0% found this document useful (0 votes)
55 views

A Locality-Improving Dynamic Memory Allocator: Yi Feng and Emery D. Berger

Uploaded by

dereck irwin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views

A Locality-Improving Dynamic Memory Allocator: Yi Feng and Emery D. Berger

Uploaded by

dereck irwin
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

A Locality-Improving Dynamic Memory Allocator

Yi Feng and Emery D. Berger


Department of Computer Science
University of Massachusetts Amherst
140 Governors Drive
Amherst, MA 01002
{yifeng, emery}@cs.umass.edu

ABSTRACT However, the widely-acknowledged increasing latency gap be-


Because most application data is dynamically allocated, the mem- tween the CPU and the various levels of the memory hierarchy
ory manager plays a crucial role in application performance by de- (caches, RAM, and disk) makes improving data locality a first-level
termining the spatial locality of heap objects. Previous general- concern. For most applications, this means improving the locality
purpose allocators have focused on reducing fragmentation, while of the heap. While applications typically exhibit temporal locality,
most locality-improving allocators have either focused on improv- spatial locality is dictated by the memory allocator, which deter-
ing the locality of the allocator (not the application) or required in- mines where and how to lay out the application’s dynamic data.
formation supplied by the programmer or obtained by profiling. We This allocator-controlled locality can have a significant impact on
present a high-performance memory allocator that builds on pre- the application’s overall performance.
vious allocator designs to achieve low fragmentation while trans- In this paper, we present a new general-purpose memory allo-
parently improving application locality. Our allocator, called Vam, cator called Vam that improves data locality while providing low
improves page-level locality by managing the heap in page-sized fragmentation. Vam improves page-level locality by managing the
chunks and aggressively giving up free pages to the virtual mem- heap in page-sized chunks and aggressively giving up free pages to
ory manager. By eliminating object headers, using fine-grained the virtual memory manager. By eliminating object headers, using
size classes, and by allocating objects using a reap-based algorithm, a judicious selection of size classes, and by allocating objects using
Vam improves cache-level locality. Over a range of large footprint a reap-based algorithm [9], Vam improves cache-level locality.
benchmarks, Vam improves application performance by an average We compare Vam to the low-fragmentation Linux allocator (DL-
of 4%–8% versus the Lea (Linux) and FreeBSD allocators. When malloc) and to the page-level locality-improving FreeBSD alloca-
memory is scarce, Vam improves application performance by up to tor (PHKmalloc) [17], both of which we describe in detail. To our
2X compared to the FreeBSD allocator, and by over 10X compared knowledge, PHKmalloc has not been discussed previously in the
to the Lea allocator. We show that synergy between Vam’s layout memory management literature. We build on these algorithms, in-
algorithms and the Linux swap clustering algorithm increases its corporating their best features while removing most of their disad-
swap prefetchability, further improving its performance when pag- vantages.
ing. Our experiments on a suite of memory-intensive benchmarks
show that Vam consistently achieves the best performance. Vam
1. Introduction performs on average 8% faster than DLmalloc and 4% faster than
Explicit memory managers have traditionally focused on address- PHKmalloc when there is sufficient physical memory to avoid pag-
ing the problem of fragmentation, discontiguous free chunks of ing. When physical memory is scarce, Vam outperforms these al-
memory. Reducing fragmentation improves space efficiency and locators by over 10X and up to 2X, respectively. We show that part
understandably has received considerable attention by memory man- of this improvement is due to an unintended but fortunate synergy
ager designers. For example, the widely-used Lea allocator that between Vam and the way Linux manages swap space, which holds
forms the basis of the Linux malloc (DLmalloc) was designed evicted pages on disk. We call this phenomenon swap prefetchabil-
specifically for high performance and low fragmentation [15, 16, ity and show that it leads to improved performance when paging.
19].
2. Related Work
This material is based upon work supported by the National Science Foun- There has been extensive research on dynamic memory allocation.
dation under CAREER Award CNS-0347339. Any opinions, findings, and
In their well-known survey paper, Wilson et al. devote most of
conclusions or recommendations expressed in this material are those of the
author(s) and do not necessarily reflect the views of the National Science their attention to the question of fragmentation, which they identify
Foundation. as the most important metric for evaluating memory allocators [24].
Johnstone and Wilson in their subsequent studies evaluate a wide
range of allocation policies using actual C/C++ programs and ar-
gue that fragmentation is near zero, given a good choice of alloca-
Permission to make digital or hard copies of all or part of this work for tion policy [15, 16]. While they argue that reducing fragmentation
personal or classroom use is granted without fee provided that copies are generally improves locality, we show that Vam’s approach is more
not made or distributed for profit or commercial advantage and that copies effective.
bear this notice and the full citation on the first page. To copy otherwise, to Most previous researchers have attacked the problem of local-
republish, to post on servers or to redistribute to lists, requires prior specific ity in memory allocation either by improving the locality of the
permission and/or a fee.
Submitted to MSP 2005 Chicago, IL USA allocator itself or by using extra information such as programmer
Copyright 2005 ACM 0-12345-67-8/90/01 ..$5.00 hints or profiles to guide placement decisions. Grunwald and Zorn
investigate the locality impact of allocation algorithms by simulat- One notable implementation detail of DLmalloc common to other
ing caches using reference traces [13], and conclude that best-first allocators is that each object has a header that stores metadata con-
search allocation schemes are the primary culprit for poor alloca- taining the object’s size and status. This metadata is also referred
tor locality. Their benchmark suite is highly allocation-intensive, as boundary tags and simplifies coalescing. Each object header
causing locality effects in the allocator to dominate. Vam’s algo- is an 8-byte chunk placed before the object. This space overhead
rithms focus instead on the effect of allocator data layout decisions can become significant if an application allocates a large number
on the application’s overall locality, rather than on locality within of small objects. Placing the header next to the object itself also
the allocator. Our benchmark suite of memory-intensive programs degrades data locality, because the header is only accessed by the
is also less allocation-intensive, emphasizing the impact of alloca- allocator and not by the application accessing the object [13]. In
tor layout policies. other words, the header and the object have different access pat-
Chilimbi et al. describe ccmalloc, a memory allocator that terns and frequencies and, if put in the same cache line, may lower
allows the programmer to help the allocator group objects with cache line utilization.
temporal locality [12]. Truong et al. describe a memory alloca-
3.2 PHKmalloc
tor that separates the hot and cold fields of objects into different
cache lines [23]. Both of these approaches improve cache-level The PHKmalloc allocator was designed for the FreeBSD operating
locality but require programmer intervention. Vam’s approach is system by Poul-Henning Kamp [17]. As far as we are aware, this
largely orthogonal. Its use of the standard malloc interface al- memory allocator has not previously been described in the litera-
lows it to be used to improve the locality of unaltered programs. ture. We describe the current version here (1.89).
It should be possible to build custom locality-improving allocators Unlike DLmalloc, which disregards page boundaries, PHKmal-
like ccmalloc on top of Vam, but we do not investigate that pos- loc’s design is page-oriented. The central design goal of PHKmal-
sibility here. loc was to minimize the number of pages accessed by both the ap-
Barrett and Zorn use a profile-based approach which predicts ob- plication and the allocator [17]. The heap is a contiguous space
ject lifetime at allocation time and segregates these short-lived in divided into 4K pages and a table stores the status of these pages
the heap [6]. Their system improves locality and space efficiency (empty or occupied). Every object on a page is the same size. This
while reducing allocation cost, but requires profiling and imposes organization allows PHKmalloc to avoid individual object head-
runtime overhead. Zorn and Seidl extend this approach by incor- ers by storing metadata such as object size at the start of the page,
porating the reference behavior and lifetime prediction gathered which can be located by bitmasking the object’s address. The meta-
during profiling to guide memory allocation and improve virtual data field also contains a bitmap to record the status of each object
memory performance [26, 22]. Their method also imposes some (free or allocated). This technique of avoiding per-object headers is
runtime overhead, which may have an adverse effect on the appli- sometimes referred to as a BIBOP-style organization (“Big Bag of
cation performance. Vam’s approach avoids the need for profiling Pages” [14]) and has been employed by many memory managers,
and improves application performance both in the presence and ab- including the Boehm-Demers-Weiser conservative garbage collec-
sence of virtual memory paging. tor [11] and the Hoard multiprocessor memory allocator [7].
PHKmalloc distinguishes just two object size classes: small (less
3. General-Purpose Memory Allocators than 2KB) and large (2KB or more). Like the BSD 4.2 (Kingsley)
Vam builds on previous allocator designs to achieve its goals of allocator [24], PHKmalloc rounds up small object requests to the
high performance and improved application-level locality. The most nearest power of two and rounds large object requests up to the
influential allocators in its design are DLmalloc, which focuses on nearest multiple of the page size; the remainder in the last page
reducing fragmentation, PHKmalloc, which focuses on improving is not reused. PHKmalloc keeps pages containing free space in a
page-level locality, and reaps, which provide high-speed allocation doubly-linked list sorted by address order, implementing the policy
and cache-level locality. known as address-ordered first-fit.
PHKmalloc’s rounding-up of object sizes makes it susceptible to
3.1 DLmalloc
considerable internal fragmentation (unused space inside of each
DLmalloc is a widely-used malloc implementation written by Doug chunk) or page-internal fragmentation (unused space at the end of
Lea [19]. It forms the basis of the Linux memory allocator included the last page of a large object) [4]. In practice, the space saved
in the GNU C library. DLmalloc has been tuned over many years by eliminating individual object headers is largely offset by this
and is widely considered to be both among the fastest and most internal fragmentation.
space-efficient allocators [9, 16]. The version we use in this study On the other hand, using coarse size classes dramatically reduces
is the latest release, version 2.7.2. the number of free lists, allowing the quick reuse of freed chunks
DLmalloc is an approximate best-fit allocator with different be- and reducing external fragmentation. In some situations, this can
havior based on object size. Small objects (less than 64 bytes) are improve locality, as we show in Section 5.3 and Section 5.5.
allocated from exact-size quicklists. Requests for a medium-sized A key advantage of PHKmalloc’s page-oriented design is that it
object (between 72 and 504 bytes) and certain other events trig- allows the allocator to discard empty pages via the madvise sys-
ger DLmalloc to coalesce the objects in these quicklists (combin- tem call. In this case, although this page is still mapped from the
ing adjacent free objects) in the hope that this reclaimed space can kernel, the previously-allocated RAM space may be reclaimed by
be reused for the medium-sized object. For medium-sized objects, the kernel and the contents do not need to be written back to swap.
DLmalloc performs immediate coalescing and splitting (breaking The underlying physical page can thus be immediately reused. If
objects into smaller ones) and approximates best-fit. DLmalloc the page is touched again, the virtual memory manager will mate-
manages large objects (between 512 and 128K bytes) similarly, but rialize a demand-zero page.
places these in a group of free lists containing free chunks of a par-
ticular size range. These size ranges are logarithmically spaced and 3.3 Reaps
DLmalloc sorts free chunks within each range by size, so that the Reaps are a combination of regions and heaps that extend region
first chunk that fits is the best fit. Very large objects (128KB or semantics with individual object deletion [9]. A reap consists of a
larger) are allocated and freed using mmap. chunk of memory, a “bump” pointer set to the start of the chunk,
and an associated heap. Allocation in a reap initially consists of and medium object allocation requests, since the C standard re-
bumping its pointer through the chunk of memory. Reaps add quires that all objects returned by malloc be double-word (8-
object headers to every allocated object. These headers contain byte) aligned. Reducing fragmentation for these objects is impor-
metadata that allow the object to be subsequently placed on the tant for improving overall cache utilization because most objects
heap. Reaps act like regions (performing pointer-bumping alloca- are small or medium-sized. In our benchmarks, 89.6% of all ob-
tion) until a call to reapFree deletes an individual object. Reaps jects requested are small and 6.4% are medium-sized.
place freed objects onto an associated heap. Subsequent allocations Nonetheless, using coarser size classes could improve locality
from that reap use memory from the heap until it is exhausted, at of reference. A wider size range in each size class allows quicker
which point it reverts to region mode. Experimental results show reuse of free space across these sizes, which could result in im-
that reaps capture most of the performance of region allocators. proved cache locality and page-level locality. We have observed
this phenomenon in the 253.perlbmk benchmark.
4. Vam
The key design goal for Vam was to enhance application-level lo- 4.2 Large and Extremely Large Objects
cality at the cache and page level while delivering high performance Like small and medium objects, large object size classes are only 8
over a range of memory sizes. In particular, we wanted its perfor- bytes apart and each size class has a dedicated free list. Vam uses a
mance to exceed that of other fast allocators both when there is best-fit algorithm for large objects. It linearly searches the free list
enough physical memory to hold the entire heap and when physi- table for the first non-empty list containing a chunk large enough
cal memory is scarce. to satisfy the given size request. If the remaining space is large
We implemented Vam using Heap Layers, a C++-based infrastruc- enough to hold the smallest large object (i.e., 504 bytes), Vam splits
ture for building high-performance memory managers [8]. Figure 1 the chunk and places the remaining space onto the appropriate free
presents an example of Vam’s heap layout. The following is an list.
overview of Vam’s design, which we explore in detail in the rest of This use of fine-grained size classes for large objects also im-
this section. proves allocator-level locality. Because each size class provides an
exact fit, no search within the size class is needed for a best fit.
Fine-grained size classes: Vam improves cache utilization by us- This can improve locality because such a search (as in DLmalloc)
ing exact size classes for objects up to 496 bytes in size, thus may visit several free chunks before it finds a best fit and these free
eliminating internal fragmentation. chunks may be scattered in memory and have poor locality. Vam
only scans the free list table, which is a contiguous space and has
Page-based: Vam uses a page-oriented heap layout similar to PHK- good locality. However, this linear scan may occasionally visit a
malloc, but uses a larger number of pages for large objects to large number of table entries and flush caches. It is possible solve
minimize page-internal fragmentation. this problem by hierarchically indexing into the table, but we have
No object headers for small objects: Vam reduces cache pollution not implemented this optimization.
by eliminating object headers for all objects under 128 bytes. Allocation requests for large objects are rare and often have poor
size locality. For example, applications may allocate large buffers
Reap allocation: Vam uses a variant of reap allocation in each of varied lengths corresponding to file inputs. Vam collocates large
page to improve throughput and to enhance cache locality. objects in memory regions shared by all these sizes and aggres-
sively coalesces free chunks in deallocation. This aggressive coa-
Ordered per-size allocation: Vam maintains non-full pages for each lescing reduces fragmentation, and for these large objects, the per-
small or medium size sorted in the order in which the pages byte cost for this coalescing is low.
become non-full. This ordering improves locality and allows Collocating large objects in large memory regions may have an-
new objects to fill the free space in the front, increasing the other beneficial impact because it tends to align them randomly.
likelihood that empty pages emerge from the end. This alignment may reduce conflict misses. For example, if a pro-
gram accesses some field of one type much more frequently than
Aggressive discarding of empty pages: Whenever a page is made the other fields, and if the objects of that type are always regularly
empty, Vam gives it back to the virtual memory manager. aligned (e.g., at the page boundary), some cache lines may suffer
excessive conflict misses while others may be under-utilized. A
Approximate address-ordered first-fit at page-level: Vam main-
more random alignment can map such hot fields in large objects
tains free pages in sorted order and preferentially allocates
more evenly.
from the front, improving performance when paging by in-
Finally, Vam directly allocates extremely large objects from the
creasing swap prefetchability (Section 5.6).
kernel via the mmap system call and frees them using munmap.
4.1 Fine-Grained Size Classes 4.3 Page-Based Heap Management
Like DLmalloc, Vam classifies object sizes into four categories: Vam allocates small and medium-sized objects from page-aligned
small (below 128 bytes), medium (between 128 and 496 bytes), blocks similarly to PHKmalloc. A block is one page for small
large (between 504 bytes and 32KB), and extremely large (more objects. For medium objects, it is four pages; this reduces page-
than 32KB). These size boundaries are tunable parameters in the internal fragmentation at the end of the block [4]. Each block is
allocator. Each size class has two associated linked lists of blocks, divided into equal-sized chunks. This division makes it impossible
groups of pages containing objects dedicated to that size class. The to fragment memory inside a block.
available list contains blocks with free space, while the full list We note that, in principle, segregating objects of different sizes
contains blocks with no remaining space. could harm locality by preventing adjacent allocation of temporally-
To improve cache line utilization, Vam uses much finer size classes local objects of different sizes. This potential cost must be weighed
than either DLmalloc or PHKmalloc. For small and medium ob- against the locality and space benefit of eliminating external frag-
jects, each size class is only 8 bytes apart. Fine-grained size classes mentation. Wilson and others have observed a strong skew towards
eliminate internal fragmentation by providing exact fits for small a small number of size classes, increasing the odds that temporally-
BMD s1 s1 BMD s2 BMD s2
small
s1 free s1 s2 free free s2

FLT
objects (8 -
128 bytes) one page one page s2 s2 s2 free memory regions
(4KB) (4KB) s2 s2 without object headers

medium BMD OH m1 OH m1 BMD OH m2 OH m2


objects FLT OH m1 OH free OH m1 OH free OH m2
(136 – 496
bytes) four pages (16KB) four pages (16KB)

large OH l1 OH l2 OH free
objects OH l3 OH l1 OH free
FLT

(504 – 32K free


bytes)
free 20 pages (80KB)

extremely OH OH
large objects
(above xl1 xl2 memory regions with
32KB) mmapped pages mmapped pages object headers

prev_block next_block
BMD s2
Free List Table FLT num_total object_size
free s2
Block Metadata BMD num_free free_list
s2 free
num_bumped current_ptr
Object Header OH s2

Objects s1 m2 free

Figure 1: An example of Vam’s heap layout (see Section 4).

local objects will be of the same size [15, 24]. until the block becomes full. It then moves the block to the full list
and uses the next block in the available list, creating one if none ex-
4.4 Elimination of Object Headers for Small Objects
ists. Vam places freed objects onto the appropriate per-block free
Like PHKmalloc, Vam uses the BIBOP technique to eliminate indi- list for reuse. When an object is freed to a previously-full block,
vidual object headers and locates metadata at the beginning of each Vam moves the block from the full list to the front of the available
block. Vam uses this approach only for small objects, because the list. This page-level ordering ensures that new objects always fill
resulting space savings and locality improvement are most signifi- free space on the page in the front of the available list and increases
cant for small objects. Larger objects each have per-object headers, the chance that pages near the end become entirely free.
simplifying coalescing. To distinguish the two cases, Vam parti- PHKmalloc uses a similar approach, but sorts non-full pages in
tions the entire address space into 16MB regions and uses a table increasing address order. We do not use address order because the
to record the type of objects in each region. Because a one-byte flag sorting operation is costly.
is enough to hold the information for each region, this table is only
256 bytes for a 4GB address space. Although every object deallo- 4.7 Aggressive Discarding of Pages
cation needs to perform a conditional check on the corresponding
entry in this table, these checks have very good locality. Since most Blocks of small-to-medium objects and regions of large objects are
objects do not have headers, this branch is also highly predictable. all multiples of pages. In Vam, a page manager manages these
pages, by recording status information for each page in a page de-
4.5 Reap Allocation scriptor table and keeping consecutive free pages in a set of free
Unlike PHKmalloc, Vam does not use per-block bitmaps to track lists.
which objects are free or allocated. Instead, it uses a cheaper pointer- Vam uses the madvise call to discard blocks of small and medium
bumping allocation until the end of the block is reached. It then objects whenever they become empty. For large objects, Vam dis-
reuses objects from a free list for that block. This technique is a cards empty pages inside the large object memory region. As de-
variant of reap allocation [9]. The original reap algorithm adds per- scribed above, Vam releases all extremely large objects upon free
object headers and employs a full-blown heap implementation to by a call to munmap.
manage freed objects. Vam instead manages its (headerless) free This strategy reduces application footprint and can greatly re-
objects by threading a linked list through them. Vam’s use of a duce paging when under memory pressure. However, aggressive
single size class per block ensures that this approach does not lead discarding of pages does add some runtime overhead. Each page
to external fragmentation. Pointer-bumping also improves cache discard requires one system call. When the page is later reused,
locality by maintaining allocation order. there is a cost in reassigning a physical page to the free page in the
kernel (soft page fault handling and page zeroing). In fact, these
4.6 Ordered Per-Size Allocation overheads are very low in practice, thanks to the efficient imple-
To minimize misses, Vam preferentially allocates objects from recently- mentation of system calls and soft page handling in the Linux ker-
accessed blocks. It allocates from the first block in the available list nel. We would prefer to discard pages only in response to memory
176.gcc 197.parser 253.perlbmk 255.vortex Run Time (normalized)
Execution Time 24s 280s 42s 50s DLmalloc PHKmalloc Vam Custom
Instructions 40G 424G 114G 101G
1.20
VM Size 130MB 15MB 120MB 65MB
1.00
Max Live Size 110MB 10MB 90MB 45MB
Total Allocations 9M 788M 5.4M 1.5M 0.80

Alloc. Rate (#/sec) 373K 2813K 129K 30K


0.60
Avg. Size (bytes) 52 21 285 471
0.40

Table 1: CPU and memory allocation statistics of memory- 0.20

intensive CPU2000 benchmarks, run with DLmalloc.


0.00
176.gcc 197.parser 253.perlbmk 255.vortex GEOMEAN

scarcity, but this feature is not supported by the current kernel.


Figure 2: Total execution time, normalized to DLmalloc.
5. Experimental Evaluation
To evaluate the efficacy of Vam’s design, we sought to answer the
following questions: We run each experiment five times and report the median. To min-
imize variance, we perform all experiments with the machine in
• Does Vam reduce total application execution time?
single-user mode.
• Does Vam increase cache-level locality?
5.1 Total Execution Time
• What is the effect of Vam’s policies on fragmentation?
Figure 2 presents total execution time results. Vam consistently
• Under memory pressure, does Vam reduce paging? improves application performance over both PHKmalloc and DL-
malloc. Vam’s improvement over PHKmalloc ranges from 1–8%,
To answer these questions, we use the four memory-intensive ap- and improves over DLmalloc by 1–23%. On average, Vam is 4%
plications from the SPEC CPU2000 benchmark suite [2]: 176.gcc, and 8% faster than PHKmalloc and DLmalloc, respectively. The
197.parser, 253.perlbmk, and 255.vortex. The other benchmarks custom memory allocators in 176.gcc and 197.parser are faster
in the suite either use very little memory or only allocate mem- than the general-purpose ones: the obstack allocator in 176.gcc is
ory at the start of execution [3]. For those applications, the choice 8% faster than DLmalloc and the xalloc allocator in 197.parser
of allocator has essentially no impact. Whenever multiple inputs is 23% faster. These allocators improve performance because both
were available, we use the reference input that consumes the most applications are very allocation intensive (see Table 1). In fact,
memory. These are scilab.i, ref.in, splitmail.pl 850 5 19 18 1500, 197.parser is so allocation-intensive that the number of cycles ex-
and lendian1.raw, respectively. Table 1 summarizes the benchmark ecuted by the allocator dictates its performance. We attribute the
CPU and memory allocation statistics. difference between this result and that obtained by Berger et al. [9]
The original 176.gcc and 197.parser applications use custom (showing a smaller gap between DLmalloc and xalloc) to our use
memory allocators: 176.gcc uses obstacks and 197.parser uses of shared objects for the allocators, which precludes link-time op-
a custom allocator called xalloc [9]. The use of custom allocation timizations.
means that the original applications make only occasional calls to
malloc. We create versions of these applications that use general- 5.2 Cache Locality
purpose memory allocators. We can replace 197.parser’s custom
allocator directly because xalloc has the same interface and seman-
tics as mallocand free. This replacement decreases the max- L1 Locality
imum virtual memory requirements of 197.parser from 30MB We measure both L1 and L2 cache locality for the different alloca-
to 15MB. The obstack allocator has a different interface and se- tors. Figure 3(a) shows L1 cache misses using different allocators
mantics than the general-purpose memory allocator. To replace normalized to DLmalloc. Vam reduces L1 cache misses for two
it, we use an obstack layer that directly invokes malloc for in- of the four benchmarks. We attribute this result to Vam’s reduc-
dividual objects [9]. This layer requires additional metadata and tion of internal fragmentation and elimination of object headers.
thus increases 176.gcc’s peak memory usage from 85MB to about Vam significantly increases L1 cache misses for one benchmark,
130MB. 253.perlbmk. This benchmark allocates from a wide range of
We use a Dell Optiplex SX270 as our experimental platform sizes, and Vam’s use of fine-grained size classes causes more cache
(3.0GHz Pentium 4, 1GB RAM, 40GB 5400RPM hard drive, Linux traffic than DLmalloc. However, this result is somewhat mislead-
version 2.4.24). The Pentium 4 has an 8KB L1 data cache (64- ing: 253.perlbmk’s L1 cache miss rate is very low for all alloca-
byte cache lines, 4-way set-associative) and a 512KB L2 cache tors, and so has very little impact on total execution time.
(64-byte cache lines, 8-way set-associative). All memory alloca- PHKmalloc increases L1 cache misses in three of the four bench-
tors are compiled into shared libraries at the highest optimization marks. We attribute this to the internal fragmentation from PHK-
level with gcc version 3.2.2 and preloaded into memory before the malloc’s coarse size classes. The only benchmark for which PHK-
applications start using LD PRELOAD. malloc reduces L1 cache misses is 197.parser, which primarily
We measure total execution time using /usr/bin/time, and allocates small objects, and the dominant object sizes are 8, 16
measure instructions retired, L1/L2 cache misses, and data TLB and 24 bytes. These objects fit into PHKmalloc’s power-of-two
misses using the Pentium 4’s on-chip performance counters. We size classes with little fragmentation, and the lack of object head-
use the perfctr patch for Linux and the perfex tool [20] to set the ers leads to efficient cache line utilization both for PHKmalloc and
performance counters according to the manufacturer’s manual [1]. for Vam.
L1 Cache Misses (normalized) L2 Cache Misses (normalized)

DLmalloc PHKmalloc Vam Custom DLmalloc PHKmalloc Vam Custom


1.28
1.20 1.20

1.00 1.00

0.80 0.80

0.60 0.60

0.40 0.40

0.20 0.20

0.00 0.00
176.gcc 197.parser 253.perlbmk 255.vortex GEOMEAN 176.gcc 197.parser 253.perlbmk 255.vortex GEOMEAN

(a) L1 cache miss counts, normalized to DLmalloc. (b) L2 cache miss counts, normalized to DLmalloc.
Figure 3: Cache-level locality results.

Variant Description better cache locality. Although this PHKmalloc variant’s changes
PHK sc size classes every 8 bytes instead of 2x in cache misses do not notably affect the overall run times shown
PHK reap replaces bitmap operations with reap al- in Figure 4(c), it greatly improves the space efficiency over the
location [9] original allocator and achieves better VM performance when un-
PHK sc reap combines PHK sc and PHK reap der memory pressure.
Vam small+header adds 8-byte headers to small objects PHK reap (replacing bitmap operations with reap allocation) re-
Vam bitmap replaces reap allocation with bitmap op- duces instructions executed by 14% for 197.parser and runs 10%
erations for small and medium objects faster than the original PHKmalloc. On average, this variant im-
proves application performance by 3%. However, because this
Table 2: Variants of PHKmalloc and Vam (see Section 5.3). modification adds extra memory accesses, it also increases L2 cache
misses for most of the benchmarks (except 255.vortex). This in-
crease is the greatest for 253.perlbmk. However, because the ab-
L2 Locality solute number of misses is quite small for 253.perlbmk, these extra
Both Vam and PHKmalloc significantly reduce L2 cache misses misses do not affect run time.
over DLmalloc, as Figure 3(b) shows. On average, Vam reduces The PHK sc reap variant, combining the changes in PHK sc and
L2 cache misses by 39% over DLmalloc. This cache-level locality PHK reap, shows that these improvements are generally comple-
improvement is more significant in 253.perlbmk and 255.vortex mentary. On average, this variant improves run time performance
than in 176.gcc and 197.parser. For 176.gcc, the obstack allo- by 4%. It notably reduces cache misses in 197.parser and 255.vor-
cator produces the fewest cache misses. This result is partially due tex and instructions in 176.gcc and 197.parser.
to the extra metadata required to simulate obstack semantics. Un-
Impact of Headers and Bitmaps: Vam
like L1 locality, the L2 cache performance is strongly correlated to
application run time performance. However, PHKmalloc’s locality Figures 5(a) and 5(c) show that adding headers to the small objects
improvement is offset by its excessive number of instructions, par- in Vam results in an average increase in L2 cache misses of 15%
ticularly in 197.parser. We also measured data TLB misses, and and a 3% increase in run times. The impact of adding headers is the
these exhibit nearly identical trends, so we do not report them here. greatest for 197.parser, increasing run time by 10%. The average
Summary: Vam generally provides better L1 cache locality than object size in 197.parser is only 21 bytes and the extra headers
the other allocators. The use of a page-oriented heap layout im- substantially increase its working set.
proves L2 cache locality for both PHKmalloc and Vam, although Figure 5(b) shows that the Vam bitmap variant significantly in-
Vam’s improvement is somewhat greater. creases the number of instructions executed in 197.parser. On
average, this Vam variant reduces L2 cache misses by 2% and in-
5.3 Performance of Allocator Variants creases the instructions by 2%, resulting in a 2% increase in run
To evaluate the effects of Vam’s design decisions, we developed time.
several variants of both PHKmalloc and Vam, summarized in Ta- Summary: The use of fine-grained size classes and elimination of
ble 2. These variants let us quantify the impact of the choice of object headers generally improve cache locality and reduce total
fine-grained size classes and reap-based allocation. Figures 4 and 5 runtime. The choice between bitmap operations and reap-like allo-
present the L2 cache misses, instruction counts and run time perfor- cation is a trade-off. Vam currently uses reaps, but trading CPU in-
mance of these PHKmalloc and Vam variants. Note that the results structions for fewer memory accesses during allocation may even-
are normalized to their respective original versions, i.e., PHKmal- tually prove more beneficial.
loc variants are normalized to PHKmalloc and Vam variants are
normalized to Vam. 5.4 Fragmentation
We evaluate the effect of allocator design on memory fragmenta-
Impact of Size Classes and Reaps: PHKmalloc tion. We define fragmentation as the maximum number of pages
As Figure 4(a) shows, PHK sc (fine-grained size classes) reduces in use divided by the maximum amount of memory (in pages) re-
cache misses in three of the four benchmarks. The exception is quested by the application. In-use pages are those mapped from
253.perlbmk, which uses much more different sizes than the other the kernel and touched, but not discarded. Pages mapped but never
benchmarks. The coarser size classes in the original PHKmalloc touched do not have physical space allocated; discarded pages have
allows quicker reuse of freed space within each size class, yielding their previously-allocated memory reclaimed. This view of appli-
L2 Cache Misses (normalized) Retired Instructions (normalized) Run Time (normalized)

PHKmalloc PHK_sc PHK_reap PHK_sc_reap PHKmalloc PHK_sc PHK_reap PHK_sc_reap PHKmalloc PHK_sc PHK_reap PHK_sc_reap
1.20 1.20 1.20

1.00 1.00 1.00

0.80 0.80 0.80

0.60 0.60 0.60

0.40 0.40 0.40

0.20 0.20 0.20

0.00 0.00 0.00


176.gcc 197.parser 253.perlbmk 255.vortex GEOMEAN 176.gcc 197.parser 253.perlbmk 255.vortex GEOMEAN 176.gcc 197.parser 253.perlbmk 255.vortex GEOMEAN

(a) Normalized L2 cache misses (b) Normalized instructions retired (c) Normalized total execution time
Figure 4: Comparison of PHKmalloc variants, normalized to the original.

L2 Cache Misses (normalized) Retired Instructions (normalized) Run Time (normalized)

Vam Vam_small+header Vam_bitmap Vam Vam_small+header Vam_bitmap Vam Vam_small+header Vam_bitmap


1.46 1.20
1.20 1.20

1.00 1.00 1.00

0.80 0.80 0.80

0.60 0.60 0.60

0.40 0.40 0.40

0.20 0.20 0.20

0.00 0.00 0.00


176.gcc 197.parser 253.perlbmk 255.vortex GEOMEAN 176.gcc 197.parser 253.perlbmk 255.vortex GEOMEAN 176.gcc 197.parser 253.perlbmk 255.vortex GEOMEAN

(a) Normalized L2 cache misses (b) Normalized instructions retired (c) Normalized total execution time
Figure 5: Comparison of the Vam variants, normalized to the original.

Fragmentation memory allocators. The rightmost point of each line shows the run
DLmalloc PHKmalloc PHK_sc Vam time of the application with sufficient RAM to run without paging.
1.35 As available memory is reduced (moving left), application perfor-
1.30
mance degrades. This performance degradation is markedly differ-
ent with different memory allocators, except for 176.gcc, where all
1.25
the allocators degrade similarly with reduced RAM. For all other
1.20
benchmark applications, Vam delivers the best performance across
1.15 a wide range of available RAM.
1.10 Recall that for 176.gcc, we needed to add extra metadata to sim-
1.05 ulate the obstack semantics with general-purpose allocators. The
1.00
original obstack allocator thus performs better than the general-
176.gcc 197.parser 253.perlbmk 255.vortex GEOMEAN purpose allocators when RAM is scarce. Nonetheless, all of the
general-purpose allocators similarly preserve the application local-
Figure 6: Fragmentation results. ity because of the clustered allocations and deallocations in 176.gcc.
The slight difference between these allocators is largely due to their
respective space efficiency, for which the original obstack custom
cation memory usage is from the VM manager’s perspective and, allocator is the best.
we believe, better reflects the actual resource consumption. The story is different for the other custom allocator. As Fig-
We compare four allocators here: DLmalloc, PHKmalloc, Vam ure 7(b) shows, 197.parser’s custom allocator (xalloc) requires
and the PHK sc variant of PHKmalloc. Figure 6 shows the results. substantially more RAM to avoid paging and performs much worse
We were surprised to see that DLmalloc, an allocator known for than the general-purpose allocators as available RAM is reduced.
low fragmentation, in fact leads to the highest fragmentation on av- This poor performance is due to a limitation in xalloc. Unlike the
erage. The first reason for this is the space overhead of per-object general-purpose allocators, xalloc can not reuse heap space imme-
headers. More importantly, DLmalloc is unable to distinguish and diately after objects are freed. Instead, it must wait until consec-
discard any free pages it may have. PHKmalloc overcomes both of utive objects at the end of the heap are all free, at which point it
these shortcomings. However, its coarse size classes lead to internal reuses memory from after the last object in use. While this strat-
fragmentation that negates its other advantages. Our PHK sc vari- egy is effective when physical memory is ample, under memory
ant uses fine-grained size classes and on average, yields the lowest pressure, it degrades performance dramatically.
fragmentation. Vam combines these fragmentation-reducing fea- Figure 7(c) and Figure 7(d) highlight the effectiveness of both
tures and nearly matches PHK sc’s low fragmentation. PHKmalloc’s and Vam’s page discarding algorithms. DLmalloc
suffers a 5x slowdown when available physical memory is reduced
5.5 Performance While Paging to 80MB for 253.perlbmk, while PHKmalloc and Vam suffer the
To evaluate the effect of limited physical memory, we launch a same slowdown only after just 30MB RAM remains. With both
process that pins down a specified amount of RAM, leaving the of those allocators, 253.perlbmk exhibits a more graceful perfor-
desired amount of available RAM for the benchmark applications. mance degradation than when using DLmalloc. For 255.vortex,
Figures 7(a) through 7(d) show the run times of the four SPEC Vam performs better than the other two allocators over all avail-
benchmarks under a range of available RAM sizes, using different
176.gcc 197.parser
200 2000
DLmalloc run time DLmalloc run time
PHKmalloc run time PHKmalloc run time
Vam run time Vam run time
obstack run time xalloc run time
150 1500
run time (sec)

run time (sec)


100 1000

50 500

0 0
0 20 40 60 80 100 120 0 5 10 15 20 25 30
available RAM (MB) available RAM (MB)

253.perlbmk 255.vortex
300 500
DLmalloc run time DLmalloc run time
PHKmalloc run time 450 PHKmalloc run time
250 Vam run time Vam run time
400
350
200
run time (sec)

run time (sec)


300
150 250
200
100
150
100
50
50
0 0
0 20 40 60 80 100 120 140 30 35 40 45 50 55 60 65 70
available RAM (MB) available RAM (MB)

Figure 7: Performance using different memory allocators over a range of available RAM sizes.

able RAM sizes we tested. DLmalloc required about 6MB more placeholders in the LRU queue for pages discarded by madvise,
available RAM to achieve Vam’s performance. Only the page dis- in addition to pages unmapped by munmap/sbrk. These place-
carding algorithms play a role here: 255.vortex’s average object holders allow us to more accurately approximate a real VM system.
size is 471 bytes, so DLmalloc’s 8-byte object headers have little We compare the miss curves generated from the simulator with
impact. the actual page faults. The actual page faults are the major (hard)
We note that, for 253.perlbmk, PHKmalloc degrades perfor- page faults measured in the experiments we described in Section 5.5.
mance slightly less than Vam when available RAM is less than For two of our benchmarks, 197.parser and 255.vortex, the sim-
60MB. This is, again, because PHKmalloc’s coarse size classes re- ulated miss curves are nearly the same as the actual page faults
sult in locality improvement for this particular benchmark in some (except for the xalloc custom allocator in 197.parser).
situations. We also run 253.perlbmk with the PHK sc variant and However, for 176.gcc and 253.perlbmk, the actual page faults
the performance degradation curve is then very close to that of Vam are far fewer than the simulated ones, as Figure 8 shows. For
across all memory sizes. example, for 176.gcc with 40MB of RAM, the simulated faults
are around 40,000 while the actual page faults measured are un-
5.6 Page-Level Locality der 10,000. This inconsistency is due to the swap prefetching used
by the Linux VM manager but not in our simulator. In addition
In this section, we explore the effect of allocator choice on applica-
to swapping in non-resident pages into RAM whenever they are
tion page-level locality in more detail by using an LRU simulator
accessed, the Linux virtual memory manager also speculatively
and page-level reference traces. We first gather application page-
prefetches adjacent pages on the swap disk. To verify this, we turn
level references into the heap using a tool that intercepts system
off prefetching in the kernel, and re-run the paging experiments.
memory calls (brk, sbrk, mmap, munmap, and madvise) to
The actual number of page faults then closely matches the simu-
keep track of heap pages currently mapped from the kernel and
lated results for all benchmarks and allocators.
traps memory references by page protection. We use the SAD
(Safely-Allowed-Drop) algorithm to reduce the trace to a manage-
able size [18]. Swap Prefetchability
We then use run these traces through an LRU simulator to gen- The effectiveness of prefetching is determined by the locality of
erate page miss curves that indicate the number of misses (page page misses on the swap disk. If page misses require contiguous
faults) that would arise for every possible size of available mem- pages on the swap disk to be swapped in, prefetching will be ef-
ory. While no real system implements LRU, many systems closely fective. Page allocation on the swap disk is managed by the virtual
approximate it, including the Linux kernel we use here. Our LRU memory manager. . The Linux virtual manager attempts to cluster
simulator is similar to that described by Yang et al. [25]. We use pages that are adjacent in virtual address space to store them con-
176.gcc 253.perlbmk
100000 100000
DLmalloc miss curve DLmalloc miss curve
PHKmalloc miss curve 90000 PHKmalloc miss curve
Vam miss curve Vam miss curve
80000 obstack miss curve 80000 DLmalloc major faults
DLmalloc major faults PHKmalloc major faults
number of page faults

number of page faults


PHKmalloc major faults 70000 Vam major faults
Vam major faults
60000 obstack major faults 60000
50000
40000 40000
30000
20000 20000
10000
0 0
0 20 40 60 80 100 120 0 20 40 60 80 100 120 140
available RAM (MB) available RAM (MB)

Figure 8: Predicted page miss curves versus actual major (page) faults in a real system with prefetching.

tiguously on disk [10]. For this reason, the application’s locality of non-prefetchable misses – over 90% of the misses are prefetch-
reference affects the effectiveness of prefetching in the kernel when able. However, it has a large number of non-prefetchable misses
the system is paging. with DLmalloc and only 64% of the misses are prefetchable. This
We investigate the effect of different allocators on application result demonstrates that 253.perlbmk’s data locality is better pre-
locality by measuring this swap prefetchability. We measure this served by PHKmalloc and Vam than by DLmalloc. 255.vortex has
by quantifying the locality of page misses. We gather those ap- much less prefetchability than the other applications: about 50%
plication’s page references that would result in a miss for a given of the misses are non-prefetchable with PHKmalloc and Vam, and
memory size in the LRU simulation. We then feed this page miss 66% with DLmalloc. In fact, 255.vortex’s poor page-level local-
trace to a page miss adjacency calculator. This calculator measures ity is also reflected in the very steep VM performance degradation
the minimum distance (in pages) between the current miss and the curves in Figure 7(d) and simulated miss curves. This occurs either
previous N misses. The N parameter roughly models the mem- because 255.vortex’s data locality is instrinsically poor or because
ory buffer size for prefetching in the VM manager. We set N to it is not preserved by any of the allocators.
32, meaning that the last 32 prefetches can be buffered. We de- Note that this prefetchability calculation assumes an ideal prefetch-
note page misses that have a minimum distance to the previous 32 ing scenario. The real VM manager may not actually be able to
misses of no more than 8 pages (the Linux default prefetch size) as prefetch all the prefetchable misses. Nevertheless, they appear to
prefetchable misses. The remaining misses are non-prefetchable. reflect observed application performance on a real system. We at-
Figure 9 presents our swap prefetchability results for different tribute the improved prefetchability in PHKmalloc and Vam to their
allocators with specific memory sizes noted on the figure. For page-oriented design and address-ordered first-fit allocation at the
176.gcc and across all allocators, as many as 90% of the misses page level.
are prefetchable. This prefetchability is due to 176.gcc’s strong lo-
6. Conclusions
cality in obstack-style memory allocation. The original version of
197.parser (using xalloc) also exhibits this strong locality. How- In this paper, we present Vam, a memory allocator that builds on
ever, this locality is less well preserved in the general-purpose al- previous allocator designs to improve data locality and provide high
locators, although among these, Vam leads to the greatest prefetch- performance while reducing fragmentation. We show that, com-
ability. With PHKmalloc and Vam, 253.perlbmk has very few pared to the Linux and FreeBSD allocators and over a suite of
memory-intensive benchmarks, Vam improves application perfor-
mance by an average of 4–8% when memory is plentiful, and by
factors ranging from 2X to over 10X when memory is scarce. Vam’s
Simulated Prefetchable/Non-Prefetchable Page Misses
performance degrades gracefully as physical memory becomes scarce
80000 and paging begins. We explore the impact of Vam’s design deci-
70000 sions and find that its fine-grained size classes, reap-like allocation,
60000 and page-oriented design all contribute to its effectiveness. We also
50000 find that a synergy between Vam’s design and the Linux swap space
40000
clustering algorithm leads to improved disk prefetching when pag-
30000
ing.
20000

10000 7. References
0
xalloc@16MB
obstack@40MB
DLmalloc@40MB

PHKmalloc@40MB

Vam@40MB

DLmalloc@9MB

PHKmalloc@9MB

Vam@8MB

DLmalloc@70MB

PHKmalloc@70MB

Vam@70MB

DLmalloc@40MB

PHKmalloc@40MB

Vam@40MB

[1] IA-32 intel architecture software developer’s manual,


volume 3: System programming guide.
ftp://download.intel.com/design/
176.gcc 197.parser 253.perlbmk 255.vortex Pentium4/manuals/25366814.pdf.
[2] SPEC CPU2000.
Figure 9: Swap prefetchability: each bar shows simulated http://www.spec.org/osg/cpu2000/.
prefetchable misses (top) and non-prefetchable misses (bot- [3] SPEC CPU2000 memory footprint. http://www.spec.
tom). org/osg/cpu2000/analysis/memory/.
[4] D. F. Bacon, P. Cheng, and V. Rajan. Controlling //gee.cs.oswego.edu/dl/html/malloc.html.
fragmentation and space consumption in the Metronome, a [20] M. Pettersson. The perfctr patch for linux and the perfex
real-time garbage collector for Java. In ACM SIGPLAN 2003 tool. http:
Conference on Languages, Compilers, and Tools for //user.it.uu.se/∼mikpe/linux/perfctr/.
Embedded Systems (LCTES’2003), San Diego, CA, June [21] Proceedings of SIGPLAN’93 Conference on Programming
2003. ACM Press. Languages Design and Implementation, volume 28(6) of
[5] D. F. Bacon, P. Cheng, and V. Rajan. A real-time garbage ACM SIGPLAN Notices, Albuquerque, NM, June 1993.
collecor with low overhead and consistent utilization. In ACM Press.
Conference Record of the Thirtieth Annual ACM Symposium [22] M. L. Seidl and B. G. Zorn. Implementing heap-object
on Principles of Programming Languages, ACM SIGPLAN behavior prediction efficiently and effectively. Software —
Notices, New Orleans, LA, Jan. 2003. ACM Press. Practice and Experience, 31(9):869–892, ???? 2001.
[6] D. A. Barrett and B. G. Zorn. Using lifetime predictors to [23] D. N. Truong, F. Bodin, and A. Seznec. Improving cache
improve memory allocation performance. In PLDI [21], behavior of dynamically allocated data structures. In IEEE
pages 187–196. PACT, pages 322+, 1998.
[7] E. Berger, K. McKinley, R. Blumofe, and P. Wilson. Hoard: [24] P. R. Wilson, M. S. Johnstone, M. Neely, and D. Boles.
A scalable memory allocator for multithreaded applications. Dynamic storage allocation: A survey and critical review. In
In ASPLOS-IX: Ninth International Conference on H. Baker, editor, Proceedings of International Workshop on
Architectural Support for Programming Languages and Memory Management, volume 986 of Lecture Notes in
Operating Systems, pages 117–128, Cambridge, MA, Nov. Computer Science, Kinross, Scotland, Sept. 1995.
2000. Springer-Verlag.
[8] E. D. Berger, B. G. Zorn, and K. S. McKinley. Composing [25] T. Yang, E. D. Berger, M. Hertz, S. F. Kaplan, and J. E. B.
high-performance memory allocators. In Proceedings of Moss. Autonomic heap sizing: Taking real memory into
SIGPLAN 2001 Conference on Programming Languages account. In A. Diwan, editor, ISMM’04 Proceedings of the
Design and Implementation, ACM SIGPLAN Notices, Fourth International Symposium on Memory Management,
Snowbird, Utah, June 2001. ACM Press. Vancouver, Oct. 2004. ACM Press.
[9] E. D. Berger, B. G. Zorn, and K. S. McKinley. Reconsidering [26] B. Zorn and M. Seidl. Segregating heap objects by reference
custom memory allocation. In OOPSLA’02 ACM Conference behavior and lifetime. In Eighth International Conference on
on Object-Oriented Systems, Languages and Applications, Architectural Support for Programming Languages and
ACM SIGPLAN Notices, Seattle, WA, Nov. 2002. ACM Operating Systems, San Jose, CA, Oct. 1998.
Press.
[10] D. Black, J. Carter, G. Feinberg, R. MacDonald,
S. Mangalat, E. Sheinbrood, J. Sciver, and P. Wang. OSF/1
virtual memory improvements. In Proceedings of the
USENIX Mac Symposium, pages 87–103, November 1991.
[11] H.-J. Boehm and M. Weiser. Garbage collection in an
uncooperative environment. Software Practice and
Experience, 18(9):807–820, 1988.
[12] T. M. Chilimbi, M. D. Hill, and J. R. Larus. Cache-conscious
structure layout. In Proceedings of SIGPLAN’99 Conference
on Programming Languages Design and Implementation,
ACM SIGPLAN Notices, pages 1–12, Atlanta, May 1999.
ACM Press.
[13] D. Grunwald, B. Zorn, and R. Henderson. Improving the
cache locality of memory allocation. In PLDI [21], pages
177–186.
[14] D. R. Hanson. A portable storage management system for
the Icon programming language. j-SPE, 10(6):489–500, June
1980.
[15] M. S. Johnstone. Non-Compacting Memory Allocation and
Real-Time Garbage Collection. PhD thesis, University of
Texas at Austin, Dec. 1997.
[16] M. S. Johnstone and P. R. Wilson. The memory
fragmentation problem: Solved? In R. Jones, editor,
ISMM’98 Proceedings of the First International Symposium
on Memory Management, volume 34(3) of ACM SIGPLAN
Notices, pages 26–36, Vancouver, Oct. 1998. ACM Press.
[17] P.-H. Kamp. Malloc(3) revisited.
http://phk.freebsd.dk/pubs/malloc.pdf.
[18] S. F. Kaplan, Y. Smaragdakis, and P. R. Wilson. Flexible
reference trace reduction for vm simulations. ACM Trans.
Model. Comput. Simul., 13(1):1–38, 2003.
[19] D. Lea. A memory allocator. http:

You might also like