A Locality-Improving Dynamic Memory Allocator: Yi Feng and Emery D. Berger
A Locality-Improving Dynamic Memory Allocator: Yi Feng and Emery D. Berger
FLT
objects (8 -
128 bytes) one page one page s2 s2 s2 free memory regions
(4KB) (4KB) s2 s2 without object headers
large OH l1 OH l2 OH free
objects OH l3 OH l1 OH free
FLT
extremely OH OH
large objects
(above xl1 xl2 memory regions with
32KB) mmapped pages mmapped pages object headers
prev_block next_block
BMD s2
Free List Table FLT num_total object_size
free s2
Block Metadata BMD num_free free_list
s2 free
num_bumped current_ptr
Object Header OH s2
Objects s1 m2 free
local objects will be of the same size [15, 24]. until the block becomes full. It then moves the block to the full list
and uses the next block in the available list, creating one if none ex-
4.4 Elimination of Object Headers for Small Objects
ists. Vam places freed objects onto the appropriate per-block free
Like PHKmalloc, Vam uses the BIBOP technique to eliminate indi- list for reuse. When an object is freed to a previously-full block,
vidual object headers and locates metadata at the beginning of each Vam moves the block from the full list to the front of the available
block. Vam uses this approach only for small objects, because the list. This page-level ordering ensures that new objects always fill
resulting space savings and locality improvement are most signifi- free space on the page in the front of the available list and increases
cant for small objects. Larger objects each have per-object headers, the chance that pages near the end become entirely free.
simplifying coalescing. To distinguish the two cases, Vam parti- PHKmalloc uses a similar approach, but sorts non-full pages in
tions the entire address space into 16MB regions and uses a table increasing address order. We do not use address order because the
to record the type of objects in each region. Because a one-byte flag sorting operation is costly.
is enough to hold the information for each region, this table is only
256 bytes for a 4GB address space. Although every object deallo- 4.7 Aggressive Discarding of Pages
cation needs to perform a conditional check on the corresponding
entry in this table, these checks have very good locality. Since most Blocks of small-to-medium objects and regions of large objects are
objects do not have headers, this branch is also highly predictable. all multiples of pages. In Vam, a page manager manages these
pages, by recording status information for each page in a page de-
4.5 Reap Allocation scriptor table and keeping consecutive free pages in a set of free
Unlike PHKmalloc, Vam does not use per-block bitmaps to track lists.
which objects are free or allocated. Instead, it uses a cheaper pointer- Vam uses the madvise call to discard blocks of small and medium
bumping allocation until the end of the block is reached. It then objects whenever they become empty. For large objects, Vam dis-
reuses objects from a free list for that block. This technique is a cards empty pages inside the large object memory region. As de-
variant of reap allocation [9]. The original reap algorithm adds per- scribed above, Vam releases all extremely large objects upon free
object headers and employs a full-blown heap implementation to by a call to munmap.
manage freed objects. Vam instead manages its (headerless) free This strategy reduces application footprint and can greatly re-
objects by threading a linked list through them. Vam’s use of a duce paging when under memory pressure. However, aggressive
single size class per block ensures that this approach does not lead discarding of pages does add some runtime overhead. Each page
to external fragmentation. Pointer-bumping also improves cache discard requires one system call. When the page is later reused,
locality by maintaining allocation order. there is a cost in reassigning a physical page to the free page in the
kernel (soft page fault handling and page zeroing). In fact, these
4.6 Ordered Per-Size Allocation overheads are very low in practice, thanks to the efficient imple-
To minimize misses, Vam preferentially allocates objects from recently- mentation of system calls and soft page handling in the Linux ker-
accessed blocks. It allocates from the first block in the available list nel. We would prefer to discard pages only in response to memory
176.gcc 197.parser 253.perlbmk 255.vortex Run Time (normalized)
Execution Time 24s 280s 42s 50s DLmalloc PHKmalloc Vam Custom
Instructions 40G 424G 114G 101G
1.20
VM Size 130MB 15MB 120MB 65MB
1.00
Max Live Size 110MB 10MB 90MB 45MB
Total Allocations 9M 788M 5.4M 1.5M 0.80
1.00 1.00
0.80 0.80
0.60 0.60
0.40 0.40
0.20 0.20
0.00 0.00
176.gcc 197.parser 253.perlbmk 255.vortex GEOMEAN 176.gcc 197.parser 253.perlbmk 255.vortex GEOMEAN
(a) L1 cache miss counts, normalized to DLmalloc. (b) L2 cache miss counts, normalized to DLmalloc.
Figure 3: Cache-level locality results.
Variant Description better cache locality. Although this PHKmalloc variant’s changes
PHK sc size classes every 8 bytes instead of 2x in cache misses do not notably affect the overall run times shown
PHK reap replaces bitmap operations with reap al- in Figure 4(c), it greatly improves the space efficiency over the
location [9] original allocator and achieves better VM performance when un-
PHK sc reap combines PHK sc and PHK reap der memory pressure.
Vam small+header adds 8-byte headers to small objects PHK reap (replacing bitmap operations with reap allocation) re-
Vam bitmap replaces reap allocation with bitmap op- duces instructions executed by 14% for 197.parser and runs 10%
erations for small and medium objects faster than the original PHKmalloc. On average, this variant im-
proves application performance by 3%. However, because this
Table 2: Variants of PHKmalloc and Vam (see Section 5.3). modification adds extra memory accesses, it also increases L2 cache
misses for most of the benchmarks (except 255.vortex). This in-
crease is the greatest for 253.perlbmk. However, because the ab-
L2 Locality solute number of misses is quite small for 253.perlbmk, these extra
Both Vam and PHKmalloc significantly reduce L2 cache misses misses do not affect run time.
over DLmalloc, as Figure 3(b) shows. On average, Vam reduces The PHK sc reap variant, combining the changes in PHK sc and
L2 cache misses by 39% over DLmalloc. This cache-level locality PHK reap, shows that these improvements are generally comple-
improvement is more significant in 253.perlbmk and 255.vortex mentary. On average, this variant improves run time performance
than in 176.gcc and 197.parser. For 176.gcc, the obstack allo- by 4%. It notably reduces cache misses in 197.parser and 255.vor-
cator produces the fewest cache misses. This result is partially due tex and instructions in 176.gcc and 197.parser.
to the extra metadata required to simulate obstack semantics. Un-
Impact of Headers and Bitmaps: Vam
like L1 locality, the L2 cache performance is strongly correlated to
application run time performance. However, PHKmalloc’s locality Figures 5(a) and 5(c) show that adding headers to the small objects
improvement is offset by its excessive number of instructions, par- in Vam results in an average increase in L2 cache misses of 15%
ticularly in 197.parser. We also measured data TLB misses, and and a 3% increase in run times. The impact of adding headers is the
these exhibit nearly identical trends, so we do not report them here. greatest for 197.parser, increasing run time by 10%. The average
Summary: Vam generally provides better L1 cache locality than object size in 197.parser is only 21 bytes and the extra headers
the other allocators. The use of a page-oriented heap layout im- substantially increase its working set.
proves L2 cache locality for both PHKmalloc and Vam, although Figure 5(b) shows that the Vam bitmap variant significantly in-
Vam’s improvement is somewhat greater. creases the number of instructions executed in 197.parser. On
average, this Vam variant reduces L2 cache misses by 2% and in-
5.3 Performance of Allocator Variants creases the instructions by 2%, resulting in a 2% increase in run
To evaluate the effects of Vam’s design decisions, we developed time.
several variants of both PHKmalloc and Vam, summarized in Ta- Summary: The use of fine-grained size classes and elimination of
ble 2. These variants let us quantify the impact of the choice of object headers generally improve cache locality and reduce total
fine-grained size classes and reap-based allocation. Figures 4 and 5 runtime. The choice between bitmap operations and reap-like allo-
present the L2 cache misses, instruction counts and run time perfor- cation is a trade-off. Vam currently uses reaps, but trading CPU in-
mance of these PHKmalloc and Vam variants. Note that the results structions for fewer memory accesses during allocation may even-
are normalized to their respective original versions, i.e., PHKmal- tually prove more beneficial.
loc variants are normalized to PHKmalloc and Vam variants are
normalized to Vam. 5.4 Fragmentation
We evaluate the effect of allocator design on memory fragmenta-
Impact of Size Classes and Reaps: PHKmalloc tion. We define fragmentation as the maximum number of pages
As Figure 4(a) shows, PHK sc (fine-grained size classes) reduces in use divided by the maximum amount of memory (in pages) re-
cache misses in three of the four benchmarks. The exception is quested by the application. In-use pages are those mapped from
253.perlbmk, which uses much more different sizes than the other the kernel and touched, but not discarded. Pages mapped but never
benchmarks. The coarser size classes in the original PHKmalloc touched do not have physical space allocated; discarded pages have
allows quicker reuse of freed space within each size class, yielding their previously-allocated memory reclaimed. This view of appli-
L2 Cache Misses (normalized) Retired Instructions (normalized) Run Time (normalized)
PHKmalloc PHK_sc PHK_reap PHK_sc_reap PHKmalloc PHK_sc PHK_reap PHK_sc_reap PHKmalloc PHK_sc PHK_reap PHK_sc_reap
1.20 1.20 1.20
(a) Normalized L2 cache misses (b) Normalized instructions retired (c) Normalized total execution time
Figure 4: Comparison of PHKmalloc variants, normalized to the original.
(a) Normalized L2 cache misses (b) Normalized instructions retired (c) Normalized total execution time
Figure 5: Comparison of the Vam variants, normalized to the original.
Fragmentation memory allocators. The rightmost point of each line shows the run
DLmalloc PHKmalloc PHK_sc Vam time of the application with sufficient RAM to run without paging.
1.35 As available memory is reduced (moving left), application perfor-
1.30
mance degrades. This performance degradation is markedly differ-
ent with different memory allocators, except for 176.gcc, where all
1.25
the allocators degrade similarly with reduced RAM. For all other
1.20
benchmark applications, Vam delivers the best performance across
1.15 a wide range of available RAM.
1.10 Recall that for 176.gcc, we needed to add extra metadata to sim-
1.05 ulate the obstack semantics with general-purpose allocators. The
1.00
original obstack allocator thus performs better than the general-
176.gcc 197.parser 253.perlbmk 255.vortex GEOMEAN purpose allocators when RAM is scarce. Nonetheless, all of the
general-purpose allocators similarly preserve the application local-
Figure 6: Fragmentation results. ity because of the clustered allocations and deallocations in 176.gcc.
The slight difference between these allocators is largely due to their
respective space efficiency, for which the original obstack custom
cation memory usage is from the VM manager’s perspective and, allocator is the best.
we believe, better reflects the actual resource consumption. The story is different for the other custom allocator. As Fig-
We compare four allocators here: DLmalloc, PHKmalloc, Vam ure 7(b) shows, 197.parser’s custom allocator (xalloc) requires
and the PHK sc variant of PHKmalloc. Figure 6 shows the results. substantially more RAM to avoid paging and performs much worse
We were surprised to see that DLmalloc, an allocator known for than the general-purpose allocators as available RAM is reduced.
low fragmentation, in fact leads to the highest fragmentation on av- This poor performance is due to a limitation in xalloc. Unlike the
erage. The first reason for this is the space overhead of per-object general-purpose allocators, xalloc can not reuse heap space imme-
headers. More importantly, DLmalloc is unable to distinguish and diately after objects are freed. Instead, it must wait until consec-
discard any free pages it may have. PHKmalloc overcomes both of utive objects at the end of the heap are all free, at which point it
these shortcomings. However, its coarse size classes lead to internal reuses memory from after the last object in use. While this strat-
fragmentation that negates its other advantages. Our PHK sc vari- egy is effective when physical memory is ample, under memory
ant uses fine-grained size classes and on average, yields the lowest pressure, it degrades performance dramatically.
fragmentation. Vam combines these fragmentation-reducing fea- Figure 7(c) and Figure 7(d) highlight the effectiveness of both
tures and nearly matches PHK sc’s low fragmentation. PHKmalloc’s and Vam’s page discarding algorithms. DLmalloc
suffers a 5x slowdown when available physical memory is reduced
5.5 Performance While Paging to 80MB for 253.perlbmk, while PHKmalloc and Vam suffer the
To evaluate the effect of limited physical memory, we launch a same slowdown only after just 30MB RAM remains. With both
process that pins down a specified amount of RAM, leaving the of those allocators, 253.perlbmk exhibits a more graceful perfor-
desired amount of available RAM for the benchmark applications. mance degradation than when using DLmalloc. For 255.vortex,
Figures 7(a) through 7(d) show the run times of the four SPEC Vam performs better than the other two allocators over all avail-
benchmarks under a range of available RAM sizes, using different
176.gcc 197.parser
200 2000
DLmalloc run time DLmalloc run time
PHKmalloc run time PHKmalloc run time
Vam run time Vam run time
obstack run time xalloc run time
150 1500
run time (sec)
50 500
0 0
0 20 40 60 80 100 120 0 5 10 15 20 25 30
available RAM (MB) available RAM (MB)
253.perlbmk 255.vortex
300 500
DLmalloc run time DLmalloc run time
PHKmalloc run time 450 PHKmalloc run time
250 Vam run time Vam run time
400
350
200
run time (sec)
Figure 7: Performance using different memory allocators over a range of available RAM sizes.
able RAM sizes we tested. DLmalloc required about 6MB more placeholders in the LRU queue for pages discarded by madvise,
available RAM to achieve Vam’s performance. Only the page dis- in addition to pages unmapped by munmap/sbrk. These place-
carding algorithms play a role here: 255.vortex’s average object holders allow us to more accurately approximate a real VM system.
size is 471 bytes, so DLmalloc’s 8-byte object headers have little We compare the miss curves generated from the simulator with
impact. the actual page faults. The actual page faults are the major (hard)
We note that, for 253.perlbmk, PHKmalloc degrades perfor- page faults measured in the experiments we described in Section 5.5.
mance slightly less than Vam when available RAM is less than For two of our benchmarks, 197.parser and 255.vortex, the sim-
60MB. This is, again, because PHKmalloc’s coarse size classes re- ulated miss curves are nearly the same as the actual page faults
sult in locality improvement for this particular benchmark in some (except for the xalloc custom allocator in 197.parser).
situations. We also run 253.perlbmk with the PHK sc variant and However, for 176.gcc and 253.perlbmk, the actual page faults
the performance degradation curve is then very close to that of Vam are far fewer than the simulated ones, as Figure 8 shows. For
across all memory sizes. example, for 176.gcc with 40MB of RAM, the simulated faults
are around 40,000 while the actual page faults measured are un-
5.6 Page-Level Locality der 10,000. This inconsistency is due to the swap prefetching used
by the Linux VM manager but not in our simulator. In addition
In this section, we explore the effect of allocator choice on applica-
to swapping in non-resident pages into RAM whenever they are
tion page-level locality in more detail by using an LRU simulator
accessed, the Linux virtual memory manager also speculatively
and page-level reference traces. We first gather application page-
prefetches adjacent pages on the swap disk. To verify this, we turn
level references into the heap using a tool that intercepts system
off prefetching in the kernel, and re-run the paging experiments.
memory calls (brk, sbrk, mmap, munmap, and madvise) to
The actual number of page faults then closely matches the simu-
keep track of heap pages currently mapped from the kernel and
lated results for all benchmarks and allocators.
traps memory references by page protection. We use the SAD
(Safely-Allowed-Drop) algorithm to reduce the trace to a manage-
able size [18]. Swap Prefetchability
We then use run these traces through an LRU simulator to gen- The effectiveness of prefetching is determined by the locality of
erate page miss curves that indicate the number of misses (page page misses on the swap disk. If page misses require contiguous
faults) that would arise for every possible size of available mem- pages on the swap disk to be swapped in, prefetching will be ef-
ory. While no real system implements LRU, many systems closely fective. Page allocation on the swap disk is managed by the virtual
approximate it, including the Linux kernel we use here. Our LRU memory manager. . The Linux virtual manager attempts to cluster
simulator is similar to that described by Yang et al. [25]. We use pages that are adjacent in virtual address space to store them con-
176.gcc 253.perlbmk
100000 100000
DLmalloc miss curve DLmalloc miss curve
PHKmalloc miss curve 90000 PHKmalloc miss curve
Vam miss curve Vam miss curve
80000 obstack miss curve 80000 DLmalloc major faults
DLmalloc major faults PHKmalloc major faults
number of page faults
Figure 8: Predicted page miss curves versus actual major (page) faults in a real system with prefetching.
tiguously on disk [10]. For this reason, the application’s locality of non-prefetchable misses – over 90% of the misses are prefetch-
reference affects the effectiveness of prefetching in the kernel when able. However, it has a large number of non-prefetchable misses
the system is paging. with DLmalloc and only 64% of the misses are prefetchable. This
We investigate the effect of different allocators on application result demonstrates that 253.perlbmk’s data locality is better pre-
locality by measuring this swap prefetchability. We measure this served by PHKmalloc and Vam than by DLmalloc. 255.vortex has
by quantifying the locality of page misses. We gather those ap- much less prefetchability than the other applications: about 50%
plication’s page references that would result in a miss for a given of the misses are non-prefetchable with PHKmalloc and Vam, and
memory size in the LRU simulation. We then feed this page miss 66% with DLmalloc. In fact, 255.vortex’s poor page-level local-
trace to a page miss adjacency calculator. This calculator measures ity is also reflected in the very steep VM performance degradation
the minimum distance (in pages) between the current miss and the curves in Figure 7(d) and simulated miss curves. This occurs either
previous N misses. The N parameter roughly models the mem- because 255.vortex’s data locality is instrinsically poor or because
ory buffer size for prefetching in the VM manager. We set N to it is not preserved by any of the allocators.
32, meaning that the last 32 prefetches can be buffered. We de- Note that this prefetchability calculation assumes an ideal prefetch-
note page misses that have a minimum distance to the previous 32 ing scenario. The real VM manager may not actually be able to
misses of no more than 8 pages (the Linux default prefetch size) as prefetch all the prefetchable misses. Nevertheless, they appear to
prefetchable misses. The remaining misses are non-prefetchable. reflect observed application performance on a real system. We at-
Figure 9 presents our swap prefetchability results for different tribute the improved prefetchability in PHKmalloc and Vam to their
allocators with specific memory sizes noted on the figure. For page-oriented design and address-ordered first-fit allocation at the
176.gcc and across all allocators, as many as 90% of the misses page level.
are prefetchable. This prefetchability is due to 176.gcc’s strong lo-
6. Conclusions
cality in obstack-style memory allocation. The original version of
197.parser (using xalloc) also exhibits this strong locality. How- In this paper, we present Vam, a memory allocator that builds on
ever, this locality is less well preserved in the general-purpose al- previous allocator designs to improve data locality and provide high
locators, although among these, Vam leads to the greatest prefetch- performance while reducing fragmentation. We show that, com-
ability. With PHKmalloc and Vam, 253.perlbmk has very few pared to the Linux and FreeBSD allocators and over a suite of
memory-intensive benchmarks, Vam improves application perfor-
mance by an average of 4–8% when memory is plentiful, and by
factors ranging from 2X to over 10X when memory is scarce. Vam’s
Simulated Prefetchable/Non-Prefetchable Page Misses
performance degrades gracefully as physical memory becomes scarce
80000 and paging begins. We explore the impact of Vam’s design deci-
70000 sions and find that its fine-grained size classes, reap-like allocation,
60000 and page-oriented design all contribute to its effectiveness. We also
50000 find that a synergy between Vam’s design and the Linux swap space
40000
clustering algorithm leads to improved disk prefetching when pag-
30000
ing.
20000
10000 7. References
0
xalloc@16MB
obstack@40MB
DLmalloc@40MB
PHKmalloc@40MB
Vam@40MB
DLmalloc@9MB
PHKmalloc@9MB
Vam@8MB
DLmalloc@70MB
PHKmalloc@70MB
Vam@70MB
DLmalloc@40MB
PHKmalloc@40MB
Vam@40MB