Asplos 2000
Asplos 2000
Asplos 2000
Speedup
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of processors Number of processors
Speedup
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of processors Number of processors
8
Speedup
8 7
7 6
6 5
5 4
4 3
3 2
2 1
1 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of processors
Number of processors
(e) BEMengine speedup. Linking with MTmalloc (f) BEMengine speedup for the system solver only.
caused an exception to be raised.
Speedup
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of processors Number of processors
(a) Speedup for the active-false benchmark, which fails (b) Speedup for the passive-false benchmark, which
to scale with memory allocators that actively induce fails to scale with memory allocators that passively or
false sharing. actively induce false sharing.
Figure 4: Speedup graphs that exhibit the effect of allocator-induced false sharing.
Benchmark Hoard fragmentation max in use (U ) max allocated (A) total memory # objects average
applications (A=U ) requested requested object size
single-threaded benchmarks
espresso 1.47 284,520 417,032 110,143,200 1,675,493 65.7378
Ghostscript 1.15 1,171,408 1,342,240 52,194,664 566,542 92.1285
LRUsim 1.05 1,571,176 1,645,856 1,588,320 39,109 40.6126
p2c 1.20 441,432 531,912 5,483,168 199,361 27.5037
multithreaded benchmarks
threadtest 1.24 1,068,864 1,324,848 80,391,016 9,998,831 8.04
shbench 3.17 556,112 1,761,200 1,650,564,600 12,503,613 132.00
Larson 1.22 8,162,600 9,928,760 1,618,188,592 27,881,924 58.04
BEMengine 1.02 599,145,176 613,935,296 4,146,087,144 18,366,795 225.74
Barnes-Hut 1.18 11,959,960 14,114,040 46,004,408 1,172,624 39.23
Table 4: Hoard fragmentation results and application memory statistics. We report fragmentation statistics for 14-processor runs of
the multithreaded programs. All units are in bytes.
superblock can yield poor memory efficiency for certain behaviors, program runtime (sec)
although Hoard still attains good scalable performance for this ap- f = 1 =8 f = 1= 4 f = 1=2
plication (see Figure 3(b)). threadtest 1.27 1.28 1.19
shbench 1.45 1.50 1.44
BEMengine 86.85 87.49 88.03
5.5 Sensitivity Study
Barnes-Hut 16.52 16.13 16.41
We also examined the effect of changing the empty fraction on run- throughput (memory ops/sec)
time and fragmentation for the multithreaded benchmarks. Because Larson 4,407,654 4,416,303 4,352,163
superblocks are returned to the global heap (for reuse by other
threads) when the heap crosses the emptiness threshold, the empty Table 5: Runtime on 14 processors using Hoard with different
fraction affects both synchronization and fragmentation. We var- empty fractions.
ied the empty fraction from 1=8 to 1=2 and saw very little change
in runtime and fragmentation. We chose this range to exercise the
tension between increased (worst-case) fragmentation and synchro- an empty fraction as 1=8, as described in Section 4.2.
nization costs. The only benchmark which is substantially affected
by these changes in the empty fraction is the Larson benchmark, 6. Related Work
whose fragmentation increases from 1.22 to 1.61 for an empty frac- While dynamic storage allocation is one of the most studied topics
tion of 1=2. Table 5 presents the runtime for these programs on 14 in computer science, there has been relatively little work on con-
processors (we report the number of memory operations per second current memory allocators. In this section, we place past work into
for the Larson benchmark, which runs for 30 seconds), and Table 6 a taxonomy of memory allocator algorithms and compare each to
presents the fragmentation results. Hoard’s runtime is robust with Hoard. We address the blowup and allocator-induced false sharing
respect to changes in the empty fraction because programs tend to characteristics of each of these allocator algorithms and compare
reach a steady state in memory usage and stay within even as small them to Hoard.
program fragmentation or atomic update operations (e.g., compare-and-swap), which are
f = 1=8 f = 1 =4 f = 1= 2 quite expensive.
threadtest 1.22 1.24 1.22 State-of-the-art serial allocators are so well engineered that most
shbench 3.17 3.17 3.16 memory operations involve only a handful of instructions [23]. An
Larson 1.22 1.22 1.61 uncontended lock acquire and release accounts for about half of the
BEMengine 1.02 1.02 1.02 total runtime of these memory operations. In order to be competi-
Barnes-Hut 1.18 1.18 1.18 tive, a memory allocator can only acquire and release at most two
locks in the common case, or incur three atomic operations. Hoard
Table 6: Fragmentation on 14 processors using Hoard with dif- requires only one lock for each malloc and two for each free and
ferent empty fractions. each memory operation takes constant (amortized) time (see Sec-
tion 3.4).
6.1 Taxonomy of Memory Allocator Algorithms Multiple-Heap Allocation
Our taxonomy consists of the following five categories: We describe three categories of allocators which all use multiple-
heaps. The allocators assign threads to heaps either by assigning
Serial single heap. Only one processor may access the heap at a one heap to every thread (using thread-specific data) [30], by using
time (Solaris, Windows NT/2000 [21]). a currently unused heap from a collection of heaps [9], round-robin
heap assignment (as in MTmalloc, provided with Solaris 7 as a re-
Concurrent single heap. Many processors may simultaneously op- placement allocator for multithreaded applications), or by provid-
erate on one shared heap ([5, 16, 17, 13, 14]). ing a mapping function that maps threads onto a collection of heaps
(LKmalloc [22], Hoard). For simplicity of exposition, we assume
Pure private heaps. Each processor has its own heap (STL [30], that there is exactly one thread bound to each processor and one
Cilk [6]). heap for each of these threads.
STL’s (Standard Template Library) pthread alloc, Cilk 4.1, and
Private heaps with ownership. Each processor has its own heap, many ad hoc allocators use pure private heaps allocation [6, 30].
but memory is always returned to its “owner” processor (MT- Each processor has its own per-processor heap that it uses for every
malloc, Ptmalloc [9], LKmalloc [22]). memory operation (the allocator mallocs from its heap and frees to
its heap). Each per-processor heap is “purely private” because each
Private heaps with thresholds. Each processor has its own heap
processor never accesses any other heap for any memory operation.
which can hold a limited amount of free memory (DYNIX
After one thread allocates an object, a second thread can free it; in
kernel allocator [25], Vee and Hsu [37], Hoard).
pure private heaps allocators, this memory is placed in the second
thread’s heap. Since parts of the same cache line may be placed on
Below we discuss these single and multiple-heap algorithms, fo-
multiple heaps, pure private-heaps allocators passively induce false
cusing on the false sharing and blowup characteristics of each.
sharing. Worse, pure private-heaps allocators exhibit unbounded
Single Heap Allocation memory consumption given a producer-consumer allocation pat-
Serial single heap allocators often exhibit extremely low fragmen- tern, as described in Section 2.2. Hoard avoids this problem by
tation over a wide range of real programs [19] and are quite fast returning freed blocks to the heap that owns the superblocks they
[23]. Since they typically protect the heap with a single lock which belong to.
serializes memory operations and introduces contention, they are Private heaps with ownership returns free blocks to the heap that
inappropriate for use with most parallel multithreaded programs. allocated them. This algorithm, used by MTmalloc, Ptmalloc [9]
In multithreaded programs, contention for the lock prevents alloca- and LKmalloc [22], yields O (P ) blowup, whereas Hoard has O (1)
tor performance from scaling with the number of processors. Most blowup. Consider a round-robin style producer-consumer program:
modern operating systems provide such memory allocators in the each processor i allocates K blocks and processor (i + 1)modP
frees them. The program requires only K blocks but the alloca-
default library, including Solaris and IRIX. Windows NT/2000 uses
64-bit atomic operations on freelists rather than locks [21] which is tor will allocate P K blocks (K on all P heaps). Ptmalloc and
also unscalable because the head of each freelist is a central bottle- MTmalloc can actively induce false sharing (different threads may
neck2 . These allocators all actively induce false sharing. allocate from the same heap). LKmalloc’s permanent assignment
Concurrent single heap allocation implements the heap as a of large regions of memory to processors and its immediate return
concurrent data structure, such as a concurrent B-tree [10, 11, 13, of freed blocks to these regions, while leading to O (P ) blowup,
14, 16, 17] or a freelist with locks on each free block [5, 8, 34]. should have the advantage of eliminating allocator-induced false
This approach reduces to a serial single heap in the common case sharing, although the authors did not explicitly address this issue.
when most allocations are from a small number of object sizes. Hoard explicitly takes steps to reduce false sharing, although it can-
Johnstone and Wilson show that for every program they examined, not avoid it altogether, while maintaining O(1) blowup.
the vast majority of objects allocated are of only a few sizes [18]. Both Ptmalloc and MTmalloc also suffer from scalability bottle-
Each memory operation on these structures requires either time lin- necks. In Ptmalloc, each malloc chooses the first heap that is not
ear in the number of free blocks or O(log C ) time, where C is the currently in use (caching the resulting choice for the next attempt).
number of size classes of allocated objects. A size class is a range This heap selection strategy causes substantial bus traffic which
of object sizes that are grouped together (e.g., all objects between limits Ptmalloc’s scalability to about 6 processors, as we show in
32 and 36 bytes are treated as 36-byte objects). Like serial sin- Section 5. MTmalloc performs round-robin heap assignment by
gle heaps, these allocators actively induce false sharing. Another maintaining a “nextHeap” global variable that is updated by every
problem with these allocators is that they make use of many locks call to malloc. This variable is a source of contention that makes
MTmalloc unscalable and actively induces false sharing. Hoard has
2
The Windows 2000 allocator and some of Iyengar’s allocators use no centralized bottlenecks except for the global heap, which is not a
one freelist for each object size or range of sizes [13, 14, 21]
Allocator algorithm fast? scalable? avoids blowup
false sharing?
serial single heap yes no no O (1)
concurrent single heap no maybe no O (1)
pure private heaps yes yes no unbounded
private heaps w/ownership
Ptmalloc [9] yes yes no O (P )
MTmalloc yes no no O (P )
LKmalloc [22] yes yes yes O (P )
private heaps w/thresholds
Vee and Hsu, DYNIX [25, 37] yes yes no O (1)
Hoard yes yes yes O (1)
frequent source of contention for reasons described in Section 4.2. allocate objects from a wide range of size classes, like espresso
The DYNIX kernel memory allocator by McKenney and Sling- and shbench.
wine [25] and the single object-size allocator by Vee and Hsu [37] Finally, we are investigating ways of improving performance of
employ a private heaps with thresholds algorithm. These allo- Hoard on cc/NUMA architectures. Because the unit of cache coher-
cators are efficient and scalable because they move large blocks ence on these architectures is an entire page, Hoard’s mechanism
of memory between a hierarchy of per-processor heaps and heaps of coalescing to page-sized superblocks appears to be very impor-
shared by multiple processors. When a per-processor heap has tant for scalability. Our preliminary results on an SGI Origin 2000
more than a certain amount of free memory (the threshold), some show that Hoard scales to a substantially larger number of proces-
portion of the free memory is moved to a shared heap. This strat- sors, and we plan to report these results in the future.
egy also bounds blowup to a constant factor, since no heap may
8. Conclusion
hold more than some fixed amount of free memory. The mecha-
In this paper, we have introduced the Hoard memory allocator.
nisms that control this motion and the units of memory moved by
Hoard improves on previous memory allocators by simultaneously
the DYNIX and Vee and Hsu allocators differ significantly from
providing four features that are important for scalable application
those used by Hoard. Unlike Hoard, both of these allocators pas-
performance: speed, scalability, false sharing avoidance, and low
sively induce false sharing by making it very easy for pieces of
fragmentation. Hoard’s novel organization of per-processor and
the same cache line to be recycled. As long as the amount of free
global heaps along with its discipline for moving superblocks across
memory does not exceed the threshold, pieces of the same cache
heaps enables Hoard to achieve these features and is the key con-
line spread across processors will be repeatedly reused to satisfy
tribution of this work. Our analysis shows that Hoard has provably
memory requests. Also, these allocators are forced to synchronize
bounded blowup and low expected case synchronization. Our ex-
every time the threshold amount of memory is allocated or freed,
perimental results on eleven programs demonstrate that in practice
while Hoard can avoid synchronization altogether while the empti-
Hoard has low fragmentation, avoids false sharing, and scales very
ness of per-processor heaps is within the empty fraction. On the
well. In addition, we show that Hoard’s performance and fragmen-
other hand, these allocators do avoid the two-fold slowdown that
tation are robust with respect to its primary parameter, the empty
can occur in the worst-case described for Hoard in Section 4.2.
fraction. Since scalable application performance clearly requires
Table 7 presents a summary of the above allocator algorithms,
scalable architecture and runtime system support, Hoard thus takes
along with their speed, scalability, false sharing and blowup char-
a key step in this direction.
acteristics. As can be seen from the table, the algorithms closest
to Hoard are Vee and Hsu, DYNIX, and LKmalloc. The first two 9. Acknowledgements
fail to avoid passively-induced false sharing and are forced to syn- Many thanks to Brendon Cahoon, Rich Cardone, Scott Kaplan,
chronize with a global heap after each threshold amount of memory Greg Plaxton, Yannis Smaragdakis, and Phoebe Weidmann for valu-
is consumed or freed, while Hoard avoids false sharing and is not able discussions during the course of this work and input during
required to synchronize until the emptiness threshold is crossed or the writing of this paper. Thanks also to Martin Bächtold, Trey
when a heap does not have sufficient memory. LKmalloc has simi- Boudreau, Robert Fleischman, John Hickin, Paul Larson, Kevin
lar synchronization behavior to Hoard and avoids allocator-induced Mills, and Ganesan Rajagopal for their contributions to helping to
false sharing, but has O(P ) blowup. improve and port Hoard, and to Ben Zorn and the anonymous re-
viewers for helping to improve this paper.
7. Future Work
Hoard is publicly available at http://www.hoard.org for
Although the hashing method that we use has so far proven to be
a variety of platforms, including Solaris, IRIX, AIX, Linux, and
an effective mechanism for assigning threads to heaps, we plan to
Windows NT/2000.
develop an efficient method that can adapt to the situation when
two concurrently-executing threads map to the same heap.
While we believe that Hoard improves program locality in var-
ious ways, we have yet to quantitatively measure this effect. We
plan to use both cache and page-level measurement tools to evalu-
ate and improve Hoard’s effect on program-level locality.
We are also looking at ways to remove the one size class per
superblock restriction. This restriction is responsible for increased
fragmentation and a decline in performance for programs which
10. References and data locality. In Proceedings of the Sixth International
Conference on Supercomputing, pages 323–334, Distributed
[1] U. Acar, E. Berger, R. Blumofe, and D. Papadopoulos. Hood: Computing, July 1992.
A threads library for multiprogrammed multiprocessors. [21] M. R. Krishnan. Heap: Pleasures and pains. Microsoft
http://www.cs.utexas.edu/users/hood, Sept. 1999.
Developer Newsletter, Feb. 1999.
[2] J. Barnes and P. Hut. A hierarchical O (N log N )
[22] P. Larson and M. Krishnan. Memory allocation for
force-calculation algorithm. Nature, 324:446–449, 1986. long-running server applications. In ISMM, Vancouver, B.C.,
[3] bCandid.com, Inc. http://www.bcandid.com. Canada, 1998.
[4] E. D. Berger and R. D. Blumofe. Hoard: A fast, scalable, and [23] D. Lea. A memory allocator.
memory-efficient allocator for shared-memory http://g.oswego.edu/dl/html/malloc.html.
multiprocessors. Technical Report UTCS-TR99-22, The [24] B. Lewis. comp.programming.threads FAQ.
University of Texas at Austin, 1999. http://www.lambdacs.com/newsgroup/FAQ.html.
[5] B. Bigler, S. Allan, and R. Oldehoeft. Parallel dynamic [25] P. E. McKenney and J. Slingwine. Efficient kernel memory
storage allocation. International Conference on Parallel allocation on shared-memory multiprocessor. In USENIX
Processing, pages 272–275, 1985. Association, editor, Proceedings of the Winter 1993 USENIX
[6] R. D. Blumofe and C. E. Leiserson. Scheduling Conference: January 25–29, 1993, San Diego, California,
multithreaded computations by work stealing. In USA, pages 295–305, Berkeley, CA, USA, Winter 1993.
Proceedings of the 35th Annual Symposium on Foundations USENIX.
of Computer Science (FOCS), pages 356–368, Santa Fe, [26] MicroQuill, Inc. http://www.microquill.com.
New Mexico, Nov. 1994.
[27] MySQL, Inc. The mysql database manager.
[7] Coyote Systems, Inc. http://www.coyotesystems.com. http://www.mysql.org.
[8] C. S. Ellis and T. J. Olson. Algorithms for parallel memory [28] G. J. Narlikar and G. E. Blelloch. Space-efficient scheduling
allocation. International Journal of Parallel Programming,
of nested parallelism. ACM Transactions on Programming
17(4):303–345, 1988.
Languages and Systems, 21(1):138–173, January 1999.
[9] W. Gloger. Dynamic memory allocator implementations in [29] J. M. Robson. Worst case fragmentation of first fit and best
linux system libraries. fit storage allocation strategies. ACM Computer Journal,
http://www.dent.med.uni-muenchen.de/˜ wmglo/malloc-slides.html.
20(3):242–244, Aug. 1977.
[10] A. Gottlieb and J. Wilson. Using the buddy system for [30] SGI. The standard template library for c++: Allocators.
concurrent memory allocation. Technical Report System http://www.sgi.com/Technology/STL/Allocators.html.
Software Note 6, Courant Institute, 1981.
[31] Standard Performance Evaluation Corporation. SPECweb99.
[11] A. Gottlieb and J. Wilson. Parallelizing the usual buddy
http://www.spec.org/osg/web99/.
algorithm. Technical Report System Software Note 37,
[32] D. Stefanović. Properties of Age-Based Automatic Memory
Courant Institute, 1982.
Reclamation Algorithms. PhD thesis, Department of
[12] D. Grunwald, B. Zorn, and R. Henderson. Improving the
Computer Science, University of Massachusetts, Amherst,
cache locality of memory allocation. In R. Cartwright, editor,
Massachusetts, Dec. 1998.
Proceedings of the Conference on Programming Language
[33] D. Stein and D. Shah. Implementing lightweight threads. In
Design and Implementation, pages 177–186, New York, NY,
Proceedings of the 1992 USENIX Summer Conference, pages
USA, June 1993. ACM Press.
1–9, 1992.
[13] A. K. Iyengar. Dynamic Storage Allocation on a
[34] H. Stone. Parallel memory allocation using the
Multiprocessor. PhD thesis, MIT, 1992. MIT Laboratory for
FETCH-AND-ADD instruction. Technical Report RC 9674,
Computer Science Technical Report MIT/LCS/TR–560.
IBM T. J. Watson Research Center, Nov. 1982.
[14] A. K. Iyengar. Parallel dynamic storage allocation
[35] Time-Warner/AOL, Inc. AOLserver 3.0.
algorithms. In Fifth IEEE Symposium on Parallel and
http://www.aolserver.com.
Distributed Processing. IEEE Press, 1993.
[36] J. Torrellas, M. S. Lam, and J. L. Hennessy. False sharing
[15] T. Jeremiassen and S. Eggers. Reducing false sharing on
and spatial locality in multiprocessor caches. IEEE
shared memory multiprocessors through compile time data
Transactions on Computers, 43(6):651–663, 1994.
transformations. In ACM Symposium on Principles and
Practice of Parallel Programming, pages 179–188, July [37] V.-Y. Vee and W.-J. Hsu. A scalable and efficient storage
1995. allocator on shared-memory multiprocessors. In
International Symposium on Parallel Architectures,
[16] T. Johnson. A concurrent fast-fits memory manager.
Algorithms, and Networks (I-SPAN’99), pages 230–235,
Technical Report TR91-009, University of Florida,
Fremantle, Western Australia, June 1999.
Department of CIS, 1991.
[17] T. Johnson and T. Davis. Space efficient parallel buddy
memory management. Technical Report TR92-008,
University of Florida, Department of CIS, 1992.
[18] M. S. Johnstone. Non-Compacting Memory Allocation and
Real-Time Garbage Collection. PhD thesis, University of
Texas at Austin, Dec. 1997.
[19] M. S. Johnstone and P. R. Wilson. The memory
fragmentation problem: Solved? In ISMM, Vancouver, B.C.,
Canada, 1998.
[20] K. Kennedy and K. S. McKinley. Optimizing for parallelism