Asplos 2000

Download as pdf or txt
Download as pdf or txt
You are on page 1of 12

Hoard: A Scalable Memory Allocator

for Multithreaded Applications

Emery D. Berger Kathryn S. McKinleyy Robert D. Blumofe Paul R. Wilson


 y
Department of Computer Sciences Department of Computer Science
The University of Texas at Austin University of Massachusetts
Austin, Texas 78712 Amherst, Massachusetts 01003
femery, g
rdb, wilson @cs.utexas.edu [email protected]

ABSTRACT servers. Many of these applications make intensive use of dynamic


Parallel, multithreaded C and C++ programs such as web servers, memory allocation. Unfortunately, the memory allocator is often a
database managers, news servers, and scientific applications are be- bottleneck that severely limits program scalability on multiproces-
coming increasingly prevalent. For these applications, the memory sor systems [21]. Existing serial memory allocators do not scale
allocator is often a bottleneck that severely limits program perfor- well for multithreaded applications, and existing concurrent allo-
mance and scalability on multiprocessor systems. Previous alloca- cators do not provide one or more of the following features, all of
tors suffer from problems that include poor performance and scal- which are needed in order to attain scalable and memory-efficient
ability, and heap organizations that introduce false sharing. Worse, allocator performance:
many allocators exhibit a dramatic increase in memory consump-
Speed. A memory allocator should perform memory operations
tion when confronted with a producer-consumer pattern of object
(i.e., malloc and free) about as fast as a state-of-the-art se-
allocation and freeing. This increase in memory consumption can
rial memory allocator. This feature guarantees good allocator
range from a factor of P (the number of processors) to unbounded
performance even when a multithreaded program executes
memory consumption.
on a single processor.
This paper introduces Hoard, a fast, highly scalable allocator
that largely avoids false sharing and is memory efficient. Hoard Scalability. As the number of processors in the system grows, the
is the first allocator to simultaneously solve the above problems. performance of the allocator must scale linearly with the num-
Hoard combines one global heap and per-processor heaps with a ber of processors to ensure scalable application performance.
novel discipline that provably bounds memory consumption and
has very low synchronization costs in the common case. Our re- False sharing avoidance. The allocator should not introduce false
sults on eleven programs demonstrate that Hoard yields low aver- sharing of cache lines in which threads on distinct processors
age fragmentation and improves overall program performance over inadvertently share data on the same cache line.
the standard Solaris allocator by up to a factor of 60 on 14 proces- Low fragmentation. We define fragmentation as the maximum
sors, and up to a factor of 18 over the next best allocator we tested. amount of memory allocated from the operating system di-
1. Introduction vided by the maximum amount of memory required by the
Parallel, multithreaded programs are becoming increasingly preva- application. Excessive fragmentation can degrade perfor-
lent. These applications include web servers [35], database man- mance by causing poor data locality, leading to paging.
agers [27], news servers [3], as well as more traditional parallel
Certain classes of memory allocators (described in Section 6)
applications such as scientific applications [7]. For these applica-
exhibit a special kind of fragmentation that we call blowup. In-
tions, high performance is critical. They are generally written in C
tuitively, blowup is the increase in memory consumption caused
or C++ to run efficiently on modern shared-memory multiprocessor
when a concurrent allocator reclaims memory freed by the pro-
This work is supported in part by the Defense Advanced Research Projects gram but fails to use it to satisfy future memory requests. We define
Agency (DARPA) under Grant F30602-97-1-0150 from the U.S. Air Force blowup as the maximum amount of memory allocated by a given al-
Research Laboratory. Kathryn McKinley was supported by DARPA Grant locator divided by the maximum amount of memory allocated by an
5-21425, NSF Grant EIA-9726401, and NSF CAREER Award CCR- ideal uniprocessor allocator. As we show in Section 2.2, the com-
9624209. In addition, Emery Berger was supported by a Novell Corporation
Fellowship. Multiprocessor computing facilities were provided through a mon producer-consumer programming idiom can cause blowup. In
generous donation by Sun Microsystems. many allocators, blowup ranges from a factor of P (the number
of processors) to unbounded memory consumption (the longer the
program runs, the more memory it consumes). Such a pathological
increase in memory consumption can be catastrophic, resulting in
Permission to make digital or hard copies of all or part of this work for premature application termination due to exhaustion of swap space.
personal or classroom use is granted without fee provided that copies are The contribution of this paper is to introduce the Hoard allocator
not made or distributed for profit or commercial advantage and that copies and show that it enables parallel multithreaded programs to achieve
bear this notice and the full citation on the first page. To copy otherwise, to scalable performance on shared-memory multiprocessors. Hoard
republish, to post on servers or to redistribute to lists, requires prior specific achieves this result by simultaneously solving all of the above prob-
permission and/or a fee.
ASPLOS 2000 Cambridge, MA USA lems. In particular, Hoard solves the blowup and false sharing prob-
Copyright 2000 ACM 0-89791-88-6/97/05 ..$5.00 lems, which, as far as we know, have never been addressed in the
literature. As we demonstrate, Hoard also achieves nearly zero syn- lead to false sharing.
chronization costs in practice. Allocators may also passively induce false sharing. Passive false
Hoard maintains per-processor heaps and one global heap. When sharing occurs when free allows a future malloc to produce false
a per-processor heap’s usage drops below a certain fraction, Hoard sharing. If a program introduces false sharing by spreading the
transfers a large fixed-size chunk of its memory from the per-processor pieces of a cache line across processors, the allocator may then
heap to the global heap, where it is then available for reuse by an- passively induce false sharing after a free by letting each processor
other processor. We show that this algorithm bounds blowup and reuse pieces it freed, which can then lead to false sharing.
synchronization costs to a constant factor. This algorithm avoids
false sharing by ensuring that the same processor almost always 2.2 Blowup
reuses (i.e., repeatedly mallocs) from a given cache line. Results Many previous allocators suffer from blowup. As we show in Sec-
on eleven programs demonstrate that Hoard scales linearly as the tion 3.1, Hoard keeps blowup to a constant factor but many existing
number of processors grows and that its fragmentation costs are concurrent allocators have unbounded blowup (the Cilk and STL
low. On 14 processors, Hoard improves performance over the stan- allocators [6, 30]) (memory consumption grows without bound while
dard Solaris allocator by up to a factor of 60 and a factor of 18 the memory required is fixed) or memory consumption can grow
over the next best allocator we tested. These features have led to its linearly with P , the number of processors (Ptmalloc and LKmalloc
incorporation in a number of high-performance commercial appli- [9, 22]). It is important to note that these worst cases are not just
cations, including the Twister, Typhoon, Breeze and Cyclone chat theoretical. Threads in a producer-consumer relationship, a com-
and USENET servers [3] and BEMSolver, a high-performance sci- mon programming idiom, may induce this blowup. To the best of
entific code [7]. our knowledge, papers in the literature do not address this prob-
The rest of this paper is organized as follows. In Section 2, we lem. For example, consider a program in which a producer thread
explain in detail the issues of blowup and allocator-induced false repeatedly allocates a block of memory and gives it to a consumer
sharing. In Section 3, we motivate and describe in detail the algo- thread which frees it. If the memory freed by the consumer is un-
rithms used by Hoard to simultaneously solve these problems. We available to the producer, the program consumes more and more
sketch proofs of the bounds on blowup and contention in Section 4. memory as it runs.
We demonstrate Hoard’s speed, scalability, false sharing avoidance, This unbounded memory consumption is plainly unacceptable,
and low fragmentation empirically in Section 5, including compar- but a P -fold increase in memory consumption is also cause for con-
isons with serial and concurrent memory allocators. We also show cern. The scheduling of multithreaded programs can cause them to
that Hoard is robust with respect to changes to its key parameter. require much more memory when run on multiple processors than
We classify previous work into a taxonomy of memory allocators when run on one processor [6, 28]. Consider a program with P
in Section 6, focusing on speed, scalability, false sharing and frag- threads. Each thread calls x=malloc(s); free(x). If these threads
mentation problems described above. Finally, we discuss future are serialized, the total memory required is s. However, if they
directions for this research in Section 7, and conclude in Section 8. execute on P processors, each call to malloc may run in paral-

lel, increasing the memory requirement to P s. If the allocator
2. Motivation multiplies this consumption by another factor of P , then memory
In this section, we focus special attention on the issues of allocator- 
consumption increases to P 2 s.
induced false sharing of heap objects and blowup to motivate our
work. These issues must be addressed to achieve efficient memory 3. The Hoard Memory Allocator
allocation for scalable multithreaded applications but have been ne- This section describes Hoard in detail. Hoard can be viewed as
glected in the memory allocation literature. an allocator that generally avoids false sharing and that trades in-
creased (but bounded) memory consumption for reduced synchro-
2.1 Allocator-Induced False Sharing of Heap Objects nization costs.
False sharing occurs when multiple processors share words in the Hoard augments per-processor heaps with a global heap that ev-
same cache line without actually sharing data and is a notorious ery thread may access (similar to Vee and Hsu [37]). Each thread
cause of poor performance in parallel applications [20, 15, 36]. Al- can access only its heap and the global heap. We designate heap
locators can cause false sharing of heap objects by dividing cache 0 as the global heap and heaps 1 through P as the per-processor
lines into a number of small objects that distinct processors then heaps. In the implementation we actually use 2P heaps (without
write. A program may introduce false sharing by allocating a num- altering our analytical results) in order to decrease the probability
ber of objects within one cache line and passing an object to a dif- that concurrently-executing threads use the same heap; we use a
ferent thread. It is thus impossible to completely avoid false sharing simple hash function to map thread id’s to per-processor heaps that
of heap objects unless the allocator pads out every memory request can result in collisions. We need such a mapping function because
to the size of a cache line. However, no allocator we know of pads in general there is not a one-to-one correspondence between threads
memory requests to the size of a cache line, and with good reason; and processors, and threads can be reassigned to other processors.
padding could cause a dramatic increase in memory consumption On Solaris, however, we are able to avoid collisions of heap assign-
(for instance, objects would be padded to a multiple of 64 bytes ments to threads by hashing on the light-weight process (LWP) id.
on a SPARC) and could significantly degrade spatial locality and The number of LWP’s is usually set to the number of processors
cache utilization. [24, 33], so each heap is generally used by no more than one LWP.
Unfortunately, an allocator can actively induce false sharing even Hoard maintains usage statistics for each heap. These statistics
on objects that the program does not pass to different threads. Ac- are ui , the amount of memory in use (“live”) in heap i, and ai , the
tive false sharing is due to malloc satisfying memory requests by amount of memory allocated by Hoard from the operating system
different threads from the same cache line. For instance, single- held in heap i.
heap allocators can give many threads parts of the same cache line. Hoard allocates memory from the system in chunks we call su-
The allocator may divide a cache line into 8-byte chunks. If mul- perblocks. Each superblock is an array of some number of blocks
tiple threads request 8-byte objects, the allocator may give each (objects) and contains a free list of its available blocks maintained
thread one 8-byte object in turn. This splitting of cache lines can in LIFO order to improve locality. All superblocks are the same
tains the following invariant on the per-processor heaps: (ui 
t1: x9 = malloc(s); t1: free(y4);  ^ 
ai K S ) (ui (1 f )ai ). When we remove a superblock,
global heap global heap
we reduce ui by at most (1 f )S but reduce ai by S , thus restor-
ing the invariant. Maintaining this invariant bounds blowup to a
constant factor, as we show in Section 4.
heap 1 heap 2 heap 1 heap 2 Hoard finds f -empty superblocks in constant time by dividing
x1 y1 x1 y1 superblocks into a number of bins that we call “fullness groups”.
x2 y2 x2 y2 Each bin contains a doubly-linked list of superblocks that are in a
given fullness range (e.g., all superblocks that are between 3=4 and
x9 y3 x9 y3
y4
completely empty are in the same bin). Hoard moves superblocks
from one group to another when appropriate, and always allocates
from nearly-full superblocks. To improve locality, we order the
superblocks within a fullness group using a move-to-front heuristic.
Whenever we free a block in a superblock, we move the superblock
t2: free(x2); t2: free(x9); to the front of its fullness group. If we then need to allocate a block,
we will be likely to reuse a superblock that is already in memory;
global heap global heap
because we maintain the free blocks in LIFO order, we are also
likely to reuse a block that is already in cache.
heap 1 heap 2
3.2 Example
x1 y1
Figure 1 illustrates, in simplified form, how Hoard manages su-
y2
x9 y3
heap 1 heap 2 perblocks. For simplicity, we assume there are two threads and
x1 y1
heaps (thread i maps to heap i). In this example (which reads from
y2 top left to top right, then bottom left to bottom right), the empty
y3
fraction f is 1=4 and K is 0. Thread 1 executes code written on
the left-hand side of each diagram (prefixed by “t1:”) and thread 2
executes code on the right-hand side (prefixed by “t2:”). Initially,
the global heap is empty, heap 1 has two superblocks (one partially
Figure 1: Allocation and freeing in Hoard. See Section 3.2 for full, one empty), and heap 2 has a completely-full superblock.
details. The top left diagram shows the heaps after thread 1 allocates x9
from heap 1. Hoard selects the fullest superblock in heap 1 for
allocation. Next, in the top right diagram, thread 1 frees y4, which
size (S ), a multiple of the system page size. Objects larger than is in a superblock that heap 2 owns. Because heap 2 is still more
half the size of a superblock are managed directly using the virtual than 1=4 full, Hoard does not remove a superblock from it. In the
memory system (i.e., they are allocated via mmap and freed using bottom left diagram, thread 2 frees x2, which is in a superblock
munmap). All of the blocks in a superblock are in the same size owned by heap 1. This free does not cause heap 1 to cross the
class. By using size classes that are a power of b apart (where b emptiness threshold, but the next free (of x9) does. Hoard then
is greater than 1) and rounding the requested size up to the near- moves the completely-free superblock from heap 1 to the global
est size class, we bound worst-case internal fragmentation within heap.
a block to a factor of b. In order to reduce external fragmentation,
we recycle completely empty superblocks for re-use by any size 3.3 Avoiding False Sharing
class. For clarity of exposition, we assume a single size class in the Hoard uses the combination of superblocks and multiple-heaps de-
discussion below. scribed above to avoid most active and passive false sharing. Only
one thread may allocate from a given superblock since a superblock
3.1 Bounding Blowup is owned by exactly one heap at any time. When multiple threads
Each heap “owns” a number of superblocks. When there is no make simultaneous requests for memory, the requests will always
memory available in any superblock on a thread’s heap, Hoard be satisfied from different superblocks, avoiding actively induced
obtains a superblock from the global heap if one is available. If false sharing. When a program deallocates a block of memory,
the global heap is also empty, Hoard creates a new superblock Hoard returns the block to its superblock. This coalescing prevents
by requesting virtual memory from the operating system and adds multiple threads from reusing pieces of cache lines that were passed
it to the thread’s heap. Hoard does not currently return empty to these threads by a user program, avoiding passively-induced
superblocks to the operating system. It instead makes these su- false sharing.
perblocks available for reuse. While this strategy can greatly reduce allocator-induced false
Hoard moves superblocks from a per-processor heap to the global sharing, it does not completely avoid it. Because Hoard may move
heap when the per-processor heap crosses the emptiness thresh- superblocks from one heap to another, it is possible for two heaps
old: more than f , the empty fraction, of its blocks are not in use to share cache lines. Fortunately, superblock transfer is a relatively
(ui < (1 f )ai ), and there are more than some number K of su- infrequent event (occurring only when a per-processor heap has
perblocks’ worth of free memory on the heap (ui < ai K S ).  dropped below the emptiness threshold). Further, we have observed
As long as a heap is not more than f empty, and has K or fewer su- that in practice, superblocks released to the global heap are often
perblocks, Hoard will not move superblocks from a per-processor completely empty, eliminating the possibility of false sharing. Re-
heap to the global heap. Whenever a per-processor heap does cross leased superblocks are guaranteed to be at least f empty, so the
the emptiness threshold, Hoard transfers one of its superblocks that opportunity for false sharing of lines in each superblock is reduced.
is at least f empty to the global heap. Always removing such Figure 1 also shows how Hoard generally avoids false sharing.
a superblock whenever we cross the emptiness threshold main- Notice that when thread 1 frees y4, Hoard returns this memory to
malloc (sz) the superblock S to ai (lines 10–14). If there are no superblocks
1. If sz > S=2, allocate the superblock from the OS in either heap i or heap 0, Hoard allocates a new superblock and
and return it. inserts it into heap i (line 8). Hoard then chooses a single block
2. i hash(the current thread). from a superblock with free space, marks it as allocated, and returns
3. Lock heap i. a pointer to that block.
4. Scan heap i’s list of superblocks from most full to least
(for the size class corresponding to sz). Deallocation
5. If there is no superblock with free space, Each superblock has an “owner” (the processor whose heap it’s
6. Check heap 0 (the global heap) for a superblock. in). When a processor frees a block, Hoard finds its superblock
7. If there is none, (through a pointer in the block’s header). (If this block is “large”,
8. Allocate S bytes as superblock s Hoard immediately frees the superblock to the operating system.)
and set the owner to heap i. It first locks the superblock and then locks the owner’s heap. Hoard
9. Else, then returns the block to the superblock and decrements ui . If the
10. Transfer the superblock s to heap i. 
heap is too empty (ui < ai K S or ui < (1 f )ai ), Hoard
11. u0 u0 s:u transfers a superblock that is at least f empty to the global heap
12. ui ui + s:u (lines 10-12). Finally, Hoard unlocks heap i and the superblock.
13. a0 a0 S 4. Analytical Results
14. ai ai + S In this section, we sketch the proofs of bounds on blowup and syn-
15. ui ui + sz. chronization. We first define some useful notation. We number the
16. s:u s:u + sz. heaps from 0 to P : 0 is the global heap, and 1 through P are the
17. Unlock heap i. per-processor heaps. We adopt the following convention: capital
18. Return a block from the superblock. letters denote maxima and lower-case letters denote current values.
Let A(t) and U (t) denote the maximum amount of memory allo-
free (ptr)
cated and in use by the program (“live memory”) after memory
1. If the block is “large”,
operation t. Let a(t) and u(t) denote the current amount of mem-
2. Free the superblock to the operating system and return.
ory allocated and in use by the program after memory operation t.
3. Find the superblock s this block comes from and lock it.
We add a subscript for a particular heap (e.g., ui (t)) and add a caret
4. Lock heap i, the superblock’s owner.
(e.g., a
^ (t)) to denote the sum for all heaps except the global heap.
5. Deallocate the block from the superblock.
6. ui ui block size. 4.1 Bounds on Blowup
7. s:u s:u block size. We now formally define the blowup for an allocator as its worst-
8. If i = 0, unlock heap i and the superblock case memory consumption divided by the ideal worst-case memory
and return. consumption for a serial memory allocator (a constant factor times

9. If ui < ai K S and ui < (1 f ) ai ,  its maximum memory required [29]):
10. Transfer a mostly-empty superblock s1
to heap 0 (the global heap). D EFINITION 1. blowup = O (A(t)=U (t)).
11. u0 u0 + s1:u, ui ui s1:u We first prove the following theorem that bounds Hoard’s worst-
12. a0 a0 + S , ai ai S case memory consumption: A(t) = O (U (t) + P ): We can show
13. Unlock heap i and the superblock. that the maximum amount of memory in the global and the per-
processor heaps (A(t)) is the same as the maximum allocated into
the per-processor heaps (A ^(t)). We make use of this lemma, whose
Figure 2: Pseudo-code for Hoard’s malloc and free.
proof is straightforward but somewhat lengthy (the proof may be
found in our technical report [4]).
y4’s superblock and not to thread 1’s heap. Since Hoard always ^(t).
L EMMA 1. A(t) = A
uses heap i to satisfy memory allocation requests from thread i,
only thread 2 can reuse that memory. Hoard thus avoids both active Intuitively, this lemma holds because these quantities are max-
and passive false sharing in these superblocks. ima; any memory in the global heap was originally allocated into a
per-processor heap. Now we prove the bounded memory consump-
3.4 Algorithms tion theorem:
In this section, we describe Hoard’s memory allocation and deallo- T HEOREM 1. A(t) = O (U (t) + P ).
cation algorithms in more detail. We present the pseudo-code for P ROOF. We restate the invariant from Section 3.1 that we main-
these algorithms in Figure 2. For clarity of exposition, we omit tain over all the per-processor heaps: (ai (t) K S   ui (t)) ^
discussion of the management of fullness groups and superblock
recycling.
((1 
f )ai (t) ui (t)).

The first inequality is sufficient to prove the theorem. Summing


Allocation
Hoard directly allocates “large” objects (size > S=2) via the virtual
memory system. When a thread on processor i calls malloc for A^(t)  P
over all P per-processor heaps gives us
P u (t) + P K S
i=1 i   ^(t)
. def. of A
small objects, Hoard locks heap i and gets a block of a superblock
with free space, if there is one on that heap (line 4). If there is not,  ^ (t) + P
U KS ^ (t)
. def. of U
Hoard checks the global heap (heap 0) for a superblock. If there
is one, Hoard transfers it to heap i, adding the number of bytes in  U (t) + P  K  S. ^ (t)  U (t)
.U
use in the superblock s:u to ui , and the total number of bytes in
Since by the above lemma A(t) = ^(t), we have A(t)
A = S , the size of a superblock. When a per-processor heap is grow-
O (U (t) + P ). ing, a thread can acquire the global heap lock at most k=(f  S=s)
times for k memory operations, where f is the empty fraction, S
Because the number of size classes is constant, this theorem is the superblock size, and s is the object size. Whenever the per-
holds over all size classes. By the definition of blowup above, processor heap is empty, the thread will lock the global heap and
and assuming that P << U (t), Hoard’s blowup is O ((U (t) + obtain a superblock with at least f  S=s free blocks. If the thread
P )=U (t)) = O(1). This result shows that Hoard’s worst case then calls malloc k times, it will exhaust its heap and acquire the
memory consumption is at worst a constant factor overhead that global heap lock at most k=(f  S=s) times.
does not grow with the amount of memory required by the pro- When a per-processor heap is shrinking, a thread will first ac-
gram. quire the global heap lock when the release threshold is crossed.
Our discipline for using the empty fraction (f ) enables this proof, The release threshold could then be crossed on every single call
so it is clearly a key parameter for Hoard. For reasons we describe to free if every superblock is exactly f empty. Completely freeing
and validate with experimental results in Section 5.5, Hoard’s per- each superblock in turn will cause the superblock to first be released
formance is robust with respect to the choice of f . to the global heap and every subsequent free to a block in that su-
perblock will therefore acquire the global heap lock. Luckily, this
4.2 Bounds on Synchronization
pathological case is highly unlikely to occur since it requires an
In this section, we analyze Hoard’s worst-case and discuss expected improbable sequence of operations: the program must systemati-
synchronization costs. Synchronization costs come in two flavors: cally free (1 f ) of each superblock and then free every block in
contention for a per-processor heap and acquisition of the global a superblock one at a time.
heap lock. We argue that the first form of contention is not a scal- For the common case, Hoard will incur very low contention costs
ability concern, and that the second form is rare. Further, for com- for any memory operation. This situation holds when the amount
mon program behavior, the synchronization costs are low over most of live memory remains within the empty fraction of the maximum
of the program’s lifetime. amount of memory allocated (and when all frees are local). John-
Per-processor Heap Contention stone and Stefanović show in their empirical studies of allocation
behavior that for nearly every program they analyzed, the memory
While the worst-case contention for Hoard arises when one thread in use tends to vary within a range that is within a fraction of total
allocates memory from the heap and a number of other threads free memory currently in use, and this amount often grows steadily [18,
it (thus all contending for the same heap lock), this case is not par- 32]. Thus, in the steady state case, Hoard incurs no contention, and
ticularly interesting. If an application allocates memory in such in gradual growth, Hoard incurs low contention.
a manner and the amount of work between allocations is so low
that heap contention is an issue, then the application itself is fun- 5. Experimental Results
damentally unscalable. Even if heap access were to be completely In this section, we describe our experimental results. We performed
independent, the application itself could only achieve a two-fold experiments on uniprocessors and multiprocessors to demonstrate
speedup, no matter how many processors are available. Hoard’s speed, scalability, false sharing avoidance, and low frag-
Since we are concerned with providing a scalable allocator for mentation. We also show that these results are robust with respect
scalable applications, we can bound Hoard’s worst case for such ap- to the choice of the empty fraction. The platform used is a ded-
plications, which occurs when pairs of threads exhibit the producer- icated 14-processor Sun Enterprise 5000 with 2GB of RAM and
consumer behavior described above. Each malloc and each free 400MHz UltraSparcs with 4 MB of level 2 cache, running Solaris
will be serialized. Modulo context-switch costs, this pattern results 7. Except for the Barnes-Hut benchmark, all programs (including
in at most a two-fold slowdown. This slowdown is not desirable the allocators) were compiled using the GNU C++ compiler at the
but it is scalable as it does not grow with the number of processors highest possible optimization level (-O6). We used GNU C++ in-
(as it does for allocators with one heap protected by a single lock). stead of the vendor compiler (Sun Workshop compiler version 5.0)
It is difficult to establish an expected case for per-processor heap because we encountered errors when we used high optimization
contention. Since most multithreaded applications use dynamically- levels. In the experiments cited below, the size of a superblock S is
allocated memory for the exclusive use of the allocating thread and 8K, the empty fraction f is 1=4, the number of superblocks K that
only a small fraction of allocated memory is freed by another thread must be free for superblocks to be released is 4, and the base of the
[22], we expect per-processor heap contention to be quite low. exponential for size classes b is 1:2 (bounding internal fragmenta-
tion to 1.2).
Global Heap Contention We compare Hoard (version 2.0.2) to the following single and
Global heap contention arises when superblocks are first created, multiple-heap memory allocators: Solaris, the default allocator pro-
when superblocks are transferred to and from the global heap, and vided with Solaris 7, Ptmalloc [9], the Linux allocator included in
when blocks are freed from superblocks held by the global heap. the GNU C library that extends a traditional allocator to use multi-
We simply count the number of times the global heap’s lock is ac- ple heaps, and MTmalloc, a multiple heap allocator included with
quired by each thread, which is an upper-bound on contention. We Solaris 7 for use with multithreaded parallel applications. (Sec-
analyze two cases: a growing phase and a shrinking phase. We tion 6 includes extensive discussion of Ptmalloc, MTmalloc, and
show that worst-case synchronization for the growing phases is in- other concurrent allocators.) The latter two are the only publicly-
versely proportional to the superblock size and the empty fraction available concurrent allocators of which we are aware for the So-
but we show that the worst-case for the shrinking phase is expen- laris platform (for example, LKmalloc is Microsoft proprietary).
sive but only for a pathological case that is unlikely to occur in We use the Solaris allocator as the baseline for calculating speedups.
practice. Empirical evidence from Section 5 suggests that for most We use the single-threaded applications from Wilson and John-
programs, Hoard will incur low synchronization costs for most of stone, and Grunwald and Zorn [12, 19]: espresso, an optimizer for
the program’s execution. programmable logic arrays; Ghostscript, a PostScript interpreter;
Two key parameters control the worst-case global heap contention LRUsim, a locality analyzer, and p2c, a Pascal-to-C translator. We
while a per-processor heap is growing: f , the empty fraction, and chose these programs because they are allocation-intensive and have
single-threaded benchmarks [12, 19] program runtime (sec) change
espresso optimizer for programmable logic arrays Solaris Hoard
Ghostscript PostScript interpreter single-threaded benchmarks
LRUsim locality analyzer espresso 6.806 7.887 +15.9%
p2c Pascal-to-C translator Ghostscript 3.610 3.993 +10.6%
multithreaded benchmarks LRUsim 1615.413 1570.488 -2.9%
threadtest each thread repeatedly allocates p2c 1.504 1.586 +5.5%
and then deallocates 100,000/P objects multithreaded benchmarks
shbench [26] each thread allocates and randomly frees threadtest 16.549 15.599 -6.1%
random-sized objects shbench 12.730 18.995 +49.2%
Larson [22] simulates a server: each thread allocates active-false 18.844 18.959 +0.6%
and deallocates objects, and then transfers passive-false 18.898 18.955 +0.3%
some objects to other threads to be freed BEMengine 678.30 614.94 -10.3%
active-false tests active false sharing avoidance Barnes-Hut 192.51 190.66 -1.0%
passive-false tests passive false sharing avoidance average +6.2%
BEMengine [7] object-oriented PDE solver
Barnes-Hut [1, 2] n-body particle solver Table 2: Uniprocessor runtimes for single- and multithreaded
benchmarks.
Table 1: Single- and multithreaded benchmarks used in this
paper.
large difference between the maximum in use and the total memory
requested (see Table 4).
widely varying memory usage patterns. We used the same inputs
Figure 3 shows that Hoard matches or outperforms all of the
for these programs as Wilson and Johnstone [19].
allocators we tested. The Solaris allocator performs poorly over-
There is as yet no standard suite of benchmarks for evaluating
all because serial single heap allocators do not scale. MTmalloc
multithreaded allocators. We know of no benchmarks that specif-
often suffers from a centralized bottleneck. Ptmalloc scales well
ically stress multithreaded performance of server applications like
only when memory operations are fairly infrequent (the Barnes-
web servers 1 and database managers. We chose benchmarks de-
Hut benchmark in Figure 3(d)); otherwise, its scaling peaks at around
scribed in other papers and otherwise published (the Larson bench-
6 processors. We now discuss each benchmark in turn.
mark from Larson and Krishnan [22] and the shbench benchmark
In threadtest, t threads do nothing but repeatedly allocate and
from MicroQuill, Inc. [26]), two multithreaded applications which
deallocate 100; 000=t 8-byte objects (the threads do not synchro-
include benchmarks (BEMengine [7] and barnes-hut [1, 2]), and
nize or share objects). As seen in Figure 3(a), Hoard exhibits linear
wrote some microbenchmarks of our own to stress different as-
speedup, while the Solaris and MTmalloc allocators exhibit severe
pects of memory allocation performance (threadtest, active-false,
slowdown. For 14 processors, the Hoard version runs 278% faster
passive-false). Table 1 describes all of the benchmarks used in this
than the Ptmalloc version. Unlike Ptmalloc, which uses a linked-
paper. Table 4 includes their allocation behavior: fragmentation,
list of heaps, Hoard does not suffer from a scalability bottleneck
maximum memory in use (U ) and allocated (A), total memory re-
caused by a centralized data structure.
quested, number of objects requested, and average object size.
The shbench benchmark is available on MicroQuill’s website
5.1 Speed and is shipped with the SmartHeap SMP product [26]. This bench-
mark is essentially a “stress test” rather than a realistic simulation
Table 2 lists the uniprocessor runtimes for our applications when of application behavior. Each thread repeatedly allocates and frees
linked with Hoard and the Solaris allocator (each is the average a number of randomly-sized blocks in random order, for a total of
of three runs; the variation between runs was negligible). On av- 50 million allocated blocks. The graphs in Figure 3(b) show that
erage, Hoard causes a slight increase in the runtime of these ap- Hoard scales quite well, approaching linear speedup as the number
plications (6.2%), but this loss is primarily due to its performance of threads increases. The slope of the speedup line is less than ideal
on shbench. Hoard performs poorly on shbench because shbench because the large number of different size classes hurts Hoard’s
uses a wide range of size classes but allocates very little memory raw performance. For 14 processors, the Hoard version runs 85%
(see Section 5.4 for more details). The longest-running application, faster than the next best allocator (Ptmalloc). Memory usage in
LRUsim, runs almost 3% faster with Hoard. Hoard also performs shbench remains within the empty fraction during the entire run so
well on BEMengine (10.3% faster than with the Solaris allocator), that Hoard incurs very low synchronization costs, while Ptmalloc
which allocates more memory than any of our other benchmarks again runs into its scalability bottleneck.
(nearly 600MB). The intent of the Larson benchmark, due to Larson and Krishnan
[22], is to simulate a workload for a server. A number of threads
5.2 Scalability
are repeatedly spawned to allocate and free 10,000 blocks rang-
In this section, we present our experiments to measure scalability. ing from 10 to 100 bytes in a random order. Further, a number
We measure speedup with respect to the Solaris allocator. These of blocks are left to be freed by a subsequent thread. Larson and
applications vigorously exercise the allocators as revealed by the Krishnan observe this behavior (which they call “bleeding”) in ac-
1
tual server applications, and their benchmark simulates this effect.
Memory allocation becomes a bottleneck when most pages served The benchmark runs for 30 seconds and then reports the number
are dynamically generated (Jim Davidson, personal communica- of memory operations per second. Figure 3(c) shows that Hoard
tion). Unfortunately, the SPECweb99 benchmark [31] performs
very few requests for completely dynamically-generated pages scales linearly, attaining nearly ideal speedup. For 14 processors,
(0.5%), and most web servers exercise dynamic memory allocation the Hoard version runs 18 times faster than the next best alloca-
only when generating dynamic content. tor, the Ptmalloc version. After an initial start-up phase, Larson
threadtest - Speedup shbench - Speedup
14 14
13 13
12 Hoard 12 Hoard
11 Ptmalloc 11 Ptmalloc
10 MTmalloc 10 MTmalloc
9 Solaris 9 Solaris
Speedup

Speedup
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of processors Number of processors

(a) The Threadtest benchmark. (b) The SmartHeap benchmark (shbench).

Larson - Speedup Barnes-Hut - Speedup


14 14
13 13
12 Hoard 12 Hoard
11 Ptmalloc 11 Ptmalloc
10 MTmalloc 10 MTmalloc
9 Solaris 9 Solaris
Speedup

Speedup

8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of processors Number of processors

(c) Speedup using the Larson benchmark. (d) Barnes-Hut speedup.

BEMengine - Speedup (Solver)


BEMengine - Speedup
14 14
Hoard 13 Hoard
13 Ptmalloc
Ptmalloc 12
12 11 Solaris
11 Solaris
10
10 9
9
Speedup

8
Speedup

8 7
7 6
6 5
5 4
4 3
3 2
2 1
1 0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of processors
Number of processors

(e) BEMengine speedup. Linking with MTmalloc (f) BEMengine speedup for the system solver only.
caused an exception to be raised.

Figure 3: Speedup graphs.


remains within its empty fraction for most of the rest of its run program falsely-shared objects
(dropping below only a few times over a 30-second run and over threadtest 0
27 million mallocs) so that Hoard incurs very low synchronization shbench 0
costs. Despite the fact that Larson transfers many objects from one Larson 0
thread to another, Hoard performs quite well. All of the other allo- BEMengine 0
cators fail to scale at all, running slower on 14 processors than on Barnes-Hut 0
one processor.
Barnes-Hut is a hierarchical n-body particle solver included with Table 3: Possible falsely-shared objects on 14 processors.
the Hood user-level multiprocessor threads library [1, 2], run on
32,768 particles for 20 rounds. This application performs a small
amount of dynamic memory allocation during the tree-building phase. the superblocks were empty. These results demonstrate that Hoard
With 14 processors, all of the multiple-heap allocators provide a successfully avoids allocator-induced false sharing.
10% performance improvement, increasing the speedup of the ap- 5.4 Fragmentation
plication from less than 10 to just above 12 (see Figure 3(d)). Hoard
performs only slightly better than Ptmalloc in this case because We showed in Section 3.1 that Hoard has bounded blowup. In
this program does not exercise the allocator much. Hoard’s per- this section, we measure Hoard’s average case fragmentation. We
formance is probably somewhat better simply because Barnes-Hut use a number of single- and multithreaded applications to evaluate
never drops below its empty fraction during its execution. Hoard’s average-case fragmentation.
The BEMengine benchmark uses the solver engine from Coyote Collecting fragmentation information for multithreaded applica-
Systems’ BEMSolver [7], a 2D/3D field solver that can solve elec- tions is problematic because fragmentation is a global property.
trostatic, magnetostatic and thermal systems. We report speedup Updating the maximum memory in use and the maximum memory
for the three mostly-parallel parts of this code (equation registra- allocated would serialize all memory operations and thus seriously
tion, preconditioner creation, and the solver). Figure 3(e) shows perturb allocation behavior. We cannot simply use the maximum
that Hoard provides a significant runtime advantage over Ptmalloc memory in use for a serial execution because a parallel execution
and the Solaris allocator (MTmalloc caused the application to raise of a program may lead it to require much more memory than a se-
a fatal exception). During the first two phases of the program, the rial execution.
program’s memory usage dropped below the empty fraction only We solve this problem by collecting traces of memory opera-
25 times over 50 seconds, leading to low synchronization overhead. tions and processing these traces off-line. We modified Hoard so
This application causes Ptmalloc to exhibit pathological behavior that (when collecting traces) each per-processor heap records every
that we do not understand, although we suspect that it derives from memory operation along with a timestamp (using the SPARC high-
false sharing. During the execution of the solver phase of the com- resolution timers via gethrtime()) into a memory-mapped buffer
putation, as seen in Figure 3(f), contention in the allocator is not and writes this trace to disk upon program termination. We then
an issue, and both Hoard and the Solaris allocator perform equally merge the traces in timestamp order to build a complete trace of
well. memory operations and process the resulting trace to compute max-
imum memory allocated and required. Collecting these traces re-
sults in nearly a threefold slowdown in memory operations but does
5.3 False sharing avoidance not excessively disturb their parallelism, so we believe that these
The active-false benchmark tests whether an allocator avoids ac- traces are a faithful representation of the fragmentation induced by
tively inducing false sharing. Each thread allocates one small ob- Hoard.
ject, writes on it a number of times, and then frees it. The rate of
memory allocation is low compared to the amount of work done, Single-threaded Applications
so this benchmark only tests contention caused by the cache co- We measured fragmentation for the single-threaded benchmarks.
herence mechanism (cache ping-ponging) and not allocator con- We follow Wilson and Johnstone [19] and report memory allocated
tention. While Hoard scales linearly, showing that it avoids actively without counting overhead (like per-object headers) to focus on the
inducing false sharing, both Ptmalloc and MTmalloc only scale up allocation policy rather than the mechanism. Hoard’s fragmentation
to about 4 processors because they actively induce some false shar- for these applications is between 1.05 and 1.2, except for espresso,
ing. The Solaris allocator does not scale at all because it actively which consumes 46% more memory than it requires. Espresso is
induces false sharing for nearly every cache line. an unusual program since it uses a large number of different size
The passive-false benchmark tests whether an allocator avoids classes for a small amount of memory required (less than 300K),
both passive and active false sharing by allocating a number of and this behavior leads Hoard to waste space within each 8K su-
small objects and giving one to each thread, which immediately perblock.
frees the object. The benchmark then continues in the same way as
the active-false benchmark. If the allocator does not coalesce the Multithreaded Applications
pieces of the cache line initially distributed to the various threads, Table 4 shows that the fragmentation results for the multithreaded
it passively induces false sharing. Figure 4(b) shows that Hoard benchmarks are generally quite good, ranging from nearly no frag-
scales nearly linearly; the gradual slowdown after 12 processors is mentation (1.02) for BEMengine to 1.24 for threadtest. The anomaly
due to program-induced bus traffic. Neither Ptmalloc nor MTmal- is shbench. This benchmark uses a large range of object sizes, ran-
loc avoid false sharing here, but the cause could be either active or domly chosen from 8 to 100, and many objects remain live for the
passive false sharing. duration of the program (470K of its maximum 550K objects re-
In Table 3, we present measurements for our multithreaded bench- main in use at the end of the run cited here). These unfreed objects
marks of the number of objects that could have been responsible are randomly scattered across superblocks, making it impossible to
for allocator-induced false sharing (i.e., those objects already in a recycle them for different size classes. This extremely random be-
superblock acquired from the global heap). In every case, when havior is not likely to be representative of real programs [19] but it
the per-processor heap acquired superblocks from the global heap, does show that Hoard’s method of maintaining one size class per
Active-False - Speedup Passive-False - Speedup
14 14
13 13
12 Hoard 12 Hoard
11 Ptmalloc 11 Ptmalloc
10 MTmalloc 10 MTmalloc
9 Solaris 9 Solaris
Speedup

Speedup
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Number of processors Number of processors

(a) Speedup for the active-false benchmark, which fails (b) Speedup for the passive-false benchmark, which
to scale with memory allocators that actively induce fails to scale with memory allocators that passively or
false sharing. actively induce false sharing.

Figure 4: Speedup graphs that exhibit the effect of allocator-induced false sharing.

Benchmark Hoard fragmentation max in use (U ) max allocated (A) total memory # objects average
applications (A=U ) requested requested object size
single-threaded benchmarks
espresso 1.47 284,520 417,032 110,143,200 1,675,493 65.7378
Ghostscript 1.15 1,171,408 1,342,240 52,194,664 566,542 92.1285
LRUsim 1.05 1,571,176 1,645,856 1,588,320 39,109 40.6126
p2c 1.20 441,432 531,912 5,483,168 199,361 27.5037
multithreaded benchmarks
threadtest 1.24 1,068,864 1,324,848 80,391,016 9,998,831 8.04
shbench 3.17 556,112 1,761,200 1,650,564,600 12,503,613 132.00
Larson 1.22 8,162,600 9,928,760 1,618,188,592 27,881,924 58.04
BEMengine 1.02 599,145,176 613,935,296 4,146,087,144 18,366,795 225.74
Barnes-Hut 1.18 11,959,960 14,114,040 46,004,408 1,172,624 39.23

Table 4: Hoard fragmentation results and application memory statistics. We report fragmentation statistics for 14-processor runs of
the multithreaded programs. All units are in bytes.

superblock can yield poor memory efficiency for certain behaviors, program runtime (sec)
although Hoard still attains good scalable performance for this ap- f = 1 =8 f = 1= 4 f = 1=2
plication (see Figure 3(b)). threadtest 1.27 1.28 1.19
shbench 1.45 1.50 1.44
BEMengine 86.85 87.49 88.03
5.5 Sensitivity Study
Barnes-Hut 16.52 16.13 16.41
We also examined the effect of changing the empty fraction on run- throughput (memory ops/sec)
time and fragmentation for the multithreaded benchmarks. Because Larson 4,407,654 4,416,303 4,352,163
superblocks are returned to the global heap (for reuse by other
threads) when the heap crosses the emptiness threshold, the empty Table 5: Runtime on 14 processors using Hoard with different
fraction affects both synchronization and fragmentation. We var- empty fractions.
ied the empty fraction from 1=8 to 1=2 and saw very little change
in runtime and fragmentation. We chose this range to exercise the
tension between increased (worst-case) fragmentation and synchro- an empty fraction as 1=8, as described in Section 4.2.
nization costs. The only benchmark which is substantially affected
by these changes in the empty fraction is the Larson benchmark, 6. Related Work
whose fragmentation increases from 1.22 to 1.61 for an empty frac- While dynamic storage allocation is one of the most studied topics
tion of 1=2. Table 5 presents the runtime for these programs on 14 in computer science, there has been relatively little work on con-
processors (we report the number of memory operations per second current memory allocators. In this section, we place past work into
for the Larson benchmark, which runs for 30 seconds), and Table 6 a taxonomy of memory allocator algorithms and compare each to
presents the fragmentation results. Hoard’s runtime is robust with Hoard. We address the blowup and allocator-induced false sharing
respect to changes in the empty fraction because programs tend to characteristics of each of these allocator algorithms and compare
reach a steady state in memory usage and stay within even as small them to Hoard.
program fragmentation or atomic update operations (e.g., compare-and-swap), which are
f = 1=8 f = 1 =4 f = 1= 2 quite expensive.
threadtest 1.22 1.24 1.22 State-of-the-art serial allocators are so well engineered that most
shbench 3.17 3.17 3.16 memory operations involve only a handful of instructions [23]. An
Larson 1.22 1.22 1.61 uncontended lock acquire and release accounts for about half of the
BEMengine 1.02 1.02 1.02 total runtime of these memory operations. In order to be competi-
Barnes-Hut 1.18 1.18 1.18 tive, a memory allocator can only acquire and release at most two
locks in the common case, or incur three atomic operations. Hoard
Table 6: Fragmentation on 14 processors using Hoard with dif- requires only one lock for each malloc and two for each free and
ferent empty fractions. each memory operation takes constant (amortized) time (see Sec-
tion 3.4).
6.1 Taxonomy of Memory Allocator Algorithms Multiple-Heap Allocation
Our taxonomy consists of the following five categories: We describe three categories of allocators which all use multiple-
heaps. The allocators assign threads to heaps either by assigning
Serial single heap. Only one processor may access the heap at a one heap to every thread (using thread-specific data) [30], by using
time (Solaris, Windows NT/2000 [21]). a currently unused heap from a collection of heaps [9], round-robin
heap assignment (as in MTmalloc, provided with Solaris 7 as a re-
Concurrent single heap. Many processors may simultaneously op- placement allocator for multithreaded applications), or by provid-
erate on one shared heap ([5, 16, 17, 13, 14]). ing a mapping function that maps threads onto a collection of heaps
(LKmalloc [22], Hoard). For simplicity of exposition, we assume
Pure private heaps. Each processor has its own heap (STL [30], that there is exactly one thread bound to each processor and one
Cilk [6]). heap for each of these threads.
STL’s (Standard Template Library) pthread alloc, Cilk 4.1, and
Private heaps with ownership. Each processor has its own heap, many ad hoc allocators use pure private heaps allocation [6, 30].
but memory is always returned to its “owner” processor (MT- Each processor has its own per-processor heap that it uses for every
malloc, Ptmalloc [9], LKmalloc [22]). memory operation (the allocator mallocs from its heap and frees to
its heap). Each per-processor heap is “purely private” because each
Private heaps with thresholds. Each processor has its own heap
processor never accesses any other heap for any memory operation.
which can hold a limited amount of free memory (DYNIX
After one thread allocates an object, a second thread can free it; in
kernel allocator [25], Vee and Hsu [37], Hoard).
pure private heaps allocators, this memory is placed in the second
thread’s heap. Since parts of the same cache line may be placed on
Below we discuss these single and multiple-heap algorithms, fo-
multiple heaps, pure private-heaps allocators passively induce false
cusing on the false sharing and blowup characteristics of each.
sharing. Worse, pure private-heaps allocators exhibit unbounded
Single Heap Allocation memory consumption given a producer-consumer allocation pat-
Serial single heap allocators often exhibit extremely low fragmen- tern, as described in Section 2.2. Hoard avoids this problem by
tation over a wide range of real programs [19] and are quite fast returning freed blocks to the heap that owns the superblocks they
[23]. Since they typically protect the heap with a single lock which belong to.
serializes memory operations and introduces contention, they are Private heaps with ownership returns free blocks to the heap that
inappropriate for use with most parallel multithreaded programs. allocated them. This algorithm, used by MTmalloc, Ptmalloc [9]
In multithreaded programs, contention for the lock prevents alloca- and LKmalloc [22], yields O (P ) blowup, whereas Hoard has O (1)
tor performance from scaling with the number of processors. Most blowup. Consider a round-robin style producer-consumer program:
modern operating systems provide such memory allocators in the each processor i allocates K blocks and processor (i + 1)modP
frees them. The program requires only K blocks but the alloca-

default library, including Solaris and IRIX. Windows NT/2000 uses
64-bit atomic operations on freelists rather than locks [21] which is tor will allocate P K blocks (K on all P heaps). Ptmalloc and
also unscalable because the head of each freelist is a central bottle- MTmalloc can actively induce false sharing (different threads may
neck2 . These allocators all actively induce false sharing. allocate from the same heap). LKmalloc’s permanent assignment
Concurrent single heap allocation implements the heap as a of large regions of memory to processors and its immediate return
concurrent data structure, such as a concurrent B-tree [10, 11, 13, of freed blocks to these regions, while leading to O (P ) blowup,
14, 16, 17] or a freelist with locks on each free block [5, 8, 34]. should have the advantage of eliminating allocator-induced false
This approach reduces to a serial single heap in the common case sharing, although the authors did not explicitly address this issue.
when most allocations are from a small number of object sizes. Hoard explicitly takes steps to reduce false sharing, although it can-
Johnstone and Wilson show that for every program they examined, not avoid it altogether, while maintaining O(1) blowup.
the vast majority of objects allocated are of only a few sizes [18]. Both Ptmalloc and MTmalloc also suffer from scalability bottle-
Each memory operation on these structures requires either time lin- necks. In Ptmalloc, each malloc chooses the first heap that is not
ear in the number of free blocks or O(log C ) time, where C is the currently in use (caching the resulting choice for the next attempt).
number of size classes of allocated objects. A size class is a range This heap selection strategy causes substantial bus traffic which
of object sizes that are grouped together (e.g., all objects between limits Ptmalloc’s scalability to about 6 processors, as we show in
32 and 36 bytes are treated as 36-byte objects). Like serial sin- Section 5. MTmalloc performs round-robin heap assignment by
gle heaps, these allocators actively induce false sharing. Another maintaining a “nextHeap” global variable that is updated by every
problem with these allocators is that they make use of many locks call to malloc. This variable is a source of contention that makes
MTmalloc unscalable and actively induces false sharing. Hoard has
2
The Windows 2000 allocator and some of Iyengar’s allocators use no centralized bottlenecks except for the global heap, which is not a
one freelist for each object size or range of sizes [13, 14, 21]
Allocator algorithm fast? scalable? avoids blowup
false sharing?
serial single heap yes no no O (1)
concurrent single heap no maybe no O (1)
pure private heaps yes yes no unbounded
private heaps w/ownership
Ptmalloc [9] yes yes no O (P )
MTmalloc yes no no O (P )
LKmalloc [22] yes yes yes O (P )
private heaps w/thresholds
Vee and Hsu, DYNIX [25, 37] yes yes no O (1)
Hoard yes yes yes O (1)

Table 7: A taxonomy of memory allocation algorithms discussed in this paper.

frequent source of contention for reasons described in Section 4.2. allocate objects from a wide range of size classes, like espresso
The DYNIX kernel memory allocator by McKenney and Sling- and shbench.
wine [25] and the single object-size allocator by Vee and Hsu [37] Finally, we are investigating ways of improving performance of
employ a private heaps with thresholds algorithm. These allo- Hoard on cc/NUMA architectures. Because the unit of cache coher-
cators are efficient and scalable because they move large blocks ence on these architectures is an entire page, Hoard’s mechanism
of memory between a hierarchy of per-processor heaps and heaps of coalescing to page-sized superblocks appears to be very impor-
shared by multiple processors. When a per-processor heap has tant for scalability. Our preliminary results on an SGI Origin 2000
more than a certain amount of free memory (the threshold), some show that Hoard scales to a substantially larger number of proces-
portion of the free memory is moved to a shared heap. This strat- sors, and we plan to report these results in the future.
egy also bounds blowup to a constant factor, since no heap may
8. Conclusion
hold more than some fixed amount of free memory. The mecha-
In this paper, we have introduced the Hoard memory allocator.
nisms that control this motion and the units of memory moved by
Hoard improves on previous memory allocators by simultaneously
the DYNIX and Vee and Hsu allocators differ significantly from
providing four features that are important for scalable application
those used by Hoard. Unlike Hoard, both of these allocators pas-
performance: speed, scalability, false sharing avoidance, and low
sively induce false sharing by making it very easy for pieces of
fragmentation. Hoard’s novel organization of per-processor and
the same cache line to be recycled. As long as the amount of free
global heaps along with its discipline for moving superblocks across
memory does not exceed the threshold, pieces of the same cache
heaps enables Hoard to achieve these features and is the key con-
line spread across processors will be repeatedly reused to satisfy
tribution of this work. Our analysis shows that Hoard has provably
memory requests. Also, these allocators are forced to synchronize
bounded blowup and low expected case synchronization. Our ex-
every time the threshold amount of memory is allocated or freed,
perimental results on eleven programs demonstrate that in practice
while Hoard can avoid synchronization altogether while the empti-
Hoard has low fragmentation, avoids false sharing, and scales very
ness of per-processor heaps is within the empty fraction. On the
well. In addition, we show that Hoard’s performance and fragmen-
other hand, these allocators do avoid the two-fold slowdown that
tation are robust with respect to its primary parameter, the empty
can occur in the worst-case described for Hoard in Section 4.2.
fraction. Since scalable application performance clearly requires
Table 7 presents a summary of the above allocator algorithms,
scalable architecture and runtime system support, Hoard thus takes
along with their speed, scalability, false sharing and blowup char-
a key step in this direction.
acteristics. As can be seen from the table, the algorithms closest
to Hoard are Vee and Hsu, DYNIX, and LKmalloc. The first two 9. Acknowledgements
fail to avoid passively-induced false sharing and are forced to syn- Many thanks to Brendon Cahoon, Rich Cardone, Scott Kaplan,
chronize with a global heap after each threshold amount of memory Greg Plaxton, Yannis Smaragdakis, and Phoebe Weidmann for valu-
is consumed or freed, while Hoard avoids false sharing and is not able discussions during the course of this work and input during
required to synchronize until the emptiness threshold is crossed or the writing of this paper. Thanks also to Martin Bächtold, Trey
when a heap does not have sufficient memory. LKmalloc has simi- Boudreau, Robert Fleischman, John Hickin, Paul Larson, Kevin
lar synchronization behavior to Hoard and avoids allocator-induced Mills, and Ganesan Rajagopal for their contributions to helping to
false sharing, but has O(P ) blowup. improve and port Hoard, and to Ben Zorn and the anonymous re-
viewers for helping to improve this paper.
7. Future Work
Hoard is publicly available at http://www.hoard.org for
Although the hashing method that we use has so far proven to be
a variety of platforms, including Solaris, IRIX, AIX, Linux, and
an effective mechanism for assigning threads to heaps, we plan to
Windows NT/2000.
develop an efficient method that can adapt to the situation when
two concurrently-executing threads map to the same heap.
While we believe that Hoard improves program locality in var-
ious ways, we have yet to quantitatively measure this effect. We
plan to use both cache and page-level measurement tools to evalu-
ate and improve Hoard’s effect on program-level locality.
We are also looking at ways to remove the one size class per
superblock restriction. This restriction is responsible for increased
fragmentation and a decline in performance for programs which
10. References and data locality. In Proceedings of the Sixth International
Conference on Supercomputing, pages 323–334, Distributed
[1] U. Acar, E. Berger, R. Blumofe, and D. Papadopoulos. Hood: Computing, July 1992.
A threads library for multiprogrammed multiprocessors. [21] M. R. Krishnan. Heap: Pleasures and pains. Microsoft
http://www.cs.utexas.edu/users/hood, Sept. 1999.
Developer Newsletter, Feb. 1999.
[2] J. Barnes and P. Hut. A hierarchical O (N log N )
[22] P. Larson and M. Krishnan. Memory allocation for
force-calculation algorithm. Nature, 324:446–449, 1986. long-running server applications. In ISMM, Vancouver, B.C.,
[3] bCandid.com, Inc. http://www.bcandid.com. Canada, 1998.
[4] E. D. Berger and R. D. Blumofe. Hoard: A fast, scalable, and [23] D. Lea. A memory allocator.
memory-efficient allocator for shared-memory http://g.oswego.edu/dl/html/malloc.html.
multiprocessors. Technical Report UTCS-TR99-22, The [24] B. Lewis. comp.programming.threads FAQ.
University of Texas at Austin, 1999. http://www.lambdacs.com/newsgroup/FAQ.html.
[5] B. Bigler, S. Allan, and R. Oldehoeft. Parallel dynamic [25] P. E. McKenney and J. Slingwine. Efficient kernel memory
storage allocation. International Conference on Parallel allocation on shared-memory multiprocessor. In USENIX
Processing, pages 272–275, 1985. Association, editor, Proceedings of the Winter 1993 USENIX
[6] R. D. Blumofe and C. E. Leiserson. Scheduling Conference: January 25–29, 1993, San Diego, California,
multithreaded computations by work stealing. In USA, pages 295–305, Berkeley, CA, USA, Winter 1993.
Proceedings of the 35th Annual Symposium on Foundations USENIX.
of Computer Science (FOCS), pages 356–368, Santa Fe, [26] MicroQuill, Inc. http://www.microquill.com.
New Mexico, Nov. 1994.
[27] MySQL, Inc. The mysql database manager.
[7] Coyote Systems, Inc. http://www.coyotesystems.com. http://www.mysql.org.
[8] C. S. Ellis and T. J. Olson. Algorithms for parallel memory [28] G. J. Narlikar and G. E. Blelloch. Space-efficient scheduling
allocation. International Journal of Parallel Programming,
of nested parallelism. ACM Transactions on Programming
17(4):303–345, 1988.
Languages and Systems, 21(1):138–173, January 1999.
[9] W. Gloger. Dynamic memory allocator implementations in [29] J. M. Robson. Worst case fragmentation of first fit and best
linux system libraries. fit storage allocation strategies. ACM Computer Journal,
http://www.dent.med.uni-muenchen.de/˜ wmglo/malloc-slides.html.
20(3):242–244, Aug. 1977.
[10] A. Gottlieb and J. Wilson. Using the buddy system for [30] SGI. The standard template library for c++: Allocators.
concurrent memory allocation. Technical Report System http://www.sgi.com/Technology/STL/Allocators.html.
Software Note 6, Courant Institute, 1981.
[31] Standard Performance Evaluation Corporation. SPECweb99.
[11] A. Gottlieb and J. Wilson. Parallelizing the usual buddy
http://www.spec.org/osg/web99/.
algorithm. Technical Report System Software Note 37,
[32] D. Stefanović. Properties of Age-Based Automatic Memory
Courant Institute, 1982.
Reclamation Algorithms. PhD thesis, Department of
[12] D. Grunwald, B. Zorn, and R. Henderson. Improving the
Computer Science, University of Massachusetts, Amherst,
cache locality of memory allocation. In R. Cartwright, editor,
Massachusetts, Dec. 1998.
Proceedings of the Conference on Programming Language
[33] D. Stein and D. Shah. Implementing lightweight threads. In
Design and Implementation, pages 177–186, New York, NY,
Proceedings of the 1992 USENIX Summer Conference, pages
USA, June 1993. ACM Press.
1–9, 1992.
[13] A. K. Iyengar. Dynamic Storage Allocation on a
[34] H. Stone. Parallel memory allocation using the
Multiprocessor. PhD thesis, MIT, 1992. MIT Laboratory for
FETCH-AND-ADD instruction. Technical Report RC 9674,
Computer Science Technical Report MIT/LCS/TR–560.
IBM T. J. Watson Research Center, Nov. 1982.
[14] A. K. Iyengar. Parallel dynamic storage allocation
[35] Time-Warner/AOL, Inc. AOLserver 3.0.
algorithms. In Fifth IEEE Symposium on Parallel and
http://www.aolserver.com.
Distributed Processing. IEEE Press, 1993.
[36] J. Torrellas, M. S. Lam, and J. L. Hennessy. False sharing
[15] T. Jeremiassen and S. Eggers. Reducing false sharing on
and spatial locality in multiprocessor caches. IEEE
shared memory multiprocessors through compile time data
Transactions on Computers, 43(6):651–663, 1994.
transformations. In ACM Symposium on Principles and
Practice of Parallel Programming, pages 179–188, July [37] V.-Y. Vee and W.-J. Hsu. A scalable and efficient storage
1995. allocator on shared-memory multiprocessors. In
International Symposium on Parallel Architectures,
[16] T. Johnson. A concurrent fast-fits memory manager.
Algorithms, and Networks (I-SPAN’99), pages 230–235,
Technical Report TR91-009, University of Florida,
Fremantle, Western Australia, June 1999.
Department of CIS, 1991.
[17] T. Johnson and T. Davis. Space efficient parallel buddy
memory management. Technical Report TR92-008,
University of Florida, Department of CIS, 1992.
[18] M. S. Johnstone. Non-Compacting Memory Allocation and
Real-Time Garbage Collection. PhD thesis, University of
Texas at Austin, Dec. 1997.
[19] M. S. Johnstone and P. R. Wilson. The memory
fragmentation problem: Solved? In ISMM, Vancouver, B.C.,
Canada, 1998.
[20] K. Kennedy and K. S. McKinley. Optimizing for parallelism

You might also like