Myths and Realities - The Performance Impact of Garbage Collection
Myths and Realities - The Performance Impact of Garbage Collection
ABSTRACT 1. Introduction
This paper explores and quantifies garbage collection behavior for Programmers are increasingly choosing object-oriented languages
three whole heap collectors and generational counterparts: copy- such as Java with automatic memory management (garbage col-
ing semi-space, mark-sweep, and reference counting, the canonical lection) because of their software engineering benefits. Although
algorithms from which essentially all other collection algorithms researchers have studied garbage collection for a long time [3, 22,
are derived. Efficient implementations in MMTk, a Java memory 24, 30, 35, 42], few detailed performance studies exist. No previ-
management toolkit, in IBM’s Jikes RVM share all common mech- ous study compares the effects of garbage collection algorithms on
anisms to provide a clean experimental platform. Instrumentation instruction throughput and locality in the light of modern proces-
separates collector and program behavior, and performance coun- sor technology trends to explain how garbage collection algorithms
ters measure timing and memory behavior on three architectures. and programs can combine to yield good performance.
Our experimental design reveals key algorithmic features and This work studies in detail the three canonical garbage collection
how they match program characteristics to explain the direct and algorithms: semi-space, mark-sweep, and reference counting, and
indirect costs of garbage collection as a function of heap size on the three generational counterparts. These collectors encompass the
SPEC JVM benchmarks. For example, we find that the contiguous key mechanisms and policies from which essentially all garbage
allocation of copying collectors attains significant locality benefits collectors are composed. Our findings therefore have application
over free-list allocators. The reduced collection costs of the gener- beyond these algorithms. We conduct our study in the Java memory
ational algorithms together with the locality benefit of contiguous management toolkit (MMTk) [13] in IBM’s Jikes RVM [2, 1]. The
allocation motivates a copying nursery for newly allocated objects. collectors are efficient, share all common mechanisms and policies,
These benefits dominate the overheads of generational collectors and provide a clean and meaningful experimental platform [13].
compared with non-generational and no collection, disputing the The results use a wide range of heap sizes on SPEC JVM Bench-
myth that “no garbage collection is good garbage collection.” Per- marks to reveal the inherent space-time trade-offs of collector algo-
formance is less sensitive to the mature space collection algorithm rithms. For fair comparisons, each experiment fixes the heap size,
in our benchmarks. However the locality and pointer mutation and triggers collection when the program exhausts available mem-
characteristics for a given program occasionally prefer copying or ory. We use three architectures (Athlon, Pentium 4, PowerPC) and
mark-sweep. This study is unique in its breadth of garbage collec- find the same trends on all three. Each experiment divides total pro-
tion algorithms and its depth of analysis. gram performance into mutator (application code) and collection
phases. The mutator phase includes some memory management
Categories and Subject Descriptors activity, such as the allocation sequence and for the generational
D.3.4 [Programming Languages]: Processors—Memory manage- collectors, a write barrier. Hardware performance counters mea-
ment (garbage collection) sure the L1, L2, and TLB misses for collector and mutator phases.
The experiments reveal the direct cost of garbage collection and its
General Terms indirect effects on mutator performance and locality.
Design, Performance, Algorithms Our first set of experiments confirm the widely held, but unexam-
Keywords ined hypothesis, that the locality benefits of contiguous allocation
improves the locality of the mutator. For the whole heap collectors
Java, Mark-Sweep, Semi-Space, Reference Counting, Generational
in small heaps, the more space efficient free-list mark-sweep col-
∗ This work is supported by the following grants: ARC DP0452011, lector performs best because collection frequency dominates the lo-
NSF ITR CCR-0085792, NSF CCR-0311829, NSF EIA-0303609, cality benefit of contiguous allocation. As heap size increases, the
DARPA F33615-03-C-4106, and IBM. Any opinions, findings and mutator locality advantage of contiguous allocation with copying
conclusions expressed herein are the authors and do not necessarily collection outweighs the space efficiency of mark-sweep. Contigu-
reflect those of the sponsors. ous allocation provides fewer misses at all levels of the cache hier-
archy (L1, L2 and TLB). These results are counter to the myth that
Permission to make digital or hard copies of all or part of this work for collection frequency is always the first order effect that determines
personal or classroom use is granted without fee provided that copies are total program performance. Further experiments reveal that most
not made or distributed for profit or commercial advantage and that copies of these locality benefits are for the young objects which motivates
bear this notice and the full citation on the first page. To copy otherwise, to a contiguous allocation for them in generational collectors.
republish, to post on servers or to redistribute to lists, requires prior specific The generational collectors divide newly allocated nursery ob-
permission and/or a fee.
jects from mature objects that survive one or more collections, and
SIGMETRICS/Performance’04, June 12–16, 2004, New York, NY, USA.
Copyright 2004 ACM 1-58113-664-1/04/0006 ...$5.00.
collect the nursery independently and more frequently than the ma- 18, 32, 40], and as we show here, garbage collectors are very sensi-
ture space [35, 42]. They work well when the rate of death among tive to heap size, and in particular to tight heaps. Diwan et al. [25,
the young objects is high. In order to collect the nursery indepen- 41], Hicks et al. [28], and others [15, 29] measure detailed, spe-
dently, the generational collectors use a write barrier which records cific mechanism costs, and architecture influences [25], but do not
any pointer into the nursery from the mature objects. During a nurs- consider a variety of collection algorithms. Many researchers have
ery collection, the collector assumes the referents of these pointers evaluated a range of memory allocators for C/C++ programs [9, 10,
are live to avoid scanning the entire mature generation. To imple- 11, 17, 21, 43], but this work does not include copying collectors
ment the write barrier, the compiler generates a sequence of code since C/C++ programs may store pointers arbitrarily.
for every pointer store that at runtime records only those pointers Java performance analysis work either disabled garbage collec-
from the mature space into the nursery. The write barrier thus in- tion [23, 37] which introduces unnecessary memory fragmentation,
duces direct mutator overhead between programs that use whole or hold it constant [32]. Kim and Hsu measure similar details as we
heap versus generational collection. do, with simulation of IBM JDK 1.1.6, a Java JIT, using whole heap
Our experiments show that the generational collectors provide mark-sweep algorithm with occasional compaction. Our work thus
better performance than the whole heap collectors in virtually all stands out as the first thorough evaluation of a variety of different
circumstances. They significantly reduce collection time itself, and garbage collection algorithms, how they compare and affect perfor-
their contiguous nursery allocation has a positive impact on local- mance using execution measurements and performance counters.
ity. We carefully measure the impact of the write barrier on the The comprehensiveness of our approach reveals new insights, such
mutator and find that their mutator cost is usually very low (often as the most space efficient collection algorithms and the distinct lo-
2% or less), and even when high (14%), the cost is outweighed by cality patterns of young and old objects, suggests mechanisms for
the improvements in collection time. matching algorithms to object demographics, and reveals perfor-
Comparing the generational collectors against each other, per- mance trade-offs each strategy makes.
formance differences are typically small. Two factors contribute We evaluate the reuse, modularity, portability, and performance
to this result. First, allocation order provides good spatial local- of MMTk in a separate publication [13]. In that work we do not ex-
ity for young objects even if the program briefly uses and discards plore generational collectors, nor measure and explain performance
them. Second, the majority of reads are actually to the mature ob- differences between collectors. However, we do demonstrate that
jects, but caching usually achieves good temporal locality for these MMTk combines modularity and reuse with high performance, and
objects regardless of mature space policy. Some object demograph- we rely on that finding here. For example, collectors that share
ics do however have a preference. For instance, generational col- functionality, such as root processing, copying, tracing, allocation,
lection with a copying mature space works best when the mature or collection mechanisms, use the exact same implementation in
space references are dispersed and frequent. The mark-sweep ma- MMTk. In addition, the allocation and collector mechanisms per-
ture space performs best, sometimes significantly, in small heaps form as well as hand tuned monolithic counterparts written in Java
when its space efficiency reduces collector invocations. or C. The experiments in this paper thus offer true policy compar-
The next section compares our study to previous collector per- isons in an efficient setting.
formance analysis studies, none of which consider this variety of
collectors in an apples-to-apples setting, nor do any include a simi- 3. Background
lar depth of analysis or vary the architecture. We then overview the This section presents the garbage collection terminology, algori-
collectors, a number of key implementation details, and the experi- thms, and features that this paper compares and explores. It first
mental setting. The results section studies the three base algorithms presents the algorithms, and then enumerates a few key implemen-
separating allocation and collection costs (as much as possible), tation details. For a thorough treatment of algorithms, see Jones
compares whole heap algorithms and their generational counter- and Lins [30], and Blackburn et al. for additional implementation
parts, and examines the cost of the generational write barrier. We details [13].
examine the impact of nursery size on performance and debunk the In MMTk, a policy pairs one allocation mechanism with one col-
myth that the nursery size should be tied to the L2 cache size. We lection mechanism. Whole heap collectors use a single policy. Gen-
also examine mature space behaviors using a fixed-size nursery to erational collectors divide the heap into age cohorts, and use one or
hold the mature space work load constant. We perform every ex- more policies [3, 42]. For generational and incremental algorithms,
periment on the nine benchmarks and three architectures, but select such as reference counting, a write barrier remembers pointers. For
representative results for brevity and clarity. every pointer store, the compiler inserts write-barrier code. At exe-
cution time, this code conditionally records pointers depending on
2. Related Work the collector policy. Following the literature, the execution time
To our knowledge, few studies quantitatively compare uniprocessor consists of the mutator (the program itself) and periodic garbage
garbage collection algorithms [5, 14, 27, 28, 40], and these studies collection. Some memory management activities, such as object
evaluate various copying and generational collectors. Our results allocation and the write barrier, mix in with the mutator. Collection
on copying collectors are similar to theirs, but they do not compare can run concurrently with mutation, but this work uses a separate
with free-list mark-sweep or reference counting collectors, nor ex- collection phase. MMTk implements the following standard allo-
plore memory system consequences. cation and collection mechanisms.
Attanasio et al. [5] evaluate parallel collectors on SPECjbb, fo-
A Contiguous Allocator appends new objects to the end of a con-
cusing on the effect of parallelism on throughput and heap size
tiguous space by incrementing a bump pointer by the size of
when running on 8 processors. They concluded that mark-sweep
the new object.
and generational mark-sweep with a fixed-size nursery (16 MB or
64 MB) are equal and the best among all the collectors. Our data A Free-List Allocator organizes memory into k size-segregated
shows that the generational are superior to whole heap collectors free-lists. Each free list is unique to a size class and is com-
especially with a variable-size nursery. posed of blocks of contiguous memory. It allocates an object
A few recent studies explore heap size effects on performance [14, into a free cell in the smallest size class that accommodates
the object.
A Tracing Collector identifies live objects by computing a transi- to 8K bytes, this organization may be a source of some conflict
tive closure from the roots (stacks, registers, and class vari- misses, but we leave that investigation for future work.
ables/statics) and from any remembered pointers. It reclaims RefCount: The deferred reference-counting collector uses a free-
space by copying live data out of the space, or by freeing list allocator. During mutation, the write barrier ignores stores to
untraced objects. roots and logs mutated objects. It then periodically updates ref-
A Reference Counting Collector counts the number of incoming erence counts for root referents and generates reference count in-
references for each object, and reclaims objects with no ref- crements and decrements using the logged objects. It then deletes
erences. objects with a zero reference count and recursively applies decre-
3.1 Collectors ments. It uses trial deletion to detect cycles [7]. Collection time is
proportional to the number of dead objects, but the mutator load is
All modern collectors build on these mechanisms. This paper ex-
significantly higher than other collectors since it logs every mutated
amines the following whole heap collectors, and a generational
heap object.
counterpart for each. The generational collectors use a copying
Implementation Details: RefCount uses object logging with co-
nursery for newly allocated objects.
alescing [34]. RefCount thus records objects only the first time
SemiSpace: The semi-space algorithm uses two equal sized copy
the program modifies it, and buffers decrements for all its refer-
spaces. It contiguously allocates into one, and reserves the other
ent objects. At collection time, it (1) generates increments for all
space for copying into since in the worst case all objects could sur-
root and modified object referents, thus coalescing intermediate up-
vive. When full, it traces and copies live objects into the other
dates, (2) introduces temporary [7] increments for deferred objects
space, and then swaps them. Collection time is proportional to the
(e.g., roots), and (3) deletes objects with a zero count. When a ref-
number of survivors. Its throughput performance suffers because
erence count goes to zero, it puts the object back on the free-list by
it reserves half of the space for copying and it repeatedly copies
setting a bit and it decrements all its referents. On the next collec-
objects that survive for a long time, and its responsiveness suffers
tion, it includes a decrement for all temporary increments from the
because it collects the entire heap every time.
previous collection.
Implementation Details: Copying tracing implements the transi-
GenCopy: The classic copying generational collector [3] allo-
tive closure as follows. It enqueues the locations of all root refer-
cates into a young (nursery) space. The write barrier records point-
ences, and repeatedly takes a reference from the locations queue.
ers from mature to nursery objects. It collects when the nursery is
If the referent object is uncopied, it copies the object, leaves a for-
full, and promotes survivors into a mature semi-space. When the
warding address in the old object, enqueues the copied object on a
mature space is exhausted, it collects the entire heap. When the
gray object queue, and adjusts the reference to point to the copied
program follows the weak generational hypothesis [35, 42], i.e.,
object. If it previously copied the referent object, it instead ad-
many young objects die quickly and old objects survive at a higher
justs the reference with the forwarding address. When the loca-
rate than young, GenCopy attains better performance than SemiS-
tions queue is empty, the collector scans each object on the gray
pace. GenCopy improves over SemiSpace in this case because it
object queue. Scanning places the locations of the pointer fields of
repeatedly collects the nursery which yields a lot of free space, it
these objects on the locations queue. When the gray object queue
compacts the survivors which can improve mutator locality, and in-
is empty, it processes the locations queue again, and so on. It ter-
curs the collection cost of the mature objects infrequently. It also
minates when both queues are empty. These experiments use a
has better average pause times than SemiSpace, since the nursery
depth-first order, because our experiments show it performs better
is typically smaller than the entire heap.
than the more standard breadth-first order [19]. MMTk supports
GenMS: This hybrid generational collector uses a copying nurs-
other orderings. SemiSpace has no write barrier.
ery and the MarkSweep policy for the mature generation. It allo-
MarkSweep: Mark-sweep uses a free-list allocator and a tracing
cates using a bump pointer and when the nursery fills up, triggers
collector. When the heap is full, it triggers a collection. The col-
a nursery collection. The write barrier, nursery collection, nursery
lection traces and marks the live objects using bit maps, and lazily
allocation policies, and mechanisms are identical to those for Gen-
finds free slots during allocation. Tracing is thus proportional to the
Copy. The test for an exhausted heap must accommodate space
number of live objects, and reclamation is incremental and propor-
for copying an entire nursery full of survivors into the MarkSweep
tional to allocation. The tracing for MarkSweep is exactly the same
space. GenMS should be better than MarkSweep for programs that
as SemiSpace, except that instead of copying the object, it marks a
follow the weak generational hypothesis. In comparison with Gen-
bit in a live object bit map. Since MarkSweep is a whole heap col-
Copy, GenMS can use memory more efficiently, since GenCopy
lector, its maximum pause time is poor and its performance suffers
reserves half the heap for copying space. However, both Mark-
from repeatedly tracing objects that survive many collections.
Sweep and GenMS can fragment the free space when objects are
Implementation Details: The free-list uses segregated-fits with
distributed among size classes.
a range of size classes similar to the Lea allocator [33]. MMTk
Infrequent collections can contribute to spreading consecutively
uses 51 size classes that attain a worst case internal fragmentation
allocated (or promoted) objects out in memory. Both sources of
of 1/8 for objects less than 255 bytes. The size classes are 4 bytes
fragmentation can reduce locality. Mark-compact collectors can
apart from 8 to 63, 8 bytes apart from 64 to 127, 16 bytes apart
reduce this fragmentation, but need one or two additional passes
from 128 to 255, 32 bytes apart from 256 to 511, 256 bytes apart
over the live and dead objects [20].
from 512 to 2047, and 1024 bytes apart from 2048 to 8192. Small,
GenRC This hybrid generational collector uses a copying nurs-
word-aligned objects get an exact fit—in practice, these are the vast
ery and RefCount for the mature generation [16]. It ignores muta-
majority of all objects. All objects 8KB or larger get their own
tions to nursery objects by marking them as logged, and logs the
block (see Section 3.2.3). MarkSweep has no write barrier. The
addresses of all mutated mature objects. When the nursery fills,
collector keeps the blocks of a size class in a circular list ordered
it promotes nursery survivors into the reference counting space.
by allocation time. It allocates the first free element in the first
As part of the promotion of nursery objects, it generates reference
block. Finding the right fit is about 10% slower [13] than bump-
counts for them and their referents. At the end of the nursery collec-
pointer allocation. The free-list stores the bit vector for each block
tion, GenRC computes reference counts and deletes dead objects,
together with the block. Since block sizes vary from 256 bytes
as in RefCount. Since GenRC ignores the frequent mutations of the tional collectors allocate large objects directly into this space. The
nursery objects, it performs much better than RefCount. Collection LOS uses the treadmill algorithm [8]. It records a pointer to each
time is proportional to the nursery size and the number of dead ob- object in a list. During whole heap collections, all of the collectors
jects in the RefCount space. With a small nursery and other collec- but RefCount and GenRC trace the live large objects, placing them
tion triggers, pause times are very low [16]. RefCount and GenRC on another list. They then reclaim any objects left on the original
are subject to the same free-list fragmentation issues as MarkSweep list. RefCount and GenRC reference count the large objects at each
and GenMS. However, since GenRC collects the mature space on collection. MMTk does not a priori reserve space for the LOS, but
every collection, it is likely to maintain a smaller memory footprint. allocates it on demand.
The boot image contains various objects and precompiled classes
3.2 Implementation Details necessary for booting Jikes RVM, including the compiler, class-
This section adds a few more implementation details about shared loader, the garbage collector, and other essential elements of the
mechanisms including the nursery size policies, inlining write bar- virtual machine as part of the Java-in-Java design. MMTk puts
riers and allocation, reference counting header, the large object these objects in an immortal space, and none of the collectors col-
space, and the boot image. lect them. All except RefCount and GenRC trace through the boot
3.2.1 Nursery size policies image objects whenever they perform a while heap collection. Re-
By default, the generational collectors implement a variable nurs- fCount and GenRC assume all pointers out of the boot image are
ery [3] whose initial size is half of the heap, the other half is re- live to avoid a priori assigning reference counts at boot time.
served for copying. Each nursery collection reduces the nursery by
the size of the survivors. When the available space for the nursery is 4. Methodology
too small (256KB by default), it triggers a mature space collection. This section describes Jikes RVM, our experimental platform, and
MMTk also provides a bounded nursery which takes a command key benchmark characteristics.
line parameter as the initial nursery size, collects after the nursery is 4.1 IBM Jikes RVM
full, and resizes the nursery below the bound only when the mature
We use MMTk in Jikes RVM version 2.3.1+CVS1 [2, 1], with
space cannot accommodate a nursery of survivors. It shrinks using
patches to support performance counters and pseudo-adaptive com-
the above variable nursery policy with the same lower bound. The
pilation. Jikes RVM is a high-performance VM written in Java with
fixed nursery never reduces the size of the nursery, and thus trig-
an aggressive optimizing compiler [1, 2]. We use configurations
gers a whole heap collection sooner than the bounded nursery of
that precompile as much as possible, including key libraries and
the same size. The bounded nursery triggers more collections than
the optimizing compiler and turn off assertion checking (the Fast
the variable nursery which uses space more efficiently, but when
build-time configuration). The adaptive compiler uses sampling to
the variable nursery is large, pause time suffers.
select methods to optimize, leading to high performance [4], but
3.2.2 Write-barrier and allocation inlining a lack of determinism. Eechout et al. use statistical techniques to
For the generational collectors, MMTk inlines the write-barrier fast show that including the adaptive compiler for short running pro-
path which filters stores to nursery objects and thus does not record grams skews the results to measure the virtual machine [26]. In ad-
most pointer updates, i.e., ignores between 93.7% to 99.9% of point- dition, adaptive compiler variations result in changes to allocation
er stores. The slow path makes the appropriate entries in the re- behavior and running time of the same run or runs with different
membered set. Since the write barrier for RefCount is uncondi- heap sizes. For example, sampling triggers compilation in different
tional, it is fully inlined but forces the slow path object remem- methods, and the compilation of different write barriers for each
bering mechanism out-of-line to minimize code bloat and compiler collector is part of the runtime system as well as the program and
overhead [15]. SemiSpace and MarkSweep have no write barrier. induces both different mutator behavior and collector load [15].
MMTk inlines the fast path for the allocation sequence. For Since our goal is to focus on application and garbage collection
the copying and generational allocators, the inlined sequence con- interactions, our pseudo adaptive approach deterministically mim-
sists of incrementing a bump pointer and testing it against a limit ics adaptive compilation.2 First we profile each benchmark five
pointer. If the test fails (failure rate is typically 0.1%), the alloca- times and select the best, collecting a log of the methods that the
tion sequence calls an out-of-line routine to acquire another block adaptive compiler chooses to optimize. This log is then used as
of memory, which may trigger a collection. deterministic compilation advice for the performance runs. For our
For the MarkSweep and RefCount free-list allocators, the inline performance runs, we run two iterations of each benchmark. In the
allocation sequence consists of establishing the size class for the first iteration, the compiler optimizes the methods in the advice file
allocation (for non-array types, the compiler statically evaluates the on demand, and base compiles the others. Before the second itera-
size), and removing a free cell from the appropriate free-list, if such tion, we perform a whole heap garbage collection to flush the heap
a cell is available. If there is no available free cell, the allocation of compiler objects. We then measure the second iteration which
path calls out-of-line to move to another block, or if there are no uses optimized code for hot methods and whose heap includes only
more blocks of that size class, to acquire a new block. application objects. We perform this experiment five times and re-
3.2.3 Header, large objects, and boot image port the fastest time. Our methodology thus avoids variations due
MMTk has a two word (8 byte) header for each object, which con- to adaptive compilation.
tains a pointer to the TIB (type information block located in the
immortal space, see below), hash bits, lock bits, and GC bits. A
4.2 Experimental Platform
one word header for MarkSweep collectors is possible, but not yet We perform our experiments on three architectures: Athlon, Pen-
implemented. Bacon et al. found that a one word header yields an tium 4, and Power PC. We present the Athlon results because it
average of 2-3% improvement in overall execution [6]. RefCount performs the best and it has a relatively simpler memory hierarchy
and the mature space in GenRC have an additional word (4 bytes) that is easier to analyze.
in the object headers to accommodate the reference count. 1A 2.3.2 pre-release, cvs timestamp ‘2004/03/25 05:11:47 UTC’.
MMTk allocates all objects 8KB or larger separately into a large 2 Xianglong Huang and Narendran Sachindran jointly implemented
object space (LOS) using an integral number of pages. The genera- the pseudo adaptive compilation mechanism.
Source Field (p) Target Object (o = *p)
alloc alloc: % GC % Nur % Read % Focus % Read % Focus
MB min ß srv Nur Mat Imm Nur Mat Nur Mat Imm Nur Mat
202 jess 261 17:1 63 1 29 44 27 0.4 69 18 62 20 0.2 97
228 jack 231 17:1 53 3 25 39 36 0.1 6 21 50 28 0.1 7
205 raytrace 135 8:1 46 2 19 75 6 0.3 48 18 78 4 0.3 49
227 mtrt 142 7:1 51 5 20 75 6 0.3 21 19 77 5 0.3 21
213 javac 185 7:1 29 23 30 46 24 0.4 2 25 55 21 0.3 3
201 compress 99 6:1 8 0 97 0 3 11.0 3 61 39 0 6.9 712
pseudojbb 216 5:1 21 32 16 59 25 0.2 2 14 72 15 0.1 2
209 db 82 4:1 11 9 5 69 26 0.3 49 1 89 9 0.1 63
222 mpegaudio 3 1:1 0 – – – – – – – – – – –
Normalized Time
Normalized Time
5 9
Time (sec)
Time (sec)
Time (sec)
1.3 17
2 4.5 1.25 2 8
16
4 1.2
7
3.5 1.15 15
1.5 1.5 6
1.1
3 14
1.05 5
2.5
1 1 1
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Heap size relative to minimum heap size Heap size relative to minimum heap size Heap size relative to minimum heap size
(a) 202 jess Total Time (b) 209 db Total Time (c) 213 javac Total Time
Heap size (MB) Heap size (MB) Heap size (MB)
20 30 40 50 60 70 80 90 100 20 40 60 80 100 120 20 40 60 80 100 120 140 160
50 50 200
SemiSpace SemiSpace 9 SemiSpace
45 MarkSweep 1.4 45 MarkSweep 180 MarkSweep
RefCount RefCount 8 RefCount 2.5
40 GenCopy 40 GenCopy 160 GenCopy
GenMS 1.2 GenMS 7 GenMS
Normalized GC Time
Normalized GC Time
Normalized GC Time
35 GenRC 35 GenRC 140 GenRC 2
GC Time (sec)
GC Time (sec)
GC Time (sec)
1 6
30 30 120
5 1.5
25 0.8 25 100
4 80
20 0.6 20
1
15 15 3 60
0.4
10 10 2 40
0.5
0.2 1 20
5 5
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Heap size relative to minimum heap size Heap size relative to minimum heap size Heap size relative to minimum heap size
(d) 202 jess GC Time (e) 209 db GC Time (f) 213 javac GC Time
Heap size (MB) Heap size (MB) Heap size (MB)
20 30 40 50 60 70 80 90 100 20 40 60 80 100 120 20 40 60 80 100 120 140 160
1.5 1.5 1.5
SemiSpace 2.9 SemiSpace SemiSpace 5.6
1.45 MarkSweep 1.45 MarkSweep 18 1.45 MarkSweep
RefCount 2.8 RefCount RefCount 5.4
1.4 GenCopy 1.4 GenCopy 1.4 GenCopy
Normalized Mutator Time
(g) 202 jess Mutator Time (h) 209 db Mutator Time (i) 213 javac Mutator Time
Figure 1: Total, Mutator and GC Performance of All Six Collectors
We first measure an upper bound on the time the program spends tator performance, where SemiSpace always improves over Mark-
in contiguous allocation by pushing the allocation sequence out-of- Sweep. Contiguous allocation in SemiSpace thus offers locality
line. This cost typically ranges from 1% to at most 10% of to- from two sources: allocation order and copying compaction. Free-
tal time. We then use a micro benchmark to establish the relative list allocation in MarkSweep degrades program locality. The muta-
costs of the two mechanisms. The benchmark allocates objects of tor benefit of SemiSpace over MarkSweep is relatively insensitive
the same size in a tight loop. Contiguous allocation is 11% faster to heap size, thus suggesting that this benefit is from allocation lo-
than the free-list allocation, allocating at 726 MB/s and 654 MB/s cality rather than mature object compaction. An exception is the
respectively. (We recently reported slower times on an older ar- TLB performance on 202 jess, where the four copying collectors
chitecture [13].) Since allocation time is less than 10%, this small show a sharp reduction in TLB misses at smaller heap sizes, pre-
difference between the mechanisms reduces to less than 1% of to- sumably due to collection-induced locality. However, L1 misses
tal time, and excludes the allocation sequence as a major source of appear to dominate, so the reduction in TLB misses does not trans-
variation. late to a reduction in mutator time.
Figure 2 examines mutator time and memory hierarchy perfor- 5.3.2 Mutator costs in generational collectors
mance for 202 jess, 209 db, and pseudojbb which have repre- We perform the following experiment to examine more closely whe-
sentative behaviors, plotting mutator time, L1 misses, L2 misses, ther SemiSpace locality is mostly due to the allocation order or the
and TLB misses as a function of heap size on a log scale. First copying compaction of mature objects. We hold the work load on
consider SemiSpace and MarkSweep. SemiSpace mutator perfor- the mature space constant with a fixed-size nursery variant of the
mance improvements range from 7 to 15% over MarkSweep (only generational collectors. The young objects thus are all in allocation
on 201 compress and 222 mpegaudio is free-list allocation order. Since young objects are collected at the same frequency,
within 5%). The limit analysis above indicates that the direct ef- only the mature space collection policies differ. Figure 3 shows the
fect of the allocator is typically 1% or less of this difference. Since geometric mean of mutator performance across all benchmarks.
the application code is otherwise identical, second order effects When the nursery size is fixed, GenCopy and GenMS have very
must dominate. 202 jess, 209 db, and pseudojbb each show a similar mutator performance. The locality of mature space objects
strong and consistent correlation between cache memory and mu- is thus not a dominant effect on mutator performance. As Sec-
4 16 8
Mutator Time (sec) (log)
SemiSpace SemiSpace
MarkSweep MarkSweep SemiSpace
RefCount RefCount MarkSweep
GenCopy GenCopy GenCopy
GenMS GenMS GenMS
GenRC GenRC GenRC
1 8 4
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Heap size relative to minimum heap size Heap size relative to minimum heap size Heap size relative to minimum heap size
(a) 202 jess Mutator Time (b) 209 db Mutator Time (c) pseudojbb Mutator Time
64 256 64
SemiSpace SemiSpace SemiSpace
MarkSweep MarkSweep MarkSweep
RefCount RefCount GenCopy
Mutator L1 Misses (millions) (log)
32 128
16 64 32
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Heap size relative to minimum heap size Heap size relative to minimum heap size Heap size relative to minimum heap size
(d) 202 jess Mutator L1 (e) 209 db Mutator L1 (f) pseudojbb Mutator L1
8 128 32
SemiSpace SemiSpace SemiSpace
MarkSweep MarkSweep MarkSweep
RefCount RefCount GenCopy
Mutator L2 Misses (millions) (log)
2 16
0.5 64 8
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Heap size relative to minimum heap size Heap size relative to minimum heap size Heap size relative to minimum heap size
(g) 202 jess Mutator L2 (h) 209 db Mutator L2 (i) pseudojbb Mutator L2
16 128 32
SemiSpace SemiSpace SemiSpace
MarkSweep MarkSweep MarkSweep
RefCount RefCount GenCopy
Mutator TLB Misses (millions) (log)
1 32 8
0.5
0.25 16 4
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Heap size relative to minimum heap size Heap size relative to minimum heap size Heap size relative to minimum heap size
(j) 202 jess Mutator TLB (k) 209 db Mutator TLB (l) pseudojbb Mutator TLB
Figure 2: Mutator time and L1, L2 and TLB misses for all six collectors collectors (log scale).
tion 5.4.1 discusses in more detail, the variable-size nursery attains pace, due to the write barrier (see Table 2). Section 4.3 shows that
a space advantage when combined with GenMS which reduces the 209 db is dominated by mature space accesses, and thus nursery
number of nursery collections, a direct benefit. The indirect benefit locality is immaterial for 209 db.
is slightly improved locality for nursery objects since they stay in In pseudojbb, the copying nursery benefits GenMS compared
allocation order in the nursery for longer. Figure 3 suggests that to MarkSweep, but GenCopy still performs significantly better than
mature object compaction in the free-list will be of little use for both. This suggests that pseudojbb has mature space access pat-
these programs. However, Figure 2 reveals the two exceptions to terns which are locality sensitive. The access pattern statistics in
this rule: 209 db and pseudojbb. Section 4.3 confirm this result. Mature space is accessed more
The most striking counterpoint is 209 db, where the genera- heavily by pseudojbb, but the accesses are relatively unfocused.
tional collectors make little impact on mutator time. Even the copy- Together the whole heap and generational results indicate that
ing nursery in GenMS provides no advantage over MarkSweep. free-list allocation significantly degrades locality, whereas contigu-
GenCopy slightly degrades mutator locality compared with SemiS- ous allocation achieves locality on young objects from allocation
16 MarkSweep mutator SemiSpace mutator
SemiSpace
MarkSweep 1.5× min ratio 1.5× min ratio
GenCopy Fixed 4MB
GenMS Fixed 4MB
GCs time (s) ∞/1.5× GCs time (s) ∞/1.5×
202 jess 27 2.48 0.97 50 1.97 1.18
Mutator Time (sec) (log)
228 jack 25 2.39 0.97 49 2.11 1.08
205 raytrace 10 2.35 0.98 25 2.04 1.06
227 mtrt 9 2.38 1.04 26 2.1 1.07
213 javac 12 4.57 0.99 28 3.8 1.03
201 compress 7 5.47 1.00 7 5.41 0.99
pseudojbb 9 7.21 1.00 32 6.04 1.07
209 db 5 13.78 1.01 22 12.81 0.86
222 mpegaudio 0 10.57 0.93 0 9.77 1.00
8
1 2 3 4 5 6
Geometric mean 8 4.57 0.99 18 4.02 1.03
Heap size relative to minimum heap size
Figure 3: Mutator time for whole heap and fixed-size nursery Table 3: Impact of very large heap size on mutator time
collectors, geometric mean across all benchmarks
of references to each object is expensive, even with aggressive op-
order. Furthermore, a copying nursery ameliorates the locality pen- timizations [22, 34], which the MMTk implementation also uses.
alty of the mature space free-list in all but 209 db and pseudojbb, This result is evident in Figure 1, where RefCount performs dra-
where mature-space reads play a large role. matically worse than MarkSweep for 202 jess and 213 javac.
5.4 Collection: How, when, and whether? RefCount performs well on 201 compress, but this application is
atypical. As discussed in Sections 5.2 and 5.3, there is compelling
The choice of allocation mechanism also governs the choice of col-
evidence for a generational policy with a copying nursery and a
lection mechanisms. We now examine the time and space over-
free-list in the mature space. The distinctly different demographics
heads of the collection algorithms, and their influence on mutator
of young and old objects further motivate a hybrid generational ref-
locality. We consider how frequently to collect. We also show that
erence counting policy [16]. Figure 1 shows that GenRC performs
our results are consistent across architectures, and then discuss if
similar to the other generational collectors, except in 213 javac,
we should choose garbage collection at all.
which has an unusually large amount of cyclic data structures [7].
5.4.1 Garbage collection costs The performance of GenRC is sensitive to the frequency of cycle
Contiguous allocation dictates copying collection which requires detection, which we did not tune in these experiments. GenRC
a copy reserve. The SemiSpace, GenCopy, and GenMS collec- holds a potential locality and space advantage over GenMS be-
tor performance graphs reflect this copying space overhead which cause it promptly reclaims dead mature space objects, and thus can
leads to many more collections than pure MarkSweep—SemiSpace more tightly pack the free-list. GenRC performs reference counting
typically collects between 1.5 and 2 times as often as MarkSweep at every nursery collection whereas GenMS infrequently performs
for a given heap size. For example, GC time in Figure 2 for SemiS- whole heap collections. This promise is not borne out in Figure 1,
pace is typically at least 50% worse than MarkSweep. We mea- but may be a reflection of the immaturity of the GenRC implemen-
sured the tracing rates for SemiSpace and MarkSweep on a mi- tation rather than on the fundamentals of the algorithm.
cro benchmark: they are very close (59.5MB/sec and 59.2MB/sec)
5.4.4 How often?
which means that the frequency of collection is the source of the
We now examine the limits of not collecting, and then examine how
overhead. In addition, GenMS with a variable nursery reduces the
often to collect the nursery.
number of nursery collections over GenCopy because it is more
If the heap is never collected and memory is monotonically con-
space efficient. The first order effect of fewer collections is re-
sumed, the spatial locality of older objects should gradually de-
duced collection time. A second order effect could be fewer cache
grade as neighboring objects die. Assuming an approximately uni-
line displacements to collector invocations. The stability of the mu-
form death rate over time, fragmentation will be an exponential
tator cache performance as a function of heap size in the face of
function of age—older objects being the most fragmented, and the
dramatic differences in numbers of collections dissuades us of this
very most recently allocated objects suffering no fragmentation. To
hypothesis.
examine this effect, Table 3 compares the mutator time for each
5.4.2 Trading off collection cost and mutator locality benchmark using contiguous and free-list allocation with a modest
Total performance is of course a function of the mutator and col- heap (1.5× minimum), and an uncollected heap, large enough to
lector performance. While contiguous allocation offers a signifi- avoid triggering any collection. For these benchmarks, 900MB is
cant mutator advantage, its copy reserve requirement results in a adequate. Only 202 jess follows the hypothesis that never col-
substantial overhead. In small heap sizes, collection time typically lecting degrades performance. Since 202 jess has a high heap
swamps total performance and overwhelms mutator locality differ- turn over and some accesses to mature space, it does suffer some
ences; MarkSweep outperforms SemiSpace. In large heaps, muta- fragmentation that degrades mutator performance when the heap is
tor time dominates and SemiSpace outperforms MarkSweep. Fig- never collected.
ure 1 illustrates the crossovers in total performance for MarkSweep Most of the other benchmarks have about the same mutator per-
and SemiSpace on 213 javac and 202 jess. formance in the uncollected heap (∞) as in the modest heap. At first
As Sections 5.3.1 and 5.3.2 establish, the locality advantage of this result seems a little surprising in light of the inevitable degra-
contiguous allocation is greatest among the young objects. These dation in locality among the older objects. However, as Section 5.3
results indicate that the copying nursery combined with a space showed, the spatial locality of the mature objects is not a dominant
efficient MarkSweep mature space offers a good combination of factor for these benchmarks. 209 db actually achieves better per-
locality benefits and reduced collection costs. However, when ma- formance without collection because it attains good locality from
ture space locality dominates, such as in 209 db, GenCopy can contiguous allocation and it has low GC work load. Blackburn et
perform best. al. found for a more memory constrained machine, never collecting
5.4.3 Tracing or Reference Counting? caused severe degradations in 209 db due to paging [14]. Table 1
With a free-list, the collector can either trace the live objects from together with mutator locality results indicate that all of the other
the roots or count references. Continuously tracking the number programs have a slight majority of accesses to a few mature objects
8 2 8
GenCopy GenCopy GenCopy
GenMS GenMS GenMS
1
Mutator Time (sec) (log)
0.25
4 0.125 4
64 256 1024 4096 16384 65536 64 256 1024 4096 16384 65536 64 256 1024 4096 16384 65536
Nursery Size (KB) (log) Nursery Size (KB) (log) Nursery Size (KB) (log)
32 8 4
64 256 1024 4096 16384 65536 64 256 1024 4096 16384 65536 64 256 1024 4096 16384 65536
Nursery Size (KB) (log) Nursery Size (KB) (log) Nursery Size (KB) (log)
(d) L1 Mutator Misses (e) L2 Mutator Misses (f) TLB Mutator Misses
16 4 1
GenCopy GenCopy GenCopy
GenMS GenMS GenMS
4 0.25
2 0.125
0.5
1 0.0625
with good temporal locality, and accesses to a very large number 5.5 Architecture influences
of young objects with poor temporal locality (typically used briefly Figure 5 compares the geometric mean of the benchmarks for all
then discarded). Thus, compression of mature space objects is not 6 collectors on the P4, Athlon, and PPC. The x-axis is heap size,
an important source of locality in these programs. We expect that and the y-axis is time. The P4 has the fastest clock speed, followed
server applications, and others with large memory usage and foot by the Athlon, and then the PPC. Intel would like us to believe
prints will follow 202 jess more than these results. that this ordering means the P4 will perform the best. Instead, the
5.4.5 Sizing the nursery Athlon performs about 20% better. For the generational collectors,
Given the performance advantages of generational collection, we even the PPC is close to the P4. The Athlon’s advantage comes
now examine the influence of the nursery size. Figure 4 shows from substantially fewer cache misses than the P4 (compare Fig-
the performance of GenMS and GenCopy over a wide range of ures 2 and 6). Due to the Athlon’s exclusive cache architecture,
bounded nursery sizes (128KB to 32MB), running in a very large substantially larger L1 and higher associativity L2, it simply has
heap (900MB). Note the x-axis in this figure is nursery size, rather more effective cache and this advantage dominates clock speed.
than heap size as in all the other figures in this paper. Figure 4(a) The collectors follow the same trends discussed above on all of
shows a small improvement with larger nurseries in mutator perfor- the architectures. The generational collectors perform best on all
mance due to fewer L2 (Figure 4(e)) and TLB misses (Figure 4(f)). architectures due to reductions in collection time and locality from
However, the difference in GC time dominates: smaller nurseries contiguous nursery allocation. However the difference is more pro-
demand more frequent collection and thus a substantially higher nounced on the PPC than the P4 or Athlon which suggests reduc-
load. We measured the fixed overhead of each collection and found tions in the influence of collection time on faster processors. The
that each invocation of a collection scanned around 64KB of roots. space advantage of MarkSweep over SemiSpace, and the locality
These fixed costs become significant when the nursery is as small advantage of SemiSpace over MarkSweep show different cross-
as 128KB. The garbage collection cost tapers off between 4MB and over points on each architecture. The faster the clock speed, the
8MB as the fixed collection costs become insignificant. These re- closer the cross-over point moves towards the minimum heap size,
sults debunk the myth that the nursery size should be matched to i.e., the cross-over where SemiSpace improves over MarkSweep is
the L2 cache size (512KB on all three architectures). 2.2 for the P4, 3.4 for the Athlon, and 4 for the PPC. This trend
Heap size (MB) Heap size (MB) Heap size (MB)
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100
3 3 3
SemiSpace SemiSpace SemiSpace
MarkSweep MarkSweep MarkSweep
RefCount 14 RefCount 14 RefCount 14
GenCopy GenCopy GenCopy
2.5 GenMS 2.5 GenMS 2.5 GenMS
GenRC GenRC GenRC
Normalized Time
Normalized Time
Normalized Time
12 12 12
Time (sec)
Time (sec)
Time (sec)
2 10 2 10 2 10
8 8 8
1.5 1.5 1.5
6 6 6
1 1 1
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Heap size relative to minimum heap size Heap size relative to minimum heap size Heap size relative to minimum heap size
256 64 128
SemiSpace SemiSpace SemiSpace
MarkSweep MarkSweep MarkSweep
RefCount RefCount RefCount
128 64
16
64 8 32
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Heap size relative to minimum heap size Heap size relative to minimum heap size Heap size relative to minimum heap size
(a) 202 jess Mutator L1 (b) 202 jess Mutator L2 (c) 202 jess Mutator TLB
Figure 6: P4 mutator L1, L2 and TLB misses for 202 jess (log scale). Compare with Figures 2(d), 2(g) and 2(j).
suggests that for future processors that the locality advantages of tial locality for young objects that die quickly. We also show that
contiguous allocation will become even more pronounced. the cost of the generational write barrier is usually low. Secondly,
the choice of mature space collector should not only be dictated by
5.6 Is garbage collection a good idea? the space efficiency, which would always prefer MarkSweep, but
The software engineering benefits of garbage collection over ex- should also include the rate of death among the mature objects, and
plicit memory management are widely accepted, but the perfor- the access and mutation rate of the mature space. If these rates are
mance trade-off in languages designed for garbage collection is un- high, a copying mature space can attain better mutator locality that
explored. Section 5.3 shows a clear mutator performance advantage in the end overcomes its higher collection time penalty. These re-
for contiguous over free-list allocation, and the architectural com- sults can guide users to the right collector for their program, and
parison shows that architectural trends should make this advantage offer insights to memory management designers for future collec-
more pronounced. The traditional explicit memory management tors that could tune themselves on long running applications.
use of malloc() and free() is tightly coupled to the use of a
free-list allocator—in fact the MMTk free-list allocator implemen- 7. REFERENCES
tation is based on Lea allocator [33], which is the default allocator [1] B. Alpern et al. Implementing Jalapeño in Java. In ACM
in standard C libraries. Standard explicit memory management is Conference on Object-Oriented Programming Systems,
thus unable to exploit the locality advantages of contiguous allo- Languages, and Applications, pages 314–324, Denver, CO,
cation. It is therefore possible that garbage collection presents a Nov. 1999.
performance advantage over explicit memory management on cur- [2] B. Alpern et al. The Jalapeño virtual machine. IBM Systems
Journal, 39(1):211–238, February 2000.
rent or future architectures. A striking example of this is seen in
[3] A. W. Appel. Simple generational garbage collection and fast
Figures 1(a) and 1(g), where the total time for GenMS matches or allocation. Software Practice and Experience,
betters the mutator time for MarkSweep. Further explortation of 19(2):171–183, 1989.
this is unfortunately beyond our scope. Another alternative—not [4] M. Arnold, S. J. Fink, D. Grove, M. Hind, and P. Sweeney.
reclaiming memory at all—is unsustainable. Adaptive optimization in the Jalapeño JVM. In ACM
Conference on Object-Oriented Programming Systems,
6. Conclusion Languages, and Applications, pages 47–65, Minneapolis,
This study examines the implications of the key policy choices in MN, October 2000.
memory management on collection time, space, mutator locality, [5] C. R. Attanasio, D. F. Bacon, A. Cocchi, and S. Smith. A
comparative evaluation of parallel garbage collectors. In
mutator performance, and total performance. A few key observa- Languages and Compilers for Parallel Computing, Lecture
tions emerge. First, even if programs do not follow the generational Notes in Computer Science. Springer-Verlag, 2001.
hypothesis, the contiguous allocation of a copying nursery offers [6] D. Bacon, S. Fink, and D. Grove. Space- and time-efficient
locality benefits that indicate the weak generational collectors are implementations of the Java object model. In Proceedings of
always the collectors of choice. As a corollary, although many ac- the European Conference on Object-Oriented Programming
cesses go to mature objects, their performance relies on temporal (ECOOP), pages 111–132. ACM Press, June 2002.
locality, whereas in the nursery, allocation order provides good spa- [7] D. F. Bacon and V. T. Rajan. Concurrent cycle collection in
reference counted systems. In J. L. Knudsen, editor, Proc. of on Principles of Programming Languages, pages 1–14,
the 15th ECOOP, volume 2072 of Lecture Notes in Portland, OR, Jan. 1994.
Computer Science, pages 207–235. Springer-Verlag, 2001. [26] L. Eeckhout, A. Georges, and K. D. Bosschere. How Java
[8] H. G. Baker. The Treadmill: Real-time garbage collection programs interact with virtual machines at the
without motion sickness. ACM SIGPLAN Notices, microarchitectural level. In ACM Conference on
27(3):66–70, 1992. Object-Oriented Programming Systems, Languages, and
[9] E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Applications, pages 244–358, Anaheim, CA, Oct. 2003.
Wilson. Hoard: A scalable memory allocator for [27] R. Fitzgerald and D. Tarditi. The case for profile-directed
multithreaded applications. In ACM Conference on selection of garbage collectors. In ACM International
Architectural Support for Programming Languages and Symposium on Memory Management, pages 111–120,
Operating Systems, Cambridge, MA, Nov. 2000. Minneapolis, MN, Oct. 2000.
[10] E. D. Berger, B. G. Zorn, and K. S. McKinley. Composing [28] M. W. Hicks, J. T. Moore, and S. Nettles. The measured cost
high-performance memory allocators. In ACM SIGPLAN of copying garbage collection mechanisms. In ACM
Conference on Programming Languages Design and International Conference on Functional Programming,
Implementation, pages 114–124, Salt Lake City, UT, June pages 292–305, 1997.
2001. [29] A. L. Hosking and R. L. Hudson. Remembered sets can also
[11] E. D. Berger, B. G. Zorn, and K. S. McKinley. Reconsidering play cards, Oct. 1993. Position paper for OOPSLA ’93
custom memory allocation. In ACM Conference on Workshop on Memory Management and Garbage Collection.
Object-Oriented Programming Systems, Languages, and [30] R. E. Jones and R. D. Lins. Garbage Collection: Algorithms
Applications, pages 1–12, Seattle, WA, Nov. 2002. for Automatic Dynamic Memory Management. Wiley, July
[12] S. M. Blackburn, P. Cheng, and K. S. McKinley. Myths and 1996.
realities: The performance impact of garbage collection. [31] N. P. Jouppi. Improving direct-mapped cache performance
Technical Report TR-CS-04-04, Dept. of Computer Science, by the addition of a small fully-associative cache and
Australian National University, 2004. prefetch buffers. In Proceedings of the 17th International
[13] S. M. Blackburn, P. Cheng, and K. S. McKinley. Oil and Symposium on Computer Architecture, pages 364–373,
water? High performance garbage collection in Java with Seattle, WA, June 1990.
JMTk. In ICSE, Scotland, UK, May 2004. [32] J. Kim and Y. Hsu. Memory system behavior of Java
[14] S. M. Blackburn, R. E. Jones, K. S. McKinley, and J. E. B. programs: Methodology and analysis. In ACM SIGMETRICS
Moss. Beltway: Getting around garbage collection gridlock. Conference on Measurement & Modeling Computer Systems,
In Proc. of SIGPLAN 2002 Conference on PLDI, pages pages 264–274, Santa Clara, CA, June 2000.
153–164, Berlin, Germany, June 2002. [33] D. Lea. A memory allocator.
[15] S. M. Blackburn and K. S. McKinley. In or out? Putting http://gee.cs.oswego.edu/dl/html/malloc.html, 1997.
write barriers in their place. In ACM International [34] Y. Levanoni and E. Petrank. An on-the-fly reference counting
Symposium on Memory Management, pages 175–183, garbage collector for Java. In ACM Conference on
Berlin, Germany, June 2002. Object-Oriented Programming Systems, Languages, and
[16] S. M. Blackburn and K. S. McKinley. Ulterior reference Applications, pages 367–380, Tampa, FL, Oct. 2001.
counting: Fast garbage collection without a long wait. In [35] H. Lieberman and C. E. Hewitt. A real time garbage
ACM Conference on Object-Oriented Programming Systems, collector based on the lifetimes of objects. Communications
Languages, and Applications, pages 244–358, Anaheim, CA, of the ACM, 26(6):419–429, 1983.
Oct. 2003. [36] M. Pettersson. Linux Intel/x86 performance counters, 2003.
[17] H.-J. Boehm. Space efficient conservative garbage collection. http://user.it.uu.se/ mikpe/linux/perfctr/.
In ACM SIGPLAN Conference on Programming Languages [37] Y. Shuf, M. J. Serran, M. Gupta, and J. P. Singh.
Design and Implementation, pages 197–206, 1993. Characterizing the memory behavior of Java workloads: A
[18] T. Brecht, E. Arjomandi, C. Li, and H. Pham. Controlling structured view and opportunities for optimizations. In ACM
garbage collection and heap growth to reduce the execution SIGMETRICS Conference on Measurement & Modeling
time of Java applications. In ACM Conference on Computer Systems, pages 194–205, Cambridge, MA, June
Object-Oriented Programming Systems, Languages, and 2001.
Applications, pages 353–366, Tampa, FL, 2001. [38] Standard Performance Evaluation Corporation. SPECjvm98
[19] C. J. Cheney. A non-recursive list compacting algorithm. Documentation, release 1.03 edition, March 1999.
Communications of the ACM, 13(11):677–8, Nov. 1970. [39] Standard Performance Evaluation Corporation.
[20] J. Cohen and A. Nicolau. Comparison of compacting SPECjbb2000 (Java Business Benchmark) Documentation,
algorithms for garbage collection. ACM Transactions on release 1.01 edition, 2001.
Programming Languages and Systems, 5(4):532–553, Oct. [40] D. Stefanović, M. Hertz, S. M. Blackburn, K. McKinley, and
1983. J. Moss. Older-first garbage collection in practice:
[21] D. L. Detlefs, A. Dosser, and B. Zorn. Memory allocation Evaluation in a Java virtual machine. In Memory System
costs in large C and C++ programs. Software Practice & Performance, pages 175–184, June 2002.
Experience, 24(6):527–542, June 1994. [41] D. Tarditi and A. Diwan. Measuring the cost of storage
[22] L. P. Deutsch and D. G. Bobrow. An efficient incremental management. Lisp and Symbolic Computation, 9(4), Dec.
automatic garbage collector. Communications of the ACM, 1996.
19(9):522–526, September 1976. [42] D. M. Ungar. Generation scavenging: A non-disruptive high
[23] S. Dieckmann and U. Hölzle. A study of the allocation performance storage reclamation algorithm. In ACM
behavior of the SPECjvm98 Java benchmarks. In SIGSOFT/SIGPLAN Software Engineering Symposium on
Proceedings of the European Conference on Object-Oriented Practical Software Development Environments, pages
Programming, pages 92–115, June 1999. 157–167, April 1984.
[24] E. Dijkstra, L. Lamport, A. Martin, C. Scholten, and [43] B. G. Zorn. The measured cost of conservative garbage
E. Steffens. On-the-fly garbage collection: An exercise in collection. Software Practice & Experience, 23(7):733–756,
cooperation. Communications of the ACM, 21(11):966–975, 1993.
September 1978.
[25] A. Diwan, D. Tarditi, and J. E. B. Moss. Memory subsystem
performance of programs using copying garbage collection.
In Conference Record of the Twenty-First ACM Symposium