0% found this document useful (0 votes)
18 views12 pages

Myths and Realities - The Performance Impact of Garbage Collection

Uploaded by

wtpqqdx4dv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views12 pages

Myths and Realities - The Performance Impact of Garbage Collection

Uploaded by

wtpqqdx4dv
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Myths and Realities:

The Performance Impact of Garbage Collection ∗

Stephen M Blackburn Perry Cheng Kathryn S McKinley


Department of Computer Science IBM T.J. Watson Research Center Department of Computer Sciences
Australian National University P.O. Box 704 University of Texas at Austin
Canberra, ACT, 0200, Australia Yorktown Heights, NY, 10598, USA Austin, TX, 78712, USA
[email protected] [email protected] [email protected]

ABSTRACT 1. Introduction
This paper explores and quantifies garbage collection behavior for Programmers are increasingly choosing object-oriented languages
three whole heap collectors and generational counterparts: copy- such as Java with automatic memory management (garbage col-
ing semi-space, mark-sweep, and reference counting, the canonical lection) because of their software engineering benefits. Although
algorithms from which essentially all other collection algorithms researchers have studied garbage collection for a long time [3, 22,
are derived. Efficient implementations in MMTk, a Java memory 24, 30, 35, 42], few detailed performance studies exist. No previ-
management toolkit, in IBM’s Jikes RVM share all common mech- ous study compares the effects of garbage collection algorithms on
anisms to provide a clean experimental platform. Instrumentation instruction throughput and locality in the light of modern proces-
separates collector and program behavior, and performance coun- sor technology trends to explain how garbage collection algorithms
ters measure timing and memory behavior on three architectures. and programs can combine to yield good performance.
Our experimental design reveals key algorithmic features and This work studies in detail the three canonical garbage collection
how they match program characteristics to explain the direct and algorithms: semi-space, mark-sweep, and reference counting, and
indirect costs of garbage collection as a function of heap size on the three generational counterparts. These collectors encompass the
SPEC JVM benchmarks. For example, we find that the contiguous key mechanisms and policies from which essentially all garbage
allocation of copying collectors attains significant locality benefits collectors are composed. Our findings therefore have application
over free-list allocators. The reduced collection costs of the gener- beyond these algorithms. We conduct our study in the Java memory
ational algorithms together with the locality benefit of contiguous management toolkit (MMTk) [13] in IBM’s Jikes RVM [2, 1]. The
allocation motivates a copying nursery for newly allocated objects. collectors are efficient, share all common mechanisms and policies,
These benefits dominate the overheads of generational collectors and provide a clean and meaningful experimental platform [13].
compared with non-generational and no collection, disputing the The results use a wide range of heap sizes on SPEC JVM Bench-
myth that “no garbage collection is good garbage collection.” Per- marks to reveal the inherent space-time trade-offs of collector algo-
formance is less sensitive to the mature space collection algorithm rithms. For fair comparisons, each experiment fixes the heap size,
in our benchmarks. However the locality and pointer mutation and triggers collection when the program exhausts available mem-
characteristics for a given program occasionally prefer copying or ory. We use three architectures (Athlon, Pentium 4, PowerPC) and
mark-sweep. This study is unique in its breadth of garbage collec- find the same trends on all three. Each experiment divides total pro-
tion algorithms and its depth of analysis. gram performance into mutator (application code) and collection
phases. The mutator phase includes some memory management
Categories and Subject Descriptors activity, such as the allocation sequence and for the generational
D.3.4 [Programming Languages]: Processors—Memory manage- collectors, a write barrier. Hardware performance counters mea-
ment (garbage collection) sure the L1, L2, and TLB misses for collector and mutator phases.
The experiments reveal the direct cost of garbage collection and its
General Terms indirect effects on mutator performance and locality.
Design, Performance, Algorithms Our first set of experiments confirm the widely held, but unexam-
Keywords ined hypothesis, that the locality benefits of contiguous allocation
improves the locality of the mutator. For the whole heap collectors
Java, Mark-Sweep, Semi-Space, Reference Counting, Generational
in small heaps, the more space efficient free-list mark-sweep col-
∗ This work is supported by the following grants: ARC DP0452011, lector performs best because collection frequency dominates the lo-
NSF ITR CCR-0085792, NSF CCR-0311829, NSF EIA-0303609, cality benefit of contiguous allocation. As heap size increases, the
DARPA F33615-03-C-4106, and IBM. Any opinions, findings and mutator locality advantage of contiguous allocation with copying
conclusions expressed herein are the authors and do not necessarily collection outweighs the space efficiency of mark-sweep. Contigu-
reflect those of the sponsors. ous allocation provides fewer misses at all levels of the cache hier-
archy (L1, L2 and TLB). These results are counter to the myth that
Permission to make digital or hard copies of all or part of this work for collection frequency is always the first order effect that determines
personal or classroom use is granted without fee provided that copies are total program performance. Further experiments reveal that most
not made or distributed for profit or commercial advantage and that copies of these locality benefits are for the young objects which motivates
bear this notice and the full citation on the first page. To copy otherwise, to a contiguous allocation for them in generational collectors.
republish, to post on servers or to redistribute to lists, requires prior specific The generational collectors divide newly allocated nursery ob-
permission and/or a fee.
jects from mature objects that survive one or more collections, and
SIGMETRICS/Performance’04, June 12–16, 2004, New York, NY, USA.
Copyright 2004 ACM 1-58113-664-1/04/0006 ...$5.00.
collect the nursery independently and more frequently than the ma- 18, 32, 40], and as we show here, garbage collectors are very sensi-
ture space [35, 42]. They work well when the rate of death among tive to heap size, and in particular to tight heaps. Diwan et al. [25,
the young objects is high. In order to collect the nursery indepen- 41], Hicks et al. [28], and others [15, 29] measure detailed, spe-
dently, the generational collectors use a write barrier which records cific mechanism costs, and architecture influences [25], but do not
any pointer into the nursery from the mature objects. During a nurs- consider a variety of collection algorithms. Many researchers have
ery collection, the collector assumes the referents of these pointers evaluated a range of memory allocators for C/C++ programs [9, 10,
are live to avoid scanning the entire mature generation. To imple- 11, 17, 21, 43], but this work does not include copying collectors
ment the write barrier, the compiler generates a sequence of code since C/C++ programs may store pointers arbitrarily.
for every pointer store that at runtime records only those pointers Java performance analysis work either disabled garbage collec-
from the mature space into the nursery. The write barrier thus in- tion [23, 37] which introduces unnecessary memory fragmentation,
duces direct mutator overhead between programs that use whole or hold it constant [32]. Kim and Hsu measure similar details as we
heap versus generational collection. do, with simulation of IBM JDK 1.1.6, a Java JIT, using whole heap
Our experiments show that the generational collectors provide mark-sweep algorithm with occasional compaction. Our work thus
better performance than the whole heap collectors in virtually all stands out as the first thorough evaluation of a variety of different
circumstances. They significantly reduce collection time itself, and garbage collection algorithms, how they compare and affect perfor-
their contiguous nursery allocation has a positive impact on local- mance using execution measurements and performance counters.
ity. We carefully measure the impact of the write barrier on the The comprehensiveness of our approach reveals new insights, such
mutator and find that their mutator cost is usually very low (often as the most space efficient collection algorithms and the distinct lo-
2% or less), and even when high (14%), the cost is outweighed by cality patterns of young and old objects, suggests mechanisms for
the improvements in collection time. matching algorithms to object demographics, and reveals perfor-
Comparing the generational collectors against each other, per- mance trade-offs each strategy makes.
formance differences are typically small. Two factors contribute We evaluate the reuse, modularity, portability, and performance
to this result. First, allocation order provides good spatial local- of MMTk in a separate publication [13]. In that work we do not ex-
ity for young objects even if the program briefly uses and discards plore generational collectors, nor measure and explain performance
them. Second, the majority of reads are actually to the mature ob- differences between collectors. However, we do demonstrate that
jects, but caching usually achieves good temporal locality for these MMTk combines modularity and reuse with high performance, and
objects regardless of mature space policy. Some object demograph- we rely on that finding here. For example, collectors that share
ics do however have a preference. For instance, generational col- functionality, such as root processing, copying, tracing, allocation,
lection with a copying mature space works best when the mature or collection mechanisms, use the exact same implementation in
space references are dispersed and frequent. The mark-sweep ma- MMTk. In addition, the allocation and collector mechanisms per-
ture space performs best, sometimes significantly, in small heaps form as well as hand tuned monolithic counterparts written in Java
when its space efficiency reduces collector invocations. or C. The experiments in this paper thus offer true policy compar-
The next section compares our study to previous collector per- isons in an efficient setting.
formance analysis studies, none of which consider this variety of
collectors in an apples-to-apples setting, nor do any include a simi- 3. Background
lar depth of analysis or vary the architecture. We then overview the This section presents the garbage collection terminology, algori-
collectors, a number of key implementation details, and the experi- thms, and features that this paper compares and explores. It first
mental setting. The results section studies the three base algorithms presents the algorithms, and then enumerates a few key implemen-
separating allocation and collection costs (as much as possible), tation details. For a thorough treatment of algorithms, see Jones
compares whole heap algorithms and their generational counter- and Lins [30], and Blackburn et al. for additional implementation
parts, and examines the cost of the generational write barrier. We details [13].
examine the impact of nursery size on performance and debunk the In MMTk, a policy pairs one allocation mechanism with one col-
myth that the nursery size should be tied to the L2 cache size. We lection mechanism. Whole heap collectors use a single policy. Gen-
also examine mature space behaviors using a fixed-size nursery to erational collectors divide the heap into age cohorts, and use one or
hold the mature space work load constant. We perform every ex- more policies [3, 42]. For generational and incremental algorithms,
periment on the nine benchmarks and three architectures, but select such as reference counting, a write barrier remembers pointers. For
representative results for brevity and clarity. every pointer store, the compiler inserts write-barrier code. At exe-
cution time, this code conditionally records pointers depending on
2. Related Work the collector policy. Following the literature, the execution time
To our knowledge, few studies quantitatively compare uniprocessor consists of the mutator (the program itself) and periodic garbage
garbage collection algorithms [5, 14, 27, 28, 40], and these studies collection. Some memory management activities, such as object
evaluate various copying and generational collectors. Our results allocation and the write barrier, mix in with the mutator. Collection
on copying collectors are similar to theirs, but they do not compare can run concurrently with mutation, but this work uses a separate
with free-list mark-sweep or reference counting collectors, nor ex- collection phase. MMTk implements the following standard allo-
plore memory system consequences. cation and collection mechanisms.
Attanasio et al. [5] evaluate parallel collectors on SPECjbb, fo-
A Contiguous Allocator appends new objects to the end of a con-
cusing on the effect of parallelism on throughput and heap size
tiguous space by incrementing a bump pointer by the size of
when running on 8 processors. They concluded that mark-sweep
the new object.
and generational mark-sweep with a fixed-size nursery (16 MB or
64 MB) are equal and the best among all the collectors. Our data A Free-List Allocator organizes memory into k size-segregated
shows that the generational are superior to whole heap collectors free-lists. Each free list is unique to a size class and is com-
especially with a variable-size nursery. posed of blocks of contiguous memory. It allocates an object
A few recent studies explore heap size effects on performance [14, into a free cell in the smallest size class that accommodates
the object.
A Tracing Collector identifies live objects by computing a transi- to 8K bytes, this organization may be a source of some conflict
tive closure from the roots (stacks, registers, and class vari- misses, but we leave that investigation for future work.
ables/statics) and from any remembered pointers. It reclaims RefCount: The deferred reference-counting collector uses a free-
space by copying live data out of the space, or by freeing list allocator. During mutation, the write barrier ignores stores to
untraced objects. roots and logs mutated objects. It then periodically updates ref-
A Reference Counting Collector counts the number of incoming erence counts for root referents and generates reference count in-
references for each object, and reclaims objects with no ref- crements and decrements using the logged objects. It then deletes
erences. objects with a zero reference count and recursively applies decre-
3.1 Collectors ments. It uses trial deletion to detect cycles [7]. Collection time is
proportional to the number of dead objects, but the mutator load is
All modern collectors build on these mechanisms. This paper ex-
significantly higher than other collectors since it logs every mutated
amines the following whole heap collectors, and a generational
heap object.
counterpart for each. The generational collectors use a copying
Implementation Details: RefCount uses object logging with co-
nursery for newly allocated objects.
alescing [34]. RefCount thus records objects only the first time
SemiSpace: The semi-space algorithm uses two equal sized copy
the program modifies it, and buffers decrements for all its refer-
spaces. It contiguously allocates into one, and reserves the other
ent objects. At collection time, it (1) generates increments for all
space for copying into since in the worst case all objects could sur-
root and modified object referents, thus coalescing intermediate up-
vive. When full, it traces and copies live objects into the other
dates, (2) introduces temporary [7] increments for deferred objects
space, and then swaps them. Collection time is proportional to the
(e.g., roots), and (3) deletes objects with a zero count. When a ref-
number of survivors. Its throughput performance suffers because
erence count goes to zero, it puts the object back on the free-list by
it reserves half of the space for copying and it repeatedly copies
setting a bit and it decrements all its referents. On the next collec-
objects that survive for a long time, and its responsiveness suffers
tion, it includes a decrement for all temporary increments from the
because it collects the entire heap every time.
previous collection.
Implementation Details: Copying tracing implements the transi-
GenCopy: The classic copying generational collector [3] allo-
tive closure as follows. It enqueues the locations of all root refer-
cates into a young (nursery) space. The write barrier records point-
ences, and repeatedly takes a reference from the locations queue.
ers from mature to nursery objects. It collects when the nursery is
If the referent object is uncopied, it copies the object, leaves a for-
full, and promotes survivors into a mature semi-space. When the
warding address in the old object, enqueues the copied object on a
mature space is exhausted, it collects the entire heap. When the
gray object queue, and adjusts the reference to point to the copied
program follows the weak generational hypothesis [35, 42], i.e.,
object. If it previously copied the referent object, it instead ad-
many young objects die quickly and old objects survive at a higher
justs the reference with the forwarding address. When the loca-
rate than young, GenCopy attains better performance than SemiS-
tions queue is empty, the collector scans each object on the gray
pace. GenCopy improves over SemiSpace in this case because it
object queue. Scanning places the locations of the pointer fields of
repeatedly collects the nursery which yields a lot of free space, it
these objects on the locations queue. When the gray object queue
compacts the survivors which can improve mutator locality, and in-
is empty, it processes the locations queue again, and so on. It ter-
curs the collection cost of the mature objects infrequently. It also
minates when both queues are empty. These experiments use a
has better average pause times than SemiSpace, since the nursery
depth-first order, because our experiments show it performs better
is typically smaller than the entire heap.
than the more standard breadth-first order [19]. MMTk supports
GenMS: This hybrid generational collector uses a copying nurs-
other orderings. SemiSpace has no write barrier.
ery and the MarkSweep policy for the mature generation. It allo-
MarkSweep: Mark-sweep uses a free-list allocator and a tracing
cates using a bump pointer and when the nursery fills up, triggers
collector. When the heap is full, it triggers a collection. The col-
a nursery collection. The write barrier, nursery collection, nursery
lection traces and marks the live objects using bit maps, and lazily
allocation policies, and mechanisms are identical to those for Gen-
finds free slots during allocation. Tracing is thus proportional to the
Copy. The test for an exhausted heap must accommodate space
number of live objects, and reclamation is incremental and propor-
for copying an entire nursery full of survivors into the MarkSweep
tional to allocation. The tracing for MarkSweep is exactly the same
space. GenMS should be better than MarkSweep for programs that
as SemiSpace, except that instead of copying the object, it marks a
follow the weak generational hypothesis. In comparison with Gen-
bit in a live object bit map. Since MarkSweep is a whole heap col-
Copy, GenMS can use memory more efficiently, since GenCopy
lector, its maximum pause time is poor and its performance suffers
reserves half the heap for copying space. However, both Mark-
from repeatedly tracing objects that survive many collections.
Sweep and GenMS can fragment the free space when objects are
Implementation Details: The free-list uses segregated-fits with
distributed among size classes.
a range of size classes similar to the Lea allocator [33]. MMTk
Infrequent collections can contribute to spreading consecutively
uses 51 size classes that attain a worst case internal fragmentation
allocated (or promoted) objects out in memory. Both sources of
of 1/8 for objects less than 255 bytes. The size classes are 4 bytes
fragmentation can reduce locality. Mark-compact collectors can
apart from 8 to 63, 8 bytes apart from 64 to 127, 16 bytes apart
reduce this fragmentation, but need one or two additional passes
from 128 to 255, 32 bytes apart from 256 to 511, 256 bytes apart
over the live and dead objects [20].
from 512 to 2047, and 1024 bytes apart from 2048 to 8192. Small,
GenRC This hybrid generational collector uses a copying nurs-
word-aligned objects get an exact fit—in practice, these are the vast
ery and RefCount for the mature generation [16]. It ignores muta-
majority of all objects. All objects 8KB or larger get their own
tions to nursery objects by marking them as logged, and logs the
block (see Section 3.2.3). MarkSweep has no write barrier. The
addresses of all mutated mature objects. When the nursery fills,
collector keeps the blocks of a size class in a circular list ordered
it promotes nursery survivors into the reference counting space.
by allocation time. It allocates the first free element in the first
As part of the promotion of nursery objects, it generates reference
block. Finding the right fit is about 10% slower [13] than bump-
counts for them and their referents. At the end of the nursery collec-
pointer allocation. The free-list stores the bit vector for each block
tion, GenRC computes reference counts and deletes dead objects,
together with the block. Since block sizes vary from 256 bytes
as in RefCount. Since GenRC ignores the frequent mutations of the tional collectors allocate large objects directly into this space. The
nursery objects, it performs much better than RefCount. Collection LOS uses the treadmill algorithm [8]. It records a pointer to each
time is proportional to the nursery size and the number of dead ob- object in a list. During whole heap collections, all of the collectors
jects in the RefCount space. With a small nursery and other collec- but RefCount and GenRC trace the live large objects, placing them
tion triggers, pause times are very low [16]. RefCount and GenRC on another list. They then reclaim any objects left on the original
are subject to the same free-list fragmentation issues as MarkSweep list. RefCount and GenRC reference count the large objects at each
and GenMS. However, since GenRC collects the mature space on collection. MMTk does not a priori reserve space for the LOS, but
every collection, it is likely to maintain a smaller memory footprint. allocates it on demand.
The boot image contains various objects and precompiled classes
3.2 Implementation Details necessary for booting Jikes RVM, including the compiler, class-
This section adds a few more implementation details about shared loader, the garbage collector, and other essential elements of the
mechanisms including the nursery size policies, inlining write bar- virtual machine as part of the Java-in-Java design. MMTk puts
riers and allocation, reference counting header, the large object these objects in an immortal space, and none of the collectors col-
space, and the boot image. lect them. All except RefCount and GenRC trace through the boot
3.2.1 Nursery size policies image objects whenever they perform a while heap collection. Re-
By default, the generational collectors implement a variable nurs- fCount and GenRC assume all pointers out of the boot image are
ery [3] whose initial size is half of the heap, the other half is re- live to avoid a priori assigning reference counts at boot time.
served for copying. Each nursery collection reduces the nursery by
the size of the survivors. When the available space for the nursery is 4. Methodology
too small (256KB by default), it triggers a mature space collection. This section describes Jikes RVM, our experimental platform, and
MMTk also provides a bounded nursery which takes a command key benchmark characteristics.
line parameter as the initial nursery size, collects after the nursery is 4.1 IBM Jikes RVM
full, and resizes the nursery below the bound only when the mature
We use MMTk in Jikes RVM version 2.3.1+CVS1 [2, 1], with
space cannot accommodate a nursery of survivors. It shrinks using
patches to support performance counters and pseudo-adaptive com-
the above variable nursery policy with the same lower bound. The
pilation. Jikes RVM is a high-performance VM written in Java with
fixed nursery never reduces the size of the nursery, and thus trig-
an aggressive optimizing compiler [1, 2]. We use configurations
gers a whole heap collection sooner than the bounded nursery of
that precompile as much as possible, including key libraries and
the same size. The bounded nursery triggers more collections than
the optimizing compiler and turn off assertion checking (the Fast
the variable nursery which uses space more efficiently, but when
build-time configuration). The adaptive compiler uses sampling to
the variable nursery is large, pause time suffers.
select methods to optimize, leading to high performance [4], but
3.2.2 Write-barrier and allocation inlining a lack of determinism. Eechout et al. use statistical techniques to
For the generational collectors, MMTk inlines the write-barrier fast show that including the adaptive compiler for short running pro-
path which filters stores to nursery objects and thus does not record grams skews the results to measure the virtual machine [26]. In ad-
most pointer updates, i.e., ignores between 93.7% to 99.9% of point- dition, adaptive compiler variations result in changes to allocation
er stores. The slow path makes the appropriate entries in the re- behavior and running time of the same run or runs with different
membered set. Since the write barrier for RefCount is uncondi- heap sizes. For example, sampling triggers compilation in different
tional, it is fully inlined but forces the slow path object remem- methods, and the compilation of different write barriers for each
bering mechanism out-of-line to minimize code bloat and compiler collector is part of the runtime system as well as the program and
overhead [15]. SemiSpace and MarkSweep have no write barrier. induces both different mutator behavior and collector load [15].
MMTk inlines the fast path for the allocation sequence. For Since our goal is to focus on application and garbage collection
the copying and generational allocators, the inlined sequence con- interactions, our pseudo adaptive approach deterministically mim-
sists of incrementing a bump pointer and testing it against a limit ics adaptive compilation.2 First we profile each benchmark five
pointer. If the test fails (failure rate is typically 0.1%), the alloca- times and select the best, collecting a log of the methods that the
tion sequence calls an out-of-line routine to acquire another block adaptive compiler chooses to optimize. This log is then used as
of memory, which may trigger a collection. deterministic compilation advice for the performance runs. For our
For the MarkSweep and RefCount free-list allocators, the inline performance runs, we run two iterations of each benchmark. In the
allocation sequence consists of establishing the size class for the first iteration, the compiler optimizes the methods in the advice file
allocation (for non-array types, the compiler statically evaluates the on demand, and base compiles the others. Before the second itera-
size), and removing a free cell from the appropriate free-list, if such tion, we perform a whole heap garbage collection to flush the heap
a cell is available. If there is no available free cell, the allocation of compiler objects. We then measure the second iteration which
path calls out-of-line to move to another block, or if there are no uses optimized code for hot methods and whose heap includes only
more blocks of that size class, to acquire a new block. application objects. We perform this experiment five times and re-
3.2.3 Header, large objects, and boot image port the fastest time. Our methodology thus avoids variations due
MMTk has a two word (8 byte) header for each object, which con- to adaptive compilation.
tains a pointer to the TIB (type information block located in the
immortal space, see below), hash bits, lock bits, and GC bits. A
4.2 Experimental Platform
one word header for MarkSweep collectors is possible, but not yet We perform our experiments on three architectures: Athlon, Pen-
implemented. Bacon et al. found that a one word header yields an tium 4, and Power PC. We present the Athlon results because it
average of 2-3% improvement in overall execution [6]. RefCount performs the best and it has a relatively simpler memory hierarchy
and the mature space in GenRC have an additional word (4 bytes) that is easier to analyze.
in the object headers to accommodate the reference count. 1A 2.3.2 pre-release, cvs timestamp ‘2004/03/25 05:11:47 UTC’.
MMTk allocates all objects 8KB or larger separately into a large 2 Xianglong Huang and Narendran Sachindran jointly implemented
object space (LOS) using an integral number of pages. The genera- the pseudo adaptive compilation mechanism.
Source Field (p) Target Object (o = *p)
alloc alloc: % GC % Nur % Read % Focus % Read % Focus
MB min ß srv Nur Mat Imm Nur Mat Nur Mat Imm Nur Mat
202 jess 261 17:1 63 1 29 44 27 0.4 69 18 62 20 0.2 97
228 jack 231 17:1 53 3 25 39 36 0.1 6 21 50 28 0.1 7
205 raytrace 135 8:1 46 2 19 75 6 0.3 48 18 78 4 0.3 49
227 mtrt 142 7:1 51 5 20 75 6 0.3 21 19 77 5 0.3 21
213 javac 185 7:1 29 23 30 46 24 0.4 2 25 55 21 0.3 3
201 compress 99 6:1 8 0 97 0 3 11.0 3 61 39 0 6.9 712
pseudojbb 216 5:1 21 32 16 59 25 0.2 2 14 72 15 0.1 2
209 db 82 4:1 11 9 5 69 26 0.3 49 1 89 9 0.1 63
222 mpegaudio 3 1:1 0 – – – – – – – – – – –

Table 1: Benchmark Characteristics


We use a 1.9GHz AMD Athlon XP 2600+. It has a 64 byte L1 SemiSpace spends performing GC work. The Nur srv quantifies
and L2 cache line size. The data and instruction L1 caches are generational behavior for a 4MB fixed size nursery using the per-
64KB 2-way set associative. It has a unified, exclusive 512KB 16- centage of allocated data that the collector copies out of the nursery.
way set associative L2 cache, and an 8 entry victim buffer [31] The remaining columns indicate access patterns for object ac-
between the two caches. The L2 holds only replacement victims cesses. We instrument every pointer read ‘o = *p’ and count the
from the L1, and does not contain copies of data cached in the L1. dereferenced field, p (columns 6–11), and the referent object, o
When the L1 data cache evicts a line, it goes to the victim buffer, (last five columns). Table 1 includes the percentage of reads from
which in turn evicts the LRU line in the victim buffer into the L2. nursery (Nur), mature (Mat) and immortal (Imm) spaces. The fo-
The Athlon has 1GB of dual channel 333MHz DDR RAM config- cus presents the accesses in the nursery and mature space divided
ured as 2 × 512MB DIMMs with an nForce2 K7N2G motherboard by the number of bytes allocated in the nursery and mature space
and 333MHz front side bus. This machine is marketed by AMD as respectively. For example, in 202 jess, 29% of the time p is the
being comparable to a 2.6GHz Pentium 4. nursery, and 18% of the time, the dereferenced object o is in the
The 2.6GHz Pentium 4 uses hyperthreading. It has a 64 byte L1 nursery. The focus of accesses to p in the mature space (69) was
and L2 cache line size, an 8KB 4-way set associative L1 data cache, more than 100 times greater than accesses to p in the nursery (0.4).
a 12Kµops L1 instruction trace cache, and a 512KB unified 8-way A higher number reflects higher temporal locality. 202 jess pro-
set associative L2 on-chip cache. The machine has 1GB of dual motes only around 1% of data into the mature space, and yet 44%
channel 400MHz DDR RAM configured as 2 × 512MB DIMMs of 202 jess’s field reads are to this 1%, while 29% are to the 99%
with an Intel i865 motherboard and 800MHz front side bus. of objects that never survive the nursery.
We also use a Apple Power Mac G5 with a 1.6HGz IBM Pow- We group programs according to Table 1. 202 jess, 228 jack,
erPC 970. It has a 128 byte L1 and L2 cache line size, a 64KB 205 raytrace, and 227 mtrt exhibit low nursery survival and
direct mapped L1 instruction cache and a 32KB 2-way set associa- high ratios of total allocation to minimum live size. 213 javac,
tive L1 data cache, and a 512KB unified 8-way set associative L2 pseudojbb, and 209 db have higher nursery survival, but a rel-
on-chip cache. The machine has 768MB of 333MHz DDR RAM atively high heap turnover. Two programs have high nursery sur-
with an Apple motherboard and 800MHz front side bus. vival and do not exercise collection much: 201 compress and
All three platforms run the same configuration of Debian Linux 222 mpegaudio. 201 compress allocates large objects, and
with a 2.6.0 kernel. We run all experiments in a standalone mode requires little garbage collection. 222 mpegaudio allocates less
with all non essential daemons and services (including the network than 4MB, and thus the generational collectors never collect it. The
interface) shut down. We instrument MMTk and Jikes RVM to use first two groups of programs are thus better tests of memory man-
the AMD and Intel performance counters to measure cycles, retired agement influences and policies and we focus on them. The results
instructions, L1 cache misses, L2 cache misses, and TLB misses section presents representative benchmarks which we discuss in de-
of both the mutator and collector as the collector algorithm, heap tail. Other benchmarks follow the same trends, except when noted.
size, and other features vary. Because of hardware limitations, each Complete results included in a technical report [12].
performance counter requires a separate execution. We use version
2.6.5 of the perfctr Intel/x86 hardware performance counters for 5. Results
Linux with the associated kernel patch and libraries [36]. At the This section examines collector performance and its influence on
time of writing, perfctr was unavailable for the PowerPC 970. mutator and total performance using the Athlon. We first explain
how occasionally small changes in heap sizes cause variations in
4.3 Benchmarks collection time. We then compare the whole heap and generational
Table 1 shows key characteristics of each of our benchmarks. We collectors, validating the uniform performance benefits of the weak
use the eight SPEC JVM benchmarks, and pseudojbb, a variant generational hypothesis [35, 42]. We then tease apart the influ-
of SPEC JBB2000 [38, 39] that executes a fixed number of trans- ences of allocation and collection mechanisms. Contiguous allo-
actions to perform comparisons under a fixed garbage collection cation yields better mutator locality than free-list allocation, but
load. The alloc column in Table 1 indicates the total number of the space-efficient free-list reduces total collector load. For most
megabytes allocated. Our prior work reports on the adaptive com- programs, cache measurements reveal that the spatial locality of
piler activity [13] and thus shows more allocation and higher ra- objects allocated close together in time is key for nursery objects,
tios of live data to allocation. However, Eeckhout et al. show but not as important for mature objects. A fixed nursery isolates the
that the adaptive compiler swamps program behaviors, and thus the influence of the mature space collection policy, showing that muta-
methodology we use here exposes variations due to the program tor performance is usually agnostic to mature space policies with a
instead of the VM. The alloc:min column quantifies the garbage few notable exceptions that need copying to achieve locality. How-
collection load with the ratio of total allocation to the minimum ever, when the mature space benefits from less frequent collection
heap size in which GenMS executes. For a heap size of 2 × the in GenMS, total time improves. Varying the nursery size reveals
minimum, the % GC SemiSpace shows the percentage of time that frequent GC’s in the small nursery degrade collector perfor-
mance, and nursery sizes well above the L2 cache size perform mutator time is however strongly correlated with the GC algorithm,
best. We then show that the same trends hold across the Athlon, where SemiSpace usually performs best. SemiSpace benefits from
P4, and PPC architectures. no write barrier and faster allocation than MarkSweep. The genera-
Figure 1 and subsequent figures plot total time, garbage col- tional collectors benefit from contiguous allocation. GenCopy and
lection (GC) time, mutator time, and cache statistics for different SemiSpace perform about the same for 213 javac and 209 db,
benchmarks as a function of heap size. The right y-axis expresses whereas mutator performance in GenCopy is around 20% slower
time in seconds and the left normalizes to the fastest time. Heap than SemiSpace on 202 jess. We now show that this difference is
size is shown as a multiple of the smallest heap size in which the mostly due to the write barrier.
particular application executes using GenMS on the bottom x-axis, 5.2.1 The Write Barrier: Friend or Foe?
and in mega-bytes (MB) on top. To examine the cost of the write barrier, we use a new collector
5.1 Collector Sensitivity to Heap Size which has the same heap organization, write barrier, and promotion
Figure 1 shows the general trend that up to some point increases policies as GenCopy, but traces (but does not collect) the whole
in heap size tend to decrease the frequency of garbage collection heap at each collection. It collects the whole heap only when the
and thus total time (see Figure 1). Each heap size is an independent mature space is full. Because it always traces the entire heap, it
trial. In all our experiments, the variation between runs on the same establishes liveness of nursery objects by reachability, so the write
heap size is less than 1%. However, small changes in heap size can barrier is not required for correctness. The garbage collection over-
produce what seem like chaotic behavior, such as the differences in head of this collector is substantial, so we do not recommend it,
total and GC time between heap sizes 1 and 1.3 the minimum for but it yields an experimental platform in which we can include or
GenMS on 213 javac. The reason is that a small change in heap exclude the write barrier while holding all other factors constant,
size triggers collections at different points which changes which such as the heap organization and promotion policy.
objects a collector promotes. For instance, consider a program that Table 2 shows the overhead of the standard MMTk generational
builds a large but relatively short lived pointer data structure. In a write barrier on mutator performance with a 4MB nursery and a
small heap, the generational collection point happens just prior to moderate heap (3 × minimum) on the Athlon platform. We show
when the program builds the data structure, and in a slightly larger the percentage slowdown in the mutator when using the write bar-
heap it happens in the middle. In the second case, the collector pro- rier relative to mutator performance without the barrier. The over-
motes the data structure, which dies shortly thereafter, but it does head is low, 3.2% on average (3.1% for the P4 and 1.9% for the
not detect the death until a whole heap collection. In the meantime, PPC). 202 jess suffers a substantial mutator slowdown. Table 1
the increased heap occupancy triggers the next nursery collection indicates the high mortality rate and concentration of accesses to
sooner, and so on. The exact timing of a collection can thus have the few objects that do survive as the cause of the heavy write bar-
cascading positive as well as negative effects, and explains varia- rier traffic for 202 jess. However, the previous section shows that
tions between nearby heap sizes. the massive reduction in collection costs swamps the mutator over-
head in such a setting. Other benchmarks show very low overheads.
5.2 Evaluating Generational Behavior For example in 222 mpegaudio, it never collects, thus no objects
This section compares whole heap collectors to their generational are ever in the large space, and the write barrier test never adds to
counterparts and explores the generational write-barrier cost. Fig- the remembered sets. The multi-issue architecture thus completely
ure 1 shows that for 202 jess, 209 db, and 213 javac the gen- hides its cost in unused issue slots. So while the write barrier has
erational collectors perform much better than their whole heap vari- the potential to be expensive, its overhead is usually very low, and
ants. This result holds on all the benchmarks, although the low- the advantages seen at collection time far outweigh the cost.
mortality, low GC load programs such as 201 compress only The combination of good mutator performance and outstand-
benefit in small heaps. Generational collectors reduce GC time ing GC performance is clear in the total time results. Even in
for 202 jess by an order of magnitude, and even for 213 javac, 213 javac which has low infant mortality and in 209 db which
where 23% of nursery objects survive, GenCopy improves GC time has low GC work load, the generational collectors perform bet-
over SemiSpace by a factor of two, and GenMS improves over ter than the whole heap collectors. In 202 jess, the advantage
MarkSweep. The generational collectors reduce GC time by reduc- for the generational collectors is dramatic. This data supports the
ing the cost of each collection through only examining the nursery. weak generational hypothesis, and indicates even when it is less
Counting the number of collections (unshown) shows the reduc- true, generational collectors offer benefits.
tions come from dramatically fewer collections as well. Because
collection costs are heap-size dependent, the impact of GC time on
5.3 Allocation: Free List versus Contiguous
total time is greatest in small to modestly sized heaps. The essential allocator choice is free-list or contiguous, which in
Examining mutator performance reveals that heap size does not turn dictates the choice of collection algorithm. Free-list allocation
systematically influence mutator time. Although the application it- is more expensive than contiguous allocation, but permits incre-
self is unchanged by heap size, larger heap sizes will tend to spread mental freeing and obviates the need for a copy reserve. Contigu-
objects out more which makes this result counter intuitive. The ous allocations provide spatial locality for objects allocated close
together in time, whereas free-list allocation may spread out these
% Overhead % objects. To reveal the allocation time trade-offs, we examine their
202 jess 13.6 impact on the mutator. Since both RefCount and MarkSweep use
228 jack 1.7
205 raytrace 0.9
the same free-list allocator, our analysis focuses on MarkSweep
227 mtrt 2.9 and GenMS, which are simpler than RefCount and GenRC.
213 javac 4.6 5.3.1 Mutator costs in whole heap collectors
201 compress 0
pseudojbb 3.1 The contiguous and free-list allocators directly impact mutator per-
209 db 2.4 formance as a consequence of the mutator allocation cost and the
222 mpegaudio 0 collection policies they impose. They also impact on the mutator
Geometric mean 3.2
through their locality effects.
Table 2: Write Barrier Mutator Overhead For 4MB Nursery
Heap size (MB) Heap size (MB) Heap size (MB)
20 30 40 50 60 70 80 90 100 20 40 60 80 100 120 20 40 60 80 100 120 140 160
3 1.5 3 12
SemiSpace 6.5 SemiSpace SemiSpace
MarkSweep 1.45 MarkSweep 19 MarkSweep
RefCount 6 RefCount RefCount 11
GenCopy 1.4 GenCopy GenCopy
2.5 GenMS GenMS 18 2.5 GenMS 10
GenRC 5.5 1.35 GenRC GenRC
Normalized Time

Normalized Time

Normalized Time
5 9

Time (sec)

Time (sec)

Time (sec)
1.3 17
2 4.5 1.25 2 8
16
4 1.2
7
3.5 1.15 15
1.5 1.5 6
1.1
3 14
1.05 5
2.5
1 1 1
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Heap size relative to minimum heap size Heap size relative to minimum heap size Heap size relative to minimum heap size

(a) 202 jess Total Time (b) 209 db Total Time (c) 213 javac Total Time
Heap size (MB) Heap size (MB) Heap size (MB)
20 30 40 50 60 70 80 90 100 20 40 60 80 100 120 20 40 60 80 100 120 140 160
50 50 200
SemiSpace SemiSpace 9 SemiSpace
45 MarkSweep 1.4 45 MarkSweep 180 MarkSweep
RefCount RefCount 8 RefCount 2.5
40 GenCopy 40 GenCopy 160 GenCopy
GenMS 1.2 GenMS 7 GenMS
Normalized GC Time

Normalized GC Time

Normalized GC Time
35 GenRC 35 GenRC 140 GenRC 2
GC Time (sec)

GC Time (sec)

GC Time (sec)
1 6
30 30 120
5 1.5
25 0.8 25 100
4 80
20 0.6 20
1
15 15 3 60
0.4
10 10 2 40
0.5
0.2 1 20
5 5

1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Heap size relative to minimum heap size Heap size relative to minimum heap size Heap size relative to minimum heap size

(d) 202 jess GC Time (e) 209 db GC Time (f) 213 javac GC Time
Heap size (MB) Heap size (MB) Heap size (MB)
20 30 40 50 60 70 80 90 100 20 40 60 80 100 120 20 40 60 80 100 120 140 160
1.5 1.5 1.5
SemiSpace 2.9 SemiSpace SemiSpace 5.6
1.45 MarkSweep 1.45 MarkSweep 18 1.45 MarkSweep
RefCount 2.8 RefCount RefCount 5.4
1.4 GenCopy 1.4 GenCopy 1.4 GenCopy
Normalized Mutator Time

Normalized Mutator Time

Normalized Mutator Time


GenMS 2.7 GenMS 17 GenMS 5.2
1.35 1.35 1.35
Mutator Time (sec)

Mutator Time (sec)

Mutator Time (sec)


GenRC GenRC GenRC
2.6 5
1.3 1.3 1.3
2.5 16
1.25 1.25 1.25 4.8
2.4
1.2 1.2 15 1.2 4.6
2.3
1.15 1.15 1.15 4.4
2.2 14
1.1 1.1 1.1 4.2
2.1
1.05 1.05 13 1.05 4
2
1 1 1
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Heap size relative to minimum heap size Heap size relative to minimum heap size Heap size relative to minimum heap size

(g) 202 jess Mutator Time (h) 209 db Mutator Time (i) 213 javac Mutator Time
Figure 1: Total, Mutator and GC Performance of All Six Collectors
We first measure an upper bound on the time the program spends tator performance, where SemiSpace always improves over Mark-
in contiguous allocation by pushing the allocation sequence out-of- Sweep. Contiguous allocation in SemiSpace thus offers locality
line. This cost typically ranges from 1% to at most 10% of to- from two sources: allocation order and copying compaction. Free-
tal time. We then use a micro benchmark to establish the relative list allocation in MarkSweep degrades program locality. The muta-
costs of the two mechanisms. The benchmark allocates objects of tor benefit of SemiSpace over MarkSweep is relatively insensitive
the same size in a tight loop. Contiguous allocation is 11% faster to heap size, thus suggesting that this benefit is from allocation lo-
than the free-list allocation, allocating at 726 MB/s and 654 MB/s cality rather than mature object compaction. An exception is the
respectively. (We recently reported slower times on an older ar- TLB performance on 202 jess, where the four copying collectors
chitecture [13].) Since allocation time is less than 10%, this small show a sharp reduction in TLB misses at smaller heap sizes, pre-
difference between the mechanisms reduces to less than 1% of to- sumably due to collection-induced locality. However, L1 misses
tal time, and excludes the allocation sequence as a major source of appear to dominate, so the reduction in TLB misses does not trans-
variation. late to a reduction in mutator time.
Figure 2 examines mutator time and memory hierarchy perfor- 5.3.2 Mutator costs in generational collectors
mance for 202 jess, 209 db, and pseudojbb which have repre- We perform the following experiment to examine more closely whe-
sentative behaviors, plotting mutator time, L1 misses, L2 misses, ther SemiSpace locality is mostly due to the allocation order or the
and TLB misses as a function of heap size on a log scale. First copying compaction of mature objects. We hold the work load on
consider SemiSpace and MarkSweep. SemiSpace mutator perfor- the mature space constant with a fixed-size nursery variant of the
mance improvements range from 7 to 15% over MarkSweep (only generational collectors. The young objects thus are all in allocation
on 201 compress and 222 mpegaudio is free-list allocation order. Since young objects are collected at the same frequency,
within 5%). The limit analysis above indicates that the direct ef- only the mature space collection policies differ. Figure 3 shows the
fect of the allocator is typically 1% or less of this difference. Since geometric mean of mutator performance across all benchmarks.
the application code is otherwise identical, second order effects When the nursery size is fixed, GenCopy and GenMS have very
must dominate. 202 jess, 209 db, and pseudojbb each show a similar mutator performance. The locality of mature space objects
strong and consistent correlation between cache memory and mu- is thus not a dominant effect on mutator performance. As Sec-
4 16 8
Mutator Time (sec) (log)

Mutator Time (sec) (log)

Mutator Time (sec) (log)


2

SemiSpace SemiSpace
MarkSweep MarkSweep SemiSpace
RefCount RefCount MarkSweep
GenCopy GenCopy GenCopy
GenMS GenMS GenMS
GenRC GenRC GenRC
1 8 4
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Heap size relative to minimum heap size Heap size relative to minimum heap size Heap size relative to minimum heap size

(a) 202 jess Mutator Time (b) 209 db Mutator Time (c) pseudojbb Mutator Time
64 256 64
SemiSpace SemiSpace SemiSpace
MarkSweep MarkSweep MarkSweep
RefCount RefCount GenCopy
Mutator L1 Misses (millions) (log)

Mutator L1 Misses (millions) (log)

Mutator L1 Misses (millions) (log)


GenCopy GenCopy GenMS
GenMS GenMS GenRC
GenRC GenRC

32 128

16 64 32
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Heap size relative to minimum heap size Heap size relative to minimum heap size Heap size relative to minimum heap size

(d) 202 jess Mutator L1 (e) 209 db Mutator L1 (f) pseudojbb Mutator L1
8 128 32
SemiSpace SemiSpace SemiSpace
MarkSweep MarkSweep MarkSweep
RefCount RefCount GenCopy
Mutator L2 Misses (millions) (log)

Mutator L2 Misses (millions) (log)

Mutator L2 Misses (millions) (log)


GenCopy GenCopy GenMS
4 GenMS GenMS GenRC
GenRC GenRC

2 16

0.5 64 8
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Heap size relative to minimum heap size Heap size relative to minimum heap size Heap size relative to minimum heap size

(g) 202 jess Mutator L2 (h) 209 db Mutator L2 (i) pseudojbb Mutator L2
16 128 32
SemiSpace SemiSpace SemiSpace
MarkSweep MarkSweep MarkSweep
RefCount RefCount GenCopy
Mutator TLB Misses (millions) (log)

Mutator TLB Misses (millions) (log)

Mutator TLB Misses (millions) (log)

8 GenCopy GenCopy GenMS


GenMS GenMS GenRC
GenRC GenRC
4 64 16

1 32 8

0.5

0.25 16 4
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Heap size relative to minimum heap size Heap size relative to minimum heap size Heap size relative to minimum heap size

(j) 202 jess Mutator TLB (k) 209 db Mutator TLB (l) pseudojbb Mutator TLB
Figure 2: Mutator time and L1, L2 and TLB misses for all six collectors collectors (log scale).
tion 5.4.1 discusses in more detail, the variable-size nursery attains pace, due to the write barrier (see Table 2). Section 4.3 shows that
a space advantage when combined with GenMS which reduces the 209 db is dominated by mature space accesses, and thus nursery
number of nursery collections, a direct benefit. The indirect benefit locality is immaterial for 209 db.
is slightly improved locality for nursery objects since they stay in In pseudojbb, the copying nursery benefits GenMS compared
allocation order in the nursery for longer. Figure 3 suggests that to MarkSweep, but GenCopy still performs significantly better than
mature object compaction in the free-list will be of little use for both. This suggests that pseudojbb has mature space access pat-
these programs. However, Figure 2 reveals the two exceptions to terns which are locality sensitive. The access pattern statistics in
this rule: 209 db and pseudojbb. Section 4.3 confirm this result. Mature space is accessed more
The most striking counterpoint is 209 db, where the genera- heavily by pseudojbb, but the accesses are relatively unfocused.
tional collectors make little impact on mutator time. Even the copy- Together the whole heap and generational results indicate that
ing nursery in GenMS provides no advantage over MarkSweep. free-list allocation significantly degrades locality, whereas contigu-
GenCopy slightly degrades mutator locality compared with SemiS- ous allocation achieves locality on young objects from allocation
16 MarkSweep mutator SemiSpace mutator
SemiSpace
MarkSweep 1.5× min ratio 1.5× min ratio
GenCopy Fixed 4MB
GenMS Fixed 4MB
GCs time (s) ∞/1.5× GCs time (s) ∞/1.5×
202 jess 27 2.48 0.97 50 1.97 1.18
Mutator Time (sec) (log)
228 jack 25 2.39 0.97 49 2.11 1.08
205 raytrace 10 2.35 0.98 25 2.04 1.06
227 mtrt 9 2.38 1.04 26 2.1 1.07
213 javac 12 4.57 0.99 28 3.8 1.03
201 compress 7 5.47 1.00 7 5.41 0.99
pseudojbb 9 7.21 1.00 32 6.04 1.07
209 db 5 13.78 1.01 22 12.81 0.86
222 mpegaudio 0 10.57 0.93 0 9.77 1.00
8
1 2 3 4 5 6
Geometric mean 8 4.57 0.99 18 4.02 1.03
Heap size relative to minimum heap size
Figure 3: Mutator time for whole heap and fixed-size nursery Table 3: Impact of very large heap size on mutator time
collectors, geometric mean across all benchmarks
of references to each object is expensive, even with aggressive op-
order. Furthermore, a copying nursery ameliorates the locality pen- timizations [22, 34], which the MMTk implementation also uses.
alty of the mature space free-list in all but 209 db and pseudojbb, This result is evident in Figure 1, where RefCount performs dra-
where mature-space reads play a large role. matically worse than MarkSweep for 202 jess and 213 javac.
5.4 Collection: How, when, and whether? RefCount performs well on 201 compress, but this application is
atypical. As discussed in Sections 5.2 and 5.3, there is compelling
The choice of allocation mechanism also governs the choice of col-
evidence for a generational policy with a copying nursery and a
lection mechanisms. We now examine the time and space over-
free-list in the mature space. The distinctly different demographics
heads of the collection algorithms, and their influence on mutator
of young and old objects further motivate a hybrid generational ref-
locality. We consider how frequently to collect. We also show that
erence counting policy [16]. Figure 1 shows that GenRC performs
our results are consistent across architectures, and then discuss if
similar to the other generational collectors, except in 213 javac,
we should choose garbage collection at all.
which has an unusually large amount of cyclic data structures [7].
5.4.1 Garbage collection costs The performance of GenRC is sensitive to the frequency of cycle
Contiguous allocation dictates copying collection which requires detection, which we did not tune in these experiments. GenRC
a copy reserve. The SemiSpace, GenCopy, and GenMS collec- holds a potential locality and space advantage over GenMS be-
tor performance graphs reflect this copying space overhead which cause it promptly reclaims dead mature space objects, and thus can
leads to many more collections than pure MarkSweep—SemiSpace more tightly pack the free-list. GenRC performs reference counting
typically collects between 1.5 and 2 times as often as MarkSweep at every nursery collection whereas GenMS infrequently performs
for a given heap size. For example, GC time in Figure 2 for SemiS- whole heap collections. This promise is not borne out in Figure 1,
pace is typically at least 50% worse than MarkSweep. We mea- but may be a reflection of the immaturity of the GenRC implemen-
sured the tracing rates for SemiSpace and MarkSweep on a mi- tation rather than on the fundamentals of the algorithm.
cro benchmark: they are very close (59.5MB/sec and 59.2MB/sec)
5.4.4 How often?
which means that the frequency of collection is the source of the
We now examine the limits of not collecting, and then examine how
overhead. In addition, GenMS with a variable nursery reduces the
often to collect the nursery.
number of nursery collections over GenCopy because it is more
If the heap is never collected and memory is monotonically con-
space efficient. The first order effect of fewer collections is re-
sumed, the spatial locality of older objects should gradually de-
duced collection time. A second order effect could be fewer cache
grade as neighboring objects die. Assuming an approximately uni-
line displacements to collector invocations. The stability of the mu-
form death rate over time, fragmentation will be an exponential
tator cache performance as a function of heap size in the face of
function of age—older objects being the most fragmented, and the
dramatic differences in numbers of collections dissuades us of this
very most recently allocated objects suffering no fragmentation. To
hypothesis.
examine this effect, Table 3 compares the mutator time for each
5.4.2 Trading off collection cost and mutator locality benchmark using contiguous and free-list allocation with a modest
Total performance is of course a function of the mutator and col- heap (1.5× minimum), and an uncollected heap, large enough to
lector performance. While contiguous allocation offers a signifi- avoid triggering any collection. For these benchmarks, 900MB is
cant mutator advantage, its copy reserve requirement results in a adequate. Only 202 jess follows the hypothesis that never col-
substantial overhead. In small heap sizes, collection time typically lecting degrades performance. Since 202 jess has a high heap
swamps total performance and overwhelms mutator locality differ- turn over and some accesses to mature space, it does suffer some
ences; MarkSweep outperforms SemiSpace. In large heaps, muta- fragmentation that degrades mutator performance when the heap is
tor time dominates and SemiSpace outperforms MarkSweep. Fig- never collected.
ure 1 illustrates the crossovers in total performance for MarkSweep Most of the other benchmarks have about the same mutator per-
and SemiSpace on 213 javac and 202 jess. formance in the uncollected heap (∞) as in the modest heap. At first
As Sections 5.3.1 and 5.3.2 establish, the locality advantage of this result seems a little surprising in light of the inevitable degra-
contiguous allocation is greatest among the young objects. These dation in locality among the older objects. However, as Section 5.3
results indicate that the copying nursery combined with a space showed, the spatial locality of the mature objects is not a dominant
efficient MarkSweep mature space offers a good combination of factor for these benchmarks. 209 db actually achieves better per-
locality benefits and reduced collection costs. However, when ma- formance without collection because it attains good locality from
ture space locality dominates, such as in 209 db, GenCopy can contiguous allocation and it has low GC work load. Blackburn et
perform best. al. found for a more memory constrained machine, never collecting
5.4.3 Tracing or Reference Counting? caused severe degradations in 209 db due to paging [14]. Table 1
With a free-list, the collector can either trace the live objects from together with mutator locality results indicate that all of the other
the roots or count references. Continuously tracking the number programs have a slight majority of accesses to a few mature objects
8 2 8
GenCopy GenCopy GenCopy
GenMS GenMS GenMS

1
Mutator Time (sec) (log)

GC Time (sec) (log)

Time (sec) (log)


0.5

0.25

4 0.125 4
64 256 1024 4096 16384 65536 64 256 1024 4096 16384 65536 64 256 1024 4096 16384 65536
Nursery Size (KB) (log) Nursery Size (KB) (log) Nursery Size (KB) (log)

(a) Mutator Time (b) GC Time (c) Total Time


64 16 16
GenCopy GenCopy GenCopy
GenMS GenMS GenMS

Mutator TLB Misses (millions) (log)


Mutator L1 Misses (millions) (log)

Mutator L2 Misses (millions) (log)

32 8 4
64 256 1024 4096 16384 65536 64 256 1024 4096 16384 65536 64 256 1024 4096 16384 65536
Nursery Size (KB) (log) Nursery Size (KB) (log) Nursery Size (KB) (log)

(d) L1 Mutator Misses (e) L2 Mutator Misses (f) TLB Mutator Misses
16 4 1
GenCopy GenCopy GenCopy
GenMS GenMS GenMS

GC TLB Misses (millions) (log)


8 0.5
GC L1 Misses (millions) (log)

GC L2 Misses (millions) (log)

4 0.25

2 0.125

0.5
1 0.0625

0.5 0.25 0.03125


64 256 1024 4096 16384 65536 64 256 1024 4096 16384 65536 64 256 1024 4096 16384 65536
Nursery Size (KB) (log) Nursery Size (KB) (log) Nursery Size (KB) (log)

(g) L1 GC Misses (h) L2 GC Misses (i) TLB GC Misses


Figure 4: Performance Effect of Nursery Size, 128KB to 32MB (log scale)

with good temporal locality, and accesses to a very large number 5.5 Architecture influences
of young objects with poor temporal locality (typically used briefly Figure 5 compares the geometric mean of the benchmarks for all
then discarded). Thus, compression of mature space objects is not 6 collectors on the P4, Athlon, and PPC. The x-axis is heap size,
an important source of locality in these programs. We expect that and the y-axis is time. The P4 has the fastest clock speed, followed
server applications, and others with large memory usage and foot by the Athlon, and then the PPC. Intel would like us to believe
prints will follow 202 jess more than these results. that this ordering means the P4 will perform the best. Instead, the
5.4.5 Sizing the nursery Athlon performs about 20% better. For the generational collectors,
Given the performance advantages of generational collection, we even the PPC is close to the P4. The Athlon’s advantage comes
now examine the influence of the nursery size. Figure 4 shows from substantially fewer cache misses than the P4 (compare Fig-
the performance of GenMS and GenCopy over a wide range of ures 2 and 6). Due to the Athlon’s exclusive cache architecture,
bounded nursery sizes (128KB to 32MB), running in a very large substantially larger L1 and higher associativity L2, it simply has
heap (900MB). Note the x-axis in this figure is nursery size, rather more effective cache and this advantage dominates clock speed.
than heap size as in all the other figures in this paper. Figure 4(a) The collectors follow the same trends discussed above on all of
shows a small improvement with larger nurseries in mutator perfor- the architectures. The generational collectors perform best on all
mance due to fewer L2 (Figure 4(e)) and TLB misses (Figure 4(f)). architectures due to reductions in collection time and locality from
However, the difference in GC time dominates: smaller nurseries contiguous nursery allocation. However the difference is more pro-
demand more frequent collection and thus a substantially higher nounced on the PPC than the P4 or Athlon which suggests reduc-
load. We measured the fixed overhead of each collection and found tions in the influence of collection time on faster processors. The
that each invocation of a collection scanned around 64KB of roots. space advantage of MarkSweep over SemiSpace, and the locality
These fixed costs become significant when the nursery is as small advantage of SemiSpace over MarkSweep show different cross-
as 128KB. The garbage collection cost tapers off between 4MB and over points on each architecture. The faster the clock speed, the
8MB as the fixed collection costs become insignificant. These re- closer the cross-over point moves towards the minimum heap size,
sults debunk the myth that the nursery size should be matched to i.e., the cross-over where SemiSpace improves over MarkSweep is
the L2 cache size (512KB on all three architectures). 2.2 for the P4, 3.4 for the Athlon, and 4 for the PPC. This trend
Heap size (MB) Heap size (MB) Heap size (MB)
20 40 60 80 100 20 40 60 80 100 20 40 60 80 100
3 3 3
SemiSpace SemiSpace SemiSpace
MarkSweep MarkSweep MarkSweep
RefCount 14 RefCount 14 RefCount 14
GenCopy GenCopy GenCopy
2.5 GenMS 2.5 GenMS 2.5 GenMS
GenRC GenRC GenRC
Normalized Time

Normalized Time

Normalized Time
12 12 12

Time (sec)

Time (sec)

Time (sec)
2 10 2 10 2 10

8 8 8
1.5 1.5 1.5

6 6 6
1 1 1
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Heap size relative to minimum heap size Heap size relative to minimum heap size Heap size relative to minimum heap size

(a) P4 (b) Athlon (c) PPC


Figure 5: Total time on three architectures

256 64 128
SemiSpace SemiSpace SemiSpace
MarkSweep MarkSweep MarkSweep
RefCount RefCount RefCount

Mutator TLB Misses (millions) (log)


Mutator L1 Misses (millions) (log)

Mutator L2 Misses (millions) (log)


GenCopy GenCopy GenCopy
GenMS GenMS GenMS
GenRC GenRC GenRC
32

128 64

16

64 8 32
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
Heap size relative to minimum heap size Heap size relative to minimum heap size Heap size relative to minimum heap size

(a) 202 jess Mutator L1 (b) 202 jess Mutator L2 (c) 202 jess Mutator TLB
Figure 6: P4 mutator L1, L2 and TLB misses for 202 jess (log scale). Compare with Figures 2(d), 2(g) and 2(j).

suggests that for future processors that the locality advantages of tial locality for young objects that die quickly. We also show that
contiguous allocation will become even more pronounced. the cost of the generational write barrier is usually low. Secondly,
the choice of mature space collector should not only be dictated by
5.6 Is garbage collection a good idea? the space efficiency, which would always prefer MarkSweep, but
The software engineering benefits of garbage collection over ex- should also include the rate of death among the mature objects, and
plicit memory management are widely accepted, but the perfor- the access and mutation rate of the mature space. If these rates are
mance trade-off in languages designed for garbage collection is un- high, a copying mature space can attain better mutator locality that
explored. Section 5.3 shows a clear mutator performance advantage in the end overcomes its higher collection time penalty. These re-
for contiguous over free-list allocation, and the architectural com- sults can guide users to the right collector for their program, and
parison shows that architectural trends should make this advantage offer insights to memory management designers for future collec-
more pronounced. The traditional explicit memory management tors that could tune themselves on long running applications.
use of malloc() and free() is tightly coupled to the use of a
free-list allocator—in fact the MMTk free-list allocator implemen- 7. REFERENCES
tation is based on Lea allocator [33], which is the default allocator [1] B. Alpern et al. Implementing Jalapeño in Java. In ACM
in standard C libraries. Standard explicit memory management is Conference on Object-Oriented Programming Systems,
thus unable to exploit the locality advantages of contiguous allo- Languages, and Applications, pages 314–324, Denver, CO,
cation. It is therefore possible that garbage collection presents a Nov. 1999.
performance advantage over explicit memory management on cur- [2] B. Alpern et al. The Jalapeño virtual machine. IBM Systems
Journal, 39(1):211–238, February 2000.
rent or future architectures. A striking example of this is seen in
[3] A. W. Appel. Simple generational garbage collection and fast
Figures 1(a) and 1(g), where the total time for GenMS matches or allocation. Software Practice and Experience,
betters the mutator time for MarkSweep. Further explortation of 19(2):171–183, 1989.
this is unfortunately beyond our scope. Another alternative—not [4] M. Arnold, S. J. Fink, D. Grove, M. Hind, and P. Sweeney.
reclaiming memory at all—is unsustainable. Adaptive optimization in the Jalapeño JVM. In ACM
Conference on Object-Oriented Programming Systems,
6. Conclusion Languages, and Applications, pages 47–65, Minneapolis,
This study examines the implications of the key policy choices in MN, October 2000.
memory management on collection time, space, mutator locality, [5] C. R. Attanasio, D. F. Bacon, A. Cocchi, and S. Smith. A
comparative evaluation of parallel garbage collectors. In
mutator performance, and total performance. A few key observa- Languages and Compilers for Parallel Computing, Lecture
tions emerge. First, even if programs do not follow the generational Notes in Computer Science. Springer-Verlag, 2001.
hypothesis, the contiguous allocation of a copying nursery offers [6] D. Bacon, S. Fink, and D. Grove. Space- and time-efficient
locality benefits that indicate the weak generational collectors are implementations of the Java object model. In Proceedings of
always the collectors of choice. As a corollary, although many ac- the European Conference on Object-Oriented Programming
cesses go to mature objects, their performance relies on temporal (ECOOP), pages 111–132. ACM Press, June 2002.
locality, whereas in the nursery, allocation order provides good spa- [7] D. F. Bacon and V. T. Rajan. Concurrent cycle collection in
reference counted systems. In J. L. Knudsen, editor, Proc. of on Principles of Programming Languages, pages 1–14,
the 15th ECOOP, volume 2072 of Lecture Notes in Portland, OR, Jan. 1994.
Computer Science, pages 207–235. Springer-Verlag, 2001. [26] L. Eeckhout, A. Georges, and K. D. Bosschere. How Java
[8] H. G. Baker. The Treadmill: Real-time garbage collection programs interact with virtual machines at the
without motion sickness. ACM SIGPLAN Notices, microarchitectural level. In ACM Conference on
27(3):66–70, 1992. Object-Oriented Programming Systems, Languages, and
[9] E. D. Berger, K. S. McKinley, R. D. Blumofe, and P. R. Applications, pages 244–358, Anaheim, CA, Oct. 2003.
Wilson. Hoard: A scalable memory allocator for [27] R. Fitzgerald and D. Tarditi. The case for profile-directed
multithreaded applications. In ACM Conference on selection of garbage collectors. In ACM International
Architectural Support for Programming Languages and Symposium on Memory Management, pages 111–120,
Operating Systems, Cambridge, MA, Nov. 2000. Minneapolis, MN, Oct. 2000.
[10] E. D. Berger, B. G. Zorn, and K. S. McKinley. Composing [28] M. W. Hicks, J. T. Moore, and S. Nettles. The measured cost
high-performance memory allocators. In ACM SIGPLAN of copying garbage collection mechanisms. In ACM
Conference on Programming Languages Design and International Conference on Functional Programming,
Implementation, pages 114–124, Salt Lake City, UT, June pages 292–305, 1997.
2001. [29] A. L. Hosking and R. L. Hudson. Remembered sets can also
[11] E. D. Berger, B. G. Zorn, and K. S. McKinley. Reconsidering play cards, Oct. 1993. Position paper for OOPSLA ’93
custom memory allocation. In ACM Conference on Workshop on Memory Management and Garbage Collection.
Object-Oriented Programming Systems, Languages, and [30] R. E. Jones and R. D. Lins. Garbage Collection: Algorithms
Applications, pages 1–12, Seattle, WA, Nov. 2002. for Automatic Dynamic Memory Management. Wiley, July
[12] S. M. Blackburn, P. Cheng, and K. S. McKinley. Myths and 1996.
realities: The performance impact of garbage collection. [31] N. P. Jouppi. Improving direct-mapped cache performance
Technical Report TR-CS-04-04, Dept. of Computer Science, by the addition of a small fully-associative cache and
Australian National University, 2004. prefetch buffers. In Proceedings of the 17th International
[13] S. M. Blackburn, P. Cheng, and K. S. McKinley. Oil and Symposium on Computer Architecture, pages 364–373,
water? High performance garbage collection in Java with Seattle, WA, June 1990.
JMTk. In ICSE, Scotland, UK, May 2004. [32] J. Kim and Y. Hsu. Memory system behavior of Java
[14] S. M. Blackburn, R. E. Jones, K. S. McKinley, and J. E. B. programs: Methodology and analysis. In ACM SIGMETRICS
Moss. Beltway: Getting around garbage collection gridlock. Conference on Measurement & Modeling Computer Systems,
In Proc. of SIGPLAN 2002 Conference on PLDI, pages pages 264–274, Santa Clara, CA, June 2000.
153–164, Berlin, Germany, June 2002. [33] D. Lea. A memory allocator.
[15] S. M. Blackburn and K. S. McKinley. In or out? Putting http://gee.cs.oswego.edu/dl/html/malloc.html, 1997.
write barriers in their place. In ACM International [34] Y. Levanoni and E. Petrank. An on-the-fly reference counting
Symposium on Memory Management, pages 175–183, garbage collector for Java. In ACM Conference on
Berlin, Germany, June 2002. Object-Oriented Programming Systems, Languages, and
[16] S. M. Blackburn and K. S. McKinley. Ulterior reference Applications, pages 367–380, Tampa, FL, Oct. 2001.
counting: Fast garbage collection without a long wait. In [35] H. Lieberman and C. E. Hewitt. A real time garbage
ACM Conference on Object-Oriented Programming Systems, collector based on the lifetimes of objects. Communications
Languages, and Applications, pages 244–358, Anaheim, CA, of the ACM, 26(6):419–429, 1983.
Oct. 2003. [36] M. Pettersson. Linux Intel/x86 performance counters, 2003.
[17] H.-J. Boehm. Space efficient conservative garbage collection. http://user.it.uu.se/ mikpe/linux/perfctr/.
In ACM SIGPLAN Conference on Programming Languages [37] Y. Shuf, M. J. Serran, M. Gupta, and J. P. Singh.
Design and Implementation, pages 197–206, 1993. Characterizing the memory behavior of Java workloads: A
[18] T. Brecht, E. Arjomandi, C. Li, and H. Pham. Controlling structured view and opportunities for optimizations. In ACM
garbage collection and heap growth to reduce the execution SIGMETRICS Conference on Measurement & Modeling
time of Java applications. In ACM Conference on Computer Systems, pages 194–205, Cambridge, MA, June
Object-Oriented Programming Systems, Languages, and 2001.
Applications, pages 353–366, Tampa, FL, 2001. [38] Standard Performance Evaluation Corporation. SPECjvm98
[19] C. J. Cheney. A non-recursive list compacting algorithm. Documentation, release 1.03 edition, March 1999.
Communications of the ACM, 13(11):677–8, Nov. 1970. [39] Standard Performance Evaluation Corporation.
[20] J. Cohen and A. Nicolau. Comparison of compacting SPECjbb2000 (Java Business Benchmark) Documentation,
algorithms for garbage collection. ACM Transactions on release 1.01 edition, 2001.
Programming Languages and Systems, 5(4):532–553, Oct. [40] D. Stefanović, M. Hertz, S. M. Blackburn, K. McKinley, and
1983. J. Moss. Older-first garbage collection in practice:
[21] D. L. Detlefs, A. Dosser, and B. Zorn. Memory allocation Evaluation in a Java virtual machine. In Memory System
costs in large C and C++ programs. Software Practice & Performance, pages 175–184, June 2002.
Experience, 24(6):527–542, June 1994. [41] D. Tarditi and A. Diwan. Measuring the cost of storage
[22] L. P. Deutsch and D. G. Bobrow. An efficient incremental management. Lisp and Symbolic Computation, 9(4), Dec.
automatic garbage collector. Communications of the ACM, 1996.
19(9):522–526, September 1976. [42] D. M. Ungar. Generation scavenging: A non-disruptive high
[23] S. Dieckmann and U. Hölzle. A study of the allocation performance storage reclamation algorithm. In ACM
behavior of the SPECjvm98 Java benchmarks. In SIGSOFT/SIGPLAN Software Engineering Symposium on
Proceedings of the European Conference on Object-Oriented Practical Software Development Environments, pages
Programming, pages 92–115, June 1999. 157–167, April 1984.
[24] E. Dijkstra, L. Lamport, A. Martin, C. Scholten, and [43] B. G. Zorn. The measured cost of conservative garbage
E. Steffens. On-the-fly garbage collection: An exercise in collection. Software Practice & Experience, 23(7):733–756,
cooperation. Communications of the ACM, 21(11):966–975, 1993.
September 1978.
[25] A. Diwan, D. Tarditi, and J. E. B. Moss. Memory subsystem
performance of programs using copying garbage collection.
In Conference Record of the Twenty-First ACM Symposium

You might also like