Parallel Generational-Copying Garbage Collection With A Block-Structured Heap

A parallel generational-copying garbage collector implemented for the Glasgow Haskell Compiler. We demonstrate wall-clock speedups of on average a factor of 2 in GC time on a commodity 4-core machine with no programmer intervention. Our use of a block-structured heap improves on earlier work in terms of simplicity (fewer runtime data structures) and generality (an arbitrary number of independently re-sizable generations, with aging)

Uploaded by

captain_recruiter_robot

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

94 views

Parallel Generational-Copying Garbage Collection With A Block-Structured Heap

Uploaded by

captain_recruiter_robot

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Parallel Generational-Copying Garbage Collection with a

Block-Structured Heap
Simon Marlow Tim Harris Roshan P. James
Microsoft Research Microsoft Research Indiana University
[email protected] [email protected] [email protected]

Simon Peyton Jones

Microsoft Research
[email protected]

Abstract • We extend this copying collector to a generational scheme

We present a parallel generational-copying garbage collector im- (Section 4), an extension that the block-structured heap makes
plemented for the Glasgow Haskell Compiler. We use a block- quite straightforward.
structured memory allocator, which provides a natural granularity • We exploit the update-once property, enjoyed by thunks in a
for dividing the work of GC between many threads, leading to a lazy language, to reduce garbage-collection costs by using a
simple yet effective method for parallelising copying GC. The re- new policy called eager promotion (Section 4.1).
sults are encouraging: we demonstrate wall-clock speedups of on • We implement and measure our collector in the context of an
average a factor of 2 in GC time on a commodity 4-core machine industrial-strength runtime (the Glasgow Haskell Compiler),
with no programmer intervention, compared to our best sequential using non-toy benchmarks, including GHC itself (Section 5).
GC. We give bottom-line wall-clock numbers, of course, but we
also try to give some insight into where the speedups come
Categories and Subject Descriptors D.3.4 [Programming Lan- from by defining and measuring a notion of work imbalance
guages]: Processors—Memory management (garbage collection) (Section 5.3).
General Terms Languages, Performance To allay any confusion, we are using the term block-structured here
to mean that the heap is divided into fixed-size blocks, not in the
1. Introduction sense of a block-structured programming language.
Garbage collection (GC) involves traversing the live data struc- In general, our use of a block-structured heap improves on ear-
tures of a program, a process that looks parallel even in sequential lier work in terms of simplicity (fewer runtime data structures) and
programs. Like many apparently-parallel tasks, however, achieving generality (an arbitrary number of independently re-sizable gener-
wall-clock speedups on real workloads is trickier than it sounds. ations, with aging), and yet achieves good speedups on commodity
In this paper we report on parallel GC applied to the Glasgow multiprocessor hardware. We see speedups of between a factor of
Haskell Compiler. This area is dense with related work, as we dis- 1.5 and 3.2 on a 4-processor machine. Against this speedup must
cuss in detail in Section 6, so virtually no individual feature of our be counted a slow-down of 20-30% because of the extra locking
design is new. But the devil is in the details: our paper describes required by parallel collection — but we also describe a promising
a tested implementation of a full-scale parallel garbage collector, approach for reducing this overhead (Section 7.1).
integrated with a sophisticated storage manager that supports gen- The bottom line is that on a dual-core machine our parallel GC
erations, finalisers, weak pointers, stable pointers, and the like. We reduces wall-clock garbage-collection time by 20% on average,
offer the following new contributions: while a quad-core can achieve more like 45%. These improvements
are extremely worthwhile, given that they come with no program-
• We parallelise a copying collector based on a block-structured mer intervention whatsoever; and they can be regarded as lower
heap (Section 3). The use of a block-structured heap affords bounds, because we can already see ways to improve them. Our
great flexibility and, in particular, makes it easy to distribute collector is expected to be shipped as part of a forthcoming release
work in chunks of variable size. of GHC.

2. The challenge we address

The challenge we tackle is that of performing garbage collection
Permission to make digital or hard copies of all or part of this work for personal or in parallel in a shared-memory multiprocessor; that is, employing
classroom use is granted without fee provided that copies are not made or distributed many processors to perform garbage collection faster than one
for profit or commercial advantage and that copies bear this notice and the full citation processor could do alone.
on the first page. To copy otherwise, to republish, to post on servers or to redistribute
to lists, requires prior specific permission and/or a fee. We focus on parallel, rather than concurrent, collection. In a
ISMM’08, June 7–8, 2008, Tucson, Arizona, USA. concurrent collector the mutator and collector run at the same time,
Copyright c 2008 ACM 978-1-60558-134-7/08/06. . . $5.00. whereas we only consider garbage collecting in parallel while the
mutator is paused. The mutator is free to use multiple processors
too, of course, but even if the mutator is purely sequential, parallel
garbage collection may still be able to reduce overall run-time. Info pointer Payload
2.1 Generational copying collection
Our collector is a generational, copying collector [Ung84], which Type-specific
we briefly summarise here to establish terminology. The heap is fields Info
divided into generations, with younger generations having smaller Object type table
numbers. Whenever generation n is collected, so are all younger Layout info
generations. A remembered set for generation n keeps track of all Entry code
pointers from generation n into younger ones. In our implementa- ... ... ...
tion, the remembered set lists all objects that contain pointers into
a younger generation, rather than listing all object fields that do so.
Not only does this avoid interior pointers, but it also requires less
administration when a mutable object is repeatedly mutated.
To collect generations 0 − n, copy into to-space all heap objects Figure 1. A heap object
in generations 0 − n that are reachable from the root pointers,
or from the remembered sets of generations older than n. More Thus motivated, GHC’s storage manager uses a block-structured
specifically: heap. Although this is not a new idea [DEB94, Ste77], the literature
• Evacuate each root pointer and remembered-set pointer. To is surprisingly thin, so we pause to describe how it works.
evacuate a pointer, copy the object it points to into to-space, • The heap is divided into fixed-size B-byte blocks. Their exact
overwrite the original object (in from-space) with a forwarding
size B is not important, except that it must be a power of 2.
pointer to the new copy of the object, and return the forwarding
GHC uses 4kbytes blocks by default, but this is just a compile-
pointer. If the object has already been evacuated, and hence
time constant and is easily changed.
has been overwritten with a forwarding pointer, just return that
• Each block has an associated block descriptor, which describes
pointer.
the generation and step of the block, among other things. Any
• Scavenge each object in to-space; that is, evacuate each pointer
address within a block can be mapped to its block descriptor
in the object, replacing the pointer with the address of the evac-
with a handful of instructions - we discuss the details of this
uated object. When all objects in to-space have been scavenged,
mapping in Section 2.3.
garbage collection is complete.
• Blocks can be linked together, through a field in their descrip-
Objects are promoted from a younger to an older generation, based tors, to form an area. For example, after garbage collection the
on a tenuring policy. Most objects die young (the “weak genera- mutator is provided with an allocation area of free blocks in
tional hypothesis”), so it is desirable to avoid promoting very young which to allocate fresh objects.
objects so they have an opportunity to perish. A popular policy is • The heap contains heap objects, whose layout is shown in Fig-
therefore to promote an object from generation n to n + 1 only ure 1. From the point of view of this paper, the important point
when it has survived kn garbage collections, for some kn chosen is that the first word of every heap object, its info pointer, points
independently for each generation. Rather than attach an age to to its statically-allocated info table, which in turn contains lay-
every object, they are commonly partitioned by address, by sub- out information that guides the garbage collector.
dividing each generation n into kn steps. Then objects from step
• A heap pointer always addresses the first word of a heap object;
kn of generation n are promoted to generation n+1, while objects
from younger steps remain in generation n, but with an increased we do not support interior pointers.
step count. • A large object, whose size is greater than a block, is allocated
It seems obvious how to parallelise a copying collector: differ- in a block group of contiguous blocks.
ent processors can evacuate or scavenge different objects. All the Free heap blocks are managed by a simple block allocator, which
interest is in the details. How do we distribute work among the pro- allows its clients to allocate and free block groups. It does simple
cessors, so that all are busy but the overheads are not too high? coalescing of free blocks, in constant time, to reduce fragmentation.
How do we balance load between processors? How do we ensure If the block allocator runs out of memory, it acquires more from the
that two processors do not copy the same data? How can we avoid operating system.
unnecessary cache conflicts? Garbage collection is a very memory- Dividing heap memory into blocks carries a modest cost in
intensive activity, so memory-hierarchy effects are dominant. terms of implementation complexity, but it has numerous benefits.
Before we can discuss these choices, we first describe in more We consider it to be one of the best architectural decisions in the
detail the architecture of our heap. GHC runtime:
2.2 The block-structured heap • Individual regions of memory (generations, steps, allocation
Most early copying garbage collectors partitioned memory into two areas), can be re-sized at will, and at run time. There is no need
large contiguous areas of equal size, from-space and to-space, to- for the blocks of a region to be contiguous; indeed usually they
gether perhaps with an allocation area or nursery in which new ob- are not.
jects are allocated. Dividing the address space in this way is more • Large objects need not be copied; each step of each generation
awkward for generational collection, because it is not clear how big contains a linked list of large objects that belong to that step, and
each generation should be — and indeed these sizes may change moving an object from one step to another involves removing it
dynamically. Worse still, the steps of each generation further sub- from one list and linking it onto another.
divide the space, so that we have n∗k spaces, each of unpredictable • There are places where it is inconvenient or even impossible
size. Matters become even more complicated if there are multiple to perform (accurate) garbage collection, such as deep inside a
mutator or garbage collector threads, because then we need multi- runtime system C procedure (e.g. the GMP arbitrary-precision
ple allocation areas and to-spaces respectively. arithmetic library). With contiguous regions we would have to
estimate how much memory is required in the worst case before 3.1 Claiming an object for evacuation
making the call, but without the requirement for contiguity we When a GC thread evacuates an object, it must answer the question
can just allocate more memory on demand, and call the garbage “has this object already been evacuated, and been overwritten with
collector at a more convenient time. a forwarding pointer?”. We must avoid the race condition in which
• Memory can be recycled more quickly: as soon as a block has two GC threads both evacuate the same object at the same time.
been freed it can be re-used in any other context. The benefit of So, like every other parallel copying collector known to us (e.g.
doing so is that the contents of the block might still be in the [FDSZ01, ABCS01]), we use an atomic CAS instruction to claim
cache. an object when evacuating it.
The complete strategy is as follows: read the header, if the
Some of these benefits could be realised without a block- object is already forwarded then return the forwarding pointer.
structured heap if the operating system gave us more control over Otherwise claim the object by atomically writing a special value
the underlying physical-to-virtual page mapping, but such tech- into the header, spinning if the header indicates that the object is
niques are non-portable if they are available at all. The block- already claimed. Having claimed the object, if the object is now
structured heap idea in contrast requires only a way to allocate a forwarding pointer, unlock it and return the pointer. Otherwise,
memory, which enables GHC’s runtime system to run on any plat- copy it into to-space, and unlock it again by writing the forwarding
form. pointer into the header.
There are variations on this scheme that we considered, namely:
2.3 Block descriptors
• Claim the object before testing whether it is a forwarding
Each block descriptor contains the following fields:
pointer. This avoids having to re-do the test later, at the expense
• A link field, used for linking blocks together into an area. of unnecessarily locking forwarding pointers.
• A pointer to the first un-allocated (free) word in the block. • Copy the object first, and then write the forwarding pointer
• A pointer to the first pending object, used only during garbage with a single CAS instruction. This avoids the need to spin, but
collection (see Section 3.4). means that we have to de-allocate the space we just allocated
• The generation and step of the block. Note that all objects in a (or just waste a little space) in the event of a collision.
block therefore reside in the same generation and step.
We believe the first option would be pessimal, as it does un-
• If this is a block group, its size in blocks.
necessary locking. The second option may be an improvement, but
probably not a measurable one, since as we show later the fre-
Where should block descriptors live? Our current design is to
quency of collisions at the object level is extremely low.
allocate memory from the operating system in units of an M -byte
This fine-grain per-object locking constitutes the largest single
megablock, aligned on an M -byte boundary. A megablock consists
overhead on our collector (measurements in Section 5.1), which
of just under M/B contiguous block descriptors followed by the
is frustrating because it is usually unnecessary since sharing is
same number of contiguous blocks. Each descriptor is a power-of-
relatively rare. We discuss some promising ideas for reducing this
2 in size. Hence, to get from a block address to its descriptor, we
overhead in Section 7.1.
round down (mask low-order bits) to get the start of the megablock,
and add the block number (mask high-order bits) suitably shifted.
An alternative design would be to have variable-sized blocks, 3.2 Privatising to-space
each a multiple of B bytes, with the block descriptor stored in the It is obviously sensible for each GC thread to allocate in a private
first few bytes of the block. In order to make it easy to transform a to-space block, so that no inter-thread synchronisation need take
heap pointer to its block descriptor, we would impose the invariant place on allocation. When a GC thread fills up a to-space block, it
that a heap object must have its first word allocated in the first B gets a new one from a shared pool of free blocks.
bytes of its block. One might worry about the fragmentation arising from having
one partly-filled block for each GC thread at the end of garbage
collection. However, the number of blocks in question is small
3. Parallel Copying GC compared with the total heap size and, in any case, the space in
these blocks is available for use as to-space in subsequent GCs (we
We are now ready to describe our parallel collector. We should
discuss fragmentation further in Section 5.6).
stress that while we have implemented and extensively tested the
algorithms described here, we have not formally proven their cor-
rectness. 3.3 The Pending Block Set
We focus exclusively on non-generational two-space copying in The challenge is to represent the pending set efficiently, because
this section, leaving generational collection until Section 4. operations on the pending set are in the inner loop of the collector.
We will suppose that GC is carried out by a set of GC threads, Cheney’s original insight [Che70] was to allocate objects contigu-
typically one for each physical processor. If a heap object has been ously in to-space, so that the pending set is represented by to-space
evacuated but not yet scavenged we will describe it as a pending itself ; more precisely, the pending set is the set of objects between
object, and the set of (pointers to) pending objects as the pending the to-space allocation pointer and the scavenge pointer. As each
set. The pending set thus contains to-space objects only. object is copied into to-space, the to-space allocation pointer is in-
The most obvious scheme for parallelising a copying collector cremented; as each object is scavenged the scavenge pointer is in-
is as follows. Maintain a single pending set, shared between all cremented. When the two coincide, the pending set is empty, and
threads. Each GC thread performs the following loop: the algorithm terminates.
while (pending set non-empty) { Cheney’s neat trick sets the efficiency bar. We cannot rely on
remove an object p from the pending set parallel performance making up for a significant constant-factor
scavenge(p) slowdown: the parallelism available depends on the data structures
add any newly-evacuated objects to the pending set and the heap, and some heaps may feature linear lists without any
} source of GC parallelism.
Scan and H reaches HL S reaches SL
Scan block Allocation block Allocation (Allocation Block full) (Scan Block empty)
blocks
Done Pending Done Pending Free
(a) distinct (1) Export the full (2) Initialise S, SL to point
Allocation Block to the to the Allocation Block;
S SL H HL Pending Block Set; get a move to (b).
fresh Allocation Block
(a) Distinct Scan and Allocation blocks from the free list; initialise
H, HL; stay in (a).
Done Pending Free (b) coincide (3) Get new Allocation (4) Get new Scan Block
Block from the free list; from Pending Block Set;
initialise H, HL; move to initialise S, SL; move to
S SL=H HL (a). (a).
(b) Scan and Allocation blocks coincide
Figure 3. State transactions in scavenging

behind the allocation pointer H; this can happen, for example,

Figure 2. The state of a GC thread when scavenging a linked list of heap objects.
Figure 2 illustrates the two situations: (a) when the Scan Block and
Representing the pending set by a separate data structure in- Allocation Block are distinct, and (b) when the two coincide. In
curs an immediate overhead. We measured the low bound on such case (a), when S reaches SL, the Scan Block has been fully scanned.
overhead to be in the region of 7-8% - this figure was obtained by When the two blocks coincide (case (b)), the allocation pointer H
writing the address of every evacuated object into a small circular is the scan limit; that is, we assume that SL is always equal to H,
buffer, which is essentially the least that any algorithm would have although in the actual implementation we do not move two pointers,
to do in order to store the pending set in a data structure. Note that of course.
Flood et al [FDSZ01] use a work-stealing queue structure to store As the scavenging algorithm proceeds there are two interesting
the pending set, so their algorithm presumably incurs this overhead. events that can take place: H reaches HL, meaning that the allocation
If we do not store the pending set in a separate data structure, block is full; and S reaches SL (or H in case (b)), meaning that
and use to-space as the pending set, then the challenge becomes there are no more pending objects in the Scan Block. The table
how to divide up the pending set between threads in a parallel GC. in Figure 3 shows what happens for these two events.
Imai and Tick [IT93] divide to-space into blocks, and use the set Work is exported to the Pending Block Set in transition (1) when
of such blocks as the pool from which threads acquire work. In our the Allocation Block is full, and is imported in transition (4) when
setting this scheme seems particularly attractive, because our to- the GC thread has no pending objects of its own. Notice that in
space is already divided into blocks! However, dividing the work transition (4) the Allocation Block remains unchanged, and hence
into chunks may incur some loss of parallelism if we pick the chunk the Allocation Block may now contain fully-scavenged objects, even
size too high, as our measurements show (Section 5.4). though it is distinct from the Scan Block. That is why there is a
We call the set of blocks remaining to be scavenged the Pending “Done” region in the Allocation Block of Figure 2(a). The block
Block Set. In our implementation, it is represented by a single list descriptor contains a field (used only during garbage collection)
of blocks linked together via their block descriptors, protected by a that points just past these fully-scavenged objects, and this field is
mutex. One could use per-thread work-stealing queues to hold the used to initialise S when the block later becomes the Scan Block in
set of pending blocks [ABP98], but so far (using only a handful of transitions (2) or (4).
processors) we have found that there is negligible contention when
using a conventional lock to protect a single data structure; we give 3.5 Initiation and termination
measurements in Section 5.5. GC starts by evacuating the root pointers, which in the case of a
sequential Haskell program will often consist of the main thread’s
3.4 How scavenging works stack only. Traversal of the stack cannot be parallelised, so in this
During scavenging, a GC thread therefore maintains the following case GC only becomes parallel when the first GC thread exports
thread-local state (see Figure 2): work to the Pending Block Set. However, when running a paral-
lel Haskell program [HMP05], or even just a Concurrent Haskell
• A Scan Block, which is being scavenged (aka scanned). program, there will be more roots, and GC can proceed in parallel
• A scan pointer, S, which points to the next pending object in from the outset.
the Scan Block. Earlier objects in the Scan Block have already The scavenging algorithm of the previous sub-section is then
been scavenged (labelled “Done” in Figure 2). executed by each GC thread until the Pending Block Set is empty.
• That does not, of course, mean that garbage collection has termi-
A scan limit, SL, which points one byte beyond the end of the nated, because other GC threads might be just about to export work
last pending object in the Scan Block. into the Pending Block Set. However the algorithm to detect global
• An Allocation Block, into which objects are copied when they termination is fairly simple. Here is a sketch of the outer loop of a
are evacuated from from-space. GC thread:
• A to-space allocation pointer H, pointing to the first free word gc_thread()
in the Allocation Block. { loop:
• A to-space limit pointer HL, which points one beyond the last scan_all();
// Returns when Pending Block Set is empty
word available for allocation in the Allocation Block.
• The Scan Block and the Allocation Block may be distinct, but running_threads--; // atomically
they may also be the very same block. This can happen when while (running_threads != 0) {
there are very few pending objects, so the scan pointer S is close if (any_pending_blocks()) {
// Found non-empty Pending Block Set adjacent block descriptors often share a common cache line, so two
running_threads++; // atomically processors modifying adjacent block descriptors would cause stalls
goto loop; as the system tried to resolve the conflict.
} It is hard to draw any general lessons from this experience,
} // only exits when all threads are idle
except to say that no parallel algorithm should be trusted until
}
it demonstrates wall-clock speedups against the best sequential
Here, running_threads is initialised to the number of GC algorithm running on the same hardware.
threads. The procedure scan_all repeatedly looks for work in the
Pending Block Set, and returns only when the set is empty. Then 4. Parallel generational copying
gc_thread decrements the number of running threads, and waits
for running_threads to reach zero. (There is no need to hold a We have so far concentrated on a single-generation collector. Hap-
mutex around this test, because once running_threads reaches pily, it turns out that our parallel copying scheme needs little modi-
zero, it never changes again.) While other threads are active, the fication to be adapted to a multi-generational copying collector, in-
current thread polls the Pending Block Set to look for work. It cluding support for multiple steps in each generation. The changes
is easy to reason that when running_threads reaches zero, all are these:
blocks are scanned. This termination algorithm was originally pro- • Each GC thread maintains one Allocation Block for each step
posed by Flood et. al. [FDSZ01], although it differs slightly from of each generation (there is still just a single Scan block).
the one used in their implementation. • When evacuating an object, the GC must decide into which
generation and step to copy the object, a choice we discuss in
3.6 When work is scarce Section 4.1.
The block-at-a-time load-balancing scheme works fine when there • The GC must implement a write-barrier to track old-to-new
is plenty of work. But when work is scarce, it can over-sequentialise pointers, and a remembered set for each generation. Currently
the collector. For example, if we start by evacuating all the roots we use a single, shared remembered set for each generation
into a single block, then work will only spread beyond one GC protected by a mutex. It would be quite possible instead to have
thread when the scavenge pointer and the to-space allocation a thread-local remembered set for each generation if contention
pointer get separated by more than a block, so that the GC thread for these mutexes became a problem.
makes transition (1) (Figure 3). And this may never happen! Sup- • When looking for work, a GC thread always seeks to scan the
pose that the live data consists of two linear lists, whose root cells oldest-possible block (we discuss this choice in Section 4.2).
are both in a block that is being scavenged by GC thread A. Then
there will always be exactly two pending objects between S and H, That’s all there is to it!
so thread A will do all the work, even though there is clearly enough
work for two processors. We have found this to be an important 4.1 Eager promotion
problem in practice, as our measurements show (Section 5.4). Suppose that an object W in generation 0 is pointed to by an im-
The solution is inescapable: when work is scarce, we must mutable object in generation 2. Then there is no point in moving W
export partly-full blocks into the Pending Block Set. We do this slowly through the steps of generation 0, and thence into generation
when (a) the size of the Pending Block Set is below some threshold, 1, and finally into generation 2. Object W cannot die until genera-
(b) the Allocation Block has a reasonable quantum, Q, of un- tion 2 is collected, so we may as well promote it, and everything
scanned words (so that there is enough work to be worth exporting), reachable from it, into generation 2 immediately, regardless of its
and (c) the Scan Block also has at least Q un-scanned words (so current location. This is the idea we call eager promotion.
that the current GC thread has some work to do before it goes to One may wonder how often we have an object that is both
get more work from the Pending Block Set). We set the parameter (a) immutable and (b) points into a younger generation. After all,
Q by experimental tuning, although it would also be possible to immutable objects such as list cells are allocated with pointers that
change it dynamically. are, by definition, older than the list cell itself. Only mutable objects
This strategy leads to a potential fragmentation problem. When can point to younger objects. But for objects that are repeatedly
a pending block is fully scanned, it normally plays no further role in mutated, eager promotion may not be a good idea because it may
that garbage collection cycle. But we do not want to lose the space promote a data structure that then becomes unreachable when the
in a partly-full, but now fully-scanned block! The solution is easy. old-generation cell is mutated again.
We maintain a Partly Free List of such partly-full, but fully-scanned For a lazy functional language, however, eager promotion is
blocks. When the GC wants a fresh Allocation Block, instead of precisely right for thunks. A thunk is a suspended computation that
going straight to the block allocator, it first looks in the Partly Free is updated, exactly once, when the thunk is demanded. The update
List. To reduce synchronisation we maintain a private Partly Free means that there may be an old-to-new pointer, while the semantics
List for each GC thread. of the language means that the thunk will not be updated again.
As a result, when evacuating an object W into to-space, that has
3.7 Experiences with an early prototype not already been evacuated, we choose its destination as follows:
Ideas that work well in theory or simulation may not work well • If the object that points to W is immutable, evacuate W into the
in real life, as our first parallel collector illustrated perfectly. With same generation and step as that object.
one processor it took K instructions to complete garbage collec- • Otherwise, move the object to the next step of the current
tion. With two processors, each processor executed roughly K/2
generation, or if it is already in the last step, to the first step
instructions, but the elapsed time was unchanged! This was not
of the next generation.
simply the fixed overhead of atomic instructions, because we ar-
ranged that our one-processor baseline executed those instructions Notice the phrase “that has not already been evacuated”. There may
too. be many pointers to W, and if we encounter the old-to-new pointer
The problem, which was extremely hard to find, turned out late in the game, W may have already been copied into to-space
to be that we were updating the block descriptors unnecessarily and replaced with a forwarding pointer. Then we cannot copy it
heavily, rather than caching their fields in thread-local storage. Two again, because other to-space objects might by now be pointing to
the copy in to-space, so the old-to-new pointer remains. In general, Program ∆ Time (%)
when completing the scavenging of an object, the GC records the circsim +29.5
object in its generation’s remembered set if any of the pointers in constraints +29.4
the object point into a younger generation. fibheaps +26.3
fulsom +19.3
We quantify the benefit of doing eager promotion in Section 5.7. gc bench +34.6
4.2 Eager promotion in parallel collection happy +36.9
lcss +21.7
Eager promotion makes it advantageous to scavenge older genera- power +25.9
tions first, so that eager promotion will promote data into the oldest spellcheck +28.9
possible generation. What are the implications for a parallel gener- Min +19.3
ational system? Max +36.9
First, we maintain a Pending Block Set for each generation, so Geometric Mean +27.9
that in transition (4) of Figure 3 we can pick the oldest Pending
Block. Figure 4. Increase in GC time due to atomic evacuation
Transition (2) offers two possible alternatives (remembering
that each GC thread maintains an Allocation Block for each step
of each generation): By default, GHC uses two generations, two steps in the youngest
generation, a fixed-size nursery of 0.5MB, and increases the size
1. Pick the oldest Allocation Block that has work. If none of them of the old generation as necessary. This configuration is designed
have pending objects, pick the the oldest Pending Set block. to be cache-friendly and memory-frugal, but we found it to be
2. Pick the oldest block we can find that has work, whether be it suboptimal for parallel GC: the nursery is too small to make it
from a Pending Block Set or one of our own Allocation Blocks. worthwhile starting up multiple threads for the young-generation
collections (but see Section 7.2). So instead of using a fixed-size
The first policy attempts to maximise parallelism, by not taking
nursery and a variable sized old-generation, we gave each program
blocks from the shared pending set if there is local work available.
a fixed-size total heap. In this configuration the GC allocates all
The second attempts to maximise eager promotion, by always pick-
unused memory to the nursery, and hence will collect the nursery
ing the oldest objects to scan first.
less frequently.
Note that eager promotion introduces some non-determinism
We set the size of the heap given each program to be the same
into the parallel GC. Since it now matters in which order we
as the maximum heap size attained when the program was run in
scavenge objects, and the order may depend on arbitrary scheduling
the default (variable-sized heap) configuration. Typically this value
of GC threads, the total amount of work done by the GC may vary
is about 3 times the maximum amount of live data encountered in
from run to run. In practice we have found this effect to be small on
any GC.
the benchmarks we have tried: the number of bytes copied in total
It is worth noting that strange artifacts abound when measuring
varies by up to 2%.
GC. If a change to the collector causes a GC to take place at a
different time, this can affect the cost of GC dramatically. The
5. Measurements volume of live data can change significantly over a short period
We chose a selection of programs taken from the nofib [Par92] and of time, for example when a large data structure suddenly becomes
nobench [Ste] Haskell benchmarking suites, taking those (disap- unreachable. To minimize these effects in our measurements, we
pointingly few) programs that spent a significant amount of time in aim to always collect at the same time, by measuring the amount of
the garbage collector. The programs we measured, with a count of live data and scheduling collections accurately.
the number of source code lines in each, are: We made all our measurements on a machine with dual quad-
core Intel Xeon processors, for a total of 8 cores. The OS was
• GHC, the Haskell compiler itself (190,000 lines) Windows Server x64, but we compiled and ran 32-bit binaries.
• circsim, a circuit simulator (700 lines)
• constraints, a constraint solver (300 lines) 5.1 Locking overhead
• fibheaps, a fibonacci heap benchmark (300 lines) Figure 4 compares our best sequential GC with the parallel GC ex-
• fulsom, a solid modeller (1,400 lines) ecuting on a single processor, and therefore shows the fixed over-
• gc bench, an artificial GC benchmark1 (300 lines) head of doing GC in parallel. (Unfortunately we did not collect
• happy, a Yacc-style parser generator for Haskell (5,500 lines)
results for the GHC benchmark, but have no reason to believe they
would differ dramatically from the others shown). The majority of
• lcss, Hirschberg’s LCSS algorithm (60 lines) this overhead comes from the per-object locking that is necessary
• power, a program for calculating power series (140 lines) to prevent two processors from evacuating the same object (Sec-
• spellcheck, checks words against a dictionary (10 lines) tion 3.1).
The overhead is about 30%, which means the parallel GC must
One thing to note is that the benchmarks we use are all pre- achieve at least a speedup of 1.3 just to beat the single-threaded
existing single-threaded programs. We expect to see better results GC.
from multithreaded or parallel programs, because such programs
will typically have a wider root set: when there are multiple thread 5.2 Speedup
stacks to treat as roots, we can start multiple GC threads to scan
them in parallel, whereas for a single-threaded programs the root Figure 5 shows the speedup obtained by our parallel GC, that is, the
set usually consists of the main thread’s stack only, so it can be ratio of the wall-clock time spent in GC when using a single CPU
longer until there is enough work to supply to the other GC threads. to the wall-clock time spent in GC when using N CPUs, for values
of N between 1 and 8.
1 Translated from the Java gc bench by Hans Boehm, who credits John The baseline we’re using here is the parallel version of the GC
Ellis and Pete Kovac of Post Communications as the original creators. This with a single thread; that is, the per-object locking measured in the
benchmark uses a lot of mutation, which is atypical for a Haskell program. previous section is included.
2
9

1.8
8

1.6
7
linear speedup (8)
1.4 power (1.47)
6 fulsom (4.75)
happy (1.33)
constraints (4.45) 1.2
5 gc_bench (1.01)
gc_bench (3.70) Speed
Speedup fulsom (1.01)
circsim (3.28) (8 procs) 1
4 spellcheck (0.98)
lcss (3.06)
0.8 fibheaps (0.96)
power (2.42)
3
spellcheck (2.28) ghc (0.94)
0.6
2 ghc (1.98) constraints (0.94)
happy (1.83) 0.4 lcss (0.91)
1 fibheaps (1.73) circsim (0.40)
0.2
0
1 2 3 4 5 6 7 8 0
Processors 32 128 256 4096

Chunk size

Figure 5. Speedup on various benchmarks Figure 7. Varying the chunk size

9 threads, and Cmax is the maximum number of bytes copied by any

one thread. Perfect balance is N (the number of GC threads), per-
8 fect imbalance is 1. This figure is the maximum real speedup we
could expect, given the degree to which the work was distributed
7
across the threads.
6
perfect balance (8) Generalising this to multiple GCs is done by treating all the GCs
constraints (7.77)
together as a single large GC - that is, we take the sum of Ctot for
gc_bench (7.69)
5
fulsom (7.59)
all GCs and divide by the sum of Cmax for all GCs.
Balance lcss (7.50) Figure 6 shows the measured work balance factor for our bench-
Factor 4 power (6.05)
marks, in the same format as Figure 5 for comparison. The results
circsim (5.99)

3 ghc (4.60)
are broadly consistent with the speedup results from Figure 5: the
spellcheck (3.96) top 3 and the bottom 4 programs are the same in both graphs. Work
2
happy (3.88)
imbalance is clearly an issue affecting the wall-clock speedup, al-
fibheaps (3.48)
though it is not the only issue, since even when we get near-perfect
1
work balance (e.g. constraints, 7.7 on 8 processors), the speedup
0
we see is still only 4.5.
1 2 3 4 5 6 7 8 Work imbalance may be caused by two things:
Processors
• Failure of our algorithm to balance the available work
• Lack of actual parallelism in the heap: for example when the
Figure 6. Work balance heap consists mainly of a single linked list.
When the work is balanced evenly, a lack of speedup could be
The amount of speedup varies significantly between programs. caused by contention for shared resources, or by threads being idle
We aim to qualify the reasons for these differences in some of the (that is, the work is balanced but still serialised).
measurements made in the following sections. Gaining more insight into these results is planned for future
work. The results we have presented are our current best effort,
5.3 Measuring work-imbalance after identifying and fixing various instances of these problems (see
If the work is divided unevenly between threads, then it is not possi- for instance Section 3.7), but there is certainly more that can be
ble to achieve the maximum speedup. Measuring work imbalance done.
is therefore useful, because it gives us some insight into whether
a less-than-perfect speedup is due to an uneven work distribution 5.4 Varying the chunk size
or to other factors. The converse doesn’t hold: if we have perfect The collector has a “minimum chunk size”, which is the smallest
work distribution it doesn’t necessarily imply perfect wall-clock amount of work it will push to the global Pending Block Set. When
speedup. For instance, the threads might be running in sequence, the Pending Block Set has plenty of work on it, we allow the
even though they are each doing the same amount of work. allocation blocks to grow larger than the minimum size, in order to
In this section we quantify the work imbalance, and measure reduce the overhead of modifying the Pending Block Set too often.
it for our benchmarks. An approximation to the amount of work Our default minimum chunk size, used to get the results pre-
done by a GC thread is the number of bytes it copies. This is an sented so far, was 128 words (with a block size of 1024 words). The
approximation because there are operations that involve copying results for our benchmarks on 8 processors using different chunk
no bytes: scanning a pointer to an already-evacuated object, for sizes are given in Figure 7. The default chunk size of 128 words
example. Still, we believe it is a reasonable approximation. seems to be something of a sweet spot, although there is a little
We define the work balance factor for a single GC to be more parallelism to be had in fibheaps and constraints when
Ctot /Cmax where Ctot is the total number of bytes copied by all using a 32-word chunk size.
5.5 Lock contention Program ∆ GC time (%)
circsim -1.0
There are various mutexes in the parallel garbage collector, which constraints -18.4
we implement as simple spinlocks. The spinlocks are: fibheaps -2.7
fulsom -14.8
• The block allocator (one global lock) gc bench -6.2
• The remembered sets (one for each generation) ghc -3.6
• The Pending Block Sets (one for each step) happy -14.2
lcss -15.6
• The large-object lists (one for each step) power +36.6
• The per-object evacuation lock spellcheck -17.4
Min -18.4
The runtime system counts how many times each of these spin- Max +36.6
locks is found to be contended, bumping a counter each time the Geometric Mean -6.8
requestor spins. We found very little contention for most locks. In
particular, it is extremely rare that two threads attempt to evacuate Figure 8. Effect of adding eager promotion
the same object simultaneously: the per-object evacuation lock typ-
ically counts less than 10 spins per second during GC. This makes
it all the more painful that this locking is so expensive, and it is why 5.8 Miscellany
we plan to investigate relaxing this locking for immutable objects Here we list a number of other techniques or modifications that we
(Section 7). tried, but have not made systematic measurements for.
There was significant contention for the block allocator, espe-
cially in the programs that use a large heap. To reduce this con- • Varying the native block size used by the block allocator. In
tention we did two things: practice this makes little difference to performance until the
block size gets too small, and too large increases the amount
• We rewrote the block allocator to maintain its free list more of fragmentation. The current default of 4kbytes is reasonable.
efficiently. • When taking a block from the Pending Block Set, do we take
• We allocate multiple blocks at a time, and keep the spare ones the block most recently added to the set (LIFO), or the block
in the thread’s Partly Free List (Section 3.6). At the end of GC added first (FIFO)? Right now, we use FIFO, as we found it
any unused free blocks are returned to the block allocator. increased parallelism slightly, although LIFO might be better
from a cache perspective. All other things being equal, it would
5.6 Fragmentation make sense to take blocks of work recently generated by the
A block-structured heap will necessarily waste some memory at the current thread, in the hope that they would still be in the cache.
end of each block. This arises when Another strategy we could try is to take a random block, on the
grounds that it would avoid accidentally hitting any worst-case
• An object to be evacuated is too large to fit in the current block, behaviour.
so a few words are often lost at the end of a to-space block. • Adding more generations and steps doesn’t help for these
• The last to-space block to be allocated into will on average benchmarks, although we have found in the past that adding
be half-full, and there is one such block for each step of each a generation is beneficial for very long-running programs.
generation for each thread. These partially-full blocks will be • At one stage we used to have a separate to-space for objects
used for to-space during the next GC, however. that do not need to be scavenged, because they have no pointer
• When work is scarce (see Section 3.6), partially-full blocks are fields (boxed integers and characters, for example). However,
exported to the pending block set. The GC tries to fill up any the runtime system has statically pre-allocated copies of small
partially-full blocks rather than allocating fresh empty blocks, integers and characters, giving a limited form of hash-consing,
but it is possible that some partially-full blocks remain at the which meant that usually less than 1% of the dynamic heap
end of GC. The GC will try to re-use them at the next GC if this consisted of objects with no pointers. There was virtually no
happens. benefit in practice from this optimisation, and it added some
complexity to the code, so it was discarded.
We measured the amount of space lost due to these factors, by • We experimented with pre-fetching in the GC, with very limited
comparing the actual amount of live data to the number of blocks success. Pre-fetching to-space ahead of the allocation pointer is
allocated at the end of each GC. The runtime tracks the maximum easy, but gives no benefit on modern processors which tend to
amount of fragmentation at any one time over the run of a program spot sequential access and pre-fetch automatically. Pre-fetching
and reports it at the end; we found that in all our benchmarks, the scan block ahead of the scan pointer suffers from the same
the fragmentation was never more than 1% of the total memory problem. Pre-fetching fields of objects that will shortly be
allocated by the runtime. To put this in perspective, remember that scanned can be beneficial, but we found in practice that it was
copying GC wastes at least half the memory. extremely difficult and processor-dependent to tune the dis-
tance at which to prefetch. Currently, our GC does no explicit
5.7 Eager Promotion prefetching.
To our knowledge this is the first time the idea of eager promotion
has been presented, and it is startlingly effective in practice. Fig-
ure 8 shows the benefit of doing eager promotion (Section 4.1), in 6. Related work
the single-threaded GC. On average, eager promotion reduces the There follows a survey of the related work in this area. Jones pro-
time spent in garbage collection by 6.8%. One example (power) vides an introduction to classical sequential GC [JL96]. We focus
went slower with eager promotion turned on - this program turns on tracing collectors based on exploring the heap by reachability
out to be quite sensitive to small changes in the times at which GC from root references. Moreover, we consider only parallel copy-
strikes, and this effect dominates. ing collectors, omitting those that use compaction or mark-sweep.
We also restrict our discussion to algorithms that are practical for stack is simplified by a gated synchronization mechanism so that
general-purpose use. pushes are never concurrent with pops.
Halstead [RHH85] developed a parallel version of Baker’s in- Ben-Yitzhak et al [BYGK+ 02] augmented a parallel mark-
cremental semispace copying collector. During collection the heap sweep collector with periodic clearing of a “evacuated area” (EA).
is logically partitioned into per-thread from-spaces and to-spaces. The basic idea is that if this is done sufficiently often then it pre-
Each thread traces objects from its own set of roots, copying them vents fragmentation building up. The EA is chosen before the mark
into its own to-space. Fine-grained locking is used to synchro- phase and, during marking, references into the EA are identified.
nize access to from-space objects, although Halstead reports that After marking the objects in the EA are evacuated to new loca-
such contention is very rare. As Halstead acknowledges, this ap- tions (parallelized by over-partitioning the EA) and the references
proach can lead to work imbalance and to heap overflow. Halstead found during the mark phase are updated. In theory performance
makes heap overflow less likely by dividing to-spaces into 64K- may be harmed by a poor choice of EA (e.g. one that does not
128K chunks which are allocated on demand. reduce fragmentation because all the objects in it are dead) or by
Many researchers have explored how to avoid the work imbal- workload imbalance when processing the EA. In practice the first
ance problems in early parallel copying collectors. It’s impractical problem could be mitigated by better EA-selection algorithm and,
to avoid work imbalance: the collector does not know ahead of time for modest numbers of threads, the impact of the second is not
which roots will lead to large data structures and which to small. significant.
The main technique therefore is to dynamically re-balance work This approach was later used in Barabash et al’s parallel frame-
from busy threads to idle threads. work [BBYG+ 05]. Barabash et al also describe the “work packet”
Imai and Tick [IT93] developed the first parallel copying GC abstraction they developed to manage the parallel marking phases.
algorithm with dynamic work balancing. They divide to-space into As with Attanasio et al’s work buffers, this provides a way to batch
blocks with each active GC thread having a “scan” block (of objects communication between GC threads. Each thread has one input
that it is tracing from) and a “copy” block (into which it copies packet from which it is taking marking work and one output packet
objects it finds in from-space). If a thread fills its copy block then into which it places work that it generates. These packets remain
it adds it to a shared work pool, allocates a fresh copy block, and distinct (unlike Imai and Tick’s chunks [IT93]) and are shared be-
continues scanning. If a thread finishes its scan block then it fetches tween threads only at a whole-packet granularity (unlike per-object
a fresh block from the work-pool. The size of the blocks provides work stealing in Endo et al’s and Flood et al’s work). Barabash et
a trade-off between the time spent synchronizing on the work- al report that this approach makes termination detection easy (all
pool and the potential work imbalance. Siegwart and Hirzel [SH06] the packets must be empty) and makes it easy to add or remove GC
extend this approach to copy objects in hierarchical order. threads (because the shared pool’s implementation is oblivious to
Endo et al [ETY97] developed a parallel mark-sweep collector the number of participants).
based on Boehm-Demers-Weiser conservative GC. They use work- Petrank and Kolodner [PK04] observe how existing parallel
stealing to avoid load-imbalance during the mark phase: GC threads copying collectors allocated objects into per-thread chunks, raising
have individual work queues and if a thread’s own queue becomes the possibility of a fragmentation of to-space. They showed how
empty then it steals work from another’s. Endo et al manage work this could be avoided by “delayed allocation” of to-space copies of
at a finer granularity than Imai and Tick: they generally use per- objects: GC threads form batches of proposed to-space allocations
object work items, but also sub-divide large objects into 512-byte which are then performed by a single CAS on a shared allocation
chunks for tracing. They found fine-granularity work management pointer. This guarantees that there is no to-space fragmentation
valuable because of large arrays in the scientific workloads that they while avoiding per-allocation CAS operations. In many systems the
studied. They parallelise the sweep phase by over-partitioning the impact of this form of fragmentation is small because the number
heap into batches of blocks that are processed in parallel, meaning of chunks with left-over space is low.
that there are more batches than GC threads and that GC threads There are thus two kinds of dynamic re-balancing: fine-grained
dynamically claim new batches as they complete their work. schemes like Endo et al [ETY97] and Flood et al [FDSZ01]
Flood et al [FDSZ01] developed a parallel semispace copy- which work at a per-object granularity, and batching schemes like
ing collector. As with Endo et al, they avoid work-imbalance by Imai and Tick [IT93], Attanasio et al [ABCS01] and Barabash et
per-object work-stealing. They parallelise root scanning by over- al [BBYG+ 05] which group work into blocks or packets. Both ap-
partitioning the root set (including the card-based remembered set proaches have their advantages. Fine-grained schemes reduce the
in generational configurations). As with Imai and Tick, each GC latency between one thread needing work and it being made avail-
thread allocates into its own local memory buffer. Flood et al also able and, as Flood et al argue, the synchronization overhead can
developed a parallel mark-compact collector which we do not dis- be mitigated by carefully designed work-stealing systems. Block-
cuss here. based schemes may make termination decisions easier (particularly
Concurrent with Flood et al, Attanasio et al [ABCS01] devel- if available work is placed in a single shared pool) and make it eas-
oped a modular GC framework for Java on large symmetric mul- ier to express different traversal policies by changing how blocks
tiprocessor machines executing server applications. Unlike Flood are selected from the pool (as Siegwart and Hirzel’s work illus-
et al, Attanasio et al’s copying collector performed load balancing trated [SH06]).
using work buffers of multiple pointers to objects. A global list is We attempt to combine the best of these approaches. In partic-
maintained of full buffers ready to process. Attanasio reports that ular we try to keep the latency in work distribution low by using
this coarser mechanism scales as well as Flood et al’s fine-grained producing incomplete blocks if there are idle threads. Furthermore,
design on the javac and SPECjbb benchmarks; as with Imai and as with Imai and Tick’s [IT93] and Siegwart and Hirzel’s [SH06]
Tick’s design, the size of the buffers controls a trade-off between work, we represent our work items by areas of to-space, avoiding
synchronization costs and work imbalance. the need to reify them in a separate queue or packet structure.
Also concurrent with Flood et al, Cheng and Blelloch developed A number of collectors have exploited the immutability of most
a parallel copying collector using a shared stack of objects waiting data in functional languages. Doligez and Leroy’s concurrent col-
to be traced [BC99, CB01]. Each GC thread periodically pushed lector for ML [DL93] uses per-thread private heaps and allows mul-
part of its work onto the shared stack and took work from the shared tiple threads to collect their own private heaps in parallel. They sim-
stack when it exhausted its own work. The implementation of the plify this by preserving an invariant that there are no inter-private-
heap references and no references from a common shared heap into language design and implementation, pages 104–117. ACM,
any thread’s private heap: the new referents of mutable objects are 1999.
copied into the shared heap, and mutable objects themselves are [BYGK+ 02] Ori Ben-Yitzhak, Irit Goft, Elliot K. Kolodner, Kean Kuiper,
allocated in the shared heap. This exploits the fact that in ML (as and Victor Leikehman. An algorithm for parallel incremental
in Haskell) most data is immutable. Huelsbergen and Larus’ con- compaction. In ISMM ’02: Proceedings of the 3rd international
current copying collector [HL93] also exploits the immutability of symposium on Memory management, pages 100–105. ACM,
2002.
data in ML: immutable data can be copied in parallel with concur-
rent accesses by the mutator. [CB01] Perry Cheng and Guy E. Blelloch. A parallel, real-time
garbage collector. In PLDI ’01: Proceedings of the ACM
SIGPLAN 2001 conference on Programming language design
7. Conclusion and further work and implementation, pages 125–136. ACM, 2001.
The advent of widespread multi-core processors offers an attractive [Che70] C. J. Cheney. A nonrecursive list compacting algorithm.
opportunity to reduce the costs of automatic memory management Commun. ACM, 13(11):677–678, 1970.
with zero programmer intervention. The opportunity is somewhat [DEB94] R. Kent Dybvig, David Eby, and Carl Bruggeman. Don’t
tricky to exploit, but it can be done, and we have demonstrated stop the BIBOP: Flexible and efficient storage management
real wall-clock benefits achieved by our algorithm. Moreover, we for dynamically-typed languages. Technical Report 400, Indiana
University Computer Science Department, 1994.
have made progress toward explaining the lack of perfect speedup
by measuring the load imbalance in the GC and showing that this [DL93] Damien Doligez and Xavier Leroy. A concurrent, generational
garbage collector for a multithreaded implementation of ML.
correlates well with the wall-clock speedup.
In POPL ’93: Proceedings of the 20th ACM SIGPLAN-SIGACT
There are two particular directions in which we would like to symposium on Principles of programming languages, pages 113–
develop our collector. 123. ACM, 1993.
7.1 Reducing per-object synchronisation [ETY97] Toshio Endo, Kenjiro Taura, and Akinori Yonezawa. A scalable
mark-sweep garbage collector on large-scale shared-memory
As noted in Section 3.1, a GC thread uses an atomic CAS instruc- machines. In Supercomputing ’97: Proceedings of the 1997
tion to gain exclusive access to a from-space heap object. The cost ACM/IEEE conference on Supercomputing (CDROM), pages
of atomicity here is high: 20-30% (Section 5.1), and we would like 1–14. ACM, 1997.
to reduce it. [FDSZ01] Christine Flood, Dave Detlefs, Nir Shavit, and Catherine Zhang.
Many heap objects in a functional language are immutable, Parallel garbage collection for shared memory multiprocessors.
and the language does not support pointer-equality. If such an In Usenix Java Virtual Machine Research and Technology
immutable objects is reachable via two different pointers, it is Symposium (JVM ’01), Monterey, CA, 2001.
therefore semantically acceptable to copy the object twice into to- [HL93] Lorenz Huelsbergen and James R. Larus. A concurrent copying
space. Sharing is lost, and the heap size may increase slightly, but garbage collector for languages that distinguish (im)mutable data.
SIGPLAN Not., 28(7):73–82, 1993.
the mutator can see no difference.
So the idea is simple: for immutable objects, we avoid using [HMP05] Tim Harris, Simon Marlow, and Simon Peyton Jones. Haskell on
a shared-memory multiprocessor. In ACM Workshop on Haskell,
atomic instructions to claim the object, and accept the small possi- Tallin, Estonia, 2005. ACM.
bility that the object may be copied more than once into to-space.
[IT93] A. Imai and E. Tick. Evaluation of parallel copying garbage
We know that contention for individual objects happens very rarely collection on a shared-memory multiprocessor. IEEE Trans.
in the GC (Section 5.5), so we expect the amount of accidental du- Parallel Distrib. Syst., 4(9):1030–1040, 1993.
plication to be negligible in practice. [JL96] Richard Jones and Rafael Lins. Garbage Collection: Algorithms
for Automatic Dynamic Memory Management. John Wiley and
7.2 Privatising minor collections
Sons, July 1996.
A clear shortcoming of the system we have described is that [Par92] WD Partain. The nofib benchmark suite of Haskell programs. In
all garbage collection is global: all the processors stop, agree to J Launchbury and PM Sansom, editors, Functional Programming,
garbage collect, perform garbage collection, and resume mutation. Glasgow 1992, pages 195–202. 1992.
It would be much better if a mutator thread could perform local [PK04] Erez Petrank and Elliot K. Kolodner. Parallel copying garbage
garbage collection on its private heap without any interaction with collection using delayed allocation. Parallel Processing Letters,
other threads whatsoever. We plan to implement such a scheme, 14(2), June 2004.
very much along the lines described by Doligez and Leroy [DL93]. [RHH85] Jr. Robert H. Halstead. Multilisp: a language for concurrent
symbolic computation. ACM Trans. Program. Lang. Syst.,
7(4):501–538, 1985.
References
[SH06] David Siegwart and Martin Hirzel. Improving locality with
[ABCS01] C. Attanasio, D. Bacon, A. Cocchi, and S. Smith. A parallel hierarchical copying gc. In ISMM ’06: Proceedings of
comparative evaluation of parallel garbage collectors. In the 5th international symposium on Memory management, pages
Fourteenth Annual Workshop on Languages and Compilers for 52–63. ACM, 2006.
Parallel Computing, pages 177–192, Cumberland Falls, KT,
[Ste] Don Stewart. nobench: Benchmarking haskell implementations.
2001. Springer-Verlag.
http://www.cse.unsw.edu.au/∼dons/nobench.html.
[ABP98] Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread
[Ste77] Guy Lewis Steele Jr. Data representations in PDP-10 MacLISP.
scheduling for multiprogrammed multiprocessors. pages 119–
Technical report, MIT Artificial Intelligence Laborotory, 1977.
129. ACM Press, June 1998.
AI Memo 420.
[BBYG+ 05] Katherine Barabash, Ori Ben-Yitzhak, Irit Goft, Elliot K.
[Ung84] D Ungar. Generation scavenging: A non-disruptive high
Kolodner, Victor Leikehman, Yoav Ossia, Avi Owshanko,
performance storage management reclamation algorithm. In
and Erez Petrank. A parallel, incremental, mostly concurrent
ACM SIGPLAN Software Engineering Symposium on Practical
garbage collector for servers. ACM Trans. Program. Lang. Syst.,
Software Development Evironments, pages 157–167. Pittsburgh,
27(6):1097–1146, 2005.
Pennsylvania, April 1984.
[BC99] Guy E. Blelloch and Perry Cheng. On bounding time and space
for multiprocessor garbage collection. In PLDI ’99: Proceedings
of the ACM SIGPLAN 1999 conference on Programming

C & C++ Interview Questions You'll Most Likely Be Asked
From Everand
C & C++ Interview Questions You'll Most Likely Be Asked
Vibrant Publishers
No ratings yet
Overcoming Anger
No ratings yet
Overcoming Anger
18 pages
The APA Writing Style - 7th Edition PDF
86% (7)
The APA Writing Style - 7th Edition PDF
38 pages
Multicore Garbage Collection With Local Heaps: Simon Marlow Simon Peyton Jones
No ratings yet
Multicore Garbage Collection With Local Heaps: Simon Marlow Simon Peyton Jones
12 pages
An Efficient Parallel Heap Compaction Algorithm
No ratings yet
An Efficient Parallel Heap Compaction Algorithm
13 pages
Ap 1 Unleashing
No ratings yet
Ap 1 Unleashing
25 pages
M.Tech (Computer Science) : Submitted By:-Submitted To
No ratings yet
M.Tech (Computer Science) : Submitted By:-Submitted To
8 pages
Hardware Accelerated High Quality Recons
No ratings yet
Hardware Accelerated High Quality Recons
10 pages
Using Page Residency To Balance Tradeoffs in Tracing Garbage Collection
No ratings yet
Using Page Residency To Balance Tradeoffs in Tracing Garbage Collection
11 pages
2502.20547v1
No ratings yet
2502.20547v1
27 pages
A Java Task Pool Framework Providing Fault-Tolerant Global Load Balancing
No ratings yet
A Java Task Pool Framework Providing Fault-Tolerant Global Load Balancing
30 pages
Storage Capacity Optimization
No ratings yet
Storage Capacity Optimization
7 pages
Hilbert Curve Based Bucket Ordering For Global Illumination
No ratings yet
Hilbert Curve Based Bucket Ordering For Global Illumination
8 pages
paper03
No ratings yet
paper03
7 pages
Robots Considered Harmful: B Ena B Ela and Zerge Zita
No ratings yet
Robots Considered Harmful: B Ena B Ela and Zerge Zita
7 pages
Chapter 2 Neede For Guide Line Help From Smiw
No ratings yet
Chapter 2 Neede For Guide Line Help From Smiw
7 pages
Garbage Collection Hints: (Dbuytaer, Kvenster, Leeckhou, KDB) @elis - Ugent.Be
No ratings yet
Garbage Collection Hints: (Dbuytaer, Kvenster, Leeckhou, KDB) @elis - Ugent.Be
16 pages
Fast, Multicore-Scalable, Low-Fragmentation Memory Allocation Through Large Virtual Memory and Global Data Structures
No ratings yet
Fast, Multicore-Scalable, Low-Fragmentation Memory Allocation Through Large Virtual Memory and Global Data Structures
16 pages
Jemalloc
No ratings yet
Jemalloc
14 pages
A Memory Allocator by Doug Lea
No ratings yet
A Memory Allocator by Doug Lea
8 pages
A Fast Analysis For Thread-Local Garbage Collection With Dynamic Class Loading
No ratings yet
A Fast Analysis For Thread-Local Garbage Collection With Dynamic Class Loading
10 pages
2009 Runtime Support For Multicore Haskell
No ratings yet
2009 Runtime Support For Multicore Haskell
13 pages
Fast14 Paper Rumble
No ratings yet
Fast14 Paper Rumble
17 pages
Myths and Realities - The Performance Impact of Garbage Collection
No ratings yet
Myths and Realities - The Performance Impact of Garbage Collection
12 pages
Design of An Exact Data Deduplication Cluster
No ratings yet
Design of An Exact Data Deduplication Cluster
12 pages
Java Performance Tuning Ver 1
No ratings yet
Java Performance Tuning Ver 1
72 pages
GC - Interview Questions
No ratings yet
GC - Interview Questions
7 pages
Abinitio-Material
No ratings yet
Abinitio-Material
11 pages
Berger 02 Reconsidering
No ratings yet
Berger 02 Reconsidering
12 pages
Beyond Malloc Efficiency
No ratings yet
Beyond Malloc Efficiency
17 pages
p1138 Cohen
No ratings yet
p1138 Cohen
8 pages
Computer Architecture CT 2 Paper Solution: K M K M K M K M T T) S (M
No ratings yet
Computer Architecture CT 2 Paper Solution: K M K M K M K M T T) S (M
12 pages
A Deployment of Information Retrieval Systems: Jeremy Stribling, Max Krohn and Dan Aguayo
No ratings yet
A Deployment of Information Retrieval Systems: Jeremy Stribling, Max Krohn and Dan Aguayo
8 pages
CD Unit-Iv
No ratings yet
CD Unit-Iv
22 pages
2015 Stucki Rompf Ureche Bagwell
No ratings yet
2015 Stucki Rompf Ureche Bagwell
13 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
21 pages
malloclab-writeup
No ratings yet
malloclab-writeup
10 pages
Scheduling Threads For Constructive Cache Sharing On Cmps
No ratings yet
Scheduling Threads For Constructive Cache Sharing On Cmps
11 pages
Introduction To Golang
No ratings yet
Introduction To Golang
13 pages
TOPL Ass 02 Orig
No ratings yet
TOPL Ass 02 Orig
9 pages
A Case For Smps
No ratings yet
A Case For Smps
5 pages
Scimakelatex 30610 None
No ratings yet
Scimakelatex 30610 None
7 pages
2005 Ppopp Composable
No ratings yet
2005 Ppopp Composable
13 pages
Least Squares Optimization: From Theory To Practice
No ratings yet
Least Squares Optimization: From Theory To Practice
29 pages
Compression Aware DCR
No ratings yet
Compression Aware DCR
10 pages
Scheduling Irregular Parallel Computations On Hierarchical Caches
No ratings yet
Scheduling Irregular Parallel Computations On Hierarchical Caches
30 pages
Scalability of Garbage Collection in Java-Based Discrete-Event Simulators
No ratings yet
Scalability of Garbage Collection in Java-Based Discrete-Event Simulators
6 pages
GeneralStreamSlicingEDBT2019
No ratings yet
GeneralStreamSlicingEDBT2019
12 pages
IBM Java Garbage Collection Tuning
No ratings yet
IBM Java Garbage Collection Tuning
55 pages
A New Memory Allocation Method For Shared Memory Multiprocessors With Large Virtual Address Space
No ratings yet
A New Memory Allocation Method For Shared Memory Multiprocessors With Large Virtual Address Space
18 pages
Java Theory and Practice:: Stick A Fork in It, Part 1
No ratings yet
Java Theory and Practice:: Stick A Fork in It, Part 1
8 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
8 pages
Unit1_Memorymanagement
No ratings yet
Unit1_Memorymanagement
6 pages
AdaPool_Exponential_Adaptive_Pooling_for_Information-Retaining_Downsampling
No ratings yet
AdaPool_Exponential_Adaptive_Pooling_for_Information-Retaining_Downsampling
16 pages
Batch Splitting in An Assembly Scheduling Environment: Satyaki Ghosh Dastidar - Rakesh Nagi
No ratings yet
Batch Splitting in An Assembly Scheduling Environment: Satyaki Ghosh Dastidar - Rakesh Nagi
21 pages
Malloclab
No ratings yet
Malloclab
10 pages
Non Inclusive Caches
No ratings yet
Non Inclusive Caches
10 pages
Bibop
No ratings yet
Bibop
17 pages
Main Seminar 'Autonomic Computing': Operating Systems and Middleware
No ratings yet
Main Seminar 'Autonomic Computing': Operating Systems and Middleware
10 pages
Scimakelatex 31866 Doe Smith Anon Jane John
No ratings yet
Scimakelatex 31866 Doe Smith Anon Jane John
6 pages
03 Memory Mangement
No ratings yet
03 Memory Mangement
11 pages
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
From Everand
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
Jordan Lioy
No ratings yet
Switchview 1000 Switch: Installer/User Guide
No ratings yet
Switchview 1000 Switch: Installer/User Guide
20 pages
Lab 3 Transistor and Relay Circuit
No ratings yet
Lab 3 Transistor and Relay Circuit
5 pages
Walpole_Ch-10_KZ
No ratings yet
Walpole_Ch-10_KZ
54 pages
Mix Design Presentation
No ratings yet
Mix Design Presentation
313 pages
Transient Considerations of Flat-Plate Solar Collectors
No ratings yet
Transient Considerations of Flat-Plate Solar Collectors
5 pages
Multi Blast
No ratings yet
Multi Blast
3 pages
Update On University System of Georgia Complaint Against VSU President McKinney
No ratings yet
Update On University System of Georgia Complaint Against VSU President McKinney
32 pages
DLL-MARCH 2024 WEEK 9-DAY 1 and 2
No ratings yet
DLL-MARCH 2024 WEEK 9-DAY 1 and 2
2 pages
Regformsankey2011 PDF
No ratings yet
Regformsankey2011 PDF
2 pages
Practice in Event Management
No ratings yet
Practice in Event Management
11 pages
Analisis Bakteri Tanah Di Hutan Larangan Adat Rumbio
No ratings yet
Analisis Bakteri Tanah Di Hutan Larangan Adat Rumbio
6 pages
Chapter-1 Introduction To Computers: Communications
No ratings yet
Chapter-1 Introduction To Computers: Communications
50 pages
Translating Evaluative Discourse The Sem PDF
No ratings yet
Translating Evaluative Discourse The Sem PDF
445 pages
Causes and Effects of Climate Change
No ratings yet
Causes and Effects of Climate Change
39 pages
Key Loggers in Cyber Security Education
No ratings yet
Key Loggers in Cyber Security Education
6 pages
Who To Engage Comments: Sbti Sag Sbti Tag Sbti Fi Eag
No ratings yet
Who To Engage Comments: Sbti Sag Sbti Tag Sbti Fi Eag
12 pages
Relationships Between Leadership, Structural Empowerment, and Engagement in Nurses
No ratings yet
Relationships Between Leadership, Structural Empowerment, and Engagement in Nurses
37 pages
Corrosion Loop
67% (3)
Corrosion Loop
14 pages
Ethical Dilemma 8
100% (1)
Ethical Dilemma 8
8 pages
Cross Laminated Timber CLT - Overview and Development - Brandner Flatscher Ringhofer Schickhofer Thiel
100% (1)
Cross Laminated Timber CLT - Overview and Development - Brandner Flatscher Ringhofer Schickhofer Thiel
21 pages
Primary Trait Scoring Rubric For Technical Writing
No ratings yet
Primary Trait Scoring Rubric For Technical Writing
2 pages
The GRE Analytical Writing Templates: Plan Your Essay
No ratings yet
The GRE Analytical Writing Templates: Plan Your Essay
4 pages
The Creative Problem-Solving Tool Kit Opening Your Life To Infinite Possibilities (Max Benda Win Wenger)
100% (2)
The Creative Problem-Solving Tool Kit Opening Your Life To Infinite Possibilities (Max Benda Win Wenger)
108 pages
The Digital Archive of The Swedish East India Company
No ratings yet
The Digital Archive of The Swedish East India Company
5 pages
Maths 9709 Paper 1 Circular Measure 4 - 241007 - 194508
No ratings yet
Maths 9709 Paper 1 Circular Measure 4 - 241007 - 194508
87 pages
Space Planning Notes - Chapter 2
No ratings yet
Space Planning Notes - Chapter 2
9 pages
Activar Office 365
No ratings yet
Activar Office 365
1 page
Oren Breslouer 559 Final Report
No ratings yet
Oren Breslouer 559 Final Report
22 pages

Parallel Generational-Copying Garbage Collection With A Block-Structured Heap

Uploaded by

Parallel Generational-Copying Garbage Collection With A Block-Structured Heap

Uploaded by

Parallel Generational-Copying Garbage Collection with a

Simon Peyton Jones

Abstract • We extend this copying collector to a generational scheme

2. The challenge we address

behind the allocation pointer H; this can happen, for example,

Figure 5. Speedup on various benchmarks Figure 7. Varying the chunk size

9 threads, and Cmax is the maximum number of bytes copied by any

You might also like