Parallel Generational-Copying Garbage Collection With A Block-Structured Heap
Parallel Generational-Copying Garbage Collection With A Block-Structured Heap
Block-Structured Heap
Simon Marlow Tim Harris Roshan P. James
Microsoft Research Microsoft Research Indiana University
[email protected] [email protected] [email protected]
1.8
8
1.6
7
linear speedup (8)
1.4 power (1.47)
6 fulsom (4.75)
happy (1.33)
constraints (4.45) 1.2
5 gc_bench (1.01)
gc_bench (3.70) Speed
Speedup fulsom (1.01)
circsim (3.28) (8 procs) 1
4 spellcheck (0.98)
lcss (3.06)
0.8 fibheaps (0.96)
power (2.42)
3
spellcheck (2.28) ghc (0.94)
0.6
2 ghc (1.98) constraints (0.94)
happy (1.83) 0.4 lcss (0.91)
1 fibheaps (1.73) circsim (0.40)
0.2
0
1 2 3 4 5 6 7 8 0
Processors 32 128 256 4096
Chunk size
3 ghc (4.60)
are broadly consistent with the speedup results from Figure 5: the
spellcheck (3.96) top 3 and the bottom 4 programs are the same in both graphs. Work
2
happy (3.88)
imbalance is clearly an issue affecting the wall-clock speedup, al-
fibheaps (3.48)
though it is not the only issue, since even when we get near-perfect
1
work balance (e.g. constraints, 7.7 on 8 processors), the speedup
0
we see is still only 4.5.
1 2 3 4 5 6 7 8 Work imbalance may be caused by two things:
Processors
• Failure of our algorithm to balance the available work
• Lack of actual parallelism in the heap: for example when the
Figure 6. Work balance heap consists mainly of a single linked list.
When the work is balanced evenly, a lack of speedup could be
The amount of speedup varies significantly between programs. caused by contention for shared resources, or by threads being idle
We aim to qualify the reasons for these differences in some of the (that is, the work is balanced but still serialised).
measurements made in the following sections. Gaining more insight into these results is planned for future
work. The results we have presented are our current best effort,
5.3 Measuring work-imbalance after identifying and fixing various instances of these problems (see
If the work is divided unevenly between threads, then it is not possi- for instance Section 3.7), but there is certainly more that can be
ble to achieve the maximum speedup. Measuring work imbalance done.
is therefore useful, because it gives us some insight into whether
a less-than-perfect speedup is due to an uneven work distribution 5.4 Varying the chunk size
or to other factors. The converse doesn’t hold: if we have perfect The collector has a “minimum chunk size”, which is the smallest
work distribution it doesn’t necessarily imply perfect wall-clock amount of work it will push to the global Pending Block Set. When
speedup. For instance, the threads might be running in sequence, the Pending Block Set has plenty of work on it, we allow the
even though they are each doing the same amount of work. allocation blocks to grow larger than the minimum size, in order to
In this section we quantify the work imbalance, and measure reduce the overhead of modifying the Pending Block Set too often.
it for our benchmarks. An approximation to the amount of work Our default minimum chunk size, used to get the results pre-
done by a GC thread is the number of bytes it copies. This is an sented so far, was 128 words (with a block size of 1024 words). The
approximation because there are operations that involve copying results for our benchmarks on 8 processors using different chunk
no bytes: scanning a pointer to an already-evacuated object, for sizes are given in Figure 7. The default chunk size of 128 words
example. Still, we believe it is a reasonable approximation. seems to be something of a sweet spot, although there is a little
We define the work balance factor for a single GC to be more parallelism to be had in fibheaps and constraints when
Ctot /Cmax where Ctot is the total number of bytes copied by all using a 32-word chunk size.
5.5 Lock contention Program ∆ GC time (%)
circsim -1.0
There are various mutexes in the parallel garbage collector, which constraints -18.4
we implement as simple spinlocks. The spinlocks are: fibheaps -2.7
fulsom -14.8
• The block allocator (one global lock) gc bench -6.2
• The remembered sets (one for each generation) ghc -3.6
• The Pending Block Sets (one for each step) happy -14.2
lcss -15.6
• The large-object lists (one for each step) power +36.6
• The per-object evacuation lock spellcheck -17.4
Min -18.4
The runtime system counts how many times each of these spin- Max +36.6
locks is found to be contended, bumping a counter each time the Geometric Mean -6.8
requestor spins. We found very little contention for most locks. In
particular, it is extremely rare that two threads attempt to evacuate Figure 8. Effect of adding eager promotion
the same object simultaneously: the per-object evacuation lock typ-
ically counts less than 10 spins per second during GC. This makes
it all the more painful that this locking is so expensive, and it is why 5.8 Miscellany
we plan to investigate relaxing this locking for immutable objects Here we list a number of other techniques or modifications that we
(Section 7). tried, but have not made systematic measurements for.
There was significant contention for the block allocator, espe-
cially in the programs that use a large heap. To reduce this con- • Varying the native block size used by the block allocator. In
tention we did two things: practice this makes little difference to performance until the
block size gets too small, and too large increases the amount
• We rewrote the block allocator to maintain its free list more of fragmentation. The current default of 4kbytes is reasonable.
efficiently. • When taking a block from the Pending Block Set, do we take
• We allocate multiple blocks at a time, and keep the spare ones the block most recently added to the set (LIFO), or the block
in the thread’s Partly Free List (Section 3.6). At the end of GC added first (FIFO)? Right now, we use FIFO, as we found it
any unused free blocks are returned to the block allocator. increased parallelism slightly, although LIFO might be better
from a cache perspective. All other things being equal, it would
5.6 Fragmentation make sense to take blocks of work recently generated by the
A block-structured heap will necessarily waste some memory at the current thread, in the hope that they would still be in the cache.
end of each block. This arises when Another strategy we could try is to take a random block, on the
grounds that it would avoid accidentally hitting any worst-case
• An object to be evacuated is too large to fit in the current block, behaviour.
so a few words are often lost at the end of a to-space block. • Adding more generations and steps doesn’t help for these
• The last to-space block to be allocated into will on average benchmarks, although we have found in the past that adding
be half-full, and there is one such block for each step of each a generation is beneficial for very long-running programs.
generation for each thread. These partially-full blocks will be • At one stage we used to have a separate to-space for objects
used for to-space during the next GC, however. that do not need to be scavenged, because they have no pointer
• When work is scarce (see Section 3.6), partially-full blocks are fields (boxed integers and characters, for example). However,
exported to the pending block set. The GC tries to fill up any the runtime system has statically pre-allocated copies of small
partially-full blocks rather than allocating fresh empty blocks, integers and characters, giving a limited form of hash-consing,
but it is possible that some partially-full blocks remain at the which meant that usually less than 1% of the dynamic heap
end of GC. The GC will try to re-use them at the next GC if this consisted of objects with no pointers. There was virtually no
happens. benefit in practice from this optimisation, and it added some
complexity to the code, so it was discarded.
We measured the amount of space lost due to these factors, by • We experimented with pre-fetching in the GC, with very limited
comparing the actual amount of live data to the number of blocks success. Pre-fetching to-space ahead of the allocation pointer is
allocated at the end of each GC. The runtime tracks the maximum easy, but gives no benefit on modern processors which tend to
amount of fragmentation at any one time over the run of a program spot sequential access and pre-fetch automatically. Pre-fetching
and reports it at the end; we found that in all our benchmarks, the scan block ahead of the scan pointer suffers from the same
the fragmentation was never more than 1% of the total memory problem. Pre-fetching fields of objects that will shortly be
allocated by the runtime. To put this in perspective, remember that scanned can be beneficial, but we found in practice that it was
copying GC wastes at least half the memory. extremely difficult and processor-dependent to tune the dis-
tance at which to prefetch. Currently, our GC does no explicit
5.7 Eager Promotion prefetching.
To our knowledge this is the first time the idea of eager promotion
has been presented, and it is startlingly effective in practice. Fig-
ure 8 shows the benefit of doing eager promotion (Section 4.1), in 6. Related work
the single-threaded GC. On average, eager promotion reduces the There follows a survey of the related work in this area. Jones pro-
time spent in garbage collection by 6.8%. One example (power) vides an introduction to classical sequential GC [JL96]. We focus
went slower with eager promotion turned on - this program turns on tracing collectors based on exploring the heap by reachability
out to be quite sensitive to small changes in the times at which GC from root references. Moreover, we consider only parallel copy-
strikes, and this effect dominates. ing collectors, omitting those that use compaction or mark-sweep.
We also restrict our discussion to algorithms that are practical for stack is simplified by a gated synchronization mechanism so that
general-purpose use. pushes are never concurrent with pops.
Halstead [RHH85] developed a parallel version of Baker’s in- Ben-Yitzhak et al [BYGK+ 02] augmented a parallel mark-
cremental semispace copying collector. During collection the heap sweep collector with periodic clearing of a “evacuated area” (EA).
is logically partitioned into per-thread from-spaces and to-spaces. The basic idea is that if this is done sufficiently often then it pre-
Each thread traces objects from its own set of roots, copying them vents fragmentation building up. The EA is chosen before the mark
into its own to-space. Fine-grained locking is used to synchro- phase and, during marking, references into the EA are identified.
nize access to from-space objects, although Halstead reports that After marking the objects in the EA are evacuated to new loca-
such contention is very rare. As Halstead acknowledges, this ap- tions (parallelized by over-partitioning the EA) and the references
proach can lead to work imbalance and to heap overflow. Halstead found during the mark phase are updated. In theory performance
makes heap overflow less likely by dividing to-spaces into 64K- may be harmed by a poor choice of EA (e.g. one that does not
128K chunks which are allocated on demand. reduce fragmentation because all the objects in it are dead) or by
Many researchers have explored how to avoid the work imbal- workload imbalance when processing the EA. In practice the first
ance problems in early parallel copying collectors. It’s impractical problem could be mitigated by better EA-selection algorithm and,
to avoid work imbalance: the collector does not know ahead of time for modest numbers of threads, the impact of the second is not
which roots will lead to large data structures and which to small. significant.
The main technique therefore is to dynamically re-balance work This approach was later used in Barabash et al’s parallel frame-
from busy threads to idle threads. work [BBYG+ 05]. Barabash et al also describe the “work packet”
Imai and Tick [IT93] developed the first parallel copying GC abstraction they developed to manage the parallel marking phases.
algorithm with dynamic work balancing. They divide to-space into As with Attanasio et al’s work buffers, this provides a way to batch
blocks with each active GC thread having a “scan” block (of objects communication between GC threads. Each thread has one input
that it is tracing from) and a “copy” block (into which it copies packet from which it is taking marking work and one output packet
objects it finds in from-space). If a thread fills its copy block then into which it places work that it generates. These packets remain
it adds it to a shared work pool, allocates a fresh copy block, and distinct (unlike Imai and Tick’s chunks [IT93]) and are shared be-
continues scanning. If a thread finishes its scan block then it fetches tween threads only at a whole-packet granularity (unlike per-object
a fresh block from the work-pool. The size of the blocks provides work stealing in Endo et al’s and Flood et al’s work). Barabash et
a trade-off between the time spent synchronizing on the work- al report that this approach makes termination detection easy (all
pool and the potential work imbalance. Siegwart and Hirzel [SH06] the packets must be empty) and makes it easy to add or remove GC
extend this approach to copy objects in hierarchical order. threads (because the shared pool’s implementation is oblivious to
Endo et al [ETY97] developed a parallel mark-sweep collector the number of participants).
based on Boehm-Demers-Weiser conservative GC. They use work- Petrank and Kolodner [PK04] observe how existing parallel
stealing to avoid load-imbalance during the mark phase: GC threads copying collectors allocated objects into per-thread chunks, raising
have individual work queues and if a thread’s own queue becomes the possibility of a fragmentation of to-space. They showed how
empty then it steals work from another’s. Endo et al manage work this could be avoided by “delayed allocation” of to-space copies of
at a finer granularity than Imai and Tick: they generally use per- objects: GC threads form batches of proposed to-space allocations
object work items, but also sub-divide large objects into 512-byte which are then performed by a single CAS on a shared allocation
chunks for tracing. They found fine-granularity work management pointer. This guarantees that there is no to-space fragmentation
valuable because of large arrays in the scientific workloads that they while avoiding per-allocation CAS operations. In many systems the
studied. They parallelise the sweep phase by over-partitioning the impact of this form of fragmentation is small because the number
heap into batches of blocks that are processed in parallel, meaning of chunks with left-over space is low.
that there are more batches than GC threads and that GC threads There are thus two kinds of dynamic re-balancing: fine-grained
dynamically claim new batches as they complete their work. schemes like Endo et al [ETY97] and Flood et al [FDSZ01]
Flood et al [FDSZ01] developed a parallel semispace copy- which work at a per-object granularity, and batching schemes like
ing collector. As with Endo et al, they avoid work-imbalance by Imai and Tick [IT93], Attanasio et al [ABCS01] and Barabash et
per-object work-stealing. They parallelise root scanning by over- al [BBYG+ 05] which group work into blocks or packets. Both ap-
partitioning the root set (including the card-based remembered set proaches have their advantages. Fine-grained schemes reduce the
in generational configurations). As with Imai and Tick, each GC latency between one thread needing work and it being made avail-
thread allocates into its own local memory buffer. Flood et al also able and, as Flood et al argue, the synchronization overhead can
developed a parallel mark-compact collector which we do not dis- be mitigated by carefully designed work-stealing systems. Block-
cuss here. based schemes may make termination decisions easier (particularly
Concurrent with Flood et al, Attanasio et al [ABCS01] devel- if available work is placed in a single shared pool) and make it eas-
oped a modular GC framework for Java on large symmetric mul- ier to express different traversal policies by changing how blocks
tiprocessor machines executing server applications. Unlike Flood are selected from the pool (as Siegwart and Hirzel’s work illus-
et al, Attanasio et al’s copying collector performed load balancing trated [SH06]).
using work buffers of multiple pointers to objects. A global list is We attempt to combine the best of these approaches. In partic-
maintained of full buffers ready to process. Attanasio reports that ular we try to keep the latency in work distribution low by using
this coarser mechanism scales as well as Flood et al’s fine-grained producing incomplete blocks if there are idle threads. Furthermore,
design on the javac and SPECjbb benchmarks; as with Imai and as with Imai and Tick’s [IT93] and Siegwart and Hirzel’s [SH06]
Tick’s design, the size of the buffers controls a trade-off between work, we represent our work items by areas of to-space, avoiding
synchronization costs and work imbalance. the need to reify them in a separate queue or packet structure.
Also concurrent with Flood et al, Cheng and Blelloch developed A number of collectors have exploited the immutability of most
a parallel copying collector using a shared stack of objects waiting data in functional languages. Doligez and Leroy’s concurrent col-
to be traced [BC99, CB01]. Each GC thread periodically pushed lector for ML [DL93] uses per-thread private heaps and allows mul-
part of its work onto the shared stack and took work from the shared tiple threads to collect their own private heaps in parallel. They sim-
stack when it exhausted its own work. The implementation of the plify this by preserving an invariant that there are no inter-private-
heap references and no references from a common shared heap into language design and implementation, pages 104–117. ACM,
any thread’s private heap: the new referents of mutable objects are 1999.
copied into the shared heap, and mutable objects themselves are [BYGK+ 02] Ori Ben-Yitzhak, Irit Goft, Elliot K. Kolodner, Kean Kuiper,
allocated in the shared heap. This exploits the fact that in ML (as and Victor Leikehman. An algorithm for parallel incremental
in Haskell) most data is immutable. Huelsbergen and Larus’ con- compaction. In ISMM ’02: Proceedings of the 3rd international
current copying collector [HL93] also exploits the immutability of symposium on Memory management, pages 100–105. ACM,
2002.
data in ML: immutable data can be copied in parallel with concur-
rent accesses by the mutator. [CB01] Perry Cheng and Guy E. Blelloch. A parallel, real-time
garbage collector. In PLDI ’01: Proceedings of the ACM
SIGPLAN 2001 conference on Programming language design
7. Conclusion and further work and implementation, pages 125–136. ACM, 2001.
The advent of widespread multi-core processors offers an attractive [Che70] C. J. Cheney. A nonrecursive list compacting algorithm.
opportunity to reduce the costs of automatic memory management Commun. ACM, 13(11):677–678, 1970.
with zero programmer intervention. The opportunity is somewhat [DEB94] R. Kent Dybvig, David Eby, and Carl Bruggeman. Don’t
tricky to exploit, but it can be done, and we have demonstrated stop the BIBOP: Flexible and efficient storage management
real wall-clock benefits achieved by our algorithm. Moreover, we for dynamically-typed languages. Technical Report 400, Indiana
University Computer Science Department, 1994.
have made progress toward explaining the lack of perfect speedup
by measuring the load imbalance in the GC and showing that this [DL93] Damien Doligez and Xavier Leroy. A concurrent, generational
garbage collector for a multithreaded implementation of ML.
correlates well with the wall-clock speedup.
In POPL ’93: Proceedings of the 20th ACM SIGPLAN-SIGACT
There are two particular directions in which we would like to symposium on Principles of programming languages, pages 113–
develop our collector. 123. ACM, 1993.
7.1 Reducing per-object synchronisation [ETY97] Toshio Endo, Kenjiro Taura, and Akinori Yonezawa. A scalable
mark-sweep garbage collector on large-scale shared-memory
As noted in Section 3.1, a GC thread uses an atomic CAS instruc- machines. In Supercomputing ’97: Proceedings of the 1997
tion to gain exclusive access to a from-space heap object. The cost ACM/IEEE conference on Supercomputing (CDROM), pages
of atomicity here is high: 20-30% (Section 5.1), and we would like 1–14. ACM, 1997.
to reduce it. [FDSZ01] Christine Flood, Dave Detlefs, Nir Shavit, and Catherine Zhang.
Many heap objects in a functional language are immutable, Parallel garbage collection for shared memory multiprocessors.
and the language does not support pointer-equality. If such an In Usenix Java Virtual Machine Research and Technology
immutable objects is reachable via two different pointers, it is Symposium (JVM ’01), Monterey, CA, 2001.
therefore semantically acceptable to copy the object twice into to- [HL93] Lorenz Huelsbergen and James R. Larus. A concurrent copying
space. Sharing is lost, and the heap size may increase slightly, but garbage collector for languages that distinguish (im)mutable data.
SIGPLAN Not., 28(7):73–82, 1993.
the mutator can see no difference.
So the idea is simple: for immutable objects, we avoid using [HMP05] Tim Harris, Simon Marlow, and Simon Peyton Jones. Haskell on
a shared-memory multiprocessor. In ACM Workshop on Haskell,
atomic instructions to claim the object, and accept the small possi- Tallin, Estonia, 2005. ACM.
bility that the object may be copied more than once into to-space.
[IT93] A. Imai and E. Tick. Evaluation of parallel copying garbage
We know that contention for individual objects happens very rarely collection on a shared-memory multiprocessor. IEEE Trans.
in the GC (Section 5.5), so we expect the amount of accidental du- Parallel Distrib. Syst., 4(9):1030–1040, 1993.
plication to be negligible in practice. [JL96] Richard Jones and Rafael Lins. Garbage Collection: Algorithms
for Automatic Dynamic Memory Management. John Wiley and
7.2 Privatising minor collections
Sons, July 1996.
A clear shortcoming of the system we have described is that [Par92] WD Partain. The nofib benchmark suite of Haskell programs. In
all garbage collection is global: all the processors stop, agree to J Launchbury and PM Sansom, editors, Functional Programming,
garbage collect, perform garbage collection, and resume mutation. Glasgow 1992, pages 195–202. 1992.
It would be much better if a mutator thread could perform local [PK04] Erez Petrank and Elliot K. Kolodner. Parallel copying garbage
garbage collection on its private heap without any interaction with collection using delayed allocation. Parallel Processing Letters,
other threads whatsoever. We plan to implement such a scheme, 14(2), June 2004.
very much along the lines described by Doligez and Leroy [DL93]. [RHH85] Jr. Robert H. Halstead. Multilisp: a language for concurrent
symbolic computation. ACM Trans. Program. Lang. Syst.,
7(4):501–538, 1985.
References
[SH06] David Siegwart and Martin Hirzel. Improving locality with
[ABCS01] C. Attanasio, D. Bacon, A. Cocchi, and S. Smith. A parallel hierarchical copying gc. In ISMM ’06: Proceedings of
comparative evaluation of parallel garbage collectors. In the 5th international symposium on Memory management, pages
Fourteenth Annual Workshop on Languages and Compilers for 52–63. ACM, 2006.
Parallel Computing, pages 177–192, Cumberland Falls, KT,
[Ste] Don Stewart. nobench: Benchmarking haskell implementations.
2001. Springer-Verlag.
http://www.cse.unsw.edu.au/∼dons/nobench.html.
[ABP98] Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. Thread
[Ste77] Guy Lewis Steele Jr. Data representations in PDP-10 MacLISP.
scheduling for multiprogrammed multiprocessors. pages 119–
Technical report, MIT Artificial Intelligence Laborotory, 1977.
129. ACM Press, June 1998.
AI Memo 420.
[BBYG+ 05] Katherine Barabash, Ori Ben-Yitzhak, Irit Goft, Elliot K.
[Ung84] D Ungar. Generation scavenging: A non-disruptive high
Kolodner, Victor Leikehman, Yoav Ossia, Avi Owshanko,
performance storage management reclamation algorithm. In
and Erez Petrank. A parallel, incremental, mostly concurrent
ACM SIGPLAN Software Engineering Symposium on Practical
garbage collector for servers. ACM Trans. Program. Lang. Syst.,
Software Development Evironments, pages 157–167. Pittsburgh,
27(6):1097–1146, 2005.
Pennsylvania, April 1984.
[BC99] Guy E. Blelloch and Perry Cheng. On bounding time and space
for multiprocessor garbage collection. In PLDI ’99: Proceedings
of the ACM SIGPLAN 1999 conference on Programming