The Garbage Collection Handbook
The Garbage Collection Handbook
The Garbage Collection Handbook
A \342\226\240 *
\342\200\242 *
r \342\200\242 \342\200\242 . i . . r
i'
r w
\302\273 *
-
) I
f
\342\200\242 r
w^
\342\200\236.
CRC Press
Taylor Si Francis Grou \342\200\242
GARBAGE COLLECTION
HANDBOOK
TheArt of Automatic Memory Management
Chapman & Hall/CRC
Applied Algorithms and Data Structures Series
Series Editor
Samir Khuller
University of Maryland
highly encouraged. The scope of the series includes, but is not limited to, titles in the
areas of parallel algorithms, approximation algorithms, randomized algorithms, graph
algorithms, search algorithms, machine learning algorithms, medical algorithms, data
structures, graph structures, tree data structures, and other relevant topics that might be
Published Titles
A Guide to Data Structures and
Practical Algorithms Using Java
GARBAGE COLLECTION
HANDBOOK
The Art of Automatic Memory Management
Richard Jones
Antony Hosking
Eliot Moss
CRC Press
Taylor &. Francis Group
Boca Raton London New York
CRC Press
Taylor & Francis Group
6000 Broken Sound Parkway NW, Suite 300
BocaRaton, FL 33487-2742
2012
\302\251 by Richard Jones,Antony Hosking, and Eliot Moss
CRCPress is an imprint of Taylor & Francis Group, an Informa business
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been
made to publish reliable data and information, but the author and publisher cannot assume responsibility for the
of all materials or the consequences
validity of their use. The authors and publishers have attempted to trace the copyright
holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this
form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may
rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or
utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including
microfilming,
photocopying, and recording, or in any information storage or retrieval system, without written permissionfrom the
publishers.
For permission to photocopy or use material electronically from this work, please accesswww.copyright.com (http://
www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers,MA 01923,
978-750-8400. CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For
organizations that have been granted a photocopylicense by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for
identification and explanation without intent to infringe.
List of Algorithms xv
Acknowledgements xxvii
Authors xxix
1 Introduction 1
1.1 Explicit
deallocation 2
1.2 Automatic dynamic memory management 3
1.3 Comparing garbage collection algorithms 5
Safety 6
Throughput 6
Completenessand promptness 6
Pause time 7
Space overhead 8
Optimisations for specificlanguages 8
Scalability
and portability 9
1.4 A performance disadvantage? 9
1.5 Experimental methodology 10
1.6 Terminology and notation 11
The heap 11
The mutator and the collector 12
The mutator roots 12
vn
viii CONTENTS
Space usage 29
To move or not to move? 30
4 Copying garbagecollection 43
Advanced solutions 74
6.1 Throughput 77
6.2 Pause time 78
6.3 Space 78
6.4 Implementation 79
CONTENTS IX
First-fit allocation 89
Next-fit allocation 90
Best-fit allocation 90
Boundary tags 98
Heap parsability 98
Locality 100
Wilderness
preservation 100
9 Generationalgarbagecollection 111
9.1 Example
112
Garbage-First 150
Immix and others 151
Copying
collection in a constrained memory space 154
10.4 Bookmarking garbage collection 156
10.5 Ulterior reference counting 157
10.6 Issues to consider 158
Language-specificconcerns 213
12.1 Finalisation 213
Caches 231
Coherence 232
Cache coherenceperformance example:
spin
locks 232
13.2 Hardware memory consistency 234
Fencesand happens-before 236
Consistency
models 236
13.3 Hardware primitives 237
Compare-and-swap 237
Load-linked/store-conditionally 238
Atomic arithmetic primitives 240
Test then test-and-set 240
Morepowerful primitives 240
Terminology 302
Is parallel collectionworthwhile? 303
Strategies for
balancing loads 303
Managing tracing 303
Low-level synchronisation 305
Sweeping and
compaction 305
Termination 306
CONTENTS Xlll
Compressor 352
Pauseless 355
17.8 Issuesto consider 361
Glossary 417
Bibliography 429
Index 463
List of
Algorithms
2.1 Mark-sweep:
allocation 18
2.2 Mark-sweep: marking 19
2.3 Mark-sweep:sweeping 20
2.4 Printezis and Detlefs's bitmap marking 24
2.5 Lazy sweeping with a block structured heap 25
2.6 Marking
with a FIFO prefetch buffer 28
2.7 Markinggraph edgesrather than nodes 28
6.1 Abstract
tracing garbage collection 82
6.2 Abstract referencecountinggarbagecollection 83
6.3 Abstract deferred reference counting garbage collection 84
7.1 Sequentialallocation 88
4.1 Copyinggarbagecollection:
an
example 47
4.2 Copying a tree with different traversal orders 49
4.3 Moon's approximately depth-firstcopying 51
1.1 Modern
languages and garbage collection 5
11.1 An
example of pointer tag encoding 169
11.2 Tag encoding for the SPARC architecture 169
11.3 The crossingmapencoding of Garthwaite et al 201
14.1 Statetransition logic for the Imai and Tick collector 295
14.2 State transition logic for the Siegwart and Hirzel collector 297
xxi
Preface
Happy anniversary! As we near completion of this book it is also the 50th anniversary of
the first papers on automatic dynamic memory management, or garbagecollection, written
years after the implementation Lisp of started in 1958. McCarthy [1978] recollects that the
first online demonstration was to an MIT IndustrialLiaisonSymposium. It was important
1The IBM 704's legacy to the Lisp world includes the terms car and cdr. The 704's 36-bit words included two
15-bit parts, the address and decrement parts. Lisp's list or cons cells stored pointers in these two parts. The head
of the list, the car, could be obtained using the 704's car 'Contents of the Address part of Register' instruction,
and the tail, the cdr, with its cdr 'Contents of the Decrement part of Register' instruction.
XX111
XXIV PREFACE
The audience
In this book, we have tried to bring together the wealth of experience gathered by
automatic
memory management researchers and developers over the past fifty years. The
literature is huge \342\200\224
our online bibliography contains 2,500 entries at the time of writing.
We discuss and compare the most important approaches and state-of-the-art techniques
in a single,accessible framework. We have taken care to present algorithms and concepts
using
a consistent style and terminology. These are describedin detail, often with
many languages offer. The almost universal adoption of garbage collection by modern
programming languages makes a thorough understanding of this topic essential for any
programmer.
XXV
techniques for allocating memory and examines the extent to which automatic garbage
collection leads to allocator policies that are different to those of explicit malloc/f ree
memory management.
The chapters make the implicit assumptionthat all objects in the heap are
first seven
managed in the sameway. However, there are many reasons why that would be a poor
design. Chapters 8 to 10 consider why we might want to partition the heap into different
spaces, and how we might manage those spaces. We look at generational garbage
collection, one of the most successful strategies for managing objects, how to handle large
objectsand many other partitioned schemes.
The interface with the rest of the run-time system is oneof the trickiest aspects of
building
a collector.2 We devote Chapter 11to the run-time interface, including finding pointers,
safe points at which to collect, and read and write barriers,and Chapter 12 to language-
be met. What questions need to be answeredabout the behaviour of the client program,
their operatingsystem or the underlying hardware? These summaries are not intendedas
a substitute for reading the chapter. Above all, they are not intended as canned solutions,
but we hopethat they will provide a focus for further analysis.
Finally, what is missing from the book?We have only considered automatic techniques
for memory management embeddedin the run-timesystem. Thus, even when a language
specification mandates garbagecollection, we have not discussed in much depth other
mechanismsfor memory management that it may also support. The most obvious
example
is the use of 'regions' [Tofte and Talpin, 1994], most prominently used in the Real-Time
Specification for Java. We pay attention only briefly to questions of region inferencing or
stackallocationand very little at all to other compile-time analysesintendedto replace,or
2And one that we passed on in Jones [1996]!
XXVI PREFACE
at least assist, garbage collection. Neitherdo we addresshow best to use techniques such
as referencecountingin theclientprogram,although this is popular in languages like C++.
Finally, the last decade has seen little new research on distributed garbage collection. In
many ways, this is a shame sincewe expect l essons learnt in that field also to be useful
to those developingcollectorsfor the next generation of machines with heterogeneous
collectionsof highly non-uniform memory architectures. Nevertheless, we do not discuss
distributed
garbage collection here.
Online resources
http://www.gchandbook.org
It includesa number of resources including our comprehensive bibliography. The
bibliography
at the end of this book contains over 400 references.However,our comprehensive
online database contains over 2500 garbage collectionrelatedpublications. This database
can be searched online or downloadedas BlBTgX, PostScript or PDF. As well as details of
We thank our many colleagues for their support for this new book. It is certain that
without their encouragement (and pressure), this work would not have
got off the ground.
In particular, we thank Steve Blackburn, Hans Boehm, David Click, David
Bacon, Cliff
Detlefs,DanielFrampton,Robin Garner, Barry Hayes, Laurence Hellyer, Maurice Herlihy,
Martin Hirzel, Doug Lea, Simon Marlow,Alan
Tomas Kalibera, Mycroft, Cosmin Oancea,
Erez Petrank, Fil Pizlo, Tony Printezis, John Reppy, David Siegwart,Gil Teneand Mario
Wolczko, all of whom have answeredour many questions or given us excellent feedback
on early drafts. We also pay tribute to the many computer scientistswho have worked
on automatic memory management since 1958:without them there would be nothing to
write about.
We are very grateful to Randi Cohen, our long-sufferingeditor at Taylor and Francis,
for her support and patience.She has always been quick to offer help and slow to chide
us for our tardiness. We also thank Elizabeth Haylett and the Societyof Authors3 for her
service, which we recommendhighly to other authors.
Above all, I am grateful to Robbie. How she has borne the stress of another book,
whose writing has yet again stretchedwelloverthe plannedtwo years, I will never know.
I owe you everything! I alsodoubt whether this book would have seen the light of day
without the inexhaustible enthusiasm of my co-authors. Tony, Eliot, it has been a pleasure
and an honour writing with knowledgeable and diligent colleagues.
Richard Jones
In the summer of 2002 Richard and plans to write a follow-upto his 1996
I hatched
book. There had been on GC in thosesix years,
lots of new work and it seemed there
was demand for an update. know then that it would be another nine years
Little did we
beforethe currentvolume would appear. Richard, your patience is much appreciated.As
conception turned into concrete planning, Eliot's offer to pitch in was gratefully accepted;
without his sharing the load we would still be labouring anxiously. Much of the early
planning and writing was carried out while I was on sabbaticalwith Richard in 2008,
with funding from Britain's Engineeringand Physical Sciences Research Council and the
United States' National Science Foundation whose support we gratefully acknowledge.
Mandi, without your encouragement and willingness to live out our own Canterbury tale
this project would not have beenpossible.
Antony Hosking
3http://www.societyofauthors.org.
XXV11
XXV111 ACKNOWLEDGEMENTS
Thank you to my co-authors for inviting me into their project, alreadylargely conceived
and being proposed for publication. You were a pleasure to work with (as always), and
tolerant of my sometimes idiosyncraticwriting style. A formal thank you is also due the
Royal Academy of Engineering, who supported my visit to the UK in November 2009,
which greatly advanced the book. Other funding agencies supported the work indirectly
by helping us attend conferences and meetings at which we could gain some face to face
working time for the book as well. And most of all many thanks to my \"girls,\"who
endured
my absences, physical and otherwise. Your
support was essential and is deeply
appreciated!
Eliot Moss
Authors
he was the inaugural Programme Chair. Hehas published numerous papers on garbage
collection, heap visualisation and electronic publishing, and he regularly sits on the
programme
committees of leading international conferences. He is a memberof the Editorial
Board of Software Practice and Experience.He was made an Honorary Fellow of the
of
University Glasgow in 2005 in recognition of his research and
scholarship in dynamic memory
management, and a DistinguishedScientistof the Association for Computing Machinery
in 2006. He is married,with three children, and in his spare time he racesDart 18
catamarans.
University of Adelaide, Australia, in 1985, and an MSc in Computer Science from the
University of Waikato, New Zealand, in 1987. He continuedhis graduatestudiesat the
University of Massachusetts Amherst, receiving a PhDin ComputerScience in 1995. His
work is in the area of programming language design and implementation, with specific
interestsin databaseand persistent programming languages, object-oriented database
systems, dynamic memory management, compiler optimisations, and architectural support
for programming languages and applications. He is a SeniorMemberof the Association
for Computing Machinery and Memberof the Institute of Electrical and Electronics
Engineers. He regularly serves on programme and steering committees of
major conferences,
mostly focused on programming language designand implementation. He is married,
with three children. When the opportunity arises, he most enjoys sitting somewhere
behind the bowler's arm on the first day of any Test match at the Adelaide Oval.
xxix
XXX AUTHORS
was named a Fellow of the Association for Computing Machinery and in 2009a Fellowof
the Institute of Electrical and Electronics Engineers.He served for four years as Secretary
of the Associationfor Computing Machinery's Special Interest Group on Programming
Languages, and served on many programme and steering committeesof the significant
venues related to his areasof research. Ordained a priest of the EpiscopalChurchin 2005,
he leads congregation
a in addition to his full-time academicposition.Heismarried, with
two children. He enjoys listening to recorded books and
movie-going, and has been known
to play the harp.
Chapter
1
Introduction
Developers are increasingly turning to managed languages and run-time systems for the
many virtues they offer, from the increased security they bestow to code to the
flexibility
they provide by abstracting away from operating system and architecture. The benefits of
managed code are widely accepted [Butters, 2007].Because many services are provided by
the virtual machine, programmershave less code to write. Code is safer if it is type-safe
and if the run-time system verifies code as it is loaded, checks for resource access
violations and the bounds of arrays and other collections,and managesmemory automatically.
Deployment costs are lower since it is easier to deploy applications to different platforms,
even if the mantra 'write once,run anywhere' is over-optimistic. Consequently,
programmerscan
spend a greater proportion of developmenttimeon the logicof their application.
Almost all modern programming languagesmakeuse of dynamic memory allocation.
This allows objects to be allocatedand deallocated even if their total size was not known
at the time that the program was compiled, and if their lifetime may exceed that of the
subroutineactivation1 that allocated them. A dynamically allocated objectis storedin a
heap, rather than on the stack (in the activation record or stack frame of the procedure that
allocated it) or statically (wherebythe name of an object is bound to a storage location
known at compile or link time). Heap allocation is particularly important because it allows
the programmer:
to choose
\342\200\242 the size of new objects(thus avoiding
dynamically program failure through
exceeding hard-coded limits on arrays);
to define
\342\200\242 and use recursive data structuressuchas lists,treesand maps;
to return
\342\200\242
newly created objects to the parent procedure (allowing, for example,
factory methods);
to return
\342\200\242 a function as the result of another function (for example, closures or
suspensionsin functional languages).
Heap allocated objects are accessed through references. Typically, a reference is a pointer to
the object (that is, the addressin memory of the object). However, a reference may
alternatively
refer to an object only indirectly, for instance through a handle which in turn points
to the object. Handles offer the advantage of allowing an object to be relocated (updating
its handle) without having to change every referenceto that object/handle throughout the
program.
1
We shall tend to use the terms method, function, procedure and subroutine interchangeably.
1
2 CHAPTER 1. INTRODUCTION
A | B
Figure 1.1: Premature deletion of an object may lead to errors. Here B has
been freed. The live object A now contains a dangling pointer. The
space
occupied by C has leaked: C is not reachablebut it cannot be freed.
Memory may be freed prematurely, while there are still references to it. Such a reference
is calleda dangling pointer (see Figure 1.1). If the program subsequentlyfollows a dangling
key advice has been to be consistentin the way that they manage the ownership of
objects [Belotsky, 2003; Cline and Lomow, 1995]. Belotsky[2003] and others offer several
possible strategies for C++. First, programmers should avoid heap allocationaltogether,
whereverpossible.Forexample,
objects
can be allocated on the stack instead. When the
objects' creating method returns, the popping of the stack will free these objects
automatically Secondly, programmers should pass and return objectsby value, by copying the
full contents of a parameter/result rather than by passing references.Clearly both of these
appropriate
to use customallocators, for example, that manage a pool of objects. At the end
of a program phase, the entire pool can be freedasa whole.
C++ has seen several attempts to use special pointer classesand templates to improve
memory management. These overload normal pointer operations in order to provide safe
storage reclamation. However, such smart pointers have several limitations. The aut o_pt r
classtemplatecannot be used with the Standard Template Library and will be deprecated
in the expected next edition of the C++ standard [Boehm and Spertus, 2009].5It will be
replaced by an improved unique_ptr that provides strict ownershipsemanticsthat allow
the target object to be deleted when the unique pointer is. The standard will also include
a reference counted shared_ptr,6 but these also have limitations. Reference counted
pointersare unable to manage self-referential (cyclic) data structures. Mostsmart pointers
are provided as libraries,which restricts their applicability if efficiency is a concern.
Possibly, they are most appropriately used to manage very large blocks, references to which
are rarely assignedor passed,in which case they might be significantly cheaper than
tracing
collection. On the other hand, without the cooperationof the compiler and run-time
system, reference counted pointers are not an efficient, general purpose solution to the
management of small
objects, especially if pointer manipulation is to be thread-safe.
The plethora of strategies for safe manual memory management throws up yet another
problem. If it is essential for the programmer to
manage object ownership consistently,
which approach should she adopt? This is particularly problematic when using library
code. Which approach does the library take? Do all the librariesusedby the program use
the same approach?
(GC) prevents dangling pointers being created: an object is reclaimed only when there is
no pointer to it from a reachable object. Conversely, in principle all is
garbage guaranteed
to be freed \342\200\224
any object that is unreachable will eventually be reclaimed by
the collector
\342\200\224
with two caveats. The first is that tracing collection uses a definition of 'garbage' that is
decidable and may not include all objectsthat will never be accessed again. The second
is that in practice, as we shall see in later chapters, garbagecollector implementations
4\"When C++ is your hammer, everything looks like a thumb,\" Steven M. Haflich, Chair of the NCITS/J13
technical committee for ANSI standard for Common Lisp.
5The final committee draft for the next ISO C++ standard is currently referred to as C++0x.
6http://boost.org
4 CHAPTER!. INTRODUCTION
the code of that module alone, or at worst a few closely related modules. Reducingthe
coupling
between modules means that the behaviour of one module is not dependent on the
implementation of another module. As far as correct memory management is concerned,
this means that modules should not have to know the rules of the memory management
garbage collection, in one form or another, has been a requirement of almost all modern
languages (see Table 1.1).It is even expectedthat the next C++ standard will require code
to be written so as to allow a garbage-collected implementation[Boehm and Spertus, 2009].
There is substantial evidence that managed code, including automatic memory
management, reduces development costs [Butters, 2007]. Unfortunately, most of this evidence is
anecdotal or compares developmentin different languages and systems (hence comparing
more than just memory management strategies), and few detailed comparativestudies
have been performed. Nevertheless, one author has suggestedthat memory management
should be the prime concern in the design of software for complex systems [Nagle, 1995].
Rovner [1985] estimated that 40% of development time for Xerox's Mesa system was spent
on getting memory managementcorrect.Possibly the strongest corroboration of the case
for automatic dynamic memory management is an indirect, economic o ne: the continued
existence of a wide variety of vendors and tools for detection of memory errors.
We do not claim that garbage collection is a silver bullet that will eradicate all memory-
related programmingerrorsor that it is applicable in all situations. Memoryleaksare one
of the most prevalent kinds of memory error. Although garbage collection tends to reduce
the chanceof memory leaks, it does not guarantee to eliminate them. If an object structure
becomes unreachableto therest of the program (for example, through any chain of pointers
from the known roots), then the garbage collector will reclaim it. Since this is the only way
that an object can be deleted, danglingpointerscannot arise. Furthermore, if deletion of an
objectcausesits children to become unreachable, they too will be reclaimed.Thus, neither
reachable, but grows without limit (for example, if a programmer repeatedly adds data to
a cachebut never removes objects from that cache), or that is reachable and simply never
accessed again.
Automatic
dynamic memory management is designed to do just what it says. Some
critics of garbage collectionhave complained that it is unable to provide general resource
1.3. COMPARING GARBAGE COLLECTION ALGORITHMS 5
management, for example,to closefiles or windows promptly after their last use.
However, this is unfair. Garbage collection is not a universal panacea.It attacks and solves
a specific question: the managementof memory resources. Nevertheless, the problem of
general resourcemanagement in a garbage collected language is a substantialone. With
explicitly-managed systems there is a straightforward and natural coupling between
memory
reclamation and the disposal of other resources.Automatic memory management
introduces the problem of how to structure resourcemanagement in the absence of a natural
coupling. However,it is interesting to observe that
many resource release scenarios
require something akin to a collector in order to detect whether the resource is still in use
(reachable)from the rest of the program.
collector. Singer
et al [2007b] applied machine learning techniquesto predict the best
collector
configuration for a particular program. Others have explored allowing Java virtual
machines to switch collectors as they run if they believe that the characteristics of the
workload
being run would benefit from a different collector [Printezis, 2001; Soman et al, 2004].
In this section, we examine the metrics by which collectors can be compared.Nevertheless,
such
comparisons are difficult in both principle and practice. Details of implementation,
locality and the practical significance of the constants in algorithmic complexity formulae
makethem lessthan perfect guides to practice. Moreover, the metricsarenot independent
variables. Not only does the performance of an algorithm depend on the topology and
volume of objects in the heap, but also on the accesspatterns of the application. Worse,
the tuning options in production virtual machines are inter-connected.Variation of one
Safety
is that
The prime consideration garbage collection should be safe: the collector must never
reclaimthe storageof live objects. However, safety comes with a cost, particularly for
concurrent collectors (seeChapter15).The safety of conservative collection, which receives
no assistancefrom the compiler or run-time system, may in principle be vulnerable to
Throughput
A common goal for end usersis that their programs should run faster. However, there
are severalaspectsto this. One the overall time
is that spent in garbage collection should
be as low as possible. This is commonly referred to in the literature as the mark/cons ratio,
comparing the early Lisp activities of the collector ('marking' live objects) and the mutator
(creating or 'consing' new list cells). However, the useris likely
to want most
the
applicationas a whole (mutator plus collector) to executein as little time as possible. In most well
designed configurations, much more CPU time is spent in the mutator than the collector.
Therefore it may be worthwhile trading some collector performance for increased mutator
countingcollectors,
for example, are unable to reclaim cyclicgarbage (self-referential structures).
For performance reasons, it may be desirablenot to collect the whole heap at every
collection
cycle. For example, generational collectors segregate objectsby their age into two or
more regionscalledgenerations (we discuss generational garbage collection in Chapter 9).
By concentrating effort on the youngest generation, generationalcollectorscan both
improve
total collection time and reduce the averagepause time for individual collections.
Concurrent collectors interleave the executionof and collectors;
mutators the goal of
such collectorsis to avoid, or at least
bound, interruptions to the user program.One
consequence
is that objects that become garbage after a collection cycle has started may not be
reclaimed until the end of the next cycle; such objectsare calledfloating garbage. Hence, in a
13. COMPARINGGARBAGE COLLECTION ALGORITHMS 7
concurrent setting it
may be more appropriate to define completeness as eventual
of
reclamation all garbage, as opposed to reclamation within one cycle. Different collection
algorithms may vary in their promptness of reclamation,again leadingto time/space trade-offs.
Pause time
represents time, from 0 to total execution time, and its y-axis the fraction of CPU time spent
in the mutator (utilisation).Thus, not only do MMU and BMU curves show total garbage
collection time as a fraction of overall execution time (the y-intercept, at the top right of the
curves is the mutators' overall share of processor time), but they also show the maximum
pause time (the longest window for which the mutator's CPUutilisation is zero) as the
x-intercept. In general, curves that are higher and more to the left are preferable since they
tend towards a higher mutator utilisation for a smaller maximum pause. Note that the
MMU is the minimum mutator utilisation (y) in any time window (x). As a consequence
it is
possible for a larger window to have a lower MMU than a smallerwindow, leading
to dips in the curve. In contrast, BMU curves give the MMU in that time window or any
larger one. Monotonically increasing BMU curves are perhaps more intuitive than MMU.
8 CHAPTER!. INTRODUCTION
100%
1
MMU 1 80%
BMU
H 60%
\\ 40%
\\ 20%
-I 0%
1 10 100 1000 10000
time (ms)
curves display concisely the (minimum) fraction of time spent in the mutator,
for any given time window. MMU is the minimum mutator utilisation (y)
in any time window (x) whereas BMU is minimum mutator utilisation in
that time window or any larger one. In both cases, the x-interceptgives the
maximum pause time and the y-intercept is the overall fraction of processor
Space overhead
The
goal of memory management is safe and efficient use of space. Different memory
managers, both explicitand automatic, impose different space overheads. Some garbage
collectorsmay impose per-object space costs (for example, to storereferencecounts);
others
may be able tosmuggle these overheads into objects' existing layouts (for example, a
mark bit can oftenbe hidden in a header word, or a forwarding pointer may be written
over user data). Collectors may have a per-heap space overhead. For example,copying
collectors divide the heap into two semispaces. Only one semispace is available to the
mutator at any time; the other is heldas a copy reserve into which the collector will evacuate
live
objects at collection time. Collectors may require auxiliary data structures. Tracing
collectors need mark stacks to guide the traversal of the pointer graph in the heap;they
may also store mark bits in separate bitmap tables rather than in the objects themselves.
Concurrentcollectors, or collectors that divide the heap into independently collected
regions, require remembered sets that record where the mutator has changed the value of
pointers, or the locationsof pointers that span regions, respectively.
Internally, however, they typically update data structures at most once(from a 'thunk' to weak
1.4. A PERFORMANCE DISADVANTAGE? 9
head normal form); this gives multi-generation collectors opportunities to promote fully
evaluated data structures eagerly (see Chapter 9). Authors have also suggested complete
mechanisms for handling cyclicdata structures with reference counting. Declarative
languages may also allow other mechanisms for
managementefficient
spaces. Any of heap
data created in a logic language after becomes unreachable after the
a 'choice point'
program
backtracks to that point. With a memory manager that keeps objects laid out in the
heap in their order of allocation, memory allocated after the choice point can be reclaimed
in constant time. Conversely, different language definitions may make specific
requirements of the collector. The most notable are the ability to deal with a variety of pointer
strengths
and the need for the collector to cause dead objects to be finalised.
algorithms depend on support from the operating system or hardware (for instance,by
protecting pages or by double mapping virtual memory space, or on the availability of
certain atomic operations on the processor). Such techniquesarenot necessarily portable.
managementdoes
impose a performance penalty on the program,it is not as much as is commonly
assumed. Furthermore, explicitoperations like ma Hoc and free also impose a
significantcost. Hertz, Feng, and Berger [2005] measuredthe true costof garbage collection for
a variety of Java benchmarks and collection algorithms. They instrumented a Java virtual
machine to discover precisely when objectsbecameunreachable, and then used the
reachability
trace as an oracle to drive a simulator, measuringcycles and cachemisses.They
compared a wide variety of garbage collector configurations against different
implementationsof malloc/f ree: the simulator invoked free at the point where the trace
indicated that an object had become garbage. Although, as expected, results varied between
both collectorsand explicitallocators, Hertzetal found garbage collectors could match the
execution time performanceof explicit allocation provided they were given a sufficiently
large heap (five times the minimum
required). For more typical heap sizes, the garbage
collection overhead increased to 17% on average.
10 CHAPTER 1. INTRODUCTION
One of the most welcome changes past decadeor so has been the improvement
over the
in experimental methodology reported in the literatureon memory management.
Nevertheless, it remains clear that reporting standards in computer sciencehave some way to
improve before they match the quality of the very best practice in the natural or social
sciences. Mytkowicz et al [2008] find measurement bias to be 'significant and commonplace'.
In a study of a large number of papers on garbage collection, Georges et al [2007]found
the experimentalmethodology, even where reported, to be inadequately rigorousin many
cases. Many reported performance improvements were sufficiently small, and the reports
lacking in statistical analysis,to raise questions of whether any confidence could be placed
in the results. Errors introduced may be systematicor random. Systematic errors are
largely due to poor experimentalpracticeand can often be reduced by more careful
design
of experiments. errors are typically due to non-determinism
Random in the system
under measurement. By their nature, theseare unpredictable often outside the
and
not reflect the interactions in memory allocation that occur in real programs, or because
their workingsetsaresufficiently small that they exhibit locality effects that real programs
would not. Wilson et al [1995a] provide an excellent critique
of such practices. Fortunately,
other than for stress testing, synthetic and toy benchmarks have been largely abandoned
in favour of larger scalebenchmark suites, consisting of widely used programs that are
believed to represent a wide range of typical behaviour (for example, the DaCapo suite for
Java [Blackburn et al, 2006b]).
Experiments with benchmarksuites that a large number of realistic programs
contain
loading the
necessary files into the disk cache: thus Georges et al [2007] advocate running
several invocations of the virtual machine and benchmark and discardingthe first.
Dynamic (or run-time) compilation is a major source of non-determinism, and is
object object
reference : : : reference
Roots
< fields
(often expressed in terms of multiples of the smallest heap size in which a program will
run to completion), such 'jitter' is made readily apparent.
We conclude this chapter by explaining the notation used in the rest of the book. We also
The heap
The
heap is either a contiguous array of memory words or organised into a set of
discontiguous
blocks of contiguous words. A granule is the smallest unit of allocation, typically
7Thecoefficient of variation is the standard deviation divided by the mean.
12 CHAPTER 1. INTRODUCTION
(although some memory managers for real-time or embedded systems may construct an
individual large object as a pointer structure, this structure is not revealed to the user
program).A field may contain a reference or some other scalar non-reference value such
as an integer. A reference is either a pointer to a heap objector the distinguished value
null. Usually, a reference will be the canonical pointer to the head of the object (that is, its
first address), or it may point to some offset from the head. An object will sometimes also
have a header field which stores metadata used by the run-time system, commonly (but not
always) stored at the head of an
object. A derived pointer is a pointer obtained
by adding an
offset to an object's canonical pointer. An interior pointer is a derived pointer to an internal
object field.
A block is an aligned chunk
of a particular size, usually a power of two. For
we
completeness mention a frame (when not referring to a stack frame) means a large
also that
2k sized portion space, and a spaceis a possibly
of address discontiguous collection of
chunks, or even objects, that receive similar treatment by the system. page A is as defined
by the hardware and operating system's virtual memory mechanism, and a cache line (or
cache block) is as defined by its cache. A card is a 2k aligned chunk, smaller than a page,
related to some schemes for remembering cross-space pointers (Section11.8).
The
heap is often characterised as an objectgraph, which is a directed graph whose nodes
are heap objects and whose directed edges are the references to heap objects stored in their
fields. An edge is a reference from a source node or a root (see below) to a destination node.
\342\200\242
The collector executes garbage collection code, which discoversunreachableobjects
and reclaims their storage.
A program may have more than one mutator thread, but the threads together can usually
be thought of as a single actor over the heap. Equally, there may be oneor morecollector
threads.
overwriting the root pointer's storagewith some other reference (that is, null or a pointer
to another object). We denote the set of (addresses of) the roots by Roots.
In practice, the roots usually comprise static/global storage and thread-local storage
(such as thread stacks) containing pointers through which mutator threads can directly
manipulate heap objects. As mutator threads executeover time, their state (and so their
roots) will change.
In a type-safeprogramming language,
once an object becomes unreachable in the heap,
and the mutator has discarded all root pointers to that object, then there is no way for the
mutator to reacquire a pointer to the object. The mutator cannot 'rediscover' the object
arbitrarily (without interaction with the run-time system) \342\200\224 there is no pointer the mutator
can traverseto it and arithmetic construction of new pointers is prohibited. A variety of
languages support finalisation of at least some objects. These appear to the mutator to be
'resurrected' by the run-time system. Our point is that the mutator cannot gain access to
any arbitrary unreachable objectby its efforts alone.
program
continues to hold a pointer to an objectdoesnot mean it will access it. Fortunately,
we can approximatelivenessby a property that is decidable: pointer reachability. An object
N is reachable from an objectM if N can be reached by following a chain of pointers,
starting
from some field / of M. By extension,an objectis only usable by a mutator if there is a
chain of pointers from one of the mutator's roots to the object.
More formally (in the mathematical sense that allows reasoning about reachability), we
can define the immediate 'points-to' relation as
\342\200\224>y
follows. For any two heap nodes M, N
in Nodes, M N
\342\200\224>y
if and only if there is some field location f=&M[i] in Pointers(M)
reachable =
{N G Nodes | (3r G Roots : r -> N) V (3M G reachable :M ^ N)} (1.1)
An
object that is unreachable heap, and not pointed to by any mutator
in the root, can
never be accessedby a type-safe mutator.
Conversely, any object reachable from the roots
may be accessed by the mutator. Thus, liveness is more profitably defined for garbage
collectors
by reachability. Unreachable objects are certainly dead and can safely be reclaimed.
But any reachable object may still be live and must be retained. Although we realise that
doing so is not
strictly accurate, we will tend to use live and dead interchangeably with
reachable and unreachable, and garbage as synonymous with unreachable.
Pseudo-code
We use a common pseudo-code to describe garbagecollection
algorithms.
We offer these
algorithm fragments as illustrative rather than definitive, preferring to resolve ambiguities
informally
in the text rather than formally in the pseudocode. Our goalis a concise
and
The allocator
supports
two operations: reserves the underlying memory
allocate, which storage for an
object, and free which returns that storage to the allocator for subsequentre-use.The size
of the storage reserved by allocate is passedasan optional parameter; when omitted the
allocation is of a fixed-size object, or the size of the object is not necessary for
of the algorithm.
understanding
Where necessary, we may pass further arguments to allocate, for
example to distinguish arrays from other objects, or arrays of pointers from those that do
not contain pointers,or to include other information necessary to initialise object headers.
New() :
return allocateQ
Write(src, i, val):
src[i] val
<\342\200\224
Atomic operations
In the face of concurrency between mutator threads, collector threads, and betweenthe
mutator and collector, all collector algorithms require that certain code sequences appear
to execute atomically. For example, stopping mutator threads makes the task of garbage
collection appear to occur atomically: the mutator will never access the heap in the
threads
middle of garbage collection.
Moreover, the
when running collector concurrently with the
mutator, the New, Read, and Write operationsmay need to appear to execute atomically
with
respect to the collector and/or other mutator threads. To simplify the exposition
of collectoralgorithms we will usually leave implicit the precise mechanism by
which
atomicity of operations is achieved, simplymarking them with the keyword atomic. The
meaning is clear:all the
steps of an atomic operation must appear to execute
indivisibly
and instantaneously with
respect to other operations. That is, other operationswill appear
to execute either before or after the atomic operation, but never interleaved between any
of the steps that constitute the atomic operation. For discussionof different techniques to
achieve atomicity as desired seeChapter 11 and Chapter 13.
We use the usual definition of a set as a collection of distinct (that is, unique) elements.
The cardinality of a set S,written |S|,
is the number of its elements.
In addition to the standard set notation, we also make use of multisets. A multiset's
elements may have repeated membership
in the multiset. The cardinality of a multisetis
the total number of its elements, including repeated memberships. The number of times
an element appears is its multiplicity. We adopt the following notation:
\342\200\242
[]
denotes the empty multiset
\342\200\242
[a, a, b] denotes the multiset containing two as and one b
\342\200\242
[a, b] + [a] \342\200\224
[a, a, b] denotes multiset union
[a] = [a, b]
\342\200\242
a,
- denotes multiset subtraction
[a, b]
multiset, the same element can appear multipletimesat different positions in the sequence.
We adopt the
following notation:
denotes
\342\200\242
() the empty sequence
\342\200\242a,
(a, b) denotes the sequence containing two as followed by a b
\342\200\242
(a, b)
-
(a)
=
(a, b, a) denotes appending of the sequence (a) to (a, b)
While a tuple of length k can be thought of as being equivalent to a sequence of the same
length, we sometimes find it convenient to use a different notation to emphasisethe fixed
length of a tuple as opposed to the variable length of a sequence, and soon. We adopt the
notation below for tuples; we usetuplesonly of length two or more.
\342\200\242
(a\\, ...,0]t) denotes the /c-tuple whose ith member is alf for 1 < i < k
Chapter 2
All
garbage collection schemes are based on one of four fundamental approaches: mark-
sweep collection,copying collection, mark-compact collection or referencecounting. Different
collectors may combine in different ways, for example,
these approaches by collecting one
region of the heap with one method and another part of the heap with a second method.
The next four chaptersfocuson these four basic styles of collection. In Chapter 6 we
compare
their characteristics.
For now we shall assume that the mutator is running one or more threads, but that
there is a single collector thread. All mutator threads are stopped while the collectorthread
runs. This stop-the-world approach simplifies the constructionof collectors considerably.
From the perspective of the mutator
appears
threads, to execute atomically: no
collection
mutator thread will see any intermediate state of the collector, and the collector will not
see interferencewith its task by the mutator threads. We can assume that each mutator
thread is stopped at a point where it is safe to examine its roots: we look at the details
of the run-time interface in Chapter 11. Stopping the world provides a snapshot of the
heap, so we do not have to worry about mutators rearranging the topology of objects in
the heap while the collector is trying to determine which objectsare live. This also means
that there is no needto synchronise the collector thread as it returns free space with other
collector threads or with the allocator as it tries to acquire space. We avoid the question
of how multiple mutator threads can acquire fresh memory until Chapter 7. There are
more complex run-time
systems that employ parallel collector threads or allow mutator
17
18 CHAPTER 2. MARK-SWEEP GARBAGE COLLECTION
i New():
2 ref <\342\200\224
allocate()
3 if ref = null /* Heap isfull */
4 collect()
5 ref <\342\200\224
allocate()
6 if ref = null /* Heap is still full */
7 error \"Out of memory\"
s return ref
9
io atomic collect():
ii
markFromRoots()
12 sweep(HeapStart, HeapEnd)
These not independent.In particular,the way space is reclaimed affects how fresh
tasks are
space is allocated. As we noted in Chapter 1, true livenessis an undecidable problem.
Instead, we turn to an over-approximation of the set of live objects: pointer reachability
(defined on page 13).We accept an object as live if and only if it can be reached by following
a chain of references from a set of known roots. By extension, an object is dead,and its
space can be reclaimed, if it cannot be reached though any such chain of pointers. This is a
safe estimate. Although some objects in the live set may never be accessed again, all those
in the dead setare certainly dead.
The first algorithm that we look at is mark-sweepcollection[McCarthy, I960]. It is a
method, reference counting. Unlike indirect methods, direct algorithms determine the
appropriately (the default definitions were given in Chapter 1 on page 15).The mark-sweep
interface with the mutator is very simple. If a thread is unable to allocatea new object,
the collectoris calledand the allocation request is retried (Algorithm 2.1). To emphasise
that the collector operates in stop-the-world mode, w ithout concurrent execution of the
mutator threads, we mark the collect routine with the atomic keyword. If there is
still insufficient memory available to meet the allocation request, then heap memory is
io initialise(worklist):
ii worklist <\342\200\224
empty
12
13
markQ:
i4 while not isEmpty(worklist)
15 ref <\342\200\224
remove(worklist) /* ref is marked */
i6 for each fid in Pointers(ref)
17 child *fld
<\342\200\224
object's header or in a side table. If an object cannot contain pointers, then because it has
no children there is no need to add it to the work list. Of course the objectitself must
still be marked. In order to minimisethe size of the work list, markFromRoots calls
mark immediately. Alternatively, it may be desirable to complete scanning the roots of
each thread as quickly as possible. For instance,a concurrent collector might
wish to stop
each thread only briefly to scan its stack and then traverse the graph while the mutator is
running. In this case mark (line8) could be moved outside the loop.
For a single-threaded collector,the work list could be implemented as a stack. This
leadsto a depth-first traversal of the graph. If mark-bits are co-locatedwith objects, it has
the advantage that the elements that are processed next are those that have been marked
most recently, and hence are likely to still be in the hardware cache. As we shall see
repeatedly,
it is essential to pay attention to cachebehaviour if the collector is not to sacrifice
performance.Later we discuss techniques for improving locality.
Markingthe graph of live objects is straightforward. References are removedfrom the
work list, and the targets of their fields marked, until the work list is empty. Note that in
this version of mark, every item in the work list has its mark-bit set. If a field contains a
null
pointer or a pointer to an object that has already been marked, there is no workto do;
otherwise the target is marked and added to the work list.
Termination of the marking phase is enforced by not adding already marked objects
to the work list, so that eventually the list will become empty. At this point, every object
20 CHAPTER 2. MARK-SWEEP GARBAGE COLLECTION
reachable from the roots will have been visited and its mark-bit will have been set. Any
unmarked object is therefore garbage.
The sweep phasereturns unmarked allocator (Algorithm 2.3). Typically,
nodes to the
the collector
sweeps the heap
linearly, starting bottom, freeing unmarked
from the nodes
and resetting the mark-bits of marked nodes in preparation for the next collection cycle.
Note that we can avoid the cost of resetting the mark-bit of live objects if the sense of the
bit is switched betweenonecollection and the next.
We will not discuss the implementationof allocate and free until
Chapter 7, but
note that the mark-sweep collectorimposesconstraints upon the heap layout. First, this
collector does not move objects. The memory manager must therefore be careful to try to
reduce the chance that the heap becomes so fragmented that the allocatorfinds it difficult
to meet new requests, which would leadto the collector being called too frequently, or in
the worst case,preventing the allocation of new memory at all. Second, the
sweeper must
be able to find each node in the heap. In practice, given a node, sweep must be able to
find the next node even in the presence of padding introduced between objects in order to
observealignmentrequirements. Thus, next Ob ject may have to
parse the heap instead
of simply adding the size of the object to its address (line 7 in Algorithm 2.3); we also
discuss heap parsability in Chapter 7.
mark stack
Figure 2.1: Marking with the tricolour abstraction. Black objects and their
Objects are colouredby mark-sweep collection as follows. Figure 2.1 showsa simple
object graph
and stack
a mark (implementing the work list), mid-way through the mark
memory
location accessed recently, it is very likely
has been that it will be accessed again soon,
and so it is worth caching its value. Applications may also exhibit good spatial locality: if
a locationis accessed,it is likely adjacent locations will also be accessedsoon. Modern
hardware can of this property in two ways. Rather
take advantage than transferring single
words between a cacheand lowerlevelsof memory, each entry in the cache (the cacheline
orcache block) holds a fixed number of bytes,typically 32-128 bytes. Secondly, processors
may use hardware prefetching.
For example, the Intel Core micro-architecture can detect a
regularstridein the memory access pattern and fetch streams of data in advance. Explicit
prefetching instructions are also commonly available for program-directed prefetching.
22 CHAPTER2. MARK-SWEEP GARBAGE COLLECTION
bitmap depends on the object alignment requirements of the virtual machine. Either a single
bitmap can be used or, in a block structured a
heap, separate bitmap can be used for each
block. The latter organisation has the advantage that no space is wasted if the heap is
not contiguous.Per-block bitmapsmight be stored in the blocks. However, placingthe
bitmap
at a fixed position in each block risks degradingperformance. This is because the
bitmaps will contend for the same setsin a set-associative cache. Also, accessing the
bitmap implies touching the page. Thus it may be better to use more instructionsto access
the bit rather than to incur locality overheadsdue to paging and cache associativity. To
avoid the cache associativity issue, the position of the
bitmap in the block can be varied by
computingsomesimplehash of the block's address to determine an offset for the bit map.
Alternatively, the bitmap can be storedto the side[Boehm and Weiser, 1988], but using a
table that is somehow indexed by block, perhaps by hashing. This avoids both paging and
cache conflicts.
Bit
maps suffice if there is only a single marking thread. Otherwise, setting a bit in a
bitmap is vulnerable to
losing updates to races whereas setting a bit in an objectheader
only
risks setting the same bit twice: the is
operation idempotent. Instead of a bitmap,
byte-maps are commonly used (at the cost of an 8-fold increase in space), thereby making
marking races benign. Alternatively, a bitmap must use a synchronised operation to set a
bit. In practice, matters are often more complicated for header bits in systems that allow
marking concurrently with mutators, since header words are typically shared with
mutator data such as locks or hash codes. With care, it may be possible to place this data and
mark-bits in different bytes of a header word. Otherwise, even mark-bits in headers must
be set atomically.
Mark bitmaps have a number of potential advantages. We identify these now, and then
examine whether
they in
materialise practice on modern hardware. A
bitmap stores marks
much more densely than if they are stored in object headers. Considerhow mark-sweep
behaveswith a mark bitmap. With a bitmap, marking will not modify any object, but
2.4. BITMAP MARKING 23
will only read pointer fields of live objects. Other than loading the type descriptorfield,
no other part of pointer-free objects will be accessed. Sweeping
will not read or write to
any
live object although it may overwrite fields of
garbage objects as part of freeing them
(for example to link them into a free-list). Thus bitmap marking is likely to modify fewer
words, and to dirty
fewer cache lines so less data needs to be written back to memory.
collector must not alter the value stored in any location owned by the mutator (including
objects and roots). This rules out all algorithmsthat move objects since this would require
updating every reference to a moved object. It also rules out storing mark-bits in object
headers since the 'object' in question might not be an object if it was reached by following
a false pointer.Settingor clearing a bit might destroy user data. Second,it is very useful to
minimise the chanceof the mutator interfering with the collector's data. Adding a header
word for the collector's use, contiguous to every object, is riskier than keeping collector
metadata suchas mark-bits in a separate data structure.
Bitmap marking was also motivated
by the concern to minimise the amount of paging
caused by the collector [Boehm,2000].However, in modern systems, any paging at all due
to the collectoris generally considered unacceptable. The question for today is whether
bitmap marking can improve There is considerable
cache performance. evidence that
objects
tend to live and die in clusters [Hayes,1991;
Jones Ryder, 2008]. Many
and allocators
will tend to allocate these objects close to each other. Sweeping with a bitmap has two
advantages.
It allows the mark-bits of clusters of objects to be tested and cleared in groups as
the commoncase will be that either every bit/byte is set or every bit/byte is clear in a map
word. A corollary is that it is simple from the bitmap to determinewhether a complete
block of
objects is garbage, thus allowing the whole returned blocktobe
to the allocator.
Many memory managers usea blockstructured heap (for example, Boehm and Weiser
[1988]). A
straightforward implementation might reserve a prefix of each block for its
bitmap.
As previously discussed this leads to unnecessarycacheconflicts and page accesses,
so collectors tend to store bitmaps separately from user data blocks.
Garner et
adopt a hybrid approach, associating eachblockin a
al [2007] fits
segregated
allocator's structure
data with a byte in a map, as well as marking a bit in object headers.
The byte is set if and only if the corresponding block containsat least one object. The byte-
map of used/unused blocks thus allows the sweeper to determineeasily which blocks are
completely empty (of live
objects) and can be recycled as a whole. This has two
advantages.
Both the bit in the object header and the byte in the byte-map, corresponding to
the block in which the object resides, can be set without using synchronised operations.
Furthermore, there are no data dependencieson eitherwrite (which might lead to cache
stalls), and writing the byte in the byte-map is unconditional.
Printezisand Detlefs [2000] use bitmaps to reduce the amount of space used for mark
stacks in a mostly-concurrent,generationalcollector. First, as usual, mutator roots are
marked by setting a bit in the map. Then, the marking thread linearly searches this
bitmap, looking objects. Algorithm 2.4 strives to maintain
for live the invariant that marked
objects below the current 'finger', cur in the mark routine, are black and those above it are
24 CHAPTER 2. MARK-SWEEPGARBAGE COLLECTION
i mark()
2 cur <r- next I nB it map ()
3 while cur < HeapEnd /* marked ref is black if and only if ref < cur */
4 add(worklist, cur)
5 markStep(cur)
6 cur <r- nextInBitmapQ
7
s markStep(start):
9 while not isEmpty(worklist)
io ref <r- remove(worklist) /* ref is marked */
n for each fid in Pointers(ref)
12 child <- *fld
13 if child ^ null && not isMarked(child)
H setMarked(child)
15 if child < start
i6 add(worklist, child)
grey. When the next live (marked) object cur is found, it is pushed onto the stack and we
enter the usual marking loop to restore the invariant:
objects are popped from the stack
and their childrenmarked recursively until the mark stack is empty. If an item is below
cur in the heap, it is pushed onto the mark stack; otherwiseits processing is deferred to
later in the linear search.The main difference between this algorithm and Algorithm 2.1 is
its conditional insertion of children onto the stack at line 15. Objects are only marked
recursively (thus consuming mark stack space) if they are behind the black wavefront which
moves linearly the heap. Although
through the complexity of this algorithm is
proportional to the the space being collected,
size of in
practice searching a bitmap is cheap.
A similar approach can be used to deal with mark stack overflow. When the stack
overflows, this is noted and the object is marked but not pushed onto the stack. Marking
continuesuntil the stack is exhausted. Now we must find those marked objects that could
not be addedto the stack. The collector searches the heap, lookingfor any marked objects
with one or more unmarked childrenand continues the trace from these children. The
most straightforward way to do this is with a linear sweep of the heap. Sweepinga bitmap
will be more efficient than examining a bit in the header of each object in the heap.
way
to improve the cache behaviour of the sweepphase is to prefetch objects. In order to
avoid fragmentation, allocatorssupportingmark-sweep collectors typically lay out objects
of the same sizeconsecutively (see Chapter 7 on page 93) leading to a fixed stride as a block
of same-sized
objects is swept. Not only does this pattern allow software
prefetching, but
it is also ideal for the hardware prefetching mechanisms found in modern processors.
2.5. LAZY SWEEPING 25
atomic collectQ:
markFromRoots()
for each block in Blocks
if not isMarked(block) /* no
objects marked in this block? */
add(blockAllocator, block) /* return block to block allocator */
else
add(reclaimList, block) /* queueblock for lazy sweeping */
atomic allocate(sz):
result <\342\200\224
remove(sz) /* allocatefrom size class for s z */
if result = null /* if no free slotsfor this size... */
lazySweep(sz) /* sweep a little */
result <r-
remove(sz)
return result /* if still null, collect */
i6
lazySweep(sz) :
17 repeat
is block <r- nextBlock(reclaimList, sz)
19 if block 7^ null
sweep(start(block), end(block))
if spaceFound(block)
return
until block = null /* reclaim list for this size classis empty */
Can the time for which are stopped during the sweep phasebereduced
the mutators or
even eliminated? We observe two properties of objects and their mark-bits. First, once an
objectis garbage,it remains garbage: it can neither be seennorbe resurrected by
a mutator.
Second, mutators cannot access mark-bits. Thus, the sweeper can be executed in parallel
with mutator threads, modifying mark-bits and even overwriting fields of garbage objects
to link them into allocator structures. The sweeper (or sweepers) couldbeexecuted as
separate threads, running concurrently with the mutator threads, but a simple solutionis to
use lazy sweeping [Hughes, 1982]. Lazy sweepingamortises the cost of sweeping by having
the allocator perform the sweep. Rather than a separate sweep phase, the responsibility
for
finding free space is devolved to allocate. At its simplest, allocate advances the
sweep pointer until it finds sufficient space in a sequence of unmarked objects. However,
it is more practical to sweep a blockof several objects at a time.
Algorithm
2.5 shows a lazy sweeper that operates on a block of memory at a time. It is
common for allocatorsto placeobjectsof the same size class into a block (wediscussthis in
detail in Chapter 7). Each sizeclassclasswill have one or more current blocks from which
it can allocate and a reclaim list of blocks not yet swept. As usual the collector will mark
all live objects in the heap, but instead of eagerly sweeping the whole heap, collect will
26 CHAPTER 2. MARK-SWEEP GARBAGE COLLECTION
simply return any completelyempty blocks to the block level allocator (line 5). All other
blocks are added to the reclaim queue for their size-class. Once the stop-the-world phase
of the collection cycle is complete, the mutators are restarted. The allocate method
first attempts to acquire a free slot of the required size from an appropriate size class(in
the same way as Algorithm 7.2 would). If that fails, the lazy sweeper is called to sweep
one or more remaining blocks of this size class, but only until the request can be satisfied
(line 12).However, it may be the case that no blocks remain to be swept or that none of
the blocks swept contained any free slots. In this case, the sweeper attempts to acquire
a whole fresh block from a lower level, block allocator.This fresh block is initialised by
setting up its metadata for example,
\342\200\224
threading a free-list through its slots or creating a
mark byte-map. However, if no fresh blocks are available, the collector must be called.
There is a subtle issuethat arises from lazy sweeping a block-structured heap suchas
one that allocates from different size-classes. Hughes [1982] worked with a contiguous
heap and thus guaranteed that the allocator would sweep every node before it ran out of
space and invoked the garbage collector again. However, lazily sweeping separate size-
classes does not make this guarantee since it is almost certain that the allocator will exhaust
one size-class (and all the empty blocks) before it has swept every blockin every other
size-class. This leads to two problems. First, garbage objectsin unswept blocks will not
the reclaimed, leading to a memory leak. If the block also containsa truly live object, this
leak is harmless sincetheseslotswould not have beenrecycled anyway until the mutator
made a request for an object of this size-class. Second, if all the objects in the unswept
block subsequently become garbage, we have lost the opportunity to reclaim the whole
block and recycle it to more heavily used size-classes.
The simplest solutionis to completesweeping all blocks in the heap before starting
to mark. However, it might be preferable give to a block more opportunities
to be lazily
swept. Garner et al [2007] trade some leakage for avoiding any eager sweeps.They achieve
this for Jikes RVM/MMTk [Blackburn et al, 2004b] by marking objects with a bounded
integer
rather than a bit. This does not usually add spacecostssince there is often room to
use more than one bit if marks are stored in object headers,and separatemark tables often
mark with bytes rather than bits. Each collection cycle increments modulo 2K the value
used as the mark representing 'live', whereK is the number of mark-bits used, thus rolling
the mark back to zero on overflow. In this way, the collector can distinguish between an
object marked in this cycle and one marked in a previous cycle. Only marks equal to the
current mark value are considered to be set. Marking value wrap-around is safe because,
immediately before the wrap-around, any live object in the heap is either unmarked
(allocated since the last collection) or has the maximum mark-bit value. Any object with a
mark equal to the next value to be used must have been marked last some multiple of 2K
collections ago. Therefore it must be floating garbage and will not be visited by the marker.
This potential leak is addressedsomewhat by block marking. Whenever the MMTk
collector marks an object, it also marks its block. If none of the objects in a block has been
markedwith the current value, then the block will not have been marked either and so
will be reclaimedas a whole at line 5 in Algorithm 2.5. Given the tendencyfor objects to
live and die in clumps, this is an effective tactic.
Lazy sweeping offers a number of benefits. It has good locality: object slots tend to
be used soon after they are swept. It reducesthe algorithmic complexity of mark-sweep
to be proportionalto the size of the live data in the heap, the sameas semispace copying
add() H
1/ ^
- i /
^dhild : |
; ;
remove()
---. ?
addr obi
mark stack FIFO
prefetch()
i add(worklist, item):
2 markStack <r- getStack(worklist)
3 push(markStack, item)
4
5 remove(worklist):
6 markStack \302\253\342\200\224
getStack(worklist)
7 addr <\342\200\224
pop(markStack)
s prefetch(addr)
9 fifo <\342\200\224
getFifo(worklist)
io prepend(fifo, addr)
ii return remove(fifo)
i mark():
2 while not isEmpty(worklist)
3 obj <\342\200\224
remove(worklist)
4 if not isMarked(obj)
s setMarked(obj)
6 for each fid in Pointers(obj)
7 child <- *fld
s if child ^ null
9 add(worklist, child)
Cher et al [2004]observe that the fundamental problem is that cache lines are fetched
in a breadth-first, first-in, first-out (FIFO), order but the mark-sweep algorithm traversesthe
graph depth-first, last-in, first-out (LIFO). Their solution is to inserta first-in, first-out queue
in front of the mark stack (Figure2.2 and
Algorithm 2.6). As usual, when mark adds an
objectto its work list, a reference to the objectis pushed onto a mark stack. However, when
mark wants to acquire an objectfrom the work list, a reference is popped from the mark
stack but inserted into the queue, and the oldestitem in the queue is returned to mark.The
referencepoppedfrom the stack is also prefetched, the length of the queue determining the
prefetch distance. Prefetching a few lines beyond the popped reference will help to ensure
that sufficient fields of the objectto be scanned are loaded without cache misses.
Prefetching
the object to be marked through the first-in, first-out queue enables mark
to loadthe objectto be scanned without cache misses (lines 16-17in Algorithm 2.2).
However, testing and setting the mark of the child nodeswill incur a cache miss (line 18).
Garner et al [2007] realised that mark's tracing loop can be restructured to offer greater
opportunities for prefetching. Algorithm 2.2 added each node of the live objectgraph to
the work list exactly once; an alternative would be to traverse and add each edge exactly
once. Instead of adding children to the work list only if they are unmarked,this algorithm
inserts the children of unmarked objects unconditionally (Algorithm 2.7). Edge
enqueuingrequires
more instructions to be executed and leads to largerwork lists than node
enqueuing since graphs must contain more edges than nodes (Garner et al suggest that
typical Java applications have about 40% more edges than nodes). However, if the cost of
adding and
removing these additional work list entries is sufficiently
small then the gains
2.7. ISSUESTO CONSIDER 29
from
reducing cachewill outweigh the costof this extra work. Algorithm 2.7 hoists
misses
marking out of loop. The actions that might
the inner lead to cache misses, isMarked and
Pointers, now
operate on the same object ob j, which has beenprefetchedthrough the
Mutator overhead
Mark-sweep
in its simplest form imposes no overheadon mutator read and write
operations. In contrast, reference counting (which we introducein Chapter 5) imposes a
significantoverhead on the mutator. However, note that mark-sweep is also commonly used as
a base algorithm for more sophisticated collectors which do require some synchronisation
between mutator and collector. Both generational collectors(Chapter 9), and concurrent
and incremental collectors (Chapter15),require the mutator to inform the collector when
they modify pointers. However, the overhead of doing sois typically small, a few percent
of overall executiontime.
Throughput
Combined with lazy mark-sweep
sweeping, offers good throughput. The mark phase is
Space usage
Mark-sweep
has better space usage than approachesbased on semispace
significantly
copying. It also
potentially has better spaceusage than reference counting algorithms.
Mark-bits can often be stored at no cost in spare bits in objectheaders.Alternatively, if a
side bitmap table is used, the spaceoverhead depend on object alignment requirements;
it will be no worse 1/alignment of the heap (^ or gjj of the heap, depending on
architecture),and possibly better depending on alignment restrictions.Reference counting,
on the
other hand, requires a full slot in each object header to store counts (although this can be
reduced if a limit is placed on the maximum reference count stored). Copying collectors
make even worse use of available memory, dividing the available heap into two equally
sized semispaces, only one of which is used by the mutator at any time. On the debit
side, non-compactingcollectors,likemark-sweep and reference counting, require more
30 CHAPTER 2. MARK-SWEEP GARBAGE COLLECTION
identify all live objects in a space before it can reclaim the memory used by any dead
objects. This is an expensiveoperationand so should be done infrequently. This means
that
tracing collectors must be given some headroom in which to operate in the heap. If
the live objects occupy too a
large proportion of the heap, and the allocators allocate too
fast, then a mark-sweep collector will be called too often: it will thrash. For moderate to
large heaps, the headroom necessary may be between 20% and 50% of the heap [Jones,
1996] though Hertz and Berger [2005] show that, in order to provide the same throughput,
Java programs managedby mark-sweep collection may need a heap several times larger
than if it were to be managed by explicitdeallocation.
[Bartlett, 1988a; Hosking, 2006]. Here, a program's roots must be treated conservatively (if
it looks like a pointer,assumeit is a pointer), so the collector cannot move their referents.
However, type-accurate information about the layout of objects is available to the collector
so it can move others that are not otherwisepinnedto their location.
Safety in uncooperative systems managedby a conservative collector precludes the
collector's
modifying user data, including object headers. It also encourages placing collector
metadata separate from user or other run-time system data, to reducethe risk of
modification
by the mutator. For both reasons, it is desirable to store mark-bits in bitmaps rather
than
object headers.
The problem with not moving objects is that, in running
long applications, the heap
tends to become fragmented. Non-moving memory allocators require space 0(log^)
larger
than the minimum possible, where min and max are the smallest and largestpossible
objectsizes[Robson, 1971,1974]. Thus a non-compacting collector may have to be called
morefrequently than one that compacts. Note that all tracing collectors need sufficient
headroom (say, 20-50%)in the heap in order to avoid thrashing the collector.
To avoid having performance suffer due to excessive fragmentation, many production
collectors that use mark-sweep to manage a region of the heap also periodically use
another
algorithm such as mark-compact to defragment it. This is particularly
true if the
large objects. If the application allocates more largeobjects than it previously did, the
result
may be many small holes in the heap no longer being reused for new allocations of
objects of the same size. Conversely, if the
application begins to allocate smaller objects
than before, these smaller objects might be allocatedin gaps previously occupied by larger
objects, with the remaining space in each gap being wasted. However, careful heap
management
can reduce the tendency to fragment by taking advantage of objects' tendency to
live and die in clumps [Dimpsey et al, 2000; Blackburn and McKinley, 2008]. Allocation
with segregated-fits can also reduce the need to compact.
Chapter
3
Fragmentation1
can be a problem for non-moving collectors.Although there may be space
available in the heap, eitherthere may be no contiguous chunk of free space sufficiently
large to handle an allocationrequest,or the time taken to allocate may become excessive
as the memory manager has to search for suitable free space. Allocators may alleviate this
problem by storing small objectsof the same size together in blocks [Boehm and Weiser,
1988] especially, as we noted earlier, for applications that do not allocate many very large
objects
and whose ratios of different objects sizesdo not changemuch. However, many
long running applications, managed by non-moving collectors, will fragment the heap and
performancewill suffer.
In this and the next chapter we discusstwo strategies for compacting live objects in
the heap in order to eliminateexternal fragmentation. The major benefit of a compacted
heap is that it allows very fast, sequential allocation, simplyby testing against a heap limit
and 'bumping' a free pointer by the size of the allocation request (we discussallocation
mechanisms further in Chapter 7). The strategy we considerin this chapter is in-place
compaction2 of objectsinto one end of the same region. In the next chapter we discuss
a secondstrategy, copying collection \342\200\224
the evacuation of live objects from one region to
another (for example, between semispaces).
Mark-compact algorithms operate in a number of phases. The first phase is always a
marking phase, which we discussed in the previous chapter. Then, further
compacting
phases compact the live data by relocatingobjectsand updatingthe pointer values of all
live references to objectsthat have moved. The number of passes over the heap, the order
in which these are executed and the way in which objectsare relocated varies from
algorithm to algorithm. The compaction order has locality implications. Any moving collector
may rearrange objectsin the heap in oneof three ways.
Linearising: objects are relocatedso that they are adjacent to related objects, suchas ones
to which
they refer, which refer to them, which are siblingsin a data structure, and
so on, as far as this is possible.
Sliding: objects are slid to oneend of the heap, squeezing out garbage, thereby
their
maintaining original allocation order in the heap.
1We discuss fragmentation in more detail in Section 7.3.
2Often called compactifying in older papers.
31
32 CHAPTER 3. MARK-COMPACT GARBAGE COLLECTION
Most compacting collectors of which we are aware use arbitrary or sliding orders.
Arbitrary
order compactors are simple to implement and fast to execute, particularly if all
nodes are of a fixed size, but lead to poor spatial locality for the mutator because related
objectsmay be dispersed to different cache lines or virtual memory pages. All modern
mark-compact collectors implement slidingcompaction, which does not interfere with
mutator locality by changing the relative orderof object placement. Copying collectors
can even improve mutator locality by varying the order in which objectsare laid out by
placing them close to their parents or siblings. Conversely, recent experiments with a
collector that compacts in an arbitrary order confirm that its rearrangement of objects' layout
can lead to drastic reductions in application throughput [Abuaiadh et al, 2004].
Compactionalgorithms may impose further constraints. Arbitrary order algorithms
handle objectsof only a single size or compact objectsof different sizes separately.
Compaction may require two or three passes through the heap. It may be necessary to provide
an extra slot in object headers to hold relocation information: such an overhead is likely
to be significant for a general purpose memory manager. Compaction algorithms may
impose
restrictions on pointers. For example, in which directionmay references point? Are
interior pointers allowed? We discuss the issues they present in Chapter 11.
We examine several styles of compaction algorithm. First, we introduceEdwards's
Two-Finger collector [Saunders, 1974]. Although this algorithm is simple to implement
and fast to execute, it disturbs the layout of objects in the heap. The secondcompacting
collector is a widely used sliding collector,the Lisp 2 algorithm. However, unlike the
Two-Finger algorithm, requires it an additional slot in each object's header to store its
forwarding address, the location to which it will be moved. Our third example, Jonkers's
threaded
compaction [1979], slides objects without any space overhead. However,it makes
two passes over the heap, both of which tend to be expensive. The final class of compacting
atomic collectQ:
markFromRoots()
compactQ
3.1 compaction
Two-finger
given the volume of live data in the region to be compacted, we know where the high-
water mark of the region will be after compaction. Live objectsabove this threshold are
moved into gaps below the threshold. Algorithm 3.1 starts with two pointers or 'fingers',
free which points to the start of the region and scan which starts at the end of the
region.
The first pass repeatedly advances the free pointeruntil it finds a gap (an unmarked
object) in the heap, and retreats the scan pointer until it finds a live object. If the free
and scan fingers pass each other, the phase is complete.Otherwise,the objectat scan is
moved into the gap at free, overwriting a field of the old copy (at scan) with a
address,
forwarding and the process continues. This is illustrated in Figure 3.1, where object A has
been moved to its new location A7 and some slot of A (say, the first slot) has been
overwritten with the address A7. Note that the quality of compaction depends on the size of
the gap at free closely matching the size of the live object at scan. Unless this algorithm
is used on fixed-size objects, the degree of defragmentation might be very poor indeed.
3.1. TWO-FINGER COMPACTION
high-water
threshold
\342\200\242 *w
I
\342\226\240
A' - free ssi!#\302\273 scar 1 - A
Figure 3.1: Edwards's Two-Finger algorithm. Live objectsat the top of the
heap are moved into free gaps at the bottom of the heap. Here,the object at
A has been moved to A'.The algorithmterminateswhen the free and scan
pointers meet.
compactQ :
relocate(HeapStart, HeapEnd)
updateReferences(HeapStart, free)
5 relocate(start, end):
e free start
<\342\200\224
7 scan end
<\342\200\224
24 updateReferences(start, end):
25 for each fid in Roots /* update roots that pointed to moved objects */
26 ref <r- *fld
27 if ref > end
28 *fld *ref
<\342\200\224 /* use the forwarding address left in first pass */
scan start
\302\253\342\200\224
At the end of this phase, free points at the high-water mark. The second pass updates
the values of pointers that referred to locations
old beyond the high-water mark with the
forwarding addresses found in those locations, that is, with the objects' new locations.
The benefits of this algorithm are that it is simple and fast, doing minimal work at each
iteration. It has no memory overhead, since forwarding addresses are written into slots
above the high-water mark only after the live object at that location has been relocated: no
information is destroyed.The algorithmsupportsinterior pointers. Its memory access
patterns are predictable, and hence provide opportunities for prefetching (by either hardware
or software) which should lead to good cache behaviour in the collector. However, the
movement of the scan pointer in relocate does require that the heap (or at least the live
objects) can be parsed'backwards'; this could be done by storing mark-bits in a separate
bitmap,
or recording the start of each object in a bitmapwhenit is allocated. Unfortunately,
the order of objects in the heap that results from this style of compaction is arbitrary, and
this tends to harm the mutator's locality. Nevertheless, it is easy to imagine how mutator
locality might be improved somewhat. Since related objects tend to live and die together
in clumps,rather than moving individual objects, we could move groups of consecutive
live objects into large gaps. In the remainder of this chapter, we look at sliding collectors
which maintain the layout order of the mutator.
compactor to be the fastest of the compaction algorithms they studied. However,they did
not take cache or paging behaviour into account, which is an important factor as we have
seen before. The chief drawback of the Lisp 2 algorithm is that it requires an additional
full-slot field in every objectheader to store the address to which the objectis to be moved;
this field can also be used for the mark-bit.
The first pass over the heap (after marking) computes the location to which each live
code) is opposite to the directionin which the objects will move (downward, from higher
3.2. THE LISP 2 ALGORITHM 35
i
compact():
2 computeLocations(HeapStart, HeapEnd, HeapStart)
3 updateReferences(HeapStart, HeapEnd)
4 relocate(HeapStart, HeapEnd)
5
6 computeLocations(start, end, toRegion):
7 scan <r- start
s free <r- toRegion
9 while scan < end
io if isMarked(scan)
n forwardingAddress(scan) <r- free
12 free <r- free + size(scan)
13 scan <r- scan + size(scan)
14
15 updateReferences(start, end):
i6 for each fid in Roots /* update roots */
17 ref <r- *fld
is if ref ^ null
19 *fld <\342\200\224
forwardingAddress(ref)
20
29 relocate(start, end):
30 scan <r- start
31 while scan < end
32 if isMarked(scan)
33 dest <r- forwardingAddress(scan)
34 move(scan, dest)
35 unsetMarked(dest)
36 scan <r- scan + size(scan)
36 CHAPTER 3. MARK-COMPACT GARBAGE COLLECTION
B C
:
i NL :/ i J^
^^, L^
N
info
a : \302\243
to lower addresses). This guaranteesthat when the third pass copies an object,it is to
a location that has already been vacated. Someparallelcompactors that divide the heap
into blocks slide the contents of alternating blocks in opposite directions. This resultsin
larger'clumps',and hence larger free gaps, than sliding each block'scontentsin the same
direction [Flood et al, 2001]. An example is shown 14.8.
in Figure
This algorithm can be improved in several ways. Data can be prefetched in similar
ways
as for the sweep phase of mark-sweepcollectors. Adjacent garbage can be merged
after line 10 of computeLocat ions in order to improve the speed subsequent
of
passes.
i compact():
2 updateForwardReferences()
3 updateBackwardReferences()
4
5 thread(ref): /* thread a reference */
6 if *ref 7^ null
7 *ref, **ref <r- **ref, ref
8
9
update(ref, addr): /* unthread all references, replacingwith addr */
io tmp <r- *ref
n while isReference(tmp)
12 tmp
\342\200\242tmp, <r- addr, *tmp
13 *ref <r- tmp
14
15 updateForwardReferences():
i6 for each fid in Roots
17 thread(*fld)
18
19 free <\342\200\224
HeapStart
20 scan <\342\200\224
HeapStart
21 while scan < HeapEnd
22 if isMarked(scan)
23 update(scan, free) /* forward refs to scan set to free */
24 for each fid in Pointers(scan)
25 thread(fld)
26 free <r- free + size(scan)
27 scan <r- scan + size(scan)
28
29 updateBackwardRef erences():
30 free <r- HeapStart
31 scan <r- HeapStart
32 while scan < HeapEnd
33 if isMarked(scan)
34 update(scan, free) /* backward refs to scan set to free */
35 move(scan, free) /* slide scan back to free */
36 free <r- free + size(scan)
37 scan <r- scan + size(scan)
update them, it must be able to recognise that this field does not hold a threadedpointer.
Jonkersrequires two passes over the heap, the first to thread references that point
forward in the heap, and the second to thread backward pointers (see Algorithm 3.3). The
38 CHAPTER 3. MARK-COMPACT GARBAGE COLLECTION
first pass starts by threading the roots. It then sweepsthrough the heap, start to finish,
computing a new address free for each live object encountered, determined by summing
the volume of live data encountered so far. It is easiest to understand this algorithm by
considering a single marked (live) node N. When the first pass reaches A, it will thread
the referenceto N. By the time that the pass reaches N, all the forward pointers to N will
have been threaded (see Figure 3.2b).This pass can then update all the forward references
to N by following this chain and writing the value of free, the address of the location to
which N will be moved, into each previously referring slot. When it reaches the end of
the chain, the collectorwill restore N's info header word. The next step on this pass is
to increment free and thread N's children. By
the end of this pass, all forward references
will have been updated to point to their new locationsand all backward pointers will have
been threaded.The secondpasssimilarly updates references to N, this time by following
the chain of backward pointers. This pass alsomoves N.
The chief advantage of this algorithm is that it does not require any additional space,
although object headers must be large enough to hold a pointer (which must be
distinguishable
from a normal value). However, threading algorithms suffer a number of
disadvantages. They modify each pointer field of live objectstwice, once to thread and once
to unthread and update references. Threading requires chasing pointers so is just as cache
unfriendly
as marking but has to chase pointers three times (marking, threading
and
unthreading)
in Jonkers's algorithm. Martin [1982] claimed that
combining the mark phase
with the first compaction pass improved collection time by a third but this is a testamentto
the cost of pointer chasing and modifying pointer fields. Because Jonkers modifies
pointers in a destructive
way, it is
inherently sequential and so cannot be usedfor concurrent
compaction. For instance, in Figure 3.2b, once the references to N have been threaded,
there is no way to discoverthat the first pointer field of B held a referenceto N (unless that
pointer is stored at the end of the chain as an extra slot in A's header,defeating the goal
of avoiding additional storage overhead).Finally, Jonkers does not support interior
pointers, which may be an important concern for some environments. However, the threaded
compactor from Morris [1982] can accommodate interior pointers at the cost of an
additional
tag bit per field, and the restriction that the second compaction pass must be in the
opposite direction to the first (adding to the problem heap parsability).
of
compact algorithms for multiprocessors that do precisely this. The former is a parallel,
stop-the-worldalgorithm (it employs multiple compaction threads); the latter can be can
also be configured to be concurrent (allowing mutator threads to run
alongside collector
threads), and incremental (periodically suspending a mutator thread briefly to perform a
small quantum of compaction work). We discuss the parallel, concurrentand
incremental
aspects of these algorithms in later chapters. Here, we focus on the corecompaction
algorithms in a stop-the-world setting.
Both algorithms use a numberof side tables or vectors. Common to many collectors,
marking uses a bitmap with one bit for each granule (say, a word). Marking sets the bits
corresponding to the first and last granules of each live object. For example, 16 and
bits
19 are set for the object marked old in Figure 3.3. By scrutinising the mark bitmap in the
compactionphase,the collector can calculate the size of any live object.
3.4. ONE-PASS ALGORITHMS 39
did
new
\\ \\ 1
offsetlnBlock
Figure 3.3: The heap (before and after compaction) and metadata used by
Compressor [Kermany and Petrank,2006].Bits in the mark-bit vector
indicate the start and end of each live object. Words in the offset vector hold
the addressto which the first live object in their corresponding block will be
respectively).
The offset table stores the forwarding address of the first live object in each
block. The new locationsof the other live objects in a block can be computedon-the-fly
from the offset and mark-bit vectors. Similarly, given a reference to any object, we can
computeits block number and thus derive its forwarding address from the entry in the
offset table and the mark-bits for that block. This allows the algorithmsto replace
multiplepasses
over the full heap to relocate objects and to update pointerswith a single pass
over the mark-bit vector to constructthe offset vector and a single pass over the heap (after
marking) to move objects and update referencesby consulting these summary vectors.
Reducing
the number of heap passes has consequentadvantagesfor locality. Let us consider
the details as they appear in Algorithm 3.4.
After marking is complete,the computeLocations routine passes over the mark-bit
vector to produce the offset vector. Essentially, it performs the same calculation as in
Lisp
2 (Algorithm 3.2) but does not need to touch any object in the heap. For example,
considerthe first marked object in block 2, shown with a bold border in
Figure 3.3. Bits 2
and 3, and 6 and 7 aresetin the first block, and bits 3 and 5 in the second (in this example,
each block compriseseightslots). This represents 7 granules (words) that are marked in
the bitmap before this object. Thus the first live object in block 2 will be relocated to the
seventh slot in the heap. This addressis recorded in the offset vector for the block (see
the dashed arrow marked of f set [block] in the figure).
Oncethe offset vector has been calculated, the roots and live fields are updated to
reflect the new locations.The Lisp 2 algorithm had to separate the updating of references
and moving of objectsbecauserelocationinformation was held in the heap and object
movementdestroyed this information as relocated objects are slidoverold objects.In
contrast, Compressor-type algorithms relocate objects and update referencesin a singlepass,
updateRef erencesRelocate in Algorithm 3.4. This is possible becausenew addresses
can be calculated reasonably quickly from the mark bitmap and the offset vector on-
the-fly: Compressor does not have to store forwarding addresses in the heap. Given the
address of any object in the heap, newAddress obtains its blocknumber(through shift
40 CHAPTER 3. MARK-COMPACT GARBAGE COLLECTION
i compact():
2 computeLocations(HeapStart, HeapEnd, HeapStart)
3 updateReferencesRelocate(HeapStart, HeapEnd)
4
15 newAddress(old):
i6 block <\342\200\224
getBlockNum(old)
17 return offset[block] + offsetInBlock(old)
18
19 updateReferencesRelocate(start, end):
20 for each fid in Roots
21 ref *fld
<\342\200\224
22 if ref 7^ null
23 *fld <\342\200\224
newAddress(ref)
24 scan start
<\342\200\224
29 if ref ^ null
30 *fld <\342\200\224
newAddress(ref)
31 dest <\342\200\224
newAddress(scan)
32 move(scan, dest)
Mark-sweep garbage collection uses less memory than other techniques such as copying
collection (which we discussin the nextchapter).Furthermore, since it does not move
objects,
a mark-sweep collector need only identify (a supersetof) the roots of the collection; it
3.5. ISSUES TO CONSIDER 41
Long-lived data
It is not uncommon for long-lived or even data to accumulate near the beginning
immortal
of the heap in moving collectors. Copying collectors handlesuch objects poorly, repeatedly
copying them from one semispace to another.On the other hand, generational collectors
(which we examine in Chapter 9) deal with these well,by moving them to a different space
whichis collected only infrequently. However, a generational solution might not be
acceptableif
heap space is tight. It is also obviouslynot a solutionif the space being collected is
the oldest generation of a
generational collector! Mark-compact, however, can simply elect
not to compact objects in this 'sediment'. Hanson [1977] w as the first to observe that these
objects tended to accumulate at the bottom of the 'transient objectarea' in his SITBOL
system. His solution was to track the height of this 'sediment' dynamically, and simply avoid
collecting it unless
absolutely necessary, at the expense of a small amount of
Sun
fragmentation.
Microsystems' HotSpot Java virtual machine uses mark-compactas the default
collector for its old generation. It too avoids compacting objects
in the user-configurable
'dense prefix' of the heap [Sun Microsystems, 2006]. If bitmap marking is used, the extent
of a live
prefix of desired density can be determined simply by examining the bitmap.
Locality
Mark-compact collectorsmay preserve the allocation order of objects in the heap or they
may rearrange them arbitrarily. Although arbitrary order collectors may be faster than
likely to suffer from an arbitrary scrambling of object order. Sliding compaction has a
further benefit for some systems: the space occupiedby all objects allocated after a certain
point canbe reclaimedin constant time, just by retreating the free spacepointer.
Limitations of mark-compact
algorithms
A wide
variety of mark-compact collection algorithms has beenproposed.
A fuller account
of many older compaction strategies canbe found in Chapter 5 of Jones [1996]. Many of
yet requires only a single pass overthe live objects in the heap. Its chief disadvantage is
that it reduces the size of the available heap by half.
scavenging
\342\200\224
live objects from the old semispace. At the end of the collection, all live
objectswill have been placed in a dense prefix of tospace. The collector simply abandons
the
fromspace (and objects contains) it until the next collection. In practice, however, many
collectorswill zero that space for safety during the initialisation of the next collection cycle
(seeChapter 11where we discuss the interface with the run-time system).
After initialisation, semispace copying collectors
populate work list by copying
their
^ote: our allocate and copy routines ignore issues of alignment and padding, and also the possibility that
a copied object may have a different format, such as an explicit rather than an implicit hash code for Java objects.
43
44 CHAPTER 4. COPYINGGARBAGE COLLECTION
i createSemispaces():
2 tospace <\342\200\224
HeapStart
3 extent 4- (HeapEnd - HeapStart) / 2 /* size of a semispace */
4 top <\342\200\224
fromspace \302\253\342\200\224
HeapStart + extent
5 free \302\253\342\200\224
tospace
6
7 atomic allocate(size):
8 result free
^\342\200\224
9 newfree result
\302\253\342\200\224 + size
io if newfree > top
ii return null /* signal 'Memory exhausted' */
12 free newfree
<\342\200\224
B return result
is essential that collectors preserve the topology of live objectsin the tospace copy of the
heap. This is achieved by storing the address of each tospace object as a forwarding
address
in its old, fromspace replica when the object is copied(line 34). The forward routine,
tracing from a tospacefield, usesthis forwarding address to update the field, regardlessof
whether the copy was made in this tracing step or a previousone (line 22). Collection is
complete when all
tospace objects have been scanned.
Unlike most mark-compact collectors, semispace copying does not require any extra
space in objectheaders.Any slot in a fromspace object can be used for the forwarding
address (at least, in stop-the-world implementations), since that copy of the object is not used
after the collection. This makes copying collection suitableeven for header-less objects.
Algorithm 4.3). After the root objects are copied, the work list \342\200\224
the set of grey objects
\342\200\2
comprises precisely those (copied but unscanned) objects between scan and free. This
invariant is maintained throughout the collection.The scan pointer is advanced as
tospace
fields are scanned and updated (line 9). Collection is completewhen the work list
is empty: when the scan pointer catches up with the free pointer. Thus, the actions of
this implementation are very simple. To determine termination, isEmpty does no more
than
compare the scan and free pointers; remove just returns the scan pointer; and no
action is required to add work to the work list.
4.1. SEMISPACE COPYING COLLECTION 45
i atomic collectQ:
flip()
3 initialise(worklist) /* empty*/
4 for each fid in Roots /* copy the roots */
5 process(fld)
6 while not i sEmpty(workl i st) A copy transitive closure */
7 ref \302\253\342\200\224
remove(worklist)
s scan(ref)
9
15 scan(ref):
for each fid in Pointers(ref)
process(fId)
pr ocess(fId): A update field with reference to tospace replica*/
fromRef <- *fld
21 if fromRef ^ null
22 *fld \302\253\342\200\224
forward(f romRef) /* update with
tospace reference */
23
24 forward(f romRef):
25 toRef \302\253\342\200\224
forwardingAddress(fromRef)
26 if toRef = null /* not copied (not marked) */
27 toRef \302\253\342\200\224
copy(fromRef)
28 return to Ref
29
32 free free
<\342\200\224 + size(fromRef)
33 move(fromRef, toRef)
34 forwardingAddress(fromRef) <r- toRef /* mark */
35 add(worklist, toRef)
36 return to Ref
46 CHAPTER 4. COPYING GARBAGE COLLECTION
7 remove(worklist):
s ref <r- scan
9 scan scan
<\342\200\224 + size(scan)
io return ref
ii
12 add(worklist, ref):
13 /* nop */
An example
Figure 4.1shows an exampleof how a Cheney scan would copy L, a linked list structure
with
pointers to the head and tail of the list. Figure 4.1a shows fromspace before the
collectionstarts. At the start of the collection, the rolesof the semispaces are flipped and L,which
we assumeis directly
reachable from the roots, is to
copied tospace(advancingthe free
pointer) and a forwarding reference to the new location L7 is written into L (for instance,
over the first field). The scan pointer points to the first object in tospace (Figure 4.1b).The
collectoris now ready to start copying the transitive closure of the roots. The s can pointer
points to the first object to process. L7 holds references to A and E in fromspace, so these
objects are evacuatedto the location pointed by at free in tospace (advancing free), the
referencesin L7 are updated to point to the new locations A7 and E7
(Figure 4.1c), and scan
is advanced to the next grey object.Note that the collector is finished with L7 so it is
conceptually black, whereas the next objects to scan, A7 and E7, are grey. This process is repeated
for each tospace object until the scan and free pointers meet (Figure 4.If). Observe that,
in Figure 4.1e, D7 holds a reference to E, which has already been copied.The referring field
in D7 is therefore updated with the forwarding address stored in E, thereby preserving the
shape of the graph. As with other tracing algorithms, copying garbage collection can
cope
with all shapes of graphs, includingcyclic data structures, preserving sharing properly.
et al [2004a] found that this was indeed the case for collection in tight heaps, where he
used a segregated-fitsallocatorfor the non-moving collector. Conversely, in large heaps
4.2. TRAVERSAL ORDER AND LOCALITY
p3 . 1 . 1 D | .
\342\200\224\"*
Ei 1 1
B| C, 1
\342\200\224 \342\200\224 \342\200\224
\342\200\224H \342\200\242
\342\200\224H i \342\200\224H ' \342\200\224H ''
Fromspace
Tospace
B C D ~~* E
Fromspace
Tospace
-free
(b) Copy the root, L
/
/ !i \342\200\224
B C D -* E
-,-
^ \\
Fromspace
Tospace
J L.
-free
/ ! \"\342\200\224\342\226\240
~ E
B
\342\200\224>
Fromspace
^~ X
Tospace
V
fA' rB'
J L-free
(d) Scan A's replica, and so on.
/ ! \342\200\224\342\226\240
^ \342\226\240
Fromspace
Tospace
free
(e) Scan Cs replica.
~~~*
B C D E
\342\226\240
\\ ! \\ ! --\"
\\_ i ^v \\ i ^^ ;
Fromspace
+- A Tospace
/ I : /
B'
scan-Lfree
i,
(f) Scan D's replica. scan=free so collectionis complete.
Depth-first |~T\"
2 4 | 8 9 5 | 10 11TTT 12
^F 14 ^
Breadth-first
J
1 2 sU 5 sh 8 9 | 10 11 12 | 13 14
^
Hierarchical decomposition | 1 | 2 | 33 | 4 9 5 10 11 | 66 |
12 | 13
13 [ 7 14 15
Figure 4.2: Copying a tree with different traversal orders. Each row shows
how a traversal order lays out objects in tospace, assuming that three objects
can be placed on a
page (indicatedby the thick borders). For online object
reordering,prime numbered (bolditalic) fields are considered to be hot.
the locality benefits to the mutator of sequential allocation outweighed the space efficiency
of mark-sweep collection, leading to better missrates at all levels of the cache hierarchy.
This was particularly true newly allocated objects which tend to experience
for higher
mutation rates than older objects [Blackburn and McKinley, 2003].
The Blackburn et al [2004a] study copied objectsdepth-first. In contrast, Cheney's
copying
collector traverses the graph breadth-first. Although this is implementedby a linear
scan of \342\200\224 and hence predictable access to \342\200\224
the work list of grey tospace objects,breadth-
first
copying adversely affects mutator locality because it tends to separate parents and
children. The table in Figure4.2b the
compares
effect of different traversal orders on object
layout, given the tree in Figure 4.2a. Each row shows different where
tracing orders would
place objects in tospace. If we examine row 2, we see that breadth-first traversal places
only objects 2 and 3 near their In
parent.
this section we look more closely at traversal
order and its consequences for locality.
White [1980] suggested the garbage collectorcouldbe used to improve
long ago that
the performance of the mutator.
copying Bothand compacting garbage collectors move
objects, thus
potentially affecting the mutators' locality patterns. Slidingis generally
considered to be best order for mark-compact algorithms since it reserves the order of layout
of objectsestablishedby the allocator. This is a safe, conservativepolicy, but can we do
better?
Mark-compact algorithms condense the heap in place, either by moving objects into
holes (arbitrary order compactors) or by slidinglive data (overwriting only garbage or
objects
that have already been moved), and thus have no opportunity for more locality-aware
reorganisation. However,any collector that evacuates live objects to a fresh region of the
heap without destroying the original data can rearrange their layout in order to improve
the performance of the mutator.
Unfortunately there are two reasonswhy we cannot find an optimal layout of objects,
that minimises the number of cachemissessuffered by the program. First of all, the
collectorcannot know what the pattern of future accesses to objects will be. But worse, Petrank
and Rawitz [2002] show that the placement problem is NP-complete: even a
given perfect
50 CHAPTER 4. COPYING GARBAGE COLLECTION
i
initialise(worklist):
2 scan free
^\342\200\224
3 partialScan free
<\342\200\224
8 remove(worklist):
9 if (partialScan < free)
io ref <\342\200\224
partialScan /* prefer secondary scan */
ii partialScan <\342\200\224
partialScan + size(partialScan)
12 else
13 ref scan
<\342\200\224 /* primary scan */
14 scan <r- scan + size(scan)
is return ref
16
17 add(worklist, ref): /* secondary scan on the most recently allocated page */
is <\342\200\224
partialScan max(partialScan, startOfPage(ref))
predictorof future behaviour. Some researchers have used either profiling, on the
assumptionthat programs behave similarly for different inputs [Calderet al, 1998], or online
sampling, assuming that behaviour remains unchanged from one period to the next[Chilimbi
et al, 1999]. Another heuristic is to preserve allocationorder,as sliding compaction does.
A third strategy is to try to place children close to one of their parents, since the only way
to accessa childis by loading a reference from one of its parents. Cheney'salgorithm uses
breadth-first traversal, but its unfortunate consequence is that it separates related data,
tending to co-locatedistant cousins rather than parents and children. Depth-firsttraversal
(row one), on the other hand, tendsto placechildren closer to their parents.
Early studies of the locality benefits of different copying orders focusedon trying to
minimise page faults: the goal was to place related items on the same page. Stamos found
that simulations of Smalltalk systems suggested that depth-first ordering gave a modest
improvement over breadth-first ordering but worse paging behaviour than the original
object
creation order [Stamos, 1982; Blau, 1983;Stamos,1984]. However, Wilson et al [1991]
argue
that these simulations ignore the topology of real Lisp and Smalltalk programs which
tended to createwide but shallow trees, rooted in hash tables, designed to spreadtheir
keys in order to avoid clashes.
If we are prepared to pay the cost of an auxiliary first-out marking stack, then
last-in,
the Fenicheland Yochelson algorithm leads to a
depth-first traversal. However,it is
possibleto obtain a pseudo-depth-first traversal without paying the spacecoststhat come from
using a stack. Moon [1984]modified Cheney's algorithm to make it 'approximately depth-
He added a second,partialScan pointerin addition
first'. to the primary scan pointer
(seeFigure4.3).Whenever an object is copied, Moon's algorithm starts a secondaryscan
from the last page of tospace that has not been completely scanned. Once the last tospace
page has been scanned, the primary scan continues from the first incompletely scanned
4.2. TRAVERSAL ORDER AND LOCALITY 51
page
partialScan-I Lfree
child
Fromspace
remove()
4
addr J obj
mark stack I FIFO
Tospace
prefetch()
Figure 4.4: A FIFO prefetch buffer (discussed in Chapter 2) does not improve
locality with copying as distant cousins (C, Y, Z), rather than parents and
children, tend to be placedtogether.
i atomic collect():
flip()
3 initialise(hotList, coldList)
4 for each fid in Roots
5 adviceProcess(fId)
6 repeat
7 while not isEmpty(hotList)
s adviceScan(remove(hotList))
9 while not isEmpty(coldList)
io advicePro cess (remove (coldList))
ii until isEmpty(hotList)
12
B initialise(hotList, coldList):
14 hotList \302\253\342\200\224
empty
is coldList <\342\200\224
empty
16
17 adviceProcess(fld):
is fromRef *fld
<\342\200\224
19 if fromRef ^ null
20 *fld <\342\200\224
forward(fromRef)
21
22 adviceScan(obj) :
23 for each fid in Pointers(obj)
24 if isHot(fld)
25 adviceProcess(f Id)
26 else
27 add(coldList, fid)
from the stack. Desirably, S shouldbe placedadjacent to its associated character array C
in tospace,as the depth-first algorithm would do. Using the first-in, first-out queue, after
S is popped from the stack, it is added to the queue. Suppose that the queue is full, so the
oldestentry X is removed, copied and its references Y and Z pushed on the stack, as
illustrated in Figure 4.4. Unfortunately, Y and Z will be removed from the queue and copied
after S but before C.
The
reorganisations above are static: the algorithms pay no attention to the behaviour
of individual applications. However, it is clear that the benefits of layout reordering
schemesdepend on the behaviour of the mutator. Lam et al [1992]found that both
algorithms
were sensitive to the mix and shape of program data structures, giving
disappointing performance for structures that were not tree-like. Siegwartand Hirzel [2006]
alsoobservedthat a parallel hierarchical decomposition collector led to benefitsfor some
benchmarks but little improvement overall for
Huang et al [2004] address this by
others.
and the last row of Figure 4.2b. The main scanning loop of their algorithm (line 6)
processes all hot fields in its work lists before any cold fields. Piggybacking on the method
of
copied objects are marked free, and the
rearranged group of objects is moved from the
Other authors have also suggested custom,staticreordering by object type [Wilson et al,
1991;Lam et al, 1992], particularly for system data structures.By allowing class authors to
specify the order in which fields are copied, Novark et al [2006] reduce the cache missrate
significantly
for certain data structures. Shuf et al [2002]use off-line profiling to identify
prolific types. The allocator is modified so that, when a parent is created, adjacent space
is left for its children, thus both improving locality and encouraging clustering of objects
with similar lifetimes. This approach may address to some extent the problem identified
on page 51 of combining a first-in, first-out prefetch queue with depth-first copying.
Allocation
Allocation in a compactedheap is fast because it is simple. In the common case,it simply
requires a test a
against heap or block limit and that a free pointer be implemented. If a
block-structured rather than a contiguous heap is used,occasionally the test will fail and a
new blockmust be acquired. The slow path frequency will depend on the ratio of the
average
size of objects allocated and the blocksize.Sequential allocation also works well with
multithreaded applications sinceeachmutator can be given its own local allocation buffer
in which to allocate without needing to synchronise with other threads. This
arrangement is
simpler and requires little metadata, in contrast with local allocation schemes for
3
We discuss barriers in Chapter 11.
54 CHAPTER 4. COPYING GARBAGE COLLECTION
non-moving collectors where each thread might need its own size-class data structures for
segregated-fits allocation.
The code sequence for such a bump-a-pointer allocation is short but, even better, it is
well behaved with respect to the cacheas allocationadvanceslinearly through the heap.
finally tends to abandon them all at once. Here, compacted heaps offer good spatial
locality,
with related objects typically allocated on the samepageand maybe in the same cache
line if they are small. Such a layout is likely to lead to fewer cache misses than if related
objects are allocated from different free-lists.
structures needed by the collector itself, semispacecopying provides only half the heap
space of that offered by other whole heap collectors.The is
consequence that copying
collectors will perform more garbage collection cycles than other collectors. Whether or not
this translates into better or worse performancedependson trade-offs between the
mutatorand the collector, the characteristics of the applicationprogramand the volume of heap
space available.
Simpleasymptotic complexity analyses might prefer copying over mark-sweep
collection. Let M be the total size of the heap, and L be the volume of live data. Semispace
collectorsmust copy, scan and update pointers in live data. Mark-sweep collectors must
similarly trace all the live objects but then sweep the whole heap. Jones[1996] defines the
done by the collector per unit allocated. The efficiency of these two algorithmsis therefore:
_ 2cr
\342\200\224 _mr + s
~
\342\202\254ms
ec\302\260vy
\\-2r \\-r
4.3. ISSUES TO CONSIDER 55
semispace /
copying /
/ ,-'' mark-sweep
The mark/cons ratio curvespresentedin Figure 4.5 show that copying collection can
be mademoreefficient that mark-sweep collection, provided that the heap is large enough
and r is small enough.However, such a simple analysis ignores several matters. Modern
mark-sweepcollectors are
likely to use lazy sweeping, thus reducing the constant s and
reordering has benefits for some programs, it often has negligible effects. Why is this?
Most objectshave short lifetimes and do not survive a collection.Moreover, many
applications concentrate accesses, and especially writes, on these young objects [Blackburn and
McKinley, 2003]. Collector traversal policies cannot affect the locality properties of objects
that are never moved.
Printezis has alsopointed out that whether parallel collector threads are used or not
will influence the choice of copying mechanism.It may be simpler to do very fine-grained
load-balancingby work stealing from per-thread stacks as opposed to using a Cheney
queue.4We discuss these issues in depth in Chapter 14.
Moving objects
The choice of acollector will depend in part on whether it is possible
copying to move
objects
and the doing so. In some environments
cost of objects cannot be moved. One reason
is that lack of type accuracy means that it would not be safe to modify the slot holding
a reference to a putative object. Another is that a reference to the object has been passed
to unmanaged code(perhaps, as an argument in a system call) that does not expect the
reference to change. Furthermore,the problemof pointer finding can often be simpler in
a mark-sweep context than that of a moving collector. It suffices to find at least one
to a live
reference
object with a non-moving collector. a moving collector
On the other hand,
must find and update all references to an evacuated object.As we will see in Chapter 17,
this alsomakes the problem of concurrent moving collection much harder than
concurrent
non-moving collection since all the references to an object must appear to be updated
atomically
It is expensiveto copy some objects. Although copying even a small object is likely
to be more expensive than
marking it, the cost and latency of doing so is often absorbed
by the costs of chasing pointers and
discovering type information. On the other hand,
repeatedly copying large, pointer-free objects will lead to
poor performance. One solution
is simply not to copy them but instead devolve the management of large objectsto a non-
moving collector.Another is to copy them virtually but not physically This canbedone
eitherby holding such objects on a linked list maintained by the collector, or by allocating
large objects on their own virtual
memory pages which can be remapped. We consider
such techniques in Chapters 8 to 10.
Chapter 5
Reference counting
The
algorithms considered so farhave all been indirect.Each has traced the graphof live
objects from a set of known roots to identify all live objects. In this chapter, we consider
the last classof fundamental algorithms, reference counting [Collins, I960]. Rather than
updates of local variables. We also assume it is called to write null into local variables
before eachprocedure returns. The operations addRef erence and deleteRef erence
incrementand decrement respectively the reference counts of their objectargument.Note
that it is essential that the reference countsare adjusted in this order (lines 9-10) to prevent
premature reclamation of the
target in the case when the old and the new targets of the
pointer are the same, that is, src[i] = ref. Once a referencecount is zero (line 20), the
object can be freed and the reference counts of all its children decremented, which may in
turn lead to theirreclamationand so on recursively.
The Write method in Algorithm 5.1 is an exampleof a write barrier. For these, the
compiler emits a short codesequence around the actual pointer write. As we shall see
laterin this book, mutators are required to executebarriersin many systems. More
precisely, they are required whenever collectors do not considerthe livenessof the entire object
graph, atomically with respect to the mutator. Such collectors may execute concurrently,
either in lock-step with the mutator as for reference counting or asynchronously in another
thread. Alternatively,
the collector may process different regions of the heap at different
frequencies, as do generational collectors.
In all these cases, mutator barriers must be
executed in order to preserve the invariants of the collector algorithm.
1
Reference listing algorithms, commonly used by distributed systems such as Java's RMI libraries, modify this
invariant so that an object is deemedto be live if and only if the set of clients believed to be holding a reference
to the object is non-empty This offers certain fault tolerance benefits, for example, set insertion or deletion is
idempotent, unlike counter arithmetic.
57
58 CHAPTER 5. REFERENCE COUNTING
i New():
2 ref <\342\200\224
allocate()
3 if ref = null
4 error \"Out of memory\"
5 rc(ref) <- 0
6 return ref
7
12
13 addRef erence(ref):
14 if ref t^ null
is
rc(ref) <\342\200\224
rc(ref) + 1
16
17 deleteRef erence(ref):
is if ref t^ null
\342\200\224
19
rc(ref) <\342\200\224
rc(ref) 1
20 if rc(ref) = 0
21 for each fid in Pointers(ref)
22 deleteRef erence(*f Id)
23
free(ref)
counting can recycle memory as soonasan object becomes garbage (but we shall see below
why this may not always be desirable). Consequently,
it may continue to operate
satisfactorily
in a nearly full heap, unlike tracing collectorswhich need some headroom. Since
reference counting operatesdirectly on the sources and targets of pointers,the locality of
a reference counting algorithm may be no worse than that of its client program. Client
programs can use destructive updates rather than copying objects if they can prove that
an object is not shared. Reference counting can be implementedwithout assistance from
or knowledge of the run-time system. In particular, it is not necessary to knowthe roots of
Service, Oce printers, scanners and document management systems; as wellas operating
systems' file managers. Libraries to support safe reclamation of objects are widely available
for languages like C++ that do not yet require automatic memory management.Such
libraries often use smart pointers to accessobjects.Smart pointers typically overload
constructors and operators such as assignment, either to enforce unique ownershipof objects
or to provide reference counting. Unique pointers ensure that an object has a single 'owner'.
5.1. ADVANTAGES AND DISADVANTAGES OF REFERENCE COUNTING 59
When this owner is destroyed, the object also canbe destroyed.For example, the next
C++ standard is expected to include a unique_ptr template.Many C++ programmers
use smart pointers to provide reference counting
to manage memory automatically. The
best known smart pointer library for C++ is the Boost library,2
which provides reference
counting through sharedpointer objects. One drawback of smart pointers is that they have
semantics different from those of the raw pointers that they imitate [Edelson, 1992].
Unfortunately,
there are also a number of disadvantagesto reference counting. First,
reference counting imposes a time overhead on the mutator. In contrast to the tracing
algorithms
we considered in earlier chapters, Algorithm 5.1 redefinedall pointer Readand
Write operations in order to manipulate referencecounts.Even non-destructive
operations such as iteration require the reference counts of each element in the list to be
incremented and then decremented as a pointer moves acrossa data structure such as a list.
From a performancepoint of view, it is particularly undesirable to add overheadto
operations that
manipulate registers or thread stack slots. For this reason alone, this naive
algorithm is impractical for use as a generalpurpose,high volume, high performance memory
manager. Fortunately, aswe shall see,the cost of reference counted pointer manipulations
can be reducedsubstantially.
Second, both the reference count manipulations and the pointerloadorstoremust be a
single atomic action in order to prevent racesbetween mutator threads which would risk
premature reclamation of objects. It is insufficient to protect the integrity of the reference
count operation alone. For now, we simply assert that actions are atomic, without
how
explaining
this might be achieved. We reconsider this in Chapter 18 when examinereference
counting
and concurrency in detail. Some smart pointer librariesthat provide reference
counting require careful use by the programmerif races are to be avoided. For example,
in the Boost library, concurrent threads can read the same shared_ptr instance
simultaneously,
or can modify different shared_pt r instancessimultaneously, but the library
enforces atomicity only upon reference count
manipulations. The combination of pointer
read or write and reference count increment is not a single atomic action.Thus, the
application
programmer must take care to prevent racesto update a pointerslot,which might
lead to undefined behaviour.
Third, naive reference counting turns read-only operations into ones requiringstores
to memory (to update reference counts). Similarly, it
requires reading and writing the old
referent of a pointer field when changing that field to refer to a different object. These
writes 'pollute'the cache and induce extra memory traffic.
Fourth, reference counting cannotreclaimcyclic data structures (that is, data structures
that contain references to themselves). Even if such a structure is isolatedfrom the rest of
the object graph \342\200\224 it is unreachable \342\200\224
the reference counts of its components will never
drop to zero. Unfortunately, self-referential structures are common (doubly-linked lists,
trees whosenodeshold a back pointer to the root, and so on), although their frequency
varies widely between applications [Baconand Rajan, 2001].
Fifth, in the worst case, the number of references to an object could be equal to the
number of objects in the heap. This means that the reference count field must be pointer sized,
that is, a whole slot. Given that the average size of nodes in object-oriented languagesis
small (for example, Java instance objects are typically 20-64bytes long [Dieckmann and
Holzle, 1999, 2001;Blackburnet al, 2006a], and Lisp cons cells usually fit into two or three
slots), this overhead can be significant.
Finally,
reference counting may still induce pauses. When the last reference to the head
of a
large pointer structure is deleted, reference counting must
recursively delete each
descendant of the root. Boehm [2004] suggests that thread-safe implementations of reference
counting may
even lead to longer maximum pause times than tracing collectors. Weizen-
baum [1969] suggested lazy reference counting: rather than immediately freeing garbage
pointer structures, deleteReference adds an object with a zero reference count to a
to-be-freed list,
destroying its contents. When the object is later acquiredby the
without
allocator, can be
its children
processed similarly, without recursive freeing. Unfortunately,
this technique allows large garbagestructures to be hidden by smaller ones, and hence
increases overall
space requirements [Boehm, 2004].
see the extentto which we can resolvetwo of the major problems facing
Let us now
reference counting:thecostof reference count manipulations and collecting cyclic garbage.
It turns out that common solutions to both of these problemsinvolve a stop-the-world
One fundamental obstacle to efficiency is that object reference counts are part of the
global state of the program, but operations on (thread) localstate are usually more
efficient. The three classes of solution above sharea this
problem:commonapproachto they
divide execution into periods or epochs. Within an epoch, some or all synchronised
reference counting operations either eliminated or replacedby unsynchronised
can be writes
(to thread-local buffers). Identification of
garbage is performed only at the end of an
stacks
registers
RC
RC space
concurrent
algorithms also impose a small overhead on the mutator, these are much lower
than the overhead of safely manipulating reference counts. To overwrite a pointer, Write
in Algorithm 5.1 executed a dozen or so instructions(though in somecasesthe compiler
could statically elide some tests). The referencecountadjustments must be atomic
operations and be kept consistent with pointer updates. Furthermore, Write modifies both the
old and new targets of the field in question, possibly polluting the cache with dirty words
that will not be reused soon.Optimisation to remove matching increments and decrements
is errorproneif done by hand, but has proved effective as a compiler optimisation [Cann
and Oldehoeft,1988].
Most
high-performance reference counting systems (for example, that of Blackburn and
McKinley [2003]) use deferred reference counting.
The overwhelming majority of pointer
loads are to localand temporary variables, that is, to registers or stack slots. Long ago,
Deutsch and Bobrow [1976] showed how to removereference count
manipulations from
these operations by adjusting counts only when pointers are stored into heap objects.
Figure 5.1 shows an abstract view of deferred reference counting
in which operations on heap
objects are performed immediately but those involving stacks or registers are deferred.
Thereis,of course, a cost to pay. If reference count manipulations on local variables are
ignored, then counts will no longer be accurate. It is therefore no longer safe to reclaim
an object just because its referencecountis zero.In order to reclaim any garbage, deferred
referencecountingmust periodically correct counts during stop-the-world pauses.
Fortunately,
these pauses are likely to be short compared with those of tracing collectors, such
as mark-sweep[Ungar,1984].
Algorithm
5.2 loads object references using the simple,unbarriered implementation
of
Read from Chapter 1. Similarly, references can alsobe written to roots using an
unbarrieredstore (line 14). In contrast, writes to heap objectsmust be barriered. In this case, the
referencecount of the new target is incremented as usual (line17).However, if
the
decrementing
reference count of the old target causes it to
drop to zero, the Write barrier adds
the
object whose reference count is zero to a zero count table (ZCT) rather than immediately
reclaiming it (line 26). The zero count table can be implemented in a variety of ways, for
example with a bitmap [Baden, 1983] or a hash table [Deutsch and Bobrow, 1976]. An
object
with a reference count of zero cannot be reclaimed at this
point because there might be
an uncounted referenceto it from the program stack. Conceptually, the zerocount table
contains objects whose reference counts are zero but may be live. Depending on the im-
62 CHAPTER 5. REFERENCE COUNTING
New():
ref allocate()
if ref = null
collectQ
5 ref <\342\200\224
allocate()
6 if ref = null
7 error \"Out of memory\"
8 rc(ref) <- 0
9
add(zct, ref)
10 return ref
11
12 Write(src, i, ref):
13 if src = Roots
14
src[i] <r- ref
is else
i6 atomic
17
addReference(ref)
remove(zct, ref)
deleteReferenceToZCT(src[i])
srcfil ref
\302\253\342\200\224
deleteReferenceToZCT(ref):
if ref 7^ null
\342\200\224
rc(ref) <\342\200\224
rc(ref) 1
if rc(ref) = 0
add(zct, ref) A defer freeing */
atomic collect():
for each fid in Roots A mark the stacks */
addReference(*fId)
sweepZCT()
for each fid in Roots A unmark the stacks */
33
deleteReferenceToZCT(*fld)
34
35
sweepZCT():
while not isEmpty(zct)
ref <r- remove(zct)
if rc(ref) = 0 A now reclaim garbage */
for each fid in Pointers(ref)
deleteReference(*
fid)
free(ref)
5.4. COALESCED
REFERENCE COUNTING 63
plementation of the zero count table and whether it is desirable to limit the sizeof the zero
count table, we can also choosetoremovethe new target from the zero count table when
writing
a reference into a heap object, as its true reference count must be positive (line 19).
However, at some point garbage objects must be collected if the
program is not to run
out of memory. Periodically, for example when the allocator fails to return memory to
New, all threads are stopped while each object in the zero count table is scrutinised to
determine whether its true reference count should be zero. An object in the zero count table
with reference count zero can only be live if there are one or more references to it from the
roots. The simplest way to discoverthis is to scan the roots and 'mark' any objects found
by incrementing their reference counts (line29).After this, no object referenced from the
stackcan have a reference count of zero, so any object with a zero count must be garbage.
We could now sweep the entire heap, as with mark-sweep collection (for example,
Algorithm 2.3), looking for and reclaiming 'unmarked' objectswith zero reference counts.
However, it is sufficient to confine this search to the zero count table. The entries in the
zero count table are scanned and any objects with zero counts are immediatelyprocessed
and freed, in the same way as in the simple Algorithm 5.1. Finally, the 'mark' operations
must be reversed:the stackis scannedagain and the reference counts of any objects found
are decremented(revertedto their previous value). If an object's reference count becomes
zero, it is reinstated in the zero count table.
Deferredreference counting
removes the cost of manipulating reference counts on local
variables from the mutator. Several, older, studies have suggestedthat it can reduce the
cost of pointer manipulations by 80% or more [Ungar, 1984;Baden, 1983]. G iven the
increased
importance of locality, we can speculate that its performance advantage over naive
reference counting will be evenlargeron modern hardware. However, reference count
adjustments
due to object field updates must still be performedeagerly rather than deferred,
and must be atomic. Next, we explorehow to replace expensive atomic reference count
manipulationscausedby updates to objects' fields with simple instructions, and how to
reduce the number of modifications necessary.
logs the object by saving its address and the values of its pointer fields to a local update
buffer (line 5). The modified object is markedas dirty.
The log procedure attempts to avoid duplicatingentriesin the thread's local log by
first
appending the original values of the object'spointerfields to the log (line 11). Next
64 CHAPTER 5. REFERENCE COUNTING
i me <r- myThreadld
2
3 Write(src, i, ref):
4 if not dirty(src)
5 log(src)
6
src[i] ref
^\342\200\224
8 log(obj):
9 for each fid in Pointers(obj)
io if *fld ^ null
ii *fld)
append(updates[me],
12 if not dirty(obj)
13 slot <\342\200\224
appendAndCommit (updates [me], obj)
H setDirty(obj, slot)
15
i6
dirty(obj):
17 return logPointer(obj) ^ CLEAN
18
19 setDirty(obj, slot)
20
logPointer(obj) slot
<\342\200\224 /* address of entry for obj in updates [me] */
it checks that src is still not dirty, and only then is the entry committed by writing src to
the log (appendAndCommit), tagged so that it can be recognised as an object entry rather
than a field entry, and the log's internal cursor is advanced (line 13). The
object is marked
dirty by writing a pointer to this log entry in its header. Note that even if a race leads
to recordsbeingcreatedin more than one thread's local buffer, the algorithm guarantees
that all these records will contain identical information so it does not matter to which log's
entry the header points. Note that, depending on the processor's memory consistency
model,this write barrier may not require any
synchronisation operations.
Later, we willdiscuss how coalesced reference counts can be processed concurrently
with mutator threads, but here we simplystop the world periodically to process the logs.
At the start of each collection cycle, Algorithm 5.4 halts every thread, transfers their
update buffers to the collector's log, and allocatesfresh ones. As we noted above, race
conditions mean that an entry for an object may appear in more than one thread's update
buffer. This is harmlessprovided the collectorprocesses each
dirty object only once. The
processRef erenceCount s procedure tests whether an object is still dirty before
updating
the reference counts. The counts of the children of an
object before its first modification
in this epoch are decremented,and then those of its children at the time of the collection
are incremented.In a simplesystem, any object whose reference count drops to zerocould
be freed recursively. However, if reference counting on localvariables is deferred, or if for
efficiency the algorithm does not guarantee processto all increments before decrements,
we simply remember any object whose count has dropped to zero. The algorithm cleans
the object so that it will not be processed again in this cycle. Pointersto an object's previous
children can be found directly from the log entry. Its current childrencan be found from
the object itself (recall that the log contains a reference to that
object). Notice that there
is opportunity for prefetching objects or reference count fields in both the incrementand
decrement loops [Paz and Petrank, 2007].
5.4. COALESCED REFERENCE COUNTING 65
atomic collect():
collectBuf fers()
processReferenceCountsQ
sweepZCT()
collectBuf fers():
collectorLog \302\253\342\200\224
[]
for each t in Threads
collectorLog <\342\200\224
collectorLog + updates[t]
processReferenceCounts():
for each entry in collectorLog
obj <\342\200\224
objFromLog(entry)
if dirty(obj) /* Do not processduplicates */
incrementNew(ob j)
decrementOld(entry)
decrementOld(entry):
for each fid in Pointers(entry) /* use the values in the collector's log */
21 child <- *fld
22 if child null
=\302\243
23
rc(child) <- rc(child) - 1
24 if rc(child) = 0
25 add(zct, child)
26
27 incrementNew(ob j):
28 for each fid in Pointers(obj) /* use the values in the object */
29 child <- *fld
30 if child t^ null
3i rc(child) <- rc(child) + 1
66 CHAPTER 5. REFERENCE COUNTING
Collector's
log
> i /^
i
j
i
Cl j
found in the collector's log and the most recent new referent D can be found
directly from A.
Let us look at the examplein Figure5.2.Suppose that A was modified in the previous
epoch to swing its pointer from C to D. The old values of the object's fields (B and C) will
have been recorded in a log which has beenpassedto the collector (shown on the left of
the figure). The collectorwill therefore decrement the reference counts of B and C and
increment those of B and D.This retains the original value of B's reference count sincethe
pointer from A to B was never modified.
Thus, through a combination of deferred reference counting and coalescing, much of
reference counting's overhead on the mutator has been removed. In particular, we have
removed any necessity for mutator threads to employexpensivesynchronisation
operations. However, this benefit has come at some cost. We have reintroduced pauses for
garbage collection although we expect these to be shorter than those requiredfor tracing
collection. We have reduced the promptnessof collection (sinceno object is reclaimed until
the end of an epoch) and added space overheads for the buffers and zerocounttable.
Coalesced reference counting may also require the collectorto decrement and then increment
the same children of unmodifiedslots.
cycles are common, created both by applicationprogrammers by and the run-time system.
Applications often use doubly-linked lists and circular buffers. Object-relation mapping
systems may require that databases know their tables and vice versa. Some real-world
structures are naturally cyclic, such as roads in geographical information
systems. Lazy
functional language implementations commonly use cycles to express recursion [Turner,
1979, the Y combinatory A number of techniques have been proposedto solvethe cycle
problem; we review some of these now.
order to preserve the invariants that all reachable objects are strongly reachable without
creating any cycles of strong references. Unfortunately, this algorithm is unsafe and may
reclaimobjectsprematurely: see Salkild's counter-example [Jones, 1996, Chapter 6.5].
Salkild[1987] amended the algorithm to make it safe but at the cost of non-termination in some
cases.Pepelset al [1988] provided a very complex solutionbut it is expensive both in terms
of space, with double the space overheads of normal referencecounting,and in terms of
performance, having twice the cost of standard reference counting in most cases and being
exponential in the worst case.
widely adopted mechanisms
The most for
handling cycles through reference counting
use a techniquecalledtrial deletion. The key observation is that it is not necessary for a
backup tracing collectorto visit the whole live object graph. Instead, its attention can be
confined to those parts of the graph where a pointer deletion might have created a garbage
cycle. Note that:
\342\200\242
In anygarbage pointer structure, all referencecounts must be due to internal pointers
(that is, pointersbetweenobjectswithin the structure).
\342\200\242
Garbage cycles can arise only from a pointer deletionthat leaves a reference count
greater than zero.
Partial tracing algorithms take advantage of these observations by tracing the subgraph
rooted at an object suspected of being garbage. Thesealgorithmstrial-delete
eachreference
encountered decrementing
by reference
temporarily counts, in effect removing the
contribution internal
of these
pointers. If the reference count of
any object remains non-zero,
it must be because there is a pointer to the object from outside the subgraph, and hence
neitherthe objectnor its transitive closure is garbage.
The Recycler [Bacon et al, 2001; Bacon and Rajan, 2001; Paz et al, 2007] supports
concurrent
cyclic reference counting. In Algorithm 5.5, we show the simpler,synchronous,
version, deferring the asynchronous collector to Chapter 15.The cycle collection algorithm
operates in three phases.
1. First, the collectortracespartial graphs, starting from objects identified as possible
members of
garbage cycles, decrementing reference counts due to internal pointers
(markCandidates). Objects visited are coloured grey.
3. Finally, any members of the subgraph that are still white must be garbage and are
reclaimed
(collectCandidates).
68 CHAPTER 5. REFERENCE COUNTING
1 New():
2 ref <\342\200\224
allocateQ
3 if ref = null
4 collectQ /* the cycle collector */
5 ref <\342\200\224
allocate()
6 if ref = null
7 error \"Out of memory\"
8 rc(ref) <- 0
9 return ref
i addReference(ref):
if ref 7^ null
rc(ref) <\342\200\224
rc(ref) + 1
colour(ref) black
<\342\200\224 /* cannot be in a garbage cycle*/
i6 deleteReference(ref):
17 if ref 7^ null
is rc(ref) <\342\200\224
rc(ref)
19 if rc(ref) = 0
20 release(ref)
21 else
22 candidate(ref) /* might isolate a garbage cycle */
23
24 release(ref):
25 for each fid in Pointers(ref)
26 deleteReference(fId)
27 colour(ref) black
<\342\200\224 /* objects on the are
free\342\200\224list
black */
28 if not ref in candidates /* deal with candidates later */
29 free(ref)
30
atomic collectQ:
markCandidatesQ
for each ref in candidates
scan(ref)
collectCandidatesQ
5.5. CYCLIC REFERENCE COUNTING 69
4i
markCandidatesQ
42 for ref in candidates
43 if colour(ref) = purple
44
markGrey(ref)
45 else
46 ref)
remove(candidates,
47 if colour(ref) = black && rc(ref)
= 0
48 free(ref)
49
so markGrey(ref):
51 if colour(ref) 7^ grey
52 colour(ref) <r-
grey
53 for each fid in Pointers(ref)
54 child <- *fld
55 if child 7^ null
56
rc(child) <\342\200\224
rc(child)
\342\200\224
1 /* trial deletion */
57 markGrey(child)
58
59
scan(ref):
60 if colour(ref) = grey
6i if rc(ref) > 0
62
scanBlack(ref) /* there must be an external reference*/
63 else
64
colour(ref) white
\302\253\342\200\224 /* looks like garbage... */
65 for each fid in Pointers(ref) /* .. .socontinue */
66 child <- *fld
67 if child 7^ null
68 scan (child)
69
78
collectCandidatesQ :
79 while not is Empty (candidates)
so ref \302\253\342\200\224
remove(candidates)
8i collectWhite(ref)
82
83 collectWhite(ref):
84 if colour(ref) = white && not ref in candidates
85 colour(ref) black
^\342\200\224 /* free\342\200\224listobjects
are black */
86 for each fid in Pointers(ref)
87 child <r- *fld
88 if child 7^ null
89 collectWhite(child)
% free(ref)
In its synchronous mode, the Recycler uses five colours to identify nodes. As usual,
black means live (or free) and white is garbage. Grey is now a possible member of a
garbage cycle, and we add the colour purple to indicate objects that are candidates for
roots of garbage cycles.
Deletingany reference other than the last to an object may isolate a garbage cycle. In
this case, Algorithm 5.5 colours the object purple and adds it to a set of candidate members
of
garbage cycles (line 22). Otherwise, the objectis garbageand its reference count must be
zero. Procedurerelease resetsits colour to black, processes its children recursively and,
if it is not a candidate, frees the
object.Thereclamationof any objects in the candidates
set is postponedto the markCandidates phase. For example, in
Figure 5.3a, some
reference to object A has been deleted. A's referencecount was non-zero, so it has been added
to the candidates set.
In the first phase of collectingcycles,
the markCandidates procedure establishes the
extent of possible garbage and removes
structures, the effect of internal references from
the counts. It considers every object in the set of garbage candidates. If the
object is
still purple (hence, no referencesto the object have been added since it was added to the
set), its transitive closureis markedgrey. Otherwise it is removed from the set and, if it is
a black object with reference count zero, it is freed. As markGrey tracesa reference, the
reference count of its target is decremented.Thus, in Figure 5.3b, the subgraph rooted at A
has been marked grey and the contribution of references internal to the subgraph has been
removed from the reference counts.
In the second phase ofeach candidate
collection, and its grey transitive closureis
scannedfor external reference count is non-zero, it can only be because
references. If a
there is a referenceto this object from outside the grey sub-graph. In this case, the effect
of markGrey is undone by scanBlack: reference counts are incremented and objects are
reverted to black. On the other hand, if the reference count is zero, the objectis coloured
white and the scan continues to its children. Note that at this point we cannot say that
a white object is definitely garbage as it might be revisited later by scanBlack starting
from another node in the subgraph. For example,in Figure 5.3b, objects Y and Z have zero
reference counts but are externally reachable via X. When scan reaches X, which has a
non-zeroreferencecount, it will invoke scanBlack on the grey transitive closure of X,
\342\200\224'
1** l\342\200\224
2 ! \342\200\224J J 1
\342\200\224:1 zl\342\200\224
candidates
i
A X
- 0 : \\ I J 1 ' /
l\302\260! \\!
\302\273
1 ...\" V<
|\342\200\224
o : -J J o
\342\200\224i1 zl\342\200\224
Figure 5.3:Cyclic reference counting. The first field of each objectis its
reference count.
72 CHAPTER 5. REFERENCE COUNTING
to performance, reducing the garbage collectiontime for moderately sized Java programs
from minutes (Lins) to a maximum of a few milliseconds (Recycler).
Further improvements can be gainedby recognising statically that certain classes of
object,includingbut not limited to those that contain no pointers,can never be members
of
cycles. The Recycler allocates objects of these types as green rather than black, and never
addsthemto thecandidatesetnor traces through them. Bacon and Rajan [2001]found that
this reduced the size of the candidateset by an order of magnitude. Figure 5.4 illustrates
the full state transition system of the synchronous Recycler,includinggreennodes.
only contrived applicationswill cause counts to grow so large; in practice most objects
5.7. ISSUES TO CONSIDER 73
have small reference counts. Indeed,most objects are not shared at all and so the space
they use be reused
could immediately the pointer to them is deleted [Clarkand Green,
1977; Stoye et al, 1984;Hartel, 1988]. In functional languages, this allows objects such as
arrays
to be updated in place rather than having
to modify a copy of the object. Given
a priori knowledge of the upper bound on reference counts, it would be possible to use a
smaller field for the reference count. Unfortunately, it is common for some objects to be
very popular [Printezis and Garthwaite, 2002].
However, it is still possible to limit the size of the reference count field provided that
some backup mechanismis occasionally invoked to deal with reference count overflow.
Once a referencecount has been incremented to the maximum permissiblevalue, it
becomes a sticky reference count, not changed by any subsequent pointer updates. The most
extreme option is to use a single bit for the reference count, thus
concentrating reference
counting on the common case of objects that are not shared. Thebit can eitherbestoredin
the object itself [Wise and Friedman, 1977]or in the pointer [Stoye et al, 1984]. The
corollary
of limited-field reference counts is that once objects become stuck
they can no longer
be reclaimed by referencecounting.A backup tracing collector is needed to handle such
objects. As the tracing collector traverses each pointer, it can restore the correct reference
counts (whereverthis is no greater than the sticky value); Wise [1993a] shows that, with
some effort, a mark-compactor copying collector can also restore uniqueness information.
Sucha backup tracing collector would be needed to reclaim garbage cyclesin any case.
Reference counting is attractive for the promptness with which it reclaims garbage objects
and its good locality properties.Simplereference counting
can reclaim the space occupied
by an objectas soonas the last pointer to that object is removed.Its operation involves only
the targets of old and new pointersreador written, unlike tracing collection which visits
every live objectin the heap. However, these strengths are also the weaknesses of simple
reference Because
counting.
it cannot reclaim an object until the last pointer to that object
has been removed, it cannot reclaim cycles of garbage. Reference counting taxes every
pointer read and write operation and thus imposesa muchlarger tax on throughput than
tracing does. Furthermore,multithreaded applications require the manipulation of
reference counts and updating of pointers to be expensively synchronised. This tight coupling
of mutator actions and memory manager risks some fragility, especially if 'unnecessary'
referencecount updates are optimisedaway by hand. Finally, reference counts increase
the sizes of objects.
The environment
Despite these concerns, it would be wrong to dismiss reference counting without further
counting can be implemented as part of a library rather than being baked into the
language's
run-time system. It can therefore give the programmercompletecontrol over its
the
particular, programmer must ensure that races between pointer modifications and reference
count updates are avoided. If reference counting is implemented through smart pointers,
he must also be aware that the semantics of pointers and smart pointers differ. As Edelson
[1992]wrote, 'They are smart, but they are not pointers'.
Advanced solutions
Sophisticated
reference counting algorithms can offer solutions to
many of the problems
faced by naive referencecountingbut, paradoxically, these algorithms often introduce
behaviours similar to those of stop-the-world tracing collectors. We examine this duality
further in the next chapter.
Garbagecyclescanbe reclaimedby a backup, tracing collector or by using the trial
deletion
algorithms we discussed in Section 5.5. In both cases, this requires mutator threads to
be suspendedwhile we reclaim cyclic data (although we show how these stop-the-world
pauses can be removed in later chapters).
Although
the worst case requires reference count fieldsto be almost as large as pointer
fields, most applicationshold only a few references to most objects. Often, it is possible
for the reference count to hijack a few bits from an existing header word (for example,
one used for object hashing or for locks). However, it is common for a very few objects to
be heavily referenced. If limited-field reference counting is used, these objectswill either
leak \342\200\224
which may not be a serious problemif they are few in number or have very long
be reclaimed by a backup tracing that in
lifetimes \342\200\224
or must collector. Note, however,
comparing the space overheadof reference counting and, say, mark-sweep collection it
is not sufficient simply to measure the cost of the referencecountfields. In ordernot to
thrash, tracing collectors require some headroom in the heap. If the application is given
a heap of, say, 20% larger than its maximum volume of live data, then at least 10%of the
heap will be 'wasted' on average. This fraction may be similar to the overhead of reference
counting (depending on the average size of
objects it manages).
The throughput overhead of reference counting canbe addressed by omitting to count
some pointer manipulations and by reducing the cost of others. Deferred reference
countingignores
mutator operations on local variables. This allows the counts of objects
reachable from roots to be lower than their true value, and hence prevents their prompt
reclamation
(since a reference count of zero no longernecessarily means that the object is garbage).
Coalescedreferencecounting accounts for the state of an objectonly at the beginning and
end of an epoch: it ignores pointer manipulations in between. In one sense, this automates
the behaviour of programmers who often optimise away temporary adjustments to
reference counts (for example, to an iterator as it traverses a list). However, once again, one
worst, the same values might be written to the logs of two different threads. However,
both solutions add space overhead to the cost of reference counting, either to store the
zero count table or to store update logs.
5.7. ISSUES
TO CONSIDER 75
In the next chapter, we compareall four forms of collection we have examined so far:
mark-sweep, mark-compact, copying
and reference counting. We then consider a
remarkableabstraction of tracing and advanced reference counting collection that reveals that
In the chapters,
preceding we presented four different styles of garbage collection. In this
chapter, we compare themin moredetail.We examine the collectors in two different ways.
First, we consider criteria by which we may assess the algorithms and the strengths and
weaknesses of the different approaches circumstances. We then present
in different
abstractions of tracing and reference counting algorithms due to Baconet al [2004]. These
abstractions reveal that while the algorithms exhibit superficial differences they also bear
a deep and remarkable similarity.
It is common to ask: which is the best garbagecollectorto use? However, the
temptationto
provide a simple answer needs to be resisted.First, what does 'best' mean? Do we
want the collectorthat provides the application with the best throughput, or do we want
the shortest pause times? Is space utilisation important? Or is a compromise that
combines these desirable properties required? Second, it is clear that, even if a single metric
is chosen, the ranking of different collectors will vary between different applications.For
example, in a study Java benchmarks and sixdifferent
of twenty collectors, Fitzgerald and
Tarditi [2000] found that for each collector there was at least one benchmarkthat would
have been at least 15% faster with a more appropriate collector. And furthermore, not
only do programs tend to run faster given larger heaps, but alsothe relative performance
of collectors varies according the amount of heap space available. To complicate matters
yet further, excessively large heaps may disperse temporally related objects,leadingto
worsened
locality that may slow down applications.
6.1 Throughput
The on many users' wish listsis likely
first item to be overall application throughput. This
might
be
primary the
goal for a 'batch' application
or for a web server where pausesmight
be tolerableor by
obscured
aspects of the system such as network delays. Although it is
example
because a copying collector has rearranged objects in sucha way
as to affect cache
77
78 CHAPTER 6. COMPARING GARBAGE COLLECTORS
greater than copying collection. However, the number of instructions executed to visit an
object for mark-sweep tracing are fewer than those for copying tracing. Localityplays a
significant part here as well. We saw in Section 2.6 how prefetching techniques couldbe
usedtohidecachemisses. However, it is an open question as to whether such techniques
can be applied to copyingcollection without
losing the benefits to the mutator of depth-
first
copying. In either of these tracing collectors, the cost of chasing pointers is likely to
dominate. Furthermore,if marking is combined with lazy sweeping, we obtain greatest
benefit in the same circumstances that copying performs best: when the proportion of live
data in the heap is small.
6.3 Space
reducing the amount of heap usable to the application. It is important not to ignore the
costs of non-heap,metadata space.Tracing collectors may require marking stacks, mark
bitmaps or other auxiliary
data structures. Any non-compacting memory manager,
including explicit managers, will use space for their own data structures, such as segregated
free-listsand so on. Finally, if a tracing or a deferred reference counting collectoris not
to thrash by collecting too frequently, it requires sufficient room for garbage in the heap.
Systems
are typically configured to use a heap anything from 30% to 200% or 300%larger
than the minimum required by the program. Many systems also allow the heap to expand
when necessary, for example in order to avoid thrashing the collector. Hertzand Berger
[2005] suggest that a garbage collected heap threeto sixtimes larger than that required by
explicitly managed heaps is neededto achieve
comparable application performance.
In contrast, simple referencecounting frees objects as soon as they become unlinked
from the graph of live objects. Apart from the obvious advantage
of preventing the
accumulation of garbage in the heap, this may offer other potential benefits. Space is likely
to be reused shortly after it is freed, which may improve cacheperformance. It
may also
be possible in some circumstances for the compiler to detect when an objectbecomes free,
and to reuse it immediately, without recycling it
through the memory manager.
It is desirable for collectors to be complete (to reclaim all dead objects
not only
but
eventually)
also to be prompt, that is, to reclaim all dead objects at each collection cycle. The
basic tracing collectorspresentedin earlier chapters achieve this, but at the cost of
tracing
all live objects at every collection. However,modernhigh-performance collectors typically
trade immediacy for performance, allowing some garbage to float in the heap from one
collection to a subsequent one. Reference counting faces the additional problem of being
incomplete; specifically,
it is unable to reclaim cyclic garbage structures without recourse
to tracing.
6.4 Implementation
including global variables, and references held in registers and stack slots. We discuss this
in more detail in Chapter 11. However, we note here that the task facing copying and
moving collector need only identify at least one reference to each live
object, and never
needs to change the value of a pointer. So-called conservative collectors [Boehm and Weiser,
1988]can reclaimmemory without accurate knowledge of mutator stack or indeed object
layouts.
Instead they make intelligent (but safe, conservative)guessesabout whether a
value really is a reference. Because non-moving
collectors do not update references, the
risk of misidentifying a value as a heap pointeris confinedto introducinga spaceleak:the
value itself will not be corrupted. A full discussion of conservative garbage collection can
be found in Jones [1996, Chapters 9 and 10].
80 CHAPTER 6. COMPARING GARBAGE COLLECTORS
Reference counting has both the advantages and disadvantages of being tightly
coupled
to the mutator. The advantages are that reference counting can be implemented in a
library, making it possible for the programmer to decide selectively which objects should
be managed by reference counting and which should be managed explicitly. The
disadvantages
are that this coupling introduces the processingoverheadsdiscussedabove and
that it is essential that all reference count manipulations are correct.
The
performance of any modern language that makes heavy use of dynamically
allocated data is heavily memory manager.
dependent on the
critical actions typically The
include allocation, mutator updates including barriers, and the garbagecollector's inner
loops. Wherever possible, the code sequences for these critical actions needs to be inlined
but this has to be done carefully to avoid exploding the size of the generated code. If
the processor's instruction cacheis sufficiently large and the code expansion is sufficiently
small (in older systems with much smaller caches, Steenkiste [1989] suggested less than
30%), this blowup may have negligible effect on performance. Otherwise, it will be
necessary
to distinguish in these actions the common case which needs to be small enough
to be inlined (the 'fast path'), whilst calling out to a procedure for the less common 'slow
path' [Blackburn and McKinley,2002]. Thereare two lessons to be learnt here. The output
from the compiler matters and it is essential to examine the assembler code produced.The
effect on the caches also has a major impact on performance.
application. Measure its behaviour, and the size and lifetime distributions of the objects it
mislead.
collectors in a way that highlights precisely where they are similar and where they
differ.
http://Java.sun.com/docs/hotspot/gc5.0/ergo5.html.
6.6. A UNIFIED THEORYOF GARBAGE COLLECTION 81
p(ref)
=
|{fld G Roots : *fld = ref}| (6.1)
+ |{fid G Pointers(n) : n G Nodes Ap(n) > 0 A *fld =
ref}|
6
scanTracing(W):
7 while not isEmpty(W)
s src <\342\200\224
remove(W)
9 p(src) <- p(src)-\\-l /* shade src */
io if p(src)
= 1 /* src was white, now grey */
ii for each fid in Pointers(src)
12 ref *fld
<\342\200\224
13 if ref ^ null
H W <- W + [ref]
15
i6
sweepTracing():
17 for each node in Nodes
is if p(node)
= 0 /* node is white */
19
free(node)
20 else /* node is black */
21
p(node) <- 0 A reset node to white */
22
23
New():
24 ref ^\342\200\224
allocate()
25 if ref = null
26 collectTracing()
27 ref <r- allocateQ
28 if ref = null
29 error \"Out of memory\"
30
p(ref) 0
\302\253\342\200\224 A node is white */
3i return ref
32
33
rootsTracing(K):
34 for each fid in Roots
35 ref *fld
<\342\200\224
36 if ref t^ null
37 R <- R + [ref]
being buffered by the mutator's inc and dec procedures rather than performed
immediately,
in order to highlight the similarity with
tracing. This buffering technique turns out to
be very practical for multithreaded applications; we consider it further in Chapter 18. This
logging of actions also shares similarities with coalesced reference counting, discussed in
Section 5.4. The garbage collector, collectCount ing, performs the deferredincrements
J with applylncrements and the deferred decrements D with scanCounting.
6.6. A UNIFIED THEORY OF GARBAGE COLLECTION 83
i atomic collectCounting(7,D):
2 applylncrements(J)
3 scanCounting(D)
4 sweepCountingQ
5
6
scanCounting(W):
7 while not isEmpty(W)
s src <\342\200\224
remove(W)
9 p(src) ^\342\200\224
p(src) \342\200\2241
io if p(src) = 0
ii for each fid in Pointers(src)
12 ref <r- *fld
13 if ref 7^ null
W <r- W + [ref]
15
i6 sweepCounting():
17 for each node in Nodes
is if p(node) = 0
19
free(node)
20
21
22
23 New():
24 ref ^\342\200\224
allocate()
25 if ref = null
26
collectCounting(J,D)
27 ref ^\342\200\224
allocateQ
28 if ref = null
29 error \"Out of memory\"
30
P(ref) <- 0
31 return ref
32
33 dec(ref):
34 if ref ^ null
35 D <- D + [ref]
36
37 inc(ref):
38 if ref ^ null
39 7 <- 7 + [ref]
40
45
46
applylncrements(7):
47 while not isEmpty(7)
48 ref <\342\200\224
remove(7)
49
P(r^f) <- p(ref) + l
84 CHAPTER 6. COMPARING GARBAGE COLLECTORS
i atomic collectDrc(I,D):
2 rootsTracing(I)
3 applylncrements(I)
4 scanCounting(D)
5 sweepCountingQ
6 rootsTracing(D)
7 applyDecrements(D)
8
9 New():
io ref <\342\200\224
allocate()
ii if ref = null
12 collectDrc(I,D)
13 ref <\342\200\224
allocate()
i4 if ref = null
is error \"Out of memory\"
i6
p(ref) 0
<\342\200\224
17 return ref
18
21 inc(dst)
22
dec(src[i])
23
src[i] dst
<\342\200\224
24
25 applyDecrements(D):
26 while not isEmpty(D)
27 ref <r- remove(D)
28
P(ref) <~ p(ref) \342\200\2241
Mutation, using the Write procedure, stores a new destination referencedst into a
field src[i]. In doing so, it buffers an increment for the new destination,inc(dst), and
buffers a decrement for the old referent, dec(src[i]), before storing the new destination
to the field, src[i]\302\253\342\200\224dst.
Each collection begins by applying all deferred incrementsto bring them up to date.
The deferred decrements
applied in the next phase. The scanCounting procedure
are
begins with reference counts that over-estimate the true counts. Thus, it must decrement
the counts of nodes in the work list as it encounters them. Any source node whose count,
p(src), is decrementedto zero in this phase is treated as garbage,and its child nodes are
added to the work list. Finally, the procedure sweepCounting frees the garbage nodes.
The tracing and reference counting algorithms are identical but for minor differences.
Each has a scan procedure:the scanTracing collector uses reference count increments
whereas the scanCounting collector uses decrements. In both cases the recursion
condition checks for a zero reference count. Each has a sweepprocedure that frees the space
occupied by garbage nodes. In fact, the outline structures of the first 31 lines in Algorithm
6.1
and Algorithm 6.2 are identical. Deferred reference counting,
which defers counting
references from the roots, is similarly captured by this framework (see Algorithm 6.3).
6.6. A UNIFIED THEORY OF GARBAGE COLLECTION 85
Finally, we noted earlier that computing reference counts is tricky when it comes to
cycles in the objectgraph. The trivial object graph in Figure 6.1 shows a simpleisolated
cycle,where assuming A has reference count zero allows B also to have reference count
zero (sinceonly source nodes with a non-zero count contributeto the referencecounts
of their destinations). But there is a chicken-and-eggproblem here, sincethe reference
counts of A and B are mutually dependent. It is just as feasible for us to claim that A
has reference count 1, because of its reference from B, leading us to claim that B also has
reference count 1.
This seeminganomaly arises generally for fixed-point computations, where there may
be several different feasible solutions. In Figure 6.1we have the case that Nodes = {A, B}
=
and Roots {}.There are two fixed-point solutions of Equation 6.1for this simple graph:
a least fixed-point p(A) = p(B) = 0 and a greatest fixed-point p(A) = p(B) = 1. Tracing
collectors compute the least fixed-point, whereas referencecountingcollectors compute
the greatest, so they cannot (by themselves) reclaim cyclic garbage. The difference
between these two solutions is precisely the set of objects reachable only from garbage
cycles. We saw in Section 5.5 that reference counting algorithmscanuse partial tracing to
reclaim garbage cycles. They do so by starting from the greatest fixed-point solution and
contracting the set of unreclaimed objects to the least fixed-point solution.
Chapter 7
Allocation
There are three aspects to a memory management system: (i) allocationof memory in the
first place, (ii) identification of live data and (iii) reclamation for future use of memory
previously allocated but currently occupiedby dead objects. Garbage collectors address
these issuesdifferently than do explicit memory managers, and different automatic
memory managers
use different algorithms to manage these actions. However, in all cases
allocation and reclamation of memory are tightly linked: how memory is reclaimed places
constraints on how it is allocated.
The problem of allocating and freeing memory dynamically under program control
has beenaddressedsince the 1950s. Most of the techniques devised over the decades
are ofpotential relevance to allocating in garbagecollectedsystems, but there are several
key differences between automatic and explicit freeing that have an impact on desirable
allocation
strategies and performance.
\342\200\242
Garbage collected systems free space all at once rather than one object at a time.
Further, some garbage collection algorithms (those that
copy or compact) free large
contiguous regionsat one time.
\342\200\242
Many systems that use garbage collection have available more information when
allocating, such as static knowledgeof the size and type of object beingallocated.
\342\200\242
Because of the availability of garbage collection,userswill write programs in a
different
style and moreoften.
are likely to use heap allocation
7.1
Sequential allocation
87
88 CHAPTER 7. ALLOCATION
Algorithm 7.1:Sequentialallocation
i
sequentialAllocate(n):
2 result free
\302\253<\342\200\224
3 newFree result
<\342\200\224 + n
4 if newFree > limit
5 return null /* signal 'Memory exhausted' */
6 free newFree
<\342\200\224
7 return result
alignment
padding
n
{l
1 ~& 1 1 ~o
0> <U
1 \342\226\240*-'
TO TO
*-> <-> available
.2 _o
75 To
It is
\342\200\242
simple.
It is
\342\200\242 efficient, although Blackburn et al [2004a]have shownthat the fundamental
performance difference between sequential allocation and segregated-fits free-list
allocation (see Section 7.4) for Java systemis on the order
a of 1% of total execution time.
\342\200\242
It appears to result in better cache locality than does free-list allocation, especiallyfor
initial allocation of objects in moving collectors [Blackburnet al, 2004a].
It
\342\200\242
may suitable than free-list allocation for non-moving
be less collectors, if
uncollected
objects up larger chunks of space into
break smaller ones, resulting in many
small sequentialallocation chunks as opposed to one or a smallnumberof large ones.
i firstFitAllocate(n):
2 prev <r- addressOf(head)
3 loop
4 curr <\342\200\224
next(prev)
s if curr = null
6 return null /* signal ^Memory exhausted' */
7 else if size(curr) < n
H if shouldSplit(size(curr), n)
is remainder <r- result + n
i6 <\342\200\224
next(curr)
next(remainder)
\342\200\224
17 size(remainder) <r-
size(curr) n
8 return result
we will use the traditional term 'free-list' for them anyway. One can think of sequential
allocation as a degenerate case of free-list allocation, but its special properties and simple
implementationdistinguish it in practice.
We consider first the caseof organising the set as a single list of free cells. The allocator
considers eachfreecellin turn, and according to some policy, choosesoneto allocate. This
First-fit allocation
When
trying to satisfy an allocation request, a first-fit allocator will use the first cell it finds
that can satisfy the request. If the cell is larger than required, the allocatormay split the
cell and return the remainder to the free-list.However,if the remainder is too small (alio-
90 CHAPTER 7. ALLOCATION
cation data structures and algorithmsusually constrain the smallest allocatable cell size),
then the allocatorcannot split the cell. Further, the allocator may follow a policy of not
splitting unless the remainderis larger than some absolute or percentage size threshold.
Algorithm
7.2 gives code for first-fit. Notice that it assumes that each free cell has room
to recordits own size and the address of the next free cell. It maintains a single global
variable head that refers to the first free cell in the list.
A variation that leads to simpler code in the splitting case is to return the portion at
the end of the cell beingsplit, illustrated in Algorithm 7.3. A possible disadvantage of this
approach is the different alignment of objects, but this could cut either way. First-fit tends
\342\200\242
Small remainder cells accumulate near the front of the list, slowing down allocation.
In terms
\342\200\242 of space utilisation, it may behave rather similarly to best-fit since cellsin
the free-list end up roughly sorted from smallest to largest.
single free-list,
it is usually more natural to build the list in address order, which is what a
mark-sweepcollectordoes.
Next-fit allocation
Next-fit is a variation of first-fit that starts the search for a cell of suitable size from the
point in the list wherethe last search succeeded [Knuth, 1973]. This is the variable prev in
the code sketched by Algorithm 7.4. When it reaches the end of the list it starts over from
the beginning, and so is sometimescalled circular
first-fit allocation. The idea is to reduce
the needto iteraterepeatedly past the small cells at the head of the list. While next-fit is
\342\200\242
Objects from different phases of mutator executionbecomemixedtogether.Because
they
become unreachable at differenttimes, this can affect fragmentation (see
Section 7.3).
\342\200\242
Accesses through the roving pointer have poor locality because the pointer cycles
through all the free cells.
\342\200\242
The allocated objects may also exhibit poor locality, being spread out through
memory
and interspersed withobjects allocated by previousmutator phases.
Best-fit allocation
Best-fit allocation size most closely matches the request. The ideais
finds the cell whose
to minimise waste, avoid
as wellsplitting largecellsunnecessarily.
as to Algorithm 7.5
sketches the code. In practice best-fit seemsto perform well for most programs, giving
relatively low wasted spacein spite of its bad worst-case performance [Robson,1977].
Though
such measurements were for explicit freeing, we would expectthe space
utilisation to remain
high for garbage collected systems as well.
7.2 FREE-LIST ALLOCATION 91
i nextFitAllocate(n):
2 start ^\342\200\224
prev
loop
curr <\342\200\224
next(prev)
if curr = null
prev addressOf
<\342\200\224
(head) /* restart from the
beginning of the free\342\200\224list */
7 curr <r- next(prev)
s if prev = start
9 return null /* signal ^Memory exhausted' */
io else if size(curr) < n
ii prev curr
^\342\200\224
12 else
i3 return listAllocate(prev, curr, n)
i bestFitAllocate(n):
2 best <r- null
3 bestSize oo
^\342\200\224
4 prev addressOf
^\342\200\224
(head)
5 loop
6 curr <\342\200\224
next(prev)
7 if curr = null || size(curr) = n
s if curr ^ null
9 bestPrev <\342\200\224
prev
io best curr
\302\253\342\200\224
i6 else
17 best curr
^\342\200\224
is bestPrev \302\253\342\200\224
prev
19 bestSize \302\253\342\200\224
size(curr)
92 CHAPTER 7. ALLOCATION
i f irstFitAllocateCartesian(n):
2 parent null
<\342\200\224
3 curr root
<\342\200\224
4 loop
5 if left(curr) ^ null && max(left(curr)) > n
6 parent curr
<\342\200\224
7 curr \302\253\342\200\224
left(curr)
s else if prev < curr && size(curr) > n
9 prev curr
\302\253\342\200\224
13 curr \302\253\342\200\224
right(curr)
14 else
15 return null /* signal ^Memory exhausted' */
Allocating
from a single sequential list may not scalevery well to large memories.
Therefore researchers have devised a number of more sophisticatedorganisationsof the set of
free cells, to speed free-list allocationaccordingto various policies. One obvious choice is
to use a balancedbinary tree of the free cells. These might be sorted by size (for best-fit) or
by address (for first-fit or next-fit). When sorting by size, it saves time to enter only one cell
of eachsizeinto the tree, and to chain the rest of the cells of that size from that tree node.
Not only does the search complete faster, but the tree needs reorganisation lessfrequently
since this happens only when adding a new size or removing
the last cell of a given size.
To use balanced trees for first-fit or next-fit, one needsto usea Cartesian tree [Vuillemin,
1980]. This indexes by both address (primary key) and size (secondary key). It is totally
ordered on addresses, but organised as a 'heap' for the sizes, which allows quick search for
the first or next fit that will satisfy a given size request. This technique is also known as fast-
fits allocation [Tadman, 1978; Standish, 1980; Stephenson, 1983]. A node in the Cartesian tree
must recordthe addressand size of the free cell, the pointersto the left and right child, and
the maximum of the sizes of all cells in its subtree.Itis easy to compute this maximum from
the maximum values recorded in the node's children and it own size. Hence the minimum
possible size for a node is larger than for simple list-based schemes. While we omit code
for inserting and removing nodes from the tree, to clarify the approach we give sample
code for searching under the first-fit policy, in Algorithm 7.6. The code uses the single
global variable root, which refers to the root of the binary tree. Each node n maintains a
value max(n) that gives the maximum size of any nodesin that node's subtree. Next-fit is
only slightly more complicated than first-fit.
Balanced binary treesimproveworst-case behaviour from linear to logarithmic in the
number of free cells. Self-adjusting {splay) trees [Sleator and Tarjan, 1985] have similar
(amortised time) benefits.
Another useful approach to address-ordered first-fit or next-fit allocation is bitmapped-
fits allocation. A bitmap on the side has one bit for each granule of the allocatable heap.
Rather than scanning the heap itself, we scanthe bitmap. We can scan a byte at a time by
using the
byte value to index pre-calculated tables giving the size of the largest run of free
granules within the eight-granule unit
represented by the byte. The bitmap can alsobe
73. FRAGMENTATION 93
augmented with run-length information that speeds calculating the size of larger free or
allocatedcells,in order to skip over them more quickly.Bitmaps have several virtues:
\342\200\242
They are 'on the side' and thus less vulnerable to corruption. This is especially
important
for less safe languages such as C and C++, but also helpful in improving the
reliability
and debuggability of collectors for other, moresafe,languages.
\342\200\242
They do not require information to be recordedin the free and allocated cells, and
thus minimiseconstraints on cell size. This effect can more than pay back the 3%
storage overhead of one bit per 32-bit word. However, other considerationsmay
locality.
7.3 Fragmentation
At the beginning system generally has one,ora small
an allocation number, of large cells
of contiguous free memory. a program runs, allocating
As and freeing cells,it typically
produces a larger number of free cells, which can individually be small. This dispersal
of free memory across a possiblylargenumber of small free cells is called fragmentation.
Fragmentation has at least two negative effects in an allocation system:
It can
\342\200\242
prevent allocation from succeeding. There can be enoughfree memory, in
total, to satisfy a request,but not enough in any
particular free cell. In non-garbage
collected systems
this generally forces a program to terminate. In a garbagecollected
system,
it may trigger collection sooner than would otherwisebe necessary.
\342\200\242
Even if there is enough memory to satisfy a request, fragmentation may cause a
program
to use more address space, more resident pagesand more cache lines than it
would otherwise.
free-list. Next-fit will tend to distribute small fragments more evenly across the heap, but
that is not necessarilybetter. The only total solution to fragmentation is compaction or
copyingcollection.
byte-addressed machines, as would a unit of one word for word-addressed machines. Still,
even when bytes are the unit for describing size, a granuleis morelikely the size of a word,
or even larger.Having cbea powerof two speeds the division in the formula by allowing
substitution of a shift for the generally slower division operation.
In addition to a very dense range of small size classes, a system might provide one or
more ranges of somewhatlargersizes, less
densely packed, as opposed to switching
immediately
to a general free-list mechanism. For example,the Boehm-Demers-Weiser collector
has separate lists for each size from the minimum up to eight words, and then for even
numbers of words up to 16, and for multiples of four words up to 32 [Boehmand Weiser,
1988]. Above that size it determines size classes somewhat dynamically, filling in an array
that maps requested size(in bytes) to allocated size (in words). It then directly indexesan
array using the allocated size. Only those sizesusedwill be populated.
of free-lists
Algorithm 7.7:Segregated-fits
allocation
Fragmentation
In the simpler free-list allocatorswe discussedpreviously, there was only one kind of
fragmentation: free cells that were too small to satisfy a request. This is known as external
fragmentation, because it is unusable
space outside any allocated cell. When we introduce
size classes, if the sizes are at all spread out then there is alsointernal fragmentation, where
It should now be reasonably clear how segregated-fits allocation works, except for the
them from separate blocksand maintain separate free-lists for them), for small objectsthe
savings (by not having to record type information in each object)
can be great. Examples
include Lisp cons cells.
Beyond
the combining of small cells' metadata acrossan entire block,block-based
allocation has the virtue of making the recombiningof free cells particularly simple and
efficient: it does not recombine unless all cells in a blockare free, and then it returns the
block to the block pool. Its commoncasefor allocation, grabbing an available cell from a
known list, is quite efficient, and if the list is empty, populatingit is straightforward. Its
primary disadvantage is its worst-casefragmentation.
Splitting. We have already seen cell splitting as a way to obtain cells given size s:
of a
the various
simple free-list schemes will split a larger cell that if is the only way to satisfy
a request. If we use a fairly dense collection of size classes, then when we split a cell, we
will be likely to have a suitable free-list to receivethe portion
not allocated. There are
some particular organisations of less dense size classes that also have that property. One
such schemeis the buddy system, which uses sizes that are powers of two [Knowlton, 1965;
Peterson and Norman, 1977].It is clearthat we can split a cell of size2,+1into two cells of
size 2l. We can also recombine (or coalesce) two adjacentcellsof size 2l into one cell of size
2,+1.A buddy system will only recombine that way if the cells were split from the same
cell of size 2/+1 originally.Hencecellsof size 2l come in pairs, that is, are buddies. Given
the high internal fragmentation of this approach (its average is 25% for arbitrarily chosen
allocation requests), it is now largely of historical as opposedto practicalinterest.
A variation of the 21 buddy system is the Fibonacci buddy system [Hirschberg, 1973;
Burton, 1976; Peterson and Norman, 1977], in which the size classes form a Fibonacci sequence:
si+\\ + sir with a suitable so and S\\ to start.
S/+2 \342\200\224 Because the ratio of adjacent sizes is
smallerthan in the power-of-two buddy system, the average internal fragmentation will
be lower (as a percentage of allocated memory). However, locating adjacent cells for re-
combining free cells after collection is slightly more complicated, since a buddy can have
the next sizelargeror smaller depending on which member of a buddy pair is under
consideration.
Other variations on the buddy system have been described by Wise [1978], Page and
Hagins[1986] and Wilson et al [1995b].
1
and Weiser [1988]place this portion
Boehm at the start of the block rather than its end, presumably to reduce
competition for cache lines near the beginning of blocks. This helps more for small cache lines, since it is effective
only for (some)cell sizes large than a cache line.
7.6. ADDITIONAL CONSIDERATIONS 97
class. If a request finds free-list for its size class is empty, we can implementbest-
that the
fit
by searching the larger size classes in order of increasing size looking for a non-empty
free-list.Having a segregated-fits front end modifies first- and next-fit, leading to a design
choice of what to do when the free-list for the desired sizeclassis empty. But in any case,
if we end up searchinglist fa, the list of all cells of size greaterthan s^-i, then we apply
the single-list scheme(first-fit, best-fit or next-fit).
Another way of seeing this is that we really have a segregated-fitsscheme,and are
simply deciding how we are going to managefa. To summarise, we can manage it in these
ways:
As a
\342\200\242
single free-list, using first-fit, best-fit, next-fit or one of the variations on them
search time.
\342\200\242
Using block-based allocation.
\342\200\242
Using a buddy system.
Alignment
Depending on constraintsof the underlying machine and its instruction set, or for better
three low bits of the address equalto zero). One way to address the overall problem is
to make double-words the
granule of allocation. In that case, all allocated and free cells
are a multiple of eight bytes in size, and are aligned on an eight-byte boundary. This
is simple, but perhaps slightly wasteful. Further, when allocating an array of double,
thereis still some special work that might be required. Supposethat the Java heap design
requires two header words for scalar (non-array) objects, one to refer to the object's class
information (for virtual method dispatch, type determinationand so on) and one for the
object's hash code and Java synchronisation (locking). This is a typical design. Array
objects require a third word, giving the number of elements in the array. If we store these
three headerwords at the start of the allocated spaceand follow them immediately by the
array elements, the elements will be aligned on an odd word boundary, not an even one
as required.If we use double-words as the granule, then we simply use four words (two
double-words) for the three-word header and waste a word.
But
suppose our granule is one word, and we wish to avoid wasting a word whenever
we can.In that case, if a free cell we are consideringis alignedon an odd word boundary
(that is, its address is 4 modulo 8), we can simply use the cell as is, putting the three
header words first, immediately followed by the array element, which will be double-
word aligned as required.If the cell starts on an even word boundary, we have to skip
a word to get the proper alignment.Noticethat this complicates our determination of
whether a request will fit in a given cell: it may or may not fit, depending on the required
and actual alignment \342\200\224 see Algorithm 7.8.
98 CHAPTER 7. ALLOCATION
i fits(n, a, m, blk):
2 /* need n bytes, alignment a modulo m, m a power of 2. Can blk satisfy this request? */
\342\200\224
3 z blk
\302\253\342\200\224 a /* back up*/
4 z<\342\200\224
(z+m\342\200\224 l)&~(m\342\200\224 l) /* round up*/
5 z z
<\342\200\224+ a /* go forward */
\342\200\224
6 pad z
\302\253\342\200\224blk
7 return n + pad < size(curr)
Size constraints
Some collection require a minimum amount of space
schemes in each object (cell) for
managing
the collection
process. For example, basic compacting room for the collectionneeds
new address in each object. Some may
need two collectors
words, such as a lock/status
word plus
a forwarding pointer. This implies that even if the language needs only one
word, the allocator will still need to allocate two words. In fact, if a program allocates
some objects that contain no data and serve only as distinct unique identifiers, for some
languages they could in principle no
storage at all! consume
In practice this does not work
since the address of the object forms its unique identity (or elseyou must calculate a unique
value and storeit in the object), so the object must consume at least one byte.
Boundary tags
In orderto support freeing objects, many allocate-free systems
recombination when
associate an additional tag
header
with each cell, outside the storage available
or boundary to
the program [Knuth, 1973]. The boundary tag indicatesthe size of the cell and whether it
is allocated or free. It may also indicatethe size of the previous cell, making it easier to
find its flag indicating whether it is free, and its free-list chaining pointers if it is free. Thus,
a boundary tag may be two words long, though with additional effort and possibly more
overhead in the allocation and freeing routines,it may be possible to pack it into one word.
Using bit tables on the side to indicate which granules are allocatedand free avoids the
need for boundary tags, and may be more robust as we previously observed. Which
approach
uses less on the average objectsizeand the allocation
storage depends granularity.
We further
garbage collection frees objects all at once,
observe that because a given
algorithm may not need boundary tags, or may need less information in them. Further, in
Heap parsability
The sweepingphaseof a mark-sweep collector must be able to advance from cell to cell in
the heap. This capabilityis what we call heap parsability. Other kinds of collectors may not
require parsability, but it can be a great help in
debugging collectors so it is good to support
parsability
if possible and the cost is not too high.
Generally
we need direction, most commonly
parsability in order of
only in one
increasing
address. one
A typical or two words to record
language will use
an
object's type
and other necessary information. We call this the object's header. For example,many Java
implementations use one word to record what amounts to a type (a pointer to type
information, including a vector of addresses of methods of the object's class) and one word for
7.6. ADDITIONAL CONSIDERATIONS 99
Figure 7.2: A Java object header design for heap parsability.Grey indicates
dashed lines.
a hash code, synchronisationinformation, garbage collection mark bit and so on. In order
to make
indexing into arrays efficient on most machines, it helps if the object reference
refers to the first element of the array, with successive elements at successively higher
addresses. Since the language run-time and the collectorneedto find the type of an object
in a uniform way given a reference to the object,we placethe header immediately before
the object data. Thus, the object reference points not to the first allocated byte, but into the
middle of the allocated cell, after the header. Having the headercomebefore the object
contents therefore facilitates upward parsing of the heap.
Again using a Java systemas an example, array
instances need to record the length of
the individual array. For easy parsability, it helps if the length field comesafter the two-
word header used for every object.Thereforethe first array element falls at the third word
of the allocatedcell,the length is at word -1 and the rest of the header is at words \342\200\22
and A scalar
\342\200\2243.
(non-array) object needs to place its header at words and
\342\200\2242 as well.
\342\200\2243
This would appear to leave word \342\200\2241 as a 'hole', but in fact there is no problemplacing
the first (scalar) field of the object there (assumingthat the machine can index by a small
negative constant just as well as by
a small positive one, and most can). Further, if the
object has no additional fields, thereis still no problem: the header of the next objectcan
legally appear at the address to which the object reference points! We illustrate all this in
Figure 7.2.
A
particular issue arises if an implementation desires to over-write one object with
another
(necessarily smaller) one, as a number of functional
language implementations do in
replacing a closure with its evaluated value. If the implementation takes no further action,
a scan that parses the heap may land in the middle of 'unformatted' bits and getquite
confused.
Non-Stop Haskell solves this problem by inserting filler objects [Cheadle et al, 2004].
In the usual casethey need only to insert a reference to metadata indicatinga pointer-
freeobjectof the appropriate size; they pre-construct metadata for sizes one through eight
words. Larger fillers are quite rare, but would require creating metadata dynamically.2
One final consideration arises from alignment requirements. If an individual object
needs to be shifted one or morewords from the beginning of its cell for proper alignment,
2They do not offer details, but it seems reasonable to us to place the metadata in the filler in that case, thus
avoiding any run-time allocation to restore heap parsability.
100 CHAPTER 7. ALLOCATION
we needto record something in the gap so that in heap parsing we will know to skip. If
ordinary object headers cannot begin with an all-zero word, and if we zero all free
space
in advance, then when parsing we can simply skip words whose value is zero. A simple
alternative is to devise a distinctrangeof values to write at the start of gap, identifying it as
a gap and giving its length. For example,Sunhave longusedwhat they call a 'self-parsing'
heap. When they free an object (in a non-moving space),they overwrite its memory with
a filler object,which includesa field giving its size (think of it as an array of words). This
is particularly useful for skipping ahead to the next real objectwhen sweepingthe heap.
A bit map on the side,indicating where each object starts, makes heap parsing easy and
simplifies the design constraints on object header formats. However, such bits consume
additional space and require additional time to set during allocation. Allocation bitmaps
are useful in many collectors, especially parallel and concurrentones.
While we mentioned a design for Java, similar considerations apply
to other languages.
Furthermore, block-based allocation offers simple parsing for the small cells, and it is also
easy to handle the large blocks. For improved cache performance, the location of a large
objectinside a sequenceof one or more blocks is something we might randomise, that
is, we randomise how much of the wasted space comes before, and how much after, the
application object. It is easy to record at the start of the block where the objectis, in order
to support parsability
Locality
Locality issues comeup several ways in allocation. There is locality of the allocation
process itself, and of freeing. Other things being equal, an address-ordered free-listmay
improve locality of allocator memory accesses. Sequential allocationalso leads naturally
to
sequential accesses with good locality. In fact, software prefetching a bit ahead of the
allocator can
help [Appel, 1994], though for certain hardwarethat is unnecessary [Diwan et al,
1994]. But there is an entirely different notion of locality that is also useful to consider:
objectsthat may become unreachable at about the same time. If some objects become
unreachable at the same time, and they are allocated adjacent to one another, then after
collection their space will coalesce into a single free chunk, thus minimising
fragmentation.
Empirically, objects allocated at about the sametime often become unreachable at
about the same time. This makes non-moving systems less problematic than might
be
presumed [Hayes, 1991; Dimpsey et al, 2000;Blackburn and McKinley, 2008]. It also
suggests applying a heuristic of trying to allocate next to, or at least near, the most recently
allocated object. Specifically, if the previous allocation request was satisfiedby splitting a
requests in the near future, if the future request cannot be satisfieddirectly from a free-list
for objects of the appropriatesize.
Wilderness preservation
A
typical heap organisation consists of a
large contiguouspart of the machine's address
space, often bounded at
by the
the program's
low static
end code and data areas. The
other end is often not occupied, but rather is openfor expansion. This boundary in Unix
systems is called the 'break'and the sbrk call can grow (or shrink) the available space
by adjusting the boundary. Space beyond the boundary may not even be in the virtual
memory map. The last free chunk in the heap is thus expandable. Since it begins what
could be called 'unoccupied territory,' it is called the wilderness, and Korn and Vo [1985]
that wilderness preservation \342\200\224 from the wilderness only as a last resort
found \342\200\2
allocating
7.7. ALLOCATION IN CONCURRENT SYSTEMS 101
helped reducefragmentation. It also has the salutary effect of tending to defer the need to
grow the heap,and thus conserves overall system resources.
Crossing maps
Somecollection schemes, or their writerequire the allocator to fill in a crossing
barriers,
map. This map indicates,for each aligned segment of the heap of size 2k for some
suitable k, the address (or offset within the 2k segment) of the last object that begins in that
segment. Combined with heap parsability, this allows a barrier or collectorto determine
fairly quickly, from an address within an object, the start of the object,and thus to access
the object's headers, and so on. We discuss crossing maps in more detail in Section 11.8.
thread
allocating might receive a small chunk while a rapidly allocating getsa largechunk.
one
Dimpsey et al [2000] noted substantial performanceimprovement in a multiprocessor Java
a
system using suitablyorganised local allocation buffer (LAB) for each thread.3 They
further note that since the local allocation buffers absorb almost all allocation of small objects,
it was beneficial to retune the global free-list-basedallocatorsinceits typical request was
for a new localallocationbuffer chunk.
Garthwaite et al [2005] discussed adaptive sizing of local allocation buffers, and found
benefit from associating them with processors rather than threads. They describe the
original mechanism for sizing per-thread local allocation buffers as follows. Initially a thread
requests a 24-word (96 byte)
local allocation buffer. Each time it
requests another local
allocation buffer, it multiplies the size by 1.5. However,when the collector runs, it decays each
thread's localallocationbuffer size by dividing by two. The schemealso involves
adjustment to the
young generation's size according to the number of different threads
allocating.
The per-processor local allocation buffer scheme relieson multiprocessor restartable
critical sections, which Garthwaite et al introduced. This mechanismallowsa thread to
determinewhether it has been preempted and rescheduled, which impliesthat it may be
running on a different
processor. By having such preemption modify a registerusedin
addressing
the per-processor data, they can cause stores after preemption to produce a trap,
and the trap handler can restart the interrupted allocation.Even though per-processor
local allocation buffers involve more instructions,their latency was the same, and they
required less sophisticated sizing mechanisms to work well. They also found that for small
numbers of threads, per-thread local allocation buffers were better (consider especiallythe
case where there are fewer threads than processors), and per-processor local allocation
buffers werebetterwhen there are many allocating threads. Therefore, they designed their
3Some authors use the term \"thread-local heap'. We use local allocation buffer when the point is separate
allocation, and reserve use of 'thread-local heap' for the case where the local areas are collected separately. Thus,
while a 'thread-local heap' is almost certainly a local allocation buffer, the reverse need not be true.
102 CHAPTER 7. ALLOCATION
A typical local allocationbuffer is used for sequential allocation. Another design is for
each thread(or processor) to maintain its own set of segregated free-lists, in conjunction
with incremental sweeping. When a thread sweeps a block incrementally during
allocation, it puts the free cells into its own free-lists.This design has certain problems that arise
when it is used for explicit storage management, as addressedby Berger et al [2000]. For
example, if the application uses a producer-consumer model, then the producer allocates
message consumerfrees them, leadingto a net transfer
buffers and the of buffers from
one to the other. In the garbagecollectedworld, the collection process may return buffers
to a globalpool. However, incremental sweeping that places free cells on the sweeper's
free-listsnaturally returns free buffers to threads that allocate them most frequently.
space within some blocks. Block-based allocation may also fit well with organisations that
support multiple spaces with different allocation and collection techniques.
Segregated-fits is generally faster than single free-list schemes. This is of greater
importance
in a garbage collected context since programs codedassuming garbage collection
tend to do more allocationthan ones coded using explicit freeing.
Becausea collectorfrees objectsin batches,the techniquesdesignedfor recombining
free cells for explicit freeing systemsare lessrelevant. The sweep phase of mark-sweep can
rebuild a free-list efficiently from scratch. In the case of
compacting collectors, in the end
there is usually just one large free chunk appropriate for sequential allocation. Copying
similarly frees whole semispaceswithout needing to free each individual cell.
Chapter 8
managed by
same collection
the algorithm and all are the same time.collectedat
However
there is no reason why this should be so and substantial performancebenefits accrue from
a more discriminating treatment of objects. The best known example is generational
collection [Lieberman and Hewitt, 1983; Ungar, 1984], which segregatesobjectsby age and
preferentially collects younger objects. There are many
reasons why it might be beneficial
to treat different categories of object in different ways. Somebut not all of these reasons are
related to the collectortechnology that might be used to manage them. As we saw in earlier
chapters,objects can be managed either by a direct algorithm (such as reference counting)
or by an indirect,tracing algorithm. Tracing algorithms may move objects (mark-compact
or copying) or not (mark-sweep). We might therefore consider whether or not we wish
to have the collectormove different categories of object and, if so, how we might wish
to move them. We might wish to distinguish, quickly by their address,which collection
or allocation algorithm to apply to different objects. Most commonly, we might wish to
distinguish when we collect different
categories of object.
8.1 Terminology
It is useful to distinguish the sets of objects to which we want to apply certain memory
management policies from the mechanisms that are used to implementthosepolicies
efficiently.
We shall use the term space to indicatea logicalset of objects that receive similar
treatment. A
space may use one or more chunks of address space. Chunks are contiguous
and often power-of-two sized and aligned.
policy
or with mechanism. These ideas werefirst explored
a different in Bishop's influential
thesis [1977]. These reasonsinclude object mobility, size, lower space overheads, easier
identification of object properties, improved garbage collection yield,reducedpausetime,
better locality, and so on. We examine these motivations now, before considering
particular models of garbage collection and object management that take advantage of heap
partitioning.
103
104 CHAPTER 8. PARTITIONING THE HEAP
Partitioning by mobility
In a hybrid collector it may be necessary to distinguish objects that can be moved from
those that either cannot be moved or which it is costly to move. It may be impossibleto
move objectsduetolackof communication between the run-time system and the compiler,
or because an object is passed to the operatingsystem (for example, an I/O buffer). Chase
[1987,1988]suggests that asynchronous movement may also be particularlydetrimental
to compileroptimisations.In order to move an object, we must be able to discoverevery
reference to that object so that each can be updated to point to the object's new location.
In contrast, if collection is non-moving,it suffices that a tracing collector finds at least one
reference. Thus, objects cannot be moved if a reference has been passed to a library (for
example, through the Java Native Interface) that does not expect garbage collection. Either
such objectsmust be pinned or we must ensure that garbage collection is not enabled for
that
space while the object is accessible to the library.1
The references that must be updated in orderto move objects include the root set.
Determining an accurate map
of root references is one of the more challengingparts of
building the interface between a managed language and its run-time. We discuss this in
detail in Chapter 11. One commonly chosen route, sometimes to an initial
is to scan roots (thread stacks and registers)
implementation, conservatively
rather than construct a
type-accurate map of which stack frame slots and so on contain
object references. This
tactic is inevitable if the compiler does not provide type-accurate information (for
example, compilers
for languages like C and C++). Conservative stack
scanning [Boehm and
Weiser, 1988] treats every slotin every stack frame as a potential reference, applyingtests
to discardthosevalues found that cannot be pointers (for example, because they 'point'
outside the range of the heap or to a locationin the heap at which no object has been
allocated). Since conservative stack scanning identifies a superset of the true pointer slots
in the stack, it is not possible to change the values of any of these (since we might
inadvertently change an integer that just happened to look likea pointer).Thus, conservative
collection cannot move any object directly referencedby the roots. However, if
appropriate
information (which need not be full type information) is provided for objects in the heap,
then a mostly-copying collector can safely move any object except for one which appears
to be directly reachable from ambiguous roots [Bartlett, 1988a].
Partitioning by size
It may
also be undesirable to move someobjects.Forexample,
(rather than impossible) the
cost of moving large objects may outweigh fragmentation the costs of not moving them. A
common strategy is to allocate objects larger than a certain threshold into a separate large
object space (LOS). We have already seen how segregated-fits allocators treat large and small
objectsdifferently. Large objects are typically placed on separatepages(soa minimum size
further reason for managing large object spaces with a non-copying collector.
Partitioning by kind
Physically segregating objects of different categories also allowsa particular property, such
as type, to be determinedsimply from the address of the object rather than by retrieving
the value of one of its field or, worse, by chasing a pointer.Thishasseveralbenefits.First, it
offers a cache advantage since it removes the necessity to load a further field (particularly
if the placement of objectsof a particular category is made statically and so the address
comparison is against a compile-time constant).Second, segregationby property, whereby
all objects sharing the same property are placedin the same contiguous chunk in order to
allow a quick address-based identification of the space, allows the property to be
associated with the space rather than replicated in each object's header.Third, the kind of the
is
object significant for some collectors. Objects that do not contain pointers do not need to
be scannedby a tracing collector. Large pointer-free objects may benefit from being stored
in their own space,whereasthe cost of processing a large array of pointers is likely to be
dominated by the cost of tracing the pointers rather than, say, the cost of moving the
object.
Conservative collectors benefit particularly from placing large compressed bitmaps
in areas that are never scanned as they are a frequent source of false pointers [Boehm, 1993].
Cycle-collecting tracing collectors can also benefit from segregating inherently acyclic
objects
which cannot be candidate roots of garbagecycles.
Virtual machines often generate and store code sequencesin the heap.Moving and
reclaiming
code has special problems such as identifying,
and keeping consistent, references
to code, or determiningwhen codeis no longer used and hence can be unloaded (note
that class reloading is generally not transparent since the class may have state). Code
objects
also tend to be large and long lived. For these reasons, it is often desirable not to
relocate code objects [Reppy, 1993], and to consider unloading codeas an exceptional case
particular
to certain applications.
abandoned within a relatively short time\". Indeed, it is even common for a significant fraction
106 CHAPTER 8. PARTITIONING THE HEAP
the least effort is to concentrate collectioneffort on those objects most likely to be garbage.
If the distribution of object lifetimes is sufficiently skewed, then it is worth repeatedly
collectinga subset (orsubsets)of the heap rather than the entire heap [Baker,1993].
For
example, generational collectors typically collect a single space of the heap (the
young generation or nursery) many times for every time that they collect the entire heap.
Note that there is a trade-off here. By not tracing the whole heap at every collection, the
collector allows some garbage to go unreclaimed(to float in the heap). This means that
the space available for the allocation of new objects is smaller than it would have been
otherwise, and hence that the collector Furthermore,as we shall
is invoked more often.
see later,segregating the heap into collected and uncollected spacesimposesmore
bookkeeping
effort on both the mutator and the collector. Nevertheless, providedthat the space
chosen for collection has a sufficiently low survival rate, a partitioned collection strategy
can be very effective.
from reducing its working set size, sinceyounger objects typically have higher mutation
rates than older ones [Blackburn et al, 2004b].
Partitioning by thread
Garbage
collection requires synchronisation between mutator and collectorthreads.On-
the-fly collection, which never pauses more than one mutator thread at a time, may
require
a complex system of handshakes with the mutator threads but even stop-the-world
collectionrequiressynchronisation to bring all mutator threads to a halt. This cost canbe
reduced if we thread at a time and collect only those objectsthat were
halt just a single
allocated by that thread and which cannot have escaped to becomereachable by
other
threads. To achieve this, the collector must be able to distinguishthoseobjects that are
accessible from only one thread from those that may be shared, for example by allocating in
thread-localheaplets[Doligezand Leroy, 1993; Doligez and Gonthier, 1994;Steensgaard,
2000;Jonesand King, 2005].
At a larger granularity, it may be desirable to distinguish the objectsaccessible to
particular tasks, where a task comprises a set of cooperating threads. For example, a server may
run multiple managed applications, each of which usually requires its own complete
virtual machine to be loaded and initialised. In contrast,a multi-tasking virtual machine (MVM)
allows many applications (tasks)to run within a single invocation of the multi-tasking
virtual machine [Palacz et al, 1994; Soman et al, 2006, 2008; Wegiel and Krintz, 2008]. Careis
clearly
needed to ensure that different tasks cannot interfere with one another, either
directly (by obtaining access to another's data) or indirectly (through denying another task
fair access to system resourcessuch as memory, CPU time, and so on). It is particularly
desirable to be able to unload all the resources of a task when it has completed without
having to disturb other tasks (for example, without
having to run the garbage collector).
All these matters are simplified by segregating unshared data owned by different threads.
Partitioning by availability
One reasonfor wishing not to touch objects that are accessibleto otherthreads is to reduce
recognise this allowed the server to handle larger workloads. More generally,
in a system
managed by distributed garbage collection,it will be desirable to manage local and remote
objectsand references with different policies and mechanisms, since the cost of accessing
a remote object will be many orders of
magnitude more expensive than accessing a local
object.
Distribution is not the only reason why the cost of
object access may not be uniform.
Earlier we paid particular attention to how tracing collectors might minimise the cost of
cache misses. The cost of a cachemiss may be a few hundred cycles whereas accessing
108 CHAPTER 8. PARTITIONING THE HEAP
Partitioning by mutability
Finally, we might
to partition
wish objects according to their mutability. Recently
created objects tend modified
to be more frequently (for example to initialise their fields)
than
longer lived objects [Wolczko and Williams, 1992; Bacon and Rajan, 2001; Blackburn
and McKinley, 2003; Levanoni and Petrank, 2006]. Memory managersbasedon reference
counting
tend to incur a high per-update overhead and thus are less suitable for objects
that are modified frequently On the other hand, in very a
large heaps, only comparatively
small proportion of objects will be updated in any period but a tracing collector must
nevertheless visit all objects that are candidates for garbage. Reference counting might be
better suitedto this scenario.
align chunks on power of two boundaries. In that case an object's space is encoded into
the highest bits of its address and can be found by a shift or mask operation. Once the
spaceidentity is known, the collector can decide how to processthe object (for example,
mark it, copy it, ignore it and so on). If the layout of the spaces is known at compile time,
this test can be particularly efficient \342\200\224
a comparison against a constant. Otherwise, the
space canbelookedup,using these bits as an index into a table.
2On the other hand, many current generation netbooks have limited memory and page thrashing is a concern.
8.4. WHEN TO PARTITION 109
programs, and the presence dynamic of class loading commonly necessitates excessively
conservative
analysis, although Jones and King [2005]showhow to obtain a more accurate
static estimate of escapementin thecontextof thread-local allocation. If object escapement
is tracked dynamically, then the distinction is between objects that are currently thread-
local and those that are (or have been) accessible to more than one thread.3 The downside
of dynamic segregationis that it imposes more work on the write barrier. Whenever a
pointer update causes its referent to become potentially shared, then the referent and its
transitive closuremust be marked as shared.
Finally in this section, we note that collecting only a subset of the partitions of the
heap necessarily leads to a collector that is incomplete: it cannot reclaim any garbage in
partitions that are not collected. Even if the collector takes care to scavenge every partition
at some time, say on a round-robinbasis, garbagecycles that span partitions will not be
collected. In order to ensurecompleteness, some
discipline must be imposed on the order
in which partitions are collected and the destination partition to which unreclaimed objects
are moved. A simple, and widely used, solution is to collectthe entire heap when other
tactics fail. However, more sophisticated strategies are possibleas we shall see when we
consider Mature ObjectSpaces(also called the Train collector) [Hudson and Moss, 1992].
gate objects by their age into a number of spaces. In this case, partitioning is performed
dynamically, by the collector. As an object's age increases beyond
some threshold, it is
promoted(moved physically or logically)into the next space.
Objects may also be segregated by
the collector of constraints on moving
because
objects.
For example, mostly-copying collectors may not be ableto move someobjectswhile
they are considered pinned
\342\200\224
accessible by code that is not aware that objects' locations
may change.
Partitioning decisions may be also made by the allocator. Most commonly, allocators
determine from the size of an allocation request whether the object should be allocated
in a large object space. In systems supporting explicitmemory regions visible to the
programmer
or inferred by the as
compiler (such scopedregions), the allocator or compiler
can place objectsin a particular region. Allocators in thread-local systems place objectsin
a heapletof the executing thread unless they are directed that the object is shared. Some
generational systems may attempt to co-locate a new object in the same regionas one that
will point to it, on the grounds that
eventually it will be promoted there anyway [Guyer
and McKinley, 2004].
An object's space may also be decided statically, by its type, because it is code, or
through some other analysis. If it is known a priori that all objects particular kind
of a
share a common property,suchas immortality, then the compiler can determine the space
in which these objects should be allocated, and generate the appropriate code sequence.
Generational collectors normally allocatein a nursery region objects;
set aside for new
later, the collectormay decide to promote some of these objectsto an oldergeneration.
However, if the 'knows' that certain objects (for instance,
compiler those allocated at a
particular
point will
in the
code) usually be promoted, then it can pretenure these objects by
allocating them directly into that generation [Cheng et al, 1998;Blackburnet al, 2001, 2007;
Marion etal, 2007].
Finally, objects may be repartitioned by the mutator as it runs if the heap is managed by
a concurrentcollector(Chapter 15). Mutator access to objects may be mediatedby read or
write barriers, each of which may cause one or more objects to be moved or marked.The
colouring of objects (black, grey, white) and the old/new spaceholding the object may
be thought of as a partitioning. The mutator can also dynamically discriminate objects
according to other properties. As we saw above, the write barrier used by Domani et al
[2002] logically segregates objectsas they escape their allocating thread. Collaboration
between the run-time system and the operating system can repartition objectsas pagesare
swapped in and out [Hertz et al, 2005].
In the next two chapters, we investigate a variety of
partitioned garbage collectors.
Chapter 9 looks at generationalcollectors in detail, while Chapter 10 examines a wide
variety
of other schemes, including both those based on different ways of exploiting object's
ages and thosebasedon non-temporal properties.
Chapter
9
The
goal of a to find dead objects and reclaim the space they occupy.
collector is Tracing
collectors (and copying collectors in particular) are mostefficient if the space they manage
contains few live objects. On the other hand, long-livedobjectsare handled poorly if the
collector processes them repeatedly, either marking and sweeping or copying them again
and again from one semispace to another. We noted in Chapter 3 that long-lived objects
tend to accumulate in the bottom of a heap managed by a mark-compact collector,and
that some collectors avoid compacting this dense prefix.While this eliminates the cost of
relocating these objects,they must still be traced and all references they
contain must be
updated.
Generational collectors extend this idea by not considering the oldest objectswhenever
possible. By concentrating reclamation effort on the youngestobjectsin orderto exploit
the weak generational hypothesis that most objects die young, they hope to maximise yield
(recovered space) while
minimising effort. Generational collectors segregate objects by age
into
generations, typically physically distinct areas of the heap. Younger generations are
collected in preference to older ones,and objects that survive long enough are promoted
(or tenured) from the generation being collected to an older one.
Most
generational collectors manage younger generations by copying.If, as expected,
few objects are live in the generationbeing collected, then the mark/cons ratio between
the volume of data processed by the collector and the volume allocatedfor that collection
will be low. The time taken to collect the youngest generation (or nursery) will in general
depend on its size. By tuning its size, we can control the expected pause times for
collectionof a
generation. Young generation pause times for a well configured collector (running
an application that conforms to the weak generational hypothesis) are typically of the
order of ten milliseconds on current hardware. Providedthe interval between collections is
sufficient, such a collectorwill be unobtrusive to many applications.
Occasionally a generationalcollectormust collect the whole heap, for example when
the allocator runs out of space and the collector estimates that insufficient space would be
recovered by collecting only the younger generations. Generational collection therefore
improves only expected pause times, not the worst case.On its own, it is not sufficient for
real-time systems. We consider the requirements for garbage collection in a hard real-time
environment and how to achieve them in Chapter 19.
Generationalcollection can also improve throughput by avoiding repeatedly
long-lived
processing objects. However, there are costs to pay Any garbage in an old generation
cannot be reclaimedby collection of younger generations: collection of long-livedobjects
that become garbage is not prompt. In order to be ableto collectonegeneration without
111
112 CHAPTER 9. GENERATIONAL GARBAGE COLLECTION
orderto track references that span generations, an overhead hoped to be small compared to
the benefits of generational collection.Tuning generational collectors to meet throughput
and pause-timegoalssimultaneously is a subtle art.
9.1 Example
Figure9.1 shows a simple
example of generational collection. This collectoris two
using
generations. Objects are created in the young generation.At each minor collection (or
nursery collection), objects in the young generation are promotedto the generation
if they old
are sufficiently old. Before the first collection, the young generation in this example
contains four objects, N, P, V and Q, and the old generation three objects,R, S and U. R and N
are reachable from outside the generational space;maybe some roots point to them. The
collectoris about to collect the young generation. Suppose that N, P and V were allocated
some time ago but Q was createdonly shortly before the collector was invoked. The
of which
question
objects should be promoted raisesimportant issues.
A generational collector will promote objects it discovers from the young generation
to the old one,provided they are old enough. This decision requires that a generational
collector has a way of measuring time and a mechanism for recording ages. In our example,
no objects in the young generation otherthan N are directly reachable from the roots, but
P and Q are also clearly live since they are reachable from the roots via R and S. Most
generational collectors do not examine the whole heap, but trace only the generation(s)
beingcollected. Since the old generation is not to be traced here,a generational system
must record inter-generational pointers such as the one from S to P in order that the collector
may discover P and Q.
Such inter-generational pointers
can arise in two ways. First, the mutator creates a
written. A generational collector needs a similar copy write barrier to detect any inter-
created
generationalreferences by promotion. In the example, the remembered set (remset)
records the location of any objects (or fields) that may contain an inter-generationalpointer
of interest to the collector, in this case S and U.
garbage
Unfortunately, treating the source of inter-generational pointers as rootsfor a minor
collection exacerbates the problem of floating garbage. Minor collectionsare frequent but do
not reclaim garbage in the old generation, such as U. Worse, U holds an inter-generational
pointer so must be considered a root for the young generation.This nepotism will lead to
the young garbagechildV of the old garbage object being promoted rather than reclaimed,
thus further reducing the space available for live objects in the older generation.
birth and their death. Space allocated is a largely machine-independent measure, although
clearly a system with 64-bit addressesor integerswill use more space than a 32-bit one.
Bytes-allocated
also directly measures the pressure placed upon the memory manager; it
is closely related to the frequency with which the collector must be called.
Unfortunately measuring time in terms of bytes allocatedis tricky in multithreaded
systems (where there are multiple application system or threads). A simple global measure
of the volume of allocation may inflate the lifetime of an object, since the counter will
include allocation by threads unrelated to the object in question [Jones and Ryder, 2008].
In practicegenerational collectors often measure time in terms of how many collections
an object has survived, because this is more convenient to recordand requires fewer bits,
but the collections survived is appropriately considered to be an approximateproxy for
actual age in terms of bytes allocated.
reported
that between 50% and 90% of Common Lisp objects survive less than ten kilobytes
of allocation. The story is similar for functional languages. For Haskell, between 75% and
95% of heap data died before they wereten kilobytes old and only 5% lived longer than one
megabyte [Sansom and Peyton Jones, 1993].Appel [1992] observed that Standard ML/NJ
reclaimed 98% of any given generation at each collection, and Stefanovic and Moss [1994]
found that only 2% to 8% of heap allocateddata survived the 100 kilobyte threshold.
114 CHAPTER 9. GENERATIONAL GARB AGE COLLECTION
more complex. Object lifetimes are not random. They commonly live in clumps and die
all at the same time, because programs operate in phases [Dieckmannand Holzle, 1999;
Jones and Ryder, 2008]. A significant number of objects may never die. The lifetime of
objects may be correlated with their size, although opinion has differed on this [Caudill
and Wirfs-Brock, 1986; Ungar and Jackson, 1988; Barrett and Zorn, 1993]. However, as we
saw above, there are other reasons why we might want to treat large objectsspecially.
spaces, called steps or buckets. Generations may also hold their own large object subspaces.
Each generationmay be managed by a different algorithm.
The primary goals of generational garbage collection are reduced pause timesand
improved throughput. Assuming that the youngest generation is processedby copying
collection, expected pause times depend largely on the volume of data that survives a minor
collectionof that generation, which in turn depends on the size of the generation.
However, if the size of the nursery is too small, collection will be fast but little memory will be
reclaimedasthe objectsin the nursery will have had insufficienttime to die. This will have
many undesirable consequences.
First, young generationcollections
will be too frequent; as well as its copyingcost
proportional
to the volume of surviving objects \342\200\224
which will be higher since object have had
lesstime to die \342\200\224
each collection must also bear the cost of stopping threads and scanning
their stacks.
Second,the older generation will fill too fast and then it too will have to be collected.
High promotion rates will cause time-consuming older generation or full heap collections
to take place too frequently.
In addition, premature promotion will increase the incidence
9.5. MULTIPLE GENERATIONS 115
of nepotism, as 'tenured' garbage objects in the old generation preserve their offspring in
the young generation,artificially inflating the survivor rate as those deadchildrenwill also
be promoted.
Third, there is considerable evidence that
newly created objects are modified more
frequently
than older ones. If these young objects are promotedprematurely, their high
mutation rate will put further pressure on the mutator's write barrier; this is particularly
undesirable if the cost of the write barrier is high. Any transfer of overheads between
mutator and collectorneedscareful evaluation with realistic workloads. Typically, the
collector will account for a much smaller proportion of execution time than the mutator in
any well configured system. For example, suppose a write barrier comprisesjust a few
instructions in its fast path yet accountsfor 5% of overall execution time; suppose further
that the collector accounts for 10% of overall run time. It would be quite easy for an
alternative write barrier implementation to double the cost of the barrier, thus adding 5% to
overall executiontime.To recover this, garbage collection time must be reducedby 50%,
which would be hard to do.
promoting
Finally, objects the program's working set may
by be diluted. Generational
organisation is a balancing act between keeping minorcollections a s short as possible,
minimising the number of minor and the much more expensive full, major collections,and
avoiding passing too much of the cost of memory management to the mutator. We now
look at how this can be achieved.
frequently
than young Although the time taken to collect an intermediate generation
to old.
will be lessthan that
required to collect the full heap, pause times will be longer than
those for nursery collections. Multiple generation collectorsare alsomorecomplex to
implement
and may introduce additional overheads to the collector's tracing loop, this as
performance critical code must now distinguish between multiple generations rather than
just two (which can often be accomplished with a single check against an address, maybe
a compile-time constant). Increasing the number of generationswill tend to increasethe
number of inter-generational pointers created, which in turn may increase the pressure on
the mutator'swrite barrier,dependingon implementation. It will also increase the size of
the root set for younger generations sinceobjectshave beenpromoted that would not have
been if some of the space used for the intermediategenerations had been used to increase
the size of the young generation.
Although many early generationalcollectors for Smalltalk and Lisp offered multiple
generations are used by default [Marlowet a\\, 2008]. Instead, mechanisms within
generations, especially the youngest generation, can be used to controlpromotionrates.
optimal utilisation of the memory devotedto the young generation. There is neither any
need to recordper-object agesnoris there any necessity for copy reserve space in each
generation
(except for the last if indeed it is managed by copying). The generational collectors
used by
the MMTk memory manager in the Jikes RVM Java virtual machine use en masse
promotion in this way 2004b]. However, Zorn [1993]has suggestedthat
[Blackburn et al,
en masse promotion of every live object(in a Lisp system) may lead to promotionrates50%
to 100% higher than can be achievedby requiring objects to survive more than one minor
collectionbeforethey are promoted.
Figure 9.3, due and to
Moher [1989b], illustrates the survival rates from
Wilson the
youngest objects (which we can expect to die soon)aredeniedtenure, and the promotion
rate is substantially reduced.In general,increasing the copy count for promotion beyond
two is likely to pay diminishing returns [Ungar, 1984;Shaw, 1988;Ungar and Jackson,
1988]; Wilson [1989] suggests that it may be necessary to increase the count by a factor of
Aging semispaces
Promotioncan be delayed by structuring a generation into two or moreaging spaces. This
allows objects to be copiedbetweenthe fromspace and tospace an arbitrary number of
(a) (b)
HWM
never copied
count =
count =2
Figure 9.3: Survival rates with a copy countof 1 or 2. The curves show the
fraction of
objects will survive a future
that collection if they were born at
time x. Curve (b) shows the proportion that will survive one collection and
curve (c) the proportion that will survive two. The colouredareasshow the
While this arrangement allows the older membersof the generation time to die, the very
youngest will still be promoted, possibly prematurely.
Sun'sExactVM1 also implemented the younger of two generationsas a pair of semis-
paces (see Figure 9.2c) but controlled promotion of an individual object by stealing five
bits from one of two header words to recordits age. In this case, individual live objects
can either be evacuatedto tospace or promoted to the next generation.While this throttles
the promotion of the youngestobjects,it adds a test and an addition operationto the work
done to process live objects in the young generation.
Bucket
brigade and step systems allow a somewhatfiner discrimination between object
ages without maintaining per-object ages. Here, a generation is divided into a number of
subspaces and objectsare advancedfrom one bucket or step to the next at each collection.
Some step systems advance all surviving objects from one step to the next at each
collection: live
objects from the oldest step are promoted to the nextgeneration. Here,an n-step
system guarantees that objects will not reach the next generationuntil they have survived
n scavenges. Glasgow Haskellallowsan arbitrary number of steps in each generation
(although
the default is two in the young generationand one in others), as does the UMass
GC Toolkit Hudson et al [1991]. Shaw [1988]further divides each step into a pair of semis-
paces in his bucket brigade scheme. Survivors are copied between each pair of semispaces
b times before advancing to the next step. Thus, the two-bucket scheme guarantees that
objects will not reach the next generation until they have survived between 2b and 2b \342\200\22 1
scavenges. Shaw arranged his scheme to simplify promotion. Figure 9.4 shows an
instance of his scheme with two buckets: n = 3 so objects are copied up to threetimes within
a bucket before being evacuated to the aging bucketor promoted.Because Shaw's
generations are
contiguous, he can merge the agingbucketwith the old generation by delaying
1Later called the Sun Machine
Microsystems Laboratories Virtual for Research, http://research.sun.
com/features/tenyears/volcd/papers/heller .htm.
9.6. AGE RECORDING 119
New data
Copied data
Ne,
Agmg Young
Aging
Young generation
generation
New
New
Aging
Aging
Old
Old generation
generation
Time
Figure 9.4: Shaw's bucket brigade system. Objects are copied within the
young generationfrom a creation space to an aging semispace. By placing
the aging semispace adjacent to the old generation at even numbered
collections, objects can be promoted to the old generationsimply by moving the
boundary between generations.
Jones [1996]. Reprinted by permission.
the promotion step until the oldest bucket's tospace is adjacent to the old generation.At
this point the bucket is promoted by adjusting the boundary between the generations. The
aging spaces of
Figure 9.2c have some similarities with a two-bucketschemebut pay the
cost of manipulating age bitsin the headersof survivors.
It is important to understand the differencesbetween steps and generations. Both
Survivorspacesandflexibility
All the semispace organisations described above are wasteful of since they reserve
space
half the spacein the generation
for
copying. Ungar [1984] organised the young generation
as one largecreationspace(sometimes called eden) and two smaller buckets or survivor
semispaces (see Figure 9.2d). As usual, objectsareallocatedin eden, whichis scavenged
alongside the survivor
fromspace at each minor collection. All live objects are
eden
promoted to the survivor tospace. Live objects in survivor fromspace are either evacuated to
tospace within the young generation or promoted to the next generation,
depending
on
their age. This organisation can improvespaceutilisation because the eden region is very
120 CHAPTER 9. GENERATIONAL GARBAGE COLLECTION
New data
Copied data [~
t
Creation
bucket 0
space
high I
water ;
mark
Younger
Generation
bucket 1
Aging
semispacesl
Next
Generation
Time
Figure 9.5: High water marks. Objects are copied from a fixed creation space
to an aging within a younger generationand then promoted to
semispace
an older generation. Although all survivors in an aging semispace are
promoted, by adjusting a 'high water mark', we can chooseto copy or promote
an object in the creation space simply through an address comparison.
Wilson and Moher [1989b], doi: 10 .1145/74877 . 74 882.
1989
\302\251 Association for Computing Machinery, Inc. Reprinted by permission.
much larger than the two semispaces. For example,Sun'sHotSpotJava virtual machine
[Sun Microsystems, 2006] has a default edenversus survivor space ratio of 32:1, thus using
a copy reserve of less than 3% of the young generation.2 HotSpot's promotion policy does
not imposea fixed agelimit for promotion but instead attempts to keep the survivor space
half empty In contrast, the other semispaceschemeswaste half of the nursery space on
copy reserve.
The
Opportunistic garbage collector [Wilson and Moher, 1989b]useda bucket brigade
system with the space parsimony of survivor spaces and some flexibility in promotion
age. The age at which objects be varied down to the granularity
are promoted can of an
space, younger objects (above the line in Figure 9.5) can be distinguished from older ones
by a simple address comparison. Younger members of the creation space are treated as
membersof bucket 0. Older members and all of the aging space are become bucket 1;
survivors of this bucket are promoted.
is interesting
2It to observe the development of hardware and configurations. Ungar [1984] used an eden of
just 140 kilobytes with 28 kilobyte survivor spaces, and a 940 kilobyte old generation. HotSpot's default size
for the young generation is 2228kilobytes on the 32-bit Solaris operating system. We have even heard of a real
configuration as extreme as a 3 gigabyte eden, 128 kilobyte survivor spacesand a 512 megabyte old generation.
9.7. ADAPTING TO PROGRAM BEHAVIOUR 121
modified at any time, including during scavenges.We can see the effect in Figure 9.3. Any
data in the dark grey or black regions to the left of the dashed white high water mark line
will be promoted at their first collection. Those to the right of the high water mark line
will be promoted if they are in the black area belowcurve (c), or evacuated to die later in
the aging spaceif they are in the grey area above the curve. Wilson and Moher used this
scheme with three generations for the byte-coded Scheme-48; it was also used in Standard
ML with up to 14 generations [Reppy, 1993].
Appel-style garbagecollection
Appel [1989a] introduced an adaptive generational layout for Standard ML that gives as
much room as possibleto the young generation for a given memory budget, rather than
using fixed size spaces. This scheme is designedfor environments where infant mortality
is high: typically only 2% of ML's young generation survived a collection.The heap is
divided into three regions: the old generation, a copy reserve,and the young generation
(see Figure 9.6a). Nursery collections
promoteall young
survivors en masse to the end
of the old
generation (Figure 9.6b). After the collection, any space not needed for old
generation objects is split equally to create the copy reserve and a new young generation.
If the
space allocatable to the young generationfalls below some threshold, the full heap
is collected.
122 CHAPTER 9. GENERATIONAL GARBAGE COLLECTION
equal equal
(a) Before a minor collection, the copy reserve must be at least as large as the young
generation.
equal equal
(b) At a minor collection, survivors are copied into the copy reserve,extending the
old generation. The copy reserveand young generation are reduced but still of equal
size.
equal equal
(c) After a minor collection and beforea major collection. Only objects in the oldest
region,old, will be evacuated into the copy reserve. After the evacuation, all live old
objects can be moved to the beginning of the heap.
As in any scheme managed by copying, Appel must ensure that the copy reserve is
sufficient to accommodatethe worst case,that all old and young objects are live. The most
conservative way is to ensure that old + young < reserve. However, Appel can initiate full
heap collections less frequently, requiring only that old < reserve A young < reserve for
safety, arguing as follows. Before a minor collection, the reserveis sufficient even if all
young objects survive. Immediately after a minor collection, all newly promoted objects
in old1 are live: they do not need to be moved. The reserve is sufficient to accommodate
all previouslypromoted objectsin old (Figure 9.6c). Following the scavenge of old, all
surviving data (now at the top of the heap) can be block moved to the bottom of the heap.
We note that in this collect-twice approach any cycle of dead
objects that lies partly in the
nursery and partly
in the old generation will be preserved. However, it will be collected
during the next full collection since it is then contained entirely in the old generation.
The entire generational universe in Appel was but
Appel-style contiguous,
collectors
can also be implemented in block structured heaps, which avoids the necessity of sliding
the live data to the start of the heap after a major collection. Shrinking nurseries can also
beused in
conjunction with an old generation managed by a non-moving algorithm, such
as mark-sweep.
The advantage of
Appel-style collection is that by dynamically adapting the size of
the copy reserve it offers good memory utilisation and the number
reduces of collections
needed compared with configurations that use en masse promotion and fix the size of the
young generation. However, somecaution is necessary to avoid thrashing the collector.
Benchmarks that have high allocation rates but promotelittle data from the young
generationare common: indeed this was one of the motivations for Appel's design. This can lead
to the situation where the space allotted to the nursery shrinks to become so small that it
leads to overly frequent minor collectionsbut never enoughdatais promoted to trigger
a major collection. To combat this, the old generation should be collected whenever the
young generation'ssizefalls below a minimum.
9.8. INTER-GENERATIONAL POINTERS 123
promotion that should be used at the next collection.Although this mechanism can control
promotion rates, it cannot demote objects from an older to a younger generation.Barrett
and Zorn [1995] vary a threatening boundary between two generations in both directions.
The cost is that they must track more pointers as they cannot predict where the inter-
generational boundary will lie.
In version 1.5.0, Sun's HotSpot family of collectors introducedErgonomics, an adaptive
mechanism for resizing generations based on user provided goals. Ergonomics focuses
on three soft goals rather than attempting to provide hard real time guarantees. It first
attempts to meet a maximum pause time goal. Once that is met, it targets throughput
(measured asthe fraction of overall time spent in garbage collection) and finally, once other
goals are satisfied, it shrinks the footprint. Pause time goals are addressed by shrinking the
size of generations, one at a time, starting with the one whose pause time is longest,based
on statistics acquired at each collection. Throughput is improved by increasing the size
of the heap and the generations,the latter in proportion to the time taken to collecteach
generation. By default, sizes are increased more aggressivelythan they are decreased.
Vengerov [2009] offers an analytical model for the throughput of HotSpot. From this
model he derives a practical algorithm for tuning the collector by adjusting the relative
sizes of HotSpot's two generations and the promotion threshold,the numberof collections
that a young object must survive before it is promoted. He makes an important
that
observation it is insufficient to consider whether adjust the promotionthreshold
to simply on
the basis of whether it would reduce the number of objects promoted. Instead,it is
essentialalso to consider the ratio of free spacein the old generation after a major collection to
the volume promoted into it at each minor collection. His ThruMax
algorithm provides
a co-evolutionary framework for alternately adjusting the size of the young generation
and the promotionthreshold.In brief, ThruMax is invoked after the first major collection
and once the volume of data in HotSpot's survivor spaces reaches a steadystate (between
75% and 90% of the young generation's survivor
space for two consecutive minor
collections). ThruMax first increases the nursery size S until it reaches the neighbourhood of
an optimum value (discovered by observing that S has been decreased and so it is
probablyoscillating
around this value). Then ThruMax adjusts the tenuringthreshold until the
model shows that a further change would decrease throughput. After this, a new episode
of adjustments is begun provided that there is no pressure to decreaseS and sufficient
minor collections are expected before the next major collection.
Overall, sophisticated collectors like HotSpot present the user with a large number of
tuning knobs, each of which is likely to be interdependent.
objects
in other parts of the heap that are not being collected at the same time. Thesetypically
include older generations and spaces outside the generationalheap, such as large object
124 CHAPTER 9. GENERATIONAL GARBAGE COLLECTION
spaces and spaces that are never collected, including those for immortal
objects and
possibly
code. As above, inter-generational pointersare created just three ways,
we noted in
by initialising writes as an objectis created, other mutator updates to pointer slots and
by
when objectsare to different
moved
generations. In general such pointersmust bedetected
as are created
they
and recorded so they can be used as roots when a generationis
that
Tracing
can be expensive, and might be applied only during full collections. Thus it would
be applied in conjunction with scanning or rememberedsets. Scanning has the virtue of
not requiring a write barrier on updates to boot image objects, but the down side that the
collector must consider more field to find the interesting pointers. If used in conjunction
with tracing, then after a trace the collector should zero the fields of unreachable boot
image objects,
to prevent misinterpretation of pointers that
may refer to old garbage now
reclaimed.Remembered setshave their usual virtues and costs, and also do not require
zeroing of unreachable boot image objects'fields.
Remembered sets
The data used to record inter-generational pointers
structures remembered sets.3 arecalled
Remembered sets record the location of possible sources of pointers (for example, U and the
second slot of S in the example)from one space of the heap to another. The source rather
than the target of an interestingpointer is for two reasons. recorded
It allows a moving
collector to update the source field with the new address of an object that has been copied or
promoted.A source field may be updated more than once between successive collections,
so remembering the sourceensuresthat the collector processes only the object that is
referenced
by the field at the time of the collection, and not the targets of any obsolete pointers.
Thus, the remembered set for any generation holds those locationsat which a potentially
interesting pointer to an
object in this generation has been stored. Remembered set
implementations vary in the precision with which they record these locations. The choice of
precision is a trade-off between overhead on the mutator, space for the remembered sets
and the collector's cost of processing them. Note that the term remembered 'set' is
sometimes a misnomer because an implementation may allow duplicate entries(and hencebe
a multiset).
Clearly it is important to detect and recordas few pointers as possible. Pointer writes
by the collectoras it moves objects are easily identified. Pointer stores by the mutator can
be detected by a software write barrier,emitted by the compiler before each pointer store.
This may not be possible if an uncooperative compiler is used. In this case, the locations
wherewrites have occurred can often be determined from the operating system's virtual
memory manager.
The prevalenceof pointer stores will vary between different programminglanguages
and their implementations. From a static analysis of a suite of SPUR Lisp programs, Zorn
[1990]found the frequency of pointer stores to be 13%to 15%, although Appel found a
3Our terminology differs from that of Jones [1996]who distinguished card table schemes from other
rememberedset implementations.
9.8. INTER-GENERATIONAL POINTERS 125
lower static frequency of 3% for Lisp [1987]and a dynamic, run-time frequency of 1% for
ML [1989a].State-basedlanguages canbeexpected to have a higher incidence of
destructive
pointer writes. Java programs vary widely in terms of the frequency of pointer stores:
for example,Dieckmannand Holzle [1999] found that between 6% and 70% of heap
accesses were stores (the latter was an outlier, the next highest was 46%).
Pointer direction
Fortunately, not all stores needto be detected or recorded. Some languages (such as
implementations
of ML) store procedure activation records in the heap. If these frames are
scanned as part of the root set at every collection, the pointer slots they contain can be
discovered
by the techniques we discuss later in Chapter 11.If stack writes can be identified
as such by the compiler, then no barrier need be emittedon writes to these stack frames.
Furthermore, many stores will refer to objects in the same partition. Although such stores
will probably be detected, the pointersare not interesting from a generational point of
view, and need not be recorded.
If we imposea disciplineon the order in which generations are collectedthen the
number of
inter-generational pointers that need to be recordedcan be reduced further. By
guaranteeing
that younger generations will be collected whenever an older one is, young-to-old
pointers need not be recorded(for example, the pointer in N in Figure 9.1). Many pointer
writes are initialising storesto newly created objects
\342\200\224
Zorn [1990] estimated that 90%
to 95% of Lisp pointer stores were initialising (and that of the remaining non-initialising
stores two-thirds were to objects in the young generation). By definition, these pointers
must refer to older objects. Unfortunately, many languages separate the allocation of
objects
from the initialisation of their fields, making it hard to separate the non-initialising
stores that may create old-young pointers. Other languages provide moresupport for the
compiler to identify pointer stores that do not require a write barrier. For example, t he
majority
of pointer writes in a pure, lazy functional language like Haskell will refer to older
objects;old-newpointerscan arise only when a thunk (a function applied to its arguments)
is evaluated and overwritten with a pointer value. ML, a strict language with side-effects,
requires the programmer to annotate mutable variables explicitly; writes to these objects
arethe only source of old-to-young references.
Object-oriented languageslike Java present a more complex scene. Here the
programmingparadigm
centres on updating objects' states, which naturally leads to old-young
live objects. In this case, the write barrier must remember pointers in both directions,
although
if the policy decision is made always to collectthe young generation at the same
time, we can ignore writes to the nursery (which we expect to be prevalent). ecause B this
an implementation where the size of the remembered set does not depend on the number
of pointers remembered. We discuss implementation of write barriers and remembered
sets in Chapter 11.
generation
being collected as all objects may survive in the worst case. However, in practice most
objects do not survive a young generation collection.
Better
space utilisation can be obtained with a smallercopy reserve and switching from
copying to compacting collection whenever the reserve is too small [McGachey and
Hosking, 2006]. Here, the collector must be able to switch between copying and marking on
the fly because it will only discover that the copy reserve is too small during a collection.
Figure 9.7a shows the state of the heap once all survivors have been identified: copied
objects
are shown in black and the remaining live young objectsare marked grey. The next
in the black objects in the young generation. McGacheyand Hosking solve this problem
by requiring the first pass over the grey young generation objectsto fix up references to
copied objects. Next, they move the marked objects with Jonkers's sliding compactor (see
Section 3.3in Chapter 3)becausethis threaded algorithm does not require additional space
9.10. OLDER-FIRST GARBAGE COLLECTION 127
old
I Hill III III
young
old
young
in
object headers.solution might be to adapt Compressor
A better for this purpose
(discussed in Section it neither 3.4), since
requires extra space in object headers nor overwrites
any part of live objects. With a copy reserve of 10% of the heap, they gained improvements
in performance of 4% on average\342\200\224but some times up to 20% \342\200\224
over MMTk collectors
that manage the old generation by
either copying or mark-sweep collection.
intermediate collection (in a configuration that uses more than two generations) or a full
heap collection. Adaptive techniques that control the promotion of objects can be thought
of as ways of varying the age boundary of the young (to be collected) prefix
in order to
give young objects more time to die. However, generational garbage collection is but one
design that avoids collecting the whole heap (we look at schemesoutsidean age-based
context in the next chapter). Possibilitiesfor age-based collection include:
Older-first collection: The collectoraimsto focus effort on middle-aged objects. It gives the
youngestobjects sufficient time to die but reduces the time
spent considering very
long-lived objects (although these areexaminedfrom time to time).
128 CHAPTER 9. GENERATIONAL GARBAGE COLLECTION
steps
allocate
window
oldest youngest
Figure 9.8: Renewal Older First garbage collection.At each collection, the
objects least recently collectedare scavenged and survivors are placed after
the youngestobjects.
Older-first collection presents two challenges: how to identify those objects considered to
be 'older' and the increasedcomplexity of managing pointers into the condemned set since
interesting pointers may point in either direction (oldestto older,or youngest to older). In
the rest of this sectionwe considertwo different solutions to these problems.
object
to be the time since it was created or last collected, whichever is most recent[Clinger
and Hansen, 1997; Hansen, 2000; Hansen and Clinger,2002].
Renewal Older-First always
collects the 'oldest' prefix of the heap. To simplify remembered set management,the heap
is divided into k equally sized steps. Allocation is always into the lowest-numbered empty
step. When the heap is full, window in Figure 9.8)are
k \342\200\224
the oldest j steps (the grey
condemned, and any survivors are evacuated to a copyreserveat the youngest end of the heap
(the black regionin the figure). Thus, survivors are 're-newed' and the youngeststeps; to
1 are now the oldest. figure, the heap advances
In the rightwards through virtual address
space. This simplifiesthe write barrier:only pointers from the figure,
right to left in and
whose is an address larger than /, need to
source be remembered by the mutator. Although
this arrangement might be feasible for some programsin a 64-bit address space, it would
soon exhaust a 32-bit address space.In this case, Renewal Older-First must renumber all
the steps in preparationfor the next cycle, and its write barrier must filter pointers by
comparing
the step numbers of the source and targets;this requires table lookups rather than
simple address comparisons.A second potential disadvantage of Renewal Older-First is
that not preserve the order of objects
it does in the heap by their true ages but irreversibly
mixes Although Hansen
them. filters out many pointersin the Larceny implementation of
Scheme by adding a standard nurserygenerational
(and using Renewal Older-First only
to managethe old his
generation),
remembered sets are large.
Deferred Older-First garbage collection. The alternative does preserve the true age
order of objects in the heap [Stefanovic, 1999;Stefanovic et al, 1999]. Deferred Older-First
slides a fixed sizecollection
window (the grey region in Figure 9.9) from the oldest to the
9.10. OLDER-FIRSTGARBAGE COLLECTION 129
allocate wnmm\"**
window
oldest youngest
youngest end of the heap. When the heap is full the window is collected,ignoringany
older or younger objects (the white regions).Any survivors (the black region) are moved
to immediatelyafter the oldest region of the heap and any space freed is added to the
youngest (rightmost) end of the heap. The next collection window is immediately to the
right (younger end) of the survivors.The intuition behind Deferred Older-First is that will
seek out a sweet spot in the heap wherethe collection window finds few survivors. At
this point, the collector'smark-cons ratio will be low and the window will move only very
slowly (as in the lower rows of the figure). However, at some point the window will reach
the youngest end of the heap, where the collector must reset it to the oldest end of the heap.
Although objects are stored in true-age order, DeferredOlder-Firstrequiresa more
complicated
write barrier. The mutator's write barrier must rememberall pointersfrom the
oldest region into either the collection window or the youngest region and all young-old
pointers (except those whose source is in the condemned window). Similarly, the
collector's
copy write barrier must remember all pointers from survivors to other regions and
all young survivor-old survivor pointers. Once again, Deferred Older-First collectors
typically
divide the heap into blocks; they associate a 'time of death' with each block (ensuring
that older blocks have a higher time of death than younger ones). Barriers can be
implemented
through block time-of-death comparisons and care will be neededto handle time
of death overflow [Stefanovic et a\\, 2002].
\342\200\242
'Most objects die young': the weak generationalhypothesis [Ungar, 1984].
As a
\342\200\242
corollary, generational oldobjects.
collectors avoid repeatedlycollecting
\342\200\242
Response times have been improved by exploitingincrementality. Generational
collectors
commonly use small nurseries; other techniques such as the Mature Object
Space (often called the 'Train') collector [Hudsonand Moss, 1992] also bound the
size of spaces collected.
\342\200\242
Small nurseries managed by sequential allocators improve data locality [Blackburn
et al, 2004a].
\342\200\242
Objects need sufficient time to die.
The Beltway garbagecollectionframework [Blackburn et al, 2002] combines all these
insights.
It can be configured to behave as any
other region-based copying collector. The
Beltway unit of collection is called an increment. Increments can be groupedinto queues,
called belts. In Figure 9.10 each row represents a belt with increments shown as 'trays' on
eachbelt. Increments on a belt are collected independently first-in, first-out, as also are
belts, although typically the increment selected for collection is the oldest non-empty
increment on the youngest belt. A promotion policy dictates the destination of objectsthat
survive a collection: they may be copied to another increment on the same belt or they
may
be promoted to an increment on a higher belt. Note that
Beltway is not just another
nursery
collectors limit the size of belt 0 increment (Figure9.10b) w hereas
Appel-style collectors
allow both increments to grow to consumeall usable memory (Figure 9.10c). Aging semis-
paces can be modelled by increasing the number of increments on belt 0 (Figure9.10d).
However, unlike the aging semispace discussed in Section 9.6, this design trades increased
space for reduced collection time: unreachable objects in the second incrementarenot
reclaimed in this collection cycle. Renewal Older-First and Deferred Older-First can also be
modelled. Figure 9.10eshowsclearly how objects of different ages are mixed by
Renewal
Older-First collectors. Deferred Older-First collectors use two belts, whose roles are flipped
when the collectionwindow reaches the youngest end of the first belt. Blackburn et al also
used the Beltway framework to introducenew copying collection algorithms. Beltway.X.X
(Figure 9.10g) adds incrementality to an Appel-style collector: when belt 1is full, it collects
only the first increment. In this configuration X is the maximumsizeof the increment as
a fraction of usablememory: thus, Beltway. 100.100 corresponds to a standard Appel-style
generationalcollector. If X < 100, Beltway.X.X is not guaranteed to be complete since
Assuming that every configuration collects only oldest increments on youngest belts
implies that Beltway's write barrier needs to remember r eferences from older to younger
9.11. BELTWAY 131
allocate
allocate
ry allocate
allocate
o|_
allocate -
L^^/survrsurvivors allocate
I I
(
I II II IL
(e) Renewal Older-First (f) Deferred Older-First
2L
Figure 9.10: Beltway can be configuredas any copying collector. Each figure
shows the increment used for allocation, the increment collected and
to be
the incrementto which survivors will be copied for each configuration.
Blackburn et al [2002], doi: 10.1145/512529.512548.
2002
\302\251 Association for Computing Machinery, Inc. Reprinted by permission.
belts, and younger to older incrementson the same belt. If we number belts upwards
from 0 (youngest),and increments in each belt in the order in which they are created,an
increment can be identified by the pair (b,i) where b is its belt number and i its creation
order in belt b. In that numbering a pointer from (bj,i) to (bj,j) is interestingif bj <
b{ V
{bj
= bj A i < j. However, the collectorcan associatea unique small number U[ with
each increment i such that a pointer from / to ; is interestingexactly when
rij
< nx. It may
need to renumberoccasionally, such as when fresh increments are added to belts.A typical
Researchers have also used profiling to identify longevity. Cheng et al [1998] recorded
which allocationsites consistently created objects that were promoted. Blackburn et al
[2001; 2007] used lifetime metrics that compared the longevity of objects allocated at a
particular
program point with some fraction of the program's largestheap footprint in order
to discriminate between short lived, longlived and immortal objects. Both techniques
necessitated the time consuming gathering of off-line traces. This information was then used
to optimise the code so that new objects were allocated in the most appropriate generation
or the immortal space. Some pretenuring decisions may be specificto a singleprogram
although
Blackburn et al computed generic advice for allocation sites used by all programs
(that is, thosein the boot or
image library code). The effectiveness of such generic advice
makethe necessary profiling more reasonable.
In contrast, the approachusedby et al [2007] is generic, and provides true
Marion
prediction rather than self-prediction: they obtain pretenuring advice by syntactic comparison
of
programs' micro-patterns [Gil and Maman, 2005]against a pre-existing knowledge
bank
generations. The choice depends largely upon the anticipated lifetime distributions of the
since remembered set for the young generationcan be discarded after collection.
Alternatively,
a collector may require an object to survive more than one collection before being
promoted. In this case, we need a mechanism to record objectages. Either some bits
in the header of each objectin the younger generations must be used to hold its age, or
the
generation must be divided into subspaces each of which holds objects of a particular
age, or both. Common configurations include step-based schemes and eden plus survivor
semispaces. In all cases, the subspaces of a generation are collected together.
Finally,
it is often possible to avoid having to promote certain objects. Many collectors
reservean immortal space for objects that will survive until the end of the program. Often
the objectsplacedin an immortal area can be recognised either at the time the collector
is built or by the compiler.Suchobjects might include the collector's own data structures
or objectsrepresenting the code being executed (assuming that it will not be necessary to
unload code).
Promotion rates may also affect the cost of the write barrier and size of remembered
sets. Higher rates of promotion may lead to more inter-generational pointersthat must
be recorded. Whether or not this affects the performance of the write barrier depends on
its implementation, a subject considered in more detail in Section11.8.Write barriers may
record pointer writes unconditionally or they
filter out writes of no interest to the
may
collector.The spacerequirements for card are independent of the number of writes
tables
recorded, in contrast to remembered sets implemented as sequential store buffers or hash
tables.
The frequency with which write also depends on whether
barriers are invoked
generations can be collected
independently. Independent collection requiresall inter-generational
pointers to be recorded. However, if we are prepared to give up this flexibility in favour
of collecting all younger generations wheneveran olderoneis collected, then the write
barrier needs to record only old-young pointers, which we can expect to be far fewer. The
number of pointers recorded also dependson whether we record the field or the object into
which a pointer is written. For card tables,the choiceis likely to be irrelevant. However,
by noting in the object whether it has
already been recorded as a possible sourceof an
inter-generational pointer, we can reduce the size of the remembered set if we use object-
rememberingrather than field-remembering.
The different mechanisms used by the sources of inter-
mutator to record the possible
generationalpointers affect the cost
Although less precise recording
of collection.
mechanisms
may reduce the cost of the write barrier, they are likely to increase the amount
of work done by the collector. Field-recording with
sequential store buffers may be the
most precisemechanism, although the buffer may contain duplicate entries. Both
object-
recording and card tables require the collectorto scanthe objector cardto find any inter-
generational pointers.
In conclusion, generationsarebut one way of partitioning the heap to improve garbage
collection.In the next chapter, we look at other partitioning methods.
Algorithm
9.1: Abstract generational garbage collection: collector routines
i atomic collectNursery(J):
2
rootsNursery(J)
3 scanNursery(J)
4
sweepNurseryQ
5
6 scanNursery(W):
7 while not isEmpty(W)
s src <r- remove(W)
9 p(src) <- p(src)-fl A shade src*/
io if p(src)
= 1 A src was white, now grey */
n for each fid in Pointers(src)
12 ref *fld
<\342\200\224
n if ref in Nursery
W <- W + [ref]
15
16 sweepNurseryQ:
17 while not isEmpty(Nursery)
is node <r- remove(Nursery) A en masse promotion */
19 if p(node)
= 0 A node is white */
20 free(node)
21
22 rootsNursery(I)
23 for each fid G Roots
24 ref <r- *fld
25 if ref 7^ null and ref G Nursery
26 I <- I + [ref]
restricted tothe nursery, counting only references from old objects.It is the complementof
deferred reference counting's zero count table. After adding references from roots to the
nursery (root sNursery), nursery t he is traced from I (scanNursery) and is then swept,
removing survivors from Nursery, which implicitly adds them to the older generation,
and freeing unreachablenursery objects, that is, those whose abstract referencecountis
zero.Note that the statement in line 18 performsen masse promotion of all the live nursery
objects:it would be straightforward to modify this to modelother promotionpolicies.
136 CHAPTER 9. GENERATIONAL GARBAGECOLLECTION
27 New():
28 ref <r- allocateQ
29 if ref = null
30 collectNursery(I)
31 ref ^\342\200\224
allocate()
32 if ref = null
33 collect () /* tracing, counting, or other full\342\200\224heap GC */
34 ref ^\342\200\224
allocate()
35 if ref = null
36 error \"Out of memory\"
37p(ref)^\342\200\224
0 /* node is black */
38 Nursery <- Nursery U {ref} /* allocatein nursery */
39 return ref
40
41
incNursery(node):
42 if node in Nursery
43 I <- I + [node]
44
45 decNursery(node):
46 if node in Nursery
47 I I \342\200\224
<\342\200\224
[node]
48
49 Write(src, i, ref):
so if src ^ Roots and src Nursery
\302\243
51 incNursery(ref)
52 decNursery(src[i])
53 src[i] ref
^\342\200\224
Chapter
10
137
138 CHAPTER 10. OTHER PARTITIONED SCHEMES
Figure 10.1: The Treadmill collector: objects are held on a double-linked list.
Each of the four segments hold objects of a different colour, so that the colour
of an object can be changed by 'unsnapping' it from one segment and
'snapping'
it into another. The pointers controlling the Treadmill are the same as
for other incremental copyingcollectors
[Baker,1978]:scanning
is complete
large object space with a wider rangeof algorithms including copying. Several
implementationshave
separated large objects into a small (possiblyfixed-size)headerand a body
[Caudill and Wirfs-Brock, 1986;Ungar and Jackson, 1988,1992; Hosking et al, 1992]. The
body kept is in a non-moving area, but the header is managed in the same way
as other
small objects. The header may also be handled by a generational garbage collector;
opinions differ on whether large object headers should be promotedby the collector [Hudson
et al, 1991]or not (so that the large amount of space that they occupy can be reclaimed
as soon as possibleafter the object's death [Ungar and Jackson, 1992]). Other Java
virtual machines,
including Sun's ExactVM [Printezis, 2001],Oracle'sJRockit and Microsoft's
Marmot [Fitzgerald and Tarditi, 2000], h ave not used a separate space but allocated large
objectsdirectly into the old generation. Since large objectsare by their nature likely to
survive for some time, this approach saves copying them from the young generation.
The Treadmillgarbagecollector
It is also
possible to copy or move objectslogically without moving them physically. In this
section we discussthe Treadmill; in the next section we consider how to move objects with
operating system support. In terms of the tricolour abstraction, a tracing garbage collector
10.1. LARGE OBJECT SPACES 139
partitions heap objects into four sets: black (scanned), grey (visited but not fully scanned),
white (not yet visited) and free; it processes the grey set until it is empty. Each collection
algorithm provides a different
way to represent these sets. The Treadmill [Baker,1992a]
provides some of the advantages of semispace copying algorithms but in a non-moving
collector. Although it was intended as an incremental collectorits virtues have also led it
to be used in stop-the-world configurations for managing large objects.
The Treadmill is organisedas a cyclic, double-linked list of objects (Figure 10.1)so that,
considered anticlockwise, the black segment is followed by the grey segment then the
white segment and finally the free segment. The black and grey segments comprise the
tospace, and the white segment the fromspace of the heap. Four pointers are used to
operate the Treadmill. Just as with Cheney's algorithm, scan points to the start of the grey
segment and divides that segment from the black one. B and T point to the bottom and top
of the white fromspace list respectively, and free dividesthe free segment from the black
segment.
Before a stop-the-worldcollection, all objectsareblackand in tospace. An object is
allocated by advancing the free pointer clockwise, thus removing it from the free
segment and adding it to the start of black segment. When the free pointer meetsthe B
pointer at the bottom of fromspace, free memory is exhausted and it is time to flip. At this
point, the Treadmill contains at most two colours,blackand white. The black segment is
reinterpreted as white and the T and B pointers are swapped. The collector then behaves
much as any semispace copying collector.As grey objects are scanned, the scan pointer is
movedanticlockwiseto add the object to the end of black segment. When a white object in
fromspace is visited by the collector,it is evacuated to tospace by unsnapping it from the
white segment and snapping it into the grey segment. When the scan pointermeetsthe T
pointer, the grey segment is empty and the collection is complete.
The Treadmill has several benefits. Allocation and 'copying' are fairly fast. A
concurrentTreadmill can allocate objects of any colour simplyby snapping them into the
appropriate segment.
As objects are not moved physicallyby snapping, allocation and 'copying'
are constant time operationsnot dependent on the size of the object. Snapping
simplifiesthe choice of traversal order compared with other techniques discussed in Chapter 4.
Snappingobjects to the end of the
grey segment (before the T pointer) gives breadth-first
traversal. Snapping objects at the start of the segment (at the scan pointer) gives depth-
first traversal without needing an explicit auxiliary stack, although effectively a stack is
embedded in the linksof the Treadmill for all traversal orders.
One disadvantage of the Treadmill for general purpose collectionis the per-object
overhead of the two links. However, for copyingcollection, this overhead is offset by removing
the need for any copy reserve as the Treadmill does not physicallycopy objects. Another
issue for the Treadmill is how to accommodate objects
of different sizes (see [Brent, 1989;
White, 1990; Baker et al, 1985]). One solution is to use separate Treadmills for each size
class [Wilson and Johnstone, 1993]. However, these disadvantages are less of an issue for
large objects. Large object Treadmills (for example, as used in Jikes RVM) keep each object
on its own page (or sequencesof pages). If links are kept in the pages themselves,they
may simply consume some of the space otherwisewasted when rounding up the size to
an integral number of pages. Alternatively, the links can be stored together,outside the
allocated to its own set of pages. Instead of copying the object word by word, its pages
can be re-mapped to fresh virtual memory addresses[Withington, 1991]. It is also possible
to useoperating system support to initialise large objects incrementally1Rather than zero
the space for the whole in
object one step, the object's pages can be memory protected. Any
attempt to access uninitialised sectionsof the object will spring this trap, at which point
the page
in
question can be zeroed and unprotected; see alsoour discussion of zeroing in
Section 11.1.
Pointer-free objects
There are good reasons for segregating typically large objectsnot directly related to their
size. If an objectdoes not contain
any pointers, unnecessary to scan it. Segregation
it is
allows
knowledge of whether the object pointer-free to be derived
is from its address. If
the mark-bit for the object is kept in a side table, then it is not necessary to touch the object
at all. Allocating large bitmaps and strings in their own area, managed by a specialised
scanner,canleadto significant performance improvements, even if the size of the area is
modest.For example,Ungar and Jackson [1988] obtained a fourfold pause time reduction
by using a separate space of only 330 kilobytes,tiny by today's standards.
pointer structures in the heap. This arrangement offers opportunities for new garbage
collection algorithms, which we considerin this section.
can be reclaimedin isolation from other trains. The algorithm proceedsas follows.
1. Select the lowest numbered car c of the lowest numbered train t as the from-car.
!http://www.memorymanagement.org/.
10.2. TOPOLOGICALCOLLECTORS 141
3. Copy any object in c that is referenced by a root to a to-car c' in a higher numbered
train t', possiblya fresh one.
4. Recursively copy objects in c that are reachable from to-car c' to that car; if c' is full,
append a fresh car to t'.
5. Move
any object promoted from the generational scheme to a train holding a
reference to it.
7. Copy any
other object reachable from other cars in this train t to the last car of t,
appendinga new car if necessary.
Step 2 reclaims whole trains that contain only garbage, even if this includes pointer
structures (suchas cycles)that span several cars of the train. As the train's remembered
setis empty,
there can be no references to it from any other train. Steps 3 and 4 move into a
different train all objects in the from-car that are reachable from roots via reference chains
contained in this car. These objects are certainly live, and this step segregates them from
any possibly-garbage objects in the current train. For example, in Figure 10.2,objectsA
and B in car CI, train Tl are copied to the first car of a new train T3. The last two steps
start to disentangle linked garbage structures from other live structures. Step 6 removes
objects from this train if they are reachablefrom another one: in this example, P is moved
to train 2, car 2. Finally, step 7 moves the remaining potentially live objects in this car (for
example,X) to the end of its train. It is essential that these steps are done in this order since
a single object may be reachablefrom more than one train. Following step 7, any objects
remaining in car c are unreachablefrom outside it and so this from-car is discarded,just as
a semispacecollector
would do.
The Train algorithm has a number of virtues. It is incremental and bounds the amount
of
copying done at each collection cycle to the size of a single car. Furthermore, it attempts
to co-locate objects with those that refer to them. Because of the discipline imposedon the
order in which trains and cars are collected, it
requires only references from high to low
numbered trains/carsto be remembered. If it is used with a young generation collector
so that all spaces outside the mature object spaceare collected at each
cycle, no references
from outside that space need be remembered.
Unfortunately, the Train collector can be challenged by several commonmutator
behaviours.2
Isolating a garbage structure into its own train may require a number of garbage
collection cyclesquadratic in the number of cars over which the structure is distributed. As
presented above, the algorithm may fail to make progress in certain conditions. Consider
the example in Figure 10.3awhere thereis insufficient room for both objects (or pointer
structures)to fit in a single car. Object A will be moved to a fresh car at the end of the
current train when the first car is collected. Provided that none of the pointersin this
example
are modified, the next collection will find an external reference to the leading car,
so B will be evacuated to a higher numbered train. Similarly, the third collection will find
a referenceto A from B's train and so move A there. There are no cars left in this train, so
we can dispose of it. The next cycle will collect the first car of the next train, as desired.
However, now suppose that, after eachcollection cycle, the mutator switches the external
2It was superseded as the 'low pause7 collector in Sun Microsystems' JDK after Java 5 in favour of a concurrent
collector.
142 CHAPTER 10. OTHER PARTITIONED SCHEMES
(b) After collecting car 1, train 1. X moved to the same car as its referent Y, A and B
to a fresh train T3. The next collection cycle will isolate T2 and reclaim it wholesale.
Numbered labels show the copies madein each algorithm step.
(a) Before collecting the first car (b) Before collecting the next car
reference to the object in the second car, as in Figure10.3b. The Train collector never
discovers an external reference to the object in the leadingcar,and sothe object will forever be
moved to the last carof the current train, which will never empty. The collectorcannever
progress to collect other trains. Seligmann and Grarup [1995]calledthese 'futile'
collections.
They solve the problem by remembering external pointers further down the train
and using these in futile collections, thereby forcing progress by eventually evacuating the
wholetrain.
The Train algorithm bounds the amount of copying done in eachcollection cyclebut
does not bound other work, such as rememberedset scanning and updating references.
Any 'popular', highly referenced objectswill induce large remembered sets and require
many referring fields to be updated when they are moved to another car. Hudson and
Moss suggest dealingwith such objects by moving them to the end of the newest train,
into their own car, which can be moved logically rather than physically in future collections
without need to update references.Unfortunately this does not guarantee to segregate a
garbagecycle that spans popular cars. Even if a popular car is allowedto contain more
than one popular item, it may still be to
necessary disentangle theseto separatecars unless
that are part of the same structure.Both Seligmann and Grarup [1995] and Printezis and
Garthwaite [2002]have found popular objects to be common in practice.Thelatter address
this
by allowing remembered sets to expand up to somethreshold (say 4,096 entries) after
which they coarsena set by rehashing its entries into a set of the same size but using a
coarserhashingfunction. Seligmann and Grarup tune the frequency of train collections by
tracking a running estimate of the
garbage collected (a low estimate allowsthe collection
frequency
to be reduced). But Printezis and Garthwaite found it to be common for an
application to have a few
very long trains of long lived data; this defeats such a tuning
mechanism.
the Train algorithm. The performance of a partitioned collector would be improved if the
number of inter-partition pointers that need to be remembered could be reduced or even
eliminated. In the previous chapter, we sawhow Guyer and McKinley [2004] used a static
analysis place to new
objects in the same generation as the to
object which they would
be connected, and Zee and Rinard [2002] eliminated write barriers for the initialisation
of the newest object in a generational collector.Hirzelet al [2003] explored connectivity-
based allocation and collection further.
They observed that the lifetimes of Java objects are
strongly correlated with their connectivity Those reachable only from the stack tend to
be short-lived whereas those reachable from globals tend to live for most of the execution
144 CHAPTER 10. OTHER PARTITIONED SCHEMES
of the program (and they note that this property is largely independent of the precise
definition of short- or long-lived).Furthermore,objectsconnected by
a chain of pointers
tend to die at the same time.
Based on this observation, they proposed a new modelof connectivity-based collection
(CBGC) [Hirzel et al, 2003]. Their model has four
components. A conservative pointer
analysis divides the objectgraph into stable partitions: if an object a may point to an object
b then either a and b share a partition or thereis an edgefrom a's partition to b's partition in
the directedacyclic graph (DAG) of partitions. Although new partitionsmay be added (for
example, as classes are loaded),partitions are never split. The collector can then choose
any partition (or set of partitions) to collect provided it also collects all its predecessor
partitions in the DAG. Partitions in the condemned set arecollected in
topological order. This
approach has two benefits. The collector requires neither write barriers nor remembered
sets. Partitions can be reclaimed early. By collecting in topological order, as soon as the
collectorhas finished tracing objects in a partition, any unvisited (white) objects in that
partition or earlier ones must be unreachable and so can be reclaimed. Note that this also
allows popular child partitions to be ignored.
Hirzelet al suggest that the performance of connectivity-basedgarbagecollectors
depends strongly on the quality of partitioning, their estimateof the survivor volume of each
partition and their choice of
partitions to collect. However, they obtained disappointing
results (from simulation) for a configuration based on partitioning by the declaredtypes of
objects and their fields, estimating partition's a chance of survival from its global or stack
reachability,
moderated by a partition age based decay function, and using a greedy
algorithm
to choose partitions to collect. Although mark/cons ratios were somewhat better
than those of a semispace copying collector, they were much worse than those of an Appel-
style generational collector.On the other hand, worst-case pause times were always better.
Comparisonwith an oracular collector, that received perfect adviceon the choiceof
partition, suggested that there was a performance gap that might be exploited by a better
configuration. Dynamic partitioning based on allocation site also improved performance
of the collector at the cost of re-introducing a write barrier to combine partitions.
A A
L OL L OL
c D C D
threaa 1 thread 2
at run time. Note that any organisation can be used within a heaplet (for example, a flat
Steensgaard [2000] used a fast but conservative pointer analysis similar to that of Ruf
[2000] to identify Java objectspotentially reachable from a global variable and by more
than one thread. The goal of his flow-insensitive, context-sensitive analysis escape is to
allow methods that create objects to be in order specialised
to allocate the object in either
the thread's localheaplet or the shared heaplet. Each heaplet comprisesan old and a
young generation. His collector is only mostly thread-local. Because Steensgaard
treats
all static fields as roots for a local heaplet, each collection requires a globalrendezvous.A
single thread scans the globals and all thread stacks in order to copy any directly reachable
objects, before Cheney-scanning the shared heaplet. The local threads are released only
after the shared scanis complete in order to finish independent collections of their own
heaplets. Thesethreads may encounter uncopied objects in the shared heaplet: if so a
global lock must be acquired beforethe is
object copied.
Static
segregation of shared and thread-local objects requires a whole program
analysis.This is a problem for any language that permits classes to be loaded dynamically, since
polymorphic
methods in sub-classes loaded after the analysisis completemay 'leak'
references to local
objects by writing references into fields globally
of reachable ones. Jones and
King address this
problem and provide a design for a truly
thread-local collector [King,
2004; Jones and King, 2005]. Their escapeanalysis builds on Steensgaard's but is
compositional:
it supports Java's dynamic class loading, dealing safely with classes loaded
after the analysis is complete.Designed for long running Java applications, the
analysiswas
sufficiently fast to be deployed at run time in a background thread, with Sun's
ExactVM Java virtual machine running on a multiprocessorunderSolaris.They provide
each thread with two local heaplets: onefor objects that are guaranteed to be reachable
by only the thread that allocated them, no matter what further classes may be loaded, and
one for optimistically-local objects: those that are accessible by no more than one thread at
the time of the analysisbut which may become shared if an antagonistic classis loaded.
Purely
thread-local objects turn out to be comparatively rare: theseare mostly objects
that do not escape their allocating method. Optimistically-local objects are fairly
common, however. The rules for pointer directionality are extended. Local objects may also
point to optimistically-local ones, but not vice-versa; optimistically-local objects may refer
to globalones.A schematic of permissible pointers is shown in Figure10.4.Jonesand King
146 CHAPTER 10. OTHER PARTITIONEDSCHEMES
just beforea thread createsa reference to an object it did not allocate. The barrier must
also set this bit for every object in the transitive closure of the target object. The parallel
mark-sweep collectorof Domani et al collects threads independently. It stops all threads
only
if it is a
unable to allocate large object or a fresh allocation buffer. They also allocate
objectsknown to be always global (such as thread and classobjectsor those identified as
copying
semantics and thread-local heaps can be collectedindependently. The costs of this
design are that message passing is an 0(n) operation (wheren is the size of the message)
and messagedata arereplicated between processes.
Sagonas and Wilhelmsson add to this architecture a shared areafor messages and one
for binaries, in order to reducethe costof message passing [Johansson et al, 2002;Sagonas
and Wilhelmsson, 2004; Wilhelmsson, 2005; Sagonas and Wilhelmsson, 2006].They impose
the usual restrictions on pointer directionbetweenthe process-local areas and the shared
messages area.Theirshared message area does not contain any cyclic data and the binaries
do not contain references. A static message analysis guides allocation:data that is likely
to be part of a message is allocated speculatively on the shared heap and otherwise in a
3Not to be confused with reference lists used by distributed reference counting systems where the target
maintainsa list of processes that it believeshold references to it.
10.2. TOPOLOGICAL COLLECTORS 147
memory
architecture. In their
case, the local/shared regionsalsoservedas the young/old
generations of their collector. Their target
was Concurrent Caml Light, ML with
concurrency primitives.
Unlike Erlang, ML does have mutable variables.In orderto allow threads
to collect their young generationsindependently, mutable objects are stored in the shared
old generation.If a mutable object is updated to refer to an object in a thread-local young
generation,then the transitive closure of the young object is copied to the old generation.
As in the Erlang case, making two copies of the data structure is safe since young
objects
are guaranteed to be immutable. As well as copying the young objects, the collector
a
updates forwarding address in each object
header to refer to its shared replica. These
addressesareusedby subsequent thread-local, young generation collections; the mutator
write barrier has done some of the collector'swork for it. Note that the forwarding pointer
must be stored in a reservedslot in the object'sheaderrather written destructively over
user data since the young copy is still in use. This additional header word is stripped from
the old generation copy as it is not required by the shared heap's concurrent mark-sweep
collector. While this additional word imposes a space overheadin the young generations,
this overhead may be acceptable since young generation data will usually occupy a much
smaller fraction of total heap size than old generation data.
Stack allocation
Several researchershave proposed allocating objects on the stack rather than in the heap,
wherever possible. A wide variety of mechanisms have been suggested,but fewer have
been implemented, especially in production systems. Stack allocation has some
attractions. It
potentially reduces the frequency of garbagecollection, expensive and
tracing or
reference counting is unnecessaryfor stack allocated data. Thus, stack allocation should
in theory be gentler on caches. On the down side,it may prolong the lifetime of objects
allocatedin frames that persist on the stack for a long time.
The
key issue is to ensure that no stackallocated objectis reachable from another object
with a longer lifetime. This can be determined eitherconservatively through an escape
analysis (for example, [Blanchet, 1999;Gay and Steensgaard, 2000; Corry, 2006]) or by
runtime
escape detection using a write barrier. Baker [1992b] the first to suggest (but not
was
bits are used to store the frame's depth in the stack. These bits are ignored by pointer
loads but checked by stores: storing a referenceto an object in a new frame into an object
in an old frame causes a trap which moves the objectand fixes up references to it (which
can only be held by objectsin newer frames). The fixup is expensive so needs to be rare
for stack allocation be be effective. If stack allocatingan object would cause the frame to
become too
large, Azul place the object in an overflow area tothe side. Azul find that they
still need occasional thread-localcollections to dealwith dead stack allocated objects in
long lived frames.
Overall, most of these schemes have either not been implemented, arereported with
insufficient detail of comparative systems, or do not offer
significant performance
improvements. While it is likely that for many applications a large fraction of objects might be stack
allocatable, most of these are likely to be short-lived. Azul find that over half of all objects
may
be stack allocated in large Java applications.However, this scenario is precisely the
one in which generational garbage collection excels. It is not clear that stack allocation
reduces memory management costs sufficiently
to make it worthwhile. Another rationale
for stack allocation is that it can reduce memory bandwidth by keeping these objects
entirely
in the cache, given a sufficiently large cache.Onerelated strategy that is effective is
scalar replacement or object Mining whereby an object is replaced by local variables
representing
its fields [Dolby, 1997; Dolby and Chien, 1998, 2000;Gay and Steensgaard, 2000].
A common application of scalar replacement is for iterators in object-oriented programs.
create and destroy regions or to indicate the region into which an object must be allocated.
Possibly
the best known explicit system is the Real-Time Specification for Java (RTSJ). In
addition to the standard heap, the RTSJ provides an immortal
region and scoped regions.
The RTSJenforcesstrictruleson pointerdirectionality: an object in an outer scoped region
cannotrefer to one in an inner scope.
Other region-basedsystems may relax the requirements on pointer direction,allowing
regions to be reclaimed even if there are references into that
region from other, live regions.
To be safe, such systems require a guarantee that the mutator will never follow a dangling
pointer into a deallocated region. These systems require compilersupport, eitherfor
inferring
the region to which an object should be allocated and when it is safe to reclaim the
programs (for example 58,000 a line program took one and a half hours to compile). Tofte
et al
report that it was often best to restrict region inferencing to well understood coding
patterns and manage otherparts of the program by garbage collection.
10.3. HYBRID MARK-SWEEP, COPYING COLLECTORS 149
Aggressive
Mark-5weep
5emispace
copyings
Passive
25 50 75 100
evacuation threshold
When
considering how the volume of live objectsin a blockcan be used to make
evacuateor mark decisions, Spoonhower et al [2005] contrast an evacuation threshold \342\200\224
whether
the block contains sufficiently little live data to make it a candidate for evacuation \342\200\224 with
allocation. These thresholds determine when and how fragmentation is reduced. For example,
a mark-sweep collectorhas an evacuation threshold of zero (it never copies) but an
allocation threshold of 100% (it reuses all free space in a block),whereas a semispace copying
collector has an evacuation threshold of 100% but an allocation threshold of zero (from-
space pages are not used for allocation until after the next collection); these two collectors
are shown in Figure 10.5.Overly passivememory managers with low evacuation and
allocation thresholds can suffer from fragmentation; overly aggressivemanagers, where both
thresholds are high, have high overheadseitherbecause they replicate data or because
they require more passesto collectthe heap.
The performance of a large or long running applicationmay eventually suffer from
fragmentation unless the is
heap managedby compacting a collector. Unfortunately,
compaction
is likely to be expensive in time or space comparedwith non-moving collection.
copying to compact the heap incrementally, one region at a time. The heap is divided into
k -f 1 equally sized windows, one of which is empty. At collection time, some window is
chosen to be the fromspace and the empty window is used as the tospace.All other
windows are managed by a mark-sweep collector. As the collectortracesthe heap, objects are
150 CHAPTER 10. OTHERPARTITIONED SCHEMES
Tospace Fromspace
(a) Beforecollection
Tospace Fromspace
evacuated to the tospace window if they are in the fromspace window, otherwisethey are
marked (see Figure 10.6). Referencesin any window to fromspace objects must be updated
with their tospace replicas.
By rotating the windowchosento bethe fromspace through the heap, Lang and Dupont
cancompact the whole heap in k collections at a space overheadof 1/k heap. Unlike
of the
by a Cheney algorithm. At each tracing step, the collector can choosewhether to take the
nextitemfrom the mark-sweep or the copying work list: Lang and Dupont advocate
preferring
the mark-sweep collector in order to limit the size of its stack. There isalsoa locality
argument here since mark-sweep tends to have better locality than Cheney copying.
The Spoonhoweret al [2005] collector for C# takes a more flexibleapproach.It uses
blockresidency predictions to decide whether to process a blockin placeto tospace or to
evacuate its contents. Predictions may be static (for example, large object space pages), use
fixed evacuation thresholds (generational collectors assume few young objectssurvive) or
dynamic ones (determined by tracing). Spoonhower et al use residency counts from the
previous collectionto determinewhether to evacuate or mark objects in a block (blocks
containingpinned objects are processed in place) in order not to need an extra pass at each
collection. In a manner similar to Dimpseyet al [2000] (discussed below), they maintain a
free-listof gaps, and bump allocate into these.
Garbage-First
Garbage-First[Detlefs 2004] is a sophisticated and
et al, complex incrementally
algorithm,
compacting designed to meet a soft real-time performance goal that collection should
consume no more than x millisecondsof any y millisecond time slice. It was introduced
in Sun Microsystems' HotSpot VM in JDK 7 as a longer term replacementtoa concurrent
compaction) collector for IBM's server Java virtual machine, version 1.1.7. Like Sun's 1.1.5
collectors, the IBM server used thread-local allocation buffers.4 Small objects were bump-
allocated within a buffer without synchronisation; synchronisedallocationof buffers and
other large objects (greater than 0.25 times the buffer size) was performedby searching a
free-list. Dimpsey et al found that this architecture on its own ledto poorperformance.
Although
most large object requests were for local allocation buffers, free chunks that could
not satisfy these requests tended to congregateat the start of the free-list, leading to very
long searches. To address this, they introduced two further free-lists, one for objects of
exactly localallocation buffer size (1.5 kilobytes plus header) and one for
objects between
512 kilobytes and buffer size. Whenever the buffer list became empty, a large chunk was
obtained from the large object list and split into many buffers. This optimisation
substantiallyimproved Java performance on uniprocessors and even more so on multiprocessors.
The IBM server collector marked objectsin a sidebitmap. Sweeping traversed the
bitmap, testing bits a byte or a word at a time. Dimpsey et al optimise their sweep by
ignoring short sequences of unused space; a bit in the objectheader was usedto distinguish
a large object from a small one followedby garbage, and two tables were used to translate
%
lineCursor- blockCursor-
lineLimit- blockLimit
Figure 10.7: Allocation in immix, showing blocks of lines. Immix uses bump
pointer allocation within a
partially empty small objects,
block of
lineCursor
advancing
to lineLimit, before moving onto the next groupof
unmarked lines. It acquires wholly empty blocks in which to bump-allocate
medium-sized objects. Immix marks both objects and lines. Because a small
object may span two lines (but no more), immix treats the line after any
sequence
of (explicitly) marked line as implicitly marked:the allocator will not
use it.
Blackburn and McKinley [2008], doi: 10 .1145/1375581.1375586.
2008
\302\251 Association for Computing Machinery, Inc. Reprinted by permission.
The potential cost of this technique is that some free space is not returned to the
allocator. However, objects tend to live and die togetherand Dimpsey use this property
et al
to avoid compactionas muchas possible.They follow the advice of Johnstone [1997]by
using an address-ordered, first-fit allocator in order to increase the chanceof creating holes
in the heap large enough to be useful. Furthermore, they allow local allocation blocksto
be of variable length. If the first item on the local allocation buffer free-list is smaller than
a desired size T (they use six kilobytes), it is used as is (note that the item must be larger
than the minimum for inclusion in the free-list).If it is between T and IT, it
size accepted
is split into twoevenly sized buffers. Otherwise,the blockis split to yield a buffer of size
T. Dimpsey et al also set aside 5% of the heap beyond the 'wilderness boundary' [Korn
and Vo, 1985], to be used only if insufficient space is available after a collection.
Like the Dimpsey et al IBM server, the immix collector [Blackburnand McKinley, 2008]
objects whose size is greater than a line, and small objects; most Java objects are small.
Algorithm 10.1 shows the immix algorithm for small and medium sized objects. Immix
preferentially allocatesinto empty line-sized gaps in partially filled blocksusing a linear,
10.3. HYBRID MARK-SWEEP, COPYING COLLECTORS 153
i alloc(size):
2 addr <r- sequentialAllocate(lines)
3 if addr ^ null
4 return addr
5 if size < LINE_SIZE
6 return allocSlowHot(size)
7 else
s return overflowAlloc(size)
9
10
allocSlowHot(size):
11 lines <r- getNextLineInBlock()
12 if lines = null
13 lines <\342\200\224
getNextRecyclableBlockQ
H if lines = null
15 lines <r-
getFreeBlock()
16 if lines = null
17 return null /* Out of memory */
is return alloc(size)
19
20 overf lowAlloc(size):
21 addr <\342\200\224
sequentialAllocate(block)
22 if addr 7^ null
23 return addr
24 block <r- getFreeBlock()
25 if block = null
26 return null /* Out of memory */
27 return sequentialAllocate(block)
next-fit strategy. In the fast path, the allocator attempts to bump-allocate into the current
contiguoussequence of free lines (line 2). If this fails, the search distinguishes
between
the overwhelming proportion of allocation was into blocksthat were either completely free
or less than a quarter full. Note that allocation of both small and medium sized objectsis
into thread-local blocks; synchronisation is required only to acquire a fresh block (either
partially filled or completely empty).
The immix collector marks both objects (to ensure correcttermination of the scan) and
lines \342\200\224
the authors call this 'mark-region'. A small object is by definition smaller than
154 CHAPTER 10. OTHER PARTITIONED SCHEMES
a line, but it may still span two lines. Immix marks the secondline implicitly (and
conservatively):
the line following any sequence of marked linesis skipped by
the allocator
(see Figure 10.7) even though, in the worst case,this might waste nearly a line in every
gap. Blackburn and McKinley found that tracing performance was improvedif a line was
marked as an objectwas scanned rather than when it was marked and added to the work list,
sincethe more expensive scanning operation better hid the latency of line marking.
Implicit marking improved the performance of the marker considerably. In contrast, medium
sized objects are marked exactly (a bit in their header distinguishes smalland medium
objects).
Immix compacts opportunistically, depending on fragmentation statistics,and in the
samepassas marking.
These statistics are recorded at the end of each collection by the
sweeper, which operates at the granularity of lines. Immix annotates eachblockwith the
number of gaps it contains and constructs histograms mapping the number of marked
lines as a function of the number of gaps blocks contain. The collector selects the most-
fragmented
blocks as candidates for compaction in the next collectioncycle. As these
statistics can provide only a guide, immix can stop compacting early if there is insufficient
room to evacuate objects.In practice,compaction is rarefor many benchmarks.
which is collected frequently, with any survivors being copied to the old generation.If the
space remaining drops to a singleblock,a full heap collection is initiated.
Independent collection of each block requires a remembered set for each one, but this
would complicate the generationalwrite barrier since it would have to record not only
inter-generational pointers but also inter-block ones. Instead, Mark-Copy's first phase
marks all live objects, and also constructsper-block unidirectional remembered sets and
counts the volume of live data for each block. Two advantages arise from having the
marker rather than the mutator construct the remembered sets: the remembered sets are
precise(they contain only those slots that actually hold pointers from higher numbered
to lower numberedblocksat the time of collection) and they do not contain any
duplicates. Windows of consecutive blocks are evacuated one at a time, starting with the lowest
numbered (to avoid the needfor bidirectional remembered sets), copying live data to the
free block. Becausethe marker has counted the volume of live data in each block, we can
determinehow many blocks can be evacuated in each pass. For example,the second pass
in Figure 10.8 was ableto evacuatea window of three blocks. At the end of each pass, the
space consumed by the evacuatedblocksis released (unmapped).
^
\\
Y ^
Dl 1
Unmapped Unmapped
\342\200\224
\\<
B[ \\<
>\\ J
Q--/--^;\"
^ I
Bl I
cr~
\342\200\224
\\-
1
y^^- A
Unmapped Fl I
Di~~
^H
(b) After the first copying pass. B has been evacuated and the first block has been
unmapped.
(c) After the second copying pass. Note that there was sufficient room to evacuate
three blocks.
The MC2 collector [Sachindran et al, 2004] relaxes Mark-Copy's requirement for blocks
to occupycontiguouslocations by numbering blocks logically rather than
by their
(virtual) address. This has several advantages. It removes the need for blocks to be remapped
at the end of each pass (and hence the risk of running out of virtual address space in a
32-bit environment). It alsoallowsblocksto be evacuated logically simply by changing
their block number, which is useful if the volume of live data in the block is sufficiently
high to outweigh the benefit of copying and compacting it. Numbering the blocks
logically
also allows the order of collection of blocks to be modified at collection time. Unlike
Mark-Copy,
MC2 spreads the passes required to copy old generationblocksover multiple
nursery collections; it also marks the old generationincrementally using a Steele insertion
barrier (we discuss incrementalmarking in Chapter15).Because of its incrementality it
starts collecting the old generationsomewhat before space runs out, and adaptively ad-
156 CHAPTER 10. OTHER PARTITIONED SCHEMES
justs the amount of work it does in each increment to try to avoid a large pause that might
occur if space runs out. Like other approachesdiscussedin this chapter, MC2 segregates
popular objects into a specialblockfor which it does not maintain a rememberedset (thus
treating them as immortal although this decision can be reverted).Furthermore, in order
to bound the size of rememberedsets,it also coarsens the largest ones by converting them
from sequential store buffers to card tables (we explainthese techniques in Chapter 11).
Large arrays are also managed by card tables, in this case by allocatingspacefor their own
table at the end of each array. Through careful tuning of its combination of techniques,it
achieves high space utilisation, high throughput, and well-balanced pauses.
memory manager which always evicts the least recently used page. Outside collection
time, the pagechosenwill always be an as yet unused but soon to be occupied tospace
page. Indeed, if most objects are short-lived,it is quite likely that the least recently used
page will be the very next one to be used by the allocator \342\200\224 the worst possible paging
scenario from its point of view! A fromspace page would be a much better choice:not
only will it not be accessed (and hence reloaded) until the next collection but its contents
do not needto be written out to the backing store.
The Bookmarkingcollectorcan complete a
garbage collection trace without faulting in
non-resident pages. The trace conservatively assumes that all objects on a non-resident
page arelive but it also needs to locate any objects reachable from that page. To support
this, if a live page has to be scheduled for eviction, the run-time system scans it, looking
for
outgoing references, and 'bookmarks' their targets. When this page is reloaded, its
bookmarks are removed. Thesebookmarksare used at collection time to propagate the
trace.
The virtual memory manager is modified to send a signal whenevera pageisscheduled
for eviction. The Bookmarking collector always attempts to choose an empty page. If this
is not possibleit calls the collector and then selects a newly emptied page. This choice
can be communicated to the virtual memory manager through a system call, for example
madvise with the MADV_DONTNEED flag. Thus Bookmarking attempts to shrink the heap
to avoid page faults. It never selects pages in the nursery or those containing its metadata.
If
Bookmarking cannot find an empty page, it chooses a victim (often the scheduled page)
and scans it for outgoing references, setting a bit in their targets' headers. Hertz et al
extend the Linux kernel with a new system call allowing user processesto surrender a list
of pages.
If the whole heap is not memory resident, full heap collections start by scanning the
heap for bookmarked objects, which are added to the collector'swork list. While this is
expensive, it is
cheaper in a small heap than a single page fault. Occasionally it is necessary
10.5. ULTERIOR REFERENCE COUNTING 157
to compactthe old generation.The marking phase counts the number of live objects of
each size class and selectsthe minimum set of pages needed to hold them. A Cheney pass
then moves objects to thesepages(objects on the target page are not moved).Bookmarked
objectsarenever moved in order to avoid having to update pointers heldin non-resident
pages.
allocated and die at very high rates; they are alsomutated frequently (for example to initialise
them) [Stefanovic, 1999].Evacuation is an effective technique for such objects since it
allows fast bump pointer allocation and needs to copy only live data, little of which is
expected.
Modern applications require increasingly large heaps and live sets. Long lived
objects tend to have lower mortality and update rates. All these factors are inimical to
tracing collection: its cost is proportional to the volume of live data and it is undesirable
to trace long lived data repeatedly.On the other hand, reference counting
is well suited to
such behaviour as its cost is simply proportional to the rate at which objects are mutated.
Blackburn and McKinley[2003] argue that each space, young and old, shouldbe managed
by
a policy appropriate to its size, and to the expected lifetimes and mutation rate of the
objectsthat it contains.
count of each reference counted child in the mutation log; any targets in the nursery are
marked as live, and added to the nursery collector's work list. As surviving young objects
are promoted and scavenged,the collector increments the reference counts of their targets.
As with many other implementations of deferred reference counting, the counts of
objects directly reachable from the roots are also incremented temporarily during collection.
All the buffered increments are applied before the buffered decrements. Cyclic garbage is
handled using by the Recycler algorithm [Bacon and Rajan, 2001].However, rather than
invoking it at each collection on all those decremented objectswhose count did not reach
zero, Blackburn and McKinley trigger cycle detection only if the available heap space falls
below a user-defined limit.
5In contrast, the write barrier of Levanoni and Petrank [2001] records a snapshot of the mutated object (see
Chapter 5).
158 CHAPTER 10. OTHER PARTITIONEDSCHEMES
objects
in the heap. We partition the set of
objects in the heap so that we can manage different
partitions or spaces under different policies and with different mechanisms: the policies
and mechanisms adopted will be those most appropriate to the propertiesof the objects
in the space. Partitioning by physical segregation can have a number of benefits including
fast address-based space membership tests, increased locality, selectivedefragmentation
and reducedmanagement costs.
representing images) from large arrays of pointers: it is not necessary to trace the former
and, if they are marked in a separate bitmap, it is never necessary for the collector to access
them, thus
avoiding page and cache faults.
Partitioning can alsobe used to allow the heap to be collected incrementally rather
than as a whole. Here, we mean that the collector can choose to collect only a subset of
the spaces in the heap in the same way that generational collectors preferentially collect
only
the nursery generation. The benefit is the same: that the collector has a more tightly
bounded amount of work to do in any single cycle and hence that it is less intrusive.
One approach is to partition the graph by its topology or by the way in which the
mutator accessesobjects. Onereasonfor doing this is to ensure that large pointer structures
are eventually placed in a single partition that can be collected on its own. Unless this is
done, garbage cyclic structures that span partitions can never be reclaimedby collecting a
time, once it is known that all the objects in a region are dead. Placement can either be
done explicitly, as for example by the Real-Time Specification for Java, or automatically
guided by a region inferencing algorithm [Tofte et al, 2004].
Pointer analyses have also been used to partition objects into heapletsthat are never
accessed
by more than one thread [Steensgaard, 2000;Jonesand King,2005]. Theseheaplets
can then be collected independently and without stopping other threads. Blackburn and
McKinley [2003] exploit the observation that mutators are likely to modify young objects
morefrequently than old ones. Their Ulterior collector thus managesyoung objects by
copying and old ones by referencecounting.High mutation rates do not impose any
overhead on
copying collection which is also well suited to spaces with high mortality rates.
Reference countingiswell suitedto very large, stable spaces which would be expensive
to
trace.
Another common approach is to divide the heap into spaces and apply a different
collection
policy to each space, chosen dynamically [Lang and Dupont, 1987;Detlefs et al,
2004; Blackburn and McKinley, 2008]. The usual reasonfor this is to allow the heap to
be defragmented incrementally, thus spreading the cost of defragmentation over several
collection cycles. At each collection, one or more regions are chosenfor defragmentation;
typically their survivors are evacuated to another space whereas objects in other spaces
are markedin place.Copying live data space by space also reduces the amount of space
does so block by block in order to limit the space overhead to a single block. Its
successor,MC2 [Sachindran et al, 2004], offers greater incrementality working to achieve good
utilisation of available memory and CPU resources while also avoiding large or clustered
pauses.
Chapter 11
Run-time interface
The heart of anautomatic memory management systemis the collector and allocator, their
algorithms and data structures, but these are of little use without suitable means to access
them from a program or if they themselves cannot appropriately accessthe underlying
platform. Furthermore, some algorithms impose requirementson the programming
language implementation, for example to provide certain information or to enforce particular
invariants. The interfaces betweenthe collector(and allocator) and the rest of the system,
both the language and compiler above and the operatingsystem and libraries beneath, are
the focus of this chapter.
We consider in turn
allocating new objects; and adjusting pointers in objects,
finding
global
areas and stacks; actions when accessing updating pointersor objects(barriers);
or
synchronisation
between mutators and the collector; managing address space;and using
virtual memory.
the memory manager or both. For Java objects this might include space for a hash
codeor synchronisation information, and for Java arrays we clearly needto record
their
length somewhere.
3. Secondary initialisation. By
this we mean to set (or update) fields of the new object
after the new objectreferencehas 'escaped' from the allocation subsystem and has
become potentially
visible to the rest of the program, otherthreads and soon.
Consider the three example languages again.
C: All
\342\200\242 the work happens in Step 1; the languageneither requires nor offers any
system or secondary initialisation \342\200\224
the programmer does all the work (or fails to).
Notice, though, thatallocation may include setting up or modifying a header, outside
of the cell returned, usedtoassist
in
freeing the object later.
\342\200\242
Java: Steps 1 and 2 together provide an object whose method dispatch vector, hash
code and synchronisation information are initialised, and all fields set to a default
value (typically all zeroes). For arrays, the length field is also filled in. At this point
the object is type safe but 'blank'. This is what the new bytecodereturns.Step 3 in
Java happens in code provided inside a constructoror static initialiser, or even
afterwards, to set fields to non-zero values. Even initialisation of final fields happens in
Step 3, so it can be tricky to ensure that other threads do not see those fields change
if the object is madepublic too soon.
\342\200\242
Haskell: The programmer provides the constructor with values for all fields of the
requested object, and the compiler and memory managertogether guaranteecomplete
initialisation before the new object becomes available to the program. That is,
Steps 1 and 2, calling the collectorif memory is exhausted. It uses sequential allocation so
respect to possible garbage collection, it allows the initialising stores to avoid some write
11.1. INTERFACE TO ALLOCATION 163
The kind of object to allocate. For example,managed run-time languages such as Java
typically distinguishbetweenarray and non-array objects. Some systems distinguish
betweenobjectsthat contain no pointers and ones that may contain pointers [Boehm
and Weiser, 1988];objectscontainingexecutable codemay also be special. In short,
any distinction that
requires attention by the allocator needs to appear at the
interface.
\342\200\242
The referenced cell has the requested size and alignment\342\200\224
but is not otherwise
prepared for use.
164 CHAPTER 11. RUN-TIME INTERFACE
\342\200\242
Beyond having size and alignment, the cellis zeroed.Zeroing
correct helps to
guarantee that the
program cannot treat old pointers \342\200\224
or non-pointer bit patterns for
that matter \342\200\224
as valid references. Zero is a good value because it typically
represents the null pointer and is otherwise a bland and legal value for most types. Some
languages, suchas Java, require zeroing or something similar for their security and
memory has a specific non-zero bit pattern, suchas Oxdeadbeef or Oxcaf ebabe,
which are values we have actually seen.
\342\200\242
The allocated cell appears to be an object of the requested type. This is a casewhere
we present the type to the allocator. The difference between this and the weakest
post-condition(thefirst one in this list) is that the allocator fills in the object header.
\342\200\242
The allocator guarantees a fully type-safe object of the requested type. This involves
both zeroing and filling in the object header. This is not quite the sameas a fully
initialised object in that zeroing provides a safe, but bland, default value, while a
program will generally initialise at least one field to a non-defaultvalue.
\342\200\242
The allocator a fully initialised object. This may be lesscommon,
guarantees sincethe
interface for passing the initial value(s). A good
must provide example is the cons
function in Lisp, which we might provide as a separate allocation function because
callsto are so common
it and need to be fast and simple from the program's side.
might inline the common (successful)caseand call a collect-and-retry function out of line.
Of course if we inline Step 1, then thereremains little distinction between Steps 1 and 2
\342\200\224
the overall code sequence must be effectively atomic. Later on we discuss handshaking
between mutators and collectors, so as to achieve such atomicity. We note that for purposes
of atomicity it is generally more appropriate to view allocation as a mutator activity.
Speeding allocation
Since many systems and applications tend to allocate at a high rate relative to the rest
of their computation, it is important to tune allocation to be fast. A key technique is to
inline the commoncasecode(the 'fast path') and call out to 'slow path' codethat handles
the rarer, more complex cases. Making goodchoices here
requires careful comparative
measurements under suitable workloads.
An
apparent of sequential allocation is its simplicity, which leadstoa short
virtue code
the resultregister;add-immediate
the needed size to the bump pointer; comparethe bump
Zeroing
Some system designs requirethat free space contain a distinguished value, often zero, for
it, but experience suggests that bulk zeroing is more efficient. Also, zeroingwith explicit
memory writes at that time may cause a number of cachemisses,and on some
architectures, reads may block until the zeroing writes drain from a hardware write buffer/store
queue. Some ML implementations, and also Sun's HotSpot Java virtual machine, prefetch
ahead of the (optimised) bump pointer preciselyto try to hide the latency of fetching newly
allocatedwords into the cache [Appel, 1994; Gongalves and Appel, 1995].Modern
processors
may also detect this pattern and perform the prefetchingin hardware. Diwan et al
[1994] found that write-allocate caches that can allocate on a per-word basis offered the
best performance, but these do not seem to be common in practice.
From the standpoint of writing an allocator, it is often best to zero whole chunks using
a call to a library routine such as bzero. Theseroutinesare typically well optimised for
the target system, and may even use special instructions that zero directly in the cache
without
fetching from memory, such as dcbz (Data CacheBlockZero)on the PowerPC.
Notice that direct use of such instructions may be tricky since the cachelinesizeisa model-
specific parameter. In any case, a system is likely to obtain best performance if it zeroes
disadvantage
of dirtying memory long before it will be used. Such freshly zeroed words will
likely be flushed from the cache, causing write-backs,and then will need to be reloaded
166 CHAPTER 11. RUN-TIME INTERFACE
during allocation. Anecdotal experiencesuggeststhe best time to zero from the standpoint
of
performance is somewhat ahead of the allocator, so that the processor has time to fetch
the words into the cache before the allocator reads or writes them, but not so far ahead
of the allocator that the zeroed words are likely to be flushed. Given modern cache miss
times, it is not clear that the prefetching technique that
Appel described will work; at least
it
may need tuning to determine the proper distance ahead of the allocator that we should
For
prefetch. purposes debugging, of
zeroing or writing a special value into memory should
be doneas soon as we free cells, to maximisethe rangeof time during which we will catch
errors.
techniques for conservative pointer finding, and then ones for accurately finding pointers
in various locations.
5. Consult the object map for blocks of thissize; has the slot
corresponding
to this object in this block been allocated?
collectorsfor C may need to retain two objectsin that case, or else over-allocate arrays by
one word to avoid possible ambiguity. An explicit-free system may interposea header
between
objects, which also solves the problem. In the presenceof compiler optimisations,
pointers may be even further
'mangled'; see page 183 for a discussion of this topic.
Since a non-pointer bit pattern may cause the collector to retain an objectthat is in fact
not reachable, Boehm [1993]deviseda mechanism called black-listing, which tries to avoid
using regions of virtual address space as heap when their addresses correspond to these
kinds of non-pointer values. In particular, if the collector encounters a possible pointer
that refers to memory in a non-allocated block, it black-lists the block, meaning it will
not allocate the block. Were it to allocate the block (and an objectat that address), future
traces would mistakenly recognise the false pointer as a true pointer. The collectoralso
supports blocksusedfor strictly non-pointer objects, such as bitmaps. Distinguishingthis
data not only speeds the collector (since it does not need to scan the contentsof these
objects),
but it also prevents excessive black-listing that can result from the bit patterns of
the non-pointerdata. The collector further refines its black-listing by discriminating
between invalid pointers that may be interior, and thosethat cannot be interior, because they
are from the heap in the configuration that disallows heap-stored interior pointers. In the
possibly-interior case, the referenced block is black-listedfrom any use, while in the other
case the collector allows the block to be used for small non-pointer objects (this cannot
cause much waste). To initialise the black-list, the collector doesa collection immediately
before the allocation. It also avoids usingblockswhose
first heap address ends in many
zeroes, since non-pointerdata in the stack often results in such values.
example, on a byte-addressed machine with a word size of four bytes, we might steal two
bits for tags. We force objects to start on a word boundary, so pointers always have their
low two bits zero. We choose some other value(s) to indicate (say) integers. Supposing
that we give integers word values with a low bit of one, we end up with 31-bit integers
\342\200\224
bit-stealing in this way does reduce the
range of numbers we can represent easily.We
might use a pattern of 10 in the low bits to indicate the start of an objectin the heap,for
parsability (Section 7.6). Table 11.1 illustrates the sampletag encoding,which is similar to
one used in actual Smalltalk implementations.
Dealing with tagged integers efficiently
is a bit of a challenge, though arguably the
subtracting tagged integers. These instructions indicate overflow, and thereare versions that
trap as well, on overflow of the operation or if either operand's two lowest bits are not
zero. For this architecture we might use the tag encoding shown in Table 11.2.This
encoding
does require that we adjust references made from pointers, though in most cases that
2In either case it allows interior pointers, but in the more restrictive case it requires that any reachable object
have a reachable pointer that is not interior. Thus in that configuration it ignores interior pointers when marking.
11.2. FINDING POINTERS 169
Tag
Encoded value
00 Pointer
10 Objectheader
xl Integer
Table 11.1: An example of pointer tag encoding
Tag
Encoded value
00 Integer
01 Pointer
10 Other Primitive Value
11 Object header
numeric and other primitive values have their full native length. This tagging approach
dedicateswhole blocks to hold integers, other blocks to floating point numbers, and so on.
Since theseare pure values and do not change,3 when allocating new oneswe might use
hashing to avoid making new copies of the values already in the table. This technique,
alsocalledhash cons'ing (from the Lisp cons function for allocating new pairs) is quite
venerable [Ershov,1958;Goto,1974]. In hash
consing Lisp pairs, the allocator maintains
a hash table of immutable pairs and can avoid allocating a new pair if the requested pair
is already in the table. This extends in the obvious way to any immutable heap-allocated
objects,suchas thoseof class Integer in Java. Notice that this is a case where it might be
good to use weak references(Section 12.2)from the hash table to the objects it contains.
3This is a property of the representational approach, not of the language: in using this form of tagging the
designer made a choiceto represent integers (floats, and so on) as tagged pointers to their full (untagged) values.
170 CHAPTER11. RUN-TIME INTERFACE
vector. Thus the collector, or any other part of the run-time that uses type information
(such as the reflectionmechanismin Java), can find the type information quite readily.
What the collector needs is a table that indicates where pointer fields lie in objects of the
given type. Two typical organisations area bit vector, similar to a bit table of mark bits, and
a vector of offsets of pointer fields. Huang et al [2004]used a vector of offsets to particular
advantage by permuting the order of the entries to obtain different tracing orders,and thus
different orders of objects in a copying collector,improving cache performance. With care,
they did this whilethe system
was running (in a stop-the-world collector).
A
way to identify pointers in objects that is simpler in some respects than using a table
is to partition the pointer and non-pointer data. This is straightforward for some languages
and system designs4 but problematicfor others. For example, in ML objects can be
polymorphic.
If the system generates a single piece of code for all polymorphic versions, and
the objectsneedtousethe same field for a pointer in some casesand a non-pointer in
others, then segregation fails. In object-oriented systems that desire to apply superclass code
to subclass objects,fields added in subclasses need to come after those of superclasses,
again leading to mixing of pointer and non-pointer fields. One way around that is to place
pointer fields in one directionfrom the reference point in the object (say at negative offsets)
and non-pointer fields in the other direction(positive offsets), which has been called
bidirectional
object layout. On byte-addressed machines with word-aligned objects,the system
heap parsability by insuring that the first header word has its low bit set
can maintain \342\200\2
each type of closure. Bartlett [1989a] applied the idea of methods for collection to C++ by
requiring the user to write a pointer-enumeratingmethod for each collected C++ class.
A managed language can use object-oriented indirect function calls in other ways
related to collection. In particular, Cheadle et al [2008]dynamically change an object's
function
pointer so as to offer a self-erasingread barrierin a copying collector, similar to the
approach Cheadle et al [2000] used for the Glasgow Haskell Compiler(GHC). That system
also used a version of stack barriers, implemented in a similar way,
and it used the same
trick again to provide a generational write barrier when updating thunks. A fine point of
systems that
update closure environments is that since they can shrink an existing object,
in orderto maintain heap parsability they may need to insert a 'fake' objectin the heap
after the one that shrank. Conversely, they may also need to expand an object:here the
old version is overwritten with an indirection node, holding a reference to the new
version. Later collections can short-circuit the indirection node. Collectorscan also perform
other computation on behalf of the mutator such as eager evaluation of applications of
'well-known' functions to arguments already partially evaluated: a common example is
the function that returns the head of a list.
4Bartlett [1989b] takes this approach for a Scheme implementation done by translating to C, and Cheadle et al
[2000] take this approach in Non-Stop Haskell.
11.2. FINDING POINTERS 171
objects.
For example, Smalltalk, and some Lisp and someJava systems start with a base
system 'image', alsocalledthe boot image, that includes a number of classes/functions
and instances, particularly if they start with an interactive programming environment. A
parts of the system image usually of one kind
\342\200\224 tables
running program might modify
of another \342\200\224 causing image objects to refer to newer objects.A system might therefore
treat pointer fields in the image as roots.Notice, though,
that image objects can become
garbage, so it may be a good idea sometimes to trace through the imageto find what
remains
actually reachable. This is all tied into whether we are using generational collection,
in which casewe may treat the image as a particularly old generation.
One way to deal with call stacksis to heap allocate activation records, as advocated by
Appel [1987],
for
example. See also [Appel and Shao,1994,1996] and a counter-argument by
Miller and Rozas [1994].Somelanguage implementations manage to make stack frames
looklike heap objectsand thus kill two birds with one stone. Examplesincludethe
Glasgow
Haskell Compiler [Cheadle et al, 2000] and Non-StopHaskell[Cheadleet al, 2004].
It is also possible to give the collector specific guidance about the contents of the stack,
for example as Henderson [2002] does with custom-generated C code for implementing
the Mercurylanguage,and which Baker et al [2009] improved upon for a real-time Java
implementation.
However, most languagesgive stack frames special treatment because of the need for
a variety of efficiencies in order to obtain best performance.Therearethree issueswe
consider:
very cleaned up from the typically more optimised and 'raw' layout in the actual frames.
Becausestack parsing is generally useful, frame layout conventions generally provide for
it. For example, many designs includea dynamic chain field in each frame, which points
172 CHAPTER 11. RUN-TIME INTERFACE
to the previous frame. Various other fields generally lie at fixed offsets from the
reference
point of the frame (the address to which the frame pointer or dynamic chain refers).
These might include the return address, the static chain and so on. Systems also generally
a
provide map to determine from a return address the function within which the address
lies. In non-collected systems this might occur only in debuggersymbol tables, but many
managed systems access this table from the program, so it may be part of the loaded or
generated information about code, rather than
just in auxiliary debugger tables.
To find pointers within a frame, a system might explicitlyadd stack map information
to each frame to help the collector.This metadata might consist of a bitmap indicating
which frame fields contain pointers, or the system might partition a frame into pointer-
containing and non-pointer portions, with metadata giving the size of each. Notice that
there are likely to be some initial instructions of each function during which the new frame
exists but is not yet entirely initialised. Collecting during
this time might be problematic;
see our later discussion of garbage collection safe points and mutator handshaking in
Section 11.6. Alternatively we might get by with careful collector analysis of the initial code
sequence, with careful use of push instructions on a machine that supports them or some
other custom-designed approach. Obviously frame scanning is simpler if the compiler
uses any given frame field always as a pointer or always as a non-pointer. That way the
whole function needs only one map.
However, the single-mapapproachis not always possible. For example, at least two
language features make it difficult:
\342\200\242
Generic /polymorphic functions.
\342\200\242
The Java Virtual Machine j sr instruction.
We
previously observed that a polymorphic function may
use the same code for pointer
and non-pointerarguments.Sincea straightforward stack map cannot distinguish the
cases, the
system needs some additional source of information.
Fortunately the caller
'knows' more about the specificcall,but it too may be a polymorphic function. So the
caller may need to 'pass the buck'to its caller. However, this is guaranteed to bottom out,
at the main function invocation in the worst case. The situation is analogous to typing
objects from roots [Appel,1989b; Goldberg,1991; Goldberg
and Gloger, 1992].
In the Java Virtual Machine, the j sr instruction performs a local call,which does not
create a new frame but rather has accessto the same local variables as the caller. It was
designedto be used to implement the try-finally feature of the Java language, using a
single pieceof code to implement the finally block by calling using jsr
it in both the
normal and the exceptional case. The
problem is that during the jsr call, some local
variables'
types are ambiguous, in the sense that, depending on which jsr called the finally
block,a particular variable, not used in the finally blockbut used later, might contain
a pointer from one call site and a non-pointer from another. There are two solution
approaches
to this problem. One is to refer these casesto the
calling site for disambiguation.
In this approach rather than have each stack map entry be just 'pointer' or 'non-pointer'
(that is, a single bit), we need an additionalcasethat means 'refer to j s r caller'. In addition
we needto be able to find the jsr return address,which requiressomeanalysis of the Java
bytecode to track where it stored that value. An alternative, more popular in modern
of subtle bugs, so managing system complexity here may be important. We note that some
systems defer generating a stack map until the collector needs it, saving space and time in
the normal case but perhaps increasing collector pause time.
11.2. FINDING POINTERS 173
Another reason that a system might choose not to use a singlemap per frame is that it
further restricts the register allocator: it must use a given register consistently as a pointer
or non-pointer.This is particularly undesirable on machines that have few registers in the
first
place.
Notice that whether we have one map per function, or different ones for different parts
of a function, the compiler must propagate type information far through the back end. This
may
not be overly difficult if we understand the requirementbeforewewrite the compiler,
but
revising existing compilers to do it can be quite difficult.
Finding pointers in registers. To this point we have ignored the issueof pointers in
machine
registers. There are several reasons why handling registers is more difficult than
As we
\342\200\242 out previously, even if each stack frame field is fixed as a pointer ora
pointed
non-pointer
a whole
for function, it is less convenient to impose that rule on registers
or to be even further
\342\200\224 restrictive and requirethat pointers, and only pointers, reside
in a particular subset of the registers. It is probably practical only on machines that
provide a large number of
registers. Thus most systems will have more than one
\342\200\242
Calling conventions often provide that some registers follow a caller-save protocol,
in which the callermust save and restore a register if it wants the value to survive
acrossa call,and that some other registers follow a callee-save protocol, in which the
callee must save and restorea register, on behalf of callers deeper in the stack,before
the calleecan use
register. the Caller-save registers are not a problemsincethe caller
knows what value is in them, but callee-save
kind of registers have contents known
only to some caller up the stack (if any). Thus a callee cannot indicatein a register
mapwhether or not an unsaved callee-save register containsa pointer.Likewise,if a
callee saves a callee-save register to a frame field, the callee cannot say whether that
field contains a pointer.
from our side memory the value that the callee had in the register. Once we have done
this for all callee-save registers savedby the callee, we produce pointers for the callee,and
allow the collector to update them as necessary.However,we should skip any registers
whose contents we processedin the caller, to avoid processing them a second time. In
some collectors, processing the same root more than once is not harmful; mark-sweep is
an examplesincemarking twice is not a problem. However, in a copying collector it is
natural to assume that any unforwarded referent is in fromspace. If the collector processes
the same root twice (not two different roots referring to the same object)then it would
make an extra copy of the tospace copy of the object, which would be bad.
We offer details of this processin Algorithm 11.1, and now proceed to describe the
example illustrated in Figure 11.2. In the algorithm, func is the function applied to each
frame slot and register, for example the body of the for each loop in markFromRoot s
of
Algorithm 2.2 (Mark-Sweep, Mark-Compact) or the body of the root scanning loop in
collect of Algorithm 4.2 (Copying).
Considering Figure 11.2a, notice first the call stack, which appears shaded on the right.
The sequence actions leading to that stack is as follows.
of
2. Function f saved the return address, saved r2 in slot 1 and rl in slot 2, and set local 3
to -13 and local4 to refer to object q. It then called g with rl containing a reference to
object r, r2 holding 17 and a return addressof f + 178.
3. Function g saved the return address, saved r2 in slot 1, and set local2 to referto
object r, local 3 to hold -7 and local4 to refer to object s.
suspended execution. These are the values that our unwinding procedure should attempt
to recover.
We now assume that a garbage collection occursin the middleof g.
being storedin a suspended thread data structure or perhaps in an actual frame for
the garbage collection routine.
5. Here processStack has retrieved registers from the thread state into Regs and
initialisedRestore. Execution is at line 15 in Algorithm 11.1for the frame for g.
11.2. FINDING POINTERS 175
i processStack(thread, func):
2 Regs ^\342\200\224
getRegisters(thread) /* register contents thread would see */
3 Done \302\253\342\200\224
empty /* no registers processedyet */
4 Top ^\342\200\224
topFrame(thread)
5 processFrame(Top, Regs, Done, func)
6 setRegister s(thread, Regs) /* get correctedregister contents back to thread */
7
12 if Caller ^ null
13 Restore ^\342\200\224
empty /* holds info to restore after doing caller */
14
28 remove(Done, reg)
29
pairs returned by calleeSavedRegs for g's IPvaluein a boxto the left of g's frame.
7. Execution is at line 19 for f's frame. We 'un-saved' both rl and r2 in this case, from
slots2 and 1 respectively.
176 CHAPTER 11. RUN-TIME INTERFACE
rl = 155
r2 = 784
o Restore Regs
rl =
main()
p old IP:...
r2 = 784 saved:
calleeSavedRegs
1:155
locals:
@ main+52 2:p
3:75
rl=p
r2 = 784
lz
Restore Regs f()
(rl,r> rl = p old IP: main+52
<r2,17>| r2 = 784
1 saved:
calleeSavedRegs 1:784
<rl,2),<r2,l> 2:p
@ f+178 locals:
3:-13
4:q e
rl=r
r2 = 17
\\z.
Restore Regs g()
<r2,t> rl = r old IP: f+178
r2 = 17 saved:
calleeSavedRegs 1:17
locals:
<r2,1)
@g+36
2:r
3:-7
4:s
rl = r
r2 = t
<^Z
0 Restore 1 iRegs GChappens
rl = r IP = g+36 ,
| r2 = t rl = r
r2 = t
and then the state after line 38. The frames themselves show the state at line 35. Those
values that are written, though their value is not necessarilychanged, are in boldface; those
not written are grey.
9. Regsholds the register values at the point main called f; as yet, Done is empty
31.2.FINDING POINTERS 177
rl = 155 rl = 155
r2 = 784 r2 = 784
rl=p I I rl=p'
r2 = = 784
784*\\}/7r2
Restore Regs 1 IDone
f()
<rl, r> rl = r
old IP: main+52
<r2,17) r2 = 17
saved: J
@g+36
2:r' @ g+36
3:-7
4:s' Regs 1
[Done (S>
rl = r' rl
r2 = f | 1 r2
rl=r rl = r'
r2 = t =f
Vf2
Restore GC happens
IP = g+36
rl = r'
r2 = f
10. Registerrl was updated by func (because rl is in pointerRegs for main + 52). Done
indicates that rl refers to a (possibly)new locationof its referent object.
11. Regs holds the registervalues at the point where f called g. Notice that the values of
rl and r2 are saved into slots 2 and 1 of f's frame and their valuesin Regshave been
1 processStack(thread, func):
2 Top <r- topFrame(thread)
3 processFrame(Top, func)
4 Regs <\342\200\224
getRegisters(thread) /* register contents thread would see */
5 for each reg in pointerRegs(lP) /* trace from registers at GC point */
6 func(qet Address (Regs [reg]))
7
8 processFrame(Frame, func):
9 Done <r- empty
io loop
11 IP \302\253\342\200\224
getlP(Frame) /* current instruction pointer (IP) */
12
/wwc(getAddress(Regs[reg]))
add(Done, reg)
22
23 Caller \302\253\342\200\224
getCallerFrame(Frame)
24 if Caller = null
25 return
26
14. Register rl was skipped (because it was in Done), but r2 was updated by func and
added to Done
Finally, in
step 15 processStack stores the values in Regs backto the thread state.
\342\200\242
It func will not update its argument then one can omit the Done data structure, the
statements update it, and the conditionaltest on line 36, invoking func
that
unconditionally
on line 37. This simplification applies for non-moving collectors and non-
moving phases of
moving collectors. It also applies if a
moving collector's
implementation
oifunc works correctly if invoked on the same slot more than once.
\342\200\242
Rather than calling func late in processFrame, one can move the two for loops
at the end upwards, insertingthem after line 9. If combined with variation one, the
11.2. FINDING POINTERS 179
i processStack(thread, func):
2 Top <r- topFrame(thread)
3 processFrame(Top, func)
4 Regs <\342\200\224
getRegisters(thread) /* register contents thread would see */
s for each reg in pointerRegs(lP) /* trace from registers at GC point */
6
/tmc(getAddress (Regs [reg]))
7 setRegisters(thread, Regs) /* get correctedreg contents back to thread */
8
9
processFrame(Frame, func):
io repeat
n IP <\342\200\224
getlP(Frame) /* current instruction pointer */
12 for each slot in pointerSlot s(ip) /* process
frame's pointer slots */
i3
/wwc(get Slot Address (Frame, slot))
i4 Frame <\342\200\224
getCallerFrame(Frame)
is until Frame = null
resulting algorithm needsto processthe stack only in one direction, which allows an
iterative implementation as opposed to a recursiveone,as shown in Algorithm 11.2.
Compressing stack maps. Experience shows that the space needed to store stack maps
canbe a considerable of the size of fraction the code in a system. For example, Diwan
et al [1992]found their tablesfor Modula-3 for the VAX to be 16%of the size of code, and
Stichnoth et al [1999] reported their tables for Java to be 20% of the size of x86 code. Tarditi
[2000] describes techniques for compressing these tables,and them in the Marmot applies
Java compiler, achieving a compressionratio of four to five and final table sizes averaging
3.6% of code size. The approach exploits two empiricalobservations.
\342\200\242
While may be many garbage collectionpoints(GC-points)
there needingmaps, many
of thosemaps are the same. Thus a system save space if multiple can GC-points share
the same In the
map.
Marmot system this is particularly true of call sites, which tend
to have few pointers live across them. Tarditi [2000]found that this technique cut
table space in half.
If the
\342\200\242
compiler works to group pointers closetogether in stack frames, then even
more maps tend to be the same.Using live variable analysis and colouring to place
pointervariableswith disjoint lifetimes into the same slot also increasesthe number
of identical maps. Tarditi [2000]found this to be important for large programs.
The overall flow of Tarditi's scheme is as follows.
1. Map
the (sparse) set of return addresses to a (smaller,denser)set of GC-point
numbers.5 In this mapping, if table entry t [i] equals return address ra, then ra maps to
GC-pointi.
5Tarditi uses the term 'call site' where we use 'GC-point'.
180 CHAPTER 11. RUN-TIMEINTERFACE
2. Map the set of GC-point numbers to a (small dense) set of map numbers. This is
useful because multiple GC-points often have the same map. Given the GC-point i
above, this can be written as map number mn=mapnum[i].
3. Index into a map array using the map number to get the map information. Given mn
from the previous step, this can be written as inf o=map[mn].
In Tarditi's scheme the map information is a 32-bit word. If the information fits in 31 bits,
then that word is adequate and its low bit is setto 0;otherwise, the low bit is set to 1and the
remaining bits point to a variable-length record giving the full map. The details probably
need to be retunedfor different platforms (language, compiler, and target architecture),so
referto the paper for the exact encoding.
Tarditi also explored severalorganisationsfor mapping IP (instruction pointer) values
to GC-pointnumbers.
\342\200\242
Using the same number for adjacent GC-points whosestack maps are the same,a
technique
also used by Diwan et al [1992]. This recordsonly the first GC-point, and
subsequent ones whose addressisless than the next address in the table are treated
as being equivalent.
\342\200\242
Using a two-level table to represent what is conceptually a large array of GC-point
addresses. This builds a separate table for each 64 kilobyte chunk of code space.
Since all GC-points in the chunk have the same upper bits, it needs to record only
the low 16 bits in eachtable entry In a 32-bit address space this savesessentially half
the table space. We also need to know the GC-point number for the first
GC-point in
a chunk; simply adding this to the indexof a return address within the chunk's table
will
get the GC-point number for the matching IP.
\342\200\242
Using a sparse array of GC-points and interpolatingby examining the code near the
IP value. This choosespoints roughly k bytes apart in the code, indicating where
these placesare,their GC-point number and their map number. It starts from the
highest location preceding the IP value, and disassembles code forward. As it finds
calls (or other garbage collection points),it updates the GC-point number and map
number. Notice that it must be able to recognise GC-pointsby inspection. Tarditi
found that even for the x86 the disassembly processfor these purposes was not
overly complex or slow,though the scheme includes a 16 element cache to reduce
repeatedcomputationfor the same return address values. It was the most compact
of the schemes examined and the disassembly overhead was small.
map
\342\200\224
then they treat the two blocks as one large block.
Working forward from each
reference point, they encodethe lengthof the instruction at that point (because the x86
has variable length instructions) and the delta to the map caused by the instruction. For
example, the instruction might push or pop a value on the stack, load a pointer into a
register, and so on. They Huffman code the delta stream to obtain additional compression.
Across a suite of benchmarks they get an average map size of about 22% of code size.
11.2. FINDING POINTERS 181
They argue that, as a fraction of code size, the situation should not be worse for machines
with
larger register sets \342\200\224
the instructions increase in size too. Also, the overall
space
used might be somewhat better for machines with fixed-length instructions, since there
is still overhead
a noticeable for recording instruction lengths, even though(like Tarditi
[2000]) disassembler
they use a in most casesto avoid recording instruction lengths. They
still need a fraction of a bit to mark those places where they cannot legally allow garbage
collection, such as in the middleof the sequence for a write barrier. Given that a fixed-
It is
\342\200\242 not always easy, or even possible,to distinguish code from any data embedded
within it.
As in
\342\200\242 the case of uncooperative compilers,it is not generally possible to tell
embedded
pointers from non-pointer data that happen to have a value that looks as if it
refers to a location in the heap.
\342\200\242
When embedded in instructions, a pointer may
be broken pieces. For
into smaller
example,on the MIPSprocessor, loading
a 32-bit static pointer value into a register
would typically require a load-upper-immediate instruction, which loads a 16-bit
immediate field into the upper half of a 32-bit register and zeroesthe low 16-bits,
and then an or-immediate of another 16-bit value into the lower half of the register.
Similar code sequences occur for other instruction sets. This is a particular caseof
derived pointers (page 183).
An embedded
\342\200\242
pointer value may not refer directly to its target object; see our
discussions of interior (page 182) and derived (page183)pointers.
The more general solution is to arrangefor the compiler to generate a side table that
indicates where embedded pointers lie in the code.
Somesystems simply rule out embedded pointers to avoid the issues altogether.The
impact
on code performance will vary according to target architecture, compilation
strategy,
and statistics of programs' accesses.
Target objects that move. If the target of an embedded then the referencemoves,
collector must
update the embedded reference. One possible difficulty is that for safety or
security reasons codeareas
may
be read-only. Thus the collector must either changethe
permissions temporarily (if possible), which might involve expensive system calls,or the
system
must disallow embedded references to moving objects. Another difficulty is that
182 CHAPTER 11. RUN-TIMEINTERFACE
granules that are the first granules of objects. This might be useful for the allocator
and collectorin any case.
If the
\342\200\242
system heap parsability (Section7.6),then one can scan the heap to
supports
find theobject whose contain the target of the interior
locations pointer. It would
be prohibitively expensive to search from the beginning of the
heap every time, so
typically system a records the first (or last) object-start position within each k-byte
chunk of the heap, where k is usually a power of two for convenient and efficient
calculation. This allows parsing to start in the chunk to which the interior pointer
11.2. FINDING POINTERS 183
refers,or the previous chunk as necessary. Thereis a trade-off between the space
used for this side table and the overhead of parsing. For a more detailed discussion
see Section 11.8.
tidy pointers (those that refer to an object's standard reference point), then the time
overhead of
dealing with the interior pointers themselvesmay not be great. However, making
provision for them at all may add space cost for tables \342\200\224 though the particular collector
design may includethe necessary tables or metadata anyway
\342\200\224
and add time cost for
maintaining the tables.
Return addresses are a particular case of interior pointers into code. They present no
specialdifficulty, though for a variety of reasons the tables for looking up the function
a
containing particular return address may
be distinct from the tables the collector usesfor
other objects.
\342\200\224
\342\200\242
p q,the distance between two objects.
In somecases can we reconstruct a tidy pointer \342\200\224
one that points to the referent's standard
referenceaddress \342\200\224
from the derived pointer. An
example is p + c where c is a compile-
time known constant. In the general case we must have access to the base expression from
which the derived pointer was derived. That expression might itself be a derived pointer,
but
eventually gets back to tidy pointers.
In a non-moving collector,justthe tidy pointers
having available as roots is enough.
Notice,though, that at a
GC-point the tidy pointer may no longer be live in the sense of
compiler live variable analysis, even though the derived pointer is live. Thus the compiler
must keep at least one copy of the tidy pointer(s) for each live derived pointer. An
exceptionto this rule is the p \302\261
c case since adjusting with a compile-time known value produces
the tidy pointer without reference to other run-time data.
184 CHAPTER 11. RUN-TIME INTERFACE
expressions
from which it was derived and the operations neededto reconstruct the
derived
pointer. Diwan et al [1992] give details on handling derived quantities of the form
Hi Pi E; tfj
~~
+ where
\302\243 the px
and
qj
are pointers or derived values and E is an expression
not involving pointers (and thus not affected if any of the p\\
or
qj move).
The advantage
of this form is that it can subtract out the p\\
and add in qj, forming before
\302\243
moving any
objects; do any moving; then add backthe new p \342\200\242
and subtract off the new q'-
to produce
the correct adjusted derived pointer.
Diwan et al [1992] point out several issues that arise in optimising compilers when
trying
to handle derived pointers, including dead base variables (which we mentioned
above), multiple derivations reachingthe samepoint in code (for which they add more
variablesto recordthe path that actually pertains), and indirect references (wherethey
record the intermediate location along the chainof references).
value in an Supporting
derived pointers sometimes required producing lessoptimalcode,but the impact was
slight. They achieved table sizesabout 15% the size of code for Modula-3on the VAX.
11.3
Object tables
For reasonsof mutator speed and space consumption, many systemshave represented
object
references as direct pointers to their referent objects. A more general approach is to
give each object unique a identifier and to locate its contents via some mapping
mechanism. This has been of particular interest when the space of objects is large, and possibly
persistent,but the hardware's underlying address space is smallin comparison. The focus
here is on heaps that fit into the address space. Even in that case, however, some systems
have found it helpful to use object tables. An object table is a generally dense array of small
proceed by marking as usual (modulo the level of indirection imposed by the object table)
and then doinga simplesliding compaction of the object data. Free objecttableentriescan
simply
be chained into a free-list. Notice that in marking it may be advantageous to keep
mark bits in object table entries, so as to save a memory reference when checking or setting
the markbit. A side mark-bit table has similar benefits. It can alsobe advantageousto keep
other metadata in the object table entry, such as a reference to class and sizeinformation.
It is also possible to compact the object table itself, for example using the Two-Finger
algorithm of Section 3.1. This can be done together with compactingthe object data,
requiring only one pass over the data in order to compact both the data and the object table.
Object tables may
be problematic, or simply unhelpful, if the
language allows interior
or derived pointers. Note also the similarity of object table entries to handles as used
to support references from external code to heap objects,as discussedin Section 11.4. If
11.4. REFERENCES FROM EXTERNAL CODE 185
managed environment. A
typical example is the Java Native Interface, which allows code
written in C, C++ or possibly other languagesto accessobjects in the Java heap. More
generally, just about every system needs to support input/output, which must somehow
move data between the operating system and heap objects.Two difficulties arise in
supporting
references from external code and data to objects in a managedheap. The first
issue is ensuring that the collector continuesto treat an object as reachable while external
code possessesa reference tothe object. This is necessary to prevent the objectfrom being
reclaimed before the external code is donewith it. Often we need the guarantee only for
the duration of a call to external code. We can make that guarantee by ensuring that there
The second issue is ensuring that external code knows where an object is. This is
relevant
only to moving collectors. Some interfaces keep externalcodeat arms length by
requiring
all accesses to heap objects to go through collector-provided access routines. This
makes it easier to support collectors that move objects. Typically the collector provides
to external codea pointerto a handle. The handle contains a reference to the actual heap
object, and possibly some other management data. Handlesact as registered-object table
entries, and thus are roots for collection. The Java Native Interface works this way.
Notice
cannot be moved while it is pinned, and the further implication that pinned objects are
reachable and will not be reclaimed.
If we know when allocatingan objectthat it may need to be pinned, then we can
allocate the
object directly into a non-moving space.This may work for buffers for file stream
I/O if the buffered-stream code allocates the buffers itself. However, in general it is
difficult to determine in advance which objects will need to be pinned. Thus, some languages
support pin and unpin functions that the programmer can invoke on any object.
186 CHAPTER 11. RUN-TIME INTERFACE
is not
Pinning
a problem for non-moving collectors,but is inconvenient for ones that
normally move an object.Thereare several solutions, each with its strengths and
weaknesses.
\342\200\242
Defer collection, at least of a pinnedobject'sregion, while it is pinned. This is simple,
but there is no guarantee that it will be unpinned before running
out of memory.
If the
\342\200\242
application requests pinning an object,and the is not in a non-moving
object
region,we can immediately collect the object's containing region (and any others
required to be collected at the sametime)and move the object to a non-moving
region.
This might be acceptable if
pinning is not frequent, and the collectoris of a
design such as a generational collectorwith a nurserywhosesurvivors are copied to
a non-moving mature space.
We can
\342\200\242 extend our collector to tolerate not moving pinned objects, which complicates
the collector and may introduce new inefficiencies.
As a simpleexampleof extending collector to support pinning, consider a basic
a moving
non-generational copying
collector. Extending it to support pinned objectsrequiresfirst
of all that the collector can distinguish pinned from unpinned objects. It can copy and
forward unpinned objects as usual. It will trace through pinned objects, updating pointers
from
pinnedthe move, but leaving
object to objects that
pinned objects where they are. The
collectorshouldalsorecordin a table the pinned objects it encounters. When all survivors
reclaims
have been copied, the collector only
the holes between pinned objects rather than
reclaiming all of fromspace. Thus, rather than obtaining a single, large, free region, it may
obtain an arbitrary number of smallerones.Theallocator can use each one as a sequential
allocationspace.This can lead to a degree of fragmentation, but that is unavoidable in the
presence of
pinning. Notice that a future collection may find that a previously pinned
object
is no longer pinned, so the fragmentation
need not persist. As we saw in Section10.3,
some mostly non-moving collectors similar approach, also sequentially
take a allocating in
the gaps between surviving objects [Dimpsey et al, 2000; Blackburn and McKinley,2008].
Another
possible difficulty is that, even though an object is pinned, the collector is
examining and updating it, which may lead to races with external code that accesses the
object at the same time. Thus, we may need to pin not
only a given object but also some
of the objects to which it refers. Likewise, if, starting from a given object, external code
traces through to other objects,or even just examines or copies references to them without
examining the objects' contents, those other objectsalsoneedtobe pinned.
Features of a programming language itself, and its implementation,may require
pinning.
In particular, if the language allows passing objectfieldsby reference, then there may
be stack referencesto the interior of objects. The implementation can apply the interior
pointer techniquesdescribed on page 182 in order to support moving the object
the
containing
referent field. However, such support can be complexand the code for handling
interior pointers correctly may thus be difficult to maintain. Therefore an implementation
might choose simply pin to such objects. This requires being able to determine
fairly easily
and efficiently which object containsa given referent.Hence it most
readily allows interior
pointers but not moregeneralcasesof derived pointers (see page 183).
thread is actively
running, so we must either pause the thread for some period of time or
Section 11.6 for more discussion of when it is appropriate to scan a thread's and
registers
stack. It is possible to scan a stack incrementally, however, and also mostly-concurrently,
a
using technique called stack barriers. The idea is to arrange for a thread to be diverted if
it tries to return (or throw) beyond given a frame in its stack. Suppose we have placed a
barrier in frame F. Then we can asynchronously process the caller of F, its caller, and so
on, confident that the running thread will not cut the stackback from under our scanning.
The key step to introducea stack barrier is to hijack the return address of the frame.
In place of the actual return address we write the address of the stackbarrier handlerwe
wish to install. We put the original return address in some standard place that the stack
barrierhandler canfind, such as a thread-local variable. The handler can then remove the
barrier as appropriate. Naturally it must be careful not to disturb any register contentsthat
the caller may examine.
For incremental stackscanningby the thread itself, when it encounters the barrier the
handler scans some number of frames up the stack and setsa new barrier at the limit of its
scanning (unless it finished scanning the whole stack). We call this synchronous
incremental
scanning. For asynchronous scanning by anotherthread, the barrier servesto stop the
running thread before it overtakes the scanning thread. For its part, the scanningthread
can move the barrier down after it scans some number of frames. That way it is possible
that the running thread will never hit the barrier. If it does hit the barrier, then it merely
has to wait for the scanningthread to advance and unset that barrier; then it can continue.
Cheng and Blelloch [2001] introducedstack barriers in order to bound the collection
work done in one increment and to support asynchronous stack scanning. Their design
breaks eachstackinto a collection of fixed size stacklets that can be scannedoneat a time.
That is, returning from one stacklet to another is the possible location of what we call a
stack barrier. But the idea doesnot require discontiguous stacklets or predetermination of
which frames can have a barrier placed on them.
Stack barriers can also be used in the opposite way from that described above: they
can mark the portion of the stack that has not changed, and thus that the collector does
not need to reprocess to find new pointers. In collectorsthat are mostly-concurrent this
approach can shorten the
'flip' time at the end of a collection cycle.
Another use for stack barriers is in handling dynamic changes to code, particularly
optimised code. For example, considerthe situation whereroutine A calls B, which calls
C, and there is a frame on the stack for an optimised version of A that inlined B but did not
further inline C. In this situation there is a frame for A + B and another one for C. If the
user now edits B, future calls of B should go to the new version.Therefore, when returning
from C, the system should deoptimise A + B and create frames for unoptimisedversions
of A and B, so that when B also returns, the frame for A supports calling the new version
of B. It might also be possible to re-optimiseand build a new A + B. The point here is that
returning to A + B triggers the deoptimisation,and the stack barrier is the mechanism that
size of the stack map tables (see Section 11.2for details on compressing maps), which tend
to be largeif more IPs are legal for garbage collection.
What
might makea given IP unsafe for garbage collection? Most systems have
occasional short sequences of code that must be run in their entirety
in order to preserve
invariants relied on by garbagecollection. Forexample, a
typical write barrier needs to do
both the underlying write and some recording. If a garbage collectionhappens between
the two steps, some object may be missed by the collector or some pointer not properly
updated by it. Systems usually have a number of such short sequences that need to be
atomic with respect to garbage collection (though not necessarilyatomic with respect to
true concurrency). In addition to write barriers otherexamples include
setting up a new
stack frame and initialising a new object.
A system is simpler in one way if it can allow garbage collection at any IP \342\200\224
there
must support stack maps for every IP, or else employ techniques that do not require them,
as for uncooperative C and C++ compilers. If a system allows garbage collection at most
IPs, then if it needs to collect and a thread is suspendedat an unsafe point, it can either
interpret instructionsahead for the suspended thread until it is safe point, or it can
at a
wake the thread up for a short time to get it to advance (probabilistically) to a safe point.
Interpretation
risks rarely exercised bugs, while nudging a thread givesonly a probabilistic
guarantee. Such systems may also pay the costof larger maps [Stichnoth et al, 1999].
Many systemsmake the oppositechoiceand allow garbage collection only at certain
restricted safe points, and produce maps only for those points. The minimal set of safe
6Excepting the possibility of checking for adequate thread-private free space before a sequence of allocations.
11.6. GC-SAFEPOINTSAND MUTATOR SUSPENSION 189
Agesen found that patching has much lower overhead than polling, but of course it is more
difficult to implement, and moreproblematic in a concurrent
system.
In bringing up the idea of GC-check points, notice that we have introducedthe notion
of a handshake mechanism between the collectorand a mutator thread. Such handshakes
may be necessary even if a
system does not include true concurrency merely but
multiplexes
several mutator threads on one processor \342\200\224 the collector may need to indicate the
need for garbage collection and then wake up any suspended thread that is not at a safe
point so that the thread can advance to a safe point. Some systems
allow threads to
suspend only at safe points so as to avoid this additional complexity, but for other reasonsa
system may not control all aspects of thread scheduling, and so may need this handshake.
For concretenesswe mention someparticular mechanisms for the handshake. Each
thread can maintain a thread-local variable that indicates whether the rest of the system
needs that thread's attention at a safe point. This mechanism can be usedfor things other
than signalling for a garbage collection.At a GC-check point, the thread checks that thread-
local variable, and if it is non-zero (say) it calls a system routine that uses the exact value
of the variable to determinewhat action to take. One particular value will indicate'timeto
garbage
collect7. When the thread notices the request,it sets another thread-local variable
to indicate it has noticed, or perhaps decrements a global variable on which a collector
thread is waiting.Systems typically arrange for thread-local variables to be cheapto access,
so this
may be a good approach.
Another possibilityis to seta processor condition code in the saved thread state of the
suspended thread. A GC-check point can then consist of a very cheap conditional branch
over a call to the systemroutinefor responding to the request. This approach works only if
the processor has multiple condition codesets (as for the PowerPC) and can be guaranteed
not to be in externalcodewhen awakened. If the machine has enough registersthat one
can be dedicated to the signalling,a registercan be used almost as cheaply as a condition
codeflag. If a thread is in external code, the systemneedsan alternate method of getting
attention when the thread comesout of that code (unless it is suspended as a safe point
already). Hijacking the return address (see also Section11.5)is a way to get attention as
the external codecompletes.
As an alternative to flag setting and return address hijacking,in somecasesan
operating system-level
inter-thread signal, such as those offered by some implementationsof
POSIX threads, may be a viable alternative. This may be problematic for wide portability,
and it may not be very efficient. It can be slow in part because of the relatively long path
through the operating system kernelto set up and delivera signaltoa user-level handler.
It can also be slow because of the need not only for a low-levelprocessor interrupt
but
because of the effect on cachesand translation lookaside buffers.
In sum, there are two basic approaches: synchronous notification, also appropriately
called polling, and asynchronousnotification via some kind of signal or interrupt. Each
approach
has its own overheads, which vary across platforms.Pollingmay also require a
degree
of compiler cooperation, depending on the specifictechnique.Further, asynchronous
notification will usually need to be turned into synchronous, since scanning the stack, or
whatever action is being requested, may not be possibleat every moment. Thus, the signal
handler's main goal may be to set a flag local to its thread wherethe thread is guaranteed
to notice the flag soon and act on it.
We further note that in implementing synchronisation between threads to direct
scanning
of stacks, considerations of concurrent hardware and software
crop up, for which
we offer generalbackgroundin Chapter 13. Of particular relevance may be Section13.7,
which discussed coordinating threads to move from phase to phase of collection, which
mutator threads may need to do as collection begins
and ends.
190 CHAPTER 11. RUN-TIME INTERFACE
11.7
Garbage collecting code
While
many systems statically compile all code in advance,garbagecollection has its roots
in implementations of languages like Lisp,which can build and execute code on the fly \342\200\2
originally interpretively but also compiled to native code since early days. Systems that
load or constructcodedynamically, and that optimise it at run time, are if anything more
common now. Loading and generating codedynamically leads logically enough to the
desire to reclaim the memoryconsumedby that code when the code is no longerneeded.
Straightforward tracing or reference counting techniques often will not work, because code
for
many functions is accessible through global variablesor symbol
tables that will never
be cleared. In some languageslittle can be done if the program does not explicitly remove
such entries\342\200\224 and the language may provide no approved way to do that.
Two specific cases deservefurther mention. First, closures consist of a function and an
environment of bindings to use when the function runs. Naive construction of a closure,
say for function g nested within function /, provides g with the full environment of /,
possibly sharing a common environment
object. Thomas and Jones [1994] described a
system that, upon collection, can specialise the environment to just thoseitems that
g uses.
This may ultimately make someotherclosure unreachable and thus reclaimable.
The other case is class-based systems,
such as Java. One consideration is that in such
systems object instances generally refer to their class. It is commonto placeclasses, and
the code for their methods, in a non-moving,non-collected area.In that way the collector
can ignore the class pointerin every object. But to reclaim classes, the collectorwill need
to trace the class pointer fields\342\200\224 possibly a significant cost in the normal case. It might
avoid that cost by tracing through class pointers only when invoked in a special mode.
For Java in particular, a run-time class is actually determined by both the class's code
and its class loader. Since loading a Java classhas side-effects such as initialising static
variables, unloading a classis not transparent if the class might be reloaded by the same
classloader.The only way to guarantee that a class will not be reloaded by a given class
loader is for the class loader itself to be reclaimable. A class loader has a table of the classes
it has loaded (to avoid reloading them, reinitialising them, and so on) and a run-time
class needsalsoto mention its class loader (as part of its identity). So, to reclaima class,
there must be no references to its class loader,any class loaded by that class loader, or any
instance of one of those classes, from existing threads or global variables (of classesloaded
by
other class loaders). Furthermore, since the bootstrap classloaderis never reclaimed,
no class that it loads can be reclaimed. While Java classunloadingis somethingof a special
case, certain kinds of programsrely on it or else servers will run out of space.
In addition to user-visiblecodeelements such as methods, functions and closures, a
system may generate multiple versions of code to be interpreted or run
natively, for
particular
pointer is 'interesting' while recording is noting that fact for later use by the collector.
To some extent detection and recording are orthogonal,though somedetection methods
Engineering
A
typical barrier involves one or more checksthat guard an action. Typical checks include
whethera pointerbeingstoredis null and the relationship between the generations of the
referring object and its referent, and a typical action is to recordan object in a remembered
set. The full code for all the checks and the action may be too largeto inline entirely,
depending
on implementation. Even a fairly modest sequence of instructions would create
very large compiled code and also risk poor instruction cache performance since much of
the code executes only rarely. Therefore designersoften separate the code into what is
commonly called 'fast path' and 'slow path' portions. The fast path is inlined for speed, and
it calls the slow path part only if necessary; there is one copy of the slow path in order to
conserve space and improve instruction cache performance. It is critical that the fast path
code include the most commoncasesand that the slow path part be less common.
However, it sometimes helps to apply the same principleto the slow path code.If the barrier
involves multiple tests \342\200\224
and they usually do \342\200\224
then it is important to order those tests
so that the first one filters out the most cases,the second the next larger set of cases,and so
on, modulothe cost of performing the test. In doing this tuning there is no substitute for
trying
various arrangements and measuring performance on a range of programs, because
so many factors come into play on modernhardware that simple analytical models fail to
give good enoughguidance.
Another
significant factor in barrier performance is speed in accessing any required
data structures, such as card tables. It may be a good trade-off to dedicate a machine
registerto hold a data structure pointer, such as the base of the card table, but this can vary
considerablyby machine and algorithm.
Also of concern is the software engineering and maintenance of those aspects of the
garbage collection algorithm
\342\200\224
mostly barriers, GC-checks and allocation sequences \342\200\2
they are built into the compiler(s) of a system. If possible it seems best to arrange for the
192 CHAPTER 11. RUN-TIME INTERFACE
compilerto inline in which the designer codesthe fast path portion of a sequence.
a routine
That
way the
compiler does not need to know the details and the designer can change
them freely. However, as we noted before these code sequences may
have constraints,
such as no garbage collectionin the middle of them, that require care on the compiler's
part. The compiler may also have to disable some optimisations on these code sequences,
such as leaving apparently dead stores that save something useful for the collector and
not reordering barrier codeor interspersing it with surrounding code. To that end the
compilermight support special pragmas or markers for the designer to use to indicate
special properties such as uninterruptible code sequences.
In the remainder of this sectionwe discuss write We defer the discussion
barriers.
more, though some of this cost may be masked if the cache locality of the barrier is better
than that of the mutator itself (for example, it is probably unnecessary to stall the user code
while the write barrier records an interesting pointer).Typically, more precise recording of
interesting pointers in the remembered set means less work for the collectorto do to find
the pointer but more work for the mutator to filter and log it. At one extreme, in a
generational collector, not logging any pointer stores transfers all overheads from the mutator
to the collector which must scan all other spacesin the heap looking for references to the
condemned generation. While this is unlikely to be a generally successfulpolicy, linear
do inline and when to call an out-of-line routine to complete the filtering and possibly add
the pointer to a remembered set.The more filtering that is done inline, the fewer
instructions that
may be executed, but the code size will increase and the larger instruction cache
11.8. READ AND WRITE BARRIERS 193
footprint may undermine any performance gains. This requires careful tuning of the order
of filter tests and whichare doneinline.
Second, at what granularity is the location of the pointer to be recorded? The most
accurate is to record the address of the field into which the pointer was written. However,
this will increase the size of the remembered set if many fields of an object, such as an
array,
are updated. An alternative is to record the address of the object containing the
field: this also permits duplicatesto be filtered, which field remembering does not (since
there is generally no room in the field to record that it has been remembered). Object
remembering requires the collector to scan every pointer in the object at scavenge time
in order to discover those that refer to objects that need to be traced.A hybrid solution
might be to object-record arrays and field-recordscalarson the assumption that if one slot
of an array is updated then many are likely to be. Conversely, it might be sensible to field-
recordarrays (to avoid scanning the whole thing) and object-recordscalars(sincethey tend
to be smaller). For arrays it may make sense to record only a portion of the array. This is
analogous to card marking, but specificto arrays and aligned with the array indices rather
than with the addresses of the array's fields in virtual memory. Whether to store the object
or one of its slots may also depend on what information the mutator has at hand. If the
write action knows the addressof the object as well as the field, the barrier can choose to
remember either; if only the address of the field is passed to the barrier, then
computing the
address of the object will incur further overhead. Hosking et al [1992] resolve this dilemma
by storing the addresses of both the object and the slotin their sequentialstorebuffer for
as true sets rather than multisets if they are not to contain duplicates.
In summary, if a card or page based scheme is usedthen the collector's scanning cost
will depend on the number of dirty cards or pages. Otherwise, the costwill depend on the
number of pointer writes if a scheme without duplicate elimination is used. With
duplicate elimination, it will depend on the number of different objects modified. In all cases,
uninteresting pointer filtering will reduce the collector's root scanning cost. Mechanisms
for
implementing remembered sets include hash sets, sequentialstorebuffers, card tables,
virtual memory mechanisms and hardwaresupport.We consider each of these in turn.
194 CHAPTER 11. RUN-TIME INTERFACE
Hash tables
The remembered set must truly implement a set if it is to remember slots without
entries.
duplicating Equally, a set is required for object remembering if there is no room in object
headers to mark an object as remembered. A further requirement for a remembered set is
that
adding entries must be a fast, and preferably constant time, operation. Hash tables
meet theserequirements.
Hosking
et al [1992] implement a remembered set with a circular hash table, using
linear hashing in their multiple generationmemory management toolkit, for a Smalltalk
interpreter that stores stack frames in generation 0, step 0 in the heap. More
specifically,
a separate remembered set is maintained for each generation. Their remembered sets can
storeeitherobjectsor fields.The tables are implemented as arrays of 2l + k entries (they use
k = 2). Hence addresses are hashed to obtain / bits (from the middle bits of the address),
and the hash is used to indexthe array. If that entry is empty, the address of the object
or field is stored at that index, otherwise the next k entries are searched (this is not done
circularly, which is why the
array size is 2l + k). If this also fails to find an empty entry, the
table is searchedcircularly
In order not to increase further the work that must be done by the remembering code,
the write barrier filters out all writes to generation 0 objectsand all young-young writes.
In addition, it adds all interesting pointers to a single scratch remembered set rather than
to the rememberedset for the target generation. Not deciding at mutator time the
generation to whose remembered set it should be added might be even more appositein a
multithreaded environment; there per-processor scratch rememberedsetscould be used
i
Write(src, i, ref):
2 add %src, %i %fld
3 st %ref, [%fld] ; src[i] <- ref
4 st %fld, [%next] ; SSB[next] <- fid
5 add %next, 4, %next ; next next
\302\253\342\200\224 + 1
whether to place it in the current slot and probe with the contents of that slot. Garthwaite
et al uses robin hood hashing [Celis et al, 1985].Eachentry is stored in its slot along with its
depth in the probing sequence, taking advantage of the fact that the least significant bits
of an item (such as the address of a card) will be zero. If a slot already contains an item, its
depth is comparedwith the depth of the new item: we leavewhicheither value is deeper
in its probing sequence and continuewith the other.
bump the next pointer. The MMTk[Blackburn et al, 2004b] implements a sequential store
buffer as a chain of blocks. Each blockis power-of-two sized and aligned, and filled from
high addressesto low.This allows a simple overflow test by comparing the low orderbits
of the next pointer with zero (which is often a fast operation).
A number can be used to eliminate the explicitoverflow
of tricks check, in which case
the cost of adding to the sequential store buffer
an entry can be as low as one or two
instructions if a
register can be reserved for the next pointer, as in Algorithm 11.4. With a
dedicated
register this might be as low as oneinstruction on the PowerPC: stwu fld,4(next).
Appel [1989a],Hudsonand Diwan [1990] and Hosking et al [1992]usea write-protected
guard page. When to add an entry on this page, the trap
the write barrier attempts
handler
performs the action, which we discusslater.Raising
necessary overflow and handling
a page protection exception is very expensive, costing hundreds of thousands of
instructions. This
technique is therefore effective only if traps are very infrequent: the trap cost
must be lessthan the cost of the (large number of) software tests that would otherwise be
performed:
cost
of page trap < cost of limit test x buffer size
Appel ensures that his guard page trap is triggered preciselyoncepercollection by storing
the sequential store buffer as a list in the young generation. The guard page is placed at
the end of the space reserved for the young generation, thus any allocation \342\200\224
for objects
or remembered set entries \342\200\224
may spring the trap and invoke the collector.
This
technique
relies on the young generation's area beingcontiguous. It might appear that a system can
simply place the heap at the end of the data area of the address and use the brk system
callto grow (or shrink) the heap. However, protecting the page beyond the end of the heap
interferes with use of brk by mal loc, as notedby Reppy [1993], so it may be better to use
a higherregion of address space and manage it with mmap.
196 CHAPTER11. RUN-TIME INTERFACE
i atomic insert(fld):
2 *(next
- 4) <- fid /* add the entry in the previous slot */
3 tmp next
<\342\200\224 >> (n
\342\200\224
l)
4 tmp <\342\200\224
tmp & 6 /* tmp = 4 or 6 */
5 next next
<\342\200\224 + tmp
remembered set and leads to the collector processing the same long-lived entries from one
collection to the next. A better solution is to move entries that need to be preserved to
the remembered set of the appropriate generation. These remembered sets might
also be
sequential store buffers or the information might be more concisely transferred into a hash
table as we saw above.
Overflow action
Hash tables and sequentialstorebuffers may overflow: this can be handled in different
ways. The MMTk acquires and links a fresh block into the sequential store buffer [Black-
11.8. READ AND WRITE BARRIERS 197
sparse, they grow a table whenever a pointer cannot be remembered to its natural location
in the table or one of the k following slots, or when the occupancyof the table exceeds a
threshold (for example,60%).Tables are grown by incrementing the size of the hash key,
effectively doubling the table's a
size; corollary is that the key size cannot be a compile-
timeconstant, which may increase the size and cost of the write barrier. As Appel [1989a]
storeshis sequentialstorebuffer in the heap, overflow triggers garbage collection.The
MMTk also invokes the collector whenever the size of its metadata (such as the sequential
store buffer) grows too large.
Card tables
Card table (card marking) schemes divide the heap conceptually into fixed size, contiguous
areas called cards [Sobalvarro,1988; Wilson and Moher, 1989a,b]. Cards are typically small,
between128and 512 bytes. The simplest way to implement the card table is as an array
of bytes, indexed by the cards. Whenever a pointer is written, the write barrier dirties an
entry
in the card table corresponding to the card containingthe source of the pointer (for
example, see Figure11.3).The card's index can be obtained by shifting the address of the
updated field. The motivation behind card tables is to permit a small and fast write barrier
that can be inlined into mutator code. In addition, card tables cannot overflow, unlike
hash tables or sequential store buffers. As always,the trade-off is that more overhead is
transferred to the collector. In this case, the collectormust searchdirtiedcardsfor fields
that have been modified and may contain an interesting pointer: the cost to the collector is
proportional to the number of cardsmarked (and to card size) rather than the number of
(interesting)storesmade.
Becausethey are designed to minimise impact on mutator performance, card marking
schemes are most often used with an unconditional write barrier. This means that the card
table is sufficiently large that all locations that might be modified by Wr i t e can be mapped
to a slot in the table. The size of the table could be reduced if it were guaranteed that no
interesting pointers would ever be written to some region of the heap and a conditional
test was used to filter out such dull pointers. For example, if the area of the heap above
some fixed virtual address boundary was reserved for the nursery (whichis scavengedat
every collection), then it is only necessary to map the area belowthat boundary.
While the most compact implementation of a card table is an array of bits, this is not
the best choicefor several reasons. Modern processor instruction sets are not designed
to write single bits. Therefore bit manipulationsrequiremoreinstructions than primitive
operations: read a byte, apply a logicaloperatorto setor clear the bit, write the byte back.
Worse, this sequence of operations is not atomic: card updatesmay be lost if threads race
to update the same card table entry even though they may not be modifying
the same
field or object in the heap. For this reason, card tables generally use arrays of bytes.
Because
processors often have fast instructions for clearing memory, 'dirty' is often
represented
by 0. Using a byte array, the card canbe dirtied in
just two SPARC instructions
et
[Detlefs al, 2002a] (other architectures may require a few more instructions), as shown
in Algorithm 11.6. For clarity, we write ZERO to represent the SPARC register %g0 which
alwaysholds 0. A BASE register needs to be set up so that it holds the higher order bits of
CT1-(h>>LOG_CARD_SIZE) where CT1 and H are the starting addresses of the card table
and the heap respectively, and both are aligned on a card-sizeboundary, say 512 bytes.
Detlefs et al [2002a]use a SPARC local register for that, which is set up once on
entry to a
method that might perform a write, and is preserved acrosscalls by
the register window
mechanism.
198 CHAPTER 11. RUN-TIME INTERFACE
i Write(src, i, ref):
2 add %src, %i, %fld
3 st %ref, [%fld] ; src[i] <- ref
4 srl %fld, LOG_CARD_SIZE, %idx ; idx f- fid >> LOG_CARD_SIZE
5 stb ZERO, [%BASE+ %idx] ; CT[idx] <- DIRTY
Algorithm 11.7: Recording stored pointers with Holzle's card table on SPARC
i Write(src, i, ref):
2 st %ref, [%src + %i]
3 srl %src, L0G_CARD_SIZE, %idx /* calculateapproximate byte index */
4 clrb [%BASE + %idx] /* clear byte
in byte map */
i Write(src, i, ref):
2 add %src, %i, %fld
3 st %ref, [%fld] /* do the write */
4 srl %fld, LOG_CARD_SIZE, %idx /*get the level 1 index */
5 stb ZERO, [%BASE + %idx] /* mark the level 1 card dirty */
6 srl %fld, LOG_SUPERCARD_SIZE, %idx /*get the level 2 index */
7 stb ZERO, [%BASE + %idx] /* mark the level 2 card dirty */
architecture. Card size is a compromise between space usageand the collector's root scanning
time, since larger cards indicatethe location of modified fields less precisely but occupy
smallertables.
At collection time, the collector must search all dirty cards for interesting pointers.
There are two
aspects to the search. First, the collector must scan the
card table, looking for
dirty cards. The searchcan be speededup by observing that mutator updates tend to have
good locality, thus clean and dirty cards will tend to occur in clumps. If bytes are used in
the card table, four or eight cards canbe checkedby comparing whole words in the table.
If a
generational collector does not promote all survivors en masse, some objects will
be retained in the younger generation,while othersare promoted.If a promoted object
refers to an object not promoted,then the older-to-younger reference leads unavoidably to
a dirty card. However, when a promoted object is first copied into the older generation,
11.8. READ AND WRITE BARRIERS 199
~~~ \342\200\224\"
scan zrzzmsm:^. \342\200\224-^z^z^ >
Heap
been dirtied (shown in black). The updated field is shown in grey. The
crossingmap
shows offsets (in words) to the last objectin a card.
it may refer to objectsin the younger generation, all of which end up beingpromoted.In
that case it would be better not to dirty the promoted object's card(s), sincedoingso will
cause needless card scanning during the next collection.Hoskinget al [1992] take care to
promote objects to clean cards,which are updated as necessary as the cardsare scanned
using
a filtering copy barrier.
Even so, a collectormay spend significant time in a very large heap skipping clean
cards. Detlefs et al [2002a] observe that the
overwhelming majority of cards are clean whilst
cards with more than pointers are
16 cross-generationalquite rare. Thecost of
searching
the card table for dirty cards can be reducedat the expense of some additional space for
a two-level card table. The second, smallercard table usesmorecoarsely grained cards,
each of which corresponds to 2n fine-grained cards, thus speeding up scanning of clean
cardsby the same factor. The write barrier can be made very similar to that of
Algorithm 11.6
(just two more instructions are needed) by sacrificing some space in order to
ensure that the start of the second level card table CT2 is aligned with the first such that
CT1-(H>>L0G_CARD_SIZE)=CT2-(H>>L0G_SUPERCARD_SIZE), as in Algorithm 11.8.
Crossing maps
As a card table is searched, each dirty card discovered must be processed, which requires
finding the modified objects and slots somewhere in the card. This is not straightforward
sincethe start of the card is not necessarily aligned with the start of an object but in order
to scanfields we must start at an object. Worse, the field that caused the card to be dirtied
may belong to a large object whose headeris severalcards earlier(this is another reason
for storing large objectsseparately). In orderto be able to find the start of an object,we
needa crossingmap that decodes how objects span cards.
The crossingmap holdsas many entries as cards. Each entry in the crossingmap
indicates the offset from the start of the correspondingcard to an objectstarting in that card.
Entries in the crossing map correspondingto old generation cards are set by the collector
as it promotes objects, or by the allocator in the case of cards for spaces outside the
generational world. Notice that the nursery space doesnot needcardssince objects there cannot
point to other object that are still younger
\342\200\224
they are the youngest objects in the system.
Promotionalsorequiresthat the crossing map be updated. The design of the crossing map
depends on whether the card table records objects or slots.
Used with a slot-recordingwrite barrier, the crossing map must record the offset to the
last object in each card, or a negativevalue if no object starts in a card. Becauseobjectscan
span cards, the start of the modifiedobjectmay be several cards earlier than the dirty one.
200 CHAPTER 11. RUN-TIMEINTERFACE
Algorithm 11.9: Search a crossing map for a slot-recording card table; trace is the collector's
markingor copying procedure.
i search(card) :
2 start <- H + (card << LOG_CARD_SIZE)
3 end <- start + CARD_SIZE /* start ofnext card */
4 offset <\342\200\224
crossingMap[card]
s while offset < 0
6 card f- card + offset /* offset is negative: go back*/
7 offset <\342\200\224
crossingMap[card]
s offset <- CARD_SIZE - (offset <<
LOG_BYTES_IN_WORD)
9 next f- H + (card << LOG_CARD_SIZE) + offset
io repeat
ii trace(next, start, end) /* trace the object at next */
12 next f- nextObject(next)
13 until next > end
process that may need to be repeated a number of times if the preceding object is quite
large. Alternatively, the system can reserve a a single value, such as -1, to mean 'back up/
Entry
v Encoded meaning
ferent from the scheme above which gives the offset to the last word in a card.Finding the
first word eliminates the need to search back possibly many cards. Large objects, suchas
arrays, may span cards. The second encoding dealswith the case that such an object spans
two or more cards, and that the first v \342\200\224
256 words of the second card are all references
and that this sequence terminates the object. The benefit of this encoding is that the
references can be found directly, without accessing the object's type information. However, this
references and non-references. In this case, the crossing map entry should be set to a value
greater that 384 to indicate that collector should consult the entry
v \342\200\224
384 entries earlier.
Garthwaite et al also includea schemein which, if an object completely spans two crossing
map slots, then the four bytes of these slots should be treated as the address of the object.
In this discussion, we have assumed that a crossing map entry should be two bytes long.
However, a single byte suffices if, for example, we use 512 byte cardsand 64-bit alignment.
Summarising cards
Some generational collectors donot promoteobjects en masse. Whenever the collector scans
a dirty card and finds an interesting pointer but does not promote its referent, the card
must be left dirty
so that it is found at the next collectionand scanned again. It would be
preferable to discover
interesting pointers directly rather than by searching through cards.
discriminated, and hardware write barriers could set bitsin a page table [Moon, 1984].
However, it is possible to use operating systemsupportto track writes without special purpose
hardware. Shaw [1988]
modified the HP-UX operating system to use its paging systemfor
this purpose. The virtual memory manager must always record which pages are dirty
so
that it knows whether to write them back to the swap file when they are evicted. Shaw
modified the virtual memory manager to intercept a page's eviction and remember the
state of its dirty bit, and added system callsto cleara set of page bits and to return a map
of
pages modified since the last collection. The benefit of this scheme is that it imposes no
normal-case cost on the mutator. A disadvantage is that it overestimates the remembered
set sincethe operating system does not distinguish pages dirtied by writing a pointer or a
non-pointer, plus therearethe overheads of the
traps and operating systems calls.
Boehmet al [1991] avoided the need to modify the operatingsystem by write-protecting
pages after a collection. The first write to a page since it was protected leads to a fault; the
of pages written rather than the number of writes. However, these schemes incur further
expense. Reading dirty page information from the
operating system is expensive. Page
protection mechanismsare known to incur 'trap storms' as many protection faults are
Studies by Hosking et al [1992] and Fitzgerald and Tarditi [2000] found no clear
winner
amongst the different remembered set mechanisms for
generational garbage
collectors, although neither study explored Sun-style card summarising. Page-based schemes
performed worst but, if a compiler is uncooperative, they do provide a way to track
where pointers are written. In general, for card table remembered sets, card sizesaround
512 bytes performed better than much larger or muchsmallercards.
Blackburn and Hosking [2004] examined the overheads of executing different
generational barriers alone on a range of platforms.Cardmarking and four partial barrier
mechanisms were studied: a boundary a
test, logging test, a frame comparison and a hybrid
barrier. They excluded the costs of inserting an entry into remembered set for the
partial barriers. The boundary test checked whether the pointer crossed a space boundary
(a compile-timeconstant). The logging test checked a 'logged' field in the source of the
pointer's header. The frame barrier compared whether a pointer spanned two 2n- sized
and alignedareasof the heap by xoring the addresses of its source and target: such
barriers can allow more flexibility in the choice of space to be collected [Hudson and Moss,
1992; B lackburn et al, 2002]. Finally, a hybrid test chose statically
between the boundary
test for arrays and the loggingtest for scalars.
They concluded that the costs of the barrier (excluding the remembered set insertion
in the case of the partial techniques) was generally small, less than 2%. Even where a
write barrier's overheadwas much higher, the cost can be more than balanced by improve-
11.9. MANAGING ADDRESSSPACE 203
NEXT - NEXT - null
null - PREV - PREV
Chunked lists
It is commonto find list-like data structures in collectors where an
array is attractive
because it does not require a linked list pointer or objectheader for each element, and it
achieves good cache locality, but where the unused part of large arrays,
and the possible
need to move and reallocate a growing array, are problematic. A remembered set in a
generational collector is such an example. A chunked list offers the advantage of high storage
density
but without the need to reallocate, and with relatively small waste and overhead.
This data structure consistsof a linked-list, possibly linked in both directions for a general
deque, of chunks, where a chunkconsists of an array of slots for holding data, plus the one
or two chaining pointers. This is illustrated in Figure 11.4.
A useful refinement of this data structure is to make the size of the chunks a power of
two, say 2k, and align them on 2k boundaries in the address space. Then logicalpointers
into a chunk scanning, inserting,or removing,
used for do not need a separate 'current
chunk' pointer and an index,but can use a single pointer. Algorithm
11.10 shows code
for traversing a bidirectionalchunkedlist in either direction, as a sample of the technique.
The modular arithmetic can be performedwith shifting and masking.
An important additional motivation for chunking is related to parallelism. If a chunked
list or deque represents a work queue, then individual threads can grab chunks instead of
individual items. If the chunk size is large enough, this greatly reducescontentionon
obtaining
work from the queue. Conversely, provided that the chunk size is small enough,
this approach still admits good load balancing. Another application for chunking is for
local allocation buffers (Section7.7),though in that case the chunks are just free memory,
not a dense representation of a list data structure.
7
bumpToNext(ptr):
8 ptr <\342\200\224
ptr + 4
if = 0 /* went off the end...
9 (ptr % 2k) */
10 ptr <\342\200\224
*(ptr
\342\200\224
2k + NEXT) /* .. .back up to start of chunk and chain */
11 return ptr
12
13 bumpToPrev(ptr):
- 4
14 ptr <\342\200\224
ptr
15 if (ptr % 2k) < DI /* went off the beginning of the data... */
16 ptr <\342\200\224
*ptr /* .. .chain */
17 return ptr
require, or at least are simpler with, large contiguous regions. In a 32-bit address spaceit
can be difficult to lay out the various spaces statically and have them be large enough for
all applications. If that were not problematic enough, on many systems we face the added
difficulty
that the operating system may have the right to place dynamic link libraries
(also calledshared objectfiles) anywhere it likes within large swaths of the address space.
Furthermore, these libraries may not end up in the same place on each run \342\200\224 for security
purposes the operating system may randomise their placement. Of course one solution
is the largeraddressspaceof a 64-bit machine. However, the wider pointers neededin a
64-bit system end up increasing the real memory requirements of applications.
One of the key reasons for using certain large-space layouts of the address spaceisto
makeaddress-oriented write barriers efficient, that is, to enable a write barrier to work
by comparing a pointer to a fixed address or to another pointer rather than requiring a
table lookup. For example, if the nursery of a generational system is placedat one end of
the address space used for the heap, a single check against a boundary value suffices to
distinguish writes of pointers referring to objects in the nursery from other writes.
In buildingnew systems, it may be best not to insist on large contiguous regionsof
address space for the heap, but to work more on the basisof frames, or at least to allow
'holes'in the middleof otherwise contiguous regions. Unfortunately this may then require
table lookup for write barriers.
Assuming table lookup costs that are acceptable, the system can manage a large logical
addressspace mapping by
it down to the available virtual address space. This does not
allow larger heaps, it but does give flexibility in that it removes some of the contiguity
requirements.
To do this, the system deals with memory in power-of-twosizedand aligned
frames, generally somewhat larger than a virtual memory page. The system maintains
a table indexed by frame number (upper bits of virtual address) that
gives each frame's
Algorithm 11.11:Frame-basedgenerational
write barrier
i Write(src, i, ref):
2 ct frameTableBase
<\342\200\224
processor, particularly if entries in the frame table are a single byte each, simplifying the
array indexingoperation.Noticealso that the algorithm works even if ref is null \342\200\224
we
simply ensure that the entry for null's frame has number so the
the highest generation
code will always skip the call to remember.
It is further possible to arrange true multiplexing of a largeaddressspaceinto a smaller
one \342\200\224
after all, that is what operating systems do in providing virtual memory. One
approach would be to use wide addresses and do a check on every access, mimicking in
software what the virtual memory hardware accomplishes. This could use the software
equivalent of translation lookaside buffers, and so on. The performance penalty might
be high. It is possibleto avoid that penalty by leveraging the virtual memory hardware,
whichwe discussin more detail in Section 11.10.
It is good tobuild into systems from the start the capability to relocatethe heap.Many
systems have a starting heap or systemimagethat they load as the system initialises. That
image assumes it will reside at a particular locationin the addressspace\342\200\224 but what if a
system in reserving resourcessuchas swap space,and all virtual memory mapping calls
tend to be expensive.Allocating pages in advance can also determine earlier that there
are not adequate resources for a larger heap. However, operating systems do not always
'charge' for demand-zero pages until they are used, so simply allocating may not give an
early failure indication.
memoryprotection checking. Implemented in this way the checkshave little or no normal case
overhead and furthermore require no explicit conditional branches. A general
is
consideration that the overhead of fielding the trap, all the way through the operating system to
206 CHAPTER 11. RUN-TIME INTERFACE
the collector software and back again,can be quite high. Also, changing page protections
can be costly, especially in a multiprocessor system where the operatingsystem may need
to stop processes currently executingand update and flush their page mapping
information.So sometimes an explicit check is cheaperevenwhen the system could use protection
traps [Hosking et al, 1992]. Traps are also useful in dealing with
uncooperative code, in
which it is not possible to cause barriers or checksin any other way.
A consideration, especially in the future, is that there are hardware performance
reasons to increase
page size. In particular, programsusemorememory now than when these
speed and power concerns. But given that translation lookaside buffer size is more or less
fixed, staying with a small page size while programs'memory use increases implies more
translation lookaside buffer misses. With larger pages some of the virtual memory 'tricks'
Double mapping
mapping, by which the system maps the same page at two different addresses with different
page tables. In this scheme, each physical page can have only one virtual address at a time.
The operating systemcan support double mapping by effectively invalidating one of the
virtual addresses and then connecting the other one. This may involve cache flushing.
pages: unconditional
for an read barrier. There are at least two more applications for
no-access pages in common use. One is to detect dereferences of null
pointers, which we
assume to be represented by
the value 0. This works by setting page 0, and possibly
a few
11.10. APPLICATIONS OF VIRTUAL MEMORY PAGE PROTECTION 207
more pages after it, no-access. If a mutator tries to access a field through a null pointer, it
will attempt to read or write the no-access page. Since
fielding a null pointer dereference
exception generally is not required to be fast, this
application can be a good trade-off. In
the rare case of an access that has a large offset, the compiler can emit an explicit check. If
the object layout places headers or other fieldsat negative offsets from the object pointer,
the techniquestill works provided that one or more pages with very high addresses are set
no-access. Mostoperatingsystems reserve the high addresses for their own use anyway.
The other common use for a no-accesspageis as a guard page. For example, the
sequential
store buffer technique for recording new rememberedset entries consists of three
steps: ensure there is room in the buffer; write the new element to the buffer; and
increment the buffer pointer. The check for room, and the call to the buffer overflow handler
routine,canbe removedif the system places a no-access guard page immediately after the
buffer. Since write barriers can be frequent and their code can be emitted in many places,
the guard page technique can speedup mutators and keep their code smaller.
Some systems apply the same idea to detecting stack or heap overflow by placing a
guard page at the end of the stack (heap). To detect stack overflow, it is best if a procedure's
prologue touches the most remotelocationof the new stack frame it desires to build. That
way the trap happens at a well defined place in the code. The handler can grow
the stack
by reallocating it elsewhere, or add a new stack segment, and then restart the mutator
with an adjusted stack and frame pointer. Likewise when using sequential allocation the
allocator can touch the most remote word of the desired new object and cause a trap if it
falls into the guard page that marks the end of the sequential allocation area.
In eithercase,if the new stack frame or object is so largethat its most remote word
might lie beyond the guard page, the system needs to use an explicitcheck.But such large
stack frames and objects are rare in many systems, and in any case a large object will take
more time to initialise and use, whichamortisesthe costof the explicit check.
No-access pages can also help in supporting a largelogicaladdressspace in a smaller
virtual address space. An example is the Texas persistent object store [Singhalet al, 1992].
Using the strategy for persistence(maintaining a heap beyond a single program execution)
goesbeyond our scope, but the mechanism is suitable for the non-persistent case as well.
In this approach the systemworks in terms of pages, of the same sizeas virtual memory
pages or some power-of-two multiple of that. The system maintains a table that indicates
where each logical page is: either or both of an address in (virtual) memory and a location
in an explicitly managed swapping file on disk. A
page can be in one of four states:
\342\200\242
Unallocated: Not yet used, empty.
\342\200\242
Resident: In memory and accessible; it
may or may not have a disk copy saved yet.
\342\200\242
Non-resident: On disk and not accessible.
\342\200\242
Reserved: On disk and not accessible, but with specificvirtual memory reserved.
How can the system free up the Reserved virtual space for re-use? It must determine
that there are no longer any Resident pages referring to the Reserved page. It can help
makethis happen by evicting pages that refer to the Reserved page. At that point the page
can become Non-residentand the system can reuse the space.
Notice that Resident pages refer to each other and to Reserved pages, but never directly
to data in Non-resident pages.
Now consider what happens if the program accesses a Reservedpage(and if there are
evicted data that are reachablein the object graph, then there must be Reservedpages).The
system looks up the page's logical addressand fetches it from disk. It then goes through
the page's pointers and replaces long logical addresses with short virtual addresses (called
pointer swizzling). For referents on pages that are Resident or Reserved, this consists
of just a table lookup. If the referent is itself on a Non-residentpage,then the system
must reserve virtual address space for that page, and then replace the long address with
a pointer to the newly Reserved page. Acquiring virtual address space for these newly
Reservedpagesmay require evicting other pages so that some page(s) can be made
Nonresident and their virtual address space recycled.
Just as an operatingsystem virtual memory manager needs good page replacement
policies,so the Texas approach a policy, though
needs it can reasonably borrow from the
vast store of virtual memory management algorithms.
How does the schemework in the presence of garbage collection? It is clear that a full
heap garbage collection of a heap largerthan the virtual address space is probably going
to involve significant performance penalties. Collection of persistent storeshas its own
literature and lies beyond our scope. However, we cansay that partitioned schemes can help
and techniqueslike Mature Object Space [Hudson and Moss, 1992]can offer completeness.
Related techniques include the Bookmarking collector[Hertz et al, 2005; Bond and
McKinley, 2008]. However,the purpose of bookmarking is more to avoid thrashing real
memory \342\200\224
it does not extend the logical address spacebeyond the physical. Rather it
summarises the outgoing pointers of pages evicted by the operating system so that the
collector can avoid touching evicted pages and thus remain within the working set, at a
possibleloss of precision similar to that occasioned by remembered setsand generational
collection: the collector may trace from pointers in dead objectsof evicted pages.
approaches to adjusting the size of the heap include choosing which pages to page out, as
in the Bookmarking collector [Hertz et al, 2005; Hertz, 2006], and having the collectorsave
rarely accessed objects to disk [Bond and McKinley,2008].
Alonso and Appel [1990] devised a scheme where an 'adviceserver'tracks virtual
the heap,how long it has been since the last full collection and how much mutator and
collectorCPUtime it has expended since the last collection. The adviceserver determines
an additional amount of space that appears safe for the process to use, and the process
adjusts heap
its size accordingly. The aim is to maximise
throughput of the managed
processes without causing other processes to thrash either.
In contrast to Alonsoand Appel, Brecht et al [2001,2006] control the growth in heap size
for Java applications without reference to operating system information.
paging
Rather, for
a system with a given amount of real memory \342\200\224
they considered 64 and 128 megabytes
give a series of increasing thresholds, T\\ to 7\\, stated as fractions of the real memory
\342\200\224
they
of the system. At any given time, a process uses a heap of Tz for some size
/. If collecting
at size Tz yields less than Tz+i \342\200\224
Tz fraction of the space reclaimed, the system the increases
threshold from Tj to T/+]. They considered the Boehm-Demers-Weiser [Boehm collector
and Weiser, 1988], which cannot shrink its heap, so their approach deals only with heap
growth. The thresholds must be determined empirically, and the approach further
assumes that the in question is the only program
program of interest running on the system.
Cooper [1992] present
et al an approach that aims for a specified working set size for
an Appel-style SML collector running under the Machoperatingsystem. They adjust the
nursery size to try
to stay within the working set size, and they
also do two things
specific to Mach. One is that they use a large sparse address space and avoid the need to
copy tospace to lower addresses to avoid
hitting the end of the address space. This has
little to do with heap sizing, but does reduce collector time. The secondthing specific to
Mach is having the collectorinform the Mach pager that evacuated fromspace pages can
be discarded and need not be paged out, and if referenced again, such pages can be offered
back to the application with
arbitrary contents \342\200\224
the allocator will zero them as necessary.
Cooper et al obtain improvement in elapsed time for a small benchmark
a four-fold suite,
with about half of the improvement coming from the heap size adjustment.However,the
target working set size must still be determined by the user.
Yang
et al [2004] modify a stock Unix kernelto add a system call whereby an
applicationcan obtain advice as to how much it may increase its working set size without
thrashing,
or how much to decrease it to avoid thrashing. They modify garbage collectorsof
several kinds to adjust their heap size using this information. They demonstrate the
importance
of adaptive heap sizing in obtaining the best performance memory as usage by
other processes changes. They introducethe notion of the footprint of a program, which
is the number of pages it needs in memory to avoid increasingthe running time by more
than a specified fraction t, set to five or ten percent. For a garbage collected program,the
footprint depends on the heap size, and for
copying collectors, also on the survival rate
from full collections, that is, the live size. However, an observation they make, not unlike
Alonso and Appel, is that the key relationship is between how the footprint changes for a
page outs have to do with the pages of that process, and page faults and allocation stalls
occur because of actions of the process. Of these three possible indicatorsthat a system
is under so much memory load that shrinking the heap might be wise, they find that the
number of allocation stallsis the best indicator to use. When a collection sees no allocation
210 CHAPTER 11. RUN-TIME INTERFACE
stalls, it will grow the heap by an amount originally set to 2% of the user-specifiedheap
size; values between 2% and 5% gave similar results. If a collection experiences allocation
stalls, the collector shrinksthe nursery
so that the total heap space, including the reserve
into which the nursery is copied, fits within the space used the last time that there were
no allocation stalls. This leaves the nursery cut by up to 50%. In the absence of
memory
pressure, the scheme performs similar to a non-adjusting baseline, while in the presence
of memory pressure, it performs close to the non-pressurecasewhile the baseline system
degrades substantially.
The schemeswe have discussed so far concern adjusting individual processes'use of
triggers a full collection if the resident set size decreases. The resulting system appears to
size competing processes' heaps well to achieve the best throughput.
The dynamic heap sizingmechanism proposed by Zhang et al [2006] is similar in spirit
to that of Hertz et al [2009], but has the program itself check the number of page faults at
each collection and adjust the target heap sizeitself, rather than building the mechanism
into the collector.Unlike the other mechanisms we have discussed, they assume that the
user has somehow identified the phases of the program and inserted code to consider
forcing collection at
phase changes. They showed that
dynamic adaptive heap sizing can
substantially improveperformance any single over fixed heap size.
the discretion of the implementer and the decisions may be made in order to improve
performanceor the robustness of the run-time system.
We need to consider what are made of allocation and initialisation.
requirements Is the
language run-time's
job simply to allocate some space of sufficient size, must some header
fields be initialised before the object can be usable or must initialisingvalues for all the
the least work but not the other slower paths which might involve obtaining space from a
lower-levelallocatoror invoking the collector. However, too much inlining can explode
the size of the code and negate the benefit hoped for. Similarly, if might be desirable
to dedicatea registerto a particular purpose, such as the bump-pointer for sequential
11.12. ISSUES TO CONSIDER 211
register-poor platform.
Depending on the languagesupported,for safety or for debugging, the run-time may
zero memory.Spacecouldbe zeroedas objects are allocated but bulk zeroing with a well-
optimisedlibrary routine is likely to be more efficient. Should memory be zeroed shortly
before it is used (for best cache performance) or immediatelywhen it is freed, which may
help with debugging (though here writing a special value might be more useful)?
Collectors need to find pointers in order to determine reachability. Shouldthe run-time
implement type-accurately. On the other hand, scanning stacks for pointers constrains the
choice of collectionalgorithm as objects directly reachable from stacks cannot be moved.
Systems generally provide stack maps to determine from a return address the function
within which the address lies. Polymorphic functions and language constructs such as
Java's jsr bytecode complicate their use. The
implementer must also decide when stack
maps should be generatedand when they can be used. Should the maps be generated
in advance or should we defer generating until the collector needs one, thereby saving
space? Isa map only valid at certain safe-points? Stack maps can be large: how can they
be compressed, especially if they must be valid at every instruction? Stackscanningalso
raisesthe question of whether the stack should be scannedin its entirety, atomically, or
incrementally. Incremental stackscanningis morecomplex but offers two benefits. First,
it can bound the amount of work done in an increment (which may be important for
realtime collectors). Second, by noting the portion of the stack that has not changed since the
last time it was scanned, we can reduce the amount of work that the collector has to do.
Language
semantics and compiler optimisations raise further questions. How should
interior and derived pointers be handled? Language may allow access to objects from
outside the managed environment, typically from code written in C or C++, and every
language needs to interact with the operating system for input/output. Therun-time must
ensure that objects are not reclaimed while they are being used by external code and that
external code can find these objects. Typically, this may involve pinning such objects or
providing accessto them through handles.
Some systems may allow a garbage collection at
any point. However it is usually
simpler
to restrict where collection can happen to
specific GC-safe points. Typically these
include allocation, backwardbranches,and function entry and return. There are
alternative
ways to cause a thread to suspend at a GC-point. One way is to have threads poll
by checking a flag that indicates that a collection has been requested. An alternative is
to patch the code of a running thread to roll it forward to the next GC-point. The
handshake between collector and mutator thread can be achieved by having threads check a
thread-local variable, by setting a processorconditioncodein the saved thread state of a
suspended thread, by hijacking return address or through operating system signals.
Several classes of garbage collection algorithm require 'interesting' pointers to be
detected as mutators run. This opens up a widerangeof design policies and implementations
for the detectionand recordingof these pointers. As barrier actions are very common, it
is essential to minimise any overhead they incur. Barriers may be short sequences of code
inserted by the compiler before pointer loadsor stores,or they may be provided through
operating system support, such as pageprotection traps. As always, there are trade-offs
to be considered.In this case, the trade-offs are between the cost to the mutator and the
212 CHAPTER 11. RUN-TIME INTERFACE
cost to the collector, between precisionof recording and speed of a barrier. In general,it is
better to favour adding overhead to relatively infrequent collector actions (such as
discovering roots) than to very frequent mutator actions (suchas heap stores). Adding a write
barrier can increase the instruction count for a pointer write by a factor of two or more,
though some of this cost may be masked by cache accesstimes.
How
accurately should pointer writes be recorded? Unconditionalloggingmay impose
less overhead on the mutator than filtering out uninteresting pointers but the
of the remembered
implementation set is key to this decision. How much filtering should be inline?
Careful tuning is essential here. At what granularity is the location of the pointer to be
recorded? Should we recordthe field overwritten, the object or the card or page on which
it resides? Should we allow the rememberedset to contain duplicate entries? Should
arrays
and non-arrays be treated in the same way?
What data structures should be used to record the location of interesting pointers: hash
tables, sequential store buffers, cards or a combinationof these? How does this choice
vary the overheads between the mutator and the collector?Data structures may overflow:
how can this be handled safely and efficiently? Card tables offer an imprecise recording
mechanism.At collection time they must be scanned to find dirty cards and hence objects
that
may contain interesting pointers. This raises three performancequestions.What size
should a card be? Card tables are often sparse: how can we speed up the search for dirty
cards? Should a two-level card table be used? Can we summarise the state of a card,
for example if it contains only one modified field or object? Once a dirty card is found,
the collector needs to find the first object on that card, but that object may start on an
earlier card. We need a crossing map that decodes how objects span cards. How does
card marking interact with multiprocessor cache coherency protocols? If two processors
repeatedly write to different objects on the same card, both will want exclusive accessto
the card's cache line. Is this likely to be a problemin practice?
In systems run with virtual memory, it is important that garbage collected applications
fit within available real memory. Unlikenon-managed programs,garbage collected ones
can adjust their heap size so as to fit better within available memory. What events and
counts doesthe particular operating system provide that a collector might use to adjust
heap size appropriately? Which of these events or counts are most effective? What is a
Language-specific concerns
12.1 Finalisation
integer
called a file descriptor, and the interface limits the number of files that a given process
may have open at onetime.A language implementation will generally have, for each open
file, an object that the programmer uses to manage that file stream. Most of the time it is
clear when a program has finished with a given file stream, and the program can ask the
run-time system to close the stream, which can closethe corresponding file
descriptor at
the operating system interface, allowingthe descriptor number to be reused.
But if the stream is shared across a number of components
file in a program, it can be
difficult to know when they have all finished with the stream.If each component that uses
a given stream sets its reference to the stream to null when the componentis finished with
the stream, then when there are no morereferences the collector can (eventually) detect
that fact. We show such a situation in Figure 12.1. Perhapswe canarrange for the collector
213
214 CHAPTER 12. LANGUAGE-SPECIFIC CONCERNS
Garbage collected
application
?a
\342\226\241
FileStream
int desc 3
Operating System
Open File n
Table \302\260
1
2
3
63
not from mutator roots. In Figure 12.2we show the previous situation but with a finaliser
added. The finaliser's call to close the descriptor is conditional,sincethe application may
have already closed the file.
In a referencecountingsystem, freeing an object the collector checksthe
before
finalisation table to see if finalisation. If it does, then the collector causes the
the object requires
finaliser function to run, and removes the object's entry in the finalisation table. Similarly,
in a tracing system, after the tracing phase the collector checks the finalisation table to see
if any untraced object hasa finaliser,and if so, the collector causes the finaliser to run, and
so on.
There are a range of subtly different ways in which finalisation can work. We now
consider some of the possibilities and issues.
When do finalisersrun?
At what time do finalisers run? In particular, finalisation might occur during collection, as
soon as the collector determines the need for it. However, the situation during collection
might
not support execution of general user code. For example, may it not be possible for
user code to allocate new objects at this time. Therefore most finalisation approaches run
finalisers after collection. The collectorsimply queues the finalisers. To avoid the need
to allocatespacefor the queue during collection, the collector can partition the finalisation
table into two portions, one for objects queued for finalisation and one for
objects that have
a finaliser but are not yet queued. When the collector enqueues an objectfor finalisation,
it moves that queue entry to the enqueued-objectspartition. A simple, but possibly
inefficient, approach is to associate an enqueuedflag with each entry and have the finalisation
activity scan the finalisation table. To avoid
scanning, we can group the enqueued
objects together in the table, perhaps permuting entrieswhenthe collector needsto enqueue
another object.
12.1. FINALISATION 215
Garbage collected
application
Table of objects
that have finalisers
.
1FileStream Method table
1int desc 3
finalize(){
if isOpen
close(desc);
Operating System
}
Open File
\302\260
Table
1
2
/*
i ,^-\\
file s
3 open
\\ information
631
In
general, finalisers affect shared state; there is little reason to operate only on finalis-
able objects since they are about to disappear. For example,finalisers may need to access
some global data to releasea sharedresource, and so often need to acquire locks. This is
anotherreasonnot to run finalisers during collection: it could result in a deadlock. Worse,
system provides re-entrant locks the same thread can acquire
if the run-time \342\200\224
locks where
a lock that it already holds \342\200\224 we can have the absence of deadlock and silent corruption
of the state of the application.1
Even
assuming that finalisers run after collection, there remain several options as to
exactly
when they run. One possibility is immediatelyafter collection, before mutator
thread(s) resume. This improvespromptness of finalisation but perhaps to the detriment
of mutator
pause time. Also, if finalisers communicate with other threads, whichremain
blockedat this time, or if finalisers compete for locks on global data structures,this policy
could lead to communication problems or deadlock.
A last consideration is that it is not desirable for a language's specification of
finalisation to constrain the possible collection techniques. In particular, collection on the fly,
concurrent with mutation, most naturally leads to running finalisers at arbitrary times,
concurrent with mutator execution.
^ava avoids this by indicating that a finalisation thread will invoke a finaliser with no locks held. Thus the
finalisation thread must be one that does not hold a lock on the object being finalised. In practice this pretty much
requires finalisation threads to be distinct threads used only for that purpose.
216 CHAPTER 12. LANGUAGE-SPECIFICCONCERNS
object of
type T might run at the same time as the allocationand initialisation code for a
new instance of T. Any
shared data structures must therefore be synchronisedto handle
that case.2
single-threaded language,which thread
In a runs a finaliser is not a question\342\200\224
though
it does reraise the
question of when finalisers run. Given the difficulties previously
mentioned, it appears that the only feasible and safe way, in general, to run finalisers in a
single-threadedsystem is to queue them and have the program run them underexplicit
control. In a multithreaded system, as previously noted it is best that distinct finalisation
threads invoke finalisers,to avoid issues around locks.
In manycases it is convenient for a finaliser to access the state of the object being reclaimed.
In the stream example, the operating system file descriptor
file number, a small integer,
might most conveniently
be stored as a field in the file stream object, as we showed in
Figure 12.2.The simplest finaliser can read that field and call on the operatingsystem to
close the file (possibly after flushing a buffer of any pending output data). Noticethat if the
finaliser does not have accessto the object, and is provided no additional data but is just
a piece of code to run, then finalisation will not be very useful \342\200\224
the finaliser needs some
context for its work. In a functional language, this context may be a closure; an object-
in
oriented language
it may be an object. Thus the
queuing mechanism needs to provide for
the passing of arguments to the finaliser.
On balance it seems more convenient if finalisers can access the object being finalised.
Assuming
finalisers run after collection, this implies that objects enqueued for finalisation
survive the collection cycle in which they are enqueued. So that finalisers have access
to everything they might need, the collector must also retain all objects reachable from
objects enqueued for finalisation. This implies that tracing collectors need to
operate in
two passes. The first pass discoversthe objectsto be finalised, and the second pass traces
and preservesobjectsreachable from the finaliser queue. In a reference counting collector
the system can increment the object's reference count as it enqueues it for finalisation, that
2Java has a special rule to help prevent this: if an object's finaliser can cause synchronisation on the object, then
the object is considered mutator reachable whenever its lockis held. This can inhibit removal of synchronisation.
12.1. FINALISATION 217
Garbage collected
application
Method table
BufferedStream
StringBuffer
3As a more subtle point, note that unless we can guarantee that the FileStream is used only by the
BufferedStream, then the BufferedStream should not close the FileStream. Unfortunately this implies
that it may require two collectioncycles before the file descriptor is closed.
218 CHAPTER 12. LANGUAGE-SPECIFIC CONCERNS
i B i
\342\226\272
<\342\200\224
; ;
T \342\226\274
finaliser finaliser finaliser
finaliser
that if we impose order on finalisations, u ltimate finalisation may be slow, since we finalise
only one'level'in the order at each collection. That is, in a given collection we finalise only
those unreached objectsthat are not themselves reachable from other unreached objects.
This
proposal has a significant flaw: it does not handle cycles of unreachable objects
wheremorethan one needs finalisation. Given that such cases appear to be rare, it seems
simpler and more helpful to guarantee finalisation in order of reachability; that is, if B is
reachable from A, the system should invoke the finaliser for A first.
In the rare case of cycles, the programmer will need to get more involved. Mechanisms
such as weak references (see Section12.2) may help, though using them correctly may be
tricky.
A general technique is to separate out fields needed for finalisation in such a way
as to break the cycle of objects needing finalisation, as suggestedby Boehm [2003]. That
is, if A and B have finalisers and cross reference each otheras shown in Figure 12.4a,we
can split B into B and B', where B does not have a finaliser but B' does (seeFigure 12.4b).
A and B still cross reference each other, but (importantly) B' does not refer to A. In this
scenario, finalisation in reachability order will finaliseA first and then B'.
i
process_finalisation_queue() :
2 while not isEmpty(Queue)
3 while not isEmpty(Queue)
4 obj <\342\200\224
remove(Queue)
5 obj.finalizeQ
6 if desired /* whatever condition is appropriate */
7
collectQ
defined
queue. In the queueing approach, the programmerwill add code to the program,
at desirable (that is, safe) points. The code will process any enqueued objectsneeding
finalisation. Since running finalisers can cause other objectsto be enqueued for
finalisation,such queue-processing code should generally continue processing until the queue is
empty, and may want to force collections if it is
important to reclaim resources promptly.
Suitablepseudocodeappearsin Algorithm 12.1. As previously noted, the thread that runs
this algorithm should not be holding a lockon any object to be finalised, which constrains
the placeswherethis processing can proceed safely.
The pain involved in this approach is the need to identify appropriate places in the code
at which to empty the finalisation queue. In addition to sprinkling enough invocations
throughout the code, the programmer must also take care that invocations do not happen
in the middle of other operations on the shared data structures. Locks alone cannot prevent
this, since the invoking thread may already hold the lock, and thus can be allowedto
proceed.This is the source of the statement in the Java Language Specification that the
system will invoke f inali ze methods only while holding no user-visible locks.
held. This pretty much means that finalisation runs in one or more separate threads,even
though the specification is not quite worded that
way. If finalize throws an exception,
the Java system ignores it and moves on. If the finalised object is not resurrected, a future
collection will reclaim it. Java also provides support for programmer-controlled
finalisation
through appropriate use of the java . lang. ref API, as we describe in Section 12.2.
Lisp. Liquid
Common Lisp offers a kind of object called a finalisation queue. The
programmer
can register an ordinary object with one or more finalisation queues. When the
registered object becomes otherwiseunreachable, the collector enters it into the finalisation
queues with which it was registered. The programmer can extract objectsfrom any
finalisation
queue and do with them what she will.The system guarantees that if objects A and
B are both registered and becomeunreachable in the same collection, and B is reachable
from A but not vice versa, then the collectorwill enter A in the finalisation queue beforeit
enters B. That is, it guarantees order of finalisation for acyclic object graphs. The
finalisation
queues of Liquid Common Lisp are similarto the guardians described by Dybvig et al
[1993].
CLisp offers a simpler mechanism: the programmer can requestthat the collector call
a given function / when it detects that a given object 0 is no longerreachable. In this
case / must not refer to 0 or else 0 will remain reachableand the system will never call
the finaliser. Since/ receives0 as an argument, this system permits resurrection. Also,
/ could register0 again, so 0 can be finalised more than once. A variant of the basic
mechanism allowsthe programmerto specify a guardian G in addition to the object 0 and
function
/. In this case, when 0 becomes unreachable the system calls / only if G is still
reachable. If at this time G is unreachable,then the system reclaims 0 but does not call /.
be used to implement guardians of the kind described by Dybvig et al [1993] /
This can \342\200\224
unmanaged resources, such as open file handles and the like. If a kind of object needs fi-
nalisation, then the destructor should call the finaliser, to cover the case when the object
is reclaimed explicitly by compiler-generated code. However, the collector will call the
finaliser if the object is being reclaimedimplicitly, that is, by the collector. In that case the
destructor willnot be called. In any case, the finalisation mechanism itself is very similar
to that of Java. The end result is a mixture of C++ destructors and something close to Java
finalisation, with both synchronous and asynchronousinvocationof finalisers possible.
may
reclaim the object and set any weak referenceto the object to null. Such objects are
calledweakly-reachable. As we will see, the collector may also take additional action, such
as notifying the mutator that a given weak referencehas beensetto null.
In the case of the canonicalisation table for variable names, if the reference from the
table to the name is a weak reference, then once there are no ordinary references to the
string, the collectorcan reclaim the
string and set the table's weak referenceto null.Notice
that the table design must take this possibility into account, and it may be necessary or
helpful for the program to clean up the table from time to time. For example, if the table
is organised by hashing with eachhash bucketbeing a linked list, defunct weak references
result in linkedlist entries whose referent is null. We should clean those out of the table
from time to time. This also shows why a notification facility might be helpful: we can use
it to trigger the cleaning up.
Below we offer a more general definition of weak references,which allows several
different
strengths of references, and we indicate how a collectorcan support them, but first
we consider how to implement just two strengths:strong and weak. First, we take the case
222 CHAPTER 12. LANGUAGE-SPECIFIC CONCERNS
pass. If a weak reference's target was reached in the first pass, then the collector retainsthe
weak reference,
and in copying it updates the weak referenceto referto the new
collectors
copy.
If a weak reference's target was not reached,the collectorsets the weak reference
to null, thus making the referent no longer reachable. At the end of the second pass, the
collector can reclaim all unreachedobjects.
The collector must be able to identify a weak reference. It
may be possible to use a bit
in the reference to indicate that it is weak. For example, if objects are word-aligned in a
byte-addressed machine, then pointers normally have their low two bits set to zero. One
of those bits could indicatea weak reference if the bit is set to one. This approach has the
disadvantage that it requires the low bits to be cleared before trying
to use a reference that
may be weak. That
may be acceptable if weak references arise only in certain restricted
placesin a given language design. Some languages and their implementations may use
tagged values anyway, and this simply requires o ne more possible tag
value. Another
disadvantage
of this approach is that the collector needs to find and null all weak references
to
objects being reclaimed, requiring another pass over the collectorrootsand heap,or that
the collector remember from its earlier phases of work where all the weak pointers are.
An alternative to using low bits is to use high orderbits and double-map the heap. In
this case every heap pageappearstwice in the virtual address space, once in its natural
place and again at a high memory (different) address. The addresses differ only in the
value of a chosenbit near the high-order end of the address. This technique avoids the
need to mask pointers before using them, and its test for weakness is simple and efficient.
However, it uses half the address space, which may make it undesirable except in large
address spaces.
Perhaps the most common implementation approach is to use indirection, so that
specially
marked weak objects hold the weak references. The disadvantageof the weak object
approach is that it is less transparent to use \342\200\224
it requires an explicit dereferencing
operation on the weak
object
\342\200\224
and it imposes a level of indirection.It alsorequiresallocating
a weak object in addition to the
object whose reclamation we are trying to control.
However, an advantage is that weak objects are specialonly to the allocator and collector \342\200\224
to
all other code they are like ordinary objects. A system can distinguish weak objects from
ordinary onesby setting a particular bit in the object header reservedfor that purpose.
Alternatively,
if objects have custom-generated tracing methods, weak objectswill just have
a special one.
How does a programmerobtain a weak reference (weak object) in the first
place? In
the case of true weak references,the
system must supply a primitive that, when given a
strong referenceto object 0, returns a weak referenceto 0. In the case of weak objects, the
weak objecttypes likewise supply a constructor that, given a strong reference to 0, returns
a new weak objectwhosetarget is 0. It is also possible for a system to allow programs to
change the referent field in a weak object.
Additional motivations
Canonicalisationtables are but one example of situations whereweak references
of some
kind help programming problem, or solve
solve a it more easily or efficiently. Another
example managinga cachewhose contents
is can be retrieved or rebuilt as necessary. Such
caches embody a space-time trade-off, but it can be difficult to determine how to control
12.2. WEAK REFERENCES 223
(without the superscript *) if it is a*-reachable but not (oc + 1)-reachable. An object is oc-
reachable if every path to it from a root includes at least one referenceof strength oc, and at
least one path includesno references of
strength less than oc. Below we will use the names
of
strengths in place of numeric values; the values are somewhat arbitrary anyway, since
what we rely on is the relative order of the strengths. Also, for gracefulness of expression,
we will say Weakly-reachable insteadof Weak-reachable, and so on.
Each level of strength will generally have some collector action associatedwith it. The
Strong: Ordinary references have the highest strength. The collectornever clears these.
Soft: The collector can cleara Soft reference at its discretion, based on current space usage.
If a Java clears a Soft referenceto object0 (that is, sets the reference to null),
collector
it must at the same time atomically5 clear all other Soft references from which 0 is
Strongly-reachable. This rule ensures that after the collector clears the reference, 0
will no longerbe Softly-reachable.
Weak: The collector must clear a (Soft*-reachable)
Weak reference as soon as the collector
determines its referent is Weak-reachable (and thus not Soft*-reachable). As with
Soft references, if the collector clears a Weak reference to 0, it must at the same time
clear all other Soft*-reachable Weak references from which 0 is Soft*-reachable.
Finaliser: We term a reference from the table of objects with finalisers to an object that has
a finaliser a finaliser reference. We described Java finalisation before, but list it here to
clarify the relative strength of this kind of weak reference, even
though it is internal
to the run-time systemas opposedtoa weak
object exposed to the programmer.
4In fact, we are not aware at this time of any language other than Java that supports multiple strengths, but
the idea may propagate in the future.
5By atomically the Java specification seems to mean that no thread can see just some of the references cleared:
either all of them are cleared or none are. This can be accomplished by having the reference objects consult a
shared flag that indicates whether the referent field should be treated as cleared, even if it is not yet set to null.
The reference object can itself contain a flag that indicates whether the single global flag should be consulted, that
is, whether the referenceis being considered for clearing. Doing this safely requires synchronisation in concurrent
collectors.
224 CHAPTER 12. LANGUAGE-SPECIFICCONCERNS
1. Working
from the roots, copy all Strongly-reachableobjects,noting (but
trace and
not tracing through) any Soft, Weak, or Phantom objectsfound.
2. Optionally,
clear all Soft references atomically.6 If we chosenot to clear Soft
references, then trace and copy from them, finding all Soft*-reachable objects, continuing
to note any Weak or Phantom objects found by tracing through Soft objects.
3. If the target of any Weak
object noted previously has been copied, update the Weak
object's pointer to refer to the new copy. If the target has not been copied, clearthe
Weak
object's pointer.
4. If any object requiring finalisation has not yet been copied, enqueue it for finalisation.
Go back to Step 1, treating the objects newly enqueued for finalisation as new roots.
Notice
that in this second round through the algorithm there can be no additional
objectsrequiring finalisation.7
any Phantom do
\342\200\224 must
collector cannot clear object's pointer the programmer that
explicitly.
While we worded the steps as for a copying collector, they work just as well for mark
sweep collection. However, it is more difficult to construct a referencecounting version
of the Java semantics. One way to do this is not to count the references from Soft, Weak
and Phantom objectsin the ordinary reference count, but rather to have a separate bit to
indicate if an object is a referent of any of these Reference objects. It is also convenient
if an
object has a separate bit indicating that it has a finaliser. We assume that there is a
global table that, for each object 0 that is the referent of at least one Reference object,
indicates those Reference objects that refer to 0. We call this the Reverse Reference Table.
6It is legal to be more selective,but following the rules makes that difficult. Note that by 'all' we mean all Soft
references currently in existence, not just the ones found by the collector so far.
7Barry Hayes pointed out to us that a Weak object wl might be reachable from an object x requiring
and the Weak object's referent
finalisation, y might be some object also requiring finalisation, which has another Weak
object w2 referring to it, that is, both wl and w2 are Weak objects that refer to y. If w2 is strongly reachable,
then it will have been cleared,while wl may not be cleared yet if it is reachable only from x. This situation
becomes especially strange if the finaliser for y resurrects y, since then w2 is cleared by y is now Strongly-reachable.
Unfortunately the issue seems inherent in the way Java defines Weak objects and finalisation.
122. WEAK REFERENCES 225
referent of at least one Reference, we check the Reverse Reference Table. Here are the
cases for
handling theReference objects that refer to the object whose ordinary reference
count went to zero; we assume they are processed from
strongest to weakest.
Weak: Clear the referent field of the WeakRef erence and enqueueit if requested.
Finaliser: Enqueue the object for finalisation. Let the entry in the finalisation queue count
as an ordinary reference. Thus, the reference count will go back up to one.Clearthe
object's T have a finaliser' bit.
Phantom: If the referent has a finaliser, then do nothing. Otherwise, enqueue the
Phantom. In order to trigger reconsideration of the referent for reclamation, increment
its ordinary reference count and mark the Phantom as enqueued. When the
Phantom's reference is cleared, if the Phantom has been enqueued, decrement the
referent's
ordinary reference count. Do the same processingwhen reclaiming a Phantom
reference.
There are some more specialcases to note. When we reclaim a Reference object,
we need to remove it from the Reverse Reference Table. We also need to do that when a
Reference object is cleared.
A
tricky case is when a detector
of
garbage cycles finds such a cycle. It appears that,
before doinganything need to see if any of the objects is the referent
else, we of a Soft
object, and in that case retain them all, but keep checking periodicallysomehow.If none
are Soft referents but some are Weak referents, then we need to clear all those Weak objects
atomically, and enqueue any objects requiringfinalisation. Finally, if none of the previous
cases apply but there are some Phantom referents to the cycle,we need to retain the whole
cycle and enqueuethe Phantoms. If no object in the cycle is the referent of a Reference
or
object requiresfinalisation, we can reclaim the whole cycle.
Suppose we have two objects, A and B, that we wish to finalise in that order. One way
to do this is to createa Phantom object A', a Phantom reference to A. In addition, this
Phantom reference should extendthe Java PhantomRef erence class so that it holds an
ordinary (strong) reference to B in order to prevent early reclamation of B.8 We illustrate
this situation in Figure 12.5.
When the collector
enqueues A', the Phantom for A, we know not only that A is
unreachable from
application, thebut also that the finaliser for A has run. This is because
reachability from the table of objects requiring finalisation is strongerthan Phantom
reachability. Then, we clear the Phantom reference to A and null the reference to B.At the next
collection the finaliser for B will be We further delete the Phantom object itself
triggered.
from the global table, so that it too can be reclaimed. It is easy to extend this approach
8Fields added in subclasses of Java's built-in reference classes hold strong pointers, not special weak referents.
226 CHAPTER 12. LANGUAGE-SPECIFIC CONCERNS
^' 1r
A'
i : \\
\342\200\224i\342\200\224: '\342\226\241
I
I
I \\. I
will be cleared and A' will be enqueued. We can then clear the reference from A' to B.
Unfortunately, the clearing of the weak referenceto A happens before the finaliser for A
runs, and we cannot easily tell when that finaliser has finished. Therefore we might cause
the finaliser for B to run first. Phantoms are intentionally designed to be weaker than
finalisation reachability, and thus will not be enqueued until after their referent's finaliser
has run.
is the converse: it retains all its arguments until none of them are strongly-reachable, and
then sets all of its fields to nil. We refer readers to the documentation on the Platform
Independent Extensions to Common Lisp10 for more details and further
generalisations,
including weak associations, weak AND- and OR-mappings,weak association lists, and a
version of weakhash tablessimilarto what we discussed above.
Ephemerons [Hayes, 1997]11are a particular form of weak key-value pairs useful for
reference to the value is strong while the key is non-null,but is weak after the key is set to null.
In the example, the reference to the base object0 is weak, and initially the reference to the
associated information I is
strong. Once 0 is reclaimed and the weak reference to it is set
to null, the referenceto I is treated as weak. Thus, I is not reclaimable while 0 is alive, but
\342\200\242
Weak pointers and finalisation tend to requireadditional tracing 'passes'. These
setting
it requires additional mechanism and care. As mentionedin earlierdiscussion,
traversing the weak references needs to includea checkof a shared flag and possibly
some additionalsynchronisation, to ensure that collector and mutator threadsmake
the samedetermination as to which weakly-referenced objects are live they
\342\200\224
need
to resolve the race between any
mutator thread trying to obtain a strong reference
and the collector trying to clear a group of weak references atomically. This race is
by no meanspeculiarto Java's weak reference mechanisms, and is a potentiality in
\342\200\242
Java soft references require a collector mechanismto decidewhether it is appropriate
to clear them during a given collection cycle.
Chapter
13
Concurrency preliminaries
Concurrent collection algorithms have been studied for a long time, goingback at least to
the 1970s [Steele, 1975]. For a long time, though, they were relevant to a small minority
of users. Now, multiprocessorsenjoy widespread commercial availability
\342\200\224
even the
laptop
on which this text is being written hasa dual-core processor. Moreover, programmers
need to deploy multiple cores to cooperateon the sametask since that has become the
only way get job to a done faster: clock speed increases can no longer deliver the regular
performance boost they used to. Therefore, language implementations need to support
concurrentprogramming, and their run-time systems, and their garbage collectorsin
particular, need to support the concurrent world well. Later chapters explore parallel,
concurrent and real-time collection in depth. Here we considerconcepts, algorithms
and data
structures fundamental to collection in presence of logical and physical parallelism,
including
an introduction to the relevant aspects of hardware, memory consistency models,
atomic update primitives,progressguarantees, mutual exclusion algorithms, work
sharing
and termination detection, concurrent data structures and the emergingmodelcalled
transactional memory.
13.1 Hardware
In order to understand both the correctness and the performance of parallel and concurrent
collection,is it
necessary first to understand relevant properties of multiprocessor
hardware. This section offers definitions and overviews of several key concepts: processors
and threads, including the various 'multis', multiprocessor, multicore, multiprogrammed,
and multithreaded;interconnect;
and
memory and caches.1
229
230 CHAPTER 13. CONCURRENCY PRELIMINARIES
which it ran previously, though the scheduler may recognise and offer some degree of
affinity
of a thread to a particular processor.
A
slight complication in these definitions is that some processor hardware supports
more than one logicalprocessorusing a single execution pipeline. This is called
simultaneous multithreading (SMT) or hyper threading,
and unfortunately for our terminology, the
logical processorsare often called threads. Here thread will always mean the software
entity
and SMTs viewed as providingmultiple (logical)
will be processors, since the logical
processors are individually schedulable, and soon.
A
multiprocessor is a computer that
provides more than one processor. A chip
multiprocessor(CMP), also called a multicore or even many-core processor, is a multiprocessor that
has more than one processor on a singleintegratedcircuit chip. Except in the case of SMT,
multithreaded refers to software that uses multiple threads, which may run concurrently
on a multiprocessor.Multiprogrammed refers to software executing multiple processes or
threadson a singleprocessor.
Interconnect
What
distinguishes a multiprocessor from the general caseof cluster, cloud or distributed
computing is that it involves sharedmemory, accessible to each of the processors. This
access is mediated
by an interconnection network of some kind. The simplest interconnect is
a singlesharedbus, through which all messages pass between processorsand memory. It
is helpful to think of memory accessesaslikethe sendingof messages between a processor
and a memory unit, given how long the accesses take in terms of processor cycles
\342\200\224
now
in the hundreds of cycles. A single bus can be reasonably fast in terms of its raw speed,
but it can obviously be a bottleneck if multiple processors request service at the same time.
The highestbandwidth interconnect would provide a private channel between each
processor and each memory, but the hardware resourcesrequiredgrow as the product of the
number of processor and number of memory units. Note that for better overall bandwidth
(numberof memory accesses per second across the entire system),splitting the memory
into multiple units is a good idea. Also, transfers between processors and memories are
usually in terms of whole cache lines (see page 231) rather than single bytes or words.
In larger CMPsa memory request may need to traverse multiple nodesin an
interconnection network, such as a grid, ring or torus connection arrangement.
Details lie beyond
our scope, but the point is that access time is long and can vary according to where a
processor is in the network and where the target memoryunit is. Concurrent traffic along the
same interconnect paths canintroducemoredelay.
Note that the bus in single-bus systems generallybecomes a bottleneck when the
system has more than about eight to sixteenprocessors. However, buses are generally simpler
and cheaper to implementthan other interconnects, and they allow each unit to listen to all
of the bus traffic (sometimes called snooping), which simplifies supporting cachecoherence
(see page 232).
If the memory units are separate from the processors, the system is called a symmetric
2Private memory is suitable for thread-local heaps if the threads can be bound to processors (allowed to run
only on the specific processorwhere their heap resides). It is also suitable for local copiesof immutable data.
13.1. HARDWARE 231
The most relevant properties of interconnect are that memory takes a long time to
access, that interconnect can be a bottleneck, and that different portions of memory may take
relatively longer times to access from different processors.
Memory
From the standpoint collection, shared memory appears as a singleaddress
of garbage
space
of words or
bytes, even though it may be physically spread out across multiple
memory units or processors. Because memory consists of multiple units accessed
concurrently,
it is not necessarily possible to describe it as
having a single definite global state at
any given moment. However, each unit, and thus each word, has a well-definedstate at
each moment.
Caches
Because memory accessestake so long, modern processors typically add one or more
layers of cache to hold recently accessed data and thus statistically reduce the number of
memory accesses a programrequires as it runs. Caches generally operate terms of cache in
lines (also called cache blocks), typically 32 or 64 bytes in size. If an accessfinds its
line
containing
in the cache, that is a cache hit, otherwise the access is a cache miss, which requires
accessing the next higher levelof cache, or memory if this was the highest level. In CMPs
it is
typical for some processors to share some higherlevels of cache. For example, each
processor might have its own Level One (LI) cache but share its L2 cache with a neighbour.
The line sizes of different levels need not be the same.
When there is a cache miss and there is not room for the new line in the cache, then
a line currently in the cache,chosenaccordingto the cache's replacement policy, must be
evicted before loading the new line. The evictedline is called the victim. Some caches are
write-through, meaning that updates to lines in the cache are passed on to the next level as
soon as practicable,while somecachesare write-back, meaning that a modified line (also
calleda dirty line) is not written to the next higherleveluntil it is evicted, explicitly flushed
(which requiresusing a specialinstruction) or explicitly written back (which also requires
a specialinstruction).
A cache's policy depends substantially on the cache'sinternal
replacement
organisation.A
fully-associative allows any set of lines, up to the cache size, to reside in the
cache
cache together.Its replacement policy
can choose to evict any line. At the opposite end
of the spectrum are direct-mapped caches, where each line must reside in a particularplace
in the cache, so there is only one possiblevictim. In between these extremes are k-way set-
associative caches, where each line is mappedto a set of k lines of cachememory, and the
Coherence
Caches hold copies of memory data that Because not all copiesare
is potentially shared.
updated
at the same moment,
particularly with the various copies in
write-back caches,
general do not containthe same value for each address. Thus, it
may be possible for two
to
processors disagree on the value at a particular location. This is undesirable, so the
underlying
hardware generally supports some degree of cache coherence. One of the common
coherenceprotocols is MESI, from the initial letters of the names it gives to the possible states
of a given line of memory in each cache.
Modified: This cacheis the only one holding a copy of the line, and its value has been
updated but not yet written backto memory.
Exclusive: This cache is the only one holding a copy of the line, but its value corresponds
with that in memory.
Shared: Other caches may hold a copy of this line, but they all have the same value as in
memory.
5 exchangeUnlock(x) :
6 *X 0
<\342\200\224
8 AtomicExchange(x, v):
9 atomic
io old *x
\302\253\342\200\224
11 *X V
<f\342\200\224
12 return old
i
testAndTestAndSetExchangeLock(x) :
2 while testAndExchange(x) = 1
3 /* do nothing */
4
5 testAndTestAndSetExchangeUnlock(x):
6 *X 0
<\342\200\224
8 testAndExchange(x) :
9 while *x = 1
io /* do nothing */
n return AtomicExchange(x, l)
i testAndSetLock(x):
2 while TestAndSet(x) = 1
3 /* do nothing */
4
5
testAndSetUnlock(x):
6 *X 0
<\342\200\224
8 TestAndSet(x):
9 atomic
io old f- *x
ii if old = 0
12 *X f- 1
13 return 0
i4 return 1
15
i6
testAndTestAndSetLock(x):
17 while testAndTestAndSet(x) = 1
is /* do nothing */
19
20
testAndTestAndSet(x):
21 while *x = 1
22 /* do nothing */
23 return TestAndSet(x)
24
25
testAndTestAndSetUnlock(x)
26
testAndSetUnlock(x)
3The Java memory model is even looser:if two writes are not otherwise synchronised, then a processor can
observeeither value on any future read, and thus the value may oscillate.
13.2. HARDWARE MEMORY CONSISTENCY 235
to
any particular memory location are totally ordered, and eachprocessor's
view of that
location is consistent with that order.
However, a program's view of the order of writes (and reads) to more than one location
does not necessarilycorrespond with the order of those actions at caches or memories,
and thus as perceived by other processors. That is, program order is not necessarily
consistent with memory order. This raises two questions:why, and what are the implications?
To answer the 'why' question, it is a matter of both hardware and software.
Broadly, the
reasons are tied up with
performance: strict consistency requires either more hardware
or
resources,orreducesperformance, both. One hardware reason is that
many processors
contain a write buffer (also called a store buffer),
that receives pending writes to memory.
A write buffer is basically a queue of (address, data) pairs. Normally these writes may
be performed in order, but if a later write is to an address already in the write buffer, the
hardware may combine it with the previous pending write. This means the later write
can effectively pass an earlierwrite to a different location and appear in memory sooner.
Designersare careful to provide each processor with a consistent view of its own actions.
Thus a read of a location that has a pending write in the write buffer will ultimately
produce the value in the write buffer, either with a direct hardware path (faster but more
costly) or by waiting for the write buffer to empty and then reading the value from cache.
Another reason program actions can be reorderedat the memory is cache misses. Many
processors will continue executinglater instructions past a (data) cache miss, and thus
reads can pass readsand writes (and so can writes). Further, write-back caches present
writes to memory only when dirty lines are evictedor flushed, so writes to different lines
can be drastically reordered. This summary of hardware reasons is illustrative but not
exhaustive.
Software reasons for reordering mostly come from compilers. For example, if two
memory references are known to go to the same location and there are no intervening
writes that can affect that location, the compiler may just use the value originally fetched.
More generally, if the compiler can show that variables are not aliased (do not refer to the
same memory location), it can freely reorder reads and writes of the locations, since the
same overall result will obtain (on a uniprocessor in the absence of thread switches).
Languages
allow such reordering and reuse of the resultsof previous accesses because it leads
to more efficient code,and of the much
time it does not affect the semantics.
Obviously,
a programmer's
from standpoint lack of consistencybetween program and
dependent loads, where the program issues a load from address x and then later issues a
load from address y where y depends
on the value returned by loading x. An example
is following a pointer chain. There are many different kinds of memory access orderings
weakerthan total consistency; we consider the more commononeshere.
4Some authors use the word 'synchronising' where we use 'atomic', but this conflates the atomicity of these
operations with their usual influence on ordering, which is a strictly different notion.
236 CHAPTER 13. CONCURRENCY PRELIMINARIES
R R
->\342\200\242 Y Y Y
R W
->\342\200\242 Y Y Y
W W
->\342\200\242 Y Y Y
W R
->\342\200\242 Y Y Y Y Y Y
Atomic -> R Y Y Y
Atomic -> W Y Y Y
dependent loads Y
relationshipbetween each previous read and each later read. Typically, atomic operations
imply a total fence on all operations:every earlier read, write, and atomic operation must
happen-beforeeachlater read,write, and atomic operation. However, other models are
possible, suchas acquire-release.
In that model, an acquiring operation (think of it as being
like acquiring a lock)prevents later operationsfrom being performed before the acquire,
but earlier readsand writescanhappenafter the acquire. A releasing operation is
symmetrical: it
prevents earlier operations from happening after the release, but later reads and
writes may happen before the release. In short, operations outsidean acquire-release pair
may
move inside it, but ones inside it may not move out. This is suitable for implementing
critical sections.
Consistency models
The strongestconsistency model is strict consistency, where every read, write and atomic
operation occurs in the same order everywhere in the system.5 Strict consistency implies
that happens-before is a total order, with the order defined by some global clock. This is
the easiest model to understand, and probably the way most programmers think, but it is
5By 'occurs' we mean 'appears to occur' a program cannot tell the difference.
\342\200\224
6Given relativistic effects, a total order may not even be well-defined in modern systems.
13.3. HARDWARE PRIMITIVES 237
4 if curr = old
5 *x new
<\342\200\224
e return curr
7
8 old, new):
CompareAndSet(x,
9 atomic
io curr *x
<\342\200\224
ii if curr = old
12 *x new
<\342\200\224
o return true
14 return false
read and the write that stored the value obtained by the read. The term relaxed consistency
applies to any model weaker than sequential consistency.
While allowed reorderings depend to some extent on the interconnect and memory
system, that is they may lie outside total control by the processor, Table 13.1 shows the
reorderings
allowed by some well-known processor families. All the processors implement
at least weak or releaseconsistency. For more background on memory consistency models
see Adve and Gharachorloo [1995,1996].
Compare-and-swap
The CompareAndSwap
primitive relation, CompareAndSet, are presentedin
and its close
Algorithm
13.4. CompareAndSet compares a memory location to an expected value old,
and if the location's value equals old, it sets the value to new. In either case it indicates
whether or not it updated the memory location. CompareAndSwap differs only in that
it returns the value of the memory location observedby the primitive before any update,
238 CHAPTER 13. CONCURRENCY PRELIMINARIES
i compareThenCompareAndSwap(x):
2 if *x = interesting
3 z f- value for the desired next state
4 CompareAndSwap(x, interesting, z)
i LoadLinked(address):
2 atomic
3 reservation address
<\342\200\224 /* reservation is a per\342\200\224processor variable */
4 reserved f- true /* reserved is a per\342\200\224processor variable */
5 return *address
6
7
StoreConditionally(address, value):
s atomic
9 if reserved
io store(address, value)
n return true
12 return false
13
17 ^address value
<\342\200\224
rather than returning a booleantruth value. The utility of the two primitives is essentially
the same,although their semantics are not strictly equivalent.
CompareAndSwap is often used to advance a location from one state to another, such
as 'lockedby thread tV to 'unlocked' to 'locked by threadtT. It is common to examine the
current state and then try
to advance it atomically, following the pattern of Algorithm 13.5,
sometimes called compare-then-compare-and-swap. There is a lurkingtrap in this
approach, namely that it is possible that at the CompareAndSwap the state has changed
multiple times,
and is now again equal to the value
sampled before. In some situations this
may be all right,
but in others it could be that the bit pattern, while equal, actually has a
different
meaning. This can happen in garbage collection if, for example, two semispace
collections occur, and alongthe way a pointer was updated to refer to a different object
that by coincidence lies where the original object was two collections ago. This inability of
CompareAndSwap to detect whether a value has changed and then changed back is called
the ABA problem.
Load-linked/store-conditionally
i observed <\342\200\224
LoadLinked(x)
2 compute desired new value z, using observed
3 if not StoreConditionally(x, z)
4
go back and recompute or otherwise handle interference
Algorithm
13.8: Implementing compare-and-swap with load-linked /store-conditionally
i compareAndSwapByLLSC(x, old, new):
2 previous <\342\200\224
LoadLinked(x)
3 if previous = old
4 StoreConditionally(x, new)
5 return previous
6
itionally more precisely. It still falls short, though, because the reservation is cleared
not only by writes by the same processor, but also by
writes coming from other processors.
Because any write to the reserved location resets the reserved flag, the compare-then-
compare-and-swap code can be rewritten to avoid the possible ABA problem, as shown
in Algorithm 13.7.7 LoadLinked/StoreConditionally is thus strictly more powerful
than CompareAndSwap. In fact, it should be clear that the
LoadLinked/StoreConditionally
primitives allow a programmer to implement any atomic read-modify-write
operation that acts on a singlememory word. Algorithm
13.8 shows how to implement
compare-and-swap with LoadLinked/StoreConditionally, and alsoan
of
implementation
compare-and-set. One more behaviour of LoadLinked/StoreConditionally
is worth mentioning: it is legal for a StoreConditionally to fail 'spuriously', that is,
even if no processor wrote the location in question. There might be a variety of low-level
hardware situations that can cause spurious failures, but notable is the occurrenceof
interrupts, including such things as page and overflow traps, and timer or I/O interrupts, all
of which induce kernel activity. This is not usually a problem, but if some code between
LoadLinkedand StoreConditionally causes a trap every time, then the
StoreConditionally will always fail.
Because LoadLinked/StoreConditionally solves ABA problems so neatly, code
presented here will most generallyprefer LoadLinked/StoreConditionally where
CompareAndSwap would exhibit an ABA problem. It would typically be straightforward
to convert such instances to use CompareAndSwap with an associatedcounter.
7
A thread also loses its reservation on a context-switch.
240 CHAPTER 13. CONCURRENCY
PRELIMINARIES
The 'test then test-and-set' pattern was illustrated in function testAndTestAndSet (see
Algorithm 13.3). Because of the way that
algorithm iterates, it is correct. Programmers
should avoid two fallacious attempts at the same semantics, here calledtest-then-test-
and-set and
test-then-test-then-set, illustrated in Algorithm 13.10. Test-then-test-and-set
is fallacious it does not iterate, yet the TestAndSet
because fail could if x is updated
between the if and the TestAndSet. Test-then-test-then-set is evenworse: fails to it use
any atomic primitive, and thus anything can happen in between the first and second read
of x and the second read and the write. Notice that making x volatile solve
does not
the problem.Therearesimilar patterns that might be called compare-then-compare-and-
set or compare-then-compare-then-set
that are equally fallacious. These traps illustrate the
difficulty programmers have in thinking concurrently.
8Ofcourse if contention is that high, there may be the possibility of starvation at the hardware level, in trying
to gain exclusiveaccessto the relevant cache line.
13.3. HARDWARE PRIMITIVES 241
Algorithm
13.9: Atomic arithmetic primitives
i
Atomiclncrement(x):
2 atomic
3 *X *X +
\302\253\342\200\224 1
4
5 AtomicDecrement(x):
e atomic
7 *X *X \342\200\224
1
<\342\200\224
8
9 AtomicAdd(x, v):
io atomic
ii new *x +
\302\253\342\200\224 v
12 *x f- new
13 return new
14
15 FetchAndAdd(x, v):
i6 atomic
17 old f- *x
is *x f- old + v
19 return old
i testThenTestAndSetLock(x): /*fallacious! */
2 if *x = 0
3
TestAndSet(x)
4
5 testThenTestThenSetLock(x): /*fallacious! */
e if *x = 0
7 other work
8 if *x = 0
9 *X 1
<\342\200\224
addition to single-word CompareAndSwap (see Algorithm 13.11). These are not of greater
theoretical power. However, a wide double-word CompareAndSwap can solve the ABA
problem of single-word CompareAndSwap by using the second word for a counter of the
number of times the first word has been updated. It would take so long\342\200\224 232 updates
Algorithm 13.11:CompareAndSwapWide
i CompareAndSwapWide(x, oldO, oldl, newO, newl):
2 atomic
3 currO, currl f- x[o], x[l]
4 if currO = oldO && currl = oldl
5
x[0]/ x[l] ^~ newO, newl
6 return currO, currl
7
5 decide(v):
6 proposals [me] v
<\342\200\224 /* 0 < thread id < N */
7 \342\200\224
CompareAndSwap(&winner, 1, me)
8 return proposals [winner]
value and write it, beforethe instructionis complete.While modern processors may
overlap multiple instructions, often there are few instructionsavailable in the pipeline since the
next thing to do often depends strongly on the result of the atomic operation. Because of
the need for coherence, an atomic update primitive often includesa bus or memory access,
which consumes many cycles.
The other reasonatomic primitives tend to be slow is that they either include memory
fence semantics, or else, by the way they are used, the programmer will need to insert
fences manually, typically on both sidesof the atomic operation. This undermines the
performance advantage overlapped
of and pipelined processing, and makes it difficult for
the processor to hide the cost of any bus or memory accessthe primitive requires.
systems. For an
example, algorithm might be wait-free as long as it does not exhaust free
storage. See Herlihy and Shavit [2008] for a thorough discussion of these concepts, how to
implement them, and so on.
A wait-free
algorithm typically involves the notion of threadshelpingeachother along.
That is, if thread t2 is about to undertakean actionthat would undermine thread 11 that is
somehow judged tobeahead of tl, tl will help advance the work of 11 and then do its own
work. Assuming a fixed bound on the number of threads, there is a bound on helping to
accomplish one work unit or operation on the data structure, and thus the total time for
any work unit or operation can be bounded. However, not only is the bound large,but the
typical time for an operation is rather higher than for weaker progress guarantees because
of the additional data structures and work required. For the simple caseof consensus,
it is fairly easy to devise a wait-free algorithm with low time overhead, as illustrated in
Algorithm
13.13. It is fairly easy to see that this meets all of the criteria to be a solution to
the consensus problemfor N threads, but it does have space overhead proportionalto N.
Obstruction-freedom is easier to achieve than wait-freedom, but may requirescheduler
cooperation. If threads can see that they are contending,they can use random increasing
244 CHAPTER 13. CONCURRENCY PRELIMINARIES
back-off so as to allow some thread to win. That is, each time they detect contention,
they compute a longer possible back-off period T and randomly choose an amount of time
between zero and T to wait before trying again. In a pool contending
threads, of each will
eventually succeed, probabilisticallyspeaking.
Lock-freedom is even easier to achieve. It requires only that at least one contender
make progress on any occasion, though any particular individual can 'starve' indefinitely.
too. Both parallel and concurrent collection algorithmstypically have a number of phases,
such as marking, scanning, copying, forwarding or sweeping, and concurrent collection
alsohas mutator work trying to proceed at the sametime. Multiple collector threads may
aim to cooperate, yet sometimesinterfere with one another and with mutator threads. In
such a complexsituation, how can collector correctness be described? Certainly the
collectormust do nothing blatantly wrong
\342\200\224
at the least it must preserve the reachableparts of
the object graph and support the mutations being performed by the mutators. Next,
provided that an invocation of the collector eventually terminates, it should generally return
some unreachable memory for reuse. However, the specific expectations vary by collector
It is less obvious with concurrent collection,becausethe object graph can grow because
of allocation of new objects, and it can
change during a collection cycle. If each mutator
change forces more collector work, how can we know that the collector will ever catch
up? Mutators
may need to be throttled back or stoppedcompletely for a time. Even if a
proof deals with the issues of more collector work being created during collection,there
remainsa further difficulty: unless the algorithm uses wait-free techniques,interference
can prevent progress indefinitely. For example, in a lock-free algorithm, one thread can
continually
fail in its attempts to accomplish a work step. fact, two competing threads
In
can even eachprevent progressof the other indefinitely, an occurrence called livelock.
Different phases of collection may offer different progress guarantees \342\200\224 one phase
might be lock-free, another wait-free. However,practical implementations, even of
theoretically entirely wait-free algorithms, may have some (it is hoped small) portions that are
make every last corner wait-free. Further, notice that an overall collection algorithm can
13.5.NOTATION USED FOR CONCURRENT ALGORITHMS 245
be judged wait-free from the standpoint of the mutators only if it can reclaim memory
fast enough to ensure that a mutator will not block in allocation waiting for collection to
complete. Put another way, the heap must not run out before the collector is done. This
requiresmorethan a wait-free guarantee for each phase \342\200\224
it requires overall balance
between
heap size, maximum live size, allocation rate and collection rate. Enough resources
need to be devoted to collection \342\200\224 memory and processing time \342\200\224
for the collector to keep
up. This be
may required for critical real-time systems, and Chapter 19 discusses it in more
the code offered here for algorithms that may execute concurrently follows certain
conventions. This makes it easier to translate the pseudocodeinto a working implementation in a
Meaning of atomic: The actions within an atomic must be perceived by all processors
as if they happened instantaneously
\342\200\224
no other shared memory read or write can
appear to happen in the middle. Moreover, atomic actions must be perceived
as happeningin the same order everywhere if they conflict (one writes and the
other reads or writes the same shared variable), and in program executionorder
for the thread that executes them. Furthermore, atomic blocksact as fencesfor all
other shared memory accesses. Sincenot all hardware includesfence semantics with
atomic primitives, the programmer may needto add them.The code here may work
with
acquire-release fence semantics, but is designed assumingtotal fences.
Marking
variables: We explicitly mark shared variables; all other variablesareprivate to
each thread.
Arrays: Where we use arrays, we give the number of elements within brackets, such as
proposals[N]. Declarations of arrays use shared or private explicitly, so as
not to look like uses of the arrays, and may be initialised with a tuple, such as
shared pair[2]^\342\200\224 [0,l], including tuples extended to the specifiedlength, suchas
shared level[N]<-[-l,. . .].
References to shared variables: Each reference to a shared variable is assumed to result
in an actual memory read or write, though not necessarily in the order presented.
Causality obeyed: Code assumes that if, subject to the sequential semantics of the
pseudocode
language, an action x causally precedes an action y, then x happens-before y
246 CHAPTER 13. CONCURRENCY PRELIMINARIES
Explicit fencepoints: Even with the conventions listed above, many operations may
be
stronger atomic update primitives. One of the classic techniques is Peterson's Algorithm
for mutual exclusion between two threads, shown in Algorithm 13.14. Not only does this
algorithm guarantee mutual exclusion, it also guarantees progress \342\200\224 if two threads are
competing to enter the critical section, one will succeed \342\200\224
and that waits are bounded,
that is, the number of turns taken by other processes beforea requestergets its turn is
bounded.9 In this case the bound is oneturn by the other thread.
It is not too hard to generalise Peterson's Algorithm
to N threads, as shown in
Algorithm 13.15, which highlights its similarity to the two-thread case. How the while loop
works is a bit subtle. The basic idea is that a requesting thread can advancea level in
the competitionto enterthe critical section if it sees no other thread at the sameor higher
level.However, if another thread enters its current level, that thread will change victim
and the earlier arrival can advance. Put another way, the latest arrival at a given level
9The time before this happens is not bounded unless a requesting thread whose turn it is enters and then
leaves within bounded time.
13.6. MUTUAL EXCLUSION 247
Algorithm
13.14: Peterson's algorithm for mutual exclusion
4 petersonLockQ:
5 other 4- 1 - me /* thread id must beO or 1 */
e
interested[me] true
\302\253\342\200\224
7 victim me
\302\253\342\200\224 $
8 while victim = me &&
interested[other] $
9 /* do nothing: wait */
10
ii
petersonUnlockQ:
12
interested[me] false
<\342\200\224
5 petersonLockN():
e for lev (- 0 to N-l
7
level[me] lev
<\342\200\224 /* 0 < thread id < N */
8 victim[lev] 4- me $
9 while victim[lev]
= me && (3i ^ me)(level[i] > lev) $
io /* do nothing: wait */
ii
12
petersonUnlockN():
13 level [me] \302\253-\342\200\2241
8 winner me
\302\253\342\200\224
9 value v
\302\253\342\200\224
io unlock()
ii return value
248 CHAPTER 13. CONCURRENCY PRELIMINARIES
waits for threads at all higher levels plus earlier arrivals at its own level. Meanwhile, later
arrivals at the same and lower levels will come strictly later. It does not matter that the
progress guarantees are not needed, as shownin Algorithm 13.16. Since Peterson's mutual
exclusion algorithm implementsmutual exclusion, it can also support this kind of
termination of a parallel
algorithm. Note that this is quite distinct from demonstrating that a
parallel algorithm will terminate; it concerns having the program detect that termination
has actually been achieved in a specific instance. In particular, considera genericsituation
in which threads consume work, and as they process work units, they may generate more
work. If each thread is concerned only with its own work, detectingterminationis
simple
\342\200\224 have
just each thread set a done flag and when all the flags are set, the algorithm
has terminated. However, parallel algorithms generally involve some sort of sharing of
work items soas to try of work done by each thread and gain
to balance the amount
maximum
speedup from the available processors. This balancingcan take two forms: threads
with a relatively large amount of work can push work to more lightly loaded threads, or
lightly loaded threads can pull work from more heavily loaded threads. Work pulling is
also calledwork stealing.
Work movement must be atomic, or at least must guarantee that no work unit is lost.10
Here, though, the concern is detecting termination of a work sharing algorithm. It is
relatively easy to detect termination using a single shared counter of work units updated
atomically by each thread, but such countersmay become bottlenecks to performance if
checking the jobsMoved flag every y/N iterations of the detector's scanning loop. Given the time needed to
perform work in a collection algorithm, it is doubtful that such a refinement is worthwhile.
13.7. WORK SHARING AND TERMINATION DETECTION 249
22 send Jobs (some, j): /* push jobs to more lightly loaded thread */
23 enqueue(jobs[j], some) $
24 while (not busy[j]) && (not isEmpty(jobs[j])) $
25 /* do nothing: wait for j to wake up */
26 indicate that some work moved
27 jobsMoved true
<\342\200\224 $
28
29 detectQ:
30 anyActive true
<\342\200\224
31 while anyActive
32 anyActive f- (3i)(busy[i])
33 anyActive <\342\200\224
anyActive || jobsMoved $
34 jobsMoved f- false $
35 allDone true
<\342\200\224 $
flag,
which indicates whether any jobs have moved recently. The detector restarts detection
in that case. It is also important that send Jobs waits until busy[ j] is true to guarantee
that before, during and immediately after the transfer at least one of the busy[i] is true:
the only way that all busy [i] can be false is if there is no work in the system.
Algorithm the similar algorithm for a work stealing (pull) model of sharing
13.18 shows
work. For example,Endo [1997] uses essentially this algorithm to detect
et al termination
in their parallel collector. Also, while the lock-free collector of Herlihy and Moss [1992] is
not based on work sharing, its termination algorithm at its heart uses the same logic as the
busy and jobsMoved flags.
250 CHAPTER 13. CONCURRENCY
PRELIMINARIES
me <\342\200\224
myThreadld
2
3 worker():
4 loop
5 while not isEmpty( jobs [me])
6 job f- dequeue( jobs [me])
7 perform job $
s if another thread j existswhosejobsset appearsrelatively large
9 some <\342\200\224
stealJobs(j) $
io enqueue( jobs [me], some)
n continue
12 busy[me] f- false $
13 while no thread has jobs to steal && not allDone $
H /* do nothing: wait for work or termination */
15 if allDone return $
i6 busy [me] f- true $
is stealJobs(j):
19 some <\342\200\224
atomicallyRemoveSomeJobs(jobs[j])
20 if not isEmpty(some)
21 jobsMoved f- true /* indicate that some work moved */
22 return some
workerQ ;
busy[me] f- false $
anyldle f- true $
10 detect():
ii anyActive true
\302\253\342\200\224
12 while anyActive
13 anyActive f- false
14 while not anyldle $
is /* do nothing: wait until a scan might be useful */
i6 anyldle false
\302\253\342\200\224 $
17 anyActive f- (3i)(busy[i]) $
is anyActive f- anyActive || jobsMoved $
19 jobsMoved false
<\342\200\224 $
20 allDone f- true $
13.7. WORKSHARINGAND TERMINATION DETECTION 251
2 me <\342\200\224
myThreadId
3
4 workerQ:
s lOOp
6 while not isEmpty( jobs [me])
7 job <\342\200\224
dequeue(jobs[me])
s perform(job) $
9 if my job set is large
io anyLarge true
<\342\200\224 $
ii if anyLarge
12 anyLarge false
<\342\200\224 /* set false beforelooking */ $
13 if another thread j has a relatively large jobs set $
H anyLarge true
\302\253\342\200\224 /* could be more stealablework */ $
is some stea
t\342\200\224
Uobs(j) $
i6
enqueue( jobs [me], some)
17 continue
is busy[me] false
\302\253\342\200\224 $
19 while (not anyLarge) && (not allDone) $
20 /* do nothing: wait for work or termination */
21 if allDone return $
22 busy [me] true
\302\253\342\200\224 $
Rendezvous barriers
Another common synchronisation mechanism in parallel and concurrent collectorsis the
need for all participants to reach the same point in the algorithm\342\200\224
essentially a point of
termination of a phase of collection \342\200\224
and then to move on. In the general caseone of the
previously presented algorithms may be most appropriate.Another
termination common
case occurs when the phase involve work does not
sharing or balancing, but it is
required
only to wait for all threads to reacha given point, called the rendezvous barrier. This can
252 CHAPTER 13. CONCURRENCY PRELIMINARIES
wor k():
2
8 detectSymmetric() :
9 while not allDone $
io while (not anyldle) &&
(not anyLarge) $
ii /* do nothing: wait until a scan might
be useful */
12 if anyLarge return $
13 anyldle f- false
H anyActive <\342\200\224
(3i)(busy[i]) $
15 anyActive <\342\200\224
anyActive || jobsMoved $
i6 jobsMoved false
<\342\200\224 $
17 allDone f- not anyActive $
2 me <\342\200\224
myThreadld
3
4 work() :
5 ...
6 while I have no work && not allDone $
7 if detector > 0
s continue /* wait
for previous detector to finish before trying */
9 if CompareAndSet(&detector, \342\200\2241,
me)
io detectSymmetric() $
ii detector f- \342\200\2241 $
shared numBusy N
<\342\200\224
workerQ :
loop
while work remaining
perform(work)
if AtomicAdd(&numBusy,
=
\342\200\224l)
0
return
while nothing to steal &&
(numBusy > 0)
/* do nothing: wait
for work or termination */
if numBusy
= 0
return
AtomicAdd(&numBusy, l)
13.8. CONCURRENT
DATA STRUCTURES 253
i shared numBusy N
\302\253\342\200\224
3 barrierQ:
4 At omicAdd(& numBusy, \342\200\224l)
5 while numBusy > 0
e /* do nothing: wait for others to catch up */
i shared numBusy N
\302\253\342\200\224
2 shared numPast 0
\302\253\342\200\224
4 barrier():
5 AtomicAdd(&numBusy, \342\200\224l)
e while numBusy > 0
7 /* do nothing: wait for others to catch up */
s if AtomicAdd(&numPast, l) = N /* one winner does the reset */
9 numPast 0
^\342\200\224 $
io numBusy N
^\342\200\224 $
n else
12 while numBusy = 0 /* the others wait (but not for long) */
13 /* do nothing: wait for reset to complete*/
usea simplified
version of termination detection with a counter (Algorithm 13.23), shown
in Algorithm 13.24. Since a collectoris usually invoked more than once as a program
runs, thesecounters must be reset as the algorithm starts, or in any case before the phase
is run again, and the
resetting should be done with care to ensure thread can be
that no
depending on the value of the rendezvous counter at the time it is reset. Algorithm 13.25
shows sucha resetting
barrier.
There are particular data structures commonly used in parallel and concurrentallocators
and collectors, so it is helpful to review some of the relevant implementation techniques.
It should be plain that data structure implementations for sequential programs are not
suitableas is for parallel and concurrent systems
\342\200\224
they will generally break. If a data
structure is accessedrarely enough then it may suffice to apply mutual exclusion to an
otherwise sequential implementation by adding a lock variable to each instance of the
data structure and have each operation acquire the lock before the operation and release
it after. If operations can be nested or recursive,then a 'counting lock' is appropriate, as
shownin Algorithm 13.26.
Some data structures have high enough traffic that applying simple mutual exclusion
leads to bottlenecks.Thereforea number of concurrent data structures have been devised
that allow greater overlap between concurrent operations. If concurrent operations are
overlapped, the result must still be safe and correct. An implementation of a concurrent
254 CHAPTER 13. CONCURRENCY PRELIMINARIES
5 countingLock():
e old lock
<\342\200\224
16
17 countingUnlockQ :
is /* leaves thread id, but no harm even when count becomes 0 */
19 old lock
<\342\200\224
20 lock <\342\200\224
(old. thread, old. count \342\200\224
l)
point, but the relative order of the linearisation points of operations that affect each other
will always be consistentwith the logical order of the operations. If
operations do not
affect each other then they can linearise in either order. Many memory manageractions,
suchas allocationand changesto work lists,must be linearisable.
There is a range of generic strategies a programmer can employ in building a
concurrentdata structure. In order from lower to higherconcurrency, and typically from simplest
to most complex,they are:13
Coarse-grained locking: One 'large' lock is applied to the whole data structure (already
mentioned).
13SeeHerlihy and Shavit [2008] Chapter 9 for details of each of these approaches applied to a set implemented
as a linked list.
13.8. CONCURRENT DATA STRUCTURES 255
Optimisticlocking: This refines fine-grained locking by doing any searchingof the data
structure without locks, then locking what appear to be the proper elements for the
intended action. If the validation fails, it releases the locks and starts over. Avoiding
Lazy update: Even with optimistic locking, read-only operations may still need to locka
data structure. This concurrency bottleneck,and alsohas the effect
can result in a
that a read-only operation performs writes (of locks). It is often possible to design
a data structure so that read-only operations need no locking\342\200\224
but of course the
updating operations are a bit more complex.Generally speaking, they make some
change that logically accomplishes the operation, but may need further steps to
complete
it and get the data structure into a normalised form. An example may help
in understanding this. For lazy update of a linked list representation of a set, the
removeoperationwill first mark an element as being (logically)removed,by setting
a boolean flag de let ed in the element.After that it will unchain the deleted element
by redirecting the predecessor's pointer. All this happens while holding locks in the
appropriateelements, soasto prevent problems with concurrent updaters. The two
are
steps necessary so that readers can proceed without locking.Adding an element
needs to modify only one next pointer in the data structure and therefore needs only
one update (again, with appropriate locks held).
Non-blocking: There are strategies that avoid locking altogether and rely on atomic
update
primitives to accomplish changes to the state of data structures. Typically a
state-changing operation has someparticular atomic update event that is its
linearisation
point. This is in contrast to lockbased methods,where some critical section
marks the linearisation 'point'.14 As previously mentioned, these can be
characterised
according to their progress guarantees, in order from easiest to implement
to hardest. Lock-free implementationsmay allow starvation of individual threads;
obstruction-free implementationsmay require long enough periods in which a single
thread can make progress without interference; and wait-free
implementations
guarantee
progress of all threads. Some lock-free implementationsare sketched below;
for wait-free implementation, see Herlihy and Shavit [2008].
14Because of mutual exclusion, it is a point as far as any other operations are concerned.However, lazy update
methods also tend to have a single linearisation point.
256 CHAPTER 13. CONCURRENCY PRELIMINARIES
Concurrent stacks
First, we sketch to implement a concurrent stackusinga singly
ways linked list. Since there
is only one locus of for a stack, the performance of the various approachesto
mutation
locking
will be about the same. The code is obvious, so not illustrated. Algorithm 13.27 shows a
lock-freeimplementationof a stack. It is easy to make a little popis
push lock-free;
harder.
The popABA routine is a simple CompareAndSet
implementation
of pop that is lock-free
\342\200\224
but that also has an ABA problem. Algorithm 13.27 also shows LoadLinked/Store-
Conditionally and CompareAndSetWide solutions that avoid the ABA problem, as
concreteexamples how of to do that. The problem occurswhen someother thread(s) pop
the node referred to by currTop, and that node is pushed later with its next different
from the currTop . next read by this popping thread.
A concurrent stack based on an array is best implemented using a lock. However,
concurrent stacks tend to be a bottlenecknot just becauseof cache and memory issues,
but because all the operations must serialise. However it is possible to do better. Blelloch
and Cheng [1999] provide a lock-free solution by requiring all threads accessing a shared
stack either to be popping from it or all to be pushing onto it, thus allowing the stack
pointer to be controlled by a Fet chAndAdd instruction rather than a lock. We discuss this
in detailin Chapter14.Chapter 11 of Herlihy and Shavit discusses a concurrentlock-free
stack implementationwhere threads that encounter high contention try
to find matching
operations side buffer.
in a When a pop finds a waiting push, or a push finds a waiting
queue is empty.
Algorithm 13.28 showsan implementationthat does fine-grained locking. It has one
lock for each locus. Notice that remove changes head to refer to the next node; thus,
after the first successful remove, the original dummy node will be free, and the node
with the value just removed becomes the new head. This version of Queue is unbounded.
Algorithm
13.29 shows a similar implementation for BoundedQueue. To avoid update
contention on a single size field, it maintains counts of the number of items added and
the number removed. It is fine if these counts wrap around \342\200\224
the fields storing them just
need to be ableto storeall max + 1 values from zero through max. Of course if these counts
lie on the same cache line, this 'optimisation' may perform no better than using a single
size field.
general case.
Other locking approaches (such as optimisticor lazy update) offer no real advantage
over fine-grainedlockingfor this data structure.
13.8. CONCURRENT DATA STRUCTURES 257
5
push(val):
e node f- new Node(value: val, next: null)
7 loop
8 currTop f- *topAddr
9 node.next \302\253\342\200\224
currTop
io if CompareAndSet(topAddr, currTop, node)
ii return
12
13
popABAQ :
14 lOOp
15 currTop f- *topAddr
i6 if currTop = null
17 return null
is /* codebelow can have an ABA problem if node is reused*/
19 next f- currTop. next
20 if CompareAndSet(topAddr, currTop, next)
21 return currTop.value
22
23 pop():
24 lOOp
25 currTop <\342\200\224
LoadLinked(topAddr)
26 if currTop = null
27 return null
28 next f- currTop. next
29 if StoreConditionally(topAddr, next)
30 return currTop.value
31
32
popCount():
33 lOOp
34 currTop f- *topAddr
35 if currTop = null
36 return null
37 currCnt f- *cntAddr $
38 nextTop f- currTop. next
39 if CompareAndSetWide(&topCnt, currTop, currCnt,
40 nextTop, currCnt + l)
4i return currTop.value
258 CHAPTER 13. CONCURRENCY
PRELIMINARIES
6
add(val):
7 node new
<\342\200\224
Node(value: val, next: null)
8
lock(&addLock)
9 tail.next node
<\342\200\224
io tail node
<\342\200\224
ii unlock(&addLock)
12
13 remove():
14 lock(&removeLock)
15 node head,
<\342\200\224 next
i6 if node = null
17
unlock(&removeLock)
is return EMPTY /* or otherwise indicateemptiness */
19 val node,
<\342\200\224 value
20 head node
<\342\200\224
21
unlock(&removeLock)
22 return val
13.8. CONCURRENT DATA STRUCTURES 259
shared numAdded 0
\302\253\342\200\224
shared numRemoved 0
\302\253-
add(val):
node new
\302\253\342\200\224
Node(value: val, next: null)
lock(&addLock)
if numAdded
\342\200\224
numRemoved = MAX
unlock(&addLock)
return false /* or otherwiseindicate full */
tail.next f- node
tail node
<\342\200\224
20
remove():
21
lock(&removeLock)
22 node head,
\302\253- next
23 if numAdded \342\200\224
numRemoved = 0
24
unlock(&removeLock)
25 return EMPTY /* or otherwise indicate emptiness */
26 val node,
<f\342\200\224 value
27 head node
\302\253\342\200\224
4 add(val):
5 node new
<\342\200\224
Node(value: val, next: null)
e loop
7 currTail 4\342\200\224
LoadLinked(&tail)
8 currNext f- currTail.next
9 if currNext 7^ null
10 /* tail appears to be out of sync: try to help */
11 StoreConditionally(&tail, currNext)
12 continue /* start over after attempt to sync */
13 if CompareAndSet(&currTail.next, null, node)
14 /* addedto end of chain; try to update tail */
15 StoreConditionally(&tail, node)
i6 /* ok if failed: someoneelsebrought tail into sync, or will in the future */
17 return
18
19 remove () :
20 loop
21 currHead <\342\200\224
LoadLinked(&head)
22 next currHead.next
<\342\200\224
23 if next = null
24 if StoreConditionally(&head, currHead)
25 /* head has not changed, so truly empty */
26 return EMPTY /* or otherwise indicateemptiness */
27 continue A head may have changed so try again */
28
29 currTail tail
<\342\200\224
30 if currHead = currTail
31 /* not empty; appears to be out of sync; try
to help */
32 currTail f- LoadLinked(&tail)
33 next <r- currTail .next
34 if next 7^ null
35 StoreConditionally(&tail, next)
36 continue
37
i shared buffer[MAX]
2 shared head f- 0
3 shared tail f- 0
4 shared numAdded f- 0
5 shared numRemoved 0
<\342\200\224
9
add(val):
io lock(&addLock)
ii if numAdded \342\200\224
numRemoved = MAX
12 unlock(&addLock)
13 return false /* indicate failure */
H
buffer[tail] val
<\342\200\224
i6 numAdded f- numAdded + 1
17
unlock(&addLock)
18
19 removeQ:
20 lock(&removeLock)
21 if numAdded \342\200\224
numRemoved = 0
22 unlock(&removeLock)
23 return EMPTY /* indicatefailure */
24 val <\342\200\224
buffer[head]
25 head <- (head + l) % MAX
26 numRemoved numRemoved
<\342\200\224 + 1
27 unlock(&removeLock)
28 return val
A queue implemented with an array has higher storagedensity than one implemented
with a linked list, and it does not require on-the-fly allocation of nodes from a pool. A
bounded queue can be implemented with a circular buffer. Algorithm 13.31 shows a
finegrained locking version of that, which can be improved by folding together head and
numRemoved, and also tail and numAdded, using modular arithmetic, shown in
Algorithm 13.32. This is particularly attractive if MAX is a power of two, sincethen the modulus
262 CHAPTER 13. CONCURRENCY PRELIMINARIES
8 add(val):
9
lock(&addLock)
io if (tail
- head + MODULUS) % MODULUS = MAX
ii unlock(&addLock)
12 return false /* indicate failure */
13
buffer[tail %
MAX] <- val
H tail <- (tail + l) % MODULUS
15 unlock(&addLock)
i6 return true /* indicate success*/
remove():
lock(&removeLock)
20 if (tail
- head + MODULUS) % MODULUS = 0
21 unlock(&removeLock)
22 return EMPTY /* indicatefailure */
23 local val <- buffer[head %
MAX]
24 head <- (head + l) % MODULUS
25
unlock(&removeLock)
26 return val
smallest modulus that will work, and has the added virtue of being a power of two when
MAX is. In the code we add MODULUS to t ai 1 \342\200\224
head to ensure we are taking the modulus
of a positive number, which is not necessary if using masking or if the implementation
language does a proper modulus (toward
\342\200\224
oo as opposed to toward zero).
If there is a distinguishedvalue that can mark empty slots in the buffer, then the code
can be further simplified as shown in Algorithm 13.33.
It is often the casethat the buffer has just a single reader and a singlewriter (for
example,
the channels used by Oancea et al [2009]). In this case, the code for a circular buffer is
much simpler; it appears in Algorithm 13.34. This algorithm is a goodexample for
mentioning
the adjustments a programmer needs to maketo realisethe algorithm on different
7
add(val):
8 lock(&addLock)
9 if buffer[tail] ^ EMPTY
o unlock(&addLock)
i return false /* indicate failure */
2 buffer[tail] <- val
3 tail f- (tail + l) % MAX
4 unlock(&addLock)
5 return true /* indicate success */
removeQ:
lock(&removeLock)
if buffer[head] = EMPTY
20 unlock(&removeLock)
2i return EMPTY A indicate failure */
22 val f- buffer[head]
23 head f- (head + l) % MAX
24
unlock(&removeLock)
25 return val
Algorithm
13.34: Single reader/single writer lock-free buffer [Oanceaet al, 2009]
5 add(val):
e newTail f- (tail + l) % MAX
7 if newTail = head
s return false
9
buffer[tail] val
\302\253\342\200\224
io tail f- newTail
ii return true
12
13 remove():
14 if head = tail
is return EMPTY A or otherwiseindicate emptiness */
i6 value \302\253\342\200\224
buffer[head] $
17 head f- (head + l) % MAX $
is return value
264 CHAPTER 13. CONCURRENCY PRELIMINARIES
add(val):
pos <\342\200\224
FetchAndAdd(&head, l)
buffer[pos] val
\302\253\342\200\224
8 remove():
9 limit head
<\342\200\224
\342\200\224
10 pos 1
<\342\200\224
11 loop
12 pos <\342\200\224
pos 1
13 if pos = ] Lt
if StoreConditionally(&buffer[pos], EMPTY)
return val
fence.15This will guarantee that if the remover orders its load instructions properly, it will
not perceive the change to t ai 1 until afterit can perceive the change to buff er. Likewise
we add an isync instruction, which servesas a load-store memory fence, before the store
to buffer, to ensure that the processor does not speculativelybeginthe store before the
load of head and thus possibly overwrite a value being read by the remover.16
Similarly
we insert an lwsync in remove between loadingbuffer [head]and
updating head, and an isync before loading from buffer, to serve as a load-load memory
barrierbetween loading tail and loading from buffer.
Oanceaet al proposed a solution that includes writing null in remove as an explicit
EMPTY value, and having both add (remove) watch its intended buffer slot until the slot
appears suitably empty (non-empty), before writing its new value (EMPTY). Because there
is only one reader and only one writer, one only thread writes EMPTY values, and only
one writes non-EMPTY values, and eachdelays its write until it sees the other thread's
previouswrite, accesses to the buffer cannot incorrectly pass eachother.Likewise, only one
thread writes each of head and tail, so at worst the other thread may have a stale view.
This solution avoids fences, but the buffer writes by the remover may cause more cache
5
add(val):
e pos \302\253\342\200\224
FetchAndAdd(&head, l)
7
buffer[pos] val
\302\253\342\200\224
9
removeQ:
io limit head
\302\253\342\200\224
n currLower lower
\302\253\342\200\224
\342\200\224
1
12 pos currLower
\302\253\342\200\224
13 loop
14 pos +
\302\253\342\200\224
pos 1
is if pos = limit
i6 return null /* found nothing */
17 val \302\253\342\200\224
LoadLinked(&buf fer[pos])
is if val = EMPTY
19 continue
20 if val = USED
21 if pos = currLower
22 A try to advance lower */
23 currLower 4\342\200\224
LoadLinked(&lower)
24 if pos = currLower
25 StoreConditionally(&lower, pos + l)
26 continue
27 /* try to grab */
28 if StoreConditionally(&buffer[pos], USED)
29 return val
ping-ponging than fences would. Oanceaet al actually combine both solutions, but as we
just argued,eachseems adequate
on its own. This all shows the care neededto obtain a
If the queue is being used as a buffer, that is, if the order in which things are removed
need not match exactly the order in which they were added, then it is not too hard to
devise a lock-freebuffer. First assume an array large enough that wrap around will never
occur. Algorithm 13.35 implements a lock-free buffer. It assumes that initially all entries
are EMPTY.
This algorithm does a lot of repeated scanning.Algorithm 13.36 adds an index lower
from which to start scans. It requires distinguishing not just empty slots,but also ones that
have been filled and then emptied,indicated by USED in the code.
Further refinement is neededto producea lock-free circular buffer implementation
along these lines. In particular there needs to be code in the add routine that carefully
converts USED slots to EMPTY ones before advancing the head index. It also helpsto use
indexvalues that cycle through twice MAX as in Algorithm 13.32. The resulting code
appears
in Algorithm 13.37.
266 CHAPTER 13. CONCURRENCY
PRELIMINARIES
shared head 0
<\342\200\224 A refers to next slot to fill */
shared lower ^\342\200\2240 /* slots from lower to head-1 may
have data */
add(val):
loop
currHead head
<\342\200\224
if =
(currHead % MAX) (currLower % MAX)
&& (currHead ^ currLower)
advanceLower() A loweris a buffer behind */
16 continue
17 A try to clean entry; ensure head has not changed */
18 if currHead = head
19 % EMPTY)
StoreConditionally(&buffer[currHead MAX],
20 continue
2i if oldVal ^ EMPTY
22 if currHead 7^ head
23 continue A things changed: try again */
24 return false A indicate failure: buffer is full */
25 currHead ^\342\200\224
LoadLinked(&head) /* try to claim slot */
26 A recheck inside LL/SC */
27 if % MAX] = EMPTY
buffer[currHead
28 if StoreConditionally(&head, (currHead + l) % MODULUS)
29
buffer[currHead] val
<\342\200\224
32 remove():
33 advanceLower()
34 limit head
\302\253\342\200\224
\342\200\224
35 scan lower
<\342\200\224 1
36 lOOp
37 scan <- (scan + l) % MODULUS
38 if scan = limit
39 return null A found nothing */
40 A could peek value first
at before using atomic operator */
41 val <\342\200\224
LoadLinked(&buffer[scan % MAX])
i advanceLower():
2 if buffer[lower % MAX] ^ USED
3 return /* quick return without using atomic operation */
4 loop
5 currLower <<\342\200\224
LoadLinked(&lower)
6 if % = USED
buffer[currLower MAX]
7 if StoreConditionally(&lower, (lower + l) % MODULUS)
s continue
9 return
\342\200\242
The start of a transaction.
\342\200\242
Each read that is part of the current transaction.
17The names of variables are different from Arora et al [1998], and the algorithm here calls the local end's index
top and the opposite end's index tail, so as to correspond better with the view that the local end of the deque is a
stack and the other end is the tail (removal point) of a queue.
268 CHAPTER 13. CONCURRENCY PRELIMINARIES
top <\342\200\224
currTop + 1
return true A indicate success*/
return null
\342\200\242
Each write that is part of the current transaction.
\342\200\242
The end of a transaction.
may be executed speculatively. It is necessary to mark their end so that speculation can be
resolvedand the transaction accepted, with its writes installed, and so on, or rejected and
the software notified so that it can retry or take some other action.
Similarto the ACID properties of database transactions, transactional memory
transactions ensure:
\342\200\242
Atomicity: All effects (writes) of a transaction appear or none do.
\342\200\242
Consistency: A transaction appears to execute at a singleinstant.
Isolation:
\342\200\242 No other thread can perceive an intermediatestate of a transaction, only a
state beforeor a state after the transaction.
The durability property of database transactions, which ensures to very high probability
that the resultsof a successful transaction will not be lost, is omitted from the requirements
on transactional memory.
The actual reads writes of a transaction will be spreadout over time. Thus, as
and
transactions run, they may interfere with each other if they access the same locations.
Specifically,
transactions A and B conflict if one of them writes an item that the other reads or
writes.Conflicting transactions must be ordered. In some cases, given the readsand writes
a transaction has already performed, this is not possible.For example, if A and B have both
read x, and then they both try to write to x, there is no way to complete both transactions
so as to satisfy transactional semantics. In that case one or both of A and B must be aborted
(discarded), and the situation made to appear as if the aborted transaction had not run.
Generallythe software will try it again, which will likely force a suitable ordering.
Transactional memory can be implemented
in hardware, software or a hybrid
combination.
Any implementation strategy must provide for: atomicity of writes, detection
of conflicts and visibility
control (for isolation). Visibility control may be part of conflict
detection.
Atomicity of writes can be achievedeither by buffering or by undoing. The buffering
approachaccumulates writes in some kind of scratch memory separate from the memory
locations written, and updates those memory location only if the transaction commits.
Hardware buffering mayby augmenting caches or using someother side
be achieved
buffer; softwarebuffering might work at the level of words, object fields or whole objects.
With
buffering, a transaction commit installs the buffered writes, while an abort discards
the buffer. This typically requiresmorework for commits, usually the more common case,
and lesswork for aborts. Undoing works in a converse way: it updates modified data as
a transaction runs, but savesin a sidedata structure called the undo log the previousvalue
of each item it writes. If the transaction commits,it simply discards the undo log, but if
the transaction aborts, it uses the undo log to restore the
previous values. Undo logs can
be implementedin hardware,software, or a combination, just as buffering can.
Conflict detection may be implemented eagerly or lazily. Eager conflict checking checks
each new accessagainst the currently running transactions to see if it conflicts. If necessary
it will cause one of the conflicting transactions to abort. Lazy conflict checking does the
checks when a transactionattempts to commit. Some mechanisms also allow a transaction
270 CHAPTER 13. CONCURRENCY PRELIMINARIES
For purposes of
presentation let us discuss a simple hardware transactionalmemory
interface consisting of these primitives, as introducedby Herlihy and Moss [1993]:
TCommitO indicates that the transaction wants to commit. It returns a boolean that is true
if and only if the commit succeeded.
request programmatically.
TLoad(addr) marks a transactional load from the indicated address. This adds that
address to the transaction's read set and returns the current value in that memory
location.
TStore(addr, value) marks a transactional store of the indicated value to the indicated
address. This adds the address to the transaction's write set and performs the write
transactionally, that is, in a way in which the effect of the write disappears if the
simpler because it can write two locations atomically, and remove is simplerbecause it
can read two and even three values atomically. More importantly, it is easier to see that the
transactional implementation is correct; verifying the other version requires more subtle
arguments about ordersof reads and writes.
\342\200\242
Software transactional memory tends to involve significant overheads, even after
optimisation. Given the desire for low overheads in most parts of automatic storage
management, the scope for applying software transactional memory may be small.
Still, coding of low traffic data structures might
be simplified while continuing to
avoid the difficulties with locks.
\342\200\242
Hardware transactional memory will likely have idiosyncrasies. For example, it may
handle conflict detection, access and updating all in terms of physical units such as
13.9. TRANSACTIONAL MEMORY 271
4 add(val):
5 node new
<\342\200\224
Node(value: val, next: null)
6 loop
7 currTail <- TLoad(&tail)
s TStore(&currTail.next, node)
9
TStore(&tail, node)
io if TCommit()
ii return
12
13
removeQ:
u loop
15 currHead <\342\200\224
TLoad(&head)
i6 next <\342\200\224
TLoad(&currHead.next)
17 if next = null
is if TCommit () A the commit ensures we got a consistent view */
19 return EMPTY /* or otherwiseindicate emptiness */
20 continue
21
\342\200\242
Transactional memory can, guarantee lock-freedom,though
at most, it does that
fairly easily. Even if the underlying mechanism of transactional memory
commit
is wait-free, transactions can conflict,leading to aborts and retries. Programming
wait-free data structures will remain complex and subtle.
\342\200\242
Transactional memory performance tuning. Oneconcernis
can require careful
inherentconflicts between they access the same data. An example
transactions because is
a concurrent stack: transactional memory will not solvethe bottleneckcausedby the
need for every push and pop to update the stack pointer.Furthermore,exactly where
in a transaction various reads and writes occur \342\200\224
nearer to the beginning or nearer to
the end \342\200\224
can significantly affect conflicts and the overheadof retrying transactions.
simpler
model of the world that transactional memory presents may result in fewer bugs and
reduce development effort.
other's semantics. This would be harder to avoid with hardware transactional memory,
since is oblivious
it to the semantics of the data being managed, whereas a software
transactional memory built for a particular language might give special treatment to object
Today's
trend is for modern hardware architectures to offer increasing numbers of
processors and cores. Sutter [2005] wrote, The free lunch is over' as many of the traditional
approachesto improving performance ran out of steam. Energy costs, and the difficulty
of dissipating that energy, have led hardwaremanufacturersaway from increasing clock
speeds (power consumption is a cubic function of clock frequency) towards placing
multiple processor
cores on a single chip (wherethe increasein energy consumption is linear
in the number of cores). As there is no reason to expect this trend to change,
designing
and implementing applications to exploit the parallelismoffered by hardware will
become more and more important. On the contrary, heterogeneous and other non-uniform
memory architectures will only increasethe needfor programmers to take the particular
characteristics of the underlying platform into account.
Up to now we have assumed that, although there may be many mutator threads there
is only a singlecollectorthread. This is clearly a poor use of resources on modern multi-
core or multiprocessorhardware. In this chapter we consider how to parallelise garbage
collection, although
we continue to assume that no mutators run while garbage collection
proceeds and that each collection cycle terminates before the mutators can continue.
Terminology
is important. Early papers used terms like 'concurrent', 'parallel','on-the-fly'
and 'real-time' interchangeably or inconsistently We shall be more consistent, in keeping
with most usage today.
Figure 14.1a represents execution on a single processorasa horizontal bar, with time
proceeding from left to right,
and shows mutator execution in white while different
collection
cycles are represented as by distinct non-white shades.Thus grey boxes represent
actions of one garbage collection cycle
and black boxes those of the next. On a
multiprocessor, suspension of the mutator means stopping all the mutator threads. Figure 14.1b
shows the generalscenariowe have considered so far: multiple mutator threads are
suspended
while a single processor performs garbage collectionwork. This is clearly a poor
use of resources.An obvious way to reduce pause times is to have all processors cooperate
to collect garbage (while still stopping all mutator threads), as illustrated in Figure 14.1c.
This
parallel collection is the topic of this chapter.
Thesescenarios,where collection cycles are completed while the mutators are halted,
are calledstop-the-world collection. We note in passing that pause times can be further
diminished either by interleaving mutator and collector actions(incremental collection) or by
time
\342\226\240
question therefore arises, is there sufficient garbage collection w ork available for the gains
offered by a parallel solution to more than offset these costs?
Some garbagecollectionproblems appear
inimical to parallelising. For example, a
mark-sweepcollectormay need to trace a list, but this is an inherently sequential
activity:
at each tracing step, the marking stackwill containonly a single item, the next item in
the list to be traced. In this case, only one collectorthread will do work and all others will
stall, waiting for work. Siebert [2008] shows that the number of times n that a processor
stalls for lack of work during a parallel mark phaseon a p-processor system is limited by
the maximum depth of any reachable object o:
n < (p \342\200\224
1)
\342\200\242
max depth(o)
oereachable
14.2. LOAD BALANCING 277
This formulation
depends on the unrealistic assumption that all marking steps take the
same amount of time. Of course,thesestepsare not uniform but depend on the kind of
objectbeingscanned.Although most objects in most programming languages are typically
small \342\200\224
in particular they a few pointers
contain only \342\200\224
arrays may be larger and often
very
much larger than the common case (unlessthey are implemented as a contiguous
'spine' whichcontainspointersto fixed-size 'arraylets'that hold the array elements).
Fortunately, typical applicationscomprisea richersetof data structures than a single
list. For example,tracinga branchingdata structure such as a tree will generate more work
at each step than it consumes until the reaches the leaves. Furthermore, there are
trace
typically multiple sources from which tracing can be initiated.Theseinclude global variables,
the stacks of mutator threads and, in the caseof generational or concurrent collectors,
remembered sets. In a study of small Java benchmarks, Siebert finds that not only do many
programs have a fairly shallow maximum depth but, more significantly, that the ratio
between the maximum depth and the number of reachable
objects is very small: stalls would
occuron lessthan 4% of the objects marked, indicating a high degree of potential
parallelism, with all the benchmarks scaling well up to 32 processors(or even more in some
cases).
Tracing isthe garbagecollection
componentthat is most problematic for identifying
potential
parallelism. The
opportunities for parallelising other components,suchas
sweeping
or up references to compacted objects,are
fixing straightforward,
more at least in
principle.
An obvious way to proceed is to split those parts of the heap that need to be
processed into a number of non-overlapping regions, each of which is managed in parallel by
a separate processor.
Of course, the devil is in the details.
and minimal coordination typically conflict. A static balance of work might be determined
in advance of execution, at the startup of the memory manager or, at the latest, before
a collection cycle. It may require no coordination of work between garbage collection
threads other than to reach a consensus on when their tasks are complete. However, static
partitioning may not always lead to an even distributionof work amongst threads. For
example, a contiguousmark-compactspace on an N-processor system might be divided
into N regions, with each processor responsible for fixing up references in its own region.
This is a comparatively simple task yet its cost is dependent on the number of
objects in
the region and the number of references they contain, and so on. Unless these
are broadly
characteristics similar across regions, some processorsare likely to have more work to
do than others. Notice also that as well as balancingthe amount of work across processors,
it is also important to balanceother resources given to those processors. In a parallel
implementation of Baker's copying collector [1978], Halstead [1984,1985] gave
each processor
its own fixed fromspace and tospace.Unfortunately, this static organisation frequently led
to one processorexhausting its tospace while there was room in other processors'spaces.
Many collection tasks require dynamic load balancing to distributework approximately
evenly For jobs where it is
possible to obtain a good estimate of the amount of work
to be done in advance of performing it, even if this estimate will vary from collectionto
collection,the division of labour may be done quite simply, and in sucha way that no
further cooperation is required between parallelgarbage collector threads. For example,
in the compaction phase of a parallel mark-compact collector, after the markingphase has
278 CHAPTER 14. PARALLEL GARBAGE COLLECTION
identified live objects,Floodet al [2001] the heap into N regions, each containing
divide
approximately equal volumes of live data, and assigna processorto compact each
region
into more sub-tasks than there are threads or processors, and then have each compete to
claim one task at a time to execute. Over-partitioning has several advantages.It is more
resilient to changes in the number of processorsavailable to the collector due to load from
other processes on the machine, since smaller sub-tasks can more easily be redistributed
across the remaining processors. If one task takes longer than expected to complete,any
further work can be carried out by threads that have completed their smaller tasks. For
example, Flood et al also
over-partition the heap into M object-alignedareasof
approximately equal
size before installing forwarding pointers; M was typical chosen to be four
times the number of collection threads. Each thread then competes to claim an area,
counting
the volume of live data in it and coalescing adjacent unmarked objects into a single
garbage block. Notice how different load balancing strategies are used in different phases
of this collector (which we discussin moredetaillater).
We simplify the algorithms we present later in this chapter by concentrating on the
three key sub-tasks of acquiring, performing and generating collection work. We abstract
this by assuming in most cases that each collector thread t executes the following loop:
while not terminated()
acquireWork()
perf ormWork()
generateWorkQ
Here, acquirework attempts to obtain one, or possibly more than one, unit of work;
per f ormWork does the work; and generateWork may take oneor morenew work units
discovered by perf ormWork and place them in the general pool for collector threads to
acquire.
14.3 Synchronisation
It might seem that possible load balancingwould beto divide
the best the work to be done
into the smallestpossibleindependent tasks, such as marking a single object. However,
while such fine granularity might lead to a perfect balancing of tasks between
processorssince whenever a task was available any processorwanting work could claim it, the
cost of coordinating processors makes this impractical. Synchronisation is needed both
for correctness and to avoid, or at least minimise, repeating work. There are two aspects
to correctness.It is essentialto prevent parallel execution of garbage collector threads
from
corrupting either the heap or a collector'sown data structures. Consider two
examples. Any moving collector must ensure that a
only single thread
copies an object. If two
threads were to copy it simultaneously, in the best case (where the object is immutable)
space would be wasted but the worst caserisksthe two replicas being updated later with
conflicting values. Safeguarding the collector's own data structures is also essential. If
all threads share a single marking stack, then all push and pop operations must be
synchronised in order to avoid losing work when morethan one thread manipulates the stack
pointer or adds/removes entries.
Synchronisation
between collector threads has time and spaceoverheads.Mechanisms
to ensure exclusive access may use locksor wait-free data structures. Well-designed
algorithms
minimise the occasions on which synchronisation operationsare needed, for ex-
14.4. TAXONOMY 279
14.4 Taxonomy
In the rest of this considerparticular solutionsto the problemsof
chapter we will
parallelising marking, sweeping, copying and compaction. Throughout we assumethat all
mutator threads are halted at safe-points whilethe collector threads run to completion. As
far as possible, we situate these case studies within a consistent framework.In all cases,
we shall be interested in how the algorithms acquire, perform and generate collection work.
The design and implementation of these three activities determines what synchronisation
is necessary, the granularity of the workloads for individual collectorthreads and how
these loads are balanced between processors.
Parallel garbage collectionalgorithms can be broadly categorised as either processor-
centric or memory-centric. Processor-centric algorithms tend to have threads acquirework
quanta that vary in size, typically by stealing work from other threads. Little regard is
given to the locationof the objects that are to be processed. However, as we have seen
in earlier chapters, locality has significant effects on performance, even in the context of
a uniprocessor. Its importance is even greater for non-uniform memory or heterogeneous
architectures. Memory-centric algorithms, on the other hand, take location into greater
account. They typically operateon contiguous blocksof heap memory and acquire/release
work from/to shared poolsof buffers of work; these buffers are likelyto be of a fixed size.
These are most likelyto beusedby parallel copying collectors.
Finally, we are concerned with the termination of parallel collection. Threads not only
acquire work to do but also generate further work dynamically. Thus it is usually
insufficient to detect termination of a collection cycleby, say, simply checking that a shared pool
of work is empty, since an active thread may be about to add further tasks to that pool.
Marking comprises three activities: acquisition of an object to process from a work list,
testing and setting one or moremarks, generating further marking work by adding the
object's children to a work list. All known parallel marking algorithms are processor-
centric. No synchronisation is necessary to acquire an object to trace if the work list is
thread-local and non-empty.Otherwisethethread must acquire work (one or more objects)
atomically,
either from some other thread's work list or from some global list. Atomicity
280 CHAPTER 14. PARALLEL GARBAGE COLLECTION
is chiefly necessary to maintain the integrity of the list from which the work is acquired.
Marking
an object more than once or adding its children to more than one work list affects
operation. The object's children can be added to the marking list without synchronisation
if the list is private and unbounded.Synchronisation is necessary if the list is shared or if
it is bounded. In the latter case, somemarking work must be transferred to a global list
whenever the local list is filled. If the object is a very large array of pointers, pushing all its
children onto a work listasa singletask may induce some load imbalance. Some collectors,
especiallythosefor real-time systems, process the slots of largeobjectsincrementally, often
by representing a large object as a linked data structure rather than a single contiguous
array of elements.
Processor-centric techniques
Any parallel collector needsto take care with how mark bitmaps are treated and how
largearrays are processed. Bits in a mark bitmap word must be set atomically Rather than
locking the word then testing
the bit, Endo et al use a simple load to test the bit and, only
if it is not set, attempt to set it atomically, retrying if the set fails (becausebits are only set
in this phase, only a limitednumber of retries are needed), illustrated in Algorithm 14.2.
Collectorslikethat of Flood et al [2001], which store the mark bit in the objectheader,can
of course mark without atomic operations, though.
Processing largearrays of pointers has been observed to be a sourceof problems. For
example, Boehm and Weiser [1988] tried to avoid mark stack overflow by pushing large
objectsin smaller(128word) portions. Similarly, Endo et al split a largeobjectinto 512 byte
sections before adding them to a stackor queue in orderto improve load balancing; here,
the stack or queue holds (address, size) pairs.
The Flood et al [2001] parallel generationalcollectormanages its young generation by
copying and its old generationby mark-compact collection. In this section, we consider
only parallel marking. Whereas Endo et al used a stack and a stealable queue per
Flood
processor, et al use a
just singledequepercollector thread. Their lock-free, work stealing
algorithm is basedon Arora et al [1998]; its low overhead allowswork to be balancedat
the level of individual objects. The algorithm works as follows; see also the detailed pre-
14.5. PARALLEL MARKING 281
13 if isEmpty(myMarkStack)
14 for each j in Threads
15 if not locked(stealableWorkQueue[j])
i6 if lock(stealableWorkQueue[j])
17 A grab half of his stealable work queue */
is n <\342\200\224
size(stealableWorkQueue[me]) / 2
19
transfer(stealableWorkQueue[j], n, myMarkStack)
20 unlock(stealableWorkQueue[ j])
21 return
22
23 perf ormWork():
24 while pop(myMarkStack, ref)
25 for each fid in Pointers(ref)
26 child *fld
<\342\200\224
31
i setMarked(ref):
2 oldByte \302\253\342\200\224
markByte(ref)
3 bitPosition \302\253\342\200\224
markBit(ref)
4 loop
5 if isMarked(oldByte, bitPosition)
6 return
7 4\342\200\224
newByte mark(oldByte, bitPosition)
8 if CompareAndSet(&markByte(ref), oldByte, newByte)
9 return
282 CHAPTER 14. PARALLELGARBAGE COLLECTION
global
overflow -
set
A1 Bl Cl
y J
A2 C2
#/
A^
sentation in Section 13.8. A thread of its deque as its mark stack; its push
treats the bottom
does not require synchronisation pop operation requires synchronisation only
and its to
claim the last element of the deque. Threads without work steal an object from the top of
other threads' deques using the synchronised remove operation.One advantage of this
work stealing design is that the synchronisation mechanism, with its concomitant
overheads, is activated only when it is needed to balance loads. In contrast, other approaches
(such as grey packets, which we discuss below) may have their load balancing mechanism
permanently 'turned on'.
The Flood et al thread deques are fixed sizein orderto avoid having to allocate during
a collection. However,this risks overflow, so they provide a globaloverflow set with just a
small, per class,overhead.The class structure for each Java class C is madeto hold a list of
all the overflow objectsof this class, linked together through their type fields (illustrated in
than a few bits. A thread marks a block grey by using a CompareAndSwap operationto
link it through this colour word into a local grey list of the processor on which the thread
14.5. PARALLEL MARKING 283
i shared overflowSet
2 shared deque[n] /* one per thread */
3 me <\342\200\224
myThreadld
4
5 acquireWork():
6 if not isEmpty( deque [me])
7 return
s n 4\342\200\224
size(overflowSet) / 2
9 if transfer(overflowSet, n, deque[me])
io return
ii for each j in Threads
12 ref ^\342\200\224
remove(deque[j]) /* try to steal from j */
13 if ref 7^ null
14 push(deque[me], ref)
15 return
16
17 perf ormWork():
is loop
19 ref <\342\200\224
pop(deque[me])
20 if ref = null
21 return
22 for each fid in Pointers(ref)
23 child *fld
<\342\200\224
30 generateWork():
3i /* nop */
is running. To balance loads, Siebert steals other threads' work lists wholesale: a thread
without work attempts to steal all of another thread's grey list. To prevent two threads from
working on the same grey block, a new colour anthracite is introduced for blocks while
they are being scanned in a mark step. Thief threads also steal by attempting to change
the colourof the head of the grey list of another processorto anthracite. This mechanism
is very coarse, and best suitedto the case that the victim thread is not performing any
collection work but maybe only adding blocksto its grey list as it executes write barriers.
This is a plausiblescenariofor a real-time, concurrent collector. However, if all threadsare
collecting garbage, it may degrade to a situation where all threads competefor a single
remaining list of grey blocks.Siebertwrites that this does not occur often in practice.
Threads
Packet pool
,ED
c=r
tCH
~c=r
\\7U< \"D1 I'D\"
Figure 14.3: Grey packets. Each thread exchanges an empty packet for a
packet of references to trace. Marking fills an
empty packet with new
references to trace; when it is full, the thread exchanges it with the global pool for
another empty packet.
all the tasks of processor B. First, A must clear its stack-empty flag, then set the detection-
interrupted flag and finally B's queue-empty flag. Unfortunately, as Petrank and Kolodner
[2004] point out, this protocol is flawed if more than one thread is allowed to detect
termination since a second detector thread may clearthe detection-interrupted flag
after the
first detector thread has set it, thus fooling the first detector thread into believingthat the
flag remained clear throughout.
Kolodner and Petrank [1999] employ
a solution common to many concurrency
problems.
They ensure that only one thread at a time can try to detect termination by
introducing
a lock: a synchronised, global, detector-identity word. Before
attempting to detect
termination, a thread must checkthat the detector-identity's is -1 (meaning that no thread
is currently trying to detect termination) and, if so, try to set its own identity into the word
atomically, or else wait.
Floodet al detect termination through a status word, with one bit for each participating
thread, which must be updatedatomically. Initially, all threads' statuses are active. When
a thread has no work to do (and has not been able to steal any), it sets its status bit to be
inactive and loops, checking whether all the status word's bits are off. If so, all threads have
offered to terminate and the collection phase is complete.Otherwise, the thread peeks at
other threads' queues,lookingfor work to steal. If it finds stealable work, it sets its status
bit to active and triesto steal.If it fails to steal, it reverts the bit to inactive and loopsagain.
This
technique clearly does not scale to a number of threads
beyond the number of bits in
a word. Theauthors suggestusing a count of active threads instead.
Grey packets. Ossiaet al observe that mark stacks with work stealingis a technique best
[Ossia et al, 2002; Barabash et al, 2005]. This will not be the
each mutator caseif
thread also
helps by performing a smallincrementof work,say allocation. at each
They also note that
it may be difficult both for a thread to choose the best queue from which to steal, and to
detect termination. Instead, they balance work loads by having each thread compete for
packets of marking work to perform. Their system had a fixed number of
(1000) packets
available and each packet was a fixed size (512entries).
14.5. PARALLEL MARKING 285
Each thread uses two packets; it processes entries in its input packetand adds work to
operation (with the thread's identifier added to the head of the list to avoid an ABA
problem). They also reduce the number of fencesthat have to be inserted on architectures with
pushing
each object, a fence is required only when a thread acquires or returns packets. Ossia
et al use a vector of allocation bits when they conservatively scan thread stacksin order
to determinewhether a putative reference really does point to an allocatedobject. Their
allocationbits are also used for synchronisation between mutators and collectors.Their
allocators use local allocation buffers. On local allocationbuffer-overflow, the allocator
performs a fence and then sets the allocation bits for all the objects in that local allocation
buffer, thus ensuring that the storesto allocateand initialise new objects cannot precede
the storesto set their allocation bits (Algorithm 14.5). Two further fences are needed.First,
when a tracing thread acquires a new input packet,it tests the allocation bits of every
object
in the new packet, recording in a private data structure whether an object is safe to
trace \342\200\224
its allocation bit has been set \342\200\224 or not. The thread then fences before continuing
to traceall the safe objects in the input packet. Tracing unsafe objects is deferred; instead,
they are added to a third, deferred,packet.At some point, this packet may be returned
to a globalpoolof deferred packets. This protocol ensures that an objectcannotbe traced
before its allocation bit has been loaded and found to be set. A tracing thread alsofences
when it returns its output packet to the globalpool (in orderto prevent the stores to the
packet being reordered with respect to adding the packet back to the
global pool). A fence
is not needed for this purpose when getting an input packet sincethere is a data
dependency
between loading the pointer to the packet and accessing its contents, an ordering
that most hardware hardware respects.
Grey packets make it comparatively easy to track state. Each globalpoolhas an
associated count of the number of packets it contains, updated by an atomic operation after
However, the first publication of this idea, other than through a patent application, was by Ossia et al [2002].
286 CHAPTER 14. PARALLEL GARBAGECOLLECTION
Algorithm
14.4: Grey packet management
5
getInPacket():
6 atomic
7 inPacket <\342\200\224
remove(fullPool)
s if isEmpty(inPacket)
9 atomic
io inPacket <\342\200\224
remove(halfFullPool)
n if isEmpty(inPacket)
12 inPacket, outPacket <\342\200\224
outPacket, inPacket
13 return not isEmpty(inPacket)
testAndMarkSafe(packet):
for each ref in packet
safe(ref) <\342\200\224
allocBit(ref)
= true /* private data structure */
20 getOutPacket():
21 if isFull(outPacket)
22 generateWork()
23 if outPacket = null
24 atomic
25 outPacket <\342\200\224
remove(emptyPool)
26 if outPacket = null
27 atomic
28 remove(halfFullPool)
29 if outPacket = null
30 if not isFull(inPacket)
31 inPacket, outPacket <\342\200\224
outPacket, inPacket
32 return
33
34
addOutPacket(ref):
35
getOutPacket()
36 if outPacket = null || isFull(outPacket)
37
dirtyCard(ref)
38 else
39 ref)
add(outPacket,
14.5. PARALLEL MARKING 287
sequentialAllocate(n) :
result free
<\342\200\224
newFree result
<\342\200\224 -f n
if newFree < labLimit
free newFree
<\342\200\224
return result
/* LAB overflow */
fence
for each obj in lab
allocBit(obj) true
<\342\200\224
3 acquireWork():
4 if isEmpty(inPacket)
s if getInPacket()
6
testAndMarkSafe(inPacket)
7 fence
perf ormWork():
for each ref in inPacket
if safe(ref)
for eachfid in Pointers(ref)
child *fld
<\342\200\224
20 generateWork():
21 fence $
22 add(fullPool, outPacket)
23 outPacket null
<\342\200\224
288 CHAPTER 14. PARALLEL GARBAGE COLLECTION
ii perf ormWork():
12 loop
13 if isEmpty(myMarkStack)
i4 return
15 ref <\342\200\224
pop(myMarkStack)
i6 for each fid in Pointers(ref)
17 child <- *fld
is if child 7^ null && not isMarked(child)
19 if not generateWork(child) /* drip
a task to another processor */
20 push(myMarkStack, child)
21
22 generateWork(ref):
23 for each j in Threads
24 if needsWork(k) && not isFull(channel[me, j])
25
add(channel[me, j]
26 return true
27 return false
count cannot drop to zero temporarily, each thread must obtain a new packet before it
replaces
the old one. Requiring a thread to obtain its input packet before its output packet at
the start of a collection will ensure that attempts to acquire work packets when no tracing
work remainswill not prevent termination detection.
Grey packets limit the depth of the total mark queue, making it possible that marking
may overflow. If a thread cannotobtainan output packet with vacant entries, it may swap
the roles of its input and output packets. If both are full, some overflow mechanism is
Channels. Wu
suggest an architecture for load balancing
and Li [2007] on large-scale
servers that does
require not
expensive atomic operations. Instead, threads exchange
marking
tasks through single writer, single reader channels (recallAlgorithm 13.34), as
shown in Algorithm 14.7. In a system with P marking threads, each thread has an array
of P \342\200\224
1 queues, implemented as circular buffers; null indicatesthat a slot in the buffer
14.6. PARALLEL COPYING 289
is empty. It is the restriction to one readerand one writer that allows this architecture to
avoid the expense of atomic operations. It performed better than the Floodet al [2001]
work
stealing algorithm on servers with a large number of processors.
Similar to the strategy usedby Endo et al [1997], threads proactively give up tasks to
other threads. When a thread i generatesa new task, it first checks whether any other
thread ; needswork and, if so, adds the task to the output channel (i \342\200\224>;)]. Otherwise, it
pushes the task onto its own marking stack. If its stack is empty, it takes a task from some
Processor-centric techniques
Dividing work among processors. Blelloch and Cheng parallelise copying in the
context of replicating collection [Blelloch and Cheng, 1999;Cheng and Blelloch, 2001; Cheng,
2001]. We discuss replicatingcollection
in detail in Chapter 17 but, in brief, replicating
collectorsare incrementalor concurrentcollectors
that
copy live objects while the
mutators are running, taking special care to fix up the values of any fields that a mutator might
have changed during the course of a collection cycle. In this chapter, we discussonly the
parallelism aspects of their design.
Each copyingthread is given its own stack of workto do.Blelloch and Cheng claim that
stacks offer easier synchronisationbetweencopying threads and less fragmentation than
Cheney queues (but we examine Cheney-style parallel copying collectors below). Load
is balanced by having threads periodically transfer work between their local stacks and
if isLocalStackEmpty()
acquireWorkQ
if isLocalStackEmpty()
break
perf ormWork()
trans it ionRooms()
generateWork()
if exitRoom() A leave push room */
terminate()
acquireWork():
sharedPop() A move work from shared stack */
performWork()
:
22 ref <\342\200\224
localPop()
23
scan(ref) A see Algorithm 4.2 */
24
25 generateWork() :
26
sharedPush() A move work to shared stack */
27
28 isLocalStackEmpty()
29 return sp = 0
30
31 localPush(ref):
32
myCopyStack[sp++] ref
\302\253\342\200\224
33
localPop():
return myCopyStack[\342\200\224sp]
36
37
sharedPop(): A move work from shared stack */
38 cursor <\342\200\224
FetchAndAdd(&sharedStack, l) A try to grab from shared stack */
39 if cursor > stackLimit A shared stack empty */
40
FetchAndAdd(&sharedStack, \342\200\224l) A readjust stack */
41 else
42
myCopyStack[sp++] <\342\200\224
cursor[0] A move work to local stack */
43
5 enterRoom():
6 while gate ^ OPEN
7 A do nothing: wait */
s FetchAndAdd(&popClients, l) /* try to start popping */
9 while gate ^ OPEN
FetchAndAdd(&popClients, -l) /* back out since did not succeed */
while gate ^ OPEN
A do nothing: wait */
FetchAndAdd(&popClients, l) A try again */
transitionRooms():
gate <- CLOSED
.3 exitRoom():
\342\200\224
pushers ^\342\200\224
FetchAndAdd(&pushClients, -l) 1 A stop pushing */
if pushers = 0 /* I was last in
push room: check termination */
if isEmpty(sharedStack) /* no grey objects left */
gate OPEN
\302\253\342\200\224
return true
else
30 gate <- OPEN
3i return false
ping room must be empty The algorithm is shown in Algorithm 14.9. At each iteration
of the collection loop, a thread first enters the pop room and performs a fixed amount of
work. It obtains slots to scan either from its own local stack or from the sharedstack with
a FetchAndAdd. Any new work generated is addedto its stack. The thread then leaves
the pop room and waits until all other threads have also left the room before it tries to
enter the push room.The first thread to enter the push room closesthe gate to prevent
any other thread enteringthe pop room.Once in the push room, the thread empties its
local stack entirely onto the shared stack,again usingFetchAndAdd to reserve space on
the stack. The last thread to leave the push room opens the gate.
Theproblemwith this mechanism is that any processor waiting to enter the push room
must wait until all processors in the pop room have finished greying their objects. The
time to grey objects is considerable to
compared fetching or depositing new work, and a
processortrying to transition to the push phase must wait for all other processors already
in the pop phaseto finish greying their objects. Large variations in the time for different
292 CHAPTER 14. PARALLELGARBAGE COLLECTION
processors to grey their objectsmakes this idle time significant. A more relaxed abstraction
would allow
processors to leave the pop room without going into the push room. Since
greying objects is not related to the shared stack, that work can be done outside the rooms.
This greatly increases the likelihood that the pop room is empty and so a thread can move
to the push room.
The
original Blelloch and Cheng room abstraction allowsstraightforward termination
detection. Each thread's local tracing stackwill be empty when it leaves the push roomso
the last thread to detect whether the shared stack
leave should is also empty. However, the
relaxed definition means that collection threads may be workingoutsidethe rooms. With
this abstraction, the shared stack must maintain a global counter of how many threads
have borrowedobjectsfrom it. The last thread to leave the push roommust check whether
this counter is zero as well as whether the shared stack is empty.
Copying objectsin parallel. To ensure that only one thread copies an object,threads
must race copy an object and
to install a forwarding addressin the oldversion'sheader.
How threads copy an object depends on whether or not they share a single allocation
region. By sharing a single region, threads avoid some wastage
but at the cost of having to
use an atomic operation to allocate. In this case, Blellochand Cheng [1999] have threads
race to write a 'busy' value in the object's forwarding pointer slot. The winning thread
copies object the before overwriting the slot with the address of the replica;losingthreads
must spin until they observe a valid pointer value in the slot. An alternative, if each thread
knows where it will copy an object (for example, becauseit will copy into its own local
allocationbuffer), is for threads to attempt to write the forwarding address atomically into
the slot before they copy the object.
Marlow et al [2008] compared two approachesin the contextof the GHC Haskell
system. In the first approach, a thread trying to copy an object first tests whether it has been
forwarded. If it has it simply returns the forwarding address. Otherwise, it attempts to
CompareAndSwap a busy value into the forwarding address word; this value should be
distinguishable from either a 'normal' value to be expected in that slot (such as a lockor
a hash code) or a valid forwarding address. If the operationsucceeds,the thread copies
the object, writes the address of its tospace replica into the slot and then returns this
address. If the busy CompareAndSwap fails, the thread spins until the winning thread has
completed copying the object. In their secondapproach,they avoid spinning by having
threads optimistically copythe objectand then CompareAndSwap the forwarding address.
If the CompareAndSwap fails, the copy must be retracted (for example, by returning the
thread's free pointer to its original value). They found that this latter approach offered
little benefit since collisions wererare.However, they suggest that in this case, may be
it
worthwhile, in the case of immutable objects, to replace the atomic write with an unsyn-
speculative allocation in its local allocation buffer and then attempts to CompareAndSwap
14.6. PARALLEL COPYING
the forwarding pointer. If it succeeds, the thread copies the object. If the CompareAnd-
Swap fails, it will return the forwarding pointer that the winning thread installed.
As we have seen throughout this book, locality has a significant impact on
performance. This is likely to become increasingly important for
multiprocessors with
nonuniform
memory architectures. Here, the ideal is to placeobjectscloseto the
processor that
will use them most. Modern operating systems support standard memory affinity policies,
used to determine the processor from which memory will be reserved. Typically,a policy
may
be first-touch or local, in which case memory is from allocated
the processor running
the thread that requested it, or round-robin, where memory allocation is striped across all
processors. A processor-affinity thread scheduler will help preserve locality propertiesby
attempting to schedule a thread to the last processoron
which it ran. Ogasawara [2009]
observes that, even with a local-processorpolicy, a memory manager that is unaware of a
non-uniform memory architecture may not placeobjectsappropriately. If local allocation
buffers are smaller than a page and are handed out to threads linearly, then some threads
will have to allocate in remote memory, particularly if the system is configured to use the
operating system's large page (16megabytes) feature to reduce the cost of local to physical
addresstranslation. Further, collectors that move objects will not respect their affinity
In contrast, Ogasawara's memory manager is aware of non-uniform memory access
and so splits the heap into segments of one or more pages. Each segment is mapped to a
singleprocessor. The allocator, used by both mutator and collectorthreads,preferentially
obtains blocks of memory from the preferred processor. For the mutator, this will be the
processor on which the thread is running. The collector threads always try to evacuate
live
objects to memory associated with their preferred processor. Sincethe thread that
allocated an object may not be the one that accesses it most frequently, the collectoralso
usesdominant-thread information to determine each object's preferred processor.First, for
objects directly referred to from the stack of a mutator thread, this will be the processor
on which that mutator thread was running;it may be necessary for mutator threads to
update the identity
of their preferred processor periodically. Second, the collectorcanuse
objectlocking information to identify the dominant thread. Locking schemes often leave
the lockingthread'sidentity in a word in the object's header. Although this only identifies
the thread, and hencethe preferredprocessor, that last locked the object, this is likely to be
a sufficient approximation, especially as many objects never escapetheir allocating thread
(although they may still be locked).Finally, the collector can propagate the preferred
processor from parent objects to their children. In the example in Figure 14.4,three threads
are marking. For simplicity, we assume they are all running on their preferred processor,
294 CHAPTER 14. PARALLEL GARBAGE COLLECTION
block chunk
1 1
Before
\342\226\240III
Figure 14.5:Chunkmanagement
in the Imai and Tick [1993] parallel copying
collector, showing
selection of a scan block before (above)and after (below)
overflow. Hatching denotes blocks that have been added to the global pool.
Memory-centric techniques
Per-thread f
romspace and tospace. collection lends itself naturally
Copying to a division
of labour based on objects'
locations.
simple solution
A to parallelising copying collection
is to give eachCheney-style collector its own fromspace and tospace [Halstead, 1984]. In
this way, each thread has its own contiguous chunk of memory to scan, but still competes
with other threads to copy objects and install forwarding pointers. However, this very
simple design not only
risks poor load balancing as one processormay run out of work
while others are still busy, but also requires some mechanism to handle the casethat one
thread's tospace overflows although there is unusedspacein other tospaces.
were comparatively small (only 256 words). The problem with using small chunks for
linear allocation is that it may lead to excessive fragmentation since,on average,we can
expect to waste half an object's worth of space at the end of eachchunk.To solve this, Imai
and Tick used big bag of pages allocation (see Chapter 7) for small objects;consequently
each thread owned N chunks for copying.Largerobjects and chunks were both allocated
from the shared heapusing a lock.
Second,they balanced load at a granularity finer than a chunk. Each chunk was
divided into smaller blocks (which they called 'load distributionunits').Thesewere maybe
as small as 32 words \342\200\224 smaller blocks led to better speed ups. In this algorithm, each
thread offered to give up someof its unscanned blocks whenever it needed a new scan
block. After scanning a slot and incrementing its scanpointer,the thread checked whether
14.6. PARALLEL COPYING 295
Figure 14.6:Block states and transitions in the Imai and Tick [1993] collector.
Blocks in states with thick borders are part of the globalpool,those with thin
borders are owned by a thread.
scan
copy
aliased or II \342\226\240
copy aliased
-\302\273
scanlist scan
\342\200\224> scanlist scan
\342\200\224>
i aliased \342\200\224>
copy (cannothappen) (cannot happen)
scanlist scan
\342\200\224>
aliased scan
\342\200\224>
copy scanlist
-\302\273 scan done
\342\200\224>
freelist \342\200\224>
copy freelist \342\200\224>
copy copy scan
-\302\273
freelist \342\200\224\342\226\272
copy
i aliased scan
\342\200\224> (cannot happen) (cannot happen)
freelist \342\200\224>
copy
\342\226\240 aliased done
\342\200\224> (cannot happen) (cannot happen)
freelist \342\200\224>
copy
scanlist scan
\342\200\224>
Table 14.1:State transition logic for the Imai and Tick collector
it had reached the block boundary. If so, and the next objectwas smaller than a block, the
thread advanced its scan pointer to the start of its current copy block. This helps reduce
contention on the globalpoolsincethe thread does not have to compete to acquirea scan
block.Italsoavoids a situation whereby the only blocks containinggrey objects to scan
are copy blocks. If there were any unscanned blocks between the old scanblockand the
copy block, these were given up to the global pool for other threads to claim. Figure 14.5
shows two example scenarios. In Figure 14.5a, a thread's scanand copy blocks are in the
same chunk; in Figure14.5b,they are in different chunks. Either way, all but one of the
unscanned blocks in the thread's copy and scanblocksaregiven up to the global pool.
If the objectwas larger than a block but smaller than a chunk, the scanpointerwas
advanced to the start of the thread's current copy chunk. If the object was large, the thread
continuedto scanit. Any large objects copied were immediately added to the globalpool.
Figure
14.6 shows the states of blocks and theirtransitions.2Blocks in the states freelist,
scanlist and done are in the global pool; blocks in the other states are localto a thread.
The transitions are labelled with the possible colouringsof a block when it changes state.
Under the Imai and Tick scheme,a block'sstate can change only when the scan pointer
reaches end of a the scan block, the copy pointer reachesthe end of a copy block, or
scan reachesfree (the scan block is the same as the copy block\342\200\224 they are aliased). For
example, a blockmust contain at least some empty space in order to be a copy
block so
all transitions into the state copy are at least partially empty Table 14.1showsthe actions
taken,depending on the state of the copy and scanblocks. For
example, if the copy block
contains both grey slots and empty space (C) and the unaliased scan blockis completely
black then
(\342\226\240), we are finished with the scan blockand continuescanning in the
copy block
\342\200\224
the copy and scan blocks are now aliases of one another.
Marlow et al [2008] found this block-at-a-timeload-balancingover-sequentialised the
collector when work was scarce in GHC Haskell. For example, if a thread evacuates its
roots into a singleblock,it will export work to other threads only when its scan and free
pointers are separated by more than a block.Theirsolutionisto exportpartially full blocks
to the global pool whenever(i)the sizeof the pool is below some threshold, (ii) the thread's
copy
block has a sufficient work to be worth exporting, and (iii) its scan block has enough
unscanned slots to process before it has to claim a new block to scan. The optimum
minimum
quantum of work to export was 128words (for most of their benchmarks, though
some benefitedfrom much smaller quanta). This design could be expectedto suffer badly
from fragmentation if threads were to acquireonly empty blocks for copying while
exportingpartially
filled ones. To avoid this, they have threads prefer to acquire blocks that are
scanned blocks were associated with two pointers, a partial scan pointer and a free space
pointer. Similarly, Imai and Tick used pairs of scan and free pointers for their blocks.The
trick to obtaining a hierarchical traversal of the object graph with the parallel algorithm is
therefore for threads to selectthe 'right' blocksto usenext. Like both of these collectors,
Siegwart and Hirzel prefer to alias copy and scan blocks,4 in contrast to the approachthat
Ossia et al [2002] used where they strove to have threads hold distinct input and output
packets. Unlike Imai and Tick, who defer checking whetherthe copy
and scan blocks can
be aliased until the end of a block, Siegwart and Hirzelmakethe check immediately after
scanning a grey slot.It is this
immediacy that leads to the hierarchical decomposition order
of traversal of the object graph.
Figure 14.7showsthe states of blocks and their transitions under this scheme. As
before, blocks in the states freelist, scanlist and done are in the global pool; blocks in the other
states arelocalto a thread.The transitions are labelled with the possible colouringsof a
block when it changes state. Table 14.2showsthe actions taken, dependingon the state
Figure 14.7:Block states and transitions in the Siegwart and Hirzel collector.
Blocks in states with thick borders are part of the global pool, those with thin
borders are local to a thread. A thread may retain one block of the scanlist in
scan
copy aliased
-\302\273
copy aliased
\342\200\224>
scanlist scan
-\302\273 scanlist scan
\342\200\224>
freelist -\302\273
copy freelist \342\200\224\302\273
copy copy scan
\342\200\224>
freelist \342\200\224>
copy
\342\226\240 aliased done
-\302\273 (cannot happen) (cannot happen)
freelist -\302\273
copy
scanlist scan
-\302\273
Table 14.2: State transition logic for the Siegwart and Hirzel collector.
Siegwart and Hirzel [2006], doi: 10.1145/1133 956.1133964.
2006
\302\251 Association for Computing Machinery, Inc. Reprinted by permission.
of the copy and scan blocks.For example,if the copy block contains both grey slots and
empty space or H) and the unaliased scan blockalsohas grey
(|] slots, then we return
the scanblockto the scanlist and continue scanning the copy block \342\200\224 the copy and scan
blocks are now aliasesof one another. Thus, the state transition system for Siegwart and
Hirzel is a superset of that for Imai and Tick [1993].
scanlist scan
\342\200\224>
really obtains the cached block (if any), and scan scanlist
\342\200\224> caches the
block, possibly returning in its stead the previously
cached block to the shared pool of
blocks to be scanned. Parallel hierarchical copyingis very effective in improving the spatial
locality of connected objects. Most parents and children were within a page (four kilobytes)
of each other.In particular, it offers a promise of reduced translation lookaside buffer and
cache miss rates. Thus, it can trade mutator speedup for collector slowdown. Whether or
only to acquire a partition/work list. The partitions used here are larger at 32 kilobytes
than those we have seen before. While a larger granularity reduces communication costs,
it is lesseffective at load balancing than finer grained approaches.
While there is work left, each thread processeswork in its incoming channels and its
work list. The termination condition for a collector thread is that (i) it does not own any
work list, (ii) all its input and output channels are empty, and (iii) all work lists (of all
threads) are empty. On exit, each thread sets a globally visible flag. Oancea et al take a
pragmatic approach to the management of this collector. They use an initialisation phase
that
processes in parallel a number (30,000)of objects under a classical tracing algorithm
and then places the resulting grey objects in their correspondingwork lists, locking the
partitions to do so, before distributing the work lists among the processors,and switching
to the channel-based algorithm.
consecutive, equally sized blocks, either statically assigned to processorsor which collector
threads would compete to claim. However, the distributionof live objects among blocks
tends to be uneven,with some blocks very densely populated and others very sparsely.
Flood et al [2001] found that this straightforward division of work led to uneven load
balancing,
as scanning the dense blocks dominated collectiontime. To address this, they
over-partitioned the card table into N strides, each a set of cards separatedby intervals of
N cards. Thus, cards {0,N,2N,...} comprise onestride,cards{1,N+ 1,2N -I- 1,...}
comprise
the next, and so on. This causes denseareastobe spread across strides. Instead of
competing for blocks, threads compete to claim strides.
14.7. PARALLEL SWEEPING 299
blocks that balances loads according to the allocation rates of mutator threads.
The first and only step in the sweep phase of lazy sweeping is to identify completely
empty blocksand return them to the block allocator. In order to reduce contention,Endo
et al [1997] gave each sweep thread several (for example, 64) consecutive blocks to process
locally.
His collector used bitmap marking, with the bitmaps held in block headers, stored
separately
from the blocks themselves. This makes it
easy to determine whether a block is
completeempty or not. Empty ones are sorted and coalesced, and added to a local free-
block list. Partially full blocks are add to local reclaimlists (for example, one for each size
classif segregated fits allocation is being used) for subsequent lazy sweeping by mutator
threads. Oncea processorhas finished with its sweep set, it merges its free-block list into
the global free-blocklist. Oneremaining question is, what should a mutator thread do if it
has run out of blocks on its localreclaimlist and the global pool of blocks is empty? One
solution is that it should steal a block from another thread. This requires synchronising the
acquisition of the next blockto sweep, but this is a reasonable cost to pay since acquiring
a new block to sweep is less frequent than allocating a slot in a block, and we can expect
contention for a block to sweep to be uncommon.
complete, all compacting collectors require two or more further phases to determine the
forwarding address of each object, to update references and to move objects. As we saw
in Chapter 3, different algorithms may perform these tasks in different orders or even
combinetwo tasks in a single pass over the heap.
Crammond [1988] implemented a location-aware parallelcollector for
Parlog, a
concurrent
logic programming language. Logic programming languages benefit from
preserving
the order of objects in the heap. In particular, backtracking to 'choice points' is made more
efficient
by preserving the allocation order of objectsin memory, since all memory
allocated after the choice point can simply be discarded.Sliding compaction preserves the
order. Crammond's collector parallelisedthe Morris [1978] threaded collector, which we
discussed in Section3.3;in this section, we consider only the parallelismaspectsof the
algorithm. Crammond reduced the cost by dividing the heap into regionsassociated with
300 CHAPTER 14. PARALLEL GARBAGE COLLECTION
Heap (before)
0 regions 12 3
Figure 14.8: Flood et al [2001] divide the heap into one region per thread and
alternate the directionin whichcompacting threads slide live objects (shown
in grey).
processors. A processor encountering an object in its own region marked and counted
it without synchronisation. However, if the object was a 'remote' one, a referenceto it
was added to that processor's stack of indirect references and a global counter was
incremented. The remoteprocessor was responsible for processingthe objectand decrementing
the global counter (which was used to detect termination).Thus, synchronisation (using
locks) was only required for remote objectssincethe indirect stacks were single reader,
multiple writer structures. Crammondfound that indirect references
typically comprised
less than 1% of the objects marked.
Flood et al [2001] use parallel mark-compact to managethe old generation of their Java
virtual machine.The collector usesthree further phases after parallel marking (which we
discussedabove) to (i) calculate forwarding addresses, (ii) update referencesand (iii)move
objects.An interesting aspect of their design is that they use different load balancing
strategies
for different phases of compaction. Uniprocessor compactionalgorithms typically
slide all live data to one end of the heap space. If multiple threads move data in parallel,
then it is essential to prevent one thread from overwriting live data before another thread
has movedit. For this reason, Flood et al do not compact all objectsinto a single,dense
endofthe heap but instead divide the space into severalregions,onefor each compacting
thread. Each thread slides objectsin its region only. To reduce the (limited) fragmentation
that this partitioning might incur, they also have threads alternate the direction in which
they move objects in even and odd numbered regions (see Figure 14.8).
The first step is to install a forwarding pointer into the headerof each live object. This
will hold the address to which the object is to be moved.In this phase, they cwer-partition
the space in order to improve load balancing. The space is split into M object-aligned
units, each of roughly the same size; they found that a good choice on their eight-way
UltraSPARC server was to use four times as many units as garbage collection threads,
M = 4N. Threadscompeteto claimunits and then count the volume of live data in each
unit; to improve subsequentpasses,they also coalesce adjacent garbage objects into single
quasi-objects. Once they
know the volume of live objects in each unit, they can partition
the space into N unevenly sized regionsthat contain approximately the same amount of live
data. These regions are aligned with the units of the
previous pass. They also calculate the
destination addressof the first live object in each unit, being careful to take into account
the direction in which objects in a region will slide. Collection threads then compete once
again to claimunits in orderto install
forwarding pointers in each live object of their units.
The next pass updates referencesto point to objects' new locations. As usual, this
requires scanning mutator threads' stacks, references to objects in this heap spacethat are
held in objects stored outside that space, as well as live objects in this space(for example,
14.8. PARALLEL COMPACTION 301
Heap (before)
0 blocks 12 3
the sense that space need only be wasted in a pile due to object alignment requirements.
However, it is possible that if a very large number of threads/regions were to be used, it
may be difficult for mutators to allocate
very large objects.
Abuaiadh et al [2004] addressthe first problem by calculating rather than storing
forwarding addresses, using the mark bitmap and an offset vector that holds the new address
of the first live object in each small block of the heap, as we described in Section 3.4.Their
solutionto the second problem is to over-partition the heap into a number of fairly large
areas. For example,they suggest that a typical choice may be to have 16 times as many
areas as
processors, while ensuring that each area is at least four
megabytes. The heap areas
are compacted in order. Threads race to claim an area, using an atomic operationto
increment a
global area index (or pointer). If the operation is successful, the thread has obtained
this area to compact. If it was not successful, then another thread must have claimed it and
the first thread tries again for the next area; thus, acquisition of areas is wait-free. A table
holds pointers to the beginningof the free space for each area. After winning an area to
compact, the thread competes acquire to an area into which it can move objects. A thread
claims an area by trying to write null atomically into its corresponding table slot. Threads
never try to compact from a source area nor into a target area whose table entry is null, and
objectsarenever moved from a lower to a higher numbered area.Progress isguaranteed
since a thread can always compact an area into itself. Once a thread has finished with an
area, it updates the area's free space pointer in the table. If an area is full, its free space
pointer will remain null.
Abuaiadh et al explored two ways in which objects could be moved. The best
compaction,
with the least fragmentation, is obtained by moving individual live objects to their
destination, as we described above. Note that because every object in a blockis moved to
a locationpartly determined by the offset vector for that block, a block's objects are never
split betweentwo destination areas. They also tried trading quality of compaction for
reduced compaction time by moving whole blocks at a time (256bytes in their
implementation),illustrated in Figure 14.9. Because objects in a linearly allocatedspacetend to live
302 CHAPTER 14. PARALLEL GARBAGECOLLECTION
and diein clumps,they found that this technique could reduce compactiontime by a fifth
at the cost of increasingthe size of the compaction area by only a few percent. On the other
hand, it is not hard to invent a worst case that would lead to no compaction at all.
The calculate-rather-than-store the forwarding address mechanism was later adopted
by Compressor [Kermany and Petrank, 2006]. However, Compressorintroducedsome
changes. First, as the second phase of the collectorpassesover the mark bitmap, it
calculates a first-object vector as well as the offset vector.5 The first-object table is a vector
indexed by the pages that will hold the relocated objects. Each slot in the table holdsthe
address in fromspace of the first object that will be moved into that page. Compaction itself
starts by updating the roots (using the information held in the mark and offset vectors).
The second difference is that each thread then competesto claima tospacepagefrom
the first-object table. A successful thread maps a new physical page for its virtual page,
and copiesobjectsstarting from the location specified in this slot of the first-objecttable,
using the offset and mark vectors. Acquisition of a fresh page to which to evacuate objects
allows Compressor to use parallel collector threads whereasthe descriptionwe gave in
Chapter 3 sequentialised sliding objects. of At first sight, this may look as if it is a copying
algorithm rather than a mark-compact one. However, Compressor truly is a sliding mark-
compact collector. It managesfromspace and tospace pages at a cost in physical memory
of typically only one page per collectorthread, in stark contrast to a traditional semispace
collector which requirestwice as much heap space. The trick is that, although Compressor
needs to map fresh tospacepages,it can also unmap each fromspace page as soonas it has
evacuated all the live objects from it.
Terminology
Earlier work was often inconsistent in the terminologyit used to describe parallel garbage
collection. Papers in the twentieth century often used 'parallel', 'concurrent' and even
'real-time'interchangeably. Fortunately, since around 2000, authors have adopted a
consistent
usage. Thus, a parallel collector is now onethat
uses multiple garbage collector
threads, running in parallel. The world may not be stopped
or may while parallel
collectionthreads run. It seems clear that it is sensible to allow parallel collection if the
platform
underlying
has the capability to support this, in the same way that it is desirable to allow
mutator threads to use all available parallel resources.
5At 512 bytes, their blocksare also larger than those of Abuaiadh et al [2004].
14.9. ISSUES TO CONSIDER 303
latter). Even in the tracing phase, thread stacks and remembered sets can be scanned in
parallel and with little synchronisation overhead; completing the trace in parallelrequires
more careful
handling of work lists in order to limit the synchronisation costs while at the
same time usingparallelhardware resources as efficiently as possible.
to estimate a fair division betweenthreads of the work to done by subsequent phases. The
Floodet al [2001] collector is a good exampleof this approach.
Managing tracing
Tracing the heap involves consuming work (objects to mark or copy) and generating
further work (their untraced children). Some structure, such as a stackor a queue,is needed
to keep
track of work to do. A single, shared structure would lead to high
costs
synchronisation so collection threads should be given their own private data structures. How-
6Amdahl's law states that the speedup obtained from parallelising a program depends on the proportion of
the program that can be parallelised. Thus, if s is the amount of time spent (by a serial processor) on serial parts
of a program, and p is the amount of time spent (by a serial processor) on parts that can be done in parallel by n
processors, then the speedup is l/(s + p/w).
304 CHAPTER 14. PARALLEL GARBAGE COLLECTION
ever, in order to balance load, some mechanism is requiredthat can transfer work between
threads. The first decision is what mechanism to use. We have discussed several in this
chapter. Work
stealing data structures can be used to allow work to be transferred safely
from one thread's to another. The idea is to make the common operation (pushing and
popping entries while as
tracing) cheap(that is, unsynchronised) as possible while still
allowing infrequent operations (transferring work safely between threads).Endo et al [1997]
give each thread its own stackand a stealablework queue,whereas Flood et al [2001] have
each thread use just one double-ended queue both for
tracing and stealing. Grey packets
provide a global poolof buffers of work to do (hence their name) [Thomas et al, 1998; Os-
sia et al, 2002]. Here, each thread competes for a packet of work to do and returns new
work to the pool in a fresh packet. Chengand Blelloch [2001] resolve the problem of
synchronising
stack pushes and pops by splitting tracing into steps, which they call 'rooms'.
At its simplest, all threads are in the push room or all are in the pop room.In eachcase,
every
thread wants to move the stack pointer in the samedirectionsoan atomic operation
like FetchAndAdd can be used. Other authors eliminate the need for atomic operations
by having tracing threads communicate through single writer, singlereaderchannels [Wu
and Li, 2007; Oancea et al, 2009].
The second decision is how much work to transfer and when to transfer it. Different
researchers have proposeddifferent solutions. The smallest unit of transfer isa singleentry
from the stack. However, if data structures are small,this may lead to a higher volume of
traffic between threads. In the context of a parallel, concurrent and real-time collector,
Siebert[2010]has a processor with no work steal all of another's work list. This is only
a sensible decision if it is
unlikely that processors will run out of work to do at around
the sametime (in this case, because they are executing mutator code concurrently). A
common solution is to transfer an intermediateamount of work between threads. Fixed
size grey packets do this naturally; other choices include transferring half of a thread's
mark stack. If mark stacks are a fixed size, then some mechanism must be employed to
handleoverflow. Again, grey packets handle this naturally: when an output packet is
filled, it is returned to the global pool and an empty oneis acquired from the pool. Flood
et al [2001] thread overflow sets through Java class objects, at the cost of a small, fixed
space overhead per class. Largearrays are problematic for load balancing. One solution,
commonly adopted in real-time systems, is to divide large,logically contiguous objects
into linked data structures. Another isto recordin the mark stack a sequence of sectionsof
the array to scan for pointers to trace,rather than requiring all of the array to be scanned
in a singlestep.
The
techniques above are processor-centric: the algorithms concernthe management of
thread (processor) local work lists. The alternative is to use memory-centric strategies that
take into account the location of objects.This may be increasingly important in the context
of non-uniform memory architectures where access to a remote memory location is more
expensive than access to a local one. Memory-centric approaches are common in parallel
copying collectors,particularly where work lists are Cheney queues [Imai and Tick, 1993;
Siegwart and Hirzel, 2006]. Here the issues are (i) the size of the blocks (the quanta of
work), (ii) which block to process next and which to return to the global pool work, of
and (iii) which thread 'owns' an object.There are two aspects to choosing sizes of blocks.
First,any moving collector should be given its own, private region of the heap into which
it can bump allocate.These chunks should probably be large in order to reducecontention
on the chunk manager. However, large chunks do not offer an appropriate granularity
for balancing the load of copying threads. Instead, chunks should be brokeninto smaller
blockswhich can act as the work quanta in a Cheney-stylecollector.Second,the choice
of which object to process next affects the locality of both the collector and the mutator
(as we saw in Section 4.2). In both cases, i t seems preferable to select the next unscanned
14.9. ISSUES TO CONSIDER 305
object in the block that is being used for allocation, returning intermediate,unscanned
or incompletely
scanned blocks decision at the end of
to the global pool. Making this
scanning a blockmay improve making this decision
the collector's
after
locality;
scanning
each object may improve the mutator's locality as well because it causes the live object
graph to be traversed in a more depth-first-like (hierarchical) order. Finally, the decision
uses the notion of a 'dominant thread'to guide the choice of which processor should copy
an object(and hencethe location to which it should be copied).
Low-level synchronisation
As well as synchronising operations on collector data structures,it may also be necessary
to synchronise operationson individual objects. In principle, marking is an idempotent
operation: does it
object is marked morethan once. However,
not matter if an if a collector
uses a vector of mark-bits, it is essential that the marker sets these bits atomically Since
modernprocessors' instruction sets do not provide the ability to set an individual bit in a
word or byte, setting a mark may necessitate looping trying to set the value of the whole
byte atomically. On the other hand, if the mark bit is held in the object'sheader,or the
mark vector is a vector of bytes (oneperobject), then no synchronisation is necessary since
doublewriting the mark is safe.
A copying collector must not 'mark' (that is, copy) an object more than once, as this
would change the topology of the graph,with possibly disastrous consequences for
mutable
objects. It is essential that
copying an object and setting the forwarding addressis seen
by
other collector threads to be a single,indivisible operation. The details come down to
how the forwarding address is handled. A number of solutionshave been adopted. A
collector
may attempt to write a 'busy' value into the forwarding slot atomically, then
address
copy the object and write the forwarding address with a simple store operation. If another
thread seesa 'busy'value, it must spin until it sees the forwarding address. The
synchronisation cost can be reduced by testing the
forwarding address slot before attempting the
atomic 'busy' write. Another tactic might be to copy the objectif there is no forwarding
address and then attempt to store the forwarding address atomically, retracting the copy
if the store is unsuccessful.The effectiveness of such a tactic will depend on the frequency
of collisions when addresses.
installing forwarding
It is important that visible in the proper orderto other
certain actions be made
processorson
platforms with weakly consistent memory models. This requires the compiler to
emit memory fencesin the appropriate places. Atomic operations such as CompareAnd-
Swap often act as fences but in many cases weaker instructions suffice. One factor in the
choice of algorithm will be the complexity of deciding where to place fences, the number
that need to be executed and the cost of doing so. It may well be worth trading
of
simplicity programming (and hence confidence that the codeis correct)for some reduction in
performance.
Sweeping and
compaction phases essentially sweep linearly through the heap (in more
than one pass in the case of compaction).Thus, these operations are well suited to paral-
lelisation. The
simplest load balancing strategy divide the heap into as many
might be to
partitions as there are processors.However, this can load balancingif the
lead to uneven
amount of work is uneven between partitions. To first approximation, the amount of work
to bedoneis proportional to the number of objects in a partition. This information is avail-
306 CHAPTER 14. PARALLEL GARBAGE COLLECTION
able from the mark phase, and can be used to divide the heap into unequally sized (but
object aligned) partitions, each of which contains roughly the sameamount of work.
However, this strategy assumes that each partition can be processed independently
of the others. This will not be true if processing one partition may destroy information
on which another partition depends. For example, a sliding compactioncollectorcannot
move objects in an arbitrary order to their destination as this would risk overwriting live
but not yet moved data. In this case, it maybe necessary to process partitions in address
order. Here,the solution is to over-partition the heap and have threads compete for the
next partitions to use (onefor the objects to be moved and oneintowhich to move them).
Termination
pool to be counted: if
they are all present (and empty), then the phase is complete.
Chapter 15
The basic principles of concurrent collection were initially devised as a means to reduce
pause times for garbage collection on uniprocessors. Early papers used terms such as
'concurrent', 'parallel', 'on-the-fly' and 'real-time' interchangeably or inconsistently. In
Chapter 14 we definedthe modern usageof 'parallel'. Here, we define the remaining
terms. So far, we have assumed that the mutator is suspendedwhile garbage collection
proceeds, and that each collection cycle terminates before the mutator can continue. As
before, Figure 14.1a illustrates different collection styles by one or more horizontal bars,
with time proceeding from left to right, and shows mutator execution in white while each
collectioncycle is represented as a distinct non-white shade. Thus, grey boxes represent
actions of one garbagecollection cycle,
and black boxes those of the next.
We have already seen one way to reduce pause times on a multiprocessor in
parallel
\342\200\224
that is, that the mutator is stopped for each increment of the collector cycle. It
is possibleto maintain this property on a multiprocessor by making sure that all parallel
mutators are stopped for each increment, as illustrated in Figure 15.1b. The increments can
alsobe parallelised, as in Figure 15.1c.
It is a conceptually simple step to go from interleaving of the mutator with the
collectoron a uniprocessor to concurrent execution of
(multiple) mutators in parallel with the
collector on a multiprocessor. The main added difficulty is ensuring that the collector and
mutators
synchronise properly to maintain a consistent view of the heap, and not just for
reachability. For example,inconsistency can occur when a mutator attempts to manipulate
partially
scanned or copied objects, or to accessmetadata, concurrently with the collector.
The degree and granularity of this synchronisation necessarily impacts application
throughput (that is, end-to-end execution time including both mutator and collector work),
307
308 CHAPTER 15. CONCURRENT GARBAGE COLLECTION
time
UMUUL-HJI
a
(b) Incremental multiprocessor collection
mi
(c) Parallel incremental collection
^^
(e)Mostly-concurrent incremental collection
^
(g) On-the-fly incremental collection
Concurrent collectors are correct only insofar as they are able to control mutator and
collector
interleavings. we
As shall soon see, concurrent mutator and collector operations
will be specified as operatingatomically, allowing us to interpret a sequence of interleaved
operations as being generated by a single mutator (and single collector), without loss of
generality.Any concurrent schedule for executing these atomic operationsthat preserves
their appearance of atomicity will be permitted,leaving the actual
implementation of the
revisited
The tricolourabstraction,
Correctness of collectors
concurrent is often most easily reasonedabout by considering
invariants the tricolour abstractionthat the collector and mutator must preserve.
based on
All concurrent collectors preserve some realisation of these invariants, but they must retain
at least all the reachableobjects(safety) even as the mutator modifies objects. Recall that:
White objects have not yet been reachedby the collector; this includes all objects at the
beginning of the collection cycle. Those left white at the end of the cyclewill be
treated as unreachable garbage.
Grey objectshave been reached by the collector, but one or more of their fields still need
to be scanned(they may still point to white objects).
Black objectshave been reached by the collector, and all their fieldshave beenscanned;
thus, immediately after scanning none of the outgoingpointers wereto white
objects.
Black objects will not be rescannedunlesstheir colour changes.
historically, concurrent collection in general was referred to as 'on-the-fly' [Dijkstra et al, 1976,1978;Ben-
Ari, 1984]. However, on-the-fly has since cometo mean more specifically never stopping all the mutator threads
simultaneously.
310 CHAPTER 15. CONCURRENT GARBAGE COLLECTION
The garbagecollector
can be of as advancing a grey wavefront,
thought the boundary
between black (was reachable time and scanned)and white (not
at some yet visited) objects.
When the collector cycle can complete without mutators concurrently modifying the heap
there is no problem.The key problem with concurrent mutation is that and the collector's
the mutator's views of the world may become inconsistent, and that the grey wavefront
no longer represents a proper boundary between black and white.
Let us reconsiderthe earlierdefinition of the mutator Write operation, which we can
recastas follows by introducing a redundant load from the field right before the store:
atomic Write(src, i, new):
old <\342\200\224
src[i]
src[i] new
<\342\200\224
The Write operation inserts the pointer src-^new into the field src[i] of object src. As
a side-effectit deletes the pointer src-\302\273old from src[i]. We characterise the operation as
atomic to emphasise that the pointers are exchangedinstantaneously
old and new
without
any other interleaving of mutator/collector operations. Of course,onmosthardware
the store is naturally atomic so no explicitsynchronisation is required.
When the mutator runs concurrently with the collectorand modifies objects ahead of
the wavefront \342\200\224 grey objects (whose fields still need to be scanned) or white objects (as
yet unreached)
\342\200\224
correctness ensues since the collector will still visit those objectsat some
point (if they are still reachable). There is alsono problem if the mutator modifies objects
behindthe wavefront \342\200\224
black objects (whose fields have already been scanned) so long
\342\200\224
as it inserts or deletes a pointer to only a black or grey object (which the collectorhas
already
decided is reachable). However, other pointer updates may lead to the mutator's
and the collector'sview of the set of live objects becomingincoherent [Wilson, 1994], and
thus live objects being freed incorrectly. Let us consider an example.
hide a white object initially directly reachable from a grey object by inserting its pointer
behind the wavefront and then deleting its link from the grey object. The initial state of
the heap shows a black object X and grey object Y, having been marked reachable from the
roots. White object z is directly reachablefrom Y. In step Dl the mutator inserts pointer
pointer e from P to S by copying pointer d from white object R. In step T2 the mutator
deletes pointer c to R, destroying the path from the only unscannedobjectQ that leads to
S. In step T3 the collectorscansthe object Q to make it black, and terminates its marking
15.1. CORRECTNESS
OFCONCURRENT COLLECTION 311
Roots
otsHHH otsl^HHH
Roots
o
'D
Dl: Write(X,b,Read(Y,a))
Tl:Write(P,e,Read(R,d))
Roots ^HBI Roots ^\342\226\240^H
'\342\226\240\342\226\240x
'D
D3: scan(Y) T3:scan(Q)
Roots Roots HHHH
|HH|
zn
phase. In the sweep phase,white objectS will be erroneously reclaimed, even though it is
Insertinga white pointer (that is, a pointer to a white object) into a blackobjectwill cause
problems if the collector never encounters another pointer to the white object. It would
mean that the white object is reachable (from the black object,Condition 1), but the
collector will never notice since it does not revisit black objects. The collectorcould only
discover the white object by following a path of unvisited (that is, white) objects starting
from an object that the collector has noticedbut not finished with (that is, a grey object).
But Condition 2 states that there is no suchpath.
To prevent live objectsfrom being reclaimed incorrectly, we must ensure that both
conditions cannot hold simultaneously. To guarantee that the collectorwill not miss any
reachable
objects it mustbe sure to find every white object that is pointed to by blackobjects.So
long as any to by blackobjectsis also protectedfrom deletion
white object pointed it will
not be missed. It is sufficient for such an object to be directly reachablefrom some grey
object, or transitively reachable from some grey object through a chain of white objects. In
this case Condition 2 never holds. We say that such an objectis grey protected. Thus, we
must preserve:
The weak tricolourinvariant: All white objects pointed to by a black objectare grey
protected (that is, reachable from some grey object,either directly or through a chain of
white objects).
Non-copying
have the advantage
collectors that all white pointers automatically turn
into grey/black pointers when their target object is shaded grey or black. Thus, white
pointers in black objects are not a problembecausetheir grey protected white targets are
eventually shaded by
the collector \342\200\224
all white pointers in black objects eventually become
blackbefore the collection cycle can terminate.
In contrast, concurrentcopying collectors aremore restricted because they explicitly
have two copiesof every live object at the end of the collection cycle(the fromspacewhite
copy, and the tospace black copy), at which point the white copies are discarded along
with the garbage. By definition, black objects are never revisited by the collector. Thus,
a correctconcurrentcopying collector must never allow a white fromspacepointer(to a
white fromspace object) to be stored in a black tospace object. Otherwise, the collector
will
complete its cycle while leaving dangling white pointers from black tospace into the
discarded white fromspace.That is, they must preserve:
The strong tricolourinvariant: Thereare no
pointers from black objects to white objects.
Clearly, the strong invariant implies the weak invariant, but not the other way round.
Becauseproblemscan occur only when the mutator inserts a white pointer into a black
object it is sufficient simply to prohibit that. Preserving the strong tricolour invariant is a
strategyequally suited to both copying and non-copying collectors.
In both the scenarios in the example, the mutator first wrote a pointer to a white object
into a blackobject(Dl/Tl),breaking the strong invariant. It then destroyed all paths to
15.1. CORRECTNESS OF CONCURRENT COLLECTION 313
that white object from grey objects (D2/T2), breakingthe weakinvariant. The result was
that a (reachable) black objectendedup pointing to a (presumed garbage) white object,
violating
correctness. Solutions to the lost object problem operateat either the step that
writes the pointer to the white object (Dl/Tl) or the step that deletes a remainingpath to
that object (D2/T2).
Precision
Varying precision means that they may retain some varying superset of the reachable
objects,
and hence affects the promptness of reclamation of dead objects. A stop-the-world
collector obtains maximal precision(all unreachable objects are collected) at the expense
of
any concurrency with the mutator. Finer grained atomicity permits increased
concurrency
with the mutator at the expense of
possibly retaining more unreachable objects and
the overheadto ensureatomicity of key operations. It is difficult to identify the minimal
yet sufficient set of critical sectionsto placein tracing. Vechev et al [2007] shows how this
searchcanbe semi-automated. Unreachable
objects that are nevertheless retained at the
end of the collection cycle are called floating garbage. It is usually desirable, though not
strictly necessary for correctness,that a concurrent collector also ensure completeness in
collectingfloating garbage at some later collection cycle.
Mutator colour
In
classifying algorithms also useful to talk about the colour of the mutator
it is roots as
if themutator itself were
object. A
an grey mutator either has not
yet been scanned by
the collector so its roots are still to be traced, or its roots have beenscanned but need
to be rescanned. This means that the grey mutator roots may refer to objectsthat are
white, grey or black. A black mutator has been scanned by the collectorso its roots have
been traced, and will not be scannedagain. Under the strong invariant this means that a
black mutator's roots can refer only to objectsthat are grey or black but not white. Under
the weak invariant, a black mutator can hold white referencesso long as their targets are
protected from deletion.
The colour of the mutator has implications for termination of a collection cycle. By
definition, concurrent collection algorithms
that permit a grey mutator need to rescanits
roots. This will lead to more tracing work if a reference to a non-black object is found.
When this trace is complete, the roots must be scannedagain, in case the mutator has
added to the roots yet another non-black reference, and so on. In the worst case, it may be
necessary for
grey mutator algorithms to halt all mutator threads for a final scan of their
roots.
As mentionedearlier,our simplifying assumption for now is that there is only a
single
mutator. However, on-the-fly collectors distinguish among multiple mutator threads
because they do not suspend them all at once to sample their roots. These collectors
must
operate with mutator threads of different colours,both grey (unscanned) and black
(scanned). Moreover, some collectorsmay separate a single mutator thread's roots into
scanned (black) and unscanned For example, the top frame
(grey) portions. of a thread's
(grey). Returning or unwinding into the greyportion of the stack forces the new top stack
frame to be scanned.
314 CHAPTER 15. CONCURRENT
GARBAGE COLLECTION
Allocation colour
Mutator colour also influencesthe colour objects receive when they are allocated, since
allocationresults in the mutator holding the pointer to the newly allocated object,which
must satisfy whichever applies given the colourof the mutator.
invariant But the allocation
colour also affects how quickly a new object can be freed once it becomes unreachable. If
an object is allocated black or grey then it will not be freed during the current collection
cycle (since black and grey objects are consideredto be live), even if the mutator drops
its reference without storing it into the heap. A
grey mutator can allocate objects white
and so avoid unnecessarily retaining new objects. A black mutator cannot allocate white
(whether strong the or weak invariant applies), unless (under the weak invariant) there is
a guarantee that the white reference will be stored to a live object ahead of the wavefront
so the collector will retain it. Otherwise, there is nothing to prevent the collector from
reclaiming the object even though the black mutator retains a pointer to it. Note also that,
may yet delete all other paths to the object ahead of the wavefront. Thus, incremental
update
techniques preserve invariant. They use a mutator
the strong write barrier to protect
against insertion of white objects. In the exampleabove,the write barrier
pointers in black
would re-colour the source or destinationof pointer b so that the pointer is no longerblack
to white.
When a black mutator loads a referencefrom the heap it is effectively inserting a pointer
in a black object(itself). Incremental update techniques can use a mutator readbarrier to
protectfrom insertion of white pointers in a black mutator.
Snapshot-at-the-beginning solutions
Wilson calls solutions that address D2/T2 mutations snapshot-at-the-beginning techniques
since they preserve the set of objects that were live at the start of the collection. They inform
the collector when the mutator deletesa white pointer from a grey or white object (ahead
of the wavefront). Snapshot-at-the-beginning solutions conservatively treat an objectas
live (non-white) if a pointer to it ever existed ahead of the wavefront, speculatingthat the
mutator may have also inserted that pointer behind the wavefront. This maintains the
weak invariant, because there is no way to delete every path grey object to
from some
any object that was live at the beginningof the collection
cycle. Snapshot-at-the-beginning
techniques use a mutator write barrier to protect against deletion of grey or white pointers
from grey or white objects.
Snapshotting the mutator means scanning its roots,makingit black. We must snapshot
the mutator at the beginning of the collection cycle to ensure it holds no white pointers.
Otherwise, if the mutator held a white pointer that was the only pointer to its referent, it
could write that pointer into a blackobjectand then drop the pointer, breaking the weak
invariant. A write barrier on black could catch such insertions, but
degenerates to
maintaining
the strong invariant. Thus, snapshot collectors operateonly with a black mutator.
15.2. BARRIER TECHNIQUES FOR CONCURRENT COLLECTION 315
\342\200\242
Add to the wavefront by shading
an object grey, if it was white. Shading an already
grey
or black object has no effect.
\342\200\242
Advance the wavefront by scanning an object to make it black.
\342\200\242
Retreat the wavefront by reverting
an object from black back to grey.
The only other actions \342\200\224
reverting an object shading an objectblackwithout
to white or
scanning
\342\200\224
would break the invariants. Algorithms 15.1 to 15.2enumeratethe range of
classical barrier techniques for concurrent collection.
\342\200\242
Boehm et al [1991] implemented a variant
[1975,1976] barrier which of the Steele
ignores the colour of the inserted pointer, 15.1b. They originally as shown in Algorithm
implemented this barrier using virtual memory dirty bits to record pages modified
by
the mutator without having to mediate the heap writes in software, which meant
a less precisebarrierthat did not originally have the conditional test that the reverted
source object is actually black.Boehmet al use a stop-the-world phase to terminate
collectionat which time the dirty pages are rescanned.
\342\200\242
Dijkstra et al [1976,1978] designed a barrier (Algorithm 15.1c) that yields less
precision than Steele's since it commits to shading the target of the inserted pointer
reachable (non-white), even if the inserted pointer is subsequently deleted. This loss of
precision aids progressby advancing the wavefront. The original formulation of this
barrier shaded the target without
regard for the colour of the source, with a further
2We believe that 'insertion barrier' is a clearer term for the mechanism than 'incremental update barrier'.
Likewise,we prefer the term 'deletion barrier' to 'snapshot-at-the-beginning' barrier.'
316 CHAPTER 15. CONCURRENTGARBAGE COLLECTION
(c) Dijkstra et al [1976,1978] barrier (c)Abraham and Patel [1987] / Yuasa [1990] barrier
Black mutatortechniques
The first two black mutator approaches apply incrementalupdate to maintain the strong
invariant using a read barrierto prevent the mutator from acquiring white pointers (that
is, to protectfrom inserting a white pointer in a black mutator). The third, a snapshot
technique,
uses pointer writes into the heap to
a deletion barrier on preserve the weak invariant
(that is, to protect from deleting the last pointer keepingan object live that was reachable
at the time of the snapshot). Under the weak invariant a blackmutator can still hold white
references; it is black because its roots do not need to be rescanned, even if it has since
loaded pointers to white objects,becausethose white objects are protected from deletion
by the write barrier.
\342\200\242
Baker [1978] used the read (mutator insertion)barrier shown in Algorithm
15.2a.
This approach has less precision than Dijkstra et al, since it retains otherwise white
objects whose referencesare loadedby the mutator at some time during the collection
cycle, as opposed to those actually inserted behind the wavefront. Note that Baker's
read barrier was designed originally for a copying collector, where the act of shading
copies the object from fromspace to tospace,so the shade routine returns the tospace
pointer.
\342\200\242
Appel et al a coarse-grained(lessprecise)
[1988] implemented variant of Baker's read
barrier (Algorithm
virtual memory page protection primitives
15.2b), using of the
operating system to trap accessesby the mutator to grey pages of the heap without
having
to mediate those reads in software. Having scanned (and unprotected) the
page the trapped access is allowed to proceed.This barrier can also be used with
a copyingcollectorsincescanning will forward any fromspace pointers held in the
sourceobject,includingthat in the field being loaded.
Abraham and Patel [1987]
\342\200\242 and Yuasa [1990] independently devised the deletion
barrier of Algorithm 15.2c. At D2 it
directly shades z grey. At T2 R grey so
it shades
that S can eventually be shaded. This deletion barrier offers the least precisionof
all the techniques, since it retains any unreachable object to which the last pointer
was deleted during the collectioncycle. With an insertion barrier at least we know
that the mutator has had some interest in objectsretainedby the barrier (whether to
acquire or store its reference), whereas the deletion barrier retains objectsregardless
of whether the mutator manipulated them. This is evident in that shading R retains
it as floating garbage not otherwise reachable
\342\200\224
it is \342\200\224
solely to preserve S. In its
original form, this snapshot barrier was unconditional: it simply shaded the target
of the overwritten pointer, regardless of the colour of the source. Abraham and Patel
exploited this to drive their snapshot barrier using virtual memory copy-on-write
mechanisms.
Algorithm
15.3. This combines an insertion read barrier on a black mutator with a deletion
barrier on the heap. The combination preserves a weak invariant: all black-to-white
pointershave a copy in some grey object(this is slightly stronger than the basic weakinvariant
that requires only a chain of white references from
grey to white). The black mutator can
safely acquire a white
pointer from some grey source object since the target object will
eventually be shaded grey whenthe grey source is scanned, or the write barrier will shade
318 CHAPTER 15. CONCURRENT GARBAGE COLLECTION
the target grey if the The readbarrier makessure that the mutator
source field is modified.
never acquires object. Thus, every reachable white objecthas
a white pointer from a white
\342\200\242
Shading an object grey can be short-circuitedby immediately scanning the object to
make it black.
A deletion
\342\200\242 barrier that shades the target of the deleted pointer grey can instead (and
more coarsely)scanthe sourceobject containing the deleted pointer to black before
the store.
A read
\342\200\242 barrier that shades the target
of the loaded pointer grey can instead (and
morecoarsely)scanthe sourceobject to black before the read. Thus, the read barrier
of
Appel et al coarsens that of Baker.
An insertion
\342\200\242 barrier that shades the target of the inserted pointer grey can instead
revert the source to grey. This is how the barriers of Steele and Boehm et al gain
precision over that of Dijkstra et al.
Clearly, all strong invariant (incremental update) techniques must at least protect from
a grey mutator inserting white pointers into black, or protect a blackmutator from
or using
acquiring
white pointers. The strong techniquesall do one of these two things and need
not do any more.
We have already argued that weak invariant (snapshot) techniques must operate with
a black mutator. Under the weak invariant, a grey object does not merely capture a single
path
to reachable It may also be placeholder
white objects. a for a pointer from a black
object to some white object on that path. Thus, the snapshot barrier must preserve any
white object directly pointed to from grey. The least it can do is to shadethe white object
when its pointer is deleted from grey.
To deal with white objects transitively reachable via a white path from a grey object
(which may also be pointed to from black) we caneither prevent the mutator from
obtainingpointers
to white objects on such paths so it can never modify the path [Pirinen, 1998],
or make sure that deleting a pointer from a white object (which may be on such a path) at
least makes the target of the pointer grey [Abraham and Patel, 1987; Yuasa, 1990].
Thus, all of the barrier techniques enumerated here cover the minimal requirements
to maintain their invariants, but variations on these techniques can be obtained by short-
circuiting or coarsening.
the target originally stored in the field. Referencesto thesegrey objects must be recorded in
some data structure. However,concurrently with mutators adding references to the
structure, the collector will remove and trace them. It is essential t hat insertions and removals
be efficient and correctin the face of mutator-mutator and mutator-collector races.
One way to record grey objects is to add them to a log. We considered a variety of
concurrent data structuresand efficient ways to manage them in Chapter 13. In this
section, we consider a popular and alternative mechanism:card tables.The basic operation
of card tables for stop-the-worldcollectors was described in Chapter 11. Here we extend
15.2. BARRIER TECHNIQUES FOR CONCURRENT COLLECTION 319
style advancing barrier or a Yuasa-style deletion barrier, a ll objects dirtyin a card must be
considered grey.While this barrier may seem very imprecise sinceit will preserve garbage
neighbours of live objects,note that Abuaiadh et al [2004] found that compacting small
blocksrather than individual objects led to an increasein memory footprint of only a few
percent.
The cardtableis the concurrent collector's work list. The collector must scan it looking
for dirty cards and cleaning them until all cards are clean. Since mutators may dirty cards
after the collector has cleanedthem,the collector must repeatedly scan the card table. An
alternative might might be to delay processing the card table until a final stop-the-world
phase, but this is likely to cause the concurrent part of the tracing phaseto terminate too
soon [Barabash et al, 2003,2005].
Chapter 11). The collector now attempts to write the new status back to the card. First,
it checks that the card's status is still refining and that no mutator has dirtied the card
while the collectorwas searching it. If the status is still refining, the collectormust try
to change the value atomically to the new status, for example with a CompareAndSwap.
If this fails, then a mutator must have dirtied the card concurrently, meaningthat it may
contain an unprocessed grey object.Detlefs et al simply leave this card di rty and proceed
to the next dirty card, but one might also try to clean the card again.
all the fine-grain cards are clean, the collectorattempts atomically to set the state of the
coarse-grain cardto clean. However, there is a subtle concurrency issue here. Because
320 CHAPTER 15. CONCURRENT GARBAGE COLLECTION
write barrier actions are not atomicwith respect to the card-cleaning thread, the write
barrier must dirty the fine-grained card before dirtying the correspondingcoarse-grained
card, while the collector reads them in the opposite order. We note that obtaining the
proper ordermay have extra cost on machines that require a memory fenceto forceit.
Reducing work
One solution that reduces the amount of redundant work done by the collector is to try to
avoid scanning any object more than once [Barabash et al, 2003, 2005]. Here, the authors
defer cleaning
cards for as long as there is other tracingwork for the collector to do. Their
uses
mostly-concurrentcollector a
Steele-style retreating insertion barrier. Such collectors
must scanmarked objects on dirty cards and trace all their unmarked children. The first
technique for reducingthe scanning is not to trace through
amount of redundant an object
on a dirty card: it suffices to mark the object as it will be traced through when the card is
cleaned.Although objects that are traced through before their card is dirtied will still be
scanned twice, this eliminates rescanning objects that are marked after their card is dirtied.
Barabash et al observe that this can improve the collector'sperformance and reduce the
number of cache misses it incurs. Note that although changes in the order of memory
accesses on a weakly consistent platform may cause this optimisation to be missed, the
techniqueis still safe.
later time, instead adding the object to a deferredlist. When the buffer overflows \342\200\224
the
allocation slow path \342\200\224
the mutator sets all the cards in the buffer to be clean and clears all
the defer bits for all objects in the buffer. One that this is effectivereason is that Barabash
et al found that the collector rarely reaches objects in an active local allocation buffer.
Some care is needed with consistent platforms. The simplest
this solution on weakly
approachisto have the after marking a card tracedand before
collector run a fence
tracing
an object and have the undirtying procedure run a fence between checking whether
each card is dirty and checking whether it is traced (as above). Note that in both cases
only the collector threads executethe fence. An alternative method is to have the
undirtyingprocedure
start by scanning the card table, and cleaningand recording (in a list or
an additionalcard table) all cards that are dirty but have not yet been traced. Next,the
15.3. ISSUES TO CONSIDER 321
Garbage
collectors that are incremental (mutator interleaved with collector) or concurrent
(mutator and collector in parallel) have one primary purpose: minimising the collector
pauses observedby the mutator. Whether the pause is due to an incrementof collection
work needing performed by the mutator,
to be or caused by the mutator having to
synchronise with(and possibly wait for) the collectorto finish some work,
incremental/concurrent
techniques usually trade increased elapsed time (mutator throughput) for reduced
Unfortunately,
there is no free lunch. As we have already seen, concurrent collectors require some
level of communication and
synchronisation between the mutator and the collector,in the
form of mutator barriers. Moreover, contention between the mutator and collector for
processor time or for memory (including disturbance of the caches by the collector) can also
slowthe mutator down.
applications.
The collectors impose overhead on individual mutator actions (loads or stores)
in order to reducethe pausesobserved by the application's users. However, an
application's user may be another program, and this client may be very sensitive to delays. Ossia
et al [2004] offer three-tier transaction processing systems as an example.They point out
that delays for stop-the-world collectionsmay cause transactions to time out and to be
retried. By doing a little extra work (executing write barriers), much more extra work
(reprocessing transactions that timed out) can be avoided.
The concurrent collection techniques that we consider in subsequent chapters each
have their own particular impact on these costs.Concurrent reference counting collectors
impose a particularly high overheadon pointerloadsand stores. Concurrent mark-sweep
collectors, which don't move objects, have relatively low overhead for pointer access
(varying
with the barrier), but they may suffer from fragmentation. Concurrent collectors that
relocate objects require additional synchronisation to protect the mutator from, or inform
the mutator about, objects that the collector moves. Copying collectors also impose
additional
space overhead that adds to memory pressure.In all concurrentcollectors, whether
a read barrier or write barrier is usedwill affect throughput differently, based on the
relative
frequency of reads and writes, and the amount of work the barrier performs.
Concurrent mark-sweepcollectors typically
use a write barrier to notify the marker of
an object to mark from. Concurrent copyingand compacting collectors typically
use a read
barrier, to protect the mutator from accessing stale objects that have been copied elsewhere.
Thereisa trade-off between the frequency of barrier execution and the amount of work it
must do. A barrier that triggers copying and scanning will be more expensive than one
that simply copies, which will be more expensive than one that simply redirects the source
pointer. Similarly, performing more work early may result in fewer later barriers needing
to domuch work. All of these factors depend on the granularity of work performed, across
a scalefrom references through objects to pages.
The amount of floating garbage is another factor in the costs of concurrent collection.
Not having to collect floating garbage will allow faster termination of the current collection
cycle, a t the
expense of additional memory pressure.
322 CHAPTER 15. CONCURRENT GARBAGE COLLECTION
Whether the mutator (threads) must be stopped at the beginning of the collection cycle
(to make sure the collector has seenall the roots) or at the end (to check for termination)
also has an impact on throughput. Termination criteria also affect the amount of floating
garbage.
A further consideration is that most concurrent collectors offer only loose assuranceson
pauses
and space overhead. Providing the hard bounds on spaceand time needed for
realtime
applications means making well-defined progress guarantees for mutator operations
that interact with the heap, and spaceguarantees that derive solely from knowledge of the
memory allocation footprint of the application.
Incremental or concurrentcollection can beparticularly desirable when the volume of
live data is expectedto be very large. In this case, even stop-the-world parallel collection
using every processor available would lead to unacceptablepausetimes.However, one
drawback of incremental and concurrent collectorsis that they cannot recycle any memory
until the collection cycle is complete; we must provide sufficient headroom in the heap or
give the collector a sufficiently generous share of processor resources (at the expense of
the mutator) to ensurethat the mutator does not run out of memory before the collection
cyclecompletes.We consider garbage collector scheduling when we address the problem
of real-time collection in Chapter 19; there, the problemis particularly acute.
An alternative approach is to use a hybrid generational/concurrent collection. The
young generation is managed in the usual generational way, stopping the world for each
minor collection. The old generationis managedby a concurrent collector. This has several
advantages. Nursery collections are usually short enough (a few milliseconds) not to be
we can expect memory to be recycledpromptly for further allocation, thus reducing the
space overhead required
to avoid running out of memory There is no needto apply the
concurrent write barrier to objects in the young generation as it is collected
stop-the-world:
the generational write barrier in the slow path suffices. Concurrent collectors typically
allocate new objectsblack, guaranteeingthat they will survive a collection even though
most objectswill not live that long. However, by allocating new objectsgenerationally,
this problem disappears. Finally, old objects have much lower mutation rates than young
ones [Blackburn and McKinley, 2003]. This is the ideal scenario for an incremental or
concurrent collectorsincetheir write barrier is less frequently invoked.
Chapter 16
Concurrent mark-sweep
In the previous chapter we looked at the need for incremental or concurrent garbage
collection, and identified the problems faced by all such collectors.In this chapter, we consider
one family of these collectors:
concurrent
mark-sweep collectors. As we noted before, the
most important issue facing concurrent collection is correctness. The mutator and collector
must communicate with each other in order to ensure that they share a coherent view of
the heap. This is necessary on the mutator's part to prevent live objects from being hidden
the collector.It is necessary for collectors that move objects to ensure that the mutator uses
the correct addresses of moved objects.
The mark-sweep family are the simplestof the concurrent collectors. Because they
do not changepointerfields, the mutator can freely read pointers from the heap without
needing to be protectedfrom the collector. Thus, there is no inherent need for a read barrier
for non-moving collectors. Read barriersare otherwisegenerally considered too expensive
for use in maintaining the strong invariant for a non-moving collector, since heap reads by
the mutator are typically much more frequent than writes. For example, Zorn [1990]found
that the static frequencies of pointer loads and storesin SPUR Lisp were 13% to 15% and
4%, respectively. He measured the run-time overhead of inlined write barriers as ranging
from 2% to 6%, and up to 20% for to this general rule is
read barriers. The exception
when compileroptimisation techniques on eliminatingredundant
can be brought to bear
barriers[Hosking et al, 1999; Zee and Rinard, 2002],and to folding someof the barrier
work into existing overheads for null pointer checks [Bacon et al, 2003a]. For this reason,
mark-sweepcollectors usually adopt the Dijkstra et al [1976,1978] incremental update or
Steele[1976] insertion write barriers, or their coarser Boehm et al [1991] variant, or the
snapshot-at-the-beginning Yuasa [1990] deletion write barrier.
16.1 Initialisation
Instead of
allowing the mutator to run until
memory is exhausted, concurrent collectors
can run even as the mutator is still allocating. However, when to trigger the beginningof a
new marking phase is a critical decision. If a collection is triggered too late, it can happen
that there will insufficient memory to satisfy some allocation request, at which point the
mutator will stall until the collection cycle can complete.Oncethe collection cycle begins,
the collector's steady-state work-rate must be sufficient to complete the cycle before the
mutator exhausts memory, while minimising its impact on mutator throughput. How and
when to trigger a garbage collection cycle, ensuring that sufficient memory is available
for allocation to keep the mutator satisfied even as concurrent collection proceeds, and
323
324 CHAPTER 16. CONCURRENTMARK-SWEEP
i
New():
2 collectEnough()
3 ref \302\253\342\200\224
allocate() /* must initialise black if mutator is black */
4 if ref = null
5 error \"Out of memory\"
6 return ref
7
8 atomic collectEnough():
9 while behindQ
in if not markSome()
ii return
reaching termination of the collection cycle so that garbage can be reclaimed and recycled,
all dependon schedulingcollection work
alongside the mutator.
Algorithm 16.1 illustrates the mutator allocation sequence for a concurrent mark-sweep
garbagecollector that schedules some amount of collector work incrementally at each
allocation
(piggy-backed on the mutator thread) in the collectEnough procedure. This
work is synchronised with other concurrent mutator threads executing mutator barriers,
or other collectorthreads,as indicatedby the atomic modifier. The decision as to when
and how much collector work to perform is captured by the utility routine behind, which
makes sure that the mutator does not get so far ahead of the collector that the allocate
routine cannot satisfy the request for new memory
16.2 Termination
Termination of the a black mutator
collector cycle for
is a relatively straightforward
procedure. When grey objects
there in
are nothe work list
remaining to be scanned then the
collector terminates.At this point, even with the weak tricolour invariant the mutator can
contain only black references, since thereare no white objects reachable from grey objects
still held by the mutator (since there are no grey objects).Because the mutator is black
there is no need to rescanits roots.
Termination for a grey mutator is a little more complicated, since the mutator may
acquire white
pointers after its roots were scanned to initiate the collection. Thus, the grey
mutator rootsmust berescannedbeforethe collector cycle can terminate. Provided that
16.3. ALLOCATION 325
markSome():
if isEmpty(worklist) A initiate collection */
scan(Roots) A Invariant: mutator holds no white references */
if isEmpty(worklist) A Invariant: no more grey references */
/* marking terminates */
sweep() A eager or lazy sweep */
return false A terminate marking */
A collection continues */
ref <\342\200\224
remove(worklist)
scan(ref)
return true A continue marking, if still behind */
shade(ref):
if not isMarked(ref)
setMarked(ref)
add(worklist,ref)
scan(ref):
for each fid in Pointers(ref)
child <- *fld
23 if child ^ null
24 shade(child)
25
26 revert(ref):
27 add(worklist, ref)
28
29 isWhite(ref):
30 return not isMarked(ref)
31
isGrey(ref):
return ref in worklist
isBlack(ref):
return isMarked(ref) && not isGrey(ref)
rescanning the mutator roots does not exposeany fresh grey objects then the collection
cycle is done. Thus, the example performs rescanning to ensure there are no more grey
references before entering the sweep phase.
16.3 Allocation
Notice that must initialise the mark state (colour)
the allocator of the new object according
to the colourof If the mutator
the is black then new objectsmust beallocated
mutator. black
(marked) under the strong invariant, unless (under the weak invariant) the new object is
also made reachablefrom some grey object. This last guarantee is generally difficult to
326 CHAPTER 16. CONCURRENTMARK-SWEEP
make, so black mutators usually allocate black even under the weak invariant [Abraham
and Patel, 1987; Yuasa, 1990]. However,a grey mutator admits a number of alternatives
that several implementations exploit.
Kung and Song [1977]simply black
allocate during the marking phase and white
otherwise. Their choice is guided by the observation that new objects are usually
immediately linked to existing reachable objects, at which point their writebarrier (unconditional
Dijkstra-style
incremental update) would simply shade the object anyway. Moreover,
because the new object contains no references it is safe to allocate straight to blackand avoid
unnecessary work scanning it for non-existent children.
Steele[1976] chooses tovary the colour of allocation during marking, depending onthe
pointer values that are used to initialise the new object.Assuming that the initial values
of a new object'sreferencefields areknown a priori at the time of allocation permits a bulk
test of the colour of the targets of those references. If none of them are white then the
new object can safely be allocatedblack.Furthermore, if none of them are white then it
is a possible sigr\\ that the marker is close to terminating and that the new object will not
be discarded.Conversely, if any of the initialising pointers is white then the new object
is allocated white. The Steele collector marks mutator stacks last, and scans them from
bottom (least volatile) to top (most volatile), so most cells will be allocated white to reduce
floating garbage.
yellow object.
2. Change all white nodes to purple and allblack nodesto white (preferably white to
avoid floating garbage) or grey (in the casethe node has been shaded concurrently
by the mutator write barrier).
Marking ignores all purple objects:the mutator can never acquire a reference to a
purple object,
so grey objects never point purpleto and purpleobjectsarenever shaded. Of
course, the difficulty
with this approach is that the conversion of white to purple might
require touching colourstate associated with all of the garbage objects,which must be
completed before sweeping can begin. Similarly, when starting the marking phase, all
black objects(from the previous cycle) must be recoloured white.
Lamport
describes an elegant solution to this problem in which the coloursare
reinterpreted
at step 2 by rotating through a range of colour values. Each object is tagged with
a two-bit basic hue (white, black, purple) plus a one-bit shaded flag. If the hue is white
then setting the shaded flag shades the object grey (that is, a shaded white hue is grey). If
the hue is blackthen settingthe shaded flag has no effect (that is, black hue meansblack
whether the shaded flag is set or not). If the hue is purple then the shaded flag will never
be set since garbage objectswill not be traced. The sense of the hue bits is determined by
a
global variable base encoding the value of white (=base), black (=base + l) and purple
(=base+2).At step 2 there are no grey or purple nodesbecausemarking and sweeping
have finished, so flipping from black to white and white to purple is achievedsimply by
incrementing base modulo 3. Table 16.1 shows the three possible values of base encoded
in binary (0 0, 01,10) and the two possiblevalues of the shaded flag (0,1), which together
make up the possiblecolours, along
with examples for the three possible values of base.
The entries in the 'value' columnsare determinedusing arithmetic modulo 3. Note that
the combination hue=base+2/shaded=l is impossible because purple (garbage) objects
are never shaded grey. Subsequentincrementscycle the hue interpretation accordingly.
328 CHAPTER 16. CONCURRENT MARK-SWEEP
independent
colour state.
So far, we have assumed that the mutator threads are all stopped at once so that their roots
can be scanned,whether to initiate or terminate marking. Thus, after the initial root scan,
the mutator holds no white references. At this point, the mutator threads can be left to
run as black (so long asa black mutator barrier is employed), or grey (with a grey mutator
barrier) with the proviso that to terminate marking the collector must
eventually stop and
rescan grey mutators until no more work can be found. These stop-the-worldactions
reduce
concurrency. An alternative is to sample the rootsof each mutator thread separately,
and concurrently with other mutator threads. This approach introduces complexity
because of the need to cope with some threads operatinggrey and some operating black, all
at the same time,and how it affects termination.
On-the-fly collection never stops the mutator threads all at once. Rather, the collector
engages each of the mutators in a series of soft handshakes: these do not require a single
global hard
synchronisation at the command of the collector. Instead,the collector merely
prompts each mutator thread asynchronously,one-by-one, halt gracefully to at some
convenient
point. The collector can then sample (and perhaps modify) each thread's state
(stacks and registers) before releasing it on its way. While one mutator thread is stopped
otherscan continueto run. Furthermore, if stack barriers are used, as describedin
Section 11.5, the collector can restrict its examination of the stopped thread to just the top
active stack frame (all other frames can be capturedsynchronously with a stack barrier) so
the handshakecanbevery quick, minimising mutator interruption.
Synchronisation operations for on-the-fly collectors need some care. A common approach
for mostly-concurrent collectors, which stop all threads together to scan their stacks, is
to use a deletionbarrierwith a black mutator. Furthermore, new objects are allocated
black. This
approach marking: black stacks
simplifies the termination of
do not need to
be rescanned and allocation work for the collector.
does not lead to more
However, this
approach is not sufficient for an on-the-fly collector, as Figure 16.1 illustrates.Because
stacksarescannedon the fly, some may be white. The heap is allowed to contain black
objectsbefore all threads have been scanned and before tracing has started because we
allocate new objects black. The deletionbarrieris not triggered on stack operations and
there isno insertionbarrier, so neither X nor Y is shaded grey. In summary, correct mutator-
collector synchronisation for on-the-fly marking is a subtle issuethat requires substantial
care on the part of the algorithm designer.
16.5. ON-THE-FLY MARKING 329
D
(a) The deletion barrier is 'on'. Thread1 (b) X is updated to point to Y; thread 2's
has been scanned, but thread 2 has not. X reference to Y is removed. Neither action
has been newly allocated black. triggers a deletion barrier.
Doligez-Leroy-Gonthier
names of its authors [Doligez and Leroy, 1993; Doligez and Gonthier, 1994], this collector
uses private thread-local heaps to allow separate garbage collectionof data allocated solely
on behalf of a single thread, and not shared with other threads. A global heap allows
sharing of objects among threads, with the proviso that the global shared objectsnever
contain
pointers into private heaps. A dynamic escape detection mechanism copies private
objectsinto the shared heap whenever their reference is stored outsidethe private heap.
Only immutable objects (the vast majority in ML) can be allocated privately, so making
a copy of one in the shared heap does not require updating all the sources of its pointers
(thoughit does require copying the transitive closure of reachable objects). But mutation is
rare in ML so this happens infrequently. These rules permit a private heap to be collected
independently, stopping only the mutator that owns the heap.
Doligez-Leroy-Gonthier uses concurrent mark-sweepcollection
in the shared heap, to
avoid having to update referencesfrom each of the threads. The steady-state concurrent
mark-sweepcollectoroperates
in the usual black mutator snapshot mode, employing a
Yuasa-style snapshot Initiating steady-state collection proceedsusing a
deletion barrier.
series of soft handshakes mutator threads from grey to black, as follows.
to transition
The collectorand mutator threads each track their own view of the state of the
collection with a private status variable. To initiate the collection cycle, the collector sets its
status to Sync^. The mutator threads are then made to acknowledge,and update their
own status, via soft handshakes. Onceall have acknowledged the Sync^ handshake, the
collector is saidto be in phase Sync^. Mutator threads ignore handshakes while storing to
a pointerfield or allocating, to ensure that these operations first complete, making them
atomic with respect to phase changes. Having acknowledged this handshake,each
mutator thread now runs with the write barrier in Algorithm 16.3a, which shades both the old
and new values of modified pointer fields, combining both the black mutator Yuasa-style
snapshot deletion barrier and the grey mutator Dijkstra-style incremental update insertion
barrier. Shading by the mutator does not directly place the shaded objectinto the
collector's work list for scanning (like Kung and Song[1977]), but rather simply colours a white
330 CHAPTER 16. CONCURRENT MARK-SWEEP
i i, new): i Wr i, new):
WriteSym:(src, ite/4sync(src,
2 old <\342\200\224
src[i] 2 old <\342\200\224
src[i]
3 shade(old) 3 if not isBlack(old)
4
shade(new) 4
shade(old)
5 src[i] new
<\342\200\224 5 if old < scanned
6 dirty true
<\342\200\224
7
src[i] new
<\342\200\224
object explicitly grey and resets a globaldirty variable to force the collector to scan for the
newly grey object (in the style of Dijkstra et al [1978]). This avoids the need to synchronise
explicitly between the mutator and the collector (other than for soft handshakes, where
atomicity
is accomplished simply by delaying acknowledging the handshake), but does
mean that worst-case termination requires rescanningthe heap for grey objects. Because
mutation is rarein ML, this is not a significant impediment.At this point, the grey mutator
threads are still allocating white, as they were before the collection cycle was initiated.
Once all of the have acknowledged the Sync^
mutators handshake the collector moves
to phase Sync2 with another round of handshakes. Becausethe writebarrieris atomiconly
with respect to handshakes, it does not impose mutator-mutator synchronisation. This
leaves the possibility that a mutator from before the Sync^ handshake, which is not
running
the write barrier, could insert some other pointer X into the src[i] field right after
the load old\302\253\342\200\224src[i]. Thus, shade(old) pointer will not shade the pointer X that
actually gets overwritten by the store src[i]<-new. The transition to phase Sync2 avoids such
problems by ensuring that all mutator threads have completed any unmonitored atomic
allocation or write in Async before transitioning to Syncv At that point, all mutators will
be running
the write barrier (with insertion protection), so evenif the mutators interleave
their write barrier operationstherewill not be a problem. The collector can then safely
move into the steady-state snapshot marking phase, Async. Each mutator thread
acknowledges
the Async handshake by scanning (shading from) its roots for the collector (making
itself black),starting to allocateblack,and reverting to the standard snapshot barrier
augmented with resetting the global dirty flag (similarly to Dijkstraet al [1978]) to force the
collector to rescan if the shaded object is behind the scanningwavefront, shown in
Algorithm 16.3b.
point where the collector is currently sweeping (to avoid the race with the sweeper at the
boundary).
Async Async Async All mutators are black; the collectorcan complete
marking, scanning and sweeping
Sliding views
Azatchi et al [2003] offer further improvements to on-the-fly marking by exploiting the
sliding views approach to sampling stopping mutator roots without the world [Levanoni
and Petrank,1999].In place of the deque used by Domaniet al [2000], the sliding views
approach implementsthe snapshotdeletionbarrierby logging to a thread-local buffer the
state of all the fields of an object beforeit is modified (dirtied) for the first time while the
collector is marking. The buffers are drained via soft handshakes, with marking
terminated once all the buffers are empty. LikeDoligez-Leroy-Gonthier, after the initial
handshake, and before the deletion barrier can be enabledfor each mutator, the mutators also
execute a Dijkstra-styleincrementalupdate insertionbarrierto avoid propagating pointers
unnoticed before the mutator snapshot can be gathered.These snooped stores also become
mutator roots. The snooped storesare disabledonceall threads are known to be logging
the snapshot.Furtherdetailsof this approach are discussed in Section 18.5.
differing
in small but important details. To highlight these similarities and differences we can
adopt a commonabstract framework for concurrent garbage collection [Vechevet al, 2005,
2006; Vechev, 2007]. As discussed previously, the correctnessof a concurrent collector
depends on cooperation betweenthe collectorand the mutator in the presence of
concurrency. Thus,
the abstract concurrent collector logs events of mutual interest to the collector
and mutator
by appending to the shared list log. These events are tagged as follows:
\342\200\242
T(src, fid, old, new) records that the collector has Traced pointer field fid of
source object src, and that the field initially contained reference old which the col-
332 CHAPTER 16. CONCURRENT MARK-SWEEP
lector has replaced by referencenew. That is, the collector has traced an edge in the
object graph src-*oldand replaced it with an edge src->new.
\342\200\242
N(ref) records that the mutator has allocateda New object ref.
\342\200\242
R(src, fid, old) records that the mutator has performed a Read from the heap
by loading the value old from field fid of source object src.
\342\200\242
W(src, fid, old, new) records that the mutator has performed a Write to the
heap by storing the value new into field fid of source object src which previously
contained value old. If fid is a pointer field then the mutator has replaced an edge
with
src\342\200\224\302\273old an edge src\342\200\224>\302\273new.
by an as-yet-undefined
parametrised function expose which takesa log prefix and returns a set of
objects that should be considered as additional origins for live references. Different
implementations for this function yield different abstract concurrent collector algorithms
corresponding
to concrete algorithms in the literature, as discussed further below when we
describe how to instantiate specific collectors. It is the log that
permits dealing with
concurrent mutations that cause reachable objects to be hidden from the scan routine, which
otherwise would remain unmarked.
16.6. ABSTRACT CONCURRENT COLLECTION 333
3 collectTracingInc():
4 atomic
5
rootsTracing(W)
6 log <- ()
7 repeat
s scanTracinglnc(W)
9
addOrigins()
io until (?)
ii atomic
12
addOrigins()
13
scanTracinglnc(W)
H sweepTracing()
15
i6
scanTracinglnc(W):
17 while not isEmpty(W)
is src <\342\200\224
remove(W)
19 if p(src)
= 0 A reference count is zero */
20 for each fid in Pointers(src)
21 atomic
22 ref *fld
<\342\200\224
\342\200\242
23 log <\342\200\224
log T(src, fid, ref, ref)
24 if ref 7^ null
25 W <- W + [ref]
26
p(src) <\342\200\224
p(src) + l /* increment reference count */
27
28
addOrigins():
29 atomic
30 origins \302\253\342\200\224
expose(loq)
31 for each src in origins
32 W <- W + [src]
33
34
New():
35 ref <\342\200\224
allocate()
36 atomic
37
P(ref) <- 0
38 <\342\200\224 \342\200\242
log log N(ref)
39 return ref
40
events (N, R, and W) are treated, dependingon whether they occur to the portion of the
heap already scanned by the collector (behind wavefront)
the or not yet scanned (ahead of
the wavefront). The wavefront itself comprises the set of pending fields still to be scanned
(specifically
not the values of the pointers in those fields). P ractical collectors may
approximate the wavefront more or less precisely, from field granularity up through granularity
at the level of objects to pages or other physical or logicalunits.
Adding origins
The addOrigins procedure uses the log to selecta set of additional objects to be
considered live, even if the collector has not yet encounteredthoseobjects in its trace, since it is
possible some number of reachablepointerswerehidden
that by the mutator behind the
wavefront. The precisechoiceof the set of origins is returned by the exposefunction.
Mutator barriers
The procedures New and Write represent the usual barriers performed by the mutator
(here they are suitably atomic),whichin the abstract algorithm coordinate with the
collector
by appending their actions to the log. Logging New
objects allows subsequent mutator
events to distinguishloading/storing fields of new objects, and loading /storing references
to new objects.A freshly allocated object always has a unique referenceuntil that
reference has been stored to more than one field in the heap. Moreover, it does not contain any
outgoing references (solong
as its fields have not been modified, since they
are initialised
to null). This event allows concretecollectors tovary in how they decide liveness of
objects
that are allocated during the collection cycle(somecollectors treat all such objects as
live regardless of their reachability, leaving those that are unreachable to be reclaimed at
the next collection
cycle).
Others will retain only those new objectswhosereferences are
stored to live
objects.
As usual, the mutator Write operation new
assigns src[i]\302\253\342\200\224 (with new^null) so the
pointer to destination object new is inserted in field src[i] of source objectsrc.
Similarly,
the old pointer old previously in field src[i] of source object src is deleted.When
the source field is behind the collectorwavefront then the pointers new/old are
inserted/deleted behind the wavefront. Otherwise, the pointers are inserted/deleted aheadof
the wavefront. Write events captures both the insertedand deletedpointers.
Logging
Recall wavefront can be expressedusing
also that the the tricolour abstraction, where
those objects/fields ahead of the wavefront are white, those at the wavefront are grey, and
those behind the wavefront are black.
Precision
The abstractconcurrentcollector of
Algorithm 16.4 preserves a fixed level of atomicity (as
specified by the atomic blocks) while instantiating
the function expose in different ways
to vary precision. Varying this parameter of the abstractconcurrentcollectoris sufficient
to capture a representative subset of concreteconcurrentcollectors that occur in the
literature, but there are other real collectors that cannot be instantiated directly from
Algorithm 16.4 since they vary also in what they treat atomically. For example, Algorithm 16.4
16.7, ISSUESTO CONSIDER 335
assumes that roots can be obtained atomically from the mutator threads, which implies
thatthey must be sampled
simultaneously perhaps by stopping them all briefly (that is,
Algorithm 16.4 is mostly-concurrent).
Instantiating collectors
Instantiating specific concurrent within this framework requires defining a
collectors
corresponding expose function. consider
For example, a Steele-styleconcurrentcollectorthat
rescans all objects modified up to and includingthe wavefront. The wavefront at an object
and field granularity is captured by (last) Trace operationsin the logfor each object/field.
The objects modified are captured by the src component of all the Write records in the
log, and the modifiedfields by the fid component. The Steele-style exposefunction
atomically
rescans modified fields that have already been traced.The traditional
tracks
implementation the wavefront at the object granularity (src component of Trace records) using
per-object mark bits, but the abstract framework highlights that the wavefront might also
operate at the field (fid) granularity given a mechanism for marking distinct fields. Thus,
one need only rescan modified fields that have already been traced as opposed to whole
modified
objects that have already been traced. Moreover,Steeleassumes that mutator
thread stacks are highly volatile so expose must rescan them right to the end. Termination
requires that every Trace recordhaveno matching(at the field or object level) Write record
update.
Many
of the issues facing concurrent mark-sweep garbage collection are commonto all
concurrent collectors. Concurrent collectors are without doubt more complexto design,
implement and debug than stop-the-world collectors. Dothe demandsmadeof the
warrant
collector this additional complexity? Or would a simplersolutionsuch as a generational
collector suffice?
Generational collectorscan offer expected pause times for most applications of only
a few milliseconds. However, their worst case\342\200\224
a full heap collection \342\200\224
may pause an
application for very much longer, depending on the size of the heap, the volume of live
objects and so on. Such delays may not be acceptable. Concurrent collectors, on the other
hand, offer shorter and more predictablepausetimes. As we shall see in Chapter 19,
hiding objects from a collector. In addition, collectors that move objects must ensure both that
only one collector thread moves an evacuated objectand that it appears to mutators that
further
possibilities. New objects may be allocated black, grey or white, or the decision may
be varied dependingon the phase of the collector, the initial values of the new object's
fields, or the progress of the sweeper.
In the remaining chapters,we examineconcurrent copying and compacting collectors
and conclude with collectors that can provide pause time guarantees sufficient for hard
real-time systems, that is, those that must meet every deadline.
Chapter 17
also protect the mutator against concurrentcopying. Moreover, concurrent updates by the
mutator must be propagatedto the copies being constructed in tospace by the collector.
For copyingcollectors, a blackmutator must by definition hold only tospace pointers.
If it held
fromspace pointers then the collector would never revisit and forward them,
violating
correctness. This is called the black mutator tospace
invariant: the mutator operates
at all times ahead of the wavefront in tospace. Similarly, a grey mutator must by definition
hold only fromspace pointers at the beginning of the collector cycle. In the absenceof a
read barrier to forward a fromspacepointerto the tospace copy, the grey mutator cannot
directly acquiretospacepointers from
fromspace objects (since the copying collector does
not forward pointers stored in fromspace objects). This is called the
grey mutator fromspace
invariant. Of course, for termination of a copying algorithm, all mutator threads must end
the collection cycle holding only tospace pointers, so any copying collector that allows
grey mutator threads to continue operatingin fromspacemust eventually switch them all
over to tospace by forwardingtheir roots. Moreover, updates by the mutator in fromspace
must also be reflected in tospace or else they will be lost.
cycle. At this point, the now-black mutators contain only (grey) tospace pointers, but the
(unscanned)grey targets will still contain fromspace pointers. Baker's[1978] blackmutator
337
338 CHAPTER 17. CONCURRENT COPYING& COMPACTION
objects
referenced by ambiguous roots. The collector is free to move the remaining objects
not directly referenced by ambiguousroots.Itis straightforward to use the brief stop-the-
world phase of a mostly-concurrent collector to mark (and pin) all the objectsreferenced
by
the ambiguous roots in the mutator threadstacksand registers. At this point all the
mutator threads are black, and a Baker-style read barrier will ensure that the mutator threads
never subsequentlyacquirereferences to uncopied objects.
17.1. MOSTLY-CONCURRENT COPYING: BAKER'S ALGORITHM 339
Algorithm
17.1: Mostly-concurrent copying
3 collect():
4 atomic
fiiP()
6 for each fid in Roots
7
process(fld)
s loop
9 atomic
io if isEmpty(worklist)
ii break /* exit loop */
12 ref <\342\200\224
remove(worklist)
13 scan(ref)
14
is flip():
i6 fromspace, tospace <\342\200\224
tospace, fromspace
17 free, top <\342\200\224
tospace, tospace + extent
18
19
scan(toRef):
20 for each fid in Pointers(toRef)
21
process(fld)
22
23 process(f Id):
24 fromRef *fld
<\342\200\224
25 if fromRef ^ null
26 * f Id f
<\342\200\224 /* update with */
orward(f romRef) tospace reference
27
28 forward(f romRef):
29 toRef <\342\200\224
forwardingAddress(fromRef)
30 if toRef = null /* not
copied (not marked) */
31 toRef <\342\200\224
copy(fromRef)
32 return to Ref
33
34 romRef):
copy(f
35 toRef free
<\342\200\224
36 free free
<\342\200\224 + size(fromRef)
37 if free > top
38 error \"Out of memory\"
39 move(fromRef, toRef)
40 forwardingAddress(f romRef) toRef
<\342\200\224 /* mark */
4i toRef)
add(worklist,
42 return toRef
43
subsequently
for Modula-3
[Cardelli 1992], both et al,
systems-orientedprogramming languages
whose compilers were not sophisticated enough to generate accuratestackmaps.Also,
because their compilers did not emit an explicitbarrierfor heap accesses, DeTreville
applied
an Appel et al [1988] read barrier to
synchronise the mutator with the collector using
virtual memory page protection. Detlefs [1990] used the same technique for C++,
modifying
the AT&T C++ compiler to derive automatically the accurate pointer maps for heap
objects needed to allow
copying of objects not referenced directly from
ambiguous roots.
Now the only problem is that the read barrier can still read a field ahead of the wave-
front that might refer to an uncopied fromspace object. Fortunately, the ubiquitous
indirection field relaxes the need for the tospaceinvariant imposed by Baker so the mutator is
allowedto operate grey and hold fromspace references. To ensure termination Brooks
imposes
a Dijkstra-style Write barrier to prevent the insertion of
fromspace pointers behind
the wavefront as in Algorithm 17.2.
Because mutator threads now operate grey, once copying is finished they need a final
scan of their any remaining unforwarded references.
stacks to replace The alternative, as
Baker-style
collectors require a read barrier to preservetheir black mutator invariant. Read
barriers are often considered to be more expensive than write barrierssincereadsare more
they must test whether src[i] is in tospaceand evacuate it if not. Cheadle et al [2004]
17A. REPLICATION COPYING 341
eliminate this test and eliminate all overheads in accessinga black tospace objectfor a
Baker-style incremental copying collector in the Glasgow HaskellCompiler(GHC). The
first word of every object (closure) in GHC points to its entry
code: the code to execute
(enter) when the closure is evaluated. They provide two versions of this code. In addition to
the standard version,a secondversion will scavenge before entering the the closure
standard code. Let us see how this scheme operates. When the collector is off, the entry-code
word points to the standard, non-scavenging code.However, when an object is copied to
tospace,this word is hijacked and set to point to the self-scavenging code. If the object,
now in tospace, is entered,the self-scavenging code is executed first to copy the object's
childrento tospace.Then the original value of the entry-code word is reinstated.Finally,
the standard version of the code is entered.The beauty of this scheme is that if the closure
is evaluated in the future then its standard code will be entered unconditionally: the read
barrier has beenerased.The cost of this scheme is some duplicationof code: Cheadle et al
found the overheadto be 25% over that of a stop-the-worldcopying collector. In [Cheadle
et al, 2008] they applied this technique to flip method-table pointers in the JikesRVM Java
virtual machine. To do so they have to virtualise most accesses to an object (all method
calls and accesses to fields unless they are static or private). However, they were able
to recoup some of this cost by using the run-time compiler to inline aggressively.
pointer adds (bounded) overhead to every mutator heap access (both reads and writes,
pointers and non-pointers), and the indirection pointer adds an additional pointer word
to the headerof every object. It has the advantage of avoiding the need for Baker's tospace
invariant which forces the mutator to perform copying work when loading a fromspace
reference from the heap, while preservingthe essentialproperty that accesses (both reads
and writes) go to the tospace copy whenever one is present. This has the important result
that
heap updates are never lost because they occureitherto the fromspace original before
it is copied or to the tospace copy afterwards.1
Replication copying collectors[Nettles et al, 1992; Nettles and OToole, 1993]relaxthis
requirementby allowing the mutator to continue operating against fromspaceoriginals
even while the collector is copying them to tospace. That is, the mutator threads obey a
fromspaceinvariant, updating the fromspace objects directly, while a write barrier logsall
1
Atomic copying of an object and installation of the forwarding address from the old copy to the new one is
not always simple.
342 CHAPTER 17. CONCURRENT COPYING& COMPACTION
collector via the mutation log, and when updating the roots from the mutators. Thread-local
buffers and work stealing techniques can minimise the synchronisation overheadwhen
manipulating
the mutation log [Azagury et al, 1999]. The collectormust use the
mutation
log to ensure that all replicas reacha consistentstatebeforethe collection terminates.
When the collector modifies a replica that has already been scanned it must rescan the
replica to make sure that
any object referenced as a result of the mutation is also replicated
in tospace. Termination collector
requires that each mutator
of the thread be stopped to
scan its roots. When more objectsto scan, the mutator
there are no log is empty, and no
mutator has any remaining references to uncopied objects, then the collection cycle is
finished. At this point all the mutator threads are stoppedtogether briefly to switch them
over to tospaceby redirecting their roots.
The resulting algorithm imposesonly short pauses to sample (and at the end redirect)
the mutator roots: each mutator thread is stopped separately to scan its roots, with a brief
of Baker's [1978] algorithm, which divides the heap into multiple per-processor regions.
Each
processor has its own fromspace and tospace,and is responsible for
evacuating into
its own tospace any fromspace objectit discovers while scanning. Halstead uses locking to
handle racesbetween processors that compete to copy the same object,
and for updates to
avoid writing to an objectwhile it is being evacuated. He also retains global
to have
synchronisation all the processors perform the flip into their tospace before discarding their
fromspace. To eliminate this global synchronisation, Herlihy and Moss decouple
fromspace
reclamation from the flip. They divide each processorregion into a single tospace
plus multiple (zero or more) fromspaces. As copying proceeds, multiple fromspace
versions of an object can accumulate in different spaces.Only one of these versions is current
whilethe restare obsolete.
Each
processor2 alternates between its mutator task and a scanningtask that checks
local variables and its tospace for
pointers to fromspace versions. When such a pointer
2Herlihy and Moss use the term process for what might now be called a thread, but we continue to use processor
here to match Halstead [1985]and to emphasise that the heap regions should be thought of as per-processor.
17.5. MULTI-VERSION
COPYING 343
processor (scanners)holds any of its fromspace pointers. A scan is clean with respect to
a given owner if the scan completes without finding any pointers to versions in
any of
its fromspaces, otherwise it is dirty A round is an interval during which every processor
starts and completes a scan. A clean round is one in which every scan is cleanand no
processor executes a flip. After a processor executes a flip the resulting fromspace can
be reclaimed after
completion of a subsequent clean round.
An owner detects that another scanner has started and completeda scan using two
atomic handshake bits, each written by one processorand readby the other. Initially both
bits agree. To start a flip, the owner creates a new tospace, marks the old tospace as
a fromspace, and inverts its handshake bit. At the start of a scan,the scannerreadsthe
owner's handshake bit, performs the scan, and sets its handshake bit to the value read
from the owner's. Thus, the handshake bits will agree once the scanner has started and
completed a scanin the interval since the owner's bit was inverted.
However,an ownermust detectthat all processes have started and completed a scan,
and every processor is symmetrically both an owner and a scanner,sothe handshake bits
are arranged into two arrays. An owner array contains the owner handshake bits, indexed
by
owner processor. A 2-dimensional scanner array containsthe scannerhandshake bits,
with an element for each owner-scanner pair. Becausea scancan complete with respect to
multiple owners, the scanner must copy the entire owner array into a local array on each
scan. At the end of the scan, the scannermust set its corresponding scanner bits to these
previously savedvalues. An owner detects that the round is complete as soonas its owner
bit agrees with the bits from all scanners. An owner cannot begin a new rounduntil the
current round is complete.
To detectwhethera completedround was clean the processors share an array of dirty
bits, indexed by processor. When an owner executes a flip, it sets the dirty bit for all
other processors.Also, when a scanner finds a pointer into another processor's fromspace
it sets that processor's dirty
bit. If an owner's dirty
bit is clear at the end of a round then
the round was clean,and it can reclaim its fromspaces. If not, then it simply clears its dirty
bit and starts a new scan. By associatingdirty bits with fromspaces rather than processor
344 CHAPTER 17. CONCURRENT COPYING & COMPACTION
3 seq(a) <\342\200\224
seq $
4 index(a) i
<\342\200\224
5 value(a) v
<\342\200\224
9 else
io Write(a, i, v)
regions, and having scanners set the dirty bit for the target fromspace when they find a
pointer, also possible to reclaim fromspacesindividually
it is rather than all at once.
Update in place with CompareAndSwap2. The first extension assumes the availability
of the CompareAndSwap2operatorwhich allows both performing and the update
ensuring
that the forwarding pointer next remains null as a singleatomicoperation.
Unfortunately, CompareAndSwap2 is not widely implemented on modern multiprocessors.
Transactional
memory might be a viable alternative; in fact, this algorithm inspired the
work leading to
Herlihy and Moss.
requires
additional fields header of objecta: seq(a) is a modulo
in the two sequence
number, index(a) is the index of the slot being updated and value(a) is the new value for
that slot. Also, the forwarding pointer field next is
(a) permitted to hold a sequence
number, in addition to a pointer or null (this is easy enough to achieve by tagging the
forwarding pointer field with a low bit to distinguish pointers suitably to aligned
objects
from a sequence number). There need only be two values for sequence numbers:
if
seq(a)=next (a) then the current update is installed,and otherwise it is ignored.
To perform a store usingthe full write barrier, a processor chains down the list of
versions until it finds the current version (one with null or a sequence number storedin its
next field). If the current version is local, then the processor performs the WriteLocai
operation illustrated in Algorithm 17.3. This takes the current version a, the observed next
17.6. SAPPHIRE 345
in the next field. If successful, then the processor performs a deletion barrier to scan any
pointer overwritten by the store (this preserves the invariant that scanning has inspected
every pointer written into tospace), before performing the store. Otherwise,the processor
locatesthe newer version and retries the update by invoking the full write barrier. Having
the owning process update in placeis well-suited to a non-uniform memory architecture
where it is more efficient to update local objects.
If the
object is remote then the new owner makesa localtospace copy
as before,
except
that after making the copy, but before performing the store, it must check whether
next(a) = seq(a). If
they are equal, then it must first complete the pending update,
performing
the deletion barrier to scan the slot indicatedby the index field and storing the
value from the value field into that slot. The same action must be performed when the
scanner evacuates an object into tospace. This ensures that any writes performed on the
original object while it is
being copied are linearised before writes performed to the copy.
Since the owner is the only processor that updates the object in place, there is no needto
synchronise
with the scanner. The deletion barrier in step 2 ensuresthat pointers possibly
seen by other processors will be scanned. The insertion barrier in step 4 ensures that if the
object has already been scanned then the new pointer will not be mistakenly omitted.
17.6 Sapphire
A
problem with symmetric division of the heap into independently collected regions per
processor as done by Halstead [1985] and Herlihy and Moss [1992]is that it ties the heap
structure to the topologyof the multiprocessor. Unfortunately, application heap structures
and thread-levelparallelismmay not map so easily to this configuration. Moreover,one
processor can become a bottleneck because it happens to own a particularly large or knotty
portion of the heap, causing other processorsto wait for it to complete its scan before
they
can discard their fromspaces, so they may end up stalling if they have no free space
in which to allocate. It may be possible to steal free space from another processor, but
this requires the ability toreconfigure the per-processor heap regions dynamically. These
issues were discussed earlier in Chapter 14. Instead, non-parallel concurrentcollectors
place
collector work asymmetrically on one or more dedicated collectorthreads, whose
priority can easily be adjusted to achieve a balance of throughput between mutator and
collectorthreads.
346 CHAPTER 17. CONCURRENT COPYING & COMPACTION
Algorithm 17.4:Sapphirephases
MarkCopy:
Mark A mark reachableobjects */
Allocate A allocate tospace shells */
Copy A copy fromspace contents into tospace shells */
Mark:
PreMark A install Mark phase write barrier */
RootMark A blacken global variables */
HeapMark/StackMark A process collectormark queue */
Flip:
PreFlip A install Flip phase write barrier */
HeapFlip I* flip all heap fromspace pointers to tospace */
ThreadFlip /*flip threads, one at a time */
Reclaim A reclaim fromspace */
Collector phases
MarkCopy: The first group of phases marks the objects directly reachable from global
variables and mutator threadstacksand registersand copies them into tospace.
these
During phases the mutators all read from the originals in fromspace, but also must
mirror their writesto the tospacecopies. The
fromspace and tospace copies are kept
looselycoherentby relying on the programming language memory model (in this
case for Java [Manson et al, 2005; Gosling et al, 2005], but which should also apply
to the forthcoming memory model for C++ [Boehm and Weiser, 1988]) to avoid a
barrier on reads of non-volatile fields (volatile fields require a read barrier).This
means the updates to each copy need not beatomicor simultaneous. Rather, a Java
application need only perceive that the values in the copies cohere at
application-
level synchronisation points. Any changes made by a mutator thread to fromspace
i pointerEQ(p, q):
2 if p = q return true
3 if q = null return false
4 if p = null return false
5 return f lipPointerEQ(p, q) /* calledonly during Flip phases */
i f lipPointerEQ(p, q) :
2 pp \302\253\342\200\224
forward(p)
3 qq <\342\200\224
forward(q)
4 return pp = qq
object must appear to have the same referenceat the language level. Every pointer
equality operationmust apply a barrier as illustrated in Algorithm 17.5a. Note that
if either argument is statically null then the compiler can revert the test to the
MarkCopy: Mark. The Mark phase marks every fromspace object reachable from the
roots, both globalvariables and thread stacks/registers. The Sapphire design calls for the
'We emphasise that Sapphire assumes that there are no races on updating non-volatiles.
348 CHAPTER 17. CONCURRENT COPYING & COMPACTION
global variables, enqueuing any reference it finds to an unmarked fromspace object. The
collector then marks and scans the unmarked objects in the queue. When it removes a
pointer object p
to from the queue, if p is not
yet marked then it marks p and scans its slots
to enqueueany unmarked fromspace objects referred to by p.
When the collector finds the mark queue empty it scans each mutator, one at a time,
stopping each mutator to scan its stacks and registers and enqueuingany reference it finds
to an unmarked fromspaceobject.If the collector makes a pass through all the mutators
without enqueuing any objects for markingthen marking is complete; otherwise marking
and scanningcontinue.Termination relies on the fact that the write barrier prevents
retreating
the marking wavefront, and that new objects are allocated black. Eventually all
reachablefromspaceobjects will be marked and scanned.
The Mark phase has threesteps.The PreMark step installs the Mark phase write barrier
WriteMar^, shown in Algorithm 17.6a. Mutators do not perform any markingdirectly, but
objects, including initialising stores, invoke the write barrier, so newly-allocated objects
are treated as black. Mutator stacksand registersare still grey.
Finally, the HeapMark/StackMark step processes the collector'smark queue,a
separate set of explicitly grey (marked) objects, and the thread stacks.For eachreference in the
mark queue, the collector checks if it is
already marked. If not, the collectormarks the
explicitgrey set are both empty. (An object can be enqueuedfor marking more than once,
but eventually it will be marked and no longer enqueued by the mutators.)
Whenever the mark queue and grey sets are both empty, the collector scansa mutator
stack by briefly stopping the mutator thread at a safe
point (which cannot be in the middle
of a write barrier), scanning the thread's stackand registerstoblackenthem by
enqueuingevery unmarked root using WriteMar^. Having scannedevery thread's stack without
finding any white pointers or enqueued objects, and with the mark queue and grey set
empty, then there can be no white pointers in the thread stacks, global variables, or newly-
allocated objects. They are now all black.The termination argument for this phase relies
on the write barrier to keep globals and newly-allocated objectsblack. The write barrier
also prevents mutators from writing white referencesinto the heap. A mutator can obtain
a white pointer only from a (reachable) grey or white object.Becausethere were no grey
objects since the mutator threads were scanned, it cannot obtain a white pointer from a
i WriteFJip(p, i, q):
2 q \302\253\342\200\224
forward(q) /* omit this for non\342\200\224
pointer q */
3 p[i] <- q
4 pp <\342\200\224
toAddress(p)
s if pp / null /* p is in fromspace */
^
pp[i] <- q
7 return
s pp <\342\200\224
fromAddress(p)
9 if pp 7^ null /* p is in
tospace */
io
pp[i] <- q
ii return
Note that only mutators active since their last scan during this Mark phase needto be
rescanned.
Similarly, only stack frames active since their last scanwithin this Mark phase
need to be rescanned.
At this point, all white fromspace objects are unreachable.
MarkCopy:
Allocate. Once the Mark phase has determined the set of reachable
fromspace objects, the collector allocates an empty tospacecopy 'shell' for each marked
fromspace object. It sets a forwarding pointer in the fromspaceobjectto refer to the tospace
copy, and builds a hash table for the reverse mapping from each tospace copy to its
fromspace copy. This is needed because a mutator thread that has been flipped to tospacestill
needs to update fromspace copies whenever other threads arestill operating in fromspace.
i copyWord(p, q) :
2 for i <- 1 to MAX_RETRY do
3 toValue \302\253\342\200\224
*p $
4 toValue <\342\200\224
forward(toValue) /* omit this for non -pointers */
5 *q toValue
<\342\200\224 $
6 fromValue \302\253\342\200\224
*p $
7 if toValue = fromValue
s return
9
LoadLinked(q) $
toValue <\342\200\224
*p $
toValue <- forward(toValue) /* omit this for non -pointers */
StoreConditionally(q, toValue) /* assuming no spurious failure */ $
copyWordSafe(p, q) :
for ... /* cis in copy Word */
loop
LoadLinked(q) $
toValue \302\253\342\200\224
*p $
toValue <\342\200\224
forward(toValue) /* omit this for non-pointers */
2o if StoreConditionally(q, toValue) $
2i return A SC succeeded */
collectorwill make progress in the face of repeated updates to the 'from' location.4 In practice,
the copyWord loop must be re-coded defensively as shown in copyWordSafe, but the
collector will fail to make progress if the mutator repeatedly updates the 'from' location p
between every invocation of LoadLinked and StoreConditionally.
Flip. Again,
the phase operates in several steps. Beginning in this phase, unflipped
mutators can operate in both fromspace and tospace.The PreFlip step installs the Flip phase
write barrier WriteFiiPto copewith this (Algorithm 17.6c). The HeapFlip step then
processes references held in global variables and newspace objects,flipping all fromspace
pointers to tospace. WriteFiiP guarantees not to undo this work by ensuring that only
tospace pointers are written into global variables and
newspace. The ThreadFlip step then
flips each mutator thread, one at a time: flips any fromspace pointers in
it
stops the thread,
its stacksand registersover tospace, to the thread. During this phase,all
and restarts
mutators still need to update both fromspace and tospacecopies.Thus, flipped threads need to
be able to map from tospace copies back to fromspace copies (using f romAddress
analogously
to t oAddress). Finally, once all threads are flipped and no thread is still executing
WriteFjiP, the Reclaim step reclaims fromspace and discards the reverse mapping table
from tospace back to fromspace.
Since unflipped threads may access both fromspace
and tospace copies of the same
object, the pointer equality test needs to compare the tospace pointers (Algorithm 17.5b).
Merging phases
Volatile fields
Java volatile fields require a physical memory accessfor each source code access, and
accesses must appear to be sequentially consistent. For this reason, volatile fields
require
heavier synchronisation on mutator access and while the collectori s copying
them
to ensure that their copies are kept properly coherent. Hudson and Moss describe several
techniquesfor achieving this, each of which imposes substantialadditional overhead for
reachability).
352 CHAPTER 17. CONCURRENTCOPYING& COMPACTION
Compressor
Compressor [Kermany and Petrank, 2006],presentedearlierin Section 3.4 and Section 14.8,
exploits the freedom allowedby separating marking from copying to perform compaction
concurrently
with the mutator threads.
Recall that Compressor first first-object table that maps a tospace
computes an auxiliary
support this, Compressor double-maps each physical page when its contents are to be copied,
once in its 'natural' (still-protected)tospacevirtual page, and again in an unprotected
virtual
page private to the compactor thread (see also Section11.10). Once the compaction
work has been done for that page, the tospace virtual page can be unprotected somutators
can proceed, and the private mapping is discarded.
In essence, Compressor applies the standard tricolour invariant. Fromspace pages are
white, protected tospace pages are grey,
and unprotected tospace pages are black. Initially,
the mutator threads operate grey in fromspace while the first-object table is computed
along with the tospace addresses. When the mutator threads are flipped over to tospace
they are black. The protection-driven double mapping read barrier prevents the black
mutator threads from acquiring stale fromspacereferences
from
grey pages that are still in
the process of beingpopulated with their fromspace copies.
Compressor must also handle other aspectsof the tricolour invariant. In particular,
after
marking and before the task of determiningthe first-object table begins, mutators must
allocate all new objects in tospace, to prevent those allocations from interfering with the
relocation map (otherwise, allocating to a hole in fromspace would interfere). Moreover,
these newly allocatedobjectsmust eventually have their pointer fields scanned after the
mutators flip to tospace, to redirect any
stale fromspace references in those fields over to
tospace, and similarly
for global roots. Thus, both newly allocated tospaceobjects a nd the
global roots must be protected from access by mutators, with traps on their pagesforcing
scanning to redirect their pointers.
17.7. CONCURRENT
COMPACTION 353
Compressor actually protects and double-maps the entire tospace at the beginning of collection
(to avoid the cost of double mapping each page as it is processed). Similarly,
moves
Compressor
eight virtual pages per trap (to better amortise the trap overhead on mutator
threads).
One downside of Compressor is that when a mutator traps on access to a protected
tospace page then it must not only copy all of that page's objects, it must also forward all
the pointersin those objectsto refer to their relocated (or soon to be relocated) targets. This
can impose significant pauses on the mutator. In a moment, we will discuss the Pauseless
collector, which reduces the amount of incremental work needed to be performed by a
mutator to copying at most one object (without needing to forward any of the stalefromspace
referencesit contains). Before doing so, let us briefly review the way in which Compressor
drives compactionusingpageprotection, as illustrated in Figure 17.1. The figures show
the logical grouping of virtual pages into distinct categories (the linear address-ordered
layout of the
heap is intentionally not represented):
Live: pagescontaining (mostly) live objects (initially dark grey in the figures)
Condemned: pages containing some live objects,but mostly dead, which are good
candidates for compaction (light grey in the figures, with dark grey live objects)
Free: pages currently free but available for allocation (dashed borders)
New Live: pages in which copiedlive objects have been allocated but not yet copied
(dashedborders,with dashed space allocated for copies)
Figure17.1aillustrates the initial state in which live objectshave been identified along
with those to be relocated.For easeof later comparison with Pauseless, we take the liberty
here to restrict compaction only to pages sparsely occupied by live objects. In
live
Compressor, tospace pages containing stale references that need forwarding, and tospace pages
into which objectsare yet to be relocated, must first be protected to
prevent the mutators
from accessing them. Concurrentlywith the mutators, the forwarding information for the
live
objects is prepared on the side in auxiliary data structures. At this point, the heap
pages are configuredas in Figure 17.1b,and the mutator roots are all flipped over to
refer
only to the protected tospace pages. Compaction can now
proceed concurrently with
the mutators, which will trap if they try to access an unprocessed tospacepage.Trapping
on a live tospace page causes all of the references in that page to be forwarded to refer to
condemned
fromspace page have been evacuated, it is completely dead and its physical page
can be unmapped and returned to the operating system, though its virtual page cannot
be recycled until all references to it have been forwarded. Compactionceaseswhen all
tospace pages have been processed and unprotected(Figure17.1e). We now contrast this
approach with the Pauseless collector.
CHAPTER 17. CONCURRENTCOPYING& COMPACTION
Roots
Roots;
T?
Roots
(c) Trapping on a Live page forwards pointers contained in that page to refer to their
tospace targets. Unprotect the Live page once all its stale fromspacereferences have
been replaced with tospace references.
Roots
(d) Trapping on a reserved tospace page evacuates objects from fromspace pages to
fill the page. The fields of these objects are updated to point to tospace. Unprotectthe
tospace page and unmap fully-evacuated fromspacepages(releasing their physical
pages, shown as hatched).
Roots
(e) Compaction is finished when all Live pages have been scannedto forward
references they contain, and all live objects in condemned pages have been copied into
Pauseless
The Pauseless collector [Clicket al, 2005; Azul, 2008], and its generational extensionC4
[Tene et al, 2011], protects fromspace pages that contain objects being moved, instead of
protecting tospacepagescontaining moved objects and/or stale pointers. Rather than
needing to protect all of the tospacepageslikeCompressor, Pauseless
protects the much
smaller set of pages whoseobjects are actually being moved (focusing on sparsely
populated
pages that will yield most space), and these pagescan be protected incrementally.
Pauseless uses a read barrierto intercept and repair stale fromspace references beforethe
mutator can use them, and avoids blocking the mutator to fix up entire pages. The
initial
implementation of Pauseless used proprietary hardware to
implement the read barrier
directly as a specialload-reference instruction, but on stock hardware Pauseless compiles
the
necessary logic inline with every load-reference operation by the mutator.
i Read(src, i):
2 ref <\342\200\224
src[i]
3 if protected(ref)
4 ref 4\342\200\224
GCtrap(ref, &src[i])
5 return ref
6
7 GCtrap(oldRef, addr):
s newRef f orward(oldRef)
\302\253\342\200\224 /* forward/copy as necessary*/
9 mark(newRef) /* mark as necessary */
io loop A 'will repeat only if CAS fails spuriously */
ii if oldRef = CompareAndSwap(addr, oldRef, newRef)
12 return /* CAS succeed, so we are done */
B if oldRef ^ *addr
14 return /* another thread
updatedaddr but newRef is ok */
Null references are quite common,so must be filtered explicitly, though the compiler can
often fold this test into the existing null pointer safety checks requiredby languages like
Java. Stripping the Not-Marked-Through bit in software can be achieved by having the
compiler modify all dereferencesto strip it before use, and reusing the stripped reference
where the reuse does not crossa GC-safe point. Alternatively, the operating system can be
modified to multi-map memory or alias address ranges so that the Not-Marked-Through
bit is effectively ignored.
The Pauselessgarbage collection phases. The Pauseless collector is divided into three
main phases, each of which is fully parallel and concurrent:
Mark is responsible for periodically refreshing the mark bits. In the process of doing
that it will set the Not-Marked-Through bit for all references to the desired value
and gatherliveness statistics for each page. The marker starts from the roots (static
global variables and mutator stacks) and begins marking reachable objects. The Not-
Marked-Through bit assists in making the mark phase fully concurrent, as described
further below.
Relocate uses the most recently available mark bits to find sparse pages with little live
A relocate phase runs continuously, freeing memory to keep pace with mutator
allocation. It runs standalone or concurrently with the next mark phase.
17.7. CONCURRENT COMPACTION 357
Remap updates every pointer in the heap whose target has been relocated.
Collectorthreads traverse the object graph executing a read barrieragainst every
reference in the heap, forwarding stale references as if a mutator had trapped on the
reference.At the end of this phase no live heap reference can refer to pages protected
by
the previous relocate phase, so virtual memory for those pages is freed.
objects, or stop the mutators in a final mark step to ensure termination). Collectorthreads
will
compete with mutator threads for CPU time, though any spare CPU can be employed
by the collector.
Secondly,
the phases incorporate a 'self-healing' effect, where mutators immediately
correct the cause of each read barrier trap by replacing any trapping reference in the slot
from which it was loaded with its updated reference that will not trigger another trap. The
work involved depends on the type of the trap. Once the mutators' working sets have
been repaired they can execute at full speed without any further traps. This resultsin a
drop in mutator utilisation for a short period (a 'trap storm')following a phase shift, with
the minimum mutator utilisation penalty of approximately 20 milliseconds spread over a
few hundred milliseconds. But Pauseless has no stop-the-worldpauseswhere all threads
Mark. The mark phase manipulates mark bits managed on the side.It beginsby clearing
the current cycle's mark bits. Each objecthas two mark bits, one for the current cycleand
one for the previous cycle. The mark phase then marks all global references, scans each
mutator thread's root set, and flips the per-thread expected Not-Marked-Through
value.
Running threads cooperate by marking their own root set at a checkpoint. Blocked (or
stalled) threads are marked in parallel by mark phase collector threads. Each mutator
thread can immediately proceed once its root set has been marked (and expectedNot-
Marked-Through flipped) but the mark phase cannot proceeduntil all threads have passed
the checkpoint.
After the root sets have been marked, marking proceeds in parallel and concurrently
with the mutators in the style of Flood et al [2001]. The markers ignore the Not-Marked-
Through bit, which is used only by the mutators. This continues until all live objects have
been marked. New objects are allocated in live pages. Becausemutators can hold (and
thus store) only marked-through references, the initial state of the mark bit for new objects
does not matter for marking.
The Not-Marked-Through bit is crucialto completion of the mark phase in a single pass
over the live objects, regardless of stores by the mutator, because the read barrier prevents
mutators from acquiring unmarked references. A mutator that loads a reference with the
wrong flavour of Not-Marked-Through bit will take a Not-Marked-Through-trap which
358 CHAPTER 17. CONCURRENT COPYING & COMPACTION
will communicate the referenceto the marker threads. Because it can never acquire an
unmarked reference, a mutator can never storeand propagatean unmarked reference. The
Not-Marked-Through-trap also stores the corrected(marked)reference backto memory, so
that particular reference can never causea trap in the future. This self-healing effect means
that a phase-change will not make the mutators wait until the marker threads can flip the
Not-Marked-Through bits in the objects on which the mutator is working. Instead, each
mutator flips each referenceit encounters as it runs. Steady state Not-Marked-Through-
Relocate. The relocate phase starts by finding sparsely occupied pages. Figure 17.2a
shows a logicalgroupingof virtual pages into distinct categories (again, the linearaddress-
ordered layout
of the heap is intentionally not illustrated).Therearereferences from both
the mutator roots and live pages into sparse pages whose live objects are to be compacted
by
evacuation. The relocate phase first builds side arrays to hold forwarding pointersfor
the objects to be relocated. These cannot be held in the fromspaceoriginalsbecausethe
physical storage for the fromspace pages will be reclaimed immediately after copying and
long before all the
fromspace references have been forwarded. The side array
of
data
forwarding
is not large because only sparse pagesare relocated, soit can be implemented
easily as a hash table.The relocatephasethen protects the mostly dead condemned pages
from access by the mutators as in Figure 17.2b. Objects
in these pages are now considered
stale,and canno longerbemodified. Also, if a mutator loads a referencethat points into a
Roots
Roots
Live Live
Roots
(c) Flip mutator roots to tospace,copying their targets, but leaving the references they
contain pointing to fromspace. Mutators accessing an object on a protected fromspace
page will trap and wait until the object is copied.
Roots
(d) Mutators loading a reference to a protected page will now trigger a GC-trap via
the read barrier, copying their targets.
Roots
(e) Compactionis finished when all live objects in condemned pages have been copied
into tospace, and all tospace pages have been scanned to forward references they
contain.
As in the mark phase, the read barrierin phase prevents the mutator from
the relocate
loading a stale reference.The self-healing GC-trap handler forwards the reference and
updates the memory location using CompareAndSwap. If the fromspace object has not
yet
been copied then the mutator will copy the
object on behalf of the collector. This is
illustrated in Figure 17.2d. The mutator can read the GC-protected page because the GC-
trap handler runs in the elevated GC-protectionmode. Large objects that span multiple
pages so their virtual memory can now be recycled (Figure 17.2e),the side array of
Finalisation and weak references. Java's soft and weak references (see
Section 12.1) lead
to a race between the collector nulling a reference and the mutator strengthening it.
Fortunately, processing the soft and weak references concurrently with the mutator is possible
with Pauseless by having the collectorCompareAndSwap down to null only when the
reference remains not marked-through. The Not-Marked-Through-trap handler already has
the properCompareAndSwap behaviour allowing both the mutator and the collectorto
race CompareAndSwap.
to If the mutator wins then the reference is strengthened (and
the collector know), will while if the collector wins then the reference is nulled (and the
mutator sees only the null).
To address these shortcomings, Pauseless benefits from operating system extensions that
support remapping without translation lookaside buffer invalidation (these can be applied
in bulk at the end of a large set of remaps as necessary), remapping of large (typically
two megabyte) page concurrentremapswithin
mappings, and multiple the same process.
These operating system improvementsresultin approximately
three orders of magnitude
speedup compared to a stock operating system, scaling almost linearly
as the number of
active threads doubles.
Summing up,
Pauseless is designed as a fully parallel and collector for large
concurrent
multiprocessorsystems. It requires no stop-the-world pauses and dead objectscan be
reclaimed at any point during a collectorcycle.There are no phases where the collector
must race to finish before the mutators run out of free memory. Mutators can perceive a
period of reduced utilisation during trap storms at some phase shifts, but the self-healing
property of these traps servesto recover utilisation quickly.
This
chapter has laid out the basic principles of concurrent copying collection and
concurrent
compaction to reduce
fragmentation, while also avoidinglong pauses.As in any
concurrent collector algorithm, the collector must be protected against
mutations that can
otherwise cause lost objects.But because the collector is moving objects, the mutator must
tospace copies as they are created, but otherwise allow it to continue operating in
fromspace [Brooks, 1984]. Still others permit continued operationin fromspace, so long as
updates
eventually propagate to tospace [Nettles et al, 1992; N ettles and OToole, 1993]. Once
copying has finished all the mutators flip
to tospace in a single step. Dispensingwith this
global transition can mean accumulating chains of
multiple versions, which mutators must
traverse to find the most up-to-date copy [Herlihy and Moss,1992].Alternatively, by
performing updates on both copies, mutators can be transitioned one at a time [Hudson and
Moss,2001,2003].Compaction can be performed in similar ways but without the need
to copy all objects at every collection[Kermany and Petrank, 2006; Click et al, 2005;Azul,
2008].
These approaches may result in longer pauses than non-moving concurrent collection:
on any given heap accessthe mutator may need to wait for an object(orobjects) move or t o
We discussed reference counting in Chapter 5. The two chief issues facing naive
reference
counting were its inability to collect garbagecycles and the high cost of manipulating
referencecounts,particularly in the face of races between different mutator threads. The
solution to cyclic garbage was trial deletion We used deferred reference
(partial tracing).
counting to avoid having mutators reference counts on local variablesand
manipulate
coalescing
to avoid having to make 'redundant' changesto reference counts that would be
cancelled out by later mutations; a useful side-effect of coalescing is that it tolerates
mutatorraces. All required stopping the world while the collector
three solutions reconciled
reference counts and reclaimed any garbage.In this chapter, we relax this requirement and
considerthe changesthat need to be made in order to allow a referencecountingcollector
thread to run concurrently with mutator threads.
!Note that we are not concerned about the correctness of the user program in the face of races, but we must
ensure the consistency of the heap.
363
364 CHAPTER 18. CONCURRENT
REFERENCE COUNTING
deleteReference(old) deleteReference(old)
o[i]\302\253\342\200\224x o[i]\302\253\342\200\224y
i Read(src, i):
2 lock(src)
3 tgt <\342\200\224
src[i]
4 addRef erence(tgt)
5 unlock(src)
6 return tgt
7
8 Write(src, i, ref):
9 addReference(ref)
io lock(src)
ii old \302\253\342\200\224
src[i]
12
src[i] ref
<\342\200\224
13 deleteRef erence(old)
H unlock(src)
simplest way to do this is to lock the object containing the field that is being read or
written, src, as illustrated in Algorithm 18.1. This is safe. After Read has locked src, the
value of field i cannot change. If it is null, Read is trivially correct. Otherwise, src holds
to
a reference some
object tgt. The reference counting invariant ensures that tgt's
reference count cannot drop to zero before src is unlockedsincethere is a reference to tgt
from src. Thus, we can guarantee that tgt cannot be freed during the Read and that
addRef erence will be ableto update the count rather than potentially corrupting
memory.
A similar argument establishes the safety of Write.
It is appealing to hope that we can find a lock-free solution, using commonly
available
primitive operations. Unfortunately, single memory location primitives are
insufficient to
guarantee safety. The problem does not lie in Write. Imagine that, instead of the
coarser grain lock,we use atomic increments and decrements to update reference counts
and CompareAndSwap for the pointer write, as in Algorithm
18.2. If ref is non-null,
then the writing
thread holds a reference to it so ref cannot be reclaimed until Write
returns (whether or not we use eager or deferred reference counting). Write spins,
attempting
to set the pointer field until we are successful: at that
point, we know that next
we will decrementthe count of the correct old object and that only the winning thread
will do this. Note that the reference count of this old target remainsan overestimate until
deleteRef erence(old) is called, and so old cannot be prematurely deleted.
We cannot apply the sametactic in Read, though. Even if Read uses a primitive atomic
operation to update the reference
count, unless we lock src it is possible that some other
18.1. SIMPLEREFERENCE COUNTING REVISITED 365
i
Write(src, i, ref):
2 if ref / null
3
Atomidncrement(&rc(ref)) /* ref guaranteed to be non\342\200\224free*/
4 loop
5 old <\342\200\224
src[i]
6 if CompareAndSet(&src[i], old, ref)
7
deleteReference(old)
s return
9
is Read(src, i):
19 =
tgt src[i]
20
AtomicIncrement(&rc(tgt)) /* oops! */
2i return tgt
i Read(src, i, root):
2 loop
3 tgt ^\342\200\224
src[i]
4 if tgt = null
5 return null
6 rc <\342\200\224
rc(tgt)
7 if CompareAndSet2(&src[i], &rc(tgt), tgt, rc, tgt, rc+l)
8 return tgt
the problem we saw in the previous sectionby not applying reference count operations
to localvariables and deferring reclamation of objects with zero referencecounts (see
Section 5.3). This leaves the question of how to reduce the overhead of pointer writes to object
fields.We now turn to look at buffered reference countingtechniquesthat use only simple
loads and stores in the mutator write barrier, yet support multithreaded applications.
In order to avoid the cost of synchronising reference count manipulationsby
differentmutator threads, DeTreville [1990] had mutators log the old and new referents of each
pointer update to a buffer (in a hybrid collectorfor Modula-2+ that used mark-sweep as
an occasionalbackup collectorto handle cycles). A single, separate reference counting
thread processedthe log and adjusted objects' reference counts, thereby ensuringthat the
modifications were trivially atomic. In order to prevent inadvertently applying a reference
count decrementbeforean incrementthat causally preceded it (and hence prematurely
reclaiming
an object), increments were applied before decrements.Unfortunately, buffering
updates does not resolve the problem of coordinating the reference count manipulations
with the pointer write.DeTreville offered two solutions, neither of which is entirely
satisfactory.
His first approach was, as above, to
protect the entire Write operation with a lock.
This ensures that records are correctly appended to the shared buffer as well as
synchronising
the updates. To avoid the cost of making every write atomic, his second solution
provided each mutator thread with its own buffer, which was periodicallypassedto the
reference counting thread, but this required the programmer to take careto ensure that
pointer writes were performed atomically, if necessary performing the locking manually,
to avoid the problems illustrated by Figure 18.1.
Bacon and Rajan [2001] also provided each thread with a localbuffer but required the
update of the pointer field to be atomic, as for example in Algorithm 18.4; a CompareAnd-
Swap with retry could be used to do this. The mutator write barrier on a processor adds
the old and new values of slot i to its local myUpdates buffer (line 9). Once again,
reference
counting of local variables is deferred, and timeis divided into ragged epochs to ensure
that objects are not prematurelydeleted,by using a single shared epoch number plus per-
thread local epoch numbers. Periodically, just as with deferred referencecounting,a
processor will interrupt a thread and scan all the processor's local stacks, logging references
found to a local myStackBuf f er. The processor then transfers its myStackBuf f er and
myUpdates to the collector, and updatesits local epoch number, e. Finally, it schedules
the collection thread of the next processorbefore resuming the interrupted thread.
The collector thread runs on the last processor.In each collection cycle k, the
collector
applies the increments of epoch k and the decrements of epoch k \342\200\224
1. Finally it
increments the global epoch counter (for simplicity, we assume an unbounded number of
global updatesBuf f ers in Algorithm 18.4). The advantage of this technique is that it is
never necessary to halt all mutators simultaneously:the collectoris on-the-fly Note how
the collector uses a variant of deferred reference counting. At the start of the collection the
counts of objects directly referenced from thread stacks (in this epoch) are incremented; at
the end of the cycle, the referencecountsof those directly reachable from the stacks in the
previous epocharedecremented.
Algorithm
18.4: Concurrent buffered reference counting
i shared epoch
2 shared updatesBuf f er[ ] /* one buffer per epoch */
3
4 Write(src, i, ref) :
5 if src = Roots
6
src[i] ref
<\342\200\224
7 else
s old <\342\200\224
AtomicExchange(&src[i], ref)
9
log(old, ref)
10
ii log(old, new):
12 myUpdates <\342\200\224
myUpdates + [(old, new)]
13
i4 collectQ:
is /* each processor passes its buffers on to a global updatesBuf fer */
i6 myStackBuffer <\342\200\224
[]
l? for each local ref in myStacks /* deferred reference counting */
is myStackBuf fer \302\253\342\200\224
myStackBuf fer + [(ref/ ref)]
19 atomic
20 updatesBuf fer[e] <\342\200\224
updatesBuf fer[e] + myStackBuf fer
21 atomic
22
updatesBuffer[e] <\342\200\224
updatesBuffer[e] + myUpdates
23 myUpdates <\342\200\224
[]
24 e e
<\342\200\224+ 1
25
26 me <\342\200\224
myProcessorld
27 if me < MAX_PROCESSORS
28 schedule(collect, me+l) A schedule collect () on the next processor */
29 else
30 A the last processor updates the reference counts*/
31 for each (old, new) in updatesBuffer[epoch]
32 addRef erence(new)
33 for each (old, new) in updatesBuffer[epoch \342\200\224l]
34 deleteRef erence(old)
35 release(updatesBuf fer [epoch-1]) I* free the old buffer */
36 epoch <\342\200\224
epoch + 1
>A, v
Collector's
log
Some thread's
1\342\200\224J log
The collector simply used the replica to find and decrement the reference counts of the old
targets and the current version of the object to find and increment the new targets. All
dirty objects were cleaned.
Let us seefirst how we can allow the reference countingthread to run concurrently with
the mutators (after a brief pause to transfer buffers), and then consider how to make that
concurrent algorithm on-the-fly. In the first case, all the mutator threads can be stopped
temporarily
while their buffers are transferred to the collector.However,onceall the
mutator threads have transferred their buffers, they can be restarted.The collector's task is
to modify the reference countsof the old and new children of every modifiedobject.
Reference decrements can be handled as before, using the replicasin the logs,but handling
increments is more involved (Algorithm 18.5).The taskis to increment the reference counts
of the children of each object in the collector's log, using the state of the object at the time
that the log was transferred. There are two casesto consider,sincethe logged object may
have been modified sincethe logs were transferred to the collector.
18.5. SLIDING VIEWS REFERENCE COUNTING 369
12 if child 7^ null
o rc(child) <- rc(child) + 1
H mark(child) /* if tracing young generation */
If the object remains clean, its state has not changed, so the reference counts of its
current children are incremented. Note that incrementNew in Algorithm 18.5 must check
again after
making a replica of a clean objectin case it was dirtied while the copy was
being taken.
If the object has been modified since the logs weretransferred, then it will have been
re-marked dirty and its state at the time of the transfer can be found in a fresh log buffer
of some mutator. The object's dirty pointer will now refer to this log, which can be read
without synchronising with that thread. Consider the example in Figure 18.2. A has been
modified again in this epoch, which complicatesfinding C, the target of the last update to
A in epoch. As A is dirty, its previous contents will be held in somethread's
the previous
current local log (shown on the right of the figure): the log refers to C. Thus, we can
decrement the referencecount of B and increment the reference count of C. In the next
epoch, C's referencecountwill be decremented to reflect the action Write(A, 0, D).
by Doligez and Gonthier [1994]. We consider what modifications need to be made to the
algorithm support to weaker consistency
models later. Sliding views can be used in
several contexts: for plain reference counting [Levanoni and Petrank, 1999, 2001, 2006], for
managing the old generation generational
of [Azatchi and Petrank, 2003] and age-oriented
[Paz et al, 2003, 2005b] collectors, and for integration with cyclic referencecounting
collectors [Paz et al, 2005a, 2007]. Here, we considerhow sliding views can be used in an
age-oriented collector and then extend it to reclaim cyclic structures.
370 CHAPTER 18. CONCURRENT REFERENCE COUNTING
Age-oriented collection
Age-oriented collectors partition the heap into young and old generations. Unlike
traditional
generational collectors, both generations are collected at the same time: there are no
nursery collections and
inter-generational pointers do not need to be trapped. Appropriate
policies and techniques are chosen for the management of each generation. Since the weak
generational hypothesisexpects most
objects to die young, and young objects are likely
to have high mutation rates (for example, they as are initialised), a young generation
benefits from a collector tuned to low survivor rates. In contrast, the old generation can be
managedby a collector tuned to lower death and mutation rates. Paz et al [2003] adopt
a mark-sweepcollector for the young generation (since it need not trace large volumes of
dead objects)and a slidingviews reference counting collector for the old generation (as it
can handle huge live heaps). Their age-oriented collector d oes not move objects: instead,
each object has a bit in its header denoting its generation.
The algorithm
On-the-fly
collection starts by gathering a sliding view
(Algorithm 18.6). Incremental
collection of a sliding view requires careful treatment of modifications made while the view
is being gathered. Pointer writes are protected an incremental update write
by adding
barrier called snooping to the Write operation of Algorithm 5.3 (see Algorithm 18.7). This
barrierprevents missinga referento whose only reference is removed from a slot s^ before
the sliding view reads si, and thenis written to another slot S2 after si is addedto the view.
At the start of a cycle,eachthread'ssnoopFlagis raised (without synchronisation).
While the sliding view is beingcollected (and the snoopFlag is up for this thread), the
new referent of any modified object is recorded in the thread's local mySnoopedBuf f er
(line 25 of Algorithm 18.7). In terms of the tricolourabstraction,this Dijkstra-style barrier
marks ref black. Objects are allocatedgrey in the young generation (Algorithm 18.8)in
order to avoid activating the write barrier when their slots are initialised.
After the collector has raisedthe snoopFlag for each mutator thread, it executes the
first handshake. The handshake stopseachthread, one at a time, and transfers its local log
and young set to the collector's updates buffer.
Next, all modifiedand young objects are cleaned. This risks a race. As cleaning is
performed while mutator threads are running, it may erase the dirty state of objectsmodified
concurrently with cleaning. A second handshake therefore pauses each thread, again on-
the-fly, and scans its local log to identify objects modified during cleaning. dirty state The
i shared updates
2 shared snoopFlag[MAX_PROCESSORS] /* one perprocessor*/
3
4
collectQ:
5 collectSlidingViewQ
6 on \342\200\224flyhandshake
\342\200\224the 4:
7 for each thread t
s suspend(t)
9 scanStack(t)
io
snoopFlag[t] false
\302\253\342\200\224
n resume(t)
12 processReferenceCountsQ
13 markNursery()
14
sweepNurseryQ
is sweepZCT()
i6
collectCyclesQ
17
is collectSlidingViewQ :
19 on \342\200\224flyhandshake
\342\200\224the 1:
20 for each thread t
21
suspend(t)
22
snoopFlag[t] true
<\342\200\224
37 processReferenceCountsQ:
38 for each obj in updates
39 decrement01d(ob j)
40
incrementNew(ob j)
41
42 collectCyclesQ:
43
markCandidatesQ
44 markLiveBlackQ
45
scanQ
46
collectWhite()
47 processBuf f ers()
372 CHAPTER 18. CONCURRENT
REFERENCE COUNTING
Write(src, i, ref):
if src = Roots
src[i] ref
<\342\200\224
else
if not dirty(src)
log(src) $
src[i] ref
<\342\200\224 $
snoop(ref) A for sliding view */
log(ref):
for each fid in Pointers(ref)
if *fld ^ null
add(logs[me], *fld)
if not dirty(ref)
/* commit the entry if ref is still clean */
entry <\342\200\224
add(logs[me], ref)
logPointer(ref) <\342\200\224
entry
23 snoop(ref):
24 if snoopFlag[me] && ref ^ null
25 mySnoopedBuffer <\342\200\224
mySnoopedBuffer + [ref] A mark grey */
Once the old generationhas beenprocessed and all inter-generational references have
been discovered, the young generation is traced (markNursery), marking objects with the
DF\342\200\224x\342\200\224-+U3\342\200\224
kW CD CD CD
Figure
18.3: views allow a fixed snapshot of the graph
Sliding to be traced by
using values in the
stored log. Here, the shaded objects the state of indicate
the graph at the time that the pointer from X to Y was overwritten to refer to
Z. Theold versionof the graph can be traced by using the value of X's field
stored in the log.
try
to avoid considering objects that might be live, includingroot referents, snoopedobjects
and
objects modified after the sliding view wascollected. An additional markBlack phase
pre-processes these objects, marking them and their sliding view descendants black. This
raisesa dilemma.The set of objects known to be live (actually, a subset of the dirty objects)
is not fixed during the collection, soitisnot possible to identify how many reference count
modificationsthe collectormight have made to an object before it becamedirty. Hence, it
is not possible to restore its original count. Instead, cycle detection operates on a second,
cyclic, r eference count. The alternative, to consider these objects regardless,would leadto
more objects being processed by the reference counter.
Memory consistency
The
sliding views algorithms presented above assume sequentialconsistency, which
modern
processors do not always guarantee. On the mutator side,it is important that the order
of operationsin Write arepreserved toensure that (i) the values seen in the logarethe
correct ones (that is, those they represent a snapshotof the modified object as it was before the
collection cycle started; (ii) the collector reads only completed log entries; and (iii) object
fields cannot be updated after a collection starts without being snooped. The handshakes
used by the algorithm solve some dependencyissueson weakly consistent platforms (en-
374 CHAPTER 18. CONCURRENT REFERENCE COUNTING
Finally we note that there is a large literature on safe reclamation of memory when
using dynamic memory structures, from the ABA-prevention tags used in IBM's System
370 onwards.Otherlock-free reference counting methods that require multi-word atomic
primitives include Michael and Scott [1995]and Herlihy et al [2002]. Techniques that use
to
timestamps delay releasing object an until it is safe to do so are scheduler-dependent
and tend to be vulnerable to the delay or failure of a single thread. For example, the Read-
Copy-Update method [McKenney and Slingwine, 1998], used in the Linux kernel, delays
reclamation of an object until all threads that have accessed it reach a 'quiescence' point.
Other mechanisms that use immediate (rather than deferred) reference counting requirea
particularprogramming style,
for example hazard pointers [Michael, 2004]or announcement
schemes [Sundell, 2005].
Chapter 19
The concurrent and incremental garbage collection algorithms of the preceding chapters
strive to reduce the pausetimesperceived by
the mutator, by interleaving small increments
of collector work on the same processor as the mutator or by running collector work at the
same time on another processor. Many of these algorithms were developedwith the goal
of supporting applications where longpausesresult in the application providing degraded
service quality (such as jumpy movement of a mouse cursorin a graphical user interface).
Thus, early incremental and concurrentcollectors were often called 'real-time' collectors,
but they were real-time only under certain strict conditions (such as restrictingthe sizeof
objects). However, as real-time systems are now understood, none of the previous
algorithms live up to the promise of supporting true real-timebehaviour becausethey cannot
provide strong progress guarantees to the mutator. When the mutator must take a lock
(within a read or write barrier or during allocation) progress its can no longer be
guaranteed. Worse, preemptive thread scheduling may result in the mutator
being desched-
uled arbitrarily in favour of concurrentcollector threads. True real-time collection (RTGC)
must account preciselyfor all interruptions to mutator progress, while ensuringthat space
bounds are not exceeded. Fortunately, there has beenmuch recent progress in real-time
garbage collection that extends the advantages of automatic memory managementto
realtime
systems.
375
376 CHAPTER 19. REAL-TIME GARBAGE COLLECTION
time
i n i i h i i i i i n
collectors. Collector
pauses in grey.
consider missed deadlines to mean failure of the system. A correct hard real-time system
must guarantee that all real-time constraints will be satisfied. In the face of such timing
constraints, important to be abletocharacterise
it is the responsiveness of garbage
collection in real-time
systems in ways that reflect both needs of the application (hard or soft
real-time) and the behaviour of the garbage collector Printezis [2006].
Overall performanceor throughput in real-timesystems is less important than
predictability
of performance. The timing behaviour of a real-time task shouldbeableto be
determined analytically by design, or empirically during testing, so that its response-time
when deployed in the field can be known ahead of time (to some acceptable degree of
confidence). The worst-case execution time (WCET)of a task is the maximum length of time
the task could take to executein isolation(that is, ignoring re-scheduling) on a particular
hardware platform. Multitasking real-time systems must schedule tasks so that their
realtime constraints are met. Knowing that these constraints will be metat run time involves
performing schedulability analysis ahead-of-time, assuming a particular (usually priority-
based) run-time scheduling algorithm.
Real-time applicationsare often deployed to run as embedded systems dedicated to
a specificpurpose,such as the example above of a control system for engine timing.
Single-chip processors in embedded systems, so incremental
predominate garbage
collection
techniques translate naturally to embedded settings, but with multicore embedded
processorsbecomingincreasingly common, techniques for concurrent and parallel
collection also
apply. Moreover, embedded systems often imposetighter spaceconstraints than
general-purpose platforms.
For all of these reasons,stop-the-world, parallel,oreven concurrent garbage collectors
that impose unpredictable pause times are not suited to real-timeapplications. Consider
the collector schedule illustrated in Figure 19.1 which resultswhen the effort required to
reclaim memory depends on the total amount and size of objects that the application uses,
the interconnections among those objects, and the level of effort required to free enough
memory to satisfy future allocations. Given this schedule, the mutator cannot rely on
When and how to trigger collector work is the main factor affecting the impact of the
collectoron the mutator. Stop-the-world collectors defer all collectorwork until some allocation
attempt detects that space is exhausted and triggersthe collector. An incremental
collectorwill
piggyback some amount of collector work oneachheap access (using read/write
barriers) and allocation. A concurrent collector will trigger some amount of collector work
to be performed concurrently (possiblyin parallel)with the mutator, but imposes
mutator barriers to keep the collector synchronised with the mutator. To maintain steady-state
space consumption, the collectormust free and recycle dead objects at the samerate
(measured
by space allocated) as the mutator createsnew objects.Fragmentation can lead to
space being wasted so that in the worst case an allocation request cannotbe satisfied un-
193. WORK-BASED REAL-TIME COLLECTION 377
The classicBaker [1978] incremental semispace copying collector is one of the earliest
attempts
at real-time garbage collection. It uses a precisemodel for
analysing for real-time
behaviour founded on the limiting assumption that objects (in this case Lisp cons cells)
have a fixed size. read barrier prevents the mutator
Recall that Baker's from accessing
fromspace objects, by making the mutator
copy any fromspace object it encounters into
mutator utilisation can also be used to drive time-based scheduling of real-time garbage
collection by making minimum mutator utilisation an input constraint to the collector.
Still, Blellochand Chengoffer useful insights into the way in which pause times can be
tightly bounded, while also boundingspace,sowe consider its detailed design here.
having atomic TestAndSet and Fetch And Add instructions for synchronisation. These
378 CHAPTER 19. REAL-TIME GARBAGE COLLECTION
Application model. The application model assumesthe usual mutator operations Read
and Write, and New(n) which allocatesa new objectwith n fields and returns a pointer
to the first field; it also includes a header word for use by the memory manager. In
addition, Blelloch and Cheng require that on each processor every New(n) is followedby n
invocations oflnitSlot(i;)to initialise each of the n fields of the last allocated object of
the processor with v, starting at slot 0. A processor must complete all n invocations of
InitSlot before it uses the new object or executes another New, though any number of
other operations includingReadandWrite can be interleaved with the InitSlots.
Furthermore, the idealised application model assumes that Write operations are atomic (no
two processorscan overlap executionof a Write). The memory manager further uses a
function isPointer(p,i) to determine whether the ith field of the object referenced by
p is a pointer, a fact often determined statically by the type of the object, or its class in an
object-orientedlanguage.
many
fields remain to be copied (copyCount (r)=n). When the object turns black (is fully
copied) then the header of the replica will be zero (copyCount =
(r) 0).
The heap is configured into two semispaces as shown in Figure 19.2. Fromspaceis
boundedby the variables f romBot and f romTopwhich are private to each thread. The
collectormaintains an explicit copy stack in the top part of tospace holding pointers to the
19.3. WORK-BASED REAL-TIME COLLECTION 379
i-fromBot fromTop-
fromspace
Figure 19.2: Heap structure in the Blelloch and Cheng work-based collector
grey objects.As noted in Section 14.6, Blelloch and Cheng [1999] offer several arguments
that this explicit copy stack allowsbetter controlover locality and synchronisation than
Cheney queues in sharing the work of
copying among concurrent collector threads. The
area between toBot and free holds all replicas and newly allocatedobjects.The area
between sharedStack and toTop holds the copy stack (growingdown from toTop to
sharedStack). When f ree=sharedStack the collectorh as run out of memory. If the
collector is off when this happens then it is turned on. Otherwise an out of memory error
is reported. The variables t oBot and t oTop are also private to eachthread, whereas free
and sharedStack are shared.
The code for copying a slot from a primary object to its replica is shown in
Algorithm 19.1, where copyOneS 1 ot takes the addressof the grey primary object p as its
argument, copies the slot specified by the current count storedin the replica, shadesthe object
pointed to by that slot (by calling makeGrey), and stores the decremented count.Finally,
the primary object p is still grey if it has fields that still need to be copied,in which case
it is pushed back onto the local copy stack (the operations on the local stackare defined
earlier in Algorithm 14.8).
The makeGrey turns an object grey if it is white
function (has no replica allocatedfor
it) and returns the pointer to the replica. The atomic TestAndSet is used to check if the
Algorithm 19.2 shows the code for the mutator operations when the collector is The on.
New operation allocates space for the primary and
replica copies using allocate, and
sets some private variables that parametrise the behaviour of Initsiot, saying
where
it should write initial values. The variable last A tracks the address of the last allocated
object, lastL notes its
length, and lastC holds the count of how many of its slots have
already been initialised. The Initsiot function stores the value of the next slot to be
initialisedin both the primary and replica copies and increments lastC. These
initialising
stores shade any pointers that are stored to preserve the strong tricolour invariant
that black objects cannot point to white objects.Thestatement collect (k) incrementally
380 CHAPTER 19. REAL-TIME GARBAGECOLLECTION
Algorithm
19.1: Copying in the Blelloch and Cheng work-basedcollector
shared gcOn false
\302\253\342\200\224
copyCount(r) i
^\342\200\224 A unlock object with decremented index */
if i > 0
localPush(p) A push back on localstack */
31
allocate(n):
32 ref <- FetchAndAdd(&free, n)
33 if ref + n > sharedStack A is tospace exhausted? */
34 if gcOn
35 error \"Out of memory\"
36 interrupt(collectorOn) A interrupt
mutators to start next collection */
allocate(n) A try again */
return ref
copies k words for every word allocated. By design, the algorithm allows a collection
cycle to start while an object is only partially initialised (that is, when a processor has
lastC^lastL).
The Write operation first shades any overwritten (deleted) pointer grey (to preserve
snapshot reachability), and then writes the new value into the corresponding slot of both
the
primary and the replica (if exists).
it
writing to a grey objectit is possible
When that
the designated copier is also copying the same slot. This copy-write race can lead to a
lost update, if the mutator writes to the replica after the copier has read the slot from
19.3. WORK-BASED REAL-TIME COLLECTION 381
8 New(n):
9 p \302\253\342\200\224
allocate(n) /* allocate primary */
alloc
r\302\253\342\200\224 at e(n) A allocate replica*/
forwardingAddress(p) r
\302\253\342\200\224 A primary forwards to replica */
copy Count (r) 0
\302\253\342\200\224 A replica has no slots to copy */
last A \302\253\342\200\224
p A set last allocated */
lastC 0
\302\253\342\200\224 A set count */
lastL n
\302\253\342\200\224 /* set length */
return p
atomic Write(p, i, v):
if isPointer(p,i)
20
makeGrey(p[i]) /* grey old value */
21
p[i] v
\302\253\342\200\224 /* write new value into
primary */
22 if forwardingAddress(p) 7^ 0 /* check if object is forwarded */
23 while forwardingAddress(p) = 1
24 A do nothing: wait for forwarding address */
25 r f orwardingAddress(p)
^\342\200\224 A get pointer to replica */
26 while copyCount(r) = \342\200\224
(i + l)
27 A do nothing: wait while slot concurrently being copied */
28 if isPointer(p, i)
29 v ^\342\200\224
makeGrey(v) A update replica with
grey new value */
30
r[i] v
\302\253\342\200\224 A update replica */
31 collect (/:) A execute k copy steps */
32
the primary but before it has finished copying the slot to the replica. Thus, the Write
operation waits for the copier,both to allocatethe and to finish copying the slot. It replica
is not a problem for the mutator to write to the primary beforethe the slot, copierlocks
since the copier will then copy that value to the replica. The statements that force while
the mutator to wait are both time-bounded,the first by the time it takes for the copier to
allocatethe replicaand the second by the time it takes for the copier to copy the slot.
i collect(fc):
2 enterRoomQ
3 for i f- 0 to H
4 if isLocalStackEmptyQ /* local stackempty */
5 sharedPop() /* move work
from shared stack to local */
6 if isLocalStackEmpty() /* local stack still empty */
7 break /* no more work to do */
s p ^\342\200\224
localPop()
9 copyOneSlot(p)
io transit ionRooms()
ii sharedPushQ /* move work to shared stack */
12 if exitRoomQ
13 interrupt (collectorOf f) A turn collector off */
the snapshot. Also, the new object always has a replica so there is no to check need for
the replica's presence. Finally, the collector is designed so that if a collection cycle starts
while an object is only partially initialised, only the initialised slots will be copied(see
collectorOn in
Algorithm 19.4).
Algorithm 19.3 shows the collector function collect(fc), which copies k slots. The
shared copy stack allows the copy work to be shared among the processors.To reduce
the number of invocations of the potentiallyexpensivesharedPopoperation (which uses
FetchAndAdd), to improve the chances for local optimisation, and to enhance locality,
each processortakesmostof its work from a private local stack (the shared and private
stack operations are defined earlier in Algorithm 14.8). Only when there is no work
available in this local stack will the processor fetch additional work from the shared copy stack.
After copying k slots, collect placesany remaining work back into the shared stack.
Note that no two processors can simultaneously execute the code to copy slots (obtaining
additional work from the shared copy stack) in lines 2-10 and move copy work back to the
copy stackafter copying k slots in lines lines 10-12. This is enforced using the 'rooms' of
Algorithm 14.9, which we discussed in Section 14.6.
Algorithm
19.4 shows the code to start (collectorOn) and stop (collectorOf f)
the collector. Here, the only roots are assumedto residein the fixed number of registers
REG to each processor.
private The synch routine implements synchronisation
to barrier
block a processor until all processors have reached that barrier. These are used to ensure
a consistentview of the shared variables gcOn, free, and sharedStack. When a new
shared gcOn
shared toTop
shared free
shared count 0
<\342\200\224 A number of processors that have synched */
shared round 0
\302\253\342\200\224 A the current synchronisation round */
synchQ :
curRound round
<\342\200\224
i6
collectorOn():
17 synchQ
is gcOn true
\302\253\342\200\224
synch()
r <\342\200\224
allocate(lastL) A allocate replicaof last allocated */
forwardingAddress(lastA) r
<\342\200\224 A forward last allocated*/
copyCount(r) lastC
<\342\200\224 A set number of slots to copy */
if lastC > 0
localPush(lastA) A push work onto local stack */
for i f- 0 to length(REG) A make roots grey */
if isPointer(REG, i)
makeGrey(REG[i])
sharedPushQ A move work to shared stack */
synch()
35
collectorOff():
synch()
for i f- 0 to length(REG) A make roots grey */
if isPointer(REG, i)
REG[i] \302\253\342\200\224
forwardingAddress(REG[i]) A forward roots */
lastA <\342\200\224
forwardingAddress(lastA)
gcOn <r- false
synch()
384 CHAPTER 19. REAL-TIME GARBAGE COLLECTION
Time and space bounds. The considerable effort taken by this algorithm to place a well-
definedbound on each increment of collector work allows precise
for bounds to be placed
on space and the time spent in garbage collection. Blellochand Cheng[1999] prove
that
the algorithm requires at most 2(R(1 + 2/k) + N + 5PD) memory words, where P is the
number of processors, R is the maximum reachable spaceduring a computation(number
of words accessible from the root set), N is the maximumnumber of reachable objects,
D is the maximum depth of any object and k controls the tradeoff between space and
time, bounding how many
words are copied each time a word is allocated. They
also
show that mutator threads are never stopped for more than time proportional to k non-
blocking machine instructions. These bounds are guaranteed even for
large objects and
arrays, because makeGrey progresses the grey
wavefront a field at a time rather than a
Performance. Chengand Blelloch [2001] implemented their collector for ML,a statically
typed functional language. ML programs typically have very a high allocation rates,
posing
a challenge to most collectors. Results reported for are
a 64-processor Sun Enterprise
10000, with processor clock speeds on the order of a few hundred megahertz. On a single
processor, the collector imposes an average (across a range of benchmarks) overhead of
51% compared to an equivalent stop-the-worldcollector.Thesearethe costs to support
both parallel (39%)and concurrent (12%) collection. Nevertheless, the collector scales well
for 32 processors (17.2 x speedup) while the mutator does not scale quite so well (9.2x
speedup),and near perfectly for 8 processors (7.8x and 7.2x, respectively). Minimum
mutator utilisation for the stop-the-world collectoris zeroor near zero for all benchmarks
at a granularity of 10ms,whereasthe concurrentcollectorsupports a minimum mutator
utilisation of around 10%for k = 2 and 15% for k = 1.2. Maximum pause times for the
concurrent collector range from three to four milliseconds.
time
I II. II II M II ii -in
lms 0.1ms
Figure 19.3: Low mutator utilisation even with short collector pauses. The
mutator (white) runs infrequently while the collector (grey)dominates.
or the collector must abort the scanning of that stacklet, deferring that work to the mutator.
Similarly, variable-sized objects can be broken into fixed-size oblets, and arrays into
arraylets, to place boundson the granularity
of scanning/copying to advance the collector
wavefront. Of course, these non-standard representations require correspondingchanges
to the operations for accessing object fields and indexingarray elements, increasing space
and time overheads for the additional indirections [Siebert, 1998, 2000, 2010].
Nevertheless,Detlefs considers the asymmetric overheads of pure work-based
to
scheduling
be the final nail in its coffin. For in
example, the Baker concurrent copying collector
mutator operations have costs that vary greatly depending on where in the collector cycle
they
occur. Before a flip operation, the mutator is taxed only for the occasional allocation
operationin orderto progress the wavefront, while reads are most likely to load references
to already copied objects.For sometime after the flip, when only mutator rootshave been
scanned,the average cost of reads may come close to the theoretical worst case as they
are forcedto copy their targets. Similarly, for the Blellochand Cheng[1999] collector, even
though writes are much less common than reads, there is still wide variability in the need
to replicatean object at any given write.
This variability
can yield collector schedules that
preserve predictably short pause
times, but do not resultin satisfactory utilisation because of the frequency and duration
of collector work. Consider the schedule in Figure 19.3in which the collector pausesare
bounded at a millisecond, but the mutator is permitted only a tenth of a millisecond
between collector pauses in which to run. Even though collector work is split into predictably
short boundedpauses,there is insufficient time remaining for a real-timemutator to meet
its deadlines.
While work-based scheduling may result in collector overhead being spread evenly
over mutator operations, on average, the big difference between averagecostand worst-
something
that must be budgeted for in a way that does not make it a pure tax on mutator work,
essentially by treating garbage collectionas anotherreal-time task that must be scheduled.
This results in mutator worst-case execution time analysis that is much closerto actual
average
mutator performance, allowing for better processor utilisation. Rare but potentially
costly operations, such as flipping the mutator, need only be short enough to complete
during portion the of execution made available to the collector.
386 CHAPTER 19. REAL-TIME GARBAGE COLLECTION
-fromBot fromTop-i
fromspace
high-priority tasks is not taxed, while low-priority tasks perform some collector work
when allocating. A special task, the high-priority garbage collectiontask, is responsiblefor
performing collector work that was omitted while the high-priority tasks were
executing,
as implied by the allocations performed by the high-priority tasks. The high-priority
garbage collection task has a priority lower than the high-priority tasks, but higher than
the low-priority tasks. It must always ensure that enough free memory is initialised and
availablefor allocation to meet the requirements of the high-prioritytasks. Thus, collector
work operatesentirely in the slack in the real-time task schedule.
The
heap is configured into two semispaces as shown in Figure 19.4. New objects are
allocated at the top of tospace, at the positionof the pointer top. Evacuated objects are
placed at the bottom of tospace, at the positiondesignatedby bottom. The collector scans
the evacuated objectsin the usual Cheney style, evacuating all fromspace objectsthey refer
to. Low-priority threads perform some evacuation work incrementally as new objects are
allocatedat the top of tospace. The position of scan indicates the progress of the collector
in scanningthe evacuated objects.
Henriksson describes his approach in the context of a Brooks-style concurrent
copying
collector that uses an indirection barrier on all accesses, a
including Dijkstra insertion
write barrier to ensure that the new target object is in tospace, copying it if not. This
maintains a strong invariant for concurrent collection: no tospace object contains
references to fromspace objects. However, Henriksson does not impose the full copying cost
of the write barrier on high-priority tasks. Instead, objects are evacuated lazily. The write
barrier simply allocates space for the tospace copy, but without actually transferring the
contents of the fromspace original. Eventually, the garbagecollectorwill run (whether as
the high-priority garbage collectiontask, or as a tax on allocationby low-priority tasks),
and perform the deferred copying work when it comes to scan the contents of the tospace
copy. Before scanning the tospace version the collector must copy the contents over from
the fromspace original. To prevent any mutator from accessing the empty tospace copy
before its contents have been copied over, Henrikssonexploits the Brooks indirection barrier
by giving every empty tospaceshella back-pointer to the fromspace original. This lazy
evacuationis illustrated in Figure 19.5.
As sketched in Algorithms 19.5 and 19.6,the collector is similar to that of
concurrent
copying (Algorithm 17.1), but uses the Brooks indirectionbarrierto avoid the need
for a tospace invariant on the mutators, and (like Sapphire) defers any copying from
the mutator write barrier to the collector. Note that the temporary toAddress pointer
allows the collectorto forward references held in tospace copies, even while the muta-
19.4.SLACK-BASED REAL-TIME COLLECTION 387
^ Fromspace
CI '
\\
B A
iyi |x!
Tospace
Fromspace
c 1 ^
ik ^^ \\
*\\ ^^ \\
\\ \\
\\ ^ \\
\\ ^^ \\
\\ \\
\\ ^^ \\
\\ \\
\\ \\
\\ ^ \\
B
A[
c
iyi ! x;
Tospace
Fromspace
c
1
\\
w 0 1
B C
iyi A[ ! *!
\342\226\240
V
Tospace
i coroutine collector:
2 lOOp
3 while bottom < top /* tospaceis not full */
4 yield /* revert to mutator */
flipQ
6 for each fid in Roots
7
process(fld)
s if not behindQ
9 yield /* revert to mutator */
io while scan < bottom
ii scan <\342\200\224
scanObject(scan)
12 if not behindQ
u yield A revert to mutator */
14
15 flipQ:
i6 toBot, fromBot <\342\200\224
fromBot, toBot
17 toTop, fromTop <\342\200\224
fromTop, toTop
is bottom, top <\342\200\224
toBot, toTop
19 scan bottom
\302\253\342\200\224
20
21 scanObject(toRef):
22 fromRef <\342\200\224
forwardingAddress(toRef)
23 toRef)
move(fromRef,
24 for each fid in Pointers(toRef)
25
process(fld)
26
forwardingAddress(fromRef) toRef
\302\253\342\200\224
29
process(f Id):
30 fromRef *fld
<\342\200\224
31 if fromRef 7^ null
32 *fld <\342\200\224
forward(f romRef) A update with
tospace reference */
33
34 forward(f romRef):
35 toRef <\342\200\224
forwardingAddress(fromRef)
36 if toRef = fromRef /* not evacuated */
37 toRef <r- toAddress(fromRef)
38 if toRef = null /* not scheduled for evacuation (not marked) */
39 toRef <\342\200\224
schedule(fromRef)
40 return toRef
41
42
schedule(f romRef):
43 toRef bottom
\302\253\342\200\224
44 bottom bottom
^\342\200\224 + size(fromRef)
45 if bottom > top
46 error \"Out of memory\"
47
toAddress(f romRef) toRef
\302\253\342\200\224 A schedule for evacuation (mark) */
48 return toRef
19A. SLACK-BASED REAL-TIMECOLLECTION 389
10
ii atomic NewHighPriority(size):
\342\200\224
12 top <\342\200\224
top size
n toRef ^\342\200\224
top
H forwardingAddress(toRef) toRef
\302\253\342\200\224
15 return toRef
16
17 atomic NewLowPriority(size):
is while behindQ
19 yield /* ivake up
the collector */
\342\200\224
20 top <\342\200\224
top size
21 toRef <\342\200\224
top
22 if bottom > top
23 error \"Out of memory\"
24 forwardingAddress(toRef) toRef
<\342\200\224
25 return toRef
\"min
390 CHAPTER 19. REAL-TIME GARBAGE COLLECTION
The current GC ratio GCR is the ratio betweenperformedGC work W and the amount A
Allocation by the mutator causes A to increase, while GC work increases W. The collector
must enough work W to make sure that the current GCratio is no lessthan the
perform
minimum GC ratio (GCR > GCRmin). This will guarantee that fromspace is empty (all live
objects have been evacuated) before tospace is filled, even in the worst case.
Allocation of memory by low-priority tasks is throttled so that the current GC ratio
GCR does not drop too low (below GCRmin), by giving the collector task priority. The
upper bound on the collector work
performed during allocation will be proportional to
the size of the allocated object.
If a high-priority task is activated shortly before a semispace flip is due then the
remainingmemory
in tospace may not be sufficient to hold both the last objects to be allocated
by
the high-priority task and the last objects to be evacuated from
fromspace. The
collector must ensure a sufficiently large buffer between bottom and top for these objects,
large enough to hold all new objects allocatedby the high-priority tasks while the
collector finishes the current cycle. To do this, the application developer must estimate the
worst-case allocation needed by the high-priority tasks in order to run, as well as their
periods and worst-case execution times for each period. Henriksson suggests that this
priority tasks, given a large set of program parameters such as task deadlines, task periods,
and soon.
Execution overheads
The overhead to high-priority tasks for collector activity consists of tight bounds on the
instructionsrequiredfor memory allocation, pointer dereferencing and pointer stores. Of
course, instruction counts alone are not always a reliablemeasure of time, in the face of
loads that
may miss in the cache. Worst-case executiontime analysis must either assume
caches are disabled (slowingdown all loads)or the system must be tested empirically to
ensure that real-time deadlines are met under the expected systemload.
Heapaccesses require single instruction indirection through the forwarding pointer,
plus the overhead of disabling interrupts. Pointer stores have worst-case overhead on
the order of twenty instructions to mark the target object for later evacuation. Allocation
requires simply bumping a pointer and initialising the header (to include the forwarding
pointer and other header information), having overhead on the order of ten instructions.
Low-priority
tasks have the same overheads for heap accesses and
pointer stores. On
allocation, the worst-case requirement is to perform collector work proportional to the
size of the new object. The exact worst case for allocation depends on the maximum object
size, total heap size, maximum live object set, and the maximum collector work performed
within any given cycle.
Worst-case latency for high-priority tasks depends on the time for the collectorto
complete (or abort)
an ongoing item of atomic work, which is short and bounded. Henriksson
states that latency is dominated more by the costof the context switch than the cost of
completing
an item of atomic work.
19.5. TIME-BASED
REAL-TIME COLLECTION: METRONOME 391
Programmer input
The
programmer must sufficient information about the applicationprogram,and
provide
the high-priority to compute the minimum
tasks, GC ratio and to track the GC ratio as
the program executesso that the collector does not disrupt the high-priority tasks. The
periodand worst-case execution times for each high-priority task is required,along with
its worst-case allocation need for any one of its periodic invocations, so as to calculatethe
minimum buffer requirements to satisfy high-priority allocations.The programmer must
compaction to avoid fragmentation. It uses a deletion barrier to enforce the weak write
tricolour invariant, marking live any object whose reference is overwritten during a write.
Objects allocated during marking are black.The overheadof simply marking on writes is
much lower (and morepredictable) than replicating as in Blelloch and Cheng[1999].
After
sweeping to reclaim garbage, Metronome compacts if necessary, to ensure that
enough contiguous free spaceis available to satisfy allocation requests until the next
Mutator utilisation
Metronome
guarantees the mutator a predetermined percentage of execution time, with
use of the remainingtime at the collector's discretion: any time not used by
the collector
will be given to the mutator. By maintaining uniformly short collector pause times
Metronome is able to give finer-grained utilisation guarantees than traditional collectors. Using
collector quanta of 500 microsecondsovera 10millisecond window Metronome sets a
default mutator utilisation target of 70%. This target utilisation can also be tuned further for
the application to meet its space constraints. Figure19.6shows a 20-millisecond
Metronome collector
cycle split into 500-microsecond time slices. The collector preserves70%
utilisation over a 10-millisecond sliding window: there are at most 6 collector quanta and
392 CHAPTER 19. REAL-TIME GARBAGE COLLECTION
100
Time (ms)
MAAM
giving
100% utilisation. Overall, the mutator will see utilisation
drop during periods that
the collector is running,but never lower than the target utilisation. This is illustratedin
Figure19.7, which shows overall mutator utilisation dropping for each collector cycle.
Figure
19.8 shows mutator utilisation over the samecollectorcycle that was illustrated
in Figure 19.6 (grey bars indicateeachcollector quantum
while white is the mutator). At
time t on the x-axis this shows utilisation for the ten millisecond window leading up to
19.5. TIME-BASED REAL-TIME COLLECTION: METRONOME 393
time t. Note that while the schedule in Figure 19.6 is perfect in that utilisation is exactly
70% over the collector cycle,realschedules will not be quite so exact. A real scheduler will
typically allow collector quanta to run until minimum mutator utilisation is close to the
target MMU and then back off to prevent overshooting target. the
Section A of the figure is a staircasegraph wherethe descending portions correspond to
collector quanta and the flat portions correspond to mutator quanta. The staircaseshows
the collectormaintaining low pause times by interleaving with the mutator, as utilisation
Supporting predictability
Metronomeusesa number of techniques to achieve deterministic pause timeswhile
guaranteeing
collector safety. The first of these addressesthe unpredictability
of allocating
large objects when the heap becomesfragmented. The remainderadvance predictability
Read barrier. Like Henriksson [1998], Metronome uses a Brooks-style read barrier to
ensure that the overhead for accessing objectshas uniform cost even if the collector has
moved them. Historically,
read barriers were considered too expensive to
implement in
software \342\200\224
Zorn
[1990] their run-time overhead at around
measured 20% \342\200\224
but
Suspending the mutator threads. Metronome uses a series of short incremental pauses
to complete each collectorcycle.However,it must still stop all the mutator threadsfor each
collector quantum, using a handshake mechanismto makeall the mutator threads stop at
a GC-safe point. At these points, each mutator thread will release any internally held
run-time metadata, store any object references from its current context into well-described
locations, signal that it has reached the safe point and then sleep while waiting for a
resume signal. Upon resumption each thread will reload objectpointersfor the current
context, reacquire any necessary run-time metadata that it previously held and then continue.
Storing and reloadingobject pointers allows the collector to update the pointers if their
targets move during the quantum. GC-safepoints are placed at regularly-spaced intervals
by the compilersoasto bound the time needed to suspend any mutator thread.
The suspend mechanismis used only for threads actively executing mutator code.
Threads that do not access the heap, threads executingnon-mutator 'native' code, and
already suspended mutator threads (suchas those waiting for synchronisation purposes)
are ignored. If these threads need to begin (or return to) mutating the heap (for example,
when returning from 'native' code, invoking operations of the Java Native Interface, or
accessingother Java run-time structures), they will suspend themselves and wait for the
collector quantum to complete.
Ragged root scanning. Metronome scans each complete thread stack within a single
collector
quantum losing pointers to objects.Developers
so as to avoid must make sure not
to use deep stacksin their real-timeapplicationssoas to permit each stack to be scanned
in a single quantum. Though each whole stack must be scanned atomically in a single
quantum, Metronome does allow scanning of distinct thread stacks to occur in different
quanta. That is, the collector and mutator threads are allowedto interleave their execution
while the collector is scanning the thread stacks. To support this, Metronome imposes an
installation write barrier on all unscannedthreads, to make sure they do not hide a root
referencebehind the wave front before the collector can scanit.
19.5. TIME-BASED REAL-TIME COLLECTION: METRONOME 395
Analysis
One of the biggest contributions of Metronome is a formal model of the scheduling of
collection workand its characterisationin terms of mutator utilisation and memory usage
[Baconet al, 2003a]. The model is parametrised by the instantaneousallocation rate A* (r) of
the mutator over time, the instantaneous garbage generation rate G* (r) of the mutator over
time and the garbage collector processing rate P (measured over the live data). All are defined
in units of data volume per unit time. Here, time r ranges over mutator time, idealised for a
collector that runs infinitely fast (or in practice assuming there is sufficient memory to run
without collecting).
These parametersallow simple definitions of the amount of memory allocatedduring
an interval of time (t\\, t2) as
=
fl*(Ar) \302\260^1. (19.4)
At
The instantaneous memory requirementof the program (excluding garbage, overhead,
and fragmentation)
at a given time t is
Of course, real time must also includethe time for the collector to execute, so it is
m(t)
= m*(&(t)) (19.6)
and the maximum memory requirement over the is
entire programexecution
m =
maxm(f) = maxm*(r). (19.7)
MAt) = (19.8)
L^J
!Note carefully here the distinction between a* (the maximum allocation rate over an interval) and oc* (the
maximum allocated memory over an interval).
396 CHAPTER 19. REAL-TIME GARBAGECOLLECTION
= 40
0.8 Qt
0.6
Qt = 10
s
0.4
0.2 \\-
Qt = 2.5
1000 10000
At
where Qj \342\226\240
L q ^+C J *s tne lengtn whole
\302\260f mutator quanta in the interval and x is the size
x = max At -
(QT + CT) \342\226\240
[
~ (19.9)
(o, q^^I CT)
Asymptotically,
minimum mutator utilisation approaches the expected ratio of total time
given to the mutator versusthe collector:
=
Ot .
lim \302\253T(A0
^l (19.10)
Space utilisation. As already noted, spaceutilisation will vary depending on the mutator
allocation rate. Assuming constant collector rate P, at time t the collector will run for time
m(t)/P to process the m(t) live data (work is proportionalto the tracingneededto mark
19.5. TIME-BASED REAL-TIME COLLECTION: METRONOME 397
the live data). In that time, the mutator will run for quantum Qj per quantum Cj of the
collector. Thus, to run a collectionincrementat time t requires an excess space overheadof
= \302\253* + (19-n)
eT(t) *(0
(*(*), n}jr-%)
Mutation. Mutation also has a spacecostbecause the write barrier must record every
deleted and insertedreference. It must filter null references and marked objectssoas to
placea bound on collector work (at most all the objectsin the heap will be marked live),
while keepingthe costof the write barrier constant. Thus, in the worst case,the write log
can have as many entries as there are objects. This
space must be accounted for by treating
allocation of the log entries as an indirect form of allocation.
the quantisation parameters Qj and Cj. Utilisation uj depends solely on Qj and Cj, so
utilisation will remain steady (subject only to any jitter in the operating system delivering
a timely quantum signal and the minimum quantum it can support).
The excess space required for collection ej{t), which determines the total space sj
needed,depends on both maximum application memory usage m and the amount of
memory
allocated over an interval. If the application developerunderestimates either the total
space required m or the maximum allocationrate a* then the total space requirementsj
may grow arbitrarily. Time-based collectors suffer from such behaviour particularly when
there are intervals of time in whichthe allocationrate is very high. Similarly, the estimate
of the collectorprocessingrateP must be a conservative underestimate of the actual rate.
Fortunately,
a collection cycle runs for a relatively long interval of mutator execution
time
P CT
little variation in space consumed so the longas estimate of maximum memory required
m is accurate.
398 CHAPTER 19. REAL-TIMEGARBAGE COLLECTION
being
the amount of memory that the mutator and collector(respectively) are allowed to
allocate/process before yielding.
Becausework-basedtimedilationis variable and non-linear there is no way to obtain a
closed-formsolutionfor minimum mutator utilisation. Each collector increment processes
Cw memory at rate P, so each pause for collection takes time d = Cyj/P. Each mutator
quantum involves allocation of Qw memory, so the minimum total mutator time Ar2 for i
quanta is the minimum At*/ that solves the equation
= (19.15)
**(ATf) iQw.
Increasing the time interval does not decrease the maximum amount of allocation in that
time, so a*(At) increases monotonically Thus, At, > At/_i, so Equation 19.15can be
solvedusing an iterative method. Let k be the largest integer suchthat
kd+ Ark < At (19.16)
so that the minimum mutator utilisation over an interval At is
uw(At) = (19.17)
^L\302\2612
where At*, is the time taken by k whole mutator quanta in At and y is the size of the
y
=
max(0,Af- Ark - (k+l) \342\200\242
d). (19.18)
Note that minimum mutator utilisation uw(At) will be zero for At < d. Moreover, any
large allocation of nQw bytes will force the collector to perform n units of work leading to
a pause lasting time nd in which the mutator will experience zero utilisation. This reveals
analytically
that the application developer must take care with a work-based collector to
achieve real-time bounds by avoiding large allocations and making sure that allocation is
spaced evenly.
Now, minimum mutator utilisation depends on the allocation rate a* (At), where At <
Af, and collector processing rate P. Supposethat the interval
on the At over which we
require
real-time performance is small (say twenty milliseconds), so the peak allocation rate
for this interval is likely to be quite high. Thus, at real-time scales work-basedminimum
mutator utilisation uw {At) will vary considerablywith the allocation rate. In contrast, note
that the At in which the time-based collector is dependenton allocation rate is at a much
ew{t)
= (19.19)
m{t).^
and the excess space required for a collection cycleover its whole execution is
ew = m \342\226\240
p^. (19.20)
19.6. COMBINING SCHEDULINGAPPROACHES:
TAX-AND-SPEND 399
These will be accurate as long as the application developer's estimate of total live memory
m is accurate. Also, note that the excess e^ for a whole collectioncycle will exceed the
maximum memory m needed for execution of the program unless Qw < Qv- The space
requirement of the program at time t is
sw = m + 3ew. (19.22)
To sum up, while a work-scheduled collector will meet its space bound so long as
m is correctly estimated, its minimum mutator utilisation will be heavily dependent on
the allocation rate over a real-time interval, while a time-basedcollectorwill guarantee
minimum mutator utilisation easily but may fluctuate in its spacerequirements.
Robustness
Time-based
scheduling yields the robustness needed for real-time collection,but when the
memory. The only way for it to degrade gracefully is to slow down the allocation rate.
One approach to reducing the total allocation rate is to impose a generational scheme.
This treats the nursery as a filter to reduce the allocation rate into the primary heap.
Focusing
collector effort on the portion of the
heap most likely to yield free memory results
in highermutator utilisation and also reduces the amount of floating garbage. However,
traditional nursery collection is unpredictableboth in terms of the time to collect and the
quantity
of data that is promoted. Syncopation is an approachfor performing nursery
collection
synchronously with the mature-space collector, where the nursery is evacuatedat
the beginning of the mature-space collection cycleand at the start of sweeping, as well as
outsidethe mature-space collection cycle [Bacon et al, 2005].It relieson an analytic
solution for utilisation in generational collection taking the nurserysurvival rate as a
and
parameter
sizing the nursery such that evacuation is needed only once per real-time window.
The analysis informs whether generational collection should be used in any given
application.
Syncopation handles the situation where temporary spikes in allocationrate make
it
impossible to evacuate the nursery quickly enough to meet real-timeboundsby moving
the work triggered by the temporaryspike to a latertime. Frampton et al [2007] adopt
a different approach, allowing nursery collection to be performed incrementally so as to
avoid
having pause times degenerate to the time needed to evacuate the nursery.
Another
strategy for slowing the allocation rate is simply
to add an element of work-
based collection to slow the mutator down, but of course this can lead to missed
deadlines.
Alternatively, slack-based scheduling achieves this by preempting the low-priority
threadsas necessaryfor the collector to keep up with allocation. So longas sufficient low-
priority slack is available then real-time deadlines will be preserved. These observations
lead to the following Tax-and-Spend methodology that combines slack-based and time-
based scheduling.
19.6
Combining scheduling approaches: Tax-and-Spend
Metronome best on dedicated
works uniprocessor or small multiprocessorsystems,
because of to suspend the mutator
its need while an increment of collector work is
performed.
Typical work-based collectors can suffer latencies that are orders of magnitude
worse than time-based schemes. Henriksson's slack-based scheduling is best-suited to
400 CHAPTER 19. REAL-TIME GARBAGE COLLECTION
Tax-and-Spend scheduling
As we have
already minimum
seen, mutator utilisation is simple for developers to
reason about because they can consider the system as just running somewhat slower than the
native processor speed until the responsiveness requirements approach the quantisation
limits of the garbage collector. As a measure of garbage collector intrusiveness, minimum
mutator utilisation is superiorto maximum pause time since it accounts for clustering
of the individual pauses that cause missed deadlines and pathological slowdowns. Tax-
and-Spend scheduling allows different threads to run at different utilisations, providing
flexibility
when threads varying allocation rates, or for threads
have widely having
particularly stringent
deadlines be interrupted as little as possible.
that must Also, background
threads on spare processors can be used to offload collector work to obtain high utilisation
for mutator threads. The time metric can be physical or virtual as best suits the application.
Of course,this does mean that any analysis of the applicationmust composethe real-time
constraints of the individual threads to obtain a global picture of application behaviour.
Per-threadscheduling. To
manage per-mutator utilisation, Tax-and-Spend must
measure and schedule collector work based on per-thread metrics and allow a collector
increment to be charged to a single mutator thread. All collector-related activity can be
accounted for in each thread (including the overheads of
extending the mutation log,
initialising
an allocation page, and other bookkeeping activities). The collectorcan track all
of these to avoid schedulingtoo muchwork on any given mutator
so as thread.
Also, by piggybacking collector increments on mutator threads before a thread
voluntarily yields
to the operating system (say to take an allocation slow path,
or to perform I/O
or execute native code that does not access the heap)Tax-and-Spend avoids
having the
operating system scheduler assume that the thread has finished with its operating system
time quantum and schedule some unrelated thread in its place. This is particularly
important in a loaded
system. By interleaving mutation and collectionon the same operating
system thread the operating system scheduler i s less likely
to interfere in the scheduling
of the collectionwork.
Allowing
different to run with
threads
different utilisation is important when
allocation rates
vary significantly across threads or when high-priority
threads like event
handlers desire minimal interruption. This also permits threads that can tolerate less stringent
timing requirements to lower their quantisation overheads by running with
larger quanta,
and so increase throughput.
19.6. COMBINING SCHEDULING APPROACHES: TAX-AND-SPEND 401
mutatorfor
given amounts of processor time. It is robust to overload because the tax continues
to be assessed,but when there is sufficient slack in the system it can result in
unnecessaryjitter since collection can occur at any time so long as minimum mutator utilisation
requirementsare preserved.
subject to a tax rate that determines how much collector work it must
perform for a given
amount of execution time, specified as a per-thread minimum mutator utilisation.
Dedicated collector threads run at low or idle priority during slack periods and accumulate tax
credits for their work. Credits are typically deposited in a singleglobal account,thoughit
is possible to consider policies that use multiple accounts.
The aggregate tax over over all threads, combiningthe tax on the mutator threads with
the credits contributedby the collector threads, must be sufficient for the collector to finish
its cycle before memoryis exhausted. The number of background collector threads is
typically the same as the number of processors,configuredso that they naturally run during
slack periods in overall system execution.They execute a series of quanta each adding
the corresponding amount of credit. On real-time operating systems it is desirable to run
these threads at some low real-time priority rather than the standard idle priority so that
they are scheduled similarly to other threads that perform real work rather than as a true
idle thread. These low-priorityreal-timethreads will still sleep for some small amount of
time, making it possible for non-real-time processesto make progresseven when
collection
might saturate the machine. This enables administratorsto log in and kill run-away
real-time processes as necessary
Each mutator thread is scheduled according to its desiredminimum mutator
utilisation, guaranteeing that it can meet its real-time requirementswhile alsoallowing the
collector to make sufficient progress. When a mutator thread is running and its tax is due, it
first attempts to withdraw credit from the bank equal to its tax quantum. If this is
then
successful the mutator thread can skip its collectorquantum because the collector is keeping
up, so the mutator
pays tax only when there is insufficient slack-scheduled background
collection. Even if only a partial quantum's credit is available then the mutator can
perform a smaller quantum of collector work than usual. Thus, if there is any slack available
the mutator can still run with both higher throughput and lowerlatencies without
having
the collector falling behind. This treats slackin a uniprocessor and excess capacity in a
multiprocessor in the same way
Tax-and-Spend prerequisites
collector work can run concurrently with work-scheduled collector work). While
completeness.
epoch mechanism to assert that the new state is in effect for all threads.
The
epoch mechanism uses a single shared epoch numberthat can be atomically
incremented
by any thread to initiate a new epoch, plus a per-thread local epoch number.
Each thread updates its local epoch number by copying the shared epoch,but it does so
only at GC-safe points. Thus, eachthread's local epoch
is always less than or equal to
the shared epoch. Any thread can examine the local epochs of all threads to find the least
local epoch,which is calledthe confirmed epoch. Only when the confirmed epoch reaches
or passesthe value a thread sets for the global epoch can it be sure that all other threads
have noticedthe change.On weakly-ordered hardware a thread must use a memoryfence
before updating its local epoch. To cope with threads waiting on I/O or executing native
code, Tax-and-Spend requires that they execute a GC-safe point on return to update their
local epoch before they resume epoch-sensitive activities. Thus, such threads can always
be assumedto be at the current epoch, so there is no needto wait for them.
Phase agreement using 'last one out7. Metronome easily achieved agreement on the
collector
phase (such sweeping, finalising, and so on)because
as marking, all collector work
occurred on dedicated threads that could block briefly to effect a phase change long so
as there was enough remaining time in their shared collector quantum. With concurrent
collection piggy-backedon the mutator threads,eachmutator might be at a different place
in its taxation quantum, so it is essential that
phase detection be non-blocking or else a
taxed mutator might fail to meet its deadlines. Using raggedepochs for this is not efficient
because it does not distinguish taxed mutator threads from others. Instead, the 'last one
out' protocoloperatesby storing a phase identifier and worker count in a single shared
and atomically updatable location.
19.7. CONTROLLING FRAGMENTATION 403
identifier
unchanged. When any thread believes that the phase might be complete because there
is (apparently) no further work to do in that phase, and it is the only remaining worker
thread (the count is one), then it will change the phase and decrement the workercount in
oneatomicoperation to establish the new phase.
This protocolworks only so long as each worker thread returns any incompletework
to a global work queue when it exits. Eventually there will be no work left, some thread
will end up being the last oneand it will be able to declare the next phase.
Unfortunately,
termination of the mark phase in Metronome is not easily achieved
using
this mechanism, because the deletion barrier employed by Metronome deposits the
overwritten pointer into a per-threadmutation log. Mark phase termination requires that
all threads have an empty mutation log (not just those performing collector work). Thus,
Tax-and-Spend introduces a final marking phase in which the remaining marking work
is handledby one thread which uses the ragged epoch mechanismto ensurethat there
is global agreement that all the mutation logs are empty. If this check fails then the
deciding
thread can declare a false alarm and switch back to parallel marking. Eventually
all the termination conditions will be met and the deciding thread can move to the next
post-marking phase.
Per-threadcallbacks. Most
phases of a collection cycle need just enough worker threads
to make progress, but others require that something done by (or to) every
be mutator
thread. For example, the first phase of collection must scan every mutator stack. Other
phasesrequire that the mutator threads flush their thread-local state to make information
available to the collector. To support this somephasesimpose a callback protocol instead
of 'last one out'.
In a callback phase some collector master thread periodicallyexaminesallthe mutator
threads to see if
they have performed the desired task. Every active thread that has not
is asked to perform a callback at their next GC-safe point to perform the required action
(stack scanning, cache flushing, and so on). Threads waiting on I/O or executing native
code are prevented from returning while the action is performedon their behalf. Thus, the
maximum delay to any thread during a callback phase is the time taken to perform the
action.
usually requires
some form of locking, particularly for volatile fields. Moreover, replicating
collectors rely on a synchronoustermination phase to ensure that the mutator roots have
been forwarded. Per-object locking does not scale. Compressorand Pauseless rely on
page-level synchronisation using page protection, but suffer from poor minimum mutator
utilisation both becauseof the cost of the traps and because they are work-based, with a
trap storm a
following phase shift.
The absence of lock-freedom means we cannot guarantee progressof the mutator let
alone preserve time bounds. Therearea number of approaches to making mutator accesses
wait-free or lock-freein the presenceof concurrent compaction, which we now discuss.
steps.
1. Sort the pagesby the number of unused (free) objects per pagefrom dense to sparse.
2. Set the allocation page to the first (densest) non-full page in the resulting list.
3. Set the page to evacuate to the last (sparsest)pagein the list.
4. While the target number of pages to evacuate in this size class hasnot beenmet,and
the page to evacuate does not equal the page in which to allocate, move each live
object from the sparsest page to the next available free cell on the allocation page
(moving
to the next page in the list whenever the current allocation page fills up).
19.7. CONTROLLING
FRAGMENTATION 405
7 r <\342\200\224
forwardingAddress(p)
s r[i] value
<\342\200\224
This moves objectsfrom the sparsest pages to the densest pages. It moves the minimal
number of objects and producesthe maximal number of completely full pages. The choice
of the first allocation page in step 2 as the densest non-full page may result in poor cache
locality
because previously co-located objects will be spread among the available dense
pages. To address this, one can set a threshold for the density of the page in which to
allocate at the head of the list, so that there are enough free cells in the page to satisfy the
locality goal.
References to relocatedobjectsare redirectedas they are scanned by the subsequent
tracing mark phase. Thus, at themark phase, the relocatedobjects
end of the next of the
previous collection can be freed. In the meantime, the Brooks forwarding barrier ensures
proper mutator access to the relocated objects. Deferring update of references to the next
mark phase has three benefits: there is no extra 'fixup' phase, fewer referencesneedtobe
fixed (since any object that dies will never be scanned)and there is the locality benefit of
piggybackingfixup on tracing.
Beforeconsideringmorecomplicated
schemes for concurrent compaction, it is worth
noting
that many real-time applications run in embedded systems, where uniprocessors have
beenthe predominantplatform. Preserving atomicity of mutator operations (with respect
to the collector other mutators) is simpleon a uniprocessor,
and eitherby disabling
scheduler interrupts or by preventing thread switching exceptat GC-safe points (making sure
that mutator barriers never contain a GC-safepoint). In this setting, the collector can freely
copy objectsso long as mutators subsequently access only the copy (usinga Brooks
indirection barrier to force a tospace invariant), or they make sureto update both copies (in
case other mutators are still reading from the old version in a replicating collector).
Kalibera [2009] compares replication copying to copying with a Brooksbarrierin the
context of a real-time system for Java running on uni-processors. His replication scheme
maintains the usual
forwarding pointer in all objects, exceptthat when the object is
replicated the
forwarding pointer in the replica refers back to the original instead of to itself
(in contrast to Brooks This [1984]).
arrangement allows for very simple and predictable
mutator barriers. On Read the mutator need not be whether concerned
it is accessing a
fromspace or tospaceobject, and can simply
load the value from whichever version the
mutator references. All that Write needs to do is to make sure that the update is
performed on both versions of the object to keep them coherent. Pseudo-code for these
barriers
(omitting the support necessary for concurrent tracing) is shown in Algorithm 19.7.
Not surprisingly, avoiding the need to forward every read is a significant benefit, and the
406 CHAPTER 19. REAL-TIME GARBAGE COLLECTION
cost of the double-write is negligible given that most of the time both writes will be to the
same address the forwarding address is a self-reference.
because
Concurrent
compaction on a multiprocessor prevents us from assuming that Read
and Write can be made straightforwardly atomic. For that we must consider more
finegrained synchronisation among mutators, and between mutator and collector, as follows.
original,the wide copy, or the final tospace copy). As in Blellochand Cheng[1999] a header
word on each object stores a Brooksforwarding pointer, either to the wide copy or to the
tospace copy. During the compaction phase, mutator and collectorthreads race to create
the wide copy using CompareAndSwap to install the forwarding pointer.
Once the wide copy has been created, and its pointer installed in the original's
forwarding pointer header field, the mutator can update only the wide copy. The status word
on each field lets the mutator know (via read and write barriers) where to read/write
the up-to-date field, encodingthe three possibilities: inOriginal, inWide and inCopy.
All status words on the fields in the wide objectare initialised to inOriginal. So long
as the status field is inOriginal then mutator reads occur on the fromspace original.
All
updates (both by the collector as it copies each field and the mutator as it performs
updates) operate on the wide copy,atomically updating both the field and its adjacent
status to inWide using CompareAndSwapWide. The collector must assert that the field is
inOriginal as it
copies the field. If this fails then the field has already been updated by
the mutator and the copyoperationcanbe abandoned.
Once all fields of an object have been converted to inWide (whether by copying or
mutation), the collector allocates its final 'narrow' version in tospace, whose pointer is then
installed as a forwardingpointer into the wide copy. At this point there are three versions
of the
object: the out-of-date fromspace original which forwards to the wide copy, the up-
to-date wide copy which forwards to the tospace copy,and the uninitialised tospace copy.
The collector concurrently field
copieseach
of the wide copy into the narrow tospacecopy,
using CompareAndSwapWide to assert that the field is unmodified and to set its status
to inCopy.
If this fails then the field was updated by the mutator and the collector tries
again to copy the field. If the mutator encounters an inCopy field when trying to access
the wide copythen it will forward the access to the tospacecopy.
Because Stopless forces all updates to the most up-to-datelocationof a field it also
supports Java volatile fields without the need for locking. It is also able to simulate
application-level atomic operations
like compare-and-swap on fields by the mutator. For
details seePizloet al [2007]. The only remaining issue is copingwith atomic operations
on double-word fields (such as Java long) where the CompareAndSwapWide is not able
to coverboth the double-word field and its adjacent status word. The authors of Stopless
one third occupancy, one can make useof the dead space for the wide copies. course,
Of
the reason for evacuating the page is that it is fragmented, so there may not be sufficient
contiguous free space available for all the copies. But if segregated-fits allocation is used
then the free portions are uniformly sized, and it is possible to allocate the wide objects
in multiple wide fragments so as to allocate each data field and its status word side-by-
side. In Stopless, the spacefor the wide objects is retained until the next mark phase has
common case, even on multiprocessors with weak memory ordering. Storms of atomic
operations are avoided by moving few objects(only as necessaryto reclaimsparsely-occupied
pages)
and by randomising their selection.
Staccato inherits the Brooks-style indirection barrier of Metronome, placing a
pointer
forwarding
in every object header. It also relieson ragged synchronisation: the mutators are
instrumented to performa memory fence (on weakly ordered machines like the PowerPC)
at
regular intervals (such as GC-safe to
points) bring them up to date with any change to
global state. The collector r eservesa bit in the forwarding pointer to denote that the
object
is being copied (Java objects are always word-aligned soa low bit in the pointer can
be used). This COPYING bit and the forwarding pointer can be changedatomically using
compare-and-swap/set. To move an object, the collectorperforms the following steps:
the mutators.
2. Wait for a ragged synchronisation where every mutator performs a read fence to
ensure that all mutators have seen the update to the COPYING bit.
3. Perform a read fence (on weakly ordered machines)to ensurethat the collector sees
all updates by mutators from before they saw the change to the COPYING bit.
4. Allocate the copy, and copy over the fields from the original.
5. Perform a write fence (on weakly ordered machines) to push the newly written state
of the copy to make it
globally visible.
6. Wait for a ragged synchronisation where every mutator performs a read fence to
ensure that it has seen the values written into the copy.
i copyObjects(candidates):
2 for each p in candidates
3 /* set COPYING bit */
4 CompareAndSet(&forwardingAddress(p), p, p | COPYING)
5 waitForRaggedSynch(readFence) /* ensure mutators see COPYINGbits */
6 readFenceQ /* ensure collector seesmutator updates from before CAS */
7 for each p in candidates
s r <\342\200\224
allocate(length(p)) /* allocate the
copy */
9
move(p, r) /* copy the contents */
o f orwardingAddress(r) /* the copyforwards to
itself*/
i add(replicas, r) /* remember the copies */
2 writeFence() I* flush the
copies so the mutators can see them */
3 waitForRaggedSynch(readFence) /* ensure mutators see the copies */
4 for each (p in candidates, r in replicas)
5 /* try to commit the
copy */
16 if not CompareAndSet(&forwardingAddress(p), p | COPYING, r)
17 /* the commit failed so dealwith it */
18 f r e e (r) /*free the aborted copy */
19 add(aborted, p) /* remember the aborts */
20 return aborted
21
22
Access(p):
23 r f orwardingAddress(p)
\302\253\342\200\224 /* load the
forwarding pointer */
24 if r & COPYING = 0
25 return r /* use the
forwarding pointer only if not copying */
26 /* try to abort the copy */
27 if CompareAndSet(&forwardingAddress(p), r, p)
28 return p /* the abort succeeded */
29 /* collector committed or another aborted */
M atomic /* force reload of current forwardi ngAddress (p) */
31 r <\342\200\224
forwardingAddress(p)
32 return r
34 Read(p, i) :
35 p <\342\200\224
Access(p)
36 return p[i]
St
38 Write(p, i, value):
39 p <\342\200\224
Access(p)
40 p[i] value
<\342\200\224
19.7. CONTROLLING FRAGMENTATION 409
Algorithm 19.9:Heapaccess
(while copying) in Staccato using CompareAndSwap
i
Access(p):
2 r \302\253\342\200\224
forwardingAddress(p) /* load the forwarding pointer */
3 if r & COPYING = 0
4 return r A use the
forwarding pointer only if not copying */
5 /* otherwise try to abort the
copy */
6 r <\342\200\224
CompareAndSwap(&forwardingAddress(p), r, p)
7 A failure means collector committed or another aborted so r is good */
s return r & \"COPYING A success means we aborted so clear COPYING bit */
routine in Algorithm 19.8.This takesa list of candidates to be moved and returnsa list
of aborted objects that could not be moved.
Meanwhile, when the mutator accesses an object (to examine or modify its state for any
reason) it performs the following steps:
4. Usethe forwarding pointer (with the COPYINGbit cleared) as the object pointer only
if the compare-and-set succeeds.
5. Otherwise, the failure of the compare-and-set means either that the collector
committed the copy or else another mutator abortedit. So,reloadthe forwarding pointer
using an atomic read (needed on weakly ordered machines), guaranteed to see the
current value of the forwarding pointer (that is, the value placedthereby the
collectoror other mutator).
These steps are shown in the Access barrierhelper function, used by both Read and
Write in Algorithm 19.8.
relocate because their move is morelikely to be aborted. To cope with this they suggest that
when such a popularobjectis detectedthen its page target of compaction.can be made the
That is, instead of moving the popular object off of a sparsely populated page it suffices
simply to increase the population density of the
page.
Also, abort storms can occurwhen the collector chooses to move objects that have
temporal locality of access by the mutator, so degrading its minimum mutator utilisation
because of the need to run an increased number of CompareAndSwap operations in a short
time. This is unlikely because only objects on sparselypopulatedpagesare moved, so
objects allocated close together in time are unlikely all to move together. The probability
of correlated aborts can be reduced by breaking the defragmentation into several phases
to shorten the time window for aborts. Also, the set of pages chosen for
defragmentation
in each phase can be randomised. Finally, by choosing to run several defragmentation
410 CHAPTER 19. REAL-TIME GARBAGE COLLECTION
i
copyObjects(candidates):
2 for each p in candidates
3 /*set COPYING bit*/
4 forwardingAddress(p) COPYING
\302\253\342\200\224
p I
21 Write(p, i, value):
22 r <- forwardingAddress(p) /* load the
forwarding pointer */
23 if r & COPYING ^ 0 A use the
forwarding pointer only ifnot copying */
24 A otherwise try to abort the copy */
25
CompareAndSet(&forwardingAddress(p), r, r & ~COPYING)
26 A failure means collector committed or another aborted*/
27 r 4\342\200\224
forwardingAddress(p) A reload forwardingAddress (p) */
28 rfil value
<\342\200\224
threads at much the same time (though not synchronously, and respecting minimum
mutator utilisation requirements), there will be fewer mutator threads running so reducing
the likelihood of aborts.
i copySlot(p, i):
2 repeat
3 value <\342\200\224
p[i]
4 r 4\342\200\224
forwardingAddress(p)
5 r[i] value
<\342\200\224
8 Read(p, i):
9 value <\342\200\224
p[i]
io if value = oc
ii r <\342\200\224
forwardingAddress(p)
12 value <\342\200\224
r[i]
o return value
14
\\5 i,
Write(p, newValue):
i6 if newValue = oc
l?
sleep until copying ends
is repeat
19 oldValue <\342\200\224
p[i]
20 if oldValue = oc
21 r <\342\200\224
forwardingAddress(p)
22
r[i] newValue
<\342\200\224
23 break
24 until CompareAndSet(&src[i], oldValue, newValue)
(except in very rare cases) and lock-free copying by the collector.Rather than preventing
data races between the collectorand the mutator, Clover detects when they occur, and in
that rare situation may need to block the mutator until the copying phase has finished.
Cloverpicksa randomvalue oc to mark fields been copiedand assumesthat the
that have
mutator can never write that value to the heap. To ensure this, the write barrier includes a
checkon the value being stored, and will block the mutator if it attempts to do so.
As the collectorcopiesthe contents of the original object to the copy it marks the
original fields as copied by overwriting them with the value oc
using compare-and-swap.
Whenever the mutator reads a field and loads the value oc it knows that it must reload the up-to-
date value of the field via the forwarding pointer (which points to the original if its
copy
has not been made yet, or the copy if it has). This works even if the true value of the field
is oc from before the copy phase began.
Whenever the mutator tries to overwrite a field containing the value oc it knows that it
must store to the up-to-date locationof the field via the forwarding pointer. If the mutator
actually tries to store the value oc then it must block until copying ends (so that oc no longer
means a copied field that must be reloaded via the forwarding pointer). We sketch Clover's
collector copying routine and mutator barriers in Algorithm 19.11.
For some types oc can be guaranteed not to clash with a proper value:pointers usually
have some illegal values that can never be used and floating point numberscan use any
one of the NaN forms so long as the program never generates them. Forother types, oc
needs to be chosen with care to minimise overlap with values used in the program.To
make the chance of overlap virtually impossible, Pizlo et al [2008] offer an innovative
412 CHAPTER 19. REAL-TIME GARBAGE COLLECTION
waiting to install oc into a field it is copying while the mutator repeatedly updates the field,
causing its CompareAndSwapto fail repeatedly
All three algorithms aim for lock-free heap access, but with subtle differences. Chicken
guarantees wait-free access for both reads and writes. Cloverand Stoplessprovide only
lock-free writes, and reads require branching. Clover'slock-free writes are only
probabilistic, since it is possible that a heap write must be stalled until copying is complete as
noted above.
Clover never aborts an object copy Stopless can abort copyingan objectin the unlikely
situation that two or more mutator threads write to the same field at much the same time
during entry into the compaction phase (see Pizloet al [2007] for details). Chicken is much
lesscareful:any write to an object while it is being copied will force the copy to abort.
Benchmarkscomparingthesecollectors
and
non-compacting concurrent mark-sweep
collection show that
throughput is highest for the non-compacting collector (becauseit
has much barriers). The copying collectorsinstall
simpler their copying-tailored barriers
only during the compactionphaseby hot-swapping compiled code at phase changes
using
the techniques of Arnold and Ryder [2001].Chickenis fastest (three times slow-down
while copying according to Pizlo3),though it results in many more copy aborts, followed
by Clover (five times slower while copying) and Stopless (ten times slower while
copying).
All the collectors scale well on a multiprocessorup to six processors.Because of
the throughput slow-downs copying degrades responsivenessto real-timeevents for both
Clover and Stopless. Responsiveness for Chickenismuchbetter because it
stays out of the
mutator's way by aborting copies quickly when necessary.
Fragmented allocation
The preceding discussionof compaction for real-time systems reveals that any real-time
collector relying on defragmentation to ensure space bounds must trade off throughput
and responsiveness to real-time events against the level of fragmentation
it is willing to
tolerate. Wait-freedom of mutator heap accesses was guaranteed only by
at the price
Chicken/Staccato of aborting some copies. Stoplessand Clover offer stronger space
guarantees but only with the weaker progress guarantee of lock-freedom for heap accesses. A
real-time collector needinghard spacebounds may find this tradeoff unacceptable.
For this reason, Sieberthas longadvocated bounding external fragmentation by
allocating
all objects in (logically if not
physically) discontiguous fixed-size chunks [Siebert,
1998,2000,2010],as implemented in his Jamaica VM for real-time Java. The Jamaica VM
Personal communication.
19.7. CONTROLLING FRAGMENTATION 413
splitsobjectsinto a list of fixed-size oblets, with each successive oblet requiring an extra
level of indirection to access, starting at the head of the list. This results in linear-time
access for object fields, depending on the field index. Similarly, arrays are represented as a
binary tree of
arraylets arranged into a trie data structure [Fredkin,I960].Thus, accessing
an array element requires a number of indirections logarithmic in the size of the array. The
main problem with this scheme is this variable cost of accessing arrays. Worst-case
execution time analysis requires knowing (or bounding) statically the size of the array being
accessed. However,array size in Java is a dynamic property, so there is no way to prove
absence of other knowledge the worst-case access time for trie-based arrays can in general
be bounded only by the size of the largest allocated array in the application, or (worse)the
sizeof the heap itself if that bound is unknown.
To solve this problem, Pizlo et al [2010b] marry the spine-based arraylet allocation
techniques
of Metronome to the fragmented allocation techniques of the Jamaica VM in a
system
they call Schism. By allowing objects and arrays
to be allocated as fixed-size fragments
there is no needto worry about external fragmentation. Moreover, both object and array
accesses have strong time bounds: indirecting a statically known number (depending on
the field offset) of oblets for object accesses, and indirecting through the spine to access
the appropriate arraylet for array accesses. To a first order approximation (ignoring cache
effects) both operationsrequire constant time. Schism's scheme for allocating fragmented
objectsand arrays is illustrated in Figure 19.10. An object or array is represented by a
'sentinel' fragment in the heap. Every object or array has a header word for garbage
collection and another to encode its type. The sentinelfragment, representing the object or
array, contains these and additional header words to encode the remaining structure. All
Objects
are encoded as a linked list of oblets as in Figure 19.10a. An array that fits
in a single fragment is encoded as in Figure 19.10b. Arrays requiring multiple arraylet
fragments are encoded with a sentinel that refers to a spine, which contains pointers to
eachof the arraylet fragments. The spine can be 'inlined'into the sentinel fragment if it is
small enough as in Figure19.10c. Otherwise, the spine must be allocated separately.
The novelty of Schism is thatseparately allocated array spines need not be allocated
in the object/array space. That
space is managed entirely as a set of fixed-size fragments
using the allocation techniques of the immix mark-region collector. The 128-bytelines of
immix are the oblet and arraylet fragments of Schism. Schism adds fragmented allocation
and on-the-fly concurrent marking to immix, using an incremental
update Dijkstra style
insertion barrier. The fragments never move, but so long as there are sufficient free
fragments available any array or object can be allocated.Thus, fragmentation is a non-issue,
^ P*3
gc
tvp?. type
(0
o
1\"\302\260 >*
L_ ro
ra
1 \302\260 Q.
1 \342\204\242
1 Q.
(a) A two-fragment object with a payload of (b) A single-fragment array with a payload
six to twelve words. Thesentinel fragment has of up to four words. The sentinel fragment
three header words: a fragmentation pointer has four header words: a null fragmentation
to the next object fragment, a garbage pointer, a garbage collection header, a type
collectionheader and header. Each fragment
a type header and an actual length n < 4 words,
has a header pointing to the next. followed by the inlined array fields.
type
\\- \342\200\224
type
Schism has a number of desirable properties. First, mutator accesses to the heap are
wait-free and tightly-bounded (costing time). Second, fragmentation is strictly
constant
controlled. Indeed,Pizloet al [2010b] prove that given the number and type of objects
and arrays (including their size) in the maximum live set of the program, then total
memory
needed for the program can be strictly
bounded at 1.3104b where b is the size of the
maximum live set. Third, as proposed for Jamaica VM by Siebert [2000], Schism can run
with
contiguous allocation of arrays (objects are always fragmented) when there is
sufficient
contiguous space. Contiguous arrays are laid out as in Figure 19.10d, except
with
the payload extending into successive contiguous fragments. This allows for much faster
array accesswithout indirectionthrough the spine.Theseproperties mean that Schism has
superior throughput compared to other production real-time collectors, while also being
tolerant of fragmentation by switching to fragmented allocation of arrays when contiguous
allocation fails. This comesat the cost of some slow-down to access the fragmented arrays.
The cost of the read and write barrier machinery to accessfragmented arrays is
77%
throughput of pure concurrent mark-region garbage collection (without the fragmented array
access support).
For applicationdevelopers
requiring
of the cost for array accessesSchism
predictability
canbe configured always fragmented allocation
to use for arrays at the cost of having to
perform spine indirectionson all array
accesses. The benefit for this is much improved
maximum
pause times. Since all allocations are performed in terms of
fragments, pauses
due to allocation are essentiallythe costof zero-initialising a four kilobyte page in the slow
path
of allocation \342\200\224
0.4 milliseconds on a forty megahertz embedded processor. When
allocating arrays contiguously the allocator must first attempt to locate a contiguous range
of fragments, which slowsthings down enough to cause maximum pauses around a
millisecond on that processor.
window. From the perspective of this task, minimum mutator utilisation is immaterial,
so long as the real-timeexpectationsof that task are met. Moreover, minimum
mutator utilisation and maximum pause time may be difficult to account for when the only
416 CHAPTER 19. REAL-TIME GARBAGE COLLECTION
garbage collector? If so, then how can a system account for them? Pizlo et al [2010a] even
go so far as to account for these slow pathsusingspecialised hardware for on-device
profiling
of an embedded processor. To aid developerslackingsuchspecialised hardware, Pizlo
et al [2010b] provide a worst-caseexecution modefor their Schism collector that forces
slow path execution so that developers can get some reasonableestimateof worst-case
execution times during testing.
Glossary
A
comprehensive glossary can also be found at http : / /www. memorymanagement. org.
ABA
problem the inability of certain barrier an action (typically a sequence of
atomic operationsto distinguishreading code emitted by the compiler) mediating
the same value twice from a memory accessto an object.
location as 'nothing changed' versus some belt a collectionof increments used by the
other thread changingthe value after the
first read and then changing it back before
Beltway collector.
the second read. best-fit allocation a free-list allocation
that an object in the next
accurate see type-accurate. strategy places
cellin the heap that most closely matches
activation record a record that saves the the object's size.
state of computation and the return
address of a method, sometimes called a big bag of pagesallocation (BiBoP) a
417
418 GLOSSARY
strategy
that allocates blocks in power of points to the next cons cell in the list.
card a small, power of two sized and compacting relocating marked (live)
aligned area of the heap. objects
and updating the pointer values of
GLOSSARY 419
all live references to objects that have dangling pointer a pointerto an objectthat
moved so as to eliminate external has been reclaimed by the memory
fragmentation.
manager.
reserve a space reserved for copying that determines whether an object may
copy
in copying collection. become reachable from outside the
method or thread that created it.
copying
collection collection that
evacuating moving an object from a
evacuates live objects from one semispace to
condemned to its new location (in to-
another which, the space occupied space
space); see copying
(after
collection or mark-
by the former can be reclaimed). collection.
compact
creationspacesee
explicit deallocation the action of
nursery.
crossing map a map that decodes how deallocation under the control of the
objects span areas (typically cards). rather than
programmer, automatically.
420 GLOSSARY
external fragmentation space wasted frame a power of two sized and aligned
outside
any cell; see also internal chunk; typically a discontiguous space
fragmentation.
comprisesnumber a of frames; however,
see also activation record.
false pointer a value that was falsely free the state of a cell in beingavailable for
that
happen to lie in the same cache line, allocation
strategy that uses a data structure to
resulting
in increased cache coherence recordthe locationand sizeof free cells.
traffic. the from which
fromspace semispace
fast-fits allocation a sequentialfits copying collectioncopies objects.
allocation
strategy that uses an index to search invariant the invariant that the
fromspace
for the first or next cell that satisfies the mutator holdsonly references.
fromspace
allocation request.
Fibonacci buddy system a buddy garbage an object that is not live but whose
system
in which the size classes form a Fibonacci space has not been reclaimed.
sequence. garbage collection(GC) an automatic
heap parsability the capability to advance lazy reference counting deferring freeing
through the heap from one object to the of zero-count objects when reference
next. counting until
they are subsequently
the allocator, at which point
heaplet a subset of the heap containing
acquired by
their childrencanbeprocessed.
objects
accessible to only a single thread.
only on demand
hyperthreading see simultaneous sweeping
lazy sweeping
(when fresh space is required).
multithreading.
leaksee memory
leak.
liveness (of object) the property of an that typically operates in two phases,first
that will be accessed at some time in marking all live objects and then
object
the future execution of the mutator. sweepingthrough the heap, reclaiming the
of all unmarked, and hence dead,
local allocationbuffer (LAB) a chunk of
storage
objects.
memory used for allocation by a single
thread. mark/cons ratio a common garbage
collection metric that compares the amount of
locality the degree to which to items
work done by the collector ('marking')
(fields, objects) are accessed together
in
with the amount of allocation ('consing')
space or time; see also spatiallocality and
done; see cons cell.
temporal locality.
lock a synchronisation mechanism for
marking recordingthat an object is live,
often by setting a mark bit.
controlling access to a resource by
multiple
concurrent threads; usually only one mature object space (MOS) a space
thread at a time can hold the lock, while reserved for older (mature) objects
all other threads must wait. managed
without respect to their age.
that has a large number of processors on mmap a Unix system call that creates a
a singleintegratedcircuit chip. mapping for a range of virtual addresses.
mark bit a bit (stored in the object's header mostly-concurrent collection a technique
or on the side in a mark table) recording for concurrent collection that may pause
whetheran objectis live. all mutator threads briefly.
GLOSSARY 423
applications (tasks) within a single invocation does not refer to any object.
of the virtual machine.
nursery a space in which
objects are
multicore see chip multiprocessor. created, typically by a generational collector.
non-blockinga guarantee
that threads partial tracing subset
tracing only a of the
moving
a particular object (typically because technique for concurrent collection or
it is accessible to code that is not collector- incrementalcollectionsupporting a
aware). realtime
system.
pointer reachability the property of all live reference the canonical pointer used to
program order the order of writes (and typically be madefreein constant time.
acquire,
but accesses
earlier can happen
generation.
after the acquire and release operations
promptness the degree to which a collector prevent earlier accesses from happening after
reclaims all
garbage at each collection the release but later accesses can happen
cycle.
before the release.
rememberedset (remset) a set of objects
queue a first-in, first-out data structure, or fields that the collector must process;
to the back (tail) and
allowing adding
typically, mutators supported by
removing from the front (head). generational or concurrent
collection,
collection or incremental collection add
raw pointer a plain pointer(in contrast to a entries to the remembered set as they create
smart pointer). or delete pointers of interest to the
collector.
reachable the property of an object that can
be accessedby following a chain of remset see remembered set.
references from a set of mutator roots. rendezvous barrier a code point at which
read barrier a barrieronreference loads by each thread waits until all other threads
the mutator. have reached that
point.
GLOSSARY 425
objects. allocation
strategy that searches the free-list
root object an objectin the referred to sequentially
for a cell that satisfies the
heap
directly by a root.
allocation request.
run-time system the code that supports the sequential store buffer (SSB) an efficient
store buffer.
tidy pointer the canonical pointer used as
buffer see write
an object'sreference.
strict consistency a consistency model in time-based scheduling a technique for
which memory access and atomic real-time collection that
every scheduling
operationappearsto occur in the same reserves a pre-defined portion of
order
everywhere. execution time
solely for collector work during
the which the mutator is stopped.
strong generational hypothesis
hypothesis
that object lifetime is inversely tospace the semispace to which copying
related to
age. collection evacuates live objects.
GLOSSARY 427
tospace invariant the invariant that the virtual machine (VM) a run-time system
mutator holds only tospace references. that abstracts away details of the
hardware or operating system.
tracing visiting
the reachable objects by
underlying
partitioning objects into white (not yet visited) wilderness the last free chunk in the heap.
and black (need not be revisited), using
work (to wilderness a of
grey to represent the remaining preservation policy
last
be revisited).
allocating from the wilderness only as a
resort.
type-accurate a property of a garbage
that can
work stealing a technique
for
collector
precisely identify every work threads where
slot or root that contains a pointer. balancing among lightly
loaded threads pull work from more
by the mutator.
zero count table (ZCT) a table of objects
write buffer a buffer that holds pending whose reference counts are zero.
Bibliography
This
bibliography over 400 references. However, our comprehensivedatabase
contains
at http: //www.cs .kent
.ac.uk/~rej/gcbib/ contains over 2500 garbagecollection
relatedpublications. This database can be searched online or downloaded as BlBTgX,
PostScript or PDF. As well as details of the article, papers, books, theses and so on, the
bibliography also contains abstracts for some entries and URLs or DOIs for most of the
electronically available ones. We continually strive to keep this bibliography up to date as
a serviceto the community. Here you can help: Richard ([email protected]) would be
very grateful to receive further entries (or corrections).
Diab Abuaiadh, Yoav Ossia, Erez Petrank, and Uri Silbershtein. An efficient parallel heap
compaction algorithm. In OOPSLA 2004,pages 224-236.
doi:10.1145/1028976.1028995. xx, 32, 38, 46, 301, 302, 319
tutorial. WRL Research Report 95/7, Digital Western Research Laboratory, September
1995. 237
Andrew W.
Garbage
Appel. collection can be faster than stack allocation. Information
Processing Letters, 25(4):275-279,1987. doi:10.1016/0020-0190(87)90175-X.125,
171
Andrew W. Appel. Simple generational garbage collection and fast allocation. Software:
Practice and Experience, 19(2):171-183,1989a.doi: 10.1002/spe . 4380190206.
121,
122,125,195,197
429
430 BIBLIOGRAPHY
Andrew W. Appel and ZhongShao.An empirical and analytic study of stack vs. heap
cost for
languages with closures. Technical Report CS-TR-450-94, Departmentof
Computer Science, Princeton University, March 1994. 171
Andrew W. Appel and Zhong Shao. Empirical and analytic study of stack versus heap
cost for languages with closures. Journal of Functional Programming, 6(l):47-74, January
1996. doi: 10 .1017/S09567 9680000157X.171
Andrew W. Appel, John R. Ellis, and Kai Li. Real-time concurrent collection on stock
multiprocessors.In PLDI 1988, pages 11-20. doi: 10.1145/53990.53992. xvii, 316,
317,318,340,352,467
J. Armstrong, R. Virding, C. Wikstrom, and M. Williams. Concurrent Programming in
Erlang. Prentice-Hall, second edition, 1996.146
Matthew Arnold and Barbara G. Ryder. A framework for reducing the cost of
instrumentedcode.In PLDI 2001, pages 168-179. doi: 10.1145/378795.378832. 412
Nimar S.Arora, Robert D. Blumofe, and C. Greg Plaxton.Thread scheduling for
Alain Azagury, Elliot K. Kolodner, and Erez Petrank. A note on the implementation of
replication-based garbagecollection for multithreaded applications and multiprocessor
environments. ParallelProcessingLetters, 9(3):391-399,1999.
doi: 10.1142/S0129626499000360. 342
Hezi Azatchi, Yossi Levanoni, Harel Paz, and Erez Petrank.An on-the-fly mark and
sweep garbage collector based on sliding views. In OOPSLA 2003, pages 269-281.
doi: 10.1145/949305.949329. 331
BIBLIOGRAPHY 431
David F. Bacon, Perry Cheng, and V.T. Rajan. Controlling fragmentation and space
consumption in the Metronome,a real-timegarbagecollector for Java. In LCTES 2003,
pages 81-92. doi: 10.1145/78 0732 . 78074 4. 404
David F. Bacon, Perry Cheng,and V T. Rajan. A unified theory of garbage collection. In
OOPSLA 2004, pages50-68. doi: 10.1145/1035292.1028982.77,80,134
David
Perry Cheng, David Grove, and Martin
F. Bacon, T. Vechev. Syncopation:
Generational real-time garbage collectionin the Metronome. In ACM
384,385
Jason Baker, Antonio Cunei, Tomas Kalibera, Filip Pizlo, and Jan Vitek. Accurate garbage
collection in uncooperative environments revisited. Concurrency and Computation:
Practice and Experience, 21(12):1572-1606,2009. doi:10.1002/cpe.1391. Supersedes
Baker etal [2007]. 171
Katherine Barabash, Yoav Ossia, and Erez Petrank. Mostly concurrent garbagecollection
revisited. In OOPSLA 2003, pages 255-268. doi: 10.1145/949305.949328. 319,320
Katherine Barabash, Ori Irit Goft, Elliot K. Kolodner,Victor
Ben-Yitzhak, Leikehman,
Yoav Ossia, Avi Owshanko, and Erez Petrank. A parallel, incremental, mostly
concurrent garbage collector for servers. ACM Transactions on Programming Languages
and
Systems, 27 (6):1097-1U6, November 2005. doi: 10.1145/1108 970.1108 972.
284,319,320, 474
David A. Barrett and Benjamin G. Zorn. Using lifetime predictors to improve memory
allocation performance. In PLDI1993,pages187-196.
8.
doi:10.1145/155090.15510 114
Report 88/2, DEC Western Research Laboratory, Palo Alto, CA, February 1988a. Also
appears as Bartlett [1988b].30,104
Joel F. Bartlett. Compacting garbage collection with ambiguous roots. LispPointers, 1(6):
3-12, April
1988b. doi: 10.1145/1317224.1317225. 432
Peter B. Bishop. Computer Systems with a Very Large Address Spaceand Garbage Collection.
PhD thesis, MIT Laboratory for Computer Science,May 1977. doi: 1721.1/16428.
Technical report MIT/LCS/TR-178. 103,140
Stephen
Blackburn
M. and Antony L. Hosking. Barriers:Friend or foe? In ISMM 2004,
pages 143-151. d oi: 10.1145/102 9873.102 9891.202,203
Stephen M. Blackburn and Kathryn
S. McKinley. In or out? putting
write barriers in their
doi:10
place. In ISMM2002, pages 175-184. 512452.
.1145/512429. 80
Stephen M. Blackburn, Matthew Hertz, Kathryn S. Mckinley, J. Eliot B. Moss, and Ting
Yang. Profile-based
pretenuring. ACM Transactions on Programming Languages and
Systems, 29(l):l-57, 2007.doi:10.1145/1180475.1180477. 110,132
434 BIBLIOGRAPHY
Hans-JuergenBoehm.Mark-sweep
vs.
copying collection and asymptotic complexity.
http: //www . hpl. hp. com/personal/Hans_Boehm/gc/complexity . html,
September1995.26
Hans-Juergen
Boehm. Reducing garbage collector cache misses. In ISMM2000,pages
.
59-64.doi:10.1145/362422 362438. 23, 27
and heap growth to reduce the executiontime of Java applications. In OOPSLA 2001,
209
pages 353-366.doi:10.1145/504282.504308.
BIBLIOGRAPHY 435
Tim Brecht, Eshrat Arjomandi, Chang Li, and HangPham. Controlling garbage collection
and heap growth to reducethe executiontime Java applications.
of ACM Transactions
on Programming Languagesand Systems, 28(5):908-941, September 2006.
doi: 10.1145/1152649.1152652. 209
R. P.Brent. Efficient implementation of the first-fit strategy for dynamicstorage
allocation. ACM Transactions on Programming Languages and Systems, 11(3):388^103,July
1989. doi: 10.1145/65979.65981. 139
Rodney A. Brooks. Trading data space for reduced time and code spacein real-time
Nancy, France, September 1985, pages 273-288. Volume 201 of Lecture Notes in
Computer
Science, Springer-Verlag. doi: 10.1007/3-540-15975-4_42. 67
F. Warren Burton. A buddy system variation for disk storageallocation.
Communications
Operating Systems, San Jose, CA, October 1998,pages139-149. ACM SIGPLAN Notices
33(11), ACM Press, doi: 10.1145/291069.291036. 50
D.C. Cann and Rod R. Oldehoeft. Reference count and copy elimination for parallel
Luca Cardelli, James Donahue, Lucille Glassman, MickJordan,Bill Kalsow, and Greg
Nelson. Modula-3 language definition. ACM SIGPLANNotices,27(8):15-42, August
1992. doi: 10.1145/142137.142141. 340
Systems and Applications (RTCSA), August 2005, pages 185-188.IEEE Press, IEEE
David R. Chase. GarbageCollection and Other Optimizations. PhD thesis, Rice University,
August
1987. doi: 1911/16127. 104
Andrew M. Cheadle, Anthony J. Field, Simon Marlow, Simon L. Peyton Jones,and R.L
While. Non-stop SIGPLANInternational
Haskell. In 5th ACM Conference on Functional
Perry Cheng and Guy Blelloch. A parallel, real-time garbagecollector. In PLDI 2001,
pages 125-136. doi: 10.1145/378795.37882 3. 7,187,289,290,304, 377,382, 384, 468
W. T. Comfort. Multiword list items. Communications of the ACM, 7(6):357-362, June 1964.
doi: 10.1145/512274.512288. 94
Eric Scott Nettles, and Indira Subramanian.Improving
Cooper, the performance of SML
garbage collection using application-specific virtual memory management. In LFP
1992,pages43-52. doi: 10.1145/141471.141501. 209
438 BIBLIOGRAPHY
Erik Corry Optimistic stack allocation for Java-like languages. In ISMM 2006, pages
162-173.doi: 10.1145/1133956.1133978.
147
Jim Crammond. A garbage collection algorithm for shared memory parallel processors.
International Journal Of Parallel Programming, 17(6):497-522,1988.
doi: 10.1007/BF01407816. 299,300
David Detlefs. Automatic inference of reference-count invariants. In 2nd Workshop on
Semantics, Program Analysis, and Computing
Environments for Memory Management
(SPACE), Venice, Italy, January 2004a. 152
David Detlefs. A hard look at hard real-time garbage collection.In 7th International
David Detlefs, William D. Clinger, Matthias Jacob, and Ross Knippel. Concurrent
set
remembered refinement in generational garbage collection. In 2nd Java Virtual
Machine Research and Technology Symposium, San Francisco,CA, August 2002a. USENIX.
196,197,199, 201, 319
David Detlefs, Christine Heller, and Tony Printezis. Garbage-first
Flood, Steven garbage
collection. In ISMM 2004, pages 37-48.doi:10.1145/1029873.1029879. 150,159
David L. Detlefs. Concurrent garbage collection for C++.TechnicalReport
CMU-CS-90-119, Carnegie Mellon University, Pittsburgh, PA, May 1990.340
David L. Detlefs, Paul A. Martin, Mark Moir, and Guy L. Steele. Lock-free reference
counting. In 20th ACM Symposium on Distributed Computing, Newport, Rhode Island,
August 2001, pages 190-199. ACM Press, doi: 10.1145/383962.384016. 365
David L. Detlefs, Paul A. Martin, Mark Moir, and Guy L. Steele. Lock-free reference
counting. DistributedComputing, 15:255-271, 2002b.
doi: 10.1007/s00446-002-0079-z. 365
John DeTreville.
Experience with concurrent garbage collectors for Modula-2+.Technical
Report 64, DEC Systems Research Center, Palo Alto, CA, August 1990. 338, 340, 366
Sylvia Dieckmann and Urs Holzle.A study of the allocation behaviour of the SPECjvm98
Java benchmarks. In Rachid Guerraoui, editor, 13th European Conference on
Object-Oriented Programming, Lisbon, Portugal, July 1999, pages 92-115. Volume 1628 of
Lecture Notes in Computer Science, Springer-Verlag. doi: 10.1007/3-540-48743- 3_5.
59,114,125
Edsgar
W. Dijkstra, Leslie Lamport, A. J. Martin, C. S.Scholten, and E. F M. Steffens.
Amer Diwan, J. Eliot B. Moss,and Richard L. Hudson. Compiler support for garbage
collection a statically typed language. In PLDI1992, pages 273-282.
in
Julian Dolby and Andrew A. Chien. An automatic object inlining optimization and its
evaluation. In PLDI2000,pages345-357. doi:10.1145/349299.349344. 148
Tamar Domani, Elliot K. Kolodner, Ethan Lewis, Erez Petrank, and Dafna Sheinwald.
Thread-localheapsfor Java. In ISMM 2002, pages 76-87.
doi: 10.1145/512429.512439. 109,110,146
Kevin
Donnelly, Joe Hallett, and Assaf Kfoury Formal semantics of weak references. In
ISMM 2006,pages 126-137.doi:10.1145/1133956.1133974. 228
R. Kent Dybvig, Carl Bruggeman, and David Eby Guardians in a generation-based
garbage collector. In PLDI doi:10.1145/155090.155110.
1993,pages 207-216. 220
ECOOP 2007, Erik Ernst, editor. 21st European Conference on Object-Oriented Programming,
Berlin, Germany, July Lecture Notes in Computer
2007. Volume 4609 of Science,
Springer-Verlag. doi: 10.1007/978-3-540-73589-2. 440,460
440 BIBLIOGRAPHY
DanielR. Edelson. Smart pointers: They're smart, but they're not pointers. In USENIX
C++ Conference, Portland, OR, August 1992. USENIX.59, 74
Toshio Endo, Kenjiro Taura, and Akinori Yonezawa. A scalable mark-sweep garbage
collector on large-scale
shared-memory
machines. In ACM/IEEE Conference on
Supercommuting, San Jose, CA, November 1997. doi: 10.1109/SC.1997.10059. xvii,
Shahrooz Feizabadi and Godmar Back. Java garbage collection scheduling in utility
accrual scheduling environments. In 3rd International
Workshop on Java Technologies for
Real-time and Embedded Systems (JTRES), San Diego, CA, 2005. 415
ShahroozFeizabadiand Godmar Back. Garbage collection-aware sheduling utility
accrual scheduling environments. Real-Time Systems, 36(1-2), July 2007.
doi: 10.10 07/sll241-0 0 7-9020-7. 415
Robert R. Fenichel and Jerome C. Yochelson. A Lisp garbage collector for virtual memory
computersystems. Communications
of the ACM, 12(11):611-612, November 1969.
doi: 10.1145/363269.363280. 43,44,50,107
Stephen J. Fink and Feng Qian. Design, implementationand evaluation of adaptive
recompilation with on-stack replacement. In 1stInternational Symposium on Code
Generation and Optimization (CGO), San Francisco, CA, March 2003,pages 241-252.
IEEE
Christine Flood, Dave Detlefs, Nir Shavit, and Catherine Zhang. Parallel
garbage
collection for shared memory multiprocessors. In JVM 2001. xvii, xx, 34, 36, 248,278,
280,282,283,284,288,289, 301,303,304,357,
292,298,300, 465,468,474
Daniel P. Friedman and David S.Wise. Reference counting can manage the circular
environments of mutual recursion. Information Processing Letters, 8(l):41-45, January
1979. doi: 10.1016/002 0-0190 (79) 90091-7. 66,67
BIBLIOGRAPHY 441
Ramakrishna. Method and mechanism for finding references in a card in time linear in
the size of the card in a garbage-collected heap. United States Patent 7,136,887 B2, Sun
Microsystems, November 2006. xxi,200,201
David
Gay and Bjarne Steensgaard. Fast escape analysisand stack allocation for
Garbage Collection in
Object-Oriented Systems, October 1993. 444,460
Andy Georges, Dries Buytaert, and Lieven Eeckhout. Statistically rigorous Java
performance evaluation. In ACM SIGPLANConference on Object-Oriented Programming,
Systems, Languages, and Applications, Montreal, Canada, October 57-76.
2007, pages
ACMSIGPLAN Notices 42(10), ACM Press, doi: 10.1145/1297027.1297033. 10
Joseph (Yossi) Gil and Itay Maman. Micro patterns in Java code. In OOPSLA 2005, pages
97-116. doi: 10.1145/1094811.1094819.132
O. Goh, Yann-Hang Lee, Z. Kaakani, and E. Rachlin. Integrated scheduling with garbage
collection for real-time embeddedapplicationsin CLI.In 9th International Symposium on
Object-Oriented Real-TimeDistributed Computing, Gyeongju, Korea, April 2006. IEEE
. 41.415
Press, doi:10.1109/ISORC.2006
Benjamin Goldberg. Tag-free garbage collection for stronglytyped programming
languages. InPLDI1991 [PLDI1991], pages 165-176. doi: 10.1145/113445.113460.
171,172
Benjamin Goldberg. Incremental garbage collection without tags. In European Symposium
on Programming, Rennes, France, February 1992,pages 200-218. Volume 582 of Lecture
Notes in
Computer Science, Springer-Verlag. doi: 10 .1007/3-540-55253-7_12. 171
Benjamin Goldberg and Gloger. Polymorphic type reconstructionfor
Michael garbage
collection without tags. In LFP 1992,pages 53-65.doi: 10.1145/141471.141504.
171,172
442 BIBLIOGRAPHY
James Gosling, BillJoy, Guy Steele, and Gilad Bracha. TheJava Language Specification.
Addison-Wesley, third edition edition, May 2005.346
Eiichi Goto. Monocopy and associative algorithms in an extendedLISP.TechnicalReport
74-03, Information Science Laboratories, Faculty of Science,University of Tokyo, 1974.
169
Dan Grossman, Greg Morrisett, Trevor Jim, Michael Hicks,Yanling Wang, and James
pages282-293.
Cheney. Region-based memory management in Cyclone.In PLDI2002,
doi:10.1145/512529.512563. 106
Chris Grzegorczyk, Sunil Soman, Chandra Krintz, and Rich Wolski. Isla Vista
heap
sizing: Using feedbackto avoid paging.In 5th International Symposium on Code
Generation and Optimization (CGO),SanJose,CA, March 2007, pages 325-340. IEEE
Computer Society Press, doi: 10.1109/CGO. 2007 . 20. 209
Samuel Guyer and Kathryn McKinley Finding your cronies: Static analysis for dynamic
object colocation. In OOPSLA 2004,pages 237-250.
996.110,132,143
doi:10.1145/1028976.1028
Robert H. Halstead. Implementation of Multilisp: Lisp on a multiprocessor.In LFP 1984,
Tim Harris and Keir Fraser. Language support for lightweight transactions. In OOPSLA
2003, pages 388-402.doi:10.1145/949305.949340. 272
Roger
Henriksson. Scheduling Garbage Collection in Embedded Systems. PhD thesis, Lund
Institute of Technology, July 1998. xviii, xx, 377,386, 387, 391,393,399
388,389,390,
Maurice
Herlihy J. Eliot B Moss. Lock-freegarbagecollection
and for
multiprocessors.
IEEE Transactions on Parallel and DistributedSystems,3(3):304-311, May
1992.
Maurice
Herlihy and Nir Shavit. The Art of Multiprocessor Programming. Morgan
Kaufman, April
2008. xxiii, 2,229,240, 243,254,255, 256
Optimizing the read and write barrier for orthogonal persistence. In Ronald Morrison,
Mick J. Jordan,and Malcolm P. Atkinson, editors, 8th International Workshop on
Persistent Object Systems (August, 1998), Tiburon, CA, 1999, pages 149-159. Advances in
PersistentObjectSystems, Morgan Kaufmann. 323
Richard L. Hudson and Amer Diwan.Adaptive garbage collection for Modula-3 and
Smalltalk. In GC 1990.195
Richard L. Hudson and J. Eliot B. Moss.Sapphire: Copying
GC without stopping the
world. In Joint ACM-ISCOPE Conference on Java Grande, Palo Alto, CA, June 2001, pages
48-57. ACM Press,doi:10.1145/376656.376810. 346,361
BIBLIOGRAPHY 445
ISMM 2000, Craig Chambers and Antony L. Hosking, editors. 2nd International
Notices 36(1), ACM Press, doi: 10.1145/362422. 434, 440, 442, 453, 456, 457
ISMM 2004, David F. Bacon and Amer Diwan, editors. 4th International Symposium on
Memory Management, Vancouver, Canada,October2004.ACM Press,
ISMM 2006, Erez Petrank and J.Eliot B. Moss, editors. 5th International Symposium on
Memory Management,Ottawa, Canada,June2006. ACM Press,
ISMM 2009, Hillel Kolodner and Guy Steele, editors. 8th International Symposium on
Memory Management,Dublin,Ireland,June2009. ACM Press, doi: 10.1145/1542431.
434, 450, 459
ISMM2010,Jan Vitek and Doug Lea, editors. 9th International Symposium on Memory
Management, Toronto, Canada,June 2010. ACM Press, doi: 10.1145/1806651. 443,
455
446 BIBLIOGRAPHY
IWMM 1992,Yves Bekkers and Jacques Cohen, editors. International Workshop on Memory
Management, St Malo, France, 17-19 September 1992. Volume 637 of Lecture Notes in
JVM 2001. 1st Java Virtual Machine Researchand Technology Symposium, Monterey, CA,
April 2001.USENIX.440,453
Tomas Kalibera. Replicating real-time garbage collector for Java. In 7th International
Tomas Kalibera, Filip Pizlo, L. Hosking, and Jan Vitek. Scheduling hard real-time
Antony
Taehyoun Kim, Naehyuck Chang, Namyun Kim, and HeonshikShin. Scheduling garbage
collector for embedded real-time systems. In ACM SIGPLANWorkshop on Languages,
H. T Kung and S. W. Song. An efficient parallel garbage collection system and its
correctness proof. In IEEE Symposium
on Foundations of Computer Science, 1977,pages
120-131. IEEE Press, doi: 10 .1109/SFCS . 1977 . 5. 326,329
Bernard
Lang and Francis Dupont. Incremental incrementally compactinggarbage
collection. In
Symposium on Interpreters St Paul,MN, June
and Interpretive Techniques,
1987, pages 253-263. ACM SIGPLANNotices22(7),ACM Press,
LFP 1992. ACM Conference on LISP and Functional Programming, San Francisco, CA, June
1992. ACM Press, doi:10.1145/141471. 437,441
Henry Lieberman and Carl E. Hewitt. A real-time garbage collector based on the lifetimes
of objects. Communications of the ACM, 26(6):419-429,1983.
doi: 10.1145/358141.358147. Also report TM-184,Laboratory for Computer
Science, MIT, Cambridge, MA, July 1980 and AI Lab Memo 569,1981. 103,116
Rafael D. Lins. Cyclic reference counting with lazy mark-scan.Information Processing
Letters,44(4):215-220,1992. 10.1016/0020-0190
doi: (92) 90088-D. Also
Computing Laboratory Technical Report 75, University of Kent, July 1990. 72
Sebastien Marion, Richard Jones, and Chris Ryder. Decryptingthe Java gene pool:
Predicting objects' lifetimes with micro-patterns. In ISMM2007,pages67-78.
doi:10.1145/1296907.12 96918. 110,132
Maged M. Michael and M.L. Scott. Correctionof a memory management method for
lock-free data structures.Technical Report UR CSD / TR59, University of Rochester,
December 1995. doi: 1802/503. 374
JamesS.Miller and Guillermo J. Rozas. Garbage collection is fast, but a stack is faster.
Technical Report AIM-1462,
MIT AI Laboratory, March 1994. doi: 1721.1/6622. 171
David A. Moon. Garbage collection in alarge LISPsystem. In LFP 1984, pages 235-245.
doi: 10.1145/800055.802040. 50,51,202,296
F. LockwoodMorris.A time- and space-efficient garbage compaction algorithm.
Communications
ofthe ACM, 21(8):662-5,1978. doi: 10.1145/359576.359583. 36, 42,
299
F.LockwoodMorris.On a comparison
of
garbage collection techniques. Communications
of the ACM, 22(10):571, October 1979. 37, 42
F. LockwoodMorris.Another compacting garbage collector. Information Processing Letters,
15(4):139-142, (82)90094-1.
October1982.doi:10.1016/0020-0190 37,38,42
Systems, Seattle, WA, March 2008, pages 265-276. ACM SIGPLAN Notices 43(3), ACM
GeneNovark, Trevor Strohman, and Emery D. Berger. Custom object layout for
Languages, and Applications, Denver, CO, October ACM SIGPLAN 1999. Notices 34(10),
ACM Press, doi: 10.1145/320384. 434,457
OOPSLA 2001. ACM SIGPLAN Conference on Object-Oriented Programming, Systems,
Yoav Ossia, Ori Ben-Yitzhak, Irit Goft, Elliot K. Kolodner,Victor Leikehman, and Avi
Owshanko. A parallel, incremental and concurrent GCfor servers. In PLDI 2002, pages
129-140. doi: 10.1145/512529.512546. 284, 285, 288, 296, 304, 474
BIBLIOGRAPHY 451
Yoav Ossia, Ori Ben-Yitzhak, and Marc Segal. Mostly concurrent compaction for
Harel Paz, David F. Bacon, Elliot K. Kolodner,Erez Petrank, V.T. Rajan. Efficient
and
Harel Paz, Erez Petrank, and Stephen M. Blackburn. Age-oriented concurrent garbage
collection. In CC 2005,pages 121-136. 369
doi:10.1007/978-3-540-31985-6_9.
Harel Paz, Elliot K. Kolodner, Erez Petrank, and V T. Rajan. An efficient
David F. Bacon,
James L. Peterson and TheodoreA. Norman. Buddy systems. Communications of the ACM,
Erez Petrank and Dror Rawitz. The hardness of cache conscious data placement. In
Twenty-ninth Annual ACM Symposium on Principles of Programming Languages, Portland,
OR, January 101-112.ACM2002, pages SIGPLAN Notices 37(1), ACM Press,
doi: 10.1145/503272.503283. 49
Pekka P. Pirinen. Barrier techniques for incremental tracing. In ISMM 1998, pages 20-25.
doi: 10.1145/286860.286863. xvii, 20, 315,316,317, 318
452 BIBLIOGRAPHY
Filip Pizlo and Jan Vitek. Memory management for real-time Java: State of the art. In 11th
International Symposium on Object-OrientedReal-Time Distributed Computing, Orlando,
FL, 2008, pages 248-254. I EEE Press, doi: 10.1109/1 SORC. 2 00 8 . 4 0. 377
Filip Pizlo, Daniel Frampton, Erez Petrank, and Bjarne Steensgard. Stopless: A real-time
garbage collector for multiprocessors. In ISMM 2007, pages 159-172.
doi: 10.1145/1296907.12 96927.406,412
Filip Pizlo, Erez Petrank, and Bjarne Steensgaard. A study of concurrent real-time
garbage collectors. In PLDI2008,pages33^14. doi: 10.1145/1379022.1375587.
410,411,412
Filip Pizlo, Lukasz Ziarek,Ethan Blanton, Petr Maj, and Jan Vitek. High-level
PLDI 2006, MichaelI. Schwartzbach and Thomas Ball, editors. ACM SIGPLANConference
on Programming Language Design and Implementation, Ottawa, Canada,June2006.ACM
SIGPLAN Notices 41 (6), ACM Press, doi: 10.1145/1133981.436,446, 458
PLDI 2008, Rajiv Gupta and SamanP.Amarasinghe, editors. ACM SIGPLAN Conference
on Programming LanguageDesignand Implementation, Tucson, AZ, June 2008. ACM
SIGPLANNotices43(6),ACM Press, doi: 10.1145/1375581. 433, 452
Tony
Printezis. On measuring garbage collection responsiveness. Scienceof Computer
Programming, 62(2):164-183, October 2006. doi: 10.1016/j.scico.2006.02.004.
375,376,415
Tony
Printezis and David Detlefs. A generational mostly-concurrent garbagecollector.
In
Erik Ruf. Effective synchronization removal for Java. In PLDI 2000,pages 208-218.
145
doi:10.1145/349299.349327.
Narendran Sachindran and Eliot Moss. MarkCopy: Fast copying GC with less space
overhead. In OOPSLA2003,pages326-343. doi: 10.1145/949305.949335. 154,155,
159
concurrent programs that use message passing. Scienceof Computer Programming, 62(2):
98-121, October 2006. doi: 10 .1016/j . scico.2006 . 02 . 006. 146
Robert A. Shaw. Empirical Analysis of a Lisp System. PhD thesis, Stanford University, 1988.
Technical Report CSL-TR-88-351. 116,118,192, 202
Yefim Shuf, Manish Gupta, Hubertus Franke, Andrew Appel, and JaswinderPal Singh.
Creating
and preserving locality of Java applications at allocation and garbage
collection 53
times. In OOPSLA 2002,pages 13-25.doi:10.1145/582419.582422.
Kong,1999,pages96-102.
IEEE Press, IEEE Computer Society Press,
doi: 10.1109/RTCSA.1999.811198.182
David
Siegwart and Martin Hirzel. Improvinglocality with parallelhierarchical
copying
GC. In ISMM 2006, pages 52-63. doi: 10.1145/1133956.1133964.xx,xxi, 52, 295,
296, 297, 304,468
Daniel Spoonhower,Guy Blelloch, and Robert Harper. Using page residencyto balance
tradeoffs in tracing garbage collection. In VEE 2005, pages 57-67.
doi: 10 .1145/1064979.1064989. 149,150,152
James W. Stamos. Static grouping of small objects to enhance performance of a
paged
virtual memory. ACM Transactions on Computer Systems, 2(3):155-180, May 1984.
doi: 10.1145/190.194. 50
James Stamos.William A large object-oriented virtual memory: Grouping strategies,
measurements, and
performance. Master's thesis, Department of Electrical
Engineeringand Computer Science, Massachusetts Institute of Technology,April 1982.
doi: 1721.1/15807. 50
316,330
Report CSL-TR-87-324. 27
V.
Stenning. On-the-fly garbage collection. Unpublished notes, cited by Gries [1977],
1976. 315
Symposium on Operating
Systems Principles, Bretton Woods, NH, October 1983, pages
30-32. ACMSIGOPSOperatingSystems Review 17(5), ACM Press,
doi: 10.1145/800217.806613. 92
JamesM. Stichnoth,Guei-Yuan Lueh, and Michal Support for garbage
Cierniak.
collection at
every instruction in a Java compiler. PLDI1999,pages118-127.
In
301652.
doi:10.1145/301618. 179,180,181,188
Herb Sutter. The free lunch is over: A fundamental turn toward concurrency in software.
Dr. DobVs Journal, 30(3), March 2005. 275
Stephen
P. Thomas. Having your cake and eating it: Recursivedepth-first copying
garbage collection with no extra stack. Personalcommunication, May 1995a. 170
458 BIBLIOGRAPHY
David M. Ungar and Frank Jackson. Tenuring policies for generation-based storage
reclamation.In ACM SIGPLAN Conference on Object-Oriented Programming, Systems,
Languages,
and Applications, San Diego, CA, November 1988,pages1-17.ACM
SIGPLAN Notices 23(11), ACM Press, doi: 10.1145/62083.62085. 114,116,121,
123,137,138,140
David M. Ungar and Frank Jackson. An
adaptive tenuring policy for generation
scavengers. ACM Transactions on Programming Languages and Systems, 14(l):l-27,1992.
doi: 10.1145/111186.116734. 121,123,138
Maxime van Assche, Joel Goossens, and Raymond R. Devillers.Joint garbage collection
and hard real-time scheduling. Journal of Embedded Computing, 2(3-4):313-326,2006.
Also published in RTS'05International Conference on Real-Time Systems, 2005. 415
Michal Wegiel and Chandra Krintz. The mapping collector: Virtual memory support for
GC 1993. 139
Paul R. Wilson, Mark S. Johnstone, MichaelNeely, and David Boles. Dynamic storage
allocation: A survey and critical review. In IWMM 1995,pages 1-116.
10,90
doi:10.1007/3-540-60368-9_19.
Paul R. Mark S. Johnstone, Michael Neely, and David
Wilson, Boles. Memory allocation
policies reconsidered. Unpublished manuscript,1995b.96
David S. Wise. The double buddy-system. Computer ScienceTechnicalReportTR79,
Indiana University, Bloomington, IN, December 1978. 96
FengXian, Witawas Srisa-an, C. Jia, and Hong Jiang. AS-GC: An efficient generational
garbage collector for Java application servers. In ECOOP 2007, pages 126-150.
8-3-5 0-735
doi: 10. 1007/97 4 8 9-2_7. 107
Ting Yang, Emery D. Berger,Matthew Hertz, Scott F. Kaplan, and J. Eliot B.Moss.
Autonomic
heap sizing: Taking real memory into account. In ISMM2004,pages61-72.
doi:10.1145/1029873.1029881. 209
Benjamin
Zorn. Barrier methods for garbage collection. Technical Report CU-CS-494-90,
University of Colorado, 323,393
Boulder, November1990.124,125,
Benjamin
Zorn. The measuredcost of conservative garbagecollection.
Software:
Practice
and Experience, 23:733-756,1993. doi: 10 .1002/spe . 4380230704. 116
Benjamin
G. Zorn. Comparative Performance Evaluation
of Garbage Collection Algorithms.
PhD thesis, University
of California, Berkeley, March 1989. TechnicalReport UCB/CSD
89/544.10,113
Index
Note: If an entry has particular defining occurrences,the page numbers appear in bold,
such as 17, and if it has occurrences deemed primary, their page numbersappearin italics,
such as 53.
best-fit, 91
ABA
problem, 2, 238,239, 285, 374, see first-fit, 89
also Concurrency,hardware next-fit, 91
primitives: CompareAndSwap; immix, 153
LoadLinked and lazy sweeping, 25
StoreConditionally real-time collection, replicating,380,
Abstract concurrent collection,331-335 382
addOrigins,333, 334
segregated-fits allocation, 95
463
464 INDEX
Allocation colour (black or grey), see Best-fit, see Free-list allocation, best-fit
Tricolour abstraction, allocation BiBoP,seeBig bag of pages technique
colour Big bag of pages technique, 27,168-169,
Allocation threshold, see Hybrid 183, 294,295, seealso Sequential-fits
mark-sweep, copying
allocation,block-based
3 see Allocation, bitmaps
Allocators, custom, Bitmapped-fits,
Atomic 15 256-257,208
operations,
Atomic primitives, see Concurrency, Boot image, 132,171
hardware
Boundedmutator utilisation, 7
primitives
AtomicAdd, 241, 252, 253 Brooks's indirection barriers, 340-341,
AtomicDecrement, 241 386,391,393,404-407
AtomicExchange,232-234 Read,341
Atomiclncrement, 241, 365
Write,341
allocation, see
Atomicity, see Concurrency; Transactions Buddy system
reference counting
associativity abstract, 83
direct-mapped,231 abstract deferred,84
fully-associative,
231
coalesced, 65
set-associative, 231
deferred, 62
coherence, 232-234
tracing, abstract, 82
contention, 233 collectNursery, 135
exclusive/inclusive,231 12
Collector,
levels,231 Collector threads, 12,15
231
replacement policy, collectorOn/Of f, real-time
write-back, 231,232, 235 collection, replicating,382,383
write-through, 231 compact
Canonicalisation
tables,221, 222
33,
mark-compact, 35
Card tables, 12,124,151,156,193, 40
Compressor,
197-201, see alsoCrossingmaps threaded compaction (Jonkers), 37
concurrent collection, 318-321
Compaction
space overhead, 198 concurrent,see Concurrent
copying
summarisingcards,201,319 and compaction
two-level, 199
incremental, see Hybrid mark-sweep,
Write, 198
copying
Cards, 12 need for, 40
Cartesian trees,seeFree-listallocation, parallel, 299-302
Cartesian trees
Compressor, 302, seealso Concurrent
352-354
Compressor, coarse-grained locking, see Locks,
Pauseless, 355 coarse-grained locking
self-erasing, 340-341 countinglocks,253,254
self-healing, 357, 358, 360 deques, 251, 331
replicating,seeConcurrentcopying fine-grained locking, see Locks,
and compaction, replicating fine-grained locking
collection lazy update, 255
snapshot-at-the-beginning approach, linked-lists, 256-261, 271
314 non-blocking,255
313,332
termination,244,248-252, optimistic locking, see Locks,
throughput, 313,345 optimistic locking
work list, 319 queues,256-267
write barriers, 314-321,330,348 bounded,256,259,261-268, 271
Write, 367
depth-first,
see traversal order
coalesced, 368-369, see also sliding flip, 45
views flipping, 139
incrementNew, 368,369 forwarding addresses, see Forwarding
Write, 370 addresses, copying collection
correctness, 363-366 fragmentation solution, 43
algorithm Cheng
Write, 364, 365 dominant-thread tracing, 293-294
Connectivity-based collection,143-144 298
Flood etal, 292-293,
time 144
overhead, generational,
298
bitmap marking,
23 memory-centric, 294-298
copy Oancea et al, seechannels
Baker's algorithm, 339 Ogasawara,
see dominant-thread
copying, semispace, 45
tracing
Copy reserve, see Copying, copy reserve processor-centric, 289-294
79,126,127,140,152,
Copying, 17,43-56, remembered sets, 298
157,158, see also Hybrid rooms,382,seeBlelloch and
Cheng
mark-sweep, copying Siegwart and Hirzel, 296-297
allocate, 44 termination, 292,298
approximatelydepth-first, see 49,
performance, 54
Cheneyscanning,44-46 Evacuating, 43
implementations,
44-53 Evacuation threshold, see Hybrid
stack, 44 mark-sweep,copying
copy Word (Sapphire), 350 ExactVM, 118,138,145
Correctness,13,79, 278 exchangeLock, 233
countingLock, 254 exitRoom, 291
Critical sections, see Concurrency, seeDeallocation,
Explicitdeallocation,
mutual exclusion explicit
Crossing maps, 101,182,199-201, see also Explicit freeing, see Deallocation, explicit
Card tables;Heapparsing External pointers,seePointers,external
search, 200
Custom allocators, see Allocators, False sharing,
see Cache lines, false
custom sharing
Cyclic
data structures, 3,140,157, see also Fast and slow path, 80,164,165,191
Referencecounting,cycles block-structured allocation, 53
segregating, 105 Fast-fits allocation, seeFree-list
Cyclone, 106 allocation,Cartesiantrees
Fat
pointers, see Pointers, fat
Dangling pointers, seePointers,dangling Fet chAndAdd, 241, 264,265,289-291,
2
Deadlock, 377, 382, 383
Deallocation, explicit, 2-3 Fields, 12
decide, 243, 247 Finalisation, 13, 213-221,223,224,330
decNursery, 136 .NET, 220-221
defined, 3-5
pinned objects, 186
real-time collection, 403-415 experimental methodology, 10
segregated-fits allocation, 95 importance,3
4
Frames,12,204-205 memory leaks,
generational
write barrier, 205 optimisations for specific languages, 8
109 partitioning
the heap, see Partitioned
partitioning,
free, 14 collection
Free pointer, 87 performance, 9
9
Free-list allocation, 87-93,102,105,126, portability,
collection sets
114 packets
strong, see Objects,grey
Grey protected,
weak, 106, 111,113,121,130,
370
protected
heap layout, 114-117
inter-generational pointers, see Handles, 1,104,184,185
Pointers,inter-generational advantages, 1
large object space, 114
happens-before, seeConcurrency,
long-lived data, 41,132
happens-before
major collection,seefull heap Hash consing, see Allocation, hash
collection
consing
Mature Object Space, seeMature 165,
Haskell,8,113,118,125,161,162,
ObjectSpacecollector 170,171, 228, 292,296,341, see also
time, 113 Functional
measuring
languages
minor collection, 112,113,115,116, Heap layout, 203-205, see also Virtual
119,120,122,123 memory techniques and
specific
115-116
multiplegenerations, collection algorithms
nepotism, 113,114
Heap nodes, 12
nursery collection, 112,121,126,127, see
Heap parsing,20,166,168,170,182,
133,141,146,157 also Allocation, heap parsability
pretenuring, 110,132 Heaplets, see Thread-local heaps
promoting,110,111,112-117,121,127,Heaps,11
130,132-134 block-structured, 22,31,122,152,
en masse, 116,122 166-168,183, 294-297
feedback, 123 relocating, 205
read barriers,seeRead barriers size, 208-210
reference counting, see Concurrent Hierarchical 51-53,
decomposition, 55,
reference counting,age-oriented 296,297
472 INDEX
incremental
incrementally compacting Linearisation points, 254-256
collector, 150 Lisp, 50,58,113,115,124,162,164,169,
Treadmill collector, seeTreadmill 171,190,220,226,323,384,
see also
collector Scheme
Incremental compaction, see Hybrid Lisp 2 algorithm, seeMark-compact,
mark-sweep, copying Lisp
2 algorithm
copying, 46 collect, 32
free-list allocation,54 compact, 33, 35,37,40
lazy sweeping, 26 compact phase, 31
mark-compact, 32, 41 compaction
marking, 21-22 cache behaviour, 38
parallelcopying, 293 one-pass,38
sequential allocation, 54 three-pass, 34
Marking,
153 root scanning, 394
asymptotic complexity, 276 scheduling, 394
atomicity, 22 sensitivity,
397
Endoet al, 280, 281, 283, 289 MMU, see Minimum mutator utilisation
Flood et al, 278, 280, 282-284, 288, 289 Modula-2+, 340,366
296
greypackets,282,284-288, Modula-3,162,179,184,340
Ossia et al, see grey packets Moon's see Copying, Moon's
algorithm,
Siebert [2008],276 algorithm
Siebert[2010],280,282 Mortality of objects,105
termination, see Concurrent Mostly-concurrent collection, see Baker's
collection, termination algorithm
Wu and Li, seeusingchannels Mostly-copying collection, seeCopying,
prefetching, see Prefetching, marking mostly-copying collection;
time overhead, 27 Concurrent copying and
work list, 19, 28, 279, 280 compaction, mostly-copying
Marmot, 138 collection
Mature Object Space collector, 109,130, Motorola MC68000,169
memory
fences many-core, 230
Memory leaks, 2 multicore, 230
Mercury, 171 symmetric, 230
Metronomecollector,391-399,402,
407, Multiprogramming,230
413, see also Real-time collection, Multisets, definition and notation, 15
Tax-and-Spend Multithreading, 230
compaction,404-405 simultaneous,230
INDEX 475
Mutator, 12 grey, 43
performance, see Copying, improves grey protected,312
mutator
performance hot,seeHot and cold, objects
threads, see Mutator threads immortal, 110,124, 126,132-134
Mutator colour(blackor grey), see immutable, 108,146
Tricolour abstract, mutator colour Java Reference, see References, weak
Mutator overhead,seespecific collection large, 56,114,138, 151,152,201,see
also
algorithms Large object space
Mutator suspension, see GC-points moving, 139-140
Mutator threads,12,15 pointer-free, 140
Mutator utilisation, 7,385,391-393,395, 140
pointer-free,
see also Bounded mutator utilisation; popular,143,156
Minimum mutator utilisation type of, 169-171
partitioning for
space, 104-105 unique, 58, 73
partitioning for yield, 105-106 updates of, 124-125,132
partitioning
to reduce pause time, 106 weak, see References, weak
reasons to partition, 103-108 Pointers, 13
reclaimingwholeregions, 106,148 Poor Richard'sMemory Manager, 210
lazy,
60 copying and compaction, Sapphire
limited counter fields, 72-74 collector
partial tracing, 67-72 148
Scalarreplacement,
promptness,73 12
Scalars,
simple, 79 scan
Write, 58 copying, semispace, 45
virtual
Clover, 411
machine; Multi-tasking
machine
incrementalreplicating,
405
Older-first, 128,129
reference counting
coalesced, 63, 64
deferred,62
61
performance,
simple,
58
Zeroing, 43,164,165-166
cache behaviour, 165,166
Colophon
This book was set in Palatino (algorithms in Courier)with pdftex (from the TeX Live 2010
distribution). The Illustrations were drawn with Adobe Illustrator CS3. We found the
Published in 199\302\273,
\342\226\240
chard Jones's Garbage Collection was a milestone in the
years. The authors compare the most important approaches and state-of-the-art
techniques in a single, accessible framework.
concurrent, and real-time garbage collection. Algorithms and concepts are often
described with pseudocode and illustrations.
Features
\342\200\242
Provides a complete, up-to-date, and authoritative sequel to the 1996book
\342\200\242
Offers thorough coverage of parallel, concurrent, and real-time garbage
collectionalgorithms
\342\200\242
Explains some of the tricky aspects of garbage collection,includingthe
interfaceto the run-time system
\342\200\242
Backed by a comprehensive online database of over 2,500 garbage
collection-related
publications
The nearly universal adoption of garbage collection by modern programming
languages makesa thorough understanding of this topic essential for any
programmer. This authoritative handbookgivesexpert on how different insight
C27T5
ISBN: ITfl-l-^OO-flE?1!-!
6000 Broken Sound Parkway, NW 90000
. 0
CRC Press SiT e 300, .
Boca Raton, F \302\273\342\226\240