Shared Memory Synchronization
Shared Memory Synchronization
Series
SeriesISSN:
ISSN:
ISSN:
1935-3235
1935-3235
1935-3235
MM
SCOTT
SCOTT
SCOTT
SSSYNTHESIS
YNTHESISL
L
LECTURES
CC
YNTHESIS
COMPUTER
OMPUTERA
OMPUTER A
ECTURES
ECTURESON
ON
ON
ARCHITECTURE
RCHITECTURE
RCHITECTURE &C
& C Mor
Mor gan&
Morgan
gan &
&Cl
Cl
Claypool
aypool
aypool Publishers
Publishers
Publishers
Shared-Memory
Shared-Memory
Series
Series
SeriesEditor:
Editor:
Editor:Mark
Mark
MarkD.
D.
D.Hill,
Hill,
Hill,University
University
UniversityofofofWisconsin
Wisconsin
Wisconsin
Shared-Memory
Shared-Memory
Shared-MemorySynchronization
Synchronization
Synchronization
Michael
Michael
MichaelL.
Since
Since
L.
L.Scott,
Sincethe
Scott,
the
Scott,University
University
theadvent
UniversityofofofRochester
advent
adventof of
oftime
time
Rochester
Rochester
timesharing
sharing
sharingin in
inthe
the
the1960s,
1960s,
1960s,designers
designers
designersof of
ofconcurrent
concurrent
concurrentand and
andparallel
parallel
parallelsystems
systems
systemshavehave
have
Synchronization
Synchronization
needed
needed
neededto to
tosynchronize
synchronize
synchronizethe the
theactivities
activities
activitiesof
of
ofthreads
threads
threadsof of
ofcontrol
control
controlthatthat
thatshare
share
sharedata
data
datastructures
structures
structuresinininmemory.
memory.
memory.In In
In
recent
recent
recentyears,
years,
years,the
the
thestudy
study
studyofofofsynchronization
synchronization
synchronizationhas has
hasgained
gained
gainednew new
newurgency
urgency
urgencywithwith
withthe
the
theproliferation
proliferation
proliferationof ofofmulticore
multicore
multicore
SHARED-MEMORY
SHARED-MEMORY SYNCHRONIZATION
SHARED-MEMORY
processors,
processors,
processors,on on
onwhich
which
whicheveneven
evenrelatively
relatively
relativelysimple
simple
simpleuser-level
user-level
user-levelprograms
programs
programsmust must
mustfrequently
frequently
frequentlyrunrun
runin in
inparallel.
parallel.
parallel.
This
This
Thislecture
lecture
lectureoffers
offers
offersaaacomprehensive
comprehensive
comprehensivesurvey survey
surveyof of
ofshared-memory
shared-memory
shared-memorysynchronization,
synchronization,
synchronization,with with
withan an
anemphasis
emphasis
emphasis
on
on
on“systems-level”
“systems-level”
“systems-level”issues.issues.
issues.ItItItincludes
includes
includessufficient
sufficient
sufficientcoverage
coverage
coverageof of
ofarchitectural
architectural
architecturaldetails
details
detailstoto
tounderstand
understand
understand
correctness
correctness
correctnessandand
andperformance
performance
performanceon on
onmodern
modern
modernmulticore
multicore
multicoremachines,
machines,
machines,and and
andsufficient
sufficient
sufficientcoverage
coverage
coverageof of
ofhigher-level
higher-level
higher-level
issues
issues
issuesto
to
tounderstand
understand
understandhow how
howsynchronization
synchronization
synchronizationisisisembedded
embedded
embeddedin ininmodern
modern
modernprogramming
programming
programminglanguages.
languages.
languages.
The
The
Theprimary
primary
primaryintended
intended
intendedaudience
audience
audienceisisis“systems
“systems
“systemsprogrammers”—the
programmers”—the
programmers”—theauthors authors
authorsof of
ofoperating
operating
operatingsystems,
systems,
systems,
SYNCHRONIZATION
SYNCHRONIZATION
Michael
Michael
MichaelL.
L.
L.Scott
Scott
Scott
library
library
librarypackages,
packages,
packages,language
language
languagerun-timerun-time
run-timesystems,
systems,
systems,concurrent
concurrent
concurrentdata data
datastructures,
structures,
structures,andand
andserver
server
serverandand
andutility
utility
utility
programs.
programs.
programs.MuchMuch
Muchof of
ofthe
the
thediscussion
discussion
discussionshould
should
shouldalso
also
alsobebe
beof
ofofinterest
interest
interestto to
toapplication
application
applicationprogrammers
programmers
programmerswho who
whowant
want
want
to
to
tomake
make
makegood
good
gooduseuse
useof
of
ofthe
the
thesynchronization
synchronization
synchronizationmechanisms
mechanisms
mechanismsavailable
available
availableto to
tothem,
them,
them,and
and
andto
to
tocomputer
computer
computerarchitects
architects
architects
who
who
whowant
want
wantto to
tounderstand
understand
understandthe the
theramifications
ramifications
ramificationsof of
oftheir
their
theirdesign
design
designdecisions
decisions
decisionson on
onsystems-level
systems-level
systems-levelcode.
code.
code.
About
About
AboutSYNTHESIs
SYNTHESIs
SYNTHESIs
This
This
Thisvolume
volume
volumeisisisaaaprinted
printed
printedversion
version
versionofofofaaawork
work
workthat
that
thatappears
appears
appearsinininthe
the
theSynthesis
Synthesis
Synthesis
Digital
Digital
DigitalLibrary
Library
LibraryofofofEngineering
Engineering
Engineeringand and
andComputer
Computer
ComputerScience.
Science.
Science.Synthesis
Synthesis
SynthesisLectures
Lectures
Lectures
MOR
MOR GAN
MOR
provide
provide
provideconcise,
concise,
concise,
original
original
originalpresentations
presentations
presentationsofofofimportant
important
importantresearch
research
researchand
and
anddevelopment
development
development
topics,
topics,
topics,published
published
publishedquickly,
quickly,
quickly,ininindigital
digital
digitaland
and
andprint
print
printformats.
formats.
formats.For
For
Formore
more
moreinformation
information
information
GAN &
GAN &
visit
visit
visitwww.morganclaypool.com
www.morganclaypool.com
www.morganclaypool.com
SSSYNTHESIS
YNTHESISL
YNTHESIS L
LECTURES
ECTURES
ECTURESON
ON
ON
& CL
CCCOMPUTER
OMPUTERA A
ARCHITECTURE
CL AYPOOL
CL AYPOOL
ISBN:
ISBN:
ISBN:978-1-60845-956-8
978-1-60845-956-8
978-1-60845-956-8
Mor
Mor
Morgan
gan
gan Cl
Cl
Claypool
aypool &
&
&
aypoolPublishers
Publishers
Publishers 90000
90000
90000
OMPUTER RCHITECTURE
RCHITECTURE
AYPOOL
wwwwwwwww. .m
.mmooorrgrggaaannncccl lalaayyypppoooool l.l.c.ccooommm
999781608
781608
781608459568
459568
459568
Mark
Mark
MarkD.
D.
D.Hill,
Hill,
Hill,Series
Series
SeriesEditor
Editor
Editor
Shared-Memory Synchronization
Synthesis Lectures on
Computer Architecture
Editor
Mark D. Hill, University of Wisconsin, Madison
Synthesis Lectures on Computer Architecture publishes 50- to 100-page publications on topics
pertaining to the science and art of designing, analyzing, selecting and interconnecting hardware
components to create computers that meet functional, performance and cost goals. e scope will
largely follow the purview of premier computer architecture conferences, such as ISCA, HPCA,
MICRO, and ASPLOS.
Shared-Memory Synchronization
Michael L. Scott
2013
Multithreading Architecture
Mario Nemirovsky and Dean M. Tullsen
2013
Performance Analysis and Tuning for General Purpose Graphics Processing Units
(GPGPU)
Hyesoon Kim, Richard Vuduc, Sara Baghsorkhi, Jee Choi, and Wen-mei Hwu
2012
On-Chip Networks
Natalie Enright Jerger and Li-Shiuan Peh
2009
e Memory System: You Can’t Avoid It, You Can’t Ignore It, You Can’t Fake It
Bruce Jacob
2009
Transactional Memory
James R. Larus and Ravi Rajwar
2006
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in
any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations
in printed reviews, without the prior permission of the publisher.
Shared-Memory Synchronization
Michael L. Scott
www.morganclaypool.com
DOI 10.2200/S00499ED1V01Y201304CAC023
Lecture #23
Series Editor: Mark D. Hill, University of Wisconsin, Madison
Series ISSN
Synthesis Lectures on Computer Architecture
Print 1935-3235 Electronic 1935-3243
Shared-Memory Synchronization
Michael L. Scott
University of Rochester
M
&C Morgan & cLaypool publishers
ABSTRACT
Since the advent of time sharing in the 1960s, designers of concurrent and parallel systems have
needed to synchronize the activities of threads of control that share data structures in memory.
In recent years, the study of synchronization has gained new urgency with the proliferation of
multicore processors, on which even relatively simple user-level programs must frequently run in
parallel.
is lecture offers a comprehensive survey of shared-memory synchronization, with an em-
phasis on “systems-level” issues. It includes sufficient coverage of architectural details to under-
stand correctness and performance on modern multicore machines, and sufficient coverage of
higher-level issues to understand how synchronization is embedded in modern programming
languages.
e primary intended audience is “systems programmers”—the authors of operating sys-
tems, library packages, language run-time systems, concurrent data structures, and server and util-
ity programs. Much of the discussion should also be of interest to application programmers who
want to make good use of the synchronization mechanisms available to them, and to computer
architects who want to understand the ramifications of their design decisions on systems-level
code.
KEYWORDS
atomicity, barriers, busy-waiting, conditions, locality, locking, memory mod-
els, monitors, multiprocessor architecture, nonblocking algorithms, scheduling,
semaphores, synchronization, transactional memory
ix
Contents
Preface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Atomicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Condition Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Spinning vs. Blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Safety and Liveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Architectural Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 Cores and Caches: Basic Shared-Memory Architecture . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Temporal and Spatial Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1.2 Cache Coherence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.1.3 Processor (Core) Locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Memory Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 Sources of Inconsistency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.2 Special Instructions to Order Memory Access . . . . . . . . . . . . . . . . . . . . . 15
2.2.3 Example Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.3 Atomic Primitives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3.1 e ABA Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 Other Synchronization Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3 Essential eory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1 Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.1.1 Deadlock Freedom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.1.2 Atomicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Liveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.1 Nonblocking Progress . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.2.2 Fairness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.3 e Consensus Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 Memory Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
xii
3.4.1 Formal Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
3.4.2 Data Races . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
3.4.3 Real-World Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
6 Read-mostly Atomicity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.1 Reader-Writer Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
6.1.1 Centralized Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
6.1.2 Queued Reader-Writer Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
xiii
6.2 Sequence Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
6.3 Read-Copy Update . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
Preface
is lecture grows out of some 25 years of experience in synchronization and concurrent data
structures. ough written primarily from the perspective of systems software, it reflects my con-
viction that the field cannot be understood without a solid grounding in both concurrency theory
and computer architecture.
Chapters 4, 5, and 7 are in some sense the heart of the lecture: they cover spin locks,
busy-wait condition synchronization (barriers in particular), and scheduler-based synchroniza-
tion, respectively. To set the stage for these, Chapter 2 surveys aspects of multicore and multi-
processor architecture that significantly impact the design or performance of synchronizing code,
and Chapter 3 introduces formal concepts that illuminate issues of feasibility and correctness.
Chapter 6 considers atomicity mechanisms that have been optimized for the important
special case in which most operations are read-only. Later, Chapter 8 provides a brief introduction
to nonblocking algorithms, which are designed in such a way that all possible thread interleavings
are correct. Chapter 9 provides a similarly brief introduction to transactional memory, which uses
speculation to implement atomicity without (in typical cases) requiring mutual exclusion. (A full
treatment of both of these topics is beyond the scope of the lecture.)
Given the volume of material, readers with limited time may wish to sample topics of
particular interest. All readers, however, should make sure they are familiar with the material in
Chapters 1 through 3. In my experience, practitioners often underestimate the value of formal
foundations, and theoreticians are sometimes vague about the nature and impact of architectural
constraints. Readers may also wish to bookmark Table 2.1 (page 19), which describes the memory
model assumed by the pseudocode. Beyond that:
• Programmers with an interest in operating systems and run-time packages may wish to
focus on Sections 2.2–2.3.1, all of Chapters 3–6, and Section 7.5.
• Authors of parallel libraries may wish to focus on Sections 2.2–2.3.1 and 5.4, plus all of
Chapters 3, 7, and 8.
• Compiler writers will need to understand all of Chapters 2 and 3, plus Sections 4.5.2, 5.1,
5.3.1, 5.3.3, and 7.3–7.4.
Some readers may be surprised to find that the lecture contains no concrete performance
results. is omission reflects a deliberate decision to focus on qualitative comparisons among
xvi PREFACE
algorithmic alternatives. Performance is obviously of great importance in the evaluation of syn-
chronization mechanisms and concurrent data structures (and my papers are full of hard numbers),
but the constants change with time, and they depend in many cases on characteristics of the spe-
cific application, language, operating system, and hardware at hand. When relative performance
is in doubt, system designers would be well advised to benchmark the alternatives in their own
particular environment.
Michael L. Scott
April 2013
xvii
Acknowledgments
is lecture has benefited from the feedback of many generous colleagues. Sarita Adve, Hans
Boehm, Dave Dice, Maurice Herlihy, Mark Hill, Victor Luchangco, Paul McKenney, Maged
Michael, Nir Shavit, and Mike Swift all read through draft material, and made numerous helpful
suggestions for improvements. I am particularly indebted to Hans for his coaching on memory
consistency models and to Victor for his careful vetting of Chapter 3. (e mistakes that remain
are of course my own!) My thanks as well to the students of Mark’s CS 758 course in the fall of
2012, who provided additional feedback. Finally, my admiration and thanks both to Mark and
to Mike Morgan for their skillful shepherding of the Synthesis series, and for convincing me to
undertake the project.
Michael L. Scott
April 2013
1
CHAPTER 1
Introduction
In computer science, as in real life, concurrency makes it much more difficult to reason about
events. In a linear sequence, if E1 occurs before E2 , which occurs before E3 , and so on, we can
reason about each event individually: Ei begins with the state of the world (or the program) after
Ei 1 , and produces some new state of the world for EiC1 . But if the sequence of events fEi g is
concurrent with some other sequence fFi g, all bets are off. e state of the world prior to Ei can
now depend not only on Ei 1 and its predecessors, but also on some prefix of fFi g.
Consider a simple example in which two threads attempt—concurrently—to increment a
shared global counter:
thread 1: thread 2:
ctr++ ctr++
On any modern computer, the increment operation ( ctr++ ) will comprise at least three separate
instruction steps: one to load ctr into a register, a second to increment the register, and a third to
store the register back to memory. is gives us a pair of concurrent sequences:
thread 1: thread 2:
1: r := ctr 1: r := ctr
2: inc r 2: inc r
3: ctr := r 3: ctr := r
Intuitively, if our counter is initially 0, we should like it to be 2 when both threads have completed.
If each thread executes line 1 before the other executes line 3, however, then both will store a 1,
and one of the increments will be “lost.”
e problem here is that concurrent sequences of events can interleave in arbitrary ways,
many of which may lead to incorrect results. In this specific example, only two of the 63 D
20 possible interleavings—the ones in which one thread completes before the other starts—will
produce the result we want.
Synchronization is the art of precluding interleavings that we consider incorrect. In a dis-
tributed (i.e., message-passing) system, synchronization is subsumed in communication: if thread
T2 receives a message from T1 , then in all possible execution interleavings, all the events performed
by T1 prior to its send will occur before any of the events performed by T2 after its receive . In
a shared-memory system, however, things are not so simple. Instead of exchanging messages,
threads with shared memory communicate implicitly through load s and store s. Implicit commu-
nication gives the programmer substantially more flexibility in algorithm design, but it requires
2 1. INTRODUCTION
separate mechanisms for explicit synchronization. ose mechanisms are the subject of this lec-
ture.
Significantly, the need for synchronization arises whenever operations are concurrent, re-
gardless of whether they actually run in parallel. is observation dates from the earliest work
in the field, led by Edsger Dijkstra [1965, 1968a, 1968b] and performed in the early 1960s. If a
single processor core context switches among concurrent operations at arbitrary times, then while
some interleavings of the underlying events may be less probable than they are with truly parallel
execution, they are nonetheless possible, and a correct program must be synchronized to protect
against any that would be incorrect. From the programmer’s perspective, a multiprogrammed
uniprocessor with preemptive scheduling is no easier to program than a multicore or multipro-
cessor machine.
A few languages and systems guarantee that only one thread will run at a time, and that
context switches will occur only at well defined points in the code. e resulting execution model
is sometimes referred to as “cooperative” multithreading. One might at first expect it to simplify
synchronization, but the benefits tend not to be significant in practice. e problem is that poten-
tial context-switch points may be hidden inside library routines, or in the methods of black-box
abstractions. Absent a programming model that attaches a true or false “may cause a context
switch” tag to every method of every system interface, programmers must protect against unex-
pected interleavings by using synchronization techniques analogous to those of truly concurrent
code.
Distribution
At the level of hardware devices, the distinction between shared memory and message passing disappears: we can
think of a memory cell as a simple process that receives load and store messages from more complicated processes,
and sends value and ok messages, respectively, in response. While theoreticians often think of things this way
(the annual PODC [Symposium on Principles of Distributed Computing ] and DISC [International Symposium on
Distributed Computing ] conferences routinely publish shared-memory algorithms), systems programmers tend to
regard shared memory and message passing as fundamentally distinct. is lecture covers only the shared-memory
case.
1.1 ATOMICITY
e example on p. 1 requires only atomicity: correct execution will be guaranteed (and incor-
rect interleavings avoided) if the instruction sequence corresponding to an increment operation
executes as a single indivisible unit:
thread 1: thread 2:
atomic atomic
ctr++ ctr++
e simplest (but not the only!) means of implementing atomicity is to force threads to
execute their operations one at a time. is strategy is known as mutual exclusion. e code of an
atomic operation that executes in mutual exclusion is called a critical section. Traditionally, mutual
exclusion is obtained by performing acquire and release operations on an abstract data object
called a lock:
lock L
thread 1: thread 2:
L.acquire() L.acquire()
ctr++ ctr++
L.release() L.release()
e acquire and release operations are assumed to have been implemented (at some lower level
of abstraction) in such a way that (1) each is atomic and (2) acquire waits if the lock is currently
held by some other thread.
In our simple increment example, mutual exclusion is arguably the only implementation
strategy that will guarantee atomicity. In other cases, however, it may be overkill. Consider an
operation that increments a specified element in an array of counters:
ctr inc(i):
L.acquire()
ctr[i]++
L.release()
If thread 1 calls ctr inc(i) and thread 2 calls ctr inc(j) , we shall need mutual exclusion only if
i = j . We can increase potential concurrency with a finer granularity of locking—for example, by
4 1. INTRODUCTION
declaring a separate lock for each counter, and acquiring only the one we need. In this example,
the only downside is the space consumed by the extra locks. In other cases, fine-grain locking
can introduce performance or correctness problems. Consider an operation designed to move n
dollars from account i to account j in a banking program. If we want to use fine-grain locking
(so unrelated transfers won’t exclude one another in time), we need to acquire two locks:
move(n, i, j):
L[i].acquire()
L[j].acquire() // (there’s a bug here)
acct[i] = n
acct[j] += n
L[i].release()
L[j].release()
If lock acquisition and release are expensive, we shall need to consider whether the benefit of
concurrency in independent operations outweighs the cost of the extra lock. More significantly,
we shall need to address the possibility of deadlock:
thread 1: thread 2:
move(100, 2, 3) move(50, 3, 2)
If execution proceeds more or less in lockstep, thread 1 may acquire lock 2 and thread 2 may ac-
quire lock 3 before either attempts to acquire the other. Both may then wait forever. e simplest
solution in this case is to always acquire the lower-numbered lock first. In more general cases,
if may be difficult to devise a static ordering. Alternative atomicity mechanisms—in particular,
transactional memory, which we will consider in Chapter 9—attempt to achieve the concurrency
of fine-grain locking without its conceptual complexity.
From the programmer’s perspective, fine-grain locking is a means of implementing atomic-
ity for large, complex operations using smaller (possibly overlapping) critical sections. e burden
of ensuring that the implementation is correct (that it does, indeed, achieve deadlock-free atom-
icity for the large operations) is entirely the programmer’s responsibility. e appeal of transac-
tional memory is that it raises the level of abstraction, allowing the programmer to delegate this
responsibility to some underlying system.
Whether atomicity is achieved through coarse-grain locking, programmer-managed fine-
grain locking, or some form of transactional memory, the intent is that atomic regions appear to be
indivisible. Put another way, any realizable execution of the program—any possible interleaving
of its machine instructions—must be indistinguishable from (have the same externally visible
behavior as) some execution in which the instructions of each atomic operation are contiguous
in time, with no other instructions interleaved among them. As we shall see in Chapter 3, there
are several possible ways to formalize this requirement, most notably linearizability and several
variants on serializability.
1.2. CONDITION SYNCHRONIZATION 5
1.2 CONDITION SYNCHRONIZATION
In some cases, atomicity is not enough for correctness. Consider, for example, a program contain-
ing a work queue, into which “producer” threads place tasks they wish to have performed, and from
which “consumer” threads remove tasks they plan to perform. To preserve the structural integrity
of the queue, we shall need each insert or remove operation to execute atomically. More than
this, however, we shall need to ensure that a remove operation executes only when the queue is
nonempty and (if the size of the queue is bounded) an insert operation executes only when the
queue is nonfull:
Q.insert(d): Q.remove():
atomic atomic
await :Q.full() await :Q.empty()
// put d in next empty slot // return data from next full slot
In the synchronization literature, a concurrent queue (of whatever sort of objects) is some-
times called a bounded buffer ; it is the canonical example of mixed atomicity and condition syn-
chronization. As suggested by our use of the await condition notation above (notation we have
not yet explained how to implement), the conditions in a bounded buffer can be specified at the
beginning of the critical section. In other, more complex operations, a thread may need to per-
form nontrivial work within an atomic operation before it knows what condition(s) it needs to
wait for. Since another thread will typically need to access (and modify!) some of the same data in
order to make the condition true, a mid-operation wait needs to be able to “break” the atomicity
of the surrounding operation in some well-defined way. In Chapter 7 we shall see that some syn-
chronization mechanisms support only the simpler case of waiting at the beginning of a critical
section; others allow conditions to appear anywhere inside.
In many programs, condition synchronization is also useful outside atomic operations—
typically as a means of separating “phases” of computation. In the simplest case, suppose that
a task to be performed in thread B cannot safely begin until some other task (data structure
initialization, perhaps) has completed in thread A. Here B may spin on a Boolean flag variable
that is initially false and that is set by A to true . In more complex cases, it is common for a
program to go through a series of phases, each of which is internally parallel, but must complete
in its entirety before the next phase can begin. Many simulations, for example, have this structure.
For such programs, a synchronization barrier, executed by all threads at the end of every phase,
ensures that all have arrived before any is allowed to depart.
It is tempting to suppose that atomicity (or mutual exclusion, at least) would be simpler to
implement—or to model formally—than condition synchronization. After all, it could be thought
of as a subcase: “wait until no other thread is currently in its critical section.” e problem with this
thinking is the scope of the condition. By standard convention, we allow conditions to consider
only the values of variables, not the states of other threads. Seen in this light, atomicity is the
more demanding concept: it requires agreement among all threads that their operations will avoid
6 1. INTRODUCTION
interfering with each other. And indeed, as we shall see in Section 3.3, atomicity is more difficult
to implement, in a formal, theoretical sense.
For mutual exclusion, the simplest implementation employs a special hardware instruction
known as test and set ( TAS ). e TAS instruction, available on almost every modern machine,
sets a specified Boolean variable to true and returns the previous value. Using TAS , we can im-
plement a trivial spin lock:
type lock = bool := false
L.acquire(): L.release():
while :TAS(&L) L := false
// spin
Here we have equated the acquisition of L with the act of changing it from false to true . e
acquire operation repeatedly applies TAS to the lock until it finds that the previous value was
false . As we shall see in Chapter 4, the trivial test and set lock has several major performance
problems. It is, however, correct.
e obvious objection to spinning (also known as busy-waiting ) is that it wastes processor
cycles. In a multiprogrammed system it is often preferable to block—to yield the processor core
to some other, runnable thread. e prior thread may then be run again later—either after some
suitable interval of time (at which point it will check its condition, and possibly yield, again), or
at some particular time when another thread has determined that the condition is finally true.
CHAPTER 2
Architectural Background
e correctness and performance of synchronization algorithms depend crucially on architectural
details of multicore and multiprocessor machines. is chapter provides an overview of these
details. It can be skimmed by those already familiar with the subject, but should probably not
be skipped in its entirety: the implications of store buffers and directory-based coherence on
synchronization algorithms, for example, may not be immediately obvious, and the semantics of
synchronizing instructions (ordered accesses, memory fences, and read-modify-write instructions)
may not be universally familiar.
e chapter is divided into three main sections. In the first, we consider the implications for
parallel programs of caching and coherence protocols. In the second, we consider consistency—the
degree to which accesses to different memory locations can or cannot be assumed to occur in any
particular order. In the third, we survey the various read-modify-write instructions— test and set
and its cousins—that underlie most implementations of atomicity.
L1
...
... ...
L2
L3
Global Interconnect
...
Figure 2.1: Typical symmetric (uniform memory access—UMA) machine. Numbers of components
of various kinds, and degree of sharing at various levels, differs across manufacturers and models.
cache. Each cache holds a temporary copy of data currently in active use by cores above it in the
hierarchy, allowing those data to be accessed more quickly than they could be if kept in memory.
On a NUMA machine in which the L2 connects directly to the global interconnect, the L3 may
sometimes be thought of as “belonging” to the memory.
In a machine with more than one processor, the global interconnect may have various
topologies. On small machines, broadcast buses and crossbars are common; on large machines,
a network of point-to-point links is more common. For synchronization purposes, broadcast has
the side effect of imposing a total order on all inter-processor messages; we shall see in Section 2.2
that this simplifies the design of concurrent algorithms—synchronization algorithms in partic-
ular. Ordering is sufficiently helpful, in fact, that some large machines (notably those sold by
Oracle) employ two different global networks: one for data requests, which are small, and benefit
from ordering, and the other for replies, which require significantly more aggregate bandwidth,
but do not need to be ordered.
As the number of cores per processor increases, on-chip interconnects—the connections
among the L2 and L3 caches in particular—can be expected to take on the complexity of cur-
rent global interconnects. Other forms of increased complexity are also likely, including, perhaps,
2.1. CORES AND CACHES: BASIC SHARED-MEMORY ARCHITECTURE 11
Processor 1 Processor n
Core 1 Core k
... ... ... ...
L1
...
... ...
L2
Global Interconnect
L3
...
Figure 2.2: Typical nonuniform memory access (NUMA) machine. Again, numbers of components
of various kinds, and degree of sharing at various levels, differs across manufacturers and models.
¹A cache is said to be k-way associative if its indexing structure permits a given block to be cached in any of k distinct locations.
If k is 1, the cache is said to be direct mapped. If a block may be held in any line, the cache is said to be fully associative.
No-remote-caching Multiprocessors
Most of this lecture assumes a shared-memory multiprocessor with global (distributed) cache coherence, which
we have contrasted with machines in which message passing provides the only means of interprocessor commu-
nication. ere is an intermediate option. Some NUMA machines (notably many of the offerings from Cray,
Inc.) support a single global address space, in which any processor can access any memory location, but remote
locations cannot be cached. We may refer to such a machine as a no-remote-caching (NRC-NUMA) multi-
processor. (Globally cache coherent NUMA machines are sometimes known as CC-NUMA.) Any access to a
location in some other processor’s memory will traverse the interconnect of an NRC-NUMA machine. Assuming
the hardware implements cache coherence within each node—in particular, between the local processor(s) and
the network interface—memory will still be globally coherent. For the sake of performance, however, system and
application programmers will need to employ algorithms that minimize the number of remote references.
14 2. ARCHITECTURAL BACKGROUND
2.2 MEMORY CONSISTENCY
On a single-core machine, it is relatively straightforward to ensure that instructions appear to
complete in execution order. Ideally, one might hope that a similar guarantee would apply to par-
allel machines—that memory accesses, system-wide, would appear to constitute an interleaving
(in execution order) of the accesses of the various cores. For several reasons, this sort of sequential
consistency [Lamport, 1979] imposes nontrivial constraints on performance. Most real machines
implement a more relaxed (i.e., potentially inconsistent) memory model, in which accesses by
different threads, or to different locations by the same thread, may appear to occur “out of or-
der” from the perspective of threads on other cores. When consistency is required, programmers
(or compilers) must employ special synchronizing instructions that are more strongly ordered than
other, “ordinary” instructions, forcing the local core to wait for various classes of potentially in-
flight events. Synchronizing instructions are an essential part of synchronization algorithms on
any non-sequentially consistent machine.
// finally i == j == 0
single-threaded code will run correctly. On a multiprocessor, however, sequential consistency may
again be violated.
On a NUMA machine, or a machine with a topologically complex interconnect, differing
distances among locations provide additional sources of circular ordering. If variable x in Fig-
ure 2.3 is close to thread 2 but far from thread 1, and y is close to thread 1 but far from thread 2,
the reads on line 2 can easily complete before the writes on line 1, even if all accesses are in-
serted into the memory system in program order. With a topologically complex interconnect,
the cache coherence protocol itself may introduce variable delays—e.g., to dispatch invalidation
requests to the various locations that may need to change the state of a local cache line, and to
collect acknowledgments. Again, these differing delays may allow line 2 of the example—in both
threads—to complete before line 1.
In all the explanations of Figure 2.3, the ordering loop results from reads bypassing writes—
executing in-order (write-then-read) from the perspective of the issuing core, but out of order
(read-then-write) from the perspective of the memory system—or of threads on other cores. On
NUMA or topologically complex machines, it may also be possible for reads to bypass reads, writes
to bypass reads, or writes to bypass writes. Worse, circularity may arise even without bypassing—
i.e., even when every thread executes its own instructions in strict program order. Consider the
“independent reads of independent writes” (IRIW) example shown in Figure 2.4. If thread 1 is
close to thread 2 but far from thread 3, and thread 4 is close to thread 3 but far from thread 2,
the reads on line 1 in threads 2 and 3 may see the new values of x and y , while the reads on
line 2 see the old. Here the problem is not bypassing, but a lack of write atomicity—one thread
sees the value written by a store and another thread subsequently sees the value prior to the store.
Many other examples of unintuitive behavior permitted by modern hardware can be found in the
literature [Adve and Gharachorloo, 1996, Adve et al., 1999, Boehm and Adve, 2008, Manson
et al., 2005].
// finally y2 == x3 == 0 and x2 == y3 == 1
Figure 2.4: Independent reads of independent writes (IRIW). If the writes of threads 1 and 4 prop-
agate to different places at different speeds, we can see a ordering loop even if instructions from the
same thread never bypass one another.
// initially x = f = 0
thread 1: thread 2:
1: x := foo() 1: while f = 0
2: f := 1 2: // spin
3: 3: y := 1/x
Figure 2.5: A simple example of flag-based synchronization. To avoid a spurious error, the update to
x must be visible to thread 2 before the update to f .
the write at line 2 in thread 1 can bypass the write in line 1, however, thread 2 may read x too
early, and see a value of zero. Similarly, if the read of x at line 3 in thread 2 can bypass the read
of f in line 1, a divide-by-zero may again occur, even if the writes in thread 1 complete in order.
(While thread 2’s read of x is separated from the read of f by a conditional test, the second read
may still issue before the first completes, if the branch predictor guesses that the loop will never
iterate.)
Barriers Everywhere
Fences are sometimes known as memory barriers. Sadly, the word barrier is heavily overloaded. As noted in Sec-
tion 1.2 (and explored in more detail in Section 5.2), it is the name of a synchronization mechanism used to
separate program phases. In the programming language community, it refers to code that must be executed when
changing a pointer, in order to maintain bookkeeping information for the garbage collector. In a similar vein, it
sometimes refers to code that must be executed when reading or writing a shared variable inside an atomic trans-
action, in order to detect and recover from speculation failures (we discuss this code in Chapter 9, but without
referring to it as a “barrier”). e intended meaning is usually clear from context, but may be confusing to readers
who are familiar with only some of the definitions.
18 2. ARCHITECTURAL BACKGROUND
On many machines, fully ordered synchronizing instructions turn out to be quite expen-
sive—tens or even hundreds of cycles. Moreover, in many cases—including those described
above—full ordering is more than we need for correct behavior. Architects therefore often pro-
vide a variety of weaker synchronizing instructions. ese may or may not be globally ordered,
and may prevent some, but not all, local bypassing. As we shall see in Section 2.2.3, the details
vary greatly from one machine architecture to another. Moreover, behavior is often defined not
in terms of the orderings an instruction guarantees among memory accesses, but in terms of the
reorderings it inhibits in the processor core, the cache subsystem, or the interconnect.
Unfortunately, there is no obvious, succinct way to specify minimal ordering requirements
in parallel programs. Neither synchronizing accesses nor fences, for example, allow us to order
two individual accesses with respect to one another (and not with respect to anything else), if that
is all that is really required. In an attempt to balance simplicity and clarity, the examples in this
lecture use a notation inspired by (but simpler than) the atomic operations of C++’11. Using this
notation, we will sometimes over-constrain our algorithms, but not egregiously.
A summary of our notation, and of the memory model behind it, can be found in Table 2.1.
To specify local ordering, each synchronizing instruction admits an optional annotation of the
form P kS , indicating that the instruction is ordered with respect to preceding ( P ) and/or subse-
quent ( S ) read and write accesses in its thread ( P , S { R, W }). So, for example, f.store(1, Wk)
might be used in Figure 2.5 at line 2 of thread 1 to order the (synchronizing) store to f after the
(ordinary) write to x , and f.load(kRW) might be used at line 1 of thread 2 to order the (syn-
chronizing) load of f before both the (ordinary) read of x and any other subsequent reads and
writes. Similarly, fence(RWkRW) would indicate a full fence, ordered globally with respect to all
other synchronizing instructions and locally with respect to all preceding and subsequent ordinary
accesses in its thread.
We will assume that synchronizing instructions inhibit reordering not only by the hard-
ware (processor, cache, or interconnect), but also by the compiler or interpreter. Compiler writers
or assembly language programmers interested in porting our pseudocode to some concrete ma-
chine will need to restrict their code improvement algorithms accordingly, and issue appropriate
synchronizing instructions for the hardware at hand. Beginning guidance can be found in Doug
Lea’s on-line “Cookbook for Compiler Writers” [2001].
To determine the need for synchronizing instructions in the code of a given synchroniza-
tion algorithm, we shall need to consider both the correctness of the algorithm itself and the
semantics it is intended to provide to the rest of the program. e acquire operation of Peterson’s
two-thread spin lock [1981], for example, employs synchronizing store s to arbitrate between
competing threads, but this ordering is not enough to prevent a thread from reading or writing
shared data before the lock has actually been acquired—or after it has been released. For that, one
needs accesses or fences with local kRW and RWk ordering (code in Section 4.1).
Fortunately for most programmers, memory ordering details are generally of concern only
to the authors of synchronization algorithms and low-level concurrent data structures, which
2.2. MEMORY CONSISTENCY 19
Table 2.1: Understanding the pseudocode
roughout the remainder of this book, pseudocode will be set in sans serif font code (code in real
programming languages will be set in typewriter font). We will use the term synchronizing instruction
to refer to explicit load s and store s, fence s, and atomic read-modify-write ( fetch and ˆ) operations
(listed in Table 2.2). Other memory accesses will be referred to as “ordinary.” We will assume the
following:
coherence
All accesses (ordinary and synchronizing) to any given location appear to occur in some
single, total order from the perspective of all threads.
global order
ere is a global, total order on synchronizing instructions (to all locations, by all threads).
Within this order, instructions issued by the same thread occur in program order.
program order
Ordinary accesses appear to occur in program order from the perspective of the issuing
thread, but may bypass one another in arbitrary ways from the perspective of other threads.
local order
Ordinary accesses may also bypass synchronizing instructions, except when forbidden by
an ordering annotation ({ R,W }k{ R,W }) on the synchronizing instruction.
values read
A read instruction will return the value written by the most recent write (to the same loca-
tion) that is ordered before the read. It may also, in some cases, return the value written by
an unordered write. More detail on memory models can be found in Section 3.4.
may need to be re-written (or at least re-tuned) for each target architecture. Programmers who
use these algorithms correctly are then typically assured that their programs will behave as if the
hardware were sequentially consistent (more on this in Section 3.4), and will port correctly across
machines.
Identifying a minimal set of ordering instructions to ensure the correctness of a given algo-
rithm on a given machine is a difficult and error-prone task. It has traditionally been performed
by hand, though there has been promising recent work aimed at verifying the correctness of a
given set of fences [Burckhardt et al., 2007], or even at inferring them directly [Kuperstein et al.,
2010].
swap
word Swap(word *a, word w): atomic { t := *a; *a := w; return t }
To facilitate the construction of synchronization algorithms and concurrent data structures, most
modern architectures provide instructions capable of updating (i.e., reading and writing) a mem-
ory location as a single atomic operation. We saw a simple example—the test and set instruction
( TAS )—in Section 1.3. A longer list of common instructions appears in Table 2.2. Note that for
each of these, when it appears in our pseudocode, we permit an optional, final argument that indi-
cates local ordering constraints. CAS(a, old, new, Wk) , for example, indicates a CAS instruction
that is ordered after all preceding write accesses in its thread.
Originally introduced on mainframes of the 1960s, TAS and Swap are still available on
several modern machines, among them the x86 and SPARC. FAA and FAI were introduced
for “combining network” machines of the 1980s [Kruskal et al., 1988]. ey are uncommon in
hardware today, but frequently appear in algorithms in the literature. e semantics of TAS ,
Swap , FAI , and FAA should all be self-explanatory. Note that they all return the value of the
target location before any change was made.
CAS was originally introduced in the 1973 version of the IBM 370 architecture [Brown
and Smith, 1975, Gifford et al., 1987, IBM, 1975]. It is also found on modern x86, IA-64 (Ita-
22 2. ARCHITECTURAL BACKGROUND
nium), and SPARC machines. LL / SC was originally proposed for the S-1 AAP Multiprocessor
at Lawrence Livermore National Laboratory [Jensen et al., 1987]. It is also found on modern
POWER, MIPS, and ARM machines. CAS and LL / SC are universal primitives, in a sense we
will define formally in Section 3.3. In practical terms, we can use them to build efficient simu-
lations of arbitrary (single-word) read-modify-write ( fetch and ˆ) operations (including all the
other operations in Table 2.2).
CAS takes three arguments: a memory location, an old value that is expected to occupy
that location, and a new value that should be placed in the location if indeed the old value is cur-
rently there. e instruction returns a Boolean value indicating whether the replacement occurred
successfully. Given CAS , fetch and ˆ can be written as follows, for any given function ˆ:
1: word fetch and ˆ(function ˆ, word *w):
2: word old, new
3: repeat
4: old := *w
5: new := ˆ(old)
6: until CAS(w, old, new)
7: return old
In effect, this code computes ˆ( *w ) speculatively, and then updates w atomically if its value has
not changed since the speculation began. e only way the CAS can fail to perform its update
(and return false at line 6) is if some other thread has recently modified w . If several threads
attempt to perform a fetch and ˆ on w simultaneously, one of them is guaranteed to succeed,
and the system as a whole will make forward progress. is guarantee implies that fetch and ˆ
operations implemented with CAS are nonblocking (more specifically, lock free), a property we will
consider in more detail in Section 3.2.
One problem with CAS , from an architectural point of view, is that it combines a load and
a store into a single instruction, which complicates the implementation of pipelined processors.
LL / SC was designed to address this problem. In the fetch and ˆ idiom above, it replaces the
load at line 4 with a special instruction that has the side effect of “tagging” the associated cache
line so that the processor will “notice” any subsequent eviction of the line. A subsequent SC will
then succeed only if the line is still present in the cache:
word fetch and ˆ(function ˆ, word *w):
word old, new
repeat
old := LL(w)
new := ˆ(old)
until SC(w, new)
return old
Here any argument for forward progress requires an understanding of why SC might fail. Details
vary from machine to machine. In all cases, SC is guaranteed to fail if another thread has modified
2.3. ATOMIC PRIMITIVES 23
*w (the location pointed at by w ) since the LL was performed. On most machines, SC will also
fail if a hardware interrupt happens to arrive in the post- LL window. On some machines, it will fail
if the cache suffers a capacity or conflict miss, or if the processor mispredicts a branch. To avoid
deterministic, spurious failure, the programmer may need to limit (perhaps severely) the types
of instructions executed between the LL and SC . If unsafe instructions are required in order to
compute the function ˆ, one may need a hybrid approach:
1: word fetch and ˆ(function ˆ, word *w):
2: word old, new
3: repeat
4: old :=*w
5: new := ˆ(old)
6: until LL(w) = old && SC(w, new)
7: return old
Emulating CAS
Note that while LL / SC can be used to emulate CAS , the emulation requires a loop to deal with spurious SC
failures. is issue was recognized explicitly by the designers of the C++’11 atomic types and operations, who
introduced two variants of CAS . e atomic_compare_exchange_strong operation has the semantics of hard-
ware CAS : it fails only if the expected value was not found. On an LL / SC machine, it is implemented with a loop.
e atomic_compare_exchange_weak operation admits the possibility of spurious failure: it has the interface
of CAS , but is implemented without a loop on an LL / SC machine.
24 2. ARCHITECTURAL BACKGROUND
top
(a) A C
(b) A B C
(c) B C
Figure 2.6 shows one of many problem scenarios. In (a), our stack contains the elements
A and C. Suppose that thread 1 begins to execute pop(&top) , and has completed line 6, but
has yet to reach line 7. If thread 2 now executes a (complete) pop(&top) operation, followed by
push(&top, &B) and then push(&top, &A) , it will leave the stack as shown in (b). If thread 1
now continues, its CAS will succeed, leaving the stack in the broken state shown in (c).
e problem here is that top changed between thread 1’s load and the subsequent CAS . If
these two instructions were replaced with LL and SC , the latter would fail—as indeed it should—
causing thread 1 to try again.
On machines with CAS , programmers must consider whether the ABA problem can arise
in the algorithm at hand and, if so, take measures to avoid it. e simplest and most common
technique is to devote part of each to-be- CAS ed word to a sequence number that is updated in
pop on a successful CAS . Using this counted pointer technique, we can convert our stack code to
the (now safe) version shown in Figure 2.7.²
e sequence number solution to the ABA problem requires that there be enough bits
available for the number that wrap-around cannot occur in any reasonable program execution.
Some machines (e.g., the x86, or the SPARC when running in 32-bit mode) provide a double-
width CAS that is ideal for this purpose. If the maximum word width is required for “real” data,
however, another approach may be required.
²While Treiber’s technical report [Treiber, 1986] is the standard reference for the nonblocking stack algorithm, the ABA
problem is mentioned as early as the 1975 edition of the System 370 manual [IBM, 1975, p. 125], and a version of the stack
appears in the 1983 edition [IBM, 1983, App. A]. Treiber’s personal contribution (not shown in Figure 2.7) was to observe
that counted pointers are required only in the pop operation; push can safely perform a single-width CAS on the pointer
alone [Michael, 2013].
2.3. ATOMIC PRIMITIVES 25
Figure 2.7: e lock-free “Treiber stack,” with a counted top-of-stack pointer to solve the ABA prob-
lem. It suffices to modify the count in pop only; if CAS is available in multiple widths, it may be applied
to only the pointer in push .
In many programs, the programmer can reason that a given pointer will reappear in a given
data structure only as a result of memory deallocation and reallocation. Note that this is not
the case in the Treiber stack as presented here. It would be the case if we re-wrote the code
to pass push a value, and had the method allocate a new node to hold it. Symmetrically, pop
would deallocate the node and return the value it contained. In a garbage-collected language,
deallocation will not occur so long as any thread retains a reference, so all is well. In a language with
manual storage management, hazard pointers [Herlihy et al., 2005, Michael, 2004b] or read-copy-
update [McKenney et al., 2001] (Section 6.3) can be used to delay deallocation until all concurrent
uses of a datum have completed. In the general case (where a pointer can recur without its memory
having been recycled), safe CAS ing may require an extra level of pointer indirection [Jayanti and
Petrovic, 2003, Michael, 2004c].
Type-preserving Allocation
Both general-purpose garbage collection and hazard pointers can be used to avoid the ABA problem in appli-
cations where it might arise due to memory reallocation. Counted pointers can be used to avoid the problem in
applications (like the Treiber stack) where it might arise for reasons other than memory reallocation. But counted
pointers can also be used in the presence of memory reclamation. In this case, one must employ a type-preserving
allocator, which ensures that a block of memory is reused only for an object of the same type and alignment. Sup-
pose, for example, that we modify the Treiber stack, as suggested in the main body of the lecture, to pass push
a value, and have the method allocate a new node to hold it. In this case, if a node were deallocated and reused
by unrelated code (in, say, an array of floating-point numbers), it would be possible (if unlikely) that one of those
numbers might match the bit pattern of a counted pointer from the memory’s former life, leading the stack code
to perform an erroneous operation. With a type-preserving allocator, space once occupied by a counted pointer
would continue to hold such a pointer even when reallocated, and (absent wrap-around), a CAS would succeed
only in the absence of reuse.
One simple implementation of a type-preserving allocator employs a Treiber stack as a free list: old nodes are
push ed onto the stack when freed; new nodes are pop ped from the stack, or, if the stack is empty, obtained
from the system memory manager. A more sophisticated implementation avoids unnecessary cache misses and
contention on the top-of-stack pointer by employing a separate pool of free nodes for each thread or core. If the
local pool is empty, a thread obtains a new “batch” of nodes from a backup central pool, or, if it is empty, the
system memory manager. If the local pool grows too large (e.g., in a program that performs most enqueue s in
one thread and most dequeue s in another), a thread moves a batch of nodes back to the central pool. e central
pool is naturally implemented as a Treiber stack of batches.
27
CHAPTER 3
Essential eory
Concurrent algorithms and synchronization techniques have a long and very rich history of
formalization—far too much to even survey adequately here. Arguably the most accessible re-
source for practitioners is the text of Herlihy and Shavit [2008]. Deeper, more mathematical
coverage can be found in the text of Schneider [1997]. On the broader topic of distributed com-
puting (which as noted in the box on page 2 is viewed by theoreticians as a superset of shared-
memory concurrency), interested readers may wish to consult the classic text of Lynch [1996].
For the purposes of the current text, we provide a brief introduction here to safety, liveness,
the consensus hierarchy, and formal memory models. Safety and liveness were mentioned briefly in
Section 1.4. e former says that bad things never happen; the latter says that good things even-
tually do. e consensus hierarchy explains the relative expressive power of hardware primitives
like test and set ( TAS ) and compare and swap ( CAS ). Memory models explain which writes
may be seen by which reads under which circumstances; they help to regularize the “out of order”
memory references mentioned in Section 2.2.
3.1 SAFETY
Most concurrent data structures (objects) are adaptations of sequential data structures. Each of
these, in turn, has its own sequential semantics, typically specified as a set of preconditions and
postconditions for each of the methods that operate on the structure, together with invariants
that all the methods must preserve. e sequential implementation of an object is considered safe
if each method, called when its preconditions are true, terminates after a finite number of steps,
having ensured the postconditions and preserved the invariants.
When designing a concurrent object, we typically wish to allow concurrent method calls
(“operations”), each of which should appear to occur atomically. is goal in turn leads to at least
three safety issues:
1. In a sequential program, an attempt to call a method whose precondition does not hold
can often be considered an error: the program’s single thread has complete control over the
order in which methods are called, and can either reason that a given call is valid or else
check the precondition first, explicitly, without worrying about changes between the check
and the call ( if (:Q.empty()) e := Q.dequeue() ). In a parallel program, the potential for
concurrent operation in other threads generally requires either that a method be total (i.e.,
that its precondition simply be true , allowing it to run under any circumstances), or that
it use condition synchronization to wait until the precondition holds. e former option
28 3. ESSENTIAL THEORY
is trivial if we are willing to return an indication that the operation is not currently valid
( Q.dequeue() , for example, might return a special ? value when the queue is empty). e
latter option is explored in Chapter 5.
2. Because threads may wait for one another due to locking or condition synchronization,
we must address the possibility of deadlock, in which some set of threads are permanently
waiting for each other. We consider lock-based deadlock in Section 3.1.1. Deadlocks due
to condition synchronization are a matter of application-level semantics, and must be ad-
dressed on a program-by-program basis.
3. e notion of atomicity requires clarification. If operations do not actually execute one at a
time in mutual exclusion, we must somehow specify the order(s) in which they are permitted
to appear to execute. We consider several popular notions of ordering, and the differences
among them, in Section 3.1.2.
3. We can break the circularity condition by imposing a static order on locks, and requiring that
every operation acquire its locks according to that static order. is approach is slightly less
onerous than requiring a thread to request all its locks at once, but still far from general. It
does not, for example, provide an acceptable solution to the “move from A to f .v/” example
in strategy 1 above.
Strategy 3 is widely used in practice. It appears, for example, in every major operating
system kernel. e lack of generality, however, and the burden of defining—and respecting—a
static order on locks, makes strategy 2 quite appealing, particularly when it can be automated, as
it typically is in transactional memory. An intermediate alternative, sometimes used for applica-
tions whose synchronization behavior is well understood, is to consider, at each individual lock
request, whether there is a feasible order in which currently active operations might complete (un-
der worst-case assumptions about the future resources they might need in order to do so), even
if the current lock is granted. e best known strategy of this sort is the Banker’s algorithm of
Dijkstra [early 1960s, 1982], originally developed for the THE operating system [Dijkstra,
1968a]. Where strategies 1 and 3 may be said to prevent deadlock by design, the Banker’s al-
gorithm is often described as deadlock avoidance, and strategy 2 as deadlock recovery.
3.1.2 ATOMICITY
In Section 2.2 we introduced the notion of sequential consistency, which requires that low-level
memory accesses appear to occur in some global total order—i.e., “one at a time”—with each core’s
accesses appearing in program order (the order specified by the core’s sequential program). When
considering the order of high-level operations on a concurrent object, it is tempting to ask whether
sequential consistency can help. In one sense, the answer is clearly no: correct sequential code
will typically not work correctly when executed (without synchronization) by multiple threads
concurrently—even on a system with sequentially consistent memory. Conversely, as we shall see
in Section 3.4, one can (with appropriate synchronization) build correct high-level objects on top
of a system whose memory is more relaxed.
At the same time, the notion of sequential consistency suggests a way in which we might
define atomicity for a concurrent object, allowing us to infer what it means for code to be properly
synchronized. After all, the memory system is a complex concurrent object from the perspective
of a memory architect, who must implement load and store instructions via messages across a
30 3. ESSENTIAL THEORY
distributed cache-cache interconnect. Just as the designer of a sequentially consistent memory
system might seek to achieve the appearance of a total order on memory accesses, consistent with
per-core program order, so too might the designer of a concurrent object seek to achieve the
appearance of a total order on high-level operations, consistent with the order of each thread’s
sequential program. In any execution that appeared to exhibit such a total order, each operation
could be said to have executed atomically.
(roughout this lecture, we use T to represent the set of thread id s. For the sake of convenience,
we assume that the set is sufficiently dense that we can use it to index arrays.)
Because of the lock, put operations are totally ordered. Further, because a get operation
performs only a single (atomic) access to memory, it is easily ordered with respect to all put s—
after those that have updated the relevant element of A , and before those that have not. It is
straightforward to identify a total order on operations that respects these constraints and that is
consistent with program order in each thread. In other words, our counter is sequentially consis-
tent.
3.1. SAFETY 31
On the other hand, consider what happens if we have two counters—call them X and Y .
Because get operations can occur “in the middle of ” a put at the implementation level, we can
imagine a scenario in which threads T 3 and T 4 perform get s on X and Y while both objects are
being updated—and see the updates in opposite orders:
At this point, the put to Y has happened before the put to X from T 3’s perspective, but after
the put to X from T 4’s perspective. To solve this problem, we might require the implementation
of a shared object to ensure that updates appear to other threads to happen at some single point
in time.
But this is not enough. Consider a software emulation of the hardware write buffers de-
scribed in Section 2.2.1. To perform a put on object X , thread T inserts the desired new value
into a local queue and continues execution. Periodically, a helper thread drains the queue and
applies the updates to the master copy of X , which resides in some global location. To perform
a get , T inspects the local queue (synchronizing with the helper as necessary) and returns any
pending update; otherwise it returns the global value of X . From the point of view of every thread
other than T , the update occurs when it is applied to the global value of X . From T ’s perspective,
however, it happens early, and, in a system with more than one object, we can easily obtain the
“bow tie” causality loop of Figure 2.3. is scenario suggests that we require updates to appear to
other threads at the same time they appear to the updater—or at least before the updater continues
execution.
Linearizability
To address the problem of composability, Herlihy and Wing introduced the notion of lineariz-
ability [1990]. For more than 20 years it has served as the standard ordering criterion for high-level
concurrent objects. e implementation of object O is said to be linearizable if, in every possible
execution, the operations on O appear to occur in some total order that is consistent not only
32 3. ESSENTIAL THEORY
with program order in each thread but also with any ordering that threads are able to observe by
other means.
More specifically, linearizability requires that each operation appear to occur instanta-
neously at some point in time between its call and return. e “instantaneously” part of this
requirement precludes the shared counter scenario above, in which T 3 and T 4 have different
views of partial updates. e “between its call and return” part of the requirement precludes the
software write buffer scenario, in which a put by thread T may not be visible to other threads
until after it has returned.
For the sake of precision, it should be noted that there is no absolute notion of objective
time in a parallel system, any more than there is in Einsteinian physics. (For more on the notion
of time in parallel systems, see the classic paper by Lamport [1978].) What really matters is
observable orderings. When we say that an event must occur at a single instant in time, what we
mean is that it must be impossible for thread A to observe that an event has occurred, for A to
subsequently communicate with thread B (e.g., by writing a variable that B reads), and then for
B to observe that the event has not yet occurred.
To help us reason about the linearizability of a concurrent object, we typically identify a
linearization point within each method at which a call to that method can be said to have occurred.
If we choose these points properly, then whenever the linearization point of operation A precedes
the linearization point of operation B , we will know that operation A, as a whole, linearizes before
operation B .
In the trivial case in which every method is bracketed by the acquisition and release of
a common object lock, the linearization point can be anywhere inside the method—we might
arbitrarily place it at the lock release. In an algorithm based on fine-grain locks, the linearization
point might correspond to the release of some particular one of the locks.
In nonblocking algorithms, it is common to associate linearization with a specific instruc-
tion (a load , store , or other atomic primitive) and then argue that any implementation-level
memory updates that are visible before the linearization point will be recognized by other threads
as merely preparation, and any that can be seen to occur after it will be recognized as merely
cleanup. In the nonblocking stack of Figure 2.7, a successful push or pop can be said to linearize
at its final CAS instruction; an unsuccessful pop (one that returns null ) can be said to linearize
at the load of top .
In a complex method, we may need to identify multiple possible linearization points, to
accommodate branching control flow. In other cases, the outcome of tests at run time may allow
us to argue that a method linearized at some point earlier in its execution (an example of this
sort can be found in Section 8.3). ere are even algorithms in which the linearization point of
a method is determined by behavior in some other thread. All that really matters is that there be
a total order on the linearization points, and that the behavior of operations, when considered in
that order, be consistent with the object’s sequential semantics.
3.1. SAFETY 33
thread 1: thread 2:
// insert(C) // delete(D)
A D K
read n!v // A read n!v // A
p := n p := n
n := n!next n := n!next
read n!v // D read n!v // D
m := new node(C) t := n!next
m!next := n A C D K
p!next := m
p!next := t
Figure 3.1: Dynamic trace of improperly synchronized list updates. is execution can lose node C
even on a sequentially consistent machine.
Given linearizable implementations of objects A and B , one can prove that in every possible
program execution, the operations on A and B will appear to occur in some single total order that
is consistent both with program order in each thread and with any other ordering that threads are
able to observe. In other words, linearizable implementations of concurrent objects are compos-
able. Linearizability is therefore sometimes said to be a local property [Herlihy and Wing, 1990,
Weihl, 1989]: the linearizability of a system as a whole depends only on the (local) linearizability
of its parts.
Serializability
Recall that the purpose of an ordering criterion is to clarify the meaning of atomicity. By requiring
an operation to complete at a single point in time, and to be visible to all other threads before it
returns to its caller, linearizability guarantees that the order of operations on any given concurrent
object will be consistent with all other observable orderings in an execution, including those of
other concurrent objects.
e flip side of this guarantee is that the linearizability of individual operations does not
necessarily imply linearizability for operations that manipulate more than one object, but are still
intended to execute as a single atomic unit.
Consider a banking system in which thread 1 transfers $100 from account A to account B ,
while thread 2 adds the amounts in the two accounts:
// initially A.balance() = B.balance() = 500
thread 1: thread 2:
A.withdraw(100)
sum := A.balance() // 400
sum +:= B.balance() // 900
B.deposit(100)
If we think of A and B as separate objects, then the execution can linearize as suggested by
vertical position on the page, but thread 2 will see a cross-account total that is $100 “too low.” If
we wish to treat the code in each thread as a single atomic unit, we must disallow this execution—
something that neither A nor B can do on its own. We need, in short, to be able to combine smaller
atomic operations into larger ones—not just perform the smaller ones in a mutually consistent
order. Where linearizability ensures that the orders of separate objects will compose “for free,”
multi-object atomic operations will generally require some sort of global or distributed control.
Multi-object atomic operations are the hallmark of database systems, which refer to them
as transactions. Transactional memory (the subject of Chapter 9) adapts transactions to shared-
memory parallel computing, allowing the programmer to request that a multi-object operation
like thread 1’s transfer or thread 2’s sum should execute atomically.
e simplest ordering criterion for transactions—both database and memory—is known
as serializability. Transactions are said to serialize if they have the same effect they would have
had if executed one at a time in some total order. For transactional memory (and sometimes for
databases as well), we can extend the model to allow a thread to perform a series of transactions,
and require that the global order be consistent with program order in each thread.
It turns out to be NP-hard to determine whether a given set of transactions (with the
given inputs and outputs) is serializable [Papadimitriou, 1979]. Fortunately, we seldom need to
make such a determination in practice. Generally all we really want is to ensure that the current
3.1. SAFETY 35
execution will be serializable—something we can achieve with conservative (sufficient but not
necessary) measures. A global lock is a trivial solution, but admits no concurrency. Databases
and most TM systems employ more elaborate fine-grain locking. A few TM systems employ
nonblocking techniques.
If we regard the objects to be accessed by a transaction as “resources” and revisit the condi-
tions for deadlock outlined at the beginning of Section 3.1.1, we quickly realize that a transaction
may, in the general case, need to access some resources before it knows which others it will need.
Any implementation of serializability based on fine-grain locks will thus entail not only “exclu-
sive use,” but also both “hold and wait” and “circularity.” To address the possibility of deadlock,
a database or lock-based TM system must be prepared to break the “irrevocability” condition by
releasing locks, rolling back, and retrying conflicting transactions.
Like branch prediction or CAS -based fetch and ˆ, this strategy of proceeding “in the
hope” that things will work out (and recovering when they don’t) is an example of speculation. So-
called lazy TM systems take this even further, allowing conflicting (non-serializable) transactions
to proceed in parallel until one of them is ready to commit —and only then aborting and rolling
back the others.
3.2 LIVENESS
Safety properties—the subject of the previous section—ensure that bad things never happen:
threads are never deadlocked; atomicity is never violated; invariants are never broken. To say that
code is correct, however, we generally want more: we want to ensure forward progress. Just as we
generally want to know that a sequential program will produce a correct answer eventually (not
just fail to produce an incorrect answer), we generally want to know that invocations of concurrent
operations will complete their work and return.
An object method is said to be blocking (in the theoretical sense described in the box on
page 7) if there is some reachable state of the system in which a thread that has called the method
will be unable to return until some other thread takes action. Lock-based algorithms are inher-
ently blocking: a thread that holds a lock precludes progress on the part of any other thread that
needs the same lock. Liveness proofs for lock-based algorithms require not only that the code be
deadlock-free, but also that critical sections be free of infinite loops, and that all threads continue
to execute.
A method is said to be nonblocking if there is no reachable state of the system in which
an invocation of the method will be unable to complete its execution and return. Nonblocking
algorithms have the desirable property that inopportune preemption (e.g., of a lock holder) never
precludes forward progress in other threads. In some environments (e.g., a system with high fault-
tolerance requirements), nonblocking algorithms may also allow the system to survive when a
thread crashes or is prematurely killed. We consider several variants of nonblocking progress in
Section 3.2.1.
In both blocking and nonblocking algorithms, we may also care about fairness—the relative
rates of progress of different threads. We consider this topic briefly in Section 3.2.2.
3.2.2 FAIRNESS
Obstruction freedom and lock freedom clearly admit behavior that defies any notion of fairness:
both allow an individual thread to take an unbounded number of steps without completing an
operation. Even wait freedom allows an operation to execute an arbitrary number of steps (helping
or deferring to peers) before completing, so long as the number is bounded in any given situation.
We shall often want stronger guarantees. In a wait-free algorithm, we might hope for a
static bound, across all invocations, on the number of steps required to complete an operation.
In a blocking algorithm, we might hope for a bound on the number of competing operations
that may complete before a given thread makes progress. If threads repeatedly invoke a certain
set of operations, we might even wish to bound the ratio of their “success” rates. ese are only
a few of the possible ways in which “fairness” might be defined. Without dwelling on particular
definitions, we will consider algorithms in subsequent chapters whose behavior ranges from po-
tentially very highly skewed (e.g., test and set locks that avoid starvation only when there are
periodic quiescent intervals, when the lock is free and no thread wants it), to strictly first-come,
first-served (e.g., locks in which a thread employs a wait-free protocol to join a FIFO queue).
We will also consider intermediate options, such as locks that deliberately balance locality (for
performance) against uniformity of service to threads.
In any practical system, forward progress relies on the assumption that any continually
unblocked thread will eventually execute another program step. Without such minimal fairness
within the implementation, a system could be “correct” without doing anything at all! Signifi-
cantly, even this minimal fairness depends on scheduling decisions at multiple system levels—in
the hardware, the operating system, and the language runtime—all of which ensure that runnable
threads continue to run.
When threads may block for mutual exclusion or condition synchronization, we shall in
most cases want to insist that the system display what is known as weak fairness. is property
guarantees that any thread waiting for a condition that is continuously true (or a lock that is con-
tinuously available) eventually executes another program step. Without such a guarantee, pro-
gram behavior may be highly unappealing. Imagine a web server, for example, that never accepts
requests from a certain client connection if requests are available from any other client.
40 3. ESSENTIAL THEORY
In the following program fragment, weak fairness precludes an execution in which thread
1 spins forever: thread 2 must eventually notice that f is false, complete its wait, and set f to true ,
after which thread 1 must notice the change to f and complete:
initially f = false
thread 1: thread 2:
await f await :f
f := true
Many more stringent definitions of fairness are possible. In particular, strong fairness re-
quires that any thread waiting for a condition that is true infinitely often (or a lock that is available
infinitely often) eventually executes another program step. In the following program fragment,
for example, weak fairness admits an execution in which thread 1 spins forever, but strong fairness
requires thread 2 to notice one of the “windows” in which g is true, complete its wait, and set f
to true , after which thread 1 must notice the change and complete:
initially f = g = false
thread 1: thread 2:
while :f await (g)
g := true f := true
g := false
Strong fairness is difficult to truly achieve: it may, for example, require a scheduler to re-
check every awaited condition whenever one of its constituent variables is changed, to make sure
that any thread at risk of starving is given a chance to run. Any deterministic strategy that con-
siders only a subset of the waiting threads on each state change risks the possibility of determin-
istically ignoring some unfortunate thread every time it is able to run.
Fortunately, statistical “guarantees” typically suffice in practice. By considering a randomly
chosen thread—instead of all threads—when a scheduling decision is required, we can drive the
probability of starvation arbitrarily low. A truly random choice is difficult, of course, but various
pseudorandom approaches appear to work quite well. At the hardware level, interconnects and
coherence protocols are designed to make it unlikely that a “race” between two cores (e.g., when
performing near-simultaneous CAS instructions on a previously uncached location) will always
be resolved the same way. Within the operating system, runtime, or language implementation,
one can “randomize” the interval between checks of a condition using a pseudorandom number
generator or even the natural “jitter” in execution time of nontrivial instruction sequences on
complex modern cores.
3.3. THE CONSENSUS HIERARCHY 41
Weak and strong fairness address worst-case behavior, and allow executions that still seem
grossly unfair from an intuitive perspective (e.g., executions in which one thread succeeds a million
times more often than another). Statistical “randomization,” by contrast, may achieve intuitively
very fair behavior without absolutely precluding worst-case starvation.
Much of the theoretical groundwork for fairness was laid by Nissim Francez [1986]. Proofs
of fairness are typically based on temporal logic, which provides operators for concepts like “always”
and “eventually.” A brief introduction to these topics can be found in the text of Ben-Ari [2006,
Chap. 4]; much more extensive coverage can be found in Schneider’s comprehensive work on the
theory of concurrency [1997].
In Section 2.3 we noted that CAS and LL / SC are universal atomic primitives—capable of imple-
menting arbitrary single-word fetch and ˆ operations. We suggested—implicitly, at least—that
they are fundamentally more powerful than simpler primitives like TAS , Swap , FAI , and FAA .
Herlihy formalized this notion of relative power in his work on wait-free synchronization [1991],
previously mentioned in Section 3.2.1. e formalization is based on the classic consensus problem.
Originally formalized by Fischer, Lynch, and Paterson [1985] in a distributed setting, the
consensus problem involves a set of potentially unreliable threads, each of which “proposes” a
value. e goal is for the reliable threads to agree on one of the proposed values—a task the au-
thors proved to be impossible with asynchronous messages. Herlihy adapted the problem to the
shared-memory setting, where powerful atomic primitives can circumvent impossibility. Specifi-
cally, Herlihy suggested that such primitives (or, more precisely, the objects on which those prim-
itives operate) be classified according the number of threads for which they can achieve wait-free
consensus.
It is easy to see that an object with a TAS method can achieve wait-free consensus for two
threads:
// initially L = 0; proposal[0] and proposal[1] are immaterial
agree(i):
proposal[self].store(i)
if TAS(L) return i
else return proposal[1 self].load()
Herlihy was able to show that this is the best one can do: TAS objects (even an arbitrary num-
ber of them) cannot achieve wait-free consensus for more than two threads. Moreover ordinary
load s and store s cannot achieve wait-free consensus at all—even for only two threads. An object
supporting CAS , on the other hand (or equivalently LL / SC ), can achieve wait-free consensus for
an arbitrary number of threads:
42 3. ESSENTIAL THEORY
// initially v = ?
agree(i):
if CAS(&v, ?, i) return i
else return v
One can, in fact, define an infinite hierarchy of atomic objects, where those appearing at
level k can achieve wait-free consensus for k threads but no more. Objects supporting CAS or
LL / SC are said to have consensus number 1. Objects with other common primitives—including
TAS , swap , FAI , and FAA —have consensus number 2. One can define atomic objects at inter-
mediate levels of the hierarchy, but these are not typically encountered on real hardware.
Language Required
implementation correspondence
(Set of ) concrete
executions
Figure 3.2: Program executions, semantics, and implementations. A valid implementation must pro-
duce only those concrete executions whose output agrees with that of some abstract execution allowed
by language semantics for the given program and input.
Program order is the union of a collection of disjoint total orders, each of which captures the
steps performed by one of the program’s threads. Each thread’s steps must be allowable
under the language’s sequential semantics, given the values returned by read operations.
Synchronization order is a total order, across all threads, on all synchronizing steps. is order
must be consistent with program order within each thread. It must also explain the values
read and written by the synchronizing steps (this will ensure, for example, that acquire and
release operations on any given lock occur in alternating order). Crucially, synchronization
order is not specified by the source program. An execution is valid only if there exists a
synchronization order that leads, as described below, to a writes-seen relation that explains
the values read by ordinary steps.
To complete a memory model, these order definitions must be augmented with a writes-seen
relation. To understand such relations, we first must understand the notion of a data race.
Synchronization Races
e definition of a data race is designed to capture cases in which program behavior may depend on the order in
which two ordinary accesses occur, and this order is not constrained by synchronization. In a similar fashion, we
may wish to consider cases in which program behavior depends on the outcome of synchronization operations.
For each form of synchronization operation, we can define a notion of conflict. Acquire operations on the same
lock, for example, conflict with one another, while an acquire and a release do not—nor do operations on different
locks. A program is said to have a synchronization race if it has two sequentially consistent executions with a
common prefix, and the first steps that differ are conflicting synchronization operations. Together, data races and
synchronization races constitute the class of general races [Netzer and Miller, 1992].
Because we assume the existence of a total order on synchronizing steps, synchronization races never compromise
sequential consistency. Rather, they provide the means of controlling and exploiting nondeterminism in parallel
programs. In any case where we wish to allow conflicting high-level operations to occur in arbitrary order, we
design a synchronization race into the program to mediate the conflict.
46 3. ESSENTIAL THEORY
subsequent acquire also force the release to appear before the acquire in the total order of any
sequentially consistent execution.
In an execution without any data races, the writes-seen relation is straightforward: the
lack of unordered conflicting accesses implies that all reads and writes of a given location are
ordered by happens-before. Each read can then return the value written by the (unique) most
recent prior write of the same location in happens-before order—or the initial value if there is
no such write. More formally, one can prove that all executions of a data-race-free program are
sequentially consistent: any total order consistent with happens-before will explain the program’s
reads. Moreover, since our (first) definition of a data race was based only on sequentially consistent
executions, we can provide the programmer with a set of rules that, if followed, will always lead
to sequentially consistent executions, with no need to reason about possible relaxed behavior of
the underlying hardware. Such a set of rules is said to constitute a programmer-centric memory
model [Adve and Hill, 1990].
In effect, a programmer-centric model is a contract between the programmer and the im-
plementation: if the programmer follows the rules (i.e., write data-race-free programs), the im-
plementation will provide the illusion of sequential consistency. Moreover, given the absence of
races, any region of code that contains no synchronization (and that does not interact with the
“outside world” via I/O or syscalls) can be thought of as atomic: it cannot—by construction—
interact with other threads.
But what about programs that do have data races? Some researchers have argued that such
programs are simply buggy, and should have undefined behavior. is is the approach adopted by
C++ [Boehm and Adve, 2008] and, subsequently, C. It rules out certain categories of programs
(e.g., chaotic relaxation [Chazan and Miranker, 1969]), but the language designers had little in
the way of alternatives: in the absence of type safety it is nearly impossible to limit the potential
impact of a data race. e resulting model is quite simple (at least in the absence of variables that
have been declared atomic ): if a C or C++ program has a data race on a given input, its behavior
is undefined; otherwise, it follows one of its sequentially consistent executions.
Unfortunately, in a language like Java, even buggy programs need to have well defined
behavior, to safeguard the integrity of the virtual machine (which may be embedded in some
larger, untrusting system). e obvious approach is to say that a read may see the value written
by the most recent write on any backward path through the happens-before graph, or by any
incomparable write (one that is unordered with respect to the read). Unfortunately, as described
by Manson et al. [2005], this approach is overly restrictive: it precludes the use of several important
compiler optimizations. e actual Java model defines a notion of “incremental justification” that
may allow a read to see a value that might have been written by an incomparable write in some
other hypothetical execution. e details are surprisingly subtle and complex, and as of 2012 it is
still unclear whether the current specification is correct, or could be made so.
3.4. MEMORY MODELS 47
3.4.3 REAL-WORLD MODELS
As of this writing, Java and C/C++ are the only widely used parallel programming languages
whose definitions attempt to precisely specify a memory model. Ada [Ichbiah et al., 1991] was
the first language to introduce an explicitly relaxed (if informally specified) memory model. It was
designed to facilitate implementation on both shared-memory and distributed hardware: variables
shared between threads were required to be consistent only in the wake of explicit message passing
(rendezvous). e reference implementations of several scripting languages (notably Ruby and
Python) are sequentially consistent, though other implementations [JRuby, Jython] are not.
A group including representatives of Intel, Oracle, IBM, and Red Hat has proposed trans-
actional extensions to C++ [Adl-Tabatabai et al., 2012]. In this proposal, begin transaction and
end transaction markers contribute to the happens-before order inherited from standard C++.
So-called relaxed transactions are permitted to contain other synchronization operations (e.g.,
lock acquire and release ); atomic transactions are not. Dalessandro et al. [2010b] have proposed
an alternative model in which atomic blocks are fundamental, and other synchronization mech-
anisms (e.g., locks) are built on top of them.
If we wish to allow programmers to create new synchronization mechanisms or nonblock-
ing data structures (and indeed if any of the built-in synchronization mechanisms are to be written
in high-level code, rather than assembler), then the memory model must define synchronizing
steps that are more primitive than lock acquire and release . Java allows a variable to be labeled
volatile , in which case reads and writes that access it are included in the global synchronization
order, with each read inducing a synchronizes-with arc (and thus a happens-before arc) from the
(unique) preceding write to the same location. C and C++ provide a substantially more complex
facility, in which variables are labeled atomic , and an individual read, write, or fetch and ˆ
operation can be labeled as an acquire access, a release access, both, or neither. By default, oper-
ations on atomic variables are sequentially consistent: there is a global total order among them.
A crucial goal in the design of any practical memory model is to preserve, as much as pos-
sible, the freedom of compiler writers to employ code improvement techniques originally devel-
oped for sequential programs. e ordering constraints imposed by synchronization operations
necessitate not only hardware-level ordered accesses or memory fences, but also software-level
“compiler fences,” which inhibit the sorts of code motion traditionally used for latency toler-
ance, redundancy elimination, etc. (Recall that in our pseudocode synchronizing instructions are
intended to enforce both hardware and compiler ordering.) Much of the complexity of C/C++
atomic variables stems from the desire to avoid unnecessary hardware ordering and compiler
fences, across a variety of hardware platforms. Within reason, programmers should attempt in
C/C++ to specify the minimal ordering constraints required for correct behavior. At the same
time, they should resist the temptation to “get by” with minimal ordering in the absence of a
solid correctness argument. In particular, while the language allows the programmer to relax the
default sequential consistency of accesses to atomic variables (presumably to avoid paying for
write atomicity), the result can be very confusing. Recent work by Attiya et al. [2011] has also
48 3. ESSENTIAL THEORY
shown that certain WkR orderings and fetch and ˆ operations are essential in a fundamental
way: standard concurrent objects cannot be written without them.
49
CHAPTER 4
Peterson’s Algorithm
e simplest known 2-thread spin lock (Figure 4.1) is due to Peterson [1981]. e lock is repre-
sented by a pair of Boolean variables, interested[self] and interested[other] (initially false), and
a integer turn that is either 0 or 1. To acquire the lock, thread i indicates its interest by setting
interested[self] and then waiting until either (a) the other thread is not interested or (b) turn is
set to the other thread, indicating that thread i set it first.
To release the lock, thread i sets interested[self] back to false . is allows the other thread,
if it is waiting, to enter the critical section. e initial value of turn in each round is immaterial:
it serves only to break the tie when both threads are interested in entering the critical section.
In his original paper, Peterson showed how to extend the lock to n threads by proceeding
through a series of n 1 rounds, each of which eliminates a possible contender. Total (remote-
access) time for a thread to enter the critical section, however, is .n2 /, even in the absence of
contention. In separate work, Peterson and Fischer [1977] showed how to generalize any 2-thread
solution to n threads with a hierarchical tournament that requires only O.log n/ time, even in the
50 4. PRACTICAL SPIN LOCKS
class lock
(0, 1) turn
bool interested[0..1] := { false, false }
lock.acquire(): lock.release():
other := 1 self interested[self].store(false, RWk)
interested[self].store(true)
turn.store(self)
while interested[other].load() and turn.load() 6D other; // spin
fence(kRW)
Figure 4.1: Peterson’s 2-thread spin lock. Variable self must be either 0 or 1.
presence of contention. Burns and Lynch [1980] proved that any deadlock-free mutual exclusion
algorithm using only reads and writes requires .n/ space.
Figure 4.2: Lamport’s bakery algorithm. e max operation is not assumed to be atomic. It is, how-
ever, assumed to read each number field only once.
counters. His algorithm has the arguably more significant advantage that threads acquire the lock
in the order in which they first indicate their interest—i.e., in FIFO order.
Each thread begins by scanning the number array to find the largest “ticket” value held
by a waiting thread. During the scan it sets its choosing flag to true to let its peers know that
its state is in flux. After choosing a ticket higher than any it has seen, it scans the array again,
spinning until each peer’s ticket is (a) stable and (b) greater than or equal to its own. e second
class lock
T x
T y := ?
bool trying[T ] := { false … }
lock.acquire(): lock.release():
loop y.store(?, RWk)
trying[self].store(true) trying[self].store(false)
x.store(self)
if y.load() 6D ?
trying[self].store(false)
while y.load() 6D ?; // spin
continue // go back to top of loop
y.store(self)
if x.load() 6D self
trying[self].store(false)
for i 2 T
while trying[i].load(); // spin
if y.load() 6D self
while y.load() 6D ?; // spin
continue // go back to top of loop
break
fence(kRW)
and Taubenfeld [2000] show how to reduce this time to O.m/, where m is the number of threads
concurrently competing for access.
class lock
bool f := false
lock.acquire(): lock.release():
while :TAS(&f) f.store(false, RWk)
while f; // spin
fence(kRW)
Figure 4.5: e test-and- test and set lock. Unlike the test and set lock of Figure 4.4, this code will
typically induce interconnect traffic only when the lock is modified by another core.
not only with other threads that are attempting to acquire the lock, but also with any attempt by
the lock owner to release the lock.
Performance can be improved by arranging to obtain write permission on the lock only
when it appears to be free. Proposed by Rudolph and Segall [1984], this test-and- test and set
lock is still extremely simple (Figure 4.5), and tends to perform well on machines with a small
handful of cores. Whenever the lock is released, however, every competing thread will fall out of
its inner loop and attempt another TAS , each of which induces coherence traffic. With n threads
continually attempting to execute a critical sections, total time per acquire - release pair will be
O.n/, which is still unacceptable on a machine with more than a handful of cores.
Drawing inspiration from the classic Ethernet contention protocol [Metcalfe and Boggs,
1976], Anderson et al. [1990] proposed an exponential backoff strategy for test and set locks
(Figure 4.6). Experiments indicate that it works quite well in practice, leading to near-constant
overhead per acquire - release pair on many machines. Unfortunately, it depends on constants (the
base , multiplier , and limit for backoff ) that have no single best value in all situations. Ideally, they
should be chosen individually for each machine and workload. Note that test and set suffices in
the presence of backoff; test-and- test and set is not required.
Figure 4.6: e test and set lock with exponential backoff. e pause(k) operation is typically an
empty loop that iterates k times. Ideal choices of base , limit and multiplier values depend on the
machine architecture and, typically, the application workload.
class lock
int next ticket := 0
int now serving := 0
const int base = ... // tuning parameter
lock.acquire():
int my ticket := FAI(&next ticket)
// returns old value; lock.release():
// arithmetic overflow is harmless int t := now serving + 1
loop now serving.store(t, RWk)
int ns := now serving.load()
if ns = my ticket
break
pause(base (my ticket ns))
// overflow in subtraction is harmless
fence(kRW)
Figure 4.7: e ticket lock with proportional backoff. Tuning parameter base should be chosen to be
roughly the length of a trivial critical section.
been waiting a very long time to be passed up by a relative newcomer; in principle, a thread can
starve.
e ticket lock [Fischer et al., 1979, Reed and Kanodia, 1979] (Figure 4.7) addresses this
problem. Like Lamport’s bakery lock, it grants the lock to competing threads in first-come-first-
served order. Unlike the bakery lock, it uses fetch and increment to get by with constant space,
and with time (per lock acquisition) roughly linear in the number of competing threads.
e code in Figure 4.7 employs a backoff strategy due to Mellor-Crummey and Scott
[1991b]. It leverages the fact that my ticket L.now serving represents the number of threads
ahead of the calling thread in line. If those threads consume an average of k base time per
critical section, the calling thread can be expected to probe now serving about k times before
56 4. PRACTICAL SPIN LOCKS
acquiring the lock. Under high contention, this can be substantially smaller than the O.n/ probes
expected without backoff.
In a system that runs long enough, the next ticket and now serving counters can be ex-
pected to exceed the capacity of a fixed word size. Rollover is harmless, however: the maximum
number of threads in any reasonable system will be less than the largest representable integer, and
subtraction works correctly in the ring of integers mod 2wordsize .
reads holding or waiting for the lock are chained together, with the link in the qnode
of thread t pointing to the qnode of the thread to which t should pass the lock when done with
its critical section. e lock itself is simply a pointer to the qnode of the thread at the tail of the
queue, or null if the lock is free.
Operation of the lock is illustrated in Figure 4.9. e acquire method allocates a new
qnode , initializes its next pointer to null , and swap s it into the tail of the queue. If the value
returned by the swap is null , then the calling thread has acquired the lock (line 2). If the value
returned by the swap is non- null , it refers to the qnode of the caller’s predecessor in the queue
(indicated by the dashed arrow in line 3). Here thread B must set A’s next pointer to refer to its
own qnode . Meanwhile, some other thread C may join the queue (line 4).
When thread A has completed its critical section, the release method reads the next
pointer of A’s qnode to find the qnode of its successor B. It changes B’s waiting flag to false ,
thereby granting it the lock (line 5).
If release finds that the next pointer of its qnode is null , it attempts to CAS the lock tail
pointer back to null . If some other thread has already swap ped itself into the queue (line 5), the
CAS will fail, and release will wait for the next pointer to become non- null (line 6). If there are
no waiting threads (line 7), the CAS will succeed, returning the lock to the appearance in line 1.
e MCS lock has several important properties. reads join the queue in a wait-free
manner (using swap ), after which they receive the lock in FIFO order. Each waiting thread
spins on a separate location, eliminating contention for cache and interconnect resources. In fact,
because each thread allocates its own qnode , it can arrange for it to be local even on an NRC-
58 4. PRACTICAL SPIN LOCKS
(1) L
(2) (R)
L A
(3) (R) W
L A B
(4) (R) W W
L A B C
(5) R W
L B C
(6) R W
L B C
(7) R
L C
Figure 4.9: Operation of the MCS lock. An ‘R’ indicates that the thread owning the given qnode
is running its critical section (parentheses indicate that the value of the waiting flag is immaterial).
A ‘W’ indicates that the corresponding thread is waiting. A dashed arrow represents a local pointer
(returned to the thread by swap ).
NUMA machine. Total (remote access) time to pass the lock from one thread to the next is
constant. Total space is linear in the number of threads and locks.
As written (Figure 4.8), the MCS lock requires both swap and CAS . CAS can of course be
used to emulate the swap in the acquire method, but entry to the queue drops from wait-free to
lock-free (meaning that a thread can theoretically starve). Mellor-Crummey and Scott [1991b]
also show how to make do with only swap in the release method, but FIFO ordering may be
lost when a thread enters the queue just as its predecessor is releasing the lock.
Figure 4.10: K42 variant of the MCS queued lock. Note the standard interface to acquire and
release , with no parameters other than the lock itself.
4.3. QUEUED SPIN LOCKS 61
tail next
(1)
L
(2) (R)
L A
(3) (R) W
L A B
(4) (R) W W
L A B C
(5) R W
L B C
Figure 4.11: Operation of the K42 MCS lock. An ‘R’ (running) indicates a null “ tail ” pointer; ‘W’
indicates a “waiting” flag. Dashed boxes indicate qnode s that are no longer needed, and may safely be
freed by returning from the method in which they were declared. In (1) the lock is free. In (2) a single
thread is active in the critical section. In (3) and (4) two new threads have arrived. In (5) thread A has
finished and thread B is now active.
section: we cannot bound the time that may elapse before a successor needs to inspect that node.
is requirement is accommodated by having a thread provide a fresh qnode to acquire , and
return with a different qnode from release .
In their original paper, Magnussen, Landin, and Hagersten presented two versions of their
lock: a simpler “LH” lock and an enhanced “M” lock; the latter reduces the number of cache
misses in the uncontended case by allowing a thread to keep its original qnode when no other
thread is trying to acquire the lock. e M lock needs CAS to resolve the race between a thread
that is trying to release a heretofore uncontended lock and the arrival of a new contender. e
LH lock has no such races; all it needs is swap .
Craig’s lock is essentially identical to the LH lock: it differs only in the mechanism used to
pass qnodes to and from the acquire and release methods. It has become conventional to refer
to this joint invention by the initials of all three inventors: CLH.
Code for the CLH lock appears in Figure 4.12. An illustration of its operation appears
in Figure 4.13. A free lock (line 1 of the latter figure) contains a pointer to a qnode whose
succ must wait flag is false. Newly arriving thread A (line 2) obtains a pointer to this node
62 4. PRACTICAL SPIN LOCKS
type qnode = record
qnode* prev // read and written only by owner thread
bool succ must wait
class lock
qnode dummy := { null, false }
// ideally, dummy and tail should lie in separate cache lines
qnode* tail := &dummy
lock.acquire(qnode* p):
p!succ must wait := true
qnode* pred := p!prev := swap(&tail, p, Wk)
while pred!succ must wait.load(); // spin
fence(kRW)
lock.release(qnode** pp):
qnode* pred := (*pp)!prev
(*pp)!succ must wait.store(false, RWk)
*pp := pred // take pred’s qnode
Figure 4.12: e CLH queued lock.
(dashed arrow) by executing a swap on the lock tail pointer. It then spins on this node (or simply
observes that its succ must wait flag is already false). Before returning from acquire , it stores
the pointer into its own qnode so it can find it again in release . (In the LH version of the
lock [Magnussen, Landin, and Hagersten, 1994], there was no pointer in the qnode ; rather, the
API for acquire returned a pointer to the predecessor qnode as an explicit parameter.)
To release the lock (line 4), thread A writes false to the succ must wait field of its own
qnode and then leaves that qnode behind, returning with its predecessor’s qnode instead (here
previously marked ‘X’). read B, which arrived at line 3, releases the lock in the same way. If no
other thread is waiting at this point, the lock returns to the state in line 1.
In his original paper, Craig [1993] explored several extensions to the CLH lock. By intro-
ducing an extra level of indirection, one can eliminate remote spinning even on an NRC-NUMA
machine—without requiring CAS , and without abandoning strict either FIFO ordering or wait-
free entry. By linking the list both forward and backward, and traversing it at acquire time, one
can arrange to grant the lock in order of some external notion of priority, rather than first-come-
first-served (Markatos [1991] presented a similar technique for MCS locks). By marking nodes
as abandoned, and skipping over them at release time, one can accommodate timeout (we will
consider this topic further in Section 7.5.2, together with the possibility—suggested by Craig as
future work—of skipping over threads that are currently preempted). Finally, Craig sketched a
technique to accommodate nested critical sections without requiring a thread to allocate multiple
qnodes : arrange for the thread to acquire its predecessor’s qnode when the lock is acquired rather
than when it is released, and maintain a separate thread-local stack of pointers to the qnodes that
must be modified in order to release the locks.
4.3. QUEUED SPIN LOCKS 63
(1) R
L X
(2) R W
L X A
(3) R W W
L X A B
(4) R R W
L A X B
(5) R W
L B X
Figure 4.13: Operation of the CLH lock. An ‘R’ indicates that a thread spinning on this qnode (i.e.,
the successor of the thread that provided it) is free to run its critical section; a ‘W’ indicates that it
must wait. Dashed boxes indicate qnode s that are no longer needed by successors, and may be reused
by the thread releasing the lock. Note the change in label on such nodes, indicating that they now
“belong” to a different thread.
to threads currently waiting for a lock are overwritten, at the end of acquire , before being read
again.)
absence of dummy nodes, space needs are lower for MCS locks, but performance may be better
(by a small constant factor) for CLH locks on some machines.
Given the overhead of inspecting and updating owner and count fields, many designers choose
not to make locks reentrant by default.
e astute reader may notice that the read of owner in reentrant lock.acquire races with
the writes of owner in both acquire and release . In memory models that forbid data races, the
owner field may need to be declared as volatile or atomic .
in Sections 5.3.3, 7.4.3, and 8.6.3. It is also a key feature of flat combining, which we will consider
briefly in Section 5.4.
e problem, of course, is that load - store -only acquire routines invariably contain some
variant of the Dekker store – load sequence—
interested[self] := true // store
bool potential conflict := interested[other] // load
if potential conflict …
—and this code works correctly on a non-sequentially consistent machine only when augmented
with (presumably also expensive) WkR ordering between the first and second lines. e cost of
the ordering has led several researchers [Dice et al., 2001, Russell and Detlefs, 2006, Vasudevan
et al., 2010] to propose asymmetric Dekker-style synchronization. Applied to Peterson’s lock, the
solution looks as shown in Figure 4.16.
e key is the handshake operation on the “slow” (non-preferred) path of the lock. is
operation must interact with execution on the preferred thread’s core in such a way that
1. if the preferred thread set fast interested before the interaction, then the non-preferred
thread is guaranteed to see it afterward.
2. if the preferred thread did not set fast interested before the interaction, then it (the pre-
ferred thread) is guaranteed to see slow interested afterward.
core. Dice et al. [2001] explore many of these options in detail. Because of their cost, they are
profitable only in cases where access by non-preferred threads is exceedingly rare. In subsequent
work, Dice et al. [2003] observe that handshaking can be avoided if the underlying hardware
provides coherence at word granularity, but supports atomic writes at subword granularity.
71
CHAPTER 5
5.1 FLAGS
In its simplest form, a flag is Boolean variable, initially false , on which a thread can wait:
class flag
bool f := false
flag.set():
f.store(true, RWk)
flag.await():
while : f.load(); // spin
fence(kRW)
Methods set and await are presumably called by different threads. Code for set consists
of a release-annotated store ; await ends with an acquire fence. ese reflect the fact that one
typically uses set to indicate that previous operations of the calling thread (e.g., initialization
of a shared data structure) have completed; one typically uses await to ensure that subsequent
operations of the calling thread do not begin until the condition holds.
In some algorithms, it may be helpful to have a reset method:
flag.reset():
f.store(false, kW)
Before calling reset , a thread must ascertain (generally through application-specific means)
that no thread is still using the flag for its previous purpose. e kW ordering on the store ensures
that any subsequent updates (to be announced by a future set ) are seen to happen after the reset .
In an obvious generalization of flags, one can arrange to wait on an arbitrary predicate:
72 5. BUSY-WAIT SYNCHRONIZATION WITH CONDITIONS
class predicate
abstract bool eval()
// to be extended by users
predicate.await():
while :eval(); // spin
fence(RkRW)
is latter form is the notation we employed in Chapters 1 and 3. It must be used with care: the
absence of an explicit set method means there is no obvious place to specify the release ordering
that typically accompanies the set ting of a Boolean flag. In any program that spins on nontrivial
conditions, a thread that changes a variable that may contribute to such a condition may need to
declare the variable as volatile or atomic , or update it with a RWk store . One must also consider
the atomicity of attempts to check the condition, and the monotonicity of the condition itself: an
await will generally be safe if the condition will become true due to a single store in some other
thread, and never again become false. Without such a guarantee, it is unclear what can safely be
assumed by the code that follows the await . We will return to generalized await statements when
we consider conditional critical regions in Section 7.4.1.
e cycle method of barrier b (sometimes called wait , next , or even barrier ) forces each thread
i to wait until all threads have reached that same point in their execution. Calling cycle accom-
plishes two things: it announces to other threads that all work prior to the barrier in the current
thread has been completed (this is the arrival part of the barrier), and it ensures that all work
prior to the barrier in other threads has been completed before continuing execution in the cur-
rent thread (this is the departure part ). To avoid data races, the arrival part typically includes a
5.2. BARRIER ALGORITHMS 73
release ( RWk ) fence or synchronizing store ; the departure part typically ends with an acquire
( kRW ) fence.
e simplest barriers, commonly referred to as centralized, employ a small, fixed-size data
structure, and consume .n/ time between the arrival of the first thread and the departure of the
last. More complex barriers distribute the data structure among the threads, consuming O.n/ or
O.n log n/ space, but requiring only ‚.log n/ time.
For any maximum number of threads n, of course, log n is a constant, and with hardware
support it can be a very small constant. Some multiprocessors (e.g., the Cray X/XE/Cascade,
SGI UV, and IBM Blue Gene series) exploit this observation to provide special constant-time
barrier operations (the Blue Gene machines, though, do not have a global address space). With
a large number of processors, constant-time hardware barriers can provide a substantial benefit
over log-time software barriers.
In effect, barrier hardware performs a global AND operation, setting a flag or asserting
a signal once all cores have indicated their arrival. It may also be useful—especially on NRC-
NUMA machines, to provide a global OR operation (sometimes known as Eureka) that can be
used to determine when any one of a group of threads has indicated its arrival. Eureka mechanisms
are commonly used for parallel search: as soon as one thread has found a desired element (e.g.,
in its portion of some large data set), the others can stop looking. e principal disadvantage of
hardware barriers and eureka mechanisms is that they are difficult to virtualize or share among
the dynamically changing processes and threads of a multiprogrammed workload.
e first subsection below presents a particularly elegant formulation of the centralized bar-
rier. e following three subsections present different log-time barriers; a final subsection sum-
marizes their relative advantages.
thread will have its own cached copy of the sense flag, and post-invalidation refills will generally
be able to pipeline or combine, for much lower per-access latency.
Figure 5.2: A software combining tree barrier. FAD is fetch and decrement .
out of a second are used to allow them to continue. Figure 5.2 shows a variant of this combining
tree barrier, as modified by Mellor-Crummey and Scott [1991b] to incorporate sense reversal and
to replace the fetch and ˆ instructions of the second combining tree with simple reads (since no
real information is returned).
Simulations by Yew et al. [1987] show that a software combining tree can significantly
decrease contention for reduction variables, and Mellor-Crummey and Scott [1991b] confirm
this result for barriers. At the same time, the need to perform (typically expensive) fetch and ˆ
operations at each node of the tree induces substantial constant-time overhead. On an NRC-
NUMA machine, most of the spins can also be expected to be remote, leading to potentially
unacceptable contention. e barriers of the next two subsections tend to work much better in
practice, making combining tree barriers mainly a matter of historical interest. is said, the
notion of combining—broadly conceived—has proven useful in the construction of a wide range
of concurrent data structures. We will return to the concept briefly in Section 5.4.
round 0:
round 1:
round 2:
Figure 5.3: Communication pattern for the dissemination barrier (adapted from Hensgen et al.
[1988]).
duces barrier latency by eliminating the separation between arrival and departure. e algorithm
proceeds through dlog2 ne (unsynchronized) rounds. In round k , each thread i signals thread
.i C 2k / mod n. e resulting pattern (Figure 5.3), which works for arbitrary n (not just a power
of 2), ensures that by the end of the final round every thread has heard—directly or indirectly—
from every other thread.
Code for the dissemination barrier appears in Figure 5.4. e algorithm uses alternating
sets of variables (chosen via parity ) in consecutive barrier episodes, avoiding interference without
requiring two separate spins in each round. It also uses sense reversal to avoid resetting variables
after every episode. e flags on which each thread spins are statically determined (allowing them
to be local even on an NRC-NUMA machine), and no two threads ever spin on the same flag.
Interestingly, while the critical path length of the dissemination barrier is dlog2 ne, the
total amount of interconnect traffic (remote writes) is ndlog2 ne. (Space requirements are also
O.n log n/.) is is asymptotically larger than the O.n/ space and bandwidth of the centralized
and combining tree barriers, and may be a problem on machines whose interconnection networks
have limited cross-sectional bandwidth.
the last arrival, however, it is a particular thread (say the one from the left-most child) that always
continues upward. Other threads set a flag in the node to let the “winner” know they have arrived.
If the winner arrives before its peers, it simply waits. Wakeup can proceed back down the tree, as
in the combining tree barrier, or (on a machine with broadcast-based cache coherence) it can use
a global flag. With care, the tree can be designed to avoid remote spinning, even on an NRC-
NUMA machine, though the obvious way to do so increases space requirements from O.n/ to
O.n log2 n/ [Lee, 1990, Mellor-Crummey and Scott, 1991b].
Inspired by experience with tournament barriers, Mellor-Crummey and Scott [1991b] pro-
posed a static tree barrier that takes logarithmic time and linear space, spins only on local locations
(even on an NRC-NUMA machine), and performs the theoretical minimum number of remote
memory accesses (2n 2) on machines that lack broadcast. Unlike a tournament barrier, the static
tree barrier associates threads with internal nodes as well as leaves, thereby reducing the overall
size of the tree. Each thread signals its parent, which in turn signals its parent when it has heard
from all of its children.
Code for the static tree barrier appears in Figure 5.5. It incorporates a minor bug fix from
Kishore Ramachandran. Each thread is assigned a unique tree node which is linked into an arrival
78 5. BUSY-WAIT SYNCHRONIZATION WITH CONDITIONS
type node = record
bool parent sense := false
bool* parent ptr
bool have child[0..3] // for arrival
bool child not ready[0..3]
bool* child ptrs[0..1] // for departure
bool dummy // pseudodata
class barrier
bool sense[T ] := true
node nodes[T ]
// on an NRC-NUMA machine, nodes[i ] should be local to thread i
// in nodes[i ]:
// have child[j ] = true iff 4i C j C 1 < n
// parent ptr = &nodes[b.i 1/=4c].child not ready[.i 1/ mod 4],
// or &dummy if i D 0
// child ptrs[0] = &nodes[2i C 1].parent sense, or &dummy if 2i C 1 n
// child ptrs[1] = &nodes[2i C 2].parent sense, or &dummy if 2i C 2 n
// initially child not ready := have child
barrier.cycle():
fence(RWk)
node* n := &nodes[self]
bool my sense := sense[self]
while n!child not ready.load() 6D { false, false, false, false }; // spin
n!child not ready.store(n!have child) // prepare for next episode
*n!parent ptr.store(false) // let parent know we’re ready
// if not root, wait until parent signals departure:
if self 6D 0
while n!parent sense.load() 6D my sense; // spin
// signal children in departure tree:
*n!child ptrs[0].store(my sense)
*n!child ptrs[1].store(my sense)
sense[self] := :my sense
fence(kRW)
Figure 5.5: A static tree barrier with local-spinning tree-based departure.
tree by a parent link and into a wakeup tree by a set of child links. It is useful to think of the trees
as separate because their arity may be different. e code shown here uses an arrival fan-in of 4
and a departure fan-out of 2, which worked well in the authors’ original (c. 1990) experiments.
Assuming that the hardware supports single-byte writes, fan-in of 4 (on a 32-bit machine) or 8
(on a 64-bit machine) allows a thread to use a single-word spin to wait for all of its arrival-tree
children simultaneously. Optimal departure fan-out is likely to be machine-dependent. As in the
tournament barrier, wakeup on a machine with broadcast-based global cache coherence could
profitably be effected with a single global flag.
5.2. BARRIER ALGORITHMS 79
Table 5.1: Tradeoffs among leading software barriers. Critical path lengths are in remote memory
references (assuming broadcast on a CC-NUMA machine); they may not correspond precisely to
wall-clock time. Space needs are in words. Constants a and d in the static tree barrier are arrival
fan-in and departure fan-out, respectively. Fuzzy barriers are discussed in Section 5.3.1
central dissemination static tree
space needs
CC-NUMA 4n C 1
n+1 n C 2ndlog2 ne
NRC-NUMA .5 C d /n
critical path length
CC-NUMA nC1 dloga ne C 1
dlog2 ne
NRC-NUMA 1 dloga ne C dlogd ne
total remote refs
CC-NUMA n C 1 :: 2n n
ndlog2 ne
NRC-NUMA 1 2n 2
fuzzy barrier suitability C
tolerance of changes in n C
phase 0:
phase 1:
phase 2:
Figure 5.6: Impact of variation across threads in phase execution times, with normal barriers (left) and
fuzzy barriers (right). Blue work bars are the same length in each version of the figure. Fuzzy intervals
are shown as outlined boxes. With fuzzy barriers, threads can leave the barrier as soon as the last peer
has entered its fuzzy interval. Overall performance improvement is shown by the double-headed arrow
at center.
in parallel for i 2 T
repeat
// do i’s portion of the work of a phase
b.cycle()
until terminating condition
becomes
in parallel for i 2 T
repeat
// do i’s critical work for this phase
b.arrive()
// do i’s non-critical work—its fuzzy interval
b.depart()
until terminating condition
As illustrated on the right side of Figure 5.6, the impact on overall run time can be a dramatic
improvement.
A centralized barrier is easily modified to produce a fuzzy variant (Figure 5.7). Unfortu-
nately, none of the logarithmic barriers we have considered has such an obvious fuzzy version.
We address this issue in the following subsection.
p p
w x o x
n o
Figure 5.8: Dynamic modification of the arrival tree in an adaptive combining tree barrier.
original (non-adaptive) combining tree. ey also employ test and set locks to arbitrate access
to each tree node. To improve performance—particularly but not exclusively on NRC NUMA
machines—Scott and Mellor-Crummey [1994] present versions of the adaptive combining tree
barrier (both regular and fuzzy) that spin only on local locations and that adapt the tree in a
wait-free fashion, without the need for per-node locks. In the process they also fix several subtle
bugs in the earlier algorithms. Shavit and Zemach [2000] generalize the notion of combining to
support more general operations than simply “arrive at barrier”; we will return to their work in
Section 5.4.
Scott and Mellor-Crummey report mixed performance results for adaptive barriers: if ar-
rival times are skewed across threads, tree adaptation can make a significant difference, both by
reducing departure times in the wake of the last arrival and by making fuzzy intervals compatible
with a logarithmic critical path. If thread arrival times are very uniform, however, the overhead
of adaptation may yield a net loss in performance. As with many other tradeoffs, the break-even
point will vary with both the machine and workload.
CHAPTER 6
Read-mostly Atomicity
In Chapter 4 we considered the topic of busy-wait mutual exclusion, which achieves atomicity by
allowing only one thread at a time to execute a critical section. While mutual exclusion is suffi-
cient to ensure atomicity, it is by no means necessary. Any mechanism that satisfies the ordering
constraints of Section 3.1.2 will also suffice. In particular, read-mostly optimizations exploit the
fact that operations can safely execute concurrently, while still maintaining atomicity, if they read
shared data without writing it.
Section 6.1 considers the simplest read-mostly optimization: the reader-writer lock, which
allows multiple readers to occupy their critical section concurrently, but requires writers (that is,
threads that may update shared data, in addition to reading it) to exclude both readers and other
writers. To use the “reader path” of a reader-writer lock, a thread must know, at the beginning of
the critical section, that it will never attempt to write. Sequence locks, the subject of Section 6.2,
relax this restriction by allowing a reader to “upgrade” to writer status if it forces all concurrent
readers to back out and retry their critical sections. (Transactional memory, which we will consider
in Chapter 9, can be considered a generalization of sequence locks. TM systems typically auto-
mate the back-out-and-retry mechanism; sequence locks require the programmer to implement
it by hand.) Finally read-copy update (RCU), the subject of Section 6.3, explores an extreme po-
sition in which the overhead of synchronization is shifted almost entirely off of readers and onto
writers, which are assumed to be quite rare.
did. Both of these options permit indefinite postponement and even starvation of non-preferred
threads when competition for the lock is high. ough not explicitly recognized by Courtois et al.,
it is also possible to construct a reader-writer lock (called a “fair” lock below) in which readers
wait for any earlier writer and writers wait for any earlier thread of either kind.
e locks of Courtois et al. were based on semaphores, a scheduler-based synchronization
mechanism that we will introduce in Section 7.2. In the current chapter we limit ourselves to
busy-wait synchronization. Like standard mutual-exclusion locks, reader-writer locks admit a
wide range of special-purpose adaptations. Calciu et al. [2013], for example, describe mechanisms
to extend the locality-conscious locking of Section 4.5.1 to the reader-writer case.
class rw lock
hshort, short, booli n := h0, 0, falsei
// high half of word counts active readers; low half counts waiting writers,
// except for low bit, which indicates whether a writer is active
const int base, limit, multiplier = … // tuning parameters
rw lock.writer acquire():
int delay := base
loop
hshort ar, short ww, bool awi := n.load()
if aw = false and ar = 0 // no active writer or readers
if CAS(&n, har, ww, falsei, har, ww, truei) break
// else retry
else if CAS(&n, har, ww, awi, har, ww+1, awi)
// I’m registered as waiting
loop // spin
har, ww, awi := n.load()
if aw = false and ar = 0 // no active writer or readers
if CAS(&n, har, ww, falsei, har, ww 1, truei) break outer loop
pause(delay) // exponential backoff
delay := min(delay multiplier, limit)
// else retry
fence(kRW)
rw lock.writer release():
fence(RWk)
short ar, wr; bool aw
repeat // fetch-and-phi
har, ww, awi := n.load()
until CAS(&n, har, ww, awi, har, ww, falsei)
rw lock.reader acquire():
loop
hshort ar, short ww, bool awi := n.load()
if ww = 0 and aw = false
if CAS(&n, har, 0, falsei, har+1, 0, falsei) break
// else spin
pause(ww base) // proportional backoff
fence(kR)
rw lock.reader release():
fence(Rk)
short ar, ww; bool aw
repeat // fetch-and-phi
har, ww, awi := n.load()
until CAS(&n, har, ww, awi, har 1, ww, awi)
Figure 6.2: A centralized writer-preference reader-writer lock, with proportional backoff for readers
and exponential backoff for writers.
6.1. READER-WRITER LOCKS 91
class rw lock
hshort, shorti requests := h0, 0i
hshort, shorti completions := h0, 0i
// top half of each word counts readers; bottom half counts writers
const int base = … // tuning parameter
rw lock.writer acquire():
short rr, wr, rc, wc
repeat // fetch-and-phi increment of writer requests
hrr, wri := requests.load()
until CAS(&requests, hrr, wri, hrr, wr+1i)
loop // spin
hrc, wci := completions.load()
if rc = rr and wc = wr break // all previous readers and writers have finished
pause((wc wr) base)
fence(kRW)
rw lock.writer release():
fence(RWk)
short rc, wc
repeat // fetch-and-phi increment of writer completions
hrc, wci := completions.load()
until CAS(&completions, hrc, wci, hrc, wc+1i)
rw lock.reader acquire():
short rr, wr, rc, wc
repeat // fetch-and-phi increment of reader requests
hrr, wri := requests.load()
until CAS(&requests, hrr, wri, hrr+1, wri)
loop // spin
hrc, wci := completions.load()
if wc = wr break // all previous writers have finished
pause((wc wr) base)
fence(kR)
rw lock.reader release():
fence(Rk)
short rc, wc
repeat // fetch-and-phi increment of reader completions
hrc, wci := completions.load()
until CAS(&completions, hrc, wci, hrc+1, wci)
Figure 6.3: A centralized fair reader-writer lock with (roughly) proportional backoff for both readers
and writers. Addition is assumed to be modulo the precision of (unsigned) short integers.
92 6. READ-MOSTLY ATOMICITY
observe that hardware transactional memory (HTM—Chapter 9) can be used both to fix the bug
and to significantly simply the code. Dice et al. also provide a pair of software-only fixes; after
incorporating one of these, code for the lock of Krieger et al. appears in Figures 6.4 and 6.5.
As in the MCS spin lock, the acquire and release routines expect a qnode argument,
which they add to the end of list. Each contiguous group of readers maintains both forward and
backward pointers in its segment of the list; segments consisting of writers are singly linked. A
reader can begin reading if its predecessor is a reader that is already active, though it must first
unblock its successor (if any) if that successor is a waiting reader.
In Mellor-Crummey and Scott’s reader-writer locks, as in the MCS spin lock, queue nodes
could be allocated in the stack frame of the routine that calls acquire and release . In the lock
of Krieger et al., this convention would be unsafe: it is possible for another thread to modify a
node an arbitrary amount of time after the node’s owner has removed it from the queue. To avoid
potential stack corruption, queue nodes must be managed by a dynamic type-preserving allocator,
as described in the box on page 26.
6.1. READER-WRITER LOCKS 93
rw lock.reader acquire(qnode* I):
I!role := reader; I!waiting := true
I!next := I!prev := null
qnode* pred := swap(&tail, I, Wk)
if pred 6D null // lock is not free
I!prev.store(pred)
pred!next.store(I)
if pred!role.load() 6D active reader
while I!waiting.load(); // spin
qnode* succ := I!next.load()
if succ 6D null and succ!role.load() = reader
succ!waiting.store(false) // unblock contiguous readers
I!role.store(active reader, kR)
rw lock.reader release(qnode* I):
fence(Rk)
qnode* pred := I!prev.load()
if pred 6D null // need to disconnect from predecessor
pred!mutex.acquire()
while pred 6D I!prev.load()
pred!mutex.release()
pred := I!prev.load();
if pred = null break
pred!mutex.acquire()
// At this point we hold the mutex of our predecessor, if any.
if pred 6D null
I!mutex.acquire()
pred!next.store(null)
qnode* succ := I!next.load()
if succ = null and :CAS(tail, I, pred)
repeat succ := I!next.load() until succ 6D null
if succ 6D null // need to disconnect from successor
succ!prev.store(pred)
pred!next.store(succ)
I!mutex.release()
pred!mutex.release()
return
I!mutex.acquire()
qnode* succ := I!next.load()
if succ = null and :CAS(tail, I, null)
repeat succ := I!next.load() until succ 6D null
if succ 6D null // 9 successor but no predecessor
bool succ is writer := succ!role.load() = writer
succ!waiting.store(false)
if :succ is writer
succ!prev.store(null)
I!mutex.release()
Figure 6.5: A fair queued reader-writer lock (reader routines).
94 6. READ-MOSTLY ATOMICITY
When a reader finishes its critical section, it removes itself from its doubly-linked group of
contiguous readers. To avoid races during the unlink operation, the reader acquires mutex locks
on its predecessor’s qnode and its own. (ese can be very simple, since at most two threads will
ever contend for access.) If a reader finds that it is the last member of its reader group, it unblocks
its successor, if any. at successor will typically be a writer; the exception to this rule is the subject
of the bug repaired by Dice et al..
In their paper on phase-fair locks, Brandenburg and Anderson [2010] also present a queue-
based implementation with local-only spinning. As of this writing, their lock and the code of
Figures 6.4 and 6.5 appear to be the best all-around performers on medium-sized machines (up
to perhaps a few dozen hardware threads). For heavily contended locks on very large machines,
Lev et al. [2009b] show how to significantly reduce contention among concurrent readers, at the
cost of higher overhead when the thread count is low.
One additional case merits special attention. If reads are much more common than writes,
and the total number of threads is not too large, the fastest performance may be achieved with a
distributed reader-writer lock consisting of jT j “reader locks”—one per thread—and one “writer
lock” [Hsieh and Weihl, 1992]. e reader acquire routine simply acquires the reader lock corre-
sponding to the calling thread. e writer acquire routine acquires first the writer lock and then
all the reader locks, one at a time. e corresponding release routines release these same compo-
nent locks (in reverse order in the case of writer release ). Reader locks can be very simple, since
they are accessed only by a single reader and the holder of the writer lock. Moreover reader ac-
quire and reader release will typically be very fast: assuming reads are more common than writes,
the needed reader lock will be unheld and locally cached. e writer operations will be slow of
course, and each lock will consume space linear in the number of threads. Linux uses locks of this
sort to synchronize some kernel-level operations, with per-core kernel instances playing the role
of threads.
Figure 6.6: Centralized implementation of a sequence lock. e CAS instructions in writer acquire
and become writer need to be write atomic.
of a writer. Moreover, the reader’s actions must be simple enough that nothing a writer might do
can cause the reader to experience an unrecoverable error—divide by zero, dereference of an invalid
pointer, infinite loop, etc. Put another way, seqlocks provide mutual exclusion among writers, but
not between readers and writers. Rather, they allow a reader to discover, after the fact, that its
execution may not have been valid, and needs to be retried.
A simple, centralized implementation of a sequence lock appears in Figure 6.6. e lock
is represented by single integer. An odd value indicates that the lock is held by a writer; an even
value indicates that it is not. For writers, the integer behaves like a test-and- test and set lock.
We assume that writers are rare.
A reader spins until the lock is even, and then proceeds, remembering the value it saw. If it
sees the same value in reader validate , it knows that no writer has been active, and that everything
it has read in its critical section is mutually consistent. (We assume that critical sections are short
enough—and writers rare enough—that n can never roll over and repeat a value before the reader
completes. For real-world integers and critical sections, this is a completely safe assumption.) If
a reader sees a different value in validate , however, it knows that it has overlapped a writer and
must repeat its critical section.
repeat
int s := SL.reader start()
// critical section
until SL.reader validate(s)
It is essential here that the critical section be idempotent —harmlessly repeatable, even if a
writer has modified data in the middle of the operation, causing the reader to see inconsistent
state. In the canonical use case, seqlocks serve in the Linux kernel to protect multi-word time
96 6. READ-MOSTLY ATOMICITY
information, which can then be read atomically and consistently. If a reader critical section updates
thread-local data (only shared data must be read-only), the idiom shown above can be modified
to undo the updates in the case where reader validate returns false .
If a reader needs to perform a potentially “dangerous” operation (integer divide, pointer
dereference, unbounded iteration, memory allocation/deallocation, etc.) within its critical section,
the reader validate method can be called repeatedly (with the same parameter each time). If
reader validate returns true , the upcoming operation is known to be safe (all values read so far
are mutually consistent); if it returns false , consistency cannot be guaranteed, and code should
branch back to the top of the repeat loop. In the (presumably rare) case where a reader discovers
that it really needs to write, it can request a “promotion” with become writer :
loop
int s := SL.reader start()
…
if unlikely condition
if :SL.become writer(s) continue // return to top of loop
…
SL.writer release()
break
else // still reader
…
if SL.reader validate(s) break
After becoming a writer, of course, a thread has no further need to validate its reads: it will exit
the loop above after calling writer release .
Unfortunately, because they are inherently speculative, seqlocks induce a host of data races
[Boehm, 2012]. Every read of a shared location in a reader critical section will typically race
with some write in a writer critical section. ese races compound the problem of readers seeing
inconsistent state: the absence of synchronization means that updates made by writers may be seen
by readers out of order. In a language like C or C++, which forbids data races, a straightforward fix
is to label all read locations atomic ; this will prevent the compiler from reordering accesses, and
cause it to issue special instructions that prevent the hardware from reordering them either. is
solution is overly conservative, however: it inhibits reorderings that are clearly acceptable within
idempotent read-only critical sections. Boehm [2012] explores the data-race issue in depth, and
describes other, less conservative options.
A related ordering issue arises from the fact that readers do not modify the state of a seq-
lock. Because they only read it, on some machines their accesses will not be globally ordered with
respect to writer updates. If threads inspect multiple seqlock-protected data structures, a situation
analogous to the IRIW example of Figure 2.4 can occur: threads 2 and 3 see updates to objects
X and Y , but thread 2 thinks that the update to X happened first, while thread 3 thinks that
the update to Y happened first. To avoid causality loops, writers must update the seqlock using
sequentially consistent (write-atomic) synchronizing store s.
6.3. READ-COPY UPDATE 97
Together, the problems of inconsistency and data races are subtle enough that seqlocks
are best thought of as a special-purpose technique, to be employed by experts in well constrained
circumstances, rather than as a general-purpose form of synchronization. at said, seqlock usage
can be safely automated by a compiler that understands the nature of speculation. Dalessandro
et al. [2010a] describe a system (in essence, a minimal implementation of transactional memory)
in which (1) a global sequence lock serializes all writer transactions, (2) fences and reader validate
calls are inserted automatically where needed, and (3) local state is checkpointed at the beginning
of each reader transaction, for restoration on abort. A follow-up paper [Dalessandro et al., 2010c]
describes a more concurrent system, in which writer transactions proceed speculatively, and a
global sequence lock serializes only the write-back of buffered updates. We will return to the
subject of transactional memory in Chapter 9.
No shared updates by readers. As in a sequence lock, readers modify no shared metadata before
or after performing an operation. While this makes them invisible to writers, it avoids the
characteristic cache misses associated with locks. To ensure a consistent view of memory,
readers may need to execute RkR fences on some machines, but these are typically much
cheaper than a cache miss.
Single-pointer updates. Writers synchronize with one another explicitly. ey make their up-
dates visible to readers by performing a single atomic memory update—typically by “swing-
ing” a pointer (under protection of a lock, or using CAS ) to refer to the new version of (some
part of ) a data structure, rather than to the old version. Readers serialize before or after the
writer depending on whether they see this update. (In either case, they see data that was
valid at some point after they began their call.)
Unidirectional data traversal. To ensure consistency, readers must never inspect a pointer more
than once. To ensure serializability (when it is desired), users must additionally ensure (via
program logic) that if writers A and B modify different pointers, and A serializes before
B , it is impossible for any reader to see B ’s update but not A’s. e most straightforward
way to ensure this is to require all structures to be trees, traversed from the root toward the
leaves, and by arranging for writers to replace entire subtrees.
98 6. READ-MOSTLY ATOMICITY
Delayed reclamation of deallocated data. When a writer updates a pointer, readers that have
already dereferenced the old version—but have not yet finished their operations—may con-
tinue to read old data for some time. Implementations of RCU must therefore provide a
(potentially conservative) way for writers to tell that all readers that could still access old data
have finished their operations and returned. Only then can the old data’s space be reclaimed.
Implementations and applications of RCU vary in many details, and may diverge from the
description above if the programmer is able to prove that (application-specific) semantics will
not be compromised. We consider relaxations of the single-pointer update and unidirectional
traversal properties below. First, though, we consider ways to implement relaxed reclamation and
to accommodate, at minimal cost, machines with relaxed memory order.
Grace Periods and Relaxed Reclamation. In a language and system with automatic garbage
collection, the delayed reclamation property is trivial: the normal collector will reclaim old data
versions when—and only when—no readers can see them any more. In the more common case
of manual memory management, a writer may wait until all readers of old data have completed,
and then reclaim space itself. Alternatively, it may append old data to a list for eventual reclama-
tion by some other, bookkeeping thread. e latter option reduces latency for writers, potentially
improving performance, but may also increase maximum space usage.
Arguably the biggest differences among RCU implementations concern the “grace period”
mechanism used (in the absence of a general-purpose garbage collector) to determine when all
old readers have completed. In a nonpreemptive OS kernel (where RCU was first employed), the
writer can simply wait until a (voluntary) context switch has occurred in every hardware thread.
Perhaps the simplest way to do this is to request migration to each hardware thread in turn: such
a request will be honored only after any active reader on the target thread has completed.
More elaborate grace period implementations can be used in more general contexts.
Desnoyers et al. [2012, App. D] describe several implementations suitable for user-level applica-
tions. Most revolve around a global counter C and a global set S of counters, indexed by thread
id. C is monotonically increasing (extensions can accommodate rollover): in the simplest im-
plementation, it is incremented at the end of each write operation. In a partial violation of the
no-shared-updates property, S is maintained by readers. Specifically, S[ i ] will be zero if thread
i is not currently executing a reader operation. Otherwise, S[ i ] will be j if C was j when thread
i ’s current reader operation began. To ensure a grace period has passed (and all old readers have
finished), a writer iterates through S , waiting for each element to be either zero or a value greater
than or equal to the value just written to C . Assuming that each set element lies in a separate
cache line, the updates performed by reader operations will usually be cache hits, with almost
no performance impact. Moreover, since each element is updated by only one thread, and the
visibility of updates can safely be delayed, no synchronizing instructions are required.
Memory Ordering. When beginning a read operation with grace periods based on the global
counter and set, a thread must update its entry in S using a kR store , or follow the update
6.3. READ-COPY UPDATE 99
with a WkR fence. At the end of the operation, it must update its entry with a Rk store , or
precede the update with a RkW fence. When reading a pointer that might have been updated
by a writer, a reader must use a Rk load , or follow the read with a RkR fence. Among these
three forms of ordering, the WkR case is typically the most expensive (the others will in fact be
free on a TSO machine). We can avoid the overhead of WkR ordering in the common case by
requiring the writer to interrupt all potential readers (e.g., with a Posix signal) at the end of a write
operation. e signal handler can then “handshake” with the writer, with appropriate memory
barriers, thereby ensuring that (a) each reader’s update of its element in S is visible to the writer,
and (b) the writer’s updates to shared data are visible to all readers. Assuming that writers are rare,
the cost of the signal handling will be outweighed by the no-longer-required WkR ordering in
the (much more numerous) reader operations.
For writers, the rules are similar to the seqlock case: to avoid causality loops when readers
inspect more than one RCU-updatable pointer, writers must use sequentially consistent (write-
atomic) synchronizing store s to modify those pointers.
p
update
x z′
x′ z
A B C
Figure 6.7: Rebalancing of a binary tree via internal subtree replacement (rotation). Adapted from
Clements et al. [2012, Fig. 8(b)]. Prior to the replacement, node z is the right child of node x . After
the replacement, x 0 is the left child of z 0 .
x and z can be reclaimed. In the meantime, readers that have traveled through x and z will still
be able to search correctly down to the fringe of the tree.
In-Place Updates. As described above, RCU is designed to incur essentially no overhead for
readers, at the expense of very high overhead for writers. In some cases, even this property can
be relaxed, extending the low-cost case to certain kinds of writers. In the same paper that intro-
duced RCU balanced trees, Clements et al. [2012] observe that trivial updates to page tables—
specifically, single-leaf modifications associated with demand page-in—are sufficiently common
to be a serious obstacle to scalability on large shared-memory multiprocessors. eir solution is
essentially a hybrid of RCU and sequence locks. Major (multi-page) update operations continue
to function as RCU writers: they exclude one another in time, install their changes via single-
pointer update, and wait for a grace period before reclaiming no-longer-needed space. e page
fault interrupt handler, however, functions as an RCU reader. If it needs to modify a page table
entry to effect demand page-in, it makes its modifications in place.
is relaxation of the rules introduces a variety of synchronization challenges. For exam-
ple: a fault handler that overlaps in time with a major update (e.g., an munmap operation that
invalidates a broad address range) may end up modifying the about-to-be-reclaimed version of a
page table entry, in which case it should not return to the user program as if nothing had gone
wrong. If each major update acquires and updates a (per-address-space) sequence lock, however,
then the fault handler can check the value of the lock both before and after its operation. If the
value has changed, it can retry, using the new version of the data. (Alternatively, if starvation is a
concern, it can acquire the lock itself.) Similarly, if fault handlers cannot safely run concurrently
with one another (e.g., if they need to modify more than a single word in memory), then they
6.3. READ-COPY UPDATE 101
need their own synchronization—perhaps a separate sequence lock in each page table entry. If
readers may inspect more than one word that is subject to in-place update, then they, too, may
need to inspect such a local sequence lock, and repeat their operation if they see a change. is
convention imposes some on the (presumably dominant) read-only code path, but the overhead
is still small—in particular, readers still make no updates to shared data.
103
CHAPTER 7
Synchronization and
Scheduling
So far in this lecture, we have emphasized busy-wait synchronization. In the current chapter we
turn to mechanisms built on top of a scheduler, which multiplexes some collection of cores among
a (typically larger) set of threads, switching among them from time to time and—in particular—
when the current thread needs to wait for synchronization.
We begin with a brief introduction to scheduling in Section 7.1. We then discuss the oldest
(and still most widely used) scheduler-based synchronization mechanism—the semaphore—in
Section 7.2. Semaphores have a simple, subroutine-call interface. Many scheduler-based syn-
chronization mechanisms, however, were designed to be embedded in a concurrent program-
ming language, with special, non-procedural syntax. We consider the most important of these—
the monitor—in Section 7.3, and others—conditional critical regions, futures, and series-parallel
(split-merge) execution—in Section 7.4.
With these mechanisms as background, we return in Section 7.5 to questions surrounding
the interaction of user- and kernel-level code: specifically, how to minimize the number of context
switches, avoid busy-waiting for threads that are not running, and reduce the demand for kernel
resources.
7.1 SCHEDULING
As outlined in Section 1.3, scheduling often occurs at more than one level of a system. e op-
erating system kernel, for example, may multiplex kernel threads on top of hardware cores, while
a user-level run-time package multiplexes user threads on top of the kernel threads. On many
machines, the processor itself may schedule multiple hardware threads on the pipeline(s) of any
given core (in which case the kernel schedules its threads on top of hardware threads, not cores).
Library packages (e.g., in Java) may sometimes schedule run-to-completion (unblockable) tasks
on top of user threads. System-level virtual machine monitors may even multiplex the (virtual)
hardware threads seen by guest operating systems on top of some smaller number of physical
hardware threads.
Regardless of the level of implementation, we can describe the construction of a scheduler
by starting with an overly simple system and progressively adding functionality. e details are
somewhat tedious [Scott, 2009, Secs. 8.6, 12.2.4, and 12.3.4]; we outline the basic ideas here
in the interest of having “hooks” that we can call in subsequent descriptions of synchronization
104 7. SYNCHRONIZATION AND SCHEDULING
mechanisms. We begin with coroutines—each of which is essentially a stack and a set of registers—
and a single core (or kernel thread) that can execute one coroutine at a time. To switch to a different
coroutine, the core (or kernel thread) calls an explicit transfer routine, passing as argument a
pointer to the context block (descriptor) of some other coroutine. e transfer routine (1) pushes
all registers other than the stack pointer onto the top of the (current) stack, (2) saves the (updated)
stack pointer into the context block of the current coroutine (typically found by examining a global
current thread variable), (3) sets current thread to the address of the new context block (the
argument to transfer ), and (4) retrieves a (new) stack pointer from that context block. Because
the new coroutine could only have stopped running by calling transfer (and new coroutines are
created in such a way that they appear to have just called transfer ), the program counter need not
change—it will already be at the right instruction. Consequently, the transfer routine simply (5)
pops registers from the top of the (new) stack and returns.
On top of coroutines, we implement non-preemptive threads (otherwise known as run-
until-block or cooperatively scheduled threads) by introducing a global ready list (often but not always
a queue) of runnable-but-not-now-running threads, and a parameterless reschedule routine that
pulls a thread off the ready list and transfer s to it. To avoid monopolizing resources, a thread
should periodically relinquish its core or kernel thread by calling a routine (often named yield )
that enqueues it at the tail of the ready list and then calls reschedule . To block for synchronization,
the thread can call reschedule after adding itself to some other data structure (other than the ready
list), with the expectation that another thread will move it from that structure to the ready list
when it is time for it to continue.
e problem with cooperatively scheduled threads, of course, is the need to cooperate—
to call yield periodically. At the kernel level, where threads may belong to mutually untrusting
applications, this need for cooperation is clearly unacceptable. And even at the user level, it is
highly problematic: how do we arrange to yield often enough (and uniformly enough) to ensure
fairness and interactivity, but not so often that we spend all of our time in the scheduler? e
answer is preemption: we arrange for periodic timer interrupts (at the kernel level) or signals (at
the user level) and install a handler for the timer that simulates a call to yield in the currently
running thread. To avoid races with handlers when accessing the ready list or other scheduler
data structures, we temporarily disable interrupts (signals) when executing scheduler operations
explicitly.
Given transfer , reschedule / yield , and preemption, we can multiplex concurrent kernel or
user threads on a single core or kernel thread. To accommodate true parallelism, we need a separate
current thread variable for each core or kernel thread, and we need one or more spin locks to
protect scheduler data structures from simultaneous access by another core or kernel thread. e
disabling of interrupts/signals eliminates races between normal execution and timer handlers;
spin locks eliminate races among cores or kernel threads. Explicit calls to scheduler routines first
disable interrupts (signals) and then acquire the appropriate spin lock(s); handlers simply acquire
7.2. SEMAPHORES 105
the lock(s), on the assumption that nested interrupts (signals) are disabled automatically when
the first one is delivered.
7.2 SEMAPHORES
Semaphores are the oldest and probably still the most widely used of the scheduler-based syn-
chronization mechanisms. ey were introduced by Dijkstra in the mid 1960s [Dijkstra, 1968b].
A semaphore is essentially a non-negative integer with two special operations, P and V .¹ P waits,
if necessary, for the semaphore’s value to become positive, and then decrements it. V increments
the value and, if appropriate, unblocks a thread that is waiting in P . If the initial value of the
semaphore is C , is easy to see that #P #V C , where #P is the number of completed P op-
erations and #V is the number of completed V operations.
¹e names stand for words in Dijkstra’s native Dutch: passeren (to pass) and vrijgeven (to release). English speakers may find
it helpful to pretend that P stands for “pause.”
thread 1: thread 2:
if :condition
if :Q.empty()
ready list.enqueue(Q.dequeue())
Q.enqueue(self)
reschedule()
Here it is important that thread 1 acquire the scheduler spin lock before it checks the awaited condition, and hold
it through the call to reschedule .
Priority Inversion
e problem addressed by disabling interrupts or signals during scheduler operations is an example of a more
general class of problems known as priority inversion. Priority inversion occurs when a high priority task (or any
sort) preempts a low priority task (or any sort), but is unable to proceed because it needs some resource held
by the low priority task. Cast in these terms, a program running above a preemption-based scheduler can be
thought of as a low-priority task; an arriving interrupt or signal preempts it, and runs a handler at high priority
instead. A spin lock on scheduler data structures ensures atomicity among explicit scheduler operations performed
by different cores (or kernel threads), but it cannot provide the same protection between normal execution and
interrupt (signal) handlers: a handler that tried to acquire a lock held by the normal code it preempted would end
up spinning forever; priority inversion would leave the system deadlocked.
106 7. SYNCHRONIZATION AND SCHEDULING
If we let C D 1, the semaphore functions as a mutual exclusion lock: P is the acquire oper-
ation; V is the release operation. Assuming that the program uses acquire and release correctly
(never attempting to release a lock that is not held), the value of the semaphore will always be
either 0 (indicating that the lock is held) or 1 (indicating that the lock is free). In this case we say
we have a binary semaphore. In other cases, a semaphore may represent some general resource of
which there are C instances. In this case we say we have a general or counting semaphore. A thread
reserves a resource instance using P ; it releases it using V . Within the OS kernel, a semaphore
might represent a frame buffer, an optical drive, a physical page of memory, a recurring slot in
a time-based communication protocol, or any other resource with a limited, discrete set of in-
stances. Many (though not all) forms of condition synchronization can be captured by the notion
of waiting for such a resource.
In Section 1.2 we introduced condition synchronization using the example of a bounded
buffer, where insert operations would wait, if necessary, for the buffer to become nonfull,
and remove operations would wait for it to become nonempty. Code for such a buffer using
semaphores appears in Figure 7.1.
Figure 7.1: Implementation of a bounded buffer using semaphores. Semaphore mutex is used
to ensure the atomicity of updates to buf , next full , and next empty . Semaphores full slots and
empty slots are used for condition synchronization.
e code for insert and remove is highly symmetric. An initial P operation delays the
calling thread until it can claim the desired resource (a full or empty slot). A subsequent brief
critical section, protected by the binary mutex semaphore, updates the contents of the buffer
and the appropriate index atomically. Finally, a V operation on the complementary condition
semaphore indicates the availability of an empty or full slot, and unblocks an appropriate waiting
thread, if any.
Given the scheduler infrastructure outlined in Section 7.1, the implementation of
semaphores is straightforward. Each semaphore is represented internally by an integer counter
and a queue for waiting threads. e P operation disables signals and acquires the scheduler
spin lock. It then checks to see whether the counter is positive. If so, it decrements it; if not, it
adds itself to the queue of waiting threads and calls reschedule . Either way (immediately or after
subsequent wakeup), it releases the scheduler lock, reenables signals, and returns. (Note that the
reschedule operation, if called, will release the scheduler lock and reenable signals after pulling
a new thread off the ready list. at thread will, in turn, reacquire the lock and disable signals
before calling back into the scheduler.) e V operation disables signals, acquires the scheduler
lock, and checks to see whether the queue of waiting threads is empty. If so, it moves a thread
from that queue to the ready list; if not, it increments the counter. Finally, it releases the scheduler
lock, reenables signals, and returns.
With very similar implementation techniques, we can implement native support for
scheduler-based reader-writer locks (we could also build them on top of semaphores). Modest
changes to the internal representation (protected, of course, by the disabling of signals and the
scheduler lock), would lead to fair, reader-preference, or writer-preference versions. In a similar
vein, while we have described the behavior of counting semaphores in terms of a “queue” of wait-
108 7. SYNCHRONIZATION AND SCHEDULING
ing threads (suggesting FIFO ordering), the choice of thread to resume in V could just as easily
be arbitrary, randomized, or based on some notion of priority.
7.3 MONITORS
To a large extent, the enduring popularity of semaphores can be attributed to their simple
subroutine-call interface: implemented by a run-time library or operating system, they can be used
in almost any language. At the same time, the subroutine-call interface can be seen as a liability. To
start with, while mutual exclusion constitutes the most common use case for semaphores, the syn-
tactic independence of P and V operations makes it easy to omit one or the other accidentally—
especially in the presence of deeply nested conditions, break and return statements, or exceptions.
is problem can be addressed fairly easily by adding syntax in which lock acquisition introduces
a nested scope, e.g.:
with lock(L) {
// ...
}
e compiler can ensure that the lock is released on any exit from the critical section, including
those that occur via break , return , or exception. In languages like C++, which provide a destructor
mechanism for objects, a similar effect can be achieved without extending language syntax:
{std::lock_guard<std::mutex> _(L);
// ...
}
is construct declares a dummy object (here named simply with an underscore) of class
lock_guard . e constructor for this object takes a parameter L of class mutex , and calls its
acquire method. e destructor for the unnamed object, which will be called automatically on
any exit from the scope, calls L ’s release method. Both mutex and lock_guard are defined in
the C++ standard library.
While scope-based critical sections help to solve the problem of missing acquire and
release calls, the association between a lock and the data it protects is still entirely a matter
of convention. Critical sections on a given lock may be widely scattered through the text of a
program, and condition synchronization remains entirely ad hoc.
To address these limitations, Dijkstra [1972], Brinch Hansen [1973], and Hoare [1974]
developed a language-level synchronization mechanism known as the monitor. In essence, a mon-
itor is a data abstraction (a module or class) whose methods (often called entries) are automatically
translated into critical sections on an implicit per-monitor lock. Since fields of the monitor are
visible only within its methods, language semantics ensure that the state of the abstraction will be
read or written only when holding the monitor lock. To accommodate condition synchronization,
monitors also provide condition variables. A thread that needs to wait for a condition within the
monitor executes a wait operation on a condition variable; a thread that has made a condition
7.3. MONITORS 109
true performs a signal operation to awaken a waiting thread. Unlike semaphores, which count
the difference in the number of P and V operations over time, condition variables contain only a
queue of waiting threads: if a signal operation occurs when no threads are waiting, the operation
has no effect.
Over the pasts 40 years, monitors have been incorporated into dozens of programming
languages. Historically, Concurrent Pascal [Brinch Hansen, 1975], Modula [Wirth, 1977], and
Mesa [Lampson and Redell, 1980] were probably the most influential. Today, Java [Goetz et al.,
2006] is probably the most widely used. ere have also been occasional attempts to devise a
library interface for monitors, but these have tended to be less successful: the idea depends quite
heavily on integration into a language’s syntax and type system.
Details of monitor semantics vary from one language to another. In the first subsection
below we consider the classic definition by Hoare. ough it is not followed precisely (to the best
of the author’s knowledge) by any particular language, it is the standard against which all other
variants are compared. e following two subsections consider the two most significant areas of
disagreement among extant monitor variants. e final subsection describes the variant found in
Java.
urgent queue
entry queue
Monitor methods
(run in mutual exclusion;
can access protected data) exit
...
condition queues
Figure 7.2: A Hoare monitor. Only one thread is permitted “inside the box” at any given time.
monitor buffer
const int SIZE = ...
data buf[SIZE]
int next full, next empty := 0, 0
int full slots := 0
condition full slot, empty slot
entry insert(data d): entry remove():
if full slots = SIZE if full slots = 0
empty slot.wait() full slot.wait()
buf[next empty] := d data d := buf[next full]
next empty := (next empty + 1) mod SIZE next full := (next full + 1) mod SIZE
++full slots full slots
full slot.signal() empty slot.signal()
return d
Figure 7.3: Implementation of a bounded buffer as a Hoare monitor. reads wait on condition vari-
ables full slot and empty slot only when the associated condition does not currently hold.
in a Mesa monitor. is change is certainly not onerous. It is also consistent with the notion of
covering conditions, discussed in the box on page 112. Most modern implementations of monitors
adopt Mesa semantics for signals.
As it turns out, many algorithms (including our bounded buffer) naturally place signal
operations only at the ends of entries. A few languages—notably Concurrent Pascal—have re-
quired this positioning of signal s, thereby maintaining the semantics of signals as absolutes while
avoiding any extra context switches for immediate transfer to the signalee.
112 7. SYNCHRONIZATION AND SCHEDULING
7.3.3 NESTED MONITOR CALLS
A second major difference among monitor implementations concerns behavior in the event of
nested calls. Suppose a thread calls entry E of monitor M1 , which in turn calls entry F of monitor
M2 , and the code in F then wait s on a condition variable. Clearly M2 ’s monitor lock will be
released. But what about M1 ? If we leave it locked, the program will deadlock if the only way
for another thread to reach the necessary signal in M2 is through M1 . If we unlock it, however,
then the wait ing thread in M2 will need to reacquire it when it wakes up, and we may deadlock
if some other thread is holding M1 ’s lock at that time—especially if that thread can’t release M1 ’s
lock without making a nested call to M2 .
A possible solution, suggested by Wettstein [1978], is to release the outer monitor when
wait ing, dictate that signals are only hints, and arrange for a re-awakened lock to re-acquire locks
from the outside in—i.e., first on M1 and then on M2 . is strategy is deadlock free so long as the
programmer takes care to ensure that nested monitor calls always occur in the same order (i.e.,
always from M1 to M2 , and never from M2 to M1 ).
Unfortunately, any scheme in which a nested wait releases the locks on outer monitors
will require the programmer to restore the monitor invariant not only on monitor exit, wait ,
and possibly signal , but also whenever calling an entry of another monitor that may wait —or a
subroutine that may call such an entry indirectly. e designers of most languages—Java among
them—have concluded that this requirement constitutes an unacceptable burden, and have opted
to leave the outer monitor locked.
As a form of syntactic sugar, the declaration of a class method can be prefaced with
synchronized , in which case its body behaves as if surrounded by synchronized (this)
{ ... } . A class whose methods are all synchronized functions as a monitor.
Within a synchronized method or block, a thread can block for condition synchroniza-
tion by executing the wait method, which all objects inherit from the root class Object ; it can
unblock a waiting peer by executing notify . If threads need to wait for more than one condition
associated with some abstraction (as they do in our bounded buffer), one must either restructure
the code in such a way that each condition is awaited in a different object, or else use some single
object’s one condition variable to cover all the alternatives. To unblock all threads waiting in a
given object, one can execute notifyAll .
C# provides mechanisms similar to those of core Java. Its lock statement is analogous to
synchronized , and conditions are awaited and signaled with Wait , Pulse , and PulseAll .
e Java 5 revision of the language, released in 2004, introduced a new library-based in-
terface to monitors. Its Lock class (with a capital ‘L’) has explicit lock (acquire) and unlock
(release) methods. ese can be used for hand-over-hand locking (Section 3.1.2) and other tech-
niques that cannot easily be captured with scope-based critical sections. Lock s can also have an
arbitrary number of associated condition variables, eliminating many unnecessary uses of cover-
ing conditions. Unfortunately, the library-based interface makes programs somewhat awkward.
ere is no equivalent of the synchronized label on methods, and the Lock -based equivalent
of a synchronized block looks like this:
Lock l = ...;
...
l.lock();
try {
// critical section
} finally {
// l.unlock();
}
e C# standard library also provides more general synchronization mechanisms, via its
WaitHandle objects, but these are operating-system specific, and may behave differently on dif-
ferent systems.
114 7. SYNCHRONIZATION AND SCHEDULING
class buffer
const int SIZE = ...
data buf[SIZE]
int next full, next empty := 0, 0
buffer.insert(data d): buffer.remove():
region when full slots < SIZE data d
buf[next empty] := d region when full slots > 0
next empty := (next empty + 1) d := buf[next full]
mod SIZE next full := (next full + 1) mod SIZE
return d
Figure 7.4: Implementation of a bounded buffer using conditional critical regions. Here we have
assumed that regions are with respect to the current object ( this ) unless otherwise specified.
As in a Java synchronized block, the protected variable specifies an object whose im-
plicit lock is to be acquired. Some languages allow the programmer to specify a list of objects,
in which case their locks are acquired in some canonical order (to avoid deadlock). Significantly,
the when clause (also known as a guard ) can appear only at the beginning of the critical section.
e intent is that the enclosed code execute atomically at some point in time where the specified
condition is true. is convention avoids the issue of monitor signal semantics, but leaves the
issue of nested calls.
Figure 7.4 uses conditional critical sections to implement a bounded buffer. e code is
arguably more natural than the semaphore (Figure 7.1) or monitor (Figure 7.3) versions, but
raises a crucial implementation question: when and how are the guards evaluated?
With no restrictions on the conditions tested by guards, we are faced with the prospect,
when one thread leaves a region, of context switching into every other thread that is waiting to
enter a region on the same object, so that each can evaluate its own condition in its own referencing
environment. With a bit more sophistication, we may be able to determine—statically or at run
time—the set of variables on which a condition depends, and only switch into a thread when
7.4. OTHER LANGUAGE MECHANISMS 115
one of these has changed value (raising the possibility that the condition may now be true). is
optimization turns out to be natural in the context of transactional memory; we will return to it
in Section 9.3.2. Depending on the cost of tracking writes, it may be cheaper in practice than
resuming every thread on every region exit, but worst-case overhead remains significant.
Another cost-reduction strategy, originally proposed by Kessels [1977] and adopted (in
essence) by Ada, is to require conditions to depend only on the state of the lockable object (never
on the parameters passed to methods), and to list these conditions explicitly in the object’s decla-
ration. ese rules allow the implementation to associate each condition with an implicit queue of
waiting threads, and to evaluate it in a generic context, without restoring the referencing environ-
ment of any particular thread. When one thread leaves a region, each condition can be evaluated
exactly once, and a corresponding thread resumed if the condition is true.
As noted in Section 5.1, it is important that tests of a condition not race with updates
to the variables on which the condition depends. is property, too, can be ensured by allowing
conditions to depend only on the state of the lockable object—and perhaps also on parameters
passed by value, which are inaccessible to other threads.
7.4.2 FUTURES
Futures, first proposed by Halstead [1985] for the Multilisp dialect of Scheme, exploit the obser-
vation that function arguments, in most languages, are evaluated before they are passed, but may
not actually be used by the caller for some time. In Multilisp, any expression—but most commonly
a function argument—can be enclosed in a future construct:
(future expression)
Evaluation of the expression may then proceed in parallel with continued execution in the caller,
up until the point (if any) at which the caller actually needs the value of the expression.
Futures embody synchronization in the sense that evaluation of the enclosed expression will
not begin until execution in the parent thread reaches the point at which the future appears, and
execution in the parent thread will not proceed beyond the point where the value is needed until
evaluation has completed. Using future s, the key recursive step in quicksort might be written as
follows:
(append (future (sort elements less than or equal to pivot))
(list pivot)
(future (sort elements greater than pivot)))
In general, a future and the continuation of its caller need to be independent, up to the
point where the value of the future is needed. If the threads executing the future and the contin-
uation share a data or synchronization race, behavior of the program may be nondeterministic or
even undefined. As recognized by Halstead, future s are thus particularly appealing in the purely
functional subset of Scheme, where the lack of side effects means that an expression will always
evaluate to the same value in a given context.
116 7. SYNCHRONIZATION AND SCHEDULING
Some thread libraries provide future s outside the language core—typically as a generic
(polymorphic) object whose constructor accepts a closure (a subroutine and its parameters) and
whose get method can be used to retrieve the computed value (waiting for it if necessary). In
Java, given a Callable<T> object o , the code
T val = o.call();
can be replaced by
Future<T> f = new Future<T>(c);
f.run();
...
T val = f.get();
Because Java is not a functional language, the programmer must exercise special care to
ensure that a future will execute safely. Welc et al. [2005] have proposed that future s be made
safe in all cases, using an implementation reminiscent of transactional memory. Specifically, they
use multiversioning to ensure that a future does not observe changes made by the continuation of
its caller, and speculation in the caller to force it to start over if it fails to observe a change made
by the future .
Conceptually, this code suggests that the main thread create (“fork”) n threads at the top
of each do loop iteration, and “join” them at the bottom. e Cilk runtime system, however, is
designed to make spawn and sync as inexpensive as possible. Concise descriptions of the tasks are
placed into a “work-stealing queue” [Blumofe et al., 1995] from which they are farmed out to a
collection of preexisting worker threads. Similar schedulers are used in a variety of other languages
as well. Source code syntax may vary, of course. X10 [Charles et al., 2005], for example, replaces
spawn and sync with async and finish .
7.4. OTHER LANGUAGE MECHANISMS 117
Many languages (including the more recent Cilk++) include a “parallel for ” loop whose
iterations proceed logically in parallel. An implicit sync causes execution of the main program to
wait for all iterations to complete before proceeding with whatever comes after the loop. Similar
functionality can be added to existing languages in the form of annotations on sequential loops.
OpenMP [Chandra et al., 2001], in particular, defines a set of compiler- or preprocessor-based
pragma s that can be used to parallelize loops in C and Fortran. Like threads executing the same
phase of a barrier-based application, iterations of a parallel loop must generally be free of data
races. If occasional conflicts are allowed, they must be resolved using other synchronization.
In a very different vein, Fortran 95 and its descendants provide a forall loop whose iterations
are heavily synchronized. Code like the following
forall (i=1:n)
A[i] = expr1
B[i] = expr2
C[i] = expr3
end forall
contains (from a semantic perspective) a host of implicit barriers: All instances of expr1 are
evaluated first, then all writes are performed to A , then all instances of expr2 are evaluated,
followed by all writes to B , and so forth. A good compiler will elide any barriers it can prove to
be unneeded.
In contrast to unstructured fork-join parallelism, in which a thread may be created—or
its completion awaited—at any time, series-parallel programs always generate properly nested
groups of tasks. e difference is illustrated in Figure 7.5. With fork and join (a), tasks may join
their parent out of order, join with a task other than the parent, or terminate without join ing at
all. With spawn and sync (b), the parent launches tasks one at a time, but rejoins them as a group.
In split-merge parallelism (c), we think of the parent as dividing into a collection of children, all
at once, and then merging together again later. While less flexible, series-parallel execution leads
to clearer source code structure. Assuming that tasks do not conflict with each other, there is also
an obvious equivalence to serial execution. For debugging purposes, series-parallel semantics may
even facilitate the construction of efficient race detection tools, which serve to identify unintended
conflicts [Raman et al., 2012].
Recognizing the host of different patterns in which parallel threads may synchronize, Shi-
rako et al. [2008] have developed a barrier generalization known as phasers. reads can join (reg-
ister with) or leave a phaser dynamically, and can participate as signalers, waiters, or both. eir
signal and wait operations can be separated by other code to effect a fuzzy barrier (Section 5.3.1).
reads can also, as a group, specify a statement to be executed, atomically, as part of a phaser
episode. Finally, and perhaps most importantly, a thread that is registered with multiple phasers
can signal or wait at all of them together when it performs a signal or wait operation. is capa-
bility facilitates the management of stencil applications, in which a thread synchronizes with its
neighbors at the end of each phase, but not with other threads. Neighbor-only synchronization is
also supported, in a more limited fashion, by the topological barriers of Scott and Michael [1996].
118 7. SYNCHRONIZATION AND SCHEDULING
Figure 7.5: Parallel task graphs for programs based on (a) fork and join , (b) spawn and sync , and (c)
parallel enumeration (split-merge).
In the message-passing word, barrier-like operations are supported by the collective communication
primitives of systems like MPI [Bruck et al., 1995], but these are beyond the scope of this lecture.
To avoid abuse, the kernel is free to ignore the do not preempt me flag if it stays set for too
long. It can also deduct any extra time granted a thread from the beginning of its subsequent
quantum. Other groups have proposed related mechanisms that can likewise be used to avoid
[Marsh et al., 1991] or recover from [Anderson et al., 1992] inopportune preemption. Solaris, in
particular [Dice, 2011], provides a schedctl mechanism closely related to that of Edler et al.
e code shown for test and set above can easily be adapted to many other locks, with
features including backoff, locality awareness, timeout, double-checked or asymmetric locking,
and adaptive spin-then-wait. Fair queueing is harder to accommodate. In a ticket, MCS, or CLH
lock, one must consider the possibility of preemption not only while holding a lock, but also while
waiting in line. So if several threads are waiting, preemption of any one may end up creating a
convoy.
To avoid passing a lock to a thread that has been preempted while waiting in line, Kon-
tothanassis et al. [1997] proposed extensions to the kernel interface in the spirit of Edler et al.
Specifically, they provide additional values for the do not preempt me flag, and make it vis-
ible to other threads. ese changes allow one thread to pass a lock to another, and to make
the other nonpreemptable, atomically. In a different vein, He et al. [2005] describe a family of
queue-based locks in which a lock-releasing thread can estimate (with high confidence) whether
the next thread in line has been preempted, and if so dynamically remove it from the queue. e
key to these locks is for each spinning thread to periodically write the wall-clock time into its
lock queue node. If a thread discovers that the difference between the current time and the time
in its successor’s queue node exceeds some appropriate threshold, it assumes that the successor is
7.5. KERNEL/USER INTERACTIONS 121
preempted. A thread whose node has been removed from the queue will try again the next time
it has a chance to run.
CHAPTER 8
Nonblocking Algorithms
When devising a concurrent data structure, we typically want to arrange for methods to be
atomic—most often linearizable (Section 3.1.2). Most concurrent algorithms achieve atomic-
ity by means of mutual exclusion, implemented using locks. Locks are blocking, however, in the
formal sense of the word: whether implemented by spinning or rescheduling, they admit system
states in which a thread is unable to make progress without the cooperation of one or more peers.
is in turn leads to the problems of inopportune preemption and convoys, discussed in Sec-
tion 7.5.2. Locks—coarse-grain locks in particular—are also typically conservative: in the course
of precluding unacceptable thread interleavings, they tend to preclude many acceptable interleav-
ings as well.
We have had several occasions in earlier chapters to refer to nonblocking algorithms, in
which there is never a reachable state of the system in which some thread is unable to make
forward progress. In effect, nonblocking algorithms arrange for every possible interleaving of
thread executions to be acceptable. ey are thus immune to inopportune preemption. For certain
data structures (counters, stacks, queues, linked lists, hash tables—even skip lists) they can also
outperform lock-based alternatives even in the absence of preemption or contention.
e literature on nonblocking algorithms is enormous and continually growing. Rather
than attempt a comprehensive survey here, we will simply introduce a few of the most widely used
nonblocking data structures, and use them to illustrate a few important concepts and techniques.
A more extensive and tutorial survey can be found in the text of Herlihy and Shavit [2008].
Håkan Sundell’s Ph.D. thesis [2004] and the survey of Moir and Shavit [2005] are also excellent
sources of background information. Before proceeding here, readers may wish to refer back to the
discussion of liveness in Section 3.2.
e simplest nonblocking algorithms use the CAS and LL/SC -based fetch and ˆ construc-
tions of Section 2.3 to implement methods that update a single-word object. An atomic counter
(accumulator) object, for example, might be implemented as shown in Figure 8.1. Reads ( get )
and writes ( set ) can use ordinary load s and store s, though the store s must be write atomic to
avoid causality loops. Updates similarly require that fetch and ˆ instructions be write atomic.
Note that in contrast to the lock algorithms of Chapter 4, we have not employed any fences or
other synchronizing instructions to order the operations of our object with respect to preceding
124 8. NONBLOCKING ALGORITHMS
class counter int counter.increase(int v):
int c int old, new
int counter.get(): repeat
return c old := c
new := old + v
void counter.set(int v): until CAS(&c, old, new)
c := v return old
Figure 8.1: A single-word atomic counter, implemented with CAS . If updates to the counter are to
be seen in consistent order by all threads, the store in set and the CAS in increase must both be write
atomic.
Figure 8.2: e lock-free “Treiber stack,” with a counted top-of-stack pointer to solve the ABA prob-
lem (reprised from Figure 2.7). It suffices to modify the count in pop only; if CAS is available in
multiple widths, it may be applied to only the pointer in push .
or following code in the calling thread. If such ordering is required, the programmer will need to
provide it.
Slightly more complicated than a single-word counter is the lock-free stack of Sec-
tion 2.3.1, originally published by Treiber [1986] for the IBM System/370, and very widely used
today. Code for this stack is repeated here as Figure 8.2. As discussed in the earlier section, a se-
quence count has been embedded in the top-of-stack pointer to avoid the ABA problem. Without
this count (or some other ABA solution [Jayanti and Petrovic, 2003, Michael, 2004a]), the stack
would not function correctly.
If our stack had additional, read-only methods (e.g., an is empty predicate), then the
CAS es that modify top would need to be write atomic. Similar observations will apply to other
nonblocking data structures, later in this chapter. In an algorithm based on mutual exclusion or
reader-writer locks, linearizability is trivially ensured by the order of updates to the lock. With
seqlocks or RCU, as noted in Sections 6.2 and 6.3, write atomicity is needed to ensure that read-
ers see updates in consistent order. In a similar way, any updates in a nonblocking data structure
that might otherwise appear inconsistent to other threads (whether in read-only operations or in
portions of more general operations) will need to be write atomic.
8.2. THE MICHAEL AND SCOTT (M&S) QUEUE 125
head tail
CAS (dequeue) CAS 2 (enqueue)
Figure 8.3: Operation of the M&S queue. After appropriate preparation (“snapshotting”), dequeue
reads a value from the second node in the list, and updates head with a single CAS to remove the old
dummy node. In enqueue , two CAS es are required: one to update the next pointer in the previous
final node; the other to update tail .
Dummy Nodes. To avoid special cases found in prior algorithms, the M&S queue always keeps
a “dummy” node at the head of the queue. e first real item is the one in the node, if any, that
follows the dummy node. As each item is dequeue d, the old dummy node is reclaimed, and the
node in which the dequeue d item was located becomes the new dummy node.
126 8. NONBLOCKING ALGORITHMS
In recent years, several techniques have been proposed to improve the performance of non-
blocking queues. Ladan-Mozes and Shavit [2008] effect an enqueue with a single CAS by using
an MCS-style list in which the operation linearizes at the update of the tail pointer, and the
forward link is subsequently created with an ordinary store . If a dequeue -ing thread finds that
128 8. NONBLOCKING ALGORITHMS
25
CAS
10 20 30 10 20 30
head head CAS
(a) (b)
25
10 20 30 10 20 30
head head
(c) (d)
Figure 8.5: Atomic update of a singly linked list. Naive insertion (a) and deletion (b), if executed
concurrently, can leave the list in an inconsistent state (c). e H&M list therefore performs a two-
step deletion that first marks the next pointer of the to-be-deleted node (shown here with shading)
(d), thereby preventing conflicting insertion.
the next node has not yet been “linked in” (as may happen if a thread is delayed), it must traverse
the queue from the tail to fix the broken connection. Hendler et al. [2010b] use flat combining
(Section 5.4) to improve locality in high-contention workloads by arranging for multiple pending
operations to be performed by a single thread. Morrison and Afek [2013] observe that both the
overhead of memory management and the contention caused by failed-and-repeating CAS op-
erations can be dramatically reduced by storing multiple data items in each element of the queue,
and using fetch and increment to insert and delete them.
Figure 8.6: e H&M lock-free list, as presented by Michael [2002] (definitions and internal search
routine), with counted pointers to solve the ABA problem. Synchronizing instructions have been
added to the original. CAS instructions are assumed to be write atomic.
that occur after the call to search . Unsuccessful insertions (attempts to insert an already present
key), unsuccessful deletions (attempts to remove an already missing key), and all calls to lookup
linearize within the search routine. If the list is empty, the linearization point is the load of CURR
from *PREVp , immediately before the first fence ; if the list is non-empty, it is the last dynamic
load of NEXT from CURR.p!next , immediately before the RR fence in the last iteration of
the loop. In all these intra- search cases, we don’t know that the method has linearized until we
inspect the loaded value.
8.4. HASH TABLES 131
bool list.insert(value v):
if search(v) return false
node* n := new node(v, hfalse, CURR.p, 0i)
loop
// note that CAS is ordered after initialization/update of node
if CAS(PREVp, hfalse, CURR.p, CURR.ci, hfalse, n, CURR.c+1i, Wk)
return true
if search(v)
free(n) // node has never been seen by others
return false
n!next := hfalse, CURR.p, 0i
bool list.delete(value v):
loop
if :search(v) return false
// attempt to mark node as deleted:
if :CAS(&CURR.p!next, hfalse, NEXT.p, NEXT.ci, htrue, NEXT.p, NEXT.c+1i)
continue // list has been changed; start over
// attempt to link node out of list:
if CAS(PREVp, hfalse, CURR.p, CURR.ci, hfalse, NEXT.p, CURR.c+1i)
fence(kW) // link node out before deleting!
free for reuse(CURR.p) // type-preserving
else (void) search(v) // list has been changed; re-scan and clean up deleted node(s)
return true
bool list.lookup(value v):
return search(v)
Figure 8.7: e H&M lock-free list (externally visible methods). Note in Figure 8.6 that PREVp ,
CURR , and NEXT are thread-private variables changed by list.search .
10 20 30 10 20 30
head head
(a) (b)
10 20 30 10 20 30
head head
(c) (d)
Figure 8.8: Searching within an H&M list. PREVp is a pointer to a markable counted pointer ( ptr );
CURR and NEXT are ptr s. Diagrams (a), (b), (c), and (d) show the final positions of PREVp , CURR ,
and NEXT when the searched-for value is 10, 20, 30, and > 30, respectively. e return value
of search will be true if and only if the value is found precisely at *CURR .
Ideally, we should like resizing to be a nonblocking operation that allows not only lookup
but also insert and delete operations to continue unimpeded. Shalev and Shavit [2006] describe
an algorithm that achieves precisely this objective. It is also incremental : the costs of a resizing
operation are spread over multiple insert , delete , and lookup operations, retaining O.1/ expected
time for each. e basic idea is illustrated in Figure 8.9. Instead of a separate chain of nodes in
each bucket, it maintains a single list of all nodes, sorted by order number. Given a hash function
h with a range of 0 : : : 2n 1, we obtain the order number of a node with key k by reversing the
n bits of h.k/ and then adding an extra least-significant 1 bit.
Fast access into the list of nodes is provided by a collection of 2j lazily initialized buckets,
where j is initialized to some small positive integer value i , and may increase at run time (up to a
limit of n) to accommodate increases in the length of the list. Each initialized bucket contains a
pointer to a so-called dummy node, linked into the list immediately before the data nodes whose
top j order number bits, when reversed, give the index of the bucket. To ensure that it appears in
the proper location, the dummy node for bucket b is given an order number obtained by reversing
the j bits of b , padding on the right with n j zeros, and adding an extra least-significant 0 bit.
e point of all this bit manipulation is to ensure, when we decide to increment j (and
thus double the number of buckets), that all of the old buckets will still point to the right places
8.4. HASH TABLES 133
00
11
00
11
01
00
01
00
00
00
00
10
00
11
00
00
10
10
10
11
11
(a) 16 17 5 15
0 1 2 3 2 bit hash
00
11
00
11
01
01
00
01
00
00
00
00
01
10
00
11
00
00
10
10
10
10
11
11
(b) 16 17 9 5 15
0 1 2 3 2 bit hash
0
1
00
01
00
01
10
00
00
10
0
1
00
00
10
10
10
10
11
11
(c) 16 17 9 5 15
0 1 2 3 4 5 6 7 3 bit hash 11
0
00
01
00
01
00
01
10
00
00
10
00
11
0
10
00
00
10
10
10
10
10
11
(d) 16 17 9 5 21 11
15
0 1 2 3 4 5 6 7 3 bit hash
00
00
00
00
01
11
00
01
01
00
01
10
00
00
10
10
10
10
00
11
0
0
00
01
01
00
10
10
10
10
10
10
11
11
(e) 16 17 9 5 21 15
0 1 2 3 4 5 6 7 3 bit hash
Figure 8.9: e nonblocking, extensible S&S hash table. Dummy nodes are shaded. Data nodes are
labeled (for illustration purposes) with the hash of their key; order numbers are shown above. Starting
from the configuration shown in (a), we have inserted a data node with hash value 9 (b), doubled the
number of buckets (c), inserted a node with hash value 21 (d), and searched for a node with hash value
30 (e).
134 8. NONBLOCKING ALGORITHMS
in the list, and the new buckets, once they are initialized, will point to new dummy nodes inter-
spersed among the old ones. We can see this dynamic at work in Figure 8.9. Part (a) shows a table
containing 4 elements whose keys are not shown, but whose hash values are 5, 15, 16, and 17.
For simplicity of presentation, we have assumed that n D 5, so hash values range from 0 to 31.
Above each node we have shown the corresponding order number. e node with hash value 5,
for example, has order number .001012 /R << 1 C 1 D 1010012 .
Still in (a), we use the two low-order bits of the hash to index into an array of 22 D 4
buckets. Slots 0, 1, and 3 contain pointers to dummy nodes. All data whose hash values are
congruent to b mod 4 are contiguous in the node list, and immediately follow the dummy node
for bucket b . Note that bucket 2, which would not have been used in the process of inserting the
four initial nodes, is still uninitialized. In (b) we have inserted a new data node with hash value 9.
It falls in bucket 1, and is inserted, according to its order number, between the nodes with hash
values 17 and 5. In (c) we have incremented j and doubled the number of buckets in use. e
buckets themselves will be lazily initialized.
To avoid copying existing buckets (particularly given that their values may change due to
lazy initialization), we employ noncontiguous bucket arrays of exponentially increasing size. In
a simplification of the scheme of Shalev and Shavit, we access these arrays through a single,
second-level directory (not shown). e directory can be replaced with a single CAS . It indicates
the current value of j and the locations of j 1 bucket arrays. e first two arrays are of size 2i
(here i D 2); the next is of size 2.i C1/ , and so on. Given a key k , we compute b D h.k/ mod 2j
and d D b >> i . If d D 0, b ’s bucket can be found at directory[0][b mod 2i ] . Otherwise, let m
be the index of the most significant 1 bit in d ’s binary representation; b ’s bucket can be found at
directory[m C 1][b mod 2mCi ] .
In (d) we have inserted a new data node with hash value 21. is requires initialization of
bucket .21 mod 23 / D 5. We identify the “parent” of bucket 5 (namely, bucket 1) by zeroing out
the most significant 1 bit in 5’s binary representation. Traversing the parent’s portion of the node
list, we find that 5’s dummy node (with order number 1010002 ) belongs between the data nodes
with hash values 9 and 5. Having inserted this node, we can then insert the data node with hash
value 21. Finally, in (e), we search for a node with hash value 30. is requires initialization of
bucket .30 mod 23 / D 6, which recursively requires initialization of bucket 6’s parent—namely
bucket 2. Shalev and Shavit prove that the entire algorithm is correct and nonblocking, and that
given reasonable assumptions about the hash function h, the amortized cost of insert , delete , and
lookup operations will be constant.
Unlike a stack, which permits insertions and deletions at one end of a list, and a queue, which per-
mits insertions at one end and deletions at the other, a double-ended queue, or deque (pronounced
“deck” or “deek”) permits insertions and deletions at both ends (but still not in the middle). In
comparison to stacks and queues, deques have fewer practical uses. e most compelling, perhaps,
is the elegant O.n/ convex hull algorithm of Melkman [1987]. e most familiar is probably the
history or undo list of an interactive application: new operations are pushed onto the head of the
list, undone operations are popped off the head, and old operations are dropped off the tail as the
list continues to grow (there are, however, no insertions at the tail).
For nonblocking concurrent programming, deques have long been a subject of intrinsic
intellectual interest, because they are more complex than stacks and queues, but still simpler than
structures like search trees. e standard CAS -based lock-free deque is due to Michael [2003]; we
describe it in Section 8.6.1 below. We then consider, in Section 8.6.2, an algorithm due to Herlihy
et al. [2003a] that achieves a significant reduction in complexity by using obstruction freedom
rather than lock freedom as its liveness criterion. Michael’s queues employ an unbounded, doubly
linked list; those of Herlihy et al. employ a circular array. Other algorithms can be found in
the literature; in particular, Sundell and Tsigas [2008b] use their lock-free doubly linked lists
to construct an unbounded nonblocking dequeue in which operations on the head and tail can
proceed in parallel.
In addition to uses inherited from sequential programming, concurrent deques have a com-
pelling application of their own: the management of tasks in a work-stealing scheduler. We consider
this application in Section 8.6.3.
136 8. NONBLOCKING ALGORITHMS
8.6.1 UNBOUNDED LOCK-FREE DEQUES
e lock-free deque of Michael [2003] uses a single, double-width, CAS -able memory location
(the “anchor”) to hold the head and tail pointers of the list, together with a 2-bit status flag that
can take on any of three possible values: STABLE , LPUSH , and RPUSH . For ABA-safe memory
allocation, the algorithm can be augmented with hazard pointers [Michael, 2004b]. Alternatively,
it can be modified to rely on counted pointers, but to fit two of these plus the status flag in a single
CAS -able anchor—even one of double width—the “pointers” must be indices into a bounded-
size pool of nodes. If this is unacceptable, double-wide LL / SC can be emulated with an extra
level of indirection [Michael, 2004a].
Operations on the deque are illustrated in Figure 8.10. At any given point in time, the
structure will be in one of seven functionally distinct states. Blue arrows in the figure indicate
state transitions effected by push left , pop left , push right , and pop right operations (arrows
labeled simply “ push ” or “ pop ” cover both left and right cases).
ree states are STABLE , as indicated by their status flag: they require no cleanup to com-
plete an operation. In S0 the dequeue is empty—the head and tail pointers are null. In S1 there
is a single node, referred to by both head and tail. In S2C there are two or more nodes, linked
together with left and right pointers.
Four states—those with status flags LPUSH and RPUSH —are transitory: their contents
are unambiguous, but they require cleanup before a new operation can begin. To ensure non-
blocking progress, the cleanup can be performed by any thread. In a push right from state S2 ,
for example, an initial CAS changes the status flag of the anchor from STABLE to RPUSH and
simultaneously updates the tail pointer to refer to a newly allocated node. (is node has pre-
viously been initialized to contain the to-be-inserted value and a left pointer that refers to the
previous tail node.) ese changes to the anchor move the deque to the incoherent state Ri , in
which the right pointer of the second-to-rightmost node is incorrect. A second CAS fixes this
pointer, moving the deque to the coherent state Rc ; a final CAS updates the status flag, returning
the deque to state S2C .
e actual code for the deque is quite complex. Various operations can interfere with one
another, but Michael proves that an operation fails and starts over only when some other operation
has made progress.
push
S0 S1
STABLE STABLE
Li Ri
?
LPUSH RPUSH
?
pop v
? X ?
push
? X ?
v v v v v v
S2+
STABLE
push_left push_right
?
Lc Rc
?
LPUSH v v RPUSH
? ?
? ?
v v v v v v
pop
Figure 8.10: Operation of a lock-free deque (figure adapted from Michael [2003]). Blue arrows in-
dicate state transitions. In each state, the anchor word is shown at top, comprising the head pointer,
status flag, and tail pointer. Nodes in the queue (oriented vertically) contain right and left pointers
and a value (‘v’). Interior nodes are elided in the figure, as suggested by dashed arrows. A left or right
push from state S2 is a three-step process. Nodes in the process of being inserted are shown in gray.
Question marks indicate immaterial values, which will not be inspected. An ‘X’ indicates a temporarily
incorrect (incoherent ) pointer.
which we introduced in Section 3.2.1. Where a lock-free algorithm (such as Michael’s) guaran-
tees that some thread will make progress within a bounded number of steps (of any thread), an
obstruction-free algorithm guarantees only that a thread that runs by itself (without interference
from other threads) will always make progress, regardless of the starting state. In effect, Herlihy
et al. argued that since a lock-free algorithm already requires some sort of contention management
mechanism (separate from the main algorithm) to avoid the possibility of starvation, one might
as well ask that mechanism to address the possibility of livelock as well, thereby separating issues
138 8. NONBLOCKING ALGORITHMS
left right
Figure 8.11: e HLM obstruction-free deque. Each ‘v’ represents a data value. ‘ LN ’ is a left null
value; ‘ RN ’ is a right null value. e left and right (head and tail) pointers are hints; they point at or
near the rightmost LN and leftmost RN slots in the array.
of safety and liveness entirely. By doing so, the authors argue, one may be able to simplify the
main algorithm considerably. Double-ended queues provide an illustrative example. Nonblocking
versions of transactional memory (Chapter 9) provide another.
Michael’s lock-free deque employs a linked list whose length is limited only by the range
of pointers that will fit in the anchor word. By contrast, the deque of Herlihy et al. employs a
fixed-length circular array. It is most easily understood by first considering a noncircular version,
illustrated in Figure 8.11. At any given time, reading from left to right, the array will contain one
or more LN (“left null”) values, followed by zero or more data values, followed by one or more
RN (“right null”) values. To perform a right push, one must replace the leftmost RN with a data
value; to perform a right pop, one must read the rightmost data value and replace it with an RN .
e left-hand cases are symmetric. To find the leftmost RN , one can start at any entry of the
array: if it’s an RN , scan left to find the last RN ; it it’s an LN or data value, scan right to find
the first RN . To reduce the time consumed, it is helpful to know approximately where to start
looking, but the indication need not be exact.
Given these observations, the only two really tricky parts of the algorithm are, first, how
to make sure that every operation maintains the LN …v… RN structural invariant, and, second,
how to join the ends of the array to make it circular.
e first challenge is addressed by adding a count to every element of the array, and then
arranging for every operation to modify, in an appropriate order, a pair of consecutive elements.
A right push operation, for example, identifies the index, k , of the leftmost RN value. If k is the
rightmost slot in the array, the operation returns a “deque is full” error message. Otherwise, it
performs a pair of CAS es. e first increments the count in element k 1; the second replaces
element k with a new data value and an incremented count. A right pop operation goes the other
way: it identifies the index, j , of the rightmost data value (if any). It then performs its own pair
of CAS es. e first increments the count in element j C 1; the second replaces element j with
RN and an incremented count. Left-hand operations are symmetric.
e key to linearizability is the observation that only the second CAS of a pair changes
the actual content of the deque; the first ensures that any conflict with a concurrent operation
will be noticed. Since we read both locations (k 1 and k , or j C 1 and j ) before attempting a
8.6. DOUBLE-ENDED QUEUES 139
CAS on either, if both CAS es succeed, no other operation modified either location in-between.
If the first CAS fails, no change has been made. If the second CAS fails, no substantive change
has been made. In either case, the operation can simply start over. Updates to the global left and
right pointers constitute cleanup. Because the pointers are just hints, atomicity with the rest of
the operation is not required. Moreover, updates to left and right need not interfere with one
another.
It is easy to see that the algorithm is obstruction free: an operation that observes an un-
changing array will always complete in a bounded number of steps. It is also easy to see that the
algorithm is not lock free: if a right push and a right pop occur at just the right time, each can,
in principle, succeed at its first CAS , fail at the second, and start over again, indefinitely. A right
push and a left push on an empty deque can encounter a similar cycle. In practice, randomized
backoff can be expected to resolve such conflicts quickly and efficiently.
To make the deque circular (as indeed it must be if pushes and pops at the two ends are
not precisely balanced), Herlihy et al. introduce one new dummy null ( DN ) value. e structural
invariant is then modified to allow the empty portion of the circular array to contain, in order, zero
or more RN values, zero or one DN values, and zero or more LN values. At all times, however,
there must be null values of at least two different kinds—at least one RN or DN , and at least one
DN or LN .
A right push that finds only one RN value in the array must change the adjacent DN value,
if any, into an RN first. If there is no adjacent DN , the operation must change the leftmost LN ,
if any, into a DN first. In all cases, changes are made with a pair of CAS es, the first of which
increments a count and the second of which is substantive.
In addition to wasting cycles and increasing contention (as spinning often does), this option
has the additional disadvantage that when a new datum is finally insert ed into an empty container,
the thread that gets to remove it will be determined, more or less accidentally, by the underlying
scheduler, rather than by the code of the data structure’s methods. To bring the choice under data
8.7. DUAL DATA STRUCTURES 141
structure control and, optionally, avoid the use of spinning, Scherer and Scott [2004] developed
the notion of nonblocking dual data structures.
In addition to data, a dual data structure may also hold reservations. When an operation
discovers that a precondition does not hold, it inserts a reservation, with the expectation that
some subsequent operation (in another thread) will notify it when the precondition holds. e
authors describe a formal framework in which both the initial insertion of a reservation and the
eventual successful completion of an operation (once the precondition holds) are nonblocking
and linearizable, and any intermediate activity (spinning or blocking) results in only a constant
number of remote memory operations, and thus can be considered harmless.
As examples, Scherer and Scott present nonblocking dual versions of the Treiber stack
(Section 8.1) and the M&S queue (Section 8.2). In both, a remove operation must determine
whether to remove a datum or insert a reservation; an insert operation must determine whether
to insert a datum or fulfill a reservation. e challenge is to make this decision and then carry
it out atomically, as a single linearizable operation. Among other things, we must ensure that if
operation x satisfies the precondition on which thread t is waiting, then once x has linearized
(and chosen t as its successor), t must complete its operation within a bounded number of (its
own) time steps, with no other linearizations in between.
In the nonblocking dualqueue, atomicity requires a small extension to the consistent snap-
shot mechanism and a convention that tags each next pointer with a bit to indicate whether the
next node in the queue contains a datum or a reservation. (e only tag that is ever inspected is the
one in the next pointer of the dummy node: one can prove that at any given time the queue will
consist entirely of data nodes or entirely of reservations.) Fulfillment of reservations is straightfor-
ward: if a waiting thread spins on a field in the reservation node, we can use a CAS to change that
field from ? to the fulfilling datum before removing the node from the queue. (Alternatively, we
could signal a single-use condition variable on which the waiting thread was blocked.)
In the nonblocking dualstack, next pointers are also tagged, but the lack of a dummy node,
and the fact that insertions and deletions occur at the same end of the list, introduces an extra
complication. To ensure nonblocking progress, we must fulfill a request before pop ping it from
the stack; otherwise, if the fulfilling thread stalled after the pop , the waiting thread could execute
an unbounded number of steps after the pop linearized, without making progress, and other
operations could linearize in-between. A push operation therefore pushes a data node regardless
of the state of the stack. If the previous top-of-stack node was a reservation, the adjacent nodes
then “annihilate each other”: any thread that finds a data node and an underlying reservation at
the top of the stack attempts to write the address of the former into the latter, and then pop both
nodes from the stack.
Nonblocking dual data structures have proven quite useful in practice. In particular, the
Executor framework of Java 6 uses dualstacks and dualqueues to replace the lock-based task pools
of Java 5, resulting in improvements of 2–10 in the throughput of thread dispatch [Scherer et
al., 2009].
142 8. NONBLOCKING ALGORITHMS
8.8 NONBLOCKING ELIMINATION
In Section 5.4 we described the notion of elimination, which allows operations in a fan-in tree
not only to combine (so that only one thread continues up the tree), but to “cancel each other
out,” so that neither thread needs to proceed.
Hendler et al. [2004] use elimination in a nonblocking stack to “back off ” adaptively in
the wake of contention. As in a Treiber stack (Section 8.1), a thread can begin by attempting a
CAS on the top-of-stack pointer. When contention is low, the CAS will generally succeed. If it
fails, the thread chooses a slot in (some subset of ) the elimination array. If it finds a matching
operation already parked in that slot, the two exchange data and complete. If the slot is empty, the
thread parks its own operation in it for some maximum time t , in hopes that a matching operation
will arrive. Modifications to a slot—parking or eliminating—are made with CAS to resolve races
among contending threads.
If a matching operation does not arrive in time, or if a thread finds a nonmatching operation
in its chosen slot (e.g., a push encounters another push ), the thread attempts to access the top-of-
stack pointer again. is process repeats—back and forth between the stack and the elimination
array—until either a push / pop CAS succeeds in the stack or an elimination CAS succeeds in the
array. If recent past experience suggests that contention is high, a thread can go directly to the
elimination array at the start of a new operation, rather than beginning with a top-of-stack CAS .
To increase the odds of success, threads dynamically adjust the subrange of the elimination
array in which they operate. Repeated failure to find a matching operation within the time interval
t causes a thread to use a smaller prefix of the array on its next iteration. Repeated failure to
eliminate successfully given a matching operation (as can happen when some other operation
manages to eliminate first) causes a thread to use a larger prefix. e value of t , the overall size of
the array, the number of failures required to trigger a subrange change, and the factor by which it
changes can all be tuned to maximize performance.
Similar techniques can be used for other abstractions in which operations may “cancel out.”
Scherer et al. [2005] describe an exchange channel in which threads must “pair up” and swap
information; a revised version of this code appears as the Exchanger class in the standard Java
concurrency library.
With care, elimination can even be applied to abstractions like queues, in which operations
cannot naively eliminate in isolation. As shown by Moir et al. [2005], one can delay an enqueue
operation until its datum, had it been inserted right away, would have reached the head of the
queue: at that point it can safely combine with any arriving dequeue operation. To determine
when an operation is “sufficiently old,” it suffices to augment the nodes of an M&S queue with
monotonically increasing serial numbers. Each enqueue operation in the elimination array is
augmented with the count found at the tail of the queue on the original (failed) CAS attempt.
When the count at the head of the queue exceeds this value, the enqueue can safely be eliminated.
is “FIFO elimination” has the nontrivial disadvantage of significantly increasing the latency
8.9. UNIVERSAL CONSTRUCTIONS 143
of dequeue operations that encounter initial contention, but it can also significantly increase
scalability and throughput under load.
CHAPTER 9
Transactional Memory
Transactional memory (TM) is among the most active areas of recent synchronization research,
with literally hundreds of papers published over the past ten years. e current chapter attempts
to outline the shape of the TM design space, the current state of the art, and the major open
questions. For further details, readers may wish to consult the encyclopedic lecture of Harris et al.
[2010].
At its core, TM represents the fusion of two complementary ideas: first, that we should
raise the level of abstraction for synchronization, allowing programmers to specify what should
be atomic without needing to specify how to make it atomic; second, that we should employ (at
least in many cases) an underlying implementation based on speculation. e user-level construct
is typically simply an atomic label attached to a block of code. e speculative implementation
allows transactions (executions of atomic blocks) to proceed in parallel unless and until they conflict
with one another (access the same location, with at least one of them performing a write). At most
one conflicting transaction is allowed to continue; the other(s) abort, roll back any changes they
have made, and try again.
Ideally, the combination of atomic blocks and speculation should provide (much of ) the
scalability of fine-grain locking with (most of ) the simplicity of coarse-grain locking, thereby
sidestepping the traditional tradeoff between clarity and performance. e combination also offers
a distinct semantic advantage over lock-based critical sections, namely composability.
An atomicity mechanism is said to be composable if it allows smaller atomic operations
to be combined into larger atomic operations without the possibility of introducing deadlock.
Critical sections based on fine-grain locks are not composable: if operations are composed in
different orders in different threads, they may attempt to acquire the same set of locks in different
orders, and deadlock can result. Speculation-based implementations of atomic blocks break the
“irrevocability” required for deadlock (Section 3.1.1): when some transactions abort and roll back,
others are able to make progress.
As noted at the end of Chapter 2, TM was originally proposed by Herlihy and Moss [1993].
A similar mechanism was proposed concurrently by Stone et al. [1993], and precursors can be
found in the work of Knight [1986] and Chang and Mergen [1988]. Originally perceived as too
complex for technology of the day, TM was largely ignored in the hardware community for a
decade. Meanwhile, as mentioned at the end of Chapter 8, several groups in the theory commu-
nity were exploring the notion of universal constructions [Anderson and Moir, 1999, Barnes, 1993,
Herlihy, 1993, Israeli and Rappoport, 1994, Shavit and Touitou, 1995, Turek et al., 1992], which
146 9. TRANSACTIONAL MEMORY
could transform a correct sequential data structure, mechanically, into a correct concurrent data
structure. Shortly after the turn of the century, breakthrough work in both hardware [Martínez
and Torrellas, 2002, Rajwar and Goodman, 2002, 2001] and software [Fraser and Harris, 2007,
Harris and Fraser, 2003, Herlihy et al., 2003b] led to a resurgence of interest in TM. is resur-
gence was fueled, in part, by the move to multicore processors, which raised profound concerns
about the ability of “ordinary” programmers to write code (correct code!) with significant amounts
of exploitable thread-level parallelism.
Much of the inspiration for TM, both originally and more recently, has come from the
database community, where transactions have been used for many years. Much of the theory of
database transactions was developed in the 1970s [Eswaran et al., 1976]. Haerder and Reuter
[1983] coined the acronym ACID to describe the essential semantics: a transaction should be
isolated – its internal behavior should not be visible to other transactions, nor should it see the
effects of other transactions during its execution
ese same semantics apply to TM transactions, with two exceptions. First, there is an
(arguably somewhat sloppy) tendency in the TM literature to use the term “atomic” to mean
both atomic and isolated. Second, TM transactions generally dispense with durability. Because
they may encompass as little as two or three memory accesses, they cannot afford the overhead of
crash-surviving disk I/O. At the same time, because they are intended mainly for synchronization
among threads of a single program (which usually live and die together), durability is much less
important than it is in the database world.
Composability
We also used the term “composability” in Section 3.1.2, where it was one of the advantages of linearizability over
other ordering criteria. e meaning of the term there, however, was different from the meaning here. With lin-
earizability, we wanted to ensure, locally (i.e., on an object-by-object basis, without any need for global knowledge
or control), that the orders of operations on different objects would be mutually consistent, so we could compose
them into a single order for the program as a whole. In transactional memory, we want combine small operations
(transactions) into larger, still atomic, operations. In other words, we’re now composing operations, not orders.
Interestingly, the techniques used to implement linearizable concurrent objects do not generally support the cre-
ation of atomic composite operations: a linearizable operation is designed to be visible to all threads before it
returns to its caller; its effect can’t easily be delayed until the end of some larger operation. Conversely, the tech-
niques used to implement composable transactions generally involve some sort of global control—exactly what
linearizability was intended not to need.
9.1. SOFTWARE TM 147
Given a correct sequential implementation of a data structure (a tree-based set, for exam-
ple), TM allows the author of a parallel program to reuse the sequential code, with guaranteed
correctness, in an almost trivial fashion:
class pset pset.insert(x : item):
set S atomic
bool pset.lookup(x : item): S.insert(x)
atomic pset.remove(x : item):
return S.lookup(x) atomic
S.remove(x)
Moreover, unlike lock-based critical sections, transactions can safely be composed into
larger atomic operations:
P, Q : pset
…
atomic
if P.lookup(x)
P.remove(x)
Q.insert(x)
Here the fact that P.lookup , P.remove , and Q.insert contain nested transactions is entirely
harmless. Moreover, if some other thread attempts a concurrent, symmetric move from Q to P ,
deadlock can never result.
e original intent of TM was to simplify the construction of library-level concurrent data
abstractions, with relatively small operations. Current hardware (HTM) and (to a lesser extent)
software (STM) implementations serve this purpose well. How much larger transactions can get
before they conflict too often to scale is still an open question.
e following two sections consider software and hardware transactions in turn; the third
takes a closer look at challenges—many of them initially unanticipated—that have complicated
the development of TM, and may yet determine the degree of its success.
While early STM implementations were provided simply as library packages—with entry
points to begin a transaction, read or write a shared location transactionally, and (attempt to)
commit a transaction—experience suggests that such libraries are too cumbersome for most pro-
grammers to use [Dalessandro et al., 2007]. We assume through the rest of this chapter that TM
is embedded in a programming language, and that all necessary hooks (including instrumentation
of STM load s and store s) are generated by the compiler.
9.1 SOFTWARE TM
If two TM implementations provide the same functionality—one in hardware and the other in
software—the hardware version will almost certainly be faster. Software implementations have
other advantages, however: they can run on legacy hardware, they are more flexible (extensible),
and they can provide functionality that is considered too complex to implement in hardware.
148 9. TRANSACTIONAL MEMORY
As of this writing, hardware TM systems are just beginning to reach the consumer market. e
majority of research over the past decade has taken place in a software context.
Progress guarantees – Most of the early universal constructions were nonblocking, and many of
the original STM systems were likewise. e OSTM (object-based STM) of Fraser’s thesis
work was lock free [Fraser, 2003, Fraser and Harris, 2007]; several other systems have been
obstruction free [Harris and Fraser, 2003, Herlihy et al., 2003b, Marathe and Moir, 2008,
Marathe et al., 2005, 2006, Tabba et al., 2009]. Over time, however, most groups have
moved to blocking implementations in order to obtain better expected-case performance.
Access tracking and conflict resolution – When two transactions conflict, a TM system must
ensure that they do not both commit. Some systems are eager : they notice as soon as a
location already accessed in one transaction is accessed in a conflicting way by a second
transaction. Other systems are lazy: they delay the resolution of conflicts until one of the
transactions has finished execution and is ready to commit. A few systems are mixed : they
resolve write-write conflicts early but read-write conflicts late [Dragojević et al., 2009, Scott,
2006, Shriraman and Dwarkadas, 2009].
To detect conflicts, a TM system must track the accesses performed by each transaction.
In principle, with lazy conflict resolution, one could log accesses locally in each thread, and
intersect, at commit time, the logs of transactions that overlap in time. RingSTM [Spear
et al., 2008a], indeed, does precisely this. More commonly, TM systems employ some sort
of shared metadata for access tracking. Some object-oriented systems include metadata in
9.1. SOFTWARE TM 149
the header of each object. Most STM systems, however, use a hash function keyed on the
address of the accessed location to index into a a global table of “ownership” records (Orecs).
By ignoring the low bits of the address when hashing, we can arrange for the bytes of a given
block (word, cache line, etc.) to share the same Orec. Given many-to-one hashing, a single
Orec will also, of course, be shared by many blocks: this false sharing means that logically
independent transactions will sometimes appear to conflict, forcing us to choose between
them.
Lazy and mixed conflict resolution have the advantage that readers can avoid updating meta-
data to make themselves visible to writers. A system that skips these updates is said to have
invisible readers. Because metadata updates tend to induce cache misses, eliminating them
can dramatically improve the performance of read-only or read-mostly transactions.
Validation – It is straightforward to demonstrate that an STM system will guarantee strict se-
rializability if it never commits conflicting transactions that overlap in time. In a system
with invisible readers, we commonly distinguish between validation and the rest of conflict
resolution. In a given transaction A, validation serves to ensure that no other transaction B
has made in-place updates (or committed updates) to locations read by A. When a read-
only transaction (one that modifies no shared locations) completes its execution, successful
validation is all that it requires in order to commit. When a writer transaction completes, it
must also make its updates visible to other threads. In an Orec-based system with a redo-log
(e.g., TL2 [Dice et al., 2006]), a transaction will typically lock the Orecs of all locations
it wishes to modify, validate, write back the contents of its redo log, and then unlock the
Orecs.
In a system with lazy conflict resolution, validation must also be performed on occasion
during transaction execution—not just at the end. Otherwise a transaction that has read
mutually inconsistent values of memory locations (values that could not logically have been
valid at the same time) may perform operations that would never occur in any sequential
execution, possibly resulting in faults (e.g., divide-by-zero), infinite loops, nontransactional
(uninstrumented) stores to shared addresses, or branches to nontransactional (uninstru-
mented) code. A maximally pessimistic system may choose to validate immediately after
every shared read; such a system is said to preserve opacity [Guerraoui and Kapałka, 2008].
A more optimistic system may delay validation until the program is about to execute a “dan-
gerous” operation; such a system is said to be sandboxed [Dalessandro and Scott, 2012].
With the exception of progress guarantees, we will discuss each of these design space di-
mensions in its own subsection below. Readers who are interested in exploring the alternatives
may wish to download the RSTM suite [RSTM], which provides a wide variety of interchange-
able STM “back ends” for C++.
e design space dimensions are largely but not fully orthogonal. When transactions con-
flict, there is no way for a writer to defer to a reader it cannot see: invisible readers reduce the
flexibility of contention management. In a similar vein, private undo logs (not visible to other
threads) cannot be used in a nonblocking system, and private access logs cannot be used for eager
conflict resolution. Perhaps most important, there is no obvious way to combine in-place update
(undo logs) with lazy conflict resolution: Suppose transaction A reads x , transaction B writes x
(speculatively, in place), and transaction A is the first to complete. Without knowing whether
A’s read occurred before or after B ’s write, we have no way of knowing whether it is safe to
commit A.
9.1.4 VALIDATION
As described in Section 3.1.2, two-phase locking provides a straightforward way to ensure seri-
alizability. Each transaction, as it runs, acquires a reader-writer lock (in read or write mode as
appropriate) on the Orec of every location it wishes to access. (is implies eager conflict de-
tection.) If an Orec is already held in an incompatible mode, the transaction stalls, aborts, or
(perhaps) kills the transaction(s) that already hold the lock. (is implies eager conflict resolution.
To avoid deadlock, a transaction that stalls must do so provisionally; if it waits too long it must
time out and abort.) If all locks are held from the point of their acquisition to the end of the
transaction, serializability is ensured. As described in the previous subsection, SNZI can be used
to reduce the contention and cache misses associated with lock updates by readers, at the expense,
in a writer, of not being able to identify which transactions have already acquired an Orec in read
mode.
To implement invisible readers, we can use sequence locks to replace the reader-writer locks
on Orecs. A reader makes no change to the (shared) lock, but does keep a (private) record of the
9.1. SOFTWARE TM 153
value of the lock at the time it reads a covered location. e record of lock values constitutes a
read log, analogous to the write log already required for redo on commit or undo on abort. Using
its read log, a transaction can validate its reads by double-checking the values of Orec locks: if a
lock has changed, then some other transaction has acquired the Orec as a writer, and the covered
data can no longer be assumed to be consistent; the reader must abort.
To reduce the impact of false sharing, a transaction can choose to keep the values of load ed
locations in its read log, instead of—or in addition to—the values of Orec locks. It can then per-
form value-based validation [Ding et al., 2007, Olszewski et al., 2007], verifying that previously-
read locations still (or again) contain the same values. Some mechanism—typically a check of
Orec lock values—must still be used, of course, to guarantee that the verified values are all present
at the same time.
In the degenerate case, Dalessandro et al. [2010c] use a single global Orec to provide
this guarantee. eir “NOrec” system allows a read-only transaction to validate—and commit—
without acquiring any locks: the transaction reads the global sequence lock, uses value-based-
validation to verify the consistency of all read locations, and then double-checks the sequence
lock to make sure that no other transaction committed writes during the validation. As in any
system with invisible readers, they employ a redo log rather than an undo log, and they validate
during the transaction immediately after every shared read or (with sandboxing [Dalessandro and
Scott, 2012]) immediately before every “dangerous” operation. NOrec forces transactions to write
back their redo logs one at a time, in mutual exclusion, but it allows them to create those logs—to
figure out what they want to write—in parallel. As of early 2013, no known STM system consis-
tently outperforms NOrec for realistic workloads on single-chip multicore machines, though both
TML [Dalessandro et al., 2010a] and FastLane [Wamhoff et al., 2013] are better in important
cases.
Time-based Validation
By validating its previous reads immediately after reading a new location x , transaction T ensures
that even if x has been modified very recently, all of T ’s work so far is still valid, because the
other locations have not been modified since they were originally read. An alternative approach,
154 9. TRANSACTIONAL MEMORY
pioneered by the TL2 system of Dice et al. [2006], is to verify that the newly read location x has
not been modified since T began execution. at is, instead of ensuring that all of T ’s work so far is
correct as of the current moment, we ensure that it is correct as of T ’s start time. To implement this
approach, TL2 employs a global “clock” (actually, just a global count of committed transactions).
It then augments each Orec with a version number that specifies the value of the global clock as of
the most recent write to any location covered by the Orec. At the beginning of each transaction,
TL2 reads and remembers the global clock. On each read, it verifies that the version number in
the corresponding Orec is less than or equal to the remembered clock value. If not, the transaction
aborts.
If a read-only transaction completes its execution successfully, we know its behavior is cor-
rect as of its start time. No additional work is necessary; it trivially commits. A writer transaction,
however, must validate its read set. It locks the Orecs of all locations it wishes to write, atomi-
cally increments the global clock, checks the version numbers of (the Orecs of ) all locations it has
read, and verifies that all are still less than its start time (so the covered locations have not been
modified since). If it is unable to acquire any of the Orecs for the write set, or if any of the Orecs
for the read set have too-recent version numbers, the transaction aborts. Otherwise, it writes back
the values in its redo log and writes the (newly incremented) global clock value into each locked
Orec. By colocating the lock and version number in a single word, TL2 arranges for these writes
to also unlock the Orecs.
When a transaction T in TL2 reads a location x that has been modified since T ’s start
time, the transaction simply aborts. Riegel et al. [2006] observe, however, that just as a writer
transaction must validate its reads at commit time, effectively “extending” them to its completion
time, a reader or writer transaction can update its reads incrementally. If T began at time t1 , but
x has been modified at time t2 > t1 , T can check to see whether any previously read location has
been modified since t2 . If not, T can pretend it began at time t2 instead of t1 , and continue. is
extensible timestamp strategy is employed in the TinySTM system of Felber et al. [2008], which
has invisible readers but eager conflict detection. It is also used in SwissTM [Dragojević et al.,
2009], with mixed conflict detection, and NOrec [Dalessandro et al., 2010c], with lazy detection.
Bloom Filters
For readers not familiar with the notion, a Bloom filter [Bloom, 1970] is a bit vector representation of a set that
relies on one or more hash functions. Bit i of the vector is set if and only if for some set member e and some hash
function hj , hj .e/ D i . Element e is inserted into the vector by setting bit hj .e/ for all j . e lookup method
tests to see if e is present by checking all these bits. If all of them are set, lookup will return true ; if any bit is unset,
lookup will return false . ese conventions allow false positives (an element may appear to be present when it
is not), but not false negatives (a present element will never appear to be absent). In the basic implementation,
deletions are not supported.
Note that Bloom filters do not introduce a qualitatively different problem for TM: Orec-based STM systems
already suffer from false sharing. e actual rate of false positives in RingSTM depends on the application and
the choice of Bloom filter size.
156 9. TRANSACTIONAL MEMORY
of shared locations, or has been forced to abort more often, or has already killed off a larger num-
ber of competitors. In general, these strategies attempt to recognize and preserve the investment
that has already been made in a transaction. ere was a flurry of papers on the subject back
around 2005 [Guerraoui et al., 2005a,b, Scherer III and Scott, 2005a,b], and work continues to
be published, but no one strategy appears to work best in all situations.
9.2 HARDWARE TM
While (as we have seen) TM can be implemented entirely in software, hardware implementations
have several compelling advantages. ey are faster, of course, at least for equivalent functional-
ity. Most can safely (and speculatively) call code in unmodified (uninstrumented) binary libraries.
Most guarantee that transactions will serialize not only with other transactions, but also with in-
dividual (non-transactional) load s, store s, and other atomic instructions. (is property is some-
times known as strong atomicity or strong isolation [Blundell et al., 2005].) Finally, most provide
automatic, immediate detection of inconsistency, eliminating the need for explicit validation.
Most of the design decisions discussed in Section 9.1, in the context of STM, are relevant to
HTM as well, though hardware may impose additional restrictions. Contention management, for
example, will typically be quite simple, or else deferred to software handlers. More significantly,
buffer space for speculative updates is unlikely to exceed the size of on-chip cache: transactions
that exceed the limit may abort even in the absence of conflicts. Transactions may also abort for
any of several “spurious” reasons, including context switches and external interrupts.
In any new hardware technology, there is a natural incentive for vendors to leverage existing
components as much as possible, and to limit the scope of changes. Several HTM implemen-
tations have been designed for plug-compatibility with traditional cross-chip cache coherence
protocols. In the IBM Blue Gene/Q [Wang et al., 2012], designers chose to use an unmodified
processor core, and to implement HTM entirely within the memory system.
To accommodate hardware limitations, most HTM systems—and certainly any commer-
cial implementations likely to emerge over the next few years—will require software backup. In
the simple case, one can always fall back to a global lock. More ambitiously, we can consider
hybrid TM systems in which compatible STM and HTM implementations coexist.
In the first subsection below we discuss aspects of the TM design space of particular sig-
nificance for HTM. In Section 9.2.2 we consider speculative lock elision, an alternative ABI that
uses TM-style speculation to execute traditional lock-based critical sections. In 9.2.3 we consider
alternative ways in which to mix hardware and software support for TM.
ABI
Most HTM implementations include instructions to start a transaction, explicitly abort the cur-
rent transaction, and (attempt to) commit the current transaction. (In this chapter, we refer to
these, generically, as tx start , tx abort , and tx commit .) Some implementations include addi-
tional instructions, e.g., to suspend and resume transactions, or to inspect their status.
While a transaction is active, load and store instructions are considered speculative: the
hardware automatically buffers updates and performs access tracking and conflict detection. Some
systems provide special instructions to access memory nonspeculatively inside of a transaction—
e.g., to spin on a condition or to save information of use to a debugger or performance analyzer.¹
Because these instructions violate isolation and/or atomicity, they must be used with great care.
On an abort, a transaction may retry automatically (generally no more than some fixed
number of times), retry the transaction under protection of an implicit global lock, or jump to a
software handler that figures out what to do (e.g., retry under protection of a software lock). In
Intel’s RTM (Restricted Transactional Memory—part of TSX), the address of the handler is an
argument to the tx start instruction. In IBM’s z and Power TM, tx start sets a condition code,
in the style of the Posix setjmp routine, to indicate whether the transaction is beginning or has
just aborted; this code must be checked by the following instruction. With either style of abort
delivery, any speculative updates performed so far will be discarded.
In the IBM Blue Gene/Q, HTM operations are triggered not with special instructions,
but with store s to special locations in I/O space. Conflicts raise an interrupt, which is fielded by
the OS kernel.
¹Among commercial machines (as of this writing), z TM provides nontransactional store s (ordered at commit or abort time),
but not load s. Sun’s Rock provided both, with store s again ordered at commit/abort. Intel’s TSX provides neither. Power TM
allows transactions to enter a “suspended” state (see page 160) in which load s and store s will happen immediately and “for
real.” Blue Gene/Q facilities can be used to similar ends, but only with kernel assistance. On both the Power 8 and Blue
Gene/Q, the programmer must be aware of the potential for paradoxical memory ordering.
158 9. TRANSACTIONAL MEMORY
buffer only the speculative version; the original can always be found in some deeper level of cache
or memory.
Whatever the physical location used to buffer speculative updates, there will be a limit on
the space available. In most HTM systems, a transaction will abort if it overflows this space,
or exceeds the supported degree of associativity (footnote, page 13). Several academic groups
have proposed mechanisms to “spill” excess updates to virtual memory and continue to execute
a hardware transaction of effectively unbounded size [Blundell et al., 2007, Ceze et al., 2006,
Chuang et al., 2006, Chung et al., 2006, Rajwar et al., 2005, Shriraman et al., 2010], but such
mechanisms seem unlikely to make their way into commercial systems anytime soon.
In addition to the state of memory, a TM system must consider the state of in-core
resources—registers in particular. In most HTM systems, tx begin checkpoints all or most of
the registers, and restores them on abort. In a few systems (including Blue Gene/Q and the Azul
Vega processors [Click, 2009]), software must checkpoint the registers prior to tx begin , and
restore them manually on abort.
2. Even when data conflicts are relatively rare, it is common for a thread to find that a lock
was last accessed on a different core. By eliding acquisition of the lock (i.e., simply verifying
that it is not held), SLE may avoid the need to acquire the lock’s cache line in exclusive
mode. By leaving locks shared among cores, a program with many small critical sections
may suffer significantly fewer cache misses.
Both of these benefits may improve performance on otherwise comparable machines. ey also
have the potential to increase scalability, allowing programs in which locks were becoming a bot-
tleneck to run well on larger numbers of cores.
Azul has indicated that lock elision was the sole motivation for their HTM design [Click,
2009], and the designers of most other commercial systems, including z [Jacobi et al., 2012],
Power [IBM, 2012], and TSX [Intel, 2012], cite it as a principal use case. On z, SLE is simply a
programming idiom, along the following lines:
really locked := false
tx begin
if failure goto handler
read lock value // add to transaction read set
if not held goto cs
abort
handler: really locked := true
acquire lock
cs: … // critical section
if really locked goto release
tx commit
goto over
release: release lock
over:
is idiom may be enhanced in various ways—e.g., to retry a few times in hardware if the abort
appears to be transient—but the basic pattern is as shown. One shortcoming is that if the critical
section (or a function it calls) inspects the value of the lock (e.g., if the lock is reentrant, and is
needed by a nested operation), the lock will appear not to be held. e obvious remedy—to write
a “held” value to the lock—would abort any similar transaction that is running concurrently. An
“SLE-friendly” solution would require each transaction to remember, in thread-local storage, the
locks it has elided.
Power TM provides a small ISA enhancement in support of SLE: the tx commit instruc-
tion can safely be called when not in transactional mode, in which case it sets a condition code.
e idiom above then becomes:
162 9. TRANSACTIONAL MEMORY
tx begin
if failure goto handler
read lock value // add to transaction read set
if not held goto cs
abort
handler: acquire lock
cs: … // critical section
tx commit
if commit succeeded goto over
release lock
over:
Originally proposed in the thesis work of Ravi Rajwar [2002], SLE plays a significantly
more prominent role in Intel’s Transactional Synchronization Extensions (TSX), of which Rajwar
was a principal architect. TSX actually provides two separate ABIs, called Hardware Lock Elision
(HLE) and Restricted Transactional Memory (RTM). RTM’s behavior, to first approximation, is
similar to that of z or Power TM. ere are instructions to begin, commit, or abort a transaction,
and to test whether one is currently active.
On legacy machines, RTM instructions will cause an unsupported instruction exception.
To facilitate the construction of backward-compatible code, HLE provides an alternative interface
in which traditional lock acquire and release instructions (typically CAS and store ) can be tagged
with an XACQUIRE or XRELEASE prefix byte. e prefixes were carefully chosen from among
codes that function as nop s on legacy machines; when run on such a machine, HLE-enabled code
will acquire and release its locks “for real.” On a TSX machine, the hardware will refrain from
acquiring exclusive ownership of the cache line accessed by an XACQUIRE -tagged instruction.
Rather, it will enter speculative mode, add the lock to its speculative update set, and remember
the lock’s original value and location. If the subsequent XRELEASE -tagged instruction restores the
original value to the same location (and no conflicts have occurred in the interim), the hardware
will commit the speculation. Crucially, any load s of the lock within the critical section will see its
value as “locked,” even though its line is never acquired in exclusive mode. e only way for code in
a critical section to tell whether it is speculative or not is to execute a (non-backward-compatible)
explicit XTEST instruction.
Because an XRELEASE -tagged instruction must restore the original value of a lock, several
of the lock algorithms in Chapter 4 must be modified to make them HLE-compatible. e ticket
lock (Figure 4.7, page 55), for example, can be rewritten as shown in Figure 9.1. Speculation
will succeed only if ns = next ticket on the first iteration of the loop in acquire , and no other
thread increments next ticket during the critical section. Note in particular that if now serving
6D next ticket when a thread first calls acquire , the loop will continue to execute until the current
lock holder updates either now serving or next ticket , at which point HLE will abort and retry
the FAI “for real.” More significantly, if no two critical sections conflict, and if no aborts occur
due to overflow or other “spurious” reasons, then an arbitrary number of threads can execute crit-
9.2. HARDWARE TM 163
class lock
int next ticket := 0
int now serving := 0
const int base = ... // tuning parameter
lock.acquire():
int my ticket := XACQUIRE FAI(&next ticket)
loop lock.release():
int ns := now serving.load() int ns := now serving
if ns = my ticket if : XRELEASE CAS(&next ticket,
break ns+1, ns, RWk)
pause(base (my ticket ns)) now serving.store(ns + 1)
fence(kRW)
Figure 9.1: e Ticket lock of Figure 4.7, modified to make use of hardware lock elision.
ical sections on the same lock simultaneously, each of them invisibly incrementing and restoring
next ticket , and never changing now serving .
9.2.3 HYBRID TM
While faster than the all-software TM implementations of Section 9.1, HTM systems seem
likely, for the foreseeable future, to have limitations that will sometimes lead them to abort even
in the absence of conflicts. It seems reasonable to hope that a hardware/software hybrid might
combine (most of ) the speed of the former with the generality of the latter.
Several styles of hybrid have been proposed. In some, hardware serves only to accelerate a
TM system implemented primarily in software. In others, hardware implements complete sup-
port for some subset of transactions. In this latter case, the hardware and software may be designed
together, or the software may be designed to accommodate a generic “best effort” HTM.
“Best effort” hybrids have the appeal of compatibility with near-term commercial HTM.
If transactions abort due to conflicts, the only alternative to rewriting the application would seem
to be fallback to a global lock. If transactions abort due to hardware limitations, however, fallback
to software transactions would seem to be attractive.
Hardware-accelerated STM
Experimental results with a variety of STM systems suggest a baseline overhead (single-thread
slowdown) of 3–10 for atomic operations. Several factors contribute to this overhead, including
conflict detection, the buffering of speculative writes (undo or redo logging), validation to ensure
consistency, and conflict resolution (arbitration among conflicting transactions). All of these are
potentially amenable to hardware acceleration.
Saha et al. [2006b] propose to simplify conflict detection by providing hardware mark bits
on cache lines. Set and queried by software, these bits are cleared when a cache line is invalidated—
e.g., by remote access. To avoid the need to poll the bits, Spear et al. [2007] propose a general-
164 9. TRANSACTIONAL MEMORY
purpose alert-on-update mechanism that a triggers a software handler when a marked line is
accessed remotely. Minh et al. [2007] propose an alternative conflict detection mechanism based
on hardware read and write signatures (Bloom filters).
Shriraman et al. [2007] propose to combine in-cache hardware buffering of speculative
cache lines with software conflict detection and resolution; alert-on-update provides immediate
notification of conflicts, eliminating the need for validation. In subsequent work, the authors add
signatures and conflict summary tables; these support eager conflict detection in hardware, leaving
software responsible only for conflict resolution, which may be lazy if desired [Shriraman et al.,
2010]. As suggested by Hill et al. [2007], the “decoupling” of mechanisms for access tracking,
buffering, notification, etc. serves to increase their generality: in various other combinations they
can be used for such non-TM applications as debugging, fine-grain protection, memory man-
agement, and active messaging.
Hardware/Software TM Codesign
In hardware-assisted STM, atomicity remains a program-level property, built on multiple (non-
atomic) hardware-level operations. To maximize performance, one would presumably prefer to
implement atomicity entirely in hardware. If hardware transactions are sometimes unsuccessful
for reasons other than conflicts, and if fallback to a global lock is not considered acceptable, the
challenge then becomes to devise a fallback mechanism that interoperates correctly with hardware
transactions.
One possible approach is to design the hardware and software together. Kumar et al. [2006]
propose an HTM to complement the object-cloning DSTM of Herlihy et al. [2003b]. Baugh
et al. [2008] assume the availability of fine-grain memory protection [Zhou et al., 2004], which
they use in software transactions to force aborts in conflicting hardware transactions. A more
common approach assumes that the hardware is given, and designs software to go with it.
Best-effort Hybrid TM
An HTM implementation is termed “best effort” if it makes no guarantees of completion, even in
the absence of conflicts, and makes no assumptions about the nature of software transactions that
might be running concurrently. All of the commercial HTM systems discussed in Section 9.2.1—
with the exception of constrained transactions in z TM—fit this characterization.
If “spurious” aborts are common enough to make fallback to a global lock unattractive,
one is faced with the question of how to make an STM fallback interoperate with HTM—in
particular, how to notice when transactions of different kinds conflict. One side of the interaction
is straightforward: if a software transaction writes—either eagerly or at commit time—a location
that has been read or written by a hardware transaction, the hardware will abort. e other side is
harder: if, say, a software transaction reads location X , a hardware transaction commits changes to
both X and Y , and then the software transaction reads Y , how are we to know that the software
transaction has seen inconsistent versions, and needs to abort?
9.3. CHALLENGES 165
Perhaps the most straightforward option, suggested by Damron et al. [2006], is to add ex-
tra instructions to the code of hardware transactions, so they update the software metadata of
any locations they write. Software transactions can then inspect this metadata to validate their
consistency, just as they would in an all-software system. Unfortunately, metadata updates can
significantly slow the HTM code path. Vallejo et al. [2011] show how to move much of the instru-
mentation inside if (hw txn) conditions, but the condition tests themselves still incur nontrivial
overhead. For object-oriented languages, Tabba et al. [2009] show how instrumented hardware
transactions can safely make in-place updates to objects that are cloned by software transactions.
To eliminate the need for instrumentation on the hardware path, Lev et al. [2007] suggest
never running hardware and software transactions concurrently. Instead, they switch between
hardware and software phases on a global basis. Performance can be excellent, but also somewhat
brittle: unless software phases are rare, global phase changes can introduce significant delays.
Arguably the most appealing approach to best-effort hybrid TM is to employ an STM
algorithm that can detect the execution of concurrent hardware transactions without the need to
instrument HTM load s and store s. Dalessandro et al. [2011] achieve this goal by using NOrec
(Section 9.1.4) on the software path, to leverage value-based validation. Significantly, the scalabil-
ity limitation imposed by NOrec’s serial write-back is mitigated in the hybrid version by counting
on most transactions to finish in hardware—STM is only a fallback.
9.3 CHALLENGES
To serve its original purpose—to facilitate the construction of small, self-contained concurrent
data structures—TM need not be exposed at the programming language level. Much as expert
programmers use CAS and other atomic hardware instructions to build library-level synchro-
nization mechanisms and concurrent data structures, so might they use HTM to improve per-
formance “under the hood,” without directly impacting “ordinary” programmers. Much of the
appeal of TM, however, is its potential to help those programmers write parallel code that is both
correct and scalable. To realize this potential, TM must be integrated into language semantics
and implementations. In this final section of the lecture, we discuss some of the issues involved in
this integration. Note that the discussion raises more questions than it answers: as of this writing,
language support for TM is still a work in progress.
9.3.1 SEMANTICS
e most basic open question for TM semantics is “what am I allowed to do inside?” Some
operations—interactive I/O in particular—are incompatible with speculation. (We cannot tell the
human user “please forget I asked you that.”) Rollback of other operations—many system calls
among them—may be so difficult as to force a nonspeculative implementation. e two most
obvious strategies for such irreversible (irrevocable) operations are to (1) simply disallow them in
transactions, or (2) force a transaction that performs them to become inevitable—i.e., guaranteed
166 9. TRANSACTIONAL MEMORY
to commit. While Spear et al. [2008b] have shown that inevitability does not always necessitate
mutual exclusion, it nonetheless imposes severe constraints on scalability.
Some TM implementations are nonblocking, as discussed in Section 9.1.1. Should this
property ever be part of the language-level semantics? Without it, performance may be less pre-
dictable on multiprogrammed cores, and event-driven code may be subject to deadlock if handlers
cannot be preempted.
In its role as a synchronization mechanism, language-level TM must be integrated into
the language memory model (Section 3.4). Some researchers have argued that since locks already
form the basis of many memory models, the behavior of transactions should be defined in terms
of implicit locking [Menon et al., 2008]. e C++ standards committee, which is currently con-
sidering TM language extensions [Adl-Tabatabai et al., 2012], is likely to adopt semantics in
which transactions are co-equal with locks and atomic variable access—all three kinds of oper-
ations will contribute to a program’s synchronization order. Given, however, that TM is often
promoted as a higher-level, more intuitive alternative to locks, there is something conceptually
unsatisfying about defining transactional behavior in terms of (or even in concert with) the thing
it is supposed to replace. Clearly any language that allows transactions and locks in the same pro-
gram must explain how the two interact. Considering that locks are typically implemented using
lower-level atomic operations like CAS , a potentially appealing approach is to turn the tables, as
it were, and define locks in terms of atomic blocks [Dalessandro et al., 2010b]. In the frame-
work of Section 3.4, a global total order on transactions provides a trivial synchronization order,
which combines with program order to yield the overall notion of happens-before. In the result-
ing framework, one can easily show that a data-race-free program is transactionally sequentially
consistent : all memory accesses appear to happen in a global total order that is consistent with
program order in each thread, and that keeps the accesses of any given transaction contiguous.
Some challenges of language integration are more pedestrian: If there are limits on the
operations allowed inside transactions, should these be enforced at compile time or at run time?
If the former, how do we tell whether it is safe to call a subroutine that is defined in a different
compilation unit? Must the subroutine interface explicitly indicate whether it is “transaction safe”?
9.3.2 EXTENSIONS
When adding transactions to a programming language, one may want—or need—to include a
variety of features not yet discussed in this chapter.
Nesting
In the chapter introduction we argued that one of the key advantages of transactions over lock-
based critical sections was their composability. Composability requires that we allow transactions
to nest. e simplest way to do so is to “flatten” them—to subsume the inner transaction(s) in
the outer, and allow the entire unit to commit or abort together. All current commercial HTM
implementations provide subsumption nesting, generally with some maximum limit on depth.
Several STM systems do likewise.
For performance reasons, it may sometimes be desirable to allow an inner transaction to
abort and retry while retaining the work that has been done so far in the outer transaction. is
option, known as “true” or closed nesting, will also be required in any system that allows a trans-
action to abort and not retry. We have already considered such a possibility for exceptions that
escape transactions. It will also arise in any language that provides the programmer with an explicit
abort command [Harris et al., 2005].
For the sake of both performance and generality, it may also be desirable to allow con-
currency within transactions—e.g., to employ multiple threads in a computationally demanding
operation, and commit their results atomically [Agrawal et al., 2008].
In some cases it may even be desirable to allow an inner transaction to commit when the sur-
rounding transaction aborts [Moss and Hosking, 2006, Ni et al., 2007]. is sort of open nesting
may violate serializability, and must be used with care. Possible applications include the preser-
vation of semantically neutral but performance-advantageous operations like garbage collection,
memoization, and rebalancing; the collection of debugging or performance information; and the
construction of “boosted” abstractions (Section 9.1.3).
168 9. TRANSACTIONAL MEMORY
Condition Synchronization
Like lock-based critical sections, transactions sometimes depend on preconditions, which may
or may not hold. In Chapter 5 we considered a variety of mechanisms whereby a thread could
wait for a precondition in a critical section. But a transaction cannot wait: because it is isolated,
changes to the state of the world, made by other threads, will not be visible to it.
ere is an analogy here to nonblocking operations, which cannot wait and still be non-
blocking. e analogy suggests a potential solution: insist that transactions be total—that their
preconditions always be true —but allow them to commit “reservation” notices in the style of
dual data structures (Section 8.7). If, say, a dequeue operation on a transactional queue finds no
data to remove, it can enqueue a reservation atomically instead, and return an indication that it
has done so. e surrounding code can then wait for the reservation to be satisfied in normal,
nontransactional code.
A second alternative, suggested by Smaragdakis et al. [2007], is to suspend (“punctuate”) a
transaction at a conditional wait, and to make the sections of the transaction before and after the
wait individually (but not jointly) atomic. is alternative requires, of course, that any invariants
maintained by the transaction be true at the punctuation point. If a wait may be nested inside
called routines, the fact that they may wait probably needs to be an explicit part of their interface.
Perhaps the most appealing approach to transactional condition synchronization is the
retry primitive of Harris et al. [2005]. When executed by a transaction, it indicates that the
current operation cannot proceed, and should abort, to be retried at some future time. Exactly
when to retry is a question reminiscent of conditional critical regions (Section 7.4.1). ere is a
particularly elegant answer for STM: e transaction is sure to behave the same the next time
around if it reads the same values from memory. erefore, it should become a visible reader
of every location in its read set, and wait for one of those locations to be modified by another
transaction. (Modification by nontransactional code would imply the existence of a data race.)
e wakeup mechanism for condition synchronization is then essentially the same as the abort
mechanism for visible readers, and can share the same implementation.
• try blocks that roll back to their original state instead of stopping where they are when an
exception arises. Shinnar et al. [2004] refer to such blocks as “ try-all .”
9.3.3 IMPLEMENTATION
e discussion of STM in Section 9.1 conveys some sense of the breadth of possible imple-
mentation strategies. It is far from comprehensive, however. Drawing inspiration from database
systems, several groups have considered multi-version STM systems, which increase the success
rate for long-running read-only transactions by keeping old versions of modified data [Cachopo
and Rito-Silva, 2006, Lu and Scott, 2012, Perelman et al., 2011, Riegel et al., 2006]. Instead
of requiring that all load ed values be correct as of commit time (and then aborting every time
a location in the read set is updated by another transaction), multi-version TM systems arrange
for a reader to use the values that were current as of its start time, and thus to “commit in the
past.” To increase concurrency, it is conceivable that TM might also adopt the ability of some
database systems to forward updates from one (still active) transaction to another, making the
second dependent on the first [Ramadan et al., 2008].
e Privatization Problem
Informally, a transaction is said to privatize a data structure X if, prior to the transaction, X
may be accessed by more than one thread, but after the transaction program logic guarantees that
X is private to some particular thread. e canonical example of privatization arises with shared
containers through which threads pass objects to one another. In a program with such a container,
the convention may be that once an object has been removed from the container, it “belongs” to
the thread that removed it, which can safely operate on it without any synchronization. If the
thread returns the object to the same or a different container at a later time, it is said to publish
the object. Publication of most shared objects also occurs at creation time: a thread typically
allocates an object and initializes it before making it visible (publishing it) to other threads. Prior
to publication, no synchronization is required. Dalessandro et al. [2010b] have observed that
privatization is semantically equivalent to locking—it renders a shared object temporarily private.
Publication is equivalent to unlocking—it makes the private object shared again.
In their usual form, publication and privatization are race-free idioms, at least at the level of
the programming model: any accesses by different threads are always ordered by an intervening
transaction. Unfortunately, in many STM systems, privatization is not race-free at the imple-
mentation level. Races arise for two reasons, and may lead to incorrect behavior in programs
that are logically correct. First, completed transactions may perform “cleanup” operations (write-
back of redo or undo logs) after their serialization point. ese cleanup writes may interfere with
170 9. TRANSACTIONAL MEMORY
nontransactional reads in the thread that now owns the privatized data. Second, “zombie” trans-
actions, which are doomed to abort but have not yet realized this fact, may read locations that are
written nontransactionally by the thread that now owns the privatized data. e result may be an
inconsistent view of memory, which can cause the zombie to display erroneous, externally visible
behavior.
Early STM systems did not experience the “privatization problem” because they assumed
(implicitly or explicitly) that any datum that was ever accessed by more than one thread was
always accessed transactionally. One solution to the privatization problem is thus to statically
partition data into “always private” and “sometimes shared” categories. Unfortunately, attempts
to enforce this partition via the type system lead to programs in which utility routines and data
structures must be “cloned” to create explicitly visible transactional and nontransactional versions
[Dalessandro et al., 2007].
Absent a static partition of data, any modern STM system must be “privatization safe” to
be correct. Systems that serialize cleanup—RingSTM and NOrec among them—are naturally so.
Others can be made so with extra instrumentation. Marathe et al. [2008] describe and evaluate
several instrumentation alternatives. ey identify an adaptive strategy whose performance is sta-
ble across a wide range of workloads. Dice et al. [2010] describe an additional mechanism that can
be used to reduce the cost of privatization when the number of active transactions is significantly
smaller than the number of extant threads. Even so, the overheads remain significant—enough
so that one must generally dismiss reported performance numbers for any prototype STM system
that is not privatization safe.
Publication, it turns out, can also lead to unexpected or erroneous behavior, but only in the
presence of program-level data races between transactional and nontransactional code [Menon
et al., 2008]. If data races are viewed as bugs, the “publication problem” can safely be ignored.
Compilation
While many researchers once expected that TM might be successfully implemented in a li-
brary / run-time system, most now agree that it requires language integration and compiler sup-
port. Compilers can be expected to instrument transactional load s and store s; clone code paths
for nontransactional, STM, and HTM execution; and insert validation where necessary to sand-
box dangerous operations. ey can also be expected to implement a variety of performance op-
timizations:
• Identify accesses that are sure to touch the same location, and elide redundant instrumen-
tation [Harris et al., 2006].
• Identify load s and store s that are certain to access private variables, and refrain from instru-
menting them [Shpeisman et al., 2007]. (is task is rather tricky: if an access may touch
either a shared datum or a private datum, then it must be instrumented. If the system uses
redo logs, then any other accesses to the same private datum must also be instrumented, to
ensure that a transaction always sees its own writes.)
9.3. CHALLENGES 171
• For strings of successive accesses, infer the minimum number of synchronizing instructions
required to maintain sequential consistency [Spear et al., 2009b].
More exotic optimizations may also be possible. Olszewski et al. [2007] have proposed dynamic
binary rewriting to allow arbitrary library routines to be instrumented on the fly, and called from
within transactions. More ambitiously, if subroutine foo is often called inside transactions, there
may be circumstances in which any subset of its arguments are known to be private, or to have
already been logged. To exploit these circumstances, a compiler may choose to generate a custom
clone of foo that elides instrumentation for one or more parameters.
As of this writing, compilers have been developed for transactional extensions to a variety
of programming languages, including Java [Adl-Tabatabai et al., 2006, Olszewski et al., 2007],
C# [Harris et al., 2006], C [Wang et al., 2007], C++ [Free Software Foundation, 2012, Intel,
2012, VELOX Project, 2011], and Haskell [HaskellWiki, 2012]. Language-level semantics are
currently the most mature in Haskell, though the implementation is slow. Among more “main-
stream” languages, C++ is likely to be the first to incorporate TM extensions into the language
standard [Adl-Tabatabai et al., 2012].
Bibliography
Martín Abadi, Tim Harris, and Mojitaba Mehrara. Transactional memory with strong atomicity
using off-the-shelf memory protection hardware. In Proceedings of the Fourteenth ACM Sympo-
sium on Principles and Practice of Parallel Programming (PPoPP), pages 185–196, Raleigh, NC,
February 2009. DOI: 10.1145/1504176.1504203. 166
Nagi M. Aboulenein, James R. Goodman, Stein Gjessing, and Philip J. Woest. Hardware support
for synchronization in the scalable coherent interface (SCI). In Proceedings of the Eighth Inter-
national Parallel Processing Symposium (IPPS), pages 141–150, Cancun, Mexico, April 1994.
DOI: 10.1109/IPPS.1994.288308. 25, 56
Ali-Reza Adl-Tabatabai, Brian T. Lewis, Vijay Menon, Brian R. Murphy, Bratin Saha, and Ta-
tiana Shpeisman. Compiler and runtime support for efficient software transactional memory.
In Proceedings of the Twenty-seventh ACM SIGPLAN Conference on Programming Language
Design and Implementation (PLDI), pages 26–37, Ottawa, ON, Canada, June 2006. DOI:
10.1145/1133255.1133985. 171
Sarita V. Adve and Kourosh Gharachorloo. Shared memory consistency models: A tutorial.
Computer, 29(12):66–76, December 1996. DOI: 10.1109/2.546611. 15, 42
Sarita V. Adve and Mark D. Hill. Weak ordering—A new definition. In Proceedings of the Sev-
enteenth International Symposium on Computer Architecture (ISCA), pages 2–14, Seattle, WA,
May 1990. DOI: 10.1145/325096.325100. 46
Sarita V. Adve, Vijay S. Pai, and Parthasarathy Ranganathan. Recent advances in memory con-
sistency models for hardware shared-memory systems. Proceedings of the IEEE, 87(3):445–455,
1999. DOI: 10.1109/5.747865. 15
Kunal Agrawal, Jeremy Fineman, and Jim Sukha. Nested parallelism in transactional mem-
ory. In Proceedings of the irteenth ACM Symposium on Principles and Practice of Par-
allel Programming (PPoPP), pages 163–174, Salt Lake City, UT, February 2008. DOI:
10.1145/1345206.1345232. 167
174 BIBLIOGRAPHY
Randy Allen and Ken Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-
Based Approach. Morgan Kaufmann, San Francisco, CA, 2002. 12
George S. Almasi and Allan Gottlieb. Highly Parallel Computing. Benjamin Cummings, Red-
wood City, CA, 1989. 73
Noga Alon, Amnon Barak, and Udi Manber. On disseminating information reliably without
broadcasting. In Proceedings of the International Conference on Distributed Computing Systems
(ICDCS), pages 74–81, Berlin, Germany, September 1987. 75
AMD. Advanced Synchronization Facility: Proposed Architectural Specification. Advanced Micro
Devices, March 2009. Publication #45432, Version 2.1. Available as amddevcentral.com/
assets/45432-ASF_Spec_2.1.pdf. 160
C. Scott Ananian, Krste Asanovic Bradley C. Kuszmaul, Charles E. Leiserson, and Sean Lie.
Unbounded transactional memory. In Proceedings of the Eleventh International Symposium on
High Performance Computer Architecture (HPCA), pages 316–327, San Francisco, CA, February
2005. DOI: 10.1109/HPCA.2005.41. 157
James Anderson and Mark Moir. Universal constructions for large objects. IEEE
Transactions on Parallel and Distributed Systems, 10(12):1317–1332, December 1999. DOI:
10.1109/71.819952. 145
omas E. Anderson, Edward D. Lazowska, and Henry M. Levy. e performance of spin lock
alternatives for shared-memory multiprocessors. IEEE Transactions on Parallel and Distributed
Systems, 1(1):6–16, January 1990. DOI: 10.1109/71.80120. 54, 56
Tom E. Anderson, Brian N. Bershad, Edward D. Lazowska, and Henry M. Levy. Scheduler ac-
tivations: Effective kernel support for the user-level management of parallelism. ACM Trans-
actions on Computer Systems, 10(1):53–79, February 1992. DOI: 10.1145/121132.121151. 120
Jonathan Appavoo, Marc Auslander, Maria Burtico, Dilma Da Silva, Orran Krieger, Mark Mer-
gen, Michal Ostrowski, Bryan Rosenburg, Robert W. Wisniewski, and Jimi Xenidis. Expe-
rience with K42, an open-source Linux-compatible scalable operating system kernel. IBM
Systems Journal, 44(2):427–440, 2005. DOI: 10.1147/sj.442.0427. 59
Nimar S. Arora, Robert D. Blumofe, and C. Greg Plaxton. read scheduling for multipro-
grammed multiprocessors. In Proceedings of the Tenth Annual ACM Symposium on Parallel Al-
gorithms and Architectures (SPAA), pages 119–129, Puerto Vallarta, Mexico, June–July 1998.
DOI: 10.1145/277651.277678. 139
Hagit Attiya, Rachid Guerraoui, Danny Hendler, Petr Kuznetsov, Maged M. Michael, and Mar-
tin Vechev. Laws of order: Expensive synchronization in concurrent algorithms cannot be elim-
inated. In Proceedings of the irty-eighth ACM Symposium on Principles of Programming Lan-
BIBLIOGRAPHY 175
guages (POPL), pages 487–498, Austin, TX, January 2011. DOI: 10.1145/1926385.1926442.
47
Marc A. Auslander, David Joel Edelsohn, Orran Yaakov Krieger, Bryan Savoye Rosenburg,
and Robert W. Wisniewski. Enhancement to the MCS lock for increased functionality and
improved programmability. U. S. patent application number 20030200457 (abandoned),
October 2003. http://appft1.uspto.gov/netacgi/nph-Parser?Sect1=PTO2&Sect2=
HITOFF&p=1&u=%2Fnetahtml%2FPTO%2Fsearch-bool.html&r=1&f=G&l=50&co1=AND&d=
PG01&s1=20030200457.PGNR.&OS=DN/20030200457&RS=DN/20030200457 59
David Bacon, Joshua Bloch, Jeff Bogda, Cliff Click, Paul Haahr, Doug Lea, Tom May, Jan-
Willem Maessen, Jeremy Manson, John D. Mitchell, Kelvin Nilsen, Bill Pugh, and Emin Gun
Sirer. e ‘double-checked locking is broken’ declaration, 2001. www.cs.umd.edu/~pugh/
java/memoryModel/DoubleCheckedLocking.html. 68
Greg Barnes. A method for implementing lock-free shared data structures (extended abstract). In
Proceedings of the Fifth Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA),
pages 261–270, Velen, Germany, June–July 1993. DOI: 10.1145/165231.165265. 145
Hans W. Barz. Implementing semaphores by binary semaphores. ACM SIGPLAN Notices, 18
(2):39–45, February 1983. DOI: 10.1145/948101.948103. 106
Lee Baugh, Naveen Neelakantan, and Craig Zilles. Using hardware memory protection to build
a high-performance, strongly atomic hybrid transactional memory. In Proceedings of the irty-
fifth International Symposium on Computer Architecture (ISCA), pages 115–126, Beijing, China,
June 2008. DOI: 10.1145/1394608.1382132. 164, 166
Rudolf Bayer and Mario Schkolnick. Concurrency of operations on B-trees. Acta Informatica, 9
(1):1–21, 1977. DOI: 10.1007/BF00263762. 33
Mordechai Ben-Ari. Principles of Concurrent and Distributed Programming. Addison-Wesley,
2006. 41, 49
Emery Berger, Ting Yang, Tongping Liu, and Gene Novark. Grace: Safe multithreaded pro-
gramming for C/C++. In Proceedings of the Twenty-fourth Annual ACM SIGPLAN Conference
on Object-oriented Programming Systems, Languages, and Applications (OOPSLA), pages 81–96,
Orlando, FL, October 2009. DOI: 10.1145/1640089.1640096. 168
Mike Blasgen, Jim Gray, Mike Mitoma, and Tom Price. e convoy phenomenon. ACM Oper-
ating Systems Review, 13(2):20–25, April 1979. http://dl.acm.org/citation.cfm?doid=
850657.850659 119
Burton H. Bloom. Space/time trade-off in hash coding with allowable errors. Communications
of the ACM, 13(7):422–426, July 1970. DOI: 10.1145/362686.362692. 155
176 BIBLIOGRAPHY
Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded computations by work
stealing. In Proceedings of the irty-fifth International Symposium on Computer Foundations
of Computer Science (FOCS), pages 356–368, Santa Fe, NM, November 1994. http://doi.
ieeecomputersociety.org/10.1109/SFCS.1994.365680 139
Colin Blundell, Joe Devietti, E Christopher Lewis, and Milo M. K. Martin. Making the fast case
common and the uncommon case simple in unbounded transactional memory. In Proceedings
of the irty-fourth International Symposium on Computer Architecture (ISCA), pages 24–34, San
Diego, CA, June 2007. DOI: 10.1145/1273440.1250667. 158
Jayaram Bobba, Neelam Goyal, Mark D. Hill, Michael M. Swift, and David A. Wood. To-
kenTM: Efficient execution of large transactions with hardware transactional memory. In
Proceedings of the irty-fifth International Symposium on Computer Architecture (ISCA), pages
127–138, Beijing, China, June 2008. DOI: 10.1145/1394608.1382133. 157, 159
Hans-J. Boehm. Can seqlocks get along with programming language memory models? In Pro-
ceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, pages
12–20, Beijing, China, June 2012. DOI: 10.1145/2247684.2247688. 96
Hans-J. Boehm and Sarita V. Adve. Foundations of the C++ concurrency memory model.
In Proceedings of the Twenty-ninth ACM SIGPLAN Conference on Programming Lan-
guage Design and Implementation (PLDI), pages 68–78, Tucson, AZ, June 2008. DOI:
10.1145/1379022.1375591. 15, 42, 46
Per Brinch Hansen. Operating System Principles. Prentice-Hall, Englewood Cliffs, NJ, 1973.
108, 114
Eugene D. Brooks III. e butterfly barrier. International Journal of Parallel Programming, 15(4):
295–307, August 1986. DOI: 10.1007/BF01407877. 75
Paul J. Brown and Ronald M. Smith. Shared data controlled by a plurality of users. U. S. patent
number 3,886,525, May 1975. Filed June 1973. DRT 21
Jehoshua Bruck, Danny Dolev, Ching-Tien Ho, Marcel-Cătălin Roşu, and Ray Strong. Efficient
message passing interface (MPI) for parallel computing on clusters of workstations. In Pro-
ceedings of the Seventh Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA),
pages 64–73, Santa Barbara, CA, July 1995. DOI: 10.1145/215399.215421. 118
Sebastian Burckhardt, Rajeev Alur, and Milo M. K. Martin. CheckFence: Checking consistency
of concurrent data types on relaxed memory models. In Proceedings of the Twenty-eighth ACM
SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 12–
21, San Diego, CA, June 2007. DOI: 10.1145/1273442.1250737. 19
James E. Burns and Nancy A. Lynch. Mutual exclusion using indivisible reads and writes.
In Proceedings of the Eighteenth Annual Allerton Conference on Communication, Control, and
Computing, pages 833–842, Monticello, IL, October 1980. A revised version of this paper
was published as “Bounds on Shared memory for Mutual Exclusion”, Information and Com-
putation, 107(2):171–184, December 1993. http://groups.csail.mit.edu/tds/papers/
Lynch/allertonconf.pdf 50
João Cachopo and António Rito-Silva. Versioned boxes as the basis for memory trans-
actions. Science of Computer Programming, 63(2):172–185, December 2006. DOI:
10.1016/j.scico.2006.05.009. 169
Irina Calciu, Dave Dice, Yossi Lev, Victor Luchangco, Virendra J. Marathe, and Nir Shavit.
NUMA-aware reader-writer locks. In Proceedings of the Eighteenth ACM Symposium on Princi-
ples and Practice of Parallel Programming (PPoPP), pages 157–166, Shenzhen, China, February
2013. DOI: 10.1145/2442516.2442532. 88
Luis Ceze, James Tuck, Călin Caşcaval, and Josep Torrellas. Bulk disambiguation of spec-
ulative threads in multiprocessors. In Proceedings of the irty-third International Sym-
posium on Computer Architecture (ISCA), pages 227–238, Boston, MA, June 2006. DOI:
10.1145/1150019.1136506. 158, 159
Rohit Chandra, Ramesh Menon, Leo Dagum, David Kohr, Dror Maydan, and Jeff McDonald.
Parallel Programming in OpenMP. Morgan Kaufmann, San Francisco, CA, 2001. http://dl.
acm.org/citation.cfm?doid=35037.42270 117
178 BIBLIOGRAPHY
Albert Chang and Mark Mergen. 801 storage: Architecture and programming. ACM Transactions
on Computer Systems, 6(1):28–50, February 1988. DOI: 10.1145/35037.42270. 145
Philippe Charles, Christopher Donawa, Kemal Ebcioglu, Christian Grothoff, Allan Kielstra,
Christoph von Praun, Vijay Saraswat, and Vivek Sarkar. X10: An object-oriented approach
to non-uniform cluster computing. In Proceedings of the Twentieth Annual ACM SIGPLAN
Conference on Object-oriented Programming Systems, Languages, and Applications (OOPSLA),
pages 519–538, San Diego, CA, October 2005. DOI: 10.1145/1094811.1094852. 116
David Chase and Yossi Lev. Dynamic circular work-stealing deque. In Proceedings of the Sev-
enteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages
21–28, Las Vegas, NV, July 2005. DOI: 10.1145/1073970.1073974. 140
Dan Chazan and Willard L. Miranker. Chaotic relaxation. Linear Algebra and its Applications, 2
(2):199–222, April 1969. DOI: 10.1016/0024-3795(69)90028-7. 46
Weihaw Chuang, Satish Narayanasamy, Ganesh Venkatesh, Jack Sampson, Michael Van Bies-
brouck, Gilles Pokam, Brad Calder, and Osvaldo Colavin. Unbounded page-based transac-
tional memory. In Proceedings of the Twelfth International Symposium on Architectural Support
for Programming Languages and Operating Systems (ASPLOS), pages 347–358, San Jose, CA,
October 2006. DOI: 10.1145/1168918.1168901. 158, 159
JaeWoong Chung, Chi Cao Minh, Austen McDonald, Travis Skare, Hassan Chafi, Brian D.
Carlstrom, Christos Kozyrakis, and Kunle Olukotun. Tradeoffs in transactional memory vir-
tualization. In Proceedings of the Twelfth International Symposium on Architectural Support for
Programming Languages and Operating Systems (ASPLOS), pages 371–381, San Jose, CA, Oc-
tober 2006. DOI: 10.1145/1168919.1168903. 158
Austin T. Clements, M. Frans Kaashoek, and Nickolai Zeldovich. Scalable address spaces using
RCU balanced trees. In Proceedings of the Seventeenth International Symposium on Architectural
Support for Programming Languages and Operating Systems (ASPLOS), pages 199–210, London,
United Kingdom, March 2012. DOI: 10.1145/2189750.2150998. 99, 100
Cliff Click Jr. And now some hardware transactional memory comments. Au-
thor’s Blog, Azul Systems, February 2009. www.azulsystems.com/blog/cliff/
2009-02-25-and-now-some-hardware-transactional-memory-comments. 26, 156,
158, 161
Edward G. Coffman, Jr., Michael J. Elphick, and Arie Shoshani. System deadlocks. Computing
Surveys, 3(2):67–78, June 1971. DOI: 10.1145/356586.356588. 28
Pierre-Jacques Courtois, F. Heymans, and David L. Parnas. Concurrent control with ‘read-
ers’ and ‘writers’. Communications of the ACM, 14(10):667–668, October 1971. DOI:
10.1145/362759.362813. 87, 88
BIBLIOGRAPHY 179
Travis S. Craig. Building FIFO and priority-queueing spin locks from atomic swap. Technical
Report TR 93-02-02, University of Washington Computer Science Department, February
1993. ftp://ftp.cs.washington.edu/tr/1993/02/UW-CSE-93-02-02.pdf 56, 59, 62
David E. Culler and Jaswinder Pal Singh. Parallel Computer Architecture: A Hardware/Software
Approach. Morgan Kaufmann, San Francisco, CA, 1998. With Anoop Gupta. 12
Luke Dalessandro and Michael L. Scott. Strong isolation is a weak idea. In Fourth ACM
SIGPLAN Workshop on Transactional Computing (TRANSACT), Raleigh, NC, February 2009.
ftp://ftp.cs.washington.edu/tr/1993/02/UW-CSE-93-02-02.pdf 166
Luke Dalessandro, Virendra J. Marathe, Michael F. Spear, and Michael L. Scott. Capabilities and
limitations of library-based software transactional memory in C++. In Second ACM SIGPLAN
Workshop on Transactional Computing (TRANSACT), Portland, OR, August 2007. http://
www.cs.rochester.edu/u/scott/papers/2007_TRANSACT_RSTM2.pdf 147, 170
Luke Dalessandro, Dave Dice, Michael L. Scott, Nir Shavit, and Michael F. Spear. Transactional
mutex locks. In Proceedings of the Sixteenth International Euro-Par Conference, pages II:2–13,
Ischia-Naples, Italy, August–September 2010a. DOI: 10.1007/978-3-642-15291-7 2. 97, 153
Luke Dalessandro, Michael L. Scott, and Michael F. Spear. Transactions as the foundation of
a memory consistency model. In Proceedings of the Twenth-Fourth International Symposium
on Distributed Computing (DISC), pages 20–34, Cambridge, MA, September 2010b. DOI:
10.1007/978-3-642-15763-9 4. 47, 166, 169
Luke Dalessandro, Michael F. Spear, and Michael L. Scott. NOrec: Streamlining STM by
abolishing ownership records. In Proceedings of the Fifteenth ACM Symposium on Principles and
Practice of Parallel Programming (PPoPP), pages 67–78, Bangalore, India, January 2010c. DOI:
10.1145/1693453.1693464. 97, 153, 154
Luke Dalessandro, François Carouge, Sean White, Yossi Lev, Mark Moir, Michael L. Scott, and
Michael F. Spear. Hybrid NOrec: A case study in the effectiveness of best effort hardware
transactional memory. In Proceedings of the Sixteenth International Symposium on Architectural
Support for Programming Languages and Operating Systems (ASPLOS), pages 39–52, Newport
Beach, CA, March 2011. DOI: 10.1145/1950365.1950373. 165
Peter Damron, Alexandra Fedorova, Yossi Lev, Victor Luchangco, Mark Moir, and Dan Nuss-
baum. Hybrid transactional memory. In Proceedings of the Twelfth International Symposium
180 BIBLIOGRAPHY
on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages
336–346, San Jose, CA, October 2006. DOI: 10.1145/1168919.1168900. 165
Mathieu Desnoyers, Paul E. McKenney, Alan S. Stern, Michel R. Dagenais, and Jonathan
Walpole. User-level implementations of read-copy update. IEEE Transactions on Parallel and
Distributed Systems, 23(2):375–382, February 2012. DOI: 10.1109/TPDS.2011.159. 97, 98
Dave Dice, Hui Huang, and Mingyao Yang. Asymmetric Dekker synchronization.
Lead author’s blog, Oracle Corp., July 2001. blogs.oracle.com/dave/resource/
Asymmetric-Dekker-Synchronization.txt. 69, 70
Dave Dice, Mark Moir, and William N. Scherer III. Quickly reacquirable locks. Technical
Report, Sun Microsystems Laboratories, 2003. Subject of U.S. Patent #7,814,488. 70
Dave Dice, Ori Shalev, and Nir Shavit. Transactional locking II. In Proceedings of the Twen-
tieth International Symposium on Distributed Computing (DISC), pages 194–208, Stockholm,
Sweden, September 2006. DOI: 10.1007/11864219 14. 148, 149, 154
Dave Dice, Yossi Lev, Mark Moir, and Daniel Nussbaum. Early experience with a com-
mercial hardware transactional memory implementation. In Proceedings of the Fourteenth
International Symposium on Architectural Support for Programming Languages and Operat-
ing Systems (ASPLOS), pages 157–168, Washington, DC, March 2009. Expanded ver-
sion available as SMLI TR-2009-180, Sun Microsystems Laboratories, October 2009. DOI:
10.1145/1508244.1508263. 26, 156, 157, 159
Dave Dice, Alexander Matveev, and Nir Shavit. Implicit privatization using private transac-
tions. In Fifth ACM SIGPLAN Workshop on Transactional Computing (TRANSACT), Paris,
France, April 2010. http://people.csail.mit.edu/shanir/publications/Implicit%
20Privatization.pdf or http://people.cs.umass.edu/~moss/transact-2010/
public-papers/03.pdf 170
Dave Dice, Yossi Lev, Yujie Liu, Victor Luchangco, and Mark Moir. Using hardware transac-
tional memory to correct and simplify a readers-writer lock algorithm. In Proceedings of the
Eighteenth ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), pages
261–270, Shenzhen, China, February 2013. DOI: 10.1145/2442516.2442542. 89, 92, 94
David Dice. Inverted schedctl usage in the JVM. Author’s Blog, Oracle Corp., June 2011.
blogs.oracle.com/dave/entry/inverted_schedctl_usage_in_the. 120
David Dice, Virendra J. Marathe, and Nir Shavit. Lock cohorting: A general technique for
designing NUMA locks. In Proceedings of the Seventeenth ACM Symposium on Principles and
Practice of Parallel Programming (PPoPP), pages 247–256, New Orleans, LA, February 2012.
DOI: 10.1145/2370036.2145848. 67
BIBLIOGRAPHY 181
Stephan Diestelhorst, Martin Pohlack, Michael Hohmuth, Dave Christie, Jae-Woong Chung,
and Luke Yen. Implementing AMD’s Advanced Synchronization Facility in an out-of-
order x86 core. In Fifth ACM SIGPLAN Workshop on Transactional Computing (TRANS-
ACT), Paris, France, April 2010. http://people.cs.umass.edu/~moss/transact-2010/
public-papers/14.pdf 160
Edsger W. Dijkstra. Een algorithme ter voorkoming van de dodelijke omarming. Technical
Report EWD-108, IBM T. J. Watson Research Center, early 1960s. In Dutch. Circulated
privately. http://www.cs.utexas.edu/~EWD/ewd01xx/EWD108.PDF 29
Chen Ding, Xipeng Shen, Kirk Kelsey, Chris Tice, Ruke Huang, and Chengliang Zhang. Soft-
ware behavior oriented parallelization. In Proceedings of the Twenty-eighth ACM SIGPLAN
Conference on Programming Language Design and Implementation (PLDI), pages 223–234, San
Diego, CA, June 2007. DOI: 10.1145/1273442.1250760. 153, 168
Joe Duffy. Windows keyed events, critical sections, and new Vista synchronization fea-
tures. Author’s Blog, November 2006. www.bluebytesoftware.com/blog/2006/11/
29/WindowsKeyedEventsCriticalSectionsAndNewVistaSynchronizationFeatures.
aspx. 121
182 BIBLIOGRAPHY
Jan Edler, Jim Lipkis, and Edith Schonberg. Process management for highly parallel UNIX
systems. In Proceedings of the Usenix Workshop on Unix and Supercomputers, pages 1–17, Pitts-
burgh, PA, September 1988. Also available as Ultracomputer Note #136, Courant Institute
of Mathematical Sciences, New York University, April 1988. http://citeseerx.ist.psu.
edu/viewdoc/summary?doi=10.1.1.45.4602 120
Faith Ellen, Yossi Lev, Victor Luchangco, and Mark Moir. SNZI: Scalable NonZero Indica-
tors. In Proceedings of the Twenty-sixth ACM Symposium on Principles of Distributed Computing
(PODC), pages 13–22, Portland, OR, August 2007. DOI: 10.1145/1281100.1281106. 152
Kapali P. Eswaran, Jim Gray, Raymond A. Lorie, and Irving L. Traiger. e notions of consis-
tency and predicate locks in a database system. Communications of the ACM, 19(11):624–633,
November 1976. DOI: 10.1145/360363.360369. 35, 146
Pascal Felber, Torvald Riegel, and Christof Fetzer. Dynamic performance tuning of word-based
software transactional memory. In Proceedings of the irteenth ACM Symposium on Principles
and Practice of Parallel Programming (PPoPP), pages 237–246, Salt Lake City, UT, February
2008. DOI: 10.1145/1345206.1345241. 154
Pascal Felber, Vincent Gramoli, and Rachid Guerraoui. Elastic transactions. In Proceedings of
the Twenth-third International Symposium on Distributed Computing (DISC), pages 93–107,
Elche/Elx, Spain, September 2009. DOI: 10.1007/978-3-642-04355-0 12. 172
Michael J. Fischer, Nancy A. Lynch, James E. Burns, and Allan Borodin. Resource allocation with
immunity to limited process failure. In Proceedings of the Twentieth International Symposium
on Computer Foundations of Computer Science (FOCS), pages 234–254, San Juan, Puerto Rico,
October 1979. DOI: 10.1109/SFCS.1979.37. 52, 55
Hubertus Franke and Rusty Russell. Fuss, futexes and furwocks: Fast userlevel locking in Linux.
In Proceedings of the Ottawa Linux Symposium, pages 479–495, Ottawa, ON, Canada, July
2002. https://www.kernel.org/doc/ols/2002/ols2002-pages-479-495.pdf 119
Keir Fraser. Practical Lock-Freedom. PhD thesis, King’s College, University of Cambridge,
September 2003. Published as University of Cambridge Computer Laboratory technical re-
port #579, February 2004. www.cl.cam.ac.uk/techreports/UCAM-CL-TR-579.pdf. 135,
148
BIBLIOGRAPHY 183
Keir Fraser and Tim Harris. Concurrent programming without locks. ACM Transactions on
Computer Systems, 25(2):article 5, May 2007. DOI: 10.1145/1233307.1233309. 146, 148
Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. e implementation of the Cilk-
5 multithreaded language. In Proceedings of the Nineteenth ACM SIGPLAN Conference on
Programming Language Design and Implementation (PLDI), pages 212–223, Montreal, PQ,
Canada, June 1998. DOI: 10.1145/277652.277725. 116, 139
David Gifford, Alfred Spector, Andris Padegs, and Richard Case. Case study: IBM’s
System/360–370 architecture. Communications of the ACM, 30(4):291–307, April 1987. DOI:
10.1145/32232.32233. 21
Brian Goetz, Tim Peierls, Joshua Bloch, Joseph Bowbeer, David Holmes, and Doug Lea. Java
Concurrency in Practice. Addison-Wesley Professional, 2006. 109
James R. Goodman, Mary K. Vernon, and Philip J. Woest. Efficient synchronization primitives
for large-scale cache-coherent multiprocessors. In Proceedings of the ird International Sym-
posium on Architectural Support for Programming Languages and Operating Systems (ASPLOS),
pages 64–75, Boston, MA, April 1989. DOI: 10.1145/70082.68188. 25, 56
Allan Gottlieb, Ralph Grishman, Clyde P. Kruskal, Kevin P. McAuliffe, Larry Rudolph,
and Marc Snir. e NYU Ultracomputer: Designing an MIMD shared memory par-
allel computer. IEEE Transactions on Computers, 32(2):175–189, February 1983. DOI:
10.1109/TC.1983.1676201. 74
Gary Graunke and Shreekant akkar. Synchronization algorithms for shared-memory multi-
processors. Computer, 23(6):60–69, June 1990. DOI: 10.1109/2.55501. 56
Rachid Guerraoui, Maurice Herlihy, and Bastian Pochon. Polymorphic contention management
in SXM. In Proceedings of the Nineteenth International Symposium on Distributed Computing
(DISC), pages 303–323, Cracow, Poland, September 2005a. DOI: 10.1007/11561927 23. 156
184 BIBLIOGRAPHY
Rachid Guerraoui, Maurice Herlihy, and Bastian Pochon. Toward a theory of transac-
tional contention managers. In Proceedings of the Twenty-fourth ACM Symposium on Prin-
ciples of Distributed Computing (PODC), pages 258–264, Las Vegas, NV, July 2005b. DOI:
10.1145/1073814.1073863. 156
Rajiv Gupta. e fuzzy barrier: A mechanism for high speed synchronization of processors.
In Proceedings of the ird International Symposium on Architectural Support for Programming
Languages and Operating Systems (ASPLOS), pages 54–63, Boston, MA, April 1989. DOI:
10.1145/70082.68187. 80
Rajiv Gupta and Charles R. Hill. A scalable implementation of barrier synchronization using an
adaptive combining tree. International Journal of Parallel Programming, 18(3):161–180, June
1989. DOI: 10.1007/BF01407897. 82
eo Haerder and Andreas Reuter. Principles of transaction-oriented database recovery. ACM
Computing Surveys, 15(4):287–317, December 1983. DOI: 10.1145/289.291. 146
Robert H. Halstead, Jr. Multilisp: A language for concurrent symbolic computation. ACM
Transactions on Programming Languages and Systems, 7(4):501–538, October 1985. DOI:
10.1145/4472.4478. 115
Lance Hammond, Vicky Wong, Mike Chen, Ben Hertzberg, Brian Carlstrom, Manohar
Prabhu, Honggo Wijaya, Christos Kozyrakis, and Kunle Olukotun. Transactional mem-
ory coherence and consistency. In Proceedings of the irty-first International Symposium
on Computer Architecture (ISCA), pages 102–113, München, Germany, June 2004. DOI:
10.1145/1028176.1006711. 159
Yijie Han and Raphael A. Finkel. An optimal scheme for disseminating information. In Proceed-
ings of the International Conference on Parallel Processing (ICPP), pages II:198–203, University
Park, PA, August 1988. 75
Tim Harris and Keir Fraser. Language support for lightweight transactions. In Proceedings
of the Eighteenth Annual ACM SIGPLAN Conference on Object-oriented Programming Systems,
Languages, and Applications (OOPSLA), pages 388–402, Anaheim, CA, October 2003. DOI:
10.1145/949343.949340. 146, 148
Tim Harris, Simon Marlow, Simon Peyton Jones, and Maurice Herlihy. Composable
memory transactions. In Proceedings of the Tenth ACM Symposium on Principles and
Practice of Parallel Programming (PPoPP), pages 48–60, Chicago, IL, June 2005. DOI:
10.1145/1065944.1065952. 167, 168
Timothy Harris, Mark Plesko, Avraham Shinnar, and David Tarditi. Optimizing memory trans-
actions. In Proceedings of the Twenty-seventh ACM SIGPLAN Conference on Programming Lan-
BIBLIOGRAPHY 185
guage Design and Implementation (PLDI), pages 14–25, Ottawa, ON, Canada, June 2006. DOI:
10.1145/1133255.1133984. 148, 170, 171
Timothy L. Harris. A pragmatic implementation of non-blocking linked-lists. In Proceedings of
the Fifteenth International Symposium on Distributed Computing (DISC), pages 300–314, Lis-
bon, Portugal, October 2001. DOI: 10.1007/3-540-45414-4 21. 129
Timothy L. Harris, James R. Larus, and Ravi Rajwar. Transactional Memory. Morgan & Clay-
pool, San Francisco, CA, second edition, 2010. First edition, by Larus and Rajwar only, 2007.
DOI: 10.2200/S00272ED1V01Y201006CAC011. 26, 145
HaskellWiki. Software transactional memory, February 2012. www.haskell.org/
haskellwiki/Software_transactional_memory. 171
Bijun He, William N. Scherer III, and Michael L. Scott. Preemption adaptivity in time-
published queue-based spin locks. In Proceedings of the Twelfth International Conference on High
Performance Computing, pages 7–18, Goa, India, December 2005. DOI: 10.1007/11602569 6.
120
Danny Hendler and Nir Shavit. Non-blocking steal-half work queues. In Proceedings of the
Twenty-first ACM Symposium on Principles of Distributed Computing (PODC), pages 280–289,
Monterey, CA, July 2002. DOI: 10.1145/571825.571876. 140
Danny Hendler, Nir Shavit, and Lena Yerushalmi. A scalable lock-free stack algorithm. In Pro-
ceedings of the Sixteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures
(SPAA), pages 206–215, Barcelona, Spain, June 2004. DOI: 10.1145/1007912.1007944. 85,
142
Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. Scalable flat-combining based syn-
chronous queues. In Proceedings of the Twenth-Fourth International Symposium on Distributed
Computing (DISC), pages 79–93, Cambridge, MA, September 2010a. DOI: 10.1007/978-3-
642-15763-9 8. 85
Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. Flat combining and the
synchronization-parallelism tradeoff. In Proceedings of the Twenty-second Annual ACM Sym-
posium on Parallelism in Algorithms and Architectures (SPAA), pages 355–364, ira, Santorini,
Greece, June 2010b. DOI: 10.1145/1810479.1810540. 85, 128
Debra A. Hensgen, Raphael A. Finkel, and Udi Manber. Two algorithms for barrier synchro-
nization. International Journal of Parallel Programming, 17(1):1–17, February 1988. DOI:
10.1007/BF01379320. 73, 75, 76
Maurice Herlihy and Eric Koskinen. Transactional boosting: A methodology for highly-
concurrent transactional objects. In Proceedings of the irteenth ACM Symposium on Principles
186 BIBLIOGRAPHY
and Practice of Parallel Programming (PPoPP), pages 207–216, Salt Lake City, UT, February
2008. DOI: 10.1145/1345206.1345237. 151
Maurice Herlihy and Yossi Lev. tm db: A generic debugging library for transactional pro-
grams. In Proceedings of the Eighteenth International Conference on Parallel Architectures
and Compilation Techniques (PACT), pages 136–145, Raleigh, NC, September 2009. DOI:
10.1109/PACT.2009.23. 171
Maurice Herlihy and J. Eliot B. Moss. Transactional memory: Architectural support for lock-free
data structures. In Proceedings of the Twentieth International Symposium on Computer Architec-
ture (ISCA), pages 289–300, San Diego, CA, May 1993. DOI: 10.1109/ISCA.1993.698569.
26, 143, 145
Maurice Herlihy and Nir Shavit. e Art of Multiprocessor Programming. Morgan Kaufmann,
San Francisco, CA, 2008. 27, 123, 135
Maurice Herlihy, Victor Luchangco, Mark Moir, and William N. Scherer III. Software trans-
actional memory for dynamic-sized data structures. In Proceedings of the Twenty-second ACM
Symposium on Principles of Distributed Computing (PODC), pages 92–101, Boston, MA, July
2003b. DOI: 10.1145/872035.872048. 38, 146, 148, 164, 172
Maurice Herlihy, Victor Luchangco, Paul Martin, and Mark Moir. Nonblocking memory man-
agement support for dynamic-sized data structures. ACM Transactions on Computer Systems,
23(2):146–196, May 2005. DOI: 10.1145/1062247.1062249. 25
Maurice P. Herlihy. A methodology for implementing highly concurrent data objects. ACM
Transactions on Programming Languages and Systems, 15(5):745–770, November 1993. DOI:
10.1145/161468.161469. 143, 145
Maurice P. Herlihy and Jeannette M. Wing. Linearizability: A correctness condition for concur-
rent objects. ACM Transactions on Programming Languages and Systems, 12(3):463–492, July
1990. DOI: 10.1145/78969.78972. 31, 33
F. Ryan Johnson, Radu Stoica, Anastasia Ailamaki, and Todd C. Mowry. Decoupling contention
management from scheduling. In Proceedings of the Fifteenth International Symposium on Archi-
tectural Support for Programming Languages and Operating Systems (ASPLOS), pages 117–128,
Pittsburgh, PA, March 2010. DOI: 10.1145/1736020.1736035. 119
JRuby. Community website. jruby.org/. 47
Jython. Project website. jython.org/. 47
Anna R. Karlin, Kai Li, Mark S. Manasse, and Susan Owicki. Empirical studies of competitive
spinning for a shared-memory multiprocessor. In Proceedings of the irteenth ACM Symposium
on Operating Systems Principles (SOSP), pages 41–55, Pacific Grove, CA, October 1991. DOI:
10.1145/121132.286599. 118
Joep L. W. Kessels. An alternative to event queues for synchronization in monitors. Communi-
cations of the ACM, 20(7):500–503, July 1977. DOI: 10.1145/359636.359710. 115
omas F. Knight. An architecture for mostly functional languages. In Proceedings of the ACM
Conference on Lisp and Functional Programming (LFP), pages 105–112, Cambridge, MA, Au-
gust 1986. DOI: 10.1145/319838.319854. 145
Alex Kogan and Erez Petrank. A methodology for creating fast wait-free data structures. In
Proceedings of the Seventeenth ACM Symposium on Principles and Practice of Parallel Programming
(PPoPP), pages 141–150, New Orleans, LA, February 2012. DOI: 10.1145/2145816.2145835.
38
BIBLIOGRAPHY 189
Leonidas I. Kontothanassis, Robert Wisniewski, and Michael L. Scott. Scheduler-conscious
synchronization. ACM Transactions on Computer Systems, 15(1):3–40, February 1997. DOI:
10.1145/244764.244765. 120
Eric Koskinen and Maurice Herlihy. Dreadlocks: Efficient deadlock detection. In Proceedings of
the Twentieth Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA),
pages 297–303, Munich, Germany, June 2008. DOI: 10.1145/1378533.1378585. 29
Eric Koskinen, Matthew Parkinson, and Maurice Herlihy. Coarse-grained transactions. In Pro-
ceedings of the irty-seventh ACM Symposium on Principles of Programming Languages (POPL),
pages 19–30, Madrid, Spain, January 2010. DOI: 10.1145/1706299.1706304. 151
Orran Krieger, Michael Stumm, Ron Unrau, and Jonathan Hanna. A fair fast scalable reader-
writer lock. In Proceedings of the International Conference on Parallel Processing (ICPP), pages
II:201–204, St. Charles, IL, August 1993. DOI: 10.1109/ICPP.1993.21. 89, 92
Clyde P. Kruskal, Larry Rudolph, and Marc Snir. Efficient synchronization on multiprocessors
with shared memory. ACM Transactions on Programming Languages and Systems, 10(4):579–
601, October 1988. DOI: 10.1145/48022.48024. 21, 74
KSR. KSR1 Principles of Operation. Kendall Square Research, Waltham, MA, 1992. 25
Sanjeev Kumar, Michael Chu, Christopher J. Hughes, Partha Kundu, and Anthony Nguyen.
Hybrid transactional memory. In Proceedings of the Eleventh ACM Symposium on Principles and
Practice of Parallel Programming (PPoPP), pages 209–220, New York, NY, March 2006. DOI:
10.1145/1168857.1168900. 164
Michael Kuperstein, Martin Vechev, and Eran Yahav. Automatic inference of memory fences. In
Proceedings of the IEEE Conference on Formal Methods in Computer-Aided Design, pages 111–
120, Lugano, Switzerland, October 2010. DOI: 10.1145/2261417.2261438. 19
Edya Ladan-Mozes and Nir Shavit. An optimistic approach to lock-free FIFO queues. Dis-
tributed Computing, 20(5):323–341, February 2008. DOI: 10.1007/s00446-007-0050-0. 127
Richard E. Ladner and Michael J. Fischer. Parallel prefix computation. Journal of the ACM, 27
(4):831–838, October 1980. DOI: 10.1145/322217.322232. 74
Leslie Lamport. How to make a multiprocessor computer that correctly executes multipro-
cess programs. IEEE Transactions on Computers, C-28(9):690–691, September 1979. DOI:
10.1109/TC.1979.1675439. 14
Leslie Lamport. A fast mutual exclusion algorithm. ACM Transactions on Computer Systems, 5
(1):1–11, February 1987. DOI: 10.1145/7351.7352. 52
Butler W. Lampson and David D. Redell. Experience with processes and monitors in Mesa.
Communications of the ACM, 23(2):105–117, February 1980. DOI: 10.1145/358818.358824.
109, 111
Doug Lea. e JSR-133 cookbook for compiler writers, March 2001. g.oswego.edu/dl/jmm/
cookbook.h. 18
Daniel Lenoski, James Laudon, Kourosh Gharachorloo, Wolf-Dietrich Weber, Anoop Gupta,
John Hennessy, Mark Horowitz, and Monica S. Lam. e Stanford Dash multiprocessor.
Computer, 25(3):63–79, March 1992. DOI: 10.1109/2.121510. 20
Yossi Lev. Debugging and Profiling of Transactional Programs. PhD thesis, Department of
Computer Science, Brown University, May 2010. http://cs.brown.edu/research/pubs/
theses/phd/2010/lev.pdf 171
Yossi Lev, Mark Moir, and Dan Nussbaum. PhTM: Phased transactional memory. In Second
ACM SIGPLAN Workshop on Transactional Computing (TRANSACT), Portland, OR, August
2007. http://www.cs.rochester.edu/meetings/TRANSACT07/papers/lev.pdf 165
Yossi Lev, Victor Luchangco, Virendra Marathe, Mark Moir, Dan Nussbaum, and Marek
Olszewski. Anatomy of a scalable software transactional memory. In Fourth ACM SIG-
PLAN Workshop on Transactional Computing (TRANSACT), Raleigh, NC, February 2009a.
http://transact09.cs.washington.edu/25_paper.pdf 152
Yossi Lev, Victor Luchangco, and Marek Olszewski. Scalable reader-writer locks. In
Proceedings of the Twenty-first Annual ACM Symposium on Parallelism in Algorithms
and Architectures (SPAA), pages 101–110, Calgary, AB, Canada, August 2009b. DOI:
10.1145/1583991.1584020. 94
BIBLIOGRAPHY 191
Li Lu and Michael L. Scott. Unmanaged multiversion STM. In Seventh ACM SIGPLAN Work-
shop on Transactional Computing (TRANSACT), New Orleans, LA, February 2012. http:
//www.cs.rochester.edu/u/scott/papers/2012_TRANSACT_umv.pdf 169
Boris D. Lubachevsky. Synchronization barrier and related tools for shared memory parallel
programming. In Proceedings of the International Conference on Parallel Processing (ICPP), pages
II:175–179, University Park, PA, August 1989. DOI: 10.1007/BF01407956. 76
Nancy Lynch. Distributed Algorithms. Morgan Kaufmann, San Francisco, CA, 1996. 27
Peter Magnussen, Anders Landin, and Erik Hagersten. Queue locks on cache coherent mul-
tiprocessors. In Proceedings of the Eighth International Parallel Processing Symposium (IPPS),
pages 165–171, Cancun, Mexico, April 1994. DOI: 10.1109/IPPS.1994.288305. 56, 59, 61,
62
Jeremy Manson, William Pugh, and Sarita V. Adve. e Java memory model. In Proceedings
of the irty-second ACM Symposium on Principles of Programming Languages (POPL), pages
378–391, Long Beach, CA, January 2005. DOI: 10.1145/1047659.1040336. 15, 42, 46
Virendra J. Marathe and Mark Moir. Toward high performance nonblocking software trans-
actional memory. In Proceedings of the irteenth ACM Symposium on Principles and Practice
of Parallel Programming (PPoPP), pages 227–236, Salt Lake City, UT, February 2008. DOI:
10.1145/1345206.1345240. 38, 148
Virendra J. Marathe, William N. Scherer III, and Michael L. Scott. Adaptive software transac-
tional memory. In Proceedings of the Nineteenth International Symposium on Distributed Comput-
ing (DISC), pages 354–368, Cracow, Poland, September 2005. DOI: 10.1007/11561927 26.
38, 148
Virendra J. Marathe, Michael F. Spear, Christopher Heriot, Athul Acharya, David Eisenstat,
William N. Scherer III, and Michael L. Scott. Lowering the overhead of software transactional
memory. In First ACM SIGPLAN Workshop on Transactional Computing (TRANSACT), Ot-
tawa, ON, Canada, June 2006. http://www.cs.rochester.edu/u/scott/papers/2012_
TRANSACT_umv.pdf 148
Virendra J. Marathe, Michael F. Spear, and Michael L. Scott. Scalable techniques for trans-
parent privatization in software transactional memory. In Proceedings of the International
Conference on Parallel Processing (ICPP), pages 67–74, Portland, OR, September 2008. DOI:
10.1109/ICPP.2008.69. 170
José F. Martínez and Josep Torrellas. Speculative synchronization: Applying thread-level spec-
ulation to explicitly parallel applications. In Proceedings of the Tenth International Symposium
on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages
18–29, San Jose, CA, October 2002. DOI: 10.1145/605397.605400. 146
Paul E. McKenney, Jonathan Appavoo, Andi Kleen, Orran Krieger, Rusty Russel, Dipankar
Sarma, and Maneesh Soni. Read-copy update. In Proceedings of the Ottawa Linux Sympo-
sium, pages 338–367, Ottawa, ON, Canada, July 2001. Revised version available as http:
//www.rdrop.com/~paulmck/RCU/rclock_OLS.2001.05.01c.pdf 25, 97, 99
Avraham A. Melkman. On-line construction of the convex hull of a simple polyline. Information
Processing Letters, 25(1):11–12, April 1987. DOI: 10.1016/0020-0190(87)90086-X. 135
Vijay Menon, Steven Balensiefer, Tatiana Shpeisman, Ali-Reza Adl-Tabatabai, Richard L. Hud-
son, Bratin Saha, and Adam Welc. Practical weak-atomicity semantics for Java STM. In Pro-
ceedings of the Twentieth Annual ACM Symposium on Parallelism in Algorithms and Architectures
(SPAA), pages 314–325, Munich, Germany, June 2008. DOI: 10.1145/1378533.1378588.
166, 170
Michael Merritt and Gadi Taubenfeld. Computing with infinitely many processes. In Proceedings
of the Fourteenth International Symposium on Distributed Computing (DISC), pages 164–178,
Toledo, Spain, October 2000. DOI: 10.1007/3-540-40026-5 11. 52
BIBLIOGRAPHY 193
Robert M. Metcalfe and David R. Boggs. Ethernet: Distributed packet switching for lo-
cal computer networks. Communications of the ACM, 19(7):395–404, July 1976. DOI:
10.1145/360248.360253. 54
Maged M. Michael. Practical lock-free and wait-free LL/SC/VL implementations using 64-
bit CAS. In Proceedings of the Eighteenth International Symposium on Distributed Computing
(DISC), pages 144–158, Amsterdam, e Netherlands, October 2004a. DOI: 10.1007/978-
3-540-30186-8 11. 124, 136
Maged M. Michael. CAS-based lock-free algorithm for shared deques. In Proeedings of the Ninth
Euro-Par Conference on Parallel Processing, pages 651–660, Klagenfurt, Austria, August 2003.
DOI: 10.1007/978-3-540-45209-6 92. 135, 136, 137
Maged M. Michael. High performance dynamic lock-free hash tables and list-based sets. In
Proceedings of the Fourteenth Annual ACM Symposium on Parallel Algorithms and Architectures
(SPAA), pages 73–82, Winnipeg, MB, Canada, August 2002. DOI: 10.1145/564870.564881.
129, 130, 131
Maged M. Michael. Hazard pointers: Safe memory reclamation for lock-free objects.
IEEE Transactions on Parallel and Distributed Systems, 15(8):491–504, August 2004b. DOI:
10.1109/TPDS.2004.8. 25, 129, 136
Maged M. Michael and Michael L. Scott. Nonblocking algorithms and preemption-safe lock-
ing on multiprogrammed shared memory multiprocessors. Journal of Parallel and Distributed
Computing, 51(1):1–26, January 1998. DOI: 10.1006/jpdc.1998.1446. 38, 125
Maged M. Michael and Michael L. Scott. Simple, fast, and practical non-blocking and block-
ing concurrent queue algorithms. In Proceedings of the Fifteenth ACM Symposium on Princi-
ples of Distributed Computing (PODC), pages 267–275, Philadelphia, PA, May 1996. DOI:
10.1145/248052.248106. 125, 127
Chi Cao Minh, Martin Trautmann, JaeWoong Chung, Austen McDonald, Nathan Bronson,
Jared Casper, Christos Kozyrakis, and Kunle Olukotun. An effective hybrid transactional
memory system with strong isolation guarantees. In Proceedings of the irty-fourth Interna-
tional Symposium on Computer Architecture (ISCA), pages 69–80, San Diego, CA, June 2007.
DOI: 10.1145/1273440.1250673. 164
194 BIBLIOGRAPHY
Mark Moir and James H. Anderson. Wait-free algorithms for fast, long-lived renaming. Science of
Computer Programming, 25(1):1–39, October 1995. DOI: 10.1016/0167-6423(95)00009-H.
52
Mark Moir and Nir Shavit. Concurrent data structures. In Dinesh P. Metha and Sartaj Sahni, ed-
itors, Handbook of Data Structures and Applications, page Chapter 47. Chapman and Hall / CRC
Press, San Jose, CA, 2005. 38, 123
Mark Moir, Daniel Nussbaum, ori Shalev, and Nir Shavit. Using elimination to implement
scalable and lock-free FIFO queues. In Proceedings of the Seventeenth Annual ACM Symposium
on Parallelism in Algorithms and Architectures (SPAA), pages 253–262, Las Vegas, NV, July 2005.
DOI: 10.1145/1073970.1074013. 142
Adam Morrison and Yehuda Afek . Fast concurrent queues for x86 processors. In Proceedings
of the Eighteenth ACM Symposium on Principles and Practice of Parallel Programming (PPoPP),
pages 103–112, Shenzhen, China, February 2013. DOI: 10.1145/2442516.2442527. 128
J. Eliot B. Moss and Antony L. Hosking. Nested transactional memory: Model and archi-
tecture sketches. Science of Computer Programming, 63(2):186–201, December 2006. DOI:
10.1016/j.scico.2006.05.010. 167
Steven S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann, San
Francisco, CA, 1997. 12
Robert H. B. Netzer and Barton P. Miller. What are race conditions?: Some issues and formal-
izations. ACM Letters on Programming Languages and Systems, 1(1):74–88, March 1992. DOI:
10.1145/130616.130623. 45
Yang Ni, Vijay S. Menon, Ali-Reza Adl-Tabatabai, Antony L. Hosking, Richard L. Hud-
son, J. Eliot B. Moss, Bratin Saha, and Tatiana Shpeisman. Open nesting in software
transactional memory. In Proceedings of the Twelfth ACM Symposium on Principles and
Practice of Parallel Programming (PPoPP), pages 68–78, San Jose, CA, March 2007. DOI:
10.1145/1229428.1229442. 167
Marek Olszewski, Jeremy Cutler, and J. Gregory Steffan. JudoSTM: A dynamic binary-rewriting
approach to software transactional memory. In Proceedings of the Sixteenth International Con-
ference on Parallel Architectures and Compilation Techniques (PACT), pages 365–375, Brasov,
Romania, September 2007. DOI: 10.1109/PACT.2007.4336226. 153, 171
John K. Ousterhout. Scheduling techniques for concurrent systems. In Proceedings of the Interna-
tional Conference on Distributed Computing Systems (ICDCS), pages 22–30, Miami/Ft. Laud-
erdale, FL, October 1982. 118
BIBLIOGRAPHY 195
Christos H. Papadimitriou. e serializability of concurrent database updates. Journal of the
ACM, 26(4):631–653, October 1979. DOI: 10.1145/322154.322158. 34
Dmitri Perelman, Anton Byshevsky, Oleg Litmanovich, and Idit Keidar. SMV: Selective multi-
versioning STM. In Proceedings of the Twenty-fifth International Symposium on Distributed
Computing (DISC), pages 125–140, Rome, Italy, September 2011. DOI: 10.1007/978-3-642-
24100-0 9. 169
Gary L. Peterson. Myths about the mutual exclusion problem. Information Processing Letters, 12
(3):115–116, June 1981. DOI: 10.1016/0020-0190(81)90106-X. 18, 49
Gary L. Peterson and Michael J. Fischer. Economical solutions for the critical section problem
in a distributed system. In Proceedings of the Ninth ACM Symposium on the eory of Computing
(STOC), pages 91–97, Boulder, CO, May 1977. DOI: 10.1145/800105.803398 49
Sundeep Prakash, Yann-Hang Lee, and eodore Johnson. A nonblocking algorithm for shared
queues using compare-and-swap. IEEE Transactions on Computers, 43(5):548–559, May 1994.
DOI: 10.1109/12.280802. 129
William Pugh. Skip lists: A probabilistic alternative to balanced trees. Communications of the
ACM, 33(6):668–676, June 1990. DOI: 10.1145/78973.78977. 134
Zoran Radović and Erik Hagersten. Hierarchical backoff locks for nonuniform communi-
cation architectures. In Proceedings of the Ninth International Symposium on High Perfor-
mance Computer Architecture (HPCA), pages 241–252, Anaheim, CA, February 2003. DOI:
10.1109/HPCA.2003.1183542. 67
Zoran Radović and Erik Hagersten. Efficient synchronization for nonuniform communication
architectures. In Proceedings, Supercomputing 2002 (SC), pages 1–13, Baltimore, MD, Novem-
ber 2002. DOI: 10.1109/SC.2002.10038. 67
Ravi Rajwar. Speculation-based Techniques for Lock-free Execution of Lock-based Programs. PhD
thesis, Department of Computer Sciences, University of Wisconsin–Madison, October 2002.
ftp://ftp.cs.wisc.edu/galileo/papers/rajwar_thesis.ps.gz 162
Ravi Rajwar and James R. Goodman. Transactional lock-free execution of lock-based programs.
In Proceedings of the Tenth International Symposium on Architectural Support for Programming
Languages and Operating Systems (ASPLOS), pages 5–17, San Jose, CA, October 2002. DOI:
10.1145/635506.605399. 146, 160
196 BIBLIOGRAPHY
Ravi Rajwar and James R. Goodman. Speculative lock elision: Enabling highly concurrent
multithreaded execution. In Proceedings of the irty-fourth International Symposium on Mi-
croarchitecture (MICRO), pages 294–305, Austin, TX, December 2001. DOI: 10.1109/MI-
CRO.2001.991127. 146, 160
Ravi Rajwar, Maurice Herlihy, and Konrad Lai. Virtualizing transactional memory. In Proceedings
of the irty-second International Symposium on Computer Architecture (ISCA), pages 494–505,
Madison, WI, June 2005. DOI: 10.1145/1080695.1070011. 158
Hany E. Ramadan, Christopher J. Rossbach, and Emmett Witchel. Dependence-aware transac-
tional memory for increased concurrency. In Proceedings of the Forty-first International Sympo-
sium on Microarchitecture (MICRO), pages 246–257, Lake Como, Italy, November 2008. DOI:
10.1109/MICRO.2008.4771795. 169
Raghavan Raman, Jisheng Zhao, Vivek Sarkar, Martin Vechev, and Eran Yahav. Scalable and
precise dynamic datarace detection for structured parallelism. In Proceedings of the irty-third
ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI),
pages 531–542, Beijing, China, 2012. DOI: 10.1145/2254064.2254127. 117
David P. Reed and Rajendra K. Kanodia. Synchronization with eventcounts and sequencers.
Communications of the ACM, 22(2):115–123, February 1979. DOI: 10.1145/359060.359076.
52, 55
Torvald Riegel, Pascal Felber, and Christof Fetzer. A lazy snapshot algorithm with eager valida-
tion. In Proceedings of the Twentieth International Symposium on Distributed Computing (DISC),
pages 284–298, Stockholm, Sweden, September 2006. DOI: 10.1007/11864219 20. 154, 169
RSTM. Reconfigurable Software Transactional Memory website. code.google.com/p/rstm/.
150
Larry Rudolph and Zary Segall. Dynamic decentralized cache schemes for MIMD parallel pro-
cessors. In Proceedings of the Eleventh International Symposium on Computer Architecture (ISCA),
pages 340–347, Ann Arbor, MI, June 1984. DOI: 10.1145/773453.808203. 54
Kenneth Russell and David Detlefs. Eliminating synchronization-related atomic operations with
biased locking and bulk rebiasing. In Proceedings of the Twenty-first Annual ACM SIGPLAN
Conference on Object-oriented Programming Systems, Languages, and Applications (OOPSLA),
pages 263–272, Portland, OR, October 2006. DOI: 10.1145/1167473.1167496. 69
Bratin Saha, Ali-Reza Adl-Tabatabai, adl tabatabai, Richard L. Hudson, Chi Cao Minh, and
Benjamin Hertzberg. McRT-STM: A high performance software transactional memory sys-
tem for a multi-core runtime. In Proceedings of the Eleventh ACM Symposium on Principles and
Practice of Parallel Programming (PPoPP), pages 187–197, New York, NY, March 2006a. DOI:
10.1145/1122971.1123001. 148
BIBLIOGRAPHY 197
Bratin Saha, Ali-Reza Adl-Tabatabai, and Quinn Jacobson. Architectural support for soft-
ware transactional memory. In Proceedings of the irty-ninth International Symposium on Mi-
croarchitecture (MICRO), pages 185–196, Orlando, FL, December 2006b. DOI: 10.1109/MI-
CRO.2006.9. 163
William N. Scherer III and Michael L. Scott. Nonblocking concurrent data structures with
condition synchronization. In Proceedings of the Eighteenth International Symposium on Dis-
tributed Computing (DISC), pages 174–187, Amsterdam, e Netherlands, October 2004.
DOI: 10.1007/978-3-540-30186-8 13. 141
William N. Scherer III and Michael L. Scott. Advanced contention management for dynamic
software transactional memory. In Proceedings of the Twenty-fourth ACM Symposium on Prin-
ciples of Distributed Computing (PODC), pages 240–248, Las Vegas, NV, July 2005a. DOI:
10.1145/1073814.1073861. 156
William N. Scherer III and Michael L. Scott. Randomization in STM contention management
(poster). In Proceedings of the Twenty-fourth ACM Symposium on Principles of Distributed Com-
puting (PODC), Las Vegas, NV, July 2005b. www.cs.rochester.edu/u/scott/papers/
2005_PODC_Rand_CM_poster_abstract.pdf. 156
William N. Scherer III, Doug Lea, and Michael L. Scott. A scalable elimination-based ex-
change channel. In Workshop on Synchronization and Concurrency in Object-Oriented Lan-
guages (SCOOL), San Diego, CA, October 2005. In conjunction with OOPSLA 2005.
http://www.cs.rochester.edu/u/scott/papers/2005_SCOOL_exchanger.pdf 142
William N. Scherer III, Doug Lea, and Michael L. Scott. Scalable synchronous queues. Com-
munications of the ACM, 52(5):100–108, May 2009. DOI: 10.1145/1506409.1506431. 141
Florian T. Schneider, Vijay Menon, Tatiana Shpeisman, and Ali-Reza Adl-Tabatabai. Dynamic
optimization for efficient strong atomicity. In Proceedings of the Twenty-third Annual ACM
SIGPLAN Conference on Object-oriented Programming Systems, Languages, and Applications
(OOPSLA), pages 181–194, Nashville, TN, October 2008. DOI: 10.1145/1449764.1449779.
166
Fred B. Schneider. On Concurrent Programming. Springer-Verlag, 1997. DOI: 10.1007/978-1-
4612-1830-2. 27, 41
Michael L. Scott. Programming Language Pragmatics. Morgan Kaufmann Publishers, Burling-
ton, MA, third edition, 2009. 103
Michael L. Scott. Sequential specification of transactional memory semantics. In First ACM
SIGPLAN Workshop on Transactional Computing (TRANSACT), Ottawa, ON, Canada, June
2006. http://www.cs.rochester.edu/u/scott/papers/2006_TRANSACT_formal_STM.
pdf 148
198 BIBLIOGRAPHY
Michael L. Scott and John M. Mellor-Crummey. Fast, contention-free combining tree barriers
for shared-memory multiprocessors. International Journal of Parallel Programming, 22(4):449–
481, August 1994. DOI: 10.1007/BF02577741. 83
Michael L. Scott and Maged M. Michael. e topological barrier: A synchronization abstraction
for regularly-structured parallel applications. Technical Report TR 605, Department of Com-
puter Science, University or Rochester, January 1996. http://www.cs.rochester.edu/u/
scott/papers/1996_TR605.pdf 117
Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen.
x86-TSO: A rigorous and usable programmer’s model for x86 multiprocessors. Communications
of the ACM, 53(7):89–97, July 2010. DOI: 10.1145/1785414.1785443. 20
Ori Shalev and Nir Shavit. Split-ordered lists: Lock-free extensible hash tables. Journal of the
ACM, 53(3):379–405, May 2006. DOI: 10.1145/1147954.1147958. 132, 134
Nir Shavit and Dan Touitou. Software transactional memory. In Proceedings of the Fourteenth
ACM Symposium on Principles of Distributed Computing (PODC), pages 204–213, Ottawa, ON,
Canada, August 1995. DOI: 10.1145/224964.224987. 143, 145
Nir Shavit and Dan Touitou. Elimination trees and the construction of pools and stacks. eory
of Computing Systems, 30(6):645–670, August 1997. DOI: 10.1145/215399.215419. 85
Nir Shavit and Asaph Zemach. Combining funnels: A dynamic approach to software combin-
ing. Journal of Parallel and Distributed Computing, 60(11):1355–1387, November 2000. DOI:
10.1006/jpdc.2000.1621. 83, 84, 85
Avraham Shinnar, David Tarditi, Mark Plesko, and Bjarne Steensgaard. Integrating support for
undo with exception handling. Technical Report MSR-TR-2004-140, Microsoft Research,
December 2004. http://research.microsoft.com/pubs/70125/tr-2004-140.pdf 168
Jun Shirako, David Peixotto, Vivek Sarkar, and William N. Scherer III. Phasers: A unified
deadlock-free construct for collective and point-to-point synchronization. In Proceedings of the
International Conference on Supercomputing (ICS), pages 277–288, Island of Kos, Greece, June
2008. DOI: 10.1145/1375527.1375568. 117
Tatiana Shpeisman, Vijay Menon, Ali-Reza Adl-Tabatabai, Steven Balensiefer, Dan Grossman,
Richard L. Hudson, Katherine F. Moore, and Bratin Saha. Enforcing isolation and order-
ing in STM. In Proceedings of the Twenty-eighth ACM SIGPLAN Conference on Programming
Language Design and Implementation (PLDI), pages 78–88, San Diego, CA, June 2007. DOI:
10.1145/1273442.1250744. 166, 170
Arrvindh Shriraman and Sandhya Dwarkadas. Refereeing conflicts in hardware transactional
memory. In Proceedings of the Twenty-third International Conference on Supercomputing, pages
136–146, Yorktown Heights, NY, June 2009. DOI: 10.1145/1542275.1542299. 148
BIBLIOGRAPHY 199
Arrvindh Shriraman, Michael F. Spear, Hemayet Hossain, Sandhya Dwarkadas, and Michael L.
Scott. An integrated hardware-software approach to flexible transactional memory. In Pro-
ceedings of the irty-fourth International Symposium on Computer Architecture (ISCA), pages
104–115, San Diego, CA, June 2007. DOI: 10.1145/1273440.1250676. 164
Yannis Smaragdakis, Anthony Kay, Reimer Behrends, and Michal Young. Transactions with
isolation and cooperation. In Proceedings of the Twenty-second Annual ACM SIGPLAN Con-
ference on Object-oriented Programming Systems, Languages, and Applications (OOPSLA), pages
191–210, Montréal, PQ, Canada, October 2007. DOI: 10.1145/1297027.1297042. 168
Daniel J. Sorin, Mark D. Hill, and David A. Wood. A Primer on Memory Consis-
tency and Cache Coherence. Morgan & Claypool, San Francisco, CA, 2011. DOI:
10.2200/S00346ED1V01Y201104CAC016. 12, 42
Michael F. Spear, Virendra J. Marathe, William N. Scherer III, and Michael L. Scott. Conflict
detection and validation strategies for software transactional memory. In Proceedings of the
Twentieth International Symposium on Distributed Computing (DISC), pages 179–193, Stock-
holm, Sweden, September 2006. DOI: 10.1007/11864219 13. 153
Michael F. Spear, Arrvindh Shriraman, Luke Dalessandro, Sandhya Dwarkadas, and Michael L.
Scott. Nonblocking transactions without indirection using alert-on-update. In Proceedings of
the Nineteenth Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA),
pages 210–220, San Diego, CA, June 2007. DOI: 10.1145/1248377.1248414. 163
Michael F. Spear, Maged M. Michael, and Christoph von Praun. RingSTM: Scalable transac-
tions with a single atomic instruction. In Proceedings of the Twentieth Annual ACM Symposium
on Parallelism in Algorithms and Architectures (SPAA), pages 275–284, Munich, Germany, June
2008a. DOI: 10.1145/1378533.1378583. 148, 154
Michael F. Spear, Michael Silverman, Luke Dalessandro, Maged M. Michael, and Michael L.
Scott. Implementing and exploiting inevitability in software transactional memory. In Proceed-
ings of the International Conference on Parallel Processing (ICPP), pages 59–66, Portland, OR,
September 2008b. DOI: 10.1109/ICPP.2008.55. 166
Michael F. Spear, Luke Dalessandro, Virendra J. Marathe, and Michael L. Scott. A comprehen-
sive contention management strategy for software transactional memory. In Proceedings of the
Fourteenth ACM Symposium on Principles and Practice of Parallel Programming (PPoPP), pages
141–150, Raleigh, NC, February 2009a. DOI: 10.1145/1504176.1504199. 155
200 BIBLIOGRAPHY
Michael F. Spear, Maged M. Michael, Michael L. Scott, and Peng Wu. Reducing memory
ordering overheads in software transactional memory. In Proceedings of the International Sym-
posium on Code Generation and Optimization (CGO), pages 13–24, Seattle, WA, March 2009b.
DOI: 10.1109/CGO.2009.30. 171
Janice M. Stone, Harold S. Stone, Philip Heidelberger, and John Turek. Multiple reservations
and the Oklahoma update. IEEE Parallel and Distributed Technology, 1(4):58–71, November
1993. DOI: 10.1109/88.260295. 26, 145
Håkan Sundell. Efficient and Practical Non-Blocking Data Structures. PhD thesis, Department
of Computing Science, Chalmers University of Technology, Göteborg University, 2004. www.
cse.chalmers.se/~tsigas/papers/Haakan-Thesis.pdf. 123
Håkan Sundell and Philippas Tsigas. NOBLE: Non-blocking programming support via lock-
free shared abstract data types. Computer Architecture News, 36(5):80–87, December 2008a.
DOI: 10.1145/1556444.1556455. 38
Håkan Sundell and Philippas Tsigas. Lock-free deques and doubly linked lists. Journal of Parallel
and Distributed Computing, 68(7):1008–1020, July 2008b. DOI: 10.1016/j.jpdc.2008.03.001.
128, 135
Fuad Tabba, Mark Moir, James R. Goodman, Andrew W. Hay, and Cong Wang. NZTM:
Nonblocking zero-indirection transactional memory. In Proceedings of the Twenty-first Annual
ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 204–213, Calgary,
AB, Canada, August 2009. DOI: 10.1145/1583991.1584048. 148, 165
Peiyi Tang and Pen-Chung Yew. Software combining algorithms for distributing hot-spot ad-
dressing. Journal of Parallel and Distributed Computing, 10(2):130–139, February 1990. DOI:
10.1016/0743-7315(90)90022-H. 74, 84
Gadi Taubenfeld. Shared memory synchronization. Bulletin of the European Association for eo-
retical Computer Science (BEATCS), (96):81–103, October 2008. http://www.faculty.idc.
ac.il/gadi/MyPapers/2008T-SMsync.pdf 49
Gadi Taubenfeld. e black-white bakery algorithm. In Proceedings of the Eighteenth International
Symposium on Distributed Computing (DISC), pages 56–70, Amsterdam, e Netherlands, Oc-
tober 2004. DOI: 10.1007/978-3-540-30186-8 5. 52
Gadi Taubenfeld. Synchronization Algorithms and Concurrent Programming. Pearson Education–
Prentice-Hall, 2006. 49
R. Kent Treiber. Systems programming: Coping with parallelism. Technical Report RJ
5118, IBM Almaden Research Center, April 1986. http://domino.research.ibm.
com/library/cyberdig.nsf/papers/58319A2ED2B1078985257003004617EF/$File/
rj5118.pdff 23, 24, 124
BIBLIOGRAPHY 201
John Turek, Dennis Shasha, and Sundeep Prakash. Locking without blocking: Making lock
based concurrent data structure algorithms nonblocking. In Proceedings of the Eleventh ACM
Symposium on Principles of Database Systems (PODS), pages 212–222, Vancouver, BC, Canada,
August 1992. DOI: 10.1145/137097.137873. 145
Enrique Vallejo, Sutirtha Sanyal, Tim Harris, Fernando Vallejo, Ramón Beivide, Osman Unsal,
Adrián Cristal, and Mateo Valero. Hybrid transactional memory with pessimistic concur-
rency control. International Journal of Parallel Programming, 29(3):375–396, June 2011. DOI:
10.1007/s10766-010-0158-x. 165
Nalini Vasudevan, Kedar S. Namjoshi, and Stephen A. Edwards. Simple and fast biased
locks. In Proceedings of the Nineteenth International Conference on Parallel Architectures
and Compilation Techniques (PACT), pages 65–74, Vienna, Austria, September 2010. DOI:
10.1145/1854273.1854287. 69
Jons-Tobias Wamhoff, Christof Fetzer, Pascal Felber, Etienne Rivière, and Gilles Muller. Fast-
Lane: Improving performance of software transactional memory for low thread counts. In
Proceedings of the Eighteenth ACM Symposium on Principles and Practice of Parallel Programming
(PPoPP), pages 113–122, Shenzhen, China, February 2013. DOI: 10.1145/2442516.2442528.
153
Amy Wang, Matthew Gaudet, Peng Wu, José Nelson Amaral, Martin Ohmacht, Christopher
Barton, Raul Silvera, and Maged Michael. Evaluation of Blue Gene/Q hardware support for
transactional memories. In Proceedings of the Twenty-first International Conference on Parallel
Architectures and Compilation Techniques (PACT), pages 127–136, Minneapolis, MN, Septem-
ber 2012. DOI: 10.1145/2370816.2370836. 26, 156
Cheng Wang, Wei-Yu Chen, Youfeng Wu, Bratin Saha, and Ali-Reza Adl-Tabatabai. Code gen-
eration and optimization for transactional memory constructs in an unmanaged language. In
Proceedings of the International Symposium on Code Generation and Optimization (CGO), pages
34–48, San Jose, CA, March 2007. DOI: 10.1109/CGO.2007.4. 171
William E. Weihl. Local atomicity properties: Modular concurrency control for abstract data
types. ACM Transactions on Programming Languages and Systems, 11(2):249–282, February
1989. DOI: 10.1145/63264.63518. 33
Adam Welc, Suresh Jagannathan, and Antony L. Hosking. Safe futures for Java. In Proceedings
of the Twentieth Annual ACM SIGPLAN Conference on Object-oriented Programming Systems,
Languages, and Applications (OOPSLA), pages 439–453, San Diego, CA, October 2005. DOI:
10.1145/1103845.1094845. 116, 169
202 BIBLIOGRAPHY
Horst Wettstein. e problem of nested monitor calls revisited. ACM Operating Systems Review,
12(1):19–23, January 1978. DOI: 10.1145/850644.850645. 112
Niklaus Wirth. Modula: A language for modular multiprogramming. Software—Practice and
Experience, 7(1):3–35, January–February 1977. DOI: 10.1002/spe.4380070102. 109
Philip J. Woest and James R. Goodman. An analysis of synchronization mechanisms in shared
memory multiprocessors. In Proceedings of the International Symposium on Shared Memory Mul-
tiprocessing (ISSMM), pages 152–165, Toyko, Japan, April 1991. 25
Kenneth C. Yeager. e MIPS R10000 superscalar microprocessor. IEEE Micro, 16(2):28–40,
April 1996. DOI: 10.1109/40.491460. 19
Luke Yen, Jayaram Bobba, Michael R. Marty, Kevin E. Moore, Haris Valos, Mark D. Hill,
Michael M. Swift, and David A. Wood. LogTM-SE: Decoupling hardware transactional
memory from caches. In Proceedings of the irteenth International Symposium on High Per-
formance Computer Architecture (HPCA), pages 261–272, Phoenix, AZ, February 2007. DOI:
10.1109/HPCA.2007.346204. 157, 159
Pen-Chung Yew, Nian-Feng Tzeng, and Duncan H. Lawrie. Distributing hot-spot addressing
in large-scale multiprocessors. IEEE Transactions on Computers, 36(4):388–395, April 1987.
DOI: 10.1109/TC.1987.1676921. 74, 75
Pin Zhou, Feng Qin, Wei Liu, Yuanyuan Zhou, and Josep Torrellas. iWatcher: Efficient archi-
tectural support for software debugging. In Proceedings of the irty-first International Sympo-
sium on Computer Architecture (ISCA), pages 224–237, München, Germany, June 2004. DOI:
10.1145/1028176.1006720. 164
Ferad Zyulkyarov. Programming, Debugging, Profiling and Optimizing Transactional
Memory Programs. PhD thesis, Department of Computer Architecture, Polytech-
nic University of Catalunya (UPC), June 2011. http:http://www.feradz.com/
ferad-phdthesis-20110525.pdf 171
Ferad Zyulkyarov, Tim Harris, Osman S. Unsal, Adrián Cristal, and Mateo Valero. Debugging
programs that use atomic blocks and transactional memory. In Proceedings of the Fifteenth ACM
Symposium on Principles and Practice of Parallel Programming (PPoPP), pages 57–66, Bangalore,
India, January 2010. DOI: 10.1145/1693453.1693463. 171
203
Author’s Biography
MICHAEL L. SCOTT
Michael L. Scott is a Professor and past Chair of the Department of Computer Science
at the University of Rochester. He received his Ph.D. from the University of Wisconsin–
Madison in 1985. His research interests span operating systems, languages, architecture, and
tools, with a particular emphasis on parallel and distributed systems. He is best known for
work in synchronization algorithms and concurrent data structures, in recognition of which
he shared the 2006 SIGACT/SIGOPS Edsger W. Dijkstra Prize. Other widely cited work
has addressed parallel operating systems and file systems, software distributed shared mem-
ory, and energy-conscious operating systems and microarchitecture. His textbook on program-
ming language design and implementation (Programming Language Pragmatics, third edition,
Morgan Kaufmann, Feb. 2009) is a standard in the field. In 2003 he served as General Chair
for SOSP ; more recently he has been Program Chair for TRANSACT ’07, PPoPP ’08, and
ASPLOS ’12. He was named a Fellow of the ACM in 2006 and of the IEEE in 2010. In 2001
he received the University of Rochester’s Robert and Pamela Goergen Award for Distinguished
Achievement and Artistry in Undergraduate Teaching.