2007 Tocs
2007 Tocs
KEIR FRASER
University of Cambridge Computer Laboratory
and
TIM HARRIS
Microsoft Research Cambridge
Mutual exclusion locks remain the de facto mechanism for concurrency control on shared-memory
data structures. However, their apparent simplicity is deceptive: It is hard to design scalable locking
strategies because locks can harbor problems such as priority inversion, deadlock, and convoying.
Furthermore, scalable lock-based systems are not readily composable when building compound
operations. In looking for solutions to these problems, interest has developed in nonblocking sys-
tems which have promised scalability and robustness by eschewing mutual exclusion while still
ensuring safety. However, existing techniques for building nonblocking systems are rarely suitable
for practical use, imposing substantial storage overheads, serializing nonconflicting operations, or
requiring instructions not readily available on today’s CPUs.
In this article we present three APIs which make it easier to develop nonblocking implemen-
tations of arbitrary data structures. The first API is a multiword compare-and-swap operation
(MCAS) which atomically updates a set of memory locations. This can be used to advance a data
structure from one consistent state to another. The second API is a word-based software transac-
tional memory (WSTM) which can allow sequential code to be reused more directly than with MCAS
and which provides better scalability when locations are being read rather than being updated.
The third API is an object-based software transactional memory (OSTM). OSTM allows a simpler
implementation than WSTM, but at the cost of reengineering the code to use OSTM objects.
We present practical implementations of all three of these APIs, built from operations available
across all of today’s major CPU families. We illustrate the use of these APIs by using them to build
highly concurrent skip lists and red-black trees. We compare the performance of the resulting
implementations against one another and against high-performance lock-based systems. These
results demonstrate that it is possible to build useful nonblocking data structures with performance
comparable to, or better than, sophisticated lock-based designs.
Categories and Subject Descriptors: D.4.1 [Operating Systems]: Process Management—Concur-
rency, mutual exclusion, synchronization
This work was supported by donations from the Scalable Synchronization Research Group at Sun
Labs Massachusetts. The evaluation was carried out using the Cambridge-Cranfield High Perfor-
mance Computing Facility.
Authors’ addresses: K. Fraser, Computer Laboratory, University of Cambridge, 15JJ Thomson
Avenue, Cambridge, CB3 OFB, UK; T. Harris, Microsoft Research Cambridge, Roger Needham
Building, 7JJ Thomson Ave., Cambridge CB3 0FB, UK; email: [email protected].
Permission to make digital or hard copies of part or all of this work for personal or classroom use is
granted without fee provided that copies are not made or distributed for profit or direct commercial
advantage and that copies show this notice on the first page or initial screen of a display along
with the full citation. Copyrights for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers,
to redistribute to lists, or to use any component of this work in other works requires prior specific
permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn
Plaza, Suite 701, New York, NY 10121-0701 USA, fax +1 (212) 869-0481, or [email protected].
C 2007 ACM 1233307/2007/05-ART5 $5.00 DOI 10.1145/1233307.1233309 http://doi.acm.org/
10.1145/1233307.1233309
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
2 • K. Fraser and T. Harris
1. INTRODUCTION
Mutual-exclusion locks are one of the most widely used and fundamental ab-
stractions for synchronization. This popularity is largely due to their appar-
ently simple programming model and the availability of implementations which
are efficient and scalable. Unfortunately, without specialist programming care,
these benefits rarely hold for systems containing more than a handful of locks:
—For correctness, programmers must ensure that threads hold the neces-
sary locks to avoid conflicting operations being executed concurrently. To avoid
mistakes, this favors the development of simple locking strategies which pes-
simistically serialize some nonconflicting operations.
—For liveness, programmers must be careful to avoid introducing deadlock
and, consequently, they may cause software to hold locks for longer than would
otherwise be necessary. Also, without scheduler support, programmers must be
aware of priority inversion problems.
—For high performance, programmers must balance the granularity at
which locking operates against the time that the application will spend ac-
quiring and releasing locks.
This article is concerned with the design and implementation of software
which is safe for use on multithreaded multiprocessor shared-memory ma-
chines, but which does not involve the use of locking. Instead, we present three
different APIs for making atomic accesses to sets of memory locations. These
enable the direct development of concurrent data structures from sequential
ones. We believe this makes it easier to build multithreaded systems which are
correct. Furthermore, our implementations are nonblocking (meaning that even
if any set of threads is stalled, the remaining threads can still make progress)
and generally allow disjoint-access parallelism (meaning that updates made to
nonoverlapping sets of locations will be able to execute concurrently).
To introduce these APIs, we shall sketch their use in a code fragment that
inserts items into a singly-linked list which holds integers in ascending order.
In each case the list is structured with sentinel head and tail nodes whose keys
are, respectively, less than and greater than all other values. Each node’s key
remains constant after insertion. For comparison, Figure 1 shows the corre-
sponding insert operation when implemented for single-threaded use. In that
figure, as in each of our examples, the insert operation proceeds by identifying
nodes prev and curr between which the new node is to be placed.
Our three alternative APIs all follow a common optimistic style [Kung
and Robinson 1981] in which the core sequential code is wrapped in a loop
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
Concurrent Programming Without Locks • 3
Fig. 2. Insertion into a sorted list managed using MCAS. In this case the arrays specifying the
update need contain only a single element.
Fig. 3. Insertion into a sorted list managed using WSTM. The structure mirrors Figure 2 except
that the WSTM implementation tracks which locations have been accessed based on the calls to
WSTMRead and WSTMWrite.
Fig. 4. Insertion into a sorted list managed using OSTM. The code is more verbose than Figure 3
because data is accessed by indirection through OSTM handles which must be opened before use.
sequential code, offer performance which competes with and often surpasses
state-of-the-art lock-based designs.
1.1 Goals
We set ourselves a number of goals in order to ensure that our designs are
practical and perform well when compared with lock-based schemes:
—Concreteness. We must consider the full implementation path down to the
instructions available on commodity CPUs. This means we build from atomic
single-word read, write, and compare-and-swap (CAS) operations. We define
CAS to return the value it reads from the memory location.
atomically word CAS (word a, word e, word n, ) {
word x := *a;
if (x = e) *a := n;
return x;
}
—Linearizability. In order for functions such as MCAS to behave as expected
in a concurrent environment, we require that their implementations be lin-
earizable, meaning that they appear to occur atomically at some point between
when they are called and when they return [Herlihy and Wing 1990].
—Nonblocking progress guarantee. In order to provide robustness against
many liveness problems such as deadlock, implementations of our APIs should
be nonblocking. This means that even if any set of threads is stalled, the re-
maining threads can still progress.
—Disjoint-access parallelism. Implementations of our APIs should not intro-
duce contention in the sets of memory locations they access: Operations which
access disjoint parts of a data structure should be able to execute in parallel
[Israeli and Rappoport 1994].
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
6 • K. Fraser and T. Harris
Table I. Assessment of Our Implementations of These Three APIs Against Our Goals
MCAS WSTM OSTM
Disjoint-access when accessing when accessing words that when accessing
parallelism disjoint sets map to disjoint sets of disjoint sets of
of words ownership records under objects
a hash function used in
the implementation
Space overhead (when 2 bits reserved fixed-size table (e.g., one word in each
no operations are in in each word 65,536 double-word object handle
progress) entries)
2. PROGRAMMING APIS
In this section we present the programming interfaces for using MCAS
(Section 2.1), WSTM (Section 2.2), and OSTM (Section 2.3). These provide
mechanisms for accessing and/or modifying multiple unrelated words in a sin-
gle atomic step; however, they differ in the way in which those accesses are
specified and the adaptation required to make a sequential operation safe for
multithreaded use.
After presenting the APIs themselves, Section 2.4 discusses how they may
be used in practice in sharedmemory multithreaded programs.
Fig. 5. The need for read consistency: A move-to-front linked list subject to two searches for node
3. In snapshot (a), search A is preempted while passing over node 1. Meanwhile, in snapshot
(b), search B succeeds and moves node 3 to the head of the list. When A continues execution, it will
incorrectly report that 3 is not in the list.
As we will show later, the interface often results in reduced performance com-
pared with MCAS.
copy of the underlying object, that is, a private copy on which the thread can
work before attempting to commit its updates.
Both OSTMOpenForReading and OSTMOpenForWriting are idempotent: If
the object has already been opened in the same access mode within the same
transaction, then the same pointer will be returned.
The caller must ensure that objects are opened in the correct mode: The
OSTM implementation may share data between objects that have been opened
for reading between multiple threads. The caller must also be careful if a trans-
action opens an object for reading (obtaining a reference to a shared copy C1)
and subsequently opens the same object for writing (obtaining a reference to
a shadow copy C2): Updates made to C2 will, of course, not be visible when
reading from C1.
The OSTM interface leads to a different cost profile from WSTM: OSTM
introduces a cost on opening objects for access and potentially producing shadow
copies to work on, but subsequent data access is made directly (rather than
through functions like WSTMRead and WSTMWrite). Furthermore, it admits a
simplified nonblocking commit operation.
The OSTM API is
1 // Transaction management
ostm transaction *OSTMStartTransaction();
3 bool OSTMCommitTransaction(ostm transaction *tx);
bool OSTMValidateTransaction(ostm transaction *tx);
5 void OSTMAbortTransaction(ostm transaction *tx);
// Data access
7 t *OSTMOpenForReading(ostm transaction *tx, ostm handle<t*> *o);
t *OSTMOpenForWriting(ostm transaction *tx, ostm handle<t*> *o);
9 // Storage management
ostm handle<void*> *OSTMNew(size t size);
11 void OSTMFree(ostm handle<void*> *ptr);
those languages which support it) is to use runtime code generation to add the
level of indirection that programming using OSTM objects requires [Herlihy
2005]. Aside from CCRs, hybrid designs are possible which combine lock-based
abstractions with optimistic execution over transactional memory [Welc et al.
2004].
A further direction, again beyond the scope of the current article, is using a
language’s type system to ensure that the only operations attempted within a
transaction are made through an STM interface [Harris et al. 2005]. This can
avoid programming errors in which irrevocable operations (such as network
I/O) are performed in a transaction that subsequently aborts.
a mutually consistent snapshot of part of the heap. This is effectively the ap-
proach taken by Herlihy et al.’s [2003b] DSTM, and leads to the need either
to make reads visible to other threads (making read parallelism difficult in a
streamlined implementation) or to explicitly revalidate invisible reads (leading
to O(n2 ) behavior when a transaction opens n objects in turn). Either of these
approaches could be integrated with WSTM or OSTM if the API is to be ex-
posed directly to programmers whilst shielding them from the need to consider
invalidity during execution.
In the second scenario, where calls on the API are generated automatically
by a compiler or language runtime system, we believe it is inappropriate for the
application programmer to have to consider mutually inconsistent sets of values
within a transaction. For instance, when considering the operational semantics
of atomic blocks built over STM in Haskell [Harris et al. 2005], definitions where
transactions run in isolation appear to be a clean fit with the existing language,
while it is unclear how to define the semantics of atomic blocks that may expose
inconsistency to the programmer. Researchers have explored a number of ways
to shield the programmer from such inconsistency [Harris and Fraser 2003;
Harris et al. 2005; Riegel et al. 2006; Dice et al. 2006]. These can broadly be
classified as approaches based on hardening the runtime system against behav-
ior due to inconsistent reads, and approaches based on preventing inconsistent
reads from occuring. The selection between these goes beyond the scope of the
current article.
2.4.3 Optimizations and Hints. The final aspect we consider is the avail-
ability of tuning facilities for a programmer to improve the performance of an
algorithm using our APIs.
The key problem is false contention, where operations built using the APIs are
deemed to conflict even though logically they commute. For instance, if a set of
integers is held in numerical order in a linked list, then a thread transactionally
inserting 15 between 10 and 20 will perform updates that conflict with reads
from a thread searching through that point for a higher value.
It is not clear that this particular example can be improved automati-
cally when a tool is generating calls on our APIs; realizing that the opera-
tions do not logically conflict relies on knowledge of their semantics and the
set’s representation. Notwithstanding this, the ideas of disjoint-access par-
allelism and read-parallelism allow the programmer to reason about which
operations will be able to run concurrently without interfering with each
other.
However, our APIs can be extended with operations for use by expert pro-
grammers. As with Herlihy et al.’s DSTM [2003b], our OSTM supports an ad-
ditional early release operation that discards an object from the sets of accesses
that the implementation uses for conflict detection. For instance, in our list ex-
ample, a thread searching the list could release the lists, nodes as it traverses
them, eventually trying to commit a minimal transaction containing only the
node it seeks (if it exists) and its immediate predecessor in the list. Similarly,
as we discuss in Section 6.5, WSTM supports discard operations to remove
addresses from a transaction.
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
Concurrent Programming Without Locks • 13
These operations all require great care: Once released or discarded, data
plays no part in the transaction’s commit or abort. A general technique for
using them correctly is for the programmer to ensure that: (i) As a transaction
runs, it always holds enough data for invalidity to be detected; and (ii) when
a transaction commits, the operation it is performing is correct, given only the
data that is still held. For instance, in the case of searching a sorted linked
list, it would need to hold a pair of adjacent nodes to act as a “witness” of the
operation’s result. However, such extreme use of optimization APIs loses many
of the benefits of performing atomic multiword updates (the linked-list example
becomes comparably complex to a list built directly from CAS [Harris 2001] or
sophisticated locking [Heller et al. 2005]).
3. RELATED WORK
The literature contains several designs for abstractions, such as MCAS, WSTM,
and OSTM. However, many of the foundational designs have not shared our re-
cent goals of practicality, for instance, much work builds on instructions such as
strong-LL/SC or DCAS [Motorola 1985] which are not available as primitives in
contemporary hardware. Our experience is that although this work has identi-
fied the problems which exist and has introduced terminology and conventions
for presenting and reasoning about algorithms, it has not been possible to ef-
fectively implement or use these algorithms by layering them above software
implementations of strong-LL/SC or DCAS. For instance, when considering
strong-LL/SC, Jayanti and Petrovic’s recent design reserves four words of stor-
age per thread for each word that may be accessed [Jayanti and Petrovic 2003].
Other designs reserve N or log N bits of storage within each word when used
with N threads: Such designs can only be used when N is small. When consid-
ering DCAS, it appears no easier to build a general-purpose DCAS operation
than it is to implement our MCAS design.
In discussing related work, the section is split into three parts. Firstly,
in Section 3.1 we introduce the terminology of nonblocking systems and de-
scribe the progress guarantees that they make. These properties underpin the
liveness gurantees that are provided to users of our algorithms. Secondly, in
Section 3.2 we discuss the design of “universal” transformations that build
nonblocking systems from sequential or lock-based code. Finally, in Section 3.3,
we present previous designs for multiword abstractions, such as MCAS, WSTM,
and OSTM, and assess them against our goals.
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
Concurrent Programming Without Locks • 15
A thread then makes updates to this in private before attempting to make them
visible by atomically updating a single “root” pointer of the structure [Herlihy
1993]. This means that concurrent updates will always conflict, even when they
modify disjoint sections of the data structure.
Turek et al. devised a hybrid scheme that may be applied to develop lock-free
systems from deadlock-free lock-based ones [Turek et al. 1992]. Each lock in
the original algorithm is replaced by an ownership reference which is either
NULL or points to a continuation describing the sequence of virtual instruc-
tions that remain to be executed by the lock “owner”. This allows conflicting
operations to avoid blocking: instead, they execute instructions on behalf of the
owner and then take ownership themselves. Interpreting a continuation is cum-
bersome: After each “instruction” is executed, a virtual program counter and
a nonwrapping version counter are atomically modified using a double-width
CAS operation which acts on an adjacent pair of memory locations.
Barnes proposes a similar technique in which mutual-exclusion locks are
replaced by pointers to operation descriptors [Barnes 1993]. Lock-based algo-
rithms are converted to operate on shadow copies of the data structure; then,
after determining the sequence of updates to apply, each “lock” is acquired in
turn by making it point to the descriptor, the updates are performed on the
structure itself, and finally the “locks” are released. Copying is avoided if con-
tention is low by observing that the shadow copy of the data structure may be
cached and reused across a sequence of operations. This two-phase algorithm
requires strong-LL/SC operations.
contains a word-size entry for each block in the memory, consisting of a block
identifier and a version number. The initial embodiment of this scheme required
arbitrary-sized memory words and suffered the same drawbacks as the condi-
tionally wait-free MCAS on which it builds: Bookkeeping space is statically
allocated for a fixed-size heap, and the read operation is potentially expensive.
Moir’s wait-free STM extends his lock-free design with a higher-level helping
mechanism.
Herlihy et al. have designed and implemented an obstruction-free STM con-
currently with our work [Herlihy et al. 2003b]. It shares many of our goals.
Firstly, the memory is dynamically sized: Memory blocks can be created and
destroyed on-the-fly. Secondly, a practical implementation is provided which is
built using CAS. Finally, the design is disjoint-access parallel and, in one im-
plementation, transactional reads do not cause contended updates to occur in
the underlying memory system. These features serve to significantly decrease
contention in many multiprocessor applications, and are all shared with our
lock-free OSTM. We include Herlihy et al.’s design in our performance evalua-
tion in Section 8.
Recently, researchers have returned to the question of building various
forms of hardware transactional memory (HTM) [Rajwar and Goodman 2002;
Hammond et al. 2004; Ananian et al. 2005; Moore et al. 2005; McDonald et al.
2005; Rajwar et al. 2005]. While production implementations of these schemes
are not available, and so it is hard to compare their performance with software
systems, in many ways they can be seen as complementary to the development
of STM. Firstly, if HTM becomes widely deployed, then effective STM imple-
mentations are necessary for machines without the new hardware features.
Secondly, HTM designs either place limits on the size of transactions or fix pol-
icy decisions into hardware; STM provides flexibility for workloads that exceed
those limits or benefit from different policies.
4. DESIGN METHOD
Our implementations of the three APIs in Sections 2.1–2.3 have to solve a set
of common problems and, unsurprisingly, use a number of similar techniques.
The key problem is that of ensuring that a set of memory accesses appears
to occur atomically when implemented by a series of individual instructions
accessing one word at-a-time. Our fundamental approach is to deal with this
problem by decoupling the notion of a location’s physical contents in memory
from its logical contents when accessed through one of the APIs. The physical
contents can, of course, only be updated one word at-a-time. However, as we
shall show, we arrange that the logical contents of a set of locations can be
updated atomically.
For each of the APIs there is only one operation which updates the logical con-
tents of memory locations: MCAS, WSTMCommitTransaction, and OSTMCommit-
Transaction. We call these operations (collectively) the commit operations and
they are the main source of complexity in our designs.
For each of the APIs we present the design of our implementation in a series
of four steps:
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
18 • K. Fraser and T. Harris
(1) Define the format of the heap, the temporary data structures used, and how
an application goes about allocating and deallocating memory for the data
structures that will be accessed through the API.
(2) Define the notion of logical contents in terms of these structures and show
how it can be computed using a series of single-word accesses. This un-
derpins the implementation of all functions other than the commit opera-
tions. In this step we are particularly concerned with ensuring nonblocking
progress and read-parallelism so that, for instance, two threads can per-
form WSTMRead operations to the same location at the same time, without
producing conflicts in the memory hierarchy.
(3) Show how the commit operation arranges to atomically update the logical
contents of a set of locations when it executes without interference from con-
current commit operations. In this stage we are particularly concerned with
ensuring disjoint-access parallelism so that threads can commit updates to
disjoint sets of locations at the same time.
(4) Show how contention is resolved when one commit operation’s progress
is impeded by another, conflicting, commit operation. In this step we are
concerned with ensuring nonblocking progress so that the progress is not
prevented if, for example, the thread performing the existing commit oper-
ation has been preempted.
Before considering the details of the three different APIs, we discuss the
common aspects of each of these four steps in Sections 4.1–4.4, respectively.
2 Thisdoes not mean that the designs can only be used in languages that traditionally provide
garbage collection. For instance, in our evaluation in Section 8.1.3, we use reference counting [Jones
and Lins 1996] on the descriptors to allow prompt memory reuse and affinity between descriptors
and threads.
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
Concurrent Programming Without Locks • 19
wishes to retry a commit operation, for example, if the code in Figures 2–4
loops, then each retry uses a fresh descriptor. This means that threads reading
from a descriptor and seeing that the outcome has been decided can be sure
that the status field will not subsequently change.
The combination of the first two properties is important because it allows us
to avoid many A-B-A problems in which a thread is about to perform a CAS
conditional on a location holding a value A, but then a series of operations by
other threads changes the value to B and then back to A, allowing the delayed
CAS to succeed. These two properties mean that there is effectively a one-to-
one association between descriptor references and the intent to perform a given
atomic update.
Our implementations rely on being able to distinguish pointers to descriptors
from other values. In our pseudocode in Sections 5–7 we abstract these tests
with predicates, for instance, IsMCASDesc to test if a pointer refers to an MCAS
descriptor. We discuss ways in which these predicates can be implemented in
Section 8.1.2.
Fig. 6. Timeline for the three phases used in commit operations. The grey bar indicates when
the commit operation is executed; prior to this, the thread prepares the heap accesses that it
wants to commit. In this example location a1 has been read but not updated, and location a2 has
been updated. The first phase acquires exclusive access to the locations being updated. The second
phase checks that the locations read have not been updated by concurrent threads. The third phase
releases exclusive access after making any updates. The read-check made at point 2 ensures that
a1 is not updated between 0 and 2. The acquisition of a2 ensures exclusive access between 1 and 3.
succeeded; otherwise it is set to FAILED. These updates are always made using
CAS operations. If a thread initiating a commit operation is helped by another
thread, then both threads proceed through this series of steps, with the prop-
erties described in Section 4.1 ensuring that only one of these threads sets the
status to SUCCESSFUL or FAILED.
In order to show that an entire commit operation appears atomic, we identify
within its execution a linearization point at which it appears to operate atomi-
cally on the logical contents of the heap from the point of view of other threads.3
There are two cases to consider, depending on whether an uncontended commit
operation is successful:
Firstly, considering unsuccessful uncontended commit operations, the lin-
earization point is straightforward: Some step of the commit operation observes
a value that prevents the commit operation from succeeding, either a location
that does not hold the expected value (in MCAS) or a value that has been written
by a conflicting concurrent transaction (in WSTM and OSTM).
Secondly, considering successful uncontended commit operations, the lin-
earization point depends on whether the algorithm has a read-check phase.
Without a read-check phase the linearization point and decision point coincide:
The algorithm has acquired ownership of the locations involved, and has not
observed any values that prevent the commit from succeeding.
However, introducing a read-check phase makes the identification of a lin-
earization point more complex. As Figure 6 shows, in this case the linearization
point occurs at the start of the read-check phase, whereas the decision point (at
which the outcome is actually signaled to other threads) occurs at the end of
the read-check phase.
This choice of linearization point may appear perverse for two reasons:
(1) The linearization point comes before its decision point: How can an opera-
tion appear to commit its updates before its outcome is decided?
3 Inthe presence of helping, the linearization point is defined with reference to the thread that
successfully performs a CAS on the status field at the decision point.
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
Concurrent Programming Without Locks • 21
The rationale for this is that holding ownership of the locations being up-
dated ensures that these remain under the control of this descriptor from
acquisition until release (1 until 3 in Figure 6). Similarly, read-checks en-
sure that any locations accessed in a read-only mode have not been updated4
between points 0 and 2. Both of these intervals include the proposed lin-
earization point, even though it precedes the decision point.
(2) If the operation occurs atomically at its linearization point, then what are
the logical contents of the locations involved before the descriptor’s status
is updated at the decision point?
Following the definition in Section 4.2, the logical contents are depen-
dent on the descriptor’s status field, thus updates are not revealed to other
threads until the decision point is reached. We reconcile this definition with
the use of a read-check phase by ensuring that concurrent readers help com-
mit operations to complete, retrying the read operation once the transaction
has reached its decision point. This means that the logical contents do not
need to be defined during the read-check phase because they are never
required.
4 Ofcourse, the correctness of this argument does not allow the read-checks to simply consider the
values in the locations because that would allow A-B-A problems to emerge if the locations are
updated multiple times between 0 and 2. Our WSTM and OSTM designs which use read-check
phases must check versioning information, rather than just values.
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
22 • K. Fraser and T. Harris
Fig. 7. An example of a dependent cycle of two operations A and B. Each needs the other to exit
its read-check phase before it can complete its own.
The third and final case, and the most complicated, is when t1’s status is not
decided and the algorithm does include a READ-CHECK phase.
The complexity stems from the fact that, as we described in Section 4.3, a
thread must acquire access to the locations that it is updating before it enters
its READ-CHECK phase. This constraint on the order in which locations are
accessed makes it impossible to eliminate cyclic helping by sorting accesses
into a canonical order. Figure 7 shows an example: A has acquired data x for
update and B has acquired data y for update, but A must wait for B before
validating its read from y, while B in turn must wait for A before validating its
read from x.
The solution is to abort at least one of the operations to break the cycle;
however, care must be taken not to abort them all if we wish to ensure lock-
freedom rather than obstruction-freedom. For instance, with OSTM, this can be
done by imposing a total order ≺ on all operations, based on the machine address
of each transaction’s descriptor. The loop is broken by allowing a transaction
tx1 to abort a transaction tx2 if and only if: (i) both are in their read phase;
(ii) tx2 owns a location that tx1 is attempting to read; and (iii) tx1 ≺ tx2. This
guarantees that every cycle will be broken, but the “least” transaction in the
cycle will continue to execute. Of course, other orderings can be used if fairness
is a concern.
Fig. 8. MCASRead operation used to read from locations which may be subject to concurrent
MCAS operations.
by the application; This is why the MCAS implementation needs two reserved
bits in the locations that it may update. However, for simplicity, in the pseu-
docode versions of our algorithms we use predicates to abstract these bitwise
operations. Many alternative implementation techniques are available: For in-
stance, some languages provide runtime type information, and in other cases
descriptors of a given kind can be placed in given regions of the process’ virtual
address space.
Fig. 9. An uncontended commit swapping the contents of a1 and a2. Grey boxes show where CAS
and CCAS operations are to be performed at each step. While a location is owned, its logical contents
remain available through the MCAS descriptor.
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
Concurrent Programming Without Locks • 27
Fig. 11. Conditional compare-and-swap (CCAS). CCASRead is used to read from locations which
may be subject to concurrent CCAS operations.
communicate the success/failure result back to the first thread. This cannot be
done by extending the descriptor with a Boolean field for the result: There may
be multiple threads concurrently helping the same descriptor, each executing
line 25 with different result values.
CCASRead proceeds by reading the contents of the supplied address. If this
is a CCAS descriptor, then the descriptor’s operation is helped to completion
and the CCASRead retried. CCASRead returns a nondescriptor value once one
is read. Notice that CCASHelp ensures that the descriptor passed to it has been
removed from the address by the time that it returns, so CCASRead can only
loop while other threads are performing new CCAS operations on the same
address.
5.5 Discussion
There are a number of final points to consider in our design for MCAS. The
first is to observe that when committing an update to a set of N locations
and proceeding without experiencing contention, the basic operation performs
3N + 1 updates using CAS: 2N CAS operations are performed by the calls
to CCAS, N CAS operations are performed when releasing ownership, and a
further single CAS is used to update the status field. However, although this
is more than a factor of three increase over updating the locations directly,
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
Concurrent Programming Without Locks • 29
Fig. 12. The steps involved in performing the first CCAS operation needed in Figure 9. In this
case the first location a1 is being updated from 100 to refer to MCAS descriptor tx1. The update is
conditional on the descriptor tx1 being UNDECIDED.
it is worth noting that the three batches of N updates all act on the same
locations: Unless evicted during the MCAS operation, the cache lines holding
these locations need only be fetched once.
We did develop an alternative implementation of CCAS which uses an ordi-
nary write in place of its second CAS. This involves leaving the CCAS descriptor
linked into the location being updated and recording the success or failure of
the CCAS within that descriptor. This 2N + 1 scheme is not a worthwhile im-
provement over the 3N + 1 design: It writes to more distinct cache lines and
makes it difficult to reuse CCAS descriptors in the way we describe in Sec-
tion 8.1.3. However, this direction may be useful if there are systems in which
CAS operates substantially more slowly than an ordinary write.
Moir explained how to build an obstruction-free 2N + 1 MCAS which follows
the same general structure as our lock-free 3N + 1 design [Moir 2002]. His de-
sign uses CAS in place of CCAS to acquire ownership while still preserving the
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
30 • K. Fraser and T. Harris
logical contents of the location being updated. The weaker progress guarantee
makes this possible by avoiding recursive helping: If t2 encounters t1 perform-
ing an MCAS then t2 causes t1’s operation to abort if it is still UNDECIDED.
This avoids the need to CCAS because only the thread initiating an MCAS can
now update its status field to SUCCESSFUL: There is no need to check it upon
each acquisition.
Finally, notice that algorithms built over MCAS will not meet the goal of read-
parallelism from Section 1.1. This is because MCAS must still perform CAS
operations on addresses for which identical old and new values are supplied:
These CAS operations force the address’s cache line to be held in exclusive mode
on the processor executing the MCAS.
Fig. 13. The heap structure used in the lock-based WSTM. The commit operation acting on de-
scriptor tx1 is midway through an update to locations a1 and a2: 200 is being written to a1 and 100
to a2. Locations a101 and a102 are examples of other locations which happen to map to the same
ownership records, but which are not part of the update.
when an update is being committed, the orec refers to the descriptor of the
transaction involved. In the figure, transaction tx1 is committing an update to
addresses a1 and a2; it has acquired orec r1 and is about to acquire r2. Within
the transaction descriptor, we indicate memory accesses using the notation
ai :(oi , voi ) → (ni , vni ) to indicate that address ai is being updated from value oi
at version number voi to value ni at version number vni . For a read-only access,
oi = ni and voi = vni . For an update, vni = voi + 1.
Figure 14 shows the definition of the data types involved. A wstm transaction
comprises a status field and a list of wstm entry structures. As indicated in the
figure, each entry provides the old and new values and old and new version
numbers for a given memory access. In addition, the obstruction-free version
of WSTM includes a prev ownership field to coordinate helping between trans-
actions.
The lock-based WSTM uses orecs of type orec basic: Each orec is a simple
union holding either a version field or a reference to the owning transaction.
The obstruction-free WSTM requires orecs of type orec obsfree holding either
a version field or a pair containing an integer count alongside a reference to
the owning transaction.5 In both implementations the current usage of a union
can be determined by applying the predicate IsWSTMDesc: If it evaluates true
then the orec contains a reference to a transaction, else it contains a version
number. As before, these predicates can be implemented by a reserved bit in
the space holding the union.
A descriptor is well formed if, for each orec, it either: (i) contains at most one
entry associated with that orec, or (ii) contains multiple entries associated with
that orec, but the old version number is the same in all of them, as is the new.
5 The maximum count needed is the maximum number of concurrent commit operations. In practice,
this means that orecs in the obstruction-free design are two-words wide. Both IA-32 and 32-bit
SPARC provide double-word-width CAS. On 64-bit machines without double-word-width CAS, it
is possible either to reserve sufficient high-order bits in a sparse 64-bit address space or to add a
level of indirection between orecs and temporary double-word structures.
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
32 • K. Fraser and T. Harris
Fig. 14. Data structures and helper functions used in the WSTM implementation.
LS1 : If the orec holds a version number then the logical contents comes
directly from the application heap. For instance, in Figure 13, the logical
contents of a2 is 200.
LS2 : If the orec refers to a descriptor that contains an entry for the address,
then that entry gives the logical contents (taking the new value from
the entry if the descriptor is SUCCESSFUL, and the old value if it is
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
Concurrent Programming Without Locks • 33
Fig. 15. WSTMRead and WSTMWrite functions built over get entry.
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
Concurrent Programming Without Locks • 35
be set to SUCCESSFUL (lines 33–34) and if successful, the updates are made
(lines 35–37). Finally, ownership of the orecs is released (lines 38–39). Notice
how the definition of LS2 means that setting the status to SUCCESSFUL
atomically updates the logical contents of all of the locations written by the
transaction.
The read-check phase uses read check orec to check that the current version
number associated with an orec matches the old version in the entries in a
transaction descriptor. As with get entry in Figure 15, if it encounters another
transaction descriptor then it ensures that its outcome is decided (line 12) before
examining it.
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
36 • K. Fraser and T. Harris
Fig. 17. Basic lock-based implementation of the helper functions for WSTMCommitTransaction.
Fig. 18. An uncontended commit swapping the contents of a1 and a2, showing where updates are
to be performed at each step.
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
38 • K. Fraser and T. Harris
If a thread is preempted just before one of the stores in line 37, then it can be
rescheduled at any time and perform that delayed update [Harris and Fraser
2005], overwriting updates from subsequent transactions.
Aside from the lock-based scheme from the previous section and the com-
plicated obstruction-free scheme we will present here, there are two further
solutions that we note for completeness:
(1) As described in an earlier paper [Harris and Fraser 2005], operating system
support can be used either to: (i) prevent a thread from being preempted
during its update phase, or (ii) to ensure that a thread preempted while
making its updates will remain suspended while being helped and then
be resumed at a safe point, for example, line 40 in WSTMCommitTransac-
tion with a new status value SUCCESSFUL HELPED, indicating that the
transaction has been helped (with the helper having made its updates and
released its ownership).
(2) CCAS from Section 5 can be used to make the updates in line 37, conditional
on the transaction descriptor still being SUCCESSFUL rather than SUC-
CESSFUL HELPED. Of course, using CCAS would require reserved space
in each word in the application heap, negating a major benefit of WSTM
over MCAS.
Fig. 19. Obstruction-free implementation of the prepare descriptor helper function for WSTMCom-
mitTransaction.
Fig. 20. Obstruction-free implementation of the acquire orec and release orec helper functions for
WSTMCommitTransaction.
check in line 24). Otherwise, if there is no such entry, a new entry is added to
tx containing the logical contents of the address (lines 26–29).
Figure 20 shows how acquire orec and release orec are modified to enable
stealing. A third case is added to acquire orec: Lines 13–17 steal ownership, so
long as the contents of the orec have not changed since the victim’s entries were
merged into tx in prepare descriptor (i.e., the orec’s current owner matches the
prev ownership value recorded during prepare descriptor). Note that we reuse
the prev ownership field in acquire orec to indicate which entries led to a suc-
cessful acquisition; this is needed in release orec to determine which entries to
release. release orec itself is now split into three cases. Firstly, if the count field
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
Concurrent Programming Without Locks • 41
will remain above 0, the count is simply decremented because other threads
may still be able to perform delayed writes to locations controlled by the orec
(lines 27–28). Secondly, if the count is to return to 0 and our descriptor tx is still
the owner, we take the old or new version number, as appropriate (lines 30–31).
The third case is that the count is to return to 0, but ownership has been
stolen from our descriptor (lines 33–36). In this case we must reperform the
updates from the current owner before releasing the orec (lines 34–36). This
ensures that the current logical contents are written back to the locations,
overwriting any delayed writes from threads that released the orec earlier. Note
that we do not need to call force decision before reading from the current owner:
The count is 1, meaning that the thread committing tx is the only one working
on that orec and so the descriptor referred by the orec must have already been
decided (otherwise there would be at least two threads working on the orec).
6.5 Discussion
The obstruction-free design from Figures 19 and 20 is clearly extremely com-
plicated. Aside from its complexity, the design has an undesirable property
under high contention: If a thread is preempted between calling acquire orec
and release orec, then the logical contents of locations associated with that
orec cannot revert to being held in the application heap until the thread is
rescheduled.
Although we do not present them here in detail, there are a number of exten-
sions to the WSTM interface which add to the range of settings in which it can
be used. In particular, we make note of a WSTMDiscardUpdate operation which
takes an address and acts as a hint that the WSTM implementation is permitted
(but not required) to discard any updates made to that address in the current
transaction. This can simplify the implementation of some data structures in
which shared “write-only” locations exist. For instance, in the red-black trees
we use in Section 8, the implementation of rotations within the trees is simpli-
fied if references to dummy nodes are used in place of NULL pointers. If a single
dummy node is used then updates to its parent pointer produce contention be-
tween logically nonconflicting transactions: In this case we can either use sep-
arate nodes or use WSTMDiscardUpdate on the dummy node’s parent pointer.
This means that OSTM offers a very different programming model than
WSTM because OSTM requires data structures to be reorganized to use han-
dles. Introducing handles allows the STM to more directly associate its coor-
dination information with parts of the data structure, rather than using hash
functions within the STM to maintain this association. This may be an impor-
tant consideration in some settings: There is no risk of false contention due to
hash collisions. Also note that while WSTM supports obstruction-free transac-
tions, OSTM guarantees lock-free progress.
7.4 Pseudocode
Figures 22 and 23 present pseudocode for the OSTMOpenForReading, OST-
MOpenForWriting, and OSTMCommitTransaction operations. Both OSTMOpen
operations use obj read to find the most recent data block for a given OSTM
handle; we therefore describe this helper function first. Its structure follows
the definitions from Section 7.2. In most circumstances, the logical contents are
defined by LS1: The latest data-block reference can be returned directly from
the OSTM handle (lines 6 and 17). However, if the object is currently owned by
a committing transaction then the correct reference is found by searching the
owner’s sorted write list (line 9) and selecting the old or new reference based
on the owner’s current status (line 15). As usual, LS2 is defined only for UNDE-
CIDED, FAILED, and SUCCESSFUL descriptors and so if the owner is in its read
phase, then the owner must be helped to completion or aborted, depending on
the status of the transaction that invoked its obj read and its ordering relative
to the owner (lines 10–14).
OSTMOpenForReading proceeds by checking whether the object is already
open by that transaction (lines 20–23). If not, a new list entry is created and
inserted into the read-only list (lines 24–28).
OSTMOpenForWriting proceeds by checking whether the object is already
open by that transaction; if so, the existing shadow copy is returned (lines 32–
33). If the object is present on the read-only list then the matching entry is
removed (line 35). If the object is present on neither list then a new entry is
allocated and initialized (lines 37–38). A shadow copy of the data block is made
(line 40) and the list entry inserted into the sorted write list (line 41).
OSTMCommitTransaction itself is divided into three phases. The first phase
attempts to acquire each object in the sorted write list (lines 4–9). If a more re-
cent data-block reference is found then the transaction is failed (line 7). If the
object is owned by another transaction then the obstruction is helped to comple-
tion (line 8). The second phase checks that each object in the read-only list has
not been updated subsequent to being opened (lines 11–12). If all objects were
successfully acquired or checked then the transaction will attempt to commit
successfully (lines 15–16). Finally, each acquired object is released (lines 17–18)
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
Concurrent Programming Without Locks • 45
Fig. 22. OSTM’s OSTMOpenForReading and OSTMOpenForWriting interface calls. Algorithms for
read and sorted write lists are not given here. Instead, search, insert, and remove operations are
assumed to exist, for example, acting on linked lists of obj entry structures.
and data-block reference returned to its previous value if the transaction failed,
otherwise it is updated to its new value.
7.5 Discussion
Our lock-free OSTM was developed concurrently with an obstruction-free de-
sign by Herlihy et al. [2003b]. We include both in our experimental evaluation.
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
46 • K. Fraser and T. Harris
The two designs are similar in their use of handles as a point of indirection
and the use of transaction descriptors to publish the updates that a transaction
proposes to make.
The key difference lies in how transactions proceed before they attempt to
commit. In our scheme, transactions operate entirely in private and so descrip-
tors are only revealed when a transaction is ready to commit. In Herlihy et al.’s
DSTM design, their equivalent to our OSTMOpen operation causes the trans-
action to acquire the object in question. This allows a wider range of contention
management strategies because contention is detected earlier than with our
scheme. Conversely, it appears difficult to make DSTM transactions lock-free
using the same technique as our design: In our scheme, threads can help one an-
other’s commit operations, but in their scheme it appears it would be necessary
for threads to help one another’s entire transactions if one thread encounters
an object opened by another.
It is interesting to note that unlike MCAS, there is no clear way to simplify
our OSTM implementation by moving from lock freedom to obstruction freedom.
This is because the data pointers in OSTM handles serve to uniquely identify a
given object-state and so lock-freedom can be obtained without need for CCAS
so as to avoid A-B-A problems when acquiring ownership.
8. EVALUATION
There is a considerable gap between the pseudocode designs presented for
MCAS, WSTM, and OSTM and a useful implementation of those algorithms
on which to base our evaluation. In this section we highlight a number of these
areas elided in the pseudocode and then assess the practical performance of
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
Concurrent Programming Without Locks • 47
our implementations by using them to build concurrent skip lists and red-black
trees.
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
48 • K. Fraser and T. Harris
of Kung and Lehman [1980], in which tentative deallocations are queued until
all threads pass through a quiescent state, after which it is known that they
hold no private references to defunct objects.6
This leaves the former problem of managing descriptors: So far, we have
assumed that they are reclaimed by garbage collection and we have benefited
from this assumption by being able to avoid the A-B-A problems that would
otherwise be caused by reuse. We use Michael and Scott’s reference-counting
garbage collection method [1995], placing a count in each MCAS, WSTM, and
OSTM descriptor and updating this to count the number of threads which may
have active references to the descriptor.
We manage CCAS descriptors by embedding a fixed number of them within
each MCAS descriptor. This avoids the overheads of memory management head-
ers and reference counts that would otherwise be associated with the small
CCAS descriptors. Since the logical contents of a CCAS descriptor is computable
from the contents of the containing MCAS descriptor, our CCAS descriptors con-
tain nothing more than a back pointer to the MCAS descriptor. Apart from sav-
ing memory, the fact that a CCAS descriptor contains no data specific to a par-
ticular CCAS suboperation allows it to be safely reused by the thread to which it
was originally allocated. Since we know the number of threads participating in
our experiments, embedding a CCAS descriptor per thread within each MCAS
descriptor is sufficient to avoid any need to fall back to dynamic allocation. In
situations where the number of threads is unbounded, dynamically-allocated
descriptors can be managed by the same reference-counting mechanism as
MCAS descriptors. However, unless contention is very high, it is unlikely that
recursive helping will occur often, and so the average number of threads par-
ticipating in a single MCAS operation will not exhaust the embedded supply of
descriptors.
A similar storage method is used for the per-transaction object lists main-
tained by OSTM. Each transaction descriptor contains a pool of embedded en-
tries that are sequentially allocated as required. If a transaction opens a very
large number of objects, then further descriptors are allocated and chained
together to extend the list.
6 Aswe explain in Section 8.2, this article considers workloads with at least one CPU available for
each thread. Schemes like SMR or pass-the-buck would be necessary for prompt memory reuse in
workloads with more threads than processors.
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
Concurrent Programming Without Locks • 49
The set data type that we implement for each benchmark supports lookup,
add, and remove operations, all of which act on a set and a key. lookup returns
a Boolean indicating whether the key is a member of the set; add and remove
update membership of the set in the obvious manner.
Each experiment is specified by three adjustable parameters:
S — the search structure that is being tested;
P — the number of parallel threads accessing the set; and
K — the initial (and mean) number of key values in the set.
The benchmark program begins by creating P threads and an initial set,
implemented by S, containing the keys 0, 2, 4, . . . , 2(K − 1). All threads then
enter a tight loop which they execute for five wallclock seconds. On each it-
eration they randomly select whether to execute a lookup ( p = 75%), add
( p = 12.5%), or remove ( p = 12.5%) on a random key chosen uniformly from
the range 0. . . 2(K − 1). This distribution of operations is chosen because reads
dominate writes in many observed real workloads; it is also very similar to
the distributions used in previous evaluations of parallel algorithms [Mellor-
Crummey and Scott 1991b; Shalev and Shavit 2003]. Furthermore, by setting
equal probabilities for add and remove in a key space of size 2K , we ensure
that K is the mean number of keys in the set throughout the experiment.
When five seconds have elapsed, each thread records its total number of com-
pleted operations. These totals are summed to get a system-wide total. The
result of the experiment is the system-wide amount of CPU time used dur-
ing the experiment, divided by the system-wide count of completed operations.
When plotting this against P , a scalable design is indicated by a line parallel
with the x-axis (showing that adding extra threads does not make each oper-
ation require more CPU time), whereas faster designs are indicated by lines
closer to the x-axis (showing that less CPU time is required for each completed
operation).
A timed duration of five seconds is sufficient to amortize the overheads asso-
ciated with warming each CPU’s data caches, as well as starting and stopping
the benchmark loop. We confirmed that doubling the execution time to ten sec-
onds does not measurably affect the final result. We plot results showing the
median of five benchmark runs with error bars indicating the minimum and
maximum results achieved.
For brevity, our experiments only consider nonmultiprogrammed workloads
where there are sufficient CPUs for running the threads in parallel. The main
reason for this setting is that it enables a fairer comparison between our non-
blocking designs and those based on locks. If we did use more threads than
available CPUs, then when testing lock-based designs, a thread could be pre-
empted at a point where it holds locks, potentially obstructing the progress of
the thread scheduled in its place. Conversely, when testing nonblocking designs,
even obstruction freedom precludes preempted threads from obstructing oth-
ers: In all our designs, the time taken to remove or help an obstructing thread
is vastly less than that of a typical scheduling quantum. Comparisons of mul-
tiprogrammed workloads would consequently be highly dependent on how well
the lock implementation is integrated with the scheduler.
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
50 • K. Fraser and T. Harris
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
Concurrent Programming Without Locks • 51
update is preceded by checking part of the tree in private to identify the sets of
locks needed, retrying this stage if inconsistency is observed.
(11) and (12) Red-black trees built using WSTM. As with skip lists, red-black
trees can be built straightforwardly from single-threaded code using WSTM.
However, there is one caveat. In order to reduce the number of cases to consider
during rotations, and in common with standard designs, we use a black-sentinel
node in place of NULL child pointers in the leaves of the tree. We use write
discard to avoid updates to sentinel nodes introducing contention when making
needless updates to the sentinel’s parent pointer.
(13) and (14) Red-black trees built using OSTM. As with skip lists, each node
is represented by a separate OSTM object, so nodes must be opened for the
appropriate type of access as the tree is traversed. We consider implementations
using OSTM from Section 7 and Herlihy et al.’s obstruction-free STM [2003b]
coupled with the “polite” contention manager. As before, we use write discard
on the sentinel node.7
We now consider our performance results under a series of scenarios. Sec-
tion 8.2.1 looks at scalability under low contention. This shows the performance
of our nonblocking systems when running on machines with few CPUs, or when
they are being used carefully to reduce the likelihood that concurrent opera-
tions conflict. Our second set of results, in Section 8.2.2, considers performance
under increasing levels of contention.
8.2.1 Scalability Under Low Contention. The first set of results measures
performance when contention between concurrent operations is very low. Each
experiment runs with a mean of 219 keys in the set, which is sufficient to ensure
that parallel writers are extremely unlikely to update overlapping sections of
the data structure. A well-designed algorithm which provides disjoint-access
parallelism will avoid introducing contention between these logically noncon-
flicting operations.
Note that all graphs in this section show a significant drop in performance
when parallelism increases beyond five to ten threads. This is due to the
architecture of the underlying hardware: Small benchmark runs execute
within one or two CPU “quads”, each of which has its own on-board memory.
Most or all memory reads in small runs are therefore serviced from local
memory, which is considerably faster than transferring cache lines across the
switched interquad backplane.
Figure 24 shows the performance of each skip-list implementation. As ex-
pected, STM-based implementations perform poorly compared with other lock-
free schemes; this demonstrates that there are significant overheads associated
with the read and write operations (in WSTM), with maintaining the lists of
opened objects and constructing shadow copies of updated objects (in OSTM), or
7 Herlihy et al.’s DSTM cannot readily support write discard because only one thread may have
a DSTM object open for writing at-a-time. Their early release scheme applies only to read-only
accesses. To avoid contention on the sentinel node, we augmented their STM with a mechanism
for registering objects with nontransactional semantics: Such objects can be opened for writing by
multiple threads, but the shadow copies remain thread private and are discarded on commit or
abort.
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
Concurrent Programming Without Locks • 53
Fig. 24. Graph (a) shows the performance of large skip lists (K = 219 ) as parallelism is increased
to 90 threads. Graph (b) is a “zoom” of (a), showing the performance of up to five threads. As with
all our graphs, lines marked with boxes represent lock-based implementations, circles are OSTMs,
triangles are WSTMs, and crosses are implementations built from MCAS or directly from CAS. The
ordering in the key reflects that of the lines at the right-hand side of the graph: Lower lines are
achieved by faster implementations.
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
54 • K. Fraser and T. Harris
with commit-time checks that verify whether the values read by an optimistic
algorithm represent a consistent snapshot.
Lock-free CAS-based and MCAS-based designs perform extremely well be-
cause, unlike the STMs, they add only minor overheads on each memory ac-
cess. Interestingly, under low contention, the MCAS-based design has almost
identical performance to the much more complicated CAS-based design, thus
indicating that the extra complexity of directly using hardware primitives is
not always worthwhile. Both schemes surpass the two lock-based designs, of
which the finer-grained scheme is slower because of the costs associated with
traversing and manipulating a larger number of locks.
Figure 25, presenting results for red-black trees, gives the clearest indication
of the kinds of setting where our different techniques are effective. Neither of
the lock-based schemes scales effectively with increasing parallelism; indeed,
both OSTM- and WSTM-based trees outperform the schemes using locking
with only two concurrent threads. Of course, the difficulty of designing effective
lock-based trees motivated the development of skip lists, so it is interesting to
observe that a straightforward tree implementation, layered over STM, does
scale well and often performs better than a skip list implemented over the
same STM.
Surprisingly, the lock-based scheme that permits parallel updates performs
hardly any better than the much simpler and more conservative design with
serialised updates. This is because the main performance bottleneck in both
schemes is contention when accessing the multi-reader lock at the root of the
tree. Although multiple readers can enter their critical region simultaneously,
there is significant contention for updating the shared synchronisation fields
within the lock itself. Put simply, using a more permissive type of lock (i.e.,
multi-reader) does not improve performance because the bottleneck is caused
by cache-line contention rather than lock contention.
In contrast, the STM schemes scale very well because transactional reads do
not cause potentially-conflicting memory writes in the underlying synchroni-
sation primitives. We suspect that, under low contention, OSTM is faster then
Herlihy’s design due to better cache locality. Herlihy’s STM requires a double
indirection when opening a transactional object: thus three cache lines are ac-
cessed when reading a field within a previously-unopened object. In contrast
our scheme accesses two cache lines; more levels of the tree fit inside each CPU’s
caches and, when traversing levels that do not fit in the cache, 50% fewer lines
must be fetched from main memory.
Fig. 25. Graph (a) shows the performance of large red-black trees (K = 219 ) as parallelism is
increased to 90 threads. Graph (b) is a “zoom” of (a), showing the performance of up to five threads.
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
56 • K. Fraser and T. Harris
requires expensive juggling of acquire and release invocations. The results here
allow us to investigate whether these overheads pay off as contention increases.
All experiments are executed with 90 parallel threads (P = 90). Contention
is varied by adjusting the benchmark parameter K and hence the mean size
of the data structure under test. Although not a general-purpose contention
metric, this is sufficient to allow us to compare the performance of a single data
structure implemented over different concurrency-control mechanisms.
Figure 26 shows the effect of contention on each of the skip-list implemen-
tations. It indicates that there is sometimes a price for using MCAS, rather
than programming directly using CAS. The comparatively poor performance of
MCAS when contention is high is because many operations must retry several
times before succeeding: It is likely that the data structure will have been mod-
ified before an update operation attempts to make its modifications globally
visible. In contrast, the carefully implemented CAS-based scheme attempts
to do the minimal work necessary to update its “view” when it observes a
change to the data structure. This effort pays off under very high contention;
in these conditions the CAS-based design performs as well as per-pointer
locks. Of course, we could postulate a hybrid implementation in which the pro-
grammer strives to perform updates using multiple smaller MCAS operations.
This could provide an attractive middle ground between the complexity of us-
ing CAS and the “one-fails-then-all-fail” contention problems of large MCAS
operations.
These results also demonstrate a particular weakness of locks: The optimal
granularity of locking depends on the level of contention. Here, per-pointer
locks are the best choice under very high contention, but introduce unnecessary
overheads compared with per-node locks under moderate to low contention.
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
Concurrent Programming Without Locks • 57
Lock-free techniques avoid the need to make this particular tradeoff. Finally,
note that the performance of each implementation drops slightly as the mean
set size becomes very large. This is because the time taken to search the skip
list begins to dominate the execution time.
Figure 27 presents results for red-black trees, and shows that locks are not
always the best choice when contention is high. Both lock-based schemes suffer
contention for cache lines at the root of the tree, where most operations must
acquire the multireader lock. For this workload, the OSTM and WSTM schemes
using suspension perform well in all cases, although conflicts still significantly
affect performance.
Herlihy’s STM performs comparatively poorly under high contention when
using an initial contention-handling mechanism that introduces exponential
backoff to “politely” deal with conflicts. Other schemes may work better [Scherer
III and Scott 2005] and, since DSTM is obstruction-free, would be expected to
influence its performance more than OSTMs. Once again, the results here are
intended primarily to compare our designs with lock-based alternatives and we
include DSTM in this sample configuration because it has been widely studied
in the literature. Marathe et al. perform further comparisons of DSTM and
OSTM [2004].
Note that when using this basic contention manager, the execution times of
individual operations are very variable, which explains the performance “spike”
at the left-hand side of the graph. This low and variable performance is caused
by sensitivity to the choice of backoff rate: Our implementation uses the same
values as the original authors, but these were chosen for a Java-based imple-
mentation of red-black trees and they do not discuss how to choose a more
appropriate set of values for different circumstances.
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
58 • K. Fraser and T. Harris
9. CONCLUSION
We have presented three APIs for building nonblocking concurrency-safe soft-
ware and demonstrated that these can match or surpass the performance of
state-of-the-art lock-based alternatives. Thus, not only do nonblocking systems
have many functional advantages compared with locks (such as freedom from
deadlock and unfortunate scheduler interactions), but they can also be imple-
mented on modern multiprocessor systems without the runtime overheads that
have traditionally been feared.
Furthermore, APIs such as STM have benefits in ease of use compared with
traditional direct use of mutual-exclusion locks. An STM avoids the need to
consider issues such as granularity of locking, the order in which locks should
be acquired to avoid deadlock, and composability of different data structures
or subsystems. This ease of use is in contrast to traditional implementations
of nonblocking data structures based directly on hardware primitives such as
CAS.
In conclusion, using the APIs that we have presented in this article, it is now
practical to deploy lock-free techniques, with all their attendant advantages,
in situations where lock-based synchronization would traditionally be the only
viable option.
ACKNOWLEDGMENTS
SOURCE CODE
Source code for the MCAS, WSTM, and OSTM implementations described in
this article is available at http://www.cl.cam.ac.uk/netos/lock-free. Also
included are the benchmarking implementations of skip lists and red-black
trees, and the offline linearizability checker described in Section 8.2.
REFERENCES
ANANIAN, C. S., ASANOVIĆ, K., KUSZMAUL, B. C., LEISERSON, C. E., AND LIE, S. 2005. Unbounded
transactional memory. In Proceedings of the 11th International Symposium on High-Performance
Computer Architecture (HPCA). (San Franscisco, CA). 316–327.
ANDERSON, J. H. AND MOIR, M. 1995. Universal constructions for multi-object operations. In Pro-
ceedings of the 14th Annual ACM Symposium on Principles of Distributed Computing (PODC).
184–193.
ANDERSON, J. H., RAMAMURTHY, S., AND JAIN, R. 1997. Implementing wait-free objects on priority-
based systems. In Proceedings of the 16th Annual ACM Symposium on Principles of Distributed
Computing (PODC). 229–238.
BARNES, G. 1993. A method for implementing lock-free data structures. In Proceedings of the 5th
Annual ACM Symposium on Parallel Algorithms and Architectures. 261–270.
DICE, D., SHALEV, O., AND SHAVIT, N. 2006. Transactional locking II. In Proceedings of the 20th
International Symposium on Distributed Computing (DISC).
FICH, F., LUCHANGCO, V., MOIR, M., AND SHAVIT, N. 2005. Obstruction-Free algorithms can be practi-
cally wait-free. In Distributed Algorithms, P. Fraigniaud, Ed. Lecture Notes in Computer Science,
vol. 3724. Springer Verlag, Berlin. 78–92.
FRASER, K. 2003. Practical lock freedom. Ph.D. thesis, Computer Laboratory, University of Cam-
bridge. Also available as Tech. Rep. UCAM-CL-TR-639, Cambridge University.
GREENWALD, M. 1999. Non-Blocking synchronization and system design. Ph.D. thesis, Stanford
University. Also available as Technical Rep. STAN-CS-TR-99-1624, Stanford University, Com-
puter Science Department.
GREENWALD, M. 2002. Two-Handed emulation: How to build non-blocking implementations of
complex data structures using DCAS. In Proceedings of the 21st Annual ACM Symposium on
Principles of Distributed Computing (PODC). 260–269.
HAMMOND, L., CARLSTROM, B. D., WONG, V., HERTZBERG, B., CHEN, M., KOZYRAKIS, C., AND OLUKOTUN,
K. 2004. Programming with transactional coherence and consistency (TCC). In Proceedings
of the 11th International Conference on Architectural Support for Programming Languages and
Operating Systems (ASPLOS). ACM Press, New York. 1–13.
HANKE, S. 1999. The performance of concurrent red-black tree algorithms. In Proceedings of the
3rd Workshop on Algorithm Engineering. Lecture Notes in Computer Science, vol. 1668. Springer
Verlag, Berlin. 287–301.
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
60 • K. Fraser and T. Harris
HANKE, S., OTTMANN, T., AND SOISALON-SOININEN, E. 1997. Relaxed balanced red-black trees. In Pro-
ceedings of the 3rd Italian Conference on Algorithms and Complexity. Lecture Notes in Computer
Science, vol. 1203. Springer Verlag, Berlin. 193–204.
HARRIS, T. 2001. A pragmatic implementation of non-blocking linked lists. In Proceedings of the
15th International Symposium on Distributed Computing (DISC). Springer Verlag, Berlin. 300–
314.
HARRIS, T. 2004. Exceptions and side-effects in atomic blocks. In Proceedings of the PODC Work-
shop on Synchronization in Java Programs. 46–53. Proceedings published as Memorial Univer-
sity of Newfoundland CS Tech. Rep. 2004-01.
HARRIS, T. AND FRASER, K. 2003. Language support for lightweight transactions. In Proceedings
of the 18th Annual ACM-SIGPLAN Conference on Object-Oriented Programming, Systems, Lan-
guages and Applications (OOPSLA). 388–402.
HARRIS, T. AND FRASER, K. 2005. Revocable locks for non-blocking programming. In Proceedings
of the 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
(PPoPP). ACM Press, New York. USA, 72–82.
HARRIS, T., MARLOW, S., PEYTON-JONES, S., AND HERLIHY, M. 2005. Composable memory transactions.
In Proceedings of the 10th ACM SIGPLAN Symposium on Principles and Practice of Parallel
Programming (PPoPP). ACM Press, New York. 48–60.
HELLER, S., HERLIHY, M., LUCHANGCO, V., MOIR, M., SCHERER, B., AND SHAVIT, N. 2005. A lazy con-
current list-based set algorithm. In 9th International Conference on Principles of Distributed
Systems (OPODIS).
HENNESSY, J. L. AND PATTERSON, D. A. 2003. Computer Architecture—A Quantitative Approach,
3rd ed. Morgan Kaufmann, San Francisco, CA.
HERLIHY, M. 1993. A methodology for implementing highly concurrent data objects. ACM Trans.
Program. Lang. Syst. 15, 5 (Nov.), 745–770.
HERLIHY, M. 2005. SXM1.1: Software transactional memory package for C#. Tech. Rep., Brown
University and Microsoft Research. May.
HERLIHY, M., LUCHANGCO, V., MARTIN, P., AND MOIR, M. 2005. Nonblocking memory manage-
ment support for dynamic-sized data structures. ACM Trans. Comput. Syst. 23, 2, 146–
196.
HERLIHY, M., LUCHANGCO, V., AND MOIR, M. 2003. Obstruction-Free synchronization: Double-
Ended queues as an example. In Proceedings of the 23rd IEEE International Con-
ference on Distributed Computing Systems (ICDCS). IEEE, Los Alamitos, CA. 522–
529.
HERLIHY, M., LUCHANGCO, V., MOIR, M., AND SCHERER, W. 2003b. Software transactional memory for
dynamic-sized data structures. In Proceedings of the 22nd Annual ACM Symposium on Principles
of Distributed Computing (PODC). 92–101.
HERLIHY, M. AND MOSS, J. E. B. 1993. Transactional memory: Architectural support for lock-
free data structures. In Proceedings of the 20th Annual International Symposium on Computer
Architecture (ISCA). ACM Press, New York. 289–301.
HERLIHY, M. AND WING, J. M. 1990. Linearizability: A correctness condition for concurrent objects.
ACM Trans. Program. Lang. Syst. 12, 3 (Jul.), 463–492.
HOARE, C. A. R. 1972. Towards a theory of parallel programming. In Operating Systems Tech-
niques. A.P.I.C. Studies in Data Processing, vol. 9. Academic Press, 61–71.
ISRAELI, A. AND RAPPOPORT, L. 1994. Disjoint-Access-Parallel implementations of strong shared
memory primitives. In Proceedings of the 13th Annual ACM Symposium on Principles of Dis-
tributed Computing (PODC). 151–160.
JAYANTI, P. AND PETROVIC, S. 2003. Efficient and practical constructions of LL/SC variables. In
Proceedings of the 22nd Annual Symposium on Principles of Distributed Computing. ACM Press,
285–294.
JONES, R. AND LINS, R. 1996. Garbage Collection: Algorithms for Automatic Dynamic Memory
Management. John Wiley and Sons.
KUNG, H. T. AND LEHMAN, P. L. 1980. Concurrent manipulation of binary search trees. ACM Trans.
Database Syst. 5, 3 (Sept.), 354–382.
KUNG, H. T. AND ROBINSON, J. T. 1981. On optimistic methods for concurrency control. ACM Trans.
Database Syst. 6, 2, 213–226.
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.
Concurrent Programming Without Locks • 61
MARATHE, V. J., SCHERER III, W. N., AND SCOTT, M. L. 2004. Design tradeoffs in modern software
transactional memory systems. In Proceedings of the 7th Workshop on Languages, Compilers,
and Run-Time Support for Scalable Systems.
MCDONALD, A., CHUNG, J., CHAFI, H., CAO MINH, C., CARLSTROM, B. D., HAMMOND, L., KOZYRAKIS, C.,
AND OLUKOTUN, K. 2005. Characterization of TCC on chip-multiprocessors. In Proceedings of
the 14th International Conference on Parallel Architectures and Compilation Techniques.
MELLOR-CRUMMEY, J. AND SCOTT, M. 1991a. Algorithms for scalable synchronization on shared-
memory multiprocessors. ACM Trans. Comput. Syst. 9, 1, 21–65.
MELLOR-CRUMMEY, J. AND SCOTT, M. 1991b. Scalable reader-writer synchronization for shared-
memory multiprocessors. In Proceedings of the 3rd ACM SIGPLAN Symposium on Principles
and Practice of Parallel Programming. 106–113.
MICHAEL, M. M. 2002. Safe memory reclamation for dynamic lock-free objects using atomic reads
and writes. In Proceedings of the 21st Annual ACM Symposium on Principles of Distributed
Computing (PODC).
MICHAEL, M. M. AND SCOTT, M. 1995. Correction of a memory management method for lock-
free data structures. Tech. Rep. TR599, University of Rochester, Computer Science Department.
December.
MOIR, M. 1997. Transparent support for wait-free transactions. In Distributed Algorithms, 11th
International Workshop. Lecture Notes in Computer Science, vol. 1320. Springer Verlag, Berlin.
305–319.
MOIR, M. 2002. Personal communication.
MOORE, K. E., HILL, M. D., AND WOOD, D. A. 2005. Thread-Level transactional memory. Tech. Rep.:
CS-TR-2005-1524, Deptartment of Computer Sciences, University of Wisconsin, Motorola Inc.,
Phoenix, AZ. 1–11.
MOTOROLA. 1985. MC68020 32-Bit Microprocessor User’s Manual.
PUGH, W. 1990. Concurrent maintenance of skip lists. Tech. Rep. CS-TR-2222, Department of
Computer Science, University of Maryland. June.
RAJWAR, R. AND GOODMAN, J. R. 2002. Transactional lock-free execution of lock-based programs.
ACM SIGPLAN Not. 37, 10 (Oct.), 5–17.
RAJWAR, R., HERLIHY, M., AND LAI, K. 2005. Virtualizing transactional memory. In Proceedings of
the 32nd Annual International Symposium on Computer Architecture. IEEE Computer Society,
Los Alamotos, CA. 494–505.
RIEGEL, T., FELBER, P., AND FETZER, C. 2006. A lazy snapshot algorithm with eager validation. In
Proceedings of the 20th International Symposium on Distributed Computing (DISC).
RINGENBURG, M. F. AND GROSSMAN, D. 2005. AtomCaml: First-Class atomicity via rollback. In
Proceedings of the 10th ACM SIGPLAN International Conference on Functional Programming
(ICFP). ACM Press, New York. 92–104.
SCHERER III, W. N. AND SCOTT, M. L. 2005. Advanced contention management for dynamic software
transactional memory. In Proceedings of the 24th Annual ACM SIGACT-SIGOPS Symposium on
Principles of Distributed Computing. ACM Press, New York. 240–248.
SHALEV, O. AND SHAVIT, N. 2003. Split-ordered lists: Lock-Free extensible hash tables. In Pro-
ceedings of the 22nd Annual ACM Symposium on Principles of Distributed Computing (PODC).
102–111.
SHAVIT, N. AND TOUITOU, D. 1995. Software transactional memory. In Proceedings of the 14th An-
nual ACM Symposium on Principles of Distributed Computing (PODC). 204–213.
SUNDELL, H. AND TSIGAS, P. 2004. Scalable and lock-free concurrent dictionaries. In Proceedings
of the ACM Symposium on Applied Computing. ACM Press, New York, NY. 1438–1445.
TUREK, J., SHASHA, D., AND PRAKASH, S. 1992. Locking without blocking: Making lock-based con-
current data structure algorithms nonblocking. In Proceedings of the 11th ACM Symposium on
Principles of Database Systems. 212–222.
WELC, A., JAGANNATHAN, S., AND HOSKING, T. 2004. Transactional monitors for concurrent objects.
In Proceedings of the European Conference on Object-Oriented Programming (ECOOP). 519–542.
WING, J. M. AND GONG, C. 1993. Testing and verifying concurrent objects. J. Parallel Distrib.
Comput. 17, 1 (Jan.), 164–182.
ACM Transactions on Computer Systems, Vol. 25, No. 2, Article 5, Publication date: May 2007.