Concurrent Programming
Concurrent Programming
Concurrent Programming:
Algorithms, Principles,
and Foundations
123
Michel Raynal
Institut Universitaire de France
IRISA-ISTIC
Université de Rennes 1
Rennes Cedex
France
… Ce jour-là j’ai bien cru tenir quelque chose et que ma vie s’en trouverait changée.
Mais rien de cette nature n’est définitivement acquis.
Comme une eau, le monde vous traverse et pour un temps vous prête ses couleurs.
Puis se retire et vous replace devant ce vide qu’on porte en soi, devant cette espèce
d’insuffisance centrale de l’âme qu’il faut bien apprendre à côtoyer, à combattre,
et qui, paradoxalement, est peut-être notre moteur le plus sûr.
In L’usage du monde (1963), Nicolas Bouvier (1929–1998)
v
vi Preface
What synchronization is
Since the early work of E.W. Dijkstra (1965), who introduced the mutual exclu-
sion problem, the concept of a process, the semaphore object, the notion of a
weakest precondition, and guarded commands (among many other contributions),
synchronization is no longer a catalog of tricks but a domain of computing science
with its own concepts, mechanisms, and techniques whose results can be applied in
many domains. This means that process synchronization has to be a major topic of
any computer science curriculum.
Content
As stressed by its title, this book is on algorithms, base principles, and foundations
of concurrent objects and synchronization in shared memory systems, i.e., systems
where the entities communicate by reading and writing a common memory. (Such
a corpus of knowledge is becoming more and more important with the advent of
new technologies such as multicore architectures.)
The book is composed of six parts. Three parts are more focused on base
synchronization mechanisms and the construction of concurrent objects, while the
other three parts are more focused on the foundations of synchronization. (A
noteworthy feature of the book is that nearly all the algorithms that are presented
are proved.)
• Part I is on lock-based synchronization, i.e., on well-known synchronization
concepts, techniques, and mechanisms. It defines the most important synchro-
nization problem in reliable asynchronous systems, namely the mutual exclusion
problem (Chap. 1). It then presents several base approaches which have been
proposed to solve it with machine-level instructions (Chap. 2). It also presents
traditional approaches which have been proposed at a higher abstraction level to
solve synchronization problems and implement concurrent objects, namely the
concept of a semaphore and, at an even more abstract level, the concepts of
monitor and path expression (Chap. 3).
• After the reader has become familiar with base concepts and mechanisms suited
to classical synchronization in reliable systems, Part II, which is made up of a
single chapter, addresses a fundamental concept of synchronization; namely, it
presents and investigates the concept of atomicity and its properties. This allows
for the formalization of the notion of a correct execution of a concurrent pro-
gram in which processes cooperate by accessing shared objects (Chap. 4).
• Part I has implicitly assumed that the cooperating processes do not fail. Hence,
the question: What does happen when cooperating entities fail? This is the main
issue addressed in Part III (and all the rest of the book); namely, it considers that
cooperating entities can halt prematurely (crash failure). To face the net effect of
asynchrony and failures, it introduces the notions of mutex-freedom and asso-
ciated progress conditions such as obstruction-freedom, non-blocking, and wait-
freedom (Chap. 5).
viii Preface
The rest of Part III focuses on hybrid concurrent objects (Chap. 6), wait-free
implementations of paradigmatic concurrent objects such as counters and store-
collect objects (Chap. 7), snapshot objects (Chap. 8), and renaming objects
(Chap. 9).
• Part V returns to the foundations side. It shows how reliable atomic read/write
registers (shared variables) can be built from non-atomic bits. This part consists
of three chapters. Chapter 11 introduces the notions of safe register, regular
register, and atomic register. Then, Chap. 12 shows how to build an atomic bit
from a safe bit. Finally, Chap. 13 shows how an atomic register of any size can
be built from safe and atomic bits.
This part shows that, while atomic read/write registers are easier to use than safe
read/write registers, they are not more powerful from a computability point-of-
view.
• Part VI, which concerns also the foundations side, is on the computational
power of concurrent objects. It is made up of four chapters. It first introduces the
notion of a consensus object and shows that consensus objects are universal
objects (Chap. 14). This means that, as soon as a system provides us with atomic
read/write registers and consensus objects, it is possible to implement in a wait-
free manner any object defined from a sequential specification.
Part VI then introduces the notion of self-implementation and shows how atomic
registers and consensus objects can be built from base objects of the same type
which are not reliable (Chap. 15). Then, it presents the notion of a consensus
number and the associated consensus hierarchy which allows the computability
power of concurrent objects to be ranked (Chap. 16). Finally, the last chapter of
the book focuses on the wait-free implementation of consensus objects from
read/write registers and failure detectors (Chap. 17).
To have a more complete feeling of the spirit of this book, the reader can also
consult the section ‘‘What Was the Aim of This Book’’ in the Afterword) which
describes what it is hoped has been learned from this book. Each chapter starts
with a short presentation of its content and a list of keywords; it terminates with a
summary of the main points that have explained and developed. Each of the six
parts of the book is also introduced by a brief description of its aim and its
technical content.
Preface ix
Acknowledgments
This book originates from lecture notes for undergraduate and graduate courses on
process synchronization that I give at the University of Rennes (France) and, as an
invited professor, at several universities all over the world. I would like to thank
the students for their questions that, in one way or another, have contributed to this
book.
Last but not least (and maybe most importantly), I also want to thank all the
researchers whose results are presented in this book. Without their work, this book
would not exist (Since I typeset the entire text myself (– for the text and xfig
for figures–), any typesetting or technical errors that remain are my responsibility.)
Michel Raynal
Professeur des Universités
Institut Universitaire de France
IRISA-ISTIC, Université de Rennes 1
Campus de Beaulieu, 35042 Rennes, France
xi
xii Contents
Afterword . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 489
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 509
Notation
No-op No operation
Process Program in action
n Number of processes
Correct process Process that does not crash during
an execution
Faulty process Process that crashes during an execution
Concurrent object Object shared by several processes
AA½1::m Array with m entries
ha; bi Pair with two elements a and b
Mutex Mutual exclusion
Read/write register Synonym of read/write variable
SWSR Single-writer/single-reader (register)
SWMR Single-writer/multi-reader (register)
MWSR Multi-writer/single-reader (register)
SWMR Single-writer/multi-reader (register)
ABCD Identifiers in italics upper case letters:
shared objects
abcd Identifiers in italics lower case letters:
local variables
"X Pointer to object X
P# Object pointed to by the pointer P
AA½1::s, (a½1::s) Shared (local) array of size s
for each i 2 f1; :::; mg do statements end for Order irrelevant
for each i from 1 to m do statements end for Order relevant
wait ðPÞ while :P do no-op end while
return ðvÞ Returns v and terminates the operation
invocation
% blablabla % Comments
; Sequentiality operator between two
statements
xxiii
Figures and Algorithms
xxv
xxvi Figures and Algorithms
7.1 A simple wait-free counter for n processes (code for pi). . . . . . . . 191
7.2 Wait-free weak counter (one-shot version, code for pi) . . . . . . . . 194
7.3 Proof of the weak increment property . . . . . . . . . . . . . . . . . . . . 197
7.4 Fast read of a weak counter (code for process pi) . . . . . . . . . . . . 199
7.5 Reading a weak counter (non-restricted version,
code for process pi) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
7.6 A trivial implementation of a store-collect object (code for pi) . . . 203
7.7 A store-collect object has no sequential specification . . . . . . . . . . 203
7.8 A complete binary tree to implement a store-collect object. . . . . . 205
7.9 Structure of a vertex of the binary tree. . . . . . . . . . . . . . . . . . . . 205
7.10 An adaptive implementation of a store-collect
object (code for pi) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 207
7.11 Computing an upper bound on the number of marked vertices ... 211
7.12 Merging store() and collect() (code for process pi) . . . . . . . . . ... 212
7.13 Incorrect versus correct implementation
of the store collect() operation . . . . . . . . . . . . . . . . . . . . . . ... 213
7.14 An efficient store_collect() algorithm (code for pi). . . . . . . . . ... 214
7.15 Sequential and concurrent invocations
of store_collect() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ... 215
7.16 Concurrent invocations of store_collect() . . . . . . . . . . . . . . . ... 216
The concept of a process to express the idea of an activity has become an indispensable
tool to master the activity on multiprocessors. More precisely, a concurrent algorithm
(or concurrent program) is the description of a set of sequential state machines that
cooperate through a communication medium, e.g., a shared memory. A concurrent
algorithm is sometimes called a multiprocess program (each process corresponding
to the sequential execution of a given state machine).
This chapter considers processes that are reliable and asynchronous. “Reliable”
means that each process results from the correct execution of the code of the corre-
sponding algorithm. “Asynchronous” means that there is no timing assumption on
the time it takes for a process to proceed from a state transition to the next one (which
means that an asynchronous sequential process proceeds at an arbitrary speed).
1.2.2 Synchronization
More generally, synchronization is the set of rules and mechanisms that allows
the specification and implementation of sequencing properties on statements issued
by the processes so that all the executions of a multiprocess program are correct.
This type of process interaction occurs when processes have to compete to execute
some statements and only one process at a time (or a bounded number of them) is
allowed to execute them. This occurs, for example, when processes compete for a
shared resource. More generally, resource allocation is a typical example of process
competition.
time line
i.e., from the disk D point of view, the execution corresponds to the sequence
D.seek(x); r ← D.read(); D.seek(y); D.write(v), from which we conclude that p
has read the value at address x and afterwards q has written the value v at address y.
Let us now consider the case where p and q simultaneously invoke disk_read(x)
and disk_write(y, v), respectively. The effect of the corresponding parallel execution
is produced by any interleaving of the primitives invoked by p and the primitives
invoked by q that respects the order of invocations issued by p and q. As an example,
a possible execution is depicted in Fig. 1.2. This figure is a classical space-time
diagram. Time flows from left to right, and each operation issued by a process is
represented by a segment on the time axis associated with this process. Two dashed
arrows are associated with each invocation of an operation. They meet at a point of
the “real time” line, which indicates the instant at which the corresponding operation
appears to have been executed instantaneously. This sequence of points define the
order in which the execution is seen by an external sequential observer (i.e., an
observer who can see one operation invocation at a time).
In this example, the processes p and q have invoked in parallel D.seek(x) and
D.seek(y), respectively, and D.seek(x) appears to be executed before D.seek(y).
Then q executes D.write(v) while p executes in parallel D.read(), and the write by
q appears to an external observer to be executed before the read of p.
It is easy to see that, while the write by process q is correct (namely v has been
written at address y), the read by process p of the value at address x is incorrect
( p obtains the value written at address y and not the value stored at address x).
Other incorrect parallel executions (involving invocations of both disk_read() and
disk_write() or involving only invocations of disk_write() operations) in which a
value is not written at the correct address can easily be designed.
A solution to prevent this problem from occurring consists in allowing only
one operation at a time (either disk_read() or disk_write()) to be executed. Mutual
exclusion (addressed later in this chapter) provides such a solution.
This section presents two examples of process cooperation. The first is a pure coor-
dination problem while the second is the well-known producer–consumer problem.
In both cases the progress of a process may depend on the progress of other processes.
To better understand the nature of what synchronization is, let us consider the
previous producer–consumer problem. Let # p and #c denote the number of data
items produced and consumed so far, respectively. The instance of the problem
8 1 The Mutual Exclusion Problem
authorized
forbidden area area
forbidden area
forbidden area
Critical section Let us consider a part of code A (i.e., an algorithm) or several parts
of code A, B, C . . . (i.e., different algorithms) that, for some consistency reasons,
must be executed by a single process at a time. This means that, if a process is execut-
ing one of these parts of code, e.g., the code B, no other process can simultaneously
execute the same or another part of code, i.e., any of the codes A or B or C or etc.
This is, for example, the case of the disk operations disk_read() and disk_write()
introduced in Sect. 1.2.2, where guaranteeing that, at any time, at most one process
can execute either of these operations ensures that each read or write of the disk is
correct. Such parts of code define what is called a critical section. It is assumed that a
code defining a critical section always terminates when executed by a single process
at a time.
In the following, the critical section code is abstracted into a procedure called
cs_code(in) where in denotes its input parameter (if any) and that returns a result
value (without loss of generality, the default value ⊥ is returned if there is no explicit
result).
It is easy to extend this execution in such a way that, while p3 wants to enter the
critical section, it can never enter it. This execution is deadlock-free but (due to p3 )
is not starvation-free.
Finite bypass versus bounded bypass A liveness property that is stronger than
starvation-freedom is the following one. Let p and q be a pair of competing processes
such that q wins the competition. Let f (n) denote a function of n (where n is the
total number of processes).
• Bounded bypass. There is a function f (n) such that, each time a process invokes
acquire_mutex(), it loses at most f (n) competitions with respect to the other
processes.
Let us observe that starvation-freedom is nothing else than the case where the
number of times that a process can be bypassed is finite. More generally, we have the
following hierarchy of liveness properties: bounded bypass ⇒ starvation-freedom
≡ finite bypass ⇒ deadlock-freedom.
Definition A lock (say LOCK) is a shared object that provides the processes
with two operations denoted LOCK.acquire_lock() and LOCK.release_lock(). It
can take two values, free and locked, and is initialized to the value free. Its
behavior is defined by a sequential specification: from an external observer point
of view, all the acquire_lock() and release_lock() invocations appear as if they
have been invoked one after the other. Moreover, using the regular language
operators “;” and “∗”, this sequence corresponds to the regular expression LOCK.
∗
acquire_lock(); LOCK.release_lock() (see Fig. 1.5).
According to the operations and their properties provided to the processes by the
underlying shared memory communication system, several families of mutex algo-
rithms can be designed. We distinguish three distinct families of mutex algorithms
which are investigated in the next chapter.
1.4 Summary
This chapter has presented the mutual exclusion problem. Solving this problem con-
sists in providing a lock object, i.e., a synchronization object that allows a zone of
code to be bracketed to guarantee that a single process at a time can execute it.
• The mutual exclusion problem was first stated by E.W. Dijkstra [88].
• A theory of interprocess communication and mutual exclusion is described in
[185].
• The notions of safety and liveness were introduced by L. Lamport in [185]. The
notion of liveness is investigated in [20].
• An invariant-based view of synchronization is presented in [194].
Chapter 2
Solving Mutual Exclusion
The read/write register object is one of the most basic objects encountered in com-
puter science. When such an object is accessed only by a single process it is said to
be local to that process; otherwise, it is a shared register. A local register allows a
process to store and retrieve data. A shared register allows concurrent processes to
also exchange data.
Definition A register R can be accessed by two base operations: R.read(), which
returns the value of R (also denoted x ← R where x is a local variable of the invoking
process), and R.write(v), which writes a new value into R (also denoted R ← v,
where v is the value to be written into R). An atomic shared register satisfies the
following properties:
Let us observe that R.write(3) and R.write(2) are concurrent, which means that
they could appear to an external observer as if R.write(2) was executed before
R.write(3). If this was the case, the execution would be correct if the last two read
invocations (issued by p1 and p3 ) return the value 3; i.e., the external observer should
then see the following sequential execution:
Let us also observe that the second read invocation by p1 is concurrent with both
R.write(2) and R.write(3). This means that it could appear as having been executed
before these two write operations or even between them. If it appears as having been
executed before these two write operations, it should return the value 1 in order for
the register behavior be atomic.
As shown by these possible scenarios (and as noticed before) concurrency is
intimately related to non-determinism. It is not possible to predict which execution
will be produced; it is only possible to enumerate the set of possible executions that
could be produced (we can only predict that the one that is actually produced is one
of them).
Examples of non-atomic read and write operations will be presented in Sect. 2.3.
Why atomicity is important Atomicity is a fundamental concept because it allows
the composition of shared objects for free (i.e., their composition is at no additional
cost). This means that, when considering two (or more) atomic registers R1 and
R2, the composite object [R1, R2] which is made up of R1 and R2 and provides the
processes with the four operations R1.read(), R1.write(), R2.read(), and R2.write()
is also atomic. Everything appears as if at most one operation at a time was executed,
and the sub-sequence including only the operations on R1 is a correct behavior of
R1, and similarly for R2.
This is very important when one has to reason about a multiprocess program
whose processes access atomic registers. More precisely, we can keep reasoning
sequentially whatever the number of atomic registers involved in a concurrent com-
putation. Atomicity allows us to reason on a set of atomic registers as if they were a
single “bigger” atomic object. Hence, we can reason in terms of sequences, not only
for each atomic register taken separately, but also on the whole set of registers as if
they were a single atomic object.
The composition of atomic objects is formally addressed in Sect. 4.4, where
it is shown that, as atomicity is a “local property”, atomic objects compose for
free.
The mutex algorithm for two processes that is presented below is due to G.L. Peterson
(1981). This construction, which is fairly simple, is built from an “addition” of two
base components. Despite the fact that these components are nearly trivial, they allow
us to introduce simple basic principles.
18 2 Solving Mutual Exclusion
operation
Fig. 2.2 Peterson’s algorithm for two processes: first component (code for pi )
The processes are denoted pi and p j . As the algorithm for p j is the same as the
one for pi after having replaced i by j, we give only the code for pi .
First component This component is described in Fig. 2.2 for process pi . It is
based on a single atomic register denoted AFTER_YOU, the initial value of which
is irrelevant (a process writes into this register before reading it). The principle that
underlies this algorithm is a “politeness” rule used in current life. When pi wants
to acquire the critical section, it sets AFTER_YOU to its identity i and waits until
AFTER_YOU = i in order to enter the critical section. Releasing the critical section
entails no particular action.
It is easy to see that this algorithm satisfies the mutual exclusion property. When
both processes want to acquire the critical section, each assigns its identity to the
register AFTER_YOU and waits until this register contains the identity of the other
process. As the register is atomic, there is a “last” process, say p j , that updated it,
and consequently only the other process pi can proceed to the critical section.
Unfortunately, this simple algorithm is not deadlock-free. If one process alone
wants to enter the critical section, it remains blocked forever in the wait statement.
Actually, this algorithm ensures that, when both processes want to enter the critical
section, the first process that updates the register AFTER_YOU is the one that is
allowed to enter it.
Second component This component is described in Fig. 2.3. It is based on a simple
idea. Each process pi manages a flag (denoted FLAG[i]) the value of which is down
or up. Initially, both flags are down. When a process wants to acquire the critical
section, it first raises its flag to indicate that it is interested in the critical section. It
is then allowed to proceed only when the flag of the other process is equal to down.
To release the critical section, a process pi has only to reset FLAG[i] to its initial
value (namely, down), thereby indicating that it is no longer interested in the mutual
exclusion.
Fig. 2.3 Peterson’s algorithm for two processes: second component (code for pi )
2.1 Mutex Based on Atomic Read/Write Registers 19
It is easy to see that, if a single process pi wants to repeatedly acquire the critical
section while the other process is not interested in the critical section, it can do so
(hence this algorithm does not suffer the drawback of the previous one). Moreover,
it is also easy to see that this algorithm satisfies the mutual exclusion property. This
follows from the fact that each process follows the following pattern: first write its flag
and only then read the value of the other flag. Hence, assuming that pi has acquired
(and not released) the critical section, we had (FLAG[i] = up)∧(FLAG[ j] = down)
when it was allowed to enter the critical section. It follows that, after p j has set
FLAG[ j] to the value up, it reads up from FLAG[i] and is delayed until pi resets
FLAG[i] to down when it releases the critical section.
Unfortunately, this algorithm is not deadlock-free. If both processes concurrently
raise first their flags and then read the other flag, each process remains blocked until
the other flag is set down which will never be done.
Remark: the notion of a livelock In order to prevent the previous deadlock situa-
tion, one could think replacing wait (FLAG[ j] = down) by the following statement:
while (FLAG[ j] = up) do
FLAG[i] ← down;
pi delays itself for an arbitrary period of time;
FLAG[i] ← up
end while.
This modification can reduce deadlock situations but cannot eliminate all of them.
This occurs, for example when both processes execute “synchronously” (both delay
themselves for the same duration and execute the same step—writing their flag and
reading the other flag—at the very same time). When it occurs, this situation is
sometimes called a livelock.
This tentative solution was obtained by playing with asynchrony (modifying the
process speed by adding delays). As a correct algorithm has to work despite any
asynchrony pattern, playing with asynchrony can eliminate bad scenarios but cannot
suppress all of them.
the flag of the other one was raised, the current value of the register AFTER_YOU
allows exactly one of them to progress.
It is important to observe that, in the wait statement of Fig. 2.4, the reading of
the atomic registers FLAG[ j] and AFTER_YOU are asynchronous (they are done at
different times and can be done in any order).
Theorem 1 The algorithm described in Fig. 2.4 satisfies mutual exclusion and
bounded bypass (where the bound is f (n) = 1).
Preliminary remark for the proof The reasoning is based on the fact that the
three registers FLAG[i], FLAG[ j], and AFTER_YOU are atomic. As we have seen
when presenting the atomicity concept (Sect. 2.1.1), this allows us to reason as if at
most one read or write operation on any of these registers occurs at a time.
Proof Proof of the mutual exclusion property.
Let us assume by contradiction that both pi and p j are inside the critical section.
Hence, both have executed acquire_mutex() and we have then FLAG[i] = up,
FLAG[ j] = up and AFTER_YOU = j (if AFTER_YOU = i, the reasoning is the
same after having exchanged i and j). According to the predicate that allowed pi to
enter the critical section, there are two cases.
pi executes
AFTER YOU ← i at current time we have:
pi executes pj executes AFTER YOU = j
FLAG[i] ← up AFTER YOU ← j FLAG [i] = FLAG [j] = up
time line
pj executes its wait statement
As p j executes the wait statement after writing j into AFTER_YOU and pi read
j from AFTER_YOU, it follows that p j cannot read down from FLAG[i] when it
executes the wait statement. This contradicts the assumption that p j is inside the
critical section.
pi executes pj executes
AFTER YOU ← i AFTER YOU ← j
pi executes pj executes pj executes
FLAG [i] ← up FLAG [j] ← down FLAG [j] ← up
time line
pi does not read FLAG[j]
which allows it to enter the critical section. For 1 ≤ x < n −1, FLAG_LEVEL[i] = x
means that pi is trying to enter level x + 1.
Moreover, to eliminate possible deadlocks at any level , 0 < < n − 1 (such as
the deadlock that can occur in the algorithm of Fig. 2.3), the processes use a second
array of atomic registers AFTER_YOU[1..(n − 1)] such that AFTER_YOU[] keeps
track of the last process that has entered level .
More precisely, a process pi executes a for loop to progress from one level to
the next one, starting from level 1 and finishing at level n − 1. At each level the
two-process solution is used to block a process (if needed). The predicate that allows
a process to progress from level , 0 < < n − 1, to level + 1 is similar to the
one of the two-process algorithm. More precisely, pi is allowed to progress to level
+ 1 if, from its point of view,
• Either all the other processes are at a lower level (i.e., ∀ k = i:FLAG_LEVEL
[k] < ).
• Or it is not the last one that entered level (i.e., AFTER_YOU[] = i).
Let us notice that the predicate used in the wait statement of line 4 involves all but one
of the atomic registers FLAG_LEVEL[·] plus the atomic register AFTER_YOU[].
As these registers cannot be read in a single atomic step, the predicate is repeatedly
evaluated asynchronously on each register.
When all processes compete for the critical section, at most (n − 1) processes can
concurrently be winners at level 1, (n − 2) processes can concurrently be winners
at level 2, and more generally (n − ) processes can concurrently be winners at
level . Hence, there is a single winner at level (n − 1).
The code of the operation release_mutex(i) is similar to the one of the two-process
algorithm: a process pi resets FLAG_LEVEL[i] to its initial value 0 to indicate that
it is no longer interested in the critical section.
Theorem 2 The algorithm described in Fig. 2.8 satisfies mutual exclusion and
starvation-freedom.
Proof Initially, a process pi is such that FLAG_LEVEL[i] = 0 and we say that it is
at level 0. Let ∈ [1..(n − 1)]. We say that a process pi has “attained” level (or,
from a global state point of view, “is” at level ) if it has exited the wait statement
of the th loop iteration. Let us notice that, after it has set its loop index to α > 0
and until it exits the wait statement of the corresponding iteration, that process is at
level α − 1. Moreover, a process that attains level has also attained the levels
with 0 ≤
≤ ≤ n − 1 and consequently it is also at these levels
.
The proof of the mutual exclusion property amounts to showing that at most one
process is at level (n − 1). This is a consequence of the following claim when we
consider = n − 1.
Claim. For , 0 ≤ ≤ n − 1, at most n − processes are at level .
The proof of this claim is by induction on the level . The base case = 0 is
trivial. Assuming that the claim is true up to level − 1, i.e., at most n − ( − 1)
24 2 Solving Mutual Exclusion
py
time line
px
processes are simultaneously at level − 1, we have to show that at least one process
does not progress to level . The proof is by contradiction: let us assume that n −+1
processes are at level .
Let px be the last process that wrote its identity into AFTER_YOU[] (hence,
AFTER_YOU[] = x). When considering the sequence of read and write operations
executed by every process, and the fact that these operations are on atomic registers,
this means that, for any of the n − other processes p y that are at level , these
operations appear as if they have been executed in the following order where the
first two operations are issued by p y while the least two operations are issued by px
(Fig. 2.9):
1. FLAG_LEVEL[y] ← is executed before AFTER_YOU[] ← y (sequentiality
of p y )
2. AFTER_YOU[] ← y is executed before AFTER_YOU[] ← x (assumption:
definition of px )
3. AFTER_YOU[] ← x is executed before r ← FLAG_LEVEL[y] (sequentiality
of px ; r is px ’s local variable storing the last value read from FLAG_LEVEL[y]
before px exits the wait statement at level ).
It follows from this sequence that r = . Consequently, as AFTER_YOU[] = x,
px exited the wait statement of the th iteration because ∀ k = x : FLAG_LEVEL
[k] < . But this is contradicted by the fact that we had then FLAG_LEVEL[y] = ,
which concludes the proof of the claim.
The proof of the starvation-freedom property is by induction on the levels starting
from level n − 1 and proceeding until level 1. The base case = n − 1 follows from
the previous claim: if there is a process at level (n − 1), it is the only process at that
level and it can exit the for loop. This process eventually enters the critical section
(that, by assumption, it will leave later). The induction assumption is the following:
each process that attains a level
such that n − 1 ≥
≥ eventually enters the
critical section.
The rest of the proof is by contradiction. Let us assume that is such that there is
a process (say px ) that remains blocked forever in the wait statement during its th
2.1 Mutex Based on Atomic Read/Write Registers 25
iteration (hence, px cannot attain level ). It follows that, each time px evaluates the
predicate controlling the wait statement, we have
∃ k = i : FLAG_LEVEL[k] ≥ ) ∧ (AFTER_YOU[] = x)
(let us remember that the atomic registers are read one at a time, asynchronously,
and in any order). There are two cases.
• Case 1: There is a process p y that eventually executes AFTER_YOU[] ← y.
As only px can execute AFTER_YOU[] ← x, there is eventually a read of
AFTER_YOU[] that returns a value different from x, and this read allows px
to progress to level . This contradicts the assumption that px remains blocked
forever in the wait statement during its th iteration.
• Case 2: No process p y eventually executes AFTER_YOU[] ← y.
The other processes can be partitioned in two sets: the set G that contains the
processes at a level greater or equal to , and the set L that contains the processes
at a level smaller than .
As the predicate AFTER_YOU[] = x remains forever true, it follows that no
process p y in L enters the th loop iteration (otherwise p y would necessarily
execute AFTER_YOU[] ← y, contradicting the case assumption).
On the other side, due to the induction assumption, all processes in G eventu-
ally enter (and later leave) the critical section. When this has occurred, these
processes have moved from the set G to the set L and then the predicate
∀ k = i : FLAG_LEVEL[k] < becomes true.
When this has happened, the values returned by the asynchronous reading of
FLAG_LEVEL[1..n] by px allow it to attain level , which contradicts the assump-
tion that px remains blocked forever in the wait statement during its th iteration.
In both case the assumption that a process remains blocked forever at level is
contradicted which completes the proof of the induction step and concludes the
proof of the starvation-freedom property.
an arbitrary long “sleeping” period (this is possible as the processes are asynchronous)
and consequently does not read AFTER_YOU[1] = 1 (which would allow it to
progress to the second level). Differently, p2 progresses to the second level and
enters the critical section. Later, p2 first invokes release_mutex() and immediately
after invokes acquire_mutex() and updates AFTER_YOU[1] = 2. While p3 keeps
on “sleeping”, p1 progresses to level 2 and finally enters the critical section. This
scenario can be reproduced an arbitrary number of times until p3 wakes up. When this
occurs, p3 reads from AFTER_YOU[1] a value different from 3, and consequently
progresses to level 2. Hence:
• Due to asynchrony, a “sleeping period” can be arbitrarily long, and a process can
consequently lose an arbitrary number of competitions with respect to the other
processes,
• But, as a process does not sleep forever, it eventually progresses to the next level.
It is important to notice that, as shown in the proof of the bounded pass property of
Theorem 1, this scenario cannot happen when n = 2.
Atomic register: size and number It is easy to see that the algorithm uses
2n − 1 atomic registers. The domain of each of the n registers FLAG_LEVEL[i] is
[0..(n −1)], while the domain of each of the n −1 AFTER_YOU[] registers is [1..n].
Hence, in both cases,
log2 n bits are necessary and sufficient for each atomic reg-
ister.
Number of accesses to atomic registers Let us define the time complexity of a
mutex algorithm as the number of accesses to atomic registers for one use of the
critical section by a process.
It is easy to see that this cost is finite but not bounded when there is contention
(i.e., when several processes simultaneously compete to execute the critical section
code).
Differently in a contention-free scenario (i.e., when only one process pi wants to
use the critical section), the number of accesses to atomic registers is (n − 1)(n + 2)
in acquire_mutex(i) and one in release_mutex(i).
The case of k-exclusion This is the k-mutual exclusion problem where the critical
section code can be concurrently accessed by up to k processes (mutual exclusion
corresponds to the case where k = 1).
Peterson’s n-process algorithm can easily be modified to solve k-mutual exclusion.
The upper bound of the for loop (namely (n−1)) has simply to be replaced by (n−k).
No other statement modification is required. Moreover, let us observe that the size
of the array AFTER_YOU can then be reduced to [1..(n − k)].
being able to access the critical section. Said differently, it has to execute n − 1 loop
iterations (eliminating another process at each iteration), and consequently, the cost
(measured in number of accesses to atomic registers) in a contention-free scenario
is O(n) × the cost of one loop iteration, i.e., O(n 2 ). Hence a natural question is the
following: Is it possible to reduce this cost and (if so) how?
Tournament tree A simple principle to reduce the number of shared memory
accesses is to use a tournament tree. Such a tree is a complete binary tree. To simplify
the presentation, we consider that the number of processes is a power of 2, i.e., n = 2k
(hence k = log2 n). If n is not a power of two, it has to be replaced by n
= 2k where
k =
log2 n (i.e., n
is the smallest power of 2 such that n
> n).
Such a tree for n = 23 processes p1 , . . . , p8 , is represented in Fig. 2.10. Each
node of the tree is any two-process starvation-free mutex algorithm, e.g., Peterson’s
two-process algorithm. It is even possible to associate different two-process mutex
algorithms with different nodes. The important common feature of these algorithms
is that any of them assumes that it is used by two processes whose identities are 0
and 1.
As we have seen previously, any two-process mutex algorithm implements a lock
object. Hence, we consider in the following that the tournament tree is a tree of (n−1)
locks and we accordingly adopt the lock terminology. The locks are kept in an array
denoted LOCK[1..(n − 1)], and for x = y, LOCK[x] and LOCK[y] are independent
objects (the atomic registers used to implement LOCK[x] and the atomic registers
used to implement LOCK[y] are different).
The lock LOCK[1] is associated withe root of the tree, and if it is not a leaf, the
node associated with the lock LOCK[x] has two children associated with the locks
LOCK[2x] and LOCK[2x + 1].
According to its identity i, each process pi starts competing with a single other
process p j to obtain a lock that is a leaf of the tree. Then, when it wins, the process
28 2 Solving Mutual Exclusion
pi proceeds to the next level of the tree to acquire the lock associated with the node
that is the father of the node currently associated with pi (initially the leaf node
associated with pi ). Hence, a process competes to acquire all the locks on the path
from the leaf it is associated with until the root node.
As (a) the length of such a path is
log2 n and (b) the cost to obtain a lock
associated with a node is O(1) in contention-free scenarios, it is easy to see that
the number of accesses to atomic registers in these scenarios is O(log2 n) (it
is exactly 4 log2 n when each lock is implemented with Peterson’s two-process
algorithm).
Remark Let us consider the case where each algorithm implementing an under-
lying two-process lock object uses a bounded number of bounded atomic regis-
ters (which is the case for Peterson’s two-process algorithm). In that case, as the
tournament-based algorithm uses (n−1) lock objects, it follows that it uses a bounded
number of bounded atomic registers.
Let us observe that this tournament-based algorithm has better time complexity
than Peterson’s n-process algorithm.
algorithm is O(n 2 ) while the cost of the tournament tree-based algorithm is O(log2 n).
Hence, a natural question is the following: Is it possible to design a fast n-process
mutex algorithm, where fast means that the cost of the algorithm is constant in a
contention-free scenario?
The next section of this chapter answers this question positively. To that end, an
incremental presentation is adopted. A simple one-shot operation is first presented.
Each of its invocations returns a value r to the invoking process, where r is the value
abor t or the value commit. Then, the next section enriches the algorithm imple-
menting this operation to obtain a deadlock-free fast mutual exclusion algorithm due
to L. Lamport (1987).
Concurrency-abortable operation A concurrency-abortable (also named conten-
tion-abortable and usually abbreviated abortable) operation is an operation that is
allowed to return the value abor t in the presence of concurrency. Otherwise, it has to
return the value commit. More precisely, let conc_abort_op() be such an operation.
Assuming that each process invokes it at most once (one-shot operation), the set of
invocations satisfies the following properties:
• Obligation. If the first process which invokes conc_abort_op() is such that its
invocation occurs in a concurrency-free pattern (i.e., no other process invokes
conc_abort_op() during its invocation), this process obtains the value commit.
• At most one. At most one process obtains the value commit.
• Let us first consider the (possibly empty) set Q of processes p j that read Y at line
2 after this register was written by pi or another process (let us notice that, due to
the atomicity of the registers X and Y , the notion of after/before is well defined).
As Y is never reset to ⊥, it follows that each process p j ∈ Q obtains a non-⊥
value from Y and consequently executes return(abor t1 ) at line 3.
32 2 Solving Mutual Exclusion
possibly some pj
executes X ← j pi executes Y ← i
time line
Fig. 2.13 Access pattern to X and Y for a successful conc_abort_op() invocation by process pi
The next corollary follows from the proof of the previous theorem.
Corollary 1 (Y = ⊥) ⇒ a process has obtained the value commit or several
processes have invoked conc_abort_op().
Remark: splitter object When we (a) replace the value commit, abor t1 , and
abor t2 by stop, right, and left, respectively, and (b) rename the operation
2.1 Mutex Based on Atomic Read/Write Registers 33
Principle and description This section presents L. Lamport’s fast mutex algo-
rithm, which is built from the previous one-shot concurrency-abortable operation.
More specifically, this algorithm behaves similarly to the algorithm of Fig. 2.12 in
contention-free scenarios and (instead of returning abort) guarantees the deadlock-
freedom liveness property when there is contention.
The algorithm is described in Fig. 2.14. The line numbering is the same as in
Fig. 2.12: the lines with the same number are the same in both algorithms, line N0 is
new, line N3 replaces line 3, lines N7.1–N7.5 replace line 7, and line N10 is new.
To attain its goal (both fast mutex and deadlock-freedom) the algorithm works as
follows. First, each process pi manages a SWMR flag FLAG[i] (initialized to down)
that pi sets to up to indicate that it is interested in the critical section (line N0). This
flag is reset to down when pi exits the critical section (line N10). As we are about
to see, it can be reset to down also in other parts of the algorithm.
According to the contention scenario in which a process pi returns abort in the
algorithm of Fig. 2.12, there are two cases to consider, which have been differentiated
by the values abort 1 and abort 2 .
• Eliminating abort 1 (line N3).
In this case, as we have seen in Fig. 2.12, process pi is “late”. As captured by
Corollary 1, this is because there are other processes that currently compete for
the critical section or there is a process inside the critical section. Line 3 of Fig. 2.12
is consequently replaced by the following statements (new line N3):
– Process pi first resets its flag to down in order not to prevent other processes
from entering the critical section (if no other process is currently inside it).
– According to Corollary 1, it is useless for pi to retry entering the critical section
while Y = ⊥. Hence, process pi delays its request for the critical section until
Y = ⊥.
• Eliminating abort 2 (lines N7.1–N7.5).
In this case, as we have seen in the base contention-abortable algorithm (Fig. 2.12),
several processes are competing for the critical section (or a process is already
inside the critical section). Differently from the base algorithm, one of the com-
peting processes has now to be granted the critical section (if no other process is
inside it). To that end, in order not to prevent another process from entering the
critical section, process pi first resets its flag to down (line N7.1). Then, pi tries
to enter the critical section. To that end, it first waits until all flags are down (line
N7.2). Then, pi checks the value of Y (line N7.3). There are two cases:
– If Y = i, process pi enters the critical section. This is due to the following
reason.
Let us observe that, if Y = i when pi reads it at line N7.3, then no process has
modified Y since pi set it to the value i at line 4 (the write of Y at line 4 and its
reading at line N7.3 follow the same access pattern as the write of X at line 1 and
its reading at line 5). Hence, process pi is the last process to have executed line
4. It then follows that, as it has (asynchronously) seen each flag equal to down
(line 7.2), process pi is allowed to enter the critical section (return() statement
at line N7.3).
– If Y = i, process pi does the same as what is done at line N3. As it has already
set its flag to down, it has only to wait until the critical section is released before
retrying to enter it (line N7.4). (Let us remember that the only place where Y is
reset to ⊥ is when a process releases the critical section.)
Fast path and slow path The fast path to enter the critical section is when pi
executes only the lines N0, 1, 2, 4, 5, and 6. The fast path is open for a process pi
2.1 Mutex Based on Atomic Read/Write Registers 35
if it reads i from X at line 5. This is the path that is always taken by a process in
contention-free scenarios.
The cost of the fast path is five accesses to atomic registers. As release_mutex()
requires two accesses to atomic registers, it follows that the cost of a single use of the
critical section in a contention-free scenario is seven accesses to atomic registers.
The slow path is the path taken by a process which does not take the fast path.
Its cost in terms of accesses to atomic registers depends on the current concurrency
pattern.
A few remarks A register FLAG[i] is set to down when pi exits the critical section
(line N10) but also at line N3 or N7.1. It is consequently possible for a process pk to
be inside the critical section while all flags are down. But let us notice that, when this
occurs, the value of Y is different from ⊥, and as already indicated, the only place
where Y is reset to ⊥ is when a process releases the critical section.
When executed by a process pi , the aim of the wait statement at line N3 is to
allow any other process p j to see that pi has set its flag to down. Without such a
wait statement, a process pi could loop forever executing the lines N0, 1, 2 and N3
and could thereby favor a livelock by preventing the other processes from seeing
FLAG[i] = down.
Theorem 6 Lamport’s fast mutex algorithm satisfies mutual exclusion and
deadlock-freedom.
Proof Let us first consider the mutual exclusion property. Let pi be a process that
is inside the critical section. Trivially, we have then Y = ⊥ and pi returned from
acquire_mutex() at line 6 or at line N7.3. Hence, there are two cases. Before consid-
ering these two cases, let us first observe that each process (if any) that reads Y after
it was written by pi (or another process) executes line N3: it resets its flag to down
and waits until Y = ⊥ (i.e., at least until pi exits the critical section, line N10). As
the processes that have read a non-⊥ value from Y at line 2 cannot enter the critical
section, it follows that we have to consider only the processes p j that have read ⊥
from Y at line 2.
It then follows from the fact that pi is the last process which wrote into X and τ 2j > τi1
that p j reads i from X at line 4 and consequently does enter the repeat loop again
and waits until X = ⊥. The mutual exclusion property follows.
Proof of the deadlock-freedom property. This is an immediate consequence of
the fact that, among the processes that have concurrently invoked the operation
acquire_mutex(), the last process that writes X ( pi in the previous reasoning) reads
its own identity from X at line 4.
Short discussion The main property of this algorithm is its simplicity. Moreover,
its code is independent of the number of processes.
The previous section presented mutual exclusion algorithms based on atomic read/
write registers. These algorithms are important because understanding their design
and their properties provides us with precise knowledge of the difficulty and subtleties
2.2 Mutex Based on Specialized Hardware Primitives 39
that have to be addressed when one has to solve synchronization problems. These
algorithms capture the essence of synchronization in a read/write shared memory
model.
Nearly all shared memory multiprocessors propose built-in primitives (i.e., atomic
operations implemented in hardware) specially designed to address synchroniza-
tion issues. This section presents a few of them (the ones that are the most
popular).
does not modify its local variable r between acquire_mutex() and release_mutex()
(or, equivalently, that it sets r to 1 before invoking release_mutex()). The test&set-
based algorithm and the swap-based algorithm are actually the very same algorithm.
Let ri be the local variable used by each process pi . Due to the atomicity property
and the “exchange of values” semantics of the swap() primitive, it is easy to see the
swap-based algorithm is characterized by the invariant X + 1≤i≤n ri = 1.
The compare&swap() primitive Let X be a shared register and old and new
be two values. The semantics of the primitive X.compare&swap(old, new), which
returns a Boolean value, is defined by the following code that is assumed to be
executed atomically.
X.compare&swap(old, new) is
if (X = old) then X ← new; return(true)
else return(false)
end if.
A problem due to asynchrony The previous primitives allow for the (simple)
design of algorithms that ensure mutual exclusion and deadlock-freedom. Said dif-
ferently, these algorithms do not ensure starvation-freedom.
2.2 Mutex Based on Specialized Hardware Primitives 41
the process pTURN is the process that has priority and p(TURN mod n)+1 is the next
process that will have priority.
• When a process pi invokes acquire_mutex(i) it first raises its flag to inform the
other processes that it is interested in the critical section (line 1). Then, it waits
(repeated checks at line 2) until it has priority (predicate TURN = i) or the process
that is currently given the priority is not interested (predicate FLAG[TURN] =
down). Finally, as soon as it can proceed, it invokes LOCK.acquire_lock(i)
in order to obtain the underlying lock (line 3). (Let us remember that reading
FLAG[TURN] requires two shared memory accesses.)
• When a process pi invokes release_mutex(i), it first resets its flag to down
(line 5). Then, if (from pi ’s point view) the process that is currently given priority
is not interested in the critical section (i.e., the predicate FLAG[TURN] = down
is satisfied), then pi makes TURN progress to the next process (line 6) on the ring
before releasing the underlying lock (line 7).
Remark 1 Let us observe that the modification of TURN by a process pi is always
done in the critical section (line 6). This is due to the fact that pi modifies TURN
after it has acquired the underlying mutex lock and before it has released it.
Remark 2 Let us observe that a process pi can stop waiting at line 2 because it finds
TURN = i while another process p j increases TURN to ((i + 1) mod n) because it
does not see that FLAG[i] has been set to up. This situation is described in Fig. 2.21.
Theorem 8 Assuming that the underlying mutex lock LOCK is deadlock-free, the
algorithm described in Fig. 2.20 builds a starvation-free mutex lock.
Proof We first claim that, if at least one process invokes acquire_mutex(), then
at least one process invokes LOCK.acquire_lock() (line 3) and enters the critical
section.
2.2 Mutex Based on Specialized Hardware Primitives 43
pi’s side
pj ’s side
pj reads FLAG [i] = down
pj reads TURN = i pj updates TURN to ((i + 1) mod n)
pj executes line 6
Proof of the claim. Let us first observe that, if processes invoke LOCK.acquire_
lock(), one of them enters the critical section (this follows from the fact that the
lock is deadlock-free). Hence, X being the non-empty set of processes that invoke
acquire_mutex(), let us assume by contradiction that no process of X terminates
the wait statement at line 2. It follows from the waiting predicate that TURN ∈ / X
and FLAG[TURN] = up. But, FLAG[TURN] = up implies TURN ∈ X , which
contradicts the previous waiting predicate and concludes the proof of the claim.
Let pi be a process that has invoked acquire_mutex(). We have to show that
it enters the critical section. Due to the claim, there is a process pk that holds the
underlying lock. If pk is pi , the theorem follows, hence let pk = pi . When pk exits
the critical section it executes line 6. Let TURN = j when pk reads it. We consider
two cases:
1. FLAG[ j] = up. Let us observe that p j is the only process that can write into
FLAG[ j] and that it will do so at line 5 when it exits the critical section. More-
over, as TURN = j, p j is not blocked at line 2 and consequently invokes
LOCK.acquire_lock() (line 3).
We first show that eventually p j enters the critical section. Let us observe that
all the processes which invoke acquire_mutex() after FLAG[ j] was set to up
and TURN was set to j remain blocked at line 2 (Observation OB). Let Y be
the set of processes that compete with p j for the lock with y = |Y |. We have
0 ≤ y ≤ n − 1. It follows from observation OB and the fact that the lock is
deadlock-free that the number of processes that compete with p j decreases from
y to y − 1, y − 2, etc., until p j obtains the lock and executes line 5 (in the worst
case, p j is the last of the y processes to obtain the lock).
If pi is p j or a process that has obtained the lock before p j , the theorem follows
from the previous reasoning. Hence, let us assume that pi has not obtained the
lock. After p j has obtained the lock, it eventually executes lines 5 and 6. As
TURN = j and p j sets FLAG[ j] to down, it follows that p j updates the register
TURN to = ( j mod n)+1. The previous reasoning, where k and j are replaced
by j and , is then applied again.
44 2 Solving Mutual Exclusion
Fast starvation-free mutual exclusion Let us consider the case where a process pi
wants to enter the critical section, while no other process is interested in entering it.
We have the following:
• The invocation of acquire_mutex(i) requires at most three accesses to the shared
memory: one to set the register FLAG[i] to up, one to read TURN and save it in a
local variable tur n, and one to read FLAG[tur n].
• Similarly, the invocation by pi of release_mutex(i) requires at most four accesses
to the shared memory: one to reset FLAG[i] to down, one to read TURN and save
it in a local variable tur n, one to read FLAG[tur n], and a last one to update TURN.
It follows from this observation that the stacking of the algorithm of Fig. 2.20
on top of the algorithm described in Fig. 2.14 (Sect. 2.1.7), which implements a
deadlock-free fast mutex lock, provides a fast starvation-free mutex algorithm.
2.2.3 Fetch&Add
Let us observe that, while NEXT is an atomic MWMR register, the operation
NEXT ← NEXT + 1 is not atomic. It is easy to see that no increase of NEXT can be
missed. This follows from the fact that the increase statement NEXT ← NEXT + 1
appears in the operation release_mutex(), which is executed by a single process at a
time.
The mutual exclusion property follows from the uniqueness of each ticket number,
and the starvation-freedom property follows from the fact that the ticket numbers are
defined from a sequence of consecutive known values (here the increasing sequence
of positive integers).
This section presents two mutex algorithms which rely on shared read/write registers
weaker than read/write atomic registers. In that sense, they implement atomicity
without relying on underlying atomic objects.
The algorithms described in this section rely on safe registers. As shown here, safe
registers are the weakest type of shared registers that we can imagine while being
useful, in the presence of concurrency.
As an atomic register, a safe register (or a regular register) R provides the processes
with a write operation denoted R.write(v) (or R ← v), where v is the value that is
written and a read operation R.read() (or local ← R, where local is a local variable
of the invoking process). Safe, regular and atomic registers differ in the value returned
by a read operation invoked in the presence of concurrent write operations.
Let us remember that the domain of a register is the set of values that it can contain.
As an example, the domain of a binary register is the set {0, 1}.
46 2 Solving Mutual Exclusion
SWMR safe register An SWMR safe register is a register whose read operation
satisfies the following properties (the notion of an MWMR safe register will be
introduced in Sect. 2.3.3):
• A read that is not concurrent with a write operation (i.e., their executions do not
overlap) returns the current value of the register.
• A read that is concurrent with one (or several consecutive) write operation(s) (i.e.,
their executions do overlap) returns any value that the register can contain.
It is important to see that, in the presence of concurrent write operations, a read can
return a value that has never been written. The returned value has only to belong to
the register domain. As an example, let the domain of a safe register R be {0, 1, 2, 3}.
Assuming that R = 0, let R.write(2) be concurrent with a read operation. This read
can return 0, 1, 2, or 3. It cannot return 4, as this value is not in the domain of R, but
can return the value 3, which has never been written.
A binary safe register can be seen as modeling a flickering bit. Whatever its
previous value, the value of the register can flicker during a write operation and
stabilizes to its final value only when the write finishes. Hence, a read that overlaps
with a write can arbitrarily return either 0 or 1.
SWMR regular register An SWMR regular register is an SWMR safe register
that satisfies the following property. This property addresses read operations in thee
presence of concurrency. It replaces the second item of the definition of a safe register.
• A read that is concurrent with one or several write operations returns the value of
the register before these writes or the value written by any of them.
An example of a regular register R (whose domain is the set {0, 1, 2, 3, 4}) written
by a process p1 and read by a process p2 is described in Fig. 2.23. As there is no
concurrent write during the first read by p2 , this read operation returns the current
value of the register R, namely 1. The second read operation is concurrent with three
write operations. It can consequently return any value in {1, 2, 3, 4}. If the register
was only safe, this second read could return any value in {0, 1, 2, 3, 4}.
Atomic register The notion of an atomic register was defined in Sect. 2.1.1. Due
to the total order on all its operations, an atomic register is more constrained (i.e.,
stronger) than a regular register.
p2
R.read() → 1 R.read() → v
p2
R.read() → 1 R.read() → a R.read() → b R.read() → 0 R.read() → c
To illustrate the differences between safe, regular, and atomic, Fig. 2.24 presents
an execution of a binary register R and Table 2.1 describes the values returned by
the read operations when the register is safe, regular, and atomic. The first and third
read by p2 are issued in a concurrency-free context. Hence, whatever the type of the
register, the value returned is the current value of the register R.
• If R is safe, as the other read operations are concurrent with a write operation,
they can return any value (i.e., 0 or 1 as the register is binary). This is denoted 0/1
in Table 2.1.
It follows that there are eight possible correct executions when the register R is
safe for the concurrency pattern depicted in Fig. 2.24.
• If R is regular, each of the values a and b returned by the read operation which
is concurrent with R.write(0) can be 1 (the value of R before the read oper-
ation) or 0 (the value of R that is written concurrently with the read operation).
Differently, the value c returned by the last read operation can only be 0 (because
the value that is written concurrently does not change the value of R).
It follows that there are only four possible correct executions when the register R
is regular.
• If R is atomic, there are only three possible executions, each corresponding to a
correct sequence of read and write invocations (“correct” means that the sequence
respects the real-time order of the invocations and is such that each read invocation
returns the value written by the immediately preceding write invocation).
48 2 Solving Mutual Exclusion
Principle of the algorithm The mutex algorithm presented in this section is due to
L. Lamport (1974) who called it the mutex bakery algorithm. It was the first algorithm
ever designed to solve mutual exclusion on top of non-atomic registers, namely on
top of SWMR safe registers. The principle that underlies its design (inspired from
bakeries where a customer receives a number upon entering the store, hence the
algorithm name) is simple. When a process pi wants to acquire the critical section,
it acquires a number x that defines its priority, and the processes enter the critical
section according to their current priorities.
As there are no atomic registers, it is possible that two processes obtain the same
number. A simple way to establish an order for requests that have the same number
consists in using the identities of the corresponding processes. Hence, let a pair x, i
define the identity of the current request issued by pi . A total order is defined for
the requests competing for the critical section as follows, where x, i and y, j
are the identities of two competing requests; x, i < y, j means that the request
identified by x, i has priority over the request identified by y, j where “<” is
defined as the lexicographical ordering on pairs of integers, namely
Description of the algorithm Two SWMR safe registers, denoted FLAG[i] and
MY _TURN[i], are associated with each process pi (hence these registers can be read
by any process but written only by pi ).
• MY _TURN[i] (which is initialized to 0 and reset to that value when pi exits the
critical section) is used to contain the priority number of pi when it wants to use the
critical section. The domain of MY _TURN[i] is the set of non-negative integers.
• FLAG[i] is a binary control variable whose domain is {down, up}. Initialized to
down, it is set to up by pi while it computes the value of its priority number
MY _TURN[i].
The sequence of values taken by FLAG[i] is consequently the regular expression
down(up, down)∗ . The reader can verify that a binary safe register whose write
operations of down and up alternate behaves as a regular register.
The algorithm of a process pi is described in Fig. 2.25. When it invokes acquire_
mutex(), process pi enters a “doorway” (lines 1–3) in which it computes its turn
number MY _TURN[i] (line 2). To that end it selects a number greater than all
MY _TURN[ j], 1 ≤ j ≤ n. It is possible that pi reads some MY _TURN[ j] while it is
written by p j . In that case the value obtained from MY _TURN[ j] can be any value.
Moreover, a process informs the other processes that it is computing its turn value by
raising its flag before this computation starts (line 1) and resetting it to down when
it has finished (line 3). Let us observe that a process is never delayed while in the
doorway, which means no process can direct another process to wait in the doorway.
2.3 Mutex Without Atomicity 49
After it has computed its turn value, a process pi enters a “waiting room” (lines
4–7) which consists of a for loop with one loop iteration per process p j . There are
two cases:
• If p j does not want to enter the critical section, we have FLAG[ j] = down ∧
MY _TURN[ j] = 0. In this case, pi proceeds to the next iteration without being
delayed by p j .
• Otherwise, pi waits until FLAG[ j] = down (i.e., until p j has finished to compute
its turn, line 5) and then waits until either p j has exited the critical section (predicate
MY _TURN[ j] = 0) or pi ’s current request has priority over p j ’s one (predicate
(MY _TURN[i], i) < (MY _TURN[ j], j)).
When pi has priority with respect to each other process (these priorities being
checked in an arbitrary order, one after the other) it enters the critical section
(line 8).
Finally, when it exits the critical section, the only thing a process pi has to do is
to reset MY _TURN[i] to 0 (line 9).
Remark: process crashes Let us consider the case where a process may crash (i.e.,
stop prematurely). It is easy to see that the algorithm works despite this type of failure
if, after a process pi has crashed, its two registers FLAG[i] and MY _TURN[i] are
eventually reset to their initial values. When this occurs, the process pi is considered
as being no longer interested in the critical section.
A first in first out (FIFO) order As already indicated, the priority of a process
pi over a process p j is defined from the identities of their requests, namely the pairs
MY _TURN[i], i and MY _TURN[ j], j. Moreover, let us observe that it is not
possible to predict the values of these pairs when pi and p j compute concurrently
the values of MY _TURN[i] and MY _TURN[ j].
50 2 Solving Mutual Exclusion
Let us consider two processes pi and p j that have invoked acquire_mutex() and
where pi has executed its doorway part (line 2) before p j has started executing its
doorway part. We will see that the algorithm guarantees a FIFO order property defined
as follows: pi terminates its invocation of acquire_mutex() (and consequently enters
the critical section) before p j . This FIFO order property is an instance of the bounded
bypass liveness property with f (n) = n − 1.
Definitions The following time instant definitions are used in the proof of
Theorem 9. Let px be a process. Let us remember that, as the read and write operations
on the registers are not atomic, they cannot be abstracted as having been executed
instantaneously. Hence, when considering the execution of such an operation, its
starting time and its end time are instead considered.
The number that appears in the following definitions corresponds to a line number
(i.e., to a register operation). Moreover, “b” stands for “beginning” while “e” stands
for “end”.
1. τex (1) is the time instant at which px terminates the assignment FLAG[x] ← up
(line 1).
2. τex (2) is the time instant at which px terminates the execution of line 2. Hence,
at time τex (2) the non-atomic register MY _TURN[x] contains the value used by
px to enter the critical section.
3. τbx (3) is the time instant at which px starts the execution of line 3. This means that
a process that reads FLAG[x] during the time interval [τex (1)..τbx (3)] necessarily
obtains the value up.
4. τbx (5, y) is the time instant at which px starts its last evaluation of the waiting
predicate (with respect to FLAG[y]) at line 5. This means that px has obtained
the value down from FLAG[y].
5. Let us notice that, as it is the only process which writes into MY _TURN[x],
px can save its value in a local variable. This means that the reading of
MY _TURN[x] entails no access to the shared memory. Moreover, as far as a
register MY _TURN[y] (y = x) is concerned, we consider that px reads it once
each time it evaluates the predicate of line 6.
τbx (6, y) is the time instant at which px starts its last reading of MY _TURN[y].
Hence, the value tur n it reads from MY _TURN[y] is such that (tur n =
0) ∨ MY _TURN[x], x < tur n, y.
Proof Let tur n i be the value used by pi at line 6. As pi is in the bakery (i.e., exe-
cuting lines 4–9) before p j enters the doorway (line 2), it follows that MY _TURN[i]
was assigned the value tur n i before p j reads it at line 2. Hence, when p j reads the
safe register MY _TURN[i], there is no concurrent write and p j consequently obtains
the value tur n i . It follows that the value tur n j assigned by p j to MY _TURN[ j] is
such that tur n j ≥ tur n i + 1, from which the lemma follows.
Lemma 2 Let pi and p j be two processes such that pi is inside the critical section
while p j is in the bakery. Then MY _TURN[i], i < MY _TURN[ j], j.
Proof Let us notice that, as p j is inside the bakery, it can be inside the critical
section.
As process pi is inside the critical section, it has read down from FLAG[ j] at
line 5 (and exited the corresponding wait statement). It follows that, according to the
timing of this read of FLAG[ j] that returned the value down to pi and the updates
of FLAG[ j] by p j to up at line 1 or down at line 3 (the only lines where FLAG[ j]
is modified), there are two cases to consider (Fig. 2.26).
j
As pi reads down from FLAG[ j], we have either τbi (5, j) < τe (1) or τei (5, j) >
j j
τb (3) (see Fig. 2.26). This is because if we had τbi (5, j) > τe (1), pi would
necessarily have read up from FLAG[ j] (left part of the figure), and, if we had
j
τbi (5, j) < τb (3), pi would necessarily have also read up from FLAG[ j] (right part
of the figure). Let us consider each case:
j
• Case 1: τbi (5, j) < τe (1) (left part of Fig. 2.26). In this case process, pi has entered
the bakery before process p j enters the doorway. It then follows from Lemma 1
that MY _TURN[i] < MY _TURN[ j], which proves the lemma for this case.
j
• Case 2: τei (5, j) > τb (3) (right part of Fig. 2.26). As p j is sequential, we have
j j
τe (2) < τb (3) (P1). Similarly, as pi is sequential, we also have τbi (5, j) < τbi (6, j)
j
(P2). Combing (P1), (P2), and the case assumption, namely τb (3) < τbi (5, j), we
obtain
j j
τe (2) < τb (3) < τei (5, j) < τbi (6, j);
Fig. 2.26 The two cases where p j updates the safe register FLAG[ j]
52 2 Solving Mutual Exclusion
j
i.e., τe (2) < τbi (6, j) from which we conclude that the last read of
MY _TURN[ j] by pi occurred after the safe register MY _TURN[ j] obtained its
value (say tur n j ).
As pi is inside the critical section (lemma assumption), it exited the second wait
statement because (MY _TURN[ j] = 0)∨MY _TURN[i], i<MY _TURN[ j], j.
j
Moreover, as p j was in the bakery before pi executed line 6 (τe (2) < τbi (6, j)), we
have MY _TURN[ j] = tur n j = 0. It follows that we have MY _TURN[i], i <
MY _TURN[ j], j, which terminates the proof of the lemma.
Theorem 9 Lamport’s bakery algorithm satisfies mutual exclusion and the bounded
bypass liveness property where f (n) = n − 1.
Proof Proof of the mutual exclusion property. The proof is by contradiction. Let
us assume that pi and p j (i = j) are simultaneously inside the critical section. We
have the following:
• As pi is inside the critical section and p j is inside the bakery, we can apply Lemma
2. We then obtain: MY _TURN[i], i < MY _TURN[ j], j.
• Similarly, as p j is inside the critical section and pi is inside the bakery, applying
Lemma 2, we obtain: MY _TURN[ j], j < MY _TURN[i], i.
As i = j, the pairs MY _TURN[ j], j and MY _TURN[i], i are totally ordered.
It follows that each item contradicts the other, from which the mutex property follows.
Proof of the FIFO order liveness property. The proof shows first that the algo-
rithm is deadlock-free. It then shows that the algorithm satisfies the bounded
bypass property where f (n) = n − 1 (i.e., the FIFO order as defined on the pairs
MY _TURN[x], x).
The proof that the algorithm is deadlock-free is by contradiction. Let us assume
that processes have invoked acquire_mutex() and no process exits the waiting room
(lines 4–7). Let Q be this set of processes. (Let us notice that, for any other process
p j , we have FLAG[ j] = down and MY _TURN[ j] = 0.) As the number of processes
is bounded and no process has to wait in the doorway, there is a time after which
we have ∀ j ∈ {1, . . . , n} : FLAG[ j] = down, from which we conclude that no
process of Q can be blocked forever in the wait statement of line 5.
By construction, the pairs MY _TURN[x], x of the processes px ∈ Q are totally
ordered. Let MY _TURN[i], i be the smallest one. It follows that, eventually, when
evaluated by pi , the predicate associated with the wait statement of line 6 is satisfied
for any j. Process pi then enters the critical section, which contradicts the deadlock
assumption and proves that the algorithm is deadlock-free.
To show the FIFO order liveness property, let us consider a pair of processes pi
and p j that are competing for the critical section and such that p j wins and after
exiting the critical section it invokes acquire_mutex( j) again, executes its doorway,
and enters the bakery. Moreover, let us assume that pi is still waiting to enter the
critical section. Let us observe that we are then in the context defined in Lemma 1: pi
and p j are in the bakery and pi entered the bakery before p j enters the doorway.
2.3 Mutex Without Atomicity 53
We then have MY _TURN[i] < MY _TURN[ j], from which we conclude that p j
cannot bypass again pi . As there are n processes, in the worst case pi is competing
with all other processes. Due to the previous observation and the fact that there is
no deadlock, it can lose at most n − 1 competitions (one with respect to each other
process p j (which enters the critical section before pi ), which proves the bounded
bypass liveness property with f (n) = n − 1.
This section presents a second mutex algorithm which does not require underlying
atomic registers. This algorithm is due to A. Aravind (2011). Its design principles
are different from the ones of the bakery algorithm.
Principle of the algorithm The idea that underlies the design of this algorithm is to
associate a date with each request issued by a process and favor the competing process
which has the oldest (smallest) request date. To that end, the algorithm ensures that
(a) the dates associated with requests are increasing and (b) no two process requests
have the same date.
More precisely, let us consider a process pi that exits the critical section. The
date of its next request (if any) is computed in advance when, just after pi has used
the critical section, it executes the corresponding release_mutex() operation. In that
way, the date of the next request of a process is computed while this process is still
“inside the critical section”. As a consequence, the sequence of dates associated with
the requests is an increasing sequence of consecutive integers and no two requests
(from the same process or different processes) are associated with the same date.
From a liveness point of view, the algorithm can be seen as ensuring a least
recently used (LRU) priority: the competing process whose previous access to the
critical section is the oldest (with respect to request dates) is given priority when it
wants to enter the critical section.
Safe registers associated with each process The following three SWMR safe
registers are associated with each process pi :
• FLAG[i], whose domain is {down, up}. It is initialized to up when pi wants to
enter the critical section and reset to down when pi exits the critical section.
• If pi is not competing for the critical section, the safe register DATE[i] contains the
(logical) date of its next request to enter the critical section. Otherwise, it contains
the logical date of its current request.
DATE[i] is initialized to i. Hence, no two processes start with the same date for
their first request. As already indicated, pi will compute its next date (the value
that will be associated with its next request for the critical section) when it exits
the critical section.
• STAGE[i] is a binary control variable whose domain is {0, 1}. Initialized to 0,
it is set to 1 by pi when pi sees DATE[i] as being the smallest date among the
54 2 Solving Mutual Exclusion
dates currently associated with the processes that it perceives as competing for the
critical section. The sequence of successive values taken by STAGE[i] (including
its initial value) is defined by the regular expression 0((0, 1)+ , 0)∗ .
j
from which we conclude τei (4) < τb (5, i), i.e., the last read of STAGE[i] by p j at line
5 started after pi had written 1 into it. Hence, the last read of STAGE[i] by p j returned
1 which contradicts the fact that it is inside the critical section simultaneously with
pi . (A similar reasoning shows that, if p j is inside the critical section, pi cannot be.)
Before proving the liveness property, let us notice that at most one process at a
time can modify the array DATE[1..n]. This follows from the fact that the algorithm
satisfies the mutual exclusion property (proved above) and line 7 is executed by
a process pi before it resets STAGE[i] to 0 (at line 8), which is necessary to allow
another process p j to enter the critical section (as the predicate of line 5 has to be true
when evaluated by p j ). It follows from the initialization of the array DATE[1..n] and
the previous reasoning that no two requests can have the same date and the sequence
of dates computed in mutual exclusion at line 7 by the processes is the sequence of
natural integers (Observation OB).
As in the proof of Lamport’s algorithm, let us first prove that there is no deadlock.
Let us assume (by contradiction) that there is a non-empty set of processes Q that have
invoked acquire_mutex() and no process succeeds in entering the critical section.
Let pi be the process of Q with the smallest date. Due to observation OB, there is a
single process pi . It then follows that, after some finite time, pi is the only process
whose predicate at line 3 is satisfied. Hence, after some time, pi is the only process
such that STAGE[i] = 1, which allows it to enter the critical section. This contradicts
the initial assumption and proves the deadlock-freedom property.
As a single process at a time can modify its entry of the array DATE, it follows
that a process p j that exits the critical section updates its register DATE[ j] to a
value greater than all the values currently kept in DATE[1..n]. Consequently, after
p j has executed line 7, all the other processes pi which are currently competing
for the critical section are such that DATE[i] < DATE[ j]. Hence, as we now have
(FLAG[i] = up) ∧ (DATE[i] < DATE[ j]), the next request (if any) issued by p j
cannot bypass the current request of pi , from which the starvation-freedom property
follows.
Moreover, it also follows from the previous reasoning that, if pi and p j are
competing and p j wins, then as soon as p j has exited the critical section pi has
priority over p j and can no longer be bypassed by it. This is nothing else than the
bounded bypass property with f (n) = n − 1 (which defines a FIFO order property).
Bounded mutex algorithm Each safe register MY _TURN[i] of Lamport’s algo-
rithm and each safe register DATE[i] of Aravind’s algorithm can take arbitrary large
values. It is shown in the following how a simple modification of Aravind’s algorithm
allows for bounded dates. This modification relies on the notion of an MWMR safe
register.
MWMR safe register An MWMR safe register is a safe register that can be written
and read by several processes. When the write operations are sequential, an MWMR
safe register behaves as an SWMR safe register. When write operations are concur-
rent, the value written into the register is any value of its domain (not necessarily a
value of a concurrent write).
Said differently, to be meaningful, an algorithm based on MWMR safe registers
has to prevent write operations on an MWMR safe register from being concurrent in
order for the write operations to be always meaningful. The behavior of an MWMR
safe register is then similar to the behavior of an SWMR safe register in which the
“single writer” is implemented by several processes that never write at the same time.
From unbounded dates to bounded dates Let us now consider that each safe
register DATE[i], 1 ≤ i ≤ n, is an MWMR safe register: any process pi can write
any register DATE[ j]. MWMR safe registers allow for the design of a (particularly
2.3 Mutex Without Atomicity 57
simple) bounded mutex algorithm. The domain of each register DATE[ j] is now
[1..N ] where N ≥ 2n. Hence, all registers are safe and have a bounded domain.
In the following we consider N = 2n. A single bit is needed for each safe register
FLAG[ j] and each safe register STAGE[ j], and only
log2 N bits are needed for
each safe register DATE[ j].
In a very interesting way, no statement has to be modified to obtain a bounded
version of the algorithm. A single new statement has to be added, namely the insertion
of the following line 7
between line 7 and line 8:
(7
) if (DATE[i] ≥ N ) then for all j ∈ [1..n] do DATE[ j] ← j end for end if.
This means that, when a process pi exiting the critical section updates its register
DATE[i] and this update is such that DATE[i] ≥ N , pi resets all date registers
to their initial values. As for line 7, this new line is executed before STAGE[i] is
reset to 0 (line 8), from which it follows that it is executed in mutual exclusion and
consequently no two processes can concurrently write the same MWMR safe register
DATE[ j]. Hence, the MWMR safe registers are meaningful.
Moreover, it is easy to see that the date resetting mechanism is such that each
date d, 1 ≤ d ≤ n, is used only by process pd , while each date d, n + 1 ≤ d ≤ 2n
can be used by any process. Hence, ∀d ∈ {1, . . . , n} we have DATE[d] ∈ {d, n +
1, n + 2, . . . , 2n}.
Theorem 11 When considering Aravind’s mutual exclusion algorithm enriched with
line 7
with N ≥ 2n, a process encounters at most one reset of the array DATE[1..n]
while it is executing acquire_mutex().
Proof Let pi be a process that executes acquire_mutex() while a reset of the array
DATE[1..n] occurs. If pi is the next process to enter the critical section, the theorem
follows. Otherwise, let p j be the next process which enters the critical section. When
p j exits the critical section, DATE[ j] is updated to max(DATE[1], . . . , DATE[n]) +
1 = n + 1. We then have FLAG[i] = up and DATE[i] < DATE[ j]. It follows that,
if there is no new reset, p j cannot enter again the critical section before pi .
In the worst case, after the reset, all the other processes are competing with pi
and pi is pn (hence, DATE[i] = n, the greatest date value after a reset). Due to
line 3 and the previous observation, each other process p j enters the critical section
before pi and max(DATE[1], . . . , DATE[n]) becomes equal to n + (n − 1). As
2n − 1 < 2n ≤ N , none of these processes issues a reset. It follows that pi enters
the critical section before the next reset. (Let us notice that, after the reset, the
invocation issued by pi can be bypassed only by invocations (pending invocations
issued before the reset or new invocations issued after the reset) which have been
issued by processes p j such that j < i).
The following corollary is an immediate consequence of the previous theorem.
Corollary 2 Let N ≥ 2n. Aravind’s mutual exclusion algorithm enriched with line
7
satisfies the starvation-freedom property.
58 2 Solving Mutual Exclusion
(Different progress conditions that this algorithm can ensure are investigated in
Exercise 6.)
Bounding the domain of the safe registers has a price. More precisely, the addition
of line 7
has an impact on the maximal number of bypasses which can now increase
up to f (n) = 2n −2. This is because, in the worst case where all the processes always
compete for the critical section, before it is allowed to access the critical section, a
process can be bypassed (n − 1) times just before a reset of the array DATE and, due
to the new values of DATE[1..n], it can again be bypassed (n − 1) times just after
the reset.
2.4 Summary
This chapter has presented three families of algorithms that solve the mutual exclu-
sion problem. These algorithms differ in the properties of the base operations they
rely on to solve mutual exclusion.
Mutual exclusion is one way to implement atomic objects. Interestingly, it was
shown that implementing atomicity does not require the underlying read and write
operations to be atomic.
• The reader will find surveys on mutex algorithms in [24, 231, 262]. Mutex algo-
rithms are also described in [41, 146].
• Peterson’s algorithm for two processes and its generalization to n processes are
presented in [224].
The first tournament-based mutex algorithm is due to G.L. Peterson and M.J.
Fischer [227].
A variant of Peterson’s algorithm in which all atomic registers are SWMR registers
due to J.L.W. Kessels is presented in [175].
• The contention-abortable mutex algorithm is inspired from Lamport’s fast mutex
algorithm [191]. Fischer’s synchronous algorithm is described in [191].
Lamport’s fast mutex algorithm gave rise to the splitter object as defined in [209].
The notion of fast algorithms has given rise to the notion of adaptive algorithms
(algorithms whose cost is related to the number of participating processes) [34].
• The general construction from deadlock-freedom to starvation-freedom that was
presented in Sect. 2.2.2 is from [262]. It is due to Y. Bar-David.
2.5 Bibliographic Notes 59
• The notions of safe, regular, and atomic read/write registers are due to L. Lamport.
They are presented and investigated in [188, 189]. The first intuition on these types
of registers appears in [184].
It is important to insist on the fact that “non-atomic” does not mean “arbiter-free”.
As defined in [193], “An arbiter is a device that makes a discrete decision based on
a continuous range of values”. Binary arbiters are the most popular. Actually, the
implementation of a safe register requires an arbiter. The notion of arbitration-free
synchronization is discussed in [193].
• Lamport’s bakery algorithm is from [183], while Aravind’s algorithm and its
bounded version are from [28].
• A methodology based on model-checking for automatic discovery of mutual exclu-
sion algorithms has been proposed by Y. Bar-David and G. Taubenfeld [46]. Inter-
estingly enough, this methodology is both simple and computationally feasible.
New algorithms obtained in this way are presented in [46, 262].
• Techniques (and corresponding algorithms) suited to the design of locks for
NUMA and CC-NUMA architectures are described in [86, 200]. These techniques
take into account non-uniform memories and caching hierarchies.
• A combiner is a thread which, using a coarse-grain lock, serves (in addition to its
own synchronization request) active requests announced by other threads while
they are waiting by performing some form of spinning. Two implementations of
such a technique are described in [173]. The first addresses systems that support
coherent caches, whereas the second works better in cacheless NUMA architec-
tures.
1. Peterson’s algorithm for two processes uses an atomic register denoted TURN
that is written and read by both processes. Design a two-process mutual exclusion
algorithm (similar to Peterson’s algorithm) in which the register TURN is replaced
by two SWMR atomic registers TURN[i](which can be written only by pi ) and
TURN[ j](which can be written only by p j ). The algorithm will be described for
pi where i ∈ {0, 1} and j = (i + 1) mod 2.
Solution in [175].
2. Considering the tournament-based mutex algorithm, show that if the base two-
process mutex algorithm is deadlock-free then the n-process algorithm is
deadlock-free.
60 2 Solving Mutual Exclusion
After having introduced the notion of a concurrent object, this chapter presents
lock-based methodologies to implement such objects. The first one is based on a
low-level synchronization object called a semaphore. The other ones are based on
linguistic constructs. One of these constructs is based on an imperative approach
(monitor construct), while the other one is based on a declarative approach (path
expression construct). This chapter closes the first part of the book devoted to lock-
based synchronization.
value ⊥ when the stack is empty. Hence, both operations can always be executed
whatever the current state of the stack (such operations are said to be total). The
sequential specification of such a stack is the set of all the sequences of push() and
pop() operations that satisfy the “last in, first out” (LIFO) property (“last in” being ⊥
when the stack is empty). Differently, as indicated in the first chapter, a rendezvous
object has no sequential specification.
operation STACK is
end operation.
operation is
end operation.
first uses a sequential stack S_STACK 1 and a lock instance LOCK 1 , while the sec-
ond uses another sequential stack S_STACK 2 and another lock instance LOCK 2 .
Hence, as LOCK 1 and LOCK 2 are distinct locks, the operations on C_STACK 1 and
C_STACK 2 are not prevented from being concurrent.
S = s0 + #(S.up) − #(S.down).
64 3 Lock-Based Concurrent Objects
Hence, when it is negative, the implementation counter S.count provides us with the
number of processes currently blocked on the semaphore S. Differently, when it is
non-negative, the value of S.count is the value of the semaphore S.
3.2 A Base Synchronization Object: the Semaphore 65
operation is
if then
the invoking process is blocked and added to ; the control is given to the scheduler
end if
end operation
operation is
if then
remove the first process in which can now be assigned a processor
end if;
end operation.
The base read and write operations on BUF[x] are denoted BUF[x].read() and
BUF[x].write().
– in and out are two local variables containing array indexes whose domain is
[0..(k − 1)]; in is used by the producer to point to the next entry of BUF where
an item can be deposited; out is used by the consumer to point to the next entry
of BUF from which an item can be consumed. The law that governs the progress
of these index variables is the addition mod k, and we say that the buffer is
circular.
• Control part. This part comprises the synchronization objects that allow the
processes to never violate the buffer invariant. It consists of two semaphores:
– The semaphore FREE counts the number of entries of the array BUF that
can currently be used to deposit new items.This semaphore is initialized
to k.
– The semaphore BUSY counts the number of entries of the array BUF that cur-
rently contain items produced and not yet consumed. This semaphore is initial-
ized to 0.
operation is
(1)
(2)
(3)
(4)
end operation
operation is
(5)
(6)
(7)
(8)
end operation
When the consumer invokes B.consume(), it first checks if there is one entry in the
array BUF that contains an item not yet consumed. The semaphore BUSY is used to
that end (line 5). When it is allowed to consume, the consumer consumes the next
item value, i.e., the one kept in BUF[out], saves it in a local variable r, and increases
the index out (line 5). Finally, before returning that value saved in r (line 5), the
consumer signals that one entry is freed; this is done by increasing the value of the
semaphore FREE (line 7).
Remark 1 It is important to repeat that, for any x, a register BUF[x] is not required
to satisfy special semantic requirements and that the value that is written (read) can be
of any type and as large as we want (e.g., a big file). This is an immediate consequence
of the fact each register BUF[x] is accessed in mutual exclusion. This means that
what is abstracted as a register BUF[x] does not have to be constrained in any way. As
an example, the operation BUF[in].write(v) (line 2) can abstract several low-level
write operations involving accesses to underlying disks which implement the register
(and similarly for the operation BUF[in].read() at line 2). Hence, the size of the items
that are produced and consumed can be arbitrary large, and reading and writing them
can take arbitrary (but finite) durations. This means that one can reasonably assume
that the duration of the operations BUF[in].write(v) and BUF[in].read() (i.e., the
operations which are in the data part of the algorithms) is usually several orders of
magnitude greater than the execution of the rest of the algorithms (which is devoted
to the control part).
Remark 2 It is easy to see that the values taken by the semaphores FREE and
BUSY are such that 0 ≤ FREE, BUSY ≤ k, but it is important to remark that a
semaphore object does not offer an operation such as FREE.read() that would return
the exact value of FREE. Actually, such an operation would be useless because there
is no guarantee that the value returned by FREE.read() is still meaningful when the
invoking process would use it (FREE may have been modified by FREE.up() or
FREE.down() just after its value was returned).
A semaphore S can be seen as an atomic register that could be modified by the
operations fech&add() and fetch&sub() (which atomically add 1 and subtract 1,
respectively) with the additional constraint that S can never become negative.
The case of a buffer with a single entry This is the case k = 1. Each of the
semaphores FREE and BUSY takes then only the values 0 or 1. It is interesting
to look at the way these values are modified. A corresponding cycle of produc-
tion/consumption is depicted in Fig. 3.5.
Initially the buffer is empty, FREE = 1, and BUSY = 0. When the producer
starts to deposit a value, the semaphore FREE decreases from 1 to 0 and the buffer
starts to being filled. When it has been filled, the producer raises BUSY from 0 to
1. Hence, FREE = 0 ∧ BUSY = 1 means that a value has been deposited and can
be consumed. When the consumer wants to read, it first decreases the semaphore
BUSY from 1 to 0 and then reads the value kept in the buffer. When, the reading is
terminated, the consumer signals that the buffer is empty by increasing FREE from
68 3 Lock-Based Concurrent Objects
If there are several producers (consumers) the previous solution no longer works,
because the control register in (out) now has to be an atomic register shared by all
producers (consumers). Hence, the local variables in and out are replaced by the
atomic registers IN and OUT . Moreover, (assuming k > 1) the read and update
operations on each of these atomic registers have to be executed in mutual exclusion
in order that no two producers simultaneously obtain the same value of IN, which
could entail the write of an arbitrary value into BUF[in]. (And similarly for out.)
A simple way to solve this issue consists in adding two semaphores initialized
to 1, denoted MP and MC. The semaphore MP is used by the producers to ensure
that at most one process at a time is allowed to execute B.produce(); (similarly
MC is used to ensure that no two consumers concurrently execute B.consume()).
Albeit correct, such a solution can be very inefficient. Let us consider the case of
a producer p1 that is very slow while another producer p2 is very rapid. If both p1
and p2 simultaneously invoke produce() and p1 wins the competition, p2 is forced to
wait for a long time before being able to produce. Moreover, if there are several free
entries in BUF[0..(k −1)], it should be possible for p1 and p2 to write simultaneously
in two different free entries of the array.
Additional control data To that end, in addition to the buffer BUF[0..(k − 1)] and
the atomic registers IN and OUT , two arrays of atomic Boolean registers denoted
3.2 A Base Synchronization Object: the Semaphore 69
Behavior of Behavior of
FULL[0..(k − 1)] and EMPTY [0..(k − 1)] are used. They are such that, for every x,
the pair FULL[x], EMPTY [x] describes the current state of BUF[x] (full, empty,
being filled, being emptied). These registers have similar behaviors, one from the
producer point of view and the other one from the consumer point of view. More
precisely, we have the following:
• FULL[x] (which is initialized to false) is set to true by a producer p just after it has
written a new item value in BUF[x]. In that way, p informs the consumers that the
value stored in BUF[x] can be consumed. FULL[x] is reset to false by a consumer
c just after it has obtained the right to consume the item value kept in BUF[x]. In
that way, c informs the other consumers that the value in BUF[x] is not for them.
To summarize: FULL[x] ⇔ (BUF[x] can be read by a consumer) (Fig. 3.6).
• EMPTY [x] (which is initialized to true) is reset to false by a consumer c just after
it has read the item value kept in BUF[x]. In that way, the consumer c informs the
producers that BUF[x] can be used again to deposit a new item value. EMPTY [x]
is set to false by a producer p just before it writes a new item value in BUF[x]. In
that way, the producer p informs the other producers that BUF[x] is reserved and
they cannot write into it. To summarize: EMPTY [x] ⇔ (BUF[x] can be written
by a producer).
operation is
(1)
(2)
(3) while do end while
(4)
(5)
(6)
(7)
(8)
(9)
(10)
(11)
end operation
operation is
(12)
(13)
(14) while do end while
(15)
(16)
(17)
(18)
(19)
(20)
(21)
(22)
characterizes the buffer implementation given in Fig. 3.7. This invariant states that
no BUF[x] can be simultaneously full and empty. Other facts can be deduced from
Fig. 3.7:
– ¬EMPTY [x]) ∧ ¬FULL[x] ⇒ BUF[x] is currently being filled or emptied,
– FULL[x] ⇒ BUF[x] contains a value not yet consumed, and
– EMPTY [x] ⇒ BUF[x] does contain a value to be consumed.
Let us observe that this priority scheme is without preemption: when a process p
has obtained the resource, it keeps it until it invokes release(), whatever the priority
of the requests that have been issued by other processes after p was granted the
resource.
Principle of the solution and base objects The object we are building can be
seen as a room made up of two parts: a blackboard room where processes can post
information to inform the other processes plus a sleeping room where processes go
to wait when the resource is used by another process (see Fig. 3.8).
In order for the information on the blackboard to be consistent, at most one
process at a time can access the room. To that end a semaphore denoted MUTEX and
initialized to 1 is used by the processes.
72 3 Lock-Based Concurrent Objects
Semaphore:
Boolean register:
Boolean registers: Blackboard room
Registers:
operation is
(1)
(2) if then
(3)
(4)
(5) else
(6)
(7) end if
(8)
end operation
operation is
(9)
(10) if then let
(11)
(12)
(13) else
(14) end if
(15)
(16)
end operation
Remark Let us observe that, due to asynchrony, it is possible that a process pi wakes
up a waiting process pk (by executing SLEEP_CHAIR[k].up(), line 12), before pk
has executed SLEEP_CHAIR[k].down() (this can occur if pk spends a long time to
go from line 3 to line 4). The reader may check that this does not cause problems:
actually, the slowness of pk between line 3 and line 4 has the same effect as if pk was
waiting for SLEEP_CHAIR[k] to become positive.
On the way priorities are defined The previous resource allocation algorithm is
very general. The way priorities are defined does not depend on it. Priorities can be
statically associated with processes, or can be associated with each invocation of the
operation acquire() independently one from another.
A token metaphor The algorithms in Fig. 3.9 can be seen as a token management
algorithm. There is initially a single token deposited on a table (this is expressed by
the predicate BUSY = false).
When there is no competition a process takes the token which is on the table
(statement BUSY ← true, line 5), and when it has finished using the resource, it
deposits it on the table (statement BUSY ← false, line 13).
When there is competition, the management of the token is different. The
process pi that owns the token gives it directly to a waiting process pk (state-
ment SLEEP_CHAIR[i].up(), line 12) and pk obtains it when it terminates executing
SLEEP_CHAIR[i].down() (line 4). In that case, due to priorities, the transmission
of the token is “direct” from pi to pk . The Boolean register BUSY is not set to true
by pi , and after being reset to false by pk , it remains equal to false until pk gives the
token to another process or deposits it on the table at line 13 if no process wants to
access the resource.
The readers-writers problem Let us consider a file that can be accessed by a single
process by the operations read_file() and write_file(). The readers-writers problem
consists in designing a concurrent object that allows several processes to access this
file in a consistent way. Consistent means here that any number of processes can
simultaneously read the file, but at most one process at a time can write the file, and
writing the file and reading the file are mutually exclusive.
To that end, an approach similar to the one used in Fig. 3.1 to go from a sequen-
tial stack to a concurrent stack can be used. As reading a file does not modify it,
using a lock would be too constraining (as it would allow at most one read at a
time). In order to allow several read operations to execute simultaneously, let us
instead design a concurrent object defined by four operations denoted begin_read(),
end_read(), begin_write(), and end_write(), and let us use them to bracket the oper-
ation read_file() and write_file(), respectively. More precisely, the high-level oper-
ations used by the processes, denoted conc_read_file() and conc_write_file(), are
defined as described in Fig. 3.10.
3.2 A Base Synchronization Object: the Semaphore 75
operation is
end operation
operation is
end operation
operation is
(1)
(2)
(3) if then end if
(4)
(5)
end operation
operation is
(6)
(7)
(8) if then end if
(9)
(10)
end operation
operation is
(11)
(12)
end operation
operation is
(13)
(14)
end operation
and pw2 have invoked begin_write() and one of them (say pw1 ) has obtained the
mutual exclusion to write the file. GLOBAL_MUTEX is then equal to 0 and the
other writers are blocked at line 11. Let us assume that a process pr invokes
begin_read(). The counter NBR is increased from 0 to 1, and consequently pr invokes
GLOBAL_MUTEX.down() and becomes blocked. Hence, pw2 and pr are blocked on
the semaphore GLOBAL_MUTEX with pw2 blocked before pr . Hence, when later
pw1 executes end_write(), it invokes GLOBAL_MUTEX.up(), which unblocks the
first process blocked on that semaphore, i.e., pw2 .
Strong priority to the read operation is weak priority (a reader obtains the priority
for the class of readers) plus the following property: when a writer terminates and
readers are waiting, the readers have to immediately obtain the mutual exclusion.
As already noticed, the readers-writers object described in Fig. 3.11 satisfies weak
priority for the readers but not strong priority.
There is a simple way to enrich the previous implementation to obtain an object
implementation that satisfies strong priority for the readers. It consists in ensur-
ing that, when several processes invoke begin_write(), at most one of them is
allowed to access the semaphore GLOBAL_MUTEX (in the previous example, pw2
is allowed to invoke GLOBAL_MUTEX.down() while pw1 had not yet invoked
GLOBAL_MUTEX.up()). To that end a new semaphore used only by the writers
is introduced. As its aim is to ensure mutual exclusion among concurrent writers,
this semaphore, which is denoted WRITER_MUTEX, is initialized to 1.
3.2 A Base Synchronization Object: the Semaphore 77
operation is
(11.0)
(11)
(12)
end operation
operation is
(13)
(13.0)
(14)
end operation
operation is
(NR.1)
(1)
(2)
(3) if then end if
(4)
(NR.2)
(5)
end operation
operation is
(6)
(7)
(8) if then end if
(9)
(10)
end operation
operation is
(NW.1)
(NW.2)
(NW.3) if then end if
(NW.4)
(11)
(12)
end operation
operation is
(13)
(NW.5)
(NW.6)
(NW.7) if then end if
(NW.8)
(14)
end operation
When considering the readers-writers problem, the mutual exclusion used in the
previous solutions between base read and write operations can entail that readers
and writers experience long delays when there is heavy contention for the shared
file. This section presents a solution to the readers-writers problem that reduces
waiting delays. Interestingly, this solution relies on the producer-consumer problem.
Read/write lock A read/write lock is a synchronization object that (a) provides
the processes with the four operations begin_read(), end_read(), begin_write(), and
end_write() and (b) ensures one of the specifications associated with the readers-
writers problem (no priority, weak/strong priority to the readers or the writers, etc.).
3.2 A Base Synchronization Object: the Semaphore 79
operation is
(1)
(2)
(3)
(4)
(5)
end operation
operation is
(6)
(7)
(8)
(9)
(10)
(11)
end operation
BUF[0]. Similarly, the last read (which is of BUF[1] because LAST = 1 when the
corresponding conc_read_file() operation starts) is concurrent with the write into
BUF[0]. Hence, the next write operation (namely conc_write_file(v5 )) will be on
BUF[1]. If conc_write_file(v5 ) is invoked while BUF[1].read() has not terminated,
the write must be constrained to wait until this read terminates. Hence, the mutual
exclusion requirement on the reads and writes on each entry of the buffer.
Starvation-freedom There are two algorithm instances that ensure the mutual
exclusion property: one between BUF[0].read() and BUF[0].write() and the other
one between BUF[1].read() and BUF[1].write().
Let us assume that these algorithms guarantee the starvation-freedom property
for the read operations (i.e., each invocation of BUF[x].read() terminates). Hence,
there is no specific liveness property attached to the base BUF[0].write() operations.
The following theorem captures the liveness properties of the conc_read_file() and
conc_write_file() operations which are guaranteed by the algorithms described in
Fig. 3.14.
Theorem 12 If the underlying read/write lock objects ensure starvation-freedom for
the read operations, then the implementation of the operations conc_read_file() and
conc_write_file() given in Fig. 3.14 ensure starvation-freedom for both operations.
Proof Starvation-freedom of the invocations of the operation conc_read_file() fol-
lows trivially from the starvation-freedom of the base BUF[x].read() operation.
To show that each invocation of the operation conc_write_file() terminates, we
show that any invocation of the operation BUF[x].write() does terminate. Let us
consider an invocation of BUF[x].write(). Hence, LAST = 1 − x (lines 6 and 10).
This means that the read invocations that start after the invocation BUF[x].write() are
on BUF[1 − x]. Consequently, these read invocations cannot prevent BUF[x].write()
from terminating.
Let us now consider the invocations BUF[x].read() which are concurrent with
BUF[x].write(). There is a finite number of such invocations. As the underlying
mutual exclusion algorithm guarantees starvation for these read invocations, there
a finite time after which they all have terminated. If the invocation BUF[x].write()
3.2 A Base Synchronization Object: the Semaphore 81
operation is
(1)
(2)
(3)
(4)
(5)
end operation
operation is
(7)
(8)
(9)
(10)
(NW.1)
(NW.2) if then end if
(11)
end operation
has not yet been executed, it is the only operation on BUF[x] and is consequently
executed, which concludes the proof of the theorem.
The case of several writers and several readers The previous single writer/
multi-reader algorithm can be easily generalized to obtain a multi-writer/multi-reader
algorithm.
Let us assume that there are m writers denoted q1 , . . . , qm . The array of reg-
isters now has 2m entries: BUF[0..(2m − 1)], and the registers BUF[2i − 2] and
BUF[2i − 1] are associated with the writer number qi . As previously, a writer qi
writes alternately BUF[2i − 2] and BUF[2i − 1] and updates LAST after it has
written a base register. The corresponding write algorithm is described in Fig. 3.16.
The local index my_last of qi is initialized to 2i − 2. Basically line 6 of Fig. 3.14 is
replaced by the new lines NW.1–NW.2. Moreover, the local variable new_last is now
renamed my_last and the algorithm for the operation conc_read_file() is the same as
before.
Semaphores are synchronization objects that allow for the construction of application-
oriented concurrent objects. Unfortunately, they are low-level counting objects.
The concept of a monitor allows for the definition of concurrent objects at a
“programming language” abstraction level. Several variants of the monitor concept
have been proposed. This concept was developed by P. Brinch Hansen and C.A.R.
Hoare from an initial idea of E.W. Dijkstra. To introduce it, this section adopts Hoare’s
presentation (1974).
82 3 Lock-Based Concurrent Objects
• Operation C.wait().
When a process p invokes C.wait() it stops executing and from an operational
point of view it waits in the queue C. As the invoking process is no longer active,
the mutual exclusion on the monitor is released.
• Operation C.signal().
When a process p invokes C.signal() there are two cases according to the value of
C:
– If no process is blocked in the queue C, the operation C.signal() has no effect.
– If at least one process is blocked in the queue C, the operation C.signal() reacti-
vates the first process blocked in C. Hence, there is one fewer process blocked in
C but two processes are now active inside the monitor. In order to guarantee that
a single process at a time can access the internal representation of the monitor
the following rule is applied:
* The process which was reactivated becomes active inside the monitor and
executes the statements which follows its invocation C.wait().
* The process that has executed C.signal() becomes passive but has priority to
re-enter the monitor. When allowed to re-enter the monitor, it will execute the
statements which follow its invocation C.signal().
• Operations C.empty().
This operation returns a Boolean value indicating if the queue C is empty or not.
3.3 A Construct for Imperative Languages: the Monitor 83
operation is
(1)
(2)
(3) if then
(4) else wait
(5) end if
(6)
end operation
Let us observe that a rendezvous object implemented as in Fig. 3.17 is not restricted
to be used only once. It can be used repeatedly to synchronize a given set of processes
as many times as needed. We say that the rendezvous object is not restricted to be a
one-shot object.
A monitor-based rendezvous The internal representation of the rendezvous object
(for m participating processes) consists of a register denoted COUNTER (initialized
to 0) and a condition denoted QUEUE.
The algorithm implementing the operation rendezvous() is described in Fig. 3.18.
Let us remember that, when a process is active inside the monitor, it is the only
active process inside the monitor. Hence, the algorithm implementing the opera-
tion rendezvous() can be designed as a sequential algorithm which momentarily
stops when it executes QUEUE.wait() and restarts when it is reactivated by another
process that has executed QUEUE.signal(). As we are about to see, as an invocation
of QUEUE.signal() reactivates at most one process, the only tricky part is the man-
agement of the invocations of QUEUE.signal() so that all blocked processes are
eventually reactivated.
When one of the m participating processes invokes rendezvous(), it first increases
COUNTER (line 1) and then checks the value of COUNTER (line 2). If COUNTER <
m, it is blocked and waits in the condition QUEUE (line 2). It follows that the (m −1)
first processes which invoke rendezvous() are blocked and wait in that condition. The
mth process which invokes rendezvous() increases COUNTER to m and consequently
resets COUNTER to 0 (line 3) and reactivates the first process that is blocked in the
condition QUEUE. When reactivated, this process executes QUEUE.signal() (line
5), which reactivates another process, etc., until all m processes are reactivated and
terminate their invocations of rendezvous().
Let us notice that, after their first rendezvous has terminated, the m processes can
use again the very same object for a second rendezvous (if needed). As an example,
this object can be used to re-synchronize the processes at the beginning of each
iteration of a parallel loop.
monitor is
init
operation is
(1)
(2) if then
(3) else
(4) end if
(5)
(6)
end operation
end monitor
monitor is
init
init
operation is
(1) if then end if
(2)
(3)
(4)
(5)
end operation
operation is
(6) if then end if
(7)
(8)
(9)
(10)
end operation
end monitor
predicate transfer
from Consumer to Producer
Producer
active passive active
inside the monitor
active passive
Consumer
The base objects (semaphores and integers) used to implement the control part of
a signal-and-wait monitor are described below. This monitor internal structure is
depicted in Fig. 3.21 (let us remark that the control part of the internal structure is
similar to the structure depicted in Fig. 3.8).
• A semaphore MUTEX, initialized to 1, is used to ensure mutual exclusion on the
monitor (at most one process at a time can access the monitor internal representa-
tion).
Semaphore:
Integers:
for each predicate Blackboard room
Semaphores:
for each predicate Waiting rooms
PRIO_SEM.up() (line 2). Hence, the control inside the monitor passes directly from
p to the reactivated process. If there are no such processes, we have PRIO_NB = 0. In
that case p releases the mutual exclusion on the monitor by releasing the semaphore
MUTEX (line 3).
The code executed by a process p that invokes C[P].wait() is described at lines
5–10. Process p then has to be blocked on the semaphore COND_SEM[P], which
is the waiting room associated with the condition C[P] (line 9). Hence, p increases
NB[P] before being blocked (line 5) and will decrease it when reactivated (line 10).
Moreover, as p is going to wait, it has to release the mutual exclusion on the monitor
internal representation. If there is a process q which is blocked due to an invocation of
wait(), it has priority to re-enter the monitor. Consequently, p directly passes control
of the monitor to q (line 6). Differently, if no process is blocked on PRIO_SEM, p
releases the mutual exclusion on the monitor entrance (line 7) to allow a process that
invokes a monitor operation to enter it.
The code executed by a process p that invokes C[P].signal() is described at lines
11–15. If no process is blocked in the condition C[P] we have NB[P] = 0 and there
is nothing to do. If NB[P] > 0, the first process blocked in C[P] has to be reactivated
and p has to become a priority process to obtain the control of the monitor again. To
that end, p increases PRIO_NB (line 11) and reactivates the first process blocked in
C[P] (line 12). Then, it exits the monitor and goes to wait in the priority waiting room
PRIO_SEM (line 13). When later reactivated, it will decrease PRIO_NB in order to
indicate that one fewer process is blocked in the priority semaphore PRIO_SEM
(line 14).
This section considers several monitors suited to the readers-writers problem. They
differ in the type of priority they give to readers or writers. This family of monitors
gives a good illustration of the programming comfort provided by the monitor con-
struct (the fact that a monitor allows a programmer to use directly the power of a
programming language makes her/his life easier).
These monitors are methodologically designed. They systematically use the fol-
lowing registers (which are all initialized to 0 and remain always non-negative):
• NB_WR and NB_AR denote the number of readers which are currently waiting and
the number of readers which are currently allowed to read the file, respectively.
• NB_WW and NB_AW denote the number of writers which are currently waiting
and the number of writers which are currently allowed to write the file, respectively.
In some monitors that follow, NB_WR and NB_AR could be replaced by a single
register NB_R (numbers of readers) whose value would be NB_WR + NB_AR and,
as its value is 0 or 1, NB_AW could be replaced by a Boolean register. This is not
done to insist on the systematic design dimension of these monitors.
90 3 Lock-Based Concurrent Objects
monitor is
init
operation is
if then end if
end operation
operation is
if then end if
end operation
operation is
if then end if
end operation
operation is
end operation
end monitor
Let us now assume that NB_AW = 1. Then, due to the transfer of predicate
(between lines 6 and 8 or between lines 12 and 8) we have NB_AR + NB_AW = 0,
from which we conclude NB_AR = 0, and consequently NB_AR × NB_AW = 0.
Let us now assume that NB_AR > 0. This register is increased at line 3. Due to the
waiting predicate NB_AW > 0 used at line 2 and the transfer of predicate between
line 12 (where we also have NB_AW = 0) and line 2, it follows that NB_AW = 0
when line 3 is executed. Consequently, we have (NB_AR > 0) ⇒ (NB_AW = 0),
which completes the proof of the safety property of the readers-writers problem.
Let us now prove the liveness property, namely strong priority to the readers.
Let us first observe that, if a read is allowed to read, we have NB_AR > 0 and,
consequently, NB_AW = 0. It then follows from the waiting predicate used at line 2
that all the readers which invoke begin_read() are allowed to read.
Let us now consider that readers invoke begin_read() while a writer has previously
been allowed to write. We then have NB_AW > 0 (line 9 executed by the writer)
and NB_AR > 0 (line 1 executed later by readers). It follows that, when the writer
invokes end_write(), it will execute C_READERS.up() (line 12) and reactivate a
reader (which in turn will reactivate another reader, etc.). Consequently, when a
92 3 Lock-Based Concurrent Objects
writer is writing and there are waiting readers, those readers proceed to read when
the writer terminates, which concludes the proof of the liveness property.
Strong priority to the writers The monitor described in Fig. 3.24 provides strong
priority to the writers. This means that, as soon as writers want to write, no more
readers are allowed to read until they have all terminated. The text of the monitor is
self-explanatory.
When looking at Fig. 3.24, as far as the management of priority is concerned,
it is important to insist on the role played by the register NB_WW . This register
stores the actual number of processes which want to write and are blocked. Hence,
giving strong priority to the writers is based on the testing of that register at line 1
and line 12. Moreover, when the priority is given to the writers, the register NB_WR
(which counts the number of waiting readers) is useless.
Similarly, the same occurs in the monitor described in Fig. 3.23. Strong priority
is given to the readers with the help of the register NB_WR while, in that case, the
register NB_WW becomes useless.
A type of fairness Let us construct a monitor in which, while all invocations
of conc_read_file() and conc_write_file() (as defined in Fig. 3.10) terminate, the
following two additional liveness properties are satisfied:
monitor is
init
operation is
(1) if then end if
(2)
(3)
end operation
operation is
(4)
(5) if then end if
(6)
end operation
operation is
(7)
(8) if then end if
(9)
(10)
end operation
operation is
(11)
(12) if then else end if
(13)
end operation
end monitor
• Property P1: When a write terminates, all waiting readers are allowed to read
before the next write.
• Property P2: When there are readers which are reading the file, the newly arriving
readers have to wait if writers are waiting.
These properties are illustrated in Fig. 3.25, where indexes are used to distinguish
different executions of a same operation. During an execution of conc_write_file1 (),
two readers invoke conc_read_file1 () and conc_read_file2 () and a writer invokes
conc_write_file2 (). As there is currently a write on the file, these operations are
blocked inside the monitor (to preserve the monitor invariant). When
conc_write_file1 () terminates, due to property P1, the invocations conc_read_file1 ()
and conc_read_file2 () are executed. Then, while they are reading the file,
conc_read_file3 () is invoked. Due to property P2, this invocation must be blocked
because, despite the fact that the file is currently being read, there is a write waiting.
When conc_read_file1 () and conc_read_file2 () have terminated, conc_write_file2 ()
can be executed. When this write terminates, conc_read_file3 () and conc_read_file4 ()
are executed. Etc.
The corresponding monitor is described in Fig. 3.26. The difference from the
previous readers-writers monitors lies in the way the properties P1 and P2 are ensured.
The property P2 is ensured by the waiting predicate at line 2. If a writer is writing or
waiting (predicate (NB_WW + NB_AW = 0)) when a reader arrives, the reader has
to wait, and when this reader reactivates it will propagate the reactivation to another
waiting reader (if any) before starting to read. The property P1 is ensured by the
reactivating predicate (NB_WR > 0) used at line 13: if there are readers that are
waiting when a writer terminates, the first of them is reactivated, which reactivates
the following one (statement C_READERS.signal() at line 2), etc., until all waiting
readers have been reactivated.
The reader can check that the implementation of such fairness properties would
have been much more difficult if one was asked to implement them directly from
semaphores. (“Directly” meaning here: without using the translation described in
Fig. 3.22.)
monitor is
init
operation is
(1)
(2) if then end if
(3)
(4)
end operation
operation is
(5)
(6) if then end if
(7)
end operation
operation is
(8)
(9) if then end if
(10)
(11)
end operation
operation is
(12)
(13) if then else end if
(14)
end operation
end monitor
monitor is
init init
operation is
(1)
(2) add
(3)
(4) suppress
(5)
end operation
operation is
(6)
(7)
(8) while do end while
(9)
end operation
end monitor
while the statement “d from BAG” removes one copy of d from the bag (if any). Then,
p invokes QUEUE.wait(wake_up_date) (line 3). When it is reactivated, p removes
its wake up date from the bag (line 4).
The second operation, denoted tic(), is executed by the system at the end of each
time unit (it can also be executed by a specific process whose job is to measure the
physical or logical passage of time). This operation first increases the monitor clock
CLOCK (line 6). Then, it reactivates, one after the other, all the processes whose
wake up time is equal to now (the current time value) (lines 7–8).
The monitor concept allows concurrent objects to be built by providing (a) sequen-
tiality inside a monitor and (b) condition objects to solve internal synchronization
issues. Hence, as we have seen, monitor-based synchronization is fundamentally an
imperative approach to synchronization. This section shows that, similarly to sequen-
tial programming languages, the statement of synchronization can be imperative or
declarative.
As monitors, several path expression formulations have been introduced. We
considered here the one that was introduced in a variant of the Pascal programming
language.
96 3 Lock-Based Concurrent Objects
3.4.1 Definition
The idea of path expressions is to state constraints on the order in which the operations
on a concurrent object have to be executed. To that end, four base operators are used,
namely concurrency, sequentiality, restriction, and de-restriction. It is then up to the
compiler to generate the appropriate control code so that these constraints are always
satisfied.
Let us consider an object defined by a set of operations. A path expression
associated with this object has the form
path expression that, at any time, there is at most one execution of op1 and one
execution of op2 which can proceed concurrently.
• path 2 : (op1 ; op2 ) end path states that, at any time, (a) the number of executions
of op2 that have started never surpasses the number of executions of op1 that
have completed (this is due to the “;” internal operator), and (b) the number of
executions of op1 that have started never surpasses by more than two the number
of executions of op2 that have completed (this is due to the “2 : ()” operator).
• path 1 : [op1 ], op2 end path states that, at any time, there is at most either one
execution of op2 or any number of concurrent executions of op1 .
• path 4 : (3 : op1 ), (2 : op2 ) end path states that, at any time, there are at most
three concurrent executions of op1 and at most two concurrent executions of op2 ,
and at most four concurrent executions when adding the executions of op1 and the
executions of op2 .
operation is
end operation
operation is
end operation
file is given either to any number of readers (which have invoked read_file()) or any
number of writers (which have invoked WRITE_file()). Moreover, path2 defines a
kind of alternating priority. If a reader is reading, it gives access to the file to all the
readers that arrive while it is reading and, similarly, if a writer is writing, it reserves
the file for all the writers that are waiting.
Producer-consumer Let us consider a buffer of size k shared by a single producer
and a single consumer. Using the same base objects (BUF[0..(k − 1)], in and out)
as in Fig. 3.4 (Sect. 3.2.2), the operations of such a buffer object B are defined in
Fig. 3.28.
The following path expression path4 defines the synchronization control associ-
ated with such a buffer:
path4 = path k : prod; cons end path.
If there are both several readers and several writers, it is possible to use the same
object B. (The only difference for B is that now in is shared by the producers and out
is shared by the consumers, but this does not entail a modification of the code of B.)
The only modification is the addition of synchronization constraints specifying that
at most one producer at a time is allowed to produce and at most one consumer at a
time is allowed to consume.
Said differently, the only modification is the replacement of path4 by path5 ,
defined as follows:
path5 = path k : (1 : prod); ((1 : cons) end path.
Generating prefixes and suffixes Let pe denote a path expression. prefix(pe) and
suffix(pe) denote the code prefix and the code suffix currently associated with pe.
These prefixes and suffixes are defined recursively starting with the path expression
pe and proceeding until the prefix and suffix of each operation is determined. Initially,
prefix(pe) and suffix(pe) are empty control sequences.
1. Concurrency rule. Let pe = pe1 , pe2 . The expression prefix(pe) pe1 ,
pe2 suffix(pe) gives rise to the two expressions prefix(pe) pe1 suffix(pe)
and prefix(pe) pe2 suffix(pe), which are then considered separately.
2. Sequentiality rule. Let pe = pe1 ; pe2 . The expression prefix(pe) pe1 ;
pe2 suffix(pe) gives rise to two expressions which are then considered sep-
arately, namely the expression prefix(pe) pe1 S.up() and the expression
S.down() pe2 suffix(pe) where S is a new semaphore initialized to 0. As we
can see, the aim of the semaphore S is to force pe2 to wait until pe1 is executed.
Hence, for the next step, as far as pe1 is concerned, we have prefix(pe1 ) =
prefix(pe) and suffix(pe1 )=S.up(). Similarly, we have prefix(pe2 )=S.down() and
suffix(pe1 ) = S.up().
3. Restriction rule. Let pe = k : pe1 . The expression prefix(pe)k : pe1suffix(pe)
gives rise to the expression S .down(); prefix(pe) pe1 suffix(pe); S .up(),
where S is a new semaphore initialized to k.
Hence, we have prefix(pe1 ) = S .down(); prefix(pe) and suffix(pe1 ) = suffix(pe);
S .up() to proceed recursively (if pe1 is not an operation name).
4. De-restriction rule. Let pe = [pe1 ]. The expression prefix(pe)[pe1 ]
suffix(pe) gives rise to the expression prio_down(CT , S , prefix(pe))
pe1 prio_up(CT , S , suffix(pe)), where
• CT is a counter initialized to 0,
• S is a new semaphore initialized to 1, and
• prio_down() and prio_up() are defined as indicated in Fig. 3.29.
operation is
if then end if
end operation
operation is
if then end if
end operation
The aim of these operations is to give priority to the processes that invoke an
operation involved in the path expression pe1 . (The reader can check that their
internal statements are the same as the ones used in Fig. 3.11 to give weak priority
to the readers.)
An example To illustrate the previous rules, let us consider the following path
expression involving three operations denoted op1 , op2 , and op3 :
path 1 : [op1 ; op2 ], op3 end path.
Hence, we have initially pe = 1 : [op1 ; op2 ], op3 , prefix(pe) = suffix(pe) =
(where represents the empty sequence). Let us now apply the rules as defined
by their precedence order. This is also described in Fig. 3.30 with the help of the
syntactical tree associated with the considered path expression.
• Let us first apply the restriction rule (item 3). We obtain k = 1 with pe1 =
[op1 ; op2 ], op3 . It follows from that rule that prefix(pe1 ) = S1.down(); and
suffix(pe1 ) = ; S1.up(), where S1 is a semaphore initialed to 1.
• Let us now apply the concurrency rule (item 1). We have pe1 = pe2 , pe3 , where
pe2 = [op1 ; op2 ] and pe3 = op3 . It follows from that rule that:
– prefix(op3 ) = prefix(pe1 ) = S1.down() and suffix(op3 ) = suffix(pe1 ) =
S1.up(). Hence, any invocation of op3 () has to be bracketed by S1.down()
and S1.up().
– Similarly, prefix(pe2 ) = prefix(pe1 ) = S1.down() and suffix(pe2 ) = suffix(pe1 )
= S1.up().
• Let us now consider pe2 = [op1 ; op2 ] = [pe4 ]. Applying the de-restriction rule
(item 4) we obtain prefix(pe4 ) = prio_down(CT , S2, prefix(pe2 )) and suffix(pe4 )
operation is
if then end if
end operation
operation is
if then end if
end operation
3.5 Summary
This chapter has presented the semaphore object and two programming language
constructs (monitors and path expressions) that allow the design of lock-based atomic
objects. Such language constructs provide a higher abstraction level than semaphores
102 3 Lock-Based Concurrent Objects
or base mutual exclusion when one has to reason on the implementation of concurrent
objects. Hence, they can make easier the job of programmers who have to implement
concurrent objects.
operation is
end operation
operation is
end operation
• A reader that arrives while another reader is reading can immediately access
the file if no writer is waiting. Otherwise, the reader has to wait until the writers
that arrived before it have accessed the file.
• A writer cannot bypass the writers and the readers which arrived before it.
end operation
operation is
wait
end operation
end operation
operation is
wait
end operation
• Prove that the algorithm is correct (a writer executes in mutual exclusion and
readers are allowed to proceed concurrently).
• What type of priority is offered by this algorithm?
• In the worst case how many process can be blocked on the wait statement?
• Let us replace the array FLAG[1..n] of SWMR atomic registers by a single
MWMR atomic register READERS initialized to 0, and
3.7 Exercises and Problems 105
operation is
end operation
operation is
wait
end operation
9. Let us associate the semantics “signal all” with the C.signal() operation on each
condition C of a monitor. This semantics means that (if any) all the processes
which are blocked on the condition C are reactivated and have priority to obtain
mutual exclusion and re-access the monitor. The process which has invoked
C.signal() continues its execution inside the monitor. Considering this “signal
all” semantics:
• Design a readers-writers monitor with strong priority to the writers.
• Design an implementation for “signal all” monitors from underlying
semaphores.
10. The last writer. Let us consider the monitor-based solution with strong priority
to the readers (Fig. 3.23). Modify this solution so that only the last writer can be
blocked (it can be blocked only because a reader is reading or a writer is writing).
This means that, when a writer p invokes begin_write(), it unblocks the waiting
106 3 Lock-Based Concurrent Objects
writer q if there is one. The write (not yet done) of q is then “overwritten” by
the write of p and the invocation of begin_write() issued by q returns false.
To that end, the operation s conc_write_file(v) defined in Fig. 3.10 is redefined
as follows:
operation conc_write_file(v) is
r ← begin_write();
if (r) then write_file(v); end_write() end if;
return(r)
end operation.
11. Implement semaphores (a) from monitors, and (b) from path expressions.
(c) leave_C_to_D() when it arrives at C and tries to enter the line CD.
(d) arrive_in_D() when it arrives at D.
The same four operations are defined for the trains that go from D to A (A, B, C,
and D are replaced by D, C, B, and A).
Design first a deadlock-free monitor that provides the processes (trains) with the
previous eight operations. Design then a starvation-free monitor.
Hints.
• The internal representation of the monitor will be made of:
– The integer variables NB_AB, NB_BA, NB_CD, and NB_DC, where NB_xy
represents the number of trains currently going from x to y.
The binary variables NB_BC and NB_CB, whose values are 0 or 1.
All these control variables are initialized to 0.
– The six following conditions (queues): START _FROM_A, ENTER_BC,
ENTER_CD, START _FROM_D, ENTER_CB, ENTER_BA.
• The code of a process going from A to D is:
start_from_A(); . . . ; leave_B_to_C();...; leave_C_to_D(); . . . ;
arrive_in_D)().
• Before deriving the predicates that allow a train to progress when it executes
a monitor operation, one may first prove that the following relation must be
an invariant of the monitor internal representation:
operation is
end operation
operation is
end operation
Let us observe that path2 defines mutual exclusion between a write invocation
and any other operation invocation while allowing concurrent read operations.
3.7 Exercises and Problems 109
The combination of the path expressions path1 and path2 defines the associated
priority. What type of priority is defined?
Solutions in [63].
Part II
On the Foundations Side:
The Atomicity Concept
This part of the book is made up of a single chapter that introduces the atomicity
concept (also called linearizability). This concept (which was sketched in the first
part of the book) is certainly (with non-determinism) one of the most important
concepts related to the concurrency and synchronization of parallel and distributed
programs. It is central to the understanding and the implementation of concurrent
objects. This chapter presents a formal definition of atomicity and its main
properties. Atomicity (which is different from sequential consistency or serializ-
ability) is the most popular consistency condition. This is due to the fact that
atomic objects compose ‘‘for free’’.
Chapter 4
Atomicity:
Formal Definition and Properties
4.1 Introduction
in the middle of an operation? (This possibility was not considered in the previous
chapters.)
Example To give a flavor of these questions, let us consider an unbounded first-in
first-out (FIFO) queue denoted Q which provides the processes with the following
two operations:
• Q.enq(v), which adds the value v at the tail of the queue, and
• Q.deq(), which returns the value at the head of the queue and suppresses it from
the queue. If the queue is empty, the default value ⊥ is returned.
Figure 4.1 describes a sequential execution of a system made up of a single process
using the queue. The time line, going from left to right, describes the progress of
the process when it enqueues first the value a, then the value b, and finally the value
c. According to the expected semantics of a queue, and as depicted in the figure,
the first invocation of Q.deq() returns the value a, the second returns the value
b, etc.
Figure 4.2 depicts an execution of a system made up of two processes sharing
the same queue. Now, process p1 enqueues first a and then b whereas process p2
concurrently enqueues c. As shown in the figure, the execution of Q.enq(c) by p2
overlaps the executions of both Q.enq(a) and Q.enq(b) by p1 . Such an execution
raises many questions, including the following: What values are dequeued by p1 and
p2 ? What values can be returned by a process, say p1 , if the other process, p2 , stops
forever in the middle of an operation? What happens if p1 and p2 share several queues
instead of a single one?
Addressing the previous questions and related issues start from the definition of a
precise computation model. This chapter presents first the base elements of such a
model and the important notion of a concurrent computation history.
4.2.2 Objects
An object has a name and a type. A type is defined by (1) the set of possible values
for (the states of) objects of that type, (2) a finite set of operations through which the
objects of that type can be manipulated, and (3) a specification describing, for each
operation, the condition under which that operation can be invoked, and the effect
produced after the operation was executed. Figure 4.3 presents a structural view of a
set of n processes sharing m objects.
Sequential specification The object types we consider are defined by a sequential
specification. (We talk interchangeably about the specification of the object or the
specification of the type.) A sequential specification depicts the behavior of the object
when accessed sequentially, i.e., in a sequential execution. This means that, despite
concurrency, the implementation of any such object has to provide the illusion of
sequential accesses. As already noticed, the aim is to facilitate the task of application
programmers who have to reason only about sequential specifications.
It is common to define a sequential specification by associating two predicates
with each operation. These predicates are called pre-assertion and post-assertion.
Assuming the pre-assertion is satisfied before executing the operation, the post-
assertion describes the new value of the object and the result of the operation returned
to the calling process. We refine the notion of sequential specification in terms of
histories later in this chapter.
Total versus partial operations An object operation is total if it is defined for
every state of the object; otherwise it is partial. This means that, differently from a
pre-assertion associated with a partial operation, the pre-assertion associated with a
total operation is always satisfied.
Deterministic versus non-deterministic operations An object operation is deter-
ministic if, given any state of the object that satisfies the pre-assertion of the oper-
ation, and given any valid input parameters of the operation, the output parameters
and the final state of the object are uniquely defined. An object type that has only
4.2 Computation Model 117
4.2.3 Histories
Two operations op and op are said to overlap (or be concurrent) in a history H if
neither resp[op] <H inv[op ] nor resp[op ] <H inv[op]. Notice that two overlapping
operations are such that ¬(op →H op ) and ¬(op →H op).
A sequential history has no overlapping
operations; i.e., for any pair of operations
op and op , we have (op = op ) ⇒ (op →H op ) ∨ ((op →H op) . →H is
is a sequential history.
consequently a total order if H
Illustrating histories Figure 4.4 depicts the (well-formed) history H associated
with the queue object execution described in Fig. 4.2. This history comprises ten
events e1 . . . e10 (e4, e6, e7, and e9 are explicitly detailed). As there is a single object,
its name is omitted. Let us notice that the operation enq(c) by p2 is concurrent with
both enq(a) and enq(b) issued by p1 . Moreover, as the history H has no pending
operations, it is a complete history.
The sequence e1 . . . e9 is a partial history where the dequeue operation issued by
p1 is pending. The sequence e1 . . . e6 e7 e8 e10 is another partial history in which
the dequeue operation issued by p2 is pending. Finally, the history e1 . . . e8 has two
pending operations.
4.2 Computation Model 119
Definition A history is sequential if its first event is an invocation, and then (1) each
invocation event, except possibly the last, is immediately followed by the matching
reply event, and (2) each reply event, except possibly the last, is immediately followed
by an invocation event. The phrase “except possibly the last” associated with an
invocation event is due to the fact that a history can be partial. A complete sequential
history always ends with a reply event. A history that is not sequential is concurrent.
A sequential history models a sequential multiprocess computation (there are no
overlapping operations in such a computation), while a concurrent history models
a concurrent multiprocess computation (there are at least two overlapping opera-
tions in such a computation). Given that a sequential history S has no overlapping
operations, the associated partial order →S defined on its operations is actually a
total order. With a sequential history, one can thus reason about executions at the
granularity of the operations invoked by the processes, instead of at the granularity
of the underlying events.
Strictly speaking, the sequential specification of an object is a set of sequential
histories involving solely that object. Basically, the sequential specification repre-
sents all possible sequential ways according to which the object can be accessed such
that the pre-assertion and post-assertion of each of its operations are respected.
Example The history H = e1 e2 · · · e10 depicted in Fig. 4.4 is a complete concur-
rent history. On the other hand, the complete history
1 = e1 e3 e4 e6 e2 e5 e7 e9 e8 e10
H
1 = [e1 e3] [e4 e6] [e2 e5] [e7 e9] [e8 e10].
H
The histories
2 = [e1 e3] [e4 e6] [e2 e5] [e8 e10] [e7 e9],
H
3 = [e1 e3] [e4 e6] [e8 e10] [e2 e5] [e7 e9].
H
H
are also sequential. Let us also notice that H, 2 , H
1 , H 3 are equivalent histories
(they have the same local histories). Let H4 be the history defined as
4 = [e1 e3] [e4 e6] [e2 e5] [e8 e10] [e7 e9].
H
4 is a partial sequential history. All these histories have the same local history for
H
1 =H
process p1 : H|p 1 |p1 = H 2 |p1 = H3 |p1 = H 4 |p1 = [e1 e3] [e4 e6] [e8 e10],
and, as far p2 is concerned, H3 |p2 is a prefix of H|p2 = H 1 |p2 = H
2 |p2 = H3 |p2 =
[e2 e5] [e7 e9].
Hence, the notion of a history is an abstract way to depict the interactions between
a set of processes and a set of concurrent objects. In short, a history is a total order
on the set of (invocation and reply) events generated by the processes on the objects.
As we are about to see, the notion of a history is central to defining the notion of
atomicity through the very notion of atomic history.
4.3 Atomicity
The role of a correctness condition is to select, among all possible histories of a set
of processes accessing shared objects, those considered to be correct. This section
introduces the correctness condition called atomicity (also called linearizability). The
aim of atomicity is to transform the difficult problem of reasoning about a concurrent
execution into the simpler problem of reasoning about a sequential one.
Intuitively, atomicity states that a history is correct if its invocation and reply
events could have been obtained, in the same order, by a single sequential process. In
an atomic (or linearizable) history, each operation has to appear as if it was executed
alone and instantaneously at some point between its invocation event and its reply
event.
As the concurrent objects that are considered are defined by sequential specifications,
a definition of what is a “correct” history has to refer in one way or another to these
specifications. The notion of legal history captures this idea.
4.3 Atomicity 121
This section first defines atomicity for complete histories H, i.e., histories without
pending operations: each invocation event of H has a matching reply event in H. The
section that follows will extend this definition to partial histories.
is atomic (or linearizable) if there is a “witness”
Definition A complete history H
history S such that:
and
1. H S are equivalent,
2.
S is sequential and legal, and
3. →H ⊆→S .
to be linearizable, there must
The definition above states that, for a history H
exist a permutation of H (namely the witness history
S) which satisfies the following
requirements:
• First, and has to respect the
S has to be composed of the same set of events as H
local history of each process [item 1].
• Second, S has to be sequential (interleave the process histories at the granularity of
complete operations) and legal (respect the sequential specification of each object)
[item 2]. Notice that, as
S is sequential, →S is a total order.
• Finally,
S has also to respect the real-time occurrence order of the operations as
defined by →H [item 3].
S represents a history that could have been obtained by executing all the opera-
tions, one after the other, while respecting the occurrence order of non-overlapping
operations. Such a sequential history
S is called a linearization of H.
Proving that an algorithm implements an atomic object To this end, we need
to prove that all histories generated by the algorithm are linearizable, i.e., iden-
tify a linearization of its operations that respects the “real-time” occurrence order
of the operations and that is consistent with the sequential specification of the
object.
It is important to notice that the notion of atomicity inherently includes a form
of non-determinism. More precisely, given a history H, several linearizations of H
might exist.
122 4 Atomicity: Formal Definition and Properties
defined in Sect. 4.2.4 is such a witness. At the granularity level defined by the oper-
ations, witness history H1 can be represented as f
This formulation highlights the intuition that underlies the definition of the atomicity
concept.
Linearization point The very existence of a linearization of an atomic history H
means that each operation of H could have been executed at an indivisible instant
between its invocation and reply time events (while providing the same result as H).
It is thus possible to associate a linearization point with each operation of an atomic
history. This is a point of the time line at which the corresponding operation could
have been “instantaneously” executed according to the witness sequential and legal
history.
To respect the real-time occurrence order, the linearization point associated with
an operation has always to appear within the interval defined by the invocation event
and the reply event associated with that operation.
Example Figure 4.5 depicts the linearization point of each operation. A triangle is
associated with each operation, such that the vertex at the bottom of a triangle (bold
dot) represents the associated linearization point. A triangle shows how atomicity
allows shrinkage of an operation (the history of which takes some duration) into a
single point on the time line.
In that sense, atomicity reduces the difficult problem of reasoning about a con-
current system to the simpler problem of reasoning about a sequential system where
the operations issued by the processes are instantaneously executed.
As a second example, let us consider a variant of the history depicted in Fig. 4.5
where the reply events e9 and e10 are “exchanged”, i.e., we have now that e9 =
resp[deq(b) by p2 ] and e10 = resp[deq(a) by p1 ]. It is easy to see that this history
is linearizable: the sequential history H 2 described in Sect. 4.2.4 is a linearization
of it.
Similarly, the history where e9 = resp[deq(c) by p2 ] and e10 = resp[deq(a) by
p1 ] is also linearizable. It has the following sequential witness history:
Differently, the history in which the two dequeue operations would return the
same value is not linearizable: it does not have a witness history which respects the
sequential specification of the queue.
Thissectionextendsthedefinitionofatomicitytopartialhistories.Asalreadyindicated,
thesearehistorieswithatleastoneprocesswhoselastoperationispending:theinvocation
event of this operation appears in the history while the corresponding reply event does
not. The history H4 described in Sect. 4.2.4 is a partial history. Extending atomicity to
partial histories is important as it allows arbitrary delays experienced by processes, or
even process crashes (when these delays become infinite), to be dealt with.
Definition A partial history H is linearizable if H
can be modified in such a way
that every invocation of a pending operation is either removed or completed with a
reply event, and the resulting (complete) history H is linearizable.
Basically, the problem of determining whether a partial history H is linearizable is
reduced to the problem of determining whether a complete history H , extracted from
H, is linearizable. We obtain H by adding reply events to certain pending operations
as if these operations have indeed been completed, but also removing invocation
of H,
events from some of the pending operations of H. We require, however, that all
complete operations of H be preserved in H . It is important to notice that, given a
that satisfy the required conditions.
we can extract several histories H
history H,
Example Let us consider Fig. 4.6, which depicts two processes accessing a register.
Process p1 first writes the value 0. The same process later issues a write for the value 1,
but p1 crashes during this second write (this is indicated by a cross on its time line).
Process p2 executes two consecutive read operations. The first read operation lies
between the two write operations of p1 and returns the value 0. A different value
would clearly violate atomicity. The situation is less obvious with the second value,
124 4 Atomicity: Formal Definition and Properties
and it is not entirely clear what value v has to be returned by the second read operation
in order for the history to be linearizable.
As explained below, both values 0 and 1 can be returned by that read operation
while preserving atomicity. The second write operation is pending in the partial
history H modeling this execution. This history H is made up of seven events (the
names of the object and the processes are omitted as there is no ambiguity), namely
We explain now why both 0 and 1 can be returned by the second read:
• Let us first assume that the returned value v is 0.
We can associate with history H a legal sequential witness history H 0 which
includes only complete operations and respects the partial order defined by H
by remov-
0 , we construct history H
on these operations (see Fig. 4.6). To obtain H
ing the event inv[write(1)] from H: we obtain a complete history, i.e., a history
without pending operations.
History H with v = 0 is consequently linearizable. The associated witness history
0 models the situation where p1 is considered as having crashed before invoking
H
the second write operation: everything appears as if this write had never been
issued.
• Assume that the returned value v is 1.
Similarly to the previous case, we can associate with history H a witness legal
sequential history H1 that respects the partial order on the operations. We actually
1 by first constructing H
derive H , which we obtain by adding to H
the reply event
—from which
res[write(1)]. (In Fig. 4.6, the part added to H in order to obtain H
H1 is constructed—is indicated by dotted lines.)
The history where v = 1 is consequently linearizable. The associated witness
1 represents the situation where the second write is taken into account
history H
despite the crash of the process that issued that write operation.
4.4 Object Composability and Guaranteed Termination Property 125
This section presents two fundamental properties of atomicity that make it partic-
ularly attractive. The first property states that atomic objects can be composed for
free, while the second property states that, as object operations are total, no operation
invocation can be prevented from terminating.
Basically, “→” totally orders all operations on the same object X, according to →X
(item 1), while preserving →H , i.e., the real-time occurrence order on the operations
(item 2).
Claim. “→ is acyclic”. This claim means that → defines a partial order on the set of
all the operations of H.
Assuming this claim (see its proof below), it is thus possible to construct a sequen-
tial history and respecting →. We trivially have →⊆→S ,
S including all events of H
where →S is the total order on the operations defined from S. We have the three
following conditions: (1) H and S are equivalent (they contain the same events and
contain the same local histories), (2) S is sequential (by construction) and legal (due
to item 1 above), and (3) →H ⊆→S (due to item 2 above and →⊆→S ). It follows
that H is linearizable.
Proof of the claim. We show (by contradiction) that → is acyclic. Assume first that
→ induces a cycle involving the operations on a single object X. Indeed, as →X is
a total order, in particular transitive, there must be two operations opi and opj on X
such that opi →X opj and opj →H opi . But opi →X opj ⇒ inv[opi ] <H resp[opj ]
because X is linearizable. As <H is a total order on the whole set of events, the fact
that opj →H opi ⇒ resp[opj ] <H inv[opi ] establishes the contradiction.
It follows that any cycle must involve at least two objects. To obtain a contradiction
we show that, in that case, a cycle in → implies a cycle in →H (which is acyclic).
Let us examine the way the cycle could be obtained. If two consecutive edges of the
cycle are due to just some →X or just →H , then the cycle can be shortened, as any
of these relations is transitive. Moreover, opi →X opj →Y opk is not possible for
X = Y , as each operation is on only one object (opi →X opj →Y opk would imply
that opj is on both X and Y ). So let us consider any sequence of edges of the cycle
such that: op1 →H op2 →X op3 →H op4 . We have:
– op1 →H op2 ⇒ resp[op1 ] <H inv[op2 ] (definition of op1 →H ),
– op2 →X op3 ⇒ inv[op2 ] <H resp[op3 ] (as X is linearizable),
– op3 →H op4 ⇒ resp[op3 ] <H inv[op4 ] (definition of op1 →H ).
Combining these statements, we obtain resp[op1 ] <H inv[op4 ], from which we can
conclude that op1 →H op4 . It follows that any cycle in → can be reduced to a cycle
in →H , which is a contradiction as →H is an irreflexive partial order. End of the
proof of the claim.
The benefit of locality Considering an execution of a set of processes that access
concurrently a set of objects, atomicity allows the programmer to reason as if
all the operations issued by the processes on the objects were executed one after
the other. The previous theorem is fundamental. It states that, to reason about
sequential processes that access concurrent atomic objects, one can reason on
each object independently, without losing the atomicity property of the whole
computation.
4.4 Object Composability and Guaranteed Termination Property 127
An example Locality means that atomic objects compose for free. As an example,
let us consider two atomic queue objects Q1 and Q2 each with its own implementation
I1 and I2, respectively (hence, the implementations can use different algorithms).
Let us define the object Q that is a composition of Q1 and Q2 defined as fol-
lows (Fig. 4.7). Q provides processes with the four following operations Q.enq1(),
Q.deq1(), Q.enq2(), and Q.deq2() whose effect is the same as Q1.enq(),
Q1.deq(), Q2.enq() and Q2.deq(), respectively.
Thanks to locality, an implementation of Q consists simply in piecing together I1
and I2 without any modification to their code. As we will see in Sect.4.5, this object
composition property is no longer true for other consistency conditions.
Duetothefactthatoperationsaretotal,atomicity(linearizability)persedoesnotrequire
a pending invocation of an operation to wait for another operation to complete. This
means that, if a given implementation I of an atomic object entails blocking of a total
operation, this is not due to the atomicity concept but only to I. Blocking is an artifact
of particular implementations of atomicity but not an inherent feature of atomicity.
This property of the atomicity consistency condition is captured by the following
theorem, which states that any (atomic) history with a pending operation invocation
can be extended with a reply to that operation.
Theorem 15 Let inv[op(arg)] be the invocation event of a total operation that is
There exists a matching reply event res[op(res)]
pending in a linearizable history H.
such that the history H = H.resp[op(res)] is linearizable.
Proof Let By definition of a linearization,
S be a linearization of the partial history H.
S has a matching reply to every invocation. Assume first that S includes a reply event
resp[op(res)] matching the invocation event inv[op(arg)]. In this case, the theorem
trivially follows, as then S is also a linearization of H .
128 4 Atomicity: Formal Definition and Properties
If
S does not include a matching reply event, then
S does not include inv[op(arg)].
Because the operation op() is total, there is a reply event resp[op(res)] matching
the invocation event inv[op(arg)] in every state of the shared object. Let S be the
sequential history S with the invocation event inv[op(arg)] and a matching reply
S. S is trivially legal. It follows
event resp[op(res)] added in that order at the end of
.
that S is a linearization of H
<1 = [e1 e3] [e4 e6] [e8 e10] and <2 = [e2 e5] [e7 e9].
S = [Q.enq(b) by p2 ] [Q.enq(a) by p1 ] [Q.deq(b) by p2 ]
easy to see that, when we consider each object in isolation, we obtain the histories
and H|Q
H|Q that are sequentially consistent. Unfortunately, there is no way to
witness a legal total order
S that involves the six operations: if p1 dequeues b from
Q , Q .enq(a ) has to be ordered after Q .enq(b ) in a witness sequential history.
But this means that (to respect process-order) Q.enq(a) by p1 is necessarily ordered
before Q.enq(b) by p2 . Consequently Q.deq() by p2 should return a for S to be
legal. A similar reasoning can be done starting from the operation Q.deq(b) by p2 .
It follows that there can be no legal witness total order. Hence, despite the fact that
and H|Q
H|Q are sequentially consistent, the whole history H is not.
4.5.2 Serializability
(or abort) event after all other events of a pending transaction is called commit-
ting (or aborting) the transaction. A sequential history is a sequence of committed
transactions.
Let →trans denote the total order of events of the committed transactions. This
is analogous to the process-order relation defined above. We say that a history is
complete if all its transactions are complete.
Let H be a complete history. H is serializable if there is a “witness” history
S
such that:
1.
S is made up of all events of committed transactions of H,
2.
S is sequential and legal, and
3. →trans ⊆→S (
S has to respect transaction order).
Let H be a history that is not complete. H is serializable if we can derive from
a complete history H
H (by completing or removing pending transactions from H)
such that: (1) H includes the complete transactions of H,
is complete, (2) H and (3)
is serializable.
H
Atomicity versus serializability As for atomicity, serializability is defined accord-
ing to the equivalence to a witness sequential history, but differently from atomicity,
no real-time ordering is required. In this sense, serializability can be viewed as an
extension of sequential consistency to transactions where a transaction is made up of
several invocations of object operations. Unlike atomicity, serializability is not a local
property (replacing processes with transactions in Fig. 4.9 gives a counter-example).
4.6 Summary
This chapter has introduced the basic elements that are needed to reason about exe-
cutions of a multiprocess program whose processes cooperate through concurrent
objects (defined by a sequential specification on total operations). More specifically,
this chapter has presented the basic notions from which the atomicity concept has
then been defined.
The fundamental modeling element is that of a history: a sequence of events depict-
ing the interaction between processes and objects. An event represents the invocation
of an object or the return of a reply. A history is atomic if, despite concurrency, it
appears as if processes access the objects by invoking operations one after the other.
In this sense, the correctness of a concurrent computation is judged with respect to a
sequential behavior, itself determined by the sequential specification of the objects.
Hence, atomicity is what allows us to reason sequentially despite concurrency.
132 4 Atomicity: Formal Definition and Properties
• The notion of atomic read/write objects (registers), as studied here, was investi-
gated and formalized by L. Lamport [189] and J. Misra [206].
• The generalization of the atomicity consistency condition to objects of any sequen-
tial type was developed by M. Herlihy and J. Wing under the name linearizability
[148].
• The notion of sequential consistency was introduced by L. Lamport [187].
The relation between atomicity and sequential consistency was investigated in
[40] and [232], where it was shown that, from a protocol design point of view,
sequential consistency can be seen as lazy linearizability. Examples of protocols
implementing sequential consistency can be found in [3, 40, 233].
• The concept of transactions is part of almost every textbook on database systems.
Books entirely devoted to transactions include [50, 97, 119]. The theory of serial-
izability is the main topic of [97, 222].
Part III
Mutex-Free Synchronization
While Part I was devoted to lock-based synchronization, this part of the book is on
the design of concurrent objects whose implementation does not rely on mutual
exclusion. It is made up of five chapters:
• The first chapter introduces the notion of a mutex-free implementation (i.e.,
implementations which are not allowed to rely on locks) and the associated
liveness properties, namely obstruction-freedom, non-blocking, and wait-
freedom.
• The second chapter introduces the notion of a hybrid implementation, namely
an implementation which is partly lock-based and partly mutex-free.
• The next three chapters are on the power of atomic read/write registers when one
has to design wait-free object implementations. These chapters show that non-
trivial objects can be built in such a particularly poor context. To that end, they
present wait-free implementations of the following concurrent objects: weak
counters, store-collect objects, snapshot objects, and renaming objects.
Remark on terminology As we are about to see, the term mutex-freedom is used
to indicate that the use of critical sections (locks) is prohibited. The term lock-
freedom could have been used instead of mutex-freedom. This has not been done
for the following reason: the term lock-freedom is already used in a lot of papers
on synchronization with different meanings. In order not to overload it and to
prevent confusion, the term mutex-freedom is used in this book.
Chapter 5
Mutex-Free Concurrent Objects
Locks are not always the panacea As we have seen in Chaps. 1 and 3, the systematic
use of locks constitutes a relatively simple method to implement atomic concurrent
objects defined by total operations. A lock is associated with every object O and all the
operation invocations on O are bracketed by acquire_lock() and release_lock() so that
at most one operation invocation on O at a time is executed. However, as we are about to
see in this chapter, locks are not the only approach to implement atomic objects. Locks
have drawbacks related to process blocking and the granularity of the underlying base
objects used in the internal representation of the object under construction.
As far as the granularity of the object protected by a lock is concerned, let us
consider a lock-based implementation of a bounded queue object Q with total oper-
ations (Q.deq() returns ⊥ when the queue is empty and Q.enq() returns when
the queue if full). The use of a single lock on the whole internal representation of
the queue prevents Q.enq() and Q.deq() from being executed concurrently. This can
decrease the queue efficiency, as nothing prevents these two operations from exe-
cuting concurrently when the queue is neither empty nor full. A solution consists
in using locks at a finer granularity level in order to benefit from concurrency and
increase efficiency. Unfortunately this makes deadlock prevention more difficult and,
due to their very nature, locks cannot eliminate the blocking problem.
The drawback related to process blocking is more severe. Let us consider a process
p that for some reason (e.g., page fault) stops executing during a long period in the
middle of an operation on an object O. If we use locks, as we have explained above,
the processes which have concurrently invoked an operation on O become blocked
until p terminates its own operation. When such a scenario occurs, processes suffer
delays due to other processes. Such an implementation is said to be blocking-prone.
The situation is even worse if the process p crashes while it is in the middle of an
operation execution. (In an asynchronous system a crash corresponds to the case
where the speed of the corresponding process becomes and remains forever equal to
0, this being never known by the other processes. This point is developed below at
the end of Sect. 5.1.2.) When this occurs, p never releases the lock, and consequently,
all the processes that will invoke an operation on O will become blocked forever.
Hence, the crash of a process creates an infinite delay that can entail a deadlock on
all operations accessing the object O.
These observations have motivated the design of concurrent object implementa-
tions that do not use locks in one way or another (i.e., explicitly or implicitly). These
implementations are called mutex-free.
Operation level versus implementation level Let us consider an object O with
two operations O.op1() and O.op2(). At the user level, the (correct) behaviors of O
are defined by the traces of its sequential specification.
When considering the implementation level, the situation is different. Each exe-
cution of O.op1() or O.op2() corresponds to a sequence of invocations of base
operations on the base objects that constitute the internal representation of O.
If the implementation of O is lock-based and we do not consider the execution of
the base operations that implement acquire_lock() and release_lock(), the sequence
of base operations produced by an invocation of O.op1() or O.op2() cannot be
interleaved with the sequence of base operations produced by another operation
invocation. When the implementation is mutex-free, this is no longer the case, as
depicted in Fig. 5.1.
Figure 5.1 shows that the invocations of O.op1() by p1 , O.op2() by p2 , and O.op1()
by p3 are linearized in that order (i.e., they appear to have been executed in that order
from an external observer point of view).
5.1 Mutex-Freedom and Progress Conditions 137
History
linearization at the object level
History at the
implementation level
When processes may crash A process crashes when it stops its execution prema-
turely. Due to the asynchrony assumption on the speed of processes, a crash can be
seen as if the corresponding process pauses during an infinitely long period before
executing its next step. Asynchrony, combined with the fact that no base shared-
memory operation (read, write, compare&swap, etc.) provides processes with infor-
mation on failures, makes it impossible for a process to know if another process
has crashed or is only very slow. It follows that, when we consider mutex-free
object implementations, the definition of obstruction-freedom, non-blocking, and
wait-freedom copes naturally with any number of process crashes.
Of course, if a process crashes while executing an object operation, it is assumed
that this invocation trivially terminates. As we have seen in Chap. 4 devoted to the
atomicity concept, this operation invocation is then considered either as entirely
executed (and everything appears as if the process crashed just after the invocation)
or not at all executed (and everything appears as if the process crashed just before
the invocation). This is the all-or-nothing semantics associated with crash failures
from the atomicity consistency condition point of view.
Hierarchy of progress conditions It is easy to see that obstruction-freedom, non-
blocking, and wait-freedom define a hierarchy of progress conditions for mutex-free
implementations of concurrent objects.
More generally, the various progress conditions encountered in the implementa-
tion of concurrent objects are summarized in Table 5.1.
140 5 Mutex-Free Concurrent Objects
Definition The splitter object was implicitly used in Chap. 1 when presenting Lam-
port’s fast mutual exclusion algorithm. A splitter is a concurrent object that provides
processes with a single operation, denoted direction(). This operation returns a value to
the invoking process. The semantics of a splitter is defined by the following properties:
5.2 Mutex-Free Concurrent Objects 141
processes invoke
processes processes
process
A splitter (Fig. 5.2) ensures that (a) not all the invoking processes go in the same
direction, and (b) the direction stop is taken by at most one process and exactly one
process in a solo execution. As we will see in this chapter, splitters are base objects
used to build more sophisticated concurrent objects.
Let us observe that, for x = 1, the concurrent execution property becomes: if a
single process invokes direction(), only the value stop can be returned. This property
is sometimes called the “solo execution” property.
A wait-free implementation A very simple wait-free implementation of a splitter
object SP is described in Fig. 5.3. The internal representation is made up of two
MWMR atomic registers: LAST , which contains a process index (its initial value is
arbitrary), and a binary register DOOR, whose domain is {open, closed} and which
is initialized to open.
When a process pi invokes SP.direction() it first writes its index i in the atomic
register LAST (line 1). Then it checks if the door is open (line 2). If the door has been
closed by another process, pi returns right (line 3). Otherwise, pi closes the door
(which can be closed by several processes, line 4) and then checks if it was the last
process to have invoked direction() (line 5). If this is the case, we have LAST = i
and pi returns stop, otherwise it returns left.
A process that obtains the value right is actually a “late” process: it arrived late at
the splitter and found the door closed. Differently, a process pi that obtains the value
left is actually a “slow” process: it set LAST ← i but was not quick enough during
the period that started when it wrote its index i into LAST (line 1) and ended when
it read LAST (line 5). According to the previous meanings for “late” and “slow”,
not all the processes can be late, not all the processes can be slow, and at most one
process can be neither late not slow, being “timely” and obtaining the value stop.
Theorem 17 The algorithm described in Fig. 5.3 is a correct wait-free implemen-
tation of a splitter.
142 5 Mutex-Free Concurrent Objects
operation is
(1)
(2) if
(3) then
(4) else
(5) if
(6) then
(7) else
(8) end if
(9) end if
end operation
Proof The algorithm of Fig. 5.3 is basically the same as the one implementing
the operation conc_abort_op() presented in Fig. 2.12 (Chap. 2); abort 1 , abort 2 , and
commit are replaced by right, left, and stop. The following proof is consequently very
close to the proof of Theorem 4. We adapt and repeat it here for self-containment of
the chapter.
The validity property follows trivially from the fact that the only values that can
be returned are right (line 3), stop (line 6), and left (line 7).
As far as the termination property is concerned, let us observe that the code of the
algorithm contains neither loops nor wait statements. It follows that any invocation
of SP.direction() by a process (which does not crash) does terminate and returns a
value. The implementation is consequently wait-free.
As far as the solo execution property is concerned, it follows from a simple
examination of the code and the fact that the door is initially open that, if a single
process invokes SP.direction() (and does not crash before executing line 6), it returns
the value stop.
Let us now consider the concurrent execution property. For a process to obtain
right, the door must be closed (lines 2–3). As the door is initially open, it follows that
the door was closed by at least one process p and this was done at line 4 (which is the
only place where a process can close the door). According to the value of LAST (line
5), process p will return stop or left. It follows that, among the x processes which
invoke SP.direction(), at least one does not return the value right.
As far as the value left is concerned, we have the following. Let pi be the last
process that writes its index i into the register LAST (as this register is atomic, the
notion of “last” writer is well defined). If the door is closed, it obtains the value
right. If the door is open, it finds LAST = i and obtains the value stop. Hence, not
all processes can return left.
Let us finally consider the value stop. Let pi be the first process that finds LAST
equal to its own index i (line 5). This means that no process pj , j = i, has modified
LAST during the period starting when it was written by pi at line 1 and ending when
it was read by pi at line 5 (Fig. 5.4). It follows that any process pj that modifies LAST
5.2 Mutex-Free Concurrent Objects 143
after this register was read by pi will find the door closed (line 2). Consequently, any
such pj cannot obtain the value stop.
The reader may check that the proof of the splitter object remains valid if
processes crash.
This section presents a simple obstruction-free timestamp object built from atomic
registers. Actually, the object is built from splitters, which as we have just seen, are
in turn built from atomic read/write registers.
Definition The object is a weak timestamp generator object which provides the
processes with a single operation denoted get_timestamp() which returns an natural
integer. Its specification is the following:
• Validity. No two invocations of get_timestamp() return the same value.
• Consistency. Let gt1 () and gt2 () be two distinct invocations of get_timestamp().
If gt1 () returns before gt2 () starts, the timestamp returned by gt2 () is greater than
the one returned by gt1 ().
• Termination. Obstruction-freedom.
It is easy to see that a lock-based implementation of a timestamp object is triv-
ial: an atomic register protected by a lock is used to supply timestamps. But, as
already noticed, locking and obstruction-freedom are incompatible in asynchronous
crash-prone systems. It is also trivial to implement this object directly from the
fetch&add() primitive. The presentation of such a timestamp generator object is
mainly pedagogic, namely showing an obstruction-free implementation built on top
of read/write registers only.
An algorithm The obstruction-free implementation relies on the following under-
lying data structures:
• NEXT defines the value of the next integer that can be used as a timestamp. It is
initialized to 1.
• LAST is an unbounded array of atomic registers. A process pi deposits its index i
in LAST [k] to indicate it is trying to obtain the timestamp k.
144 5 Mutex-Free Concurrent Objects
• COMP is another unbounded array of atomic Boolean registers with each entry
initialized to false. A process pi sets COMP[k] to true to indicate that it is competing
for the timestamp k (hence several processes can write true into COMP[k]). For
any k, COMP[k] is initialized to false.
operation is
(1)
(2) repeat forever
(3)
(4) if
(5) then
(6) if then end if
(7) end if
(8)
(9) end repeat
end operation
The ABA problem When using compare&swap(), a process pi usually does the
following. It first reads the atomic register X (obtaining the value a), then executes
statements (possibly involving accesses to the shared memory) and finally updates
X to a new value c only if X has not been modified by another process since it was
read by pi . To that end, pi invokes X.compare&swap(a, c) (Fig. 5.6).
Unfortunately, the fact that this invocation returns true to pi does not allow pi
to conclude that X has not been modified since the last time it read it. This is
because, between the read of X and the invocation X.compare&swap(a, c) both
issued by pi , X could have been updated twice, first by a process pj that success-
fully invoked X.compare&swap(a, b) and then by a process pk that has successfully
invoked X.compare&swap(b, a), thereby restoring the value a to X. This is called
the ABA problem.
Solving the ABA problem This problem can be solved by associating tags
(sequence numbers) with each value that is written. The atomic register X is
then composed of two fields content, tag. When it reads X, a process pi obtains
a pair x, y (where x is the current “data value” of X) and it later invokes
X.compare&swap(x, y, c, y + 1) to write a new value c into X. It is easy to
see that the write succeeds only if X has continuously been equal to x, y.
statements;
Differently, (Q ↓).tail.ptr points to the dummy cell only when the list is empty.
Moreover, we have initially (Q ↓).head.tag = (Q ↓).tail.tag = 0.
It is assumed that the operation new_cell() creates a new cell in the shared memory,
while the operation free_cell(pt) frees the cell pointed to by pt.
The algorithm implementing the operation Q.enq() As already indicated, these
algorithms consist in handling pointers in an appropriate way. An interesting point is
the fact that they require processes to help other processes terminate their operations.
Actually, this helping mechanism is the mechanism that implements the non-blocking
property.
The algorithm implementing the enq() operation is described at lines 1–13 of
Fig. 5.9. The invoking process pi first creates a new cell in the shared memory,
assigns its address to the local pointer pt_cell, and updates its fields value and next.ptr
(line 1). Then pi enters a loop that it will exit when the value v will be enqueued.
In the loop, pi executes the following statements. It is important to notice that,
in order to obtain consistent pointer values, these statements include sequences of
read and re-read (with compare&swap) to check that pointer values have not been
modified.
• Process pi first makes local copies (kept in tail and next) of (Q ↓).tail and
(tail.ptr ↓).next, respectively. These values inform pi on the current state of the
tail of the queue (lines 3–4).
• Then pi checks if the content of (Q ↓).tail has changed since it read it (line 5).
If it has changed, tail.ptr no longer points to the last element of the queue.
Consequently, pi starts the loop again.
• If tail = (Q ↓).tail (line 6), pi optimistically considers that no other process is
currently trying to enqueue a value. It then checks if next.ptr is equal to ⊥.
operation is
(14) repeat forever
(15)
(16)
(17)
(18) if then
(19) if
(20) then if then end if
(21)
(22) else
(23) if
(24) then
(25) end if
(26) end if
(27) end if
(28) end repeat
end operation
∗ If process pi succeeds in appending its new cell to the list, it tries to update the
content of (Q ↓).tail. This is done by executing (Q ↓).tail).compare&swap
(tail, cell, tail.tag + 1) (line 8). Finally, pi returns from its invocation.
Let us observe that it is possible that the second compare&swap does not
succeed. This is the case when, due to asynchrony, another process pj did the
work for pi by executing line 10 of enq() or line 21 of deq().
– If next.ptr = ⊥, pi discovers that next does not point to the last element of
the queue. Hence, pi discovers that the value of (Q ↓).tail was not up to date
when it read it. Another process has added an element to the queue but had not
yet updated (Q ↓).tail when pi read it. In that case, pi tries to help the other
process terminate the update of (Q ↓).tail if not yet done. To that end, it executes
the statement ((Q ↓).tail).compare&swap(tail, next.ptr, tail.tag+1) (line
10) before restarting the loop.
5.2 Mutex-Free Concurrent Objects 149
The stack and its operations The stack has two operations, denoted push(v)
(where v is the value to be added at the top of the stack) and pop(). It is a bounded
stack: it can contain at most k values. If the stack is full, push(v) returns the control
value full, otherwise v is added to the top of the stack and the control value done is
returned. The operation pop() returns the value that is at the top of the stack (and
suppresses it from the stack), or the control value empty if the stack is empty.
Internal representation of the stack This non-blocking implementation of an
atomic stack is due to N. Shafiei (2009). The stack is implemented with an atomic
register denoted TOP and an array of k + 1 atomic registers denoted STACK[0..k].
These registers can be read and can be modified only by using the compare&swap()
primitive.
• TOP has three fields that contain an index (to address an entry of STACK), a value,
and a counter. It is initialized to 0, ⊥, 0.
• Each atomic register STACK[x] has two fields: the field STACK[x].val, which
contains a value, and the field STACK[x].sn, which contains a sequence number
(used to prevent the ABA problem as far as STACK[x] is concerned).
STACK[0] is a dummy entry initialized to ⊥, −1. Its first field always contains the
default value ⊥. As far as the other entries are concerned, STACK[x] (1 ≤ x ≤ k)
is initialized to ⊥, 0.
The array STACK is used to store the contents of the stack, and the register TOP
is used to store the index and the value of the element at the top of the stack. The
contents of TOP and STACK[x] are modified with the help of the conditional write
instruction compare&swap() (which is used to prevent erroneous modifications of
the stack internal presentation).
The implementation is lazy in the sense that a stack operation assigns its new
value to TOP and leaves the corresponding effective modification of STACK to the
next stack operation. Hence, while on the one hand a stack operation is lazy, on the
other hand it has to help terminate the previous stack operation (as far as the internal
representation of the stack is concerned).
The algorithm implementing the operation push(v) When a process pi invokes
push(v), it enters a repeat loop that it will exit at line 4 or line 7. The process first
reads the content of TOP (which contains the last operation on the stack) and stores
its three fields in its local variables index, value, and seqnb (line 2).
Then, pi calls the internal procedure help(index, value, seqnb) to help terminate
the previous stack operation (line 3). That stack operation (be it a push() or a pop()) is
required to write the pair value, seqnb into
STACK[index]. To that end, pi invokes
STACK[index].compare&swap. old, new with the appropriate values old and new
so that the write is executed only if not yet done (lines 17–18).
5.2 Mutex-Free Concurrent Objects 151
After its help (which was successful if not yet done by another stack operation)
to move the content of TOP into STACK[index], pi returns full if the stack is full
(line 4). If the stack is not full, it tries to modify TOP so that it registers its push
operation. This invocation of TOP.compare&swap() (line 7) succeeds if no other
process has modified TOP since it was read by pi at line 2. If it succeeds, TOP
takes its new value and push(v) returns the control value done (lines 7). Other-
wise pi executes the body of the repeat loop again until its invocation of push()
succeeds.
The triple of values to be written in TOP at line 7 is computed at lines 5–6. Process
pi first computes the last sequence number sn_of _next used in STACK[index + 1]
and then defines the new triple, namely newtop = index + 1, v, sn_of _next + 1, to
be written first in TOP and, later, in STACK[index + 1] thanks to the help provided
by the next stack operation (let us remember that sn_of _next + 1 is used to prevent
the ABA problem).
The algorithm implementing the operation pop() The algorithm implementing
this operation has exactly the same structure as the previous one and is nearly the
same. Its explanation is consequently left to the reader.
Linearization points of the push() and pop() operations The operations that
terminate are linearizable; i.e., they can be totally ordered on the time line, each
operation being associated with a single point of that line after its start event and
before its end event. Its start event corresponds to the execution of the first statement of
an operation, and its end event corresponds to the execution of the return() statement.
More precisely, an invocation of an operation appears as if it was atomically executed
• when it reads TOP (at line 2 or 10) if it returns full or empty (at line 4 or 12),
• or at the time at which its invocation TOP.compare&swap −, − (at line 7 or 15)
is successful (i.e., returns true).
operation is
(1) repeat forever
(2)
(3)
(4) if then end if
(5)
(6)
(7) if then end if
(8) end repeat
end operation
operation is
(9) repeat forever
(10)
(11)
(12) if then end if
(13)
(14)
(15) if then end if
(16) end repeat
end operation
procedure
(17)
(18)
end procedure
• NEXT is an atomic register that contains the index of the next entry where a value
can be deposited. It is initialized to 1. This register can be read by any process. It
can be modified by any process by invoking NEXT .fetch&add(), which adds 1 to
NEXT and returns its new value.
• Case 1: pi crashes after it has executed the atomic statement REG[in] ← v (line 2).
In this case, from an external observer point of view, everything appears as if pi
crashed after it invoked STACK.push(v).
• Case 2: pi crashes after it has obtained an index value (line 1) and before it invokes
the atomic statement REG[in] ← v. In this case, pi has obtained an entry in from
operation is
(1)
(2)
(3)
end operation
operation is
(4)
(5) for from to do
(6)
(7) if then end if
(8) end for
(9)
end operation
NEXT but did not deposit a value into REG[in], which consequently will remain
forever equal to ⊥. In this case, from an external observer point of view, everything
appears as if the process crashed before invoking STACK.push(v).
From an internal point of view, the crash of pi just before executing REG[in] ← v
entails an increase of NEXT . But as the corresponding entry of the array REG will
remain forever equal to ⊥, this increase of NEXT can only increase the duration
of the loop but cannot affect its output.
If pi executes REG[x] ← c after all the values deposited at entries with an index
greater than x have been removed from the stack, and before new values are pushed
onto the stack, then the linearization point associated with push(c) is the time at
which pi executes REG[x] ← c.
While the definition of the linearization points associated with the operation invo-
cations on a concurrent object is sometimes fairly easy, the previous wait-free imple-
mentation of a stack (whose algorithms are simple) shows that this is not always the
case. This is due to the net effect of the mutex-freedom requirement and asynchrony.
Let us consider the case where (a) the processes can cooperate by accessing base
read/write atomic registers only and (b) any number of processes may crash. Let
us suppose that, in such a context, we have an obstruction-free implementation of
a concurrent object (hence this implementation relies only on read/write atomic
registers). An important question is then the following: Is it possible to boost this
implementation in order to obtain a non-blocking or even a wait-free implementa-
tion? This section presents an approach based on failure detectors that answers this
question.
process pi with a local variable denoted ev_leader(X) (eventual leader in the set X)
such that the following properties are always satisfied:
• Validity. At any time, the variable ev_leader(X) of any process contains a process
index.
• Eventual leadership. There is a finite time after which the local variables
ev_leader(X) of the correct processes of X contain the same index, which is
the index of one of them.
This means that there is an arbitrarily long anarchy period during which the
content of any local variable ev_leader(X) can change and, at the same time, distinct
processes can have different values in their local variables. However, this anarchy
period terminates for the correct processes of X, and when it has terminated, the
local variable ev_leader(X) of the correct processes of X contain forever the same
index, and it is the index of one of them. The time at which this occurs is finite but
remains unknown to the processes. This means that, when a process of X reads x from
ev_leader(X), it can never be sure that px is correct. In that sense, the information
on failures (or the absence of failures) provided by X is particularly weak.
Remark on the use of X This failure detector is usually used in a context where
X denotes a dynamically defined subset of processes. It then allows these processes
to rely on the fact that one of them (which is correct) is eventually elected as their
common leader.
It is possible that, at some time, a process perceived locally X as being xi while
another process pj perceives it as being xj = xi . Consequently, the local read-only
variables provided by X are denoted ev_leader(xi ) at pi and ev_leader(xj ) at pj .
As xi and xj may change with time, this means that X may potentially be required
to produce outputs for any non-empty subset x of (the whole set of processes
composing the system).
The failure detector ♦P (eventually perfect) This failure detector provides each
process pi with a local set variable denoted suspected such that the following prop-
erties are always satisfied:
As with X (a) there is an arbitrary long anarchy period during which each set
suspectedi can contain arbitrary values, and (b) the time at which this anarchy period
terminates remains unknown to the processes.
It is easy to see that ♦P is stronger than X (actually, it is strictly stronger). Let
assume that we are given ♦P. The output of X can be constructed as follows. For
a process pi such that i ∈/ X the current value of ev_leader(X) is any process index
and it can change at any time. For a process pi such that i ∈ X, the output of X is
5.3 Boosting Obstruction-Freedom to Stronger Progress in the Read/Write Model 157
defined as follows: ev_leader(X) = min (\suspected)∩X . The reader can check
that the local variables ev_leader(X) satisfy the validity and eventual leadership
of X .
operation is
(1)
(2) repeat forever
(3)
(4) if
(5) then
(6) if then end if
(7) end if
(8)
(9) end repeat
end operation
operation is
end operation
us remember that ♦P provides each process pi with a set suspected that eventually
contains all crashed processes and only them.
This contention manager uses an underlying operation, denoted weak_ts(), that
generates locally increasing timestamps such that, if a process obtains a timestamp
value ts, then any process can obtain only a finite number of timestamp values
lower than ts. This operation weak_ts() can be implemented from atomic read/write
registers only. (Let us remark that weak_ts() is a weaker operation than the operation
get_timestamp() described in Fig. 5.5.)
The internal representation of the contention manager consists of an array of
SWMR atomic read/write registers TS[1..n] such that only pi can write TS[i]. This
array is initialized to [0, . . . , 0].
When pi invokes need_help(i), it assigns a weak timestamp to TS[i] (line 1). It will
reset TS[i] to 0 only when it executes stop_help(i). Hence, TS[i] = 0 means that pi
is competing inside the contention manager. After it has assigned a value to TS[i], pi
waits (loops) until the pair (TS[i], i) is the smallest pair (according to lexicographical
ordering) among the processes that (a) are competing inside the contention manager
and (b) are not locally suspected to have crashed (lines 2–4).
Theorem 20 The contention manager described in Fig. 5.16 transforms an enriched
obstruction-free implementation of an object into a wait-free implementation.
Proof The proof is similar to the proof of Theorem 19. Let us suppose (by con-
tradiction) that there is an operation invocation by a correct process pi that never
terminates. Let tsi be its timestamp (obtained at line 1). Moreover, let this invocation
be the one with the smallest pair tsi , i among all the invocations issued by correct
processes that never terminate.
It follows from the property of weak_ts() that any other process obtains a finite
number of timestamp values smaller than ts, from which we conclude that there is
a finite number of operation invocations that are lexicographically ordered before
tsi , i. Let I be this set of invocations. There are two cases.
• If an invocation of I issued by a process pj that is not correct (i.e., a process
that will crash in the execution) does not terminate, it follows from the eventual
accuracy of ♦P that eventually j is forever suspected pi (i.e., remains forever in its
set suspected).
operation is
(1) if then end if
(2) repeat
(3) let be
(4) until end repeat
end operation
It then follows from the predicate tested by pi at line 1 that there is a finite time
after which, whatever the value of the pair tsj , j attached to the invocation issued
by pj , j will never belong to the set competing repeatedly computed by pi . Hence,
these invocations cannot prevent pi from progressing.
• Let us now consider the invocations in I issued by correct processes. Due to
the definition of the pair tsi , i and pi , all these invocation terminate. Moreover,
due to the definition of I, any of these processes pj that invokes again an operation
obtains a pair such that the pair tsj , j is greater than the pair tsi , i. Consequently,
the fact that j belongs or not to the set suspected of pi cannot prevent pi from
progressing.
To conclude the proof, as pi is correct, it follows from the eventual completeness
property of ♦P that there is a finite time after which i never belongs to the set
supectedk of any correct process pk .
Hence, there is a finite time after which, at any correct process pj , i ∈
/ suspected
and tsi , j is the smallest pair. As the number of processes is bounded, it follows
that, when this occurs, only pi can progress.
On the design principles of contention managers As one can sec, this contention
manager and the previous one are based on the same design principle. When a process
asks for help, a priority is given to some process so that it can proceed alone and
benefit from the obstruction-freedom property.
In the case of non-blocking, it is required that any one among the concurrent
processes progresses. This was obtained from X , and the only additional under-
lying objects which are required are bounded atomic read/write registers. As any
invocation by a correct process has to terminate, the case of wait-freedom is more
demanding. This progress property is obtained from ♦P and unbounded atomic
read/write registers.
5.4 Summary
This chapter has introduced the notion of a mutex-free implementation and the asso-
ciated progress conditions, namely obstruction-freedom, non-blocking, and wait-
freedom.
To illustrate these notions, several mutex-free implementations of concurrent
objects have been described: wait-free splitter, obstruction-free counter, non-blocking
queue and stack based on compare&swap registers, and wait-free queues based
on fetch&add registers and swap registers. Techniques based on failure detectors
have also been described that allow boosting of an obstruction-free implementa-
tion of a concurrent object to a non-blocking or wait-free implementation of that
object.
1. Prove that the concurrent queue implemented by Michael & Scott’s non-blocking
algorithm presented in Sect. 5.2.4 is an atomic object (i.e., its operations are
atomic).
Solution in [205].
2. The hardware-provided primitives LL(), SC() and VL() are defined in Sect. 6.3.2.
Modify Michael & Scott’s non-blocking algorithm to obtain an algorithm that
uses the operations LL(), SC(), and VL() instead of compare&swap().
3. A one-shot atomic test&set register R allows each process to invoke the operation
R.test&set() once. This operation is such that one of the invoking processes
obtains the value winner while the other invoking processes obtain the value
loser.
Let us consider an atomic swap() operation that can be used by two (statically
determined) processes only. Assuming that there are n processes, this means
that there is a half-matrix of registers MSWAP such that (a) MSWAP[i, j] and
MSWAP[j, i] denote the same atomic register, (b) this register can be accessed
only by pi and pj , and (c) their accesses are invocations of MSWAP[j, i].swap().
Design, in such a context, a wait-free algorithm that implements R.test&set().
Solutions in [13].
164 5 Mutex-Free Concurrent Objects
As already mentioned, if a process crashes while holding a lock, the processes that
invoke a lock-based operation on the same object can be blocked forever. Hence, locks
cannot cope with process crashes. This means that the implementations described in
this chapter tolerate process crashes in all executions in which no process crashes
while holding a lock.
6.2 A Static Hybrid Implementation of a Concurrent Set Object 167
The internal representation of a set The set S is represented by a linked list pointed
to by a pointer kept in an atomic register HEAD. A cell of the list (say NEW _CELL)
is made up of four atomic registers:
• NEW _CELL.val which contains a value (element of the set).
• NEW _CELL.out, a Boolean (initialized to false) that is set to true when the
corresponding element is suppressed from the list.
• NEW _CELL.lock, which is a lock used to ensure mutual exclusion (when needed)
on the registers composing the cell. This lock is accessed with the operations
acquire_lock() and release_lock().
168 6 Hybrid Concurrent Objects
• NEW _CELL.next, which is a pointer to the next cell. The set is organized as
a sorted linked list. Initially the list is empty and contains two sentinel cells, as
indicated in Fig. 6.1. The values associated with these cells are the default values
denoted ⊥ and . These values cannot belong to the set and are such that for any
value v of the set we have ⊥ < v < . All operations are based on list traversal.
The algorithm implementing the operation S.remove(v) This algorithm is des-
cribed in lines 1–9 of Fig. 6.2. Using the fact that the list is sorted in increasing order,
the invoking process pi traverses the list from the beginning until the first cell whose
element v is greater than v (lines 1–2). Then it locks two cells: the cell containing
the element v (which is pointed to by its local variable curr ) and the immediately
preceding cell (which is pointed to by its local variable pr ed).
The list traversal and the locking of the two consecutive cells are asynchronous,
and other processes can concurrently access the list to add or remove elements. It is
consequently possible that there are synchronization conflicts that make the content
of pr ed and curr no longer valid. More specifically, the cell pointed to by pr ed or
curr could have been removed, or new cells could have been inserted between the
cells pointed to by pr ed and curr . Hence, before suppressing the cell containing
v (if any), pi checks that pr ed and curr are still valid. The Boolean procedure
validate( pr ed, curr ) is used to this end (lines 10–11).
If the validation predicate is false, pi restarts the removal operation (line 9). This
is the price that has to be paid to have an optimistic removal operation (there is no
global locking of the whole list, which would prevent concurrent processes from
traversing the list). Let us remember that, as by assumption there are few invocations
of the remove() and add() operations, pi will eventually terminate its invocation.
If the validation predicate is satisfied, pi checks whether v belongs to the set or
not (Boolean pr es, line 5). If v is present, it is suppressed from the set (line 6). This
is done in two steps:
• First the Boolean field out of the cell containing v is set to true. This is a logical
removal (logical because the pointers have not yet been modified to suppress the
cell from the list). This logical removal is denoted S1 in Fig. 6.3.
• Then, the physical removal occurs. The pointer ( pr ed ↓).next is updated to its
new value, namely (curr ↓).next. This physical removal is denoted S2 in Fig. 6.3.
The algorithm implementing the operation S.add(v) This algorithm is described
in lines 12–23 of Fig. 6.2. It is very close to the algorithm implementing the
remove(v) operation. Process pi first traverses the list until it reaches the cell whose
value field is greater than v (lines 12–13) and then locks the cell that precedes it
(line 14). Then, as previously, it checks if the values of its pointers pr ed and curr
are valid (line 14). If they are valid and v is not in the list, pi creates a new cell that
contains v and inserts it into the list (lines 17–20).
6.2 A Static Hybrid Implementation of a Concurrent Set Object 169
Finally, pi releases the lock on the cell pointed to by its local pointer variable
ptr . It returns a Boolean value if the validation predicate was satisfied and restarts
if it was not.
The algorithm implementing the operation S.contain(v) This algorithm is des-
cribed in lines 24–24 of Fig. 6.2. As it does not use locks and cannot be delayed
by locks used by the add() and remove() operations, it is wait-free. It consists of
a simple traversal of the list. Let us remark that, during this traversal, the list does
not necessarily remain constant: cells can be added or removed, and so the values of
the pointers are not necessarily up to date when they are read by the process pi that
invoked S.contain(). Let us consider Fig. 6.5. It is possible that the pointer values
pr edi and curri of the current invocation of contain(v) by pi are as indicated in the
figure while all the cells between those containing a1 and b are removed (let us remark
that it is also possible that a new cell containing the value v is concurrently added).
The list traversal is the same as for the add() and remove() operations. The value
true is returned if and only if v is currently the value of the cell pointed to by curr
and this cell has not been logically removed. The algorithm relies on the fact that a
cell cannot be recycled as long as it is reachable from a global or local pointer. (In
contrast, cells that are no longer accessible can be recycled.)
Base properties The previous implementation of a concurrent set has the following
noteworthy features:
• The traversal of the list by an add()/remove() operation is wait-free (a cell locked
by an add()/remove() does not prevent another add()/remove() from progressing
until it locks a cell).
• Locks are used on at most two (consecutive) cells by an add()/remove() operation.
• Invocations of the add()/remove() operations on non-adjacent list entries do not
interfere, thereby favoring concurrency.
Linearization points Let us remember that the linearization point of an operation
invocation is a point of the time line such that the operation appears as if it was been
executed instantaneously at that time instant. This point must lie between the start
time and the end time of the operation.
The algorithm described in Fig. 6.2 provides the operations add(), remove(), and
contain() with the following linearization points. Let an operation be successful
(unsuccessful) if it returns true (false).
• remove() operation:
– The linearization point of a successful remove(v) operation is when it marks
the value v as being removed from the set, i.e., when it executes the statement
(curr ↓).out ← true (line 6).
– The linearization point of an unsuccessful remove(v) operation is when, during
its list traversal, it reads the first unmarked cell with a value v > v (line 2).
• add(v) operation:
– The linearization point of a successful add(v) operation is when it updates the
pointer ( pr ed ↓).next which, from then on, points to the new cell (line 19).
– The linearization point of an unsuccessful add(v) operation is when it reads the
value kept in (curr ↓).val and that value is v (line 16).
• contain(v) operation:
– The linearization point of a successful contain(v) operation is when it checks
whether the value v kept in (curr ↓).val belongs to the set, i.e., (curr ↓).out
is then false (line 26).
– The linearization point of an unsuccessful contain(v) operation is more tricky
to define. This is due to the fact that (as discussed previously with the help of
Fig. 6.5), while contain(v) executes, an execution of add(v) or remove(v) can
proceed concurrently.
Let τ1 be the time at which a cell containing v is found but its field out is marked
true (line 26), or a cell containing v > v is found (line 25). Let τ2 be the time
172 6 Hybrid Concurrent Objects
just before the linearization point of a new operation add(v) that adds v to the
set (if there is no such add(v), let τ2 = +∞). The linearization point of an
unsuccessful contain(v) operation is min(τ1 , τ2 ).
The proof that this object construction is correct consists in (a) showing that any
the operation contain() is wait-free and the operations of add() and remove() are
deadlock-free, and (b) showing that, given any execution, the previous linearization
points associated with the operation invocations define a trace that belongs to the
sequential specification of the set object.
• Case 2. Both values v and 1 − v are written into AUX (line 1).
Let pi be a process that proposes v and reads ⊥ from AUX, and p j a process that
proposes 1 − v and reads ⊥ from AUX. As both pi and p j have read ⊥ from AUX,
we conclude that, at line 3, both pi and p j have read true from PROPOSED[1 − v]
and PROPOSED[v], respectively (Fig. 6.8). It follows that both of them execute
lines 4–8.
Let us now consider a process pk that proposes a value w and reads a non-⊥ value
from AUX. As it reads a non-⊥ value and both PROPOSED[0] and PROPOSED[1]
were equal to true when it read them, it follows that pk necessarily reads true from
PROPOSED[1 − w]. Hence, it executes lines 4–8.
It follows that all processes execute lines 4–8. The first process that acquires the
lock writes the current value of AUX into DECIDED, and that value becomes the
only decided value.
(Let us notice that, due to the arbitrary speed of processes, it is not possible to
predict if it is the first value written in AUX or the second one that will be the
decided value.)
Let us now show that the implementation satisfies the contention sensitiveness
property. We consider each case of “favorable circumstances” separately.
• Case 1: all participating processes propose the same value v.
In this case, PROPOSED[1 − v] remains forever equal to false. It follows that all
the processes that invoke C.propose() write v into the atomic register DECIDED
(line 3). Consequently none of the participating processes ever execute the lines
4–8, which proves the property.
• Case 2: the invocations of C.propose(v) are not concurrent.
Let us consider such an invocation. If it is the first one, it writes v into DECIDED
(line 3) and does not execute the lines 4–8, which proves the property. If other
invocations have been executed before this one, they have all terminated and at
least one of them has written a value into DECIDED (at line 3 or 6). Hence, the con-
sidered invocation C.propose(v) executes line 4, and as DECIDED = ⊥, it does
not execute lines 4–8, which concludes the proof of the contention sensitiveness
property.
The double-ended queue A double-ended queue has two heads: one on its left side
and one on its right side. The head on one side is the last element of the queue seen
from the other side. Such an object has four operations:
• The operation right_enq(v) (or left_enq(v)) adds v to the queue such that v
becomes the last value on the right (or left) side of the queue.
• The operation right_deq() (or left_deq()) suppresses the last element at the right
(or left) of the queue. If the queue is empty, the operation returns the value empty.
A double-ended queue is defined by a sequential specification. This specification
contains all the correct sequences including all or a subset of the operations. It follows
that, in a concurrency context, a queue has to be an atomic object. A double-ended
queue is a powerful object that generalizes queues and stacks. More precisely, we
have the following (see Fig. 6.9, where the double-ended queue contains the list of
values a, b, c, d, e, f ):
• If either only the operations left_enq() and right_deq() or only the operations
right_enq() and left_deq() are used, the object is a queue.
• If either only the operations left_enq() and left_deq() or only the operations
right_enq() and right_deq() are used, the object is a stack.
Favorable circumstances The implementation that follows considers the following
notion of “favorable circumstances” from the contention sensitiveness point of view:
The operation invocations appear in a concurrency-free context. When this occurs,
such an operation invocation is not allowed to use locks.
Internal representation of a double-ended queue Let D Q be a double-ended
queue. Its internal representation is made up of the following objects:
6.3 Contention-Sensitive Implementations 177
∀x, y : (x < y) ⇒
(Q[y] = ⊥ ) ⇒ (Q[x] = ⊥ ) ∧ (Q[x] = ⊥r ) ⇒ (Q[y] = ⊥r ) .
Hence, at any time, the list of values which have been enqueued and not yet dequeued
is the list kept in the array Q[(LI + 1)..(RI − 1)]. In Fig. 6.9, the current value of the
double-ended queue is represented by the array Q[−2..3].
Atomic operations for accessing a register Q[x] An atomic register Q[x] can
be accessed by three atomic operations, denoted LL() (linked load), SC() (store
conditional) and VL() (validate). These operations are provided by the hardware,
and their effects are described by the algorithms of Fig. 6.10.
Let X be any register Q[x]. The description given in Fig. 6.10 assumes there are
n processes whose indexes are in {1, . . . , n}. It considers that a distinct Boolean
array VALID X [1..n] is associated with each register X . This array is initialized to
[false, . . . , false].
An invocation of X.LL() (linked load) returns the current value of X and links
this read (issued by a process pi ) by setting VALID X [i] to true (line 1).
An invocation of X.SC(−, v) (store conditional) by a process pi is successful if
no process has written X since pi ’s last invocation of X.LL(). In that case, the write
is executed (line 2) and the value true is returned (line 4). If it is not successful, the
value false is returned (line 5). Moreover, if X.SC(−, v) is successful, all the entries
Fig. 6.10 Definition of the atomic operations LL(), SC(), and VL() (code for process pi )
of the array VALID X [1..n] are set to false (line 3) to prevent the processes that have
previously invoked X.LL() from having a successful X.SC().
An invocation of X.VD() (validate) by a process pi returns true if and only if
no other process has issued a successful X.SC() operation since the last X.LL()
invocation issued by pi .
It is important to notice that between an invocation of X.LL() and an invocation
of X.SC() or X.VL(), a process pi can execute any code at any speed (including
invocations of Y.LL(), Y.SC(), and Y.VL() where Y = X ).
LL/SC primitives appear in MIPS architectures. Variants of these atomic opera-
tions are proposed in some architectures such as Alpha AXP (under the names idl_l
and stl_c), IBM PowerPC (under the names lwarx and stwcx), or ARM (under the
names ldrex and strex).
Fig. 6.11 Implementation of the operations right_enq() and right_deq() of a double-ended queue
180 6 Hybrid Concurrent Objects
since it was read by pi at line 2, the write is successful and pi consequently increases
the right index RI and returns the control value done. This behavior, which entails
the enqueue of v on the right side, is described in Fig. 6.12.
If the previous invocations of LL() and SC() (issued at lines 2, 4, and 5) reveal
that the right side of the double-ended queue was modified, pi requires the lock in
order to solve conflicts among the invocations of D Q.right_enq() (line 9). It then
executes a loop in which it does the same as before. Lines 10–16 are exactly the
same as lines 1–7 except for the statement return() at line 5, which is replaced at
line 14 by ter m ← true to indicate that the value v was added to the right side of the
double-ended queue. When this occurs, the process pi releases the lock and returns
the control value done.
Let us consider an invocation of right_enq() (by a process p) which is about to
terminate. More precisely, p starts executing the statements in the then part at line
5 (or line 14). If other processes are concurrently executing right_enq(), they will
loop until p has updated the right index RI to RI + 1. This is due to the fact that p
modifies Q[RI] at line 5 (or line 14) before updating RI.
While the right_enq() operation issues SC() invocations first on Q[my_index −1]
and then on Q[my_index] (lines 4–5 or lines 13–14), the right_deq() opera-
tion has to issue them in the opposite order, first on Q[my_index] and then
on Q[my_index − 1] (lines 24–25 or lines 34–35). This is due to the fact that
right_enq() writes (a value v) into Q[my_index] while right_enq() writes (⊥r )
into Q[my_index − 1].
The algorithms implementing the operations left_enq() and left_deq() These
algorithms are similar to the algorithms implementing the right_enq() and
right_deq() operations. The only modifications to be made to the previous algo-
rithms are the following: replace RI by LI, replace R_LOCK by L_LOCK, replace
each occurrence of ⊥r by ⊥ , and replace the occurrence of ⊥ at line 33 by ⊥r .
A left-side operation and a right-side operation can be concurrent and try to invoke
an atomic SC() operation on the same register R[x]. In such a case, if one is unsuc-
cessful, it is because the other one was successful. More generally, the construction
is non-blocking.
Fig. 6.14, where the operations are denoted ab_push() and ab_pop(). The internal
representation of the stack is the same as the one defined in Sect. 5.2.5 with the fol-
lowing simple modification: in each operation, the loops are suppressed and replaced
by a return(⊥) statement. It is easy to see that this modification does not alter the
non-blocking property of the algorithm described in Fig. 5.10.
This section describes a simple contention sensitive algorithm which transforms the
implementation of any non-blocking concurrency-abortable object into a wait-free
implementation of the same object. This algorithm, which is based on a starvation-
free lock, is described in Fig. 6.15. (Let us remember that a simple algorithm which
builds a starvation-free lock from a deadlock-free lock was presented in Sect. 2.2.2.)
Notation The algorithm is presented in Fig. 6.15. Let oper( par ) denote any
operation of the considered object O and ab_oper( par ) denote the corresponding
operation on its non-blocking concurrency-abortable version ABO. This means that,
when considering the stack object presented in the previous section, push() or pop()
denote the non-abortable counterparts of ab_push() or ab_pop(), respectively. It is
assumed that any invocation of an object operation oper( par ) returns a value which
is different from the default value ⊥. As in the previous section, ⊥ can be returned
by invocations of ab_oper( par ) only to indicate that they failed.
equal to false when the operation starts and consequently oper() invokes
ABO.ab_oper() at line 2. As ABO is a concurrency-abortable object and there are
no concurrent operation invocations, the invocation of ab_oper() does not abort. It
follows that this invocation of oper() returns at line 2, which proves the property.
Let us now show that the implementation is starvation-free, i.e., that any invo-
cation of any operation oper() terminates. To this end, given an invocation inv_op p
of an operation oper() issued by a process p, we have to show that there is eventu-
ally an underlying invocation of ABO.ab_oper() invoked by inv_op p that does not
return ⊥.
Let us first observe that, as ABO is a concurrency-abortable object, any invoca-
tion of ABO.ab_oper() terminates (returning ⊥ or another value). If the underlying
invocation ABO.ab_oper() issued at line 2 returns a non-⊥ value, inv_op p does
terminate. If the underlying invocation ABO.ab_oper() returns ⊥ or if the Boolean
CONTENTION was equal to false when p executed line 1, p tries to acquire the lock
(line 4).
Among the process that compete for the lock, let q be the process which
has obtained and not yet released the lock. It repeatedly invokes some operation
ABO.ab_operq () until it obtains a non-⊥ value. It is possible that other processes exe-
cute, concurrently with q, some underlying operations ABO.ab_oper1(),
ABO.ab_oper2(), etc. This happens if these processes have found CONTENTION =
false at line 2 (which means that they have read CONTENTION before it was written
by q). Hence, in the worst case, there are (n −2) other processes executing operations
on ABO concurrently with q (all the processes but p and q). As ABO is non-blocking,
one of them returns a non-⊥ value and the corresponding process terminates its invo-
cation of an operation on the underlying object ABO. If this process is q ,we are done.
If it not q and invokes again an operation oper (), it is directed to require the lock
because now CONTENTION = false.
Hence, there are now at most (n − 3) processes executing operations on ABO
concurrently with q. If follows that, if q has not obtained a non-⊥ value before, it
eventually executes ABO.ab_poperq () in a concurrency-free context. It then obtains
a non-⊥ value and releases the lock.
As the lock is starvation-free, it follows that p eventually obtains it. Then, replac-
ing q by p in the previous reasoning, it follows that p eventually obtains a non-⊥
value from an invocation of ABO.ab_oper() and accordingly terminates its upper-
layer invocation of the operation oper().
The proof of atomicity follows from the following definition of the linearization
points associated with the invocations of the underlying object ABO. Given an invo-
cation of an operation O.oper(), let us consider its last invocation of ABO.ab_oper()
(that invocation returned a non-⊥ value). The linearization point of oper() is the
linearization of this underlying invocation.
186 6 Hybrid Concurrent Objects
6.5 Summary
• Without giving it the name “static hybrid”, the notion of static hybrid implemen-
tation of a concurrent object was implicitly introduced by S. Heller, M. Herlihy,
V. Luchangco, M. Moir, W. Scherer, and N. Shavit in [137].
The implementation of a concurrent set object described is Sect. 6.2 is due to to
the same authors [137]. This implementation was formally proved correct in [78].
• The notion of contention sensitive implementation is due to G. Taubenfeld [263].
The contention sensitive implementations of a binary consensus object and of a
double-ended queue are due to G. Taubenfeld [263]. The second of these imple-
mentations is an adaptation of an implementation of a double-ended queue based
on compare and swap() proposed by M. Herlihy, V. Luchangco, and M. Moir in
[143] (the notion of obstruction-freedom is also introduced in this paper).
• The notion of concurrency-abortable implementation used in this chapter is from
[214] where the methodology to go from a non-blocking abortable implementation
to a starvation-free implementation of an object is presented. This methodology
relies on a general approach introduced by G. Taubenfeld in [263].
• It is important to insist on the fact that the notion of “abortable object” used in this
chapter is different from the one used in [16] (where an operation that returns ⊥
may or may not have been executed).
6.7 Exercises and Problems 187
The two previous chapters were on the implementation of concurrent atomic objects
(such as queues and stacks). More precisely, the aim of Chap. 5 was to introduce
and illustrate the notion of a mutex-free implementation and associated progress
conditions, namely obstruction-freedom, non-blocking and wait-freedom. The aim
of Chap. 6 was to introduce and investigate the notion of a hybrid implementation. In
both cases, the internal representation of the high-level object that was constructed
was based on atomic read/write registers and more sophisticated registers accessed
by stronger hardware-provided operations such as compare&swap(), fetch&add(),
or swap().
This chapter and the two following ones address another dimension when one is
interested in building wait-free implementations of concurrent objects, namely the
case where the only base objects that can be used are atomic read/write registers.
Hence, these chapters investigate the power of base read/write registers to construct
wait-free implementations. This chapter is on the wait-free implementation of weak
counters and store-collect objects, while Chap. 8 addresses snapshot objects, and
Chap. 9 focuses on renaming objects.
As we are concerned with wait-free implementations, let us remember that it is
assumed that any number of processes may crash. Let us also remember that, as far
as terminology is concerned, a process is correct in a run if it does not crash in that
run; otherwise, it is faulty.
This section has two aims: to present a wait-free implementation of a weak counter
object and to show how to cope with an unknown and arbitrarily large number of
processes. To that end, it first presents a very simple implementation of a (non-weak)
counter and then focuses on the wait-free implementation of a weak counter that can
be accessed by infinitely many processes.
A shared counter C is a concurrent object that has an integer value (initially 0) and
provides the processes with two operations denoted increment() and get_count().
Informally, the operation increment() increases the value of the counter by 1, while
the operation get_count() returns its current value. In a more precise way, the behav-
ior of a counter is defined by the three following properties:
• Liveness. Any invocation of increment() or get_count() by a correct process ter-
minates.
• Monotonicity. Let gt1 and gt2 be two invocations of get_count() such that (a) gt1
returns c1 , gt2 returns c2 , and gt1 terminates before gt2 starts. Then, c1 ≤ c2 .
• Freshness. Let gt be an invocation of get_count() and c the value it returns. Let ca
be the number of invocations of increment() that have terminated before gt starts
and cb be the number of invocations of increment() that have started before gt
terminates. Then, ca ≤ c ≤ cb .
The liveness property expresses that the implementation has to be wait-free.
Monotonicity and freshness are the safety properties which give meaning to the
object; namely, they define the domain of the value returned by a get_count() invo-
cation. As we will see in the proof of Theorem 23, the previous behavior can be
defined by a sequential specification.
A simple implementation A concurrent counter can be easily built as soon as
the number of processes n is known and the system provides one atomic SWMR
read/write register per process. More precisely, let REG[1..n] be an array of atomic
registers initialized to 0, such that, for any i, REG[i] can be read by any process but
is written only by pi .
The algorithm implementing the operations increment() and get_count() are triv-
ial (Fig. 7.1). The invocation of increment() by pi consists in adding 1 to REG[i]
(local_ct is a local variable of pi , initialized to 0). The invocation of get_count()
consists in reading (in any order) and summing up the values of all the entries of the
array REG[1..n].
Theorem 23 The algorithms described in Fig. 7.1 are a wait-free implementation
of an atomic counter object.
Proof The fact that the operations are wait-free follows directly from their code.
The proof that the construction provides an atomic counter is based on the atomicity
of the underlying base registers. Let us associate a linearization point with each
invocation as follows:
7.1 A Wait-Free Weak Counter for Infinitely Many Processes 191
operation is
end operation.
operation is
;
for do end for;
end operation.
Infinitely many processes This section focuses on dynamic systems where each
run can have an unknown, arbitrarily large, and possibly infinite number of processes.
The only constraint is that in each finite time interval only finitely many processes
execute operations. Each process pi has an identity i, and it is common knowledge
that no two processes have the same identity.
192 7 Wait-Free Objects from Read/Write Registers Only
Differently from the static model where there are n processes p1 , . . . , pn , each
process knowing n and the whole set of identities, now the identities of the processes
that are in the system are not necessarily consecutive, and no process has a priori
knowledge of which other processes can execute operations concurrently with it.
(Intuitively, this means that a process can “enter” or “leave” the system at any time.)
Moreover, no process is provided with an upper bound n on their number, which
could be used by the algorithms (as, for example, in the previous algorithm, where
the operation get_count() scans the whole array REG[1..n]). This model, called the
finite concurrency model, captures existing physical systems where the only source
of “infiniteness” is the passage of time.
It is important to see that the algorithms designed for this computation model have
to be inherently wait-free as they have to guarantee progress even if new processes
keep on arriving: the progress of pending operations cannot be indefinitely delayed
because new processes keep on arriving.
Helping mechanism A basic principle when designing algorithms suited to the
previous dynamic model with finite concurrency consists in using a helping mecha-
nism. More generally, such mechanisms are central when one has to design wait-free
implementations of concurrent objects.
More precisely, ensuring the wait-freedom property despite the fact that infinitely
many processes can be involved in an algorithm requires a process to help other
processes terminate their operations. This strategy prevents slow processes from
never terminating despite the continuous arrival of new processes. This will clearly
appear in the weak counter algorithms described below.
Weak counter: definition A weak counter is a counter whose increment() and
get_count() operations satisfy the liveness and monotonicity properties of a classical
counter (as defined previously), plus the following property (which replaces the
previous freshness property):
• Weak increment. Let gt1 and gt2 be two invocations of the get_count() opera-
tion that return c1 and c2 , respectively. Let incr be an invocation of increment()
that (a) has started after gt1 has terminated (i.e., r es[gt1 ] < H inv[incr] using
the notations defined in Chap. 4), and (b) has terminated before gt2 has started
(i.e., r es[incr] < H inv[gt2 ]). We have then c1 < c2 .
With a classical counter, each invocation of the increment() operation, be it con-
current with other invocations or not, results in adding 1 to the value of the counter (if
the invoking process does not crash before updating the SWMR register it is associ-
ated with). The way the counter increases is different for a weak counter. Let k be the
number of concurrent invocations of increment() at some time. This concurrency
pattern entails the increase of the counter by a quantity x such that 1 ≤ x ≤ k.
This means that the effect of a batch of k concurrent invocations of the operation
increment() appears as being reduced to any number y, 1 ≤ y ≤ k, of invocations,
k − y of these invocations being overwritten by other ones. No increment is missed
only when there are no concurrent increment operations. It is easy to see that a counter
7.1 A Wait-Free Weak Counter for Infinitely Many Processes 193
is a weak counter but not vice versa. Moreover, differently from a counter, a weak
counter has no sequential specification.
This section and the next one present an incremental construction of a weak counter
object in a finite concurrency computation model. More precisely, this section
presents and proves correct a wait-free implementation of a one-shot weak counter
which is based on read/write registers only. One-shot means here that a process issues
at most operation invocation. The next section will remove this restriction.
This construction of a wait-free weak counter is due to M.K. Aguilera (2004). Its
main design principles are simplicity (more efficient constructions can be designed)
and generality (in the sense that these principles can be used to obtain wait-free
implementations of other objects in the presence of finite concurrency).
Internal representation of the weak counter object This internal representation
consists of three arrays of atomic read/write registers. The first represents the counter
itself, while the last two are used to implement the helping mechanism that provides
processes with the wait-freedom property.
• The array BIT is made up of a potentially infinite number of MRMW registers.
Each register BIT [x] is initialized to the value 0 and can be set to the value 1
(by one or several processes). Its aim is to contain the value of the counter. More
precisely, when the value of the counter is v, we have BIT [x] = 1 for 1 ≤ x ≤ v,
and BIT [x] = 0 for x > v.
• READING is a potentially infinite array of SWMR registers with one entry per
process. READING[i] = true means that the process pi is currently trying to
read the value of the counter and may require help to terminate its get_count()
operation.
• HELPED is a potentially infinite array of MWSR registers with one entry per
process. Its meaning is the following: HELPED[i] = v means that some process
helped pi by providing it with the value v that pi can use as the current value of
the counter.
The algorithms implementing the operations increment() and get_count() These
algorithms are described in Fig. 7.2. The algorithm implementing the increment()
operation is fairly simple. When a process wants to increment the counter, it first
obtains its current value v, and then sets to 1 the next bit of the array representing the
value of the counter, namely BIT [v + 1]. (As we can see, no entry of the BIT array
is reserved for some processes.)
The algorithm implementing the get_count() operation lies at the core of the con-
struction. In addition to the helping arrays READING and HELPED, each invocation
of get_count() by a process pi uses a local index k, and a local variable to_help
which will contain the identities of the processes that pi can help terminate their
get_count() invocation. The algorithm is made up of two parts:
194 7 Wait-Free Objects from Read/Write Registers Only
operation is
(1)
end operation.
operation is
(2)
(3)
(4) while do
(5) if then end if;
(6)
(7) end while;
(8)
(9) if then
(10) else let
(11) for each do
(12) end for
(13)
(14) end if
end operation.
• The first part (lines 2–8) is a scan whose aim is to obtain the value of the counter.
A process pi starts scanning the BIT array until either it finds an entry equal to
0 or it discovers that it was helped by another process (line 4). During this scan-
ning, pi indicates to the other processes that it needs help by setting READING[i]
appropriately (lines 3 and 8). Moreover, it also registers the processes it could help
before terminating (line 5).
• In the second part (lines 9–14) a process pi returns a value. If it was helped by
another process (we have then HELPED[i] = 0 at line 9), pi returns the helping
value supplied by another process. In that case, pi does not help other processes.
On the contrary, if it has not been helped, pi computes the value v to be returned
(line 10) and helps the processes in to_help with the value v it has obtained
(lines 11–12) before returning v (line 13).
As all the base read/write registers are atomic, it is possible to reason by considering
a global time frame defined by the linearization points associated with the read and
write operations on these registers. (This is the sequence S as defined in Chap. 4
from the linearization points.) Let f (τ ) denote the number of 1s in the array BIT
at time τ .
7.1 A Wait-Free Weak Counter for Infinitely Many Processes 195
Proof Let us first show that v ≤ f (τ2 ). As exactly f (τ2 ) entries of BIT are equal to
1 at time τ2 and f (τ ) is non-decreasing (Lemma 3), no invocation of the get_count()
operation that exits the while loop at time τ ≤ τ2 can find x > f (τ2 ) entries equal to
1. As the value v returned by gt is a value determined by an invocation of get_count()
that exits the while loop before τ2 , the assertion v ≤ f (τ2 ) follows.
Let us now show that f (τ1 ) ≤ v. Let τ be the time at which pi exits the while
loop. We have τ1 < τ < τ2 . There are two cases according to the line at which pi
returns its value:
• pi returns at line 13. In this case, due to Lemma 6 we have v = f (τ ). As,
f (τ1 ) ≤ f (τ ) due to Lemma 3, we obtain f (τ1 ) ≤ v, which proves the case.
• pi returns at line 9. In this case, pi returns the value v such that v = HELPED[i].
Let p j be the process that updated HELPED[i] to v. Let us consider the following
time instants:
We have τ1 < τ i < τ1 < τ2 < τ3 < τ4 < τ . Due to Lemma 3, we have f (τ1 ) ≤
j j j j
j j
f (τ3 ), and, due to Lemma 6, we have v = f (τ3 ). If follows that f (τ1 ) ≤ v.
Theorem 24 (Monotonicity) Let gt1 and gt2 be two invocations of get_count()
such that gt1 returns v1 , gt2 returns v2 , and gt1 terminates before gt2 starts. Then,
v1 ≤ v2 .
Proof Let τ1 be the time at which gt1 terminates and τ2 be the time at which gt2
starts. Due to Lemma 7 we have v1 ≤ f (τ1 ) and f (τ2 ) ≤ v2 . Due to Lemma 3, we
have f (τ1 ) ≤ f (τ2 ). Combining these inequalities results in v1 ≤ v2 .
Lemma 8 Let us consider an invocation of the increment() operation that starts at
time τ1 and terminates at time τ2 . We have f (τ1 ) + 1 ≤ f (τ2 ).
Proof Noticing that the increment() operation issues an internal call to the
get_count() operation, let τ3 and τ4 be the time instants at which this invocation
of get_count() starts and terminates, respectively. We have τ1 < τ3 < τ4 < τ2 , and
consequently (due to Lemma 3) we also have f (τ1 ) ≤ f (τ3 ).
Let v the value returned by get_count(). Due to Lemma 7 we have f (τ3 ) ≤
v. Moreover, due to the second line of the increment() operation (namely, BIT
[v + 1] ← 1) and Corollary 3, we have f (τ2 ) ≥ v + 1. Combining the previous
7.1 A Wait-Free Weak Counter for Infinitely Many Processes 197
Proof Let the time instants be as defined in Fig. 7.3. We have τ1 < τ2 < τ3 < τ4 .
Moreover, we have the following inequalities:
loop which contradicts the fact that it is executing the for loop. It follows that, if
a process executes the for loop, its local set contains a finite number of identities.
Hence, no process can loop forever in the for loop.
Let us now consider the case of the while loop. If pi loops forever in that loop,
HELPED[i] remains equal to 0 and, for every k, pi reads BIT [k] = 1. It follows
that there are infinitely many invocations of increment() that terminate (the corre-
sponding process having crashed after updating some BIT [k] to 1). As each of these
invocations of increment() calls get_count(), there are infinitely many invocations
of get_count() that terminate. We consider two (non-exclusive) cases:
• Infinitely many invocations of get_count() terminate at line 13.
In this case, as at any time the number of operation invocations which are currently
executing is finite (model assumption), there is a process p j that has invoked
get_count() (and –due the case assumption– this invocation terminates at line 13)
after pi has invoked its (non-terminating) get_count() operation. In such a setting,
p j reads READING[i] = true (line 5) and, consequently, adds i to its local variable
to_help. Then (due to the case assumption) p j executes lines 10-12, and sets
HELPED[i] to a value v = 0. After this has been done, pi eventually exits its
while loop, contradicting the initial assumption.
• Infinitely many invocations of get_count() terminate at line 9.
In this case, due to the one-shot assumption, infinitely many processes p j with
identity j > i are such that HELPED[ j] was set to a value different from 0. Each
such p j is helped by at least one process p j (that sets HELPED[ j] to a non-
0 value at line 12). As (1) there are infinitely many processes p j , (2) a process
executes only one operation invocation, and (3) at any time the number of current
operation invocations is finite, it follows that there are infinitely many helping
processes p j . (Let us notice that, for each p j , it is possible that the processes
that help it crash just after having updated HELPED[ j] and before helping other
processes.) Consequently at least one of the helping processes (say pg ) has started
its invocation of get_count() after pi started its non-terminating invocation of
get_count(). It follows that pg reads READING[i] = true and consequently adds
i to its local variable to_help.
As pg sets HELPED[ j] to a non-0 value v, it has not crashed before. Moreover,
as it executes the loop of lines 11–12 according to the increasing order of process
identities that are in its local variable to_help, it follows from the fact that i < j
that pg has updated HELPED[i] to the value v before updating HELPED[ j] to
v. After this has been done, pi eventually exits its while loop, contradicting the
initial assumption.
Remark From an understanding point of view, it is important to notice that wait-
freedom of the implementation is the only property whose proof relies on both the
one-shot assumption and the order on process identities.
7.1 A Wait-Free Weak Counter for Infinitely Many Processes 199
operation is
(1) ;
(2) ;
(3) while do end while;
(4) ;
(5) if then
(6) else let
(7) end if
end operation.
operation is
(1) let
(2)
(3)
(4) while do
(5) let
(6) if then end if
(7) end while
(8)
(9) if then
(10) else let
(11) foreach do
(12) if then end if end do
(13)
(14) end if
end operation.
Fig. 7.5 Reading a weak counter (non-restricted version, code for process pi )
number) with each invocation of get_count() and enriching the shared data structures
as follows:
• An SWMR atomic register LAST _INV [i] (initialized to 0) is associated with each
process pi . This register is used by pi to generate sequence numbers.
• The atomic registers READING[i] and HELPED[i] now become two-dimensional
arrays. Let us consider the xth invocation of get_count() issued by pi .
– READING[i, x] = true means that pi is looking for a counter value for its xth
invocation.
– HELPED[i, x] is destined to contain the help value for its its xth invocation.
The wait-free property can be prevented only in runs where there are infinitely
many invocations of the get_count() operation. This can occur when there are infi-
nitely many processes that invoke get_count() or when a process invokes infinitely
many get_count(). This observation can be used as follows by the helping mech-
anism in order to obtain a wait-free algorithm. A process pi is required to help
the invocations of get_count() whose identities
j, z have been collected in its set
to_help, in the order defined by the sum j + z. This order relation allows replacing,
in the proof of the second item of Theorem 26, the order on the process identities by
an order on invocations.
The corresponding general get_count() algorithm is described in Fig. 7.5. It is a
straightforward extension of the base one-shot algorithm. The increment() operation
is the same as before.
The proof of the general construction is left as an exercise. We consider here only
the proof of the wait-freedom property.
7.1 A Wait-Free Weak Counter for Infinitely Many Processes 201
i, x helped an invocation
j , y such that i + x < j + y . Consequently, the
invocation
g, z found READING[i, x] = true and added
i, x to to_help. As (1)
This section introduces the notion of a store-collect object and presents several
wait-free implementations of it in a classical static system made up of n processes
p1 , . . . , pn .
These values define what is sometimes called a view. More precisely, a view is a
set of pairs (process identity, value) with at most one pair per process. Initially, the
view associated with a store-collect object is empty.
Partial order on views An invocation of the operation collect() returns a view
containing the latest values made public by each process. To define precisely the
notion of “latest values” returned in a view, we use a partial order relation defined
on views. Let view1 and view2 be two views. view1 ≤ view2 if, for every process
pi such that (i, v1) ∈ view1 , we have (i, v2) ∈ view2 where the invocation of
store(v2) follows (or is) the operation stor e(v1) (notice that both invocations of the
operation store() are issued by pi , which is a sequential process). view1 < view2
if view1 ≤ view2 and view1 = view2 .
The store() and collect() operations: definition A store-collect object is formally
defined by the following three properties, where the notations such as inv[st] < H
r esp[col] refer to the order on events as defined in Chap. 4:
• Validity. Let col be an invocation of collect() that returns the set view. For any
(i, v) ∈ view, there is an invocation st of the operation store() with actual para-
meter v that was issued by the process pi and this invocation has started before
the invocation col terminates (i.e., inv[st] < H r esp[col]).
This property means that a collect() operation can neither read from the future nor
output values that have not yet been deposited.
• Partial order consistency. Let col1 and col2 be two invocations of the collect()
operation that return the views view1 and view2 , respectively. If col1 terminates
before col2 starts (i.e., r esp[col1 ] < H inv[col2 ]), then view1 ≤ view2 .
This property expresses the mutual consistency of non-concurrent invocations of
the operation collect(): an invocation of collect() cannot obtain values older than
the values obtained by a previous invocation of collect(). On the contrary, there is
no constraint on the views returned by concurrent invocations of collect() (hence
the name “partial order” for this consistency property).
• Freshness. Let st and col be invocations of the operations store(v) and collect()
issued by pi and p j , respectively, such that st has terminated before col has started
(i.e., r esp[st] < H inv[col]). The view returned by p j contains a pair (i, v ) such
that v is v or a value deposited by pi after v.
This property expresses the fact that the views returned by the invocations of the
operation collect() are up to date in the sense that, as soon as a value was deposited,
it cannot be ignored by future invocations of collect(). If store(v) is executed by a
process pi , the pair (i, v) must appear in a returned view (provided there are enough
invocations of collect()) unless v was overwritten by a more recent invocation of
store() issued by pi .
• Liveness. Any invocation of an operation by a process that does not crash
terminates.
7.2 Store-Collect Object 203
operation is
end operation.
operation is
for each do
if then end if
end for;
end operation.
process pi is the last value stored by pi before pi ’s axis is cut by the corresponding
read line. For convenience, a view is represented by an array (⊥ means that the
process associated with the corresponding entry has never deposited a value). The
value returned by an invocation of collect() is indicated after the arrow following the
corresponding invocation.
It is easy to see that we have view0 ≤ view1 ≤ view3 , from which we can
conclude that the invocations of collect() by p5 and p3 and the invocations of store()
from which they obtained their views can be totally ordered. Similarly, as view0 ≤
view2 ≤ view3 , the first invocation of collect() by p5 and the invocation of collect()
by p2 and p3 , plus invocations of store() from which they obtained their views can
also be totally ordered.
Differently, we have neither view1 ≤ view2 nor view2 ≤ view1 . These views are
not ordered. While both view1 and view2 are correct views, ordering one before the
other would make the other incorrect. It follows that the corresponding invocations
of collect() cannot be ordered, preventing a sequential specification of a store-collect
object.
depth
right
depth
left
left
stop
depth
collect() operation and, for each process, the cost of its first store() operation are
O(k), while the cost of any other store() operation is O(1).
Internal representation of a store-collect object This representation is based on
a complete binary tree of depth n − 1 (Fig. 7.8). Only a part of the tree is used in
a given execution, but this part is not known in advance, it is dynamically defined
according to the pattern of the accesses issued by the processes (the black vertices in
Fig. 7.8 denote the vertices which are currently used). The read-only shared register
ROOT contains a pointer to the root of the tree.
The idea is that the first invocation of the operation store() by a process is a
simple descent in the tree (without ever backtracking) from the root until a node
(also called a vertex in the following) where (due to some predicate) the descent
stops. An invocation of the operation collect() is a traversal of the part of the tree
defined by the vertices already visited by invocations of the operation store(). Each
vertex VTX is made up of the following fields (Fig. 7.9):
• VTX.mar ked is a Boolean atomic register (initialized to false) whose value indi-
cates whether or not the vertex has been attributed to a process to deposit its
successive values. Once a vertex was attributed to a process, it is assigned to that
process forever, and consequently only that process can deposit a value in that
local pointer to
vertex. A process is assigned a vertex when it invokes for the first time the opera-
tion store().
• VTX. pid is a write-once atomic register destined to contain the identity of the
process to which the vertex VTX was attributed.
• VTX.value is an atomic register where the process to which the vertex VTX was
attributed (i.e., p pid ) deposits its value each time it executes a store() operation.
Initially, VTX.value = ⊥, for any vertex VTX.
• VTX.left_s and VTX.right_s are atomic registers containing pointers to the left
and the right successor vertices of VTX. They contain ⊥, if VTX is a leaf.
• VTX.splitter is a splitter object associated with the vertex VTX. This object (which
can be built wait-free from read/write registers only) was introduced in Sect. 5.2.1.
It is used to to direct a process pi (when it executes its first store() operation) to
a free vertex. The vertex assigned to pi is the vertex whose splitter returns the
value stop to pi . The values left or right returned by a splitter are the routing
information which govern pi ’s descent of the tree.
Let VTX be a vertex of the tree and pt_vt x a local pointer to it used by some
process pi . The notation ( pt_vt x ↓). f ield is used instead of VTX. f ield, where
f ield is any field of VTX.
The algorithm implementing the operation store() The algorithm implementing
the store() operation is described in Fig. 7.10. Each process pi is provided with a
local variable my_ver texi (initialized to ⊥) whose scope is the entire execution
and whose aim is to contain the pointer to the vertex of the tree assigned to it
(this means that pi will always deposit its values at the same place, namely in the
register my_ver texi .value). Each execution of a store() operation uses two local
variables denoted pt_vt x (a pointer to a vertex of the tree) and dir , whose values
are meaningful only during the execution of the corresponding invocation of the
operation store().
The first time it invokes store(), pi executes lines 2–11. Starting from the root,
pi descends along a path of the tree according to the routing information provided
by the splitters associated with the vertices it traverses. It stops at the first vertex
whose splitter returns the value stop. Due to the splitter property, pi is then the only
process thats stops at that vertex (line 7). It then assigns this vertex to its variable
my_ver texi (line 10) and indicates that it was attributed this vertex (line 11) (so the
other processes are aware that the value deposited in my_ver texi .value is from pi ).
Finally, pi deposits its value v in the register my_ver texi .value (line 13).
If it has already been attributed a vertex of the tree when it invokes a store()
operation (we have then my_ver texi = ⊥ at line 1), pi directly deposit its value v
in the register my_ver texi .value. It is easy to see that, in that case, the cost of an
execution of store() is constant.
Let us observe that, given a vertex VTX, the atomic registers VTX. pid and
VTX.value are SWMR registers. Differently from the SWMR atomic registers which
have been previously used where the single writer is statically defined, here the single
7.2 Store-Collect Object 207
writer is dynamically determined. The fact that no process is a priori aware of which
registers are assigned to which processes is a price to be paid to obtain an adaptive
implementation.
The algorithm implementing the operation collect() The algorithm implement-
ing the collect() operation is described in Fig. 7.10. It is a simple depth-first search
algorithm (line 15) that, starting at the root of the tree, traverses all the marked vertices
while collecting the values deposited at these vertices.
Let us notice that it is possible that, while a vertex VTX has been marked and
attributed to a process pk , pk crashes before depositing a value in the register
VTX.value or (due to asynchrony) takes a very long time before depositing a value
operation is
(1) if then
(2)
(3) repeat
(4)
(5) case then
(6) then
(7) then
(8) end case
(9) end repeat
(10)
(11)
(12) end if
(13)
(14)
end operation
operation is
(15) let
end operation
in VTX.value. Let us also notice that it is possible that, after VTX was marked, the
vertices VTX.left_s VTX.right_s have been attributed to other processes, indepen-
dently of whether a value has or has not been deposited in VTX.value. This means
that, even when VTX..mar ked ∧ VTX.value = ⊥, the tree traversal has to progress
(lines 18–19) along both descendants of VTX (possibly until a leaf according to which
vertices are marked). It stops and backtracks when it encounters a non-marked vertex
(i.e., a vertex not yet attributed to a process).
This section proves that the previous implementation is correct and adaptive.
Theorem 28 The implementation described in Fig. 7.10 is a correct wait-free imple-
mentation of a store-collect object.
Proof Let us first consider the liveness property. The first time a correct process (i.e.,
a process which does not crash) invokes store() it executes the repeat loop (lines 3–9).
The fact that is exits the loop follows from the two following observations: (a) there
are at most n processes which access the store-collect object, (b) the height of the
binary tree is n, and (c) the property of the splitters attached to each vertex which
ensures that, if x processes access a splitter, at most x − 1 of them are going left,
and at most x − 1 of them are going right. Hence, the first invocation of store() by
a correct process always terminates. Moreover, as my_ver texi = ⊥ after the first
invocation by a process, its other invocations trivially terminate.
The proof that any invocation of collect() by a correct process terminates follows
from the observation that the tree is bounded and the fields left_s and right_s of any
if its leaves remain always equal to ⊥, which stops the depth-first search at line 18
or 19, if not yet done before at another vertex.
The fact that a view cannot contain values read from the future follows directly
from the text of the algorithms. The rest of the validity property (a value which is
returned was deposited), the freshness property, and the partial order consistency
property follow from the following observations:
• If the view view returned by a collect() operation contains the pair (i, v), that
pair was obtained in the then part of line 17 during the tree traversal. As shown by
line 17 of the collect() operation, the corresponding vertex was previously marked.
• A vertex VTX is marked during invocations of the operation store() (line 3). More-
over, as no process resets VTX.mar ked to false, once marked, a vertex remains
marked forever. It also keeps forever in VTX. pid (line 11) the identity i of the
only process that obtains the value stop from the associated splitter (line 7). That
vertex is then definitely attributed to pi (line 10) for it to deposit its future values
in VTX.value (line 13).
7.2 Store-Collect Object 209
• When a process pi stores a new value in the vertex assigned to it (i.e., the vertex
VTX such that VTX. pid = i and pointed to by my_ver texi ), it overwrites the
previous value (if any).
• If a store() operation issued by p j terminates before a collect() operation starts,
the tree traversal generated by that collect() visits the vertex attributed to p j (this
follows from the fact that any invocation of collect() visits all the vertices that are
currently marked).
It follows from the previous items that, if view is returned by an invocation of collect()
and contains the pair (i, v), (a) the value v was written by pi in the vertex VTX such
that VTX. pid = i and (b) v the last value written by pi before the invocation of
collect() reads VTX.value.
Notation Let k denote the number of processes that invoke the operation store() in
a run.
Lemma 9 If the depth of a vertex VTX is d, 0 ≤ d ≤ k, then at most k − d processes
access the vertex VTX among the k processes that invoke the store() operation.
Proof The proof is by induction on d, the depth of the vertex VTX. The base case
d = 0 follows from the fact that k processes invoke the operation store() (no more
than k processes access the root).
Assuming that the lemma holds for the vertices at depth d, 0 ≤ d < k, let us
consider a vertex VTX at depth d + 1 ≤ k. Let VTX be the vertex parent of VTX
in the tree. The depth of VTX is d, and due to the induction assumption, at most
k − d processes access VTX . Letting VTX be the left or right successor of VTX in
the tree, it follows from the property of the splitter associated with VTX that at most
k − d − 1 = k − (d + 1) processes access the vertex VTX, which proves the induction
case.
Lemma 10 If a process writes its identity in a vertex (line 13), the depth of that
vertex is smaller than k, and no other process writes its identity in the same vertex.
Proof Let us first notice that a process pi writes its identity in a vertex only when
it invokes the store() operation for the first time. It follows from Lemma 9 (taking
d = k) that no process accesses a vertex at depth k. Hence, pi stops at a vertex with
depth smaller than k. Moreover, due to the splitter property, at most one process stops
at a given vertex. It follows that no two processes are assigned the same vertex.
Lemma 11 All the vertices from the root to a marked vertex are marked.
Proof If a vertex VTX is marked, some process pk has set VTX.mar ked to true.
It follows from line 3 inside the repeat loop of the operation store() that pk has
also marked all the vertices from the root until VTX. The lemma follows from this
observation and the fact that no process ever resets a marked vertex to false.
Theorem 29 Let pi be a process that invokes the store() operation. The cost of its
first invocation is O(k). The cost of its following invocations is O(1).
210 7 Wait-Free Objects from Read/Write Registers Only
Proof Let us consider the first invocation of the store() operation issued by a process
pi . It descends from the root until the first vertex VTX such that VTX.mar ked is equal
to false (due to Lemma 9, such a non-marked vertex is visited by pi ; this vertex is at
depth k − 1 in the “worst” case). As all the vertices (but the last one) visited by p
are marked (Lemma 11), and there are at most k − 1 processes that have invoked the
store() operation before pi , it follows from Lemma 10 that pi executes at most k − 1
iterations of the repeat loop (lines 3–9). As both the statements executed k − 1 times
inside the loop and the statements executed once outside the loop are made up of a
constant number of accesses to read/write registers, it follows that the total number
of read/write operations on base atomic registers is upper bounded by O(k). (Let us
recall that the splitter operation direction() requires a bounded number of read/write
accesses on the two atomic registers from which it is built.)
As soon as vertex was attributed to pi , we have my_ver texi = ⊥, and the cost
of its following invocations of store() is constant, i.e., O(1).
Theorem 30 The cost of an invocation of the operation collect() is O(k).
Proof Let us first observe that, for each vertex visited by an invocation of collect(),
the number of operations on base atomic read/write registers is constant. So, to
determine an upper bound on the number of operations on base atomic registers,
we only need to compute an upper bound on the number of marked vertices visited
by an invocation of collect(), when at most k processes have invoked the store()
operation. (Let us observe that a vertex child of a marked vertex is visited even if it is
not marked. As this can happens at most k times, this does not change the magnitude
order of the cost of an invocation of the collect() operation.)
Thanks to the property of the splitter objects associated with each vertex and the
fact that the vertices that are marked (by the store() operation) define a subtree rooted
at ROOT (Lemma 11), an upper bound u k on the number of marked vertices accessed
by an invocation of the collect() operation can be defined as follows:
u k = 1 + u + u r with 0 ≤ , r ≤ k − 1 and + r ≤ k,
where (see Fig. 7.11 where the marked vertices are the black vertices):
• The number 1 comes from the fact that, as k ≥ 1, the root is always marked.
(Notice that it is possible that no process stops at the root.)
• u (or u r ) is the number of marked vertices in the left (or right) subtree of the root.
It is easy to see that u 0 = 0, u 1 = 1, and u 2 = 3.
Let us consider the more constraining recurrence equation
u k = a + u + u r with 1 ≤ , r ≤ k − 1 and + r = k.
u k = a + u + ur
= a + ( − 1)a + u 1 + (r − 1)a + r u 1
= ( + r − 1)a + ( + r )u 1
= (k − 1) a + k u 1 .
This section presents the notion of a fast store-collect object and an efficient wait-free
implementation of it.
Merging the operations store() and collect() In some applications, each time
a process deposits a new value, it also needs to collect the last values that have
been deposited. More specifically, the processes are interested in using a single
operation, denoted store_collect(). As shown in Fig. 7.12, such an operation can be
easily built from an array of SWMR atomic read/write registers by a simple merge
of the algorithms described in Fig. 7.6.
Looking for efficiency when there is no concurrency As we can see, each invo-
cation of the store_collect() operation costs O(n) read/write accesses to base atomic
registers. It appears that, in some applications, there are periods during which a
single process invokes store_collect() (that process possibly playing a special role,
212 7 Wait-Free Objects from Read/Write Registers Only
operation is
for each do
if then end if
end for;
end operation
e.g., leader, during the corresponding period). Such periods are characterized by a
concurrency degree (i.e., the number of processes invoking operations) equal to 1.
We are interested here in an implementation of the store_collect() operation whose
cost eventually becomes constant when there is a single process that invokes it during
a long enough period. This search for such an efficient implementation is motivated
by the practical cases where the atomic registers REG[1..n] are actually abstracting
n physical blocks of one or several disks. It becomes evident that, in such a setting,
a store_collect() operation whose cost is O(1) read/write operations on disk blocks
in concurrency-free scenarios is much more interesting than an implementation of
store_collect() whose cost is O(n) disk accesses.
The design of a fast store_collect() algorithm is based on two ideas. The first consists
in adding a shared control variable and the second in considering a two-step algo-
rithm. We present them in an incremental way. This algorithm is due to B. Englert
and E. Gafni (2002).
Each process pi maintains also a local array r egi [1..n] where it keeps a copy of
the values that it considers as the last deposited values. This array acts as a local
cache that can save accesses to shared atomic registers.
To simplify the text of the algorithm, this section considers that the view returned
by store_collect() is an array of n values. It is trivial to obtain the corresponding
(process identity, value) pairs by eliminating the pairs whose value is ⊥.
First step: adding a MWMR atomic register The first step of the construction
consists in adding a MWMR atomic register LAST whose aim is to contain the
identity of the last process that has terminated executing store_collect() (initially,
LAST = ⊥). This will allow a process to learn if, since its previous invocation,
another process has terminated execution of a store_collect() invocation. (When the
array of atomic registers REG[1..n] abstracts blocks of a shared disk, the atomic
register LAST can be placed on the same disk.)
7.3 Fast Store-Collect Object 213
Unfortunately, this simple LAST -based algorithm is incorrect. This is easily shown
in Fig. 7.13, where (as in Fig. 7.7) a read (dotted) line obtains the last value deposited
by a process at the time this dotted line cuts the corresponding process axis. The
figure considers that each pi has previously deposited the value vi0 . So, as a view
contains a value per process, it is represented by an array. (The values of the variable
step4 and the views view3 and view4 are meaningless for the present discussion,
they will be used later.)
Let view1 = [v10 , v20 , v30 , v41 ] be the view returned to p4 by its invocation
store_collect(v41 ), view2 be the view returned to p1 by its invocation store_collect
(v11 ), and view3 and view4 be the views returned to p4 by its last two consecutive
invocations store_collect(v42 ) and store_collect(v43 ), respectively.
In this execution, p1 obtains view2 after p4 has computed view1 but before it has
updated LAST to 4: p4 paused after having read asynchronously one after the other
the registers REG[1..n] and before updating LAST and the store_collect() issued by
p1 occurred during this pause.
It is easy to see that LAST = 4 when the invocation that returns view3 starts.
Consequently, that invocation returns the values cached at p4 , kept in r eg4 [1..n],
missing the value v11 deposited by the invocation of store_collect() issued by p1 . As
that invocation by p1 is terminated before the invocation of store_collect() issued
by p4 (which returns view3 ) started, it follows that the store_collect() returning
view3 is incorrect as it violates the freshness property (it does not return the last
values deposited by the invocations of store_collect() that have terminated before it
started). Similarly, the view view4 returned by the next invocation of store_collect()
issued by p4 is incorrect.
Second step: a two-step algorithm The previous problem arises because, on one
side, processes can overwrite each other when writing the MWMR atomic register
LAST and, on another side, the reading of the whole array REG[1..n] is not an
atomic operation. A way to solve this problem consists in forcing a process pi to
access the array of atomic registers REG[1..n] even when LAST = i, in the cases
where LAST = i when pi issued its previous invocation of store_collect(). This
means that a process pi can safely read the values from its local cache r egi [1..n]
only when it sees it is the last writer (LAST = i) twice in a row. This means that
during the first of these two store_collect(), pi has obtained and saved in its local
cache the last values that have been deposited.
The resulting algorithm for the store_collect() operation is described in Fig. 7.14.
Each process pi maintains a local variable stepi whose scope is the whole execu-
tion. Its aim is to track whether pi remains the last writer between its successive
invocations of store_collect(). Initialized to 0, this local variable is set to 1 when
LAST = i ∧ stepi = 0 (i.e., when pi considers it is the last writer) while it was not
the last writer before. Then, when LAST = i ∧ stepi = 1, pi considers that it is
still the last writer since its previous invocation of store_collect(). As we have seen,
this is the case where pi can read from its local cache.
As an example of the way the algorithm works, let us again consider Fig. 7.13 and
observe the management of the local variable step4 . One can see that p4 has to read
the array of atomic registers REG[1..n] before returning view3 . This is because we
have then LAST = 4 ∧ step4 = 0. Differently, it does not have to read that array
operation is
then
for each do end for
then
end case
end operation
Lemma 12 The implementation described in Fig. 7.14 satisfies the freshness prop-
erty of a store-collect object.
Proof Let viewi be the vector view returned by an invocation st_coli1 of store_
collect() issued by a process pi . Let store_collect(v j ) be the last invocation by p j
that terminated before st_coli1 started. We have to show that viewi [ j] contains v j or
a value v j such that p j has invoked store_collect(v j ) after store_collect(v j ).
The lemma is trivially satisfied if st_coli1 reads all the atomic registers REG[1..n].
So, let us consider the case where st_coli1 is such that LAST = i ∧ stepi = 1.
Let st_coli0 be the invocation of store_collect() by pi that precedes st_coli1 and is
such that pi reads LAST = i. Such an invocation exists because LAST is initialized
to ⊥. Moreover, there is another invocation st_coli2 issued by pi that is in between
st_coli0 and st_coli1 (this follows from the management of stepi ), and pi read all the
atomic registers REG[1..n] when it executed st_coli2 . As any process p j that invokes
store_collect() first deposits its value in REG[ j] and only then writes its identity j
in LAST , it follows that the last invocation of store_collect() issued by p j which
terminates before st_coli1 started has terminated before st_coli0 . Consequently, the
value read from REG[ j] by st_coli1 is v j or a more recent value.
Lemma 13 The implementation described in Fig. 7.14 satisfies the partial order
property of a store-collect object.
Proof Let st_col and st_coli2 be two invocations of store_collect() such that st_col
terminates before st_coli2 starts. Moreover let pi be the process that has invoked
st_coli2 (hence the index i). We have to show that, for every process p j , the value
returned by st_coli2 is not older than the value returned by st_col.
If st_coli2 returns values just read from the atomic registers REG[1..n], the lemma
immediately follows. So, let us consider the case where st_coli2 does not read the
atomic registers REG[1..n] (see Fig. 7.15).
This means that pi reads LAST = i and stepi = 1 when it executes st_coli2 .
As in the previous lemma, let us consider the last invocation st_coli0 by pi such
that LAST = i. We have seen that st_coli0 does exist. As pi has read all the atomic
registers REG[1..n] during st_coli0 , its invocations after st_coli0 do not return values
older than the ones returned by st_coli0 . Hence, if st_col terminates before st_coli0
starts, the lemma follows. Consequently, we have to consider only the invocations
st_col that are concurrent with st_coli0 or start after st_coli0 has terminated. Let us
observe that these invocations do not terminate between the end of st_coli0 (where
pi writes its identity in LAST ) and st_coli2 (where pi finds LAST = i), otherwise
pi would not read LAST = i. It follows that these invocations are concurrent with
st_coli2 and st_coli1 (the invocation issued by pi in between st_coli0 and st_coli2 whose
existence was proved in the previous lemma) or start after st_coli1 and are concurrent
with st_coli2 . It follows that all the invocations st_col that have terminated before
st_coli2 starts have terminated before st_coli0 terminates and updates LAST to i. As
st_coli2 returns the values read by st_coli1 from the atomic registers, it follows that
this invocation cannot return values older than the ones returned by st_col, which
concludes the proof of lemma.
Remark As suggested in the previous lemma, the fact that a process reads from its
local cache does not mean that there are no concurrent invocations of store_collect()
while a process returns the values kept in its local cache. To visualize this, let us
consider the execution depicted in Fig. 7.16. This execution is nearly the same as
the one described in Fig. 7.13. The only difference is that there is now an additional
invocation of store_collect() by process p2 and this invocation lasts a long time and
is concurrent with all the other invocations of store_collect(). This invocation by
p2 , which is represented by a bold dotted line, is the one that defines v20 as the last
value of p2 (assuming now that its previous value was v2∗ ). The view it returns is the
vector view5 = [v10 , v20 , v30 , v43 ] which is comparable to neither view3 nor view4
(the vectors returned by the two consecutive invocations of store_collect() issued
by p4 ).
Theorem 31 The implementation described in Fig. 7.14 is a wait-free implementa-
tion of a store-collect object where the store() and collect() operations are merged
into a single store_collect() operation.
Proof The algorithm is trivially wait-free. The proof of the validity property is the
similar to the validity proof of Theorem 28. The proofs of the partial order consistency
and freshness properties follow from Lemmas 12 and 13.
Theorem 32 If there is a time τ0 after which there is a single process pi that invokes
store_collect(), then there exists a time τ1 > τ0 after which the cost of each of its
invocations is O(1) accesses to atomic registers.
Proof The proof is an immediate consequence of the management of the atomic
register LAST and the local variable stepi . If, after some time, only pi invokes
store_collect() operations, LAST remains permanently equal to i. It then follows
from the management of stepi that, after two invocations of store_collect() by pi ,
we always have LAST = i ∧ stepi = 1. From that time, each time it invokes
store_collect(), pi writes its new value into REG[i] and reads LAST (and learns that
it is always the last writer). There are then two accesses to atomic registers, whatever
the value of n.
7.4 Summary
This chapter has considered the base read/write asynchronous computation model
in which any number of processes may crash. Hence, implementations cannot use
strong base atomic operations such as compare&swap(), they can rely only on atomic
read/write registers.
Wait-free implementations suited to such a weak computation model have been
presented. The first considered a weak counter which can be accessed by infinitely
many processes. The second considered store-collect objects and fast store-collect
objects.
• The notion of a weak counter for infinitely many processes and the corresponding
implementation are due to M.K. Aguilera [14].
218 7 Wait-Free Objects from Read/Write Registers Only
• The notions of infinitely many processes and finite concurrency used in this chapter
are due to E. Gafni, M. Merritt, and G. Taubenfeld [110, 204].
• The notion of an adaptive algorithm and its underlying theory is investigated in
[30, 34].
• The adaptive store-collect object presented in this chapter is due to H. Attiya,
A. Fouren, and E. Gafni [35].
• Long-lived adaptive store-collect objects are presented in [12].
• The fast implementation of the store_collect() operation is due to B. Englert and
E. Gafni [94] who proposed it to improve the disk version of the Paxos algorithm
[109].
7.6 Problem
This chapter is devoted to snapshot objects. Such an object can be seen as a store-
collect object whose two operations are atomic. After having defined the concept of
a snapshot object, this chapter presents wait-free implementations of it, which are
based on atomic read/write registers only. This chapter introduces also the notion of
an immediate snapshot object, which can be seen as the atomic counterpart of the
fast store-collect object presented in the previous chapter.
only the component associated with it. In this case, the base atomic read /write
registers from which the snapshot object is built are SWMR registers.
As shown in Fig. 8.1, a process pi invokes update(v) to assign a new value to the
SRMW register REG[i] and any process p j invokes snapshot() to read atomically
the whole array REG[1..n].
Multi-writer snapshot object A multi-writer snapshot object is a snapshot object
for which each component can be written by any process. It follows that the base
atomic read /write registers from which a multi-writer snapshot object is built are
MWMR registers.
A multi-writer snapshot object with m components is described in Fig. 8.2.
A process pi invokes update(x, v) to assign the value v to the MWMR compo-
nent REG[x] and invokes snapshot() to read atomically the whole array REG[1..m].
Let REG[1..n] be the array which is the internal representation of the snapshot object.
A simple principle to be able to distinguish different updates of REG[i] by process
pi consists in considering that each atomic register REG[i] is made up of two fields,
REG[i].val, which contains the last value written by pi , and REG[i].sn, its associated
sequence number.
The corresponding algorithm for the update() operation is described in Fig. 8.3.
The local variable denoted sn i , which is initialized to 0, allows pi to generate
sequence numbers.
The algorithm implementing the operation snapshot() is based on what is called
a “sequential double collect” which is made up of two consecutive invocations of
the function collect(). As we have seen in the previous section, this operation reads
asynchronously all the registers of the array REG[1..n] (lines 10–11). (Its trivial
implementation considered in the implementations that follow can be replaced by a
more efficient adaptive implementation as seen previously).
An invocation of snapshot() repeatedly reads twice the array REG[1..n] (lines
4 and 6 or lines 8 and 6) until two consecutive collects obtain the same sequence
number values for each register REG[ j] (the local arrays aai and bbi are used to save
the values read from the array in the first and second collect, respectively). When,
∀ j : aai [ j] = bbi [ j], the corresponding double collect is said to be successful and
the algorithm then returns the array of values [aai [1].val, . . . , aai [n].val] (line 7).
• Let the linearization point of an update() operation issued by p j be the time instant
when p j writes REG[ j].
• Considering the invocation of a snapshot() operation that returns at line 7, let
collect1 and collect2 be its last two invocations of collect(). Moreover, let τbi
(or τei ) be the time at which collect1 (or collect2 ) read REG[i]. As both collect1
and collect2 obtain the same sequence number from REG[i], we can conclude
that, for any i, no update() operation issued by pi has terminated and modi-
fied REG[i] between τbi and τei . As collect2 starts after collect1 has terminated,
it follows that, between max({τbi }1≤i≤n ) and min({τei }1≤i≤n ), no atomic regis-
ter REG[i], 1 ≤ i ≤ n, was modified. It is consequently possible to associate
with the corresponding invocation of the snapshot() operation a linearization
point τ of the time line such that max({τbi }1≤i≤n ) < τ < min(τei }1≤i≤n ). Hence,
from an external observer point of view, we can consider that the invocation of
snapshot() occurred instantaneously at time τ (see Fig. 8.4) after all the invocations
of the first update() operation and before the invocations of the second update()
operation.
It follows from the previous definition of the linearization points that the operations
that terminate define an atomic snapshot object.
Fig. 8.5 The update() operation includes an invocation of the snapshot() operation
• As the invocation int_snap is entirely overlapped by the invocation snap, snap can
borrow the array help and return it as its own result. (Notice it is possible that
the internal snapshot invocation inside upd1 may be also entirely overlapped by
snap, but there is no way to know this.) This overlapping is important to satisfy
the atomicity property; namely, the values returned have to be consistent with the
real-time occurrence order of the operation invocations.
The proof will show that such an addition of an invocation of snapshot() to the
algorithm implementing the update() operation is not at the price of the wait-free
property.
The algorithms implementing the update() and snapshot() algorithms are
described in Fig. 8.6. The function collect() is the same as before. A SWMR atomic
register REG[i] is now made up of three fields: REG[i].val and REG[i].sn as before,
plus the new field REG[i].help_array, whose aim is to contain a helping array as
previously discussed. The final version of both update() and snapshot() algorithms
is a straightforward extension of their previous attempt versions.
The main novelty lies in the local variable can_helpi that is used by a process
pi when it executes the operation snapshot(). The aim of this set, initialized to
∅, is to contain the identity of the processes that have terminated an invocation of
update() since pi started its current invocation of the snapshot() operation. Its use
is described in the if statement at lines 12–17. More precisely, when a double collect
is unsuccessful, pi does the following with respect to each process p j that made the
double collect unsuccessful (i.e., such that (aai [ j].sn = bbi [ j].sn)):
• If j ∈
/ can_helpi (line 15). In this case, pi discovers that p j has terminated an
invocation of update() since it started its invocation of the snapshot() operation.
Consequently, if p j terminates a new invocation of update() while pi has not yet
terminated its invocation of snapshot(), p j can help pi terminate.
• If j ∈ can_helpi (line 14). In this case, p j has entirely executed an update()
operation while pi is executing its snapshot() operation. As we have seen, pi
can benefit from the help provided by pi by returning the array that p j stored in
REG[ j] at the end of its invocation of the operation update().
Theorem 34 The algorithms described in Fig. 8.6 define a bounded wait-free
implementation of an atomic snapshot object.
Proof Let us first show that the implementation is bounded wait-free. As any invo-
cation of update() contains an invocation of snapshot(), we have to show that any
8.2 Single-Writer Snapshot Object 225
the processes p j such that aai [ j].sn = bbi [ j].sn at line 12, we necessarily have
j ∈ can_helpi at line 13, from which it follows that pi returns at line 14. The
implementation is consequently wait-free, as pi terminates after a finite number of
operations on base registers have been executed.
Let us now replace “finite” by “bounded”, i.e., let us determine a bound on the
number of accesses to base registers. A collect() costs O(n) accesses to base regis-
ters. The cost of each iteration of the for loop (lines 11–18) is O(1), and there are
at most n iteration steps, which means that the cost of that loop is upper-bounded by
O(n). Finally, as the enclosing repeat loop is executed at most n times, it follows
that a process issues at most O(n 2 ) accesses to base registers when it executes a
snapshot() or an update() operation.
Let us now show that the object that is built is atomic. To that end we have to show
that, for each execution (and according to the notation introduced in Chap. 4), there
is a total order on the invocations of S and update() and snapshot() such that: (1) S
includes all the invocations issued by the processes, except possibly, for each process,
its the last invocation if that process crashes, (2) S respects the real-time occurrence
order on these invocations (i.e., if the invocation op1 terminates before the invocation
op2 starts, op1 has to appear before op2 in S), and (3) S respects the semantics of
each operation (i.e., a snapshot() invocation has to return, for each process p j , the
value v j such that, in S, there is no upadte() invocation by p j between update(v j )
and that snapshot() invocation).
The definition of the sequence S relies on (a) the atomicity of the base registers
and (b) the fact that the operation snapshot() invokes sequentially the underlying
function collect(). Let us remember that item (a) means that the read and write oper-
ations on the base registers can be considered as being executed instantaneously,
each one at a point of the time line, and no two of them at the same time.
The sequence S is built as follows. The linearization point of an invocation of
the operation update() is the time at which it atomically executes the write in the
corresponding SWMR register (line 3).
The definition of the linearization point of an invocation of the operation
snapshot() depends on the line at which it returns:
• The linearization point of an invocation of snapshot() that terminates at line 10
(successful double collect) is at any time time between the end of the first and the
beginning of the second of these collect invocations (see Theorem 33 and Fig. 8.4).
• The linearization point of an invocation of snapshot() that terminates at line 14
(i.e., pi terminates with the help of another process p j ) is defined inductively as
follows. (See Fig. 8.7, where a rectangle below an update() invocation represents
the internal invocation of snapshot(). The dotted help_array arrow shows the
way an array is conveyed from a successful double collect by a process pk until
an invocation of snapshot() issued by a process pi .)
The array (say help_array) returned by pi was provided by an invocation of
update() executed by some process p j . As already seen, this update() was entirely
executed within the time interval of pi ’s current invocation of snapshot(). This
8.2 Single-Writer Snapshot Object 227
array was obtained by p j from a successful double collect, or from another process
pk . If it was obtained from a process pk , let us consider the way help_array was
obtained by pk . As there are at most n concurrent invocations of snapshot(), it fol-
lows by induction that there is a process px that has invoked the snapshot() opera-
tion and has obtained help_array from a successful double collect. Moreover, that
invocation of snapshot() was inside an invocation of an update() operation that was
entirely executed within the time interval of pi ’s current invocation of snapshot().
The linearization point of the invocation of snapshot() issued by pi is defined
from the internal invocation of snapshot() whose successful double collect deter-
mined help_array. If several invocations of snapshot() are about to be linearized
at the same time, they are ordered according to the total order in which they were
invoked.
It follows directly from the previous definition of the linearization points asso-
ciated with the invocations of the update() and snapshot() operation issued by the
processes that S satisfies items (1) and (2) stated at the beginning of the proof. The
satisfaction of item (3) comes from the fact that the array returned by any invocation
of snapshot() has always been obtained from a successful double collect.
Fig. 8.8 Single-writer atomic snapshot for infinitely many processes (code for pi )
of Fig. 8.6 are the same. As far as the operation update() is concerned, the only
difference with respect to the base version is the addition of the first line marked N0.
Adapting the algorithm for the operation snapshot() A first problem that has
to be solved consists in making the collect() function always terminate. A simple
solution consists in adding an input parameter x to that function, indicating that
the collect has to be only from REG[1] until REG[x]. The value of this parameter is
defined as the current value of the counter WEAK_CT . The corresponding adaptation
of the algorithm implementing the snapshot() operation appears in the lines prefixed
by “M” (for modified); more precisely, the lines M.6, M.8, M.13, and M.22 in Fig. 8.8.
A second problem arises when new processes with higher identities invoke
update(), causing the counter WEAK_CT to increase forever. It is consequently
possible that, while it executes the repeat loop, an invocation of snapshot() never
230 8 Snapshot Objects from Read/Write Registers Only
finds a process p j that has terminated two invocations of update() during its invo-
cation of snapshot(): permanently, there are new invocations of update(), but those
are issued by new processes with higher and increasing identities.
To solve this problem, let us observe that, if WEAK_CT increases due to a
process p j , then p j has necessarily increased it (at line M0 when it executed the
update() operation) after pi started its snapshot operation. So, if n_init is the value
of WEAK_CT when pi starts invoking snapshot() (see line M.6), this means that we
have j > n_init. The solution to the problem (see Fig. 8.9) consists then in replacing
the test j ∈ could_helpi by the test j ∈ could_helpi ∨ j > n_init (line M.3):
even if p j has not executed two update(), REG[ j].help_array can be returned as
it was determined after pi started its invocation of the snapshot() operation.
Remarks As it is written, the returned value (at line 10 or 14) is an array that can
contain lots of ⊥. This depends on the identity of the processes that have previously
invoked the update() operation. It is possible to return instead a set of (process
identity, value) pairs. On another side, the array can be replaced by a list.
The proof that this is a wait-free implementation of an atomic snapshot object in
the finite concurrency model is left as an exercise. The reader can easily remark that
the construction is not bounded wait-free (this is because it is not possible a priori to
state a bound on the number of iterations of the while loop).
This section presents a multi-writer snapshot algorithm due to D. Imbs and M. Raynal
(2011). This implementation is based on a helping mechanism similar to the one used
in the previous section. The snapshot object has m components.
8.4 Multi-Writer Snapshot Object 231
pi be the invoking process. First, pi increases the local sequence number genera-
tor sn i (initialized to 0) and atomically writes the triple
v, i, sn i into R E G[x].
It then computes a snapshot value and writes it into HELPSNAP[i] (line 3).
This constitutes the “write first, help later” strategy. The write of the value v into
the component x is executed before the computation and the write of a helping array.
The way HELPSNAP[i] can be used by other processes was described previously.
Finally, pi returns from its invocation of update().
It is important to notice that, differently from what is done in Fig. 8.6, the write
of v into R E G[x] and the write of a snapshot value into HELPSNAP[i] are distinct
atomic writes (which access different atomic registers).
The algorithm implementing the operation snapshot(): try first to terminate
without help from a successful double collect This algorithm is described at lines
5–17 of Fig. 8.10.
The pair of lines 6 and 8 and the pair of lines 16 and 8 constitute “double collects”.
Similarly to what is done in Fig. 8.6, a process pi first issues a double collect to try
to compute a snapshot value by itself. The values obtained from the first collect are
saved in the local array aa, while the values obtained from the second collect are
saved in the local array bb. If aa[x] = bb[x] for each component x, pi has executed
a successful double collect: REG[1..m] contained the same values at any time during
the period starting at the end of the first collect and finishing at the beginning of
8.4 Multi-Writer Snapshot Object 233
the second collect. Consequently, pi returns the array of values bb[1..m].val as the
result of its snapshot invocation (line 9).
The algorithm implementing the operation snapshot(): otherwise, try to benefit
from the help of other processes If the predicate ∀x : aa[x] = bb[x] is false,
pi looks for all entries x that have been modified during its previous double collect.
Those are the entries x such that aa[x] = bb[x]. Let x be such an entry. As witnessed
by bb[x] =
−, w, −, the component x has been modified by process pw (line 11).
The predicate w ∈ can_helpi (line 12) is the helping predicate. It means that
process pw issued two updates that are concurrent with pi ’s current snapshot invo-
cation. As we have seen in the algorithm implementing the operation update(x, v)
(line 3; see also Fig. 8.11), this means that pw has issued an invocation of snapshot()
as part of an invocation of update() concurrent with pi ’s snapshot invocation.
If this predicate is true, the corresponding snapshot value (which has been saved in
HELPSNAP[w]) can be returned by pi as output of its snapshot invocation (line 12).
If the predicate is false, process pi adds the identity w to the set can_helpi
(line 13). Hence, can_helpi (which is initialized to ∅, line 1) contains identities y
indicating that process p y has issued its last update while pi is executing its snapshot
operation. Process pi then moves the array bb into the array aa (line 16) and re-enters
the repeat. (As already indicated, the lines 16 and 08 constitute a new double scan.)
On the “write first, help later” strategy As we can see, this strategy is very simple.
It has several noteworthy advantages:
• This strategy first allows atomic write operations (at line 2 and line 3) to write
values into base atomic registers R E G[r ] and HELPSNAP[i] that have a smaller
size than the values written in the single-writer snapshot object implementation
of Fig. 8.6 (where an atomic write into R E G[x] is on a triple of values). Atomic
writes of smaller values allow for more efficient solutions.
• Second, this simple strategy allows the atomic writes into the base atomic regis-
ters R E G[x] and HELPSNAP[i] to be not synchronized (while they are strongly
synchronized in the single-writer snapshot implementation of Fig. 8.6, where they
are pieced into a single atomic write).
Fig. 8.11 A snapshot() with two concurrent update() by the same process
234 8 Snapshot Objects from Read/Write Registers Only
• Finally, as shown in the proof, the “write first, help later” strategy allows the
invocations of snapshot() to satisfy the strong freshness property (i.e., to return
component values that are “as fresh as possible”).
Cost of the implementation This section analyses the cost of the operations
update() and snapshot() in terms of the number of base atomic registers that are
accessed by a read or write operation.
• Operation snapshot().
– Best case. In the best case an invocation of the operation snapshot() returns
after having read only twice the array REG[1..m]. The cost is then 2m.
– Worst case. Let pi be the process that invoked operation snapshot(). The worst
case is when a process returns at line 12 and the local array can_helpi contains
n − 1 identities: an identity from every process but pi . In that case, pi has read
n +1 times the array REG[1..m] and, consequently, has accessed (n +1)m times
the shared memory.
• The cost of an update operation is the cost of a snapshot operation plus 1.
It follows that the cost of an operation is O(n × m).
As snap has read REG[r 1] twice (first to obtain aa1[r 1], and then to obtain bb1[r 1])
and aa1[r 1] = bb1[r 1], it follows that up1 started after the start of snp, which
concludes the proof of the claim.
Lemma 16 For any h, the values returned by an h-helped invocation of snapshot()
are well-defined, mutually consistent, and strongly fresh.
Proof The proof is by induction. The base case is h = 0 (Lemma 14). Assum-
ing that the lemma is satisfied up to h − 1, the proof for h is similar to the proof
for h = 1 that relies on the fact that we have a proof for the case h − 1 = 0
(Lemma 15).
Lemma 17 Wait-freedom. Any invocation of update() or snapshot() issued by a
correct process terminates.
Proof Let us first observe that, if every snapshot() operation issued by a correct
process terminates, then all its update() operations terminate. Hence, the proof only
has to show that all snapshot() operations issued by a correct process terminate.
Let us consider an invocation of snapshot() issued by a correct process pi . If,
when pi executes line 9, the predicate is true, the invocation terminates. So, we have
to show that, if the predicate of line 9 is never satisfied, then the predicate of line 12
eventually becomes true. As the predicate of line 9 is never satisfied, each time pi
executes the loop body, there is a component x such that aa[x] = bb[x]. The process
pk that modified REG[k] between the two readings by pi entails the addition of its
identity k to can_helpi (where k is extracted from bb[x]). In the worst case, n − 1
identities (one per process except pi , because it cannot execute an update operation
while it executes a snapshot operation) can be added to can_helpi while the predicate
of line 12 remains false. But, once can_helpi contains one identity per process (but
pi ), the test of line 12 necessarily becomes satisfied, which proves the lemma.
The next lemma shows that the implementation is atomic, i.e., that the invocations
of update() and snapshot() issued by the processes during a run (except possibly
the last operation issued by faulty processes) appear as if they have been executed
one after the other, each one being executed at some point of the time line between
its start event and its end event.
Lemma 18 The algorithms described in Fig. 8.10 implement an atomic multi-writer
snapshot object.
Proof The proof consists in associating with each invocation inv of update() and
snapshot() a single point of the time line denoted p(inv) (linearization point of inv)
such that:
• p(inv) lies between the beginning (start event) of inv and its end (end event),
• no two operations have the same linearization point,
• the sequence of the operation invocations defined by their linearization points is a
sequential execution of the snapshot object.
8.4 Multi-Writer Snapshot Object 237
So, the proof consists in (a) an appropriate definition of the linearization points and
(b) showing that the associated sequence satisfies the specification of the snapshot
object.
• Point (a): definition of the linearization points.
The linearization point of each operation invocation (except possibly the last oper-
ation of faulty processes) is defined as follows:
– The linearization point of an invocation of update(r, −) is the linearization point
of its write of REG[r ] (line 2). (Let us remember that, as the underlying atomic
registers are atomic, they have well-defined linearization points).
– The linearization point of an invocation psp of snapshot() depends on the line
at which the return() statement is executed:
∗ Case 1: psp returns at line 9 due a successful double collect (i.e., psp is
0-helped). Its linearization point is any point of the time line between the
first and the second collect of that successful double collect.
∗ Case 2: psp returns at line 12 (i.e., psp is h-helped with h ≥ 1). In this case,
the array of values returned by psp was computed by some update operation
at line 3. Moreover, whatever the value of h, this array was computed by a
successful double collect executed by some process pz . When considering
this successful double collect, p(psp) is placed between the end of its first
collect and the beginning of its second collect.
If two operations are about to be linearized at the same point, they are arbitrarily
ordered (e.g., according to the identities of the processes that issued them).
It follows from the previous linearization point definitions that each invocation of
an operation is linearized between its beginning and its end, and no two operations
are linearized at the same point.
• Point (b): the sequence of invocations of update() and snapshot() defined by their
linearization points satisfies the specification of the snapshot object.
This follows directly from Lemma 15 which showed that the values returned
by every snapshot operation are well defined, mutually consistent, and strongly
fresh.
Theorem 35 The implementation described in Fig. 8.10 is a bounded wait-free
implementation of an atomic multi-writer snapshot object which also satisfies the
strong freshness property.
Proof The proof that the implementation satisfies the consistency property of a
snapshot object follows from Lemma 18. Wait-freedom follows from Lemma 17.
The freshness property follows from the definition of the linearization points given
in Lemma 18. Finally, the fact that the implementation is bounded wait-free follows
from the fact that an operation costs at most O(m × n) accesses to base atomic
registers.
238 8 Snapshot Objects from Read/Write Registers Only
Figure 8.12 depicts three processes p1 , p2 , and p3 . Each process pi first invokes
update(vi ) and later invokes snapshot(). (In the figure, process identities appear as
subscripts in the operation invoked).
According to the specification of the one-shot snapshot object, the invoca-
tion snapshot1 () returns the view {(1, v1 ), (2, v2 )}, while both the invocations
snapshot2 () issued by p2 and snapshot3 () issued by p3 return the view {(1, v1 ),
(2, v2 ), (3, V3 )}. This means that it is possible to associate with this execution the
following sequence of operation invocations S:
8.5 Immediate Snapshot Objects 239
update1 (v1 ) update2 (v2 ) snapshot1 () update3 (v3 ) snapshot2 () snapshot3 (),
which belongs to the sequential specification of a one-shot snapshot object and shows,
consequently, that this execution is atomic.
Figure 8.13 shows the same processes where the invocations updatei () and
snapshoti () issued by pi are replaced by a single invocation update_snapshoti ()
(that starts at the same time as updatei () starts and terminates at the same time as
snapshoti ()).
As update_snapshot1 (1) terminates before update_snapshot3 (3) starts, we nec-
essarily have (1, v1 ) ∈ view3 and (3, v3 ) ∈
/ view1 . More generally, each of the five
items that follow defines a set of correct outputs for the execution of Fig. 8.13 (said
differently, this execution is non-deterministic in the sense that its outputs are defined
by any one of these five items):
1. view1 = {(1, v1 )}, view2 = {(1, v1 ), (2, v2 )}, and view3 = {(1, v1 ), (2, v2 ),
(3, v3 )},
2. view1 = view2 = {(1, v1 ), (2, v2 )}, and view3 = {(1, v1 ), (2, v2 ), (3, v3 )},
3. view2 = {(2, v2 )}, view1 = {(1, v1 ), (2, v2 )}, and view3 = {(1, v1 ), (2, v2 ),
(3, v3 )},
4. view1 = {(1, v1 )}, and view2 = view3 = {(1, v1 ), (2, v2 ), (3, v3 )}, and
5. view1 = {(1, v1 )}, view3 = {(1, v1 ), (3, v3 )}, and view2 = {(1, v1 ), (2, v2 ),
(3, v3 )}.
When view1 , view2 and view3 are all different (item 1), everything appears as
if the three invocations of update_snapshot() have been executed sequentially (and
consistently with their real-time occurrence order). When two of them are equal, e.g.,
view1 = view2 = {(1, v1 ), (2, v2 )} (item 2), everything appears as if the invocations
update_snapshot1 (1) and update_snapshot2 (2) have been issued at the very same
time, both before update_snapshot3 (). This possibility of simultaneity is the very
essence of the “immediate” snapshot abstraction. It also shows that an immediate
snapshot object is not an atomic object.
Theorem 36 A one-shot immediate snapshot object satisfies the following property:
if (i, −) ∈ view j and ( j, −) ∈ viewi , then viewi = view j .
Proof If ( j, −) ∈ viewi (theorem assumption), we have view j ⊆ viewi from
the immediacy property. Similarly, (i, −) ∈ view j implies that viewi ⊆ view j . It
trivially follows that viewi = view j when ( j, −) ∈ viewi and (i, −) ∈ view j .
Set-linearizability The previous theorem states that, while its operations appear as
if they were executed instantaneously, an immediate snapshot object is not an atomic
object. This is because it is not always possible to totally order all its operations.
The immediacy property states that, from a logical time point of view, it is possible
that several invocations appear as being executed simultaneously (they then return
the same view), making it impossible to consider that one occurred before the other.
This means that an immediate snapshot object has no sequential specification. We
then say that these invocations are set-linearized at the same point of the time line,
and the notion of a linearization point is replaced by set-linearization.
Hence, differently from a snapshot object, the specification of an immediate snap-
shot object allows for concurrent operations from an external observer point of view.
It then requires that the invocations which are set-linearized at the same point do
return the very same view.
Fig. 8.14 An algorithm for the operation update_snapshot() (code for process pi )
To catch the underlying intuition and understand how this idea works, let us
consider two extremal cases in which k processes invoke the update_snapshot()
operation:
• Sequential case.
In this case, the k processes invoke the operation sequentially; i.e., the next invo-
cation starts only after the previous one has returned. It is easy to see that the
first process pi1 that invokes the update_snapshot() operation proceeds from step
n + 1 until step number 1, and stops at this step. Then, the process pi2 starts and
descends from step n + 1 until step number 2, etc., and the last process pik stops
at step k.
Moreover, the set returned by pi1 is {i 1 }, the set returned by pi2 is {i 1 , i 2 }, etc., the
set returned by pik being {i 1 , i 2 , . . . , i k }. These sets trivially satisfy the inclusion
property.
• Synchronous case.
In this case, the k processes proceed synchronously. They all, simultaneously,
descend from step n + 1 to step n, then from step n to step n − 1, etc., and they all
stop at step number k because there are then k processes at steps from 1 to k (they
all are on the same kth step).
242 8 Snapshot Objects from Read/Write Registers Only
It follows that all the processes return the very same set of participating processes,
namely, the set including all of them {i 1 , i 2 , . . . , i k }, and the inclusion property is
trivially satisfied.
Other cases, where the processes proceed asynchronously and some of them crash,
can easily be designed. The general case is described in Fig. 8.15. If pi stops at level
x, its view includes all the pairs
j, v j such that p j returns or crash at level x.
How levels are implemented The main question is now how to make this idea oper-
ational? This is done by three statements (Fig. 8.14). Let us consider a
process pi :
• First, when it is standing at a given step LEVEL[i], pi reads the steps at which the
other processes are (line 3). The aim of this asynchronous reading is to allow pi to
compute an approximate global state of the stairway. Let us notice that, as a process
p j can only go downstairs, leveli [ j] is equal or smaller to the step k = LEVEL[i]
on which p j currently is. It follows that, despite the fact the global state obtained
by pi is approximate, seti can be safely used by pi .
• Then (line 4), pi uses the global state of the stairway it has obtained to compute
the set (denoted seti ) of processes that, from its point of view, are standing at a
step between LEVEL][1] and LEVEL][i] (the step where pi currently is).
• Finally (line 5), if seti contains k = LEVEL][i] or more processes, pi returns the
corresponding view (line 6–7). Otherwise, it proceeds to the next stair LEVEL]
[i] − 1 (line 2).
Two preliminary lemmas are proved before the main theorem.
Lemma 19 Let seti = {x | leveli [x] ≤ LEVEL[i]} (as computed at line 4). For any
process pi , the predicate |seti | ≤ LEVEL[i] is always satisfied at line 5.
8.5 Immediate Snapshot Objects 243
Proof Let us first observe that leveli [i] and LEVEL[i] are always equal at lines 4
and 5. Moreover, any LEVEL[ j] register can only decrease, and for any pair (i, j)
we have LEVEL[ j] ≤ leveli [ j].
The proof is by contradiction. Let us assume that there is at least one process pi
such that |seti | = |{x | leveli [x] ≤ LEVEL[i]}| > LEVEL[i]. Let k be the current
value of LEVEL[i] when this occurs.|seti | > k and LEVEL[i] = k mean that at
least k + 1 processes have progressed at least to stair k. Moreover, as any process p j
descends one stair at a time (it proceeds from stair LEVEL[ j] to stair LEVEL[ j] − 1
without skipping stairs),at least k + 1 processes have proceeded from stair k + 1 to
stair k.
Among the at least k + 1 processes that are on a stair ≤ k, let p be the last
process that updated its LEVEL[] register to k + 1 (due to the atomicity of the
base registers, there is such a last process). When p was on stair k + 1 (we then had
LEVEL[] = k +1), it obtained at line 4 a set set such that |set | = |{x | level [x]| ≤
LEVEL[]} ≥ k +1 (this is because at least k +1 processes have proceeded to the stair
k + 1, and as p is the last of them, it read a value smaller than or equal to k + 1 from
its own LEVEL[] register and the ones of those processes). As |set | ≥ k + 1, p
stopped descending the stairway at line 5, at stair k +1. It then returned, contradicting
the initial assumption stating that it progresses until stair k.
Lemma 20 If pi halts at stair k, we then have |seti | = k. Moreover, seti is composed
of the processes that are at a stair k ≤ k.
Proof Due to Lemma 19, we always have |seti | ≤ LEVEL[i], when pi executes line
5. If it stops, we also have |seti | ≥ LEVEL[i] (test of line 5). It follows that |seti | =
LEVEL[i]. Finally, if k is pi ’s current stair, we have LEVEL[i] = k (definition of
LEVEL[i] and line 2). Hence, |seti | = k.
The fact that seti is composed of the identities of the processes that are at a stair
smaller than or equal to k follows from the very definition of seti (namely,seti =
{x | leveli [x] ≤ LEVEL[i]}), the fact that,for any x,leveli [x] ≤ LEVEL[x], and
the fact that a process never climbs the stairway (it either halts on a stair,line 5, or
descends to the next one, line 2).
Theorem 37 The algorithm described in Fig. 8.14 is a bounded wait-free implemen-
tation of a one-shot immediate snapshot object.
Proof Let us observe that (1) LEVEL[i] is monotonically decreasing, and (2) at any
time, seti is such that |seti | ≥ 1 (because it contains at least the identity i). It follows
that the repeat loop always terminates (in the worst case when LEVEL[i] = 1).
Hence, the algorithm is wait-free. Moreover, pi executes the repeat loop at most n
times, and each computation inside the loop includes n reads of atomic base registers.
It follows that O(n 2 ) is an upper bound on the number of read/write operations
on base registers issued in an invocation of update_snapshot(). The algorithm is
consequently bounded wait-free.
The self-inclusion property is a direct consequence of the way seti is computed
(line 4): trivially,the set {x|leveli [x] ≤ leveli [i]} always contains i.
244 8 Snapshot Objects from Read/Write Registers Only
For the containment property, let us consider two processes pi and p j that stop at
stairs ki , and k j , respectively. Without loss of generality, let ki ≤ k j . Due to Lemma
20, there are exactly ki processes on the stairs 1 to ki , and k j processes on stairs 1 to
k j ≤ ki . As no process backtracks on the stairway (a process proceeds downwards
or stops), the set of k j processes returned by p j includes the set of k1 processes
returned by pi .
Let us finally consider the immediacy property. Let us first observe that a process
deposits its value before starting its descent of the stairway (line 1), from which it
follows that, if j ∈ seti , REG[ j] contains the value v j deposited by p j . Moreover,
it follows from lines 4 and 5 that, if a a process p j stops at a stair k j and then
i ∈ set j , then pi stopped at a stair ki ≤ k j . It then follows from Lemma 20 that the
set set j returned by p j includes the set seti returned by pi , from which follows the
immediacy property.
Fig. 8.16 Recursive construction of a one-shot immediate snapshot object (code for process pi )
case, as pk is the last process that wrote into the array REG[x][1..n], it follows
from |viewk | < x that fewer than x processes have written into REG[x][1..n], and
consequently, at most (x − 1) processes invoke rec_update_snapshot(x − 1, −).
End of the proof of claim C.
To prove the termination property, let us consider a correct process pi that
invokes update_snapshot(vi ). Hence, it invokes rec_update_snapshot(n, −). It fol-
lows from Claim C and the fact that at most n processes invoke rec_update_snapshot
(n, −) that either pi stops at that invocation or belongs to the set of at most n − 1
processes that invoke rec_update_snapshot(n − 1, −). It then follows by induction
from the claim that if pi has not stopped during a previous invocation, it is the only
process that invokes rec_update_snapshot(1). It then follows from the text of the
algorithm that it stops at that invocation.
The proof of the self-inclusion property is trivial. Before stopping at recursion level
x (line 6), a process pi has written vi into REG[x][i] (line 3), and consequently we
have then (i, vi ) ∈ viewi , which concludes the proof of the self-inclusion property.
To prove the self-containment and immediacy properties, let us first consider the
case of two processes that return at the same recursion level x. If a process pi returns
at line 6 of recursion level x, let viewi [x] denote the corresponding value of viewi .
Among the processes that stop at recursion level x, let pi be the last process which
writes into REG[x][1..n]. As pi stops, this means that REG[x][1..n] has exactly x
entries different from ⊥ and (due to Claim C) no more of its entries will be set to
a non-⊥ value. It follows that, as any other process p j that stops at recursion level
x reads x non-⊥ entries from REG[x][1..n], we have viewi [x] = view j [x] which
proves the properties.
Let us now consider the case of two processes pi and p j that return at line 6 of
recursion level x and y, respectively, with x > y; i.e., pi returns viewi [x] while p j
returns view j [y]. The self-containment follows then from x > y and the fact that p j
has written into all the arrays REG[z][1..n] with n ≥ z ≥ y, from which we conclude
that view j [y] ⊆ viewi [x]. Moreover, as x > y, pi has not written into REG[y][1..n]
while p j has written into REG[x][1..n], and consequently ( j, v j ) ∈ viewi [x] while
(i, vi ) ∈
/ view j [y], from which the containment and immediacy properties follow.
As far as the number of shared memory accesses is concerned we have the follow-
ing. Let res be the set returned by an invocation of rec_update_snapshot(n, −). Each
recursive invocation costs n + 1 shared memory accesses (lines 3–4).
Moreover, the sequence of invocations, namely rec_update_snapshot(n, −), rec
_update_snapshot(n − 1, −), etc., until rec_update_snapshot(|res|, −) (where
x = |res| is the recursion level at which the recursion stops) contains n − |res| +
1 invocations. It follows that the cost is O(n(n − |res| + 1)) shared memory
accesses.
8.6 Summary 247
8.6 Summary
This chapter was on the notion of a snapshot object. It has shown how such an object
can be wait-free implemented on top of base read/write registers despite asynchrony
and any number of process crashes. It has also shown how an implementation for a
fixed number of processes can be extended to cope with infinitely many processes. A
wait-free implementation of a multi-writer snapshot object has also been described.
Finally, the chapter has introduced the notion of an immediate snapshot object and
presented both an iterative implementation and a recursive implementation of it.
It is important to insist on the fact that snapshot objects are fundamental objects
in crash-prone concurrent environments. This is because they can be used to record
the last consistent global state of an application (the value deposited by each process
being its last meaningful local state).
8.8 Problem
This chapter is devoted to the renaming problem. After having presented the problem,
it describes three wait-free implementations of one-shot renaming objects. These
implementations, which are all based on read/write registers, differ in their design
principles, their cost, and the size of the new name space allowed to the processes.
Finally, the chapter presents a long-lived renaming object based on registers stronger
than read/write registers, namely test&set registers.
The underlying system The system that is considered is a static system made
up of n asynchronous sequential processes p1 , . . . , pn . Moreover, any number of
processes may crash. The processes communicate by accessing atomic read/write
registers only. A process participates in a run, if it accesses the shared memory at
least once in the corresponding run.
Indexes When considering a process pi , the subscript i is not the identity (name)
of process pi but its index. Indexes are used for addressing only; they cannot be used
for computing new names (this will be precisely defined below in the statement of
the “index independence” property).
Initial names Each process pi has an initial (permanent) name denoted idi , which is
initially known only by itself. Such a name can be seen as a particular value defined
in pi ’s initial context that uniquely identifies it (e.g., its IP address). Hence, for any
process pi we have idi ∈ {1, . . . , N}, where N is the size of the initial name. Moreover,
The aim of such an object is to allow the processes to obtain new names in a new name
space whose size M is much smaller than N. Such an object solves the M-renaming
problem. To that end, the object offers to the processes a single operation denoted
new_name() which can be invoked at most once by each process and returns them
values (new names). The object is formally defined by the following properties:
• Liveness. The invocation of new_name() by a correct process terminates.
• Validity. A new name is an integer in the set [1..M].
• Agreement. No two processes obtain the same new name.
• Index independence. ∀ i, j, if a process whose index is i obtains the new name
v, that process could have obtained the very same new name v if its index had
been j.
The index independence property states that, for any process, the new name
obtained by that process is independent of its index. This means that, from an opera-
tional point of view, the indexes define only an underlying communication infrastruc-
ture, i.e., an addressing mechanism that can be used only to access entries of shared
arrays. Indexes cannot be used to compute new names. This property prevents a
process pi from choosing i as its new name without any communication.
Let p be the number of processes that participate in a renaming execution, i.e., the
number of processes that invoke new_name(). Let us observe that the renaming
problem cannot be solved when M < p. There are two types of adaptive renaming
algorithms:
• Size adaptive. An algorithm is size-adaptive if the size M of the new name space
depends only on p, the number of participating processes. We have then M = f (p),
where f (p) is a function of p such that f (1) = 1 and, for 2 ≤ p ≤ n, p − 1 ≤
f (p − 1) ≤ f (p). If M depends only on n (the total number of processes), the
algorithm is not size-adaptive.
9.1 Renaming Objects 251
A lower bound on the size of the new name space An important theoretical
result associated with the renaming problem in asynchronous read/write systems,
due to M. Herlihy and N. Shavit (1999), is the following one. Except for some
“exceptional” values of n, the value M = 2n − 1 is the lower bound on the size of the
new name space. For the exceptional values of n, which have been characterized by
A. Castañeda and S. Rajsbaum (2008), we have M = 2n − 2 (more precisely, there
is a (2n − 2)-renaming algorithm for the values of n such that the integers in the set
{ ni : 1 ≤ i ≤ n2 } are relatively prime).
This means that M = 2p − 1 is a lower bound for size-adaptive algorithms
(in that case, there is no specific value of p that allows for a lower bound smaller
than 2p − 1). Consequently, the use of an optimal size-adaptive algorithm means
that, if “today” p processes acquire new names, their new names belong to the
interval [1..2p − 1]. If “tomorrow” p additional processes acquire new names,
these processes will have their new names in the interval [1..2p − 1], where
p = p + p .
The price due to the communication by read/write registers only The lower
bound M = 2n − 1 (or M = 2n − 2) for implementations which are not size-adaptive
or M = 2p − 1 for implementations which are size-adaptive defines the price that has
to be paid by any wait-free implementation of the renaming problem when processes
communicate by accessing read/write registers only.
This means that, when considering wait-free implementations of optimal size-
adaptive M-renaming (i.e., when considering M = 2p − 1, where p is the number
of participating processes), while only p new names are actually needed, obtaining
them requires a space of size 2p − 1 in which p − 1 new names will never be used and
it is impossible to know in advance which of these new names will not be used. This
intrinsic uncertainty is the price to pay to obtain wait-free implementations based on
read/write atomic registers only.
It follows that, if one wants to design an M-renaming object where p ≤ M <
2p − 1, one has to consider a system with operations which are computationally
stronger than simple read/write atomic registers (e.g., test&set(); see Sect. 9.7).
252 9 Renaming Objects from Read/Write Registers Only
In the long-lived renaming problem, a process can repeatedly acquire a new name
and then release it. (Long-lived renaming can be useful in systems in which processes
acquire and release identical resources.)
So, a long-lived renaming object offers two operations: new_name(), which
allows a process to acquire a new name; and release_name(), which allows it to
release the new name it has previously acquired.
While nearly all this chapter is devoted to one-shot renaming, its last section
presents a simple long-lived renaming object based on test&set registers.
The aim of this section is to give an intuition of the difficulty of the renaming problem
and its underlying algorithmic principles. To that end we use a simple example.
A simple example with three scenarios Let us consider a system with two asyn-
chronous crash-prone processes p and q that want to acquire new names. They have
to coordinate to ensure they do not choose the same new name.
To coordinate, they share two single-writer/multi-reader registers X[1] and X[2].
Only p can write into X[1], while only q can write into X[2]. Both of them can read
X[1] and X[2]. Let us consider only one communication exchange, namely p writes
to X[1], then reads X[2], and similarly q first writes to X[2], then reads from X[1].
Initially processes are identical except for their initial names, so the only useful thing
to communicate to the other process is the initial name. There are essentially three
scenarios:
• Scenario 1. In this scenario, process p writes (e.g., its initial name) to X[1] to
inform q that it wants to acquire a new name, but when p reads the shared register
X[2], q has not yet written it (e.g., because it is slow). Hence, p does not “see” that
q is competing for a new name. Differently, when q reads X[1], it “sees” that p is
competing for a new name.
• Scenario 2. This scenario is the same as the previous one, except that p and q are
inverted. Hence, in this scenario, q does not “see” that p is competing for a new
name, while p sees that q is competing for a new name.
• Scenario 3. In this symmetric scenario, concurrently p writes into X[1] while q
writes into X[2], and then each of them discovers that the other one is competing.
Here, each process “sees” that the other one is competing for a new name.
These three possible scenarios are represented in a single graph in Fig. 9.1. Each
vertex corresponds to the final state of a process, in one of the scenarios. The sce-
narios are represented with edges. Two vertices belong to an edge if and only if the
corresponding final states appear in the same scenario.
9.2 Non-triviality of the Renaming Problem 253
p sees only itself q sees both p and q p sees both p and q q sees only itself
The difficulty comes from the fact that in scenario 1, q does not know if p sees it
or not. More explicitly, q cannot distinguish scenario 1 and scenario 3. A symmetric
situation occurs for p which cannot distinguish scenario 2 and scenario 3; it is in the
same state in both; that is, if p is going to choose a new name in this state, it will have
to be the same name for both scenario 2 and scenario 3. Hence, we could label the
corresponding vertex in the graph with that decision. The graph of Fig. 9.1 represents
the global structure of these indistinguishability relations: in this example, a simple
path of length 3.
How to address the problem In order to think about the design of an algorithm,
let us assume that, whenever a process does not see the other process (because it has
crashed or is very slow), it chooses the new name 1. This can be done without loss
of generality because the space of initial names is big, hence for every algorithm
there exist two processes such that each one of them picks the same new name when
it does not see the other; recall that processes are initially identical, except for their
initial names, and hence two processes with the same initial name, in this kind of
scenario, chose the same name. Consequently, p chooses the new name 1 in scenario
1 and q chooses the new name 1 in scenario 2. Hence, the end vertices of the graph
in Fig. 9.1 may be labeled 1.
Let us now look at scenario 3. Process q sees p and is aware that p may have
not seen it (this is because q cannot distinguish scenario 1 and scenario 3). To avoid
conflict (in case we are in scenario 1 in which case p chooses new name 1), q chooses
new name 2. In that case, p (that does not know if the real scenario is scenario 2 or
scenario 3) has no choice: it has to choose the new name 3 to ensure that no two
processes have the same new name.
This simple observation shows that the renaming problem can be solved for two
processes with size of the new name space equal to 3. Let us observe that scenario 4, in
which no process sees the other one, cannot happen. This is due to fact that processes
communicate by writing and reading a shared memory made up of atomic registers,
and each of them writes the shared memory before reading it. The corresponding
algorithm is as follows. When p or q wants to acquire a new name, it first deposits
its initial name into the shared memory. If it sees only itself, it takes the new name
1. Otherwise, if its initial name is greater than the one of the other process, it takes
the new name 2, and if it is smaller, it takes the new name 3. (This simple algorithm
will be extended to any number of processes in Sects. 9.5 and 9.6.)
254 9 Renaming Objects from Read/Write Registers Only
Could it be possible to solve the problem for two processes with two new names
only? The previous discussion shows that the answer is “no”, if each process is limited
to a single communication round (during which it writes and then reads). What if
processes are not restricted to one communication round? Perhaps surprisingly, the
answer remains “no”. This is because the two endpoints of the uncertainty graph
(Fig. 9.1) always remain connected. These two endpoints represents the scenario
where neither p nor q sees the other process. In these extreme cases, each has to
choose the new name 1, and again it would be impossible for p and q to pick only
1 or 2 in the internal nodes, because an edge with equal new names in its endpoints
would be unavoidable.
ri
1 2 3 4 5
di
1 1 3 6 10 15
2 2 5 9 14
3 4 8 13
4 7 12
5 11
3–6). It follows from the properties of the splitter objects that, if p processes invoke
SP[1, 1].direction(), each visits at most p splitters before stopping.
Let us remark that the initial name idi of pi is used only to distinguish two processes
inside the underlying splitter objects.
Theorem 39 The implementation described in Fig. 9.3 is a wait-free size-adaptive
implementation of an M-renaming object where M = p(p+1)/2 and p is the number
of participating processes. Moreover, this implementation requires O(p) accesses to
underlying atomic read/write shared registers and is consequently time-optimal.
Proof It follows from the properties of the splitter objects and their grid structure
that a process progresses along a path of length at most p before returning the value
stop and stopping. It follows that any invocation of new_name() by a correct process
terminates. Moreover, as no more than one process stops at each splitter, and any two
splitters have different names, it follows that no two processes can obtain the same
new name.
The validity property (the new names belong to the interval [1..p(p + 1)/2])
follows directly from the fact that the static assignment of names to splitters is a
diagonal assignment. The index independence property is trivially satisfied: indexes
are used neither in the code of new_name() nor in the code of direction().
Finally, as each invocation of direction() issues at most four accesses to atomic
read/write registers and a process travels along a path of at most p splitters, it fol-
lows that an invocation of new_name() issues at most 4p accesses to atomic reg-
isters. Hence, O(p) is an upper bound on the time complexity of an invocation of
new_name() (which is clearly optimal for p participating processes as each partici-
pating process needs to access the underlying shared memory at least once).
The principle that underlies the algorithm is the following. A new name can be
considered as a slot, and processes compete to acquire free slots in the interval of
slots [1..2p − 1]. After entering the loop, a process pi first updates STATE[i] (line 3)
to announce to all processes its current proposal for a new name (let us note that it
also implicitly announces it is competing for a new name).
Then, thanks to the snapshot() operation on the snapshot object STATE
(line 4), pi obtains a consistent view (saved in the local array statei ) of the sys-
tem global state. Let us note that this view is consistent because it was obtained from
an atomic snapshot operation. Then the behavior of pi depends on the state of the
shared memory it has obtained, more precisely on the value of the predicate
• Case 1: the predicate is true. This means that, according to the global state obtained
by pi , no process pj is competing with pi for the new name propi . In that case, pi
considers the current value of propi as its new name and consequently returns it
(line 6).
• Case 2: the predicate is false. In that case, several processes are competing to
obtain the same new name propi . So, pi constructs a new proposal for a new name
and enters the loop again. This proposal is built by pi from the global state of the
system it has obtained and has saved in statei (line 4).
The set
(line 7) contains the new name proposals (as known by pi ), while the set
(line 9) contains the initial names of the processes that pi sees as competing for
obtaining a new name.
The determination of a new proposal by pi is based on these two sets: set1 is
used in order not to propose a new name already proposed, while set2 is used to
determine a free slot. This determination is done as follows.
First, pi considers the increasing sequence (denoted free) of the integers that are
“free” and can consequently be used to define new name proposals. This is the
sequence of the increasing positive integers from which the proposals in set1 have
been suppressed (line 8). Then, pi computes its rank r among the processes that
(from its point of view as given by statei [1..n]) want to acquire a new name (lines
9–10). Finally, given the sequence free and r, pi defines its new name proposal as
the rth integer in the sequence free (line 11).
very definition of the value p that, when pi has defined its last proposal for its new
name (line 11), at most p − 1 processes have already defined new name proposals.
Hence, when considering the pair (set2, r) defined at lines 9 and 10, the rank of idi in
set2 is at most p (it is p if idi is the greatest initial identity among the p participating
processes). It then follows from (a) the definition of the sequence free (line 8), (b)
r ∈ {1, . . . , p}, and (c) the determination of propi at line 11 that pi proposed as a
new name a value ≤ p + (p − 1), which completes the proof of the validity property.
For the liveness property, let us assume by contradiction that there is a non-empty
subset Q of correct participating processes that do not terminate. Let τ be a time after
which all faulty participating processes have crashed and all correct participating
processes not in Q have terminated. It follows that there is a time τ ≥ τ after which
all the processes of Q repeatedly invoke STATE.snapshot() at line 4 and always
obtain the same array of initial names from STATE[1..n].init_id. Consequently, after
τ , the processes of Q obtain distinct ranks in STATE[1..n].init_id (lines 9–10), each
process obtaining always the same rank. Moreover, let pi be the process of Q which
has the smallest initial name (idi ) among the processes of Q and r the rank of idi in
the array STATE[1..n].init_id.
As, after τ , all the processes of Q execute repeatedly the lines 7–11, there is a
time τ ≥ τ such that pi is the only process that proposes propi = z as a new
name, where z is the rth integer in its sequence free (all other processes of Q propose
greater names). Hence, when pi evaluates the predicate ∀j = i : statei [j] = propi
(line 5) after τ , it finds it is satisfied and consequently returns z as its new name (line
6), which contradicts the initial assumption and completes the proof of the liveness
property.
This section extends to any number of processes the simple renaming algorithm for
n = 2 processes sketched in Sect. 9.2.
260 9 Renaming Objects from Read/Write Registers Only
Let us observe that, if only p processes invoke new_name(n, 1, up), p < n, then all
of them will invoke the algorithm recursively, first with new_name(n−1, 1, up), then
new_name(n − 2, 1, up), etc., until the call new_name(p, 1, up). Only at this point,
when p processes invoke new_name(p, 1, up), does the behavior of a participating
process pi depend on the concurrency pattern (namely, it may or may not invoke the
algorithm recursively, and with either up or down).
Splitter-like behavior associated with SM[x, first, dir] Considering the (at most)
x processes that invoke new_name(x, first, dir), the “splitter” behavior (adapted to
renaming) associated with SM[x, first, dir] is defined by the following properties.
Let x = x − 1.
• At most x = x − 1 processes invoke new_name(x − 1, first, dir) (line 9). Hence,
these processes will obtain new names in an interval of size (2x − 1) as follows:
– If dir = up, the new names will be in the “going up” interval [first..first +
(2x − 2)],
– If dir = down, the new names will be in the “going down” interval [first −
(2x − 2)..first].
• At most x = x − 1 processes invoke new_name(x − 1, last + dir, dir) (line 7),
where last = first + dir(2x − 2) (line 4). Hence, these x = x − 1 processes will
obtain their new names in a renaming space of size (2x − 1) starting at last + 1
and going from left to right if dir = up, or starting at last − 1 and going from right
to left if dir = down. Let us observe that the value last ± 1 is considered as the
starting name because the slot last is reserved for the new name of the process (if
any) that stops during its invocation of new_name(x, first, dir) (see next item).
• At most one process “stops”, i.e., defines its new name as last = first +dir(2x −2)
(lines 4 and 6). Let us observe that the only process pk that can stop is the one such
that idk has the greatest value in the array SM[x, first, dir][1..n] (line 5), which
contains then exactly x old names (line 3).
262 9 Renaming Objects from Read/Write Registers Only
9.5.2 An Example
SM [4, 1, up]
p1 and p4 invoke new name(3, 1, up), they see p = 3 processes and both
compute last = 1 + (2 ∗ 3 − 2) = 5 ⇒ their new name space = [1..5]
After p4 has obtained its new name, p1 continues its execution and invokes
new_name(2, 4, down)() and computes last = 4 − (2 × 2 − 2) = 2. As, among the
two processes (p1 and p4 ) that access SM[2, 4, down], p1 does not have the greatest
initial name (id1 < id4 ), it invokes new_name(1, 3, up)() and obtains new name 3.
Let us observe that the new name space attributed to the p = 3 processes p1 ,
p3 , and p4 (the only ones that, up to now, have invoked new_name(4, 1, up)()) is
[1..2p − 1] = [1..5].
Finally process p2 invokes new_name() Let us now assume that p2 eventually
invokes new_name(4, 1, up). It sees that p = 4 processes have accessed SM[4, , up],
and computes last = 1 + (2 × 4 − 2) = 7. The size of the new name space becomes
consequently [1..2p − 1] = [1..7].
As it dies not have the greatest initial name among the four processes, p2 invokes
new_name(3, 6, down), and then invokes recursively new_name(2, 6, down) and
new_name(1, 6, down) and obtains 6 as its new name.
This section shows that the recursive algorithm described in Fig. 9.5 is correct, i.e.,
that all correct participating processes obtain a new name in the interval [1..2p − 1]
(where p is the number of participating processes), and no two new names are
264 9 Renaming Objects from Read/Write Registers Only
identical. Moreover, the process indexes are used only as an addressing mechanism
(index independence).
Notation In the following the sentence “process pi stops in SM[x, f , d]” means that
pi executes line 6 during its invocation new_name(x, f , d).
Remark The proof is based on reasoning by induction. This is a direct consequence
of the recursive formulation of the algorithm. In that sense the proof provides us with
deeper insight into the way the algorithm works.
• Let y be the number of processes for which the predicate |competingj | = x (line 3)
is false when they invoke new_name(x, f , d). We have 0 ≤ y < x. It follows from
the text of the algorithm that these y processes invoke new_name(x − 1, f , d) at
line 9. As y < x, the lemma follows for these invocations.
• Let z be the number of processes for which the predicate |competingj | = x (line 3)
is true when they invoke new_name(x, f , d). We have 1 ≤ z ≤ x and y + z = x.
If one of these z processes pk is such that the predicate of line 5 is true (i.e.,
idk = max(SM[x, f , d])), then pk executes line 6 and stops inside SM[x, f , d].
Let us also note that this is always the case if x = 1. If x > 1, it follows from
y + z = x and y ≥ 0 that z − 1 ≤ x − 1. Then, the other z − 1 processes invoke
new_name(x −1, f , d). As z−1 ≤ x −1, the lemma follows for these invocations.
If the test of line 5 is false for each of the z processes, it follows that the process pk
that has the greatest old name among the x processes that invoke new_name(x, f , d)
is necessarily one of the y previous processes. Hence, in that case, we have y ≥ 1.
As y + z = x, this implies z < x. It then follows that at most z ≤ x − 1 processes
invoke new_name(x − 1, f , d), which concludes the proof of the lemma.
• Let Y be the set of processes pj (with |Y | = y) such that the predicate |contendingj |
= p + 1 (line 3) is false. We have 0 ≤ y < p + 1. These processes invoke
new_name(p, f , d), etc., until new_name(y, f , d), and due to the induction assump-
tion, they rename (with distinct new names) in [f ..f + 2y − 2], namely [1..2y − 1],
since f = d = 1.
• Let Z be the set of processes pj (with |Z| = z) such that the predicate |contendingj |
= p + 1 (line 3) is true. We have 1 ≤ z ≤ p + 1 and y + z = p + 1. At line 4, each
of these z processes obtains last = f + 2(p + 1) − 2 = f + 2p, i.e., last = 2p + 1
(as f = 1).
– If one of these z processes pk is such that idk = max(SM[p + 1, f , d]) (line 5),
it stops at SM[p + 1, f , d]), obtaining the name res = last = f + 2p = 2p + 1
(as f = 1).
– If no process stops at SM[p + 1, f , d], we have 1 ≤ z ≤ p and 1 ≤ y (this
is because the process with the greatest old name is then necessarily a process
of Y ).
Hence, the z = z ≤ p or z = z − 1 ≤ p processes that do not stop at
SM[p + 1, f , d] invoke new_name(p, last − 1, d) (where f = d = 1), etc.,
until new_name(z , f + 2p − 1, d). Due to the induction assumption, these z
processes rename (with distinct new names) in the interval [(f + 2p − 1) −
(2z − 2)..f + 2p − 1] = [2p − (2z − 2)..2p].
• Hence, when considering the y + z = p + 1 processes of Y ∪ Z, the y processes
of Y rename with distinct new names in [1..2y − 1] (where 0 ≤ y < p + 1), the z
processes of Z rename with distinct names [2p − (2z − 2)..2p] (where z ≤ z and
y + z = p + 1), and if z + 1 = z, the remaining process of Z obtains the new name
2p + 1. The new name space for the whole set processes Y ∪ Z is consequently
[1..2p + 1].
• It remains to show that a process of Y and a process of Z cannot obtain the same
new name. To that end we have to show that the upper bound 2y − 1 of the new
266 9 Renaming Objects from Read/Write Registers Only
This section shows how “useless” recursive invocations of the previous algorithms
can be saved with the help of immediate snapshot objects (which replace the store-
collect objects).
Eliminate recursive calls Let us consider the case where y < n processes par-
ticipate concurrently in the previous size-adaptive implementation described in
Fig. 9.5. It is easy to see that these processes invoke at line 9 first new_name(n, 1, 1)
and then recursively new_name(n − 1, 1, 1), new_name(n − 2, 1, 1), etc., until
new_name(y, 1, 1). It is only from this invocation that the processes start doing
“interesting” work. Hence, the question: Is it possible to eliminate (whatever the
value of y) these useless invocations?
Solving this issue amounts to directing a process to “jump” directly to the invo-
cation new_name(y, 1, 1) such that we have |competingi | = y in order to execute
only lines 4–8 of the algorithm of Fig. 9.5. Interestingly, the property we are looking
for is exactly what is provided by the update_snapshot() operation of the one-shot
immediate snapshot object defined in Sect. 8.5. This operation directs a process to
9.6 Variant of the Previous Recursion-Based Renaming Algorithm 267
the appropriate concurrency level as defined by the number of processes that the
invoking process perceives as executing concurrently with it.
The algorithm implementing the operation new_name() The resulting unimple-
mentation based on immediate snapshot objects is described in Fig. 9.8. Interestingly,
this algorithm was proposed by E. Borowsky and E. Gafni (1993).
A process pi invokes new_name(, first, dir), where first = dir = 1 and is a
list initialized to n. This list is the recursion parameter. The lists generated by the
recursive invocations of all participating processes define a tree that is the recursion
tree associated with the whole execution. The list n is associated with the root of
this tree. These lists are used to address the appropriate entry of the array of one-shot
immediate snapshot objects SM.
Similarly to the implementation described in Fig. 9.5 where each SM[x, f , d] is a
store-collect object, each SM[] is now a one-shot immediate snapshot object imple-
mented by an array of n SWMR atomic registers (such that only pi can write SM[][i]).
Moreover, at most s processes invoke the operation SM[].update_snapshot(), where
s is the integer which is the last element of the list . More generally, the elements
of the list indicate the current recursion path.
A process pi first invokes SM[].update_snapshot(idi ) (line 1). This allows it
to access its concurrency level thereby skipping all useless recursive invocations.
It then executes only “useful work” (lines 2–7). When considering these lines, the
only difference with respect to the algorithm of Fig. 9.5 lies in the management
of the recursion parameter needed in the case where idi = max(competingi ). The
value of the recursion parameter (which is now an extended list) used at line 6 is
defined from the current recursion parameter (the list = n1 , n2 , . . . , nα where
n1 = n) and the size of the actual concurrency set competingi . The new list is
next = n1 , n2 , . . . , nα , |competingi |. This value is computed at line 5 where ⊕ is
used to denote concatenation.
Let us observe that the recursive invocations entailed by new_name( n, 1, 1)
issued by a process pi are such that n1 = n > n2 > .. > nα > |competingi | > 0
(from which it is easy to prove that any invocation new_name( n, 1, 1) always
terminates).
Fig. 9.8 Borowsky and Gafni’s recursive size-adaptive renaming algorithm (code for pi )
268 9 Renaming Objects from Read/Write Registers Only
Reminder: test&set registers Atomic test&set registers have already been used
in previous chapters. Let us remember that such a register TS can take two val-
ues: 1 (winner value) and 0 (loser value). It can be accessed by two operations
denoted TS.test&set() and TS.reset(), whose sequential sequential specification is
as follows. An invocation of TS.test&set() returns the current TS and, whatever
its previous value, sets it to 0 (loser value). An invocation of TS.reset() writes 1
into TS.
Internal representation The perfect long-lived renaming object is made up of an
array of n test&set registers denoted TS[1..n].
Algorithms for the operations new_name() and release_name() The algorithms
implementing these operations are described in Fig. 9.10. It is assumed that a process
invokes release_name(x) only after it has obtained (and not yet released) the new
name x.
When it invokes new_name(), a process executes a for loop (lines 1–4) until it
has obtained a new name. It first checks if the new name y = 1 is available. If
TS[1].test&set() returns 1, the new name y = 1 is available and the process returns
it (lines 2–3). Otherwise, the process process to check if the new name y = 2 is
available, etc., until the new name y = n.
After it has obtained the new name x, a process can release it by invoking
release_name(x), which resets the test&set register TS[x] to the value 1 (line 5).
Theorem 42 The algorithms described in Fig. 9.10 defines a wait-free implementa-
tion of a perfect long-lived renaming object.
Proof Let us first observe that, if p processes are participating, it follows from the
fact that the base test&set registers are checked sequentially, first TS[1], then TS[2],
etc., that at most the registers TS[1] until TS[p] are accessed by the p processes. It
follows from this observation that the implementation is such that the new name
space is at most [1..p]. (It can be [1..p ], where p < p, if new names which have
been released are later used by other processes.)
As there is no index, the implementation trivially satisfies the index independence
property. The agreement property follows from the fact that, at any time, any test&set
register TS[y] has at most one winner (the process which obtained from it the value
1 and has not yet invoked TS[y].reset()).
The liveness (wait-freedom) property of the operation relase_name() follows triv-
ially from its code. For the operation new_name(), it follows from (a) the fact that
there are at most n participating processes and (b) that the processes are assumed
to release their previous new names before trying to acquire new ones, that when
a process invokes new_name(), there is at least one test&set register TS[y] whose
value is 1 (winner value).
9.8 Summary
The focus of this chapter was on the renaming problem. After having defined the
problem, the chapter has presented several wait-free implementations of renaming
objects in the base asynchronous read/write system. These implementations differ
in their time complexity and the size of the new name space granted to processes. A
perfect long-lived renaming object based on registers stronger than base read/write
registers, namely test&set registers, has also been presented.
new base object called a reflector from which an optimal size-adaptive renaming
algorithm is built is introduced in [33].
The lower bounds M = 2n − 1, and M = 2n − 2 for an infinite number of
exceptional values of n, are due to M. Herlihy and N. Shavit [145] and A. Castañeda
and S. Rajsbaum [65], respectively.
• The splitter-based renaming algorithm described in Fig. 9.3 is due to M. Moir and
J. Anderson [209]. An assertional proof of this time-adaptive renaming algorithm
can be found in that paper.
• The notion of long-lived renaming is due to M. Moir and J. Anderson [208, 209].
The algorithm described in [32], which is due to H. Attiya and A. Fouren, is both
size- and time-adaptive for the long-lived renaming problem. Let p be the number
of processes that are currently participating (those that have invoked new_name()
and not yet invoked release_name()). The algorithm is such that M = 2p − 1 and
its step complexity is O(p4 ).
• The size-adaptive renaming implementation based on atomic read/write registers
described in Fig. 9.4 is due to H. Attiya and J. Welch [41]. It is an adaptation to the
shared memory context of a message-passing implementation defined in [29]. It is
shown in [103] that this algorithm has runs in which invocations of new_name()
can give rise to an exponential number of shared memory accesses.
• The recursive size-adaptive renaming implementation based on store-collect
objects described in Fig. 9.5 is due to S. Rajsbaum and M. Raynal [229]. It is
inspired from a renaming implementation sketch described in [112] (this paper is
focused on recursion in distributed algorithms).
• The recursive size-adaptive renaming implementation based on immediate snap-
shot objects described in Fig. 9.8 is due to E. Borowsky and E. Gafni [53].
• A generalization of the renaming problem for groups of processes is proposed in
[106] and investigated in [7]. In this variant, each process belongs to a group and
knows the original name of its group. Each process has to choose a new name for
its group in such a way that two processes belonging to distinct groups choose
distinct new names.
• The relations between renaming objects and other objects which are central to dis-
tributed computability such as k-set-agreement [71] have received a lot of attention
(e.g., [18, 19, 107, 111, 113, 154, 160, 215, 216] to cite a few).
This part of the book is made up of a single chapter devoted to a new approach for
designing multiprocess programs, namely the software transactional memory
(STM) approach. The idea that underlies this approach is to free programmers
from ‘‘implementation details’’ associated with synchronization.
Chapter 10
Transactional Memory
This chapter is devoted to software transactional memories (STM). This concept was
first proposed by M. Herlihy and J. Moss (1993), and later refined by N. Shavit and
D. Touitou (1997). The idea is to provide the designers of multiprocess programs with
a language construct (namely, the notion of an atomic procedure called a transaction)
that discharges them from the management of synchronization issues. More precisely,
a programmer has to concentrate her efforts only on defining which parts of processes
have to be executed atomically and not on the way atomicity is realized, this last issue
being automatically handled by the underlying STM system.
This chapter, which is mainly on basic principles of algorithms implementing
STM systems, assumes that the asynchronous processes are reliable (i.e., they never
crash).
Unfortunately, this does not work: the move is not atomic as, between the invoca-
tion of Q1.dequeue() and the invocation of Q2.enqueue(x), an arbitrary number of
accesses to Q1 and Q2 can be issued by other transactions.
A first answer to this problem could be to define a polyadic procedure move_item
(Q1, Q2) as a base atomic operation on queues (the procedure is polyadic because
it is on two objects—here queues—at the same time).This ad hoc solution is not
satisfactory for the following reasons:
• It can make the implementation of a queue more difficult (polyadic atomic oper-
ations are more difficult to implement than unary operations).
• It is not scalable. If a procedure has to be executed atomically on a queue and a
stack, where the corresponding operation is defined? In the definition of the stack?
In the definition of the queue? Moreover, how this can be done if the procedure,
that has to appear as being executed atomically, involves more base objects than
only two, or if this number is not statically defined?
Actually, the part of code that has to be executed atomically is not part of the definition
of the objects but is application-dependent.
A language construct What is needed is a language construct saying that the code
corresponding to some procedure has to appear as having been executed atomically.
This is exactly what the concept of a transactional memory offers to programmers.
10.1 What Are Software Transactional Memories 279
where the keyword transaction means that the associated code has to appear as being
executed as an atomic operation. It is the job of the underlying software transactional
system (STM) to ensure this atomicity property.
STM transaction versus database transaction While transactions defined in an
STM system and transactions encountered in a database share the same name, they
are different computer science objects. A main difference lies in the following obser-
vation: the code of an STM transaction can be any code (accessing base atomic
objects), while the code of a database transaction is usually restricted to SQL-like
queries. STM transactions are actually atomic procedures.
The high-level view Moving beyond assembly languages, the design of high-level
programming languages was mainly motivated by hiding “implementation details”
to allow the programmer to concentrate on the solution to his problem and not on
the technicalities of specific machines or on low-level machineries. As an exam-
ple, garbage collection is now implicit in a lot of programming languages and the
programmer does not have to worry about it.
Transactional memories constitute a similar effort as far as synchronization is
concerned. The programmer has to state what has to be atomically executed and has
not to focus on the way this synchronization has to be realized. This is the job of the
underlying STM system.
T11 T12
p1
O1 O2 O4 O5 O1 O3
T21 T22 T23
p2
O4 O2 O3 O 2 O1 O 3 O5 O4 O3 O2
accesses
executes
accesses
executes
accesses
executes
accesses
executes
To simplify the presentation, the rest of this chapter considers that the processes do
not contain non-transactional code and that the atomic objects shared by transactions
are MWMR registers.
It is easy to see that, if T1 and T2 are executed in parallel, the final value of X can be
1, 2 or 3. But, as a transaction is an atomic execution unit, the only correct value for
X after both transactions have been executed is the one produced by T1 followed by
T2 or T2 followed by T1 , i.e., the value 3.
This means that, when a speculative execution of a transaction is about to termi-
nate, it is required to check if it can be linearized at some point of the time line. If
it can, it is committed, otherwise it is aborted. Hence, a speculative execution of a
transaction has to return a control value commit or abor t that defines its fate. When
a transaction execution is aborted, the STM system may decide to re-execute it or
notify the corresponding process, which decides to re-issue it or not.
It is important to notice that there is always a price to pay when processes access
shared objects. This price is either wait statements or re-execution. In both cases, it
means that time duration without progress cannot be prevented. This is an inescapable
price that has to be paid when there is synchronization.
282 10 Transactional Memory
An STM system interface for transactions that access MWMR atomic registers pro-
vides each transaction with four operations, denoted beginT (), X.readT (),
X.writeT (), and try_to_commitT (), where T is a transaction, and X an atomic t-
object (a MWMR register shared by the transactions).
• beginT () is invoked by T when it starts. It initializes local control variables.
• X.readT () is invoked by the transaction T to read the base object X . That operation
returns a value of X or the control value abor t. If abor t is returned, the invoking
transaction is aborted (in that case, the corresponding read does not belong to the
read prefix associated with T ).
• X.writeT (v) is invoked by the transaction T to update X to the new value v.
That operation returns the control value ok or the control value abor t. Like in the
operation X.readT (), if abor t is returned, the invoking transaction is aborted.
10.2 STM System 283
Execution of a transaction T
local computation
incremental snapshot atomic updates into the shared memory
can abort the transaction can abort the transaction
if the snapshot is about if the snapshot and the writes
to be inconsistent cannot appear as being atomic
• If a transaction reaches its last statement, it invokes the STM interface operation
try_to_commitT (). That operation decides the fate of T by returning commit or
abor t. (Let us notice that a transaction T that invokes try_to_commitT () has not
been aborted during an invocation of X.readT () or X.writeT ().)
In the transaction system model we considered here, each transaction T uses a local
working space. When T invokes X.readT () for the first time, it reads the value of X
from the shared memory and copies it into its local working space. Later invocations
of X.readT () (if any) use this copy. So, if T reads X and then Y , these reads are
done incrementally, and the state of the shared memory may have been changed in
between by other transactions. Such incremental reads are also called incremental
snapshots.
When T invokes X.writeT (v), it writes v into its working space (and does
not access the shared memory). Finally, if T is not aborted when it executes
try_to_commitT (), it copies (if any) the values of application registers it has written
from its local working space into the shared memory. (A similar deferred update
model is used in some database transaction systems.)
The corresponding structure of the execution of a transaction is represented in
Fig. 10.3.
When the atomic application objects are atomic MWMR registers (which is our
case), a transaction that issues only read operations on application objects is called a
read-only transaction. Otherwise, it is called an update transaction. (Hence an update
transaction is either a read/write transaction or a write-only transaction.)
284 10 Transactional Memory
This section presents a simplified version of a lock-based STM system called TL2
(Transactional Locking 2) due to D. Dice, O. Shalev, and N. Shavit (2006). This
STM system satisfies the opacity consistency condition. It is particularly efficient
when there are few conflicts between concurrent transactions.
Underlying system The underlying system provides the processes with atomic
MWMR registers and an atomic fetch&add register. In addition to a read operation,
such a register A allows the processes to invoke the operation fetch&add() which
atomically adds 1 to A and returns the new value of A to the invoking process.
Global control variable An atomic fetch&add register denoted CLOCK, initialized
to 0, is used as a logical clock to measure the progress of the system, counted as the
number of transactions that have been committed so far.
As we will see, the abort/commit decision for a transaction T will involve this clock
and the dates associated with the application registers accessed by transaction T .
Internal representation of application registers and associated control variables
As already indicated, the application registers are the MWMR atomic registers
defined at the application level and accessed by the transactions.
At the system level, an implementation MWMR register XX located in the shared
memory is associated with each application MWMR register X . Such an implemen-
tation register XX has two fields: XX.value, which contains the value of the appli-
cation register X , and XX.date, which contains the date of its last update. A lock
is also associated with each implementation register XX.
As far as notations are concerned, lc(XX) denotes a copy of XX in the local
working space of a process.
10.3 A Logical Clock-Based STM System: TL2 285
birthdate(T )
birthdate(T )
X.readT X.readT
Final validation test Let us consider a transaction T that has not been aborted
during its incremental read phase, has done local computation, and wants to write
new values into some application register Z .
For T to appear as having been executed atomically at the logical date birthdate(T ),
the date that T has to associate with the values it wants to write has to be equal to
birthdate(T ) (this date is obtained after increasing the clock by 1).
But it is possible that another transaction has modified an application register X
after it was read by T . If this is the case, the date associated with the new value
of X will be greater than bir thdate(T ). Hence, the reads of application registers
issued by T would appear as being done at the date bir thdate(T ) while its write of
new values will appear as being done at a date > bir thdate(T ). It follows that T
cannot appear as having been executed atomically. When this occurs, T is aborted
(Fig. 10.5).
From an operational point of view, a transaction T that executes try_to_commitT ()
is required to read again from the shared memory the dates of the application registers
X it has previously read during the incremental read phase. If there is an application
register X such that XX.date ≥ bir thdate(T ), the transaction T is aborted.
Remark It is important to see that the consistency predicates which are used to
satisfy opacity are based on comparison on dates defined from the logical time
produced from the atomic register CLOCK. This logical time is sufficient to obtain
opaque implementations. It is not necessary.
The algorithms implementing the operations of the STM interface are described in
Fig. 10.6. Let pi be the process that has issued the transaction T .
The operation beginT () When a process pi issues a transaction T , it first invokes
beginT () to assign a birth date to T .
10.3 A Logical Clock-Based STM System: TL2 287
The operation X.readT () This operation is invoked by pi each time the transaction
T issues a read of the application register X . If there is a local copy lc(XX) of X , the
value of X stored in lc(XX).value is returned (lines 2–3). Otherwise, a local copy
is created and initialized to the value of X X read from the shared memory (line 4).
Finally, the validation test associated with the incremental read of X is done (line 5).
If it is successful, the current value of X is returned and X is added to lr sT (line 6).
Otherwise, T is aborted (line 7).
It is important to notice that this implementation of X.readT () satisfies the read
invisibility property: no information is written into the shared memory by an invoca-
tion of X.readT ().
The operation X.writeT (v) This operation is invoked each time the transaction
T , issued by pi , writes a new value into the application register X . Actually, this
algorithm does not write the new value into the shared memory but only in the local
copy lc(XX).value (line 11). Such a local copy is previously created if there is no
288 10 Transactional Memory
local copy (line 10). Moreover, the set lwsT is updated accordingly (line 11). It is
easy to see that no invocation of X.writeT (v) entails the abortion of T .
The operation try_to_commitT () This operation is called by pi when T has
reached its last statement without having been previously aborted. This means that
all the incremental reads issued by T are consistent with the birth date of T . As we
have seen, it remains to see if the writes of the application registers issued by T can
appear as having been atomically done at the date bir thdateT (i.e., together with
the incremental reads issued by T ).
When it executes try_to_commitT (), pi first locks both the application registers
that have been read by T and the the ones that will be written by T (line 13). (In
order to prevent deadlocks, it is assumed that these lockings are done sequentially,
according to a predefined total order.)
Then, pi executes the final validation test (as explained above). To that end, for
each X ∈ lr sT , it reads the current value of the implementation register X X and
aborts the transaction if there is some X such that XX.date ≥ bir thdate(T ) (line
15). If T is aborted, all the locks are released.
If the final validation test succeeds, the transaction T can be committed. But before
releasing the locks and committing T (line 19), pi has first to compute the new date
that has to be associated with all the writes issued by T (line 17). It can then issue
the corresponding writes into the shared memory (line 18).
On the use of locks It is important to notice that locks are used only in the operation
try_to_commitT (). Hence, an implementation register can be read by a transaction
while it is locked by another transaction. The use of the fetch&add register CLOCK
ensures that no two committed transactions associate the same date with their writes.
The locking of all the registers accessed by a transaction T (whose names are saved
in lrsT ∪ lwsT ) ensure that no date of a register XX can be modified while pi is check-
ing the final validation text (lines 14–18), thereby ensuring that, from an external
observer’s point of view, all the writes into the shared memory of the registers in
lwsT (line 18) appear as having been executed (a) at the date bir thdateT and (b)
simultaneously with the read of the registers in lr sT .
Remark This presentation of TL2 does not take into account all its features. As
an example, if at line 13 a lock cannot be immediately obtained, TL2 aborts the
corresponding transaction. This can allow for a more efficient implementation.
If a transaction T does not write application registers, its STM interface can be
simplified as shown in Fig. 10.7. Without loss of generality, it is assumed that a
read-only transaction reads an application register at most once.
A local copy no longer plays the role of a local cache memory. When T invokes
X.readT (), XX is read from the shared memory (line 2) and the associated incremental
10.3 A Logical Clock-Based STM System: TL2 289
read validation test is executed (line 3). This modified version of X.readT () satisfies
the read invisibility property.
As T does not write application registers, the lines 17–18 of try_to_commitT ()
in Fig. 10.6 become useless. Moreover, the final validation test of line 15 of Fig. 10.6
is now done each time pi executes X.readT (). It follows that locks are no longer
needed in the operation try_to_commitT (), which boils down to a simple invocation
of the statement return(commit) (line 4).
Interestingly, these algorithms satisfy a strong read invisibility property. Not only
does X.readT () guarantee that there is no visible read, but even try_to_commitT ()
does not reveal which registers have been read by a read-only transaction.
This section presents an STM system based on multi-versionning that satisfies the
opacity property. Multi-version means that an application MWMR atomic register is
implemented by a list of versions containing its successive values. This implementa-
tion, called JVSTM (Java STM), is due to J. Cachopo and A. Rito-Silva (2006). As
for TL2, the version which is presented here is a version which has been simplified
for pedagogical purposes.
Interestingly, JVSTM provides a garbage collector mechanism that discards the
versions older than the oldest transaction which are still in the system. This point is
not addressed here.
290 10 Transactional Memory
A description of the X.readT (), X.writeT (v), and try_to_commitT () operations for
an update transaction is given in Fig. 10.9. As in the previous section, pi is the process
that has issued the transaction T .
The operation X.beginT () Similarly to TL2, this operation consists in computing
the birth date of the transaction T .
If a transaction T does not modify the application registers, the algorithms imple-
menting the interface operations of the STM system can be simplified as shown in
Fig. 10.10. Without loss of generality, it is assumed that a read-only transaction reads
an application register at most once.
Each time a read-only transaction T reads an application register X , it obtains
(from the shared memory) the most recent value which is at least as old as the
transaction’s birth date (lines 2–4). A read-only transaction is not required to manage
a local read set lr sT .
As in TL2, the operation try_to_commitT () always returns commit. This is
because a read-only transaction can never abort. This is due to the presence of multiple
versions which always allow a read-only transaction T to obtain mutually consistent
versions of the application registers it reads.
Virtual world consistency While keeping its spirit, virtual world consistency is
a weaker consistency condition than opacity. It states that (a) no transaction (com-
mitted or aborted) reads values from an inconsistent global state, (b) the committed
transactions are linearizable (they can be totally ordered from an external observer
point of view), and (c) each aborted transaction (reduced to a read prefix as defined
in Sect. 10.2.2 where opacity was introduced) reads values that are consistent with
respect to its causal past only.
294 10 Transactional Memory
This section presents an STM system that satisfies the virtual world consistency
condition. This system, due to D. Imbs and M. Raynal (2009), is based on vector
clocks and guarantees the read invisibility property.
While n denotes the number of processes, m is used to denote the number of
application MWMR atomic registers.
Internal representation of the application MWMR atomic registers Each appli-
cation MWMR atomic register X is represented by an implementation MWMR
atomic register XX made up of two fields:
• XX.value contains the current value of X .
• XX.depend[1..m] is a vector clock which tracks value dependencies. More pre-
cisely,
– XX.depend[X ] contains the sequence number associated with the current value
of X , and
– XX.depend[Y ] contains the sequence number associated with the value of Y on
which the current value of X depends.
Such a vector is called a vector clock because, sequence numbers being consid-
ered as logical dates associated with updates, depend[1..m] captures the causal
dependences among these updates.
Moreover, a starvation-free lock is associated with each application register X .
Local control variables associated with processes and transactions A process
issues transactions sequentially. So, when a process pi issues a new transaction,
that transaction has to work with object values that are not older than the ones used
by the previous transactions issued by pi . To that end, pi manages a local vector
p_dependi [1..m] such that p_dependi [X ] contains the sequence number of the last
value of X that (directly or indirectly) is known by pi .
In addition to the previous array, a process pi manages the following local vari-
ables whose scope is the one of the transaction T it is currently executing:
• t_dependT [1..m] is a copy of p_dependi [1..m] which is used instead of it during
the speculative execution of T (this is because p_dependi [1..m] must not be
modified if T aborts).
• lr sT and lwsT are the read set and write set used when pi executes T .
• Finally, for each application register X accessed by T , pi manages a local copy
denoted lc(XX) of the implementation register XX.
296 10 Transactional Memory
This section presents the algorithms implementing the four STM operations
beginT (), X.readT (), X.writeT (), and try_to_commitT () (Fig. 10.12) and some of
their properties. When the control value abor t is returned, it carries a tag (1 or 2)
which indicates the cause of the abortion of the corresponding transaction. This tag
is used only for pedagogical purposes.
The operation beginT () This operation is a simple initialization of the local control
variables associated with the current transaction T . Let us notice that t_dependT is
initialized to p_dependi to take into account the causal dependencies on the values
accessed by the committed transactions previously issued by pi . This is due to the
fact that a process pi issues transactions one after the other and the next one inherits
the causal dependencies created by the previous ones.
The validation test for an incremental read and the operation X.readT () This
operation returns either a value of X or the control value abor t (in which case T is
aborted). If (due to a previous read of X ) there is a local copy, its value is returned
(lines 2 and 10).
If the call X.readT () is its first read of X , pi first builds a copy lc(XX) from
the shared memory (line 3), and updates accordingly its local control variables lrsT
and t_dependT [X ] (line 4). Hence, t_dependT [X ] contains the sequence number
associated with the last value of X which was saved in lc(XX).value.
As the reads are incremental ( pi does not read in one atomic action all the applica-
tion registers it wants to read), pi has to check that (a) the value it has just read from
the shared memory and stored in lc(XX).value and (b) the values of the application
registers Y it has previously read can belong to a consistent global state.
The corresponding incremental read test is done as follows. Let Y be an object
that was previously read by T (hence Y ∈ lr sT ). Let us observe that the sequence
number of the value of Y read by T is kept in t_dependT [Y ]. If the value of X just
read by T depends on a more recent value of Y , the values of X and Y are mutually
inconsistent. This is exactly what is captured by the predicate
∃ Y ∈ lrsT : t_dependT [Y ] < lci (X).depend[Y ]
Fig. 10.12 An STM system guaranteeing the virtual world consistency condition
snapshot”), the addition of the value of X obtained from the shared memory would
make this snapshot inconsistent.
The operation X.writeT (v) The algorithm implementing this operation is very
simple. If there is no local copy lc(XX) of the implementation register X X associated
with X , one is created (line 11). Then, the value v is written into lc(XX).value and
the control variable lwsT is updated (line 12).
The operation try_to_commitT () When a process pi executes try_to_commitT (),
it first locks all the registers accessed by T (line 14); those are the application registers
whose names have been saved in lr sT ∪ lwsT . This locking is done according to a
canonical order (e.g., on the register names) to prevent deadlock and starvation. If it
is a read-only transaction (that has read more than one application register), it can be
298 10 Transactional Memory
committed if its incremental snapshot is still valid, i.e., the values it has read from
the shared memory have not yet been overwritten. This is what is captured by the
predicate
∀Z ∈ lrsT : t_dependT [Z ] = Z .depend[Z ]
whose negation is used at line 15. If this predicate is false, the transaction T is aborted
after it has released all its locks. If this predicate is false, the transaction can appear
as if both its reads and its writes (if any) have been simultaneously executed just
before the test of line 15 was evaluated.
If the transaction T is a write-only transaction (i.e., lr sT = ∅, line 15), it follows
from the locks on the application registers of lwsT that the transaction T can write
new values of the registers in lwsT with the associated data dependencies (captured
in t_dependT ) into the shared memory (line 20). Due to the fact that all the registers
in lwsT are locked when they are written, it follows that these writes appear as being
executed simultaneously. Before executing these writes, T has to update the sequence
number of each of these registers X for the dependency vectors to have correct values
(line 19).
If the transaction T is neither read-only nor write-only it can be committed only
if all its read and write operations could have been executed simultaneously. As
we have seen, this is ensured by the net effect of the predicate used at line 15 for
the simultaneity of the reads and the use of locks on all application registers in
lr sT ∪ lwsT .
Let us finally observe that, when a transaction returns the control value commit
(line 24), the dependency vector of the associated process pi has to be updated
accordingly (line 23) to take into account the new data dependencies created by the
newly committed transaction T .
A remark on (abor t, 2) If (abor t, 2) is returned to a read-only transaction T , the
values it has incrementally read define a consistent snapshot, but this snapshot cannot
be totally ordered (with certainty) with respect to the committed transactions. In that
case, all the read operations issued by the aborted transaction T belong to its read
prefix, and this read prefix is consistent with respect to the causal past of T .
A remark on write-only transactions and independent transactions As a write-
only transaction T is such that lr sT = ∅, it is easy to see that, due the fact that a
transaction that executes try_to_commitT () can be aborted only at line 16, write-only
transactions cannot be aborted.
Two transactions T 1 and T 2 are independent if (lr sT 1 ∪lwsT 1 )∩(lr sT 2 ∪lwsT 2 ) =
∅. It follows from the code of try_to_commitT () that independent concurrent trans-
actions can commit independently.
A remark on read-only transactions A simple modification of the previous algo-
rithms provides the following additional property: a read-only transaction T that
reads a single object X is never aborted. T is then only made up of X.readT (), and
this operation is implemented as follows:
10.5 A Vector Clock-Based STM System 299
When the implementation registers are not atomic When the implementation
MWMR registers are not atomic, there is a very simple way to enrich the STM
of Fig. 10.12 so that it works correctly; namely, using the locks associated with the
application registers, it is sufficient to replace the statement “lc(XX) ← XX” at line 3
by “lock X ; lc(XX) ← XX; unlock X ”.
Depending on implementation choices, concurrent transactions that try to access
X when it is locked could either abort or wait. A read operation does not modify the
value and holds the lock only for a short time. So, if X is locked because of a read
operation, it could be beneficial to let the transaction wait instead of aborting it.
10.6 Summary
This chapter has presented the notion of a software transactional memory (STM). An
STM system provides the programmer of a multiprocess program with the concept
of an atomic procedure (called a transaction). The programmer has then to focus his
effort on which parts of processes have to appear as being executed atomically and
not on the way synchronization is implemented.
Two consistency conditions for STM systems have been introduced: opacity and
virtual world consistency. Several STM systems which guarantee these conditions
have been presented.
1. The logical clock used in the TL2 STM system is a global clock, which can
constitute a bottleneck in heavy load.
Modify the algorithms in order to replace this centralized clock by a vector clock-
like distributed clock.
Solution in [42].
2. Add to the simplified version of JVSTM a garbage collection mechanism that
recycles the versions of the implementation registers which are too old to be used
by transactions.
Solution in [61].
3. Let us consider that, in the STM system presented in Sect. 10.5.2 (suited to virtual
world consistency), the implementation registers XX are MWMR regular registers
10.8 Exercises and Problems 301
instead of being atomic registers. Let us observe that, due to the locks used in
try_to_commitT () (locking at line 14, and unlocking at line 16 or 22), no two
processes can write to the same register concurrently, from which it follows that
all the writes into an implementation register XX are sequential.
Modify the algorithms described in Fig. 10.12 to obtain an STM system that works
with such implementation registers. These modifications require only to:
• Add the statement “if pr edicate then return(abor t, 3) end if” between
the two statements of line 22 where pr edicate has to be appropriately defined,
and
• Modify the second predicate used at line 15.
Prove then that these modifications guarantee that the implementation registers
behave as if they were atomic.
Solution in [163].
Part V
On the Foundations Side:
From Safe Bits to Atomic Registers
This part of the book is on the construction of atomic multi-valued registers from
safe bits (binary registers). It consists of three chapters. The first chapter starts with
a reminder of the definitions of safe, regular, and atomic registers (introduced in
Chap. 2), and then presents various constructions of ‘‘high-level’’ registers (regular
or atomic, respectively) from ‘‘low-level’’ registers (safe or regular, respectively).
The second chapter presents a bounded construction of a one-bit single-writer/
single-reader (SWSR) atomic register from three one-bit SWSR safe registers.
Finally, the last chapter presents two approaches which allow for the construction
of multi-valued multi-writer/multi-reader (MWMR) atomic registers from safe
bits.
Chapter 11
Safe, Regular, and Atomic
Read/Write Registers
For self-containment, this chapter starts with a short presentation of the notions of
safe, regular, and atomic read/write registers (which were introduced in Chap. 2).
It then presents simple wait-free implementations of “high-level” registers from
“low-level” registers. The notions of “high-level” and “low-level” used here are
not related to the computability power but to the abstraction level. This is because,
as we will see in the next two chapters, while a regular register is easier to use than a
safe register and an atomic register is easier to use than a regular register, they are all
computationally equivalent; i.e., any of them can be built wait-free from any other
without enriching the underlying system with additional computational power.
The proofs of the theorems stated in this chapter use the definitions and terminol-
ogy introduced in Chap. 4.
As just indicated, the following definitions have already been stated in Sect. 2.3.1.
Notation As far as its interface is concerned, as seen before, a register R provides
the processes with a write operation denoted R.write(v) (or R ← v), where v is the
value that is written, and a read operation R.read() (or local ← R, where local is
a local variable of the invoking process). We also use the notation R.read() → v to
indicate that the corresponding read of R returns the value v. Safe, regular, and atomic
registers differ in the value returned by a read operation invoked in the presence of
concurrent write operations.
The capacity of a register At a given time a register contains a single value, but
according to the write operations issued by the processes, a register can contain
distinct values at different times. So, the first dimension associated with a register is
related to its size, i.e., its capacity to contain more or less information.
The simplest type of register is the binary register, which can store a single bit: 0
or 1. Otherwise, a register is multi-valued. A multi-valued register can be bounded
or unbounded. A bounded register is one whose value domain includes b distinct
values (e.g., the values from 0 up to b − 1), where b is a constant known by the
processes. Otherwise the register is unbounded. A register that can contain b distinct
values is said to be b-valued. Its binary representation requires B = log2 b bits. Its
unary representation is more expensive as it requires b bits (the value v being then
represented by a bit equal to 1 followed by v − 1 bits equal to 0).
Access to a register This dimension concerns the number of processes that can read
or write the register. As seen in Chap. 2, a register can be single- or multi-reader, and
single- or multi-writer, hence the notation XWYR, where both X and Y stand for
M(ulti) or S(ingle).
A register in the face of concurrency A fundamental question concerns the behav-
ior of a register when it is concurrently accessed by several processes. Three types
of registers can be distinguished.
SWMR safe register A SWMR safe register is a register whose read operation
satisfies the following properties:
• A read that is not concurrent with a write operation (i.e., their executions do not
overlap) returns the current value of the register.
• A read that is concurrent with one (or several) write operation(s) (i.e., their exe-
cutions do overlap) returns any value that the register can contain.
It is important to see that, in the presence of concurrent write operations, a read can
return a value that has never been written. The returned value only has to belong to
the register domain. As an example, let the domain of a safe register R be {0, 1, 2, 3}.
Assuming that R = 0, let R.write(2) be concurrent with a read operation. This read
can return either 0 or 1 or 2 or 3. It cannot return 4, as this value is not in the domain
of R, but can return 3 which has never been written.
A binary safe register can be seen as modeling a flickering bit. Whatever its
previous value, the value of the register can flicker during a write operation and
stabilizes to its final value only when the write finishes. Hence, a read that overlaps
with a write can arbitrarily return either 0 or 1.
SWMR regular register A SWMR regular register is a SWMR safe register that
satisfies the following property. This property addresses read operations in the pres-
ence of concurrency. It replaces the second item of the definition of a safe register.
11.1 Safe, Regular, and Atomic Registers 307
• A read that is concurrent with one or several write operations (i.e., the read invoca-
tion overlaps with a write invocation or with consecutive write invocations) returns
the value of the register before these writes, or the value written by any of them.
An example of a regular register R (whose domain is the set {0, 1, 2, 3, 4}) written
by a process p1 and read by a process p2 is described in Fig. 11.1. As there is no
concurrent write during the first read by p2 , this read operation returns the current
value of the register R, namely 1. The second read operation is concurrent with three
write operations. It can consequently return any value in {1, 2, 3, 4}. If the register
were only safe, this second read could return any value in {0, 1, 2, 3, 4}.
Atomic register An atomic MWMR register is such that all its operation invoca-
tions can be totally ordered in such a way that (a) each invocation appears as if it
had been executed instantaneously at a point of the time line between its start event
and its end event, and (b) the resulting sequence of invocations is such that any read
invocation returns the value written by the closest preceding write invocation. Said
differently, the execution of an atomic register is linearizable (see Chap. 4).
Due to the total order on all its operations and the fact that it can have several
writers, an atomic register is more constrained than a regular register.
Example To illustrate the differences between safe, regular, and atomic, let us
consider Fig. 11.2 (which has already been partly presented in Chap. 2). This figure
considers an execution of a binary register R and presents the associated history H
(at the base level defined by the start and end events of each operation invocation).
The first and third read by p2 are issued in a concurrency-free context. Hence,
whatever the type of the register (safe, regular, or atomic), the value returned is
the current value of the register R. More generally, Table 11.1 describes the values
returned by the read operations when the register is safe, regular, and atomic.
Let us consider the invocations of the read operation by p2 which return the values
a, b, or c.
• If R is safe, as these read invocations are concurrent with a write invocation, they
can return any value (i.e., 0 or 1 as the register is binary). This is denoted 0/1 in
Table 11.1.
• If R is regular, each of the values a and b returned by the read invocation can
be 1 (the value of R before the read invocation) or 0. This is because this read
invocation is concurrent with a write. Differently, the value c returned by the last
308 11 Safe, Regular, and Atomic Read/Write Registers
Table 11.1 Values returned by safe, regular, and atomic registers (again)
Value returned a b c
Safe 1/0 1/0 1/0
Regular 1/0 1/0 0
Atomic 1 1/0 0
Atomic 0 0 0
read invocation can only be 0 (because the value which is written concurrently is
the same as the previous value of R).
• If R is atomic, there are only three possible executions, each corresponding to a
correct total order on the read and write operations (as indicated above, “correct”
means that the sequence of read and write invocations respects their real-time order
and is such that each read invocation returns the value written by the immediately
preceding write invocation).
R.read() → b, the first obtains the “new” value (a = 0) while the “second” obtains
the “old” value (b = 1). This is called a new/old inversion. This is actually the
difference between regularity and atomicity: an atomic register is a regular register
that does not allow for new/old inversions. This important observation is formalized
by the theorem that follows.
Associating a write invocation with each read invocation Considering the formal
notations introduced in Chap. 4, let H be the execution history of a SWMR regular
register. Let us recall that → H is an irreflexive partial order relation on the set
of the read and write invocations issued during that execution. Let op and op be
two operation invocations. op → H op if r esp[op] occurs before inv[op ] (i.e.,
r esp[op] < H inv[op ] in the associated event-based history). Let us assume without
loss of generality that all the write invocations write distinct values, and let π(r)
be the write operation that wrote the value obtained by the read invocation r (for
example, in Fig. 11.2, the first read operation r obtains the value 1, and consequently
π(r) is the first write operation issued by the writer).
Theorem 43 An SWMR atomic register is an SWMR regular register such that any
satisfies the following property, where r1 and r2 are any
of its execution history H
two read invocations: (r1 → H r2) ⇒ ¬ π(r2) → H π(r1) .
This theorem states that a SWMR regular register without new/old inversion
is atomic. Looking again at Fig. 11.2, as R.read() → a → H R.read() → b
and R.write(1) → H R.write(0), it is not possible to have π R.read() → b =
R.write(1) and π R.read() → a = R.write(0) if the execution is atomic. This
theorem is particularly useful to show that a construction provides atomic registers.
This is done as follows. The atomicity proof consists in incrementally proving first
that the register is safe, then that it is regular, and finally that it does not allow for
new/old inversion. This last point completes then the atomicity proof.
Proof The fact that a SWMR atomic register is regular and satisfies the “no new/old
inversion” property is an immediate consequence of the definition of atomicity (see
Chap. 4); more precisely, any execution H of an atomic register is equivalent to a
legal sequential history
S that respects the partial order → H on its operations: two
successive read operations that overlap with the same write cannot obtain first the
new value and then the old value (otherwise, the witness sequential history would
not be legal).
So, we only have to show the other direction, namely that a regular register whose
executions have no new/old inversion is atomic. Let us first observe that, as there is
w
a single writer, all the write operations are totally ordered. Let → be this total order
relation. Moreover, let us associate with each write operation w a sequence number
sn(w), the first write being numbered 1, etc. For consistency, we also assume that
there is an initial write invocation before any other read or write invocation (or
equivalently, that the register has an initial value whose sequence number is 1). Let
us also associate with each read operation r the sequence number sn(π(r)), i.e., the
sequence number of the write invocation π(r ) that wrote the value read by r.
310 11 Safe, Regular, and Atomic Read/Write Registers
Let us observe that, due to the fact that the register is regular and there is no
new/old
for any two read invocations r1 and r2, (r1 → H r2) ⇒
inversion, we have,
sn(π(r1)) ≤ sn(π(r2)) .
We now show that, for any execution H , there is a total order
S that is equivalent
to H (i.e., that includes the same operation invocations), is legal, and respects the
partial order on the operation invocations defined by H . S is built as follows. We
w
start from the total order on the write operations → (that is included in H ) and insert
the read operations as follows:
• A read operation r is inserted just after the associated write operation π(r).
• If two read operations r1 and r2 are such that sn(r1) = sn(r2), then insert first the
. (Let us recall that, at the level of the invocation/response
one that starts first in H
events, any two events are totally ordered.)
Due to its very construction, S includes all the operation invocations of H , from
which it follows that S and H are equivalent. S is trivially a total order (as all the
operations are ordered according to their associated sequence numbers). Moreover,
this total order is an extension of H, as it only adds an order on operations that are
concurrent in H (when there is no read operation concurrent with a write operation,
there is no difference between regularity and atomicity). Finally, S is legal, as each
read obtains the last written value that precedes it in this total order. Hence, the cor-
responding history is linearizable. As this reasoning does not depend on a particular
history H, it follows that the register is atomic.
As atomicity (linearizability) is a local property (Theorem 14, Chap. 4), it follows
that a set of SWMR regular registers behave atomically as soon as each of them—
independently from the others—satisfies the “no new/old inversion” property.
• The number of base registers needed to build the high-level register is a constant
or depends on the number of processes. (Actually, a base register usually acts as
a “copy” of the register under construction.)
• The additional control information used to build the high-level register is bounded
or unbounded. Basically, unbounded means that the construction uses sequence
numbers that can grow arbitrarily. Except for a few constructions, bounded con-
structions are much more difficult to design and prove correct than unbounded
solutions. From a complexity (cost) point of view, they are always better.
This section presents two very simple bounded constructions. It focuses on safe and
regular registers (let us recall that these registers have been defined as having a single
writer). The first construction extends such a register from one reader to multi-reader.
The second shows how to transform a safe bit into a regular bit.
The aim of this construction is to provide an SWMR safe (or regular) register from
SWSR safe (regular) registers. So, the added value here is to allow for any number
of readers instead of a single reader.
The construction, described in Fig. 11.3, is very simple. The constructed SWMR
register R is built from n SWSR base registers REG[1 : n], one per reader process. A
reader pi reads the base register REG[i] it is associated with, while the single writer
writes all the base registers (in any order). It is important to see that this construction
is bounded: it uses no additional control information, and each base register has to
be of the same size (measured in number of bits) as the register we want to build.
Interestingly, this construction is independent of the fact that the base registers are
safe or regular.
Fig. 11.3 From SWSR safe/regular to SWMR safe/regular: a bounded construction (code for pi )
312 11 Safe, Regular, and Atomic Read/Write Registers
Theorem 44 Given one base safe (regular) register per reader, the algorithm
described in Fig. 11.3 constructs an SWMR safe (regular) register.
Proof Let us first consider safe base registers. It follows directly from the algorithm
that a read of R which is not concurrent with an invocation of R.write() obtains the
last value deposited in the register R. The register R is consequently safe.
Let us now consider the case where the base registers are regular. As a regular
register is safe, it only remains to show that an invocation of R.read() by a process
pi that is concurrent with one or more invocations of the write operation R.write(v),
R.write(v ), etc., returns one of the values v, v , etc. written by these concurrent
write invocations, or the value of R before these write invocations. As REG[i] is
regular, it follows that, when pi reads REG[i], it obtains the value of a concurrent
write on this base register (if any) or the value of REG[i] before these concurrent
write operations. It follows that the constructed register R is regular.
It is important to see that unfortunately the construction of Fig. 11.3 does not build
a SWMR atomic register when every base register REG[i] is a SWSR atomic register.
To show this, let us consider the counter-example described in Fig. 11.4, where there
are one writer pw and two readers p1 and p2 . Let us assume that the register R contains
initially the value 1 (which means that we initially have REG[1] = REG[2] = 1). To
write the value 2 in R, the writer first executes REG[1] ← 2 and then REG[2] ← 2.
The duration between these two write invocations on base registers can be arbitrary
(recall that, as the processes are asynchronous, there is no assumption on their speed).
Concurrently, p1 reads REG[1] and returns 2, while later (as indicated in the figure)
p2 reads REG[2] and returns 1. The linearization order on the two base atomic
registers is depicted in the figure (bold dots). The reader can easily see that, from the
point of view of the constructed register R, there is a new/old inversion, as p1 reads
first and obtains the new value, while p2 reads later and obtains the old value. The
constructed register is consequently not atomic.
11.2 Two Very Simple Bounded Constructions 313
The aim here is to build a regular bit from a safe bit. As (a) a bit has only one out of
two possible values (0 or 1) and (b) regularity allows for new/old inversions when
several read invocations are concurrent with one or more write invocations, the only
problem that has to be solved is the following. Let us assume that the register has the
value 0 and there is an invocation of the operation R.write() that writes the very same
value 0. As a base register is only safe, it is possible that a concurrent invocation of
R.read() obtains the value 1 (that has maybe never been written into the register).
An easy way to fix this problem is to force invocations of R.write() to always write
a value different from the previous one. It then follows that invocations of R.read(),
which are concurrent with one or more invocations of R.write(), obtain the value
before these write invocations or the value written by one of these invocations.
The corresponding construction is described in Fig. 11.5. It only requires that the
(single) writer uses a local register pr ev_val that contains the previous value that it
wrote in the base safe register REG. The test guarantees that a value is written in the
safe base register only when it is different from its current value.
Theorem 45 Given an SWMR binary safe register, the construction described in
Fig. 11.5 builds an SWMR binary regular register.
Proof The proof is an immediate consequence of the following facts: (1) As the
underlying register is safe, a read that is not concurrent with a write obtains the last
written value; (2) As the underlying safe register always alternates between 0 and 1,
a read invocation concurrent with one or more write invocations obtains the value of
the base register before these write invocations or one of the values written by such
a write invocation.
Remark The previous construction exploits the fact that the constructed regis-
ter R can only contain one out of two possible values. Unfortunately, it cannot be
extended to work for multi-valued registers, nor to implement an atomic binary
register.
This section presents bounded constructions from bits to registers whose value
domain is made up of b distinct values (b > 2). The base bits and the constructed
registers are SWMR registers. (Of course, the algorithms trivially work when the
base bits are SWSR, the constructed b-valued register being then SWSR.) Finally,
the abstraction level of the constructed register is the same as the one of the base
bits from which it is built: if the base bits are safe (regular or atomic) the b-valued
constructed register is safe (regular or atomic).
Fig. 11.6 SWMR safe register: from binary domain to b-valued domain
11.3 From Bits to b-Valued Registers 315
operation is
end operation
operation is
end operation
Fig. 11.7 SWMR regular register: from binary domain to b-valued domain
316 11 Safe, Regular, and Atomic Read/Write Registers
loop body is executed, it follows that the loop always terminates, and the value j it
returns is such that 1 ≤ j ≤ b.
Remark The previous lemma relies heavily on the fact that the register R can
contain only b distinct values. The lemma would no longer be true if the value
domain of R was unbounded. An invocation of R.read() could then never terminate
in the case where the writer continuously writes increasing values. This is due to the
following possible scenario. Let R.write(x) be the last write invocation terminated
before the invocation R.read(), and let us assume that there is no concurrent write
invocation R.write(y) such that y < x. It is possible that, when it reads REG[x],
the reader finds REG[x] = 0 because another R.write(y) operation (with y > x)
updated REG[x] from 1 to 0. Now when it reads REG[y], the reader finds REG[y] = 0
because another R.write(z) operation (with z > y) updated REG[y] from 1 and so
on. The read invocation can then never terminate.
Theorem 47 Given b SWMR regular bits, the construction described in Fig. 11.7
builds an SWMR b-valued regular register.
Proof Let us first consider a read operation that is not concurrent with a write, and let
v be the last written value. It follows from the write algorithm that, when R.write(v)
terminates, the first entry of the array equal to 1 is REG[v] (i.e., REG[x] = 0 for
1 ≤ x ≤ v − 1). As a read scans the array starting from REG[1], then REG[2], etc.,
it necessarily reads until REG[v] and returns accordingly the value v.
Let us now consider a read operation R.read() that is concurrent with one or more
write operations R.write(v1 ), . . . , R.write(vm ) (as depicted in Fig. 11.8). Moreover,
let v0 be the value written by the last write invocation that terminated before the
invocation R.read() starts (or the initial value if there is no such write invocation).
As a read invocation always terminates (Lemma 24), the number of write invocations
concurrent with the R.read() invocation is finite. We have to show that the value v
returned by R.read() is one of the values v0 , v1 , . . . , vm . We proceed by case analysis:
1. v < v0 .
No value that is both smaller than v0 and different from vx (1 ≤ x ≤ m) can be
output. This is because (1) R.write(v0 ) has set to 0 all the entries from v0 − 1
until the first one, and only a write of a value vx can set REG[vx ] to 1; and (2) as
the base registers are regular, if REG[v ] is updated by an invocation R.write(vx )
from 0 to the same value 0, the reader cannot concurrently read REG[v ] = 1.
It follows from this observation that, if R.read() returns a value v smaller than
318 11 Safe, Regular, and Atomic Read/Write Registers
v0 , that value has necessarily been written by a concurrent write invocation, and
consequently R.read() satisfies the regularity property.
2. v = v0 .
In this case, R.read() trivially satisfies the regularity property. (Let us notice that
it is possible that the corresponding write invocation be some R.write(vx ) such
that vx = v0 .)
3. v > v0 .
From v > v0 we can conclude that the read invocation obtained 0 when it read
REG[v0 ]. As REG[v0 ] was set to 1 by R.write(v0 ), this means that there is an
invocation R.write(v ), issued after R.write(v0 ) and concurrent with R.read(),
such that v > v0 , and that invocation has executed REG[v ] ← 1 and has then
set to 0 at least all the registers from REG[v − 1] up to REG[v0 ]. We consider
three cases:
(a) v0 < v < v .
In this case, as REG[v] was set to 0 by R.write(v ), we can conclude that there
is a R.write(v), issued after R.write(v ) and concurrent with R.read(), that
updated REG[v] from 0 to 1. The value returned by R.read() is consequently
a value written by a concurrent write invocation. The regularity property is
consequently satisfied by R.read().
(b) v0 < v = v .
The regularity property is then trivially satisfied by R.read(), as R.write
(v ) and R.read() are concurrent.
(c) v0 < v < v.
In this case, R.read() missed the value 1 in REG[v ]. This can only be due
to a R.write(v ) operation, issued after R.write(v ) and concurrent with
R.read(), such that v > v , and that operation has executed REG[v ] ← 1
and has then set to 0 at least all the registers from REG[v − 1] to REG[v ].
We are now in the same situation as the one described at the beginning
of item 3, where v0 and R.write(v ) are replaced by v and R.write(v ),
respectively. As the number of values between v0 and b is finite and as the
invocations of R.read() terminate, if follows that it eventually terminates in
case 3a or case 3b, which completes the proof of the theorem.
A counter-example for atomicity Figure 11.9 shows that, even if all the base reg-
isters are atomic, the previous construction does not provide an atomic b-valued
register.
Let us assume that b = 5 and the initial value of the register R is 3, which
means that we initially have REG[1] = REG[2] = 0, REG[3] = 1, and REG[4] =
REG[5] = 0. The writer issues first R.write(1) and then R.write(2). There are con-
currently two read invocations as indicated in the figure. The first returns the value 2,
while the second one returns the value 1. Hence, there is a new/old inversion. The last
line of the figure describes a linearization order
S of the read and write invocations on
11.3 From Bits to b-Valued Registers 319
the base binary registers. (As we can see, each base object taken alone is linearizable.
This follows from the fact that linearizability is a local property; see Sect. 4.4.)
As just seen, the previous construction does not work to build a b-valued atomic
register from atomic bits. Interestingly, a relatively simple modification of its read
algorithm prevents new/old inversions from occurring. The construction that follows
is due to K. Vidyasankar (1989).
Principles of the construction The idea consists in decomposing a R.read() oper-
ation into two phases. The first phase is the previous read algorithm: it reads the
base registers in ascending order, until it finds an entry equal to 1; let j be that entry.
Then, the second phase traverses the array in the reverse direction (from j to 1), and
determines the smallest entry that contains the value 1 to return it. So, the returned
value is determined by a double scanning of a “meaningful” part of the REG array.
The construction is given in Fig. 11.10. To understand the way it works let us con-
sider the first invocation of R.read() depicted in Fig. 11.9. After it finds REG[2] = 1,
it changes its scanning direction. It then finds REG[1] = 1 and returns consequently
the value 1. The second read (that starts after the first one) will return the value 1
or 2 according to the value read from REG[1]. If it reads 1, as in the figure, the
read invocation returns 1. This shows that, in the presence of concurrency, this con-
struction does not strive to eagerly return a value. Instead, the value v returned by
a read operation has to be “validated” by an appropriate procedure, namely, all the
“preceding” base registers REG[v − 1] until REG[1] have to be found equal to 0
when read for the second time.
Theorem 48 Given b SWMR atomic bits, the construction described in Fig. 11.10
builds an SWMR atomic b-valued register.
Proof The proof consists of two parts: first showing that the constructed register is
regular, and then showing that it does not allow for new/old inversions. Applying
Theorem 43 proves then that the constructed register is an SWMR atomic register.
320 11 Safe, Regular, and Atomic Read/Write Registers
Let us first show that the constructed register is regular. Let R.read() be a read
invocation and j the value it returns. We consider two cases (let us observe that, due
to the construction, the case j > j_up cannot happen):
• j = j_up ( j is determined at line 3).
The value returned is then the same as the one returned by the construction
described in Fig. 11.7. It follows from Theorem 47 that the value read is then either
the value of the last preceding write or the value of a concurrent write invocation.
• j < j_up ( j is determined at line 5).
In this case, the read found REG[ j] = 0 during the ascending loop (line 2) and
REG[ j] = 1 during the descending loop (line 5). Due to the atomicity of the
REG[ j] register, this means that a write operation has written REG[ j] = 1 between
these two readings of that base atomic register. It follows that the value j returned
has been written by a concurrent write operation.
To show that there is no new/old inversion, let us consider Fig. 11.11. There are
two write invocations and two read invocations r1 and r2, which are concurrent with
the second write invocation. (The fact that the read invocations are issued by the
same process or different processes is irrelevant.) As the constructed register R is
regular, both read invocations can return v or v . If the first read invocation r1 returns
v, the second read invocation r2 can return either v or v without entailing a new/old
inversion. So, let us consider the case where r1 returns v . We show that r2 returns
v , where v is v or a value written by a more recent write concurrent with this read.
If v = v , there is no new/old inversion. So, let us consider v
= v . As r1 returns
v , r1 has sequentially read REG[v ] = 1 and then REG[v − 1] = 0 to REG[1] = 0
(lines 4–5). Moreover, r 2 starts after r 1 has terminated (r1 → H r2) in the associated
execution history H ).
11.3 From Bits to b-Valued Registers 321
1. v < v . In this case, a write invocation has written REG[v ] = 1 after r 1 has
read REG[v ] = 0 (at line 5) and before r 2 reads REG[v ] = 1 (at line 3 or
line 4) with 1 ≤ v < v . It follows that this write invocation is after R.write(v )
(there is a single sequential writer, and r1 returns v ). Consequently, r2 obtains
a value more recent than v (hence more recent than v), and there is no new/old
inversion.
2. v > v . In this case, r2 has read 1 from REG[v ] and then 0 from REG[v ] (line
5). As r1 terminates (reading REG[v ] = 1 and returning v ) before the invocation
r2 starts and the write invocations are sequential, it follows that there is a write
invocation, issued after R.write(v ), that has updated REG[v ] from 1 to 0.
This section presents three constructions based on sequence numbers which, con-
sequently, are unbounded. The sequence numbers are used to identify each write
invocation and associate a total order with them, a total order that can then be easily
exploited. The use of sequence numbers makes these constructions relatively simple.
(It is possible to design equivalent constructions that are bounded: they use only a
constant number of safe bits and are much more involved.)
The high-level register R that we want to build can be bounded or not. This
depends only on the application that uses it. It is important to see that, whether R is
bounded or not, the base registers from which it is built contain sequence numbers
and are consequently potentially unbounded. More explicitly, in the constructions
presented in this section, a base register REG is made up of two fields:
322 11 Safe, Regular, and Atomic Read/Write Registers
• REG.val is the data part intended to contain the value v of the constructed register
R. As already noticed, whether this part is bounded or not depends only on the
upper-layer application.
• REG.sn is a control part containing a sequence number and possibly a process
identity. The sequence number values increase proportionally to the number of
write invocations, and consequently cannot be bounded.
As soon as we can use sequence numbers and those can be kept inside a base regular
register, it becomes easy to build an SW1R atomic register from an unbounded
regular SWSR. The underlying principle consists in associating a sequence number
with each write operation and use it to prevent the new/old inversion phenomenon
from occurring. It then follows from Theorem 43 that the constructed register is
atomic.
The construction is described in Fig. 11.12. The local variable sn of the writer
is used to generate sequence numbers. Each time it writes a value v, the writer
deposits the pair sn, v in the base regular register REG. The reader manages two
local variables: last_sn stores the greatest sequence number it has ever seen, and
last_val stores the corresponding value. When it wants to read the high-level register
R, the reader first reads the base regular register REG, and then compares last_sn
and the sequence number it has just read in order to prevent old/new inversions. The
scope of the local variable aux used by the reader spans a read invocation; it is made
up of two fields: a sequence number (aux.sn) and a value (aux.val).
Theorem 49 Given an unbounded SWSR regular register, the construction described
in Fig. 11.12 builds an SWSR atomic register.
operation is
end operation
operation
if then
end if
end operation
Proof The proof is similar to the proof of Theorem 43. Let us associate with each
read invocation r the sequence number (denoted sn(r)) of the value it returns as a result
(this is possible as the base register is regular and consequently a read always returns
a value that was written, that value being the last written value or a value concurrently
written, if any). Considering an arbitrary execution history H of the register R, we
show that H is atomic (linearizable) by building an equivalent sequential history S
that is legal and respects the partial order on the operations defined by → H .
S is built from the sequence numbers associated with the invocations. First, let
us order all write invocations according to their sequence numbers. Then, let us
order each read invocation just after the write invocation that has the same sequence
number. If two read invocations have the same sequence number, we order first the
one which started first.
The history S is trivially sequential as all the invocations are placed one after the
other. Moreover, S is equivalent to H as it is made up of the same operation invo-
cations. S is trivially legal as each read follows the corresponding write invocation.
We now show that S respects → H .
• For any two write operations w1 and w2 we have either w1 → H w2 or w2 → H w1.
This is because there is a single writer and it is sequential: as the variable sn is
increased by 1 between two consecutive write invocations, no two write invocations
have the same sequence number, and these numbers agree with the occurrence
order of the write invocations. As the total order on the write invocations in S is
determined by their sequence numbers, their total order in H follows.
• Let op1 be a write or a read invocation, and op2 be a read invocation such that
op1 → H op2. It follows from the construction that sn(op1) ≤ sn(op2) (where
sn(op) is the sequence number of the invocation op). The ordering rule guarantees
that op1 is ordered before op2 in
S.
• Let op1 be a read invocation and op2 be a write invocation. Similarly to the previous
item we then have sn(op1) < sn(op2), and consequently op1 is ordered before
op2 in
S (which concludes the proof).
One might think of a naive extension of the previous algorithm to construct an
SWMR atomic register from base SWSR regular registers. Indeed, we could, at first
glance, consider a construction associating one SWSR regular register per reader,
and have the writer write in all of them, each reader reading its dedicated register.
Unfortunately, a fast reader might see a new concurrently written value, whereas a
reader that comes later sees the old value. This is because the second reader does not
know about the sequence number and the value returned by the first reader. The latter
stores them locally. In fact, with the previous construction, this can happen even if
the base SWSR registers are atomic. The construction of SWMR atomic register
from base SWSR atomic registers is addressed in the next section.
324 11 Safe, Regular, and Atomic Read/Write Registers
• Helping the others. Just before returning the value v it has determined, a reader pi
helps each process p j by indicating to p j the last value it has read (namely v) and
its sequence number sn. This is realized by pi which updates H E L P[i, j] to the
pair sn, v. This allows p j not to return in the future a value older than v, i.e., a
value whose sequence number would be smaller than sn.
• To be helped by the others. To determine the value returned by a read operation,
a reader pi first computes the greatest sequence number that it can know. This
computation involves all SWSR atomic registers that pi can read, i.e., REG[i] and
H E L P[ j, i] for any j. The process pi then returns the value that has the greatest
sequence number it has computed.
operation is
end operation
operation is
end operation
Fig. 11.13 Atomic register: from one reader to multi-reader (unbounded construction)
can be used to contain the SWSR atomic register REG[i]. It follows that the protocol
requires exactly n 2 base SWSR atomic registers.
This section shows how to use sequence numbers to build an MWMR atomic register
from n SWMR atomic registers (where n is the number of writers). The construction
is simpler than the previous one. An array REG[1..n] of n SWMR atomic registers
is used in such a way that pi is the only process that can write REG[i], while any
process can read it. Each register REG[i] stores a sequence number, value pair.
As before X.sn and X.val are used to denote the sequence number field and the
value field of the register X , respectively. Each REG[i] is initialized to the same pair,
namely 0, v0 , where v0 is the initial value of R.
326 11 Safe, Regular, and Atomic Read/Write Registers
operation is
for each in do end for
let be
end operation
operation is
for each in do end for
let be such that
end operation
Fig. 11.14 Atomic register: from one writer to multi-writer (unbounded construction)
The problem to solve consists in allowing the writers to totally order their write
operations. To that end, the idea is the following. A write invocation first computes
the highest sequence number that was used, say sn, and defines sn + 1 as the next
sequence number. Unfortunately, this does not prevent two distinct write invocations
from associating the same sequence number with their respective values. A simple
way to cope with this problem consists in associating a timestamp with each value,
where a timestamp is a pair made up of a sequence number plus the identity of the
process that issues the corresponding write invocation.
The timestamping mechanism can be used to define a total order on all the
timestamps as follows. Let ts1 = sn1, i and ts1 = sn2, j be any two timestamps.
We have
def
ts1 < ts2 = (sn1 < sn2) ∨ (sn1 = sn2 ∧ i < j) .
The corresponding construction is described in Fig. 11.14. The meaning of the addi-
tional local variables that are used is clear from the context (and from the similar
variables used in the previous constructions).
Theorem 51 Given n unbounded SWMR atomic registers, the construction described
in Fig. 11.14 builds an MWMR atomic register.
Proof As previously, the proof consists in showing that the timestamps allow the
definition of a linearization of any execution history H .
Considering an execution history H of the constructed register R, we first build
an equivalent sequential history S by ordering all the write invocations according to
their timestamps, then inserting the read invocations as in Theorem 49. This history
is trivially legal as each read invocation is ordered just after the write invocation that
wrote the read value. Finally, a reasoning similar to the one used in Theorem 49,
based on timestamps, shows that S respects → H .
11.5 Summary 327
11.5 Summary
Considering (a) the abstraction level hierarchy associated with registers as defined by
safe, regular, and atomic registers, (b) the fact that a register can be SWSR, SWMR,
MWSR, or MWMR, and (c) the size of the registers (binary versus multi-valued),
this chapter has presented bounded and unbounded constructions from one type of
read/write register to another type of read/write register. More precisely, the following
constructions have been presented:
• Bounded constructions:
– From safe (regular) SWSR registers to safe (regular) SWMR registers
(Sect. 11.2.1)
– From safe binary SWMR registers to regular binary SWMR registers
(Sect. 11.2.2)
– From binary safe SWMR registers to multi-valued safe SWMR registers
(Sect. 11.3.1)
– From binary regular SWMR registers to multi-valued regular SWMR registers
(Sect. 11.3.2)
– From binary atomic SWMR registers to multi-valued atomic SWMR registers
(Sect. 11.3.3).
• Unbounded constructions:
• The notions of safe, regular, and atomic registers are due to L. Lamport [189].
Constructions from one type of register into another one were proposed by
L. Lamport [190]. Very interestingly, L. Lamport presented in [190] a stacking
of wait-free constructions which allows a b-valued atomic MWMR register to be
built from binary SWSR safe registers. The first intuition of these types of registers
can be found in [184].
• Axioms for asynchronous shared memory access are presented in [206].
• The bounded constructions described in Fig. 11.5 (from a safe bit to an atomic bit),
Fig. 11.6 (from safe bits to a b-valued safe register), and Fig. 11.7 (from regular
bits to a b-valued regular register) are due to L. Lamport [190].
328 11 Safe, Regular, and Atomic Read/Write Registers
• The bounded construction for building an atomic b-valued register from atomic
bits which is described in Fig. 11.10 is due to K. Vidyasankar [268].
• The unbounded constructions from SWSR to SWMR described in Fig. 11.13
and from SWMR to MWMR described in Sect. 11.4.3 are due to P. Vitányi and
B. Awerbuch [272]. This paper presents also corresponding bounded construc-
tions, and the previous two unbounded constructions are a very first step in the
design of these bounded constructions.
• Many constructions from one type of register into another one have been proposed
(e.g., [22, 43, 52, 59, 132, 168, 177, 178, 196, 219, 225, 257, 265, 268, 269, 270, 271]
to cite a few).
• The notion of a multi-valued regular register was also investigated in [72, 254] and
the cost of multi-valued register implementations is discussed in [73].
Chapter 12
From Safe Bits to Atomic Bits:
Lower Bound and Optimal Construction
As just indicated, the construction of an SWSR atomic register from an SWSR regular
register presented in Fig. 11.12 uses sequence numbers which increase forever and
consequently makes this construction unbounded. (These sequence numbers are used
to prevent new/old inversions.) This construction is such that (a) each invocation of
R.write() writes a pair seq number, value in the shared memory while (b) each
invocation of R.read() has only to read the shared memory.
Hence, a fundamental question is the following: Is it possible to build an atomic
register from a finite number of base safe (or regular) registers that can (a) contain
only a bounded number of values, and (b) be written only by the writer (of the atomic
register). This section shows that such a construction is impossible.
The two following lemmas will be used in the impossibility proof. The first shows that
it is possible to replace several regular registers by a single register while preserving
regularity. The second states a property of a sub-sequence with respect to the sequence
it originated from.
Lemma 25 Let us consider a set of SWSR regular registers, all written by the same
writer process and read by the same reader process. These registers can be replaced
by a single SWSR regular register.
Proof Let REG1 , . . . , REGn be the set of n regular registers written by the writer and
read by the reader. Moreover, let VALi be the value domain of REGi and vi ∈ VALi
be a value of REGi .The construction is as follows:
• The set of n registers (REGi )1≤i≤n is replaced by a single regular register R whose
value domain is the cross-product VAL 1 × VAL 2 × . . . × VAL n .
• Let us first observe that, for any register REGi , the writer can always keep a copy of
the last value it has written in that register. Assuming REG1 = v1 , . . . , REGn = vn ,
the write of the value vi in the register REGi is replaced by the write of the
composite value v1 , . . . , vi−1 , vi , vi+1 , . . . , vn in the regular register R.
• A read of REGi is realized by a read of R followed by the extraction of the value
of its ith field.
To show that this construction is correct, let us consider a read of a register REGi
(denoted ri ). If that read is not concurrent with an invocation of a write of a register
REG j , the value returned by ri returns the current value of REGi (the ith field of R)
and consequently satisfies regularity.
So, let us consider the case where there are invocations of write operations
w, . . . , w into some registers REG j which are concurrent with ri . It follows from
the regularity of R that the read of R implementing ri returns the value of R before
or after one of these write invocations. There are two cases:
• None of the write invocations w, . . . , w writes REGi . Due to the construction, the
ith field of R is not changed by this sequence of write invocations. It follows that
ri returns the current value vi of REGi .
12.1 A Lower Bound Theorem 331
• One or more write invocations in the sequence w, . . . , w write REGi . As the value
obtained by ri is the ith field of R, it follows from the regularity of R that it is the
value of REGi before or after any of these write invocations, which proves that ri
satisfies the regularity property.
This theorem, which is due to L. Lamport (1986), asserts that, when we want to
construct an SWSR atomic register from bounded regular registers, there is no con-
struction in which the writer only writes and the reader only reads. This means that
any such construction must involve two-way communication between the reader and
the writer.
Theorem 52 It is not possible to build an SWSR b-valued atomic register (b ≥ 2)
from a finite number of regular registers that can take a finite number of values and
are written only by the writer.
Proof The proof is by contradiction. Let us assume that it is possible to build an
SWSR atomic register R from a finite number of SWSR regular registers, each with a
332 12 From Safe Bits to Atomic Bits: Lower Bound and Optimal Construction
finite value domain. Without loss of generality, let us consider that the atomic register
R that is built is binary. Let us first observe that, due to Lemma 25, it is sufficient to
consider the case where R is built from a single base regular register (denoted REG).
The proof considers a possible behavior for the writer and a possible behavior for
the reader, and then deduces a contradiction from these behaviors.
A write pattern. Assuming R is initialized to 0, let us consider an execution where
infinitely often the writer alternately writes 1 and 0 in R. Let wi , i ≥ 1, denote the
ith invocation of the operation R.write(v). This means that v = 1 when i is odd and
v = 0 when i is even.
Invocations of R.write(1). Each write invocation w2i+1 of R.write(1) is implemented
by a sequence of invocations of base write operations on the regular register REG.
Let ω1 , . . . , ωx be the sequence of base writes generated by w2i+1 , and Ai the cor-
responding sequence of values defined as follows: its first element a1 is the value of
REG before ω1 , its second element a2 is the value of REG just after ω1 and before
ω2 , etc.; its last element ax+1 is the value of REG after ωx . Let Bi be a digest derived
from Ai (due to Lemma 26 such a digest sequence exists).
As the number of distinct values that REG can have is finite, it follows that the
number of distinct digest sequences Bi is finite. As the sequence of writes on R is infi-
nite, there is a digest sequence Bi that appears infinitely often. Let B = b1 , . . . , b y+1
(y ≥ 1) be such a sequence.
Remark Let us observe that there is no constraint on the number of internal states
of the writer. This means that all the sequences Ai can be different. It is possible,
with an infinite number of internal states, for the writer never to perform the same
sequence of base write operations twice. That is why each sequence Ai is replaced by
its digest Bi , in such a way that the set of possible digests is finite. This observation
is the main motivation for Lemma 26.
Let us observe that, due to the first item of Lemma 26, b1 is the value of REG just
before the invocation of an operation R.write(1). As the writer writes alternately 0
and 1 in R, this means that an invocation of R.read() all of whose readings of REG
return the value b1 has to return 0 as the value of R (i.e., b1 is one of the low-level
encodings that represent the value 0 of R).
The contradiction is provided by showing that b1 is also a low-level encoding
representing the value 1 of R.
Invocations of R.read(). An invocation r of R.read() is implemented as a sequence
of base operations that read REG. However, in our quest for a contradiction, we
restrict our attention to the scenarios in which each read of REG issued by a read
invocation r returns the same value. That value is denoted λ(r) (notice that λ(r)
is obtained from a decoding of the value of REG). So, we may assume that each
invocation r of R.read() issues a single read on the base regular register REG, and
that read returns λ(r).
Let r and r be two consecutive read invocations of R.read(). It is possible that
these read invocations be such that λ(r) = b j+1 , λ(r ) = b j (where b j+1 and b j
are two consecutive values of the digest B), and both r and r return 1. An example
12.1 A Lower Bound Theorem 333
Fig. 12.1 Two read invocations r and r concurrent with an invocation w2i+1 of R.write(1)
where this scenario can occur is when both r and r are concurrent with an invocation
w2i+1 of R.write(1) as shown in Fig. 12.1 (where λ(r) is used to represent both a
read invocation r and the value obtained from REG by that read invocation). A write
of REG that changes its value from a to b is denoted “from a to b”). More explicitly
we have the following:
• Let us first look at the constructed register R (i.e., on the side of the semantics
provided to the upper layer). As r and r are concurrent with a write operation
w2i+1 = R.write(1), and r returns the new value of R, the immediately following
read r also has to return the new value, as R is atomic (initial assumption).
• Let us now look at the base register REG (i.e., on the side of the implementation).
Due to the third item of Lemma 26, when the invocation w2i+1 of the R.write(1)
operation is executed there are two consecutive base write operations ωz and ωz+1
such that ωz writes b j in REG and ωz+1 writes b j+1 in REG. As the register REG
is regular, it is possible that the two consecutive base invocations (that read REG)
issued by r and r while REG is concurrently written by the base write invocation
ωz+1 (that modifies its value from b j to b j+1 ) be such that the first obtains the new
value b j+1 while the second obtains the old value b j .
The contradiction. Let us now consider a sequence S = r0 , r1 , . . . , r y of consecutive
invocations of R.read() such that λ(r0 ) = b y+1 , λ(r1 ) = b y , . . . , λ(r y ) = b1 , where
B = b1 , . . . , b y+1 is the digest sequence defined above. According to the previous
observation applied to the pair of consecutive operations r0 and r1 , we conclude that
both r0 and r1 must return the value 1. Applying the same reasoning to the sequence
of pairs (r1 , r2 ), (r2 , r3 ), . . . , (r y−1 , r y ) (notice that this is possible as the length of
the sequence B is y + 1, see Fig. 12.2), we conclude that r y has to return the value 1.
From safe to regular registers The three base safe registers are initialized to the
initial value of R, i.e., 0. Then, as we will see, the read and write algorithms defining
the construction are such that any write applied to a base register X changes its
value. So, the sequence of successive values written into any base safe register is the
sequence 0, 1, 0, etc. Consequently, to simplify the writing of the construction and
12.2 A Construction of an Atomic Bit from Three Safe Bits 335
to stress the fact it is always updated to the “other” value, the writing of a new value
in the base register X is denoted “X ← (1 − X )”.
As any two consecutive write operations of a base bit X write different values, it
follows that the base register X behaves as a regular register. A read not concurrent
with a write returns the last written value, while a read concurrent with a write returns
0 or 1: whatever the value returned, it is the value before the write or a value that is
currently being written. It still remains possible for two consecutive reads of X that
overlap with the same write that the first obtains the new value of X while the second
obtains the previous value (new/old inversion).
Although the writer and the reader of a base register X are different processes,
we consider in the text of the read and write algorithms that the writer of X can also
read it. This is done without loss of generality as the writer of X can keep a local
copy and read it directly.
As previously suggested, the basic idea on which the construction relies consists
in using the pair of control bits (WR, RR) to pass information from the reader to
the writer and vice versa in order to prevent new/old inversions. To implement this
passing of control information, we choose to have the following scheme:
• To indicate that a new value was written, the writer makes WR different from RR.
• To indicate that a new value was read, the reader makes RR equal to WR.
(Let us notice that this choice is arbitrary in the sense that we could have chosen the
writer to make the registers equal and the reader to make them different.)
336 12 From Safe Bits to Atomic Bits: Lower Bound and Optimal Construction
This algorithm is described in Fig. 12.3. The writer first updates the base register
REG to its new value (line 1). Then, it strives to establish WR = RR to inform the
reader that a new value was written (line 2).
As we are about to see, the reader strives to establish RR = WR before reading
REG. This is how it informs the writer that it is about to read the regular bit REG
implementing R. This allows the writer to know that its previous value was read.
Interestingly, the reader itself uses the predicate RR = WR to check if the value it
has read from REG can be returned as the value of R.
The read algorithm is much more involved than the write algorithm. It is described
in Fig. 12.3. To make it easier to understand, this section adopts an informal and
incremental presentation.
The construction: step 1 As we have seen previously, before reading the base
register REG, the reader modifies RR (if needed) in order to obtain wr = RR where
wr represents the value it obtains from the regular register WR. Then, after it has read
the base register REG (let val be the value it has obtained), the reader reads again
WR (that meanwhile may have been modified by the writer) to check the predicate
WR = RR. If it is true, it returns the value val, as from its point of view, WR has not
changed between the two consecutive reads of it, and consequently REG should not
have been changed. This is described by the following sequence of statements (the
line numbers refer to the final version of the read algorithm).
12.2 A Construction of an Atomic Bit from Three Safe Bits 337
It is easy to verify that the register R constructed by the read operation just
described (lines 5–8 and line 10) and the write operation (lines 1–3) is regular. A read
of R that is not concurrent with a write of R returns the current value of R, and a read
of R that is concurrent with one or more writes of R returns 0 or 1. Unfortunately,
this construction does not build an atomic register R as it is still possible to have
new/old inversions.
The construction: final step A way to prevent new/old inversions is for the reader
to help itself by requiring some invocations of R.read() to help future invocations
of R.read() by saving in its local memory (namely in its local variable val, whose
scope is now the whole execution) a value previously obtained from REG. That value
can be returned in appropriate situations to prevent new/old inversions as far as R
is concerned. The final read algorithm is described in Fig. 12.3. More precisely, we
have the following:
• When it invokes R.read(), the reader first checks the predicate WR = RR (line 4).
If it is true, the reader considers that no invocation of R.write() was issued since its
last reading of REG. In that case, the reader returns the last value it has previously
obtained from REG, namely the value that was saved in the persistent local variable
val.
Let us observe that this allows some read operations to return without accessing
the data register REG.
• Before executing return(aux) (line 10) the reader reads REG and saves its value
in val (new line 9). This allows val to contain a “fresh” value of REG (as the
content of REG may have been changed since its previous reading).
338 12 From Safe Bits to Atomic Bits: Lower Bound and Optimal Construction
Remark The reader should be conscious of the fact that the previous presentation
relies mainly on intuitive explanations. As the base registers are only safe bits (that
behave as regular bits), this kind of intuition could easily be flawed. That is why
the only way to be convinced that the construction is correct consists in providing a
proof.
As far as memory space is concerned, the cost of the construction is three SWSR
safe bits plus a permanent local variable (val).
As in previous chapters, the time complexity of the R.read() and R.write() oper-
ations is measured by the the maximal and minimal numbers of accesses to the base
registers. Let us recall that the writer (reader) does not have to read WR (RR), as it
can keep a local copy of that register.
• R.write(v): maximal cost: 3; minimal cost: 2.
• R.read(): maximal cost: 7; minimal cost: 1.
The minimal cost is realized when the same type of operation (i.e., read or write) is
repeatedly executed while the operation of the other type is not invoked.
Let us remark that we have assumed that, if R.write(v) and R.write(v ) are two
consecutive invocations of the write operation, we have v = v . This means that, if
the upper layer issues two consecutive write invocations with v = v , the cost of the
second one is 0, as it is skipped and consequently there is no access to a base register.
r r1 r2
• A2: ∀r1 , r2 : (r1 → H r2 ) ⇒ π(r1 ) = π(r2 ) ∨ π(r1 ) → H π(r2 ) (no new/old
inversion).
This theorem states that an execution of an SWMR register is atomic if and only
if it is a feasible history (A0), no read invocation obtains an overwritten value (A1),
and there is no new/old inversion (A2). The forbidden situations are described in
Fig. 12.4 (r1 and r2 can be issued by the same process or different processes; the
important point is that r2 starts after r1 has terminated).
Proof The “only if” part follows directly from the fact that, if one of the assertions
A0, A1, or A2 is not satisfied, it is clearly impossible to construct a legal sequential
history S satisfying → H ⊆→ S .
Let H be an execution history at the level of the invocations of the operations
R.read() and R.write(). The proof of the “if” part consists in finding a total order S
on all operations in H that is legal and respects the partial order → H on the operation
invocations defined by H .
A sequential history S is built as follows. We start from the invocations of the
operation R.write() because, as there is a single writer, they provide a germ to build
a total order. We then add the invocations to the operation R.read() into this total
order in such a way that each write invocation w is immediately followed by the
read invocations r such that π(r) = w. Moreover, if π(r1 ) = π(r2 ) = w, we place
first the read invocation that started first (i.e., the one whose start event precedes the
start event of the other; let us remind that, at the event level, an execution history
is defined by a total order). This total order on the operation invocations provides a
sequential history S that (1) is legal (a read obtains the value of the last preceding
write invocation) and (2) is equivalent to H (as it is made up of the same operations
as H).
It remains to show that → H ⊆→ S . As H contains all completed invocations of H ,
→ H ⊆→ S will imply that → H is equivalent to → S . We consider the four possible
cases (w1 and w2, and r 1 and r 2 denote write and read perations, respectively).
1. w1 → H w2 . Due to the very construction of S (that considers the order on the
write invocations as imposed by the writer), we have w1 → S w2 .
2. r1 → H r2 . Due to A2, we have π(r1 ) = π(r2 ) or π(r1 ) → H π(r2 ).
340 12 From Safe Bits to Atomic Bits: Lower Bound and Optimal Construction
r1
w2
Theorem 54 Let H be any execution history of an SWSR register R built from three
safe SWSR bits by Tromp’s construction Fig. 12.3. H is linearizable.
12.3 Proof of the Construction of an Atomic Bit 341
Proof Let H be an execution history. Let us recall that < H denotes the total order
on its start and response events, while → H denotes the relation induced by < H on
its invocations of the operations R.write() and R.read() (see Chap. 4). To show that
H is atomic (linearizable) we show that it satisfies the assertions A0, A1, and A2
defined in the statement of Theorem 53. Then, the result follows as a consequence
of that theorem.
To distinguish the invocations of the operations R.read() and R.write() (that,
as previously, we denote r and w) from the read and write invocations on the base
registers (e.g.,“ RR ← (1 − RR)”, “aux ← REG”), the latter are called actions. The
history defined from the action start and response events is denoted
L (< L denotes the
total order on its events and → L the corresponding relation induced on its read/write
invocations; without loss of generality, < L is assumed to contain all the start and
response events defining H ).
Moreover, r being an invocation of R.read() and loc the local variable (aux or val)
containing the value returned by r (at line 4, 8 or 10), ρr denotes the last read action
“loc ← REG” executed before r returns. More explicitly, we have the following:
• If r returns at line 10, ρr is the read action “aux ← REG” executed at line 5 of r,
• If r returns at line 8, ρr is the read action “val ← REG” executed at line 7 of r,
and finally
• If r returns at line 4, ρr is the read action “val ← REG” executed at line 7 or 9 by
a previous invocation of R.read().
As each base register (REG, RR, and WR) behaves as a regular register, each
read action ρr has a corresponding write action, denoted π(ρr ). Finally, given a read
invocation r and its associated read action ρr , π(r) denotes the invocation of R.write()
which includes the write action π(ρr ). This means that the value returned by the read
invocation r was written in the base register REG by the action “REG ← 1 − REG”
issued by the invocation of R.write() denoted π(r). For notational convenience we
say a ∈ A when a is an action of the operation A.
is an execution history). Let us first observe the following:
Proof of A0 ( H
• As REG behaves as a regular register, a value that is read is a value that was
previously written or is concurrently written. It follows that the read action ρr
cannot precede the corresponding write action π(ρr ), which means that we have
inv[π(ρr )] < L r esp[ρr ].
• Due to the very definition of r and ρr , we have r esp[ρr ] < L r esp[r ].
It follows that inv[π(ρr )] < L r esp[r ], from which we conclude ¬(r esp[r ] < L
inv[π(ρr )]).
Let us now reason by contradiction. Let us assume that A0 is violated; i.e., at the
level defined by the invocations r and π(r), we have r → H π(r). This translates at
the action event level as r esp[r] < L inv[π(ρr )], which contradicts the property that
was previously established and completes the proof of A0.
342 12 From Safe Bits to Atomic Bits: Lower Bound and Optimal Construction
π(r) w
ω read RR
r r
r”
π(ρr2 ) π(ρr1 )
ρr1 ρr2
WR is not modified
is updating REG, from which we conclude that WR does not change between the
events r esp[ρr1 ] and inv[ρr2 ]. Let P denote that property. We consider three cases
according to the line at which r1 returns:
– r1 returns at line 8.
Then, ρr1 is the action “val ← REG” executed by r1 at line 7 and r1 sees
RR = WR at line 8. Since ρr1 → L ρr2 , r2 does not return at line 4 (for r2 to
return at line 4, we need to have RR = WR at line 4 of r2 , which means that we
would then have ρr1 = ρr2 ). Consequently, r2 sees RR = WR when it executes
line 4, andρr2 is line 5 or line 7 of r2 . It follows that WR was modified between
ρr1 and ρr2 : a contradiction with property P, which proves the case.
– r1 returns at line 4.
In this case, ρr1 is line 7 or line 9 of the read invocation that precedes r1 . The
reasoning is the same as in the previous case. Since ρr1 → L ρr2 , r2 does not
return at line 4, from which we conclude that it sees RR = WR when it executed
line 4. It follows that WR was modified between ρr1 and ρr2 , which contradicts
property P and proves the case.
12.4 Summary
This chapter has presented two fundamental results. The first is a theorem due to
Lamport stating that any bounded construction of an SWSR atomic bit from safe bits
requires that both the reader and the writer write into the shared memory. This means
that the reader has to send information to the writer to cope with the fact that the
shared memory is bounded. The second result is an optimal bounded construction of
an SWSR atomic bit from three safe bits. Combined with constructions presented in
the previous chapter, this allows for bounded constructions of SWSR b-valued atomic
registers and unbounded constructions of MWMR unbounded atomic registers.
12.5 Bibliographic Notes 345
• The main theorem stating that a bounded construction of an SWSR atomic bit from
a bounded number of safe bits requires that the reader writes the shared memory
is due to L. Lamport [189]. The proof given here is inspired from [70].
• The first bounded construction proposed to build an atomic bit from safe bits was
due to L. Lamport in 1986. This construction is described in his seminal paper
[190].
• As indicated, the optimal construction of an atomic bit from three safe bits pre-
sented here is due to J. Tromp [265].
• A short survey on the construction of atomic registers can be found in [177].
12.6 Exercise
1. Considering the write algorithm described in Fig. 12.3 and the read algorithm
described in step 2 of the construction of an atomic bit (Sect. 12.2.4), show that
the register R that is built is not atomic. To that end, find an execution where there
are new/old inversions.
Chapter 13
Bounded Constructions
of Atomic b-Valued Registers
The previous chapter has shown how to construct an SWSR atomic bit from a bounded
number (three) of safe bits. Moreover, a bounded construction, due to K. Vidyasankar,
of a b-valued atomic register (i.e., a register that can take b different values) from
atomic bits was presented in Chap. 11. It is consequently possible to obtain an SWSR
b-valued atomic register from a bounded number of SWSR safe bits. However,
stacking these two constructions requires O(b) safe bits, i.e., a number of safe bits
linear with respect to the size of the value domain of the atomic register we want to
construct.
This chapter presents two efficient bounded constructions of an SWSR b-valued
atomic register from a constant number of atomic bits and a constant number of
b-valued safe registers each made up of log2 b safe bits. As an atomic bit can be
built from three safe bits, these constructions (due to J. Tromp and K. Vidyasankar)
require only O(log2 b) safe bits and are consequently efficient.
As in the previous chapters, the read and write operations associated with the
constructed SWSR atomic b-valued register R are denoted R.read() and R.write(),
respectively.
13.1 Introduction
ones containing older values or the value that is currently written by the writer).
The bits of a buffer can be read and written in any order, and there is no guarantee
on the value obtained from a buffer when it is concurrently read and written.
• The control part consists of atomic bits which implement switches. These switches
are used to direct the writer to write in a given buffer and the reader to read from
a buffer which contains the last value that was written into R.
Pure versus impure buffers The two constructions presented in this chapter dif-
fer in the way they use buffers. The first uses pure buffers, while the second uses
impure buffers. A buffer is pure if it is never accessed concurrently by the reader
and the writer. Otherwise, it is impure. A pure buffer is also called a collision-free
buffer.
Space and time complexities of the constructions The two constructions exhibit
an interesting tradeoff between the number of buffers and the cost of a read (or write)
operation.
• The first construction is due to J. Tromp (1989). Its switching mechanism ensures
collision-freedom on the operation invocations issued on a same buffer. To that end
it needs four buffers (i.e., 4log2 b safe bits). As there is no read/write collision, an
invocation of R.read() or R.write() needs to access a single buffer only once. The
switch mechanism ensuring collision-freedom is implemented with four atomic
bits. Hence, its control part needs 4 × 3 = 12 safe bits.
It follows that Tromp’s construction requires 4log2 b + O(1) safe bits and a that
read or write operation involves log2 b + O(1) bit accesses (log2 b accesses to
the safe bits of the single buffer that is accessed plus O(1) accesses to the safe bits
implementing the switch).
• The second construction is due to K. Vidyasankar (1990). It uses only three buffers,
each made up of log2 b safe bits. This construction does not prevent a read and a
write operation on the same buffer from colliding. To ensure that a read operation
returns a correct value, it requires the read and the write operations to read or write
one or two buffers according to the value of the switch which is made up of three
atomic bits (i.e., 3 × 3 = 9 safe bits).
So, the space complexity of Vidyasankar’s construction is 3log2 b + 9 safe bits,
and the time complexity of a read or write operation can be up to 2log2 b + O(1)
(2log2 b accesses to safe bits of the two buffers that are accessed plus O(1)
accesses for the safe bits of the switch).
Remark From a practical point of view, the techniques developed in this chapter
can be used to provide wait-free implementations of read/write objects such as a
shared file or a read/write web page. The shared memory or the disks supporting
the buffers implementing the file or the web page can be accessed without requiring
particular constraints (such as a predefined access order) on the way the blocks of
the disks or the words of the shared memory have to be accessed.
13.2 A Collision-Free (Pure Buffers) Construction 349
As just indicated the internal representation of the register R consists of four buffers
plus switches built from four atomic bits. The global structure is represented in
Fig. 13.1.
The four buffers The four buffers are kept as an array denoted BUF[0..1, 0..1].
Each buffer BUF[i, j] is initialized to the initial value of the constructed register R.
The buffers are divided into two groups, denoted 0 and 1. The group i ∈ {0, 1} is
composed of the buffers BUF[i, 0] and BUF[i, 1].
The switch The switch is actually a two-level switch composed of four atomic bits
denoted WR, RR, D[0] and D[1] (each initialized to 0). WR, D[0], and D[1] are
written by the writer to convey control information to the reader, while RR is written
by the reader to pass control information to the writer.
• The atomic bit RR denotes the buffer group from which the reader has to read.
• Similarly, the atomic bit WR denotes the last group in which a value was written.
• The atomic bits D[0] and D[1] indicate which is the last buffer written in the
corresponding group.
• To avoid collision in a buffer, the writer uses the switch (as defined by the values
of WR and RR). It first makes WR different from RR in order to have a chance
not to write in a buffer of the same group where it sees the reader. Then it writes
WR RR
Group 0 Group 1
D[0] D[1]
the buffer BUF[WR, D[WR]], and consequently D[WR] denotes the most recently
written buffer in the group WR.
• The reader on its side first makes RR equal to WR in order to read from the last
written group. It then reads the current value of the buffer BUF[RR, D[RR]].
Due to asynchrony, it is nevertheless possible that, despite the first-level switch
implemented by the atomic bits RR and WR, the reader and the writer access con-
currently the same group of buffers. The atomic bits D[0] and D[1] are then used
to implement a second-level switch which prevents the reader and the writer from
accessing simultaneously the same buffer of a group.
Hence, the core of the construction lies in the design of a two-level switch mech-
anism ensuring collision-freedom. As just suggested, the four atomic bits RR, WR,
D[0], and D[1] implement this switch. RR and WR are used to direct a process to a
given group i (0 or 1) of buffers, while the atomic bit D[i] is used inside the group i
to prevent collision in the same buffer in the case where both processes would have
been directed to the same group.
Local variables The writer process manages three local variables: wr , which is a
local copy of WR, and d[0] and d[1], where d[wr ] (wr ∈ {0, 1}) is a local copy of the
direction D[wr ]. On the reader side, rr is a local copy of RR. These local copies are
initialized to the same value as their original counterpart, and managed as described
in the read and write algorithms.
To get a better intuition of the construction, the reader (1) can first consider the
case where the read and the write algorithms do not execute concurrently and (2)
always reason as if, at any time, there is at most one access to an atomic bit (this
follows from the fact that accesses to atomic bits are linearizable).
The algorithm implementing the operation R.write() The algorithm of the writer
is described in lines 1–9 of Fig. 13.2. At the end of a write, the writer manages to have
WR = RR, which is how it informs the reader that a new value was written. The aim of
this control information, passed through the pair of atomic bits WR, RR, is to allow
the next read invocation to obtain the last written value by reading BUF[WR, D[WR]].
The writer starts by reading RR (the group from which the reader is assumed
to read) and compares with to WR (line 1), which is then equal to wr (this equal-
ity follows from the fact that only the writer writes WR, line 4). There are two
cases according to the value of the first-level switch made up of the atomic bits RR
and WR.
• RR = WR.
As (if it is reading) the reader is reading a buffer in the group RR, the first-level
switch indicates in that case that the reader is not accessing the other group (the
13.2 A Collision-Free (Pure Buffers) Construction 351
group 1 − RR). So, the writer simply writes to a buffer in that other group. To do
so, it first determines this new group (setting wr ← 1 − wr , line 2), then writes
into the target buffer BUF[wr, d[wr ]] (line 3), and finally sets WR to its new value
wr (line 4). The aim of this statement is also to have WR = RR in order to indicate
to the reader that a new value was written.
• RR = WR.
In this case the current value of the first-level switch indicates that (if it is reading)
the reader is currently reading from the other buffer group (namely the RR group).
The writer writes accordingly in the same group as its previous write, but in the
other buffer, namely BUF[WR, D[1 − WR]] (lines 5–6). This is in order to prevent
a possible collision in the case where the reader would terminate its read and
start a new read accessing now the same buffer group WR. (This can occur due
to asynchrony. As process speeds are independent, several read invocations can
overlap the same write invocation as shown in Fig. 13.3.) The writer finally updates
D[wr ] to have again d[WR] = D[WR] (line 7). Let us observe that, in that case,
the writer does not modify WR.
The algorithm implementing the operation R.read() The algorithm for the oper-
ation R.read() is described in lines 10–16 of Fig. 13.2.
As WR denotes the last group where a value was written, the reader first establishes
RR = WR before reading, in order to obtain the number of the buffer group containing
the last value that was written (lines 10–13). As we have seen in the write algorithm,
that value is kept in the buffer BUF[WR, D[WR]]. As the reader is the only process that
352 13 Bounded Constructions of Atomic b-Valued Registers
The proof of Tromp’s construction is made up of two parts. The first part (addressed
in this section) consists in establishing the collision-freedom property, namely, that
there is no concurrent write and read operation on the same buffer. The second part
shows that the constructed register R is an atomic b-valued register.
From atomic bits to finite state automata Let us consider an execution at an
abstraction level involving only the read and the write operations on the base atomic
bits WR, RR, D[0], and D[1]. As they are atomic, this execution is linearizable (all
the read and write operations on these bits can be totally ordered, in such a way
that this total order respects their real-time occurrence order and a read of a register
obtains the last value written in this register).
Said in another way, the fact that the bits are atomic allows reasoning as if the
operations on these bits were executed one at a time. This means that we can associate
with each algorithm an automaton where a transition corresponds to a read or a write
of an atomic bit. A state of the associated automaton is then the code of the algorithm
contained between two successive transitions. The idea is then to compute the cross-
product of these automata to analyze all the possible behaviors that can be produced
by concurrent read and write operations. The global automaton obtained in that way
will then be used to show the collision-freedom property.
Write automaton The automaton associated with the write operation is described
in Fig. 13.4. It has three states (denoted w0 , w1 , and w2 ) and four transitions. These
states and transitions are described on the right part of Fig. 13.2. All the line numbers
that follows refer to the algorithms described in Fig. 13.2.
Initially, or after it has completed a write operation, the writer is in the local state
w0 . Then, according to the value it reads from RR (line 1), it executes one out of two
possible transitions. The reading of RR such that RR = WR constitutes the transition
from w0 to w1 ,while the reading of RR such that RR = WR constitutes the transition
13.2 A Collision-Free (Pure Buffers) Construction 353
from w0 to w2 . Then,when it is in the state w1 , the writer writes the safe bits of the
buffer BUF[1 − WR, D[1 − WR]] (line 3). When it is in the state w2 , it writes the
safe bits of the buffer BUF[WR, 1 − D[WR]] (line 6). Finally, the transition from w1
to the initial state w0 is the writing of the atomic bit WR (line 4), while the transition
from w2 to w0 is the writing of the atomic bit D[WR] (line 4).
Read automaton The automaton associated with the read operation is described
in Fig. 13.5. It has four states (denoted r0 , r1 , r2 , and r3 ) and four transitions.
Initially, the reader is in the local state r0 . Then, according to the value it reads
from WR (this atomic read constitutes the first transition, line 10), it proceeds to the
local state r1 if WR = RR, or to the local state r2 if WR = RR. When it is in state
r1 , the reader modifies its local variable rr , and then executes the second transition,
namely it updates RR to obtain WR = RR (line 12), which makes it progress to the
local state r2 (line 13). Then, the atomic read of D[R R] (line 14) constitutes its next
transition which makes it proceed to state r3 . While it is in that state (lines 15–16)
the reader reads the safe bits of the buffer whose coordinates have been previously
determined, i.e., BUF[RR, D[RR]]. After, it has read that buffer, the reader returns
to the local state r0 .
The composite automaton As each transition is an access to an atomic bit, and
all these accesses are linearizable, a global state of the system can be represented
(at this observation level) by the values of the atomic bits. More specifically, the
behavior of the whole system can be represented by a global automaton describing
all the possible global states and all the transitions (operations on an atomic bit) that
make the system progress from one global state to another global state.
state r1
write RR
read W R (= RR)
state r3:
state r0 state r2 read from BUF [RR, D[RR]]
read W R (= RR) read D[RR]
reader states
r0 r1 r2 r3
w0 WR=RR
w2 WR=RR
the reader proceeds directly from r0 to r2 if it reads WR. Similarly, w1 , r1 cannot
be attained from w0 , r1 as then a reading of RR by the writer obtains RR = WR,
which entails the writer transition from w0 to w2 .)
Lemma 27 The write and read algorithms described in Fig. 13.2 ensure the collision-
freedom property (i.e., if a process accesses a buffer, the other process does not
concurrently access the same buffer).
Proof The global state automaton shows that collisions can only occur in the global
states w1 , r3 and w2 , r3 . We examine each case separately:
• Global state w1 , r3 .
As indicated in Fig. 13.2, the writer is then accessing the buffer BUF[1−WR, D[1−
WR]] and the reader is accessing the buffer BUF[RR, D[[RR]]. As RR = WR in
the global state w1 , r3 , it directly follows that the reader and the writer access
buffers in distinct groups.
• Global state w2 , r3 .
The writer then accesses the buffer BUF[WR, 1 − D[WR]], while the reader
accesses the buffer BUF[RR, D[[WR]]. There are two sub-cases:
– RR = WR. It follows directly that the reader and the writer access buffers in
distinct groups.
– RR = WR. The reader and the writer access buffers in the same group. The
reader accesses then BUF[RR, D[RR]] (line 14–15) while the writer accesses
BUF[RR, 1 − D[RR]]. As D[RR] and 1 − D[RR] are different values, it follows
that the reader and the writer access distinct buffers.
The previous lemma has shown that each buffer taken separately behaves as an atomic
buffer (let us recall that a safe or regular register that is never accessed concurrently by
the reader and the writer behaves as if it was atomic). Unfortunately, this observation
is not sufficient to prove that the register R that is built is atomic. The buffer that is
accessed by the reader could contain an old value that would make R non-atomic.
Theorem 55 Tromp’s construction described in Fig. 13.2 builds an SWSR b-valued
atomic register.
Proof To prove that R is atomic we use Theorem 53, stated and proved in Chap. 12.
Let r, r1 , and r2 denote invocations of R.read(), and w denote an invocation of
R.write(). Let us first observe that, thanks to Lemma 27, we can define a reading
function π() such that each read invocation r is mapped to the write invocation π(r)
that wrote (into some buffer BUF[i, j]) the value v read by r. Theorem 53 states that
an execution history H of a register R is atomic if the three following properties are
satisfied:
356 13 Bounded Constructions of Atomic b-Valued Registers
• A0: ∀ r : ¬ r → π(r) ( H is an execution history).
• A1: ∀ r, w : (w → H r) ⇒ (π(r) = w) ∨ (w → H π(r)) (no read obtains an
overwritten value).
• A2: ∀ r1 , r2 : (r1 → H r2 ) ⇒ π(r1 ) = π(r2 ) ∨ π(r1 ) → H π(r2 ) (no new/old
inversion).
Proof of A0: ∀ r : ¬ r → π(r) ( H is an execution history). This is an immedi-
ate consequence of the definition of the function π() given above. (Recall that the
definition of π() rests on Lemma 27.)
Proof of A1: ∀ r, w : (w → H r) ⇒ (π(r) = w) ∨ (w → H π(r)) (no read obtains
an overwritten value).
The proof is by contradiction. Let us assume that there is a write invocation w such
that π(r) = w and π(r) → H w → H r.
Let us assume without loss of generality the following context: (1) the operation
w writes the buffer BUF[0, 0] and (2) D[1] = 0 when the event r esp[w] occurs.
(If D[1] = 1 when the event r esp[w] occurs, the cases 1 and 2 remain unchanged
in the case analysis that follows, while the cases 3 and 4 have to be exchanged.) It
follows from this context and the write algorithm that, when r esp[w] occurs, w has
just written into BUF[WR, D[WR]], which means that we then have WR = 0 and
D[0] = 0. We consider the four possible buffers from which r can read:
1. The read invocation r reads from BUF[0, 0].
According to the definition of π(), this means that π(r) = w or w → H π(r),
contradicting the initial assumption. So, this case is eliminated.
2. The read invocation r reads from BUF[0, 1].
In this case (as the reader reads BUF[RR, D[RR]]), we have RR = 0 and D[RR] =
D[0] = 1, from which we can conclude that the writer has changed D[0] from 0
to 1 between the event r esp[w] and the reading of D[0] by the reader (notice that
D[0] is updated only by the writer). But, according to the write algorithm, this
update of D[0] is done at line 7, i.e., just after the writer has written in BUF[0, 1],
which implies that w → H π(r), leading to a contradiction which eliminates this
case.
The last two cases are similar. We only need to show that the buffer read by the
read invocation r was written after the write invocation w.
3. The read invocation r reads from BUF[1, 0].
In this case, as the reader reads BUF[RR, D[RR]], we have RR = 1.We can then
conclude from lines 10 and 12 that the reader has previously read 1 from WR.
This means that the writer has changed WR from 0 to 1. As this change can be
done only at line 4, we conclude that the writer has previously written the buffer
BUF[WR, D[WR]], i.e., BUF[1, 0]. It follows that w → H π(r), leading to a
contradiction and eliminating this case.
13.2 A Collision-Free (Pure Buffers) Construction 357
As already indicated, the construction uses three SWSR b-valued buffers each made
up of log2 b safe bits and three SWSR atomic bits that implement a switch mech-
anism:
• The three b-valued buffers, denoted BUF[0], BUF[1], and HELP_BUF, are written
by the writer and read by the reader. They are initialized to the initial value of the
constructed register R.
358 13 Bounded Constructions of Atomic b-Valued Registers
• The three SWSR atomic bits implementing the switch are denoted RR, WR, and
LAST (they are all initialized to 0). RR is written by the reader, while WR and
LAST are written by the writer.
Incremental construction The way the atomic bits WR and RR and the b-valued
buffer HELP_BUF are managed constitutes the core of the construction. We present
their management in an incremental way. The first step focuses on the management
of WR and RR, while the second step focuses on the management of HELP_BUF.
To simplify the presentation, we allow the reader to read RR and the writer to read
WR and LAST . (This is done without loss of generality, as a process always knows
the exact value of an atomic bit for which it is the only writer.)
A first step towards the read and write algorithms As indicated previously, the
writer first writes the new value v in the buffer BUF[1 − LAST ] and then changes
accordingly the value of LAST . Moreover, it indicates that a new value was written
by setting WR = RR. This corresponds to the following sequence of statements:
1 BUF[1 − LAST ] ← v;
2 LAST ← (1 − LAST );
3 r ← RR;
4 if WR = r then WR ← r end if.
The reader first sets RR different from WR to indicate that a read is currently
executing, and reads BUF[LAST ]. Then it checks for a possible collision. To attain
this goal, a “no-collision” predicate has to be defined. This predicate has to always
answer true when there is a collision (moreover,it can be conservative in the sense
that it may answer false in some cases where there is no collision).
Such a no-collision predicate can be defined from RR and WR. More specifically,
it is RR = WR. It follows intuitively from the following observation. When it is
true, it means that, since the time the reader has set RR different from WR (line 1 of
the read operation described below), the writer has not accessed WR, otherwise we
would have WR = RR (see line 4 of the write operation). So, when the predicate is
true, between line 1 and line 4 of the read operation, LAST contains the number of
the last buffer that was written (its value has then been determined at line 2 of the
write operation). The proof will show that this intuition is correct.
Let us observe that, if the predicate is false, we can conclude that a write operation
is concurrent with the current read. Let us also notice that the “no-collision” predicate
is not a “no-concurrency” predicate: a write operation can be concurrent with a read
operation that evaluates the predicate to true. We finally obtain the read operation
described below:
• If RR = WR, the predicate indicates that there was no collision on BUF[LAST ].
The value of BUF[LAST ] is then saved in a local variable denoted pr ev and
returned. This no-collision case is indicated by the bit 1 returned with the value.
• If RR = WR, a collision was possible. In this case, the returned value is the most
recent value whose reading was detected as being collision-free, i.e., the value
pr ev. This is indicated by the bit 0 returned with the value.
1 RR ← (1 − WR);
2 val ← BUF[LAST ];
3 wr ← WR;
4 if RR = wr then pr ev ← val; return(val, 1)
5 else return( pr ev, 0)
6 end if.
It is easy to show that the register R built by the previous read and write algorithms
satisfies the following properties:
• It is safe: an invocation of R.read() executed without concurrent write invocations
returns the last value that was written in R.
• The value returned by an invocation of R.read() is a value that was written. It
follows that the register is more than a safe register, as it never returns an arbitrary
value in the presence of concurrency.
• Due to the use of the pr ev local variable, a returned value is never older than
the previously returned value. In this sense, the previous construction prevents
new/old inversions.
360 13 Bounded Constructions of Atomic b-Valued Registers
But unfortunately this construction does not provide an atomic register R. It does
detect collision and prevent new/old inversion, but an invocation of R.read() can
return an older value than the last value written before that read invocation. This
can occur when consecutive read invocations detect possible collisions as depicted
in Fig. 13.8. The first read invocation sets pr ev to v1, and then, as each of the
following read invocations detects a collision, it returns the current value of pr ev
without modifying it.
Proof Let us first observe that the atomic bit RR is read only by the writer and
written only by the reader. It is always updated to 1 − WR (line 8) where the reader
establishes RR = WR each time it starts reading. (Let us notice that, if WR has not
been modified since its last reading at line 8, this update of RR does not modify its
value.)
The atomic bit WR is read only by the reader and written only by the writer.
Its new value is then the value of RR (lines 3–5). More explicitly, the writer always
establishes WR = RR when it terminates a write invocation. It follows that WR
takes then the value RR = 1 − WR , where WR denotes its previous value (see
Fig. 13.10).
such that b and
Lemma 30 Let a, b, c and d be four operation invocations of H
c are invocations on an atomic register. We have (a → H b → S_ab c → H d) ⇒
(a → H d).
Proof Due to Lemma 28, we can conclude from b → S_ab c that either b → H c or
the invocations b and c are concurrent in H . This means that (a) the event inv[b]
occurs before the event r esp[c] (c terminates before b starts, see Fig. 13.11). As
a → H b, the event inv[b] occurs after the event r esp[a] (b). Similarly, as c → H d,
the event r esp[c] occurs before the event inv[d] (c).
Combining (a), (b), and (c), we conclude that the event r esp[a] occurs before the
event inv[d], i.e., a → H d.
Lemma 31 Let w be an invocation of R.write() that writes WR. This invocation
w sets WR = RR. Moreover, if there is an invocation r of R.read() such that the
base operation w-write-WR occurs in between r-1read-WR and r-2read-WR, then
the equality WR = RR continues to hold until r terminates.
Proof Let wi (i ≥ 1) be the ith invocation of R.write() that writes the atomic bit
WR. The proof is by induction. It uses the fact that the base operations which are
concerned are atomic bits (and we can consequently base our reasoning on the asso-
ciated linearization order).
Let us first consider w1 . Due to Lemma 29 and the fact that WR is initialized to 0,
we conclude that w1 writes 1 into WR. It has consequently read 1 from RR. As RR
is initialized to 0, it follows that there is an invocation of R.read() that has written
13.3 A Construction Based on Impure Buffers 363
1 into RR. Let r1 be the first of these read operations (in case there are several such
read invocations). It follows from the previous observation that r1 -write-RR → S_ab
w1 -read-RR. Let us also notice that, as the base operation r1 -write-RR writes 1 in
RR, we can conclude that the base operation w1 -1read-WR reads 0 from WR.
Let us consider any subsequent invocation r of R.read() (if any) that performs
r -1read-WR before w1 -write-WR (i.e., such that r -1read-WR → S_ab w1 -write-WR).
Due to Lemma 29 and to the fact that the base invocation w1 -write-WR writes 1
in WR, we can conclude that such a read invocation r reads also 0 from WR and
consequently the base operation r -write-RR writes 1 in RR. (Let us observe that
this is irrespective of the linearization order of the base operations r -write-RR and
w1 -read-RR.) This means that these read invocations r write in RR the same value
as the read invocation r, namely the value 1.
It follows that, after w1 has executed w1 -write-WR, we have RR = WR = 1.
Moreover, if w1 -write-WR occurs in between r-1read-WR and r-2read-WR of some
read operation r, then the equality WR = RR continues to hold until the next read
operation that writes RR, i.e., at least until r terminates (this follows from the fact
that any other write operation does not modify WR because we have then WR = RR).
This completes the proof for the base case, namely for the first write invocation (w1 )
which writes WR.
Assuming as the induction hypothesis that the assertion holds from w1 until
wi . The same reasoning as the previous one shows that the assertion also holds
for wi+1 .
Lemma 32 Let r be an invocation of R.read(). (a) There is at most one write invo-
cation w such that w-write-WR occurs in between r-1read-WR and r-2read-WR. (b)
If there is no invocation w of R.write() such that w-write-WR occurs in between
r-1read-WR and r-2read-WR, then the value v returned by r is a value obtained from
BUF[0] or BUF[1]. If there is such a write invocation w, then the value v returned
by r is a value obtained from HELP_BUF, and that safe register was written by w.
Proof The proof of (a) is an immediate consequence of Lemma 31 that states that,
if there is a write invocation w such that w-write-WR occurs in between r-1read-WR
and r-2read-WR, then the equality WR = RR holds until the end of the read operation
r. As WR = RR until the end of r, it follows from the test of line 4 that no write
invocation w can write WR until the end of r, which proves item (a).
364 13 Bounded Constructions of Atomic b-Valued Registers
The proof of item (b) follows directly from the code of the read algorithm. If WR is
not modified between r-1read-WR and r-2read-WR, these base read invocations return
the same value from WR. It then follows from line 11 that the value returned by the
read operation r comes from BUF[0] or BUF[1]. If WR is modified between r-1read-
WR and r-2read-WR, it follows from (a) that there is at most one write invocation
w that writes WR, and the base read invocations r-1read-WR and r-2read-WR return
different values from WR. It follows from the test of line 11 that the value returned
by the read invocation r comes from HELP_BUF. Moreover, due to lines 4–5, the
invocation w has written into HELP_BUF before it writes WR, which concludes the
proof of the lemma.
The next lemma shows that, although the b-valued buffers BUF[0], BUF[1], and
HELP_BUF are only safe registers, if one of their values is returned by an invocation
r of R.read(), that value is a value that was written by an invocation w of R.write().
Lemma 33 Let B be the b-valued base buffer whose value was returned by an
invocation r of R.read() (B is BUF[0], BUF[1], or HELP_BUF). When the reader
was reading B, the writer was not writing B.
Proof We consider two cases. Let us first examine the case where there is a write
invocation w such that w-write-WR occurs in between r-1read-WR and r-2read-WR.
Due to item (b) of Lemma 32, r returns a value read from the base b-valued buffer
HELP_BUF that was written by w. Let us notice that, due to the code of the write
algorithm, we have w-write-HELP_BUF → H w-write-WR. Similarly, due to the
code of the read algorithm, we have r-2read-WR → H r-read-HELP_BUF. Moreover,
we also have w-write-WR → S_ab r-2read-WR (case assumption). Combining these
relations we obtain w-write-HELP_BUF → H w-write-WR → S_ab r-2read-WR → H
r-read-HELP_BUF.
Using now Lemma 30, we obtain w-write-HELP_BUF → H r-read-HELP_BUF,
which shows that the writing of HELP_BUF by w and its reading by r are not
concurrent. Finally, as the equality RR = WR holds until r terminates (Lemma 31),
no subsequent write invocation w writes HELP_BUF until r completes, i.e., until it
has finished reading HELP_BUF.
Let us now examine the case where there is no invocation w of R.write() such that
w-write-WR occurs in between r-1read-WR and r-2read-WR. Due to Lemma 32, the
read operation r returns a value that it has obtained by reading BUF[0] or BUF[1].
Without loss of generality let us assume that LAST = 0 when r reads it (line 9). This
means that r reads from the safe b-valued register BUF[0].
Let us observe that BUF[0] either contains the initial value or was previously
written by some write operation w0 ; this is because a write operation first writes
BUF[0] and only then updates LAST to 0 (lines 1–2), that is, to the value of LAST
subsequently read by r. If there is no other write operation, r reads BUF[0] after it
was written and consequently the writer is not writing BUF[0] when the reader is
reading it, which proves the lemma.
13.3 A Construction Based on Impure Buffers 365
r1 r2
w = π(r1) w w”
• The value returned by r comes from HELP_BUF. In that case, due to the second
item of Lemma 32, there is an invocation w of R.write() whose base invocation w -
write-WR occurs between r-1read-WR and r-2read-WR, from which we conclude
that w and r are concurrent. Moreover, due to the last part of Lemma 32 and
Lemma 33, the value returned by r is the value written by w . It follows that
π(r) = w , which proves the regularity of R.
R is atomic. As R is regular, we show it is atomic by proving that there is no
new/old inversion. (The atomicity follows then directly from Theorem 43 of Chap. 11
that proved that an atomic register is a regular register with no new/old inversion.)
Let r1 and r2 be two invocations of R.read() such that r1 → H r2 , and let π(r1 ) = w.
We show that π(r2 ) is either w or a later write.
Let us first observe that, if r2 is not concurrent with w, due to the regularity of R,
π(r2 ) is either w or a later write. So, we assume in the following that r2 is concurrent
with w. Let us notice that r1 is then also concurrent with w, otherwise we would have
w → H r1 → H r2 , contradicting the fact that w and r2 are concurrent (Fig. 13.12).
Moreover, if r2 is concurrent with several write operations, w is the first of them. As
previously, let us assume without loss of generality that w writes BUF[0].
• If r1 returns the value from BUF[0], we have from the code of the algorithms and
the atomicity of LAST : w-write-BUF[0] → H w-write-LAST → S_ab r1 -read-LAST
→ H r1 -read-BUF[0].
• If r1 returns the value from HELP_BUF, we have from the code of the algorithms,
the atomicity of LAST and WR, and Lemma 32: w-write-LAST → H w-write-
HELP_BUF → H w-write-WR → S_ab r1 -2read-WR → H r1 -read-HELP_BUF.
As r1 → H r2 , we have w-write-LAST → S_ab r2 -read-LAST in both cases.
The write invocation w writes 0 in LAST (assumption), but the read invocation r2
may read 0 or 1 from the atomic bit WR (this can occur when r2 is concurrent with
other write operations that are issued after w). Whatever the value (0 or 1) returned by
r2 -read-LAST , due to w-write-LAST → S_ab r2 -read-LAST , we have the following:
• If r2 returns a value read from BUF[0] or BUF[1], that value has necessarily been
written by w or a successor of w, which proves the case.
• If r2 returns the value read from HELP_BUF, due to the second item of Lemma
32, that value was written by a write invocation w such that the base invocation
w -write-WR is between r2 -1read-WR and r2 -2read-WR. Since w is the first write
that is concurrent with r, it follows that w is w or a subsequent write invocation
w . This completes the proof of the atomicity of R.
13.3 A Construction Based on Impure Buffers 367
In a very interesting way, the previous construction that builds an SWSR atomic
b-valued register R can be easily extended to build an SWMR atomic register R.
This simple construction is as follows.
The base b-valued safe buffers BUF[0] and BUF[1], together with the atomic bit
LAST , are used exactly as in the basic algorithms described in Fig. 13.9. The only
difference lies in the fact that now the writer considers explicitly the fact there are
n readers p1 , . . . , pn and executes the base write algorithm with respect to each of
them.
To that end, the control bits WR and RR and the additional b-valued safe
buffer HELP_BUF are replaced by the arrays WR[1..n], RR[1..n], and HELP_BUF
[1..n], each with one entry per reader process pi . More specifically, for each couple
made up of the writer and a reader process pi , the base registers WR[i], RR[i], and
HELP_BUF[i] replace the base registers WR, RR, and HELP_BUF used in the basic
SWSR construction.
The resulting SWMR construction is described in Fig. 13.13. It uses 2n + 1
atomic bits (WR[1..n], RR[1..n], and LAST ) and n + 2 b-valued safe buffers
(HELP_BUF[1..n], BUF[0], and BUF[1]) whose size is log2 b bits.The cost of
a read operation is the same as in the basic SWSR construction, while the cost of a
write operation now depends on n, the number of readers. The proof of this SWMR
construction is the same as the proof of the basic SWSR construction.
368 13 Bounded Constructions of Atomic b-Valued Registers
13.4 Summary
This chapter has presented two efficient constructions that build an SWSR b-valued
atomic register from a few atomic bits and safe buffers of log2 b bits. The atomic bits
are used to implement switches directing the writer and the reader to an appropriate
buffer to read or write a value.
These constructions differ in their underlying principles. The first one, due to
J. Tromp, ensures that the reader and the writer never access the same buffer simul-
taneously. The second one, due to K. Vidyasankar, allows conflicts in a buffer but
can direct the reader or the writer to sequentially access two buffers.
• The notions of safe register, regular register, and atomic register are due to
L. Lamport [189, 190] who has also presented (in these papers) a suite of algo-
rithms that allow the construction of an MWMR b-valued atomic register from
SWSR safe bits.
• As already indicated, the first construction presented in this chapter is due to
J. Tromp [265].
• As indicated, the second construction presented in this chapter is due to
K. Vidyasankar [270]. Its starting point is a construction due to G.L. Peterson
[225].
• Numerous other papers have presented constructions of SWSR b-valued atomic
registers from “lower-level” registers, e.g., [52, 59, 72, 132, 133, 168, 177, 178,
184, 190, 196, 219, 225, 257, 269, 272] to cite a few.
Part VI
On the Foundations Side:
The Computability Power of Concurrent
Objects (Consensus)
This chapter introduces the notion of a universal object and a universal construction
and shows that the consensus object is universal. To that end two consensus-based
universal constructions (which rest on different principles) are presented. This
chapter shows also that binary consensus is as powerful as multi-valued consen-
sus, hence binary consensus is universal.
The universality notions developed and used in this chapter concern the synchroniza-
tion power of concurrent objects in the presence of asynchrony and any number of
process crashes. Hence, they address the computability power of concurrent objects.
Universal construction
is captured in Theorem 15 (Chap. 4), states that atomicity per se does not entail that
a pending operation invocation has to wait for another operation invocation.
Wait-freedom is the strongest liveness condition that can be associated with mutex-
free implementations. Given an object O, it states that any invocation of an operation
on O issued by a non-faulty process has to terminate; i.e., the corresponding invo-
cation has to terminate whatever the behavior of the other processes (which can be
slow or even crashed). It is easy to see that only objects with total operations can be
wait-free implemented.
Universal (synchronization) object and universal construction An object A of
type Ta , or more generally a type Ta , is universal if any object Z whose type Tz is
defined by a sequential specification on total operations can be wait-free implemented
from (any number of) atomic read/write registers and objects of type Ta .
Any wait-free algorithm implementing such a construction is called a universal
construction. The structure of a universal construction is depicted in Fig. 14.1.
Binary consensus object was introduced in Sect. 6.3.1.We extend here the definition
to multi-valued consensus. It will be shown in the rest of this chapter that consensus
objects are universal objects.
Definition A consensus object provides a single operation denoted propose(v),
where v is the value proposed by the invoking process. While only the values 0
and 1 can be proposed by processes to a binary consensus object, any value can be
proposed to a multi-valued consensus object.
An invocation of propose() returns a value which is said to be the value decided
by the invoking process. A process can invoke the operation propose() at most once
(hence, a consensus object is a one-shot object). Moreover, any number of processes
can invoke this operation. A process that invokes propose() is a participating process.
The object is defined by the following properties. Let us recall that a process is correct
in a run if it does not crash in that run; otherwise, it is faulty.
• Validity. A decided value is a proposed value.
• Integrity. A process decides at most once.
• Agreement. No two processes decide different values.
• Termination. An invocation of propose() by a correct process terminates.
14.1 Universal Object, Universal Construction, and Consensus Object 373
The validity property relates the output to the inputs (a value that has not been
proposed cannot be decided). The integrity property states that a decision is irrevo-
cable. The agreement property defines the coordination power of a consensus object:
no two processes can decide differently (in that sense, a consensus object solves
non-determinism among the proposed values by selecting any but only one of them).
Finally, the termination property states that the implementation has to be wait-free.
A consensus object is a one-write register Let ⊥ denote a default value that cannot
be proposed by a process. A consensus object C can be seen as maintaining an internal
state variable X initialized to ⊥. The effect of an invocation of propose(v) can be
described by the atomic execution of the following code:
if (X = ⊥) then X ← v end if; return (X).
As we can see, a consensus object can be defined from a sequential specification.
The operation propose() is a combination of a conditional write operation followed
by a read operation. More generally, a consensus object can be seen as a one-write
register that keeps forever the value proposed by the first invocation of propose().
Then, any subsequent invocation of propose() returns the value that was written. As
a consensus object is a concurrent atomic object, if a process crashes while executing
propose(), everything appears as if that invocation had been executed entirely or not
at all.
Let Z be the object for which we want to build a wait-free implementation. This
object is defined by a sequential specification that will be the input of a universal
construction (see Fig. 14.1).
On the sequential specifications of the object Z The object Z is defined as a
type that consists of a bounded set of m total operations op1 (param1 , res1 ), . . .,
opm (paramm , resm ) and a sequential specification.
In the following, op(param, res) is used to denote any of the previous operations.
Each operation has a (possibly empty) set of input parameters (param) and returns
a result (res). “Sequential specification” means that, given any initial state s0 of Z,
its behavior can be defined by the set of all the sequences of operation invocations
where the output res of each invocation of an operation op() is entirely determined
by the value of its input parameters param and the operation invocations that precede
it in the corresponding sequence.
Alternatively, the sequential specification can also be defined by associating a pre-
assertion and a post-assertion with each operation. Assuming one operation at a time
is executed on the object, the pre-assertion describes the state of the object before the
374 14 Universality of Consensus
operation while the post-assertion defines the result output by the operation and the
new state of the object resulting from that operation execution. (Let us notice that,
as the operations defining Z are total, all pre-assertions are always satisfied and are
consequently equal to true.)
The sequential specification of Z used in the universal construction A sequence
of operation invocations on an object can be abstracted as an object state, and accord-
ingly, the semantics of each operation is defined by a transition function denoted δ().
More precisely, s being the current state of the object, δ(s, oper(param)) returns
a pair s , res from a finite non-empty set of pairs {s[1] res[1], . . . , s[x] res[x]}.
Each pair of this set defines a possible output where s is the new state of the object
and res is the output value returned to the calling process.
If, for any δ(s, oper(param)), the set {s[1] res[1], . . . , s[x] res[x]} contains a
single pair, the object is deterministic. Otherwise it is non-deterministic.
by the processes. It is then shown how atomic registers are used to ensure that no
invocation of an operation issued by a correct process remains pending forever. As
we will see, while the construction is wait-free, it is not bounded. This is because,
when counting the number of operation invocations on the base objects that constitute
the internal representation of Z, it is not possible to bound the period that elapses
between the time an operation op() starts (event inv[op]) and the time a result is
returned (event resp[op]).
The presentation concentrates first on the case of deterministic objects. It then
addresses non-deterministic objects in Sect. 14.3.3.
applied to that machine are applied in the same order on the copy managed by each
server.
The concurrent object Z is here the state machine, and the operation invocations
issued by the processes are the commands it has to execute. More precisely, we have
the following:
• Each local server manages a local copy si of the constructed object Z. This copy
si is initialized to the initial value of Z.
• In order that the servers apply the operation invocations in the same order to their
local copies of Z, they cooperate through objects kept in the shared memory.
The global structure is described in Fig. 14.2.
Step 1 of the construction: use consensus objects to define a total order To apply
the operation invocations in the same order to their local copies, the server threads
are implemented by a background task at each process pi . This task is the infinite
loop described below (as before, the line numbers correspond to the line numbers of
the final construction described in Fig. 14.3).
(5) while (true) do
(12) if (propi = ⊥) then
(13) ki ← ki + 1;
(14) execi ← CONS[ki ].propose(propi );
(16) si , res ← δ(si , execi .op);
(17) let j = execi [r].proc;
(19) if (i = j) then propi ← ⊥; resulti ← res end if
(21) end if
(22) end while.
When propi = ⊥, the task discovers that a new operation invocation has locally
been issued (line 12). So, in order both to inform the other processes and to guar-
antee that the operations will be applied in the same order to each copy, the local
task proposes that operation invocation to its next consensus instance (lines 13–14).
Hence, {CONS[k]}k≥1 , where CONS[k], denote the kth consensus object instance.
As a value propi proposed to a consensus instance is a pair, a decided value is also
a pair, made up of an operation invocation with the identity of the invoking process
(see line 2). So execi (the local variable where the value decided by the last consensus
14.3 An Unbounded Wait-Free Universal Construction 377
instance is saved, line 14) is a pair composed of two fields, execi .op, which contains
the decided invocation, and execi .proc, which contains the identity of the process
that issued that invocation. The server executes then the state machine transition
si , res) ← δ(si , execi .op to update its local copy si of Z (line 16). Moreover, if pi is
the process that has invoked this operation, it locally returns the result res by writing
it into the local variable resulti (line 19).
As (a) each consensus instance returns the same invocation to all the processes
that invoke that instance and (b) the processes invoke the instances in the same order,
it follows that, for any k, the processes pi that have invoked the first k consensus
instances applied the same length k sequence of operation invocations to their local
copy si of the object Z.
If, after it has participated in k consensus instances (as just observed, si is then
the state of Z resulting from the sequential application of the k operation invocations
output by the first k consensus instances), a process pi does not invoke an operation
during some period of time, its local copy si is not modified. When, later, it invokes
again an operation and proposes it to a consensus instance, it may have to execute
consensus instances (starting from k +1) to catch up with a consensus instance where
no value has yet been decided. While catching up, it will sequentially update its local
state si according to the operation invocations that have been decided after the kth
instance.
Non-blocking versus wait-free implementation Considering a consensus instance,
exactly one process is a winner in the sense that the value proposed to that instance
by that process is the decided value. In that sense the previous construction is non-
blocking: all the processes that participate (i.e., propose a value) in a consensus
instance and do not crash obtain the same value, and exactly one process (the win-
ner) terminates its operation invocation.
Unfortunately, that implementation is not wait-free. Indeed, it is easy to build
a scenario in which, while a process pi continuously proposes the same operation
invocation to successive consensus instances, it is never a winner because there are
always processes proposing their operation invocations to these successive consensus
instances and it is always a value proposed by one of these processes that is decided,
the value proposed by pi being never decided. In that case, the operation on Z invoked
by pi is never executed and, at the application level, pi is prevented from progressing.
It follows that, the construction is non-blocking but not wait-free.
Step 2 of the construction: introduce a helping mechanism A way to go from a
non-blocking construction to a wait-free construction consists here in introducing a
helping mechanism that allows a process to propose to a consensus instance not only
its own pending invocation but all the pending invocations it is aware of. (Similar
helping mechanisms have been used in previous chapters to build wait-free objects
such as atomic registers or snapshot objects.)
To that end, an SWMR atomic register is associated with each process. That
register allows its writer process to inform the other processes about the last operation
it has invoked. More explicitly, LAST _OP[i] is an SWMR atomic register that can
be written only by pi and read by all the processes. When it writes its last operation
378 14 Universality of Consensus
invocation into LAST _OP[i], pi “publishes” it, and all the other processes become
aware of it when they read this register.
Such a register LAST _OP[i] is made up of two fields: LAST _OP[i].op, which con-
tains the last operation invoked by pi with its input parameters (i.e., “op(param)”), and
LAST _OP[i].sn, which contains a sequence number. LAST _OP[i].sn = x means that
LAST _OP[i].op is the xth invocation issued by pi . Each atomic register LAST _OP[i]
is initialized to ⊥, 0.
In order to know whether the last operation that pj published in LAST _OP[j] has
been or has not been applied to its local copy si of Z, each process pi manages a local
array of sequence numbers denoted last_sni [1..n] such that last_sni [j] contains the
sequence number of the last operation invoked by pj that was applied to si (for any
j, last_sni [j] is initialized to 0).
Using the helping mechanism The previous helping mechanism is used as follows.
When it invokes a consensus instance, a process pi proposes all the operation invoca-
tions that have been published by the processes in LAST _OP[1..n] and that have not
yet been applied to its local copy si of the object. From its point of view, those are all
the invocations that have not yet been executed. This means that now, instead of its
own invocation, a process pi proposes a list of invocations to a consensus instance.
These design principles give rise to the universal construction described in
Fig. 14.3. As the value propi proposed by a process pi to a consensus instance is a
non-empty list of operation invocations, the value decided by that consensus instance
is a non-empty list of invocations. Consequently, the local variable execi , where the
value decided by the current consensus instance is saved, is now a list of invocations;
|execi | denotes its length and execi [r], 1 ≤ r ≤ |execi |, denotes its rth element. Let
us recall that such an element is a pair (namely, the pair execi [r].op, execi [r].proc).
The proof of the universal construction described in Fig. 14.3 consists in showing
that the constructed concurrent object Z is atomic and its operations are wait-free.
Let us recall that atomicity means that, from an external observer point of view,
everything has to appear as if there was a single copy of the object, the operations
were executed one after the other on that copy and in an order that complies with the
sequential specification of the object and respects their real-time occurrence order.
The proof is decomposed in several lemmas.
• A first lemma shows that the construction is wait-free; i.e., each operation invoked
by a process that does not crash terminates despite the crash of any number of
processes.
• A second lemma shows that any process sees the invocations issued by all the
processes. Moreover, all the processes see them in the same total order. This
allows one to show that all the local copies of the object Z are modified according
14.3 An Unbounded Wait-Free Universal Construction 379
Lemma 34 The construction described in Fig. 14.3 is wait-free (i.e., each operation
invocation issued by a correct process terminates).
Proof Let us consider a correct process pi (i.e., a process that does not crash) that
invokes Z.op(param). It deposits the operation description with its sequence number
into the shared register LAST _OP[i] to inform the processes on its pending operation
(line 2). To show that a result is returned at line 4, we have to show that the predicate
resulti = ⊥ is eventually satisfied (line 3).
We claim (C) that (1) there is a consensus instance CONS[k] that outputs (at
line 14) a list of pairs containing the pair “op(param)”, i and (2) pi participates in
that consensus instance.
380 14 Universality of Consensus
It follows from that claim that, when pi executes the internal loop (lines 15–20)
associated with the consensus instance CONS[k], there is an internal loop iteration r
during which execi [r].proc = i and, consequently, pi applies op(param) to its local
copy si of Z. During that loop iteration, the result of the operation is deposited in
resulti , which proves the lemma.
Proof of the claim C. The proof is by contradiction. Let us assume that no consensus
instance outputs a list containing the pair “op(param)”, i.
Due to the test done to build a list proposal (line 8), it follows that there is a time
after which all the lists proposed by the processes to consensus instances contain
forever the pair “op(param)”, i. As, from then on, propi is never empty, process pi
participates in an infinite sequence of consensus instances with increasing sequence
number ki .
Let CONS[k] be the first consensus instance such that each participating process
proposes a list including the pair “op(param)”, i. As the invocations of CONS[k].
propose() issued by the correct processes that participate in that consensus instance
return a list (consensus termination property) that is the same for all (consensus
agreement property), and that list is the list proposed by one of them (consensus
validity property), it follows that execi contains the pair “op(param)”, i, which
contradicts the initial assumption and proves the claim. End of proof of the claim.
Lemma 35 All the operations invoked by the processes (except possibly the last
operation invoked by process pj , 1 ≤ j ≤ n, if pj crashes before depositing its
operation invocation in LAST _OP[j]) are totally ordered (this total order defines the
sequence
S).
Proof We first show that any invocation Z.op() that is deposited by a process pi in
LAST _OP[i] is output by exactly one consensus instance.
Let us first observe that, as soon as pi has deposited the pair “op(param)”, i
in LAST _OP[i] (line 2), it cannot invoke a new operation before this pair belongs to
the list decided by a consensus instance (this follows from the management of the
local variable resulti at lines 1, 3, 17 and 19). It follows that, no operation invocation
issued by a process can be overwritten before being output by a consensus instance.
The proof that the pair “op(param)”, i is eventually output by a consensus
instance is the same as the proof of claim C in Lemma 34. We now have to prove
that this operation is not output by more than one consensus instance. To that aim,
assuming that several consensus instances can decide lists containing this pair, let
k be the first of them (due to the previous discussion, there is at least one such
instance). After it has obtained the list containing “op(param)”, i decided by the
kth consensus instance, a process pj increases last_snj [i] (line 18) before building a
new proposal for the next consensus instance. Due to the test of line 8 used by pj to
build a new list and propose it to the next consensus instance, it follows that the pair
“op(param)”, i can no longer be in the list proposed by pj . We conclude then that,
as soon as an operation invocation has been deposited in LAST _OP[i] by a process
pi , that invocation is decided by exactly one consensus instance.
14.3 An Unbounded Wait-Free Universal Construction 381
The fact that the processes apply the same sequence of operations to their local
copy of Z is a direct consequence of the use of consensus objects. Let us first observe
that the processes use the consensus instances in the same order: first CONS [1], then
CONS [2], etc. Moreover, each consensus instance orders a batch of invocations (the
invocations that appear in the list decided by that instance). The combination of the
single order on consensus instances and the single order on the invocations output
by each consensus instance provides each correct process pi with the same sequence
S of invocations of operation on Z. Moreover, as a process that crashes behaves as a
correct process until it crashes, it follows that a process that crashes is provided with
a prefix of the sequence S.
It follows from the previous lemma and lines 12–21 of the universal construction
that each correct process pi applies to its local copy si of Z the same sequence of
deterministic operations, and a process pj that crashes applies to sj a prefix of that
sequence. We have consequently the following corollary:
Corollary 6 The local copies si of all the correct processes behave as a single copy
that complies with the sequential specification of Z.
Lemma 36 The sequence
S respects the real-time occurrence order on its opera-
tions.
Proof Let op(param) be an invocation issued by a process pi . Its start event inv[op]
corresponds to the execution of line 1, and its response event resp[op] corresponds
to the execution of line 4. If pi crashes during this invocation, there is no response
event.
Reminder. Let us remind that a linearization point associated with an operation
invocation op(param) is a point of the time line that appears between the points
associated with the events inv[op] and resp[op]. Its aim is to abstract the operation
invocation and create the illusion that it had been executed instantaneously at that
point of the time line. We have to show that it is possible to associate a linearization
point with each invocation appearing in S such that the total order on the linearization
points is the total order defined by S. (End of reminder.)
Let Lk ≥ 1 be the number of invocations decided and ordered by the consensus
object CONS[k], and op[k, x] be the xth operation invocation ordered by CONS[k].
The sequence S is then as follows: op[1, 1], . . . , op[1, L1 ], op[2, 1], . . . , op[k, Lk ],
op[k +1, 1], . . . For each invocation, let us consider the consensus instance CONS[k]
and the first invocation of CONS[k].propose() that returns the list of operation
invocations op[k, 1], . . . , oper[k, Lk ]. Let τk be the time at which this first invo-
cation of CONS[k].propose() starts. (This means that, if any, the other invoca-
tions of CONS[k].propose() return later. This notion of “first/later” is well defined
as the consensus objects are atomic. Moreover, pi is not necessarily the process
that has issued the first invocation of CONS[k].propose().) We associate the lin-
earization points (times) τ [k, 1], . . . , τ [k, Lk ] with these invocations, such that
τk < τ [k, 1] < · · · < τ [k, Lk ] < τk+1 . Let us observe that the total order on
these linearization points is the same as the total order defined by S.
382 14 Universality of Consensus
Proof The wait-free property is proved in Lemma 34. Lemma 35 has shown that S
is a sequential history including all invocations. Corollary 6 has shown that S is legal
(i.e., respects the sequential specification of the object Z). Finally, Lemma 36 has
shown that S respects the real-time order on the operation invocations. It follows that
the construction builds a wait-free atomic object.
When the object Z is non-deterministic, the function δ() can return any pair from a
set including more than one pair, and consequently two different processes pi and pj
can obtain different pairs for the same state transition, thereby entailing divergence
of their local copies of Z. When this occurs, the construction algorithm described
in Fig. 14.3 no longer ensures that the all the local copies of Z behave as a single
copy. This section describes three solutions that allow the implementation of non-
deterministic objects Z.
A brute force solution A brute force strategy to solve the previous inconsistency
problem consists in replacing the non-deterministic transition function δ() by any
one of its deterministic restrictions δ (). More precisely, the transition function δ ()
is restricted to be such that, for any state s and any operation invocation op(param),
δ s, op(param) has a single possible output.
Using additional consensus objects to cope with non-determinism A solution
preserving the non-deterministic dimension of Z consists in using additional con-
sensus objects {CONS_ND[k]}
k≥1 to prevent possible divergences. Each time the
transition function δ si , op(param) can return a pair s, res from a set including at
least two pairs (line 16), each process pi proposes to the
instance CONS_ND[k]
the pair s, res it has obtained from δ si , op(param) . To that end, pi invokes
14.3 An Unbounded Wait-Free Universal Construction 383
Finite versus bounded Lemma 34 has shown that the previous construction is wait-
free (each operation issued by a correct process terminates). We now show that it
is not bounded wait-free. This means that it is not possible to state a bound on
the number of operation invocations on base objects (consensus object and atomic
read/write registers) that need to be executed before an invocation of an operation
Z.op() terminates. This number is finite (hence the wait-free property), but there is
no bound that would hold in any execution.
Why the construction of Fig. 14.3 is not bounded wait-free Considering the
construction described in Fig. 14.3, let pi be a process that invokes an operation
op(). Moreover, let kinv be the value of ki when pi invokes op() and CONS[kresp ]
(where kinv < kresp ) be the consensus instance that outputs op(); i.e., op() belongs
to the list decided by CONS[kresp ]. This means that the task T of pi has to execute
K = kresp − kinv times the lines 6–21 in order to catch up with the consensus instance
in which its invocation op() is decided. The proof of Lemma 34 has shown that K is
finite. We show that it cannot be bounded.
To show this, consider the case where no operation is invoked by pi and its task T is
“sleeping” during an arbitrary long period (this is possible, due to asynchrony). Dur-
ing that period, the other processes pj issue an arbitrarily large number of operation
384 14 Universality of Consensus
invocations and their values kj can become arbitrarily large. Then, at some time τ
after that arbitrary long period, the process pi invokes an operation and its task T
wakes up (let us observe that, despite the fact that the task T of a process is not
synchronized with its own operation invocations, this is a possible scenario). The
value K is then arbitrarily large, which shows that it cannot be bounded. Hence, the
construction is not bounded wait-free.
As we have seen in Sect. 14.3.4, what prevents the construction described in Fig. 14.3
from being be bounded is the fact that each process has its own copy of Z and does
not strive to keep its local copy up to date when it does not invoke operations on Z.
The ideas on which Herlihy’s bounded wait-free construction relies are (1) a single
copy of the object Z (maintained in shared memory) plus (2) a helping mechanism that
allows bounding of the number of invocations of base object operations that can be
executed between the invocation of an operation and the return of the corresponding
result.
Internal representation of the object The object is represented as a linked list,
where the sequence of cells represents the sequence of operation invocations applied
to the object. A process executes an operation by adding a new cell to the list. So,
there is a single (centralized) representation of the object.
A cell is dynamically created by a process pi each time it invokes an operation. It
is made up of five fields (Fig. 14.4):
• The field sn is a sequence number initialized to 0 when the cell is created. Then,
it takes a positive value (its rank in the list) and keeps it forever.
• The field invoc contains the description of the operation (with its parameters
“op(param)”) invoked by the process pi that created the cell.
• The fields state and res contain a pair of related values:
14.4 A Bounded Wait-Free Universal Construction 385
– state contains the state of the object Z after sequential application of all the
operations in the list, from the first up to and including the operation op(param).
– resp is the result associated with the invocation op(param).
From an operational point of view, these fields are computed from the state of the
object defined in the immediately preceding cell
of the list. Let prev_state
be this
state. Given a cell, we have state, resp = δ prev_state, op(param) .
• The last field, denoted next, is a consensus object created together with the cell.
The aim of this consensus object is to contain the pointer to the next cell of the list.
As a consensus object decides a single value, any cell can point towards a single
next cell. This is the way the consensus objects are used to create a total order on
the operation invocations issued by the processes.
A cell is not an atomic object, while each of its components is atomic (four atomic
registers plus a consensus object). While the field invoc refers to an SWMR register
(only the process that creates a cell can write this field), the three other fields sn,
state, and resp are MWMR atomic registers.
Initially, the list is represented by a unique cell, called anchor. Its field sn has the
value 1, while its field state has the value s0 (the initial state of the object). Its fields
invoc and resp are irrelevant. As far as the field next is concerned, let us recall that
a consensus object has no initial value.
Ensuring the bounded wait-freedom property To ensure the bounded wait-free
property, each process pi manages two atomic SWMR registers, denoted LAST _OP[i]
and HEAD[i], the values of which are pointers to cells. Initially, both LAST _OP[i]
and HEAD[i] points to the cell anchor. As far as terminology is concerned, the last
cell added to the list defines its “head”. More specifically, we have the following.
(The line numbers refer to Fig. 14.5.)
• The role of the register LAST _OP[i] is similar to that of the previous construction,
namely, it is used by pi to make public its last invocation of an operation on Z
so that the other processes can help it if needed. Hence, LAST _OP[i] is used to
ensure the wait-freedom property.
When pi invokes an operation op(param), it creates a new cell denoted CELL,
initializes its sn field to 0 and its invoc field to the description of the operation
386 14 Universality of Consensus
invocation ”op(param)”, and then makes LAST _OP[i] to point to that cell (lines 1–
3 in Fig. 14.5). After this has been done, the invocation op(param) issued by pi is
published and can consequently be executed by any process.
Then, (LAST _OP[i] ↓).sn = 0 indicates that the cell associated with the last
invocation issued by pi has been added to the linked list; i.e., the invocation saved
in (LAST _OP[i] ↓).invoc was executed (lines 6 and 13).
• The atomic register HEAD[i] is used to ensure the boundedness attribute of the
wait-freedom property. This atomic register contains a pointer to the last cell of the
linked list as known by pi . This means that (HEAD[i] ↓).state is the current state
of the constructed object Z known by pi . Let us notice that, due to asynchrony,
two distinct processes pi and pj do not necessarily have the same view of which
is the last cell added to the list; i.e., HEAD[i] can be different from HEAD[j], and
both can be different from the cell which is currently the head of the list.
of which is the last operation invocation that was added to the list. This view,
computed from the sequence numbers associated with the cells pointed to by
HEAD[1..n], is saved in HEAD[i] (line 4).
The local variable last_sni (line 5) represents then, from pi ’s point of view, the
sequence number associated with the last operation invocation that was executed
(i.e., added to the linked list). It is important to see here that the processes help each
other: a process reads the last sequence number known by each of them, namely it
reads all the registers (HEAD[1..n] ↓).sn (line 4), in order to obtain the best view
of which is the last operation invocation that was executed.
Remark Let us notice that a process pi updates LAST _OP[i] (line 3) before read-
ing asynchronously the entries of the array HEAD[1..n] (line 4). The fact that
LAST _OP[i] is updated first is important to ensure the boundedness attribute of
the wait-free property, as we will see in the proof of Lemma 37. End of remark.
• The second phase is a helping and computation phase (lines 6–15).
Its aim is to determine the result associated with the operation invoked by pi which
occurs when the corresponding cell is added to the list.
First, the helping mechanism is used. Inspired from the classical round-robin prin-
ciple, this mechanism works as follows. The priority to add the kth cell to the list
is systematically given to a process px if
– That process has a pending operation (i.e., (LAST _OP[x] ↓).sn = 0), and
– Its index x is such that x = (k mod n) + 1.
This is expressed at lines 7–8 (where k = last_sni and x = to_help). It is easy to
see that, for each sequence number, exactly one process is given priority and no
process that has a pending operation can be missed.
Combined with the management of the array HEAD[1..n], this ensures that, as soon
as a process has made public its operation (line 3), at most one operation invoked
by each other process can be added to the list before its invocation. Hence, the
bounded wait-free property. This is operationally expressed in the construction at
lines 8–9, where propi is a pointer to the cell that pi has to try to add to the list.
The addition of a cell into the list is as follows. pi considers the cell that (from its
point of view) is at the head of the list (namely, the cell pointed to by HEAD[i])
and tries to append after it the cell it has previously determined, that is, the cell
pointed to by propi (lines 10–12). This is where the consensus objects come into
play. As already mentioned, the field next of a cell is a consensus object destined
to contain the pointer to the next cell. Here the relevant consensus object to thread
the next cell is (HEAD[i] ↓).next. Let us recall that, if several processes try to
thread different cells, a single one will succeed (this is because (HEAD[i] ↓).next
is a consensus object, and next_cell then contains the pointer value decided by that
consensus object (line 11).
388 14 Universality of Consensus
Then, pi executes the operation encapsulated in next_cell (line 12). As the oper-
ations are deterministic, if several process write into the pair of atomic reg-
isters (next_cell ↓).state, res, they write the same pair of values. The field
(next_cell ↓).sn is then updated to (HEAD[i] ↓).sn + 1 (line 13), and HEAD[i] is
advanced to next_cell at line 14.
Finally, the process pi terminates looping when its invocation was threaded into the
list. This is operationally detected when the predicate (LAST _OP[i] ⇓).sn = 0)
becomes true. Let us observe that this predicate can be satisfied when pi evaluates
it for the first time at line 6 (this occurs when pi is slow: after it has announced its
operation invocation in LAST _OP[i], that invocation has been threaded into the
list by another process before pi starts executing line 7).
Let τ be the time at which pi has executed line 3 (i.e., just after LAST _OP[i] has
been updated). Let sn1 be the value of max{(HEAD[x] ↓).sn}1≤x≤n at time τ . Let
sn3 = sn1 + (n + 1) (see Fig. 14.6).
We claim that LAST _OP[i] is threaded into the list with a sequence number
smaller than or equal to sn3. It follows from this claim that pi executes at most
n + 1 times the body of the while loop. As each loop iteration is made up of a
bounded number of operations on base objects, it follows that pi executes a bounded
number of invocations on base objects (each base operation—be it a read or a write
of a shared register, or a propose() invocation on a consensus object—counts for 1
when counting the number of base object operations executed by pi ). Hence, there
are O(n) invocations on base objects between the time instant at which pi invokes
op(param) and the time instant at which pi executes the return() statement (line 17).
The construction is consequently bounded wait-free.
Proof of the claim. The proof is by contradiction. Let us assume that LAST _OP[i] is
not threaded into the list with a sequence number smaller than or equal to sn3. From τ ,
and until the cell pointed to by LAST _OP[i] is threaded into the list, all the processes
that read (LAST _OP[i] ↓).sn obtain the value 0 (line 8). As sn3 = sn1 + (n + 1),
there is at least one sequence number sn2 (and at most two) such that i = (sn2
mod n) + 1 and sn1 < sn2 ≤ sn3. When there are two such sequence numbers,
those are sn1 + 1 and sn3 = sn1 + (n + 1). We consider two cases:
• There is a single sequence number sn2 such that i = (sn2 mod n) + 1 and
sn1 + 1 ≤ sn2 ≤ sn3. In this case, sn1 + 1 < sn2 < sn3 + 1.
As sn2 > sn1+1, one or several cells have been threaded into the list with sequence
numbers sn1+1,... , sn2−1 when, after τ , a process reads last_snj = sn2−1 > sn1
(at line 5 or 13). It follows that, when this occurs, such a process necessarily obtains
0 when thereafter it reads the atomic register (LAST _OP[i] ↓).sn (line 13). It
follows from this reading that such a process pj sets propj to LAST _OP[i] (line 8).
This is true for all the processes that execute the loop when their last_sn local
variable becomes equal to sn2 − 1. Hence, all the processes that try to add the next
cell (i.e., the sn2th cell) to the list propose LAST _OP[i] to the consensus object of
the (sn2 − 1)th cell. Moreover, let us notice that at least one process executes the
loop with last_snj = sn2 − 1 (namely pi , as, due to the contradiction assumption,
its operation is not added before the sequence number sn3+1 and we consequently
have (LAST _OP[i] ↓).sn = 0). It follows that at least one process invokes the base
390 14 Universality of Consensus
operation propose(LAST _OP[i]) on the consensus object of the (sn2 − 1)th cell.
It follows from the validity property of that base consensus object that the sn2th
cell that is threaded is LAST _OP[i], which contradicts the initial assumption and
proves the claim.
• There are two sequence numbers sn2 such that i = (sn2 mod n)+1 and sn1+1 ≤
sn2 ≤ sn3. In that case, these numbers are sn1 + 1 and sn3.
It is possible that some processes pj obtain last_snj = sn1 before τ (see Fig. 14.6).
If this happens, these processes can thread a cell different from the one pointed
to by LAST _OP[i] with sequence number sn2 = sn1 + 1. This is because these
processes try to add the sn2th cell before reading 0 from (LAST _OP[i] ↓).sn,
and although this sequence number gives priority to the cell announced by pi ,
those processes are not yet aware of that cell. So, LAST _OP[i] misses its turn
to be threaded. But, after this addition, we are after τ and all the processes see
(LAST _OP[i] ↓).sn = 0 when the priority is again given to pi . It follows from
the reasoning of the previous case that the cell pointed to by LAST _OP[i] will be
threaded at the latest as the sn3th cell into the list. End of proof of the claim.
Lemma 38 All the operation invocations (except the last invocations of the processes
pj that crash before depositing their invocation in LAST _OP[j]) are totally ordered.
Let
S be that sequence of invocations. S belongs to the sequential specification of Z.
Proof Each operation invocation issued by a process (except possibly its last oper-
ation if it crashes) is associated with a single cell that (as shown by Lemma 37) is
threaded into the list with the help of a consensus object. Due to the very definition
of a list (a list is a sequence), it follows that all the operations define a sequence
S.
Moreover, when considering the list of cells, this sequence is defined starting
from the initial state s0 of the object Z (as defined in the cell anchor), and then going
from one cell to the next one by applying the transition function δ() defined by the
sequential specification of Z (line 12). It follows that S belongs to the sequential
specification of the constructed object Z.
Lemma 39 The sequence
S respects the real-time occurrence order on its operation
invocations.
Proof Let us define the linearization point of an operation invocation Z.op(param)
as the time when the corresponding cell (built at line 2) is added to the list (line 11).
This means that the linearization point corresponds to the first invocation of
propose() on the consensus object CONS that threaded op(param) into the list.
This point in time is trivially between the time the operation was invoked (when
we had (LAST _OP[i] ↓).sn = 0) and the time a result is returned (when we had
(LAST _OP[i] ↓).sn = 0). Moreover, as each consensus instance outputs a single
pointer to a cell and the consensus instances define the total order
S on the operation
invocations, it follows that the total order on the linearization points is the same
as the total order S on the corresponding operation invocations, which proves the
lemma.
14.4 A Bounded Wait-Free Universal Construction 391
This section considers the case where operations on Z are not deterministic. When
considering the previous construction, the line where a non-deterministic operation
where the pair state,
can create a problem is line 12, namely, the line res of a cell
is written according to the values output by δ prev_state, op(param) .
As these fields of a cell can be written by several processes, different processes
can now write different pairs of values. As between any two such writings these
values can be read by other processes, the linked list representing the object Z will
no longer satisfies its sequential specification.
A simple way to address this problem consists in demanding that the processes
reach consensus on the pair state, res of each cell. In that way, such a pair is made
unique for each cell. At the operational level, this can be done as follows:
• In addition to its previous fields, a cell has now a new field denoted state_res_cons
which contains a consensus object. This consensus object of the cell anchor
is assumed to be properly initialized to s0 , ⊥ (this can easily be done by
requiring each process to invoke the operation instance anchor.state_res_cons.
propose(s0 , ⊥) when it starts).
• Line 12 of Fig. 14.5 is replaced by the following lines (where prop_pairi is an
auxiliary local variable of pi that can contain a pair of values):
(12.a) prop_pairi ← δ (HEAD[i] ↓).state, (next_cell ↓).invoc) ;
(12.b) let CONS_ND = (HEAD[i] ↓).state_res_cons;
(12.c) (next_cell ↓).state, res ← CONS_ND.propose(prop_pairi ).
Preliminary notations The arrays that are used are arrays of b bits, where b is the
number of bits required to encode any value that can be proposed. Hence, the set of
values that can be proposed is bounded. It is {0, 1, . . . , 2b − 1}. It is assumed that b
is known by the processes.
Let aa[1..b] be such an array and 0 ≤ k < b. The notation aa[1..0] denotes an
empty array (it has no entries), and aa[1..k] denotes the sub-array containing the
entries from 1 to k. Finally, the predicate aa[1..k] = bb[1..k] is true if and only if
these two sub-arrays of bits are component-wise equal. By definition, the predicate
aa[1..0] = bb[1..0] is always satisfied.
Internal representation of the multi-valued consensus object The internal rep-
resentation of the multi-valued consensus object CONS is made up of two arrays of
n elements (where n is the number of processes):
• PROP[1..n] is an array of SWMR atomic registers initialized to [⊥, . . . , ⊥].
Then, the aim of each PROP[i] is to contain the b-bit array-based representation of
the value proposed by the process pi . Hence, when PROP[i] = ⊥, PROP[i][1..b]
contains the value proposed by pi and its kth bit is kept in PROP[i][k].
• BC[1..b] is an array of b binary consensus objects.
Principle of the algorithm The principle of the algorithm is simple: the processes
agree sequentially, one bit after the other, on each of the b bits of the decided value.
This is done with the help of the underlying binary consensus objects BC[1..b].
In order for the decided array of bits to correspond to one of the proposed
values, when the processes progress from bit k to bit k + 1, they consider only
the binary coding of the proposed values whose first k bits are a prefix of the
sequence of bits on which they have already agreed. A process pi keeps this sequence
of agreed bits in a local array resi [1..b] of which only the sub-part resi [1..k] is
meaningful.
14.5 From Binary Consensus to Multi-Valued Consensus 393
Differently from the previous construction, this one places no constraint on the size
of the values that can be proposed.
Internal representation of the multi-valued consensus object This internal rep-
resentation is very similar to the one of the previous construction:
• PROP[1..n] is an array of SWMR atomic registers initialized to [⊥, . . . , ⊥]. The
aim of PROP[i] is to contain the value proposed by process pi .
• BC[1..n] is an array of n binary consensus objects.
Fig. 14.9 Linearization order for the proof of the termination property
To prove that CONS satisfies the consensus agreement property, let us consider
the first binary consensus object (BC[k]) that returns the value 1. As processes invoke
the binary consensus objects in the same order, it follows from the agreement prop-
erty of the underlying binary consensus objects that all the processes that invoke
BC[x].bin_propose() for x = 1, ..., (k − 1) obtain the value 0 from these invocations
and the value 1 from BC[k].bin_propose(). As a process exits the loop and decides
when it obtains the value 1, it follows that no value different from PROP[k] can be
decided.
To prove the termination property (Fig 14.9) of the consensus object CONS, let
us first observe that, among all the processes that participate (i.e., execute line 1),
there is one that is first to write the value it proposes into the array PROP. (This
follows from the fact that the registers PROP[1], . . . , PROP[n] are atomic.) Let p
be that process. Due to the fact that any process pj writes the value vj it proposes into
PROP[j] before reading any entry of PROP, it follows that, when k = , it will read
PROP[] = ⊥. Hence, all the processes that will invoke BC[].bin_propose(bp)
will do so with bp = 1. Due to the validity property of BC[], they decide the value
1 from BC[] and consequently exit the for loop, which concludes the proof of the
termination property of the object CONS.
14.6 Summary
This chapter was devoted to the wait-free implementation of any concurrent object
defined by a sequential specification on total operations. It first introduced the notion
of a universal construction and the notion of a universal object. It also defined the
concept of a consensus object.
The chapter then presented two universal constructions. The first one, which is not
bounded, is based on the state machine replication paradigm. The second one, which
is bounded, is based on the management of a linked list kept in shared memory. Both
constructions rest on consensus objects to ensure that the processes agree on a single
order in which their operation invocations appear to have been executed.
The chapter also showed how a multi-valued consensus object can be built from
binary consensus objects and atomic read/write registers.
396 14 Universality of Consensus
• The notions of a universal construction and a universal object are due to M. Herlihy
[138]. The wait-free universal construction presented in Sect. 14.4 is due to the
same author [138, 139].
Versions of this time-bounded construction with bounded sequence numbers are
described in [138, 172].
• The universal construction based on the state machine replication paradigm pre-
sented in Sect. 14.3 is due to R. Guerraoui and M. Raynal [130]. This construction
is inspired from a total order broadcast algorithm described in [67]. (The reader
interested in the state machine replication paradigm will find more developments
on this paradigm in [186, 236, 250].)
• The interested reader will find other constructions in many other papers, e.g.,
[4, 47, 95, 96].
• A universal construction for large objects is presented in [26].
Universal constructions for multi-object operations are described in [10, 11, 25].
A multi-object operation is an atomic operation that accesses simultaneously sev-
eral objects.
• A universal construction in which a process that has invoked an operation on an
object can abort its invocation is presented in [77].
• A universal construction suited to transactional memory systems is described
in [83].
• The constructions of multi-valued consensus objects from binary consensus
objects presented in Sect. 14.5 are the read/write counterparts of constructions
designed for message-passing systems described in [217, 236].
• The first universal construction that was presented is based on the state machine
replication paradigm and consensus objects. What happens if each consensus
instance is replaced by a k-set agreement instance (as defined in Exercise 5, p. 273)?
This question is answered in [108]. This answer relies on the observation that, in
shared memory systems, solving k-set agreement is the same as solving concur-
rently k consensus instances where at least one of them is required to terminate [6].
We will see in Chap. 16 that it is possible to design a consensus object for any
number of processes from compare&swap objects or LL/SC registers (this type
of register was defined in Sect. 6.3.2). In the following, “directly” means without
building intermediate consensus objects.
14.8 Exercises and Problems 397
The previous chapter presented universal constructions that allow one to build
implementations of any object defined by a sequential specification (on total opera-
tions) that are wait-free, i.e., that tolerate any number of process crashes. As we have
seen, these constructions rest on two types of objects: atomic read/write registers and
consensus objects. These universal constructions assume that these base objects are
reliable; namely, they implicitly consider that their behavior always complies with
their specification. As an example, given an atomic register R, it is assumed that an
invocation of R.read() always returns the last value that was written into R (“last” is
with respect to the linearization order). Similarly, given a consensus object CONS,
an invocation CONS.propose() is assumed to always return the single value decided
by this consensus object.
This chapter revisits the failure-free object assumption and investigates the case
where the base objects are prone to failure. It focuses on the self-implementation
of such objects. Self-implementation means that the internal representation of the
reliable object R O that is built relies on a bounded number m of objects O1 , . . . , Om
of the very same type. Hence, a reliable atomic register is built from a set of atomic
registers of which some (not known in advance) can be faulty, and similarly for a
consensus object. Moreover, such a self-implementation has to be t-tolerant. This
means that the reliability of the object R O that is built has to be guaranteed despite the
fact that up to t of the base objects O1 , . . . , Om which implement R O can be faulty
(Fig. 15.1). Hence, this chapter is devoted to the self-implementation of t-tolerant
atomic read/write registers and consensus objects.
From a terminology point of view, wait-freedom is related to process crashes
while t-tolerance is related to the failure of base objects.
Intuitively, an object crash failure occurs when the corresponding object stops work-
ing. More precisely, two different crash failure models can be distinguished: the
responsive crash model and the non-responsive crash model.
Responsive crashes In the responsive crash failure model, an object fails if it
behaves correctly until some time, after which every operation returns the default
value ⊥. This means that the object behaves according to its sequential specification
until it crashes (if it ever crashes), and then satisfies the property “once ⊥, forever
⊥”. The responsive crash model is sometimes called the fail-stop model.
Non-responsive crashes In the non-responsive crash model, an object does not
return ⊥ after it has crashed. There is no response after the object has crashed and
the invocations of object operations remain pending forever. The non-responsive
crash model is sometimes called the fail-silent model.
Facing non-responsive failures is more difficult than facing responsive failures.
Indeed, in the asynchronous computation model, a process that invokes an operation
on an object that has crashed and is not responsive has no means to know whether the
object has indeed crashed or is only very slow. As we will see, some objects, which
can be implemented in the responsive failure model, can no longer be implemented
in the non-responsive failure model.
A lower bound It is easy to see that t + 1 is a lower bound on the number of base
objects O1 , . . . Om required to mask up to t faulty base objects. If an operation on
the constructed object R O accesses only t base objects, and all of them fail, there is
no way for the constructed object to mask the base object failures.
Fig. 15.2 t-Tolerant SWSR atomic register: unbounded self-implementation (responsive crash)
402 15 The Case of Unreliable Base Objects
Let us observe that the version of the construction with such parallel invocations is
optimal as far as time complexity is concerned.
Theorem 61 The construction described in Fig. 15.2 is a wait-free t-tolerant self-
implementation of an SWSR atomic register from t + 1 SWSR atomic registers that
may suffer responsive crash failures.
Proof As already noticed, the construction is trivially wait-free. Moreover, as by
assumption there is at least one base register that does not crash, each invocation
of R O.read() returns a non-⊥ value and consequently the register R O is reliable.
So, it remains to show that it behaves as an atomic register. This is done by (a) first
defining a total order on the invocations of R O.write() and R O.read(), and (b) then
showing that the resulting sequence satisfies the sequential specification of a register.
This second step uses the fact that there exists a total order on the accesses to the
base registers (as those registers are atomic).
Let us associate with each invocation of R O.write() the sequence number of the
value it writes. Similarly, let us associate with each invocation of R O.read() the
sequence number of the value it reads. Let S be the total order on the invocations of
R O.write() and R O.read() defined as follows. The invocations of R O.write() are
ordered according to their sequence numbers. Each invocation of R O.read() whose
sequence number is sn is ordered just after the invocation of R O.write() that has
the same sequence number. If two or more invocations of R O.read() have the same
sequence number, they are ordered in S according to the order in which they have
been issued by the reader. We have the following:
• It follows from its definition that
S includes all the operation invocations issued
by the reader and the writer (except possibly their last operation if they crash).
• Due to the way the local variable sn is used by the writer, the invocations of
R O.write() appear in
S in the order they have been issued by the writer.
• Similarly, the invocations of R O.read() appear in
S according to the order in which
they have been issued by the reader. This is due to the local variable last used by
the reader (the reader returns the value with the highest sequence number it has
ever obtained from a base register).
• As the base registers are atomic, the base operations on these registers are totally
ordered. Consequently, when we consider this total order, a base read operation
that obtains the sequence number sn from an atomic register REG[ j] appears after
the base write operation that wrote sn into that register.
As S is such that an invocation of R O.read() that obtains a value whose sequence
number is sn appears after the snth and before the (sn + 1)th invocation of
R O.write(), it follows that S is consistent with the occurrence order defined by
the read and write invocations on the base objects.
It follows from the previous observations that
S is a correct linearization of the
invocations of R O.read() and R O.write(). Consequently, the constructed register
R O is atomic.
15.2 SWSR Registers Prone to Crash Failures 403
operation is
for from to do end for
end operation
operation is
for from to do
if then end if
end for
end operation
Fig. 15.3 t-Tolerant SWSR atomic register: bounded self-implementation (responsive crash)
404 15 The Case of Unreliable Base Objects
Fig. 15.4 Order in which the operations access the base registers
REG[t + 1]. The write updates REG[1] to x and crashes just after. Then, an invoca-
tion of R O.read() obtains the value x. Sometime later, REG[1] crashes. After that
crash occurs, the reader reads REG[1], obtains ⊥, then reads REG[2] and obtains y, a
value that was written before x. Consequently, the reader suffer, a new/old inversion
and R O is not atomic. Forcing the reader to access the base registers in the reverse
order (with respect to the writer) ensures that, if the reader returns v from REG[ j], all
the base registers REG[k] such that j < k ≤ t + 1 have crashed. More generally, as
we have seen previously, if the reader and the writer do not access the base registers
in opposite order, additional control information has to be used (to create some order)
such as sequence numbers.
Remark Scanning base registers in one direction when writing and in the other
direction when reading is a technique that has already been used in Sect. 11.3 devoted
to the construction of b-valued regular (atomic) registers from regular (atomic) bits.
inversion, from which atomicity follows from Theorem 43 in Chap. 11 (this theorem
states that any execution of a regular register in which there is no new/old inversion
is linearizable and hence atomic).
• Safeness property. Let us consider an invocation of R O.read() when there is no
concurrent invocation of R O.write(). Safeness requires that, in this scenario, the
read returns the last value that was written into R O.
As (by assumption) there is no concurrent write on R O, we conclude that the
writer has not crashed during the last invocation of R O.write() issued before the
invocation of R O.read() (otherwise, this write invocation would not be terminated
and consequently would be concurrent with the read invocation).
The last write updated all the non-crashed registers to the same value v. It follows
that, whatever the base register from which the read operation obtains a non-⊥
value, it obtains and returns the value v, which is the last value written into R O.
• Regularity property. If an invocation of R O.read() is concurrent with one or several
invocations of R O.write(), we have to show that the read invocation has to obtain
the value of the constructed register before these write invocations or the value
written by one of them.
Let us first observe that an invocation r of R O.read() cannot obtain from a base
register a value that has not yet been written into it. We conclude from this obser-
vation that a read invocation cannot return a value that has not yet been written by
a write invocation.
Let v be the value of the register before the concurrent invocations of R O.write().
This means that all the non-crashed base registers are equal to v before the first of
these concurrent write invocations. If r obtains the value v, regularity is ensured.
So, let us assume that r obtains another value v from some register REG[x]. This
means that REG[x] has not crashed and was updated to v after having been updated
to v. This can only be done by a concurrent write invocation that writes v and was
issued by the writer after the write of v. The constructed register is consequently
regular.
• Atomicity property. We prove that there is no new/old inversion. Let us assume
that two invocations of R O.read(), say r1 and r2 , are such that r1 is invoked before
r2 , r1 returns v2 that was written by w2 , r2 returns v1 that was written by w1 , and
w1 was issued before w2 (Fig. 15.5).
The invocation r1 returns v2 from some base register REG[x]. It follows from the
read algorithm that all the base registers REG[y] such that x < y ≤ t + 1 have
operation is
for from to do
if then end if
end for
end operation
crashed. It also follows from the write algorithm that the non-crashed registers
from REG[1] to REG[x − 1] contain v2 or a more recent value when r1 returns v2.
As the base registers from REG[t + 1] until REG[x + 1] have crashed when r2 is
invoked, that read invocation obtains ⊥ from all these registers. When it reads the
atomic register REG[x], it obtains either v2, or a more recent value, or ⊥.
– If r2 obtains v2 or a more recent value, there is no new/old inversion.
– If r2 obtains ⊥, it continues reading from REG[x − 1] until it finds a base
register REG[y] (y < x) from which it obtains a non-⊥ value. Also, as the write
algorithm writes the base registers in increasing order starting from REG[1], it
follows that no register from REG[1] until REG[x − 1] (which is not crashed
when it is read by r2 ) can contain a value older than v2; namely, it can only
contain v2 or a more recent value. It follows that there is no possibility of
new/old inversion also in this case.
A simple improvement An easy way to improve the time efficiency of the previous
operation R O.read() consists in providing the reader process with a local variable
(denoted shortcut and initialized to t + 1) that keeps an array index such that, to
the reader’s knowledge, each REG[k] has crashed for shortcut < k ≤ t + 1. The
resulting read algorithm is described in Fig. 15.6. It is easy to see that, if after some
time no more base registers crash, shortcut always points to the first (in descending
order) non-crashed base register. This means that there is a time after which the
duration of a read operation is constant in the sense that it depends neither on t nor
on the number of base registers that have actually crashed.
When crash failures are not responsive, the construction of an SWSR atomic register
R O is still possible but requires a higher cost in terms of base registers; namely,
m ≥ 2t + 1 base registers are then required.
15.2 SWSR Registers Prone to Crash Failures 407
Base principles The base principles of such a construction are relatively simple.
They are the following:
• The use of sequence numbers (as in the construction for responsive failures,
Fig. 15.2). These sequence numbers allow the most recent value written in R O
to be known.
• The use of the majority notion. As the model assumes that, among the m base
registers, at most t can be faulty, taking m > 2t ensures that there is a majority
of base registers that do not crash. Said differently, any set of t + 1 base registers
contains at least one register which is not faulty.
• The parallel activation of read operations on base registers. This allows one to cope
with the fact that some read of base registers are not responsive. Combined with
the majority of correct base registers, this ensures that invocations of R O.read()
do not remain blocked forever.
Before presenting a construction that builds a t-tolerant consensus object, let us give
an intuitive explanation of the fact that the “replicated state machine with parallel
invocations” approach does not work. This approach considers copies of the object
(here base consensus objects) on which the same operation is applied in parallel.
So, assuming m = 2t + 1 base consensus objects CONS[1..m], let us consider
that the algorithm implementing R_CONS.propose(v) is implemented as follows:
the invoking process (1) invokes in parallel CONS[k].propose(v) for k ∈ {1, . . . , m}
and then (2) takes the value decided by a majority of the base consensus objects.
As there is a majority of base objects that are reliable, this algorithm does not
block, and the invoking process receives decided values at least from a majority of
base consensus objects. But, according to the values proposed by the other processes,
it is possible that none of the values it receives is a majority value. It is even possible
that it receives a different value from each of the 2t + 1 base consensus objects if
there are n ≥ m = 2t + 1 processes and all have proposed different values to the
consensus object R_CONS.
While the “parallel invocation” approach works for objects such as atomic
read/write registers (see above), it does not work for consensus objects. This comes
from the fact that registers are data objects, while consensus are synchronization
objects, and synchronization is inherently non-deterministic.
15.3 Consensus When Crash Failures Are Responsive: A Bounded Construction 409
The t + 1 base consensus objects are denoted CONS[1..(t + 1)]. The construc-
tion is described in Fig. 15.8. The variable esti is local to the invoking process
and contains its current estimate of the decision value. When a process pi invokes
R_CONS.propose(v), it first sets esti to the value v it proposes. Then, pi sequen-
tially visits the base consensus objects in a predetermined order (e.g., starting from
CONS[1] until CONS[t + 1]; the important point is that the processes use the same
visit order). At step k, pi invokes CONS[k].propose(est). Then, if the value it obtains
is different from ⊥, pi adopts it as its new estimate value esti . Finally, pi decides the
value of est after it has visited all the base consensus objects. Let us observe that, as
at least one consensus object CONS[] does not crash, all the processes that invoke
CONS[].propose() obtain the same non-⊥ value from that object.
Theorem 64 The construction described in Fig. 15.8 is a wait-free t-tolerant
self-implementation of a consensus object from t + 1 base consensus objects that
can suffer responsive crash failures.
Proof The proof has to show that, if at most t base consensus objects crash, the
object that is built satisfies the validity, agreement and wait-free termination proper-
ties of consensus.
As any base consensus object CONS[k] is responsive, it follows that any invocation
of CONS[k].propose() terminates (line 3). It follows that, when executed by a correct
process, the for loop always terminates. The wait-free termination follows directly
from these observations.
When a process pi invokes R_CONS.propose(v), it first initializes its local vari-
able esti to the value v it proposes. Then, if esti is modified, it is modified at line 4 and
takes the value proposed by a process to the corresponding base consensus object.
By backward induction, that value was proposed by a process, and the consensus
validity property follows.
Let CONS[x] be the first (in the increasing order on x) non-faulty base consensus
object (by assumption, such a base object exists). Let v be the value decided by that
consensus object. It follows from the agreement property of CONS[x] that all the
410 15 The Case of Unreliable Base Objects
The ideas and algorithms developed in this section are mainly due to P. Jayanti,
T.D. Chandra, and S. Toueg (1998).
We have seen in the previous sections the notions of responsive and non-responsive
object failures. With responsive failures, an invocation of an object operation always
returns a response but that response can be ⊥. With non-responsive failures, an invo-
cation of an object operation may never return. Except for a few objects (including
atomic registers) it is usually impossible to design a t-tolerant wait-free implemen-
tation of objects when failures are non-responsive (e.g., consensus objects).
Failure modes In addition to the responsiveness dimension of object failures, three
modes of failure can be defined. These modes define a second dimension with respect
to object failures:
• Crash failure mode. An object experiences a crash if there is a time τ such that
the object behaves correctly up to time τ , after which all its operation invocations
return the default value ⊥. As we have seen, this can be summarized as “once
⊥, forever⊥”. This failure mode was investigated in Sects. 15.1–15.3 for atomic
read/write registers and consensus objects.
15.4 Omission and Arbitrary Failures 411
This section presents two simple t-tolerant object self-implementations. The first
considers the omission failure mode, while the second considers the arbitrary failure
mode.
A t-tolerant self-implementation of consensus for responsive omission failures
Let us consider the t-tolerant self-implementation of a consensus object for the
responsive crash failure mode described in Fig. 15.8. It is easy to see that this algo-
rithm still works when the failure mode is omission (instead of crash). This is due
to the fact that, due to definition of t, at least one of the (t + 1) sequential iterations
necessarily uses a correct base consensus object.
A t-tolerant self-implementation of an SWSR safe register for responsive arbi-
trary failures Responsive arbitrary failure of a base read/write register BR means
that a read of BR can return an arbitrary value (even when there is no concurrent
write) and a write of a value v into BR can actually deposit any value v = v.
The t-tolerant SWSR safe register SR is built from m = 2t + 1 base safe registers
REG[1..m]. Each base register is initialized to the initial value of SR. The self-
implementation of R O for responsive arbitrary failures is described in Fig. 15.9. The
algorithm implementing SR.write(v) consists in writing the value v in each base safe
register. When the reader invokes SR.read() it first reads all the base safe registers
(line 3) and then returns the value that is the most present in REG[1..m] (lines 4–5;
if several values are equally most present, any of them is chosen).
Theorem 66 The construction of Fig. 15.9 is a bounded wait-free t-resilient imple-
mentation of an SWSR safe register.
Proof The fact that the construction is bounded follows directly from an examination
of the algorithm. Moreover, as the read and write operations on the base registers are
responsive, the construction is wait-free.
Let us assume that at most t base registers fail. This means that at most t base
registers can contain arbitrary values. This means that, if SR.read() is invoked while
Fig. 15.9 Wait-free t-tolerant (and gracefully degrading) self-implementation of an SWSR safe
register (responsive arbitrary failures)
15.4 Omission and Arbitrary Failures 413
the writer does not execute SR.write(), it obtains at least t + 1 copies of the same
value v (which is the value written by the last invocation of SR.write()). As t + 1
defines a majority, the value v is returned. Finally, the fact that any value can be
returned when SR.read() is concurrent with an invocation of SR.write() concludes
the proof of the theorem.
Let us remember that, due to the suite of constructions presented in Part V of
this book, it is possible to build an MWMR atomic register from reliable safe bits.
Hence, stacking these constructions on top of the previous one, it is possible to
build an omission-tolerant wait-free MWMR atomic register from safe bits prone to
responsive arbitrary failures.
(t + 1) base consensus objects fail by crash from the very beginning. It follows that,
for any k, all the invocations CONS[k].propose() return ⊥. Hence, the local vari-
able esti of each process pi remains forever equal to the value proposed by pi , and
consequently each process decides its initial value. It follows that, when more than
t base consensus objects fail, the consensus object that is built can fail according to
the arbitrary failure mode.
A gracefully degrading t-tolerant self-implementation of a safe SWSR read/
write register (with respect to arbitrary failures) The construction described in
Fig. 15.9 is a gracefully degrading t-tolerant self-implementation for the arbitrary
failure mode. This means that, when more than t base safe registers experience
arbitrary failures, the constructed object fails according to the arbitrary failure mode.
This follows directly from the fact that this construction is t-tolerant (Theorem 66)
and the arbitrary failures define the most severe failure mode.
Consensus with responsive omission failures: specification As we have seen,
responsive omission failure means that an invocation of an object operation may
return the default value ⊥ instead of returning a correct value.
The definition of a consensus object (which is a one-shot object) has to be re-visited
to take into account omission failures. The corresponding weakened definition is as
follows:
As we can see, taking into account omission failures requires one to modify only
the validity and agreement properties. It is easy to see that this definition boils down
to the usual definition if we eliminate the possibility of deciding ⊥.
A gracefully degrading t-tolerant self-implementation of consensus (with respect
to omission failures) A wait-free t-tolerant (with respect to responsive omission
failures) gracefully degrading self-implementation of a consensus object is described
in Fig. 15.10. This construction is due to P. Jayanti, T.D. Chandra and S. Toueg (1999).
It uses m = 2t + 1 base consensus objects which are kept in the array CONS[1..m].
In this construction, each process pi manages a local variable esti which contains
its estimate of the decision value and an array deci [1..(2t + 1)] that it will use to
take its final decision. Similarly to the construction described in Fig. 15.8 (which is
t-tolerant with respect to responsive omission failures but not gracefully degrading),
a process pi visits sequentially all the base consensus objects. The main differences
between the construction of Fig. 15.8 (which is not gracefully degrading) and the one
of Fig. 15.10 (which is designed to be gracefully degrading) are:
15.4 Omission and Arbitrary Failures 415
To prove the consensus validity property and the graceful degradation property,
we have to show that the value that is decided by a process (i) is either a proposed
value or ⊥ and (ii) is not ⊥ when no more than t base consensus objects fail.
• Part (i). Let us first observe that each local variable esti is initialized to a proposed
value (line 1). Then, it follows from (a) the validity of the omission-prone base
consensus objects (a decided value is a proposed value or ⊥) and (b) a simple
induction on k that the values written in deci [1..(2t + 1)] are only ⊥ or the value
of some est j . As the value decided by pi is a value stored in deci [1..(2t + 1)]
(lines 3–5) part (i) of the consensus validity property follows.
• Part (ii). By assumption, there are c correct base consensus objects where
t + 1 ≤ c ≤ 2t + 1. Let CONS[k1 ], CONS[k2 ], . . . , CONS[kc ] be this sequence
of correct base objects. As there are at most t faulty base consensus objects, we
have k1 ≤ t + 1 ≤ c ≤ 2t + 1.
As CONS[k1 ] is correct and no process proposes ⊥, CONS[k1 ] returns the same
value v to all the processes that invoke CONS[k1 ].propose(). Hence, each non-
crashed process pi is such that esti = v at the end of its k1 th loop iteration. From
then on, as (a) a base object can fail only by omission (it then returns ⊥ instead
of returning a proposed value) and (b) due to their agreement property each of
the c − 1 remaining correct consensus objects return v, it follows that, for any x,
k1 < x ≤ 2t + 1, the predicate of line 4 (deci [x] = ⊥, esti ) is never satisfied.
Consequently, lines 5–6 are never executed after the k1 th loop iteration.
Moreover, as there are c ≥ t + 1 consensus objects that do not fail, it follows that,
at the end of the last iteration, at least c entries of the array deci contain the value
v, the other entries containing ⊥. It follows that, when a process exits the loop, the
predicate at line 9 cannot be satisfied. Consequently, pi decides the value v = ⊥
kept in esti , which concludes the proof of part (ii).
To prove the consensus agreement property we have to show that, if a process pi
decides v = ⊥ and a process p j decides w = ⊥, then v = w.
The proof is by contradiction. Let us assume that pi decides v = ⊥ while p j
decides w = ⊥ and w = v. As pi decides v, it follows from line 9 that at least
(t + 1) entries of deci [1..(2t + 1)] contain v. Similarly, at least (t + 1) entries of
dec j [1..(2t + 1)] contain w. As there are only (2t + 1) base consensus objects, it
follows from the previous observation that there is a base consensus object CONS[k]
such that the invocations of CONS[k].propose() by pi and p j returned v = ⊥ to
pi and w = ⊥ to p j . But this is impossible because, as CONS[k] can fail only by
omission, it cannot return different different non-⊥ values to distinct processes. It
follows that v = w.
15.4 Omission and Arbitrary Failures 417
15.5 Summary
• The notions of t-tolerance and graceful degradation presented in this chapter are
due to P. Jayanti, T.D. Chandra, and S. Toueg [169].
• The constructions presented in Figs. 15.8 and 15.9 are from [169]. This paper
presents a suite of t-tolerant constructions (some being gracefully degrading) and
impossibility results.
• The gracefully degrading consensus construction for responsive omission failure
presented in Fig. 15.10 is due to P. Jayanti, T.D. Chandra, and S. Toueg [170].
It is shown in [170] that this construction (which uses 2t +1 base consensus objects)
is space-optimal. As the construction described in Fig. 15.8 (which uses t + 1 base
consensus objects) is t-tolerant (but not gracefully degrading) with respect to the
omission failure mode, it follows that graceful degradation for consensus and
omission failures has a price: the number of base objects increases by t.
• The space-optimal t-tolerant self-implementation of an SWSR atomic register
described in Fig. 15.3 is due to R. Guerraoui and M. Raynal [131].
• The notion of memory failure is investigated in [8]. Such a failure corresponds to
a faulty write.
An object type characterizes the possible behaviors of a set of objects (namely, the
objects of that type). As an example the type consensus defines the behavior of
all consensus objects. Similarly, the type atomic register defines the behavior of
all atomic registers. This chapter considers concurrent object types defined by a
sequential specification on a finite set of operations.
We have seen in Chap. 14 that an object type T is universal if, together with
atomic registers, objects of type T allow for the wait-free construction of any type T
(defined by a sequential specification on total operations). As consensus objects are
universal, a natural and fundamental question that comes to mind is the following:
Consensus number The consensus number associated with an object type T is the
largest number n such that it is possible to wait-free implement a consensus object
from atomic read/write registers and objects of type T in a system of n processes.
16.2 Fundamentals
This section defines notions (schedule, configuration, and valence) which are central
to prove the impossibility of wait-free implementing a consensus object from “too
weak” object types in a system of n processes.
Reminder Let us consider an execution made up of sequential processes that invoke
operations on atomic objects of types T1 , . . . , Tx . These objects are called “base
objects” (equivalently, the types T1 , . . . , Tx are called “base types”). We have seen
in Chap. 4 (Theorem 14) that, as each base object is atomic, an execution at the
operation level can be modeled by an atomic history (linearization) S on the operation
invocations issued by the processes. This means that (a) S is a sequential history that
includes all the operation invocations issued by the processes (except possibly the
16.2 Fundamentals 423
The next theorem shows that, for any wait-free consensus algorithm A, there is at
least one initial bivalent configuration C, i.e., a configuration in which the decided
value is not predetermined: there are several proposed values that can still be decided
(for each of these values v, there is a schedule generated by the algorithm A that,
starting from that configuration, decides v).
424 16 Consensus Numbers and the Consensus Hierarchy
This means that, while the decided value is only determined from the inputs
when the initial configuration is univalent, this is not true for all configurations, as
there is at least one initial bivalent configuration: the value decided by a wait-free
consensus algorithm cannot always be deterministically determined from the inputs.
As indicated previously, it may depend on the execution of the algorithm A itself:
according to the set of proposed values, a value can be decided in some runs while
the other value can be decided in other executions.
Theorem 69 Let us assume that there is an algorithm A that wait-free implements
a binary consensus object in a system of n processes. There is then a bivalent initial
configuration.
Proof Let C0 be the initial configuration in which all the processes propose 0 to the
consensus object, and Ci , 1 ≤ i ≤ n, the initial configuration in which the processes
from p1 to pi propose the value 1, while all the other processes propose 0. So, all
the processes propose 1 in Cn . These configurations constitute a sequence in which
any two adjacent configurations Ci−1 and Ci , 1 ≤ i ≤ n, differ only in the value
proposed by the process pi : it proposes the value 0 in Ci−1 and the value 1 in Ci .
Moreover, it follows from the validity property of the consensus algorithm A that C0
is 0-valent while Cn is 1-valent.
Let us assume that all the previous configurations are univalent. It follows that, in
the previous sequence, there is (at least) one pair of consecutive configurations, say
Ci−1 and Ci , such that Ci−1 is 0-valent and Ci is 1-valent. We show a contradiction.
Assuming that no process crashes, let us consider an execution history H of the
algorithm A that starts from the configuration Ci−1 , in which the process pi invokes
no base operation for an arbitrarily long period (the end of that period will be defined
below). As the algorithm is wait-free, the processes decide after a finite number of
invocations of base operations. The sequence of operations that starts at the very
beginning of the history and ends when all the processes have decided (except pi ,
which has not yet started invoking base operations) defines a schedule S. (See the
upper part of Fig. 16.1. Within the vector containing the values proposed by each
process, the value proposed by pi is placed inside a box.) Then, after S terminates,
pi starts executing and eventually decides. As Ci−1 is 0-valent, S(Ci−1 ) is also
0-valent.
Let us observe (lower part of Fig. 16.1) that the same schedule S can be produced
by the algorithm A from the configuration Ci . This is because (1) the configurations
Ci−1 and Ci differ only in the value proposed by pi and (2) as pi invokes no base
operations in S, that schedule cannot depend on the value proposed by pi . As S(Ci−1 )
is 0-valent, it follows that the configuration S(Ci ) is also 0-valent. But, as also Ci is
1-valent, we conclude that S(Ci ) is 1-valent, a contradiction.
Crash versus asynchrony The previous proof is based on (1) the assumption stating
that the consensus algorithm A is wait-free (intuitively, the progress of a process does
not depend on the “speed” of the other processes) and (2) asynchrony (a process
progresses at its “own speed”). This allows the proof to play with process speeds
and consider a schedule (part of an execution history) in which a process pi does not
execute operations. We could have instead considered that pi has initially crashed
(i.e., pi crashes before executing any operation). During the schedule S, the wait-free
consensus algorithm A (the existence of which is a theorem assumption) has no way
to distinguish between these two cases (has pi initially crashed or is it only very
slow?). This shows that, for some problems, asynchrony and process crashes are two
facets of the same “uncertainty” with which wait-free algorithms have to cope.
We have seen in Part III that the computability power of atomic read/write registers is
sufficient to build wait-free implementations of concurrent objects such as splitters,
weak counters, store-collect objects, and snapshot objects. As atomic read/write
registers are very basic objects, an important question from a computability point
of view, is then: are atomic read/write registers powerful enough to build wait-free
implementations of other concurrent objects such as queues, stacks, etc.
This section shows that the answer to this question is “no”. More precisely, it
shows that MWMR atomic registers are not powerful enough to wait-free implement a
consensus object in a system of two processes. This means that the consensus number
of an atomic read/write register is 1; i.e., read/write registers allow a consensus
object to be wait-free implemented only in a system made up of a single process.
Stated another way, atomic read/write registers have the “poorest” computability
power when one is interested in wait-free implementations of atomic objects in
asynchronous systems prone to any number of process crashes.
Bivalent configuration D
R1.op1() by p R2.op2() by q
R2.op2() by q R1.op1() by p
p(q(D)) and q( p(D)) are the very same configuration (each process is in the
same local state and each shared register has the same value in both configura-
tions).
As q(D) is 1-valent, it follows that p(q(D)) is also 1-valent. Similarly, as p(D)
is 0-valent, it follows that q( p(D)) is also 0-valent. This is a contradiction, as
the configuration p(q(D)) ≡ q( p(D)) cannot be both 0-valent and 1-valent.
2. R1 and R2 are the same register R.
• Both p and q read R.
As a read operation on an atomic register does modify its value, this case is
the same as the previous one (where p and q access distinct registers).
• p invokes R.read(), while q invokes R.write() (Fig. 16.3).
(Let us notice that the case where q reads R while p writes R is similar.) Let
Read p be the read operation issued by p on R, and W riteq be the write oper-
ation issued by q on R. As Read p (D) is 0-valent, so is Writeq (Read p (D)).
Moreover, Writeq (D) is 1-valent.
The configurations D and Read p (D) differ only in the local state of p (it
has read R in the configuration Read p (D), while it has not in D). These two
configurations cannot be distinguished by q. Let us consider the following
two executions:
– After the configuration D was attained by the algorithm A, p stops exe-
cuting for an arbitrarily long period, and during that period only q exe-
cutes base read/write operations (on base atomic read/write registers). As
by assumption the algorithm A is wait-free, there is a finite sequence of
read/write invocations on atomic registers issued by q at the end of which q
q decides
Schedule S
q decides
– Similarly, after the configuration Read p (D) was attained by the algorithm
A, p stops executing for an arbitrarily long period. The same schedule S
(defined in the previous item) can be issued by q after the configuration
Read p (D). This is because, as p issues no read/write invocations on base
atomic registers, q cannot distinguish D from Read p (D). It follows that q
decides at the end of that schedule, and, as Writeq (Read p (D)) is 0-valent,
q decides 0.
While executing the schedule S , q cannot know which (D or Read p (D)) was
the configuration when it started executing S (this is because these configu-
rations differ only in a read of R by p). As the schedule S is deterministic (it
is composed only of read and write invocations issued by q on base atomic
registers), q must decide the same value, whatever the configuration at the
beginning of S . This is a contradiction, as it decides 0 in the first case and 1
in the second case.
• Both p and q invoke R.write().
Let Writep and Writeq be the write operations issued by p and q on R,
respectively. By assumption the configurations Writep (D) and Writeq (D) are
0-valent and 1-valent, respectively.
The configurations Writeq (Writep (D)) and Writeq (D) cannot be distinguished
by q: the write of R by p in the configuration D that produces the config-
uration Writep (D) is overwritten by q when it produces the configuration
Writeq (Writep (D)).
The reasoning is then the same as the previous one. It follows that, if q executes
alone from D until it decides, it decides 1 after executing a schedule S . The
same schedule from the configuration Writep (D) leads it to decide 0. But, as q
cannot distinguish D from Writep (D), and S is deterministic, it follows that
it has to decide the same value in both executions, a contradiction as it decides
0 in the first case and 1 in the second case. (Let us observe that Fig. 16.3 is
still valid. We have only to replace Readp (D) and S by Writep (D) and S ,
respectively.)
As atomic registers are too weak to wait-free implement a consensus object for two
processes, the question posed at the beginning of the chapter becomes: are there
objects that allow for the construction of a wait-free implementation of consensus
objects in systems of two or more processes.
This section first considers three base objects (test&set objects, queue, and swap
objects) and shows that, together with shared registers, they can wait-free implement
consensus in a set of two processes. These objects and their operations have been
introduced in Sect. 2.2 (their definition is repeated in the following to make this
chapter easier to read). This section then shows that these concurrent objects cannot
wait-free implement a consensus object in a system of three or more processes. It
follows that their consensus number is 2.
In this section, when considering two processes, they are denoted p0 and p1
(considering the process indexes 0 and 1 makes the presentation simpler). When
considering three processes, they are denoted p, q, and r .
Test&set objects A test&set object TS is a one-shot atomic object that provides the
processes with a single operation called test&set() (hence the name of the object).
Such an object can be seen as maintaining an internal state variable X which is
initialized to 1 and can contain only the values 0 or 1. The effect of an invocation of
TS.test&set() can be described by the atomic execution of the following code:
Hence, the test&set object in which we are interested is an atomic one-shot object
whose first invocation returns 1 while all the other invocations return 0. The process
that obtains the value 1 is called the winner. The other ones (or the other one in the
case of two processes) are called losers.
From test&set to consensus for two processes The algorithm described in Fig. 16.4
builds a wait-free implementation of a consensus object for two processes ( p0 and
p1 ) from a test&set object TS. It uses two additional SWSR registers R E G[0] and
R E G[1] (a process pi can always keep a local copy of the atomic register it writes,
so we do not count it as one of its readers). Moreover, as each register R E G[i] is
always written by pi before being read by p1−i , they do not need to be atomic (or
even safe). Let C be the consensus object that is built. The construction is made up
of two parts:
• When process pi invokes C.propose(v), it deposits the value v it proposes into
REG[i] (line 1). This part consists for pi in making public the value it proposes.
• Then pi executes a control part to know which value has to be decided. To that
aim, it uses the underlying test&set object TS (line 2). If it obtains the initial value
of the test&set object (1), it is the winner and decides the value it has proposed
(line 3). Otherwise pi is the loser and accordingly it decides the value proposed
by the winner p1−i (line 4).
Let us remember that, as the test&set object is atomic, the winner is the process
whose invocation TS.test&set() is the first that appears in the linearization order
associated with the object TS.
Theorem 72 The algorithm described in Fig. 16.4 is a wait-free construction, in a
system of two processes, of a consensus object from a test&set object.
Proof The algorithm is clearly wait-free, from which follows the consensus termi-
nation property.
Let pi be the winner. When it executes line 2, the test&set object TS changes its
value from 1 to 0, and then, as any other invocation finds TS = 0, the test&set object
keeps forever the value 0. As pi is the only process that obtains the value 1 from TS,
it decides the value v it has just deposited in REG[i] (line 3).
Moreover, as the other process p1−i obtains the value 0 from TS, it decides the
value deposited REG[i] by the winner pi (line 4). Let us observe that, due to the
16.4 Objects Whose Consensus Number Is 2 431
values that are returned to pi and p1−i by the test&set object, we conclude that the
invocation of TS.test&set() by pi is linearized before the one issued by p1−i , from
which it follows that REG[i] was written before being read by p1−i .
It follows that a single value is decided (consensus agreement property) and that
value was proposed by a process (consensus validity property). Hence, the algorithm
described in Fig. 16.4 is a wait-free implementation of a consensus object in a system
of two processes.
Fig. 16.5 From an atomic concurrent queue to consensus (code for pi , i ∈ {0, 1})
432 16 Consensus Numbers and the Consensus Hierarchy
Fig. 16.6 From a swap register to consensus (code for pi , i ∈ {0, 1})
value w which is initially at the head of the queue. As suggested by the text of the
algorithm, the proof is then verbatim the same.
Swap objects The swap object R considered here is a one-shot atomic register
that can be accessed by an operation denoted R.swap(). This operation, which has
an input parameter v, assigns it to R and returns the previous value of R. Its effect
can be described by the atomic execution of the following statements:
pr ev ← R; R ← v; return( pr ev).
pr ev ← X ; X ← X + x; return( pr ev).
The computational power of the previous objects allows for wait-free implementa-
tions of a consensus object in a system of two processes, hence the question: Do they
allow for the wait-free implementation of a consensus object in a system of three or
more processes? Surprisingly, the answer to this question is “no”.
This section gives a proof that the consensus number of an atomic concurrent
queue is exactly 2. (The proofs for the other objects presented in this section are
similar.)
Theorem 74 Atomic wait-free queues have consensus number 2.
Proof This proof has the same structure as the proof of Theorem 70. Considering
binary consensus, it assumes that there is an algorithm based on queues and atomic
read/write registers that wait-free implements a consensus object in a system of three
processes (denoted p, q, and r ). As in Theorem 70 we show that, starting from an
initial bivalent configuration C (due to Theorem 69, such a configuration does exist),
there is an arbitrarily long schedule S produced by A that leads from C to another
bivalent configuration S(C). This shows that A has an execution in which no process
ever decides, which proves the theorem by contradiction.
Starting the algorithm A in a bivalent configuration C, let S be a maximal schedule
produced by A such that the configuration D = S(C) is bivalent. As we have seen
in the proof of Theorem 70, “maximal” means that the configurations p(D), q(D),
and r (D) are monovalent. Moreover, as D is bivalent, two of these configurations
have different valences. Without loss of generality let us say that p(D) is 0-valent
and q(D) is 1-valent. r (D) is either 0-valent or 1-valent (the important point here is
that r (D) is not bivalent).
Let Op p be the invocation of the operation issued by p that leads from D to p(D),
Opq the invocation of the operation issued by q that leads from D to q(D), and Opr
the operation issued by r that leads from D to r (D). Each of Op p , Opq , and Opr is a
read or a write of an atomic register or an enqueue or a dequeue on an atomic queue.
434 16 Consensus Numbers and the Consensus Hierarchy
Let us consider p and q (the processes that produce configurations with different
valences), and let us consider that, from D, process r does not execute operations
for an arbitrarily long period:
• If both Op p and Opq are invocations of operations on atomic registers, the proof
follows from the proof of Theorem 70.
• If one of Op p and Opq is an invocation of an operation on an atomic register while
the other is an invocation of an operation on an atomic queue, the reasoning used in
item 1 of the proof of Theorem 70 applies. This reasoning, based on the argument
depicted in Fig. 16.2, permits one to conclude that p(q(D)) ≡ q( p(D)), while
one is 0-valent and the other is 1-valent.
It follows that the only case that remains to be investigated is when both Op p and
Opq are operations on the same atomic queue Q. We proceed by a case analysis.
There are three cases:
Bivalent configuration D
Q.enqueue(a) by p Q.enqueue(b) by q
k ≥ 0 items
• Q is not empty. In this case, the configurations Opq (Op p (D)) and Op p
(Opq (D)) are the same configuration: in both, each object has the same state
and each process is in the same local state. This is a contradiction because
Opq (Op p (D)) is 0-valent while Op p (Opq (D)) is 1-valent.
• Q is empty. In this case, r cannot distinguish the configuration Op p (D) and
Op p (Opq (D)). The same reasoning as that of case 1 above shows a contra-
diction (the same schedule S starting from any of these configurations and
involving only operation invocations issued by r has to decide both 0 and 1).
3. Op p is the invocation of Q.enqueue(a) and Opq is the invocation of
Q.enqueue(b). (This case is described in Fig. 16.7.)
Let k be the number of items in the queue Q in the configuration D. This means
that p(D) contains k + 1 items, and q( p(D)) (or p(q(D))) contains k + 2 items
(see Fig. 16.8).
As the algorithm A is wait-free and p(D) is 0-valent, there is a schedule S p
starting from the configuration q( p(D)) that includes only operation invocations
issued by p (these invocations are on atomic registers and possibly other atomic
queues) and that ends with p deciding 0.
Claim C1. The schedule S p contains an invocation by p of Q.dequeue() that
dequeues the (k + 1)th element of Q.
Proof of the claim C1. Assume by contradiction that p issues at most k invoca-
tions of Q.dequeue() in S p (and so it never dequeues the value a it has enqueued).
In that case (see Fig. 16.9), if we apply the schedule pS p (i.e., the invocation of
Q.enqueue(a) by p followed by the schedule S p ) to the configuration q(D), we
obtain the configuration S p ( p(q(D))) in which p decides 0. It decides 0 for the
following reason: as it dequeues at most k items from Q, p cannot distinguish the
configurations p(q(D)) and q( p(D)). This is because these two configurations
differ only in the state of the queue Q: its two last items in q( p(D)) are a fol-
lowed by b while they are b followed by a in p(q(D)), but, as we have just seen,
p decides 0 in S p (q( p(D))) without having dequeued a or b (let us remember
that, due to the contradiction assumption, S p contains at most k dequeues of Q).
But this contradicts the fact that, as q(D) is 1-valent, p should have decided
1. Hence, S p contains an invocation Q.dequeue() that dequeues the (k + 1)th
element of Q. End of proof of the claim C1.
It follows from this claim that S p contains at least k + 1 invocations of
Q.dequeue() issued by p. Let S p be the longest prefix of S p that does not
contain the (k + 1)th dequeue invocation on Q by p:
436 16 Consensus Numbers and the Consensus Hierarchy
Bivalent configuration D
Q.enqueue(a) by p Q.enqueue(b) by q
Q.enqueue(b) by q Q.enqueue(a) by p
Sp Sp
Bivalent configuration D
Q.enqueue(a) by p Q.enqueue(b) by q
Q.enqueue(b) by q Q.enqueue(a) by p
Sp Sp
Sq Sq
Let us now consider the third process r (that has not invoked operations since
configuration D).
All the objects have the same state in the configurations D0 and D1 . Moreover,
process r has also the same local state in both configurations. It follows that D0
438 16 Consensus Numbers and the Consensus Hierarchy
Bivalent configuration D
Q.enqueue(a) by p Q.enqueue(b) by q
Q.enqueue(b) by q Q.enqueue(a) by p
Q.dequeue() by p Q.dequeue() by p
(p obtains a) (p obtains b)
Q.dequeue() by q Q.dequeue() by q
(q obtains b) (q obtains a)
D0 D1
This section shows that there are atomic objects whose consensus number is +∞.
Hence, it is possible to wait-free implement a consensus object in a system with
read/write registers and such objects whatever the number n of processes, from
16.5 Objects Whose Consensus Number Is +∞ 439
Remark on the number and the indexes of processes The previous consensus
construction based on acompare&swap object uses neither the process indexes nor the
number n of processes. It consequently works in anonymous systems with infinitely
many processes.
A base SWMR register REG[i] is associated with each process pi . This register
is used to make public the value proposed by pi (line 1). A process p j reads it at line
4 if it has to decide the value proposed by pi .
There are n + 1 mem-to-mem-swap objects. The array A[1..n] is initialized to
[0, . . . , 0], while the object R is initialized to 1. The object A[i] is written only by
pi , and this write is due to a mem-to-mem-swap operation: pi exchanges the content
of A[i] with the content of R (line 2). As we can see, differently from A[i], the
mem-to-mem-swap object R can be written by any process. As described in lines
2–4, these objects are used to determine the decided value. After it has exchanged
A[i] and R, a process looks for the first entry j of the array A such that A[ j] = 0,
and decides the value deposited by the corresponding process p j .
Remark: a simple invariant To better understand the algorithm, let us observe
that the following relation remains invariant: R + 1≤i≤n A[i] = 1. As initially
R = 1 and A[i] = 0 for each i, this relation is initially satisfied. Then, due to the fact
that the binary operation A[i].mem_to_mem_swap(R) issued at line 2 is atomic, it
follows that the relation remains true forever.
Lemma 40 The mem-to-mem-swap object type has consensus number n in a system
of n processes.
Proof The algorithm is trivially wait-free (the loop is bounded). As before, let the
winner be the process pi that sets A[i] to 1 when it executes line 2. As any mem-
to-mem-swap register A[ j] is written at most once, we conclude from the previous
invariant that there is a single winner. Moreover, due to the atomicity of the mem-
to-mem-swap objects, the winner is the first process that executes line 2. As, before
becoming the winner, a process pi has deposited into REG[i] the value v it proposes to
the consensus object, we have R E G[i] = v and A[i] = 1 before the other processes
terminate the execution of line 2. It follows that all the processes that decide return
the value proposed by the single winner process.
Remark Differently from the construction based on compare&swap objects, the
construction based on mem-to-mem-swap objects uses the indexes and the number
of processes.
Definition 5 An object type is universal in a system of n processes if, together with
atomic registers, it permits a wait-free implementation of a consensus object for n
processes to be built.
This notion of universality completes the one introduced in Chap. 14. It states
that, in an n-process system, universal constructions can be built as soon as the base
read/write system is enriched with an object type that is universal in a system of
m ≥ n processes.
The following theorem is an immediate consequence of the previous lemma.
Theorem 76 For any n, mem-to-mem-swap objects are universal in a system of n
processes.
442 16 Consensus Numbers and the Consensus Hierarchy
Sticky bit A sticky bit SB is an atomic register which is initialized to ⊥ and can
then contain forever either the value 0 or the value 1. It can be accessed only by the
operation SB.write(v), where the value of the parameter v is 0 or 1.
The effect of the invocation of SB.write(v) can be described by the following
sequence of statements executed atomically (X represents the current value of SB).
The first invocation (as defined by the linearization order) gives its value to the sticky
bit SB and returns the value true. Any other invocation returns true if the value v it
wants to write is the value of X or false if v = X .
operation SB.write(v) is
if (X = ⊥)
then X ← v; return(true)
else if (X = v) then return(true) else return(false) end if
end if
end operation.
16.5 Objects Whose Consensus Number Is +∞ 443
• This means that, while read/write registers are universal in failure-free systems,
this is no longer true in asynchronous systems prone to any number of process
crashes when one is interested in wait-free implementations. More generally, the
synchronization primitives (provided by shared memory distributed machines)
have different power in the presence of process crashes: compare&swap is stronger
than test&set that is in turn stronger than atomic read/write operations.
• Interestingly, consensus numbers show also that, in a system of two processes,
the concurrent versions of classical objects encountered in sequential computing
such as stacks, lists, sets, and queues are as powerful as the test&set or fetch&add
synchronization primitives when one is interested in providing processes with
wait-free objects.
More generally, consensus numbers define an infinite hierarchy on concurrent
objects. Let X be an object at level x, Y an object at level y, and x < y. This
hierarchy is such that:
1. It is not possible to wait-free implement Y from objects X and atomic registers.
2. It is possible to wait-free implement X from objects Y and atomic registers.
This hierarchy is called a consensus hierarchy, wait-freedom hierarchy, or
Herlihy’s hierarchy. Some of its levels and corresponding objects are described in
Table 16.1. (LL/SC registers have been defined in Sect. 6.3.2, m-register assignment
objects are defined in the exercise section.)
The previous hierarchy shows that fault-masking can be impossible to achieve when
the designer is provided with base atomic objects whose computability power (as
measured by their consensus number) is too weak.
As an example, a wait-free FIFO queue that has to tolerate the crash of a single
process cannot be built from atomic registers. This follows from the fact that the
consensus number of a queue is 2 while the consensus number of atomic registers
is 1.
16.6 Hierarchy of Atomic Objects 445
The previous hierarchy of concurrent objects is robust in the sense that any number
of objects of the same deterministic type Tx whose consensus number is x cannot
wait-free implement (with the additional help of any number of read/write registers)
an object Y whose consensus number y is such that y > x.
16.7 Summary
This chapter has introduced the notions of consensus number and consensus hierar-
chy. These notions are related to the wait-free implementation of concurrent objects.
Important results related to the consensus hierarchy are: (a) atomic read/write
registers, snapshot objects, and (2n − 1)-renaming objects have consensus number
1, (b) test&set, fetch&add, queues, and stacks have consensus number 2 and (c)
compare&swap, LL/SC, and sticky bits have consensus number +∞.
Combined with a universal construction, the consensus hierarchy states which
synchronization computability power is needed when one wants to be able to wait-
free implement, in a system of n processes, any concurrent object defined by a
sequential specification.
• The consensus number notion, the consensus hierarchy and wait-free synchroniza-
tion are due to M. Herlihy [138].
• The impossibility of solving consensus from atomic registers in a system of two
processes was proved in several papers [76, 199, 257].
This impossibility result is the shared memory counterpart of the well-known
result named FLP (from the names of the persons who proved it, namely M.J.
Fischer, N.A. Lynch, and M.S. Paterson). This result is on the impossibility to
solve consensus in asynchronous message-passing systems prone to (even a single)
process crashes [102].
• The reader interested in synchronization primitives that exist (or could be realized)
on multiprocessor machines can consult [117, 118, 157, 180, 260].
446 16 Consensus Numbers and the Consensus Hierarchy
Chapter 16 has shown that a base read/write system enriched with objects whose con-
sensus number is at least x allows consensus objects to be wait-free implemented in a
system of at most x processes. Moreover, thanks to universal constructions (Chap. 14)
these objects allow construction of wait-free implementations of any object defined
by a sequential specification on total operations.
Hence, the question: is the enrichment of a read/write asynchronous system with
stronger operations such as compare&swap the only means to obtain wait-free imple-
mentations of consensus objects? This chapter shows that the answer to this question
is “no”. To that end it presents another approach which is based on information
on failures (the failure detector-based approach). Each process is enriched with a
specific module that gives it (possibly unreliable) information on the faulty/correct
status of processes.
The chapter presents first a de-construction of compare&swap that allows us to
introduce two types of objects whose combination allows consensus to be solved
despite asynchrony and any number of process crashes. One is related to the safety
of consensus, the other one to its liveness. The chapter presents then the failure
detector which has been proved to be the one providing the weakest information
on failure which allows consensus to be solved. is used to ensure the consensus
termination property. Several base objects that can be used to guarantee the consensus
safety properties are then introduced. Differently from , these objects can be wait-
free built from read/write registers only. The chapter finally investigates additional
synchrony assumptions that allow to be implemented.
As noticed in Chap. 14, a consensus object CONS can be seen as a write-once register
whose value is determined by the first invocation of its operation CONS.propose().
Moreover, any invocation of CONS.propose() (which terminates) returns the value
of the consensus object.
From a trivial failure-free case Let us consider a failure-free asynchronous sys-
tem made up of processes cooperating through atomic registers. A trivial consensus
algorithm consists in deciding the value proposed by a predetermined process, say
p . That process (which can be any predetermined process) deposits the value it
proposes into an SWMR register DEC (initialized to ⊥, a default value that cannot
be proposed), and a process pi = p reads the shared register DEC until it obtains a
value different from ⊥. Assuming that p eventually proposes a value, this constitutes
a trivial starvation-free implementation of a consensus object. (Let us notice that the
shared register is not required to be atomic. It could be a regular register.)
First to compare&swap An attempt to adapt this algorithm to a system where
any number of processes may crash and in which not all the processes propose a
value could be the following. Let the “leader” be the process that “imposes” a value
as the decided value. As there is no statically defined leader p that could impose
its value on all (any static choice could select a process that crashes before writing
into DEC), any process is entitled to compete to play the leader role. Moreover,
to ensure the wait-freedom property of the algorithm implementing the operation
CONS.propose(), as soon as a process proposes a value, it must try to play the leader
role just in case no other process writes into DEC (because it crashes before or never
invokes CONS.propose()).
So, when it proposes a value, a process pi first reads DEC, and then writes the
value v it proposes if its previous read returned ⊥. Unfortunately, between a read
obtaining the value ⊥ and the subsequent write into DEC by a process pi , the value
of DEC may have been changed from ⊥ to some value v by another process pj
(and maybe v has then been read and decided by other processes). Hence, this naive
approach does not work (we already knew that because the consensus number of
atomic registers is 1).
A way to solve the previous problem consists in forging an atomic operation
that includes both the read and the associated conditional write. As we have seen
in Chap. 16, this is exactly what is done by the operation compare&swap(). If the
register DEC is provided with that operation, CONS.propose(v) is implemented by
DEC.compare&swap(⊥, v) and the first process that invokes it becomes the dynam-
ically defined leader that imposes the value v as the value of the consensus object.
(The term “winner” was used in Chap. 16 instead of the term “leader”). It is impor-
tant to notice that, as it is atomic, the operation compare&swap() allows one to
cope both with process crashes and the fact that not all processes propose a value.
This intuitively explains why the consensus number of a compare&swap register
is +∞.
17.1 De-constructing Compare&Swap 451
The notion of a failure detector, which is due to D.P. Chandra and S. Toueg (1996),
was introduced in Sect. 5.3.1, (where the eventually restricted leader failure detector
X and the eventually perfect failure detector ♦P were introduced).
Let us remember that a failure detector is a device (object) that provides each
process with a read-only local variable which contains (possibly incomplete and
unreliable) information related to failures. A given class of failure detector is defined
by the type and the quality of this information.
A failure detector FD is non-trivial with respect to a system S and an object O if
the object O can be wait-free implemented in S enriched with FD while it cannot in
S alone.
17.2.1 Definition of
The failure detector was introduced by T.D. Chandra, V. Hadzilacos, and S. Toueg
(1996), who have showed that this failure detector captures the weakest information
on failures that allows one to build a consensus object in an asynchronous system
prone to process crash failures. As there is no wait-free implementation of a consensus
object in asynchronous systems where (a) processes communicate by read/write
registers only and (b) any number of processes may crash (as shown in Chap. 16),
it follows that is a non-trivial failure detector. Consequently (as we will see in
Sect. 17.7), cannot be implemented in a pure asynchronous system.
is sometimes called an eventual leader oracle. From a more practical point of
view, it is sometimes called an eventual leader service.
Definition of This failure detector provides each process pi with a local variable
denoted leaderi that pi can only read. These local variables satisfy the following
properties (let us remember that a process is correct in an execution if it does not
crash in that execution):
• Validity. For any pi and at any time, leaderi contains a process index (i.e., leaderi ∈
{1, . . . , n}).
• Eventual leadership. There is a finite time after which the local variables leader i
of the correct processes contain forever the same index and the corresponding
process p is a correct process.
Taking = {1, . . . , n} (the whole set of process indexes) corresponds to ,
the failure detector introduced in Sect. 5.3.1. As we have seen in that chapter, this
means that there is a finite anarchy period during which processes can have arbitrary
leaders (distinct processes having different leaders and some leaders being crashed
processes). But this anarchy period eventually terminates, and after it has terminated,
all correct processes have “forever” the same correct leader. However, it is important
17.2 A Liveness-Oriented Abstraction: The Failure Detector 453
to notice that no process knows when the anarchy period terminates and the eventual
leader is elected.
The term “forever” is due to asynchrony. As there is no notion of physical time
accessible to the processes, once elected, the leader process has to remain leader for
a long enough period in order to be useful. Even if this period is finite, it cannot be
bounded. Hence the term “forever”.
A formal definition Let us consider a discrete global clock, not accessible to the
processes, whose domain is the set of integers, denoted N. As previously indicated,
let be the set of the process indexes.
A failure pattern is a function F : N → 2 such that F(τ ) denotes the set of
processes that have crashed by time τ . As a crashed process does not recover, we
have (τ ≤ τ ) ⇒ F(τ ) ⊆ F(τ ) . The set C of processes that are correct in a
given execution
is theset of processes that do not crash during that execution, i.e.,
C = \ ∪τ ≥1 F(τ ) .
∀i ∈ , ∀τ ∈ N, Let leaderiτ be the value of leaderiτ at time τ . Let CF be the set
of indexes of the correct processes in the failure pattern F(). With these notations,
is defined as follows:
• Validity. ∀F(), ∀i ∈ , ∀τ ∈ N : leaderiτ ∈ .
• Eventual leadership. ∀F(), ∃ ∈ CF , ∃τ ∈ N : ∀τ ≥ τ : ∀i ∈ CF : leaderiτ = .
is a failure detector The previous definition makes clear the fact that is a
failure detector. This comes from the fact that it is defined from a failure pattern.
When it reads leaderi = x at some time τ , a process pi learns from the failure detector
that px is its current leader. As just seen, this information may be unreliable only
during a finite period of time. At any time, pi knows that (a) leaderi will contain
forever the index of a correct process and (b) all these local variables will contain
the same process index.
Such an alpha object, called here alpha1 , was introduced by E. Gafni (1998) (under
the name adopt-commit object).
Alpha1 is a one-shot object that provides the processes with a single operation
denoted adopt_commit(). This operation takes a value as input parameter (we then
say that the invoking process proposes that value) and returns a pair d, v, where d
is a control tag and v a value (we then say that the invoking process decides the pair
d, v).
Definition Let ALPHA1 be an alpha1 object. Its behavior is defined by the following
properties:
• Termination. An invocation of ALPHA1.adopt_commit() by a correct process
terminates.
• Validity. This property is made up of two parts:
– Output domain. If a process decides d, v then d ∈ {commit, adopt, abort}
and v is a value that was proposed to ALPHA1.
– Obligation. If all the processes that invoke ALPHA1.adopt_commit() propose
the same value v, then the only pair that can be decided is commit, v.
• Quasi-agreement. If a process decides commit, v then any other deciding process
decides d, v with d ∈ {commit, adopt}.
Intuitively, an alpha1 object is an abortable consensus object. A single value v
can be committed. The control tags adopt or abort are used to indicate that no value
can be committed. It is easy to see (quasi-agreement property) that, if a process
decides abort, −, no process decides commit, −. Differently, if a process decides
adopt, −, it cannot conclude which control tag was decided by other processes.
17.3 Three Safety-Oriented Abstractions: Alpha1 , Alpha2 , and Alpha3 455
This object, called here alpha2 , was introduced by L. Lamport (1998). It is a round-
based object which takes a round number as a parameter. Hence, differently from the
previous alpha1 object, an alpha2 object is a multi-shot object and a single instance
is required to implement a consensus object.
An alpha2 object provides the processes with a single operation denoted deposit()
which takes two input parameters, a round number r and a proposed value v (v is the
value that the invoking process wants to deposit into the alpha2 object). An invocation
of deposit() returns a value which can be the default value ⊥ (that cannot be proposed
by a process). Moreover, it is assumed that (a) no two processes use the same round
numbers and (b) a process uses a given round number only once and its successive
round numbers are increasing.
Definition Let ALPHA2 be an alpha2 object. Its behavior is defined by the following
properties:
• Termination. An invocation of ALPHA2.deposit() by a correct process terminates.
• Validity. This property is made up of two parts:
– Output domain. An invocation ALPHA2.deposit(r, −) returns either ⊥ or a
value v such that ALPHA2.deposit(−, v) has been invoked by a process.
– Obligation. If there is an invocation I = ALPHA2.deposit(r, −) such that any
invocation I = ALPHA2.deposit(r , −) that started before I terminates is such
that r < r, then I returns a non-⊥ value.
• Quasi-agreement. Let ALPHA2.deposit(r, v) and ALPHA2.deposit(r , v ) be two
invocations that return w and w , respectively. We have (w = ⊥)∧(w = ⊥) ⇒
(w = w ).
It is easy to see (quasi-agreement property) that an alpha2 object is a weakened
consensus object in the sense that two processes cannot decide different non-⊥ values.
An example To illustrate the round numbers and the obligation property, let us
consider Fig. 17.2, where there are six processes, among which the process p4 has
(initially) crashed. The invocations of ALPHA2.deposit() are indicated by double-
arrow segments, the round number associated with an invocation being indicated
above the corresponding segment.
As required by alpha2 , no two invocations of ALPHA2.deposit() use the same
round number and the consecutive round numbers used by each process are increas-
ing. When we consider all the invocations with round number smaller than 11, it
is possible that they all return ⊥. Indeed, these invocations satisfy the termina-
tion, obligation, and quasi-agreement properties stated above. As any invocation
with a round number smaller than 11 is concurrent with another invocation with
a greater round number, the obligation property allows them to return the default
value ⊥.
The situation is different for the invocation I = ALPHA2.deposit(11, −) issued
by p5 . As there is no invocation with a greater round number that started before I
terminates, the obligation property states that I must return a non-⊥ value, namely a
value v such that deposit(r, v) was invoked with r ≤ 11.
Due to the very same obligation property, the invocation I = ALPHA2.deposit
(15, −) issued by p3 also has to return a non-⊥ value. Moreover, due to the quasi-
agreement property, that value has to be the same as the one returned by I.
Another abstraction that can be used to maintain the safety of a consensus object is
a store-collect object. Such an object was introduced in Sect. 7.2. and the notion of
a fast store-collect object was introduced in Sect. 7.3.
Reminder: store-collect object Let ALPHA3 be a store-collect object. As we have
seen, such an object provides the processes with two operations. The first one, denoted
store(), allows the invoking process pi to deposit a new value into ALPHA3 while
discarding the previous value it has previously deposited (if any). The second oper-
ation, denoted collect(), returns to the invoking process a set of pairs j, val, where
val is the last value deposited by the process pj . The set of pairs returned by an
invocation of collect() is called a view.
As we have seen in Chap. 7, a store-collect object has no sequential specification.
We have also seen that such an object has an efficient adaptive wait-free implemen-
tation based on read/write registers. More precisely, the step complexity (number of
shared memory accesses) of the operation collect() is O(k), where k is the number of
different processes which have invoked the operation store() (Theorem 30, Chap. 7).
The step complexity of the first invocation of the operation store() by a process pi
is also O(k), while the step complexity of its next invocations of store() is O(1)
(Theorem 29, Chap. 7).
Reminder: fast store-collect object Such an object is a store-collect object in
which the operations store() and collect() are merged to form a single operation
denoted store_collect(). Its effect is that of an invocation of store() immediately
followed by an invocation of collect(). This operation is not atomic.
17.3 Three Safety-Oriented Abstractions: Alpha1 , Alpha2 , and Alpha3 457
We have seen that the step complexity of such an operation is O(1) when, after
some time, a single process invokes that operation (Theorem 32, Chap. 7).
Let us consider that, according to the alpha object that is used, due to there is a
time τ after which only one process invokes repeatedly ALPHA1.adopt_commit()
or ALPHA2.deposit() or ALPHA3.store_collect().
As we will see in the next section, the role of is to be a round allocator in such
a way that after some finite time, a single process executes rounds. According to
the object that is used, this will allow an invocation of ALPHA1.adopt_commit() to
return commit, v, or an invocation of ALPHA2.deposit() to return a value v = ⊥,
or an invocation of ALPHA3.store_collect() to return a set which allows the invoking
process to decide.
Let us remember that each process is endowed with a read-only local variable leaderi
whose content is supplied by the failure detector . Let CONS be the consensus object
we want to build.
The three consensus algorithms described below assume that the common cor-
rect leader eventually elected by invokes CONS.propose(v). The case where this
assumption is not required is considered in Sect. 17.4.4.
The consensus agreement property states that no two processes can decide dif-
ferent values. Let us consider the first round r during which a process assigns a
value to the atomic register DEC (line 7). Let pi be a process that issues such
an assignment. It follows from lines 5–6 that pi has obtained resi = commit, v
from ALPHA1[r]. Moreover, it follows from the quasi-agreement property of this
alpha1 object that any other process pj that returns from ALPHA1[r].adopt_commit()
obtains resj = commit, v or resj = adopt, v. If pi obtains commit, v, it writes
v into the atomic register DEC, and if it obtains adopt, v, it writes v into esti . It
follows that, any process that executes round r either decides before entering round
r + 1 or starts round r + 1 with estj = v. This means that, from round r + 1, the only
value that a process can propose to ALPHA1[r + 1] is v. It follows that, from then
on, only v can be written into the atomic register DEC, which concludes the proof
of the consensus agreement property.
The consensus termination property states that any invocation of CONS.propose()
by a correct process terminates. Due to the termination property of the alpha1
objects, it follows that, whatever r, any invocation of ALPHA1[r].adopt_commit()
terminates. It follows that the proof of the consensus termination property amounts
to showing that a value is eventually written in DEC. We prove this by contradiction.
Let us assume that no value is ever written in DEC. It follows from the eventual
leadership property of that there is a finite time τ1 after which there is a correct
process p that is elected as permanent leader. Moreover, there is a time τ2 after
which only correct processes execute the algorithm. It follows that there is a time
τ ≥ max(τ1 , τ2 ) when all the processes pi that have invoked CONS.propose() and
have not crashed are such that leaderi = remains true forever. It follows that, after τ ,
p is the only process that repeatedly executes lines 4–10. As (by assumption) it never
writes into DEC, p executes an infinite number of rounds. Let r be a round number
strictly greater than any round number attained by any other process. It follows that
p is the only process that invokes ALPHA1[r].adopt_commit(est). It then follows
from the obligation property of ALPHA1[r] that this invocation returns the pair
commit, est. Hence, p executes line 7 and writes est into DEC, which contradicts
the initial assumption and concludes the proof of the termination property.
Let us remember that no two processes are allowed to use the same round numbers,
and round numbers used by a process must increase. To that end, the process pi is
statically assigned the rounds i, n+i, 2n+i, etc. (where n is the number of processes).
The code of the algorithm is very similar to the code based on alpha1 objects. In this
algorithm, res is not a pair; it contains the value returned by ALPHA2.deposit(),
which is a proposed value or the default value ⊥.
Theorem 80 Let us assume that the eventual leader elected by participates (i.e.,
invokes CONS.propose()). The construction described in Fig. 17.4 is a wait-free
implementation of a consensus object.
Proof The consensus validity and agreement properties follow directly from the
properties of the alpha2 object. More precisely, we have the following. Let us first
observe that, due to the test of line 2, ⊥ cannot be decided. The consensus validity
property follows then from the “output domain” validity property of the alpha2 object
and the fact that a process pi always invokes ALPHA2.deposit(ri , vi ), where vi is
the value it proposes to the consensus object. The consensus agreement property
is a direct consequence of the fact that ⊥ cannot be decided combined with the
quasi-agreement property of the alpha2 object.
The proof of the consensus termination property is similar to the previous one. It
follows from (a) the eventual leadership property of , (b) the fact that the eventual
leader proposes a value, and (c) the fact that, if no process has deposited a value
into DEC before, there is a time τ after which the eventual leader is the only one to
execute rounds. The proof is consequently the same as in Theorem 79.
• If it is late (ri < rmaxi ), pi jumps to round rmaxi and adopts as a new esti-
mate a value that is associated with rmaxi in the view it has previously obtained
(line 13).
• If it is “on time” from a round number point of view (ri = rmaxi ), pi checks if it
can write a value into DEC and decide. To that end, it executes lines 8–12. It first
computes the set seti of the values that are registered in the store-collect object
with round number rmaxi or rmaxi − 1, i.e., the values registered by the processes
that (from pi ’s point of view) have attained one of the last two rounds.
If it has passed the first round (ri > 1) and its set seti contains only the value kept
in esti , pi writes it into DEC (line 10), just before going to decide at line 17. If it
cannot decide, pi proceeds to the next round without modifying its estimate esti
(line 11).
Hence, the base principle on which this algorithm rests is fairly simple to state. (It
is worth noticing that this principle is encountered in other algorithms that solve other
problems such as termination detection of distributed computations.) This principle
can be stated as follows: processes execute asynchronous rounds (observation peri-
ods) until a process sees two consecutive rounds in which “nothing which is relevant
has changed”.
Particular cases It is easy to see that, when all processes propose the same value,
no process decides in more than two rounds whatever the pattern of failure and
the behavior of . Similarly, only two rounds are needed when elects a correct
common leader from the very beginning. In that sense, the algorithm is optimal from
the “round number” point of view.
On the management of round numbers In adopt-commit-based or alpha-based
consensus algorithms, the processes that execute rounds execute a predetermined
sequence of rounds. More precisely, in the adopt-commit-based consensus algo-
rithm presented in Fig. 17.3, each process that executes rounds does execute the
predetermined sequence of rounds numbered 1, 2, etc. Similarly, in the alpha1 -based
consensus algorithm presented in Fig. 17.4, each process pi that executes rounds
executes the predetermined sequence of rounds numbered i, i + n, i + 2n, etc.
Differently, the proposed algorithm allows a process pi that executes rounds to
jump from its current round ri to the round rmaxi , which can be arbitrarily large (line
13). These jumps make the algorithm particularly efficient. More specifically, let us
consider a time τ of an execution such that (a) up to time τ , when a process executes
line 9, the decision predicate is never satisfied, (b) processes have executed rounds
and mr is the last round that was attained at time τ , (c) from time τ , elects the
same correct leader p for any process pi , and (d) p starts participating at time τ . It
follows from the algorithm that p executes the first round, during which it updates
r to mr, and then, according to the values currently contained in ALPHA3, at most
the rounds mr and mr + 1 or the rounds mr, mr + 1, and mr + 2. As the sequence
of rounds is not predetermined, p saves at least mr − 2 rounds.
17.4 -Based Consensus 463
When the process p eventually elected as common leader by does not participate,
the termination property of the three previous consensus algorithms can no longer
be ensured in all executions.
When the subset of processes that participate in the consensus algorithm can be
any non-empty subset of processes, the failure detector has to be replaced by the
failure detector X introduced in Sect. 5.3.1.
Reminder: the failure detector X Let X be any non-empty subset of process
indexes. The failure detector denoted X provides each process pi with a local
variable denoted ev_leader(X) (eventual leader in the set X) such that the following
properties are always satisfied:
• Validity. At any time, the variable ev_leader(X) of any process contains a process
index.
• Eventual leadership. There is a finite time after which the local variables
ev_leader(X) of the correct processes of X contain the same index which is the
index of one of them.
The extended algorithms are correct The fact that this extended algorithm is
correct follows from the correction of the base algorithm plus the following two
observations:
1. The fact that two processes pi and pj , while executing the same round r, are such
that ev_leader i (X) = i and ev_leader j (X ) = j with X = X does not create a
problem. This is because the situation is exactly as if X = X and X has not
yet stabilized to a single leader. Hence, the consensus safety property cannot be
compromised.
464 17 The Alpha(s) and Omega of Consensus
When the failure detector is used to implement a consensus object, no process ever
knows from which time instant the failure detector provides forever the processes
with the index (identity) of the same correct process. A failure detector is a service
that never terminates. Its behavior depends on the failure pattern and has no sequential
specification.
17.4 -Based Consensus 465
This section is devoted to the wait-free constructions of the alpha objects defined in
Sect. 17.3 and used in Sect. 17.4.
• Finally, pi computes the final value it returns as the result of its invocation of
adopt_commit(vi ):
– If a single proposed value v was seen by the processes that (to pi ’s knowledge)
have written into BB[1..n] (those processes have consequently terminated the
second communication phase), pi commits the value v by returning the pair
commit, v (line 5).
– If the set of pairs read by pi from BB[1..n] contains several pairs and one of
them is single, v, pi adopts v by returning adopt, v.
– Otherwise, pi has not read single, v from BB[1..n]. In that case, it simply
returns the pair abort, vi .
Theorem 81 The construction described in Fig. 17.6 is a wait-free implementation
of an alpha1 object from read/write atomic registers.
Proof The proofs of the termination, input domain, and obligation properties of
alpha1 are trivial. Before proving the quasi-agreement property, we prove the fol-
lowing claim.
Claim. (single, v ∈ BB) ⇒ ( ∃v = v : single, v ∈ BB).
Proof of the claim. Let single, v be the first pair single, − written in BB and let px
be the process that wrote it (due to the atomicity of the registers of the array BB[1..n]
such a first write exists).
Due to the sequentiality of px we have τ1 < τ2 < τ3 , where τ1 is the time at which
the statement AA[x] ← v issued by px terminates (line 1), τ2 is the time instant at
Fig. 17.7 Timing on the accesses to AA for the proof of the quasi-agreement property
17.5 Wait-Free Implementations of the Alpha1 and Alpha2 Abstractions 467
Fig. 17.8 Timing on the accesses to BB for the proof of the quasi-agreement property
which the invocation by px of AA.val_collect() starts (line 2), and τ3 is the time at
which the statement BB[x] ← single, v starts (line 3). See Fig. 17.7.
Let py be a process that wrote v = v into the array AA[1..n]. As px writes
single, v into BB[x], it follows that aax = {v} (line 3), from which we conclude
that, at time τ2 , there is no v = v such that v ∈ AA[1..n]. It follows that py has
written v into AA[y] after time τ2 . Consequently, as τ1 < τ2 , py executes aay ←
AA.val_collect() (line 2) and it necessarily reads v from AA[x]. It follows that py
writes several, v into BB[y], which concludes the proof of the claim.
The proof of the quasi-agreement property consists in showing that, for any pair
of processes that execute line 4, we have bbi = {single, v} ⇒ single, v ∈ bbj
(i.e., the lines 5 and 7 are mutually exclusive: if a process executes one of them, no
process can execute the other one).
Let pi be a process that returns single, v at line 5. It follows that single, v was
written into the array BB. It follows from the claim that single, v is the single pair
single, − written in BB. The rest of the proof consists in showing that every process
pj that executes return() reads single, v from BB at line 4 (from which it follows
that it does not execute line 7). The reasoning is similar to the one used in the proof
of the claim. See Fig. 17.8 which can be seen as an extension of Fig. 17.7.
If a process pj writes several, − into BB[j] (line 3), it does it after time τ5
(otherwise, pi could not have bbi = {single, v} at time τ6 and executes line 5).
As τ4 < τ5 , it follows from this observation that, when pj reads the array BB[1..n]
(line 4), it reads single, v from BB[i] and consequently executes line 6 and returns
adopt, v, which concludes the proof of the quasi-agreement property.
process are increasing). Intuitively, the algorithm aims to ensure that, if there is a
“last” invocation of deposit() (“last” with respect to the round numbers, as required by
the obligation property), this invocation succeeds in associating a definitive value with
the alpha2 object. To that aim, the algorithm manages and uses control information,
namely the “date” fields of each shared register (i.e., REG[i].lre and REG[i].lrww).
The algorithm implementing the operation deposit() This algorithm is described
in Fig. 17.10. Each process pi has a local array regi [1..n] in which pi stores the last
copy of REG[1..n] it has asynchronously read. At the operational level, the previous
principles convert into four computation phases. More precisely, when a process pi
invokes deposit(r, v), it executes a sequence of statements which can be decomposed
into four phases:
• Phase 1 (lines 1–3):
– The process pi first informs the other processes that the date r has been attained
(line 1).
– Then, pi reads asynchronously the array REG[1..n] to know the “last date”
attained by each of the other processes (line 2).
– If it discovers that it is late (i.e., other processes have invoked deposit() with
higher dates), pi aborts its current attempt and returns ⊥ (line 3). Let us observe
470 17 The Alpha(s) and Omega of Consensus
that this preserves the quasi-agreement property without contradicting the oblig-
ation property.
• Phase 2 (lines 4–5). If it is not late, pi computes a value that it will try to deposit in
the alpha2 object. In order not to violate quasi-agreement, it selects the last value
(“last” according to the round numbers/logical dates) that has been written in a
regular register REG[j]. If there is no such value, pi considers the value v that it
wants to deposit in the alpha2 object.
• Phase 3 (lines 6–8):
– Then, pi writes into its regular register REG[i] a pair made up of the value
(denoted value) it has previously computed, together with its date r (line 6).
– The process pi then reads again the regular registers to check again if it is late (in
which case there are concurrent invocations of deposit() with higher dates). As
before, if this is the case, pi aborts its current attempt and returns ⊥ (lines 7–8).
• Phase 4 (line 9). Otherwise, as pi is not late, it has succeeded in depositing v in
the alpha2 object, giving it its final value. It consequently returns that value.
Theorem 82 The construction described in Fig. 17.10 is a wait-free implementation
of an alpha2 object from regular registers.
Proof Proof of the wait-freedom property.
A simple examination of the code of the algorithm shows that it is a wait-free algo-
rithm: if pi does not crash while executing propose(r, v), it does terminates (at line
3, 8 or 9) whatever the behavior of the other processes.
Proof of the validity property.
Let us observe that a non-⊥ value v written in a register REG[i] (line 6) is a value
that was previously passed as a parameter in a deposit() invocation (lines 4–6).The
validity property follows from this observation and the fact that only ⊥ or a value
written in a register REG[i] can be returned from a deposit() invocation.
Proof of the obligation property.
Let I = deposit(r, −) be an invocation (by a process pi ) such that all the invocations
I = deposit(r , −) that have started before I terminates are such that r < r. As
r < r, It follows that the predicate ∀j = i : REG[j].lre < r = REG[i].lre is true
during the whole execution of I. Consequently, I cannot be stopped prematurely and
forced to output ⊥ at line 3 or line 8. It follows that I terminates at line 9 and returns
a non-⊥ value.
Proof of the quasi-agreement property.
If none or a single invocation of deposit() executes line 9, the quasi-agreement
property is trivially satisfied. So, among all the invocations that terminate at line
9 (i.e., that return a non-⊥ value), let I = deposit(r, −) be the invocation with
the smallest round number and I = deposit(r , −) be any other of these invocations
(hence, r > r). (Let us remember that the total ordering on the deposit() invocations
defined by their round numbers constitutes the basic idea that underlies the design of
17.5 Wait-Free Implementations of the Alpha1 and Alpha2 Abstractions 471
the algorithm.) Let pi (pj ) be the process that invoked I (I ), and v (v ) the returned
value. To show that v = v , we use the following time instant definitions (Fig. 17.11):
• Definitions of time instants related to I:
– Let w6(I) be the time at which I terminates the write of the regular register
REG[i] at line 6. We have then REG[i] = r, r, v.
– Let r7(I, j) be the time at which I starts reading REG[j] at line 7. As pi is
sequential, we have w6(I) < r7(I, j).
• Definitions of time instants related to I :
– Let w1(I ) be the time at which I terminates the write of the regular register
REG[j] at line 1. We then have REG[j] = r , −, −.
– Let r2(I , i) be the time at which I starts reading REG[i] at line 2. As pj is
sequential, we have w1(I ) < r2(I , i).
Let us first observe that, as I returns a non-⊥ value, it successfully passed the test
of line 8; i.e., the value it read from REG[j].lre was smaller than r. Moreover, when
I executed line 1, it updated REG[j].lre to r > r. As the register REG[j] is regular,
we conclude that I started reading REG[j] before I finished writing it (otherwise, pi
would have read r from REG[j].lre and not a value smaller than r). Consequently we
have r7(I, j) < w1(I ), and by transitivity w6(I) < r7(I, j) < w1(I ) < r2(I , i).
This is illustrated in Fig. 17.11.
It follows that, when I reads REG[i] at line 2, it obtains x, x, − with x ≥
r (this is because, after I, pi has possibly executed other invocations with higher
round numbers). Moreover, as I does not return ⊥ at line 3, when it read REG[1..n]
at line 2 it saw no REG[k] such that REG[k].lre > r . This means that, when I
determines a value val at line 4, it obtains v from some register REG[k] (the value
in REG[k].val) such that ∀ : r > REG[k].lrww ≥ REG[].lrww, and we have
REG[k].lrww ≥ REG[i].lrww = r. Let I be the invocation that deposited v in
REG[k].val. If REG[k].lrww = r, we then have i = k and I is I (this is because
r can be generated only by pi ). Consequently v = v. Otherwise, the invocation I
by pk deposited v into REG[k].val at line 6, with a corresponding round number r
such that r < REG[k].lrww = r < r . Observing that only lines 1–4 executed by
I are relevant in the previous reasoning, we can reuse the same reasoning replacing
the pair of invocations (I, I ) with the pair (I, I ). So, either I obtained v deposited
472 17 The Alpha(s) and Omega of Consensus
Building a reliable regular register from unreliable disks As just indicated, the
disk-based implementation of alpha2 consists in using the quorum-based replication
technique to translate the read and write operations on a “virtual” object REG[i] into
its disk access counterpart on the m copies DISK_BK[i, 1], . . . , DISK_BK[i, m].
Fig. 17.13 Implementing an SWMR regular register from unreliable read/write disks
474 17 The Alpha(s) and Omega of Consensus
To that aim, each time it has to write a new triple into the virtual object REG[i],
pi associates a new sequence number sn with that triple triple and writes the pair
triple, sn into the corresponding block DISK_BK[i, d] of each of the m disks (1 ≤
d ≤ m). The write invocation terminates when the pair has been written into a
majority of disks.
Similarly, a read of the virtual object REG[i] is translated into m read operations,
each one reading the corresponding block DISK_BK[i, d] on disk d (1 ≤ d ≤ m).
The read terminates when a pair was received from a majority of the disks; the triple
with the greatest sequence number is then delivered as the result of the read of the
virtual object REG[i]. Let us observe that, due to the “majority of correct disks”
assumption, every read or write operation on a virtual object REG[i] terminates.
It is easy to see that a read of the virtual object REG[i] that is not concur-
rent with a write of it obtains the last triple written in REG[i] (or ⊥ if REG[i]
has never be written). For the read operations of REG[i] that are concurrent with
a write in REG[i], let us consider Fig. 17.13, which considers five disk blocks:
DISK_BK[i, 1], . . . , DISK_BK[i, 5]. A write of a virtual object REG[i] by pi is
represented by a “write line”, the meaning of which is the following: the point where
the “write line” crosses the time line of a disk is the time at which that disk executes
the write of the corresponding triple, sn pair. As the system is asynchronous, these
physical writes can occur as indicated in the figure, whatever their invocation times.
When considering the figure, x is the sequence number associated with the value of
REG[i] in each disk block DISK_BK[i, d] , 1 ≤ d ≤ m = 5, before the write; x + 1
is consequently the sequence number after the write operation. (Let us observe that,
in general, due to asynchrony and the fact that an operation terminates when a new
value was written into a majority of disks, it is possible that two disks DISK_BK[i, d]
and DISK_BK[i, d ] do not have the same sequence number.)
The figure considers also three reads of REG[i]: each is represented by an
ellipsis and obtains the <triple, sequence number> pair of the disks contained in
the corresponding ellipsis (e.g., read2 obtains the current pairs of the disk blocks
DISK_BK[i, 3], DISK_BK[i, 4], and DISK_BK[i, 5]). As we can see, each read
obtains pairs from a majority of disks (let us notice that this is the best that can be done
as, from the invoking process point of view, all the other disks may have crashed).
read1 and read2 are concurrent with the write into the virtual object REG[i], and
read1 obtains the new triple (whose sequence number is x + 1), while read2 obtains
the old triple (whose sequence number is x). This is a new/old inversion. As we have
seen, the regular register definition allows such new/old inversions.
Let us finally notice that a write on the virtual object REG[i] such that pi
crashes during that write operation can leave the disk blocks DISK_BK[i, 1], . . . ,
DISK_BK[i, m] in a state where the pair new value of REG[i], associated sequence
number has not been written into a majority of disks. Considering that a write during
which the corresponding process crashes never terminates, this remains consistent
with the definition of a regular register. As shown by Fig. 17.13, all future reads will
then be concurrent with that write and each of them can consequently return the old
or the new value of REG[i].
17.6 Wait-Free Implementations of the Alpha2 Abstraction from Shared Disks 475
The algorithm implementing the operation deposit() When we look at the basic
construction described in Fig. 17.10 with the “sequence number” notion in mind, we
can see that the fields REG[i].lre and REG[i].lrww of each virtual object REG[i]
actually play a sequence number role: the first for the writes of REG[i] issued at line
1, and the second for the writes of REG[i] issued at line 6. These fields of each virtual
register REG[i] can consequently be used as sequence numbers.
On another side, as far as disk accesses are concerned, we can factor out the
writing of REG[i] at line 1 and the reading of REG[1..n] at line 2. This means
that we can issue, for each disk d, the writing of DISK_BK[i, d] and the reading
of DISK_BK[1, d], . . . , DISK_BK[n, d], and wait until these operations have been
executed on a majority of disks. The same factorization can be done for the writing
of REG[i] at line 6 and the reading of REG[1..n] at line 7 of Fig. 17.10.
The resulting construction based on unreliable disks is described in Fig. 17.14. The
variable reg[i] is a local variable where pi stores the last value of REG[i] (while there
Fig. 17.14 Wait-free construction of an alpha2 object from unreliable read/write disks
476 17 The Alpha(s) and Omega of Consensus
is no local array reg[1..n], we keep the array-like notation reg[i] for homogeneity
and notational convenience). The algorithm tolerates any number of process crashes,
and up to (m − 1)/2 disk crashes. It is wait-free as a process can always progress
and terminate its deposit() invocation despite the crash of any number of processes.
This algorithm can be easily improved. As an example, when pi receives at line 4
a triple block[j, d] such that block[j, d].lre > r, it can abort the current attempt and
return ⊥ without waiting for triples from a majority of disks. (The same improvement
can be done at line 13.)
An active disk is a disk that can atomically execute a few operations more sophisti-
cated than read or write. (As an example, active disks have been implemented that
provide their users with an atomic create() operation that atomically creates a new
file object and updates the corresponding file directory.)
We consider here an active disk, denoted ACT _DISK, that is made up of three
fields: ACT _DISK.lre, ACT _DISK.lrww, and ACT _DISK.val. This disk can be
atomically accessed by the two operations described in Fig. 17.15. The first operation,
denoted write_round + read(), takes a round number r as input parameter. It updates
the field ACT _DISK.lre to max(ACT _DISK.lre, r) and returns the triple contained
in ACT _DISK. The second operation, denoted cond_write + read_round(), takes a
triple as input parameter. It is a conditional write that returns a round number.
The algorithm implementing the operation deposit() This algorithm is described
in Fig. 17.16. It is a simple adaptation of the algorithm described in Fig. 17.10 to the
active disk context in which the array REG[1..n] of base regular registers is replaced
by a more sophisticated object, namely the active disk. Each process pi manages two
local variables: act_diski , which is a local variable that contains the last value of the
active disk read by pi , and rri , which is used to contain a round number. It is easy to
see that this construction is wait-free.
Unreliable active disks The previous construction assumes that the underlying
active disk is reliable. An interesting issue is to build a reliable virtual active disk
from base unreliable active disks. Unreliability means here that an active disk can
crash but does not corrupt its values. After it has crashed (if it ever crashes) a disk
no longer executes the operations that are applied to it. Before crashing it executes
them atomically. A disk that crashes during a run is said to be faulty with respect to
that run; otherwise, it is correct. Let us assume there are m active disks. It is possible
to build a correct active disk.
Using the quorum-based approach described in Sect. 17.6.1, it is possible to build
a correct active disk from a set of m active disks in which no more than m/2 may
crash.
17.7 Implementing
mented from read/write registers only. This means that the underlying system has to
be enriched (or, equivalently, restricted) in order for to be built. This is captured by
additional behavioral assumptions that have to be at the same time “weak enough”
(to be practically nearly always satisfied) and “strong enough” (to allow to be
implemented).
The section presents first timing assumptions which are particularly weak and
then shows that these timing assumptions are strong enough to implement in
any execution where they are satisfied. These assumptions and the corresponding
construction of are due to A. Fernández, E. Jiménez, M. Raynal and G. Trédan
(2007, 2010).
Each process is assumed to be equipped with a timer timeri , and any two timers
measure time durations the same way. The assumption is denoted EWB (for eventually
well behaved). The intuition that underlies EWB is that (a) on the one side the system
must be “synchronous” enough for a process to have a chance to be elected and (b)
on the other side the other processes must be able to recognize it.
EWB is consequently made up of two parts: EWB1 , which is on the existence of a
process whose behavior satisfies some synchrony assumption, and EWB2 , which is on
the timers of other processes. Moreover, EWB1 and EWB2 have to be complementary
assumptions that have to fit each other.
Critical registers Some SWMR atomic registers are critical, while other are not. A
critical register is an atomic register on which some timing constraint is imposed by
EWB on the single process that is allowed to write this register. This attribute allows
one to restrict the set of registers involved in the EWB assumption.
The assumption EWB1 This assumption restricts the asynchronous behavior of a
single process. It is defined as follows:
EWB1 : There are a time τEWB1 , a bound , and a correct process p (τEWB1 ,
, and p may be never explicitly known) such that, after τEWB1 , any two
consecutive write accesses issued by p to its critical register are completed
in at most time units.
This property means that, after some arbitrary (but finite) time, the speed of p
is lower-bounded; i.e., its behavior is partially synchronous (let us notice that, while
there is a lower bound, no upper bound is required on the speed of p , except the fact
that it is not +∞). In the following we say “p satisfies EWB1 ” to say that p is a
process that satisfies this assumption.
The assumption EWB2 This assumption, which is on timers, is based on the fol-
lowing timing property. Let a timer be eventually well behaved if there is a time
τEWB2 after which, whatever the finite duration δ and the time τ ≥ τEWB2 at which
17.7 Implementing 479
When we consider EWB, it is important to notice that any process (except one,
which is constrained by a lower bound on its speed) can behave in a fully asynchro-
nous way. Moreover, the local clocks used to implement the timers are not required
to be synchronized.
Let us also observe that the timers of up to n−(t −f ) correct processes can behave
arbitrarily. It follows that, in the executions where f = t, the timers of all the correct
processes can behave arbitrarily. It follows from these observations that the timing
assumption EWB is particularly weak.
In the following we say “px is involved in EWB2 ” to say that px is a correct process
that has an eventually well-behaved timer.
ordering, in the set X (let us remember that a, i < b, j if and only if (a <
b) ∨ (a = b ∧ i < j)).
Internal representation of To implement , the processes cooperate by reading
and writing two arrays of SWMR atomic registers:
• PROGRESS[1..n] is an array of SWMR atomic registers that contain positive
integers. It is initialized to [1, . . . , 1]. Only pi can write PROGRESS[i], and it
does so regularly to inform the other processes that it is alive. Each PROGRESS[i]
register is a critical register.
• SUSPICIONS[1..n, 1..n] is an array of (non-critical) SWMR atomic registers
that contain positive integers (each entry being initialized to 1). The vector
SUSPICIONS[i, 1..n] can be written only by pi . SUSPICIONS[i, j] = x means
that pi has suspected x − 1 times the process pj to have crashed.
2. Timer setting part (lines 19–23). Then, pi resets its timer to an appropriate timeout
value. That value is computed from the current relevant suspicions. Let us observe
that this timeout value increases when these suspicions increase. Let us also remark
that, if after some time the number of relevant suspicions no longer increases,
timeouti keeps forever the same value.
As we can see, the construction is relatively simple. It uses n2 +n atomic registers.
(As, for any i, SUSPICIONS[i, i] is always equal to 1, it is possible to use the diagonal
of that matrix to store the array PROGRESS[1..n].)
shows that an eventual leader is elected in that run. The proof is decomposed into
several lemmas.
Lemma 41 Let pi be a faulty process. For any pj , SUSPICIONS[i, j] is bounded.
Proof Let us first observe that the vector SUSPICIONS[i, 1..n] is updated only by
pi . The proof follows immediately from the fact that, after it has crashed, a process
no longer updates atomic registers.
Lemma 42 Let pi and pj be a correct and a faulty process, respectively.
SUSPICIONS[i, j] grows forever.
Proof After a process pj has crashed, it no longer increases the value of
PROGRESS[j], and consequently, due to the update of line 14, there is a finite time
after which the test of line 13 remains forever false for any correct process pi . It
follows that SUSPICIONS[i, j] increases without bound at line 15.
Lemma 43 Let pi be a correct process involved in the assumption EWB2 (i.e., its
timer is eventually well behaved) and let us assume that, after some point in time,
timeri is always set to a value > . Let pj be a correct process that satisfies the
assumption EWB1 . Then, SUSPICIONS[i, j] is bounded.
Proof As pi is involved in EWB2 , there is a time τEWB2 such that timeri never expires
before τ + δ if it was set to δ at time τ , with τ ≥ τEWB2 . Similarly, as pj satisfies
EWB1 , there are a bound and a time τEWB1 after which two consecutive write
operations issued by pj into PROGRESS[j] are separated by at most time units (let
us recall that PROGRESS[j] is the only critical variable written by pj ).
Let τ be the time after which timeouti takes only values > , and let τ =
max(τ , τEWB1 , τEWB2 ). As after time τ any two consecutive write operations into
PROGRESS[j] issued by pj are separated by at most time units, while any two
reading of PROGRESS[j] by pi are separated by at least time units, it follows that
there is a finite time τ ≥ τ after which we always have PROGRESS[j] = lasti [j]
when evaluated by pi (line 12). Hence, after τ , the shared variable SUSPICIONS[i, j]
is no longer increased, which completes the proof of the lemma.
Definition 6 Given a process pk , let sk1 (τ ) ≤ sk2 (τ ) ≤ · · · ≤ skt+1 (τ ) denote the
(t + 1) smallest values among the n values in the vector SUSPICIONS[1..n, k] at
time τ . Let Mk (τ ) denote sk1 (τ ) + sk2 (τ ) + · · · + skt+1 (τ ).
Definition 7 Let S denote the set containing the f faulty processes plus the (t −
f ) processes involved in the assumption EWB2 (whose timers are eventually well
behaved). Then, for each process pk ∈
/ S, let Sk denote the set S ∪ {pk }. (Let us notice
that |Sk | = t + 1.)
Lemma 44 At any time τ , there is a process pi ∈ Sk such that the predicate
SUSPICIONS[i, k] ≥ skt+1 (τ ) is satisfied.
Proof Let K(τ ) be the set of the (t + 1) processes px such that, at time τ ,
SUSPICIONS[x, k] ≤ skt+1 (τ ). We consider two cases:
17.7 Implementing 483
Proof The lemma follows directly from the following observations: B does not
contain faulty processes (Lemma 46), it is not empty (Lemma 45), and no two
processes have the same identity.
Theorem 83 There is a time after which the local variable leaderi of all the correct
processes remain forever equal to the same identity, which is the identity of a correct
process.
Proof The theorem follows from Lemma 47 and Lemma 48.
17.7.4 Discussion
On the process that is elected The proof of the construction relies on the assump-
tion EWB1 to guarantee that at least one correct process can be elected; i.e., the
set B is not empty (Lemma 45) and does not contain faulty processes (Lemma 46).
This does not mean that the elected process is a process that satisfies the assumption
EWB1 . There are cases where it can be another process.
To see when this can happen, let us consider two correct processes pi and pj
such that pi does not satisfy EWB2 (its timer is never well behaved) and pj does
not satisfy EWB1 (it never behaves synchronously with respect to its critical register
PROGRESS[j]). (A re-reading of the statement of Lemma 43 will make the following
description easier to understand.) Despite the fact that (1) pi is not synchronous with
respect to a process that satisfies EWB1 , and can consequently suspect these processes
infinitely often, and (2) pj is not synchronous with respect to a process that satisfies
EWB2 (and can consequently be suspected infinitely often by such processes), it is
still possible that pi and pj behave synchronously with respect to each other in such
a way that pi never suspects pj . If this happens SUSPICIONS[i, j] remains bounded,
and it is possible that the value Mj not only remains bounded but becomes the smallest
value in the set B. It this occurs, pj is elected as the common leader.
Of course, there are runs in which the previous scenario does not occur. That is
why the protocol has to rely on EWB1 in order to guarantee that the set B is never
empty.
On the timeout values It is important to notice that the timeout values are deter-
mined from the least suspected processes. Moreover, after the common leader (say
p ) was elected, any timeout value is set to M . It follows that, given any run, be it
finite or infinite, the timeout values are always bounded with respect to that run (two
executions can have different bounds).
17.8 Summary 485
17.8 Summary
• The failure detector abstraction was introduced by T. Chandra and S. Toueg [67].
The failure detector was introduced by T. Chandra, V. Hadzilacos and S. Toueg
[68], who proved in that paper that it captures the weakest information on failure
to solve the consensus problem.
• The first use of a failure detector to solve the consensus problem in a shared
memory system appeared in [197].
In addition to the -based consensus algorithms described in this chapter, other
-based consensus algorithms for shared memory systems can be found in [84,
240]. -based consensus algorithms suited to message-passing systems can be
found in several papers (e.g., [213, 127, 129]).
Relations between failure detectors and wait-freedom are studied in [218]. An
introduction to failure detectors can be found in [235].
An extension of failure detectors to bounded lifetime failure detectors was intro-
duced in [104]. Such an extension allows a leader to be elected for a finite period
of time (and not “forever”).
• The alpha2 object is due to L. Lamport [192], who introduced it in the context of
message-passing systems (Paxos algorithm).
• A technique similar to the one used to implement an alpha2 object was used in
timestamp-based transaction systems. A timestamp is associated with each trans-
action, and a transaction is aborted when it accesses data that has already been
accessed by another transaction with a higher timestamp (an aborted transaction
has to be re-issued with a higher timestamp [50]).
• The implementation of an alpha2 object based on unreliable disks is due E. Gafni
and L. Lamport [109].
• The consensus algorithm based on a store-collect object presented in Fig. 17.5
(Sect. 17.4.3) is due to M. Raynal and J. Stainer [239]. It originates from an algo-
rithm described in [240] (which itself was inspired by an algorithm presented by
C. Delporte and H. Fauconnier in [84]).
• The power of read/write disks encountered in storage area networks is investigated
in [15].
• The implementation of an alpha2 object based on active disks is due to G. Chockler
and D. Malkhi [75]. Generalization to Byzantine failures can be found in [1].
• The notion of an indulgent algorithm is due to R. Guerraoui [120]. Its underlying
theory is developed in [126], and its application to message-passing systems is
investigated in [92, 127, 129, 247, 274].
• The implementation of described in Sect. 17.7 is due to A. Fernández, E.
Jiménez, M. Raynal, and G. Trédan [100]. An improvement of this construction in
which, after some finite time, only the eventual leader writes the shared memory
is presented in [101]: (See also [99].)
• The implementation of in crash-prone asynchronous message-passing systems
was addressed in several papers (e.g., [75, 99, 128]).
Fig. 17.18 The operations collect() and deposit() on a closing set object (code for process pi )
Differently from the alpha1 (adopt-commit) and alpha2 objects, a closing set has
no sequential specification and, consequently, is not an atomic object. Moreover,
similarly to alpha1 , it is a round-free object (rounds do not appear in its definition).
A wait-free implementation of a closing set is described in Fig. 17.18. This imple-
mentation uses an array of SWMR atomic registers REG[1..n] initialized to
[⊥, . . . , ⊥]. The aim of REG[i] is to contain the value deposited by pi into the
closing set CS.
Considering an unbounded sequence of closing sets objects, CS[0], CS[1],
CS[2], . . ., design an -based wait-free construction of a consensus object and
prove that it is correct.
Solution in [240].
Afterword
The practice of sequential computing has greatly benefited from the results of the
theory of sequential computing that were captured in the study of formal languages
and automata theory. Everyone knows what can be computed (computability) and
what can be computed efficiently (complexity). All these results constitute the
foundations of sequential computing, which, thanks to them, has become a science.
These theoretical results and algorithmic principles have been described in a lot of
books from which students can learn basic results, algorithms, and principles of
sequential computing (e.g., [79, 85, 114, 151, 176, 203, 211, 258] to cite a few).
Synchronization is coming back, but is it the same? While books do exist for
traditional synchronization (e.g., [27, 58]), very few books present in a
comprehensive way the important results that have been discovered in the past
20 years.1 Hence, even if it describes a lot of algorithms implementing concurrent
objects, the aim of this book is not to be a catalog of algorithms. Its ambition is not
only to present synchronization algorithms but also to introduce the reader to the
theory that underlies the implementation of concurrent objects in the presence of
asynchrony and process crashes.
1
The book by Gadi Taubenfeld [262] and the book by Maurice Herlihy and Nir Shavit [146] are
two such books.
2
For message-passing systems, the reader can consult [40, 60, 115, 176, 201, 236, 237, 248].
This section presents courses on synchronization and concurrent objects which can
benefit from the concepts, algorithms and principles presented in this book. The
first course consists of one semester course for the last year of the undergraduate
level, while the two one-semester courses are more appropriate for the graduate
level.
• Undergraduate students.
A one-semester course could first focus on Part I devoted to mutual exclusion.
(The reader can notice this part of the book has plenty of exercises.)
Then, the course could address (a) the underlying theory, namely Part II devoted
to the formalization of the atomicity concept (in order for students to have a
clear view of the foundations on what they have learned), and (b) the notion of
mutex-freedom and associated progress conditions introduced in Chap. 5.
Examples taken from Chaps. 7–9 can be used to illustrate these notions.
492 Afterword
This section briefly presents a few other books which address synchronization in
shared memory systems.
• Books on message-passing and shared memory:
– The book by H. Attiya and J. Welch [41], the book by A. Kshemkalyani and
M. Singhal [181], and the book by N. Lynch [201] consider both the
message-passing model and the shared memory model.
More precisely, Part IIA of Lynch’s book is devoted to the asynchronous
shared memory model. The algorithms are described in the input/output
automata model and formally proved using this formalism. In addition to
mutual exclusion, this part of the book visits also resource allocation.
Chapter 5 of Attiya and Welch’s book, and Chap. 9 of Kshemkalyani and
Singhal’s book, are fully devoted to the mutual exclusion problem in shared
Afterword 493
1. I. Abraham, G.V. Chockler, I. Keidar, D. Malkhi, Byzantine disk Paxos, optimal resilience
with Byzantine shared memory. Proceedings of the 23rd ACM Symposium on Principles of
Distributed Computing (PODC’04), St. John’s, 2004 (ACM Press, New York, 2004),
pp. 226–235
2. Y. Afek, H. Attiya, D. Dolev, E. Gafni, M. Merritt, N. Shavit, Atomic snapshots of shared
memory. J. ACM 40(4), 873–890 (1993)
3. Y. Afek, G. Brown, M. Merritt, Lazy caching. ACM Trans. Program. Lang. Syst. 15(1),
182–205 (1993)
4. Y. Afek, D. Dauber, D. Touitou, Wait-free made fast. Proceedings of the 27th ACM
Symposium on Theory of Computing (STOC’00), Portland, 2000 (ACM Press, New York,
2000), pp. 538–547
5. Y. Afek, E. Gafni, A. Morisson, Common2 extended to stacks and unbounded concurrency.
Distrib. Comput. 20(4), 239–252 (2007)
6. Y. Afek, E. Gafni, S. Rajsbaum, M. Raynal, C. Travers, The k-simultaneous consensus
problem. Distrib. Comput. 22(3), 185–195 (2010)
7. Y. Afek, I. Gamzu, I. Levy, M. Merritt, G. Taubenfeld, Group renaming. Proceedings of the
12th International Conference on Principles of Distributed Systems (OPODIS’08), Luxor,
2008. LNCS, vol. 5401 (Springer, Berlin, 2006), pp. 58–72
8. Y. Afek, D. Greenberg, M. Merritt, G. Taubenfeld, Computing with faulty shared objects.
J. ACM 42(6), 1231–1274 (1995)
9. Y. Afek, M. Merritt, Fast wait-free ð2k1Þ-renaming. Proceedings of the 18th ACM
Symposium on Principles of Distributed Computing (PODC’99), Atlanta, 1999 (ACM Press,
New York, 1999), pp. 105–112
10. Y. Afek, M. Merritt, G. Taubenfeld, The power of multi-objects. Inf. Comput. 153, 213–222
(1999)
11. Y. Afek, M. Merritt, G. Taubenfeld, D. Touitou, Disentangling multi-object operations.
Proceedings of the 16th International ACM Symposium on Principles of Distributed
Computing (PODC’97), Santa Barbara, 1997 (ACM Press, New York, 1997), pp. 262–272
12. Y. Afek, G. Stupp, D. Touitou, Long-lived adaptive collect with applications. Proceedings
of the 40th IEEE Symposium on Foundations of Computer Science Computing (FOCS’99),
New York, 1999 (IEEE Computer Press, New York, 1999), pp. 262–272
13. Y. Afek, E. Weisberger, H. Weisman, A completeness theorem for a class of
synchronization objects. Proceedings of the 12th International ACM Symposium on
Principles of Distributed Computing (PODC’93), Ithaca, 1993 (ACM Press, New York,
1993), pp. 159–168
14. M.K. Aguilera, A pleasant stroll through the land of infinitely many creatures. ACM
SIGACT News, Distrib. Comput. Column 35(2), 36–59 (2004)
15. M.K. Aguilera, B. Englert, E. Gafni, On using network attached disks as shared memory.
Proceedings of the 21st ACM Symposium on Principles of Distributed Computing
(PODC’03), Boston, 2003 (ACM Press, New York, 2003), pp. 315–324
16. M.K. Aguilera, S. Frolund, V. Hadzilacos, S.L. Horn, S. Toueg, Abortable and query-
abortable objects and their efficient implementation. Proceedings of the 26th ACM
Symposium on Principles of Distributed Computing (PODC’07), Portland, 2007 (ACM
Press, New York, 2007), pp. 23–32
17. M. Ahamad, G. Neiger, J.E. Burns, P. Kohli, Ph.W. Hutto, Causal memory: definitions,
implementation, and programming. Distrib. Comput. 9(1), 37–49 (1995)
18. D. Alistarh, J. Aspnes, S. Gilbert, R. Guerraoui, The complexity of renaming. Proceedings
of the 52nd Annual IEEE Symposium on Foundations of Computer Science (FOCS 2011),
Palm Springs, 2011 (IEEE Press, New York, 2011), pp. 718–727
19. D. Alistarh, H. Attiya, S. Gilbert, A. Giurgiu, R. Guerraoui, Fast randomized test-and-set
and renaming. Proceedings of the 14th International Symposium on Distributed Computing
(DISC’00), Cambridge, 2010. LNCS, vol. 6343 (Springer, Heidelberg, 2010), pp. 94–108
20. B. Alpern, F.B. Schneider, Defining liveness. Inf. Process. Lett. 21(4), 181–185 (1985)
21. J.H. Anderson, Composite registers. Proceedings of the 9th ACM Symposium on Principles
of Distributed Computing (PODC’90), Quebec City, 1990 (ACM Press, New York, 1990),
pp. 15–29
22. J.H. Anderson, Multi-writer composite registers. Distrib. Comput. 7(4), 175–195 (1994)
23. J.H. Anderson, Y.J. Kim, Adaptive local exclusion with local spinning. Proceedings of the
14th International Symposium on Distributed Computing (DISC’00), Toledo, 2000. LNCS,
vol. 1914 (Springer, Heidelberg, 2000), pp. 29–43
24. J.H. Anderson, Y.J. Kim, T. Herman, Shared memory mutual exclusion: major research
trends since 1986. Distrib. Comput. 16, 75–110 (2003)
25. J. Anderson, M. Moir, Universal constructions for multi-object operations. Proceedings of
the 14th International ACM Symposium on Principles of Distributed Computing
(PODC’95), Ottawa, 1995 (ACM Press, New York, 1995), pp. 184–195
26. J. Anderson, M. Moir, Universal constructions for large objects. IEEE Trans. Parallel
Distrib. Syst. 10(12), 1317–1332 (1999)
27. G.R. Andrews, Concurrent Programming, Principles and Practice (Benjamin/Cumming,
Redwood City, 1993), 637 pp
28. A.A. Aravind, Yet another simple solution to the concurrent programming control problem.
IEEE Tran. Parallel Distrib. Syst. 22(6), 1056–1063 (2011)
29. H. Attiya, A. Bar-Noy, D. Dolev, D. Peleg, R. Reischuk, Renaming in an asynchronous
environment. J. ACM 37(3), 524–548 (1990)
30. H. Attiya, E. Dagan, Improved implementations of universal binary operations. J. ACM
48(5), 1013–1037 (2001)
31. H. Attiya, F. Ellen, P. Fatourou, The complexity of updating snapshot objects. J. Parallel
Distrib. Comput. 71(12), 1570–1577 (2010)
32. H. Attiya, A. Fouren, Polynomial and adaptive long-lived (2p1)-renaming. Proceedings of
the 14th International Symposium on Distributed Computing (DISC’00), Toledo, 2000.
LNCS, vol. 1914 (Springer, Heidelberg, 2000), pp. 149–163
33. H. Attiya, A. Fouren, Adaptive and efficient algorithms for lattice agreement and renaming.
SIAM J. Comput. 31(2), 642–664 (2001)
34. H. Attiya, A. Fouren, Algorithms adapting to point contention. J. ACM 50(4), 444–468
(2003)
35. H. Attiya, A. Fouren, E. Gafni, An adaptive collect algorithm with applications. Distrib.
Comput. 15(2), 87–96 (2002)
36. H. Attiya, R. Guerraoui, E. Ruppert, Partial snapshot objects. Proceedings of the 20th ACM
Symposium on Parallel Architectures and Algorithms (SPAA’08), Munich, 2008 (ACM
Press, New York, 2008), pp. 336–343
37. H. Attiya, E. Hillel, Highly concurrent multi-word synchronization. Proceedings of the 9th
International Conference on Distributed Computing and Networking (ICDCN’08), Kolkata,
2008. LNCS, vol. 4904 (Springer, Heidelberg, 2008), pp. 112–123
Bibliography 497
59. J.E. Burns, G.L. Peterson, Constructing multireader atomic values from non-atomic values.
Proceedings of the 6th ACM Symposium on Principles of Distributed Computing
(PODC’87), Vancouver, 1987 (ACM Press, New York, 1987), pp. 222–231
60. Ch. Cachin, R. Guerraoui, L. Rodrigues, Introduction to Reliable and Secure Distributed
Programming (Springer, New York, 2011), 320 pp
61. J. Cachopo, A. Rito-Silva, Versioned boxes as the basis for transactional memory. Sci.
Comput. Program. 63(2), 172–175 (2006)
62. R.H. Campbell, Predicates path expressions. Proceedings of the 6th ACM Symposium on
Principles of Programming Languages (POPL’79), San Antonio, 1979 (ACM Press, New
York, 1979), pp. 226–236
63. R.H. Campbell, N. Haberman, The specification of process synchronization by path
expressions. Proceedings of the International Conference on Operating Systems, LNCS,
vol. 16 (Springer, Berlin, 1974), pp. 89–102
64. R.H. Campbell, R.B. Kolstad, Path expressions in Pascal. Proceedings of the 4th
International Conference on Software Engineering (ICSE’79), Munich, 1979 (ACM
Press, New York, 1979), pp. 212–219
65. A. Castañeda, S. Rajsbaum, New combinatorial topology upper and lower bounds for
renaming: the lower bound. Distrib. Comput. 22(5), 287–301 (2010)
66. A. Castañeda, S. Rajsbaum, M. Raynal, The renaming problem in shared memory systems:
an introduction. Elsevier Comput. Sci. Rev. 5, 229–251 (2011)
67. T. Chandra, S. Toueg, Unreliable failure detectors for reliable distributed systems. J. ACM
43(2), 225–267 (1996)
68. T. Chandra, V. Hadzilacos, S. Toueg, The weakest failure detector for solving consensus.
J. ACM 43(4), 685–722 (1996)
69. K.M. Chandy, J. Misra, Parallel Program Design (Addison-Wesley, Reading, 1988), 516 pp
70. B. Charron-Bost, R. Cori, A. Petit, Introduction à l’algorithmique des objets partagés.
RAIRO Informatique Théorique et Applications 31(2), 97–148 (1997)
71. S. Chaudhuri, More choices allow more faults: set consensus problems in totally
asynchronous systems. Inf. Comput. 105(1), 132–158 (1993)
72. S. Chaudhuri, M.J. Kosa, J. Welch, One-write algorithms for multivalued regular and
atomic registers. Acta Inform. 37, 161–192 (2000)
73. S. Chaudhuri, J. Welch, Bounds on the cost of multivalued registers implementations. SIAM
J. Comput. 23(2), 335–354 (1994)
74. G.V. Chockler, D. Malkhi, Active disk Paxos with infinitely many processes. Distrib.
Comput. 18(1), 73–84 (2005)
75. G.V. Chockler, D. Malkhi, Light-weight leases for storage-centric coordination. Int.
J. Parallel Prog. 34(2), 143–170 (2006)
76. B. Chor, A. Israeli, M. Li, On processor coordination with asynchronous hardware.
Proceedings of the 6th ACM Symposium on Principles of Distributed Computing
(PODC’87), Vancouver, 1987 (ACM Press, New York, 1987), pp. 86–97
77. Ph. Chuong, F. Ellen, V. Ramachandran, A universal construction for wait-free transaction
friendly data structures. Proceedings of the 22nd International ACM Symposium on
Parallelism in Algorithms and Architectures (SPAA’10), Santorini, 2010 (ACM Press, New
York, 2010), pp. 335–344
78. R, Colvin, L. Groves, V. Luchangco, M. Moir, Formal verification of a lazy concurrent list-
based set algorithm. Proceedings of the 18th International Conference on Computer Aided
Verification (CAV’06), Seattle, 2006. LNCS, vol. 4144 (Springer, Heidelberg, 2006),
pp. 475–488
79. Th.M. Cormen, Ch.E. Leiserson, R.L. Rivest, Introduction to Algorithms (The MIT Press,
Cambridge, 1998), 1028 pp
80. P.J. Courtois, F. Heymans, D.L. Parnas, Concurrent control with ‘‘readers’’ and ‘‘writers’’.
Commun. ACM 14(5), 667–668 (1971)
81. T. Crain, V. Gramoli, M. Raynal, A speculation-friendly binary search tree. Proceedings of
the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
(PPoPP’12), New Orleans, 2012 (ACM Press, New York, 2012), pp. 161–170
Bibliography 499
82. T. Crain, D. Imbs, M. Raynal, Read invisibility, virtual world consistency and probabilistic
permissiveness are compatible. Proceedings of the 11th International Conference on
Algorithms and Architectures for Parallel Processing (ICA3PP’11), Melbourne, 2011.
LNCS, vol. 7016 (Springer, Berlin, 2011), pp. 245–258
83. T. Crain, D. Imbs, M. Raynal, Towards a universal construction for transaction-based
multiprocess programs. Proceedings of the 13th International Conference on Distributed
Computing and Networking (ICDCN’12), Hong Kong, 2012. LNCS, vol. 7129, (Springer,
Berlin, 2012), pp. 61–75
84. C. Delporte-Gallet, H. Fauconnier, Two consensus algorithms with atomic registers and
failure detector X. Proceedings of the 10th International Conference on Distributed
Computing and Networking (ICDCN’09), Hyderabad, 2009. LNCS, vol. 5408 (Springer,
Heidelberg, 2009), pp. 251–262
85. P.J. Denning, J.B. Dennis, J.E. Qualitz, Machines Languages and Computation (Prentice
Hall, Englewood Cliffs, 1978), 612 pp
86. D. Dice, V.J. Marathe, N. Shavit, Lock cohorting: a general technique for designing NUMA
locks. Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming (PPoPP’12), New Orleans, 2012 (ACM Press, New York, 2012),
pp. 247–256
87. D. Dice, O. Shalev, N. Shavit, Transactional locking II. Proceedings of the 20th
International Symposium on Distributed Computing (DISC’06), Stockholm, 2006. LNCS,
vol. 4167 (Springer, Heidelberg, 2006), pp. 194–208
88. E.W. Dijkstra, Solution of a problem in concurrent programming control. Commun. ACM
8(9), 569 (1965)
89. E.W. Dijkstra, Cooperating Sequential Processes. In Programming Languages, ed. by F.
Genuys (Academic Press, New York, 1968), pp. 43–112
90. E.W. Dijkstra, Hierarchical ordering of sequential processes. Acta Inform. 1(1), 115–138
(1971)
91. A.B. Downey, The Little Book of Semaphores, 2nd edn, version 2.1.2. (Green Tea Press,
Virginia, 2005), 291 pp. http://www.greenteapress.com/semaphores/downey05semaphores.pdf
92. P. Dutta, R. Guerraoui, Fast indulgent consensus with zero degradation. Proceedings of the
4th European Dependable Computing Conference (EDCC’02), Toulouse, 2002. LNCS, vol.
2485 (Springer, Heidelberg, 2002), pp. 191–208
93. F. Ellen, How hard is it to take a snapshot? Proceedings of the 31st Conference on Current
Trends in Theory and Practice of Computer Science (SOFSEM’05), Liptovský Ján, 2005.
LNCS, vol. 3381 (Springer, Heidelberg, 2005), pp. 28–37
94. B. Englert, E. Gafni, Fast collect in the absence of contention. Proceedings of the IEEE
International Conference on Distributed Computing Systems (ICDCS’02), Vienna, 2002
(IEEE Press, New York, 2002), pp. 537–543
95. P. Fatourou, N.D. Kallimanis, The red-blue adaptive universal construction. Proceedings of
the 22nd International Symposium on Distributed Computing (DISC’09), Elche, 2009.
LNCS, vol. 5805 (Springer, Berlin, 2009), pp. 127–141
96. P. Fatourou, N.D. Kallimanis, A highly-efficient wait-free universal construction.
Proceedings of the 23rd Annual ACM Symposium on Parallelism in Algorithms and
Architectures (SPAA’11), San Jose, 2011 (ACM Press, New York, 2011), pp. 325–334
97. A. Fekete, N. Lynch, M. Merritt, W. Weihl, Atomic Transactions (Morgan Kaufmann, San
Mateo, 1994)
98. P. Felber, Ch. Fetzer, R. Guerraoui, T. Harris, Transactions are coming back, but are they
the same? ACM SIGACT News, Distrib. Comput. Column 39(1), 48–58 (2008)
99. A. Fernández, E. Jiménez, M. Raynal, Electing an eventual leader in an asynchronous
shared memory system. Proceedings of the 37th International IEEE Conference on
Dependable Systems and Networks (DSN’07), Edinburgh, 2007 (IEEE Computer Society
Press, New York, 2007), pp. 399–408
100. A. Fernández, E. Jiménez, M. Raynal, G. Trédan, A timing assumption and a t-resilient
protocol for implementing an eventual leader in asynchronous shared memory systems.
Proceedings of the 10th International IEEE Symposium on Objects and Component-
500 Bibliography
Oriented Real-Time Computing (ISORC 2007), Santorini Island, May 2007 (IEEE Society
Computer Press, New York, 2007), pp. 71–78
101. A. Fernández, E. Jiménez, M. Raynal, G. Trédan, A timing assumption and two t-resilient
protocols for implementing an eventual leader service in asynchronous shared-memory
systems. Algorithmica 56(4), 550–576 (2010)
102. M.J. Fischer, N.A. Lynch, M.S. Paterson, Impossibility of distributed consensus with one
faulty process. J. ACM 32(2), 374–382 (1985)
103. A. Fouren, Exponential examples of two renaming algorithms. Technion Tech Report, 1999.
http://www.cs.technion.ac.il/hagit/pubs/expo.ps.gz
104. R. Friedman, A. Mostéfaoui, M. Raynal, Asynchronous bounded lifetime failure detectors.
Inf. Process. Lett. 94(2), 85–91 (2005)
105. E. Gafni, Round-by-round fault detectors: unifying synchrony and asynchrony. Proceedings
of the 17th ACM Symposium on Principles of Distributed Computing (PODC), Puerto
Vallarta, 1998 (ACM Press, New York, 1998), pp. 143–152
106. E. Gafni, Group solvability. Proceedings of the 18th International Symposium on
Distributed Computing (DISC’04), Amsterdam, 2004. LNCS, vol. 3274 (Springer,
Heidelberg, 2004), pp. 30–40
107. E. Gafni, Renaming with k-set consensus: an optimal algorithm in n þ k 1 slots.
Proceedings of the 10th International Conference on Principles of Distributed Systems
(OPODIS’06), Bordeaux, 2006. LNCS, vol. 4305 (Springer, Heidelberg, 2006), pp. 36–44
108. E. Gafni, R. Guerraoui, Generalizing universality. Proceedings of the 22nd International
Conference on Concurrency Theory (CONCUR’11), Aachen, 2011. LNCS, vol. 6901
(Springer, Berlin, 2011), pp. 17–27
109. E. Gafni, L. Lamport, Disk Paxos. Distrib. Comput. 16(1), 1–20 (2003)
110. E. Gafni, M. Merritt, G. Taubenfeld, The concurrency hierarchy, and algorithms for
unbounded concurrency. Proceedings of the 20th ACM Symposium on Principles of
Distributed Computing (PODC’01), Newport, 2001, pp. 161–169
111. G. Gafni, A. Mostéfaoui, M. Raynal, C. Travers, From adaptive renaming to set agreement.
Theoret. Comput. Sci. 410(14–15), 1328–1335 (2009)
112. E. Gafni, S. Rajsbaum, Recursion in distributed computing. Proceedings of the 12th
International Symposium on Stabilization, Safety, and Security of Distributed Systems
(SSS’10), New York, 2010. LNCS, vol. 6366 (Springer, Heidelberg, 2010), pp. 362–376
113. E. Gafni, M. Raynal, C. Travers, Test&set, adaptive renaming and set agreement: a guided
visit to asynchronous computability. Proceedings of the 26th IEEE Symposium on Reliable
Distributed Systems (SRDS’07), Beijing, 2007 (IEEE Computer Society Press, New York,
2007), pp. 93–102
114. M.R. Garey, D.S. Johnson, Computers and Intractability: A Guide to the Theory of NP-
Completeness (W.H. Freeman, New York, 1979), 340 pp
115. V.K. Garg, Elements of Distributed Computing (Wiley-Interscience, New York, 2002), 423
pp
116. A.J. Gerber, Process synchronization by counter variables. ACM Operating Syst. Rev.
11(4), 6–17 (1977)
117. A. Gottlieb, R. Grishman, C.P. Kruskal, K.P. McAuliffe, L. Rudolph, M. Snir, The NYU
ultracomputer: designing an MIMD parallel computer. IEEE Trans. Comput. C-32(2), 175–
189 (1984)
118. A. Gottlieb, B.D. Lubachevsky, L. Rudolph, Basic techniques for the efficient coordination
of very large number of cooperating sequential processes. ACM Trans. Program. Lang.
Syst. 5(2), 164–189 (1983)
119. J. Gray, A. Reuter, Transactions Processing: Concepts and Techniques (Morgan Kaufmann,
San Mateo, 1992), 1070 pp
120. R. Guerraoui, Indulgent algorithms. Proceedings of the 19th ACM Symposium on Principles
of Distributed Computing (PODC’00), Portland, 2000 (ACM Press, New York, 2000),
pp. 289–298
121. R. Guerraoui, Th.A. Henzinger, V. Singh, Permissiveness in transactional memories.
Proceedings of the 22nd International Symposium on Distributed Computing (DISC’08),
Arcachon, 2008. LNCS, vol. 5218 (Springer, Heidelberg, 2008), pp. 305–319
Bibliography 501
165. D. Inoue, W. Chen, T. Masuzawa, N. Tokura, Linear time snapshot using multi-reader
multi-writer registers. Proceedings of the 8th International Workshop on Distributed
Algorithms (WDAG’94), Terschelling, 1994. LNCS, vol. 857 (Springer, London, 1994),
pp. 130–140
166. P. Jayanti, Robust wait-free hierarchies. J. ACM 44(4), 592–614 (1997)
167. P. Jayanti, An optimal multiwriter snapshot algorithm. Proceedings of the 37th ACM
Symposium on Theory of Computing (STOC’05), Baltimore, 2005 (ACM Press, New York,
2005), pp. 723–732
168. P. Jayanti, J.E. Burns, G.L. Peterson, Almost optimal single reader single writer atomic
register. J. Parallel Distrib. Comput. 60, 150–168 (2000)
169. P. Jayanti, T.D. Chandra, S. Toueg, Fault-tolerant wait-free shared objects. J. ACM 45(3),
451–500 (1998)
170. P. Jayanti, T.D. Chandra, S. Toueg, The cost of graceful degradation for omission failures.
Inf. Process. Lett. 71, 167–172 (1999)
171. P. Jayanti, K. Tan, G. Friedland, A. Katz, Bounding Lamport’s bakery algorithm.
Proceedings of the 28th Conference on Current Trends in Theory and Practice of
Informatics (SOFSEM’01), Piestany, 2004. LNCS, vol. 2234 (Springer, Berlin, 2004),
pp. 261–270
172. P. Jayanti, S. Toueg, Some results on the impossibility, universality, and decidability of
consensus. Proceedings of the 6th International Workshop on Distributed Algorithms
(WDAG’92), Haifa, 1992. LNCS, vol. 647 (Springer, Heidelberg, 1992), pp. 69–84
173. N.D. Kallimanis, P. Fatourou, Revisiting the combining synchronization technique.
Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming (PPoPP’12), New Orleans, 2012 (ACM Press, New York, 2012)
pp. 257–266
174. J.L.W. Kessels, An alternative to event queues for synchronization in monitors. Commun.
ACM 20(7), 500–503 (1977)
175. J.L.W. Kessels, Arbitration without common modifiable variables. Acta Inform. 17(2),
135–141 (1982)
176. J. Kleinberg, E. Tardos, Algorithm Design (Addison-Wesley, Pearson Education, New
York, 2005), 838 pp
177. L.M. Kirousis, E. Kranakis, A survey of concurrent readers and writers. CWI Q. 2, 307–330
(1989)
178. L.M. Kirousis, E. Kranakis, P. Vitányi, Atomic multireader register. Proceedings of the 2nd
International Workshop on Distributed Algorithms (WDAG’87), Amsterdam, 1987. LNCS,
vol. 312 (Springer, Berlin, 1987), pp. 278–296
179. A. Kogan, E. Petrank, A methodology for creating fast wait-free data structures.
Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of
Parallel Programming (PPoPP’12), New Orleans, 2012 (ACM Press, New York, 2012),
pp. 141–150
180. C.P. Kruskal, L. Rudolph, M. Snir, Efficient synchronization on multiprocessors with shared
memory. ACM Trans. Program. Lang. Syst. 10(4), 579–601 (1988)
181. A. Kshemkalyani, M. Singhal, Distributed Computing: Principles, Algorithms and Systems
(Cambridge University Press, Cambridge, 2008), 736 pp
182. E. Ladam-Mozes, N. Shavit, An optimistic approach to lock-free FIFO queues. Proceedings
of the 18th International Symposium on Distributed Computing (DISC’04), Amsterdam,
2004. LNCS, vol. 3274 (Springer, Heidelberg, 2004), pp. 117–131
183. L. Lamport, A new solution of Dijkstra’s concurrent programming problem. Commun.
ACM 17(8), 453–455 (1974)
184. L. Lamport, Concurrent reading while writing. Commun. ACM 20(11), 806–811 (1977)
185. L. Lamport, Proving the correctness of multiprocess programs. IEEE Trans. Softw. Eng.
SE-3(2), 125–143 (1977)
186. L. Lamport, Time, clocks, and the ordering of events in a distributed system. Commun.
ACM 21(7), 558–565 (1978)
187. L. Lamport, How to make a multiprocessor computer that correctly executes multiprocess
programs. IEEE Trans. Comput. C28(9), 690–691 (1979)
504 Bibliography
188. L. Lamport, The mutual exclusion problem. Part I: a theory of interprocess communication,
Part II: statement and solutions. J. ACM 33, 313–348 (1986)
189. L. Lamport, On interprocess communication, Part I: basic formalism. Distrib. Comput. 1(2),
77–85 (1986)
190. L. Lamport, On interprocess communication, Part II: algorithms. Distrib. Comput. 1(2),
77–101 (1986)
191. L. Lamport, Fast mutual exclusion. ACM Trans. Comput. Syst. 5(1), 1–11 (1987)
192. L. Lamport, The part-time parliament. ACM Trans. Comput. Syst. 16(2), 133–169 (1998)
(first version appeared as DEC Research, Report #49, September 1989)
193. L. Lamport, Arbitration-free synchronization. Distrib. Comput. 16(2–3), 219–237 (2003)
194. L. Lamport, Teaching concurrency. ACM SIGACT News Distrib. Comput. Column 40(1),
58–62 (2009)
195. J. Larus, Ch. Kozyrakis, Transactional memory: is TM the answer for improving parallel
programming? Commun. ACM 51(7), 80–89 (2008)
196. M. Li, J. Tromp, P. Vitányi, How to share concurrent wait-free variables. J. ACM 43(4),
723–746 (1996)
197. W.-K. Lo, V. Hadzilacos, Using failure detectors to solve consensus in asynchronous shared
memory systems. Proceedings of the 8th International Workshop on Distributed Algorithms
(WDAG’94), 1994. LNCS, vol. 857 (Springer, Heidelberg, 1994), pp. 280–295
198. W.-K. Lo, V. Hadzilacos, All of us are smarter than any of us: wait-free hierarchies are not
robust. Proceedings of the 29th ACM Symposium on Theory of Computing (STOC’97), El
Paso, 1997 (ACM Press, New York, 1997), pp. 579–588
199. M. Loui, H. Abu-Amara, Memory requirements for agreement among unreliable
asynchronous processes. Adv. Comput. Res. 4, 163–183 (1987). JAI Press
200. V. Luchangco, D. Nussbaum, N. Shavit, A hierarchical CLH queue lock. Proceedings of the
12th European Conference on Parallel Computing (Euro-Par’06), Dresden, 2006. LNCS,
vol. 4128 (Springer, Berlin, 2006), pp. 801–810
201. N.A. Lynch, Distributed Algorithms (Morgan Kaufmann, San Mateo, 1996), 872 pp
202. F. Mattern, Virtual time and global states in distributed computations. ed. by M. Cosnard, P.
Quinton, M. Raynal, Y. Robert, Proceedings of the International Workshop on Parallel and
Distributed Algorithms, North-Holland, 1989, pp. 215–226
203. K. Mehlhorn, P. Sanders, Algorithms and Data Structures (Springer, Berlin, 2008), 300 pp
204. M. Merritt, G. Taubenfeld, Computing with infinitely many processes. Proceedings of the
14th International Symposium on Distributed Computing (DISC’00), Toledo, 2000. LNCS,
vol. 1914 (Springer, Heidelberg, 2000), pp. 164–178
205. M.M. Michael, M.L. Scott, Simple, fast and practical blocking and non-blocking concurrent
queue algorithms. Proceedings of the 15th International ACM Symposium on Principles of
Distributed Computing (PODC’96), Philadelphia, 1996 (ACM Press, New York, 1996),
pp. 267–275
206. J. Misra, Axioms for memory access in asynchronous hardware systems. ACM Trans.
Program. Lang. Syst. 8(1), 142–153 (1986)
207. M. Moir, Practical implementation of non-blocking synchronization primitives.
Proceedings of the 16th ACM Symposium on Principles of Distributed Computing
(PODC’97), Santa Barbara, 1997 (ACM Press, New York, 1997), pp. 219–228
208. M. Moir, Fast, long-lived renaming improved and simplified. Sci. Comput. Program. 30,
287–308 (1998)
209. M. Moir, J. Anderson, Wait-free algorithms for fast, long-lived renaming. Sci. Comput.
Program. 25(1), 1–39 (1995)
210. M. Moir, D. Nussbaum, O. Shalev, N. Shavit, Using elimination to implement scalable and lock-
free FIFO queues. Proceedings of the 17th ACM Symposium on Parallelism in Algorithms and
Architectures (SPAA’05), Las Vegas, 2005 (ACM Press, New York, 2005), pp. 253–262
211. B. Moret, The Theory of Computation (Addison-Wesley, Reading, 1998), 453 pp
212. A. Mostéfaoui, M. Raynal, Solving consensus using Chandra-Toueg’s unreliable failure
detectors: a general quorum-based approach. Proceedings of the 13th International
Symposium on Distributed Computing (DISC’99), Bratislava, 1999. LNCS, vol. 1693
(Springer, Berlin, 1999), 4963 pp
Bibliography 505
213. A. Mostéfaoui, M. Raynal, Leader-based consensus. Parallel Process. Lett. 11(1), 95–107 (2001)
214. A. Mostéfaoui, M. Raynal, Looking for efficient implementations of concurrent objects.
Proceedings of the 11th International Conference on Parallel Computing Technologies
(PaCT’11), Kazan, 2011. LNCS, vol. 6873 (Springer, Berlin, 2011), pp. 74–87
215. M. Mostéfaoui, M. Raynal, C. Travers, Exploring Gafni’s reduction land: from Xk to wait-
free adaptive (2p dpke)-renaming via k-set agreement. Proceedings of the 20th
International Symposium on Distributed Computing (DISC’09), Elche, 2009. LNCS, vol.
4167 (Springer, Heidelberg, 2006), pp. 1–15
216. A. Mostéfaoui, M. Raynal, C. Travers, From renaming to k-set agreement. 14th
International Colloquium on Structural Information and Communication Complexity
(SIROCCO’07), Castiglioncello, 2007, LNCS, vol. 4474 (Springer, Berlin, 2007), pp. 62–76
217. A. Mostéfaoui, M. Raynal, F. Tronel, From binary consensus to multivalued consensus in
asynchronous message-passing systems. Inf. Process. Lett. 73, 207–213 (2000)
218. G. Neiger, Failure detectors and the wait-free hierarchy. Proceedings of the 14th ACM
Symposium on Principles of Distributed Computing (PODC’95), Ottawa, 1995 (ACM Press,
New York, 1995), pp. 100–109
219. R. Newman-Wolfe, A protocol for wait-free atomic multi-reader shared variables.
Proceedings of the 6th ACM Symposium on Principles of Distributed Computing
(PODC’87), Vancouver, 1987 (ACM Press, New York, 1987), pp. 232–248
220. S. Owicki, D. Gries, Verifying properties of parallel programs. Commun. ACM 19(5),
279–285 (1976)
221. Ch. Papadimitriou, The serializability of concurrent updates. J. ACM 26(4), 631–653 (1979)
222. Ch. Papadimitriou, The Theory of Database Concurrency Control (Computer Science Press,
Cambridge, 1988), 239 pp
223. D. Perelman, R. Fan, I. Keidar, On maintaining multiple versions in STM. Proceedings of
the 29th Annual ACM Symposium on Principles of Distributed Computing (PODC’10),
Zurich, 2010 (ACM Press, New York, 2010), pp. 16–25
224. G.L. Peterson, Myths about the mutual exclusion problem. Inf. Process. Lett. 12(3),
115–116 (1981)
225. G.L. Peterson, Concurrent reading while writing. ACM Trans. Program. Lang. Syst. 5,
46–55 (1983)
226. G.L. Peterson, R. Bazzi, N. Neiger, A gap theorem for consensus types (extended abstract).
Proceedings of the 13th ACM Symposium on Principles of Distributed Computing
(PODC’94), Los Angeles, 1994 (ACM Press, New York, 1994), pp. 344–353
227. G.L. Peterson, M.J. Fisher, Economical solutions for the critical section problem in
distributed systems. Proceedings of the 9th ACM Symposium on Theory of Computing
(STOC’77), Boulder, 1977 (ACM Press, New York, 1977), pp. 91–97
228. S.A. Plotkin, Sticky bits and universality of consensus. Proceedings of the 8th ACM
Symposium on Principles of Distributed Computing (PODC’89), Edmonton, 1989 (ACM
Press, New York, 1989), pp. 159–176
229. S. Rajsbaum, M. Raynal, A theory-oriented introduction to wait-free synchronization based
on the adaptive renaming problem. Proceedings of the 25th International Conference on
Advanced Information Networking and Applications (AINA’11), Singapore, 2011 (IEEE
Press, New York, 2011), pp. 356–363
230. S. Rajsbaum, M. Raynal, C. Travers, The iterated restricted immediate snapshot model.
Proceedings of the 14th Annual International Conference on Computing and Combinatorics
(COCOON’08), Dalian, 2008. LNCS, vol. 5092 (Springer, Heidelberg, 2008), pp. 487–497
231. M. Raynal, Algorithms for Mutual Exclusion (The MIT Press, New York, 1986), 107 pp.
ISBN 0-262-18119-3
232. M. Raynal, Sequential consistency as lazy linearizability. Proceedings of the 14th ACM
Symposium on Parallel Algorithms and Architectures (SPAA’02), Winnipeg, 2002 (ACM
Press, New York, 2002), pp. 151–152
233. M. Raynal, Token-based sequential consistency. Int. J. Comput. Syst. Sci. Eng. 17(6),
359–366 (2002)
506 Bibliography
254. C. Shao, E. Pierce, J. Welch, Multi-writer consistency conditions for shared memory
objects. Proceedings of the 17th International Symposium on Distributed Computing
(DISC’03), Sorrento, 2003. LNCS, vol. 2848 (Springer, Heidelberg, 2003), pp. 106–120
255. N. Shavit, D. Touitou, Software transactional memory. Distrib. Comput. 10(2), 99–116 (1997)
256. E. Shenk, The consensus hierarchy is not robust. Proceedings 16th ACM Symposium on
Principles of Distributed Computing (PODC’97), Santa Barbara, 1997 (ACM Press, New
York, 1997), p. 279
257. A.K. Singh, J.H. Anderson, M.G. Gouda, The elusive atomic register. J. ACM 41(2),
311–339 (1994)
258. M. Sipser, Introduction to the Theory of Computation (PWS, Boston, 1996), 396 pp
259. M.F. Spear, V.J. Marathe, W.N. Scherer III, M.L. Scott, Conflict detection and validation
strategies for software transactional memory. Proceedings of the 20th Symposium on
Distributed Computing (DISC’06), Stockholm, 2006. LNCS, vol. 4167 (Springer,
Heidelberg, 2006), pp. 179–193
260. H.S. Stone, Database applications of the fetch&add instruction. IEEE Trans. Comput. C-
33(7), 604–612 (1984)
261. G. Taubenfeld, The black-white bakery algorithm. Proceedings of the 18th International
Symposium on Distributed Computing (DISC’04), Amsterdam, 2004. LNCS, vol. 3274
(Springer, Heidelberg, 2004), pp. 56–70
262. G. Taubenfeld, Synchronization Algorithms and Concurrent Programming (Pearson
Education/Prentice Hall, Upper Saddle River, 2006), 423 pp. ISBN 0-131-97259-6
263. G. Taubenfeld, Contention-sensitive data structure and algorithms. Proceedings of the 23rd
International Symposium on Distributed Computing (DISC’09), Elche, 2009. LNCS, vol.
5805 (Springer, Heidelberg, 2009), pp. 157–171
264. G. Taubenfeld, The computational structure of progress conditions. Proceedings of the 24th
International Symposium on Distributed Computing (DISC’10), Cambridge, 2010. LNCS,
vol. 6343 (Springer, Heidelberg, 2010), pp. 221–235
265. J. Tromp, How to construct an atomic variable. Proceedings of the 3rd International
Workshop on Distributed Algorithms (WDAG’89), Nice, 1989. LNCS, vol. 392 (Springer,
Heidelberg, 1989), pp. 292–302
266. Ph. Tsigas, Y. Zhang, A simple, fast and scalable non-blocking concurrent FIFO queue for
shared memory multiprocessor systems. Proceedings of the 13th ACM Symposium on
Parallelism in Algorithms and Architectures (SPAA’01), Heraklion, 2001 (ACM Press, New
York), pp. 134–143
267. J.D. Valois, Implementing lock-free queues. Proceedings of the 7th International
Conference on Parallel and Distributed Computing Systems (PDCS’94), Las Vegas, 1994
(IEEE Press, New York, 1994), pp. 64–69
268. K. Vidyasankar, Converting Lamport’s regular register to atomic register. Inf. Proces. Lett.
30, 221–223 (1989)
269. K. Vidyasankar, An elegant 1-writer multireader multivalued atomic register. Inf. Proces. Lett.
30(5), 221–223 (1989)
270. K. Vidyasankar, Concurrent reading while writing revisited. Distrib. Comput. 4, 81–85 (1990)
271. K. Vidyasankar, A very simple construction of 1-writer multireader multivalued atomic
variable. Inf. Proces. Lett. 37, 323–326 (1991)
272. P. Vitányi, B. Awerbuch, Atomic shared register access by asynchronous hardware.
Proceedings of the 27th IEEE Symposium on Foundations of Computer Science (FOCS’87),
Los Angeles, 1987 (IEEE Press, New York, 1987), pp. 223–243, 1987 (errata, ibid, 1987)
273. J.-T. Wamhoff, Ch. Fetzer, The universal transactional memory construction. Technical
Report, 12 pp, University of Dresden (Germany), 2010
274. W. Wu, J. Cao, J. Yang, M. Raynal, Using asynchrony and zero degradation to speed up
indulgent consensus protocols. J. Parallel Distrib. Comput. 68(7), 984–996 (2008)
275. J. Yang, G. Neiger, E. Gafni, Structured derivations of consensus algorithms for failure
detectors. Proceedings of the 17th Symposium on Principles of Distributed Computing
(PODC), Puerto Vallarta, 1998 (ACM Press, New York, 1998), pp. 297–308
Index
B Consensus hierarchy
Binary consensus object definition, 443
Hybrid dynamic implementation, 172 robustness, 445
Binary to multi-valued consensus Consensus number
construction based on a bit representation, augmented queue, 442
392 compare&swap, 439
construction for unbounded proposed val- consensus hierarchy, 443
ues, 394 definition, 421
Boosting progress condition, 155 fetch&add object, 432
Guerraoui–Kapalka–Kuznetsov’s boosting mem-to-mem-swap, 440
from obstruction-freedom to non- queue object, 431
blocking, 158 read/write register, 425
Guerraoui–Kapalka–Kuznetsov’s boosting renaming object, 429
from obstruction-freedom to wait-free- snapshot object, 429
dom, 159 sticky bit, 443
swap object, 432
Test&set object, 429
C Consensus object
Clock binary, 172, 391
logical scalar clock, 284 from binary to multi-valued, 391
notion of a birthdate, 285 multi-valued, 372
vector clock, 295 self-implementation despite responsive
Commit versus abort, 281 crash failures, 409
Compare&swap self-implementation despite responsive
a de-construction, 450 omission failures, 409
ABA problem, 145 Consistency condition
compare&swap-based mutual exclusion, atomicity, 113
40 opacity (STM), 282
consensus number, 439 sequential consistency, 128
definition, 40 serializability, 130
double compare&swap, 164 virtual world consistency (STM), 293
Michael–Scott non-blocking queue con- Contention manager
struction, 146 definition, 157
Shafiei’s non-blocking stack construction, design principles, 161
150 is not mutual exclusion, 161
Concurrency-abortable non-blocking stack, Contention-sensitive implementation, see
181 Hybrid implementation
Concurrency-abortable object, 181 Critical section, 9
Concurrency-abortable operation, 30
Concurrent object
composability, 125 D
definition, 61 Declarative synchronization, see Path
fundamental issues, 113 expressions
hybrid implementation, 165 Double-ended queue
operations, 115 hybrid dynamic implementation, 176
partial order on operations, 118
Concurrent set object
Heller et al. hybrid static implementation, E
167 Event, 115
Configuration (global state), 423 Execute in isolation, 137
Consensus based on X Execution (run)
and alpha1 (adopt-commit), 457 informal view, 115
and alpha2, 459 Execution history
and alpha3 (store-collect), 460 complete histories, 121
Index 511
I
F Immediate snapshot object
Fail-silent object, see Object non-responsive Borowsky–Gafni’s wait-free construction,
failures 240
Fail-stop object, see Object responsive failures definition, 238
Failure detector, 155 Gafni–Rajsbaum’s recursive wait-free
definition, 155 construction, 244
the class P (eventually perfect), 156 set linearizability, 240
the class X (eventual leadership), 452 versus snapshot object, 238
the class XX (eventually restricted leader- Imperative synchronization, see Monitor
ship), 155 Semaphore
Fast mutex Indulgent algorithm, 464
fast path versus slow path, 34 Infinitely many processes, 191
Lamport’s fast algorithm, 33 Invariant, 7
Fast store-collect object
definition, 211
Englert–Gafni adaptive wait-free con- J
struction, 212 JVSTM, 289
Fault masking, 444
Fault-tolerance versus graceful degradation,
417 K
Fault-tolerant self-implementation, 411 k-Set agreement, 273
Fetch&add k-Test&set, 273
Afek–Gafni–Morrison wait-free stack, 152
consensus number, 432
definition, 44 L
fetch&add-based mutual exclusion, 44 Linearizability, see Atomicity
linearization point, 122
Livelock, 19
LL/SC
G definition of the primitives, 177
Graceful degradation, 413 implementing a non-blocking queue, 163
Local property, 125
Lock object
H a panacea?, 135
Helping mechanism definition, 11
from unbounded SWSR to SWMR, 324 do not compose for free, 278
in a universal construction, 377 with respect to mutex, 11
multi-writer snapshot object, 233 Lock-based implementation
single writer snapshot object, 223 a simple example, 62
universal construction, 387 versus mutex-free implementation, 166
wait-free weak counter, 192 Lock-freedom, 133
Hybrid dynamic implementation Long-lived renaming
Taubenfeld’s binary consensus construc- definition, 252
tion, 172 perfect renaming from test&set, 270
512 Index
M Non-determinism, 6
Mem-to-mem-swap concurrency, 17
consensus number, 440
definition, 440
Modular construction: Peterson’s algorithm, O
17 Object
Monitor atomic objects compose for free, 125
conditions (queues), 82 concurrency-abortable, 181
definition, 82 deterministic/non-deterministic operation,
implementation from semaphores, 87 116
transfer of predicates, 85 sequential specification, 116
Multi-writer snapshot object, 230 total versus partial operations, 116
Imbs–Raynal wait-free construction, 231 Object crash failures
strong freshness property, 231 non-responsive failures, 400
Mutex, see Mutual exclusion responsive failures, 400
Mutex for n processes Object failure modes, 410
Fischer’s synchronous algorithm, 37–38 t-Tolerance with respect to a failure mode,
Lamport’s fast algorithm, 33–36 411
Peterson’s algorithm, 22–26 arbitrary failure, 411
Tournament-based algorithm, 26–29 crash failure, 400, 410
Mutex for n processes without atomicity failure mode hierarchy, 411
Aravind’s algorithm, 53–58 graceful degradation, 413
Lamport’s bakery algorithm, 48–53 omission failure, 413
Mutex for two processes Object operation, 115
Peterson’s algorithm, 17–22 Object specification for a universal construc-
Mutex-freedom tion, 373
definition, 137 Obstruction-freedom, 137
Mutual exclusion boosting snapshot to wait-freedom, 223
k-Mutual exclusion, 26 boosting to non-blocking, 158
abstraction level, 9 boosting to wait-freedom, 159
bounded bypass property, 11 single-writer snapshot object, 221
deadlock-freedom, 10 timestamp generator, 143
definition, 9 Omega
entry algorithm, 9 a liveness abstraction, 452
exit algorithm, 9 an additional timing assumption EWB to
liveness property, 10 implement X, 478
property-based definition, 10 an implementation, 471
safety property, 10 an implementation based on EWB, 479
starvation-freedom property, 10 as an (eventual) resource allocator or a
scheduler, 453
Operation
N complete versus pending, 118
New/old inversion, 308 Operation level versus implementation level,
Non-blocking, 138 136
from obstruction-freedom, 158
is wait-freedom for one-shot objects, 140
Michael–Scott queue construction from P
compare&swap, 146 Path expressions
practical interest, 140 an implementation from semaphores,
Shafiei’s stack construction from com- 98–101
pare&swap, 150 definition, 95
to starvation-freedom, 183 solving synchronization problems, 97–98
Non-blocking to starvation-freedom Pre/post assertion, 116
Taubenfeld’s construction, 183 Problem
Index 513