Pprog Skript PDF
Pprog Skript PDF
This document aims to be a total replacement for revising the lectures in the parallel program-
ming course. This includes, but is not limited to, watching the lecture recordings, studying the
slides, reading additional literature and doing research on the internet. By the very nature of
the examination process, it is still highly recommended to solve old exams and exercises for ideal
preparation. The author wishes best of luck with your studies.
In this document, references are colorcoded in this shade of violet. On most pdf-viewers, clicking
any such references jumps to the relative page in the document. This is true for both the table
of contents and in-text references to both chapters and specific pages as well as figures and the
like.
In a lot of code-snippets, a while(condition); is used for spinning. While not necessarily
initially apparent, this is a shorthand for
while(condition){
// Not doing anything but waiting
}
In this document, care has been taken to ensure proper indentation of code, so that it is obvious
when or if a piece of code belongs into a while()-loop.
While it was considered, the author has ultimately decided against including a section on the
assignments given during the semester. They serve mostly as a hands-on experience for the
various topics discussed during the lecture, and yield little to no additional information. Includ-
ing them in this document would serve no purpose, as no knowledge can be gained from them
without solving them yourself. That is not to say that the exercises can be skipped - they are
still excellent preparation for the exam.
In appendix A one can find many slides that have been deemed important or as good visualization
of a problem, yet did not fit into the main document because of space constraints or their
exemplary nature. The main text of this document refers to this appendix when appropriate.
In appendix B one can find full code that is not found on slides. Most of it is extensive code, or
is a “full” program contrasting the many smaller pieces of code that are discussed in the main
text of this document.
This document is not meant to be shared. Please contact the author if you wish to make
this document available to others.
1
Contents
1 Introduction 6
1.1 Course overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Three stories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Mutual exclusion, or, The shared backyard . . . . . . . . . . . . . . . . . 7
1.2.2 Producer-Consumer, or, The very hungry cat . . . . . . . . . . . . . . . . 7
1.2.3 Readers-Writers, or, A therapy for communication issues . . . . . . . . . . 7
1.2.4 The moral of the story . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Some parallel programming guidelines . . . . . . . . . . . . . . . . . . . . . . . . 8
3 Hardware Parallelism 15
3.1 Basic principles of today’s computers . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Hardware speed-up possibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.2 Instruction Level Parallelism (ILP) . . . . . . . . . . . . . . . . . . . . . . 16
3.2.3 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
6 Cilk-Style bounds 27
2
Contents
15 Locking tricks 62
15.1 Reader / Writer Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
15.2 Coarse-grained locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
15.3 Fine-grained locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
15.4 Optimistic synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
15.5 Lazy synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
15.6 Lazy Skip Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
16 Lock-free synchronization 67
16.1 Recap: Definitions with locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
16.2 Definitions for Lock-free Synchronization . . . . . . . . . . . . . . . . . . . . . . . 68
16.3 Lock-free Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
16.4 Lock Free List Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
3
Contents
20 Consensus 84
21 Transactional Memory 86
21.1 TM semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
21.2 Implementing transactional memory . . . . . . . . . . . . . . . . . . . . . . . . . 87
21.3 Design choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
21.4 Scala-STM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
21.5 Simplest STM implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
21.6 Dining Philosophers with STM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
21.7 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
A Slides 104
B Code-snippets 125
B.1 Skip list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
B.1.1 Constructor, fields and node class . . . . . . . . . . . . . . . . . . . . . . 125
B.1.2 find() method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
B.1.3 add() method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
B.1.4 remove() method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
B.1.5 contains() method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
B.2 Concurrent prime sieve in Go . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
B.3 Calculating Pi in MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4
Part I
5
Chapter 1
Introduction
Parallel Programming has become a necessity in modern-day programming, as CPUs are limited
by heat or power consumption. While there are many intuitive reasons to deal with a problem
with “multiple problem-solvers”, parallel programming also brings its fair share of challenges
and (unintuitive) problems.
Learning Objectives
This means that both writing parallel programs and understanding of the underlying fundamen-
tal concepts is an important part of this lecture.
In order to achieve the stated objectives, this course is split into different parts, in order of
appearance:
1. (Parallel) Programming: Programming and Parallelism in Java (Threads)
2. Parallelism: Understanding and detecting, intro to PC architectures, formalizing and
programming models
3. Concurrency: Shared data, locks, race conditions, lock-free programming, communica-
tion
4. Parallel Algorithms: Useful & common algorithms in parallel, data structures for par-
allelism, sorting & searching
6
Chapter 1. Introduction
7
Chapter 1. Introduction
Note: The section about JVMs is either explicitly non-examinable (JVM) or should already be
familiar (Java). Thus, only the recommended guidelines for this course are mentioned here.
• Keep variables as local as possible
• Avoid aliasing (i.e., do not reference the same object using multiple variables)
• If possible, avoid mutable state, especially when aliased - immutable objects can be shared
without much hassle, but concurrent modifications cause a lot of headaches
• Do not expect thrown exceptions to be very informative - the cause of the error may be
far earlier in the execution than where the exception triggers
8
Chapter 2
The concurrent execution of multiple tasks can be simulated even on a single core: By using a
technique called time multiplexing, the impression of parallelism is created. In truth, this is
just the core switching rapidly between different tasks.
This principle allows for asynchronous I/O: If a process has to wait, for example due to reading
data from memory, other processes may be able to use the computing power that is currently
not needed for the waiting process.
Each process has a context, things like instruction counters, Resource handles etc., and a state,
all of which are captured in a Process Control Block (PCB). The most important states are
waiting, running and blocked:
• A waiting process is ready to execute - it only needs to be allocated CPU time
• A running process is currently running - its instructions are being executed by a CPU
• A blocked process needs some external change in state to proceed. Most of the time, this
is I/O.
The OS is responsible for assigning resources like memory or computing time to processes. It
would be massively overkill to dive into the details of implementations in this lecture, but the
takeaway should be that process level parallelism can be complex and expensive.
2.1.1 Multithreading
The concept of a thread appears on different levels: It can be on a hardware level, the OS level
or even inside (for example) a JVM.
Multiple threads share the same address space - contrary to processes, they can thus share
resources much more easily, and switching between different threads is very efficient, since no
change of address space or loading of process states are necessary. However, this also makes
them more vulnerable for programming mistakes.
Within Java, there are some different, easy ways of using threads. Java supplies a special Thread
class as part of the core language, which enables the programmer to easily create and start new
threads.
9
Chapter 2. Threads and Synchronization
How does one go about using Java threads? The first option is to extend the java.lang.Thread
class. The custom class needs to override the run() method - however, in order to start the
thread the function start() needs to be invoked on the respective thread.
A better way is to implement the java.lang.Runnable interface, for example:
public class ConcurrWriter implements Runnable {
public int data;
While we can force the creation of new threads, every Java program has at least one execution
thread - the first one calls main(). Interestingly, threads can continue to run even if main
returns.
Creating a Thread object does not start a thread1 - conversely, if a thread has finished exe-
cuting, the respective Thread object still exists.
How do we “wait” for our threads to return results? Obviously, we may want to wait with
continuing the execution of main(), for example if we need data that the threads calculate.
1
Neither does calling run(), the thread has to be started using start()
10
Chapter 2. Threads and Synchronization
Now, the problem could be solved by busy waiting - essentially looping until each thread has
the state terminated. This is inefficient, and instead our main() thread should wait for the
threads to “wake it up”. This can be achieved using the Thread.join() method.1
Exceptions in threads do not terminate the entire program - even the behaviour of joining via
thread.join() is unaffected. It is thus paramount to either watch out for exceptions, or use a
personalized UncaughtExceptionHandler.2
A few additional, useful methods that can be used:
Thread t = Thread.currentThread(); // get the current thread
t.setName("PP" + 2019); // the name, not the ID, can be modified like this
The Battle of the Threads highlights an important issue: Bad interleavings. Since the
exact order of execution between two threads is unknown to us, bad things might happen if
both access a shared resource, for example a simple counter.
If we create two different threads, one adding and one subtracting thread, that we both execute to
do a certain, equal amount of up- or downticks respectively, we would assume that at the end, the
value would be 0. That is not necessarily the case. The reason for that is that incrementing and
decrementing are not atomic operations: In bytecode, it is visible that it consists of loading the
value, incrementing, and storing the value. In parallel execution, the variable could be updated
after loading, but before storing - which completely negates the operation that happened between
loading and storing. How may we solve this? Enter the synchronized keyword:
public synchronized void inc(long delta) {
this.value += delta;
}
A method that is synchronized can only ever be used by one thread at a time. A synchronized
method is thus a critical section with guaranteed mutual exclusion.
Technically, a synchronized method is just syntactic sugar:
public synchronized void inc(long delta) {
this.value += delta;
}
11
Chapter 2. Threads and Synchronization
}
}
The argument passed to synchronized is the object that will be used as a lock. Every object in
Java has a lock, called intrinsic or monitor lock, which can act as a lock for concurrency purposes,
in addition to the more sophisticated external locks from java.util.concurrent.locks. The
latter ones are especially useful for things like reader-writer scenarios, but can be complicated
to use.
synchronized can be used for static methods as well, which will synchronize the class methods
and attributes instead of the non-static instance methods and attributes. Meaning, instead of the
implied synchronized(this), we have an implied synchronized(this.getClass()) - locking
the class object instead of a single object.
In Java, locks are recursive (or reentrant) - meaning, if a thread has acquired a certain lock, it
can still request to lock this object - i.e. a thread can call methods that are synchronized with
the same lock that it is already in possession of.
In general, it is advisable to keep critical sections under lock as small as possible for performance
reasons. Not synchronizing an entire method is efficient, but can lead to incorrectness:
• Not using the same lock for two synchronized methods: This effectively invalidates the
usage of synchronicity, since the threads can still acquire their respective lock at the same
time
• Not using synchronized for all methods that access the shared resource: Obviously, this
is a more extreme case of the first possibility: While one thread may always request and
acquire the lock, the other is free to do as it pleases.
Luckily for us, synchronized handles exceptions very well. It releases the lock that was acquired,
and then the exception handler is executed. One thing that needs to be taken care of:
public void foo() {
synchronized (this) {
longComputation(); // say this takes a while
divisionbyZero(); // this throws an exception
someOtherCode(); // something else
}
}
12
Chapter 2. Threads and Synchronization
Assuming that our producer also uses synchronize on the buffer object, this seems fine - we
have guaranteed mutual exclusion.
Except that it is not fine, since a Deadlock can occur: If the consumer locks the buffer, it
spins on isEmpty(), while the producer cannot add anything because the lock for buffer never
becomes available. The solution to this problem is using the wait() and notify() methods:
//Consumer
synchronize (buffer) {
while (buffer.isEmpty())
buffer.wait();
prime = buffer.remove();
}
//Producer
synchronize (buffer) {
buffer.add(prime);
buffer.notifyAll();
}
wait() releases the object lock, and the thread enters into a waiting state. notify() and
notifyAll() wake up waiting threads (non deterministically) and thus allow us to ensure correct
producer-consumer behaviour. notify() does however not release the lock, it just informs
waiting threads (Which then compete in the usual manner for acquisition of the (hopefully)
soon-to-be released lock).
The while loop is a necessity: If we would only use a simple if condition and no synchronized
at all, it could happen that the producer completes successfully just after the condition check,
but before the wait() call. Also, even with synchronized there might be other reasons for
the consumer returning from a wait() (e.g. due to a thread interrupt or different consumers
needing different conditions), so checking the condition again is considered a necessity. It might
be possible to have a correct program without a while loop, but it is highly recommended,
as the Java documentation mentions:
As in the one argument version, interrupts and spurious wakeups are possible, and
this method should always be used in a loop:
synchronized (obj) {
while (<condition does not hold>)
obj.wait();
// Perform action appropriate to condition
}
13
Chapter 2. Threads and Synchronization
We briefly touched on the different thread states already (see figure 2.1), let us now insert the
concrete methods that are responsible for state changes. For sake of ease, the full-size-slide for
the model is attached in appendix A on page 106.
• Thread is created when an object derived from the Thread class is created. It is in the
new state.
• Once start is called, the thread becomes eligible for execution by the scheduler, entering
the runnable state.
• If the thread calls the wait() method, or join() to wait for another thread, it becomes
not runnable
• Once it is either notified or the requested join thread terminates, the thread is once again
runnable
• Exiting the run() method (normally or via exception) results in the thread entering the
terminated state.
• Alternatively, its destroy() method can be called - which reduces in an abrupt move to
the terminated state, possibly leaving other objects locked or other undesired side effects
occurring.
14
Chapter 3
Hardware Parallelism
While this section focuses mostly on things that are not directly tied to parallel programming
on the software level, it serves as a general intuition as to why parallel programming has become
more and more important as well as showing some performance implications and some challenges
that transfer to the software level.
While computers have very different shapes and sizes, they are similar from the inside. They are
based on the Von Neumann architecture (or Princeton architecture). For more details
one can refer to the digital design and computer architecture lecture.
A problem that presented itself to hardware architects is the speed difference between mem-
ory and CPUs: While CPUs got (a lot) faster, accessing memory became much slower than
accessing CPU registers. Thus, caches - faster, but much more expensive memory - became
important. Since the size of caches is limited, it is impossible to have all data in them at the
same time. This is where locality plays an important role: Since related storage locations are
often accessed shortly after each other (i.e., accessing array cells one after another), it makes
sense to design hardware with this aspect in mind. As programmers, we can use this locality to
increase performance, as the time needed to execute the following two (C++) programs show:
int main() { int main() {
const int N = 800_000_000; const int N = 800_000_000;
15
Chapter 3. Hardware Parallelism
In addition, the cache distribution and things like MESI- Cache coherent Protocols - are theoret-
ically none of our concern. However, the CPU itself may reorder (i.e, postpone) writes from its
own registers to the relevant cache. Therefore, it is paramount to implement memory barriers
or fences. In our case, we can use the already discussed synchronized().1
3.2.1 Vectorization
Vectorization can be classified as single instruction applied to multiple data. Of course,
such actions are inherently parallel - think about adding two vectors componentwise - and are
thus supported by special hardware instructions. As with many things, we cannot really control
vectorization of our code in Java - we just have to trust the compiler and the JVM to do this
for us.
3.2.3 Pipelining
Pipelining, while CPU-internal, is a very universal idea that has made it into the software world.
There are two main concepts:
Throughput
Latency
1
Technically, volatile can be used for a very similar purpose, but, since we often need to lock anyways for
different reasons, synchronized is the comfortable option
16
Chapter 3. Hardware Parallelism
17
Chapter 4
Scalability is a term used for a plentitude of things, e.g., how well a system reacts to increased
load. In this course, we are interested in:
• Speedup when increasing processor count
• What will happen if # of processors → ∞?
• Ideally, a program scales linearly - we achieve linear speedup
Of course, some mathematical definitions are in order if we want to make statements about our
programs. Those are, luckily, not very complicated.
Parallel Performance
18
Chapter 4. Basic Concepts in Parallelism
Sp = T1 /Tp
Why do we incur a performance loss, i.e. why is Sp < p? Once again, introducing parallelization
induces some overhead (typically associated with synchronization), which reduces performance.
Additionally, some programs may simply not contain “enough” parallelism - that is, some parts
of the program might be sequential due to their nature.1
Additionally, one should be careful when choosing whether to use efficiency or absolute speedup.
Sometimes, there is a sequential algorithm that doesn’t parallelize well that outperforms the
parallel algorithm with one processing unit. In these cases, it is fairer to use that sequential
algorithm for T1 , since using an unnecessarily poor baseline artificially inflates speedup and
efficiency.
11
Amdahl’s law provides a (“pessimistic”) bound on the speedup we can achieve. It is based on
the separation of T1 into the time spent on parallelizable work and the time spent on non-
1
There are also architectural limitations - e.g., memory contention - which is less of a focus than the program-
influenced part
19
Chapter 4. Basic Concepts in Parallelism
Wser + Wpar
Sp ≤ Wpar
Wser + P
And a simple corollary, where f denotes the serial fraction of the total work:
Wser = f · T1
Wpar = (1 − f )T1
1
=⇒ Sp ≤
f + 1−f
P
Amdahl’s law is mostly bad news, as it puts a limit on scalability. A key takeaway is that all
non-parallel parts of a program can cause problems, no matter how small. For a
more visual explanation why, consider appendix A page 108.
Gustafson’s law shows an alternative, more optimistic view to Amdahl’s. Gustafson bases his
law on the consideration of constant runtime. Meaning, instead of trying to find out how
much we can speedup a given program, we try to find out how much work can be done in a
given timeframe.
Gustafson’s law
Let f be the sequential part of a program and Twall the total amount of available time.
Then it holds that
Sp = f + p(1 − f )
= p − f (p − 1)
Comparing the two laws can be difficult when only comparing formulae. Therefore, it is highly
recommend to compare the formulaic definitions to figure 4.2.
20
Chapter 4. Basic Concepts in Parallelism
p=4 p=4
28
21
Chapter 5
5.1 Introduction
Having clarified exactly how much we can theoretically improve our programs with parallelism,
let us consider an example of a parallel program that actually solves a problem: Summing up
the elements of an array.We base our parallel program on the following, simple sequential code:
public static int sum(int[] input){
int sum = 0;
for(int i=0; i<input.length; i++){
sum += input[i];
}
return sum;
}
The idea of parallelizing is rather simple: We choose an arbitrary amount of threads - let us
consider 4 for this example - and then have each of them run a part of the array - in our case,
every thread gets 1/4 of the array. With our knowledge of Java threads, we might come up with
a program like this:
class SumThread extends java.lang.Thread {
int lo, int hi, int[] arr; // arguments
int ans = 0; // result
SumThread(int[] a, int l, int h) {
// pass defined sector
}
public void run(){
// override, calculate sum of sector
}
}
22
Chapter 5. Fork/Join style programming I
}
return ans;
}
That code is technically correct1 and produces expected results. This style of parallel pro-
gramming is called fork/join - being named after its most important methods. Luckily for
us, fork-join programs do not require much focus on sharing memory among threads. In our
example, we used fields which only had one writer (the main or a helper thread respectively),
but in general one should be careful to avoid data races with shared memory.
There are a few issues remaining with our code. First of all, it is not very parameterized - at
least the number of threads should be able to be changed easily. We also would like to only use
processors that are available to our program, not just how many cores are in our machine. And,
probably most devastating of all, we can have load imbalance depending on the structure
of our tackled problem - maybe a divisor calculation is happening, the duration of which is
obviously vastly different for different inputs - which would result in our program’s speedup
being limited by one slow, overburdened processor. How can we alleviate those problems?
The solution to all of these problems is the perhaps counter-intuitive notion of using far more
threads than processors available. All of the aforementioned issues are handily solved,2 but this
will require both a change of algorithm and, due to the immense overhead generated by Java
threads, abandoning them.
Our first concern is the changing of our algorithm to accommodate the idea of small pieces of
work. The straightforward way of implementing those changes is, as alluded to by the title of
this section, the divide-and-conquer paradigm. Our (sequential) implementation for the problem
of summing up could look like this:
public static int do_sum_rec(int[] xs, int l, int h) {
// l and h are the boundaries of our part
int size = h-l;
if (size == 1) /*check for termination criteria*/
return xs[l];
Before using this code in a threaded version, adjustments need to be made3 due to the overhead
generated by creating all those threads and communicating between them:
• Use a sequential cutoff - we do not need to split down to single elements, and by
shortening height of the tree generated by our algorithm, we significantly cut down on
thread creation. Typically, we use a value around 500-1000.
1
join() may throw an exception, so we need to insert a try-catch block. Catching and exiting should be
fine for basic parallel code
2
Although load imbalance, with “unlucky” scheduling, could be a small problem - Variance in workload should
be small anyway if the pieces of work are small
3
This is purely for practical reasons. In theory, the changes will make no difference to our speedup - but the
real world is a tad bit different
23
Chapter 5. Fork/Join style programming I
• Do not create two recursive threads, instead create only one and do the other work “your-
self” - this reduces the number of threads created by another factor of two.
Implementing the sequential cutoff is an easy task. When improving the recursive thread cre-
ation, one needs to be careful with ordering this.run() and other.start() - otherwise the
run() method just runs sequentially. Our new and improved program thus looks like this:
public void run(){
int size = h-l;
if (size < SEQ_CUTOFF)
for (int i=l; i<h; i++)
result += xs[i];
else {
int mid = size / 2;
SumThread t1 = new SumThread(xs, l, l + mid);
SumThread t2 = new SumThread(xs, l + mid, h);
t1.start();
t2.run();
t1.join();
result=t1.result+t2.result;
}
}
Luckily, in this case we are dealing with a very regular workload. If we were dynamically
allocating workloads, think doing a breadth-first-search in a graph, the workload might be highly
irregular. Making sure that the work is split fairly between threads is difficult, and without
prior knowledge maybe even impossible. The next model we discuss can deal with this issue a
lot easier.
Until now, we have always used one thread per task. This is not ideal, since Java threads are
very heavyweight and (in most real-world implementations) mapped to OS threads. Using one
thread per small task is horribly inefficient. Instead, we approach from a new angle - scheduling
tasks on threads.
24
Chapter 5. Fork/Join style programming I
Tasks
ExecutorService
Interface
30
As shown in figure 5.1, we now focus on generating and submitting tasks, while leaving the
allocation of threads to tasks to an interface.
To use the executor service, we have to submit tasks (objects of a subclass of Runnable or
Callable<T>), hand them over to a previously created ExecutorService and get our results.
The following simple program shows how it can be done, starting with our task template:
static class HelloTask implements Runnable {
String msg;
exs.shutdown(); // initiate shutdown, does not wait, but can’t submit more tasks
25
Chapter 5. Fork/Join style programming I
The executor service is not meant for parallel tasks that have to wait on each other, as it has
a fixed amount of threads - which will quickly run out. We could conceivably decouple work
partitioning from solving the problem, or use a framework which we will discuss in chapter 7.
26
Chapter 6
Cilk-Style bounds
In this chapter, we return to a more theoretical standpoint, which allows us to make some more
guarantees on performance of task parallel programming - also called Cilk-style1 .
For visualizing via task graph, we need to define tasks. This is done relatively simply:
Tasks
• execute code
• spawn other tasks
• wait for results from other tasks
Now we create a graph based on spawning tasks: Every task is represented by a node. An edge
from node A to node B means that Task B was created by Task A.2
Let us familiarize ourselves with one example. Consider the following code for a simple program
that calculates the fibonacci numbers:
public class sequentialFibonacci { public class parallelFibonacci {
public static long fib(int n){ public static long fib(int n) {
if (n < 2) if (n < 2)
return n; return n;
long x1 = fib(n-1); spawn task for fib(n-1);
long x2 = fib(n-2); spawn task for fib(n-2);
return x1 + x2; wait for tasks to complete
} return addition of task results
} }}
And now consider the task graph of it. Here, the meaning of node and edges is shown as
spawning/joining a task. The exact meaning is not as important, as we’ll analyse the graph just
with the amount of nodes given to us.
1
Cilk++ and Cilk Plus are general-purpose programming languages designed for multi-threaded parallel
computing. They are based on the C and C++ programming languages, which they extend with constructs to
express parallel loops and the fork–join idiom. — Wikipedia
2
Confusingly enough, in some interpretations, there are different meanings to nodes and edges. At least,
according to professor Vechev, the meaning of nodes and edges will always be explained with a given problem.
Confusion inducing, yet the author sees no way of resolving this conflict inherent to even cilk literature. Thus,
it is best to familiarize oneself with multiple examples of such graphs, for example with old exams. The author
apologizes for the lackluster explanation.
27
Chapter 6. Cilk-Style bounds
But how does this help us get guarantees for performance? We define the following terms:
Task parallelism: performance bounds
TP ≥ T1 /P
TP ≥ T∞
Let’s examine figure 6.1: T1 is the sum of all nodes, in this case, that is 17. T∞ is 8 (path from
f(4) down to f(0) to the sink in f(4), assuming every node is of cost 1)1 . Thus, the fraction that
defines parallelism is T1 /T∞ ≈ 2.1. This is a hard upper limit for the speedup we can achieve -
this follows directly from the definitions (think about the mathematical implications of ≤ and
≥ in fractions).
Why do we not calculate TP ? As mentioned above, TP depends on scheduler. How so?
Depending on which order the tasks get executed and which dependencies exist between them,
more or less may be executed in parallel. Observe the following figure:
1
The ambiguous definition of what a note or edge means leads to some difficulties when defining what needs
to be counted, especially when there is no explicit, differing cost per node. In past exams, it has always been the
nodes having a cost and thus the longest path was always the maximum sum of nodes on a path. The author
apologizes again for this unrectifiable inconvenience
28
Chapter 6. Cilk-Style bounds
1 1
What is T2 for this graph?
2 3 2
That is, we have 2 processors.
4
3
5
4
Nowadays, a standard method is the so-called work stealing scheduler. This is due to the
following guarantee this scheduler can give:
TP = T1 /P + O(T∞ )
The proof would go beyond this lecture, but empirically we also get that TP ≈ T1 /P + T∞ .
29
Chapter 7
We had issues with the executor service when trying to execute divide and conquer algorithms
due to the allocation of threads to tasks. In this chapter, we will see a framework that supports
divide and conquer style parallelism - Java’s ForkJoin Framework.
The ForkJoin Framework is designed to meet the needs of divide-and-conquer fork-join paral-
lelism - that is, when a task is waiting, it is suspended and other tasks can run. There are
similar libraries available for other languages, most notably Cilk for C/C++.
The usage does not differ much from what we have previously seen, although the terms are a
bit different: We have to subclass either RecursiveTask<V> or RecursiveAction, depending
on whether we want to return something or not. We have to override the compute method,
and return a V if we subclass RecursiveTask<V>. Instead of starting and joining threads, we
call fork (or invoke) and join respectively. Similarly to executorservice, we need to create a
ForkJoinPool. Let’s use this framework to solve recursive sum how we initially wanted to:
class SumForkJoin extends RecursiveTask<Long> {
int low;
int high;
int[] array;
30
Chapter 7. Fork/Join style programming II
Now that we have our individual tasks, we only need some wrapper code:
static ForkJoinPool fjPool = new ForkJoinPool();
//number of threads equal to available processors
Had we used submit(), instead of receiving a <V>, in this case a long, we would instead get a
Future<V> object. While invoke() submits and waits for task completion, with the Future we
have to explicitly ask for this behaviour by calling Future.get(). Overall, it makes most sense
for us to simply use invoke().
Once again, we can significantly improve performance by introducing a sequential threshold -
this is, once more, a flaw of the implementation - and thus not a flaw inherent to the model!
Many problems can be tackled in exactly the same way: Finding the maximum or minimum,
counting occurrences of certain objects etc. These problems are “just summing with a different
base case”.
Computations of this form are called reductions (in the context of MPI, explained in more
detail in chapter 23, also just reduce), which produce a single answer from a collection via
an associative operator. To name a few operations that are not reductions/reducible: Median,
subtraction, exponentiation etc. However, results need not be single numbers or strings - they
can also be arrays or objects with multiple fields, for example counting occurrences of multiple
different elements (think of a histogram).
An even simpler form of parallel computations are maps. We had already discussed vectorization
in chapter 3.2.1 - maps are basically exactly this - operating on each element of a collection
independently to create a new collection of the same size. Vectorization is an array-map
supported on hardware level.
Both maps and reductions are the “work horses” of parallel programming - they are the most
important and common patterns in which we can write a parallel algorithm.
One thing that one has to be mindful of, is that maps and reduces over suboptimal data struc-
tures, for example a linked list, may not necessarily yield a great effect. Parallelism is still
beneficial for expensive per-element operations - but traversing the list over and over again
takes much longer. Trees should be used instead, where applicable.
Similar to the task graphs in cilk-style, we can create a DAG1 , where every fork “ends” a node
and makes two outgoing edges (no matter whether we continue with two threads or continue
with one new and the current one), and every join “ends” a node and makes a node with two
incoming edges. For most divide-and-conquer algorithms, the graph will look like this:
1
Directed Acyclic Graph
31
Chapter 7. Fork/Join style programming II
divide
base cases
combine
results
27
Luckily, in most Cilk-style literature, we group the forking and working somewhat differently
(see chapter 6) - those groups of nodes are called strands, and are usually the easier (and more
exam-relevant) way to compute T∞ and the like.
The ForkJoin library yields an asymptotically optimal execution, that is, we can expect the
following:
TP = O((T1 /P ) + T∞ )
So far, we have analyzed parallel programs in terms of work and span (i.e., total amount of node
cost and longest path in the DAG). In practice, most programs have parts that parallelize well
(maps/reductions) and parts that do not (reading linked lists, getting input, etc.). Amdahl’s
Law shows us that unparallelized parts become a bottleneck very quickly. Thus, we need to find
new and improved parallel algorithms. For problems that seem sequential, it turns out they are
actually parallelizable, if we introduce a trade-off: A bit more work or memory for a greatly
reduced span. In this section, we focus on one such problem: The prefix-sum problem. Solving
this problem will give us a template, similar to summing an array, that we can use to parallelize
other things - like quicksort.
32
Chapter 7. Fork/Join style programming II
Prefix-sum problem
As with most parallel problems, we should take a look at the sequential solution:
int[] prefix_sum(int[] input){
int[] output = new int[input.length];
output[0] = input[0];
for(int i=1; i < input.length; i++)
output[i] = output[i-1]+input[i];
return output;
}
This does not seem parallelizable at all - and it is: This algorithm is strictly sequential, but a
different algorithm improves the span from O(n) to O(log n).
This algorithm calculates the result in two “passes”, each with work O(n) and span O(log n):
First, we build a tree bottom-up, where the leaves map to an element in the array, and every
node contains the sum of all its children (or the respective array element). Then, we pass down
a value we call fromLeft with the following invariant: fromLeft is the sum of all elements left
of the node’s range. In order to achieve this, we assign the root fromLeft=0, and then every
node passes its left child its own fromLeft value and its right child its own fromLeft plus its
left child’s sum from the first pass. Each leaf calculates the output array by adding its own
fromLeft value to the input array.
Example range
sum
0,8
76
fromleft 0
input 6 4 16 10 16 14 2 8
output 6 10 26 36 52 66 68 76
47
As always, we could easily add a sequential cut-off by having the leaves hold the sum of a
range, and then calculating the output by beginning the same way as without cut-off, and then
sequentially prefix-summing all elements in our range, or as a simple code snippet:
33
Chapter 7. Fork/Join style programming II
Where lo and hi are the boundaries for the range given to our leaf node.
7.4.1 Pack
In this section, we want to apply what we learned from parallel prefix-sum to a more general
context. We coin this term a Pack1 :
Pack
Given an array input, produce an array output containing only elements such that a
certain f (elmnt) is true.
Example: input [17, 4, 6, 8, 11, 5, 13, 19, 0, 24]
f : is elmnt > 10?
output [17, 11, 13, 19, 24]
O(n) work and O(log n) span
How do we parallelize those problems? Finding the elements for the output is simple - but we
somehow need to make sure that they are put in the right place, without sacrificing parallelism.
The idea is to compute a bit-vector for elements that fulfill f , and to then use prefix-sum
on that very bit-vector. Then, each true element can just check the bitsum array to find its
position.
34
Chapter 7. Fork/Join style programming II
1
It is possible to do both packs at once with some very fancy parallel prefixing, but that has no effect on
asymptotic complexity
35
Chapter 8
When talking about parallel algorithms in the context of the ForkJoin framework, we never
talked about locks, synchronization and the like. This is due to the structure of our algorithms
- each thread had memory that “only it accessed”, for example a sub-range of an array. Thus,
we could avoid bad interleavings and the like. However, it is not always this simple, and we
have to make sure that we are aware how we should manage state.
Managing state(s) is the main challenge for parallel programs, for the reasons mentioned so far.
There are a few approaches on how to go about this:
Approaches to managing state
• immutability
– data does not change at all
– best option
• isolated mutability
– data can change, but only one thread/task can access them
• mutable/shared data
– data can change, all tasks/threads can potentially access them
When dealing with mutable/shared state, one needs to protect the state via exclusive access,
where intermediate inconsistent states should not be observed. This can be achieved via locks
or transactional memory. The former we’ve already covered extensively, while the latter will
be part of the second half of the course (see chapter 21).
For the rest of the course, we will very often consider a canonical example of managing a banking
system and the problems that arise when parallelizing that system. Consider the following
sequential code as a baseline on how the system should work:
class BankAccount {
private int balance = 0;
int getBalance() { return balance; }
36
Chapter 8. Shared memory concurrency, locks and data races
Were we to port this program directly to the multithreaded world, one could easily find a
bad interleaving1 , for example when both threads execute getBalance() before the other has
finished etc.2
While tempting, it is almost always wrong to fix a bad interleaving by rearranging or repeating
operations. This generally only shifts the problem or might even compile into the same version
since the compiler does not know of any need to synchronize. Thus, we have to use mutual
exclusion and critical sections. We could under certain assumptions implement our own
mutual exclusion protocol, but this won’t work in real languages anyway. We should instead use
Locks, a basic synchronization primitive with operations new, acquire and release.3
Recall the required properties of mutual exclusion: At most one process executes the critical
section, and the acquire mutex method must terminate in finite time when no process is cur-
rently in the critical section (Safety Property and Liveliness respectively). Using locks, we can
implement such a mutual exclusion.
When using locks to make our bank account parallel-proof, there are quite a few things that one
needs to be aware of, some of which we’ll discuss in the second half of the lecture (especially
when it comes to transferring money between accounts), but here are a few possible mistakes
one might encounter:
• Using different locks for withdraw and deposit
• Using the same lock for every bank account (very poor performance)
• Forgetting to release a lock before throwing an exception
What about getBalance and setBalance? If they can be called outside of a locked section (i.e.
if they are public), it could lead to a race condition. However, if they acquire the same lock as
withdraw would, it will block forever, since the thread would need to acquire a lock it already
has. One simple approach would be to have a private setBalance that does not lock and a
public setBalance that does lock. However, more intuitive is the use of re-entrant locks -
Java uses this type of lock, and thus allows us to use the locked setBalance without issues.
Re-entrant lock
A re-entrant or recursive lock has to “remember” two things: Which thread (if any)
holds it, and a counter for how many times it has been entered.
On lock, if the lock goes from not-held to held, the count is set to 0. If the current holder
calls acquire, the count is incremented.
1
As a recap, we say that calls interleave if a second call starts before the first ends
2
This could also happen with only one processor since threads may be subject to “strange” scheduling by the
OS, however, since we largely use processor/thread interchangeably, this is only of minor concern
3
The implementation of Lock is quite complicated and uses special hardware and OS support. In this course,
we take it as a given primitive to use
37
Chapter 8. Shared memory concurrency, locks and data races
On release, if the count is > 0, the count is decremented. If the count is 0, the lock
becomes not-held.
The Java statement synchronized is a bit more advanced than the primitive re-entrant lock,
for example, it releases the lock even if it leaves the synchronized block due to throw, return
etc.If we want a lock that works more akin to the primitive, we can use
java.util.concurrent.locks.ReentrantLock - but we then need to use a lot of try and
finally blocks to avoid forgetting to release it. In the second half of the course, we will use
more locks from this library.
8.2 Races
There is quite a bit of confusion regarding the term race. It is important to know what exactly is
meant when one talks about a race (condition), and thus this section deals with those definitions.
A race condition occurs when the computation result depends on the scheduling (how threads
are interleaved). These bugs exist only due to concurrency (i.e. not possible with one thread)1 .
One can further define data races and bad interleavings:
The distinction
Data Race [aka Low Level Race Condition, low semantic level]
Erroneous program behavior caused by insufficiently synchronized accesses of a shared
resource by multiple threads, e.g. Simultaneous read/write or write/write of the same
memory location
(for mortals) always an error, due to compiler & HW
Bad Interleaving [aka High Level Race Condition, high semantic level]
Erroneous program behavior caused by an unfavorable execution order of a
multithreaded algorithm that makes use of otherwise well synchronized resources.
“Bad” depends on your specification
Original peek had several
39
1
Possible with one processor, however
38
Chapter 9
Decades of bugs have led to some conventional wisdom - general techniques that are known to
work. In this chapter, we mention some of those important guidelines and recap the first half
of this lecture.
Memory Location
For every memory location (e.g., object field), we must use one of the following three
possibilities:
1. Thread-local: Use location in one thread only
2. Immutable: Do not write to the memory location
3. Synchronized: Control access via synchronization
After making sure the amount of data that is thread-shared and mutable is minimized, we work
with some guidelines how to use locks to keep other data consistent:
Guidelines for synchronization
1. No data races - Never allow two threads to read/write or write/write the same
location at the same time
2. For each location needing synchronization, have a lock that is always held
when reading or writing the location
3. Start with coarse-grained (i.e., fewer locks that guard more), simpler locking,
move to fine-grained (i.e. more locks that guard less, better performance) locking
only if contention (threads waiting for locks to be released) becomes an issue
4. Do not do expensive computations or I/O in critical sections (contention), but do
not introduce data races - this is the so-called critical-section granularity
5. Think in terms of what operations need to be atomic - that is, for other threads
the operation can never be seen partly executed - first, locks second. Typically,
operations on abstract data types (ADT) - stack, table etc. - need to be atomic
even to other threads running operations on the same ADT
6. Generally, use provided libraries for concurrent data structures. For this course
and your understanding, try to implement things yourself !
39
Chapter 9. Guidelines and recap
9.1 Recap
Use this section to see at a glance what the lecture covered in the first half:
Java Threads: wait, notify, start, join
synchronized and its usage
Producer/Consumer
Parallelism: Vectorization, ILP
Pipelining: Latency, Throughput
Concepts: T1 , TP , T∞
Amdahl’s and Gustafson’s Law
Cilk-Style bounds, Taskgraphs and finding T1 , T∞ on them
Divide-and-Conquer
ForkJoin
Prefix-sum, packs - reducing T∞ for complicated algorithms, i.e. Quicksort
High-level and low-level data races
Overall, broad view of parallel programming - ready to be expanded in the second
half
40
Part II
41
Chapter 10
We already had an extensive look at bad interleavings between two threads. In the real world,
we are presented the unfortunate reality of memory reordering. What does that entail?
As a rule of thumb: The compiler and hardware are allowed to make changes that do not affect
the semantics of a sequentially executed program - meaning that instructions may be reordered
or optimized away completely if the resulting program still conforms to the same semantics.
Consider the following simple example:
int x; Thread A calls wait, then thread B calls
arrive. Naively, we would expect the code
void wait() {
x = 1; to run until arrive is called.
while(x==1); However, for the compiler, it could optimize
} the wait method as follows:
void wait() {
void arrive(){
while(true);
x = 2;
}
}
On the altar of performance, our program’s correctness is seemingly sacrificed. It gets worse:
The same thing happens on hardware as well! The exact behaviour of threads that interact
with shared memory thus depends on hardware, runtime system and of course programming
language (and by extension its compiler).
A memory model provides guarantees for the effects of memory operations - these leave
open optimization possibilities for hardware and compiler, but include guidelines for writing
correct multi-threaded programs. Memory models can be understood as a contract between
programmer, compiler, runtime, and architecture about the semantics of a program. It is thus
paramount to understand those guarantees and guidelines.
How would we fix code that we know could cause problems? We could simply use synchronized.
Additionally, Java has volatile fields, whose access counts as a synchronization operation (more
on that later). Generally, volatile is more for experts, we should rely on standard libraries
instead.
42
Chapter 10. Memory Models: An introduction
How exactly do those language constructs forbade reordering work? For this, we need to dig
deeper into the JMM.
The JMM defines Actions: read(x):1 means “read variable x, read value is 1”
Executions (of a program) combine those actions with ordering, of which there are multiple:
• Program Order
• Synchronizes-with
• Synchronization Order
• Happens-before
Program order is a total order of intra-thread actions - it is not a total order across threads.
It is what we see when we write code and serves as a link between possible executions and the
original program.
The synchronization order is formed by synchronization actions:
Synchronization Actions
The synchronization order is a total order - all threads see the synchronization actions in the
same order, within a thread, all synchronization actions are in program order, and of course the
synchronization order is consistent, i.e. all reads in synchronization order see the last writes in
synchronization order.
Synchronizes-with only pairs specific actions which “see” each other - a volatile write to x
synchronizes with subsequent (in synchronization order) read of x.
The combination of program and synchronizes-with order creates a happens-before order - this
allows us to reason about a program and its possible states - see figure 10.1 for a detailed
example.
This covers what we need to know about the JMM for the moment. While an extraordinarily
complex topic that spans both hardware and software, memory models are essential for making
guarantees for parallel programs with shared memory. One should gain an intuition for what
guarantees we are given, and what actions are synchronization actions that enforce an ordering.
43
Chapter 10. Memory Models: An introduction
spcl.inf.ethz.ch
@spcl_eth
Example
30
44
Chapter 11
For this chapter (and most of the rest of this document), we make the following assumptions:
1. atomic reads and writes of variables of primitive type
2. no reordering of read and write sequences (this is not true in practice!)
3. threads entering a critical section will leave it eventually
The implementation of a critical section on a single core system is very simple: Before a critical
section, we disallow the usage of interrupt requests by the operating system - in effect, our
thread can’t be switched while inside the critical section.
Of course, we want to tackle this problem for two processes that run on different cores. For
analysing a simple program, we will draw a state space diagram - A diagram that lists all
possible states and transitions between states. If we can reach a state where both processes are
in their critical section, we have no mutual exclusion. If we can reach a state (that is not the
final state) without a possibility of leaving it, we have a deadlock. If we simply never reach the
final state, we have starvation. Consider the simple example of a state space diagram below in
figure 11.1. More detailed examples can be found in appendix A, on page 109.
45
Chapter 11. Behind Locks - Implementation of Mutual Exclusion
spcl.inf.ethz.ch
@spcl_eth
12
spcl.inf.ethz.ch
@spcl_eth
p1 non-critical section
p1, q1, false, false p2, q1, false, false p3, q1, false, false p4, q1, true, false
p1, q2, false, false p2, q2, false, false p3, q2, false, false p4, q2, true, false
p1, q3, false, false p2, q3, false, false p3, q3, false, false p4, q3, true, false
p1, q4, false, true p2, q4, false, true p3, q4, false, true p4, q4, true, true
no mutual exclusion ! 13
46
Chapter 11. Behind Locks - Implementation of Mutual Exclusion
Dekker’s1 algorithm solves the problem of mutual exclusion with two processes. Essentially,
we implement both a turn and a “show-of-intent” flag that decide whose “turn” it is. In the
following snippet, all boolean variables are initialized as false, and turn = 1
// Process P
loop // Process Q
non-critical section loop
wantp = true non-critical section
while (wantq) { wantq = true
//only when q tries to get lock while (wantp) {
if (turn == 2) { if (turn == 1) {
//and q has precedence wantq = false
wantp = false; //let q go on while(turn != 2);
while(turn != 1);//wait wantq = true; }}
wantp = true; }}//try again critical section
critical section turn = 1
turn = 2 wantq = false
wantp = false
We can make this a bit more concise - introducing the Peterson Lock:
Let P=1, Q=2, victim=1, array flag[1,2]=[false,false]
// Process P (1)
loop // Process Q (2)
non-critical section loop
flag[P] = true; // I’m interested non-critical section
victim = P; // but you go first flag[Q] = true;
while (flag[Q] && victim == P); victim = Q;
// we are both interested while (flag[P] && victim == Q);
// I’m victim, so I wait critical section
critical section flag[Q] = false;
flag[P] = false;
How would we prove that the Peterson Lock satisfies mutual exclusion and is starvation free?
For that (and the definition of atomic register) we need to introduce some notation.
Threads produce a sequence of events: P produces events p0 , p1 , . . .
where p1 = “flag[P]=true” etc.
Since most of our examples consist of loops, we might need to count occurrences. This is done
via superscript, i.e. p35 refers to flag[P]=false in the third iteration.
For precedence, we write a → b for a occurs before b. The → relation is a total order for
events. We also define the intuitive intervals of events as we understand it mathematically,
where an interval IA = (a0 , a1 ) precedes an interval IB = (b0 , b1 ) if a1 → b0 . Now, we can
properly define an atomic register. For further examples of events and precedence, see appendix
A page 110.
1
Often misspelled as “Decker”
47
Chapter 11. Behind Locks - Implementation of Mutual Exclusion
Atomic Register
For the proof of the correctness of Peterson’s lock, refer to the slides in appendix A on page 111.
The implementation of Peterson’s lock in Java is pretty simple:
class PetersonLock
{
volatile boolean flag[] = new boolean[2];
// Note: the volatile keyword refers to the reference, not the array contents
// This example may still work in practice
// It is recommended to instead use Java’s AtomicInteger and AtomicIntegerArray
volatile int victim;
We can actually extend Peterson’s lock to n processes. This is the so called Filter lock.
Every thread t knows his level in the filter. In order to enter the critical section, a thread has
to elevate all levels. For each level, we use Peterson’s mechanism to filter at most one thread,
i.e. in every level there’s one thread that’s “stuck” in there, that is the victim of that level.
The algorithm is much easier to understand if we simply rename victim to lastToArrive - and
visualize each level as a waiting room. A thread can only progress if there are either no more
threads waiting in front of it, or if another thread enters his room - because the initial thread
then loses the lastToArrive property. With a similar thought, one can also proof this lock’s
correctness by showing that only n − 1 threads can be in level 1, n − 2 in level 2 etc.
Expressed in (pseudo)code, the filter lock could look like this:
48
Chapter 11. Behind Locks - Implementation of Mutual Exclusion
lock(me) {
for (int i=1; i<n; ++i) {
level[me] = i;
victim[i] = me;
while (exists(k != me): level[k] >= i && victim[i] == me) {};
}
}
unlock(me) {
level[me] = 0;
}
The filter lock is not fair: Usually, we define fairness as “first-come-first-served”, where “first-
come” is defined as completing the finite number of steps before the waiting period before
another thread. This is not necessarily the case for a very slow thread in a filter lock, which
might need to wait way more than other threads moving through the filter - if the slow thread
always ends up as lastToArrive, it would only get a very small fraction of the throughput a
fair lock would guarantee.
Additionally, the filter lock requires 2n fields just for storing levels and victims - which, depending
on the amount of threads, can be vastly inefficient. The time to move through the lock is always
O(n) - even if the thread trying to acquire the lock is doing so without other threads!
The word Safe might be misleading - in this context, it simply means that any read that is
not concurrent to a write returns the current value of the register. This allows implementation
of mutual exclusion with non-atomic registers. However, any read concurrent with a write can
return any value of the domain of the register - even values that have never been entered! If it
could only return the previous or the new value, the register would be called regular.
spcl.inf.ethz.ch
@spcl_eth
Example
r.read() 1
A
r.write(4) r.read() 4
B
time
39
49
Chapter 11. Behind Locks - Implementation of Mutual Exclusion
The bakery algorithm, or, for the Swiss probably more aptly nicknamed “post-office algorithm”,
relies on the intuitive system of taking a numbered ticket. We achieve this by using two
arrays with entries for each thread: One for the labels or “ticket-numbers”, and one for the flag
“thread interested in lock”. These entries can be SWMR registers!
Of course, we still need to cover the problem that perhaps multiple threads draw the same
number at the same time. For this, we use a lexicographical comparator:
BakeryLock(int n) {
this.n = n;
flag = new AtomicIntegerArray(n);
label = new AtomicIntegerArray(n);
}
int MaxLabel() {
int max = label.get(0);
for (int i = 1; i<n; ++i)
max = Math.max(max, label.get(i));
return max;
}
This lock still suffers from two problems: First, an overflow could occur, especially with many
threads. Second, the memory requirement and runtime for acquiring is still O(n)!
50
Chapter 12
We have used atomic registers, (SWMR in Bakery, MWMR in Peterson) but yet we haven’t
found a very efficient algorithm. That is, because it is not possible! As a theorem in a paper
states:
“If S is a [atomic] read/write system with at least two processes and S solves mutual exclusion
with global progress [deadlock-freedom], then S must have at least as many variables as processes”.
To fix this issue, modern multiprocessor architectures provide special instructions for atomically
reading and writing at once!
There is a ton of different hardware support operations, which differ between different archi-
tectures. For two examples, see appendix A on page 112. In this section, we will focus mainly
on the abstracted version of those operations: Test-And-Set (TAS) and Compare-And-
Swap (CAS).1 One should stress that these are Read-Modify-Write operations, that is, occur
atomically, and enable implementation of mutual exclusion with O(1) space.
The semantics of TAS and CAS are easy to understand:
boolean TAS(memref s)
int CAS (memref a, int old, int new)
if (mem[s] == 0) {
oldval = mem[a];
mem[s] = 1;
if (old == oldval)
return true;
mem[a] = new;
} else
return oldval;
return false;
CAS can be seen as an extension of TAS - instead of checking for a constant and setting a
constant, we can instead pass both what to check for and what to set. TAS however can already
be used on its own to determine one thread that can go ahead in a critical section. A spinlock
implemented with those instructions is very easy as well:
1
These are not the “standard” for operations due to performance - simple read and write operations are just
way faster than our atomic instructions
51
Chapter 12. Beyond Locks I - TAS & CAS
This lock is still a spinlock - threads keep trying until the lock is acquired. This is a major
performance issue, especially since the atomic operations are relatively slow on their own. We
have a new bottleneck - the variable that all the threads are fighting over. How can we fix
this? Instead of always trying to TAS, we first check by only reading, which is much easier on
performance - we Test-and-Test-and-Set - TATAS:
public class TASLock implements Lock {
AtomicBoolean state = new AtomicBoolean(false);
TATAS as an algorithm works, but the Java implementation does not generalize due to nontrivial
interactions with the JMM and must be used with a lot of care. It is not recommended to use
it in practice.
1
Slightly mightier than TAS, it simply exchanges the passed argument and the memory value
52
Chapter 12. Beyond Locks I - TAS & CAS
There is still one aspect of performance we could improve: If many threads go to the line after
the state.get() at the same time, we have a lot of contention again. This is easily solved by
implementing a backoff : If a check fails, we let the thread go to sleep with a random duration.
Multiple failed attempts lead to an increase in the expected waiting duration we assign. Let’s
see the implementation:
public void lock() {
Backoff backoff = null;
while (true) {
while (state.get()) {};
// spin reading only (TTAS)
if (!state.getAndSet(true))
// try to acquire, returns previous val
return;
else { // backoff on failure
try {
if (backoff == null)
// allocation only on demand
backoff = new Backoff(MIN_DELAY, MAX_DELAY);
backoff.backoff();
} catch (InterruptedException ex) {}
}
}
53
Chapter 13
13.1 Deadlocks
The dreaded Deadlock: Two or more processes are mutually blocked because each process waits
for another of these processes to proceed. Consider our canonical banking system example:
class BankAccount {
...
synchronized void withdraw(int amount) {...}
synchronized void deposit(int amount) {...}
If two threads and respective bank accounts A and B want to transfer money to each other,
they might get stuck in a deadlock: A acquires the lock on its bank account, then B acquires
the lock on its bank account, and now both bank accounts are deadlocked: Neither A can get
B’s lock nor the other way around.
To look at deadlocks more formally, we can use a graph: Each thread and each resource (lock)
is a node. An edge from a thread to a resource means that thread attempts to acquire that
resource, an edge from a resource to a thread means that the resource is held by that thread. A
deadlock occurs if the resulting graph contains a cycle (see figure 13.1).
Deadlocks can, in general, not be healed. Releasing the locks generally leads to an inconsistent
state. Therefore, it is paramount to understand Deadlock avoidance. In databases, where
transactions can (generally) easily be aborted, one could implement two-phase locking with retry
- and releasing the locks on fail. However, in this course, we use resource ordering.1
By creating a global ordering of resources, we can avoid cycles. If there is no suitable global
order available, one could just implement a global (atomic) counter that each bank account
gets an ID from and increments the counter on creation.
1
In general, decreasing the size of critical sections will probably lead to transient inconsistent states, and only
using one lock for all accounts is terribly slow - resource ordering is far better
54
Chapter 13. Beyond Locks II - Deadlocks, Semaphores and Barriers
This even works for working with different data types: When transferring from a Hashtable to a
Queue, one could make sure that the Hashtable’s Lock is always acquired first. If a datatype is
acyclic by itself (lists, trees), we can use this in determining a global order. Let’s see the simple
solution for our bank accounts:
class BankAccount {
...
void transferTo(int amount, BankAccount to) {
if (to.accountNr < this.accountNr)
synchronized(this){
synchronized(to) {
withdraw(amount);
to.deposit(amount);
}}
else
synchronized(to){
synchronized(this) {
withdraw(amount);
to.deposit(amount);
}}
}
}
spcl.inf.ethz.ch
@spcl_eth
A deadlock for threads 𝑇1 … 𝑇𝑛 occurs when the directed graph describing the
relation of 𝑇1 … 𝑇𝑛 and resources 𝑅1 … 𝑅𝑚 contains a cycle.
T1 wants R3 T2 has R3
R3 T2
T1
R4
R1 R2 T3
T4
34
13.2 Semaphores
Semaphores can be seen as an extension of locks. A semaphore is an abstract data type with
one integer value. It supports the following (atomic) operations:
acquire(S)
{
wait until S > 0
dec(S)
}
55
Chapter 13. Beyond Locks II - Deadlocks, Semaphores and Barriers
release(S)
{
inc(S)
}
We can easily build a lock with a Semaphore: We set the initial value of S to 1. For quick
reference, remember that the semaphore number signifies the following:
• ≥ 1 → unlocked
• = 0 → locked
• x > 0 → x threads will be let into the protected piece of code
We can now tackle some problems that would have been much more difficult with locks - such
as the rendezvous.
We define a rendezvous as a location in code, where two respective threads must wait for the
other to arrive, i.e. to “synchronize” the two of them. For two threads, that problem can be
solved relatively simply by using two semaphores:
spcl.inf.ethz.ch
@spcl_eth
P Q
init P_Arrived=0 Q_Arrived=0
pre ... ...
rendezvous release(P_Arrived) release(Q_Arrived)
acquire(Q_Arrived) acquire(P_Arrived)
post ... ..
22
Why do we release first instead of acquiring? If we were to put acquire first on both threads,
we’d simply deadlock. If we put only one acquire first, we may encounter both threads having
to wait at one point. Thus, we release first (for a detailed look, see appendix A page 113).
We can further increase performance by implementing semaphores without spinning: For this,
we simply modify the (atomic) acquire method to put us into a queue on fail and block1 us, and
the (atomic as well) release method to get the first thread in the queue and unblock it. In case
that we succeed in entering or release with an empty queue, we simply decrement or increment
the semaphore variable. In fact, we have already seen something very similar in the first half of
the course as wait and notify!
1
This will be explained later, but it is basically going to sleep until woken up
56
Chapter 13. Beyond Locks II - Deadlocks, Semaphores and Barriers
13.3 Barrier
How do we apply the rendezvous scenario to n threads? We need a set of a few building blocks:
1. A counter that increases with every thread that passes to make sure we only allow con-
tinuation of the program as soon as all processes have reached the barrier. We need to
make sure to make this counter mutually exclusive by using a semaphore (volatile is not
enough to make count++ mutex). We initialize the mutex semaphore with state 1 (i.e.,
unlocked):
acquire(mutex)
count++
release(mutex)
2. We need to make sure that everyone gets past the barrier, i.e. we call release often
enough that every thread may pass. We use a semaphore barrier which we initialize with
state 0 (i.e., locked):
if (count == n) release(barrier)
acquire(barrier)
release(barrier)
3. Additionally, our barrier should be reusable. For this, we need to make sure that counter is
decreased and barrier only gets released n times after threads pass the turnstile of acquire
and release. Even then, we need to make sure that it is not possible for a single thread
to pass other processes in a different iteration of the barrier. To make sure all of those
invariants hold, we implement a second semaphore that does things “in reverse” - and have
completed our code:
mutex=1; barrier1=0; barrier2=1; count=0
acquire(mutex)
count++;
if (count==n)
acquire(barrier2); release(barrier1)
release(mutex)
acquire(barrier1); release(barrier1);
// barrier1 = 1 for all processes, barrier2 = 0 for all processes
acquire(mutex)
count--;
if (count==0)
acquire(barrier1); release(barrier2)
release(mutex)
acquire(barrier2); release(barrier2)
// barrier2 = 1 for all processes, barrier1 = 0 for all processes
For a full overview of the improvement process, refer to appendix A page 114.
57
Chapter 14
Queue(int size) {
this.size = size;
in = out = 0;
buffer = new long[size];
}
The problem with this implementation is, that we could try to dequeue from an empty queue
or enqueue into a full queue - we’d need to fix this by writing helper functions that check if the
58
Chapter 14. Producer-Consumer and Monitors
queue is empty or full. However, what would our functions do while they can’t en- or dequeue?
We can’t let them spin, since they are still holding the lock! Sleeping with a timeout would
work, but what is the proper value for the timeout? Maybe Semaphores can help implement
this more easily?
import java.util.concurrent.Semaphore;
class Queue {
int in, out, size;
long buf[];
Semaphore nonEmpty, nonFull, manipulation;
Queue(int s) {
size = s;
buf = new long[size];
in = out = 0;
nonEmpty = new Semaphore(0); // use the counting feature of semaphores!
nonFull = new Semaphore(size); // use the counting feature of semaphores!
manipulation = new Semaphore(1); // binary semaphore
}
}
With careful ordering of acquiring the semaphores (swapping nonFull and manipulation would
deadlock!), we can now properly implement enqueue and dequeue:
long dequeue() {
void enqueue(long x) { long x=0;
try { try {
nonFull.acquire(); nonEmpty.acquire();
manipulation.acquire(); manipulation.acquire();
buf[in] = x; x = buf[out];
in = next(in); out = next(out);
} }
catch (InterruptedException ex) {} catch (InterruptedException ex) {}
finally { finally {
manipulation.release(); manipulation.release();
nonEmpty.release(); nonFull.release();
} }
} return x;
}
This is a correct solution, but not the best. Semaphores are unstructured, meaning we as
programmers have to manage the semantics of the semaphores ourselves. Correct use requires
high level of discipline, and it is very easy to introduce deadlocks with semaphores. We need a
lock that we can temporarily escape from when we are waiting on a condition.
A Monitor is an abstract data structure equipped with a set of operations that run in mutual
exclusion. Luckily, we already know monitors as the wait/notify-system! Now we can easily
implement the enqueue/dequeue as follows:
59
Chapter 14. Producer-Consumer and Monitors
We can further enhance the usage of monitors by not only using the intrinsic lock (what we lock
onto when using synchronized), but instead using the Lock interface that Java offers. This JIL
(Java Interface Lock) can also provide conditions that can be individually used to wait or signal
on.
Considering that signal might be slow, we can use the sleeping barber variant, a term coined by
Dijkstra - instead of simply calling signal all the time, we maintain a count of waiting producers
and consumers:
class Queue {
int in=0, out=0, size;
long buf[];
final Lock lock = new ReentrantLock();
int n = 0; final Condition notFull = lock.newCondition();
int m; final Condition notEmpty = lock.newCondition();
Queue(int s) {
size = s; m=size-1;
buf = new long[size];
}
void enqueue(long x) {
lock.lock();
m--; if (m<0)
while (isFull())
try { notFull.await(); }
catch(InterruptedException e){}
doEnqueue(x);
n++;
if (n<=0) notEmpty.signal();
lock.unlock();
}
long dequeue() {
long x;
lock.lock();
n--; if (n<0)
while (isEmpty())
try { notEmpty.await(); }
catch(InterruptedException e){}
x = doDequeue();
m++;
if (m<=0) notFull.signal();
lock.unlock();
return x;
} }
60
Chapter 14. Producer-Consumer and Monitors
Of course, we still need to use the lessons we learned in the first half, meaning, as guidelines:
61
Chapter 15
Locking tricks
We know that concurrent reads of the same memory is not a problem. So far, whenever a
concurrent write/write or read/write might occur, we have used synchronization to ensure only
one thread can access the memory at a time. This is too conservative: We should allow multiple
readers where appropriate. We introduce a new abstract data type - the reader/writer lock:
Reader/Writer Lock
An implementation of a reader/writer lock has to consider how to prioritize readers and writers:
If there are no priorities given, a substantial amount of readers may lock out the writer forever.
Thus, usually, priority is given to writers.
In Java, we use java.util.concurrent.locks.ReentrantReadWriteLock. Using the methods
readLock and writeLock we get objects that themselves have lock and unlock methods. This
implementation does not have writer priority or reader-to-writer upgrading.
This “technique” barely deserves its own section, it is the easy (and very likely not the best)
solution: One lock for the entire system. This of course fixes parallelism issues, however it does
so by essentially eliminating all parallelism and bottle-necking all threads in the critical sections.
It is very simple - but that is pretty much all that it has going for it.
62
Chapter 15. Locking tricks
Fine grained locking, while in general performing better, is often more intricate than visible at
first sight. It requires careful consideration of special cases.
The basic idea of fine grained locking is to split the to-be-protected object into pieces with
separate locks - no mutual exclusion for algorithms on disjoint pieces. In our canonical example,
we only need to lock a bank account when we are actively transferring money to or from it - we
don’t need to lock every account every time a transaction fires. In reality, many objects require
careful thought what one needs to lock, as we’ll see in the following example.
Given a linked list, we want to remove an element. What do we need to lock?
Try 1: Lock the element in front of the one we want to remove. We modify the next-pointer of
our locked element. Problematic: If two threads decide to delete two adjacent elements,
we may not remove the item at all:
spcl.inf.ethz.ch
@spcl_eth
Thread A: remove(c)
Thread B: remove(b)
B A
a b c d
c not deleted!
Try 2: The problem with the 1st try was that we also read the next field of the node we want to
delete. A thread thus needs to lock both predecessor and the node to be deleted. We call
this hand-over-hand locking. The real life equivalent would be the safety systems used
in “adventure parks” when climbing: Secured by two snap hooks, you only move one at a
time to always be secured. The remove method works as follows:
public boolean remove(T item) {
Node pred = null, curr = null;
int key = item.hashCode();
head.lock();
try {
pred = head;
curr = pred.next;
curr.lock();
try {
// find and remove
while (curr.key < key) {
63
Chapter 15. Locking tricks
pred.unlock();
pred = curr; // pred still locked
curr = curr.next;
curr.lock(); // lock hand over hand
}
if (curr.key == key) {
pred.next = curr.next; // delete
return true;
// remark: We use sentinels at front and end
// so no exceptions will occur
}
return false;
} finally { curr.unlock(); }
} finally { pred.unlock(); }
}
The disadvantages of this method is that we potentially need a very long sequence of ac-
quire / release before the deletion can take place. Also, one slow thread locking “early
nodes” can block another thread wanting to acquire “late nodes” - even if the two opera-
tions wouldn’t interfere with each other.
Let us try to improve our locking method. The idea of optimistic synchronization (or
optimistic locking, the terms are used interchangeably) is to find the nodes without locking,
then locking the nodes and checking if everything is okay (i.e., validating before operating).
What do we need to check in order to proceed?
We can reason as follows: If
• nodes b and c are both locked
• node b is still reachable from head
• node c is still successor to b
then neither is in the process of being deleted, nor can an item have been added between the
two nodes. Thus, we can safely remove c.
Consider the good and bad things about this “optimistic list”:
Bad:
Good: • Need to traverse list twice
• No contention on traversals • A contains() method needs to acquire locks
• Traversals are wait-free • Not starvation-free: One thread may have to
• Overall less lock acquisitions redo the validation over and over due to other
threads inserting or removing other elements.
We mentioned wait-free (or sometimes obstruction-free) above. We define that as follows:
Wait-Free
Every call finishes in a finite number of steps, i.e. never waits for other threads. Wait-
freedom implies lock-freedom!
64
Chapter 15. Locking tricks
A skip list is a practical representation for sets that is much easier to implement than a balanced
tree, since the latter requires rebalancing - a global operation - which is very hard to implement
in a (mostly) lock-free way. The skip list runs on the assumption that we have many calls to
find(), fewer to add() and much fewer to remove(). It solves the challenge of sorting and
finding probabilistically. With this, we can achieve an expected runtime of find() in O(log n) -
similar to a tree!
We represent different levels with different lists, emulating a tree. Each node gets a random
“height”, except for two sentinels at the start and end of the set.
spcl.inf.ethz.ch
@spcl_eth
−∞ 𝟐 𝟒 𝟓 𝟕 𝟖 𝟗 +∞
28
For searching, we start at the top level at the head of the respective list. We move forward in the
list until we either find the sought element or are “in-between” items that are smaller than and
greater than the sought value. We then move down a level and continue this searching pattern.
65
Chapter 15. Locking tricks
For adding and removing, we simply find predecessors, lock those and validate. contains() is
once more wait-free (add and remove are not).
The full code for such a list can be found in appendix B, on page 125 ff. The visual representation
can be found in appendix A on page 115.
66
Chapter 16
Lock-free synchronization
Throughout this course, we have seen many reasons why synchronizing with locks is not partic-
ularly optimal. Let us recap some points:
• Missing scheduling fairness / missing FIFO-behaviour1
• No notification mechanism
• Computing resources are wasted, thus performance degrades, particularly for long-lived
contention, i.e. long locked sections2
How about locks that support waiting/scheduling? Such locks require support from the runtime
system (i.e. OS, scheduler), the queues behind the implementation of monitors etc. also need
protection, again using spinlocks.3 These locks also suffer from a higher wakeup latency.
Overall, locks have some disadvantages by design:
• Locks are pessimistic - assume the worst and enforce mutual exclusion
• Every protected part of a program is not parallelizable - remember Amdahl’s law!
• If a thread is delayed (e.g., scheduler) while in a critical section, all threads suffer
• If a thread dies in a critical section, the system is basically dead
Let us also recap some central definitions for blocking synchronization:
• Deadlock: group of two or more competing processes are mutually blocked because each
process waits for another blocked process in the group to proceed
• Livelock: competing processes are able to detect a potential deadlock but make no ob-
servable progress while trying to resolve it4
• Starvation: repeated but unsuccessful attempt of a recently unblocked process to continue
its execution
1
One can solve this with Queue locks, which have not been presented in the lecture
2
Note that this does not imply that lock-free algorithms are always faster than locked ones
3
If they are not implemented lock-free, which is topic of this chapter
4
While taken from the lecture slides, the more intuitive version of this point is: The threads still do something,
but make no tangible progress
67
Chapter 16. Lock-free synchronization
spcl.inf.ethz.ch
@spcl_eth
Non-blocking Blocking
(no locks) (locks)
With Locks/blocking algorithms, a thread can indefinitely delay another thread (i.e. by holding
a lock). In a non-blocking algorithm, failure or suspension of one thread cannot cause failure
or suspension of another thread!
The main tool that we use is CAS (refer to chapter 12). Now we can implement a very simple
non-blocking counter. We use AtomicInteger since we now need to be more careful with data-
types (race conditions make their return if we are not!).
public class CasCounter {
private AtomicInteger value;
68
Chapter 16. Lock-free synchronization
This counter is now lock-free. No deadlocks may occur and a thread dying does not hinder the
other threads, but a thread can still starve.
A positive result of CAS suggests that no other thread has written in between reading and
modifying our local v. However, especially if we also have a decrement() function, this is only
a suggestion - we’ll discuss this in the form of the ABA-Problem in chapter 17.
Let us now implement a proper data structure in a lock-free way: A lock-free stack.
The advantage of a stack for lock-free synchronization is that we only ever have to take care
of one single thing: The head pointer. Knowing this, we can use an AtomicReference for our
pointer, and then we implement our operations with CAS:
public Long pop() {
public void push(Long item) {
Node head, next;
Node newi = new Node(item);
do {
Node head;
head = top.get();
do {
if (head == null) return null;
head = top.get();
next = head.next;
newi.next = head;
} while (!top.compareAndSet(head,
} while (!top.compareAndSet(head,
next));
newi));
return head.item;
}
}
Surprisingly easy. Performance is, however, worse than a locked variant - this is because of how
expensive atomic operations are, and contention can still be a problem. With a simple backoff,
this can be fixed:
spcl.inf.ethz.ch
@spcl_eth
With backoff
50000
lock-free
45000
40000
35000
30000
locked/
time 25000 blocking
(ms)
20000
15000
10000
5000 lock-free
0
with backoff
0 20 40 60 80 100 120 140
#threads
60
Let us return to the example of linked lists. Can CAS help us out?
69
Chapter 16. Lock-free synchronization
If the only matter of contention is the next pointer of a single node, CAS does indeed work.
With multiple different pointers, it does not:
spcl.inf.ethz.ch
@spcl_eth
Another scenario
A: remove(c)
B: remove(b)
a b c d
CAS CAS
c not deleted!
17
Maybe the marked bit approach from lazy synchronization could help us out?
spcl.inf.ethz.ch
@spcl_eth
a b c d
18
The difficulty in this (and many other similar problems!) is that while we do not want to use
locks, we still want to atomically establish consistency of two things - here the mark bit and
the next pointer. The Java solution? We use one bit of the address pointer (the next pointer of
every node) as a mark bit. Since a normal AtomicReference in Java is 64 bits long, the storage
70
Chapter 16. Lock-free synchronization
one would need to “need” the bit we use for marking is in the trillions of petabytes. By using
one bit as a mark bit, we execute a hacky version of DCAS - a double compare-and-swap. Does
this fix all our problems?
spcl.inf.ethz.ch
@spcl_eth
It helps!
1. try to set mark (c.next)
A: remove(c) 2. try CAS(
[b.next.reference, b.next.marked],
B: remove(b) [c,unmarked], [d,unmarked]);
①Mark ①Mark
a b c d
②DCAS ②DCAS fails!
In figure 16.5, it is noted that “c remains marked (frowning emoji)” - meaning, we still have to
physically delete it. Luckily, we can simply have another thread “help us out”: When a thread
traverses the list and finds a logically deleted node, it can CAS the predecessor’s next field and
then proceed.1 This “helping” is a recurring theme in wait-free algorithms, where threads help
each other to make progress.
At the heart of an operating system is a scheduler, which basically moves tasks between queues
(or similar structures) and selects threads to run on a processor core. Data structures of a run-
time or kernel need to be protected against concurrent access on different cores. Conventionally,
spinlocks are used. If we want to do this lock-free, we need a lock-free unbounded queue. Also,
we usually cannot rely on garbage collection, thus we need to reuse elements of the queue.2
First things first: What parts do we have to make sure to protect? In this case, we need to
protect three pointers that might be updated at the same time: head (the next item to be
removed), tail (the newest item) and tail.next (when enqueuing).
Our first idea is to use a Sentinel as the consistent head of the queue (which is especially great
when dealing with an empty queue etc. due to null-pointers) and AtomicReference in every
node for its respective next-pointer. Now we try to use CAS.
As we can see in figure 16.6, this version seems okay for the most part.
1
If other threads would have to wait for one thread to cleanup the inconsistency, the approach would of course
not be lock-free!
2
Reusing elements will introduce the ABA problem, see chapter 17
71
Chapter 16. Lock-free synchronization
spcl.inf.ethz.ch
@spcl_eth
Enqueuer Dequeuer
read tail into last read head into first
then tries to set last.next: read first.next into next
CAS(last.next, null, new) if next is available, read the item value of next
If unsuccessful retry! try to set head from first to next
If successful, try to set tail without retry CAS(head, first, next)
CAS(tail, last, new) If unsuccessful, retry!
① ① Read value
node node
S node
tail ②
②
head
44
There are still some possible inconsistencies! In the enqueuer protocol, if a thread dies after
successfully performing the first CAS, then tail can never be updated because the first CAS will
fail for every thread thereafter. The solution: Threads helping threads when their check fails:
public void enqueue(T item) {
Node node = new Node(item);
while(true) { // retry
Node last = tail.get();
Node next = last.next.get();
if (next == null) {
if (last.next.compareAndSet(null, node)) {
tail.compareAndSet(last, node);
//everything okay, return
return;
}
}
else // Our tail is outdated, help others progress if necessary!
tail.compareAndSet(last, next);
}
}
72
Chapter 16. Lock-free synchronization
This implementation works. However, we mentioned that we should reuse nodes instead of
relying on garbage collection. Sadly, this introduces one of the most complex pitfalls in parallel
programming: The ABA problem.
73
Chapter 17
Let us assume that we want to implement a lock-free stack (page 69), but we do not want to
always create new nodes, and instead maintain a node pool. We can implement this as a second
stack. We switch elements between the different stacks - calling get() on the node-pool stack
creates a new node, while push() gets a node from the node-pool and pop() on the “real” stack
puts the node onto the node-pool. This means that the stack is now in-place (since objects never
change their address). Otherwise, the two stacks are identical to the ones discussed earlier.
For very large number of threads (≈ 32 or more), we can see that this actually speeds up our
program. Problematically, our program does not always work correctly. The reason for this is
the ABA-Problem:
spcl.inf.ethz.ch
@spcl_eth
ABA Problem
Thread X Thread Y Thread Z Thread Z' Thread X
in the middle pops A pushes B pushes A completes pop
of pop: after read
but before CAS Pool
head
Pool A A
head
next next
A larger figure can be found in appendix A on page 116. Another note: For the above to work,
thread Z has to have gotten B from the node-pool just before Y has returned A to the pool,
since that is a stack as well.
74
Chapter 17. Memory Reuse and the ABA Problem
ABA-Poblem
“The ABA problem ... occurs when one activity fails to recognize that a single memory
location was modified temporarily by another activity and therefore erroneously assumes
that the overall state has not been changed.”
How do we solve this conundrum? DCAS would actually work, however - no hardware supports
it. We have used a variant for the lock-free list set, but that was more of a “hacky” solution
rather than proper DCAS. DCAS is, at least today, more hypothetical than practical.
Maybe we just rely on garbage collection? This is much too slow to use in the inner loop of a
runtime kernel - and how would we implement a lock-free garbage collector relying on garbage
collection?
Then there are three practical solutions: Pointer Tagging, Hazard Pointers and Transac-
tional Memory. In this chapter, we will discuss the first two. Transactional memory will be
covered in depth in chapter 21.
The ABA problem usually occurs with CAS on pointers. We could maybe reuse the trick from
earlier, where we reused bits from the pointer - and indeed we do: We could only choose addresses
(values of pointers) that are aligned modulo 32. This would make the last 5 bits available for
tagging. Every time we store a pointer in a data structure (i.e., pop() on the node allocator), we
increment this 5-bit counter by one. This makes the ABA problem much less probable because
now 32 versions of each pointer exist. This does not solve the ABA problem, but rather delay
it, since in the rather unlikely case that a pointer gets reused 32 times in between access, the
CAS would still succeed.
Hazard pointers are a true solution to the ABA problem. Consider the reason for the existence
of the ABA problem:
The ABA problem stems from reuse of a pointer P that has been read by some thread X but
not yet written with CAS by the same thread. Modification takes place meanwhile by some
other thread Y.
Our idea to solve this, is that we introduce an array with n slots, where n is the number of
threads. Before X now reads P, it marks it as hazardous by entering it into the array. After the
CAS, X removes P from the array. If a process Y tries to reuse P, it first checks all entries of
the hazard array, and, if it finds P in there, it simply requests a new pointer for use. Examine
the changed pop() method:
public int pop(int id) {
Node head, next = null;
do {
do {
head = top.get();
setHazarduous(head);
} while (head == null || top.get() != head);
next = head.next;
} while (!top.compareAndSet(head, next));
setHazarduous(null);
75
Chapter 17. Memory Reuse and the ABA Problem
The ABA problem also occurs on the node pool. What do we do? We could make the pools
thread-local. This does not help when push/pop operations aren’t well balanced within the
thread. Alternatively, we could just use Hazard pointers on the global node pool.
The Java code above does not really improve performance in comparison to memory allocation
plus garbage collection, but it demonstrates how to solve the ABA problem. The ABA problem
does not only occur with performance issues, but this is the easiest example.
76
Chapter 18
Concurrency theory I -
Linearizability
For sequential programs, we have learned of the Floyd-Hoare logic to prove correctness. Defining
pre- and postconditions for each method is inherently sequential - can we somehow carry that
forward to a parallel formulation? In this chapter, we define the central aspects of a formulaic
approach to giving certain guarantees about parallel programs. A first definition are method
calls:
A method call is the interval that starts with an invocation and ends with a response. A
method call is called pending between invocation and response.
Linearizability is a theoretical concept: Each method should appear to take effect instanta-
neously between invocation and response events. We call this a linearization point (in code,
typically a CAS). An object for which this is true for all possible executions is called lineariz-
able, and the object is correct if the associated sequential behaviour is correct. We can take a
look at a particular execution and the question: Is this execution linearizable?
spcl.inf.ethz.ch
@spcl_eth
Yes
q.enq(x) q.deq() y
A
q.eny(y) q.deq() x
B
time
35
77
Chapter 18. Concurrency theory I - Linearizability
18.1 Histories
spcl.inf.ethz.ch
@spcl_eth
History
47
78
Chapter 18. Concurrency theory I - Linearizability
Now we can decide if a history H is legal: If for every object x the corresponding projection
H | x adhere to the sequential specification (i.e., what we want it to do), which we can prove
using sequential Hoare-logic, the history is legal.
A method call precedes another call if the response event precedes the invocation event. If
there is no precedence, the method calls overlap. We note that a method execution m0 precedes
m1 on History H as
m0 →H m1
→H is a relation that implies a partial order on H. The order is total when H is sequential.
18.2 Linearizability
For locks, the linearization points are just the unlock points. A few more complicated examples
can be found in appendix A on page 119.
As a general guideline: We need to identify one atomic step where the method “happens” - often
the critical section or a machine instruction. Additionally, there may be multiple linearization
points when there are considerations on the state of the object - i.e. is the queue empty or full
etc.
79
Chapter 18. Concurrency theory I - Linearizability
spcl.inf.ethz.ch
@spcl_eth
cannot be removed
because B already took
q.enq(x) effect into account
A
q.deq() x
B
can be removed,
flag.read() ? nobody relies on this
C
14
spcl.inf.ethz.ch
@spcl_eth
Linearizability:
limitation on the
possible choice of S
a c
A →𝑮
b
B
time
→𝑺 →𝑺
15
80
Chapter 19
spcl.inf.ethz.ch
@spcl_eth
q.enq(x) q.deq() y
A
q.enq(y)
B
time
27
Sequential consistency is not a local property and we lose composability: Each object in a
history H is sequentially consistent does not imply that the history is sequentially consistent!
1
The same definition as linearizability without the last condition - sequential consistency is weaker than
Linearizability
81
Chapter 19. Concurrency theory II - Sequential consistency
spcl.inf.ethz.ch
@spcl_eth
q.enq(X) q.deq() Y
A
q.enq(Y) q.deq() X
B
36
spcl.inf.ethz.ch
@spcl_eth
q.deq() X q.enq(X)
A
q.size() 1
B
37
82
Chapter 19. Concurrency theory II - Sequential consistency
As we can see in figure 19.4, we need sequential consistency. In the real world, hardware
architects do not adhere to this by default, as we’ve seen with reordering operations earlier - the
operations are simply too expensive. We need to explicitly announce that we want this property
(i.e. volatile keyword).
spcl.inf.ethz.ch
@spcl_eth
Sequential Consistency At least one of the processes A and B read flag[1-id] = true.
If both processes read flag = true then both processes eventually read the same value
for victim().
34
83
Chapter 20
Consensus
Consensus is yet another theoretical object, however one of great importance. Consider a simple
object c which implements the following interface:
public interface Consensus<T> {
T decide (T value);
}
Two small theorems: Atomic Registers have consensus number 1. CAS has consensus number
∞. The latter can be shown by construction:
class CASConsensus {
private final int FIRST = -1;
private AtomicInteger r = new AtomicInteger(FIRST); // supports CAS
private AtomicIntegerArray proposed; // suffices to be atomic register
84
Chapter 20. Consensus
else
return proposed.get(r.get());
}
}
Why is consensus this important? It creates the consensus hierarchy, that is, a class-system
of protocols and their respective consensus number:
spcl.inf.ethz.ch
@spcl_eth
1 Read/Write Registers
2 getAndSet, getAndIncrement, … FIFO Queue
LIFO Stack
.
.
60
This is backed by mathematical proof, and can thus help us decide what algorithms are impos-
sible to implement with certain operations: It is simply impossible to implement a (wait-free)
FIFO-Queue with atomic registers. CompareAndSet also cannot be implemented using atomic
registers. In general:
Higher consensus number operations can implement lower consensus number operations. It
is impossible for lower consensus number operations to implement higher consensus number
operations.
85
Chapter 21
Transactional Memory
We have seen that programming with locks is difficult, and that lock-free programming is even
more difficult. The goal of transactional memory is to remove the burden of synchronization
away from the programmer and place it in the system (be that hardware or software). Ideally,
the programmer only has to say (in the context of our canonical banking system):
atomic {
a.withdraw(amount);
b.deposit(amount);
}
We have already seen that this is the idea behind locks, and it is also the idea behind transac-
tional memory. The difference is the execution - we have already extensively covered why locks
are (sadly) not this convenient. This is where transactional memory (TM) comes in.
The benefits of TM are manifold:
• simpler, less error-prone code
• higher-level semantics (what vs. how)
• composable (unlike e.g. locks)
• analogy to garbage collection
• optimistic by design (does not require mutual exclusion)
In short: TM is awesome.
21.1 TM semantics
Changes made by a transaction are made visible atomically. Other threads observe either the
initial or final state, but no intermediate states.
Transactions run in isolation: While a transaction is happening, effects from other transactions
are not observed - as if the transaction takes a snapshot of the global state when it begins, and
that operates only on that snapshot.
Transactions appear serialized, i.e. as if they had been executed sequentially.
Transactional memory is heavily inspired by database transactions and their properties, ACID,
although the last property is not really important for TM:
• Atomicity
86
Chapter 21. Transactional Memory
Of course we need a way to actually implement this TM. We could just use the Big-lock approach
of locking every atomic section with one big lock. That isn’t done in practice for obvious reasons.
The other approach (which we are going to use) is to keep track of operations performed by
each transaction - concurrency control - where the system ensures the atomicity and isolation
properties. What does that mean?
As mentioned before, we create a “snapshot” of the current state and make sure that the
transaction only affects a local copy of this state, which can then be either committed or tossed
away. If a transaction which has yet to commit has read a value (at the start of its operation)
that was changed by a transaction that has committed, a conflict may arise. Consider the
following example, where the initial state is a=0:
// Transaction A
atomic {
... // Transaction B
x = a; // read a atomic {
if (x == 0){ ...
// do something a = 10; // write a
} else { ...
// do something else }
}
}
Now assume that transaction B commits the changes it has made before A does. Now, in a
serialized view, the execution with a==0 is invalid!
spcl.inf.ethz.ch
@spcl_eth
Serialized view
Initially: a = 0
TXB Serial order of transactions.
atomic {
…
a = 10 // write a
…
}
TXA
atomic {
…
x = a // read a Should have read a == 10
if (x == 0) {
…
Executions that read a == 0 are
} else { invalid!
…
}
}
37
87
Chapter 21. Transactional Memory
Issues like this are handled by a concurrency control mechanism. This means, the transaction
has to be aborted, upon which it can either be retried automatically or the user is notified.
Additional care must be taken that inconsistent transactions do not cause things like division
by zero - this would likely throw a global exception! - which would be a global inconsistency.
We could implement TM either in hard- or software. HTM is fast, but has bounded resources
that often cannot handle big transactions. STM allows greater flexibility, but achieving good
performance might be very challenging. Ideally, we would wish for a hybrid TM, but due to the
relatively young age of TM there is no such solution widely available (yet).
21.4 Scala-STM
This course uses Scala-STM, where mutable state is put into special variables - everything
else is immutable or not shared. We call this reference-based STM. Scala-STM has a Java
interface (which we will use), which sadly does not have compiler support, e.g. for ensuring that
references are only accessed inside a transaction. Our goal is to get a first idea of how to use
STM.
For that, let us start with our banking system.
class AccountSTM {
private final Integer id; // account id
private final Ref.View<Integer> balance;
As we can see, the Scala-STM requires us to use quite a bit of boilerplate code - this is a flaw
inherent to this specific Java implementation, in theory we can just use atomic.
88
Chapter 21. Transactional Memory
For actually using transactions, we also need to define Runnables and Callables - once again,
this is just annoying boilerplate code. For a full example, see appendix A page 120.
How do we deal with waiting for a certain condition to come true? With locks, we used con-
ditional variables, with TM we use retry: Abort the transaction and retry when conditions
change. Using our bank accounts again, this time with theoretical notation:
static void transfer_retry(final AccountSTM a, final AccountSTM b, final int amount) {
atomic {
if (a.balance.get() < amount)
STM.retry();
a.withdraw(amount);
b.deposit(amount);
}
}
Usually, implementations of retry track what reads/writes a transaction performed, and when
retry is called, a retry will occur when any of the variables that were read, change. In this
example, when a.balance is updated, the transaction will be retried.
In this section we create a very simple, theoretical implementation of STM. Our ingredients
are transactions (with threads to run them) and objects. Transactions can either be active,
aborted or committed. Objects represent state stored in memory (the variables affected by the
transaction), and offer methods like read and write - and can of course be copied.
We wish to create a Clock-based STM-System. This clock is not some real-time clock, but
instead offers an absolute order to transactions and their commits. Why do we need this? Using
a global clock (implemented with locks or similar), we can timestamp transactions’ birth-time
and when exactly a commit has been made.
Each transaction has a local read-set and a local write-set, holding all locally read and written
objects. If a transaction calls read, it checks first if the object is in the write set. If so, it uses
this new version. If it is not in the write set, the transaction checks whether the object’s
timestamp is smaller than it’s own birth-timestamp, i.e. the last modification happened before
the transaction began. If it is not, it throws an exception, otherwise it adds a new copy to the
read set. Similarly, a call to write simply modifies the write-set by either changing the value
or copying the object into the write-set. In figure 21.2, we can see that transaction T continues
until it reads Z and sees that the modification happened after T’s birth-date.
A commit is a central part of our system. It works as follows:
• All objects of read- and write-set get locked (in specific order to avoid deadlocks)
• Check that all objects in the read set provide a time stamp ≤ birth-date of the transaction,
otherwise abort
• Increment and get the value T of the global clock
• Copy each element of the write set back to global memory with timestamp T
• Release all locks
We can see a commit in figure 21.3. If we were to swap “T writes X” with “T writes Z”, then
the commit would be unsuccessful.
89
Chapter 21. Transactional Memory
spcl.inf.ethz.ch
@spcl_eth
read set of T
time
15
spcl.inf.ethz.ch
@spcl_eth
Successful commit
read set of T
time
T writes Y T writes X
X.date Y.date Z.date
(local copy!) (local copy!)
write set of T
17
The dining philosopher problem goes as follows: 5 philosophers sit at a round table. Between
each pair of philosophers, a fork is placed (totalling 5). Every philosopher thinks for some time,
then wants to eat. For this, he needs to acquire the two neighbouring forks. Design a concurrent
algorithm so that no philosopher will starve, i.e. each can continue to forever alternate between
thinking and eating. No communication between the philosophers is allowed.
Solving this problem with TM is very easy! Besides managing the syntax, one only has to make
90
Chapter 21. Transactional Memory
if (left.inUse.get() || right.inUse.get())
STM.retry();
left.inUse.set(true);
right.inUse.set(true);
}});
}
...
}
TM is not without its issues: The best semantics are not clear (e.g. nesting), getting a good per-
formance can be challenging and also the method of how we should deal with I/O in transactions
(i.e. how would one rollback these changes?) is not clear.
TM is still very much in the development. It remains to be seen whether it will be widely
adopted in the future and what semantics or type of TM will be used.
91
Chapter 22
Many of the problems of parallel/concurrent programming come from sharing state. What if
we simply avoid this? Functional Programming for example relies on a immutable state - no
synchronization required!
Message Passing, which will be the main topic of this chapter, has isolated mutable state,
that is, each thread/task has its private, mutable state, and separate tasks only cooperate via
message passing - hence the name.
We differentiate the theoretical programming model (CSP: Communicating Sequential Pro-
cesses and Actor programming model) and the practical framework/library (MPI: Message
Passing Interface).
The actor model uses different actors (i.e. threads) that communicate by directly sending
messages to other actors. This model lends itself very well to event-driven programming: Actors
react to messages, and a program is written as a set of event handlers for events (where events
can be seen as received messages). A good example for this is a GUI: Every button can be a
very small actor, which on click (which can be perceived as message) does something and sends
relevant messages to other actors (e.g. to a window that it can be closed etc.).
An example for this is the functional programming language Erlang. It was initially developed
for distributed fault-tolerant applications, since recovering from errors becomes much easier if
no state is shared. The most one has to do is restart an actor and perhaps make sure that
messages are sent again etc. Consider the following, simple Erlang program:
92
Chapter 22. Distributed Memory and Message Passing
start() ->
Pid = spawn(fun() -> hello() end),
Pid ! hello,
Pid ! bye.
hello() ->
receive
hello ->
io:fwrite("Hello world\n"),
hello();
bye ->
io:fwrite("Bye cruel world\n"),
ok
end.
This simple program creates a new actor that executes the hello function. Then, the start()
function sends two messages to that actor, “hello” and “bye”. When the actor receives a message,
it is handled similarly to a switch-statement: For “hello” the actor writes something and then
executes the hello() function again. On “bye”, it prints and then exits.
CSP was designed as a formal algebra for concurrent systems. Its main difference when compared
to the actor model is the existence of channels: Instead of directly addressing certain actors,
messages are sent to a channel. These channels are more flexible, as they can also be passed to
other processes. CSP was first implemented in 1983 in OCCAM.
A more modern example is Go - a concurrent programming language from Google that is
inspired by CSP. It features lightweight tasks and typed channels for task communications.
These channels are synchronous by default, but asynchronous channels are also supported. If
we recreate the Erlang example in Go, it would look like this:
func main() {
msgs := make(chan string)
done := make(chan bool)
go hello(msgs, done);
ok := <-done
fmt.Println("Done:", ok);
}
93
Chapter 22. Distributed Memory and Message Passing
The similarities are apparent. The main difference, as mentioned before, is the existence of
channels, and of course the syntax (go is the equivalent to spawn etc.). In appendix B, on page
128, another example of a concurrent program in Go can be found: A prime sieve.
94
Chapter 23
spcl.inf.ethz.ch
@spcl_eth
Communicators
mpiexec -np 16 ./test
4 54 6
5 7
Every process in a Can make copies of this
communicator has an ID communicator (same group of
called as “rank” 6 7 processes, same ranks, but
different “aliases”)
We can already write a very simple (and very useless) MPI program:
public static void main(String args []) throws Exception {
MPI.Init(args);
// Get total number of processes (p)
95
Chapter 23. Message Passing II - MPI
Note that this works as SPMD: Single Program Multiple Data (Multiple Instances). We compile
only one program, which gets executed by multiple different instances.
Of course, now we need to communicate between those processes. This is achieved by the
Comm.Send function, which is called on a Communicator (not a process!):
void Comm.Send( communicator
Object buf, pointer to data to be sent,e.g. an array
int offset, offset within buf
int count, number of items to be sent
Datatype datatype, datatype of items
int dest, destination process id
int tag data id tag
)
Send has a tag argument. This can be used to differentiate between different messages being
sent, for example as numbering.
How do messages get matched, i.e. how do we receive a message? Three things have to match:
The communicator, the tag, and source/dest.
void Comm.Recv( communicator
Object buf, pointer to where the received data is stored
int offset, offset within buf
int count, number of items to be received
Datatype datatype, datatype of items
int dest, destination process id (or MPI_ANY_SOURCE)
int tag data id tag (or MPI_ANY_TAG)
)
A receiver can get a message without knowing the sender or the tag of the message!
One can specify a send operation to be synchronous: Ssend. That means, the method waits
until the message can be accepted by the receiving process before returning. Of course, receive
Recv is synchronous by nature (a message can only be received if it has been sent). Synchronous
routines can perform two actions: Transferring data and synchronizing processes!
Messages can also be passed asynchronously - however then the buffer needs to be stored some-
where, which depending on the MPI implementation might need to be taken care of by the
programmer.
A second concept is blocking and non-blocking sends/receives. These return immediately,
even before the local actions are complete. This assumes that the data storage used for transfer
won’t be modified by subsequent statements until the transfer is complete!
What are the MPI defaults? For send, it is blocking, but the synchronicity is implementation
dependent. Receiving is blocking by default and synchronous by nature.
96
Chapter 23. Message Passing II - MPI
All is not well though, since we have now introduced the possibility of deadlocks again: If two
processes want to synchronously send to each other at the same time and receive after sending,
they would block each other. Luckily, MPI offers easy solutions: Using Sendrecv allows to do
both statements at the same time, or we can use explicit non-blocking operations. These have
a prefixed “i”. After executing Isend and Irecv, we can instruct a process to wait for the
completion of all non-blocking methods by calling the Waitall method.
Essentially, every single MPI program can be written using mere six functions:
• MPI INIT - initialize the MPI library (always the first routine called)
• MPI COMM SIZE - get the size of the communicator
• MPI COMM RANK - get the rank of the calling process in the communicator
• MPI SEND - send a message to another process
• MPI RECV - receive a message from another process
• MPI FINALIZE - clean up all MPI state (must be the final routine called)
An example code where only these are used can be found in B, on page 128. For performance,
however, we need to use other MPI features.
Up until now, we used point-to-point communication. MPI also supports communications among
groups of processors! These collectives will be discussed in this section.
An important notice ahead of time: Collectives need to be called by every process to make sense!
One must not use an if-statement to single out a thread to receive like in point-to-point!
public void Reduce(
java.lang.Object sendbuf,
int sendoffset,
Reduce: Similar to what we heard in the java.lang.Object recvbuf,
first half of the semester, reduce makes use int recvoffset,
of an associative operator (e.g. MPI SUM)to int count,
reduce a result from different processes to Datatype datatype,
one (called root) Op op,
int root
)
97
Chapter 23. Message Passing II - MPI
98
Chapter 23. Message Passing II - MPI
A complete visualization of the different collectives can be found in appendix A, page 121.
A sample program using those collectives can be found on the following page, page 122.
99
Chapter 24
Recall what we know about sorting: The lower bound for sorting is O(n log n).1
The basic building block for sorting algorithms is the comparator. Using the following notation,
we can visualize sorting networks:
spcl.inf.ethz.ch
@spcl_eth
a[i] a[i]
a[j]
< a[j]
14
Note that sorting networks are data-oblivious: They function the exact same for each and every
input. They can also be redundant: Performing “unnecessary” comparisons. This makes it
far easier to reason about them, since there is no worst- or best-case scenario - they are one and
the same!
1
In computer science, assuming limited bit-width, one can also construct specialized algorithms that actually
achieve O(n)
100
Chapter 24. Parallel Sorting - Sorting Networks
spcl.inf.ethz.ch
@spcl_eth
Sorting networks
1 1
5 1
5 4 3
1 3
3 3 4
4 4
4 5
3 5
15
Sorting networks are basically a sequence of comparisons that either swap the elements that are
compared or leave them the way they are. One can construct such a network recursively, as
can be seen in figure 24.3. To improve our parallel algorithms, we can once again argue about
depth and width of our graph (see appendix A page 123) - One relatively simple improvement
is to implement Odd-Even Transposition Sort. We then compare, in alternating fashion, odd
indices with even indices, then even with odds - in numbers, first we compare index 1 to 2, 3 to
4 etc., then 2 to 3, 4 to 5 etc. This sorting network has the same width as the sorting network
for bubblesort, the same number of comparisons but a smaller depth - only n.
In general, there is no easy way to get a sorting network. For networks with pretty small n,
(n > 10) no size-optimal sorting networks are known.
How would we prove the correctness of sorting networks? Enter the Zero-one-principle:
“If a network with n input lines sorts all 2n sequences of 0s and 1s into non-decreasing order, it
will sort any arbitrary sequence of n numbers in non-decreasing order.”
The proof for this theorem has been visualized in figure 24.4.
This principle can now be used to reduce the number of cases for a proof by exhaustion from n!
down to “only” 2n .
101
Chapter 24. Parallel Sorting - Sorting Networks
spcl.inf.ethz.ch
@spcl_eth
𝑥1
𝑥2
𝑥3
. sorting . .
. network . .
. . .
𝑥𝑛−1
𝑥𝑛
𝑥𝑛+1
17
spcl.inf.ethz.ch
@spcl_eth
Proof
Assume 𝑦𝑖 sorted
Show: If x is not > 𝑦𝑖+1by network N, then𝑖,
for some then
there is a consider the monotonic function
1 8 20 30 5 9 1 5 9 8 20 30
monotonic function f that maps x to 0s and 1s and f(x) 0, 𝑖𝑓0 𝑥 0< 1𝑦𝑖 1 0 1
is not sorted by the network 𝑓(𝑥) = ቊ 0 0 1 0 1 1
1, 𝑖𝑓 𝑥 ≥ 𝑦𝑖
Note: There exists a sorting algorithm - Bitonic sort - which, with enough processors, breaks
the lower bound on sorting for comparison sort algorithms. The time complexity in sequential
execution is O(n log2 n), in parallel time O(log2 n).
Sorting networks were only allotted 45 minutes of time, and thus this is a very, very shallow
look at them.
102
Part III
Appendices
103
Appendix A
Slides
Attached are numerous slides that offer a good overview over certain problems, or things that
are just easier to visualize with full-page diagrams. Full credit goes to the lecturers that put
these slides together!
104
UncaughtExceptionHandlers: Example
public class ExceptionHandler public class Main {
implements UncaughtExceptionHandler { public static void main(String[] args) {
...
public Set<Thread> threads = new HashSet<>();
ExceptionHandler handler = new ExceptionHandler();
@Override
public void uncaughtException(Thread thread, thread.setUncaughtExceptionHandler(handler);
Throwable throwable) {
105
...
println("An exception has been captured");
println(thread.getName()); thread.join();
println(throwable.getMessage());
... if (handler.threads.contains(thread)) {
threads.add(thread); // bad
} } else {
} // good
}
}
}
47
Thread state model in Java (repetition)
http://pervasive2.morselli.unimo.it/~nicola/courses/IngegneriaDelSoftware/java/J5e_multithreading.html
87
spcl.inf.ethz.ch
@spcl_eth
106
Designing a pipeline: 1st Attempt
Washing clothes – Unbalanced Pipeline
(lets consider 5 washing loads)
Time (s) 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70
Takes 5 seconds. We use “w” for Washer next. Load #
Load 1 w d d f c c
Load 2 w _ d d f c c
Load 3 w _ _ d d f c c
Takes 10 seconds. We use “d” for Dryer next. Load 4 w _ _ _ d d f c c
Load 5 w _ _ _ _ d d f c c
This pipeline is a bit wasteful, but the latency is bound at 40 seconds for each Load.
Now takes 10 seconds.
Throughput here is about 1 load / 10 seconds, so about 6 loads / minute.
So now we have the total time for all 5 loads at 80 seconds, higher than before.
Takes 10 seconds, as before.
Can we somehow get a bound on latency while improving the time/throughput?
Step 2: and also, like in the 2nd pipeline, make each stage take as much time
Step 1: make the pipeline from 1st attempt a bit more fine-grained: as the longest stage does from Step 1 [this is 6 seconds due to d2 and c2]
Like in the 1st attempt, this takes 5 seconds. It now takes 6 seconds.
107
Speedup
22
Efficiency
23
108
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth
p1 non-critical section
Mutual exclusion for 2 processes -- 1st Try State space diagram [p, q, wantp, wantq] p2
p3
while(wantq);
wantp = true
1 non-critical section 2 while(wantp) 3 wantp = true 4 critical section 5 wantp = false p4 critical section
volatile boolean wantp=false, wantq=false while(wantq) wantq = true wantq = false p5 wantp = false
Process P Process Q Do you see the problem? p1, q1, false, false p2, q1, false, false p3, q1, false, false p4, q1, true, false
local variables local variables
loop loop
p1, q2, false, false p2, q2, false, false p3, q2, false, false p4, q2, true, false
p1 non-critical section q1 non-critical section
p2 while(wantq); q2 while(wantp);
p3 wantp = true q3 wantq = true p1, q3, false, false p2, q3, false, false p3, q3, false, false p4, q3, true, false
p4 critical section q4 critical section
p5 wantp = false q5 wantq = false
p1, q4, false, true p2, q4, false, true p3, q4, false, true p4, q4, true, true
12
no mutual exclusion ! 13
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth
Observation: state space diagram too large Reduced state space diagram [p, q, wantp, wantq] – only states 2, 3, and 5
1 non-critical section 2 await wantq == false 3 wantp = true 4 critical section 5 wantp = false
volatile boolean wantp=false,
Only of interest:wantq=false
state transitions of the protocol. await wantp == false wantq = true wantq = false
p1/q1 is identical to p2/q2 – call state 2 All of interest covered: p1 non-critical section
p4/q4 is identical to p5/q5 – call state 5 p2 while(wantq);
Process P Process Q
Then forbidden: both processes in state 5 p3 wantp = true
local variables local variables p4 critical section
p2, q2, false, false p3, q2, false, false p5, q2, true, false p5 wantp = false
loop loop
p1 non-critical section q1 non-critical section
p2 while(wantq); q2 while(wantp); p2, q3, false, false p3, q3, false, false p5, q3, true, false
p3 wantp = true q3 wantq = true
p4 critical section q4 critical section
q5 wantq = false p2, q5, false, true p3, q5, false, true p5, q5, true, true
p5 wantp = false
no mutual exclusion !
14 15
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth
Mutual exclusion for 2 processes -- 2nd Try State space diagram [p, q, wantp, wantq]
1 non-critical section 2 wantp = true 3 while(wantp) 4 critical section 5 wantp = false
volatile boolean wantp=false, wantq=false wantq = true while(wantq) wantq = false
p3 while(wantq); q3 while(wantp):
p4 critical section q4 critical section
p2, q5, false, true p3, q5, true, true
p5 wantp = false q5 wantq = false
deadlock !
16 17
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth
Mutual exclusion for 2 processes -- 3rd Try State space diagram [p, q, turn]
volatile int turn = 1; We have not made any
assumptions about progress
Process P Process Q Do you see the problem? outside of the CS...
local variables local variables
p2, q2, 1 p4, q2, 1
loop loop
p1 non-critical section q1 non-critical section
p2 while(turn != 1); q2 while(turn != 2);
p3 critical section q3 critical section
p4 turn = 2 q4 turn = 1 p2, q2, 2 p2, q4, 2
starvation!
18 19
109
spcl.inf.ethz.ch
@spcl_eth
Intervals
𝑎0 , 𝑎1 : interval of events 𝑎0 , 𝑎1 with 𝑎0 → 𝑎1
With 𝐼𝐴 = (𝑎0 , 𝑎1 ) and 𝐼𝐵 = (𝑏0 , 𝑏1 ) we write 𝑰𝑨 → 𝑰𝑩 if 𝒂𝟏 → 𝒃𝟎
𝑎0 𝐼𝐴 𝑎1 𝑎2 𝐼𝐴′ 𝑎3
A
𝑏0 𝐼𝐵 𝑏1 𝑏0 𝐼𝐵 ′ 𝑏1
B
time
𝐼𝐵′ ↛ 𝐼𝐴′
𝐼𝐴 → 𝐼𝐵 𝐼𝐵 → 𝐼𝐴′
𝐼𝐴′ ↛ 𝐼𝐵′
we say "𝐼𝐴 precedes 𝐼𝐵 " and "𝐼𝐵′ and 𝐼𝐴′ are concurrent"
24
spcl.inf.ethz.ch
@spcl_eth
Example
K M
r.read() 1 L r.write(8) O
A
r.write(4) r.read() 4
B J
N
r.write(1) r.read() 8
C
𝝉 𝑱 𝝉 𝑲 𝝉 𝑴 𝝉 𝑵 𝝉 𝑳 𝝉 𝑶 time
26
110
spcl.inf.ethz.ch
@spcl_eth
flag[P] = true
Proof: Mutual exclusion (Peterson) victim = P
while (flag[Q] && victim == P){}
By contradiction: assume concurrent CSP and CSQ [A] CSP
flag[P] = false
Assume without loss of generality:
transitivity of "→ "
"write of P" ⇒ must read true
28
spcl.inf.ethz.ch
@spcl_eth
flag[P] = true
Proof: Freedom from starvation victim = P
while (flag[Q] && victim == P){}
CSP
flag[P] = false
By (exhaustive) contradition
Assume without loss of generality that P runs forever in its lock loop, waiting until flag[Q]==false or
victim != P.
Possibilities for Q:
stuck in nonCS
⇒ flag[Q] = false and P can continue. Contradiction.
repeatedly entering and leaving its CS
⇒ sets victim to Q when entering.
Now victim cannot be changed ⇒ P can continue. Contradiction.
stuck in its lock loop waiting until flag[P]==false or victim != Q.
But victim == P and victim == Q cannot hold at the same time. Contradiction.
29
111
spcl.inf.ethz.ch
@spcl_eth
spcl.inf.ethz.ch
@spcl_eth
112
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth
owned by
Deadlock
requires P
Rendezvous with Semaphores
Wrong solution with Deadlock
Q_Arrived P_Arrived
owned by Q requires
P pre acquire release
P Q
Q pre acquire release
init P_Arrived=0 Q_Arrived=0
pre ... ...
rendezvous acquire(Q_Arrived) acquire(P_Arrived)
release(P_Arrived) release(Q_Arrived)
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth
P Q
Q pre acquire release post
time
init P_Arrived=0 Q_Arrived=0
release signals (arrow)
acquire may wait (filled box)
pre ... ... Q first
rendezvous release(P_Arrived) acquire(P_Arrived) P pre release acquire post
acquire(Q_Arrived) release(Q_Arrived)
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth
113
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth
Process Q
Process P
«Each of the processes eventually Shared
reaches the acquire statement" Variable
P1 P1 ... Pn
«The barrier will be opened if and
init barrier = 0; volatile count = processes
only if all 0 have reached the x
barrier" x++ x--
pre ... Race Condition ! reg = x
barrier count++ «count provides the number of read x reg = x
processes that have passed the reg = reg +1
if (count==n) release(barrier) (violated)
read x reg = reg -1
barrier" x = reg
acquire(barrier) write x x = reg
«when all processes have reached
write x
post ... Deadlock ! then all waiting processes
the barrier
can continue" (violated) Race Condition
28 29
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth
x Section
Section
x++ P1 P2 ... Pn
reg = x read x init mutex = 1; barrier = 0; count = 0
reg = reg +1
write x pre ...
x = reg
x-- barrier acquire(mutex)
read x reg = x count++
Mutual reg = reg -1 release(mutex)
Exclusion
write x x = reg if (count==n) release(barrier)
acquire(barrier)
turnstile
release(barrier)
30
post ... 31
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth
114
spcl.inf.ethz.ch
@spcl_eth
−∞ 𝟐 𝟒 𝟓 𝟕 𝟖 𝟗 +∞
28
spcl.inf.ethz.ch
@spcl_eth
add (6)
spcl.inf.ethz.ch
@spcl_eth
remove(5)
<
<
<
<
−∞ 𝟐 𝟒 𝟓 𝟕 𝟖 𝟗 +∞
35
spcl.inf.ethz.ch
@spcl_eth
contains(8)
< >
< < >
< >
< < >
115 < =
−∞ 𝟐 𝟒 𝟓 𝟕 𝟖 𝟗 +∞
36
spcl.inf.ethz.ch
@spcl_eth
ABA Problem
Thread X Thread Y Thread Z Thread Z' Thread X
in the middle pops A pushes B pushes A completes pop
of pop: after read
but before CAS Pool
head
Pool A A
head
116
top A top A top B top B B
top
next next
time time
34 35
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth
Linearizable? No
q.enq(y) q.enq(y)
B B
x is first in queue
time time
36 37
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth
Linearizable ? Yes
time time
38 39
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth
time time
40 41
117
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth
Linearizable! Linearizable?
time time
42 43
spcl.inf.ethz.ch
@spcl_eth
No
write(1) read()1
B
write(1) must
have happened
time
44
118
spcl.inf.ethz.ch
@spcl_eth
18
spcl.inf.ethz.ch
@spcl_eth
spcl.inf.ethz.ch
@spcl_eth
20
spcl.inf.ethz.ch
@spcl_eth
spcl.inf.ethz.ch
@spcl_eth
spcl.inf.ethz.ch
@spcl_eth
5
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth
Collective Computation - Reduce public void Reduce(java.lang.Object sendbuf, Collective Data Movement - Broadcast
int sendoffset,
java.lang.Object recvbuf,
int recvoffset,
int count,
P0 A A
Datatype datatype, P1 Broadcast A
Op op,
int root)
P2 A
P3 A
root = rank 0
P0 A A+B+C+D
P1 B Reduce
P2 C
P3 D
P0 A A
P1 B A+B
Scan
P2 C A+B+C
P3 D A+B+C+D
34 36
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth
Collective Computation - Allreduce public void Allreduce(java.lang.Object sendbuf, Collective Data Movement – Scatter/Gather
int sendoffset,
java.lang.Object recvbuf,
int recvoffset,
int count, P0 A B C D Scatter A
Datatype datatype,
Op op)
P1 B
P2 C
Gather
P3 D
P0 A A+B+C+D
B Allreduce
P1 A+B+C+D Scatter can be used in a function that reads in an entire vector on process 0 but only
P2 C A+B+C+D sends the needed components to each of the other processes.
P3 D A+B+C+D
Gather collects all of the components of the vector onto destination process, then
destination process can process all of the components.
Useful in a situation in which all of the processes need the result of a global sum in order
to complete some larger computation.
37 41
spcl.inf.ethz.ch
@spcl_eth
P0 A A B C D
P1 B Allgather A B C D
P2 C A B C D
P3 D A B C D
P0 A0 A1 A2 A3 A0 B0 C0 D0
P1 B0 B1 B2 B3 Alltoall A1 B1 C1 D1
P2 C0 C1 C2 C3 A2 B2 C2 D2
P3 D0 D1 D2 D3 A3 B3 C3 D3
42
121
spcl.inf.ethz.ch
@spcl_eth
Matrix-Vector-Multiply
1 2 3 10 𝐴1⋅ ⋅ 𝑥
Compute 𝒚 = 𝑨 ⋅ 𝒙 , e.g., 𝐴 = 4 5 6 𝑥 = 20 y = 𝐴2⋅ ⋅ 𝑥
7 8 9 30 𝐴3⋅ ⋅ 𝑥
Assume A and x are available only at rank 0!
1. Broadcast x
P0 10 20 30
P0 10 20 30
P1 10 20 30
P2 10 20 30
43
spcl.inf.ethz.ch
@spcl_eth
Matrix-Vector-Multiply
1 2 3 10 𝐴1⋅ ⋅ 𝑥
Compute 𝒚 = 𝑨 ⋅ 𝒙 , e.g., 𝐴 = 4 5 6 𝑥 = 20 y = 𝐴2⋅ ⋅ 𝑥
7 8 9 30 𝐴3⋅ ⋅ 𝑥
Assume A and x are available only at rank 0!
2. Scatter A
1 2 3
P0 1 2 3
P0 4 5 6
7 8 9
P1 4 5 6
P2 7 8 9
44
spcl.inf.ethz.ch
@spcl_eth
Matrix-Vector-Multiply
1 2 3 10 𝐴1⋅ ⋅ 𝑥
Compute 𝒚 = 𝑨 ⋅ 𝒙 , e.g., 𝐴 = 4 5 6 𝑥 = 20 y = 𝐴2⋅ ⋅ 𝑥
7 8 9 30 𝐴3⋅ ⋅ 𝑥
3. Compute locally
P0 1 2 3 10 20 30 = 140
P1 4 5 6 10 20 30 = 320
P2 7 8 9 10 20 30 = 500
45
spcl.inf.ethz.ch
@spcl_eth
Matrix-Vector-Multiply
1 2 3 10 𝐴1⋅ ⋅ 𝑥
Compute 𝒚 = 𝑨 ⋅ 𝒙 , e.g., 𝐴 = 4 5 6 𝑥 = 20 y = 𝐴2⋅ ⋅ 𝑥
7 8 9 30 𝐴3⋅ ⋅ 𝑥
4. Gather result y
P0 140
P2 500
122
46
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth
2:3 4:3 2:1 4:1 2:4 3:4 2:1 3:1 1:3 4:3 1:2 4:2 1:4 3:4 1:2 3:2 𝑥𝑛+1
redundant cases
16 17
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth
𝑥1
𝑥2
𝑥3
. . sorting .
. . network . insertion sort bubble sort
. . .
𝑥𝑛−1
𝑥𝑛
𝑥𝑛+1
18 19
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth
20 21
spcl.inf.ethz.ch
@spcl_eth
22
123
spcl.inf.ethz.ch
@spcl_eth
124
Watch the clock!
If you are taking too long on a question, consider dropping it and moving on to another one.
Always show your working
You should be able to explain most of the slides
Tip: form learning groups and present the slides to each other
If something is unclear:
Ask your friends
Read the book (Herlihy and Shavit for the second part)
Ask your TAs
67
Appendix B
Code-snippets
125
Appendix B. Code-snippets
126
Appendix B. Code-snippets
return true;
} finally {
for (int level = 0; level <= highestLocked; level++)
preds[level].unlock();
}
}
}
127
Appendix B. Code-snippets
if (rank != 0) {
double [] sendBuf = new double []{sum};
128
Appendix B. Code-snippets
129