0% found this document useful (0 votes)
151 views130 pages

Pprog Skript PDF

This document provides a comprehensive summary and replacement for the parallel programming course lectures. It covers key topics like threads and synchronization, hardware parallelism, fork/join programming, and shared memory concurrency. The document is intended as a single reference for exam preparation, replacing the need to review slides, lectures, and do additional research. It includes appendices with important slides and full code examples.

Uploaded by

Mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
151 views130 pages

Pprog Skript PDF

This document provides a comprehensive summary and replacement for the parallel programming course lectures. It covers key topics like threads and synchronization, hardware parallelism, fork/join programming, and shared memory concurrency. The document is intended as a single reference for exam preparation, replacing the need to review slides, lectures, and do additional research. It includes appendices with important slides and full code examples.

Uploaded by

Mohamed
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 130

Parallel Programming

[email protected]

2nd Semester Bsc Computer Science, 2020


Preface

This document aims to be a total replacement for revising the lectures in the parallel program-
ming course. This includes, but is not limited to, watching the lecture recordings, studying the
slides, reading additional literature and doing research on the internet. By the very nature of
the examination process, it is still highly recommended to solve old exams and exercises for ideal
preparation. The author wishes best of luck with your studies.
In this document, references are colorcoded in this shade of violet. On most pdf-viewers, clicking
any such references jumps to the relative page in the document. This is true for both the table
of contents and in-text references to both chapters and specific pages as well as figures and the
like.
In a lot of code-snippets, a while(condition); is used for spinning. While not necessarily
initially apparent, this is a shorthand for
while(condition){
// Not doing anything but waiting
}

In this document, care has been taken to ensure proper indentation of code, so that it is obvious
when or if a piece of code belongs into a while()-loop.
While it was considered, the author has ultimately decided against including a section on the
assignments given during the semester. They serve mostly as a hands-on experience for the
various topics discussed during the lecture, and yield little to no additional information. Includ-
ing them in this document would serve no purpose, as no knowledge can be gained from them
without solving them yourself. That is not to say that the exercises can be skipped - they are
still excellent preparation for the exam.
In appendix A one can find many slides that have been deemed important or as good visualization
of a problem, yet did not fit into the main document because of space constraints or their
exemplary nature. The main text of this document refers to this appendix when appropriate.
In appendix B one can find full code that is not found on slides. Most of it is extensive code, or
is a “full” program contrasting the many smaller pieces of code that are discussed in the main
text of this document.
This document is not meant to be shared. Please contact the author if you wish to make
this document available to others.

1
Contents

I Lecture - First half 5

1 Introduction 6
1.1 Course overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Three stories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2.1 Mutual exclusion, or, The shared backyard . . . . . . . . . . . . . . . . . 7
1.2.2 Producer-Consumer, or, The very hungry cat . . . . . . . . . . . . . . . . 7
1.2.3 Readers-Writers, or, A therapy for communication issues . . . . . . . . . . 7
1.2.4 The moral of the story . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.3 Some parallel programming guidelines . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Threads and Synchronization 9


2.1 Different kinds of multitasking . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1.1 Multithreading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Java threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.3 Shared Resources and synchronized . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.4 Wait and Notify . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.4.1 Nested Lockout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5 Short recap - Java thread state model . . . . . . . . . . . . . . . . . . . . . . . . 14

3 Hardware Parallelism 15
3.1 Basic principles of today’s computers . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Hardware speed-up possibilities . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.1 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2.2 Instruction Level Parallelism (ILP) . . . . . . . . . . . . . . . . . . . . . . 16
3.2.3 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

4 Basic Concepts in Parallelism 18


4.1 Work Partitioning and Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 Scalability, Speedup and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Amdahl’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4 Gustafson’s Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

5 Fork/Join style programming I 22


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5.2 Divide et impera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
5.3 Abandoning the one thread per task model . . . . . . . . . . . . . . . . . . . . . 24

6 Cilk-Style bounds 27

7 Fork/Join style programming II 30


7.1 The ForkJoin Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

2
Contents

7.2 Maps and reductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31


7.3 Analysis of ForkJoin’s performance . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.4 The prefix-sum problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
7.4.1 Pack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.4.2 Parallel Quicksort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

8 Shared memory concurrency, locks and data races 36


8.1 Managing state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
8.2 Races . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

9 Guidelines and recap 39


9.1 Recap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

II Lecture - Second half 41

10 Memory Models: An introduction 42


10.1 Why we care - What can go wrong . . . . . . . . . . . . . . . . . . . . . . . . . . 42
10.2 Java’s Memory Model (JMM) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

11 Behind Locks - Implementation of Mutual Exclusion 45


11.1 Critical sections and state space diagrams . . . . . . . . . . . . . . . . . . . . . . 45
11.2 Dekker’s Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
11.3 Atomic registers and the filter lock . . . . . . . . . . . . . . . . . . . . . . . . . . 47
11.4 Safe SWMR Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
11.5 Bakery Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

12 Beyond Locks I - TAS & CAS 51


12.1 Read-Modify-Write Operations - TAS, TATAS and CAS . . . . . . . . . . . . . . 51
12.2 Read-Modify-Write in Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

13 Beyond Locks II - Deadlocks, Semaphores and Barriers 54


13.1 Deadlocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
13.2 Semaphores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
13.3 Barrier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

14 Producer-Consumer and Monitors 58


14.1 Producer Consumer Pattern . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
14.2 Monitors (in Java) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

15 Locking tricks 62
15.1 Reader / Writer Locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
15.2 Coarse-grained locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
15.3 Fine-grained locking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
15.4 Optimistic synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
15.5 Lazy synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
15.6 Lazy Skip Lists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

16 Lock-free synchronization 67
16.1 Recap: Definitions with locks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
16.2 Definitions for Lock-free Synchronization . . . . . . . . . . . . . . . . . . . . . . . 68
16.3 Lock-free Stack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
16.4 Lock Free List Set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

3
Contents

16.5 Lock-free Unbounded Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

17 Memory Reuse and the ABA Problem 74


17.1 Pointer Tagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
17.2 Hazard Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

18 Concurrency theory I - Linearizability 77


18.1 Histories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
18.2 Linearizability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

19 Concurrency theory II - Sequential consistency 81


19.1 A sidenote: Quiescent Consistency . . . . . . . . . . . . . . . . . . . . . . . . . . 82
19.2 Sequential consistency and the real world . . . . . . . . . . . . . . . . . . . . . . 83

20 Consensus 84

21 Transactional Memory 86
21.1 TM semantics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
21.2 Implementing transactional memory . . . . . . . . . . . . . . . . . . . . . . . . . 87
21.3 Design choices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
21.4 Scala-STM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
21.5 Simplest STM implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
21.6 Dining Philosophers with STM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
21.7 Final remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91

22 Distributed Memory and Message Passing 92


22.1 Rethinking managing state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
22.2 Actor Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
22.3 Communicating Sequential Processes : CSP . . . . . . . . . . . . . . . . . . . . . 93

23 Message Passing II - MPI 95


23.1 Sending and Receiving . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
23.2 Collective Communication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

24 Parallel Sorting - Sorting Networks 100

III Appendices 103

A Slides 104

B Code-snippets 125
B.1 Skip list . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
B.1.1 Constructor, fields and node class . . . . . . . . . . . . . . . . . . . . . . 125
B.1.2 find() method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
B.1.3 add() method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
B.1.4 remove() method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
B.1.5 contains() method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
B.2 Concurrent prime sieve in Go . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
B.3 Calculating Pi in MPI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

4
Part I

Lecture - First half


Lectures by H. Lehner, M. Schwerhoff and M. Vechev

5
Chapter 1

Introduction

1.1 Course overview

Parallel Programming has become a necessity in modern-day programming, as CPUs are limited
by heat or power consumption. While there are many intuitive reasons to deal with a problem
with “multiple problem-solvers”, parallel programming also brings its fair share of challenges
and (unintuitive) problems.
Learning Objectives

By the end of this course you should


1. have mastered fundamental concepts in parallelism
2. know how to construct parallel algorithms using different parallel programming
paradigms (e.g. task parallelism, data parallelism) and mechanisms (e.g. threads,
tasks, locks, communication channels)
3. be qualified to reason about correctness and performance of parallel algo-
rithms
4. be ready to implement parallel programs for real-world application tasks
(e.g. searching large data sets)

This means that both writing parallel programs and understanding of the underlying fundamen-
tal concepts is an important part of this lecture.
In order to achieve the stated objectives, this course is split into different parts, in order of
appearance:
1. (Parallel) Programming: Programming and Parallelism in Java (Threads)
2. Parallelism: Understanding and detecting, intro to PC architectures, formalizing and
programming models
3. Concurrency: Shared data, locks, race conditions, lock-free programming, communica-
tion
4. Parallel Algorithms: Useful & common algorithms in parallel, data structures for par-
allelism, sorting & searching

6
Chapter 1. Introduction

1.2 Three stories

These stories give an (abstracted) overview of common problems in parallel programming.

1.2.1 Mutual exclusion, or, The shared backyard


Let us assume that Alice and Bob share a backyard. Alice’s cat and Bob’s dog both want to
use it, but only one of them is allowed there at a time due to vastly differing philosophies of life
(read: uncontrolled aggression). In addition to this first requirement (mutual exclusion) we
also want to prevent a lockout: No pet should be locked out of the yard if it is not in use.
Idea 1: We alternate turns and communicate whose turn it is in some way (let’s say a generic
boolean).
Problem: Starvation: If one pet doesn’t use its turn, the other will never be able to go into
the yard - even if it is free!
Idea 2: We notify the other party of the status of our pet: i.e., we check if the other party’s flag
is up, and, if that is not the case, we hoist our own flag and let the pet out.
Problem: No mutual exclusion: If both parties check at the same time, it could lead to both
pets being pitted against one another in a heated debate (read: duel to the death)
Idea 2.1: How about checking once more after hoisting our flag?
Problem: Deadlock: If, once again, both flags are up, both parties wait forever for the other
to lower it.
Idea 2.2: We hoist the flag to show our intent, afterwards check, and then either release the pet
or lower our flag and try again some time later.
Problem(s): Livelock and Starvation: One party may still hog the backyard for itself forever.
In addition, a simultaneous declaration of intent leads to the system changing (lowering and
hoisting flags), but never progressing.
Final Solution: We use the above idea, but, to prevent locks, we have a “tie-breaker alternator”
from our very first idea. This is known as Dekker’s algorithm.
The problem of one party constantly using the yard (and thereby, starvation), is still present
though.

1.2.2 Producer-Consumer, or, The very hungry cat


Let’s assume Alice’s kitten has grown up into a fierce, and very hungry tiger. The backyard is
its feeding ground, and Bob is responsible for depositing food there. Of course, the two can’t
be in the yard at the same time.
Luckily, we can just use the simple alternating solution for this problem: That is, we have a
common flag for showing that there’s (no) food, and thus allowing the other party to move in.

1.2.3 Readers-Writers, or, A therapy for communication issues


Let us assume Alice and Bob communicate via Alice’s crude whiteboard and Bob’s telescope.
Alice may write the sentence “Pet loves ham” by writing “Pet”, erasing it, and then continuing
with her writing. Bob just waits for new words and writes them down.
If Bob’s notetaking is interrupted in some way, he might miss some words. This is especially
critical if there are even more writing parties, i.e., the cat has mastered human language and
starts writing down “I hate salad”. In the worst case, Bob writes down “Pet loves salad”.

7
Chapter 1. Introduction

1.2.4 The moral of the story


Of course, reality is much more complicated than those simplified stories, i.e. flipping a bit to
simulate the raised flag might become visibly delayed, or other unwanted side-effects that will
become clear later on in our studies.
The good news is that there is sufficient abstraction in the programming models of different
languages, that allow us to (later on) mostly ignore such low level concurrency issues.

1.3 Some parallel programming guidelines

Note: The section about JVMs is either explicitly non-examinable (JVM) or should already be
familiar (Java). Thus, only the recommended guidelines for this course are mentioned here.
• Keep variables as local as possible
• Avoid aliasing (i.e., do not reference the same object using multiple variables)
• If possible, avoid mutable state, especially when aliased - immutable objects can be shared
without much hassle, but concurrent modifications cause a lot of headaches
• Do not expect thrown exceptions to be very informative - the cause of the error may be
far earlier in the execution than where the exception triggers

8
Chapter 2

Threads and Synchronization

2.1 Different kinds of multitasking

The concurrent execution of multiple tasks can be simulated even on a single core: By using a
technique called time multiplexing, the impression of parallelism is created. In truth, this is
just the core switching rapidly between different tasks.
This principle allows for asynchronous I/O: If a process has to wait, for example due to reading
data from memory, other processes may be able to use the computing power that is currently
not needed for the waiting process.
Each process has a context, things like instruction counters, Resource handles etc., and a state,
all of which are captured in a Process Control Block (PCB). The most important states are
waiting, running and blocked:
• A waiting process is ready to execute - it only needs to be allocated CPU time
• A running process is currently running - its instructions are being executed by a CPU
• A blocked process needs some external change in state to proceed. Most of the time, this
is I/O.
The OS is responsible for assigning resources like memory or computing time to processes. It
would be massively overkill to dive into the details of implementations in this lecture, but the
takeaway should be that process level parallelism can be complex and expensive.

2.1.1 Multithreading
The concept of a thread appears on different levels: It can be on a hardware level, the OS level
or even inside (for example) a JVM.
Multiple threads share the same address space - contrary to processes, they can thus share
resources much more easily, and switching between different threads is very efficient, since no
change of address space or loading of process states are necessary. However, this also makes
them more vulnerable for programming mistakes.

2.2 Java threads

Within Java, there are some different, easy ways of using threads. Java supplies a special Thread
class as part of the core language, which enables the programmer to easily create and start new
threads.

9
Chapter 2. Threads and Synchronization

Similar to the process states, there is a thread state model in Java:

Figure 2.1: The Java thread state model

How does one go about using Java threads? The first option is to extend the java.lang.Thread
class. The custom class needs to override the run() method - however, in order to start the
thread the function start() needs to be invoked on the respective thread.
A better way is to implement the java.lang.Runnable interface, for example:
public class ConcurrWriter implements Runnable {
public int data;

public ConcurrWriter(int number){


this.data = number;
}
@Override
public void run() {
// code here executes concurrently with caller
}
}
ConcurrReader readerThread = new ConcurrReader();
Thread t = new Thread(readerThread);
t.start(); // calls ConcurrWriter.run()

While we can force the creation of new threads, every Java program has at least one execution
thread - the first one calls main(). Interestingly, threads can continue to run even if main
returns.
Creating a Thread object does not start a thread1 - conversely, if a thread has finished exe-
cuting, the respective Thread object still exists.
How do we “wait” for our threads to return results? Obviously, we may want to wait with
continuing the execution of main(), for example if we need data that the threads calculate.
1
Neither does calling run(), the thread has to be started using start()

10
Chapter 2. Threads and Synchronization

Now, the problem could be solved by busy waiting - essentially looping until each thread has
the state terminated. This is inefficient, and instead our main() thread should wait for the
threads to “wake it up”. This can be achieved using the Thread.join() method.1
Exceptions in threads do not terminate the entire program - even the behaviour of joining via
thread.join() is unaffected. It is thus paramount to either watch out for exceptions, or use a
personalized UncaughtExceptionHandler.2
A few additional, useful methods that can be used:
Thread t = Thread.currentThread(); // get the current thread

System.out.println("Thread ID" + t.getId()); // prints the current ID.

t.setName("PP" + 2019); // the name, not the ID, can be modified like this

t.setPriority(Thread.MAX_PRIORITY); // updates the thread’s priority - this is a


general indication of priorities, the JVM may or may not follow these (strictly)

if (t.getState() == State.TERMINATED){} //check if thread’s status is terminated -


functions similarly for other states

2.3 Shared Resources and synchronized

The Battle of the Threads highlights an important issue: Bad interleavings. Since the
exact order of execution between two threads is unknown to us, bad things might happen if
both access a shared resource, for example a simple counter.
If we create two different threads, one adding and one subtracting thread, that we both execute to
do a certain, equal amount of up- or downticks respectively, we would assume that at the end, the
value would be 0. That is not necessarily the case. The reason for that is that incrementing and
decrementing are not atomic operations: In bytecode, it is visible that it consists of loading the
value, incrementing, and storing the value. In parallel execution, the variable could be updated
after loading, but before storing - which completely negates the operation that happened between
loading and storing. How may we solve this? Enter the synchronized keyword:
public synchronized void inc(long delta) {
this.value += delta;
}

A method that is synchronized can only ever be used by one thread at a time. A synchronized
method is thus a critical section with guaranteed mutual exclusion.
Technically, a synchronized method is just syntactic sugar:
public synchronized void inc(long delta) {
this.value += delta;
}

is just a “better-looking” version of


public void inc(long delta) {
synchronized (this) {
this.value += delta;
1
Joining involves some slight overhead, and thus in very small scenarios, busy waiting may outperform it
2
Refer to Appendix A, page 105

11
Chapter 2. Threads and Synchronization

}
}

The argument passed to synchronized is the object that will be used as a lock. Every object in
Java has a lock, called intrinsic or monitor lock, which can act as a lock for concurrency purposes,
in addition to the more sophisticated external locks from java.util.concurrent.locks. The
latter ones are especially useful for things like reader-writer scenarios, but can be complicated
to use.
synchronized can be used for static methods as well, which will synchronize the class methods
and attributes instead of the non-static instance methods and attributes. Meaning, instead of the
implied synchronized(this), we have an implied synchronized(this.getClass()) - locking
the class object instead of a single object.
In Java, locks are recursive (or reentrant) - meaning, if a thread has acquired a certain lock, it
can still request to lock this object - i.e. a thread can call methods that are synchronized with
the same lock that it is already in possession of.
In general, it is advisable to keep critical sections under lock as small as possible for performance
reasons. Not synchronizing an entire method is efficient, but can lead to incorrectness:
• Not using the same lock for two synchronized methods: This effectively invalidates the
usage of synchronicity, since the threads can still acquire their respective lock at the same
time
• Not using synchronized for all methods that access the shared resource: Obviously, this
is a more extreme case of the first possibility: While one thread may always request and
acquire the lock, the other is free to do as it pleases.
Luckily for us, synchronized handles exceptions very well. It releases the lock that was acquired,
and then the exception handler is executed. One thing that needs to be taken care of:
public void foo() {
synchronized (this) {
longComputation(); // say this takes a while
divisionbyZero(); // this throws an exception
someOtherCode(); // something else
}
}

someOtherCode() will not be executed if divisionByZero() throws an exception, and the


changes made by longComputation() will not be reverted.
A final note: The implementation of synchronized differs based on the JVM implementation,
which differs for different hardware/OS combinations. For 2 threads, we already know Dekker’s
algorithm to ensure synchronized works, later on we will also see instructions that implement
synchronized for 3 or more threads.

2.4 Wait and Notify

Let us consider a simple Producer-Consumer scenario:


We have a shared buffer object, which is used by the producer to put items into, and by the
consumer to take items out of the shared buffer. How would we implement such a shared buffer
using our newly acquired knowledge about synchronized?
Consider the following Code-snippet:

12
Chapter 2. Threads and Synchronization

public class Consumer extends Thread {


public void run() {
long prime;
while (true) {
synchronize (buffer) {
while (buffer.isEmpty());
prime = buffer.remove();
}
performLongRunningComputation(prime);
}
}
}

Assuming that our producer also uses synchronize on the buffer object, this seems fine - we
have guaranteed mutual exclusion.
Except that it is not fine, since a Deadlock can occur: If the consumer locks the buffer, it
spins on isEmpty(), while the producer cannot add anything because the lock for buffer never
becomes available. The solution to this problem is using the wait() and notify() methods:
//Consumer
synchronize (buffer) {
while (buffer.isEmpty())
buffer.wait();
prime = buffer.remove();
}

//Producer
synchronize (buffer) {
buffer.add(prime);
buffer.notifyAll();
}

wait() releases the object lock, and the thread enters into a waiting state. notify() and
notifyAll() wake up waiting threads (non deterministically) and thus allow us to ensure correct
producer-consumer behaviour. notify() does however not release the lock, it just informs
waiting threads (Which then compete in the usual manner for acquisition of the (hopefully)
soon-to-be released lock).
The while loop is a necessity: If we would only use a simple if condition and no synchronized
at all, it could happen that the producer completes successfully just after the condition check,
but before the wait() call. Also, even with synchronized there might be other reasons for
the consumer returning from a wait() (e.g. due to a thread interrupt or different consumers
needing different conditions), so checking the condition again is considered a necessity. It might
be possible to have a correct program without a while loop, but it is highly recommended,
as the Java documentation mentions:
As in the one argument version, interrupts and spurious wakeups are possible, and
this method should always be used in a loop:
synchronized (obj) {
while (<condition does not hold>)
obj.wait();
// Perform action appropriate to condition
}

13
Chapter 2. Threads and Synchronization

2.4.1 Nested Lockout


Calling wait() inside of a nested synchronized block can lead to a deadlock by not releasing
every lock :
class Stack {
LinkedList list = new LinkedList();
public synchronized void push(Object x) {
synchronized(list) {
list.addLast( x ); notify();
} }
public synchronized Object pop() {
synchronized(list) {
if( list.size() <= 0 ) wait();
return list.removeLast();
} }
}

Using wait() or notify() without a class is an implicit this.wait() and this.notify()


respectively. Meaning, the pop() function only releases the lock on the this Object, here a
Stack, but not on the list!
There is no general solution to this problem, however it is advisable to not use nested synchro-
nized methods, unless it is guaranteed that no problems arise - being careful is important, as
always in parallel programming.

2.5 Short recap - Java thread state model

We briefly touched on the different thread states already (see figure 2.1), let us now insert the
concrete methods that are responsible for state changes. For sake of ease, the full-size-slide for
the model is attached in appendix A on page 106.
• Thread is created when an object derived from the Thread class is created. It is in the
new state.
• Once start is called, the thread becomes eligible for execution by the scheduler, entering
the runnable state.
• If the thread calls the wait() method, or join() to wait for another thread, it becomes
not runnable
• Once it is either notified or the requested join thread terminates, the thread is once again
runnable
• Exiting the run() method (normally or via exception) results in the thread entering the
terminated state.
• Alternatively, its destroy() method can be called - which reduces in an abrupt move to
the terminated state, possibly leaving other objects locked or other undesired side effects
occurring.

14
Chapter 3

Hardware Parallelism

While this section focuses mostly on things that are not directly tied to parallel programming
on the software level, it serves as a general intuition as to why parallel programming has become
more and more important as well as showing some performance implications and some challenges
that transfer to the software level.

3.1 Basic principles of today’s computers

While computers have very different shapes and sizes, they are similar from the inside. They are
based on the Von Neumann architecture (or Princeton architecture). For more details
one can refer to the digital design and computer architecture lecture.
A problem that presented itself to hardware architects is the speed difference between mem-
ory and CPUs: While CPUs got (a lot) faster, accessing memory became much slower than
accessing CPU registers. Thus, caches - faster, but much more expensive memory - became
important. Since the size of caches is limited, it is impossible to have all data in them at the
same time. This is where locality plays an important role: Since related storage locations are
often accessed shortly after each other (i.e., accessing array cells one after another), it makes
sense to design hardware with this aspect in mind. As programmers, we can use this locality to
increase performance, as the time needed to execute the following two (C++) programs show:
int main() { int main() {
const int N = 800_000_000; const int N = 800_000_000;

std::vector<int> data(N); std::vector<int> data(N);


std::fill(data.begin(), data.end(), std::fill(data.begin(), data.end(),
1); 1);

// Given an array of N ones, // Given an array of N ones,


// sum up the first 8 array slots, // sum up every 32nd array slot,
// and that N/8 times. // and that 32 times.
// This is cache friendly // This is NOT cache friendly, slower

long result = 0; long result = 0;


for (int i = 0; i < N/8; ++i) for (int i = 0; i < 32; ++i)
for (int j = 0; j < 8; ++j) for (int j = 0; j < N; j += 32)
result += data[j]; result += data[j];

assert(result == N); assert(result == N);


} // needs about 8 secs } //needs about 12 secs

15
Chapter 3. Hardware Parallelism

In addition, the cache distribution and things like MESI- Cache coherent Protocols - are theoret-
ically none of our concern. However, the CPU itself may reorder (i.e, postpone) writes from its
own registers to the relevant cache. Therefore, it is paramount to implement memory barriers
or fences. In our case, we can use the already discussed synchronized().1

3.2 Hardware speed-up possibilities

3.2.1 Vectorization
Vectorization can be classified as single instruction applied to multiple data. Of course,
such actions are inherently parallel - think about adding two vectors componentwise - and are
thus supported by special hardware instructions. As with many things, we cannot really control
vectorization of our code in Java - we just have to trust the compiler and the JVM to do this
for us.

3.2.2 Instruction Level Parallelism (ILP)


The principle of ILP is relatively simple: If certain instructions are independent of each other,
they can be calculated in parallel. Combined with speculative execution or Out-of-Order exe-
cution, a speedup can be achieved without a noticeable difference.
ILP is also affected by locality, although this kind of locality is more about jumps in code. While
Caches improve performance when there are no jumps in memory, ILP improves performance
when there aren’t jumps in the code.
ILP, while helpful, is not ‘controllable’ from a software standpoint. More interesting to delve
into is the topic of pipelining.

3.2.3 Pipelining
Pipelining, while CPU-internal, is a very universal idea that has made it into the software world.
There are two main concepts:
Throughput

• Throughput: Amount of work that can be done by a system in a given period of


time
• In CPUs: # of instructions completed per second
• Larger is better
• Bound on throughput: 1
max(computationtime(stages))

Latency

• Latency: Time to perform a computation


• In CPU: Time required to execute a single instruction in the pipeline
• Bound on latency #stages · max(computationtime(stages))
• Constant over time only if pipeline balanced.

1
Technically, volatile can be used for a very similar purpose, but, since we often need to lock anyways for
different reasons, synchronized is the comfortable option

16
Chapter 3. Hardware Parallelism

In addition, pipelines can be characterized be certain timeframes:


• Lead-In: Time until all the stages are busy
• Full Utilization: All stages are busy
• Lead-Out: Time until all stages are free after exiting full utilization
These definitions are, of course, highly theoretical. For a more ‘hands-on’ approach, refer to the
example in appendix A on page 107 and assignment 4.
We very often consider balanced pipelines, i.e., all stages take the same amount of time, so that
we may have a bound on latency. Technically, we could of course create unbalanced pipelines,
although those are not as relevant as balanced ones.
Throughput optimization may increase the latency: Pipelining typically adds some constant
overhead between individual stages for synchronization and communication, leading to reduced
performance when splitting the pipeline stages into smaller steps. That in turn makes infinitely
small pipeline steps not practical, and it may even take longer than with a serial implementation.
In CPUs one can often find the classical RISC pipeline, which allows some level of ILP within
a single processor. Details for this are not exam relevant, and may be studied further in the
digital design and computer architecture course.

17
Chapter 4

Basic Concepts in Parallelism

4.1 Work Partitioning and Scheduling

In order to profit from the availability of multiple threads/processors/cores, work needs to be


split up into parallel tasks. Each task is a unit of work, and this splitting has to be done by
the user. It is also called task/thread decomposition.
When the different tasks are finalized, they need to be assigned to processors. This is (typically)
done by the system, with the goal of full utilization.
To achieve best performance, the task granularity has to be chosen correctly. A fine gran-
ularity (opposite of coarse granularity) is more portable, i.e. can be executed with more pro-
cessors, and is in general better for scheduling. However, scheduling induces overhead, and thus
a trade-off must be made. In general, tasks should be as small as possible, but significantly
bigger than scheduling overhead.

4.2 Scalability, Speedup and Efficiency

Scalability is a term used for a plentitude of things, e.g., how well a system reacts to increased
load. In this course, we are interested in:
• Speedup when increasing processor count
• What will happen if # of processors → ∞?
• Ideally, a program scales linearly - we achieve linear speedup
Of course, some mathematical definitions are in order if we want to make statements about our
programs. Those are, luckily, not very complicated.
Parallel Performance

Sequential execution time: T1


Execution time on p CPUs: Tp
• Tp = T1 /p - This is the ideal case
• Tp > T1 /p - Performance loss, this is usually the case
• Tp < T1 /p - Sorcery!

18
Chapter 4. Basic Concepts in Parallelism

Speedup & Efficiency

Speedup (absolute) on p CPUs: Sp

Sp = T1 /Tp

• Sp = p - linear speedup (ideal)


• Sp < p - sub-linear speedup (performance loss)
• Sp > p - super-linear speedup (Sorcery!)
Efficiency (relative speedup): Sp /p

Why do we incur a performance loss, i.e. why is Sp < p? Once again, introducing parallelization
induces some overhead (typically associated with synchronization), which reduces performance.
Additionally, some programs may simply not contain “enough” parallelism - that is, some parts
of the program might be sequential due to their nature.1
Additionally, one should be careful when choosing whether to use efficiency or absolute speedup.
Sometimes, there is a sequential algorithm that doesn’t parallelize well that outperforms the
parallel algorithm with one processing unit. In these cases, it is fairer to use that sequential
algorithm for T1 , since using an unnecessarily poor baseline artificially inflates speedup and
efficiency.

(parallel) speedup graph example

11

Figure 4.1: A typical graph comparing actual to linear speedup

4.3 Amdahl’s Law

Amdahl’s law provides a (“pessimistic”) bound on the speedup we can achieve. It is based on
the separation of T1 into the time spent on parallelizable work and the time spent on non-
1
There are also architectural limitations - e.g., memory contention - which is less of a focus than the program-
influenced part

19
Chapter 4. Basic Concepts in Parallelism

parallelizable, serial work:


T1 = Wser + Wpar
Given P workers, the (bounded) time for parallel execution is
Wpar
Tp ≥ Wser +
P
Simply inserting these relations into the definition of speedup yields Amdahl’s law:
Amdahl’s law

Wser + Wpar
Sp ≤ Wpar
Wser + P

And a simple corollary, where f denotes the serial fraction of the total work:

Wser = f · T1
Wpar = (1 − f )T1
1
=⇒ Sp ≤
f + 1−f
P

Applying to infinite workers:


1
S∞ ≤
f

Amdahl’s law is mostly bad news, as it puts a limit on scalability. A key takeaway is that all
non-parallel parts of a program can cause problems, no matter how small. For a
more visual explanation why, consider appendix A page 108.

4.4 Gustafson’s Law

Gustafson’s law shows an alternative, more optimistic view to Amdahl’s. Gustafson bases his
law on the consideration of constant runtime. Meaning, instead of trying to find out how
much we can speedup a given program, we try to find out how much work can be done in a
given timeframe.

Gustafson’s law

Let f be the sequential part of a program and Twall the total amount of available time.
Then it holds that

W = p(1 − f )Twall + f Twall

Which, when inserted into the formula for speedup, yields

Sp = f + p(1 − f )
= p − f (p − 1)

Comparing the two laws can be difficult when only comparing formulae. Therefore, it is highly
recommend to compare the formulaic definitions to figure 4.2.

20
Chapter 4. Basic Concepts in Parallelism

Amdahl's vs Gustafson's Law


Amdahl's Law Gustafson's Law

p=4 p=4

28

Figure 4.2: Comparison of Amdahl’s and Gustafon’s law, with p = 4.

21
Chapter 5

Fork/Join style programming I

5.1 Introduction

Having clarified exactly how much we can theoretically improve our programs with parallelism,
let us consider an example of a parallel program that actually solves a problem: Summing up
the elements of an array.We base our parallel program on the following, simple sequential code:
public static int sum(int[] input){
int sum = 0;
for(int i=0; i<input.length; i++){
sum += input[i];
}
return sum;
}

The idea of parallelizing is rather simple: We choose an arbitrary amount of threads - let us
consider 4 for this example - and then have each of them run a part of the array - in our case,
every thread gets 1/4 of the array. With our knowledge of Java threads, we might come up with
a program like this:
class SumThread extends java.lang.Thread {
int lo, int hi, int[] arr; // arguments
int ans = 0; // result
SumThread(int[] a, int l, int h) {
// pass defined sector
}
public void run(){
// override, calculate sum of sector
}
}

int sum(int[] arr){// can be a static method


int len = arr.length;
int ans = 0;
SumThread[] ts = new SumThread[4];
for(int i=0; i < 4; i++){// do parallel computations
ts[i] = new SumThread(arr,i*len/4,(i+1)*len/4);
ts[i].start();
}
for(int i=0; i < 4; i++) { // combine results
ts[i].join(); // wait for helper to finish!
ans += ts[i].ans;

22
Chapter 5. Fork/Join style programming I

}
return ans;
}

That code is technically correct1 and produces expected results. This style of parallel pro-
gramming is called fork/join - being named after its most important methods. Luckily for
us, fork-join programs do not require much focus on sharing memory among threads. In our
example, we used fields which only had one writer (the main or a helper thread respectively),
but in general one should be careful to avoid data races with shared memory.
There are a few issues remaining with our code. First of all, it is not very parameterized - at
least the number of threads should be able to be changed easily. We also would like to only use
processors that are available to our program, not just how many cores are in our machine. And,
probably most devastating of all, we can have load imbalance depending on the structure
of our tackled problem - maybe a divisor calculation is happening, the duration of which is
obviously vastly different for different inputs - which would result in our program’s speedup
being limited by one slow, overburdened processor. How can we alleviate those problems?

5.2 Divide et impera

The solution to all of these problems is the perhaps counter-intuitive notion of using far more
threads than processors available. All of the aforementioned issues are handily solved,2 but this
will require both a change of algorithm and, due to the immense overhead generated by Java
threads, abandoning them.
Our first concern is the changing of our algorithm to accommodate the idea of small pieces of
work. The straightforward way of implementing those changes is, as alluded to by the title of
this section, the divide-and-conquer paradigm. Our (sequential) implementation for the problem
of summing up could look like this:
public static int do_sum_rec(int[] xs, int l, int h) {
// l and h are the boundaries of our part
int size = h-l;
if (size == 1) /*check for termination criteria*/
return xs[l];

/* split array in half and call self recursively*/


int mid = size / 2;
int sum1 = do_sum_rec(xs, l, l + mid);
int sum2 = do_sum_rec(xs, l + mid, h);
return sum1 + sum2;
}

Before using this code in a threaded version, adjustments need to be made3 due to the overhead
generated by creating all those threads and communicating between them:
• Use a sequential cutoff - we do not need to split down to single elements, and by
shortening height of the tree generated by our algorithm, we significantly cut down on
thread creation. Typically, we use a value around 500-1000.
1
join() may throw an exception, so we need to insert a try-catch block. Catching and exiting should be
fine for basic parallel code
2
Although load imbalance, with “unlucky” scheduling, could be a small problem - Variance in workload should
be small anyway if the pieces of work are small
3
This is purely for practical reasons. In theory, the changes will make no difference to our speedup - but the
real world is a tad bit different

23
Chapter 5. Fork/Join style programming I

• Do not create two recursive threads, instead create only one and do the other work “your-
self” - this reduces the number of threads created by another factor of two.
Implementing the sequential cutoff is an easy task. When improving the recursive thread cre-
ation, one needs to be careful with ordering this.run() and other.start() - otherwise the
run() method just runs sequentially. Our new and improved program thus looks like this:
public void run(){
int size = h-l;
if (size < SEQ_CUTOFF)
for (int i=l; i<h; i++)
result += xs[i];
else {
int mid = size / 2;
SumThread t1 = new SumThread(xs, l, l + mid);
SumThread t2 = new SumThread(xs, l + mid, h);
t1.start();
t2.run();
t1.join();
result=t1.result+t2.result;
}
}

Luckily, in this case we are dealing with a very regular workload. If we were dynamically
allocating workloads, think doing a breadth-first-search in a graph, the workload might be highly
irregular. Making sure that the work is split fairly between threads is difficult, and without
prior knowledge maybe even impossible. The next model we discuss can deal with this issue a
lot easier.

5.3 Abandoning the one thread per task model

Until now, we have always used one thread per task. This is not ideal, since Java threads are
very heavyweight and (in most real-world implementations) mapped to OS threads. Using one
thread per small task is horribly inefficient. Instead, we approach from a new angle - scheduling
tasks on threads.

24
Chapter 5. Fork/Join style programming I

Java's executor service:


managing asynchronous tasks

Tasks

ExecutorService
Interface

(Thread pool) Implementation


e.g.: ThreadPoolExecutor

30

Figure 5.1: Overview of Java’s executor service

As shown in figure 5.1, we now focus on generating and submitting tasks, while leaving the
allocation of threads to tasks to an interface.
To use the executor service, we have to submit tasks (objects of a subclass of Runnable or
Callable<T>), hand them over to a previously created ExecutorService and get our results.
The following simple program shows how it can be done, starting with our task template:
static class HelloTask implements Runnable {
String msg;

public HelloTask(String msg) {


this.msg = msg;
}

public void run() {


long id = Thread.currentThread().getId();
System.out.println(msg + " from thread:" + id);
}
}

Then creating our executor and submitting to it:


int ntasks = 1000;
ExecutorService exs = Executors.newFixedThreadPool(4);

for (int i=0; i<ntasks; i++) {


HelloTask t = new HelloTask("Hello from task " + i);
exs.submit(t);
}

exs.shutdown(); // initiate shutdown, does not wait, but can’t submit more tasks

And resulting in:

25
Chapter 5. Fork/Join style programming I

Using executor service: Hello World (output)



Hello from task 803 from thread:8
Hello from task 802 from thread:10
Hello from task 807 from thread:8
Hello from task 806 from thread:9
Hello from task 805 from thread:11
Hello from task 810 from thread:9
Hello from task 809 from thread:8
Hello from task 808 from thread:10
Hello from task 813 from thread:8
Hello from task 812 from thread:9
Hello from task 811 from thread:11
...
35

Figure 5.2: Sample output of our simple executor service program

The executor service is not meant for parallel tasks that have to wait on each other, as it has
a fixed amount of threads - which will quickly run out. We could conceivably decouple work
partitioning from solving the problem, or use a framework which we will discuss in chapter 7.

26
Chapter 6

Cilk-Style bounds

In this chapter, we return to a more theoretical standpoint, which allows us to make some more
guarantees on performance of task parallel programming - also called Cilk-style1 .
For visualizing via task graph, we need to define tasks. This is done relatively simply:
Tasks

• execute code
• spawn other tasks
• wait for results from other tasks

Now we create a graph based on spawning tasks: Every task is represented by a node. An edge
from node A to node B means that Task B was created by Task A.2
Let us familiarize ourselves with one example. Consider the following code for a simple program
that calculates the fibonacci numbers:
public class sequentialFibonacci { public class parallelFibonacci {
public static long fib(int n){ public static long fib(int n) {
if (n < 2) if (n < 2)
return n; return n;
long x1 = fib(n-1); spawn task for fib(n-1);
long x2 = fib(n-2); spawn task for fib(n-2);
return x1 + x2; wait for tasks to complete
} return addition of task results
} }}

And now consider the task graph of it. Here, the meaning of node and edges is shown as
spawning/joining a task. The exact meaning is not as important, as we’ll analyse the graph just
with the amount of nodes given to us.
1
Cilk++ and Cilk Plus are general-purpose programming languages designed for multi-threaded parallel
computing. They are based on the C and C++ programming languages, which they extend with constructs to
express parallel loops and the fork–join idiom. — Wikipedia
2
Confusingly enough, in some interpretations, there are different meanings to nodes and edges. At least,
according to professor Vechev, the meaning of nodes and edges will always be explained with a given problem.
Confusion inducing, yet the author sees no way of resolving this conflict inherent to even cilk literature. Thus,
it is best to familiarize oneself with multiple examples of such graphs, for example with old exams. The author
apologizes for the lackluster explanation.

27
Chapter 6. Cilk-Style bounds

fib(4) task graph


spawn
join

step in same procedure

The task graph is a directed acyclic graph (DAG)


42

Figure 6.1: Task graph of a simple recursive fibonacci function

But how does this help us get guarantees for performance? We define the following terms:
Task parallelism: performance bounds

• T1 : work - sum of all nodes in graph


• T∞ : span, critical path - Time it takes on infinite processors, longest (=maxi-
mum cost of nodes on it) path from root to sink
• T1 /T∞ : parallelism - “wider” is better
• T1 /TP : speedup
With the bounds

TP ≥ T1 /P
TP ≥ T∞

Whereby TP depends on scheduler, while T1 /P and T∞ are fixed.

Let’s examine figure 6.1: T1 is the sum of all nodes, in this case, that is 17. T∞ is 8 (path from
f(4) down to f(0) to the sink in f(4), assuming every node is of cost 1)1 . Thus, the fraction that
defines parallelism is T1 /T∞ ≈ 2.1. This is a hard upper limit for the speedup we can achieve -
this follows directly from the definitions (think about the mathematical implications of ≤ and
≥ in fractions).
Why do we not calculate TP ? As mentioned above, TP depends on scheduler. How so?
Depending on which order the tasks get executed and which dependencies exist between them,
more or less may be executed in parallel. Observe the following figure:
1
The ambiguous definition of what a note or edge means leads to some difficulties when defining what needs
to be counted, especially when there is no explicit, differing cost per node. In past exams, it has always been the
nodes having a cost and thus the longest path was always the maximum sum of nodes on a path. The author
apologizes again for this unrectifiable inconvenience

28
Chapter 6. Cilk-Style bounds

1 1
What is T2 for this graph?
2 3 2
That is, we have 2 processors.
4
3
5
4

T2 will be 5 with T2 will be 4 with


this scheduling this scheduling
(we have 5 time steps) (we have 4 time steps)

a bound on how fast you can get on p processors


with a greedy scheduler: 𝑻𝒑 ≤ 𝑻𝟏 / 𝑷 + 𝑻∞ 13

Figure 6.2: TP difference depending on scheduler

Nowadays, a standard method is the so-called work stealing scheduler. This is due to the
following guarantee this scheduler can give:

TP = T1 /P + O(T∞ )

The proof would go beyond this lecture, but empirically we also get that TP ≈ T1 /P + T∞ .

29
Chapter 7

Fork/Join style programming II

7.1 The ForkJoin Framework

We had issues with the executor service when trying to execute divide and conquer algorithms
due to the allocation of threads to tasks. In this chapter, we will see a framework that supports
divide and conquer style parallelism - Java’s ForkJoin Framework.
The ForkJoin Framework is designed to meet the needs of divide-and-conquer fork-join paral-
lelism - that is, when a task is waiting, it is suspended and other tasks can run. There are
similar libraries available for other languages, most notably Cilk for C/C++.
The usage does not differ much from what we have previously seen, although the terms are a
bit different: We have to subclass either RecursiveTask<V> or RecursiveAction, depending
on whether we want to return something or not. We have to override the compute method,
and return a V if we subclass RecursiveTask<V>. Instead of starting and joining threads, we
call fork (or invoke) and join respectively. Similarly to executorservice, we need to create a
ForkJoinPool. Let’s use this framework to solve recursive sum how we initially wanted to:
class SumForkJoin extends RecursiveTask<Long> {
int low;
int high;
int[] array;

SumForkJoin(int[] arr, int lo, int hi) {


array = arr;
low = lo;
high = hi;
}

protected Long compute() {


if(high - low <= 1)
return array[high];
else {
int mid = low + (high - low) / 2;
SumForkJoin left = new SumForkJoin(array, low, mid);
SumForkJoin right = new SumForkJoin(array, mid, high);
left.fork();
right.fork();
return left.join() + right.join();
}
}
}

30
Chapter 7. Fork/Join style programming II

Now that we have our individual tasks, we only need some wrapper code:
static ForkJoinPool fjPool = new ForkJoinPool();
//number of threads equal to available processors

static long sumArray(int[] array) {


return fjPool.invoke(new SumForkJoin(array,0,array.length));
}

Had we used submit(), instead of receiving a <V>, in this case a long, we would instead get a
Future<V> object. While invoke() submits and waits for task completion, with the Future we
have to explicitly ask for this behaviour by calling Future.get(). Overall, it makes most sense
for us to simply use invoke().
Once again, we can significantly improve performance by introducing a sequential threshold -
this is, once more, a flaw of the implementation - and thus not a flaw inherent to the model!

7.2 Maps and reductions

Many problems can be tackled in exactly the same way: Finding the maximum or minimum,
counting occurrences of certain objects etc. These problems are “just summing with a different
base case”.
Computations of this form are called reductions (in the context of MPI, explained in more
detail in chapter 23, also just reduce), which produce a single answer from a collection via
an associative operator. To name a few operations that are not reductions/reducible: Median,
subtraction, exponentiation etc. However, results need not be single numbers or strings - they
can also be arrays or objects with multiple fields, for example counting occurrences of multiple
different elements (think of a histogram).
An even simpler form of parallel computations are maps. We had already discussed vectorization
in chapter 3.2.1 - maps are basically exactly this - operating on each element of a collection
independently to create a new collection of the same size. Vectorization is an array-map
supported on hardware level.
Both maps and reductions are the “work horses” of parallel programming - they are the most
important and common patterns in which we can write a parallel algorithm.
One thing that one has to be mindful of, is that maps and reduces over suboptimal data struc-
tures, for example a linked list, may not necessarily yield a great effect. Parallelism is still
beneficial for expensive per-element operations - but traversing the list over and over again
takes much longer. Trees should be used instead, where applicable.

7.3 Analysis of ForkJoin’s performance

Similar to the task graphs in cilk-style, we can create a DAG1 , where every fork “ends” a node
and makes two outgoing edges (no matter whether we continue with two threads or continue
with one new and the current one), and every join “ends” a node and makes a node with two
incoming edges. For most divide-and-conquer algorithms, the graph will look like this:
1
Directed Acyclic Graph

31
Chapter 7. Fork/Join style programming II

Our simple examples


fork and join are very flexible, but divide-and-conquer maps and
reductions use them in a very basic way:
• A tree on top of an upside-down tree

divide

base cases

combine
results

27

Figure 7.1: Graph investigating ForkJoin style divide-and-conquer

Luckily, in most Cilk-style literature, we group the forking and working somewhat differently
(see chapter 6) - those groups of nodes are called strands, and are usually the easier (and more
exam-relevant) way to compute T∞ and the like.
The ForkJoin library yields an asymptotically optimal execution, that is, we can expect the
following:

TP = O((T1 /P ) + T∞ )

This is an expected-time guarantee - how exactly this is achieved is topic of an advanced


course. In order for this guarantee to hold, there are a few assumptions about our code that are
being made, most importantly: We need to make all nodes small(-ish) and have approximately
equal amount of work. As an example, let’s consider the summation of an array:
• T1 = O(n)
• T∞ = O(log n)
• We thus expect: TP = O(n/P + log n)

7.4 The prefix-sum problem

So far, we have analyzed parallel programs in terms of work and span (i.e., total amount of node
cost and longest path in the DAG). In practice, most programs have parts that parallelize well
(maps/reductions) and parts that do not (reading linked lists, getting input, etc.). Amdahl’s
Law shows us that unparallelized parts become a bottleneck very quickly. Thus, we need to find
new and improved parallel algorithms. For problems that seem sequential, it turns out they are
actually parallelizable, if we introduce a trade-off: A bit more work or memory for a greatly
reduced span. In this section, we focus on one such problem: The prefix-sum problem. Solving
this problem will give us a template, similar to summing an array, that we can use to parallelize
other things - like quicksort.

32
Chapter 7. Fork/Join style programming II

Prefix-sum problem

Given int[] input, produce int[] output where:


output[i] = input[0] + input[1] + ... + input[i]

As with most parallel problems, we should take a look at the sequential solution:
int[] prefix_sum(int[] input){
int[] output = new int[input.length];
output[0] = input[0];
for(int i=1; i < input.length; i++)
output[i] = output[i-1]+input[i];
return output;
}

This does not seem parallelizable at all - and it is: This algorithm is strictly sequential, but a
different algorithm improves the span from O(n) to O(log n).
This algorithm calculates the result in two “passes”, each with work O(n) and span O(log n):
First, we build a tree bottom-up, where the leaves map to an element in the array, and every
node contains the sum of all its children (or the respective array element). Then, we pass down
a value we call fromLeft with the following invariant: fromLeft is the sum of all elements left
of the node’s range. In order to achieve this, we assign the root fromLeft=0, and then every
node passes its left child its own fromLeft value and its right child its own fromLeft plus its
left child’s sum from the first pass. Each leaf calculates the output array by adding its own
fromLeft value to the input array.

Example range
sum
0,8
76
fromleft 0

range 0,4 range 4,8


sum 36 sum 40
fromleft 0 fromleft 36

range 0,2 range 2,4 range 4,6 range 6,8


sum 10 sum 26 sum 30 sum 10
fromleft 0 fromleft 10 fromleft 36 fromleft 66

r 0,1 r 1,2 r 2,3 r 3,4 r 4,5 r 5,6 r 6,7 r 7,8


s 6 s 4 s 16 s 10 s 16 s 14 s 2 s 8
f 0 f 6 f 10 f 26 f 36 f 52 f 66 f 68

input 6 4 16 10 16 14 2 8

output 6 10 26 36 52 66 68 76
47

Figure 7.2: Parallel prefix-sum visualized

As always, we could easily add a sequential cut-off by having the leaves hold the sum of a
range, and then calculating the output by beginning the same way as without cut-off, and then
sequentially prefix-summing all elements in our range, or as a simple code snippet:

33
Chapter 7. Fork/Join style programming II

output[lo] = fromLeft + input[lo];


for(i=lo+1; i < hi; i++)
output[i] = output[i-1] + input[i]

Where lo and hi are the boundaries for the range given to our leaf node.

7.4.1 Pack
In this section, we want to apply what we learned from parallel prefix-sum to a more general
context. We coin this term a Pack1 :
Pack

Given an array input, produce an array output containing only elements such that a
certain f (elmnt) is true.
Example: input [17, 4, 6, 8, 11, 5, 13, 19, 0, 24]
f : is elmnt > 10?
output [17, 11, 13, 19, 24]
O(n) work and O(log n) span

How do we parallelize those problems? Finding the elements for the output is simple - but we
somehow need to make sure that they are put in the right place, without sacrificing parallelism.
The idea is to compute a bit-vector for elements that fulfill f , and to then use prefix-sum
on that very bit-vector. Then, each true element can just check the bitsum array to find its
position.

Parallel prefix to the rescue


1. Parallel map to compute a bit-vector for true elements
input [17, 4, 6, 8, 11, 5, 13, 19, 0, 24]
bits [1, 0, 0, 0, 1, 0, 1, 1, 0, 1]

2. Parallel-prefix sum on the bit-vector


bitsum [1, 1, 1, 1, 2, 2, 3, 4, 4, 5]
3. Parallel map to produce the output
output [17, 11, 13, 19, 24]
output = new array of size bitsum[n-1]
FORALL(i=0; i < input.length; i++){
if(bits[i]==1)
output[bitsum[i]-1] = input[i];
53
}
Figure 7.3: Packing with parallel prefix-sum
1
Non-standard terminology

34
Chapter 7. Fork/Join style programming II

7.4.2 Parallel Quicksort


Let us recall some properties of Quicksort: It is a sequential, in-place sorting algorithm with
expected time O(n log n). How should we parallelize this?
One easy way of achieving a performance gain is to do the recursive sorting of the different
partitions (i.e., after placing our pivot in the correct position) in parallel. With this, we can
decrease the span from O(n log n) down to O(n), yielding O(log n) for parallelism (work divided
by span).
By giving up the in-place property, we can actually achieve a span of O(log2 n). Why is that
so? The limiting factor for further improvement was the partitioning, which always happened
in O(n). However, partitioning is just two packs - one for elements less than pivot, one for
elements greater than pivot.1
In general, by using packs we give up in-place due to the need for auxiliary storage, but increase
parallelism.

1
It is possible to do both packs at once with some very fancy parallel prefixing, but that has no effect on
asymptotic complexity

35
Chapter 8

Shared memory concurrency, locks


and data races

When talking about parallel algorithms in the context of the ForkJoin framework, we never
talked about locks, synchronization and the like. This is due to the structure of our algorithms
- each thread had memory that “only it accessed”, for example a sub-range of an array. Thus,
we could avoid bad interleavings and the like. However, it is not always this simple, and we
have to make sure that we are aware how we should manage state.

8.1 Managing state

Managing state(s) is the main challenge for parallel programs, for the reasons mentioned so far.
There are a few approaches on how to go about this:
Approaches to managing state

• immutability
– data does not change at all
– best option
• isolated mutability
– data can change, but only one thread/task can access them
• mutable/shared data
– data can change, all tasks/threads can potentially access them

When dealing with mutable/shared state, one needs to protect the state via exclusive access,
where intermediate inconsistent states should not be observed. This can be achieved via locks
or transactional memory. The former we’ve already covered extensively, while the latter will
be part of the second half of the course (see chapter 21).
For the rest of the course, we will very often consider a canonical example of managing a banking
system and the problems that arise when parallelizing that system. Consider the following
sequential code as a baseline on how the system should work:
class BankAccount {
private int balance = 0;
int getBalance() { return balance; }

36
Chapter 8. Shared memory concurrency, locks and data races

void setBalance(int x) { balance = x; }


void withdraw(int amount) {
int b = getBalance();
if(amount > b)
throw new WithdrawTooLargeException();
setBalance(b - amount);
}
// other operations like deposit, etc.
}

Were we to port this program directly to the multithreaded world, one could easily find a
bad interleaving1 , for example when both threads execute getBalance() before the other has
finished etc.2
While tempting, it is almost always wrong to fix a bad interleaving by rearranging or repeating
operations. This generally only shifts the problem or might even compile into the same version
since the compiler does not know of any need to synchronize. Thus, we have to use mutual
exclusion and critical sections. We could under certain assumptions implement our own
mutual exclusion protocol, but this won’t work in real languages anyway. We should instead use
Locks, a basic synchronization primitive with operations new, acquire and release.3
Recall the required properties of mutual exclusion: At most one process executes the critical
section, and the acquire mutex method must terminate in finite time when no process is cur-
rently in the critical section (Safety Property and Liveliness respectively). Using locks, we can
implement such a mutual exclusion.
When using locks to make our bank account parallel-proof, there are quite a few things that one
needs to be aware of, some of which we’ll discuss in the second half of the lecture (especially
when it comes to transferring money between accounts), but here are a few possible mistakes
one might encounter:
• Using different locks for withdraw and deposit
• Using the same lock for every bank account (very poor performance)
• Forgetting to release a lock before throwing an exception
What about getBalance and setBalance? If they can be called outside of a locked section (i.e.
if they are public), it could lead to a race condition. However, if they acquire the same lock as
withdraw would, it will block forever, since the thread would need to acquire a lock it already
has. One simple approach would be to have a private setBalance that does not lock and a
public setBalance that does lock. However, more intuitive is the use of re-entrant locks -
Java uses this type of lock, and thus allows us to use the locked setBalance without issues.
Re-entrant lock

A re-entrant or recursive lock has to “remember” two things: Which thread (if any)
holds it, and a counter for how many times it has been entered.
On lock, if the lock goes from not-held to held, the count is set to 0. If the current holder
calls acquire, the count is incremented.

1
As a recap, we say that calls interleave if a second call starts before the first ends
2
This could also happen with only one processor since threads may be subject to “strange” scheduling by the
OS, however, since we largely use processor/thread interchangeably, this is only of minor concern
3
The implementation of Lock is quite complicated and uses special hardware and OS support. In this course,
we take it as a given primitive to use

37
Chapter 8. Shared memory concurrency, locks and data races

On release, if the count is > 0, the count is decremented. If the count is 0, the lock
becomes not-held.

The Java statement synchronized is a bit more advanced than the primitive re-entrant lock,
for example, it releases the lock even if it leaves the synchronized block due to throw, return
etc.If we want a lock that works more akin to the primitive, we can use
java.util.concurrent.locks.ReentrantLock - but we then need to use a lot of try and
finally blocks to avoid forgetting to release it. In the second half of the course, we will use
more locks from this library.

8.2 Races

There is quite a bit of confusion regarding the term race. It is important to know what exactly is
meant when one talks about a race (condition), and thus this section deals with those definitions.
A race condition occurs when the computation result depends on the scheduling (how threads
are interleaved). These bugs exist only due to concurrency (i.e. not possible with one thread)1 .
One can further define data races and bad interleavings:

The distinction
Data Race [aka Low Level Race Condition, low semantic level]
Erroneous program behavior caused by insufficiently synchronized accesses of a shared
resource by multiple threads, e.g. Simultaneous read/write or write/write of the same
memory location
(for mortals) always an error, due to compiler & HW

Original peek example has no data races

Bad Interleaving [aka High Level Race Condition, high semantic level]
Erroneous program behavior caused by an unfavorable execution order of a
multithreaded algorithm that makes use of otherwise well synchronized resources.
“Bad” depends on your specification
Original peek had several

39

Figure 8.1: Data Races vs. Bad Interleavings

While tempting, if an implementation of a method only reads, synchronization should not be


skipped due to possible data races with writing methods.

1
Possible with one processor, however

38
Chapter 9

Guidelines and recap

Decades of bugs have led to some conventional wisdom - general techniques that are known to
work. In this chapter, we mention some of those important guidelines and recap the first half
of this lecture.
Memory Location

For every memory location (e.g., object field), we must use one of the following three
possibilities:
1. Thread-local: Use location in one thread only
2. Immutable: Do not write to the memory location
3. Synchronized: Control access via synchronization

After making sure the amount of data that is thread-shared and mutable is minimized, we work
with some guidelines how to use locks to keep other data consistent:
Guidelines for synchronization

1. No data races - Never allow two threads to read/write or write/write the same
location at the same time
2. For each location needing synchronization, have a lock that is always held
when reading or writing the location
3. Start with coarse-grained (i.e., fewer locks that guard more), simpler locking,
move to fine-grained (i.e. more locks that guard less, better performance) locking
only if contention (threads waiting for locks to be released) becomes an issue
4. Do not do expensive computations or I/O in critical sections (contention), but do
not introduce data races - this is the so-called critical-section granularity
5. Think in terms of what operations need to be atomic - that is, for other threads
the operation can never be seen partly executed - first, locks second. Typically,
operations on abstract data types (ADT) - stack, table etc. - need to be atomic
even to other threads running operations on the same ADT
6. Generally, use provided libraries for concurrent data structures. For this course
and your understanding, try to implement things yourself !

39
Chapter 9. Guidelines and recap

9.1 Recap

Use this section to see at a glance what the lecture covered in the first half:
 Java Threads: wait, notify, start, join
 synchronized and its usage
 Producer/Consumer
 Parallelism: Vectorization, ILP
 Pipelining: Latency, Throughput
 Concepts: T1 , TP , T∞
 Amdahl’s and Gustafson’s Law
 Cilk-Style bounds, Taskgraphs and finding T1 , T∞ on them
 Divide-and-Conquer
 ForkJoin
 Prefix-sum, packs - reducing T∞ for complicated algorithms, i.e. Quicksort
 High-level and low-level data races
 Overall, broad view of parallel programming - ready to be expanded in the second
half

40
Part II

Lecture - Second half


Lectures by T. Hoefler

41
Chapter 10

Memory Models: An introduction

10.1 Why we care - What can go wrong

We already had an extensive look at bad interleavings between two threads. In the real world,
we are presented the unfortunate reality of memory reordering. What does that entail?
As a rule of thumb: The compiler and hardware are allowed to make changes that do not affect
the semantics of a sequentially executed program - meaning that instructions may be reordered
or optimized away completely if the resulting program still conforms to the same semantics.
Consider the following simple example:
int x; Thread A calls wait, then thread B calls
arrive. Naively, we would expect the code
void wait() {
x = 1; to run until arrive is called.
while(x==1); However, for the compiler, it could optimize
} the wait method as follows:
void wait() {
void arrive(){
while(true);
x = 2;
}
}

On the altar of performance, our program’s correctness is seemingly sacrificed. It gets worse:
The same thing happens on hardware as well! The exact behaviour of threads that interact
with shared memory thus depends on hardware, runtime system and of course programming
language (and by extension its compiler).
A memory model provides guarantees for the effects of memory operations - these leave
open optimization possibilities for hardware and compiler, but include guidelines for writing
correct multi-threaded programs. Memory models can be understood as a contract between
programmer, compiler, runtime, and architecture about the semantics of a program. It is thus
paramount to understand those guarantees and guidelines.

10.2 Java’s Memory Model (JMM)

How would we fix code that we know could cause problems? We could simply use synchronized.
Additionally, Java has volatile fields, whose access counts as a synchronization operation (more
on that later). Generally, volatile is more for experts, we should rely on standard libraries
instead.

42
Chapter 10. Memory Models: An introduction

How exactly do those language constructs forbade reordering work? For this, we need to dig
deeper into the JMM.
The JMM defines Actions: read(x):1 means “read variable x, read value is 1”
Executions (of a program) combine those actions with ordering, of which there are multiple:
• Program Order
• Synchronizes-with
• Synchronization Order
• Happens-before
Program order is a total order of intra-thread actions - it is not a total order across threads.
It is what we see when we write code and serves as a link between possible executions and the
original program.
The synchronization order is formed by synchronization actions:
Synchronization Actions

Synchronization actions are:


• Read/write of a volatile variable
• Lock/unlock a monitor
• First/last action of a thread
• Actions which start a thread
• Actions which determine if a thread has terminated

The synchronization order is a total order - all threads see the synchronization actions in the
same order, within a thread, all synchronization actions are in program order, and of course the
synchronization order is consistent, i.e. all reads in synchronization order see the last writes in
synchronization order.
Synchronizes-with only pairs specific actions which “see” each other - a volatile write to x
synchronizes with subsequent (in synchronization order) read of x.
The combination of program and synchronizes-with order creates a happens-before order - this
allows us to reason about a program and its possible states - see figure 10.1 for a detailed
example.
This covers what we need to know about the JMM for the moment. While an extraordinarily
complex topic that spans both hardware and software, memory models are essential for making
guarantees for parallel programs with shared memory. One should gain an intuition for what
guarantees we are given, and what actions are synchronization actions that enforce an ordering.

43
Chapter 10. Memory Models: An introduction

spcl.inf.ethz.ch
@spcl_eth

Example

30

Figure 10.1: Example of possible executions with happens-before ordering

44
Chapter 11

Behind Locks - Implementation of


Mutual Exclusion

For this chapter (and most of the rest of this document), we make the following assumptions:
1. atomic reads and writes of variables of primitive type
2. no reordering of read and write sequences (this is not true in practice!)
3. threads entering a critical section will leave it eventually

11.1 Critical sections and state space diagrams

We define a critical section as follows:


Critical sections

Critical sections are pieces of code with the following conditions:


1. Mutual exclusion: Statements from critical sections of two (or more) processes
must not be interleaved
2. Deadlock-freedom: If some processes are trying to enter a critical section, one
of them must eventually succeed
3. Starvation-freedom: If any process tries to enter its critical section, that process
must eventually succeed

The implementation of a critical section on a single core system is very simple: Before a critical
section, we disallow the usage of interrupt requests by the operating system - in effect, our
thread can’t be switched while inside the critical section.
Of course, we want to tackle this problem for two processes that run on different cores. For
analysing a simple program, we will draw a state space diagram - A diagram that lists all
possible states and transitions between states. If we can reach a state where both processes are
in their critical section, we have no mutual exclusion. If we can reach a state (that is not the
final state) without a possibility of leaving it, we have a deadlock. If we simply never reach the
final state, we have starvation. Consider the simple example of a state space diagram below in
figure 11.1. More detailed examples can be found in appendix A, on page 109.

45
Chapter 11. Behind Locks - Implementation of Mutual Exclusion

spcl.inf.ethz.ch
@spcl_eth

Mutual exclusion for 2 processes -- 1st Try

volatile boolean wantp=false, wantq=false

Process P Process Q Do you see the problem?


local variables local variables
loop loop
p1 non-critical section q1 non-critical section
p2 while(wantq); q2 while(wantp);
p3 wantp = true q3 wantq = true
p4 critical section q4 critical section
p5 wantp = false q5 wantq = false

12

spcl.inf.ethz.ch
@spcl_eth

p1 non-critical section

State space diagram [p, q, wantp, wantq] p2


p3
while(wantq);
wantp = true
1 non-critical section 2 while(wantp) 3 wantp = true 4 critical section 5 wantp = false p4 critical section
while(wantq) wantq = true wantq = false p5 wantp = false

p1, q1, false, false p2, q1, false, false p3, q1, false, false p4, q1, true, false

p1, q2, false, false p2, q2, false, false p3, q2, false, false p4, q2, true, false

p1, q3, false, false p2, q3, false, false p3, q3, false, false p4, q3, true, false

p1, q4, false, true p2, q4, false, true p3, q4, false, true p4, q4, true, true

no mutual exclusion ! 13

Figure 11.1: Simple example of a state space diagram

46
Chapter 11. Behind Locks - Implementation of Mutual Exclusion

11.2 Dekker’s Algorithm

Dekker’s1 algorithm solves the problem of mutual exclusion with two processes. Essentially,
we implement both a turn and a “show-of-intent” flag that decide whose “turn” it is. In the
following snippet, all boolean variables are initialized as false, and turn = 1
// Process P
loop // Process Q
non-critical section loop
wantp = true non-critical section
while (wantq) { wantq = true
//only when q tries to get lock while (wantp) {
if (turn == 2) { if (turn == 1) {
//and q has precedence wantq = false
wantp = false; //let q go on while(turn != 2);
while(turn != 1);//wait wantq = true; }}
wantp = true; }}//try again critical section
critical section turn = 1
turn = 2 wantq = false
wantp = false

We can make this a bit more concise - introducing the Peterson Lock:
Let P=1, Q=2, victim=1, array flag[1,2]=[false,false]
// Process P (1)
loop // Process Q (2)
non-critical section loop
flag[P] = true; // I’m interested non-critical section
victim = P; // but you go first flag[Q] = true;
while (flag[Q] && victim == P); victim = Q;
// we are both interested while (flag[P] && victim == Q);
// I’m victim, so I wait critical section
critical section flag[Q] = false;
flag[P] = false;

11.3 Atomic registers and the filter lock

How would we prove that the Peterson Lock satisfies mutual exclusion and is starvation free?
For that (and the definition of atomic register) we need to introduce some notation.
Threads produce a sequence of events: P produces events p0 , p1 , . . .
where p1 = “flag[P]=true” etc.
Since most of our examples consist of loops, we might need to count occurrences. This is done
via superscript, i.e. p35 refers to flag[P]=false in the third iteration.
For precedence, we write a → b for a occurs before b. The → relation is a total order for
events. We also define the intuitive intervals of events as we understand it mathematically,
where an interval IA = (a0 , a1 ) precedes an interval IB = (b0 , b1 ) if a1 → b0 . Now, we can
properly define an atomic register. For further examples of events and precedence, see appendix
A page 110.
1
Often misspelled as “Decker”

47
Chapter 11. Behind Locks - Implementation of Mutual Exclusion

Atomic Register

Register: A basic memory object, can be shared or not (here 6= CPU-register).


Operations on a register r: r.read() and r.write(v)
Atomic Register:
• An invocation J of r.read or r.write takes effect at a single point in time τ (J)
• τ (J) always lies between start and end of operation J (in the respective interval)
• Two operations J and K on the same register always have a different effect time:
∀J, K J 6= K =⇒ τ (J) 6= τ (K)
• An invocation J of r.read() returns the value written by the most recent invoca-
tion K of r.write(v), i.e. with the closest preceding effect time τ (K)

For the proof of the correctness of Peterson’s lock, refer to the slides in appendix A on page 111.
The implementation of Peterson’s lock in Java is pretty simple:
class PetersonLock
{
volatile boolean flag[] = new boolean[2];
// Note: the volatile keyword refers to the reference, not the array contents
// This example may still work in practice
// It is recommended to instead use Java’s AtomicInteger and AtomicIntegerArray
volatile int victim;

public void Acquire(int id)


{
flag[id] = true;
victim = id;
while (flag[1-id] && victim == id);
}

public void Release(int id)


{
flag[id] = false;
}
}

We can actually extend Peterson’s lock to n processes. This is the so called Filter lock.
Every thread t knows his level in the filter. In order to enter the critical section, a thread has
to elevate all levels. For each level, we use Peterson’s mechanism to filter at most one thread,
i.e. in every level there’s one thread that’s “stuck” in there, that is the victim of that level.
The algorithm is much easier to understand if we simply rename victim to lastToArrive - and
visualize each level as a waiting room. A thread can only progress if there are either no more
threads waiting in front of it, or if another thread enters his room - because the initial thread
then loses the lastToArrive property. With a similar thought, one can also proof this lock’s
correctness by showing that only n − 1 threads can be in level 1, n − 2 in level 2 etc.
Expressed in (pseudo)code, the filter lock could look like this:

48
Chapter 11. Behind Locks - Implementation of Mutual Exclusion

int[] level(#threads), int[] victim(#threads)

lock(me) {
for (int i=1; i<n; ++i) {
level[me] = i;
victim[i] = me;
while (exists(k != me): level[k] >= i && victim[i] == me) {};
}
}

unlock(me) {
level[me] = 0;
}

The filter lock is not fair: Usually, we define fairness as “first-come-first-served”, where “first-
come” is defined as completing the finite number of steps before the waiting period before
another thread. This is not necessarily the case for a very slow thread in a filter lock, which
might need to wait way more than other threads moving through the filter - if the slow thread
always ends up as lastToArrive, it would only get a very small fraction of the throughput a
fair lock would guarantee.
Additionally, the filter lock requires 2n fields just for storing levels and victims - which, depending
on the amount of threads, can be vastly inefficient. The time to move through the lock is always
O(n) - even if the thread trying to acquire the lock is doing so without other threads!

11.4 Safe SWMR Registers

The word Safe might be misleading - in this context, it simply means that any read that is
not concurrent to a write returns the current value of the register. This allows implementation
of mutual exclusion with non-atomic registers. However, any read concurrent with a write can
return any value of the domain of the register - even values that have never been entered! If it
could only return the previous or the new value, the register would be called regular.
spcl.inf.ethz.ch
@spcl_eth

Example

r.read() 1
A
r.write(4) r.read() 4
B

r.write(1) r.read() 1 r.read() any value!


C

time

39

Figure 11.2: Example of a safe SWMR register

49
Chapter 11. Behind Locks - Implementation of Mutual Exclusion

11.5 Bakery Algorithm

The bakery algorithm, or, for the Swiss probably more aptly nicknamed “post-office algorithm”,
relies on the intuitive system of taking a numbered ticket. We achieve this by using two
arrays with entries for each thread: One for the labels or “ticket-numbers”, and one for the flag
“thread interested in lock”. These entries can be SWMR registers!
Of course, we still need to cover the problem that perhaps multiple threads draw the same
number at the same time. For this, we use a lexicographical comparator:

(k, lk ) <l (j, lj ) ⇐⇒ lk < lj ∨ (lk = lj ∧ k < j)

Now, we can implement this in Java:


class BakeryLock
{
AtomicIntegerArray flag; // there is no AtomicBooleanArray
AtomicIntegerArray label;
final int n;

BakeryLock(int n) {
this.n = n;
flag = new AtomicIntegerArray(n);
label = new AtomicIntegerArray(n);
}

int MaxLabel() {
int max = label.get(0);
for (int i = 1; i<n; ++i)
max = Math.max(max, label.get(i));
return max;
}

boolean Conflict(int me) {


for (int i = 0; i < n; ++i)
if (i != me && flag.get(i) != 0) {
int diff = label.get(i) - label.get(me);
if (diff < 0 || diff == 0 && i < me)
return true;
}
return false;
}

public void Acquire(int me) {


flag.set(me, 1);
label.set(me, MaxLabel() + 1);
while(Conflict(me));
}

public void Release(int me) {


flag.set(me, 0);
}
}

This lock still suffers from two problems: First, an overflow could occur, especially with many
threads. Second, the memory requirement and runtime for acquiring is still O(n)!

50
Chapter 12

Beyond Locks I - TAS & CAS

We have used atomic registers, (SWMR in Bakery, MWMR in Peterson) but yet we haven’t
found a very efficient algorithm. That is, because it is not possible! As a theorem in a paper
states:
“If S is a [atomic] read/write system with at least two processes and S solves mutual exclusion
with global progress [deadlock-freedom], then S must have at least as many variables as processes”.
To fix this issue, modern multiprocessor architectures provide special instructions for atomically
reading and writing at once!

12.1 Read-Modify-Write Operations - TAS, TATAS and CAS

There is a ton of different hardware support operations, which differ between different archi-
tectures. For two examples, see appendix A on page 112. In this section, we will focus mainly
on the abstracted version of those operations: Test-And-Set (TAS) and Compare-And-
Swap (CAS).1 One should stress that these are Read-Modify-Write operations, that is, occur
atomically, and enable implementation of mutual exclusion with O(1) space.
The semantics of TAS and CAS are easy to understand:
boolean TAS(memref s)
int CAS (memref a, int old, int new)
if (mem[s] == 0) {
oldval = mem[a];
mem[s] = 1;
if (old == oldval)
return true;
mem[a] = new;
} else
return oldval;
return false;

CAS can be seen as an extension of TAS - instead of checking for a constant and setting a
constant, we can instead pass both what to check for and what to set. TAS however can already
be used on its own to determine one thread that can go ahead in a critical section. A spinlock
implemented with those instructions is very easy as well:
1
These are not the “standard” for operations due to performance - simple read and write operations are just
way faster than our atomic instructions

51
Chapter 12. Beyond Locks I - TAS & CAS

Init (lock) Init (lock)


lock = 0; lock = 0;

Acquire (lock) Acquire (lock)


while !TAS(lock); // wait while (CAS(lock, 0, 1) != 0);//wait

Release(lock) Release (lock)


lock = 0; CAS(lock, 1, 0); // ignore result

12.2 Read-Modify-Write in Java

Java has some high level support for atomic operations:


java.util.concurrent.atomic.AtomicBoolean for example, which has operations set() and
the corresponding get(), compareAndSet(boolean expect, boolean update) and the more
unusual getAndSet(boolean newVal)1 .
Great! But the Java bytecode does not offer CAS. Classes like the above are implemented
using the aptly named sun.misc.Unsafe to map directly to the underlying machine/OS. Direct
mapping to hardware is not guaranteed however, the operations on AtomicBoolean are thus not
guaranteed lock-free. We can still implement a simple TASLock in Java though:
public class TASLock implements Lock {
AtomicBoolean state = new AtomicBoolean(false);

public void lock() {


while(state.getAndSet(true)) {}
}

public void unlock() {


state.set(false);
}
}

This lock is still a spinlock - threads keep trying until the lock is acquired. This is a major
performance issue, especially since the atomic operations are relatively slow on their own. We
have a new bottleneck - the variable that all the threads are fighting over. How can we fix
this? Instead of always trying to TAS, we first check by only reading, which is much easier on
performance - we Test-and-Test-and-Set - TATAS:
public class TASLock implements Lock {
AtomicBoolean state = new AtomicBoolean(false);

public void lock() {


do
while(state.get()) {}
while(!state.compareAndSet(false, true)) {}
}
...
}

TATAS as an algorithm works, but the Java implementation does not generalize due to nontrivial
interactions with the JMM and must be used with a lot of care. It is not recommended to use
it in practice.
1
Slightly mightier than TAS, it simply exchanges the passed argument and the memory value

52
Chapter 12. Beyond Locks I - TAS & CAS

There is still one aspect of performance we could improve: If many threads go to the line after
the state.get() at the same time, we have a lot of contention again. This is easily solved by
implementing a backoff : If a check fails, we let the thread go to sleep with a random duration.
Multiple failed attempts lead to an increase in the expected waiting duration we assign. Let’s
see the implementation:
public void lock() {
Backoff backoff = null;
while (true) {
while (state.get()) {};
// spin reading only (TTAS)
if (!state.getAndSet(true))
// try to acquire, returns previous val
return;
else { // backoff on failure
try {
if (backoff == null)
// allocation only on demand
backoff = new Backoff(MIN_DELAY, MAX_DELAY);
backoff.backoff();
} catch (InterruptedException ex) {}
}
}

public void unlock(){


state.set(false);
}
}

It should be noted that we use !state.getAndSet(true) instead of the non-negated variant.


Reason for this is that the while loop in our previous TASLock should continue to spin while it
returns true, and the same is happening here - but the if-statement serves as exit condition.
And for completion the Backoff class:
class Backoff
{ ...
public void backoff() throws InterruptedException {
int delay = random.nextInt(limit);
if (limit < maxDelay) { // double limit if less than max
limit = 2 * limit;
}
Thread.sleep(delay);
}
}

53
Chapter 13

Beyond Locks II - Deadlocks,


Semaphores and Barriers

13.1 Deadlocks

The dreaded Deadlock: Two or more processes are mutually blocked because each process waits
for another of these processes to proceed. Consider our canonical banking system example:
class BankAccount {
...
synchronized void withdraw(int amount) {...}
synchronized void deposit(int amount) {...}

synchronized void transferTo(int amount, BankAccount a) {


this.withdraw(amount);
a.deposit(amount);
}
}

If two threads and respective bank accounts A and B want to transfer money to each other,
they might get stuck in a deadlock: A acquires the lock on its bank account, then B acquires
the lock on its bank account, and now both bank accounts are deadlocked: Neither A can get
B’s lock nor the other way around.
To look at deadlocks more formally, we can use a graph: Each thread and each resource (lock)
is a node. An edge from a thread to a resource means that thread attempts to acquire that
resource, an edge from a resource to a thread means that the resource is held by that thread. A
deadlock occurs if the resulting graph contains a cycle (see figure 13.1).
Deadlocks can, in general, not be healed. Releasing the locks generally leads to an inconsistent
state. Therefore, it is paramount to understand Deadlock avoidance. In databases, where
transactions can (generally) easily be aborted, one could implement two-phase locking with retry
- and releasing the locks on fail. However, in this course, we use resource ordering.1
By creating a global ordering of resources, we can avoid cycles. If there is no suitable global
order available, one could just implement a global (atomic) counter that each bank account
gets an ID from and increments the counter on creation.
1
In general, decreasing the size of critical sections will probably lead to transient inconsistent states, and only
using one lock for all accounts is terribly slow - resource ordering is far better

54
Chapter 13. Beyond Locks II - Deadlocks, Semaphores and Barriers

This even works for working with different data types: When transferring from a Hashtable to a
Queue, one could make sure that the Hashtable’s Lock is always acquired first. If a datatype is
acyclic by itself (lists, trees), we can use this in determining a global order. Let’s see the simple
solution for our bank accounts:
class BankAccount {
...
void transferTo(int amount, BankAccount to) {
if (to.accountNr < this.accountNr)
synchronized(this){
synchronized(to) {
withdraw(amount);
to.deposit(amount);
}}
else
synchronized(to){
synchronized(this) {
withdraw(amount);
to.deposit(amount);
}}
}
}

spcl.inf.ethz.ch
@spcl_eth

Deadlocks – more formally

A deadlock for threads 𝑇1 … 𝑇𝑛 occurs when the directed graph describing the
relation of 𝑇1 … 𝑇𝑛 and resources 𝑅1 … 𝑅𝑚 contains a cycle.

T1 wants R3 T2 has R3

R3 T2
T1

R4
R1 R2 T3

T4

34

Figure 13.1: Deadlock detection via graph

13.2 Semaphores

Semaphores can be seen as an extension of locks. A semaphore is an abstract data type with
one integer value. It supports the following (atomic) operations:
acquire(S)
{
wait until S > 0
dec(S)
}

55
Chapter 13. Beyond Locks II - Deadlocks, Semaphores and Barriers

release(S)
{
inc(S)
}

We can easily build a lock with a Semaphore: We set the initial value of S to 1. For quick
reference, remember that the semaphore number signifies the following:
• ≥ 1 → unlocked
• = 0 → locked
• x > 0 → x threads will be let into the protected piece of code
We can now tackle some problems that would have been much more difficult with locks - such
as the rendezvous.
We define a rendezvous as a location in code, where two respective threads must wait for the
other to arrive, i.e. to “synchronize” the two of them. For two threads, that problem can be
solved relatively simply by using two semaphores:

spcl.inf.ethz.ch
@spcl_eth

Rendezvous with Semaphores


Synchronize Processes P and Q at one location (Rendezvous)

Assume Semaphores P_Arrived and Q_Arrived

P Q
init P_Arrived=0 Q_Arrived=0
pre ... ...
rendezvous release(P_Arrived) release(Q_Arrived)
acquire(Q_Arrived) acquire(P_Arrived)

post ... ..
22

Figure 13.2: Rendezvous with Semaphores

Why do we release first instead of acquiring? If we were to put acquire first on both threads,
we’d simply deadlock. If we put only one acquire first, we may encounter both threads having
to wait at one point. Thus, we release first (for a detailed look, see appendix A page 113).
We can further increase performance by implementing semaphores without spinning: For this,
we simply modify the (atomic) acquire method to put us into a queue on fail and block1 us, and
the (atomic as well) release method to get the first thread in the queue and unblock it. In case
that we succeed in entering or release with an empty queue, we simply decrement or increment
the semaphore variable. In fact, we have already seen something very similar in the first half of
the course as wait and notify!
1
This will be explained later, but it is basically going to sleep until woken up

56
Chapter 13. Beyond Locks II - Deadlocks, Semaphores and Barriers

13.3 Barrier

How do we apply the rendezvous scenario to n threads? We need a set of a few building blocks:
1. A counter that increases with every thread that passes to make sure we only allow con-
tinuation of the program as soon as all processes have reached the barrier. We need to
make sure to make this counter mutually exclusive by using a semaphore (volatile is not
enough to make count++ mutex). We initialize the mutex semaphore with state 1 (i.e.,
unlocked):
acquire(mutex)
count++
release(mutex)

2. We need to make sure that everyone gets past the barrier, i.e. we call release often
enough that every thread may pass. We use a semaphore barrier which we initialize with
state 0 (i.e., locked):
if (count == n) release(barrier)
acquire(barrier)
release(barrier)

3. Additionally, our barrier should be reusable. For this, we need to make sure that counter is
decreased and barrier only gets released n times after threads pass the turnstile of acquire
and release. Even then, we need to make sure that it is not possible for a single thread
to pass other processes in a different iteration of the barrier. To make sure all of those
invariants hold, we implement a second semaphore that does things “in reverse” - and have
completed our code:
mutex=1; barrier1=0; barrier2=1; count=0

acquire(mutex)
count++;
if (count==n)
acquire(barrier2); release(barrier1)
release(mutex)

acquire(barrier1); release(barrier1);
// barrier1 = 1 for all processes, barrier2 = 0 for all processes
acquire(mutex)
count--;
if (count==0)
acquire(barrier1); release(barrier2)
release(mutex)
acquire(barrier2); release(barrier2)
// barrier2 = 1 for all processes, barrier1 = 0 for all processes

For a full overview of the improvement process, refer to appendix A page 114.

57
Chapter 14

Producer-Consumer and Monitors

14.1 Producer Consumer Pattern

The Producer Consumer Pattern is a very common pattern in parallel programming. It


is also preferable to other patterns due to its relative simplicity. We do not need to lock the
resource that is being produced/consumed, we only need a synchronized mechanism to pass this
resource from the producer to the consumer. Of course, there could be multiple producers and
consumers - but this won’t be a problem.
We can implement a relatively simple circular buffer, or queue with wraparound, to solve this
problem:
class Queue {
private int in; // next new element
private int out; // next element
private int size; // queue capacity
private long[] buffer;

Queue(int size) {
this.size = size;
in = out = 0;
buffer = new long[size];
}

private int next(int i) {


return (i + 1) % size;
}

public synchronized void enqueue(long item) {


buffer[in] = item;
in = next(in);
}

public synchronized long dequeue() {


item = buffer[out];
out = next(out);
return item;
}
}

The problem with this implementation is, that we could try to dequeue from an empty queue
or enqueue into a full queue - we’d need to fix this by writing helper functions that check if the

58
Chapter 14. Producer-Consumer and Monitors

queue is empty or full. However, what would our functions do while they can’t en- or dequeue?
We can’t let them spin, since they are still holding the lock! Sleeping with a timeout would
work, but what is the proper value for the timeout? Maybe Semaphores can help implement
this more easily?
import java.util.concurrent.Semaphore;

class Queue {
int in, out, size;
long buf[];
Semaphore nonEmpty, nonFull, manipulation;

Queue(int s) {
size = s;
buf = new long[size];
in = out = 0;
nonEmpty = new Semaphore(0); // use the counting feature of semaphores!
nonFull = new Semaphore(size); // use the counting feature of semaphores!
manipulation = new Semaphore(1); // binary semaphore
}
}

With careful ordering of acquiring the semaphores (swapping nonFull and manipulation would
deadlock!), we can now properly implement enqueue and dequeue:
long dequeue() {
void enqueue(long x) { long x=0;
try { try {
nonFull.acquire(); nonEmpty.acquire();
manipulation.acquire(); manipulation.acquire();
buf[in] = x; x = buf[out];
in = next(in); out = next(out);
} }
catch (InterruptedException ex) {} catch (InterruptedException ex) {}
finally { finally {
manipulation.release(); manipulation.release();
nonEmpty.release(); nonFull.release();
} }
} return x;
}

This is a correct solution, but not the best. Semaphores are unstructured, meaning we as
programmers have to manage the semantics of the semaphores ourselves. Correct use requires
high level of discipline, and it is very easy to introduce deadlocks with semaphores. We need a
lock that we can temporarily escape from when we are waiting on a condition.

14.2 Monitors (in Java)

A Monitor is an abstract data structure equipped with a set of operations that run in mutual
exclusion. Luckily, we already know monitors as the wait/notify-system! Now we can easily
implement the enqueue/dequeue as follows:

59
Chapter 14. Producer-Consumer and Monitors

synchronized long dequeue() {


synchronized void enqueue(long x) {
long x;
while (isFull())
while (isEmpty())
try {
try {
wait();
wait();
} catch (InterruptedException e)
} catch (InterruptedException e) { }
{ }
x = doDequeue();
doEnqueue(x);
notifyAll();
notifyAll();
return x;
}
}

We can further enhance the usage of monitors by not only using the intrinsic lock (what we lock
onto when using synchronized), but instead using the Lock interface that Java offers. This JIL
(Java Interface Lock) can also provide conditions that can be individually used to wait or signal
on.
Considering that signal might be slow, we can use the sleeping barber variant, a term coined by
Dijkstra - instead of simply calling signal all the time, we maintain a count of waiting producers
and consumers:
class Queue {
int in=0, out=0, size;
long buf[];
final Lock lock = new ReentrantLock();
int n = 0; final Condition notFull = lock.newCondition();
int m; final Condition notEmpty = lock.newCondition();

Queue(int s) {
size = s; m=size-1;
buf = new long[size];
}

void enqueue(long x) {
lock.lock();
m--; if (m<0)
while (isFull())
try { notFull.await(); }
catch(InterruptedException e){}
doEnqueue(x);
n++;
if (n<=0) notEmpty.signal();
lock.unlock();
}

long dequeue() {
long x;
lock.lock();
n--; if (n<0)
while (isEmpty())
try { notEmpty.await(); }
catch(InterruptedException e){}
x = doDequeue();
m++;
if (m<=0) notFull.signal();
lock.unlock();
return x;
} }

60
Chapter 14. Producer-Consumer and Monitors

Of course, we still need to use the lessons we learned in the first half, meaning, as guidelines:

Guidelines for condition waits

• Always have a condition predicate


• Always test the condition predicate before and after calling wait =⇒ while-loop!
• Ensure state is protected by lock associated with the condition

61
Chapter 15

Locking tricks

15.1 Reader / Writer Locks

We know that concurrent reads of the same memory is not a problem. So far, whenever a
concurrent write/write or read/write might occur, we have used synchronization to ensure only
one thread can access the memory at a time. This is too conservative: We should allow multiple
readers where appropriate. We introduce a new abstract data type - the reader/writer lock:

Reader/Writer Lock

Abstract data type which has three states:


• Not held
• Held for writing by one thread
• Held for reading by one or more threads
The most important invariant is that the above statements can not occur simultaneously.
That is, we can never have readers and writer(s) or multiple writers at the same time.

An implementation of a reader/writer lock has to consider how to prioritize readers and writers:
If there are no priorities given, a substantial amount of readers may lock out the writer forever.
Thus, usually, priority is given to writers.
In Java, we use java.util.concurrent.locks.ReentrantReadWriteLock. Using the methods
readLock and writeLock we get objects that themselves have lock and unlock methods. This
implementation does not have writer priority or reader-to-writer upgrading.

15.2 Coarse-grained locking

This “technique” barely deserves its own section, it is the easy (and very likely not the best)
solution: One lock for the entire system. This of course fixes parallelism issues, however it does
so by essentially eliminating all parallelism and bottle-necking all threads in the critical sections.
It is very simple - but that is pretty much all that it has going for it.

62
Chapter 15. Locking tricks

15.3 Fine-grained locking

Fine grained locking, while in general performing better, is often more intricate than visible at
first sight. It requires careful consideration of special cases.
The basic idea of fine grained locking is to split the to-be-protected object into pieces with
separate locks - no mutual exclusion for algorithms on disjoint pieces. In our canonical example,
we only need to lock a bank account when we are actively transferring money to or from it - we
don’t need to lock every account every time a transaction fires. In reality, many objects require
careful thought what one needs to lock, as we’ll see in the following example.
Given a linked list, we want to remove an element. What do we need to lock?
Try 1: Lock the element in front of the one we want to remove. We modify the next-pointer of
our locked element. Problematic: If two threads decide to delete two adjacent elements,
we may not remove the item at all:

spcl.inf.ethz.ch
@spcl_eth

Let's try this

Thread A: remove(c)
Thread B: remove(b)

B A
a b c d

c not deleted!

Figure 15.1: Two threads trying to delete two elements

Try 2: The problem with the 1st try was that we also read the next field of the node we want to
delete. A thread thus needs to lock both predecessor and the node to be deleted. We call
this hand-over-hand locking. The real life equivalent would be the safety systems used
in “adventure parks” when climbing: Secured by two snap hooks, you only move one at a
time to always be secured. The remove method works as follows:
public boolean remove(T item) {
Node pred = null, curr = null;
int key = item.hashCode();
head.lock();
try {
pred = head;
curr = pred.next;
curr.lock();
try {
// find and remove
while (curr.key < key) {

63
Chapter 15. Locking tricks

pred.unlock();
pred = curr; // pred still locked
curr = curr.next;
curr.lock(); // lock hand over hand
}
if (curr.key == key) {
pred.next = curr.next; // delete
return true;
// remark: We use sentinels at front and end
// so no exceptions will occur
}
return false;
} finally { curr.unlock(); }
} finally { pred.unlock(); }
}

The disadvantages of this method is that we potentially need a very long sequence of ac-
quire / release before the deletion can take place. Also, one slow thread locking “early
nodes” can block another thread wanting to acquire “late nodes” - even if the two opera-
tions wouldn’t interfere with each other.

15.4 Optimistic synchronization

Let us try to improve our locking method. The idea of optimistic synchronization (or
optimistic locking, the terms are used interchangeably) is to find the nodes without locking,
then locking the nodes and checking if everything is okay (i.e., validating before operating).
What do we need to check in order to proceed?
We can reason as follows: If
• nodes b and c are both locked
• node b is still reachable from head
• node c is still successor to b
then neither is in the process of being deleted, nor can an item have been added between the
two nodes. Thus, we can safely remove c.
Consider the good and bad things about this “optimistic list”:
Bad:
Good: • Need to traverse list twice
• No contention on traversals • A contains() method needs to acquire locks
• Traversals are wait-free • Not starvation-free: One thread may have to
• Overall less lock acquisitions redo the validation over and over due to other
threads inserting or removing other elements.
We mentioned wait-free (or sometimes obstruction-free) above. We define that as follows:

Wait-Free

Every call finishes in a finite number of steps, i.e. never waits for other threads. Wait-
freedom implies lock-freedom!

64
Chapter 15. Locking tricks

15.5 Lazy synchronization

An alternative to optimistic synchronization is the lazy synchronization. We use a very


similar approach to the optimistic list, but, we scan only once and contains never locks. We
achieve this by using deleted-markers which removes nodes “lazily” after marking. How would
a remove look?
As before, we scan the list and lock predecessor and current. If both are not marked and are still
predecessor and successor respectively, we go ahead and continue: We mark the current node
as removed (logical delete), the we redirect predecessor’s next (physical remove) and as always
unlock the two. The key invariant of this “lazy list” is that an unmarked node is reachable from
the head of the queue and reachable from its predecessor.
Contains() is now wait-free: We simply check if the item exists and if it is not marked for
deletion.

15.6 Lazy Skip Lists

A skip list is a practical representation for sets that is much easier to implement than a balanced
tree, since the latter requires rebalancing - a global operation - which is very hard to implement
in a (mostly) lock-free way. The skip list runs on the assumption that we have many calls to
find(), fewer to add() and much fewer to remove(). It solves the challenge of sorting and
finding probabilistically. With this, we can achieve an expected runtime of find() in O(log n) -
similar to a tree!
We represent different levels with different lists, emulating a tree. Each node gets a random
“height”, except for two sentinels at the start and end of the set.

spcl.inf.ethz.ch
@spcl_eth

Skip list property


 Sublist relationship between levels: higher level lists are always contained
in lower-level lists. Lowest level is entire list.

−∞ 𝟐 𝟒 𝟓 𝟕 𝟖 𝟗 +∞
28

Figure 15.2: Skip list visualized

For searching, we start at the top level at the head of the respective list. We move forward in the
list until we either find the sought element or are “in-between” items that are smaller than and
greater than the sought value. We then move down a level and continue this searching pattern.

65
Chapter 15. Locking tricks

For adding and removing, we simply find predecessors, lock those and validate. contains() is
once more wait-free (add and remove are not).
The full code for such a list can be found in appendix B, on page 125 ff. The visual representation
can be found in appendix A on page 115.

66
Chapter 16

Lock-free synchronization

16.1 Recap: Definitions with locks

Throughout this course, we have seen many reasons why synchronizing with locks is not partic-
ularly optimal. Let us recap some points:
• Missing scheduling fairness / missing FIFO-behaviour1
• No notification mechanism
• Computing resources are wasted, thus performance degrades, particularly for long-lived
contention, i.e. long locked sections2
How about locks that support waiting/scheduling? Such locks require support from the runtime
system (i.e. OS, scheduler), the queues behind the implementation of monitors etc. also need
protection, again using spinlocks.3 These locks also suffer from a higher wakeup latency.
Overall, locks have some disadvantages by design:
• Locks are pessimistic - assume the worst and enforce mutual exclusion
• Every protected part of a program is not parallelizable - remember Amdahl’s law!
• If a thread is delayed (e.g., scheduler) while in a critical section, all threads suffer
• If a thread dies in a critical section, the system is basically dead
Let us also recap some central definitions for blocking synchronization:
• Deadlock: group of two or more competing processes are mutually blocked because each
process waits for another blocked process in the group to proceed
• Livelock: competing processes are able to detect a potential deadlock but make no ob-
servable progress while trying to resolve it4
• Starvation: repeated but unsuccessful attempt of a recently unblocked process to continue
its execution
1
One can solve this with Queue locks, which have not been presented in the lecture
2
Note that this does not imply that lock-free algorithms are always faster than locked ones
3
If they are not implemented lock-free, which is topic of this chapter
4
While taken from the lecture slides, the more intuitive version of this point is: The threads still do something,
but make no tangible progress

67
Chapter 16. Lock-free synchronization

16.2 Definitions for Lock-free Synchronization

Analogous to the above, but for lock-free synchronization, we define:


Lock-freedom: at least one thread always makes progress, even if other threads run concur-
rently. Implies system-wide progress but not freedom from starvation. In particular, if one
thread is suspended, then a lock-free algorithm guarantees that the remaining threads can still
make progress.
Wait-freedom: all threads eventually make progress. Implies freedom from starvation. In
particular, implies lock-freedom.

spcl.inf.ethz.ch
@spcl_eth

Progress conditions with and without locks

Non-blocking Blocking
(no locks) (locks)

Everyone makes progress Wait-free Starvation-free

Someone make progress Lock-free Deadlock-free

Figure 16.1: Progress conditions with and without locks


47

With Locks/blocking algorithms, a thread can indefinitely delay another thread (i.e. by holding
a lock). In a non-blocking algorithm, failure or suspension of one thread cannot cause failure
or suspension of another thread!
The main tool that we use is CAS (refer to chapter 12). Now we can implement a very simple
non-blocking counter. We use AtomicInteger since we now need to be more careful with data-
types (race conditions make their return if we are not!).
public class CasCounter {
private AtomicInteger value;

public int getVal() {


return value.get();
}

// increment and return new value


public int inc() {
int v;
do {
v = value.get();
} while (!value.compareAndSet(v, v+1));
return v+1;
}
}

68
Chapter 16. Lock-free synchronization

This counter is now lock-free. No deadlocks may occur and a thread dying does not hinder the
other threads, but a thread can still starve.
A positive result of CAS suggests that no other thread has written in between reading and
modifying our local v. However, especially if we also have a decrement() function, this is only
a suggestion - we’ll discuss this in the form of the ABA-Problem in chapter 17.

16.3 Lock-free Stack

Let us now implement a proper data structure in a lock-free way: A lock-free stack.
The advantage of a stack for lock-free synchronization is that we only ever have to take care
of one single thing: The head pointer. Knowing this, we can use an AtomicReference for our
pointer, and then we implement our operations with CAS:
public Long pop() {
public void push(Long item) {
Node head, next;
Node newi = new Node(item);
do {
Node head;
head = top.get();
do {
if (head == null) return null;
head = top.get();
next = head.next;
newi.next = head;
} while (!top.compareAndSet(head,
} while (!top.compareAndSet(head,
next));
newi));
return head.item;
}
}

Surprisingly easy. Performance is, however, worse than a locked variant - this is because of how
expensive atomic operations are, and contention can still be a problem. With a simple backoff,
this can be fixed:
spcl.inf.ethz.ch
@spcl_eth

With backoff
50000

lock-free
45000

40000

35000

30000
locked/
time 25000 blocking
(ms)
20000

15000

10000

5000 lock-free
0
with backoff
0 20 40 60 80 100 120 140

#threads

60

Figure 16.2: Performance of a locking, lock-free and lock-free-backoff stack

16.4 Lock Free List Set

Let us return to the example of linked lists. Can CAS help us out?

69
Chapter 16. Lock-free synchronization

If the only matter of contention is the next pointer of a single node, CAS does indeed work.
With multiple different pointers, it does not:

spcl.inf.ethz.ch
@spcl_eth

Another scenario

A: remove(c)
B: remove(b)

a b c d
CAS CAS

c not deleted! 

17

Figure 16.3: CAS is not a universal solution

Maybe the marked bit approach from lazy synchronization could help us out?

spcl.inf.ethz.ch
@spcl_eth

Mark bit approach?


B: c.mark ?
A: remove(c) B: CAS(c.next,d,c')
B: add(c')
c'

a b c d

c' not added!  A: CAS(c.mark,false,true)


A: CAS(b.next,c,d)

18

Figure 16.4: CAS is definitely not a universal solution

The difficulty in this (and many other similar problems!) is that while we do not want to use
locks, we still want to atomically establish consistency of two things - here the mark bit and
the next pointer. The Java solution? We use one bit of the address pointer (the next pointer of
every node) as a mark bit. Since a normal AtomicReference in Java is 64 bits long, the storage

70
Chapter 16. Lock-free synchronization

one would need to “need” the bit we use for marking is in the trillions of petabytes. By using
one bit as a mark bit, we execute a hacky version of DCAS - a double compare-and-swap. Does
this fix all our problems?

spcl.inf.ethz.ch
@spcl_eth

It helps!
1. try to set mark (c.next)
A: remove(c) 2. try CAS(
[b.next.reference, b.next.marked],
B: remove(b) [c,unmarked], [d,unmarked]);

①Mark ①Mark
a b c d
②DCAS ②DCAS fails!

c remains marked  (logically deleted)


1. try to set mark (b.next)
2. try CAS(
[a.next.reference, a.next.marked],
[b,unmarked], [c,unmarked]);
23

Figure 16.5: Using DCAS to implement a lock-free list

In figure 16.5, it is noted that “c remains marked (frowning emoji)” - meaning, we still have to
physically delete it. Luckily, we can simply have another thread “help us out”: When a thread
traverses the list and finds a logically deleted node, it can CAS the predecessor’s next field and
then proceed.1 This “helping” is a recurring theme in wait-free algorithms, where threads help
each other to make progress.

16.5 Lock-free Unbounded Queue

At the heart of an operating system is a scheduler, which basically moves tasks between queues
(or similar structures) and selects threads to run on a processor core. Data structures of a run-
time or kernel need to be protected against concurrent access on different cores. Conventionally,
spinlocks are used. If we want to do this lock-free, we need a lock-free unbounded queue. Also,
we usually cannot rely on garbage collection, thus we need to reuse elements of the queue.2
First things first: What parts do we have to make sure to protect? In this case, we need to
protect three pointers that might be updated at the same time: head (the next item to be
removed), tail (the newest item) and tail.next (when enqueuing).
Our first idea is to use a Sentinel as the consistent head of the queue (which is especially great
when dealing with an empty queue etc. due to null-pointers) and AtomicReference in every
node for its respective next-pointer. Now we try to use CAS.
As we can see in figure 16.6, this version seems okay for the most part.
1
If other threads would have to wait for one thread to cleanup the inconsistency, the approach would of course
not be lock-free!
2
Reusing elements will introduce the ABA problem, see chapter 17

71
Chapter 16. Lock-free synchronization

spcl.inf.ethz.ch
@spcl_eth

Protocol: Initial Version

Enqueuer Dequeuer
 read tail into last  read head into first
 then tries to set last.next:  read first.next into next
CAS(last.next, null, new)  if next is available, read the item value of next
 If unsuccessful retry!  try to set head from first to next
 If successful, try to set tail without retry CAS(head, first, next)
CAS(tail, last, new)  If unsuccessful, retry!

① ① Read value
node node
S node

tail ②

head

44

Figure 16.6: Initial Version of a lock-free unbounded queue

There are still some possible inconsistencies! In the enqueuer protocol, if a thread dies after
successfully performing the first CAS, then tail can never be updated because the first CAS will
fail for every thread thereafter. The solution: Threads helping threads when their check fails:
public void enqueue(T item) {
Node node = new Node(item);
while(true) { // retry
Node last = tail.get();
Node next = last.next.get();
if (next == null) {
if (last.next.compareAndSet(null, node)) {
tail.compareAndSet(last, node);
//everything okay, return
return;
}
}
else // Our tail is outdated, help others progress if necessary!
tail.compareAndSet(last, next);
}
}

Dequeue works similarly, but checks for “true” emptiness of queue:


public T dequeue() {
while (true) { // retry
Node first = head.get();
Node last = tail.get();
Node next = first.next.get();
if (first == last) {
if (next == null) return null; // queue truly empty
else tail.compareAndSet(last, next);
// help update tail!
}
else {

72
Chapter 16. Lock-free synchronization

// queue not empty, get first item


T value = next.item;
if (head.compareAndSet(first, next))
return value;
}
}
}

This implementation works. However, we mentioned that we should reuse nodes instead of
relying on garbage collection. Sadly, this introduces one of the most complex pitfalls in parallel
programming: The ABA problem.

73
Chapter 17

Memory Reuse and the ABA


Problem

Let us assume that we want to implement a lock-free stack (page 69), but we do not want to
always create new nodes, and instead maintain a node pool. We can implement this as a second
stack. We switch elements between the different stacks - calling get() on the node-pool stack
creates a new node, while push() gets a node from the node-pool and pop() on the “real” stack
puts the node onto the node-pool. This means that the stack is now in-place (since objects never
change their address). Otherwise, the two stacks are identical to the ones discussed earlier.
For very large number of threads (≈ 32 or more), we can see that this actually speeds up our
program. Problematically, our program does not always work correctly. The reason for this is
the ABA-Problem:
spcl.inf.ethz.ch
@spcl_eth

ABA Problem
Thread X Thread Y Thread Z Thread Z' Thread X
in the middle pops A pushes B pushes A completes pop
of pop: after read
but before CAS Pool

head
Pool A A
head

top A top A top B top B top B

next next

NULL NULL NULL NULL NULL

public Long pop() { public void push(Long item) {


Node head, next; Node head;
do { Node new = pool.get(item);
head = top.get();
time if (head == null) return null;
do {
next = head.next; head = top.get();
} while (!top.compareAndSet(head, next)); new.next = head;
Long item = head.item; pool.put(head); return item; } while (!top.compareAndSet(head, new));
} } 14

Figure 17.1: The ABA problem visualized

A larger figure can be found in appendix A on page 116. Another note: For the above to work,
thread Z has to have gotten B from the node-pool just before Y has returned A to the pool,
since that is a stack as well.

74
Chapter 17. Memory Reuse and the ABA Problem

ABA-Poblem

“The ABA problem ... occurs when one activity fails to recognize that a single memory
location was modified temporarily by another activity and therefore erroneously assumes
that the overall state has not been changed.”

How do we solve this conundrum? DCAS would actually work, however - no hardware supports
it. We have used a variant for the lock-free list set, but that was more of a “hacky” solution
rather than proper DCAS. DCAS is, at least today, more hypothetical than practical.
Maybe we just rely on garbage collection? This is much too slow to use in the inner loop of a
runtime kernel - and how would we implement a lock-free garbage collector relying on garbage
collection?
Then there are three practical solutions: Pointer Tagging, Hazard Pointers and Transac-
tional Memory. In this chapter, we will discuss the first two. Transactional memory will be
covered in depth in chapter 21.

17.1 Pointer Tagging

The ABA problem usually occurs with CAS on pointers. We could maybe reuse the trick from
earlier, where we reused bits from the pointer - and indeed we do: We could only choose addresses
(values of pointers) that are aligned modulo 32. This would make the last 5 bits available for
tagging. Every time we store a pointer in a data structure (i.e., pop() on the node allocator), we
increment this 5-bit counter by one. This makes the ABA problem much less probable because
now 32 versions of each pointer exist. This does not solve the ABA problem, but rather delay
it, since in the rather unlikely case that a pointer gets reused 32 times in between access, the
CAS would still succeed.

17.2 Hazard Pointers

Hazard pointers are a true solution to the ABA problem. Consider the reason for the existence
of the ABA problem:
The ABA problem stems from reuse of a pointer P that has been read by some thread X but
not yet written with CAS by the same thread. Modification takes place meanwhile by some
other thread Y.
Our idea to solve this, is that we introduce an array with n slots, where n is the number of
threads. Before X now reads P, it marks it as hazardous by entering it into the array. After the
CAS, X removes P from the array. If a process Y tries to reuse P, it first checks all entries of
the hazard array, and, if it finds P in there, it simply requests a new pointer for use. Examine
the changed pop() method:
public int pop(int id) {
Node head, next = null;
do {
do {
head = top.get();
setHazarduous(head);
} while (head == null || top.get() != head);
next = head.next;
} while (!top.compareAndSet(head, next));
setHazarduous(null);

75
Chapter 17. Memory Reuse and the ABA Problem

int item = head.item;


if (!isHazardous(head))
pool.put(id, head);
return item;
}

The ABA problem also occurs on the node pool. What do we do? We could make the pools
thread-local. This does not help when push/pop operations aren’t well balanced within the
thread. Alternatively, we could just use Hazard pointers on the global node pool.
The Java code above does not really improve performance in comparison to memory allocation
plus garbage collection, but it demonstrates how to solve the ABA problem. The ABA problem
does not only occur with performance issues, but this is the easiest example.

76
Chapter 18

Concurrency theory I -
Linearizability

For sequential programs, we have learned of the Floyd-Hoare logic to prove correctness. Defining
pre- and postconditions for each method is inherently sequential - can we somehow carry that
forward to a parallel formulation? In this chapter, we define the central aspects of a formulaic
approach to giving certain guarantees about parallel programs. A first definition are method
calls:
A method call is the interval that starts with an invocation and ends with a response. A
method call is called pending between invocation and response.
Linearizability is a theoretical concept: Each method should appear to take effect instanta-
neously between invocation and response events. We call this a linearization point (in code,
typically a CAS). An object for which this is true for all possible executions is called lineariz-
able, and the object is correct if the associated sequential behaviour is correct. We can take a
look at a particular execution and the question: Is this execution linearizable?

spcl.inf.ethz.ch
@spcl_eth

Yes

q.enq(x) q.deq() y
A

q.eny(y) q.deq() x
B

time

35

Figure 18.1: A specific execution of a queue

77
Chapter 18. Concurrency theory I - Linearizability

It is highly recommended to understand this concept by examining examples. Some of those


examples are found in appendix A, on page 117. Linearization points are also highlighted in
those examples. Additionally, refer to assignment 12 for an example exercise.
Of course, these executions abstract away from the actual code. The linearization points can
often be specified, but may depend on the execution instead of the code.

18.1 Histories

First, we need to clarify notation for invocation and response:


The invocation follows the pattern thread-object-method-arguments, the response follows the
pattern thread-object-result. For example, a call of enqueue on the object q by thread A may
look like this:
Invocation: A q.enq(x)
Response: A q:void
We then define a History as a sequence of invocations and responses:

spcl.inf.ethz.ch
@spcl_eth

History

History H = sequence of invocations and responses

A q.enq(3) Invocations and response match, if thread


A q:void names agree and object names agree
A q.enq(5) An invocation is pending if it has no matching
H B p.enq(4) response.
B p:void A subhistory is complete when it has no
B q.deq() pending responses.
B q:3

47

Figure 18.2: History and definitions

We can project a history H onto objects or threads, denoted H | q or H | B. Projections


consist only of calls that are made onto the respective object or by the respective thread.
Now we can define subtypes of histories and their properties:
A Complete subhistory is a history H without its pending invocations.
A sequential history consists only of non-interleaving method calls of different threads. A
final pending invocation is okay.
A history is well formed if the per-thread projections are sequential. Generally, only well-
formed histories “make sense”.
Two histories H and G are equivalent if the per-thread projections are identical, e.g. for two
threads A and B: H | A = G | A ∧ H | B = G | B.

78
Chapter 18. Concurrency theory I - Linearizability

Now we can decide if a history H is legal: If for every object x the corresponding projection
H | x adhere to the sequential specification (i.e., what we want it to do), which we can prove
using sequential Hoare-logic, the history is legal.
A method call precedes another call if the response event precedes the invocation event. If
there is no precedence, the method calls overlap. We note that a method execution m0 precedes
m1 on History H as

m0 →H m1

→H is a relation that implies a partial order on H. The order is total when H is sequential.

18.2 Linearizability

A history H is linearizable if it can be extended to a history G by appending zero or more


responses to pending invocations that took effect and discarding zero or more pending invocations
that did not take effect such that G is equivalent to a legal sequential history S with →G ⊂→S .
On the next page, an example that shows what those highly theoretical definitions mean in
practice can be found. Important as to why we care about linearizability that much is its
composability:
Composability Theorem

A History H is linearizable if and only if for every object x, H | x is linearizable.


Consequently, linearizability of objects can be proven in isolation, i.e. independently
implemented objects can be composed.

For locks, the linearization points are just the unlock points. A few more complicated examples
can be found in appendix A on page 119.
As a general guideline: We need to identify one atomic step where the method “happens” - often
the critical section or a machine instruction. Additionally, there may be multiple linearization
points when there are considerations on the state of the object - i.e. is the queue empty or full
etc.

79
Chapter 18. Concurrency theory I - Linearizability

spcl.inf.ethz.ch
@spcl_eth

Invocations that took effect … ?

cannot be removed
because B already took
q.enq(x) effect into account
A

q.deq() x
B

can be removed,
flag.read()  ? nobody relies on this
C

14

Figure 18.3: Invocations that took effect

spcl.inf.ethz.ch
@spcl_eth

→𝑮 ⊂ →𝑺 ? What does this mean?


→𝑮 = 𝒂 → 𝒄, 𝒃 → 𝒄 In other words: S respects
→𝑺 = 𝒂 → 𝒃, 𝒂 → 𝒄, 𝒃 → 𝒄 the real-time order of G

Linearizability:
limitation on the
possible choice of S
a c
A →𝑮
b
B

time
→𝑺 →𝑺
15

Figure 18.4: Second part of the definition, visualized

80
Chapter 19

Concurrency theory II - Sequential


consistency

A history H is sequentially consistent if it can be extended to a history G by appending zero


or more responses to pending invocations that took effect and discarding zero or more pending
invocations that did not take effect such that G is equivalent to a legal sequential history S.1
This means that operations done by one thread respect program order - but there is no need to
preserve real-time order, i.e. we can’t reorder operations done by the same thread, but we
can reorder operations done by different threads:

spcl.inf.ethz.ch
@spcl_eth

Yet sequentially consistent!

q.enq(x) q.deq() y
A

q.enq(y)
B

time

27

Figure 19.1: Visualization of what sequential consistency allows

Sequential consistency is not a local property and we lose composability: Each object in a
history H is sequentially consistent does not imply that the history is sequentially consistent!
1
The same definition as linearizability without the last condition - sequential consistency is weaker than
Linearizability

81
Chapter 19. Concurrency theory II - Sequential consistency

19.1 A sidenote: Quiescent Consistency

Quiescent consistency is another, incomparable type of consistency. Quiescent consistency es-


sentially means that we can reorder overlapping operations as we choose - but non-overlapping
methods need to take place in their real-time order!

spcl.inf.ethz.ch
@spcl_eth

Side Remark: Quiescent Consistency


Quiescent consistency is incomparable to Sequential Consistency

q.enq(X) q.deq()  Y
A

q.enq(Y) q.deq()  X
B

This example is sequentially consistent but not quiescently consistent

36

Figure 19.2: Quiescent vs. Sequential Consistency I

spcl.inf.ethz.ch
@spcl_eth

Side Remark: Quiescent Consistency


Quiescent consistency is incomparable to Sequential Consistency

q.deq()  X q.enq(X)
A

q.size()  1
B

This example is quiescently consistent but not sequentially consistent


(note that initially the queue is empty)

37

Figure 19.3: Quiescent vs. Sequential Consistency II

82
Chapter 19. Concurrency theory II - Sequential consistency

19.2 Sequential consistency and the real world

As we can see in figure 19.4, we need sequential consistency. In the real world, hardware
architects do not adhere to this by default, as we’ve seen with reordering operations earlier - the
operations are simply too expensive. We need to explicitly announce that we want this property
(i.e. volatile keyword).

spcl.inf.ethz.ch
@spcl_eth

Reminder: Consequence for Peterson Lock (Flag Principle)


flag[id] = true;
victim = id;
while (flag[1-id] && victim == id);

flag[0].write(true) victim.write(0) flag[1].read() ? victim.read()  ?


A

flag[1].write(true) victim.write(1) flag[0].read()  ? victim.read()  ?


B

Sequential Consistency  At least one of the processes A and B read flag[1-id] = true.
If both processes read flag = true then both processes eventually read the same value
for victim().

34

Figure 19.4: Sequential Consistency for Peterson Lock

83
Chapter 20

Consensus

Consensus is yet another theoretical object, however one of great importance. Consider a simple
object c which implements the following interface:
public interface Consensus<T> {
T decide (T value);
}

A number of threads now all call c.decide(v) with an input value v.


Consensus

A consensus protocol must be:


• wait-free: consensus returns in finite time for each and every thread
• consistent: all threads decide the same value (i.e. reach consensus)
• valid: the decision value is some thread’s input
A simple implication: Whichever thread’s linearization point is first, gets to decide the
value adopted for all threads.
A class C solves n-thread consensus if there exists a consensus protocol using any number
of objects of class C and any number of atomic registers.
The consensus number of C is the largest n for which C solves n-thread consensus.

Two small theorems: Atomic Registers have consensus number 1. CAS has consensus number
∞. The latter can be shown by construction:
class CASConsensus {
private final int FIRST = -1;
private AtomicInteger r = new AtomicInteger(FIRST); // supports CAS
private AtomicIntegerArray proposed; // suffices to be atomic register

... // constructor (allocate array proposed etc.)

public Object decide (Object value) {


int i = ThreadID.get();
proposed.set(i, value);
if (r.compareAndSet(FIRST, i)) // I won
return proposed.get(i); // = value

84
Chapter 20. Consensus

else
return proposed.get(r.get());
}
}

Why is consensus this important? It creates the consensus hierarchy, that is, a class-system
of protocols and their respective consensus number:

spcl.inf.ethz.ch
@spcl_eth

The Consensus Hierarchy

1 Read/Write Registers
2 getAndSet, getAndIncrement, … FIFO Queue
LIFO Stack

.
.

∞ CompareAndSet, … Multiple Assignment

60

Figure 20.1: The consensus hierarchy

This is backed by mathematical proof, and can thus help us decide what algorithms are impos-
sible to implement with certain operations: It is simply impossible to implement a (wait-free)
FIFO-Queue with atomic registers. CompareAndSet also cannot be implemented using atomic
registers. In general:
Higher consensus number operations can implement lower consensus number operations. It
is impossible for lower consensus number operations to implement higher consensus number
operations.

85
Chapter 21

Transactional Memory

We have seen that programming with locks is difficult, and that lock-free programming is even
more difficult. The goal of transactional memory is to remove the burden of synchronization
away from the programmer and place it in the system (be that hardware or software). Ideally,
the programmer only has to say (in the context of our canonical banking system):
atomic {
a.withdraw(amount);
b.deposit(amount);
}

We have already seen that this is the idea behind locks, and it is also the idea behind transac-
tional memory. The difference is the execution - we have already extensively covered why locks
are (sadly) not this convenient. This is where transactional memory (TM) comes in.
The benefits of TM are manifold:
• simpler, less error-prone code
• higher-level semantics (what vs. how)
• composable (unlike e.g. locks)
• analogy to garbage collection
• optimistic by design (does not require mutual exclusion)
In short: TM is awesome.

21.1 TM semantics

Changes made by a transaction are made visible atomically. Other threads observe either the
initial or final state, but no intermediate states.
Transactions run in isolation: While a transaction is happening, effects from other transactions
are not observed - as if the transaction takes a snapshot of the global state when it begins, and
that operates only on that snapshot.
Transactions appear serialized, i.e. as if they had been executed sequentially.
Transactional memory is heavily inspired by database transactions and their properties, ACID,
although the last property is not really important for TM:
• Atomicity

86
Chapter 21. Transactional Memory

• Consistency (database remains in a consistent state)


• Isolation (no mutual corruption of data)
• Durability (e.g., transaction effects survive power loss)

21.2 Implementing transactional memory

Of course we need a way to actually implement this TM. We could just use the Big-lock approach
of locking every atomic section with one big lock. That isn’t done in practice for obvious reasons.
The other approach (which we are going to use) is to keep track of operations performed by
each transaction - concurrency control - where the system ensures the atomicity and isolation
properties. What does that mean?
As mentioned before, we create a “snapshot” of the current state and make sure that the
transaction only affects a local copy of this state, which can then be either committed or tossed
away. If a transaction which has yet to commit has read a value (at the start of its operation)
that was changed by a transaction that has committed, a conflict may arise. Consider the
following example, where the initial state is a=0:
// Transaction A
atomic {
... // Transaction B
x = a; // read a atomic {
if (x == 0){ ...
// do something a = 10; // write a
} else { ...
// do something else }
}
}

Now assume that transaction B commits the changes it has made before A does. Now, in a
serialized view, the execution with a==0 is invalid!

spcl.inf.ethz.ch
@spcl_eth

Serialized view
Initially: a = 0
TXB Serial order of transactions.
atomic {

a = 10 // write a

}

TXA
atomic {

x = a // read a Should have read a == 10
if (x == 0) {

Executions that read a == 0 are
} else { invalid!

}
}
37

Figure 21.1: A conflict occurs in transactional memory

87
Chapter 21. Transactional Memory

Issues like this are handled by a concurrency control mechanism. This means, the transaction
has to be aborted, upon which it can either be retried automatically or the user is notified.
Additional care must be taken that inconsistent transactions do not cause things like division
by zero - this would likely throw a global exception! - which would be a global inconsistency.
We could implement TM either in hard- or software. HTM is fast, but has bounded resources
that often cannot handle big transactions. STM allows greater flexibility, but achieving good
performance might be very challenging. Ideally, we would wish for a hybrid TM, but due to the
relatively young age of TM there is no such solution widely available (yet).

21.3 Design choices

A few design choices need to be made when implementing TM:


• Strong vs Weak isolation: What happens when a shared state accessed by a transaction
is also accessed outside of a transaction? Are the transactional guarantees still maintained?
Strong isolation: Yes. This is easier for porting existing code, but more difficult to imple-
ment and may lead to overhead.
Weak isolation: No.
• Nesting: What are the semantics of nested transactions?
Flat/Flattened nesting: Nested transactions are handled as if they are one transaction. If
an inner transaction aborts, the outer transaction(s) abort as well. If the inner commits,
the changes are made visible only if the outer transaction commits.
Closed nesting: Similar to flattened, but an abort of an inner transaction does not result
in an abort for the outer transaction. If the inner commits, the changes are made visible
to the outer transaction, but not to other transactions. Only when the outer transaction
commits do the inner changes become visible to other transactions.
There are other approaches, but the above are the only ones covered by this lecture.

21.4 Scala-STM

This course uses Scala-STM, where mutable state is put into special variables - everything
else is immutable or not shared. We call this reference-based STM. Scala-STM has a Java
interface (which we will use), which sadly does not have compiler support, e.g. for ensuring that
references are only accessed inside a transaction. Our goal is to get a first idea of how to use
STM.
For that, let us start with our banking system.
class AccountSTM {
private final Integer id; // account id
private final Ref.View<Integer> balance;

AccountSTM(int id, int balance) {


this.id = new Integer(id);
this.balance = STM.newRef(balance);
}
}

As we can see, the Scala-STM requires us to use quite a bit of boilerplate code - this is a flaw
inherent to this specific Java implementation, in theory we can just use atomic.

88
Chapter 21. Transactional Memory

For actually using transactions, we also need to define Runnables and Callables - once again,
this is just annoying boilerplate code. For a full example, see appendix A page 120.
How do we deal with waiting for a certain condition to come true? With locks, we used con-
ditional variables, with TM we use retry: Abort the transaction and retry when conditions
change. Using our bank accounts again, this time with theoretical notation:
static void transfer_retry(final AccountSTM a, final AccountSTM b, final int amount) {
atomic {
if (a.balance.get() < amount)
STM.retry();
a.withdraw(amount);
b.deposit(amount);
}
}

Usually, implementations of retry track what reads/writes a transaction performed, and when
retry is called, a retry will occur when any of the variables that were read, change. In this
example, when a.balance is updated, the transaction will be retried.

21.5 Simplest STM implementation

In this section we create a very simple, theoretical implementation of STM. Our ingredients
are transactions (with threads to run them) and objects. Transactions can either be active,
aborted or committed. Objects represent state stored in memory (the variables affected by the
transaction), and offer methods like read and write - and can of course be copied.
We wish to create a Clock-based STM-System. This clock is not some real-time clock, but
instead offers an absolute order to transactions and their commits. Why do we need this? Using
a global clock (implemented with locks or similar), we can timestamp transactions’ birth-time
and when exactly a commit has been made.
Each transaction has a local read-set and a local write-set, holding all locally read and written
objects. If a transaction calls read, it checks first if the object is in the write set. If so, it uses
this new version. If it is not in the write set, the transaction checks whether the object’s
timestamp is smaller than it’s own birth-timestamp, i.e. the last modification happened before
the transaction began. If it is not, it throws an exception, otherwise it adds a new copy to the
read set. Similarly, a call to write simply modifies the write-set by either changing the value
or copying the object into the write-set. In figure 21.2, we can see that transaction T continues
until it reads Z and sees that the modification happened after T’s birth-date.
A commit is a central part of our system. It works as follows:
• All objects of read- and write-set get locked (in specific order to avoid deadlocks)
• Check that all objects in the read set provide a time stamp ≤ birth-date of the transaction,
otherwise abort
• Increment and get the value T of the global clock
• Copy each element of the write set back to global memory with timestamp T
• Release all locks
We can see a commit in figure 21.3. If we were to swap “T writes X” with “T writes Z”, then
the commit would be unsuccessful.

89
Chapter 21. Transactional Memory

spcl.inf.ethz.ch
@spcl_eth

Transaction life time

read set of T

birthdate of T T reads Y T reads X T reads Z

time

X.date Y.date Z.date

15

Figure 21.2: A transaction fails due to outdated variables

spcl.inf.ethz.ch
@spcl_eth

Successful commit

read set of T

birthdate of T T reads Y T reads X T commits

time

T writes Y T writes X
X.date Y.date Z.date
(local copy!) (local copy!)

write set of T

17

Figure 21.3: A successful commit

21.6 Dining Philosophers with STM

The dining philosopher problem goes as follows: 5 philosophers sit at a round table. Between
each pair of philosophers, a fork is placed (totalling 5). Every philosopher thinks for some time,
then wants to eat. For this, he needs to acquire the two neighbouring forks. Design a concurrent
algorithm so that no philosopher will starve, i.e. each can continue to forever alternate between
thinking and eating. No communication between the philosophers is allowed.
Solving this problem with TM is very easy! Besides managing the syntax, one only has to make

90
Chapter 21. Transactional Memory

the philosophers’ “picking-up-forks” routine as follows:


class PhilosopherThread extends Thread {
...
private void pickUpBothForks() {
STM.atomic(new Runnable() { public void run() {

if (left.inUse.get() || right.inUse.get())
STM.retry();
left.inUse.set(true);
right.inUse.set(true);
}});
}
...
}

21.7 Final remarks

TM is not without its issues: The best semantics are not clear (e.g. nesting), getting a good per-
formance can be challenging and also the method of how we should deal with I/O in transactions
(i.e. how would one rollback these changes?) is not clear.
TM is still very much in the development. It remains to be seen whether it will be widely
adopted in the future and what semantics or type of TM will be used.

91
Chapter 22

Distributed Memory and Message


Passing

Many of the problems of parallel/concurrent programming come from sharing state. What if
we simply avoid this? Functional Programming for example relies on a immutable state - no
synchronization required!
Message Passing, which will be the main topic of this chapter, has isolated mutable state,
that is, each thread/task has its private, mutable state, and separate tasks only cooperate via
message passing - hence the name.
We differentiate the theoretical programming model (CSP: Communicating Sequential Pro-
cesses and Actor programming model) and the practical framework/library (MPI: Message
Passing Interface).

22.1 Rethinking managing state

We reconsider the state of our bank-account example. In sequential programming, we only


had a single balance. In parallel programming (so far), we had a single balance with
protection. Now, instead of sharing the state, we have a distributed state: Each thread has
a local balance, and we exchange messages between threads.
These messages can be divided into synchronous and asynchronous: Synchronous messages
mean that the sender of the message blocks/waits until the message is received. Asynchronous
messages do not block, but are placed into a buffer (“postbox”) for the receiver to get at some
point.

22.2 Actor Model

The actor model uses different actors (i.e. threads) that communicate by directly sending
messages to other actors. This model lends itself very well to event-driven programming: Actors
react to messages, and a program is written as a set of event handlers for events (where events
can be seen as received messages). A good example for this is a GUI: Every button can be a
very small actor, which on click (which can be perceived as message) does something and sends
relevant messages to other actors (e.g. to a window that it can be closed etc.).
An example for this is the functional programming language Erlang. It was initially developed
for distributed fault-tolerant applications, since recovering from errors becomes much easier if
no state is shared. The most one has to do is restart an actor and perhaps make sure that
messages are sent again etc. Consider the following, simple Erlang program:

92
Chapter 22. Distributed Memory and Message Passing

start() ->
Pid = spawn(fun() -> hello() end),

Pid ! hello,
Pid ! bye.
hello() ->
receive
hello ->
io:fwrite("Hello world\n"),
hello();
bye ->
io:fwrite("Bye cruel world\n"),
ok
end.

This simple program creates a new actor that executes the hello function. Then, the start()
function sends two messages to that actor, “hello” and “bye”. When the actor receives a message,
it is handled similarly to a switch-statement: For “hello” the actor writes something and then
executes the hello() function again. On “bye”, it prints and then exits.

22.3 Communicating Sequential Processes : CSP

CSP was designed as a formal algebra for concurrent systems. Its main difference when compared
to the actor model is the existence of channels: Instead of directly addressing certain actors,
messages are sent to a channel. These channels are more flexible, as they can also be passed to
other processes. CSP was first implemented in 1983 in OCCAM.
A more modern example is Go - a concurrent programming language from Google that is
inspired by CSP. It features lightweight tasks and typed channels for task communications.
These channels are synchronous by default, but asynchronous channels are also supported. If
we recreate the Erlang example in Go, it would look like this:
func main() {
msgs := make(chan string)
done := make(chan bool)

go hello(msgs, done);

msgs <- "Hello"


msgs <- "bye"

ok := <-done

fmt.Println("Done:", ok);
}

func hello(msgs chan string, done chan bool) {


for {
msg := <-msgs
fmt.Println("Got:", msg)
if msg == "bye" {
break
}
}
done <- true;
}

93
Chapter 22. Distributed Memory and Message Passing

The similarities are apparent. The main difference, as mentioned before, is the existence of
channels, and of course the syntax (go is the equivalent to spawn etc.). In appendix B, on page
128, another example of a concurrent program in Go can be found: A prime sieve.

94
Chapter 23

Message Passing II - MPI

MPI is a standard application/programming interface (API), meaning it is a portable, flexible


library not bound to a particular language. It is the de-facto interface for distributed parallel
computing, which is nearly the entirety of high performance computing.
The main concept in MPI are the processes (for us the equivalent to threads), which can be
collected into groups. Each group can have multiple colors (sometimes “context”), which make
a communicator. Initially all processes are in the communicator MPI COMM WORLD Processes
communicate inside of respective communicators with each other. Communicators can overlap,
i.e. a thread can belong to multiple communicators. Processes are identified by a unique
number (called “rank”) ranging from 0 to n − 1 (where n is the number of processes within each
communicator), meaning the rank is always relative to a communicator!

spcl.inf.ethz.ch
@spcl_eth

Communicators
mpiexec -np 16 ./test

When you start an MPI


Communicators do not program, there is one
need to contain all 0 1
predefined communicator
processes in the system MPI_COMM_WORLD
0 12 23 3

4 54 6
5 7
Every process in a Can make copies of this
communicator has an ID communicator (same group of
called as “rank” 6 7 processes, same ranks, but
different “aliases”)

The same process might have different


ranks in different communicators

Communicators can be created “by hand” or using tools


Simple programs typically only use the predefined communicator MPI_COMM_WORLD
(which is sometimes considered bad practice because of modularity issues)

Figure 23.1: Communicators visualized

We can already write a very simple (and very useless) MPI program:
public static void main(String args []) throws Exception {
MPI.Init(args);
// Get total number of processes (p)

95
Chapter 23. Message Passing II - MPI

int size = MPI.COMM_WORLD.Size();


// Get rank of current process (in [0..p-1])
int rank = MPI.COMM_WORLD.Rank();
MPI.Finalize();
}

Note that this works as SPMD: Single Program Multiple Data (Multiple Instances). We compile
only one program, which gets executed by multiple different instances.

23.1 Sending and Receiving

Of course, now we need to communicate between those processes. This is achieved by the
Comm.Send function, which is called on a Communicator (not a process!):
void Comm.Send( communicator
Object buf, pointer to data to be sent,e.g. an array
int offset, offset within buf
int count, number of items to be sent
Datatype datatype, datatype of items
int dest, destination process id
int tag data id tag
)

Send has a tag argument. This can be used to differentiate between different messages being
sent, for example as numbering.
How do messages get matched, i.e. how do we receive a message? Three things have to match:
The communicator, the tag, and source/dest.
void Comm.Recv( communicator
Object buf, pointer to where the received data is stored
int offset, offset within buf
int count, number of items to be received
Datatype datatype, datatype of items
int dest, destination process id (or MPI_ANY_SOURCE)
int tag data id tag (or MPI_ANY_TAG)
)

A receiver can get a message without knowing the sender or the tag of the message!
One can specify a send operation to be synchronous: Ssend. That means, the method waits
until the message can be accepted by the receiving process before returning. Of course, receive
Recv is synchronous by nature (a message can only be received if it has been sent). Synchronous
routines can perform two actions: Transferring data and synchronizing processes!
Messages can also be passed asynchronously - however then the buffer needs to be stored some-
where, which depending on the MPI implementation might need to be taken care of by the
programmer.
A second concept is blocking and non-blocking sends/receives. These return immediately,
even before the local actions are complete. This assumes that the data storage used for transfer
won’t be modified by subsequent statements until the transfer is complete!
What are the MPI defaults? For send, it is blocking, but the synchronicity is implementation
dependent. Receiving is blocking by default and synchronous by nature.

96
Chapter 23. Message Passing II - MPI

All is not well though, since we have now introduced the possibility of deadlocks again: If two
processes want to synchronously send to each other at the same time and receive after sending,
they would block each other. Luckily, MPI offers easy solutions: Using Sendrecv allows to do
both statements at the same time, or we can use explicit non-blocking operations. These have
a prefixed “i”. After executing Isend and Irecv, we can instruct a process to wait for the
completion of all non-blocking methods by calling the Waitall method.
Essentially, every single MPI program can be written using mere six functions:
• MPI INIT - initialize the MPI library (always the first routine called)
• MPI COMM SIZE - get the size of the communicator
• MPI COMM RANK - get the rank of the calling process in the communicator
• MPI SEND - send a message to another process
• MPI RECV - receive a message from another process
• MPI FINALIZE - clean up all MPI state (must be the final routine called)
An example code where only these are used can be found in B, on page 128. For performance,
however, we need to use other MPI features.

23.2 Collective Communication

Up until now, we used point-to-point communication. MPI also supports communications among
groups of processors! These collectives will be discussed in this section.
An important notice ahead of time: Collectives need to be called by every process to make sense!
One must not use an if-statement to single out a thread to receive like in point-to-point!
public void Reduce(
java.lang.Object sendbuf,
int sendoffset,
Reduce: Similar to what we heard in the java.lang.Object recvbuf,
first half of the semester, reduce makes use int recvoffset,
of an associative operator (e.g. MPI SUM)to int count,
reduce a result from different processes to Datatype datatype,
one (called root) Op op,
int root
)

public void Bcast(


java.lang.Object buf,
Broadcast: Broadcasts a message from the int offset,
int count,
root process to all other processes in the
Datatype type,
group int root
)

97
Chapter 23. Message Passing II - MPI

public void Allreduce(


java.lang.Object sendbuf,
int sendoffset,
Allreduce: Similar to reduce, but hands java.lang.Object recvbuf,
result to all processes involved. Performs int recvoffset,
better than reduce followed by broadcast, int count,
but has the same effect Datatype datatype,
Op op
)

public void Gather(


java.lang.Object sendbuf,
int sendoffset,
int sendcount,
Datatype sendtype,
Gather: Each process sends the contents of java.lang.Object recvbuf,
its send buffer to the root process int recvoffset,
int recvcount,
Datatype recvtype,
int root
)

public void Scatter(


java.lang.Object sendbuf,
int sendoffset,
int sendcount,
Scatter: Inverse operation of gather, root Datatype sendtype,
java.lang.Object recvbuf,
process sends count specified to each other
int recvoffset,
process (different parts) int recvcount,
Datatype recvtype,
int root
)

public void Allgather(


java.lang.Object sendbuf,
int sendoffset,
int sendcount,
Datatype sendtype,
Allgather: Like gather, but hands result
java.lang.Object recvbuf,
to all processes involved int recvoffset,
int recvcount,
Datatype recvtype
)

98
Chapter 23. Message Passing II - MPI

public void Alltoall(


java.lang.Object sendbuf,
int sendoffset,
int sendcount,
AllToAll: Extension of Allgather to the Datatype sendtype,
case where each process sends distinct data java.lang.Object recvbuf,
to each of the receivers int recvoffset,
int recvcount,
Datatype recvtype
)

A complete visualization of the different collectives can be found in appendix A, page 121.
A sample program using those collectives can be found on the following page, page 122.

99
Chapter 24

Parallel Sorting - Sorting Networks

Recall what we know about sorting: The lower bound for sorting is O(n log n).1
The basic building block for sorting algorithms is the comparator. Using the following notation,
we can visualize sorting networks:

spcl.inf.ethz.ch
@spcl_eth

void compare(int[] a, int i, int j, boolean dir) {


if (dir==(a[i]>a[j])){
int t=a[i];
a[i]=a[j];
a[j]=t;
}
}

a[i] a[i]

a[j]
< a[j]

14

Figure 24.1: Visualizing Comparators

Note that sorting networks are data-oblivious: They function the exact same for each and every
input. They can also be redundant: Performing “unnecessary” comparisons. This makes it
far easier to reason about them, since there is no worst- or best-case scenario - they are one and
the same!
1
In computer science, assuming limited bit-width, one can also construct specialized algorithms that actually
achieve O(n)

100
Chapter 24. Parallel Sorting - Sorting Networks

spcl.inf.ethz.ch
@spcl_eth

Sorting networks

1 1
5 1
5 4 3
1 3
3 3 4
4 4
4 5
3 5

15

Figure 24.2: Sorting Networks - An example

Sorting networks are basically a sequence of comparisons that either swap the elements that are
compared or leave them the way they are. One can construct such a network recursively, as
can be seen in figure 24.3. To improve our parallel algorithms, we can once again argue about
depth and width of our graph (see appendix A page 123) - One relatively simple improvement
is to implement Odd-Even Transposition Sort. We then compare, in alternating fashion, odd
indices with even indices, then even with odds - in numbers, first we compare index 1 to 2, 3 to
4 etc., then 2 to 3, 4 to 5 etc. This sorting network has the same width as the sorting network
for bubblesort, the same number of comparisons but a smaller depth - only n.
In general, there is no easy way to get a sorting network. For networks with pretty small n,
(n > 10) no size-optimal sorting networks are known.
How would we prove the correctness of sorting networks? Enter the Zero-one-principle:
“If a network with n input lines sorts all 2n sequences of 0s and 1s into non-decreasing order, it
will sort any arbitrary sequence of n numbers in non-decreasing order.”
The proof for this theorem has been visualized in figure 24.4.
This principle can now be used to reduce the number of cases for a proof by exhaustion from n!
down to “only” 2n .

101
Chapter 24. Parallel Sorting - Sorting Networks

spcl.inf.ethz.ch
@spcl_eth

Recursive construction : Insertion

𝑥1
𝑥2
𝑥3
. sorting . .
. network . .
. . .
𝑥𝑛−1
𝑥𝑛
𝑥𝑛+1

17

Figure 24.3: Insertion Sort recursively constructed as sorting network

spcl.inf.ethz.ch
@spcl_eth

Proof

Assume a monotonic function 𝑓(𝑥) with 𝑓1 𝑥8 ≤


Argue: If x is sorted by a network N
20 30 5 9 1 5 8 9 20 30
𝑓(𝑦) whenever 𝑥 ≤ 𝑦 and a
then also any 𝑁
network that sorts.
monotonic Let
function of x. N transform (𝑥1 , 𝑥2 , … , 𝑥𝑛 ) into (𝑦1 , 𝑦2 , … , 𝑦𝑛 ), then
e.g., floor(x/2) 0 4 10 15 2 4 0 2 4 4 10 15
it also transforms (𝑓(𝑥1 ), 𝑓(𝑥2 ), … , 𝑓(𝑥𝑛 )) into (𝑓(𝑦1 ), 𝑓(𝑦2 ), … , 𝑓(𝑦𝑛 )).

Assume 𝑦𝑖 sorted
Show: If x is not > 𝑦𝑖+1by network N, then𝑖,
for some then
there is a consider the monotonic function
1 8 20 30 5 9 1 5 9 8 20 30
monotonic function f that maps x to 0s and 1s and f(x) 0, 𝑖𝑓0 𝑥 0< 1𝑦𝑖 1 0 1
is not sorted by the network 𝑓(𝑥) = ቊ 0 0 1 0 1 1
1, 𝑖𝑓 𝑥 ≥ 𝑦𝑖

𝑥 not sorted by 𝑁 ⇒ there is an 𝑓 𝑥 ∈ 0,1 𝑛 not sorted by N


N converts
⇔ ,…𝑓 𝑦 ,𝑓 𝑦
(𝑓(𝑥1 ), 𝑓(𝑥2 ), … , 𝑓(𝑥𝑛 )) into 𝑓 𝑦1 , 𝑓(𝑦2 𝑖 𝑖+1 , … , 𝑓(𝑦𝑛 ))
𝑓 sorted by N for all 𝑓 ∈ 0,1 𝑛 ⇒ 𝑥 sorted by N for all x
27

Figure 24.4: Proof of the Zero-One-Principle

Note: There exists a sorting algorithm - Bitonic sort - which, with enough processors, breaks
the lower bound on sorting for comparison sort algorithms. The time complexity in sequential
execution is O(n log2 n), in parallel time O(log2 n).
Sorting networks were only allotted 45 minutes of time, and thus this is a very, very shallow
look at them.

102
Part III

Appendices

103
Appendix A

Slides

Attached are numerous slides that offer a good overview over certain problems, or things that
are just easier to visualize with full-page diagrams. Full credit goes to the lecturers that put
these slides together!

104
UncaughtExceptionHandlers: Example
public class ExceptionHandler public class Main {
implements UncaughtExceptionHandler { public static void main(String[] args) {
...
public Set<Thread> threads = new HashSet<>();
ExceptionHandler handler = new ExceptionHandler();
@Override
public void uncaughtException(Thread thread, thread.setUncaughtExceptionHandler(handler);
Throwable throwable) {

105
...
println("An exception has been captured");
println(thread.getName()); thread.join();
println(throwable.getMessage());
... if (handler.threads.contains(thread)) {
threads.add(thread); // bad
} } else {
} // good
}
}
}

47
Thread state model in Java (repetition)

http://pervasive2.morselli.unimo.it/~nicola/courses/IngegneriaDelSoftware/java/J5e_multithreading.html

87

spcl.inf.ethz.ch
@spcl_eth

Thread States in Java

waiting state with


specified waiting TIMED_WAIT WAITING
time, e.g,. sleep
notify
thread is waiting for notifyAll
join/ a condition or a join
NEW wait

thread has monitor


not yet started obtained
RUNNABLE BLOCKED

thread is waiting for


TERMINATED entry to monitor lock
thread is runnable,
may or may not be monitor
thread has not yet free
finished execution currently scheduled
by the OS 21

106
Designing a pipeline: 1st Attempt
Washing clothes – Unbalanced Pipeline
(lets consider 5 washing loads)
Time (s) 0 5 10 15 20 25 30 35 40 45 50 55 60 65 70
Takes 5 seconds. We use “w” for Washer next. Load #
Load 1 w d d f c c
Load 2 w _ d d f c c
Load 3 w _ _ d d f c c
Takes 10 seconds. We use “d” for Dryer next. Load 4 w _ _ _ d d f c c
Load 5 w _ _ _ _ d d f c c

The total time for all 5 loads is 70 seconds.


Takes 5 seconds. We use “f” for Folding next.
This pipeline can work, however it cannot bound the latency of a Load as it keeps
growing. If we want to bound this latency, one approach is to make each stage take
Takes 10 seconds. We use “c” for Closet next. as much time as the longest one, thus balancing it. In our example, the longest
time is 10 seconds, so we can do the following:

Make Pipeline balanced by increasing time for


Designing a pipeline: 2nd Attempt
each stage to match longest stage
Time (s) 0 10 20 30 40 50 60 70 80 90 100 110 60 65 70 75 80 85 90
Now takes 10 seconds. Load #
Load 1 w d f c
Load 2 w d f c
Load 3 w d f c
Takes 10 seconds, as before. Load 4 w d f c
Load 5 w d f c

This pipeline is a bit wasteful, but the latency is bound at 40 seconds for each Load.
Now takes 10 seconds.
Throughput here is about 1 load / 10 seconds, so about 6 loads / minute.

So now we have the total time for all 5 loads at 80 seconds, higher than before.
Takes 10 seconds, as before.
Can we somehow get a bound on latency while improving the time/throughput?

Step 2: and also, like in the 2nd pipeline, make each stage take as much time
Step 1: make the pipeline from 1st attempt a bit more fine-grained: as the longest stage does from Step 1 [this is 6 seconds due to d2 and c2]

Like in the 1st attempt, this takes 5 seconds. It now takes 6 seconds.

Lets have 2 dryers working in a row. The first dryer is referred to as d1


and takes 4 seconds, the second as d2 and takes 6 sec.
Each of d1 and d2 dryers take 6 seconds.

Like in the 1st attempt, it takes 5 seconds.


Now takes 6 seconds.

Lets have 2 closets working in a row. The first closet is referred to as c1


and takes 4 seconds, the second as c2 and takes 6 sec. Each of c1 and c2 closets now take 6 seconds.

Designing a pipeline: 3rd Attempt


(lets consider 5 washing loads)
Time (s) 0 6 12 18 24 30 36 42 48 54 60 110 60 65 70 75 80 85 90
Load #
Load 1 w d1 d2 f c1 c2
Load 2 w d1 d2 f c1 c2
Load 3 w d1 d2 f c1 c2
Load 4 w d1 d2 f c1 c2
Load 5 w d1 d2 f c1 c2

The bound on latency for each load is now: 6 * 6 = 36 seconds.

The throughput is approximately: 1 load / 6 seconds = ~ 10 loads / minute.

The total time for all 5 loads is 60 seconds.

107
Speedup

22

Efficiency

23

108
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth

p1 non-critical section

Mutual exclusion for 2 processes -- 1st Try State space diagram [p, q, wantp, wantq] p2
p3
while(wantq);
wantp = true
1 non-critical section 2 while(wantp) 3 wantp = true 4 critical section 5 wantp = false p4 critical section

volatile boolean wantp=false, wantq=false while(wantq) wantq = true wantq = false p5 wantp = false

Process P Process Q Do you see the problem? p1, q1, false, false p2, q1, false, false p3, q1, false, false p4, q1, true, false
local variables local variables
loop loop
p1, q2, false, false p2, q2, false, false p3, q2, false, false p4, q2, true, false
p1 non-critical section q1 non-critical section
p2 while(wantq); q2 while(wantp);
p3 wantp = true q3 wantq = true p1, q3, false, false p2, q3, false, false p3, q3, false, false p4, q3, true, false
p4 critical section q4 critical section
p5 wantp = false q5 wantq = false
p1, q4, false, true p2, q4, false, true p3, q4, false, true p4, q4, true, true

12
no mutual exclusion ! 13

spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth

Observation: state space diagram too large Reduced state space diagram [p, q, wantp, wantq] – only states 2, 3, and 5
1 non-critical section 2 await wantq == false 3 wantp = true 4 critical section 5 wantp = false
volatile boolean wantp=false,
Only of interest:wantq=false
state transitions of the protocol. await wantp == false wantq = true wantq = false
p1/q1 is identical to p2/q2 – call state 2 All of interest covered: p1 non-critical section
p4/q4 is identical to p5/q5 – call state 5 p2 while(wantq);
Process P Process Q
Then forbidden: both processes in state 5 p3 wantp = true
local variables local variables p4 critical section
p2, q2, false, false p3, q2, false, false p5, q2, true, false p5 wantp = false
loop loop
p1 non-critical section q1 non-critical section
p2 while(wantq); q2 while(wantp); p2, q3, false, false p3, q3, false, false p5, q3, true, false
p3 wantp = true q3 wantq = true
p4 critical section q4 critical section
q5 wantq = false p2, q5, false, true p3, q5, false, true p5, q5, true, true
p5 wantp = false

no mutual exclusion !

14 15

spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth

Mutual exclusion for 2 processes -- 2nd Try State space diagram [p, q, wantp, wantq]
1 non-critical section 2 wantp = true 3 while(wantp) 4 critical section 5 wantp = false
volatile boolean wantp=false, wantq=false wantq = true while(wantq) wantq = false

Process P Process Q Do you see the problem?


local variables local variables
p2, q2, false, false p3, q2, true, false p5, q2, true, false
loop loop
p1 non-critical section q1 non-critical section
p2 wantp = true q2 wantq = true p2, q3, false, true p3, q3, true, true p5, q3, true, true

p3 while(wantq); q3 while(wantp):
p4 critical section q4 critical section
p2, q5, false, true p3, q5, true, true
p5 wantp = false q5 wantq = false

deadlock !

16 17

spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth

Mutual exclusion for 2 processes -- 3rd Try State space diagram [p, q, turn]
volatile int turn = 1; We have not made any
assumptions about progress
Process P Process Q Do you see the problem? outside of the CS...
local variables local variables
p2, q2, 1 p4, q2, 1
loop loop
p1 non-critical section q1 non-critical section
p2 while(turn != 1); q2 while(turn != 2);
p3 critical section q3 critical section
p4 turn = 2 q4 turn = 1 p2, q2, 2 p2, q4, 2

starvation!

18 19

109
spcl.inf.ethz.ch
@spcl_eth

Intervals
𝑎0 , 𝑎1 : interval of events 𝑎0 , 𝑎1 with 𝑎0 → 𝑎1
With 𝐼𝐴 = (𝑎0 , 𝑎1 ) and 𝐼𝐵 = (𝑏0 , 𝑏1 ) we write 𝑰𝑨 → 𝑰𝑩 if 𝒂𝟏 → 𝒃𝟎

𝑎0 𝐼𝐴 𝑎1 𝑎2 𝐼𝐴′ 𝑎3
A
𝑏0 𝐼𝐵 𝑏1 𝑏0 𝐼𝐵 ′ 𝑏1
B
time
𝐼𝐵′ ↛ 𝐼𝐴′
𝐼𝐴 → 𝐼𝐵 𝐼𝐵 → 𝐼𝐴′
𝐼𝐴′ ↛ 𝐼𝐵′

we say "𝐼𝐴 precedes 𝐼𝐵 " and "𝐼𝐵′ and 𝐼𝐴′ are concurrent"

24

spcl.inf.ethz.ch
@spcl_eth

Example
K M

r.read() 1 L r.write(8) O
A
r.write(4) r.read() 4
B J
N
r.write(1) r.read() 8
C

𝝉 𝑱 𝝉 𝑲 𝝉 𝑴 𝝉 𝑵 𝝉 𝑳 𝝉 𝑶 time

26

110
spcl.inf.ethz.ch
@spcl_eth

flag[P] = true
Proof: Mutual exclusion (Peterson) victim = P
while (flag[Q] && victim == P){}
By contradiction: assume concurrent CSP and CSQ [A] CSP
flag[P] = false
Assume without loss of generality:

WQ(victim=Q) → WP(victim=P) [B]


A + C ⇒ must read false B ⇒ must read P [C]
From the code:
WP(flag[P]=true) → WP(victim = P) → RP(flag[Q]) → RP(victim) → CSP


transitivity of "→ "
"write of P" ⇒ must read true

WQ(flag[Q]=true) → WQ(victim = Q) → RQ(flag[P]) → RQ(victim) → CSQ


"read of Q"

28

spcl.inf.ethz.ch
@spcl_eth

flag[P] = true
Proof: Freedom from starvation victim = P
while (flag[Q] && victim == P){}
CSP
flag[P] = false

By (exhaustive) contradition
Assume without loss of generality that P runs forever in its lock loop, waiting until flag[Q]==false or
victim != P.
Possibilities for Q:
stuck in nonCS
⇒ flag[Q] = false and P can continue. Contradiction.
repeatedly entering and leaving its CS
⇒ sets victim to Q when entering.
Now victim cannot be changed ⇒ P can continue. Contradiction.
stuck in its lock loop waiting until flag[P]==false or victim != Q.
But victim == P and victim == Q cannot hold at the same time. Contradiction.

29

111
spcl.inf.ethz.ch
@spcl_eth

Hardware support for atomic operations: Example (x86)

CMPXCHG mem, reg «The LOCK prefix causes certain


«compares the value in Register A kinds of memory read-modify-write
with the value in a memory location instructions to occur atomically»
If the two values are equal, the
instruction copies the value in the
second operand to the first operand
and sets the ZF flag in the flag
registers to 1. Otherwise it copies
the value in the first operand to the
A register and clears ZF flag to 0»
From the AMD64 Architecture
Programmer’s Manual

R. Hudson: IA memory ordering: https://www.youtube.com/watch?v=WUfvvFD5tAA (2008) 10

spcl.inf.ethz.ch
@spcl_eth

Hardware support for atomic operations: Example (ARM)

LDREX <rd>, <rn> STREX <rd>, <rm>, <rn>


«Loads a register from memory and «performs a conditional store to
if the address has the shared memory. The store only occurs if the
memory attribute, mark the physical executing processor has exclusive
address as exclusive access for the access to the memory addressed»
executing processor in a shared
monitor»

From the ARM Architecture


Reference Manual
11

112
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth

owned by
Deadlock
requires P
Rendezvous with Semaphores
Wrong solution with Deadlock
Q_Arrived P_Arrived

owned by Q requires
P pre acquire release

P Q
Q pre acquire release
init P_Arrived=0 Q_Arrived=0
pre ... ...
rendezvous acquire(Q_Arrived) acquire(P_Arrived)
release(P_Arrived) release(Q_Arrived)

post ... ...


17 18

spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth

Rendezvous with Semaphores Scheduling Scenarios


Synchronize Processes P and Q at one location (Rendezvous)
Assume Semaphores P_Arrived and Q_Arrived P first
P pre release acquire post

P Q
Q pre acquire release post
time
init P_Arrived=0 Q_Arrived=0
release signals (arrow)
acquire may wait (filled box)
pre ... ... Q first
rendezvous release(P_Arrived) acquire(P_Arrived) P pre release acquire post
acquire(Q_Arrived) release(Q_Arrived)

post ... .. Q pre acquire release post


time
19 21

spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth

Rendezvous with Semaphores That’s even better.


Synchronize Processes P and Q at one location (Rendezvous)
P first
Assume Semaphores P_Arrived and Q_Arrived P pre release acquire post

P Q Q pre release acquire post

init P_Arrived=0 Q_Arrived=0 Q first release signals (arrow)


acquire may wait (filled box)
pre ... ...
rendezvous release(P_Arrived) release(Q_Arrived) P pre release acquire post
acquire(Q_Arrived) acquire(P_Arrived)
Q pre release acquire post
post ... ..
22 23

113
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth

Barrier Recap: Race Condition


Synchronize a number (n) of processes.
Semaphore barrier. Integer count. Invariants

Process Q
Process P
«Each of the processes eventually Shared
reaches the acquire statement" Variable
P1 P1 ... Pn
«The barrier will be opened if and
init barrier = 0; volatile count = processes
only if all 0 have reached the x
barrier" x++ x--
pre ... Race Condition ! reg = x
barrier count++ «count provides the number of read x reg = x
processes that have passed the reg = reg +1
if (count==n) release(barrier)  (violated)
  read x reg = reg -1
barrier" x = reg
acquire(barrier) write x x = reg
«when all processes have reached
write x
post ... Deadlock ! then all waiting processes
the barrier
can continue" (violated) Race Condition

28 29

spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth

With Mutual Exclusion Barrier


Synchronize a number (n) of processes.
Process P

Critical Semaphores barrier, mutex. Integer count.


Critical
Process Q

x Section
Section

x++ P1 P2 ... Pn
reg = x read x init mutex = 1; barrier = 0; count = 0
reg = reg +1
write x pre ...
x = reg
x-- barrier acquire(mutex)
read x reg = x count++
Mutual reg = reg -1 release(mutex)
Exclusion
write x x = reg if (count==n) release(barrier)   
acquire(barrier)
turnstile
release(barrier)

30
post ... 31

spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth

Reusable Barrier. 1st trial. Illustration of the problem: scheduling scenario


P1 ... Pn barrier = 0
count++
init mutex = 1; barrier = 0; count = 0
(count=1)
pre ...
count++
barrier acquire(mutex) count++
count++
release(mutex) Invariants
barrier = 1 count=3  release(barrier)
if (count==n) release(barrier) Race Condition !
«Only when all processes have barrier = 2 count=3  release(barrier)
reached the turnstyle it will be
acquire(barrier) opened the first time"  
release(barrier) turnstile(barrier)
«When all processes have run
acquire(mutex) through the barrier then count = 0" turnstile(barrier)
count--
release(mutex) «When all processes have run turnstile(barrier)
Race Condition !
if (count==0) acquire(barrier) through the barrier then barrier = 0" barrier = 2
(violated)
post ... 33 34

spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth

Reusable Barrier. 2nd trial. Reusable Barrier. 2nd trial.


P1 ... Pn P1 ... Pn
init mutex = 1; barrier = 0; count = 0 init mutex = 1; barrier = 0; count = 0
pre ... Dou you see pre ...
the problem?
barrier acquire(mutex) barrier acquire(mutex)
count++ count++
if (count==n) release(barrier) if (count==n) release(barrier) Invariants
release(mutex) release(mutex)
Process can pass «When all processes have passed the
other processes! barrier, it holds that barrier = 0"
acquire(barrier) acquire(barrier)
   
release(barrier) release(barrier) « Even when a single process has
passed the barrier, it holds that
acquire(mutex) acquire(mutex) barrier = 0» (violated)
count-- count--
if (count==0) acquire(barrier) if (count==0) acquire(barrier)
release(mutex) release(mutex)
post ... post ...
35 36

114
spcl.inf.ethz.ch
@spcl_eth

Skip list property


 Sublist relationship between levels: higher level lists are always contained
in lower-level lists. Lowest level is entire list.

−∞ 𝟐 𝟒 𝟓 𝟕 𝟖 𝟗 +∞
28

spcl.inf.ethz.ch
@spcl_eth

add (6)

 Find predecessors (lock-free)  Splice


 Lock predecessors  mark fully linked
 Validate (cf. Lazy Synchronisation)  Unlock
<
<
<
<
<
−∞ 𝟐 𝟒 𝟓 𝟔 𝟕 𝟖 𝟗 +∞
33

spcl.inf.ethz.ch
@spcl_eth

remove(5)

 find predecessors  Lock predecessors and validate


 lock victim  physically remove
 logically remove victim (mark)  unlock

<
<
<
<
−∞ 𝟐 𝟒 𝟓 𝟕 𝟖 𝟗 +∞
35

spcl.inf.ethz.ch
@spcl_eth

contains(8)

 sequential find() & not logically removed & fully linked


 even if other nodes are removed, it stays reachable
 contains is wait-free (while add and remove are not)

< >
< < >
< >
< < >
115 < =
−∞ 𝟐 𝟒 𝟓 𝟕 𝟖 𝟗 +∞
36
spcl.inf.ethz.ch
@spcl_eth

ABA Problem
Thread X Thread Y Thread Z Thread Z' Thread X
in the middle pops A pushes B pushes A completes pop
of pop: after read
but before CAS Pool

head
Pool A A
head

116
top A top A top B top B B
top

next next

NULL NULL NULL NULL NULL

public Long pop() { public void push(Long item) {


Node head, next; Node head;
do { Node new = pool.get(item);
head = top.get();
time do {
if (head == null) return null;
next = head.next; head = top.get();
} while (!top.compareAndSet(head, next)); new.next = head;
Long item = head.item; pool.put(head); return item; } while (!top.compareAndSet(head, new));
} } 14
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth

Is this particular execution linearizable? Yes

q.enq(x) q.deq() y q.enq(x) q.deq() y


A A

q.eny(y) q.deq() x q.eny(y) q.deq() x


B B

time time

34 35

spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth

Linearizable? No

q.enq(x) q.deq() y q.enq(x) q.deq() y


A A

q.enq(y) q.enq(y)
B B

x is first in queue

time time

36 37

spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth

Linearizable ? Yes

q.enq(x) q.deq() y q.enq(x) q.deq() y


A A

q.eny(y) q.deq() x q.eny(y) q.deq() x


B B

time time

38 39

spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth

And yes, another scenario. Read/Write Register Example

q.enq(x) q.deq() y write(0) write(2)


A A

q.eny(y) q.deq() x write(1) read()1


B B

time time

40 41

117
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth

Linearizable! Linearizable?

write(0) write(2) write(0) read()1 write(2)


A A

write(1) read()1 write(1) read()1


B B

time time

42 43

spcl.inf.ethz.ch
@spcl_eth

No

write(0) read()1 write(2)


A

write(1) read()1
B
write(1) must
have happened

time

44

118
spcl.inf.ethz.ch
@spcl_eth

Reasoning About Linearizability (Locking)


public T deq() throws EmptyException {
lock.lock(); head
try {
if (tail == head)
throw new EmptyException();
T x = items[head % items.length];
head++;
return x;
tail
} finally {
lock.unlock();
Linearization points
}
are when locks are released
}

18

spcl.inf.ethz.ch
@spcl_eth

Reasoning About Linearizability (Wait-free example)


class WaitFreeQueue {
volatile int head = 0, tail=0;
AtomicReferenceArray<T>[] items = head
new AtomicReferenceArray<T>(capacity);

public boolean enq (T x) {


Linearization point
if (tail – heap == capacity) return false;
items.set((tail+2) % capacity, x);
tail++;
Linearization point
return true; for (only one)
} tail
enqueuer
public T deq() {
if (tail - head == 0) return null;
Linearization point
int x = items.get((head+1) % capacity);
head++; Linearization point
return x;
} for (only one)
} dequeuer 19

spcl.inf.ethz.ch
@spcl_eth

Reasoning About Linearizability (Lock-free example)


public T dequeue() {
while (true) {
Node first = head.get();
Node last = tail.get();
Node next = first.next.get(); Linearization point
if (first == last) {
if (next == null) return null;
else tail.compareAndSet(last, next);
}
else { Linearization point
T value = next.item;
if (head.compareAndSet(first, next))
return value;
} Linearization point
}
119
}

20
spcl.inf.ethz.ch
@spcl_eth

Bank account (ScalaSTM)


class AccountSTM {
private final Integer id; // account id
private final Ref.View<Integer> balance;

AccountSTM(int id, int balance) {


this.id = new Integer(id);
this.balance = STM.newRef(balance);
}

spcl.inf.ethz.ch
@spcl_eth

Real world: bank account in ScalaSTM


void withdraw(final int amount) {
// assume that there are always sufficient funds...
STM.atomic(new Runnable() { public void run() {
int old_val = balance.get();
balance.set(old_val – amount);
}});
}

void deposit(final int amount) {


STM.atomic(new Runnable() { public void run() {
int old_val = balance.get();
balance.set(old_val + amount);
}});
}

spcl.inf.ethz.ch
@spcl_eth

GetBalance (return a value)

public int getBalance() {


int result = STM.atomic(
new Callable<Integer>() {
public Integer call() {
"atomic"

int result = balance.get();


return result;
}
});
return result;
} 120

5
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth

Collective Computation - Reduce public void Reduce(java.lang.Object sendbuf, Collective Data Movement - Broadcast
int sendoffset,
java.lang.Object recvbuf,
int recvoffset,
int count,
P0 A A
Datatype datatype, P1 Broadcast A
Op op,
int root)
P2 A
P3 A
root = rank 0
P0 A A+B+C+D
P1 B Reduce
P2 C
P3 D

P0 A A
P1 B A+B
Scan
P2 C A+B+C
P3 D A+B+C+D
34 36

spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth

Collective Computation - Allreduce public void Allreduce(java.lang.Object sendbuf, Collective Data Movement – Scatter/Gather
int sendoffset,
java.lang.Object recvbuf,
int recvoffset,
int count, P0 A B C D Scatter A
Datatype datatype,
Op op)
P1 B
P2 C
Gather
P3 D
P0 A A+B+C+D
B Allreduce
P1 A+B+C+D  Scatter can be used in a function that reads in an entire vector on process 0 but only
P2 C A+B+C+D sends the needed components to each of the other processes.
P3 D A+B+C+D

 Gather collects all of the components of the vector onto destination process, then
destination process can process all of the components.
Useful in a situation in which all of the processes need the result of a global sum in order
to complete some larger computation.
37 41

spcl.inf.ethz.ch
@spcl_eth

More Collective Data Movement – some more (16 functions total!)

P0 A A B C D
P1 B Allgather A B C D
P2 C A B C D
P3 D A B C D

P0 A0 A1 A2 A3 A0 B0 C0 D0
P1 B0 B1 B2 B3 Alltoall A1 B1 C1 D1
P2 C0 C1 C2 C3 A2 B2 C2 D2
P3 D0 D1 D2 D3 A3 B3 C3 D3

42

121
spcl.inf.ethz.ch
@spcl_eth

Matrix-Vector-Multiply
1 2 3 10 𝐴1⋅ ⋅ 𝑥
Compute 𝒚 = 𝑨 ⋅ 𝒙 , e.g., 𝐴 = 4 5 6 𝑥 = 20 y = 𝐴2⋅ ⋅ 𝑥
7 8 9 30 𝐴3⋅ ⋅ 𝑥
Assume A and x are available only at rank 0!

1. Broadcast x

P0 10 20 30
P0 10 20 30

P1 10 20 30

P2 10 20 30

43

spcl.inf.ethz.ch
@spcl_eth

Matrix-Vector-Multiply
1 2 3 10 𝐴1⋅ ⋅ 𝑥
Compute 𝒚 = 𝑨 ⋅ 𝒙 , e.g., 𝐴 = 4 5 6 𝑥 = 20 y = 𝐴2⋅ ⋅ 𝑥
7 8 9 30 𝐴3⋅ ⋅ 𝑥
Assume A and x are available only at rank 0!

2. Scatter A
1 2 3
P0 1 2 3

P0 4 5 6

7 8 9
P1 4 5 6

P2 7 8 9

44

spcl.inf.ethz.ch
@spcl_eth

Matrix-Vector-Multiply
1 2 3 10 𝐴1⋅ ⋅ 𝑥
Compute 𝒚 = 𝑨 ⋅ 𝒙 , e.g., 𝐴 = 4 5 6 𝑥 = 20 y = 𝐴2⋅ ⋅ 𝑥
7 8 9 30 𝐴3⋅ ⋅ 𝑥

3. Compute locally

P0 1 2 3 10 20 30 = 140

P1 4 5 6 10 20 30 = 320

P2 7 8 9 10 20 30 = 500

45

spcl.inf.ethz.ch
@spcl_eth

Matrix-Vector-Multiply
1 2 3 10 𝐴1⋅ ⋅ 𝑥
Compute 𝒚 = 𝑨 ⋅ 𝒙 , e.g., 𝐴 = 4 5 6 𝑥 = 20 y = 𝐴2⋅ ⋅ 𝑥
7 8 9 30 𝐴3⋅ ⋅ 𝑥

4. Gather result y

P0 140

P1 320 P0 140 320 500

P2 500

122

46
spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth

Sorting networks are data-oblivious (and redundant) 𝑥1 𝑥2 𝑥3 𝑥4 Recursive construction : Insertion


Data-oblivious comparison tree

no swap 1:2 swap


𝑥1
𝑥2
3:4 3:4
𝑥3
. sorting . .
1:3 1:4 2:3 2:4 . network . .
. . .
2:4 2:4 2:3 2:3 1:4 1:4 1:3 1:3 𝑥𝑛−1
𝑥𝑛

2:3 4:3 2:1 4:1 2:4 3:4 2:1 3:1 1:3 4:3 1:2 4:2 1:4 3:4 1:2 3:2 𝑥𝑛+1

redundant cases

16 17

spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth

Recursive construction: Selection Applied recursively..

𝑥1
𝑥2
𝑥3
. . sorting .
. . network . insertion sort bubble sort
. . .
𝑥𝑛−1
𝑥𝑛
𝑥𝑛+1

with parallelism: insertion sort = bubble sort !

18 19

spcl.inf.ethz.ch spcl.inf.ethz.ch
@spcl_eth @spcl_eth

Question Improving parallel Bubble Sort


How many steps does a computer with infinite number of processors Odd-Even Transposition Sort:
(comparators) require in order to sort using parallel bubble sort (depth)?
Answer: 2n – 3 0 9 8 2 7 3 1 5 6 4
Can this be improved ? 1 8 9 2 7 1 3 5 6 4
2 8 2 9 1 7 3 5 4 6
3 2 8 1 9 3 7 4 5 6
How many comparisons ? 4 2 1 8 3 9 4 7 5 6
Answer: (n-1) n/2 5 1 2 3 8 4 9 5 7 6
6 1 2 3 4 8 5 9 6 7
How many comparators are required (at a time)? 7 1 2 3 4 5 8 6 9 7
8 1 2 3 4 5 6 8 7 9
Answer: n/2
1 2 3 4 5 6 7 8 9
Reusable comparators: n-1

20 21

spcl.inf.ethz.ch
@spcl_eth

void oddEvenTranspositionSort(int[] a, boolean dir) {


int n = a.length;
for (int i = 0; i<n; ++i) {
for (int j = i % 2; j+1<n; j+=2)
compare(a,j,j+1,dir);
}
}

22

123
spcl.inf.ethz.ch
@spcl_eth

Last lecture -- basic exam tips


 First of all, read all instructions
 Then, read the whole exam paper through
 Look at the number of points for each question
 This shows how long we think it will take to answer!
 Find one you know you can answer, and answer it
 This will make you feel better early on.

124
 Watch the clock!
 If you are taking too long on a question, consider dropping it and moving on to another one.
 Always show your working
 You should be able to explain most of the slides
 Tip: form learning groups and present the slides to each other
 If something is unclear:
Ask your friends
Read the book (Herlihy and Shavit for the second part)
Ask your TAs
67
Appendix B

Code-snippets

B.1 Skip list

B.1.1 Constructor, fields and node class

public final class LazySkipList<T> {


static final int MAX_LEVEL = ...;
final Node<T> head = new Node<T>(Integer.MIN_VALUE);
final Node<T> tail = new Node<T>(Integer.MAX_VALUE);
public LazySkipList() {
for (int i = 0; i < head.next.length; i++) {
head.next[i] = tail;
}
}
...
private static final class Node<T> {
final Lock lock = new ReentrantLock();
final T item;
final int key;
final Node<T>[] next;
volatile boolean marked = false;
volatile boolean fullyLinked = false;
private int topLevel;
public Node(int key) { // sentinel node constructor
this.item = null;
this.key = key;
next = new Node[MAX_LEVEL + 1];
topLevel = MAX_LEVEL;
}
public Node(T x, int height) {
item = x;
key = x.hashCode();
next = new Node[height + 1];
topLevel = height;
}
public void lock() {
lock.lock();
}
public void unlock() {
lock.unlock();
}
}

125
Appendix B. Code-snippets

B.1.2 find() method


int find(T x, Node<T>[] preds, Node<T>[] succs) {
int key = x.hashCode();
int lFound = -1;
Node<T> pred = head;
for (int level = MAX_LEVEL; level >= 0; level--) {
volatile Node<T> curr = pred.next[level];
while (key > curr.key) {
pred = curr; curr = pred.next[level];
}
if (lFound == -1 && key == curr.key) {
lFound = level;
}
preds[level] = pred;
succs[level] = curr;
}
return lFound;
}

B.1.3 add() method


boolean add(T x) {
int topLevel = randomLevel();
Node<T>[] preds = (Node<T>[]) new Node[MAX_LEVEL + 1];
Node<T>[] succs = (Node<T>[]) new Node[MAX_LEVEL + 1];
while (true) {
int lFound = find(x, preds, succs);
if (lFound != -1) {
Node<T> nodeFound = succs[lFound];
if (!nodeFound.marked) {
while (!nodeFound.fullyLinked) {}
return false;
}
continue;
}
int highestLocked = -1;
try {
Node<T> pred, succ;
boolean valid = true;
for (int level = 0; valid && (level <= topLevel); level++) {
pred = preds[level];
succ = succs[level];
pred.lock.lock();
highestLocked = level;
valid = !pred.marked && !succ.marked && pred.next[level]==succ;
}
if (!valid) continue;
Node<T> newNode = new Node(x, topLevel);
for (int level = 0; level <= topLevel; level++)
newNode.next[level] = succs[level];
for (int level = 0; level <= topLevel; level++)
preds[level].next[level] = newNode;
newNode.fullyLinked = true; // successful add linearization point

126
Appendix B. Code-snippets

return true;
} finally {
for (int level = 0; level <= highestLocked; level++)
preds[level].unlock();
}
}
}

B.1.4 remove() method


boolean remove(T x) {
Node<T> victim = null; boolean isMarked = false; int topLevel = -1;
Node<T>[] preds = (Node<T>[]) new Node[MAX_LEVEL + 1];
Node<T>[] succs = (Node<T>[]) new Node[MAX_LEVEL + 1];
while (true) {
int lFound = find(x, preds, succs);
if (lFound != -1) victim = succs[lFound];
if (isMarked ||
(lFound != -1 &&
(victim.fullyLinked
&& victim.topLevel == lFound
&& !victim.marked))) {
if (!isMarked) {
topLevel = victim.topLevel;
victim.lock.lock();
if (victim.marked) {
victim.lock.unlock();
return false;
}
victim.marked = true;
isMarked = true;
}
int highestLocked = -1;
try {
Node<T> pred, succ; boolean valid = true;
for (int level = 0; valid && (level <= topLevel); level++) {
pred = preds[level];
pred.lock.lock();
highestLocked = level;
valid = !pred.marked && pred.next[level]==victim;
}
if (!valid) continue;
for (int level = topLevel; level >= 0; level--) {
preds[level].next[level] = victim.next[level];
}
victim.lock.unlock();
return true;
} finally {
for (int i = 0; i <= highestLocked; i++) {
preds[i].unlock();
}
}
} else return false;
}
}

127
Appendix B. Code-snippets

B.1.5 contains() method


boolean contains(T x) {
Node<T>[] preds = (Node<T>[]) new Node[MAX_LEVEL + 1];
Node<T>[] succs = (Node<T>[]) new Node[MAX_LEVEL + 1];
int lFound = find(x, preds, succs);
return (lFound != -1
&& succs[lFound].fullyLinked
&& !succs[lFound].marked);
}

B.2 Concurrent prime sieve in Go


func main() {
ch := make(chan int)
go Generate(ch)
for i := 0; i < 10; i++ {
prime := <-ch
fmt.Println(prime)
ch1 := make(chan int)
go Filter(ch, ch1, prime)
ch = ch1
}
}

func Generate(ch chan<- int) {


for i := 2; ; i++ {
ch <- i
}
}
func Filter(in <-chan int, out chan<- int, prime int) {
for {
i := <-in // Receive value from ’in’.
if i%prime != 0 {
out <- i // Send ’i’ to ’out’.
}
}
}

B.3 Calculating Pi in MPI

We use a mathematical formula to compute an approximation of π. This formula consists of a


sum, hence it can be parallelized by calculating independent, smaller sums.
MPI.Init(args);
// declare and initialize variables (sum=0 etc.)
int size = MPI.COMM_WORLD.Size();
int rank = MPI.COMM_WORLD.Rank();

for(int i=rank; i<numSteps; i=i+size) {


double x=(i + 0.5) * h;
sum += 4.0/(1.0 + x*x);
}

if (rank != 0) {
double [] sendBuf = new double []{sum};

128
Appendix B. Code-snippets

// 1-element array containing sum


MPI.COMM_WORLD.Send(sendBuf, 0, 1, MPI.DOUBLE, 0, 10);
}
else { // rank == 0
double [] recvBuf = new double [1] ;
for (int src=1 ; src<P; src++) {
MPI.COMM_WORLD.Recv(recvBuf, 0, 1, MPI.DOUBLE, src, 10);
sum += recvBuf[0];
}
}
double pi = h * sum; // output pi at rank 0 only!
MPI.Finalize();

129

You might also like