Pap 3 Shared Memory Algos
Pap 3 Shared Memory Algos
Programming
Email: [email protected]
Website: tropars.github.io
1
References
The content of this lecture is inspired by:
Parallel algorithms (Chapter 1) by H. Casanova, Y. Robert, A. Legrand.
A survey of parallel algorithms for shared-memory machines by R.
Karp, V. Ramachandran.
Parallel Algorithms by G. Blelloch and B. Maggs.
Data Parallel Thinking by K. Fatahalian
2
Outline
The PRAM model
3
Need for a model
A parallel algorithm
De nes multiple operations to be executed in each step
Includes communication/coordination between the processing units
The problem
A wide variety of parallel architectures
Di erent number of processing units
Multiple network topologies
Parallel RAM
A shared central memory
A set of processing units (PUs)
Any PU can access any memory location in one unit of time
The number of PUs and the size of the memory is unbounded
5
Details about the PRAM model
Lock-step execution
A 3-phase cycle:
1. Read memory cells
2. Run local computations
3. Write to the shared memory
All PUs execute these steps synchronously
No need for explicit synchronization
6
About the CRCW model
Semantic of concurrent writes:
Arbitrary mode : Select one value from the concurrent writes
Priority mode : Select the value of the PU with the lowest index
Fusion mode : A commutative and associative operation is applied to the
values (logical OR, AND, sum, maximum, etc.)
C RC W > C RE W > E RE W
A model is more powerful if there is one problem for which this model
allows implementing a strictly faster solution with the same number of PUs
7
Some shared-memory
algorithms
8
List ranking
Description of the problem
A linked list of n objects
Doubly-linked list
We want to compute the distance of each element to the end of the list
9
List ranking
forall i in parallel:
next[i] = next[next[i]]
forall i in parallel:
temp[i] = next[next[i]]
forall i in parallel:
next[i] = temp[i]
11
Comments on the previous algorithm
About the termination test
Note that the test in the while loop can be done in constant time only
in the CRCW model
The problem is about having all PUs sharing the result of their local test
(next[i] != None)
In a CW model, all PUs can write to the same variable and a fusion
operation can be used
In a EW model, the results of the tests can only aggregated two-by-two
leading to a solution with a complexity in O(log n) for this operation
12
Point to root
Description of the problem
A tree data structure
Each node should get a pointer to the root
PointToRoot(P):
for k in 1..ceiling(log(sizeof(P))):
forall i in parallel:
P[i] = P[P[i]]
13
Divide and conquer
Split the problems into sub-problems that can be solved independently
Merge the solutions
Example: Mergesort
Mergesort(A):
if sizeof(A) is 1:
return A
else:
Do in parallel:
L = Mergesort(A[0 .. sizeof(A)/2])
R = Mergesort(A[sizeof(A)/2 .. sizeof(A)])
Merge(L,R)
14
Analysis of PRAM models
15
Comparison of PRAM models
CRCW vs CREW
To compare CRCW and CREW, we consider a reduce operation over n
elements with an associative operation.
Example: the sum of n elements
16
Comparison of PRAM models
CREW vs EREW
To compare CREW and EREW, we consider the problem of determining
whether an element e belongs to a set (e1 , . . . en ) .
Solution with CREW:
A boolean res is initialized to false and n PUs are used
PU k runs the test (ek == e )
If one PU nds e, it sets res to true
Solution with EREW:
Same algorithm except e cannot be read simultaneously by multiple
PUs
n copies of e should be created (broadcast)
17
Limits of the PRAM model
Unrealistic memory model
Constant time access for all memory location
Synchronous execution
Removes some exibility
18
Study of Parallel scans
19
Scans (Pre x sums)
Description of the problem
Inputs:
A sequence of elements x1 , x2 . . . xn
A associative operation *
Output:
A sequence of elements y1 , y2 . . . yn such that yk = x1 ∗ x2 . . . ∗xk
Scan(L):
forall i in parallel: # initialization
y[i] = x[i]
for k in 1..ceiling(log(sizeof(L))):
forall i in parallel:
if next[i] != None:
y[next[i]] = y[i] * y[next[i]]
next[i] = next[next[i]]
20
Scans (Pre x sums)
Performance of this algorithm
Work:
Depth:
D(n) = log(n)
21
Parallel scan with 2 processing units
Solution
Scan(L):
# input: x; output: y
# first phase
half = sizeof(L)/2
for i in 0..1 in parallel
SequentialScan(x[half*i .. half*(i+1)-1])
# second phase
base = y[half]
quarter = half / 2
for i in 0..1 in parallel
add base to elems in y[half+quarter*i .. half+quarter*(i+1)-1]
The algorithm with a larger depth and less work per iteration
performs better up to 16 PUs 23