Munchen
Technische Universitat
Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model
Michael Bader
Winter 2014/15
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 1
Munchen
Technische Universitat
Example: Parallel Sorting
Definition
Sorting is required to order a given sequence of elements, or more
precisely:
Input : a sequence of n elements a1 , a2 , . . . , an
Output : a permutation (reordering) a10 , a20 , . . . , an0 of the input
sequence, such that a10 a20 an0 .
An naive(?) solution:
pairwise comparison of all elements
count wins for each element to obtain its position
use one processor for each comparison!
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 2
Munchen
Technische Universitat
A (Naive?) Parallel Example: AccumulateSort
AccumulateSort ( A : Array [ 1 . . n ] ) {
Create Array P [ 1 . . n ] o f I n t e g e r ;
/ / a l l P [ i ]=0 a t s t a r t
f o r 1 <= i ,j <= n and i < j do i n p a r a l l e l {
i f A[ i ] > A[ j ]
then P [ i] : = P [ i ]+1
else P [ j] := P[ j ]+1;
}
f o r i from 1 to n do i n p a r a l l e l {
A [ P [ i ]+1 ] : = A [ i ] ;
}
}
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 3
Munchen
Technische Universitat
AccumulateSort Discussion
Implementation:
do all n2 comparisons at once and in parallel
use n2 processors
count wins for each element; then move them to their
respective rank
complexity: TAS = (1) on n(n 1)/2 processors
Assumptions:
all read accesses to A can be done in parallel
increments of P[i] and P[j] can be done in parallel
second for-loop is executed after the first one (on all processors)
all moves A[ P[i] ] := A[i] happen in one atomic step
(no overwrites due to sequential execution)
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 4
Munchen
Technische Universitat
Example: Parallel Searching
Definition (Search Problem)
Input: a set A of n elements A, and an element x A.
Output: The (smallest) index i {1, . . . , n} with x = A[i].
An immediate solution:
use n processors
on each processor: compare x with A[i]
return matching index/indices i
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 5
Munchen
Technische Universitat
Simple Parallel Searching
ParSearch ( A : Array [ 1 . . n ] , x : Element ) : I n t e g e r {
f o r i from 1 to n do i n p a r a l l e l {
i f x = A [ i ] then r e t u r n i ;
}
}
Discussion:
Can all n processors access x simultaneously?
exclusive or concurrent read
What happens if more than one processor finds an x?
exclusive or concurrent write (of multiple returns)
general approach: parallelisation by competition
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 6
Munchen
Technische Universitat
Towards Parallel Algorithms
First Problems and Questions:
parallel read access to variables possible?
parallel write access (or increments?) to variables possible?
are parallel/global copy statements realistic?
how do we synchronise parallel executions?
Reality vs. Theory:
on real hardware: probably lots of restrictions
(e.g., no parallel reads/writes; no global operations on or access
to memory)
in theory: if there were no such restrictions, how far can we get?
or: for different kinds of restrictions, how far can we get?
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 7
Munchen
Technische Universitat
The PRAM Models
Shared Memory
P1 P2 P3 ... Pn
Central Control
Concurrent or Exclusive Read/Write Access:
EREW exclusive read, exclusive write
CREW concurrent read, exclusive write
ERCW exclusive read, concurrent write
CRCW concurrent read, concurrent write
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 8
Munchen
Technische Universitat
Exclusive/Concurrent Read and Write Access
exclusive read concurrent read
X1 X3 X4 X
X6 Y
X2 X5
exclusive write concurrent write
X1 X3 X4 X
X6 Y
X2 X5
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 9
Munchen
Technische Universitat
The PRAM Models (2)
Shared Memory
P1 P2 P3 ... Pn
Central Control
SIMD
Underlying principle for parallel hardware architecture:
strict single instruction, multiple data (SIMD)
All parallel instructions of a parallelized loop are performed
synchronously (applies even to simple if-statements)
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 10
Munchen
Technische Universitat
Loops and If-Statements in PRAM Programs
Lockstep Execution of parallel for:
Parallel for-loops (i.e., with extension in parallel) are executed in lockstep.
Any instruction in a parallel for-loop is executed at the same time (and in sync) by all
involved processors.
If an instruction consists of several substeps, all substeps are executed in sync.
If an if-then-else statement appears in a parallel for-loop, all processors first evaluate the
comparison at the same time. Then, all processors on which the condition evaluates as true
execute the then branch. Finally, all processors on which the condition evaluates to false
execute the else branch.
Lockstep Example:
f o r i from 1 to n do i n p a r a l l e l {
i f U[ i ] > 0
then F [ i ] : = (U[ i ]U [ i 1]) / dx
else F [ i ] : = (U[ i +1]U [ i ] ) / dx
end i f
}
First, all processors perform the comparison U[i]>0
All processors where U[i]>0 then compute F[ i ]; note that first all processors read U[i ] and
then all processors read U[i1] (substeps!); hence, there is no concurrent read access!
Afterwards, the else-part is executed in the same manner by all processors with U[i]<=0
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 11
Munchen
Technische Universitat
Parallel Search on an EREW PRAM
ToDos for exclusive read and exclusive write:
avoid exclusive access to x
replicate x for all processors (broadcast)
determine smallest index of all elements found:
determine minimum in parallel
Broadcast on the PRAM:
copy x into all elements of an array X[1..n]
note: each processor can only produce one copy per step
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 12
Munchen
Technische Universitat
Broadcast on the PRAM Copy Scheme
5 5
5 5 5 5
5 5 5 5 5 5 5 5
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 13
Munchen
Technische Universitat
Broadcast on the PRAM Implementation
BroadcastPRAM ( x : Element , A : Array [ 1 . . n ] ) {
/ / n assumed t o be 2 k
/ / Model : EREW PRAM
A[ 1 ] := x ;
f o r i from 0 to k1 do
f o r j from 2 i +1 to 2 ( i +1) do i n p a r a l l e l {
A [ j ] : = A [ j 2 i ] ;
}
}
Complexity:
T (n) = (log n) on n
2 processors
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 14
Munchen
Technische Universitat
Minimum Search on the PRAM Binary Fan-In
4 7 3 9 5 6 10 8
4 3 5 8
3 5
3
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 15
Munchen
Technische Universitat
Minimum on the PRAM Implementation
MinimumPRAM ( A : Array [ 1 . . n ] ) : I n t e g e r {
/ / n assumed t o be 2 k
/ / Model : EREW PRAM
f o r i from 1 to k do
f o r j from 1 to n / ( 2 i ) do i n p a r a l l e l {
i f A[ 2 j 1] < A[ 2 j ]
then A[ 2 j ] : = A[ 2 j 1 ] ;
end i f ;
A [ j ] : = A[ 2 j ] ;
}
return A [ 1 ] ;
}
n
Complexity: T (n) = (log n) on 2 processors
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 16
Munchen
Technische Universitat
Binary Fan-In (2)
Comment Concerned about synchronous copy statement?
Modifiy stride!
4 7 3 9 5 6 10 8
4 3 5 8
3 5
3
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 17
Munchen
Technische Universitat
Searching on the PRAM Parallel Implementation
SearchPRAM ( A : Array [ 1 . . n ] , x : Element ) : I n t e g e r {
/ / n assumed t o be 2 k
/ / Model : EREW PRAM
BroadcastPRAM ( x , X [ 1 . . n ] ) ;
f o r i from 1 to n do i n p a r a l l e l {
i f A[ i ] = X[ i ]
then X [ i ] := i ;
else X [ i ] : = n +1; / / ( i n v a l i d index )
end i f ;
}
r e t u r n MinimumPRAM ( X [ 1 . . n ] ) ;
}
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 18
Munchen
Technische Universitat
The Prefix Problem
Definition (Prefix Problem)
Input: an array A of n elements ai .
Output: All terms a1 a2 ak for k = 1, . . . , n.
may be any associative operation.
Straightforward serial implementation:
P r e f i x ( A : Array [ 1 . . n ] ) {
/ / i n p l a c e computation :
f o r i from 2 to n do {
A [ i ] : = A [ i 1]A [ i ] ;
}
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 19
Munchen
Technische Universitat
The Prefix Problem Divide and Conquer
Idea:
1. compute prefix problem for A1 , . . . , An/2
gives A1:1 , . . . , A1:n/2
2. compute prefix problem for An/2+1 , . . . , An
gives An/2+1:n/2+1 , . . . , An/2+1:n
3. multiply A1:n/2 with all An/2+1:n/2+1 , . . . , An/2+1:n
gives A1:n/2+1 , . . . , A1:n
Parallelism:
steps 1 and 2 can be computed in parallel (divide)
all multiplications in step 3 can be computed in parallel
recursive extension leads to parallel prefix scheme
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 20
Munchen
Technische Universitat
Parallel Prefix Divide and Conquer
A1 A2 A3 A4 A5 A6 A7 A8
A1:2 A3 A3:4 A5:6 A7 A7:8
A1:3 A1:4 A5 A5:6 A5:7 A5:8
A1:5 A1:6 A1:7 A1:8
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 21
Munchen
Technische Universitat
Parallel Prefix Scheme on a CREW PRAM
Additional Feature: In-Place Computation, Pin Elements to Cores
A1 A8 A2 A7 A3 A6 A4 A5
A1 A7:8 A1:2 A7 A3 A5:6 A3:4 A5
A1 A5:8 A1:2 A5:7 A1:3 A5:6 A1:4 A5
A1 A1:8 A1:2 A1:7 A1:3 A1:6 A1:4 A1:5
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 22
Munchen
Technische Universitat
Outlook: Parallel Prefix on Distributed Memory
Consider scheme from previous slide:
A1 A8 A2 A7 A3 A6 A4 A5
A1 A7:8 A1:2 A7 A3 A5:6 A3:4 A5
A1 A5:8 A1:2 A5:7 A1:3 A5:6 A1:4 A5
A1 A1:8 A1:2 A1:7 A1:3 A1:6 A1:4 A1:5
Execution on Distributed Memory:
Each color corresponds to one compute node
Nodes cannot directly access matrices from a node with different colour
explicit data transfer (communication) required
Properties of the Distributed-Memory Parallel Prefix Scheme:
In-place computation; A[1:n] will overwrite A[n]; all A[j:n] stored on the same node
One of the two multiplied matrices is always local
Still, n/2 outgoing messages from A[1:n/2] in the last step (bottleneck!)
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 23
Munchen
Technische Universitat
Parallel Prefix CREW PRAM Implementation
PrefixPRAM ( A : Array [ 1 . . n ] ) {
/ / n assumed t o be 2 k
/ / Model : CREW PRAM ( n / 2 p r o c e s s o r s )
f o r l from 0 to k1 do
f o r p from 2 l by 2 ( l +1) to n do i n p a r a l l e l
f o r j from 1 to 2 l do i n p a r a l l e l {
A [ p+ j ] : = A [ p ] A [ p+ j ] ;
}
}
Comments:
p- and j-loop together: n/2 multiplications per l-loop
concurrent read access to A[p] in the innermost loop
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 24
Munchen
Technische Universitat
Parallel Prefix Scheme on an EREW PRAM
A1 A2 A3 A4 A5 A6 A7 A8
A1 A1:2 A2:3 A3:4 A4:5 A5:6 A6:7 A7:8
A1 A1:2 A1:3 A1:4 A2:5 A3:6 A4:7 A5:8
A1 A1:2 A1:3 A1:4 A1:5 A1:6 A1:7 A1:8
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 25
Munchen
Technische Universitat
Parallel Prefix EREW PRAM Implementation
PrefixPRAM ( A : Array [ 1 . . n ] ) {
/ / n assumed t o be 2 k
/ / Model : EREW PRAM ( n1 p r o c e s s o r s )
f o r l from 0 to k1 do
f o r j from 2 l +1 to n do i n p a r a l l e l {
tmp [ j ] : = A [ j 2 l ] ;
A [ j ] : = tmp [ j ] A [ j ] ;
}
}
Comment:
all processors execute tmp[j] := A[j-2l] before multiplication!
M. Bader: Fundamental Algorithms
Chapter 3: Parallel Algorithms The PRAM Model, Winter 2014/15 26