Chapter 02
Chapter 02
PRAM
A Model of Serial Computation (RAM)
Time & Space Complexities
The PRAM Model of Parallel
Computation
Various Models of PRAM
PRAM Algorithms
Parallel Reduction
1
…
Parallel Prefix Sums
Parallel Merge Sort
Reducing Number of Processors
Brent’s Theorem
Complexity Theory & PRAM
2
PRAM
PRAM stands for Parallel Random
Access Machine.
It is unrealistically simple; it ignores
the complexity of inter processor
communication. Why?
It allows parallel-algorithm designers
to treat processing power as an
unlimited recourse.
3
RAM
It’s a theoretical model of serial
computation. It is a one address
computer.
It consists of
A memory
A read-only input tape
A write-only output tape
A program
4
Read-only
input tape x1 x2 … xn
r0
Accumulat
or
r1
r2
Location counter Program
r3
Memory
Write -only
output y1 y2 … yn
tape
Time & Space Complexities
Time complexity: the time taken by
the program to execute over all
inputs of size n.
Space complexity: the space taken
by the program to execute over all
input of size n.
Measuring time & space
Uniform cost
Logarithmic cost
6
The PRAM Model of Parallel
Computation
The PRAM consists of:
Control unit
Global memory
Unbounded set of processor, each
with its own private memory & a
unique index, which is need to enable
and disable a processor or influence
which memory location it accesses.
7
Control
P1 P2 Pp
Private memory Private memory Private memory
… … … …
Interconnection network
Global memory
How a PRAM Works?
A PRAM begins with:
input stored in global memory.
A single active processing element
The computation terminates when
the last processor halts.
9
…
During each step of computation an
active processor may:
Read a value from a single private local
or global memory location.
Write a value from a single private
local or global memory location.
Perform a single RAM operation.
During a computation step, a processor
may active another processor.
10
Cost
The cost of a PRAM computation is
the product of the parallel time
complexity and the number of
processors used.
PRAM algorithm that has
complexity (log P) using P
processors has cost
(P log P)
11
Various Models of PRAM
EREW
(Exclusive Read Exclusive Write)
CREW
(Concurrent Read Exclusive Write),
which is the default.
CRCW
(Concurrent Read Concurrent
Write)
12
…
The following are the different policies to handle
concurrent writes:
Common
all processors concurrently writing into the same
global address must be writing the same value.
Arbitrary
if multiple processors writing the same address, an
arbitrary processor are chosen.
Priority
if multiple processors writing the same address, the
processor with lowest index succeeds (highest
Priority).
Compare between them ( which is the weakness,
strength, and why)
13
PRAM Algorithms
PRAM algorithms have two
phases:
a) Sufficient number of processors are
activated.
b) Perform computation on parallel.
14
…
To activate P processors, we need
log p steps.
The meta instruction
Spawn (<processor names>)
is used to activate processors.
To denote a code segment to be
executed in parallel by specified
processors, we use a parallel
construct.
15
…
The general format is:
For all < processor list> do
< statement list>
end for
We can use other usual structures such
as:
If-then-else-endif
For-endfor
While-endwhile
Repeat-until
16
Parallel Reduction
The importance of binary trees to
solve problems.
Starting from the root (fan-out)
broadcasting
divide and conquer
Starting from the leaves (fan-in)
Reduction
17
…
Given a set of n values a1,a2,---- an,
and an associative binary
operation ⊕, the reduction is the
process of computing
a1 ⊕ a2 ⊕ a3 ⊕ ------ an
18
Ex:
4 3 8 2 9 1 0 5 Array A
0 1 2 3 4 5 6 7
J=1 17 15 P0, P2
J=2 P0
3
2 19
Algorithm
SUM (EREW PRAM)
Initialization: list of n ≥ 1 elements stored
in
A[0 … n-1]
Final condition: sum of elements stored in
A[0]
Global variables: n, A [0, …, n-1], j
20
…
begin
spawn (p0, p1, p2, …, p⌊n/2⌋ -1)
for all pi where 0 ≤i ≤ ⌊n/2⌋ -1 do
for j 0 to log n-1 do
if i modulo 2j = 0 and 2i +2j < n
then
A[2i] A[2i] +A[2i + 2j]
end if
end for
end for
end
21
Tracing
When j = 0
0 mod 20 = 0 a[0] a[0]+a[1]
1 mod 20 = 0 a[2] a[2]+a[3]
2 mod 20 = 0 a[4] a[4]+a[5]
3 mod 20 = 0 a[6] a[6]+a[7]
When j = 1
0 mod 21 = 0 a[0] a[0]+a[2]
1 mod 21 = 1
2 mod 21 = 0 a[4] a[4]+a[6]
3 mod 21 = 1
22
…
When j = 2
0 mod 22 = 0 a[0] a[0]+a[4]
1 mod 22 = 1
2 mod 22 = 1
3 mod 22 = 1
23
Analysis
Time complexity (log n)
Given n/2 processors
spawning Log n/2
Cost is (log n)* n/2
24
Prefix Sums
Given a set of n values a1, a2, ----, an
and an associative binary operation
⊕, the prefix sums is to compute n
quantities:
a1
a1⊕ a2
a1⊕ a2 ⊕ a3
a1⊕ a2 ⊕ a3…an
25
Analysis
Time Complexity is (log n)
Given n-1 processors
spawning log (n-1)
Cost is (log n)*(n-1)
26
Algorithm
PREFIX-SUMS (CREW PRAM)
Initialization: list of n ≥ 1 elements stored
in
A[0 … (n-1)]
Final condition: each element a[i] contains
A[1] ⊕ A[2] … A[i]
Global variables: n, A[0, …, (n-1)], j
27
…
begin
spawn (p1, p2,…, pn-1)
for all pi where 1 ≤ i ≤ n-1 do
for j 0 to log n -1 do
if i -2j ≥ 0 then
A[i] A[i] + A [i - 2j]
end if
end for
end for
end
28
Ex
0 1 2 3 4 5 6 7
4 3 8 2 9 1 0 5 Array A
P1... P7
4 7 11 10 11 10 1 5
P2... P7
4 7 15 17 22 20 12 15
P4... P7
4 7 15 17 26 27 27 32
29
Tracing
When j=0
1 - 20 ≥ 0 a[1] a[1]+a[0]
2 - 20 ≥ 0 a[2] a[2]+a[1]
3 - 20 ≥ 0 a[3] a[3]+a[2]
4 - 20 ≥ 0 a[4] a[4]+a[3]
5 – 20 ≥ 0 a[5] a[5]+a[4]
6 – 20 ≥ 0 a[6] a[6]+a[5]
7 – 20 ≥ 0 a[7] a[7]+a[6]
30
…
When j=1
1 – 21 < 0
2 - 21 ≥ 0 a[2] a[2]+a[0]
3 - 21 ≥ 0 a[3] a[3]+a[1]
4 - 21 ≥ 0 a[4] a[4]+a[2]
5 – 21 ≥ 0 a[5] a[5]+a[3]
6 – 21 ≥ 0 a[6] a[6]+a[4]
7 – 21 ≥ 0 a[7] a[7]+a[5]
31
…
When j=2
1 – 22 < 0
2 - 22 < 0
3 - 22 < 0
4 - 22 ≥ 0 a[4] a[4]+a[0]
5 – 22 ≥ 0 a[5] a[5]+a[1]
6 – 22 ≥ 0 a[6] a[6]+a[2]
7 – 22 ≥ 0 a[7] a[7]+a[3]
32
Merging Two Sorted Lists
An optimal RAM algorithm creates
the merged list one element at a
time. It requires at most n-1
comparisons to merge two stored
list of n/2 elements.
Complexity for RAM algorithm is
(n).
33
…
Every processor do a binary search on
one half of the array and finds the
position of the element associated with it
by adding its index to the index of the
search
For example, P4 is in the lower half and
associated with element 9, searches the
upper half and finds the position of 9 after
index 3. Therefore, the final position is
4+3=7.
34
Algorithm
MERGE-LISTS (CREW PRAM)
Initial condition: two stored lists of n/2
elements each, stored in A[1] … A[n/2] and
A[(n/2)+1] … A[n]
Final condition: merged list in locations
A[1] … A[n]
Global variables: A[1 …n]
Local variables: x, low, high, index
35
…
begin
spawn (p1, p2, …, pn)
for all pi where i <= n do
if i <= n/2 then
low (n/2) +1
high n
else
low 1
high n/2
endif
36
…
X a[i]
repeat
index ⌊(low + high)/2⌋
If x < A[index] then
high index –1
else
low index +1
endif
until low > high
A[high+i–n/2] x
endfor
end
37
Ex
p1 p2 p3 p4 p5 p6 p7 p8
1 5 7 9 13 17 19 23 Lower half
12 4 5 7 8 9 1 1 1 1 1 2 2 2 2
A[1] 1 2 3 7 9 1 2 3 4A[16]
2 4 8 11 12 21 22 24 Upper half
39
Reducing Number of
Processors
A cost optimal parallel algorithm is an algorithm
for which the cost is in the same complexity
class as an optimal sequential algorithm.
For example, parallel reduction is not cost
optimal.
PRAM cost is time complexity * number of
processors
= (log n)* (n)
= (n log n)
which is greater than the complexity of the
optimal sequential algorithm (n).
40
…
A cost optimal parallel algorithm exists if the
total number of operations performed the by
parallel algorithm is of the same complexity
class as an optimal sequential algorithm.
For example, parallel reduction # of steps is:
n/2 + n/4 + n/8 + ... + (n-1) (n)
while number of steps sequentially is:
n-1 operation (n)
41
…
Since both the parallel algorithm
and the sequential algorithm
perform n-1 operations, then a
cost variant algorithm exists.
Number of processors required to
perform n-1 operations in (log n)
time is
n 1 n
P= (log n) = log n
42
Ex: n=16 P=16/4= 4
9 2 1 3 5 4 7 6 01 8 2 3 5 8 1
STEP1
STEP2
STEP3
15 24 11 1
7
STEP4
39 28
STEP5
76
43
Brent’s theorem
Given A, a parallel algorithm with
computation time t, if parallel
algorithm A performs m
computational operations, then P
processors can execute algorithm A
in time
t+(m-t)/p
44
Ex: Parallel Reduction
m t ( n 1) log n
t log n
p n
log n
log n log 2 n
log n log n log n
n n
45
Complexity Theory & PRAM
• P: class of problems solvable by a
deterministic algorithm in polynomial time.
• NP: class of problems solvable by non
deterministic algorithm in polynomial time.
• P- complete: a problem L P is P-complete
if every other problem in P can be
transformed to L in polynomial time.
• NP- complete: a problem L NP is NP-
complete if every other problem in P can be
transformed to L in polynomial time.
46
…
• NC: class of problems solvable on a
PRAM in polylogarithmic time using
polynomial number of processors.
NP-complete
NP
P-complete
P
NC
47