PRAM Model
PRAM Model
for
Parallel Computation
References
1. Selim Akl, Parallel Computation: Models and Methods, Prentice
Hall, 1997, Updated online version available through website.
2. Selim Akl, The Design of Efficient Parallel Algorithms, Chapter 2 in
“Handbook on Parallel and Distributed Processing” edited by J.
Blazewicz, K. Ecker, B. Plateau, and D. Trystram, Springer Verlag,
2000.
3. Selim Akl, Design & Analysis of Parallel Algorithms, Prentice Hall,
1989.
4. Henri Casanova, Arnaud Legrand, and Yves Robert, Parallel
Algorithms, CRC Press, 2009.
5. Cormen, Leisterson, and Rivest, Introduction to Algorithms, 1st
edition (i.e., older), 1990, McGraw Hill and MIT Press, Chapter 30
on parallel algorithms.
6. Phillip Gibbons, Asynchronous PRAM Algorithms, Ch 22 in
Synthesis of Parallel Algorithms, edited by John Reif, Morgan
Kaufmann Publishers, 1993.
7. Joseph JaJa, An Introduction to Parallel Algorithms, Addison
Wesley, 1992.
8. Michael Quinn, Parallel Computing: Theory and Practice, McGraw
Hill, 1994
9. Michael Quinn, Designing Efficient Algorithms for Parallel
Computers, McGraw Hill, 1987.
Outline
• Computational Models
• Definition and Properties of the PRAM Model
• Parallel Prefix Computation
• The Array Packing Problem
• Cole’s Merge Sort for PRAM
• PRAM Convex Hull algorithm using divide &
conquer
• Issues regarding implementation of PRAM
model
Concept of “Model”
• An abstract description of a real world entity
• Attempts to capture the essential features while
suppressing the less important details.
• Important to have a model that is both precise
and as simple as possible to support theoretical
studies of the entity modeled.
• If experiments or theoretical studies show the
model does not capture some important aspects
of the physical entity, then the model should be
refined.
• Some people will not accept most abstract
model of reality, but instead insist on reality.
• Sometimes reject a model as invalid if it does not
capture every tiny detail of the physical entity.
Parallel Models of Computation
• Describes a class of parallel computers
• Allows algorithms to be written for a
general model rather than for a specific
computer.
• Allows the advantages and disadvantages
of various models to be studied and
compared.
• Important, since the life-time of specific
computers is quite short (e.g., 10 years).
Controversy over Parallel Models
• Some professionals (often engineers) will not accept a
parallel model if
– It does not capture every detail of reality
– It cannot currently be built
• Engineers often insist that a model must be valid for any
number of processors
0 if next[i] = nil
d[i] =
d[next [i]] +1 if next[i] ≠ nil
Backup of Previous Diagram
Potential Problems?
• Consider following steps:
7. d[i] = d[i] + d[next[i+1]]
8. next[i] = next[next[i]]
• Casanova, et.al, pose below problem in Step7
– Pi reads d[i+1]and uses this value to update d[i]
– Pi-1 must read d[i] to update d[i-1]
– Computation fails if Pi change the value of d[i] before
Pi-1 can read it.
• This problem should not occur, as all PEs in
PRAM should execute algorithm synchronously.
– The same problem is avoided in Step 8 for the same
reason
Potential Problems? (cont.)
• Does Step 7 (&Step 8) require CR PRAM?
d[i] = d[i] + d[next[i]]
– Let j = next[i]
– Casanova et.al. suggests that P i and Pj may try to read
d[j] concurrently, requiring a CR PRAM model
– Again, if PEs are stepping through the computations
synchronously, EREW PRAM is sufficient here
• In Step 4, PRAM must determine whether there is
a node i with next[i] ≠ nil. A CWCR solution is:
– In Step 4a, set done to false
– In Step 4b, all PE write boolean value of “next[i] = nil”
using CW-common write.
• A EREW solution for Step 7 is given next
Rank-Computation using EREW
• Theorem: The Rank-Computation
algorithm only requires EREW PRAM
– Replace Step 4 with
• For step = 1 to log n do,
• Akl raises the question of what to do if an
unknown number of processors Pi, each of
which is in charge of node i (see pg 236).
– In this case, it would be necessary to go back
to the CRCW solution suggested earlier.
PRAM Model Separation
• We next consider the following two
questions
– Is CRCW strictly more powerful than CREW
– Is CREW strictly more powerful that EREW
• We can solve each of above questions by
finding a problem that the leftmost PRAM
can solve faster than the rightmost PRAM
CRCW Maximum Array Value Algorithm
CRCW Compute_Maximum (A,n)
• Algorithm requires O(n2) PEs, Pi,j.
1. forall i {0, 1, … , n-1} in parallel do
• Pi,0 sets m[i] = True
2. forall i, j {0, 1, … , n-1}2, i≠j, in parallel do
• [if A[i] < A[j] then Pi,j sets m[i] = False
3. forall i {0, 1, … , n-1} in parallel do
• If m[i] = True, then Pi,0 sets max = A[i]
4. Return max
• Note that on n PEs do EW in steps 1 and 3
• The write in Step 2 can be a “common CW”
• Cost is O(1) O(n2) which is O(n2)
CRCW More Powerful Than
CREW
• The previous algorithm establishes that
CRCW can calculate the maximum of an
array in O(1) time
• Using CREW, only two values can be
merged into a single value by one PE in a
single step.
– Therefore the number of values that need to be
merged can be halved at each step.
– So the fastest possible time for CREW is (log
n)
CREW More Powerful Than
EREW
• Determine if a given element e belongs to a set {e1,
e2, … , en} of n distinct elements
• CREW can solve this in O(1) using n PEs
– One PE initializes a variable result to false
– All PEs compare e to one ei.
– If any PE finds a match, it writes “true” to result.
• On EREW, it takes (log n) steps to broadcast the
value of e to all PEs.
– The number of PEs with the value of e can be doubled at
each step.
Simulating CRCW with EREW
Theorem: An EREW PRAM with p PEs can simulate
a common CRCW PRAM with p PEs in O(log p)
steps using O(p) extra memory.
• See Pg 14 of Casanova, et. al.
• The only additional capabilities of CRCW that
EREW PRAM has to simulate are CR and CW.
• Consider a CW first, and initially assume all PE
participate.
• EREW PRAM simulates this CW by creating a p2
array A with length p
Simulating Common CRCW with EREW
• When a CW write is simulated, PRAM EREW PE
j writes
– The memory cell address wishes to write to in A(j,0)
– The value it wishes into memory in A(j,1).
– If any PE j does not participate in CW, it will write -1 to
A(j,0).
• Next, sort A by its first column. This brings all of
the CW to same location together.
• If memory location in A(0,1) is not -1, then PE 0
writes the data value in A(0,1) to memory
location value stored in A(0,1).
PRAM Simulations (cont)
• All PEs j for j>0 read memory address in A(j,0) and
A(j-1,0)
– If memory location in A(j,0) is -1, PE j does not write.
– Also, if the two memory addresses are the same, PE j does
not write to memory.
– Otherwise, PE j writes data value in A(j,1) to memory
location in A(j,0).
• Cole’s algorithm that EREW can sort n items in log(n)
time is needed to complete this proof. It is discussed
next in Casanova et.al. for CREW.
Problem:
• This proof is invalid for CRCW versions stronger than
common CRCW, such as combining.
Cole’s Merge Sort for PRAM
• Cole’s Merge Sort runs on EREW PRAM in O(lg n) using
O(n) processors, so it is cost optimal.
– The Cole sort is significantly more efficient than most
other PRAM sorts.
• Akl calls this sort “PRAM SORT” in book & chptr (pg 54)
– A high level presentation of EREW version is given in
Ch. 4 of Akl’s online text and also in his book chapter
• A complete presentation for CREW PRAM is in JaJa.
– JaJa states that the algorithm he presents can be
modified to run on EREW, but that the details are
non-trivial.
• Currently, this sort is the best-known PRAM sort & is
usually the one cited when a cost-optimal PRAM sort
using O(n) PEs is needed.
References for Cole’s EREW Sort
Two references are listed below.
• Richard Cole, Parallel Merge Sort, SIAM Journal
on Computing, Vol. 17, 1988, pp. 770-785.
• Richard Cole, Parallel Merge Sort, Book-chapter
in “Synthesis of Parallel Algorithms”, Edited by
John Reif, Morgan Kaufmann, 1993, pg.453-496
Comments on Sorting
• A CREW PRAM algorithm that runs in
O((lg n) lg lg n) time
and uses O(n) processors which is much simpler is given
in JaJa’s book (pg 158-160).
– This algorithm is shown to be work optimal.
• Also, JaJa gives an O(lg n) time randomized sort for
CREW PRAM on pages 465-473.
– With high probability, this algorithm terminates in O(lg
n) time and requires O(n lg n) operations
• i.e., with high-probability, this algorithm is work-
optimal.
• Sorting is often called the “queen of the algorithms”:
• A speedup in the best-known sort for a parallel
model usually results in a similar speedup other
algorithms that use sorting.
Cole’s CREW Sort
• Given in 1986 by Cole [43 in Casanova]
• Also, sort given for EREW in same paper, but is even
more difficult.
• The general idea of algorithm technique follows:
– Based on classical merge sort, represented as a binary tree.
– All merging steps at a given level of the tree must be done in
parallel
– At each level, two sequences each of arbitrary size must be
merged in O(1) time.
• Partial information from previous merges is used to merge in constant
time, using a very clever technique.
– Since there are log n levels, this yields a log n running time.
Cole’s EREW Sort (cont)
• Defn: A sequence L is called a good sampler (GS) of
sequence J if, for any k1, there are at most 2k+1
elements of J between k+1 consecutive elements of {-}
L {+ }
– Intuitively, elements of L are almost uniformly
distributed among elements of J.
Key is to use sorting tree of Fig 1.6 in a pipelined fashion. A
good sampler sequence is built at each level for next level.
Divide & Conquer PRAM Algorithms
(Reference: Akl, Chapter 5)
• Three Fundamental Operations
– Divide is the partitioning process
– Conquer is the process of solving the base problem
(without further division)
– Combine is the process of combining the solutions to
the subproblems
• Merge Sort Example
– Divide repeatedly partitions the sequence into halves.
– Conquer sorts the base set of one element
– Combine does most of the work. It repeatedly merges
two sorted halves
• Quicksort Example
– The divide stage does most of the work.
An Optimal CRCW PRAM
Convex Hull Algorithm
• Let Q = {q1, q2, . . . , qn} be a set of points in the
Euclidean plane (i.e., E2-space).
• The convex hull of Q is denoted by CH(Q) and
is the smallest convex polygon containing Q.
– It is specified by listing convex hull corner points
(which are from Q) in order (e.g., clockwise order).
• Usual Computational Geometry Assumptions:
– No three points lie on the same straight line.
– No two points have the same x or y coordinate.
– There are at least 4 points, as CH(Q) = Q for n 3.
PRAM CONVEX HULL(n,Q, CH(Q))
1. Sort the points of Q by x-coordinate.
2. Partition Q into k =n subsets Q1,Q2,. . . ,Qk of k
points each such that a vertical line can separate Qi
from Qj
– Also, if i < j, then Qi is left of Qj.
3. For i = 1 to k , compute the convex hulls of Qi in
parallel, as follows:
– if |Qi| 3, then CH(Qi) = Qi
– else (using k=n PEs) call PRAM CONVEX HULL(k, Qi,
CH(Qi))
4. Merge the convex hulls in {CH(Q1),CH(Q2), . . .
,CH(Qk)} into a convex hull for Q.
Merging n Convex Hulls
Details for Last Step of Algorithm
• The last step is somewhat tedious.
• The upper hull is found first. Then, the lower hull
is found next using the same method.
– Only finding the upper hull is described here
– Upper & lower convex hull points merged into ordered
set
• Each CH(Qi) has n PEs assigned to it.
• The PEs assigned to CH(Qi) (in parallel)
compute the upper tangent from CH(Qi) to
another CH(Qj) .
– A total of n-1 tangents are computed for each CH(Qi)
– Details for computing the upper tangents will be
discussed separately
The Upper and Lower Hull
Last Step of Algorithm (cont)
• Among the tangent lines to CH(Qi) and polygons to the
left of CH(Qi), let Li be the one with the smallest slope.
– Use a MIN CW to a shared memory location
• Among the tangent lines to CH(Qi) and polygons to the
right, let Ri be the one with the largest slope.
– Use a MAX CW to a shared memory location
• If the angle between Li and Ri is less than 180 degrees,
no point of CH(Qi) is in CH(Q).
– See Figure 5.13 on next slide (from Akl’s Online text)
• Otherwise, all points in CH(Q) between where Li touches
CH(Qi) and where Ri touches CH(Qi) are in CH(Q).
• Array Packing is used to combine all convex hull points
of CH(Q) after they are identified.
Algorithm for Upper Tangents
• Requires finding a straight line segment tangent
to CH(Qi) and CH(Qj), as given by line swusing
a binary search technique
– See Fig 5.14(a) on next slide
• Let s be the mid-point of the ordered sequence
of corner points in CH(Qi) .
• Similarly, let w be the mid-point of the ordered
sequence of convex hull points in CH(Qi).
• Two cases arise:
– sw is the upper tangent of CH(Qi) and we are done.
– Otherwise, on average one-half of the remaining
corner points of CH(Qi) and/or CH(Qj) can be
removed from consideration.
• Preceding process is now repeated with the mid-
points of two remaining sequences.
PRAM Convex Hull Complexity Analysis
• Step 1: The sort takes O(lg n) time.
• Step 2: Partition of Q into subsets takes O(1) time.
– Here, Qi consist of points qk where k = (i-1)n +r for 1 i n
• Step 3: The recursive calculations of CH(Qi) for 1 i n
in parallel takes t(n ) time (using n PEs for each Qi).
• Step 4: The big steps here require O(lgn) and are
– Finding the upper tangent from CH(Qi) to CH(Qj) for
each i, j pair takes O(lgn ) = O(lg n)
– Array packing used to form the ordered sequence of
upper convex hull points for Q.
• Above steps find the upper convex hull. The lower
convex hull is found similarly.
– Upper & lower hulls can be merged in O(1) time to be
an (counter)/clockwise ordered set of hull points.
Complexity Analysis (Cont)
• Cost for Step 3: Solving the recurrence relation
t(n) = t(n) + lg n
yields
t(n) = O(lg n)
• Running time for PRAM Convex Hull is O(lg n)
since this is maximum cost for each step.
• Then the cost for PRAM Convex Hull is
C(n) = O(n lg n).
Optimality of PRAM Convex Hull
Theorem: A lower bound for the number of
sequential steps required to find the convex hull
of a set of planar points is (n lg n)