Parallel Algorithms
Parallel algorithms
TOPICS:
Sorting
Merging
List ranking in PRAMs and
Applications.
What is a parallel approach?
Imagine you needed to find a lost
child in the woods. Even in a small
area, searching by yourself would be
very time consuming. Now if you
gathered some friends and family to
help you, you could cover the woods
in much faster manner…
Parallel Architectures
Single Instruction Stream , Multiple
Data Stream (SIMD)
One global control unit connected
to each processor
Multiple Instruction Stream,
Multiple Data Stream (MIMD)
Each processor has a local control
unit
Parallel Architectures
Shared-Address-Space
Each processor has access to main memory
Processors may be given a small private memory for local
variables
Message-Passing
Each processor is given its own block of memory
Processors communicate by passing messages directly
instead of modifying memory locations
Interconnection Networks
Static
Each processor is hard-wired to every
other processor
Completely
Connected Star- Bounded-Degree
Connected (Degree 4)
Dynamic
Processors are connected to a series of
switches
Parallel Algorithms
A parallel algorithm is an algorithm that has been
specifically written for execution on a computer with
two or more processing units.
Can be run on computers with single
processor
(multiple functional units, pipelined
functional units, pipelined memory
systems).
When designing algorithm, take into
account the cost of communication, the
number of processors (efficiency)
Parallel Algorithms
Parallel: perform more than one operation at a time.
PRAM model: Parallel Random Access Model.
p0 Multiple processors connected to a shared
p1 Shared memory. Each processor access any
memory location in unit time. All processors can
access memory in parallel. All processors
pn-1 can perform operations in parallel.
The PRAM Model
Parallel Random Access Machine
Number of processors is not limited
All processors have local memory
One global memory accessible to all
processors
Processors must read and write global
memory
The PRAM Model
Parallel Random Access Machine
Theoretical model for parallel
machines
P processors with uniform access to a
large memory bank
MIMD
UMA (uniform memory access) – Equal
memory access time for any processor
to any address
Memory access protocols
Exclusive-Read Exclusive-Write
Exclusive-Read Concurrent-
Write
Concurrent-Read Exclusive-Write
Concurrent-Read Concurrent-Write
Example: Finding the Largest key in an array
In order to find the largest key in an
array of size n, at least n-1
comparisons must be done.
A parallel version of this algorithm
will still perform the same amount
of compares, but by doing them in
parallel it will finish sooner.
Example: Finding the Largest key in an array
Assume that n is a power of 2 and
we have n/2 processors executing
the algorithm in parallel.
Each Processor reads two array
elements into local variables called
first and second
It then writes the larger value into
the first of the array slots that it
has to read.
Takes log (n) steps for the largest
2
key to be placed in to the first slot
of the array.
Example: Finding the Largest key in an array
3 1 7 6 8 1 1 1
2 1 9 3
P1 P2 P3 P4
Read
1 3 6 7 1 8 1 1
2
First Second First Second 1
First 3
Second First 9
Second
Write
1 1 7 6 1 1 1 1
2 2 1 1 9 3
Example: Finding the Largest key in an array
1 1 7 6 1 1 1 1
2 2 1 1 9 3
P1 P3
Read
7 12 19 11
Write
1 1 7 6 1 1 1 1
2 2 9 1 9 3
Example: Finding the Largest key in an array
1 1 7 6 1 1 1 1
2 2 9 1 9 3
Read P1
1 1
9 2
Write
1 1 7 6 1 1 1 1
9 2 9 1 9 3
Example: Merge Sort
8 1 4 5 2 7 3 6
P1 P2 P3 P4
1 8 4 5 2 7 3 6
P1 P2
1 4 5 8 2 3 6 7
P1
1 2 3 4 5 6 7 8
Merge Sort Analysis
Number of comparisons
= 1 + 3 + … + (2i-1) + … + (n-1)
= ∑i=1.. log2(n)(2i-1) < 2n- log2(n) =
Θ(n)
We have improved from n × loge(n)
to n simply by applying the old
algorithm to parallel computing,
by altering the algorithm we can
further improve merge sort to
(log2(n) )2
O(1) Sorting Algorithm
We assume a CRCW PRAM where concurrent
write is handled with addition :
for(int i=1; i<=n; i++)
{
for(int j=1; j<=n; j++)
{
if(X[i] > X[j])
Processor Pij stores 1 in memory location
mi
else
Processor Pij stores 0 in memory location
mi
}}
O(1) Sorting Algorithm
Unsorted
List {4, 2,
1, 3}
P11(4,4) P12(4,2) P13(4,1)
P14(4,3) 0+1+1+1 = 3
P21(2,4) P22(2,2) P (2,1)
0+0+1+0 = 1 23
P24(2,3)
0+0+0+0 = 0
P31(1,4) P32(1,2) P33(1,1)
P34(1,3)
0+1+1+0 = 2
Applications
Computer Graphics Processing
Video Encoding
Accurate weather forecasting
Potential Speedup
O(n logn) optimal sequential sorting algorithm
Best we can expect based upon a sequential sorting algorithm
using n processors is:
O( n log n)
optimal parallel time complexity O(log n)
n
Odd-Even Transposition Sort - example
Parallel time complexity: Tpar = O(n) (for P=n)
Odd-Even Transposition Sort – Example (N >> P)
Each PE gets n/p numbers. First, PEs sort n/p locally, then they run
odd-even trans. algorithm each time doing a merge-split for 2n/p
numbers. P P P
0 1 2
P3
13 7 12 8 5 4 6 1 3 9 2 10
Local sort
7 12 13 4 5 8 1 3 6 2 9 10
O-E
4 5 7 8 12 13 1 2 3 6 9 10
E-O
4 5 7 1 2 3 8 12 13 6 9 10
O-E
1 2 3 4 5 7 6 8 9 10 12 13
E-O
SORTED: 1 2 3 4 5 6 7 8 9 10 12 13
Time complexity: Tpar = (Local Sort) + (p merge-splits) +(p exchanges)
Tpar = (n/p)log(n/p) + p*(n/p) + p*(n/p) = (n/p)log(n/p) + 2n
Parallelizing Merge sort
Merge sort - Time complexity
Sequential :
n 2 n log n n
Tseq 1* n 2 * 2 * 2 2 * log n
2 2 2
Tseq O ( n log n)
Parallel :
n n n n
T par 2 0 1 2 k 2 1
2 2 2 2
2n 20 2 1 2 2 2 log n
T par O( 4n)
Bitonic Merge sort
Bitonic Sequence
A bitonic sequence is defined as a list with no more than one
LOCAL MAXIMUM and no more than one LOCAL
MINIMUM.(Endpoints must be considered - wraparound )
Binary Split
1. Divide the bitonic list into two equal halves.
2. Compare-Exchange each item on the first half
with the corresponding item in the second half.
Result:
Two bitonic sequences where the numbers in one sequence are all less
than the numbers in the other sequence.
Repeated application of binary split
Bitonic list:
24 20 15 9 4 2 5 8 | 10 11 12 13 22 30 32 45
Result after Binary-split:
10 11 12 9 4 2 5 8 | 24 20 15 13 22 30 32
45
If you keep applying the BINARY-SPLIT to each half repeatedly, you
will get a SORTED LIST !
10 11 12 9 . 4 2 5 8 | 24 20 15 13 . 22 30 32
45
4 2 . 5 8 10 11 . 12 9 | 22 20 . 15 13 24 30 . 32
45
4 . 2 5 . 8 10 . 9 12 .11 15 . 13 22 . 20 24 . 30 32 .
45
2 4 5 8 9 10 11 12 13 15 20 22 24 30 32
Sorting a bitonic sequence
Compare-and-exchange moves smaller numbers of each pair to left
and larger numbers of pair to right.
Given a bitonic sequence,
recursively performing ‘binary split’ will sort the list.
Sorting an arbitrary sequence
To sort an unordered sequence, sequences are merged into larger
bitonic sequences, starting with pairs of adjacent numbers.
By a compare-and-exchange operation, pairs of adjacent numbers
formed into increasing sequences and decreasing sequences. Pairs
form a bitonic sequence of twice the size of each original
sequences.
By repeating this process, bitonic sequences of larger and larger
lengths obtained.
In the final step, a single bitonic sequence sorted into a single
increasing sequence.
Number of steps (P=n)
In general, with n = 2k, there are k phases, each of 1, 2, 3,
…, k steps. Hence the total number of steps is:
i log n
log n(log n 1)
T bitonic
par i 2
O(log n)
i 1 2
Here Bitonic sort (for N >> P)
x x x x x x x x x x x x x x x x x x x x x x x x
x x x x x x x x
Parallel sorting - summary
Computational time complexity using P=n processors
• Odd-even transposition sort - O(n)
• Parallel merge sort - O(n)
unbalanced processor load and Communication
• Bitonic Merge sort - O(log2n) (** BEST! **)
Pointer Jumping –list ranking
Given a single linked list L with n objects, compute, for each
object in L, its distance from the end of the list.
Formally: suppose next is the pointer field
d[i]= 0 if next[i]=nil
d[next[i]]+1 if next[i]nil
Serial algorithm: (n).
List ranking –EREW algorithm
LIST-RANK(L) (in O(log2 n) time)
1. for each processor i, in parallel
2. do if next[i]=nil
3. then d[i]0
4. else d[i]1
5. while there exists an object i such that next[i]nil
6. do for each processor i, in parallel
7. do if next[i]nil
8. then d[i] d[i]+ d[next[i]]
9. next[i] next[next[i]]
List-ranking –EREW algorithm
3 4 6 1 0 5
(a) 1 1 1 1 1 0
3 4 6 1 0 5
(b) 2 2 2 2 1 0
3 4 6 1 0 5
(c) 4 4 3 2 1 0
3 4 6 1 0 5
(d) 5 4 3 2 1 0
List ranking –correctness of EREW algorithm
Loop invariant: for each i, the sum of d
values in the sub list headed by i is the
correct distance from i to the end of the
original list L.
Parallel memory must be synchronized:
the reads on the right must occur before
the writes on the left. Moreover, read d[i]
and then read d[next[i]].
An EREW algorithm: every read and write
is exclusive. For an object i, its
processor reads d[i], and then its
precedent processor reads its d[i].
Writes are all in distinct locations.
LIST ranking EREW algorithm running time
O(log2 n):
The initialization for loop runs in O(1).
Each iteration of while loop runs in O(1).
There are exactly log2 n iterations:
Each iteration transforms each list into two
interleaved lists: one consisting of objects
in even positions, and the other odd
positions. Thus, each iteration double the
number of lists but halves their lengths.
The termination test in line 5 runs in O(1).
Define work = #processors running time.
O(n log2 n).
Sorting on Specific Networks
• Two network structures have received special attention:
mesh and hypercube
Parallel computers have been built with these networks.
• However, it is of less interest nowadays because networks got
faster and clusters became a viable option.
• Besides, network architecture is often hidden from the user.
• MPI provides libraries for mapping algorithms onto meshes,
and one can always use a mesh or hypercube algorithm even
if the underlying architecture is not one of them.
40
Shear-sort
Alternate row and column sorting until list is fully sorted.
Alternate row directions to get snake-like sorting:
41