0% found this document useful (0 votes)
5 views

L8 Parallel Algorithms

The document discusses parallel algorithms, focusing on their structure, architectures, and applications, including sorting and merging techniques. It explains various parallel computing models such as SIMD and MIMD, and introduces the PRAM model for parallel random access. Additionally, it provides examples of parallel algorithms like finding the largest key in an array and merge sort, highlighting their time complexities and efficiency improvements over sequential algorithms.

Uploaded by

chitrabhanuk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

L8 Parallel Algorithms

The document discusses parallel algorithms, focusing on their structure, architectures, and applications, including sorting and merging techniques. It explains various parallel computing models such as SIMD and MIMD, and introduces the PRAM model for parallel random access. Additionally, it provides examples of parallel algorithms like finding the largest key in an array and merge sort, highlighting their time complexities and efficiency improvements over sequential algorithms.

Uploaded by

chitrabhanuk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 41

Parallel Algorithms

Parallel algorithms

TOPICS:
Sorting
Merging
 List ranking in PRAMs and
Applications.
What is a parallel approach?
Imagine you needed to find a lost
child in the woods. Even in a small
area, searching by yourself would be
very time consuming. Now if you
gathered some friends and family to
help you, you could cover the woods
in much faster manner…
Parallel Architectures

Single Instruction Stream , Multiple


Data Stream (SIMD)
One global control unit connected
to each processor
Multiple Instruction Stream,
Multiple Data Stream (MIMD)
Each processor has a local control
unit
Parallel Architectures
Shared-Address-Space
Each processor has access to main memory
Processors may be given a small private memory for local
variables
Message-Passing
Each processor is given its own block of memory
Processors communicate by passing messages directly
instead of modifying memory locations
Interconnection Networks

Static
Each processor is hard-wired to every
other processor

Completely
Connected Star- Bounded-Degree
Connected (Degree 4)
Dynamic
Processors are connected to a series of
switches
Parallel Algorithms
 A parallel algorithm is an algorithm that has been
specifically written for execution on a computer with
two or more processing units.
 Can be run on computers with single
processor
(multiple functional units, pipelined
functional units, pipelined memory
systems).
 When designing algorithm, take into
account the cost of communication, the
number of processors (efficiency)
Parallel Algorithms
Parallel: perform more than one operation at a time.
PRAM model: Parallel Random Access Model.

p0 Multiple processors connected to a shared


p1 Shared memory. Each processor access any
memory location in unit time. All processors can
access memory in parallel. All processors
pn-1 can perform operations in parallel.
The PRAM Model

Parallel Random Access Machine


Number of processors is not limited
All processors have local memory
One global memory accessible to all
processors
Processors must read and write global
memory
The PRAM Model
Parallel Random Access Machine
Theoretical model for parallel
machines
P processors with uniform access to a
large memory bank
MIMD
UMA (uniform memory access) – Equal
memory access time for any processor
to any address
Memory access protocols
Exclusive-Read Exclusive-Write
Exclusive-Read Concurrent-
Write
Concurrent-Read Exclusive-Write
Concurrent-Read Concurrent-Write
Example: Finding the Largest key in an array

In order to find the largest key in an


array of size n, at least n-1
comparisons must be done.

A parallel version of this algorithm


will still perform the same amount
of compares, but by doing them in
parallel it will finish sooner.
Example: Finding the Largest key in an array

Assume that n is a power of 2 and


we have n/2 processors executing
the algorithm in parallel.
Each Processor reads two array
elements into local variables called
first and second
It then writes the larger value into
the first of the array slots that it
has to read.
Takes log (n) steps for the largest
2
key to be placed in to the first slot
of the array.
Example: Finding the Largest key in an array

3 1 7 6 8 1 1 1
2 1 9 3

P1 P2 P3 P4
Read
1 3 6 7 1 8 1 1
2
First Second First Second 1
First 3
Second First 9
Second

Write

1 1 7 6 1 1 1 1
2 2 1 1 9 3
Example: Finding the Largest key in an array

1 1 7 6 1 1 1 1
2 2 1 1 9 3

P1 P3
Read

7 12 19 11

Write

1 1 7 6 1 1 1 1
2 2 9 1 9 3
Example: Finding the Largest key in an array

1 1 7 6 1 1 1 1
2 2 9 1 9 3

Read P1

1 1
9 2
Write

1 1 7 6 1 1 1 1
9 2 9 1 9 3
Example: Merge Sort

8 1 4 5 2 7 3 6

P1 P2 P3 P4

1 8 4 5 2 7 3 6

P1 P2

1 4 5 8 2 3 6 7

P1

1 2 3 4 5 6 7 8
Merge Sort Analysis

Number of comparisons
= 1 + 3 + … + (2i-1) + … + (n-1)
= ∑i=1.. log2(n)(2i-1) < 2n- log2(n) =
Θ(n)

We have improved from n × loge(n)


to n simply by applying the old
algorithm to parallel computing,
by altering the algorithm we can
further improve merge sort to
(log2(n) )2
O(1) Sorting Algorithm
We assume a CRCW PRAM where concurrent
write is handled with addition :
for(int i=1; i<=n; i++)
{
for(int j=1; j<=n; j++)
{
if(X[i] > X[j])
Processor Pij stores 1 in memory location
mi
else
Processor Pij stores 0 in memory location
mi
}}
O(1) Sorting Algorithm

Unsorted
List {4, 2,
1, 3}
P11(4,4) P12(4,2) P13(4,1)
P14(4,3) 0+1+1+1 = 3

P21(2,4) P22(2,2) P (2,1)


0+0+1+0 = 1 23
P24(2,3)

0+0+0+0 = 0
P31(1,4) P32(1,2) P33(1,1)
P34(1,3)

0+1+1+0 = 2
Applications
Computer Graphics Processing
Video Encoding
Accurate weather forecasting
Potential Speedup

O(n logn) optimal sequential sorting algorithm

Best we can expect based upon a sequential sorting algorithm


using n processors is:

O( n log n)
optimal parallel time complexity  O(log n)
n
Odd-Even Transposition Sort - example

Parallel time complexity: Tpar = O(n) (for P=n)


Odd-Even Transposition Sort – Example (N >> P)

Each PE gets n/p numbers. First, PEs sort n/p locally, then they run
odd-even trans. algorithm each time doing a merge-split for 2n/p
numbers. P P P
0 1 2
P3
13 7 12 8 5 4 6 1 3 9 2 10

Local sort
7 12 13 4 5 8 1 3 6 2 9 10
O-E

4 5 7 8 12 13 1 2 3 6 9 10
E-O

4 5 7 1 2 3 8 12 13 6 9 10
O-E

1 2 3 4 5 7 6 8 9 10 12 13
E-O

SORTED: 1 2 3 4 5 6 7 8 9 10 12 13
Time complexity: Tpar = (Local Sort) + (p merge-splits) +(p exchanges)

Tpar = (n/p)log(n/p) + p*(n/p) + p*(n/p) = (n/p)log(n/p) + 2n


Parallelizing Merge sort
Merge sort - Time complexity
Sequential :
n 2 n log n n
Tseq 1* n  2 *  2 * 2    2 * log n
2 2 2
Tseq O ( n log n)

Parallel :
 n n n n 
T par 2 0  1  2    k   2  1
2 2 2 2 
 
2n 20  2  1  2  2    2  log n
T par O( 4n)
Bitonic Merge sort

Bitonic Sequence
A bitonic sequence is defined as a list with no more than one
LOCAL MAXIMUM and no more than one LOCAL
MINIMUM.(Endpoints must be considered - wraparound )
Binary Split
1. Divide the bitonic list into two equal halves.
2. Compare-Exchange each item on the first half
with the corresponding item in the second half.

Result:
Two bitonic sequences where the numbers in one sequence are all less
than the numbers in the other sequence.
Repeated application of binary split
Bitonic list:
24 20 15 9 4 2 5 8 | 10 11 12 13 22 30 32 45

Result after Binary-split:


10 11 12 9 4 2 5 8 | 24 20 15 13 22 30 32
45

If you keep applying the BINARY-SPLIT to each half repeatedly, you


will get a SORTED LIST !

10 11 12 9 . 4 2 5 8 | 24 20 15 13 . 22 30 32
45
4 2 . 5 8 10 11 . 12 9 | 22 20 . 15 13 24 30 . 32
45
4 . 2 5 . 8 10 . 9 12 .11 15 . 13 22 . 20 24 . 30 32 .
45
2 4 5 8 9 10 11 12 13 15 20 22 24 30 32
Sorting a bitonic sequence
Compare-and-exchange moves smaller numbers of each pair to left
and larger numbers of pair to right.
Given a bitonic sequence,
recursively performing ‘binary split’ will sort the list.
Sorting an arbitrary sequence
To sort an unordered sequence, sequences are merged into larger
bitonic sequences, starting with pairs of adjacent numbers.

By a compare-and-exchange operation, pairs of adjacent numbers


formed into increasing sequences and decreasing sequences. Pairs
form a bitonic sequence of twice the size of each original
sequences.

By repeating this process, bitonic sequences of larger and larger


lengths obtained.

In the final step, a single bitonic sequence sorted into a single


increasing sequence.
Number of steps (P=n)

In general, with n = 2k, there are k phases, each of 1, 2, 3,


…, k steps. Hence the total number of steps is:

i log n
log n(log n  1)
T bitonic
par  i  2
O(log n)
i 1 2
Here Bitonic sort (for N >> P)

x x x x x x x x x x x x x x x x x x x x x x x x
x x x x x x x x
Parallel sorting - summary

Computational time complexity using P=n processors

• Odd-even transposition sort - O(n)

• Parallel merge sort - O(n)


unbalanced processor load and Communication

• Bitonic Merge sort - O(log2n) (** BEST! **)


Pointer Jumping –list ranking
Given a single linked list L with n objects, compute, for each
object in L, its distance from the end of the list.
Formally: suppose next is the pointer field
d[i]= 0 if next[i]=nil
d[next[i]]+1 if next[i]nil
Serial algorithm: (n).
List ranking –EREW algorithm
LIST-RANK(L) (in O(log2 n) time)
1. for each processor i, in parallel
2. do if next[i]=nil
3. then d[i]0
4. else d[i]1
5. while there exists an object i such that next[i]nil
6. do for each processor i, in parallel
7. do if next[i]nil
8. then d[i] d[i]+ d[next[i]]
9. next[i] next[next[i]]
List-ranking –EREW algorithm

3 4 6 1 0 5
(a) 1 1 1 1 1 0
3 4 6 1 0 5
(b) 2 2 2 2 1 0
3 4 6 1 0 5
(c) 4 4 3 2 1 0

3 4 6 1 0 5
(d) 5 4 3 2 1 0
List ranking –correctness of EREW algorithm
Loop invariant: for each i, the sum of d
values in the sub list headed by i is the
correct distance from i to the end of the
original list L.
Parallel memory must be synchronized:
the reads on the right must occur before
the writes on the left. Moreover, read d[i]
and then read d[next[i]].
An EREW algorithm: every read and write
is exclusive. For an object i, its
processor reads d[i], and then its
precedent processor reads its d[i].
Writes are all in distinct locations.
LIST ranking EREW algorithm running time
O(log2 n):
The initialization for loop runs in O(1).
Each iteration of while loop runs in O(1).
There are exactly log2 n  iterations:
 Each iteration transforms each list into two
interleaved lists: one consisting of objects
in even positions, and the other odd
positions. Thus, each iteration double the
number of lists but halves their lengths.
The termination test in line 5 runs in O(1).
Define work = #processors running time.
O(n log2 n).
Sorting on Specific Networks

• Two network structures have received special attention:


mesh and hypercube
Parallel computers have been built with these networks.

• However, it is of less interest nowadays because networks got


faster and clusters became a viable option.

• Besides, network architecture is often hidden from the user.

• MPI provides libraries for mapping algorithms onto meshes,


and one can always use a mesh or hypercube algorithm even
if the underlying architecture is not one of them.
40
Shear-sort
Alternate row and column sorting until list is fully sorted.
Alternate row directions to get snake-like sorting:

41

You might also like