0% found this document useful (0 votes)
47 views

Parallel Distributed Computing Unit-4

The document summarizes sorting algorithms that can be used for parallel computing. It discusses bubble sort and its variants like odd-even transposition sort. It also covers parallel implementations of sorting algorithms like shellsort and quicksort. For quicksort, it describes a shared address space formulation where the list is divided among processors, a pivot is selected, and the lists are merged and sorted recursively.

Uploaded by

sowmya srikande
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Parallel Distributed Computing Unit-4

The document summarizes sorting algorithms that can be used for parallel computing. It discusses bubble sort and its variants like odd-even transposition sort. It also covers parallel implementations of sorting algorithms like shellsort and quicksort. For quicksort, it describes a shared address space formulation where the list is divided among processors, a pivot is selected, and the lists are merged and sorted recursively.

Uploaded by

sowmya srikande
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Chapter 9 (Secs. 9.1, 9.3, 9.

4)
Sorting Algorithms
A. Grama, A. Gupta, G. Karypis, and V. Kumar

To accompany the text “Introduction to Parallel Computing”,


Addison Wesley, 2003.
Topic Overview

• Issues in Sorting on Parallel Computers

• Bubble Sort and its Variants

• Quicksort
Sorting: Basics

• One of the most commonly used and well-studied kernels.

• The fundamental operation of comparison-based sorting is


compare-exchange.

• The lower bound on any comparison-based sort of n numbers


is Θ(n log n) on a serial computer.

• In case of parallel sorting, the sorted list is partitioned among a


number of processors, such that (1) each sublist is sorted (2) for
i < j, each element in processor Pi’s sublist is less than those in
Pj ’s sublist.
Sorting: Parallel Compare Exchange Operation
ai aj ai , aj aj , aimin{ai, aj } max{ai, aj }
Pi Pj Pi Pj Pi Pj

Step 1 Step 2 Step 3

A parallel compare-exchange operation (each process is


responsible for one element). Processes Pi and Pj send their
elements to each other. Process Pi keeps min{ai, aj }, and Pj
keeps max{ai, aj }.
Sorting: Parallel Compare Split Operation
2 7 9 10 12 1 6 8 11 13

1 6 8 11 13 2 7 9 10 12
1 6 8 11 13 2 7 9 10 12

Pi Pj Pi Pj
Step 1 Step 2

1 2 6 7 8 9 10 11 12 13 1 2 6 7 8 9 10 11 12 13 1 2 6 7 8 9 10 11 12 13

Pi Pj Pi Pj
Step 3 Step 4

A compare-split operation (each process is responsible for a


block of elements). Each process sends all its elements to
another process. Each process merges the received block with
its own block and retains only the appropriate half of the
merged block. In this example, process Pi retains the smaller
elements and process Pj retains the larger elements.
Bubble Sort and its Variants

The sequential bubble sort algorithm compares and exchanges


adjacent elements in the sequence to be sorted:

1. procedure BUBBLE SORT(n)


2. begin
3. for i := n − 1 downto 1 do
4. for j := 1 to i do
5. compare-exchange(aj , aj+1);
6. end BUBBLE SORT

Sequential bubble sort algorithm.


Bubble Sort and its Variants

• The complexity of bubble sort is Θ(n2).

• Bubble sort is difficult to parallelize since the algorithm has no


concurrency.

• A simple variant, though, uncovers the concurrency.


Odd-Even Transposition

1. procedure ODD-EVEN(n)
2. begin
3. for i := 1 to n do
4. begin
5. if i is odd then
6. for j := 0 to n/2 − 1 do
7. compare-exchange(a2j+1, a2j+2);
8. if i is even then
9. for j := 1 to n/2 − 1 do
10. compare-exchange(a2j , a2j+1);
11. end for
12. end ODD-EVEN

Sequential odd-even transposition sort algorithm.


Odd-Even Transposition
Unsorted

3 2 3 8 5 6 4 1

Phase 1 (odd)

2 3 3 8 5 6 1 4

Phase 2 (even)

2 3 3 5 8 1 6 4

Phase 3 (odd)

2 3 3 5 1 8 4 6

Phase 4 (even)

2 3 3 1 5 4 8 6

Phase 5 (odd)

2 3 1 3 4 5 6 8

Phase 6 (even)

2 1 3 3 4 5 6 8

Phase 7 (odd)

1 2 3 3 4 5 6 8

Phase 8 (even)

1 2 3 3 4 5 6 8

Sorted

Sorting n = 8 elements, using the odd-even transposition sort


algorithm. During each phase, n = 8 elements are compared.
Odd-Even Transposition

• After n phases of odd-even exchanges, the sequence is sorted.

• Each phase of the algorithm (either odd or even) requires Θ(n)


comparisons.

• Serial complexity is Θ(n2).


Parallel Odd-Even Transposition

• Consider the one item per processor case.

• There are n iterations, in each iteration, each processor does


one compare-exchange.

• The parallel run time of this formulation is Θ(n).

• This is cost optimal with respect to the base serial algorithm but
not the optimal one.
Parallel Odd-Even Transposition

1. procedure ODD-EVEN PAR(n)


2. begin
3. id := process’s label
4. for i := 1 to n do
5. begin
6. if i is odd then
7. if id is odd then
8. compare-exchange min(id + 1);
9. else
10. compare-exchange max(id − 1);
11. if i is even then
12. if id is even then
13. compare-exchange min(id + 1);
14. else
15. compare-exchange max(id − 1);
16. end for
17. end ODD-EVEN PAR

Parallel formulation of odd-even transposition.


Parallel Odd-Even Transposition

• Consider a block of n/p elements per processor.

• The first step is a local sort.

• In each subsequent step, the compare exchange operation is


replaced by the compare split operation.

• The parallel run time of the formulation is

z local
}|sort { comparisons communication
n n z }| { z }| {
TP = Θ log + Θ(n) + Θ(n).
p p
Shellsort

• Let n be the number of elements to be sorted and p be the


number of processes.

• During the first phase, processes that are far away from each
other in the array compare-split their elements.

• During the second phase, the algorithm switches to an odd-


even transposition sort.
Parallel Shellsort

0 3 4 5 6 7 2 1

0 3 4 5 6 7 2 1

0 3 4 5 6 7 2 1

An example of the first phase of parallel shellsort on an


eight-process array.
Parallel Shellsort

• Each process performs d = log p compare-split operations.

• With O(p) bisection width, the each communication can be


performed in time Θ(n/p) for a total time of Θ((n log p)/p).

• In the second phase, l odd and even phases are performed,


each requiring time Θ(n/p).

• The parallel run time of the algorithm is:

first phase second phase


z local
}|sort { z  }| { z }| {
n n n n
TP = Θ log +Θ log p + Θ l . (1)
p p p p
Quicksort

• Quicksort is one of the most common sorting algorithms for


sequential computers because of its simplicity, low overhead,
and optimal average complexity.

• Quicksort selects one of the entries in the sequence to be the


pivot and divides the sequence into two – one with all elements
less than the pivot and other greater.

• The process is recursively applied to each of the sublists.


Quicksort
(a) 3 2 1 5 8 4 3 7

Pivot
(b) 1 2 3 5 8 4 3 7

Final position
(c) 1 2 3 3 4 5 8 7

(d) 1 2 3 3 4 5 7 8

(e) 1 2 3 3 4 5 7 8

Example of the quicksort algorithm sorting a sequence of size


n = 8.
Quicksort

• The performance of quicksort depends critically on the quality


of the pivot.

• In the best case, the pivot divides the list in such a way that the
larger of the two lists does not have more than αn elements (for
some constant α).

• In this case, the complexity of quicksort is O(n log n).


Parallelizing Quicksort: Shared Address Space
Formulation

• Consider a list of size n equally divided across p processors.

• A pivot is selected by one of the processors and made known


to all processors.

• Each processor partitions its list into two, say Si and Li, based
on the selected pivot.

• All of the Si lists are merged and all of the Li lists are merged
separately.

• The set of processors is partitioned into two (in proportion of the


size of lists S and L). The process is recursively applied to each
of the lists.
Shared Address Space Formulation
P0 P1 P2 P3 P4
7 13 18 2 17 1 14 20 6 10 15 9 3 16 19 4 11 12 5 8 pivot selection
pivot=7

First Step
P0 P1 P2 P3 P4
after local
7 2 18 13 1 17 14 20 6 10 15 9 3 4 19 16 5 12 11 8 rearrangement
after global
7 2 1 6 3 4 5 18 13 17 14 20 10 15 9 19 16 12 11 8 rearrangement

P0 P1 P2 P3 P4
7 2 1 6 3 4 5 18 13 17 14 20 10 15 9 19 16 12 11 8 pivot selection
pivot=5 pivot=17
Second Step
P0 P1 P2 P3 P4
after local
1 2 7 6 3 4 5 14 13 17 18 20 10 15 9 19 16 12 11 8 rearrangement

after global
1 2 3 4 5 7 6 14 13 17 10 15 9 16 12 11 8 18 20 19 rearrangement

P0 P1 P2 P3 P4
1 2 3 4 5 7 6 14 13 17 10 15 9 16 12 11 8 18 20 19 pivot selection
pivot=11

P0 P1 P2 P3 P4
Third Step

after local
1 2 3 4 5 6 7 10 13 17 14 15 9 8 12 11 16 18 19 20 rearrangement

after global
10 9 8 12 11 13 17 14 15 16 rearrangement

P2 P3
Fourth Step

after local
10 9 8 12 11 13 17 14 15 16 rearrangement

P0 P1 P2 P3 P4
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Solution
Parallelizing Quicksort: Shared Address Space
Formulation

• How to globally merge the local lists (S0, L0, S1, L1, . . .) to form
S and L?

• Each processor needs to determine the right location for its


elements in the merged list.

• Each processor first counts the number of elements locally less


than and greater than pivot.

• It then computes two sum-scans to determine the starting


location for its elements in the merged S and L lists.

• Once it knows the starting locations, it can write its elements


safely.
Parallelizing Quicksort: Shared Address Space
Formulation
P0 P1 P2 P3 P4
7 13 18 2 17 1 14 20 6 10 15 9 3 16 19 4 11 12 5 8 pivot selection
pivot=7

P0 P1 P2 P3 P4
after local
7 2 18 13 1 17 14 20 6 10 15 9 3 4 19 16 5 12 11 8 rearrangement

|Si | 2 1 1 2 1 2 3 3 2 3 |Li |
Prefix Sum Prefix Sum
0 2 3 4 6 7 0 2 5 8 10 13

after global
7 2 1 6 3 4 5 18 13 17 14 20 10 15 9 19 16 12 11 8 rearrangement
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

Efficient global rearrangement of the array.


Parallelizing Quicksort: Shared Address Space
Formulation

• The parallel time depends on the split and merge time, and the
quality of the pivot.

• The latter is an issue independent of parallelism, so we focus on


the first aspect, assuming ideal pivot selection.

• The algorithm executes in four steps: (i) determine and


broadcast the pivot; (ii) locally rearrange the array assigned
to each process; (iii) determine the locations in the globally
rearranged array that the local elements will go to; and (iv)
perform the global rearrangement.

• The first step takes time Θ(log p), the second, Θ(n/p), the third,
Θ(log p), and the fourth, Θ(n/p).

• The overall complexity of splitting an n-element array is Θ(n/p)+


Θ(log p).
Parallelizing Quicksort: Shared Address Space
Formulation

• The process recurses until there are p lists, at which point, the
lists are sorted locally.

• Therefore, the total parallel time is:

array splits
z local
}|sort { z  }| {
n n n
TP = Θ log +Θ log p + Θ(log2 p). (2)
p p p

• The corresponding isoefficiency is Θ(p log2 p) due to broadcast


and scan operations.
Parallelizing Quicksort: Message Passing Formulation

• A simple message passing formulation is based on the recursive


halving of the machine.

• Assume that each processor in the lower half of a p processor


ensemble is paired with a corresponding processor in the upper
half.

• A designated processor selects and broadcasts the pivot.

• Each processor splits its local list into two lists, one less (Si), and
other greater (Li) than the pivot.

• A processor in the low half of the machine sends its list Li to the
paired processor in the other half. The paired processor sends
its list Si.

• It is easy to see that after this step, all elements less than the
pivot are in the low half of the machine and all elements
greater than the pivot are in the high half.
Parallelizing Quicksort: Message Passing Formulation

• The above process is recursed until each processor has its own
local list, which is sorted locally.

• The time for a single reorganization is Θ(log p) for broadcasting


the pivot element, Θ(n/p) for splitting the locally assigned
portion of the array, Θ(n/p) for exchange and local
reorganization.

• We note that this time is identical to that of the corresponding


shared address space formulation.

• It is important to remember that the reorganization of elements


is a bandwidth sensitive operation.

You might also like