5 Module #5 Parallel Algorithms Actual October 30 2024
5 Module #5 Parallel Algorithms Actual October 30 2024
Example 2: Construct an algorithm (in Pseudo-code) for Searching (finding out) if a given integer, x, exists in a set
of integers. Illustration: {5, -1, 10, 35, -22, 15, -12, 23, 1, -7} The targeted integer is, x= -12
Example 3: Construct an algorithm (in Pseudo-code) for finding out if a given integer, x, exists in a set of sorted
integers. Illustration: {-22, -12 , -7 , -1 , 1 , 5 , 10, 15, 23, 35} The targeted integer is, x= -12
Example 4: Construct an algorithm (in Pseudo-code) for sorting a set of integers in increasing order using the
Bubble sort technoque. Illustration: {5, -1, 10, 35, -22, 15, -12, 23, 1, -7}
Example 5: Construct an algorithm (in Pseudo-code) for sorting a set of integers in increasing order using the
insertion sort technique. Illustration {8, 2, 4, 9, 3, 6}
Example 6: Construct an algorithm (in Pseudo-code) for sorting a set of integers in increasing order using the
selection sort technique. Illustration {8, 2, 4, 9, 3, 6}
Introduction (Conventional Algorithms Processing)
Pseudo-Code
• Pseudo-Code: a high-level abstraction of code, usually used to outline the general steps in an algorithm
without having to write actual code (usually done for the reader's or programmer's benefit).
Example Algorithms in Pseudo-code
Example 1: Construct an algorithm (in Pseudo-code) for finding the minimum (smallest) value in a finite set of
integers.
• Illustration: {5, -1, 10, 35, -22, 15, -12, 23, 1, -7}
• Algorithm 1 Find-Min
1. Input: set of integers a1 , a 2 ,..., a n
2. Output: Temporary Minimum variable = min,
3. Steps:
3.1. min := a1
3.2. for i := 2 to n
3.3. if min > ai then min := ai
3.4. end for
Introduction (Conventional Algorithms Processing)
Example Algorithms in Pseudo-code
◼ Example 2: Construct an algorithm (in Pseudo-code) for Searching (finding out) if a given
integer, x, exists in a set of integers.
◼ Illustration: {5, -1, 10, 35, -22, 15, -12, 23, 1, -7} The targeted integer is, x= -12
◼ Algorithm 2 Linear-Search
1. Input: set of integers a1 , a 2 ,..., a nand integer x
2. Output: location =i if x= ai ; otherwise, location = 0
3. Steps:
3.1. i :=1
3.2. while (i n and x ai )
i := i+1
3.3. end while
3.4. if i n then location := i else location := 0
Introduction (Conventional Algorithms Processing)
Example Algorithms in Pseudo-code
◼ Example 3: Construct an algorithm (in Pseudo-code) for finding out if a given integer, x,
exists in a set of sorted integers.
◼ Illustration: {-22, -12 , -7 , -1 , 1 , 5 , 10, 15, 23, 35}
The targeted integer is, x= -12
◼ Algorithm 3 Binary-Search
a1 , a2 ,..., an
1. Input: set of n sorted integers and integer x
2. Output: location =k if x= a k ; otherwise location = 0
3. Steps:
3.1. i :=1, j := n, k :=0
3.2. while ((i j) and (k=0))
3.3. m := (i + j ) / 2
3.4. if x=
a m then k := m
3.5. else if x < a m then j := m-1
3.6. else i := m+1
3.7. end while
Trace the Algorithm.
Introduction (Conventional Algorithms Processing
Example Algorithms in Pseudo-code
• Example 4: Construct an algorithm (in Pseudo-code) for sorting a set of integers in increasing
order.
• Illustration: {5, -1, 10, 35, -22, 15, -12, 23, 1, -7}
• Algorithm 4 Bubble-Sort
1. Input: set of integers a1 , a2 ,..., an
2. Output: a1 , a2 ,..., an sorted
3. Steps:
3.1. for i := 1 to n-1
3.2. for j:= 1 to n-i
3.3. if a j a j +1 then interchange a j and a j +1
3.4. end for
▪ The size is measured in terms of the number of inputs processed by the algorithm.
▪ The number of basic operations to process an input of certain size is important in the analysis.
▪ The time taken to complete a basic operation is considered to be independent on the particular values of its operands.
n 1 2 4 8 16 32
C=1 1 1 1 1 1 1
log 2 n 0 1 2 3 4 5
n× log 2 n 0 2 8 24 64 160
n2 1 4 16 64 256 1K
n3 1 8 64 512 4K 32K
2n 2 4 16 256 64K 4T
Introduction (Conventional sorting Algorithms Processing)
1. Bubble Sort
▪ Basic Idea: the idea is to repeatedly move the largest element to the highest index
position of the array.
▪ Each iteration reduces the effective size of the array by one.
▪ The focus is on successive adjacent pairs of elements in the array, they are compared
and either swaps them or not. In either case, after such a step, the larger of the two
elements will be in the higher index position.
▪ The focus then moves to the next higher position, and the process is repeated.
▪ When the focus reaches the end of the effective array, the largest element will have
``bubbled'' from whatever its original position to the highest index position in the
effective array.
14
Introduction (Conventional sorting Algorithms Processing)
Bubble Sort (Cont’d)
◼ Example: Consider the shown array A having unsorted set of integers.
45 12 34 67 25 39
45 67 12 34 25 39
0 1 2 3 4 5
0 1 2 3 4 5
45 67 12 34 25 39 45 12 34 25 67 39
0 1 2 3 4 5 0 1 2 3 4 5
45 12 34 25 39 67
45 12 67 34 25 39
0 1 2 3 4 5
0 1 2 3 4 5
15
Introduction (Conventional Sorting Algorithms Processing)
Bubble Sort (Cont’d)
• A bubble step is done by the following loops:
for j ← 0 to n-1
for i ← 0 to n-j
{ if (A[i] > A[i+ 1]);
{Temp ← A[i];
A[i] ← A[i+1];
A [i+1] ← Temp}};
Worst case Analysis: O ( n 2 )
▪ The loop compares all adjacent elements at index i and i + 1. If they are not in the correct order, they are
swapped.
▪ One complete bubble step moves the largest element to the last position, which is the correct position for that
element in the final sorted array.
▪ The effective size of the array is reduced by one and the process repeated until the effective size becomes one.
▪ Each bubble step moves the largest element in the effective array to the highest index of the effective array.
16
Introduction (Conventional sorting Algorithms
Processing)
2. Insertion Sort
• Basic Idea: For each element in the list of elements, find the proper slot where it should belong, and insert it.
• One element by itself is already sorted.
• Two elements are then considered and sorted, i.e., swapped if needed.
• Three elements, the third element is swapped leftward until it is in its proper order with the first two elements.
• Four elements, the fourth element is swapped leftward until it is in its proper order with the first three
elements.
• Continue in this manner with the fifth element, the sixth element, and so on until the whole list is sorted.
17
Introduction (Conventional sorting Algorithms Processing)
Insertion Sort
Algorithm InsertionSort(A):
Input: An Array A of n elements
Output: The array A with its n elements sorted in a non-decreasing order
for i ← 1 to n-1 do
Temp ← A[i]
j ← i-1
while j 0 and A[j] > Temp do
A[j+1] ← A[j]
j ← j-1
end while
A[j+1] ← Temp Trace the algorithm
end for
• Best Case Analysis: Elements are already sorted
The inner loop will never be executed. The outer loop is executed n-1 times, i.e. O(n)
• Worst Case Analysis: Elements are in reverse Order
The inner loop will be executed the maximum number of times. The outer loop is executed n-1 times, i.e., O ( n2 )
Introduction (Conventional Sorting Algorithms Processing)
INSERTION-SORT(A)
1 for j 2 to length [A]
2 do key A[ j]
3 Insert A[j] into the sortted sequence A[1,..., j-1].
4 i j–1
5 while i > 0 and A[i] > key
6 do A[i+1] A[i]
7 ii–1
8 A[i +1] key
Introduction (Conventional Sorting Algorithms Processing)
key
sorted
L1.20
Introduction (Conventional Sorting Algorithms Processing)
Example insertion sort
8 2 4 9 3 6
8 2 4 9 3 6
L1.21
Introduction (Conventional Sorting Algorithms Processing
Example of insertion sort
8 2 4 9 3 6
2 8 4 9 3 6
L1.22
Introduction (Conventional Sorting Algorithms Processing
Example of insertion sort
8 2 4 9 3 6
2 8 4 9 3 6
L1.23
Introduction (Conventional Sorting Algorithms Processing
Example of insertion sort
8 2 4 9 3 6
2 8 4 9 3 6
2 4 8 9 3 6
L1.24
Introduction (Conventional Sorting Algorithms Processing
Example of insertion sort
8 2 4 9 3 6
2 8 4 9 3 6
2 4 8 9 3 6
L1.25
Introduction (Conventional Sorting Algorithms Processing
Example of insertion sort
8 2 4 9 3 6
2 8 4 9 3 6
2 4 8 9 3 6
2 4 8 9 3 6
L1.26
Introduction (Conventional Sorting Algorithms Processing
Example of insertion sort
8 2 4 9 3 6
2 8 4 9 3 6
2 4 8 9 3 6
2 4 8 9 3 6
L1.27
Introduction (Conventional Sorting Algorithms Processing
Example of insertion sort
8 2 4 9 3 6
2 8 4 9 3 6
2 4 8 9 3 6
2 4 8 9 3 6
2 3 4 8 9 6
L1.28
Introduction (Conventional Sorting Algorithms Processing
Example of insertion sort
8 2 4 9 3 6
2 8 4 9 3 6
2 4 8 9 3 6
2 4 8 9 3 6
2 3 4 8 9 6
6 do A[i + 1] A[i ] c6 n
(t
j =2 j
− 1)
7 i i −1 c7 n
(t
j =2 j
− 1)
8 A[i + 1] key c8 n −1
Introduction (Conventional Sorting Algorithms Processing
Analysis of INSERTION-SORT
The total running time is
𝑛 𝑛 n
T(n) = C1 n + C2 (n-1) + C4 (n-1) + C5
𝑗=2
𝑡𝑗 + C6
𝑗=2
( 𝑡 𝑗 − 1) + c7 (t j − 1) + c8 (n − 1).
j =2
• The best case: The array is already sorted. (tj =1 for j=2,3, ...,n)
T (n) = c1n + c2 (n − 1) + c4 (n − 1) + c5 (n − 1) + c8 (n − 1)
= (c1 + c2 + c4 + c5 + c8 )n − (c2 + c4 + c5 + c8 ).
• The worst case: The array is reverse sorted (tj =j for j=2,3, ...,n).
T (n) = c1n + c2 (n − 1) + c5 (n(n + 1) / 2 − 1)+ c6 (n(n − 1) / 2) + c7 (n(n − 1) / 2) + c8 (n − 1)
= (c5 / 2 + c6 / 2 + c7 / 2)n 2 + (c1 + c2 + c4 + c5 / 2 − c6 / 2 − c7 / 2 + c8 )n
T (n) = an 2 + bn + c
Introduction (Conventional Sorting Algorithms Processing
• Selection Sort (Cont’d)
for I ← 0 to n-2
Temp ← A[I]
Location ←I
for J ← I+1 to n-1
if A[J] < A [Location]
Location ← J
A[I] ← A [Location]
A [Location] ← Temp
Trace of the algorithm 126 43 26 1 113
33
Introduction (Conventional Matrix Multiplication Processing)
Example 1: Matrix - Vector Multiplication
•Axb=y
• Allocate tasks to
j rows of A
y[i] = ∑A[i,j]*b[j]
• Dependencies:
o Computing each
element of y can be
done independently
• Speedup?
•AxB=C
• A[i,:] • B[:,j] = C[i,j]
Com putati on
Com munication
f = 0 f = fraction
40
unaffected
p = speedup
Speedup (s )
f = 0 . 01
30
f = 0 . 02 of the rest
20
f = 0 . 05 1
s=
10 f + (1 – f)/p
f = 0 .1
0 min(p, 1/f)
0 10 20 30 40 50
E nha nc em en t f ac tor ( p )
Introduction to Parallel Architectures and Algorithms
Types of Parallelism: (Flynn and Johnson) Taxonomy
memory
stream
Global
Johnson’ s expansion
SISD SIMD GMSV GMMP
Uniprocessors Array or vector Shared-memory Rarely used
processors multiprocessors
Multiple instr
Distributed
memory
streams
Flynn’s categories
Introduction to Parallel Architectures and Algorithms
✓Common Parallel Architecture Models
Model Mapping Technique Observations Illustration
Most widely used for programming Communications
Message Passing Sending and receiving messages
parallel computers (clusters of Primitives
workstations ) send(buff, size, destination) M1 N2 P2 Mn
P2
receive(buff, size, source)
Features (Key attributes): Blocking vs non-blocking
Buffered vs non-buffered Link 1 Link 2 Li
Partitioned address space Message Passing Interface (MPI)
Explicit parallelization Popular message passing library
Process interactions ~125 functions
Send and receive data Interconnection
Network
Output queue
Work pool • Mapping of Work/Data • Dynamic mapping of tasks to
Input queue
• No desired pre-mapping processes
• Any task performed by any process • Synchronization: Adding/removing
• Computation: Processes work as data work from input queue PP
PP
becomes available (or requests arrive) • Example: Web Server
Introduction to Parallel Architectures and Algorithms
o The speed at which sequential computers operate has been improving at an exponential rate for many years, the improvement is now
coming at greater and greater cost.
o To design an algorithm that specifies multiple operations on each step, i.e., a parallel algorithm.
o Example: computing the sum of a sequence A of n numbers.
o It is not difficult however, to devise an algorithm for computing the sum that performs many operations in parallel. For example,
o Suppose that, in parallel, each element of A with an even index is paired and summed with the next element of A, which has an odd index,
o A[0] is paired with A[1], A[2] with A[3], and so on.
o The result is a new sequence of ⌈n/2⌉ numbers that sum to the same value as the sum that we wish to compute.
o This pairing and summing step can be repeated until, after ⌈log2 n⌉ steps, a sequence consisting of a single value is produced, and this value
is equal to the final sum.
o it is important to make a distinction between the parallelism in an algorithm and the ability of any particular computer to perform multiple
operations in parallel.
o In order for a parallel algorithm to run efficiently on any type of computer, the algorithm must contain at least as much parallelism as the
computer.
o The converse does not always hold: some parallel computers cannot efficiently execute all algorithms, even if the algorithms contain a great
deal of parallelism.
o Experience has shown that it is more difficult to build a general-purpose parallel machine than a general-purpose sequential machine.
1 + + + + + + + + 8
1 + + + + 4
1 + + 2
1 + 1
M M M Mn Memory P P P P Processors
1 2 3
...................
2 3 n
1
▪ In all three types of models, there may be differences in the operations that the processors and networks are
allowed to perform.
Introduction to Parallel Architectures and Algorithms
✓ Network topology
Bus Mesh Hypercube Multistage
000 1 5 9 000
001
001
010 2 6 10 010
011 011
100 3 7 11 100
101
101
(a) Bus (c) Hypercube
110 4 8 12 110
111 111
Introduction to Parallel Architectures and Algorithms
The Bus:
The simplest network topology is a bus.
This network can be used in both local memory machine models and modular memory machine models. In either
case, all processors and memory modules are typically connected to a single bus. In each step, at most one piece of
data can be written onto the bus. This data might be a request from a processor to read or write a memory value, or
it might be the response from the processor or memory module that holds the value. In practice,
the advantages of using a bus is that it is simple to build, and, because all processors and memory modules can
observe the traffic on the bus, it is relatively easy to develop protocols that allow processors to cache memory
values locally.
The disadvantage of using a bus is that the processors have to take turns accessing the bus. Hence, as more
processors are added to a bus, the average time to perform a memory access grows proportionately.
Introduction to Parallel Architectures and Algorithms
Mesh Topology
Several variations on meshes are also popular, including 3-dimensional meshes, toruses, and hypercubes. A torus is
a mesh in which the switches on the sides have connections to the switches on the opposite sides. Thus, every switch
(x,y) is connected to four other switches: (x,y+1 modY ), (x,y−1 modY ), (x+1 modX,y), and (x−1 modX,y). The figure
shows an example of a 2-dimesnsional mesh
Introduction to Parallel Architectures and Algorithms
Multistage network
A multistage network is used to connect one set of switches called the input switches to another set called the output
switches through a sequence of stages of switches.
The stages of a multistage network are numbered 1 through L, where L is the depth of the network. The switches on
stage 1 are the input switches, and those on stage L are the output switches. In most multistage networks, it is possible
to send a message from any input switch to any output switch along a path that traverses the stages of the network in
order from 1 to L.
Multistage networks are frequently used in modular memory computers; typically, processors are attached to input
switches, and memory modules to output switches.
A processor accesses a word of memory by injecting a memory access request message into the network.
This message then travels through the network to the appropriate memory module.
If the request is to read a word of memory, then the memory module sends the data back through then network to the
requesting processor.
Introduction to Parallel Architectures and Algorithms
Routing of Networks
An alternative to modeling the topology of a network is to summarize its routing capabilities in terms of two
parameters, its latency and bandwidth.
The latency, L, of a network is the time it takes for a message to traverse the network. In actual networks this will
depend on the topology of the network, which particular ports the message is passing between, and the congestion of
messages in the network. The latency, is often modeled by considering the worst-case time assuming that the network
is not heavily congested.
The bandwidth at each port of the network is the rate at which a processor can inject data into the network. In actual
networks this will depend on the topology of the network, the bandwidths of the network’s individual communication
channels, and, again, the congestion of messages in the network. The bandwidth often can be usefully modeled as the
maximum rate at which processors can inject messages into the network without causing it to become heavily
congested, assuming a uniform distribution of message destinations. .
✓ Primitive operations
We assume that all processors are allowed to perform the same local instructions as the single processor in the standard
sequential RAM model. (This issue will be discussed in detail in the Abstract Model Modu le)
Introduction to Parallel Architectures and Algorithms
Work-depth models (focusing on the algorithm not the machine)
In a work-depth model, the cost of an algorithm is determined by examining the total number of operations that it
performs, and the dependencies among those operations.
An algorithm’s work W is the total number of operations that it performs; its depth D is the longest chain of
dependencies among its operations.
We call the ratio P = W/D the parallelism of the algorithm.
The advantage of using a work-depth model is that there are no machine-dependent details to complicate the
design and analysis of algorithms.
The Figure : Summing 16 numbers on a tree. The total depth (longest chain of dependencies) is 4 and the total
work (number of operations) is 15. The work and depth for this family of circuits is W(n) = n − 1 and D(n) = log2 n.
1 + + + + + + + + 8
1 + + + + 4
1 + + 2
1 + 1