0% found this document useful (0 votes)
22 views49 pages

5 Module #5 Parallel Algorithms Actual October 30 2024

Uploaded by

Omar Amer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views49 pages

5 Module #5 Parallel Algorithms Actual October 30 2024

Uploaded by

Omar Amer
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 49

Module #5

Introduction to Parallel Algorithms Processing

Professor Mostafa Abd-El-Barr

Fall Term 2024-2025

Saturday, November 2, 2024 1


Outline
1. Conventional Algorithms Processing
2. Conventional sorting Algorithms Processing
3. Introduction (Conventional Matrix Multiplication Processing)
4. Introduction to Parallel Architectures and Algorithms
Introduction (Conventional Algorithms Processing)
Example Algorithms in Pseudo-code
Example 1: Construct an algorithm (in Pseudo-code) for finding the minimum (smallest) value in a finite set of
integers. Illustration: {5, -1, 10, 35, -22, 15, -12, 23, 1, -7}

Example 2: Construct an algorithm (in Pseudo-code) for Searching (finding out) if a given integer, x, exists in a set
of integers. Illustration: {5, -1, 10, 35, -22, 15, -12, 23, 1, -7} The targeted integer is, x= -12

Example 3: Construct an algorithm (in Pseudo-code) for finding out if a given integer, x, exists in a set of sorted
integers. Illustration: {-22, -12 , -7 , -1 , 1 , 5 , 10, 15, 23, 35} The targeted integer is, x= -12
Example 4: Construct an algorithm (in Pseudo-code) for sorting a set of integers in increasing order using the
Bubble sort technoque. Illustration: {5, -1, 10, 35, -22, 15, -12, 23, 1, -7}
Example 5: Construct an algorithm (in Pseudo-code) for sorting a set of integers in increasing order using the
insertion sort technique. Illustration {8, 2, 4, 9, 3, 6}
Example 6: Construct an algorithm (in Pseudo-code) for sorting a set of integers in increasing order using the
selection sort technique. Illustration {8, 2, 4, 9, 3, 6}
Introduction (Conventional Algorithms Processing)
Pseudo-Code
• Pseudo-Code: a high-level abstraction of code, usually used to outline the general steps in an algorithm
without having to write actual code (usually done for the reader's or programmer's benefit).
Example Algorithms in Pseudo-code
Example 1: Construct an algorithm (in Pseudo-code) for finding the minimum (smallest) value in a finite set of
integers.
• Illustration: {5, -1, 10, 35, -22, 15, -12, 23, 1, -7}
• Algorithm 1 Find-Min
1. Input: set of integers a1 , a 2 ,..., a n
2. Output: Temporary Minimum variable = min,
3. Steps:
3.1. min := a1
3.2. for i := 2 to n
3.3. if min > ai then min := ai
3.4. end for
Introduction (Conventional Algorithms Processing)
Example Algorithms in Pseudo-code
◼ Example 2: Construct an algorithm (in Pseudo-code) for Searching (finding out) if a given
integer, x, exists in a set of integers.
◼ Illustration: {5, -1, 10, 35, -22, 15, -12, 23, 1, -7} The targeted integer is, x= -12
◼ Algorithm 2 Linear-Search
1. Input: set of integers a1 , a 2 ,..., a nand integer x
2. Output: location =i if x= ai ; otherwise, location = 0
3. Steps:
3.1. i :=1
3.2. while (i  n and x  ai )
i := i+1
3.3. end while
3.4. if i  n then location := i else location := 0
Introduction (Conventional Algorithms Processing)
Example Algorithms in Pseudo-code
◼ Example 3: Construct an algorithm (in Pseudo-code) for finding out if a given integer, x,
exists in a set of sorted integers.
◼ Illustration: {-22, -12 , -7 , -1 , 1 , 5 , 10, 15, 23, 35}
The targeted integer is, x= -12
◼ Algorithm 3 Binary-Search
a1 , a2 ,..., an
1. Input: set of n sorted integers and integer x
2. Output: location =k if x= a k ; otherwise location = 0
3. Steps:
3.1. i :=1, j := n, k :=0
3.2. while ((i  j) and (k=0))
3.3. m := (i + j ) / 2
3.4. if x=
a m then k := m
3.5. else if x < a m then j := m-1
3.6. else i := m+1
3.7. end while
Trace the Algorithm.
Introduction (Conventional Algorithms Processing
Example Algorithms in Pseudo-code
• Example 4: Construct an algorithm (in Pseudo-code) for sorting a set of integers in increasing
order.
• Illustration: {5, -1, 10, 35, -22, 15, -12, 23, 1, -7}
• Algorithm 4 Bubble-Sort
1. Input: set of integers a1 , a2 ,..., an
2. Output: a1 , a2 ,..., an sorted
3. Steps:
3.1. for i := 1 to n-1
3.2. for j:= 1 to n-i
3.3. if a j  a j +1 then interchange a j and a j +1
3.4. end for

Trace the algorithm.


Introduction (Conventional Algorithms Processing)
✓ Assigning costs to algorithms: Asymptotic Algorithm Time Complexity (cost measures)
▪ Asymptotic (algorithm analysis) measures the efficiency of an algorithm as the size of the input becomes large.

▪ The size is measured in terms of the number of inputs processed by the algorithm.

▪ The number of basic operations to process an input of certain size is important in the analysis.

▪ The time taken to complete a basic operation is considered to be independent on the particular values of its operands.

Worst-case: (usually) the O-notation


• T(n) = maximum time of algorithm on any input of size n.

Average-case: (sometimes) the Θ-Theta notation


• T(n) = expected time of algorithm over all inputs of size n.
• Need assumption of statistical distribution of inputs.

Best-case: (Rarely) the Ω-Omega notation


• Cheat with a slow algorithm that works fast on some input.
Introduction (Conventional Algorithms Processing)
The O-notation
• For a given function g (n) , we denote by O( g (n)) the set
of functions
 f (n) : there exist positive constants c and n0 s.t.
O( g (n)) =  
 0  f ( n )  cg ( n ) for all n  n 0 

• We use O-notation to give an asymptotic upper bound of


a function, to within a constant factor.
• f (n) = O( g (n)) means that there existes some constant c
s.t. f (n) is always  cg (n) for large enough n.
Introduction (Conventional Algorithms Processing)
Θ -Theta notation
• For a given function g (n), we denote by ( g (n)) the set
of functions

 f (n) : there exist positive constants c1, c2 , and n0 s.t.


( g (n)) =  
 0  c1g (n)  f (n)  c2 g (n) for all n  n0 

• A function f (n) belongs to the set ( g (n)) if there exist


positive constants c1 and c2 such that it can be “sand-
wiched” between c1 g (n) and c2 g (n) or sufficienly large n.
• f (n) = ( g (n)) means that there exists some constant c1
and c2 s.t. c1 g (n)  f (n)  c 2 g (n) for large enough n.
Introduction (Conventional Algorithms Processing)
Ω-Omega notation
• For a given function g (n) , we denote by ( g (n)) the
set of functions

 f (n) : there exist positive constants c and n0 s.t.


( g (n)) =  
 0  cg (n)  f (n) for all n  n0 
• We use Ω-notation to give an asymptotic lower bound on
a function, to within a constant factor.
• f (n) = ( g (n)) means that there exists some constant c s.t.
f (n) is always  cg (n) for large enough n.
Introduction (Conventional Algorithms Processing)
Asymptotic notation

Graphic examples of , O, and  .


Introduction (Conventional Algorithms Processing)
✓Order of Growth of Functions
• We are interested in the running time of an algorithm for large input size.
• This allows us to consider the rate of growth or the order of growth of the running time.
• Example: for the function f ( n) = n 2 log n + 10 n 2 + n
• It is observed that as the larger the value of n, the significance the value of the term n 2 log n and the lesser the
significance of the contribution of the lower terms 10n 2 and n.
Rate of Growth of Functions

n 1 2 4 8 16 32
C=1 1 1 1 1 1 1
log 2 n 0 1 2 3 4 5
n× log 2 n 0 2 8 24 64 160
n2 1 4 16 64 256 1K
n3 1 8 64 512 4K 32K
2n 2 4 16 256 64K 4T
Introduction (Conventional sorting Algorithms Processing)

1. Bubble Sort
▪ Basic Idea: the idea is to repeatedly move the largest element to the highest index
position of the array.
▪ Each iteration reduces the effective size of the array by one.
▪ The focus is on successive adjacent pairs of elements in the array, they are compared
and either swaps them or not. In either case, after such a step, the larger of the two
elements will be in the higher index position.
▪ The focus then moves to the next higher position, and the process is repeated.
▪ When the focus reaches the end of the effective array, the largest element will have
``bubbled'' from whatever its original position to the highest index position in the
effective array.

14
Introduction (Conventional sorting Algorithms Processing)
Bubble Sort (Cont’d)
◼ Example: Consider the shown array A having unsorted set of integers.

45 12 34 67 25 39
45 67 12 34 25 39
0 1 2 3 4 5
0 1 2 3 4 5

45 67 12 34 25 39 45 12 34 25 67 39

0 1 2 3 4 5 0 1 2 3 4 5

45 12 34 25 39 67
45 12 67 34 25 39
0 1 2 3 4 5
0 1 2 3 4 5

15
Introduction (Conventional Sorting Algorithms Processing)
Bubble Sort (Cont’d)
• A bubble step is done by the following loops:

for j ← 0 to n-1
for i ← 0 to n-j
{ if (A[i] > A[i+ 1]);
{Temp ← A[i];
A[i] ← A[i+1];
A [i+1] ← Temp}};
Worst case Analysis: O ( n 2 )
▪ The loop compares all adjacent elements at index i and i + 1. If they are not in the correct order, they are
swapped.
▪ One complete bubble step moves the largest element to the last position, which is the correct position for that
element in the final sorted array.
▪ The effective size of the array is reduced by one and the process repeated until the effective size becomes one.
▪ Each bubble step moves the largest element in the effective array to the highest index of the effective array.
16
Introduction (Conventional sorting Algorithms
Processing)
2. Insertion Sort
• Basic Idea: For each element in the list of elements, find the proper slot where it should belong, and insert it.
• One element by itself is already sorted.
• Two elements are then considered and sorted, i.e., swapped if needed.
• Three elements, the third element is swapped leftward until it is in its proper order with the first two elements.
• Four elements, the fourth element is swapped leftward until it is in its proper order with the first three
elements.
• Continue in this manner with the fifth element, the sixth element, and so on until the whole list is sorted.

• How does it work?


✓ Each element A[j] is taken one at a time from j to n-1.
✓ Before insertion: sub-array from A[0] to A[j-1] is sorted, and the remainder of the array is unsorted.
✓ After insertion A[0] to A[j] is correctly ordered while the sub-array with elements A[j+1]…A[n-1] is unsorted.

17
Introduction (Conventional sorting Algorithms Processing)
Insertion Sort
Algorithm InsertionSort(A):
Input: An Array A of n elements
Output: The array A with its n elements sorted in a non-decreasing order
for i ← 1 to n-1 do
Temp ← A[i]
j ← i-1
while j  0 and A[j] > Temp do
A[j+1] ← A[j]
j ← j-1
end while
A[j+1] ← Temp Trace the algorithm
end for
• Best Case Analysis: Elements are already sorted
The inner loop will never be executed. The outer loop is executed n-1 times, i.e. O(n)
• Worst Case Analysis: Elements are in reverse Order
The inner loop will be executed the maximum number of times. The outer loop is executed n-1 times, i.e., O ( n2 )
Introduction (Conventional Sorting Algorithms Processing)

A pseudocode for insertion sort ( INSERTION SORT ).

INSERTION-SORT(A)
1 for j  2 to length [A]
2 do key  A[ j]
3  Insert A[j] into the sortted sequence A[1,..., j-1].
4 i j–1
5 while i > 0 and A[i] > key
6 do A[i+1]  A[i]
7 ii–1
8 A[i +1]  key
Introduction (Conventional Sorting Algorithms Processing)

Insertion sort Step


INSERTION-SORT (A, n) ⊳ A[1 . . n]
for j ← 2 to n
do key ← A[ j]
“pseudocode” i←j–1
while i > 0 and A[i] > key
do A[i+1] ← A[i]
i←i–1
A[i+1] = key
1 i j n
A:

key
sorted

L1.20
Introduction (Conventional Sorting Algorithms Processing)
Example insertion sort
8 2 4 9 3 6

8 2 4 9 3 6

L1.21
Introduction (Conventional Sorting Algorithms Processing
Example of insertion sort
8 2 4 9 3 6

2 8 4 9 3 6

L1.22
Introduction (Conventional Sorting Algorithms Processing
Example of insertion sort
8 2 4 9 3 6

2 8 4 9 3 6

L1.23
Introduction (Conventional Sorting Algorithms Processing
Example of insertion sort
8 2 4 9 3 6

2 8 4 9 3 6

2 4 8 9 3 6

L1.24
Introduction (Conventional Sorting Algorithms Processing
Example of insertion sort
8 2 4 9 3 6

2 8 4 9 3 6

2 4 8 9 3 6

L1.25
Introduction (Conventional Sorting Algorithms Processing
Example of insertion sort
8 2 4 9 3 6

2 8 4 9 3 6

2 4 8 9 3 6

2 4 8 9 3 6

L1.26
Introduction (Conventional Sorting Algorithms Processing
Example of insertion sort
8 2 4 9 3 6

2 8 4 9 3 6

2 4 8 9 3 6

2 4 8 9 3 6

L1.27
Introduction (Conventional Sorting Algorithms Processing
Example of insertion sort
8 2 4 9 3 6

2 8 4 9 3 6

2 4 8 9 3 6

2 4 8 9 3 6

2 3 4 8 9 6

L1.28
Introduction (Conventional Sorting Algorithms Processing
Example of insertion sort
8 2 4 9 3 6

2 8 4 9 3 6

2 4 8 9 3 6

2 4 8 9 3 6

2 3 4 8 9 6

Trace of the algorithm


Introduction (Conventional Sorting Algorithms Processing
Analysis of INSERTION-SORT
INSERTION - SORT(A) cost times
1 for j  2 to length [ A] c1 n
2 do key  A[ j ] c2 n −1
3  Insert A[ j ] into the sorted
sequence A[1   j − 1] 0 n −1
4 i  j −1 c4 n −1
5 while i  0 and A[i ]  key c5  n
t
j =2 j

6 do A[i + 1]  A[i ] c6  n
(t
j =2 j
− 1)

7 i  i −1 c7  n
(t
j =2 j
− 1)
8 A[i + 1]  key c8 n −1
Introduction (Conventional Sorting Algorithms Processing

Analysis of INSERTION-SORT
The total running time is
𝑛 𝑛 n
T(n) = C1 n + C2 (n-1) + C4 (n-1) + C5 ෍
𝑗=2
𝑡𝑗 + C6 ෍
𝑗=2
( 𝑡 𝑗 − 1) + c7  (t j − 1) + c8 (n − 1).
j =2

• The best case: The array is already sorted. (tj =1 for j=2,3, ...,n)
T (n) = c1n + c2 (n − 1) + c4 (n − 1) + c5 (n − 1) + c8 (n − 1)
= (c1 + c2 + c4 + c5 + c8 )n − (c2 + c4 + c5 + c8 ).
• The worst case: The array is reverse sorted (tj =j for j=2,3, ...,n).
T (n) = c1n + c2 (n − 1) + c5 (n(n + 1) / 2 − 1)+ c6 (n(n − 1) / 2) + c7 (n(n − 1) / 2) + c8 (n − 1)
= (c5 / 2 + c6 / 2 + c7 / 2)n 2 + (c1 + c2 + c4 + c5 / 2 − c6 / 2 − c7 / 2 + c8 )n
T (n) = an 2 + bn + c
Introduction (Conventional Sorting Algorithms Processing
• Selection Sort (Cont’d)
for I ← 0 to n-2
Temp ← A[I]
Location ←I
for J ← I+1 to n-1
if A[J] < A [Location]
Location ← J
A[I] ← A [Location]
A [Location] ← Temp
Trace of the algorithm 126 43 26 1 113

I Temp Location Inner Loop Final Selection


0 126 0 J=1
……..
✓The total number of comparisons = (n-1) + (n-2) + …+ 2 + 1= n(n − 1)
= O(n 2 ) 32
2
Summary of Some Sorting Algorithms

Algorithm Time (worst Possible case Notes


Bubble-Sort O(n2) ◼ slow
◼ for small data sets (< 1K)

Selection-sort O(n2) ◼ slow


◼ all data sets (< 1K)

Insertion-sort O(n2) ◼ slow


◼ all data sets (< 1K)

33
Introduction (Conventional Matrix Multiplication Processing)
Example 1: Matrix - Vector Multiplication
•Axb=y
• Allocate tasks to
j rows of A
y[i] = ∑A[i,j]*b[j]

• Dependencies:
o Computing each
element of y can be
done independently

• Speedup?

Introduction to Parallel Computing, University of Oregon, IPCC 34


Introduction (Conventional Matrix
A
Multiplication
B C
Processing)
Example 2: Matrix Multiplication
x =

•AxB=C
• A[i,:] • B[:,j] = C[i,j]

Introduction to Parallel Computing, University of Oregon, IPCC 35


Introduction to Parallel Architectures and Algorithms
✓ What is meant by Parallelism?
o Ability to execute different parts of a program at the same time (concurrently) on
different processors to speed up and shorten execution time.
✓ Speedup
o Speedup (of algorithm) = sequential execution time/execution time on p processors (for the same
data set).
✓ Scalability
o a program is said to scale to a certain number of processors p, if going from p-1 to p processors
results in some acceptable improvement in speedup (for instance, an increase of 20%).
Time Speed
sequential Ideal speedup

Solution time Actual speedup

Com putati on

Com munication

Number of processors Number of processors p


p
Introduction to Parallel Architectures and Algorithms
✓ Amdahl’s Law
o If f = 1/s of the program is sequential, then you can never get a speedup better than s.
▪ (Normalized) sequential execution time = 1/s + (1- 1/s) = 1
▪ Best parallel execution time on p processors = 1/s + (1 - 1/s) /p
▪ When p goes to infinity, parallel execution time= 1/s
▪ (maximum) Speedup = s.
50

f = 0 f = fraction
40
unaffected
p = speedup
Speedup (s )

f = 0 . 01
30
f = 0 . 02 of the rest
20
f = 0 . 05 1
s=
10 f + (1 – f)/p
f = 0 .1

0  min(p, 1/f)
0 10 20 30 40 50
E nha nc em en t f ac tor ( p )
Introduction to Parallel Architectures and Algorithms
Types of Parallelism: (Flynn and Johnson) Taxonomy

Single data Multiple data Shared Message


stream streams variables passing
Single instr

memory
stream

Global
Johnson’ s expansion
SISD SIMD GMSV GMMP
Uniprocessors Array or vector Shared-memory Rarely used
processors multiprocessors
Multiple instr

Distributed
memory
streams

MISD MIMD DMSV DMMP


Rarely used Multiproc’s or Distributed Distrib-memory
multicomputers shared memory multicomputers

Flynn’s categories
Introduction to Parallel Architectures and Algorithms
✓Common Parallel Architecture Models
Model Mapping Technique Observations Illustration
Most widely used for programming Communications
Message Passing Sending and receiving messages
parallel computers (clusters of Primitives
workstations ) send(buff, size, destination) M1 N2 P2 Mn
P2
receive(buff, size, source)
Features (Key attributes): Blocking vs non-blocking
Buffered vs non-buffered Link 1 Link 2 Li
Partitioned address space Message Passing Interface (MPI)
Explicit parallelization Popular message passing library
Process interactions ~125 functions
Send and receive data Interconnection
Network

Mostly used for programming SMP Communication


machines (multi-core chips) Read/write memory
Posix Thread API
Popular thread API M M M M
Shared Address Space Features (Key attributes): Operations
Shared address space Creation/deletion of threads
Implicit parallelization Synchronization (mutexes,
Process/Thread communication semaphores) Interconnection Network
Memory reads/stores Thread management
P1
P P P P
Introduction to Parallel Architectures and Algorithms
✓Types of Parallel execution of programs models
• Data parallel: all processors do same thing on different data
• Task graph: processors are assigned tasks that do different things
• Work pool: Data grouped as per the work to done
• Pipelining: Data is processed in a pipelined form
• Master-Worker: Two types of processors: Master Processor and Workers Processors

Introduction to Parallel Computing, University of Oregon, IPCC 40


Introduction to Parallel Architectures and Algorithms
Common Parallel Execution of Programs Models
The model Mapping Technique Observations Illustration
P O
• Static • Data-parallel Computation DD O
Data-parallel • Tasks process data, synchronize to P
• Tasks -> Processes get new data or exchange results, DD O
• Independent data items assigned to continue until all data processed D P
processes (Data Parallelism) • Load Balancing: Uniform O
partitioning of data Synchronization P O
P
• Static • Computation: Each node processes
Task graph input from previous node(s) and
• Tasks are mapped to nodes in a data P
send output to next node(s) in the P
dependency graph
• task dependency graph (Task • Load Balancing: Assign more
parallelism) processes to a given task Eliminate D P P P O
graph bottlenecks
• Mapping of Data: Data moves through • Synchronization: Node data D O
graph (Source to Sink) exchange D
• Examples: Parallel Quicksort, Divide P
and Conquer approaches

Output queue
Work pool • Mapping of Work/Data • Dynamic mapping of tasks to

Input queue
• No desired pre-mapping processes
• Any task performed by any process • Synchronization: Adding/removing
• Computation: Processes work as data work from input queue PP
PP
becomes available (or requests arrive) • Example: Web Server
Introduction to Parallel Architectures and Algorithms
o The speed at which sequential computers operate has been improving at an exponential rate for many years, the improvement is now
coming at greater and greater cost.
o To design an algorithm that specifies multiple operations on each step, i.e., a parallel algorithm.
o Example: computing the sum of a sequence A of n numbers.
o It is not difficult however, to devise an algorithm for computing the sum that performs many operations in parallel. For example,
o Suppose that, in parallel, each element of A with an even index is paired and summed with the next element of A, which has an odd index,
o A[0] is paired with A[1], A[2] with A[3], and so on.
o The result is a new sequence of ⌈n/2⌉ numbers that sum to the same value as the sum that we wish to compute.
o This pairing and summing step can be repeated until, after ⌈log2 n⌉ steps, a sequence consisting of a single value is produced, and this value
is equal to the final sum.
o it is important to make a distinction between the parallelism in an algorithm and the ability of any particular computer to perform multiple
operations in parallel.
o In order for a parallel algorithm to run efficiently on any type of computer, the algorithm must contain at least as much parallelism as the
computer.
o The converse does not always hold: some parallel computers cannot efficiently execute all algorithms, even if the algorithms contain a great
deal of parallelism.
o Experience has shown that it is more difficult to build a general-purpose parallel machine than a general-purpose sequential machine.

1 + + + + + + + + 8

1 + + + + 4

1 + + 2

1 + 1

Depth total = 4 Work Total: 15


Introduction to Parallel Architectures and Algorithms
✓ Multiprocessor models
o Multiprocessor models can be classified into three basic types:
▪ local memory machine models,
▪ modular memory machine models, and
▪ parallel random-access machine (PRAM) models.
▪ The Figure illustrates the structure of these machine models.

Interconnection Network M M M M M Memory Shared Memory


1 2 3 4 m

P1 P2 P3 Pn Processors Interconnection Network


................... P1 P2 P3 ................... Pn Processors

M M M Mn Memory P P P P Processors
1 2 3
...................
2 3 n
1

▪ In all three types of models, there may be differences in the operations that the processors and networks are
allowed to perform.
Introduction to Parallel Architectures and Algorithms
✓ Network topology
Bus Mesh Hypercube Multistage

000 1 5 9 000
001
001
010 2 6 10 010
011 011

100 3 7 11 100
101
101
(a) Bus (c) Hypercube
110 4 8 12 110

111 111
Introduction to Parallel Architectures and Algorithms
The Bus:
The simplest network topology is a bus.
This network can be used in both local memory machine models and modular memory machine models. In either
case, all processors and memory modules are typically connected to a single bus. In each step, at most one piece of
data can be written onto the bus. This data might be a request from a processor to read or write a memory value, or
it might be the response from the processor or memory module that holds the value. In practice,
the advantages of using a bus is that it is simple to build, and, because all processors and memory modules can
observe the traffic on the bus, it is relatively easy to develop protocols that allow processors to cache memory
values locally.
The disadvantage of using a bus is that the processors have to take turns accessing the bus. Hence, as more
processors are added to a bus, the average time to perform a memory access grows proportionately.
Introduction to Parallel Architectures and Algorithms
Mesh Topology
Several variations on meshes are also popular, including 3-dimensional meshes, toruses, and hypercubes. A torus is
a mesh in which the switches on the sides have connections to the switches on the opposite sides. Thus, every switch
(x,y) is connected to four other switches: (x,y+1 modY ), (x,y−1 modY ), (x+1 modX,y), and (x−1 modX,y). The figure
shows an example of a 2-dimesnsional mesh
Introduction to Parallel Architectures and Algorithms
Multistage network
A multistage network is used to connect one set of switches called the input switches to another set called the output
switches through a sequence of stages of switches.
The stages of a multistage network are numbered 1 through L, where L is the depth of the network. The switches on
stage 1 are the input switches, and those on stage L are the output switches. In most multistage networks, it is possible
to send a message from any input switch to any output switch along a path that traverses the stages of the network in
order from 1 to L.
Multistage networks are frequently used in modular memory computers; typically, processors are attached to input
switches, and memory modules to output switches.
A processor accesses a word of memory by injecting a memory access request message into the network.
This message then travels through the network to the appropriate memory module.
If the request is to read a word of memory, then the memory module sends the data back through then network to the
requesting processor.
Introduction to Parallel Architectures and Algorithms
Routing of Networks
An alternative to modeling the topology of a network is to summarize its routing capabilities in terms of two
parameters, its latency and bandwidth.
The latency, L, of a network is the time it takes for a message to traverse the network. In actual networks this will
depend on the topology of the network, which particular ports the message is passing between, and the congestion of
messages in the network. The latency, is often modeled by considering the worst-case time assuming that the network
is not heavily congested.
The bandwidth at each port of the network is the rate at which a processor can inject data into the network. In actual
networks this will depend on the topology of the network, the bandwidths of the network’s individual communication
channels, and, again, the congestion of messages in the network. The bandwidth often can be usefully modeled as the
maximum rate at which processors can inject messages into the network without causing it to become heavily
congested, assuming a uniform distribution of message destinations. .
✓ Primitive operations
We assume that all processors are allowed to perform the same local instructions as the single processor in the standard
sequential RAM model. (This issue will be discussed in detail in the Abstract Model Modu le)
Introduction to Parallel Architectures and Algorithms
Work-depth models (focusing on the algorithm not the machine)
In a work-depth model, the cost of an algorithm is determined by examining the total number of operations that it
performs, and the dependencies among those operations.
An algorithm’s work W is the total number of operations that it performs; its depth D is the longest chain of
dependencies among its operations.
We call the ratio P = W/D the parallelism of the algorithm.
The advantage of using a work-depth model is that there are no machine-dependent details to complicate the
design and analysis of algorithms.
The Figure : Summing 16 numbers on a tree. The total depth (longest chain of dependencies) is 4 and the total
work (number of operations) is 15. The work and depth for this family of circuits is W(n) = n − 1 and D(n) = log2 n.

1 + + + + + + + + 8

1 + + + + 4

1 + + 2

1 + 1

Depth (D) Total: 4 Work (W) Total: 15

You might also like