0% found this document useful (0 votes)
57 views

UNIT-3 Concurrent and Parallel Programming:: Some Simple Computations

1) The document discusses several fundamental parallel algorithms including semigroup computation, parallel prefix computation, packet routing, broadcasting, sorting, and their implementations on simple architectures like linear arrays, binary trees, and meshes. 2) It provides examples of maximum finding as a semigroup computation and parallel prefix sums on a linear array of processors. Both algorithms take p-1 communication steps, matching the diameter of the architecture. 3) Extensions to handling multiple data items per processor are discussed.

Uploaded by

Sandhya Gubbala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

UNIT-3 Concurrent and Parallel Programming:: Some Simple Computations

1) The document discusses several fundamental parallel algorithms including semigroup computation, parallel prefix computation, packet routing, broadcasting, sorting, and their implementations on simple architectures like linear arrays, binary trees, and meshes. 2) It provides examples of maximum finding as a semigroup computation and parallel prefix sums on a linear array of processors. Both algorithms take p-1 communication steps, matching the diameter of the architecture. 3) Extensions to handling multiple data items per processor are discussed.

Uploaded by

Sandhya Gubbala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 36

UNIT-3

Concurrent and Parallel Programming: Parallel algorithms – sorting, ranking,


searching, traversals, prefix sum etc.,

SOME SIMPLE COMPUTATIONS:

In this section three fundamental building-block computations are defined:

1. Semigroup (reduction, fan-in)computation


2. Parallel prefix computation
3. Packet routing
4. Broadcasting, and its more general version, multicasting
5. Sorting records in ascending/descending order of their keys

1. Semigroup Computation. Let be an associative binary operator; i.e., ( x⊕y ) ⊕z= x ⊕ ( y


⊕z ) for all x, y, z ∈S. A semigroup is simply a pair ( S, ⊕ ), where S is a set of elements on
which ⊕ is defined. Semigroup(also known as reduction or fan-in )computation is defined as:
Given a list of n values x0,x1,...,xn–1,computex0⊕x1⊕...⊕xn–1. The operator ⊕ may or may not be
commutative, i.e., it may or may not satisfy x ⊕y = y ⊕x (all of the above examples are, but the
carry computation, e.g., is not). This last point is important; while the parallel algorithm can
compute chunks of the expression using any partitioning scheme, the chunks must eventually be
combined in left-to-right order. Figure 3.1 depicts a semigroup computation on a uni processor.

2. Parallel Prefix Computation. With the same assumptions as in the preceding paragraph, a
parallel prefix computation is defined as simultaneously evaluating all of the prefixes
oftheexpressionx0⊕x1 ...⊕xn–1;i.e.,x0,x0⊕x1,x0⊕x1 ⊕x2,...,x0⊕x1⊕...⊕xn–1. Note that the ith
prefix expression is si = x0 ⊕x1 ⊕ . . . ⊕xi.

The graph representing the prefix computation on a uniprocessor is similar to Fig. 3.1, but with
the intermediate values also output.

3. Packet Routing. A packet of information resides at Processor i and must be sent to


Processor j. The problem is to route the packet through intermediate processors, if needed,

Fig 3.1: Semigroup computation on a uniprocessor.

1
such that it gets to the destination as quickly as possible. The problem becomes more challenging
when multiple packets reside at different processors, each with its own destination. In this case,
the packet routes may interfere with one another as they go through common intermediate
processors. When each processor has at most one packet to send and one packet to receive, the
packet routing problem is called one-to-one communication or 1–1routing.

4. Broadcasting: Given a value a known at a certain processor i, disseminate it to all p


processors as quickly as possible, so that at the end, every processor has access to, or “knows,”
the value. This is sometimes referred to as one-to-all communication. The more general case of
this operation, i.e., one-to-many communication, is known as multicasting. From a programming
viewpoint, we make the assignments xj: = a for 1 ≤ j ≤ p(broadcasting) or for j ∈G (multicasting),
where G is the multicast group and xj is a local variable in processor j.

5. Sorting: Rather than sorting a set of records, ach with a key and data elements, we focus on
sorting a
set of keys for simplicity. Our sorting problem is thus defined as: Given a list of n keys x0,
x1, . . . , xn–1, and a total order ≤ on key values, rearrange the n keys as xi , xi , . . . , xi ,

such that xi ≤ xi≤ . . . ≤x i. We consider only sorting the keys In non descending order.

SOME SIMPLEARCHITECTURES
In this section, we define four simple parallel architectures:

1. Linear array of processors


2. Binary tree of processors
3. Two-dimensional mesh of processors
4. Multiple processors with shared variables

Linear Array: Figure 3.2 shows a linear array of nine processors, numbered 0 to 8. The
diameter of ap-processor linear array, defined as the longest of the shortest distances between
pairs of processors, is D = p – 1. The ( maximum) node degree, defined as the largest number of
links or communication channels associated with a processor, is d = 2. The ring variant, also
shown in Fig.3.2, has the same node degree of 2 but a smaller diameter of D=p/2.

Fig 3.2: A linear array of nine processors and its ring variant.
Binary Tree: Figure 3.3 shows a binary tree of nine processors. This binary tree is balanced in
that the leaf levels differ by at most 1. If all leaf levels are identical and every non leaf processor
has two children, the binary tree is said to be complete. The diameter

Fig 3.3: A balanced (but incomplete) binary tree of nine processors.

of a p-processor complete binary tree is 2 log2 (p + 1) – 2. More generally, the diameter of ap-
processor balanced binary tree architecture is 2log 2p or2log2p–1, depending on the placement of
leaf nodes at the last level. Unlike linear array, several different p-processor binary tree
architectures may exist. This is usually not a problem as we almost always deal with complete
binary trees. The (maximum) node degree in a binary tree is d=3.

2D Mesh: Figure 3.4 shows a square 2D mesh of nine processors. The diameter of a p-processor
square mesh is – 2. More generally, the mesh does not have to be square. The diameter of a p-
processor r× (p/ r) mesh is D = r + p /r – 2. Again, multiple 2D meshes may exist for the same
number p of processors, e.g., 2 × 8 or 4 × 4. Square meshes are usually preferred because they
minimize the diameter. The torus variant, also shown in Fig. 3.4, has end-around or wraparound
links for rows and columns. The node degree for both meshes and tori is d = 4. But a p -
processor r×(p/r) torus has a smaller diameter of D = ×r /2×+p/(2r).

Shared Memory: A shared-memory multiprocessor can be modeled as a complete graph, in


which every node is connected to every other node, as showing Fig.3.5 for p= 9.

In the 2D mesh of Fig. 3.4, Processor 0 can send/receive data directly to/from P 1and P3.
However, it has to go through an intermediary to send/receive data to/from P 4 , say. In a shared-
memory multiprocessor, every piece of data is directly accessible to every processor (we assume
that each processor can simultaneously send/receive data over all of its p – 1 links).The diameter
D=1 of a complete graph is an indicator of this direct access. The node
Fig 3.4: A 2D mesh of nine processors and its torus variant.

Fig 3.5: A shared-variable architecture modeled as a complete graph.


degreed = p – 1, on the other hand, indicates that such an architecture would be quite costly to
implement if no restriction is placed on data accesses.

ALGORITHMS FOR A LINEARARRAY


Semigroup Computation: Let us consider first a special case of semigroup computation,
namely, that of maximum finding. Each of the p processors holds a value initially and our goal is
for every processor to know the largest of these values. A local variable, max-thus-far, can be
initialized to the processor’s own data value. In each step, a processor sends its max-thus-far
value to its two neighbors. Each processor, on receiving values from its left and right neighbors,
sets its max-thus-far value to the largest of the three values, i.e., max(left, own, right). Figure 3.6
depicts the execution of this algorithm for p = 9 processors. The dotted lines in Fig.2.6 show how
the maximum value propagates from P6 to all other processors. Had there been two maximum

values, say in P 2 and P 6 , the propagation would have been faster. In the worst case, p – 1
communication steps (each involving sending a processor’s value to both neighbors), and the
same number of three – way comparison steps, are needed. This is the best one can hope for,
given that the diameter of ap-processor linear array is D = p – 1 (diameter-based lower bound).
Fig 3.6: Maximum-finding on a linear array of nine processors
For a general semigroup computation, the processor at the left end of the array (the one with no left neighbor) becomes
active and sends its data value to the right (initially, all processors are dormant or inactive). On receiving a value from its
left neighbor, a processor becomes active, applies the semigroup operation ⊕ to the value received from the left and its
own data value, sends the result to the right, and becomes inactive again. This wave of activity propagates to the
right, until the rightmost processor obtains the desired result. The
computationresultisthenpropagatedleftwardtoallprocessors.Inall,2p–2communication steps
areneeded.
th
Parallel Prefix Computation. Let us assume that we want the i prefix result to be obtained at
th
the i processor, 0 ≤i ≤p – 1. The general semigroup algorithm described in the preceding
paragraph in fact performs a semigroup computation first and then does a broadcast of the final
value to all processors. Thus, we already have an algorithm for parallel prefix computation that
takes p – 1 communication/combining steps. A variant of the parallel prefix computation, in
th
which Processor i ends up with the prefix result up to the (i – 1) value, is sometimes useful.
This diminished prefix computation can be performed just as easily if each processor holds onto
the value received from the left rather than the one it sends to the right. The diminished prefix
sum results for the example of Fig. 3.7 would be 0, 5, 7, 15, 21, 24, 31, 40,41.

Thus far, we have assumed that each processor holds a single data item. Extension of the
semigroup and parallel prefix algorithms to the case where each processor initially holds several
data items is straightforward. Figure 3.8 shows a parallel prefix sum computation with each
processor initially holding two data items. The algorithm consists of each processor doing a
prefix computation on its own data set of size n/p (this takes n/p – 1 combining steps),then doing
a diminished parallel prefix computation on the linear array as above (p-1
communication/combining steps), and finally combining the local prefix result from this last
computation with the locally computed prefixes(n /p combining steps). In all, 2n/p + p-2
combining steps and p – 1 communication steps are required.

Packet Routing. To send a packet of information from Processor i to Processor j on a linear


array, we simply attach a routing tag with the value j – i to it. The sign of a routing tag
determines the direction in which it should move (+ = right, – = left) while its magnitude
indicates the action to be performed (0 = remove the packet, nonzero = forward the packet). With
each forwarding, them agnitude of the routing tag Is decremented by 1.Multiplepackets
Fig 3.7: Computing prefix sums on a linear array of nine processors.

Fig 3.8: Computing prefix sums on a linear array with two items per processor.

originating at different processors can flow rightward and leftward in lockstep, without ever
interfering with each other.
Broadcasting. If Processor i wants to broadcast a value a to all processors, it sends an rbcast(a)
(read r-broadcast) message to its right neighbor and an lbcast( a) message to its left neighbor.
Any processor receiving an rbcast(a) message, simply copies the value a and forwards the
message to its right neighbor (if any). Similarly, receiving an lbcast(a) message causes a to be
copied locally and the message forwarded to the left neighbor. The worst-case number of
communication steps for broadcasting is p – 1.

Sorting. We consider two versions of sorting on a linear array: with and without I/O. Figure 3.9
depicts a linear-array sorting algorithm when p keys are input, one at a time, from the left end.
Each processor, on receiving a key value from the left, compares the received value with the
value stored in its local register. The smaller of the two values is kept in the local register and
larger value is passed on to the right. Once all p inputs have been received, we must allow p – 1
additional communication cycles for the key values that are in transit to settle into their
respective positions in the linear array. If the sorted list is to be output from the left, the output
phase can start immediately after the last key value has been received. In this case, an array half
the size of the input list would be adequate and we effectively have zero-time sorting, i.e., the
total sorting time is equal to the I/O time.
If the key values are already in place, one per processor, then an algorithm known as odd–even
transposition can be used for sorting. A total of p steps are required. In an odd-numbered step,
odd-numbered processors compare values with their even-numbered right neighbors. The two
processors exchange their values if they are out of order. Similarly, in an even-numbered step,
even-numbered processors compare–exchange values with their right neighbors (see Fig. 3.10).
In the worst case, the largest key value resides in Processor 0 and must move all the way to the
other end of the array. This needs p – 1 right moves. One step must be added because no
movement occurs in the first step. Of course one could use even–odd transposition, but this will
not affect the worst-case time complexity of the algorithm for our nine-process or linear array.

Fig 3.9: Sorting on a linear array with the keys input sequentially from the left.

Fig 3.10: Odd–even transposition sort on a linear array.


Let use valuate the odd–even transposition algorithm with respect to the various measures
introduced in Section 1.6. The best sequential sorting algorithms take on the order of plogp
compare–exchange steps to sort a list of size p. Let us assume, for simplicity, that they take
2
exactly p log2 p steps. Then, we have T(1)=W(1)=plog2p,T(p)=p,W(p)=p /2,S(p)=log2p
3 2
(Minsky’sconjecture?), E(p) =(log2p)/p,R(p)=p/(2log2p),U(p)=1/2,andQ(p)=2(log2p) /p .
Ranking the Elements of a Linked List

Our next example computation is important not only because it is a very useful building block in
many applications, but also in view of the fact that it demonstrates how a problem that seems
hopelessly sequential can be efficiently parallelized.

The problem will be presented in terms of a linear linked list of size p, but in practice it often
arises in the context of graphs of the types found in image processing and computer vision
applications. Many graph-theoretic problems deal with (directed) paths between various pairs of
nodes. Such a path essentially consists of a sequence of nodes, each “pointing” to the next node
on the path; thus, a directed path can be viewed as a linear linked list.
The problem of list ranking can be defined as follows: Given a linear linked list of the type
shown in Fig. 3.11, rank the list elements in terms of the distance from each to the terminal

Fig 3.11: Another divide-and-conquer scheme for parallel prefix computation.

Fig 3.12: Example linked list and the ranks of its elements.

element. The terminal element is thus ranked 0, the one pointing to it 1, and so forth. In a list of
length p, each element’s rank will be a unique integer between 0 and p–1.
A sequential algorithm for list ranking requires (p) time. Basically, the list must be traversed
once to determine the distance of each element from the head, storing the results in the linked list
itself or in a separate integer vector. This first pass can also yield the length of the list (six in the
example of Fig. 3.12). A second pass, through the list, or the vector of p intermediate results,
then suffices to compute all of the ranks.

The list ranking problem for the example linked list of Fig. 3.12 may be approached with the
PRAM input and output data structures depicted in Fig. 3.13. The info and next vectors are given,
as is the head pointer (in our example, head = 2). The rank vector must be filled with the unique
element ranks at the termination of the algorithm.

The parallel solution method for this problem is known as pointer jumping:
Repeatedly make each element point to the successor of its successor (i.e., make the pointer jump
over the current successor) until all elements end up pointing to the terminal node, keeping track
of the number of list elements that have been skipped over. If the original list is not to be
modified, a copy can be made in the PRAM’s shared memory in constant time before the
algorithm is applied.

Processor j, 0 ≤j <p, will be responsible for computing rank [j]. The invariant of the list ranking
algorithm given below is that initially and after each iteration, the partial computed rank of each
element is the difference between its rank and the rank of its successor. With the difference
between the rank of a list element and the rank of its successor available, the rank of an element
can be determined as soon as the rank of its successor becomes known. Again, a doubling
process takes place. Initially, only the rank of the terminal element (the only node that points to
itself) is known. In successive iterations of the algorithm, the ranks

Fig 3.13: PRAM data structures representing a linked list and the ranking results.
Of two elements, then four elements, the n eight elements, and so forth become known until the
ranks of all elements have been determined
PRAM list ranking algorithm (via pointer jumping)

Processorj,0 j<p,do{initializethepartialranks}
ifnext [j ] = j then rank[j] := 0 else rank[j ] := 1
endif whilerank[next[head]] 0 Processor j, 0 j <p, do
rank[j] := rank [j] + rank [next[j]] next[j] := next
[next[j]] endwhile
Figure3.14 shows the intermediate e values in the vectors rank(numbers with inboxes) and next
(arrows) as the above list ranking algorithm is applied to the example list. Because the number of
elements that are skipped doubles with each iteration, the number of iterations, and thus the
running time of the algorithm, is logarithmic in p.

List-ranking appears to be hopelessly sequential, as no access to list elements is possible without


traversing all previous elements. However, the list ranking algorithm presented above shows that
we can in fact use a recursive doubling scheme to determine the rank of each element in optimal
time. The problems at the end of the chapter contain other examples of computations on lists that
can be performed just as efficiently. This is why intuition can be misleading when it comes to
determining which computations are or are not efficiently parallelizable (formally, whether a
computation is or is not inNC).

Fig 3.14: Element ranks initially and after each of the three iterations

Parallel Algorithm-Introduction
An algorithm is a sequence of steps that take inputs from the user and after some computation,
produces an output. A parallel algorithm is an algorithm that can execute sever all instructions
simultaneously on different processing devices and then combine all the individual out put to
produce the final result.

Concurrent Processing

The easy availability of computers along with the growth of Internet has changed the way we
store and process data. We are living in a day and age where data is available in abundance.
Every day we deal with huge volumes of data that require complex computing and that too, in
quick time. Sometimes, we need to fetch data from similar or interrelated events that occur
simultaneously. This is where we require concurrent processing that can divide a complex task
and process. It multiple systems to produce the output in quick time.
Concurrent processing is essential where the task involves processing a huge bulk of complex
data. Examples include− accessing large databases, aircraft testing, astronomical calculations,
atomic and nuclear physics, biomedical analysis, economic planning, image processing, robotics,
weather forecasting, web-based services, etc.

What is Parallelism?

Parallelism is the process of processing several set of instructions simultaneously. It reduces the
total computational time. Parallelism can be implemented by using parallel computers, i.e.a
computer with many processors. Parallel computers require parallel algorithm, programming
languages, compilers and operating system that support multitasking.

In this tutorial, we will discuss only about parallel algorithms. Before moving further, let us first
discuss about algorithms and their types.
What is an Algorithm?
An algorithm is a sequence of instructions followed to solve a problem. While designing an
algorithm, we should consider the architecture of computer on which the algorithm will be
executed. As per the architecture, there are two types of computers–
• Sequential computer
• Parallel computer
Depending on the architecture of computers, we have two types of algorithms –
• Sequential Algorithm − An algorithm in which some consecutive steps of instructions are
executed in a chronological order to solve a problem.

• Parallel Algorithm−The problem is divided into sub-problems and are executed in


parallel to get individual outputs. Later on, these individual outputs are combined
together to get the final desired output.

It is not easy to divide a large problem into sub-problems. Sub-problems may have data
dependency among them. Therefore, the processors have to communicate with each other to
solve the problem.

It has been found that the time needed by the processors in communicating with each other is
more than the actual processing time. So, while designing a parallel algorithm, proper CPU
utilization should be considered to get an efficient algorithm.

To design an algorithm properly, we must have a clear idea of the basic model of computation in
a parallel computer.
Model of Computation
Both sequential and parallel computers operate on a set (stream) of instructions called
algorithms. These set of instructions (algorithm) instruct the computer about what it has to do in
each step.
Depending on the instruction stream and data stream, computers can be classified into four
categories–
• Single Instruction stream, Single Data stream(SISD)computers single
• Instruction Stream , Multiple Data Stream (SIMD) computers Multiple
• Instruction Stream , Single Data Stream (MISD) computers Multiple
• Instruction Stream , Multiple Data Stream (MIMD) computers Multiple

SISD Computers
SISD computers contain one control unit, one processing unit, and one memory unit.

Fig 3.15: SISD computers

In this type of computers, the processor receives a single stream of instructions from the control
unit and operates on a single stream of data from the memory unit. During computation, at each
step, the processor receives one instruction from the control unit and operates on a single data
received from the memory unit.
SIMD Computers:
SIMD computers contain one control unit, multiple processing units, and shared memory or
interconnection network.

Fig 3.16: Control Unit and Shared Memory


Here, one single control unit sends instruction to all processing units. During computation, at
each step, all the processors receive a single set of instructions from the control unit and operate
on different set of data from the memory unit.

Each of the processing units has its own local memory unit to store both data and instructions. In
SIMD computers, processors need to communicate among themselves. This is done by shared
memory or by inter connection network.

While some of the processors execute a set of instructions, the remaining processors wait for
their next set of instructions. Instructions from the control unit decides which processor will be
active(execute instructions) or inactive (wait for next instruction).

MISD Computers

As the name suggests, MISD computers contain multiple control units, multiple processing units,
and one common memory unit.

Fig 3.17: Flow of instruction from control unit to memory


Here, each processor has its own control unit and they share a common memory unit.All
the processors get instructions individually from their own control unit and they operate
on a single stream of data as per the instructions they have received from their respective
control units. This processor operates simultaneously.

MIMD Computers

MIMD computers have multiple control units, multiple processing units, and a shared
memory or interconnection network.
Fig 3.18: Instruction and data stream

Here, each processor has its own control unit, local memory unit, and arithmetic and logic
unit. They receive different sets of instructions from their respective control units and
operate on different sets of data.

• An MIMD computer that shares a common memory is known as multiprocessors, while


those that uses an interconnection network is known as multi computers.
• Based on the physical distance of the processors, multicomputers are of two types –

➢ Multicomputer − When all the processors are very close to one another (e.g., in
the same room).

➢ Distributed system − When all the processors are far away from one another (e.g.-
in the different cities)

Parallel Algorithm-Structure

To apply any algorithm properly, it is very important that you select a proper data structure. It is
because a particular operation performed on a data structure may take more time as compared to
the same operation performed on another data structure.
Example−To access the ith element in a set by using an array, it may take a constant time but by
using a linked list, the time required to perform the same operation may become a polynomial.
Therefore, the selection of a data structure must be done considering the architecture and the type
of operations to be performed.
The following data structures are commonly used in parallel programming
− Hypercube Network
Linked List

A linked list is a data structure having zero or more nodes connected by pointers. Nodes may or
may not occupy consecutive memory locations. Each node has two or three parts−one data part
that stores the data and the other two are link fields that store the address of the previous or next
node. The first node’s address is stored in an external pointer called head. The last node, known
as tail, generally does not contain any address.
There are three types of linked lists −
Singly Linked List Doubly Linked List Circular Linked List
Singly Linked List
A node of a singly linked list contains data and the address of the next node. A next ernal pointer
called head stores the address of the first node.

Fig 3.19: Singly linked list


Doubly Linked List
A node of a doubly linked list contains data and the address of both the previous and the next
node. An external pointer called head stores the address of the first node and the external pointer
called tail stores the address of the last node.

Fig 3.20: Doubly linked list


Circular Linked List
A circular linked list is very similar to the singly linked list except the fact that the last node
saved the address of the first node.
Arrays
An array is a data structure where we can store similar types of data. It can be one-dimensional
or multi- dimensional. Arrays can be created statically or dynamically.
In statically declared arrays, dimension and size of the arrays are known at the time of
compilation.
In dynamically declared arrays, dimension and size of the array are known at runtime.
For shared memory programming, arrays can be used as a common memory and for data parallel
programming, they can be used by partitioning into sub-arrays.

Hypercube Network
Hyper cube architecture is helpful for those parallel algorithms where each task has to
communicate with other tasks. Hypercube topology can easily embed other topologies such as
ring and mesh. It is also known as n-cubes, where n is the number of dimensions. A hypercube
can be constructed recursively.
Parallel Algorithm-Matrix Multiplication
A matrix is a set of numerical and non-numerical data arranged in a fixed number of rows and
column. Matrix multiplication is an important multiplication design in parallel communication.
Here we will discuss the implementation of matrix multiplication on various communication
networks like mesh and hypercube. Mesh and hypercube have higher network connectivity, so
they allow faster algorithm than other networks like ring network.

Mesh Network
A topology where a set of nodes for map-dimensional grid is called a mesh topology. Here, all
the edges are parallel to the grid and all the adjacent nodes can communicate among themselves.
Total number of nodes= (number of nodes in row) × (number of nodes in column)
A mesh network can be evaluated using the following factors−

• Diameter
• Bisection width

Diameter−In a mesh network, the longest distance between two nodes is its diameter. Ap-
dimensional mesh network having kP nodes has a diameter of (k–1).

Bisection width − Bisection width is the minimum number of edges needed to be removed from
a network to divide the mesh network into two halves.

Matrix Multiplication Using Mesh Network

We have considered a 2D mesh network SIMD model having wrap around connections. We will
design an algorithm to multiply two n×n arrays using n2 processors in a particular amount of
time.
Matrices A and B have elements aij and bij respectively. Processing element PEij represents aij
and bij. Arrange the matrices A and B in such a way that every processor has a pair of elements
to multiply. The elements of matrix A will move in left direction and the elements of matrix B
will move in upward direction. These changes in the position of the elements in matrix A and B
present each processing element, PE, a new pair of values to multiply.

Steps in Algorithm

Stagger two matrices.

Calculate all products, aik × bkj


Calculate sums when step 2 is complete.

Algorithm
ProcedureMatrixMulti

Begin
fork =1to n-1

forall Pij;wherei andj ranges from1to n ifiisgreater


than k then
rotate a inleft direction
endif

ifj isgreater than k then


rotate b inthe upward direction
endif

forall Pij;wherei andj lies between 1andn compute the


product ofa andb andstore it inc

fork=1to n-1step 1
forall Pi;jwherei andj ranges from1to n rotate a
inleft direction
rotate b inthe upward direction c=c+aXb
End

Hypercube Network
A hypercube is an n-dimensional construct where edges are perpendicular among themselves and
are of same length. An n-dimensional hypercube is also known as an n-cube or an n-dimensional
cube.
Features of Hypercube with 2k node
Diameter = k

Bisection width = 2k–1 Number of edges = k


Matrix Multiplication using Hypercube Network
General specification of Hypercube networks −
• Let N = 2m be the total number of processors. Let the processors be P0, P1…..PN-1.
• Let I and ib be two integers, 0<i, ib<N-1 and its binary representation differ only in
position b,
0< b < k–1.
• Let us consider two n × n matrices, matrix A and matrix B.
• Step1−The elements of matrix A and matrix B are assigned to then 3 processors such that
the processor in position i, j, k will have aji and bik.
• Step 2 − All the processor in position (i,j,k) computes the product
C(i,j,k) = A(i,j,k) × B(i,j,k)
Step 3 − The sum C(0,j,k) = ΣC(i,j,k) for 0 ≤ i ≤ n-1, where 0 ≤ j, k < n–1.

Block Matrix
Block Matrix or partitioned matrix is a matrix where each element itself represents an individual
matrix. These individual sections are known as a block or sub-matrix.

Example
In Figure(a), X is a block matrix where A, B, C, D are matrix themselves. Figure(f) shows the
total matrix.

Block Matrix Multiplication


When two block matrices are square matrices, then they are multiplied just the way we perform
simple matrix multiplication. For example,

Parallel Algorithm-Sorting
Sorting is a process of arranging elements in a group in a particular order, i.e., ascending order,
descending order, alphabetic order, etc. Here we will discuss the following −

• Enumeration Sort
• odd even transportation sortParallel Merge Sort
• Hyper Quick Sort

Sorting a list of elements is a very common operation. A sequential sorting algorithm may not be
efficient enough when we have to sort a huge volume of data. Therefore, parallel algorithms are
used in sorting.

Enumeration Sort
Enumeration sort is a method of arranging all the elements in a list by finding the final position
of each element in a sorted list. It is done by comparing each element with all other elements and
finding the number of elements having smaller value.

Therefore, for any two elements, ai and aj any one of the following cases must be true −

a <a a >a a = a
ii jj ii jj ii jj
Algorithm

procedure ENUM_SORTING (n)

begin
foreach process P1,jdo
C[j]:=0;

foreach process Pi,jdo

if(A[i]<A[j])orA[i]=A[j]andi <j)then
C[j]:=1;
else
C[j]:=0;

foreach process P1,j do


A[C[j]]:=A[j];

endENUM_SORTING

Odd-Even Transposition Sort


Odd-EvenTransposition Sort is based on the Bubble Sort technique. It compares two adjacent
numbers and switches them, if the first number is greater than the second number to get an
ascending order list. The opposite case applies for a descending order series. Odd-Even
transposition sort operates in two phases − odd phase and even phase. In both the phases,
processes exchange numbers with their adjacent number in the right.

Fig 3.21: Example for Odd-Even Transposition sort


Algorithm

procedure ODD-EVEN_PAR (n)

begin
id:=process'slabel

fori:=1tondobegin

if i is odd and id is odd then


compare-exchange_min(id+1);
else
compare-exchange_max(id-1);

if i is even and id is even then


compare-exchange_min(id+1);
else
compare-exchange_max(id-

1);endfor

endODD-EVEN_PAR

Parallel Merge Sort


Merge sort first divides the unsorted list into smallest possible sub-lists, compares it with the
adjacent list, and merges it in a sorted order. It implements parallelism very nicely by following
the divide and conquer algorithm
Fig 3.22: Parallel Merge Sort
Algorithm

procedureparallelmergesort(id,n,data,newdata)

begin
data=sequentialmergesort(data)

for dim =1to n


data=parallelmerge(id,dim,data)endfor

newdata=data
end

Hyper Quick Sort

Hyper quick sort is an implementation of quick sort on hypercube. Its steps are as follows −

• Divide the unsorted list among each node.


• Sort each node locally.
• From node 0, broadcast the median value.
• Split each list locally, then exchange the halves across the highest dimension. Repeat
steps3 and 4 in parallel until the dimension reaches 0.
Algorithm

procedure HYPERQUICKSORT (B,n)


begin

id:=process’s label;

for i :=1to d dobegin


x:=pivot;
partition B intoB1 andB2 such that B1 ≤x <B2;
ifith bit is0then

begin
send B2 to the process along the ith communication link;
C:=subsequence received along the ith communication link;B:=B1 U
C; endif

else
send B1 to the process along the ith communication link; C:=subsequence
received along the ith communication link;B:=B2 U C;
end else end
or

sort B using sequential quicksort;

end HYPERQUICKSORT

Parallel Search Algorithm

• Searching is one of the fundamental operations in computer science. It is used in all


applications where we need to find if an element is in the given list or not. In this
chapter, we will discuss the following search algorithms −

• Divide and Conquer Depth-First Search Breadth-First Search Best-First Search

Divide and Conquer

In divide and conquer approach, the problem is divided into several small sub-problems.
Then the sub- problems are solved recursively and combined to get the solution of the
original problem.
The divide and conquer approach involves the following steps at each level −
Divide − The original problem is divided into sub-problems.
Conquer − The sub-problems are solved recursively.

Combine−The solutions of the sub- problems are combined to get the solution of the
original problem.

Binary search is an example of divide and conquer algorithm.

Pseudocode

Binarysearch(a,b,low,high)

if low <high then return NOT FOUND


else
mid←(low+high)/2
if b =key(mid)then return key(mid)
elseif b <key(mid)then
return BinarySearch(a,b,low,mid−1)
else
return BinarySearch(a,b,mid+1,high)

Depth-First Search

Depth-First Search (orDFS) is an algorithm for searching a tree or an undirected graph


data structure. Here, the concept is to start from the starting node known as the root and
traverse as far as possible in the same branch. If we get anode with no successor node,
were turn and continue with the vertex, which is yet to be visited.
Steps of Depth-First Search

• Consider a node (root) that is not visited previously and mark it visited. Visit the first
adjacent successor node and mark it visited.

• If all the successors nodes of the considered node are already visited or it doesn’t have
any more successor node, return to its parent node.
Pseudo code
Let v be the vertex where the search starts in Graph G.
DFS(G,v)

StackS :={};

foreach vertex u,setvisited[u]:=false;push S,v;


while(S isnotempty)do
u:=pop S;

if(notvisited[u])then
visited[u]:=true;
foreach unvisited neighbour w ofu push
S,w;
endif

endwhile

ENDDFS()

Breadth-First Search

Breadth-First Search (orBFS) is an algorithm for searching a tree or an undirected graph data
structure. Here, we start with a node and then visit all the adjacent nodes in the same level and
then move to the adjacent successor node in the next level. This is also known as level-by-level
search.
Steps of Breadth-First Search
• Start with the root node, mark it visited.
• As the root node has no node in the same level, go to the next level. Visit all adjacent
nodes and mark them visited.
• Go to the next level and visit all the unvisited adjacent nodes. Continue this process until
all the nodes are visited.
Pseudocode
Let v be the vertex where the search starts in Graph G.

BFS(G,v)

QueueQ :={};

foreach vertex u,setvisited[u]:=false;insert


Q,v;
while(Q isnotempty)do
u:=deleteQ;

if(notvisited[u])then
foreach unvisited neighbor w ofu insert Q,w;
endif

endwhile

ENDBFS()

Best-First Search
Best-First Search is an algorithm that traverses a graph to reach a target in the shortest possible
path. Unlike BFS and DFS, Best-First Search follows an evaluation function to determine which
node is the most appropriate to traverse next.
Steps of Best-First Search
• Start with the root node, mark it visited.
• Find the next appropriate node and mark it visited.
• Go to the next level and find the appropriate node and mark it visited. Continue this
process until the target is reached.

Pseudocode
BFS(m )

Insert(m.StartNode)
UntilPriorityQueueisempty
c←PriorityQueue.DeleteMin Ifcisthe goal
ExitElse

Foreachneighbor n ofc
If n "Unvisited"Markn"Visited"
Insert(n)
Markc"Examined"

End procedure

Graph Algorithm
A graph is an abstract notation used to represent the connection between pairs of objects. A
graph consists of −
• Vertices − Interconnected objects in a graph are called vertices. Vertices are also known
as
• nodes.

• Edges − Edges are the links that connect the vertices.

• There are two types of graphs −

• Directed graph − In a directed graph, edges have direction, i.e., edges go from one vertex
to another.

• Undirected graph − In an undirected graph, edges have no direction.

Graph Coloring
Graph coloring is a method to assign colors to the vertices of a graph so that no two adjacent
vertices have the same color. Some graph coloring problems are−

• Vertex coloring − A way of coloring the vertices of a graph so that no two adjacent
vertices share the same color.

• Edge Coloring − It is the method of assigning a color to each edge so that no two adjacent
edges have the same color.

• Facecoloring−Itassignsacolortoeachfaceorregionofaplanargraphsothatnotwofaces that
share a common boundary have the samecolor.
Chromatic Number
Chromatic number is the minimum number of colors required to color a graph. For example, the
chromatic number of the following graph is3.

Fig 3.23: Chromatic Number


The concept of graph coloring is applied in preparing timetables, mobile radio frequency
assignment, Suduku, register allocation, and coloring of maps.
Steps for graph coloring

• Set the initial value of each processor in the n-dimensional array to 1.

• Now to assign a particular color to a vertex, determine whether that color is already
assigned to the adjacent vertices or not.
• If a processor detects same color in the adjacent vertices, it sets its value in the array to 0.

2
After making n comparisons, if any element of the array is 1, then it is a valid coloring.

Pseudocode for graph coloring


begin

create the processors P(i0,i1,...in-1)where0_iv<m,0_ v <n status[i0,..in-1]=1

forj varies from0to n-1dobegin

fork varies from0to n-1dobegin


ifaj,k=1andij=ikthen status[i0,..in-1]=0
end

end
ok=ΣStatus

ifok >0,thendisplay valid coloring exists


else
display invalid coloring

end

Minimal Spanning Tree


A spanning tree whose sum of weight (or length) of all its edges is less than all other possible
spanning tree of graph G is known as a minimal spanning tree or minimum cost spanning tree.
The following figure shows a weighted connected graph.

Some possible spanning trees of the above graph are shown below −
Among all the above spanning trees, figure(d) is the minimum spanning tree. The concept of
minimum cost spanning tree is applied in travelling sales man problem, designing electronic
circuits, Designing efficient networks, and designing efficient routing algorithms.
To implement the minimum cost-spanning tree, the following two methods are used −
Prim’s Algorithm Kruskal’s Algorithm
Prim's Algorithm
Prim’s algorithm is a greedy algorithm, which helps us find the minimum spanning tree for a
weighted undirected graph. It selects a vertex first and finds an edge with the lowest weight
incident on that vertex.
• Steps of Prim’s Algorithm
• Select any vertex, say v1 of Graph G.
• Select an edge, say e1 of G such that e1 = v1 v2 and v1 ≠ v2 and e1 has minimum
weight among the edges incident on v1 in graph G.
• Now, following step 2, select the minimum weighted edge incident on v2. Continue

this till n–1 edges have been chosen. Here n is the number of vertices.

The minimum spanning tree is −

Kruskal's Algorithm
Kruskal’s algorithm is a greedy algorithm, which helps us find the minimum spanning tree
for a connected weighted graph, adding increasing cost arcsat each step. It is a minimum-
spanning-tree algorithm that finds an edge of the least possible weight that connects any two
trees in the forest.
Steps of Kruskal’s Algorithm
• Select an edge of minimum weight; say e1 of Graph G and e1 is not a loop. Select the
next minimum weight ed edge connected to e1.

• Continue this till n–1 edges have been chosen. Here n is the number of vertices.
The minimum spanning tree of the above graph is −

Shortest Path Algorithm


Shortest Path algorithm is a method of finding the least cost path from the source node(S) to
the destination node (D). Here, we will discuss Moore’s algorithm, also known as Breadth
First Search Algorithm.
Moore’s algorithm

• Label the source vertex, S and label it i and set i=0.

• Find all unlabeled vertices adjacent to the vertex labeled i. If no vertices are
connected to the vertex, S, then vertex, D, is not connected to S. If there are vertices
connected to S, label them i+1.

• If D is labeled, then go to step4, else go to step2 to increase i=i+1. Stop after the
length of the shortest path is found.

You might also like