0% found this document useful (0 votes)
10 views96 pages

Unit - 2 HPC

Uploaded by

Meet Panchal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views96 pages

Unit - 2 HPC

Uploaded by

Meet Panchal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 96

Unit- 2

Parallel Algorithm Design


By: NEETHU V A
Assistant Professor,
CSE department, PIET, PU
 Algorithm
 Sequential Algorithm
 Parallel Algorithm
 Principles of Parallel Algorithms
 Preliminaries
Example: Multiplying a Dense Matrix with a Vector
Example: Database Query Processing
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
Consider the execution of the query:

MODEL = ``CIVIC'' AND YEAR = 2001 AND (COLOR


= ``GREEN'' OR COLOR = ``WHITE)
Task-dependency graph
Example: Database Query Processing
ID# Model Year Color Dealer Price
4523 Civic 2002 Blue MN $18,000
3476 Corolla 1999 White IL $15,000
7623 Camry 2001 Green NY $21,000
9834 Prius 2001 Green CA $18,000
6734 Civic 2001 White OR $17,000
5342 Altima 2001 Green FL $19,000
3845 Maxima 2001 Blue NY $22,000
8354 Accord 2000 Green VT $18,000
4395 Civic 2001 Red CA $17,000
7352 Civic 2002 Red WA $18,000
An alternate decomposition of the given problem into subtasks, along
with their
data dependencies.
Fine-grained granularity decomposition of task
Coarse-grained granularity decomposition of task
Abstraction of task graphs
Limits on Parallel Performance
 Types of granularity, utilizing the resulting concurrency
 Facing inherit bound
Task Interaction Graph
 To capture interactions among tasks
 Node = task
 Edge(undirected/directed) = interaction or data
exchange

 Task-dependency graph vs. task-interaction graph


Example: Sparse matrix vector multiplication
Example: Sparse matrix vector multiplication
Tasks: each task computes an entry of y[]
• Assign ith row of A to Task i. Also assign b[i] to Task i.

We need to compute the products A[i, j] x b[j] for only those values of j for
which A[i, j]≠ 0.
For example, y[0] = A[0, 0].b[0] + A[0, 1].b[1] + A[0, 4].b[4] + A[0, 8].b[8].
Processes
• Process: The tasks, into which a problem is decomposed, run on physical
processors.
• It is an abstract entity that uses the code and data corresponding to a
task to produce the output of that task within a finite amount of time
after the task is activated by the parallel program.
• Process = Task + task data + task code required to produce the task’s
output
• Mapping:
• Processor: It is a hardware unit that physically performs the
computations.
Why use processes rather than processors?
• We refer to the mapping as being from tasks to processes, as opposed to
processors.
• This is because typical programming APIs do not allow easy binding of
tasks to physical processors.
• Rather, we aggregate tasks into processes and rely on the system to map
these processes to physical processors.
• We use processes, not in the UNIX sense of a process, rather, simply as a
collection of tasks and associated data.
Basis for Choosing Mapping
Task-dependency graph Task-interaction graph
Makes sure the max. concurrency Minimum communication.
Criteria of Mapping

1.Mapping independent tasks to different processes.

2.Assigning tasks on critical path to processes as soon as they


become available.

3.Minimizing interaction between processes by mapping tasks


with dense interactions to the same process.
Mapping Database Query to Processes

4 processes can be used in total since the max. concurrency is 4.


• Assign all tasks within a level to different processes.
Decomposition Techniques

1. Recursive decomposition
2. Data decomposition
3. Exploratory decomposition
4. Speculative decomposition
Recursive Decomposition Technique
• Ideal for problems to be solved by divide-and conquer
method.
Steps:
1.Decompose a problem into a set of independent
sub problems
2. Recursively decompose each sub-problem
3. Stop decomposition when minimum desired granularity
is achieved or (partial) result is obtained
Quick Sort Example
Recursive Decomposition for Finding Min
Find the minimum in an array of numbers A of length n
1. procedure SERIAL_MIN(A,n)
2. begin
3. min =A[0];
4. for i:= 1 to n − 1 do
5. if (A[i] < min) min := A[i];
6. endfor;
7. return min;
8. end SERIAL_MIN
Finding Min by using Recursive Procedure
1. procedure RECURSIVE_MIN (A, n)
2. begin
3. if ( n = 1 ) then
4. min := A [0] ;
5. else
6. lmin := RECURSIVE_MIN ( A, n/2 );
7. rmin := RECURSIVE_MIN ( &(A[n/2]), n - n/2 );
8. if (lmin < rmin) then
9. min := lmin;
10. else
11. min := rmin;
12. endelse;
13. endelse;
14. return min;
15. end RECURSIVE_MIN
Data Decomposition Technique:
• Ideal for problems that operate on large data structures
• Steps:
1. The data on which the computations are performed are partitioned
2. Data partition is used to induce a partitioning of the computations
into tasks.
• Data Partitioning –
Partition output data
Partition input data
Partition input + output data
Partition intermediate data
Data Decomposition Based on Partitioning Output Data
• If each element of the output
can be computed independently
of others as a function of the
input.
• Partitioning computations into
tasks is natural. Each task is
assigned with the work of
computing a portion of the
output.
Matrix Multiplication Example

Matrix-matrix multiplication: 𝐶 = 𝐴 × 𝐵
• Partition matrix C into 2× 2 submatrices
• Computation of C then can be partitioned into four tasks.
• A partitioning of output data does not result in a unique
decomposition into tasks.
• For example, for the same problem as in previous, with
identical output data distribution, we can derive the
following two (other) decompositions:
Problem: itemset in a database of transaction
• Consider the problem of computing the frequency of a set of itemsets in
a transaction database.
• In this problem there is a set T containing n transactions and a set I
containing m itemsets.
• Each transaction and itemset contains a small number of items, out of a
possible set of items.
Data Decomposition Based on Partitioning Input Data
• Ideal if output is a single unknown value or the
individual elements of the output can not be efficiently
determined in isolation.
• Example: Finding the minimum, maximum, or sum of a
set of numbers.
• Example. Sorting a set.
• Partitioning the input data and associating a task with
each partition of the input data.
Data Decomposition based on partitioning
• input/output data
Often input and output data decomposition can be combined for a
higher degree of concurrency.
• Decomposition based on partitioning input/output data is referred to
as the owner- computes rule.
• Each partition performs all the computations involving data that it
owns.
• Input data decomposition: – A task performs all the computations that
can be done using these input data.
• Output data decomposition: – A task computes all the results in the
partition assigned to it.
For the itemset counting example, the transaction set (input) and itemset counts
(output) can both be decomposed as follows:
Data Decomposition Based on Partitioning Intermediate
Data
• Applicable for problems which can be solved by
multi-stage computations such that the output of
one stage is the input to the subsequent stage.
• Partitioning can be based on input or output of an
intermediate stage.
Example: Dense matrix multiplication
• Original output data decomposition yields a maximum degree of
concurrency of 4.
Exploratory Decomposition
• In many cases, the decomposition of the problem goes hand-in-
hand with its execution.
• Exploratory decomposition is used to decompose problems
whose underlying computations correspond to a search of a
space for solutions.
• In exploratory decomposition, the search space is partitioned
into smaller parts, and search each one of these parts
concurrently, until the desired solutions are found.
• Example of Problems in this class include a variety of discrete
optimization problems, theorem proving, game playing, etc.
• A simple application of exploratory decomposition is in the solution to a
15 puzzle (a tile puzzle). We show a sequence of three moves that
transform a given initial state (a) to desired final state (d).
Speculative Decomposition
• In some applications, dependencies between tasks are not known a-
priori.
• For such applications, it is impossible to identify independent tasks.
• There are generally two approaches to dealing with such applications:
conservative approaches, which identify independent tasks only when
they are guaranteed to not have dependencies, and, optimistic
approaches, which schedule tasks even when they may potentially be
erroneous.
• Conservative approaches may yield little concurrency and optimistic
approaches may require roll-back mechanism in the case of an error.
Example
• A classic example of speculative decomposition is in discrete event
simulation.
• The central data structure in a discrete event simulation is a time-ordered
event list.
• Events are extracted precisely in time order, processed, and if required,
resulting events are inserted back into the event list.
• Consider your day today as a discrete event system - you get up, get
ready, drive to work, work, eat lunch, work some more, drive back, eat
dinner, and sleep.
• Each of these events may be processed independently, however, in
driving to work, you might meet with an unfortunate accident and not
get to work at all.
• Therefore, an optimistic scheduling of other events will have to be rolled
back.
• Another Example: The simulation of a network of nodes (for instance, an
assembly line or a computer network through which packets pass). The
task is to simulate the behavior of this network for various inputs and
node delay parameters (note that networks may become unstable for
certain values of service rates, queue sizes, etc.).
Example: A simple network for discrete event
simulation
Characteristics of Tasks
• Identify the concurrency that is available in a problem and decompose it
into tasks that can be executed in parallel.
• The nature of the tasks and the interactions among them has a bearing on
the mapping.
• The characteristics of these tasks critically impact choice and performance
of parallel algorithms.
• Relevant task characteristics include:
1. Task generation.
2. Task sizes.
3. Size of data associated with tasks.
Task Generation: The tasks that constitute a parallel algorithm may be generated
either statically or dynamically.
• Static task generation refers to where all the tasks are known before the
algorithm starts execution.
Task Sizes: the size of a task is the relative amount of time required to complete it.
• The complexity of mapping schemes often depends on whether or not the
tasks are uniform; i.e., whether or not they require roughly the same amount
of time. If the amount of time required by the tasks varies significantly, then
they are said to be non-uniform.
• If the size of all the tasks is known, then this information can often be
used in mapping of tasks to processes.

Size of Data Associated with Tasks:


• It is important because the data associated with a task must be available
to the process performing that task, and the size and the location of
these data may determine the process that can perform the task without
incurring excessive data-movement overheads.
Characteristics of Inter-Task Interactions
• Tasks need to interact with each other to share data, work, or synchronization
information.
• Different parallel algorithms require different types of interactions among concurrent
tasks.
• The nature of these interactions makes them more suitable for certain programming
paradigms and mapping schemes
1) Static versus dynamic –
Static: interactions are known prior to execution.
2) Regular versus irregular –
Regular: interaction pattern can be exploited for efficient implementation.
3) Read-only versus read-write
4) One-way versus two-way
Static vs. Dynamic Interactions
• Static interaction –
• Tasks and associated interactions are predetermined:
• Task-interaction graph and times that interactions occur are
known:
• Example: matrix multiplication – Easy to program
• Dynamic interaction –
• Timing of interaction or sets of tasks to interact with can not be
determined prior to the execution.
• Difficult to program using massage-passing;
• Shared memory space programming may be simple
Regular vs. Irregular Interactions
• Regular interactions –
• Interaction has a spatial structure that can be exploited for
efficient implementation:
• ring, mesh
• Irregular Interactions –
• Interactions has no well-defined structure
• Example: Sparse matrix-vector multiplication
Example: Image dithering
Read-Only versus Read-Write
• Read-Only interactions:- tasks require only a read-access
to the data shared among many concurrent tasks.

• Read-Write interactions:- multiple tasks need to read


and write access on some shared data.
One-way versus Two-way
• One-way interaction:-
only one of a pair of communicating tasks initiates the interaction
and completes it without interrupting the other one.
• Two-way interactions:-
The data or work needed by a task or a subset of tasks is
explicitly supplied by another task or subset of tasks, called which usually
involve predefined producer and consumer tasks..
• All read-only interactions can be formulated as one-way interactions.
• Read-write interactions can be either one way or two-way.
Mapping Techniques for Load Balancing
• Problem  Decomposed in to no. of tasks  these tasks are mapped onto
processes
• Objective Minimize the execution time
 In order to achieve a small execution time, the overheads
of executing the tasks in parallel must be minimized.
• For a given decomposition, there are two key sources of overhead:
1. time spent in inter-process interaction and
2. time that some processes may spend being idle.
• A good mapping must ensure that the computations and interactions among
processes at each stage of the execution of the parallel algorithm are well balanced.
Dense Matrix example
Schemes for Static Mapping
• Mapping must simultaneously minimize idling and load balance.
• It distributes the tasks among processes prior to the execution
of the algorithm.
• For this to work, we must have a good estimate of the size of
each task.
• Types of techniques:
1. Mapping Based on Data Partitioning
2. Task Graph Partitioning
3. Hybrid Strategies
Mapping Based on Data Partitioning
• By owner-computes rule, mapping the relevant data onto processes is
equivalent to mapping tasks onto processes
• Mappings based on partitioning, two of the most common ways of
representing data in algorithms are arrays (or matrices) and graphs.
• Array or Matrices regular data
– Block distributions
– Cyclic and block cyclic distributions
• Irregular Data
– Graph partitioning
Block Distribution
• In these distributions, n-dimensional array is distributed among
the processes such that each process receives a contiguous block
of array entries along a specified subset of array dimensions.

• Consider 1-D of dense matrix example


• No. of processes = 8
• Hint: Distribute rows or columns of matrix to different processes
1-D Dense matrix multiplication
Multi-D Block Distribution
• Distribute blocks of matrix to different processes

Examples of two-dimensional distributions of an array, (a) on a 4 x 4 process grid, and (b) on a 2 x 8 process grid.
• For multiplying two dense matrices A and B, we can partition the output
matrix C using a block decomposition.
• For load balance, we give each task the same number of elements of C.
(Note that each element of C corresponds to a single dot product.)
• The choice of precise decomposition (1-D or 2-D) is determined by the
associated communication overhead.
• In general, higher dimension decomposition allows the use of larger
number of processes.
• Example: 𝑛 × 𝑛 dense matrix multiplication 𝐶 = 𝐴 × 𝐵 using 𝑝
processes

• Each entry of 𝐶 use the same amount of computation.


• Decomposition based on output data.

• Either 1D or 2D block distribution can be used


• Multi-D distribution allows higher degree of concurrency.
• Multi-D distribution can also help to reduce interactions
Block Distribution
Block Cyclic Distributions
• If the amount of work differs for different entries of a matrix, a block
distribution can lead to load imbalances.
• Example. Doolittle’s method of LU factorization of dense matrix
• The amount of computation increases from the top left to the bottom
right of the matrix.
Work used to compute Entries of L and U
• Block distribution of LU factorization tasks leads to load imbalance.
Block-Cyclic Distribution
• A variation of block distribution that can be used to alleviate the
load-imbalance.
• Steps:
1. Partition an array into many more blocks than the
number of available processes
2. Assign blocks to processes in a round-robin manner so
that each process gets several non-adjacent blocks.
Cyclic Distribution
• A cyclic distribution is an extreme case of a block-cyclic
distribution and can result in an almost perfect load balance due
to the extreme fine-grained underlying decomposition.
• Cyclic Distribution  when the block size =1
 achieve fine grained granularity
Task Graph Partitioning

• Partitioning a given task-dependency graph across processes.

• Determining an optimal mapping for a general task-dependency


graph is an NP-complete problem.

• Excellent heuristics exist for structured graphs.


• In case of sparse matrices, block decompositions are more complex.

• Consider the problem of multiplying a sparse matrix with a vector.

• The graph of the matrix is a useful indicator of the work (number of


nodes) and communication (the degree of each node).

• In this case, we would like to partition the graph so as to assign


equal number of nodes to each process, while minimizing edge
count of the graph partition.
Task Graph Partitioning
Sparse-matrix vector multiplication
Dynamic Mapping Techniques
• Dynamic mapping is sometimes also referred to as dynamic load
balancing, since load balancing is the primary motivation for dynamic
mapping.

• Dynamic mapping is necessary in situations where a static mapping may


result in a highly imbalanced distribution of work among processes

• Dynamic mapping schemes can be


1. centralized
2. distributed
Centralized Schemes
• All executable tasks are maintained in a common central data structure or they are
maintained by a special process or a subset of processes.
• Processes are designated as masters or slaves.
• When a process runs out of work, it requests the master for more work.
• When the number of processes increases, the master may become the bottleneck.
• To alleviate this, a process may pick up a number of tasks (a chunk) at one time.
This is called Chunk scheduling.
• Selecting large chunk sizes may lead to significant load imbalances as well. A
number of schemes have been used to gradually decrease chunk size as the
computation progresses.
Distributed Schemes
• Each process can send or receive work from other processes.

• This alleviates the bottleneck in centralized schemes.

• There are four critical questions:


• How are sending and receiving processes paired together,
• Who initiates work transfer,
• How much work is transferred, and
• When is a transfer triggered?
Methods for Containing Interaction Overheads
• Maximize data locality: Where possible, reuse intermediate data. Restructure
computation so that data can be reused in smaller time windows.
• Minimize volume of data exchange: There is a cost associated with each word
that is communicated. For this reason, we must minimize the volume of data
communicated.
• Minimize frequency of interactions: There is a startup cost associated with
each interaction. Therefore, try to merge multiple interactions to one, where
possible.
• Minimize contention and hot-spots: Use decentralized techniques, replicate
data where necessary.
Overlapping computations with interactions: Use non-blocking
communications, multithreading, and prefetching to hide latencies.

Replicating data or computations.

Using group communications instead of point-to-point primitives.

Overlap interactions with other interactions.


Parallel Algorithm Models
• An algorithm model is a way of structuring a parallel algorithm by selecting a
decomposition and mapping technique and applying the appropriate strategy to
minimize interactions.
• Data Parallel Model: Tasks are statically (or semi-statically) mapped to processes and
each task performs similar operations on different data.
• Usually based on data decomposition followed by static mapping
• Uniform partitioning of data followed by static mapping guarantees load balance
• Example algorithm: dense matrix multiplication
• Task Graph Model: Starting from a task dependency graph, the interrelationships
among the tasks are utilized to promote locality or to reduce interaction costs.
• Typically used to solve problems where amount of data associated with a task is
large relative to computation
• Static mapping usually used to optimize data movement costs
• Example algorithm: parallel quicksort, sparse matrix factorization
• Master-Slave Model: One or more processes generate work and allocate it
to worker processes. This allocation may be static or dynamic.
• Pipeline / Producer-Consumer Model: A stream of data is passed through a
succession of processes, each of which perform some task on it.
• Hybrid Models: A hybrid model may be composed either of multiple models
applied hierarchically or multiple models applied sequentially to different
phases of a parallel algorithm.

You might also like