0% found this document useful (0 votes)

50 views63 pages

Principles of Parallel Algorithm Design

The document discusses principles of designing parallel algorithms. It covers identifying portions of work that can be performed concurrently, task decomposition using dependency graphs, granularity and concurrency, and mapping tasks to processes. It also discusses techniques for decomposing a computation into parallel tasks like recursive decomposition and data decomposition.

Uploaded by

ritu.agrawal31506

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

50 views63 pages

Principles of Parallel Algorithm Design

Uploaded by

ritu.agrawal31506

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

Lecture 4: Principles of Parallel

Algorithm Design

1
Constructing a Parallel Algorithm

• identify portions of work that can be performed

concurrently
• map concurrent portions of work onto multiple
processes running in parallel
• distribute a program’s input, output, and
intermediate data
• manage accesses to shared data: avoid conflicts
• synchronize the processes at stages of the
parallel program execution

2
Task Decomposition and Dependency Graphs

Decomposition: divide a computation into smaller

parts, which can be executed concurrently
Task: programmer-defined units of computation.

Task-dependency graph:
Node represent s task.
Directed edge represents
control dependence.

3
Example 1: Dense Matrix-Vector Multiplication

• Computing y[i] only use ith row of A and b – treat

computing y[i] as a task.
• Remark:
– Task size is uniform
– No dependence between tasks
– All tasks need b
4
Example 2: Database Query Processing
• Executing the query:
Model =“civic” AND Year = “2001” AND (Color = “green” OR
Color = “white”)
on the following database:

5
• Task: create sets of elements that satisfy a (or several)
criteria.
• Edge: output of one task serves as input to the next

6
• An alternate task-dependency graph for query

• Different task decomposition leads to different

parallelism 7
Granularity of Task Decomposition
• Fine-grained decomposition: large number of
small tasks
• Coarse-grained decomposition: small number of
large tasks
Matrix-vector multiplication example
-- coarse-grain: each task computes 3 elements of y[]

8
Degree of Concurrency

• Degree of Concurrency: # of tasks that can

execute in parallel
-- maximum degree of concurrency: largest # of
concurrent tasks at any point of the execution
-- average degree of concurrency: average # of tasks
that can be executed concurrently
• Degree of Concurrency vs. Task Granularity
– Inverse relation

9
Critical Path of Task Graph
• Critical path: The longest directed path between
any pair of start node (node with no incoming
edge) and finish node (node with on outgoing
edges).
• Critical path length: The sum of weights of nodes
along critical path.
– The weights of a node is the size or the amount of
work associated with the corresponding task
• Average degree of concurrency = total amount of
work / critical path length

10
Example: Critical Path Length

Task-dependency graphs of query processing operation

Left graph:
Critical path length = 27
Average degree of concurrency = 63/27 = 2.33
Right graph:
Critical path length = 34
Average degree of concurrency = 64/34 = 1.88
11
Limits on Parallelization
• Facts bounds on parallel execution
– Maximum task granularity is finite
• Matrix-vector multiplication O(n2)
– Interactions between tasks
• Tasks often share input, output, or intermediate data, which may
lead to interactions not shown in task-dependency graph.

Ex. For the matrix-vector multiplication problem, all tasks are

independent, and all need access to the entire input vector b.

12
• Speedup = sequential execution time/parallel
execution time
• Parallel efficiency = sequential execution
time/(parallel execution time × processors used)

13
Task Interaction Graphs
• Tasks generally share input, output or intermediate data
– Ex. Matrix-vector multiplication: originally there is only one
copy of b, tasks will have to communicate b.
• Task-interaction graph
– To capture interactions among tasks
– Node = task
– Edge(undirected/directed) = interaction or data exchange
• Task-dependency graph vs. task-interaction graph
– Task-dependency graph represents control dependency
– Task-interaction graph represents data dependency
– The edge-set of a task-interaction graph is usually a superset
of the edge-set of the task-dependency graph

14
Example: Task-Interaction Graph
Sparse matrix-vector multiplication
• Tasks: each task computes an entry of y[]
• Assign ith row of A to Task i. Also assign b[i] to
Task i.

15
Processes and Mapping

• Mapping: the mechanism by which tasks are

assigned to processes for execution.
• Process: a logic computing agent that performs
tasks, which is an abstract entity that uses the
code and data corresponding to a task to produce
the output of that task.
• Why use processes rather than processors?
– We rely on OS to map processes to physical
processors.
– We can aggregate tasks into a process

16
Criteria of Mapping
1. Maximize the use of concurrency by mapping independent
tasks onto different processes
2. Minimize the total completion time by making sure that
processes are available to execute the tasks on critical path
as soon as such tasks become executable
3. Minimize interaction among processes by mapping tasks
with a high degree of mutual interaction onto the same
process.

Basis for Choosing Mapping

Task-dependency graph
Makes sure the max. concurrency
Task-interaction graph
Minimum communication.
17
Example: Mapping Database Query to Processes

P3 P2 P1 P0 P3 P2 P1 P0
P0

P0
P2 P0
P0
P0

• 4 processes can be used in total since the max. concurrency is 4.

• Assign all tasks within a level to different processes.

18
Decomposition Techniques

How to decompose a computation into a set of

tasks?
 Recursive decomposition
Data decomposition
• Exploratory decomposition
• Speculative decomposition

19
Recursive Decomposition

• Ideal for problems to be solved by divide-and-

conquer method.
• Steps
1. Decompose a problem into a set of independent
sub-problems
2. Recursively decompose each sub-problem
3. Stop decomposition when minimum desired
granularity is achieved or (partial) result is
obtained
20
Quicksort Example
Sort a sequence A of n elements in the increasing order.

• Select a pivot
• Partition the sequence around the pivot
• Recursively sort each sub-sequence

Task: the work of partitioning a given sub-sequence

21
Recursive Decomposition for Finding Min
Find the minimum in an array of numbers A of length n

procedure Serial_Min(A,n) procedure Recursive_MIN(A,n)

begin begin
min = A[0] if (n == 1) then
for i:= 1 to n-1 do min := A[0];
if(A[i] < min) min := A[i] else
endfor; lmin := Recursive_MIN(A,n/2);
return min; rmin := Recursive_MIN(&[A/2],n-n/2);
end Serial_Min if( lmin < rmin) then
min := lmin;
else
min := rmin;
endelse;
endelse;
return min;
end Recursive_MIN

22
Data Decomposition
• Ideal for problems that operate on large data
structures
• Steps
1. The data on which the computations are
performed are partitioned
2. Data partition is used to induce a partitioning of
the computations into tasks.
• Data Partitioning
– Partition output data
– Partition input data
– Partition input + output data
– Partition intermediate data
23
Data Decomposition Based on Partitioning Output Data

• If each element of the output can be computed

independently of others as a function of the input.
• Partitioning computations into tasks is natural. Each
task is assigned with the work of computing a
portion of the output.
• Example. Dense matrix-vector multiplication.

24
Example: Output Data Decomposition
Matrix-matrix multiplication: 𝐶𝐶 = 𝐴𝐴 × 𝐵𝐵
• Partition matrix C into 2× 2 submatrices
• Computation of C then can be partitioned into four tasks.

Remark: data-decomposition is different from task decomposition.

Same data decomposition can have different task decompositions.
25
26
Data Decomposition Based on Partitioning Input Data

• Ideal if output is a single unknown value or

the individual elements of the output can not
be efficiently determined in isolation.
– Example. Finding the minimum, maximum, or sum
of a set of numbers.
– Example. Sorting a set.
• Partitioning the input data and associating a
task with each partition of the input data.

27
Data Decomposition Based on Partitioning Intermediate Data

• Applicable for problems which can be solved

by multi-stage computations such that the
output of one stage is the input to the
subsequent stage.
• Partitioning can be based on input or output
of an intermediate stage.

28
Example: Intermediate Data Decomposition

Dense matrix-matrix multiplication

• Original output data decomposition yields a
maximum degree of concurrency of 4.

29
Stage 1: 𝐷𝐷𝑘𝑘,𝑖𝑖,𝑗𝑗 = 𝐴𝐴𝑖𝑖,𝑘𝑘 𝐵𝐵𝑘𝑘,𝑗𝑗

A1,1 B1,1 B1,2 D1,1,1 D1,1,2

A2,1 D1,2,1 D1,2,2

A1,2 D2,1,1 D2,1,2

A2,2 B2,1 B2,2 D2,2,1 D2,2,2

Stage 2: 𝐶𝐶𝑖𝑖,𝑗𝑗 = 𝐷𝐷1,𝑖𝑖,𝑗𝑗 + 𝐷𝐷2,𝑖𝑖,𝑗𝑗

D1,1,1 D1,1,2 D2,1,1 D2,1,2 C1,1 C1,2

+
D1,2,1 D1,2,2 D2,2,1 D2,2,2 C2,1 C2,2

30
Let 𝑫𝑫𝒌𝒌,𝒊𝒊,𝒋𝒋 = 𝑨𝑨𝒊𝒊,𝒌𝒌 � 𝑩𝑩𝒌𝒌,𝒋𝒋

Task-dependency graph

31
Owner-Computes Rule

• Decomposition based on partitioning

input/output data is referred to as the owner-
computes rule.
– Each partition performs all the computations involving
data that it owns.
• Input data decomposition
– A task performs all the computations that can be done
using these input data.
• Output data decomposition
– A task computes all the results in the partition
assigned to it.

32
Characteristics of Tasks
Key characteristics of tasks influencing choice of mapping and
performance of parallel algorithm:
1. Task generation
• Static or dynamic generation
– Static: all tasks are known before the algorithm starts execution. Data or
recursive decomposition often leads to static task generation.
Ex. Matrix-multiplication. Recursive decomposition in finding min. of a set of
numbers.
– Dynamic: the actual tasks and the task-dependency graph are not explicitly
available a priori. Recursive, exploratory decomposition can generate tasks
dynamically.
Ex. Recursive decomposition in Quicksort, in which tasks are generated
dynamically.
2. Task sizes
• Amount of time required to compute it: uniform, non-uniform
3. Knowledge of task sizes
4. Size of data associated with tasks
• Data associated with the task must be available to the process
performing the task. The size and location of data may determine the
data-movement overheads.
33
Characteristics of Task Interactions

1) Static versus dynamic

– Static: interactions are known prior to execution.
2) Regular versus irregular
– Regular: interaction pattern can be exploited for
efficient implementation.
3) Read-only versus read-write
4) One-way versus two-way

34
Static vs. Dynamic Interactions

• Static interaction
– Tasks and associated interactions are predetermined:
task-interaction graph and times that interactions
occur are known: matrix multiplication
– Easy to program
• Dynamic interaction
– Timing of interaction or sets of tasks to interact with
can not be determined prior to the execution.
– Difficult to program using massage-passing; Shared-
memory space programming may be simple
35
Regular vs. Irregular Interactions

• Regular interactions
– Interaction has a spatial structure that can be
exploited for efficient implementation: ring, mesh
Example: Explicit finite difference for solving PDEs.
• Irregular Interactions
– Interactions has no well-defined structure
Example: Sparse matrix-vector multiplication

36
37
Mapping Technique for Load Balancing
Minimize execution time → Reduce overheads of execution
• Sources of overheads:
– Inter-process interaction
– Idling
– Both interaction and idling are often a function of mapping
• Goals to achieve:
– To reduce interaction time
– To reduce total amount of time some processes being idle
(goal of load balancing)
– Remark: these two goals often conflict
• Classes of mapping:
– Static
– Dynamic
38
Remark:
1. Loading balancing is only a necessary but not sufficient condition for reducing
idling.
• Task-dependency graph determines which tasks can execute in parallel and
which must wait for some others to finish at a given stage.
2. Good mapping must ensure that computations and interactions among processes
at each stage of execution are well balanced.

Two mappings of 12-task decomposition in which the last 4 tasks can be started only
after the first 8 are finished due to task-dependency.
39
Schemes for Static Mapping

Static Mapping: It distributes the tasks among

processes prior to the execution of the algorithm.

• Mapping Based on Data Partitioning

• Task Graph Partitioning
• Hybrid Strategies

40
Mapping Based on Data Partitioning

• By owner-computes rule, mapping the relevant

data onto processes is equivalent to mapping
tasks onto processes
• Array or Matrices
– Block distributions
– Cyclic and block cyclic distributions
• Irregular Data
– Example: data associated with unstructured mesh
– Graph partitioning

41
1D Block Distribution
Example. Distribute rows or columns of matrix to different
processes

42
Multi-D Block Distribution
Example. Distribute blocks of matrix to different processes

43
Load-Balance for Block Distribution

Example. 𝑛𝑛 × 𝑛𝑛 dense matrix multiplication 𝐶𝐶 = 𝐴𝐴 × 𝐵𝐵

using 𝑝𝑝 processes
– Decomposition based on output data.
– Each entry of 𝐶𝐶 use the same amount of computation.
– Either 1D or 2D block distribution can be used:
𝑛𝑛
• 1D distribution: rows are assigned to a process
𝑝𝑝
• 2D distribution: 𝑛𝑛/ 𝑝𝑝 × 𝑛𝑛/ 𝑝𝑝 size block is assigned to a process
– Multi-D distribution allows higher degree of concurrency.
– Multi-D distribution can also help to reduce interactions

44
Suppose the size of matrix is 𝑛𝑛 × 𝑛𝑛, and 𝑝𝑝 processes are used.
𝑛𝑛2
(a): A process need to access + 𝑛𝑛2 amount of data
𝑝𝑝
(b): A process need to access 𝑂𝑂(𝑛𝑛2 / 𝑝𝑝) amount of data 45
Cyclic and Block Cyclic Distributions

• If the amount of work differs for different

entries of a matrix, a block distribution can
lead to load imbalances.
• Example. Doolittle’s method of LU factorization
of dense matrix
– The amount of computation increases from the top
left to the bottom right of the matrix.

46
Doolittle’s method of LU factorization

𝑎𝑎11 𝑎𝑎12 … 𝑎𝑎1𝑛𝑛 1 0 … 0 𝑢𝑢11 𝑢𝑢12 … 𝑢𝑢1𝑛𝑛

𝑎𝑎21 𝑎𝑎22 … 𝑎𝑎2𝑛𝑛 𝑙𝑙21 1 … 0 0 𝑢𝑢22 … 𝑢𝑢2𝑛𝑛
𝐴𝐴 = ⋮ ⋮ ⋱ ⋮ = 𝐿𝐿𝐿𝐿 = ⋮ ⋮ ⋱ ⋮
⋮ ⋮ ⋱ ⋮
𝑎𝑎𝑛𝑛𝑛 𝑎𝑎𝑛𝑛𝑛 … 𝑎𝑎𝑛𝑛𝑛𝑛 𝑙𝑙𝑛𝑛𝑛 𝑙𝑙𝑛𝑛𝑛 … 1 0 0 … 𝑢𝑢𝑛𝑛𝑛𝑛

By matrix-matrix multiplication

𝑢𝑢1𝑗𝑗 = 𝑎𝑎1𝑗𝑗 , 𝑗𝑗 = 1,2, … , 𝑛𝑛 (1𝑠𝑠𝑠𝑠 row of 𝑈𝑈)

𝑙𝑙𝑗𝑗𝑗 = 𝑎𝑎𝑗𝑗𝑗 /𝑢𝑢11 , 𝑗𝑗 = 1,2, … , 𝑛𝑛 (1𝑠𝑠𝑠𝑠 column of 𝐿𝐿)
For 𝑖𝑖 = 2,3, … , 𝑛𝑛 − 1 do
𝑢𝑢𝑖𝑖𝑖𝑖 = 𝑎𝑎𝑖𝑖𝑖𝑖 − ∑𝑖𝑖−1
𝑡𝑡=1 𝑙𝑙𝑖𝑖𝑖𝑖 𝑢𝑢𝑡𝑡𝑖𝑖

𝑢𝑢𝑖𝑖𝑖𝑖 = 𝑎𝑎𝑖𝑖𝑖𝑖 − ∑𝑖𝑖−1

𝑡𝑡=1 𝑙𝑙𝑖𝑖𝑖𝑖 𝑢𝑢𝑡𝑡𝑡𝑡 for 𝑗𝑗 = 𝑖𝑖 + 1, … , 𝑛𝑛 (𝑖𝑖𝑖𝑖𝑖 row of 𝑈𝑈)
𝑎𝑎𝑗𝑗𝑗𝑗 −∑𝑖𝑖−1
𝑡𝑡=1 𝑙𝑙𝑗𝑗𝑗𝑗 𝑢𝑢𝑡𝑡𝑡𝑡
𝑙𝑙𝑗𝑗𝑗𝑗 = for 𝑗𝑗 = 𝑖𝑖 + 1, … , 𝑛𝑛 (𝑖𝑖𝑖𝑖𝑖 column of 𝐿𝐿)
𝑢𝑢𝑖𝑖𝑖𝑖

End
𝑢𝑢𝑛𝑛𝑛𝑛 = 𝑎𝑎𝑛𝑛𝑛𝑛 − ∑𝑛𝑛−1
𝑡𝑡=1 𝑙𝑙𝑛𝑛𝑛𝑛 𝑢𝑢𝑡𝑡𝑡𝑡

47
Serial Column-Based LU

• Remark: Matrices L and U share space with A

48
Work used to compute Entries of L and U

49
• Block distribution of LU factorization tasks
leads to load imbalance.

50
Block-Cyclic Distribution

• A variation of block distribution that can be

used to alleviate the load-imbalance.

• Steps
1. Partition an array into many more blocks than
the number of available processes
2. Assign blocks to processes in a round-robin
manner so that each process gets several non-
adjacent blocks.

51
(a) The rows of the array are grouped into blocks each consisting of two rows,
resulting in eight blocks of rows. These blocks are distributed to four processes
in a wrap-around fashion.
(b) The matrix is blocked into 16 blocks each of size 4×4, and it is mapped onto a
2×2 grid of processes in a wraparound fashion.
• Cyclic distribution: when the block size =1
52
Randomized Block Distribution

53
Graph Partitioning
Sparse-matrix vector multiplication

Work: nodes
Interaction/communication: edges

Partition the graph:

Assign roughly same number of nodes to each process
Minimize edge count of graph partition
54
Finite element simulation of water contaminant in a lake.
• Goal of partitioning: balance work & minimize communication

Random Partitioning Partitioning for Minimizing Edge-Count

• Assign equal number of nodes (or cells) to each process

– Random partitioning may lead to high interaction overhead due to data
sharing
• Minimize edge count of the graph partition
– Each process should get roughly the same number of elements and the
number of edges that cross partition boundaries should be minimized as 55well.
Mappings Based on Task Partitioning

• Mapping based on task partitioning can be used

when computation is naturally expressed in the
form of a static task-dependency graph with
known sizes.
• Finding optimal mapping minimizing idle time and
minimizing interaction time is NP-complete
• Heuristic solutions exist for many structured
graphs

56
Mapping a Sparse Graph
Example. Sparse matrix-vector multiplication using 3
processes
• Arrow distribution

57
• Partitioning task-interaction graph to reduce
interaction overhead

58
Techniques to Minimize Interaction Overheads

• Maximize data locality

– Maximize the reuse of recently accessed data
– Minimize volume of data-exchange
• Use high dimensional distribution. Example: 2D block
distribution for matrix multiplication
– Minimize frequency of interactions
• Reconstruct algorithm such that shared data are accessed
and used in large pieces.
• Combine messages between the same source-destination
pair

59
• Minimize contention and hot spots
– Competition occur when multi-tasks try to access the same
resources concurrently: multiple processes sending
message to the same process; multiple simultaneous
accesses to the same memory block

𝑝𝑝−1
• Using 𝐶𝐶𝑖𝑖,𝑗𝑗 = ∑𝑘𝑘=0 𝐴𝐴𝑖𝑖,𝑘𝑘 𝐵𝐵𝑘𝑘,𝑗𝑗 causes contention. For example, 𝐶𝐶0,0 ,
𝐶𝐶0,1 , 𝐶𝐶0, 𝑝𝑝−1 attempt to read 𝐴𝐴0,0 , at the same time.
• A contention-free manner is to use:
𝑝𝑝−1
𝐶𝐶𝑖𝑖,𝑗𝑗 = ∑𝑘𝑘=0 𝐴𝐴𝑖𝑖, 𝑖𝑖+𝑗𝑗+𝑘𝑘 % 𝑝𝑝 𝐵𝐵 𝑖𝑖+𝑗𝑗+𝑘𝑘 % 𝑝𝑝,𝑗𝑗
All tasks 𝑃𝑃∗,𝑗𝑗 that work on the same row of C access block
𝐴𝐴𝑖𝑖, 𝑖𝑖+𝑗𝑗+𝑘𝑘 % 𝑝𝑝 , which is different for each task. 60
• Overlap computations with interactions
– Use non-blocking communication
• Replicate data or computations
– Some parallel algorithm may have read-only access to
shared data structure. If local memory is available,
replicate a copy of shared data on each process if
possible, so that there is only initial interaction during
replication.
• Use collective interaction operations
• Overlap interactions with other interactions

61
Parallel Algorithm Models
• Data parallel
– Each task performs similar operations on different data
– Typically statically map tasks to processes
• Task graph
– Use task dependency graph to promote locality or reduce
interactions
• Master-slave
– One or more master processes generating tasks
– Allocate tasks to slave processes
– Allocation may be static or dynamic
• Pipeline/producer-consumer
– Pass a stream of data through a sequence of processes
– Each performs some operation on it
• Hybrid
– Apply multiple models hierarchically, or apply multiple models
in sequence to different phases
62
• Reference
– A. Grama, et al. Introduction to Parallel
Computing. Chapter 3.

Unit 2
No ratings yet
Unit 2
64 pages
Unit 2
No ratings yet
Unit 2
151 pages
Parallel Programming: Lecture #9
No ratings yet
Parallel Programming: Lecture #9
24 pages
Unit 2 - Part - 1
No ratings yet
Unit 2 - Part - 1
32 pages
ConcurrencyDecomposition Parallel Algorithm
No ratings yet
ConcurrencyDecomposition Parallel Algorithm
40 pages
Unit 2
No ratings yet
Unit 2
81 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
Unit - 2 HPC
No ratings yet
Unit - 2 HPC
96 pages
Lecture 6 Principles of Parallel Algorithm Design
No ratings yet
Lecture 6 Principles of Parallel Algorithm Design
35 pages
Common PDC Module3
No ratings yet
Common PDC Module3
43 pages
CSC 580 - Chapter 3
No ratings yet
CSC 580 - Chapter 3
35 pages
Introduction To Parallel Computing Design and Anal
No ratings yet
Introduction To Parallel Computing Design and Anal
53 pages
FoP HPC Unit II
No ratings yet
FoP HPC Unit II
107 pages
AA Part1
No ratings yet
AA Part1
43 pages
WINSEM2022-23 CSE4001 ETH VL2022230503160 2023-01-19 Reference-Material-I
No ratings yet
WINSEM2022-23 CSE4001 ETH VL2022230503160 2023-01-19 Reference-Material-I
72 pages
HPC - Unit-2 Insem Notes
No ratings yet
HPC - Unit-2 Insem Notes
99 pages
Principles of Parallel Algorithm Design
No ratings yet
Principles of Parallel Algorithm Design
78 pages
8-Parallel Algorithm Design - Preliminaries-09-Jan-2020Material - I - 09-Jan-2020 - Module - 3 - Preliminaries PDF
No ratings yet
8-Parallel Algorithm Design - Preliminaries-09-Jan-2020Material - I - 09-Jan-2020 - Module - 3 - Preliminaries PDF
18 pages
Chapter 3 - Principles of Parallel Algorithm Design
No ratings yet
Chapter 3 - Principles of Parallel Algorithm Design
52 pages
Lecture 5 Principles of Parallel Algorithm Design
No ratings yet
Lecture 5 Principles of Parallel Algorithm Design
30 pages
Chapter Overview: Algorithms and Concurrency: - Introduction To Parallel Algorithms
No ratings yet
Chapter Overview: Algorithms and Concurrency: - Introduction To Parallel Algorithms
84 pages
WINSEM2022-23 CSE4001 ETH VL2022230503176 Reference Material I 02-02-2023 Module3-ParallelDecomposition
No ratings yet
WINSEM2022-23 CSE4001 ETH VL2022230503176 Reference Material I 02-02-2023 Module3-ParallelDecomposition
89 pages
WINSEM2022-23 CSE4001 ETH VL2022230503160 2023-01-12 Reference-Material-I
No ratings yet
WINSEM2022-23 CSE4001 ETH VL2022230503160 2023-01-12 Reference-Material-I
28 pages
Module - 3 Parallel Algorithm Design - Preliminaries
No ratings yet
Module - 3 Parallel Algorithm Design - Preliminaries
12 pages
Chap3 Slides Week4
No ratings yet
Chap3 Slides Week4
42 pages
Processes and Mapping, Decomposition Techniques
No ratings yet
Processes and Mapping, Decomposition Techniques
28 pages
3.1.3 Processes and Mapping (1/5)
No ratings yet
3.1.3 Processes and Mapping (1/5)
74 pages
Unit 2 HPC
No ratings yet
Unit 2 HPC
92 pages
L19-20 PA Design Intro
No ratings yet
L19-20 PA Design Intro
31 pages
HPC Ut 2
No ratings yet
HPC Ut 2
4 pages
03-Task Decomposition and Mapping
No ratings yet
03-Task Decomposition and Mapping
62 pages
Bert 1 Parallel Algorithmic Concepts
No ratings yet
Bert 1 Parallel Algorithmic Concepts
95 pages
PDC Unit-2
No ratings yet
PDC Unit-2
48 pages
Parallel Computing Unit 3 - Principles of Parallel Computing Design
No ratings yet
Parallel Computing Unit 3 - Principles of Parallel Computing Design
78 pages
Con Currency Mapping
No ratings yet
Con Currency Mapping
40 pages
Partitioning
No ratings yet
Partitioning
37 pages
Module 3 - Principles of Parallel Algorithm Design
No ratings yet
Module 3 - Principles of Parallel Algorithm Design
39 pages
Lecture4 PDF
No ratings yet
Lecture4 PDF
23 pages
PDC (Steps in Parallel Algorithm Design)
No ratings yet
PDC (Steps in Parallel Algorithm Design)
82 pages
Padp Unit 4up
No ratings yet
Padp Unit 4up
147 pages
Lecture 6
No ratings yet
Lecture 6
37 pages
2.decomposition Done
No ratings yet
2.decomposition Done
4 pages
ParallelIzation Principles
No ratings yet
ParallelIzation Principles
40 pages
WINSEM2022-23 CSE4001 ETH VL2022230503160 2023-02-07 Reference-Material-I
No ratings yet
WINSEM2022-23 CSE4001 ETH VL2022230503160 2023-02-07 Reference-Material-I
35 pages
LECTURE 4 - Parallel Computing Design (PART 1)
No ratings yet
LECTURE 4 - Parallel Computing Design (PART 1)
47 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Parallel Computing
No ratings yet
Parallel Computing
91 pages
Parallel Algorithms Presentation
No ratings yet
Parallel Algorithms Presentation
32 pages
6-Decomposition Techniques
No ratings yet
6-Decomposition Techniques
30 pages
Lecture 3 and 4HPC
No ratings yet
Lecture 3 and 4HPC
24 pages
5-Parallel Algorithm Design Life Cycle
No ratings yet
5-Parallel Algorithm Design Life Cycle
25 pages
Chap 4-7 - Parallel - Abstractions - and - MPI
No ratings yet
Chap 4-7 - Parallel - Abstractions - and - MPI
34 pages
Parallel and Distributed Algorithms-IMPORTANT QUESTION
100% (1)
Parallel and Distributed Algorithms-IMPORTANT QUESTION
15 pages
X. Mapping Techniques: 27 April, 2009
No ratings yet
X. Mapping Techniques: 27 April, 2009
27 pages
Characteristics of Tasks and Task Interactions
No ratings yet
Characteristics of Tasks and Task Interactions
11 pages
CS416 - Parallel and Distributed Computing: Lecture # 6 (19-03-2021) Spring 2021 FAST - NUCES, Faisalabad Campus
No ratings yet
CS416 - Parallel and Distributed Computing: Lecture # 6 (19-03-2021) Spring 2021 FAST - NUCES, Faisalabad Campus
31 pages
E - Notes - HPC-Unit 3-1
No ratings yet
E - Notes - HPC-Unit 3-1
26 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
Module2 BCS405B
No ratings yet
Module2 BCS405B
8 pages
Card Trick
No ratings yet
Card Trick
34 pages
Graph Theory Um404 - Course Material
No ratings yet
Graph Theory Um404 - Course Material
130 pages
Algorithm U1 Answer Key
No ratings yet
Algorithm U1 Answer Key
23 pages
2.1 The Glowworm Swarm Optimization (GSO) Algorithm
No ratings yet
2.1 The Glowworm Swarm Optimization (GSO) Algorithm
37 pages
System Partitioning
No ratings yet
System Partitioning
3 pages
Edge Coloring-III, Color Variation
No ratings yet
Edge Coloring-III, Color Variation
16 pages
Edexcel D1 Revision Sheets PDF
No ratings yet
Edexcel D1 Revision Sheets PDF
17 pages
Mastering Gephi Network Visualization Sample Chapter
No ratings yet
Mastering Gephi Network Visualization Sample Chapter
37 pages
1972 18erdos
No ratings yet
1972 18erdos
4 pages
Topological Analysis and Synthesis of Communication Networks Wan Hee Kim Robert Tienwen Chien PDF Download
No ratings yet
Topological Analysis and Synthesis of Communication Networks Wan Hee Kim Robert Tienwen Chien PDF Download
89 pages
SPN Book
No ratings yet
SPN Book
520 pages
Structured Flow Graph
No ratings yet
Structured Flow Graph
9 pages
Narang 2013
No ratings yet
Narang 2013
5 pages
Instant Download Discrete Biochronological Time Scales 1st Edition Jean Guex PDF All Chapters
100% (1)
Instant Download Discrete Biochronological Time Scales 1st Edition Jean Guex PDF All Chapters
55 pages
DS QB Unit-4
No ratings yet
DS QB Unit-4
9 pages
Fundamental Loops and Cut Sets - GATE Study Material in PDF
No ratings yet
Fundamental Loops and Cut Sets - GATE Study Material in PDF
7 pages
2 1 1
No ratings yet
2 1 1
5 pages
Graph Cycles and Olympiad Problems: Mathematics Magazine
No ratings yet
Graph Cycles and Olympiad Problems: Mathematics Magazine
5 pages
Design, Simulate and Analyze Cafeteria System Using Arena
No ratings yet
Design, Simulate and Analyze Cafeteria System Using Arena
11 pages
Mat206 Graph Theory, July 2021
No ratings yet
Mat206 Graph Theory, July 2021
3 pages
Editor,+8443 Jam+
No ratings yet
Editor,+8443 Jam+
16 pages
Unit Iv Graphs: 1. Directed Graph or Digraph
No ratings yet
Unit Iv Graphs: 1. Directed Graph or Digraph
10 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
Wa0007
No ratings yet
Wa0007
2 pages
Final Exam Paper Solution
No ratings yet
Final Exam Paper Solution
12 pages
Non Linear Data Structures
No ratings yet
Non Linear Data Structures
12 pages
Algorithm To Construct Super Vertex Magic Total Labeling of Complete Graphs
No ratings yet
Algorithm To Construct Super Vertex Magic Total Labeling of Complete Graphs
5 pages
DS Assignment 01
No ratings yet
DS Assignment 01
3 pages
Distributed Optimization Methods For Multi-Robot Systems Part I Tutorial
No ratings yet
Distributed Optimization Methods For Multi-Robot Systems Part I Tutorial
17 pages

Principles of Parallel Algorithm Design

Uploaded by

Principles of Parallel Algorithm Design

Uploaded by

Lecture 4: Principles of Parallel

• identify portions of work that can be performed

Decomposition: divide a computation into smaller

• Computing y[i] only use ith row of A and b – treat

• Different task decomposition leads to different

• Degree of Concurrency: # of tasks that can

Task-dependency graphs of query processing operation

Ex. For the matrix-vector multiplication problem, all tasks are

• Mapping: the mechanism by which tasks are

Basis for Choosing Mapping

• 4 processes can be used in total since the max. concurrency is 4.

How to decompose a computation into a set of

• Ideal for problems to be solved by divide-and-

Task: the work of partitioning a given sub-sequence

procedure Serial_Min(A,n) procedure Recursive_MIN(A,n)

• If each element of the output can be computed

Remark: data-decomposition is different from task decomposition.

• Ideal if output is a single unknown value or

• Applicable for problems which can be solved

Dense matrix-matrix multiplication

A1,1 B1,1 B1,2 D1,1,1 D1,1,2

A1,2 D2,1,1 D2,1,2

Stage 2: 𝐶𝐶𝑖𝑖,𝑗𝑗 = 𝐷𝐷1,𝑖𝑖,𝑗𝑗 + 𝐷𝐷2,𝑖𝑖,𝑗𝑗

D1,1,1 D1,1,2 D2,1,1 D2,1,2 C1,1 C1,2

• Decomposition based on partitioning

1) Static versus dynamic

Static Mapping: It distributes the tasks among

• Mapping Based on Data Partitioning

• By owner-computes rule, mapping the relevant

Example. 𝑛𝑛 × 𝑛𝑛 dense matrix multiplication 𝐶𝐶 = 𝐴𝐴 × 𝐵𝐵

• If the amount of work differs for different

𝑎𝑎11 𝑎𝑎12 … 𝑎𝑎1𝑛𝑛 1 0 … 0 𝑢𝑢11 𝑢𝑢12 … 𝑢𝑢1𝑛𝑛

𝑢𝑢1𝑗𝑗 = 𝑎𝑎1𝑗𝑗 , 𝑗𝑗 = 1,2, … , 𝑛𝑛 (1𝑠𝑠𝑠𝑠 row of 𝑈𝑈)

𝑢𝑢𝑖𝑖𝑖𝑖 = 𝑎𝑎𝑖𝑖𝑖𝑖 − ∑𝑖𝑖−1

• Remark: Matrices L and U share space with A

• A variation of block distribution that can be

Partition the graph:

Random Partitioning Partitioning for Minimizing Edge-Count

• Assign equal number of nodes (or cells) to each process

• Mapping based on task partitioning can be used

• Maximize data locality

You might also like