0% found this document useful (0 votes)

52 views68 pages

Chapter 7 - Parallel Programming Issues

Uploaded by

topkek69123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views68 pages

Chapter 7 - Parallel Programming Issues

Uploaded by

topkek69123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 68

Parallel Programming

Issues
References
• Michael J. Quinn. Parallel Computing. Theory and Practice.
McGraw-Hill
• Albert Y. Zomaya. Parallel and Distributed Computing
Handbook. McGraw-Hill
• Ian Foster. Designing and Building Parallel Programs.
Addison-Wesley.
• Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar .
Introduction to Parallel Computing, Second Edition. Addison
Wesley.
• Joseph Jaja. An Introduction to Parallel Algorithm. Addison
Wesley.
• Nguyễn Đức Nghĩa. Tính toán song song. Hà Nội 2003.

3
7.1 Parallel Model and
Domain Decomposition

4
Some Serial Algorithms
Working Examples
Dense Matrix-Matrix & Matrix-Vector
Multiplication
Sparse Matrix-Vector Multiplication
Floyd’s All-pairs Shortest Path
Minimum/Maximum Finding
Heuristic Search—15-puzzle problem

5
Dense Matrix-Vector Multiplication

6
Dense Matrix-Matrix Multiplication

7
Sparse Matrix-Vector Multiplication

8
Floyd’s All-Pairs Shortest Path

9
Minimum Finding

10
15—Puzzle Problem

11
Parallel Algorithm vs Parallel
Formulation
Parallel Formulation
Refers to a parallelization of a serial algorithm.
Parallel Algorithm
May represent an entirely different algorithm than the
one used serially.

We primarily focus on “Parallel Formulations”

Our goal today is to primarily discuss how to develop
such parallel formulations.
Of course, there will always be examples of “parallel
algorithms” that were not derived from serial
algorithms.

12
Elements of a Parallel
Algorithm/Formulation
Pieces of work that can be done concurrently
tasks
Mapping of the tasks onto multiple processors
processes vs processors
Distribution of input/output & intermediate data across the different
processors
Management the access of shared data
either input or intermediate
Synchronization of the processors at various points of the parallel
execution

Note:
Maximize concurrency and reduce overheads due to parallelization!
Maximize potential speedup!

13
Finding Concurrent Pieces of
Work
Decomposition:
The process of dividing the computation into
smaller pieces of work i.e., tasks
Tasks are programmer defined and are
considered to be indivisible

14
Example: Dense Matrix-
Vector Multiplication

Tasks can be of different size.

• granularity of a task

15
Example: Query Processing

Query:

16
Example: Query Processing
Finding concurrent tasks7

17
Task-Dependency Graph
In most cases, there are dependencies between
the different tasks
certain task(s) can only start once some other task(s)
have finished
e.g., producer-consumer relationships
These dependencies are represented using a
DAG called task-dependency graph

18
Task-Dependency Graph (cont)

Key Concepts Derived from the Task-

Dependency Graph
Degree of Concurrency
The number of tasks that can be concurrently
executed
we usually care about the average degree of
concurrency
Critical Path
The longest vertex-weighted path in the graph
The weights represent task size
Task granularity affects both of the above
characteristics
19
Task-Interaction Graph
Captures the pattern of interaction between
tasks
This graph usually contains the task-dependency
graph as a subgraph
i.e., there may be interactions between tasks even if there
are no dependencies
these interactions usually occur due to accesses on shared
data

20
Task Dependency/Interaction
Graphs
These graphs are important in developing
effectively mapping the tasks onto the different
processors
Maximize concurrency and minimize overheads

More on this later7

21
Common Decomposition
Methods

Data Decomposition
Recursive Decomposition
Task
Exploratory Decomposition decomposition
methods
Speculative Decomposition
Hybrid Decomposition

22
Recursive Decomposition
Suitable for problems that can be solved
using the divide-and-conquer paradigm
Each of the subproblems generated by the
divide step becomes a task

23
Example: Finding the Minimum
Note that we can obtain divide-and-conquer algorithms
for problems that are traditionally solved using non-
divide-and-conquer approaches

24
Recursive Decomposition
How good are the decompositions that it
produces?
average concurrency?
critical path?
How do the quicksort and min-finding
decompositions measure-up?

25
Data Decomposition
Used to derive concurrency for problems that operate on
large amounts of data
The idea is to derive the tasks by focusing on the
multiplicity of data
Data decomposition is often performed in two steps
Step 1: Partition the data
Step 2: Induce a computational partitioning from the data
partitioning
Which data should we partition?
Input/Output/Intermediate?
Well7 all of the above—leading to different data decomposition
methods
How do induce a computational partitioning?
Owner-computes rule

26
Example: Matrix-Matrix
Multiplication
Partitioning the output data

27
Example: Matrix-Matrix
Multiplication
Partitioning the intermediate data

28
Data Decomposition
Is the most widely-used decomposition
technique
after all parallel processing is often applied to
problems that have a lot of data
splitting the work based on this data is the natural
way to extract high-degree of concurrency
It is used by itself or in conjunction with other
decomposition methods
Hybrid decomposition

29
Exploratory Decomposition
Used to decompose computations that
correspond to a search of a space of
solutions

30
Example: 15-puzzle Problem

31
Exploratory Decomposition

It is not as general purpose

It can result in speedup anomalies
engineered slow-down or superlinear
speedup

32
Speculative Decomposition
Used to extract concurrency in problems in
which the next step is one of many
possible actions that can only be
determined when the current tasks
finishes
This decomposition assumes a certain
outcome of the currently executed task
and executes some of the next steps
Just like speculative execution at the
microprocessor level

33
Example: Discrete Event Simulation

34
Speculative Execution

If predictions are wrong7

work is wasted
work may need to be undone
state-restoring overhead
memory/computations

However, it may be the only way to extract

concurrency!

35
Mapping the Tasks
Why do we care about task mapping?
Can I just randomly assign them to the available processors?
Proper mapping is critical as it needs to minimize the
parallel processing overheads
If Tp is the parallel runtime on p processors and Ts is the serial
runtime, then the total overhead To is p*Tp – Ts
The work done by the parallel system beyond that required by the
serial system
Overhead sources:
they can Load imbalance remember the
be at odds holy grail7
Inter-process communication
with each
other coordination/synchronization/data-sharing

36
Why Mapping can be Complicated?
Proper mapping needs to take into account the task-dependency
and interaction graphs
Are the tasks available a priori?
Static vs dynamic task generation Task
How about their computational requirements? dependency
Are they uniform or non-uniform? graph
Do we know them a priori?
How much data is associated with each task?
How about the interaction patterns between the tasks?
Are they static or dynamic?
Task
Do we know them a priori? interaction
Are they data instance dependent? graph
Are they regular or irregular?
Are they read-only or read-write?
Depending on the above characteristics different mapping
techniques are required of different complexity and cost

37
Example: Simple & Complex
Task Interaction

38
Mapping Techniques for Load
Balancing
Be aware7
The assignment of tasks whose aggregate
computational requirements are the same does not
automatically ensure load balance.

Each
processor is
assigned three
tasks but (a) is
better than (b)!

39
Load Balancing Techniques
Static
The tasks are distributed among the processors prior
to the execution
Applicable for tasks that are
generated statically
known and/or uniform computational requirements
Dynamic
The tasks are distributed among the processors
during the execution of the algorithm
i.e., tasks & data are migrated
Applicable for tasks that are
generated dynamically
unknown computational requirements

40
Static Mapping—Array Distribution

Suitable for algorithms that

use data decomposition
their underlying input/output/intermediate data
are in the form of arrays
Block Distribution
Cyclic Distribution 1D/2D/3D
Block-Cyclic Distribution
Randomized Block Distributions

41
Examples: Block Distributions

42
Examples: Block Distributions

43
Random Block Distributions

Sometimes the computations are performed only

at certain portions of an array
sparse matrix-matrix multiplication

44
Random Block Distributions
Better load balance can be achieved via a
random block distribution

45
Dynamic Load Balancing
Schemes
There is a huge body of research
Centralized Schemes
A certain processors is responsible for giving out work
master-slave paradigm
Issue:
task granularity
Distributed Schemes
Work can be transferred between any pairs of processors.
Issues:
How do the processors get paired?
Who initiates the work transfer? push vs pull
How much work is transferred?

46
Mapping to Minimize
Interaction Overheads
Maximize data locality
Minimize volume of data-exchange
Minimize frequency of interactions
Minimize contention and hot spots
Overlap computation with interactions
Selective data and computation replication

Achieving the above is usually an interplay of

decomposition and mapping and is usually done iteratively

47
7.2 Dependency in parallel computing

• Definition
• Types of dependency
• Solution
• Example

48
What is data dependency?
• A data dependency is a situation in which a program
statement (instruction) refers to the data of a preceding
statement
• A dependency exists between statements when order of
statement execution affects the results of program
• A data dependency results from multiple use of the same
location(s) in storage by different tasks
• In parallel computing: a data dependency consist of a
situation in which calculation on this thread/core/cpu/node
use data calculated by other thread/core/cpu/node and/or
use data stored in memory managed by other
thread/core/cpu/node

49
Types of data dependency

50
Types of data dependency

51
Control dependency

52
Loop dependency

• Loop dependency has two types:

• Loop – carried dependency
• Loop – independent dependency

53
Loop – Carried dependency

• In Loop – carried dependency, statements in

an iteration of a loop depend on statements in
other iteration of the loop

54
Loop – Independent dependency
• In Loop – independent dependency, loops have
inter-iteration dependence, but do not have
dependence between iterations.
• Each iteration may be treated as a block and
performed in parallel without other
synchronization efforts.

55
Solution
• Parallel computing in a shared memory
system: OpenMP, CUDA
• Synchronization
• Parallel computing in a distributed memory
system: MPI, Cloud, Grid
• Synchronization
• Communication

56
Data dependency: Example
• Heat Diffusion Equations:
∂C
= D∇ 2C
∂t
2 2
2 ∂ C ∂ C
∇C= 2 + 2
∂x ∂y
• Solving approach
• Initialize inputs: Ci0, j
• At step n+1: dx (i,j)
2 tn tn
∇ Ci , j = FDi , j =
(C tn
i +1, j + C tn
i −1, j + C tn
i , j +1 + C tn
i , j −1 − 4Ci, j )
tn

dx 2
Citn, j+1 = Citn, j + dt * D * FDitn, j
• Data dependency?

57
Data dependency
• Calculation at point (i,j) needs data from neighboring
points: (i-1,j), (i+1,j) , (i,j-1) , (i,j+1)
• This is data dependency
• Solution
• Shared memory system: Synchronization
• Distributed memory system: Communication and
Synchronization (Difficult, Optimization)
• Exercise:
• Write a OpenMP program to implement Heat Equations
problem
• Write a MPI program to implement Heat Equations
problem

58
7.3 Performance
Analysis

59
Sources of Overhead in
Parallel Programs
The total time spent by a parallel
system is usually higher than that
spent by a serial system to solve
the same problem.
Overheads!
Interprocessor Communication &
Interactions
Idling
Load imbalance,
Synchronization, Serial
components
Excess Computation
Sub-optimal serial algorithm
More aggregate computations
Goal is to minimize these
overheads!

60
Performance Metrics
Parallel Execution Time
Time spent to solve a problem on p
processors.
Tp
Total Overhead Function
To = pTp -Ts
Speedup
S = Ts/Tp
Can we have superlinear speedup?
exploratory computations, hardware
features
Efficiency
E = S/p
Cost
p Tp (processor-time product)
Cost-optimal formulation
Working example: Adding n elements on
n processors.

61
Effect of Granularity on
Performance
Scaling down the number of processors
Achieving cost optimality
Naïve emulations vs Intelligent scaling
down
adding n elements on p processors

62
Scaling Down by Emulation

63
Intelligent Scaling Down

64
Scalability of a Parallel System
The need to predict the
performance of a parallel algorithm
as p increases
Characteristics of the To function
Linear on the number of
processors
serial components
Dependence on Ts
usually sub-linear
Efficiency drops as we increase
the number of processors and
keep the size of the problem fixed
Efficiency increases as we
increase the size of the problem
and keep the number of
processors fixed

65
Scalable Formulations
A parallel formulation is called scalable if
we can maintain the efficiency constant
when increasing p by increasing the size
of the problem
Scalability and cost-optimality are related
Which system is more scalable?

66
Measuring Scalability

What is the problem size?

Isoefficiency function
measures the rate by which the problem size
has to increase in relation to p
Algorithms that require the problem size to
grow at a lower rate are more scalable
Isoefficiency and cost-optimality
What is the best we can do in terms of
isoefficiency?

67
Thank
you for
your
attentions
!

Multithreading
No ratings yet
Multithreading
143 pages
Chapter 3 - Principles of Parallel Algorithm Design
No ratings yet
Chapter 3 - Principles of Parallel Algorithm Design
52 pages
Principles of Parallel Algorithm Design
No ratings yet
Principles of Parallel Algorithm Design
63 pages
Parallel Programming: Lecture #9
No ratings yet
Parallel Programming: Lecture #9
24 pages
Introduction To Parallel Computing Design and Anal
No ratings yet
Introduction To Parallel Computing Design and Anal
53 pages
ConcurrencyDecomposition Parallel Algorithm
No ratings yet
ConcurrencyDecomposition Parallel Algorithm
40 pages
FoP HPC Unit II
No ratings yet
FoP HPC Unit II
107 pages
Unit 2 - Part - 1
No ratings yet
Unit 2 - Part - 1
32 pages
UNIT 2 HPC_NAP
No ratings yet
UNIT 2 HPC_NAP
72 pages
Unit 2
No ratings yet
Unit 2
64 pages
AA Part1
No ratings yet
AA Part1
43 pages
Con Currency Mapping
No ratings yet
Con Currency Mapping
40 pages
Unit 2
No ratings yet
Unit 2
151 pages
Lecture 6 Principles of Parallel Algorithm Design
No ratings yet
Lecture 6 Principles of Parallel Algorithm Design
35 pages
Principles of Parallel Algorithm Design
No ratings yet
Principles of Parallel Algorithm Design
78 pages
8-Parallel Algorithm Design - Preliminaries-09-Jan-2020Material - I - 09-Jan-2020 - Module - 3 - Preliminaries PDF
No ratings yet
8-Parallel Algorithm Design - Preliminaries-09-Jan-2020Material - I - 09-Jan-2020 - Module - 3 - Preliminaries PDF
18 pages
Partitioning
No ratings yet
Partitioning
37 pages
WINSEM2022-23 CSE4001 ETH VL2022230503176 Reference Material I 02-02-2023 Module3-ParallelDecomposition
No ratings yet
WINSEM2022-23 CSE4001 ETH VL2022230503176 Reference Material I 02-02-2023 Module3-ParallelDecomposition
89 pages
CSC 580 - Chapter 3
No ratings yet
CSC 580 - Chapter 3
35 pages
Chapter Overview: Algorithms and Concurrency: - Introduction To Parallel Algorithms
No ratings yet
Chapter Overview: Algorithms and Concurrency: - Introduction To Parallel Algorithms
84 pages
Lecture 5 Principles of Parallel Algorithm Design
No ratings yet
Lecture 5 Principles of Parallel Algorithm Design
30 pages
Common PDC Module3
No ratings yet
Common PDC Module3
43 pages
HPC Ut 2
No ratings yet
HPC Ut 2
4 pages
WINSEM2022-23 CSE4001 ETH VL2022230503160 2023-02-07 Reference-Material-I
No ratings yet
WINSEM2022-23 CSE4001 ETH VL2022230503160 2023-02-07 Reference-Material-I
35 pages
Unit 2
No ratings yet
Unit 2
81 pages
WINSEM2022-23 CSE4001 ETH VL2022230503160 2023-01-19 Reference-Material-I
No ratings yet
WINSEM2022-23 CSE4001 ETH VL2022230503160 2023-01-19 Reference-Material-I
72 pages
HPC - Unit-2 Insem Notes
No ratings yet
HPC - Unit-2 Insem Notes
99 pages
WINSEM2022-23 CSE4001 ETH VL2022230503160 2023-01-12 Reference-Material-I
No ratings yet
WINSEM2022-23 CSE4001 ETH VL2022230503160 2023-01-12 Reference-Material-I
28 pages
Parallel Algorithms Presentation
No ratings yet
Parallel Algorithms Presentation
32 pages
Chap 4-7 - Parallel - Abstractions - and - MPI
No ratings yet
Chap 4-7 - Parallel - Abstractions - and - MPI
34 pages
Module 3 - Principles of Parallel Algorithm Design
No ratings yet
Module 3 - Principles of Parallel Algorithm Design
39 pages
03-Task Decomposition and Mapping
No ratings yet
03-Task Decomposition and Mapping
62 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
Unit - 2 HPC
No ratings yet
Unit - 2 HPC
96 pages
E - Notes - HPC-Unit 3-1
No ratings yet
E - Notes - HPC-Unit 3-1
26 pages
L19-20 PA Design Intro
No ratings yet
L19-20 PA Design Intro
31 pages
Parallel Computing
No ratings yet
Parallel Computing
91 pages
PDC Unit-2
No ratings yet
PDC Unit-2
48 pages
Programming For Performance
No ratings yet
Programming For Performance
79 pages
Chap3 Slides Week4
No ratings yet
Chap3 Slides Week4
42 pages
WINSEM2022 23 CSE4001 ETH VL2022230503182 Reference Material I 02
No ratings yet
WINSEM2022 23 CSE4001 ETH VL2022230503182 Reference Material I 02
28 pages
X. Mapping Techniques: 27 April, 2009
No ratings yet
X. Mapping Techniques: 27 April, 2009
27 pages
Processes and Mapping, Decomposition Techniques
No ratings yet
Processes and Mapping, Decomposition Techniques
28 pages
Pda 1
No ratings yet
Pda 1
72 pages
Parallel Computing Unit 3 - Principles of Parallel Computing Design
No ratings yet
Parallel Computing Unit 3 - Principles of Parallel Computing Design
78 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Lecture4 PDF
No ratings yet
Lecture4 PDF
23 pages
ParallelIzation Principles
No ratings yet
ParallelIzation Principles
40 pages
Mpi Course
No ratings yet
Mpi Course
202 pages
07 Parallel Algorithms in Parallel and Distributed Computing
No ratings yet
07 Parallel Algorithms in Parallel and Distributed Computing
13 pages
Group3 - Parallel - Computing - Techniques - Presentation Power Point 2025
No ratings yet
Group3 - Parallel - Computing - Techniques - Presentation Power Point 2025
27 pages
Bert 1 Parallel Algorithmic Concepts
No ratings yet
Bert 1 Parallel Algorithmic Concepts
95 pages
Unit1 2 and 3
No ratings yet
Unit1 2 and 3
76 pages
Project - ParallelComputing BSR v2
No ratings yet
Project - ParallelComputing BSR v2
40 pages
5 - Designing Parallel Programs
No ratings yet
5 - Designing Parallel Programs
52 pages
Chapter 8 - Advanced Parallel Algorithms
No ratings yet
Chapter 8 - Advanced Parallel Algorithms
56 pages
Chapter 10 - Parallel in Tree-Related Problems
No ratings yet
Chapter 10 - Parallel in Tree-Related Problems
84 pages
Chapter 6 - Java - Multithreading Concurrency
No ratings yet
Chapter 6 - Java - Multithreading Concurrency
49 pages
Chapter 9 - Parallel Computation Problems
No ratings yet
Chapter 9 - Parallel Computation Problems
43 pages
True&False Questions
No ratings yet
True&False Questions
5 pages
Difference Between Executor Exe Service
No ratings yet
Difference Between Executor Exe Service
4 pages
Distributed System Notes
No ratings yet
Distributed System Notes
27 pages
Last Crash Log
No ratings yet
Last Crash Log
2 pages
Puzzle Java
No ratings yet
Puzzle Java
3 pages
Os 1 Integrated
No ratings yet
Os 1 Integrated
2 pages
Linux Scheduling
No ratings yet
Linux Scheduling
28 pages
Final Exam3
No ratings yet
Final Exam3
3 pages
OSV Assignment1
No ratings yet
OSV Assignment1
1 page
Thread Programming
No ratings yet
Thread Programming
8 pages
(IJCT-V3I4P1) Authors:Anusha Itnal, Sujata Umarani
No ratings yet
(IJCT-V3I4P1) Authors:Anusha Itnal, Sujata Umarani
5 pages
HPC Assignments
No ratings yet
HPC Assignments
3 pages
Difference Between Vector Processor and Scalar Processor
No ratings yet
Difference Between Vector Processor and Scalar Processor
1 page
Assignment No 1
No ratings yet
Assignment No 1
2 pages
Unit I Instruction Level Parallelism Two Mark Questions: Dept of Cse G.SURESH. M.Tech, Asst Prof / CSE
No ratings yet
Unit I Instruction Level Parallelism Two Mark Questions: Dept of Cse G.SURESH. M.Tech, Asst Prof / CSE
12 pages
03LAB01
No ratings yet
03LAB01
3 pages
Studies On Performance Aspects of Scheduling Algorithms On Multicore Platforms
No ratings yet
Studies On Performance Aspects of Scheduling Algorithms On Multicore Platforms
7 pages
Unit V: Task Communication
No ratings yet
Unit V: Task Communication
21 pages
Pipelining For Multi-Core Architectures
No ratings yet
Pipelining For Multi-Core Architectures
31 pages
Lecture (2) .PPT-1
100% (1)
Lecture (2) .PPT-1
19 pages
Unit 1 System Models and Issues - MP
No ratings yet
Unit 1 System Models and Issues - MP
71 pages
Ca LP
No ratings yet
Ca LP
6 pages
Week 4 Solution
No ratings yet
Week 4 Solution
5 pages
Python Parallel Programming Cookbook - Sample Chapter
67% (3)
Python Parallel Programming Cookbook - Sample Chapter
39 pages
Multi Threading
No ratings yet
Multi Threading
9 pages
CPU Scheduling
No ratings yet
CPU Scheduling
39 pages
iDeskSDK 2020-04-12 Log
No ratings yet
iDeskSDK 2020-04-12 Log
4 pages
Thakur05-Optimization of Collective Communication Operations in MPICH
No ratings yet
Thakur05-Optimization of Collective Communication Operations in MPICH
18 pages
M.A.M. School of Engineering: Siruganur, Trichy - 621 105
No ratings yet
M.A.M. School of Engineering: Siruganur, Trichy - 621 105
78 pages

Chapter 7 - Parallel Programming Issues

Uploaded by

Chapter 7 - Parallel Programming Issues

Uploaded by

Parallel Programming

We primarily focus on “Parallel Formulations”

Tasks can be of different size.

Key Concepts Derived from the Task-

More on this later7

It is not as general purpose

If predictions are wrong7

However, it may be the only way to extract

Suitable for algorithms that

Sometimes the computations are performed only

Achieving the above is usually an interplay of

• Loop dependency has two types:

• In Loop – carried dependency, statements in

What is the problem size?

You might also like