0% found this document useful (0 votes)

40 views

Chapter 9 - Parallel Computation Problems

Uploaded by

topkek69123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

40 views

Chapter 9 - Parallel Computation Problems

Uploaded by

topkek69123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Parallel Computation

Problems
References
• Michael J. Quinn. Parallel Computing. Theory and Practice.
McGraw-Hill
• Albert Y. Zomaya. Parallel and Distributed Computing
Handbook. McGraw-Hill
• Ian Foster. Designing and Building Parallel Programs.
Addison-Wesley.
• Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar .
Introduction to Parallel Computing, Second Edition. Addison
Wesley.
• Joseph Jaja. An Introduction to Parallel Algorithm. Addison
Wesley.
• Nguyễn Đức Nghĩa. Tính toán song song. Hà Nội 2003.

3
9.1 Numerical approach for dense
matrix

4
Review

5
Matrix-Vector Multiplication

Compute: y = Ax
y, x are nx1 vectors
A is an nxn dense matrix
Serial complexity: W = O(n2).
We will consider:
1D & 2D partitioning.

6
Row-wise 1D Partitioning

How do we perform the operation?

7
Row-wise 1D Partitioning

Each processor needs to have the entire x vector.

All-to-all broadcast Local computations

Analysis?

8
Block 2D Partitioning

How do we perform the operation?

9
Block 2D Partitioning
Each processor needs to have the portion of the x vector
that corresponds to the set of columns that it stores.

Analysis?

10
1D vs 2D Formulation

Which one is better?

11
Matrix-Matrix Multiplication

Compute: C = AB
A, B, & C are nxn dense
matrices.
Serial complexity:
W = O(n3).
We will consider:
2D & 3D partitioning.

12
Simple 2D Algorithm

Processors are arranged in a logical

sqrt(p)*sqrt(p) 2D topology.
Each processor gets a block of
(n/sqrt(p))*(n/sqrt(p)) block of A, B, & C.
It is responsible for computing the entries
of C that it has been assigned to.
Analysis?
How about the
memory
complexity?

13
Cannon’s Algorithm

Memory efficient variant of the simple

algorithm.
Key idea:
Replace traditional loop:

With the following loop:

During each step, processors operate on

different blocks of A and B.

14
Can we do better?

Can we use more than O(n2) processors?

So far the task corresponded to the dot-
product of two vectors
i.e., Ci,j = Ai,* . B*,j
How about performing this dot-product in
parallel?
What is the maximum concurrency that we
can extract?

15
3D Algorithm—DNS Algorithm
Partitioning the intermediate data

16
3D Algorithm—DNS Algorithm

17
Gaussian Elimination
Solve Ax=b
A is an nxn dense matrix.
x and b are dense vectors
Serial complexity:
W = O(n3).
There are two key steps in
each iteration:
Division step
Rank-1 update
We will consider:
1D & 2D partitioning, and
introduce the notion of
pipelining.

18
1D Partitioning

Assign n/p rows of A to

each processor.
During the ith iteration:
Divide operation is
performed by the processor
who stores row i.
Result is broadcasted to the
rest of the processors.
Each processor performs
the rank-1 update for its
local rows.
Analysis?

(one element per processor)

19
1D Pipelined Formulation

Existing Algorithm:
Next iteration starts only when the
previous iteration has finished.
Key Idea:
The next iteration can start as soon as the
rank-1 update involving the next row has
finished.
Essentially multiple iterations are perform
simultaneously!

20
Cost-optimal with
n processors

21
1D Partitioning

Is the block mapping a good idea?

22
2D Mapping

Each processor gets a 2D

block of the matrix.
Steps:
Broadcast of the “active” column
along the rows.
Divide step in parallel by the
processors who own portions of
the row.
Broadcast along the columns.
Rank-1 update.
Analysis?

23
2D Pipelined

Cost-optimal with
n2 processors

24
9.2 Numerical approach for PDE
problem

PARALLEL SOLUTION TO PDEs

CASE STUDY: HEAT EQUATIONS

25
Mathematic model and algorithm
• Heat equations (PDE):
∂C
= D∇ 2C
∂t
2 ∂ 2C ∂ 2 C
∇C= 2 + 2
∂x ∂y

• Algorithm
• Initial input value: Ci0, j d (i,j
• At step n+1: x tn ) tn
2 tn tn
∇ Ci , j = FDi , j =
(Ci +1, j + Ci −1, j + Ci , j +1 + Ci , j −1 − 4Ci , j )
tn tn tn

dx 2
tn +1
C i, j = C + dt * D * FD
tn
i, j
tn
i, j
• Data dependncy?
26
Data dependency
• Calculation at point (i,j) needs data from neighboring
points: (i-1,j), (i+1,j) , (i,j-1) , (i,j+1)
• This is data dependency
• Solution
• Shared memory system: Synchronization
• Distributed memory system: Communication and
Synchronization (Difficult, Optimization)
• Exercise:
• Write a OpenMP program to implement Heat Equations
problem
• Write a MPI program to implement Heat Equations
problem

27
Mathematic model and algorithm

• Notation:
u
c
c: Ci , j ,
u: Ci −1, j , l r
d: Ci +1, j ,
l: Ci , j −1 , d
r: Ci , j +1.

28
Implementation: Spatial Discretization (FD)
∇ C2 tn
= FD tn
=
(C tn
i +1, j + C itn− 1 , j + C itn, j + 1 + C itn, j − 1 − 4 C itn, j )
i, j i, j
dx 2
void FD(float *C, float *dC) {
int i, j;
float c,u,d,l,r;
for ( i = 0 ; i < m ; i++ )
for ( j = 0 ; j < n ; j++ )
{
c = *(C+i*n+j);
u = (i==0) ? *(C+i*n+j) : *(C+(i-1)*n+j);
d = (i==m-1) ? *(C+i*n+j) : *(C+(i+1)*n+j);
l = (j==0) ? *(C+i*n+j) : *(C+i*n+j-1); dx (i,j)
r = (j==n-1) ? *(C+i*n+j) : *(C+i*n+j+1);
*(dC+i*n+j) = (1/(dx*dx))*(u+d+l+r-4*c);
Boundary condition
}
}
29
Implementation: Time Integration
Citn, j+1 = Citn, j + dt * D * FDitn, j

while (t<=T)
{
FD(C, dC);
for ( i = 0 ; i < m ; i++ )
for ( j = 0 ; j < n ; j++ )
*(C+i*n+j) = *(C+i*n+j) + dt*(*(dC+i*n+j));
t=t+dt;
}

30
SPMD Parallel Algorithm (1)

• SPMD: Single Program Multiple Data

CPU0

Domain
CPU1 Decomposition

CPU2

31
SPMD Parallel Algorithm SPMD (2)

• B1: Input data

• Usually, initial input data at CPU 0 (Root)
• B2: Domain decomposition
• B3: Distribute Input data from Root to all other CPUs
• B4: Computation (Each CPU calculate on its subdomain)
• B5: Gather Output from all other CPUs to Root

B3 and B5: Communication (Input và Output)

32
SPMD Parallel Algorithm SPMD (3)

• B1: Input data

• Depending on requirement of each problem
• Different input results in different output

33
SPMD Parallel Algorithm SPMD (4)
• B2: Domain decomposition
• Many approaches
• Different approach has different efficiency
• Following is row-wise domain decomposition
• Given that the size of domain is: mxn
• Subdomain for each CPU: mcxn, where: mc=m/NP with
NP is the number of CPUs

CPU0

CPU1

CPU2

34
SPMD Parallel Algorithm SPMD (5)
• B3: Distribute Input data from Root to all other CPUs
MPI_Scatter (C, mc*n, MPI_FLOAT,
Cs, mc*n, MPI_FLOAT, 0,
MPI_COMM_WORLD);

CPU0

CPU1

CPU2

Input data initialized at Root

35
SPMD Parallel Algorithm SPMD (6)
• B5: Gather Output from all other CPU to Root
MPI_Gather ( Cs, mc*n, MPI_FLOAT,
C, mc*n, MPI_FLOAT, 0,
MPI_COMM_WORLD);

CPU0

CPU1

CPU2

Result calculated at CPUs CPU0

36
SPMD Parallel Algorithm SPMD (7)
• B4: Computation

CPU0
- B4.1: Communication
- B4.2: Calculation
CPU1

CPU2

37
SPMD Parallel Algorithm SPMD (8)

• B4.1: Communication CPU0

Csmc-
- B4.1a): Communicate 1,j

array Cu Cu
- B4.1b): Communicate
CPU1
array Cd
Cd

Cs0,j
CPU2

38
SPMD Parallel Algorithm SPMD (9)
• B4.1a): Communicate array Cu

if (rank==0){
for (j=0; j<n; j++) *(Cu+j) = *(Cs+0*n+j);
MPI_Send (Cs+(mc- )*n, n, MPI_FLOAT, rank+1, rank, …);
} else if (rank==NP-1) {
MPI_Recv (Cu, n, MPI_FLOAT, rank-1, rank-1, …);
} else {
MPI_Send (Cs+(mc-1)*n, n, MPI_FLOAT, rank+1, rank,…);
MPI_Recv(Cu, n, MPI_FLOAT, rank-1, rank-1, …);
}

39
CPU0
Csmc-1,j

CPU1

Cs0,j
CPU2

40
SPMD Parallel Algorithm SPMD (10)
• B4.1b): Communicate array Cd
if (rank==NP-1){
for (j=0; j<n; j++) *(Cd+j) = *(Cs+(mc-1)*n+j);
MPI_Send (Cs, n, MPI_FLOAT, rank-1, rank, …);
} else if (rank==0) {
MPI_Recv (Cd, n, MPI_FLOAT, rank+1, rank+1, …);
} else {
MPI_Send (Cs, n, MPI_FLOAT, rank-1, rank, …);
MPI_Recv (Cd, n, MPI_FLOAT, rank+1, rank+1, …);
}

41
SPMD Parallel Algorithm SPMD (11)
• B4.2: Calculation
void FD(float *Cs, float *Cu, float *Cd, float *dCs ,int ms) {
int i, j;
float c,u,d,l,r;
for ( i = 0 ; i < ms ; i++ )
for ( j = 0 ; j < n ; j++ )
{
c = *(Cs+i*n+j);
u = (i==0) ? *(Cu+j) : *(Cs+(i-1)*n+j);
d = (i==ms-1) ? *(Cd+j) : *(Cs+(i+1)*n+j);
l = (j==0) ? *(Cs+i*n+j) : *(Cs+i*n+j-1);
r = (j==n-1) ? *(Cs+i*n+j) : *(Cs+i*n+j+1);
*(dCs+i*n+j) = (D/(dx*dx))*(u+d+l+r-4*c);
}
}
42
Thank you
for your
attentions!

Solution Manual
100% (6)
Solution Manual
191 pages
Parallel Algorithm Merged
No ratings yet
Parallel Algorithm Merged
76 pages
Sols Book PDF
100% (1)
Sols Book PDF
120 pages
High Performance Computing Matrix Mul.
No ratings yet
High Performance Computing Matrix Mul.
15 pages
Dense Matrix Algorithms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Dense Matrix Algorithms: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
55 pages
Parallel and Distributed Algorithms-IMPORTANT QUESTION
100% (1)
Parallel and Distributed Algorithms-IMPORTANT QUESTION
15 pages
Chapter 14: Parallel Algorithms
No ratings yet
Chapter 14: Parallel Algorithms
23 pages
Report - Viber String
No ratings yet
Report - Viber String
26 pages
Parallel Processing
No ratings yet
Parallel Processing
35 pages
MIT - Applied Parallel Computing - Alan Edelman
No ratings yet
MIT - Applied Parallel Computing - Alan Edelman
187 pages
Parallel Programming Models: Sathish Vadhiyar
No ratings yet
Parallel Programming Models: Sathish Vadhiyar
32 pages
hpc_parallel
No ratings yet
hpc_parallel
122 pages
matrix_mul
No ratings yet
matrix_mul
33 pages
Mpi CG
No ratings yet
Mpi CG
23 pages
Introduction
No ratings yet
Introduction
46 pages
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
No ratings yet
A Practical Performance Comparison of Parallel Matrix Multiplication Algorithms On Networks of Workstations
2 pages
Parallel and Distributed Algorithms: Johnnie W. Baker
No ratings yet
Parallel and Distributed Algorithms: Johnnie W. Baker
67 pages
Simulating Ocean Currents
No ratings yet
Simulating Ocean Currents
35 pages
VSS-NumericalLibraries
No ratings yet
VSS-NumericalLibraries
21 pages
Floyd's Algorithm: Input N: Number of Vertices A (0..n-1) (0..n-1) - Adjacency Matrix
No ratings yet
Floyd's Algorithm: Input N: Number of Vertices A (0..n-1) (0..n-1) - Adjacency Matrix
7 pages
Introduction To Parallel Programming: Parallel Methods For Matrix Multiplication
No ratings yet
Introduction To Parallel Programming: Parallel Methods For Matrix Multiplication
50 pages
ParallelIzation Principles
No ratings yet
ParallelIzation Principles
40 pages
Pseudo Code of Mpi Programs
No ratings yet
Pseudo Code of Mpi Programs
22 pages
Lect11 12 Parallel
No ratings yet
Lect11 12 Parallel
57 pages
RG2-ParallelizationPrinciples-HPCAI-Jan2020
No ratings yet
RG2-ParallelizationPrinciples-HPCAI-Jan2020
40 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
Parallel Computing
No ratings yet
Parallel Computing
91 pages
04 Progbasics
No ratings yet
04 Progbasics
43 pages
Efficient Parallel Algorithm for
No ratings yet
Efficient Parallel Algorithm for
12 pages
CO-2 (2)
No ratings yet
CO-2 (2)
22 pages
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
No ratings yet
Lecture 4: Principles of Parallel Algorithm Design (Part 4)
27 pages
Algorithms For Parallel Machines
No ratings yet
Algorithms For Parallel Machines
7 pages
Mpi Course
No ratings yet
Mpi Course
202 pages
2022 Mid 1
No ratings yet
2022 Mid 1
4 pages
Chapter 7 - Parallel Programming Issues
No ratings yet
Chapter 7 - Parallel Programming Issues
68 pages
Content PDF
No ratings yet
Content PDF
14 pages
Applied Parallel Computing-Honest
100% (1)
Applied Parallel Computing-Honest
218 pages
Flat Combining Synchronized Global Data Structures
No ratings yet
Flat Combining Synchronized Global Data Structures
265 pages
3 Sardar Anisul Haque Thesis UWO
No ratings yet
3 Sardar Anisul Haque Thesis UWO
152 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
Unit 3
No ratings yet
Unit 3
10 pages
Gauss
No ratings yet
Gauss
7 pages
Introduction To Parallel Computing Design and Anal
No ratings yet
Introduction To Parallel Computing Design and Anal
53 pages
Compre1
No ratings yet
Compre1
2 pages
ICS 311 PADC Foaster Algorithm Design (1)
No ratings yet
ICS 311 PADC Foaster Algorithm Design (1)
54 pages
Solution Manual
No ratings yet
Solution Manual
191 pages
Module 3
No ratings yet
Module 3
12 pages
Accelerating CFD Simulations With Gpus: Patrice Castonguay
No ratings yet
Accelerating CFD Simulations With Gpus: Patrice Castonguay
67 pages
Ca 3
No ratings yet
Ca 3
34 pages
Chandana Monica CGMethod
No ratings yet
Chandana Monica CGMethod
28 pages
Chained Matrix Multiplication
No ratings yet
Chained Matrix Multiplication
32 pages
Parallel Random Access Machine (PRAM) : Control
No ratings yet
Parallel Random Access Machine (PRAM) : Control
9 pages
Ece408 Lecture19 Sparse Matrix VK SP23
No ratings yet
Ece408 Lecture19 Sparse Matrix VK SP23
28 pages
Chapter 8 - Advanced Parallel Algorithms
No ratings yet
Chapter 8 - Advanced Parallel Algorithms
56 pages
Parralel Demro 002
No ratings yet
Parralel Demro 002
61 pages
Matrix-Vector Multiplication
No ratings yet
Matrix-Vector Multiplication
12 pages
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
From Everand
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
Fouad Sabry
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
Mathematical Formulas for Economics and Business: A Simple Introduction
From Everand
Mathematical Formulas for Economics and Business: A Simple Introduction
K.H. Erickson
4/5 (4)
An 1388
No ratings yet
An 1388
6 pages
Graph Traversal Algorithm: Recapitulation
No ratings yet
Graph Traversal Algorithm: Recapitulation
14 pages
06 cmsc416 Algorithms
No ratings yet
06 cmsc416 Algorithms
40 pages
Abcs of Adcs: Analog-To-Digital Converter Basics
No ratings yet
Abcs of Adcs: Analog-To-Digital Converter Basics
35 pages
Matching in Planar Graphs
No ratings yet
Matching in Planar Graphs
27 pages
Lecture 3 - Booth Algorithm
No ratings yet
Lecture 3 - Booth Algorithm
20 pages
ARTIFICIAL INTELLIGENCE
No ratings yet
ARTIFICIAL INTELLIGENCE
5 pages
Unit 4
No ratings yet
Unit 4
4 pages
Task 3 Hashing Quadratic Probing
No ratings yet
Task 3 Hashing Quadratic Probing
7 pages
Complexity Analysis of Algorithms: Jordi Cortadella Department of Computer Science
No ratings yet
Complexity Analysis of Algorithms: Jordi Cortadella Department of Computer Science
19 pages
128-Points FFT CORE ARchitectutre
No ratings yet
128-Points FFT CORE ARchitectutre
12 pages
Decision Tree
No ratings yet
Decision Tree
6 pages
Dcom Question
No ratings yet
Dcom Question
3 pages
Image Processing Notes
No ratings yet
Image Processing Notes
5 pages
QP 1
No ratings yet
QP 1
3 pages
Module 2: Special Products and Binomial Theorem: Objectives
No ratings yet
Module 2: Special Products and Binomial Theorem: Objectives
4 pages
Manav Rachna International Institute of Reasearch & Studies: Faculty of Engineering and Technology
No ratings yet
Manav Rachna International Institute of Reasearch & Studies: Faculty of Engineering and Technology
3 pages
Midterm F02soln
No ratings yet
Midterm F02soln
14 pages
DSP Task
No ratings yet
DSP Task
9 pages
Discrete Time Systems Discrete Time Systems & Difference Equations
100% (1)
Discrete Time Systems Discrete Time Systems & Difference Equations
44 pages
Decision Tree
No ratings yet
Decision Tree
12 pages
Lab 2
No ratings yet
Lab 2
1 page
KNN Presentation
No ratings yet
KNN Presentation
19 pages
Course - CS 120 Computer Programming Lab
No ratings yet
Course - CS 120 Computer Programming Lab
3 pages
Error Correction and Detection
No ratings yet
Error Correction and Detection
9 pages
Matlab
No ratings yet
Matlab
4 pages
DAA Practical Question
No ratings yet
DAA Practical Question
11 pages
Exercise 05 PDF
No ratings yet
Exercise 05 PDF
2 pages
Finite Difference Methods For Two-Point Boundary Value Prob-Lems
No ratings yet
Finite Difference Methods For Two-Point Boundary Value Prob-Lems
3 pages
Midterm Fall02
No ratings yet
Midterm Fall02
12 pages

Chapter 9 - Parallel Computation Problems

Uploaded by

Chapter 9 - Parallel Computation Problems

Uploaded by

Parallel Computation

How do we perform the operation?

Each processor needs to have the entire x vector.

All-to-all broadcast Local computations

How do we perform the operation?

Which one is better?

Processors are arranged in a logical

Memory efficient variant of the simple

With the following loop:

During each step, processors operate on

Can we use more than O(n2) processors?

Assign n/p rows of A to

(one element per processor)

Is the block mapping a good idea?

Each processor gets a 2D

PARALLEL SOLUTION TO PDEs

• SPMD: Single Program Multiple Data

• B1: Input data

B3 and B5: Communication (Input và Output)

• B1: Input data

Input data initialized at Root

Result calculated at CPUs CPU0

• B4.1: Communication CPU0

You might also like