0% found this document useful (0 votes)
40 views

Chapter 9 - Parallel Computation Problems

Uploaded by

topkek69123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

Chapter 9 - Parallel Computation Problems

Uploaded by

topkek69123
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

Parallel Computation

Problems
References
• Michael J. Quinn. Parallel Computing. Theory and Practice.
McGraw-Hill
• Albert Y. Zomaya. Parallel and Distributed Computing
Handbook. McGraw-Hill
• Ian Foster. Designing and Building Parallel Programs.
Addison-Wesley.
• Ananth Grama, Anshul Gupta, George Karypis, Vipin Kumar .
Introduction to Parallel Computing, Second Edition. Addison
Wesley.
• Joseph Jaja. An Introduction to Parallel Algorithm. Addison
Wesley.
• Nguyễn Đức Nghĩa. Tính toán song song. Hà Nội 2003.

3
9.1 Numerical approach for dense
matrix

4
Review

5
Matrix-Vector Multiplication

Compute: y = Ax
y, x are nx1 vectors
A is an nxn dense matrix
Serial complexity: W = O(n2).
We will consider:
1D & 2D partitioning.

6
Row-wise 1D Partitioning

How do we perform the operation?

7
Row-wise 1D Partitioning

Each processor needs to have the entire x vector.

All-to-all broadcast Local computations

Analysis?

8
Block 2D Partitioning

How do we perform the operation?

9
Block 2D Partitioning
Each processor needs to have the portion of the x vector
that corresponds to the set of columns that it stores.

Analysis?

10
1D vs 2D Formulation

Which one is better?

11
Matrix-Matrix Multiplication

Compute: C = AB
A, B, & C are nxn dense
matrices.
Serial complexity:
W = O(n3).
We will consider:
2D & 3D partitioning.

12
Simple 2D Algorithm

Processors are arranged in a logical


sqrt(p)*sqrt(p) 2D topology.
Each processor gets a block of
(n/sqrt(p))*(n/sqrt(p)) block of A, B, & C.
It is responsible for computing the entries
of C that it has been assigned to.
Analysis?
How about the
memory
complexity?

13
Cannon’s Algorithm

Memory efficient variant of the simple


algorithm.
Key idea:
Replace traditional loop:

With the following loop:

During each step, processors operate on


different blocks of A and B.

14
Can we do better?

Can we use more than O(n2) processors?


So far the task corresponded to the dot-
product of two vectors
i.e., Ci,j = Ai,* . B*,j
How about performing this dot-product in
parallel?
What is the maximum concurrency that we
can extract?

15
3D Algorithm—DNS Algorithm
Partitioning the intermediate data

16
3D Algorithm—DNS Algorithm

17
Gaussian Elimination
Solve Ax=b
A is an nxn dense matrix.
x and b are dense vectors
Serial complexity:
W = O(n3).
There are two key steps in
each iteration:
Division step
Rank-1 update
We will consider:
1D & 2D partitioning, and
introduce the notion of
pipelining.

18
1D Partitioning

Assign n/p rows of A to


each processor.
During the ith iteration:
Divide operation is
performed by the processor
who stores row i.
Result is broadcasted to the
rest of the processors.
Each processor performs
the rank-1 update for its
local rows.
Analysis?

(one element per processor)

19
1D Pipelined Formulation

Existing Algorithm:
Next iteration starts only when the
previous iteration has finished.
Key Idea:
The next iteration can start as soon as the
rank-1 update involving the next row has
finished.
Essentially multiple iterations are perform
simultaneously!

20
Cost-optimal with
n processors

21
1D Partitioning

Is the block mapping a good idea?

22
2D Mapping

Each processor gets a 2D


block of the matrix.
Steps:
Broadcast of the “active” column
along the rows.
Divide step in parallel by the
processors who own portions of
the row.
Broadcast along the columns.
Rank-1 update.
Analysis?

23
2D Pipelined

Cost-optimal with
n2 processors

24
9.2 Numerical approach for PDE
problem

PARALLEL SOLUTION TO PDEs


CASE STUDY: HEAT EQUATIONS

25
Mathematic model and algorithm
• Heat equations (PDE):
∂C
= D∇ 2C
∂t
2 ∂ 2C ∂ 2 C
∇C= 2 + 2
∂x ∂y

• Algorithm
• Initial input value: Ci0, j d (i,j
• At step n+1: x tn ) tn
2 tn tn
∇ Ci , j = FDi , j =
(Ci +1, j + Ci −1, j + Ci , j +1 + Ci , j −1 − 4Ci , j )
tn tn tn

dx 2
tn +1
C i, j = C + dt * D * FD
tn
i, j
tn
i, j
• Data dependncy?
26
Data dependency
• Calculation at point (i,j) needs data from neighboring
points: (i-1,j), (i+1,j) , (i,j-1) , (i,j+1)
• This is data dependency
• Solution
• Shared memory system: Synchronization
• Distributed memory system: Communication and
Synchronization (Difficult, Optimization)
• Exercise:
• Write a OpenMP program to implement Heat Equations
problem
• Write a MPI program to implement Heat Equations
problem

27
Mathematic model and algorithm

• Notation:
u
c
c: Ci , j ,
u: Ci −1, j , l r
d: Ci +1, j ,
l: Ci , j −1 , d
r: Ci , j +1.

28
Implementation: Spatial Discretization (FD)
∇ C2 tn
= FD tn
=
(C tn
i +1, j + C itn− 1 , j + C itn, j + 1 + C itn, j − 1 − 4 C itn, j )
i, j i, j
dx 2
void FD(float *C, float *dC) {
int i, j;
float c,u,d,l,r;
for ( i = 0 ; i < m ; i++ )
for ( j = 0 ; j < n ; j++ )
{
c = *(C+i*n+j);
u = (i==0) ? *(C+i*n+j) : *(C+(i-1)*n+j);
d = (i==m-1) ? *(C+i*n+j) : *(C+(i+1)*n+j);
l = (j==0) ? *(C+i*n+j) : *(C+i*n+j-1); dx (i,j)
r = (j==n-1) ? *(C+i*n+j) : *(C+i*n+j+1);
*(dC+i*n+j) = (1/(dx*dx))*(u+d+l+r-4*c);
Boundary condition
}
}
29
Implementation: Time Integration
Citn, j+1 = Citn, j + dt * D * FDitn, j

while (t<=T)
{
FD(C, dC);
for ( i = 0 ; i < m ; i++ )
for ( j = 0 ; j < n ; j++ )
*(C+i*n+j) = *(C+i*n+j) + dt*(*(dC+i*n+j));
t=t+dt;
}

30
SPMD Parallel Algorithm (1)

• SPMD: Single Program Multiple Data

CPU0

Domain
CPU1 Decomposition

CPU2

31
SPMD Parallel Algorithm SPMD (2)

• B1: Input data


• Usually, initial input data at CPU 0 (Root)
• B2: Domain decomposition
• B3: Distribute Input data from Root to all other CPUs
• B4: Computation (Each CPU calculate on its subdomain)
• B5: Gather Output from all other CPUs to Root

B3 and B5: Communication (Input và Output)

32
SPMD Parallel Algorithm SPMD (3)

• B1: Input data


• Depending on requirement of each problem
• Different input results in different output

33
SPMD Parallel Algorithm SPMD (4)
• B2: Domain decomposition
• Many approaches
• Different approach has different efficiency
• Following is row-wise domain decomposition
• Given that the size of domain is: mxn
• Subdomain for each CPU: mcxn, where: mc=m/NP with
NP is the number of CPUs

CPU0

CPU1

CPU2

34
SPMD Parallel Algorithm SPMD (5)
• B3: Distribute Input data from Root to all other CPUs
MPI_Scatter (C, mc*n, MPI_FLOAT,
Cs, mc*n, MPI_FLOAT, 0,
MPI_COMM_WORLD);

CPU0

CPU1

CPU2

Input data initialized at Root

35
SPMD Parallel Algorithm SPMD (6)
• B5: Gather Output from all other CPU to Root
MPI_Gather ( Cs, mc*n, MPI_FLOAT,
C, mc*n, MPI_FLOAT, 0,
MPI_COMM_WORLD);

CPU0

CPU1

CPU2

Result calculated at CPUs CPU0

36
SPMD Parallel Algorithm SPMD (7)
• B4: Computation

CPU0
- B4.1: Communication
- B4.2: Calculation
CPU1

CPU2

37
SPMD Parallel Algorithm SPMD (8)

• B4.1: Communication CPU0


Csmc-
- B4.1a): Communicate 1,j

array Cu Cu
- B4.1b): Communicate
CPU1
array Cd
Cd

Cs0,j
CPU2

38
SPMD Parallel Algorithm SPMD (9)
• B4.1a): Communicate array Cu

if (rank==0){
for (j=0; j<n; j++) *(Cu+j) = *(Cs+0*n+j);
MPI_Send (Cs+(mc- )*n, n, MPI_FLOAT, rank+1, rank, …);
} else if (rank==NP-1) {
MPI_Recv (Cu, n, MPI_FLOAT, rank-1, rank-1, …);
} else {
MPI_Send (Cs+(mc-1)*n, n, MPI_FLOAT, rank+1, rank,…);
MPI_Recv(Cu, n, MPI_FLOAT, rank-1, rank-1, …);
}

39
CPU0
Csmc-1,j

Cu

CPU1

Cd

Cs0,j
CPU2

40
SPMD Parallel Algorithm SPMD (10)
• B4.1b): Communicate array Cd
if (rank==NP-1){
for (j=0; j<n; j++) *(Cd+j) = *(Cs+(mc-1)*n+j);
MPI_Send (Cs, n, MPI_FLOAT, rank-1, rank, …);
} else if (rank==0) {
MPI_Recv (Cd, n, MPI_FLOAT, rank+1, rank+1, …);
} else {
MPI_Send (Cs, n, MPI_FLOAT, rank-1, rank, …);
MPI_Recv (Cd, n, MPI_FLOAT, rank+1, rank+1, …);
}

41
SPMD Parallel Algorithm SPMD (11)
• B4.2: Calculation
void FD(float *Cs, float *Cu, float *Cd, float *dCs ,int ms) {
int i, j;
float c,u,d,l,r;
for ( i = 0 ; i < ms ; i++ )
for ( j = 0 ; j < n ; j++ )
{
c = *(Cs+i*n+j);
u = (i==0) ? *(Cu+j) : *(Cs+(i-1)*n+j);
d = (i==ms-1) ? *(Cd+j) : *(Cs+(i+1)*n+j);
l = (j==0) ? *(Cs+i*n+j) : *(Cs+i*n+j-1);
r = (j==n-1) ? *(Cs+i*n+j) : *(Cs+i*n+j+1);
*(dCs+i*n+j) = (D/(dx*dx))*(u+d+l+r-4*c);
}
}
42
Thank you
for your
attentions!

43

You might also like