0% found this document useful (0 votes)

29 views12 pages

MPIDomain Decomp

This document discusses domain decomposition with MPI for numerical simulations. It describes: 1) Dividing the computational domain into uniform subdomains and assigning each CPU a copy for parallel computation and communication between neighboring CPUs using MPI. 2) MPI communications are needed for variable derivatives, filtering, implicit solving, and determining error/values each time step. Blocking sends/receives require synchronization. 3) A 1D decomposition is used initially but a 3D decomposition allows for smaller buffer sizes and better scaling for larger processor counts. Non-blocking sends/receives with waits are implemented.

Uploaded by

whatever

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views12 pages

MPIDomain Decomp

Uploaded by

whatever

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Domain Decomposition With MPI

Numerical Problem
• Transient Large Eddy Simulation (LES) of turbulence
• Finite difference scheme
• Three-dimensional rectangular domain
• Uniform domain decomposition in each direction – every CPU gets a copy

MPI Communications buffer neighboring CPU

• Computation of variable derivatives buffer neighboring CPU
while converging for
• Filtering of variables
each time step
• Implicit solution of variables, i.e. iteration of coupled equations
• Determination of maximum error while converging
scalars
• Determine maximum variable values each time step
• Post processing 1
Motivation

 f  f i  2  8 f i 1  8 f i 1  f i  2
4th Order First Derivative: 
x 12 Δ
Required buffer width = 2

 2 f  f i  2  16 f i 1  30 f i  16 f i 1  f i 2
4th Order Second Derivative: 
x 2
Δ2
Required buffer width = 2

8th Order Filter: f i  0.7265625 f i  0.21875f i 1  f i 1   0.109375 f i 2  f i 2 

 0.03125fi 3  fi 3   0.00390625 fi 4  fi 4 

Required buffer width = 4

2
Development

1 Dimensional Domain Decomposition

• NX,NY,NZ = grid points in x,y and z directions
• NX/CPUs = NXP is an integer > 2 x NB
• NXP = grid points on 1 CPU in x direction
• f(NXP,NY,NZ) = array dimension on each CPU
• buf(NB,NY,NZ) = buffer dimensions at each
boundary

Communication Pattern

“backward” tag
“forward” tag

Note:
• Blocking SEND/RECV requires synchronization to avoid dead-lock
• Distributed memory + proper parallelization ensures no RAM limitation

3
Implementation (FORTRAN)

Blocking SEND/RECV
CALL MPI_SEND(buffer,size,MPI_Datatype,destination,tag,MPI_COMM_WORLD,IERR)
CALL MPI_RECV(buffer,size,MPI_Datatype,source,tag,MPI_COMM_WORLD,STATUS,IERR)

Synchronization
send(…,destination,tag,…)
recv(…,source,tag,…)
NPROC = total number of CPUs

Rank = 0 0 < Rank < NPROC-1 Rank = NPROC

send(…,Rank+1,NFOR+Rank,…) recv(…,Rank-1,NFOR+Rank-1,…) recv(…,Rank-1,NFOR+Rank-1,…)
recv(…,Rank+1,NBAC+Rank,…) send(…,Rank-1,NBAC+Rank-1,…) send(…,Rank-1,NBAC+Rank-1,…)
send(…,Rank+1,NFOR+Rank,…)
recv(…,Rank+1,NBAC+Rank,…)

Note:
• All send/recv’s are ordered for synchronization, otherwise dead-lock
• All tags are unique positive integers, otherwise dead-lock

4
SEND/RECV Example Code (FORTRAN)

C TAGS Destination/Source
unique forward and backward tags NFOR=NSTRT
NBAC=NFOR+NPROC tag
NSTRT=NBAC+1
next subroutine has a fresh tag IF(RANK .EQ. 0)THEN
CALL MPI_SEND(BUF,NB*NY*NZ,MPI_REAL,RANK+1,NFOR+RANK,
~ MPI_COMM_WORLD,IERR)
CALL MPI_RECV(BUF,NB*NY*NZ,MPI_REAL,RANK+1,NBAC+RANK,
~ MPI_COMM_WORLD,STATUS,IERR)
ELSEIF(RANK .EQ. NPROC-1)THEN
CALL MPI_RECV(BUF,NB*NY*NZ,MPI_REAL,RANK-1,NFOR+RANK-1,
~ MPI_COMM_WORLD,STATUS,IERR)
CALL MPI_SEND(BUF,NB*NY*NZ,MPI_REAL,RANK-1,NBAC+RANK-1,
~ MPI_COMM_WORLD,IERR)
ELSE
SEND/RECV pattern matches CALL MPI_RECV(BUF,NB*NY*NZ,MPI_REAL,RANK-1,NFOR+RANK-1,
synchronization pattern ~ MPI_COMM_WORLD,STATUS,IERR)
CALL MPI_SEND(BUF,NB*NY*NZ,MPI_REAL,RANK-1,NBAC+RANK-1,
~ MPI_COMM_WORLD,IERR)
CALL MPI_SEND(BUF,NB*NY*NZ,MPI_REAL,RANK+1,NFOR+RANK,
~ MPI_COMM_WORLD,IERR)
CALL MPI_RECV(BUF,NB*NY*NZ,MPI_REAL,RANK+1,NBAC+RANK,
~ MPI_COMM_WORLD,STATUS,IERR)
ENDIF

Note:
• All multidimensional arrays are transformed into a 1 dimensional “BUF” before SEND
• After RECV all 1 dimensional “BUF” are transformed back into 3 dimensions
• Color coding applies to NPROC=3

5
Current Practices

3 Dimensional Parallelization
• Smaller buffers
- 1D buffers, buf(NB=4,NY,NZ), are large slabs, limits number of CPUs
- 3D buffers, i.e. buf(NB=4,NYP,NZP), are smaller, i.e. NYP=NY/4
- 3D buffering requires more communication
• Greater potential for scaling up
- large slab domain cannot be narrower than 2 x 4 (2 x Buffer width, NB)
- arbitrary rectangular proportions possible
• Buffer arrays are converted to vectors before and after MPI communication
• Use non-blocking ISEND/IRECV + WAIT for buffer arrays
• Use blocking SEND/RECV for scalars
• Extensive use of COMMON variables for RAM minimization
• Possible recalculation in different subroutines for RAM minimization
• As much as possible, locate MPI communication in separate subroutines
Lessons Learned using SEND/RECV:
• For a relatively small number of processors, NPROC<16, anticipated speed-up achieved
• For NPROC>16 performance slowed

6
3D Domain Decomposition

Utilize 1 dimensional flags (IZNX, IZNY and IZNZ) for

implementation of boundary conditions
IZNX

IZNY
Rank

IZNZ

NPROC = 64 Processors

7
Buffer Subroutines, ISEND/IRECV (FORTRAN)

Time loop tag initiation: DO IT=1,NT

NSTRT=1 first tag of each time iteration

Call buffer subroutine: CALL BUFFERSXX(NSTRT,

~ RANK,MPI_REAL,MPI_COMM_WORLD,MPI_STATUS_SIZE,STATUS,IERR)

Buffer subroutine: ~
SUBROUTINE BUFFERSXX(NSTRT,
RANK,MPI_REAL,MPI_COMM_WORLD,MPI_STATUS_SIZE,STATUS,IERR)
C TAGS
NXF=NSTRT unique forward and backward tags
NXB=NXF+NPROC
NSTRT=NXB+1 next subroutine has a fresh tag

Use MPI_WAIT for BOTH CALL MPI_ISEND(SENDBUFX,NBNYPNZP,MPI_REAL,RANK+1,NXF+RANK,

ISEND and IRECV ~ MPI_COMM_WORLD,SENDXF,IERR)
CALL MPI_IRECV(RECVBUFX,NB*NYP*NZP,MPI_REAL,RANK+1,NXB+RANK,
~ MPI_COMM_WORLD,RECVXB,IERR)
CALL MPI_WAIT(SENDXF,STATUS,IERR) MPI_WAIT has its own request
CALL MPI_WAIT(RECVXB,STATUS,IERR) like a “tag”

CALL MPI_BARRIER(MPI_COMM_WORLD,IERR) synchronize at the end of

each buffer communication
Note:
• tags recycled each time step, MPI tag must greater than zero and less than a big integer
• Non-blocking ISEND/IRECV need not be written in synchronized pattern, MPI_WAIT “tag”

8
Buffer Subroutines for Convergence

• Crank-Nicolson method for x, y and z velocity components

• Implicit Poisson-type equation
• Solution by Jacobi iteration
- Each iteration level steps together
- Slowest method to converge
- Most stable iterative method (Neumann problem)
- Number of iterations a function of time step size

     
u i,n j,1k  c1 u in1, j,k  u in1, j,k  c 2 u i,n j1, k  u i,n j1, k  c3 u i,n j,k 1  u i,n j,k 1  c 4g i, j,k

Note: Stencil size is 1 in each direction

9
Scalar Communication, SEND/RECV (FORTRAN)

• Iterative solution requires maximum error over the entire domain

• Poisson pressure equation also requires communication for the
compatibility condition
• Treat Rank = 0 as Master

Step 1: Master receives Step 2: Master sends

slaves’ maximum error maximum error to slaves

Note: Typically domain decomposition would not require

Master/Slave method, rather use distributed memory

10
Scalar Communication FORTRAN

Determine max error during convergence:

C TAGS
Index same set of tags NFOR=NSTRT
as buffer communication NBAC=NFOR+NPROC
NSTRT=NBAC+1
IF(RANK .EQ. 0)THEN
DO I=1,NPROC-1
CALL MPI_RECV(ERR,1,MPI_REAL,I,NBAC+I,
~ MPI_COMM_WORLD,STATUS,IERR)
IF(ERR .GT. ERRMAX)THEN
ERRMAX=ERR
ENDIF
Synchronizing SEND/RECV ENDDO
pattern required to avoid DO I=1,NPROC-1
CALL MPI_SEND(ERRMAX,1,MPI_REAL,I,NFOR+I,
dead-lock ~ MPI_COMM_WORLD,IERR)
ENDDO
ELSE
CALL MPI_SEND(ERRMAX,1,MPI_REAL,0,NBAC+RANK,
~ MPI_COMM_WORLD,IERR)
CALL MPI_RECV(ERRMAX,1,MPI_REAL,0,NFOR+RANK,
~ MPI_COMM_WORLD,STATUS,IERR)
ENDIF

11
Lessons Learned and Open Questions

• Three-dimensional domain decomposition yields maximum efficiency

- more communication
- smaller buffer sizes
- greater sub-domain surface area for a given volume
• SEND/RECV fine for small buffer sizes
• ISEND/IRECV provides appropriate scaling performance
• Both MPI_ISEND and MPI_IRECV require MPI_WAIT linked with
a request, which is unique like a “tag”

• Unique buffer sizes as required for communication speed-up (particularly

in an iterative loop) and memory reduction?

List of PowerPivot DAX Functions With Description
No ratings yet
List of PowerPivot DAX Functions With Description
14 pages
Nancy Canavan Anderson, Lainie Schuster - Good Questions For Math Teaching: Why Ask Them and What To Ask, Grades 5-8 (2005)
No ratings yet
Nancy Canavan Anderson, Lainie Schuster - Good Questions For Math Teaching: Why Ask Them and What To Ask, Grades 5-8 (2005)
204 pages
Embedded Linux Booting
No ratings yet
Embedded Linux Booting
7 pages
GLOSSARY OF NUCLEAR TERMS (From 'Swords of Armageddon')
No ratings yet
GLOSSARY OF NUCLEAR TERMS (From 'Swords of Armageddon')
43 pages
Tough SF - Particle Beams in Space
No ratings yet
Tough SF - Particle Beams in Space
92 pages
Fluid Mechanics BEL L5
No ratings yet
Fluid Mechanics BEL L5
26 pages
Prado Dictionary
No ratings yet
Prado Dictionary
897 pages
Fast Approximate Correlation For Massive Time-Series Data - Correlation-Sigmod10
No ratings yet
Fast Approximate Correlation For Massive Time-Series Data - Correlation-Sigmod10
12 pages
Non-Blocking Synchronization and System Design - CS-TR-99-1624
No ratings yet
Non-Blocking Synchronization and System Design - CS-TR-99-1624
261 pages
2008 Basics DDM FEA
No ratings yet
2008 Basics DDM FEA
24 pages
Communication Overlap in Multi-Tier Parallel Algorithms - A33-Baden
No ratings yet
Communication Overlap in Multi-Tier Parallel Algorithms - A33-Baden
20 pages
Slides Os 1
No ratings yet
Slides Os 1
59 pages
Jawaharlal Nehru Engineering College: Laboratory Manual
No ratings yet
Jawaharlal Nehru Engineering College: Laboratory Manual
60 pages
Scratch Programming and Problem Solving - Introduction To Scratch - Tessellations
No ratings yet
Scratch Programming and Problem Solving - Introduction To Scratch - Tessellations
5 pages
Monitoring IDocs With The SAP Application Interface Framework
No ratings yet
Monitoring IDocs With The SAP Application Interface Framework
11 pages
Array Address Calculation
No ratings yet
Array Address Calculation
10 pages
Problem Based Upon Nested For/while Loop For Printing Pattern
100% (1)
Problem Based Upon Nested For/while Loop For Printing Pattern
28 pages
How To Write SAS Code in SAS Enterprise Guide
No ratings yet
How To Write SAS Code in SAS Enterprise Guide
20 pages
AI Applications in Software Development
No ratings yet
AI Applications in Software Development
2 pages
First Java Program Notes
No ratings yet
First Java Program Notes
10 pages
Chapter 6: CPU Scheduling: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
No ratings yet
Chapter 6: CPU Scheduling: Silberschatz, Galvin and Gagne ©2013 Operating System Concepts - 9 Edition
37 pages
Instant Download Java Programming Joyce Farrell PDF All Chapter
100% (6)
Instant Download Java Programming Joyce Farrell PDF All Chapter
53 pages
Client Side and Server Side
No ratings yet
Client Side and Server Side
3 pages
Functional Specification: INC0238165 - (SAP PR1) - (LSMW - Scale Quantity Break Upload Not Working)
No ratings yet
Functional Specification: INC0238165 - (SAP PR1) - (LSMW - Scale Quantity Break Upload Not Working)
5 pages
Grade 9 Pretechnical Studies Notes
50% (2)
Grade 9 Pretechnical Studies Notes
77 pages
Java Collections Framework: Licensing Note
No ratings yet
Java Collections Framework: Licensing Note
43 pages
Greedy Slides
No ratings yet
Greedy Slides
80 pages
CodeBase User Guide
No ratings yet
CodeBase User Guide
134 pages
L7 Cross Compiler
No ratings yet
L7 Cross Compiler
9 pages
Management Studies (CP-202) : Unit I
No ratings yet
Management Studies (CP-202) : Unit I
2 pages
Hello Peach 1
No ratings yet
Hello Peach 1
23 pages
Classical IPC Problems
No ratings yet
Classical IPC Problems
5 pages
Informatica
No ratings yet
Informatica
32 pages
Two Auto Home Methods v1
No ratings yet
Two Auto Home Methods v1
20 pages
Querying Microsoft SQL Server 2012: Version: Demo
No ratings yet
Querying Microsoft SQL Server 2012: Version: Demo
14 pages
Oracle Approvals Management
0% (1)
Oracle Approvals Management
22 pages
Introduction To Python (Lab)
No ratings yet
Introduction To Python (Lab)
28 pages
Lecture 4 - SQL Part II
No ratings yet
Lecture 4 - SQL Part II
73 pages
Misra C2012 Guidelines For The Use of The C Language in Critical Systems Motor Industry Software Reliability Association Download
No ratings yet
Misra C2012 Guidelines For The Use of The C Language in Critical Systems Motor Industry Software Reliability Association Download
88 pages

MPIDomain Decomp

Uploaded by

MPIDomain Decomp

Uploaded by

Domain Decomposition With MPI

MPI Communications buffer neighboring CPU

8th Order Filter: f i  0.7265625 f i  0.21875f i 1  f i 1   0.109375 f i 2  f i 2 

 0.03125fi 3  fi 3   0.00390625 fi 4  fi 4 

1 Dimensional Domain Decomposition

Rank = 0 0 < Rank < NPROC-1 Rank = NPROC

Utilize 1 dimensional flags (IZNX, IZNY and IZNZ) for

Time loop tag initiation: DO IT=1,NT

Call buffer subroutine: CALL BUFFERSXX(NSTRT,

Use MPI_WAIT for BOTH CALL MPI_ISEND(SENDBUFX,NB*NYP*NZP,MPI_REAL,RANK+1,NXF+RANK,

CALL MPI_BARRIER(MPI_COMM_WORLD,IERR) synchronize at the end of

• Crank-Nicolson method for x, y and z velocity components

Note: Stencil size is 1 in each direction

• Iterative solution requires maximum error over the entire domain

Step 1: Master receives Step 2: Master sends

Note: Typically domain decomposition would not require

Determine max error during convergence:

• Three-dimensional domain decomposition yields maximum efficiency

• Unique buffer sizes as required for communication speed-up (particularly

You might also like

Use MPI_WAIT for BOTH CALL MPI_ISEND(SENDBUFX,NBNYPNZP,MPI_REAL,RANK+1,NXF+RANK,