0% found this document useful (0 votes)
29 views12 pages

MPIDomain Decomp

This document discusses domain decomposition with MPI for numerical simulations. It describes: 1) Dividing the computational domain into uniform subdomains and assigning each CPU a copy for parallel computation and communication between neighboring CPUs using MPI. 2) MPI communications are needed for variable derivatives, filtering, implicit solving, and determining error/values each time step. Blocking sends/receives require synchronization. 3) A 1D decomposition is used initially but a 3D decomposition allows for smaller buffer sizes and better scaling for larger processor counts. Non-blocking sends/receives with waits are implemented.

Uploaded by

whatever
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views12 pages

MPIDomain Decomp

This document discusses domain decomposition with MPI for numerical simulations. It describes: 1) Dividing the computational domain into uniform subdomains and assigning each CPU a copy for parallel computation and communication between neighboring CPUs using MPI. 2) MPI communications are needed for variable derivatives, filtering, implicit solving, and determining error/values each time step. Blocking sends/receives require synchronization. 3) A 1D decomposition is used initially but a 3D decomposition allows for smaller buffer sizes and better scaling for larger processor counts. Non-blocking sends/receives with waits are implemented.

Uploaded by

whatever
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Domain Decomposition With MPI

Numerical Problem
• Transient Large Eddy Simulation (LES) of turbulence
• Finite difference scheme
• Three-dimensional rectangular domain
• Uniform domain decomposition in each direction – every CPU gets a copy

MPI Communications buffer neighboring CPU


• Computation of variable derivatives buffer neighboring CPU
while converging for
• Filtering of variables
each time step
• Implicit solution of variables, i.e. iteration of coupled equations
• Determination of maximum error while converging
scalars
• Determine maximum variable values each time step
• Post processing 1
Motivation

 f  f i  2  8 f i 1  8 f i 1  f i  2
4th Order First Derivative: 
x 12 Δ
Required buffer width = 2

 2 f  f i  2  16 f i 1  30 f i  16 f i 1  f i 2
4th Order Second Derivative: 
x 2
Δ2
Required buffer width = 2

8th Order Filter: f i  0.7265625 f i  0.21875f i 1  f i 1   0.109375 f i 2  f i 2 

 0.03125fi 3  fi 3   0.00390625 fi 4  fi 4 


Required buffer width = 4

2
Development

1 Dimensional Domain Decomposition


• NX,NY,NZ = grid points in x,y and z directions
• NX/CPUs = NXP is an integer > 2 x NB
• NXP = grid points on 1 CPU in x direction
• f(NXP,NY,NZ) = array dimension on each CPU
• buf(NB,NY,NZ) = buffer dimensions at each
boundary

Communication Pattern

“backward” tag
“forward” tag

Note:
• Blocking SEND/RECV requires synchronization to avoid dead-lock
• Distributed memory + proper parallelization ensures no RAM limitation

3
Implementation (FORTRAN)

Blocking SEND/RECV
CALL MPI_SEND(buffer,size,MPI_Datatype,destination,tag,MPI_COMM_WORLD,IERR)
CALL MPI_RECV(buffer,size,MPI_Datatype,source,tag,MPI_COMM_WORLD,STATUS,IERR)

Synchronization
send(…,destination,tag,…)
recv(…,source,tag,…)
NPROC = total number of CPUs

Rank = 0 0 < Rank < NPROC-1 Rank = NPROC


send(…,Rank+1,NFOR+Rank,…) recv(…,Rank-1,NFOR+Rank-1,…) recv(…,Rank-1,NFOR+Rank-1,…)
recv(…,Rank+1,NBAC+Rank,…) send(…,Rank-1,NBAC+Rank-1,…) send(…,Rank-1,NBAC+Rank-1,…)
send(…,Rank+1,NFOR+Rank,…)
recv(…,Rank+1,NBAC+Rank,…)

Note:
• All send/recv’s are ordered for synchronization, otherwise dead-lock
• All tags are unique positive integers, otherwise dead-lock

4
SEND/RECV Example Code (FORTRAN)

C TAGS Destination/Source
unique forward and backward tags NFOR=NSTRT
NBAC=NFOR+NPROC tag
NSTRT=NBAC+1
next subroutine has a fresh tag IF(RANK .EQ. 0)THEN
CALL MPI_SEND(BUF,NB*NY*NZ,MPI_REAL,RANK+1,NFOR+RANK,
~ MPI_COMM_WORLD,IERR)
CALL MPI_RECV(BUF,NB*NY*NZ,MPI_REAL,RANK+1,NBAC+RANK,
~ MPI_COMM_WORLD,STATUS,IERR)
ELSEIF(RANK .EQ. NPROC-1)THEN
CALL MPI_RECV(BUF,NB*NY*NZ,MPI_REAL,RANK-1,NFOR+RANK-1,
~ MPI_COMM_WORLD,STATUS,IERR)
CALL MPI_SEND(BUF,NB*NY*NZ,MPI_REAL,RANK-1,NBAC+RANK-1,
~ MPI_COMM_WORLD,IERR)
ELSE
SEND/RECV pattern matches CALL MPI_RECV(BUF,NB*NY*NZ,MPI_REAL,RANK-1,NFOR+RANK-1,
synchronization pattern ~ MPI_COMM_WORLD,STATUS,IERR)
CALL MPI_SEND(BUF,NB*NY*NZ,MPI_REAL,RANK-1,NBAC+RANK-1,
~ MPI_COMM_WORLD,IERR)
CALL MPI_SEND(BUF,NB*NY*NZ,MPI_REAL,RANK+1,NFOR+RANK,
~ MPI_COMM_WORLD,IERR)
CALL MPI_RECV(BUF,NB*NY*NZ,MPI_REAL,RANK+1,NBAC+RANK,
~ MPI_COMM_WORLD,STATUS,IERR)
ENDIF

Note:
• All multidimensional arrays are transformed into a 1 dimensional “BUF” before SEND
• After RECV all 1 dimensional “BUF” are transformed back into 3 dimensions
• Color coding applies to NPROC=3

5
Current Practices

3 Dimensional Parallelization
• Smaller buffers
- 1D buffers, buf(NB=4,NY,NZ), are large slabs, limits number of CPUs
- 3D buffers, i.e. buf(NB=4,NYP,NZP), are smaller, i.e. NYP=NY/4
- 3D buffering requires more communication
• Greater potential for scaling up
- large slab domain cannot be narrower than 2 x 4 (2 x Buffer width, NB)
- arbitrary rectangular proportions possible
• Buffer arrays are converted to vectors before and after MPI communication
• Use non-blocking ISEND/IRECV + WAIT for buffer arrays
• Use blocking SEND/RECV for scalars
• Extensive use of COMMON variables for RAM minimization
• Possible recalculation in different subroutines for RAM minimization
• As much as possible, locate MPI communication in separate subroutines
Lessons Learned using SEND/RECV:
• For a relatively small number of processors, NPROC<16, anticipated speed-up achieved
• For NPROC>16 performance slowed

6
3D Domain Decomposition

Utilize 1 dimensional flags (IZNX, IZNY and IZNZ) for


implementation of boundary conditions
IZNX

IZNY
Rank

IZNZ

NPROC = 64 Processors

7
Buffer Subroutines, ISEND/IRECV (FORTRAN)

Time loop tag initiation: DO IT=1,NT


NSTRT=1 first tag of each time iteration

Call buffer subroutine: CALL BUFFERSXX(NSTRT,


~ RANK,MPI_REAL,MPI_COMM_WORLD,MPI_STATUS_SIZE,STATUS,IERR)

Buffer subroutine: ~
SUBROUTINE BUFFERSXX(NSTRT,
RANK,MPI_REAL,MPI_COMM_WORLD,MPI_STATUS_SIZE,STATUS,IERR)
C TAGS
NXF=NSTRT unique forward and backward tags
NXB=NXF+NPROC
NSTRT=NXB+1 next subroutine has a fresh tag

Use MPI_WAIT for BOTH CALL MPI_ISEND(SENDBUFX,NB*NYP*NZP,MPI_REAL,RANK+1,NXF+RANK,


ISEND and IRECV ~ MPI_COMM_WORLD,SENDXF,IERR)
CALL MPI_IRECV(RECVBUFX,NB*NYP*NZP,MPI_REAL,RANK+1,NXB+RANK,
~ MPI_COMM_WORLD,RECVXB,IERR)
CALL MPI_WAIT(SENDXF,STATUS,IERR) MPI_WAIT has its own request
CALL MPI_WAIT(RECVXB,STATUS,IERR) like a “tag”

CALL MPI_BARRIER(MPI_COMM_WORLD,IERR) synchronize at the end of


each buffer communication
Note:
• tags recycled each time step, MPI tag must greater than zero and less than a big integer
• Non-blocking ISEND/IRECV need not be written in synchronized pattern, MPI_WAIT “tag”

8
Buffer Subroutines for Convergence

• Crank-Nicolson method for x, y and z velocity components


• Implicit Poisson-type equation
• Solution by Jacobi iteration
- Each iteration level steps together
- Slowest method to converge
- Most stable iterative method (Neumann problem)
- Number of iterations a function of time step size

     
u i,n j,1k  c1 u in1, j,k  u in1, j,k  c 2 u i,n j1, k  u i,n j1, k  c3 u i,n j,k 1  u i,n j,k 1  c 4g i, j,k

Note: Stencil size is 1 in each direction

9
Scalar Communication, SEND/RECV (FORTRAN)

• Iterative solution requires maximum error over the entire domain


• Poisson pressure equation also requires communication for the
compatibility condition
• Treat Rank = 0 as Master

Step 1: Master receives Step 2: Master sends


slaves’ maximum error maximum error to slaves

Note: Typically domain decomposition would not require


Master/Slave method, rather use distributed memory

10
Scalar Communication FORTRAN

Determine max error during convergence:


C TAGS
Index same set of tags NFOR=NSTRT
as buffer communication NBAC=NFOR+NPROC
NSTRT=NBAC+1
IF(RANK .EQ. 0)THEN
DO I=1,NPROC-1
CALL MPI_RECV(ERR,1,MPI_REAL,I,NBAC+I,
~ MPI_COMM_WORLD,STATUS,IERR)
IF(ERR .GT. ERRMAX)THEN
ERRMAX=ERR
ENDIF
Synchronizing SEND/RECV ENDDO
pattern required to avoid DO I=1,NPROC-1
CALL MPI_SEND(ERRMAX,1,MPI_REAL,I,NFOR+I,
dead-lock ~ MPI_COMM_WORLD,IERR)
ENDDO
ELSE
CALL MPI_SEND(ERRMAX,1,MPI_REAL,0,NBAC+RANK,
~ MPI_COMM_WORLD,IERR)
CALL MPI_RECV(ERRMAX,1,MPI_REAL,0,NFOR+RANK,
~ MPI_COMM_WORLD,STATUS,IERR)
ENDIF

11
Lessons Learned and Open Questions

• Three-dimensional domain decomposition yields maximum efficiency


- more communication
- smaller buffer sizes
- greater sub-domain surface area for a given volume
• SEND/RECV fine for small buffer sizes
• ISEND/IRECV provides appropriate scaling performance
• Both MPI_ISEND and MPI_IRECV require MPI_WAIT linked with
a request, which is unique like a “tag”

• Unique buffer sizes as required for communication speed-up (particularly


in an iterative loop) and memory reduction?

12

You might also like