CUDA-Multiple GPUs
CUDA-Multiple GPUs
Oswald Haan
[email protected]
A Typical Compute-Cluster Configuration
with Multiple GPUs
high speed interconnect
SM SM SM SM
...
...
...
...
SM SM SM SM
Using Multiple GPUs in a Single Node
• start as many OpenMP threads in the node as gpus to be used
• each thread selects a different gpu and controlls all activities of this
device independently of the activities of the other devices.
• Communication between different gpus in two steps
SM SM
...
...
SM SM
Using Multiple GPUs in Multiple Nodes
• start as many MPI tasks on different nodes as gpus to be used
• each MPI task controlls a gpu on a different node .
• Communication between different gpus in three steps
high speed interconnect
SM SM
...
...
SM SM
Separate Files for Different Code
• File code_cuda.cu contains all program parts with CUDA code
• File code_par.c contains all program parts for managing OpenMP threads
resp. MPI tasks
• compile/link in two steps:
1. compile code_cuda.cu ( nvcc uses CC compiler )
nvcc –arch=sm_35 –c code_cuda.cu
2. compile and link code_par.c ( gcc and mpicc use C compiler )
gcc –fopenmp –lcudart code_cuda.o code_par.c –o code.ex or
mpicc –lcudart code_cuda.o code_par.c –o code.exe
Important
all functions in code_cuda.cu to be called from code_par.c have to be
declared with the qualifier extern “C“
Example: Power Series with Multiple Devices
in One Node, Using OpenMP
in the OMP part of the program set the number of devices to be used :
int nd = 2;
and generate nd threads with OpenMP:
omp_set_num_threads(nd);
#pragma omp parallel default(shared)
{
int did = omp_get_thread_num(); Variables declared in
... a parallel section are
powser_by_device(did, ... ); threadlocal
}
In the CUDA part of the program select a device according to the thread_num did:
cudaSetDevice(did);
Power Series with Multiple Devices:
Work Distribution
Distribute N values onto nd devices
cudaSetDevice(did);
cudaMalloc((void **) &a_d, sizea);
cudaMemcpyToSymbol(coeff, coeff_h, sizeco, 0, cudaMemcpyHostToDevice);
ps_d<<<N_blks,N_thrpb>>>(N,offset,eps,a_d);
cudaMemcpy(&a_h[offset], a_d, sizea, cudaMemcpyDeviceToHost);
cudaFree(a_d);
}
Code Example for powerseries in File
powser_cuda.cu
__global__ void ps_d( int n, int offset, double eps, float *a ) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
if( i<n ) {
float x=(offset+i)*eps;
a[i] = powerseries(x);
}
}
Task me will controll number the gpu me_on of its node‘s gpus by calling the function
powser_by_device(me_on, ... )
from the CUDA part of the program in file power_cuda.cu
Code Example for powerseries in File
powser_mpi.c Complete code in
~ohaan/cuda_kurs/multiple_gpus/powser_mpi.c
#include <mpi.h>
extern int nord;
int main(void){
int N = 10000000, nd, me; float a[N];
// initialize MPI (nd, me, me_on) and coefficients coeff_h
...
// Define first[nd+1] and N_loc[nd] for distributing N values onto nd devices
// Every task calculates its local values for its own device
float *al; al = (float *)malloc(N_loc[me]*sizeof(float));
powser_by_device(me_on,N_loc[me],first[me],eps,al,coeff_h);
// gather local values in al into a in task 0
MPI_Gatherv(al,N_loc[me],MPI_FLOAT,a,N_loc,first,MPI_FLOAT,0,MPI_COMM_WORLD);
MPI_Finalize();
}
Compile and Link:
Multiple GPUs on Multiple Nodes
• Load environment:
> module load openmpi cuda/10.2
• compile CUDA code
> nvcc –O3 –c power_cuda.cu
• compile C code
> mpicc –O3 -c power_mpi.c
• link step
mpicc power_cuda.o power_mpi.o –lcudart –o power_mpi.exe
mpirun ./powser_mpi.exe
~ohaan/cuda_kurs/multiple_gpus/job.script
Performance with Multiple GPUs (MPI)
N: 10000000, series order: 160, gpus: 1
hello 1 0 dge003
on gpu 0>: N = 10000000, offs = 0
time for allocate 2.3e-01, copy coeff = 2.7e-05, powser = 9.8e-04, copy a = 1.2e-02
speed = 4893
cudaMemcpy(&cpf[tid*n2],&v[1+n2+2],sil,...);
cudaMemcpy(&cpl[tid*n2],&v[1+n1*(n2+2)],sil,...);
cpf ompbarrier();
host cpl
shared mem if (tid ==0)
{ cudaMemcpy(&v[1+(n1+1)*(n2+2)],&cpf[(tid+1)*n2],sil,...); }
else if (tid==n_thrds-1)
{ cudaMemcpy(&v[1],&cpl[(tid-1)*n2],sil,...); }
else
{ cudaMemcpy(&v[1+(n1+1)*(n2+2)],&cpf[(tid+1)*n2],sil,...);
cudaMemcpy(&v[1],&cpl[(tid-1)*n2],sil,...); }
device mems v
Complete code in
~ohaan/cuda_kurs/multiple_gpus/diff_cuda.cu
Border Exchange between GPUs in one Node
make diff_omp in directory ~ohaan/cuda_kurs/multiple_gpus
1
p 4 1 x 2 dx
0
Numerical Integration
ℎ
divide integration domain 𝑙, ℎ
𝐼 = න 𝑓 𝑥 𝑑𝑥
into n strips of width =(h−𝑙)/𝑛
𝑙
𝑛 𝑥𝑖
𝐼 = 𝐼𝑖 𝐼𝑖 = න 𝑓 𝑥 𝑑𝑥
𝑖=1 𝑥𝑖−1
𝐼1 𝐼2 𝐼𝑛 𝑥0 = 𝑙 , 𝑥𝑛 = ℎ , 𝑥𝑖 = 𝑖
𝑥0 𝑥1 𝑥2 𝑥𝑛
Approximation: I = A + R
1
𝑛 = (ℎ − 𝑙)(𝑓 𝑙 − 𝑓(ℎ)) => 𝑅 ≤ 𝐸
2𝐸
CUDA device function
f ( x)dx
a i 0 ai
f ( x )dx
ba ba
ai a i , bi ai
nin nin
a b
...
i=1 i=2 i=3 i=nin
MPI applications 31
The Farmer-Worker Program
piprec_mpi.c
Slurm configuration:
-N <nn> --tasks-per-node=<nt_pn> --gpus-per-node =<nd_pn>
1 farmer, nn*nt_pn-1 worker, nn*nd_pn gpu-worker
cudaSetDevice(did);
...
return 0;
} Complete code in
~ohaan/cuda_kurs/multiple_gpus/piprec_cuda.cu
Output from run with SLURM options
nn=2, tasks-per-node=4, gpus-per-node=2
1 1 >> ils: 5 9 12 15 18 22 24 29 32 36 40 44 48
53 56 61 66 69 74 77 82 85 90 93 100
5 1 >> ils: 3 10 11 16 19 21 25 28 33 37 41 45 49
52 57 60 65 70 73 78 81 86 89 94 99
0 0 >> ils: 6 8 13 14 17 20 23 26 30 34 38 43 46
51 54 58 63 68 71 76 79 84 87 92 95 98
4 0 >> ils: 1 27 31 35 39 42 47 50 55 59 62 67 72
75 80 83 88 91 96 97
2 2 >> ils: 7 64
3 3 >> ils: 4
6 2 >> ils: 2