0% found this document useful (0 votes)
39 views36 pages

CUDA-Multiple GPUs

Uploaded by

ohaan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views36 pages

CUDA-Multiple GPUs

Uploaded by

ohaan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Using Multiple GPUs

Oswald Haan
[email protected]
A Typical Compute-Cluster Configuration
with Multiple GPUs
high speed interconnect

core ... core core ... core


. . .
memory memory

mem ... mem mem ... mem

SM SM SM SM

...
...

...
...

SM SM SM SM
Using Multiple GPUs in a Single Node
• start as many OpenMP threads in the node as gpus to be used
• each thread selects a different gpu and controlls all activities of this
device independently of the activities of the other devices.
• Communication between different gpus in two steps

core ... core

mem memory mem

SM SM
...

...
SM SM
Using Multiple GPUs in Multiple Nodes
• start as many MPI tasks on different nodes as gpus to be used
• each MPI task controlls a gpu on a different node .
• Communication between different gpus in three steps
high speed interconnect

core ... core core ... core

mem memory memory mem

SM SM
...

...
SM SM
Separate Files for Different Code
• File code_cuda.cu contains all program parts with CUDA code
• File code_par.c contains all program parts for managing OpenMP threads
resp. MPI tasks
• compile/link in two steps:
1. compile code_cuda.cu ( nvcc uses CC compiler )
nvcc –arch=sm_35 –c code_cuda.cu
2. compile and link code_par.c ( gcc and mpicc use C compiler )
gcc –fopenmp –lcudart code_cuda.o code_par.c –o code.ex or
mpicc –lcudart code_cuda.o code_par.c –o code.exe

Important
all functions in code_cuda.cu to be called from code_par.c have to be
declared with the qualifier extern “C“
Example: Power Series with Multiple Devices
in One Node, Using OpenMP
in the OMP part of the program set the number of devices to be used :
int nd = 2;
and generate nd threads with OpenMP:
omp_set_num_threads(nd);
#pragma omp parallel default(shared)
{
int did = omp_get_thread_num(); Variables declared in
... a parallel section are
powser_by_device(did, ... ); threadlocal
}

In the CUDA part of the program select a device according to the thread_num did:
cudaSetDevice(did);
Power Series with Multiple Devices:
Work Distribution
Distribute N values onto nd devices

• Define N_loc = integer part of (N+nd-1)/nd


• Device did works on indices ranging from:
did*N_loc to min((did+1)*N_loc , N)-1

dev. 0 dev.1 dev. nd-1

0 N_loc 2*N_loc (nd-1)*N_loc


Code Example for powerseries in File
powser_omp.c
#include <omp.h>
int main(void){
extern int nord;
int N = 10000000, nd = 4, first[nd+1];
float a[N];
double eps = 0.5/(double)N;
// initialize coefficients coeff_h
...
N_loc = (N+nd-1)/nd;
omp_set_num_threads(nd); All variabels are globally shared
#pragma omp parallel default(shared) by all thereads
{ Exception: variables declared
int did = omp_get_thread_num(); in the parallel region are thread-local
// specify index range for gpu number did
int locN = N_loc; if (did==nd-1) locN = N-(nd-1)*N_loc
int offset = did*N_loc;
powser_by_device(did, locN, offset, eps, &a, coeff_h);
} Complete code in
~ohaan/cuda_kurs/multiple_gpus/powser_omp.c
Code Example for powerseries in File
powser_cuda.cu
extern const int nord=160;
__device__ __constant__ float coeff[nord];
extern "C" void powser_by_device(int did, int N, int offset, double eps,
float *a_h, float *coeff_h)
{
int N_thrpb =1024, N_blks = (N+N_thrpb-1)/N_thrpb;
float *a_d;
size_t sizea = N*sizeof(float), size_t sizeco = nord*sizeof(float);

cudaSetDevice(did);
cudaMalloc((void **) &a_d, sizea);
cudaMemcpyToSymbol(coeff, coeff_h, sizeco, 0, cudaMemcpyHostToDevice);
ps_d<<<N_blks,N_thrpb>>>(N,offset,eps,a_d);
cudaMemcpy(&a_h[offset], a_d, sizea, cudaMemcpyDeviceToHost);
cudaFree(a_d);
}
Code Example for powerseries in File
powser_cuda.cu
__global__ void ps_d( int n, int offset, double eps, float *a ) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
if( i<n ) {
float x=(offset+i)*eps;
a[i] = powerseries(x);
}
}

__device__ float powerseries(float x) {


float ps, xi;
int i;
ps = coeff[0]; xi = 1.0;
for(i=1; i<nord; i++) {
xi = xi*x;
ps = ps + xi*coeff[i];
}
return ps;
Complete code in
}
~ohaan/cuda_kurs/multiple_gpus/powser_cuda.c
Performance with Multiple GPUs (OpenMP)
make powser_omp in directory ~ohaan/cuda_kurs/multiple_gpus

N: 10000000, series order: 160, gpus: 1


on gpu 0>: N = 10000000, offs = 0
time for allocate 2.5e-01, copy coeff = 2.8e-05, powser = 9.8e-04, copy a = 1.3e-02
speed = 4874

N: 10000000, series order: 160, gpus: 2


on gpu 1>: N = 5000000, offs = 5000000
time for allocate 4.5e-01, copy coeff = 3.2e-05, powser = 5.2e-04, copy a = 6.7e-03
speed = 4643
on gpu 0>: N = 5000000, offs = 0
time for allocate 4.7e-01, copy coeff = 1.1e-05, powser = 4.9e-04, copy a = 6.5e-03
speed = 4870
Example: Power Series with nd Devices,
on nn Nodes, Using MPI
Resources allocated with Slurm options (with nd >= nn)
-N <nn>
-G <nd>
--cpus-per-gpu=1
The total number of tasks and devices is nd returned by
MPI_Comm_size(MPI_COMM_WORLD,&nd);
Each task has an unique identifier me, with values from 0 to nd-1, returned by
MPI_Comm_rank(MPI_COMM_WORLD,&me);
The number of tasks and devices in node i ,i = 0,…,nn is ndi,
each local task has an unique on-node identifier me_on with values from 0 to ndi-1

How to determine the value of me_on corresponding to me ?


MPI Function MPI_Comm_split_type
MPI allows to group tasks residing in differnt nodes by the call to
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,
MPI_INFO_NULL, &comm_on);
And to determine the local on-node taskid me_on
corresponding to the global taskid me by a call to
MPI_Comm_rank(comm_on, &me_on);
Example: Power Series with nd Devices,
and nn Nodes, Using MPI
In the MPI part of the program the MPI environment will be set by
MPI_Init(NULL,NULL);
MPI_Comm_size(MPI_COMM_WORLD,&nd);
MPI_Comm_rank(MPI_COMM_WORLD,&me);
MPI_Comm comm_on;
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED, 0,
MPI_INFO_NULL, &comm_on);
MPI_Comm_rank(comm_on, &me_on);

Task me will controll number the gpu me_on of its node‘s gpus by calling the function
powser_by_device(me_on, ... )
from the CUDA part of the program in file power_cuda.cu
Code Example for powerseries in File
powser_mpi.c Complete code in
~ohaan/cuda_kurs/multiple_gpus/powser_mpi.c
#include <mpi.h>
extern int nord;

int main(void){
int N = 10000000, nd, me; float a[N];
// initialize MPI (nd, me, me_on) and coefficients coeff_h
...
// Define first[nd+1] and N_loc[nd] for distributing N values onto nd devices
// Every task calculates its local values for its own device
float *al; al = (float *)malloc(N_loc[me]*sizeof(float));
powser_by_device(me_on,N_loc[me],first[me],eps,al,coeff_h);
// gather local values in al into a in task 0
MPI_Gatherv(al,N_loc[me],MPI_FLOAT,a,N_loc,first,MPI_FLOAT,0,MPI_COMM_WORLD);

MPI_Finalize();
}
Compile and Link:
Multiple GPUs on Multiple Nodes
• Load environment:
> module load openmpi cuda/10.2
• compile CUDA code
> nvcc –O3 –c power_cuda.cu
• compile C code
> mpicc –O3 -c power_mpi.c
• link step
mpicc power_cuda.o power_mpi.o –lcudart –o power_mpi.exe

(make powser_mpi in directory ~ohaan/cuda_kurs/multiple_gpus)


Slurm Script for job submission:
Multiple GPUs on Multiple Nodes
#!/bin/bash
#SBATCH -p gpu
#SBATCH -t 10:00
#SBATCH –N 2
#SBATCH --tasks-per-node=2
##SBATCH --reservation=gpu-course
#SBATCH --gpus-pus-per-node=2

mpirun ./powser_mpi.exe
~ohaan/cuda_kurs/multiple_gpus/job.script
Performance with Multiple GPUs (MPI)
N: 10000000, series order: 160, gpus: 1
hello 1 0 dge003
on gpu 0>: N = 10000000, offs = 0
time for allocate 2.3e-01, copy coeff = 2.7e-05, powser = 9.8e-04, copy a = 1.2e-02
speed = 4893

N: 10000000, series order: 160, gpus: 2


hello 2 0 dge003
hello 2 1 dge006
on gpu 0>: N = 5000000, offs = 0
time for allocate 2.3e-01, copy coeff = 2.6e-05, powser = 5.1e-04, copy a = 5.3e-03
speed = 4726
on gpu 0>: N = 5000000, offs = 5000000
time for allocate 2.4e-01, copy coeff = 2.7e-05, powser = 5.1e-04, copy a = 5.7e-03
speed = 4715
Performance with Multiple GPUs (MPI)
N: 10000000, series order: 160, gpus: 4
hello 4 1 dge003
hello 4 0 dge003
hello 4 2 dge006
hello 4 3 dge006

on gpu 1>: N = 2500000, offs = 2500000


time for allocate 4.3e-01, copy coeff = 6.7e-05, powser = 2.9e-04, copy a = 2.3e-03
speed = 4149
on gpu 0>: N = 2500000, offs = 0
time for allocate 4.3e-01, copy coeff = 3.1e-05, powser = 2.7e-04, copy a = 2.5e-03
speed = 4458
on gpu 1>: N = 2500000, offs = 7500000
time for allocate 4.4e-01, copy coeff = 7.0e-05, powser = 2.9e-04, copy a = 2.5e-03
speed = 4136
on gpu 0>: N = 2500000, offs = 5000000
time for allocate 4.4e-01, copy coeff = 3.4e-05, powser = 2.7e-04, copy a = 3.0e-03
speed = 4446
Diffusion on Multiple GPUs
Diffusion on Multiple GPUs
Border Exchange between GPUs in one Node
device mems v

cudaMemcpy(&cpf[tid*n2],&v[1+n2+2],sil,...);
cudaMemcpy(&cpl[tid*n2],&v[1+n1*(n2+2)],sil,...);

cpf ompbarrier();
host cpl
shared mem if (tid ==0)
{ cudaMemcpy(&v[1+(n1+1)*(n2+2)],&cpf[(tid+1)*n2],sil,...); }
else if (tid==n_thrds-1)
{ cudaMemcpy(&v[1],&cpl[(tid-1)*n2],sil,...); }
else
{ cudaMemcpy(&v[1+(n1+1)*(n2+2)],&cpf[(tid+1)*n2],sil,...);
cudaMemcpy(&v[1],&cpl[(tid-1)*n2],sil,...); }

device mems v
Complete code in
~ohaan/cuda_kurs/multiple_gpus/diff_cuda.cu
Border Exchange between GPUs in one Node
make diff_omp in directory ~ohaan/cuda_kurs/multiple_gpus

n_thrds = 1, nt = 500, n = 20000


0 > alloc :0.221725
0 > copy field to device :0.524632
0 > update :13.977505
0 > copy field to host :0.245590
Zeit,Rechengeschwindigkeit host [Mflop/s]: 14.982562 160186.218750
n_thrds = 2, nt = 500, n = 20000
0 > alloc :0.434437
0 > copy field to device :0.221996
0 > update :7.101686
0 > copy field to host :0.062721
1 > alloc :0.435305
1 > copy field to device :0.277471
1 > update :7.045307
1 > copy field to host :0.062992
Zeit,Rechengeschwindigkeit host [Mflop/s]: 7.865795 305118.531250
Calculating p

Aerea of a quarter circle = p / 4

1
p  4  1  x 2 dx
0
Numerical Integration

divide integration domain 𝑙, ℎ
𝐼 = න 𝑓 𝑥 𝑑𝑥
into n strips of width =(h−𝑙)/𝑛
𝑙

𝑛 𝑥𝑖

𝐼 = ෍ 𝐼𝑖 𝐼𝑖 = න 𝑓 𝑥 𝑑𝑥
𝑖=1 𝑥𝑖−1

𝐼1 𝐼2 𝐼𝑛 𝑥0 = 𝑙 , 𝑥𝑛 = ℎ , 𝑥𝑖 = 𝑖 
𝑥0 𝑥1 𝑥2 𝑥𝑛
Approximation: I = A + R

𝐼𝑖 exact area of a strip


>
𝐼𝑖 =  𝑓 𝑥𝑖−1 ≥ 𝐼𝑖
<
𝐼𝑖 = 𝑓 𝑥𝑖 ≤ 𝐼𝑖
𝑛
𝐼𝑖 1 > <
𝐴 = ෍ 𝐼𝑖 + 𝐼𝑖
2
𝑥𝑖−1 𝑥𝑖 𝑖=1
𝑛
1 > < 1
𝑅 ≤ ෍ 𝐼𝑖 − 𝐼𝑖 = ∆ 𝑓 𝑙 − 𝑓(ℎ)
2 2
𝑖=1

1
𝑛 = (ℎ − 𝑙)(𝑓 𝑙 − 𝑓(ℎ)) => 𝑅 ≤ 𝐸
2𝐸
CUDA device function

__global__ void numint_d( int N, float offs, float eps,


float *pres ) {
int i = threadIdx.x + blockIdx.x*blockDim.x;
float xl, xh, fl, fh, flh;
if (i==0) *pres = 0.;
if (i<N) {
xl = offs + i*eps; xh = xl + eps;
fl = sqrt(1.-xl*xl); fh = sqrt(1.-xh*xh);
flh = 0.5*(fl+fh);
atomicAdd( pres, flh);
}
}
Complete code in
~ohaan/cuda_kurs/multiple_gpus/piprec_cuda.cu
CUDA host function
extern "C" int numprec_d(int did, float lo, float hi,
float prec, float *pres, int *pn){
float *pres_d, del, flo, fhi; int n, blksz = 1024;
flo = sqrt(1.-lo*lo); fhi = sqrt(1.-hi*hi);
n = 0.5*(hi-lo)*(flo-fhi)/prec; *pn = n;
n: Number of strips for
del = (hi-lo)/((double) n);
required precision prec
int N_blk = (n+blksz-1)/blksz;
cudaSetDevice(did);
cudaMalloc((void **) &pres_d, sizeof(float));
numint_d<<<N_blk,blksz>>>(n,lo,del,pres_d);
cudaMemcpy(pres, pres_d, sizeof(int), cudaMemcpyDeviceToHost);
cudaFree(pres_d);
*pres = *pres *4.*del;
return 0; Complete code in
~ohaan/cuda_kurs/multiple_gpus/piprec.c
}
Numerical Approximation with Prescribed Precision prec
b nin 1 bi

 f ( x)dx   
a i 0 ai
f ( x )dx

ba ba
ai  a  i , bi  ai 
nin nin
a b

divide integration domain 𝐚, 𝐛 into nin intervals of size (b-a)/nin.


In each interval i approximate the integral using the cuda function
numprec_d(did, lo, hi, pr, *pres, *pn);
with lo = ai , hi = bi , pr=prec/nin

The approximation in each interval is returned at address pres


The number of strips 0.5(bi-ai)(f(ai)-f(bi))/pr is returned at address pn
MPI applications 29
Distribute nin evaluations of numprec_d to np processes

...
i=1 i=2 i=3 i=nin

proc. proc. ... proc.


0 1 np-1

Problem: unequal workload for different evaluations


workload not known at run time
MPI applications 30
Distribute Work to Idle Workers

farmer: me = np-1 worker: me <np-1


tres = 0 res = 0
for iw = 1 , nintv +(np-1) : Send res to 0
recv res from anytask for i = 1 , nintv :
tres = tres + res Recv iw from 0
ipw = status(mpi_source) If iw nintv: exit
Send iw to ipw res = work(iw)
Send res to 0

MPI applications 31
The Farmer-Worker Program
piprec_mpi.c
Slurm configuration:
-N <nn> --tasks-per-node=<nt_pn> --gpus-per-node =<nd_pn>
1 farmer, nn*nt_pn-1 worker, nn*nd_pn gpu-worker

Identification tasknumber me and tasknumber on node me_on on node


MPI_Comm_size(MPI_COMM_WORLD,&np);
MPI_Comm_rank(MPI_COMM_WORLD,&me);
MPI_Comm_split_type(MPI_COMM_WORLD, MPI_COMM_TYPE_SHARED,
0,MPI_INFO_NULL, &comm_on);
MPI_Comm_rank(comm_on, &me_on);
Complete code in
~ohaan/cuda_kurs/multiple_gpus/piprec_mpi.c
The Farmer-Part
if (me == np-1) {
MPI_Barrier(MPI_COMM_WORLD);
pia = 0.;
for (iw = 1; iw <= nintv+(np-1);iw++){
MPI_Recv(&res, 1, MPI_FLOAT, MPI_ANY_SOURCE,
tag, MPI_COMM_WORLD, &stat);
pia = pia + res;
ipw = stat.MPI_SOURCE;
MPI_Send(&iw, 1, MPI_INT, ipw,
tag, MPI_COMM_WORLD);
}
The Worker-Part
if (me<np-1) {
res = 0.;
MPI_Send(&res, 1, MPI_FLOAT, np-1, tag,MPI_COMM_WORLD);
for (i = 1; i<= nintv+1;i++){
MPI_Recv( &iw, 1, MPI_INT, np-1, tag, MPI_COMM_WORLD,
&stat );
if (iw > nintv) break;
lo = 1. -iw*del; hi = lo+del;
pr = prec*del;
iex = numprec_d(me_on,lo,hi,pr,&res, &n);
if (iex<0) numprec(lo,hi,pr,&res, &n);
MPI_Send(&res, 1, MPI_FLOAT, np-1, tag, MPI_COMM_WORLD)
}
The Worker-function numprec_d
int numprec_d(int did, float lo, float hi, float prec,
float *pres, int *pn){
float *pres_d, del, flo, fhi;
int nd, n, blksz = 1024;
cudaGetDeviceCount(&nd);
if (did>=nd) {return -1;}

cudaSetDevice(did);
...
return 0;
} Complete code in
~ohaan/cuda_kurs/multiple_gpus/piprec_cuda.cu
Output from run with SLURM options
nn=2, tasks-per-node=4, gpus-per-node=2
1 1 >> ils: 5 9 12 15 18 22 24 29 32 36 40 44 48
53 56 61 66 69 74 77 82 85 90 93 100
5 1 >> ils: 3 10 11 16 19 21 25 28 33 37 41 45 49
52 57 60 65 70 73 78 81 86 89 94 99
0 0 >> ils: 6 8 13 14 17 20 23 26 30 34 38 43 46
51 54 58 63 68 71 76 79 84 87 92 95 98
4 0 >> ils: 1 27 31 35 39 42 47 50 55 59 62 67 72
75 80 83 88 91 96 97
2 2 >> ils: 7 64
3 3 >> ils: 4
6 2 >> ils: 2

You might also like