OpenACC 2017spring
OpenACC 2017spring
Shaohao Chen
Research Computing Services
Information Services and Technology
Boston University
Outline
• Introduction to GPU and OpenACC
• Basic syntax and the first OpenACC program: SAXPY
• Kernels vs. parallel directives
• An example: Laplace solver in OpenACC
The first try
Data transfer between GPU and CPU/host
Data race and the reduction clause
• GPU and OpenACC task granularities
GPU and GPGPU
• Originally, graphics processing unit (GPU) is dedicated for manipulating computer graphics and
image processing. Traditionally GPU is known as “video card”.
• GPU’s highly parallel structure makes it efficient for parallel programs. Nowadays GPUs are used
for tasks that were formerly the domain of CPUs, such as scientific computation. This kind of
GPU is called general-purpose GPU (GPGPU) .
• In many cases, a parallel program runs faster on GPU than on CPU. Note that a serial program
runs faster on CPU than on GPU.
• The most popular type of GPU in the high-performance computing world is NVIDIA GPU. We will
only focus on NVIDIA GPU here.
GPU is an accelerator
• GPU is a device on a CPU-based system. GPU is connected to CPU through PCI bus.
• Computer program can be parallelized and accelerated on GPU.
• CPU and GPU has separated memories. Data transfer between CPU and GPU is required for programming.
Three ways to accelerate applications on GPU
What is OpenACC
• OpenACC (for Open Accelerators) is a programming standard for parallel
computing on accelerators (mostly on NIVDIA GPU).
• It is designed to simplify GPU programming.
• The basic approach is to insert special comments (directives) into the code so as
to offload computation onto GPUs and parallelize the code at the level of GPU
(CUDA) cores.
• It is possible for programmers to create an efficient parallel OpenACC code with
only minor modifications to a serial CPU code.
What are compiler directives?
The directives tell the compiler or runtime to ……
Generate parallel code for GPU
Allocate GPU memory and copy input data GPU
Execute parallel code on GPU
Copy output data to CPU and deallocate GPU memory
• C
#pragma acc directive [clause [,] clause] …]…
often followed by a structured code block
• Fortran
!$acc directive [clause [,] clause] …]...
often paired with a matching end directive surrounding a structured code block
!$acc end directive
The first OpenACC program: SAXPY
Example: Compute a*x + y, where x and y are vectors, and a is a scalar.
C Fortran
int main(int argc, char **argv){ program main
int N=1000; integer :: n=1000, i
float a = 3.0f; real :: a=3.0
float x[N], y[N]; real, allocatable :: x(:), y(:)
for (int i = 0; i < N; ++i) { allocate(x(n),y(n))
x[i] = 2.0f; x(1:n)=2.0
y[i] = 1.0f; y(1:n)=1.0
} !$acc kernels
#pragma acc kernels do i=1,n
for (int i = 0; i < N; ++i) { y(i) = a * x(i) + y(i)
y[i] = a * x[i] + y[i]; enddo
} !$acc end kernels
} end program main
Using OpenACC on BU SCC (1): Get GPU resources
• Login BU SCC
% ssh [email protected]
• Request an interactive session with one CPU core and one GPU:
% qlogin –l gpus=1
The compiler creates a serial program, which runs slower on GPU than on CPU!
Pointer aliasing in C (1)
An improper version of the SAXPY code (using pointers):
int N=1000;
float a = 3.0f; Pointer aliasing: Different
float * x = (float*)malloc(N * sizeof(float)); pointers are allowed to access
the same object. This may
float * y = (float*)malloc(N * sizeof(float));
induce implicit data dependency
in a loop.
for (int i = 0; i < N; ++i) {
x[i] = 2.0f;
y[i] = 1.0f; In this case, it is possible that the
pointers x an y access to the
} same object. Potentially there is
data dependency in the loop.
#pragma acc kernels
for (int i = 0; i < N; ++i) {
y[i] = a * x[i] + y[i];
}
Pointer aliasing in C (2)
The compiler refuses to parallelize the loop that is involved with pointer aliasing.
Compiling output of the improper SAXPY code:
……
20, Loop carried dependence of y-> prevents parallelization
Complex loop carried dependence of x-> prevents parallelization
Loop carried backward dependence of y-> prevents vectorization
Accelerator scalar kernel generated
Use restrict to avoid pointer aliasing
A proper version of the SAXPY code (using pointers):
int N=1000;
float a = 3.0f; To avoid pointer aliasing, use
float *x = (float*)malloc(N * sizeof(float)); the keyword restrict.
float * restrict y = (float*)malloc(N * sizeof(float)); restrict means: For the
lifetime of the pointer ptr, only
for (int i = 0; i < N; ++i) { it or a value directly derived
from it (such as ptr + 1) will be
x[i] = 2.0f;
used to access the object to
y[i] = 1.0f; which it points.
}
C Fortran
#pragma acc parallel !$acc parallel
for (int i = 0; i < N; ++i) { do i=1,n
y(i) = a*x(i)+y(i)
y[i] = a * x[i] + y[i];
enddo
}
!$acc end parallel
The parallel directive tells the compiler to create a parallel region. But differently from the
kernels region, the code in the parallel region (the loop in this case) is executed (by all
gangs) redundantly. There is no work sharing (among gangs)!
Parallel directive (2)
A proper version of SAXPY code (using parallel loop directive):
C Fortran
#pragma acc parallel loop !$acc parallel loop
for (int i = 0; i < N; ++i) { do i=1,n
y(i) = a*x(i)+y(i)
y[i] = a * x[i] + y[i];
enddo
}
!$acc end parallel loop
It is necessary to add the keyword to loop share the works (among gangs).
In Fortran, the keyword loop can be replaced by do here.
In C, the keyword loop can be replaced by for.
kernels vs. parallel (1)
kernels
• More implicit.
• Gives the compiler more freedom to find and map parallelism.
• Compiler performs parallel analysis and parallelizes what it believes safe.
parallel
• More explicit.
• Requires analysis by programmer to ensure safe parallelism
• Straightforward path from OpenMP
kernels vs. parallel (2)
Parallelize a code block with two loops:
kernels parallel
#pragma acc kernels #pragma acc parallel
{ {
for (i=0; i<n; i++) #pragma acc loop
a[i] = 3.0f*(float)(i+1); for (i=0; i<n; i++) a[i] = 3.0f*(float)(i+1);
for (i=0; i<n; i++) #pragma acc loop
b[i] = 2.0f*a[i];
for (i=0; i<n; i++) b[i] = 2.0f*a[i];
}
}
• Discretize the Laplacian with first-order differential method and express the solution as
• The solution on one point only depends on the four neighbor points:
Laplace Solver (2)
Provided a serial code (in C or Fortran) for solving the two-dimensional Laplace equation,
parallelize it using OpenACC directives. Then compare the performance between the serial
code and the OpenACC code.
• Hints:
1. Find the “hot spots”, the most time-consuming parts of the code. Usually they are loops.
2. Analyze parallelism. Which loops are parallelizable?
3. What directives should be used? Where to insert the directives?
Laplace Solver (OpenACC in C , version 1)
This loop is not parallelizable
while ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) { due to data dependency.
#pragma acc kernels
for(i = 1; i <= ROWS; i++) These loops are parallelizable.
for(j = 1; j <= COLUMNS; j++) { Create a kernel region and ask
the compiler to determine
A_new[i][j] = 0.25 * (A[i+1][j] + A[i-1][j] +A[i][j+1] + A[i][j-1]);
parallelism and data transfer.
}
dt = 0.0;
#pragma acc kernels
for(i = 1; i <= ROWS; i++) These loops are parallelizable.
for(j = 1; j <= COLUMNS; j++){ Create a kernel region and ask
dt = fmax( fabs(A_new[i][j]-A[i][j]), dt);
the compiler to determine
parallelism and data transfer.
A[i][j] = A_new[i][j];
}
iteration++;
}
Laplace Solver (OpenACC in Fortran, version 1)
do while ( dt > max_temp_error .and. iteration <= max_iterations) This loop is not parallelizable
!$acc kernels due to data dependency.
do j=1,columns
do i=1,rows These loops are parallelizable.
A_new(i,j)=0.25*(A(i+1,j)+A(i-1,j)+ A(i,j+1)+A(i,j-1) ) Create a kernel region and ask
enddo the compiler to determine
enddo parallelism and data transfer.
!$acc end kernels
dt=0.0
!$acc kernels These loops are parallelizable.
do j=1,columns Create a kernel region and ask
do i=1,rows the compiler to determine
parallelism and data transfer.
dt = max( abs(A_new(i,j) - A(i,j)), dt )
A(i,j) = A_new(i,j)
enddo
enddo
!$acc end kernels
iteration = iteration+1
enddo
Analysis of performance (version 1)
The OpenACC code is much slower than the serial code. What went wrong?
• Syntax for C
#pragma acc data copy(a[0:size]) copyin(b[0:size]), copyout(c[0:size]) create(d[0:size]) present(d[0:size])
• Syntax for Fortran
!$acc acc data copy(a(0:size)) copyin(b(0:size]), copyout(c(0:size)) create(d(0:size)) present(d(0:size))
!$acc end data
• If the compiler can determine the size of arrays, it is unnecessary to specify it explicitly.
Laplace Solver (OpenACC in C , version 2)
Create a data region here. A is copied in
#pragma acc data copy(A), create(A_new) before the while loop starts and is copied out
while ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) { after the while loop ends. A_new is allocated
on GPU memory directly and is unnecessary
#pragma acc kernels
to be copied to the host memory.
for(i = 1; i <= ROWS; i++)
for(j = 1; j <= COLUMNS; j++) { Create a kernel region here to
A_new[i][j] = 0.25 * (A[i+1][j] + A[i-1][j] + A[i][j+1] + A[i][j-1]);
parallelize the for loops, but there
is no data transfer.
}
dt = 0.0;
#pragma acc kernels
Create a kernel region here to
parallelize the for loops, but there
for(i = 1; i <= ROWS; i++) is no data transfer.
for(j = 1; j <= COLUMNS; j++){
dt = fmax( fabs(A_new[i][j]-A[i][j]), dt);
A[i][j] = A_new[i][j];
}
iteration++;
}
Laplace Solver (OpenACC in Fortran , version 2)
!$acc data copy(A), create(A_new) Create a data region here. A is copied in
do while ( dt > max_temp_error .and. iteration <= max_iterations) before the while loop starts and is copied out
!$acc kernels after the while loop ends. A_new is allocated
do j=1,columns on GPU memory directly and is unnecessary
to be copied to the host memory.
do i=1,rows
A_new(i,j)=0.25*(A(i+1,j)+A(i-1,j)+ A(i,j+1)+A(i,j-1) )
Create a kernel region here to
enddo parallelize the do loops, but there
enddo is no data transfer.
!$acc end kernels
dt=0.0
!$acc kernels Create a kernel region here to
do j=1,columns parallelize the do loops, but there
do i=1,rows is no data transfer.
dt = max( abs(A_new(i,j) - A(i,j)), dt )
A(i,j) = A_new(i,j)
enddo
enddo
!$acc end kernels
iteration = iteration+1
enddo
time(us): 2,374,331
59: data region reached 2 times Profiling (version 2)
59: data copyin transfers: 1
device time(us): total=1,564 max=1,564 min=1,564 avg=1,564 export PGI_ACC_TIME=1
91: data copyout transfers: 1 to activate profiling, then
device time(us): total=1,773 max=1,773 min=1,773 avg=1,773 run again.
63: compute region reached 3372 times
65: kernel launched 3372 times
grid: [32x250] block: [32x4] • There are only 2 times data
device time(us): total=1,005,947 max=313 min=296 avg=298 movement (of arrays) in total.
elapsed time(us): total=1,102,391 max=946 min=324 avg=326
74: compute region reached 3372 times • There are data movements
74: data copyin transfers: 3372 for the variable dt, but it not
device time(us): total=20,344 max=16 min=6 avg=6 an array and thus the transfer
76: kernel launched 3372 times processes cost very little time.
grid: [32x250] block: [32x4]
device time(us): total=1,150,552 max=344 min=327 avg=341
elapsed time(us): total=1,235,344 max=856 min=352 avg=366 • The total time for data
76: reduction kernel launched 3372 times movement is around 0.09
grid: [1] block: [256] second, which is much
device time(us): total=67,484 max=21 min=19 avg=20 smaller than the computing
elapsed time(us): total=151,147 max=358 min=43 avg=44 time (around 2.5 seconds)!
76: data copyout transfers: 3372
device time(us): total=68,104 max=46 min=17 avg=20
Analysis of performance (version 2)
Compare the computation time (for 1000*1000 grid):
• Serial code: 17.610445 seconds.
• OpenACC code (version 1): 48.796347 seconds
• OpenACC code (version 2): 2.592581 seconds
The OpenACC code (version 2) is around 6.8 times faster than the serial code.
Cheers!
The speed-up would be even larger if the size of the problem increase.
The maximum size of GPU memory (typically 6 GB or 12 GB) is much smaller
than regular CPU memory (e.g. 128 GB on BU SCC).
Reduction
As we can see from the profiling results, a reduction kernel is created by the compiler.
Using parallel directive is a little faster than using kernel directive in this case
(mostly due to different task granularities).
It is a good habit to explicitly specify reduction operators and variables.
NVIDIA GPU (CUDA) Task Granularity
Syntax for C
#pragma acc kernels loop gang(n) worker(m) vector(k)
#pragma acc parallel loop num_gangs(n) num_workers(m) vector_length(k)
Syntax for Fortran
!$acc kernels loop gang(n) worker(m) vector(k)
!$acc parallel loop num_gangs(n) num_workers(m) vector_length(k)
Appendix A: Submit a GPU job on SCC
• Architecture of GPU
• Advanced OpenACC (vector, worker, gang, synchronization, etc)
• Using OpenACC with CUDA
• Using OpenACC with OpenMP (to use a few GPUs on one node)
• Using OpenACC with MPI (to use many GPUs on multiple nodes)
Further information