0% found this document useful (0 votes)

12 views46 pages

OpenACC 2017spring

The document introduces OpenACC, a programming standard for parallel computing on accelerators like GPUs. It discusses basic OpenACC syntax and directives, and provides an example of using OpenACC to parallelize a simple SAXPY computation on a GPU. It also covers important OpenACC concepts like kernels vs parallel directives, data transfer between CPU and GPU, and avoiding data races.

Uploaded by

Cosmic02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views46 pages

OpenACC 2017spring

Uploaded by

Cosmic02

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 46

Introduction to OpenACC

Shaohao Chen
Research Computing Services
Information Services and Technology
Boston University
Outline
• Introduction to GPU and OpenACC
• Basic syntax and the first OpenACC program: SAXPY
• Kernels vs. parallel directives
• An example: Laplace solver in OpenACC
The first try
Data transfer between GPU and CPU/host
Data race and the reduction clause
• GPU and OpenACC task granularities
GPU and GPGPU
• Originally, graphics processing unit (GPU) is dedicated for manipulating computer graphics and
image processing. Traditionally GPU is known as “video card”.
• GPU’s highly parallel structure makes it efficient for parallel programs. Nowadays GPUs are used
for tasks that were formerly the domain of CPUs, such as scientific computation. This kind of
GPU is called general-purpose GPU (GPGPU) .
• In many cases, a parallel program runs faster on GPU than on CPU. Note that a serial program
runs faster on CPU than on GPU.
• The most popular type of GPU in the high-performance computing world is NVIDIA GPU. We will
only focus on NVIDIA GPU here.
GPU is an accelerator
• GPU is a device on a CPU-based system. GPU is connected to CPU through PCI bus.
• Computer program can be parallelized and accelerated on GPU.
• CPU and GPU has separated memories. Data transfer between CPU and GPU is required for programming.
Three ways to accelerate applications on GPU
What is OpenACC
• OpenACC (for Open Accelerators) is a programming standard for parallel
computing on accelerators (mostly on NIVDIA GPU).
• It is designed to simplify GPU programming.
• The basic approach is to insert special comments (directives) into the code so as
to offload computation onto GPUs and parallelize the code at the level of GPU
(CUDA) cores.
• It is possible for programmers to create an efficient parallel OpenACC code with
only minor modifications to a serial CPU code.
What are compiler directives?
 The directives tell the compiler or runtime to ……
 Generate parallel code for GPU
 Allocate GPU memory and copy input data GPU
 Execute parallel code on GPU
 Copy output data to CPU and deallocate GPU memory

// ... serial code ...

 The first OpenACC directive: kernels
#pragma acc kernels
 ask the compiler to generate a GPU code
for (int i= 0; i<n; i++) {
 let the compiler determine safe
parallelism and data transfer . //... parallel code ...
}
// ... serial code ...
OpenACC Directive syntax

• C
#pragma acc directive [clause [,] clause] …]…
often followed by a structured code block

• Fortran
!$acc directive [clause [,] clause] …]...
often paired with a matching end directive surrounding a structured code block
!$acc end directive
The first OpenACC program: SAXPY
 Example: Compute a*x + y, where x and y are vectors, and a is a scalar.

C Fortran
int main(int argc, char **argv){ program main
int N=1000; integer :: n=1000, i
float a = 3.0f; real :: a=3.0
float x[N], y[N]; real, allocatable :: x(:), y(:)
for (int i = 0; i < N; ++i) { allocate(x(n),y(n))
x[i] = 2.0f; x(1:n)=2.0
y[i] = 1.0f; y(1:n)=1.0
} !$acc kernels
#pragma acc kernels do i=1,n
for (int i = 0; i < N; ++i) { y(i) = a * x(i) + y(i)
y[i] = a * x[i] + y[i]; enddo
} !$acc end kernels
} end program main
Using OpenACC on BU SCC (1): Get GPU resources

• Login BU SCC
% ssh [email protected]

• Request an interactive session with one CPU core and one GPU:
% qlogin –l gpus=1

• Use the newest PGI compiler:

% module switch pgi/13.5 pgi/16.5

• Get GPU information

% pgaccelinfo
% nvidia-smi
Using OpenACC on BU SCC (2): Compile and Run

• On SCC, only the Portland Group compiler supports OpenACC

• Compile an OpenACC source code:
% pgcc -acc –Minfo=accel name.c –o exename
% pgf90 -acc –Minfo=accel name.f90 –o exename
• Note: the option –Minfo=accel is for printing useful information about
accelerator region targeting.

• Run the executable:

% ./exename
Exercise 1: SAXPY

1) Login BU SCC and get an interactive session with GPU resources.

2) Provided a serial SAXPY code in C or Fortran, parallelize it using OpenACC
directives.
3) Compile and run the SAXPY code.
Analysis of the compiling output
$ pgcc -acc -Minfo=accel saxpy_array.c -o saxpy_array
main:
17, Generating copyin(x[:1000])
Generating copy(y[:1000])
19, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
19, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */

 Accelerator kernel is generated. The loop computation is offloaded to (Tesla) GPU

and is parallelized.
 The keywords copy and copyin are involved with data transfer. The keywords gang
and vector are involved with tasks granularity. We will mention these later.
Data dependency
 The loop is not parallelized if there is data dependency. For example,
#pragma acc kernels
for (int i = 0; i < N-1; i++) {
x[i] = a * x[i+1] ;
}

 The compiling output:

……
14, Loop carried dependence of x-> prevents parallelization
Loop carried backward dependence of x-> prevents vectorization
Accelerator scalar kernel generated
Loop carried backward dependence of x-> prevents vectorization

 The compiler creates a serial program, which runs slower on GPU than on CPU!
Pointer aliasing in C (1)
 An improper version of the SAXPY code (using pointers):

int N=1000;
float a = 3.0f;  Pointer aliasing: Different
float * x = (float*)malloc(N * sizeof(float)); pointers are allowed to access
the same object. This may
float * y = (float*)malloc(N * sizeof(float));
induce implicit data dependency
in a loop.
for (int i = 0; i < N; ++i) {
x[i] = 2.0f;
y[i] = 1.0f;  In this case, it is possible that the
pointers x an y access to the
} same object. Potentially there is
data dependency in the loop.
#pragma acc kernels
for (int i = 0; i < N; ++i) {
y[i] = a * x[i] + y[i];
}
Pointer aliasing in C (2)

 The compiler refuses to parallelize the loop that is involved with pointer aliasing.
 Compiling output of the improper SAXPY code:

……
20, Loop carried dependence of y-> prevents parallelization
Complex loop carried dependence of x-> prevents parallelization
Loop carried backward dependence of y-> prevents vectorization
Accelerator scalar kernel generated
Use restrict to avoid pointer aliasing
 A proper version of the SAXPY code (using pointers):

int N=1000;
float a = 3.0f;  To avoid pointer aliasing, use
float *x = (float*)malloc(N * sizeof(float)); the keyword restrict.
float * restrict y = (float*)malloc(N * sizeof(float));  restrict means: For the
lifetime of the pointer ptr, only
for (int i = 0; i < N; ++i) { it or a value directly derived
from it (such as ptr + 1) will be
x[i] = 2.0f;
used to access the object to
y[i] = 1.0f; which it points.
}

#pragma acc kernels

for (int i = 0; i < N; ++i) {
y[i] = a * x[i] + y[i];
}
Parallel directive (1)
 An improper version of SAXPY code (using parallel directive):

C Fortran
#pragma acc parallel !$acc parallel
for (int i = 0; i < N; ++i) { do i=1,n
y(i) = a*x(i)+y(i)
y[i] = a * x[i] + y[i];
enddo
}
!$acc end parallel

 The parallel directive tells the compiler to create a parallel region. But differently from the
kernels region, the code in the parallel region (the loop in this case) is executed (by all
gangs) redundantly. There is no work sharing (among gangs)!
Parallel directive (2)
 A proper version of SAXPY code (using parallel loop directive):

C Fortran
#pragma acc parallel loop !$acc parallel loop
for (int i = 0; i < N; ++i) { do i=1,n
y(i) = a*x(i)+y(i)
y[i] = a * x[i] + y[i];
enddo
}
!$acc end parallel loop

 It is necessary to add the keyword to loop share the works (among gangs).
 In Fortran, the keyword loop can be replaced by do here.
 In C, the keyword loop can be replaced by for.
kernels vs. parallel (1)

 kernels
• More implicit.
• Gives the compiler more freedom to find and map parallelism.
• Compiler performs parallel analysis and parallelizes what it believes safe.

 parallel
• More explicit.
• Requires analysis by programmer to ensure safe parallelism
• Straightforward path from OpenMP
kernels vs. parallel (2)
 Parallelize a code block with two loops:

kernels parallel
#pragma acc kernels #pragma acc parallel
{ {
for (i=0; i<n; i++) #pragma acc loop
a[i] = 3.0f*(float)(i+1); for (i=0; i<n; i++) a[i] = 3.0f*(float)(i+1);
for (i=0; i<n; i++) #pragma acc loop
b[i] = 2.0f*a[i];
for (i=0; i<n; i++) b[i] = 2.0f*a[i];
}
}

 Generate two kernels  Generate one kernel

 There is an implicit barrier between the two  There is no barrier between the two loops: the
loops: the second loop will start after the first second loop may start before the first loop ends.
loop ends. (This is different from OpenMP).
Laplace Solver (1)
• Two-dimensional Laplace equation

• Discretize the Laplacian with first-order differential method and express the solution as

• The solution on one point only depends on the four neighbor points:
Laplace Solver (2)

• Use Jacobi iterative algorithm to obtain a convergent solution.

• Jacobi iterative algorithm:

1. Give a trial solution A depending on a provided initial condition.
2. Compute a new solution, that is A_new(i,j), based on the old values of the four
neighbor points.
3. Update the solution, i.e. A=A_new,
4. Iterate steps 2 and 3 until converged, i.e. max(| A_new(i,j)-A(i,j)|)<tolerance.
5. Finally the converged solution is stored at A.
Laplace Solver (serial C)
while ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) { A Loop for Jacobi iterations.
for(i = 1; i <= ROWS; i++)
for(j = 1; j <= COLUMNS; j++) { Loops for computing a new
solution
A_new[i][j] = 0.25 * (A[i+1][j] + A[i-1][j] +A[i][j+1] + A[i][j-1]);
}
dt = 0.0;
for(i = 1; i <= ROWS; i++)
Loops for updating the solution
for(j = 1; j <= COLUMNS; j++){ and finding the max error.
dt = fmax( fabs(A_new[i][j]-A[i][j]), dt);
A[i][j] = A_new[i][j];
}
iteration++;
}
Laplace Solver (serial Fortran)
do while ( dt > max_temp_error .and. iteration <= max_iterations) A Loop for Jacobi iterations.
do j=1,columns
Loops for computing a new
do i=1,rows
solution
A_new(i,j)=0.25*(A(i+1,j)+A(i-1,j)+ A(i,j+1)+A(i,j-1) )
enddo
enddo
dt=0.0
do j=1,columns Loops for updating the solution
do i=1,rows and finding the max error.
dt = max( abs(A_new(i,j) - A(i,j)), dt )
A(i,j) = A_new(i,j)
enddo
enddo
iteration = iteration+1
enddo
Exercise 2: Laplace Solver in OpenACC

 Provided a serial code (in C or Fortran) for solving the two-dimensional Laplace equation,
parallelize it using OpenACC directives. Then compare the performance between the serial
code and the OpenACC code.

• Hints:
1. Find the “hot spots”, the most time-consuming parts of the code. Usually they are loops.
2. Analyze parallelism. Which loops are parallelizable?
3. What directives should be used? Where to insert the directives?
Laplace Solver (OpenACC in C , version 1)
This loop is not parallelizable
while ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) { due to data dependency.
#pragma acc kernels
for(i = 1; i <= ROWS; i++) These loops are parallelizable.
for(j = 1; j <= COLUMNS; j++) { Create a kernel region and ask
the compiler to determine
A_new[i][j] = 0.25 * (A[i+1][j] + A[i-1][j] +A[i][j+1] + A[i][j-1]);
parallelism and data transfer.
}
dt = 0.0;
#pragma acc kernels
for(i = 1; i <= ROWS; i++) These loops are parallelizable.
for(j = 1; j <= COLUMNS; j++){ Create a kernel region and ask
dt = fmax( fabs(A_new[i][j]-A[i][j]), dt);
the compiler to determine
parallelism and data transfer.
A[i][j] = A_new[i][j];
}
iteration++;
}
Laplace Solver (OpenACC in Fortran, version 1)
do while ( dt > max_temp_error .and. iteration <= max_iterations) This loop is not parallelizable
!$acc kernels due to data dependency.
do j=1,columns
do i=1,rows These loops are parallelizable.
A_new(i,j)=0.25*(A(i+1,j)+A(i-1,j)+ A(i,j+1)+A(i,j-1) ) Create a kernel region and ask
enddo the compiler to determine
enddo parallelism and data transfer.
!$acc end kernels
dt=0.0
!$acc kernels These loops are parallelizable.
do j=1,columns Create a kernel region and ask
do i=1,rows the compiler to determine
parallelism and data transfer.
dt = max( abs(A_new(i,j) - A(i,j)), dt )
A(i,j) = A_new(i,j)
enddo
enddo
!$acc end kernels
iteration = iteration+1
enddo
Analysis of performance (version 1)

 Compare the computation time (for 1000*1000 grid):

• Serial code: 17.610445 seconds.
• OpenACC code (version 1): 48.796347 seconds

 The OpenACC code is much slower than the serial code. What went wrong?

 We need to further analyze the parallelism and data transfer.

time(us): 25,860,945
61: compute region reached 3372 times Profiling (version 1)
63: kernel launched 3372 times
grid: [32x250] block: [32x4]
 export PGI_ACC_TIME=1
device time(us): total=1,006,028 max=312 min=296 avg=298
to activate profiling, then
elapsed time(us): total=1,149,681 max=862 min=337 avg=340
run again.
61: data region reached 6744 times
61: data copyin transfers: 3372
device time(us): total=4,570,063 max=1,378 min=1,353 avg=1,355
69: data copyout transfers: 3372 • There are four times data
device time(us): total=4,217,959 max=1,987 min=1,248 avg=1,250
transfer between host(CPU)
72: compute region reached 3372 times
memory and GPU memory in
74: kernel launched 3372 times
every iteration of the outer
grid: [32x250] block: [32x4]
while loop.
device time(us): total=1,143,160 max=342 min=325 avg=339
elapsed time(us): total=1,300,500 max=1,128 min=373 avg=385
74: reduction kernel launched 3372 times
• The total time for data
grid: [1] block: [256]
transfer is around 23.6
device time(us): total=67,550 max=21 min=20 avg=20
elapsed time(us): total=146,840 max=436 min=42 avg=43
seconds, which is much
72: data region reached 6744 times
larger than the computing
72: data copyin transfers: 6744
time around 2.5 seconds!
device time(us): total=9,567,773 max=1,648 min=1,346 avg=1,418
81: data copyout transfers: 3372
device time(us): total=5,176,980 max=1,553 min=1,534 avg=1,535
Analysis of data transfer (version 1)
 These data transfer happen every iteration of the outer while loop!

while ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) {

#pragma acc kernels Copy in: A and A_new is
copied from host to GPU.
for(i = 1; i <= ROWS; i++)
for(j = 1; j <= COLUMNS; j++) {
A_new[i][j] = 0.25 * (A[i+1][j] + A[i-1][j] +A[i][j+1] + A[i][j-1]);
Copy out: A and A_new is
} copied from GPU to host.
dt = 0.0;
#pragma acc kernels Copy in: A and A_new is
for(i = 1; i <= ROWS; i++) copied from host to GPU.
for(j = 1; j <= COLUMNS; j++){
dt = fmax( fabs(A_new[i][j]-A[i][j]), dt);
A[i][j] = A_new[i][j];
} Copy out: A and A_new is
copied from GPU to host.
iteration++;
}
Data clauses
 copy (list): Allocates memory on GPU and copies data from host to GPU when entering region
and copies data to the host when exiting region.
 copyin(list): Allocates memory on GPU and copies data from host to GPU when entering region.
 copyout(list): Allocates memory on GPU and copies data to the host when exiting region.
 create(list): Allocates memory on GPU but does not copy.
 present(list): Data is already present on GPU.

• Syntax for C
#pragma acc data copy(a[0:size]) copyin(b[0:size]), copyout(c[0:size]) create(d[0:size]) present(d[0:size])
• Syntax for Fortran
!$acc acc data copy(a(0:size)) copyin(b(0:size]), copyout(c(0:size)) create(d(0:size)) present(d(0:size))
!$acc end data
• If the compiler can determine the size of arrays, it is unnecessary to specify it explicitly.
Laplace Solver (OpenACC in C , version 2)
Create a data region here. A is copied in
#pragma acc data copy(A), create(A_new) before the while loop starts and is copied out
while ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) { after the while loop ends. A_new is allocated
on GPU memory directly and is unnecessary
#pragma acc kernels
to be copied to the host memory.
for(i = 1; i <= ROWS; i++)
for(j = 1; j <= COLUMNS; j++) { Create a kernel region here to
A_new[i][j] = 0.25 * (A[i+1][j] + A[i-1][j] + A[i][j+1] + A[i][j-1]);
parallelize the for loops, but there
is no data transfer.
}
dt = 0.0;
#pragma acc kernels
Create a kernel region here to
parallelize the for loops, but there
for(i = 1; i <= ROWS; i++) is no data transfer.
for(j = 1; j <= COLUMNS; j++){
dt = fmax( fabs(A_new[i][j]-A[i][j]), dt);
A[i][j] = A_new[i][j];
}
iteration++;
}
Laplace Solver (OpenACC in Fortran , version 2)
!$acc data copy(A), create(A_new) Create a data region here. A is copied in
do while ( dt > max_temp_error .and. iteration <= max_iterations) before the while loop starts and is copied out
!$acc kernels after the while loop ends. A_new is allocated
do j=1,columns on GPU memory directly and is unnecessary
to be copied to the host memory.
do i=1,rows
A_new(i,j)=0.25*(A(i+1,j)+A(i-1,j)+ A(i,j+1)+A(i,j-1) )
Create a kernel region here to
enddo parallelize the do loops, but there
enddo is no data transfer.
!$acc end kernels
dt=0.0
!$acc kernels Create a kernel region here to
do j=1,columns parallelize the do loops, but there
do i=1,rows is no data transfer.
dt = max( abs(A_new(i,j) - A(i,j)), dt )
A(i,j) = A_new(i,j)
enddo
enddo
!$acc end kernels
iteration = iteration+1
enddo
time(us): 2,374,331
59: data region reached 2 times Profiling (version 2)
59: data copyin transfers: 1
device time(us): total=1,564 max=1,564 min=1,564 avg=1,564  export PGI_ACC_TIME=1
91: data copyout transfers: 1 to activate profiling, then
device time(us): total=1,773 max=1,773 min=1,773 avg=1,773 run again.
63: compute region reached 3372 times
65: kernel launched 3372 times
grid: [32x250] block: [32x4] • There are only 2 times data
device time(us): total=1,005,947 max=313 min=296 avg=298 movement (of arrays) in total.
elapsed time(us): total=1,102,391 max=946 min=324 avg=326
74: compute region reached 3372 times • There are data movements
74: data copyin transfers: 3372 for the variable dt, but it not
device time(us): total=20,344 max=16 min=6 avg=6 an array and thus the transfer
76: kernel launched 3372 times processes cost very little time.
grid: [32x250] block: [32x4]
device time(us): total=1,150,552 max=344 min=327 avg=341
elapsed time(us): total=1,235,344 max=856 min=352 avg=366 • The total time for data
76: reduction kernel launched 3372 times movement is around 0.09
grid: [1] block: [256] second, which is much
device time(us): total=67,484 max=21 min=19 avg=20 smaller than the computing
elapsed time(us): total=151,147 max=358 min=43 avg=44 time (around 2.5 seconds)!
76: data copyout transfers: 3372
device time(us): total=68,104 max=46 min=17 avg=20
Analysis of performance (version 2)
 Compare the computation time (for 1000*1000 grid):
• Serial code: 17.610445 seconds.
• OpenACC code (version 1): 48.796347 seconds
• OpenACC code (version 2): 2.592581 seconds

 The OpenACC code (version 2) is around 6.8 times faster than the serial code.
Cheers!

 The speed-up would be even larger if the size of the problem increase.
 The maximum size of GPU memory (typically 6 GB or 12 GB) is much smaller
than regular CPU memory (e.g. 128 GB on BU SCC).
Reduction
 As we can see from the profiling results, a reduction kernel is created by the compiler.

What is reduction and why is it necessary?

 In the previous example, the variable dt can be modified by multiple workers (warps) simultaneously.
This is called a data race condition. If data race happened, an incorrect result will be returned.
 To avoid data race, a reduction clause is required to protect the concerned variable.
 Fortunately, the compiler is smart enough to create a reduction kernel and avoid the data race
automatically!
dt = 0.0;
#pragma acc kernels
for(i = 1; i <= ROWS; i++)
for(j = 1; j <= COLUMNS; j++){
dt = fmax( fabs(A_new[i][j]-A[i][j]), dt);
A[i][j] = A_new[i][j];
}
Laplace Solver (OpenACC in C , version 3)
#pragma acc data copy(A), create(A_new) Create a data region here.
while ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) {
#pragma acc parallel loops Create a parallel region here
for(i = 1; i <= ROWS; i++) and parallelize the for loops.
for(j = 1; j <= COLUMNS; j++) {
A_new[i][j] = 0.25 * (A[i+1][j] + A[i-1][j] + A[i][j+1] + A[i][j-1]);
}
dt = 0.0;
#pragma parallel loops reduction(max:dt)
Create a parallel region here and
parallelize the for loops. Explicitly
for(i = 1; i <= ROWS; i++) specify the reduction operator
for(j = 1; j <= COLUMNS; j++){ and variable.
dt = fmax( fabs(A_new[i][j]-A[i][j]), dt);
A[i][j] = A_new[i][j];
}
iteration++;
}
Laplace Solver (OpenACC in Fortran , version 3)
!$acc data copy(A), create(A_new) Create a data region here.
do while ( dt > max_temp_error .and. iteration <= max_iterations)
!$acc parallel loop
do j=1,columns Create a parallel region here
do i=1,rows
and parallelize the do loops.
A_new(i,j)=0.25*(A(i+1,j)+A(i-1,j)+ A(i,j+1)+A(i,j-1) )
enddo
enddo
!$acc end parallel loop
dt=0.0
!$acc parallel loop Create a parallel region here and
do j=1,columns parallelize the do loops. Explicitly
specify the reduction operator
do i=1,rows
and variable.
dt = max( abs(A_new(i,j) - A(i,j)), dt )
A(i,j) = A_new(i,j)
enddo
enddo
!$acc end parallel loop
iteration = iteration+1
enddo
Analysis of performance (version 3)

 Compare the computation time (for 1000*1000 grid):

• Serial code: 17.610445 seconds.
• OpenACC code (version 1): 48.796347 seconds
• OpenACC code (version 2): 2.592581 seconds
• OpenACC code (version 3): 2.259797 seconds

 Using parallel directive is a little faster than using kernel directive in this case
(mostly due to different task granularities).
 It is a good habit to explicitly specify reduction operators and variables.
NVIDIA GPU (CUDA) Task Granularity

• GPU device -- CUDA grids:

Kennels/grids are assigned to a device.
• Streaming Multiprocessor (SM) -- CUDA thread blocks:
Blocks are assigned to a SM.
• CUDA cores -- CUDA threads:
Threads are assigned to a core.

 Warp: a unit that consists of 32 threads.

 Blocks are divided into warps.
 The SM executes threads at warp granularity.
 The warp size can be changed in the future.
OpenACC Task Granularity

• Gang --- block

• Worker – warp
• Vector – thread

 Syntax for C
#pragma acc kernels loop gang(n) worker(m) vector(k)
#pragma acc parallel loop num_gangs(n) num_workers(m) vector_length(k)
 Syntax for Fortran
!$acc kernels loop gang(n) worker(m) vector(k)
!$acc parallel loop num_gangs(n) num_workers(m) vector_length(k)
Appendix A: Submit a GPU job on SCC

• Submit a batch job:

% qsub job.sh

• A typical script for OpenACC jobs is like the following:

#!/bin/bash
#$ -l gpus=1
#$ -l h_rt=01:30:00
#$ -P project_name
#$ -N job_name
./executable
Appendix B: More options for requesting GPU resources on SCC

 To request 4 CPU cores and 1 GPU

-pe omp 4 -l gpus=0.25

 To request 12 CPU cores and 1 GPU (e.g. for budge node)

-pe omp 12 -l gpus=0.08

 To request a whole budge node (12 CPU cores and 8 GPUs)

-pe omp 12 -l gpus=0.6

 To request 2 nodes with 12 CPU cores and 8 GPUs on each node

-pe mpi_12_tasks_per_node 24 -l gpus=0.6

 To request 1 node with 2 K40 GPUs

-pe omp 16 –l gpus=0.125 –l gpu_type=K40m
What is not covered

• Architecture of GPU
• Advanced OpenACC (vector, worker, gang, synchronization, etc)
• Using OpenACC with CUDA
• Using OpenACC with OpenMP (to use a few GPUs on one node)
• Using OpenACC with MPI (to use many GPUs on multiple nodes)
Further information

 OpenACC official website: http://www.openacc.org/node/1

Help
[email protected]
[email protected]

HTH622B - OPERATORS MANUAL TimberRiteä 30H Control and Measuring System OMF071051 Issue 04 DIC06 (ENGLISH) AJUSTES
No ratings yet
HTH622B - OPERATORS MANUAL TimberRiteä 30H Control and Measuring System OMF071051 Issue 04 DIC06 (ENGLISH) AJUSTES
180 pages
Tecnomatix Plant Simulation Basics, Methods, and Strategies Student Guide - 2012
100% (1)
Tecnomatix Plant Simulation Basics, Methods, and Strategies Student Guide - 2012
764 pages
Lesson Plan COT 1 MIL
100% (2)
Lesson Plan COT 1 MIL
10 pages
Computer Graphics Report
100% (1)
Computer Graphics Report
31 pages
S3076 Getting Started With OpenACC
No ratings yet
S3076 Getting Started With OpenACC
58 pages
Openacc Online Course: Lecture 1: Introduction To Openacc
No ratings yet
Openacc Online Course: Lecture 1: Introduction To Openacc
47 pages
OpenACC Fundamentals
No ratings yet
OpenACC Fundamentals
38 pages
Introduction To OpenACC Course 20161026 1550 1
No ratings yet
Introduction To OpenACC Course 20161026 1550 1
68 pages
OpenACC Advanced Fixed
No ratings yet
OpenACC Advanced Fixed
53 pages
Introduction To OpenACC Course 20161102 1530 1
No ratings yet
Introduction To OpenACC Course 20161102 1530 1
64 pages
Directives Tips For Fortran
No ratings yet
Directives Tips For Fortran
15 pages
OpenACC 3
No ratings yet
OpenACC 3
23 pages
Module 4
No ratings yet
Module 4
40 pages
Openmp Lab: Antonio Gómez-Iglesias Agomez@Tacc - Utexas.Edu Texas Advanced Computing Center
No ratings yet
Openmp Lab: Antonio Gómez-Iglesias Agomez@Tacc - Utexas.Edu Texas Advanced Computing Center
17 pages
Lab 7
No ratings yet
Lab 7
3 pages
OpenACC Princeton Bootcamp PDF
No ratings yet
OpenACC Princeton Bootcamp PDF
51 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
OpenMP Workshop Day 3
No ratings yet
OpenMP Workshop Day 3
116 pages
Module 3
No ratings yet
Module 3
34 pages
Computer Engineering Laboratory Solution Primer
From Everand
Computer Engineering Laboratory Solution Primer
Karan Bhandari
No ratings yet
Compilers: Tools For Scientists and Engineers
No ratings yet
Compilers: Tools For Scientists and Engineers
42 pages
OpenACC 1
No ratings yet
OpenACC 1
44 pages
OpenMP Workshop Day 3
No ratings yet
OpenMP Workshop Day 3
91 pages
Coursera Lecture 11.1 OpenACC Intro
No ratings yet
Coursera Lecture 11.1 OpenACC Intro
11 pages
Web GPU
0% (1)
Web GPU
40 pages
Icl Utk 1031 2017
No ratings yet
Icl Utk 1031 2017
45 pages
Kernelgen Ncar 2012 Slides
No ratings yet
Kernelgen Ncar 2012 Slides
28 pages
02 Basicarch
No ratings yet
02 Basicarch
83 pages
A General Approach To Creating FORTRAN Interface For C++ Application Libraries
No ratings yet
A General Approach To Creating FORTRAN Interface For C++ Application Libraries
19 pages
Using Python For Large Scale Linear Alge
No ratings yet
Using Python For Large Scale Linear Alge
11 pages
Design of Parallel Algorithm'S: Faculty Guide: Group Members
No ratings yet
Design of Parallel Algorithm'S: Faculty Guide: Group Members
49 pages
ECE408 MT2 Review FA24
No ratings yet
ECE408 MT2 Review FA24
58 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
L04 Parallel Programming Models I
No ratings yet
L04 Parallel Programming Models I
72 pages
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
No ratings yet
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
58 pages
02 Multicore
No ratings yet
02 Multicore
66 pages
OpenACC Programming Guide 0 0
No ratings yet
OpenACC Programming Guide 0 0
73 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
Clase01 - Introducción Al Paralelismo
No ratings yet
Clase01 - Introducción Al Paralelismo
30 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
Excelente
No ratings yet
Excelente
64 pages
Ass Parallel
No ratings yet
Ass Parallel
11 pages
GPUProgramming Talk
No ratings yet
GPUProgramming Talk
18 pages
Parallel Programming in Opencl: Advanced Graphics & Image Processing
No ratings yet
Parallel Programming in Opencl: Advanced Graphics & Image Processing
31 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
DS1822-Parallel Computing - Unit5
No ratings yet
DS1822-Parallel Computing - Unit5
16 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
RG2 ParallelizationPrinciples HPCAI Jan2020
No ratings yet
RG2 ParallelizationPrinciples HPCAI Jan2020
40 pages
Digital Assignment-6: Name: Bejugam Shiva Suprith REG NO: 18BCE0427 Faculty: Narayanamoorthi M SLOT: L59+L60
No ratings yet
Digital Assignment-6: Name: Bejugam Shiva Suprith REG NO: 18BCE0427 Faculty: Narayanamoorthi M SLOT: L59+L60
14 pages
Programming Assignment: On Openmp
No ratings yet
Programming Assignment: On Openmp
19 pages
Concurrent and Parallel Programming Unit V-Notes Unit V Openmp, Opencl, Cilk++, Intel TBB, Cuda 5.1 Openmp
No ratings yet
Concurrent and Parallel Programming Unit V-Notes Unit V Openmp, Opencl, Cilk++, Intel TBB, Cuda 5.1 Openmp
10 pages
Openmp
No ratings yet
Openmp
115 pages
04 Progbasics
No ratings yet
04 Progbasics
51 pages
Navya2022 Chapter ComparativeStudyOfDirective-ba
No ratings yet
Navya2022 Chapter ComparativeStudyOfDirective-ba
13 pages
Lecture Open MP
No ratings yet
Lecture Open MP
25 pages
Lab # 2 by Akram
No ratings yet
Lab # 2 by Akram
14 pages
CP4292 Multicore Architecture Lab Manual
No ratings yet
CP4292 Multicore Architecture Lab Manual
36 pages
CS021 - Assessment 10 2213686117142407
100% (1)
CS021 - Assessment 10 2213686117142407
3 pages
02 Basicarch
No ratings yet
02 Basicarch
103 pages
Week 9 Assignment
No ratings yet
Week 9 Assignment
3 pages
Guidelines For Programming in High Performance Fortran: by Dave Pruett
No ratings yet
Guidelines For Programming in High Performance Fortran: by Dave Pruett
4 pages
Intro To Matlab GPU Programming
No ratings yet
Intro To Matlab GPU Programming
35 pages
C Programming
From Everand
C Programming
Netra
No ratings yet
Advanced C Concepts and Programming: First Edition
From Everand
Advanced C Concepts and Programming: First Edition
Gayatri
3/5 (1)
Tenses (Notes) : Chart-Active Verb Tenses
No ratings yet
Tenses (Notes) : Chart-Active Verb Tenses
2 pages
Detailed Lesson Log in Mathematics
100% (1)
Detailed Lesson Log in Mathematics
7 pages
Assignment-Unit-Iii - (Software & It'S Types) : (PART-B)
No ratings yet
Assignment-Unit-Iii - (Software & It'S Types) : (PART-B)
4 pages
Maguire Mackenzie Resume
No ratings yet
Maguire Mackenzie Resume
2 pages
Bayesian Statistics For Beginners: A Step-By-Step Approach Therese M Donovan Download
100% (4)
Bayesian Statistics For Beginners: A Step-By-Step Approach Therese M Donovan Download
59 pages
De Thi HK1 Tieng Anh 8 Bac Ninh 23 24
No ratings yet
De Thi HK1 Tieng Anh 8 Bac Ninh 23 24
2 pages
Jurnal - Siti Nurjanah - 022119046
No ratings yet
Jurnal - Siti Nurjanah - 022119046
15 pages
jBASE Dataguard
No ratings yet
jBASE Dataguard
140 pages
Williams PDF
No ratings yet
Williams PDF
14 pages
Lesson 1 Algebraic Expression I
No ratings yet
Lesson 1 Algebraic Expression I
18 pages
Quiz 003 - Attempt Review PDF
No ratings yet
Quiz 003 - Attempt Review PDF
3 pages
Ayurvedic Literature in Orissa - An Overview: Prem Kishore, M.M. Padhi, G.C. Nanda
No ratings yet
Ayurvedic Literature in Orissa - An Overview: Prem Kishore, M.M. Padhi, G.C. Nanda
6 pages
Revelation Chapter 20 Vrs. 1-3
No ratings yet
Revelation Chapter 20 Vrs. 1-3
3 pages
How Do I Write A Report
100% (2)
How Do I Write A Report
2 pages
Candidate Instructions
No ratings yet
Candidate Instructions
8 pages
Intro To ANSYS Ncode DL 14 5 L14 Standalone DesignLife
No ratings yet
Intro To ANSYS Ncode DL 14 5 L14 Standalone DesignLife
21 pages
Gbio 55 Lec Lesson 4
No ratings yet
Gbio 55 Lec Lesson 4
8 pages
Examen 2 Parte
No ratings yet
Examen 2 Parte
3 pages
Alarms RDC Integration
No ratings yet
Alarms RDC Integration
13 pages
Advanced Listening Theory
No ratings yet
Advanced Listening Theory
9 pages
Lecture 1
No ratings yet
Lecture 1
232 pages
History of Reading
No ratings yet
History of Reading
9 pages
Li Lai-Resume
No ratings yet
Li Lai-Resume
2 pages
Reading Comprehension Reading Comprehension: Basic Types of Distractors For Part VI
No ratings yet
Reading Comprehension Reading Comprehension: Basic Types of Distractors For Part VI
10 pages
Integers and Rational Numbers - PPTX - Grade 7.Pptx 1111
No ratings yet
Integers and Rational Numbers - PPTX - Grade 7.Pptx 1111
25 pages
THEO DÕI HỌC VIÊN
No ratings yet
THEO DÕI HỌC VIÊN
7 pages

OpenACC 2017spring

Uploaded by

OpenACC 2017spring

Uploaded by

Introduction to OpenACC

// ... serial code ...

• Use the newest PGI compiler:

• Get GPU information

• On SCC, only the Portland Group compiler supports OpenACC

• Run the executable:

1) Login BU SCC and get an interactive session with GPU resources.

 Accelerator kernel is generated. The loop computation is offloaded to (Tesla) GPU

 The compiling output:

#pragma acc kernels

 Generate two kernels  Generate one kernel

• Use Jacobi iterative algorithm to obtain a convergent solution.

• Jacobi iterative algorithm:

 Compare the computation time (for 1000*1000 grid):

 We need to further analyze the parallelism and data transfer.

while ( dt > MAX_TEMP_ERROR && iteration <= max_iterations ) {

What is reduction and why is it necessary?

 Compare the computation time (for 1000*1000 grid):

• GPU device -- CUDA grids:

 Warp: a unit that consists of 32 threads.

• Gang --- block

• Submit a batch job:

• A typical script for OpenACC jobs is like the following:

 To request 4 CPU cores and 1 GPU

 To request 12 CPU cores and 1 GPU (e.g. for budge node)

 To request a whole budge node (12 CPU cores and 8 GPUs)

 To request 2 nodes with 12 CPU cores and 8 GPUs on each node

 To request 1 node with 2 K40 GPUs

 OpenACC official website: http://www.openacc.org/node/1

You might also like