Module4
Module4
GPU PROGRAMMING
Dr. Volker Weinberg | LRZ
MODULE OVERVIEW
OpenACC Directives
GPU CPU
Compute-Intensive Functions
Rest of Sequential
Small % of Code CPU Code
Large % of Runtime
GPU PROGRAMMING IN OPENACC
}
CPU + GPU
Physical Diagram CPU GPU
more bandwidth $ $ $ $ $ $
Shared Cache
CPU and GPU memory are usually separate,
connected by an I/O bus (traditionally PCI-e)
$ $ $ $ $ $ $ $
GPU memory
Shared Cache
Host/Accelerator memory
Shared Cache
High
Capacity
Memory
IO Bus
High Bandwidth
Memory
CUDA MANAGED MEMORY
CUDA MANAGED MEMORY Commonly referred to as “unified
memory.”
Simplified Developer Effort
Handling explicit data transfers between the host and device (CPU and GPU) can be
difficult
The PGI compiler can utilize CUDA Managed Memory to defer data management
This allows the developer to concentrate on parallelism and think about data
movement as an optimization
C/C++
#pragma acc kernels
for(int i = 0; i < N; i++){
a[i] = 0;
}
BASIC DATA MANAGEMENT
Moving data between the Host and Device using copy
Data clauses allow the programmer to tell the compiler which data to move and
when
Data clauses may be added to kernels or parallel regions, but also data, enter
data, and exit data, which will discussed shortly
C/C++
#pragma acc parallel loop copyout(a[0:n])
for(int i = 0; i < N; i++){
a[i] = 0; I don’t need the initial value
} of a, so I’ll only copy it out
of the region at the end.
BASIC DATA MANAGEMENT
Moving data between the Host and Device using copy
A
A’ A’
DATA CLAUSES
copy( list ) Allocates memory on GPU and copies data from host to GPU when
entering region and copies data to the host when exiting region.
Principal use: For many important data structures in your code, this is a
logical default to input, modify and return the data.
copyin( list ) Allocates memory on GPU and copies data from host to GPU when
entering region.
Principal use: Think of this like an array that you would use as just an
input to a subroutine.
copyout( list ) Allocates memory on GPU and copies data to the host when exiting
region.
Principal use: A result that isn’t overwriting the input data structure.
copy(array[starting_index:length]) C/C++
copy(array(starting_index:ending_index)) Fortran
BASIC DATA MANAGEMENT
Multi-dimensional Array shaping
copy(array[0:N][0:M]) C/C++