0% found this document useful (0 votes)
13 views

Module4

Uploaded by

singhguma86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Module4

Uploaded by

singhguma86
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 40

MODULE FOUR:

GPU PROGRAMMING
Dr. Volker Weinberg | LRZ
MODULE OVERVIEW
OpenACC Directives

 Multicore CPU vs GPU


 Introduction to GPU Data Management
 CUDA Managed Memory
 GPU Profiling with Nsight Systems
CPU VS GPU
CPU VS GPU
Number of cores and parallelism

 Both are extremely popular parallel processors, but


with different degrees of parallelism
 CPUs generally have a small number of very fast
physical cores
 GPUs have thousands of simple cores able to
achieve high performance in aggregate
 Both require parallelism to be fully utilized, but GPUs
require much more
CPU + GPU WORKFLOW
Application Code

GPU CPU
Compute-Intensive Functions
Rest of Sequential
Small % of Code CPU Code
Large % of Runtime
GPU PROGRAMMING IN OPENACC

 Execution always begins and ends on the


host CPU
 Compute-intensive loops are offloaded to the
GPU using directives
int main(){
 Offloading may or may not require data <sequential code>
movement between the host and device.
Compiler
#pragma acc kernels
Hint
{
<parallel code>
}

}
CPU + GPU
Physical Diagram CPU GPU

 CPU memory is larger, GPU memory has $ $ $ $ $ $

more bandwidth $ $ $ $ $ $

Shared Cache
 CPU and GPU memory are usually separate,
connected by an I/O bus (traditionally PCI-e)
$ $ $ $ $ $ $ $

 Any data transferred between the CPU and


GPU will be handled by the I/O Bus Shared Cache
High
 The I/O Bus is relatively slow compared to Capacity
memory bandwidth Memory
IO Bus
High Bandwidth
 The GPU cannot perform computation until the Memory
data is within its memory
BASIC DATA MANAGEMENT
BASIC DATA MANAGEMENT
Between the host and device

 The host is traditionally a CPU Host


 The device is some parallel accelerator Device
 When our target hardware is multicore, the
host and device are the same, meaning that
their memory is also the same
 There is no need to explicitly manage data
when using a shared memory accelerator, Host
such as the multicore target Memory
Device
Memory
BASIC DATA MANAGEMENT
Between the host and device CPU GPU

 When the target hardware is a GPU data will $ $ $ $ $ $

usually need to migrate between CPU and $ $ $ $ $ $

GPU memory
Shared Cache

 The next lecture will discuss OpenACC data


management, for now we’ll assume a unified $ $ $ $ $ $ $ $

Host/Accelerator memory
Shared Cache
High
Capacity
Memory
IO Bus
High Bandwidth
Memory
CUDA MANAGED MEMORY
CUDA MANAGED MEMORY Commonly referred to as “unified
memory.”
Simplified Developer Effort

Without Managed Memory With Managed Memory

CPU and GPU memories are


combined into a single, shared pool

System GPU Memory Managed Memory


Memory
CUDA MANAGED MEMORY
Usefulness

 Handling explicit data transfers between the host and device (CPU and GPU) can be
difficult
 The PGI compiler can utilize CUDA Managed Memory to defer data management
 This allows the developer to concentrate on parallelism and think about data
movement as an optimization

$ pgcc –fast –acc –ta=tesla:managed –Minfo=accel main.c

$ pgfortran –fast –acc –ta=tesla:managed –Minfo=accel main.f90


MANAGED MEMORY
Limitations
With Managed Memory
 The programmer will almost always be able to
get better performance by manually handling
data transfers
 Memory allocation/deallocation takes longer
with managed memory
 Cannot transfer data asynchronously
 Currently only available from PGI on NVIDIA
GPUs. Managed Memory
OPENACC WITH MANAGED MEMORY
An Example from the Lab Code
while ( error > tol && iter < iter_max )
{
error = 0.0;
#pragma acc kernels
Without Managed Memory the
{ compiler must determine the size of
for( int j = 1; j < n-1; j++)
{ A and Anew and copy their data to
for( int i = 1; i < m-1; i++ )
{
and from the GPU each iteration to
Anew[j][i] = 0.25 * ( A[j][i+1] + A[j][i-1] ensure correctness
+ A[j-1][i] + A[j+1][i]);
error = fmax( error, fabs(Anew[j][i] - A[j][i]));
}
}

for( int j = 1; j < n-1; j++)


{
for( int i = 1; i < m-1; i++ )
{ With Managed Memory the
A[j][i] = Anew[j][i];
} underlying runtime will move the
}
}
data only when needed
}
INTRODUCTION TO DATA CLAUSES
BASIC DATA MANAGEMENT
Moving data between the Host and Device using copy
 Data clauses allow the programmer to tell the compiler which data to move and
when
 Data clauses may be added to kernels or parallel regions, but also data, enter
data, and exit data, which will discussed shortly

C/C++
#pragma acc kernels
for(int i = 0; i < N; i++){
a[i] = 0;
}
BASIC DATA MANAGEMENT
Moving data between the Host and Device using copy
 Data clauses allow the programmer to tell the compiler which data to move and
when
 Data clauses may be added to kernels or parallel regions, but also data, enter
data, and exit data, which will discussed shortly

C/C++
#pragma acc parallel loop copyout(a[0:n])
for(int i = 0; i < N; i++){
a[i] = 0; I don’t need the initial value
} of a, so I’ll only copy it out
of the region at the end.
BASIC DATA MANAGEMENT
Moving data between the Host and Device using copy

Copy ‘a’ Copy ‘a’ Deallocate


Allocate ‘a’ on Execute
from CPU from GPU ‘a’ from
GPU Kernels
to GPU to CPU GPU

#pragma acc parallel loop copy(a[0:N])


for(int i = 0; i < N; i++){
a[i] = 2 * a[i];
}
BASIC DATA MANAGEMENT
Moving data between the Host and Device using copy

Copy ‘a’ Copy ‘a’ Deallocate


Allocate ‘a’ on Execute
from CPU from GPU ‘a’ from
GPU Kernels
to GPU to CPU GPU

CPU MEMORY GPU MEMORY

A
A’ A’
DATA CLAUSES
copy( list ) Allocates memory on GPU and copies data from host to GPU when
entering region and copies data to the host when exiting region.

Principal use: For many important data structures in your code, this is a
logical default to input, modify and return the data.

copyin( list ) Allocates memory on GPU and copies data from host to GPU when
entering region.

Principal use: Think of this like an array that you would use as just an
input to a subroutine.

copyout( list ) Allocates memory on GPU and copies data to the host when exiting
region.

Principal use: A result that isn’t overwriting the input data structure.

create( list ) Allocates memory on GPU but does not copy.

Principal use: Temporary arrays.


ARRAY SHAPING

 Sometimes the compiler needs help understanding the shape of an array


 The first number is the start index of the array
 In C/C++, the second number is how much data is to be transferred
 In Fortran, the second number is the ending index

copy(array[starting_index:length]) C/C++

copy(array(starting_index:ending_index)) Fortran
BASIC DATA MANAGEMENT
Multi-dimensional Array shaping

copy(array[0:N][0:M]) C/C++

copy(array(1:N, 1:M)) Fortran


PROFILING GPU CODE
PROFILING GPU CODE (PGI)
Obtaining information about your GPU

 Using the pgaccelinfo command will


display information about available Terminal Window
accelerators
$ pgaccelinfo
Device Number: 0
Device Name: Tesla P100-PCIE-16GB
...
Managed Memory: Yes
PGI Compiler Option: -ta=tesla:cc60
PROFILING GPU CODE
Obtaining information about your GPU

 Using the pgaccelinfo command will


display information about available Terminal Window
accelerators
$ pgaccelinfo
 Each device is numbered starting with Device Number: 0
Device Name: Tesla P100-PCIE-16GB
0 ...
Managed Memory: Yes
PGI Compiler Option: -ta=tesla:cc60
PROFILING GPU CODE
Obtaining information about your GPU

 Using the pgaccelinfo command will


display information about available Terminal Window
accelerators
$ pgaccelinfo
 Each device is numbered starting with Device Number: 0
Device Name: Tesla P100-PCIE-16GB
0 ...
Managed Memory: Yes
 The Device Name identifies the type of PGI Compiler Option: -ta=tesla:cc60
accelerator
PROFILING GPU CODE
Obtaining information about your GPU

 Using the pgaccelinfo command will


display information about available Terminal Window
accelerators
$ pgaccelinfo
 Each device is numbered starting with Device Number: 0
Device Name: Tesla P100-PCIE-16GB
0 ...
Managed Memory: Yes
 The Device Name identifies the type of PGI Compiler Option: -ta=tesla:cc60
accelerator
 Can Managed Memory be used?
PROFILING GPU CODE
Obtaining information about your GPU

 Using the pgaccelinfo command will


display information about available Terminal Window
accelerators
$ pgaccelinfo
 Each device is numbered starting with Device Number: 0
Device Name: Tesla P100-PCIE-16GB
0 ...
Managed Memory: Yes
 The Device Name identifies the type of PGI Compiler Option: -ta=tesla:cc60
accelerator
Without Manage Memory
 Can Managed Memory be used? $ pgcc –ta=tesla:cc60 main.c
 What compiler options should be used With Manage Memory
to target this device? $ pgcc –ta=tesla:cc60,managed main.c
COMPILING GPU CODE
Terminal Window
$ pgcc –fast –ta=tesla:cc60 –Minfo=accel jacobi.c laplace2d.c
calcNext:
37, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated We can see that our data
Generating Tesla code copies are being applied by
37, Generating reduction(max:error) the compiler
38, #pragma acc loop gang /* blockIdx.x */
41, #pragma acc loop vector(128) /* threadIdx.x */
41, Loop is parallelizable
swap:
56, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated
Generating Tesla code
57, #pragma acc loop gang /* blockIdx.x */
60, #pragma acc loop vector(128) /* threadIdx.x */
60, Loop is parallelizable
COMPILING GPU CODE
Terminal Window
$ pgcc –fast –ta=tesla:cc60 –Minfo=accel jacobi.c laplace2d.c
calcNext:
37, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated We also see that the
Generating Tesla code compiler is generating code
37, Generating reduction(max:error) for our GPU
38, #pragma acc loop gang /* blockIdx.x */
41, #pragma acc loop vector(128) /* threadIdx.x */
41, Loop is parallelizable
swap:
56, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated
Generating Tesla code
57, #pragma acc loop gang /* blockIdx.x */
60, #pragma acc loop vector(128) /* threadIdx.x */
60, Loop is parallelizable
COMPILING GPU CODE
Terminal Window
$ pgcc –fast –ta=tesla:cc60 –Minfo=accel jacobi.c laplace2d.c
calcNext:
37, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated
Generating Tesla code
This is the parallelization of
37, Generating reduction(max:error) the outer
inner loop
38, #pragma acc loop gang /* blockIdx.x */
41, #pragma acc loop vector(128) /* threadIdx.x */
41, Loop is parallelizable
swap:
56, Generating copy(Anew[:m*n],A[:m*n])
Accelerator kernel generated
Generating Tesla code
57, #pragma acc loop gang /* blockIdx.x */
60, #pragma acc loop vector(128) /* threadIdx.x */
60, Loop is parallelizable
PROFILING GPU CODE (NSIGHT SYSTEMS)
Using nsys to profile GPU code
▪ Nsight Systems presents far more
information when running on a
GPU
▪ It is capable of capturing
information about CUDA
execution in the profiled process.
▪ In the Timeline view, you can see
all the information about kernels
and memory movements (expand
the CUDA row)
PROFILING GPU CODE (NSIGHT SYSTEMS)
Using nsys to profile GPU code

▪ Kernels: These are our


computational functions. We can
see our calcNext and swap function

▪ MemCpy(HtoD): This includes data


transfers from the Host to the Device
(CPU to GPU)
▪ MemCpy(DtoH): These are data
transfers from the Device to the Host
(GPU to CPU)
PROFILING GPU CODE
Receiving unexpected code results

 Here we can see the runtime of our Terminal Window


application: 151 seconds $ pgcc –ta=tesla:cc60 jacobi.c laplace2d.c
$ ./a.out
 The program is now performing over 3 0, 0.250000
times worse than the sequential 100, 0.002397
version 200, 0.001204
300, 0.000804
 A profiler can help us understand why 400, 0.000603
this performance is worse 500, 0.000483
600, 0.000403
700, 0.000345
800, 0.000302
900, 0.000269
total: 151.772627 s
PROFILING GPU CODE
Inspecting the Nsight Sytems timeline

▪ Let’s focus on the data movement


(Memory row)
▪ At a first glance, it looks like our
program is spending a significant
amount of time transferring data
between the host and device
▪ We also see that the compute
regions are very small and spread
out
▪ What if we try Managed Memory?
PROFILING GPU CODE
Using managed memory
Terminal Window
 Using managed memory $ pgcc –ta=tesla:cc60,managed jacobi.c
drastically improves performance laplace2d.c
$ ./a.out
 This managed memory version is 0, 0.250000
performing over 20x better than 100, 0.002397
the sequential code 200, 0.001204
300, 0.000804
 What does the profiler tell us about 400, 0.000603
500, 0.000483
this? 600, 0.000403
700, 0.000345
800, 0.000302
900, 0.000269
total: 1.474951 s
PROFILING GPU CODE
Using managed memory

▪ The data no longer needs to transfer


between each kernel
▪ The data is only moved when it’s first
accessed on the GPU or CPU
▪ During the timestepping data remains on
the device
▪ Now a higher percentage of time is spent
computing
KEY CONCEPTS
In this module we discussed…
 The fundamental differences between CPUs and GPUs
 Assisting the compiler by providing information about array sizes for data
management
 Managed memory
THANK YOU

You might also like