CUDA C Best Practices Guide
CUDA C Best Practices Guide
Version 3.2
8/20/2010
CUDA C Best Practices Guide Version 3.2 ii
Table of Contents
Contents Summary
The remainder of this guide is divided into the following sections:
Parallel Computing with CUDA: Important aspects of the parallel
programming architecture.
Performance Metrics: How should performance be measured in CUDA
applications and what are the factors that most influence performance?
Memory Optimizations: Correct memory management is one of the most
effective means of improving performance. This chapter explores the different
kinds of memory available to CUDA applications.
Execution Configuration Optimizations: How to make sure your CUDA
application is exploiting all the available resources on the GPU.
Instruction Optimizations: Certain operations run faster than others. Using
faster operations and avoiding slower ones often confers remarkable benefits.
Control Flow: Carelessly designed control flow can force parallel code into
serial execution; whereas thoughtfully designed control flow can help the
hardware perform the maximum amount of work per clock cycle.
Getting the Right Answer: How to debug code and how to handle differences
in how the CPU and GPU represent floating-point values.
This chapter reviews heterogeneous computing with CUDA, explains the limits of
performance improvement, and helps you choose the right version of CUDA and
which application programming interface (API) to use when programming.
lightweight. In a typical system, thousands of threads are queued up for work (in
warps of 32 threads each). If the GPU must wait on one warp of threads, it
simply begins executing work on another. Because separate registers are
allocated to all active threads, no swapping of registers or state need occur
between GPU threads. Resources stay allocated to each thread until it completes
its execution.
RAM. Both the host system and the device have RAM. On the host system,
RAM is generally equally accessible to all code (within the limitations enforced
by the operating system). On the device, RAM is divided virtually and physically
into different types, each of which has a special purpose and fulfills different
needs. The types of device RAM are explained in the CUDA C Programming
Guide and in Chapter 3 of this document.
These are the primary hardware differences between CPU hosts and GPU devices
with respect to parallel programming. Other differences are discussed as they arise
elsewhere in this document.
where P is the fraction of the total serial execution time taken by the portion of code
that can be parallelized and N is the number of processors over which the parallel
portion of the code runs.
The larger N is (that is, the greater the number of processors), the smaller the P/N
fraction. It can be simpler to view N as a very large number, which essentially
transforms the equation into S = 1 / 1 P. Now, if ¾ of a program is parallelized,
the maximum speed-up over serial code is 1 / (1 – ¾) = 4.
For most purposes, the key point is that the greater P is, the greater the speed-up.
An additional caveat is implicit in this equation, which is that if P is a small number
(so not substantially parallel), increasing N does little to improve performance. To
get the largest lift, best practices suggest spending most effort on increasing P; that
is, by maximizing the amount of code that can be parallelized.
The major and minor revision numbers of the compute capability are shown on the
third and fourth lines of Figure 1.1. Device 0 of this system has compute capability
1.1.
More details about the compute capabilities of various GPUs are in Appendix A of
the CUDA C Programming Guide. In particular, developers should note the number of
multiprocessors on the device, the number of registers and the amount of memory
available, and any special capabilities of the device.
with cu; while the C runtime for CUDA is delivered through the cudart dynamic
library and all its entry points are prefixed with cuda.
delete[] pA;
delete[] pB;
delete[] pC;
cudaFree(pDeviceMemA);
cudaFree(pDeviceMemB);
cudaFree(pDeviceMemC);
Listing 1.1 Host code for adding two vectors using the C runtime for
CUDA
Listing 1.1 consists of 27 lines of code. Listing 1.2 shows the same functionality
implemented using the CUDA driver API.
const unsigned int cnBlockSize = 512;
const unsigned int cnBlocks = 3;
const unsigned int cnDimension = cnBlocks * cnBlockSize;
CUdevice hDevice;
CUcontext hContext;
CUmodule hModule;
CUfunction hFunction;
cuModuleLoad(&hModule, “vectorAdd.cubin”);
cuModuleGetFunction(&hFunction, hModule, "vectorAdd");
// execute kernel
cuLaunchGrid(cuFunction, cnBlocks, 1);
delete[] pA;
delete[] pB;
delete[] pC;
cuMemFree(pDeviceMemA);
cuMemFree(pDeviceMemB);
cuMemFree(pDeviceMemC);
Listing 1.2 Host code for adding two vectors using the CUDA driver API
Listing 1.2 contains 50 lines of code and performs several lower-level operations
than the runtime API. These additional calls are evident in several places, especially
the setup necessary in the driver API prior to the kernel call.
2.1 Timing
CUDA calls and kernel executions can be timed using either CPU or GPU timers.
This section examines the functionality, advantages, and pitfalls of both approaches.
Because the default stream, stream 0, exhibits synchronous behavior (an operation
in the default stream can begin only after all preceding calls in any stream have
completed; and no subsequent operation in any stream can begin until it finishes),
these functions can be used reliably for timing in the default stream.
Be aware that CPU-to-GPU synchronization points such as those mentioned in this
section imply a stall in the GPU’s processing pipeline and should thus be used
sparingly to minimize their performance impact.
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord( start, 0 );
kernel<<<grid,threads>>> ( d_odata, d_idata, size_x, size_y,
NUM_REPS);
cudaEventRecord( stop, 0 );
cudaEventSynchronize( stop );
Here cudaEventRecord() is used to place the start and stop events into the
default stream, stream 0. The device will record a timestamp for the event when it
reaches that event in the stream. The cudaEventElapsedTime() function returns
the time elapsed between the recording of the start and stop events. This value is
expressed in milliseconds and has a resolution of approximately half a microsecond.
Like the other calls in this listing, their specific operation, parameters, and return
values are described in the CUDA Reference Manual. Note that the timings are
measured on the GPU clock, so the timing resolution is operating-system-
independent.
2.2 Bandwidth
Bandwidth—the rate at which data can be transferred—is one of the most
important gating factors for performance. Almost all changes to code should be
made in the context of how they affect bandwidth. As described in Chapter 3 of this
guide, bandwidth can be dramatically affected by the choice of memory in which
data is stored, how the data is laid out and the order in which it is accessed, as well
as other factors.
High Priority: Use the effective bandwidth of your computation as a metric when
measuring performance and optimization benefits.
Memory optimizations are the most important area for performance. The goal is to
maximize the use of the hardware by maximizing bandwidth. Bandwidth is best
served by using as much fast memory and as little slow-access memory as possible.
This chapter discusses the various kinds of memory on the host and device and how
best to set up data items to use the memory effectively.
High Priority: Minimize data transfer between the host and the device, even if it
means running some kernels on the device that do not show performance gains when
compared with running them on the host CPU.
The last argument to the cudaMemcpyAsync() function is the stream ID, which in
this case uses the default stream, stream 0. The kernel also uses the default stream,
and it will not begin execution until the memory copy completes; therefore, no
explicit synchronization is needed. Because the memory copy and the kernel both
return control to the host immediately, the host function cpuFunction() overlaps
their execution.
In Listing 3.1, the memory copy and kernel execution occur sequentially. On devices
that are capable of ―concurrent copy and execute,‖ it is possible to overlap kernel
execution on the device with data transfers between the host and the device.
Whether a device has this capability is indicated by the deviceOverlap field of a
cudaDeviceProp variable (or listed in the output of the deviceQuery SDK sample).
On devices that have this capability, the overlap once again requires pinned host
memory, and, in addition, the data transfer and kernel must use different, non-
default streams (streams with non-zero stream IDs). Non-default streams are
required for this overlap because memory copy, memory set functions, and kernel
calls that use the default stream begin only after all preceding calls on the device (in
any stream) have completed, and no operation on the device (in any stream)
commences until they are finished.
Listing 3.2 illustrates the basic technique.
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaMemcpyAsync(a_d, a_h, size, cudaMemcpyHostToDevice, stream1);
kernel<<<grid, block, 0, stream2>>>(otherData_d);
In this code, two streams are created and used in the data transfer and kernel
executions as specified in the last arguments of the cudaMemcpyAsync call and the
kernel’s execution configuration.
Listing 3.2 demonstrates how to overlap kernel execution with asynchronous data
transfer. This technique could be used when the data dependency is such that the
data can be broken into chunks and transferred in multiple stages, launching
multiple kernels to operate on each chunk as it arrives. Listings 3.3a and 3.3b
demonstrate this. They produce equivalent results. The first segment shows the
reference sequential implementation, which transfers and operates on an array of N
floats (where N is assumed to be evenly divisible by nThreads).
cudaMemcpy(a_d, a_h, N*sizeof(float), dir);
kernel<<<N/nThreads, nThreads>>>(a_d);
Listing 3.3b shows how the transfer and kernel execution can be broken up into
nStreams stages. This approach permits some overlapping of the data transfer and
execution.
size=N*sizeof(float)/nStreams;
for (i=0; i<nStreams; i++) {
offset = i*N/nStreams;
cudaMemcpyAsync(a_d+offset, a_h+offset, size, dir, stream[i]);
}
for (i=0; i<nStreams; i++) {
offset = i*N/nStreams;
kernel<<<N/(nThreads*nStreams), nThreads,
0, stream[i]>>>(a_d+offset);
}
allow any other CUDA call to begin until it has completed.) A diagram depicting the
timeline of execution for the two code segments is shown in Figure 3.1, and
nStreams=4 for Listing 3.3b is shown in the bottom half.
For this example, it is assumed that the data transfer and kernel execution times are
comparable. In such cases, and when the execution time (tE) exceeds the transfer
time (tT), a rough estimate for the overall time is tE + tT/nStreams for the staged
version versus tE + tT for the sequential version. If the transfer time exceeds the
execution time, a rough estimate for the overall time is tT + tE/nStreams.
Low Priority: On version 2.2 of the CUDA Toolkit (and later), use zero-copy operations
on integrated GPUs.
The host code in Listing 3.4 shows how zero copy is typically set up.
float *a_h, *a_map;
…
cudaGetDeviceProperties(&prop, 0);
if (!prop.canMapHostMemory)
exit(0);
cudaSetDeviceFlags(cudaDeviceMapHost);
cudaHostAlloc(&a_h, nBytes, cudaHostAllocMapped);
cudaHostGetDevicePointer(&a_map, a_h, 0);
kernel<<<gridSize, blockSize>>>(a_map);
To Host
Of these different memory spaces, global and texture memory are the most
plentiful; see Section G.1 of the CUDA C Programming Guide for the amounts of
memory available in each memory space at each compute capability level. Global,
local, and texture memory have the greatest access latency, followed by constant
memory, registers, and shared memory.
The various principal traits of the memory types are shown in Table 3.1.
The access requirements for coalescing depend on the compute capability of the
device:
On devices of compute capability 1.0 or 1.1, the k-th thread in a half warp must
access the k-th word in a segment aligned to 16 times the size of the elements
being accessed; however, not all threads need to participate.
On devices of compute capability 1.2 or 1.3, coalescing is achieved for any
pattern of accesses that fits into a segment size of 32 bytes for 8-bit words,
64 bytes for 16-bit words, or 128 bytes for 32- and 64-bit words. Smaller
transactions may be issued to avoid wasting bandwidth. More precisely, the
following protocol is used to issue a memory transaction for a half warp:
Find the memory segment that contains the address requested by the lowest
numbered active thread. Segment size is 32 bytes for 8-bit data, 64 bytes for
16-bit data, and 128 bytes for 32-, 64-, and 128-bit data.
Find all other active threads whose requested address lies in the same
segment, and reduce the transaction size if possible:
If the transaction is 128 bytes and only the lower or upper half is used,
reduce the transaction size to 64 bytes.
If the transaction is 64 bytes and only the lower or upper half is used,
reduce the transaction size to 32 bytes.
Carry out the transaction and mark the serviced threads as inactive.
Repeat until all threads in the half warp are serviced.
On devices of compute capability 2.x, memory accesses by the threads of a
warp are coalesced into the minimum number of L1-cache-line-sized aligned
transactions necessary to satisfy all threads; see Section G.4.2 of the CUDA C
Programming Guide.
These concepts are illustrated in the following simple examples.
Figure 3.4 Coalesced access in which all threads but one access the
corresponding word in a segment
This access pattern results in a single 64-byte transaction, indicated by the red
rectangle. Note that even though one word is not requested, all data in the segment
are fetched. If accesses by threads were permuted within this segment, still one 64-
byte transaction would be performed by a device with compute capability 1.2 or
higher, but 16 serialized transactions would be performed by a device with compute
capability 1.1 or lower.
Figure 3.5 Unaligned sequential addresses that fit within a single 128-
byte segment
If a half warp accesses memory that is sequential but split across two 128-byte
segments, then two transactions are performed. In the following case, illustrated in
Figure 3.6, one 64-byte transaction and one 32-byte transaction result. Again, this
figure assumes a device of compute capability 1.x.
Figure 3.6 Misaligned sequential addresses that fall within two 128-byte
segments
In Listing 3.5, data is copied from the input array idata to the output array, both of
which exist in global memory. The kernel is executed within a loop in host code that
varies the parameter offset from 1 to 32. (Figures 3.5 and 3.6 correspond to
offsets of 1 and 17, respectively.) The effective bandwidth for the copy with various
offsets on an NVIDIA GeForce GTX 280 (with compute capability 1.3) and an
NVIDIA GeForce GTX 8800 (compute capability 1.0) are shown in Figure 3.7.
For the NVIDIA GeForce GTX 8800 device, global memory accesses with no
offset or with offsets that are multiples of 16 result in a single transaction per half
warp and an effective bandwidth of approximately 74 GBps. Otherwise, 16
transactions are issued per half warp resulting in an effective bandwidth of
approximately 7 GBps. This roughly 8x performance degradation is due to the fact
that 32 bytes, the minimum transaction size, are fetched for each thread. However,
only 4 bytes of data are used for each 32 bytes fetched—resulting in the 4/32=1/8
performance relative to the fully coalesced case. The two numbers also reflect the
different data represented by effective bandwidth (4 bytes) versus actual bandwidth
(32 bytes).
Because of this possible performance degradation, memory coalescing is the most
critical aspect of performance optimization of device memory. For devices of
compute capability 1.2 and 1.3, the situation is less dire for misaligned accesses
because, in all cases, access by a half warp of threads in this kernel results in either
one or two transactions.
On the NVIDIA GeForce GTX 280 device, this results in an effective bandwidth
of between 120 GBps for a single transaction and 70 GBps for two transactions per
half warp. The number of transactions issued for a half warp of threads depends on
the offset and whether the warp is even- or odd-numbered. For offsets of 0 or 16,
each half warp results in a single 64-byte transaction (Figure 3.4). For offsets of 1
through 7 or 9 through 15, even-numbered warps result in a single 128-byte
transaction (Figure 3.5) and odd-numbered warps result in two transactions: one 64-
byte and one 32-byte (Figure 3.6). For offsets of 8, even-numbered warps result in
one 128-byte transaction and odd-numbered warps result in two 32-byte
transactions. The two 32-byte transactions, rather than a 64- and a 32-byte
transaction, are responsible for the blip at the offset of 8 in Figure 3.7.
Figure 3.8 illustrates a situation that can be created using the code in Listing 3.6;
namely, threads within a half warp access memory with a stride of 2. This action is
coalesced into a single 128-byte transaction on an NVIDIA GeForce GTX 280
(compute capability 1.3).
Although a stride of 2 results in a single transaction, note that half the elements in
the transaction are not used and represent wasted bandwidth. As the stride
increases, the effective bandwidth decreases until the point where 16 transactions
are issued for the 16 threads in a half warp, as indicated in Figure 3.9.
Note, however, that on the NVIDIA GTX 8800 device (compute capability 1.0),
any non-unit stride results in 16 separate transactions per half warp.
As illustrated in Figure 3.9, non-unit stride global memory accesses should be
avoided whenever possible. One method for doing so utilizes shared memory,
which is discussed in the next section.
Shared memory banks are organized such that successive 32-bit words are assigned
to successive banks and each bank has a bandwidth of 32 bits per clock cycle. The
bandwidth of shared memory is 32 bits per bank per clock cycle.
For devices of compute capability 1.x, the warp size is 32 threads and the number of
banks is 16. A shared memory request for a warp is split into one request for the
first half of the warp and one request for the second half of the warp. Note that no
bank conflict occurs if only one memory location per bank is accessed by a half
warp of threads.
For devices of compute capability 2.x, the warp size is 32 threads and the number of
banks is also 32. A shared memory request for a warp is not split as with devices of
compute capability 1.x, meaning that bank conflicts can occur between threads in the
first half of a warp and threads in the second half of the same warp (see Section
G.4.3 of the CUDA C Programming Guide).
Refer to the CUDA C Programming Guide for more information on how accesses and
banks can be matched to avoid conflicts.
To do this, the simpleMultiply kernel (Listing 3.7) calculates the output elements
of a tile of matrix C.
__global__ void simpleMultiply(float *a, float* b, float *c,
int N)
{
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0.0f;
for (int i = 0; i < TILE_DIM; i++) {
sum += a[row*TILE_DIM+i] * b[i*N+col];
}
c[row*N+col] = sum;
}
In Listing 3.7, a, b, and c are pointers to global memory for the matrices A, B, and
C, respectively; blockDim.x, blockDim.y, and TILE_DIM are all 16. Each thread in
the 16×16 block calculates one element in a tile of C. row and col are the row and
column of the element in C being calculated by a particular thread. The for loop
over i multiplies a row of A by a column of B, which is then written to C.
The effective bandwidth of this kernel is only 8.7 GBps on an NVIDIA GeForce
GTX 280 and 0.7 GBps on an NVIDIA GeForce GTX 8800. To analyze
performance, it is necessary to consider how half warps of threads access global
memory in the for loop. Each half warp of threads calculates one row of a tile of C,
which depends on a single row of A and an entire tile of B as illustrated in Figure
3.11.
Figure 3.11 Computing a row (half warp) of a tile in C using one row of A
and an entire tile of B
For each iteration i of the for loop, all threads in a half warp read the same value
from global memory (the index row*TILE_DIM+i is constant within a half warp),
resulting in 16 transactions for compute capability 1.1 or lower, and 1 transaction
for compute capability 1.2 or higher. Even though the operation requires only 1
transaction for compute capability 1.2 or higher, there is wasted bandwidth in the
transaction because only 4 bytes out of a 32-byte transaction are used. For each
iteration, the 16 threads in a half warp read a row of the B tile, which is a sequential
and coalesced access for all compute capabilities.
The performance on a device of any compute capability can be improved by reading
a tile of A into shared memory as shown in Listing 3.8.
__global__ void coalescedMultiply(float *a, float* b, float *c,
int N)
{
__shared__ float aTile[TILE_DIM][TILE_DIM];
Listing 3.8 Using shared memory to improve the global memory load
efficiency in matrix multiplication
In Listing 3.8, each element in a tile of A is read from global memory only once, in a
fully coalesced fashion (with no wasted bandwidth), to shared memory. Within each
iteration of the for loop, a value in shared memory is broadcast to all threads in a
half warp.
In Listing 3.8, a __syncthreads()synchronization barrier call is not needed after
reading the tile of A into shared memory because only threads within the half warp
that write the data into shared memory read the data. This kernel has an effective
bandwidth of 14.3 GBps on an NVIDIA GeForce GTX 280, and 8.2 GBps on an
NVIDIA GeForce GTX 8800.
A further improvement can be made to how Listing 3.8 deals with matrix B. In
calculating a tile’s row of matrix C, the entire tile of B is read. The repeated reading
of the B tile can be eliminated by reading it into shared memory once (Listing 3.9).
__global__ void sharedABMultiply(float *a, float* b, float *c,
int N)
{
__shared__ float aTile[TILE_DIM][TILE_DIM],
bTile[TILE_DIM][TILE_DIM];
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0.0f;
aTile[threadIdx.y][threadIdx.x] = a[row*TILE_DIM+threadIdx.x];
bTile[threadIdx.y][threadIdx.x] = b[threadIdx.y*N+col];
__syncthreads();
for (int i = 0; i < TILE_DIM; i++) {
sum += aTile[threadIdx.y][i]* bTile[i][threadIdx.x];
}
c[row*N+col] = sum;
}
Note that in Listing 3.9, a __syncthreads() call is required after reading the B tile
because a warp reads data from shared memory that were written to shared memory
by different warps. The effective bandwidth of this routine is 29.7 GBps on an
NVIDIA GeForce GTX 280 and 15.7 GBps on an NVIDIA GeForce GTX 8800.
Note that the performance improvement is not due to improved coalescing in either
case, but to avoiding redundant transfers from global memory.
The results of the various optimizations are summarized in Table 3.2.
Table 3.2 Performance improvements optimizing C = AB matrix multiply
Medium Priority: Use shared memory to avoid redundant transfers from global
memory.
In Listing 3.10, the row-th, col-th element of C is obtained by taking the dot product
of the row-th and col-th rows of A. The effective bandwidth for this kernel is
1.1 GBps on an NVIDIA GeForce GTX 280 and 0.5 GBps on an NVIDIA
GeForce GTX 8800. These results are substantially lower than the corresponding
measurements for the C = AB kernel. The difference is in how threads in a half
warp access elements of A in the second term, a[col*TILE_DIM+i], for each
iteration i. For a half warp of threads, col represents sequential columns of the
transpose of A, and therefore col*TILE_DIM represents a strided access of global
memory with a stride of 16. This results in uncoalesced memory accesses on devices
with compute capability 1.1 or lower and plenty of wasted bandwidth on devices
with compute capability 1.2 or higher. The way to avoid strided access is to use
shared memory as before, except in this case a half warp reads a row of A into a
column of a shared memory tile, as shown in Listing 3.11.
__global__ void coalescedMultiply(float *a, float *c, int M)
{
__shared__ float aTile[TILE_DIM][TILE_DIM],
transposedTile[TILE_DIM][TILE_DIM];
int row = blockIdx.y * blockDim.y + threadIdx.y;
int col = blockIdx.x * blockDim.x + threadIdx.x;
float sum = 0.0f;
aTile[threadIdx.y][threadIdx.x] = a[row*TILE_DIM+threadIdx.x];
transposedTile[threadIdx.x][threadIdx.y] =
a[(blockIdx.x*blockDim.x + threadIdx.y)*TILE_DIM +
threadIdx.x];
__syncthreads();
Listing 3.11 uses the shared transposedTile to avoid uncoalesced accesses in the
second term in the dot product, and the shared aTile technique from the previous
example to avoid uncoalesced accesses in the first term. The effective bandwidth of
this kernel is 24.9 GBps on an NVIDIA GeForce GTX 280 and 13.2 GBps on an
NVIDIA GeForce GTX 8800. These results are slightly lower than those obtained
by the final kernel for C = AB. The cause of the difference is shared memory bank
conflicts.
The reads of elements in transposedTile within the for loop are free of conflicts,
because threads of each half warp read across rows of the tile, resulting in unit stride
across the banks. However, bank conflicts occur when copying the tile from global
memory into shared memory. To enable the loads from global memory to be
coalesced, data are read from global memory sequentially. However, this requires
writing to shared memory in columns, and because of the use of 16×16 tiles in
shared memory, this results in a stride between threads of 16 banks. These 16-way
bank conflicts are very expensive. The simple remedy is to pad the shared memory
array so that it has an extra column, as in the following line of code.
__shared__ float transposedTile[TILE_DIM][TILE_DIM+1];
This padding eliminates the conflicts entirely, because now the stride between
threads is 17 banks (33 banks for compute capability 2.x), which, due to modular
arithmetic used to compute bank indices, is equivalent to a unit stride. After this
change, the effective bandwidth is 30.4 GBps on an NVIDIA GeForce GTX 280
and 15.6 GBps on an NVIDIA GeForce GTX 8800, which is comparable to the
results from the last C = AB kernel.
The results of these optimizations are summarized in Table 3.3.
Table 3.3 Performance improvements optimizing C = AAT matrix
multiplication
These results should be compared with those in Table 3.2. As can be seen from
these tables, judicious use of shared memory can dramatically improve performance.
The examples in this section have illustrated three ways to use shared memory:
To enable coalesced accesses to global memory, especially to avoid large strides
(for general matrices, strides are much larger than 16)
To eliminate (or reduce) redundant loads from global memory
To avoid wasted bandwidth
Low Priority: For kernels with long argument lists, place some arguments into
constant memory to save shared memory.
In certain addressing situations, reading device memory through texture fetching can
be an advantageous alternative to reading device memory from global or constant
memory.
This copy kernel applies a shift to the global memory location when reading from
idata, but writes to unshifted global memory locations in odata. The amount of
shift is specified as a function argument to the kernel. Some degradation of
performance occurs when the shift is neither zero nor a multiple of 16 because
reading from idata will be either uncoalesced (compute capability 1.1 or lower) or
result in transactions with wasted bandwidth (compute capability 1.2 or higher).
Note that regardless of compute capability, writing to odata is fully coalesced.
The version of this code that uses textures to perform the shifted read is shown in
Listing 3.13.
__global__ void textureShiftCopy(float *odata, float *idata,
int shift)
{
int xid = blockIdx.x * blockDim.x + threadIdx.x;
odata[xid] = tex1Dfetch(texRef, xid+shift);
}
Here, the texture reference texRef is bound to the idata array in the host code and
the function tex1Dfetch() reads the shifted memory locations of idata via a
texture fetch. The results of both kernels (using global memory and textures for
loads) on an NVIDIA GeForce GTX 280 and an NVIDIA GeForce GTX 8800 are
given in Figure 3.12.
The benefit of using textures for cases that are not optimally coalesced is clear.
Textured reads can maintain effective bandwidth of the unshifted, fully coalesced
cases within a few percent. Note that shifts that are neither zero nor multiples of 16
show greater effective bandwidth than the offsetCopy kernel in Figure 3.7. Because
all the stores in the shift kernels are fully coalesced with no wasted bandwidth, the
shift applies only to the loads.
¹The automatic handling of boundary cases in the bottom row of Table 3.4 refers to how a texture coordinate is
resolved when it falls outside the valid addressing range. There are two options: clamp and wrap. If x is the
coordinate and N is the number of texels for a one-dimensional texture, then with clamp, x is replaced by 0 if x < 0
and by 1-1/N if 1 ≤x. With wrap, x is replaced by frac(x) where frac(x) = x – floor(x). Floor returns the largest
integer less than or equal to x. So, in clamp mode where N = 1, an x of 1.3 is clamped to 1.0; whereas in wrap
mode, it is converted to 0.3
Within a kernel call, the texture cache is not kept coherent with respect to global
memory writes, so texture fetches from addresses that have been written via global
stores in the same kernel call return undefined data. That is, a thread can safely read
a memory location via texture if the location has been updated by a previous kernel
call or memory copy, but not if it has been previously updated by the same thread or
another thread within the same kernel call. This is relevant only when fetching from
linear or pitch-linear memory because a kernel cannot write to CUDA arrays.
3.2.6 Registers
Generally, accessing a register consumes zero extra clock cycles per instruction, but
delays may occur due to register read-after-write dependencies and register memory
bank conflicts.
The latency of read-after-write dependencies is approximately 24 cycles, but this
latency is completely hidden on multiprocessors that have at least 192 active threads
(that is, 6 warps) for devices of compute capability 1.x (8 CUDA cores per
multiprocessor * 24 cycles of latency = 192 active threads to cover that latency). For
devices of compute capability 2.0, which have 32 CUDA cores per multiprocessor,
as many as 768 threads might be required to completely hide latency.
The compiler and hardware thread scheduler will schedule instructions as optimally
as possible to avoid register memory bank conflicts. They achieve the best results
when the number of threads per block is a multiple of 64. Other than following this
rule, an application has no direct control over these bank conflicts. In particular,
there is no register-related reason to pack data into float4 or int4 types.
3.3 Allocation
Device memory allocation and de-allocation via cudaMalloc() and cudaFree() (or
their Driver API equivalents) are expensive operations, so device memory should be
reused and/or sub-allocated by the application wherever possible to minimize the
impact of allocations on overall performance.
One of the keys to good performance is to keep the multiprocessors on the device
as busy as possible. A device in which work is poorly balanced across the
multiprocessors will deliver suboptimal performance. Hence, it’s important to
design your application to use threads and blocks in a way that maximizes hardware
utilization and to limit practices that impede the free distribution of work. A key
concept in this effort is occupancy, which is explained in the following sections.
Another important concept is the management of system resources allocated for a
particular task. How to manage this resource utilization is discussed in the final
sections of this chapter.
4.1 Occupancy
Thread instructions are executed sequentially in CUDA, and, as a result, executing
other warps when one warp is paused or stalled is the only way to hide latencies and
keep the hardware busy. Some metric related to the number of active warps on a
multiprocessor is therefore important in determining how effectively the hardware is
kept busy. This metric is occupancy.
Occupancy is the ratio of the number of active warps per multiprocessor to the
maximum number of possible active warps. (To determine the latter number, see
the deviceQuery.cu program in the CUDA SDK or refer to Appendix A in the
CUDA C Programming Guide.) Another way to view occupancy is the percentage of
the hardware’s ability to process warps that is actively in use.
Higher occupancy does not always equate to higher performance—there is a point
above which additional occupancy does not improve performance. However, low
occupancy always interferes with the ability to hide memory latency, resulting in
performance degradation.
entire block all at once. So, if each thread block uses many registers, the number of
thread blocks that can be resident on a multiprocessor is reduced, thereby lowering
the occupancy of the multiprocessor. The maximum number of registers per thread
can be set manually at compilation time per-file using the –maxrregcount option or
per-kernel using the __launch_bounds__ qualifier (see Section 3.2.6.1).
For purposes of calculating occupancy, the number of registers used by each thread
is one of the key factors. For example, devices with compute capability 1.0 and 1.1
have 8,192 32-bit registers per multiprocessor and can have a maximum of 768
simultaneous threads resident (24 warps x 32 threads per warp). This means that in
one of these devices, for a multiprocessor to have 100% occupancy, each thread can
use at most 10 registers. However, this approach of determining how register count
affects occupancy does not take into account the register allocation granularity. For
example, on a device of compute capability 1.0, a kernel with 128-thread blocks
using 12 registers per thread results in an occupancy of 83% with 5 active 128-
thread blocks per multiprocessor, whereas a kernel with 256-thread blocks using the
same 12 registers per thread results in an occupancy of 66% because only two 256-
thread blocks can reside on a multiprocessor. Furthermore, register allocations are
rounded up to the nearest 256 registers per block on devices with compute
capability 1.0 and 1.1.
The number of registers available, the maximum number of simultaneous threads
resident on each multiprocessor, and the register allocation granularity vary over
different compute capabilities. Because of these nuances in register allocation and
the fact that a multiprocessor’s shared memory is also partitioned between resident
thread blocks, the exact relationship between register usage and occupancy can be
difficult to determine. The --ptxas-options=-v option of nvcc details the
number of registers used per thread for each kernel. See Section 4.2 of the CUDA C
Programming Guide for the register allocation formulas for devices of various compute
capabilities and Section G.1 of the programming guide for the total number of
registers available on those devices. Alternatively, NVIDIA provides an occupancy
calculator in the form of an Excel spreadsheet that enables developers to hone in on
the optimal balance and to test different possible scenarios more easily. This
spreadsheet, shown in Figure 4.1, is called CUDA_Occupancy_calculator.xls and
is located in the tools directory of the CUDA SDK.
Figure 4.1 Use the CUDA GPU Occupancy Calculator to project occupancy
Medium Priority: The number of threads per block should be a multiple of 32 threads,
because this provides optimal computing efficiency and facilitates coalescing.
The dimension and size of blocks per grid and the dimension and size of threads
per block are both important factors. The multidimensional aspect of these
parameters allows easier mapping of multidimensional problems to CUDA and does
not play a role in performance. As a result, this section discusses size but not
dimension.
Latency hiding and occupancy depend on the number of active warps per
multiprocessor, which is implicitly determined by the execution parameters along
with resource (register and shared memory) constraints. Choosing execution
parameters is a matter of striking a balance between latency hiding (occupancy) and
resource utilization.
Choosing the execution configuration parameters should be done in tandem;
however, there are certain heuristics that apply to each parameter individually. When
choosing the first execution configuration parameter—the number of blocks per
grid, or grid size—the primary concern is keeping the entire GPU busy. The number
of blocks in a grid should be larger than the number of multiprocessors so that all
multiprocessors have at least one block to execute. Furthermore, there should be
multiple active blocks per multiprocessor so that blocks that aren’t waiting for a
__syncthreads() can keep the hardware busy. This recommendation is subject to
resource availability; therefore, it should be determined in the context of the second
execution parameter—the number of threads per block, or block size—as well as
shared memory usage. To scale to future devices, the number of blocks per kernel
launch should be in the thousands.
When choosing the block size, it is important to remember that multiple concurrent
blocks can reside on a multiprocessor, so occupancy is not determined by block size
alone. In particular, a larger block size does not imply a higher occupancy. For
example, on a device of compute capability 1.1 or lower, a kernel with a maximum
block size of 512 threads results in an occupancy of 66 percent because the
maximum number of threads per multiprocessor on such a device is 768. Hence,
only a single block can be active per multiprocessor. However, a kernel with 256
threads per block on such a device can result in 100 percent occupancy with three
resident active blocks.
As mentioned in Section 4.1, higher occupancy does not always equate to better
performance. For example, improving occupancy from 66 percent to 100 percent
generally does not translate to a similar increase in performance. A lower occupancy
kernel will have more registers available per thread than a higher occupancy kernel,
which may result in less register spilling to local memory. Typically, once an
occupancy of 50 percent has been reached, additional increases in occupancy do not
translate into improved performance.
There are many such factors involved in selecting block size, and inevitably some
experimentation is required. However, a few rules of thumb should be followed:
Threads per block should be a multiple of warp size to avoid wasting
computation on under-populated warps and to facilitate coalescing.
A minimum of 64 threads per block should be used, but only if there are
multiple concurrent blocks per multiprocessor.
Between 128 and 256 threads per block is a better choice and a good initial
range for experimentation with different block sizes.
Use several (3 to 4) smaller thread blocks rather than one large thread block per
multiprocessor if latency affects performance. This is particularly beneficial to
kernels that frequently call __syncthreads().
Note that when a thread block allocates more registers than are available on a
multiprocessor, the kernel launch fails, as it will when too much shared memory or
too many threads are requested.
Integer division and modulo operations are particularly costly and should be avoided
or replaced with bitwise operations whenever possible: If n is a power of 2, (i/n) is
equivalent to (i ≫ log2(n)) and (i % n) is equivalent to (i & (n-1)).
The compiler will perform these conversions if n is literal. (For further information,
refer to Chapter 5 of the CUDA C Programming Guide).
Two types of runtime math operations are supported. They can be distinguished by
their names: some have names with prepended underscores, whereas others do not
(e.g., __functionName() versus functionName()). Functions following the
__functionName() naming convention map directly to the hardware level. They
are faster but provide somewhat lower accuracy (e.g., __sinf(x) and __expf(x)).
Functions following functionName() naming convention are slower but have
higher accuracy (e.g., sinf(x) and expf(x)). The throughput of __sinf(x),
__cosf(x), and __expf(x) is much greather than that of sinf(x), cosf(x),
tanf(x). The latter become even more expensive (about an order of magnitude
slower) if the magnitude of the argument x needs to be reduced. Moreover, in such
cases, the argument-reduction code uses local memory, which can affect
performance even more because of the high latency of local memory. More details
are available in the CUDA C Programming Guide.
Note also that whenever sine and cosine of the same argument are computed, the
sincos… family of instructions should be used to optimize performance:
Medium Priority: Prefer faster, more specialized math functions over slower, more
general ones when possible.
For exponentiation using base 2 or 10, use the functions exp2() or expf2() and
exp10() or expf10() rather than the functions pow() or powf(). Both pow() and
powf() are heavy-weight functions in terms of register pressure and instruction
count due to the numerous special cases arising in general exponentiation and the
difficulty of achieving good accuracy across the entire ranges of the base and the
exponent. The functions exp2(), exp2f(), exp10(), and exp10f(), on the other
hand, are similar to exp() and expf() in terms of performance, and can be as
much as ten times faster than their pow()/powf() equivalents.
High Priority: Minimize the use of global memory. Prefer shared memory access
where possible.
Memory instructions include any instruction that reads from or writes to shared,
local, or global memory. When accessing uncached local or global memory, there are
400 to 600 clock cycles of memory latency.
As an example, the assignment operator in the following sample code has a high
throughput, but, crucially, there is a latency of 400 to 600 clock cycles to read data
from global memory:
__shared__ float shared[32];
__device__ float device[32];
shared[threadIdx.x] = device[threadIdx.x];
Much of this global memory latency can be hidden by the thread scheduler if there
are sufficient independent arithmetic instructions that can be issued while waiting
for the global memory access to complete. However, it is best to avoid accessing
global memory whenever possible.
High Priority: Avoid different execution paths within the same warp.
Any flow control instruction (if, switch, do, for, while) can significantly affect
the instruction throughput by causing threads of the same warp to diverge; that is,
to follow different execution paths. If this happens, the different execution paths
must be serialized, increasing the total number of instructions executed for this
warp. When all the different execution paths have completed, the threads converge
back to the same execution path.
To obtain best performance in cases where the control flow depends on the thread
ID, the controlling condition should be written so as to minimize the number of
divergent warps.
This is possible because the distribution of the warps across the block is
deterministic as mentioned in Section 4.1 of the CUDA C Programming Guide. A
trivial example is when the controlling condition depends only on (threadIdx /
WSIZE) where WSIZE is the warp size.
In this case, no warp diverges because the controlling condition is perfectly aligned
with the warps.
Low Priority: Make it easy for the compiler to use branch predication in lieu of loops
or control statements.
Sometimes, the compiler may unroll loops or optimize out if or switch statements
by using branch predication instead. In these cases, no warp can ever diverge. The
programmer can also control loop unrolling using
#pragma unroll
For more information on this pragma, refer to the CUDA C Programming Guide.
When using branch predication, none of the instructions whose execution depends
on the controlling condition is skipped. Instead, each such instruction is associated
with a per-thread condition code or predicate that is set to true or false according to
the controlling condition. Although each of these instructions is scheduled for
execution, only the instructions with a true predicate are actually executed.
Instructions with a false predicate do not write results, and they also do not evaluate
addresses or read operands.
The compiler replaces a branch instruction with predicated instructions only if the
number of instructions controlled by the branch condition is less than or equal to a
certain threshold: If the compiler determines that the condition is likely to produce
many divergent warps, this threshold is 7; otherwise it is 4.
Medium Priority: Use signed integers rather than unsigned integers as loop counters.
In the C language standard, unsigned integer overflow semantics are well defined,
whereas signed integer overflow causes undefined results. Therefore, the compiler
can optimize more aggressively with signed arithmetic than it can with unsigned
arithmetic. This is of particular note with loop counters: since it is common for loop
counters to have values that are always positive, it may be tempting to declare the
counters as unsigned. For slightly better performance, however, they should instead
be declared as signed.
For example, consider the following code:
for (i = 0; i < n; i++) {
out[i] = in[offset + stride*i];
}
Obtaining the right answer is clearly the principal goal of all computation. On
parallel systems, it is possible to run into difficulties not typically found in traditional
serial-oriented programming. These include threading issues, unexpected values due
to the way floating-point values are computed, and challenges arising from
differences in the way CPU and GPU processors operate. This chapter examines
issues that can affect the correctness of returned data and points to appropriate
solutions.
7.1 Debugging
The CUDA debugger, CUDA-GDB, is a valuable debugging tool. It is a port of the
GNU Debugger version 6.6 and runs on 32-bit and 64-bit Linux. See the CUDA-
GDB User Manual for more details.
Whenever doubles are used, use at least the –arch=sm_13 switch on the nvcc
command line; see Sections 3.1.3 and 3.1.4 of the CUDA C Programming Guide for
more details.
Even though a GPU can execute calls from one context at a time, it can belong to
multiple contexts. For example, it is possible for several CPU threads to establish
contexts with the same GPU. This allows developing multi-GPU applications on a
single GPU. GPU driver manages GPU switching between the contexts, as well as
partitioning memory among the contexts (GPU memory allocated in one context
cannot be accessed from another context).
Lightweight CPU threads exchange data most efficiently via shared memory. Note
that in order for a pinned memory region to be viewed as pinned by CPU threads
other than the one that allocated it, one must call cudaHostAlloc() with the
cudaHostAllocPortable flag. A common communication pattern will be for one
CPU thread to copy data from its GPU to a shared host memory region, after which
another CPU thread will copy the data to its GPU. Users of NUMA systems will
have to follow the same best practices as for communication between non-GPU
accelerated CPU threads.
Communication between heavy-weight processes takes place via message passing,
for example MPI. Once data has been copied from GPU to CPU it is transferred to
another process by calling one of the MPI functions. For example, one possible
pattern when exchanging data between two GPUs is for a CPU thread to call a
device-to-host cudaMemcpy(), then MPI_Sendrecv(), then a host-to-device
cudaMemcpy(). Note that performance of the MPI function is not dependent on
the fact that data originated at or is destined for a GPU. Since MPI provides several
variations for most of its communication functions, the choice of a function should
be dictated by the best practices guide for the MPI implementation as well as the
system and network.
8.6 Infiniband
NVIDIA GPUDirect™ technology allows the sharing of CUDA pinned host
memory with other devices. This allows accelerated transfers of GPU data to other
devices, such as supported Infiniband network adapters. If GPUDirect support is
not available for your network device, network transfer throughput can be reduced.
A possible workaround is to disable RDMA. For the OpenMPI implementation,
this can be achieved by passing the flag –mca btl_openib_flags 1 to mpirun.
This appendix contains a list of all the recommendations for optimization and the
list of best practices that are explained in this document.
paid to control flow instructions due to the SIMT (single instruction multiple
thread) nature of the device.
For kernels with long argument lists, place some arguments into constant
memory to save shared memory. (Section 3.2.2.4)
Use shift operations to avoid expensive division and modulo calculations.
(Section 5.1.1)
Avoid automatic conversion of doubles to floats. (Section 5.1.3)
Make it easy for the compiler to use branch predication in lieu of loops or
control statements. (Section 6.2)
B.1 NVCC
nvcc is the compiler that converts .cu files into C for the host system and CUDA
assembly or binary instructions for the device. It supports a spate of switches, of
which the following are especially useful for optimization and related best practices:
-arch=sm_13 or higher is required for double precision. See Section 7.2.1.
–maxrregcount=N specifies the maximum number of registers kernels can use
at a per-file level. See Section 3.2.6.1. (See also the __launch_bounds__
qualifier discussed in Section B.17 of the CUDA C Programming Guide to control
the number of registers used on a per-kernel basis.)
--ptxas-options=-v or -Xptxas=-v lists per-kernel register, shared, and
constant memory usage.
–use_fast_math compiler option of nvcc coerces every functionName() call
to the equivalent __functionName() call. This makes the code run faster at the
cost of slightly diminished precision and accuracy. See Section 5.1.4.
Trademarks
NVIDIA, the NVIDIA logo, CUDA, GeForce, NVIDIA Quadro, and Tesla are trademarks or registered
trademarks of NVIDIA Corporation. Other company and product names may be trademarks of the respective
companies with which they are associated.
Copyright
© 2010 NVIDIA Corporation. All rights reserved.