OpenCL Best Practices Guide
OpenCL Best Practices Guide
REVISIONS
July 2009 (Original Release) April 2010 May 2010
ii
Table of Contents
Preface............................................................................................................................ v What Is This Document? Who Should Read This Guide? Recommendations and Best Practices Contents Summary v v v vi
Chapter 1. Heterogeneous Computing with OpenCL ..................................................... 1 1.1 1.2 1.3 Differences Between Host and Device What Runs on an OpenCL-Enabled Device? Maximum Performance Benefit 1 2 3
Chapter 2. Performance Metrics ..................................................................................... 5 2.1 Timing 2.1.1 Using CPU Timers 2.1.2 Using OpenCL GPU Timers 2.2 Bandwidth 2.2.1 Theoretical Bandwidth Calculation 2.2.2 Effective Bandwidth Calculation 2.2.3 Throughput Reported by the OpenCL Visual Profiler 5 5 6 6 6 7 7
Chapter 3. Memory Optimizations .................................................................................. 9 3.1 Data Transfer Between Host and Device 3.1.1 Pinned Memory 3.1.2 Asynchronous Transfers 3.1.3 Overlapping Transfers and Device Computation 3.2 Device Memory Spaces 3.2.1 Coalesced Access to Global Memory 3.2.1.1 A Simple Access Pattern 3.2.1.2 A Sequential but Misaligned Access Pattern 3.2.1.3 Effects of Misaligned Accesses 3.2.1.4 Strided Accesses 3.2.2 Shared Memory 3.2.2.1 Shared Memory and Memory Banks 3.2.2.2 Shared Memory in Matrix Multiplication (C = AB) 3.2.2.3 Shared Memory in Matrix Multiplication (C = AA )
T
9 9 12 12 16 17 18 18 19 21 22 22 23 27
iii
3.2.2.4 Shared Memory Use by Kernel Arguments 3.2.3 Local Memory 3.2.4 Texture Memory 3.2.4.1 Textured Fetch vs. Global Memory Read 3.2.4.2 Additional Texture Capabilities 3.2.5 Constant Memory 3.2.6 Registers 3.2.6.1 Register Pressure
29 29 30 30 30 31 31 31
Chapter 4. NDRange Optimizations..............................................................................33 4.1 Occupancy 4.2 Calculating Occupancy 4.3 Hiding Register Dependencies 4.4 Thread and Block Heuristics 4.5 Effects of Shared Memory 33 33 35 35 36
Chapter 5. Instruction Optimizations ...........................................................................39 5.1 Arithmetic Instructions 5.1.1 Division and Modulo Operations 5.1.2 Reciprocal Square Root 5.1.3 Other Arithmetic Instructions 5.1.4 Math Libraries 5.2 Memory Instructions 39 40 40 40 41 41
Chapter 6. Control Flow................................................................................................43 6.1 6.2 Branching and Divergence Branch Predication 43 43
Appendix A. Recommendations and Best Practices .....................................................45 A.1 Overall Performance Optimization Strategies A.2 High-Priority Recommendations A.3 Medium-Priority Recommendations A.4 Low-Priority Recommendations 45 46 46 47
Preface
applications have the highest priority, while small optimizations that affect only very specific situations are given a lower priority. Before implementing lower priority recommendations, it is good practice to make sure all higher priority recommendations that are relevant have already been applied. This approach will tend to provide the best results for the time invested and will avoid the trap of premature optimization. The criteria of benefit and scope for establishing priority will vary depending on the nature of the program. In this guide, they represent a typical case. Your code might reflect different priority factors. Regardless of this possibility, it is good practice to verify that no higher priority recommendations have been overlooked before undertaking lower priority items. Appendix A of this document lists all the recommendations and best practices, grouping them by priority and adding some additional helpful observations.
Contents Summary
The remainder of this guide is divided into the following sections: Introduction to Parallel Computing with OpenCL: Important aspects of the parallel programming architecture. Performance Metrics: How should performance be measured in OpenCL applications and what are the factors that most influence performance? Memory Optimizations: Correct memory management is one of the most effective means of improving performance. This chapter explores the different kinds of memory available to OpenCL applications, and it explains in detail how memory is handled behind the scenes. NDRanges Optimizations: How to make sure your OpenCL application is exploiting all the available resources on the GPU. Instruction Optimizations: Certain operations run faster than others. Using faster operations and avoiding slower ones often confers remarkable benefits. Control Flow: Carelessly designed control flow can force parallel code into serial execution; whereas thoughtfully designed control flow can help the hardware perform the maximum amount of work per clock cycle. Getting the Right Answer: How to debug code and how to handle differences in how the CPU and GPU represent floating-point values.
OpenCL programming involves running code on two different platforms: a host system that relies on one or more CPUs to perform calculations, and a card (frequently a graphics adapter) with one or more OpenCL-enabled NVIDIA GPUs (the device). While NVIDIA devices are primarily associated with rendering graphics, they also are powerful arithmetic engines capable of running thousands of lightweight threads in parallel. This capability makes them well suited to computations that can leverage parallel execution well. However, the device is based on a distinctly different design from the host system and, to use OpenCL effectively, its important to understand those differences and how they determine the performance of OpenCL applications.
into different types, each of which has a special purpose and fulfills different needs. The types of device RAM are explained in the NVIDIA OpenCL Programming Guide and in Chapter 3 of this document. These are the primary hardware differences between CPU hosts and GPU devices with respect to parallel programming. Other differences are discussed as they arise elsewhere in this document.
operations are an additional factor, as additions versus trigonometric functions have different complexity profiles. It is important to include transfers to and from the device in determining where operations should be performed. Data should be kept on the device as long as possible. Because transfers should be minimized, programs that run multiple kernels on the same data should favor leaving the data on the device between kernel calls, rather than transferring intermediate results to the host and then sending them back to the device for subsequent calculations. So if the data were already on the device in the previous example, the matrix addition should be performed locally on the device. This approach should be used even if one of the steps in a sequence of calculations could be performed faster on the host. Even a relatively slow kernel may be advantageous if it avoids one or more PCI Express (PCIe) transfers. Section 3.1 provides further details, including the measurements of bandwidth between host and device versus within the device proper.
The amount of performance benefit an application will realize by using OpenCL depends entirely on the extent to which it can be parallelized. As mentioned previously, code that cannot be sufficiently parallelized should run on the host, unless doing so would result in excessive transfers between host and device. Amdahls law specifies the maximum speed-up that can be expected by parallelizing portions of a serial program. Essentially, it states that the maximum speed-up (S) of a program is = 1 (1 ) + where P is the fraction of the total serial execution time taken by the portion of code that can be parallelized and N is the number of processors over which the parallel portion of the code runs. The larger N is (that is, the greater the number of processors), the smaller the P/N fraction. It can be simpler to view N as a very large number, which essentially transforms the equation into S = 1 / 1P. Now, if of a program is parallelized, the maximum speed-up over serial code is 1 / (1 ) = 4. For most purposes, the key point is that the greater P is, the greater the speed-up. An additional caveat is implicit in this equation, which is that if P is a small number (so not substantially parallel), increasing N does little to improve performance. To get the largest lift, best practices suggest spending most effort on increasing P; that is, by maximizing the amount of code that can be parallelized.
When attempting to optimize OpenCL code, it pays to know how to measure performance accurately and to understand the role that bandwidth plays in performance measurement. This chapter discusses how to correctly measure performance using CPU timers and OpenCL events. It then explores how bandwidth affects performance metrics and how to mitigate some of the challenges it poses.
2.1 Timing
OpenCL calls and kernel executions can be timed using either CPU or GPU timers. This section examines the functionality, advantages, and pitfalls of both approaches.
Listing 2.1 How to time code using OpenCL events Note that the timings are measured on the GPU clock, and so are operating system independent. The resolution of the GPU timer is approximately half a microsecond.
2.2 Bandwidth
Bandwidth is one of the most important gating factors for performance. Almost all changes to code should be made in the context of how they affect bandwidth. As described in Chapter 3 of this guide, bandwidth can be dramatically affected by the choice of memory in which data is stored, how the data is stored and accessed, as well as other factors. To measure performance accurately, it is useful to calculate theoretical and effective bandwidth. When the latter is much lower than the former, design or implementation details are likely to reduce bandwidth, and it should be the primary goal of subsequent optimization efforts to increase it.
High Priority: Use the effective bandwidth of your computation as a metric when measuring performance and optimization benefits.
Performance Metrics
In this calculation, the memory clock rate is converted in to Hz, multiplied by the interface width (divided by 8, to convert bits to bytes) and multiplied by 2 due to the double data rate. Finally, this product is divided by 109 to convert the result to GB/sec (GBps). Note that some calculations use 1,0243 instead of 109 for the final calculation. In such a case, the bandwidth would be 131.9 GBps. It is important to use the same divisor when calculating theoretical and effective bandwidth, so that the comparison is valid.
Memory optimizations are the most important area for performance. The goal is to maximize the use of the hardware by maximizing bandwidth. Bandwidth is best served by using as much fast memory and as little slow-access memory as possible. This chapter discusses the various kinds of memory on the host and device and how best to set up data items to use the memory effectively.
Intermediate data structures should be created in device memory, operated on by the device, and destroyed without ever being mapped by the host or copied to host memory. Also, because of the overhead associated with each transfer, batching many small transfers into one larger transfer performs significantly better than making each transfer separately. Finally, higher bandwidth between host and device is achieved when using pagelocked (or pinned) memory, as discussed in the NVIDIA OpenCL Programming Guide and section 3.1.1 of this document.
OpenCL applications do not have direct control over whether memory objects are allocated in pinned memory or not, but they can create objects using the CL_MEM_ALLOC_HOST_PTR flag and such objects are likely to be allocated in pinned memory by the driver for best performance. The oclBandwidthTest program in the NVIDIA GPU Computing SDK shows how to use these functions as well as how to measure memory transfer performance. Additional examples of pinned memory usage are provided in the oclSobelFilter , oclMedianFilter and oclCopyComputeOverlap program samples in the NVIDIA GPU Computing SDK. Pinned memory should not be overused. Excessive use can reduce overall system performance because pinned memory is a scarce resource. How much is too much is difficult to tell in advance, so as with all optimizations, test the applications and the systems they run on for optimal performance parameters. The steps normally needed to use pinned memory are briefly summarized in the following example. 1) Declare cl_mem buffer objects for the pinned host memory and the GPU device GMEM, respectively, and standard pointers to reference pinned host memory.
cl_context cxGPUContext; cl_mem cmPinnedBufIn = NULL; cl_mem cmPinnedBufOut = NULL; cl_mem cmDevBufIn = NULL; cl_mem cmDevBufOut = NULL;
unsigned char* cDataIn = NULL; unsigned char* cDataOut = NULL;
2) Allocate cl_mem buffer objects for the pinned host memory and the GPU device GMEM, respectively. Because these are time consuming operations, and because many applications dont need to change the size of these buffers within time-critical code paths, these functions are commonly executed in an application initialization function or event driven function (not in any program loop to be executed quickly and frequently).
cmPinnedBufIn = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY |
CL_MEM_ALLOC_HOST_PTR, memSize, NULL, NULL); cmPinnedBufOut = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, memSize, NULL, NULL);
cmDevBufIn = clCreateBuffer(cxGPUContext, CL_MEM_READ_ONLY, memSize, NULL, NULL); cmDevBufOut = clCreateBuffer(cxGPUContext, CL_MEM_WRITE_ONLY, memSize, NULL, NULL);
Memory Optimizations
3) Map standard pointer to reference the pinned host memory input and output buffers with standard pointers.
cDataIn = (unsigned char*)clEnqueueMapBuffer(cqCommandQue,
cmPinnedBufIn, CL_TRUE, CL_MAP_WRITE, 0, memSize, 0, NULL, NULL, NULL); cDataOut = (unsigned char*)clEnqueueMapBuffer(cqCommandQue, cmPinnedBufOut, CL_TRUE, CL_MAP_READ, 0, memSize, 0, NULL, NULL, NULL);
4) Initialize or update the pinned memory content, using the standard host pointer and standard host code. This might be done during program initialization function or at any appropriate time by such means as an asynchronous data acquisition function.
for(unsigned int i = 0; i < memSize; i++) { cDataIn[i] = (unsigned char)(i & 0xff); }
5) Write data from pinned host memory to the GPU device GMEM any time in the application that fresh data has been written to the pinned host memory. This step #5, along with steps #6 and #7, commonly constitute a core sequence (copy input data to GPU, compute on GPU, copy results back to CPU) in an application with a recurring main loop, such as an application with a GLUT display callback loop.
clEnqueueWriteBuffer(cqCommandQue, cmDevBufIn, CL_FALSE, 0, szBuffBytes, cDataIn, 0, NULL, NULL);
7) Read data from GPU device GMEM to pinned host memory. Note that this example uses a blocking read to assure the read is complete (which would make sense if the next step in the application was to display or otherwise use the processed data from the host). Also note that this read would be unnecessary in applications using CL-GL interop if the destination for the computed data is only a graphics window, not main CPU memory.
clEnqueueReadBuffer(cqCommandQue, cmDevBufOut, CL_TRUE, 0, szBuffBytes, cDataOut, 0, NULL, NULL);
11
3.1.2
Asynchronous Transfers
Data transfers between the host and the device using clEnqueueReadBuffer() and clEnqueueWriteBuffer() can be synchronous, blocking transfers or asynchronous non-blocking transfers with respect to the calling host thread. Whether the transfer is blocking or non-blocking is specified by the 3d argument in the API call. With a blocking clEnqueueReadBuffer() or clEnqueueWriteBuffer () call, control is not returned to the host thread until after the data transfer is complete. With the non-blocking read or write transfer, control is returned immediately to the host thread, allowing operations to occur concurrently in the host thread while the transfer between host and device proceeds. The following snippet demonstrates a non-blocking write from the host to the device (note the 3d argument is CL_FALSE). Because the clEnqueueWriteBuffer call is asynchronous with respect to the host thread and returns immediately, the subsequent printf statement will execute in parallel with the requested transfer. The host thread will then block at the clFinish call until the transfer is completed.
clEnqueueWriteBuffer(cqCommandQue, cmDevBufIn, CL_FALSE, 0, szBuffBytes, cDataIn, 0, NULL, NULL); printf(Asynchronous printf after non-blocking write); clFinish(cqCommandQue);
The next code snippet demonstrates a blocking read from the device to the host (note the 3d argument is CL_TRUE). Because the clEnqueueReadBuffer call is synchronous with respect to the host thread and does not return until the transfer is completed, the subsequent printf statement will executes in series after completion of the requested transfer.
clEnqueueReadBuffer(cqCommandQue, cmDevBufOut, CL_TRUE, 0, szBuffBytes, cDataOut, 0, NULL, NULL); printf(Synchronous printf after blocking read);
3.1.3
Memory Optimizations
The SDK sample oclCopyComputeOverlap is devoted exclusively to exposition of the techniques required to achieve concurrent copy and device computation and demonstrates the advantages of this relative to purely synchronous operations. The remainder of this section also describes and discusses the necessary techniques. Achieving maximal overlapped transfer and kernel execution requires: 1) Device with compute capability >= 1.2 (single or dual independent copy engines) 2) Host side memory buffers using pinned memory as described above in section 3.1.1 3) 2 OpenCL command queues and 1 OpenCL context 4) Properly balanced compute / copy phases 5) Proper use of clFlush commands in concert with copy/compute phases (Required only if using WDDM Windows (Windows Vista and Windows 7) 6) Warmed up GPU The following code snippet (extracted from the oclCopyComputeOverlap SDK sample illustrates the core sequence to obtain copy/compute overlap, assuming the involved buffers and command queues have already been setup properly. Refer directly to that sample for additional details. The core sequence of the code copies 2 input vectors from the host to the GPU, runs a hypotenuse computation on the GPU for each pair of elements of those 2 input vectors and then copiesthe resulting hypotenuse vector from the GPU back to the host. Each input and output vector is broken into 2 halves to be managed by each of 2 command queues, and each of the 2 command queues will manage of the hypotenuse computations. The sequence executes 2 staggered copy/compute/copy sequences in 2 command queues in 3 phases.
// Nonblocking Write of 1st half of input data, queue 0 clEnqueueWriteBuffer(cqCommandQueue[0], cmDevSrcA, CL_FALSE, 0, szHalfBuffer, (void*)&fSourceA[0], 0, NULL, NULL); clEnqueueWriteBuffer(cqCommandQueue[0], cmDevSrcB, CL_FALSE, 0, szHalfBuffer, (void*)&fSourceB[0], 0, NULL, NULL); // Push the write for queue 0 clFlush(cqCommandQueue[0]); // Start new phase *********************************** // Launch kernel computation, queue 0 clEnqueueNDRangeKernel(cqCommandQueue[0], ckKernel[0], 1, NULL, &szGlobalWorkSize, &szLocalWorkSize, 0, NULL, NULL);
// Nonblocking Write of 2nd half of input data, queue1 clEnqueueWriteBuffer(cqCommandQueue[1], cmDevSrcA, CL_FALSE, szHalfBuffer, szHalfBuffer,
13
(void*)&fSourceA[szHalfOffset], 0, NULL, NULL); clEnqueueWriteBuffer(cqCommandQueue[1], cmDevSrcB, CL_FALSE, szHalfBuffer, szHalfBuffer, (void*)&fSourceB[szHalfOffset], 0, NULL, NULL); // Push the compute for queue 0 & write for queue 1 clFlush(cqCommandQueue[0]); clFlush(cqCommandQueue[1]); // Start new phase ********************************** // Launch kernel computation, queue 1 clEnqueueNDRangeKernel(cqCommandQueue[1], ckKernel[1], 1, NULL, &szGlobalWorkSize, &szLocalWorkSize, 0, NULL, NULL); // Non Blocking Read of 1st half of output data, queue 0 clEnqueueReadBuffer(cqCommandQueue[0], cmDevResult, CL_FALSE, 0, szHalfBuffer, (void*)&fResult[0], 0, NULL, NULL); // Push the compute for queue 1 & the read for queue 0 clFlush(cqCommandQueue[0]); clFlush(cqCommandQueue[1]); // Start new phase ****************************** // Non Blocking Read of 2nd half of output data, queue 1 clEnqueueReadBuffer(cqCommandQueue[1], cmDevResult, CL_FALSE, szHalfBuffer, szHalfBuffer, (void*)&fResult[szHalfOffset], 0, NULL, NULL);
Note that the clFlush commands at the each phase are not required if using Linux, Mac OSX or Windows XP. The clFlush is needed on WDDM Windows versions (e.g. Windows Vista and Windows 7) due to a side effect of driver optimizations that normally yield application performance improvements on WDDM versions of Windows with standard NVIDIA graphics drivers. Also note that the degree of overlap that can be achieved depends upon many variables, including: bandwidth of the PCIe bus on the host system, performance of the GPU being used, size of the input and output data vectors to be overlapped, OpenCL kernel computation intensity and duration, etc. In the oclCopyCompute sample, an attempt is made to dynamically configure the kernel computation in such a fashion that a best-case overlap will occur for the hypotenuse computation involved. Ideally, this should result in up to 30% overlap when using a device with one independent copy engine and up to 50% for a GPU with 2 copy engines. (The % overlap represents the proportion by which execution time is reduced when using a 2 queue overlapped approach than when using a 1 queue approach with no overlap).
Memory Optimizations
The following pictorial depicts the approximate time sequence for the oclCopyComputeOverlap sample running with a single command queue without overlap and then with overlap using 2 command queues. Note that the total time for completion of the task using the overlapped technique with 2 command queues is substantially shorter than the non-overlapped approach with 1 command queue.
Additional average overlap is possible on devices with 2 independent copy engines (Tesla and Quadro GPUs with compute capability >= 2.0) if an application such as an image processing application repeats a copy/compute/copy sequence (with fresh data at each iteration). This is depicted in the following pictorial time sequence.
15
To Host
Figure 3.2 The various memory spaces on a CUDA device Of these different memory spaces, global and texture memory are the most plentiful. There is a 16 KB per thread limit on local memory, a total of 64 KB of constant memory, and a limit of 16 KB of shared memory, and either 8,192 or 8 16,384 32-bit registers per multiprocessor. Global, local, and texture memory have bit the greatest access latency (although texture is cached), followed by constant , memory, registers and shared memory. , registers, The various principal traits of the memory types are shown in Table 3.1. 3.1 Table 3.1 Salient features of device memory Memory Register Local Shared Global Location on/off chip On Off On Off Cached n/a No n/a No Access R/W R/W R/W R/W Scope 1 thread 1 thread All threads in block All threads + host Lifetime Thread Thread Block Host allocation
Memory Optimizations
Constant Texture
Off Off
Yes Yes
R R
Perhaps the single most important performance consideration in programming for the CUDA architecture is coalescing global memory accesses. Global memory loads and stores by threads of a half warp (16 threads) are coalesced by the device in as few as one transaction (or two transactions in the case of 128-bit words) when certain access requirements are met. To understand these access requirements, global memory should be viewed in terms of aligned segments of 16 and 32 words. Figure 3.3 helps explain coalescing of a half warp of 32-bit words, such as floats. It shows global memory as rows of 64-byte aligned segments (16 floats). Two rows of the same color represent a 128-byte aligned segment. A half warp of threads that accesses the global memory is indicated at the bottom of the figure.
Figure 3.3 Linear memory segments and threads in a half warp The access requirements for coalescing depend on the compute capability of the device: On devices of compute capability 1.0 or 1.1, the k-th thread in a half warp must access the k-th word in a segment aligned to 16 times the size of the elements being accessed; however, not all threads need to participate. On devices of compute capability 1.2 or higher, coalescing is achieved for any pattern of accesses that fits into a segment size of 32 bytes for 8-bit words, 64 bytes for 16-bit words, or 128 bytes for 32- and 64-bit words. Smaller transactions may be issued to avoid wasting bandwidth. More precisely, the following protocol is used to issue a memory transaction for a half warp: Find the memory segment that contains the address requested by the lowest numbered active thread. Segment size is 32 bytes for 8-bit data, 64 bytes for 16-bit data, and 128 bytes for 32-, 64-, and 128-bit data.
17
Find all other active threads whose requested address lies in the same segment, and reduce the transaction size if possible: If the transaction is 128 bytes and only the lower or upper half is used, reduce the transaction size to 64 bytes. If the transaction is 64 bytes and only the lower or upper half is used, reduce the transaction size to 32 bytes. Carry out the transaction and mark the serviced threads as inactive. Repeat until all threads in the half warp are serviced. These concepts are illustrated in the following simple examples.
Figure 3.4 Coalesced access in which all threads but one access the corresponding word in a segment This access pattern results in a single 64-byte transaction, indicated by the red rectangle. Note that even though one word is not requested, all data in the segment are fetched. If accesses by threads were permuted within this segment, still one 64byte transaction would be performed by a device with compute capability 1.2 or higher, but 16 serialized transactions would be performed by a device with compute capability 1.1 or lower.
Memory Optimizations
Figure 3.5 Unaligned sequential addresses that fit within a single 128byte segment If a half warp accesses memory that is sequential but split across two 128-byte segments, then two transactions are performed. In the following case, illustrated in Figure 3.6, one 64-byte transaction and one 32-byte transaction result.
Figure 3.6 Misaligned sequential addresses that fall within two 128-byte segments Device memory allocated through OpenCL is guaranteed to be aligned to at least 256 bytes. Therefore, choosing sensible thread block sizes, such as multiples of 16, facilitates memory accesses by half warps that are aligned to segments. In addition, the qualifiers __attribute__ ((aligned(8))) and __attribute__ ((aligned(16))) can be used when defining structures to ensure alignment to segments.
Listing 3.5 A copy kernel that illustrates misaligned accesses In Listing 3.5, data is copied from the input array idata to the output array, both of which exist in global memory. The kernel is executed within a loop in host code that
19
varies the parameter offset from 1 to 32. (Figures 3.5 and 3.6 correspond to offsets of 1 and 17, respectively.) The effective bandwidth for the copy with various offsets on an NVIDIA GeForce GTX 280 (with compute capability 1.3) and an NVIDIA GeForce GTX 8800 (compute capability 1.0) are shown in Figure 3.7.
Figure 3.7 Performance of offsetCopy kernel For the NVIDIA GeForce GTX 8800 device, global memory accesses with no offset or with offsets that are multiples of 16 result in a single transaction per half warp and an effective bandwidth of approximately 74 GBps. Otherwise, 16 transactions are issued per half warp resulting in an effective bandwidth of approximately 7 GBps. This roughly 8x performance degradation is due to the fact that 32 bytes, the minimum transaction size, are fetched for each thread. However, only 4 bytes of data are used for each 32 bytes fetchedresulting in the 4/32=1/8 performance relative to the fully coalesced case. The two numbers also reflect the different data represented by effective bandwidth (4 bytes) versus actual bandwidth (32 bytes). Because of this possible performance degradation, memory coalescing is the most critical aspect of performance optimization of device memory. For the NVIDIA GeForce GTX 280 device, the situation is less dire for misaligned accesses because, in all cases, access by a half warp of threads in this kernel results in either one or two transactions. As such, the effective bandwidth is between 120 GBps for a single transaction and 70 GBps for two transactions per half warp. The number of transactions issued for a half warp of threads depends on the offset and whether the warp is even- or odd-numbered. For offsets of 0 or 16, each half warp results in a single 64-byte transaction (Figure 3.4). For offsets of 1 through 7 or 9 through 15, even-numbered warps result in a single 128-byte transaction (Figure 3.5) and oddnumbered warps result in two transactions: one 64-byte and one 32-byte (Figure 3.6). For offsets of 8, even-numbered warps result in one 128-byte transaction and odd-numbered warps result in two 32-byte transactions. The two 32-byte
Memory Optimizations
transactions, rather than a 64- and a 32-byte transaction, are responsible for the blip at the offset of 8 in Figure 3.7.
Listing 3.6 A kernel to illustrate non-unit stride data copy Figure 3.8 illustrates a situation that can be created using the code in Listing 3.6; namely, threads within a half warp access memory with a stride of 2. This action is coalesced into a single 128-byte transaction on an NVIDIA GeForce GTX 280 (compute capability 1.3).
Figure 3.8 A half warp accessing memory with a stride of 2 Although a stride of 2 results in a single transaction, note that half the elements in the transaction are not used and represent wasted bandwidth. As the stride increases, the effective bandwidth decreases until the point where 16 transactions are issued for the 16 threads in a half warp, as indicated in Figure 3.9.
21
Figure 3.9 Performance of strideCopy kernel Note, however, that on the NVIDIA GTX 8800 device (compute capability 1.0), any non-unit stride results in 16 separate transactions per half warp. As illustrated in Figure 3.9, non-unit stride global memory accesses should be avoided whenever possible. One method for doing so utilizes shared memory, which is discussed in the next section.
Memory Optimizations
Medium Priority: Accesses to shared memory should be designed to avoid serializing requests due to bank conflicts.
Shared memory banks are organized such that successive 32-bit words are assigned to successive banks and each bank has a bandwidth of 32 bits per clock cycle. The bandwidth of shared memory is 32 bits per bank per clock cycle. For devices of compute capability 1.x, the warp size is 32 threads and the number of banks is 16. A shared memory request for a warp is split into one request for the first half of the warp and one request for the second half of the warp. Note that no bank conflict occurs if only one memory location per bank is accessed by a half warp of threads. Refer to the NV IDIA OpenCL Programming Guide for more information on how accesses and banks can be matched to avoid conflicts.
23
Figure 3.10 A block-column matrix (A) multiplied by a block-row matrix row (B) and the resulting product matrix (C) To do this, the simpleMultiply kernel (Listing 3.7) calculates the output elements of a tile of matrix C.
__kernel void simpleMultiply( simpleMultiply(__global float* a, __global float* b, __global float* c, int N) { int row = get_glob get_global_id(1); int col = get_global_id(0) get_global_id(0); float sum = 0.0f; for (int i = 0; i < TILE_DIM; i++) { sum += a[row*TILE_DIM+i] * b[i*N+col]; } c[row*N+col] = sum; }
Listing 3.7 Unoptimized matrix multiplication In Listing 3.7, a, b, and c are pointers to global memory for the matrices A, B, and C, respectively; blockDim.x, blockDim.y, and TILE_DIM are all 16. Each thread in the 16x16 block calculates one element in a tile of C. row and col are the row and column of the element in C being calculated by a particular thread. The for loop calculated over i multiplies a row of A by a column of B, which is then written to C. The effective bandw he bandwidth of this kernel is only 8.7 GBps on an NVIDIA GeForce n GTX 280 and 0. GBps on an NVIDIA GeForce GTX 8800. To analyze 0.7 . performance, it is necessary to consider how half warps of threads access global ps memory in the for loop. Each half warp of threads calculates one row of a tile of C, which depends on a single row of A and an entire tile of B as illustrated in Figure 3.11.
Memory Optimizations
Figure 3.11 Computing a row (half warp) of a tile in C using one row of A and an entire tile of B For each iteration i of the for loop, all threads in a half warp read the same value from global memory (the index row*TILE_DIM+i is constant within a half warp), resulting in 16 transactions for compute capability 1.1 or lower, and 1 transaction , for compute capability 1.2 or higher. Even though the operation requires only 1 transaction for compute capability 1.2 or higher, there is wasted bandwidth in the bandw transaction because only 4 bytes out of a 32-byte transaction are used. For each iteration, the 16 threads in a half warp read a row of the B tile, which is a sequential and coalesced access for all compute capabilities. The performance on a device of any compute capability can be improved by reading a tile of A into shared memory as shown in Listing 3.8.
__kernel void coalescedMultiply( coalescedMultiply(__global float* a, __global float* b, __global float* c, int N, __local float aTile[TILE_DIM][TILE_DIM]) aTile[TILE_DIM][TILE_DIM] { int row = get_global_id(1) get_global_id(1); int col = get_global_id(0) get_global_id(0); float sum = 0.0f; int x = get_l get_local_id(0); int y = get_local_id(1); aTile[y][x] = a[row*TILE_DIM+x]; for (int i = 0; i < TILE_DIM; i++) { sum += aTile[y][i]* b[i*N+col]; } c[row*N+col] = sum; }
25
Listing 3.8 Using shared memory to improve the global memory load efficiency in matrix multiplication In Listing 3.8, each element in a tile of A is read from global memory only once, in a fully coalesced fashion (with no wasted bandwidth), to shared memory. Within each iteration of the for loop, a value in shared memory is broadcast to all threads in a half warp. In Listing 3.8, a synchronization barrier call is not needed after reading the tile of A into shared memory because only threads within the half warp that write the data into shared memory read the data. This kernel has an effective bandwidth of 14.3 GBps on an NVIDIA GeForce GTX 280, and 8.2 GBps on an NVIDIA GeForce GTX 8800. A further improvement can be made to how Listing 3.8 deals with matrix B. In calculating a tiles row of matrix C, the entire tile of B is read. The repeated reading of the B tile can be eliminated by reading it into shared memory once (Listing 3.9).
__kernel void sharedABMultiply(__global float* a, __global float* b, __global float* c, int N, __local float aTile[TILE_DIM][TILE_DIM], __local float bTile[TILE_DIM][TILE_DIM]) { int row = get_global_id(1); int col = get_global_id(0); float sum = 0.0f; int x = get_local_id(0); int y = get_local_id(1); aTile[y][threadIdx.x] = a[row*TILE_DIM+x]; bTile[y][threadIdx.x] = b[y*N+col]; barrier(CLK_LOCAL_MEM_FENCE); for (int i = 0; i < TILE_DIM; i++) { sum += aTile[y][i]* bTile[i][x]; } c[row*N+col] = sum; }
Listing 3.9 Improvement by reading additional data into shared memory Note that in Listing 3.9, a barrier() call is required after reading the B tile because a warp reads data from shared memory that were written to shared memory by different warps. The effective bandwidth of this routine is 29.7 GBps on an NVIDIA GeForce GTX 280 and 15.7 GBps on an NVIDIA GeForce GTX 8800. Note that the performance improvement is not due to improved coalescing in either case, but to avoiding redundant transfers from global memory. The results of the various optimizations are summarized in Table 3.2. Table 3.2 Performance improvements optimizing C = AB matrix multiply Optimization No optimization NVIDIA GeForce GTX 280 8.8 GBps NVIDIA GeForce GTX 8800 0.7 GBps
Memory Optimizations
Coalesced using shared memory to store a tile of A Using shared memory to eliminate redundant reads of a tile of B
14.3 GBps
8.2 GBps
29.7 GBps
15.7 GBps
Medium Priority: Use shared memory to avoid redundant transfers from global memory.
Listing 3.10 Unoptimized handling of strided accesses to global memory In Listing 3.10, the row-th, col-th element of C is obtained by taking the dot product of the row-th and col-th rows of A. The effective bandwidth for this kernel is 1.1 GBps on an NVIDIA GeForce GTX 280 and 0.5 GBps on an NVIDIA GeForce GTX 8800. These results are substantially lower than the corresponding measurements for the C = AB kernel. The difference is in how threads in a half warp access elements of A in the second term, a[col*TILE_DIM+i], for each iteration i. For a half warp of threads, col represents sequential columns of the transpose of A, and therefore col*TILE_DIM represents a strided access of global memory with a stride of 16. This results in uncoalesced memory accesses on devices with compute capability 1.1 or lower and plenty of wasted bandwidth on devices with compute capability 1.2 or higher. The way to avoid strided access is to use shared memory as before, except in this case a half warp reads a row of A into a column of a shared memory tile, as shown in Listing 3.11.
__kernel void coalescedMultiply(__global float *a, __global float *c, int M, __local float aTile[TILE_DIM][TILE_DIM], __local float transposedTile[TILE_DIM][TILE_DIM]) {
27
int row = get_global_id(1); int col = get_global_id(0); float sum = 0.0f; int x = get_local_id(0); int y = get_local_id(1); aTile[y][x] = a[row*TILE_DIM+x]; transposedTile[x][y] = a[(get_group_id(0) * get_num_groups(0) + y) * TILE_DIM + x]; barrier(CLK_LOCAL_MEM_FENCE); for (int i = 0; i < TILE_DIM; i++) { sum += aTile[y][i] * transposedTile[i][x]; } c[row*M+col] = sum; }
Listing 3.11 An optimized version of Listing 3.10 using coalesced reads from global memory Listing 3.11 uses the shared transposedTile to avoid uncoalesced accesses in the second term in the dot product, and the shared aTile technique from the previous example to avoid uncoalesced accesses in the first term. The effective bandwidth of this kernel is 24.9 GBps on an NVIDIA GeForce GTX 280 and 13.2 GBps on an NVIDIA GeForce GTX 8800. These results are slightly lower than those obtained by the final kernel for C = AB. The cause of the difference is shared memory bank conflicts. The reads of elements in transposedTile within the for loop are free of conflicts, because threads of each half warp read across rows of the tile, resulting in unit stride across the banks. However, bank conflicts occur when copying the tile from global memory into shared memory. To enable the loads from global memory to be coalesced, data are read from global memory sequentially. However, this requires writing to shared memory in columns, and because of the use of 16x16 tiles in shared memory, this results in a stride between threads of 16 banks. These 16-way bank conflicts are very expensive. The simple remedy is to pad the shared memory array so that it has an extra column, as in the following line of code.
__local float transposedTile[TILE_DIM][TILE_DIM+1];
This padding eliminates the conflicts entirely, because now the stride between threads is 17 banks, which, due to modular arithmetic used to compute bank indices, is equivalent to a unit stride. After this change, the effective bandwidth is 30.4 GBps on an NVIDIA GeForce GTX 280 and 15.6 GBps on an NVIDIA GeForce GTX 8800, which is comparable to the results from the last C = AB kernel. The results of these optimizations are summarized in Table 3.3. Table 3.3 Performance improvements optimizing C = AAT matrix multiplication Optimization No optimization NVIDIA GeForce GTX 280 1.1 GBps NVIDIA GeForce GTX 8800 0.5 GBps
Memory Optimizations
These results should be compared with those in Table 3.2. As can be seen from these tables, judicious use of shared memory can dramatically improve performance. The examples in this section have illustrated three ways to use shared memory: To enable coalesced accesses to global memory, especially to avoid large strides (for general matrices, strides are much larger than 16) To eliminate (or reduce) redundant loads from global memory To avoid wasted bandwidth
29
Memory Optimizations
The automatic handling of boundary cases in the bottom row of Table 3.4 refers to how a texture coordinate is resolved when it falls outside the valid addressing range. There are two options: clamp and repeat. If x is the coordinate and N is the number of texels for a one-dimensional texture, then with clamp, x is replaced by 0 if x < 0 and by 1-1/N if 1 x. With repeat, x is replaced by frac(x) where frac(x) = x floor(x). Floor returns the largest integer less than or equal to x. So, in clamp mode where N = 1, an x of 1.3 is clamped to 1.0; whereas in repeat mode, it is converted to 0.3
3.2.6 Registers
Generally, accessing a register consumes zero extra clock cycles per instruction, but delays may occur due to register read-after-write dependencies and register memory bank conflicts. The latency of read-after-write dependencies is approximately 24 cycles, but this latency is completely hidden on multiprocessors that have at least 192 active threads (that is, 6 warps). The compiler and hardware thread scheduler will schedule instructions as optimally as possible to avoid register memory bank conflicts. They achieve the best results when the number of threads per block is a multiple of 64. Other than following this rule, an application has no direct control over these bank conflicts. In particular, there is no register-related reason to pack data into float4 or int4 types.
31
One of the keys to good performance is to keep the multiprocessors on the device as busy as possible. A device in which work is poorly balanced across the multiprocessors will deliver suboptimal performance. Hence, its important to design your application to use threads and blocks in a way that maximizes hardware utilization and to limit practices that impede the free distribution of work. A key concept in this effort is occupancy, which is explained in the following sections. Another important concept is the management of system resources allocated for a particular task. How to manage this resource utilization is discussed in the final sections of this chapter.
4.1 Occupancy
Thread instructions are executed sequentially in CUDA, and, as a result, executing other warps when one warp is paused or stalled is the only way to hide latencies and keep the hardware busy. Some metric related to the number of active warps on a multiprocessor is therefore important in determining how effectively the hardware is kept busy. This metric is occupancy. Occupancy is the ratio of the number of active warps per multiprocessor to the maximum number of possible active warps. (To determine the latter number, see the oclDeviceQuery program in the NVIDIA GPU Computing SDK or refer to Appendix A in the NVIDIA OpenCL Programming Guide.) Another way to view occupancy is the percentage of the hardwares ability to process warps that are actively in use. Higher occupancy does not always equate to higher performancethere is a point above which additional occupancy does not improve performance. However, low occupancy always interferes with the ability to hide memory latency, resulting in performance degradation.
33
thread blocks that can be resident on a multiprocessor is reduced, thereby lowering the occupancy of the multiprocessor. For purposes of calculating occupancy, the following factors can be important. Devices with compute capability 1.1 or lower have 8,192 32-bit registers per multiprocessor. Devices with compute capability 1.2 or 1.3 have 16,384 32-bit registers per multiprocessor. Multiprocessors with compute capability 1.1 and lower can have a maximum of 768 simultaneous threads resident (24 warps x 32 threads per warp). This means that in a multiprocessor with 100 percent occupancy, every thread can use 10 registers before occupancy is reduced. For compute capability 1.2 and 1.3, the corresponding number is 16 registers per thread (16,384 / (32 warps x 32 threads per warp)). The preceding approach of determining how register count affects occupancy does not take into account allocation granularity because register allocation is performed per block. For example, on a device of compute capability 1.0, a kernel with 128thread blocks using 12 registers per thread results in an occupancy of 83 percent with 5 active 128-thread blocks per multiprocessor, whereas a kernel with 256thread blocks using the same 12 registers per thread results in an occupancy of 66 percent because only two 256-thread blocks can reside on a multiprocessor. Not only is register allocation performed per block, but it also is rounded to the nearest 256 registers per block on devices with compute capability 1.0 and 1.1, and its rounded to the nearest 512 registers on devices with compute capability 1.2 and 1.3. Because of these nuances in register allocation and the fact that a multiprocessors shared memory is also partitioned between resident thread blocks, the relation between register usage and occupancy can be difficult to determine. NVIDIA provides an occupancy calculator in the form of an Excel spreadsheet that enables developers to hone in on the optimal balance and to test different possible scenarios. This spreadsheet, shown in Figure 4.1, is CUDA_Occupancy_calculator.xls located in the tools directory of the SDK.
Figure 4.1 Use the CUDA GPU Occupancy Calculator to project occupancy
NDRange Optimizations
In addition to the calculator spreadsheet, occupancy can be determined using the OpenCL Visual profiler.
Register dependencies arise when an instruction uses a result stored in a register written by an instruction before it. The latency on current CUDA-enabled GPUs is approximately 24 cycles, so threads must wait 24 cycles before using an arithmetic result. However, this latency can be completely hidden by the execution of threads in other warps. To hide arithmetic latency completely, multiprocessors should be running at least 192 threads (6 warps). This equates to 25 percent occupancy on devices with compute capability 1.1 and lower, and 18.75 percent occupancy on devices with compute capability 1.2 and higher.
The dimension and size of blocks per grid and the dimension and size of threads per block are both important factors. The multidimensional aspect of these parameters allows easier mapping of multidimensional problems to OpenCL and does not play a role in performance. As a result, this section discusses size but not dimension. Latency hiding and occupancy depend on the number of active warps per multiprocessor, which is implicitly determined by the execution parameters along with resource (register and shared memory) constraints. Choosing execution parameters is a matter of striking a balance between latency hiding (occupancy) and resource utilization. Choosing the NDRange parameters should be done in tandem; however, there are certain heuristics that apply to each parameter individually. When choosing the number of blocks per grid or grid size (i.e. number of work groups in OpenCL terminology), the primary concern is keeping the entire GPU busy. The number of blocks in a grid should be larger than the number of multiprocessors so that all multiprocessors have at least one block to execute. Furthermore, there should be multiple active blocks per multiprocessor so that blocks that arent waiting for a
35
barrier()
can keep the hardware busy. This recommendation is subject to resource availability; therefore, it should be determined in the context of the blocksize execution parameter, as well as shared memory usage. To scale to future devices, the number of blocks per kernel launch should be in the hundreds, as kernels with thousands of blocks will scale across multiple future generations. When choosing the number of threads per block, or the blocksize, it is important to remember that multiple concurrent blocks can reside on a multiprocessor, so occupancy is not determined by blocksize alone. In particular, a larger blocksize does not imply a higher occupancy. For example, on a device of compute capability 1.1 or lower, a kernel with a maximum blocksize of 512 threads results in an occupancy of 66 percent because the maximum number of threads per multiprocessor on such a device is 768. Hence, only a single block can be active per multiprocessor. However, a kernel with 256 threads per block on such a device can result in 100 percent occupancy with three resident active blocks. As mentioned in section 4.1 et seq., higher occupancy does not always equate to better performance. For example, improving occupancy from 66 percent to 100 percent generally does not translate to a similar increase in performance. A lower occupancy kernel will have more registers available per thread than a higher occupancy kernel, which may result in less register spilling to local memory. In fact, once an occupancy of 50 percent has been reached, additional increases in occupancy do not translate into improved performance. There are many such factors involved in selecting blocksize, and inevitably some experimentation is required. However, a few rules of thumb should be followed: Threads per block should be a multiple of warp size to avoid wasting computation on underpopulated warps and to facilitate coalescing. A minimum of 64 threads per block should be used, but only if there are multiple concurrent blocks per multiprocessor. Between 128 and 256 threads per block is a better choice and a good initial range for experimentation with different block sizes.
Note that when a thread block allocates more than the available registers on a multiprocessor, the kernel invocation fails, as it will when too much shared memory or too many threads are requested.
NDRange Optimizations
memory array can be beneficial even if limits such as threads per block are not an issue. This is because some common operations can be performed by a thread once and the cost amortized over the number of shared memory elements processed by a thread. A useful technique to determine the sensitivity of performance to occupancy is through experimentation with the amount of dynamically allocated shared memory. In OpenCL, the size of any __local pointer argument is specified outside the kernel using clSetKernelArg(). By simply increasing this amount, it is possible to effectively reduce the occupancy of the kernel and measure its effect on performance. As mentioned in the previous section, once an occupancy of more than 50 percent has been reached, it generally does not pay to optimize parameters to obtain higher occupancy ratios. The previous technique can be used to determine whether such a plateau has been reached.
37
Awareness of how instructions are executed often permits low-level optimizations that can be useful, especially in code that is run frequently (the so-called hot spot in a program). Best practices suggest that this optimization be performed after all higher-level optimizations have been completed. In this chapter, throughputs are given in number of operations per clock cycle per multiprocessor. For a warp size of 32, an instruction consists of 32 operations. Therefore, if T is the number of operations per clock cycle, the instruction throughput is one instruction every 32/T clock cycles. All throughputs are for one multiprocessor. They must be multiplied by the number of multiprocessors in the device to get throughput for the whole device. The use of -cl-mad-enable build option, which lets the compiler group add and mul instructions into a single FMAD instruction whenever possible, is recommended as it can lead to large performance gains. FMAD truncates the intermediate result of the multiplication however. The use of -cl-fast-relaxed-math build option enables many aggressive compiler optimizations and should be considered as well. The CUDA architecture is a scalar architecture. Therefore, there is no performance benefit from using vector types and instructions. These should only be used for convenience. It is also in general better to have more work-items than fewer using large vectors.
The throughput of single-precision reciprocal, reciprocal square root, and 2 operations per clock cycle. (Refer to Appendix B of the NVIDIA OpenCL Programming Guide.)
The throughput of 32-bit integer multiplication is 2 operations per clock cycle, but (refer to Appendix B of the NVIDIA OpenCL Programming Guide) provides signed and unsigned 24-bit integer multiplication with a throughput of 8 operations per clock cycle. On future architectures, however, mul24 will be slower than 32-bit integer multiplication, so you should provide two kernels, one using mul24 and the
mul24
39
other using generic 32-bit integer multiplication, to be called appropriately by the application.
Integer division and modulo operations are particularly costly and should be avoided or replaced with bitwise operations whenever possible: If n is a power of 2, (i/n) is equivalent to (i log2(n)) and (i % n) is equivalent to (i & (n-1)). The compiler will perform these conversions if n is literal. (For further information, refer to Chapter 3 of the NVIDIA OpenCL Programming Guide).
The compiler must on occasion insert conversion instructions, introducing additional execution cycles. This is the case for Functions operating on char or short whose operands generally need to be converted to an int Double-precision floating-point constants (defined without any type suffix) used as input to single-precision floating-point computations The latter case can be avoided by using single-precision floating-point constants, defined with an f suffix such as 3.141592653589793f, 1.0f, 0.5f. This specification has accuracy implications in addition to its ramifications on performance. The effects on accuracy are discussed in Chapter 7. For single-precision code, use of the float type and the single-precision math functions are highly recommended. When compiling for devices without native double-precision support, such as devices of compute capability 1.2 and earlier, each double variable is converted to single-precision floating-point format (but retains its size of 64 bits) and double-precision arithmetic is demoted to single-precision arithmetic. It should also be noted that the math library function for complementary error function, erfc(), is particularly fast with full single-precision accuracy.
Instruction Optimizations
Two types of runtime math operations are supported. They are and functionName(). Functions using map directly to the hardware level. They are faster but provide somewhat lower accuracy. (Examples: native_sin(x), native_exp(x), and so forth.) Functions using functionName() are slower but have higher accuracy. (Examples: sin(x), exp(x), and so forth.) The throughput of native_sin(x), native_cos(x), native_exp(x) is 1 operation per clock cycle, while sin(x), cos(x), tan(x) are much more expensive and become even more so (about an order of magnitude slower) if the absolute value of x needs to be reduced. Moreover, in such cases, the argument-reduction code uses local memory, which can affect performance even more because of the high latency of local memory. More details are available in the NVIDIA OpenCL Programming Guide.
native_functionName() native_functionName()
Memory instructions include any instruction that reads from or writes to shared, local, or global memory. The throughput of memory optimizations is 8 operations per clock cycle. When accessing local or global memory, there are, in addition, 400 to 600 clock cycles of memory latency. As an example, the throughput for the assignment operator in the following sample code
__local float shared[32]; __global float* device; shared[threadIdx.x] = device[threadIdx.x];
is 8 operations per clock cycle to issue a read from global memory, 8 operations per clock cycle to issue a write to shared memory, but, crucially, there is a latency of 400 to 600 clock cycles to read data from global memory. Much of this global memory latency can be hidden by the thread scheduler if there are sufficient independent arithmetic instructions that can be issued while waiting for the global memory access to complete. However, it is best to avoid accessing global memory whenever possible.
41
Any flow control instruction (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge; that is, to follow different execution paths. If this happens, the different execution paths must be serialized, increasing the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path. To obtain best performance in cases where the control flow depends on the thread ID, the controlling condition should be written so as to minimize the number of divergent warps. This is possible because the distribution of the warps across the block is deterministic as mentioned in section 2.1.1 of the NVIDIA OpenCL Programming Guide. A trivial example is when the controlling condition depends only on (threadIdx / WSIZE) where WSIZE is the warp size. In this case, no warp diverges because the controlling condition is perfectly aligned with the warps.
Sometimes, the compiler may unroll loops or optimize out if or switch statements by using branch predication instead. In these cases, no warp can ever diverge. The programmer can also control loop unrolling using
#pragma unroll
For more information on this pragma, refer to the NVIDIA OpenCL Programming Guide. When using branch predication, none of the instructions whose execution depends on the controlling condition is skipped. Instead, each such instruction is associated
43
with a per-thread condition code or predicate that is set to true or false according to the controlling condition. Although each of these instructions is scheduled for execution, only the instructions with a true predicate are actually executed. Instructions with a false predicate do not write results, and they also do not evaluate addresses or read operands. The compiler replaces a branch instruction with predicated instructions only if the number of instructions controlled by the branch condition is less than or equal to a certain threshold: If the compiler determines that the condition is likely to produce many divergent warps, this threshold is 7; otherwise it is 4.
This appendix contains a list of all the recommendations for optimization and the list of best practices that are explained in this document.
45
47
Notice
ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, MATERIALS) ARE BEING PROVIDED AS IS. NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE. Information furnished is believed to be accurate and reliable. However, NVIDIA Corporation assumes no responsibility for the consequences of use of such information or for any infringement of patents or other rights of third parties that may result from its use. No license is granted by implication or otherwise under any patent or patent rights of NVIDIA Corporation. Specifications mentioned in this publication are subject to change without notice. This publication supersedes and replaces all information previously supplied. NVIDIA Corporation products are not authorized for use as critical components in life support devices or systems without express written approval of NVIDIA Corporation.
Trademarks
NVIDIA, the NVIDIA logo, CUDA, GeForce, NVIDIA Quadro, and Tesla are trademarks or registered trademarks of NVIDIA Corporation. Other company and product names may be trademarks of the respective companies with which they are associated.
Copyright
2010 NVIDIA Corporation. All rights reserved.