Cuuda Nvidai Guide - Part2
Cuuda Nvidai Guide - Part2
Note
The compute capability version of a particular GPU should not be confused with
the CUDA version (for example, CUDA 7.5, CUDA 8, CUDA 9), which is the version of
the CUDA software platform. The CUDA platform is used by application developers
to create applications that run on many generations of GPU architectures,
including future GPU architectures yet to be invented. While new versions of the
CUDA platform often add native support for a new GPU architecture by supporting
the compute capability version of that architecture, new versions of the CUDA
platform typically also include software features that are independent of hardware
generation.
The Tesla and Fermi architectures are no longer supported starting with CUDA 7.0 and
CUDA 9.0, respectively.
3. Programming Interface
CUDA C++ provides a simple path for users familiar with the C++ programming
language to easily write programs for execution by the device.
It consists of a minimal set of extensions to the C++ language and a runtime library.
The core language extensions have been introduced in Programming Model. They
allow programmers to define a kernel as a C++ function and use some new syntax to
specify the grid and block dimension each time the function is called. A complete
description of all extensions can be found in C++ Language Extensions. Any source file
that contains some of these extensions must be compiled with nvcc as outlined in
Compilation with NVCC.
The runtime is introduced in CUDA Runtime. It provides C and C++ functions that
execute on the host to allocate and deallocate device memory, transfer data between
host memory and device memory, manage systems with multiple devices, etc. A
complete description of the runtime can be found in the CUDA reference manual.
The runtime is built on top of a lower-level C API, the CUDA driver API, which is also
accessible by the application. The driver API provides an additional level of control by
exposing lower-level concepts such as CUDA contexts - the analogue of host
processes for the device - and CUDA modules - the analogue of dynamically loaded
libraries for the device. Most applications do not use the driver API as they do not
need this additional level of control and when using the runtime, context and module
management are implicit, resulting in more concise code. As the runtime is
interoperable with the driver API, most applications that need some driver API
features can default to use the runtime API and only use the driver API where needed.
The driver API is introduced in Driver API and fully described in the reference manual.
nvcc is a compiler driver that simplifies the process of compiling C++ or PTX code: It
provides simple and familiar command line options and executes them by invoking the
collection of tools that implement the different compilation stages. This section gives
an overview of nvcc workflow and command options. A complete description can be
found in the nvcc user manual.
compiling the device code into an assembly form (PTX code) and/or binary form
(cubin object),
and modifying the host code by replacing the <<<...>>> syntax introduced in
Kernels (and described in more details in Execution Configuration) by the
necessary CUDA runtime function calls to load and launch each compiled kernel
from the PTX code and/or cubin object.
The modified host code is output either as C++ code that is left to be compiled using
another tool or as object code directly by letting nvcc invoke the host compiler during
the last compilation stage.
Either link to the compiled host code (this is the most common case),
Or ignore the modified host code (if any) and use the CUDA driver API (see Driver
API) to load and execute the PTX code or cubin object.
When the device driver just-in-time compiles some PTX code for some application, it
automatically caches a copy of the generated binary code in order to avoid repeating
the compilation in subsequent invocations of the application. The cache - referred to
as compute cache - is automatically invalidated when the device driver is upgraded, so
that applications can benefit from the improvements in the new just-in-time compiler
built into the device driver.
As an alternative to using nvcc to compile CUDA C++ device code, NVRTC can be used
to compile CUDA C++ device code to PTX at runtime. NVRTC is a runtime compilation
library for CUDA C++; more information can be found in the NVRTC User guide.
Note
Binary compatibility is supported only for the desktop. It is not supported for
Tegra. Also, the binary compatibility between desktop and Tegra is not supported.
PTX code produced for some specific compute capability can always be compiled to
binary code of greater or equal compute capability. Note that a binary compiled from
an earlier PTX version may not make use of some hardware features. For example, a
binary targeting devices of compute capability 7.0 (Volta) compiled from PTX
generated for compute capability 6.0 (Pascal) will not make use of Tensor Core
instructions, since these were not available on Pascal. As a result, the final binary may
perform worse than would be possible if the binary were generated using the latest
version of PTX.
PTX code compiled to target architecture conditional features only run on the exact
same physical architecture and nowhere else. Arch conditional PTX code is not forward
and backward compatible. Example code compiled with sm_90a or compute_90a only
runs on devices with compute capability 9.0 and is not backward or forward
compatible.
Which PTX and binary code gets embedded in a CUDA C++ application is controlled by
the -arch and -code compiler options or the -gencode compiler option as detailed in
the nvcc user manual. For example,
nvcc x.cu
-gencode arch=compute_50,code=sm_50
-gencode arch=compute_60,code=sm_60
-gencode arch=compute_70,code=\"compute_70,sm_70\"
embeds binary code compatible with compute capability 5.0 and 6.0 (first and second
-gencode options) and PTX and binary code compatible with compute capability 7.0
(third -gencode option).
Host code is generated to automatically select at runtime the most appropriate code
to load and execute, which, in the above example, will be:
5.0 binary code for devices with compute capability 5.0 and 5.2,
6.0 binary code for devices with compute capability 6.0 and 6.1,
7.0 binary code for devices with compute capability 7.0 and 7.5,
PTX code which is compiled to binary code at runtime for devices with compute
capability 8.0 and 8.6.
x.cu can have an optimized code path that uses warp reduction operations, for
example, which are only supported in devices of compute capability 8.0 and higher.
The __CUDA_ARCH__ macro can be used to differentiate various code paths based on
compute capability. It is only defined for device code. When compiling with -
arch=compute_80 for example, __CUDA_ARCH__ is equal to 800 .
If x.cu is compiled for architecture conditional features example with sm_90a or
compute_90a , the code can only run on devices with compute capability 9.0.
Applications using the driver API must compile code to separate files and explicitly
load and execute the most appropriate file at runtime.
The Volta architecture introduces Independent Thread Scheduling which changes the
way threads are scheduled on the GPU. For code relying on specific behavior of SIMT
scheduling in previous architectures, Independent Thread Scheduling may alter the
set of participating threads, leading to incorrect results. To aid migration while
implementing the corrective actions detailed in Independent Thread Scheduling, Volta
developers can opt-in to Pascal’s thread scheduling with the compiler option
combination -arch=compute_60 -code=sm_70 .
The nvcc user manual lists various shorthands for the -arch , -code , and -gencode
compiler options. For example, -arch=sm_70 is a shorthand for -arch=compute_70 -
code=compute_70,sm_70 (which is the same as -gencode
arch=compute_70,code=\"compute_70,sm_70\" ).
Asynchronous Concurrent Execution describes the concepts and API used to enable
asynchronous concurrent execution at various levels in the system.
Multi-Device System shows how the programming model extends to a system with
multiple devices attached to the same host.
Error Checking describes how to properly check the errors generated by the runtime.
Call Stack mentions the runtime functions used to manage the CUDA C++ call stack.
Texture and Surface Memory presents the texture and surface memory spaces that
provide another way to access device memory; they also expose a subset of the GPU
texturing hardware.
3.2.1. Initialization
As of CUDA 12.0, the cudaInitDevice() and cudaSetDevice() calls initialize the runtime
and the primary context associated with the specified device. Absent these calls, the
runtime will implicitly use device 0 and self-initialize as needed to process other
runtime API requests. One needs to keep this in mind when timing runtime function
calls and when interpreting the error code from the first call into the runtime. Before
12.0, cudaSetDevice() would not initialize the runtime and applications would often use
the no-op runtime call cudaFree(0) to isolate the runtime initialization from other api
activity (both for the sake of timing and error handling).
The runtime creates a CUDA context for each device in the system (see Context for
more details on CUDA contexts). This context is the primary context for this device
and is initialized at the first runtime function which requires an active context on this
device. It is shared among all the host threads of the application. As part of this
context creation, the device code is just-in-time compiled if necessary (see Just-in-
Time Compilation) and loaded into device memory. This all happens transparently. If
needed, for example, for driver API interoperability, the primary context of a device can
be accessed from the driver API as described in Interoperability between Runtime and
Driver APIs.
When a host thread calls cudaDeviceReset() , this destroys the primary context of the
device the host thread currently operates on (that is, the current device as defined in
Device Selection). The next runtime function call made by any host thread that has
this device as current will create a new primary context for this device.
Note
The CUDA interfaces use global state that is initialized during host program
initiation and destroyed during host program termination. The CUDA runtime and
driver cannot detect if this state is invalid, so using any of these interfaces
(implicitly or explicitly) during program initiation or termination after main) will
result in undefined behavior.
As of CUDA 12.0, cudaSetDevice() will now explicitly initialize the runtime after
changing the current device for the host thread. Previous versions of CUDA
delayed runtime initialization on the new device until the first runtime call was
made after cudaSetDevice() . This change means that it is now very important to
check the return value of cudaSetDevice() for initialization errors.
The runtime functions from the error handling and version management sections
of the reference manual do not initialize the runtime.
CUDA arrays are opaque memory layouts optimized for texture fetching. They are
described in Texture and Surface Memory.
Linear memory is allocated in a single unified address space, which means that
separately allocated entities can reference one another via pointers, for example, in a
binary tree or linked list. The size of the address space depends on the host system
(CPU) and the compute capability of the used GPU:
Note
On devices of compute capability 5.3 (Maxwell) and earlier, the CUDA driver creates
an uncommitted 40bit virtual address reservation to ensure that memory
allocations (pointers) fall into the supported range. This reservation appears as
reserved virtual memory, but does not occupy any physical memory until the
program actually allocates memory.
Linear memory is typically allocated using cudaMalloc() and freed using cudaFree()
and data transfer between host memory and device memory are typically done using
cudaMemcpy() . In the vector addition code sample of Kernels, the vectors need to be
copied from host memory to device memory:
// Device code
__global__ void VecAdd(float* A, float* B, float* C, int N)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}
// Host code
int main()
{
int N = ...;
size_t size = N * sizeof(float);
// Invoke kernel
int threadsPerBlock = 256;
int blocksPerGrid =
(N + threadsPerBlock - 1) / threadsPerBlock;
VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&devPtr, &pitch,
width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);
// Device code
__global__ void MyKernel(float* devPtr,
size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
float element = row[c];
}
}
}
The following code sample allocates a width x height x depth 3D array of floating-
point values and shows how to loop over the array elements in device code:
// Host code
int width = 64, height = 64, depth = 64;
cudaExtent extent = make_cudaExtent(width * sizeof(float),
height, depth);
cudaPitchedPtr devPitchedPtr;
cudaMalloc3D(&devPitchedPtr, extent);
MyKernel<<<100, 512>>>(devPitchedPtr, width, height, depth);
// Device code
__global__ void MyKernel(cudaPitchedPtr devPitchedPtr,
int width, int height, int depth)
{
char* devPtr = devPitchedPtr.ptr;
size_t pitch = devPitchedPtr.pitch;
size_t slicePitch = pitch * height;
for (int z = 0; z < depth; ++z) {
char* slice = devPtr + z * slicePitch;
for (int y = 0; y < height; ++y) {
float* row = (float*)(slice + y * pitch);
for (int x = 0; x < width; ++x) {
float element = row[x];
}
}
}
}
Note
The reference manual lists all the various functions used to copy memory between
linear memory allocated with cudaMalloc() , linear memory allocated with
cudaMallocPitch() or cudaMalloc3D() , CUDA arrays, and memory allocated for variables
declared in global or constant memory space.
The following code sample illustrates various ways of accessing global variables via the
runtime API:
Starting with CUDA 11.0, devices of compute capability 8.0 and above have the
capability to influence persistence of data in the L2 cache, potentially providing higher
bandwidth and lower latency accesses to global memory.
The L2 cache set-aside size for persisting accesses may be adjusted, within limits:
cudaGetDeviceProperties(&prop, device_id);
size_t size = min(int(prop.l2CacheSize * 0.75), prop.persistingL2CacheMaxSize);
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, size); /* set-aside 3/4 of L2 cache
for persisting accesses or the max allowed*/
When the GPU is configured in Multi-Instance GPU (MIG) mode, the L2 cache set-
aside functionality is disabled.
When using the Multi-Process Service (MPS), the L2 cache set-aside size cannot be
changed by cudaDeviceSetLimit . Instead, the set-aside size can only be specified at
start up of MPS server through the environment variable
CUDA_DEVICE_DEFAULT_PERSISTING_L2_CACHE_PERCENTAGE_LIMIT .
The code example below shows how to set an L2 persisting access window using a
CUDA Stream.
cudaStreamAttrValue stream_attribute; //
Stream level attributes data structure
stream_attribute.accessPolicyWindow.base_ptr = reinterpret_cast<void*>(ptr); //
Global Memory data pointer
stream_attribute.accessPolicyWindow.num_bytes = num_bytes; //
Number of bytes for persistence access.
// (Must
be less than cudaDeviceProp::accessPolicyMaxWindowSize)
stream_attribute.accessPolicyWindow.hitRatio = 0.6; // Hint
for cache hit ratio
stream_attribute.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting; // Type
of access property on cache hit
stream_attribute.accessPolicyWindow.missProp = cudaAccessPropertyStreaming; // Type
of access property on cache miss.
L2 persistence can also be set for a CUDA Graph Kernel Node as shown in the example
below:
The hitRatio parameter can be used to specify the fraction of accesses that receive
the hitProp property. In both of the examples above, 60% of the memory accesses in
the global memory region [ptr..ptr+num_bytes) have the persisting property and 40%
of the memory accesses have the streaming property. Which specific memory
accesses are classified as persisting (the hitProp ) is random with a probability of
approximately hitRatio ; the probability distribution depends upon the hardware
architecture and the memory extent.
For example, if the L2 set-aside cache size is 16KB and the num_bytes in the
accessPolicyWindow is 32KB:
With a hitRatio of 0.5, the hardware will select, at random, 16KB of the 32KB
window to be designated as persisting and cached in the set-aside L2 cache area.
With a hitRatio of 1.0, the hardware will attempt to cache the whole 32KB window
in the set-aside L2 cache area. Since the set-aside area is smaller than the window,
cache lines will be evicted to keep the most recently used 16KB of the 32KB data
in the set-aside portion of the L2 cache.
The hitRatio can therefore be used to avoid thrashing of cache lines and overall
reduce the amount of data moved into and out of the L2 cache.
A hitRatio value below 1.0 can be used to manually control the amount of data
different accessPolicyWindow s from concurrent CUDA streams can cache in L2. For
example, let the L2 set-aside cache size be 16KB; two concurrent kernels in two
different CUDA streams, each with a 16KB accessPolicyWindow , and both with
hitRatio value 1.0, might evict each others’ cache lines when competing for the
shared L2 resource. However, if both accessPolicyWindows have a hitRatio value of 0.5,
they will be less likely to evict their own or each others’ persisting cache lines.
cudaDeviceProp prop;
// CUDA device properties variable
cudaGetDeviceProperties( &prop, device_id);
// Query GPU properties
size_t size = min( int(prop.l2CacheSize * 0.75) , prop.persistingL2CacheMaxSize );
cudaDeviceSetLimit( cudaLimitPersistingL2CacheSize, size);
// set-aside 3/4 of L2 cache for persisting accesses or the max allowed
cudaStreamAttrValue stream_attribute;
// Stream level attributes data structure
stream_attribute.accessPolicyWindow.base_ptr = reinterpret_cast<void*>(data1);
// Global Memory data pointer
stream_attribute.accessPolicyWindow.num_bytes = window_size;
// Number of bytes for persistence access
stream_attribute.accessPolicyWindow.hitRatio = 0.6;
// Hint for cache hit ratio
stream_attribute.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting;
// Persistence Property
stream_attribute.accessPolicyWindow.missProp = cudaAccessPropertyStreaming;
// Type of access property on cache miss
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow,
&stream_attribute); // Set the attributes to a CUDA Stream
stream_attribute.accessPolicyWindow.num_bytes = 0;
// Setting the window size to 0 disable it
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow,
&stream_attribute); // Overwrite the access policy attribute to a CUDA Stream
cudaCtxResetPersistingL2Cache();
// Remove any persistent lines in L2
cuda_kernelC<<<grid_size,block_size,0,stream>>>(data2);
// data2 can now benefit from full L2 in normal mode