0% found this document useful (0 votes)
6 views15 pages

Cuuda Nvidai Guide - Part2

The document provides detailed information on CUDA-enabled GPUs, their compute capabilities, and the programming interface for CUDA C++. It explains the compilation process using nvcc, including offline and just-in-time compilation, as well as binary and PTX compatibility. Additionally, it covers the CUDA runtime, memory management, and initialization procedures essential for developing applications that leverage GPU architectures.

Uploaded by

faraziid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views15 pages

Cuuda Nvidai Guide - Part2

The document provides detailed information on CUDA-enabled GPUs, their compute capabilities, and the programming interface for CUDA C++. It explains the compilation process using nvcc, including offline and just-in-time compilation, as well as binary and PTX compatibility. Additionally, it covers the CUDA runtime, memory management, and initialization procedures essential for developing applications that leverage GPU architectures.

Uploaded by

faraziid
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

CUDA-Enabled GPUs lists of all CUDA-enabled devices along with their compute

capability. Compute Capabilities gives the technical specifications of each compute


capability.

 Note

The compute capability version of a particular GPU should not be confused with
the CUDA version (for example, CUDA 7.5, CUDA 8, CUDA 9), which is the version of
the CUDA software platform. The CUDA platform is used by application developers
to create applications that run on many generations of GPU architectures,
including future GPU architectures yet to be invented. While new versions of the
CUDA platform often add native support for a new GPU architecture by supporting
the compute capability version of that architecture, new versions of the CUDA
platform typically also include software features that are independent of hardware
generation.

The Tesla and Fermi architectures are no longer supported starting with CUDA 7.0 and
CUDA 9.0, respectively.

3. Programming Interface
CUDA C++ provides a simple path for users familiar with the C++ programming
language to easily write programs for execution by the device.

It consists of a minimal set of extensions to the C++ language and a runtime library.

The core language extensions have been introduced in Programming Model. They
allow programmers to define a kernel as a C++ function and use some new syntax to
specify the grid and block dimension each time the function is called. A complete
description of all extensions can be found in C++ Language Extensions. Any source file
that contains some of these extensions must be compiled with nvcc as outlined in
Compilation with NVCC.

The runtime is introduced in CUDA Runtime. It provides C and C++ functions that
execute on the host to allocate and deallocate device memory, transfer data between
host memory and device memory, manage systems with multiple devices, etc. A
complete description of the runtime can be found in the CUDA reference manual.

The runtime is built on top of a lower-level C API, the CUDA driver API, which is also
accessible by the application. The driver API provides an additional level of control by
exposing lower-level concepts such as CUDA contexts - the analogue of host
processes for the device - and CUDA modules - the analogue of dynamically loaded
libraries for the device. Most applications do not use the driver API as they do not
need this additional level of control and when using the runtime, context and module
management are implicit, resulting in more concise code. As the runtime is
interoperable with the driver API, most applications that need some driver API
features can default to use the runtime API and only use the driver API where needed.
The driver API is introduced in Driver API and fully described in the reference manual.

3.1. Compilation with NVCC


Kernels can be written using the CUDA instruction set architecture, called PTX, which
is described in the PTX reference manual. It is however usually more effective to use a
high-level programming language such as C++. In both cases, kernels must be
compiled into binary code by nvcc to execute on the device.

nvcc is a compiler driver that simplifies the process of compiling C++ or PTX code: It
provides simple and familiar command line options and executes them by invoking the
collection of tools that implement the different compilation stages. This section gives
an overview of nvcc workflow and command options. A complete description can be
found in the nvcc user manual.

3.1.1. Compilation Workflow


3.1.1.1. Offline Compilation
Source files compiled with nvcc can include a mix of host code (i.e., code that
executes on the host) and device code (i.e., code that executes on the device). nvcc ’s
basic workflow consists in separating device code from host code and then:

 compiling the device code into an assembly form (PTX code) and/or binary form
(cubin object),
 and modifying the host code by replacing the <<<...>>> syntax introduced in
Kernels (and described in more details in Execution Configuration) by the
necessary CUDA runtime function calls to load and launch each compiled kernel
from the PTX code and/or cubin object.

The modified host code is output either as C++ code that is left to be compiled using
another tool or as object code directly by letting nvcc invoke the host compiler during
the last compilation stage.

Applications can then:

 Either link to the compiled host code (this is the most common case),
 Or ignore the modified host code (if any) and use the CUDA driver API (see Driver
API) to load and execute the PTX code or cubin object.

3.1.1.2. Just-in-Time Compilation


Any PTX code loaded by an application at runtime is compiled further to binary code
by the device driver. This is called just-in-time compilation. Just-in-time compilation
increases application load time, but allows the application to benefit from any new
compiler improvements coming with each new device driver. It is also the only way for
applications to run on devices that did not exist at the time the application was
compiled, as detailed in Application Compatibility.

When the device driver just-in-time compiles some PTX code for some application, it
automatically caches a copy of the generated binary code in order to avoid repeating
the compilation in subsequent invocations of the application. The cache - referred to
as compute cache - is automatically invalidated when the device driver is upgraded, so
that applications can benefit from the improvements in the new just-in-time compiler
built into the device driver.

Environment variables are available to control just-in-time compilation as described in


CUDA Environment Variables

As an alternative to using nvcc to compile CUDA C++ device code, NVRTC can be used
to compile CUDA C++ device code to PTX at runtime. NVRTC is a runtime compilation
library for CUDA C++; more information can be found in the NVRTC User guide.

3.1.2. Binary Compatibility


Binary code is architecture-specific. A cubin object is generated using the compiler
option -code that specifies the targeted architecture: For example, compiling with -
code=sm_80 produces binary code for devices of compute capability 8.0. Binary
compatibility is guaranteed from one minor revision to the next one, but not from one
minor revision to the previous one or across major revisions. In other words, a cubin
object generated for compute capability X.y will only execute on devices of compute
capability X.z where z≥y.

 Note

Binary compatibility is supported only for the desktop. It is not supported for
Tegra. Also, the binary compatibility between desktop and Tegra is not supported.

3.1.3. PTX Compatibility


Some PTX instructions are only supported on devices of higher compute capabilities.
For example, Warp Shuffle Functions are only supported on devices of compute
capability 5.0 and above. The -arch compiler option specifies the compute capability
that is assumed when compiling C++ to PTX code. So, code that contains warp shuffle,
for example, must be compiled with -arch=compute_50 (or higher).

PTX code produced for some specific compute capability can always be compiled to
binary code of greater or equal compute capability. Note that a binary compiled from
an earlier PTX version may not make use of some hardware features. For example, a
binary targeting devices of compute capability 7.0 (Volta) compiled from PTX
generated for compute capability 6.0 (Pascal) will not make use of Tensor Core
instructions, since these were not available on Pascal. As a result, the final binary may
perform worse than would be possible if the binary were generated using the latest
version of PTX.

PTX code compiled to target architecture conditional features only run on the exact
same physical architecture and nowhere else. Arch conditional PTX code is not forward
and backward compatible. Example code compiled with sm_90a or compute_90a only
runs on devices with compute capability 9.0 and is not backward or forward
compatible.

3.1.4. Application Compatibility


To execute code on devices of specific compute capability, an application must load
binary or PTX code that is compatible with this compute capability as described in
Binary Compatibility and PTX Compatibility. In particular, to be able to execute code on
future architectures with higher compute capability (for which no binary code can be
generated yet), an application must load PTX code that will be just-in-time compiled
for these devices (see Just-in-Time Compilation).

Which PTX and binary code gets embedded in a CUDA C++ application is controlled by
the -arch and -code compiler options or the -gencode compiler option as detailed in
the nvcc user manual. For example,

nvcc x.cu
-gencode arch=compute_50,code=sm_50
-gencode arch=compute_60,code=sm_60
-gencode arch=compute_70,code=\"compute_70,sm_70\"

embeds binary code compatible with compute capability 5.0 and 6.0 (first and second
-gencode options) and PTX and binary code compatible with compute capability 7.0
(third -gencode option).

Host code is generated to automatically select at runtime the most appropriate code
to load and execute, which, in the above example, will be:

 5.0 binary code for devices with compute capability 5.0 and 5.2,
 6.0 binary code for devices with compute capability 6.0 and 6.1,
 7.0 binary code for devices with compute capability 7.0 and 7.5,
 PTX code which is compiled to binary code at runtime for devices with compute
capability 8.0 and 8.6.

x.cu can have an optimized code path that uses warp reduction operations, for
example, which are only supported in devices of compute capability 8.0 and higher.
The __CUDA_ARCH__ macro can be used to differentiate various code paths based on
compute capability. It is only defined for device code. When compiling with -
arch=compute_80 for example, __CUDA_ARCH__ is equal to 800 .
If x.cu is compiled for architecture conditional features example with sm_90a or
compute_90a , the code can only run on devices with compute capability 9.0.

Applications using the driver API must compile code to separate files and explicitly
load and execute the most appropriate file at runtime.

The Volta architecture introduces Independent Thread Scheduling which changes the
way threads are scheduled on the GPU. For code relying on specific behavior of SIMT
scheduling in previous architectures, Independent Thread Scheduling may alter the
set of participating threads, leading to incorrect results. To aid migration while
implementing the corrective actions detailed in Independent Thread Scheduling, Volta
developers can opt-in to Pascal’s thread scheduling with the compiler option
combination -arch=compute_60 -code=sm_70 .

The nvcc user manual lists various shorthands for the -arch , -code , and -gencode
compiler options. For example, -arch=sm_70 is a shorthand for -arch=compute_70 -
code=compute_70,sm_70 (which is the same as -gencode
arch=compute_70,code=\"compute_70,sm_70\" ).

3.1.5. C++ Compatibility


The front end of the compiler processes CUDA source files according to C++ syntax
rules. Full C++ is supported for the host code. However, only a subset of C++ is fully
supported for the device code as described in C++ Language Support.

3.1.6. 64-Bit Compatibility


The 64-bit version of nvcc compiles device code in 64-bit mode (i.e., pointers are 64-
bit). Device code compiled in 64-bit mode is only supported with host code compiled
in 64-bit mode.

3.2. CUDA Runtime


The runtime is implemented in the cudart library, which is linked to the application,
either statically via cudart.lib or libcudart.a , or dynamically via cudart.dll or
libcudart.so . Applications that require cudart.dll and/or cudart.so for dynamic
linking typically include them as part of the application installation package. It is only
safe to pass the address of CUDA runtime symbols between components that link to
the same instance of the CUDA runtime.

All its entry points are prefixed with cuda .

As mentioned in Heterogeneous Programming, the CUDA programming model


assumes a system composed of a host and a device, each with their own separate
memory. Device Memory gives an overview of the runtime functions used to manage
device memory.
Shared Memory illustrates the use of shared memory, introduced in Thread Hierarchy,
to maximize performance.

Page-Locked Host Memory introduces page-locked host memory that is required to


overlap kernel execution with data transfers between host and device memory.

Asynchronous Concurrent Execution describes the concepts and API used to enable
asynchronous concurrent execution at various levels in the system.

Multi-Device System shows how the programming model extends to a system with
multiple devices attached to the same host.

Error Checking describes how to properly check the errors generated by the runtime.

Call Stack mentions the runtime functions used to manage the CUDA C++ call stack.

Texture and Surface Memory presents the texture and surface memory spaces that
provide another way to access device memory; they also expose a subset of the GPU
texturing hardware.

Graphics Interoperability introduces the various functions the runtime provides to


interoperate with the two main graphics APIs, OpenGL and Direct3D.

3.2.1. Initialization
As of CUDA 12.0, the cudaInitDevice() and cudaSetDevice() calls initialize the runtime
and the primary context associated with the specified device. Absent these calls, the
runtime will implicitly use device 0 and self-initialize as needed to process other
runtime API requests. One needs to keep this in mind when timing runtime function
calls and when interpreting the error code from the first call into the runtime. Before
12.0, cudaSetDevice() would not initialize the runtime and applications would often use
the no-op runtime call cudaFree(0) to isolate the runtime initialization from other api
activity (both for the sake of timing and error handling).

The runtime creates a CUDA context for each device in the system (see Context for
more details on CUDA contexts). This context is the primary context for this device
and is initialized at the first runtime function which requires an active context on this
device. It is shared among all the host threads of the application. As part of this
context creation, the device code is just-in-time compiled if necessary (see Just-in-
Time Compilation) and loaded into device memory. This all happens transparently. If
needed, for example, for driver API interoperability, the primary context of a device can
be accessed from the driver API as described in Interoperability between Runtime and
Driver APIs.

When a host thread calls cudaDeviceReset() , this destroys the primary context of the
device the host thread currently operates on (that is, the current device as defined in
Device Selection). The next runtime function call made by any host thread that has
this device as current will create a new primary context for this device.
 Note

The CUDA interfaces use global state that is initialized during host program
initiation and destroyed during host program termination. The CUDA runtime and
driver cannot detect if this state is invalid, so using any of these interfaces
(implicitly or explicitly) during program initiation or termination after main) will
result in undefined behavior.

As of CUDA 12.0, cudaSetDevice() will now explicitly initialize the runtime after
changing the current device for the host thread. Previous versions of CUDA
delayed runtime initialization on the new device until the first runtime call was
made after cudaSetDevice() . This change means that it is now very important to
check the return value of cudaSetDevice() for initialization errors.

The runtime functions from the error handling and version management sections
of the reference manual do not initialize the runtime.

3.2.2. Device Memory


As mentioned in Heterogeneous Programming, the CUDA programming model
assumes a system composed of a host and a device, each with their own separate
memory. Kernels operate out of device memory, so the runtime provides functions to
allocate, deallocate, and copy device memory, as well as transfer data between host
memory and device memory.

Device memory can be allocated either as linear memory or as CUDA arrays.

CUDA arrays are opaque memory layouts optimized for texture fetching. They are
described in Texture and Surface Memory.

Linear memory is allocated in a single unified address space, which means that
separately allocated entities can reference one another via pointers, for example, in a
binary tree or linked list. The size of the address space depends on the host system
(CPU) and the compute capability of the used GPU:

Table 1: Linear Memory Address Space

x86_64 (AMD64) POWER (ppc64le) ARM64

up to compute 40bit 40bit 40bit


capability 5.3 (Maxwell)

compute capability 6.0 up to 47bit up to 49bit up to


(Pascal) or newer 48bit

 Note
On devices of compute capability 5.3 (Maxwell) and earlier, the CUDA driver creates
an uncommitted 40bit virtual address reservation to ensure that memory
allocations (pointers) fall into the supported range. This reservation appears as
reserved virtual memory, but does not occupy any physical memory until the
program actually allocates memory.

Linear memory is typically allocated using cudaMalloc() and freed using cudaFree()
and data transfer between host memory and device memory are typically done using
cudaMemcpy() . In the vector addition code sample of Kernels, the vectors need to be
copied from host memory to device memory:
// Device code
__global__ void VecAdd(float* A, float* B, float* C, int N)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}

// Host code
int main()
{
int N = ...;
size_t size = N * sizeof(float);

// Allocate input vectors h_A and h_B in host memory


float* h_A = (float*)malloc(size);
float* h_B = (float*)malloc(size);
float* h_C = (float*)malloc(size);

// Initialize input vectors


...

// Allocate vectors in device memory


float* d_A;
cudaMalloc(&d_A, size);
float* d_B;
cudaMalloc(&d_B, size);
float* d_C;
cudaMalloc(&d_C, size);

// Copy vectors from host memory to device memory


cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

// Invoke kernel
int threadsPerBlock = 256;
int blocksPerGrid =
(N + threadsPerBlock - 1) / threadsPerBlock;
VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

// Copy result from device memory to host memory


// h_C contains the result in host memory
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

// Free device memory


cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);

// Free host memory


...
}

Linear memory can also be allocated through cudaMallocPitch() and cudaMalloc3D() .


These functions are recommended for allocations of 2D or 3D arrays as it makes sure
that the allocation is appropriately padded to meet the alignment requirements
described in Device Memory Accesses, therefore ensuring best performance when
accessing the row addresses or performing copies between 2D arrays and other
regions of device memory (using the cudaMemcpy2D() and cudaMemcpy3D() functions).
The returned pitch (or stride) must be used to access array elements. The following
code sample allocates a width x height 2D array of floating-point values and shows
how to loop over the array elements in device code:

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&devPtr, &pitch,
width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);

// Device code
__global__ void MyKernel(float* devPtr,
size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
float element = row[c];
}
}
}

The following code sample allocates a width x height x depth 3D array of floating-
point values and shows how to loop over the array elements in device code:

// Host code
int width = 64, height = 64, depth = 64;
cudaExtent extent = make_cudaExtent(width * sizeof(float),
height, depth);
cudaPitchedPtr devPitchedPtr;
cudaMalloc3D(&devPitchedPtr, extent);
MyKernel<<<100, 512>>>(devPitchedPtr, width, height, depth);

// Device code
__global__ void MyKernel(cudaPitchedPtr devPitchedPtr,
int width, int height, int depth)
{
char* devPtr = devPitchedPtr.ptr;
size_t pitch = devPitchedPtr.pitch;
size_t slicePitch = pitch * height;
for (int z = 0; z < depth; ++z) {
char* slice = devPtr + z * slicePitch;
for (int y = 0; y < height; ++y) {
float* row = (float*)(slice + y * pitch);
for (int x = 0; x < width; ++x) {
float element = row[x];
}
}
}
}
 Note

To avoid allocating too much memory and thus impacting system-wide


performance, request the allocation parameters from the user based on the
problem size. If the allocation fails, you can fallback to other slower memory types
( cudaMallocHost() , cudaHostRegister() , etc.), or return an error telling the user how
much memory was needed that was denied. If your application cannot request the
allocation parameters for some reason, we recommend using cudaMallocManaged()
for platforms that support it.

The reference manual lists all the various functions used to copy memory between
linear memory allocated with cudaMalloc() , linear memory allocated with
cudaMallocPitch() or cudaMalloc3D() , CUDA arrays, and memory allocated for variables
declared in global or constant memory space.

The following code sample illustrates various ways of accessing global variables via the
runtime API:

__constant__ float constData[256];


float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data));
cudaMemcpyFromSymbol(data, constData, sizeof(data));

__device__ float devData;


float value = 3.14f;
cudaMemcpyToSymbol(devData, &value, sizeof(float));

__device__ float* devPointer;


float* ptr;
cudaMalloc(&ptr, 256 * sizeof(float));
cudaMemcpyToSymbol(devPointer, &ptr, sizeof(ptr));

cudaGetSymbolAddress() is used to retrieve the address pointing to the memory


allocated for a variable declared in global memory space. The size of the allocated
memory is obtained through cudaGetSymbolSize() .

3.2.3. Device Memory L2 Access Management


When a CUDA kernel accesses a data region in the global memory repeatedly, such
data accesses can be considered to be persisting. On the other hand, if the data is only
accessed once, such data accesses can be considered to be streaming.

Starting with CUDA 11.0, devices of compute capability 8.0 and above have the
capability to influence persistence of data in the L2 cache, potentially providing higher
bandwidth and lower latency accesses to global memory.

3.2.3.1. L2 Cache Set-Aside for Persisting Accesses


A portion of the L2 cache can be set aside to be used for persisting data accesses to
global memory. Persisting accesses have prioritized use of this set-aside portion of L2
cache, whereas normal or streaming, accesses to global memory can only utilize this
portion of L2 when it is unused by persisting accesses.

The L2 cache set-aside size for persisting accesses may be adjusted, within limits:

cudaGetDeviceProperties(&prop, device_id);
size_t size = min(int(prop.l2CacheSize * 0.75), prop.persistingL2CacheMaxSize);
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, size); /* set-aside 3/4 of L2 cache
for persisting accesses or the max allowed*/

When the GPU is configured in Multi-Instance GPU (MIG) mode, the L2 cache set-
aside functionality is disabled.

When using the Multi-Process Service (MPS), the L2 cache set-aside size cannot be
changed by cudaDeviceSetLimit . Instead, the set-aside size can only be specified at
start up of MPS server through the environment variable
CUDA_DEVICE_DEFAULT_PERSISTING_L2_CACHE_PERCENTAGE_LIMIT .

3.2.3.2. L2 Policy for Persisting Accesses


An access policy window specifies a contiguous region of global memory and a
persistence property in the L2 cache for accesses within that region.

The code example below shows how to set an L2 persisting access window using a
CUDA Stream.

CUDA Stream Example

cudaStreamAttrValue stream_attribute; //
Stream level attributes data structure
stream_attribute.accessPolicyWindow.base_ptr = reinterpret_cast<void*>(ptr); //
Global Memory data pointer
stream_attribute.accessPolicyWindow.num_bytes = num_bytes; //
Number of bytes for persistence access.
// (Must
be less than cudaDeviceProp::accessPolicyMaxWindowSize)
stream_attribute.accessPolicyWindow.hitRatio = 0.6; // Hint
for cache hit ratio
stream_attribute.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting; // Type
of access property on cache hit
stream_attribute.accessPolicyWindow.missProp = cudaAccessPropertyStreaming; // Type
of access property on cache miss.

//Set the attributes to a CUDA stream of type cudaStream_t


cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow,
&stream_attribute);
When a kernel subsequently executes in CUDA stream , memory accesses within the
global memory extent [ptr..ptr+num_bytes) are more likely to persist in the L2 cache
than accesses to other global memory locations.

L2 persistence can also be set for a CUDA Graph Kernel Node as shown in the example
below:

CUDA GraphKernelNode Example

cudaKernelNodeAttrValue node_attribute; // Kernel


level attributes data structure
node_attribute.accessPolicyWindow.base_ptr = reinterpret_cast<void*>(ptr); // Global
Memory data pointer
node_attribute.accessPolicyWindow.num_bytes = num_bytes; // Number
of bytes for persistence access.
// (Must
be less than cudaDeviceProp::accessPolicyMaxWindowSize)
node_attribute.accessPolicyWindow.hitRatio = 0.6; // Hint
for cache hit ratio
node_attribute.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting; // Type of
access property on cache hit
node_attribute.accessPolicyWindow.missProp = cudaAccessPropertyStreaming; // Type of
access property on cache miss.

//Set the attributes to a CUDA Graph Kernel node of type cudaGraphNode_t


cudaGraphKernelNodeSetAttribute(node, cudaKernelNodeAttributeAccessPolicyWindow,
&node_attribute);

The hitRatio parameter can be used to specify the fraction of accesses that receive
the hitProp property. In both of the examples above, 60% of the memory accesses in
the global memory region [ptr..ptr+num_bytes) have the persisting property and 40%
of the memory accesses have the streaming property. Which specific memory
accesses are classified as persisting (the hitProp ) is random with a probability of
approximately hitRatio ; the probability distribution depends upon the hardware
architecture and the memory extent.

For example, if the L2 set-aside cache size is 16KB and the num_bytes in the
accessPolicyWindow is 32KB:

 With a hitRatio of 0.5, the hardware will select, at random, 16KB of the 32KB
window to be designated as persisting and cached in the set-aside L2 cache area.
 With a hitRatio of 1.0, the hardware will attempt to cache the whole 32KB window
in the set-aside L2 cache area. Since the set-aside area is smaller than the window,
cache lines will be evicted to keep the most recently used 16KB of the 32KB data
in the set-aside portion of the L2 cache.

The hitRatio can therefore be used to avoid thrashing of cache lines and overall
reduce the amount of data moved into and out of the L2 cache.
A hitRatio value below 1.0 can be used to manually control the amount of data
different accessPolicyWindow s from concurrent CUDA streams can cache in L2. For
example, let the L2 set-aside cache size be 16KB; two concurrent kernels in two
different CUDA streams, each with a 16KB accessPolicyWindow , and both with
hitRatio value 1.0, might evict each others’ cache lines when competing for the
shared L2 resource. However, if both accessPolicyWindows have a hitRatio value of 0.5,
they will be less likely to evict their own or each others’ persisting cache lines.

3.2.3.3. L2 Access Properties


Three types of access properties are defined for different global memory data
accesses:

1. cudaAccessPropertyStreaming : Memory accesses that occur with the streaming


property are less likely to persist in the L2 cache because these accesses are
preferentially evicted.
2. cudaAccessPropertyPersisting : Memory accesses that occur with the persisting
property are more likely to persist in the L2 cache because these accesses are
preferentially retained in the set-aside portion of L2 cache.
3. cudaAccessPropertyNormal : This access property forcibly resets previously applied
persisting access property to a normal status. Memory accesses with the
persisting property from previous CUDA kernels may be retained in L2 cache long
after their intended use. This persistence-after-use reduces the amount of L2
cache available to subsequent kernels that do not use the persisting property.
Resetting an access property window with the cudaAccessPropertyNormal property
removes the persisting (preferential retention) status of the prior access, as if the
prior access had been without an access property.

3.2.3.4. L2 Persistence Example


The following example shows how to set-aside L2 cache for persistent accesses, use
the set-aside L2 cache in CUDA kernels via CUDA Stream and then reset the L2 cache.
cudaStream_t stream;
cudaStreamCreate(&stream);
// Create CUDA stream

cudaDeviceProp prop;
// CUDA device properties variable
cudaGetDeviceProperties( &prop, device_id);
// Query GPU properties
size_t size = min( int(prop.l2CacheSize * 0.75) , prop.persistingL2CacheMaxSize );
cudaDeviceSetLimit( cudaLimitPersistingL2CacheSize, size);
// set-aside 3/4 of L2 cache for persisting accesses or the max allowed

size_t window_size = min(prop.accessPolicyMaxWindowSize, num_bytes);


// Select minimum of user defined num_bytes and max window size.

cudaStreamAttrValue stream_attribute;
// Stream level attributes data structure
stream_attribute.accessPolicyWindow.base_ptr = reinterpret_cast<void*>(data1);
// Global Memory data pointer
stream_attribute.accessPolicyWindow.num_bytes = window_size;
// Number of bytes for persistence access
stream_attribute.accessPolicyWindow.hitRatio = 0.6;
// Hint for cache hit ratio
stream_attribute.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting;
// Persistence Property
stream_attribute.accessPolicyWindow.missProp = cudaAccessPropertyStreaming;
// Type of access property on cache miss

cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow,
&stream_attribute); // Set the attributes to a CUDA Stream

for(int i = 0; i < 10; i++) {


cuda_kernelA<<<grid_size,block_size,0,stream>>>(data1);
// This data1 is used by a kernel multiple times
}
// [data1 + num_bytes) benefits from L2 persistence
cuda_kernelB<<<grid_size,block_size,0,stream>>>(data1);
// A different kernel in the same stream can also benefit

// from the persistence of data1

stream_attribute.accessPolicyWindow.num_bytes = 0;
// Setting the window size to 0 disable it
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow,
&stream_attribute); // Overwrite the access policy attribute to a CUDA Stream
cudaCtxResetPersistingL2Cache();
// Remove any persistent lines in L2

cuda_kernelC<<<grid_size,block_size,0,stream>>>(data2);
// data2 can now benefit from full L2 in normal mode

3.2.3.5. Reset L2 Access to Normal


A persisting L2 cache line from a previous CUDA kernel may persist in L2 long after it
has been used. Hence, a reset to normal for L2 cache is important for streaming or
normal memory accesses to utilize the L2 cache with normal priority. There are three
ways a persisting access can be reset to normal status.

You might also like