0% found this document useful (0 votes)

6 views15 pages

Cuuda Nvidai Guide - Part2

The document provides detailed information on CUDA-enabled GPUs, their compute capabilities, and the programming interface for CUDA C++. It explains the compilation process using nvcc, including offline and just-in-time compilation, as well as binary and PTX compatibility. Additionally, it covers the CUDA runtime, memory management, and initialization procedures essential for developing applications that leverage GPU architectures.

Uploaded by

faraziid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views15 pages

Cuuda Nvidai Guide - Part2

Uploaded by

faraziid

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

CUDA-Enabled GPUs lists of all CUDA-enabled devices along with their compute

capability. Compute Capabilities gives the technical specifications of each compute

capability.

 Note

The compute capability version of a particular GPU should not be confused with
the CUDA version (for example, CUDA 7.5, CUDA 8, CUDA 9), which is the version of
the CUDA software platform. The CUDA platform is used by application developers
to create applications that run on many generations of GPU architectures,
including future GPU architectures yet to be invented. While new versions of the
CUDA platform often add native support for a new GPU architecture by supporting
the compute capability version of that architecture, new versions of the CUDA
platform typically also include software features that are independent of hardware
generation.

The Tesla and Fermi architectures are no longer supported starting with CUDA 7.0 and
CUDA 9.0, respectively.

3. Programming Interface
CUDA C++ provides a simple path for users familiar with the C++ programming
language to easily write programs for execution by the device.

It consists of a minimal set of extensions to the C++ language and a runtime library.

The core language extensions have been introduced in Programming Model. They
allow programmers to define a kernel as a C++ function and use some new syntax to
specify the grid and block dimension each time the function is called. A complete
description of all extensions can be found in C++ Language Extensions. Any source file
that contains some of these extensions must be compiled with nvcc as outlined in
Compilation with NVCC.

The runtime is introduced in CUDA Runtime. It provides C and C++ functions that
execute on the host to allocate and deallocate device memory, transfer data between
host memory and device memory, manage systems with multiple devices, etc. A
complete description of the runtime can be found in the CUDA reference manual.

The runtime is built on top of a lower-level C API, the CUDA driver API, which is also
accessible by the application. The driver API provides an additional level of control by
exposing lower-level concepts such as CUDA contexts - the analogue of host
processes for the device - and CUDA modules - the analogue of dynamically loaded
libraries for the device. Most applications do not use the driver API as they do not
need this additional level of control and when using the runtime, context and module
management are implicit, resulting in more concise code. As the runtime is
interoperable with the driver API, most applications that need some driver API
features can default to use the runtime API and only use the driver API where needed.
The driver API is introduced in Driver API and fully described in the reference manual.

3.1. Compilation with NVCC

Kernels can be written using the CUDA instruction set architecture, called PTX, which
is described in the PTX reference manual. It is however usually more effective to use a
high-level programming language such as C++. In both cases, kernels must be
compiled into binary code by nvcc to execute on the device.

nvcc is a compiler driver that simplifies the process of compiling C++ or PTX code: It
provides simple and familiar command line options and executes them by invoking the
collection of tools that implement the different compilation stages. This section gives
an overview of nvcc workflow and command options. A complete description can be
found in the nvcc user manual.

3.1.1. Compilation Workflow

3.1.1.1. Offline Compilation
Source files compiled with nvcc can include a mix of host code (i.e., code that
executes on the host) and device code (i.e., code that executes on the device). nvcc ’s
basic workflow consists in separating device code from host code and then:

 compiling the device code into an assembly form (PTX code) and/or binary form
(cubin object),
 and modifying the host code by replacing the <<<...>>> syntax introduced in
Kernels (and described in more details in Execution Configuration) by the
necessary CUDA runtime function calls to load and launch each compiled kernel
from the PTX code and/or cubin object.

The modified host code is output either as C++ code that is left to be compiled using
another tool or as object code directly by letting nvcc invoke the host compiler during
the last compilation stage.

Applications can then:

 Either link to the compiled host code (this is the most common case),
 Or ignore the modified host code (if any) and use the CUDA driver API (see Driver
API) to load and execute the PTX code or cubin object.

3.1.1.2. Just-in-Time Compilation

Any PTX code loaded by an application at runtime is compiled further to binary code
by the device driver. This is called just-in-time compilation. Just-in-time compilation
increases application load time, but allows the application to benefit from any new
compiler improvements coming with each new device driver. It is also the only way for
applications to run on devices that did not exist at the time the application was
compiled, as detailed in Application Compatibility.

When the device driver just-in-time compiles some PTX code for some application, it
automatically caches a copy of the generated binary code in order to avoid repeating
the compilation in subsequent invocations of the application. The cache - referred to
as compute cache - is automatically invalidated when the device driver is upgraded, so
that applications can benefit from the improvements in the new just-in-time compiler
built into the device driver.

Environment variables are available to control just-in-time compilation as described in

CUDA Environment Variables

As an alternative to using nvcc to compile CUDA C++ device code, NVRTC can be used
to compile CUDA C++ device code to PTX at runtime. NVRTC is a runtime compilation
library for CUDA C++; more information can be found in the NVRTC User guide.

3.1.2. Binary Compatibility

Binary code is architecture-specific. A cubin object is generated using the compiler
option -code that specifies the targeted architecture: For example, compiling with -
code=sm_80 produces binary code for devices of compute capability 8.0. Binary
compatibility is guaranteed from one minor revision to the next one, but not from one
minor revision to the previous one or across major revisions. In other words, a cubin
object generated for compute capability X.y will only execute on devices of compute
capability X.z where z≥y.

 Note

Binary compatibility is supported only for the desktop. It is not supported for
Tegra. Also, the binary compatibility between desktop and Tegra is not supported.

3.1.3. PTX Compatibility

Some PTX instructions are only supported on devices of higher compute capabilities.
For example, Warp Shuffle Functions are only supported on devices of compute
capability 5.0 and above. The -arch compiler option specifies the compute capability
that is assumed when compiling C++ to PTX code. So, code that contains warp shuffle,
for example, must be compiled with -arch=compute_50 (or higher).

PTX code produced for some specific compute capability can always be compiled to
binary code of greater or equal compute capability. Note that a binary compiled from
an earlier PTX version may not make use of some hardware features. For example, a
binary targeting devices of compute capability 7.0 (Volta) compiled from PTX
generated for compute capability 6.0 (Pascal) will not make use of Tensor Core
instructions, since these were not available on Pascal. As a result, the final binary may
perform worse than would be possible if the binary were generated using the latest
version of PTX.

PTX code compiled to target architecture conditional features only run on the exact
same physical architecture and nowhere else. Arch conditional PTX code is not forward
and backward compatible. Example code compiled with sm_90a or compute_90a only
runs on devices with compute capability 9.0 and is not backward or forward
compatible.

3.1.4. Application Compatibility

To execute code on devices of specific compute capability, an application must load
binary or PTX code that is compatible with this compute capability as described in
Binary Compatibility and PTX Compatibility. In particular, to be able to execute code on
future architectures with higher compute capability (for which no binary code can be
generated yet), an application must load PTX code that will be just-in-time compiled
for these devices (see Just-in-Time Compilation).

Which PTX and binary code gets embedded in a CUDA C++ application is controlled by
the -arch and -code compiler options or the -gencode compiler option as detailed in
the nvcc user manual. For example,

nvcc x.cu
-gencode arch=compute_50,code=sm_50
-gencode arch=compute_60,code=sm_60
-gencode arch=compute_70,code=\"compute_70,sm_70\"

embeds binary code compatible with compute capability 5.0 and 6.0 (first and second
-gencode options) and PTX and binary code compatible with compute capability 7.0
(third -gencode option).

Host code is generated to automatically select at runtime the most appropriate code
to load and execute, which, in the above example, will be:

 5.0 binary code for devices with compute capability 5.0 and 5.2,
 6.0 binary code for devices with compute capability 6.0 and 6.1,
 7.0 binary code for devices with compute capability 7.0 and 7.5,
 PTX code which is compiled to binary code at runtime for devices with compute
capability 8.0 and 8.6.

x.cu can have an optimized code path that uses warp reduction operations, for
example, which are only supported in devices of compute capability 8.0 and higher.
The __CUDA_ARCH__ macro can be used to differentiate various code paths based on
compute capability. It is only defined for device code. When compiling with -
arch=compute_80 for example, __CUDA_ARCH__ is equal to 800 .
If x.cu is compiled for architecture conditional features example with sm_90a or
compute_90a , the code can only run on devices with compute capability 9.0.

Applications using the driver API must compile code to separate files and explicitly
load and execute the most appropriate file at runtime.

The Volta architecture introduces Independent Thread Scheduling which changes the
way threads are scheduled on the GPU. For code relying on specific behavior of SIMT
scheduling in previous architectures, Independent Thread Scheduling may alter the
set of participating threads, leading to incorrect results. To aid migration while
implementing the corrective actions detailed in Independent Thread Scheduling, Volta
developers can opt-in to Pascal’s thread scheduling with the compiler option
combination -arch=compute_60 -code=sm_70 .

The nvcc user manual lists various shorthands for the -arch , -code , and -gencode
compiler options. For example, -arch=sm_70 is a shorthand for -arch=compute_70 -
code=compute_70,sm_70 (which is the same as -gencode
arch=compute_70,code=\"compute_70,sm_70\" ).

3.1.5. C++ Compatibility

The front end of the compiler processes CUDA source files according to C++ syntax
rules. Full C++ is supported for the host code. However, only a subset of C++ is fully
supported for the device code as described in C++ Language Support.

3.1.6. 64-Bit Compatibility

The 64-bit version of nvcc compiles device code in 64-bit mode (i.e., pointers are 64-
bit). Device code compiled in 64-bit mode is only supported with host code compiled
in 64-bit mode.

3.2. CUDA Runtime

The runtime is implemented in the cudart library, which is linked to the application,
either statically via cudart.lib or libcudart.a , or dynamically via cudart.dll or
libcudart.so . Applications that require cudart.dll and/or cudart.so for dynamic
linking typically include them as part of the application installation package. It is only
safe to pass the address of CUDA runtime symbols between components that link to
the same instance of the CUDA runtime.

All its entry points are prefixed with cuda .

As mentioned in Heterogeneous Programming, the CUDA programming model

assumes a system composed of a host and a device, each with their own separate
memory. Device Memory gives an overview of the runtime functions used to manage
device memory.
Shared Memory illustrates the use of shared memory, introduced in Thread Hierarchy,
to maximize performance.

Page-Locked Host Memory introduces page-locked host memory that is required to

overlap kernel execution with data transfers between host and device memory.

Asynchronous Concurrent Execution describes the concepts and API used to enable
asynchronous concurrent execution at various levels in the system.

Multi-Device System shows how the programming model extends to a system with
multiple devices attached to the same host.

Error Checking describes how to properly check the errors generated by the runtime.

Call Stack mentions the runtime functions used to manage the CUDA C++ call stack.

Texture and Surface Memory presents the texture and surface memory spaces that
provide another way to access device memory; they also expose a subset of the GPU
texturing hardware.

Graphics Interoperability introduces the various functions the runtime provides to

interoperate with the two main graphics APIs, OpenGL and Direct3D.

3.2.1. Initialization
As of CUDA 12.0, the cudaInitDevice() and cudaSetDevice() calls initialize the runtime
and the primary context associated with the specified device. Absent these calls, the
runtime will implicitly use device 0 and self-initialize as needed to process other
runtime API requests. One needs to keep this in mind when timing runtime function
calls and when interpreting the error code from the first call into the runtime. Before
12.0, cudaSetDevice() would not initialize the runtime and applications would often use
the no-op runtime call cudaFree(0) to isolate the runtime initialization from other api
activity (both for the sake of timing and error handling).

The runtime creates a CUDA context for each device in the system (see Context for
more details on CUDA contexts). This context is the primary context for this device
and is initialized at the first runtime function which requires an active context on this
device. It is shared among all the host threads of the application. As part of this
context creation, the device code is just-in-time compiled if necessary (see Just-in-
Time Compilation) and loaded into device memory. This all happens transparently. If
needed, for example, for driver API interoperability, the primary context of a device can
be accessed from the driver API as described in Interoperability between Runtime and
Driver APIs.

When a host thread calls cudaDeviceReset() , this destroys the primary context of the
device the host thread currently operates on (that is, the current device as defined in
Device Selection). The next runtime function call made by any host thread that has
this device as current will create a new primary context for this device.
 Note

The CUDA interfaces use global state that is initialized during host program
initiation and destroyed during host program termination. The CUDA runtime and
driver cannot detect if this state is invalid, so using any of these interfaces
(implicitly or explicitly) during program initiation or termination after main) will
result in undefined behavior.

As of CUDA 12.0, cudaSetDevice() will now explicitly initialize the runtime after
changing the current device for the host thread. Previous versions of CUDA
delayed runtime initialization on the new device until the first runtime call was
made after cudaSetDevice() . This change means that it is now very important to
check the return value of cudaSetDevice() for initialization errors.

The runtime functions from the error handling and version management sections
of the reference manual do not initialize the runtime.

3.2.2. Device Memory

As mentioned in Heterogeneous Programming, the CUDA programming model
assumes a system composed of a host and a device, each with their own separate
memory. Kernels operate out of device memory, so the runtime provides functions to
allocate, deallocate, and copy device memory, as well as transfer data between host
memory and device memory.

Device memory can be allocated either as linear memory or as CUDA arrays.

CUDA arrays are opaque memory layouts optimized for texture fetching. They are
described in Texture and Surface Memory.

Linear memory is allocated in a single unified address space, which means that
separately allocated entities can reference one another via pointers, for example, in a
binary tree or linked list. The size of the address space depends on the host system
(CPU) and the compute capability of the used GPU:

Table 1: Linear Memory Address Space

x86_64 (AMD64) POWER (ppc64le) ARM64

up to compute 40bit 40bit 40bit

capability 5.3 (Maxwell)

compute capability 6.0 up to 47bit up to 49bit up to

(Pascal) or newer 48bit

 Note
On devices of compute capability 5.3 (Maxwell) and earlier, the CUDA driver creates
an uncommitted 40bit virtual address reservation to ensure that memory
allocations (pointers) fall into the supported range. This reservation appears as
reserved virtual memory, but does not occupy any physical memory until the
program actually allocates memory.

Linear memory is typically allocated using cudaMalloc() and freed using cudaFree()
and data transfer between host memory and device memory are typically done using
cudaMemcpy() . In the vector addition code sample of Kernels, the vectors need to be
copied from host memory to device memory:
// Device code
__global__ void VecAdd(float* A, float* B, float* C, int N)
{
int i = blockDim.x * blockIdx.x + threadIdx.x;
if (i < N)
C[i] = A[i] + B[i];
}

// Host code
int main()
{
int N = ...;
size_t size = N * sizeof(float);

// Allocate input vectors h_A and h_B in host memory

float* h_A = (float*)malloc(size);
float* h_B = (float*)malloc(size);
float* h_C = (float*)malloc(size);

// Initialize input vectors

...

// Allocate vectors in device memory

float* d_A;
cudaMalloc(&d_A, size);
float* d_B;
cudaMalloc(&d_B, size);
float* d_C;
cudaMalloc(&d_C, size);

// Copy vectors from host memory to device memory

cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

// Invoke kernel
int threadsPerBlock = 256;
int blocksPerGrid =
(N + threadsPerBlock - 1) / threadsPerBlock;
VecAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);

// Copy result from device memory to host memory

// h_C contains the result in host memory
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

// Free device memory

cudaFree(d_A);
cudaFree(d_B);
cudaFree(d_C);

// Free host memory

...
}

Linear memory can also be allocated through cudaMallocPitch() and cudaMalloc3D() .

These functions are recommended for allocations of 2D or 3D arrays as it makes sure
that the allocation is appropriately padded to meet the alignment requirements
described in Device Memory Accesses, therefore ensuring best performance when
accessing the row addresses or performing copies between 2D arrays and other
regions of device memory (using the cudaMemcpy2D() and cudaMemcpy3D() functions).
The returned pitch (or stride) must be used to access array elements. The following
code sample allocates a width x height 2D array of floating-point values and shows
how to loop over the array elements in device code:

// Host code
int width = 64, height = 64;
float* devPtr;
size_t pitch;
cudaMallocPitch(&devPtr, &pitch,
width * sizeof(float), height);
MyKernel<<<100, 512>>>(devPtr, pitch, width, height);

// Device code
__global__ void MyKernel(float* devPtr,
size_t pitch, int width, int height)
{
for (int r = 0; r < height; ++r) {
float* row = (float*)((char*)devPtr + r * pitch);
for (int c = 0; c < width; ++c) {
float element = row[c];
}
}
}

The following code sample allocates a width x height x depth 3D array of floating-
point values and shows how to loop over the array elements in device code:

// Host code
int width = 64, height = 64, depth = 64;
cudaExtent extent = make_cudaExtent(width * sizeof(float),
height, depth);
cudaPitchedPtr devPitchedPtr;
cudaMalloc3D(&devPitchedPtr, extent);
MyKernel<<<100, 512>>>(devPitchedPtr, width, height, depth);

// Device code
__global__ void MyKernel(cudaPitchedPtr devPitchedPtr,
int width, int height, int depth)
{
char* devPtr = devPitchedPtr.ptr;
size_t pitch = devPitchedPtr.pitch;
size_t slicePitch = pitch * height;
for (int z = 0; z < depth; ++z) {
char* slice = devPtr + z * slicePitch;
for (int y = 0; y < height; ++y) {
float* row = (float*)(slice + y * pitch);
for (int x = 0; x < width; ++x) {
float element = row[x];
}
}
}
}
 Note

To avoid allocating too much memory and thus impacting system-wide

performance, request the allocation parameters from the user based on the
problem size. If the allocation fails, you can fallback to other slower memory types
( cudaMallocHost() , cudaHostRegister() , etc.), or return an error telling the user how
much memory was needed that was denied. If your application cannot request the
allocation parameters for some reason, we recommend using cudaMallocManaged()
for platforms that support it.

The reference manual lists all the various functions used to copy memory between
linear memory allocated with cudaMalloc() , linear memory allocated with
cudaMallocPitch() or cudaMalloc3D() , CUDA arrays, and memory allocated for variables
declared in global or constant memory space.

The following code sample illustrates various ways of accessing global variables via the
runtime API:

constant float constData[256];

float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data));
cudaMemcpyFromSymbol(data, constData, sizeof(data));

device float devData;

float value = 3.14f;
cudaMemcpyToSymbol(devData, &value, sizeof(float));

device float* devPointer;

float* ptr;
cudaMalloc(&ptr, 256 * sizeof(float));
cudaMemcpyToSymbol(devPointer, &ptr, sizeof(ptr));

cudaGetSymbolAddress() is used to retrieve the address pointing to the memory

allocated for a variable declared in global memory space. The size of the allocated
memory is obtained through cudaGetSymbolSize() .

3.2.3. Device Memory L2 Access Management

When a CUDA kernel accesses a data region in the global memory repeatedly, such
data accesses can be considered to be persisting. On the other hand, if the data is only
accessed once, such data accesses can be considered to be streaming.

Starting with CUDA 11.0, devices of compute capability 8.0 and above have the
capability to influence persistence of data in the L2 cache, potentially providing higher
bandwidth and lower latency accesses to global memory.

3.2.3.1. L2 Cache Set-Aside for Persisting Accesses

A portion of the L2 cache can be set aside to be used for persisting data accesses to
global memory. Persisting accesses have prioritized use of this set-aside portion of L2
cache, whereas normal or streaming, accesses to global memory can only utilize this
portion of L2 when it is unused by persisting accesses.

The L2 cache set-aside size for persisting accesses may be adjusted, within limits:

cudaGetDeviceProperties(&prop, device_id);
size_t size = min(int(prop.l2CacheSize * 0.75), prop.persistingL2CacheMaxSize);
cudaDeviceSetLimit(cudaLimitPersistingL2CacheSize, size); /* set-aside 3/4 of L2 cache
for persisting accesses or the max allowed*/

When the GPU is configured in Multi-Instance GPU (MIG) mode, the L2 cache set-
aside functionality is disabled.

When using the Multi-Process Service (MPS), the L2 cache set-aside size cannot be
changed by cudaDeviceSetLimit . Instead, the set-aside size can only be specified at
start up of MPS server through the environment variable
CUDA_DEVICE_DEFAULT_PERSISTING_L2_CACHE_PERCENTAGE_LIMIT .

3.2.3.2. L2 Policy for Persisting Accesses

An access policy window specifies a contiguous region of global memory and a
persistence property in the L2 cache for accesses within that region.

The code example below shows how to set an L2 persisting access window using a
CUDA Stream.

CUDA Stream Example

cudaStreamAttrValue stream_attribute; //
Stream level attributes data structure
stream_attribute.accessPolicyWindow.base_ptr = reinterpret_cast<void*>(ptr); //
Global Memory data pointer
stream_attribute.accessPolicyWindow.num_bytes = num_bytes; //
Number of bytes for persistence access.
// (Must
be less than cudaDeviceProp::accessPolicyMaxWindowSize)
stream_attribute.accessPolicyWindow.hitRatio = 0.6; // Hint
for cache hit ratio
stream_attribute.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting; // Type
of access property on cache hit
stream_attribute.accessPolicyWindow.missProp = cudaAccessPropertyStreaming; // Type
of access property on cache miss.

//Set the attributes to a CUDA stream of type cudaStream_t

cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow,
&stream_attribute);
When a kernel subsequently executes in CUDA stream , memory accesses within the
global memory extent [ptr..ptr+num_bytes) are more likely to persist in the L2 cache
than accesses to other global memory locations.

L2 persistence can also be set for a CUDA Graph Kernel Node as shown in the example
below:

CUDA GraphKernelNode Example

cudaKernelNodeAttrValue node_attribute; // Kernel

level attributes data structure
node_attribute.accessPolicyWindow.base_ptr = reinterpret_cast<void*>(ptr); // Global
Memory data pointer
node_attribute.accessPolicyWindow.num_bytes = num_bytes; // Number
of bytes for persistence access.
// (Must
be less than cudaDeviceProp::accessPolicyMaxWindowSize)
node_attribute.accessPolicyWindow.hitRatio = 0.6; // Hint
for cache hit ratio
node_attribute.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting; // Type of
access property on cache hit
node_attribute.accessPolicyWindow.missProp = cudaAccessPropertyStreaming; // Type of
access property on cache miss.

//Set the attributes to a CUDA Graph Kernel node of type cudaGraphNode_t

cudaGraphKernelNodeSetAttribute(node, cudaKernelNodeAttributeAccessPolicyWindow,
&node_attribute);

The hitRatio parameter can be used to specify the fraction of accesses that receive
the hitProp property. In both of the examples above, 60% of the memory accesses in
the global memory region [ptr..ptr+num_bytes) have the persisting property and 40%
of the memory accesses have the streaming property. Which specific memory
accesses are classified as persisting (the hitProp ) is random with a probability of
approximately hitRatio ; the probability distribution depends upon the hardware
architecture and the memory extent.

For example, if the L2 set-aside cache size is 16KB and the num_bytes in the
accessPolicyWindow is 32KB:

 With a hitRatio of 0.5, the hardware will select, at random, 16KB of the 32KB
window to be designated as persisting and cached in the set-aside L2 cache area.
 With a hitRatio of 1.0, the hardware will attempt to cache the whole 32KB window
in the set-aside L2 cache area. Since the set-aside area is smaller than the window,
cache lines will be evicted to keep the most recently used 16KB of the 32KB data
in the set-aside portion of the L2 cache.

The hitRatio can therefore be used to avoid thrashing of cache lines and overall
reduce the amount of data moved into and out of the L2 cache.
A hitRatio value below 1.0 can be used to manually control the amount of data
different accessPolicyWindow s from concurrent CUDA streams can cache in L2. For
example, let the L2 set-aside cache size be 16KB; two concurrent kernels in two
different CUDA streams, each with a 16KB accessPolicyWindow , and both with
hitRatio value 1.0, might evict each others’ cache lines when competing for the
shared L2 resource. However, if both accessPolicyWindows have a hitRatio value of 0.5,
they will be less likely to evict their own or each others’ persisting cache lines.

3.2.3.3. L2 Access Properties

Three types of access properties are defined for different global memory data
accesses:

1. cudaAccessPropertyStreaming : Memory accesses that occur with the streaming

property are less likely to persist in the L2 cache because these accesses are
preferentially evicted.
2. cudaAccessPropertyPersisting : Memory accesses that occur with the persisting
property are more likely to persist in the L2 cache because these accesses are
preferentially retained in the set-aside portion of L2 cache.
3. cudaAccessPropertyNormal : This access property forcibly resets previously applied
persisting access property to a normal status. Memory accesses with the
persisting property from previous CUDA kernels may be retained in L2 cache long
after their intended use. This persistence-after-use reduces the amount of L2
cache available to subsequent kernels that do not use the persisting property.
Resetting an access property window with the cudaAccessPropertyNormal property
removes the persisting (preferential retention) status of the prior access, as if the
prior access had been without an access property.

3.2.3.4. L2 Persistence Example

The following example shows how to set-aside L2 cache for persistent accesses, use
the set-aside L2 cache in CUDA kernels via CUDA Stream and then reset the L2 cache.
cudaStream_t stream;
cudaStreamCreate(&stream);
// Create CUDA stream

cudaDeviceProp prop;
// CUDA device properties variable
cudaGetDeviceProperties( &prop, device_id);
// Query GPU properties
size_t size = min( int(prop.l2CacheSize * 0.75) , prop.persistingL2CacheMaxSize );
cudaDeviceSetLimit( cudaLimitPersistingL2CacheSize, size);
// set-aside 3/4 of L2 cache for persisting accesses or the max allowed

size_t window_size = min(prop.accessPolicyMaxWindowSize, num_bytes);

// Select minimum of user defined num_bytes and max window size.

cudaStreamAttrValue stream_attribute;
// Stream level attributes data structure
stream_attribute.accessPolicyWindow.base_ptr = reinterpret_cast<void*>(data1);
// Global Memory data pointer
stream_attribute.accessPolicyWindow.num_bytes = window_size;
// Number of bytes for persistence access
stream_attribute.accessPolicyWindow.hitRatio = 0.6;
// Hint for cache hit ratio
stream_attribute.accessPolicyWindow.hitProp = cudaAccessPropertyPersisting;
// Persistence Property
stream_attribute.accessPolicyWindow.missProp = cudaAccessPropertyStreaming;
// Type of access property on cache miss

cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow,
&stream_attribute); // Set the attributes to a CUDA Stream

for(int i = 0; i < 10; i++) {

cuda_kernelA<<<grid_size,block_size,0,stream>>>(data1);
// This data1 is used by a kernel multiple times
}
// [data1 + num_bytes) benefits from L2 persistence
cuda_kernelB<<<grid_size,block_size,0,stream>>>(data1);
// A different kernel in the same stream can also benefit

// from the persistence of data1

stream_attribute.accessPolicyWindow.num_bytes = 0;
// Setting the window size to 0 disable it
cudaStreamSetAttribute(stream, cudaStreamAttributeAccessPolicyWindow,
&stream_attribute); // Overwrite the access policy attribute to a CUDA Stream
cudaCtxResetPersistingL2Cache();
// Remove any persistent lines in L2

cuda_kernelC<<<grid_size,block_size,0,stream>>>(data2);
// data2 can now benefit from full L2 in normal mode

3.2.3.5. Reset L2 Access to Normal

A persisting L2 cache line from a previous CUDA kernel may persist in L2 long after it
has been used. Hence, a reset to normal for L2 cache is important for streaming or
normal memory accesses to utilize the L2 cache with normal priority. There are three
ways a persisting access can be reset to normal status.

Turing Compatibility Guide
No ratings yet
Turing Compatibility Guide
22 pages
Getting Started With CUDA Samples
No ratings yet
Getting Started With CUDA Samples
9 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
38 pages
Maxwell Compatibility Guide
No ratings yet
Maxwell Compatibility Guide
8 pages
Cuda C
No ratings yet
Cuda C
70 pages
NVCC 1.1
No ratings yet
NVCC 1.1
30 pages
NVIDIA CUDA C Programming Guide 3.1
No ratings yet
NVIDIA CUDA C Programming Guide 3.1
173 pages
CUDA C Programming Guide
No ratings yet
CUDA C Programming Guide
306 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
ACA Unit3 Revised
No ratings yet
ACA Unit3 Revised
53 pages
CUDA Compatibility
No ratings yet
CUDA Compatibility
16 pages
rCUDA Guide
No ratings yet
rCUDA Guide
13 pages
Gpucc: An Open-Source GPGPU Compiler
No ratings yet
Gpucc: An Open-Source GPGPU Compiler
12 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
Part2 22
No ratings yet
Part2 22
97 pages
The Cuda Compiler Driver NVCC: Last Modified On
No ratings yet
The Cuda Compiler Driver NVCC: Last Modified On
39 pages
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
No ratings yet
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
28 pages
Cuda Compiler Driver NVCC: Reference Guide
No ratings yet
Cuda Compiler Driver NVCC: Reference Guide
43 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Overview of GPGPU's
No ratings yet
Overview of GPGPU's
81 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
Getting Started
No ratings yet
Getting Started
7 pages
CUDA C Programming Guide
No ratings yet
CUDA C Programming Guide
316 pages
CUDA C Programming Guide PDF
No ratings yet
CUDA C Programming Guide PDF
301 pages
CUDA Compatibility
No ratings yet
CUDA Compatibility
24 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
CUDA Toolkit Release Notes
No ratings yet
CUDA Toolkit Release Notes
26 pages
Programming Models For GPU Architecture
No ratings yet
Programming Models For GPU Architecture
55 pages
Lecture3 Fundamentals of CUDA (Part1) - 2025
No ratings yet
Lecture3 Fundamentals of CUDA (Part1) - 2025
52 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Parallel Programming With CUDA - Architecture, Analysis
No ratings yet
Parallel Programming With CUDA - Architecture, Analysis
93 pages
CUDA C Programming Guide
No ratings yet
CUDA C Programming Guide
173 pages
1 Cuda
100% (1)
1 Cuda
173 pages
4 - Key Concepts
No ratings yet
4 - Key Concepts
2 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
NVCC - CUDA Toolkit Documentation
No ratings yet
NVCC - CUDA Toolkit Documentation
1 page
Unit 5 - CUDA Architecture
No ratings yet
Unit 5 - CUDA Architecture
17 pages
CUDA C Programming Guide
No ratings yet
CUDA C Programming Guide
346 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
CUDA C Programming Guide
No ratings yet
CUDA C Programming Guide
187 pages
CUDA For Tegra AppNote
No ratings yet
CUDA For Tegra AppNote
60 pages
Introduction To CUDA C
No ratings yet
Introduction To CUDA C
67 pages
Vu Instruction Manual
No ratings yet
Vu Instruction Manual
128 pages
MIPS32 1004K: Industry's First Multi-Threaded Multiprocessor IP Core For Embedded Applications
No ratings yet
MIPS32 1004K: Industry's First Multi-Threaded Multiprocessor IP Core For Embedded Applications
2 pages
Memory Hierarchy in Computer Architecture
No ratings yet
Memory Hierarchy in Computer Architecture
4 pages
Report On 64 Bit Processor
No ratings yet
Report On 64 Bit Processor
7 pages
Pseudo LRU
No ratings yet
Pseudo LRU
6 pages
Gla University (Mathura) Bca 2nd Semester Operating System PDF
No ratings yet
Gla University (Mathura) Bca 2nd Semester Operating System PDF
156 pages
Cse-211 Computer Architecture
No ratings yet
Cse-211 Computer Architecture
32 pages
Evolution of Microprocessors
No ratings yet
Evolution of Microprocessors
21 pages
The OCTEON II CN60XX Family of Multi
No ratings yet
The OCTEON II CN60XX Family of Multi
3 pages
Advanced Computer Architectures: 17CS72 (As Per CBCS Scheme)
No ratings yet
Advanced Computer Architectures: 17CS72 (As Per CBCS Scheme)
32 pages
Lecture: DRAM Main Memory: Topics: DRAM Intro and Basics (Section 2.3)
No ratings yet
Lecture: DRAM Main Memory: Topics: DRAM Intro and Basics (Section 2.3)
18 pages
SASTRA OS MCQ Bank
No ratings yet
SASTRA OS MCQ Bank
21 pages
MCQ - Unit-1 To Unit-6
No ratings yet
MCQ - Unit-1 To Unit-6
10 pages
Problem Bank 01: Assignment I
No ratings yet
Problem Bank 01: Assignment I
9 pages
Paging and Segmentation
100% (1)
Paging and Segmentation
27 pages
The Return Address From The Interrupt-Service Routine Is Stored On The Processor Stack
No ratings yet
The Return Address From The Interrupt-Service Routine Is Stored On The Processor Stack
2 pages
1.1 The Characteristics of Contemporary Processors.280155520
No ratings yet
1.1 The Characteristics of Contemporary Processors.280155520
1 page
Ynthetic Aperture Radar On Low Power
No ratings yet
Ynthetic Aperture Radar On Low Power
6 pages
Thales Computers Powerengine7 Vmebus Cpu Boards Bd1
No ratings yet
Thales Computers Powerengine7 Vmebus Cpu Boards Bd1
6 pages
Rockchip RK3399-T Datasheet V1.0-20210818
No ratings yet
Rockchip RK3399-T Datasheet V1.0-20210818
70 pages
Computer Organization PYQ Semester Question Bank
No ratings yet
Computer Organization PYQ Semester Question Bank
8 pages
BM3551 Embedded Systems and IoMT Lecture Notes 1
100% (1)
BM3551 Embedded Systems and IoMT Lecture Notes 1
137 pages
(Ebook PDF) Computer Organization and Design ARM Edition: The Hardware Software Interfacepdf Download
100% (7)
(Ebook PDF) Computer Organization and Design ARM Edition: The Hardware Software Interfacepdf Download
55 pages
HP-UX Performance and Tuning (H4262S)
100% (1)
HP-UX Performance and Tuning (H4262S)
616 pages
Home Task
No ratings yet
Home Task
4 pages
Co Avl Slides
No ratings yet
Co Avl Slides
52 pages
(Ebook) Ewing's Analytical Instrumentation Handbook by Grinberg, Nelu Rodriguez, Sonia ISBN 9781482218671, 1482218674 Instant Download
No ratings yet
(Ebook) Ewing's Analytical Instrumentation Handbook by Grinberg, Nelu Rodriguez, Sonia ISBN 9781482218671, 1482218674 Instant Download
59 pages
Roadrunner Tutorial Session 1 Web1
No ratings yet
Roadrunner Tutorial Session 1 Web1
160 pages
Amd Micro Architecture
No ratings yet
Amd Micro Architecture
15 pages
Multi-Core Architectures
100% (1)
Multi-Core Architectures
43 pages

Cuuda Nvidai Guide - Part2

Uploaded by

Cuuda Nvidai Guide - Part2

Uploaded by

CUDA-Enabled GPUs lists of all CUDA-enabled devices along with their compute

capability. Compute Capabilities gives the technical specifications of each compute

3.1. Compilation with NVCC

3.1.1. Compilation Workflow

Applications can then:

3.1.1.2. Just-in-Time Compilation

Environment variables are available to control just-in-time compilation as described in

3.1.2. Binary Compatibility

3.1.3. PTX Compatibility

3.1.4. Application Compatibility

3.1.5. C++ Compatibility

3.1.6. 64-Bit Compatibility

3.2. CUDA Runtime

All its entry points are prefixed with cuda .

As mentioned in Heterogeneous Programming, the CUDA programming model

Page-Locked Host Memory introduces page-locked host memory that is required to

Graphics Interoperability introduces the various functions the runtime provides to

3.2.2. Device Memory

Device memory can be allocated either as linear memory or as CUDA arrays.

Table 1: Linear Memory Address Space

x86_64 (AMD64) POWER (ppc64le) ARM64

up to compute 40bit 40bit 40bit

compute capability 6.0 up to 47bit up to 49bit up to

// Allocate input vectors h_A and h_B in host memory

// Initialize input vectors

// Allocate vectors in device memory

// Copy vectors from host memory to device memory

// Copy result from device memory to host memory

// Free device memory

// Free host memory

Linear memory can also be allocated through cudaMallocPitch() and cudaMalloc3D() .

To avoid allocating too much memory and thus impacting system-wide

__constant__ float constData[256];

__device__ float devData;

__device__ float* devPointer;

cudaGetSymbolAddress() is used to retrieve the address pointing to the memory

3.2.3. Device Memory L2 Access Management

3.2.3.1. L2 Cache Set-Aside for Persisting Accesses

3.2.3.2. L2 Policy for Persisting Accesses

CUDA Stream Example

//Set the attributes to a CUDA stream of type cudaStream_t

CUDA GraphKernelNode Example

cudaKernelNodeAttrValue node_attribute; // Kernel

//Set the attributes to a CUDA Graph Kernel node of type cudaGraphNode_t

3.2.3.3. L2 Access Properties

1. cudaAccessPropertyStreaming : Memory accesses that occur with the streaming

3.2.3.4. L2 Persistence Example

size_t window_size = min(prop.accessPolicyMaxWindowSize, num_bytes);

for(int i = 0; i < 10; i++) {

// from the persistence of data1

3.2.3.5. Reset L2 Access to Normal

You might also like

constant float constData[256];

device float devData;

device float* devPointer;