Profiling Guide
Profiling Guide
User Manual
TABLE OF CONTENTS
Chapter 1. Introduction.........................................................................................1
1.1. Profiling Applications.....................................................................................1
Chapter 2. Metric Collection...................................................................................3
2.1. Sets and Sections......................................................................................... 3
2.2. Sections and Rules........................................................................................4
2.3. Replay.......................................................................................................5
2.3.1. Kernel Replay........................................................................................ 5
2.3.2. Application Replay.................................................................................. 6
2.3.3. Range Replay......................................................................................... 8
2.3.3.1. Defining Ranges................................................................................ 8
2.3.3.2. Supported APIs................................................................................. 9
2.3.4. Application Range Replay.........................................................................13
2.3.5. Graph Profiling..................................................................................... 13
2.4. Compatibility............................................................................................. 14
2.5. Profile Series............................................................................................. 14
2.6. Overhead..................................................................................................15
Chapter 3. Metrics Guide..................................................................................... 17
3.1. Hardware Model......................................................................................... 17
3.2. Metrics Structure........................................................................................ 22
3.3. Metrics Decoder......................................................................................... 27
3.4. Range and Precision.................................................................................... 31
Chapter 4. Metrics Reference................................................................................33
Chapter 5. Sampling............................................................................................45
5.1. Warp Scheduler States................................................................................. 45
Chapter 6. Reproducibility....................................................................................46
6.1. Serialization.............................................................................................. 46
6.2. Clock Control.............................................................................................46
6.3. Cache Control............................................................................................ 47
6.4. Persistence Mode........................................................................................ 48
Chapter 7. Special Configurations...........................................................................49
7.1. Multi Instance GPU......................................................................................49
Chapter 8. Roofline Charts................................................................................... 51
8.1. Overview.................................................................................................. 51
8.2. Analysis....................................................................................................52
Chapter 9. Memory Chart..................................................................................... 54
9.1. Overview.................................................................................................. 54
Chapter 10. Memory Tables.................................................................................. 57
10.1. Shared Memory......................................................................................... 57
10.2. L1/TEX Cache...........................................................................................59
10.3. L2 Cache................................................................................................ 62
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | ii
10.4. L2 Cache Eviction Policies............................................................................65
10.5. Device Memory......................................................................................... 66
Chapter 11. FAQ................................................................................................ 68
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | iii
LIST OF TABLES
Table 2 Replay modes and metrics per GPU workload type ............................................. 14
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | iv
Chapter 1.
INTRODUCTION
This guide describes various profiling topics related to NVIDIA Nsight Compute and
NVIDIA Nsight Compute CLI. Most of these apply to both the UI and the CLI version of
the tool.
To use the tools effectively, it is recommended to read this guide, as well as at least the
following chapters of the CUDA Programming Guide:
‣ Programming Model
‣ Hardware Implementation
‣ Performance Guidelines
Afterwards, it should be enough to read the Quickstart chapter of the NVIDIA Nsight
Compute or NVIDIA Nsight Compute CLI documentation, respectively, to start using
the tools.
When profiling an application with NVIDIA Nsight Compute, the behavior is different.
The user launches the NVIDIA Nsight Compute frontend (either the UI or the CLI)
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 1
Introduction
on the host system, which in turn starts the actual application as a new process on the
target system. While host and target are often the same machine, the target can also be a
remote system with a potentially different operating system.
The tool inserts its measurement libraries into the application process, which allow
the profiler to intercept communication with the CUDA user-mode driver. In addition,
when a kernel launch is detected, the libraries can collect the requested performance
metrics from the GPU. The results are then transferred back to the frontend.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 2
Chapter 2.
METRIC COLLECTION
Collection of performance metrics is the key feature of NVIDIA Nsight Compute. Since
there is a huge list of metrics available, it is often easier to use some of the tool's pre-
defined sets or sections to collect a commonly used subset. Users are free to adjust which
metrics are collected for which kernels as needed, but it is important to keep in mind the
Overhead associated with data collection.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 3
Metric Collection
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 4
Metric Collection
2.3. Replay
Depending on which metrics are to be collected, kernels might need to be replayed one
or more times, since not all metrics can be collected in a single pass. For example, the
number of metrics originating from hardware (HW) performance counters that the
GPU can collect at the same time is limited. In addition, patch-based software (SW)
performance counters can have a high impact on kernel runtime and would skew results
for HW counters.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 5
Metric Collection
there is still enough device memory available, it is stored there directly. If it runs out
of device memory, the data is transferred to the CPU host memory. Likewise, if an
allocation originates from CPU host memory, the tool first attempts to save it into the
same memory location, if possible.
As explained in Overhead, the time needed for this increases the more memory is
accessed, especially written, by a kernel. If NVIDIA Nsight Compute determines that
only a single replay pass is necessary to collect the requested metrics, no save-and-
restore is performed at all to reduce overhead.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 6
Metric Collection
Across application replay passes, NVIDIA Nsight Compute matches metric data for the
individual, selected kernel launches. The matching strategy can be selected using the --
app-replay-match option. For matching, only kernels within the same process and
running on the same device are considered. By default, the grid strategy is used, which
matches launches according to their kernel name and grid size. When multiple launches
have the same attributes (e.g. name and grid size), they are matched in execution order.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 7
Metric Collection
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 8
Metric Collection
Error Handling
All supported.
Initialization
Not supported.
Version Management
All supported.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 9
Metric Collection
Device Management
All supported, except:
‣ cuDeviceSetMemPool
Context Management
All supported, except:
‣ cuCtxSetCacheConfig
‣ cuCtxSetSharedMemConfig
Module Management
‣ cuModuleGetFunction
‣ cuModuleGetGlobal
‣ cuModuleGetSurfRef
‣ cuModuleGetTexRef
‣ cuModuleLoad
‣ cuModuleLoadData
‣ cuModuleLoadDataEx
‣ cuModuleLoadFatBinary
‣ cuModuleUnload
Library Management
All supported, except:
‣ cuKernelSetAttribute
‣ cuKernelSetCacheConfig
Memory Management
‣ cuArray*
‣ cuDeviceGetByPCIBusId
‣ cuDeviceGetPCIBusId
‣ cuMemAlloc
‣ cuMemAllocHost
‣ cuMemAllocPitch
‣ cuMemcpy*
‣ cuMemFree
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 10
Metric Collection
‣ cuMemFreeHost
‣ cuMemGetAddressRange
‣ cuMemGetInfo
‣ cuMemHostAlloc
‣ cuMemHostGetDevicePointer
‣ cuMemHostGetFlags
‣ cuMemHostRegister
‣ cuMemHostUnregister
‣ cuMemset*
‣ cuMipmapped*
Unified Addressing
Not supported.
Stream Management
‣ cuStreamCreate*
‣ cuStreamDestroy
‣ cuStreamGet*
‣ cuStreamQuery
‣ cuStreamSetAttribute
‣ cuStreamSynchronize
‣ cuStreamWaitEvent
Event Management
All supported.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 11
Metric Collection
Execution Control
‣ cuFuncGetAttribute
‣ cuFuncGetModule
‣ cuFuncSetAttribute
‣ cuFuncSetCacheConfig
‣ cuLaunchCooperativeKernel
‣ cuLaunchHostFunc
‣ cuLaunchKernel
Graph Management
Not supported.
Occupancy
All supported.
Graphics Interoperability
Not supported.
OpenGL Interoperability
Not supported.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 12
Metric Collection
VDPAU Interoperability
Not supported.
EGL Interoperability
Not supported.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 13
Metric Collection
Note that when graph profiling is enabled, certain metrics such as instruction-level
source metrics are not available. This then also applies to kernels profiled outside of
graphs.
2.4. Compatibility
The set of available replay modes and metrics depends on the type of GPU workload to
profile.
1
Limitations also apply to kernels profiled outside of graphs.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 14
Metric Collection
2.6. Overhead
As with most measurements, collecting performance data using NVIDIA Nsight
Compute CLI incurs some runtime overhead on the application. The overhead does
depend on a number of different factors:
‣ Number and type of collected metrics
Depending on the selected metric, data is collected either through a hardware
performance monitor on the GPU, through software patching of the kernel
instructions or via a launch or device attribute. The overhead between these
mechanisms varies greatly, with launch and device attributes being "statically"
available and requiring no kernel runtime overhead.
Furthermore, only a limited number of metrics can be collected in a single pass of
the kernel execution. If more metrics are requested, the kernel launch is replayed
multiple times, with its accessible memory being saved and restored between
subsequent passes to guarantee deterministic execution. Therefore, collecting more
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 15
Metric Collection
metrics can significantly increase overhead by requiring more replay passes and
increasing the total amount of memory that needs to be restored during replay.
‣ The collected section set
Since each set specifies a group of section to be collected, choosing a less
comprehensive set can reduce profiling overhead. See the --set command in the
NVIDIA Nsight Compute CLI documentation.
‣ Number of collected sections
Since each section specifies a set metrics to be collected, selecting fewer sections can
reduce profiling overhead. See the --section command in the NVIDIA Nsight
Compute CLI documentation.
‣ Number of profiled kernels
By default, all selected metrics are collected for all launched kernels. To reduce the
impact on the application, you can try to limit performance data collection to as few
kernel functions and instances as makes sense for your analysis. See the filtering
commands in the NVIDIA Nsight Compute CLI documentation.
There is a relatively high one-time overhead for the first profiled kernel in each
context to generate the metric configuration. This overhead does not occur for
subsequent kernels in the same context, if the list of collected metrics remains
unchanged.
‣ GPU Architecture
For some metrics, the overhead can vary depending on the exact chip they are
collected on, e.g. due to varying number of units on the chip. Similarly, the overhead
for resetting the L2 cache in-between kernel replay passes depends on the size of
that cache.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 16
Chapter 3.
METRICS GUIDE
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 17
Metrics Guide
the CTA occupancy, and these physical resources limit this occupancy. Details on the
kernel's occupancy are collected by the Occupancy section.
Each CTA can be scheduled on any of the available SMs, where there is no guarantee in
the order of execution. As such, CTAs must be entirely independent, which means it is
not possible for one CTA to wait on the result of another CTA. As CTAs are independent,
the host (CPU) can launch a large Grid that will not fit on the hardware all at once,
however any GPU will still be able to run it and produce the correct results.
CTAs are further divided into groups of 32 threads called Warps. If the number of
threads in a CTA is not dividable by 32, the last warp will contain the remaining number
of threads.
The total number of CTAs that can run concurrently on a given GPU is referred to as
Wave. Consequently, the size of a Wave scales with the number of available SMs of a
GPU, but also with the occupancy of the kernel.
Streaming Multiprocessor
The Streaming Multiprocessor (SM) is the core processing unit in the GPU. The SM is
optimized for a wide diversity of workloads, including general-purpose computations,
deep learning, ray tracing, as well as lighting and shading. The SM is designed to
simultaneously execute multiple CTAs. CTAs can be from different grid launches.
The SM implements an execution model called Single Instruction Multiple Threads
(SIMT), which allows individual threads to have unique control flow while still
executing as part of a warp. The Turing SM inherits the Volta SM's independent thread
scheduling model. The SM maintains execution state per thread, including a program
counter (PC) and call stack. The independent thread scheduling allows the GPU to
yield execution of any thread, either to make better use of execution resources or to
allow a thread to wait for data produced by another thread possibly in the same warp.
Collecting the Source Counters section allows you to inspect instruction execution and
predication details on the Source Page, along with Sampling information.
Each SM is partitioned into four processing blocks, called SM sub partitions. The SM sub
partitions are the primary processing elements on the SM. Each sub partition contains
the following units:
‣ Warp Scheduler
‣ Register File
‣ Execution Units/Pipelines/Cores
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 18
Metrics Guide
‣ an instruction fetch,
‣ a memory dependency (result of memory instruction),
‣ an execution dependency (result of previous instruction), or
‣ a synchronization barrier.
See Warp Scheduler States for the list of stall reasons that can be profiled and the Warp
State Statistics section for a summary of warp states found in the kernel execution.
The most important resource under the compiler's control is the number of registers
used by a kernel. Each sub partition has a set of 32-bit registers, which are allocated by
the HW in fixed-size chunks. The Launch Statistics section shows the kernel's register
usage.
Memory
Global memory is a 49-bit virtual address space that is mapped to physical memory
on the device, pinned system memory, or peer memory. Global memory is visible to all
threads in the GPU. Global memory is accessed through the SM L1 and GPU L2.
Local memory is private storage for an executing thread and is not visible outside of that
thread. It is intended for thread-local data like thread stacks and register spills. Local
memory addresses are translated to global virtual addresses by the the AGU unit. Local
memory has the same latency as global memory. One difference between global and
local memory is that local memory is arranged such that consecutive 32-bit words are
accessed by consecutive thread IDs. Accesses are therefore fully coalesced as long as all
threads in a warp access the same relative address (e.g., same index in an array variable,
same member in a structure variable, etc.).
Shared memory is located on chip, so it has much higher bandwidth and much lower
latency than either local or global memory. Shared memory can be shared across a
compute CTA. Compute CTAs attempting to share data across threads via shared
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 19
Metrics Guide
Caches
All GPU units communicate to main memory through the Level 2 cache, also known
as the L2. The L2 cache sits between on-chip memory clients and the framebuffer. L2
works in physical-address space. In addition to providing caching functionality, L2 also
includes hardware to perform compression and global atomics.
The Level 1 Data Cache, or L1, plays a key role in handling global, local, shared, texture,
and surface memory reads and writes, as well as reduction and atomic operations. On
Volta and Turing architectures there are , there are two L1 caches per TPC, one for each
SM. For more information on how L1 fits into the texturing pipeline, see the TEX unit
description. Also note that while this section often uses the name "L1", it should be
understood that the L1 data cache, shared data, and the Texture data cache are one and
the same.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 20
Metrics Guide
L1 receives requests from two units: the SM and TEX. L1 receives global and local
memory requests from the SM and receives texture and surface requests from TEX.
These operations access memory in the global memory space, which L1 sends through a
secondary cache, the L2.
Cache hit and miss rates as well as data transfers are reported in the Memory Workload
Analysis section.
Texture/Surface
The TEX unit performs texture fetching and filtering. Beyond plain texture memory
access, TEX is responsible for the addressing, LOD, wrap, filter, and format conversion
operations necessary to convert a texture read request into a result.
TEX receives two general categories of requests from the SM via its input interface:
texture requests and surface load/store operations. Texture and surface memory space
resides in device memory and are cached in L1. Texture and surface memory are
allocated as block-linear surfaces (e.g. 2D, 2D Array, 3D). Such surfaces provide a cache-
friendly layout of data such that neighboring points on a 2D surface are also located
close to each other in memory, which improves access locality. Surface accesses are
bounds-checked by the TEX unit prior to accessing memory, which can be used for
implementing different texture wrapping modes.
The L1 cache is optimized for 2D spatial locality, so threads of the same warp that read
texture or surface addresses that are close together in 2D space will achieve optimal
performance. The L1 cache is also designed for streaming fetches with constant latency;
a cache hit reduces DRAM bandwidth demand but not fetch latency. Reading device
memory through texture or surface memory presents some benefits that can make it an
advantageous alternative to reading memory from global or constant memory.
Information on texture and surface memory can be found in the Memory Workload
Analysis section.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 21
Metrics Guide
Metrics Entities
While in NVIDIA Nsight Compute, all performance counters are named metrics, they
can be split further into groups with specific properties. For metrics collected via the
PerfWorks measurement library, the following entities exist:
Counters may be either a raw counter from the GPU, or a calculated counter value.
Every counter has four sub-metrics under it, which are also called roll-ups:
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 22
Metrics Guide
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 23
Metrics Guide
Throughputs indicate how close a portion of the GPU reached to peak rate. Every
throughput has the following sub-metrics:
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 24
Metrics Guide
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 25
Metrics Guide
Metrics Examples
‣ Unit-Level Counter :
unit__(subunit?)_(pipestage?)_quantity_(qualifiers?)
‣ Interface Counter :
unit__(subunit?)_(pipestage?)_(interface)_quantity_(qualifiers?)
‣ Unit Metric : (counter_name).(rollup_metric)
‣ Sub-Metric : (counter_name).(rollup_metric).(submetric)
where
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 26
Metrics Guide
Cycle Metrics
Counters using the term cycles in the name report the number of cycles in the unit's
clock domain. Unit-level cycle metrics include:
Units
dram Device (main) memory, where the GPUs global and local memory resides.
fbpa The FrameBuffer Partition is a memory controller which sits between the level 2
cache (LTC) and the DRAM. The number of FBPAs varies across GPUs.
fe The Frontend unit is responsible for the overall flow of workloads sent by the
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 27
Metrics Guide
gpc The General Processing Cluster contains SM, Texture and L1 in the form of TPC(s). It
gr Graphics Engine is responsible for all 2D and 3D graphics, compute work, and
idc The InDexed Constant Cache is a subunit of the SM responsible for caching
l1tex The Level 1 (L1)/Texture Cache is located within the GPC. It can be used as
directed-mapped shared memory and/or store global, local and texture data in its
cache portion. l1tex__t refers to its Tag stage. l1tex__m refers to its Miss stage.
ltcfabric The LTC fabric is the communication fabric for the L2 cache partitions.
lts A Level 2 (L2) Cache Slice is a sub-partition of the Level 2 cache. lts__t refers to its
Tag stage. lts__m refers to its Miss stage. lts__d refers to its Data stage.
threads, called warps. Warps are further grouped into cooperative thread arrays
(CTA), called blocks in CUDA. All warps of a CTA execute on the same SM. CTAs
share various resources across their threads, e.g. the shared memory.
smsp Each SM is partitioned into four processing blocks, called SM sub partitions. The
SM sub partitions are the primary processing elements on the SM. A sub partition
tpc Thread Processing Clusters are units in the GPC. They contain one or more SM,
Texture and L1 units, the Instruction Cache (ICC) and the Indexed Constant Cache
(IDC).
Subunits
aperture_device Memory interface to local device memory (dram)
global Global memory is a 49-bit virtual address space that is mapped to physical memory
on the device, pinned system memory, or peer memory. Global memory is visible to
all threads in the GPU. Global memory is accessed through the SM L1 and GPU L2.
lg Local/Global memory
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 28
Metrics Guide
local Local memory is private storage for an executing thread and is not visible outside
of that thread. It is intended for thread-local data like thread stacks and register
shared Shared memory is located on chip, so it has much higher bandwidth and much
lower latency than either local or global memory. Shared memory can be shared
texin TEXIN
xbar The Crossbar (XBAR) is responsible for carrying packets from a given source unit to
Pipelines
adu Address Divergence Unit. The ADU is responsible for address divergence handling
for branches/jumps. It also provides support for constant loads and block-level
barrier instructions.
alu Arithmetic Logic Unit. The ALU is responsible for execution of most bit
IMAD and IMUL. On NVIDIA Ampere architecture chips, the ALU pipeline performs
cbu Convergence Barrier Unit. The CBU is responsible for warp-level convergence,
fma Fused Multiply Add/Accumulate. The FMA pipeline processes most FP32 arithmetic
IMAD), as well as integer dot products. On GA10x, FMA is a logical pipeline that
indicates peak FP32 and FP16x2 performance. It is composed of the FMAHeavy and
fmaheavy Fused Multiply Add/Accumulate Heavy. FMAHeavy performs FP32 arithmetic (FADD,
FMUL, FMAD), FP16 arithmetic (HADD2, HMUL2, HFMA2), and integer dot products.
fmalite Fused Multiply Add/Accumulate Lite. FMALite performs FP32 arithmetic (FADD,
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 29
Metrics Guide
fp16 Half-precision floating-point. On Volta, Turing and NVIDIA GA100, the FP16 pipeline
performs paired FP16 instructions (FP16x2). It also contains a fast FP32-to-FP16 and
FP16-to-FP32 converter. Starting with GA10x chips, this functionality is part of the
FMA pipeline.
chip.
lsu Load Store Unit. The LSU pipeline issues load, store, atomic, and reduction
instructions to the L1TEX unit for global, local, and shared memory. It also issues
special register reads (S2R), shuffles, and CTA-level arrive/wait barrier instructions
tex Texture Unit. The SM texture pipeline forwards texture and surface instructions
to the L1TEX unit's TEXIN stage. On GPUs where FP64 or Tensor pipelines are
tma Tensor Memory Access Unit. Provides efficient data transfer mechanisms
between global and shared memories with the ability to understand and traverse
uniform Uniform Data Path. This scalar unit executes instructions where all threads use the
xu Transcendental and Data Type Conversion Unit. The XU pipeline is responsible for
special functions such as sin, cos, and reciprocal square root. It is also responsible
Quantities
instruction An assembly (SASS) instruction. Each executed instruction may generate zero or
more requests.
request A command into a HW unit to perform some action, e.g. load data from some
cache line is four sectors, i.e. 128 bytes. Sector accesses are classified as hits if the
tag is present and the sector-data is present within the cache line. Tag-misses and
tag Unique key to a cache line. A request may look up multiple tags, if the thread
addresses do not all fall within a single cache line-aligned region. The L1 and L2
both have 128 byte cache lines. Tag accesses may be classified as hits or misses.
wavefront Unique "work package" generated at the end of the processing stage for requests.
All work items of a wavefront are processed in parallel, while work items of
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 30
Metrics Guide
different wavefronts are serialized and processed on different cycles. At least one
A simplified model for the processing in L1TEX for Volta and newer architectures can
be described as follows: When an SM executes a global or local memory instruction for
a warp, a single request is sent to L1TEX. This request communicates the information for
all participating threads of this warp (up to 32). For local and global memory, based on
the access pattern and the participating threads, the request requires to access a number
of cache lines, and sectors within these cache lines. The L1TEX unit has internally
multiple processing stages operating in a pipeline.
A wavefront is the maximum unit that can pass through that pipeline stage per cycle. If
not all cache lines or sectors can be accessed in a single wavefront, multiple wavefronts
are created and sent for processing one by one, i.e. in a serialized manner. Limitations
of the work within a wavefront may include the need for a consistent memory space, a
maximum number of cache lines that can be accessed, as well as various other reasons.
Each wavefront then flows through the L1TEX pipeline and fetches the sectors handled
in that wavefront. The given relationships of the three key values in this model are
requests:sectors is 1:N, wavefronts:sectors 1:N, and requests:wavefronts is 1:N.
A wavefront is described as a (work) package that can be processed at once, i.e. there is a
notion of processing one wavefront per cycle in L1TEX. Wavefronts therefore represent
the number of cycles required to process the requests, while the number of sectors per
request is a property of the access pattern of the memory instruction for all participating
threads. For example, it is possible to have a memory instruction that requires 4 sectors
per request in 1 wavefront. However, you can also have a memory instruction having 4
sectors per request, but requiring 2 or more wavefronts.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 31
Metrics Guide
impact of such asynchronous units, consider profiling on a GPU without active display
and without other processes that can access the GPU at the time.
Tool issue
If you still observe metric issues after following the guidelines above, please reach out to
us and describe your issue.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 32
Chapter 4.
METRICS REFERENCE
Overview
Most metrics in NVIDIA Nsight Compute can be queried using the ncu command line
interface's --query-metrics option.
The following metrics can be collected explicitly, but are not listed by --query-
metrics, and do not follow the naming scheme explained in Metrics Structure. They
should be used as-is instead.
launch__* metrics are collected per kernel launch, and do not require an additional
replay pass. They are available as part of the kernel launch parameters (such as grid size,
block size, ...) or are computed using the CUDA Occupancy Calculator.
Launch Metrics
launch__block_dim_x Number of threads for the kernel launch in X dimension.
launch__block_size Total number of threads per block for the kernel launch.
launch__cluster_max_active Maximum number of clusters that can co-exist on the target device. The runtime
environment may affect how the hardware schedules the clusters, so the calculated
launch__cluster_max_potential_size Largest valid cluster size for the kernel function and launch configuration.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 33
Metrics Reference
launch__func_cache_config On devices where the L1 cache and shared memory use the same hardware
resources, this is the preferred cache configuration for the CUDA function. The
runtime will use the requested configuration if possible, but it is free to choose a
launch__graph_contains_device_launch Set to 1 if any node in the profiled graph can launch a CUDA device graph.
launch__occupancy_cluster_pct The ratio of active blocks to the max possible active blocks due to clusters.
launch__occupancy_limit_blocks Occupancy limit due to maximum number of blocks managable per SM.
launch__shared_mem_config_size Shared memory size configured for the kernel launch. The size depends on the
static, dynamic, and driver shared memory requirements as well as the specified or
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 34
Metrics Reference
launch__shared_mem_per_block_driver Shared memory size per block, allocated for the CUDA driver.
launch__shared_mem_per_block_dynamic Dynamic shared memory size per block, allocated for the kernel.
launch__shared_mem_per_block_static Static shared memory size per block, allocated for the kernel.
launch__thread_count Total number of threads across all blocks for the kernel launch.
launch__uses_cdp Set to 1 if any function object in the launched workload can use CUDA dynamic
parallelism.
launch__waves_per_multiprocessor Number of waves per SM. Partial waves can lead to tail effects where some SMs
Instance values map from physical nvlink device ID (uint64) to value (uint64).
Instance values map from logical nvlink ID (uint64) to comma-separated list of port
numbers (string).
Instance values map from logical nvlink ID (uint64) to values [1=GPU, 2=CPU]
(uint64).
Instance values map from logical nvlink ID (uint64) to values [1=GPU, 2=CPU]
(uint64).
Instance values map from logical nvlink device ID (uint64) to value (string).
Instance values map from physical nvlink device ID (uint64) to value (uint64).
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 35
Metrics Reference
Instance values map from physical nvlink device ID (uint64) to value (uint64).
Instance values map from logical nvlink ID (uint64) to comma-separated list of port
numbers (string).
Device Attributes
device__attribute_* metrics represent CUDA device attributes. Collecting them
does not require an addition kernel replay pass, as their value is available from the
CUDA driver for each CUDA device.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 36
Metrics Reference
smsp__pcsamp_warps_issue_stalled_barrier Warp was stalled waiting for sibling warps at a CTA barrier. A high number of
barrier. This causes some warps to wait a long time until other warps reach the
synchronization point. Whenever possible, try to divide up the work into blocks of
uniform workloads. If the block size is 512 threads or greater, consider splitting it
into smaller groups. This can increase eligible warps without affecting occupancy,
unless shared memory becomes a new occupancy limiter. Also, try to identify which
barrier instruction causes the most stalls, and optimize the code executed before
smsp__pcsamp_warps_issue_stalled_branch_resolving Warp was stalled waiting for a branch target to be computed, and the warp
using fewer jump/branch operations and reduce control flow divergence, e.g.
Instructions state.
smsp__pcsamp_warps_issue_stalled_dispatch_stall Warp was stalled waiting on a dispatch stall. A warp stalled during dispatch has an
instruction ready to issue, but the dispatcher holds back issuing the warp due to
smsp__pcsamp_warps_issue_stalled_drain Warp was stalled after EXIT waiting for all outstanding memory operations to
complete so that warp's resources can be freed. A high number of stalls due to
draining warps typically occurs when a lot of data is written to memory towards the
end of a kernel. Make sure the memory access patterns of these store operations
are optimal for the target architecture and consider parallelized data reduction, if
applicable.
smsp__pcsamp_warps_issue_stalled_imc_miss Warp was stalled waiting for an immediate constant cache (IMC) miss. A read
from constant memory costs one memory read from device memory only on a
cache miss; otherwise, it just costs one read from the constant cache. Immediate
different addresses by threads within a warp are serialized, thus the cost scales
linearly with the number of unique addresses read by all threads within a warp. As
such, the constant cache is best when threads in the same warp access only a few
distinct locations. If all threads of a warp access the same location, then constant
smsp__pcsamp_warps_issue_stalled_lg_throttle Warp was stalled waiting for the L1 instruction queue for local and global (LG)
memory operations to be not full. Typically, this stall occurs only when executing
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 37
Metrics Reference
if dynamically indexed arrays are declared in local scope, of if the kernel has
multiple lower-width memory operations into fewer wider memory operations and
smsp__pcsamp_warps_issue_stalled_long_scoreboard Warp was stalled waiting for a scoreboard dependency on a L1TEX (local, global,
surface, texture) operation. Find the instruction producing the data being waited
upon to identify the culprit. To reduce the number of cycles waiting on L1TEX data
accesses verify the memory access patterns are optimal for the target architecture,
changing the cache configuration. Consider moving frequently used data to shared
memory.
smsp__pcsamp_warps_issue_stalled_math_pipe_throttle Warp was stalled waiting for the execution pipe to be available. This stall occurs
when all active warps execute their next instruction on a specific, oversubscribed
math pipeline. Try to increase the number of active warps to hide the existent
latency or try changing the instruction mix to utilize all available pipelines in a
smsp__pcsamp_warps_issue_stalled_membar Warp was stalled waiting on a memory barrier. Avoid executing any unnecessary
memory barriers and assure that any outstanding memory operations are fully
smsp__pcsamp_warps_issue_stalled_mio_throttle Warp was stalled waiting for the MIO (memory input/output) instruction queue
to be not full. This stall reason is high in cases of extreme utilization of the MIO
instruction cache miss. A high number of warps not having an instruction fetched
is typical for very short kernels with less than one full wave of work in the grid.
Excessively jumping across large blocks of assembly code can also lead to more
warps stalled for this reason, if this causes misses in the instruction cache. See also
smsp__pcsamp_warps_issue_stalled_not_selected Warp was stalled waiting for the micro scheduler to select the warp to issue. Not
selected warps are eligible warps that were not picked by the scheduler to issue
that cycle as another warp was selected. A high number of not selected warps
typically means you have sufficient warps to cover warp latencies and you may
consider reducing the number of active warps to possibly increase cache coherence
smsp__pcsamp_warps_issue_stalled_selected Warp was selected by the micro scheduler and issued an instruction.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 38
Metrics Reference
smsp__pcsamp_warps_issue_stalled_short_scoreboard Warp was stalled waiting for a scoreboard dependency on a MIO (memory input/
output) operation (not to L1TEX). The primary reason for a high number of stalls
dynamic branching (e.g. BRX, JMX). Consult the Memory Workload Analysis section
to verify if there are shared memory operations and reduce bank conflicts, if
reported. Assigning frequently accessed values to variables can assist the compiler
smsp__pcsamp_warps_issue_stalled_sleeping Warp was stalled due to all threads in the warp being in the blocked, yielded, or
sleep state. Reduce the number of executed NANOSLEEP instructions, lower the
specified time delay, and attempt to group threads in a way that multiple threads
smsp__pcsamp_warps_issue_stalled_tex_throttle Warp was stalled waiting for the L1 instruction queue for texture operations to
be not full. This stall reason is high in cases of extreme utilization of the L1TEX
pipeline. Try issuing fewer texture fetches, surface loads, surface stores, or
width memory operations into fewer wider memory operations and try interleaving
or surface loads into global memory lookups. Texture can accept four threads'
smsp__pcsamp_warps_issue_stalled_wait Warp was stalled waiting on a fixed latency execution dependency. Typically, this
stall reason should be very low and only shows up as a top contributor in already
increasing the number of active warps, restructuring the code or unrolling loops.
smsp__pcsamp_warps_issue_stalled_barrier_not_issued Warp was stalled waiting for sibling warps at a CTA barrier. A high number of
barrier. This causes some warps to wait a long time until other warps reach the
synchronization point. Whenever possible, try to divide up the work into blocks of
uniform workloads. If the block size is 512 threads or greater, consider splitting it
into smaller groups. This can increase eligible warps without affecting occupancy,
unless shared memory becomes a new occupancy limiter. Also, try to identify which
barrier instruction causes the most stalls, and optimize the code executed before
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 39
Metrics Reference
smsp__pcsamp_warps_issue_stalled_branch_resolving_not_issued Warp was stalled waiting for a branch target to be computed, and the warp
using fewer jump/branch operations and reduce control flow divergence, e.g.
Instructions state.
smsp__pcsamp_warps_issue_stalled_dispatch_stall_not_issued Warp was stalled waiting on a dispatch stall. A warp stalled during dispatch has an
instruction ready to issue, but the dispatcher holds back issuing the warp due to
smsp__pcsamp_warps_issue_stalled_drain_not_issued Warp was stalled after EXIT waiting for all memory operations to complete so
that warp resources can be freed. A high number of stalls due to draining warps
typically occurs when a lot of data is written to memory towards the end of a
kernel. Make sure the memory access patterns of these store operations are
optimal for the target architecture and consider parallelized data reduction, if
applicable.
smsp__pcsamp_warps_issue_stalled_imc_miss_not_issued Warp was stalled waiting for an immediate constant cache (IMC) miss. A read
from constant memory costs one memory read from device memory only on a
cache miss; otherwise, it just costs one read from the constant cache. Accesses to
different addresses by threads within a warp are serialized, thus the cost scales
linearly with the number of unique addresses read by all threads within a warp. As
such, the constant cache is best when threads in the same warp access only a few
distinct locations. If all threads of a warp access the same location, then constant
smsp__pcsamp_warps_issue_stalled_lg_throttle_not_issued Warp was stalled waiting for the L1 instruction queue for local and global (LG)
memory operations to be not full. Typically, this stall occurs only when executing
if dynamically indexed arrays are declared in local scope, of if the kernel has
multiple lower-width memory operations into fewer wider memory operations and
smsp__pcsamp_warps_issue_stalled_long_scoreboard_not_issued Warp was stalled waiting for a scoreboard dependency on a L1TEX (local, global,
surface, texture) operation. Find the instruction producing the data being waited
upon to identify the culprit. To reduce the number of cycles waiting on L1TEX data
accesses verify the memory access patterns are optimal for the target architecture,
changing the cache configuration. Consider moving frequently used data to shared
memory.
smsp__pcsamp_warps_issue_stalled_math_pipe_throttle_not_issued Warp was stalled waiting for the execution pipe to be available. This stall occurs
when all active warps execute their next instruction on a specific, oversubscribed
math pipeline. Try to increase the number of active warps to hide the existent
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 40
Metrics Reference
latency or try changing the instruction mix to utilize all available pipelines in a
smsp__pcsamp_warps_issue_stalled_membar_not_issued Warp was stalled waiting on a memory barrier. Avoid executing any unnecessary
memory barriers and assure that any outstanding memory operations are fully
smsp__pcsamp_warps_issue_stalled_mio_throttle_not_issued Warp was stalled waiting for the MIO (memory input/output) instruction queue
to be not full. This stall reason is high in cases of extreme utilization of the MIO
instruction cache miss. A high number of warps not having an instruction fetched
is typical for very short kernels with less than one full wave of work in the grid.
Excessively jumping across large blocks of assembly code can also lead to more
warps stalled for this reason, if this causes misses in the instruction cache. See also
smsp__pcsamp_warps_issue_stalled_not_selected_not_issued Warp was stalled waiting for the micro scheduler to select the warp to issue. Not
selected warps are eligible warps that were not picked by the scheduler to issue
that cycle as another warp was selected. A high number of not selected warps
typically means you have sufficient warps to cover warp latencies and you may
consider reducing the number of active warps to possibly increase cache coherence
smsp__pcsamp_warps_issue_stalled_selected_not_issued Warp was selected by the micro scheduler and issued an instruction.
smsp__pcsamp_warps_issue_stalled_short_scoreboard_not_issued Warp was stalled waiting for a scoreboard dependency on a MIO (memory input/
output) operation (not to L1TEX). The primary reason for a high number of stalls
dynamic branching (e.g. BRX, JMX). Consult the Memory Workload Analysis section
to verify if there are shared memory operations and reduce bank conflicts, if
reported. Assigning frequently accessed values to variables can assist the compiler
smsp__pcsamp_warps_issue_stalled_sleeping_not_issued Warp was stalled due to all threads in the warp being in the blocked, yielded, or
sleep state. Reduce the number of executed NANOSLEEP instructions, lower the
specified time delay, and attempt to group threads in a way that multiple threads
smsp__pcsamp_warps_issue_stalled_tex_throttle_not_issued Warp was stalled waiting for the L1 instruction queue for texture operations to
be not full. This stall reason is high in cases of extreme utilization of the L1TEX
pipeline. Try issuing fewer texture fetches, surface loads, surface stores, or
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 41
Metrics Reference
width memory operations into fewer wider memory operations and try interleaving
or surface loads into global memory lookups. Texture can accept four threads'
smsp__pcsamp_warps_issue_stalled_wait_not_issued Warp was stalled waiting on a fixed latency execution dependency. Typically, this
stall reason should be very low and only shows up as a top contributor in already
increasing the number of active warps, restructuring the code or unrolling loops.
Source Metrics
Collected using SASS-patching. These metrics have instance values mapping
from function address (uint64) to associated values (uint64). Metrics
memory_[access_]type map to string values.
branch_inst_executed Number of unique branch targets assigned to the instruction, including both
derived__avg_thread_executed Average number of thread-level executed instructions per warp (regardless of their
memory_l1_wavefronts_shared / inst_executed
instructions, because not all not predicated-off threads performed the operation.
Warp-level means the values increased by one per individual warp executing the
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 42
Metrics Reference
memory_l1_wavefronts_shared_ideal Ideal number of wavefronts in L1 from shared memory instructions, assuming each
memory_l2_theoretical_sectors_global_ideal Ideal number of sectors requested in L2 from global memory instructions, assuming
smsp__branch_targets_threads_divergent Number of divergent branch targets, including fallthrough. Incremented only when
smsp__branch_targets_threads_uniform Number of uniform branch execution, including fallthrough, where all active
smsp__pcsamp_sample_count Number of collected samples per program counter from the periodic sampler.
evaluation.
smsp__sass_inst_executed_memdesc_explicit_hitprop_evict_first Number of warp-level executed instructions with L2 cache eviction hit property
'first'.
smsp__sass_inst_executed_memdesc_explicit_hitprop_evict_last Number of warp-level executed instructions with L2 cache eviction hit property
'last'.
smsp__sass_inst_executed_memdesc_explicit_hitprop_evict_normal Number of warp-level executed instructions with L2 cache eviction hit property
'normal'.
smsp__sass_inst_executed_memdesc_explicit_hitprop_evict_normal_demote
Number of warp-level executed instructions with L2 cache eviction hit property
'normal demote'.
smsp__sass_inst_executed_memdesc_explicit_missprop_evict_first Number of warp-level executed instructions with L2 cache eviction miss property
'first'.
smsp__sass_inst_executed_memdesc_explicit_missprop_evict_normal Number of warp-level executed instructions with L2 cache eviction miss property
'normal'.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 43
Metrics Reference
modifiers.
modifiers.
modifiers.
modifiers.
Metric Groups
group:memory__chart Group of metrics for the workload analysis chart.
group:memory__dram_table Group of metrics for the device memory workload analysis table.
group:memory__first_level_cache_table Group of metrics for the L1/TEX cache workload analysis table.
group:memory__shared_table Group of metrics for the shared memory workload analysis table.
group:smsp__pcsamp_warp_stall_reasons Group of metrics for the number of samples from the statistical sampler per
program location.
group:smsp__pcsamp_warp_stall_reasons_not_issued Group of metrics for the number of samples from the statistical sampler per
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 44
Chapter 5.
SAMPLING
NVIDIA Nsight Compute supports periodic sampling of the warp program counter and
warp scheduler state on desktop devices of compute capability 6.1 and above.
At a fixed interval of cycles, the sampler in each streaming multiprocessor selects an
active warp and outputs the program counter and the warp scheduler state. The tool
selects the minimum interval for the device. On small devices, this can be every 32
cycles. On larger chips with more multiprocessors, this may be 2048 cycles. The sampler
selects a random active warp. On the same cycle the scheduler may select a different
warp to issue.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 45
Chapter 6.
REPRODUCIBILITY
In order to provide actionable and deterministic results across application runs, NVIDIA
Nsight Compute applies various methods to adjust how metrics are collected. This
includes serializing kernel launches, purging GPU caches before each kernel replay or
adjusting GPU clocks.
6.1. Serialization
NVIDIA Nsight Compute serializes kernel launches within the profiled application,
potentially across multiple processes profiled by one or more instances of the tool at the
same time.
Serialization across processes is necessary since for the collection of HW performance
metrics, some GPU and driver objects can only be acquired by a single process at a time.
To achieve this, the lock file TMPDIR/nsight-compute-lock is used. On Windows,
TMPDIR is the path returned by the Windows GetTempPath API function. On other
platforms, it is the path supplied by the first environment variable in the list TMPDIR,
TMP, TEMP, TEMPDIR. If none of these is found, it's /var/nvidia on QNX and /tmp
otherwise.
Serialization within the process is required for most metrics to be mapped to the proper
kernel. In addition, without serialization, performance metric values might vary widely
if kernel execute concurrently on the same device.
It is currently not possible to disable this tool behavior. Refer to the FAQ entry on
possible workarounds.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 46
Reproducibility
regularly be lower. In addition, due to kernel replay, the metric value might depend on
which replay pass it is collected in, as later passes would result in higher clock states.
To mitigate this non-determinism, NVIDIA Nsight Compute attempts to limit GPU
clock frequencies to their base value. As a result, metric values are less impacted by the
location of the kernel in the application, or by the number of the specific replay pass.
However, this behavior might be undesirable for analysis of the kernel, e.g. in cases
where an external tool is used to fix clock frequencies, or where the behavior of the
kernel within the application is analyzed. To solve this, users can adjust the --clock-
control option to specify if any clock frequencies should be fixed by the tool.
Factors affecting Clock Control:
‣ Note that thermal throttling directed by the driver cannot be controlled by the tool
and always overrides any selected options.
‣ On mobile targets, e.g. L4T or QNX, there may be variations in profiling results due
the inability for the tool to lock clocks. Using Nsight Compute’s --clock-control
to set the GPU clocks will fail or will be silently ignored when profiling on a GPU
partition.
‣ On L4T, you can use the jetson_clocks script to lock the clocks at their
maximums during profiling.
‣ See the Special Configurations section for MIG and vGPU clock control.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 47
Reproducibility
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 48
Chapter 7.
SPECIAL CONFIGURATIONS
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 49
Special Configurations
Locking Clocks
NVIDIA Nsight Compute is not able to set the clock frequency on any Compute Instance
for profiling. You can continue analyzing kernels without fixed clock frequencies (using
--clock-control none; see here for more details). If you have sufficient permissions,
nvidia-smi can be used to configure a fixed frequency for the whole GPU by calling
nvidia-smi --lock-gpu-clocks=tdp,tdp. This sets the GPU clocks to the base TDP
frequency until you reset the clocks by calling nvidia-smi --reset-gpu-clocks.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 50
Chapter 8.
ROOFLINE CHARTS
8.1. Overview
Kernel performance is not only dependent on the operational speed of the GPU. Since a
kernel requires data to work on, performance is also dependent on the rate at which the
GPU can feed data to the kernel. A typical roofline chart combines the peak performance
and memory bandwidth of the GPU, with a metric called Arithmetic Intensity (a ratio
between Work and Memory Traffic), into a single chart, to more realistically represent the
achieved performance of the profiled kernel. A simple roofline chart might look like the
following:
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 51
Roofline Charts
This chart actually shows two different rooflines. However, the following components
can be identified for each:
‣ Vertical Axis - The vertical axis represents Floating Point Operations per Second
(FLOPS). For GPUs this number can get quite large and so the numbers on this axis
can be scaled for easier reading (as shown here). In order to better accommodate the
range, this axis is rendered using a logarithmic scale.
‣ Horizontal Axis - The horizontal axis represents Arithmetic Intensity, which is
the ratio between Work (expressed in floating point operations per second), and
Memory Traffic (expressed in bytes per second). The resulting unit is in floating point
operations per byte. This axis is also shown using a logarithmic scale.
‣ Memory Bandwidth Boundary - The memory bandwidth boundary is the sloped part
of the roofline. By default, this slope is determined entirely by the memory transfer
rate of the GPU but can be customized inside the SpeedOfLight_RooflineChart.section
file if desired.
‣ Peak Performance Boundary - The peak performance boundary is the flat part of
the roofline By default, this value is determined entirely by the peak performance of
the GPU but can be customized inside the SpeedOfLight_RooflineChart.section file if
desired.
‣ Ridge Point - The ridge point is the point at which the memory bandwidth
boundary meets the peak performance boundary. This point is a useful reference
when analyzing kernel performance.
‣ Achieved Value - The achieved value represents the performance of the profiled
kernel. If baselines are being used, the roofline chart will also contain an achieved
value for each baseline. The outline color of the plotted achieved value point can be
used to determine from which baseline the point came.
8.2. Analysis
The roofline chart can be very helpful in guiding performance optimization efforts for a
particular kernel.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 52
Roofline Charts
As shown here, the ridge point partitions the roofline chart into two regions. The area
shaded in blue under the sloped Memory Bandwidth Boundary is the Memory Bound
region, while the area shaded in green under the Peak Performance Boundary is the
Compute Bound region. The region in which the achieved value falls, determines the
current limiting factor of kernel performance.
The distance from the achieved value to the respective roofline boundary (shown
in this figure as a dotted white line), represents the opportunity for performance
improvement. The closer the achieved value is to the roofline boundary, the more optimal
is its performance. An achieved value that lies on the Memory Bandwidth Boundary but is
not yet at the height of the ridge point would indicate that any further improvements in
overall FLOP/s are only possible if the Arithmetic Intensity is increased at the same time.
Using the baseline feature in combination with roofline charts, is a good way to track
optimization progress over a number of kernel executions.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 53
Chapter 9.
MEMORY CHART
The Memory Chart shows a graphical, logical representation of performance data for
memory subunits on and off the GPU. Performance data includes transfer sizes, hit rates,
number of instructions or requests, etc.
9.1. Overview
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 54
Memory Chart
‣ Load Global Store Shared: Instructions loading directly from global into shared
memory without intermediate register file access
‣ L1/TEX Cache: The L1/Texture cache. The underlying physical memory is split
between this cache and the user-managed Shared Memory.
‣ Shared Memory: CUDA's user-managed shared memory. The underlying physical
memory is split between this and the L1/TEX Cache.
‣ L2 Cache: The L2 cache
‣ L2 Compression: The memory compression unit of the L2 Cache
‣ System Memory: Off-chip system (CPU) memory
‣ Device Memory: On-chip device (GPU) memory of the CUDA device that executes
the kernel
‣ Peer Memory: On-chip device (GPU) memory of other CUDA devices
Depending on the exact GPU architecture, the exact set of shown units can vary, as not
all GPUs have all units.
Links
Links between Kernel and other logical units represent the number of executed
instructions (Inst) targeting the respective unit. For example, the link between Kernel and
Global represents the instructions loading from or storing to the global memory space.
Instructions using the NVIDIA A100's Load Global Store Shared paradigm are shown
separately, as their register or cache access behavior can be different from regular global
loads or shared stores.
Links between logical units and blue, physical units represent the number of requests
(Req) issued as a result of their respective instructions. For example, the link going from
L1/TEX Cache to Global shows the number of requests generated due to global load
instructions.
The color of each link represents the percentage of peak utilization of the corresponding
communication path. The color legend to the right of the chart shows the applied color
gradient from unused (0%) to operating at peak performance (100%). Triangle markers
to the left of the legend correspond to the links in the chart. The markers offer a more
accurate value estimate for the achieved peak performances than the color gradient
alone.
A unit often shares a common data port for incoming and outgoing traffic. While the
links sharing a port might operate well below their individual peak performances, the
unit's data port may have already reached its peak. Port utilization is shown in the chart
by colored rectangles inside the units located at the incoming and outgoing links. Ports
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 55
Memory Chart
use the same color gradient as the data links and have also a corresponding marker to
the left of the legend.
An example of the correlation between the peak values reported in the memory tables
and the ports in the memory chart is shown below.
Metrics
Metrics from this chart can be collected on the command line using --set full, --
section MemoryWorkloadAnalysis_Chart or --metrics group:memory__chart.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 56
Chapter 10.
MEMORY TABLES
The Memory Tables show detailed metrics for the various memory HW units, such as
shared memory, the caches, and device memory. For most table entries, you can hover
over it to see the underlying metric name and description. Some entries are generated
as derivatives from other cells, and do not show a metric name on their own, but the
respective calculation. If a certain metric does not contribute to the generic derivative
calculation, it is shown as UNUSED in the tooltip. You can hover over row or column
headers to see a description of this part of the table.
Columns
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 57
Memory Tables
Rows
Metrics
Metrics from this table can be collected on the command line using --set
full, --section MemoryWorkloadAnalysis_Tables or --metrics
group:memory__shared_table.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 58
Memory Tables
Columns
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 59
Memory Tables
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 60
Memory Tables
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 61
Memory Tables
Rows
Metrics
Metrics from this table can be collected on the command line using --set
full, --section MemoryWorkloadAnalysis_Tables or --metrics
group:memory__first_level_cache_table.
10.3. L2 Cache
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 62
Memory Tables
Columns
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 63
Memory Tables
Rows
Metrics
Metrics from this table can be collected on the command line using --set
full, --section MemoryWorkloadAnalysis_Tables or --metrics
group:memory__l2_cache_table.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 64
Memory Tables
Columns
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 65
Memory Tables
Rows
Metrics
Metrics from this table can be collected on the command line using --set
full, --section MemoryWorkloadAnalysis_Tables or --metrics
group:memory__l2_cache_evict_policy_table. Note that this table is only
available on GPUs with GA100 or newer.
Columns
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 66
Memory Tables
Rows
Metrics
Metrics from this table can be collected on the command line using --set
full, --section MemoryWorkloadAnalysis_Tables or --metrics
group:memory__dram_table.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 67
Chapter 11.
FAQ
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 68
FAQ
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 69
FAQ
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 70
FAQ
accept the intermediate host's key. A simple way to pinpoint the cause of failures in
this case is to open a terminal and use the OpenSSH client to connect to the remote
target. Once that connection succeeds, NVIDIA Nsight Compute should be able to
connect to the target, too.
‣ SSH connection fails without trying to connect
If the connection fails without trying to connect, there may be a problem with the
settings you entered into the connection dialog. Please make sure that the IP/Host
Name, User Name and Port fields are correctly set.
‣ SSH connections are still not working
The problem might come from NVIDIA Nsight Compute's SSH client not finding a
suitable host key algorithm to use which is supported by the remote server. You can
force NVIDIA Nsight Compute to use a specific set of host key algorithms by setting
the HostKeyAlgorithms option for the problematic host in your SSH configuration
file. To list the supported host key algorithms for a remote target, you can use the
ssh-keyscan utility which comes with the OpenSSH client.
‣ Removing host keys from known hosts files
When connecting to a target machine, NVIDIA Nsight Compute tries to verify the
target's host key against the same local database as the OpenSSH client. If NVIDIA
Nsight Compute find the host key is incorrect, it will inform you through a failure
dialog. If you trust the key hash shown in the dialog, you can remove the previously
saved key for that host by manually editing your known hosts database or using the
ssh-keygen -R <host> command.
‣ Qt initialization failed
Failed to load Qt platform plugin
See System Requirements for Linux.
www.nvidia.com
Kernel Profiling Guide v2023.1.1 | 71
Notice
ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS,
DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY,
"MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES,
EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE
MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF
NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR
PURPOSE.
Information furnished is believed to be accurate and reliable. However, NVIDIA
Corporation assumes no responsibility for the consequences of use of such
information or for any infringement of patents or other rights of third parties
that may result from its use. No license is granted by implication of otherwise
under any patent rights of NVIDIA Corporation. Specifications mentioned in this
publication are subject to change without notice. This publication supersedes and
replaces all other information previously supplied. NVIDIA Corporation products
are not authorized as critical components in life support devices or systems
without express written approval of NVIDIA Corporation.
Trademarks
NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA
Corporation in the U.S. and other countries. Other company and product names
may be trademarks of the respective companies with which they are associated.
Copyright
© 2018-2023 NVIDIA Corporation and affiliates. All rights reserved.
This product includes software developed by the Syncro Soft SRL (http://
www.sync.ro/).
www.nvidia.com