User Guide
User Guide
User Manual
TABLE OF CONTENTS
www.nvidia.com
User Guide v2022.1.1 | ii
2.3. Profiling QNX Targets from the GUI.................................................................. 95
Chapter 3. Export Formats................................................................................... 96
3.1. SQLite Schema Reference..............................................................................96
3.2. SQLite Schema Event Values.......................................................................... 98
3.3. Common SQLite Examples............................................................................ 104
3.4. Arrow Format Description............................................................................ 118
3.5. JSON and Text Format Description................................................................. 119
Chapter 4. Report Scripts................................................................................... 120
Report Scripts Shipped With Nsight Systems............................................................ 120
apigpusum[:base] -- CUDA API & GPU Summary (CUDA API + kernels + memory ops)..........120
cudaapisum -- CUDA API Summary.....................................................................121
cudaapitrace -- CUDA API Trace........................................................................121
gpukernsum[:base] -- CUDA GPU Kernel Summary.................................................. 121
gpumemsizesum -- GPU Memory Operations Summary (by Size)..................................122
gpumemtimesum -- GPU Memory Operations Summary (by Time)................................122
gpusum[:base] -- GPU Summary (kernels + memory operations)................................. 123
gputrace -- CUDA GPU Trace........................................................................... 123
nvtxppsum -- NVTX Push/Pop Range Summary...................................................... 124
openmpevtsum -- OpenMP Event Summary...........................................................124
osrtsum -- OS Runtime Summary.......................................................................124
vulkanmarkerssum -- Vulkan Range Summary........................................................ 125
pixsum -- PIX Range Summary..........................................................................125
khrdebugsum -- OpenGL KHR_debug Range Summary.............................................. 126
Report Formatters Shipped With Nsight Systems....................................................... 126
Column...................................................................................................... 126
Table........................................................................................................ 127
CSV.......................................................................................................... 127
TSV.......................................................................................................... 128
JSON......................................................................................................... 128
HDoc.........................................................................................................128
HTable.......................................................................................................129
Chapter 5. Migrating from NVIDIA nvprof................................................................ 130
Using the Nsight Systems CLI nvprof Command........................................................ 130
CLI nvprof Command Switch Options.....................................................................130
Next Steps.....................................................................................................133
Chapter 6. Profiling in a Docker on Linux Devices.................................................... 134
Chapter 7. Direct3D Trace.................................................................................. 136
7.1. D3D11 API trace........................................................................................136
7.2. D3D12 API Trace....................................................................................... 136
Chapter 8. WDDM Queues................................................................................... 141
Chapter 9. Vulkan API Trace................................................................................ 143
9.1. Vulkan Overview....................................................................................... 143
9.2. Pipeline Creation Feedback.......................................................................... 144
www.nvidia.com
User Guide v2022.1.1 | iii
9.3. Vulkan GPU Trace Notes.............................................................................. 145
Chapter 10. Stutter Analysis................................................................................146
10.1. FPS Overview..........................................................................................146
10.2. Frame Health..........................................................................................149
10.3. GPU Memory Utilization............................................................................. 150
10.4. Vertical Synchronization.............................................................................150
Chapter 11. OpenMP Trace..................................................................................151
Chapter 12. OS Runtime Libraries Trace................................................................. 153
12.1. Locking a Resource...................................................................................154
12.2. Limitations............................................................................................. 154
12.3. OS Runtime Libraries Trace Filters................................................................ 155
12.4. OS Runtime Default Function List................................................................. 156
Chapter 13. NVTX Trace..................................................................................... 159
Chapter 14. CUDA Trace..................................................................................... 163
14.1. CUDA GPU Memory Allocation Graph............................................................. 166
14.2. Unified Memory Transfer Trace.................................................................... 166
Unified Memory CPU Page Faults...................................................................... 168
Unified Memory GPU Page Faults...................................................................... 169
14.3. CUDA Default Function List for CLI............................................................... 171
14.4. cuDNN Function List for X86 CLI...................................................................173
Chapter 15. OpenACC Trace................................................................................ 175
Chapter 16. OpenGL Trace.................................................................................. 177
16.1. OpenGL Trace Using Command Line...............................................................179
Chapter 17. Custom ETW Trace............................................................................181
Chapter 18. GPU Metric Sampling......................................................................... 183
Overview.......................................................................................................183
Launching GPU Metric Sampling from the CLI.......................................................... 184
Launching GPU Metric Sampling from the GUI..........................................................185
Sampling frequency..........................................................................................185
Available Metrics............................................................................................. 186
Exporting and Querying Data.............................................................................. 189
Limitations.................................................................................................... 190
Chapter 19. NVIDIA Video Codec SDK Trace.............................................................191
19.1. NV Encoder API Functions Traced by Default.................................................... 192
19.2. NV Decoder API Functions Traced by Default....................................................193
19.3. NV JPEG API Functions Traced by Default........................................................194
Chapter 20. Network Communication Profiling.........................................................195
20.1. MPI API Trace......................................................................................... 196
20.2. OpenSHMEM Library Trace.......................................................................... 199
20.3. UCX Library Trace.................................................................................... 199
20.4. NVIDIA NVSHMEM and NCCL Trace................................................................. 200
20.5. NIC Metric Sampling................................................................................. 201
Chapter 21. Reading Your Report in GUI.................................................................203
www.nvidia.com
User Guide v2022.1.1 | iv
21.1. Generating a New Report........................................................................... 203
21.2. Opening an Existing Report......................................................................... 203
21.3. Sharing a Report File................................................................................ 203
21.4. Report Tab............................................................................................. 203
21.5. Analysis Summary View..............................................................................204
21.6. Timeline View......................................................................................... 204
21.6.1. Timeline...........................................................................................204
Row Height.............................................................................................. 205
21.6.2. Events View...................................................................................... 205
21.6.3. Function Table Modes.......................................................................... 207
21.6.4. Filter Dialog...................................................................................... 210
21.7. Diagnostics Summary View..........................................................................210
21.8. Symbol Resolution Logs View....................................................................... 211
Chapter 22. Adding Report to the Timeline.............................................................212
22.1. Time Synchronization................................................................................ 212
22.2. Timeline Hierarchy................................................................................... 214
22.3. Example: MPI.......................................................................................... 215
22.4. Limitations............................................................................................. 216
Chapter 23. Using Nsight Systems Expert System......................................................217
Using Expert System from the CLI........................................................................ 217
Using Expert System from the GUI....................................................................... 217
Expert System Rules.........................................................................................218
CUDA Synchronous Operation Rules....................................................................218
GPU Low Utilization Rules.............................................................................. 219
Chapter 24. Import NVTXT..................................................................................221
Commands.....................................................................................................222
Chapter 25. Visual Studio Integration.................................................................... 224
Chapter 26. Troubleshooting............................................................................... 226
26.1. General Troubleshooting.............................................................................226
26.2. CLI Troubleshooting.................................................................................. 227
26.3. Launch Processes in Stopped State................................................................227
LD_PRELOAD............................................................................................... 228
Launcher.................................................................................................... 228
26.4. GUI Troubleshooting..................................................................................229
Ubuntu 18.04/20.04 and CentOS 7/8 with root privileges......................................... 229
Ubuntu 18.04/20.04 and CentOS 7/8 without root privileges..................................... 230
Other platforms, or if the previous steps did not help............................................ 230
26.5. Symbol Resolution.................................................................................... 230
Broken Backtraces on Tegra............................................................................ 232
Debug Versions of ELF Files.............................................................................233
26.6. Logging................................................................................................. 234
Verbose Logging on Linux Targets......................................................................234
Verbose Logging on Windows Targets..................................................................234
www.nvidia.com
User Guide v2022.1.1 | v
Chapter 27. Other Resources...............................................................................236
Feature Videos............................................................................................... 236
Blog Posts..................................................................................................... 236
Training Seminars............................................................................................ 236
Conference Presentations.................................................................................. 237
For More Support............................................................................................ 237
www.nvidia.com
User Guide v2022.1.1 | vi
Chapter 1.
PROFILING FROM THE CLI
or
nsys [command_switch][optional command_switch_options][application] [optional
application_options]
All command line options are case sensitive. For command switch options, when
short options are used, the parameters should follow the switch after a space; e.g. -s
process-tree. When long options are used, the switch should be followed by an equal
sign and then the parameter(s); e.g. --sample=process-tree.
For this version of Nsight Systems, if you launch a process from the command line to
begin analysis, the launched process will be terminated when collection is complete,
including runs with --duration set, unless the user specifies the --kill none option (details
www.nvidia.com
User Guide v2022.1.1 | 1
Profiling from the CLI
below). The exception is that if the user uses NVTX, cudaProfilerStart/Stop, or hotkeys to
control the duration, the application will continue unless --kill is set.
The Nsight Systems CLI supports concurrent analysis by using sessions. Each Nsight
Systems session is defined by a sequence of CLI commands that define one or more
collections (e.g. when and what data is collected). A session begins with either a start,
launch, or profile command. A session ends with a shutdown command, when a profile
command terminates, or, if requested, when all the process tree(s) launched in the
session exit. Multiple sessions can run concurrently on the same system.
Command Description
profile A fully formed profiling description
requiring and accepting no further input.
The command switch options used
(see below table) determine when the
collection starts, stops, what collectors are
used (e.g. API trace, IP sampling, etc.),
what processes are monitored, etc.
start Start a collection in interactive mode. The
start command can be executed before or
after a launch command.
stop Stop a collection that was started in
interactive mode. When executed, all
active collections stop, the CLI process
terminates but the application continues
running.
cancel Cancels an existing collection started
in interactive mode. All data already
www.nvidia.com
User Guide v2022.1.1 | 2
Profiling from the CLI
Command Description
collected in the current collection is
discarded.
launch In interactive mode, launches an
application in an environment that
supports the requested options. The
launch command can be executed before
or after a start command.
shutdown Disconnects the CLI process from the
launched application and forces the CLI
process to exit. If a collection is pending or
active, it is cancelled
export Generates an export file from an
existing .nsys-rep file. For more
information about the exported formats
see the /documentation/nsys-exporter
directory in your Nsight Systems
installation directory.
stats Post process existing Nsight Systems
result, either in .nsys-rep or SQLite format,
to generate statistical information.
analyze Post process existing Nsight Systems
result, either in .nsys-rep or SQLite format,
to generate expert systems report.
status Reports on the status of a CLI-based
collection or the suitability of the profiling
environment.
sessions Gives information about all sessions
running on the system.
nvprof Special option to help with transition
from legacy NVIDIA nvprof tool. Calling
nsys nvprof [options] will provide
the best available translation of nvprof
[options] See Migrating from NVIDIA
nvprof topic for details. No additional
functionality of nsys will be available
when using this option. Note: Not
available on IBM Power targets.
www.nvidia.com
User Guide v2022.1.1 | 3
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 4
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 5
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 6
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 7
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 8
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 9
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 10
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 11
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 12
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 13
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 14
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 15
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 16
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 17
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 18
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 19
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 20
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 21
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 22
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 23
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 24
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 25
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 26
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 27
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 28
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 29
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 30
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 31
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 32
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 33
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 34
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 35
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 36
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 37
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 38
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 39
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 40
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 41
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 42
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 43
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 44
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 45
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 46
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 47
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 48
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 49
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 50
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 51
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 52
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 53
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 54
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 55
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 56
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 57
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 58
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 59
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 60
Profiling from the CLI
After choosing the stats command switch, the following options are available. Usage:
nsys [global-options] stats [options] [input-file]
www.nvidia.com
User Guide v2022.1.1 | 61
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 62
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 63
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 64
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 65
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 66
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 67
Profiling from the CLI
Subcommand Description
list List all active sessions including ID, name,
and state information
Effect: Nsight Systems CLI (and target application) will run with elevated privilege.
This is necessary for some features, such as FTrace or system-wide CPU sampling. If you
don't want the target application to be elevated, use `--run-as` option.
Default analysis run
nsys profile <application>
[application-arguments]
Effect: Launch the application using the given arguments. Start collecting immediately
and end collection when the application stops. Trace CUDA, OpenGL, NVTX, and
OS runtime libraries APIs. Collect CPU sampling information and thread scheduling
information. With Nsight Systems Embedded Platforms Edition this will only analysis
the single process. With Nsight Systems Workstation Edition this will trace the process
tree. Generate the report#.nsys-rep file in the default location, incrementing the report
number if needed to avoid overwriting any existing output files.
Limited trace only run
nsys profile --trace=cuda,nvtx -d 20
--sample=none --cpuctxsw=none -o my_test <application>
[application-arguments]
Effect: Launch the application using the given arguments. Start collecting immediately
and end collection after 20 seconds or when the application ends. Trace CUDA and
NVTX APIs. Do not collect CPU sampling information or thread scheduling information.
www.nvidia.com
User Guide v2022.1.1 | 68
Profiling from the CLI
Profile any child processes. Generate the output file as my_test.nsys-rep in the current
working directory.
Delayed start run
nsys profile -e TEST_ONLY=0 -y 20
<application> [application-arguments]
Effect: Set environment variable TEST_ONLY=0. Launch the application using the given
arguments. Start collecting after 20 seconds and end collection at application exit. Trace
CUDA, OpenGL, NVTX, and OS runtime libraries APIs. Collect CPU sampling and
thread schedule information. Profile any child processes. Generate the report#.nsys-rep
file in the default location, incrementing if needed to avoid overwriting any existing
output files.
Collect ftrace events
nsys profile --ftrace=drm/drm_vblank_event
-d 20
Effect: Launch application. Collect default options and GPU metrics for the first GPU
(a TU10x), using the tu10x-gfxt metric set at the default frequency (10 kHz). Profile any
child processes. Generate the report#.nsys-rep file in the default location, incrementing if
needed to avoid overwriting any existing output files.
Run GPU metric sampling on all GPUs at a set frequency
nsys profile --gpu-metrics-device=all
--gpu-metrics-frequency=20000 <application>
Effect: Launch application. Collect default options and GPU metrics for all available
GPUs using the first suitable metric set for each and sampling at 20 kHz. Profile any
child processes. Generate the report#.nsys-rep file in the default location, incrementing if
needed to avoid overwriting any existing output files.
Collect custom ETW trace using configuration file
nsys profile --etw-provider=file.JSON
Effect: Configure custom ETW collectors using the contents of file.JSON. Collect data for
20 seconds. Generate the report#.nsys-rep file in the current working directory.
A template JSON configuration file is located at in the Nsight Systems installation
directory as \target-windows-x64\etw_providers_template.json. This path will show up
automatically if you call
nsys profile --help
www.nvidia.com
User Guide v2022.1.1 | 69
Profiling from the CLI
‣ TRACE_LEVEL_ERROR
‣ TRACE_LEVEL_WARNING
‣ TRACE_LEVEL_INFORMATION
‣ TRACE_LEVEL_VERBOSE
The flags attribute can only be set to one or more of the following:
‣ EVENT_TRACE_FLAG_ALPC
‣ EVENT_TRACE_FLAG_CSWITCH
‣ EVENT_TRACE_FLAG_DBGPRINT
‣ EVENT_TRACE_FLAG_DISK_FILE_IO
‣ EVENT_TRACE_FLAG_DISK_IO
‣ EVENT_TRACE_FLAG_DISK_IO_INIT
‣ EVENT_TRACE_FLAG_DISPATCHER
‣ EVENT_TRACE_FLAG_DPC
‣ EVENT_TRACE_FLAG_DRIVER
‣ EVENT_TRACE_FLAG_FILE_IO
‣ EVENT_TRACE_FLAG_FILE_IO_INIT
‣ EVENT_TRACE_FLAG_IMAGE_LOAD
‣ EVENT_TRACE_FLAG_INTERRUPT
‣ EVENT_TRACE_FLAG_JOB
‣ EVENT_TRACE_FLAG_MEMORY_HARD_FAULTS
‣ EVENT_TRACE_FLAG_MEMORY_PAGE_FAULTS
‣ EVENT_TRACE_FLAG_NETWORK_TCPIP
‣ EVENT_TRACE_FLAG_NO_SYSCONFIG
‣ EVENT_TRACE_FLAG_PROCESS
‣ EVENT_TRACE_FLAG_PROCESS_COUNTERS
‣ EVENT_TRACE_FLAG_PROFILE
‣ EVENT_TRACE_FLAG_REGISTRY
‣ EVENT_TRACE_FLAG_SPLIT_IO
‣ EVENT_TRACE_FLAG_SYSTEMCALL
‣ EVENT_TRACE_FLAG_THREAD
‣ EVENT_TRACE_FLAG_VAMAP
‣ EVENT_TRACE_FLAG_VIRTUAL_ALLOC
Typical case: profile a Python script that uses CUDA
nsys profile --trace=cuda,cudnn,cublas,osrt,nvtx
--delay=60 python my_dnn_script.py
Effect: Launch a Python script and start profiling it 60 seconds after the launch, tracing
CUDA, cuDNN, cuBLAS, OS runtime APIs, and NVTX as well as collecting thread
schedule information.
Typical case: profile an app that uses Vulkan
nsys profile --trace=vulkan,osrt,nvtx
--delay=60 ./myapp
www.nvidia.com
User Guide v2022.1.1 | 70
Profiling from the CLI
Effect: Launch an app and start profiling it 60 seconds after the launch, tracing Vulkan,
OS runtime APIs, and NVTX as well as collecting CPU sampling and thread schedule
information.
Effect: Create interactive CLI process and set it up to begin collecting as soon as an
application is launched. Launch the application, set up to allow tracing of CUDA and
NVTX as well as collection of thread schedule information. Stop only when explicitly
requested. Generate the report#.nsys-rep in the default location.
If
you
start
a
collection
and
fail
to
stop
the
collection
(or
if
you
are
allowing
it
to
Note: stop
on
exit,
and
the
application
runs
for
too
long)
your
system’s
storage
space
may
be
filled
with
collected
data
www.nvidia.com
User Guide v2022.1.1 | 71
Profiling from the CLI
causing
significant
issues
for
the
system.
Nsight
Systems
will
collect
a
different
amount
of
data/
sec
depending
on
options,
but
in
general
Nsight
Systems
does
not
support
runs
of
more
than
5
minutes
duration.
Effect: Create interactive CLI and launch an application set up for default analysis.
Send application output to the terminal. No data is collected until you manually
start collection at area of interest. Profile until the application ends. Generate the
report#.nsys-rep in the default location.
If
you
launch
an
application
Note: and
that
application
and
any
descendants
exit
www.nvidia.com
User Guide v2022.1.1 | 72
Profiling from the CLI
before
start
is
called
Nsight
Systems
will
create
a
fully
formed .nsys-
rep
file
containing
no
data.
Effect: Create interactive CLI process and set it up to begin collecting as soon as
a cudaProfileStart() is detected. Launch application for default analysis, sending
application output to the terminal. Stop collection at next call to cudaProfilerStop,
when the user calls nsys stop, or when the root process terminates. Generate the
report#.nsys-rep in the default location.
If
you
call
nsys
launch
before
nsys
start
-
c
cudaProfilerApi
and
the
code
Note: contains
a
large
number
of
short
duration
cudaProfilerStart/
Stop
pairs,
Nsight
Systems
may
be
unable
www.nvidia.com
User Guide v2022.1.1 | 73
Profiling from the CLI
to
process
them
correctly,
causing
a
fault.
This
will
be
corrected
in
a
future
version.
The
Nsight
Systems
CLI
does
not
support
multiple
calls
Note:
to
the
cudaProfilerStart/
Stop
API
at
this
time.
Effect: Create interactive CLI process and set it up to begin collecting as soon as an
NVTX range with given message in given domain (capture range) is opened. Launch
application for default analysis, sending application output to the terminal. Stop
collection when all capture ranges are closed, when the user calls nsys stop, or when
the root process terminates. Generate the report#.nsys-rep in the default location.
The
Nsight
Systems
CLI
Note: only
triggers
the
profiling
session
www.nvidia.com
User Guide v2022.1.1 | 74
Profiling from the CLI
for
the
first
capture
range.
This would make the profiling start when the first range with message "profiler" is
opened in domain "service".
‣ Message@*: All ranges with given message in all domains are capture ranges. For
example:
nsys launch -w true -p profiler@* ./app
This would make the profiling start when the first range with message "profiler" is
opened in any domain.
‣ Message: All ranges with given message in default domain are capture ranges. For
example:
nsys launch -w true -p profiler ./app
This would make the profiling start when the first range with message "profiler" is
opened in the default domain.
‣ By default only messages, provided by NVTX registered strings are considered to
avoid additional overhead. To enable non-registered strings check please launch
your application with NSYS_NVTX_PROFILER_REGISTER_ONLY=0 environment:
nsys launch -w true -p profiler@service -e
NSYS_NVTX_PROFILER_REGISTER_ONLY=0 ./app
Effect: Create interactive CLI and launch an application set up for default analysis.
Send application output to the terminal. No data is collected until the start command
is executed. Collect data from start until stop requested, generate report#.qstrm in the
current working directory. Collect data from second start until the second stop request,
generate report#.nsys-rep (incremented by one) in the current working directory.
Shutdown the interactive CLI and send sigkill to the target application's process group.
Calling
Note: nsys
cancel
after
www.nvidia.com
User Guide v2022.1.1 | 75
Profiling from the CLI
nsys
start
will
cancel
the
collection
without
generating
a
report.
www.nvidia.com
User Guide v2022.1.1 | 76
Profiling from the CLI
Effect: Open test.sqlite and run the cudaapisum script on that file. Generate table data
and feed that into the command grep -E (-|Name|cudaFree). The grep command
will filter out everything but the header, formatting, and the cudaFree data, and display
the results to the console.
Note: When the output name starts with @, it is defined as a command. The command
is run, and the output of the report is piped to the command's stdin (standard-input).
The command's stdout and stderr remain attached to the console, so any output will be
displayed directly to the console.
Be aware there are some limitations in how the command string is parsed. No shell
expansions (including *, ?, [], and ~) are supported. The command cannot be piped
to another command, nor redirected to a file using shell syntax. The command and
command arguments are split on whitespace, and no quotes (within the command
syntax) are supported. For commands that require complex command line syntax, it is
suggested that the command be put into a shell script file, and the script designated as
the output command
www.nvidia.com
User Guide v2022.1.1 | 77
Profiling from the CLI
www.nvidia.com
User Guide v2022.1.1 | 78
Profiling from the CLI
If your run traces graphics debug markers these include DX11 debug markers, DX12
debug markers, Vulkan debug markers or KHR debug markers:
www.nvidia.com
User Guide v2022.1.1 | 79
Profiling from the CLI
Recipes for these statistics as well as documentation on how to create your own metrics
will be available in a future version of the tool.
www.nvidia.com
User Guide v2022.1.1 | 80
Profiling from the CLI
The import of really large, multi-gigabyte, .qdstrm files may take up all of the memory
on the host computer and lock up the system. This will be fixed in a later version.
Importing Windows ETL files
For Windows targets, ETL files captured with Xperf or the log.cmd command supplied
with GPUView in the Windows Performance Toolkit can be imported to create reports
as if they were captured with Nsight Systems's "WDDM trace" and "Custom ETW trace"
features. Simply choose the .etl file from the Import dialog to convert it to a .nsys-rep
file.
Create .nsys-rep Using QdstrmImporter
The CLI and QdstrmImporter versions must match to convert a .qdstrm file into a .nsys-
rep file. This .nsys-rep file can then be opened in the same version or more recent
versions of the GUI.
To run QdstrmImporter on the host system, find the QdstrmImporter binary in the Host-
x86_64 directory in your installation. QdstrmImporter is available for all host platforms.
See options below.
To run QdstrmImporter on the target system, copy the Linux Host-x86_64 directory to
the target Linux system or install Nsight Systems for Linux host directly on the target.
The Windows or macOS host QdstrmImporter will not work on a Linux Target. See
options below.
www.nvidia.com
User Guide v2022.1.1 | 81
Profiling from the CLI
To profile everything putting the data from each rank into a separate file:
mpirun [mpi options] nsys profile [nsys options]
www.nvidia.com
User Guide v2022.1.1 | 82
Profiling from the CLI
To profile a single MPI process use a wrapper script. The following script(called
"wrap.sh") runs nsys on rank 0 only:
#!/bin/bash
if [[ $OMPI_COMM_WORLD_RANK == 0 ]]; then
~/nsys/nsys profile ./myapp "$@" --mydummyargument
else
./myapp "$@"
fi
Currently
you
will
need
a
dummy
argument
to
the
process,
so
that
Nsight
Systems
can
decide
which
process
to
profile.
This
means
that
Note: your
process
must
accept
dummy
arguments
to
take
advantage
of
this
workaround.
This
script
as
written
is
for
Open
MPI,
but
should
be
easily
www.nvidia.com
User Guide v2022.1.1 | 83
Profiling from the CLI
adaptable
to
other
MPI
implementations.
www.nvidia.com
User Guide v2022.1.1 | 84
Chapter 2.
PROFILING FROM THE GUI
On Tegra:
www.nvidia.com
User Guide v2022.1.1 | 85
Profiling from the GUI
The dialog has simple controls that allow adding, removing, and modifying connections:
Security notice: SSH is only used to establish the initial connection to a target device,
perform checks, and upload necessary files. The actual profiling commands and data
are transferred through a raw, unencrypted socket. Nsight Systems should not be used
in a network setup where attacker-in-the-middle attack is possible, or where untrusted
parties may have network access to the target device.
While connecting to the target device, you will be prompted to input the user's
password. Please note that if you choose to remember the password, it will be stored in
plain text in the configuration file on the host. Stored passwords are bound to the public
key fingerprint of the remote device.
The No authentication option is useful for devices configured for passwordless
login using root username. To enable such a configuration, edit the file /etc/ssh/
sshd_config on the target and specify the following option:
PermitRootLogin yes
Then set empty password using passwd and restart the SSH service with service ssh
restart.
Open ports: The Nsight Systems daemon requires port 22 and port 45555 to be open for
listening. You can confirm that these ports are open with the following command:
sudo firewall-cmd --list-ports --permanent
sudo firewall-cmd --reload
To open a port use the following command, skip --permanent option to open only for
this session:
sudo firewall-cmd --permanent --add-port 45555/tcp
sudo firewall-cmd --reload
www.nvidia.com
User Guide v2022.1.1 | 86
Profiling from the GUI
Likewise, if you are running on a cloud system, you must open port 22 and port 45555
for ingress.
Kernel Version Number - To check for the version number of the kernel support of
Nsight Systems on a target device, run the following command on the remote device:
cat /proc/quadd/version
2.1.2.1. Linux x86_64
System-wide profiling is available on x86 for Linux targets only when run with root
privileges.
Ftrace Events Collection
Select Ftrace events
www.nvidia.com
User Guide v2022.1.1 | 87
Profiling from the GUI
www.nvidia.com
User Guide v2022.1.1 | 88
Profiling from the GUI
Trace all processes – On compatible devices (with kernel module support version 1.107
or higher), this enables trace of all processes and threads in the system. Scheduler events
from all tasks will be recorded.
Collect PMU counters – This allows you to choose which PMU (Performance
Monitoring Unit) counters Nsight Systems will sample. Enable specific counters when
interested in correlating cache misses to functions in your application.
Three different backtrace collections options are available when sampling CPU
instruction pointers. Backtraces can be generated using Intel (c) Last Branch Record
(LBR) registers. LBR backtraces generate minimal overhead but the backtraces have
www.nvidia.com
User Guide v2022.1.1 | 89
Profiling from the GUI
limited depth. Backtraces can also be generated using DWARF debug data. DWARF
backtraces incur more overhead than LBR backtraces but have much better depth.
Finally, backtraces can be generated using frame pointers. Frame pointer backtraces
incur medium overhead and have good depth but only resolve frames in the portions
of the application and its libraries (including 3rd party libraries) that were compiled
with frame pointers enabled. Normally, frame pointers are disabled by default during
compilation.
By default, Nsight Systems will use Intel(c) LBRs if available and fall back to using dwarf
unwind if they are not. Choose modes... will allow you to override the default.
The Include child processes switch controls whether API tracing is only for the
launched process, or for all existing and new child processes of the launched process. If
you are running your application through a script, for example a bash script, you need
to set this checkbox.
The Include child processes switch does not control sampling in this version of Nsight
Systems. The full process tree will be sampled regardless of this setting. This will be
fixed in a future version of the product.
Nsight Systems can sample one process tree. Sampling here means interrupting each
processor after a certain number of events and collecting an instruction pointer (IP)/
backtrace sample if the processor is executing the profilee.
When sampling the CPU on a workstation target, Nsight Systems traces thread
context switches and infers thread state as either Running or Blocked. Note that
Blocked in the timeline indicates the thread may be Blocked (Interruptible) or Blocked
(Uninterruptible). Blocked (Uninterruptible) often occurs when a thread has transitioned
into the kernel and cannot be interrupted by a signal. Sampling can be enhanced with
OS runtime libraries tracing; see OS Runtime Libraries Trace for more information.
www.nvidia.com
User Guide v2022.1.1 | 90
Profiling from the GUI
Currently Nsight Systems can only sample one process. Sampling here means that the
profilee will be stopped periodically, and backtraces of active threads will be recorded.
Most applications use stripped libraries. In this case, many symbols may stay
unresolved. If unstripped libraries exist, paths to them can be specified using the
Symbol locations... button. Symbol resolution happens on host, and therefore does not
affect performance of profiling on the target.
Additionally, debug versions of ELF files may be picked up from the target system. Refer
to Debug Versions of ELF Files for more information.
www.nvidia.com
User Guide v2022.1.1 | 91
Profiling from the GUI
In Attach or launch mode, the process is to first search as if in the Attach only mode,
but if it is not found, the process is launched using the same path and command line
arguments. If NVTX, CUDA, or other trace settings are selected, the process will be
automatically launched with appropriate environment variables.
Note that in some cases, the capabilities of Nsight Systems are not sufficient to correctly
launch the application; for example, if certain environment variables have to be
corrected. In this case, the application has to be started manually and Nsight Systems
should be used in Attach only mode.
The Edit arguments... link will open an editor window, where every command line
argument is edited on a separate line. This is convenient when arguments contain spaces
or quotes.
To properly populate the Search criteria field based on a currently running process on
the target system, use the Select a process button on the right, which has ellipsis as the
caption. The list of processes is automatically refreshed upon opening.
www.nvidia.com
User Guide v2022.1.1 | 92
Profiling from the GUI
window. This is useful when tracing games and graphic applications that use fullscreen
display. In these scenarios switching to Nsight Systems' UI would unnecessarily
introduce the window manager's footprint into the trace. To enable the use of Hotkey
check the Hotkey checkbox in the project settings page:
Nsight Systems can sample one process tree. Sampling here means interrupting each
processor periodically. The sampling rate is defined in the project settings and is either
100Hz, 1KHz (default value), 2Khz, 4KHz, or 8KHz.
www.nvidia.com
User Guide v2022.1.1 | 93
Profiling from the GUI
On Windows, Nsight Systems can collect thread activity of one process tree. Collecting
thread activity means that each thread context switch event is logged and (optionally) a
backtrace is collected at the point that the thread is scheduled back for execution. Thread
states are displayed on the timeline.
If it was collected, the thread backtrace is displayed when hovering over a region where
the thread execution is blocked.
Symbol Locations
Symbol resolution happens on host, and therefore does not affect performance of
profiling on the target.
Press the Symbol locations... button to open the Configure debug symbols location
dialog.
www.nvidia.com
User Guide v2022.1.1 | 94
Profiling from the GUI
www.nvidia.com
User Guide v2022.1.1 | 95
Chapter 3.
EXPORT FORMATS
www.nvidia.com
User Guide v2022.1.1 | 96
Export Formats
0 - TRACE_PROCESS_EVENT_CUDA_RUNTIME
1 - TRACE_PROCESS_EVENT_CUDA_DRIVER
13 - TRACE_PROCESS_EVENT_CUDA_EGL_DRIVER
28 - TRACE_PROCESS_EVENT_CUDNN
29 - TRACE_PROCESS_EVENT_CUBLAS
33 - TRACE_PROCESS_EVENT_CUDNN_START
34 - TRACE_PROCESS_EVENT_CUDNN_FINISH
35 - TRACE_PROCESS_EVENT_CUBLAS_START
36 - TRACE_PROCESS_EVENT_CUBLAS_FINISH
67 - TRACE_PROCESS_EVENT_CUDABACKTRACE
77 - TRACE_PROCESS_EVENT_CUDA_GRAPH_NODE_CREATION
See CUPTI documentation for detailed information on collected event and data types.
NVTX Event Type Values
33 - NvtxCategory
34 - NvtxMark
39 - NvtxThread
59 - NvtxPushPopRange
60 - NvtxStartEndRange
75 - NvtxDomainCreate
76 - NvtxDomainDestroy
The difference between text and textId columns is that if an NVTX event message was
passed via call to nvtxDomainRegisterString function, then the message will be available
through textId field, otherwise the text field will contain the message if it was provided.
OpenGL Events
KHR event class values
62 - KhrDebugPushPopRange
63 - KhrDebugGpuPushPopRange
0x8249 - GL_DEBUG_SOURCE_THIRD_PARTY
0x824A - GL_DEBUG_SOURCE_APPLICATION
www.nvidia.com
User Guide v2022.1.1 | 98
Export Formats
0x824C - GL_DEBUG_TYPE_ERROR
0x824D - GL_DEBUG_TYPE_DEPRECATED_BEHAVIOR
0x824E - GL_DEBUG_TYPE_UNDEFINED_BEHAVIOR
0x824F - GL_DEBUG_TYPE_PORTABILITY
0x8250 - GL_DEBUG_TYPE_PERFORMANCE
0x8251 - GL_DEBUG_TYPE_OTHER
0x8268 - GL_DEBUG_TYPE_MARKER
0x8269 - GL_DEBUG_TYPE_PUSH_GROUP
0x826A - GL_DEBUG_TYPE_POP_GROUP
0x826B - GL_DEBUG_SEVERITY_NOTIFICATION
0x9146 - GL_DEBUG_SEVERITY_HIGH
0x9147 - GL_DEBUG_SEVERITY_MEDIUM
0x9148 - GL_DEBUG_SEVERITY_LOW
27 - TRACE_PROCESS_EVENT_OS_RUNTIME
31 - TRACE_PROCESS_EVENT_OS_RUNTIME_START
32 - TRACE_PROCESS_EVENT_OS_RUNTIME_FINISH
41 - TRACE_PROCESS_EVENT_DX12_API
42 - TRACE_PROCESS_EVENT_DX12_WORKLOAD
43 - TRACE_PROCESS_EVENT_DX12_START
44 - TRACE_PROCESS_EVENT_DX12_FINISH
52 - TRACE_PROCESS_EVENT_DX12_DISPLAY
59 - TRACE_PROCESS_EVENT_DX12_CREATE_OBJECT
65 - TRACE_PROCESS_EVENT_DX12_DEBUG_API
75 - TRACE_PROCESS_EVENT_DX11_DEBUG_API
www.nvidia.com
User Guide v2022.1.1 | 99
Export Formats
53 - TRACE_PROCESS_EVENT_VULKAN_API
54 - TRACE_PROCESS_EVENT_VULKAN_WORKLOAD
55 - TRACE_PROCESS_EVENT_VULKAN_START
56 - TRACE_PROCESS_EVENT_VULKAN_FINISH
60 - TRACE_PROCESS_EVENT_VULKAN_CREATE_OBJECT
66 - TRACE_PROCESS_EVENT_VULKAN_DEBUG_API
Vulkan Flags
VALID_BIT = 0x00000001
CACHE_HIT_BIT = 0x00000002
BASE_PIPELINE_ACCELERATION_BIT = 0x00000004
62 - TRACE_PROCESS_EVENT_SLI
63 - TRACE_PROCESS_EVENT_SLI_START
64 - TRACE_PROCESS_EVENT_SLI_FINISH
0 - P2P_SKIPPED
1 - P2P_EARLY_PUSH
2 - P2P_PUSH_FAILED
3 - P2P_2WAY_OR_PULL
4 - P2P_PRESENT
5 - P2P_DX12_INIT_PUSH_ON_WRITE
www.nvidia.com
User Guide v2022.1.1 | 100
Export Formats
0 - None
101 - RestoreSegments
102 - PurgeSegments
103 - CleanupPrimary
104 - AllocatePagingBufferResources
105 - FreePagingBufferResources
106 - ReportVidMmState
107 - RunApertureCoherencyTest
108 - RunUnmapToDummyPageTest
109 - DeferredCommand
110 - SuspendMemorySegmentAccess
111 - ResumeMemorySegmentAccess
112 - EvictAndFlush
113 - CommitVirtualAddressRange
114 - UncommitVirtualAddressRange
115 - DestroyVirtualAddressAllocator
116 - PageInDevice
117 - MapContextAllocation
118 - InitPagingProcessVaSpace
200 - CloseAllocation
202 - ComplexLock
203 - PinAllocation
204 - FlushPendingGpuAccess
205 - UnpinAllocation
206 - MakeResident
207 - Evict
208 - LockInAperture
209 - InitContextAllocation
210 - ReclaimAllocation
211 - DiscardAllocation
212 - SetAllocationPriority
1000 - EvictSystemMemoryOfferList
0 - VIDMM_PAGING_QUEUE_TYPE_UMD
1 - VIDMM_PAGING_QUEUE_TYPE_Default
2 - VIDMM_PAGING_QUEUE_TYPE_Evict
3 - VIDMM_PAGING_QUEUE_TYPE_Reclaim
0 - DXGKETW_RENDER_COMMAND_BUFFER
1 - DXGKETW_DEFERRED_COMMAND_BUFFER
2 - DXGKETW_SYSTEM_COMMAND_BUFFER
3 - DXGKETW_MMIOFLIP_COMMAND_BUFFER
4 - DXGKETW_WAIT_COMMAND_BUFFER
5 - DXGKETW_SIGNAL_COMMAND_BUFFER
6 - DXGKETW_DEVICE_COMMAND_BUFFER
7 - DXGKETW_SOFTWARE_COMMAND_BUFFER
www.nvidia.com
User Guide v2022.1.1 | 101
Export Formats
0 - DXGK_ENGINE_TYPE_OTHER
1 - DXGK_ENGINE_TYPE_3D
2 - DXGK_ENGINE_TYPE_VIDEO_DECODE
3 - DXGK_ENGINE_TYPE_VIDEO_ENCODE
4 - DXGK_ENGINE_TYPE_VIDEO_PROCESSING
5 - DXGK_ENGINE_TYPE_SCENE_ASSEMBLY
6 - DXGK_ENGINE_TYPE_COPY
7 - DXGK_ENGINE_TYPE_OVERLAY
8 - DXGK_ENGINE_TYPE_CRYPTO
1 = DXGK_INTERRUPT_DMA_COMPLETED
2 = DXGK_INTERRUPT_DMA_PREEMPTED
4 = DXGK_INTERRUPT_DMA_FAULTED
9 = DXGK_INTERRUPT_DMA_PAGE_FAULTED
0 = Queue_Packet
1 = Dma_Packet
2 = Paging_Queue_Packet
Driver Events
Load balance event type values
1 - LoadBalanceEvent_GPU
8 - LoadBalanceEvent_CPU
21 - LoadBalanceMasterEvent_GPU
22 - LoadBalanceMasterEvent_CPU
OpenMP Events
OpenMP event class values
78 - TRACE_PROCESS_EVENT_OPENMP
79 - TRACE_PROCESS_EVENT_OPENMP_START
80 - TRACE_PROCESS_EVENT_OPENMP_FINISH
www.nvidia.com
User Guide v2022.1.1 | 102
Export Formats
15 - OPENMP_EVENT_KIND_TASK_CREATE
16 - OPENMP_EVENT_KIND_TASK_SCHEDULE
17 - OPENMP_EVENT_KIND_CANCEL
20 - OPENMP_EVENT_KIND_MUTEX_RELEASED
21 - OPENMP_EVENT_KIND_LOCK_INIT
22 - OPENMP_EVENT_KIND_LOCK_DESTROY
25 - OPENMP_EVENT_KIND_DISPATCH
26 - OPENMP_EVENT_KIND_FLUSH
27 - OPENMP_EVENT_KIND_THREAD
28 - OPENMP_EVENT_KIND_PARALLEL
29 - OPENMP_EVENT_KIND_SYNC_REGION_WAIT
30 - OPENMP_EVENT_KIND_SYNC_REGION
31 - OPENMP_EVENT_KIND_TASK
32 - OPENMP_EVENT_KIND_MASTER
33 - OPENMP_EVENT_KIND_REDUCTION
34 - OPENMP_EVENT_KIND_MUTEX_WAIT
35 - OPENMP_EVENT_KIND_CRITICAL_SECTION
36 - OPENMP_EVENT_KIND_WORKSHARE
1 - Barrier
2 - Implicit barrier
3 - Explicit barrier
4 - Implementation-dependent barrier
5 - Taskwait
6 - Taskgroup
1 - Initial task
2 - Implicit task
3 - Explicit task
1 - Task completed
2 - Task yielded to another task
3 - Task was cancelled
7 - Task was switched out for other reasons
www.nvidia.com
User Guide v2022.1.1 | 103
Export Formats
1 - Loop region
2 - Sections region
3 - Single region (executor)
4 - Single region (waiting)
5 - Workshare region
6 - Distrubute region
7 - Taskloop region
1 - Iteration
2 - Section
.mode column
.headers on
Default column width is determined by the data in the first row of results. If this doesn’t
work out well, you can specify widths manually.
.width 10 20 50
www.nvidia.com
User Guide v2022.1.1 | 104
Export Formats
Note: globalTid field includes both TID and PID values, while globalPid only containes
the PID value.
Correlate CUDA Kernel Launches With CUDA API Kernel Launches
Results:
www.nvidia.com
User Guide v2022.1.1 | 105
Export Formats
Results:
COUNT(*)
----------
1095
Find CUDA API Calls That Resulted in Original Graph Node Creation.
www.nvidia.com
User Guide v2022.1.1 | 106
Export Formats
Results:
www.nvidia.com
User Guide v2022.1.1 | 107
Export Formats
Results:
www.nvidia.com
User Guide v2022.1.1 | 108
Export Formats
Results:
---------- -------------------------------------------------------
---------------------------------------------------------------------------------------------
19163 /tmp/nvidia/nsight_systems/streams/pid_19163_stderr.log
Thread Summary
Please note, that Nsight Systems applies additional logic during sampling events
processing to work around lost events. This means that the results of the below query
might differ slightly from the ones shown in “Analysis summary” tab.
Thread summary calculated using CPU cycles (when available).
SELECT
globalTid / 0x1000000 % 0x1000000 AS PID,
globalTid % 0x1000000 AS TID,
ROUND(100.0 * SUM(cpuCycles) /
(
SELECT SUM(cpuCycles) FROM COMPOSITE_EVENTS
GROUP BY globalTid / 0x1000000000000 % 0x100
),
2
) as CPU_utilization,
(SELECT value FROM StringIds WHERE id =
(
SELECT nameId FROM ThreadNames
WHERE ThreadNames.globalTid = COMPOSITE_EVENTS.globalTid
)
) as thread_name
FROM COMPOSITE_EVENTS
GROUP BY globalTid
ORDER BY CPU_utilization DESC
LIMIT 10;
www.nvidia.com
User Guide v2022.1.1 | 109
Export Formats
Results:
Thread running time may be calculated using scheduling data, when PMU counter data
was not collected.
SELECT
globalTid / 0x1000000 % 0x1000000 AS PID,
globalTid % 0x1000000 AS TID,
(SELECT value FROM StringIds where nameId == id) as thread_name,
ROUND(100.0 * total_duration / (SELECT SUM(total_duration) FROM CPU_USAGE),
2) as CPU_utilization
FROM CPU_USAGE
ORDER BY CPU_utilization DESC;
Results:
Function Table
These examples demonstrate how to calculate Flat and BottomUp (for top level only)
views statistics.
www.nvidia.com
User Guide v2022.1.1 | 110
Export Formats
To set up:
Results:
www.nvidia.com
User Guide v2022.1.1 | 111
Export Formats
The example demonstrates how to calculate DX12 CPU frames durartion and construct a
histogram out of it.
SELECT
CAST((end - start) / 1000000.0 AS INT) AS duration_ms,
count(*)
FROM DX12_API_FPS
WHERE end IS NOT NULL
GROUP BY duration_ms
ORDER BY duration_ms;
Results:
duration_ms count(*)
----------- ----------
3 1
4 2
5 7
6 153
7 19
8 116
9 16
10 8
11 2
12 2
13 1
14 4
16 3
17 2
18 1
SELECT (CASE tag WHEN 8 THEN "BEGIN" WHEN 7 THEN "END" END) AS tag,
globalPid / 0x1000000 % 0x1000000 AS PID,
vmId, seqNo, contextId, timestamp, gpuId FROM FECS_EVENTS
WHERE tag in (7, 8) ORDER BY seqNo LIMIT 10;
www.nvidia.com
User Guide v2022.1.1 | 112
Export Formats
Results:
WITH
event AS (
SELECT *
FROM NVTX_EVENTS
WHERE eventType IN (34, 59, 60) -- mark, push/pop, start/end
),
category AS (
SELECT
category,
domainId,
text AS categoryName
FROM NVTX_EVENTS
WHERE eventType == 33 -- new category
)
SELECT
start,
end,
globalTid,
eventType,
domainId,
category,
categoryName,
text
FROM event JOIN category USING (category, domainId)
ORDER BY start;
www.nvidia.com
User Guide v2022.1.1 | 113
Export Formats
Results:
Results:
www.nvidia.com
User Guide v2022.1.1 | 114
Export Formats
Results:
SELECT *
FROM SLI_P2P
WHERE resourceSize < 98304 AND start > 1568063100 AND end < 1579468901
ORDER BY resourceSize DESC;
www.nvidia.com
User Guide v2022.1.1 | 115
Export Formats
Results:
Generic Events
Syscall usage histogram by PID:
www.nvidia.com
User Guide v2022.1.1 | 116
Export Formats
Results:
PID total
---------- ----------
5551 32811
9680 3988
4328 1477
9564 1246
4376 1204
4377 1167
4357 656
4355 655
4356 640
4354 633
SELECT json_insert('{}',
'$.sourceId', sourceId,
'$.data', json(data)
)
FROM GENERIC_EVENT_SOURCES LIMIT 2;
SELECT json_insert('{}',
'$.typeId', typeId,
'$.sourceId', sourceId,
'$.data', json(data)
)
FROM GENERIC_EVENT_TYPES LIMIT 2;
SELECT json_insert('{}',
'$.rawTimestamp', rawTimestamp,
'$.timestamp', timestamp,
'$.typeId', typeId,
'$.data', json(data)
)
FROM GENERIC_EVENTS LIMIT 2;
www.nvidia.com
User Guide v2022.1.1 | 117
Export Formats
Results:
json_insert('{}',
'$.sourceId', sourceId,
'$.data', json(data)
)
----------------------------------------------------------------------------------------------
{"sourceId":72057602627862528,"data":
{"Name":"FTrace","TimeSource":"ClockMonotonicRaw","SourceGroup":"FTrace"}}
json_insert('{}',
'$.typeId', typeId,
'$.sourceId', sourceId,
'$.data', json(data)
)
----------------------------------------------------------------------------------------------
{"typeId":72057602627862547,"sourceId":72057602627862528,"data":
{"Name":"raw_syscalls:sys_enter","Format":"\"NR %ld (%lx,
%lx, %lx, %lx, %lx, %lx)\", REC->id, REC->args[0], REC-
>args[1], REC->args[2], REC->args[3], REC->args[4], REC-
>args[5]","Fields":[{"Name":"common_pid","Prefix":"int","Suffix":""},
{"Name":"id","Prefix":"long","S
{"typeId":72057602627862670,"sourceId":72057602627862528,"data":
{"Name":"irq:irq_handler_entry","Format":"\"irq=%d name=%s\", REC->irq,
__get_str(name)","Fields":[{"Name":"common_pid","Prefix":"int","Suffix":""},
{"Name":"irq","Prefix":"int","Suffix":""},{"Name":"name","Prefix":"__data_loc
char[]","Suffix":""},{"Name":"common_type",
json_insert('{}',
'$.rawTimestamp', rawTimestamp,
'$.timestamp', timestamp,
'$.typeId', typeId,
'$.data', json(data)
)
----------------------------------------------------------------------------------------------
{"rawTimestamp":1183694330725221,"timestamp":6236683,"typeId":72057602627862670,"data":
{"common_pid":"0","irq":"66","name":"327696","common_type":"142","common_flags":"9","common_pr
{"rawTimestamp":1183694333695687,"timestamp":9207149,"typeId":72057602627862670,"data":
{"common_pid":"0","irq":"66","name":"327696","common_type":"142","common_flags":"9","common_pr
www.nvidia.com
User Guide v2022.1.1 | 118
Export Formats
field has the key table_name. The titles of all the available tables can be found in
section SQLite Schema Reference.
{Event #1}
{Event #2}
...
{Event #N}
{Strings}
{Streams}
{Threads}
For easier grepping of JSON output, the --separate-strings switch may be used to
force manual splitting of strings, streams and thread names data.
Example line split: nsys export --export-json --separate-strings
sample.nsys-rep -- -
Note, that only last few lines are shown here for clarity and that carriage returns and
indents were added to avoid wrapping documentation.
www.nvidia.com
User Guide v2022.1.1 | 119
Chapter 4.
REPORT SCRIPTS
www.nvidia.com
User Guide v2022.1.1 | 120
Report Scripts
This report combines data from the cudaapisum, gpukernsum, and gpumemsizesum
reports. It is very similar to profile section of nvprof --dependency-analysis.
www.nvidia.com
User Guide v2022.1.1 | 121
Report Scripts
‣ Total Time : The total time used by all executions of this kernel
‣ Instances : The number of calls to this kernel
‣ Average : The average execution time of this kernel
‣ Minimum : The smallest execution time of this kernel
‣ Maximum : The largest execution time of this kernel
‣ Name : The name of the kernel
This report provides a summary of CUDA kernels and their execution times. Note that
the Time(%) column is calculated using a summation of the Total Time column, and
represents that kernel's percent of the execution time of the kernels listed, and not a
percentage of the application wall or CPU execution time.
www.nvidia.com
User Guide v2022.1.1 | 122
Report Scripts
www.nvidia.com
User Guide v2022.1.1 | 123
Report Scripts
‣ Strm : Stream ID
‣ Name : Trace event name
This report displays a trace of CUDA kernels and memory operations. Items are sorted
by start time.
www.nvidia.com
User Guide v2022.1.1 | 124
Report Scripts
‣ Total Time : The total time used by all executions of this function
‣ Num Calls : The number of calls to this function
‣ Average : The average execution time of this function
‣ Minimum : The smallest execution time of this function
‣ Maximum : The largest execution time of this function
‣ Name : The name of the function
This report provides a summary of operating system functions and their execution
times. Note that the Time(%) column is calculated using a summation of the Total Time
column, and represents that function's percent of the execution time of the functions
listed, and not a percentage of the application wall or CPU execution time.
www.nvidia.com
User Guide v2022.1.1 | 125
Report Scripts
column, and represents that function's percent of the execution time of the functions
listed, and not a percentage of the application wall or CPU execution time.
Column
Usage:
column[:nohdr][:nolimit][:nofmt][:<width>[:<width>]...]
Arguments
‣ nohdr : Do not display the header
‣ nolimit : Remove 100 character limit from auto-width columns Note: This can result
in extremely wide columns.
‣ nofmt : Do not reformat numbers.
‣ <width>... : Define the explicit width of one or more columns. If the value "." is
given, the column will auto-adjust. If a width of 0 is given, the column will not be
displayed.
The column formatter presents data in vertical text columns. It is primarily designed to
be a human-readable format for displaying data on a console display.
Text data will be left-justified, while numeric data will be right-justified. If the data
overflows the available column width, it will be marked with a "…" character, to indicate
www.nvidia.com
User Guide v2022.1.1 | 126
Report Scripts
the data values were clipped. Clipping always occurs on the right-hand side, even for
numeric data.
Numbers will be reformatted to make easier to visually scan and understand.
This includes adding thousands-separators. This process requires that the string
representation of the number is converted into its native representation (integer or
floating point) and then converted back into a string representation to print. This
conversion process attempts to preserve elements of number presentation, such as the
number of decimal places, or the use of scientific notation, but the conversion is not
always perfect (the number should always be the same, but the presentation may not
be). To disable the reformatting process, use the argument nofmt.
If no explicit width is given, the columns auto-adjust their width based off the header
size and the first 100 lines of data. This auto-adjustment is limited to a maximum
width of 100 characters. To allow larger auto-width columns, pass the initial argument
nolimit. If the first 100 lines do not calculate the correct column width, it is suggested
that explicit column widths be provided.
Table
Usage:
table[:nohdr][:nolimit][:nofmt][:<width>[:<width>]...]
Arguments
‣ nohdr : Do not display the header
‣ nolimit : Remove 100 character limit from auto-width columns Note: This can result
in extremely wide columns.
‣ nofmt : Do not reformat numbers.
‣ <width>... : Define the explicit width of one or more columns. If the value "." is
given, the column will auto-adjust. If a width of 0 is given, the column will not be
displayed.
The table formatter presents data in vertical text columns inside text boxes. Other than
the lines between columns, it is identical to the column formatter.
CSV
Usage:
csv[:nohdr]
Arguments
‣ nohdr : Do not display the header
The csv formatter outputs data as comma-separated values. This format is commonly
used for import into other data applications, such as spread-sheets and databases.
There are many different standards for CSV files. Most differences are in how escapes
are handled, meaning data values that contain a comma or space.
www.nvidia.com
User Guide v2022.1.1 | 127
Report Scripts
This CSV formatter will escape commas by surrounding the whole value in double-
quotes.
TSV
Usage:
tsv[:nohdr][:esc]
Arguments
‣ nohdr : Do not display the header
‣ esc : escape tab characters, rather than removing them
The tsv formatter outputs data as tab-separated values. This format is sometimes used
for import into other data applications, such as spreadsheets and databases.
Most TSV import/export systems disallow the tab character in data values. The formatter
will normally replace any tab characters with a single space. If the esc argument has
been provided, any tab characters will be replaced with the literal characters "\t".
JSON
Usage:
json
Arguments: no arguments
The json formatter outputs data as an array of JSON objects. Each object represents one
line of data, and uses the column names as field labels. All objects have the same fields.
The formatter attempts to recognize numeric values, as well as JSON keywords, and
converts them. Empty values are passed as an empty string (and not nil, or as a missing
field).
At this time the formatter does not escape quotes, so if a data value includes double-
quotation marks, it will corrupt the JSON file.
HDoc
Usage:
hdoc[:title=<title>][:css=<URL>]
Arguments:
‣ title : string for HTML document title
‣ css : URL of CSS document to include
The hdoc formatter generates a complete, verifiable (mostly), standalone HTML
document. It is designed to be opened in a web browser, or included in a larger
document via an <iframe>.
www.nvidia.com
User Guide v2022.1.1 | 128
Report Scripts
HTable
Usage:
htable
Arguments: no arguments
The htable formatter outputs a raw HTML <table> without any of the surrounding
HTML document. It is designed to be included into a larger HTML document. Although
most web browsers will open and display the document, it is better to use the hdoc
format for this type of use.
www.nvidia.com
User Guide v2022.1.1 | 129
Chapter 5.
MIGRATING FROM NVIDIA NVPROF
www.nvidia.com
User Guide v2022.1.1 | 130
Migrating from NVIDIA nvprof
www.nvidia.com
User Guide v2022.1.1 | 131
Migrating from NVIDIA nvprof
www.nvidia.com
User Guide v2022.1.1 | 132
Migrating from NVIDIA nvprof
Next Steps
NVIDIA Visual Profiler (NVVP) and NVIDIA nvprof are deprecated. New GPUs and
features will not be supported by those tools. We encourage you to make the move to
Nsight Systems now. For additional information, suggestions, and rationale, see the blog
series in Other Resources.
www.nvidia.com
User Guide v2022.1.1 | 133
Chapter 6.
PROFILING IN A DOCKER ON LINUX
DEVICES
Download the default seccomp profile file, default.json, relevant to your Docker version.
If perf_event_open is already listed in the file as guarded by CAP_SYS_ADMIN, then
remove the perf_event_open line. Add the following lines under "syscalls" and save
the resulting file as default_with_perf.json.
{
"name": "perf_event_open",
"action": "SCMP_ACT_ALLOW",
"args": []
},
www.nvidia.com
User Guide v2022.1.1 | 134
Profiling in a Docker on Linux Devices
Then you will be able to use the following switch when starting the Docker to apply the
new seccomp profile.
--security-opt seccomp=default_with_perf.json
There is a known issue where Docker collections terminate prematurely with older
versions of the driver and the CUDA Toolkit. If collection is ending unexpectedly, please
update to the latest versions.
After the Docker has been started, use the Nsight Systems CLI to launch a collection
within the Docker. The resulting .qdstrm file can be imported into the Nsight Systems
host like any other CLI result.
www.nvidia.com
User Guide v2022.1.1 | 135
Chapter 7.
DIRECT3D TRACE
Nsight Systems has the ability to trace both the Direct3D 11 API and the Direct3D 12 API
on Windows targets.
SLI Trace
Trace SLI queries and peer-to-peer transfers of D3D11 applications. Requires SLI
hardware and an active SLI profile definition in the NVIDIA console.
www.nvidia.com
User Guide v2022.1.1 | 136
Direct3D Trace
The Command List Creation row displays time periods when command lists
were being created. This enables developers to improve their application’s multi-
threaded command list creation. Command list creation time period is measured
between the call to ID3D12GraphicsCommandList::Reset and the call to
ID3D12GraphicsCommandList::Close.
The GPU row shows an aggregated view of D3D12 API calls and GPU workloads. Note
that not all D3D12 API calls are logged.
A Command Queue row is displayed for each D3D12 command queue created by the
profiled application. The row’s header displays the queue's running index and its type
(Direct, Compute, Copy).
The DX12 API Memory Ops row displays all API memory operations and non-persistent
resource mappings. Event ranges in the row are color-coded by the heap type they
belong to (Default, Readback, Upload, Custom, or CPU-Visible VRAM), with usage
warnings highlighted in yellow. A breakdown of the operations can be found by
expanding the row to show rows for each individual heap type.
The following operations and warnings are shown:
www.nvidia.com
User Guide v2022.1.1 | 137
Direct3D Trace
‣ Calls to ID3D12Device::CreateCommittedResource,
ID3D12Device4::CreateCommittedResource1, and
ID3D12Device8::CreateCommittedResource2
‣ A warning will be reported if D3D12_HEAP_FLAG_CREATE_NOT_ZEROED is not
set in the method's HeapFlags parameter
‣ Calls to ID3D12Device::CreateHeap and ID3D12Device4::CreateHeap1
‣ A warning will be reported if D3D12_HEAP_FLAG_CREATE_NOT_ZEROED is not
set in the Flags field of the method's pDesc parameter
‣ Calls to ID3D12Resource::ReadFromSubResource
‣ A warning will be reported if the read is to a
D3D12_CPU_PAGE_PROPERTY_WRITE_COMBINE CPU page or from a
D3D12_HEAP_TYPE_UPLOAD resource
‣ Calls to ID3D12Resource::WriteToSubResource
‣ A warning will be reported if the write is from a
D3D12_CPU_PAGE_PROPERTY_WRITE_BACK CPU page or to a
D3D12_HEAP_TYPE_READBACK resource
‣ Calls to ID3D12Resource::Map and ID3D12Resource::Unmap will be matched
into [Map, Unmap] ranges for non-persistent mappings. If a mapping range is
nested, only the most external range (reference count = 1) will be shown.
www.nvidia.com
User Guide v2022.1.1 | 138
Direct3D Trace
In addition, you can see the PIX command queue CPU-side performance markers, GPU-
side performance markers and the GPU Command List performance markers, each in
their row.
Detecting which CPU thread was blocked by a fence can be difficult in complex apps
that run tens of CPU threads. The timeline view displays the 3 operations involved:
‣ The CPU thread pushing a signal command and fence value into the command
queue. This is displayed on the DX12 Synchronization sub-row of the calling thread.
‣ The GPU executing that command, setting the fence value and signaling the fence.
This is displayed on the GPU Queue Synchronization sub-row.
‣ The CPU thread calling a Win32 wait API to block-wait until the fence is signaled.
This is displayed on the Thread's OS runtime libraries row.
Clicking one of these will highlight it and the corresponding other two calls.
www.nvidia.com
User Guide v2022.1.1 | 139
Direct3D Trace
www.nvidia.com
User Guide v2022.1.1 | 140
Chapter 8.
WDDM QUEUES
The Windows Display Driver Model (WDDM) architecture uses queues to send work
packets from the CPU to the GPU. Each D3D device in each process is associated
with one or more contexts. Graphics, compute, and copy commands that the profiled
application uses are associated with a context, batched in a command buffer, and pushed
into the relevant queue associated with that context.
Nsight Systems can capture the state of these queues during the trace session.
Enabling the "Collect additional range of ETW events" option will also capture extended
DxgKrnl events such as context status, allocations, sync wait, signal events, etc.
A command buffer in a WDDM queues may have one the following types:
‣ Render
‣ Deferred
‣ System
‣ MMIOFlip
‣ Wait
‣ Signal
‣ Device
‣ Software
It may also be marked as a Present buffer, indicating that the application has finished
rendering and requests to display the source surface.
www.nvidia.com
User Guide v2022.1.1 | 141
WDDM Queues
See the Microsoft documentation for the WDDM architecture and the
DXGKETW_QUEUE_PACKET_TYPE enumeration.
To retain the .etl trace files captured, so that they can be viewed in other tools (e.g.
GPUView), change the "Save ETW log files in project folder" option under "Profile
Behavior" in Nsight Systems's global Options dialog. The .etl files will appear in the
same folder as the .nsys-rep file, accessible by right-clicking the report in the Project
Explorer and choosing "Show in Folder...". Data collected from each ETW provider will
appear in its own .etl file, and an additional .etl file named "Report XX-Merged-*.etl",
containing the events from all captured sources, will be created as well.
www.nvidia.com
User Guide v2022.1.1 | 142
Chapter 9.
VULKAN API TRACE
9.1. Vulkan Overview
Vulkan is a low-overhead, cross-platform 3D graphics and compute API, targeting
a wide variety of devices from PCs to mobile phones and embedded platforms. The
Vulkan API is defined by the Khronos Group. Information about Vulkan and the
Khronos Group can be found at the Khronos Vulkan Site.
Nsight Systems can capture information about Vulkan usage by the profiled process.
This includes capturing the execution time of Vulkan API functions, corresponding GPU
workloads, debug util labels, and frame durations. Vulkan profiling is supported on
both Windows and x86 Linux operating systems.
The Command Buffer Creation row displays time periods when command buffers were
being created. This enables developers to improve their application’s multi-threaded
command buffer creation. Command buffer creation time period is measured between
the call to vkBeginCommandBuffer and the call to vkEndCommandBuffer.
The Swap chains row displays the available swap chains and the time periods where
vkQueuePresentKHR was executed on each swap chain.
www.nvidia.com
User Guide v2022.1.1 | 143
Vulkan API Trace
A Queue row is displayed for each Vulkan queue created by the profiled application.
The API sub-row displays time periods where vkQueueSubmit was called. The GPU
Workload sub-row displays time periods where workloads were executed by the GPU.
In addition, you can see Vulkan debug util labels on both the CPU and the GPU.
www.nvidia.com
User Guide v2022.1.1 | 144
Vulkan API Trace
www.nvidia.com
User Guide v2022.1.1 | 145
Chapter 10.
STUTTER ANALYSIS
10.1. FPS Overview
The Frame Duration section displays frame durations on both the CPU and the GPU.
The frame duration row displays live FPS statistics for the current timeline viewport.
Values shown are:
1. Number of CPU frames shown of the total number captured
2. Average, minimal, and maximal CPU frame time of the currently displayed time
range
3. Average FPS value for the currently displayed frames
4. The 99th percentile value of the frame lengths (such that only 1% of the frames in the
range are longer than this value).
The values will update automatically when scrolling, zooming or filtering the timeline
view.
www.nvidia.com
User Guide v2022.1.1 | 146
Stutter Analysis
The stutter row highlights frames that are significantly longer than the other frames in
their immediate vicinity.
The stutter row uses an algorithm that compares the duration of each frame to the
median duration of the surrounding 19 frames. Duration difference under 4 milliseconds
is never considered a stutter, to avoid cluttering the display with frames whose absolute
stutter is small and not noticeable to the user.
For example, if the stutter threshold is set at 20%:
1. Median duration is 10 ms. Frame with 13 ms time will not be reported (relative
difference > 20%, absolute difference < 4 ms)
2. Median duration is 60 ms. Frame with 71 ms time will not be reported (relative
difference < 20%, absolute difference > 4 ms)
3. Median duration is 60 ms. Frame with 80 ms is a stutter (relative difference > 20%,
absolute difference > 4 ms, both conditions met)
OSC detection
The "19 frame window median" algorithm by itself may not work well with some cases
of "oscillation" (consecutive fast and slow frames), resulting in some false positives. The
median duration is not meaningful in cases of oscillation and can be misleading.
To address the issue and identify if oscillating frames, the following method is applied:
1. For every frame, calculate the median duration, 1st and 3rd quartiles of 19-frames
window.
2. Calculate the delta and ratio between 1st and 3rd quartiles.
3. If the 90th percentile of 3rd – 1st quartile delta array > 4 ms AND the 90th percentile
of 3rd/1st quartile array > 1.2 (120%) then mark the results with "OSC" text.
Right-clicking the Frame Duration row caption lets you choose the target frame rate (30,
60, 90 or custom frames per second).
By clicking the Customize FPS Display option, a customization dialog pops up. In the
dialog, you can now define the frame duration threshold to customize the view of the
potentially problematic frames. In addition, you can define the threshold for the stutter
analysis frames.
www.nvidia.com
User Guide v2022.1.1 | 147
Stutter Analysis
The CPU Frame Duration row displays the CPU frame duration measured between the
ends of consecutive frame boundary calls:
‣ The OpenGL frame boundaries are eglSwapBuffers/glXSwapBuffers/
SwapBuffers calls.
‣ The D3D11 and D3D12 frame boundaries are IDXGISwapChainX::Present calls.
‣ The Vulkan frame boundaries are vkQueuePresentKHR calls.
The GPU Frame Duration row displays the time measured between
‣ The start time of the first GPU workload execution of this frame.
‣ The start time of the first GPU workload execution of the next frame.
Reflex SDK
NVIDIA Reflex SDK is a series of NVAPI calls that allow applications to integrate the
Ultra Low Latency driver feature more directly into their game to further optimize
synchronization between simulation and rendering stages and lower the latency
between user input and final image rendering. For more details about Reflex SDK, see
Reflex SDK Site.
Nsight Systems will automatically capture NVAPI functions when either Direct3D 11,
Direct3D 12, or Vulkan API trace are enabled.
The Reflex SDK row displays timeline ranges for the following types of latency markers:
‣ RenderSubmit.
‣ Simulation.
‣ Present.
‣ Driver.
‣ OS Render Queue.
www.nvidia.com
User Guide v2022.1.1 | 148
Stutter Analysis
‣ GPU Render.
10.2. Frame Health
The Frame Health row displays actions that took significantly a longer time during
the current frame, compared to the median time of the same actions executed during
the surrounding 19-frames. This is a great tool for detecting the reason for frame time
stuttering. Such actions may be: shader compilation, present, memory mapping, and
more. Nsight Systems measures the accumulated time of such actions in each frame.
For example: calculating the accumulated time of shader compilations in each frame
and comparing it to the accumulated time of shader compilations in the surrounding 19
frames.
Example of a Vulkan frame health row:
www.nvidia.com
User Guide v2022.1.1 | 149
Stutter Analysis
Note that this is not the same as the CUDA kernel memory allocation graph, see CUDA
GPU Memory Graph for that functionality.
10.4. Vertical Synchronization
The VSYNC rows display when the monitor's vertical synchronizations occur.
www.nvidia.com
User Guide v2022.1.1 | 150
Chapter 11.
OPENMP TRACE
Nsight Systems for Linux x86_64 and Power targets is capable of capturing information
about OpenMP events. This functionality is built on the OpenMP Tools Interface
(OMPT), full support is available only for runtime libraries supporting tools interface
defined in OpenMP 5.0 or greater.
As an example, LLVM OpenMP runtime library partially implements tools interface.
If you use PGI compiler <= 20.4 to build your OpenMP applications, add -mp=libomp
switch to use LLVM OpenMP runtime and enable OMPT based tracing. If you use
Clang, make sure the LLVM OpenMP runtime library you link to was compiled with
tools interface enabled.
The
raw
Note: OMPT
events
are
www.nvidia.com
User Guide v2022.1.1 | 151
OpenMP Trace
used
to
generate
ranges
indicating
the
runtime
of
OpenMP
operations
and
constructs.
Example screenshot:
www.nvidia.com
User Guide v2022.1.1 | 152
Chapter 12.
OS RUNTIME LIBRARIES TRACE
www.nvidia.com
User Guide v2022.1.1 | 153
OS Runtime Libraries Trace
You can also use Skip if shorter than. This will skip calls shorter than the given
threshold. Enabling this option will improve performances as well as reduce noise on
the timeline. We strongly encourage you to skip OS runtime libraries call shorter than 1
μs.
12.1. Locking a Resource
The functions listed below receive a special treatment. If the tool detects that the
resource is already acquired by another thread and will induce a blocking call, we
always trace it. Otherwise, it will never be traced.
pthread_mutex_lock
pthread_rwlock_rdlock
pthread_rwlock_wrlock
pthread_spin_lock
sem_wait
Note that even if a call is determined as potentially blocking, there is a chance that it
may not actually block after a few cycles have elapsed. The call will still be traced in this
scenario.
12.2. Limitations
‣ Nsight Systems only traces syscall wrappers exposed by the C runtime. It is not able
to trace syscall invoked through assembly code.
www.nvidia.com
User Guide v2022.1.1 | 154
OS Runtime Libraries Trace
‣ Additional thread states, as well as backtrace collection on long calls, are only
enabled if sampling is turned on.
‣ It is not possible to configure the depth and duration threshold when collecting
backtraces. Currently, only OS runtime libraries calls longer than 80 μs will generate
a backtrace with a maximum of 24 frames. This limitation will be removed in a
future version of the product.
‣ It is required to compile your application and libraries with the -funwind-tables
compiler flag in order for Nsight Systems to unwind the backtraces correctly.
www.nvidia.com
User Guide v2022.1.1 | 155
OS Runtime Libraries Trace
POSIX Threads
pthread_barrier_wait
pthread_cancel
pthread_cond_broadcast
pthread_cond_signal
pthread_cond_timedwait
pthread_cond_wait
pthread_create
pthread_join
pthread_kill
pthread_mutex_lock
pthread_mutex_timedlock
pthread_mutex_trylock
pthread_rwlock_rdlock
pthread_rwlock_timedrdlock
pthread_rwlock_timedwrlock
pthread_rwlock_tryrdlock
pthread_rwlock_trywrlock
pthread_rwlock_wrlock
pthread_spin_lock
pthread_spin_trylock
pthread_timedjoin_np
pthread_tryjoin_np
pthread_yield
sem_timedwait
sem_trywait
sem_wait
www.nvidia.com
User Guide v2022.1.1 | 157
OS Runtime Libraries Trace
I/O
aio_fsync
aio_fsync64
aio_suspend
aio_suspend64
fclose
fcloseall
fflush
fflush_unlocked
fgetc
fgetc_unlocked
fgets
fgets_unlocked
fgetwc
fgetwc_unlocked
fgetws
fgetws_unlocked
flockfile
fopen
fopen64
fputc
fputc_unlocked
fputs
fputs_unlocked
fputwc
fputwc_unlocked
fputws
fputws_unlocked
fread
fread_unlocked
freopen
freopen64
ftrylockfile
fwrite
fwrite_unlocked
getc
getc_unlocked
getdelim
getline
getw
getwc
getwc_unlocked
lockf
lockf64
mkfifo
mkfifoat
posix_fallocate
posix_fallocate64
putc
putc_unlocked
putwc
putwc_unlocked
Miscellaneous
forkpty
popen
posix_spawn
posix_spawnp
sigwait
sigwaitinfo
sleep
system
usleep
www.nvidia.com
User Guide v2022.1.1 | 158
Chapter 13.
NVTX TRACE
The NVIDIA Tools Extension Library (NVTX) is a powerful mechanism that allows
users to manually instrument their application. Nsight Systems can then collect the
information and present it on the timeline.
Nsight Systems supports version 3.0 of the NVTX specification.
The following features are supported:
‣ Domains
nvtxDomainCreate(), nvtxDomainDestroy()
nvtxDomainRegisterString()
‣ Push-pop ranges (nested ranges that start and end in the same thread).
nvtxRangePush(), nvtxRangePushEx()
nvtxRangePop()
nvtxDomainRangePushEx()
nvtxDomainRangePop()
‣ Start-end ranges (ranges that are global to the process and are not restricted to a
single thread)
nvtxRangeStart(), nvtxRangeStartEx()
nvtxRangeEnd()
nvtxDomainRangeStartEx()
nvtxDomainRangeEnd()
‣ Marks
nvtxMark(), nvtxMarkEx()
nvtxDomainMarkEx()
‣ Thread names
nvtxNameOsThread()
‣ Categories
nvtxNameCategory()
nvtxDomainNameCategory()
To learn more about specific features of NVTX, please refer to the NVTX header file:
nvToolsExt.h or the NVTX documentation.
www.nvidia.com
User Guide v2022.1.1 | 159
NVTX Trace
In addition, by enabling the "Insert NVTX Marker hotkey" option it is possible to add
NVTX markers to a running non-console applications by pressing the F11 key. These will
appear in the report under the NVTX Domain named "HotKey markers".
Typically calls to NVTX functions can be left in the source code even if the application is
not being built for profiling purposes, since the overhead is very low when the profiler is
not attached.
NVTX is not intended to annotate very small pieces of code that are being called very
frequently. A good rule of thumb to use: if code being annotated usually takes less than
1 microsecond to execute, adding an NVTX range around this code should be done
carefully.
Range
annotations
should
be
matched
carefully.
If
many
ranges
Note: are
opened
but
not
closed,
Nsight
Systems
has
no
meaningful
www.nvidia.com
User Guide v2022.1.1 | 160
NVTX Trace
way
to
visualize
it.
A
rule
of
thumb
is
to
not
have
more
than
a
couple
dozen
ranges
open
at
any
point
in
time.
Nsight
Systems
does
not
support
reports
with
many
unclosed
ranges.
NVTX domains enable scoping of annotations. Unless specified differently, all events
and annotations are in the default domain. Additionally, categories can be used to group
events.
Nsight Systems gives the user the ability to include or exclude NVTX events from a
particular domain. This can be especially useful if you are profiling across multiple
libraries and are only interested in nvtx events from some of them.
www.nvidia.com
User Guide v2022.1.1 | 161
NVTX Trace
This functionality is also available from the CLI. See the CLI documentation for --nvtx-
domain-include and --nvtx-domain-exclude for more details.
Categories that are set in by the user will be recognized and displayed in the GUI.
www.nvidia.com
User Guide v2022.1.1 | 162
Chapter 14.
CUDA TRACE
Near the bottom of the timeline row tree, the GPU node will appear and contain a
CUDA node. Within the CUDA node, each CUDA context used within the process will
be shown along with its corresponding CUDA streams. Steams will contain memory
operations and kernel launches on the GPU. Kernel launches are represented by blue,
while memory transfers are displayed in red.
www.nvidia.com
User Guide v2022.1.1 | 163
CUDA Trace
The easiest way to capture CUDA information is to launch the process from Nsight
Systems, and it will setup the environment for you. To do so, simply set up a normal
launch and select the Collect CUDA trace checkbox.
For Nsight Systems Workstation Edition this looks like:
www.nvidia.com
User Guide v2022.1.1 | 164
CUDA Trace
cudaDeviceReset(), and then let the application gracefully exit (as opposed to
crashing).
This option allows flushing CUDA trace data even before the device is finalized.
However, it might introduce additional overhead to a random CUDA Driver or
CUDA Runtime API call.
‣ Skip some API calls — avoids tracing insignificant CUDA Runtime
API calls (namely, cudaConfigureCall(), cudaSetupArgument(),
cudaHostGetDevicePointers()). Not tracing these functions allows Nsight
Systems to significantly reduce the profiling overhead, without losing any
interesting data. (See CUDA Trace Filters, below)
‣ Collect GPU Memory Usage - collects information used to generate a graph of
CUDA allocated memory across time. Note that this will increase overhead. See
section on CUDA GPU Memory Allocation Graph below.
‣ Collect Unified Memory CPU page faults - collects information on page faults that
occur when CPU code tries to access a memory page that resides on the device. See
section on Unified Memory CPU Page Faults in the Unified Memory Transfer
Trace documentation below.
‣ Collect Unified Memory GPU page faults - collects information on page faults that
occur when GPU code tries to access a memory page that resides on the CPU. See
section on Unified Memory GPU Page Faults in the Unified Memory Transfer
Trace documentation below.
‣ For Nsight Systems Workstation Edition, Collect cuDNN trace, Collect cuBLAS
trace, Collect OpenACC trace - selects which (if any) extra libraries that depend on
CUDA to trace.
OpenACC versions 2.0, 2.5, and 2.6 are supported when using PGI runtime version
15.7 or greater and not compiling statically. In order to differentiate constructs, a PGI
runtime of 16.1 or later is required. Note that Nsight Systems Workstation Edition
does not support the GCC implementation of OpenACC at this time.
‣ For Nsight Systems Embedded Platforms Edition if desired, the target application
can be manually set up to collect CUDA trace. To capture information about CUDA
execution, the following requirements should be satisfied:
‣ The profiled process should be started with the specified environment variable,
depending on the architecture of the process:
‣ For ARMv7 (32-bit) processes: CUDA_INJECTION32_PATH, which should
point to the injection library:
/opt/nvidia/nsight_systems/libToolsInjection32.so
‣ For ARMv8 (64-bit) processes: CUDA_INJECTION64_PATH, which should
point to the injection library:
/opt/nvidia/nsight_systems/libToolsInjection64.so
‣ If the application is started by Nsight Systems, all required environment
variables will be set automatically.
Please note that if your application crashes before all collected CUDA trace data has
been copied out, some or all data might be lost and not present in the report.
www.nvidia.com
User Guide v2022.1.1 | 165
CUDA Trace
www.nvidia.com
User Guide v2022.1.1 | 166
CUDA Trace
HtoD transfer indicates the CUDA kernel accessed managed memory that was residing
on the host, so the kernel execution paused and transferred the data to the device. Heavy
traffic here will incur performance penalties in CUDA kernels, so consider using manual
cudaMemcpy operations from pinned host memory instead.
PtoP transfer indicates the CUDA kernel accessed managed memory that was residing
on a different device, so the kernel execution paused and transferred the data to this
device. Heavy traffic here will incur performance penalties, so consider using manual
cudaMemcpyPeer operations to transfer from other devices' memory instead. The row
showing these events is for the destination device -- the source device is shown in the
tooltip for each transfer event.
DtoH transfer indicates the CPU accessed managed memory that was residing on a
CUDA device, so the CPU execution paused and transferred the data to system memory.
Heavy traffic here will incur performance penalties in CPU code, so consider using
manual cudaMemcpy operations from pinned host memory instead.
Some Unified Memory transfers are highlighted with red to indicate potential
performance issues:
www.nvidia.com
User Guide v2022.1.1 | 167
CUDA Trace
Collecting
Unified
Memory
CPU
page
faults
can
cause
overhead
of
up
Note:
to
70%
in
testing.
Please
use
this
functionality
only
when
needed.
www.nvidia.com
User Guide v2022.1.1 | 168
CUDA Trace
Collecting
Unified
Memory
GPU
page
faults
can
cause
overhead
Note: of
up
to
70%
in
testing.
Please
use
this
functionality
only
www.nvidia.com
User Guide v2022.1.1 | 169
CUDA Trace
when
needed.
www.nvidia.com
User Guide v2022.1.1 | 170
CUDA Trace
www.nvidia.com
User Guide v2022.1.1 | 174
Chapter 15.
OPENACC TRACE
Nsight Systems for Linux x86_64 and Power targets is capable of capturing information
about OpenACC execution in the profiled process.
OpenACC versions 2.0, 2.5, and 2.6 are supported when using PGI runtime version 15.7
or later. In order to differentiate constructs (see tooltip below), a PGI runtime of 16.0 or
later is required. Note that Nsight Systems does not support the GCC implementation of
OpenACC at this time.
Under the CPU rows in the timeline tree, each thread that uses OpenACC will show
OpenACC trace information. You can click on a OpenACC API call to see correlation
with the underlying CUDA API calls (highlighted in teal):
If the OpenACC API results in GPU work, that will also be highlighted:
www.nvidia.com
User Guide v2022.1.1 | 175
OpenACC Trace
Hovering over a particular OpenACC construct will bring up a tooltip with details about
that construct:
To capture OpenACC information from the Nsight Systems GUI, select the Collect
OpenACC trace checkbox under Collect CUDA trace configurations. Note that turning
on OpenACC tracing will also turn on CUDA tracing.
Please note that if your application crashes before all collected OpenACC trace data has
been copied out, some or all data might be lost and not present in the report.
www.nvidia.com
User Guide v2022.1.1 | 176
Chapter 16.
OPENGL TRACE
OpenGL and OpenGL ES APIs can be traced to assist in the analysis of CPU and GPU
interactions.
A few usage examples are:
1. Visualize how long eglSwapBuffers (or similar) is taking.
2. API trace can easily show correlations between thread state and graphics driver's
behavior, uncovering where the CPU may be waiting on the GPU.
3. Spot bubbles of opportunity on the GPU, where more GPU workload could be
created.
4. Use KHR_debug extension to trace GL events on both the CPU and GPU.
OpenGL trace feature in Nsight Systems consists of two different activities which will be
shown in the CPU rows for those threads
‣ CPU trace: interception of API calls that an application does to APIs (such as
OpenGL, OpenGL ES, EGL, GLX, WGL, etc.).
‣ GPU trace (or workload trace): trace of GPU workload (activity) triggered by use
of OpenGL or OpenGL ES. Since draw calls are executed back-to-back, the GPU
workload trace ranges include many OpenGL draw calls and operations in order to
optimize performance overhead, rather than tracing each individual operation.
To collect GPU trace, the glQueryCounter() function is used to measure how much
time batches of GPU workload take to complete.
www.nvidia.com
User Guide v2022.1.1 | 177
OpenGL Trace
Ranges defined by the KHR_debug calls are represented similarly to OpenGL API and
OpenGL GPU workload trace. GPU ranges in this case represent incremental draw cost.
They cannot fully account for GPUs that can execute multiple draw calls in parallel. In
this case, Nsight Systems will not show overlapping GPU ranges.
www.nvidia.com
User Guide v2022.1.1 | 178
OpenGL Trace
www.nvidia.com
User Guide v2022.1.1 | 180
Chapter 17.
CUSTOM ETW TRACE
Use the custom ETW trace feature to enable and collect any manifest-based ETW log.
The collected events are displayed on the timeline on dedicated rows for each event
type.
Custom ETW is available on Windows target machines.
www.nvidia.com
User Guide v2022.1.1 | 181
Custom ETW Trace
To retain the .etl trace files captured, so that they can be viewed in other tools (e.g.
GPUView), change the "Save ETW log files in project folder" option under "Profile
Behavior" in Nsight Systems's global Options dialog. The .etl files will appear in the
same folder as the .nsys-rep file, accessible by right-clicking the report in the Project
Explorer and choosing "Show in Folder...". Data collected from each ETW provider will
appear in its own .etl file, and an additional .etl file named "Report XX-Merged-*.etl",
containing the events from all captured sources, will be created as well.
www.nvidia.com
User Guide v2022.1.1 | 182
Chapter 18.
GPU METRIC SAMPLING
Overview
GPU performance metrics sampling is intended to identify performance limiters in
applications using GPU for computations and graphics. It uses periodic sampling to
gather performance metrics and detailed timing statistics associated with different GPU
hardware units taking advantage of specialized hardware to capture this data in a single
pass with minimal overhead.
Note: GPU metrics sampling will give you precise device level information, but it does
not know which process or context is involved. GPU context switch trace provides less
precise information, but will give you process and context information.
These metrics provide an overview of GPU efficiency over time within compute,
graphics, and input/output (IO) activities such as:
www.nvidia.com
User Guide v2022.1.1 | 183
GPU Metric Sampling
www.nvidia.com
User Guide v2022.1.1 | 184
GPU Metric Sampling
By default the first metric set which supports the selected GPU is used. But you can
manually select another metric set from the list. To see available metric sets use:
$ nsys profile --gpu-metrics-set=help
Possible --gpu-metrics-set values are:
[0] [tu10x] General Metrics for NVIDIA TU10x (any frequency)
[1] [tu11x] General Metrics for NVIDIA TU11x (any frequency)
[2] [ga100] General Metrics for NVIDIA GA100 (any frequency)
[3] [ga10x] General Metrics for NVIDIA GA10x (any frequency)
[4] [tu10x-gfxt] Graphics Throughput Metrics for NVIDIA TU10x (frequency
>= 10kHz)
[5] [ga10x-gfxt] Graphics Throughput Metrics for NVIDIA GA10x (frequency
>= 10kHz)
[6] [ga10x-gfxact] Graphics Async Compute Triage Metrics for NVIDIA GA10x
(frequency >= 10kHz)
By default metrics sampling frequency is set to 10 kHz. But you can manually set it
from 10 Hz to 200 kHz using
--gpu-metrics-frequency=<value>
Select the GPUs dropdown to pick which GPUs you wish to sample.
Select the Metric set: dropdown to choose which available metric set you would like to
sample.
Note that metric sets for GPUs that are not being sampled will be greyed out.
Sampling frequency
Sampling frequency can be selected from the range of 10 Hz - 200 kHz. The default value
is 10 kHz.
www.nvidia.com
User Guide v2022.1.1 | 185
GPU Metric Sampling
The maximum sampling frequency without buffer overflow events depends on GPU
(SM count), GPU load intensity, and overall system load. The bigger the chip and the
higher the load, the lower the maximum sampling frequency without buffer overflow
errors. If you need higher frequency, you can increase it until you get "Buffer overflow"
message in the Diagnostics Summary report page. If you observe buffer overflow ranges
on timeline, lower the sampling frequency.
Each metric set has a recommended sampling frequency range in its description. These
ranges are approximate. If you observe Inconsistent Data ranges on timeline, please
try closer to the recommended frequency.
Available Metrics
‣ GPC Clock Frequency - gpc__cycles_elapsed.avg.per_second
The average GPC clock frequency in hertz. In public documentation the GPC clock
may be called the "Application" clock, "Graphic" clock, "Base" clock, or "Boost" clock.
Note: The collection mechanism for GPC can result in a small fluctuation between
samples.
‣ SYS Clock Frequency - sys__cycles_elapsed.avg.per_second
The average SYS clock frequency in hertz. The GPU front end (command processor),
copy engines, and the performance monitor run at the SYS clock. On Turing and
NVIDIA GA100 GPUs the GPU metrics sampling frequency is based upon a period
of SYS clocks (not time) so samples per second will vary with SYS clock. On NVIDIA
GA10x GPUs the GPU metrics sampling rate is based upon a fixed frequency clock.
The maximum sampling rate scales linearly with the SYS clock.
‣ GR Active - gr__cycles_active.sum.pct_of_peak_sustained_elapsed
The percentage of cycles the graphics/compute engine is active. The graphics/
compute engine is active if there is any work in the graphics pipe or if the compute
pipe is processing work.
GA100 MIG - MIG is not yet supported. This counter will report the activity of the
primary GR engine.
‣ Sync Compute In Flight -
gr__dispatch_cycles_active_queue_sync.avg.pct_of_peak_sustained_elapsed
The percentage of cycles with synchronous compute in flight.
CUDA: CUDA will only report synchronous queue in the case of MPS configured
with 64 sub-context. Synchronous refers to work submitted in VEID=0.
Graphics: This will be true if any compute work submitted from the direct queue is
in flight.
‣ Async Compute in Flight -
gr__dispatch_cycles_active_queue_async.avg.pct_of_peak_sustained_elapsed
The percentage of cycles with asynchronous compute in flight.
www.nvidia.com
User Guide v2022.1.1 | 186
GPU Metric Sampling
CUDA: CUDA will only report all compute work as asynchronous. The one
exception is if MPS is configured and all 64 sub-context are in use. 1 sub-context
(VEID=0) will report as synchronous.
Graphics: This will be true if any compute work submitted from a compute queue is
in flight.
‣ Draw Started - fe__draw_count.avg.pct_of_peak_sustained_elapsed
The ratio of draw calls issued to the graphics pipe to the maximum sustained rate of
the graphics pipe.
Note:The percentage will always be very low as the front end can issue draw calls
significantly faster than the pipe can execute the draw call. The rendering of this row
will be changed to help indicate when draw calls are being issued.
‣ Dispatch Started -
gr__dispatch_count.avg.pct_of_peak_sustained_elapsed
The ratio of compute grid launches (dispatches) to the compute pipe to the
maximum sustained rate of the compute pipe.
Note: The percentage will always be very low as the front end can issue grid
launches significantly faster than the pipe can execute the draw call. The rendering
of this row will be changed to help indicate when grid launches are being issued.
‣ Vertex/Tess/Geometry Warps in Flight -
tpc__warps_active_shader_vtg_realtime.avg.pct_of_peak_sustained_elapsed
The ratio of active vertex, geometry, tessellation, and meshlet shader warps resident
on the SMs to the maximum number of warps per SM as a percentage.
‣ Pixel Warps in Flight -
tpc__warps_active_shader_ps_realtime.avg.pct_of_peak_sustained_elapsed
The ratio of active pixel/fragment shader warps resident on the SMs to the
maximum number of warps per SM as a percentage.
‣ Compute Warps in Flight -
tpc__warps_active_shader_cs_realtime.avg.pct_of_peak_sustained_elapsed
The ratio of active compute shader warps resident on the SMs to the maximum
number of warps per SM as a percentage.
‣ Active SM Unused Warp Slots -
tpc__warps_inactive_sm_active_realtime.avg.pct_of_peak_sustained_elapsed
The ratio of inactive warp slots on the SMs to the maximum number of warps per
SM as a percentage. This is an indication of how many more warps may fit on the
SMs if occupancy is not limited by a resource such as max warps of a shader type,
shared memory, registers per thread, or thread blocks per SM.
‣ Idle SM Unused Warp Slots -
tpc__warps_inactive_sm_idle_realtime.avg.pct_of_peak_sustained_elapsed
The ratio of inactive warps slots due to idle SMs to the the maximum number of
warps per SM as a percentage.
This is an indicator that the current workload on the SM is not sufficient to put work
on all SMs. This can be due to:
www.nvidia.com
User Guide v2022.1.1 | 187
GPU Metric Sampling
www.nvidia.com
User Guide v2022.1.1 | 188
GPU Metric Sampling
based upon the PCIe generation and number of lanes. This value includes protocol
overhead.
‣ PCIe Write Throughput -
pcie__write_bytes.avg.pct_of_peak_sustained_elapsed
The ratio of bytes received on the PCIe interface to the maximum number of bytes
receivable in the sample period as a percentage. The theoretical value is calculated
based upon the PCIe generation and number of lanes. This value includes protocol
overhead.
309277039|80
309301295|99
309325583|99
309349776|99
309373872|60
309397872|19
309421840|100
309446000|100
309470096|100
309494161|99
www.nvidia.com
User Guide v2022.1.1 | 189
GPU Metric Sampling
Limitations
‣ If metrics sets with NVLink are used but the links are not active, they may appear as
fully utilized.
‣ Only one tool that subscribes to these counters can be used at a time, therefore,
Nsight Systems GPU metric sampling cannot be used at the same time as the
following tools:
‣ Nsight Graphics
‣ Nsight Compute
‣ DCGM (Data Center GPU Manager)
Use the following command:
‣ dcgmi profile --pause
‣ dcgmi profile --resume
Or API:
‣ dcgmProfPause
‣ dcgmProfResume
‣ Non-NVIDIA products which use:
‣CUPTI sampling used directly in the application. CUPTI trace is okay
(although it will block Nsight Systems CUDA trace)
‣ DCGM library
‣ Nsight Systems limits the amount of memory that can be used to store GPU metrics
sampling data. Analysis with higher sampling rates or on GPUs with more SMs has
a risk of filling these buffers. This will lead to gaps with long samples on timeline.
If you select that area on the timeline you will see that the counters will pause and
remain at a steady state for a while. Future releases will reduce the frequency of this
happening and better present these periods.
www.nvidia.com
User Guide v2022.1.1 | 190
Chapter 19.
NVIDIA VIDEO CODEC SDK TRACE
Nsight Systems for x86 Linux and Windows targets can trace calls from the NV Video
Codec SDK. This software trace can be launched from the GUI or using the --trace
nvvideo from the CLI
On the timeline, calls on the CPU to the NV Encoder API and NV Decoder API will be
shown.
www.nvidia.com
User Guide v2022.1.1 | 191
NVIDIA Video Codec SDK Trace
www.nvidia.com
User Guide v2022.1.1 | 192
NVIDIA Video Codec SDK Trace
www.nvidia.com
User Guide v2022.1.1 | 193
NVIDIA Video Codec SDK Trace
www.nvidia.com
User Guide v2022.1.1 | 195
Network Communication Profiling
If you require more control over the list of traced APIs or if you are using a different
MPI implementation, see github nvtx pmpi wrappers. You can use this documentation
to generate a shared object to wrap a list of synchronous MPI APIs with NVTX using
the MPI profiling interface (PMPI). If you set your LD_PRELOAD environment variable
to the path of that object, Nsight Systems will capture and report the MPI API trace
information when NVTX tracing is enabled.
www.nvidia.com
User Guide v2022.1.1 | 196
Network Communication Profiling
www.nvidia.com
User Guide v2022.1.1 | 197
Network Communication Profiling
When only part of an MPI communication is loaded into Nsight Systems the following
information is available.
‣ Right-hand screenshot shows a reused communicator handle (last number
increased).
‣ Encoding: MPI_COMM[*team size*]*global-group-root-rank*.*group-ID*
www.nvidia.com
User Guide v2022.1.1 | 198
Network Communication Profiling
shmem_my_pe
shmem_n_pes
shmem_global_exit
shmem_pe_accessible
shmem_addr_accessible
shmem_ctx_{create,destroy,get_team}
shmem_global_exit
shmem_info_get_{version,name}
shmem_{my_pe,n_pes,pe_accessible,ptr}
shmem_query_thread
shmem_team_{create_ctx,destroy}
shmem_team_get_config
shmem_team_{my_pe,n_pes,translate_pe}
shmem_team_split_{2d,strided}
shmem_test*
www.nvidia.com
User Guide v2022.1.1 | 199
Network Communication Profiling
ucp_am_send_nb[x]
ucp_am_recv_data_nbx
ucp_am_data_release
ucp_atomic_{add{32,64},cswap{32,64},fadd{32,64},swap{32,64}}
ucp_atomic_{post,fetch_nb,op_nbx}
ucp_cleanup
ucp_config_{modify,read,release}
ucp_disconnect_nb
ucp_dt_{create_generic,destroy}
ucp_ep_{create,destroy,modify_nb,close_nbx}
ucp_ep_flush[{_nb,_nbx}]
ucp_listener_{create,destroy,query,reject}
ucp_mem_{advise,map,unmap,query}
ucp_{put,get}[_nbi]
ucp_{put,get}_nb[x]
ucp_request_{alloc,cancel,check_status,is_completed}
ucp_rkey_{buffer_release,destroy,pack,ptr}
ucp_stream_data_release
ucp_stream_recv_{data_nb,request_test}
ucp_stream_{send,recv}_nb[x]
ucp_stream_worker_poll
ucp_tag_msg_recv_nb[x]
ucp_tag_probe_nb
ucp_tag_{send,recv}_nbr
ucp_tag_{send,recv}_nb[x]
ucp_tag_recv_request_test
ucp_tag_send_sync_nb[x]
ucp_worker_{create,destroy,get_address,get_efd,arm,fence,wait,signal,wait_mem}
ucp_worker_flush[{_nb,_nbx}]
ucp_worker_set_am_{handler,recv_handler}
ucp_config_print
ucp_conn_request_query
ucp_context_{query,print_info}
ucp_get_version[_string]
ucp_ep_{close_nb,print_info,rkey_unpack}
ucp_mem_print_info
ucp_request_{test,free,release}
ucp_worker_{progress,query,release_address,print_info}
Additional API functions from other UCX layers may be added in a future version of the
product.
www.nvidia.com
User Guide v2022.1.1 | 200
Network Communication Profiling
www.nvidia.com
User Guide v2022.1.1 | 201
Network Communication Profiling
Available Metrics
‣ Bytes sent - Number of bytes sent through all NIC ports.
‣ Packets sent - Number of packets sent through all NIC ports.
‣ Bytes received - Number of bytes received by all NIC ports.
‣ Packets received - Number of packets received by all NIC ports.
‣ CNPs sent - Number of congestion notification packets sent by the NIC.
‣ CNPs received - Number of congestion notification packets received and handled
by the NIC.
Collecting NIC Metrics Using the Command Line
To collect NIC performance metric, using Nsight Systems CLI, add the --nic-metrics
command line switch:
nsys profile --nic-metrics my_app
System Requirements
NIC metrics collection is supported on:
‣ NVIDIA ConnectX 3 boards or newer
‣ Linux x86_64 machines, having minimum Linux kernel 4.12 and minimum
MLNX_OFED 4.1.
Limitations
‣ Nsight Systems 2021.5.1 only supports Infiniband metrics.
‣ Launch the following command, which will install all the required libraries in
system directories:
www.nvidia.com
User Guide v2022.1.1 | 202
Chapter 21.
READING YOUR REPORT IN GUI
21.4. Report Tab
While generating a new report or loading an existing one, a new tab will be created. The
most important parts of the report tab are:
‣ View selector — Allows switching between Analysis Summary, Timeline View,
Diagnostics Summary, and Symbol Resolution Logs views.
www.nvidia.com
User Guide v2022.1.1 | 203
Reading Your Report in GUI
21.6. Timeline View
The timeline view consists of two main controls: the timeline at the top, and a bottom
pane that contains the events view and the function table. In some cases, when sampling
of a process has not been enabled, the function table might be empty and hidden.
The bottom view selector sets the view that is displayed in the bottom pane.
21.6.1. Timeline
Timeline is a versatile control that contains a tree-like hierarchy on the left, and
corresponding charts on the right.
Contents of the hierarchy depend on the project settings used to collect the report. For
example, if a certain feature has not been enabled, corresponding rows will not be show
on the timeline.
To generate a timeline screenshot without opening the full GUI, use the command
nsys-ui.exe --screenshot filename.nsys-rep
To display trace events in the Events View right-click a timeline row and select the
“Show in Events View” command. The events of the selected row and all of its sub-rows
will be displayed in the Events View.
www.nvidia.com
User Guide v2022.1.1 | 204
Reading Your Report in GUI
If a timeline row has been selected for display in the Events View then double-clicking
a timeline item on that row will automatically scroll the content of the Events View to
make the corresponding Events View item visible and select it.
Row Height
Several of the rows in the timeline use height as a way to model the percent utilization
of resources. This gives the user insight into what is going on even when the timeline is
zoomed all the way out.
In this picture you see that for kernel occupation there is a colored bar of variable height.
Nsight Systems calculates the average occupancy for the period of time represented by
particular pixel width of screen. It then uses that average to set the top of the colored
section. So, for instance, if 25% of that timeslice the kernel is active, the bar goes 25% of
the distance to the top of the row.
In order to make the difference clear, if the percentage of the row height is non-zero, but
would be represented by less than one vertical pixel, Nsight Systems displays it as one
pixel high. The gray height represents the maximum usage in that time range.
This row height coding is used in the CPU utilization, thread and process occupancy,
kernel occupancy, and memory transfer activity rows.
21.6.2. Events View
The Events View provides a tabular display of the trace events. The view contents can be
searched and sorted.
Double-clicking an item in the Events View automatically focuses the Timeline View on
the corresponding timeline item.
API calls, GPU executions, and debug markers that occurred within the boundaries of a
debug marker are displayed nested to that debug marker. Multiple levels of nesting are
supported.
Events view recognizes these types of debug markers:
www.nvidia.com
User Guide v2022.1.1 | 205
Reading Your Report in GUI
‣ NVTX
‣ Vulkan VK_EXT_debug_marker markers, VK_EXT_debug_utils labels
‣ PIX events and markers
‣ OpenGL KHR_debug markers
You can copy and paste from the events view by highlighting rows, using Shift or Ctrl
to enable multi-select. Right clicking on the selection will give you a copy option.
www.nvidia.com
User Guide v2022.1.1 | 206
Reading Your Report in GUI
www.nvidia.com
User Guide v2022.1.1 | 207
Reading Your Report in GUI
Each of the views helps understand particular performance issues of the application
being profiled. For example:
‣ When trying to find specific bottleneck functions that can be optimized, the Bottom-
Up view should be used. Typically, the top few functions should be examined.
Expand them to understand in which contexts they are being used.
‣ To navigate the call tree of the application and while generally searching for
algorithms and parts of the code that consume unexpectedly large amount of CPU
time, the Top-Down view should be used.
‣ To quickly assess which parts of the application, or high level parts of an algorithm,
consume significant amount of CPU time, use the Flat view.
The Top-Down and Bottom-Up views have Self and Total columns, while the Flat view
has a Flat column. It is important to understand the meaning of each of the columns:
‣ Top-Down view
‣ Self column denotes the relative amount of time spent executing instructions of
this particular function.
‣ Total column shows how much time has been spent executing this function,
including all other functions called from this one. Total values of sibling rows
sum up to the Total value of the parent row, or 100% for the top-level rows.
‣ Bottom-Up view
‣ Self column for top-level rows, as in the Top-Down view, shows how much time
has been spent directly in this function. Self times of all top-level rows add up to
100%.
‣ Self column for children rows breaks down the value of the parent row based on
the various call chains leading to that function. Self times of sibling rows add up
to the value of the parent row.
‣ Flat view
‣ Flat column shows how much time this function has been anywhere on the
call stack. Values in this column do not add up or have other significant
relationships.
If
low-
impact
functions
have
been
filtered
out,
Note: values
may
not
add
up
correctly
to
100%,
or
www.nvidia.com
User Guide v2022.1.1 | 208
Reading Your Report in GUI
to
the
value
of
the
parent
row.
This
filtering
can
be
disabled.
Contents of the symbols table is tightly related to the timeline. Users can apply and
modify filters on the timeline, and they will affect which information is displayed in
the symbols table:
‣ Per-thread filtering — Each thread that has sampling information associated with it
has a checkbox next to it on the timeline. Only threads with selected checkboxes are
represented in the symbols table.
‣ Time filtering — A time filter can be setup on the timeline by pressing the left
mouse button, dragging over a region of interest on the timeline, and then choosing
Filter by selection in the dropdown menu. In this case, only sampling information
collected during the selected time range will be used to build the symbols table.
If
too
little
sampling
data
is
being
used
to
build
the
symbols
table
(for
example,
Note: when
the
sampling
rate
is
configured
to
be
low,
and
a
short
period
of
time
is
used
www.nvidia.com
User Guide v2022.1.1 | 209
Reading Your Report in GUI
for
time-
based
filtering),
the
numbers
in
the
symbols
table
might
not
be
representative
or
accurate
in
some
cases.
21.6.4. Filter Dialog
‣ Collapse unresolved lines is useful if some of the binary code does not have
symbols. In this case, subtrees that consist of only unresolved symbols get collapsed
in the Top-Down view, since they provide very little useful information.
‣ Hide functions with CPU usage below X% is useful for large applications, where
the sampling profiler hits lots of function just a few times. To filter out the "long
tail," which is typically not important for CPU performance bottleneck analysis, this
checkbox should be selected.
www.nvidia.com
User Guide v2022.1.1 | 210
Reading Your Report in GUI
‣ Warnings
‣ Errors
To draw attention to important diagnostics messages, a summary line is displayed on
the timeline view in the top right corner:
Information from this view can be selected and copied using the mouse cursor.
www.nvidia.com
User Guide v2022.1.1 | 211
Chapter 22.
ADDING REPORT TO THE TIMELINE
Starting with 2021.3, Nsight Systems can load multiple report files into a single timeline.
This is a BETA feature and will be improved in the future releases. Please let us know
about your experience on the forums or through Help > Send Feedback... in the main
menu.
To load multiple report files into a single timeline, first start by opening a report as usual
— using File > Open... from the main menu, or double clicking on a report in the Project
Explorer window. Then additional report files can be loaded into the same timeline
using one of the methods:
‣ File > Add Report (beta)... in the main menu, and select another report file that you
want to open
‣ Right click on the report in the project explorer window, and click Add Report
(beta)
22.1. Time Synchronization
When multiple reports are loaded into a single timeline, timestamps between them need
to be adjusted, such that events that happened at the same time appear to be aligned.
www.nvidia.com
User Guide v2022.1.1 | 212
Adding Report to the Timeline
Nsight Systems can automatically adjust timestamps based on UTC time recorded
around the collection start time. This method is used by default when other more
precise methods are not available. This time can be seen as UTC time at t=0 in the
Analysis Summary page of the report file. Refer to your OS documentation to learn how
to sync the software clock using the Network Time Protocol (NTP). NTP-based time
synchronization is not very precise, with the typical errors on the scale of one to tens of
milliseconds.
Reports collected on the same physical machine can use synchronization based on
Timestamp Counter (TSC) values. These are platform-specific counters, typically
accessed in user space applications using the RDTSC instruction on x86_64 architecture,
or by reading the CNTVCT register on Arm64. Their values converted to nanoseconds
can be seen as TSC value at t=0 in the Analysis Summary page of the report file.
Reports synchronized using TSC values can be aligned with nanoseconds-level
precision.
TSC-based time synchronization is activated automatically, when Nsight Systems detects
that the same TSC value corresponds to very close UTC times. UTC time difference
must be below 1 second in this case. This method is expected to only work for reports
collected on the same physical machine.
To find out which synchronization method was used, navigate to the Analysis Summary
tab of an added report and check the Report alignment source property of a target.
Note, that the first report won’t have this parameter.
When loading multiple reports into a single timeline, it is always advisable to first
check that time synchronization looks correct, by zooming into synchronization or
communication events that are expected to be aligned.
www.nvidia.com
User Guide v2022.1.1 | 213
Adding Report to the Timeline
22.2. Timeline Hierarchy
When reports are added to the same timeline Nsight Systems will automatically
line them up by timestamps as described above. If you want Nsight Systems to also
recognize matching process or hardware information, you will need to set environment
variables NSYS_SYSTEM_ID and NSYS_HW_ID as shown below at the time of report
collection (such as when using "nsys profile ..." command).
When loading a pair of given report files into the same timeline, they will be merged in
one of the following configurations:
‣ Different hardware (default) — is used when reports are coming from different
physical machines, and no hardware resources are shared in these reports. This
mode is used by default, and can be additionally signalled by specifying different
NSYS_HW_ID values.
‣ Different systems, same hardware — is used when reports are collected on different
virtual machines (VMs) or containers on the same physical machine. To activate this
mode, specify the same value of NSYS_HW_ID when collecting the reports.
‣ Same system — is used when reports are collected within the same operating system
(or container) environment. In this mode a process identifier (PID) 100 will refer to
the same process in both reports. To activate this mode, specify the same value of
NSYS_SYSTEM_ID when collecting the reports.
The following diagrams demonstrate typical cases:
www.nvidia.com
User Guide v2022.1.1 | 214
Adding Report to the Timeline
22.3. Example: MPI
A typical scenario is when a computing job is run using one of the MPI
implementations. Each instance of the app can be profiled separately, resulting in
multiple report files. For example:
When each MPI rank runs on a different node, the command above works fine, since the
default pairing mode (different hardware) will be used.
When all MPI ranks run the localhost only, use this command (value "A" was chosen
arbitrarily, it can be any non-empty string):
NSYS_SYSTEM_ID=A mpirun <MPI-options> nsys profile -o report-%p
<nsys-options> ./myApp
For convenience, the MPI rank can be encoded into the report filename. Specifics depend
on the MPI implementation. For Open MPI, use the following command to create report
files based on the global rank value:
www.nvidia.com
User Guide v2022.1.1 | 215
Adding Report to the Timeline
22.4. Limitations
‣ Only report files collected with Nsight Systems version 2021.3 and newer are fully
supported.
‣ Sequential reports collected in a single CLI profiling session cannot be loaded into a
single timeline yet.
www.nvidia.com
User Guide v2022.1.1 | 216
Chapter 23.
USING NSIGHT SYSTEMS EXPERT SYSTEM
If a .nsys-rep file is given as the input file and there is no .sqlite file with the same name
in the same directory, it will be generated.
Note: The Expert System view in the GUI will give you the equivalent command line.
www.nvidia.com
User Guide v2022.1.1 | 217
Using Nsight Systems Expert System
A context menu is available to correlate the table entry with the timeline. The options are
the same as the Events View:
‣ Highlight selected on timeline (double-click)
‣ Show current on timeline (ctrl+double-click)
The highlighting is not supported for rules that do not return an event but rather an
arbitrary time range (e.g. GPU utilization rules).
The CLI and GUI share the same rule scripts and messages. There might be some
formatting differences between the output table in GUI and CLI.
www.nvidia.com
User Guide v2022.1.1 | 218
Using Nsight Systems Expert System
Synchronous Memcpy
This rule identifies synchronous memory transfers that block the host.
Suggestion: Use cudaMemcpy*Async APIs instead.
Synchronous Memset
This rule identifies synchronous memset operations that block the host.
Suggestion: Use cudaMemset*Async APIs instead.
Synchronization APIs
This rule identifies synchronization APIs that block the host until all issued CUDA calls
are complete.
Suggestions: Avoid excessive use of synchronization. Use asynchronous CUDA event
calls, such as cudaStreamWaitEvent and cudaEventSynchronize, to prevent host
synchronization.
www.nvidia.com
User Guide v2022.1.1 | 219
Using Nsight Systems Expert System
‣ For each process, each GPU is examined, and gaps are found within the time range
that starts with the beginning of the first GPU operation on that device and ends
with the end of the last GPU operation on that device.
‣ GPU gaps that cannot be addressed by the user are excluded. This includes:
‣ Profiling overhead in the middle of a GPU gap.
‣ The initial gap in the report that is seen before the first GPU operation.
‣ The final gap that is seen after the last GPU operation.
GPU Low Utilization
This rule identifies time regions with low utilization.
Suggestions: Use CPU sampling data, OS Runtime blocked state backtraces, and/or OS
Runtime APIs related to thread synchronization to understand if a sluggish or blocked
CPU is causing the gaps. Add NVTX annotations to CPU code to understand the reason
behind the gaps.
Notes:
‣ For each process, each GPU is examined, and gaps are found within the time range
that starts with the beginning of the first GPU operation on that device and ends
with the end of the last GPU operation on that device. This time range is then
divided into equal chunks, and the GPU utilization is calculated for each chunk. The
utilization includes all GPU operations as well as profiling overheads that the user
cannot address.
‣ The utilization refers to the "time" utilization and not the "resource" utilization.
This rule attempts to find time gaps when the GPU is or isn't being used, but does
not take into account how many GPU resources are being used. Therefore, a single
running memcpy is considered the same amount of "utilization" as a huge kernel
that takes over all the cores. If multiple operations run concurrently in the same
chunk, their utilization will be added up and may exceed 100%.
‣ Chunks with an in-use percentage less than the threshold value are displayed.
If consecutive chunks have a low in-use percentage, the individual chunks are
coalesced into a single display record, keeping the weighted average of percentages.
This is why returned chunks may have different durations.
www.nvidia.com
User Guide v2022.1.1 | 220
Chapter 24.
IMPORT NVTXT
Modes description:
‣ lerp - Insert with linear interpolation
--mode lerp --ns_a arg --ns_b arg [--nvtxt_a arg --nvtxt_b arg]
‣ lin - insert with linear equation
--mode lin --ns_a arg --freq arg [--nvtxt_a arg]
Modes' parameters:
www.nvidia.com
User Guide v2022.1.1 | 221
Import NVTXT
Commands
Info
To find out report's start and end time use info command.
Usage:
ImportNvtxt --cmd info -i [--input] arg
Example:
ImportNvtxt info Report.nsys-rep
Analysis start (ns) 83501026500000
Analysis end (ns) 83506375000000
Create
You can create a report file using existing NVTXT with create command.
Usage:
ImportNvtxt --cmd create -n [--nvtxt] arg -o [--output] arg [-m [--mode]
mode_name mode_args]
with:
‣ ns_a — a nanoseconds value.
‣ ns_b — a nanoseconds value (greater than ns_a).
‣ nvtxt_a — an nvtxt file's time unit value corresponding to ns_a nanoseconds.
‣ nvtxt_b — an nvtxt file's time unit value corresponding to ns_b nanoseconds.
If nvtxt_a and nvtxt_b are not specified, they are respectively set to nvtxt file's
minimum and maximum time value.
Usage for lin mode is:
--mode lin --ns_a arg --freq arg [--nvtxt_a arg]
with:
www.nvidia.com
User Guide v2022.1.1 | 222
Import NVTXT
The output will be a new generated report file which can be opened and viewed by
Nsight Systems.
Merge
To merge NVTXT file with an existing report file use merge command.
Usage:
ImportNvtxt --cmd merge -i [--input] arg -n [--nvtxt] arg -o [--output] arg [-m
[--mode] mode_name mode_args]
with:
‣ ns_a — a nanoseconds value.
‣ ns_b — a nanoseconds value (greater than ns_a).
‣ nvtxt_a — an nvtxt file's time unit value corresponding to ns_a nanoseconds.
‣ nvtxt_b — an nvtxt file's time unit value corresponding to ns_b nanoseconds.
If nvtxt_a and nvtxt_b are not specified, they are respectively set to nvtxt file's
minimum and maximum time value.
Usage for lin mode is:
--mode lin --ns_a arg --freq arg [--nvtxt_a arg]
with:
‣ ns_a — a nanoseconds value.
‣ freq — the nvtxt file's timer frequency.
‣ nvtxt_a — an nvtxt file's time unit value corresponding to ns_a nanoseconds.
If nvtxt_a is not specified, it is set to nvtxt file's minimum time value.
Time values in <filename.nvtxt> are assumed to be nanoseconds if no mode
specified.
Example
ImportNvtxt --cmd merge -i Report.nsys-rep -n Sample.nvtxt -o NewReport.nsys-rep
www.nvidia.com
User Guide v2022.1.1 | 223
Chapter 25.
VISUAL STUDIO INTEGRATION
NVIDIA Nsight Integration is a Visual Studio extension that allows you to access the
power of Nsight Systems from within Visual Studio.
When Nsight Systems is installed along with NVIDIA Nsight Integration, Nsight
Systems activities will appear under the NVIDIA Nsight menu in the Visual Studio
menu bar. These activities launch Nsight Systems with the current project settings and
executable.
Selecting the "Trace" command will launch Nsight Systems, create a new Nsight Systems
project and apply settings from the current Visual Studio project:
‣ Target application path
‣ Command line parameters
‣ Working folder
If the "Trace" command has already been used with this Visual Studio project then
Nsight Systems will load the respective Nsight Systems project and any previously
captured trace sessions will be available for review using the Nsight Systems project
explorer tree.
www.nvidia.com
User Guide v2022.1.1 | 224
Visual Studio Integration
For more information about using Nsight Systems from within Visual Studio, please
visit
‣ NVIDIA Nsight Integration Overview
‣ NVIDIA Nsight Integration User Guide
www.nvidia.com
User Guide v2022.1.1 | 225
Chapter 26.
TROUBLESHOOTING
26.1. General Troubleshooting
Profiling
If the profiler behaves unexpectedly during the profiling session, or the profiling session
fails to start, try the following steps:
‣ Close the host application.
‣ Restart the target device.
‣ Start the host application and connect to the target device.
Nsight Systems uses a settings file (NVIDIA Nsight Systems.ini) on the host to
store information about loaded projects, report files, window layout configuration,
etc. Location of the settings file is described in the Help → About dialog. Deleting the
settings file will restore Nsight Systems to a fresh state, but all projects and reports will
disappear from the Project Explorer.
Environment Variables
By default, Nsight Systems writes temporary files to /tmp directory. If you are using
a system that does not allow writing to /tmp or where the /tmp directory has limited
storage you can use the TMPDIR environment variable to set a different location. An
example:
TMPDIR=/testdata ./bin/nsys profile -t cuda matrixMul
Environment variable control support for Windows target trace is not available, but
there is a quick workaround:
‣ Create a batch file that sets the env vars and launches your application.
‣ Set Nsight Systems to launch the batch file as its target, i.e. set the project settings
target path to the path of batch file.
‣ Start the trace. Nsight Systems will launch the batch file in a new cmd instance and
trace any child process it launches. In fact, it will trace the whole process tree whose
root is the cmd running your batch file.
www.nvidia.com
User Guide v2022.1.1 | 226
Troubleshooting
WebGL Testing
Nsight Systems cannot profile using the default Chrome launch command. To profile
WebGL please follow the following command structure:
“C:\Program Files (x86)\Google\Chrome\Application\chrome.exe”
--inprocess-gpu --no-sandbox --disable-gpu-watchdog --use-angle=gl
https://webglsamples.org/aquarium/aquarium.html
26.2. CLI Troubleshooting
If you have collected a report file using the CLI and the report will not open in the GUI,
check to see that your GUI version is the same or greater than the CLI version you used.
If it is not, download a new version of the Nsight Systems GUI and you will be able to
load and visualize your report.
This situation occurs most frequently when you update Nsight Systems using a CLI only
package, such as the package available from the NVIDIA HPC SDK.
www.nvidia.com
User Guide v2022.1.1 | 227
Troubleshooting
LD_PRELOAD
The first mechanism uses LD_PRELOAD environment variable. It only works with
dynamically linked binaries, since static binaries do not invoke the runtime linker, and
therefore are not affected by the LD_PRELOAD environment variable.
‣ For ARMv7 binaries, preload
/opt/nvidia/nsight_systems/libLauncher32.so
‣ Otherwise if running from host, preload
/opt/nvidia/nsight_systems/libLauncher64.so
‣ Otherwise if running from CLI, preload
[installation_directory]/libLauncher64.so
The most common way to do that is to specify the environment variable as part of the
process launch command, for example:
$ LD_PRELOAD=/opt/nvidia/nsight_systems/libLauncher64.so ./my-aarch64-binary --
arguments
When loaded, this library will send itself a SIGSTOP signal, which is equivalent to typing
Ctrl+Z in the terminal. The process is now a background job, and you can use standard
commands like jobs, fg and bg to control them. Use jobs -l to see the PID of the
launched process.
When attaching to a stopped process, Nsight Systems will send SIGCONT signal, which is
equivalent to using the bg command.
Launcher
The second mechanism can be used with any binary. Use
[installation_directory]/launcher to launch your application, for example:
$ /opt/nvidia/nsight_systems/launcher ./my-binary --arguments
The process will be launched, daemonized, and wait for SIGUSR1 signal. After attaching
to the process with Nsight Systems, the user needs to manually resume execution of the
process from command line:
$ pkill -USR1 launcher
Note
that
pkill
will
send
the
signal
Note: to
any
process
with
the
matching
name.
If
that
www.nvidia.com
User Guide v2022.1.1 | 228
Troubleshooting
is
not
desirable,
use
kill
to
send
it
to
a
specific
process.
The
standard
output
and
error
streams
are
redirected
to
/
tmp/
stdout_<PID>.tx
and
/
tmp/
stderr_<PID>.tx
The launcher mechanism is more complex and less automated than the LD_PRELOAD
option, but gives more control to the user.
26.4. GUI Troubleshooting
If opening the Nsight Systems Linux GUI fails with one of the following errors, you may
be missing some required libraries:
This application failed to start because it could not find or load the Qt
platform plugin "xcb" in "". Available platform plugins are: xcb. Reinstalling
the application may fix this problem.
or
error while loading shared libraries: [library_name]: cannot open shared object
file: No such file or directory
www.nvidia.com
User Guide v2022.1.1 | 229
Troubleshooting
If the workload does not run when launched via Nsight Systems or the timeline is
empty, check the stderr.log and stdout.log (click on drop-down menu showing Timeline
View and click on Files) to see the errors encountered by the app.
26.5. Symbol Resolution
If stack trace information is missing symbols and you have a symbol file, you can
manually re-resolve using the ResolveSymbols utility. This can be done by right-clicking
the report file in the Project Explorer window and selecting "Resolve Symbols...".
Alternatively, you can find the utility as a separate executable in the
[installation_path]\Host directory. This utility works with ELF format files, with
Windows PDB directories and symbol servers, or with files where each line is in the
format <start><length><name>.
www.nvidia.com
User Guide v2022.1.1 | 230
Troubleshooting
www.nvidia.com
User Guide v2022.1.1 | 231
Troubleshooting
www.nvidia.com
User Guide v2022.1.1 | 232
Troubleshooting
The following ELF sections should be considered empty if they have size of 4 bytes:
.debug_frame, .eh_frame, .ARM.exidx. In this case, these sections only contain
termination records and no useful information.
For GCC, use the following compiler invocation to see which compiler flags are enabled
in your toolchain by default (for example, to check if -funwind-tables is enabled by
default):
$ gcc -Q --help=common
For GCC and Clang, add -### to the compiler invocation command to see which
compiler flags are actually being used.
Since EHABI and DWARF information is compiled on per-unit basis (every .cpp or
.c file, as well as every static library, can be built with or without this information),
presence of the ELF sections does not guarantee that every function has necessary
unwind information.
Frame pointers are required by the Aarch64 Procedure Call Standard. Adding frame
pointers slows down execution time, but in most cases the difference is negligible.
www.nvidia.com
User Guide v2022.1.1 | 233
Troubleshooting
26.6. Logging
To enable logging on the host, refer to this config file:
host-linux-x64/nvlog.config.template
When reporting any bugs please include the build version number as described in the
Help → About dialog. If possible, attach log files and report (.nsys-rep) files, as they
already contain necessary version information.
To enable verbose logging on the target device, when launched from the host, follow
these steps:
1. Close the host application.
2. Restart the target device.
3. Place nvlog.config from host directory to the /opt/nvidia/nsight_systems
directory on target.
4. From SSH console, launch the following command:
sudo /opt/nvidia/nsight_systems/nsys --daemon --debug
5. Start the host application and connect to the target device.
Logs on the target devices are collected into this file (if enabled):
nsys.log
To enable verbose logging on the target device, when launched from the host, follow
these steps:
1. Close the host application.
2. Terminate the nsys process.
3. Place nvlog.config from host directory next to Nsight Systems Windows agent on
the target device
www.nvidia.com
User Guide v2022.1.1 | 234
Troubleshooting
www.nvidia.com
User Guide v2022.1.1 | 235
Chapter 27.
OTHER RESOURCES
Looking for information to help you use Nsight Systems the most effectively? Here are
some more resources you might want to review:
Feature Videos
Short videos, only a minute or two, to introduce new features.
‣ OpenMP Trace Feature Spotlight
‣ Command Line Sessions Video Spotlight
‣ Direct3D11 Feature Spotlight
‣ Vulkan Trace
‣ Statistics Driven Profiling
‣ Analyzing NCCL Usage with NVDIA Nsight Systems
Blog Posts
NVIDIA developer blogs, these are longer form, technical pieces written by tool and
domain experts.
‣ 2021 - Optimizing DX12 Resource Uploads to the GPU Using CPU-Visible VRAM
‣ 2019 - Migrating to NVIDIA Nsight Tools from NVVP and nvprof
‣ 2019 - Transitioning to Nsight Systems from NVIDIA Visual Profiler / nvprof
‣ 2019 - NVIDIA Nsight Systems Add Vulkan Support
‣ 2019 - TensorFlow Performance Logging Plugin nvtx-plugins-tf Goes Public
‣ 2020 - NVIDIA Nsight Systems in Containers and the Cloud
‣ 2020 - Understanding the Visualization of Overhead and Latency in Nsight Systems
‣ 2021 - Optimizing DX12 Resource Uploads to the GPU Using CPU-Visible VRAM
Training Seminars
2018 NCSA Blue Waters Webinar - Introduction to NVIDIA Nsight Systems
www.nvidia.com
User Guide v2022.1.1 | 236
Other Resources
Conference Presentations
‣ GTC 2020 - Rebalancing the Load: Profile-Guided Optimization of the NAMD
Molecular Dynamics Program for Modern GPUs using Nsight Systems
‣ GTC 2020 - Scaling the Transformer Model Implementation in PyTorch Across
Multiple Nodes
‣ GTC 2019 - Using Nsight Tools to Optimize the NAMD Molecular Dynamics
Simulation Program
‣ GTC 2019 - Optimizing Facebook AI Workloads for NVIDIA GPUs
‣ GTC 2018 - Optimizing HPC Simulation and Visualization Codes Using NVIDIA
Nsight Systems
‣ GTC 2018 - Israel - Boost DNN Training Performance using NVIDIA Tools
‣ Siggraph 2018 - Taming the Beast; Using NVIDIA Tools to Unlock Hidden GPU
Performance
www.nvidia.com
User Guide v2022.1.1 | 237
Other Resources
www.nvidia.com
User Guide v2022.1.1 | 238