0% found this document useful (0 votes)
16 views

S62256 - Demystify CUDA Debugging and Performance with Powerful Developer Tools

Uploaded by

Gia Huy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

S62256 - Demystify CUDA Debugging and Performance with Powerful Developer Tools

Uploaded by

Gia Huy
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

Demystify CUDA Debugging and Performance

with Powerful Developer Tools


Jackson Marusarz
Agenda

• High-level tools ecosystem overview

• For each tool:


• Brief description and feature overview
• New features in the latest releases and the problems the help solve

• Current and Future Areas of Focus

• Additional Resources / Q&A

https://developer.nvidia.com/tools-overview
Developer Tools Ecosystem
Debuggers: cuda-gdb, Nsight Visual Studio Edition Profilers: Nsight Systems, Nsight Compute, CUPTI, NVIDIA Tools eXtension (NVTX)
Nsight Visual Studio Code Edition

Correctness Checker:: Compute Sanitizer IDE integrations: Nsight Visual Studio Code Edition
Nsight Visual Studio Edition
Nsight Eclipse Edition
Compute Debuggers and IDEs
Compute Debuggers
Debug GPU Kernels Running on Device

• CUDA GDB
• CPU + GPU CUDA kernel debugger
• Supports stepping, breakpoints, in-line functions, variable inspection etc…
• Built on GDB and uses many of the same CLI commands
• Local/Remote connection support
• Nsight Visual Studio Edition
• IDE integration for Visual Studio
• Build and Debug CPU+GPU code from Visual Studio
• Nsight Visual Studio Code Edition
• New IDE integration for VS Code
• Build and Debug CPU+GPU code from Visual Studio Code
• Remotely target Linux targets from Windows or Linux
• Nsight Eclipse Edition
• IDE integration for Eclipse
• Build and Debug CPU+GPU code from Eclipse
Compute Sanitizer
Automatically Scan for Bugs and Memory Issues

• Compute Sanitizer checks correctness issues via


sub-tools:

• Memcheck – Memory access error and leak detection


tool.
• Racecheck – Shared memory data access hazard
detection tool.
• Initcheck – Uninitialized device global memory access
detection tool.
• Synccheck – Thread synchronization hazard detection
tool.

https://github.com/NVIDIA/compute-sanitizer-samples
Compute Sanitizer
Reading a Memcheck Example Report

Address space Type of access Access size

========= Invalid __global__ write of size 4 bytes Access location

========= Faulty thread


at 0xb0 in /home/cuda/github/compute-sanitizer-samples/Memcheck/memcheck_demo.cu:39:out_of_bounds_function()

========= Faulty address


by thread (0,0,0) and(0,0,0)
in block nearest
allocation
========= Address 0x87654320 is out of bounds

========= and Device and host backtracesbytes before the nearest allocation at 0x7f953da00000 of size 1,024 bytes
is 140,276,689,190,112

========= Device Frame:/home/cuda/github/compute-sanitizer-samples/Memcheck/memcheck_demo.cu:44:out_of_bounds_kernel() [0x30]

========= Saved host backtrace up to driver entry point at kernel launch time

========= Host Frame: [0x2774ec]

========= in /lib/x86_64-linux-gnu/libcuda.so.1

========= Host Frame:__cudart803 [0xfccb]

========= in /home/cuda/github/compute-sanitizer-samples/Memcheck/memcheck_demo

========= Host Frame:cudaLaunchKernel [0x6a578]

========= in /home/cuda/github/compute-sanitizer-samples/Memcheck/memcheck_demo
Compute Debuggers and IDEs
New Features
Debuggers/IDE

IDEs
• VS Code Autostart tasks for remote debugging.
• VS Code remote debug (QNX/L4T)
• VS Code Docker support

Compute Sanitizer
• Racecheck support for device-launched graphs
• Memcheck support for Address Translation
Service (ATS)
• Memcheck support for Heterogeneous Memory
Management (HMM)
NVTX Tools Extension API
NVIDIA Tools eXtension (NVTX)
• Decorate application source code with annotations (markers, ranges, nested ranges, …) to help visualize execution with debugging, tracing and profiling tools

• Header-only library https://github.com/NVIDIA/NVTX/tree/release-v3/c.


#include <nvtx3/nvToolsExt.h>

• Marker:
nvtxMark("This is a marker");

• Push-Pop range
nvtxRangePush("This is a push/pop range");
// Do something interesting in the range
nvtxRangePop(); // Pop must be on same thread as corresponding Push

• Start-End range
nvtxRangeHandle_t handle = nvtxRangeStart("This is a start/end range");
// Somewhere else in the code, not necessarily same thread as Start call:
nvtxRangeEnd(handle);

API references https://nvidia.github.io/NVTX/doxygen/index.html and https://nvidia.github.io/NVTX/doxygen-cpp/index.html


NVIDIA SDKs and NVTX
A Complete Ecosystem

DeepStream SDK Holoscan SDK

Accel. GStreamer
GXF GXF
Plugins

Math Libraries Comm. Libraries …


Deep Learning Libraries
cuSPARSE NVSHMEM cuDF
TensorRT

cuSOLVER NCCL cuFile


cuDNN cuDLA

cuBLAS cuML
Python and NVTX

• Annotate Python code with NVTX • pip install nvtx - https://pypi.org/project/nvtx/

• Profile and Visualize with Nsight Systems


Python and NVTX
Trace Python Functions of Interest

• No Python source changes required


• Annotations are configured in a JSON file (e.g. <target-
platform-folder>/PythonNvtx/annotations.json)
Nsight Systems
Nsight Systems
System Profiler

Key Features:
• System-wide application algorithm tuning
• Multi-process tree support
• Locate optimization opportunities
• Visualize millions of events on a very fast GUI timeline
• Identify gaps of unused CPU and GPU time
• Balance your workload across multiple CPUs and GPUs
• CPU algorithms, utilization and thread state
GPU streams, kernels, memory transfers, etc
• Command Line, Standalone, IDE Integration
• OS: Linux (x86, ARM Server, Tegra), Windows, macOS X (host)
• GPUs: Pascal+
• Docs/product: https://developer.nvidia.com/nsight-systems
Processes and
threads

Thread state

cuDNN and
cuBLAS trace

Kernel and
memory transfer
activities

Multi-GPU
Zoom/Filter to Exact Areas of Interest
Nsight Systems
New Features
Grace Host Profiling
Hardware Counters and Metrics

• CPU Core and Uncore Events


• Sampled for each CPU
• Visualize parallelism effects
• Cache hit/miss, instructions retired, etc…
• L3 Coherency Fabric
• Socket to socket traffic
• Variable sampling frequencies supported
• Timeline correlated with all other data
• GPU vs. CPU idle times and metrics
• Data movement
• Zoom and filter
Grace Host Profiling
Cache Access Pattern Example

Single threaded CPU matrix multiplication with poor memory access patterns

Improving access pattern and implementing cache blocking


JupyterLab Integration Updates

• Extension to JupyterLab
• Profile individual Jupyter cells
• Text-based results can be viewed directly in Jupyter
• Launch new remote GUI streaming container
directly in JupyterLab
• Servers without X, Windowing Manager, …
• Container with X, WM, & WebRTC server
• Dockerfile inside Nsight Systems Installer

• See it in action:
• DLIT61667: Profilers, Python, and Performance:
Nsight Tools for Optimizing Modern CUDA Workloads
Python Profiling Updates

• Python Call Stacks Samples and CUDA API Backtrace


• Identify where you are and how you got there
• Global Interpreter Lock (GIL) trace
• Common performance limiter in Python
• See annotated code ranges built into in popular frameworks and libraries
such as:
• RAPIDS, Spark, CV-CUDA, and more…
Cluster and Recipe Framework Improvements

• Nsight Systems enhanced support for Kubernetes


• Nsight Systems analysis framework:
• User programmable and predefined recipes to:
• Process and analyze complex and large reports or collection of
reports
• Understand how compute cold-spots relate to communications
• Generate multi-node heatmaps to show :
• InfiniBand congestion
• InfiniBand, Ethernet, and NVLink throughputs
• Overlapped compute and networking

• NVIDIA Switch per-port support


• Remotely stream GUI inside container
• No need to copy/export out to local PCs
Recipe Framework Example

• Multi-process workload with NCCL


• Utilization heatmaps for NCCL/Compute/All
• Visualize usage over time to identify:
• Phases and behavior patterns
• Load imbalance
• Idle GPU compute cycles
• Inefficient scheduling
• Overlapping communication and compute
• Ensure resources are used efficiently
Nsight Compute
Nsight Compute
Kernel Profiler

Key Features:
• Interactive CUDA API debugging and kernel profiling
• Built-in rules expertise
• Fully customizable data collection and display
• Command Line, Standalone, IDE Integration, Remote Targets

• OS: Linux (x86, Power, Tegra, Arm SBSA), Windows, macOS X (host only)
• GPUs: Volta+

• Docs/product: https://developer.nvidia.com/nsight-compute
Nsight Compute GUI Interface

Targeted metric sections

Customizable data
collection and
presentation

Built-in expertise for


Guided Analysis and
optimization
Visual memory analysis chart

Metrics for peak performance


ratios
Source/PTX/SASS analysis
and correlation

Metric heatmap to quickly


Source metrics per identify hotspots
instruction
Nsight Compute
New Features
Nsight Compute
Periodic Metric Sampling

• Reveals behaviors hidden by aggregates


• Inter-kernel phases
• Workload imbalance (tail effects, etc…)
• Warp Stall Reasons
Nsight Compute
Source Code Comparison

• Source Code Comparison


• Determine how modifications impact performance
• No need for multiple open reports/GUIs
• Automatic diff’ing to locate and navigate to changes
• Per-source heatmaps provide additional visual information
Nsight Compute
Workload Distribution Section and Load Imbalance Rules

• New GPU and Memory Workload Distribution section


• Helps users understand the balance of work across SMs and memory.
• New rules identify load imbalances where uneven work distribution could be impacting performance.
• Use this new section and the built-in rules to detect uneven workload distributions that may keep you from achieving
peak performance.
Coming Soon…

• Source Page Statistics including multi-select

• Python Callstacks and Syntax Highlighting

• Range Replay Kernel Timestamps


CUPTI
CUDA Profiling Tools Interface
CUPTI Updates

• New APIs for instruction level SASS metrics


• Gives CUPTI users/tool developers the ability to collect SASS metrics
through code instrumentation
• Previously only available through Nsight Compute
• Graph-level tracing for device-launched graphs
• Start and stop trace for the graph execution
• Lower overhead than per-node tracing
• Push Buffer full events
• CUDA API queue pressure can cause performance degradation
• Overhead reporting for lazy loading of CUDA modules and functions
• Performance improvements
• Tracing overhead reductions to ensure accurate performance data
Reviewing Areas of Focus
Focus Area: DevTools ♡ Python

• Python CPU Call Stacks


• Python GIL trace in Nsight Systems
• JupyterLab support
• Nsight Systems can profile individual Jupyter cells
• Text-based results can be viewed directly in Jupyter
• Timeline reports can launch the remote GUI streaming
container with a single click directly in JupyterLab
• Increasing Python collateral/samples/labs
DLIT61667: Profilers, Python, and Performance: Nsight Tools
for Optimizing Modern CUDA Workloads
Focus Area: Cloud & Cluster

• Nsight Systems enhanced support for Kubernetes


• Nsight Systems analysis framework recipes to:
• Understand how compute cold-spots relate to
communications
• Generate multi-node heatmaps to show :
• Infiniband congestion
• Infiniband, Ethernet, and NVLink throughputs
• Overlapped compute and networking
• NVIDIA Switch per-port support

• Remotely stream GUI inside container


• No need to copy/export out to local PCs
• Jupyter Lab integration including multi-node recipes
• More Details and Examples:
• S62388: Achieving Higher Performance From Your
Datacenter and Cloud Application
Additional Resources
New Developer Tools Video Series
YouTube Playlist
DEVELOPER TOOLS ACROSS GTC
Sessions
S62256: Demystify CUDA debugging and performance with powerful developer tools
S62388: Achieving Higher Performance From Your Data Center and Cloud Application
SE62128: Exploring AI-Assisted Developer Tools for Accelerated Computing
S62398: Advances in Ray Tracing Developer Tools
Labs
DLIT61667: Profilers, Python, and Performance: Nsight Tools for Optimizing Modern CUDA Workloads
Connect with the Experts
CWE61532: What's in Your CUDA Toolbox? CUDA Profiling, Optimization, and Debugging Tools
CWE61581: Using Nsight Graphics Tools to Transform Your Graphics Application to a Next-Gen Powerhouse
CWE61231: Connect With the Experts: GPU Compute Performance Analysis and Optimizations
SE63279: Ask the Experts: Connect with Jetson, Metropolis, and Isaac Platform Experts and Engineers
Live demos
Come and visit the Developer Tools pod during show floor hours!

Developer Tools are free, get started here:


https://developer.nvidia.com/tools-overview
Training and Tutorials:
https://developer.nvidia.com/tools-tutorials

Interested in working on Developer Tools? We are hiring! Scan the QR code

You might also like