Fabric Manager User Guide
Fabric Manager User Guide
Systems
Release r560
NVIDIA Corporation
1 Introduction 3
2 NVSwitch-based Systems 5
3 Terminology 7
i
6.11.2.1 Fabric Manager Operating Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.11.2.2 Fabric Manager Restart Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.11.2.3 Fabric Manager API Interface . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.11.2.4 Fabric Manager API TCP Port . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.11.2.5 Fabric Manager Domain Socket Interface . . . . . . . . . . . . . . . . . . . . . . . . 25
6.11.2.6 Fabric Manager State . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.11.3 Miscellaneous Config Items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.11.3.1 Prevent Fabric Manager from Daemonizing . . . . . . . . . . . . . . . . . . . . . . . 25
6.11.3.2 Fabric Manager Communication Socket Interface . . . . . . . . . . . . . . . . . . . 26
6.11.3.3 Fabric Manager Communication TCP Port . . . . . . . . . . . . . . . . . . . . . . . . 26
6.11.3.4 Unix Domain Socket for Fabric Manager Communication . . . . . . . . . . . . . . 26
6.11.3.5 Fabric Manager System Topology File Location . . . . . . . . . . . . . . . . . . . . . 26
6.11.4 High Availability Mode-Related Config Items . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.11.4.1 Control Fabric Manager Behavior on Initialization Failure . . . . . . . . . . . . . . . 27
6.11.4.2 GPU Access NVLink Failure Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.11.4.3 NVSwitch Trunk NVLink Failure Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
6.11.4.4 NVSwitch Failure Mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.11.4.5 CUDA Jobs Behavior When the Fabric Manager Service is Stopped or Terminated 28
8 Virtualization Models 33
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
8.2 Supported Virtualization Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
ii
10.6 Virtual Machine with One GPU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
10.7 Other Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
10.8 Hypervisor Sequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
10.9 Monitoring Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
10.10 Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
iii
12.9 Interoperability with MIG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
12.9.1 Enabling MIG before Starting the Fabric Manager Service . . . . . . . . . . . . . . . . . 68
12.9.2 Enabling MIG After Starting the Fabric Manager Service . . . . . . . . . . . . . . . . . 68
14 NVLink Topology 83
14.1 NVIDIA HGX-2 GPU Baseboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
14.2 NVIDIA HGX A100 GPU Baseboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
14.3 NVIDIA HGX H100 GPU Baseboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
15 GPU Partitions 89
15.1 DGX-2 and NVIDIA HGX-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
15.2 DGX A100 and NVIDIA HGX A100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
15.2.1 Default GPU Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
15.2.2 Supported GPU Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
15.3 DGX H100 and HGX H100 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
15.3.1 Default GPU Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
iv
15.3.2 Supported GPU Partitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
16 Resiliency 93
16.1 High-Level Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
16.2 Detailed Resiliency Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
17 Error Handling 97
17.1 FM Initialization Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
17.2 Partition Life Cycle Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
17.3 Runtime NVSwitch Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
17.4 Non-Fatal NVSwitch SXid Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
17.5 Fatal NVSwitch SXid Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
17.6 Always Fatal NVSwitch SXid Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
17.7 Other Notable NVSwitch SXid Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
17.8 High Availability Mode Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
17.9 GPU/VM/System Reset Capabilities and Limitations . . . . . . . . . . . . . . . . . . . . . . 107
18 Notices 109
18.1 Notice . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
18.2 OpenCL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
18.3 Trademarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
v
vi
Fabric Manager for NVIDIA NVSwitch Systems, Release r560
This document describes the NVIDIA® Fabric Manager for NVSwitch™ systems.
Contents 1
Fabric Manager for NVIDIA NVSwitch Systems, Release r560
2 Contents
Chapter 1. Introduction
As deep learning neural networks become more sophisticated, their size and complexity continue to
expand. The result is exponential demand in the computing capacity that is required to train these
networks during a reasonable period. To meet this challenge, applications have turned into multi-GPU
implementations.
NVIDIA NVLink®, which was introduced to connect multiple GPUs, is a direct GPU-to- GPU interconnect
that scales multi-GPU input/output (IO) in the server. To additionally scale the performance and con-
nect multiple GPUs, NVIDIA introduced NVIDIA NVSwitch, which connects multiple NVLinks to provide
all-to-all GPU communication at the total NVLink speed.
3
Fabric Manager for NVIDIA NVSwitch Systems, Release r560
4 Chapter 1. Introduction
Chapter 2. NVSwitch-based Systems
Over the years, NVIDIA introduced three generation of NVSwitches and associated DGX and NVIDIA
HGX™ server systems.
NVIDIA DGX-2™ and NVIDIA HGX-2 systems consists of two identical GPU baseboards with eight
NVIDIA V100 GPUs and six first generation NVSwitches on each baseboard. Each V100 GPU has one
NVLink connection to each NVSwitch on the same GPU baseboard. Two GPU baseboards are con-
nected to build a 16-GPU system. Between the two GPU baseboards, the only NVLink connections are
between NVSwitches, where each NVSwitch from one GPU baseboard is connected to one NVSwitch
on the second GPU baseboard for a total of eight NVLink connections.
The NVIDIA DGX™ A100 and NVIDIA HGX A100 8-GPU systems consist of a GPU baseboard, with eight
NVIDIA A100 GPUs, and six second generation NVSwitches. The GPU baseboard NVLink topology is
like the first-generation version, where each A100 GPU has two NVLink connections to each NVSwitch
on the same GPU baseboard. This generation supports connecting two GPU baseboards for a total of
sixteen NVLink connections between the baseboards.
Third-generation NVSwitches are used in DGX H100 and NVIDIA HGX H100 8-GPU server systems.
This server variant consists of one GPU baseboard with eight NVIDIA H100 GPUs and four NVSwitches.
The corresponding NVLink topology is different from the previous generation because every GPU has
four NVLinks that connect to two of the NVSwitches, and five NVLinks that connect to the remaining
two NVSwitches. This generation has depreciated the support to connect two GPU baseboard using
NVLink.
5
Fabric Manager for NVIDIA NVSwitch Systems, Release r560
Term Definition
M Fabric Manager
MMIO Memory Mapped IO
VM Virtual Machine
GPU regis- A location in the GPU MMIO space
ter
SBR Secondary Bus Reset
DCGM NVIDIA Data Center GPU manager
NVML NVIDIA Management Library
Service VM A privileged VM where NVIDIA NVSwitch software stack runs
Access NVLink between a GPU and an NVSwitch
NVLink
Trunk NVLink between two GPU baseboards
NVLink
SMBPBI NVIDIA SMBus Post-Box Interface
vGPU NVIDIA GRID Virtual GPU
MIG Multi-Instance GPU
SR-IOV Single-Root IO Virtualization
PF Physical Function
VF Virtual Function
GFID GPU Function Identification
Partition A collection of GPUs which are allowed to perform NVLink
Peer-to-Peer Communication among themselves
ALI Autonomous Link Initialization
7
Fabric Manager for NVIDIA NVSwitch Systems, Release r560
8 Chapter 3. Terminology
Chapter 4. NVSwitch Core Software
Stack
The core software stack for NVSwitch management consists of an NVSwitch kernel driver and a privi-
leged process called NVIDIA Fabric Manager (FM). The kernel driver performs low-level hardware man-
agement in response to FM requests. The software stack also provides in-band and out-of-band mon-
itoring solutions to report NVSwitch and GPU errors and status information.
9
Fabric Manager for NVIDIA NVSwitch Systems, Release r560
FM configures the NVSwitch memory fabrics to form one memory fabric among all participating GPUs
and monitors the NVLinks that support the fabric. At a high level, FM has the following responsibilities:
▶ Configures routing among NVSwitch ports.
▶ Coordinates with the GPU driver to initialize GPUs.
▶ Monitors the fabric for NVLink and NVSwitch errors.
On systems that are not capable of Autonomous Link Initialization (ALI)-based NVLink training (the first
and second generation NVSwitch-based systems), FM also has the following additional responsibilities:
▶ Coordinate with the NVSwitch driver to train NVSwitch to NVSwitch NVLink interconnects.
▶ Coordinate with the GPU driver to initialize and train NVSwitch to GPU NVLink interconnects.
This document provides an overview of various FM features and is intended for system administrators
and individual users of NVSwitch-based server systems.
11
Fabric Manager for NVIDIA NVSwitch Systems, Release r560
13
Fabric Manager for NVIDIA NVSwitch Systems, Release r560
register with the NVLink fabric. If a GPU fails to register with the fabric, it will lose its NVLink peer-to-
peer capability and be available for non-peer-to- peer use cases. The CUDA initialization process will
start after the GPUs complete their registration process with the NVLink fabric.
GPU fabric registration status is exposed through the NVML APIs, and as part of nvidia-smi -q
command. Refer the following nvidia-smi command output for more information.
▶ Here is the Fabric state output when the GPU is being registered:
nvidia-smi -q -i 0 | grep -i -A 2 Fabric
Fabric
State : In Progress
Status : N∕A
▶ Here is the Fabric state output after the GPU has been successfully registered:
nvidia-smi -q -i 0 | grep -i -A 2 Fabric
Fabric
State : Completed
Status : Success
Fabric Manager plays a critical role in the functionality of NVSwitch-based systems that are typically
initiated during a system boot or a workload activation. Restarting the service intermittently is unnec-
essary; but if such a restart is necessary because of workflow requirements, or as part of a GPU reset
operation, complete the following procedure for DGX H100 and NVIDIA HGX H100 systems to ensure
the system returns to a coherent state.
1. Stop all CUDA Applications and GPU-Related Services.
▶ Halt all running CUDA applications and services (for example, DCGM) that are actively using
GPUs.
▶ You can leave the nvidia-persistenced service running.
2. Stop Fabric Manager Service by terminating the Fabric Manager service.
3. Perform GPU Reset by issuing the nvidia-smi -r command and executing a GPU reset by.
4. Start Fabric Manager Service Againby restarting the Fabric Manager service and restoring its
functionality.
5. Resume Stopped Services by restarting any services that were halted in step 1, such as DCGM
or other GPU-related services.
6. Launch CUDA Applications.
7. After completing these steps, launch your CUDA applications as needed.
Note: System administrators can set their GPU application launcher services, such as SSHD, Docker,
and so on, to start after the FM service is started. Refer to your Linux distribution’s manual for more
information about setting up service dependencies and the service start order
Note: Unless specified, the steps for NVIDIA HGX A800 and NVIDIA HGX H800 are the same as the
steps NVIDIA HGX A100 and NVIDIA HGX H100. The only difference is that the number of GPU NVLinks
will defer depending on the actual platform.
6.3.3. OS Environment
FM is supported on the following major Linux OS distributions:
▶ RHEL/CentOS 7.x and RHEL/CentOS 8.x
▶ Ubuntu 18.04.x, and Ubuntu 20.04.x, and Ubuntu 22.04.x
Note: During initialization, the FM service checks the currently loaded kernel driver stack version for
compatibility, and if the loaded driver stack version is not compatible, aborts the process.
6.6. Installation
Note: In the following commands, <driver-branch> should be substituted with the required NVIDIA
driver branch number for qualified data center drivers (for example, 560).
Note: On NVSwitch-based NVIDIA HGX systems, before you install FM, install the compati-
ble Driver for NVIDIA Data Center GPUs. As part of the installation, the FM service unit file
(nvidia-fabricmanager.service) will be copied to systemd. However, the system administrator
must manually enable and start the FM service.
[-r | --restart]: Restart Fabric Manager after exit. Applicable to Shared NVSwitch�
,→and vGPU multitenancy modes.
Most of the FM configurable parameters and options are specified through a text config file. FM
installation will copy a default config file to a predefined location, and the file will be used by default.
To use a different config file location, specify the same using the [-c | --config] command line
argument.
Note: On Linux based installations, the default FM config file will be in the ∕usr∕share∕nvidia∕
nvswitch∕fabricmanager.cfg directory. If the default config file on the system is modified,
to manage the existing config file, subsequent FM package update will provide options such as
merge/keep/overwrite.
[Service]
User=root
PrivateTmp=false
Type=forking
ExecStart=∕usr∕bin∕nv-fabricmanager -c ∕usr∕share∕nvidia∕nvswitch∕fabricmanager.cfg
[Install]
WantedBy=multi-user.target
When FM is configured to run from a specific user/user group as specified above, the nvswitch-audit
command line utility should be started from the same user/user group account.
Note: System administrators can set up necessary udev rules to automate the process of changing
these proc entry permissions.
Note: The FM config file is read as part of FM service startup. If you changed anyconfig file options,
for the new settings to take effect, restart the FM service.
▶ Config Item
LOG_FILE_NAME=<value>
▶ Supported/Possible Values
The complete path/filename string (max length of 256) for the log.
▶ Default Value
LOG_FILE_NAME=∕var∕log∕fabricmanager.log
▶ Config Item
LOG_LEVEL=<value>
▶ Supported/Possible Values
▶ 0 - All the logging is disabled
▶ 1 - Set log level to CRITICAL and above
▶ 2 - Set log level to ERROR and above
▶ 3 - Set log level to WARNING and above
▶ 4 - Set log level to INFO and above
▶ Default Value
LOG_LEVEL=4
▶ Config Item
LOG_APPEND_TO_LOG=<value>
▶ Supported/Possible Values
▶ 0 - No, don’t append to the existing log file, instead overwrite the existing log file.
▶ 1 - Yes, append to the existing log file every time Fabric Manager service is started.
▶ Default Value
LOG_APPEND_TO_LOG=1
▶ Config Item
LOG_FILE_MAX_SIZE=<value>
▶ Supported/Possible Values
▶ The desired max log file size in MBs.
After the specified size is reached, FM will skip additional logging to the specified log file.
▶ Default Value
LOG_FILE_MAX_SIZE=1024
▶ Config Item
LOG_USE_SYSLOG=<value>
▶ Supported/Possible Values
▶ 0 - Use the specified log file for storing all the Fabric Manager logs
▶ 1 - Redirect all the Fabric Manager logs to syslog instead file-based logging.
▶ Default Value
LOG_USE_SYSLOG=0
▶ Config Item
LOG_MAX_ROTATE_COUNT=<value>
▶ Supported/Possible Values
▶ 0 - The log is not rotated.
Logging is stopped once the log file reaches the size specified in above
LOG_FILE_MAX_SIZE option.
▶ Non-zero: Rotate the current log file once it reaches the individual log file size.
The combined Fabric Manager log size is LOG_FILE_MAX_SIZE multiplied by
LOG_MAX_ROTATE_COUNT + 1. After this threshold is reached, the oldest log file will
be purged.
▶ Default Value
LOG_MAX_ROTATE_COUNT=3
Note: The FM log is in a clear-text format, and NVIDIA recommends that you run the FM service with
logging enabled at the INFO level for troubleshooting field issues.
Note: This section of config items is applicable only to Shared NVSwitch and vGPU Multitenancy
deployment.
▶ Config Item
FABRIC_MODE=<value>
▶ Supported/Possible Values
▶ 0 - Start FM in bare metal or full passthrough virtualization mode.
▶ 1 - Start FM in Shared NVSwitch multitenancy mode.
For more information, refer to Shared NVSwitch Virtualization Configurations.
▶ 2 - Start FM in vGPU multitenancy mode.
For more information, refer to vGPU Virtualization Model.
▶ Default Value
FABRIC_MODE=0
Note: The older SHARED_FABRIC_MODE configuration item is still supported, but we recommend
that you use the FABRIC_MODE configuration item.
▶ Config Item
FABRIC_MODE_RESTART=<value>
▶ Supported/Possible Values
▶ 0 - Start FM and complete the initialization sequence.
▶ 1 - Start FM and follow the Shared NVSwitch or vGPU multitenancy mode resiliency/restart
sequence.
This option is equal to the -restart command line argument to the FM process and is pro-
vided to enable the Shared NVSwitch or vGPU multitenancy mode resiliency without modi-
fying command-line arguments to the FM process. Refer to Resiliency for more information
on the FM resiliency flow.
▶ Default Value
FABRIC_MODE_RESTART=0
Note: The older SHARED_FABRIC_MODE_RESTART configuration item is still supported but we recom-
mend that you use FABRIC_MODE_RESTART configuration item.
▶ Config Item
FM_CMD_BIND_INTERFACE =<value>
▶ Supported/Possible Values
The network interface for the FM SDK/API to listen and for the Hypervisor to communicate with
the running FM instance for the shared NVSwitch and vGPU multitenancy operations.
▶ Default Value
FM_CMD_BIND_INTERFACE=127.0.0.1
▶ Config Item
FM_CMD_PORT_NUMBER=<value>
▶ Supported/Possible Values
The TCP port number for the FM SDK/API for Hypervisor to communicate with the running FM
instance for Shared NVSwitch and vGPU multitenancy operations.
▶ Default Value
FM_CMD_PORT_NUMBER=6666
▶ Config Item
FM_CMD_UNIX_SOCKET_PATH=<value>
▶ Supported/Possible Values
The Unix domain socket path instead of the TCP/IP socket for the FM SDK/API to listen and to
communicate with the running FM instance for the shared NVSwitch and vGPU multitenancy
operations.
▶ Default Value
FM_CMD_UNIX_SOCKET_PATH=<empty value>
▶ Config Item
STATE_FILE_NAME=<value>
▶ Supported/Possible Values
Specify the filename to be used to save the FM states to restart FM after a crash or a successful
exit. This is only valid when the Shared NVSwitch or vGPU multitenancy mode is enabled.
▶ Default Value
STATE_FILE_NAME =∕tmp∕fabricmanager.state
▶ Config Item
DAEMONIZE=<value>
▶ Supported/Possible Values
▶ 0 - Do not daemonize and run FM as a normal process.
▶ 1 - Run the FM process as a Unix daemon.
▶ Default Value
DAEMONIZE=1
▶ Config Item
BIND_INTERFACE_IP=<value>
▶ Supported/Possible Values
Network interface to listen for the FM internal communication/IPC, and this value should be a
valid IPv4 address.
▶ Default Value
BIND_INTERFACE_IP=127.0.0.1
▶ Config Item
STARTING_TCP_PORT=<value>
▶ Supported/Possible Values
Starting TCP port number for the FM internal communication/IPC, and this value should be be-
tween 0 and 65535.
▶ Default Value
STARTING_TCP_PORT=16000
▶ Config Item
UNIX_SOCKET_PATH=<value>
▶ Supported/Possible Values
Use Unix Domain socket instead of TCP/IP socket for FM internal communication/IPC. An empty
value means that the Unix domain socket is not used.
▶ Default Value
UNIX_SOCKET_PATH=<empty value>
▶ Config Item
TOPOLOGY_FILE_PATH =<value>
▶ Supported/Possible Values
Configuration option to specify the FM topology files directory path information.
▶ Default Value
TOPOLOGY_FILE_PATH=∕usr∕share∕nvidia∕nvswitch
▶ Config Item
FM_STAY_RESIDENT_ON_FAILURES=<value>
▶ Supported/Possible Values
▶ 0 - The FM service will terminate on errors such as, NVSwitch and GPU config failure, typical
software errors, and so on.
▶ 1 - The FM service will stay running on errors such as, NVSwitch and GPU config failure,
typical software errors, and so on.
However, the system will be uninitialized, and the CUDA application launch will fail.
▶ Default Value
FM_STAY_RESIDENT_ON_FAILURES=0
▶ Config Item
ACCESS_LINK_FAILURE_MODE=<value>
▶ Supported/Possible Values
The available high-availability options when there is an Access NVLink Failure (GPU to NVSwitch
NVLink). Refer to Supported High Availability Modes for more information about supported values
and behavior.
▶ Default Value
ACCESS_LINK_FAILURE_MODE=0
▶ Config Item
TRUNK_LINK_FAILURE_MODE=<value>
▶ Supported/Possible Values
The available high-availability options when there is a Trunk Link failure (NVSwitch to NVSwitch
connection between GPU baseboards). Refer to Supported High Availability Modes for more in-
formation about supported values and behavior.
▶ Default Value
TRUNK_LINK_FAILURE_MODE=0
▶ Config Item
NVSWITCH_FAILURE_MODE=<value>
▶ Supported/Possible Values
The available high-availability options when there is an NVSwitch failure. Refer to Refer to Sup-
ported High Availability Modes for more information about supported values and behavior.
▶ Default Value
NVSWITCH_FAILURE_MODE=0
6.11.4.5 CUDA Jobs Behavior When the Fabric Manager Service is Stopped or Terminated
▶ Config Item
ABORT_CUDA_JOBS_ON_FM_EXIT=<value>
▶ Supported/Possible Values
▶ 0 - Do not abort running CUDA jobs when the FM service is stopped or exits.
However, a new CUDA job launch will fail with cudaErrorSystemNotReady error.
▶ 1 - Abort all running CUDA jobs when the FM service is stopped or exits.
Also, a new CUDA job launch will fail with cudaErrorSystemNotReady error.
Note: This is not effective on DGX H100 and NVIDIA HGX H100 NVSwitch based systems. Also, This
config option is applicable to only bare metal and full passthrough virtualization models.
▶ Default Value
ABORT_CUDA_JOBS_ON_FM_EXIT=1
7.1. Introduction
The NVSwitch-based DGX and NVIDIA HGX server systems’ default software configuration is to run the
systems as bare-metal machines for workloads such as AI, machine learning, and so on. This chapter
provides information about the FM installation requirements to support a bare metal configuration.
29
Fabric Manager for NVIDIA NVSwitch Systems, Release r560
Depending on the severity (fatal vs fon-fatal) and the impacted port, the SXid and Xid errors can abort
existing CUDA jobs and prevent new CUDA job launches. The next section provides information about
the potential impact of SXid and Xid errors and the corresponding recovery procedure.
NVSwitch non-fatal SXids are for informational purposes only, and FM will not terminate CUDA jobs
that are running or prevent new CUDA job launches. The existing CUDA jobs should resume; but de-
pending on the exact error, CUDA jobs might experience issues such as a performance drop, no forward
progress for brief time, and so on.
When a fatal SXid error is reported on a NVSwitch port, which is connected between a GPU and a
NVSwitch, the corresponding error will be propagated to the GPU. The CUDA jobs that are running
on that GPU will be aborted and the GPU might report Xid 74 and Xid 45 errors. The FM service
will log the corresponding GPU index and PCI bus information in its log file and syslog. The system
administrator must use the following recovery procedure to clear the error state before using the GPU
for an additional CUDA workload.
1. Reset the specified GPU (and all the participating GPUs in the affected workload) via the NVIDIA
System Management Interface (nvidia-smi) command line utility.
Refer to the -r or the --gpu-reset options in nvidia-smi for more information and the indi-
vidual GPU reset operation. If the problem persists, reboot or power cycle the system.
When a fatal SXid error is reported on a NVSwitch port, which connects two GPU baseboards, FM
will abort all the running CUDA jobs and prevent any new CUDA job launches. The GPU will also
report an Xid 45 error as part of aborting CUDA jobs. The FM service will log the corresponding
error information in its log file and syslog.
2. The system administrator must use the following recovery procedure to clear the error state and
subsequent successful CUDA job launch:
1. Reset all the GPUs and NVSwitches.
2. Stop the FM service.
3. Stop all the applications that are using the GPU.
4. Reset all the GPU and NVSwitches using the nvidia-smi command line utility with the -r or
the --gpu-reset option.
5. Do not use the -i or the -id options.
6. After the reset operation is complete, start the FM service again.
If the problem persists, reboot or power cycle the system.
8.1. Introduction
NVSwitch-based systems support multiple models to isolate NVLink interconnects in a multi-tenant
environment. In virtualized environments, VM workloads often cannot be trusted and must be isolated
from each other and from the host or hypervisor. The switches used to maintain this isolation cannot
be directly controlled by untrusted VMs and must instead be controlled by the trusted software.
This chapter provides a high-level overview of supported virtualization models.
33
Fabric Manager for NVIDIA NVSwitch Systems, Release r560
FM provides a shared library, a set of C/C++ APIs (SDK), and the corresponding header files. The library
and APIs are used to interface with FM when runs in the shared NVSwitch and vGPU multi-tenant
modes to query/activate/deactivate GPU partitions. All FM interface API definitions, libraries, sam-
ple code, and associated data structure definitions are delivered as a separate development package
(RPM/Debian). To compile the sample code provided in this user guide, this package must be installed.
35
Fabric Manager for NVIDIA NVSwitch Systems, Release r560
typedef struct
{
char uuid[FM_UUID_BUFFER_SIZE];
char pciBusId[FM_DEVICE_PCI_BUS_ID_BUFFER_SIZE];
unsigned int numPorts;
unsigned int portNum[FM_MAX_NUM_NVLINK_PORTS];
} fmNvlinkFailedDeviceInfo_t;
∕**
* Structure to store information about a unsupported fabric partition
*∕
typedef struct
{
fmFabricPartitionId_t partitionId; ∕∕!< a unique id assigned to
reference this partition
unsigned int numGpus; ∕∕!< number of GPUs in this partition
unsigned int gpuPhysicalIds[FM_MAX_NUM_GPUS]; ∕∕!< physicalId of
each GPU assigned to this partition.
} fmUnsupportedFabricPartitionInfo_t;
∕**
* Structure to store information about all the unsupported fabric partitions
*∕
typedef struct
{
unsigned int version; ∕∕!< version number. Use fmFabricPartitionList_version
unsigned int numPartitions; ∕∕!< total number of unsupported partitions
fmUnsupportedFabricPartitionInfo_t
partitionInfo[FM_MAX_FABRIC_PARTITIONS]; ∕*!< detailed information of
each unsupported partition*∕
} fmUnsupportedFabricPartitionList_v1;
Note: On DGX H100 and NVIDIA HGX H100 systems, the GPU physical ID information has the same
value as GPU Module ID information that is returned by the nvidia-smi-q output. On these sys-
tems, when reporting partition information, GPU information such as UUID, PCI Device (BDF) will be
empty. The hypervisor stack should use GPU Physical ID information to correlate between GPUs in the
partition, and the actual GPUs needs to be assigned to corresponding partition’s Guest VM.
Parameters
None
Return Values
(continues on next page)
Parameters
None
Return Values
FM_ST_SUCCESS - if FM API interface library has been properly shut down
FM_ST_UNINITIALIZED - interface library was not in initialized state.
Parameters
connectParams
Valid IP address for the remote host engine to connect to. If ipAddress
is specified as x.x.x.x it will attempt to connect to the default port
specified by FM_CMD_PORT_NUMBER.If ipAddress is specified as x.x.x.x:yyyy
it will attempt to connect to the port specified by yyyy. To connect to
an FM instance that was started with unix domain socket fill the socket
path in addressInfo member and set addressIsUnixSocket flag.
pfmHandle
Fabric Manager API interface abstracted handle for subsequent API calls
Return Values
FM_ST_SUCCESS - successfully connected to the FM instance
FM_ST_CONNECTION_NOT_VALID - if the FM instance could not be reached
FM_ST_UNINITIALIZED - FM interface library has not been initialized
FM_ST_BADPARAM - pFmHandle is NULL or IP Address∕format is invalid
FM_ST_VERSION_MISMATCH - provided versions of params do not match
Parameters
pfmHandle
Handle that came from fmConnect
Return Values
FM_ST_SUCCESS - successfully disconnected from the FM instance
FM_ST_UNINITIALIZED - FM interface library has not been initialized
FM_ST_BADPARAM - if pFmHandle is not a valid handle
FM_ST_GENERIC_ERROR - an unspecified internal error occurred
Parameters
pFmHandle
Handle returned by fmConnect()
pFmFabricPartition
Pointer to fmFabricPartitionList_t structure. On success, the list of
supported (static) partition information will be populated in this structure.
Return Values
M_ST_SUCCESS – successfully queried the list of supported partitions
FM_ST_UNINITIALIZED - FM interface library has not been initialized.
FM_ST_BADPARAM – Invalid input parameters
FM_ST_GENERIC_ERROR – an unspecified internal error occurred
FM_ST_NOT_SUPPORTED - requested feature is not supported or enabled
FM_ST_NOT_CONFIGURED - Fabric Manager is initializing and no data
FM_ST_VERSION_MISMATCH - provided versions of params do not match
Parameters
pFmHandle
Handle returned by fmConnect()
partitionId
The partition id to be activated.
Return Values
FM_ST_SUCCESS – successfully queried the list of supported partitions
FM_ST_UNINITIALIZED - FM interface library has not been initialized.
FM_ST_BADPARAM – Invalid input parameters or unsupported partition id
FM_ST_GENERIC_ERROR – an unspecified internal error occurred
FM_ST_NOT_SUPPORTED - requested feature is not supported or enabled
FM_ST_NOT_CONFIGURED - Fabric Manager is initializing and no data
FM_ST_IN_USE - specified partition is already active or the GPUs are in
use by other partitions.
Parameters:
pFmHandle
Handle returned by fmConnect()
partitionId
The partition id to be activated.
*vfList
List of VFs associated with physical GPUs in the partition. The
ordering of VFs passed to this call is significant, especially for
migration∕suspend∕resume compatibility, the same ordering should be used each
time the partition is activated.
numVfs
Number of VFs
Return Values:
FM_ST_SUCCESS – successfully queried the list of supported partitions
FM_ST_UNINITIALIZED - FM interface library has not been initialized.
FM_ST_BADPARAM – Invalid input parameters or unsupported partition id
FM_ST_GENERIC_ERROR – an unspecified internal error occurred
FM_ST_NOT_SUPPORTED - requested feature is not supported or enabled
FM_ST_NOT_CONFIGURED - Fabric Manager is initializing and no data
FM_ST_IN_USE - specified partition is already active or the GPUs are in
use by other partitions.
Note: Before you start a vGPU VM, this API must be called, even if there is only one vGPU partition.
A multi-vGPU partition activation will fail if MIG mode is enabled on the corresponding GPUs.
Parameters
pFmHandle
Handle returned by fmConnect()
partitionId
The partition id to be deactivated.
Return Values
FM_ST_SUCCESS – successfully queried the list of supported partitions
FM_ST_UNINITIALIZED - FM interface library has not been initialized.
FM_ST_BADPARAM – Invalid input parameters or unsupported partition id
FM_ST_GENERIC_ERROR – an unspecified internal error occurred
FM_ST_NOT_SUPPORTED - requested feature is not supported or enabled
FM_ST_NOT_CONFIGURED - Fabric Manager is initializing and no data
FM_ST_UNINITIALIZED - specified partition is not activated
Note: If there are no active partitions when FM is restarted, this call must be made with the number
of partitions as zero.
Parameters
pFmHandle
Handle returned by fmConnect()
pFmActivatedPartitionList
List of currently activated fabric partitions.
Return Values
FM_ST_SUCCESS – FM state is updated with active partition information
FM_ST_UNINITIALIZED - FM interface library has not been initialized.
FM_ST_BADPARAM – Invalid input parameters
FM_ST_GENERIC_ERROR – an unspecified internal error occurred
FM_ST_NOT_SUPPORTED - requested feature is not supported or enabled
Note: This API is not supported when FM is running in Shared NVSwitch or vGPU multi-tenancy
resiliency restart (--restart) modes.
Parameters
pFmHandle
Handle returned by fmConnect()
pFmNvlinkFailedDevices
List of GPU or NVSwitch devices that have failed NVLinks.
Return Values
FM_ST_SUCCESS – successfully queried list of devices with failed NVLinks
FM_ST_UNINITIALIZED - FM interface library has not been initialized.
FM_ST_BADPARAM – Invalid input parameters
FM_ST_GENERIC_ERROR – an unspecified internal error occurred
FM_ST_NOT_SUPPORTED - requested feature is not supported or enabled
FM_ST_NOT_CONFIGURED - Fabric Manager is initializing and no data
FM_ST_VERSION_MISMATCH - provided versions of params do not match
Note: On DGX H100 and NVIDIA HGX H100 systems, NVLinks are trained at GPU and NVSwitch
hardware level using ALI feature and without FM coordination. On these systems, FM will always return
FM_ST_SUCCESS with an empty list for this API.
Parameters
pFmHandle
Handle returned by fmConnect()
pFmUnupportedFabricPartition
List of unsupported fabric partitions on the system.
Return Values
FM_ST_SUCCESS – successfully queried list of devices with failed NVLinks
FM_ST_UNINITIALIZED - FM interface library has not been initialized.
FM_ST_BADPARAM – Invalid input parameters
FM_ST_GENERIC_ERROR – an unspecified internal error occurred
FM_ST_NOT_SUPPORTED - requested feature is not supported or enabled
(continues on next page)
Note: On DGX H100 and NVIDIA HGX H100 systems, this API will always return FM_ST_SUCCESS with
an empty list of unsupported partition.
The first supported virtualization model for NVSwitch-based systems is passthrough device assign-
ment for GPUs and NVSwitch memory fabrics (switches). VMs with 16, eight, four, two, and one GPUs
are supported with predefined subsets of GPUs and NVSwitches used for each VM size.
A subset of GPUs and NVSwitches is referred to as a system partition. Non-overlapping partitions
can be mixed and matched, which allows you to simultaneously support, for example, an 8-GPU VM, a
4-GPU VM, and two 2-GPU VMs on an NVSwitch-based system with two GPU baseboards. VMs with
16 and eight GPUs have no loss in bandwidth while in smaller VMs, there is some bandwidth tradeoff
for isolation by using dedicated switches.
45
Fabric Manager for NVIDIA NVSwitch Systems, Release r560
Table 2: Two DGX A100 and NVIDIA HGX A100 Systems Device
Assignment
Number of Number of Enabled NVLink Enabled NVLink Constraints
GPUs NVSwitches Interconnects Interconnects
assigned to assigned to Per GPU Per NVSwitch
a VM a VM
16 12 12 out of 12 32 out of 36 None
8 6 12 out of 12 16 out of 36 One set of eight GPUs from
each GPU Baseboard
4 3 6 out of 12 6 out of 36 Two sets of four GPUs from
each GPU Baseboard
2 1 2 out of 12 4 out of 36 Four sets of two GPUs from
each GPU Baseboard
1 0 0 out of 12 0 out of 36 None
Table 3: DGX H100 and NVIDIA HGX H100 Systems Device As-
signment
Number of Number of Enabled NVLink Enabled NVLink Constraints
GPUs NVSwitches Interconnects Interconnects
assigned to assigned to Per GPU Per NVSwitch
a VM a VM
8 4 18 out of 18 32 out of 64 for None
two NVSwitches.
40 out of 64
for other two
NVSwitches.
1 0 0 out of 18 0 out of 64 Need to disable GPU
NVLinks.
10.10. Limitations
▶ NVSwitch errors are visible to the guest VMs.
▶ Windows is only supported for single GPU VMs.
The shared NVSwitch virtualization model additionally extends the GPU Passthrough model by man-
aging the switches from one Service VM that runs permanently. The GPUs are made accessible to the
Service VM for link training and reassigned to the guest VMs. Sharing switches among the guest VMs
allows FM to enable more NVLink connections for 2 and 4 GPU VMs that observe reduced bandwidth
in GPU Passthrough model.
51
Fabric Manager for NVIDIA NVSwitch Systems, Release r560
NVSwitch units are always assigned as a PCIe passthrough device to the Service VM. GPUs are hot-
plugged and hot-unplugged on-demand (as PCI passthrough) to the Service VM.
At a high level, the Service VM has the following features:
▶ Provides an interface to query the available GPU VM partitions (groupings) and corresponding
GPU information.
▶ Provides an interface to activate GPU VM partitions, which involves the following:
▶ Training NVSwitch to NVSwitch NVLink interconnects (if required).
▶ Training the corresponding GPU NVLink interfaces (if applicable).
▶ Programming NVSwitch to deny access to GPUs not assigned to the partition.
▶ Provides an interface to deactivate GPU VM partitions, which involves the following:
▶ Untrain (power down) the NVSwitch to NVSwitch NVLink interconnects.
▶ Untrain (power down) the corresponding GPU NVLink interfaces.
▶ Disable the corresponding NVSwitch routing and GPU access.
▶ Report NVSwitch errors through in-band and out-of-band mechanisms.
Note: The resource requirements for the Service VM might vary if it is used for additional func-
tionalities, such as conducting a GPU health check. The specific memory and vCPU demands might
also fluctuate depending on the Linux distribution you selected and the OS features you enabled. We
recommend that you make necessary adjustments to the allocated memory and vCPU resources ac-
cordingly.
Note: NVSwitches and GPUs on NVIDIA HGX-2 and NVIDIA HGX A100 systems must bind to nvidia.ko
before FM service starts. If the GPUs and NVSwitches are not plugged into the Service VM as part of
OS boot, start the FM service manually or the process directly by running the appropriate command
line options after the NVSwitches and GPUs are bound to nvidia.ko.
In Shared NVSwitch mode, FM supports a resiliency feature, which allows the non-stop forwarding
of NVLink traffic between GPUs on active guest VMs after FM gracefully or non-gracefully exits
in the Service VM. To support this feature, FM uses ∕tmp∕fabricmanager.state to save certain
metadata information. To use a different location/file to store this metadata information, modify
the STATE_FILE_NAME FM config file item with the path and file name. FM uses TCP I/P loopback
(127.0.0.1)-based socket interface for communication. To use Unix domain sockets instead, modify
the FM FM_CMD_UNIX_SOCKET_PATH and UNIX_SOCKET_PATH config file options with the Unix do-
main socket names.
if ( operation > 0 ) {
std::cout << "Input Shared Fabric Partition ID: \n";
std::cin >> partitionId;
std::string buffer;
std::cin >> buffer;
if (buffer.length() > sizeof(hostIpAddress) - 1){
std::cout << "Invalid IP address.\n" << std::endl;
return FM_ST_BADPARAM;
} else {
buffer += '\0';
strncpy(hostIpAddress, buffer.c_str(), 15);
}
if ( operation == 0 ) {
∕* List supported partitions *∕
partitionList.version = fmFabricPartitionList_version;
fmReturn = fmGetSupportedFabricPartitions(fmHandle, &partitionList);
if (fmReturn != FM_ST_SUCCESS) {
std::cout << "Failed to get partition list. fmReturn: " << fmReturn
<< std::endl;
} else {
∕* Only printing number of partitions for brevity *∕
std::cout << "Total number of partitions supported: " <<
partitionList.numPartitions << std::endl;
}
} else if ( operation == 1 ) {
∕* Activate a partition *∕
fmReturn = fmActivateFabricPartition(fmHandle, partitionId);
if (fmReturn != FM_ST_SUCCESS) {
std::cout << "Failed to activate partition. fmReturn: " << fmReturn
<< std::endl;
}
} else if ( operation == 2 ) {
∕* Deactivate a partition *∕
} else {
std::cout << "Unknown operation." << std::endl;
}
∕* Clean up *∕
fmDisconnect(fmHandle);
fmLibShutdown();
return fmReturn;
}
# Make file for the above sample assuming the source is saved into sampleCode.cpp
# Note: Change the default include paths (∕usr∕include & ∕usr∕lib) based on FM
# API header files location.
IDIR := ∕usr∕include
CXXFLAGS = -I $(IDIR)
LDIR := ∕usr∕lib
LDFLAGS= -L$(LDIR) -lnvfm
sampleCode: sampleCode.o
$(CXX) -o $@ $< $(CXXFLAGS) $(LDFLAGS)
clean:
-@rm -f sameplCode.o
-@rm -f sampleCode
Note: Do not leave the guest VMs in this state for a longer period.
Note: The sequences will be different depending on the NVSwitch generation used in the system.
The key difference is whether the GPU needs to be attached to Service VM and bound to nvidia.ko.
Note: If the GPUs get PCIe reset as part of guest VM launch, the GPU NVLinks will be in an InActive
state on the guest VM. Also, starting the guest VM without a GPU reset might require a modification
in your hypervisor VM launch sequence path.
Note: The sequences will be different depending on the NVSwitch generation used in the system.
Options include:
[-h | --help]: Displays help information
[-v | --verbose]: Verbose output including all Request and Response table entries
[-f | --full-matrix]: Display All possible GPUs including those with no connecting�
,→paths
The following example output shows the maximum GPU NVLink connectivity when an 8-GPU VM par-
tition on an NVIDIA HGX A100 is activated.
root@host1-servicevm:~# .∕nvswitch-audit
GPU Reachability Matrix
GPU 1 2 3 4 5 6 7 8
1 X 12 12 12 12 12 12 12
2 12 X 12 12 12 12 12 12
3 12 12 X 12 12 12 12 12
4 12 12 12 X 12 12 12 12
5 12 12 12 12 X 12 12 12
6 12 12 12 12 12 X 12 12
7 12 12 12 12 12 12 X 12
8 12 12 12 12 12 12 X 12
The vGPU virtualization model supports VF passthrough by enabling SR-IOV functionality in all the
supported GPUs and assigning a specific VF, or set of VFs, to the VM.
▶ GPU NVLinks are assigned to only one VF at a time.
▶ NVLink P2P between GPUs that belong to different VMs or partitions is not supported.
Refer to the vGPU Software User Guide for more information about the supported vGPU functionality,
features, and configurations.
Note: The vGPU-based deployment model is not supported on first generation-based NVSwitch sys-
tems such as DGX-2 and NVIDIA HGX-2.
Note: The vGPU-based deployment model is not supported on the current release of DGX H100 and
NVIDIA HGX H100 systems. NVIDIA plans to add this support in a future software release.
12.2.1. OS Image
Refer to the vGPU Software User Guide for the list of supported OSs, hypervisors,and for information
about installing and configuring the vGPU host driver software.
63
Fabric Manager for NVIDIA NVSwitch Systems, Release r560
Note: Both packages must be the same version as the Driver package.
Note: NVSwitches must bind to nvidia.ko before the FM service starts. On DGX A100 and NVIDIA
HGX A100 systems, all the GPUs must also be bound to nvidia.ko before the FM service starts.
In the vGPU virtualization mode, FM supports a resiliency feature that allows the continuous forward-
ing of NVLink traffic between GPUs on active guest VMs after FM exits (gracefully or non-gracefully)
on the vGPU host. To support this feature, FM uses ∕tmp∕fabricmanager.state to save certain
metadata information. To use a different location/file to store this metadata information, modify the
STATE_FILE_NAME FM config file item with the new path and file name.
By default, FM uses TCP I/P loopback (127.0.0.1)-based socket interface for communication. To use
Unix domain sockets instead, modify the FM_CMD_UNIX_SOCKET_PATH and UNIX_SOCKET_PATH FM
config file options with the new Unix domain socket names.
Note: Partition activation is always required before starting a vGPU VM, even for VMs that use only
one vGPU.
The ordering of VFs used during partition activation and VM assignment must remain consistent to
ensure the correct suspend, resume, and migration operations.
Refer to Installing and Configuring the NVIDIA Virtual GPU Manager for Red Hat Enterprise Linux
KVM <https://docs.nvidia.com/vgpu/latest/grid-vgpu-user-guide/index.html#red-hat-el-kvm-install-
configure-vgpu>__ for more information about SR-IOV VF enablement and assigning VFs to VMs.
FM provides several High Availability Mode (Degraded Mode) configurations that allow system admin-
istrators to set appropriate policies when there are hardware failures, such as GPU failures, NVSwitch
failures, NVLink connection failures, and so on, on NVSwitch- based systems. With this feature, sys-
tem administrators can keep a subset of available GPUs that can be used while waiting to replace failed
GPUs, baseboards, and so on.
DGX A100, NVIDIA HGX A100 and DGX H100, NVIDIA HGX H100 systems have different behaviors
Refer to:ref:error-handling for more information.
Note: These high availability modes and their corresponding dynamic reconfiguration of the NVSwitch
based system are applied in response to errors that are detected during FM initialization. Runtime
errors that occur after the system is initialized, or when a GPU job is running, will not trigger these
high availability mode policies.
69
Fabric Manager for NVIDIA NVSwitch Systems, Release r560
Figure 5: Shared NVSwitch and vGPU Partitions When a GPU Access NVLink Fails
Note: This option applies only to systems with two GPU baseboards.
In this mode, if an NVSwitch has one or more trunk NVLink failures, the NVSwitch will
be disabled with its peer NVSwitch. This reduces the available bandwidth to 5/6 through-
out the fabric. If multiple NVSwitches have trunk NVLink failures, FM will fall back to the
TRUNK_LINK_FAILURE_MODE=0 behavior as mentioned above.
This mode is effective only on DGX A100 and NVIDIA HGX A100 NVSwitch-based systems.
Note: Currently, the TRUNK_LINK_FAILURE_MODE=1 configuration is not supported in the vGPU Mul-
titenancy Mode.
▶ NVSWITCH_FAILURE_MODE =1
In this mode, when there is an NVSwitch failure, the NVSwitch will be disabled with its peer
NVSwitch. This will reduce the available bandwidth to 5/6 throughout the fabric. If multiple
NVSwitch failures happen, FM will fall back to the NVSWITCH_FAILURE_MODE=0 behavior as men-
tioned above.
This mode is effective only on DGX A100 and NVIDIA HGX A100 NVSwitch-based systems.
Figure 6: Shared NVSwitch and vGPU Partitions When an NVSwitch has Failed
▶ NVSWITCH_FAILURE_MODE=1
In the Shared NVSwitch mode, all the GPU partitions will be available, but the partitions will reduce
the available bandwidth to 5/6 throughout the fabric. If multiple NVSwitch failures happen, FM
will fall back to NVSWITCH_FAILURE_MODE =0 behavior as mentioned above.
This mode is effective only on DGX A100 and NVIDIA HGX A100 NVSwitch-based systems.
Note: Currently, the NVSWITCH_FAILURE_MODE=1 configuration is not supported in the vGPU Multi-
tenancy Mode.
Figure 7: Shared NVSwitch and vGPU Partitions When a GPU is Missing/Has Failed
The GPU exclusion flow can be broken down into the following phases:
1. Running application error handling.
2. Diagnosing GPU failures.
3. Remediating the error.
The steps for each of these phases can vary based on whether the system is running in bare metal or
in virtualized mode. The following sections describe the flow for bare metal and virtualized platforms.
Errors faced by the GPU during active execution, such as GPU ECC errors, GPU falling off the bus, and
so on, are reported through the following means:
▶ ∕var∕log∕syslog as an XID message
▶ DCGM
▶ NVIDIA Management Library (NVML)
▶ GPU SMBPBI-based OOB commands
▶ the FM log file
System administrators can create their own GPU monitoring/health check scripts to look for the error
traces. This process requires looking for at least one of the above- mentioned sources (syslog, NVML
APIs, and so on) to collect the necessary data.
DCGM includes an exclusion recommendation script that can be invoked by a system administrator
to collect the GPU error information. This script queries information from the passive monitoring
performed by DCGM to determine whether any conditions that might require a GPU to be excluded
have occurred since the previous time the DCGM daemon was started. As part of the execution, the
script invokes a validation test that
determines whether unexpected XIDs are being generated by the execution of a known good appli-
cation. Users can prevent the validation test from being run and choose to only monitor the passive
information.
The DCGM exclusion recommendation script code is provided as a reference for system administrators
to extend as appropriate or build their own monitoring/health check scripts.
Note: Refer to the NVIDIA DCGM Documentation for more information about the exclusion recom-
mendation script such as its location and supported options.
The GPU kernel driver on NVSwitch-based systems can be configured to ignore a set of GPUs, even
if the GPUs were enumerated on the PCIe bus. The GPUs to be excluded are identified by the GPU’s
unique identifier (GPU UUID) via a kernel module parameter. After identifying whether the GPU exclude
candidates are in the system, the GPU kernel module driver will exclude the GPU from being used by
applications. If a GPU UUID is in the exclude candidate list, but the UUID was not detected at runtime
because the UUID belonged to a GPU that is not on the system or because the PCIe enumeration of
the GPU board failed, the GPU is not considered to have been excluded.
The list of exclude candidate GPUs can be persisted across reboots by specifying the module param-
eters by using a .conf file in the filesystem. The exclude mechanism is specific to a GPU, rather than
a physical location on the baseboard. As a result, if a GPU is on the exclude candidate list, and is later
replaced by a new GPU, the new GPU will become visible to the system without updating the exclude
candidates. Conversely, if a GPU has been excluded on a system, placing it in different PCIe slots will
still prevent the GPU from being visible to applications, unless the exclude candidate list is updated.
Updating the GPU excludes candidates requires manual intervention by the system administrator.
The set of candidate GPU UUIDs that will be excluded are specified by using a kernel module parameter
that consists of a set of comma-separated GPU UUIDs.
▶ The kernel parameter can be specified when the kernel module loads nvidia.ko.
insmod nvidia.ko NVreg_ExcludedGpus=uuid1,uuid2…
▶ To make the GPU UUID persistent, the set of exclude candidate GPU UUIDs can also be specified
by using an nvidia.conf file in ∕etc∕modprobe.d.
options nvidia NVreg_ExcludedGpus=uuid1, uuid2…
Adding GPUs into the exclude candidate list is a manual step that must be completed by a system
administrator.
Note: The previously supported NVreg_GpuBlacklist module parameter option has been depre-
cated and will be removed in a future release.
To add a GPU from the exclude candidate list or to remove it from the list, the system administrator
must complete the following steps:
1. If a conf file does not exist, create a conf file for the nvidia kernel module parameters.
2. Complete one of the following tasks:
▶ Add the UUID of the excluded GPU into the .conf file.
▶ Remove the UUID of the GPU from the list.
3. Restart the system to load the kernel driver with updated module parameters.
An excluded GPU is not visible in CUDA applications or in basic queries by using nvidia-smi -q or
through NVML. This section provides information about the options to identify when a GPU has been
excluded, for example, the GPU’s UUID was in the exclude candidate list, and the GPU was detected in
the system.
13.6.1.8 nvidia-smi
13.6.1.9 Procfs
Refer to the NVIDIA GPU SMBus Post-Box Interface (SMBPBI) documentation for more information.
The following section provides information about the recommended flow that a system administrator
should follow to run GPU monitoring health checks or the DCGM exclusion recommendation script on
various system configurations.
In, the system administrator will run the bare metal and vGPU virtualization configurations in the same
OS instance as the application programs. Here is the general flow that a system administrator will
follow:
1. Periodically run the health check script or the DCGM exclusion recommendation script for all the
GPUs and NVSwitches on the system.
2. (Optional) Monitor the system logs to trigger a run of the health check script or DCGM exclusion
recommendation script.
3. Based on the output of the health check or exclusion recommendation script, add the GPU UUID
to the exclude candidate list.
4. Also, if you are using the DCGM exclusion recommendation script, update the periodic run of the
exclude recommendation script with the newly expected GPU count.
5. Reboot the system to load the kernel driver with updated module parameters.
The primary difference in virtualized configurations is that the GPU kernel driver is left to the guest
VMs. As a result, the execution of the GPU diagnosis and remediation phases must be performed by
the hypervisor with the VM provisioning mechanism.
Here is the general flow that a hypervisor will follow:
1. The guest VM finishes and returns controls of a set of GPUs and switches to the hypervisor.
2. The hypervisor invokes a special test VM, which is trusted by the hypervisor. In test VM, there
should be a complete instance of the NVIDIA NVSwitch core software stack, including GPU
drivers and FM.
3. On this test VM, run the health check script or DCGM exclusion recommendation script.
4. Based on the output of the health check or exclusion recommendation script, add the GPU UUID
to a hypervisor readable database.
The hypervisor shuts down the test VM.
1. The hypervisor reads the database to identify the candidates for excluding and updates
its resource allocation mechanisms to prevent that GPU from being assigned to future VM
requests.
2. After the GPU board has been replaced, to make the GPU available again, the hypervisor
updates the database.
In a shared NVSwitch virtualization configuration, system administrators can run their GPU health
check script or DCGM exclusion recommendation script in a dedicated test VM, or on DGX A100 and
NVIDIA HGX A100 systems, in the Service VM immediately after the GPU partition is activated.
To run GPU health on a special test VM:
1. The guest VM completes and returns control of the GPUs in the partition to the hypervisor.
2. After the shared NVSwitch guest VM shutdown procedure is complete, activate the same GPU
partition again.
3. The hypervisor schedules a special test VM, which is trusted on those GPUs.
4. On this test VM, run the health check script or DCGM exclusion recommendation script.
5. Based on the output of the health check or exclusion recommendation script, add the GPU UUID
into a hypervisor readable database.
6. If the partition activation/deactivation cycle is consistently failing, the hypervisor can consider
adding all the GPU UUID s of a partition to the database.
7. After the health check is complete, shut down the test VM.
8. The hypervisor reads the database to identify the candidates for exclusion and removes the cor-
responding GPU partitions from its currently supported partitions.
9. The hypervisor resource allocation mechanisms ensures that the affected GPU partitions will not
be activated.
10. When the Service VM is rebooted, the hypervisor can choose not to bind the excluded GPUs to
the Service VM.
This way, FM will adjust its currently supported GPU partitions.
11. When the GPU board has been replaced, the hypervisor updates the database to make the GPU
available and restarts the Service VM with all the GPUs to enable previously disabled GPU parti-
tions again.
To run GPU health on a Service VM on DGX 100 and NVIDIA HGX 100 systems:
1. The fmActivateFabricPartition() call returned successfully in a Shared NVSwitch partition
activation flow.
2. Before the hypervisor detaches/unbinds the GPUs in the partition, run the required health check
script or DCGM exclusion recommendation script on those GPUs in the Service VM.
3. Based on the output of the health check or exclusion recommendation script, add the GPU UUID
into a hypervisor readable database.
4. The hypervisor executes the partition deactivation flow using fmDeactivateFabricParti-
tion() when health check fails and corresponding guest VM launch is deferred.
5. If the partition activation/deactivation cycle is consistently failing, the hypervisor can consider
adding the GPU UUID s of a partition to the database.
6. The hypervisor reads the database to identify the candidates for exclusion and removes the cor-
responding GPU partitions from its currently supported partitions.
7. The hypervisor resource allocation mechanisms ensure that the affected GPU partitions will not
be activated.
8. After the Service VM is rebooted, the hypervisor can choose not to bind the excluded GPUs to
the Service VM.
This way, FM will adjust its currently supported GPU partitions.
9. After the GPU board has been replaced, the hypervisor updates the database to make the GPU
available and restarts the Service VM with the GPUs to enable previously disabled GPU partitions
again.
The NVSwitch kernel driver on NVSwitch-based systems can be configured to ignore an NVSwitch even
when the systems were enumerated on the PCIe bus like the GPU exclusion feature. If the NVSwitch
exclusion candidates are in the system, the NVSwitch kernel module driver will exclude the NVSwitch
from being used by applications. If an NVSwitch UUID is in the exclusion candidate list, but the UUID is
not detected at runtime because the UUID belongs to a NVSwitch that is not on the system, or because
the PCIe enumeration of the NVSwitch fails, the NVSwitch is not considered to have been excluded.
Also, in NVIDIA HGX A100 systems with two GPU baseboards, if an NVSwitch is explicitly excluded, FM
will manually exclude its peer NVSwitch across the Trunk NVLinks. This behavior can be configured
using the NVSWITCH_FAILURE_MODE high availability configuration file item.
▶ To specify a candidate NVSwitch UUID as a kernel module parameter, run the following command:
insmod nvidia.ko NvSwitchExcludelist=<NVSwitch_uuid>
▶ To make the NVSwitch UUID persistent, specify the UUID using an nvidia.conf file in ∕etc∕
modprobe.d:
options nvidia NvSwitchExcludelist=<NVSwitch_uuid>
The system administrator can get the NVSwitch UUID from the FM log file and add the UUID into the
excluded candidate list.
Note: The previously supported NvSwitchBlacklist module parameter option has been deprecated
and will be removed in a future release.
Refer to SMBus Post Box Interface (SMBPBI) for more information about NVSwitch.
The following section lists the link IDs used by each GPU to connect to each NVSwitch on different
versions of NVIDIA HGX baseboards.
83
Fabric Manager for NVIDIA NVSwitch Systems, Release r560
This chapter provides information about the =default Shared NVSwitch and vGPU partitions for various
GPU baseboards.
89
Fabric Manager for NVIDIA NVSwitch Systems, Release r560
In this generation of NVSwitch, the NVLink ports reset (even-odd pair of links) must be issued in pairs.
As a result, NVIDIA HGX-2 and DGX-2 only support a fixed mapping of Shared NVSwitch partitions.
Due to this limitation, the four-GPU and two-GPU VMs can enable only five out of six NVLinks per GPU.
Note: The GPU Physical IDs are based on how the GPU baseboard NVSwitch GPIOs are strapped. If
only one baseboard is present, and the GPIOs are strapped as for the bottom tray, the GPU Physical
IDs range is between 1 and 8. If the baseboard is strapped as for the top tray, the GPU Physical IDs
range between 9 and 16.
Note: NVIDIA will evaluate any custom partition definition requests and variations of the above-
mentioned policy on a case-by-case basis and will provide necessary information to configure/override
the default GPU partitions.
The FM resiliency feature in the Shared NVSwitch and vGPU Model allows system administrators to
resume the normal operation after FM gracefully or non-gracefully exits in the Service VM. With this
feature, currently activated guest VMs will continue to forward NVLink traffic even when FM is not run-
ning. After FM is successfully restarted, FM will support the typical guest VM activation /deactivation
workflow.
The NVSwitch and GPU NVLink errors that were detected while FM is not running will be cached into
the NVSwitch Driver and will be reported after FM has successfully restarted. Also, changing the FM
version when FM is not running is not supported.
93
Fabric Manager for NVIDIA NVSwitch Systems, Release r560
partition operations such as querying the list of supported guest VM partitions, activating and deac-
tivating guest VM partitions, and so on, will be rejected. After the list of active guest VM partition
information is received from hypervisor, FM will ensure that routing is enabled only for those parti-
tions. After these steps have completed, FM will enable typical guest VM partition activation and the
deactivation workflow.
If FM cannot resume from the current state or the hypervisor does not provide the list of currently
activated guest VM partitions before the timeout period, the restart operation will be aborted, and FM
will exit.
Figure 8 shows the high-level flow when FM is started with typical command-line options and the
--restart option.
97
Fabric Manager for NVIDIA NVSwitch Systems, Release r560
18.1. Notice
This document is provided for information purposes only and shall not be regarded as a warranty of a
certain functionality, condition, or quality of a product. NVIDIA Corporation (“NVIDIA”) makes no repre-
sentations or warranties, expressed or implied, as to the accuracy or completeness of the information
contained in this document and assumes no responsibility for any errors contained herein. NVIDIA shall
have no liability for the consequences or use of such information or for any infringement of patents
or other rights of third parties that may result from its use. This document is not a commitment to
develop, release, or deliver any Material (defined below), code, or functionality.
NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any
other changes to this document, at any time without notice.
Customer should obtain the latest relevant information before placing orders and should verify that
such information is current and complete.
NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the
time of order acknowledgement, unless otherwise agreed in an individual sales agreement signed by
authorized representatives of NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects
to applying any customer general terms and conditions with regards to the purchase of the NVIDIA
product referenced in this document. No contractual obligations are formed either directly or indirectly
by this document.
NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military,
aircraft, space, or life support equipment, nor in applications where failure or malfunction of the NVIDIA
product can reasonably be expected to result in personal injury, death, or property or environmental
damage. NVIDIA accepts no liability for inclusion and/or use of NVIDIA products in such equipment or
applications and therefore such inclusion and/or use is at customer’s own risk.
NVIDIA makes no representation or warranty that products based on this document will be suitable for
any specified use. Testing of all parameters of each product is not necessarily performed by NVIDIA.
It is customer’s sole responsibility to evaluate and determine the applicability of any information con-
tained in this document, ensure the product is suitable and fit for the application planned by customer,
and perform the necessary testing for the application in order to avoid a default of the application or
the product. Weaknesses in customer’s product designs may affect the quality and reliability of the
NVIDIA product and may result in additional or different conditions and/or requirements beyond those
contained in this document. NVIDIA accepts no liability related to any default, damage, costs, or prob-
lem which may be based on or attributable to: (i) the use of the NVIDIA product in any manner that is
contrary to this document or (ii) customer product designs.
No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other
NVIDIA intellectual property right under this document. Information published by NVIDIA regarding
third-party products or services does not constitute a license from NVIDIA to use such products or
109
Fabric Manager for NVIDIA NVSwitch Systems, Release r560
services or a warranty or endorsement thereof. Use of such information may require a license from a
third party under the patents or other intellectual property rights of the third party, or a license from
NVIDIA under the patents or other intellectual property rights of NVIDIA.
Reproduction of information in this document is permissible only if approved in advance by NVIDIA
in writing, reproduced without alteration and in full compliance with all applicable export laws and
regulations, and accompanied by all associated conditions, limitations, and notices.
THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS,
DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE
BEING PROVIDED “AS IS.” NVIDIA MAKES NO WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR
OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WAR-
RANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.
TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES,
INCLUDING WITHOUT LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CON-
SEQUENTIAL DAMAGES, HOWEVER CAUSED AND REGARDLESS OF THE THEORY OF LIABILITY, ARIS-
ING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN ADVISED OF THE POSSIBILITY
OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason whatso-
ever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein
shall be limited in accordance with the Terms of Sale for the product.
18.2. OpenCL
OpenCL is a trademark of Apple Inc. used under license to the Khronos Group Inc.
18.3. Trademarks
NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA Corporation in the
U.S. and other countries. Other company and product names may be trademarks of the respective
companies with which they are associated.
Copyright
©2020-2024, NVIDIA Corporation & affiliates. All rights reserved