Bitfusion Perf Best Practices
Bitfusion Perf Best Practices
VMware, Inc. 3401 Hillview Avenue Palo Alto CA 94304 USA Tel 877-486-9273 Fax 650-427-5001 www.vmware.com
Copyright © 2020 VMware, Inc. All rights reserved. This product is protected by U.S. and international copyright and intellectual property
laws. VMware products are covered by one or more patents listed at http://www.vmware.com/go/patents. VMware is a registered trademark
or trademark of VMware, Inc. in the United States and/or other jurisdictions. All other marks and names mentioned herein may be trademarks
of their respective companies.
Table of Contents
1. Introduction ..................................................................................................................................... 3
6.3.2. Allocating vCPUs to Bitfusion Client and Bitfusion Server VMs ..................................................... 12
The recommendations in this guide are based on our performance testing using NVIDIA V100 (Volta
architecture) GPUs with 16GB of graphics memory and NVIDIA T4 (Turing architecture) GPUs with 16GB of
graphics memory. Our testbed configurations are in table 1, below.
10x V100 Volta 16GB or Mellanox ConnectX-5, Intel Ethernet Controller 10G
Dell EMC DSS 8440
32GB X550
The V100 Volta GPU has 5120 CUDA cores, while the T4 Turing GPU has 2560 CUDA cores. Both GPUs are
capable of machine learning (ML) training and inferencing. The training times are inversely proportional to the
number of CUDA cores. That's why the V100 Volta results in better training times. Both GPUs are capable of
mixed-precision arithmetic.
The T4 Turing GPU is capable of INT 4 operations, which significantly boost inferencing throughput. Another
important consideration in choosing a GPU is power consumption. The T4 Turing GPU consumes 70 watts of
power, while the V100 Volta consumes 300 watts.
3. Software Requirements
Table 2 lists the minimum software versions required to use Bitfusion.
Guest operating systems Ubuntu 18.04, Ubuntu 16.04, CentOS 7.0+, RHEL 7.4+
For the best performance, the network adapters should support the following features:
• Checksum offload
• TCP segmentation offload (TSO)
• Ability to handle high-memory DMA (that is, 64-bit DMA addresses)
• Ability to handle multiple scatter/gather elements per Tx frame
• Jumbo frames (JF)
• Receive side scaling (RSS)
Our recommendations are based on performance studies on the hardware in our labs. You might need to tune
the above parameters to get the best performance for your application on your hardware.
In addition to the wire speed of your network hardware, and the features listed above, the device
configuration can also affect performance. Many of these configuration options are addressed in the following
sections.
Table 4, below, shows the CPU, throughput, and latency for the three network device configurations. For CPU
and latency, lower is better. For throughput, higher is better.
Network Device
CPU Throughput Latency Recommended Workload
Configuration
* With PVRDMA, network traffic between VMs on the same host doesn’t go through the physical NIC, so it might perform
better than passthrough.
Passthrough (using Direct I/O) achieved the best performance across all workloads, but it doesn’t support
vSphere virtualization features. For high performance computing workloads, use passthrough.
Our studies used QLogic FastLinQ QL41xxx 1/10/25 gigabit Ethernet adapters on the client side. Table 5 shows
the settings that provided the best performance in our test environment, but note that different hardware and
workloads might perform best with other settings.
Our studies used Intel® Ethernet Controller 10G X550 Ethernet adapters. Table 6 shows the settings that
provided the best performance in our test environment, although different hardware and workloads might
perform best with other settings.
We recommend the default settings for PVRDMA using virtual hardware version 17.
For more information on how to set up PVRDMA, refer to the following documentation:
For passthrough networking, refer to the guest operating system instructions for tuning network performance
for features such as LRO. Examples follow.
• Disabling LRO
Tunable Setting
C1E Disabled
C States Disabled
Monitor/Mwait Enabled
In most environments, ESXi allows significant levels of CPU overcommitment (that is, running more vCPUs on
a host than the total number of physical processor cores in that host) without impacting virtual machine
performance. If an ESXi host becomes CPU saturated (that is, the virtual machines and other loads on the host
demand all the CPU resources the host has), latency-sensitive workloads might not perform well. In this case,
you might want to reduce the CPU load—for example, by powering off some virtual machines or migrating
them to a different host (or allowing DRS to migrate them automatically). Because Bitfusion servers use GPUs
in passthrough mode, they can’t be vMotioned. However, other non-GPU VMs, including Bitfusion client VMs,
can be migrated to other hosts.
Using esxtop or resxtop, you should monitor the CPU load of the hosts running Bitfusion client and server
VMs. A load average of the first line of the esxtop CPU panel equal to or greater than 1 indicates that the CPU
is overloaded. In general, 80% load of the physical CPU is an upper bound with some room for periodic spikes
in CPU load.
Bitfusion Server 1.5 * aggregate of all GPU memory on all GPU cards + minimum memory for your application
Bitfusion Client 1.5 * graphics memory on the requested GPUs + minimum memory for your application
Configuring a virtual machine with more vCPUs than its workload can use might cause slightly increased
resource usage, potentially impacting performance on very heavily loaded systems. Common examples of this
include a single-threaded workload running in a multiple-vCPU virtual machine or a multi-threaded workload
in a virtual machine with more vCPUs than the workload can effectively use. Even if the guest operating
system doesn’t use some of its vCPUs, configuring virtual machines with those vCPUs still imposes some small
resource requirements on ESXi that translate to real CPU consumption on the host.
Based on our performance studies, Bitfusion client and server VMs require a minimum of 4 vCPUs. You might
need to increase the number of vCPUs depending on the number of GPUs.
6.3.3.1. GPU PCIe Socket, Network Adapter PCIe Socket, and NUMA Affinity Settings in Bitfusion
Server VM
GPU cards and NICs are placed in PCIe physical slots. Each physical slot is associated with a particular NUMA
node on the host. Refer to figure 2, on the next page. The VM using the GPU with PCI address 0000:3b:0000
is associated with NUMA Node 0. However, the VM’s vCPUs are associated with NUMA Node 1 on the GPU
with the PCI address 0000:d8:0000. The CPU transfers the data first from the storage to its main memory
and then to high speed memory on the GPU card.
You can use the VMkernel system information shell (vsish) utility on the ESXi host running the Bitfusion server
VM to find out which NUMA node the GPU card is associated with. The vsish command listed below requires
the GPU card's PCI address (bus, device, and function) in decimal format. Use the lspci command on an ESXi
host to get these values.
vsish> cat
/hardware/pci/seg/0/bus/<pci_bdf_address_of_gpucard_in_decimal>/slot/0/func/0/pciConfigHeader #
lists the numa_node.
We recommend setting the Bitfusion server VM's numa.nodeAffinity to the NUMA node associated with the
GPU card, which you can change in Advanced VM options.
The NVIDIA system management interface (nvidia-smi) utility provides monitoring and management
capabilities for each of the installed GPUs. The command nvidia-smi -q will display all the information about
GPUs. The command nvidia-smi topo -mp will give information about CPU affinity of each of the GPUs.
Legend:
• X = Self
• SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
• NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a
NUMA node
• PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
• PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
• PIX = Connection traversing at most a single PCIe bridge
The default Bitfusion server and client VMs are configured with 12 vCPUs (numvcpus). The setting of
cpuid.coresPerSocket = 1 (one core per socket) defines the virtual NUMA topology of the virtual machine. All
12 virtual sockets are grouped into a single physical domain, which means that the vCPUs will be scheduled in
a single physical CPU package that typically is similar to a single physical NUMA node.
The NUMA scheduler in ESXi autosizes the vNUMA client. For details, refer
to https://frankdenneman.nl/2016/08/22/numa-deep-dive-part-5-esxi-vmkernel-numa-constructs/.
During the initial boot, the VMkernel adds two advanced settings to the virtual machine:
numa.autosize.vcpu.maxPerVirtualNode=X
numa.autosize.cookie = “XXXXXX”
The autosize setting reflects the number of vCPUs inside the NUMA node. Do not change this setting unless
the number of vCPUs of the VM changes. This is particularly of interest for clusters that contain heterogeneous
host configurations. If your cluster contains hosts with different core counts, you could end up with a NUMA
misalignment. In this scenario, the following advanced settings can be used:
numa.autosize.once = FALSE
numa.autosize = TRUE
This forces the NUMA scheduler to reconfigure the NUMA clients at every power cycle.
The guidelines for scaling the number of concurrent Bitfusion clients sharing a GPU are empirical:
Performance checks:
=======================
[PASS ] Check Interface/Subnet Compatibility: network interfaces and subnets configured correctly
[PASS ] Check ulimit -n >= 4096: 4096
[MARGINAL] Check MTU Size: 10000Mbps interface ens160 has low MTU: 1500 < 4K
– Issue a FATAL condition if Ethernet and InfiniBand interfaces share the same subnet
– Issue a MARGINAL condition if interfaces with different speeds share the same subnet
• Check ulimit – Bitfusion may require many open file descriptors to perform well. This check looks at your
Linux user limit for open descriptors and warns you if it is less than 4096.
• MTU Size Check – Bitfusion performance relies heavily on a healthy, low-latency, high-speed network.
Because you pay a latency penalty with every packet sent over the network, you should send a few large
packets instead of many small packets. This check determines if you have a large (≥4K) MTU (maximum
transfer unit) setting for all high-speed (≥10 Gbps) interfaces. You can ignore this check for interfaces that
you won’t use with Bitfusion.