MLNX - EN Documentation Rev 4.9-5.1.0.0 LTS - 10!23!2022
MLNX - EN Documentation Rev 4.9-5.1.0.0 LTS - 10!23!2022
MLNX - EN Documentation Rev 4.9-5.1.0.0 LTS - 10!23!2022
2
Uninstalling MLNX_EN ............................................................................. 84
Uninstalling MLNX_EN Using the YUM Tool ................................................. 84
Uninstalling MLNX_EN Using the apt-get Tool ............................................. 84
Updating Firmware After Installation ........................................................... 84
Updating the Device Online .................................................................. 85
Updating Firmware and FPGA Image on Innova IPsec Cards............................. 85
Updating the Device Manually ............................................................... 86
Ethernet Driver Usage and Configuration ...................................................... 86
Performance Tuning ............................................................................... 89
Features Overview and Configuration .....................................................90
Ethernet Network .................................................................................. 90
ethtool Counters ............................................................................... 90
Interrupt Request (IRQ) Naming ............................................................. 91
Quality of Service (QoS) ...................................................................... 92
Quantized Congestion Notification (QCN)................................................. 101
Ethtool.......................................................................................... 103
Checksum Offload ............................................................................ 108
Ignore Frame Check Sequence (FCS) Errors............................................... 109
RDMA over Converged Ethernet (RoCE).................................................... 109
Flow Control ................................................................................... 110
Explicit Congestion Notification (ECN) .................................................... 115
RSS Support .................................................................................... 117
Time-Stamping ................................................................................ 118
Flow Steering .................................................................................. 122
Wake-on-LAN (WoL)........................................................................... 127
Hardware Accelerated 802.1ad VLAN (Q-in-Q Tunneling)............................... 127
Local Loopback Disable ...................................................................... 128
NVME-oF - NVM Express over Fabrics ...................................................... 129
Debuggability .................................................................................. 129
RX Page Cache Size Limit .................................................................... 129
Virtualization ......................................................................................130
Single Root IO Virtualization (SR-IOV) ..................................................... 131
Enabling Paravirtualization.................................................................. 158
VXLAN Hardware Stateless Offloads ....................................................... 159
3
Q-in-Q Encapsulation per VF in Linux (VST) .............................................. 161
802.1Q Double-Tagging....................................................................... 164
Resiliency ..........................................................................................165
Reset Flow ..................................................................................... 165
Docker Containers ................................................................................168
Docker Using SR-IOV .......................................................................... 168
Kubernetes Using SR-IOV..................................................................... 169
Kubernetes with Shared HCA................................................................ 169
Mediated Devices ............................................................................. 169
OVS Offload Using ASAP2 Direct ................................................................170
Overview ....................................................................................... 170
Installing ASAP2 Packages.................................................................... 171
Setting Up SR-IOV ............................................................................. 171
Configuring Open-vSwitch (OVS) Offload.................................................. 172
Appendix: Mellanox Firmware Tools ....................................................... 180
Fast Driver Unload ................................................................................181
Troubleshooting ............................................................................. 182
General Issues .....................................................................................182
Ethernet Related Issues ..........................................................................183
Performance Related Issues .....................................................................184
SR-IOV Related Issues ............................................................................185
Common Abbreviations and Related Documents ....................................... 186
User Manual Revision History.............................................................. 190
Release Notes Change Log History ....................................................... 191
4
Overview
NVIDIA offers a robust and full set of protocol software and driver for Linux with the ConnectX® EN
family cards. Designed to provide a high performance support for Enhanced Ethernet with fabric
consolidation over TCP/IP based LAN applications. The driver and software in conjunction with the
industry's leading ConnectX family of cards achieve full line rate, full duplex of up to 100Gb/s
performance per port.
Further information on this product can be found in the following MLNX_EN documents:
• Release Notes
• User Manual
Software Download
For the list of changes made to this document, refer to User Manual Revision History.
5
Release Notes
These are the release notes for MLNX_OFED LTS (long term support) release for customers who wish
to utilize the following:
• ConnectX-3
• ConnectX-3 Pro
1. 56 GbE is a NVIDIA propriety link speed and can be achieved while connecting a NVIDIA
adapter card to
NVIDIA SX10XX switch series, or connecting a NVIDIA adapter card to another NVIDIA adapter
card.
2. Supports both NRZ and PAM4 modes.
6
Package Contents
Package Revision Licenses
7
Package Revision Licenses
8
Package Revision Licenses
9
• General Support in MLNX_EN
• Changes and New Features
• Known Issues
• Bug Fixes
ALIOS7.2 AArch64
4.19.48-006.ali4000.alios7.aarch64
BCLINUX7.3 x86_64
3.10.0-514.el7.x86_64
BCLINUX7.4 x86_64
3.10.0-693.el7.x86_64
BCLINUX7.5 x86_64
3.10.0-862.el7.x86_64
BCLINUX7.6 x86_64
3.10.0-957.el7.x86_64
BCLINUX7.7 x86_64
3.10.0-1062.el7.bclinux.x86_64
BCLINUX8.1 x86_64
4.19.0-193.1.3.el8.bclinux.x86_64
Debian10.0 x86_64
4.19.0-5-arm64
AArch64
4.19.0-5-amd64
Debian8.11 x86_64
3.16.0-6-amd64
Debian8.9 x86_64
3.16.0-4-amd64
Debian9.11 x86_64
4.9.0-11-amd64
Debian9.6 x86_64
4.9.0-8-amd64
Debian9.9 x86_64
4.9.0-9-amd64
EulerOS2.0sp9 AArch64
4.19.90-vhulk2006.2.0.h171.eulerosv2r9.aarch64
x86_64
4.18.0-147.5.1.0.h269.eulerosv2r9.x86_64
Fedora30 x86_64
5.0.9-301.fc30.x86_64
10
Operating System Platform Default Kernel Version
11
Operating System Platform Default Kernel Version
RHEL/ AArch64
CentOS7.5alternate 4.14.0-49.el7a.aarch64
RHEL/CentOS7.6 x86_64
3.10.0-957.el7.ppc64
ppc64le
3.10.0-957.el7.ppc64le
ppc64
3.10.0-957.el7.x86_64
RHEL/ AArch64
CentOS7.6alternate 4.14.0-115.el7a.aarch64
ppc64le
4.14.0-115.el7a.ppc64le
RHEL/CentOS7.7 ppc64
3.10.0-1062.el7.ppc64
ppc64le
3.10.0-1062.el7.ppc64le
x86_64
3.10.0-1062.el7.x86_64
RHEL/CentOS7.8 ppc64
3.10.0-1127.el7.ppc64
ppc64le
3.10.0-1127.el7.ppc64le
x86_64
3.10.0-1127.el7.x86_64
RHEL/CentOS7.9 ppc64
3.10.0-1160.el7.ppc64
ppc64le
3.10.0-1160.el7.ppc64le
x86_64
3.10.0-1160.el7.x86_64
RHEL/CentOS8.0 AArch64
4.18.0-80.el8.aarch64
ppc64le
4.18.0-80.el8.ppc64le
x86_64
4.18.0-80.el8.x86_64
RHEL/CentOS8.1 AArch64
4.18.0-147.el8.aarch64
ppc64le
4.18.0-147.el8.ppc64le
x86_64
4.18.0-147.el8.x86_64
RHEL/CentOS8.2 AArch64
4.18.0-193.el8.aarch64
12
Operating System Platform Default Kernel Version
ppc64le
4.18.0-193.el8.ppc64le
x86_64
4.18.0-193.el8.x86_64
RHEL/CentOS8.3 AArch64
4.18.0-240.el8.aarch64
ppc64le
4.18.0-240.el8.ppc64le
x86_64
4.18.0-240.el8.x86_64
RHEL/CentOS8.4 AArch64
4.18.0-305.el8.aarch64
ppc64le
4.18.0-305.el8.ppc64le
x86_64
4.18.0-305.el8.x86_64
RHEL/CentOS8.5 AArch64
4.18.0-348.el8.aarch64
ppc64le
4.18.0-348.el8.ppc64le
x86_64
4.18.0-348.el8.x86_64
RHEL/CentOS8.6 AArch64
4.18.0-372.9.1.el8.aarch64
ppc64le
4.18.0-372.9.1.el8.ppc64le
x86_64
4.18.0-372.9.1.el8.x86_64
SLES11SP3 x86_64
3.0.76-0.11-default
SLES11SP4 ppc64
3.0.101-63-ppc64
x86_64
3.0.101-63-default
SLES12SP2 x86_64
4.4.21-69-default
SLES12SP3 x86_64
4.4.73-5-default
ppc64le
4.4.73-5-default
SLES12SP4 x86_64
4.12.14-94.41-default
ppc64le
4.12.14-94.41-default
13
Operating System Platform Default Kernel Version
AArch64
4.12.14-94.41-default
SLES12SP5 x86_64
4.12.14-120-default
ppc64le
4.12.14-120-default
AArch64
4.12.14-120-default
SLES15SP0 x86_64
4.12.14-23-default
SLES15SP1 x86_64
4.12.14-195-default
ppc64le
4.12.14-195-default
AArch64
4.12.14-195-default
SLES15SP2 x86_64
5.3.18-22-default
ppc64le
5.3.18-22-default
AArch64
5.3.18-22-default
SLES15SP3 x86_64
5.3.18-57-default
ppc64le
5.3.18-57-default
AArch64
5.3.18-57-default
Ubuntu14.04 x86_64
3.13.0-27-generic
Ubuntu16.04 ppc64le
4.4.0-21-generic
x86_64
4.4.0-21-generic
Ubuntu18.04 x86_64
4.15.0-20-generic
ppc64le
4.15.0-20-generic
AArch64
4.15.0-20-generic
Ubuntu19.04 x86_64
5.0.0-13-generic
Ubuntu19.10 x86_64
5.3.0-19-generic
14
Operating System Platform Default Kernel Version
Ubuntu20.04 x86_64
5.4.0-26-generic
ppc64le
5.4.0-26-generic
AArch64
5.4.0-26-generic
32 bit platforms are no longer supported in MLNX_EN.
All OSs listed above are fully supported in Paravirtualized and SR-IOV Environments with
Linux KVM Hypervisor.
This MLNX_EN version provides long term support (LTS) for customers who wish to utilize
ConnectX-3, ConnectX-3 Pro and Connect-IB, as well as RDMA experimental verbs library
(mlnx_lib). Any MLNX_OFED version starting from v5.1 and above does not support any of
the adapter cards mentioned.
This MLNX_EN version supports the following Mellanox network adapter cards firmware versions:
15
NIC Recommended Firmware Rev. Additional Firmware Rev.
Supported
New Features
The following are the new features that have been added to this version of MLNX_EN.
Feature Description
For additional information on the new features, please refer to MLNX_EN User Manual.
Security
Hardening This release contains important reliability improvements and security hardening
enhancements. NVIDIA recommends upgrading your devices to this release to improve
security and reliability.2900508
Unsupported Functionalities/Features/HCAs
For additional information on the new features, please refer to MLNX_EN User Manual.
16
• ConnectX®-2 Adapter Card
• Soft-RoCE
Known Issues
The following is a list of general limitations and known issues of the various components of this
Mellanox EN for Linux release.
For the list of old known issues, please refer to Mellanox EN Archived Known Issues file at: http://
www.mellanox.com/pdf/prod_software/MLNX_EN_Archived_Known_Issues.pdf
Internal Issue
Reference
Number
Description: Running 'ip link show' command over RHEL8.5 using ConnectX-3 with VFs will
2894838 print "Truncated VFs" to the screen.
Workaround: Before the upgrade, remove ibutils manually (and the metapackage with it)
using the following command: yum remove ibutils
Keywords: Installation, ibutils
Workaround: Add modprobe.d rules to force the ib_cm driver to load before the mlx4_ib
and mlx5_ib drivers:
install mlx4_ib { /sbin/modprobe ib_cm; /sbin/modprobe -ignore-install mlx4_ib
$CMDLINE_OPTS; }
install mlx5_ib { /sbin/modprobe ib_cm; /sbin/modprobe —ignore-install mlx5_ib
$CMDLINE_OPTS; }
17
Internal Issue
Reference
Number
18
Internal Issue
Reference
Number
1550266
Description: XDP is not supported over ConnectX-3 and ConnectX-3 Pro adapter cards.
Workaround: N/A
Keywords: XDP, ConnectX-3
Workaround: N/A
Keywords: ConnectX-3
Workaround: Link the python module to a directory in the python modules search path by
running:
19
Internal Issue
Reference
Number
2105447 Description: hns_roce warning messages will appear in the dmesg after reboot on Euler2
SP3 OSs.
Workaround: N/A
Keywords: hns_roce, dmesg, Euler
20
Internal Issue
Reference
Number
Workaround: N/A
Keywords: VF LAG, binding, firmware, FW, PF, SR-IOV
21
Internal Issue
Reference
Number
Workaround: Set the interface down, change the trust mode, then bring the interface back
up.
22
Internal Issue
Reference
Number
1975293 Description: Installing MLNX_EN with --with-openvswitch flag requires manual removal
of the existing Open vSwitch.
Workaround: N/A
Keywords: OVS, Open vSwitch, openvswitch
Discovered in Release: 4.7-3.2.9.0
2001966
Description: When bond is created over VF netdevices in SwitchDev mode, the VF
netdevice will be treated as representor netdevice. This will cause the mlx5_core driver to
crash if it receives netdevice events related to bond device.
Workaround: Do not create bond over VF netdevices in SwitchDev mode.
Keywords: PF, VF, SwitchDev, netdevice, bonding
Discovered in Release: 4.7-3.2.9.0
1979834 Description: When running MLNX_EN on Kernel 4.10 with ConnectX-3/ConnectX-3 Pro NICs,
deleting VxLAN may result in a crash.
Workaround: Upgrade the Kernel version to v4.14 to avoid the crash.
Keywords: Kernel, OS, ConnectX-3, VxLAN
Discovered in Release: 4.7-3.2.9.0
1997230 Description: Running mlxfwreset or unloading mlx5_core module while contrak flows are
offloaded may cause a call trace in the kernel.
Workaround: Stop OVS service before calling mlxfwreset or unloading mlx5_core module.
Keywords: Contrak, ASAP, OVS, mlxfwrest, unload
Discovered in Release: 4.7-3.2.9.0
1955352 Description: Moving 2 ports to SwitchDev mode in parallel is not supported.
Workaround: N/A
Keywords: ASAP, SwitchDev
Discovered in Release: 4.7-3.2.9.0
1973238
Description: ib_core unload may fail on Ubuntu 18.04.2 OS with the following error
message:
23
Internal Issue
Reference
Number
Workaround: N/A
Keywords: Tunnel, VXLAN, ASAP, IPv6
Discovered in Release: 4.7-3.2.9.0
1980884 Description: Setting VF VLAN, state and spoofchk using ip link tool is not supported in
SwitchDev mode.
Workaround: N/A
Keywords: ASAP, ip tool, VF, SwitchDev
Discovered in Release: 4.7-3.2.9.0
1991710 Description: PRIO_TAG_REQUIRED_EN configuration is not supported and may cause call
trace.
Workaround: N/A
Keywords: ASAP, PRIO_TAG, mstconfig
Discovered in Release: 4.7-3.2.9.0
1970429
Description: With HW offloading in SR-IOV SwitchDev mode, the fragmented ICMP echo
request/reply packets (with length larger than MTU) do not function properly. The correct
behavior is for the fragments to miss the offloading flow and go to the slow path.
However, the current behavior is as follows.
• Ingress (to the VM): All echo request fragments miss the corresponding offloading
flow, but all echo reply fragments hit the corresponding offloading flow
• Egress (from the VM): The first fragment still hits the corresponding offloading
flow, and the subsequent fragments miss the corresponding offloading flow
Workaround: N/A
Keywords: HW offloading, SR-IOV, SwitchDev, ICMP, VM, virtualization
Discovered in Release: 4.7-3.2.9.0
1939719 Description: Running openibd restart after the installation of MLNX_EN on SLES12 SP5
and SLES15 SP1 OSs with the latest Kernel (v4.12.14) will result in an error that the
modules do not belong to that Kernel. This is due to the fact that the module installed by
MLNX_EN is incompatible with new Kernel's module.
--add-kernel-support --skip-repo
24
Internal Issue
Reference
Number
1919335 Description: On SLES 11 SP4, RedHat 6.9 and 6.10 OSs, on hosts where OpenSM is running,
the low-level driver's internal error reset flow will cause a kernel crash when OpenSM is
killed (after the reset occurs). This is due to a bug in these kernels where opening the
umad device (by OpenSM) does not take a reference count on the underlying device.
Workaround: Run OpenSM on a host with a more recent Kernel.
Keywords: SLES, RedHat, CR-Dump, OpenSM
Discovered in Release: 4.7-3.2.9.0
1916029 Description: When the firmware response time to commands becomes very long, some
commands fail upon timeout. The driver may then trigger a timeout completion on the
wrong entry, leading to a NULL pointer call trace.
Workaround: N/A
Keywords: Firmware, timeout, NULL
Discovered in Release: 4.7-3.2.9.0
1967866 Description: Enabling ECMP offload requires the VFs to be unbound and VMs to be shut
down.
Workaround: N/A
Workaround: N/A
Keywords: SLES, installation
25
Internal Issue
Reference
Number
Workaround: N/A
Keywords: SwitchDev, ASAP, Kernel , SR-IOV, RedHat
Discovered in Release: 4.7-1.0.0.1
1892663 Description: mlnx_tune script does not support python3 interpreter.
Workaround: Run mlnx_tune with python2 interpreter only.
Keywords: mlnx_tune, python3, python2
Discovered in Release: 4.7-1.0.0.1
1759593 Description: MLNX_EN installation on XenServer OSs requires using the -u flag.
Workaround: N/A
Keywords: Installation, XenServer, OS, operating system
Discovered in Release: 4.6-1.0.1.1
1753629
Description: A bonding bug found in Kernels 4.12 and 4.13 may cause a slave to become
permanently stuck in BOND_LINK_FAIL state. As a result, the following message may
appear in dmesg:
Workaround: N/A
Keywords: Bonding, slave
Discovered in Release: 4.6-1.0.1.1
1758983
Description: Installing RHEL 7.6 OSs platform x86_64 and RHEL 7.6 ALT OSs platform PPCLE
using YUM is not supported.
Workaround: Install these OSs using the install script.
Keywords: RHEL, RedHat, YUM, OS, operating system
Discovered in Release: 4.6-1.0.1.1
26
Internal Issue
Reference
Number
1734102 Description: Ubuntu v16.04.05 and v16.04.05 OSs can only be used with Kernels of version
4.4.0-143 or below.
Workaround: N/A
Keywords: Ubuntu, Kernel, OS
Discovered in Release: 4.6-1.0.1.1
1712068 Description: Uninstalling MLNX_EN automatically results in the uninstallation of several
libraries that are included in the MLNX_EN package, such as InfiniBand-related libraries.
Workaround: If these libraries are required, reinstall them using the local package
manager (yum/dnf).
Keywords: MLNX_EN libraries
Discovered in Release: 4.6-1.0.1.1
- Description: Due to changes in libraries, MFT v4.11.0 and below are not forward
compatible with MLNX_EN v4.6-1.0.0.0 and above.
Therefore, with MLNX_EN v4.6-1.0.0.0 and above, it is recommended to use MFT v4.12.0
and above.
Workaround: N/A
Keywords: MFT compatible
Discovered in Release: 4.6-1.0.1.1
1730840 Description: On ConnectX-4 HCAs, GID index for RoCE v2 is inconsistent when toggling
between enabled and disabled interface modes.
Workaround: N/A
Keywords: RoCE v2, GID
Discovered in Release: 4.6-1.0.1.1
1731005 Description: MLNX_EN v4.6 YUM and Zypper installations fail on RHEL8.0, SLES15.0 and
PPCLE OSs.
Workaround: N/A
Keywords: YUM, Zypper, installation, RHEL, RedHat, SLES, PPCLE
Discovered in Release: 4.6-1.0.1.1
1717428 Description: On kernels 4.10-4.14, MTUs larger than 1500 cannot be set for a GRE interface
with any driver (IPv4 or IPv6).
Workaround: Upgrade your kernel to any version higher than v4.14.
Keywords: Fedora 27, gretap, ip_gre, ip_tunnel, ip6_gre, ip6_tunnel
Discovered in Release: 4.6-1.0.1.1
1748343 Description: Driver reload takes several minutes when a large number of VFs exists.
Workaround: N/A
Keywords: VF, SR-IOV
Discovered in Release: 4.6-1.0.1.1
27
Internal Issue
Reference
Number
1748537 Description: Cannot set max Tx rate for VFs from the ARM.
Workaround: N/A
Keywords: Host control, max Tx rate
Discovered in Release: 4.6-1.0.1.1
1732940 Description: Software counters not working for representor net devices.
Workaround: N/A
Keywords: mlx5, counters, representors
Discovered in Release: 4.6-1.0.1.1
1733974 Description: Running heavy traffic (such as 'ping flood') while bringing up and down other
mlx5 interfaces may result in “INFO: rcu_preempt dectected stalls on CPUS/tasks:”
call traces.
Workaround: N/A
Keywords: mlx5
Discovered in Release: 4.6-1.0.1.1
1731939 Description: Get/Set Forward Error Correction FEC configuration is not supported on
ConnectX-6 HCAs with 200Gbps speed rate.
Workaround: N/A
Keywords: Forward Error Correction, FEC, 200Gbps
Discovered in Release: 4.6-1.0.1.1
1715789 Description: Mellanox Firmware Tools (MFT) package is missing from Ubuntu v18.04.2 OS.
Workaround: Manually install MFT.
Keywords: MFT, Ubuntu, operating system
Discovered in Release: 4.6-1.0.1.1
1652864 Description: On ConnectX-3 and ConnectX-3 Pro HCAs, CR-Dump poll is not supported using
sysfs commands.
Workaround: If supported in your Kernel, use the devlink tool as an alternative to sysfs to
achieve CR-Dump support.
Keywords: mlx4, devlink, CR-Dump
Discovered in Release: 4.6-1.0.1.1
- Description: On ConnectX-6 HCAs and above, an attempt to configure advertisement (any
bitmap) will result in advertising the whole capabilities.
Workaround: N/A
Keywords: 200Gmbps, advertisement, Ethtool
Discovered in Release: 4.6-1.0.1.1
1581631 Description: GID entries referenced to by a certain user application cannot be deleted
while that user application is running.
28
Internal Issue
Reference
Number
Workaround: N/A
1521877 Description: On SLES 12 SP1 OSs, a kernel tracepoint issue may cause undefined behavior
when inserting a kernel module with a wrong parameter.
Workaround: N/A
1504073 Description: When using ConnectX-5 with LRO over PPC systems, the HCA might experience
back pressure due to delayed PCI Write operations. In this case, bandwidth might drop
from line-rate to ~35Gb/s. Packet loss or pause frames might also be observed.
1424233 Description: On RHEL v7.3, 7.4 and 7.5 OSs, setting IPv4-IP-forwarding will turn off LRO on
existing interfaces. Turning LRO back on manually using ethtool and adding a VLAN
interface may cause a warning call trace.
Workaround: Make sure IPv4-IP-forwarding and LRO are not turned on at the same time.
29
Internal Issue
Reference
Number
1431282 Description: Software reset may result in an order inversion of interface names.
1442507 Description: Retpoline support in GCC causes an increase in CPU utilization, which results
in IP forwarding’s 15% performance drop.
Workaround: N/A
1425129 Description: MLNX_EN cannot be installed on SLES 15 OSs using Zypper repository.
Workaround: Install MLNX_EN using the standard installation script instead of Zypper
repository.
1241056 Description: When working with ConnectX-4/ConnectX-5 HCAs on PPC systems with
Hardware LRO and Adaptive Rx support, bandwidth drops from full wire speed (FWS) to
~60Gb/s.
1090612 Description: NVMEoF protocol does not support LBA format with non-zero metadata size.
Therefore, NVMe namespace configured to LBA format with metadata size bigger than 0
will cause Enhanced Error Handling (EEH) in PowerPC systems.
30
Internal Issue
Reference
Number
Workaround: Configure the NVMe namespace to use LBA format with zero sized metadata.
1275082 Description: When setting a non-default IPv6 link local address or an address that is not
based on the device MAC, connection establishments over RoCEv2 might fail.
Workaround: N/A
1307336 Description: In RoCE LAG mode, when running ibdev2netdev -v , the port state of the
second port of the mlx4_0 IB device will read “NA” since this IB device does not have a
second port.
Workaround: N/A
1296355 Description: Number of MSI-X that can be allocated for VFs and PFs in total is limited to
2300 on Power9 platforms.
31
Internal Issue
Reference
Number
Workaround: N/A
1294934 Description: Firmware reset might cause Enhanced Error Handling (EEH) on Power7
platforms.
Workaround: N/A
1259293 Description: On Fedora 20 operating systems, driver load fails with an error message such
as: “ [185.262460] kmem_cache_sanity_check (fs_ftes_0000:00:06.0): Cache name already
exists. ”
This is caused by SLUB allocators grouping multiple slab kmem_cache_create into one slab
cache alias to save memory and increase cache hotness. This results in the slab name to
be considered stale.
Note that after rebooting to the new kernel, you will need to rebuild
MLNX_EN against the new kernel version.
1264359 Description: When running perftest (ib_send_bw, ib_write_bw, etc.) in rdma-cm mode, the
resp_cqe_error counter under /sys/class/infiniband/mlx5_0/ports/1/hw_counters/
resp_cqe_error might increase. This behavior is expected and it is a result of receive WQEs
that were not consumed.
Workaround: N/A
1264956 Description: Configuring SR-IOV after disabling RoCE LAG using sysfs (/sys/bus/pci/drivers/
mlx5_core/<bdf>/roce_lag_enable) might result in RoCE LAG being enabled again in case
SR-IOV configuration fails.
32
Internal Issue
Reference
Number
- Description: Packet Size (Actual Packet MTU) limitation for IPsec offload on Innova IPsec
adapter cards: The current offload implementation does not support IP fragmentation. The
original packet size should be such that it does not exceed the interface's MTU size after
the ESP transformation (encryption of the original IP packet which increases its length)
and the headers (outer IP header) are added:
• Inner IP packet size <= I/F MTU - ESP additions (20) - outer_IP (20) - fragmentation
issue reserved length (56)
• Inner IP packet size <= I/F MTU - 96
This mostly affects forwarded traffic into smaller MTU, as well as UDP traffic. TCP does
PMTU discovery by default and clamps the MSS accordingly.
Workaround: N/A
Workaround: N/A
33
Internal Issue
Reference
Number
- Description: No support for FEC on Innova IPsec adapter cards. When using switches, there
may be a need to change its configuration.
Workaround: N/A
955929 Description: Heavy traffic may cause SYN flooding when using Innova IPsec adapter cards.
Workaround: N/A
- Description: Priority Based Flow Control is not supported on Innova IPsec adapter cards.
Workaround: N/A
- Description: Pause configuration is not supported when using Innova IPsec adapter cards.
Default pause is global pause (enabled).
Workaround: N/A
1045097 Description: Connecting and disconnecting a cable several times may cause a link up
failure when using Innova IPsec adapter cards.
Workaround: N/A
34
Internal Issue
Reference
Number
- Description: On Innova IPsec adapter cards, supported MTU is between 512 and 2012
bytes. Setting MTU values outside this range might fail or might cause traffic loss.
1125184 Description: In old kernel versions, such as Ubuntu 14.04 and RedHat 7.1, VXLAN interface
does not reply to ARP requests for a MAC address that exists in its own ARP table. This
issue was fixed in the following newer kernel versions: Ubuntu 16.04 and RedHat 7.3.
Workaround: N/A
1134323 Description: When using kernel versions older than version 4.7 with IOMMU enabled,
performance degradations and logical issues (such as soft lockup) might occur upon high
load of traffic. This is caused due to the fact that IOMMU IOVA allocations are centralized,
requiring many synchronization operations and high locking overhead amongst CPUs.
Workaround: Use kernel v4.7 or above, or a backported kernel that includes the following
patches:
• 2aac630429d9 iommu/vt-d: change intel-iommu to use IOVA frame numbers
• 9257b4a206fc iommu/iova: introduce per-cpu caching to iova allocation
• 22e2f9fa63b0 iommu/vt-d: Use per-cpu IOVA caching
35
Internal Issue
Reference
Number
1135738 Description: On 64k page size setups, DMA memory might run out when trying to increase
the ring size/number of channels.
1159650 Description: When configuring VF VST, VLAN-tagged outgoing packets will be dropped in
case of ConnectX-4 HCAs. In case of ConnectX-5 HCAs, VLAN-tagged outgoing packets will
have another VLAN tag inserted.
Workaround: N/A
Keywords: VST
1157770 Description: On Passthrough/VM machines with relatively old QEMU and libvirtd,
After timeout, no other commands will be completed and all driver operations will be
stuck.
Keywords: QEMU
1147703 Description: Using dm-multipath for High Availability on top of NVMEoF block devices must
be done with “directio” path checker.
Workaround: N/A
Keywords: NVMEoF
36
Internal Issue
Reference
Number
1152408 Description: RedHat v7.3 PPCLE and v7.4 PPCLE operating systems do not support KVM
qemu out of the box. The following error message will appear when attempting to
run virt-install to create new VMs:
Workaround: Acquire the following rpms from the beta version of 7.4ALT to 7.3/7.4 PPCLE
(in the same order):
• qemu-img-.el7a.ppc64le.rpm
• qemu-kvm-common-.el7a.ppc64le.rpm
• qemu-kvm-.el7a.ppc64le.rpm
1012719 Description: A soft lockup in the CQ polling flow might occur when running very high stress
on the GSI QP (RDMA-CM applications). This is a transient situation from which the driver
will later recover.
Workaround: N/A
1078630 Description: When working in RoCE LAG over kernel v3.10, a kernel crash might occur
when unloading the driver as the Network Manager is running.
Workaround: Stop the Network Manager before unloading the driver and start it back once
the driver unload is complete.
1149557 Description: When setting VGT+, the maximal number of allowed VLAN IDs presented in
the sysfs is 813 (up to the first 813).
Workaround: N/A
Keywords: VGT+
37
Internal Issue
Reference
Number
Workaround: N/A
Keywords: Lustre
995665/1165919 Description: In kernels below v4.13, connection between NVMEoF host and target cannot
be established in a hyper-threaded system with more than 1 socket.
Keywords: NVMEoF
1039346 Description: Enabling multiple namespaces per subsystem while using NVMEoF target
offload is not supported.
Workaround: To enable more than one namespace, create a subsystem for each one.
1030301 Description: Creating virtual functions on a device that is in LAG mode will destroy the
LAG configuration. The boding device over the Ethernet NICs will continue to work as
expected.
Workaround: N/A
1047616 Description: When node GUID of a device is set to zero (0000:0000:0000:0000), RDMA_CM
user space application may crash.
Keywords: RDMA_CM
38
Internal Issue
Reference
Number
1051701 Description: New versions of iproute which support new kernel features may misbehave on
old kernels that do not support these new features.
Workaround: N/A
Keywords: iproute
1007830 Description: When working on Xenserver hypervisor with SR-IOV enabled on it, make sure
the following instructions are applied:
1. Right after enabling SR-IOV, unbind all driver instances of the virtual functions from
their PCI slots.
2. It is not allowed to unbind PF driver instance while having active VFs.
Workaround: N/A
Keywords: SR-IOV
1005786 Description: When using ConnectX-5 adapter cards, the following error might be printed to
dmesg, indicating temporary lack of DMA pages:
“mlx5_core ... give_pages:289:(pid x): Y pages alloc time exceeded the max permitted
duration
Example: This might happen when trying to open more than 64 VFs per port.
Workaround: N/A
1008066/100900 Description: Performing some operations on the user end during reboot might cause call
4 trace/panic, due to bugs found in the Linux kernel.
Workaround: N/A
39
Internal Issue
Reference
Number
1009488 Description: Mounting MLNX_EN to a path that contains special characters, such as
parenthesis or spaces is not supported. For example, when mounting MLNX_EN to “/
media/CDROM(vcd)/”, installation will fail and the following error message will be
displayed:
# cd /media/CDROM\(vcd\)/
# ./install
Workaround: N/A
Keywords: Installation
982144 Description: When offload traffic sniffer is on, the bandwidth could decrease up to 50%.
Workaround: N/A
982534 Description: In ConnectX-3, when using a server with page size of 64K, the UAR BAR will
become too small. This may cause one of the following issues:
1. mlx4_core driver does not load.
2. The mlx4_core driver does load, but calls to ibv_open_device may return ENOMEM
errors.
Workaround:
1. Add the following parameter in the firmware's ini file under [HCA] section:
log2_uar_bar_megabytes = 7
2. Re-burn the firmware with the new ini file.
Keywords: PPC
981362 Description: On several OSs, setting a number of TC is not supported via the tc tool.
Keywords: Ethernet, TC
979457 Description: When setting IOMMU=ON, a severe performance degradation may occur due to
a bug in IOMMU.
40
Internal Issue
Reference
Number
Workaround: Make sure the following patches are found in your kernel:
• iommu/vt-d: Fix PASID table allocation
• iommu/vt-d: Fix IOMMU lookup for SR-IOV Virtual Functions
Note: These patches are already available in Ubuntu 16.04.02 and 17.04 OSs.
Bug Fixes
This table lists the bugs fixed in this release.
For the list of old bug fixes, please refer to MLNX_EN Archived Bug Fixes file at: http://
www.mellanox.com/pdf/prod_software/MLNX_EN_Archived_Bug_Fixes.pdf
Interna Description
l
Refere
nce
Numbe
r
41
Interna Description
l
Refere
nce
Numbe
r
Fixed in Release: 4.9-4.0.8.0
2635628 Description: openibd does not load
automatically after reboot on Suler2sp9 OS.
Keywords: openibd, Suler2sp9
Discovered in Release: 4.9-3.1.5.0
Fixed in Release: 4.9-4.0.8.0
2748862 Description: add-kernel-support flag was not
supported on Oracle Linux 7.9 causing an
installation issue.
Keywords: openibd, Euleros2u0sp9
42
Interna Description
l
Refere
nce
Numbe
r
Discovered in Release: 4.9-0.1.7.0
Fixed in Release: 4.9-4.0.8.0
2748862 Description: add-kernel-support flag was not
supported on Oracle Linux 7.9 causing an
installation issue.
Keywords: Installation, Oracle Linux 7.9
Discovered in Release: 4.9-0.1.7.0
Fixed in Release: 4.9-4.0.8.0
2396956 Description: Fixed an issue were device under
massive load may hit iommu allocation
failures. For more information see "RX Page
Cache Size Limit" section in the user manual.
Keywords: Legacy libibverbs
Discovered in Release: 4.9-2.2.4.0
Fixed in Release: 4.9-3.1.5.0
2434638 Description: Fixed an issue where
"ibv_devinfo -v" command did not print some
of the MEM_WINDOW capabilities, even
though they were supported.
Keywords: Legacy libibverbs
Discovered in Release: 4.9-2.2.4.0
Fixed in Release: 4.9-3.1.5.0
2192791 Description: Fixed the issue where packages
neohost-backend and neohost-sdk were not
properly removed by the uninstallation
procedure and may have required manual
removal before re-installing or upgrading the
MLNX_OFED driver.
Keywords: NEO-Host, SDK
Discovered in Release: 4.9-0.1.7.0
Fixed in Release: 4.9-2.2.4.0
43
Interna Description
l
Refere
nce
Numbe
r
Discovered in Release: 4.9-0.1.7.0
44
Interna Description
l
Refere
nce
Numbe
r
45
Interna Description
l
Refere
nce
Numbe
r
46
Interna Description
l
Refere
nce
Numbe
r
47
Interna Description
l
Refere
nce
Numbe
r
48
Interna Description
l
Refere
nce
Numbe
r
49
Interna Description
l
Refere
nce
Numbe
r
2047221
Description: Reference count (refcount) for
RDMA connection ID (cm_id) was not
incremented in rdma_resolve_addr() function,
resulting in a cm_id use-after-free access.
A fix was applied to increment the cm_id
refcount.
Keywords: rdma_resolve_addr(), cm_id
Discovered in Release: 4.6-1.0.1.1
50
Interna Description
l
Refere
nce
Numbe
r
51
Interna Description
l
Refere
nce
Numbe
r
52
Interna Description
l
Refere
nce
Numbe
r
53
Interna Description
l
Refere
nce
Numbe
r
54
Interna Description
l
Refere
nce
Numbe
r
55
Interna Description
l
Refere
nce
Numbe
r
56
Interna Description
l
Refere
nce
Numbe
r
57
Interna Description
l
Refere
nce
Numbe
r
Keywords: libibverbs
58
Interna Description
l
Refere
nce
Numbe
r
59
Interna Description
l
Refere
nce
Numbe
r
Keywords: RDMA CM
60
Interna Description
l
Refere
nce
Numbe
r
61
Interna Description
l
Refere
nce
Numbe
r
Keywords: ibv_exp_query_device,
max_device_ctx mlx5
Keywords: RoCE
62
Interna Description
l
Refere
nce
Numbe
r
mlx5_core 0000:01:00.0:
mlx5_cmd_check:705:(pid 20037):
ACCESS_REG(0x805) op_mod(0x0) failed,
status bad parameter(0x3), syndrome
(0x15c356)
63
Interna Description
l
Refere
nce
Numbe
r
Keywords: iw_cm
64
Interna Description
l
Refere
nce
Numbe
r
Keywords: RDMACM
65
Interna Description
l
Refere
nce
Numbe
r
Keywords: RDMACM
Keywords: mlx4
66
Interna Description
l
Refere
nce
Numbe
r
Keywords: mlx5_en
Keywords: mlx5_core
Keywords: Arm
67
Interna Description
l
Refere
nce
Numbe
r
68
Introduction
This manual is intended for system administrators responsible for the installation, configuration,
management and maintenance of the software and hardware of Ethernet adapter cards. It is also
intended for application developers.
This document provides instructions on how to install the driver on NVIDIA ConnectX® network
adapter solutions supporting the following uplinks to servers.
Uplink/NICs Driver Uplink Speed
Name
1. 56 GbE is a NVIDIA propriety link speed and can be achieved while connecting a NVIDIA
adapter card to
NVIDIA SX10XX switch series, or connecting a NVIDIA adapter card to another NVIDIA adapter
card.
2. Supports both NRZ and PAM4 modes.
69
• Net device statistics
• SR-IOV support
• Flow steering
• Ethernet Time Stamping
Package Images
MLNX_EN is provided as an ISO image or as a tarball per Linux distribution and CPU architecture that
includes source code and binary RPMs, firmware and utilities. The ISO image contains an installation
script (called install) that performs the necessary steps to accomplish the following:
• Discover the currently installed kernel
• Uninstall any previously installed MLNX_OFED/MLNX_EN packages
• Install the MLNX_EN binary RPMs (if they are available for the current kernel)
• Identify the currently installed HCAs and perform the required firmware updates
Software Components
MLNX_EN contains the following software components:
Components Description
mlx5 driver mlx5 is the low level driver implementation for the ConnectX®-4
adapters designed by Mellanox Technologies. ConnectX®-4 operates as a
VPI adapter.
mlx5_core Acts as a library of common functions (e.g. initializing the device after
reset) required by the ConnectX®-4 adapter cards.
mlx4 driver mlx4 is the low level driver implementation for the ConnectX adapters
designed by Mellanox Technologies. The ConnectX can operate as an
InfiniBand adapter and as an Ethernet NIC.
mlx4_en Handles Ethernet specific functions and plugs into the netdev mid-layer.
70
Components Description
Software modules Source code for all software modules (for use under conditions
mentioned in the modules' LICENSE files)
Firmware
The image includes the following firmware item:
• Firmware images (.bin format wrapped in the mlxfwmanager tool) for ConnectX®-2/
ConnectX®-3/ConnectX®-3 Pro/ConnectX®-4 and ConnectX®-4 Lx network adapters
Directory Structure
The tarball image of MLNX_EN contains the following files and directories:
• install - the MLNX_EN installation script
• uninstall.sh - the MLNX_EN un-installation script
• RPMS/ - directory of binary RPMs for a specific CPU architecture
• src/ - directory of the OFED source tarball
• mlnx_add_kernel_support.sh - a script required to rebuild MLNX_EN for customized kernel
version on supported Linux distribution
mlx4_core
Handles low-level functions like device initialization and firmware commands processing. Also
controls resource allocation so that the Ethernet functions can share the device without interfering
with each other.
mlx4_en
71
mlx5 Driver
mlx5 is the low-level driver implementation for the ConnectX®-4 adapters designed by Mellanox
Technologies. ConnectX®-4 operates as a VPI adapter. The mlx5 driver is comprised of the following
kernel module:
mlx5_core
Acts as a library of common functions (e.g. initializing the device after reset) required by
ConnectX®-4 adapter cards. mlx5_core driver also implements the Ethernet interfaces for
ConnectX®-4. Unlike mlx4_en/core, mlx5 drivers do not require the mlx5_en module as the
Ethernet functionalities are built-in in the mlx5_core module.
Module Parameters
and/or
mlx4_core Parameters
72
msi_x 0 - don't use MSI-X,
1 - use MSI-X,
num_vfs Either a single value (e.g. '5') to define uniform num_vfs value for all
devices functions or a string to map device function numbers to their
num_vfs values (e.g. '0000:04:00.0-5,002b:1c:0b.a-15').
probe_vf Either a single value (e.g. '3') to indicate that the Hypervisor driver
itself should activate this number of VFs for each HCA on the host, or
a string to map device function numbers to their probe_vf values (e.g.
'0000:04:00.0-3,002b:1c:0b.a-13').
log_num_mgm_entry_size log mgm size, that defines the num of qp per mcg, for example: 10
gives 248.range: 7 <= log_num_mgm_entry_size <= 12. To activate
device managed flow steering when available, set to -1 (int)
high_rate_steer Enable steering mode for higher packet rate (obsolete, set "Enable
optimized steering" option in log_num_mgm_entry_size to use this
mode). (int)
fast_drop Enable fast packet drop when no recieve WQEs are posted (int)
log_num_mac Log2 max number of MACs per ETH port (1-7) (int)
log_num_vlan (Obsolete) Log2 max number of VLANs per ETH port (0-7) (int)
log_mtts_per_seg Log2 number of MTT entries per segment (0-7) (default: 0) (int)
73
port_type_array Either pair of values (e.g. '1,2') to define uniform port1/port2 types
configuration for all devices functions or a string to map device
function numbers to their pair of port types values (e.g.
'0000:04:00.0-1;2,002b:1c:0b.a-1;1').
If only a single port is available, use the N/A port type for port2 (e.g
'1,4').
log_num_qp log maximum number of QPs per HCA (default: 19) (int)
log_num_srq log maximum number of SRQs per HCA (default: 16) (int)
log_num_cq log maximum number of CQs per HCA (default: 16) (int)
log_num_mcg log maximum number of multicast groups per HCA (default: 13) (int)
log_num_mpt log maximum number of memory protection table entries per HCA
(default: 19) (int)
log_num_mtt log maximum number of memory translation table segments per HCA
(default: max(20, 2*MTTs for register all of the host memory limited to
30)) (int)
enable_qos Enable Quality of Service support in the HCA (default: off) (bool)
log_num_mgm_entry_size log mgm size, that defines the num of qp per mcg, for example: 10
gives 248.range: 7 <= log_num_mgm_entry_size <= 12 (default = -10).
enable_4k_uar Enable using 4K UAR. Should not be enabled if have VFs which do not
support 4K UARs (default: true) (bool)
74
mlx4_en_only_mode Load in Ethernet only mode (int)
rr_proto IP next protocol for RoCEv1.5 or destination port for RoCEv2. Setting 0
means using driver default values (deprecated) (int)
mlx4_en Parameters
Default and max value is 104 bytes. Saves PCI read operation transaction, packet less then
threshold size will be copied to hw buffer directly. (range: 17-104)
On by default. Once disabled no RSS for incoming UDP traffic will be done.
pfctx Priority-based Flow Control policy on TX[7:0]. Per priority bit mask (uint)
pfcrx Priority-based Flow Control policy on RX[7:0]. Per priority bit mask (uint)
udev_dev_port Work with dev_id or dev_port when supported by the kernel. Range: 0 <=
_dev_id udev_dev_port_dev_id <= 2 (default = 0).
udev_dev_port Work with dev_id or dev_port when supported by the kernel. Range: 0 <=
_dev_id: udev_dev_port_dev_id <= 2 (default = 0).
• 0: Work with dev_port if supported by the kernel, otherwise work with dev_id.
• 2: Work with both of dev_id and dev_port (if dev_port is supported by the kernel). (int)
The mlx5_core module supports a single parameter used to select the profile which defines the
number of resources supported.
prof_sel The parameter name for selecting the profile. The supported values
for profiles are:
• 0 - for medium resources, medium performance
• 1 - for low resources
• 2 - for high performance (int) (default)
guids charp
75
debug_mask debug_mask: 1 = dump cmd data, 2 = dump cmd exec time, 3 = both.
Default=0 (uint)
Default=4; Range=1-1024
Devlink Parameters
The following parameters, supported in mlx4 driver only, can be changed using the Devlink user
interface:
76
Installation
This chapter describes how to install and test the MLNX_EN for Linux package on a single host
machine with Mellanox Ethernet adapter hardware installed.
• Software Dependencies
• Downloading MLNX_EN
• Installing MLNX_EN
• Uninstalling MLNX_EN
• Updating Firmware After Installation
• Ethernet Driver Usage and Configuration
• Performance Tuning
Software Dependencies
MLNX_EN driver cannot coexist with OFED software on the same machine. Hence when installing
MLNX_EN, all OFED packages should be removed (run the install script).
Downloading MLNX_EN
To download MLNX_EN, perform the following:
1. Verify that the system has a Mellanox network adapter (HCA/NIC) installed.
The following example shows a system with an installed Mellanox HCA:
Note: For ConnectX-5 Socket Direct adapters, use ibdev2netdev to display the installed card and
the mapping of logical ports to physical ports. Example:
Notes:
• Each PCI card of ConnectX-5 Socket Direct has a different PCI address. In the output example
above, the first two rows indicate that one card is installed in a PCI slot with PCI Bus address
84 (hexadecimal), and PCI Device Number 00, and PCI Function Number 0 and 1. RoCE
77
assigned mlx5_10 as the logical port, which is the same as netdevice p2p1, and both are
mapped to physical port of PCI function 0000:84:00.0.
• RoCE logical port mlx5_2 of the second PCI card (PCI Bus address 05) and netdevice p5p1 are
mapped to physical port of PCI function 0000:05:00.0, which is the same physical port of PCI
function 0000:84:00.0.
MT4119 is the PCI Device ID of the Mellanox ConnectX-5 adapters family.
For more details, please refer to ConnectX-5 Socket Direct Hardware User Manual, available
at www.mellanox.com.
The image’s name has the format mlnx-en-<ver>-<OS label-<CPU arch>.iso. You can download it
from http://www.mellanox.com > Products > Software> Ethernet Drivers.
3. Use the md5sum utility to confirm the file integrity of your ISO/tarball image.
Installing MLNX_EN
Installation Script
The install installation script performs the following:
Installation Modes
mlnx_en installer supports 2 modes of installation. The install script selects the mode of driver
installation depending on the running OS/kernel version.
• Kernel Module Packaging (KMP) mode, where the source rpm is rebuilt for each installed
flavor of the kernel. This mode is used for RedHat and SUSE distributions.
• Non KMP installation mode, where the sources are rebuilt with the running kernel. This mode
is used for vanilla kernels.
• By default, the package will install drivers supporting Ethernet only. In addition, the package
will include the following new installation options:
• Full VMA support which can be installed using the installation option “–vma”.
• Infrastructure to run DPDK using the installation option “–dpdk”.
Notes:
78
• DPDK itself is not included in the package. Users would still need to install DPDK
separately after the MLNX_EN installation is completed.
• Installing VMA or DPDK infrastructure will allow users to run RoCE.
Installation Results
• For Ethernet only installation mode:
• The kernel modules are installed under:
• /lib/modules/`uname -r`/updates on SLES and Fedora Distributions
• /lib/modules/`uname -r`/extra/mlnx-en on RHEL and other RedHat like
Distributions
• /lib/modules/`uname -r`/updates/dkms/ on Ubuntu
• The kernel module sources are placed under:
/usr/src/mlnx-en-<ver>/
• For VPI installation mode:
• The kernel modules are installed under:
• /lib/modules/`uname -r`/updates on SLES and Fedora Distributions
• /lib/modules/`uname -r`/extra/mlnx-ofa_kernel on RHEL and other RedHat like
Distributions
• /lib/modules/`uname -r`/updates/dkms/ on Ubuntu
• The kernel module sources are placed under:
/usr/src/ofa_kernel-<ver>/
Installation Procedure
This section describes the installation procedure of MLNX_EN on Mellanox adapter cards. Additional
installation procedures are provided for Mellanox Innova SmartNIC for other environment
customizations, and for extra libraries and packages in Installing MLNX_EN on Innova™ IPsec Adapter
Cards section.
1. Log into the installation machine as root.
2. Mount the ISO image on your machine.
/mnt/install
4. Case A: If the installation script has performed a firmware update on your network adapter,
you need to either restart the driver or reboot your system before the firmware update can
take effect. Refer to the table below to find the appropriate action for your specific card.
Action \ Adapter Driver Restart Standard Reboot Cold Reboot (Hard
(Soft Reset) Reset)
ConnectX-3/ConnectX-3 + - -
Pro
79
Standard ConnectX-4/ - + -
ConnectX-4 Lx or higher
Case B: If the installations script has not performed a firmware upgrade on your network
adapter, restart the driver by running:
The “/etc/init.d/openibd” service script will load the mlx4 and/or mlx5 and ULP drivers as
set in the “/etc/infiniband/openib.conf” configuration file.
This type of installation is applicable to RedHat 7.1, 7.2, 7.3 and 7.4 operating systems and Kernel
4.13.
As of version 4.2, MLNX_EN supports Mellanox Innova IPsec EN adapter card that provides security
acceleration for IPsec-enabled networks.
For information on the usage of Innova IPsec, please refer to Mellanox Innova IPsec EN Adapter Card
documentation on Mellanox official web (mellanox.com --> Products --> Adapters --> Smart Adapters
--> Innova IPsec).
Prerequisites
In order to obtain Innova IPsec offload capabilities once MLNX_EN is installed, make sure Kernel
v4.13 or newer is installed with the following configuration flags enabled:
• CONFIG_XFRM_OFFLOAD
• CONFIG_INET_ESP_OFFLOAD
80
• CONFIG_INET6_ESP_OFFLOAD
You can download the image from www.mellanox.com --> Products --> Software --> Ethernet
Drivers.
3. Download and install Mellanox Technologies GPG-KEY:
The key can be downloaded via the following link:
http://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox
Example:
# wget http://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox
--2014-04-20 13:52:30-- http://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox
Resolving www.mellanox.com... 72.3.194.0
Connecting to www.mellanox.com|72.3.194.0|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1354 (1.3K) [text/plain]
Saving to: ?RPM-GPG-KEY-Mellanox?
100%[=================================================>] 1,354 --.-K/s in 0s
2014-04-20 13:52:30 (247 MB/s) - ?RPM-GPG-KEY-Mellanox? saved [1354/1354]
[mlnx_en]
81
name=MLNX_EN Repository
baseurl=file:///<path to extracted MLNX_EN package>/<RPMS FOLDER NAME>
enabled=1
gpgkey=file:///<path to the downloaded key RPM-GPG-KEY-Mellanox>
gpgcheck=1
Replace <RPMS FOLDER NAME> with “RPMS_ETH” or “RPMS” depending on the desired
installation mode (Ethernet only or RDMA).
# yum repolist
Loaded plugins: product-id, security, subscription-manager
This system is not registered to Red Hat Subscription Management. You can use subscription-manager to
register.
repo id repo name
status
mlnx_en MLNX_EN Repository
After setting up the YUM repository for MLNX_EN package, install one of the following metadata
packages:
• In case you set up the “RPMS_ETH” folder as the repository (for Ethernet only mode), install:
• In case you set up the “RPMS” folder as the repository (for RDMA mode), install either:
Or
MLNX_EN provides kernel module RPM packages with KMP support for RHEL and SLES. For other
operating systems, kernel module RPM packages are provided only for the operating system’s
default kernel. In this case, the group RPM packages have the supported kernel version in their
packages name.
If you have an operating systems different than RHEL or SLES, or you have installed a kernel that is
not supported by default in MLNX_EN, you can use the mlnx_add_kernel_support.sh script to build
MLNX_EN for your kernel.
The script will automatically build the matching group RPM packages for your kernel so that you can
still install MLNX_EN via YUM.
Please note that the resulting MLNX_EN repository will contain unsigned RPMs. Therefore, you
should set 'gpgcheck=0' in the repository configuration file.
Installing MLNX_EN using the YUM tool does not automatically update the firmware.
To update the firmware to the version included in MLNX_EN package, you can either:
1. Run:
82
# yum install mlnx-fw-updater
OR
2. Update the firmware to the latest version available on Mellanox website as described
in Updating Firmware After Installation section.
Replace <DEBS FOLDER NAME> with “DEBS_ETH” or “DEBS” depending on the desired
installation mode (Ethernet only or RDMA).
# apt-key list
pub 1024D/A9E4B643 2013-08-11
uid Mellanox Technologies <support@mellanox.com>
sub 1024g/09FCC269 2013-08-11
83
Installing MLNX_EN Using the apt-get Tool
After setting up the apt-get repository for MLNX_EN package, install one of the following metadata
packages:
• In case you set up the “DEBS_ETH” folder as the repository (for Ethernet only mode), install:
• In case you set up the “DEBS” folder as the repository (for RDMA mode), install either:
OR
Installing MLNX_EN using the apt-get tool does not automatically update the firmware. To update
the firmware to the version included in MLNX_EN package, you can either:
1. Run:
Or
2. Update the firmware to the latest version available on Mellanox website as described
in Updating Firmware After Installation section.
Uninstalling MLNX_EN
Use the script /usr/sbin/mlnx_en_uninstall.sh to uninstall MLNX_EN package.
84
Updating the Device Online
To update the device online on the machine from Mellanox site, use the following command line:
Example:
The current update package available on mellanox.com does not support the script below.
An update package that supports this script will become available in a future release.
You can run the following update script using one of the modes below:
/opt/mellanox/mlnx-fw-updater/mlnx_fpga_updater.sh
./mlnx_fpga_updater.sh -u http://www.mellanox.com/downloads/fpga/ipsec/Innova_IPsec_<version>.tgz
./mlnx_fpga_updater.sh -t <Innova_IPsec_bundle_file.tgz>
• With -p flag to provide the path to the downloaded and extracted tarball. Example:
./mlnx_fpga_updater.sh -p <Innova_IPsec_extracted_bundle_directory>
85
For more information on the script usage, you can run mlnx_fpga_updater.sh -h.
It is recommended to perform firmware and FPGA upgrade on Innova IPsec cards using this
script only.
2. Download the firmware BIN file from the Mellanox website or the OEM website.
3. Burn the firmware.
mlxfwmanager_pci -i <fw_file.bin>
Example:
86
To query stateless offload status:
#> ethtool -K eth<x> [rx on|off] [tx on|off] [sg on|off] [tso on|off] [lro on|off]
By default, the driver uses adaptive interrupt moderation for the receive path, which adjusts the
moderation time to the traffic pattern.
To set the values for packet rate limits and for moderation time high and low:
Above an upper limit of packet rate, adaptive moderation will set the moderation time to its highest
value. Below a lower limit of packet rate, the moderation time will be set to its lowest value.
usec settings correspond to the time to wait after the *last* packet is sent/received before
triggering an interrupt.
87
#> ethtool -A eth<x> [rx on|off] [tx on|off]
Some of these values can be changed using module parameters, which can be displayed by running:
88
Performance Tuning
Depending on the application of the user's system, it may be necessary to modify the default
configuration of network adapters based on the ConnectX® adapters. In case tuning is required,
please refer to the Performance Tuning for Mellanox Adapters Community post.
89
Features Overview and Configuration
This chapter contains the following sections:
• Ethernet Network
• Virtualization
• Resiliency
• Docker Containers
• OVS Offload Using ASAP2 Direct
• Fast Driver Unload
Ethernet Network
• ethtool Counters
• Interrupt Request (IRQ) Naming
• Quality of Service (QoS)
• Quantized Congestion Notification (QCN)
• Ethtool
• Checksum Offload
• Ignore Frame Check Sequence (FCS) Errors
• RDMA over Converged Ethernet (RoCE)
• Flow Control
• Explicit Congestion Notification (ECN)
• RSS Support
• Time-Stamping
• Flow Steering
• Wake-on-LAN (WoL)
• Hardware Accelerated 802.1ad VLAN (Q-in-Q Tunneling)
• Local Loopback Disable
• NVME-oF - NVM Express over Fabrics
• Debuggability
• RX Page Cache Size Limit
ethtool Counters
The ethtool counters are counted in different places, according to which they are divided into
groups. Each counters group may also have different counter types.
90
For the full list of supported ethtool counters, refer to the Understanding mlx5 ethtool Counters
Community post.
The following example demonstrates how reducing the number of channels affects the IRQs names.
$ ethtool -l ens1
Channel parameters for ens1:
Pre-set maximums:
RX: 0
TX: 0
Other: 0
Combined: 12
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 12
$ cat /proc/interrupts
98: 0 0 0 0 0 0 7935 0 0 0
0 0 IR-PCI-MSI-edge mlx5_async@pci:0000:81:00.0
99: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-0
100: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-1
101: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-2
102: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-3
103: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-4
104: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-5
105: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-6
106: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-7
107: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-8
108: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-9
109: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-10
110: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-11
91
$ ethtool -L ens1 combined 4
$ ethtool -l ens1
Channel parameters for ens1:
…
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 4
$ cat /proc/interrupts
98: 0 0 0 0 0 0 8455 0 0 0
0 0 IR-PCI-MSI-edge mlx5_async@pci:0000:81:00.0
99: 0 0 0 0 0 0 1 2 0 0
0 0 IR-PCI-MSI-edge ens1-0
100: 0 0 0 0 0 0 1 0 2 0
0 0 IR-PCI-MSI-edge ens1-1
101: 0 0 0 0 0 0 1 0 0 2
0 0 IR-PCI-MSI-edge ens1-2
102: 0 0 0 0 0 0 1 0 0 0
2 0 IR-PCI-MSI-edge ens1-3
103: 0 0 0 0 0 0 1 0 0 0
0 1 IR-PCI-MSI-edge mlx5_comp4@pci:0000:81:00.0
104: 0 0 0 0 0 0 2 0 0 0
0 0 IR-PCI-MSI-edge mlx5_comp5@pci:0000:81:00.0
105: 0 0 0 0 0 0 1 1 0 0
0 0 IR-PCI-MSI-edge mlx5_comp6@pci:0000:81:00.0
106: 0 0 0 0 0 0 1 0 1 0
0 0 IR-PCI-MSI-edge mlx5_comp7@pci:0000:81:00.0
107: 0 0 0 0 0 0 1 0 0 1
0 0 IR-PCI-MSI-edge mlx5_comp8@pci:0000:81:00.0
108: 0 0 0 0 0 0 1 0 0 0
1 0 IR-PCI-MSI-edge mlx5_comp9@pci:0000:81:00.0
109: 0 0 0 0 0 0 1 0 0 0
0 1 IR-PCI-MSI-edge mlx5_comp10@pci:0000:81:00.0
110: 0 0 0 0 0 0 2 0 0 0
0 0 IR-PCI-MSI-edge mlx5_comp11@pci:0000:81:00.0
Quality of Service (QoS) is a mechanism of assigning a priority to a network flow (socket, rdma_cm
connection) and manage its guarantees, limitations and its priority over other flows. This is
accomplished by mapping the user's priority to a hardware TC (traffic class) through a 2/3 stage
process. The TC is assigned with the QoS attributes and the different flows behave accordingly.
Mapping traffic to TCs consists of several actions which are user controllable, some controlled by
the application itself and others by the system/network administrators.
The following is the general mapping traffic to Traffic Classes flow:
1. The application sets the required Type of Service (ToS).
2. The ToS is translated into a Socket Priority (sk_prio).
3. The sk_prio is mapped to a User Priority (UP) by the system administrator (some applica-
tions set sk_prio directly).
4. The UP is mapped to TC by the network/system administrator.
5. TCs hold the actual QoS parameters
QoS can be applied on the following types of traffic. However, the general QoS flow may vary among
them:
• Plain Ethernet - Applications use regular inet sockets and the traffic passes via the ker- nel
Ethernet driver
• RoCE - Applications use the RDMA API to transmit using Queue Pairs (QPs)
• Raw Ethernet QP - Application use VERBs API to transmit using a Raw Ethernet QP
92
Plain Ethernet Quality of Service Mapping
Applications use regular inet sockets and the traffic passes via the kernel Ethernet driver. The
following is the Plain Ethernet QoS mapping flow:
1. The application sets the ToS of the socket using setsockopt (IP_TOS, value).
2. ToS is translated into the sk_prio using a fixed translation:
4. The UP is mapped to the TC as configured by the mlnx_qos tool or by the lldpad daemon if
DCBX is used.
Socket applications can use setsockopt (SK_PRIO, value) to directly set the sk_prio of the
socket. In this case, the ToS to sk_prio fixed mapping is not needed. This allows the
application and the administrator to utilize more than the 4 values possible via ToS.
In the case of a VLAN interface, the UP obtained according to the above mapping is also
used in the VLAN tag of the traffic.
For RoCE old kernels that do not support set_egress_map, use the tc_wrap script to map between
sk_prio and UP. Use tc_wrap with option -u. For example:
93
• Trust State
• Receive Buffer
• DCBX Control Mode
Strict Priority
When setting a TC's transmission algorithm to be 'strict', then this TC has absolute (strict) priority
over other TC strict priorities coming before it (as determined by the TC number: TC 7 is the highest
priority, TC 0 is lowest). It also has an absolute priority over nonstrict TCs (ETS).
This property needs to be used with care, as it may easily cause starvation of other TCs.
A higher strict priority TC is always given the first chance to transmit. Only if the highest strict
priority TC has nothing more to transmit, will the next highest TC be considered.
Nonstrict priority TCs will be considered last to transmit.
This property is extremely useful for low latency low bandwidth traffic that needs to get immediate
service when it exists, but is not of high volume to starve other transmitters in the system.
Enhanced Transmission Selection standard (ETS) exploits the time periods in which the offered load
of a particular Traffic Class (TC) is less than its minimum allocated bandwidth by allowing the
difference to be available to other traffic classes.
After servicing the strict priority TCs, the amount of bandwidth (BW) left on the wire may be split
among other TCs according to a minimal guarantee policy.
If, for instance, TC0 is set to 80% guarantee and TC1 to 20% (the TCs sum must be 100), then the BW
left after servicing all strict priority TCs will be split according to this ratio.
Since this is a minimum guarantee, there is no maximum enforcement. This means, in the same
example, that if TC1 did not use its share of 20%, the reminder will be used by TC0.
ETS is configured using the mlnx_qos tool (mlnx_qos) which allows you to:
• Assign a transmission algorithm to each TC (strict or ETS)
• Set minimal BW guarantee to ETS TCs
Usage:
mlnx_qos -i \[options\]
Rate Limit
Rate limit defines a maximum bandwidth allowed for a TC. Please note that 10% deviation from the
requested values is considered acceptable.
Trust State
94
Receive Buffer
By default, the receive buffer configuration is controlled automatically. Users can override the
receive buffer size and receive buffer's xon and xoff thresholds using mlnx_qos tool.
For further information, please refer to HowTo Tune the Receive buffers on Mellanox Adapters
Community post.
DCBX settings, such as "ETS" and "strict priority" can be controlled by firmware or software. When
DCBX is controlled by firmware, changes of QoS settings cannot be done by the software. The DCBX
control mode is configured using the mlnx_qos -d os/fw command.
For further information on how to configure the DCBX control mode, please refer to mlnx_qos
Community post.
mlnx_qos
mlnx_qos is a centralized tool used to configure QoS features of the local host. It communicates
directly with the driver thus does not require setting up a DCBX daemon on the system.
The mlnx_qos tool enables the administrator of the system to:
• Inspect the current QoS mappings and configuration
The tool will also display maps configured by TC and vconfig set_egress_map tools, in order
to give a centralized view of all QoS mappings.
• Set UP to TC mapping
• Assign a transmission algorithm to each TC (strict or ETS)
• Set minimal BW guarantee to ETS TCs
• Set rate limit to TCs
• Set DCBX control mode
• Set cable length
• Set trust state
For an unlimited ratelimit, set the ratelimit to 0.
Usage
Options
95
-h, --help Show this help message and exit
-f LIST, --pfc=LIST Set priority flow control for each priority. LIST is
a comma separated value for each priority starting from
0 to 7. Example: 0,0,0,0,1,1,1,1 enable PFC on TC4-7
-s LIST, --tsa=LIST Transmission algorithm for each TC. LIST is comma separated
algorithm names for each TC. Possible algorithms: strict, ets and
vendor. Example: vendor,strict,ets,ets,ets,ets,ets,ets sets TC0 to
vendor, TC1 to strict, TC2-7 to ets.
-t LIST, --tcbw=LIST Set the minimally guaranteed %BW for ETS TCs. LIST is comma-
separated percents for each TC. Values set to TCs that are not
configured to ETS algorithm are ignored but must be present.
Example: if TC0,TC2 are set to ETS, then 10,0,90,0,0,0,0,0will set
TC0 to 10% and TC2 to 90%. Percents must sum to 100.
-r LIST, --ratelimit=LIST Rate limit for TCs (in Gbps). LIST is a comma-separated Gbps limit for
each TC. Example: 1,8,8 will limit TC0 to 1Gbps, and TC1,TC2 to 8
Gbps each.
ofed_scripts/utils/mlnx_qos -i ens1f0
DCBX mode: OS controlled
Priority trust state: dscp
dscp2prio mapping:
prio:0 dscp:07,06,05,04,03,02,01,00,
prio:1 dscp:15,14,13,12,11,10,09,08,
prio:2 dscp:23,22,21,20,19,18,17,16,
prio:3 dscp:31,30,29,28,27,26,25,24,
prio:4 dscp:39,38,37,36,35,34,33,32,
prio:5 dscp:47,46,45,44,43,42,41,40,
prio:6 dscp:55,54,53,52,51,50,49,48,
prio:7 dscp:63,62,61,60,59,58,57,56,
96
Cable len: 7
PFC configuration:
priority 0 1 2 3 4 5 6 7
enabled 0 0 0 0 0 0 0 0
tc: 0 ratelimit: unlimited, tsa: vendor
priority: 1
tc: 1 ratelimit: unlimited, tsa: vendor
priority: 0
tc: 2 ratelimit: unlimited, tsa: vendor
priority: 2
tc: 3 ratelimit: unlimited, tsa: vendor
priority: 3
tc: 4 ratelimit: unlimited, tsa: vendor
priority: 4
tc: 5 ratelimit: unlimited, tsa: vendor
priority: 5
tc: 6 ratelimit: unlimited, tsa: vendor
priority: 6
tc: 7 ratelimit: unlimited, tsa: vendor
priority: 7
Set ratelimit. 3Gbps for tc0 4Gbps for tc1 and 2Gbps for tc2
ConfigureQoS.mapUP0,7totc0,1,2,3totc1and4,5,6totc2.settc0,tc1asetsandtc2asstrict.divideets30%
fortc0and70%fortc1
97
tcandtc_wrap.py
Usage
Options
-u SKPRIO_UP, --skprio_up=SKPRIO_UP maps sk_prio to priority for RoCE. LIST is <=16 comma sep-
arated priority. index of element is sk_prio.
Example
Run:
tc_wrap.py -i enp139s0
Output:
98
Additional Tools
tc tool compiled with the sch_mqprio module is required to support kernel v2.6.32 or higher. This is
a part of iproute2 package v2.6.32-19 or higher. Otherwise, an alternative custom sysfs interface is
available.
• mlnx_qos tool (package: ofed-scripts) requires python version 2.5 < = X
• tc_wrap.py (package: ofed-scripts) requires python version 2.5 < = X
Packet Pacing
ConnectX-4 and above devices allow packet pacing (traffic shaping) per flow. This capability is
achieved by mapping a flow to a dedicated send queue and setting a rate limit on that send queue.
Note the following:
• Up to 512 send queues are supported
• 16 different rates are supported
• The rates can vary from 1 Mbps to line rate in 1 Mbps resolution
• Multiple queues can be mapped to the same rate (each queue is paced independently)
• It is possible to configure rate limit per CPU and per flow in parallel
System Requirements
• MLNX_EN, version 3.3
• Linux kernel 4.1 or higher
• ConnectX-4 or ConnectX-4 Lx adapter cards with a formal firmware version
This configuration is non-persistent and does not survive driver restart.
1. Firmware Activation:
To activate Packet Pacing in the firmware:
First, make sure Mellanox Firmware Tools service (mst) is started:
# mst start
Then run:
99
#echo "MLNX_RAW_TLV_FILE" > /tmp/mlxconfig_raw.txt
#echo “0x00000004 0x0000010c 0x00000000 0x00000000" >> /tmp/mlxconfig_raw.txt
#yes | mlxconfig -d <mst_dev >-f /tmp/mlxconfig_raw.txt set_raw > /dev/null
#reboot /mlxfwreset
2. Driver Activation:
There are two operation modes for Packet Pacing:
• IP + L4 Port
• IP only
• L4 Port only
• No match (the flow would be mapped to default queues)
To create flow mapping:
Configure the destination IP. Write the IP address in hexadecimal
representation to the relevant sysfs entry. For example, to map IP
address 192.168.1.1 (0xc0a80101) to send queue 310, run the following
command:
From this point on, all traffic destined to the given IP address and L4 port
will be sent using send queue 310. All other traffic will be sent using the
original send queue.
iii. Limit the rate of this flow using the following command:
100
echo 100 > /sys/class/net/ens2f1/queues/tx-310/tx_maxrate
mlnx_qcn is a tool used to configure QCN attributes of the local host. It communicates directly with
the driver thus does not require setting up a DCBX daemon on the system.
The mlnx_qcn enables the user to:
• Inspect the current QCN configurations for a certain port sorted by priority
• Inspect the current QCN statistics and counters for a certain port sorted by priority
• Set values of chosen QCN parameters
Usage
Options
--version Show program's version number and exit
-h, --help Show this help message and exit
-i INTF, --interface=INTF Interface name
-g TYPE, --get_type=TYPE Type of information to get statistics/parameters
--rpg_enable=RPG_ENABLE_LIST Set value of rpg_enable according to priority, use spaces
between values and -1 for unknown values.
--rppp_max_rps=RPPP_MAX_RPS_LIST Set value of rppp_max_rps according to priority, use spaces
between values and -1 for unknown values.
--rpg_time_reset=RPG_TIME_RESET_LIST Set value of rpg_time_reset according to priority, use
spaces between values and -1 for unknown values.
--rpg_byte_reset=RPG_BYTE_RESET_LIST Set value of rpg_byte_reset according to priority, use
spaces between values and -1 for unknown values.
101
--rpg_threshold=RPG_THRESHOLD_LIST Set value of rpg_threshold according to priority, use spaces
between values and -1 for unknown values.
--rpg_max_rate=RPG_MAX_RATE_LIST Set value of rpg_max_rate according to priority, use spaces
between values and -1 for unknown values.
--rpg_ai_rate=RPG_AI_RATE_LIST Set value of rpg_ai_rate according to priority, use spaces
between values and -1 for unknown values.
--rpg_hai_rate=RPG_HAI_RATE_LIST Set value of rpg_hai_rate according to priority, use spaces
between values and -1 for unknown values.
--rpg_gd=RPG_GD_LIST Set value of rpg_gd according to priority, use spaces
between values and -1 for unknown values.
--rpg_min_dec_fac=RPG_MIN_DEC_FAC_LIST Set value of rpg_min_dec_fac according to priority, use
spaces between values and -1 for unknown values.
--rpg_min_rate=RPG_MIN_RATE_LIST Set value of rpg_min_rate according to priority, use spaces
between values and -1 for unknown values.
-- Set value of cndd_state_machine according to priority, use
cndd_state_machine=CNDD_STATE_MACHINE_LIS spaces between values and -1 for unknown values.
T
priority 0:
rpg_enable: 0
rppp_max_rps: 1000
rpg_time_reset: 1464
rpg_byte_reset: 150000
rpg_threshold: 5
rpg_max_rate: 40000
rpg_ai_rate: 10
rpg_hai_rate: 50
rpg_gd: 8
rpg_min_dec_fac: 2
rpg_min_rate: 10
cndd_state_machine: 0
priority 1:
rpg_enable: 0
rppp_max_rps: 1000
rpg_time_reset: 1464
rpg_byte_reset: 150000
rpg_threshold: 5
rpg_max_rate: 40000
rpg_ai_rate: 10
rpg_hai_rate: 50
rpg_gd: 8
rpg_min_dec_fac: 2
rpg_min_rate: 10
cndd_state_machine: 0
.............................
.............................
priority 7:
rpg_enable: 0
rppp_max_rps: 1000
rpg_time_reset: 1464
rpg_byte_reset: 150000
rpg_threshold: 5
rpg_max_rate: 40000
rpg_ai_rate: 10
rpg_hai_rate: 50
rpg_gd: 8
rpg_min_dec_fac: 2
102
rpg_min_rate: 10
cndd_state_machine: 0
Setting QCN parameters requires updating its value for each priority. '-1' indicates no change in the
current value.
Example for setting 'rp g_enable' in order to enable QCN for priorities 3, 5, 6:
Ethtool
Ethtool is a standard Linux utility for controlling network drivers and hardware, particularly for
wired Ethernet devices. It can be used to:
• Get identification and diagnostic information
• Get extended device statistics
• Control speed, duplex, auto-negotiation and flow control for Ethernet devices
• Control checksum offload and other hardware offload features
• Control DMA ring sizes and interrupt moderation The following are the ethtool supported
options:
ethtool --set-priv-flags eth<x> <priv flag> <on/ Enables/disables driver feature matching the given
off> private flag.
103
Options Description
ethtool --show-priv-flags eth<x> Shows driver private flags and their states (ON/OFF)
ethtool -A eth<x> [rx on|off] [tx on|off] Note: Supported in ConnectX®-3/ConnectX®-3 Pro cards
only.
Sets the values for packet rate limits and for moderation
time high and low values.
104
Options Description
ethtool -G eth<x> [rx <N>] [tx <N>] Modifies the ring size.
For example:
#> ethtool -i eth2
driver: mlx4_en (MT_0DD0120009_CX3)
version: 2.1.6 (Aug 2013)
firmware-version: 2.30.3000
bus-info: 0000:1a:00.0
105
Options Description
ethtool -K eth<x> [rx on|off] [tx on|off] [sg on| Sets the stateless offload status.
off] [tso on|off] [lro on|off] [gro on|off] [gso on|
off] [rxvlan on|off] [txvlan on|off] [ntuple on/ TCP Segmentation Offload (TSO), Generic Segmentation
off] [rxhash on/off] [rx-all on/off] [rx-fcs on/off] Offload (GSO): increase outbound throughput by reducing
CPU overhead. It works by queuing up large buffers and
letting the network interface card split them into
separate packets.
ethtool -L eth<x> [rx <N>] [tx <N>] Sets the number of channels.
Notes:
• This also resets the RSS table to its default
distribution, which is uniform across the cores on
the NUMA (non-uniform memory access) node that
is closer to the NIC.
• For ConnectX®-4 cards, use ethtool -L eth<x>
combined <N> to set both RX and TX channels.
ethtool -m|--dump-module-eeprom eth<x> [ raw Queries/Decodes the cable module eeprom information.
on|off ] [ hex on|off ] [ offset N ] [ length N ]
ethtool -p|--identify DEVNAME Enables visual identification of the port by LED blinking
[TIME-IN-SECONDS].
106
Options Description
ethtool -p|--identify eth<x> <LED duration> Allows users to identify interface's physical port by
turning the ports LED on for a number of seconds.
ethtool -s eth<x> advertise <N> autoneg on Changes the advertised link modes to requested link
modes <N>
Notes:
• Both previous and new link modes configurations
are supported, however, they must be run
separately.
• Any link mode configuration on Kernels below v5.1
and ConnectX-6 HCAs will result in the
advertisement of the full capabilities.
• <autoneg on> only sends a hint to the driver that
the user wants to modify advertised link modes
and not speed.
ethtool -s eth<x> msglvl [N] Changes the current driver message level.
107
Options Description
ethtool -s eth<x> speed <SPEED> autoneg off Changes the link speed to requested <SPEED>. To check
the supported speeds, run ethtool eth<x>.
ethtool -X eth<x> equal a b c... Sets the receive flow hash indirection table.
Checksum Offload
MLNX_EN supports the following Receive IP/L4 Checksum Offload modes:
• CHECKSUM_UNNECESSARY: By setting this mode the driver indicates to the Linux Networking
Stack that the hardware successfully validated the IP and L4 checksum so the Linux
Networking Stack does not need to deal with IP/L4 Checksum validation.
Checksum Unnecessary is passed to the OS when all of the following are true:
• Ethtool -k <DEV> shows rx-checksumming: on
• Received TCP/UDP packet and both IP checksum and L4 protocol checksum are correct.
• CHECKSUM_COMPLETE: When the checksum validation cannot be done or fails, the driver still
reports to the OS the calculated by hardware checksum value. This allows accelerating
checksum validation in Linux Networking Stack, since it does not have to calculate the whole
checksum including payload by itself.
Checksum Complete is passed to OS when all of the following are true:
108
• Ethtool -k <DEV> shows rx-checksumming: on
• Using ConnectX®-3, firmware version 2.31.7000 and up
• Received IpV4/IpV6 non-TCP/UDP packet
The ingress parser of the ConnectX®-3-Pro card comes by default without checksum
offload support for non-TCP/UDP packets.
To change that, please set the value of the module parameter ingress_parser_mode
in mlx4_core to 1.
In this mode, IPv4/IPv6 non-TCP/UDP packets will be passed up to the protocol stack with
CHECKSUM_COMPLETE tag.
In this mode of the ingress parser, the following features are unavailable:
• NVGRE stateless offloads
• VXLAN stateless offloads
• RoCE v2 (RoCE over UDP)
• CHECKSUM_NONE: By setting this mode the driver indicates to the Linux Networking Stack
that the hardware failed to validate the IP or L4 checksum so the Linux Networking Stack
must calculate and validate the IP/L4 Checksum.
Checksum None is passed to OS for all other cases.
This feature is supported in ConnectX-3 Pro and ConnectX-4 cards only.
Upon receiving packets, the packets go through a checksum validation process for the FCS field. If
the validation fails, the received packets are dropped.
When FCS is enabled (disabled by default), the device does not validate the FCS field even if the
field is invalid.
It is not recommended to enable FCS.
For further information on how to enable/disable FCS, please refer to ethtool option rx-fcs on/off.
109
(InfiniBand) services over Ethernet to deliver ultra-low latency for performance-critical and
transaction-intensive applications such as financial, database, storage, and content delivery
networks.
When working with RDMA applications over Ethernet link layer the following points should be noted:
• The presence of a Subnet Manager (SM) is not required in the fabric. Thus, operations that
require communication with the SM are managed in a different way in RoCE. This does not
affect the API but only the actions such as joining the multicast group, that need to be taken
when using the API
• Since LID is a layer 2 attribute of the InfiniBand protocol stack, it is not set for a port and is
displayed as zero when querying the port
• With RoCE, the alternate path is not set for RC QP. Therefore, APM (another type of High
Availability and part of the InfiniBand protocol) is not supported
• Since the SM is not present, querying a path is impossible. Therefore, the path record
structure must be filled with relevant values before establishing a connection. Hence, it is
recommended working with RDMA-CM to establish a connection as it takes care of filling the
path record structure
• VLAN tagged Ethernet frames carry a 3-bit priority field. The value of this field is derived
from the IB SL field by taking the 3 least significant bits of the SL field
• RoCE traffic is not shown in the associated Ethernet device's counters since it is offloaded by
the hardware and does not go through Ethernet network driver. RoCE traffic is counted in the
same place where InfiniBand traffic is counted; /sys/class/infiniband/<device>/ports/<port
number>/counters/
For further information on RoCE usage, please refer to MLNX_OFED User Manual.
Flow Control
Priority Flow Control (PFC) IEEE 802.1Qbb applies pause functionality to specific classes of traffic on
the Ethernet link. For example, PFC can provide lossless service for the RoCE traffic and best-effort
service for the standard Ethernet traffic. PFC can provide different levels of ser- vice to specific
classes of Ethernet traffic (using IEEE 802.1p traffic classes).
Set the pfctx and pcfrx mlx_en module parameters to the file: /etc/modprobe.d/mlx4_en.conf:
Note: These parameters are bitmap of 8 bits. In the example above, only priority 3 is enabled (0x8
is 00001000b). 0x16 will enable priority 4 and so on.
110
Configuring PFC on ConnectX-4
1. Enable PFC on the desired priority:
Example (Priority=4):
Example (VLAN_ID=5):
tc_wrap.py -i <interface>
PFC Auto-Configuration Using LLDP Tool in the OS
1. Start lldpad daemon on host.
lldpad -d Or
service lldpad start
111
lldptool -T -i <ethX> -V mngAddr enableTx=yess
lldptool -T -i <ethX> -V PFC enableTx=yes;
lldptool -T -I <ethX> -V CEE-DCBX enableTx=yes;
Example:
where:
[pfcup:xx Enables/disables priority flow control. From left to right (priorities 0-7) - x can be
xxxxxx] equal to either 0 or 1. 1 indicates that the priority is configured to transmit
priority pause.
lldptool -T -i <ethX> -V PFC enabled=x,x,x,x,x,x,x,x
Example:
lldptool -T -i eth2 -V PFC enabled=1,2,4
where:
enab Displays or sets the priorities with PFC enabled. The set attribute takes a comma-
led separated list of priorities to enable, or the string none to disable all priorities.
There are two ways to configure PFC and ETS on the server:
1. Local Configuration - Configuring each server manually.
2. Remote Configuration - Configuring PFC and ETS on the switch, after which the switch will
pass the configuration to the server using LLDP DCBX TLVs.
There are two ways to implement the remote configuration using ConnectX-4 adapters:
a. Configuring the adapter firmware to enable DCBX.
b. Configuring the host to enable DCBX.
For further information on how to auto-configure PFC using LLDP in the firmware, refer to the
HowTo Auto-Config PFC and ETS on ConnectX-4 via LLDP DCBXCommunity post.
112
2. Add DCBX to the list of supported TLVs per required interface.
For IEEE DCBX:
4. Make sure PFC is enabled on the host (for enabling PFC on the host, refer to PFC
Configuration on Hosts section above). Once it is enabled, it will be passed in the LLDP TLVs.
5. Enable PFC with the desired priority on the Ethernet port.
Priority Counters
MLNX_EN driver supports several ingress and egress counters per priority. Run ethtool -S to get the
full list of port counters.
ConnectX-3 Example
113
# ethtool -S eth1 | grep prio_3
rx_prio_3_packets: 5152
rx_prio_3_bytes: 424080
tx_prio_3_packets: 328209
tx_prio_3_bytes: 361752914
rx_pause_prio_3: 14812
rx_pause_duration_prio_3: 0
rx_pause_transition_prio_3: 0
tx_pause_prio_3: 0
tx_pause_duration_prio_3: 47848
tx_pause_transition_prio_3: 7406
ConnectX-4 Example
Note: The Pause counters in ConnectX-4 are visible via ethtool only for priorities on which PFC is
enabled.
PFC Storm Prevention
This feature is applicable to ConnectX-4/ConnectX-5 adapter cards family only.
PFC storm prevention enables toggling between default and auto modes.
The stall prevention timeout is configured to 8 seconds by default. Auto mode sets the stall
prevention timeout to be 100 msec.
The feature can be controlled using sysfs in the following directory: /sys/class/net/eth*/settings/
pfc_stall_prevention
• To query the PFC stall prevention mode:
cat /sys/class/net/eth*/settings/pfc_stall_prevention
Example
$ cat /sys/class/net/ens6/settings/pfc_stall_prevention
default
114
• tx_pause_storm_error_events - when the device is stalled for a period longer than a pre-
configured timeout, the pause transmission is disabled, and the counter increase.
This feature is applicable to ConnectX-4/ConnectX-5 adapter cards family only.
Dropless RQ feature enables the driver to notify the FW when SW receive queues are overloaded.
This scenario takes place when the handling of SW receive queue is slower than the handling of the
HW receive queues.
When this feature is enabled, a packet that is received while the receive queue is full will not be
immediately dropped. The FW will accumulate these packets assuming posting of new WQEs will
resume shortly. If received WQEs are not posted after a certain period of time, out_of_buffer
counter will increase, indicating that the packet has been dropped.
This feature is disabled by default. In order to activate it, ensure that Flow Control feature is also
enabled.
Output example:
This feature is supported on ConnectX-4 adapter cards family and above only.
ECN is an extension to the IP protocol. It allows reliable communication by notifying all ends of
communication when congestion occurs. This is done without dropping packets.
115
Please note that this feature requires all nodes in the path (nodes, routers etc) between the
communicating nodes to support ECN to ensure reliable communication. ECN is marked as 2 bits in
the traffic control IP header. This ECN implementation refers to RoCE v2.
/sys/class/net/<interface>/<protocol>/ecn_<protocol>_enable =1
Each algorithm has a set of relevant parameters and statistics, which are defined per device, per
port, per priority.
cat /sys/class/net/<interface>/ecn/<protocol>/enable/X
where:
116
• X: priority {0..7}
• protocol: roce_rp / roce_np
• requested attributes: Next Slide for each protocol.
RSS Support
The device has the ability to use XOR as the RSS distribution function, instead of the default Toplitz
function.
The XOR function can be better distributed among driver's receive queues in a small number of
streams, where it distributes each TCP/UDP stream to a different queue.
MLNX_EN provides one of the following options to change the working RSS hash function from Toplitz
to XOR, and vice versa:
• Through ethtool priv-flags, in case mlx4_rss_xor_hash_function is not part of the priv-flags
list.
Output:
MLNX_EN provides the following option to change the working RSS hash function from Toplitz to
XOR, and vice versa:
117
cat /sys/class/net/eth*/settings/hfunc
Example:
cat /sys/class/net/eth2/settings/hfunc
Operational hfunc: toeplitz
Supported hfuncs: xor toeplitz
Supported in ConnectX-3 and ConnectX-3 Pro only.
As of MLNX_EN v2.4-.1.0.0, RSS will distribute incoming IP fragmented datagrams according to its
hash function, considering the L3 IP header values. Different IP fragmented datagram flows will be
directed to different rings.
When the first packet in IP fragments chain contains upper layer transport header (e.g. UDP
packets larger than MTU), it will be directed to the same target as the proceeding IP
fragments that follow it, to prevent out-of-order processing.
Time-Stamping
Time-Stamping Service
Time-stamping is the process of keeping track of the creation of a packet. A time-stamping service
supports assertions of proof that a datum existed before a particular time. Incoming packets are
time-stamped before they are distributed on the PCI depending on the congestion in the PCI buffers.
Outgoing packets are time-stamped very close to placing them on the wire.
Enabling Time-Stamping
118
SOF_TIMESTAMPING_TX_HARDWARE: try to obtain send time-stamp in hardware
SOF_TIMESTAMPING_SYS_HARDWARE: return hardware time-stamp transformed into the system time base
Admin privileged user can enable/disable time stamping through calling ioctl (sock, SIOCSH-
WTSTAMP, &ifreq) with the following values:
• Send side time sampling, enabled by ifreq.hwtstamp_config.tx_type when:
119
enum hwtstamp_rx_filters {
/* time stamp no incoming packet at all */
HWTSTAMP_FILTER_NONE,
/* time stamp any incoming packet */
HWTSTAMP_FILTER_ALL,
/* return value: time stamp all packets requested plus some others */
HWTSTAMP_FILTER_SOME,
/* PTP v1, UDP, any kind of event packet */
HWTSTAMP_FILTER_PTP_V1_L4_EVENT,
/* PTP v1, UDP, Sync packet */
HWTSTAMP_FILTER_PTP_V1_L4_SYNC,
/* PTP v1, UDP, Delay_req packet */
HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ,
/* PTP v2, UDP, any kind of event packet */
HWTSTAMP_FILTER_PTP_V2_L4_EVENT,
/* PTP v2, UDP, Sync packet */
HWTSTAMP_FILTER_PTP_V2_L4_SYNC,
/* PTP v2, UDP, Delay_req packet */
HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ,
Getting Time-Stamping
Once time stamping is enabled time stamp is placed in the socket Ancillary data. recvmsg() can be
used to get this control message for regular incoming packets. For send time stamps the outgoing
packet is looped back to the socket's error queue with the send time-stamp(s) attached. It can
be received with recvmsg (flags=MSG_ERRQUEUE). The call returns the original outgoing packet data
including all headers prepended down to and including the link layer, the scm_time-stamping
control message and a sock_extended_err control message with ee_errno==ENOMSG and
ee_origin==SO_EE_ORIGIN_TIMESTAMPING. A socket with such a pending bounced packet is ready for
reading as far as select() is concerned. If the outgoing packet has to be fragmented, then only the
first fragment is time stamped and returned to the sending socket.
When time-stamping is enabled, VLAN stripping is disabled. For more info please refer to
Documentation/networking/timestamping.txt in kernel.org
ethtool -T eth<x>
Example:
ethtool -T eth0
Time stamping parameters for p2p1:
Capabilities:
hardware-transmit (SOF_TIMESTAMPING_TX_HARDWARE)
120
software-transmit (SOF_TIMESTAMPING_TX_SOFTWARE)
hardware-receive (SOF_TIMESTAMPING_RX_HARDWARE)
software-receive (SOF_TIMESTAMPING_RX_SOFTWARE)
software-system-clock (SOF_TIMESTAMPING_SOFTWARE)
hardware-raw-clock (SOF_TIMESTAMPING_RAW_HARDWARE)
PTP Hardware Clock: 1
Hardware Transmit Timestamp Modes:
off (HWTSTAMP_TX_OFF)
on (HWTSTAMP_TX_ON)
Hardware Receive Filter Modes:
none (HWTSTAMP_FILTER_NONE)
all (HWTSTAMP_FILTER_ALL)
As a result of Receive Side Steering (RSS) PTP traffic coming to UDP ports 319 and 320, it may reach
the user space application in an out of order manner. In order to prevent this, PTP traffic needs to
be steered to single RX ring using ethtool.
Example:
# ethtool -u ens7
8 RX rings available
Total 0 rules
# ethtool -U ens7 flow-type udp4 dst-port 319 action 0 loc 1
# ethtool -U ens7 flow-type udp4 dst-port 320 action 0 loc 0
# ethtool -u ens7
8 RX rings available
Total 2 rules
Filter: 0
Rule Type: UDP over IPv4
Src IP addr: 0.0.0.0 mask: 255.255.255.255
Dest IP addr: 0.0.0.0 mask: 255.255.255.255
TOS: 0x0 mask: 0xff
Src port: 0 mask: 0xffff
Dest port: 320 mask: 0x0
Action: Direct to queue 0
Filter: 1
Rule Type: UDP over IPv4
Src IP addr: 0.0.0.0 mask: 255.255.255.255
Dest IP addr: 0.0.0.0 mask: 255.255.255.255
TOS: 0x0 mask: 0xff
Src port: 0 mask: 0xffff
Dest port: 319 mask: 0x0
Action: Direct to queue 0
This feature is supported on ConnectX-4 adapter cards family and above only.
1PPS is a time synchronization feature that allows the adapter to be able to send or receive 1 pulse
per second on a dedicated pin on the adapter card using an SMA connector (SubMiniature version A).
Only one pin is supported and could be configured as 1PPS in or 1PPS out.
For further information, refer to HowTo Test 1PPS on Mellanox Adapters Community post.
121
Flow Steering
Flow steering is a new model which steers network flows based on flow specifications to specific
QPs. Those flows can be either unicast or multicast network flows. In order to maintain flexibility,
domains and priorities are used. Flow steering uses a methodology of flow attribute, which is a
combination of L2-L4 flow specifications, a destination QP and a priority. Flow steering rules may be
inserted either by using ethtool or by using InfiniBand verbs. The verbs abstraction uses different
terminology from the flow attribute (ibv_exp_flow_attr), defined by a combination of specifications
(struct ibv_exp_flow_spec_*).
Applicable to ConnectX®-3 and ConnectX®-3 Pro adapter cards only.
Flow steering is generally enabled when the log_num_mgm_entry_size module parameter is non
positive (e.g., -log_num_mgm_entry_size), meaning the absolute value of the parameter, is a bit
field. Every bit indicates a condition or an option regarding the flow steering mechanism:
B Operation Description
i
t
b0 Force device managed Flow Steering When set to 1, it forces HCA to be enabled regardless of whether
NC-SI Flow Steering is supported or not.
b2 Enable A0 static DMFS steering (see When set to 1, A0 static DMFS steering is enabled. This bit should
"A0 Static Device Managed Flow be set to 0 when "b1- Disable IPoIB Flow Steering" is 0.
Steering")
b3 Enable DMFS only if the HCA supports When set to 1, DMFS is enabled only if the HCA supports more
more than 64QPs per MCG entry than 64 QPs attached to the same rule. For example, attaching
64VFs to the same multicast address causes 64QPs to be attached
to the same MCG. If the HCA supports less than 64 QPs per MCG,
B0 is used.
b5 Optimize the steering table for non- When set to 1, steering table will be optimized to support rules
source IP rules when possible ignoring source IP check.
This optimization is possible only when DMFS mode is set.
122
B Operation Description
i
t
The default value of log_num_mgm_entry_size is (-10). Meaning Ethernet Flow Steering (i.e IPoIB
DMFS is disabled by default) is enabled by default if NC-SI DMFS is supported and the HCA supports
at least 64 QPs per MCG entry. Otherwise, L2 steering (B0) is used.
When using SR-IOV, flow steering is enabled if there is an adequate amount of space to store the
flow steering table for the guest/master.
Flow Steering is supported in ConnectX®-3, ConnectX®-3 Pro, ConnectX®-4 and
ConnectX®-4 Lx adapter cards.
[For ConnectX®-3 and ConnectX®-3 Pro only] To determine which Flow Steering features are
supported:
123
ethtool --show-priv-flags eth4
For ConnectX-4 and ConnectX-4 Lx adapter cards, all supported features are enabled.
Flow Steering support in InfiniBand is determined according to the EXP_MANAGED_-
FLOW_STEERING flag.
This mode enables fast steering, however it might impact flexibility. Using it increases the packet
rate performance by ~30%, with the following limitations for Ethernet link-layer unicast QPs:
• Limits the number of opened RSS Kernel QPs to 96. MACs should be unique (1 MAC per 1 QP).
The number of VFs is limited
• When creating Flow Steering rules for user QPs, only MAC--> QP rules are allowed. Both MACs
and QPs should be unique between rules. Only 62 such rules could be created
• When creating rules with Ethtool, MAC--> QP rules could be used, where the QP must be the
indirection (RSS) QP. Creating rules that indirect traffic to other rings is not allowed. Ethtool
MAC rules to drop packets (action -1) are supported
• RFS is not supported in this mode
• VLAN is not supported in this mode
ConnectX®-4 and ConnectX®-4 Lx adapter cards support only User Verbs domain with struct
ibv_exp_flow_spec_eth flow specification using 4 priorities.
Flow steering defines the concept of domain and priority. Each domain represents a user agent that
can attach a flow. The domains are prioritized. A higher priority domain will always supersede a
lower priority domain when their flow specifications overlap. Setting a lower priority value will
result in a higher priority.
In addition to the domain, there is a priority within each of the domains. Each domain can have at
most 2^12 priorities in accordance with its needs.
The following are the domains at a descending order of priority:
• Ethtool
Ethtool domain is used to attach an RX ring, specifically its QP to a specified flow. Please
refer to the most recent ethtool manpage for all the ways to specify a flow.
124
Examples:
When configuring two rules with the same priority, the second rule will overwrite the first one, so
this ethtool interface is effectively a table. Inserting Flow Steering rules in the kernel requires
support from both the ethtool in the user space and in kernel (v2.6.28).
Optional vlan src-ip, dst-ip, src-port, dst-port, vlan src-ip, dst-ip, vlan
125
Configure the RFS flow table entries (globally and per core).
Note: The functionality remains disabled until explicitly configured (by default it is 0).
• The number of entries in the global flow table is set as follows:
/proc/sys/net/core/rps_sock_flow_entries
• The number of entries in the per-queue flow table are set as follows:
/sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt
Example:
To Configure aRFS:
The aRFS feature requires explicit configuration in order to enable it. Enabling the aRFS requires
enabling the 'ntuple' flag via the ethtool.
For example, to enable ntuple for eth0, run:
aRFS requires the kernel to be compiled with the CONFIG_RFS_ACCEL option. This option is available
in kernels 2.6.39 and above. Furthermore, aRFS requires Device Managed Flow Steering support.
RFS cannot function if LRO is enabled. LRO can be disabled via ethtool.
126
Flow Steering Dump Tool
This tool is only supported for ConnectX-4 and above adapter cards.
The mlx_fs_dump is a python tool that prints the steering rules in a readable manner. Python v2.7 or
above, as well as pip, anytree and termcolor libraries are required to be installed on the host.
Running example:
./ofed_scripts/utils/mlx_fs_dump -d /dev/mst/mt4115_pciconf0
FT: 9 (level: 0x18, type: NIC_RX)
+-- FG: 0x15 (MISC)
|-- FTE: 0x0 (FWD) to (TIR:0x7e) out.ethtype:IPv4 out.ip_prot:UDP out.udp_dport:0x140
+-- FTE: 0x1 (FWD) to (TIR:0x7e) out.ethtype:IPv4 out.ip_prot:UDP out.udp_dport:0x13f
...
For further information on the mlx_fs_dump tool, please refer to mlx_fs_dump Community post.
Wake-on-LAN (WoL)
Wake-on-LAN (WoL) is a technology that allows a network professional to remotely power on a
computer or to wake it up from sleep mode.
• To enable WoL:
• To get WoL:
Where:
"g" is the magic packet activity.
Requirements
• ConnectX-3/ConnectX-3 Pro without Rx offloads
• Firmware version 2.34.8240 and up
• Kernel version 3.10 and up
127
• iproute-3.10.0-13.el7.x86_64 and up
Once the private flag and the ethtool device feature are set, the device will be ready for
802.1ad VLAN acceleration.
The "phv-bit" private flag setting is available for the Physical Function (PF) only.
The Virtual Function (VF) can use the VLAN acceleration by setting the "tx-vlan-stag-
hw-insert" parameter only if the private flag "phv-bit" is enabled by the PF. If the PF
enables/disables the "phv-bit" flag after the VF driver is up, the configuration will
take place only after the VF driver is restarted.
When turned off, the driver configures the loopback mode according to its own logic.
128
NVME-oF - NVM Express over Fabrics
NVME-oF
NVME-oF enables NVMe message-based commands to transfer data between a host computer and a
target solid-state storage device or system over a network such as Ethernet, Fibre Channel, and
InfiniBand. Tunneling NVMe commands through an RDMA fabric provides a high throughput and a low
latency.
For information on how to configure NVME-oF, please refer to the HowTo Configure NVMe over
Fabrics Community post.
This feature is only supported for ConnectX-5 adapter cards family and above.
NVME-oF Target Offload is an implementation of the new NVME-oF standard Target (server) side in
hardware. Starting from ConnectX-5 family cards, all regular IO requests can be processed by the
HCA, with the HCA sending IO requests directly to a real NVMe PCI device, using peer-to-peer PCI
communications. This means that excluding connection management and error flows, no CPU
utilization will be observed during NVME-oF traffic.
• For instructions on how to configure NVME-oF target offload, refer to HowTo Configure NVME-
oF Target Offload Community post.
• For instructions on how to verify that NVME-oF target offload is working properly, refer
to Simple NVMe-oF Target Offload Benchmark Community post.
Debuggability
If the debugfs feature is enabled in the kernel, the mlx5 driver maintains a subdirectory containing
useful debug information for each open eth port inside /sys/kernel/debug/.
For the default network namespace, the subdirectory name is the name of the interface, like "eth8".
When the network interface is moved to the non-default network namespaces, the interface name is
followed by "@" and the port's PCI ID. For example, the subdirectory name would be
"eth8@0000:af:00.3".
129
By default, the RX page cache size can extend up to 16 times the original size of the wq, at most.
When the RX default limit is too high, the table may extend too much causing iommu allocation
(iommu_alloc) problems.
To prevent this, the RX page cache size limit can be set to a lower value using sysfs.
The unit of this value is log of multiplies for the basic cache size. In other words, when the value is
set to 4, the RX cache table can extend up to 16 times its original size and when the value is 2, the
table can extend up to 4 times its original size.
Because the size of tables changes fast, changing the RX page cache size limit has a quick impact,
even if the table size is larger than the new limit.
In cases of iommu_alloc problem, try and reducing the limit of the RX page cache size because, in
some setups, the reduction of RX page cache size limit does not impair performance.
Originally, the log limit of the RX page cache size was always 4, therefore 4 is the default value set
by the feature. The value has to be set to an integer between 0 and 4 included, where 0 means that
the table will not be able to extend at all.
Below is an example of how to check the current log limit of the RX page cache size (example):
cat /sys/class/net/<ifs-name>/rx_page_cache/log_mult_limit
screen: log rx page cache mult limit is 4
Below is an example of setting a new log limit to the RX page cache size where < val> is the log of
multiplies for the basic cache size.
Virtualization
This chapter contains the following sections:
130
Single Root IO Virtualization (SR-IOV)
Single Root IO Virtualization (SR-IOV) is a technology that allows a physical PCIe device to present
itself multiple times through the PCIe bus. This technology enables multiple virtual instances of the
device with separate resources. Mellanox adapters are capable of exposing in ConnectX®-3 adapter
cards up to 126 virtual instances called Virtual Functions (VFs) and ConnectX-4/Connect-IB adapter
cards up to 62 virtual instances. These virtual functions can then be provisioned separately. Each VF
can be seen as an additional device connected to the Physical Function. It shares the same resources
with the Physical Function, and its number of ports equals those of the Physical Function.
SR-IOV is commonly used in conjunction with an SR-IOV enabled hypervisor to provide virtual
machines direct hardware access to network resources hence increasing its performance.
In this chapter we will demonstrate setup and configuration of SR-IOV in a Red Hat Linux
environment using Mellanox ConnectX® VPI adapter cards family.
System Requirements
Setting Up SR-IOV
Depending on your system, perform the steps below to set up your BIOS. The figures used in this
section are for illustration purposes only. For further information, please refer to the appropriate
BIOS User Manual:
131
1. Enable "SR-IOV" in the system BIOS.
default=0
timeout=5
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title Red Hat Enterprise Linux Server (2.6.32-36.x86-645)
root (hd0,0)
kernel /vmlinuz-2.6.32-36.x86-64 ro root=/dev/VolGroup00/LogVol00 rhgb quiet
intel_iommu=on initrd /initrd-2.6.32-36.x86-64.img
Note: Please make sure the parameter "intel_iommu=on" exists when updating the /boot/
grub/grub.conf file, otherwise SR-IOV cannot be loaded.
132
Some OSs use /boot/grub2/grub.cfg file. If your server uses such file, please edit this file
instead (add “intel_iommu=on” for the relevant menu entry at the end of the line that starts
with "linux16").
[HCA]
num_pfs = 1
total_vfs = <0-126>
sriov_en = true
where:
Param Recommended Value
eter
num_pfs 1
Note: This field is optional and might not always appear.
total_vfs • When using firmware version 2.31.5000 and above, the recommended
value is 126.
• When using firmware version 2.30.8000 and below, the recommended
value is 63
Note: Before setting number of VFs in SR-IOV, please make sure your system can
support that amount of VFs. Setting number of VFs larger than what your Hardware
and Software can support may cause your system to cease working.
sriov_en true
Notes:
- If SR-IOV is supported, to enable SR-IOV (if it is not enabled), it is sufficient to set
“sriov_en = true” in the INI.
- If the HCA does not support SR-IOV, please contact Mellanox Support:
support@mellanox.com
b. Add the above fields to the INI if they are missing.
c. Set the total_vfs parameter to the desired number if you need to change the number
of total VFs.
133
d. Reburn the firmware using the mlxburn tool if the fields above were added to the INI,
or the total_vfs parameter was modified.
If the mlxburn is not installed, please downloaded it from the Mellanox website
at http://www.mellanox.com→ products → Firmware tools
where:
134
Parameter Recommended Value
135
Parameter Recommended Value
• Triplets and single port VFs are only
valid when all ports are configured as
Ethernet. When an InfiniBand port
exists, only num_vfs=a syntax is valid
where "a" is a single value that
represents the number of VFs.
• The second parameter in a triplet is
valid only when there are more than 1
physical port.
In a triplet, x+z<=63 and y+z<=63, the
maximum number of VFs on each
physical port must be 63.
port_type_array Specifies the protocol type of the ports. It is either one array
of 2 port types 't1,t2' for all devices or list of BDF to
port_type_array 'bb:dd.f-t1;t2,...'. (string)
Valid port types: 1-ib, 2-eth, 3-auto, 4-N/A
If only a single port is available, use the N/A port type for
port2 (e.g '1,4').
Note that this parameter is valid only when num_vfs is not
zero (i.e., SRIOV is enabled). Otherwise, it is ignored.
136
Parameter Recommended Value
137
Parameter Recommended Value
• The second parameter in a triplet is valid only
when there are more than 1 physical port
Every value (either a value in a triplet or a
single value) should be less than or equal to the
respective value of num_vfs parameter
The example above loads the driver with 5 VFs (num_vfs). The standard use of a VF is a
single VF per a single VM. However, the number of VFs varies upon the working mode
requirements.
The protocol types are:
- Port 1 = IB
- Port 2 = Ethernet
- port_type_array=2,2 (Ethernet, Ethernet)
- port_type_array=1,1 (IB, IB)
- port_type_array=1,2 (VPI: IB, Ethernet)
- NO port_type_array module parameter: ports are IB
For single port HCAs the possible values are (1,1) or (2,2).
5. Reboot the server.
If the SR-IOV is not supported by the server, the machine might not come out of
boot/load.
where:
- “03:00” represents the Physical Function
- “03:00.X” represents the Virtual Function connected to the Physical Function
Assigning the SR-IOV Virtual Function to the Red Hat KVM VM Server
1. Run the virt-manager.
2. Double click on the virtual machine and open its Properties.
138
3. Go to Details → Add hardware → PCI host device.
4. Choose a Mellanox virtual function according to its PCI device (e.g., 00:03.1)
5. If the Virtual Machine is up reboot it, otherwise start it.
6. Log into the virtual machine and verify that it recognizes the Mellanox card. Run:
Example:
SR-IOV Virtual function configuration can be done through Hypervisor iprout2/netlink tool, if
present. Otherwise, it can be done via sysfs.
139
/sys/class/net/enp8s0f0/device/sriov/[VF]
+-- [VF]
| +-- config
| +-- link_state
| +-- mac
| +-- mac_list
| +-- max_tx_rate
| +-- min_tx_rate
| +-- spoofcheck
| +-- stats
| +-- trunk
| +-- trust
| +-- vlan
When running ETH ports on VGT, the ports may be configured to simply pass through packets as is
from VFs (VLAN Guest Tagging), or the administrator may configure the Hypervisor to silently force
packets to be associated with a VLAN/Qos (VLAN Switch Tagging).
In the latter case, untagged or priority-tagged outgoing packets from the guest will have the VLAN
tag inserted, and incoming packets will have the VLAN tag removed.
The default behavior is VGT.
ip link set dev <PF device> vf <NUM> vlan <vlan_id> [qos <qos>]
where:
• NUM = 0..max-vf-num
• vlan_id = 0..4095
• qos = 0..7
For example:
• ip link set dev eth2 vf 2 vlan 10 qos 3 - sets VST mode for VF #2 belonging to PF eth2, with
vlan_id = 10 and qos = 3
• ip link set dev eth2 vf 2 vlan 0 - sets mode for VF 2 back to VGT
Note: In ConnectX-3 adapter cards family, switching to VGT mode can also be done by setting
vlan_id to 4095.
Additional Ethernet VF Configuration Options
• Guest MAC configuration - by default, guest MAC addresses are configured to be all zeroes. If
the administrator wishes the guest to always start up with the same MAC, he/she should
configure guest MACs before the guest driver comes up. The guest MAC may be configured by
using:
For legacy and ConnectX-4 guests, which do not generate random MACs, the administrator
should always configure their MAC addresses via IP link, as above.
140
• Spoof checking - Spoof checking is currently available only on upstream kernels newer than
3.1.
ip link set dev <PF device> vf <UM> state [enable| disable| auto]
ip link
Output:
61: p1p1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 00:02:c9:f1:72:e0 brd ff:ff:ff:ff:ff:ff
vf 0 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto
vf 37 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto
vf 38 MAC ff:ff:ff:ff:ff:ff, vlan 65535, spoof checking off, link-state disable
vf 39 MAC ff:ff:ff:ff:ff:ff, vlan 65535, spoof checking off, link-state disable
When a MAC is ff:ff:ff:ff:ff:ff, the VF is not assigned to the port of the net device it is listed under.
In the example above, vf38 is not assigned to the same port as p1p1, in contrast to vf0.
However, even VFs that are not assigned to the net device, could be used to set and change its
settings. For example, the following is a valid command to change the spoof check:
This command will affect only the vf38. The changes can be seen in ip link on the net device that
this device is assigned to.
141
Mapping VFs to Ports using the mlnx_get_vfs.pl tool
mlnx_get_vfs.pl
Output:
BDF 0000:04:00.0
Port 1: 2
vf0 0000:04:00.1
vf1 0000:04:00.2
Port 2: 2
vf2 0000:04:00.3
vf3 0000:04:00.4
Both: 1
vf4 0000:04:00.5
Port Type management is static when enabling SR-IOV (the connectx_port_config script will not
work). The port type is set on the Host via a module parameter, port_type_array, in mlx- 4_core.
This parameter may be used to set the port type uniformly for all installed ConnectX® HCAs, or it
may specify an individual configuration for each HCA.
This parameter should be specified as an options line in the file /etc/modprobe.d/mlx- 4_core.conf.
For example, to configure all HCAs to have Port1 as IB and Port2 as ETH, insert the following line:
Only the PFs are set via this mechanism. The VFs inherit their port types from their
associated PF.
142
Virtual Function InfiniBand Ports
Each VF presents itself as an independent vHCA to the host, while a single HCA is observable by the
network which is unaware of the vHCAs. No changes are required by the InfiniBand sub- system,
ULPs, and applications to support SR-IOV, and vHCAs are interoperable with any exist- ing (non-
virtualized) IB deployments.
Sharing the same physical port(s) among multiple vHCAs is achieved as follows:
• Each vHCA port presents its own virtual GID table
For further details, please refer to Configuring an Alias GUID (under ports/<n>/admin_guids)
• Each vHCA port presents its own virtual PKey table
The virtual PKey table (presented to a VF) is a mapping of selected indexes of the physical
PKey table. The host admin can control which PKey indexes are mapped to which virtual
indexes using a sysfs interface. The physical PKey table may contain both full and partial
memberships of the same PKey to allow different membership types in different virtual
tables.
• Each vHCA port has its own virtual port state
A vHCA port is up if the following conditions apply:
• The physical port is up
• The virtual GID table contains the GIDs requested by the host admin
• The SM has acknowledged the requested GIDs since the last time that the physical port
went up
• Other port attributes are shared, such as: GID prefix, LID, SM LID, LMC mask
To allow the host admin to control the virtual GID and PKey tables of vHCAs, a new sysfs 'iov sub-
tree has been added under the PF InfiniBand device.
If the vHCA comes up without a GUID, make sure you are running the latest version of SM/
OpenSM. The SM on QDR switches do not support SR-IOV.
Administration of GUIDs and PKeys is done via the sysfs interface in the Hypervisor (Dom0). This
interface is under:
/sys/class/infiniband/<infiniband device>/iov
143
• <pci id> directories - one for Dom0 and one per guest. Here, you may see the map- ping
between virtual and physical pkey indices, and the virtual to physical gid 0.
Currently, the GID mapping cannot be modified, but the pkey virtual to physical mapping can.
These directories have the structure:
• <pci_id>/port/<m>/gid_idx/0 where m = 1..2 (this is read-only)
and
• <pci_id>/port/<m>/pkey_idx/<n>, where m = 1..2 and n = 0..126
cat /sys/class/infiniband/mlx4_0/iov/0000:02:00.3/port/<port_num>/gid_idx/0
The value returned will present which guid index to modify on Dom0.
2. Modify the physical GUID table via the admin_guids sysfs interface.
To configure the GUID at index <n> on port <port_num>:
Example:
Note:
/sys/class/infiniband/mlx4_0/iov/ports/<port_num>/admin_guids/0 is read only and
cannot be changed.
3. Read the administrative status of the GUID index.
To read the administrative status of GUID index <guid_index> on port number <port_- num>::
cat /sys/class/infiniband/mlx4_0/iov/ports/<port_num>/admin_guids/<guid_index>
The values indicate what gids are actually configured on the firmware/hardware, and all the
entries are R/O.
5. Compare the value you read under the "admin_guids" directory at that index with the value
under the "gids" directory, to verify the change requested in Step 3 has been accepted by the
SM, and programmed into the hardware port GID table.
If the value under admin_guids/<m> is different that the value under gids/<m>, the request
is still in progress.
144
Alias GUID Support in InfiniBand
Admin VF GUIDs
As of MLNX_EN v3.0, the query_gid verb (e.g. ib_query_gid()) returns the admin desired value
instead of the value that was approved by the SM to prevent a case where the SM is unreachable or
a response is delayed, or if the VF is probed into a VM before their GUID is registered with the SM. If
one of the above scenarios occurs, the VF sees an incorrect GID (i.e., not the GID that was intended
by the admin).
Despite the new behavior, if the SM does not approve the GID, the VF sees its link as down.
On Demand GUIDs
GIDs are requested from the SM on demand, when needed by the VF (e.g. become active), and are
released when the GIDs are no longer in use.
Since a GID is assigned to a VF on the destination HCA, while the VF on the source HCA is shut down
(but not administratively released), using GIDs on demand eases the GID migrations.
For compatibility reasons, an explicit admin request to set/change a GUID entry is done
immediately, regardless of whether the VF is active or not to allow administrators to change the
GUID without the need to unbind/bind the VF.
Due to the change in the Alias GUID support in InfiniBand behavior, its default mode is now set as
HOST assigned instead of SM assigned. To enable out-of-the-box experience, the PF generates
random GUIDs as the initial admin values instead of asking the SM.
Initial GUIDs' values depend on the mlx4_ib module parameter 'sm_guid_assign' as follows:
Mode Type Description
admin assigned Each admin_guid entry has the random generated GUID value.
sm assigned Each admin_guid entry for non-active VFs has a value of 0. Meaning,
asking a GUID from the SM upon VF activation. When a VF is active, the
returned value from the SM becomes the admin value to be asked later
again.
When a VF becomes active, and its admin value is approved, the operational GID entry is changed
accordingly. In both modes, the administrator can set/delete the value by using the sysfs
Administration Interfaces on the Hypervisor as described above.
Each VF has a single GUID entry in the table based on the VF number. (e.g. VF 1 expects to use GID
entry 1). To determine the GUID index of the PCI Virtual Function to pass to a guest, use the sysfs
mechanism <gid_idx> directory as described above.
145
Persistency Support
Once admin request is rejected by the SM, a retry mechanism is set. Retry time is set to 1 second,
and for each retry it is multiplied by 2 until reaching the maximum value of 60 seconds.
Additionally, when looking for the next record to be updated, the record having the lowest time to
be executed is chosen.
Any value reset via the admin_guid interface is immediately executed and it resets the entry's timer.
PKeys are used to partition IPoIB communication between the Virtual Machines and the Dom0 by
mapping a non-default full-membership PKey to virtual index 0, and mapping the default PKey to a
virtual pkey index other than zero.
The below describes how to set up two hosts, each with 2 Virtual Machines. Host-1/vm-1 will be
able to communicate via IPoIB only with Host2/vm1,and Host1/vm2 only with Host2/vm2.
In addition, Host1/Dom0 will be able to communicate only with Host2/Dom0 over ib0. vm1 and vm2
will not be able to communicate with each other, nor with Dom0.
This is done by configuring the virtual-to-physical PKey mappings for all the VMs, such that at virtual
PKey index 0, both vm-1s will have the same pkey and both vm-2s will have the same
PKey (different from the vm-1's), and the Dom0's will have the default pkey (different from the vm's
pkeys at index 0).
OpenSM must be used to configure the physical Pkey tables on both hosts.
• The physical Pkey table on both hosts (Dom0) will be configured by OpenSM to be:
pkey_idx 0 = 1
pkey_idx 1 = 0
pkey_idx 0 = 2
pkey_idx 1 = 0
So that the default pkey will reside on the vms at index 1 instead of at index 0.
The IPoIB QPs are created to use the PKey at index 0. As a result, the Dom0, vm1 and vm2 IPoIB QPs
will all use different PKeys.
Default=0x7fff,ipoib : ALL=full ;
Pkey1=0x3000,ipoib : ALL=full;
Pkey3=0x3030,ipoib : ALL=full;
This will cause OpenSM to configure the physical Port Pkey tables on all physical ports on the
network as follows:
146
--------- |---------
0 | 0xFFFF
1 | 0xB000
2 | 0xB030
The ",ipoib" causes OpenSM to pre-create IPoIB the broadcast group for the indicated
PKeys.
2. Configure (on Dom0) the virtual-to-physical PKey mappings for the VMs.
a. Check the PCI ID for the Physical Function and the Virtual Functions.
b. Assuming that on Host1, the physical function displayed by lspci is "0000:02:00.0", and
that on Host2 it is "0000:03:00.0"
On Host1 do the following.
cd /sys/class/infiniband/mlx4_0/iov
0000:02:00.0 0000:02:00.1 0000:02:00.2 ...
vm1 pkey index 0 will be mapped to physical pkey-index 1, and vm2 pkey index 0 will
be mapped to physical pkey index 2. Both vm1 and vm2 will have their pkey index 1
mapped to the default pkey.
d. On Host2 do the following.
cd /sys/class/infiniband/mlx4_0/iov
echo 0 > 0000:03:00.1/ports/1/pkey_idx/1
echo 1 > 0000:03:00.1/ports/1/pkey_idx/0
echo 0 > 0000:03:00.2/ports/1/pkey_idx/1
echo 2 > 0000:03:00.2/ports/1/pkey_idx/0
e. Once the VMs are running, you can check the VM's virtualized PKey table by doing (on
the vm).
cat /sys/class/infiniband/mlx4_0/ports/[1,2]/pkeys/[0,1]
147
Running Network Diagnostic Tools on a Virtual Function in ConnectX-3/
ConnectX-3 Pro
Until now, in MLNX_EN, administrators were unable to run network diagnostics from a VF since
sending and receiving Subnet Management Packets (SMPs) from a VF was not allowed, for security
reasons: SMPs are not restricted by network partitioning and may affect the physical network
topology. Moreover, even the SM may be denied access from portions of the network by setting
management keys unknown to the SM.
However, it is desirable to grant SMP capability to certain privileged VFs, so certain network
management activities may be conducted within virtual machines rather than only on the
hypervisor.
To enable SMP capability for a VF, one must enable the Subnet Management Interface (SMI) for that
VF. By default, the SMI interface is disabled for VFs. To enable SMI mads for VFs, there are two new
sysfs entries per VF per on the Hypervisor (under /sys/class/infiniband/mlx4_X/ iov/<b.d.f>/ports/
<1 or 2>. These entries are displayed only for VFs (not for the PF), and only for IB ports (not ETH
ports).
The first entry, enable_smi_admin, is used to enable SMI on a VF. By default, the value of this entry
is zero (disabled). When set to "1", the SMI will be enabled for the VF on the next rebind or openibd
restart on the VM that the VF is bound to. If the VF is currently bound, it must be unbound and then
re-bound.
The second sysfs entry, smi_enabled, indicates the current enablement state of the SMI. 0 indicates
disabled, and 1 indicates enabled. This entry is read-only.
When a VF is initialized (bound), during the initialization sequence, the driver copies the requested
smi_state (enable_smi_admin) for that VF/port to the operational SMI state (smi_enabled) for that
VF/port, and operate according to the operational state.
Thus, the sequence of operations on the hypervisor is:
1. Enable SMI for any VF/port that you wish.
2. Restart the VM that the VF is bound to (or just run /etc/init.d/openibd restart on that VM)
The SMI will be enabled for the VF/port combinations that you set in step 2 above. You will then be
able to run network diagnostics from that VF.
To install MLNX_EN on a VF which will be enabled to run the tools, run the following on the
VM:
mlnx_en_install
148
MAC Forwarding DataBase (FDB) Management in ConnectX-3/
ConnectX-3 Pro
FDB also know as Forwarding Information Base (FIB) or the forwarding table, is most commonly used
in network bridging, routing, and similar functions to find the proper interface to which the input
interface should forward a packet.
In the SR-IOV environment, the Ethernet driver can share the existing 128 MACs (for each port)
among the Virtual interfaces (VF) and Physical interfaces (PF) that share the same table as follow:
• Each VF gets 2 granted MACs (which are taken from the general pool of the 128 MACs)
• Each VF/PF can ask for up to 128 MACs on the policy of first-asks first-served (meaning,
except for the 2 granted MACs, the other MACs in the pool are free to be asked)
To check if there are free MACs for its interface (PF or VF), run:
/sys/class/net/<ethX>/ fdb_det.
Example:
cat /sys/class/net/eth2/fdb_det
device eth2: max: 112, used: 2, free macs: 110
Once running the command above, the interface (VF/PF) verifies if a free MAC exists. If there is a
free MAC, the VF/PF takes it from the global pool and allocates it. If there is no free MAC, an error
is returned notifying the user of lack of MACs in the pool.
If /sys/class/net/eth<X>/fdb does not exist, use the Bridge tool from the ip-route2 package
which includes the tool to manage FDB tables as the kernel supports FDB callbacks:
If adding a new MAC from the kernel's NDO function fails due to insufficient MACs in
the pool, the following error flow will occur:
149
• If the interface is a PF, it will automatically enter the promiscuous mode
• If the interface is a VF, it will try to enter the promiscuous mode and since it does
not support it, the action will fail and an error will be printed in the kernel's log
Virtual Guest Tagging (VGT+)
VGT+ is an advanced mode of Virtual Guest Tagging (VGT), in which a VF is allowed to tag its own
packets as in VGT, but is still subject to an administrative VLAN trunk policy. The policy determines
which VLAN IDs are allowed to be transmitted or received. The policy does not determine the user
priority, which is left unchanged.
Packets can be sent in one of the following modes: when the VF is allowed to send/receive untagged
and priority tagged traffic and when it is not. No default VLAN is defined for VGT+ port. The send
packets are passed to the eSwitch only if they match the set, and the received packets are
forwarded to the VF only if they match the set.
In some old OSs, such as SLES11 SP4, any VLAN can be created in the VM, regardless of the
VGT+ configuration, but traffic will only pass for the allowed VLANs.
cat /sys/class/net/eth5/vf0/vlan_set
oper:
admin:
If you set the vlan_set parameter with more the 10 VLAN IDs, the driver chooses the first 10
VLAN IDs provided and ignores all the rest.
150
echo 0 1 2 3 4 5 6 7 8 9 > /sys/class/net/eth5/vf0/vlan_set
2. Reboot the relevant VM for changes to take effect, or run: /etc/init.d/openibd restart
2. Reboot the relevant VM for changes to take effect, or run: /etc/init.d/openibd restart
To add a VLAN:
cat /sys/class/net/eth5/vf0/vlan_set
oper: 0 1 2 3
admin: 0 1 2 3
cat /sys/class/net/eth5/vf0/vlan_set
oper: 0 1 2 3 4 5 6
admin: 2 3 4 5 6
Configuring VGT+for ConnectX-4/ConnectX-5
When working in SR-IOV, the default operating mode is VGT.
Set the corresponding port/VF (in the example below port eth5, VF0) range of allowed VLANs.
Examples:
• Adding VLAN ID range (4-15) to trunk:
151
echo add 4 15 > /sys/class/net/eth5/device/sriov/0/trunk
Note: When VLAN ID = 0, it indicates that untagged and priority-tagged traffics are allowed
Virtualized QoS per VF, (supported in ConnectX®-3/ConnectX®-3 Pro adapter cards only with
firmware v2.33.5100 and above), limits the chosen VFs' throughput rate limitations (Maximum
throughput). The granularity of the rate limitation is 1Mbits.
The feature is disabled by default. To enable it, set the "enable_vfs_qos" module parameter to "1"
and add it to the "options mlx4_core". When set, and when feature is supported, it will be shown
upon PF driver load time (in DEV_CAP in kernel log: GranularQoSRatelimitperVFsupport), when
mlx4_core module parameter debug_level is set to 1. For further information, please refer to
"mlx4_core Parameters" - debug_level parameter).
When set, and supported by the firmware, running as SR-IOV Master and Ethernet link, the driver
also provides information on the number of total available vPort Priority Pair (VPPs) and how many
VPPs are allocated per priority. All the available VPPs will be allocated on priority 0.
152
Configuring Rate Limit for VFs
The rate limit configuration will take effect only when the VF is in VST mode configured
with priority 0.
where:
• NUM = 0...<Num of VF>
• <TXRATE> in units of 1Mbit/s
ip link set dev <PF device> vf <NUM> vlan <vlan_id> [qos <qos>] rate <TXRATE>
To disable rate limit configured for a VF set the VF with rate 0. Once the rate limit is set, you
cannot switch to VGT or change VST priority.
To view current rate limit configurations for VFs, use the iproute2 tool.
Example:
89: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000
link/ether f4:52:14:5e:be:20 brd ff:ff:ff:ff:ff:ff
vf 0 MAC 00:00:00:00:00:00, vlan 2, tx rate 1500 (Mbps), spoof checking off, link-state auto
vf 1 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto
vf 2 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto
vf 3 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto
On some OSs, the iptool may not display the configured rate, or any of the VF information, although
the both the VST and the rate limit are set through the netlink command. In order to view the rate
limit configured, use sysfs provided by the driver. Its location can be found at:
/sys/class/net/<eth-x>/<vf-i>/tx_rate
153
SR-IOV Advanced Security Features
Normally, MAC addresses are unique identifiers assigned to network interfaces, and they are fixed
addresses that cannot be changed. MAC address spoofing is a technique for altering the MAC address
to serve different purposes. Some of the cases in which a MAC address is altered can be legal, while
others can be illegal and abuse security mechanisms or disguises a possible attacker.
The SR-IOV MAC address anti-spoofing feature, also known as MAC Spoof Check provides protection
against malicious VM MAC address forging. If the network administrator assigns a MAC address to a
VF (through the hypervisor) and enables spoof check on it, this will limit the end user to send traffic
only from the assigned MAC address of that VF.
MAC anti-spoofing is disabled by default.
In the configuration example below, the VM is located on VF-0 and has the following MAC address:
11:22:33:44:55:66.
There are two ways to enable or disable MAC anti-spoofing:
1. Use the standard IP link commands - available from Kernel 3.10 and above.
a. To enable MAC anti-spoofing, run:
2. Specify echo "ON" or "OFF" to the file located under /sys/class/net/<ETH_IF_NAME> / device/
sriov/<VF index>/spoofchk.
a. To enable MAC anti-spoofing, run:
This configuration is non-persistent and does not survive driver restart.
In order for spoof-check enabling/disabling to take effect while the VF is up and running on
ConnectX-3 Pro adapter cards, it is required to perform a driver restart on the guest OS.
154
Limit and Bandwidth Share Per VF
This feature is at beta level.
This feature enables rate limiting traffic per VF in SR-IOV mode for ConnectX-4/ConnectX-4 Lx/
ConnectX-5 adapter cards. For details on how to configure rate limit per VF for ConnectX-4/
ConnectX-5, refer to HowTo Configure Rate Limit per VF for ConnectX-4/ConnectX-5 Community
post.
VFs Rate Limit for vSwitch (OVS) feature allows users to join available VFs into groups and set a rate
limitation on each group. Rate limitation on a VF group ensures that the total Tx bandwidth that the
VFs in this group get (altogether combined) will not exceed the given value.
With this feature, a VF can still be configured with an individual rate limit as in the past (under /
sys/class/net/<ifname>/device/sriov/<vf_num>/max_tx_rate). However, the actual bandwidth limit
on the VF will eventually be determined considering the VF group limitation and how many VFs are
in the same group.
For example: 2 VFs (0 and 1) are attached to group 3.
Case 1: The rate limitation on the group is set to 20G. Rate limit of each VF is 15G
Result: Each VF will have a rate limit of 10G
Case 2: Group’s max rate limitation is still set to 20G. VF 0 is configured to 30G limit, while VF 1 is
configured to 5G rate limit
Result: VF 0 will have 15G de-facto. VF 1 will have 5G
The rule of thumb is that the group’s bandwidth is distributed evenly between the number of VFs in
the group. If there are leftovers, they will be assigned to VFs whose individual rate limit has not
been met yet.
1. When VF rate group is supported by FW, the driver will create a new hierarchy in the SRI-OV
sysfs named “groups” (/sys/class/net/<ifname>/device/sriov/groups/). It will contain all the
info and the configurations allowed for VF groups.
2. All VFs are placed in group 0 by default since it is the only existing group following the initial
driver start. It would be the only group available under /sys/class/net/<ifname>/device/
sriov/groups/
3. The VF can be moved to a different group by writing to the group file -> echo $GROUP_ID > /
sys/class/net/<ifname>/device/sriov/<vf_id>/group
4. The group IDs allowed are 0-255
5. Only when there is at least 1 VF in a group, there will be a group configuration available
under /sys/class/net/<ifname>/device/sriov/groups/ (Except for group 0, which is always
available even when it’s empty).
6. Once the group is created (by moving at least 1 VF to that group), users can configure the
group’s rate limit. For example:
155
a. echo 10000 > /sys/class/net/<ifname>/device/sriov/5/max_tx_rate – setting individual
rate limitation of VF 5 to 10G (Optional)
b. echo 7 > /sys/class/net/<ifname>/device/sriov/5/group – moving VF 5 to group 7
c. echo 5000 > /sys/class/net/<ifname>/device/sriov/groups/7/max_tx_rate – setting
group 7 with rate limitation of 5G
d. When running traffic via VF 5 now, it will be limited to 5G because of the group rate
limit even though the VF itself is limited to 10G
e. echo 3 > /sys/class/net/<ifname>/device/sriov/5/group – moving VF 5 to group 3
f. Group 7 will now disappear from /sys/class/net/<ifname>/device/sriov/groups since
there are 0 VFs in it. Group 3 will now appear. Since there’s no rate limit on group 3,
VF 5 can transmit at 10G (thanks to its individual configuration)
Notes:
• You can see to which group the VF belongs to in the ‘stats’ sysfs (cat /sys/class/net/
<ifname>/device/sriov/<vf_num>/stats)
• You can see the current rate limit and number of attached VFs to a group in the group’s
‘config’ sysfs (cat /sys/class/net/<ifname>/device/sriov/groups/<group_id>/config)
Privileged VFs
In case a malicious driver is running over one of the VFs, and in case that VF's permissions are not
restricted, this may open security holes. However, VFs can be marked as trusted and can thus
receive an exclusive subset of physical function privileges or permissions. For example, in case of
allowing all VFs, rather than specific VFs, to enter a promiscuous mode as a privilege, this will
enable malicious users to sniff and monitor the entire physical port for incoming traffic, including
traffic targeting other VFs, which is considered a severe security hole.
In the configuration example below, the VM is located on VF-0 and has the following MAC address:
11:22:33:44:55:66.
There are two ways to enable or disable trust:
1. Use the standard IP link commands - available from Kernel 4.5 and above.
a. To enable trust for a specific VF, run:
2. Specify echo "ON" or "OFF" to the file located under /sys/class/net/<ETH_IF_NAME> / device/
sriov/<VF index>/trust.
156
echo "OFF" > /sys/class/net/ens785f1/device/sriov/0/trust
Probed VFs
Probing Virtual Functions (VFs) after SR-IOV is enabled might consume the adapter cards' resources.
Therefore, it is recommended not to enable probing of VFs when no monitoring of the VM is needed.
VF probing can be disabled in two ways, depending on the kernel version installed on your server:
1. If the kernel version installed is v4.12 or above, it is recommended to use the PCI sysfs
interface sriov_drivers_autoprobe. For more information, see linux-next branch.
2. If the kernel version installed is older than v4.12, it is recommended to use the mlx5_core
module parameter probe_vf with MLNX_EN v4.1 or above.
Example:
For more information on how to probe VFs, see HowTo Configure and Probe VFs on mlx5 Drivers
Community post.
VF Promiscuous Rx Modes
VF Promiscuous Mode
VFs can enter a promiscuous mode that enables receiving the unmatched traffic and all the
multicast traffic that reaches the physical port in addition to the traffic originally targeted to the
VF. The unmatched traffic is any traffic's DMAC that does not match any of the VFs' or PFs' MAC
addresses.
Note: Only privileged/trusted VFs can enter the VF promiscuous mode.
ifconfig eth2 –promisc
VF All-Multi Mode
VFs can enter an all-multi mode that enables receiving all the multicast traffic sent from/to the
other functions on the same physical port in addition to the traffic originally targeted to the VF.
Note: Only privileged/trusted VFs can enter the all-multi RX mode.
157
ifconfig eth2 allmulti
#ifconfig eth2 –allmulti
Enabling Paravirtualization
To enable Paravirtualization:
The example below works on RHEL6.* or RHEL7.* without a Network Manager.
1. Create a bridge.
vim /etc/sysconfig/network-scripts/ifcfg-bridge0
DEVICE=bridge0
TYPE=Bridge
IPADDR=12.195.15.1
NETMASK=255.255.0.0
BOOTPROTO=static
ONBOOT=yes
NM_CONTROLLED=no
DELAY=0
2. Change the related interface (in the example below bridge0 is created over eth5).
DEVICE=eth5
BOOTPROTO=none
STARTMODE=on
HWADDR=00:02:c9:2e:66:52
158
TYPE=Ethernet
NM_CONTROLLED=no
ONBOOT=yes
BRIDGE=bridge0
ifconfig -a
…
eth6 Link encap:Ethernet HWaddr 52:54:00:E7:77:99
inet addr:13.195.15.5 Bcast:13.195.255.255 Mask:255.255.0.0
inet6 addr: fe80::5054:ff:fee7:7799/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:481 errors:0 dropped:0 overruns:0 frame:0
TX packets:450 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:22440 (21.9 KiB) TX bytes:19232 (18.7 KiB)
Interrupt:10 Base address:0xa000
…
To enable the VXLAN offloads support load the mlx4_core driver with Device-Managed Flow- steering
(DMFS) enabled. DMFS is the default steering mode.
159
2. Set the parameter debug_level to "1".
The net-device will advertise the tx-udp-tnl-segmentation flag shown when running "etht- hool -k
$DEV | grep udp" only when VXLAN is configured in the OpenvSwitch (OVS) with the configured UDP
port.
Example:
As of firmware version 2.31.5050, VXLAN tunnel can be set on any desired UDP port. If using
previous firmware versions, set the VXLAN tunnel over UDP port 4789.
VXLAN offload is enabled by default for ConnectX-4 family devices running the minimum required
firmware version and a kernel version that includes VXLAN support.
Example:
ConnectX-4 family devices support configuring multiple UDP ports for VXLAN offload. Ports can be
added to the device by configuring a VXLAN device from the OS command line using the "ip"
command.
Note: If you configure multiple UDP ports for offload and exceed the total number of ports
supported by hardware, then those additional ports will still function properly, but will not benefit
from any of the stateless offloads.
Example:
160
ip link add vxlan0 type vxlan id 10 group 239.0.0.10 ttl 10 dev ens1f0 dstport 4789
ip addr add 192.168.4.7/24 dev vxlan0
ip link set up vxlan0
Example:
To verify that the VXLAN ports are offloaded, use debugfs (if supported):
1. Mount debugfs.
ls /sys/kernel/debug/mlx5/$PCIDEV/VXLAN
Where $PCIDEV is the PCI device number of the relevant ConnectX-4 family device.
Example:
ls /sys/kernel/debug/mlx5/0000:81:00.0/VXLAN 4789
Important Notes
• VXLAN tunneling adds 50 bytes (14-eth + 20-ip + 8-udp + 8-vxlan) to the VM Ethernet frame.
Please verify that either the MTU of the NIC who sends the packets, e.g. the VM virtio-net NIC
or the host side veth device or the uplink takes into account the tunneling overhead.
Meaning, the MTU of the sending NIC has to be decremented by 50 bytes (e.g 1450 instead of
1500), or the uplink NIC MTU has to be incremented by 50 bytes (e.g 1550 instead of 1500)
• From upstream 3.15-rc1 and onward, it is possible to use arbitrary UDP port for VXLAN. Note
that this requires firmware version 2.31.2800 or higher. Additionally, you need to enable this
kernel configuration option CONFIG_MLX4_EN_VXLAN=y (ConnectX-3 Pro only).
• On upstream kernels 3.12/3.13 GRO with VXLAN is not supported
This feature is supported on ConnectX-3 Pro and ConnectX-5 adapter cards only.
ConnectX-4 and ConnectX-4 Lx adapter cards support 802.1Q double-tagging (C-tag stack-
ing on C-tag) - refer to "802.1Q Double-Tagging" section.
161
This section describes the configuration of IEEE 802.1ad QinQ VLAN tag (S-VLAN) to the hypervisor
per Virtual Function (VF). The Virtual Machine (VM) attached to the VF (via SR- IOV) can send traffic
with or without C-VLAN. Once a VF is configured to VST QinQ encapsulation (VST QinQ), the
adapter's hardware will insert S-VLAN to any packet from the VF to the physical port. On the receive
side, the adapter hardware will strip the S-VLAN from any packet coming from the wire to that VF.
Setup
The setup assumes there are two servers equipped with ConnectX-3 Pro/ConnectX-5 adapter cards.
Prerequisites
• Kernel must be of v3.10 or higher, or custom/inbox kernel must support vlan-stag
• Firmware version 2.36.5150 or higher must be installed for ConnectX-3 Pro HCAs
• Firmware version 16.21.0458 or higher must be installed for ConnectX-5 HCAs
• The server should be enabled in SR-IOV and the VF should be attached to a VM on the
hypervisor.
• In order to configure SR-IOV in Ethernet mode for ConnectX-3 Pro adapter cards,
please refer to "Configuring SR-IOV for ConnectX-3/ConnectX-3 Pro" section.
• In order to configure SR-IOV in Ethernet mode for ConnectX-5 adapter cards, please
refer to "Configuring SR-IOV for ConnectX-4/ConnectX-5 (Ethernet)" section. In the
following configuration example, the VM is attached to VF0.
• Network Considerations - the network switches may require increasing the MTU (to support
1522 MTU size) on the relevant switch ports.
2. Add the required S-VLAN (QinQ) tag (on the hypervisor) per port per VF. There are two ways
to add the S-VLAN:
a. By using sysfs only if the Kernel version used is v4.9 or older:
162
echo 'vlan 100 proto 802.1ad' > /sys/class/net/ens2/vf0/vlan_info
b. By using the ip link command (available only when using the latest Kernel version):
3. Optional: Add S-VLAN priority. Use the qos parameter in the ip link command (or sysfs):
6. To verify the setup, run ping between the two VMs and open Wireshark or tcpdump to capture
the packet.
For further examples, refer to HowTo Configure QinQ Encapsulation per VF in Linux (VST) for
ConnectX-3 Pro Community post.
b. By using the ip link command (available only when using the latest Kernel version):
163
ip link set dev ens1f0 vf 0 vlan 100 proto 802.1ad
2. Optional: Add S-VLAN priority. Use the qos parameter in the ip link command (or sysfs):
4. To verify the setup, run ping between the two VMs and open Wireshark or tcpdump to capture
the packet.
802.1Q Double-Tagging
This section describes the configuration of 802.1Q double-tagging support to the hypervisor per
Virtual Function (VF). The Virtual Machine (VM) attached to the VF (via SR-IOV) can send traffic with
or without C-VLAN. Once a VF is configured to VST encapsulation, the adapter's hardware will insert
C-VLAN to any packet from the VF to the physical port. On the receive side, the adapter hardware
will strip the C-VLAN from any packet coming from the wire to that VF.
b. By using the ip link command (available only when using the latest Kernel version):
164
Check the configuration using the ip link show command:
3. To verify the setup, run ping between the two VMs and open Wireshark or tcpdump to capture
the packet.
Resiliency
Reset Flow
Reset Flow is activated by default, once a "fatal device" error is recognized. Both the HCA and the
software are reset, the ULPs and user application are notified about it, and a recovery process is
performed once the event is raised.
• In mlx4 devices, "Reset Flow" is activated by default. It can be disabled using the mlx- 4_core
module parameter internal_err_reset (default value is 1).
• In mlx5 devices, "Reset Flow" is activated by default. Currently, it can be triggered by a
firmware assert with Recover Flow Request (RFR) only. Firmware RFR support should be
enabled explicitly using mlxconfig commands.
Notes:
• For mlx4 devices, a “fatal device” error can be a timeout from a firmware command, an
error on a firmware closing command, communication channel not being responsive in a VF,
etc.
• For mlx5 devices, a “fatal device” is a firmware assert combined with Recover Flow Request
bit.
165
Kernel ULPs
Once a "fatal device" error is recognized, an IB_EVENT_DEVICE_FATAL event is created, ULPs are
notified about the incident, and outstanding WQEs are simulated to be returned with "flush in error"
message to enable each ULP to close its resources and not get stuck via calling its "remove_one"
callback as part of "Reset Flow".
Once the unload part is terminated, each ULP is called with its "add_one" callback, its resources are
re-initialized and it is re-activated.
SR-IOV
If the Physical Function recognizes the error, it notifies all the VFs about it by marking their
communication channel with that information, consequently, all the VFs and the PF are reset.
If the VF encounters an error, only that VF is reset, whereas the PF and other VFs continue to work
unaffected.
If an outside "reset" is forced by using the PCI sysfs entry for a VF, a reset is executed on that VF
once it runs any command over its communication channel.
For example, the below command can be used on a hypervisor to reset a VF defined by
0000:04:00.1:
echo 1 >/sys/bus/pci/devices/0000:04:00.1/reset
AER, a mechanism used by the driver to get notifications upon PCI errors, is supported only in native
mode, ULPs are called with remove_one/add_one and expect to continue working properly after
that flow.User space application will work in same mode as defined in the "Reset Flow" above.
Extended Error Handling (EEH) is a PowerPC mechanism that encapsulates AER, thus exposing AER
events to the operating system as EEH events.
The behavior of ULPs and user space applications is identical to the behavior of AER.
CRDUMP
CRDUMP feature allows for taking an automatic snapshot of the device CR-Space in case the device's
FW/HW fails to function properly.
Snapshots Triggers:
166
• ConnectX-3 adapters family - the snapshot is triggered in case the driver detects any of the
following issues:
This snapshot can later be investigated and analyzed to track the root cause of the failure.
Currently, only the first snapshot is stored, and is exposed using a temporary virtual file. The virtual
file is cleared upon driver reset.
When a critical event is detected, a message indicating CRDUMP collection will be printed to the
Linux log. User should then back up the file pointed to in the printed message. The file location
format is:
• For mlx4 driver: /proc/driver/mlx4_core/crdump/<pci address>
• For mlx5 driver: /proc/driver/mlx5_core/crdump/<pci address>
In mlx4 driver, CRDUMP will not be collected if internal_err_reset module parameter is set
to 0.
Firmware Tracer
This mechanism allows for the device's FW/HW to log important events into the event tracing
system (/sys/kernel/debug/tracing) without requiring any Mellanox tool.
To be able to use this feature, trace points must be enabled in the kernel.
This feature is enabled by default, and can be controlled using sysfs commands.
167
To enable the feature:
vim /sys/kernel/debug/tracing/trace
Docker Containers
This feature is supported at beta level on ConnectX-4 adapter cards family and above only.
Currently, RDMA/RoCE devices are supported in the modes listed in the following table:
Docker Docker SR-IOV using sriov-plugin along with InfiniBand and SR-IOV
Engine docker run wrapper tool Ethernet
17.03 or
higher
Kubernetes Kubernetes SR-IOV using device plugin, and InfiniBand and SR-IOV
using SR- IOV CNI plugin Ethernet
1.10.3 or
higher
VXLAN using IPoIB bridge InfiniBand Shared HCA
168
For instructions on how to use Docker with SR-IOV, refer to the following Community post: https://
support.mellanox.com/docs/DOC-3139
When used in SR-IOV mode, this plugin enables SR-IOV and performs necessary configuration
including setting GUID, MAC, privilege mode, and Trust mode.
The plugin also allocates the VF devices when Pods are scheduled and requested by Kubernetes
framework.
For instructions on how to use Kubernetes with SR-IOV, refer to the following Community posts:
• https://support.mellanox.com/docs/DOC-3151
• https://support.mellanox.com/docs/DOC-3138
Mediated Devices
The Mellanox mediated devices deliver flexibility in allowing to create accelerated devices without
SR-IOV on the Bluefield® system. These mediated devices support NIC and RDMA, and offer the
same level of ASAP2 offloads as SR-IOV VFs. Mediates devices are supported using mlx5 sub-function
acceleration technology.
Cold reboot the BlueField host system so that the above settings can be applied on
subsequent reboot sequences.
2. By default, the firmware allows for a large number of maximum mdev devices. You must set
the maximum number of mediated devices to 2 or 4 devices. Run:
3. Mediated devices are uniquely identified using UUID. To create one, run:
169
$ uuidgen
$ echo 49d0e9ac-61b8-4c91-957e-6f6dbc42557d > /sys/bus/pci/devices/0000:05:00.0/mdev_supported_types/
mlx5_core-local/create
4. By default, if the driver vfio_mdev is loaded, newly created mdev devices are bound to it. To
make use of this newly created mdev device in order to create a netdevice and RDMA device,
you must first unbind it from that driver. Run:
$ cat /sys/bus/mdev/devices/49d0e9ac-61b8-4c91-957e-6f6dbc42557d/devlink-compat-config/netdev
When an mdev device is bound to the mlx5_core driver, its respective netdevice and/or RDMA
device is also created.
8. To inspect the netdevice and RDMA device for the mdev, run:
$ ls /sys/bus/mdev/devices/49d0e9ac-61b8-4c91-957e-6f6dbc42557d/net/
$ ls /sys/bus/mdev/devices/49d0e9ac-61b8-4c91-957e-6f6dbc42557d/infiniband/
Overview
Open vSwitch (OVS) allows Virtual Machines (VM) to communicate with each other and with the
outside world. OVS traditionally resides in the hypervisor and switching is based on twelve tuple
matching on flows. The OVS software based solution is CPU intensive, affecting system performance
and preventing full utilization of the available bandwidth. The current OVS supported is OVS running
in Linux Kernel.
Mellanox Accelerated Switching And Packet Processing (ASAP2) technology allows OVS offloading by
handling OVS data-plane in Mellanox ConnectX-5 onwards NIC hardware (Mellanox Embedded Switch
or eSwitch) while maintaining OVS control-plane unmodified. As a result, we observe significantly
higher OVS performance without the associated CPU load.
The current actions supported by ASAP2 include packet parsing and matching, forward, drop along
with VLAN push/pop or VXLAN encapsulation/decapsulation.
170
Installing ASAP2 Packages
Install the required packages. For the complete solution, you need to install supporting
MLNX_EN (v4.4 and above), iproute2, and openvswitch packages.
Setting Up SR-IOV
To set up SR-IOV:
1. Choose the desired card.
The example below shows a dual-ported ConnectX-5 card (device ID 0x1017) and a single SR-
IOV VF (Virtual Function, device ID 0x1018).
In SR-IOV terms, the card itself is referred to as the PF (Physical Function).
Enabling SR-IOV and creating VFs is done by the firmware upon admin directive as
explained in Step 5 below.
2. Identify the Mellanox NICs and locate net-devices which are on the NIC PCI BDF.
The PF NIC for port #1 is enp4s0f0, and the rest of the commands will be issued on it.
3. Check the firmware version.
Make sure the firmware versions installed are as state in the Release Notes document.
# cat /sys/class/net/enp4s0f0/device/sriov_totalvfs
4
171
# echo 2 > /sys/class/net/enp4s0f0/device/sriov_numvfs
7. Verify the VF MAC addresses were provisioned correctly and SR-IOV was turned ON.
# cat /sys/class/net/enp4s0f0/device/sriov_numvfs
2
# ip link show dev enp4s0f0
256: enp4s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP mode DEFAULT
group default qlen 1000
link/ether e4:1d:2d:60:95:a0 brd ff:ff:ff:ff:ff:ff
vf 0 MAC e4:11:22:33:44:50, spoof checking off, link-state auto
vf 1 MAC e4:11:22:33:44:51, spoof checking off, link-state auto
In the example above, the maximum number of possible VFs supported by the firmware is 4
and only 2 are enabled.
8. Provision the PCI VF devices to VMs using PCI Pass-Through or any other preferred virt tool of
choice, e.g virt-manager.
VMs with attached VFs must be powered off to be able to unbind the VFs.
2. Change the e-switch mode from legacy to switchdev on the PF device.
This will also create the VF representor netdevices in the host OS.
Before changing the mode, make sure that all VFs are unbound.
172
SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="e41d2d60971d", \
ATTR{phys_port_name}!="", NAME="enp4s0f1_$attr{phys_port_name}"
Replace the phys_switch_id value ("e41d2d60971d" above) with the value matching your
switch, as obtained from:
ls -l /sys/class/net/ens4*
lrwxrwxrwx 1 root root 0 Mar 27 17:14 enp4s0f0 -> ../../devices/pci0000:00/0000:00:03.0/0000:04:00.0/net/
enp4s0f0
lrwxrwxrwx 1 root root 0 Mar 27 17:15 enp4s0f0_0 -> ../../devices/virtual/net/enp4s0f0_0
lrwxrwxrwx 1 root root 0 Mar 27 17:15 enp4s0f0_1 -> ../../devices/virtual/net/enp4s0f0_1
7. Restart the openvswitch service. This step is required for HW offload changes to take effect.
HW offload policy can also be changed by setting the tc-policy using one on the
following values:
* none - adds a TC rule to both the software and the hardware (default)
* skip_sw - adds a TC rule only to the hardware
* skip_hw - adds a TC rule only to the software
The above change is used for debug purposes.
173
# ip link set dev enp4s0f0_1 up
# ovs-dpctl show
system@ovs-system:
lookups: hit:0 missed:192 lost:1
flows: 2
masks: hit:384 total:2 hit/pkt:2.00
port 0: ovs-system (internal)
port 1: ovs-sriov (internal)
port 2: enp4s0f0
port 3: enp4s0f0_0
port 4: enp4s0f0_1
9. Run traffic from the VFs and observe the rules added to the OVS data-path.
# ovs-dpctl dump-flows
recirc_id(0),in_port(3),eth(src=e4:11:22:33:44:50,dst=e4:1d:2d:a5:f3:9d),
eth_type(0x0800),ipv4(frag=no), packets:33, bytes:3234, used:1.196s, actions:2
recirc_id(0),in_port(2),eth(src=e4:1d:2d:a5:f3:9d,dst=e4:11:22:33:44:50),
eth_type(0x0800),ipv4(frag=no), packets:34, bytes:3332, used:1.196s, actions:3
In the example above, the ping was initiated from VF0 (OVS port 3) to the outer node (OVS
port 2), where the VF MAC is e4:11:22:33:44:50 and the outer node MAC is e4:1d:2d:a5:f3:9d
As shown above, two OVS rules were added, one in each direction.
Note that you can also verify offloaded packets using by adding type=offloaded to the
command. For example:
The aging timeout of OVS is given is ms and can be controlled with this command:
Offloading VLANs
It is common to require the VM traffic to be tagged by the OVS. Such that, the OVS adds tags (vlan
push) to the packets sent by the VMs and strips (vlan pop) the packets received for this VM from
other nodes/VMs.
To do so, add a tag=$TAG section for the OVS command line that adds the representor ports, for
example here we use vlan ID 52.
The PF port should not have a VLAN attached. This will cause OVS to add VLAN push/pop actions
when managing traffic for these VFs.
To see how the OVS rules look with vlans, here we initiated a ping from VF0 (OVS port 3) to an outer
node (OVS port 2), where the VF MAC is e4:11:22:33:44:50 and the outer node MAC
is 00:02:c9:e9:bb:b2.
174
At this stage, we can see that two OVS rules were added, one in each direction.
recirc_id(0),in_port(3),eth(src=e4:11:22:33:44:50,dst=00:02:c9:e9:bb:b2),eth_type(0x0800),ipv4(frag=no), \
packets:0, bytes:0, used:never, actions:push_vlan(vid=52,pcp=0),2
recirc_id(0),in_port(2),eth(src=00:02:c9:e9:bb:b2,dst=e4:11:22:33:44:50),eth_type(0x8100), \
vlan(vid=52,pcp=0),encap(eth_type(0x0800),ipv4(frag=no)), packets:0, bytes:0, used:never, actions:pop_vlan,3
• For outgoing traffic (in port = 3), the actions are push vlan (52) and forward to port 2
• For incoming traffic (in port = 2), matching is done also on vlan, and the actions are pop vlan
and forward to port 3
VXLAN encapsulation / decapsulation offloading of OVS actions is supported only in
ConnectX-5 adapter cards.
In case of offloading VXLAN, the PF should not be added as a port in the OVS data-path but rather
be assigned with the IP address to be used for encapsulation.
The example below shows two hosts (PFs) with IPs 1.1.1.177 and 1.1.1.75, where the PF device
on both hosts is enp4s0f0 and the VXLAN tunnel is set with VNID 98:
• On the first host:
When encapsulating guest traffic, the VF’s device MTU must be reduced to allow the host/HW add
the encap headers without fragmenting the resulted packet. As such, the VF’s MTU must be lowered
to 1450 for IPv4 and 1430 for IPv6.
To see how the OVS rules look with vxlan encap/decap actions, here we initiated a ping from a VM
on the 1st host whose MAC is e4:11:22:33:44:50 to a VM on the 2nd host whose MAC
is 46:ac:d1:f1:4c:af
At this stage we see that two OVS rules were added to the first host; one in each direction.
# ovs-dpctl show
system@ovs-system:
lookups: hit:7869 missed:241 lost:2
flows: 2
masks: hit:13726 total:10 hit/pkt:1.69
port 0: ovs-system (internal)
port 1: ovs-sriov (internal)
port 2: vxlan_sys_4789 (vxlan)
port 3: enp4s0f1_0
port 4: enp4s0f1_1
# ovs-dpctl dump-flows
recirc_id(0),in_port(3),eth(src=e4:11:22:33:44:50,dst=46:ac:d1:f1:4c:af),eth_type(0x0800),ipv4(tos=0/0x3,frag=no),
packets:4, bytes:392, used:0.664s, actions:set(tunnel(tun_id=0x62,dst=1.1.1.75,ttl=64,flags(df,key))),2
recirc_id(0),tunnel(tun_id=0x62,src=1.1.1.75,dst=1.1.1.177,ttl=64,flags(-df-csum+key)),
175
in_port(2),skb_mark(0),eth(src=46:ac:d1:f1:4c:af,dst=e4:11:22:33:44:50),eth_type(0x0800),ipv4(frag=no), packets:5,
bytes:490, used:0.664s, actions:3
• For outgoing traffic (in port = 3), the actions are set vxlan tunnel to host 1.1.1.75 (encap)
and forward to port 2
• For incoming traffic (in port = 2), matching is done also on vxlan tunnel info which was
decapsulated, and the action is forward to port 3
Offloading rules can also be added directly, and not just through OVS, using the tc utility.
To enable TC ingress on both the PF and the VF.
Examples
L2 Rule
VLAN Rule
VXLAN Rule
176
src_mac e4:11:22:11:4a:50 \
enc_src_ip 20.1.11.1 \
enc_dst_ip 20.1.12.1 \
enc_key_id 100 \
enc_dst_port 4789 \
action tunnel_key unset \
action mirred egress redirect dev ens4f0_0
Bond Rule
VLAN Modify
177
tc filter add dev $REP_DEV1 protocol 802.1q ingress prio 1 flower \
vlan_id 10 \
action vlan modify id 11 pipe \
action mirred egress redirect dev $REP_DEV2
SR-IOV VF LAG
SR-IOV VF LAG allows the NIC’s physical functions to get the rules that the OVS will try to offload to
the bond net-device, and to offload them to the hardware e-switch. Bond modes supported are:
• Active-Backup
• Active-Active
• LACP
To enable SR-IOV LAG, both physical functions of the NIC should first be configured to SR-IOV
switchdev mode, and only afterwards bond the up-link representors.
The example below shows the creation of bond interface on two PFs:
1. Load bonding device and enslave the up-link representor (currently PF) net-device devices.
2. Add the VF representor net-devices as OVS ports. If tunneling is not used, add the bond
device as well.
Port Mirroring is currently supported in ConnectX-5 adapter cards only.
Unlike para-virtual configurations, when the VM traffic is offloaded to the hardware via SRIOV VF,
the host side Admin cannot snoop the traffic (e.g. for monitoring).
ASAP² uses the existing mirroring support in OVS and TC along with the enhancement to the
offloading logic in the driver to allow mirroring the VF traffic to another VF.
The mirrored VF can be used to run traffic analyzer (tcpdump, wireshark, etc) and observe the
178
traffic of the VF being mirrored.
The example below shows the creation of port mirror on the following configuration:
# ovs-vsctl show
09d8a574-9c39-465c-9f16-47d81c12f88a
Bridge br-vxlan
Port "enp4s0f0_1"
Interface "enp4s0f0_1"
Port "vxlan0"
Interface "vxlan0"
type: vxlan
options: {key="100", remote_ip="192.168.1.14"}
Port "enp4s0f0_0"
Interface "enp4s0f0_0"
Port "enp4s0f0_2"
Interface "enp4s0f0_2"
Port br-vxlan
Interface br-vxlan
type: internal
ovs_version: "2.8.90"
• If we want to set enp4s0f0_0 as the mirror port, and mirror all of the traffic, set it as follow:
• If we want to set enp4s0f0_0 as the mirror port, and only mirror the traffic, the destination is
enp4s0f0_1, set it as follow:
• If we want to set enp4s0f0_0 as the mirror port, and only mirror the traffic the source is
enp4s0f0_1, set it as follow:
• If we want to set enp4s0f0_0 as the mirror port and mirror, all the traffic on enp4s0f0_1, set
it as follow:
Offloaded flows (including connection tracking) are added to virtual switch FDB flow tables. FDB
tables have a set of flow groups. Each flow group saves the same traffic pattern flows. For example,
179
for connection tracking offloaded flow, TCP and UDP are different traffic patterns which end up in
two different flow groups.
A flow group has a limited size to save flow entries. By default, the driver has 4 big FDB flow groups.
Each of these big flow groups can save at most 4000000/(4+1)=800k different 5-tuple flow entries.
For scenarios with more than 4 traffic patterns, the driver provides a module parameter
(num_of_groups) to allow customization and performance tune.
The size of each big flow group can be calculated according to the following formula.
size = 4000000/(num_of_groups+1)
The change takes effect immediately if there is no flow inside the FDB table (no traffic running and
all offloaded flows are aged out), and it can be dynamically changed without reloading the driver.
The module parameter can be set statically in /etc/modprobe.d/mlnx.conf file. This way the
administrator will not be required to set it via sysfs each time the driver is reloaded.
If there are residual offloaded flows when changing this parameter, then the new configuration only
takes effect after all flows age out.
# mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
Loading MST PCI configuration module - Success
Create devices
ST modules:
------------
MST PCI module loaded
MST PCI configuration module loaded
PCI devices:
------------
DEVICE_TYPE MST PCI RDMA NET NUMA
ConnectX4lx(rev:0) /dev/mst/mt4117_pciconf0.1 04:00.1 net-enp4s0f1 NA
ConnectX4lx(rev:0) /dev/mst/mt4117_pciconf0 04:00.0 net-enp4s0f0 NA
# mlxconfig -d /dev/mst/mt4117_pciconf0 q | head -16
Device #1:
----------
180
Device type: ConnectX4lx
PCI device: /dev/mst/mt4117_pciconf0
Configurations: Current
SRIOV_EN True(1)
NUM_OF_VFS 8
PF_LOG_BAR_SIZE 5
VF_LOG_BAR_SIZE 5
NUM_PF_MSIX 63
NUM_VF_MSIX 11
LINK_TYPE_P1 ETH(2)
LINK_TYPE_P2 ETH(2)
# mlxconfig -d /dev/mst/mt4115_pciconf0 q
This feature enables optimizing mlx5 driver teardown time in shutdown and kexec flows.
The fast driver unload is disabled by default. To enable it, the prof_sel module parameter of
mlx5_core module should be set to 3.
181
Troubleshooting
You may be able to easily resolve the issues described in this section. If a problem persists and you
are unable to resolve it yourself, please contact your Mellanox representative or Mellanox Support
at support@mellanox.com.
General Issues
Issue Cause Solution
Mellanox adapters are not Misidentification of the Mellanox Run the command below and check
installed in the system. adapter installed Mellanox’s MAC to identify the
Mellanox adapter installed.
Insufficient memory to be udev is designed to fork() new process Limit the udev instances running
used by udev upon OS boot. for each event it receives so it could simultaneously per boot by adding
han- dle many events in parallel, and udev.children-max=<number> to the
each udev instance consumes some kernel command line in grub.
RAM memory.
Operating system running The mlnx-en.d service script is called Disable the openibd ‘stop’ option by
from root file system using the ‘stop’ option by the operating setting 'ALLOW_STOP=no' in /etc/
located on a remote storage system. This option unloads the driver mlnx-en.conf configuration file.
(over Mellanox devices), stack. Therefore, the OS root file
hang during reboot/ system dis- appears before the reboot/
shutdown (errors such as shutdown procedure is completed,
“No such file or directory” leaving the OS in a hang state.
will appear).
182
Ethernet Related Issues
Issue Cause Solution
Ethernet interfaces renaming Invalid udev rules. Review the udev rules inside the "/etc/
fails leaving them with names udev/rules.d/70-persistent-net.rules"
such as renameXY. file. Modify the rules such that every
rule is unique to the target interface,
by adding correct unique attribute
values to each interface, such as
dev_id, dev_port and KERNELS or
address).
SUBSYSTEM=="net", ACTION=="add",
DRIVERS=="?*",
ATTR{dev_id}=="0x0", ATTR{type}
=="1", KERNEL=="eth*",
ATTR{dev_port}=="0", KER-
NELS=="0000:08:00.0", NAME="eth4"
SUBSYSTEM=="net", ACTION=="add",
DRIVERS=="?*",
ATTR{dev_id}=="0x0", ATTR{type}
=="1", KERNEL=="eth*",
ATTR{dev_port}=="1", KER-
NELS=="0000:08:00.0", NAME="eth5"
Degraded performance is Sending traffic from a node with Enable Flow Control on both switch
measured when having a mixed a higher rate to a node with ports and nodes:
rate environment (10GbE, 40GbE lower rate. • On the server side run:
and 56GbE). ethtool -A <interface> rx
on tx on
183
Issue Cause Solution
Physical link fails to negotiate to The adapter is running an Install the latest firmware on the
maximum supported rate. outdated firmware. adapter.
• Ensure that the cable is
Physical link fails to come up The cable is not connected to connected on both ends or use a
while port physical state is the port or the port on the other known working cable
Polling. end of the cable is disabled. • Check the status of the
connected port using
the ibportstate command and
enable it if necessary
Physical link fails to come up The port was manually disabled. Restart the driver:
while port physical state is
Disabled. /etc/init.d/openibd restart
184
Issue Cause Solution
Out of the box throughput IRQ affinity is not set properly by the For additional performance
performance in Ubuntu14.04 is irq_balancer tuning, please refer to
not optimal and may achieve Performance Tuning Guide.
results below the line rate in
40GE link speed.
UDP receiver throughput may be This is caused by the adaptive Disable adaptive interrupt
lower than expected, when interrupt moderation routine, which moderation and set lower values
running over mlx4_en Ethernet sets high values of interrupt for the interrupt coalescing
driver. coalescing, causing the driver to manually.
process large number of packets in the
same interrupt, leading UDP to drop ethtool -C <eth>X adaptive-
packets due to overflow in its buffers. rx off rx-usecs 64 rx-
frames 24
Failed to enable SR- IOV. SR-IOV is disabled in the Check that the SR-IOV is enabled in
BIOS. the BIOS (see “Setting Up SR-IOV”
The following message is reported in section).
dmesg:
185
Common Abbreviations and Related Documents
Common Abbreviations and Acronyms
Abbreviation/Acronym Description
b (Small) ‘b’ is used to indicate size in bits or multiples of bits (e.g., 1Kb
= 1024 bits)
FW Firmware
HW Hardware
IB InfiniBand
SW Software
PR Path Record
SL Service Level
186
Abbreviation/Acronym Description
VL Virtual Lane
Glossary
The following is a list of concepts and terms related to InfiniBand in general and to Subnet Managers
in particular. It is included here for ease of reference, but the main reference remains
the InfiniBand Architecture Specification.
Term Description
Channel Adapter (CA), An IB device that terminates an IB link and executes transport functions. This may
Host Channel Adapter be an HCA (Host CA) or a TCA (Target CA)
(HCA)
HCA Card A network adapter card based on an InfiniBand channel adapter device
Local Identifier (ID) An address assigned to a port (data sink or source point) by the Subnet Manager,
unique within the subnet, used for directing packets within the subnet
Local Device/Node/ The IB Host Channel Adapter (HCA) Card installed on the machine running IBDIAG
System tools
Local Port The IB port of the HCA through which IBDIAG tools connect to the IB fabric
187
Term Description
Master Subnet Manager The Subnet Manager that is authoritative, that has the reference configuration
information for the subnet
Multicast Forwarding A table that exists in every switch providing the list of ports to forward received
Tables multicast packet. The table is organized by MLID
Network Interface Card A network adapter card that plugs into the PCI Express slot and provides one or
(NIC) more ports to an Ethernet network
Standby Subnet Manager A Subnet Manager that is currently quiescent, and not in the role of a Master
Subnet Manager, by the agency of the master SM
Subnet Administrator An application (normally part of the Subnet Manager) that implements the
(SA) interface for querying and manipulating subnet management data
Subnet Manager (SM) One of several entities involved in the configuration and control of the IB fabric
Unicast Linear A table that exists in every switch providing the port through which packets should
Forwarding Tables (LFT) be sent to each LID
Virtual Protocol A Mellanox Technologies technology that allows Mellanox channel adapter devices
Interconnect (VPI) (ConnectX®) to simultaneously connect to an InfiniBand subnet and a 10GigE
subnet (each subnet connects to one of the adapter ports)
Related Documentation
Document Name Description
InfiniBand Architecture Specification, Vol. 1, The InfiniBand Architecture Specification that is provided
Release 1.2.1 by IBTA
IEEE Std 802.3ae™-2002 Part 3: Carrier Sense Multiple Access with Collision
Detection (CSMA/CD) Access Method and Physical Layer
(Amendment to IEEE Std 802.3-2002) Document # Specifications
PDF: SS94996
Amendment: Media Access Control (MAC) Parameters,
Physical Layers, and Management Parameters for 10 Gb/s
Operation
Firmware Release Notes for Mellanox adapter See the Release Notes PDF file relevant to your adapter
devices device on mellanox.com
MFT User Manual and Release Notes Mellanox Firmware Tools (MFT) User Manual and Release
Notes documents
188
Document Name Description
WinOF User Manual Mellanox WinOF User Manual describes the installation,
configuration, and operation of Mellanox Windows driver
VMA User Manual Mellanox VMA User Manual describes the installation,
configuration, and operation of Mellanox VMA driver
189
User Manual Revision History
Release Date Description
4.9 LTS May 28, 2020 • Added Interrupt Request (IRQ) Naming section.
• Added Debuggability section.
190
Release Notes Change Log History
Category Description
Rev 4.9-0.1.7.0
Devlink Health CR-Space Dump Added the option to dump configuration space via the devlink
tool in order to improve debug capabilities.
Multi-packet TX WQE Support for XDP The conventional TX descriptor (WQE or Work Queue Element)
Transmit Flows describes a single packet for transmission. Added driver support
for the HW feature of multi-packet TX WQEs in XDP transmit
flows. With this, the HW becomes capable of working with a new
and improved WQE layout that describes several packets. In
effect, this feature saves PCI bandwidth and transactions, and
improves transmit packet rate.
GENEVE Encap/Decap Rules Offload Added support for GENEVE encapsulation/decapsulation rules
offload.
Multi Packet Tx WQE Support for XDP
Transmit Flows Added driver support for the hardware feature of multi-packet Tx
to work with a new and improved WQE layout that describes
several packets instead of a single packet for XDP transmission
flows. This saves PCI bandwidth and transactions, and improves
transmit packet rate.
Kernel Software Steering for Connection
Tracking (CT) [Beta] Added support for updating CT rules using the software
steering mechanism.
Kernel Software Steering Remote Mirroring [Beta] Added support for updating remote mirroring rules using
the software steering mechanism.
191
RoCE Accelerator Counters
Added the following RoCE accelerator counters:
• roce_adp_retrans - counts the number of adaptive
retransmissions for RoCE traffic
• roce_adp_retrans_to - counts the number of times RoCE
traffic reached timeout due to adaptive retransmission
• roce_slow_restart - counts the number of times RoCE slow
restart was used
• roce_slow_restart_cnps - counts the number of times
RoCE slow restart generated CNP packets
• roce_slow_restart_trans - counts the number of times
RoCE slow restart changed state to slow restart
Memory Region
Added support for the user to register memory regions with a
relaxed ordering access flag through experimental verbs. This can
enhance performance, depending on architecture and scenario.
Adapters: All
ibdev2netdev Tool Output ibdev2netdev tool output was changed such that the bonding
device now points at the bond instead of the slave interface.
Devlink Health Reporters Added support for monitoring and recovering from errors that
occur on the RX queue, such as cookie errors and timeout.
GSO Optimization Improved GSO (Generic Segmentation Offload) workload
performance by decreasing doorbells usage to the minimum
required.
TX CQE Compression Added support for TX CQE (Completion Queue Element)
compression. Saves on outgoing PCIe bandwidth by compressing
CQEs together. Disabled by default. Configurable via private flags
of ethtool.
Firmware Versions Query via Devlink Added the option to query for running and stored firmware
versions using the devlink tool.
Firmware Flash Update via Devlink
Added the option to update the firmware image in the flash using
the devlink tool.
192
Bug Fixes See Bug Fixes section.
Rev 4.7-1.0.0.1
Counters Monitoring Added support for monitoring selected counters and generating a
notification event (Monitor_Counter_Change event) upon changes
made to these counters.
The counters to be monitored are selected using the
SET_MONITOR_COUNTER command.
EEPROM Device Thresholds via Ethtool Added support to read additional EEPROM information from high
pages of modules such as SFF-8436 and SFF-8636. Such
information can be: 1. Application Select table 2. User writable
EEPROM 3. Thresholds and alarms - Ethtool dump works on active
cables only (e.g. optic), but thresholds and alarms can be read
with “offset” and “length” parameters in any cable by running:
ethtool -m <DEVNAME> offset X length Y
RDMA_RX RoCE Steering Support Added the ability to create rules to steer RDMA traffic, with two
destinations supported: DevX object and QP. Multiple priorities
are also supported.
HCAs: All
Rev 4.6-1.0.1.1
Devlink Configuration Parameters Tool Added support for a set of configuration parameters that can be
changed by the user through the Devlink user interface.
193
DevX Interoperability APIs
Added support for modifying and/or querying for a verb object
(including CQ, QP, SRQ, WQ, and IND_TBL APIs) via the DevX
interface.
For example: Take the cqn from the created ibv_cq and use it on
a devx)create(QP).
Indirect Mkey ODP Added the ability to create indirect Mkeys with ODP support over
DevX interface.
XDP Redirect Added support for XDP_REDIRECT feature for both ingress and
egress sides. Using this feature, incoming packets on one
interface can be redirected very quickly into the transmission
queue of another capable interface. Typically used for load
balancing.
RoCE Disablement
Added the option to disable RoCE traffic handling. This enables
forwarding of traffic over UDP port 4791 that is handled as RoCE
traffic when RoCE is enabled.
VF LAG Added support for High Availability and load balancing for Virtual
Functions of different physical ports in SwitchDev SR-IOV mode.
194
PCI Atomic Operations Added the ability to run atomic operations on local memory
without involving verbs API or compromising the operation's
atomicity.
HCAs: ConnectX-5
VFs Rate Limit Added support for setting a rate limit on groups of Virtual
Functions rather on an individual Virtual Function.
HCAs: ConnectX-6
ConnectX-6 Support
[Beta] Added support for ConnectX-6 (VPI only) adapter cards.
195
Message Signaled Added support for using a single MSI-X vector for all control event
Interrupts-X (MSI-X) queues instead of one MSI-X vector per queue in a virtual
Vectors function driver. This frees extra MSI-X vectors to be used for
completion event queue, allowing for additional traffic channels
in the network device.
Send APIs Introduced a new set of QP Send operations (APIs) which allows
extensibility for new Send opcodes.
HCAs: BlueField
BlueField Support BlueField is now fully supported as part of the Mellanox OFED
mainstream version sharing the same code baseline with all the
adapters product line.
Representor Name Change
In SwitchDev mode:
• Uplink representors are now called p0/p1
• Host PF representors are now called pf0hpf/pf1hpf
• VF representors are now called pf0vfN/pf1vfN
ECPF Net Devices In SwitchDev mode, net devices enp3s0f0 and enp3s0f1 are no
longer created.
Setting Host MAC and Tx Rate Limit from Expanded to support VFs as well as the host PFs.
ECPF
HCAs: All
RDMA-CM Application Managed QP Added support for the RDMA application to manage its own QPs
and use RDMA-CM only for exchanging Address information.
RDMA-CM QP Timeout Control Added a new option to rdma_set_option that allows applications
to override the RDMA-CM's QP ACK timeout value.
MLNX_OFED Verbs API
As of MLNX_OFED v5.0 release (Q1 2020) onwards, MLNX_OFED
Verbs API will be migrated from the legacy version of the user
space verbs libraries (libibervs, libmlx5 ..) to the upstream
version rdma-core.
More details are available in MLNX_OFED user manual under
Installing Upstream rdma-core Libraries.
4.5-1.0.1.0
HCAs: ConnectX-5
VFs per PF Increased the amount of maximum virtual functions (VF) that can
be allocated to a physical function (PF) to 127 VF.
SW-Defined UDP Source Port for RoCE v2 UDP source port for RoCE v2 packets is now calculated by the
driver rather than the firmware, achieving better distribution
and less congestion. This mechanism works for RDMA- CM QPs
only, and ensures that RDMA connection messages and data
messages have the same UDP source port value.
196
Local Loopback Disable Added the ability to manually disable Local Loopback regardless
of the number of open user-space transport domains.
HCAs: ConnectX-6
Adapter Cards Added support for ConnectX-6 Ready. For further information,
please contact Mellanox Support.
HCAs: All
4.4-2.0.7.0
HCAs: All
4.4-1.0.1.0
Adaptive Interrupt Moderation Added support for adaptive Tx, which optimizes the moderation
values of the Tx CQs on runtime for maximum throughput with
minimum CPU overhead.
Docker Containers [Beta] Added support for Docker containers to run over Virtual RoCE and
InfiniBand devices using SR-IOV mode.
Firmware Tracer Added a new mechanism for the device’s FW/HW to log
important events into the event tracing system (/sys/kernel/
debug/tracing) without requiring any Mellanox-specific tool.
HCAs: ConnectX-4/ConnectX-4 Lx
VST Q-in-Q Added support for C-tag (0x8100) VLAN insertion to tagged
packets in VST mode.
197
HCAs: ConnectX-4 Lx/ConnectX-5
OVS Offload using ASAP2 Added support for Mellanox Accelerated Switching And Packet
Processing (ASAP2) technology, which allows OVS offloading by
handling OVS data-plane, while maintaining OVS control-plane
unmodified. OVS Offload using ASAP2 technology provides
significantly higher OVS performance without the associated CPU
load.
4.3-1.0.1.0
Adaptive Interrupt Moderation Added support for adaptive Tx, which optimizes the moderation
values of the Tx CQs on runtime for maximum throughput with
minimum CPU overhead.
Docker Containers [Beta] Added support for Docker containers to run over Virtual RoCE and
InfiniBand devices using SR-IOV mode.
Firmware Tracer Added a new mechanism for the device’s FW/HW to log
important events into the event tracing system (/sys/kernel/
debug/tracing) without requiring any Mellanox-specific tool.
HCAs: ConnectX-4/ConnectX-4 Lx
VST Q-in-Q Added support for C-tag (0x8100) VLAN insertion to tagged
packets in VST mode.
HCAs: ConnectX-4
HCAs: ConnectX-5/ConnectX-4 Lx
198
OVS Offload using ASAP2 Added support for Mellanox Accelerated Switching And Packet
Processing (ASAP2) technology, which allows OVS offloading by
handling OVS data-plane, while maintaining OVS control-plane
unmodified. OVS Offload using ASAP2 technology provides
significantly higher OVS performance without the associated CPU
load.
HCAs: All
4.3-1.0.1.0
HCAs: ConnectX-5
Erasure Coding Added support for erasure coding offload software verbs
Offload verbs (encode/decode/update API) supporting a number of redundancy
blocks (m) greater than 4.
RoCE LAG Added out of box RoCE LAG support for RHEL 7.2 and RHEL 6.9.
Reset Flow Added support for triggering software reset for firmware/driver
recovery. When fatal errors occur, firmware can be reset and
driver reloaded.
Striding RQ with HW Time-Stamping Added the option to retrieve the HW timestamp when polling for
completions from a completion queue that is attached to a multi-
packet RQ (Striding RQ).
4.2-1.0.1.0
199
HCAs: mlx5 Driver
Physical Address Memory Allocation Added support to register a specific physical address range.
Innova IPsec Adapter Cards Added support for Mellanox Innova IPsec EN adapter card, that
provides security acceleration for IPsec-enabled networks.
Precision Time Protocol (PTP) Added support for PTP feature over PKEY interfaces.
Virtual MAC Added support for Virtual MAC feature, which allows users to add
up to 4 virtual MACs (VMACs) per VF. All traffic that is destined to
the VMAC will be forwarded to the relevant VF instead of PF. All
traffic going out from the VF with source MAC equal to VMAC will
go to the wire also when Spoof Check is enabled.
Receive Buffer Added the option to change receive buffer size and cable length.
Changing cable length will adjust the receive buffer's xon and
xoff thresholds.
GRE Tunnel Offloads Added support for the following GRE tunnel offloads:
• TSO over GRE tunnels
• Checksum offloads over GRE tunnels
• RSS spread for GRE packets
NVMEoF Added support for the host side (RDMA initiator) in RedHat 7.2
and above.
Dropless Receive Queue (RQ) Added support for the driver to notify the FW when SW receive
queues are overloaded.
200
PFC Storm Prevention Added support for configuring PFC stall prevention in cases where
the device unexpectedly becomes unresponsive for a long period
of time. PFC stall prevention disables flow control mechanisms
when the device is stalled for a period longer than the default
pre-configured timeout. Users now have the ability to change the
default timeout by moving to auto mode.
HCAs: ConnectX-5
Virtual Guest Tagging (VGT+) Added support for VGT+ in ConnectX-4/ConnectX-5 HCAs. This
feature is s an advanced mode of Virtual Guest Tagging (VGT), in
which a VF is allowed to tag its own packets as in VGT, but is still
subject to an administrative VLAN trunk policy. The policy
determines which VLAN IDs are allowed to be transmitted or
received. The policy does not determine the user priority, which
is left unchanged.
Tag Matching Offload Added support for hardware Tag Matching offload with
Dynamically Connected Transport (DCT).
HCAs: All
4.1-1.0.2.0
RoCE Diagnostics and ECN Counters Added support for additional RoCE diagnostics and ECN
congestion counters under /sys/class/infiniband/mlx5_0/ports/
1/hw_counters/ directory.
201
rx-fcs Offload (ethtool) Added support for rx-fcs ethtool offload configuration. Normally,
the FCS of the packet will be truncated by the ASIC hardware
before sending it to the application socket buffer (skb). Ethtool
allows to set the rx-fcs not to be truncated, but to pass it to the
application for analysis.
DSCP Trust Mode Added the option to enable PFC based on the DSCP value. Using
this solution, VLAN headers will no longer be mandatory for use.
RoCE ECN Parameters ECN parameters have been moved to the following directory: /
sys/kernel/debug/mlx5/<PCI BUS>/cc_params/
Flow Steering Dump Tool Added support for mlx_fs_dump, which is a python tool that
prints the steering rules in a readable manner.
Secure Firmware Updates Firmware binaries embedded in MLNX_EN package now support
Secure Firmware Updates. This feature provides devices with the
ability to verify digital signatures of new firmware binaries, in
order to ensure that only officially approved versions are
installed on the devices.
PeerDirect Added the ability to open a device and create a context while
giving PCI peer attributes such as name and ID.
Probed VFs Added the ability to disable probed VFs on the hypervisor. For
further information, see HowTo Configure and Probe VFs on mlx5
Drivers Community post.
202
1PPS Time Synchronization (at alpha level) Added support for One Pulse Per Second (1PPS), which is a time
synchronization feature that allows the adapter to send or
receive 1 pulse per second on a dedicated pin on the adapter
card.
Fast Driver Unload Added support for fast driver teardown in shutdown and kexec
flows.
HCAs: ConnectX-5/ConnectX-5 Ex
NVMEoF Target Offload Added support for NVMe over fabrics (NVMEoF) offload, an
implementation of the new NVMEoF standard target (server) side
in hardware.
HCAs: All
4.0-2.0.0.1
PCIe Error Counting [ConnectX-4/ConnectX-4 Lx] Added the ability to expose physical
layer statistical counters to ethtool.
Standard ethtool [ConnectX-4/ConnectX-4 Lx] Added support for flow steering and
rx-all mode.
SR-IOV Bandwidth Share for Ethernet/ [ConnectX-4/ConnectX-4 Lx] Added the ability to guarantee the
RoCE (beta) minimum rate of a certain VF in SR-IOV mode.
NFS over RDMA (NFSoRDMA) Removed support for NFSoRDMA drivers. These drivers are no
longer provided along with the MLNX_EN package.
203
Uplink Representor
Modes Added support for the following Uplink Representor modes:
1. new_netdev: default mode - when found in this mode, the uplink representor is
created as a new netdevice
2. nic_netdev: when found in this mode, the NIC netdevice acts as an uplink
representor device
Example:
Notes:
• The mode can only be changed when found in Legacy mode
• The mode is not saved when reloading mlx5_core
204
Notice
This document is provided for information purposes only and shall not be regarded as a warranty of a certain
functionality, condition, or quality of a product. Neither NVIDIA Corporation nor any of its direct or indirect subsidiaries
and affiliates (collectively: “NVIDIA”) make any representations or warranties, expressed or implied, as to the accuracy
or completeness of the information contained in this document and assumes no responsibility for any errors contained
herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents
or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or
deliver any Material (defined below), code, or functionality.
NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to
this document, at any time without notice.
Customer should obtain the latest relevant information before placing orders and should verify that such information is
current and complete.
NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order
acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of
NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and
conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations
are formed either directly or indirectly by this document.
NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or
life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be
expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for
inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at
customer’s own risk.
NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use.
Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to
evaluate and determine the applicability of any information contained in this document, ensure the product is suitable
and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a
default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability
of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in
this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or
attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product
designs.
No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual
property right under this document. Information published by NVIDIA regarding third-party products or services does not
constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such
information may require a license from a third party under the patents or other intellectual property rights of the third
party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.
Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced
without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all
associated conditions, limitations, and notices.
THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS,
AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO
WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY
DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.
TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT
LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND
REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN
ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason
whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be
limited in accordance with the Terms of Sale for the product.
Trademarks
NVIDIA, the NVIDIA logo, and Mellanox are trademarks and/or registered trademarks of NVIDIA Corporation and/
or Mellanox Technologies Ltd. in the U.S. and in other countries. Other company and product names may be trademarks
of the respective companies with which they are associated.
Copyright
© 2022 NVIDIA Corporation & affiliates. All Rights Reserved.