MLNX - EN Documentation Rev 4.9-5.1.0.0 LTS - 10!23!2022

Download as pdf or txt
Download as pdf or txt
You are on page 1of 206

 

MLNX_EN Documentation Rev 4.9-5.1.0.0


LTS
 
 

Exported on Oct/23/2022 05:04 PM


Table of Contents
Release Notes................................................................................... 6
Supported NICs Speeds ..............................................................................6
Package Contents ....................................................................................7
General Support in MLNX_EN..................................................................... 10
Supported Operating Systems................................................................ 10
Supported Non-Linux Virtual Machines ..................................................... 15
Supported HCAs Firmware Versions ......................................................... 15
Changes and New Features ....................................................................... 16
New Features ................................................................................... 16
Customer Affecting Changes ................................................................. 16
Known Issues........................................................................................ 17
Bug Fixes ............................................................................................ 41
Introduction....................................................................................69
MLNX_EN Package Contents ...................................................................... 70
Package Images ................................................................................ 70
Software Components ......................................................................... 70
Firmware ........................................................................................ 71
Directory Structure ............................................................................ 71
mlx4 VPI Driver................................................................................. 71
mlx5 Driver ..................................................................................... 72
Unsupported Features in MLNX_EN.......................................................... 72
Module Parameters ................................................................................ 72
mlx4 Module Parameters...................................................................... 72
Devlink Parameters ................................................................................ 76
Installation .....................................................................................77
Software Dependencies ........................................................................... 77
Downloading MLNX_EN ............................................................................ 77
Installing MLNX_EN ................................................................................ 78
Installation Script .............................................................................. 78
Installation Modes ............................................................................. 78
Installation Procedure......................................................................... 79
Additional Installation Procedures .......................................................... 80

2
Uninstalling MLNX_EN ............................................................................. 84
Uninstalling MLNX_EN Using the YUM Tool ................................................. 84
Uninstalling MLNX_EN Using the apt-get Tool ............................................. 84
Updating Firmware After Installation ........................................................... 84
Updating the Device Online .................................................................. 85
Updating Firmware and FPGA Image on Innova IPsec Cards............................. 85
Updating the Device Manually ............................................................... 86
Ethernet Driver Usage and Configuration ...................................................... 86
Performance Tuning ............................................................................... 89
Features Overview and Configuration .....................................................90
Ethernet Network .................................................................................. 90
ethtool Counters ............................................................................... 90
Interrupt Request (IRQ) Naming ............................................................. 91
Quality of Service (QoS) ...................................................................... 92
Quantized Congestion Notification (QCN)................................................. 101
Ethtool.......................................................................................... 103
Checksum Offload ............................................................................ 108
Ignore Frame Check Sequence (FCS) Errors............................................... 109
RDMA over Converged Ethernet (RoCE).................................................... 109
Flow Control ................................................................................... 110
Explicit Congestion Notification (ECN) .................................................... 115
RSS Support .................................................................................... 117
Time-Stamping ................................................................................ 118
Flow Steering .................................................................................. 122
Wake-on-LAN (WoL)........................................................................... 127
Hardware Accelerated 802.1ad VLAN (Q-in-Q Tunneling)............................... 127
Local Loopback Disable ...................................................................... 128
NVME-oF - NVM Express over Fabrics ...................................................... 129
Debuggability .................................................................................. 129
RX Page Cache Size Limit .................................................................... 129
Virtualization ......................................................................................130
Single Root IO Virtualization (SR-IOV) ..................................................... 131
Enabling Paravirtualization.................................................................. 158
VXLAN Hardware Stateless Offloads ....................................................... 159

3
Q-in-Q Encapsulation per VF in Linux (VST) .............................................. 161
802.1Q Double-Tagging....................................................................... 164
Resiliency ..........................................................................................165
Reset Flow ..................................................................................... 165
Docker Containers ................................................................................168
Docker Using SR-IOV .......................................................................... 168
Kubernetes Using SR-IOV..................................................................... 169
Kubernetes with Shared HCA................................................................ 169
Mediated Devices ............................................................................. 169
OVS Offload Using ASAP2 Direct ................................................................170
Overview ....................................................................................... 170
Installing ASAP2 Packages.................................................................... 171
Setting Up SR-IOV ............................................................................. 171
Configuring Open-vSwitch (OVS) Offload.................................................. 172
Appendix: Mellanox Firmware Tools ....................................................... 180
Fast Driver Unload ................................................................................181
Troubleshooting ............................................................................. 182
General Issues .....................................................................................182
Ethernet Related Issues ..........................................................................183
Performance Related Issues .....................................................................184
SR-IOV Related Issues ............................................................................185
Common Abbreviations and Related Documents ....................................... 186
User Manual Revision History.............................................................. 190
Release Notes Change Log History ....................................................... 191

4
Overview

NVIDIA offers a robust and full set of protocol software and driver for Linux with the ConnectX® EN
family cards. Designed to provide a high performance support for Enhanced Ethernet with fabric
consolidation over TCP/IP based LAN applications. The driver and software in conjunction with the
industry's leading ConnectX family of cards achieve full line rate, full duplex of up to 100Gb/s
performance per port.

Further information on this product can be found in the following MLNX_EN documents:
• Release Notes
• User Manual

Software Download

Please visit nvidia.com/en-us/networking → Products → Software → Ethernet Drivers → NVIDIA EN


for Linux

Document Revision History

For the list of changes made to this document, refer to User Manual Revision History.

5
Release Notes
These are the release notes for MLNX_OFED LTS (long term support) release for customers who wish
to utilize the following:
• ConnectX-3
• ConnectX-3 Pro

Release Notes Update History


Revision Date Description

June 30, 2022


4.9-5.1.0.0 Initial release of this document version.

Supported NICs Speeds


This document provides instructions on how to install the driver on NVIDIA ConnectX® network
adapter solutions supporting the following uplinks to servers.
Uplink/NICs Driver Uplink Speed
Name

• 10GbE, 40GbE and 56GbE1


ConnectX-3/ConnectX-3 Pro mlx4
ConnectX-4 mlx5 • Ethernet: 1GbE, 10GbE, 25GbE, 40GbE, 50GbE, 56GbE1,
and 100GbE

• Ethernet: 1GbE, 10GbE, 25GbE, 40GbE, and 50GbE


ConnectX-4 Lx
• Ethernet: 1GbE, 10GbE, 25GbE, 40GbE, 50GbE, and
ConnectX-5/ConnectX-5 Ex 100GbE

• Ethernet - 10GbE, 25GbE, 40GbE, 50GbE2, 100GbE2,


ConnectX-6 200GbE2

ConnectX-6 Dx • Ethernet - 1GbE, 10GbE, 25GbE, 40GbE, 50GbE1,


100GbE1, 200GbE2

• Ethernet: 10GbE, 40GbE


Innova™ IPsec EN

1. 56 GbE is a NVIDIA propriety link speed and can be achieved while connecting a NVIDIA
adapter card to 
NVIDIA SX10XX switch series, or connecting a NVIDIA adapter card to another NVIDIA adapter
card.
2. Supports both NRZ and PAM4 modes.

6
Package Contents 
Package Revision Licenses

ar_mgr 1.0-0.2.MLNX20201014.g8577618.49510 Mellanox Confidential and Proprietary

dapl 2.1.10.1.mlnx-OFED.4.9.0.1.4.49510 Dual GPL/BSD/CPL

dump_pr 1.0-0.2.MLNX20201014.g8577618.49510 GPLv2 or BSD

fabric-collector 1.1.0.MLNX20170103.89bb2aa-0.1.4951 GPLv2 or BSD


0

gpio-mlxbf 1.0-0.g6d44a8a GPLv2

hcoll 4.4.2969-1.49510 Proprietary

i2c-mlx 1.0-0.g422740c GPLv2

ibacm 41mlnx1-OFED.4.3.3.0.0.49510 GPLv2 or BSD

ibdump 6.0.0-1.49510 BSD2+GPL2

ibsim 0.10-1.49510 GPLv2 or BSD

ibutils 1.5.7.1-0.12.gdcaeae2.49510 GPL/BSD

ibutils2 2.1.1-0.121.MLNX20200324.g061a520.49 OpenIB.org BSD license.


510

infiniband-diags 5.6.0.MLNX20200211.354e4b7-0.1.4951 GPLv2 or BSD


0

iser 4.9-OFED.4.9.5.0.7.1 GPLv2

isert 4.9-OFED.4.9.5.0.7.1 GPLv2

kernel-mft 4.15.1-9 Dual BSD/GPL

knem 1.1.4.90mlnx1-OFED.5.6.0.1.6.1 BSD and GPLv2

libibcm 41mlnx1-OFED.4.1.0.1.0.49510 GPL/BSD

libibmad 5.4.0.MLNX20190423.1d917ae-0.1.4951 GPLv2 or BSD


0

7
Package Revision Licenses

libibumad 43.1.1.MLNX20200211.078947f-0.1.4951 GPLv2 or BSD


0

libibverbs 41mlnx1-OFED.4.9.3.0.0.49510 GPLv2 or BSD

libmlx4 41mlnx1-OFED.4.7.3.0.3.49510 GPLv2 or BSD

libmlx5 41mlnx1-OFED.4.9.0.1.2.49510 GPLv2 or BSD

libpka 1.0-1.g6cc68a2.49510 BSD

librdmacm 41mlnx1-OFED.4.7.3.0.6.49510 GPLv2 or BSD

libvma 9.0.2-1 GPLv2

mlnx-en 4.9-5.1.0.0.g4ae6a22 GPLv2

mlnx-ethtool 5.4-1.49510 GPL

mlnx-iproute2 5.4.0-1.49510 GPL

mlnx-nfsrdma 4.9-OFED.4.9.5.0.7.1 GPLv2

mlnx-nvme 4.9-OFED.4.9.5.0.7.1 GPLv2

mlnx-ofa_kernel 4.9-OFED.4.9.5.1.0.1 GPLv2

mlxbf-livefish 1.0-0.gec08328 GPLv2

mlx-bootctl 1.3-0.g2aa74b7 GPLv2

mlx-l3cache 0.1-1.gebb0728 GPLv2

mlx-pmc 1.1-0.g1141c2e GPLv2

mlx-trio 0.1-1.g9d13513 GPLv2

mpi-selector 1.0.3-1.49510 BSD

mpitests 3.2.20-e1a0676.49510 BSD

mstflint 4.14.0-3.49510 GPL/BSD

multiperf 3.0-0.14.g5f0fd0e.49510 BSD 3-Clause, GPL v2 or later

8
Package Revision Licenses

multiperf 3.0.0.mlnxlibs-0.13.gcdaa426.49017.49 BSD 3-Clause, GPL v2 or later


417.49510

mxm 3.7.3112-1.49510 Proprietary

nvme-snap 2.1.0-126.mlnx Proprietary

ofed-docs 4.9-OFED.4.9.5.1.0 GPL/BSD

ofed-scripts 4.9-OFED.4.9.5.1.0 GPL/BSD

openmpi 4.0.3rc4-1.49510 BSD

opensm 5.7.2.MLNX20201014.9378048-0.1.49510 GPLv2 or BSD

openvswitch 2.12.1-1.49510 ASL 2.0 and LGPLv2+ and SISSL

perftest 4.5-0.1.g23b8f9c.49510 BSD 3-Clause, GPL v2 or later

perftest 4.5.0.mlnxlibs-0.3.g1121951.49417.495 BSD 3-Clause, GPL v2 or later


10

pka-mlxbf 1.0-0.g963f663 GPLv2

qperf 0.4.11-1.49510 BSD 3-Clause, GPL v2

rdma-core 50mlnx1-1.49510 GPLv2 or BSD

rshim 1.18-0.gb99e894 GPLv2

sharp 2.1.2.MLNX20200428.ddda184-1.49510 Proprietary

sockperf 3.7-0.gita1e8e835a689.49510 BSD

srp 4.9-OFED.4.9.5.0.7.1 GPLv2

srptools 41mlnx1-5.49510 GPL/BSD

tmfifo 1.5-0.g31e8a6e GPLv2

ucx 1.8.0-1.49510 BSD

Release Notes contain the following sections:

9
• General Support in MLNX_EN
• Changes and New Features
• Known Issues
• Bug Fixes

General Support in MLNX_EN

Supported Operating Systems


Operating System Platform Default Kernel Version

ALIOS7.2 AArch64
4.19.48-006.ali4000.alios7.aarch64
BCLINUX7.3 x86_64
3.10.0-514.el7.x86_64
BCLINUX7.4 x86_64
3.10.0-693.el7.x86_64
BCLINUX7.5 x86_64
3.10.0-862.el7.x86_64
BCLINUX7.6 x86_64
3.10.0-957.el7.x86_64
BCLINUX7.7 x86_64
3.10.0-1062.el7.bclinux.x86_64
BCLINUX8.1 x86_64
4.19.0-193.1.3.el8.bclinux.x86_64
Debian10.0 x86_64
4.19.0-5-arm64
AArch64
4.19.0-5-amd64
Debian8.11 x86_64
3.16.0-6-amd64
Debian8.9 x86_64
3.16.0-4-amd64
Debian9.11 x86_64
4.9.0-11-amd64
Debian9.6 x86_64
4.9.0-8-amd64
Debian9.9 x86_64
4.9.0-9-amd64
EulerOS2.0sp9 AArch64
4.19.90-vhulk2006.2.0.h171.eulerosv2r9.aarch64
x86_64
4.18.0-147.5.1.0.h269.eulerosv2r9.x86_64
Fedora30 x86_64
5.0.9-301.fc30.x86_64

10
Operating System Platform Default Kernel Version

Oracle Linux 6.10 x86_64


4.1.12-124.16.4.el6uek.x86_64
Oracle Linux 7.4 x86_64
4.1.12-94.3.9.el7uek.x86_64
Oracle Linux 7.7 x86_64
4.14.35-1902.3.2.el7uek.x86_64
Oracle Linux 7.8 x86_64
4.14.35-1902.300.11.el7uek.x86_64
Oracle Linux 7.9 x86_64
5.4.17-2011.6.2.el7uek.x86_64
Oracle Linux 8.0 x86_64
4.18.0-80.7.2.el8_0.x86_64
Oracle Linux 8.1 x86_64
4.18.0-147.el8.x86_64
Oracle Linux 8.2 x86_64
5.4.17-2011.1.2.el8uek.x86_64
Oracle Linux 8.3 x86_64
5.4.17-2011.7.4.el8uek.x86_64
RHEL/CentOS6.10 x86_64
2.6.32-754.el6.x86_64
RHEL/CentOS6.3 x86_64
2.6.32-279.el6.x86_64
RHEL/CentOS7.2 ppc64
3.10.0-327.el7.ppc64
ppc64le
3.10.0-327.el7.ppc64le
x86_64
3.10.0-327.el7.x86_64
RHEL/CentOS7.3 x86_64
3.10.0-514.el7.x86_64
RHEL/CentOS7.4 ppc64
3.10.0-693.el7.ppc64
ppc64le
3.10.0-693.el7.ppc64le
x86_64
3.10.0-693.el7.x86_64
RHEL/ AArch64
CentOS7.4alternate 4.11.0-44.el7a.aarch64
RHEL/CentOS7.5 ppc64
3.10.0-862.el7.ppc64
ppc64le
3.10.0-862.el7.ppc64le
x86_64
3.10.0-862.el7.x86_64

11
Operating System Platform Default Kernel Version

RHEL/ AArch64
CentOS7.5alternate 4.14.0-49.el7a.aarch64
RHEL/CentOS7.6 x86_64
3.10.0-957.el7.ppc64
ppc64le
3.10.0-957.el7.ppc64le
ppc64
3.10.0-957.el7.x86_64
RHEL/ AArch64
CentOS7.6alternate 4.14.0-115.el7a.aarch64
ppc64le
4.14.0-115.el7a.ppc64le
RHEL/CentOS7.7 ppc64
3.10.0-1062.el7.ppc64
ppc64le
3.10.0-1062.el7.ppc64le
x86_64
3.10.0-1062.el7.x86_64
RHEL/CentOS7.8 ppc64
3.10.0-1127.el7.ppc64
ppc64le
3.10.0-1127.el7.ppc64le
x86_64
3.10.0-1127.el7.x86_64
RHEL/CentOS7.9 ppc64
3.10.0-1160.el7.ppc64
ppc64le
3.10.0-1160.el7.ppc64le
x86_64
3.10.0-1160.el7.x86_64
RHEL/CentOS8.0 AArch64
4.18.0-80.el8.aarch64
ppc64le
4.18.0-80.el8.ppc64le
x86_64
4.18.0-80.el8.x86_64
RHEL/CentOS8.1 AArch64
4.18.0-147.el8.aarch64
ppc64le
4.18.0-147.el8.ppc64le
x86_64
4.18.0-147.el8.x86_64
RHEL/CentOS8.2 AArch64
4.18.0-193.el8.aarch64

12
Operating System Platform Default Kernel Version

ppc64le
4.18.0-193.el8.ppc64le
x86_64
4.18.0-193.el8.x86_64
RHEL/CentOS8.3 AArch64
4.18.0-240.el8.aarch64
ppc64le
4.18.0-240.el8.ppc64le
x86_64
4.18.0-240.el8.x86_64
RHEL/CentOS8.4 AArch64
4.18.0-305.el8.aarch64
ppc64le
4.18.0-305.el8.ppc64le
x86_64
4.18.0-305.el8.x86_64
RHEL/CentOS8.5 AArch64
4.18.0-348.el8.aarch64
ppc64le
4.18.0-348.el8.ppc64le
x86_64
4.18.0-348.el8.x86_64
RHEL/CentOS8.6 AArch64
4.18.0-372.9.1.el8.aarch64
ppc64le
4.18.0-372.9.1.el8.ppc64le
x86_64
4.18.0-372.9.1.el8.x86_64
SLES11SP3 x86_64
3.0.76-0.11-default
SLES11SP4 ppc64
3.0.101-63-ppc64
x86_64
3.0.101-63-default
SLES12SP2 x86_64
4.4.21-69-default
SLES12SP3 x86_64
4.4.73-5-default
ppc64le
4.4.73-5-default
SLES12SP4 x86_64
4.12.14-94.41-default
ppc64le
4.12.14-94.41-default

13
Operating System Platform Default Kernel Version

AArch64
4.12.14-94.41-default
SLES12SP5 x86_64
4.12.14-120-default
ppc64le
4.12.14-120-default
AArch64
4.12.14-120-default
SLES15SP0 x86_64
4.12.14-23-default
SLES15SP1 x86_64
4.12.14-195-default
ppc64le
4.12.14-195-default
AArch64
4.12.14-195-default
SLES15SP2 x86_64
5.3.18-22-default
ppc64le
5.3.18-22-default
AArch64
5.3.18-22-default
SLES15SP3 x86_64
5.3.18-57-default
ppc64le
5.3.18-57-default
AArch64
5.3.18-57-default
Ubuntu14.04 x86_64
3.13.0-27-generic
Ubuntu16.04 ppc64le
4.4.0-21-generic
x86_64
4.4.0-21-generic
Ubuntu18.04 x86_64
4.15.0-20-generic
ppc64le
4.15.0-20-generic
AArch64
4.15.0-20-generic
Ubuntu19.04 x86_64
5.0.0-13-generic
Ubuntu19.10 x86_64
5.3.0-19-generic

14
Operating System Platform Default Kernel Version

Ubuntu20.04 x86_64
5.4.0-26-generic
ppc64le
5.4.0-26-generic
AArch64
5.4.0-26-generic

Kernel 5.5 x86_64 5.5


32 bit platforms are no longer supported in MLNX_EN.


All OSs listed above are fully supported in Paravirtualized and SR-IOV Environments with
Linux KVM Hypervisor.

Supported Non-Linux Virtual Machines


The following are the supported non-Linux Virtual Machines in this current MLNX_EN version:
HCA Windows Virtual Machine WinOF version Protocol
Type

ConnectX-3 Windows 2012 R2 DC MLNX_VPI 5.50 ETH

ConnectX-3 Pro Windows 2016 DC MLNX_VPI 5.50 ETH

ConnectX-4 Windows 2012 R2 DC MLNX_WinOF2 2.40 ETH

ConnectX-4 Lx Windows 2016 DC MLNX_WinOF2 2.40 ETH

Supported HCAs Firmware Versions


This MLNX_EN version provides long term support (LTS) for customers who wish to utilize
ConnectX-3, ConnectX-3 Pro and Connect-IB, as well as RDMA experimental verbs library
(mlnx_lib). Any MLNX_OFED version starting from v5.1 and above does not support any of
the adapter cards mentioned.

This MLNX_EN version supports the following Mellanox network adapter cards firmware versions:

15
NIC Recommended Firmware Rev. Additional Firmware Rev.
Supported

ConnectX®-3/ConnectX®-3 2.42.5000 2.40.7000


Pro

ConnectX®-4 12.28.2006 12.27.4000

ConnectX®-4 Lx 14.28.2006 14.27.1016

ConnectX®-5/ConnectX®-5 Ex 16.28.2006 16.27.2008


ConnectX®-6
20.28.2006 20.27.2008
ConnectX®-6 Dx N/A
22.28.2006

Innova IPsec EN 16.28.2006 16.27.2008

For the official firmware versions, please see:


http://www.mellanox.com/content/pages.php?pg=firmware_download

Changes and New Features

New Features
The following are the new features that have been added to this version of MLNX_EN.
Feature Description

Bug Fixes See Bug Fixes section.

For additional information on the new features, please refer to MLNX_EN User Manual.

Customer Affecting Changes


Change Description

Security
Hardening This release contains important reliability improvements and security hardening
enhancements. NVIDIA recommends upgrading your devices to this release to improve
security and reliability.2900508

Unsupported Functionalities/Features/HCAs
For additional information on the new features, please refer to MLNX_EN User Manual.

The following are the unsupported functionalities/features/HCAs in MLNX_EN:

16
• ConnectX®-2 Adapter Card
• Soft-RoCE

Known Issues
The following is a list of general limitations and known issues of the various components of this
Mellanox EN for Linux release.

For the list of old known issues, please refer to Mellanox EN Archived Known Issues file at: http://
www.mellanox.com/pdf/prod_software/MLNX_EN_Archived_Known_Issues.pdf
Internal Issue
Reference
Number

Description: Running 'ip link show' command over RHEL8.5 using ConnectX-3 with VFs will
2894838 print "Truncated VFs" to the screen.

Workaround: Use the following OFED IP link command: /opt/mellanox/iproute2/sbin/ip


link show

Keywords: IP Link, Virtual Functions,  ConnectX-3

Discovered in Release: 4.9-4.1.7.0


Description: On Sles15Sp3, MFT restart does not work.
2793596
Workaround: Install MFT manually from https://www.mellanox.com/products/adapter-
software/firmware-tools.
Keywords: MFT

Discovered in Release: 4.9-4.0.8.0


Description: When upgrading MLNX_OFED from 4.9-4 to 5.4-2 GA using Yum installation,
2794326 the installation fails due to ibutils.

Workaround: Before the upgrade, remove ibutils manually (and the metapackage with it)
using the following command: yum remove ibutils
Keywords: Installation, ibutils

Discovered in Release: 4.9-4.0.8.0


Description: On rare occasion, registering a device (ib_register_device()) and loading
2753944 modules in parallel in this case (ib_cm) , a racing condition may occur which would stop
ib_cm from loading properly. 

Workaround: Add modprobe.d rules to force the ib_cm driver to load before the mlx4_ib
and mlx5_ib drivers:
install mlx4_ib { /sbin/modprobe ib_cm; /sbin/modprobe -ignore-install mlx4_ib
$CMDLINE_OPTS; }
install mlx5_ib { /sbin/modprobe ib_cm; /sbin/modprobe —ignore-install mlx5_ib
$CMDLINE_OPTS; }

17
Internal Issue
Reference
Number

Keywords: ib_core, Racing Condition

Discovered in Release: 4.9-4.0.8.0


2175959 Description: In rare cases where driver unload fails, the following message may appear:
"rmmod: ERROR: Module mlx_compat is in use by: mdev vfio_mdev ib_core ib_uverbs".

Workaround: Remove “mdev vfio_mdev ib_core ib_uverbs” before restart driver.


Keywords: Installation

Discovered in Release: 4.9-0.1.7.0


2636998 Description: When using Debian or Ubuntu operating systems, installing MLNX_OFED with
mlnxofedinstall and then proceeding to upgrade with a package manager (apt), the mlnx-
rdma-core-dkms package remains installed and fails to rebuild. 

Workaround: Before upgrade, remove mlnx-rdma-rxe-dkms: dpkg --purge mlnx-rdma-rxe-


dkms
Keywords: Upgrade, Debian, Ubuntu, mlnx-rdma-core-dkms

Discovered in Release: 4.9-3.1.5.0


2635628 Description: openibd should be started manually after reboot on EulerOS2.0sp9 OS.

Workaround: Start openibd using "/etc/init.d/openibd start"


Keywords: openibd, EulerOS2.0sp9

Discovered in Release: 4.9-3.1.5.0


2338121 Description: UCX will not work while running with upstream-libs if librdmacm is not
installed.

Workaround: Install rdmacm or disable VMC (-x HCOLL_MCAST=^vmc).


Keywords: RDMA

Discovered in Release: 4.9-2.2.4.0


2440042 Description: Using ODP in this version may cause failures.

Workaround: To use ODP with ConnectX-4 and above, it is recommended to use


MLNX_OFED version 5.2 and above.
Keywords: ar_mgr; dump_pr; upgrade; installation

Discovered in Release: 4.9-2.2.4.0

18
Internal Issue
Reference
Number

1550266
Description: XDP is not supported over ConnectX-3 and ConnectX-3 Pro adapter cards.
Workaround: N/A
Keywords: XDP, ConnectX-3

Discovered in Release: 4.9-0.1.7.0


2117822 Description: On ConnectX-3 and ConnectX-3 Pro adapter cards, no traffic runs between
VLANs of any type over VLAN of type ctag (protocol 802.1Q).
Workaround: N/A
Keywords: ConnectX-3, VLAN

Discovered in Release: 4.9-0.1.7.0


2142218
Description: On ConnectX-3 and ConnectX-3 Pro adapter cards, driver might hang when
found under the following conditions, collectively:
• OS kernel is older than 4.10
• Interface is down
• CONFIG_NET_RX_BUSY_POLL parameter is set
• netdev_ops.ndo_busy_poll is defined

Workaround: N/A
Keywords: ConnectX-3

Discovered in Release: 4.9-0.1.7.0


2156645
Description: MLNX_LIBS provider packages, such as libmlx5, cannot be installed
simultaneously with ibverbs-providers distribution package when working with Ubuntu and
Debian OSs.
Workaround: Before installing MLNX_OFED of type MLNX_LIBS, make sure that ibverbs-
providers package is not installed.
Keywords: MLNX_LIBS, libmlx5, ibverbs-providers, Debian, Ubuntu

Discovered in Release: 4.9-0.1.7.0


2162639 Description: MLNX_OFED includes several python tools, such as mlnx_qos, which rely on
python modules included in the same package. On Ubuntu 20.04 OS, those are installed
into a directory that is not in python modules search path.

Workaround: Link the python module to a directory in the python modules search path by
running:

ln -s `dpkg -L mlnx-ofed-kernel-utils | grep 'site-.*py$'` /usr/lib/python3/dist-packages/

Keywords: mlnx_qos, Ubuntu

Discovered in Release: 4.9-0.1.7.0

19
Internal Issue
Reference
Number

2105447 Description: hns_roce warning messages will appear in the dmesg after reboot on Euler2
SP3 OSs.
Workaround: N/A
Keywords: hns_roce, dmesg, Euler

Discovered in Release: 4.9-0.1.7.0


2112251
Description: On kernels 4.10-4.14, when Geneve tunnel's remote endpoint is defined using
IPv6, packets larger than MTU are not fragmented, resulting in no traffic sent.
Workaround: Define geneve tunnel's remote endpoint using IPv4. 
Keywords: Kernel, Geneve, IPv4, IPv6, MTU, fragmentation

Discovered in Release: 4.9-0.1.7.0


2119210 Description: Multiple driver restarts may cause a stress and result in mlx5 commands check
error message in the log.
Workaround: N/A
Keywords: Driver restart, syndrome, error message

Discovered in Release: 4.9-0.1.7.0


2111349 Description: Ethtool --show-fec/--get-fec are not supported over ConnectX-6 and
ConnectX-6 Dx adapter cards.
Workaround: N/A
Keywords: Ethtool, ConnectX-6 Dx

Discovered in Release: 4.9-0.1.7.0


2119984 Description: IPsec crypto offloads does not work when ESN is enabled.
Workaround: N/A
Keywords: IPsec, ESN

Discovered in Release: 4.9-0.1.7.0


2102902 Description: A kernel panic may occur over RH8.0-4.18.0-80.el8.x86_64 OS when opening
kTLS offload connection due to a bug in kernel TLS stack.
Workaround: N/A
Keywords: TLS offload, mlx5e

Discovered in Release: 4.9-0.1.7.0


2111534 Description: A Kernel panic may occur over Ubuntu19.04-5.0.0-38-generic OS when opening
kTLS offload connection due to a bug in the Kernel TLS stack.
Workaround: N/A
Keywords: TLS offload, mlx5e

20
Internal Issue
Reference
Number

Discovered in Release: 4.9-0.1.7.0


2117845 Description: Relaxed ordering memory regions are not supported when working with CAPI.
Registering memory region with relaxed ordering while CAPI enabled will result in a
registration failure.
Workaround: N/A
Keywords: Relaxed ordering, memory region, MR, CAPI

Discovered in Release: 4.9-0.1.7.0


2083942 Description: The content of file /sys/class/net/<NETIF>/statistics/multicast may be out of
date and may display values lower than the real values.
Workaround: Run ethtool -S <NETIF> to show the actual multicast counters and to
update the content of file /sys/class/net/<NETIF>/statistics/multicast.
Keywords: Multicast counters

Discovered in Release: 4.9-0.1.7.0


2035950
Description: An internal error might take place in the firmware when performing any of
the following in VF LAG mode, when at least one VF of either PF is still bound/attached to
a VM.
1. Removing PF from the bond (using ifdown, ip link or any other function)
2. Attempting to disable SR-IOV

Workaround: N/A
Keywords: VF LAG, binding, firmware, FW, PF, SR-IOV

Discovered in Release: 4.9-0.1.7.0


2094176 Description: When running in a large scale in VF-LAG mode, bandwidth may be unstable.
Workaround: N/A
Keywords: VF LAG

Discovered in Release: 4.9-0.1.7.0


2044544 Description: When working with OSs with Kernel v4.10, bonding module does not allow
setting MTUs larger than 1500 on a bonding interface.
Workaround: Upgrade your Kernel version to v4.11 or above.
Keywords: Bonding, MTU, Kernel

Discovered in Release: 4.9-0.1.7.0


1882932 Description: Libibverbs dependencies are removed during OFED installation, requiring
manual installation of libraries that OFED does not reinstall.
Workaround: Manually install missing packages.
Keywords: libibverbs, installation

21
Internal Issue
Reference
Number

Discovered in Release: 4.9-0.1.7.0


2093746 Description: Devlink health dumps are not supported on kernels lower than v5.3.
Workaround: N/A
Keywords: Devlink, health report, dump

Discovered in Release: 4.9-0.1.7.0


2020260 Description: When changing the Trust mode to DSCP, there is an interval between the
change taking effect in the hardware and updating the inline mode of the SQ in the driver.
If any traffic is transmitted during this interval, the driver will not inline enough headers,
resulting in a CQE error in the NIC. 

Workaround: Set the interface down, change the trust mode, then bring the interface back
up.

ip link set eth0 down


mlnx_qos -i eth0 --trust dscp
ip link set eth0 up

Keywords: DSCP, inline, SQ, CQE

Discovered in Release: 4.9-0.1.7.0


2083427 Description: For kernels with connection tracking support, neigh update events are not
supported, requiring users to have static ARPs to work with OVS and VxLAN.
Workaround: N/A
Keywords: VxLAN, VF LAG, neigh, ARP

Discovered in Release: 4.9-0.1.7.0


2043739
Description: Userspace RoCE UD QPs are not supported over distributions such as SLES11
SP4 and RedHat 6.10 for which the netlink 3 libraries (libnl-3 and libnl-route3) are not
available.
Workaround: N/A
Keywords: RoCE UD, QP, SLES, RedHat, RHEL, netlink

Discovered in Release: 4.9-0.1.7.0


- Description: The argparse module is installed by default in Python versions =>2.7 and
>=3.2. In case an older Python version is used, the argparse module is not installed by
default.
Workaround: Install the argparse module manually.
Keywords: Python, MFT, argparse, installation
Discovered in Release: 4.7-3.2.9.0

22
Internal Issue
Reference
Number

1975293 Description: Installing MLNX_EN with --with-openvswitch flag requires manual removal
of the existing Open vSwitch.
Workaround: N/A
Keywords: OVS, Open vSwitch, openvswitch
Discovered in Release: 4.7-3.2.9.0
2001966
Description: When bond is created over VF netdevices in SwitchDev mode, the VF
netdevice will be treated as representor netdevice. This will cause the mlx5_core driver to
crash if it receives netdevice events related to bond device.
Workaround: Do not create bond over VF netdevices in SwitchDev mode.
Keywords: PF, VF, SwitchDev, netdevice, bonding
Discovered in Release: 4.7-3.2.9.0
1979834 Description: When running MLNX_EN on Kernel 4.10 with ConnectX-3/ConnectX-3 Pro NICs,
deleting VxLAN may result in a crash.
Workaround: Upgrade the Kernel version to v4.14 to avoid the crash.
Keywords: Kernel, OS, ConnectX-3, VxLAN
Discovered in Release: 4.7-3.2.9.0
1997230 Description: Running mlxfwreset or unloading mlx5_core module while contrak flows are
offloaded may cause a call trace in the kernel.
Workaround: Stop OVS service before calling mlxfwreset or unloading mlx5_core module.
Keywords: Contrak, ASAP, OVS, mlxfwrest, unload
Discovered in Release: 4.7-3.2.9.0
1955352 Description: Moving 2 ports to SwitchDev mode in parallel is not supported.
Workaround: N/A
Keywords: ASAP, SwitchDev
Discovered in Release: 4.7-3.2.9.0
1973238
Description: ib_core unload may fail on Ubuntu 18.04.2 OS with the following error
message:

"Module ib_core is in use"

Workaround: Stop ibacm.socket using the following commands:

systemctl stop ibacm.socket

systemctl disable ibacm.socket

Keywords: ib_core, Ubuntu, ibacm


Discovered in Release: 4.7-3.2.9.0
1979958 Description: VxLAN IPv6 offload is not supported over CentOS/RHEL v7.2 OSs.

23
Internal Issue
Reference
Number

Workaround: N/A
Keywords: Tunnel, VXLAN, ASAP, IPv6
Discovered in Release: 4.7-3.2.9.0
1980884 Description: Setting VF VLAN, state and spoofchk using ip link tool is not supported in
SwitchDev mode.
Workaround: N/A
Keywords: ASAP, ip tool, VF, SwitchDev
Discovered in Release: 4.7-3.2.9.0
1991710 Description: PRIO_TAG_REQUIRED_EN configuration is not supported and may cause call
trace.
Workaround: N/A
Keywords: ASAP, PRIO_TAG, mstconfig
Discovered in Release: 4.7-3.2.9.0
1970429
Description: With HW offloading in SR-IOV SwitchDev mode, the fragmented ICMP echo
request/reply packets (with length larger than MTU) do not function properly. The correct
behavior is for the fragments to miss the offloading flow and go to the slow path.
However, the current behavior is as follows.
• Ingress (to the VM): All echo request fragments miss the corresponding offloading
flow, but all echo reply fragments hit the corresponding offloading flow
• Egress (from the VM): The first fragment still hits the corresponding offloading
flow, and the subsequent fragments miss the corresponding offloading flow

Workaround: N/A
Keywords: HW offloading, SR-IOV, SwitchDev, ICMP, VM, virtualization
Discovered in Release: 4.7-3.2.9.0
1939719 Description: Running openibd restart after the installation of MLNX_EN on SLES12 SP5
and SLES15 SP1 OSs with the latest Kernel (v4.12.14) will result in an error that the
modules do not belong to that Kernel. This is due to the fact that the module installed by
MLNX_EN is incompatible with new Kernel's module.

Workaround: Rebuild the Kernel modules by rerunning the ./install installation


command using the following additional options:

--add-kernel-support --skip-repo

Keywords: SLES, operating system, OS, installation, Kernel, module


Discovered in Release: 4.7-3.2.9.0
1969580 Description: RHEL 6.10 OS is not supported in SR-IOV mode.
Workaround: N/A
Keywords: RHEL, RedHat, OS, operating system, SR-IOV, virtualization
Discovered in Release: 4.7-3.2.9.0

24
Internal Issue
Reference
Number

1919335 Description: On SLES 11 SP4, RedHat 6.9 and 6.10 OSs, on hosts where OpenSM is running,
the low-level driver's internal error reset flow will cause a kernel crash when OpenSM is
killed (after the reset occurs). This is due to a bug in these kernels where opening the
umad device (by OpenSM) does not take a reference count on the underlying device.
Workaround: Run OpenSM on a host with a more recent Kernel.
Keywords: SLES, RedHat, CR-Dump, OpenSM
Discovered in Release: 4.7-3.2.9.0
1916029 Description: When the firmware response time to commands becomes very long, some
commands fail upon timeout. The driver may then trigger a timeout completion on the
wrong entry, leading to a NULL pointer call trace.
Workaround: N/A
Keywords: Firmware, timeout, NULL
Discovered in Release: 4.7-3.2.9.0
1967866 Description: Enabling ECMP offload requires the VFs to be unbound and VMs to be shut
down.
Workaround: N/A

Keywords: ECMP, Multipath, ASAP2


Discovered in Release: 4.7-3.2.9.0
1821235 Description: When using mlx5dv_dr API for flow creation, for flows which execute the
"encapsulation" action or "push vlan" action, metadata C registers will be reset to zero.
Workaround: Use the both actions at the end of the flow process.
Keywords: Flow steering
Discovered in Release: 4.7-1.0.0.1
1916029 Description: When the firmware response time to commands becomes very long, some
commands fail upon timeout. The driver may then trigger a timeout completion on the
wrong entry, leading to a NULL pointer call trace.
Workaround: N/A
Keywords: Firmware, timeout, NULL
Discovered in Release: 4.7-1.0.0.1
1921981 Description: On Ubuntu, Debian and RedHat 8 and above OSS, parsing the mfa2 file using
the mstarchive might result in a segmentation fault. 
Workaround: Use mlxarchive to parse the mfa2 file instead. 
Keywords: MFT, mfa2, mstarchive, mlxarchive, Ubuntu, Debian, RedHat, operating system
Discovered in Release: 4.7-1.0.0.1
1921799
Description: MLNX_EN installation over SLES15 SP1 ARM OSs fails unless --add-kernel-
support flag is added to the installation command.

Workaround: N/A
Keywords: SLES, installation

25
Internal Issue
Reference
Number

Discovered in Release: 4.7-1.0.0.1


1840288
Description: MLNX_EN does not support XDP features on RedHat 7 OS, despite the declared
support by RedHat.
Workaround: N/A
Keywords: XDP, RedHat
Discovered in Release: 4.7-1.0.0.1
1888574
Description: Kernel support limitations in the current MLNX_EN version:
• SwitchDev is only supported on Kernel 4.14 and above, and on RedHat/CentOS 7.4,
7.5 and 7.6.
• SR-IOV is only supported on Kernel 4.3 and above, and on RedHat/CentOS 7.4, 7.5,
7.6 and 7.7.

Workaround: N/A
Keywords: SwitchDev, ASAP, Kernel , SR-IOV, RedHat
Discovered in Release: 4.7-1.0.0.1
1892663 Description: mlnx_tune script does not support python3 interpreter.
Workaround: Run mlnx_tune with python2 interpreter only.
Keywords: mlnx_tune, python3, python2
Discovered in Release: 4.7-1.0.0.1
1759593 Description: MLNX_EN installation on XenServer OSs requires using the -u flag.
Workaround: N/A
Keywords: Installation, XenServer, OS, operating system
Discovered in Release: 4.6-1.0.1.1
1753629
Description: A bonding bug found in Kernels 4.12 and 4.13 may cause a slave to become
permanently stuck in BOND_LINK_FAIL state. As a result, the following message may
appear in dmesg:

bond: link status down for interface eth1, disabling it in 100 ms

Workaround: N/A
Keywords: Bonding, slave
Discovered in Release: 4.6-1.0.1.1
1758983
Description: Installing RHEL 7.6 OSs platform x86_64 and RHEL 7.6 ALT OSs platform PPCLE
using YUM is not supported.
Workaround: Install these OSs using the install script.
Keywords: RHEL, RedHat, YUM, OS, operating system
Discovered in Release: 4.6-1.0.1.1

26
Internal Issue
Reference
Number

1734102 Description: Ubuntu v16.04.05 and v16.04.05 OSs can only be used with Kernels of version
4.4.0-143 or below.
Workaround: N/A
Keywords: Ubuntu, Kernel, OS
Discovered in Release: 4.6-1.0.1.1
1712068 Description: Uninstalling MLNX_EN automatically results in the uninstallation of several
libraries that are included in the MLNX_EN package, such as InfiniBand-related libraries.
Workaround: If these libraries are required, reinstall them using the local package
manager (yum/dnf).
Keywords: MLNX_EN libraries
Discovered in Release: 4.6-1.0.1.1
- Description: Due to changes in libraries, MFT v4.11.0 and below are not forward
compatible with MLNX_EN v4.6-1.0.0.0 and above.

Therefore, with MLNX_EN v4.6-1.0.0.0 and above, it is recommended to use MFT v4.12.0
and above.
Workaround: N/A
Keywords: MFT compatible
Discovered in Release: 4.6-1.0.1.1
1730840 Description: On ConnectX-4 HCAs, GID index for RoCE v2 is inconsistent when toggling
between enabled and disabled interface modes.
Workaround: N/A
Keywords: RoCE v2, GID
Discovered in Release: 4.6-1.0.1.1
1731005 Description: MLNX_EN v4.6 YUM and Zypper installations fail on RHEL8.0, SLES15.0 and
PPCLE OSs.
Workaround: N/A
Keywords: YUM, Zypper, installation, RHEL, RedHat, SLES, PPCLE
Discovered in Release: 4.6-1.0.1.1
1717428 Description: On kernels 4.10-4.14, MTUs larger than 1500 cannot be set for a GRE interface
with any driver (IPv4 or IPv6).
Workaround: Upgrade your kernel to any version higher than v4.14.
Keywords: Fedora 27, gretap, ip_gre, ip_tunnel, ip6_gre, ip6_tunnel
Discovered in Release: 4.6-1.0.1.1
1748343 Description: Driver reload takes several minutes when a large number of VFs exists.
Workaround: N/A
Keywords: VF, SR-IOV
Discovered in Release: 4.6-1.0.1.1

27
Internal Issue
Reference
Number

1748537 Description: Cannot set max Tx rate for VFs from the ARM.
Workaround: N/A
Keywords: Host control, max Tx rate
Discovered in Release: 4.6-1.0.1.1
1732940 Description: Software counters not working for representor net devices.
Workaround: N/A
Keywords: mlx5, counters, representors
Discovered in Release: 4.6-1.0.1.1
1733974 Description: Running heavy traffic (such as 'ping flood') while bringing up and down other
mlx5 interfaces may result in “INFO: rcu_preempt dectected stalls on CPUS/tasks:”
call traces.
Workaround: N/A
Keywords: mlx5
Discovered in Release: 4.6-1.0.1.1
1731939 Description: Get/Set Forward Error Correction FEC configuration is not supported on
ConnectX-6 HCAs with 200Gbps speed rate.
Workaround: N/A
Keywords: Forward Error Correction, FEC, 200Gbps
Discovered in Release: 4.6-1.0.1.1
1715789 Description: Mellanox Firmware Tools (MFT) package is missing from Ubuntu v18.04.2 OS.
Workaround: Manually install MFT.
Keywords: MFT, Ubuntu, operating system
Discovered in Release: 4.6-1.0.1.1
1652864 Description: On ConnectX-3 and ConnectX-3 Pro HCAs, CR-Dump poll is not supported using
sysfs commands.
Workaround: If supported in your Kernel, use the devlink tool as an alternative to sysfs to
achieve CR-Dump support.
Keywords: mlx4, devlink, CR-Dump
Discovered in Release: 4.6-1.0.1.1
- Description: On ConnectX-6 HCAs and above, an attempt to configure advertisement (any
bitmap) will result in advertising the whole capabilities.
Workaround: N/A
Keywords: 200Gmbps, advertisement, Ethtool
Discovered in Release: 4.6-1.0.1.1

1581631 Description: GID entries referenced to by a certain user application cannot be deleted
while that user application is running.

28
Internal Issue
Reference
Number

Workaround: N/A

Keywords: RoCE, GID

Discovered in Release: 4.5-1.0.1.0

1403313 Description: Attempting to allocate an excessive number of VFs per PF in operating


systems with kernel versions below v4.15 might fail due to a known issue in the Kernel.

Workaround: Make sure to update the Kernel version to v4.15 or above.

Keywords: VF, PF, IOMMU, Kernel, OS

Discovered in Release: 4.5-1.0.1.0

1521877 Description: On SLES 12 SP1 OSs, a kernel tracepoint issue may cause undefined behavior
when inserting a kernel module with a wrong parameter.

Workaround: N/A

Keywords: mlx5 driver, SLES 12 SP1

Discovered in Release: 4.5-1.0.1.0

1504073 Description: When using ConnectX-5 with LRO over PPC systems, the HCA might experience
back pressure due to delayed PCI Write operations. In this case, bandwidth might drop
from line-rate to ~35Gb/s. Packet loss or pause frames might also be observed.

Workaround: Look for an indication of PCI back pressure (“outbound_pci_stalled_wr”


counter in ethtools advancing). Disabling LRO helps reduce the back pressure and its
effects.

Keywords: Flow Control, LRO

Discovered in Release: 4.4-1.0.0.0

1424233 Description: On RHEL v7.3, 7.4 and 7.5 OSs, setting IPv4-IP-forwarding will turn off LRO on
existing interfaces. Turning LRO back on manually using ethtool and adding a VLAN
interface may cause a warning call trace.

Workaround: Make sure IPv4-IP-forwarding and LRO are not turned on at the same time.

Keywords: IPv4 forwarding, LRO

29
Internal Issue
Reference
Number

Discovered in Release: 4.4-1.0.1.0

1431282 Description: Software reset may result in an order inversion of interface names.

Workaround: Restart the driver to re-set the order.

Keywords: Software reset

Discovered in Release: 4.4-1.0.1.0

1442507 Description: Retpoline support in GCC causes an increase in CPU utilization, which results
in IP forwarding’s 15% performance drop.

Workaround: N/A

Keywords: Retpoline, GCC, CPU, IP forwarding, Spectre attack

Discovered in Release: 4.4-1.0.1.0

1425129 Description: MLNX_EN cannot be installed on SLES 15 OSs using Zypper repository.

Workaround: Install MLNX_EN using the standard installation script instead of Zypper
repository.

Keywords: Installation, SLES, Zypper

Discovered in Release: 4.4-1.0.1.0

1241056 Description: When working with ConnectX-4/ConnectX-5 HCAs on PPC systems with
Hardware LRO and Adaptive Rx support, bandwidth drops from full wire speed (FWS) to
~60Gb/s.

Workaround: Make sure to disable Adaptive Rx when enabling Hardware LRO: ethtool -C


<interface> adaptive-rx off

ethtool -C <interface> rx-usecs 8 rx-frames 128

Keywords: Hardware LRO, Adaptive Rx, PPC

Discovered in Release: 4.3-1.0.1.0

1090612 Description: NVMEoF protocol does not support LBA format with non-zero metadata size.
Therefore, NVMe namespace configured to LBA format with metadata size bigger than 0
will cause Enhanced Error Handling (EEH) in PowerPC systems.

30
Internal Issue
Reference
Number

Workaround: Configure the NVMe namespace to use LBA format with zero sized metadata.

Keywords: NVMEoF, PowerPC, EEH

Discovered in Release: 4.3-1.0.1.0

1309621 Description: In switchdev mode default configuration, stateless offloads/steering based on


inner headers is not supported.

Workaround: To enable stateless offloads/steering based on inner headers, disable encap


by running:

devlink dev eswitch show pci/0000:83:00.1 encap disable

Or, in case devlink is not supported by the kernel, run:

echo none > /sys/kernel/debug/mlx5/<BDF>/compat/encap

Note: This is a hardware-related limitation.

Keywords: switchdev, stateless offload, steering

Discovered in Release: 4.3-1.0.1.0

1275082 Description: When setting a non-default IPv6 link local address or an address that is not
based on the device MAC, connection establishments over RoCEv2 might fail.

Workaround: N/A

Keywords: IPV6, RoCE, link local address

Discovered in Release: 4.3-1.0.1.0

1307336 Description: In RoCE LAG mode, when running ibdev2netdev -v , the port state of the
second port of the mlx4_0 IB device will read “NA” since this IB device does not have a
second port.

Workaround: N/A

Keywords: mlx4, RoCE LAG, ibdev2netdev, bonding

Discovered in Release: 4.3-1.0.1.0

1296355 Description: Number of MSI-X that can be allocated for VFs and PFs in total is limited to
2300 on Power9 platforms.

31
Internal Issue
Reference
Number

Workaround: N/A

Keywords: MSI-X, VF, PF, PPC, SR-IOV

Discovered in Release: 4.3-1.0.1.0

1294934 Description: Firmware reset might cause Enhanced Error Handling (EEH) on Power7
platforms.

Workaround: N/A

Keywords: EEH, PPC

Discovered in Release: 4.3-1.0.1.0

1259293 Description: On Fedora 20 operating systems, driver load fails with an error message such
as: “ [185.262460] kmem_cache_sanity_check (fs_ftes_0000:00:06.0): Cache name already
exists. ”

This is caused by SLUB allocators grouping multiple slab kmem_cache_create into one slab
cache alias to save memory and increase cache hotness. This results in the slab name to
be considered stale.

Workaround: Upgrade the kernel version to kernel-3.19.8-100.fc20.x86_64.

Note that after rebooting to the new kernel, you will need to rebuild 
MLNX_EN against the new kernel version.

Keywords: Fedora, driver load

Discovered in Release: 4.3-1.0.1.0

1264359 Description: When running perftest (ib_send_bw, ib_write_bw, etc.) in rdma-cm mode, the
resp_cqe_error counter under /sys/class/infiniband/mlx5_0/ports/1/hw_counters/
resp_cqe_error might increase. This behavior is expected and it is a result of receive WQEs
that were not consumed.

Workaround: N/A

Keywords: perftest, RDMA CM, mlx5

Discovered in Release: 4.3-1.0.1.0

1264956 Description: Configuring SR-IOV after disabling RoCE LAG using sysfs (/sys/bus/pci/drivers/
mlx5_core/<bdf>/roce_lag_enable) might result in RoCE LAG being enabled again in case
SR-IOV configuration fails.

32
Internal Issue
Reference
Number

Workaround: Make sure to disable RoCE LAG once again.

Keywords: RoCE LAG, SR-IOV

Discovered in Release: 4.3-1.0.1.0

1263043 Description: On RHEL7.4, due to an OS issue introduced in kmod package version


20-15.el7_4.6, parsing the depmod configuration files will fail, resulting in either of the
following issues:
• Driver restart failure prompting an error message, such as: “ ERROR: Module
mlx5_core belong to kernel which is not a part of MLNX_EN, skipping... ”
• nvmet_rdma kernel module dysfunction, despite installing MLNX_EN using the "--
with-nvmf " option. An error message, such as: “ nvmet_rdma: unknown parameter
'offload_mem_start' ignored ” will be seen in dmesg output

Workaround: Go to RedHat webpage to upgrade the kmod package version.

Keywords: driver restart, kmod, kmp, nvmf, nvmet_rdma

Discovered in Release: 4.2-1.2.0.0

- Description: Packet Size (Actual Packet MTU) limitation for IPsec offload on Innova IPsec
adapter cards: The current offload implementation does not support IP fragmentation. The
original packet size should be such that it does not exceed the interface's MTU size after
the ESP transformation (encryption of the original IP packet which increases its length)
and the headers (outer IP header) are added:
• Inner IP packet size <= I/F MTU - ESP additions (20) - outer_IP (20) - fragmentation
issue reserved length (56)
• Inner IP packet size <= I/F MTU - 96

This mostly affects forwarded traffic into smaller MTU, as well as UDP traffic. TCP does
PMTU discovery by default and clamps the MSS accordingly.

Workaround: N/A

Keywords: Innova IPsec, MTU

Discovered in Release: 4.2-1.0.1.0

- Description: No LLC/SNAP support on Innova IPsec adapter cards.

Workaround: N/A

Keywords: Innova IPsec, LLC/SNAP

33
Internal Issue
Reference
Number

Discovered in Release: 4.2-1.0.1.0

- Description: No support for FEC on Innova IPsec adapter cards. When using switches, there
may be a need to change its configuration.

Workaround: N/A

Keywords: Innova IPsec, FEC

Discovered in Release: 4.2-1.0.1.0

955929 Description: Heavy traffic may cause SYN flooding when using Innova IPsec adapter cards.

Workaround: N/A

Keywords: Innova IPsec, SYN flooding

Discovered in Release: 4.2-1.0.1.0

- Description: Priority Based Flow Control is not supported on Innova IPsec adapter cards.

Workaround: N/A

Keywords: Innova IPsec, Priority Based Flow Control

Discovered in Release: 4.2-1.0.1.0

- Description: Pause configuration is not supported when using Innova IPsec adapter cards.
Default pause is global pause (enabled).

Workaround: N/A

Keywords: Innova IPsec, Global pause

Discovered in Release: 4.2-1.0.1.0

1045097 Description: Connecting and disconnecting a cable several times may cause a link up
failure when using Innova IPsec adapter cards.

Workaround: N/A

Keywords: Innova IPsec, Cable, link up

34
Internal Issue
Reference
Number

Discovered in Release: 4.2-1.0.1.0

- Description: On Innova IPsec adapter cards, supported MTU is between 512 and 2012
bytes. Setting MTU values outside this range might fail or might cause traffic loss.

Workaround: Set MTU between 512 and 2012 bytes.

Keywords: Innova IPsec, MTU

Discovered in Release: 4.2-1.0.1.0

1118530 Description: On kernel versions 4.10-4.13, when resetting sriov_numvfs to 0 on PowerPC


systems, the following dmesg warning will appear:

mlx5_core <BDF>: can't update enabled VF BAR0

Workaround: Reboot the system to reset sriov_numvfs value.

Keywords: SR-IOV, numvfs

Discovered in Release: 4.2-1.0.1.0

1125184 Description: In old kernel versions, such as Ubuntu 14.04 and RedHat 7.1, VXLAN interface
does not reply to ARP requests for a MAC address that exists in its own ARP table. This
issue was fixed in the following newer kernel versions: Ubuntu 16.04 and RedHat 7.3.

Workaround: N/A

Keywords: ARP, VXLAN

Discovered in Release: 4.2-1.0.1.0

1134323 Description: When using kernel versions older than version 4.7 with IOMMU enabled,
performance degradations and logical issues (such as soft lockup) might occur upon high
load of traffic. This is caused due to the fact that IOMMU IOVA allocations are centralized,
requiring many synchronization operations and high locking overhead amongst CPUs.

Workaround: Use kernel v4.7 or above, or a backported kernel that includes the following
patches:
• 2aac630429d9 iommu/vt-d: change intel-iommu to use IOVA frame numbers
• 9257b4a206fc iommu/iova: introduce per-cpu caching to iova allocation
• 22e2f9fa63b0 iommu/vt-d: Use per-cpu IOVA caching

Keywords: IOMMU, soft lockup

35
Internal Issue
Reference
Number

Discovered in Release: 4.2-1.0.1.0

1135738 Description: On 64k page size setups, DMA memory might run out when trying to increase
the ring size/number of channels.

Workaround: Reduce the ring size/number of channels.

Keywords: DMA, 64K page

Discovered in Release: 4.2-1.0.1.0

1159650 Description: When configuring VF VST, VLAN-tagged outgoing packets will be dropped in
case of ConnectX-4 HCAs. In case of ConnectX-5 HCAs, VLAN-tagged outgoing packets will
have another VLAN tag inserted.

Workaround: N/A

Keywords: VST

Discovered in Release: 4.2-1.0.1.0

1157770 Description: On Passthrough/VM machines with relatively old QEMU and libvirtd,

CMD timeout might occur upon driver load.

After timeout, no other commands will be completed and all driver operations will be
stuck.

Workaround: Upgrade the QEMU and libvirtd on the KVM server.

Tested with (Ubuntu 16.10) are the following versions:


• libvirt 2.1.0
• QEMU 2.6.1

Keywords: QEMU

Discovered in Release: 4.2-1.0.1.0

1147703 Description: Using dm-multipath for High Availability on top of NVMEoF block devices must
be done with “directio” path checker.

Workaround: N/A

Keywords: NVMEoF

36
Internal Issue
Reference
Number

Discovered in Release: 4.2-1.0.1.0

1152408 Description: RedHat v7.3 PPCLE and v7.4 PPCLE operating systems do not support KVM
qemu out of the box. The following error message will appear when attempting to
run virt-install to create new VMs:

Cant find qemu-kvm packge to install

Workaround: Acquire the following rpms from the beta version of 7.4ALT to 7.3/7.4 PPCLE
(in the same order):
• qemu-img-.el7a.ppc64le.rpm
• qemu-kvm-common-.el7a.ppc64le.rpm
• qemu-kvm-.el7a.ppc64le.rpm

Keywords: Virtualization, PPC, Power8, KVM, RedHat, PPC64LE

Discovered in Release: 4.2-1.0.1.0

1012719 Description: A soft lockup in the CQ polling flow might occur when running very high stress
on the GSI QP (RDMA-CM applications). This is a transient situation from which the driver
will later recover.

Workaround: N/A

Keywords: RDMA-CM, GSI QP, CQ

Discovered in Release: 4.2-1.0.1.0

1078630 Description: When working in RoCE LAG over kernel v3.10, a kernel crash might occur
when unloading the driver as the Network Manager is running.

Workaround: Stop the Network Manager before unloading the driver and start it back once
the driver unload is complete.

Keywords: RoCE LAG, network manager

Discovered in Release: 4.2-1.0.1.0

1149557 Description: When setting VGT+, the maximal number of allowed VLAN IDs presented in
the sysfs is 813 (up to the first 813).

Workaround: N/A

Keywords: VGT+

37
Internal Issue
Reference
Number

Discovered in Release: 4.2-1.0.1.0

965591 Description: Lustre support is limited to versions 2.9 and 2.10.

Workaround: N/A

Keywords: Lustre

Discovered in Release: 4.1-1.0.2.0

995665/1165919 Description: In kernels below v4.13, connection between NVMEoF host and target cannot
be established in a hyper-threaded system with more than 1 socket.

Workaround: On the host side, connect to NVMEoF subsystem using --nr-io-queues


<num_queues> flag.

Note that num_queues must be lower or equal to num_sockets multiplied with


num_cores_per_socket.

Keywords: NVMEoF

1039346 Description: Enabling multiple namespaces per subsystem while using NVMEoF target
offload is not supported.

Workaround: To enable more than one namespace, create a subsystem for each one.

Keywords: NVMEoF Target Offload, namespace

1030301 Description: Creating virtual functions on a device that is in LAG mode will destroy the
LAG configuration. The boding device over the Ethernet NICs will continue to work as
expected.

Workaround: N/A

Keywords: LAG, SR-IOV

1047616 Description: When node GUID of a device is set to zero (0000:0000:0000:0000), RDMA_CM
user space application may crash.

Workaround: Set node GUID to a nonzero value.

Keywords: RDMA_CM

38
Internal Issue
Reference
Number

1051701 Description: New versions of iproute which support new kernel features may misbehave on
old kernels that do not support these new features.

Workaround: N/A

Keywords: iproute

1007830 Description: When working on Xenserver hypervisor with SR-IOV enabled on it, make sure
the following instructions are applied:
1. Right after enabling SR-IOV, unbind all driver instances of the virtual functions from
their PCI slots.
2. It is not allowed to unbind PF driver instance while having active VFs.

Workaround: N/A

Keywords: SR-IOV

1005786 Description: When using ConnectX-5 adapter cards, the following error might be printed to
dmesg, indicating temporary lack of DMA pages:

“mlx5_core ... give_pages:289:(pid x): Y pages alloc time exceeded the max permitted
duration

mlx5_core ... page_notify_fail:263:(pid x): Page allocation failure notification on


func_id(z) sent to fw

mlx5_core ... pages_work_handler:471:(pid x): give fail -12”

Example: This might happen when trying to open more than 64 VFs per port.

Workaround: N/A

Keywords: mlx5_core, DMA

1008066/100900 Description: Performing some operations on the user end during reboot might cause call
4 trace/panic, due to bugs found in the Linux kernel.

For example: Running get_vf_stats (via iptool) during reboot.

Workaround: N/A

Keywords: mlx5_core, reboot

39
Internal Issue
Reference
Number

1009488 Description: Mounting MLNX_EN to a path that contains special characters, such as
parenthesis or spaces is not supported. For example, when mounting MLNX_EN to “/
media/CDROM(vcd)/”, installation will fail and the following error message will be
displayed:

# cd /media/CDROM\(vcd\)/

# ./install

sh: 1: Syntax error: "(" unexpected

Workaround: N/A

Keywords: Installation

982144 Description: When offload traffic sniffer is on, the bandwidth could decrease up to 50%.

Workaround: N/A

Keywords: Offload Traffic Sniffer

982534 Description: In ConnectX-3, when using a server with page size of 64K, the UAR BAR will
become too small. This may cause one of the following issues:
1. mlx4_core driver does not load.
2. The mlx4_core driver does load, but calls to ibv_open_device may return ENOMEM
errors.

Workaround:
1. Add the following parameter in the firmware's ini file under [HCA] section:
log2_uar_bar_megabytes = 7
2. Re-burn the firmware with the new ini file.

Keywords: PPC

981362 Description: On several OSs, setting a number of TC is not supported via the tc tool.

Workaround: Set the number of TC via the /sys/class/net/<interface>/qos/tc_num sysfs


file.

Keywords: Ethernet, TC

979457 Description: When setting IOMMU=ON, a severe performance degradation may occur due to
a bug in IOMMU.

40
Internal Issue
Reference
Number

Workaround: Make sure the following patches are found in your kernel:
• iommu/vt-d: Fix PASID table allocation
• iommu/vt-d: Fix IOMMU lookup for SR-IOV Virtual Functions

Note: These patches are already available in Ubuntu 16.04.02 and 17.04 OSs.

Keywords: Performance, IOMMU

Workaround: Upgrade the Kernel version to v4.14 to avoid the crash.

Bug Fixes
This table lists the bugs fixed in this release.
For the list of old bug fixes, please refer to MLNX_EN Archived Bug Fixes file at: http://
www.mellanox.com/pdf/prod_software/MLNX_EN_Archived_Bug_Fixes.pdf
Interna Description
l
Refere
nce
Numbe
r

Description: An issue with the Udev script


2944030  caused non-NVIDIA devices to be renamed.
2944030
Keywords: ASAP2, Udev, Naming
Fixed in Release: 4.9-5.1.0.0
Description: RDMA traffic may fail due to
3037901  incorrect tracking of outstanding work
3037901 requests.
Keywords: RDMA

Discovered in Release: 4.9-0.1.7.0


Fixed in Release: 4.9-5.1.0.0
Description: On the passive side, when the
2976200  RDMACM disconnectReq event arrives, if the
2976200 current state is MRA_REP_RCVD, it needs to
cancel the MAD before entering the
DREQ_RCVD and TIMEWAIT state, otherwise
the destroy_id may block the request until
this MAD reaches timeout.

41
Interna Description
l
Refere
nce
Numbe
r

Keywords: RDMACM, MRA, destroy_id

Discovered in Release: 4.9-0.1.7.0


Fixed in Release: 4.9-5.1.0.0
Description: On rare occasion, registering a
2753944  device (ib_register_device()) and loading
2753944 modules in parallel (in this case (ib_cm)), ay
cause a race condition to occur which would
stop ib_cm from loading properly. 
Keywords: RDMA, ib_core, Racing Condition
Fixed in Release: 4.9-5.1.0.0
Description: When removing the nvmet port
2802401 from configfs caused a use-after-free
condition.
Keywords: nvmet-rdma Module

Discovered in Release: 4.9-0.1.7.0

Fixed in Release: 4.9-4.1.7.0


Description: MLNX_EN includes several python
2162639 tools, such as mlnx_qos, which rely on python
modules included in the same package. On
Ubuntu 20.04 OS, those are installed into a
directory that is not in python modules search
path.
Keywords: mlnx_qos, Ubuntu

Discovered in Release: 4.9-0.1.7.0

Fixed in Release: 4.9-4.0.8.0
2635628 Description: openibd does not load
automatically after reboot on Suler2sp9 OS.
Keywords: openibd, Suler2sp9

Discovered in Release: 4.9-3.1.5.0

Fixed in Release: 4.9-4.0.8.0
2748862 Description: add-kernel-support flag was not
supported on Oracle Linux 7.9 causing an
installation issue.
Keywords: openibd, Euleros2u0sp9

42
Interna Description
l
Refere
nce
Numbe
r

Discovered in Release: 4.9-0.1.7.0

Fixed in Release: 4.9-4.0.8.0
2748862 Description: add-kernel-support flag was not
supported on Oracle Linux 7.9 causing an
installation issue.
Keywords: Installation, Oracle Linux 7.9

Discovered in Release: 4.9-0.1.7.0

Fixed in Release: 4.9-4.0.8.0
2396956 Description: Fixed an issue were device under
massive load may hit iommu allocation
failures. For more information see "RX Page
Cache Size Limit" section in the user manual. 
Keywords: Legacy libibverbs

Discovered in Release: 4.9-2.2.4.0

Fixed in Release: 4.9-3.1.5.0
2434638 Description: Fixed an issue where
"ibv_devinfo -v" command did not print some
of the MEM_WINDOW capabilities, even
though they were supported.
Keywords: Legacy libibverbs

Discovered in Release: 4.9-2.2.4.0

Fixed in Release: 4.9-3.1.5.0
2192791 Description: Fixed the issue where packages
neohost-backend and neohost-sdk were not
properly removed by the uninstallation
procedure and may have required manual
removal before re-installing or upgrading the
MLNX_OFED driver.
Keywords: NEO-Host, SDK

Discovered in Release: 4.9-0.1.7.0

Fixed in Release: 4.9-2.2.4.0

43
Interna Description
l
Refere
nce
Numbe
r

2226715 Description: Fixed an issue where bringing up


PF interface failed when using SR-IOV and
configuring RoCE mode for v2 only.
Keywords: PF, SR-IOV, RoCE v2

Discovered in Release: 4.9-0.1.7.0

Fixed in Release: 4.9-2.2.4.0


2119017 Description: Fixed the issue where injecting
EEH may cause extra Kernel prints, such as:
“EEH: Might be infinite loop in mlx5_core
driver”.
Keywords: EEH, kernel
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0


2076546 Description: Fixed the issue where in RPM-
based OSs with non-default kernels, using
repositories after re-creating the installer
(using --add-kernel-support) would result in
improper installation of the drivers.
Keywords: Installation, OS
Discovered in Release: 4.7-1.0.0.1

Fixed in Release: 4.9-0.1.7.0


2114957 Description: Fixed the issue where
MLNX_OFED installation may have depended
on python2 package even when attempting to
install it on OSs whose default package is
python3.
Keywords: Installation, python
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0


2143258
Description: Fixed a typo in perftest package
where help messages wrongly displayed the
conversion result between Gb/s and MB/s
(20^2 instead of 2^20).
Keywords: perftest
Discovered in Release: 4.7-3.2.9.0

44
Interna Description
l
Refere
nce
Numbe
r

Fixed in Release: 4.9-0.1.7.0


2094216
Description: Fixed the issue of when one of
the LAG slaves went down, LAG deactivation
failed, ultimately causing bandwidth
degradation.
Keywords: RoCE LAG
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0


2133778 Description: The mlx5 driver maintains a
subdirectory for every open eth port in /sys/
kernel/debug/. For the default network
namespace, the sub-directory name is the
name of the interface, like "eth8". The new
convention for the network interfaces moved
to the non-default network namespaces is the
interfaces name followed by "@" and the
port's PCI ID. For example:
"eth8@0000:af:00.3".
Keywords: Namespace
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0


2122684 Description: Fixed the issue where OFED
uninstallation resulted in the removal of
dependency packages, such as qemu-system-*
(qemu-system-x86).
Keywords: Uninstallation, dependency, qemu-
system-x86
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0


2135476 Description: Added KMP ability to install
MLNX_OFED Kernel modules on SLES12 SP5
and SLES15 kernel maintenance updates.
Keywords: KMP, SLES, kernel
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0

45
Interna Description
l
Refere
nce
Numbe
r

2149577 Description: Fixed the issue where openibd


script load used to fail when esp6_offload
module did not load successfully.
Keywords: openibd, esp6_offload
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0


2163879 Description: Added dependency of package
mpi-selectors on perl-Getopt-Long system
package. On minimal installs of RPM-based
OSs, installing mpi-selectros will also install
the required system package perl-Getopt-
Long.
Keywords: Dependency, perl-Getopt-Long
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0


2107532 Description: Fixed the issue where in certain
rare scenarios, due to Rx page not being
replenished, the same page fragment
mistakenly became assigned to two different
Rx descriptors.
Keywords: Memory corruption, Rx page
recycle
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0


2116233 Description: Fixed an issue where ucx-kmem
was missing after OFED installation.
Keywords: ucx-kmem, installation
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0


2107776 Description: Fixed a driver load issue with
Errata-kernel on SLES15 SP1.
Keywords: Load, SLES, Errata
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0

46
Interna Description
l
Refere
nce
Numbe
r

2105536 Description: Fixed an issue in the Hairpin


feature which prevented adding hairpin flows
using TC tool.
Keywords: Hairpin, TC
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0


2090321 Description: Fixed the issue where WQ queue
flushing was not handled properly in the
event of EEH.
Keywords: WQ, EEH
Discovered in Release: 4.7-1.0.0.1

Fixed in Release: 4.9-0.1.7.0


2076311 Description: Fixed a rare kernel crash
scenario when exiting an application that uses
RMPP mads intensively.
Keywords: MAD RMPP
Discovered in Release: 4.0-1.0.1.0

Fixed in Release: 4.9-0.1.7.0


2096998 Description: Fixed the issue where NEO-Host
could not be installed from the MLNX_OFED
package when working on Ubuntu and Debian
OSs. 
Keywords: NEO-Host, Ubuntu, Debian
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0


2057076 Description: Fixed the issue where installing
MLNX_OFED using --add-kernel-support
option did not work over RHEL 8 OSs.
Keywords: --add-kernel-support, installation,
RHEL
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0

47
Interna Description
l
Refere
nce
Numbe
r

2090186 Description: Fixed a possible kernel crash


scenario when AER/slot reset in done in
parallel to user space commands execution.
Keywords: mlx5_core, AER, slot reset
Discovered in Release: 4.3-1.0.1.0

Fixed in Release: 4.9-0.1.7.0


2093410 Description: Added missing ECN configuration
under sysfs for PFs in SwitchDev mode.
Keywords: sysfs, ASAP, SwitchDev, ECN
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0


1916029 Description: Fixed the issue of when firmware
response time to commands became very
long, some commands failed upon timeout.
The driver may have then triggered a timeout
completion on the wrong entry, leading to a
NULL pointer call trace.
Keywords: Firmware, timeout, NULL
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0


2036394 Description: Added driver support for kernels
with the old XDP_REDIRECT infrastructure
that uses the following NetDev
operations: .ndo_xdp_flush
and .ndo_xdp_xmit.
Keywords: XDP_REDIRECT, Soft lockup
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0


2072871 Description: Fixed an issue where the usage
of --excludedocs Open MPI RPM option
resulted in the removal of non-documentation
related files. 
Keywords: --excludedocs, Open MPI, RPM
Discovered in Release: 4.5-1.0.1.0

Fixed in Release: 4.9-0.1.7.0

48
Interna Description
l
Refere
nce
Numbe
r

2072884 Description: Removed all cases of automated


loading of MLNX_OFED kernel modules outside
of openibd to preserve the startup process of
previous MLNX_OFED versions. These loads
conflict with openibd, which has its own logic
to overcome issues. Such issues can be inbox
driver load instead of MLNX_OFED, or module
load with wrong parameter value. They might
also load modules while openibd is trying to
unload the driver stack.
Keywords: Installation, openibd
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0


2052037 Description: Disabled automated loading of
some modules through udev triggers to
preserve the startup process of previous
MLNX_OFED versions.
Keywords: Installation, udev
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0


2022634 Description: Fixed a typo in the packages
build command line which could cause the
installation of MLNX_OFED on SLES OSs to fail
when using the option --without-depcheck.
Keywords: Installation, SLES
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0


2022619 Description: Fixed the issue where
uninstallation of MLNX_OFED would hang due
to a bug in the package dependency check.
Keywords: Uninstallation, dependency
Discovered in Release: 4.7-1.0.0.1

Fixed in Release: 4.9-0.1.7.0

49
Interna Description
l
Refere
nce
Numbe
r

2047221
Description: Reference count (refcount) for
RDMA connection ID (cm_id) was not
incremented in rdma_resolve_addr() function,
resulting in a cm_id use-after-free access.
A fix was applied to increment the cm_id
refcount.
Keywords: rdma_resolve_addr(), cm_id
Discovered in Release: 4.6-1.0.1.1

Fixed in Release: 4.9-0.1.7.0


2045181 Description: Fixed a race condition which
caused kernel panic when moving two ports
to SwitchDev mode at the same time.
Keywords: ASAP, SwitchDev, race
Discovered in Release: 4.7-1.0.0.1

Fixed in Release: 4.9-0.1.7.0


2004488 Description: Allowed accessing sysfs hardware
counters in SwitchDev mode.
Keywords: ASAP, hardware counters, sysfs,
SwitchDev
Discovered in Release: 4.7-1.0.0.1

Fixed in Release: 4.9-0.1.7.0


2030943 Description: Function smp_processor_id() is
called in the RX page recycle flow to
determine the core to run on. This is intended
to run in NAPI context. However, due to a bug
in backporting, the RX page recycle was
mistakenly called also in the RQ close flow
when not needed.
Keywords: Rx page recycle, smp_processor_id
Discovered in Release: 4.6-1.0.1.1

Fixed in Release: 4.9-0.1.7.0


2074487 Description: Fixed an issue where port link
state was automatically changed (without
admin state involvement) to "UP" after
reboot.
Keywords: Link state, UP
Discovered in Release: 4.7-1.0.0.1

50
Interna Description
l
Refere
nce
Numbe
r

Fixed in Release: 4.9-0.1.7.0


2022618 Description: Fixed a hang with ConnectX-3
adapter cards when running over SLES 11 OS.
Keywords: OS, SLES, ConnectX-3
Discovered in Release: 4.6-1.0.1.1

Fixed in Release: 4.9-0.1.7.0


2064711 Description: Fixed an issue where RDMA CM
connection failed when port space was small.
Keywords: RDMA CM
Discovered in Release: 4.7-1.0.0.1

Fixed in Release: 4.9-0.1.7.0


2076424
Description: Traffic mirroring with OVS
offload and non-offload over VxLAN interface
is now supported.

Note: For kernel 4.9, make sure to use a


dedicated OVS version.
Keywords: VxLAN, OVS
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0


1828321 Description: Fixed the issue of when working
with VF LAG while the bond device is in
active-active mode, running fwreset would
result in unequal traffic on both PFs, and PFs
would not reach line rate.
Keywords: VF LAG, bonding, PF
Discovered in Release: 4.6-1.0.1.1

Fixed in Release: 4.9-0.1.7.0


1975293 Description: Installing OFED with --with-
openvswitch flag no longer requires manual
removal of the existing Open vSwitch.
Keywords: OVS, Open vSwitch, openvswitch
Discovered in Release: 4.7-3.2.9.0

51
Interna Description
l
Refere
nce
Numbe
r

Fixed in Release: 4.9-0.1.7.0


1939719 Description: Fixed an issue of when running
openibd restart after the installation of
MLNX_OFED on SLES12 SP5 and SLES15 SP1
OSs with the latest Kernel (v4.12.14) resulted
in an error that the modules did not belong to
that Kernel. This was due to the fact that the
module installed by MLNX_OFED was
incompatible with new Kernel's module.
Keywords: SLES, operating system, OS,
installation, Kernel, module
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0


2001966 Description: Fixed an issue of when bond was
created over VF netdevices in SwitchDev
mode, the VF netdevice would be treated as
representor netdevice. This caused the
mlx5_core driver to crash in case it received
netdevice events related to bond device.
Keywords: PF, VF, SwitchDev, netdevice,
bonding
Discovered in Release: 4.7-3.2.9.0

Fixed in Release: 4.9-0.1.7.0


1816629 Description: Fixed an issue where following a
bad affinity occurrence in VF LAG mode,
traffic was sent after the port went up/down
in the switch.
Keywords: Traffic, VF LAG
Discovered in Release: 4.6-1.0.1.1

Fixed in Release: 4.9-0.1.7.0


1718531 Description: Added support for VLAN header
rewrite on CentOS 7.2 OS.
Keywords: VLAN, ASAP, switchdev, CentOS 7.2
Discovered in Release: 4.6-1.0.1.1

Fixed in Release: 4.9-0.1.7.0

52
Interna Description
l
Refere
nce
Numbe
r

1556337 Description: Fixed the issue where adding


VxLAN decapsulation rule with enc_tos and
enc_ttl failed.
Keywords: VxLAN, decapsulation
Discovered in Release: 4.7-1.0.0.1

Fixed in Release: 4.9-0.1.7.0


1973828 Description: Fixed wrong EEPROM length for
small form factor (SFF) 8472 from 256 to 512
bytes. 
Keywords: EEPROM, SFF
Discovered in Release: 4.7-1.0.0.1
Fixed in Release: 4.7-3.2.9.0
1915553 Description: Fixed the issue where errno field
was not sent in all error flows of ibv_reg_mr
API.
Keywords: ibv_reg_mr
Discovered in Release: 4.6-1.0.1.1
Fixed in Release: 4.7-3.2.9.0
1970901 Description: Fixed the issue where mlx5 IRQ
name did not change to express the state of
the interface.
Keywords: Ethernet, PCIe, IRQ
Discovered in Release: 4.7-1.0.0.1
Fixed in Release: 4.7-3.2.9.0
1915587 Description: Udaddy application is now
functional in Legacy mode.
Keywords: Udaddy, MLNX_EN legacy, RDMA-CM
Discovered in Release: 4.7-1.0.0.1
Fixed in Release: 4.7-3.2.9.0
1931421 Description: Added support for E-Switch (SR-
IOV Legacy) mode in RHEL 7.7 OSs.
Keywords: E-Switch, SR-IOV, RHEL, RedHat
Discovered in Release: 4.7-1.0.0.1
Fixed in Release: 4.7-3.2.9.0

53
Interna Description
l
Refere
nce
Numbe
r

1945411/ Description: Fixed the issue of when


1839353 XDP_REDIRECT fails, pages got double-freed
due to a bug in the refcnt_bias feature.
Keywords: XDP, XDP_REDIRECT, refcnt_bias
Discovered in Release: 4.6-1.0.1.1
Fixed in Release: 4.7-3.2.9.0
1976482 Description: Added support for enabling
SwitchDev mode in MLNX_EN.
Keywords: SwitchDev
Discovered in Release: 4.7-1.0.0.1
Fixed in Release: 4.7-3.2.9.0
1758983 Description: Installing MLNX_EN on RHEL 7.6
OSs platform x86_64 and RHEL 7.6 ALT OSs
platform PPCLE using YUM is now supported.
Keywords: RHEL, RedHat, YUM, OS, operating
system
Discovered in Release: 4.6-1.0.1.1
Fixed in Release: 4.7-1.0.0.1
1800525
Description: When configuring the Time-
stamping feature, CQE compression will be
disabled. This fix entails the removal of a
warning message that appeared upon
attempting to disable CQE compression when
it has already been disabled.
Keywords: Time-stamping, CQE compression
Discovered in Release: 4.6-1.0.1.1
Fixed in Release: 4.7-1.0.0.1
1817636 Description: Fixed the issue of when disabling
one port on the Server side, VF-LAG Tx
Affinity would not work on the Client side.
Keywords: VF-LAG, Tx Affinity
Discovered in Release: 4.6-1.0.1.1
Fixed in Release: 4.7-1.0.0.1
1843020 Description: Server reboot may result in a
system crash.
Keywords: reboot, crash
Discovered in Release: 4.2-1.2.0.0
Fixed in Release: 4.7-1.0.0.1

54
Interna Description
l
Refere
nce
Numbe
r

1811973 Description: VF mirroring offload is now


supported.

Keywords: ASAP2, VF mirroring


Discovered in Release: 4.6-1.0.1.1
Fixed in Release: 4.7-1.0.0.1
1841634 Description: The number of guaranteed
counters per VF is now calculated based on
the number of ports mapped to that VF. This
allows more VFs to have counters allocated.
Keywords: Counters, VF
Discovered in Release: 4.4-1.0.0.0
Fixed in Release: 4.7-1.0.0.1
1523548 Description: Fixed the issue where RDMA
connection persisted even after dropping the
network interface.
Keywords: Network interface, RDMA
Discovered in Release: 4.4-1.0.0.0
Fixed in Release: 4.6-1.0.1.1
1712870
Description: Fixed the issue where small
packets with non-zero padding were wrongly
reported as "checksum complete" even though
the padding was not covered by the csum
calculation. These packets now report
"checksum unnecessary".

In addition, an ethtool private flag has been


introduced to control the "checksum
complete" feature: ethtool --set-priv-
flags eth1 rx_no_csum_complete on/off

Keywords: csum error, checksum, mlx5_core


Discovered in Release: 4.5-1.0.1.0
Fixed in Release: 4.6-1.0.1.1
Description: Fixed the wrong wording in the
1648597 FW tracer ownership startup message (from
"FW Tracer Owner" to "FWTracer: Ownership
granted and active").
Keywords: FW Tracer
Discovered in Release: 4.5-1.0.1.0
Fixed in Release: 4.6-1.0.1.1

55
Interna Description
l
Refere
nce
Numbe
r

Description: Fixed the issue where GID entries


1581631 referenced to by a certain user application
could not be deleted while that user
application was running.
Keywords: RoCE, GID
Discovered in Release: 4.5-1.0.1.0
Fixed in Release: 4.6-1.0.1.1
Description: Fixed the issue of when
1403313 attempting to allocate an excessive number
of VFs per PF in operating systems with kernel
versions below v4.15, the allocation failed
due to a known issue in the Kernel.
Keywords: VF, PF, IOMMU, Kernel, OS
Discovered in Release: 4.5-1.0.1.0
Fixed in Release: 4.6-1.0.1.1

1368390 Description: Fixed the issue where MLNX_EN


could not be installed on RHEL 7.x Alt OSs
using YUM repository.
Keywords: Installation, YUM, RHEL
Discovered in Release: 4.3-3.0.2.1
Fixed in Release: 4.6-1.0.1.1

1531817 Description: Fixed an issue of when the


number of channels configured was less than
the number of CPUs available, part of the
CPUs would not be used by Tx queues.

Keywords: Performance, Tx, CPU

Discovered in Release: 4.4-1.0.1.0

Fixed in Release: 4.5-1.0.1.0

1400381 Description: Fixed the issue where on SLES 11


SP3 PPC64 OSs, a memory allocation issue
might prevent the interface from loading
after reboot, resulting in a call trace in the
message log.

Keywords: SLES11 SP3

56
Interna Description
l
Refere
nce
Numbe
r

Discovered in Release: 4.4-1.0.1.0

Fixed in Release: 4.5-1.0.1.0

1498931 Description: Fixed the issue where


establishing TCP connection took too long due
to failure of SA PathRecord query callback
handler.

Keywords: TCP, SA PathRecord

Discovered in Release: 4.4-1.0.1.0

Fixed in Release: 4.5-1.0.1.0

1514096 Description: Fixed the issue where lack of


high order allocations caused driver load
failure. All high order allocations are now
changed to order-0 allocations.

Keywords: mlx5, high order allocation

Discovered in Release: 4.4-2.0.0.1

Fixed in Release: 4.5-1.0.1.0

1524932 Description: Fixed a backport issue on some


OSs, such as RHEL v7.x, where mlx5 driver
would support ip link set DEVICE vf NUM rate
TXRATE old command, instead of ip link set
DEVICE vf NUM max_tx_rate TXRATE
min_tx_rate TXRATE new command.

Keywords: mlx5 driver

Discovered in Release: 4.4-2.0.0.1

Fixed in Release: 4.5-1.0.1.0

1498585 Description: Fixed the issue of when


performing configuration changes, mlx5e
counters values were reset.

57
Interna Description
l
Refere
nce
Numbe
r

Keywords: Ethernet counters

Discovered in Release: 4.4-2.0.0.1

Fixed in Release: 4.5-1.0.1.0

1484603 Description: Fixed the issue of when using


ibv_exp_cqe_ts_to_ns verb to convert a
packet's hardware timestamp to UTC time in
nanoseconds, the result may appear
backwards compared to the converted time of
a previous packet.

Keywords: libibverbs

Discovered in Release: 4.4-1.0.1.0

Fixed in Release: 4.5-1.0.1.0

1425027 Description: Fixed the issue where attempting


to establish a RoCE connection on the default
GID or on IPv6 link-local address might have
failed when two or more netdevices that
belong to HCA ports were slaves under a
bonding master.

This might also have resulted in the following


error message in the kernel log:
“ __ib_cache_gid_add: unable to add gid
fe80:0000:0000:0000:f652:14ff:fe46:7391
error=-28 ”.

Keywords: RoCE, bonding

Discovered in Release: 4.4-1.0.1.0

Fixed in Release: 4.5-1.0.1.0

1412468 Description: Added support for multi-host


connection on mstflint’s mstfwreset.

Keywords: mstfwreset, mstflint, MFT, multi-


host

58
Interna Description
l
Refere
nce
Numbe
r

Discovered in Release: 4.3-1.0.1.0

Fixed in Release: 4.4-1.0.1.0

1423319 Description: Removed the following prints on


server shutdown: mlx5_core 0005:81:00.1:
mlx5_enter_error_state:96:(pid1): start
mlx5_core 0005:81:00.1:
mlx5_enter_error_state:109:(pid1): end

Keywords: mlx5, fast shutdown

Discovered in Release: 4.3-1.0.1.0

Fixed in Release: 4.4-1.0.1.0

1318251 Description: Fixed the issue of when bringing


mlx4/mlx5 devices up or down, a call trace
in 
nvme_rdma_remove_one or 
nvmet_rdma_remove_one may occur.

Keywords: NVMEoF, mlx4, mlx5, call trace

Discovered in Release: 4.3-1.0.1.0

Fixed in Release: 4.4-1.0.1.0

1181815 Description: Fixed an issue where 4K UD


packets were dropped when working with 4K
MTU on mlx4 devices.

Keywords: mlx4, 4K MTU, UD

Discovered in Release: 4.2-1.2.0.0

Fixed in Release: 4.3-1.0.1.0

1247458 Description: Added support for VLAN Tag (VST)


creation on RedHat v7.4 with new iproute2
packages (iptool).

59
Interna Description
l
Refere
nce
Numbe
r

Keywords: SR-IOV, VST, RedHat

Discovered in Release: 4.2-1.2.0.0

Fixed in Release: 4.3-1.0.1.0

1229554 Description: Enabled RDMA CM to honor


incoming requests coming from ports of
different devices.

Keywords: RDMA CM

Discovered in Release: 4.2-1.0.0.0

Fixed in Release: 4.3-1.0.1.0

1262257 Description: Fixed an issue where sending


Work Requests (WRs) with multiple entries
where the first entry is less than 18 bytes
used to fail.

Keywords: ConnectX-5; libibverbs; Raw QP

Discovered in Release: 4.2-1.2.0.0

Fixed in Release: 4.3-1.0.1.0

1249358/ Description: Fixed the issue of when the


1261023 interface was down, ethtool counters ceased
to increase. As a result, RoCE traffic counters
were not always counted.

Keywords: Ethtool counters, mlx5

Discovered in Release: 4.2-1.2.0.0

Fixed in Release: 4.3-1.0.1.0

1244509 Description: Fixed compilation errors of


MLNX_EN over kernel when
CONFIG_PTP_1588_CLOCK parameter was not
set.

60
Interna Description
l
Refere
nce
Numbe
r

Keywords: PTP, mlx5e

Discovered in Release: 4.2-1.2.0.0

Fixed in Release: 4.3-1.0.1.0

1266802 Description: Fixed an issue where the system


used to hang when trying to allocate multiple
device memory buffers from different
processes simultaneously.

Keywords: Device memory programming

Discovered in Release: 4.2-1.0.0.0

Fixed in Release: 4.3-1.0.1.0

1078887 Description: Fixed an issue where post_list


and CQ_mod features in perftest did not
function when running the --
run_infinitely flag.

Keywords: perftest, --run_infinitely

Discovered in Release: 4.2-1.0.1.0

Fixed in Release: 4.2-1.2.0.0

1186260 Description: Fixed the issue where CNP


counters exposed under /sys/class/
infiniband/mlx5_bond_0/ports/1/
hw_counters/ did not aggregate both physical
functions when working in RoCE LAG mode.

Keywords: RoCE, LAG, ECN, Congestion


Counters

Discovered in Release: 4.2-1.0.1.0

Fixed in Release: 4.2-1.2.0.0

61
Interna Description
l
Refere
nce
Numbe
r

1192374 Description: Fixed wrong calculation


of max_device_ctx capability in ConnectX-4,
and ConnectX-5 HCAs.

Keywords: ibv_exp_query_device,
max_device_ctx mlx5

Discovered in Release: 4.2-1.0.1.0

Fixed in Release: 4.2-1.2.0.0

1084791 Description: Fixed the issue where


occasionally, after reboot, rpm commands
used to fail and create a core file, with
messages such as “Bus error (core dumped)”,
causing the openibd service to fail to start.

Keywords: rpm, openibd

Discovered in Release: 3.4-2.0.0.0

Fixed in Release: 4.2-1.0.1.0

960642/9 Description: Added support


60653 for min_tx_rate and max_tx_rate limit per
virtual function ConnectX-5 and ConnectX-5
Ex adapter cards.

Keywords: SR-IOV, mlx5

Discovered in Release: 4.0-1.0.1.0

Fixed in Release: 4.2-1.0.1.0

866072/8 Description: Fixed the issue where RoCE v2


69183 multicast traffic using RDMA-CM with IPv4
address was not received.

Keywords: RoCE

Discovered in Release: 3.4-1.0.0.0

62
Interna Description
l
Refere
nce
Numbe
r

Fixed in Release: 4.2-1.0.1.0

1163835 Description: Fixed an issue where ethtool


-P output was 00:00:00:00:00:00 when using
old kernels.

Keywords: ethtool, Permanent MAC address,


mlx4, mlx5

Discovered in Release: 4.0-2.0.0.1

Fixed in Release: 4.2-1.0.1.0

1067158 Description: Replaced a few “GPL only”


legacy libibverbs functions with upstream
implementation that conforms with libibverbs
GPL/BSD dual license model.

Keywords: libibverbs, license

Discovered in Release: 4.1-1.0.2.0

Fixed in Release: 4.2-1.0.1.0

1119377 Description: Fixed an issue where


ACCESS_REG command failure used to appear
upon RoCE Multihost driver restart in dmesg.
Such an error message looked as follows:

mlx5_core 0000:01:00.0:
mlx5_cmd_check:705:(pid 20037):
ACCESS_REG(0x805) op_mod(0x0) failed,
status bad parameter(0x3), syndrome
(0x15c356)

Keywords: RoCE, multihost, mlx5

Discovered in Release: 4.1-1.0.2.0

Fixed in Release: 4.2-1.0.1.0

63
Interna Description
l
Refere
nce
Numbe
r

1122937 Description: Fixed an issue where concurrent


client requests got corrupted when working in
persistent server mode due to a race
condition on the server side.

Keywords: librdmacm, rping

Discovered in Release: 4.1-1.0.2.0

Fixed in Release: 4.2-1.0.1.0

1102158 Description: Fixed an issue where client side


did not exit gracefully in RTT mode when the
server side was not reachable.

Keywords: librdmacm, rping

Discovered in Release: 4.1-1.0.2.0

Fixed in Release: 4.2-1.0.1.0

1038933 Description: Fixed a backport issue where


IPv6 procedures were called while they were
not supported in the underlying kernel.

Keywords: iw_cm

Discovered in Release: 4.0-2.0.0.1

Fixed in Release: 4.1-1.0.2.0

1064722 Description: Added log debug prints when


changing HW configuration via DCB. To enable
log debug prints, run: ethtool -s <devname>
msglvl hw on/off

Keywords: DCB, msglvl

Discovered in Release: 4.0-2.0.0.1

Fixed in Release: 4.1-1.0.2.0

64
Interna Description
l
Refere
nce
Numbe
r

1022251 Description: Fixed SKB memory leak issue that


was introduced in kernel 4.11, and added
warning messages to the Soft RoCE driver for
easy detection of future SKB leaks.

Keywords: Soft RoCE

Discovered in Release: 4.0-2.0.0.1

Fixed in Release: 4.1-1.0.2.0

1044546 Description: Fixed the issue where a kernel


crash used to occur when RXe device was
coupled with a virtual (dummy) device.

Keywords: Soft RoCE

Discovered in Release: 4.0-2.0.0.1

Fixed in Release: 4.1-1.0.2.0

1047617 Description: Fixed the issue where a race


condition in the RoCE GID cache used to cause
for the loss of IP-based GIDs.

Keywords: RoCE, GID

Discovered in Release: 4.0-2.0.0.1

Fixed in Release: 4.1-1.0.2.0

1006768 Description: Fixed the issue where an


rdma_cm connection between a client and a
server that were on the same host was not
possible when working over VLAN interfaces.

Keywords: RDMACM

Discovered in Release: 4.0-2.0.0.1

Fixed in Release: 4.1-1.0.2.0

65
Interna Description
l
Refere
nce
Numbe
r

801807 Description: Fixed an issue where RDMACM


connection used to fail upon high connection
rate accompanied with the error
message: RDMA_CM_EVENT_UNREACHABLE .

Keywords: RDMACM

Discovered in Release: 3.0-2.0.1

Fixed in Release: 4.1-1.0.2.0

869768 Description: Fixed the issue where SR-IOV was


not supported in systems with a page size
greater than 16KB.

Keywords: SR-IOV, mlx5, PPC

Discovered in Release: 4.0-2.0.0.1

Fixed in Release: 4.1-1.0.2.0

919545 Description: Fixed the issue of when the


Kernel becomes out of memory upon driver
start, it could crash on SLES 12 SP2.

Keywords: mlx_5 Eth Driver

Discovered in Release: 3.4-2.0.0.0

Fixed in Release: 4.0-2.0.0.1

864063 Description: Fixed the issue of when Spoof-


check may have been turned on for MAC
address 00:00:00:00:00:00.

Keywords: mlx4

Discovered in Release: 3.4-1.0.0.0

Fixed in Release: 4.0-2.0.0.1

66
Interna Description
l
Refere
nce
Numbe
r

869209 Description: Fixed an issue that caused TCP


packets to be received in an out of order
manner when Large Receive Offload (LRO) is
on.

Keywords: mlx5_en

Discovered in Release: 3.3-1.0.0.0

Fixed in Release: 4.0-2.0.0.1

890285 Description: Fixed the issue where memory


allocation for CQ buffers used to fail when
increasing the RX ring size.

Keywords: mlx5_core

Discovered in Release: 3.4-1.0.0.0

Fixed in Release: 4.0-1.0.1.0

867094 Description: Fixed the issue where MLNX_EN


used to fail to load on 4K page Arm
architecture.

Keywords: Arm

Discovered in Release: 3.4-1.0.0.0

Fixed in Release: 4.0-1.0.1.0

67
Interna Description
l
Refere
nce
Numbe
r

68
Introduction
This manual is intended for system administrators responsible for the installation, configuration,
management and maintenance of the software and hardware of Ethernet adapter cards. It is also
intended for application developers.

This document provides instructions on how to install the driver on NVIDIA ConnectX® network
adapter solutions supporting the following uplinks to servers.
Uplink/NICs Driver Uplink Speed
Name

• 10GbE, 40GbE and 56GbE1


ConnectX-3/ConnectX-3 Pro mlx4
ConnectX-4 mlx5 • Ethernet: 1GbE, 10GbE, 25GbE, 40GbE, 50GbE, 56GbE1,
and 100GbE

• Ethernet: 1GbE, 10GbE, 25GbE, 40GbE, and 50GbE


ConnectX-4 Lx
• Ethernet: 1GbE, 10GbE, 25GbE, 40GbE, 50GbE, and
ConnectX-5/ConnectX-5 Ex 100GbE

• Ethernet - 10GbE, 25GbE, 40GbE, 50GbE2, 100GbE2,


ConnectX-6 200GbE2

ConnectX-6 Dx • Ethernet - 1GbE, 10GbE, 25GbE, 40GbE, 50GbE1,


100GbE1, 200GbE2

• Ethernet: 10GbE, 40GbE


Innova™ IPsec EN

1. 56 GbE is a NVIDIA propriety link speed and can be achieved while connecting a NVIDIA
adapter card to 
NVIDIA SX10XX switch series, or connecting a NVIDIA adapter card to another NVIDIA adapter
card.
2. Supports both NRZ and PAM4 modes.

MLNX_EN driver release exposes the following capabilities:


• Single/Dual port
• Multiple Rx and Tx queues 
• Rx steering mode: Receive Core Affinity (RCA)
• MSI-X or INTx
• Adaptive interrupt moderation
• HW Tx/Rx checksum calculation
• Large Send Offload (i.e., TCP Segmentation Offload)
• Large Receive Offload
• Multi-core NAPI support
• VLAN Tx/Rx acceleration (HW VLAN stripping/insertion)
• Ethtool support

69
• Net device statistics
• SR-IOV support
• Flow steering
• Ethernet Time Stamping

MLNX_EN Package Contents

Package Images
MLNX_EN is provided as an ISO image or as a tarball per Linux distribution and CPU architecture that
includes source code and binary RPMs, firmware and utilities. The ISO image contains an installation
script (called install) that performs the necessary steps to accomplish the following:
• Discover the currently installed kernel
• Uninstall any previously installed MLNX_OFED/MLNX_EN packages
• Install the MLNX_EN binary RPMs (if they are available for the current kernel)
• Identify the currently installed HCAs and perform the required firmware updates

Software Components
MLNX_EN contains the following software components:
Components Description

mlx5 driver mlx5 is the low level driver implementation for the ConnectX®-4
adapters designed by Mellanox Technologies. ConnectX®-4 operates as a
VPI adapter.

mlx5_core Acts as a library of common functions (e.g. initializing the device after
reset) required by the ConnectX®-4 adapter cards.

mlx4 driver mlx4 is the low level driver implementation for the ConnectX adapters
designed by Mellanox Technologies. The ConnectX can operate as an
InfiniBand adapter and as an Ethernet NIC.

To accommodate the two flavors, the driver is split into modules:


mlx4_core, mlx4_en, and mlx4_ib.

Note: mlx4_ib is not part of this package.

mlx4_core Handles low-level functions like device initialization and firmware


commands processing. Also controls resource allocation so that the
InfiniBand, Ethernet and FC functions can share a device without
interfering with each other.

mlx4_en Handles Ethernet specific functions and plugs into the netdev mid-layer.

mstflint An application to burn a firmware binary image.

70
Components Description

Software modules Source code for all software modules (for use under conditions
mentioned in the modules' LICENSE files)

Firmware
The image includes the following firmware item:
• Firmware images (.bin format wrapped in the mlxfwmanager tool) for ConnectX®-2/
ConnectX®-3/ConnectX®-3 Pro/ConnectX®-4 and ConnectX®-4 Lx network adapters

Directory Structure
The tarball image of MLNX_EN contains the following files and directories:
• install - the MLNX_EN installation script
• uninstall.sh - the MLNX_EN un-installation script
• RPMS/ - directory of binary RPMs for a specific CPU architecture
• src/ - directory of the OFED source tarball
• mlnx_add_kernel_support.sh - a script required to rebuild MLNX_EN for customized kernel
version on supported Linux distribution

mlx4 VPI Driver


mlx4 is the low-level driver implementation for the ConnectX® family adapters designed by
Mellanox Technologies. ConnectX®-3 adapters can operate. To accommodate the supported
configurations, the driver is split into the following modules:

mlx4_core

Handles low-level functions like device initialization and firmware commands processing. Also
controls resource allocation so that the Ethernet functions can share the device without interfering
with each other.

mlx4_en

A 10/40GigE driver under drivers/net/ethernet/mellanox/mlx4 that handles Ethernet specific


functions and plugs into the netdev mid layer.

71
mlx5 Driver
mlx5 is the low-level driver implementation for the ConnectX®-4 adapters designed by Mellanox
Technologies. ConnectX®-4 operates as a VPI adapter. The mlx5 driver is comprised of the following
kernel module:

mlx5_core

Acts as a library of common functions (e.g. initializing the device after reset) required by
ConnectX®-4 adapter cards. mlx5_core driver also implements the Ethernet interfaces for
ConnectX®-4. Unlike mlx4_en/core, mlx5 drivers do not require the mlx5_en module as the
Ethernet functionalities are built-in in the mlx5_core module.

Unsupported Features in MLNX_EN


• InfiniBand protocol
• Remote Direct Memory Access (RDMA)
• Storage protocols that use RDMA, such as:
• iSCSI Extensions for RDMA (iSER)
• SCSI RDMA Protocol (SRP)
• Sockets Direct Protocol (SDP)

Module Parameters

mlx4 Module Parameters


In order to set mlx4 parameters, add the following line(s) to /etc/modprobe.d/mlx4.conf:

options mlx4_core parameter=<value>

and/or

options mlx4_en parameter=<value>

The following sections list the available mlx4 parameters.

mlx4_core Parameters

debug_level Enable debug tracing if > 0 (int)

72
msi_x 0 - don't use MSI-X,

1 - use MSI-X,

>1 - limit number of MSI-X irqs to msi_x (non-SRIOV only) (int)

enable_sys_tune Tune the cpu's for better performance (default 0) (int)

block_loopback Block multicast loopback packets if > 0 (default: 1) (int)

num_vfs Either a single value (e.g. '5') to define uniform num_vfs value for all
devices functions or a string to map device function numbers to their
num_vfs values (e.g. '0000:04:00.0-5,002b:1c:0b.a-15').

Hexadecimal digits for the device function (e.g. 002b:1c:0b.a) and


decimal for num_vfs value (e.g. 15). (string)

probe_vf Either a single value (e.g. '3') to indicate that the Hypervisor driver
itself should activate this number of VFs for each HCA on the host, or
a string to map device function numbers to their probe_vf values (e.g.
'0000:04:00.0-3,002b:1c:0b.a-13').

Hexadecimal digits for the device function (e.g. 002b:1c:0b.a) and


decimal for probe_vf value (e.g. 13). (string)

log_num_mgm_entry_size log mgm size, that defines the num of qp per mcg, for example: 10
gives 248.range: 7 <= log_num_mgm_entry_size <= 12. To activate
device managed flow steering when available, set to -1 (int)

high_rate_steer Enable steering mode for higher packet rate (obsolete, set "Enable
optimized steering" option in log_num_mgm_entry_size to use this
mode). (int)

fast_drop Enable fast packet drop when no recieve WQEs are posted (int)

enable_64b_cqe_eqe Enable 64 byte CQEs/EQEs when the FW supports this if non-zero


(default: 1) (int)

log_num_mac Log2 max number of MACs per ETH port (1-7) (int)

log_num_vlan (Obsolete) Log2 max number of VLANs per ETH port (0-7) (int)

log_mtts_per_seg Log2 number of MTT entries per segment (0-7) (default: 0) (int)

73
port_type_array Either pair of values (e.g. '1,2') to define uniform port1/port2 types
configuration for all devices functions or a string to map device
function numbers to their pair of port types values (e.g.
'0000:04:00.0-1;2,002b:1c:0b.a-1;1').

Valid port types: 1-ib, 2-eth, 3-auto, 4-N/A

If only a single port is available, use the N/A port type for port2 (e.g
'1,4').

log_num_qp log maximum number of QPs per HCA (default: 19) (int)

log_num_srq log maximum number of SRQs per HCA (default: 16) (int)

log_rdmarc_per_qp log number of RDMARC buffers per QP (default: 4) (int)

log_num_cq log maximum number of CQs per HCA (default: 16) (int)

log_num_mcg log maximum number of multicast groups per HCA (default: 13) (int)

log_num_mpt log maximum number of memory protection table entries per HCA
(default: 19) (int)

log_num_mtt log maximum number of memory translation table segments per HCA
(default: max(20, 2*MTTs for register all of the host memory limited to
30)) (int)

enable_qos Enable Quality of Service support in the HCA (default: off) (bool)

internal_err_reset Reset device on internal errors if non-zero (default is 1) (int)

ingress_parser_mode Mode of ingress parser for ConnectX3-Pro. 0 - standard. 1 - checksum


for non TCP/UDP. (default: standard) (int)

roce_mode Set RoCE modes supported by the port

ud_gid_type Set gid type for UD QPs

log_num_mgm_entry_size log mgm size, that defines the num of qp per mcg, for example: 10
gives 248.range: 7 <= log_num_mgm_entry_size <= 12 (default = -10).

use_prio Enable steering by VLAN priority on ETH ports (deprecated) (bool)

enable_vfs_qos Enable Virtual VFs QoS (default: off) (bool)

mlx4_en_only_mode Load in Ethernet only mode (int)

enable_4k_uar Enable using 4K UAR. Should not be enabled if have VFs which do not
support 4K UARs (default: true) (bool)

74
mlx4_en_only_mode Load in Ethernet only mode (int)

rr_proto IP next protocol for RoCEv1.5 or destination port for RoCEv2. Setting 0
means using driver default values (deprecated) (int)

mlx4_en Parameters

inline_thold The threshold for using inline data (int)

Default and max value is 104 bytes. Saves PCI read operation transaction, packet less then
threshold size will be copied to hw buffer directly. (range: 17-104)

udp_rss: Enable RSS for incoming UDP traffic (uint)

On by default. Once disabled no RSS for incoming UDP traffic will be done.

pfctx Priority-based Flow Control policy on TX[7:0]. Per priority bit mask (uint)

pfcrx Priority-based Flow Control policy on RX[7:0]. Per priority bit mask (uint)

udev_dev_port Work with dev_id or dev_port when supported by the kernel. Range: 0 <=
_dev_id udev_dev_port_dev_id <= 2 (default = 0).

udev_dev_port Work with dev_id or dev_port when supported by the kernel. Range: 0 <=
_dev_id: udev_dev_port_dev_id <= 2 (default = 0).

• 0: Work with dev_port if supported by the kernel, otherwise work with dev_id.

• 1: Work only with dev_id regardless of dev_port support.

• 2: Work with both of dev_id and dev_port (if dev_port is supported by the kernel). (int)

mlx5_core Module Parameters

The mlx5_core module supports a single parameter used to select the profile which defines the
number of resources supported.

prof_sel The parameter name for selecting the profile. The supported values
for profiles are:
• 0 - for medium resources, medium performance
• 1 - for low resources
• 2 - for high performance (int) (default)

guids charp

node_guid guids configuration. This module parameter will be obsolete!

75
debug_mask debug_mask: 1 = dump cmd data, 2 = dump cmd exec time, 3 = both.
Default=0 (uint)

probe_vf probe VFs or not, 0 = not probe, 1 = probe. Default = 1 (bool)


num_of_groups
Controls the number of large groups in the FDB flow table.

Default=4; Range=1-1024

Devlink Parameters
The following parameters, supported in mlx4 driver only, can be changed using the Devlink user
interface:

Parameter Description Parameter Type

internal_error_reset Enables resetting the device on internal errors Generic 


max_macs Max number of MACs per ETH port Generic 
region_snapshot_enable Enables capturing region snapshots Generic 
enable_64b_cqe_eqe Enables 64 byte CQEs/EQEs when supported by FW Driver-specific 
enable_4k_uar Enables using 4K UAR Driver-specific 

76
Installation
This chapter describes how to install and test the MLNX_EN for Linux package on a single host
machine with Mellanox Ethernet adapter hardware installed.

The chapter contains the following sections:

• Software Dependencies
• Downloading MLNX_EN
• Installing MLNX_EN
• Uninstalling MLNX_EN
• Updating Firmware After Installation
• Ethernet Driver Usage and Configuration
• Performance Tuning

Software Dependencies
MLNX_EN driver cannot coexist with OFED software on the same machine. Hence when installing
MLNX_EN, all OFED packages should be removed (run the install script).

Downloading MLNX_EN
 To download MLNX_EN, perform the following:

1. Verify that the system has a Mellanox network adapter (HCA/NIC) installed.
The following example shows a system with an installed Mellanox HCA:

# lspci -v | grep Mellanox


86:00.0 Network controller [0207]: Mellanox Technologies MT27620 Family
Subsystem: Mellanox Technologies Device 0014
86:00.1 Network controller [0207]: Mellanox Technologies MT27620 Family
Subsystem: Mellanox Technologies Device 0014

Note: For ConnectX-5 Socket Direct adapters, use ibdev2netdev to display the installed card and
the mapping of logical ports to physical ports. Example:

[root@gen-l-vrt-203 ~]# ibdev2netdev -v | grep -i MCX556M-ECAT-S25


0000:84:00.0 mlx5_10 (MT4119 - MCX556M-ECAT-S25SN) CX556M - ConnectX-5 QSFP28 fw 16.22.0228 port 1 (DOWN ) ==>
p2p1 (Down)
0000:84:00.1 mlx5_11 (MT4119 - MCX556M-ECAT-S25SN) CX556M - ConnectX-5 QSFP28 fw 16.22.0228 port 1 (DOWN ) ==>
p2p2 (Down)
0000:05:00.0 mlx5_2 (MT4119 - MCX556M-ECAT-S25SN) CX556M - ConnectX-5 QSFP28 fw 16.22.0228 port 1 (DOWN ) ==> p5p1
(Down)
0000:05:00.1 mlx5_3 (MT4119 - MCX556M-ECAT-S25SN) CX556M - ConnectX-5 QSFP28 fw 16.22.0228 port 1 (DOWN ) ==> p5p2
(Down)

Notes:
• Each PCI card of ConnectX-5 Socket Direct has a different PCI address. In the output example
above, the first two rows indicate that one card is installed in a PCI slot with PCI Bus address
84 (hexadecimal), and PCI Device Number 00, and PCI Function Number 0 and 1. RoCE

77
assigned mlx5_10 as the logical port, which is the same as netdevice p2p1, and both are
mapped to physical port of PCI function 0000:84:00.0. 
• RoCE logical port mlx5_2 of the second PCI card (PCI Bus address 05) and netdevice p5p1 are
mapped to physical port of PCI function 0000:05:00.0, which is the same physical port of PCI
function 0000:84:00.0. 
MT4119 is the PCI Device ID of the Mellanox ConnectX-5 adapters family.

For more details, please refer to ConnectX-5 Socket Direct Hardware User Manual, available
at www.mellanox.com.

2. Download the ISO image to your host.

The image’s name has the format mlnx-en-<ver>-<OS label-<CPU arch>.iso. You can download it
from http://www.mellanox.com > Products > Software> Ethernet Drivers.

3.  Use the md5sum utility to confirm the file integrity of your ISO/tarball image. 

Installing MLNX_EN

Installation Script
The install installation script performs the following:

• Discovers the currently installed kernel


• Uninstalls any previously installed MLNX_EN package
• Installs the MLNX_EN binary (if they are available for the current kernel)
• Identifies the currently installed Ethernet network adapters and automatically upgrades the
firmware

Installation Modes

mlnx_en installer supports 2 modes of installation. The install script selects the mode of driver
installation depending on the running OS/kernel version.
• Kernel Module Packaging (KMP) mode, where the source rpm is rebuilt for each installed
flavor of the kernel. This mode is used for RedHat and SUSE distributions.
• Non KMP installation mode, where the sources are rebuilt with the running kernel. This mode
is used for vanilla kernels.
• By default, the package will install drivers supporting Ethernet only. In addition, the package
will include the following new installation options:
• Full VMA support which can be installed using the installation option “–vma”.
• Infrastructure to run DPDK using the installation option “–dpdk”.
Notes:

78
• DPDK itself is not included in the package. Users would still need to install DPDK
separately after the MLNX_EN installation is completed.
• Installing VMA or DPDK infrastructure will allow users to run RoCE.

Installation Results
• For Ethernet only installation mode:
• The kernel modules are installed under:
• /lib/modules/`uname -r`/updates on SLES and Fedora Distributions
• /lib/modules/`uname -r`/extra/mlnx-en on RHEL and other RedHat like
Distributions
• /lib/modules/`uname -r`/updates/dkms/ on Ubuntu
• The kernel module sources are placed under:
/usr/src/mlnx-en-<ver>/
• For VPI installation mode:
• The kernel modules are installed under:
• /lib/modules/`uname -r`/updates on SLES and Fedora Distributions
• /lib/modules/`uname -r`/extra/mlnx-ofa_kernel on RHEL and other RedHat like
Distributions
• /lib/modules/`uname -r`/updates/dkms/ on Ubuntu 
• The kernel module sources are placed under:
/usr/src/ofa_kernel-<ver>/

Installation Procedure
This section describes the installation procedure of MLNX_EN on Mellanox adapter cards. Additional
installation procedures are provided for Mellanox Innova SmartNIC for other environment
customizations, and for extra libraries and packages in Installing MLNX_EN on Innova™ IPsec Adapter
Cards section.
1. Log into the installation machine as root.
2. Mount the ISO image on your machine. 

host1# mount -o ro,loop mlnx-en-<ver>-<OS label>-<CPU arch>.iso /mnt

3. Run the installation script.

/mnt/install

4. Case A: If the installation script has performed a firmware update on your network adapter,
you need to either restart the driver or reboot your system before the firmware update can
take effect. Refer to the table below to find the appropriate action for your specific card.
Action \ Adapter Driver Restart Standard Reboot Cold Reboot (Hard
(Soft Reset) Reset)

ConnectX-3/ConnectX-3 + - -
Pro

79
Standard ConnectX-4/ - + -
ConnectX-4 Lx or higher

Adapters with Multi- - - +


Host Support

Socket Direct Cards - - +

Case B: If the installations script has not performed a firmware upgrade on your network
adapter, restart the driver by running: 

• # /etc/init.d/mlnx-en.d restart - after Ethernet only installation mode


• # /etc/init.d/openibd restart - after VPI installation mode


The “/etc/init.d/openibd” service script will load the mlx4 and/or mlx5 and ULP drivers as
set in the “/etc/infiniband/openib.conf” configuration file.

The result is a new net-device appearing in the 'ifconfig -a' output.

Additional Installation Procedures

Installing MLNX_EN on Innova™ IPsec Adapter Cards

This type of installation is applicable to RedHat 7.1, 7.2, 7.3 and 7.4 operating systems and Kernel
4.13.
As of version 4.2, MLNX_EN supports Mellanox Innova IPsec EN adapter card that provides security
acceleration for IPsec-enabled networks. 
For information on the usage of Innova IPsec, please refer to Mellanox Innova IPsec EN Adapter Card
documentation on Mellanox official web (mellanox.com --> Products --> Adapters --> Smart Adapters
--> Innova IPsec).

Prerequisites

In order to obtain Innova IPsec offload capabilities once MLNX_EN is installed, make sure Kernel
v4.13 or newer is installed with the following configuration flags enabled:
• CONFIG_XFRM_OFFLOAD
• CONFIG_INET_ESP_OFFLOAD

80
• CONFIG_INET6_ESP_OFFLOAD

Installing MLNX_EN Using YUM

This type of installation is applicable to RedHat/OL, Fedora, XenServer operating systems.

Setting up MLNX_EN YUM Repository

The package consists of two folders that can be set up as a repository:


• “RPMS_ETH” - provides the Ethernet only installation mode
• “RPMS” - provides the RDMA support installation mode

1. Log into the installation machine as root.


2. Mount the ISO image on your machine and copy its content to a shared location in your
network.

# mount -o ro,loop mlnx-en-<ver>-<OS label>-<CPU arch>.iso /mnt

You can download the image from www.mellanox.com --> Products --> Software --> Ethernet
Drivers.
3. Download and install Mellanox Technologies GPG-KEY:
The key can be downloaded via the following link: 
http://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox
Example: 

# wget http://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox
--2014-04-20 13:52:30-- http://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox
Resolving www.mellanox.com... 72.3.194.0
Connecting to www.mellanox.com|72.3.194.0|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1354 (1.3K) [text/plain]
Saving to: ?RPM-GPG-KEY-Mellanox?
100%[=================================================>] 1,354 --.-K/s in 0s
2014-04-20 13:52:30 (247 MB/s) - ?RPM-GPG-KEY-Mellanox? saved [1354/1354]

4. Install the key. 


Example:

# sudo rpm --import RPM-GPG-KEY-Mellanox


warning: rpmts_HdrFromFdno: Header V3 DSA/SHA1 Signature, key ID 6224c050: NOKEY
Importing GPG key 0x6224C050:
Userid: "Mellanox Technologies (Mellanox Technologies - Signing Key v2) <support@mellanox.com>"
From : /repos/MLNX_EN/RPM-GPG-KEY-Mellanox
Is this ok [y/N]:

5. Check that the key was successfully imported. 


Example:

# rpm -q gpg-pubkey --qf '%{NAME}-%{VERSION}-%{RELEASE}\t%{SUMMARY}\n' | grep Mellanox


gpg-pubkey-a9e4b643-520791ba gpg(Mellanox Technologies <support@mellanox.com>)
Rev 3.30
Mellanox Technologies 45

6. Create a YUM repository configuration file called “/etc/yum.repos.d/mlnx_en.repo” with the


following content: 

[mlnx_en]

81
name=MLNX_EN Repository
baseurl=file:///<path to extracted MLNX_EN package>/<RPMS FOLDER NAME>
enabled=1
gpgkey=file:///<path to the downloaded key RPM-GPG-KEY-Mellanox>
gpgcheck=1

 Replace <RPMS FOLDER NAME> with “RPMS_ETH” or “RPMS” depending on the desired
installation mode (Ethernet only or RDMA).

7. Check that the repository was successfully added. 

# yum repolist
Loaded plugins: product-id, security, subscription-manager
This system is not registered to Red Hat Subscription Management. You can use subscription-manager to
register.
repo id repo name
status
mlnx_en MLNX_EN Repository

Installing MLNX_EN Using the YUM Tool

After setting up the YUM repository for MLNX_EN package, install one of the following metadata
packages:

• In case you set up the “RPMS_ETH” folder as the repository (for Ethernet only mode), install: 

# yum install mlnx-en-eth-only

• In case you set up the “RPMS” folder as the repository (for RDMA mode), install either: 

# yum install mlnx-en-vma

Or 

# yum install mlnx-en-dpdk

Please note the following:

MLNX_EN provides kernel module RPM packages with KMP support for RHEL and SLES. For other
operating systems, kernel module RPM packages are provided only for the operating system’s
default kernel. In this case, the group RPM packages have the supported kernel version in their
packages name.
If you have an operating systems different than RHEL or SLES, or you have installed a kernel that is
not supported by default in MLNX_EN, you can use the mlnx_add_kernel_support.sh script to build
MLNX_EN for your kernel.
The script will automatically build the matching group RPM packages for your kernel so that you can
still install MLNX_EN via YUM.
Please note that the resulting MLNX_EN repository will contain unsigned RPMs. Therefore, you
should set 'gpgcheck=0' in the repository configuration file.
Installing MLNX_EN using the YUM tool does not automatically update the firmware.
To update the firmware to the version included in MLNX_EN package, you can either:
1. Run: 

82
# yum install mlnx-fw-updater

OR 
2. Update the firmware to the latest version available on Mellanox website as described
in Updating Firmware After Installation section.

Installing MLNX_EN Using apt-get

This type of installation is applicable to Debian and Ubuntu operating systems.

Setting up MLNX_EN apt-get Repository

The package consists of two folders that can be set up as a repository:


• “DEBS_ETH” - provides the Ethernet only installation mode.
• “RPMS” - provides the RDMA support installation mode.

1. Log into the installation machine as root.


2. Extract the MLNX_EN package on a shared location in your network.
You can download it from http://www.mellanox.com > Products > Software> Ethernet
Drivers.
3. Create an apt-get repository configuration file called "/etc/apt/sources.list.d/mlnx_en.list"
with the following content: 

deb file:/<path to extracted MLNX_EN package>/<DEBS FOLDER NAME> ./

 Replace <DEBS FOLDER NAME> with “DEBS_ETH” or “DEBS” depending on the desired
installation mode (Ethernet only or RDMA).

4. Download and install Mellanox Technologies GPG-KEY. 


Example:

# wget -qO - http://www.mellanox.com/downloads/ofed/RPM-GPG-KEY-Mellanox | sudo apt-key add -

5. Verify that the key was successfully imported. 


Example:

# apt-key list
pub 1024D/A9E4B643 2013-08-11
uid Mellanox Technologies <support@mellanox.com>
sub 1024g/09FCC269 2013-08-11

6. Update the apt-get cache. 

# sudo apt-get update

83
Installing MLNX_EN Using the apt-get Tool

After setting up the apt-get repository for MLNX_EN package, install one of the following metadata
packages: 
• In case you set up the “DEBS_ETH” folder as the repository (for Ethernet only mode), install: 

# apt-get install mlnx-en-eth-only

• In case you set up the “DEBS” folder as the repository (for RDMA mode), install either: 

# apt-get install mlnx-en-vma

OR 

# apt-get install mlnx-en-dpdk

Installing MLNX_EN using the apt-get tool does not automatically update the firmware. To update
the firmware to the version included in MLNX_EN package, you can either: 
1. Run: 

# apt-get install mlnx-fw-updater

Or 
2. Update the firmware to the latest version available on Mellanox website as described
in Updating Firmware After Installation section.

Uninstalling MLNX_EN
Use the script /usr/sbin/mlnx_en_uninstall.sh to uninstall MLNX_EN package. 

Uninstalling MLNX_EN Using the YUM Tool


Use the script /usr/sbin/mlnx_en_uninstall.sh to uninstall MLNX_EN package.

Uninstalling MLNX_EN Using the apt-get Tool


Use the script /usr/sbin/mlnx_en_uninstall.sh to uninstall MLNX_EN package. 

Updating Firmware After Installation


The firmware can be updated using one of the following methods:

84
Updating the Device Online
To update the device online on the machine from Mellanox site, use the following command line: 

mlxfwmanager --online -u -d <device>

Example: 

mlxfwmanager --online -u -d 0000:09:00.0


Querying Mellanox devices firmware ...
 
Device #1:
----------
 
Device Type: ConnectX3
Part Number: MCX354A-FCA_A2-A4
Description: ConnectX-3 VPI adapter card; dual-port QSFP; FDR IB (56Gb/s) and 40GigE; PCIe3.0 x8 8GT/
s; RoHS R6
PSID: MT_1020120019
PCI Device Name: 0000:09:00.0
Port1 GUID: 0002c9000100d051
Port2 MAC: 0002c9000002
Versions: Current Available
FW 2.32.5000 2.33.5000
 
Status: Update required
---------
Found 1 device(s) requiring firmware update. Please use -u flag to perform the update.

Updating Firmware and FPGA Image on Innova IPsec Cards


The firmware and FPGA update package (mlnx-fw-updater) are installed under “/opt/mellanox/
mlnx-fw-updater” folder.
The latest FW and FPGA update package can be downloaded from mellanox.com, under 
Products --> Adapters --> Smart Adapters --> Innova IPsec --> Download tab. 


The current update package available on mellanox.com does not support the script below.
An update package that supports this script will become available in a future release.

You can run the following update script using one of the modes below:  

/opt/mellanox/mlnx-fw-updater/mlnx_fpga_updater.sh

• With -u flag to provide URL to the software package (tarball). Example: 

./mlnx_fpga_updater.sh -u http://www.mellanox.com/downloads/fpga/ipsec/Innova_IPsec_<version>.tgz

• With -t flag to provide the path to the downloaded tarball. Example: 

./mlnx_fpga_updater.sh -t <Innova_IPsec_bundle_file.tgz>

• With -p flag to provide the path to the downloaded and extracted tarball. Example: 

./mlnx_fpga_updater.sh -p <Innova_IPsec_extracted_bundle_directory>

85
For more information on the script usage, you can run mlnx_fpga_updater.sh -h. 


It is recommended to perform firmware and FPGA upgrade on Innova IPsec cards using this
script only.

Updating the Device Manually


In case you ran the install script with the ‘--without-fw-update’ option or you are using an OEM
card and now you wish to (manually) update firmware on your adapter card(s), you need to perform
the steps below. The following steps are also appropriate in case you wish to burn newer firmware
that you have downloaded from Mellanox Technologies’ Web site (http://www.mellanox.com >
Support > Firmware Download).

1. Get the device’s PSID. 

mlxfwmanager_pci | grep PSID


PSID: MT_1210110019

2. Download the firmware BIN file from the Mellanox website or the OEM website.
3. Burn the firmware. 

mlxfwmanager_pci -i <fw_file.bin>

4. Reboot your machine after the firmware burning is completed.

Ethernet Driver Usage and Configuration


To assign an IP address to the interface, run: 

#> ifconfig eth<x> <ip>

Note: 'x' is the OS assigned interface number.

To check driver and device information:  

#> ethtool -i eth<x>

Example: 

#> ethtool -i eth2


driver: mlx4_en
version: 2.1.8 (Oct 06 2013)
firmware-version: 2.30.3110
bus-info: 0000:1a:00.0

86
To query stateless offload status:

#> ethtool -k eth<x>

To set stateless offload status:

#> ethtool -K eth<x> [rx on|off] [tx on|off] [sg on|off] [tso on|off] [lro on|off]

To query interrupt coalescing settings: 

#> ethtool -c eth<x>

To enable/disable adaptive interrupt moderation: 

#>ethtool -C eth<x> adaptive-rx on|off

By default, the driver uses adaptive interrupt moderation for the receive path, which adjusts the
moderation time to the traffic pattern. 

To set the values for packet rate limits and for moderation time high and low: 

#> ethtool -C eth<x> [pkt-rate-low N] [pkt-rate-high N] [rx-usecs-low N] [rx-usecs-high N]

Above an upper limit of packet rate, adaptive moderation will set the moderation time to its highest
value. Below a lower limit of packet rate, the moderation time will be set to its lowest value.

To set interrupt coalescing settings when adaptive moderation is disabled: 

#> ethtool -C eth<x> [rx-usecs N] [rx-frames N]


usec settings correspond to the time to wait after the *last* packet is sent/received before
triggering an interrupt.

[ConnectX-3/ConnectX-3 Pro] To query pause frame settings: 

#> ethtool -a eth<x>

[ConnectX-3/ConnectX-3 Pro] To set pause frame settings: 

87
#> ethtool -A eth<x> [rx on|off] [tx on|off]

To query ring size values: 

#> ethtool -g eth<x>

To modify rings size: 

#> ethtool -G eth<x> [rx <N>] [tx <N>]

To obtain additional device statistics: 

#> ethtool -S eth<x>

[ConnectX-3/ConnectX-3 Pro] To perform a self diagnostics test: 

#> ethtool -t eth<x>

The driver defaults to the following parameters:


• Both ports are activated (i.e., a net device is created for each port)
• The number of Rx rings for each port is the nearest power of 2 of number of cpu cores,
limited by 16.
• LRO is enabled with 32 concurrent sessions per Rx ring

Some of these values can be changed using module parameters, which can be displayed by running: 

#> modinfo mlx4_en

To set non-default values to module parameters, add to the /etc/modprobe.conf file: 

"options mlx4_en <param_name>=<value> <param_name>=<value> ..."

Values of all parameters can be observed in /sys/module/mlx4_en/parameters/.

88
Performance Tuning
Depending on the application of the user's system, it may be necessary to modify the default
configuration of network adapters based on the ConnectX® adapters. In case tuning is required,
please refer to the Performance Tuning for Mellanox Adapters Community post.

89
Features Overview and Configuration
This chapter contains the following sections: 

• Ethernet Network
• Virtualization
• Resiliency
• Docker Containers
• OVS Offload Using ASAP2 Direct
• Fast Driver Unload

Ethernet Network
• ethtool Counters
• Interrupt Request (IRQ) Naming
• Quality of Service (QoS)
• Quantized Congestion Notification (QCN)
• Ethtool
• Checksum Offload
• Ignore Frame Check Sequence (FCS) Errors
• RDMA over Converged Ethernet (RoCE)
• Flow Control
• Explicit Congestion Notification (ECN)
• RSS Support
• Time-Stamping
• Flow Steering
• Wake-on-LAN (WoL)
• Hardware Accelerated 802.1ad VLAN (Q-in-Q Tunneling)
• Local Loopback Disable
• NVME-oF - NVM Express over Fabrics
• Debuggability
• RX Page Cache Size Limit

ethtool Counters
The ethtool counters are counted in different places, according to which they are divided into
groups. Each counters group may also have different counter types. 

90
For the full list of supported ethtool counters, refer to the Understanding mlx5 ethtool Counters
Community post.

Interrupt Request (IRQ) Naming


Once IRQs are allocated by the driver, they are named mlx5_comp<x>@pci:<pci_addr>. The IRQs
corresponding to the channels in use are renamed to <interface>-<x>, while the rest maintain
their default name.
The mlx5_core driver allocates all IRQs during loading time to support the maximum possible
number of channels. Once the driver is up, no further IRQs are freed or allocated. Changing the
number of working channels does not re-allocate or free the IRQs.

The following example demonstrates how reducing the number of channels affects the IRQs names.

$ ethtool -l ens1
Channel parameters for ens1:
Pre-set maximums:
RX: 0
TX: 0
Other: 0
Combined: 12
 
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 12
 
$ cat /proc/interrupts
 
98: 0 0 0 0 0 0 7935 0 0 0
0 0 IR-PCI-MSI-edge mlx5_async@pci:0000:81:00.0
99: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-0
100: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-1
101: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-2
102: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-3
103: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-4
104: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-5
105: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-6
106: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-7
107: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-8
108: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-9
109: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-10
110: 0 0 0 0 0 0 1 0 0 0
0 0 IR-PCI-MSI-edge ens1-11

91
 
 
 
$ ethtool -L ens1 combined 4
$ ethtool -l ens1
Channel parameters for ens1:

Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 4
 
$ cat /proc/interrupts
98: 0 0 0 0 0 0 8455 0 0 0
0 0 IR-PCI-MSI-edge mlx5_async@pci:0000:81:00.0
99: 0 0 0 0 0 0 1 2 0 0
0 0 IR-PCI-MSI-edge ens1-0
100: 0 0 0 0 0 0 1 0 2 0
0 0 IR-PCI-MSI-edge ens1-1
101: 0 0 0 0 0 0 1 0 0 2
0 0 IR-PCI-MSI-edge ens1-2
102: 0 0 0 0 0 0 1 0 0 0
2 0 IR-PCI-MSI-edge ens1-3
103: 0 0 0 0 0 0 1 0 0 0
0 1 IR-PCI-MSI-edge mlx5_comp4@pci:0000:81:00.0
104: 0 0 0 0 0 0 2 0 0 0
0 0 IR-PCI-MSI-edge mlx5_comp5@pci:0000:81:00.0
105: 0 0 0 0 0 0 1 1 0 0
0 0 IR-PCI-MSI-edge mlx5_comp6@pci:0000:81:00.0
106: 0 0 0 0 0 0 1 0 1 0
0 0 IR-PCI-MSI-edge mlx5_comp7@pci:0000:81:00.0
107: 0 0 0 0 0 0 1 0 0 1
0 0 IR-PCI-MSI-edge mlx5_comp8@pci:0000:81:00.0
108: 0 0 0 0 0 0 1 0 0 0
1 0 IR-PCI-MSI-edge mlx5_comp9@pci:0000:81:00.0
109: 0 0 0 0 0 0 1 0 0 0
0 1 IR-PCI-MSI-edge mlx5_comp10@pci:0000:81:00.0
110: 0 0 0 0 0 0 2 0 0 0
0 0 IR-PCI-MSI-edge mlx5_comp11@pci:0000:81:00.0

Quality of Service (QoS)

Quality of Service (QoS) is a mechanism of assigning a priority to a network flow (socket, rdma_cm
connection) and manage its guarantees, limitations and its priority over other flows. This is
accomplished by mapping the user's priority to a hardware TC (traffic class) through a 2/3 stage
process. The TC is assigned with the QoS attributes and the different flows behave accordingly.

Mapping Traffic to Traffic Classes

Mapping traffic to TCs consists of several actions which are user controllable, some controlled by
the application itself and others by the system/network administrators.
The following is the general mapping traffic to Traffic Classes flow:
1. The application sets the required Type of Service (ToS).
2. The ToS is translated into a Socket Priority (sk_prio).
3. The sk_prio is mapped to a User Priority (UP) by the system administrator (some applica-
tions set sk_prio directly).
4. The UP is mapped to TC by the network/system administrator.
5. TCs hold the actual QoS parameters

QoS can be applied on the following types of traffic. However, the general QoS flow may vary among
them:
• Plain Ethernet - Applications use regular inet sockets and the traffic passes via the ker- nel
Ethernet driver
• RoCE - Applications use the RDMA API to transmit using Queue Pairs (QPs)
• Raw Ethernet QP - Application use VERBs API to transmit using a Raw Ethernet QP

92
Plain Ethernet Quality of Service Mapping

Applications use regular inet sockets and the traffic passes via the kernel Ethernet driver. The
following is the Plain Ethernet QoS mapping flow:
1. The application sets the ToS of the socket using setsockopt (IP_TOS, value).
2. ToS is translated into the sk_prio using a fixed translation: 

TOS 0 <=> sk_prio 0


TOS 8 <=> sk_prio 2
TOS 24 <=> sk_prio 4
TOS 16 <=> sk_prio 6

3. The Socket Priority is mapped to the UP:

• If the underlying device is a VLAN device, egress_map is used controlled by the


vconfig command. This is per VLAN mapping.
• If the underlying device is not a VLAN device, in ConnectX-3/ConnectX-3 Pro RoCE old
kernels, mapping the sk_prio to UP is done by using tc_wrap.py -i <dev name> -u
0,1,2,3,4,5,6,7. Otherwise, the mapping is done in the driver.

4. The UP is mapped to the TC as configured by the mlnx_qos tool or by the lldpad daemon if
DCBX is used. 


Socket applications can use setsockopt (SK_PRIO, value) to directly set the sk_prio of the
socket. In this case, the ToS to sk_prio fixed mapping is not needed. This allows the
application and the administrator to utilize more than the 4 values possible via ToS. 


In the case of a VLAN interface, the UP obtained according to the above mapping is also
used in the VLAN tag of the traffic.

Map Priorities with set_egress_map

For RoCE old kernels that do not support set_egress_map, use the tc_wrap script to map between
sk_prio and UP. Use tc_wrap with option -u. For example: 

tc_wrap -i <ethX> -u <skprio2up mapping>

Quality of Service Properties

The different QoS properties that can be assigned to a TC are:


• Strict Priority
• Enhanced Transmission Selection (ETS)
• Rate Limit

93
• Trust State
• Receive Buffer
• DCBX Control Mode

Strict Priority

When setting a TC's transmission algorithm to be 'strict', then this TC has absolute (strict) priority
over other TC strict priorities coming before it (as determined by the TC number: TC 7 is the highest
priority, TC 0 is lowest). It also has an absolute priority over nonstrict TCs (ETS).
This property needs to be used with care, as it may easily cause starvation of other TCs.
A higher strict priority TC is always given the first chance to transmit. Only if the highest strict
priority TC has nothing more to transmit, will the next highest TC be considered.
Nonstrict priority TCs will be considered last to transmit.
This property is extremely useful for low latency low bandwidth traffic that needs to get immediate
service when it exists, but is not of high volume to starve other transmitters in the system.

Enhanced Transmission Selection (ETS)

Enhanced Transmission Selection standard (ETS) exploits the time periods in which the offered load
of a particular Traffic Class (TC) is less than its minimum allocated bandwidth by allowing the
difference to be available to other traffic classes.
After servicing the strict priority TCs, the amount of bandwidth (BW) left on the wire may be split
among other TCs according to a minimal guarantee policy.
If, for instance, TC0 is set to 80% guarantee and TC1 to 20% (the TCs sum must be 100), then the BW
left after servicing all strict priority TCs will be split according to this ratio.
Since this is a minimum guarantee, there is no maximum enforcement. This means, in the same
example, that if TC1 did not use its share of 20%, the reminder will be used by TC0.
ETS is configured using the mlnx_qos tool (mlnx_qos) which allows you to:
• Assign a transmission algorithm to each TC (strict or ETS)
• Set minimal BW guarantee to ETS TCs
Usage: 

mlnx_qos -i \[options\]

Rate Limit

Rate limit defines a maximum bandwidth allowed for a TC. Please note that 10% deviation from the
requested values is considered acceptable.

Trust State

Trust state enables prioritizing sent/received packets based on packet fields.


The default trust state is PCP. Ethernet packets are prioritized based on the value of the field (PCP/
DSCP).
For further information on how to configure Trust mode, please refer to HowTo Configure Trust State
on Mellanox Adapters Community post.

94
Receive Buffer

By default, the receive buffer configuration is controlled automatically. Users can override the
receive buffer size and receive buffer's xon and xoff thresholds using mlnx_qos tool.
For further information, please refer to HowTo Tune the Receive buffers on Mellanox Adapters
Community post.

DCBX Control Mode

DCBX settings, such as "ETS" and "strict priority" can be controlled by firmware or software. When
DCBX is controlled by firmware, changes of QoS settings cannot be done by the software. The DCBX
control mode is configured using the mlnx_qos -d os/fw command.
For further information on how to configure the DCBX control mode, please refer to mlnx_qos
Community post.

Quality of Service Tools

mlnx_qos

mlnx_qos is a centralized tool used to configure QoS features of the local host. It communicates
directly with the driver thus does not require setting up a DCBX daemon on the system.
The mlnx_qos tool enables the administrator of the system to:
• Inspect the current QoS mappings and configuration
The tool will also display maps configured by TC and vconfig set_egress_map tools, in order
to give a centralized view of all QoS mappings.
• Set UP to TC mapping
• Assign a transmission algorithm to each TC (strict or ETS)
• Set minimal BW guarantee to ETS TCs
• Set rate limit to TCs
• Set DCBX control mode
• Set cable length
• Set trust state


For an unlimited ratelimit, set the ratelimit to 0.

Usage

mlnx_qos -i <interface> \[options\]

Options

--version Show the program's version number and exit

95
-h, --help Show this help message and exit

-f LIST, --pfc=LIST Set priority flow control for each priority. LIST is
a comma separated value for each priority starting from
0 to 7. Example: 0,0,0,0,1,1,1,1 enable PFC on TC4-7

-p LIST, --prio_tc=LIST Maps UPs to TCs. LIST is 8 comma-separated TC numbers. Example:


0,0,0,0,1,1,1,1 maps UPs 0-3 to TC0, and UPs 4-7 to TC1

-s LIST, --tsa=LIST Transmission algorithm for each TC. LIST is comma separated
algorithm names for each TC. Possible algorithms: strict, ets and
vendor. Example: vendor,strict,ets,ets,ets,ets,ets,ets sets TC0 to
vendor, TC1 to strict, TC2-7 to ets.

-t LIST, --tcbw=LIST Set the minimally guaranteed %BW for ETS TCs. LIST is comma-
separated percents for each TC. Values set to TCs that are not
configured to ETS algorithm are ignored but must be present.
Example: if TC0,TC2 are set to ETS, then 10,0,90,0,0,0,0,0will set
TC0 to 10% and TC2 to 90%. Percents must sum to 100.

-r LIST, --ratelimit=LIST Rate limit for TCs (in Gbps). LIST is a comma-separated Gbps limit for
each TC. Example: 1,8,8 will limit TC0 to 1Gbps, and TC1,TC2 to 8
Gbps each.

-d DCBX, --dcbx=DCBX Set dcbx mode to firmware controlled(fw) or OS controlled(os). Note,


when in OS mode, mlnx_qos should not be used in parallel with other
dcbx tools, such as lldptool

--trust=TRUST set priority trust state to pcp or dscp

--dscp2prio=DSCP2PRIO Set/del a (dscp,prio) mapping. Example 'set,30,2' maps dscp 30 to


priority 2. 'del,30,2' resets the dscp 30 mapping back to the default
setting priority 0.

--cable_len=CABLE_LEN Set cable_len for buffer's xoff and xon thresholds

-i INTF, --interface=INTF Interface name

-a Show all interface's TCs

Get Current Configuration

ofed_scripts/utils/mlnx_qos -i ens1f0
DCBX mode: OS controlled
Priority trust state: dscp
dscp2prio mapping:
prio:0 dscp:07,06,05,04,03,02,01,00,
prio:1 dscp:15,14,13,12,11,10,09,08,
prio:2 dscp:23,22,21,20,19,18,17,16,
prio:3 dscp:31,30,29,28,27,26,25,24,
prio:4 dscp:39,38,37,36,35,34,33,32,
prio:5 dscp:47,46,45,44,43,42,41,40,
prio:6 dscp:55,54,53,52,51,50,49,48,
prio:7 dscp:63,62,61,60,59,58,57,56,

96
Cable len: 7
PFC configuration:
priority 0 1 2 3 4 5 6 7
enabled 0 0 0 0 0 0 0 0
tc: 0 ratelimit: unlimited, tsa: vendor
priority: 1
tc: 1 ratelimit: unlimited, tsa: vendor
priority: 0
tc: 2 ratelimit: unlimited, tsa: vendor
priority: 2
tc: 3 ratelimit: unlimited, tsa: vendor
priority: 3
tc: 4 ratelimit: unlimited, tsa: vendor
priority: 4
tc: 5 ratelimit: unlimited, tsa: vendor
priority: 5
tc: 6 ratelimit: unlimited, tsa: vendor
priority: 6
tc: 7 ratelimit: unlimited, tsa: vendor
priority: 7

Set ratelimit. 3Gbps for tc0 4Gbps for tc1 and 2Gbps for tc2

# mlnx_qos -i <interface> -p 0,1,2 -r 3,4,2


tc: 0 ratelimit: 3 Gbps, tsa: strict
up: 0
skprio: 0
skprio: 1
skprio: 2 (tos: 8)
skprio: 3
skprio: 4 (tos: 24)
skprio: 5
skprio: 6 (tos: 16)
skprio: 7
skprio: 8
skprio: 9
skprio: 10
skprio: 11
skprio: 12
skprio: 13
skprio: 14
skprio: 15
up: 3
up: 4
up: 5
up: 6
up: 7
tc: 1 ratelimit: 4 Gbps, tsa: strict
up: 1
tc: 2 ratelimit: 2 Gbps, tsa: strict
up: 2

ConfigureQoS.mapUP0,7totc0,1,2,3totc1and4,5,6totc2.settc0,tc1asetsandtc2asstrict.divideets30%
fortc0and70%fortc1

# mlnx_qos -i <interface> -s ets,ets,strict -p 0,1,1,1,2,2,2 -t 30,70


tc: 0 ratelimit: 3 Gbps, tsa: ets, bw: 30%
up: 0
skprio: 0
skprio: 1
skprio: 2 (tos: 8)
skprio: 3
skprio: 4 (tos: 24)
skprio: 5
skprio: 6 (tos: 16)
skprio: 7
skprio: 8
skprio: 9
skprio: 10
skprio: 11
skprio: 12
skprio: 13
skprio: 14
skprio: 15
up: 7
tc: 1 ratelimit: 4 Gbps, tsa: ets, bw: 70%
up: 1
up: 2
up: 3
tc: 2 ratelimit: 2 Gbps, tsa: strict
up: 4
up: 5
up: 6

97
tcandtc_wrap.py

The tc tool is used to create 8 Traffic Classes (TCs).


The tool will either use the sysfs (/sys/class/net/<ethX>/qos/tc_num) or the tc tool to create the
TCs.
In case of RoCE ConnectX-3/ConnectX-3 Pro old kernels, the tc_wrap will enable mapping between
sk_prio and UP using the sysfs (/sys/class/infiniband/mlx4_0/ports/<port_num>/ skprio2up).

Usage 

tc_wrap.py -i <interface> \[options\]

Options

--version show program's version number and exit

-h, --help show this help message and exit

-u SKPRIO_UP, --skprio_up=SKPRIO_UP maps sk_prio to priority for RoCE. LIST is <=16 comma sep-
arated priority. index of element is sk_prio.

-i INTF, --interface=INTF Interface name

Example
Run: 

tc_wrap.py -i enp139s0

Output:

Tarrfic classes are set to 8


 
UP 0
skprio: 0 (vlan 5)
UP 1
skprio: 1 (vlan 5)
UP 2
skprio: 2 (vlan 5 tos: 8)
UP 3
skprio: 3 (vlan 5)
UP 4
skprio: 4 (vlan 5 tos: 24)
UP 5
skprio: 5 (vlan 5)
UP 6
skprio: 6 (vlan 5 tos: 16)
UP 7
skprio: 7 (vlan 5)

98
Additional Tools

tc tool compiled with the sch_mqprio module is required to support kernel v2.6.32 or higher. This is
a part of iproute2 package v2.6.32-19 or higher. Otherwise, an alternative custom sysfs interface is
available.
• mlnx_qos tool (package: ofed-scripts) requires python version 2.5 < = X
• tc_wrap.py (package: ofed-scripts) requires python version 2.5 < = X

Packet Pacing

ConnectX-4 and above devices allow packet pacing (traffic shaping) per flow. This capability is
achieved by mapping a flow to a dedicated send queue and setting a rate limit on that send queue.
Note the following:
• Up to 512 send queues are supported
• 16 different rates are supported
• The rates can vary from 1 Mbps to line rate in 1 Mbps resolution
• Multiple queues can be mapped to the same rate (each queue is paced independently)
• It is possible to configure rate limit per CPU and per flow in parallel

System Requirements
• MLNX_EN, version 3.3
• Linux kernel 4.1 or higher
• ConnectX-4 or ConnectX-4 Lx adapter cards with a formal firmware version

Packet Pacing Configuration 


This configuration is non-persistent and does not survive driver restart.

1. Firmware Activation:
 To activate Packet Pacing in the firmware:
First, make sure Mellanox Firmware Tools service (mst) is started: 

# mst start

Then run: 

#echo "MLNX_RAW_TLV_FILE" > /tmp/mlxconfig_raw.txt


#echo “0x00000004 0x0000010c 0x00000000 0x00000001" >> /tmp/mlxconfig_raw.txt
#yes | mlxconfig -d <mst_dev> -f /tmp/mlxconfig_raw.txt set_raw > /dev/null
#reboot /mlxfwreset

 To deactivate Packet Pacing in the firmware, run:

99
#echo "MLNX_RAW_TLV_FILE" > /tmp/mlxconfig_raw.txt
#echo “0x00000004 0x0000010c 0x00000000 0x00000000" >> /tmp/mlxconfig_raw.txt
#yes | mlxconfig -d <mst_dev >-f /tmp/mlxconfig_raw.txt set_raw > /dev/null
#reboot /mlxfwreset

2. Driver Activation:
There are two operation modes for Packet Pacing:

a. Rate limit per CPU core:


When XPS is enabled, traffic from a CPU core will be sent using the corresponding send
queue. By limiting the rate on that queue, the transmit rate on that CPU core will be
limited. For example: 

echo 300 > /sys/class/net/ens2f1/queues/tx-0/tx_maxrate

In this case, the rate on Core 0 (tx-0) is limited to 300Mbit/sec.


b. Rate limit per flow:
i.  The driver allows opening up to 512 additional send queues using the following
command: 

ethtool -L ens2f1 other 1200

In this case, 1200 additional queues are opened


ii. Create flow mapping.
Users can map a certain destination IP and/or destination layer 4 Port to a
specific send queue. The match precedence is as follows:

• IP + L4 Port
• IP only
• L4 Port only
• No match (the flow would be mapped to default queues)
To create flow mapping:
Configure the destination IP. Write the IP address in hexadecimal
representation to the relevant sysfs entry. For example, to map IP
address 192.168.1.1 (0xc0a80101) to send queue 310, run the following
command: 

echo 0xc0a80101 > /sys/class/net/ens2f1/queues/tx-310/flow_map/dst_ip

To map Destination L4 3333 port (either TCP or UDP) to the same queue,


run: 

echo 3333 > /sys/class/net/ens2f1/queues/tx-310/flow_map/dst_port

From this point on, all traffic destined to the given IP address and L4 port
will be sent using send queue 310. All other traffic will be sent using the
original send queue.

iii. Limit the rate of this flow using the following command: 

100
echo 100 > /sys/class/net/ens2f1/queues/tx-310/tx_maxrate

Note: Each queue supports only a single IP+Port combination.

Quantized Congestion Notification (QCN)


Congestion control is used to reduce packet drops in lossy environments and mitigate congestion
spreading and resulting victim flows in lossless environments.
The Quantized Congestion Notification (QCN) IEEE standard (802.1Qau) provides congestion control
for long-lived flows in limited bandwidth-delay product Ethernet networks. It is part of the IEEE
Data Center Bridging (DCB) protocol suite, which also includes ETS, PFC, and DCBX. QCN in
conducted at L2, and is targeted for hardware implementations. QCN applies to all Ethernet packets
and all transports, and both the host and switch behavior is detailed in the standard.
QCN user interface allows the user to configure QCN activity. QCN configuration and retrieval of
information is done by the mlnx_qcn tool. The command interface provides the user with a set of
changeable attributes, and with information regarding QCN's counters and statistics. All parameters
and statistics are defined per port and priority. QCN command interface is available if and only the
hardware supports it.

QCN Tool - mlnx_qcn

mlnx_qcn is a tool used to configure QCN attributes of the local host. It communicates directly with
the driver thus does not require setting up a DCBX daemon on the system.
The mlnx_qcn enables the user to:
• Inspect the current QCN configurations for a certain port sorted by priority
• Inspect the current QCN statistics and counters for a certain port sorted by priority
• Set values of chosen QCN parameters

Usage

mlnx_qcn -i <interface> \[options\]

Options
--version Show program's version number and exit
-h, --help Show this help message and exit
-i INTF, --interface=INTF Interface name
-g TYPE, --get_type=TYPE Type of information to get statistics/parameters
--rpg_enable=RPG_ENABLE_LIST Set value of rpg_enable according to priority, use spaces
between values and -1 for unknown values.
--rppp_max_rps=RPPP_MAX_RPS_LIST Set value of rppp_max_rps according to priority, use spaces
between values and -1 for unknown values.
--rpg_time_reset=RPG_TIME_RESET_LIST Set value of rpg_time_reset according to priority, use
spaces between values and -1 for unknown values.
--rpg_byte_reset=RPG_BYTE_RESET_LIST Set value of rpg_byte_reset according to priority, use
spaces between values and -1 for unknown values.

101
--rpg_threshold=RPG_THRESHOLD_LIST Set value of rpg_threshold according to priority, use spaces
between values and -1 for unknown values.
--rpg_max_rate=RPG_MAX_RATE_LIST Set value of rpg_max_rate according to priority, use spaces
between values and -1 for unknown values.
--rpg_ai_rate=RPG_AI_RATE_LIST Set value of rpg_ai_rate according to priority, use spaces
between values and -1 for unknown values.
--rpg_hai_rate=RPG_HAI_RATE_LIST Set value of rpg_hai_rate according to priority, use spaces
between values and -1 for unknown values.
--rpg_gd=RPG_GD_LIST Set value of rpg_gd according to priority, use spaces
between values and -1 for unknown values.
--rpg_min_dec_fac=RPG_MIN_DEC_FAC_LIST Set value of rpg_min_dec_fac according to priority, use
spaces between values and -1 for unknown values.
--rpg_min_rate=RPG_MIN_RATE_LIST Set value of rpg_min_rate according to priority, use spaces
between values and -1 for unknown values.
-- Set value of cndd_state_machine according to priority, use
cndd_state_machine=CNDD_STATE_MACHINE_LIS spaces between values and -1 for unknown values.
T

To get QCN current configuration sorted by priority, run: 

mlnx_qcn -i eth2 -g parameters

To show QCN's statistics sorted by priority, run: 

mlnx_qcn -i eth2 -g statistics

Output example of running mlnx_qcn -i eth2 -g parameters: 

priority 0:
rpg_enable: 0
rppp_max_rps: 1000
rpg_time_reset: 1464
rpg_byte_reset: 150000
rpg_threshold: 5
rpg_max_rate: 40000
rpg_ai_rate: 10
rpg_hai_rate: 50
rpg_gd: 8
rpg_min_dec_fac: 2
rpg_min_rate: 10
cndd_state_machine: 0
priority 1:
rpg_enable: 0
rppp_max_rps: 1000
rpg_time_reset: 1464
rpg_byte_reset: 150000
rpg_threshold: 5
rpg_max_rate: 40000
rpg_ai_rate: 10
rpg_hai_rate: 50
rpg_gd: 8
rpg_min_dec_fac: 2
rpg_min_rate: 10
cndd_state_machine: 0
.............................
.............................
priority 7:
rpg_enable: 0
rppp_max_rps: 1000
rpg_time_reset: 1464
rpg_byte_reset: 150000
rpg_threshold: 5
rpg_max_rate: 40000
rpg_ai_rate: 10
rpg_hai_rate: 50
rpg_gd: 8
rpg_min_dec_fac: 2

102
rpg_min_rate: 10
cndd_state_machine: 0

Setting QCN Configuration

Setting QCN parameters requires updating its value for each priority. '-1' indicates no change in the
current value.

Example for setting 'rp g_enable' in order to enable QCN for priorities 3, 5, 6: 

mlnx_qcn -i eth2 --rpg_enable=-1 -1 -1 1 -1 1 1 -1

Example for setting 'rpg_hai_rate' for priorities 1, 6, 7: 

mlnx_qcn -i eth2 --rpg_hai_rate=60 -1 -1 -1 -1 -1 60 60

Ethtool
Ethtool is a standard Linux utility for controlling network drivers and hardware, particularly for
wired Ethernet devices. It can be used to:
• Get identification and diagnostic information
• Get extended device statistics
• Control speed, duplex, auto-negotiation and flow control for Ethernet devices
• Control checksum offload and other hardware offload features
• Control DMA ring sizes and interrupt moderation The following are the ethtool supported
options:

Ethtool Supported Options


Options Description

ethtool --set-priv-flags eth<x> <priv flag> <on/ Enables/disables driver feature matching the given
off> private flag.

103
Options Description

ethtool --show-priv-flags eth<x> Shows driver private flags and their states (ON/OFF)

The private flags are:


• qcn_disable_32_14_4_e
• disable_mc_loopback - when this flag is on,
multicast traffic is not redirected to the device by
loopback.
• mlx4_flow_steering_ethernet_l2
• mlx4_flow_steering_ipv4
• mlx4_flow_steering_tcp

For further information regarding the last three flags,


refer to Flow Steering section.

The flags below are related to Ignore Frame Check


Sequence, and they are active when ethtool -k does not
support them:
• orx-fcs
• orx-all

The flags below is relevant for ConnectX-4 family cards


only:
• rx_cqe_compress - used to control CQE
compression. It is initialized with the automatic
driver decision.
• per_channel_stats - used to control whether to
expose per channel statistics counters in ethtool –
S.

ethtool -a eth<x> Note: Supported in ConnectX®-3/ConnectX®-3 Pro cards


only.

Queries the pause frame settings.

ethtool -A eth<x> [rx on|off] [tx on|off] Note: Supported in ConnectX®-3/ConnectX®-3 Pro cards
only.

Sets the pause frame settings.

ethtool -c eth<x> Queries interrupt coalescing settings.

ethtool -C eth<x> [pkt-rate-low N] [pkt-rate-high Note: Supported in ConnectX®-3/ConnectX®-3 Pro cards


N] [rx-usecs-low N] [rx-usecs-high N] only.

Sets the values for packet rate limits and for moderation
time high and low values.

104
Options Description

ethtool -C eth<x> [rx-usecs N] [rx-frames N] Sets the interrupt coalescing setting.

rx-frames will be enforced immediately, rx-usecs will be


enforced only when adaptive moderation is disabled.

Note: usec settings correspond to the time to wait after


the *last* packet is sent/received before triggering an
interrupt.

ethtool -C eth<x> adaptive-rx on|off Note: Supported in ConnectX®-3/ConnectX®-3 Pro cards


only.

Enables/disables adaptive interrupt moderation.

By default, the driver uses adaptive interrupt moderation


for the receive path, which adjusts the moderation time
to the traffic pattern.

ethtool -C eth<x> adaptive-tx on|off Note: Supported in ConnectX-4/ConnectX-4 Lx/


ConnectX-5 cards only.

Enables/disables adaptive interrupt moderation.

By default, the driver uses adaptive interrupt moderation


for the transmit path, which adjusts the moderation
parameters (time/frames) to the traffic pattern.

ethtool -g eth<x> Queries the ring size values.

ethtool -G eth<x> [rx <N>] [tx <N>] Modifies the ring size.

ethtool -i eth<x> Checks driver and device information.

 For example:
 #> ethtool -i eth2
 driver: mlx4_en (MT_0DD0120009_CX3)
 version: 2.1.6 (Aug 2013)
 firmware-version: 2.30.3000
 bus-info: 0000:1a:00.0

ethtool -k eth<x> Queries the stateless offload status.

105
Options Description

ethtool -K eth<x> [rx on|off] [tx on|off] [sg on| Sets the stateless offload status.
off] [tso on|off] [lro on|off] [gro on|off] [gso on|
off] [rxvlan on|off] [txvlan on|off] [ntuple on/ TCP Segmentation Offload (TSO), Generic Segmentation
off] [rxhash on/off] [rx-all on/off] [rx-fcs on/off] Offload (GSO): increase outbound throughput by reducing
CPU overhead. It works by queuing up large buffers and
letting the network interface card split them into
separate packets.

Large Receive Offload (LRO): increases inbound


throughput of high-bandwidth network connections by
reducing CPU overhead. It works by aggregating multiple
incoming packets from a single stream into a larger buffer
before they are passed higher up the networking stack,
thus reducing the number of packets that have to be
processed. LRO is available in kernel versions < 3.1 for
untagged traffic.

Hardware VLAN insertion Offload (txvlan): When enabled,


the sent VLAN tag will be inserted into the packet by the
hardware.

Note: LRO will be done whenever possible. Otherwise


GRO will be done. Generic Receive Offload (GRO) is
available throughout all kernels.

Hardware VLAN Striping Offload (rxvlan): When enabled


received VLAN traffic will be stripped from the VLAN tag
by the hardware.

RX FCS (rx-fcs): Keeps FCS field in the received


packets.Sets the stateless offload status.

RX FCS validation (rx-all): Ignores FCS validation on the


received packets.

ethtool -l eth<x> Shows the number of channels.

ethtool -L eth<x> [rx <N>] [tx <N>] Sets the number of channels.

Notes:
• This also resets the RSS table to its default
distribution, which is uniform across the cores on
the NUMA (non-uniform memory access) node that
is closer to the NIC.
• For ConnectX®-4 cards, use ethtool -L eth<x>
combined <N> to set both RX and TX channels.

ethtool -m|--dump-module-eeprom eth<x>  [ raw Queries/Decodes the cable module eeprom information.
on|off ] [ hex on|off ] [ offset N ] [ length N ]

ethtool -p|--identify DEVNAME Enables visual identification of the port by LED blinking
[TIME-IN-SECONDS].

106
Options Description

ethtool -p|--identify eth<x> <LED duration> Allows users to identify interface's physical port by
turning the ports LED on for a number of seconds.

Note: The limit for the LED duration is 65535 seconds.

ethtool -S eth<x> Obtains additional device statistics.

ethtool -s eth<x> advertise <N> autoneg on Changes the advertised link modes to requested link
modes <N>

To check the link modes’ hex values, run <man


ethtool> and to check the supported link modes,
run ethtool eth<x>

For advertising new link modes, make sure to configure


the entire bitmap as follows:
200GAUI-4 / 200GBASE-CR4/ 0x7c000000000000000
KR4
100GAUI-2 / 100GBASE-CR2 / 0x3E00000000000000
KR2
CAUI-4 / 100GBASE-CR4 / 0xF000000000
KR4
50GAUI-1 / LAUI-1/ 0x1F0000000000000
50GBASE-CR / KR
50GAUI-2 / LAUI-2/ 0x10C00000000
50GBASE-CR2/KR2
XLAUI-4/XLPPI-4 // 40G 0x7800000
25GAUI-1/ 25GBASE-CR / KR 0x380000000
XFI / XAUI-1 // 10G 0x7C0000181000
5GBASE-R 0x1000000000000
2.5GBASE-X / 2.5GMII 0x820000000000
1000BASE-X / SGMII 0x20000020020

Notes:
• Both previous and new link modes configurations
are supported, however, they must be run
separately.
• Any link mode configuration on Kernels below v5.1
and ConnectX-6 HCAs will result in the
advertisement of the full capabilities.
• <autoneg on> only sends a hint to the driver that
the user wants to modify advertised link modes
and not speed.

ethtool -s eth<x> msglvl [N] Changes the current driver message level.

107
Options Description

ethtool -s eth<x> speed <SPEED> autoneg off Changes the link speed to requested <SPEED>. To check
the supported speeds, run ethtool eth<x>.

Note: <autoneg off> does not set autoneg OFF, it only


hints the driver to set a specific speed.

ethtool -t eth<x> Performs a self-diagnostics test.

ethtool -T eth<x> Note: Supported in ConnectX®-3/ConnectX®-3 Pro cards


only.

Shows time stamping capabilities

ethtool -x eth<x> Retrieves the receive flow hash indirection table.

ethtool -X eth<x> equal a b c... Sets the receive flow hash indirection table.

Note: The RSS table configuration is reset whenever the


number of channels is modified (using ethtool -L
command).
ethtool --show-fec eth<x>
Queries current Forward Error Correction (FEC) encoding
in case FEC is supported.

Note: An output of "baser" implies Firecode encoding.


ethtool --set-fec eth<x> encoding auto|off|rs|
baser  Configures Forward Error Correction (FEC).

Note: ‘baser’ encoding applies to the Firecode encoding,


and ‘auto’ regards the HCA’s default.

Checksum Offload
MLNX_EN supports the following Receive IP/L4 Checksum Offload modes:
• CHECKSUM_UNNECESSARY: By setting this mode the driver indicates to the Linux Networking
Stack that the hardware successfully validated the IP and L4 checksum so the Linux
Networking Stack does not need to deal with IP/L4 Checksum validation.
Checksum Unnecessary is passed to the OS when all of the following are true:
• Ethtool -k <DEV> shows rx-checksumming: on
• Received TCP/UDP packet and both IP checksum and L4 protocol checksum are correct.

• CHECKSUM_COMPLETE: When the checksum validation cannot be done or fails, the driver still
reports to the OS the calculated by hardware checksum value. This allows accelerating
checksum validation in Linux Networking Stack, since it does not have to calculate the whole
checksum including payload by itself.
Checksum Complete is passed to OS when all of the following are true:

108
• Ethtool -k <DEV> shows rx-checksumming: on
• Using ConnectX®-3, firmware version 2.31.7000 and up
• Received IpV4/IpV6 non-TCP/UDP packet


The ingress parser of the ConnectX®-3-Pro card comes by default without checksum
offload support for non-TCP/UDP packets.
To change that, please set the value of the module parameter ingress_parser_mode
in mlx4_core to 1.
In this mode, IPv4/IPv6 non-TCP/UDP packets will be passed up to the protocol stack with
CHECKSUM_COMPLETE tag.
In this mode of the ingress parser, the following features are unavailable:
• NVGRE stateless offloads
• VXLAN stateless offloads
• RoCE v2 (RoCE over UDP)

Change the default behavior only if non tcp/udp is very common.

• CHECKSUM_NONE: By setting this mode the driver indicates to the Linux Networking Stack
that the hardware failed to validate the IP or L4 checksum so the Linux Networking Stack
must calculate and validate the IP/L4 Checksum.
Checksum None is passed to OS for all other cases. 

Ignore Frame Check Sequence (FCS) Errors


This feature is supported in ConnectX-3 Pro and ConnectX-4 cards only.

Upon receiving packets, the packets go through a checksum validation process for the FCS field. If
the validation fails, the received packets are dropped.
When FCS is enabled (disabled by default), the device does not validate the FCS field even if the
field is invalid.
It is not recommended to enable FCS.
For further information on how to enable/disable FCS, please refer to ethtool option rx-fcs on/off.

RDMA over Converged Ethernet (RoCE)


Remote Direct Memory Access (RDMA) is the remote memory management capability that allows
server-to-server data movement directly between application memory without any CPU
involvement. RDMA over Converged Ethernet (RoCE) is a mechanism to provide this efficient data
transfer with very low latencies on lossless Ethernet networks. With advances in data center
convergence over reliable Ethernet, ConnectX® Ethernet adapter cards family with RoCE uses the
proven and efficient RDMA transport to provide the platform for deploying RDMA technology in
mainstream data center application at 10GigE and 40GigE link-speed. ConnectX® Ethernet adapter
cards family with its hardware offload support takes advantage of this efficient RDMA transport

109
(InfiniBand) services over Ethernet to deliver ultra-low latency for performance-critical and
transaction-intensive applications such as financial, database, storage, and content delivery
networks.
When working with RDMA applications over Ethernet link layer the following points should be noted:
• The presence of a Subnet Manager (SM) is not required in the fabric. Thus, operations that
require communication with the SM are managed in a different way in RoCE. This does not
affect the API but only the actions such as joining the multicast group, that need to be taken
when using the API
• Since LID is a layer 2 attribute of the InfiniBand protocol stack, it is not set for a port and is
displayed as zero when querying the port
• With RoCE, the alternate path is not set for RC QP. Therefore, APM (another type of High
Availability and part of the InfiniBand protocol) is not supported
• Since the SM is not present, querying a path is impossible. Therefore, the path record
structure must be filled with relevant values before establishing a connection. Hence, it is
recommended working with RDMA-CM to establish a connection as it takes care of filling the
path record structure
• VLAN tagged Ethernet frames carry a 3-bit priority field. The value of this field is derived
from the IB SL field by taking the 3 least significant bits of the SL field
• RoCE traffic is not shown in the associated Ethernet device's counters since it is offloaded by
the hardware and does not go through Ethernet network driver. RoCE traffic is counted in the
same place where InfiniBand traffic is counted; /sys/class/infiniband/<device>/ports/<port
number>/counters/ 
For further information on RoCE usage, please refer to MLNX_OFED User Manual.

Flow Control

Priority Flow Control (PFC)

Priority Flow Control (PFC) IEEE 802.1Qbb applies pause functionality to specific classes of traffic on
the Ethernet link. For example, PFC can provide lossless service for the RoCE traffic and best-effort
service for the standard Ethernet traffic. PFC can provide different levels of ser- vice to specific
classes of Ethernet traffic (using IEEE 802.1p traffic classes).

PFC Local Configuration

Configuring PFC on ConnectX-3

Set the pfctx and pcfrx mlx_en module parameters to the file: /etc/modprobe.d/mlx4_en.conf: 

options mlx4_en pfctx=0x08 pfcrx=0x08

Note: These parameters are bitmap of 8 bits. In the example above, only priority 3 is enabled (0x8
is 00001000b). 0x16 will enable priority 4 and so on. 

110
Configuring PFC on ConnectX-4
1. Enable PFC on the desired priority: 

mlnx_qos -i <ethX> --pfc <0/1>,<0/1>,<0/1>,<0/1>,<0/1>,<0/1>,<0/1>,<0/1>

Example (Priority=4): 

mlnx_qos -i eth1 --pfc 0,0,0,0,1,0,0,0

2. Create a VLAN interface: 

vconfig add <ethX> <VLAN_ID>

Example (VLAN_ID=5): 

vconfig add eth1 5

3. Set egress mapping:


a. For Ethernet traffic: 

vconfig set_egress_map <vlan_einterface> <skprio> <up>

Example (skprio=3, up=5): 

vconfig set_egress_map eth1.5 3 5

4. Create 8 Traffic Classes (TCs): 

tc_wrap.py -i <interface>

5. Enable PFC on the switch.


For information on how to enable PFC on your respective switch, please refer to Switch FC/
PFC Configuration sections in the following Mellanox Community page: https://
support.mellanox.com/docs/DOC-2283.

PFC Configuration Using LLDP DCBX

PFC Configuration on Hosts

PFC Auto-Configuration Using LLDP Tool in the OS
1. Start lldpad daemon on host. 

lldpad -d Or
service lldpad start

2. Send lldpad packets to the switch. 

lldptool set-lldp -i <ethX> adminStatus=rxtx ;


lldptool -T -i <ethX> -V sysName enableTx=yes;
lldptool -T -i <ethX> -V portDesc enableTx=yes ;
lldptool -T -i <ethX> -V sysDesc enableTx=yes
lldptool -T -i <ethX> -V sysCap enableTx=yess

111
lldptool -T -i <ethX> -V mngAddr enableTx=yess
lldptool -T -i <ethX> -V PFC enableTx=yes;
lldptool -T -I <ethX> -V CEE-DCBX enableTx=yes;

3. Set the PFC parameters.

• For the CEE protocol, use dcbtool: 

dcbtool sc <ethX> pfc pfcup:<xxxxxxxx>

Example: 

dcbtool sc eth6 pfc pfcup:01110001

where:
[pfcup:xx Enables/disables priority flow control. From left to right (priorities 0-7) - x can be
xxxxxx] equal to either 0 or 1. 1 indicates that the priority is configured to transmit
priority pause.

• For IEEE protocol, use lldptool: 

lldptool -T -i <ethX> -V PFC enabled=x,x,x,x,x,x,x,x 

Example: 

lldptool -T -i eth2 -V PFC enabled=1,2,4 

where:
enab Displays or sets the priorities with PFC enabled. The set attribute takes a comma-
led separated list of priorities to enable, or the string none to disable all priorities.

PFC Auto-Configuration Using LLDP in the Firmware (for mlx5 driver)

There are two ways to configure PFC and ETS on the server:
1. Local Configuration - Configuring each server manually.
2. Remote Configuration - Configuring PFC and ETS on the switch, after which the switch will
pass the configuration to the server using LLDP DCBX TLVs.
There are two ways to implement the remote configuration using ConnectX-4 adapters:
a. Configuring the adapter firmware to enable DCBX.
b. Configuring the host to enable DCBX.

For further information on how to auto-configure PFC using LLDP in the firmware, refer to the
HowTo Auto-Config PFC and ETS on ConnectX-4 via LLDP DCBXCommunity post.

PFC Configuration on Switches


1. In order to enable DCBX, LLDP should first be enabled: 

switch (config) # lldp


show lldp interfaces ethernet remote

112
2. Add DCBX to the list of supported TLVs per required interface.
For IEEE DCBX: 

switch (config) # interface 1/1


switch (config interface ethernet 1/1) # lldp tlv-select dcbx

For CEE DCBX: 

switch (config) # interface 1/1


switch (config interface ethernet 1/1) # lldp tlv-select dcbx-cee

3. [Optional] Application Priority can be configured on the switch, with the


required ethertype and priority. For example, IP packet, priority 1:  

switch (config) # dcb application-priority 0x8100 1

4. Make sure PFC is enabled on the host (for enabling PFC on the host, refer to PFC
Configuration on Hosts section above). Once it is enabled, it will be passed in the LLDP TLVs.
5. Enable PFC with the desired priority on the Ethernet port. 

dcb priority-flow-control enable force


dcb priority-flow-control priority <priority> enable
interface ethernet <port> dcb priority-flow-control mode on force

Example - Enabling PFC with priority 3 on port 1/1: 

dcb priority-flow-control enable force


dcb priority-flow-control priority 3 enable
interface ethernet 1/1 dcb priority-flow-control mode on force

Priority Counters

MLNX_EN driver supports several ingress and egress counters per priority. Run ethtool -S to get the
full list of port counters.

ConnectX-3 and ConnectX-4 Counters


• Rx and Tx Counters:
• Packets
• Bytes
• Octets
• Frames
• Pause
• Pause frames
• Pause Duration
• Pause Transition

ConnectX-3 Example

113
# ethtool -S eth1 | grep prio_3

rx_prio_3_packets: 5152
rx_prio_3_bytes: 424080
tx_prio_3_packets: 328209
tx_prio_3_bytes: 361752914
rx_pause_prio_3: 14812
rx_pause_duration_prio_3: 0
rx_pause_transition_prio_3: 0
tx_pause_prio_3: 0
tx_pause_duration_prio_3: 47848
tx_pause_transition_prio_3: 7406

ConnectX-4 Example

# ethtool -S eth35 | grep prio4


prio4_rx_octets: 62147780800
prio4_rx_frames: 14885696
prio4_tx_octets: 0
prio4_tx_frames: 0
prio4_rx_pause: 0
prio4_rx_pause_duration: 0
prio4_tx_pause: 26832
prio4_tx_pause_duration: 14508
prio4_rx_pause_transition: 0

Note: The Pause counters in ConnectX-4 are visible via ethtool only for priorities on which PFC is
enabled.

PFC Storm Prevention


This feature is applicable to ConnectX-4/ConnectX-5 adapter cards family only.

PFC storm prevention enables toggling between default and auto modes.
The stall prevention timeout is configured to 8 seconds by default. Auto mode sets the stall
prevention timeout to be 100 msec.
The feature can be controlled using sysfs in the following directory: /sys/class/net/eth*/settings/
pfc_stall_prevention
• To query the PFC stall prevention mode: 

cat /sys/class/net/eth*/settings/pfc_stall_prevention

Example 

$ cat /sys/class/net/ens6/settings/pfc_stall_prevention
default

• To configure the PFC stall prevention mode: 

Echo "auto"/"default" > /sys/class/net/eth*/settings/pfc_stall_prevention

The following two counters were added to the ethtool -S:


• tx_Pause_storm_warning_events - when the device is stalled for a period longer than a pre-
configured watermark, the counter increases, allowing the debug utility an insight into
current device status.

114
• tx_pause_storm_error_events - when the device is stalled for a period longer than a pre-
configured timeout, the pause transmission is disabled, and the counter increase.

Dropless Receive Queue (RQ)


This feature is applicable to ConnectX-4/ConnectX-5 adapter cards family only.

Dropless RQ feature enables the driver to notify the FW when SW receive queues are overloaded.
This scenario takes place when the handling of SW receive queue is slower than the handling of the
HW receive queues.
When this feature is enabled, a packet that is received while the receive queue is full will not be
immediately dropped. The FW will accumulate these packets assuming posting of new WQEs will
resume shortly. If received WQEs are not posted after a certain period of time, out_of_buffer
counter will increase, indicating that the packet has been dropped.
This feature is disabled by default. In order to activate it, ensure that Flow Control feature is also
enabled.

To enable the feature, run: 

ethtool --set-priv-flags ens6 dropless_rq on

To get the feature state, run: 

ethtool --show-priv-flags DEVNAME

Output example: 

Private flags for DEVNAME:


rx_cqe_moder : on
rx_cqe_compress: off
sniffer : off
dropless_rq : off
hw_lro : off

To disable the feature, run: 

ethtool --set-priv-flags ens6 dropless_rq off

Explicit Congestion Notification (ECN)


This feature is supported on ConnectX-4 adapter cards family and above only.

ECN is an extension to the IP protocol. It allows reliable communication by notifying all ends of
communication when congestion occurs. This is done without dropping packets.

115
Please note that this feature requires all nodes in the path (nodes, routers etc) between the
communicating nodes to support ECN to ensure reliable communication. ECN is marked as 2 bits in
the traffic control IP header. This ECN implementation refers to RoCE v2.

Enabling ECN on ConnectX-4/ConnectX-4 Lx/ConnectX-5

To enable ECN on the hosts:


1. Enable ECN in sysfs. 

/sys/class/net/<interface>/<protocol>/ecn_<protocol>_enable =1

2. Query the attribute. 

cat /sys/class/net/<interface>/ecn/<protocol>/params/<requested attribute>

3. Modify the attribute. 

echo <value> /sys/class/net/<interface>/ecn/<protocol>/params/<requested attribute>

ECN supports the following algorithms:


• r_roce_ecn_rp - Reaction point
• r_roce_ecn_np - Notification point

Each algorithm has a set of relevant parameters and statistics, which are defined per device, per
port, per priority.

To query whether ECN is enabled per Priority X: 

cat /sys/class/net/<interface>/ecn/<protocol>/enable/X

To read ECN configurable parameters: 

cat /sys/class/net/<interface>/ecn/<protocol>/requested attributes

To enable ECN for each priority per protocol: 

echo 1 > /sys/class/net/<interface>/ecn/<protocol>/enable/X

To modify ECN configurable parameters: 

echo <value> > /sys/class/net/<interface>/ecn/<protocol>/requested attributes

where:

116
• X: priority {0..7}
• protocol: roce_rp / roce_np
• requested attributes: Next Slide for each protocol.

RSS Support

RSS Hash Function

The device has the ability to use XOR as the RSS distribution function, instead of the default Toplitz
function.
The XOR function can be better distributed among driver's receive queues in a small number of
streams, where it distributes each TCP/UDP stream to a different queue.

mlx4 RSS Hash Function

MLNX_EN provides one of the following options to change the working RSS hash function from Toplitz
to XOR, and vice versa:
• Through ethtool priv-flags, in case mlx4_rss_xor_hash_function is not part of the priv-flags
list. 

ethtool --set-priv-flags eth<x> mlx4_rss_xor_hash_function on/off

• Through ethtool, provided as part of MLNX_EN package, in case mlx4_rss_x- or_hash_function


is not part of the priv-flags list: 

/opt/mellanox/ethtool# ./sbin/ethtool -X ens4 hfunc xor


/opt/mellanox/ethtool# ./sbin/ethtool --show-rxfh ens4

Output: 

RX flow hash indirection table for ens4 with 8 RX ring(s):


0: 0 1 2 3 4 5 6 7
RSS hash key:
7c:9c:37:de:18:dc:43:86:d9:27:0f:6f:26:03:74:b8:bf:d0:40:4b:78:72:e2:24:dc:1b:91:bb:01:1b:a7:a6:37
:6c:c8:7e:d6:e3:14:17
RSS hash function:
toeplitz: off
xor : on

For further information, please refer to Ethtool Supported Options table.

mlx5 RSS Hash Function

MLNX_EN provides the following option to change the working RSS hash function from Toplitz to
XOR, and vice versa:

Through sysfs, located at: /sys/class/net/eth*/settings/hfunc.

To query the operational and supported hash functions: 

117
cat /sys/class/net/eth*/settings/hfunc

Example: 

cat /sys/class/net/eth2/settings/hfunc
Operational hfunc: toeplitz
Supported hfuncs: xor toeplitz

 To change the operational hash function: 

echo xor > /sys/class/net/eth*/settings/hfunc

RSS Support for IP Fragments


Supported in ConnectX-3 and ConnectX-3 Pro only.

As of MLNX_EN v2.4-.1.0.0, RSS will distribute incoming IP fragmented datagrams according to its
hash function, considering the L3 IP header values. Different IP fragmented datagram flows will be
directed to different rings.


When the first packet in IP fragments chain contains upper layer transport header (e.g. UDP
packets larger than MTU), it will be directed to the same target as the proceeding IP
fragments that follow it, to prevent out-of-order processing.

Time-Stamping

Time-Stamping Service

Time-stamping is the process of keeping track of the creation of a packet. A time-stamping service
supports assertions of proof that a datum existed before a particular time. Incoming packets are
time-stamped before they are distributed on the PCI depending on the congestion in the PCI buffers.
Outgoing packets are time-stamped very close to placing them on the wire.

Enabling Time-Stamping

Time-stamping is off by default and should be enabled before use.

To enable time-stamping for a socket:

Call setsockopt() with SO_TIMESTAMPING and with the following flags:

118
SOF_TIMESTAMPING_TX_HARDWARE: try to obtain send time-stamp in hardware

SOF_TIMESTAMPING_TX_SOFTWARE: if SOF_TIMESTAMPING_TX_HARDWARE is off or fails, then do it in


software

SOF_TIMESTAMPING_RX_HARDWARE: return the original, unmodified time-stamp as generated by the


hardware

SOF_TIMESTAMPING_RX_SOFTWARE: if SOF_TIMESTAMPING_RX_HARDWARE is off or fails, then do it in


software

SOF_TIMESTAMPING_RAW_HARDWARE return original raw hardware time-stamp


:

SOF_TIMESTAMPING_SYS_HARDWARE: return hardware time-stamp transformed into the system time base

SOF_TIMESTAMPING_SOFTWARE: return system time-stamp generated in software

SOF_TIMESTAMPING_TX/RX determine how time-stamps are generated

SOF_TIMESTAMPING_RAW/SYS determine how they are reported

To enable time-stamping for a net device:

Admin privileged user can enable/disable time stamping through calling ioctl (sock, SIOCSH-
WTSTAMP, &ifreq) with the following values:
• Send side time sampling, enabled by ifreq.hwtstamp_config.tx_type when:

/* possible values for hwtstamp_config->tx_type */


enum hwtstamp_tx_types {
/*
* No outgoing packet will need hardware time stamping;
* should a packet arrive which asks for it, no hardware
* time stamping will be done.
*/
HWTSTAMP_TX_OFF,
 
/*
* Enables hardware time stamping for outgoing packets;
* the sender of the packet decides which are to be
* time stamped by setting %SOF_TIMESTAMPING_TX_SOFTWARE
* before sending the packet.
*/
HWTSTAMP_TX_ON,
/*
* Enables time stamping for outgoing packets just as
* HWTSTAMP_TX_ON does, but also enables time stamp insertion
* directly into Sync packets. In this case, transmitted Sync
* packets will not received a time stamp via the socket error
* queue.
*/
HWTSTAMP_TX_ONESTEP_SYNC,
};
Note: for send side time stamping currently only HWTSTAMP_TX_OFF and
HWTSTAMP_TX_ON are supported.

• Receive side time sampling, enabled by ifreq.hwtstamp_config.rx_filter when:

/* possible values for hwtstamp_config->rx_filter */

119
enum hwtstamp_rx_filters {
/* time stamp no incoming packet at all */
HWTSTAMP_FILTER_NONE,
 
/* time stamp any incoming packet */
HWTSTAMP_FILTER_ALL,
/* return value: time stamp all packets requested plus some others */
HWTSTAMP_FILTER_SOME,
 
/* PTP v1, UDP, any kind of event packet */
HWTSTAMP_FILTER_PTP_V1_L4_EVENT,
/* PTP v1, UDP, Sync packet */
HWTSTAMP_FILTER_PTP_V1_L4_SYNC,
/* PTP v1, UDP, Delay_req packet */
HWTSTAMP_FILTER_PTP_V1_L4_DELAY_REQ,
/* PTP v2, UDP, any kind of event packet */
HWTSTAMP_FILTER_PTP_V2_L4_EVENT,
/* PTP v2, UDP, Sync packet */
HWTSTAMP_FILTER_PTP_V2_L4_SYNC,
/* PTP v2, UDP, Delay_req packet */
HWTSTAMP_FILTER_PTP_V2_L4_DELAY_REQ,

/* 802.AS1, Ethernet, any kind of event packet */


HWTSTAMP_FILTER_PTP_V2_L2_EVENT,
/* 802.AS1, Ethernet, Sync packet */
HWTSTAMP_FILTER_PTP_V2_L2_SYNC,
/* 802.AS1, Ethernet, Delay_req packet */
HWTSTAMP_FILTER_PTP_V2_L2_DELAY_REQ,
 
/* PTP v2/802.AS1, any layer, any kind of event packet */
HWTSTAMP_FILTER_PTP_V2_EVENT,
/* PTP v2/802.AS1, any layer, Sync packet */
HWTSTAMP_FILTER_PTP_V2_SYNC,
/* PTP v2/802.AS1, any layer, Delay_req packet */
HWTSTAMP_FILTER_PTP_V2_DELAY_REQ,
};
Note: for receive side time stamping currently only HWTSTAMP_FILTER_NONE and
HWTSTAMP_FILTER_ALL are supported.

Getting Time-Stamping

Once time stamping is enabled time stamp is placed in the socket Ancillary data. recvmsg() can be
used to get this control message for regular incoming packets. For send time stamps the outgoing
packet is looped back to the socket's error queue with the send time-stamp(s) attached. It can
be received with recvmsg (flags=MSG_ERRQUEUE). The call returns the original outgoing packet data
including all headers prepended down to and including the link layer, the scm_time-stamping
control message and a sock_extended_err control message with ee_errno==ENOMSG and
ee_origin==SO_EE_ORIGIN_TIMESTAMPING. A socket with such a pending bounced packet is ready for
reading as far as select() is concerned. If the outgoing packet has to be fragmented, then only the
first fragment is time stamped and returned to the sending socket.


When time-stamping is enabled, VLAN stripping is disabled. For more info please refer to
Documentation/networking/timestamping.txt in kernel.org

Time Stamping Capabilities via ethtool

To display Time Stamping capabilities via ethtool:

Show Time Stamping capabilities: 

ethtool -T eth<x>

Example:  

ethtool -T eth0
Time stamping parameters for p2p1:
Capabilities:
hardware-transmit (SOF_TIMESTAMPING_TX_HARDWARE)

120
software-transmit (SOF_TIMESTAMPING_TX_SOFTWARE)
hardware-receive (SOF_TIMESTAMPING_RX_HARDWARE)
software-receive (SOF_TIMESTAMPING_RX_SOFTWARE)
software-system-clock (SOF_TIMESTAMPING_SOFTWARE)
hardware-raw-clock (SOF_TIMESTAMPING_RAW_HARDWARE)
PTP Hardware Clock: 1
Hardware Transmit Timestamp Modes:
off (HWTSTAMP_TX_OFF)
on (HWTSTAMP_TX_ON)
 
Hardware Receive Filter Modes:
none (HWTSTAMP_FILTER_NONE)
all (HWTSTAMP_FILTER_ALL)

For more details on PTP Hardware Clock, please refer to: https://www.kernel.org/doc/


Documentation/ptp/ptp.txt

Steering PTP Traffic to Single RX Ring

As a result of Receive Side Steering (RSS) PTP traffic coming to UDP ports 319 and 320, it may reach
the user space application in an out of order manner. In order to prevent this, PTP traffic needs to
be steered to single RX ring using ethtool.

Example: 

# ethtool -u ens7
8 RX rings available
Total 0 rules
# ethtool -U ens7 flow-type udp4 dst-port 319 action 0 loc 1
# ethtool -U ens7 flow-type udp4 dst-port 320 action 0 loc 0
# ethtool -u ens7
8 RX rings available
Total 2 rules
Filter: 0
Rule Type: UDP over IPv4
Src IP addr: 0.0.0.0 mask: 255.255.255.255
Dest IP addr: 0.0.0.0 mask: 255.255.255.255
TOS: 0x0 mask: 0xff
Src port: 0 mask: 0xffff
Dest port: 320 mask: 0x0
Action: Direct to queue 0
Filter: 1
Rule Type: UDP over IPv4
Src IP addr: 0.0.0.0 mask: 255.255.255.255
Dest IP addr: 0.0.0.0 mask: 255.255.255.255
TOS: 0x0 mask: 0xff
Src port: 0 mask: 0xffff
Dest port: 319 mask: 0x0
Action: Direct to queue 0

One Pulse Per Second (1PPS)


This feature is supported on ConnectX-4 adapter cards family and above only.

1PPS is a time synchronization feature that allows the adapter to be able to send or receive 1 pulse
per second on a dedicated pin on the adapter card using an SMA connector (SubMiniature version A).
Only one pin is supported and could be configured as 1PPS in or 1PPS out.
For further information, refer to HowTo Test 1PPS on Mellanox Adapters Community post.

121
Flow Steering
Flow steering is a new model which steers network flows based on flow specifications to specific
QPs. Those flows can be either unicast or multicast network flows. In order to maintain flexibility,
domains and priorities are used. Flow steering uses a methodology of flow attribute, which is a
combination of L2-L4 flow specifications, a destination QP and a priority. Flow steering rules may be
inserted either by using ethtool or by using InfiniBand verbs. The verbs abstraction uses different
terminology from the flow attribute (ibv_exp_flow_attr), defined by a combination of specifications
(struct ibv_exp_flow_spec_*).

Flow Steering Configuration


Applicable to ConnectX®-3 and ConnectX®-3 Pro adapter cards only.

In ConnectX®-4 and ConnectX®-4 Lx adapter cards, Flow Steering is automatically enabled


as of MLNX_EN v3.1-x.0.0.

Flow steering is generally enabled when the log_num_mgm_entry_size module parameter is non
positive (e.g., -log_num_mgm_entry_size), meaning the absolute value of the parameter, is a bit
field. Every bit indicates a condition or an option regarding the flow steering mechanism:

B Operation Description
i
t

b0 Force device managed Flow Steering When set to 1, it forces HCA to be enabled regardless of whether
NC-SI Flow Steering is supported or not.

b2 Enable A0 static DMFS steering (see When set to 1, A0 static DMFS steering is enabled. This bit should
"A0 Static Device Managed Flow be set to 0 when "b1- Disable IPoIB Flow Steering" is 0.
Steering")

b3 Enable DMFS only if the HCA supports When set to 1, DMFS is enabled only if the HCA supports more
more than 64QPs per MCG entry than 64 QPs attached to the same rule. For example, attaching
64VFs to the same multicast address causes 64QPs to be attached
to the same MCG. If the HCA supports less than 64 QPs per MCG,
B0 is used.
b5 Optimize the steering table for non- When set to 1, steering table will be optimized to support rules
source IP rules when possible ignoring source IP check.
This optimization is possible only when DMFS mode is set.

122
B Operation Description
i
t

b6 Enable/disable VXLAN offloads


When set to 1, VXLAN offloads will be disabled. When VXLAN
offloads are disabled, ethool -k will display:

tx-udp_tnl-segmentation: off [fixed]


tx-udp_tnl-csum-segmentation: off [fixed]
tx-gso-partial: off [fixed]

For example, a value of (-7) means:


• Forcing Flow Steering regardless of NC-SI Flow Steering support
• Disabling IPoIB Flow Steering support
• Enabling A0 static DMFS steering
• Steering table is not optimized for rules ignoring source IP check

The default value of log_num_mgm_entry_size is (-10). Meaning Ethernet Flow Steering (i.e IPoIB
DMFS is disabled by default) is enabled by default if NC-SI DMFS is supported and the HCA supports
at least 64 QPs per MCG entry. Otherwise, L2 steering (B0) is used.
When using SR-IOV, flow steering is enabled if there is an adequate amount of space to store the
flow steering table for the guest/master.

To enable Flow Steering:


1. Open the /etc/modprobe.d/mlnx.conf file.
2. Set the parameter log_num_mgm_entry_size to a non-positive value by writing the option
mlx4_core log_num_mgm_entry_size=<value>.
3. Restart the driver.

 To disable Flow Steering:


1. Open the /etc/modprobe.d/mlnx.conf file.
2. Remove the options mlx4_core log_num_mgm_entry_size= <value>.
3. Restart the driver.

Flow Steering Support


Flow Steering is supported in ConnectX®-3, ConnectX®-3 Pro, ConnectX®-4 and
ConnectX®-4 Lx adapter cards.

[For ConnectX®-3 and ConnectX®-3 Pro only] To determine which Flow Steering features are
supported: 

123
ethtool --show-priv-flags eth4

The following output will be received: 

mlx4_flow_steering_ethernet_l2: on Creating Ethernet L2 (MAC) rules is supported


mlx4_flow_steering_ipv4: on Creating IPv4 rules is supported
mlx4_flow_steering_tcp: on Creating TCP/UDP rules is supported


For ConnectX-4 and ConnectX-4 Lx adapter cards, all supported features are enabled. 


Flow Steering support in InfiniBand is determined according to the EXP_MANAGED_-
FLOW_STEERING flag.

A0 Static Device Managed Flow Steering

This mode enables fast steering, however it might impact flexibility. Using it increases the packet
rate performance by ~30%, with the following limitations for Ethernet link-layer unicast QPs:
• Limits the number of opened RSS Kernel QPs to 96. MACs should be unique (1 MAC per 1 QP).
The number of VFs is limited
• When creating Flow Steering rules for user QPs, only MAC--> QP rules are allowed. Both MACs
and QPs should be unique between rules. Only 62 such rules could be created
• When creating rules with Ethtool, MAC--> QP rules could be used, where the QP must be the
indirection (RSS) QP. Creating rules that indirect traffic to other rings is not allowed. Ethtool
MAC rules to drop packets (action -1) are supported
• RFS is not supported in this mode
• VLAN is not supported in this mode

Flow Domains and Priorities


ConnectX®-4 and ConnectX®-4 Lx adapter cards support only User Verbs domain with struct
ibv_exp_flow_spec_eth flow specification using 4 priorities.

Flow steering defines the concept of domain and priority. Each domain represents a user agent that
can attach a flow. The domains are prioritized. A higher priority domain will always supersede a
lower priority domain when their flow specifications overlap. Setting a lower priority value will
result in a higher priority.
In addition to the domain, there is a priority within each of the domains. Each domain can have at
most 2^12 priorities in accordance with its needs.
The following are the domains at a descending order of priority:
• Ethtool
Ethtool domain is used to attach an RX ring, specifically its QP to a specified flow. Please
refer to the most recent ethtool manpage for all the ways to specify a flow.

124
Examples:

• ethtool –U eth5 flow-type ether dst 00:11:22:33:44:55 loc 5 action 2


All packets that contain the above destination MAC address are to be steered into rx-
ring 2 (its underlying QP), with priority 5 (within the ethtool domain)
• ethtool –U eth5 flow-type tcp4 src-ip 1.2.3.4 dst-port 8888 loc 5 action 2
All packets that contain the above destination IP address and source port are to be
steered into rx- ring 2. When destination MAC is not given, the user's destination MAC
is filled automatically.
• ethtool -U eth5 flow-type ether dst 00:11:22:33:44:55 vlan 45 m 0xf000 loc 5 action 2
All packets that contain the above destination MAC address and specific VLAN are
steered into ring 2. Please pay attention to the VLAN's mask 0xf000. It is required in
order to add such a rule.
• ethtool –u eth5
Shows all of ethtool's steering rule

When configuring two rules with the same priority, the second rule will overwrite the first one, so
this ethtool interface is effectively a table. Inserting Flow Steering rules in the kernel requires
support from both the ethtool in the user space and in kernel (v2.6.28).

MLX4 Driver Support


The mlx4 driver supports only a subset of the flow specification the ethtool API defines. Asking for
an unsupported flow specification will result in an "invalid value" failure.
The following are the flow specific parameters:
ether tcp4/udp4 ip4

Mandatory dst src-ip/dst-ip

Optional vlan src-ip, dst-ip, src-port, dst-port, vlan src-ip, dst-ip, vlan

• Accelerated Receive Flow Steering (aRFS) 

 aRFS is supported in both ConnectX®-3 and ConnectX®-4 adapter cards.


Receive Flow Steering (RFS) and Accelerated Receive Flow Steering (aRFS) are kernel features
currently available in most distributions. For RFS, packets are forwarded based on the
location of the application consuming the packet. aRFS boosts the speed of RFS by adding
support for the hardware. By usingaRFS(unlike RFS), the packets are directed to a CPU that is
local to the thread running the application. 
aRFSis an in-kernel-logic responsible for load balancing between CPUs by attaching flows to
CPUs that are used by flow's owner applications. This domain allows the aRFS mechanism to
use the flow steering infrastructure to support the aRFS logic by implementing the ndo_rx_-
flow_steer, which, in turn, calls the underlying flow steering mechanism with the aRFS
domain.
To configure RFS:

125
Configure the RFS flow table entries (globally and per core).
Note: The functionality remains disabled until explicitly configured (by default it is 0).
• The number of entries in the global flow table is set as follows: 

 /proc/sys/net/core/rps_sock_flow_entries

• The number of entries in the per-queue flow table are set as follows: 

 /sys/class/net/<dev>/queues/rx-<n>/rps_flow_cnt

Example: 

# echo 32768 > /proc/sys/net/core/rps_sock_flow_entries


# for f in /sys/class/net/ens6/queues/rx-*/rps_flow_cnt; do echo 32768 > $f; done

To Configure aRFS:

The aRFS feature requires explicit configuration in order to enable it. Enabling the aRFS requires
enabling the 'ntuple' flag via the ethtool.
For example, to enable ntuple for eth0, run: 

ethtool -K eth0 ntuple on

aRFS requires the kernel to be compiled with the CONFIG_RFS_ACCEL option. This option is available
in kernels 2.6.39 and above. Furthermore, aRFS requires Device Managed Flow Steering support.


RFS cannot function if LRO is enabled. LRO can be disabled via ethtool.

• All of the rest


The lowest priority domain serves the following users:

• The mlx4 Ethernet driver attaches its unicast and multicast MACs addresses to its QP


using L2 flow specifications
• The mlx4 ipoib driver when it attaches its QP to his configured GIDS 

 Fragmented UDP traffic cannot be steered. It is treated as 'other' protocol by


hardware (from the first packet) and not considered as UDP traffic. 

 We recommend using libibverbs v2.0-3.0.0 and libmlx4 v2.0-3.0.0 and higher


as of MLNX_EN v2.0-3.0.0 due to API changes. 

126
Flow Steering Dump Tool


This tool is only supported for ConnectX-4 and above adapter cards.

The mlx_fs_dump is a python tool that prints the steering rules in a readable manner. Python v2.7 or
above, as well as pip, anytree and termcolor libraries are required to be installed on the host.

Running example:

./ofed_scripts/utils/mlx_fs_dump -d /dev/mst/mt4115_pciconf0
FT: 9 (level: 0x18, type: NIC_RX)
+-- FG: 0x15 (MISC)
|-- FTE: 0x0 (FWD) to (TIR:0x7e) out.ethtype:IPv4 out.ip_prot:UDP out.udp_dport:0x140
+-- FTE: 0x1 (FWD) to (TIR:0x7e) out.ethtype:IPv4 out.ip_prot:UDP out.udp_dport:0x13f
...

For further information on the mlx_fs_dump tool, please refer to mlx_fs_dump Community post.

Wake-on-LAN (WoL)
Wake-on-LAN (WoL) is a technology that allows a network professional to remotely power on a
computer or to wake it up from sleep mode.
• To enable WoL: 

ethtool -s <interface> wol g

• To get WoL: 

ethtool <interface> | grep Wake-on Wake-on: g

Where:
"g" is the magic packet activity. 

Hardware Accelerated 802.1ad VLAN (Q-in-Q Tunneling)


Q-in-Q tunneling allows the user to create a Layer 2 Ethernet connection between two servers. The
user can segregate a different VLAN traffic on a link or bundle different VLANs into a single VLAN.
Q-in-Q tunneling adds a service VLAN tag before the user's 802.1Q VLAN tags.
For Q-in-Q support in virtualized environments (SR-IOV), please refer to  "Q-in-Q Encapsulation per
VF in Linux (VST)".

Requirements
• ConnectX-3/ConnectX-3 Pro without Rx offloads
• Firmware version 2.34.8240 and up
• Kernel version 3.10 and up

127
• iproute-3.10.0-13.el7.x86_64 and up

To enable device support for accelerated 802.1ad VLAN:


1. Turn on the new ethtool private flag "phv-bit" (disabled by default). 

$ ethtool --set-priv-flags eth1 phv-bit on

Enabling this flag sets the phv_en port capability.


2. Change the interface device features by turning on the ethtool device feature "tx-vlan- stag-
hw-insert" (disabled by default). 

$ ethtool -K eth1 tx-vlan-stag-hw-insert on

Once the private flag and the ethtool device feature are set, the device will be ready for
802.1ad VLAN acceleration. 

 The "phv-bit" private flag setting is available for the Physical Function (PF) only.
The Virtual Function (VF) can use the VLAN acceleration by setting the "tx-vlan-stag-
hw-insert" parameter only if the private flag "phv-bit" is enabled by the PF. If the PF
enables/disables the "phv-bit" flag after the VF driver is up, the configuration will
take place only after the VF driver is restarted. 

Local Loopback Disable


Local Loopback Disable feature allows users to force the disablement of local loopback on the
virtual port (vport). This disables both unicast and mutlicast loopback in the hardware.

To enable Local Loopback Disable, run the following command: 

echo 1 > /sys/class/net/<ifname>/settings/force_local_lb_disable"

To disable Local Loopback Disable, run the following command: 

echo 0 > /sys/class/net/<ifname>/settings/force_local_lb_disable"


When turned off, the driver configures the loopback mode according to its own logic. 

128
NVME-oF - NVM Express over Fabrics

NVME-oF

NVME-oF enables NVMe message-based commands to transfer data between a host computer and a
target solid-state storage device or system over a network such as Ethernet, Fibre Channel, and
InfiniBand. Tunneling NVMe commands through an RDMA fabric provides a high throughput and a low
latency.
For information on how to configure NVME-oF, please refer to the HowTo Configure NVMe over
Fabrics Community post.

NVME-oF Target Offload


This feature is only supported for ConnectX-5 adapter cards family and above.

NVME-oF Target Offload is an implementation of the new NVME-oF standard Target (server) side in
hardware. Starting from ConnectX-5 family cards, all regular IO requests can be processed by the
HCA, with the HCA sending IO requests directly to a real NVMe PCI device, using peer-to-peer PCI
communications. This means that excluding connection management and error flows, no CPU
utilization will be observed during NVME-oF traffic.
• For instructions on how to configure NVME-oF target offload, refer to HowTo Configure NVME-
oF Target Offload Community post.
• For instructions on how to verify that NVME-oF target offload is working properly, refer
to Simple NVMe-oF Target Offload Benchmark Community post.

Debuggability

Directory in debugfs per Open Interface

If the debugfs feature is enabled in the kernel, the mlx5 driver maintains a subdirectory containing
useful debug information for each open eth port inside /sys/kernel/debug/.

For the default network namespace, the subdirectory name is the name of the interface, like "eth8".
When the network interface is moved to the non-default network namespaces, the interface name is
followed by "@" and the port's PCI ID. For example, the subdirectory name would be
"eth8@0000:af:00.3".

RX Page Cache Size Limit


The RX page cache size changes dynamically (extends/reduces) in response to the RX load. The RX
page cache tables of different RQs are independent and the sizes of these tables change
independently per RQ.

129
By default, the RX page cache size can extend up to 16 times the original size of the wq, at most.
When the RX default limit is too high, the table may extend too much causing iommu allocation
(iommu_alloc) problems.

To prevent this, the RX page cache size limit can be set to a lower value using sysfs.

The unit of this value is log of multiplies for the basic cache size. In other words, when the value is
set to 4, the RX cache table can extend up to 16 times its original size and when the value is 2, the
table can extend up to 4 times its original size.

Because the size of tables changes fast, changing the RX page cache size limit has a quick impact,
even if the table size is larger than the new limit.

In cases of iommu_alloc problem, try and reducing the limit of the RX page cache size because, in
some setups, the reduction of RX page cache size limit does not impair performance.

Originally, the log limit of the RX page cache size was always 4, therefore 4 is the default value set
by the feature. The value has to be set to an integer between 0 and 4 included, where 0 means that
the table will not be able to extend at all.

Below is an example of how to check the current log limit of the RX page cache size (example):

cat /sys/class/net/<ifs-name>/rx_page_cache/log_mult_limit
 
screen: log rx page cache mult limit is 4

Below is an example of setting a new log limit to the RX page cache size where < val> is the log of
multiplies for the basic cache size.

echo < val> > /sys/class/net/eth3/rx_page_cache/log_mult_limit

Virtualization
This chapter contains the following sections: 

• Single Root IO Virtualization (SR-IOV)


• Enabling Paravirtualization
• VXLAN Hardware Stateless Offloads
• Q-in-Q Encapsulation per VF in Linux (VST)
• 802.1Q Double-Tagging

130
Single Root IO Virtualization (SR-IOV)
Single Root IO Virtualization (SR-IOV) is a technology that allows a physical PCIe device to present
itself multiple times through the PCIe bus. This technology enables multiple virtual instances of the
device with separate resources. Mellanox adapters are capable of exposing in ConnectX®-3 adapter
cards up to 126 virtual instances called Virtual Functions (VFs) and ConnectX-4/Connect-IB adapter
cards up to 62 virtual instances. These virtual functions can then be provisioned separately. Each VF
can be seen as an additional device connected to the Physical Function. It shares the same resources
with the Physical Function, and its number of ports equals those of the Physical Function.
SR-IOV is commonly used in conjunction with an SR-IOV enabled hypervisor to provide virtual
machines direct hardware access to network resources hence increasing its performance.
In this chapter we will demonstrate setup and configuration of SR-IOV in a Red Hat Linux
environment using Mellanox ConnectX® VPI adapter cards family.

System Requirements

To set up an SR-IOV environment, the following is required:


• MLNX_EN Driver
• A server/blade with an SR-IOV-capable motherboard BIOS
• Hypervisor that supports SR-IOV such as: Red Hat Enterprise Linux Server Version 6.
• Mellanox ConnectX® VPI Adapter Card family with SR-IOV capability

Setting Up SR-IOV

Depending on your system, perform the steps below to set up your BIOS. The figures used in this
section are for illustration purposes only. For further information, please refer to the appropriate
BIOS User Manual:

131
1. Enable "SR-IOV" in the system BIOS.

2. Enable "Intel Virtualization Technology".

3. Install a hypervisor that supports SR-IOV.


4. Depending on your system, update the /boot/grub/grub.conf file to include a similar
command line load parameter for the Linux kernel.
For example, to Intel systems, add: 

default=0
timeout=5
splashimage=(hd0,0)/grub/splash.xpm.gz
hiddenmenu
title Red Hat Enterprise Linux Server (2.6.32-36.x86-645)
root (hd0,0)
kernel /vmlinuz-2.6.32-36.x86-64 ro root=/dev/VolGroup00/LogVol00 rhgb quiet
intel_iommu=on initrd /initrd-2.6.32-36.x86-64.img

Note: Please make sure the parameter "intel_iommu=on" exists when updating the /boot/
grub/grub.conf file, otherwise SR-IOV cannot be loaded.

132
Some OSs use /boot/grub2/grub.cfg file. If your server uses such file, please edit this file
instead (add “intel_iommu=on” for the relevant menu entry at the end of the line that starts
with "linux16").

Configuring SR-IOV for ConnectX-3/ConnectX-3 Pro


1. Install the MLNX_EN driver for Linux that supports SR-IOV.
SR-IOV can be enabled and managed by running the mlxconfig tool and setting the SRIOV_EN
parameter to "1" without re-burning the firmware.
To find the mst device run: "mst start" and "mst status" 

mlxconfig -d <mst_device> s SRIOV_EN=1

For further information, please refer to section


"mlxconfig-ChangingDeviceConfigurationTool" in the MFT User Manual (www.mellanox.com >
Products > Software > Firmware Tools).
2. Verify the HCA is configured to support SR-IOV. 

# mstflint -dev <PCI Device> dc

a. Verify in the [HCA] section the following fields appear: 

[HCA]
num_pfs = 1
total_vfs = <0-126>
sriov_en = true

where:
Param Recommended Value
eter

num_pfs 1
Note: This field is optional and might not always appear.

total_vfs • When using firmware version 2.31.5000 and above, the recommended
value is 126.
• When using firmware version 2.30.8000 and below, the recommended
value is 63
Note: Before setting number of VFs in SR-IOV, please make sure your system can
support that amount of VFs. Setting number of VFs larger than what your Hardware
and Software can support may cause your system to cease working.
sriov_en true

Notes:
- If SR-IOV is supported, to enable SR-IOV (if it is not enabled), it is sufficient to set
“sriov_en = true” in the INI.
- If the HCA does not support SR-IOV, please contact Mellanox Support:
support@mellanox.com
b. Add the above fields to the INI if they are missing.
c. Set the total_vfs parameter to the desired number if you need to change the number
of total VFs.

133
d. Reburn the firmware using the mlxburn tool if the fields above were added to the INI,
or the total_vfs parameter was modified.
If the mlxburn is not installed, please downloaded it from the Mellanox website
at http://www.mellanox.com→ products → Firmware tools 

mlxburn -fw ./fw-ConnectX3-rel.mlx -dev /dev/mst/mt4099_pci_cr0 -conf ./MCX341A-XCG_Ax.ini

3. Create the text file /etc/modprobe.d/mlx4_core.conf if it does not exist.


4. Insert an "options" line in the /etc/modprobe.d/mlx4_core.conf file to set the number of VFs.
The protocol type per port, and the allowed number of virtual functions to be used by the
physical function driver (probe_vf).
• For example: 

options mlx4_core num_vfs=5 port_type_array=1,2 probe_vf=1

where: 

134
Parameter Recommended Value

num_vfs • If absent, or zero: no VFs will be available


• If its value is a single number in the range of
0-63: The driver will enable the num_vfs VFs on
the HCA and this will be applied to all
ConnectX® HCAs on the host.
• If its a triplet x,y,z (applies only if all ports are
configured as Ethernet) the driver creates:
• x single port VFs on physical port 1
• y single port VFs on physical port 2
(applies only if such a port exist)
• z n-port VFs (where n is the number of
physical ports on device). This applies to
all ConnectX® HCAs on the host
• If it is a format is a string: The string specifies
the num_vfs parameter separately per installed
HCA.
The string format is: "bb:dd.f-v,bb:dd.f-v,…"
• bb:dd.f = bus:device.function of
the PF of the HCA
• v = number of VFs to enable for
that HCA which is either a single
value or a triplet, as described
above.
For example:
• num_vfs=5 - The driver will enable 5 VFs
on the HCA and this will be applied to
all ConnectX® HCAs on the host
• num_vfs=00:04.0-5,00:07.0-8 - The
driver will enable 5 VFs on the HCA
positioned in BDF 00:04.0 and 8 on the
one in 00:07.0)
• num_vfs=1,2,3 - The driver will enable 1
VF on physical port 1, 2 VFs on physical
port 2 and 3 dual port VFs (applies only
to dual port HCA when all ports are
Ethernet ports).
• num_vfs=00:04.0-5;6;7,00:07.0-8;9;10 -
the driver will enable:
• HCA positioned in BDF 00:04.0
• 5 single VFs on port 1
• 6 single VFs on port 2
• 7 dual port VFs
• HCA positioned in BDF 00:07.0
• 8 single VFs on port 1
• 9 single VFs on port 2
• 10 dual port VFs
Applies when all ports are configure as Ethernet in dual port
HCAs
Notes:
• PFs not included in the above list will
not have SR-IOV enabled.

135
Parameter Recommended Value
• Triplets and single port VFs are only
valid when all ports are configured as
Ethernet. When an InfiniBand port
exists, only num_vfs=a syntax is valid
where "a" is a single value that
represents the number of VFs.
• The second parameter in a triplet is
valid only when there are more than 1
physical port.
In a triplet, x+z<=63 and y+z<=63, the
maximum number of VFs on each
physical port must be 63.

port_type_array Specifies the protocol type of the ports. It is either one array
of 2 port types 't1,t2' for all devices or list of BDF to
port_type_array 'bb:dd.f-t1;t2,...'. (string)
Valid port types: 1-ib, 2-eth, 3-auto, 4-N/A
If only a single port is available, use the N/A port type for
port2 (e.g '1,4').
Note that this parameter is valid only when num_vfs is not
zero (i.e., SRIOV is enabled). Otherwise, it is ignored.

136
Parameter Recommended Value

probe_vf • If absent or zero: no VF interfaces will be


loaded in the Hypervisor/host
• If num_vfs is a number in the range of 1-63, the
driver running on the Hypervisor will itself
activate that number of VFs. All these VFs will
run on the Hypervisor. This number will apply to
all ConnectX® HCAs on that host.
• If its a triplet x,y,z (applies only if all ports are
configured as Ethernet), the driver probes:
• x single port VFs on physical port 1
• y single port VFs on physical port 2
(applies only if such a port exist)
• z n-port VFs (where n is the number of
physical ports on device). Those VFs are
attached to the hypervisor.
• If its format is a string: the string specifies the
probe_vf parameter separately per installed
HCA.
The string format is: "bb:dd.f-v,bb:dd.f-v,…
• bb:dd.f = bus:device.function of the PF
of the HCA
• v = number of VFs to use in the PF driver
for that HCA which is either a single
value or a triplet, as described above
For example:
• probe_vfs=5 - The PF driver will activate 5 VFs
on the HCA and this will be applied to all
ConnectX® HCAs on the host
• probe_vfs=00:04.0-5,00:07.0-8 - The PF driver
will acti- vate 5 VFs on the HCA positioned in
BDF 00:04.0 and 8 for the one in 00:07.0)
• probe_vf=1,2,3 - The PF driver will activate 1
VF on physical port 1, 2 VFs on physical port 2
and 3 dual port VFs (applies only to dual port
HCA when all ports are Ethernet ports).
This applies to all ConnectX® HCAs in the host.
• probe_vf=00:04.0-5;6;7,00:07.0-8;9;10 - The PF
driver will activate:
• HCA positioned in BDF 00:04.0
• 5 single VFs on port 1
• 6 single VFs on port 2
• 7 dual port VFs
• HCA positioned in BDF 00:07.0
• 8 single VFs on port 1
• 9 single VFs on port 2
• 10 dual port VFs
Applies when all ports are configure as Ethernet in dual port
HCAs.
Notes:
• PFs not included in the above list will not
activate any of their VFs in the PF driver
• Triplets and single port VFs are only valid when
all ports are configured as Ethernet. When an
InfiniBand port exist, only probe_vf=a syntax is
valid where "a" is a single value that represents
the number of VFs

137
Parameter Recommended Value
• The second parameter in a triplet is valid only
when there are more than 1 physical port
Every value (either a value in a triplet or a
single value) should be less than or equal to the
respective value of num_vfs parameter

The example above loads the driver with 5 VFs (num_vfs). The standard use of a VF is a
single VF per a single VM. However, the number of VFs varies upon the working mode
requirements.
The protocol types are:
- Port 1 = IB
- Port 2 = Ethernet
   - port_type_array=2,2 (Ethernet, Ethernet)
   - port_type_array=1,1 (IB, IB)
   - port_type_array=1,2 (VPI: IB, Ethernet)
   - NO port_type_array module parameter: ports are IB 

 For single port HCAs the possible values are (1,1) or (2,2).
5. Reboot the server. 

 If the SR-IOV is not supported by the server, the machine might not come out of
boot/load.

6. Load the driver and verify the SR-IOV is supported. Run: 

lspci | grep Mellanox


03:00.0 InfiniBand: Mellanox Technologies MT26428 [ConnectX VPI PCIe 2.0 5GT/s - IB QDR / 10GigE] (rev b0)
03:00.1 InfiniBand: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function] (rev b0)
03:00.2 InfiniBand: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function] (rev b0)
03:00.3 InfiniBand: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function] (rev b0)
03:00.4 InfiniBand: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function] (rev b0)
03:00.5 InfiniBand: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function] (rev b0)

where:
- “03:00” represents the Physical Function
- “03:00.X” represents the Virtual Function connected to the Physical Function

Additional SR-IOV Configurations

Assigning a Virtual Function to a Virtual Machine

This section describes a mechanism for adding a SR-IOV VF to a Virtual Machine.

Assigning the SR-IOV Virtual Function to the Red Hat KVM VM Server
1. Run the virt-manager.
2. Double click on the virtual machine and open its Properties.

138
3. Go to Details → Add hardware → PCI host device.

4. Choose a Mellanox virtual function according to its PCI device (e.g., 00:03.1)
5. If the Virtual Machine is up reboot it, otherwise start it.
6. Log into the virtual machine and verify that it recognizes the Mellanox card. Run: 

lspci | grep Mellanox

Example: 

lspci | grep Mellanox


00:03.0 InfiniBand: Mellanox Technologies MT27500 Family [ConnectX-3 Virtual Function] (rev b0)

7. [ConnectX-3/ConnectX-3 Pro] Add the device to the /etc/sysconfig/network-scripts/


ifcfg-ethX configuration file. The MAC address for every virtual function is configured
randomly, therefore it is not necessary to add it.

Ethernet Virtual Function Configuration when Running SR-IOV

SR-IOV Virtual function configuration can be done through Hypervisor iprout2/netlink tool, if
present. Otherwise, it can be done via sysfs. 

ip link set { dev DEVICE | group DEVGROUP } [ { up | down } ]


...
[ vf NUM [ mac LLADDR ] [ vlan VLANID [ qos VLAN-QOS ] ]
...
[ spoofchk { on | off} ] ]
...
 
sysfs configuration (ConnectX-4):

139
 
/sys/class/net/enp8s0f0/device/sriov/[VF]
 
+-- [VF]
| +-- config
| +-- link_state
| +-- mac
| +-- mac_list
| +-- max_tx_rate
| +-- min_tx_rate
| +-- spoofcheck
| +-- stats
| +-- trunk
| +-- trust
| +-- vlan

VLAN Guest Tagging (VGT) and VLAN Switch Tagging (VST)

When running ETH ports on VGT, the ports may be configured to simply pass through packets as is
from VFs (VLAN Guest Tagging), or the administrator may configure the Hypervisor to silently force
packets to be associated with a VLAN/Qos (VLAN Switch Tagging).
In the latter case, untagged or priority-tagged outgoing packets from the guest will have the VLAN
tag inserted, and incoming packets will have the VLAN tag removed.
The default behavior is VGT.

To configure VF VST mode, run: 

ip link set dev <PF device> vf <NUM> vlan <vlan_id> [qos <qos>]

where:
• NUM = 0..max-vf-num
• vlan_id = 0..4095
• qos = 0..7

For example:
• ip link set dev eth2 vf 2 vlan 10 qos 3 - sets VST mode for VF #2 belonging to PF eth2, with
vlan_id = 10 and qos = 3
• ip link set dev eth2 vf 2 vlan 0 - sets mode for VF 2 back to VGT
Note: In ConnectX-3 adapter cards family, switching to VGT mode can also be done by setting
vlan_id to 4095.

Additional Ethernet VF Configuration Options

• Guest MAC configuration - by default, guest MAC addresses are configured to be all zeroes. If
the administrator wishes the guest to always start up with the same MAC, he/she should
configure guest MACs before the guest driver comes up. The guest MAC may be configured by
using: 

ip link set dev <PF device> vf <NUM> mac <LLADDR>

For legacy and ConnectX-4 guests, which do not generate random MACs, the administrator
should always configure their MAC addresses via IP link, as above.

140
• Spoof checking - Spoof checking is currently available only on upstream kernels newer than
3.1. 

ip link set dev <PF device> vf <NUM> spoofchk [on | off]

•  Guest Link State 

ip link set dev <PF device> vf <UM> state [enable| disable| auto]

Virtual Function Statistics

Virtual function statistics can be queried via sysfs:

cat /sys/class/infiniband/mlx5_2/device/sriov/2/stats tx_packets : 5011


tx_bytes : 4450870
tx_dropped : 0
rx_packets : 5003
rx_bytes : 4450222
rx_broadcast : 0
rx_multicast : 0
tx_broadcast : 0
tx_multicast : 8
rx_dropped : 0

Mapping VFs to Ports

To view the VFs mapping to ports:

Use the ip link tool v2.6.34~3 and above. 

ip link

Output: 

61: p1p1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 00:02:c9:f1:72:e0 brd ff:ff:ff:ff:ff:ff
vf 0 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto
vf 37 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto
vf 38 MAC ff:ff:ff:ff:ff:ff, vlan 65535, spoof checking off, link-state disable
vf 39 MAC ff:ff:ff:ff:ff:ff, vlan 65535, spoof checking off, link-state disable

When a MAC is ff:ff:ff:ff:ff:ff, the VF is not assigned to the port of the net device it is listed under.
In the example above, vf38 is not assigned to the same port as p1p1, in contrast to vf0.
However, even VFs that are not assigned to the net device, could be used to set and change its
settings. For example, the following is a valid command to change the spoof check: 

ip link set dev p1p1 vf 38 spoofchk on

This command will affect only the vf38. The changes can be seen in ip link on the net device that
this device is assigned to.

141
Mapping VFs to Ports using the mlnx_get_vfs.pl tool

To map the PCI representation in BDF to the respective ports, run: 

mlnx_get_vfs.pl

Output: 

BDF 0000:04:00.0
Port 1: 2
vf0 0000:04:00.1
vf1 0000:04:00.2
Port 2: 2
vf2 0000:04:00.3
vf3 0000:04:00.4
Both: 1
vf4 0000:04:00.5

Configuring Pkeys and GUIDs under SR-IOV in ConnectX-3/ConnectX-3 Pro

Port Type Management

Port Type management is static when enabling SR-IOV (the connectx_port_config script will not
work). The port type is set on the Host via a module parameter, port_type_array, in mlx- 4_core.
This parameter may be used to set the port type uniformly for all installed ConnectX® HCAs, or it
may specify an individual configuration for each HCA.
This parameter should be specified as an options line in the file /etc/modprobe.d/mlx- 4_core.conf.
For example, to configure all HCAs to have Port1 as IB and Port2 as ETH, insert the following line: 

options mlx4_core port_type_array=1,2

To set HCAs individually, you may use a string of Domain:bus:device.function=x;y


For example, if you have a pair of HCAs, whose PFs are 0000:04:00.0 and 0000:05:00.0, you may
specify that the first will have both ports as IB, and the second will have both ports as ETH as
follows: 

options mlx4_core port_type_array='0000:04:00.0-1;1,0000:05:00.0-2;2


Only the PFs are set via this mechanism. The VFs inherit their port types from their
associated PF.

142
Virtual Function InfiniBand Ports

Each VF presents itself as an independent vHCA to the host, while a single HCA is observable by the
network which is unaware of the vHCAs. No changes are required by the InfiniBand sub- system,
ULPs, and applications to support SR-IOV, and vHCAs are interoperable with any exist- ing (non-
virtualized) IB deployments.
Sharing the same physical port(s) among multiple vHCAs is achieved as follows:
• Each vHCA port presents its own virtual GID table
For further details, please refer to Configuring an Alias GUID (under ports/<n>/admin_guids)
• Each vHCA port presents its own virtual PKey table
The virtual PKey table (presented to a VF) is a mapping of selected indexes of the physical
PKey table. The host admin can control which PKey indexes are mapped to which virtual
indexes using a sysfs interface. The physical PKey table may contain both full and partial
memberships of the same PKey to allow different membership types in different virtual
tables.
• Each vHCA port has its own virtual port state
A vHCA port is up if the following conditions apply:
• The physical port is up
• The virtual GID table contains the GIDs requested by the host admin
• The SM has acknowledged the requested GIDs since the last time that the physical port
went up
• Other port attributes are shared, such as: GID prefix, LID, SM LID, LMC mask

To allow the host admin to control the virtual GID and PKey tables of vHCAs, a new sysfs 'iov sub-
tree has been added under the PF InfiniBand device.


If the vHCA comes up without a GUID, make sure you are running the latest version of SM/
OpenSM. The SM on QDR switches do not support SR-IOV.

SR-IOV sysfs Administration Interfaces on the Hypervisor

Administration of GUIDs and PKeys is done via the sysfs interface in the Hypervisor (Dom0). This
interface is under: 

/sys/class/infiniband/<infiniband device>/iov

Under this directory, the following subdirectories can be found:


• ports - The actual (physical) port resource tables
Port GID tables:
• ports/<n>/gids/<n> where 0 <= n <= 127 (the physical port gids)
• ports/<n>/admin_guids/<n> where 0 <= n <= 127 (allows examining or changing the
administrative state of a given GUID>
• ports/<n>/pkeys/<n> where 0 <= n <= 126 (displays the contents of the physical pkey
table)

143
• <pci id> directories - one for Dom0 and one per guest. Here, you may see the map- ping
between virtual and physical pkey indices, and the virtual to physical gid 0.
Currently, the GID mapping cannot be modified, but the pkey virtual to physical mapping can.
These directories have the structure:
• <pci_id>/port/<m>/gid_idx/0 where m = 1..2 (this is read-only)
and
• <pci_id>/port/<m>/pkey_idx/<n>, where m = 1..2 and n = 0..126

For instructions on configuring pkey_idx, please see below.

Configuring an Alias GUID (under ports/<n>/admin_guids)


1. Determine the GUID index of the PCI Virtual Function that you want to pass through to a
guest.
For example, if you want to pass through PCI function 02:00.3 to a certain guest, you initially
need to see which GUID index is used for this function.
To do so: 

cat /sys/class/infiniband/mlx4_0/iov/0000:02:00.3/port/<port_num>/gid_idx/0

The value returned will present which guid index to modify on Dom0.
2. Modify the physical GUID table via the admin_guids sysfs interface.
To configure the GUID at index <n> on port <port_num>: 

echo NEWGUID > /sys/class/infiniband/mlx4_0/iov/ports/<port_num>/admin_guids/<guid_in- dex>

Example: 

echo "0x002fffff8118" > /sys/class/infiniband/mlx4_0/iov/ports/1/admin_guids/3

Note:
/sys/class/infiniband/mlx4_0/iov/ports/<port_num>/admin_guids/0 is read only and
cannot be changed.
3. Read the administrative status of the GUID index.
To read the administrative status of GUID index <guid_index> on port number <port_- num>:: 

cat /sys/class/infiniband/mlx4_0/iov/ports/<port_num>/admin_guids/<guid_index>

4. Check the operational state of a GUID. 

/sys/class/infiniband/mlx4_0/iov/ports/<port_num>/gids (where port_num = 1 or 2)

The values indicate what gids are actually configured on the firmware/hardware, and all the
entries are R/O.
5. Compare the value you read under the "admin_guids" directory at that index with the value
under the "gids" directory, to verify the change requested in Step 3 has been accepted by the
SM, and programmed into the hardware port GID table.
If the value under admin_guids/<m> is different that the value under gids/<m>, the request
is still in progress.

144
Alias GUID Support in InfiniBand

Admin VF GUIDs

As of MLNX_EN v3.0, the query_gid verb (e.g. ib_query_gid()) returns the admin desired value
instead of the value that was approved by the SM to prevent a case where the SM is unreachable or
a response is delayed, or if the VF is probed into a VM before their GUID is registered with the SM. If
one of the above scenarios occurs, the VF sees an incorrect GID (i.e., not the GID that was intended
by the admin).
Despite the new behavior, if the SM does not approve the GID, the VF sees its link as down.

On Demand GUIDs

GIDs are requested from the SM on demand, when needed by the VF (e.g. become active), and are
released when the GIDs are no longer in use.
Since a GID is assigned to a VF on the destination HCA, while the VF on the source HCA is shut down
(but not administratively released), using GIDs on demand eases the GID migrations.
For compatibility reasons, an explicit admin request to set/change a GUID entry is done
immediately, regardless of whether the VF is active or not to allow administrators to change the
GUID without the need to unbind/bind the VF.

Alias GUIDs Default Mode

Due to the change in the Alias GUID support in InfiniBand behavior, its default mode is now set as
HOST assigned instead of SM assigned. To enable out-of-the-box experience, the PF generates
random GUIDs as the initial admin values instead of asking the SM.

Initial GUIDs' Values

Initial GUIDs' values depend on the mlx4_ib module parameter 'sm_guid_assign' as follows:
Mode Type Description

admin assigned Each admin_guid entry has the random generated GUID value.

sm assigned Each admin_guid entry for non-active VFs has a value of 0. Meaning,
asking a GUID from the SM upon VF activation. When a VF is active, the
returned value from the SM becomes the admin value to be asked later
again.

When a VF becomes active, and its admin value is approved, the operational GID entry is changed
accordingly. In both modes, the administrator can set/delete the value by using the sysfs
Administration Interfaces on the Hypervisor as described above.

Single GUID per VF

Each VF has a single GUID entry in the table based on the VF number. (e.g. VF 1 expects to use GID
entry 1). To determine the GUID index of the PCI Virtual Function to pass to a guest, use the sysfs
mechanism <gid_idx> directory as described above. 

145
Persistency Support

Once admin request is rejected by the SM, a retry mechanism is set. Retry time is set to 1 second,
and for each retry it is multiplied by 2 until reaching the maximum value of 60 seconds.
Additionally, when looking for the next record to be updated, the record having the lowest time to
be executed is chosen.
Any value reset via the admin_guid interface is immediately executed and it resets the entry's timer.

Partitioning IPoIB Communication using PKeys

PKeys are used to partition IPoIB communication between the Virtual Machines and the Dom0 by
mapping a non-default full-membership PKey to virtual index 0, and mapping the default PKey to a
virtual pkey index other than zero.
The below describes how to set up two hosts, each with 2 Virtual Machines. Host-1/vm-1 will be
able to communicate via IPoIB only with Host2/vm1,and Host1/vm2 only with Host2/vm2.
In addition, Host1/Dom0 will be able to communicate only with Host2/Dom0 over ib0. vm1 and vm2
will not be able to communicate with each other, nor with Dom0.
This is done by configuring the virtual-to-physical PKey mappings for all the VMs, such that at virtual
PKey index 0, both vm-1s will have the same pkey and both vm-2s will have the same
PKey (different from the vm-1's), and the Dom0's will have the default pkey (different from the vm's
pkeys at index 0).
OpenSM must be used to configure the physical Pkey tables on both hosts.
• The physical Pkey table on both hosts (Dom0) will be configured by OpenSM to be: 

index 0 = 0xffff index 1 = 0xb000 index 2 = 0xb030

• The vm1's virt-to-physical PKey mapping will be: 

pkey_idx 0 = 1
pkey_idx 1 = 0

• The vm2's virt-to-phys pkey mapping will be: 

pkey_idx 0 = 2
pkey_idx 1 = 0

So that the default pkey will reside on the vms at index 1 instead of at index 0.

The IPoIB QPs are created to use the PKey at index 0. As a result, the Dom0, vm1 and vm2 IPoIB QPs
will all use different PKeys.

To partition IPoIB communication using PKeys:


1. Create a file "/etc/opensm/partitions.conf" on the host on which OpenSM runs, containing
lines. 

Default=0x7fff,ipoib : ALL=full ;
Pkey1=0x3000,ipoib : ALL=full;
Pkey3=0x3030,ipoib : ALL=full;

This will cause OpenSM to configure the physical Port Pkey tables on all physical ports on the
network as follows: 

pkey idx | pkey value

146
--------- |---------
0 | 0xFFFF
1 | 0xB000
2 | 0xB030

The most significant bit indicates if a PKey is a full PKey. 

 The ",ipoib" causes OpenSM to pre-create IPoIB the broadcast group for the indicated
PKeys.

2. Configure (on Dom0) the virtual-to-physical PKey mappings for the VMs.
a. Check the PCI ID for the Physical Function and the Virtual Functions. 

lspci | grep Mel

b. Assuming that on Host1, the physical function displayed by lspci is "0000:02:00.0", and
that on Host2 it is "0000:03:00.0"
On Host1 do the following. 

cd /sys/class/infiniband/mlx4_0/iov
0000:02:00.0 0000:02:00.1 0000:02:00.2 ...

Note: 0000:02:00.0 contains the virtual-to-physical mapping tables for the physical


function. 
          0000:02:00.X contain the virt-to-phys mapping tables for the virtual functions.
Do not touch the Dom0 mapping table (under <nnnn>:<nn>:00.0). Modify only tables
under 0000:02:00.1 and/or 0000:02:00.2. We assume that vm1 uses VF 0000:02:00.1
and vm2 uses VF 0000:02:00.2
c. Configure the virtual-to-physical PKey mapping for the VMs. 

echo 0 > 0000:02:00.1/ports/1/pkey_idx/1


echo 1 > 0000:02:00.1/ports/1/pkey_idx/0
echo 0 > 0000:02:00.2/ports/1/pkey_idx/1
echo 2 > 0000:02:00.2/ports/1/pkey_idx/0

vm1 pkey index 0 will be mapped to physical pkey-index 1, and vm2 pkey index 0 will
be mapped to physical pkey index 2. Both vm1 and vm2 will have their pkey index 1
mapped to the default pkey. 
d. On Host2 do the following. 

cd /sys/class/infiniband/mlx4_0/iov
echo 0 > 0000:03:00.1/ports/1/pkey_idx/1
echo 1 > 0000:03:00.1/ports/1/pkey_idx/0
echo 0 > 0000:03:00.2/ports/1/pkey_idx/1
echo 2 > 0000:03:00.2/ports/1/pkey_idx/0

e. Once the VMs are running, you can check the VM's virtualized PKey table by doing (on
the vm). 

cat /sys/class/infiniband/mlx4_0/ports/[1,2]/pkeys/[0,1]

3. Start up the VMs (and bind VFs to them).


4. Configure IP addresses for ib0 on the host and on the guests.

147
Running Network Diagnostic Tools on a Virtual Function in ConnectX-3/
ConnectX-3 Pro

Until now, in MLNX_EN, administrators were unable to run network diagnostics from a VF since
sending and receiving Subnet Management Packets (SMPs) from a VF was not allowed, for security
reasons: SMPs are not restricted by network partitioning and may affect the physical network
topology. Moreover, even the SM may be denied access from portions of the network by setting
management keys unknown to the SM.
However, it is desirable to grant SMP capability to certain privileged VFs, so certain network
management activities may be conducted within virtual machines rather than only on the
hypervisor.

Granting SMP Capability to a Virtual Function

To enable SMP capability for a VF, one must enable the Subnet Management Interface (SMI) for that
VF. By default, the SMI interface is disabled for VFs. To enable SMI mads for VFs, there are two new
sysfs entries per VF per on the Hypervisor (under /sys/class/infiniband/mlx4_X/ iov/<b.d.f>/ports/
<1 or 2>. These entries are displayed only for VFs (not for the PF), and only for IB ports (not ETH
ports).
The first entry, enable_smi_admin, is used to enable SMI on a VF. By default, the value of this entry
is zero (disabled). When set to "1", the SMI will be enabled for the VF on the next rebind or openibd
restart on the VM that the VF is bound to. If the VF is currently bound, it must be unbound and then
re-bound.
The second sysfs entry, smi_enabled, indicates the current enablement state of the SMI. 0 indicates
disabled, and 1 indicates enabled. This entry is read-only.
When a VF is initialized (bound), during the initialization sequence, the driver copies the requested
smi_state (enable_smi_admin) for that VF/port to the operational SMI state (smi_enabled) for that
VF/port, and operate according to the operational state.
Thus, the sequence of operations on the hypervisor is:
1. Enable SMI for any VF/port that you wish.
2. Restart the VM that the VF is bound to (or just run /etc/init.d/openibd restart on that VM)

The SMI will be enabled for the VF/port combinations that you set in step 2 above. You will then be
able to run network diagnostics from that VF.

Installing MLNX_EN with Network Diagnostics on a VM

To install MLNX_EN on a VF which will be enabled to run the tools, run the following on the
VM: 

mlnx_en_install

148
MAC Forwarding DataBase (FDB) Management in ConnectX-3/
ConnectX-3 Pro

FDB Status Reporting

FDB also know as Forwarding Information Base (FIB) or the forwarding table, is most commonly used
in network bridging, routing, and similar functions to find the proper interface to which the input
interface should forward a packet.
In the SR-IOV environment, the Ethernet driver can share the existing 128 MACs (for each port)
among the Virtual interfaces (VF) and Physical interfaces (PF) that share the same table as follow:
• Each VF gets 2 granted MACs (which are taken from the general pool of the 128 MACs)
• Each VF/PF can ask for up to 128 MACs on the policy of first-asks first-served (meaning,
except for the 2 granted MACs, the other MACs in the pool are free to be asked)

To check if there are free MACs for its interface (PF or VF), run: 

/sys/class/net/<ethX>/ fdb_det.

Example:

cat /sys/class/net/eth2/fdb_det
device eth2: max: 112, used: 2, free macs: 110

To add a new MAC to the interface: 

echo +<MAC> > /sys/class/net/eth<X>/fdb

Once running the command above, the interface (VF/PF) verifies if a free MAC exists. If there is a
free MAC, the VF/PF takes it from the global pool and allocates it. If there is no free MAC, an error
is returned notifying the user of lack of MACs in the pool.

To delete a MAC from the interface: 

echo -<MAC> > /sys/class/net/eth<X>/fdb

If /sys/class/net/eth<X>/fdb does not exist, use the Bridge tool from the ip-route2 package
which includes the tool to manage FDB tables as the kernel supports FDB callbacks:

bridge fdb add 00:01:02:03:04:05 permanent self dev p3p1


bridge fdb del 00:01:02:03:04:05 permanent self dev p3p1
bridge fdb show dev p3p1


If adding a new MAC from the kernel's NDO function fails due to insufficient MACs in
the pool, the following error flow will occur:

149
• If the interface is a PF, it will automatically enter the promiscuous mode
• If the interface is a VF, it will try to enter the promiscuous mode and since it does
not support it, the action will fail and an error will be printed in the kernel's log

Virtual Guest Tagging (VGT+)

VGT+ is an advanced mode of Virtual Guest Tagging (VGT), in which a VF is allowed to tag its own
packets as in VGT, but is still subject to an administrative VLAN trunk policy. The policy determines
which VLAN IDs are allowed to be transmitted or received. The policy does not determine the user
priority, which is left unchanged.
Packets can be sent in one of the following modes: when the VF is allowed to send/receive untagged
and priority tagged traffic and when it is not. No default VLAN is defined for VGT+ port. The send
packets are passed to the eSwitch only if they match the set, and the received packets are
forwarded to the VF only if they match the set.


In some old OSs, such as SLES11 SP4, any VLAN can be created in the VM, regardless of the
VGT+ configuration, but traffic will only pass for the allowed VLANs.

Configuring VGT+ for ConnectX-3/ConnectX-3 Pro

The following are the current VGT+ limitations:


• The size of the VLAN set is defined to be up to 10 VLANs including the VLAN 0 that is added
for untagged/priority-tagged traffic
• This behavior applies to all VF traffic: plain Ethernet, and all RoCE transports
• VGT+ allowed VLAN sets may be only extended when the VF is online
• An operational VLAN set becomes identical as the administration VLAN set only after a VF
reset
• VGT+ is available in DMFS mode only 

The default operating mode is VGT: 

cat /sys/class/net/eth5/vf0/vlan_set
oper:
admin:

Both states (operational and administrative) are empty.


If you set the vlan_set parameter with more the 10 VLAN IDs, the driver chooses the first 10
VLAN IDs provided and ignores all the rest.

To enable VGT+ mode:


1. Set the corresponding port/VF (in the example below port eth5 VF0) list of allowed VLANs. 

150
echo 0 1 2 3 4 5 6 7 8 9 > /sys/class/net/eth5/vf0/vlan_set

Where 0 specifies if untagged/priority tagged traffic is allowed.


Meaning if the below command is run, you will not be able to send/receive untagged traffic. 

echo 1 2 3 4 5 6 7 8 9 10 > /sys/class/net/eth5/vf0/vlan_set

2. Reboot the relevant VM for changes to take effect, or run: /etc/init.d/openibd restart

To disable VGT+ mode:


1. Set the VLAN. 

echo > /sys/class/net/eth5/vf0/vlan_set

2. Reboot the relevant VM for changes to take effect, or run: /etc/init.d/openibd restart

To add a VLAN:

In the example below, the following state exist: 

cat /sys/class/net/eth5/vf0/vlan_set
oper: 0 1 2 3
admin: 0 1 2 3

1. Make an operational VLAN set identical to the administration VLAN. 

echo 2 3 4 5 6 > /sys/class/net/eth5/vf0/vlan_set

The delta will be added to the operational state immediately (4 5 6): 

cat /sys/class/net/eth5/vf0/vlan_set
oper: 0 1 2 3 4 5 6
admin: 2 3 4 5 6

2. Reset the VF for changes to take effect.

Configuring VGT+for ConnectX-4/ConnectX-5 


When working in SR-IOV, the default operating mode is VGT.

To enable VGT+ mode:

Set the corresponding port/VF (in the example below port eth5, VF0) range of allowed VLANs. 

echo "<add> <start_vid> <end_vid>" > /sys/class/net/eth5/device/sriov/0/trunk

Examples:
• Adding VLAN ID range (4-15) to trunk: 

151
echo add 4 15 > /sys/class/net/eth5/device/sriov/0/trunk

• Adding a single VLAN ID to trunk: 

echo add 17 17 > /sys/class/net/eth5/device/sriov/0/trunk

Note: When VLAN ID = 0, it indicates that untagged and priority-tagged traffics are allowed

To disable VGT+ mode, make sure to remove all VLANs. 

echo rem 0 4095 > /sys/class/net/eth5/device/sriov/0/trunk

To remove selected VLANs.


• Remove VLAN ID range (4-15) from trunk: 

echo rem 4 15 > /sys/class/net/eth5/device/sriov/0/trunk

• Remove a single VLAN ID from trunk: 

echo rem 17 17 > /sys/class/net/eth5/device/sriov/0/trunk

Virtualized QoS per VF (Rate Limit per VF) in ConnectX-3/ConnectX-3 Pro

Virtualized QoS per VF, (supported in ConnectX®-3/ConnectX®-3 Pro adapter cards only with
firmware v2.33.5100 and above), limits the chosen VFs' throughput rate limitations (Maximum
throughput). The granularity of the rate limitation is 1Mbits.
The feature is disabled by default. To enable it, set the "enable_vfs_qos" module parameter to "1"
and add it to the "options mlx4_core". When set, and when feature is supported, it will be shown
upon PF driver load time (in DEV_CAP in kernel log: GranularQoSRatelimitperVFsupport), when
mlx4_core module parameter debug_level is set to 1. For further information, please refer to
"mlx4_core Parameters" - debug_level parameter).
When set, and supported by the firmware, running as SR-IOV Master and Ethernet link, the driver
also provides information on the number of total available vPort Priority Pair (VPPs) and how many
VPPs are allocated per priority. All the available VPPs will be allocated on priority 0.

mlx4_core 0000:1b:00.0: Port 1 Available VPPs 63


mlx4_core 0000:1b:00.0: Port 1 UP 0 Allocated 63 VPPs
mlx4_core 0000:1b:00.0: Port 1 UP 1 Allocated 0 VPPs
mlx4_core 0000:1b:00.0: Port 1 UP 2 Allocated 0 VPPs
mlx4_core 0000:1b:00.0: Port 1 UP 3 Allocated 0 VPPs
mlx4_core 0000:1b:00.0: Port 1 UP 4 Allocated 0 VPPs
mlx4_core 0000:1b:00.0: Port 1 UP 5 Allocated 0 VPPs
mlx4_core 0000:1b:00.0: Port 1 UP 6 Allocated 0 VPPs
mlx4_core 0000:1b:00.0: Port 1 UP 7 Allocated 0 VPPs

152
Configuring Rate Limit for VFs


The rate limit configuration will take effect only when the VF is in VST mode configured
with priority 0.

Rate limit can be configured using the iproute2/netlink tool. 

ip link set dev <PF device> vf <NUM> rate <TXRATE>

where:
• NUM = 0...<Num of VF>
• <TXRATE> in units of 1Mbit/s

The rate limit for VF can be configured:


• While setting it to the VST mode.

ip link set dev <PF device> vf <NUM> vlan <vlan_id> [qos <qos>] rate <TXRATE>

• Before the VF enters the VST mode with a supported priority.


In this case, the rate limit value is saved and the rate limit configuration is applied when VF
state is changed to VST mode.

To disable rate limit configured for a VF set the VF with rate 0. Once the rate limit is set, you
cannot switch to VGT or change VST priority.
To view current rate limit configurations for VFs, use the iproute2 tool. 

ip link show dev <PF device>

Example:

89: eth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN mode DEFAULT qlen 1000
link/ether f4:52:14:5e:be:20 brd ff:ff:ff:ff:ff:ff
vf 0 MAC 00:00:00:00:00:00, vlan 2, tx rate 1500 (Mbps), spoof checking off, link-state auto
vf 1 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto
vf 2 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto
vf 3 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto

On some OSs, the iptool may not display the configured rate, or any of the VF information, although
the both the VST and the rate limit are set through the netlink command. In order to view the rate
limit configured, use sysfs provided by the driver. Its location can be found at: 

/sys/class/net/<eth-x>/<vf-i>/tx_rate

153
SR-IOV Advanced Security Features

SR-IOV MAC Anti-Spoofing

Normally, MAC addresses are unique identifiers assigned to network interfaces, and they are fixed
addresses that cannot be changed. MAC address spoofing is a technique for altering the MAC address
to serve different purposes. Some of the cases in which a MAC address is altered can be legal, while
others can be illegal and abuse security mechanisms or disguises a possible attacker.
The SR-IOV MAC address anti-spoofing feature, also known as MAC Spoof Check provides protection
against malicious VM MAC address forging. If the network administrator assigns a MAC address to a
VF (through the hypervisor) and enables spoof check on it, this will limit the end user to send traffic
only from the assigned MAC address of that VF.

MAC Anti-Spoofing Configuration


MAC anti-spoofing is disabled by default.

In the configuration example below, the VM is located on VF-0 and has the following MAC address:
11:22:33:44:55:66.
There are two ways to enable or disable MAC anti-spoofing:
1. Use the standard IP link commands - available from Kernel 3.10 and above.
a. To enable MAC anti-spoofing, run: 

ip link set ens785f1 vf 0 spoofchk on

b. To disable MAC anti-spoofing, run: 

ip link set ens785f1 vf 0 spoofchk off

2. Specify echo "ON" or "OFF" to the file located under /sys/class/net/<ETH_IF_NAME> / device/
sriov/<VF index>/spoofchk.
a. To enable MAC anti-spoofing, run: 

echo "ON" > /sys/class/net/ens785f1/vf/0/spoofchk

b. To disable MAC anti-spoofing, run: 

echo "OFF" > /sys/class/net/ens785f1/vf/0/spoofchk


This configuration is non-persistent and does not survive driver restart. 


In order for spoof-check enabling/disabling to take effect while the VF is up and running on
ConnectX-3 Pro adapter cards, it is required to perform a driver restart on the guest OS. 

154
Limit and Bandwidth Share Per VF


This feature is at beta level.

This feature enables rate limiting traffic per VF in SR-IOV mode for ConnectX-4/ConnectX-4 Lx/
ConnectX-5 adapter cards. For details on how to configure rate limit per VF for ConnectX-4/
ConnectX-5, refer to HowTo Configure Rate Limit per VF for ConnectX-4/ConnectX-5 Community
post.

Limit Bandwidth per Group of VFs

VFs Rate Limit for vSwitch (OVS) feature allows users to join available VFs into groups and set a rate
limitation on each group. Rate limitation on a VF group ensures that the total Tx bandwidth that the
VFs in this group get (altogether combined) will not exceed the given value.
With this feature, a VF can still be configured with an individual rate limit as in the past (under /
sys/class/net/<ifname>/device/sriov/<vf_num>/max_tx_rate). However, the actual bandwidth limit
on the VF will eventually be determined considering the VF group limitation and how many VFs are
in the same group.
For example: 2 VFs (0 and 1) are attached to group 3.

Case 1: The rate limitation on the group is set to 20G. Rate limit of each VF is 15G
Result: Each VF will have a rate limit of 10G

Case 2: Group’s max rate limitation is still set to 20G. VF 0 is configured to 30G limit, while VF 1 is
configured to 5G rate limit
Result: VF 0 will have 15G de-facto. VF 1 will have 5G

The rule of thumb is that the group’s bandwidth is distributed evenly between the number of VFs in
the group. If there are leftovers, they will be assigned to VFs whose individual rate limit has not
been met yet.

VFs Rate Limit Feature Configuration

1. When VF rate group is supported by FW, the driver will create a new hierarchy in the SRI-OV
sysfs named “groups” (/sys/class/net/<ifname>/device/sriov/groups/). It will contain all the
info and the configurations allowed for VF groups.
2. All VFs are placed in group 0 by default since it is the only existing group following the initial
driver start. It would be the only group available under /sys/class/net/<ifname>/device/
sriov/groups/
3. The VF can be moved to a different group by writing to the group file -> echo $GROUP_ID > /
sys/class/net/<ifname>/device/sriov/<vf_id>/group
4. The group IDs allowed are 0-255
5. Only when there is at least 1 VF in a group, there will be a group configuration available
under /sys/class/net/<ifname>/device/sriov/groups/ (Except for group 0, which is always
available even when it’s empty).
6. Once the group is created (by moving at least 1 VF to that group), users can configure the
group’s rate limit. For example:

155
a. echo 10000 > /sys/class/net/<ifname>/device/sriov/5/max_tx_rate – setting individual
rate limitation of VF 5 to 10G (Optional)
b. echo 7 > /sys/class/net/<ifname>/device/sriov/5/group – moving VF 5 to group 7
c. echo 5000 > /sys/class/net/<ifname>/device/sriov/groups/7/max_tx_rate – setting
group 7 with rate limitation of 5G
d. When running traffic via VF 5 now, it will be limited to 5G because of the group rate
limit even though the VF itself is limited to 10G
e. echo 3 > /sys/class/net/<ifname>/device/sriov/5/group – moving VF 5 to group 3
f. Group 7 will now disappear from /sys/class/net/<ifname>/device/sriov/groups since
there are 0 VFs in it. Group 3 will now appear. Since there’s no rate limit on group 3,
VF 5 can transmit at 10G (thanks to its individual configuration)

Notes:

• You can see to which group the VF belongs to in the ‘stats’ sysfs (cat /sys/class/net/
<ifname>/device/sriov/<vf_num>/stats)
• You can see the current rate limit and number of attached VFs to a group in the group’s
‘config’ sysfs (cat /sys/class/net/<ifname>/device/sriov/groups/<group_id>/config)

Privileged VFs

In case a malicious driver is running over one of the VFs, and in case that VF's permissions are not
restricted, this may open security holes. However, VFs can be marked as trusted and can thus
receive an exclusive subset of physical function privileges or permissions. For example, in case of
allowing all VFs, rather than specific VFs, to enter a promiscuous mode as a privilege, this will
enable malicious users to sniff and monitor the entire physical port for incoming traffic, including
traffic targeting other VFs, which is considered a severe security hole.

Privileged VFs Configuration

In the configuration example below, the VM is located on VF-0 and has the following MAC address:
11:22:33:44:55:66.
There are two ways to enable or disable trust:
1. Use the standard IP link commands - available from Kernel 4.5 and above.
a. To enable trust for a specific VF, run: 

ip link set ens785f1 vf 0 trust on

b. To disable trust for a specific VF, run: 

ip link set ens785f1 vf 0 trust off

2. Specify echo "ON" or "OFF" to the file located under /sys/class/net/<ETH_IF_NAME> / device/
sriov/<VF index>/trust.

a. To enable trust for a specific VF, run: 

echo "ON" > /sys/class/net/ens785f1/device/sriov/0/trust

b. To disable trust for a specific VF, run: 

156
echo "OFF" > /sys/class/net/ens785f1/device/sriov/0/trust

Probed VFs

Probing Virtual Functions (VFs) after SR-IOV is enabled might consume the adapter cards' resources.
Therefore, it is recommended not to enable probing of VFs when no monitoring of the VM is needed.
VF probing can be disabled in two ways, depending on the kernel version installed on your server:
1. If the kernel version installed is v4.12 or above, it is recommended to use the PCI sysfs
interface sriov_drivers_autoprobe. For more information, see linux-next branch.
2. If the kernel version installed is older than v4.12, it is recommended to use the mlx5_core
module parameter probe_vf with MLNX_EN v4.1 or above.

Example: 

echo 0 > /sys/module/mlx5_core/parameters/probe_vf

For more information on how to probe VFs, see HowTo Configure and Probe VFs on mlx5 Drivers
Community post.

VF Promiscuous Rx Modes

VF Promiscuous Mode

VFs can enter a promiscuous mode that enables receiving the unmatched traffic and all the
multicast traffic that reaches the physical port in addition to the traffic originally targeted to the
VF. The unmatched traffic is any traffic's DMAC that does not match any of the VFs' or PFs' MAC
addresses.
Note: Only privileged/trusted VFs can enter the VF promiscuous mode.

To set the promiscuous mode on for a VF, run: 

ifconfig eth2 promisc

To exit the promiscuous mode, run: 

ifconfig eth2 –promisc

VF All-Multi Mode

VFs can enter an all-multi mode that enables receiving all the multicast traffic sent from/to the
other functions on the same physical port in addition to the traffic originally targeted to the VF.
Note: Only privileged/trusted VFs can enter the all-multi RX mode.

To set the all-multi mode on for a VF, run: 

157
ifconfig eth2 allmulti

To exit the all-multi mode, run: 

#ifconfig eth2 –allmulti

Uninstalling the SR-IOV Driver

To uninstall SR-IOV driver, perform the following:


1. For Hypervisors, detach all the Virtual Functions (VF) from all the Virtual Machines (VM) or
stop the Virtual Machines that use the Virtual Functions.
Please be aware that stopping the driver when there are VMs that use the VFs, will cause
machine to hang.
2. Run the script below. Please be aware, uninstalling the driver deletes the entire driver's file,
but does not unload the driver. 

[root@swl022 ~]# /usr/sbin/mlnx_en_uninstall.sh


This program will uninstall all MLNX_EN packages on your machine.
Do you want to continue?[y/N]:y
Running /usr/sbin/vendor_pre_uninstall.sh
Removing MLNX_EN Software installations
Running /bin/rpm -e --allmatches kernel-ib kernel-ib-devel libibverbs libibverbs-devel libibverbs-
devel-static libibverbs-utils libmlx4 libmlx4-devel libibcm libibcm-devel libibumad libibumad-devel
libibumad-static libibmad libibmad-devel libibmad-static librdmacm librdmacm-utils librdmacm-devel ibacm
opensm-libs opensm-devel perftest compat-dapl compat-dapl-devel dapl dapl-devel dapl-devel-static dapl-
utils srptools infiniband-diags-guest ofed-scripts opensm-devel
warning: /etc/infiniband/openib.conf saved as /etc/infiniband/openib.conf.rpmsave
Running /tmp/2818-ofed_vendor_post_uninstall.sh

3. Restart the server.

Enabling Paravirtualization

To enable Paravirtualization:


The example below works on RHEL6.* or RHEL7.* without a Network Manager.

1. Create a bridge. 

vim /etc/sysconfig/network-scripts/ifcfg-bridge0
DEVICE=bridge0
TYPE=Bridge
IPADDR=12.195.15.1
NETMASK=255.255.0.0
BOOTPROTO=static
ONBOOT=yes
NM_CONTROLLED=no
DELAY=0

2. Change the related interface (in the example below bridge0 is created over eth5). 

DEVICE=eth5
BOOTPROTO=none
STARTMODE=on
HWADDR=00:02:c9:2e:66:52

158
TYPE=Ethernet
NM_CONTROLLED=no
ONBOOT=yes
BRIDGE=bridge0

3. Restart the service network.


4. Attach a bridge to VM. 

ifconfig -a

eth6 Link encap:Ethernet HWaddr 52:54:00:E7:77:99
inet addr:13.195.15.5 Bcast:13.195.255.255 Mask:255.255.0.0
inet6 addr: fe80::5054:ff:fee7:7799/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:481 errors:0 dropped:0 overruns:0 frame:0
TX packets:450 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:22440 (21.9 KiB) TX bytes:19232 (18.7 KiB)
Interrupt:10 Base address:0xa000

VXLAN Hardware Stateless Offloads


VXLAN technology provides scalability and security challenges solutions. It requires extension of the
traditional stateless offloads to avoid performance drop. ConnectX-3 Pro and ConnectX-4 family
adapter card offer the following stateless offloads for a VXLAN packet, similar to the ones offered
to non-encapsulated packets. VXLAN protocol encapsulates its packets using outer UDP header.

Available hardware stateless offloads:


• Checksum generation (Inner IP and Inner TCP/UDP)
• Checksum validation (Inner IP and Inner TCP/UDP). This will allow the use of GRO (in
ConnectX-3 Pro card only) for inner TCP packets.
• TSO support for inner TCP packets
• RSS distribution according to inner packets attributes
• Receive queue selection - inner frames may be steered to specific QPs 

VXLAN Hardware Stateless Offloads requires the following prerequisites:


• HCA and their minimum firmware required:
• ConnectX-3 Pro - Firmware v2.32.5100
• ConnectX-4 - Firmware v12.14.xxxx
• ConnectX-4 Lx - Firmware v14.14.xxxx
• Operating Systems:
• RHEL7, Ubuntu 14.04 or upstream kernel 3.12.10 (or higher)
• ConnectX-3 Pro Supported Features:
• DMFS enabled
• A0 static mode disabled

Enabling VXLAN Hardware Stateless Offloads for ConnectX-3 Pro

To enable the VXLAN offloads support load the mlx4_core driver with Device-Managed Flow- steering
(DMFS) enabled. DMFS is the default steering mode.

To verify it is enabled by the adapter card:


1. Open the /etc/modprobe.d/mlnx.conf file. 

159
2. Set the parameter debug_level to "1". 

options mlx4_core debug_level=1

3. Restart the driver.


4. Verify in the dmesg that the tunneling mode is: vxlan.

The net-device will advertise the tx-udp-tnl-segmentation flag shown when running "etht- hool -k
$DEV | grep udp" only when VXLAN is configured in the OpenvSwitch (OVS) with the configured UDP
port.
Example:

$ ethtool -k eth0 | grep udp_tnl


tx-udp_tnl-segmentation: on

As of firmware version 2.31.5050, VXLAN tunnel can be set on any desired UDP port. If using
previous firmware versions, set the VXLAN tunnel over UDP port 4789.

To add the UDP port to /etc/modprobe.d/vxlan.conf: 

options vxlan udp_port=<number decided above>

Enabling VXLAN Hardware Stateless Offloads for ConnectX®-4 Family


Devices

VXLAN offload is enabled by default for ConnectX-4 family devices running the minimum required
firmware version and a kernel version that includes VXLAN support.

To confirm if the current setup supports VXLAN, run: 

ethtool -k $DEV | grep udp_tnl

Example:

ethtool -k ens1f0 | grep udp_tnl


tx-udp_tnl-segmentation: on

ConnectX-4 family devices support configuring multiple UDP ports for VXLAN offload. Ports can be
added to the device by configuring a VXLAN device from the OS command line using the "ip"
command.

Note: If you configure multiple UDP ports for offload and exceed the total number of ports
supported by hardware, then those additional ports will still function properly, but will not benefit
from any of the stateless offloads.

Example:

160
ip link add vxlan0 type vxlan id 10 group 239.0.0.10 ttl 10 dev ens1f0 dstport 4789
ip addr add 192.168.4.7/24 dev vxlan0
ip link set up vxlan0

Note: dstport' parameters are not supported in Ubuntu 14.4.

The VXLAN ports can be removed by deleting the VXLAN interfaces.

Example: 

ip link delete vxlan0

To verify that the VXLAN ports are offloaded, use debugfs (if supported):
1. Mount debugfs. 

mount -t debugfs nodev /sys/kernel/debug

2. List the offloaded ports. 

ls /sys/kernel/debug/mlx5/$PCIDEV/VXLAN

Where $PCIDEV is the PCI device number of the relevant ConnectX-4 family device.
Example: 

ls /sys/kernel/debug/mlx5/0000:81:00.0/VXLAN 4789

Important Notes
• VXLAN tunneling adds 50 bytes (14-eth + 20-ip + 8-udp + 8-vxlan) to the VM Ethernet frame.
Please verify that either the MTU of the NIC who sends the packets, e.g. the VM virtio-net NIC
or the host side veth device or the uplink takes into account the tunneling overhead.
Meaning, the MTU of the sending NIC has to be decremented by 50 bytes (e.g 1450 instead of
1500), or the uplink NIC MTU has to be incremented by 50 bytes (e.g 1550 instead of 1500)
• From upstream 3.15-rc1 and onward, it is possible to use arbitrary UDP port for VXLAN. Note
that this requires firmware version 2.31.2800 or higher. Additionally, you need to enable this
kernel configuration option CONFIG_MLX4_EN_VXLAN=y (ConnectX-3 Pro only).
• On upstream kernels 3.12/3.13 GRO with VXLAN is not supported

Q-in-Q Encapsulation per VF in Linux (VST)


This feature is supported on ConnectX-3 Pro and ConnectX-5 adapter cards only. 


ConnectX-4 and ConnectX-4 Lx adapter cards support 802.1Q double-tagging (C-tag stack-
ing on C-tag) - refer to "802.1Q Double-Tagging" section.

161
This section describes the configuration of IEEE 802.1ad QinQ VLAN tag (S-VLAN) to the hypervisor
per Virtual Function (VF). The Virtual Machine (VM) attached to the VF (via SR- IOV) can send traffic
with or without C-VLAN. Once a VF is configured to VST QinQ encapsulation (VST QinQ), the
adapter's hardware will insert S-VLAN to any packet from the VF to the physical port. On the receive
side, the adapter hardware will strip the S-VLAN from any packet coming from the wire to that VF. 

Setup

The setup assumes there are two servers equipped with ConnectX-3 Pro/ConnectX-5 adapter cards.

Prerequisites
• Kernel must be of v3.10 or higher, or custom/inbox kernel must support vlan-stag
• Firmware version 2.36.5150 or higher must be installed for ConnectX-3 Pro HCAs
• Firmware version 16.21.0458 or higher must be installed for ConnectX-5 HCAs
• The server should be enabled in SR-IOV and the VF should be attached to a VM on the
hypervisor.
• In order to configure SR-IOV in Ethernet mode for ConnectX-3 Pro adapter cards,
please refer to "Configuring SR-IOV for ConnectX-3/ConnectX-3 Pro" section.
• In order to configure SR-IOV in Ethernet mode for ConnectX-5 adapter cards, please
refer to "Configuring SR-IOV for ConnectX-4/ConnectX-5 (Ethernet)" section. In the
following configuration example, the VM is attached to VF0.
• Network Considerations - the network switches may require increasing the MTU (to support
1522 MTU size) on the relevant switch ports. 

Configuring Q-in-Q Encapsulation per Virtual Function for ConnectX-3


Pro
1. Enable QinQ support in the hardware. Set the phv-bit flag using ethtool (on the hypervisor). 

ethtool --set-priv-flags ens2 phv-bit on

2. Add the required S-VLAN (QinQ) tag (on the hypervisor) per port per VF. There are two ways
to add the S-VLAN: 
a. By using sysfs only if the Kernel version used is v4.9 or older: 

162
echo 'vlan 100 proto 802.1ad' > /sys/class/net/ens2/vf0/vlan_info

b. By using the ip link command (available only when using the latest Kernel version): 

# ip link set dev ens2 vf 0 vlan 100 proto 802.1ad

Check the configuration using the ip link show command: 

# ip link show ens2


2: ens2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default
qlen 1000
link/ether 7c:fe:90:19:9e:21 brd ff:ff:ff:ff:ff:ff
vf 0 MAC 00:00:00:00:00:00, vlan 100, vlan protocol 802.1ad , spoof checking off, link-state
auto
vf 1 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto
vf 2 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto
vf 3 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto
vf 4 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto

3. Optional: Add S-VLAN priority. Use the qos parameter in the ip link command (or sysfs): 

# ip link set dev ens2 vf 0 vlan 100 qos 3 proto 802.1ad

Check the configuration using the ip link show command:

# ip link show ens2


2: ens2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT group default qlen
1000
link/ether 7c:fe:90:19:9e:21 brd ff:ff:ff:ff:ff:ff
vf 0 MAC 00:00:00:00:00:00, vlan 100, qos 3, vlan protocol 802.1ad , spoof checking off, link-state
auto
vf 1 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto
vf 2 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto
vf 3 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto
vf 4 MAC 00:00:00:00:00:00, vlan 4095, spoof checking off, link-state auto

4. Restart the driver in the VM attached to that VF. 

(VM1)# /etc/init.d/openidb restart

5. Create a VLAN interface on the VM and add an IP address. 

ip link add link ens5 ens5.40 type vlan protocol 802.1q id 40


ip addr add 42.134.135.7/16 brd 42.134.255.255 dev ens5.40
ip link set dev ens5.40 up

6. To verify the setup, run ping between the two VMs and open Wireshark or tcpdump to capture
the packet.

For further examples, refer to HowTo Configure QinQ Encapsulation per VF in Linux (VST) for
ConnectX-3 Pro Community post.

Configuring Q-in-Q Encapsulation per Virtual Function for ConnectX-5


1. Add the required S-VLAN (QinQ) tag (on the hypervisor) per port per VF. There are two ways
to add the S-VLAN:
a. By using sysfs: 

echo '100:0:802.1ad' > /sys/class/net/ens1f0/device/sriov/0/vlan

b. By using the ip link command (available only when using the latest Kernel version): 

163
ip link set dev ens1f0 vf 0 vlan 100 proto 802.1ad

Check the configuration using the ip link show command: 

# ip link show ens1f0


ens1f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT qlen 1000
link/ether ec:0d:9a:44:37:84 brd ff:ff:ff:ff:ff:ff
vf 0 MAC 00:00:00:00:00:00, vlan 100, vlan protocol 802.1ad, spoof checking off, link-state
auto, trust off
vf 1 MAC 00:00:00:00:00:00, spoof checking off, link-state auto, trust off
vf 2 MAC 00:00:00:00:00:00, spoof checking off, link-state auto, trust off
vf 3 MAC 00:00:00:00:00:00, spoof checking off, link-state auto, trust off
vf 4 MAC 00:00:00:00:00:00, spoof checking off, link-state auto, trust off

2. Optional: Add S-VLAN priority. Use the qos parameter in the ip link command (or sysfs): 

ip link set dev ens1f0 vf 0 vlan 100 qos 3 proto 802.1ad

Check the configuration using the ip link show command: 

# ip link show ens1f0


ens1f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT qlen 1000
link/ether ec:0d:9a:44:37:84 brd ff:ff:ff:ff:ff:ff
vf 0 MAC 00:00:00:00:00:00, vlan 100, qos 3, vlan protocol 802.1ad, spoof checking off, link-state
auto, trust off
vf 1 MAC 00:00:00:00:00:00, spoof checking off, link-state auto, trust off
vf 2 MAC 00:00:00:00:00:00, spoof checking off, link-state auto, trust off
vf 3 MAC 00:00:00:00:00:00, spoof checking off, link-state auto, trust off
vf 4 MAC 00:00:00:00:00:00, spoof checking off, link-state auto, trust off

3. Create a VLAN interface on the VM and add an IP address. 

ip link add link ens5 ens5.40 type vlan protocol 802.1q id 40


ip addr add 42.134.135.7/16 brd 42.134.255.255 dev ens5.40
ip link set dev ens5.40 up

4. To verify the setup, run ping between the two VMs and open Wireshark or tcpdump to capture
the packet.

802.1Q Double-Tagging
This section describes the configuration of 802.1Q double-tagging support to the hypervisor per
Virtual Function (VF). The Virtual Machine (VM) attached to the VF (via SR-IOV) can send traffic with
or without C-VLAN. Once a VF is configured to VST encapsulation, the adapter's hardware will insert
C-VLAN to any packet from the VF to the physical port. On the receive side, the adapter hardware
will strip the C-VLAN from any packet coming from the wire to that VF.

Configuring 802.1Q Double-Tagging per Virtual Function for ConnectX-4/


ConnectX-4 Lx and ConnectX-5
1. Add the required C-VLAN tag (on the hypervisor) per port per VF. There are two ways to add
the C-VLAN:
a. By using sysfs: 

echo '100:0:802.1q' > /sys/class/net/ens1f0/device/sriov/0/vlan

b. By using the ip link command (available only when using the latest Kernel version): 

ip link set dev ens1f0 vf 0 vlan 100

164
Check the configuration using the ip link show command: 

# ip link show ens1f0


ens1f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq state UP mode DEFAULT qlen 1000
link/ether ec:0d:9a:44:37:84 brd ff:ff:ff:ff:ff:ff
vf 0 MAC 00:00:00:00:00:00, vlan 100, spoof checking off, link-state auto, trust off
vf 1 MAC 00:00:00:00:00:00, spoof checking off, link-state auto, trust off
vf 2 MAC 00:00:00:00:00:00, spoof checking off, link-state auto, trust off
vf 3 MAC 00:00:00:00:00:00, spoof checking off, link-state auto, trust off
vf 4 MAC 00:00:00:00:00:00, spoof checking off, link-state auto, trust off

2. Create a VLAN interface on the VM and add an IP address. 

# ip link add link ens5 ens5.40 type vlan protocol 802.1q id 40


# ip addr add 42.134.135.7/16 brd 42.134.255.255 dev ens5.40
# ip link set dev ens5.40 up

3. To verify the setup, run ping between the two VMs and open Wireshark or tcpdump to capture
the packet.

Resiliency

Reset Flow
Reset Flow is activated by default, once a "fatal device" error is recognized. Both the HCA and the
software are reset, the ULPs and user application are notified about it, and a recovery process is
performed once the event is raised.
• In mlx4 devices, "Reset Flow" is activated by default. It can be disabled using the mlx- 4_core
module parameter internal_err_reset (default value is 1).
• In mlx5 devices, "Reset Flow" is activated by default. Currently, it can be triggered by a
firmware assert with Recover Flow Request (RFR) only. Firmware RFR support should be
enabled explicitly using mlxconfig commands.

Notes:
• For mlx4 devices, a “fatal device” error can be a timeout from a firmware command, an
error on a firmware closing command, communication channel not being responsive in a VF,
etc.
• For mlx5 devices, a “fatal device” is a firmware assert combined with Recover Flow Request
bit.

To query the current value, run: 

mlxconfig -d /dev/mst/mt4115_pciconf0 query | grep SW_RECOVERY_ON_ERRORS

To enable RFR bit support, run: 

mlxconfig -d /dev/mst/mt4115_pciconf0 set SW_RECOVERY_ON_ERRORS=true

165
Kernel ULPs

Once a "fatal device" error is recognized, an IB_EVENT_DEVICE_FATAL event is created, ULPs are
notified about the incident, and outstanding WQEs are simulated to be returned with "flush in error"
message to enable each ULP to close its resources and not get stuck via calling its "remove_one"
callback as part of "Reset Flow".
Once the unload part is terminated, each ULP is called with its "add_one" callback, its resources are
re-initialized and it is re-activated.

SR-IOV

If the Physical Function recognizes the error, it notifies all the VFs about it by marking their
communication channel with that information, consequently, all the VFs and the PF are reset.
If the VF encounters an error, only that VF is reset, whereas the PF and other VFs continue to work
unaffected.

Forcing the VF to Reset

If an outside "reset" is forced by using the PCI sysfs entry for a VF, a reset is executed on that VF
once it runs any command over its communication channel.
For example, the below command can be used on a hypervisor to reset a VF defined by
0000:04:00.1: 

echo 1 >/sys/bus/pci/devices/0000:04:00.1/reset

Advanced Error Reporting (AER) in ConnectX-3 and ConnectX-3 Pro

AER, a mechanism used by the driver to get notifications upon PCI errors, is supported only in native
mode, ULPs are called with remove_one/add_one and expect to continue working properly after
that flow.User space application will work in same mode as defined in the "Reset Flow" above.

Extended Error Handling (EEH)

Extended Error Handling (EEH) is a PowerPC mechanism that encapsulates AER, thus exposing AER
events to the operating system as EEH events.
The behavior of ULPs and user space applications is identical to the behavior of AER.

CRDUMP

CRDUMP feature allows for taking an automatic snapshot of the device CR-Space in case the device's
FW/HW fails to function properly.

Snapshots Triggers:

166
• ConnectX-3 adapters family - the snapshot is triggered in case the driver detects any of the
following issues:

1. Critical event, such as a command timeout


2. Critical FW command failure
3. PCI errors
4. Internal FW error

• ConnectX-4/ConnectX-5 adapters family - the snapshot is triggered after firmware detects a


critical issue, requiring a recovery flow (see Reset Flow).

This snapshot can later be investigated and analyzed to track the root cause of the failure.
Currently, only the first snapshot is stored, and is exposed using a temporary virtual file. The virtual
file is cleared upon driver reset. 
When a critical event is detected, a message indicating CRDUMP collection will be printed to the
Linux log. User should then back up the file pointed to in the printed message. The file location
format is:
• For mlx4 driver: /proc/driver/mlx4_core/crdump/<pci address>
• For mlx5 driver: /proc/driver/mlx5_core/crdump/<pci address>

Example - the following message is printed to the log: 

[257480.719070] mlx4_core 0000:00:05.0: Internal error detected:


[257480.726019] mlx4_core 0000:00:05.0: buf[00]: 0fffffff
[257480.732082] mlx4_core 0000:00:05.0: buf[01]: 00000000
....
[257480.806531] mlx4_core 0000:00:05.0: buf[0f]: 00000000
[257480.811534] mlx4_core 0000:00:05.0: device is going to be reset
[257482.781154] mlx4_core 0000:00:05.0: crdump: Crash snapshot collected to /proc/driver/mlx4_core/crdump/0000:00:0
5.0
[257483.789230] mlx4_core 0000:00:05.0: device was reset successfully

Snapshot should be copied by Linux standard tool for future investigation. 


In mlx4 driver, CRDUMP will not be collected if internal_err_reset module parameter is set
to 0.

Firmware Tracer

This mechanism allows for the device's FW/HW to log important events into the event tracing
system (/sys/kernel/debug/tracing) without requiring any Mellanox tool.


To be able to use this feature, trace points must be enabled in the kernel.

This feature is enabled by default, and can be controlled using sysfs commands.

To disable the feature: 

echo 0 > /sys/kernel/debug/tracing/events/mlx5/fw_tracer/enable

167
To enable the feature: 

echo 1 > /sys/kernel/debug/tracing/events/mlx5/fw_tracer/enable

To view FW traces using vim text editor: 

vim /sys/kernel/debug/tracing/trace

Docker Containers

This feature is supported at beta level on ConnectX-4 adapter cards family and above only.

Docker (containerization) performs operating-system-level virtualization. On Linux, Docker uses


resource isolation of the Linux kernel, to allow independent "containers" to run within a single Linux
kernel instance.
Docker containers are supported on MLNX_OFED using Docker runtime. Virtual RoCE and InfiniBand
devices are supported using SR-IOV mode.

Currently, RDMA/RoCE devices are supported in the modes listed in the following table: 

Linux Containers Networking Modes


Orchestration and Version Networking Mode Link Layer Virtualizati
Clustering Tool on Mode

Docker Docker SR-IOV using sriov-plugin along with InfiniBand and SR-IOV
Engine docker run wrapper tool Ethernet

17.03 or
higher

Kubernetes Kubernetes SR-IOV using device plugin, and InfiniBand and SR-IOV
using SR- IOV CNI plugin Ethernet
1.10.3 or
higher
VXLAN using IPoIB bridge InfiniBand Shared HCA

Docker Using SR-IOV


In this mode, Docker engine is used to run containers along with SR-IOV networking plugin. To
isolate the virtual devices, docker_rdma_sriov tool should be used. This mode is applicable to both
InfiniBand and Ethernet link layers.
To obtain the plugin, visit: https://hub.docker.com/r/mellanox/sriov-plugin/
To install the docker_rdma_sriov tool, use the container tools installer available via https://
hub.docker.com/r/mellanox/container_tools_installer/

168
For instructions on how to use Docker with SR-IOV, refer to the following Community post: https://
support.mellanox.com/docs/DOC-3139

Kubernetes Using SR-IOV


In order to use RDMA in Kubernetes environment with SR-IOV networking mode, two main
components are required:
1. RDMA device plugin - this plugin allows for exposing RDMA devices in a Pod
2. SR-IOV CNI plugin - this plugin provisions VF net device in a Pod

When used in SR-IOV mode, this plugin enables SR-IOV and performs necessary configuration
including setting GUID, MAC, privilege mode, and Trust mode.
The plugin also allocates the VF devices when Pods are scheduled and requested by Kubernetes
framework.
For instructions on how to use Kubernetes with SR-IOV, refer to the following Community posts:
• https://support.mellanox.com/docs/DOC-3151
• https://support.mellanox.com/docs/DOC-3138

Kubernetes with Shared HCA


One RDMA device (HCA) can be shared among multiple Pods running in a Kubernetes worker nodes.
User defined networks are created using VXLAN or VETH networking devices. RDMA device (HCA) can
be shared among multiple Pods running in a Kubernetes worker nodes.
For instructions on how to use Kubernetes with Shared HCA, refer to the following Community post:
https://support.mellanox.com/docs/DOC-3153

Mediated Devices
The Mellanox mediated devices deliver flexibility in allowing to create accelerated devices without
SR-IOV on the Bluefield® system. These mediated devices support NIC and RDMA, and offer the
same level of ASAP2 offloads as SR-IOV VFs. Mediates devices are supported using mlx5 sub-function
acceleration technology.

Configuring Mediated Device


1. To support sub-functions, PCIe BAR2 must be enabled. Run: 

$ mlxconfig -d /dev/mst/mst41682_pciconf0 s PF_BAR2_SIZE 4 PF_BAR2_ENABLE=True

Cold reboot the BlueField host system so that the above settings can be applied on
subsequent reboot sequences.
2. By default, the firmware allows for a large number of maximum mdev devices. You must set
the maximum number of mediated devices to 2 or 4 devices. Run: 

$ echo 4 > /sys/bus/pci/devices/0000:05:00.0/mdev_supported_types/mlx5_core-local/max_mdevs

3. Mediated devices are uniquely identified using UUID. To create one, run:

169
$ uuidgen
$ echo 49d0e9ac-61b8-4c91-957e-6f6dbc42557d > /sys/bus/pci/devices/0000:05:00.0/mdev_supported_types/
mlx5_core-local/create

4. By default, if the driver vfio_mdev is loaded, newly created mdev devices are bound to it. To
make use of this newly created mdev device in order to create a netdevice and RDMA device,
you must first unbind it from that driver. Run:

$ echo 49d0e9ac-61b8-4c91-957e-6f6dbc42557d > /sys/bus/mdev/drivers/vfio_mdev/unbind

5. Configure a MAC address for the mdev device. Run:

$ echo 00:11:22:33:44:55 > /sys/bus/mdev/devices/49d0e9ac-61b8-4c91-957e-6f6dbc42557d/devlink-compat-


config/mac_addr

6. Query the representor netdevice of the mdev device. Run:

$ cat /sys/bus/mdev/devices/49d0e9ac-61b8-4c91-957e-6f6dbc42557d/devlink-compat-config/netdev

7. Bind the mediated device to mlx5_core driver. Run:

$ echo 49d0e9ac-61b8-4c91-957e-6f6dbc42557d > /sys/bus/mdev/drivers/mlx5_core/bind

When an mdev device is bound to the mlx5_core driver, its respective netdevice and/or RDMA
device is also created.
8. To inspect the netdevice and RDMA device for the mdev, run: 

$ ls /sys/bus/mdev/devices/49d0e9ac-61b8-4c91-957e-6f6dbc42557d/net/
$ ls /sys/bus/mdev/devices/49d0e9ac-61b8-4c91-957e-6f6dbc42557d/infiniband/

OVS Offload Using ASAP2 Direct

Overview
Open vSwitch (OVS) allows Virtual Machines (VM) to communicate with each other and with the
outside world. OVS traditionally resides in the hypervisor and switching is based on twelve tuple
matching on flows. The OVS software based solution is CPU intensive, affecting system performance
and preventing full utilization of the available bandwidth. The current OVS supported is OVS running
in Linux Kernel.
Mellanox Accelerated Switching And Packet Processing (ASAP2) technology allows OVS offloading by
handling OVS data-plane in Mellanox ConnectX-5 onwards NIC hardware (Mellanox Embedded Switch
or eSwitch) while maintaining OVS control-plane unmodified. As a result, we observe significantly
higher OVS performance without the associated CPU load.
The current actions supported by ASAP2 include packet parsing and matching, forward, drop along
with VLAN push/pop or VXLAN encapsulation/decapsulation.

170
Installing ASAP2 Packages
Install the required packages. For the complete solution, you need to install supporting
MLNX_EN (v4.4 and above), iproute2, and openvswitch packages.

Setting Up SR-IOV

To set up SR-IOV:
1. Choose the desired card.
The example below shows a dual-ported ConnectX-5 card (device ID 0x1017) and a single SR-
IOV VF (Virtual Function, device ID 0x1018).
In SR-IOV terms, the card itself is referred to as the PF (Physical Function). 

# lspci -nn | grep Mellanox


 
0a:00.0 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
0a:00.1 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5] [15b3:1017]
 
0a:00.2 Ethernet controller [0200]: Mellanox Technologies MT27800 Family [ConnectX-5 Virtual Function]
[15b3:1018]

 Enabling SR-IOV and creating VFs is done by the firmware upon admin directive as
explained in Step 5 below.

2. Identify the Mellanox NICs and locate net-devices which are on the NIC PCI BDF. 

# ls -l /sys/class/net/ | grep 04:00


 
lrwxrwxrwx 1 root root 0 Mar 27 16:58 enp4s0f0 -> ../../devices/pci0000:00/0000:00:03.0/0000:04:00.0/net/
enp4s0f0
lrwxrwxrwx 1 root root 0 Mar 27 16:58 enp4s0f1 -> ../../devices/pci0000:00/0000:00:03.0/0000:04:00.1/net/
enp4s0f1
lrwxrwxrwx 1 root root 0 Mar 27 16:58 eth0 -> ../../devices/pci0000:00/0000:00:03.0/0000:04:00.2/net/eth0
lrwxrwxrwx 1 root root 0 Mar 27 16:58 eth1 -> ../../devices/pci0000:00/0000:00:03.0/0000:04:00.3/net/eth1

The PF NIC for port #1 is enp4s0f0, and the rest of the commands will be issued on it.
3. Check the firmware version.
Make sure the firmware versions installed are as state in the Release Notes document. 

# ethtool -i enp4s0f0 | head -5


driver: mlx5_core
version: 5.0-5
firmware-version: 16.21.0338
expansion-rom-version:
bus-info: 0000:04:00.0

4. Make sure SR-IOV is enabled on the system (server, card).


Make sure SR-IOV is enabled by the server BIOS, and by the firmware with up to N VFs, where
N is the number of VFs required for your environment. Refer to "Mellanox Firmware Tools"
below for more details. 

# cat /sys/class/net/enp4s0f0/device/sriov_totalvfs
4

5. Turn ON SR-IOV on the PF device. 

171
# echo 2 > /sys/class/net/enp4s0f0/device/sriov_numvfs

6. Provision the VF MAC addresses using the IP tool. 

# ip link set enp4s0f0 vf 0 mac e4:11:22:33:44:50


# ip link set enp4s0f0 vf 1 mac e4:11:22:33:44:51

7. Verify the VF MAC addresses were provisioned correctly and SR-IOV was turned ON. 

# cat /sys/class/net/enp4s0f0/device/sriov_numvfs
2
 
# ip link show dev enp4s0f0
256: enp4s0f0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc mq master ovs-system state UP mode DEFAULT
group default qlen 1000
link/ether e4:1d:2d:60:95:a0 brd ff:ff:ff:ff:ff:ff
vf 0 MAC e4:11:22:33:44:50, spoof checking off, link-state auto
vf 1 MAC e4:11:22:33:44:51, spoof checking off, link-state auto

In the example above, the maximum number of possible VFs supported by the firmware is 4
and only 2 are enabled. 
8. Provision the PCI VF devices to VMs using PCI Pass-Through or any other preferred virt tool of
choice, e.g virt-manager.

For further information on SR-IOV, refer to https://support.mellanox.com/docs/DOC-2386.

Configuring Open-vSwitch (OVS) Offload


1. Unbind the VFs. 

echo 0000:04:00.2 > /sys/bus/pci/drivers/mlx5_core/unbind


echo 0000:04:00.3 > /sys/bus/pci/drivers/mlx5_core/unbind

 VMs with attached VFs must be powered off to be able to unbind the VFs. 
2. Change the e-switch mode from legacy to switchdev on the PF device.
This will also create the VF representor netdevices in the host OS. 

# echo switchdev > /sys/class/net/enp4s0f0/compat/devlink/mode

 Before changing the mode, make sure that all VFs are unbound. 

 To go back to SR-IOV legacy mode:


# echo legacy > /sys/class/net/enp4s0f0/compat/devlink/mode
Running this command, will also remove the VF representor netdevices.

3. Set the network VF representor device names to be in the form of $PF_$VFID where $PF is


the PF netdev name, and $VFID is the VF ID=0,1,[..], either by: 
* Using this rule in /etc/udev/rules.d/82-net-setup-link.rules

172
SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="e41d2d60971d", \
ATTR{phys_port_name}!="", NAME="enp4s0f1_$attr{phys_port_name}"

Replace the phys_switch_id value ("e41d2d60971d" above) with the value matching your
switch, as obtained from: 

ip -d link show enp4s0f1

Example output of device names when using the udev rule: 

ls -l /sys/class/net/ens4*
lrwxrwxrwx 1 root root 0 Mar 27 17:14 enp4s0f0 -> ../../devices/pci0000:00/0000:00:03.0/0000:04:00.0/net/
enp4s0f0
lrwxrwxrwx 1 root root 0 Mar 27 17:15 enp4s0f0_0 -> ../../devices/virtual/net/enp4s0f0_0
lrwxrwxrwx 1 root root 0 Mar 27 17:15 enp4s0f0_1 -> ../../devices/virtual/net/enp4s0f0_1

* Using the supplied 82-net-setup-link.rules and vf-net-link-name.sh script to set the


VF representor device names. 
From the scripts directory copy vf-net-link-name.sh to /etc/udev/ and 82-net-setup-
link.rules to /etc/udev/rules.d/. 
Make sure vf-net-link-name.sh is executable.
4. Run the openvswitch service. 

# systemctl start openvswitch

5. Create an OVS bridge (here it's named ovs-sriov). 

# ovs-vsctl add-br ovs-sriov

6. Enable hardware offload (disabled by default). 

# ovs-vsctl set Open_vSwitch . other_config:hw-offload=true

7. Restart the openvswitch service. This step is required for HW offload changes to take effect. 

# systemctl restart openvswitch

 HW offload policy can also be changed by setting the tc-policy using one on the
following values:
* none - adds a TC rule to both the software and the hardware (default)
* skip_sw - adds a TC rule only to the hardware
* skip_hw - adds a TC rule only to the software
The above change is used for debug purposes.

8. Add the PF and the VF representor netdevices as OVS ports. 

# ovs-vsctl add-port ovs-sriov enp4s0f0


# ovs-vsctl add-port ovs-sriov enp4s0f0_0
# ovs-vsctl add-port ovs-sriov enp4s0f0_1

Make sure to bring up the PF and representor netdevices. 

# ip link set dev enp4s0f0 up


# ip link set dev enp4s0f0_0 up

173
# ip link set dev enp4s0f0_1 up

The PF represents the uplink (wire). 

# ovs-dpctl show
system@ovs-system:
lookups: hit:0 missed:192 lost:1
flows: 2
masks: hit:384 total:2 hit/pkt:2.00
port 0: ovs-system (internal)
port 1: ovs-sriov (internal)
port 2: enp4s0f0
port 3: enp4s0f0_0
port 4: enp4s0f0_1

9. Run traffic from the VFs and observe the rules added to the OVS data-path.

# ovs-dpctl dump-flows
 
recirc_id(0),in_port(3),eth(src=e4:11:22:33:44:50,dst=e4:1d:2d:a5:f3:9d),
eth_type(0x0800),ipv4(frag=no), packets:33, bytes:3234, used:1.196s, actions:2
 
recirc_id(0),in_port(2),eth(src=e4:1d:2d:a5:f3:9d,dst=e4:11:22:33:44:50),
eth_type(0x0800),ipv4(frag=no), packets:34, bytes:3332, used:1.196s, actions:3

In the example above, the ping was initiated from VF0 (OVS port 3) to the outer node (OVS
port 2), where the VF MAC is e4:11:22:33:44:50 and the outer node MAC is e4:1d:2d:a5:f3:9d
As shown above, two OVS rules were added, one in each direction.
Note that you can also verify offloaded packets using by adding type=offloaded to the
command. For example: 

# ovs-dpctl dump-flows type=offloaded

Flow Statistics and Aging

The aging timeout of OVS is given is ms and can be controlled with this command: 

# ovs-vsctl set Open_vSwitch . other_config:max-idle=30000

Offloading VLANs

It is common to require the VM traffic to be tagged by the OVS. Such that, the OVS adds tags (vlan
push) to the packets sent by the VMs and strips (vlan pop) the packets received for this VM from
other nodes/VMs.
To do so, add a tag=$TAG section for the OVS command line that adds the representor ports, for
example here we use vlan ID 52. 

# ovs-vsctl add-port ovs-sriov enp4s0f0


# ovs-vsctl add-port ovs-sriov enp4s0f0_0 tag=52
# ovs-vsctl add-port ovs-sriov enp4s0f0_1 tag=52

The PF port should not have a VLAN attached. This will cause OVS to add VLAN push/pop actions
when managing traffic for these VFs.
To see how the OVS rules look with vlans, here we initiated a ping from VF0 (OVS port 3) to an outer
node (OVS port 2), where the VF MAC is e4:11:22:33:44:50 and the outer node MAC
is 00:02:c9:e9:bb:b2.

174
At this stage, we can see that two OVS rules were added, one in each direction. 

recirc_id(0),in_port(3),eth(src=e4:11:22:33:44:50,dst=00:02:c9:e9:bb:b2),eth_type(0x0800),ipv4(frag=no), \
packets:0, bytes:0, used:never, actions:push_vlan(vid=52,pcp=0),2
 
recirc_id(0),in_port(2),eth(src=00:02:c9:e9:bb:b2,dst=e4:11:22:33:44:50),eth_type(0x8100), \
vlan(vid=52,pcp=0),encap(eth_type(0x0800),ipv4(frag=no)), packets:0, bytes:0, used:never, actions:pop_vlan,3

• For outgoing traffic (in port = 3), the actions are push vlan (52) and forward to port 2
• For incoming traffic (in port = 2), matching is done also on vlan, and the actions are pop vlan
and forward to port 3

Offloading VXLAN Encapsulation/Decapsulation Actions


VXLAN encapsulation / decapsulation offloading of OVS actions is supported only in
ConnectX-5 adapter cards.

In case of offloading VXLAN, the PF should not be added as a port in the OVS data-path but rather
be assigned with the IP address to be used for encapsulation. 

The example below shows two hosts (PFs) with IPs 1.1.1.177 and 1.1.1.75, where the PF device
on both hosts is enp4s0f0 and the VXLAN tunnel is set with VNID 98:
• On the first host: 

# ip addr add 1.1.1.177/24 dev enp4s0f1


 
# ovs-vsctl add-port ovs-sriov vxlan0 -- set interface vxlan0 type=vxlan
options:local_ip=1.1.1.177 options:remote_ip=1.1.1.75 options:key=98

• On the second host: 

# ip addr add 1.1.1.75/24 dev enp4s0f1


 
# ovs-vsctl add-port ovs-sriov vxlan0 -- set interface vxlan0 type=vxlan
options:local_ip=1.1.1.75 options:remote_ip=1.1.1.177 options:key=98

When encapsulating guest traffic, the VF’s device MTU must be reduced to allow the host/HW add
the encap headers without fragmenting the resulted packet. As such, the VF’s MTU must be lowered
to 1450 for IPv4 and 1430 for IPv6. 
To see how the OVS rules look with vxlan encap/decap actions, here we initiated a ping from a VM
on the 1st host whose MAC is e4:11:22:33:44:50 to a VM on the 2nd host whose MAC
is 46:ac:d1:f1:4c:af
At this stage we see that two OVS rules were added to the first host; one in each direction. 

# ovs-dpctl show
system@ovs-system:
lookups: hit:7869 missed:241 lost:2
flows: 2
masks: hit:13726 total:10 hit/pkt:1.69
port 0: ovs-system (internal)
port 1: ovs-sriov (internal)
port 2: vxlan_sys_4789 (vxlan)
port 3: enp4s0f1_0
port 4: enp4s0f1_1
 
 
# ovs-dpctl dump-flows
 
recirc_id(0),in_port(3),eth(src=e4:11:22:33:44:50,dst=46:ac:d1:f1:4c:af),eth_type(0x0800),ipv4(tos=0/0x3,frag=no),
packets:4, bytes:392, used:0.664s, actions:set(tunnel(tun_id=0x62,dst=1.1.1.75,ttl=64,flags(df,key))),2
 
recirc_id(0),tunnel(tun_id=0x62,src=1.1.1.75,dst=1.1.1.177,ttl=64,flags(-df-csum+key)),

175
in_port(2),skb_mark(0),eth(src=46:ac:d1:f1:4c:af,dst=e4:11:22:33:44:50),eth_type(0x0800),ipv4(frag=no), packets:5,
bytes:490, used:0.664s, actions:3

• For outgoing traffic (in port = 3), the actions are set vxlan tunnel to host 1.1.1.75 (encap)
and forward to port 2
• For incoming traffic (in port = 2), matching is done also on vxlan tunnel info which was
decapsulated, and the action is forward to port 3

Manually Adding TC Rules

Offloading rules can also be added directly, and not just through OVS, using the tc utility.
To enable TC ingress on both the PF and the VF. 

# tc qdisc add dev enp4s0f0 ingress


# tc qdisc add dev enp4s0f0_0 ingress
# tc qdisc add dev enp4s0f0_1 ingress

Examples

L2 Rule

# tc filter add dev ens4f0_0 protocol ip parent ffff: \


flower \
skip_sw \
dst_mac e4:11:22:11:4a:51 \
src_mac e4:11:22:11:4a:50 \
action drop

VLAN Rule 

# tc filter add dev ens4f0_0 protocol 802.1Q parent ffff: \


flower \
skip_sw \
dst_mac e4:11:22:11:4a:51 \
src_mac e4:11:22:11:4a:50 \
action vlan push id 100 \
action mirred egress redirect dev ens4f0
 
# tc filter add dev ens4f0 protocol 802.1Q parent ffff: \
flower \
skip_sw \
dst_mac e4:11:22:11:4a:51 \
src_mac e4:11:22:11:4a:50 \
vlan_ethtype 0x800 \
vlan_id 100 \
vlan_prio 0 \
action vlan pop \
action mirred egress redirect dev ens4f0_0

VXLAN Rule

# tc filter add dev ens4f0_0 protocol 0x806 parent ffff: \


flower \
skip_sw \
dst_mac e4:11:22:11:4a:51 \
src_mac e4:11:22:11:4a:50 \
action tunnel_key set \
src_ip 20.1.12.1 \
dst_ip 20.1.11.1 \
id 100 \
action mirred egress redirect dev vxlan100
 
# tc filter add dev vxlan100 protocol 0x806 parent ffff: \
flower \
skip_sw \
dst_mac e4:11:22:11:4a:51 \

176
src_mac e4:11:22:11:4a:50 \
enc_src_ip 20.1.11.1 \
enc_dst_ip 20.1.12.1 \
enc_key_id 100 \
enc_dst_port 4789 \
action tunnel_key unset \
action mirred egress redirect dev ens4f0_0

Bond Rule

Bond rules can be added in one of the following methods:


• Using shared block (requires kernel support): 

# tc qdisc add dev bond0 ingress_block 22 ingress


# tc qdisc add dev ens4p0 ingress_block 22 ingress
# tc qdisc add dev ens4p1 ingress_block 22 ingress

• Add drop rule: 

# tc filter add block 22 protocol arp parent ffff: prio 3 \


flower \
dst_mac e4:11:22:11:4a:51 \
action drop

• Add redirect rule from bond to representor: 

# tc filter add block 22 protocol arp parent ffff: prio 3 \


flower \
dst_mac e4:11:22:11:4a:50 \
action mirred egress redirect dev ens4f0_0

• Add redirect rule from representor to bond: 

# tc filter add dev ens4f0_0 protocol arp parent ffff: prio 3 \


flower \
dst_mac ec:0d:9a:8a:28:42 \
action mirred egress redirect dev bond0

• Without using shared block:


• Add redirect rule from bond to representor: 

# tc filter add dev bond0 protocol arp parent ffff: prio 1 \


flower \
dst_mac e4:11:22:11:4a:50 \
action mirred egress redirect dev ens4f0_0

• Add redirect rule from representor to bond: 

# tc filter add dev ens4f0_0 protocol arp parent ffff: prio 3 \


flower \
dst_mac ec:0d:9a:8a:28:42 \
action mirred egress redirect dev bond0

VLAN Modify

VLAN Modify rules can be added in one of the following methods: 

177
tc filter add dev $REP_DEV1 protocol 802.1q ingress prio 1 flower \
vlan_id 10 \
action vlan modify id 11 pipe \
action mirred egress redirect dev $REP_DEV2

tc filter add dev $DEV_REP1 protocol 802.1q ingress prio 1 flower \


vlan_id 10 \
action vlan pop pipe action vlan push id 11 pipe \
action mirred egress redirect dev $REP_DEV2

SR-IOV VF LAG

SR-IOV VF LAG allows the NIC’s physical functions to get the rules that the OVS will try to offload to
the bond net-device, and to offload them to the hardware e-switch. Bond modes supported are:
• Active-Backup
• Active-Active
• LACP

To enable SR-IOV LAG, both physical functions of the NIC should first be configured to SR-IOV
switchdev mode, and only afterwards bond the up-link representors.
The example below shows the creation of bond interface on two PFs:
1. Load bonding device and enslave the up-link representor (currently PF) net-device devices. 

modprobe bonding mode=802.3ad


Ifup bond0 (make sure ifcfg file is present with desired bond configuration)
ip link set enp4s0f0 master bond0
ip link set enp4s0f1 master bond0

2. Add the VF representor net-devices as OVS ports. If tunneling is not used, add the bond
device as well. 

ovs-vsctl add-port ovs-sriov bond0


ovs-vsctl add-port ovs-sriov enp4s0f0_0
ovs-vsctl add-port ovs-sriov enp4s0f1_0

3. Make sure to bring up the PF and the representor netdevices. 

ip link set dev bond0 up


ip link set dev enp4s0f0_0 up
ip link set dev enp4s0f1_0 up

Port Mirroring (Flow Based VF Traffic Mirroring for ASAP²)


Port Mirroring is currently supported in ConnectX-5 adapter cards only.

Unlike para-virtual configurations, when the VM traffic is offloaded to the hardware via SRIOV VF,
the host side Admin cannot snoop the traffic (e.g. for monitoring).
ASAP² uses the existing mirroring support in OVS and TC along with the enhancement to the
offloading logic in the driver to allow mirroring the VF traffic to another VF.
The mirrored VF can be used to run traffic analyzer (tcpdump, wireshark, etc) and observe the

178
traffic of the VF being mirrored.
The example below shows the creation of port mirror on the following configuration: 

# ovs-vsctl show
09d8a574-9c39-465c-9f16-47d81c12f88a
Bridge br-vxlan
Port "enp4s0f0_1"
Interface "enp4s0f0_1"
Port "vxlan0"
Interface "vxlan0"
type: vxlan
options: {key="100", remote_ip="192.168.1.14"}
Port "enp4s0f0_0"
Interface "enp4s0f0_0"
Port "enp4s0f0_2"
Interface "enp4s0f0_2"
Port br-vxlan
Interface br-vxlan
type: internal
ovs_version: "2.8.90"

• If we want to set enp4s0f0_0 as the mirror port, and mirror all of the traffic, set it as follow:

# ovs-vsctl -- --id=@p get port enp4s0f0_0 \


-- --id=@m create mirror name=m0 select-all=true output-port=@p \
-- set bridge br-vxlan mirrors=@m

• If we want to set enp4s0f0_0 as the mirror port, and only mirror the traffic, the destination is
enp4s0f0_1, set it as follow: 

# ovs-vsctl -- --id=@p1 get port enp4s0f0_0 \


-- --id=@p2 get port enp4s0f0_1 \
-- --id=@m create mirror name=m0 select-dst-port=@p2 output-port=@p1 \
-- set bridge br-vxlan mirrors=@m

• If we want to set enp4s0f0_0 as the mirror port, and only mirror the traffic the source is
enp4s0f0_1, set it as follow: 

# ovs-vsctl -- --id=@p1 get port enp4s0f0_0 \


-- --id=@p2 get port enp4s0f0_1 \
-- --id=@m create mirror name=m0 select-src-port=@p2 output-port=@p1 \
-- set bridge br-vxlan mirrors=@m

• If we want to set enp4s0f0_0 as the mirror port and mirror, all the traffic on enp4s0f0_1, set
it as follow: 

# ovs-vsctl -- --id=@p1 get port enp4s0f0_0 \


-- --id=@p2 get port enp4s0f0_1 \
-- --id=@m create mirror name=m0 select-dst-port=@p2 select-src-port=@p2 output-port=@p1 \
-- set bridge br-vxlan mirrors=@m

To clear the mirror port: 

# ovs-vsctl clear bridge br-vxlan mirrors

Performance Tuning Based on Traffic Patterns

Offloaded flows (including connection tracking) are added to virtual switch FDB flow tables. FDB
tables have a set of flow groups. Each flow group saves the same traffic pattern flows. For example,

179
for connection tracking offloaded flow, TCP and UDP are different traffic patterns which end up in
two different flow groups.

A flow group has a limited size to save flow entries. By default, the driver has 4 big FDB flow groups.
Each of these big flow groups can save at most 4000000/(4+1)=800k different 5-tuple flow entries.
For scenarios with more than 4 traffic patterns, the driver provides a module parameter
(num_of_groups) to allow customization and performance tune.


The size of each big flow group can be calculated according to the following formula.

size = 4000000/(num_of_groups+1)

To change the number of big FDB flow groups, run: 

$ echo <num_of_groups> > /sys/module/mlx5_core/parameters/num_of_groups

The change takes effect immediately if there is no flow inside the FDB table (no traffic running and
all offloaded flows are aged out), and it can be dynamically changed without reloading the driver. 

The module parameter can be set statically in /etc/modprobe.d/mlnx.conf file. This way the
administrator will not be required to set it via sysfs each time the driver is reloaded. 

If there are residual offloaded flows when changing this parameter, then the new configuration only
takes effect after all flows age out.

Appendix: Mellanox Firmware Tools


1. Download and install the MFT package corresponding to your computer’s operating system.
You would need the kernel-devel or kernel-headers RPM before the tools are built and
installed.
The package is available at http://www.mellanox.com => Products => Software => Firmware
Tools.
2. Start the mst driver. 

# mst start
Starting MST (Mellanox Software Tools) driver set
Loading MST PCI module - Success
Loading MST PCI configuration module - Success
Create devices

3. Show the devices status. 

ST modules:
------------
MST PCI module loaded
MST PCI configuration module loaded
 
PCI devices:
------------
DEVICE_TYPE MST PCI RDMA NET NUMA
ConnectX4lx(rev:0) /dev/mst/mt4117_pciconf0.1 04:00.1 net-enp4s0f1 NA
ConnectX4lx(rev:0) /dev/mst/mt4117_pciconf0 04:00.0 net-enp4s0f0 NA
 
 
# mlxconfig -d /dev/mst/mt4117_pciconf0 q | head -16
 
Device #1:
----------
 

180
Device type: ConnectX4lx
PCI device: /dev/mst/mt4117_pciconf0
 
Configurations: Current
SRIOV_EN True(1)
NUM_OF_VFS 8
PF_LOG_BAR_SIZE 5
VF_LOG_BAR_SIZE 5
NUM_PF_MSIX 63
NUM_VF_MSIX 11
LINK_TYPE_P1 ETH(2)
LINK_TYPE_P2 ETH(2)

4. Make sure your configuration is as follows:


* SR-IOV is enabled (SRIOV_EN=1)
* The number of enabled VFs is enough for your environment (NUM_OF_VFS=N)
* The port’s link type is Ethernet (LINK_TYPE_P1/2=2) when applicable
If this is not the case, use mlxconfig to enable that, as follows:
a. Enable SR-IOV. 

# mlxconfig -d /dev/mst/mt4115_pciconf0 s SRIOV_EN=1

b. Set the number of required VFs. 

# mlxconfig -d /dev/mst/mt4115_pciconf0 s NUM_OF_VFS=8

c. Set the link type to Ethernet. 

# mlxconfig -d /dev/mst/mt4115_pciconf0 s LINK_TYPE_P1=2


# mlxconfig -d /dev/mst/mt4115_pciconf0 s LINK_TYPE_P2=2

5. Reset the firmware. 

# mlxfwreset -d /dev/mst/mt4115_pciconf0 reset

6. Query the firmware to make sure everything is set correctly. 

# mlxconfig -d /dev/mst/mt4115_pciconf0 q

Fast Driver Unload



This feature is supported in ConnectX-4 adapter cards family and above only.

This feature enables optimizing mlx5 driver teardown time in shutdown and kexec flows.
The fast driver unload is disabled by default. To enable it, the prof_sel module parameter of
mlx5_core module should be set to 3.

181
Troubleshooting
You may be able to easily resolve the issues described in this section. If a problem persists and you
are unable to resolve it yourself, please contact your Mellanox representative or Mellanox Support
at support@mellanox.com.

General Issues
Issue Cause Solution

1. Remove the failed adapter.


The system panics when it is Malfunction hardware component 2. Reboot the system.
booted with a failed
adapter installed.
1. Run lspci.
Mellanox adapter is not PCI slot or adapter PCI connector 2. Reseat the adapter in its PCI
identified as a PCI device. dysfunctionality slot or insert the adapter to a
different PCI slot.
If the PCI slot confirmed to be
functional, the adapter should
be replaced.

Mellanox adapters are not Misidentification of the Mellanox Run the command below and check
installed in the system. adapter installed Mellanox’s MAC to identify the
Mellanox adapter installed.

lspci | grep Mellanox' or


'lspci -d 15b3:

Note: Mellanox MACs start with:


00:02:C9:xx:xx:xx, 00:25:8B:xx:xx:xx
or F4:52:14:xx:xx:xx"

Insufficient memory to be udev is designed to fork() new process Limit the udev instances running
used by udev upon OS boot. for each event it receives so it could simultaneously per boot by adding
han- dle many events in parallel, and udev.children-max=<number> to the
each udev instance consumes some kernel command line in grub.
RAM memory.

Operating system running The mlnx-en.d service script is called Disable the openibd ‘stop’ option by
from root file system using the ‘stop’ option by the operating setting 'ALLOW_STOP=no' in /etc/
located on a remote storage system. This option unloads the driver mlnx-en.conf configuration file.
(over Mellanox devices), stack. Therefore, the OS root file
hang during reboot/ system dis- appears before the reboot/
shutdown (errors such as shutdown procedure is completed,
“No such file or directory” leaving the OS in a hang state.
will appear).

182
Ethernet Related Issues
Issue Cause Solution

Ethernet interfaces renaming Invalid udev rules. Review the udev rules inside the "/etc/
fails leaving them with names udev/rules.d/70-persistent-net.rules"
such as renameXY. file. Modify the rules such that every
rule is unique to the target interface,
by adding correct unique attribute
values to each interface, such as
dev_id, dev_port and KERNELS or
address).

Example of valid udev rules:

SUBSYSTEM=="net", ACTION=="add",
DRIVERS=="?*",

ATTR{dev_id}=="0x0", ATTR{type}
=="1", KERNEL=="eth*",

ATTR{dev_port}=="0", KER-
NELS=="0000:08:00.0", NAME="eth4"
SUBSYSTEM=="net", ACTION=="add",
DRIVERS=="?*",

ATTR{dev_id}=="0x0", ATTR{type}
=="1", KERNEL=="eth*",

ATTR{dev_port}=="1", KER-
NELS=="0000:08:00.0", NAME="eth5"

• Ensure the switch port is not


No link. Misconfiguration of the switch down
port or using a cable not • Ensure the switch port rate is
supporting link rate. configured to the same rate as
the adapter's port

Degraded performance is Sending traffic from a node with Enable Flow Control on both switch
measured when having a mixed a higher rate to a node with ports and nodes:
rate environment (10GbE, 40GbE lower rate. • On the server side run:
and 56GbE). ethtool -A <interface> rx
on tx on

• On the switch side run the


following command on the
relevant interface:
send on force and receive
on force

183
Issue Cause Solution

• Use supported ports on the


No link with break-out cable. Misuse of the break-out cable or switch with proper
misconfiguration of the switch's configuration. For further
split ports information, please refer to the
MLNX_OS User Manual.
• Make sure the QSFP breakout
cable side is connected to the
SwitchX.

Physical link fails to negotiate to The adapter is running an Install the latest firmware on the
maximum supported rate. outdated firmware. adapter.
• Ensure that the cable is
Physical link fails to come up The cable is not connected to connected on both ends or use a
while port physical state is the port or the port on the other known working cable
Polling. end of the cable is disabled. • Check the status of the
connected port using
the ibportstate command and
enable it if necessary

Physical link fails to come up The port was manually disabled. Restart the driver:
while port physical state is
Disabled. /etc/init.d/openibd restart

Performance Related Issues


Issue Cause Solution

The driver works but the transmit  - These recommendations may


and/or receive data rates are not assist with gaining immediate
optimal. improvement:
1. Confirm PCI link
negotiated uses its
maximum capability
2. Stop the IRQ Balancer
service:
/etc/init.d/
irq_balancer stop
3. Start mlnx_affinity
service:
mlnx_affinity start

For best performance practices,


please refer to the "Performance
Tuning Guide for Mellanox
Network Adapters".

184
Issue Cause Solution

Out of the box throughput IRQ affinity is not set properly by the For additional performance
performance in Ubuntu14.04 is irq_balancer tuning, please refer to
not optimal and may achieve Performance Tuning Guide.
results below the line rate in
40GE link speed.

UDP receiver throughput may be This is caused by the adaptive Disable adaptive interrupt
lower than expected, when interrupt moderation routine, which moderation and set lower values
running over mlx4_en Ethernet sets high values of interrupt for the interrupt coalescing
driver. coalescing, causing the driver to manually.
process large number of packets in the
same interrupt, leading UDP to drop ethtool -C <eth>X adaptive-
packets due to overflow in its buffers. rx off rx-usecs 64 rx-
frames 24

 Values above may need tuning,


depend- ing the system,
configuration and link speed.

SR-IOV Related Issues


Issue Cause Solution

1. Check the firmware SR-IOV


Failed to enable SR- IOV. The number of VFs configuration, run the
configured in the driver is mlxconfig tool.
The following message is reported in higher than configured in 2. Set the same number of VFs
dmesg: the firmware. for the driver.

mlx4_core 0000:xx:xx.0: Failed to


enable , continuing without  (err
=-22)

Failed to enable SR- IOV. SR-IOV is disabled in the Check that the SR-IOV is enabled in
BIOS. the BIOS (see “Setting Up SR-IOV”
The following message is reported in section).
dmesg:

mlx4_core 0000:xx:xx.0: Failed to


enable , continuing without  (err =
-12)

1. Verify they are both enabled


When assigning a VF to a VM the following SR-IOV and virtualization in the BIOS
message is reported on the screen: are not enabled in the 2. Add to the GRUB
BIOS. configuration file to the
PCI-assgine: error: requires KVM following kernel parameter:
support "intel_immun=on" (see “Settin
g Up SR-IOV” section).

185
Common Abbreviations and Related Documents
Common Abbreviations and Acronyms
Abbreviation/Acronym Description

B (Capital) ‘B’ is used to indicate size in bytes or multiples of bytes


(e.g., 1KB = 1024 bytes, and 1MB = 1048576 bytes)

b (Small) ‘b’ is used to indicate size in bits or multiples of bits (e.g., 1Kb
= 1024 bits)

FW Firmware

HCA Host Channel Adapter

HW Hardware

IB InfiniBand

iSER iSCSI RDMA Protocol

LSB Least significant byte

lsb Least significant bit

MSB Most significant byte

msb Most significant bit

NIC Network Interface Card

SW Software

VPI Virtual Protocol Interconnect

IPoIB IP over InfiniBand

PFC Priority Flow Control

PR Path Record

RoCE RDMA over Converged Ethernet

SL Service Level

186
Abbreviation/Acronym Description

SRP SCSI RDMA Protocol

MPI Message Passing Interface

QoS Quality of Service

ULP Upper Layer Protocol

VL Virtual Lane

vHBA Virtual SCSI Host Bus Adapter

uDAPL User Direct Access Programming Library

Glossary

The following is a list of concepts and terms related to InfiniBand in general and to Subnet Managers
in particular. It is included here for ease of reference, but the main reference remains
the InfiniBand Architecture Specification.
Term Description

Channel Adapter (CA), An IB device that terminates an IB link and executes transport functions. This may
Host Channel Adapter be an HCA (Host CA) or a TCA (Target CA)
(HCA)

HCA Card A network adapter card based on an InfiniBand channel adapter device

IB Devices An integrated circuit implementing InfiniBand compliant communication

IB Cluster/Fabric/ A set of IB devices connected by IB cables


Subnet

In-Band A term assigned to administration activities traversing the IB connectivity only

Local Identifier (ID) An address assigned to a port (data sink or source point) by the Subnet Manager,
unique within the subnet, used for directing packets within the subnet

Local Device/Node/ The IB Host Channel Adapter (HCA) Card installed on the machine running IBDIAG
System tools

Local Port The IB port of the HCA through which IBDIAG tools connect to the IB fabric

187
Term Description

Master Subnet Manager The Subnet Manager that is authoritative, that has the reference configuration
information for the subnet

Multicast Forwarding A table that exists in every switch providing the list of ports to forward received
Tables multicast packet. The table is organized by MLID

Network Interface Card A network adapter card that plugs into the PCI Express slot and provides one or
(NIC) more ports to an Ethernet network

Standby Subnet Manager A Subnet Manager that is currently quiescent, and not in the role of a Master
Subnet Manager, by the agency of the master SM

Subnet Administrator An application (normally part of the Subnet Manager) that implements the
(SA) interface for querying and manipulating subnet management data

Subnet Manager (SM) One of several entities involved in the configuration and control of the IB fabric

Unicast Linear A table that exists in every switch providing the port through which packets should
Forwarding Tables (LFT) be sent to each LID

Virtual Protocol A Mellanox Technologies technology that allows Mellanox channel adapter devices
Interconnect (VPI) (ConnectX®) to simultaneously connect to an InfiniBand subnet and a 10GigE
subnet (each subnet connects to one of the adapter ports)

Related Documentation
Document Name Description

InfiniBand Architecture Specification, Vol. 1, The InfiniBand Architecture Specification that is provided
Release 1.2.1 by IBTA

IEEE Std 802.3ae™-2002 Part 3: Carrier Sense Multiple Access with Collision
Detection (CSMA/CD) Access Method and Physical Layer
(Amendment to IEEE Std 802.3-2002) Document # Specifications
PDF: SS94996
Amendment: Media Access Control (MAC) Parameters,
Physical Layers, and Management Parameters for 10 Gb/s
Operation

Firmware Release Notes for Mellanox adapter See the Release Notes PDF file relevant to your adapter
devices device on mellanox.com

MFT User Manual and Release Notes Mellanox Firmware Tools (MFT) User Manual and Release
Notes documents

188
Document Name Description

WinOF User Manual Mellanox WinOF User Manual describes the installation,
configuration, and operation of Mellanox Windows driver

VMA User Manual Mellanox VMA User Manual describes the installation,
configuration, and operation of Mellanox VMA driver

189
User Manual Revision History
Release Date Description

4.9 LTS May 28, 2020 • Added Interrupt Request (IRQ) Naming section.
• Added Debuggability section.

4.7 December 29, 2019 • Added section Mediated Devices.


• Added "num_of_groups" entry to table mlx5_core Module
Parameters.
• Added Performance Tuning Based on Traffic Patterns section.

• Reorganized Chapter 2, “Installation”: Consolidated the


4.5 December 19, 2018 separate installation procedures under Installing
MLNX_EN and Additional Installation Procedures

190
Release Notes Change Log History
Category Description

Rev 4.9-0.1.7.0

Adapters: ConnectX-5 and above

Devlink Health CR-Space Dump Added the option to dump configuration space via the devlink
tool in order to improve debug capabilities.
Multi-packet TX WQE Support for XDP The conventional TX descriptor (WQE or Work Queue Element)
Transmit Flows describes a single packet for transmission. Added driver support
for the HW feature of multi-packet TX WQEs in XDP transmit
flows. With this, the HW becomes capable of working with a new
and improved WQE layout that describes several packets. In
effect, this feature saves PCI bandwidth and transactions, and
improves transmit packet rate.
GENEVE Encap/Decap Rules Offload Added support for GENEVE encapsulation/decapsulation rules
offload.
Multi Packet Tx WQE Support for XDP
Transmit Flows Added driver support for the hardware feature of multi-packet Tx
to work with a new and improved WQE layout that describes
several packets instead of a single packet for XDP transmission
flows. This saves PCI bandwidth and transactions, and improves
transmit packet rate.
Kernel Software Steering for Connection
Tracking (CT) [Beta] Added support for updating CT rules using the software
steering mechanism.
Kernel Software Steering Remote Mirroring [Beta] Added support for updating remote mirroring rules using
the software steering mechanism.

Adapters: ConnectX-4 and above

Discard Counters Exposed rx_prio[p]_discards discard counters per priority that


count the number of received packets dropped due to lack of
buffers on the physical port.
MPLS Traffic Added support for reporting TSO and CSUM offload capabilities
for MPLS tagged traffic and, allowed the kernel stack to use
these offloads.
mlx5e Max Combined Channels
Increased the driver’s maximal combined channels value from 64
to 128 (however, note that OOB value will not cross 64).
128 is the upper bound. Lower maximal value can be seen on the
host, depending on the number of cores and MSIX's configured by
the firmware.

191
RoCE Accelerator Counters
Added the following RoCE accelerator counters:
• roce_adp_retrans - counts the number of adaptive
retransmissions for RoCE traffic
• roce_adp_retrans_to - counts the number of times RoCE
traffic reached timeout due to adaptive retransmission
• roce_slow_restart - counts the number of times RoCE slow
restart was used
• roce_slow_restart_cnps - counts the number of times
RoCE slow restart generated CNP packets
• roce_slow_restart_trans - counts the number of times
RoCE slow restart changed state to slow restart

Memory Region
Added support for the user to register memory regions with a
relaxed ordering access flag through experimental verbs. This can
enhance performance, depending on architecture and scenario.

Adapters: All

ibdev2netdev Tool Output ibdev2netdev tool output was changed such that the bonding
device now points at the bond instead of the slave interface.
Devlink Health Reporters Added support for monitoring and recovering from errors that
occur on the RX queue, such as cookie errors and timeout.
GSO Optimization Improved GSO (Generic Segmentation Offload) workload
performance by decreasing doorbells usage to the minimum
required.
TX CQE Compression Added support for TX CQE (Completion Queue Element)
compression. Saves on outgoing PCIe bandwidth by compressing
CQEs together. Disabled by default. Configurable via private flags
of ethtool.
Firmware Versions Query via Devlink Added the option to query for running and stored firmware
versions using the devlink tool.
Firmware Flash Update via Devlink
Added the option to update the firmware image in the flash using
the devlink tool.

Usage: devlink dev flash <dev> file <file_name>.mfa2

For further information on how to perform this update, see


"Updating Firmware Using ethtool/devlink and .mfa2 File" section
in MFT User Manual.
Devlink Health WQE Dump
Added support for WQE (Work Queue Element) dump, triggered
by an error on Rx/Tx reporters. In addition, some dumps (not
triggered by an error) can be retrieved by the user via devlink
health reporters.
GENEVE Tunnel Stateless Offload Added support for GENEVE tunneled hardware offloads of TSO,
CSUM and RSS.
TCP Segmentation and Checksum Offload
Added TCP segmentation and checksum offload support for MPLS-
tagged traffic.
NEO-Host SDK Added support for NEO-Host SDK installation on MLNX_OFED.

192
Bug Fixes See Bug Fixes section.

Rev 4.7-1.0.0.1

HCAs: ConnectX-4 and above

Counters Monitoring Added support for monitoring selected counters and generating a
notification event (Monitor_Counter_Change event) upon changes
made to these counters.
The counters to be monitored are selected using the
SET_MONITOR_COUNTER command.
EEPROM Device Thresholds via Ethtool Added support to read additional EEPROM information from high
pages of modules such as SFF-8436 and SFF-8636. Such
information can be: 1. Application Select table 2. User writable
EEPROM 3. Thresholds and alarms - Ethtool dump works on active
cables only (e.g. optic), but thresholds and alarms can be read
with “offset” and “length” parameters in any cable by running:
ethtool -m <DEVNAME> offset X length Y

RDMA_RX RoCE Steering Support Added the ability to create rules to steer RDMA traffic, with two
destinations supported: DevX object and QP. Multiple priorities
are also supported.

HCAs: ConnectX-5 and above

ASAP2 Incorporated the documentation of Accelerated Switching And


Packet Processing (ASAP2): Hardware Offloading for vSwitches
into MLNX_OFED Release Notes and User Manual.

HCAs: All

MLNX_OFED Installation via Repository


The repository providing legacy verbs has been moved from RPMS
or DEBS folders to RPMS/MLNX_LIBS and DEBS/MLNX_LIBS.

In addition, a new repository providing RDMA-Core based


userspace has been added to RPMS/UPSTREAM_LIBS and DEBS/
UPSTREAM_LIBS.

Rev 4.6-1.0.1.1

HCAs: ConnectX-3/ConnectX-3 Pro

Devlink Configuration Parameters Tool Added support for a set of configuration parameters that can be
changed by the user through the Devlink user interface.

HCAs: ConnectX-4 and above

ODP Pre-fetch Added support for pre-fetching a range of an on-demand paging


(ODP) memory region (MR), this way reducing latency by making
pages present with RO/RW permissions before the actual IO is
conducted.
DevX Privilege Enforcement Enforced DevX privilege by firmware. This enables future device
functionality without the need to make driver changes unless a
new privilege type is introduced.

193
DevX Interoperability APIs
Added support for modifying and/or querying for a verb object
(including CQ, QP, SRQ, WQ, and IND_TBL APIs) via the DevX
interface.

This enables interoperability between verbs and DevX.


DevX Asynchronous Query Commands Added support for running QUERY commands over the DevX
interface in an asynchronous mode. This enables applications to
issue many commands in parallel while firmware processes the
commands.
DevX User-space PRM  Handles Exposure
Exposed all PRM handles to user-space so DevX user application
can mix verbs objects with DevX objects.

For example: Take the cqn from the created ibv_cq and use it on
a devx)create(QP).
Indirect Mkey ODP Added the ability to create indirect Mkeys with ODP support over
DevX interface.
XDP Redirect Added support for XDP_REDIRECT feature for both ingress and
egress sides. Using this feature, incoming packets on one
interface can be redirected very quickly into the transmission
queue of another capable interface. Typically used for load
balancing.
RoCE Disablement
Added the option to disable RoCE traffic handling. This enables
forwarding of traffic over UDP port 4791 that is handled as RoCE
traffic when RoCE is enabled.

When RoCE is disabled, there is no GID table, only Raw Ethernet


QP type is supported and RoCE traffic is handled as regular
Ethernet traffic.
Forward Error Correction (FEC) Encoding Added the ability to query and modify Forward Error Correction
(FEC) encoding, as well as disabling it via Ethtool.
RAW Per-Lane Counters Exposure
Exposed RAW error counters per cable-module lane via ethtool
stats. The counters show the number of errors before FEC
correction (if enabled).

For further information, please see phy_raw_errors_lane[i]


under Physical Port Counters section in Understanding mlx5
ethtool Counters Community post.

HCAs: ConnectX-4 Lx and above

VF LAG Added support for High Availability and load balancing for Virtual
Functions of different physical ports in SwitchDev SR-IOV mode.

HCAs: ConnectX-5 and above

ASAP2 Offloading VXLAN Decapsulation


with HW LRO Added support for performing hardware Large Receive Offload
(HW LRO) on VFs with HW-decapsulated VXLAN.
For further information on the VXLAN decapsulation feature,
please refer to ASAP2 User Manual under www.mellanox.com ->
Products -> Software -> ASAP2.

194
PCI Atomic Operations Added the ability to run atomic operations on local memory
without involving verbs API or compromising the operation's
atomicity.

HCAs: ConnectX-5

Virtual Ethernet Port Aggregator (VEPA)


Added support for activating/deactivating Virtual Ethernet Port
Aggregator (VEPA) mode on a single virtual function (VF). To turn
on VEPA on the second VF, run:
echo ON > /sys/class/net/enp59s0/device/sriov/1/vepa

VFs Rate Limit Added support for setting a rate limit on groups of Virtual
Functions rather on an individual Virtual Function.

HCAs: ConnectX-6

ConnectX-6 Support
[Beta] Added support for ConnectX-6 (VPI only) adapter cards. 

NOTE: In HDR installations that are built with remotely managed


Quantum-based switches, the switch’s firmware must be
upgraded to version 27.2000.1142 prior to upgrading the HCA’s
(ConnectX-6) firmware to version 20.25.1500. When using
ConnectX-6 HCAs with firmware v20.25.1500 and connecting
them to Quantum-based switches, make sure the Quantum
firmware version is 27.2000.1142 in order to avoid any critical
link issues. 
Ethtool 200Gbps
ConnectX-6 hardware introduces support for 200Gbps and
50Gbps-per-lane link mode. MLNX_OFED supports full backward
compatibility with previous configurations.

Note that in order to advertise newly added link-modes, the full


bitmap related to the link modes must be advertised from
ethtool man page. For the full bitmap list per link mode, please
refer to MLNX_OFED User Manual.

NOTE: This feature is firmware-dependent. Currently, ConnectX-6


Ethernet firmware supports up to 100Gbps only. Thus, this
capability may not function properly using the current driver and
firmware versions.
PCIe Power State
Added support for the following PCIe power state indications to
be printed to dmesg:
1. Info message #1: PCIe slot power capability was not
advertised.
2. Warning message: Detected insufficient power on the
PCIe slot (xxxW).
3. Info message #2: PCIe slot advertised sufficient power
(xxxW).
When indication #1 or #2 appear in dmesg, user should
make sure to use a PCIe slot that is capable of supplying
the required power.

HCAs: mlx5 Driver

195
Message Signaled Added support for using a single MSI-X vector for all control event
Interrupts-X (MSI-X) queues instead of one MSI-X vector per queue in a virtual
Vectors function driver. This frees extra MSI-X vectors to be used for
completion event queue, allowing for additional traffic channels
in the network device.
Send APIs Introduced a new set of QP Send operations (APIs) which allows
extensibility for new Send opcodes.

HCAs: BlueField

BlueField Support BlueField is now fully supported as part of the Mellanox OFED
mainstream version sharing the same code baseline with all the
adapters product line.
Representor Name Change
In SwitchDev mode:
• Uplink representors are now called p0/p1
• Host PF representors are now called pf0hpf/pf1hpf
• VF representors are now called pf0vfN/pf1vfN

ECPF Net Devices  In SwitchDev mode, net devices enp3s0f0 and enp3s0f1 are no
longer created. 
Setting Host MAC and Tx Rate Limit from Expanded to support VFs as well as the host PFs.
ECPF

HCAs: All

RDMA-CM Application Managed QP Added support for the RDMA application to manage its own QPs
and use RDMA-CM only for exchanging Address information.
RDMA-CM QP Timeout Control Added a new option to rdma_set_option that allows applications
to override the RDMA-CM's QP ACK timeout value.
MLNX_OFED Verbs API
As of MLNX_OFED v5.0 release (Q1 2020) onwards, MLNX_OFED
Verbs API will be migrated from the legacy version of the user
space verbs libraries (libibervs, libmlx5 ..) to the upstream
version rdma-core.
More details are available in MLNX_OFED user manual under
Installing Upstream rdma-core Libraries.

4.5-1.0.1.0

HCAs: ConnectX-5

VFs per PF Increased the amount of maximum virtual functions (VF) that can
be allocated to a physical function (PF) to 127 VF.

HCAs: ConnectX-4/ConnectX-4 Lx/ConnectX-5

SW-Defined UDP Source Port for RoCE v2 UDP source port for RoCE v2 packets is now calculated by the
driver rather than the firmware, achieving better distribution
and less congestion. This mechanism works for RDMA- CM QPs
only, and ensures that RDMA connection messages and data
messages have the same UDP source port value.

HCAs: mlx5 Driver

196
Local Loopback Disable Added the ability to manually disable Local Loopback regardless
of the number of open user-space transport domains.

HCAs: ConnectX-6

Adapter Cards Added support for ConnectX-6 Ready. For further information,
please contact Mellanox Support.

HCAs: All

Bug Fixes See “Bug Fixes" section.

4.4-2.0.7.0

HCAs: All

Operating Systems Added support for additional OSs. See "General Support in


MLNX_EN" section.

4.4-1.0.1.0

HCAs: ConnectX-4/ConnectX-4 Lx/ConnectX-5

Adaptive Interrupt Moderation Added support for adaptive Tx, which optimizes the moderation
values of the Tx CQs on runtime for maximum throughput with
minimum CPU overhead.

This mode is enabled by default.

Updated Adaptive Rx to ignore ACK packets so that queues that


only handle ACK packets remain with the default moderation.

Docker Containers [Beta] Added support for Docker containers to run over Virtual RoCE and
InfiniBand devices using SR-IOV mode.

Firmware Tracer Added a new mechanism for the device’s FW/HW to log
important events into the event tracing system (/sys/kernel/
debug/tracing) without requiring any Mellanox-specific tool.

Note: This feature is enabled by default.

CR-Dump Accelerated the original cr-dump by optimizing the reading


process of the device’s CR-Space snapshot.

HCAs: ConnectX-4/ConnectX-4 Lx

VST Q-in-Q Added support for C-tag (0x8100) VLAN insertion to tagged
packets in VST mode.

197
HCAs: ConnectX-4 Lx/ConnectX-5

OVS Offload using ASAP2 Added support for Mellanox Accelerated Switching And Packet
Processing (ASAP2) technology, which allows OVS offloading by
handling OVS data-plane, while maintaining OVS control-plane
unmodified. OVS Offload using ASAP2 technology provides
significantly higher OVS performance without the associated CPU
load.

For further information, refer to ASAP2 Release Notes under


www.mellanox.com -> Products -> Software -> ASAP2

4.3-1.0.1.0

HCAs: ConnectX-4/ConnectX-4 Lx/ConnectX-5

Adaptive Interrupt Moderation Added support for adaptive Tx, which optimizes the moderation
values of the Tx CQs on runtime for maximum throughput with
minimum CPU overhead.

This mode is enabled by default.

Updated Adaptive Rx to ignore ACK packets so that queues that


only handle ACK packets remain with the default moderation.

Docker Containers [Beta] Added support for Docker containers to run over Virtual RoCE and
InfiniBand devices using SR-IOV mode.

Firmware Tracer Added a new mechanism for the device’s FW/HW to log
important events into the event tracing system (/sys/kernel/
debug/tracing) without requiring any Mellanox-specific tool.

Note: This feature is enabled by default.

CR-Dump Accelerated the original cr-dump by optimizing the reading


process of the device’s CR-Space snapshot.

HCAs: ConnectX-4/ConnectX-4 Lx

VST Q-in-Q Added support for C-tag (0x8100) VLAN insertion to tagged
packets in VST mode.

HCAs: ConnectX-4

HCAs: ConnectX-5/ConnectX-4 Lx

198
OVS Offload using ASAP2 Added support for Mellanox Accelerated Switching And Packet
Processing (ASAP2) technology, which allows OVS offloading by
handling OVS data-plane, while maintaining OVS control-plane
unmodified. OVS Offload using ASAP2 technology provides
significantly higher OVS performance without the associated CPU
load.

For further information, refer to ASAP2 Release Notes under


www.mellanox.com -> Products -> Software -> ASAP2

HCAs: All

4.3-1.0.1.0

HCAs: ConnectX-5

Erasure Coding  Added support for erasure coding offload software verbs
Offload verbs (encode/decode/update API) supporting a number of redundancy
blocks (m) greater than 4.

HCAs: ConnectX-4/ConnectX-4 Lx/ConnectX-5

Virtual MAC Removed support for Virtual MAC feature.

RoCE LAG Added out of box RoCE LAG support for RHEL 7.2 and RHEL 6.9.

Dropped Counters Added a new counter rx_steer_missed_packets which provides


the number of packets that were received by the NIC, yet were
discarded/dropped since they did not match any flow in the NIC
steering flow table.

Added the ability for SR-IOV counter rx_dropped to count the


number of packets that were dropped while vport was down.

HCAs: mlx5 Driver

Reset Flow Added support for triggering software reset for firmware/driver
recovery. When fatal errors occur, firmware can be reset and
driver reloaded.

HCAs: ConnectX-4 Lx/ConnectX-5

Striding RQ with HW Time-Stamping Added the option to retrieve the HW timestamp when polling for
completions from a completion queue that is attached to a multi-
packet RQ (Striding RQ).

4.2-1.0.1.0

199
HCAs: mlx5 Driver

Physical Address Memory Allocation Added support to register a specific physical address range.

HCAs: Innova IPsec EN

Innova IPsec Adapter Cards Added support for Mellanox Innova IPsec EN adapter card, that
provides security acceleration for IPsec-enabled networks.

HCAs: ConnectX-4/ConnectX-4 Lx/ConnectX-5

Precision Time Protocol (PTP) Added support for PTP feature over PKEY interfaces.

This feature allows for accurate synchronization between the


distributed entities over the network. The synchronization is
based on symmetric Round Trip Time (RTT) between the master
and slave devices, and is enabled by default.

Virtual MAC Added support for Virtual MAC feature, which allows users to add
up to 4 virtual MACs (VMACs) per VF. All traffic that is destined to
the VMAC will be forwarded to the relevant VF instead of PF. All
traffic going out from the VF with source MAC equal to VMAC will
go to the wire also when Spoof Check is enabled.

For further information, please refer to “Virtual MAC” section in


MLNX_EN User Manual.

Receive Buffer Added the option to change receive buffer size and cable length.
Changing cable length will adjust the receive buffer's xon and
xoff thresholds.

For further information, please refer to “Receive Buffer” section


in MLNX_EN User Manual.

GRE Tunnel Offloads Added support for the following GRE tunnel offloads:
• TSO over GRE tunnels
• Checksum offloads over GRE tunnels
• RSS spread for GRE packets

NVMEoF Added support for the host side (RDMA initiator) in RedHat 7.2
and above.

Dropless Receive Queue (RQ) Added support for the driver to notify the FW when SW receive
queues are overloaded.

200
PFC Storm Prevention Added support for configuring PFC stall prevention in cases where
the device unexpectedly becomes unresponsive for a long period
of time. PFC stall prevention disables flow control mechanisms
when the device is stalled for a period longer than the default
pre-configured timeout. Users now have the ability to change the
default timeout by moving to auto mode.

For further information, please refer to “PFC Stall Prevention”


section in MLNX_EN User Manual.

HCAs: ConnectX-5

Q-in-Q Added support for Q-in-Q VST feature in ConnectX-5 adapter


cards family.

Virtual Guest Tagging (VGT+) Added support for VGT+ in ConnectX-4/ConnectX-5 HCAs. This
feature is s an advanced mode of Virtual Guest Tagging (VGT), in
which a VF is allowed to tag its own packets as in VGT, but is still
subject to an administrative VLAN trunk policy. The policy
determines which VLAN IDs are allowed to be transmitted or
received. The policy does not determine the user priority, which
is left unchanged.

For further information, please refer to “Virtual Guest Tagging


(VGT+)” section in MLNX_EN User Manual.

Tag Matching Offload Added support for hardware Tag Matching offload with
Dynamically Connected Transport (DCT).

HCAs: ConnectX-3/ConnectX-3 Pro

HCAs: All

CR-DUMP Added support for the driver to take an automatic snapshot of


the device’s CR-Space in cases of critical failures.

For further information, please refer to “CRDUMP” section in


MLNX_EN User Manual.

4.1-1.0.2.0

HCAs: mlx5 Driver

RoCE Diagnostics and ECN Counters Added support for additional RoCE diagnostics and ECN
congestion counters under /sys/class/infiniband/mlx5_0/ports/
1/hw_counters/ directory.

For further information, refer to the Understanding mlx5 Linux


Counters and Status Parameters Community post.

201
rx-fcs Offload (ethtool) Added support for rx-fcs ethtool offload configuration. Normally,
the FCS of the packet will be truncated by the ASIC hardware
before sending it to the application socket buffer (skb). Ethtool
allows to set the rx-fcs not to be truncated, but to pass it to the
application for analysis.

For more information and usage, refer to Understanding ethtool


rx-fcs for mlx5 Drivers Community post.

DSCP Trust Mode Added the option to enable PFC based on the DSCP value. Using
this solution, VLAN headers will no longer be mandatory for use.

For further information, refer to the HowTo Configure Trust Mode


on Mellanox Adapters Community post.

RoCE ECN Parameters ECN parameters have been moved to the following directory: /
sys/kernel/debug/mlx5/<PCI BUS>/cc_params/

For more information, refer to the HowTo Configure DCQCN


(RoCE CC) for ConnectX-4 (Linux) Community post.

Flow Steering Dump Tool Added support for mlx_fs_dump, which is a python tool that
prints the steering rules in a readable manner.

Secure Firmware Updates Firmware binaries embedded in MLNX_EN package now support
Secure Firmware Updates. This feature provides devices with the
ability to verify digital signatures of new firmware binaries, in
order to ensure that only officially approved versions are
installed on the devices.

For further information on this feature, refer to Mellanox


Firmware Tools (MFT) User Manual.

PeerDirect Added the ability to open a device and create a context while
giving PCI peer attributes such as name and ID.

For further details, refer to the PeerDirect


Programming Community post.

Probed VFs Added the ability to disable probed VFs on the hypervisor. For
further information, see HowTo Configure and Probe VFs on mlx5
Drivers Community post.

Local Loopback Improved performance by rendering Local loopback (unicast and


multicast) disabled by mlx5 driver by default while local
loopback is not in use. The mlx5 driver keeps track of the number
of transport domains that are opened by user-space applications.
If there is more than one user-space transport domain open, local
loopback will automatically be enabled.

202
1PPS Time Synchronization (at alpha level) Added support for One Pulse Per Second (1PPS), which is a time
synchronization feature that allows the adapter to send or
receive 1 pulse per second on a dedicated pin on the adapter
card.

For further information on this feature, refer to the HowTo Test


1PPS on Mellanox Adapters Community post.

Fast Driver Unload Added support for fast driver teardown in shutdown and kexec
flows.

HCAs: ConnectX-5/ConnectX-5 Ex

NVMEoF Target Offload Added support for NVMe over fabrics (NVMEoF) offload, an
implementation of the new NVMEoF standard target (server) side
in hardware.

For further information on NVMEoF Target Offload, refer


to HowTo Configure NVMEoF Target Offload .

HCAs: All

RDMA CM Changed the default RoCE mode on which RDMA CM runs to


RoCEv2 instead of RoCEv1.

RDMA_CM session requires both the client and server sides to


support the same RoCE mode. Otherwise, the client will fail to
connect to the server.

For further information, refer to RDMA CM and RoCE Version


Defaults Community post.

Lustre Added support for Lustre file system open-source project.

4.0-2.0.0.1

PCIe Error Counting [ConnectX-4/ConnectX-4 Lx] Added the ability to expose physical
layer statistical counters to ethtool.

Standard ethtool [ConnectX-4/ConnectX-4 Lx] Added support for flow steering and
rx-all mode.

SR-IOV Bandwidth Share for Ethernet/ [ConnectX-4/ConnectX-4 Lx] Added the ability to guarantee the
RoCE (beta) minimum rate of a certain VF in SR-IOV mode.

Adapter Cards Added support for ConnectX-5 and ConnectX-5 Ex HCAs.

NFS over RDMA (NFSoRDMA) Removed support for NFSoRDMA drivers. These drivers are no
longer provided along with the MLNX_EN package.

203
Uplink Representor
Modes Added support for the following Uplink Representor modes:
1. new_netdev: default mode - when found in this mode, the uplink representor is
created as a new netdevice
2. nic_netdev: when found in this mode, the NIC netdevice acts as an uplink
representor device

Example:

echo nic_netdev > /sys/class/net/ens1f0/compat/devlink/uplink_rep_mode

Notes:
• The mode can only be changed when found in Legacy mode
• The mode is not saved when reloading mlx5_core

204
Notice

This document is provided for information purposes only and shall not be regarded as a warranty of a certain
functionality, condition, or quality of a product. Neither NVIDIA Corporation nor any of its direct or indirect subsidiaries
and affiliates (collectively: “NVIDIA”) make any representations or warranties, expressed or implied, as to the accuracy
or completeness of the information contained in this document and assumes no responsibility for any errors contained
herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents
or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or
deliver any Material (defined below), code, or functionality.
NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to
this document, at any time without notice.
Customer should obtain the latest relevant information before placing orders and should verify that such information is
current and complete.
NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order
acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of
NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and
conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations
are formed either directly or indirectly by this document.
NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or
life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be
expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for
inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at
customer’s own risk.
NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use.
Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to
evaluate and determine the applicability of any information contained in this document, ensure the product is suitable
and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a
default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability
of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in
this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or
attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product
designs.
No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual
property right under this document. Information published by NVIDIA regarding third-party products or services does not
constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such
information may require a license from a third party under the patents or other intellectual property rights of the third
party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.
Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced
without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all
associated conditions, limitations, and notices.
THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS,
AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO
WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY
DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.
TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT
LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND
REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN
ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason
whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be
limited in accordance with the Terms of Sale for the product.

Trademarks
NVIDIA, the NVIDIA logo, and Mellanox are trademarks and/or registered trademarks of NVIDIA Corporation and/
or Mellanox Technologies Ltd. in the U.S. and in other countries. Other company and product names may be trademarks
of the respective companies with which they are associated.

Copyright
© 2022 NVIDIA Corporation & affiliates. All Rights Reserved.

You might also like