Rdma Aware Networks Programming User Manual 1 7
Rdma Aware Networks Programming User Manual 1 7
User Manual
Table of Contents
1 Glossary ................................................................................. 11
2 RDMA Architecture Overview......................................................... 15
2.1 InfiniBand .................................................................................... 15
2.2 Virtual Protocol Interconnect® (VPI) .................................................... 15
2.3 RDMA over Converged Ethernet (RoCE).................................................. 15
2.4 Comparison of RDMA Technologies ....................................................... 16
2.5 Key Components ............................................................................ 18
2.5.1 Host Channel Adapter .................................................................. 18
2.5.2 Range Extenders ........................................................................ 18
2.5.3 Subnet Manager ......................................................................... 18
2.5.4 Switches .................................................................................. 18
2.6 Support for Existing Applications and ULPs ............................................. 18
2.7 References ................................................................................... 19
3 RDMA-Aware Programming Overview ............................................... 20
3.1 Available Communication Operations.................................................... 20
3.1.1 Send/Send With Immediate ........................................................... 20
3.1.2 Receive ................................................................................... 20
3.1.3 RDMA Read ............................................................................... 20
3.1.4 RDMA Write / RDMA Write With Immediate ......................................... 21
3.1.5 Atomic Fetch and Add / Atomic Compare and Swap .............................. 21
3.2 Transport Modes............................................................................. 21
3.2.1 Reliable Connection (RC) .............................................................. 21
3.2.2 Unreliable Connection (UC) ........................................................... 21
3.2.3 Unreliable Datagram (UD) ............................................................. 21
3.3 Key Concepts ................................................................................ 22
3.3.1 Send Request (SR)....................................................................... 22
3.3.2 Receive Request (RR)................................................................... 22
3.3.3 Completion Queue ...................................................................... 22
3.3.4 Memory Registration.................................................................... 22
3.3.5 Memory Window......................................................................... 22
3.3.6 Address Vector .......................................................................... 23
3.3.7 Global Routing Header (GRH) ......................................................... 23
2
3.3.8 Protection Domain ...................................................................... 23
3.3.9 Asynchronous Events ................................................................... 23
3.3.10 Scatter Gather........................................................................... 23
3.3.11 Polling .................................................................................... 24
3.4 Typical Application.......................................................................... 24
4 VPI Verbs API............................................................................ 26
4.1 Initialization ................................................................................. 26
4.1.1 ibv_fork_init ............................................................................. 26
4.2 Device Operations .......................................................................... 26
4.2.1 ibv_get_device_list ..................................................................... 26
4.2.2 ibv_free_device_list .................................................................... 27
4.2.3 ibv_get_device_name .................................................................. 27
4.2.4 ibv_get_device_guid.................................................................... 27
4.2.5 ibv_open_device ........................................................................ 28
4.2.6 ibv_close_device ........................................................................ 28
4.2.7 ibv_node_type_str ...................................................................... 28
4.2.8 ibv_port_state_str ...................................................................... 29
4.3 Verb Context Operations................................................................... 29
4.3.1 ibv_query_device ....................................................................... 29
4.3.2 ibv_query_port .......................................................................... 31
4.3.3 ibv_query_gid ........................................................................... 32
4.3.4 ibv_query_pkey ......................................................................... 33
4.3.5 ibv_alloc_pd ............................................................................. 33
4.3.6 ibv_dealloc_pd .......................................................................... 34
4.3.7 ibv_create_cq ........................................................................... 34
4.3.8 ibv_resize_cq ............................................................................ 35
4.3.9 ibv_destroy_cq .......................................................................... 35
4.3.10 ibv_create_comp_channel............................................................. 35
4.3.11 ibv_destroy_comp_channel ........................................................... 35
4.4 Protection Domain Operations ............................................................ 36
4.4.1 ibv_reg_mr ............................................................................... 36
4.4.2 ibv_dereg_mr ............................................................................ 37
4.4.3 ibv_create_qp ........................................................................... 37
4.4.4 ibv_destroy_qp .......................................................................... 38
3
4.4.5 ibv_create_srq .......................................................................... 38
4.4.6 ibv_modify_srq .......................................................................... 39
4.4.7 ibv_destroy_srq ......................................................................... 39
4.4.8 ibv_open_xrc_domain .................................................................. 40
4.4.9 ibv_create_xrc_srq ..................................................................... 40
4.4.10 ibv_close_xrc_domain.................................................................. 41
4.4.11 ibv_create_xrc_rcv_qp................................................................. 41
4.4.12 ibv_modify_xrc_rcv_qp ................................................................ 42
4.4.13 ibv_reg_xrc_rcv_qp..................................................................... 42
4.4.14 ibv_unreg_xrc_rcv_qp.................................................................. 43
4.4.15 ibv_create_ah ........................................................................... 43
4.4.16 ibv_destroy_ah .......................................................................... 44
4.5 Queue Pair Bringup (ibv_modify_qp) .................................................... 44
4.5.1 ibv_modify_qp........................................................................... 44
4.5.2 RESET to INIT ............................................................................ 45
4.5.3 INIT to RTR ............................................................................... 46
4.5.4 RTR to RTS ............................................................................... 46
4.6 Active Queue Pair Operations............................................................. 47
4.6.1 ibv_query_qp ............................................................................ 47
4.6.2 ibv_query_srq ........................................................................... 47
4.6.3 ibv_query_xrc_rcv_qp.................................................................. 48
4.6.4 ibv_post_recv............................................................................ 48
4.6.5 ibv_post_send ........................................................................... 49
4.6.6 ibv_post_srq_recv....................................................................... 50
4.6.7 ibv_req_notify_cq....................................................................... 51
4.6.8 ibv_get_cq_event ....................................................................... 51
4.6.9 ibv_ack_cq_events...................................................................... 52
4.6.10 ibv_poll_cq............................................................................... 52
4.6.11 ibv_init_ah_from_wc ................................................................... 53
4.6.12 ibv_create_ah_from_wc ............................................................... 54
4.6.13 ibv_attach_mcast ....................................................................... 54
4.6.14 ibv_detach_mcast ...................................................................... 55
4.7 Event Handling Operations ................................................................ 55
4.7.1 ibv_get_async_event ................................................................... 55
4
4.7.2 ib_ack_async_event .................................................................... 56
4.7.3 ibv_event_type_str ..................................................................... 57
4.8 Experimental APIs........................................................................... 57
4.8.1 ibv_exp_query_device ................................................................. 57
4.8.2 ibv_exp_create_qp ..................................................................... 58
4.8.3 ibv_exp_post_send ..................................................................... 59
5 RDMA_CM API ........................................................................... 61
5.1 Event Channel Operations ................................................................. 61
5.1.1 rdma_create_event_channel.......................................................... 61
5.1.2 rdma_destroy_event_channel......................................................... 61
5.2 Connection Manager (CM) ID Operations ................................................ 62
5.2.1 rdma_create_id ......................................................................... 62
5.2.2 rdma_destroy_id ........................................................................ 63
5.2.3 rdma_migrate_id........................................................................ 63
5.2.4 rdma_set_option ........................................................................ 63
5.2.5 rdma_create_ep ........................................................................ 64
5.2.6 rdma_destroy_ep ....................................................................... 65
5.2.7 rdma_resolve_addr ..................................................................... 65
5.2.8 rdma_bind_addr......................................................................... 66
5.2.9 rdma_resolve_route .................................................................... 67
5.2.10 rdma_listen .............................................................................. 67
5.2.11 rdma_connect ........................................................................... 67
5.2.12 rdma_get_request ...................................................................... 69
5.2.13 rdma_accept............................................................................. 69
5.2.14 rdma_reject ............................................................................. 70
5.2.15 rdma_notify.............................................................................. 70
5.2.16 rdma_disconnect........................................................................ 70
5.2.17 rdma_get_src_port ..................................................................... 71
5.2.18 rdma_get_dst_port ..................................................................... 71
5.2.19 rdma_get_local_addr................................................................... 71
5.2.20 rdma_get_peer_addr ................................................................... 72
5.2.21 rdma_get_devices ...................................................................... 72
5.2.22 rdma_free_devices ..................................................................... 72
5.2.23 rdma_getaddrinfo....................................................................... 73
5
5.2.24 rdma_freeaddrinfo...................................................................... 73
5.2.25 rdma_create_qp ........................................................................ 73
5.2.26 rdma_destroy_qp ....................................................................... 74
5.2.27 rdma_join_multicast ................................................................... 74
5.2.28 rdma_leave_multicast ................................................................. 75
5.3 RDMA_CM Event Handling Operations.................................................... 75
5.3.1 rdma_get_cm_event.................................................................... 75
5.3.2 rdma_ack_cm_event ................................................................... 79
5.3.3 rdma_event_str ......................................................................... 79
6 RDMA Verbs API ......................................................................... 80
6.1 RDMA Protection Domain Operations .................................................... 80
6.1.1 rdma_reg_msgs.......................................................................... 80
6.1.2 rdma_reg_read .......................................................................... 80
6.1.3 rdma_reg_write ......................................................................... 81
6.1.4 rdma_dereg_mr ......................................................................... 81
6.1.5 rdma_create_srq ........................................................................ 82
6.1.6 rdma_destroy_srq....................................................................... 82
6.2 RDMA Active Queue Pair Operations ..................................................... 83
6.2.1 rdma_post_recvv........................................................................ 83
6.2.2 rdma_post_sendv ....................................................................... 83
6.2.3 rdma_post_readv ....................................................................... 84
6.2.4 rdma_post_writev ...................................................................... 85
6.2.5 rdma_post_recv ......................................................................... 85
6.2.6 rdma_post_send......................................................................... 86
6.2.7 rdma_post_read......................................................................... 87
6.2.8 rdma_post_write ........................................................................ 88
6.2.9 rdma_post_ud_send .................................................................... 88
6.2.10 rdma_get_send_comp .................................................................. 89
6.2.11 rdma_get_recv_comp .................................................................. 89
7 Events.................................................................................... 91
7.1 IBV Events .................................................................................... 91
7.1.1 IBV_EVENT_CQ_ERR .................................................................... 91
7.1.2 IBV_EVENT_QP_FATAL .................................................................. 91
7.1.3 IBV_EVENT_QP_REQ_ERR .............................................................. 91
6
7.1.4 IBV_EVENT_QP_ACCESS_ERR .......................................................... 91
7.1.5 IBV_EVENT_COMM_EST ................................................................. 92
7.1.6 IBV_EVENT_SQ_DRAINED ............................................................... 92
7.1.7 IBV_EVENT_PATH_MIG .................................................................. 92
7.1.8 IBV_EVENT_PATH_MIG_ERR ............................................................ 92
7.1.9 IBV_EVENT_DEVICE_FATAL ............................................................. 92
7.1.10 IBV_EVENT_PORT_ACTIVE ............................................................. 93
7.1.11 IBV_EVENT_PORT_ERR ................................................................. 93
7.1.12 IBV_EVENT_LID_CHANGE............................................................... 93
7.1.13 IBV_EVENT_PKEY_CHANGE ............................................................ 93
7.1.14 IBV_EVENT_SM_CHANGE ............................................................... 94
7.1.15 IBV_EVENT_SRQ_ERR ................................................................... 94
7.1.16 IBV_EVENT_SRQ_LIMIT_REACHED..................................................... 94
7.1.17 IBV_EVENT_QP_LAST_WQE_REACHED................................................ 94
7.1.18 IBV_EVENT_CLIENT_REREGISTER ..................................................... 94
7.1.19 IBV_EVENT_GID_CHANGE .............................................................. 95
7.2 IBV WC Events ............................................................................... 95
7.2.1 IBV_WC_SUCCESS ....................................................................... 95
7.2.2 IBV_WC_LOC_LEN_ERR ................................................................. 95
7.2.3 IBV_WC_LOC_QP_OP_ERR ............................................................. 95
7.2.4 IBV_WC_LOC_EEC_OP_ERR ............................................................ 95
7.2.5 IBV_WC_LOC_PROT_ERR ............................................................... 95
7.2.6 IBV_WC_WR_FLUSH_ERR............................................................... 95
7.2.7 IBV_WC_MW_BIND_ERR ................................................................ 95
7.2.8 IBV_WC_BAD_RESP_ERR................................................................ 96
7.2.9 IBV_WC_LOC_ACCESS_ERR............................................................. 96
7.2.10 IBV_WC_REM_INV_REQ_ERR ........................................................... 96
7.2.11 IBV_WC_REM_ACCESS_ERR ............................................................ 96
7.2.12 IBV_WC_REM_OP_ERR .................................................................. 96
7.2.13 IBV_WC_RETRY_EXC_ERR .............................................................. 96
7.2.14 IBV_WC_RNR_RETRY_EXC_ERR ........................................................ 96
7.2.15 IBV_WC_LOC_RDD_VIOL_ERR.......................................................... 96
7.2.16 IBV_WC_REM_INV_RD_REQ_ERR ...................................................... 97
7.2.17 IBV_WC_REM_ABORT_ERR ............................................................. 97
7
7.2.18 IBV_WC_INV_EECN_ERR ................................................................ 97
7.2.19 IBV_WC_INV_EEC_STATE_ERR ......................................................... 97
7.2.20 IBV_WC_FATAL_ERR ..................................................................... 97
7.2.21 IBV_WC_RESP_TIMEOUT_ERR.......................................................... 97
7.2.22 IBV_WC_GENERAL_ERR................................................................. 97
7.3 RDMA_CM Events ............................................................................ 97
7.3.1 RDMA_CM_EVENT_ADDR_RESOLVED .................................................. 97
7.3.2 RDMA_CM_EVENT_ADDR_ERROR ...................................................... 98
7.3.3 RDMA_CM_EVENT_ROUTE_RESOLVED ................................................ 98
7.3.4 RDMA_CM_EVENT_ROUTE_ERROR .................................................... 98
7.3.5 RDMA_CM_EVENT_CONNECT_REQUEST .............................................. 98
7.3.6 RDMA_CM_EVENT_CONNECT_RESPONSE ............................................. 98
7.3.7 RDMA_CM_EVENT_CONNECT_ERROR ................................................. 98
7.3.8 RDMA_CM_EVENT_UNREACHABLE .................................................... 98
7.3.9 RDMA_CM_EVENT_REJECTED .......................................................... 98
7.3.10 RDMA_CM_EVENT_ESTABLISHED ...................................................... 99
7.3.11 RDMA_CM_EVENT_DISCONNECTED.................................................... 99
7.3.12 RDMA_CM_EVENT_DEVICE_REMOVAL ................................................. 99
7.3.13 RDMA_CM_EVENT_MULTICAST_JOIN.................................................. 99
7.3.14 RDMA_CM_EVENT_MULTICAST_ERROR ............................................... 99
7.3.15 RDMA_CM_EVENT_ADDR_CHANGE .................................................... 99
7.3.16 RDMA_CM_EVENT_TIMEWAIT_EXIT.................................................... 99
8 Programming Examples Using IBV Verbs ...........................................100
8.1 Synopsis for RDMA_RC Example Using IBV Verbs ..................................... 100
8.1.1 Main ......................................................................................100
8.1.2 print_config.............................................................................101
8.1.3 resources_init ..........................................................................101
8.1.4 resources_create.......................................................................101
8.1.5 sock_connect ...........................................................................101
8.1.6 connect_qp .............................................................................101
8.1.7 modify_qp_to_init .....................................................................101
8.1.8 post_receive ............................................................................102
8.1.9 sock_sync_data.........................................................................102
8.1.10 modify_qp_to_rtr ......................................................................102
8
8.1.11 modify_qp_to_rts ......................................................................102
8.1.12 post_send ...............................................................................102
8.1.13 poll_completion ........................................................................102
8.1.14 resources_destroy .....................................................................102
8.2 Code for Send, Receive, RDMA Read, RDMA Write ................................... 102
8.3 Synopsis for Multicast Example Using RDMA_CM and IBV Verbs.................... 130
8.3.1 Main ......................................................................................131
8.3.2 Run .......................................................................................131
8.4 Code for Multicast Using RDMA_CM and IBV Verbs ................................... 132
8.5 Programming Examples Using RDMA Verbs ............................................ 137
8.5.1 Automatic Path Migration (APM).....................................................137
8.5.2 Multicast Code Example Using RDMA CM ...........................................145
8.5.3 Shared Received Queue (SRQ) .......................................................151
8.6 Experimental APIs......................................................................... 157
8.6.1 Dynamically Connected Transport...................................................157
8.7 Verbs API for Extended Atomics Support .............................................. 160
8.7.1 Supported Hardware ..................................................................161
8.7.2 Verbs Interface Changes ..............................................................161
8.8 User-Mode Memory Registration (UMR) ................................................ 163
8.8.1 Interfaces ...............................................................................165
8.9 Cross-Channel Communications Support .............................................. 167
8.9.1 Usage Model ............................................................................168
8.9.2 Resource Initialization ................................................................168
8.9.3 Posting Request List ...................................................................170
9 Document Revision History ..........................................................172
9
Remote Direct Memory Access (RDMA) provides direct memory access from the memory of one host
(storage or compute) to the memory of another host without involving the remote Operating System
and CPU, boosting network and host performance with lower latency, lower CPU load and higher
bandwidth. In contrast, TCP/IP communications typically require copy operations, which add
latency and consume significant CPU and memory resources.
10
1 Glossary
Term Description
Access Layer Low level operating system infrastructure (plumbing) used for
accessing the interconnect fabric (VPI™, InfiniBand®, Ethernet,
FCoE). It includes all basic transport services needed to support
upper level network protocols,
middleware, and management agents.
AH (Address Handle) An object which describes the path to the remote side used in
UD QP
CA (Channel Adapter) A device which terminates an InfiniBand link, and executes
transport level functions
CI (Channel Interface) Presentation of the channel to the Verbs Consumer as
implemented through the combination of the network adapter,
associated firmware, and device driver software
CM (Communication Manager) An entity responsible to establish, maintain, and release
communication for RC and UC QP service types
The Service ID Resolution Protocol enables users of UD service
to locate QPs supporting their desired service.
There is a CM in every IB port of the end nodes.
Compare & Swap Instructs the remote QP to read a 64-bit value, compare it with
the compare data provided, and if equal, replace it with the
swap data, provided in the QP.
CQ (Completion Queue) A queue (FIFO) which contains CQEs
CQE (Completion Queue Entry) An entry in the CQ that describes the information about the
completed WR (status size etc.)
DMA (Direct Memory Access) Allowing Hardware to move data blocks directly to and from the
memory, bypassing the CPU
Fetch & Add Instructs the remote QP to read a 64-bit value and replace it
with the sum of the 64-bit value and the added data value,
provided in the QP.
GUID (Globally Unique IDentifier) A 64 bit number that uniquely identifies a device or component
in a subnet
GID (Global IDentifier) A 128-bit identifier used to identify a Port on a network
adapter, a port on a Router, or a Multicast Group.
A GID is a valid 128-bit IPv6 address (per RFC 2373) with
additional properties / restrictions defined within IBA to
facilitate efficient discovery,
communication, and routing.
GRH (Global Routing Header) A packet header used to deliver packets across a subnet
boundary and also used to deliver Multicast messages
This Packet header is based on IPv6 protocol.
Network Adapter A hardware device that allows for communication between
computers in a network.
Host A computer platform executing an Operating System which may
control one or more network adapters
IB InfiniBand
Join operation An IB port must explicitly join a multicast group by sending a
request to the SA to receive multicast packets.
11
Term Description
lkey A number that is received upon registration of MR is used locally
by the WR to identify the memory region and its associated
permissions.
LID (Local IDentifier) A 16 bit address assigned to end nodes by the subnet manager.
Each LID is unique within its subnet.
LLE (Low Latency RDMA service over CEE (Converged Enhanced Ethernet) allowing
Ethernet) IB transport over Ethernet.
NA (Network Adapter) A device which terminates a link and executes transport level
functions.
MGID (Multicast Group ID) IB multicast groups, identified by MGIDs, are managed by the
SM. The SM associates a MLID with each MGID and explicitly
programs the IB switches in the fabric to ensure that the
packets are received by all the ports that joined the multicast
group.
MR (Memory Region) A contiguous set of memory buffers which have already been
registered with access permissions. These buffers need to be
registered for the network adapter to make use of them. During
registration an L_Key and R_Key are created and associated
with the created memory region
MTU (Maximum The maximum size of a packet payload (not including headers)
Transfer Unit) that can be sent /received from a port
MW (Memory Window) An allocated resource that enables remote access after being
bound to a specified area within an existing Memory
Registration. Each Memory Window has an associated Window
Handle, set of access privileges, and current R_Key.
Outstanding Work Request WR which was posted to a work queue and its completion was
not polled
pkey (Partition key) The pkey identifies a partition that the port belongs to. A pkey
is roughly analogous to a VLAN ID in ethernet networking. It is
used to point to an entry within the port’s partition key (pkey)
table. Each port is assigned at least one pkey by the subnet
manager (SM).
PD (Protection Domain) Object whose components can interact with only each other.
AHs interact with QPs, and MRs interact with WQs.
QP (Queue Pair) The pair (send queue and receive queue) of independent WQs
packed together in one object for the purpose of transferring
data between nodes of a network.
Posts are used to initiate the sending or receiving of data.
There are three types of QP: UD Unreliable Datagram,
Unreliable Connection, and Reliable Connection.
RC (Reliable Connection) A QP Transport service type based on a connection oriented
protocol.
A QP (Queue pair) is associated with another single QP. The
messages are sent in a reliable way (in terms of the correctness
and order of the information.)
RDMA (Remote Direct Memory Access) Accessing memory in a remote side without involvement of the
remote CPU
12
Term Description
RDMA_CM (Remote Direct Memory Access API used to setup reliable, connected and unreliable datagram
Communication data transfers. It provides an RDMA transport neutral interface
Manager) for establishing connections. The API is based on sockets, but
adapted for queue pair (QP) based semantics: communication
must be over a specific RDMA device, and data transfers are
message based.
Requestor The side of the connection that will initiate a data transfer (by
posting a send request)
Responder The side of the connection that will respond to commands from
the requestor which may include a request to write to the
responder memory or read from the responder memory and
finally a command requesting the responder to receive a
message.
rkey A number that is received upon registration of MR is used to
enforce permissions on incoming RDMA operations
RNR (Receiver Not Ready) The flow in an RC QP where there is a connection between the
sides but a RR is not present in the Receive side
RQ (Receive Queue) A Work Queue which holds RRs posted by the user
RR (Receive Request) A WR which was posted to an RQ which describes where
incoming data using a send opcode is going to be written. Also
note that a RDMA Write with immediate will consume a RR.
RTR (Ready To Receive) A QP state in which an RR can be posted and be processed
RTS (Ready To Send) A QP state in which an SR can be posted and be processed
SA (Subnet The interface for querying and manipulating subnet
Administrator) management data
SGE (Scatter /Gather An entry to a pointer to a full or a part of a local registered
Elements) memory block.
The element hold the start address of the block, size, and lkey
(with its associated permissions).
S/G Array An array of S/G elements which exists in a WR that according to
the used opcode either collects data from multiple buffers and
sends them as a single stream or takes a single stream and
breaks it down to numerous buffers
SM (Subnet Manager) An entity that configures and manages the subnet
Discovers the network topology
Assign LIDs
Determines the routing schemes and sets the routing tables
One master SM and possible several slaves (Standby mode)
Administers switch routing tables thereby establishing paths
through the fabric
SQ (Send Queue) A Work Queue which holds SRs posted by the user
SR (Send Request) A WR which was posted to an SQ which describes how much
data is going to be transferred, its direction, and the way (the
opcode will specify the transfer)
SRQ (Shared Receive Queue) A queue which holds WQEs for incoming messages from any RC/
UC/UD QP which is associated with it.
More than one QPs can be associated with one SRQ.
TCA (Target Channel Adapter) A Channel Adapter that is not required to support verbs, usually
used in I/O devices
13
Term Description
UC (Unreliable A QP transport service type based on a connection oriented
Connection) protocol,
where a QP (Queue pair) is associated with another single QP.
The QPs do not execute a reliable Protocol and messages can be
lost.
UD (Unreliable A QP transport service type in which messages can be one
Datagram) packet length and every UD QP can send/receive messages from
another UD QP in the subnet
Messages can be lost and the order is not guaranteed. UD QP is
the only type which supports multicast messages. The message
size of a UD packet is limited to the Path MTU
Verbs An abstract description of the functionality of a network
adapter.
Using the verbs, any application can create / manage objects
that are needed in order to use RDMA for data transfer.
VPI (Virtual Protocol Interface) Allows the user to change the layer 2 protocol of the port.
WQ (Work Queue) One of Send Queue or Receive Queue.
WQE (Work Queue A WQE, pronounced “wookie”, is an element in a work queue.
Element)
WR (Work Request) A request which was posted by a user to a work queue.
14
2 RDMA Architecture Overview
The chapter contains the following sections:
• InfiniBand
• Virtual Protocol Interconnect® (VPI)
• RDMA over Converged Ethernet (RoCE)
• Comparison of RDMA Technologies
• Key Components
• Support for Existing Applications and ULPs
• References
2.1 InfiniBand
InfiniBand (IB) is a high-speed, low latency, low CPU overhead, highly efficient and scalable server
and storage interconnect technology. One of the key capabilities of InfiniBand is its sup- port for
native Remote Direct Memory Access (RDMA). InfiniBand enables data transfer between servers and
between server and storage without the involvement of the host CPU in the data path. InfiniBand
uses I/O channels for data communication (up to 16 million per host), where each channel provides
the semantics of a virtualized NIC or HCA (security, isolations etc). InfiniBand provides various
technology or solution speeds ranging from 10Gb/s (SDR) up to 56Gb/s (FDR) per port, using copper
and optical fiber connections. InfiniBand efficiency and scalability have made it the optimal
performance and cost/performance interconnect solution for the world's leading high-performance
computing, cloud, Web 2.0, storage, database and financial data centers and applications.
InfiniBand is a standard technology, defined and specified by the IBTA organization.
Similarly, a VPI switch can have InfiniBand-only ports, Ethernet-only ports, or a mix of both
InfiniBand and Ethernet ports working at the same time.
Mellanox-based VPI adapters and switches support both the InfiniBand RDMA and the Ethernet RoCE
solutions.
15
and takes advantage of Priority Flow Control in Data Center Bridging Ethernet for lossless
connectivity. RoCE has been fully supported by the Open Fabrics Software since the release of OFED
1.5.1.
16
The most important point is that every application has direct access to the virtual memory of
devices in the fabric. This means that applications do not need to make requests to an operating
system to transfer messages. Contrast this with the traditional network environment where the
shared network resources are owned by the operating system and cannot be accessed by a user
application. Thus, an application must rely on the involvement of the operating system to move
data from the application's virtual buffer space, through the network stack and out onto the wire.
Similarly, at the other end, an application must rely on the operating system to retrieve the data on
the wire on its behalf and place it in its virtual buffer space.
17
once the entire message has been received. Thus neither the sending nor the receiving application
is involved until the entire message is delivered into the receiving appli- cation's buffer.
2.5.4 Switches
IB switches are conceptually similar to standard networking switches but are designed to meet IB
performance requirements. They implement flow control of the IB Link Layer to prevent packet
dropping, and to support congestion avoidance and adaptive routing capabilities, and advanced
Quality of Service. Many switches include a Subnet Manager. At least one Subnet Manager is required
to configure an IB fabric.
18
2.7 References
• IBTA Intro to IB for End Users http://members.infinibandta.org/kwspub/
Intro_to_IB_for_End_Users.pdf
• Mellanox InfiniBandFAQ_FQ_100.pdf http://www.mellanox.com/pdf/whitepapers/
InfiniBandFAQ_FQ_100.pdf
• Mellanox WP_2007_IB_Software_and_Protocols.pdf http://www.mellanox.com/pdf/
whitepapers/WP_2007_IB_Software_and_Protocols.pdf
• Mellanox driver software stacks and firmware are available for download from Mellanox
Technologies' Web pages: http://www.mellanox.com
19
3 RDMA-Aware Programming Overview
The VPI architecture permits direct user mode access to the hardware. Mellanox provides a
dynamically loaded library, creating access to the hardware via the verbs API. This document
contains verbs and their related inputs, outputs, descriptions, and functionality as exposed through
the operating system programming interface.
This programming manual and its verbs are valid only for user space. See header files for
the kernel space verbs.
Programming with verbs allows for customizing and optimizing the RDMA-Aware network. This
customizing and optimizing should be done only by programmers with advanced knowledge and
experience in the VPI systems.
In order to perform RDMA operations, establishment of a connection to the remote host, as well as
appropriate permissions need to be set up first. The mechanism for accomplishing this is the Queue
Pair (QP). For those familiar with a standard IP stack, a QP is roughly equivalent to a socket. The QP
needs to be initialized on both sides of the connection. Communication Manager (CM) can be used to
exchange information about the QP prior to actual QP setup.
Once a QP is established, the verbs API can be used to perform RDMA reads, RDMA writes, and
atomic operations. Serialized send/receive operations, which are similar to socket reads/writes, can
be performed as well.
3.1.2 Receive
This is the corresponding operation to a send operation. The receiving host is notified that a data
buffer has been received, possibly with an inline immediate value. The receiving application is
responsible for receive buffer maintenance and posting.
20
3.1.4 RDMA Write / RDMA Write With Immediate
Similar to RDMA read, but the data is written to the remote host. RDMA write operations are
performed with no notification to the remote host. RDMA write with immediate operations, how-
ever, do notify the remote host of the immediate value.
Operation UD UC RC RD
Send (with immediate) + + + +
Receive + + + +
RDMA Write (with immediate) + + +
RDMA Read + +
Atomic: Fetch and Add/ Cmp and Swap + +
Max message size MTU 1GB 1GB 1GB
21
3.3 Key Concepts
3.3.1 Send Request (SR)
An SR defines how much data will be sent, from where, how and, with RDMA, to where. struct
ibv_send_wr is used to implement SRs.
22
• wants to grant different remote access rights to different remote agents and/or grant those
rights over different ranges within a registered Region.
The operation of associating an MW with an MR is called Binding. Different MWs can overlap the
same MR (event with different access permissions).
23
• L_key: the local key of the MR that was registered to this buffer. struct ibv_sge implements
scatter gather elements.
3.3.11 Polling
Polling the CQ for completion is getting the details about a WR (Send or Receive) that was posted. If
we have completion with bad status in a WR, the rest of the completions will be all be bad (and the
Work Queue will be moved to error state). Every WR that does not have a completion (that was
polled) is still outstanding. Only after a WR has a completion, the send / receive buffer may be
used / reused / freed. The completion status should always be checked. When a CQE is polled it is
removed from the CQ. Polling is accomplished with the ibv_poll_cq operation.
The structure of a typical application is as follows. The functions in the programming example that
implement each step are indicated in bold.
1. Get the device list;
First you must retrieve the list of available IB devices on the local host. Every device in this
list contains both a name and a GUID. For example the device names can be: mthca0,
mlx4_1.
Implemented in programming example by resources_create
2. Open the requested device;
Iterate over the device list, choose a device according to its GUID or name and open it.
Implemented in programming example by resources_create
3. Query the device capabilities;
The device capabilities allow the user to understand the supported features (APM, SRQ) and
capabilities of the opened device.
Implemented in programming example by resources_create
4. Allocate a Protection Domain to contain your resources;
A Protection Domain (PD) allows the user to restrict which components can interact with only
each other. These components can be AH, QP, MR, MW, and SRQ.
Implemented in programming example by resources_create
5. Register a memory region;
VPI only works with registered memory. Any memory buffer which is valid in the process's
virtual space can be registered. During the registration process the user sets memory
permissions and receives local and remote keys (lkey/rkey) which will later be used to refer
to this memory buffer.
Implemented in programming example by resources_create
6. Create a Completion Queue (CQ);
A CQ contains completed work requests (WR). Each WR will generate a completion queue
entry (CQE) that is placed on the CQ. The CQE will specify if the WR was completed
24
successfully or not.
Implemented in programming example by resources_create
7. Create a Queue Pair (QP);
Creating a QP will also create an associated send queue and receive queue. Implemented in
programming example by resources_create
8. Bring up a QP;
A created QP still cannot be used until it is transitioned through several states, eventually
getting to Ready To Send (RTS). This provides needed information used by the QP to be able
send / receive data.
Implemented in programming example by connect_qp, modify_qp_to_init, post_receive,
modify_qp_to_rtr, and modify_qp_to_rts.
9. Post work requests and poll for completion;
Use the created QP for communication operations.
Implemented in programming example by post_send and poll_completion.
10. Cleanup;
Destroy objects in the reverse order you created them:
Delete QP
Delete CQ
Deregister MR
Deallocate PD
Close device
Implemented in programming example by resources_destroy.
25
4 VPI Verbs API
This chapter describes the details of the VPI verbs API.
4.1 Initialization
4.1.1 ibv_fork_init
Template: int ibv_fork_init(void)
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_fork_init initializes libibverbs' data structures to handle the fork() function safely
and avoid data corruption, whether fork() is called explicitly or implicitly such as in system() calls.
It is not necessary to call ibv_fork_init if all parent process threads are always blocked until all child
processes end or change address space via an exec() operation.
This function works on Linux kernels supporting the MADV_DONTFORK flag for madvise() (2.6.17 and
higher).
Setting the environment variable RDMAV_FORK_SAFE or IBV_FORK_SAFE to any value has the same
effect as calling ibv_fork_init().
Setting the environment variable RDMAV_HUGEPAGES_SAFE to any value tells the library to check
the underlying page size used by the kernel for memory regions. This is required if an application
uses huge pages either directly or indirectly via a library such as libhugetlbfs.
Calling ibv_fork_init() will reduce performance due to an extra system call for every memory
registration, and the additional memory allocated to track memory regions. The precise
performance impact depends on the workload and usually will not be significant.
Setting RDMAV_HUGEPAGES_SAFE adds further overhead to all memory registrations.
4.2.1 ibv_get_device_list
Template: struct ibv_device **{*}ibv_get_device_list{*}(int *num_devices)
Output Parameters: num_devices(optional) If non-null, the number of devices returned in the array
will be stored here
Description: ibv_get_device_list returns a list of VPI devices available on the system. Each entry on
the list is a pointer to a struct ibv_device.
26
struct ibv_device is defined as:
struct ibv_device
{
struct ibv_device_ops ops;
enum ibv_node_type node_type;
enum ibv_transport_type transport_type;
char name[IBV_SYSFS_NAME_MAX];
char dev_name[IBV_SYSFS_NAME_MAX];
char dev_path[IBV_SYSFS_PATH_MAX];
char ibdev_path[IBV_SYSFS_PATH_MAX];
};
ops pointers to alloc and free functions
node_type IBV_NODE_UNKNOWN
IBV_NODE_CA
IBV_NODE_SWITCH
IBV_NODE_ROUTER
IBV_NODE_RNIC
transport_type IBV_TRANSPORT_UNKNOWN
IBV_TRANSPORT_IB
IBV_TRANSPORT_IWARP
name kernel device name eg “mthca0”
dev_name uverbs device name eg “uverbs0”
dev_path path to infiniband_verbs class device in sysfs
ibdev_path path to infiniband class device in sysfs
The list of ibv_device structs shall remain valid until the list is freed. After calling
ibv_get_device_list, the user should open any desired devices and promptly free the list via the
ibv_free_device_list command.
4.2.2 ibv_free_device_list
Template: void ibv_free_device_list(struct ibv_device **list)
4.2.3 ibv_get_device_name
Template: const char {}ibv_get_device_name{*}(struct ibv_device *device)
Description: ibv_get_device_name returns a pointer to the device name contained within the
ibv_device struct.
4.2.4 ibv_get_device_guid
Template: uint64_t ibv_get_device_guid(struct ibv_device *device)
27
Output Parameters: none
Description: ibv_get_device_guid returns the devices 64 bit Global Unique Identifier (GUID) in
network byte order.
4.2.5 ibv_open_device
Template: struct ibv_context {}ibv_open_device{*}(struct ibv_device *device)
Return Value: A verbs context that can be used for future operations on the device or NULL on
failure.
Description: ibv_open_device provides the user with a verbs context which is the object that will be
used for all other verb operations.
4.2.6 ibv_close_device
Template: int ibv_close_device(struct ibv_context *context)
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_close_device closes the verb context previously opened with ibv_open_device. This
operation does not free any other objects associated with the context. To avoid memory leaks, all
other objects must be independently freed prior to calling this command.
4.2.7 ibv_node_type_str
Template: const char {}ibv_node_type_str* (enum ibv_node_type node_type)
Return Value: A constant string which describes the enum value node_type
Description: ibv_node_type_str returns a string describing the node type enum value, node_type.
This value can be an InfiniBand HCA, Switch, Router, an RDMA enabled NIC or unknown
enum ibv_node_type {
IBV_NODE_UNKNOWN = -1,
IBV_NODE_CA = 1,
IBV_NODE_SWITCH, IBV_NODE_ROUTER, IBV_NODE_RNIC
};
28
4.2.8 ibv_port_state_str
Template: const char {}ibv_port_state_str* (enum ibv_port_state port_state)
Return Value: A constant string which describes the enum value port_state
Description: ibv_port_state_str returns a string describing the port state enum value, port_state.
enum ibv_port_state {
IBV_PORT_NOP = 0,
IBV_PORT_DOWN = 1,
IBV_PORT_INIT = 2,
IBV_PORT_ARMED = 3,
IBV_PORT_ACTIVE = 4,
IBV_PORT_ACTIVE_DEFER = 5
};
4.3.1 ibv_query_device
Template: int ibv_query_device(struct ibv_context *context, struct ibv_device_attr *device_attr)
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_query_device retrieves the various attributes associated with a device. The user
should malloc a struct ibv_device_attr, pass it to the command, and it will be filled in upon
successful return. The user is responsible to free this struct.
struct ibv_device_attr
{
char fw_ver[64];
uint64_t node_guid;
uint64_t sys_image_guid;
uint64_t max_mr_size;
uint64_t page_size_cap;
uint32_t vendor_id;
uint32_t vendor_part_id;
uint32_t hw_ver;
int max_qp;
int max_qp_wr;
29
int device_cap_flags;
int max_sge;
int max_sge_rd;
int max_cq;
int max_cqe;
int max_mr;
int max_pd;
int max_qp_rd_atom;
int max_ee_rd_atom;
int max_res_rd_atom;
int max_qp_init_rd_atom;
int max_ee_init_rd_atom;
int max_rdd;
int max_mw;
int max_raw_ipv6_qp;
int max_raw_ethy_qp;
int max_mcast_grp;
int max_mcast_qp_attach;
int max_total_mcast_qp_attach;
int max_ah;
int max_fmr;
int max_map_per_fmr;
int max_srq;
int max_srq_wr;
int max_srq_sge;
uint16_t max_pkeys;
uint8_t local_ca_ack_delay;
uint8_t phys_port_cnt;
device_cap_flags IBV_DEVICE_RESIZE_MAX_WR
IBV_DEVICE_BAD_PKEY_CNTR
IBV_DEVICE_BAD_QKEY_CNTR
IBV_DEVICE_RAW_MULTI
IBV_DEVICE_AUTO_PATH_MIG
IBV_DEVICE_CHANGE_PHY_PORT
IBV_DEVICE_UD_AV_PORT_ENFORCE
IBV_DEVICE_CURR_QP_STATE_MOD
IBV_DEVICE_SHUTDOWN_PORT
IBV_DEVICE_INIT_TYPE
IBV_DEVICE_PORT_ACTIVE_EVENT
30
IBV_DEVICE_SYS_IMAGE_GUID
IBV_DEVICE_RC_RNR_NAK_GEN
IBV_DEVICE_SRQ_RESIZE
IBV_DEVICE_N_NOTIFY_CQ
IBV_DEVICE_XRC
max_ee_rd_atom Maximum outstanding RDMA read and atomic operations per End to End (EE) context (RD
connections)
max_res_rd_atom Maximum resources used for incoming RDMA read and atomic operations
max_qp_init_rd_atom Maximium RDMA read and atomic operations that may be initiated per QP
max_ee_init_atom Maximum RDMA read and atomic operations that may be initiated per EE
max_total_mcast_qp_attach
4.3.2 ibv_query_port
Template: int ibv_query_port(struct ibv_context *context, uint8_t port_num, struct
ibv_port_attr *port_attr)
Input Parameters: context struct ibv_context from ibv_open_device port_numphysical port number
(1 is first port)
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
31
Description: ibv_query_port retrieves the various attributes associated with a port. The user should
allocate a struct ibv_port_attr, pass it to the command, and it will be filled in upon successful
return. The user is responsible to free this struct.
struct ibv_port_attr
{
enum ibv_port_state state;
enum ibv_mtu max_mtu;
enum ibv_mtu active_mtu;
int gid_tbl_len;
uint32_t port_cap_flags;
uint32_t max_msg_sz;
uint32_t bad_pkey_cntr;
uint32_t qkey_viol_cntr;
uint16_t pkey_tbl_len;
uint16_t lid;
uint16_t sm_lid;
uint8_t lmc;
uint8_t max_vl_num;
uint8_t sm_sl;
uint8_t subnet_timeout;
uint8_t init_type_reply;
uint8_t active_width;
uint8_t active_speed;
uint8_t phys_state;
};
state IBV_PORT_NOP
IBV_PORT_DOWN
IBV_PORT_INIT
IBV_PORT_ARMED
IBV_PORT_ACTIVE
IBV_PORT_ACTIVE_DEFER
max_mtu Maximum Transmission Unit (MTU) supported by port. Can be:
IBV_MTU_256
IBV_MTU_512
IBV_MTU_1024
IBV_MTU_2048
IBV_MTU_4096
active_mtu Actual MTU in use
gid_tbl_len Length of source global ID (GID) table
port_cap_flags Supported capabilities of this port. There are currently no enumerations/defines declared in
verbs.h
max_msg_sz Maximum message size
bad_pkey_cntr Bad P_Key counter
qkey_viol_cntr Q_Key violation counter
pkey_tbl_len Length of partition table
lid First local identifier (LID) assigned to this port
sm_lid LID of subnet manager (SM)
lmc LID Mask control (used when multiple LIDs are assigned to port)
max_vl_num Maximum virtual lanes (VL)
sm_sl SM service level (SL)
subnet_timeout Subnet propagation delay
init_type_reply Type of initialization performed by SM
active_width Currently active link width
active_speed Currently active link speed
phys_state Physical port state
4.3.3 ibv_query_gid
Template: int ibv_query_gid(struct ibv_context *context, uint8_t port_num, int index, union ibv_gid
*gid)
Input Parameters:
Output Parameters:
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
32
Description: ibv_query_gid retrieves an entry in the port’s global identifier (GID) table. Each port is
assigned at least one GID by the subnet manager (SM). The GID is a valid IPv6 address composed of
the globally unique identifier (GUID) and a prefix assigned by the SM. GID[0] is unique and contains
the port's GUID.
The user should allocate a union ibv_gid, pass it to the command, and it will be filled in upon
successful return. The user is responsible to free this union.
union ibv_gid
uint8_t raw[16];
struct
uint64_t subnet_prefix;
uint64_t interface_id;
} global;
};
4.3.4 ibv_query_pkey
Template: int ibv_query_pkey(struct ibv_context *context, uint8_t port_num, int index, uint16_t
*pkey)
Input Parameters:
Output Parameters:
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_query_pkey retrieves an entry in the port's partition key (pkey) table. Each port is
assigned at least one pkey by the subnet manager (SM). The pkey identifies a partition that the port
belongs to. A pkey is roughly analogous to a VLAN ID in Ethernet networking.
The user passes in a pointer to a uint16 that will be filled in with the requested pkey. The user is
responsible to free this uint16.
4.3.5 ibv_alloc_pd
Template: struct ibv_pd {}ibv_alloc_pd{*}(struct ibv_context *context)
33
Return Value: Pointer to created protection domain or NULL on failure.
Description: ibv_alloc_pd creates a protection domain (PD). PDs limit which memory regions can be
accessed by which queue pairs (QP) providing a degree of protection from unauthorized access. The
user must create at least one PD to use VPI verbs.
4.3.6 ibv_dealloc_pd
Template: int ibv_dealloc_pd(struct ibv_pd *pd)
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_dealloc_pd frees a protection domain (PD). This command will fail if any other
objects are currently associated with the indicated PD.
4.3.7 ibv_create_cq
Template: struct ibv_cq {}ibv_create_cq{*}(struct ibv_context *context, int cqe, void *cq_context,
struct ibv_- comp_channel *channel, int comp_vector)
Input Parameters:
context struct ibv_context from ibv_open_device
Description: ibv_create_cq creates a completion queue (CQ). A completion queue holds completion
queue entries (CQE). Each Queue Pair (QP) has an associated send and receive CQ. A single CQ can
be shared for sending and receiving as well as be shared across multiple QPs.
The parameter cqe defines the minimum size of the queue. The actual size of the queue may be
larger than the specified value.
The parameter cq_context is a user defined value. If specified during CQ creation, this value will be
returned as a parameter in ibv_get_cq_event when using a completion channel (CC).
The parameter channel is used to specify a CC. A CQ is merely a queue that does not have a built in
notification mechanism. When using a polling paradigm for CQ processing, a CC is unnecessary. The
user simply polls the CQ at regular intervals. If, however, you wish to use a pend para-
digm, a CC is required. The CC is the mechanism that allows the user to be notified that a new CQE
is on the CQ.
The parameter comp_vector is used to specify the completion vector used to signal completion
events. It must be >=0 and < context->num_comp_vectors.
34
4.3.8 ibv_resize_cq
Template: int ibv_resize_cq(struct ibv_cq *cq, int cqe)
Input Parameters:
cq CQ to resize
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_resize_cq resizes a completion queue (CQ). The parameter cqe must be at least the
number of outstanding entries on the queue. The actual size of the queue may be larger than the
specified value. The CQ may (or may not) contain com- pletions when it is being resized thus, it can
be resized during work with the CQ.
4.3.9 ibv_destroy_cq
Template: int ibv_destroy_cq(struct ibv_cq *cq)
Input Parameters:
cq CQ to destroy
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_destroy_cq frees a completion queue (CQ). This command will fail if there is any
queue pair (QP) that still has the specified CQ associated with it.
4.3.10 ibv_create_comp_channel
Template: struct ibv_comp_channel {}ibv_create_comp_channel{*}(struct ibv_context *context)
4.3.11 ibv_destroy_comp_channel
Template: int ibv_destroy_comp_channel(struct ibv_comp_channel *channel)
35
Output Parameters: none
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_destroy_comp_channel frees a completion channel. This command will fail if there
are any completion queues (CQ) still associated with this completion channel.
4.4.1 ibv_reg_mr
Template: struct ibv_mr {}ibv_reg_mr{*}(struct ibv_pd *pd, void *addr, size_t length, enum
ibv_access_flags access)
Input Parameters:
Description:
ibv_reg_mr registers a memory region (MR), associates it with a protection domain (PD), and assigns
it local and remote keys (lkey, rkey). All VPI commands that use memory require the memory to be
registered via this command. The same physical memory may be mapped to differ- ent MRs even
allowing different permissions or PDs to be assigned to the same memory, depending on user
requirements.
Access flags may be bitwise or one of the following enumerations:
36
If IBV_ACCESS_REMOTE_WRITE or IBV_ACCESS_REMOTE_ATOMIC is set, then
IBV_ACCESS_LOCAL_WRITE must be set as well.
struct ibv_mr
{
struct ibv_context*context;
struct ibv_pd *pd;
void*addr;
size_tlength;
uint32_thandle;
uint32_tlkey;
uint32_trkey;
};
4.4.2 ibv_dereg_mr
Template: int ibv_dereg_mr(struct ibv_mr *mr)
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_dereg_mr frees a memory region (MR). The operation will fail if any memory
windows (MW) are still bound to the MR.
4.4.3 ibv_create_qp
Template: struct ibv_qp {}ibv_create_qp{*}(struct ibv_pd *pd, struct ibv_qp_init_attr *qp_init_attr)
Input Parameters: pd struct ibv_pd from ibv_alloc_pd qp_init_attrinitial attributes of queue pair
Description: ibv_create_qp creates a QP. When a QP is created, it is put into the RESET state.
struct ibv_qp_init_attr
{
void *qp_context;
struct ibv_cq *send_cq;
struct ibv_cq *recv_cq;
struct ibv_srq *srq;
struct ibv_qp_cap cap;
enum ibv_qp_type qp_type;
int sq_sig_all;
struct ibv_xrc_domain *xrc_domain;
};
37
struct ibv_qp_cap
{
uint32_t max_send_wr;
uint32_t max_recv_wr;
uint32_t max_send_sge;
uint32_t max_recv_sge;
uint32_t max_inline_data;
};
4.4.4 ibv_destroy_qp
Template: int ibv_destroy_qp(struct ibv_qp *qp)
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
4.4.5 ibv_create_srq
Template: struct ibv_srq *ibv_create_srq(struct ibv_pd *pd, struct ibv_srq_init_attr *srq_init_attr)
Input Parameters:
pd The protection domain associated with the shared receive queue (SRQ)
Description: ibv_create_srq creates a shared receive queue (SRQ). srq_attr->max_wr and srq_attr-
>max_sge are read to determine the requested size of the SRQ, and set to the actual values
allocated on return. If ibv_create_srq succeeds, then max_wr and max_sge will be at least as large
as the requested values.
struct ibv_srq {
struct ibv_context *context; struct ibv_context from ibv_open_device
void *srq_context;
struct ibv_pd *pd; Protection domain
uint32_t handle;
pthread_mutex_t mutex;
pthread_cond_t cond;
uint32_t events_completed;
}
struct ibv_srq_init_attr is defined as follows:
struct ibv_srq_init_attr
{
void *srq_context;
struct ibv_srq_attr attr;
};
38
struct ibv_srq_attr
{
uint32_t max_wr;
uint32_t max_sge;
uint32_t srq_limit;
};
4.4.6 ibv_modify_srq
Template: int ibv_modify_srq (struct ibv_srq *srq, struct ibv_srq_attr *srq_attr, int srq_attr_mask)
Input Parameters:
srq_attr Specifies the SRQ to modify (input)/the current values of the selected SRQ
attributes are returned (output)
srq_attr_mask A bit-mask used to specify which SRQ attributes are being modified
Output Parameters: srq_attr The struct ibv_srq_attr is returned with the updated values
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_modify_srq modifies the attributes of the SRQ srq using the attribute values in
srq_attr based on the mask srq_attr_mask. srq_attr is an ibv_srq_attr struct as defined above under
the verb ibv_create_srq. The argument srq_attr_mask specifies the SRQ attributes to be modified. It
is either 0 or the bitwise OR of one or more of the flags:
If any of the attributes to be modified is invalid, none of the attributes will be modified. Also, not
all devices support resizing SRQs. To check if a device supports resizing, check if the IBV_DE-
VICE_SRQ_RESIZE bit is set in the device capabilities flags.
Modifying the SRQ limit arms the SRQ to produce an IBV_EVENT_SRQ_LIMIT_REACHED 'low
watermark' async event once the number of WRs in the SRQ drops below the SRQ limit.
4.4.7 ibv_destroy_srq
Template: int ibv_destroy_srq(struct ibv_srq *srq)
Input Parameters:
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_destroy_srq destroys the specified SRQ. It will fail if any queue pair is still
associated with this SRQ.
39
4.4.8 ibv_open_xrc_domain
Template: struct ibv_xrc_domain *ibv_open_xrc_domain(struct ibv_context *context, int fd, int
oflag)
Input Parameters:
Output Parameters: A file descriptor associated with the opened XRC domain
Description: ibv_open_xrc_domain opens an eXtended Reliable Connection (XRC) domain for the
RDMA device context. The desired file creation attributes oflag can either be 0 or the bitwise OR of
O_CREAT and O_EXCL. If a domain belonging to the device named by the context is already
associated with the inode, then the O_CREAT flag has no effect. If both O_CREAT and O_XCL are set,
open will fail if a domain associated with the inode already exists. Otherwise a new XRC domain will
be created and associated with the inode specified by fd.
Please note that the check for the existence of the domain and creation of the domain if it does not
exist is atomic with respect to other processes executing open with fd naming the same inode.
If fd equals -1, then no inode is associated with the domain, and the only valid value for oflag is
O_CREAT.
Since each ibv_open_xrc_domain call increments the xrc_domain object's reference count, each
such call must have a corresponding ibv_close_xrc_domain call to decrement the xrc_domain
object's reference count.
4.4.9 ibv_create_xrc_srq
Template:
Input Parameters:
40
Return Value: A pointer to the created SRQ or NULL on failure
Description: ibv_create_xrc_srq creates an XRC shared receive queue (SRQ) associated with the
protection domain pd, the XRC domain domain_xrc and the CQ which will hold the completion
xrc_cq
struct ibv_xrc_domain {
struct ibv_context *context; struct ibv_context from ibv_open_device
uint64_t handle;
}
4.4.10 ibv_close_xrc_domain
Template: int ibv_close_xrc_domain(struct ibv_xrc_domain *d)
Input Parameters:
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_close_xrc_domain closes the XRC domain, d. If this happens to be the last
reference, then the XRC domain will be destroyed. This function decrements a reference count and
may fail if any QP or SRQ are still associated with the XRC domain being closed.
4.4.11 ibv_create_xrc_rcv_qp
Template: int ibv_create_xrc_rcv_qp(struct ibv_qp_init_attr *init_attr, uint32_t *xrc_rcv_qpn)
Input Parameters:
Output Parameters:
init_attr Populated with the XRC domain information the QP will be associated with
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_create_xrc_rcv_qp creates an XRC queue pair (QP) to serve as a receive side only
QP and returns the QP number through xrc_rcv_qpn. This number must be passed to the remote
(sender) node. The remote node will use xrc_rcv_qpn in ibv_post_send when it sends messages to an
XRC SRQ on this host in the same xrc domain as the XRC receive QP.
The QP with number xrc_rcv_qpn is created in kernel space and persists until the last process
registered for the QP called ibv_unreg_xrc_rcv_qp, at which point the QP is destroyed. The pro- cess
which creates this QP is automatically registered for it and should also call ibv_unreg_x- rc_rcv_qp
at some point to unregister.
41
Any process which wishes to receive on an XRC SRQ via this QP must call ibv_reg_xrc_rcv_qp for this
QP to ensure that the QP will not be destroyed while they are still using it.
Please note that because the QP xrc_rcv_qpn is a receive only QP, the send queue in the init_attr
struct is ignored.
4.4.12 ibv_modify_xrc_rcv_qp
Template: int ibv_modify_xrc_rcv_qp(struct ibv_xrc_domain *xrc_domain, uint32_t xrc_qp_num,
struct ibv_qp_attr *attr, int attr_mask)
Input Parameters:
xrc_domain The XRC domain associated with this QP xrc_qp_num The queue pair number
to identify this QP
attr The attributes to use to modify the XRC receive QP attr_mask The mask to use
for modifying the QP attributes
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_modify_xrc_rcv_qp modifies the attributes of an XRC receive QP with the number
xrc_qp_num which is associated with the attributes in the struct attr according to the mask
attr_mask. It then moves the QP through the following transitions: Reset->Init->RTR
At least the following masks must be set (the user may add optional attributes as needed)
Please note that if any attribute to modify is invalid or if the mask as invalid values, then
none of the attributes will be modified, including the QP state.
4.4.13 ibv_reg_xrc_rcv_qp
Template: int ibv_reg_xrc_rcv_qp(struct ibv_xrc_domain *xrc_domain, uint32_t xrc_qp_num)
Input Parameters:
xrc_qp_nu m The number associated with the created QP to which the user process is
to be registered
42
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_reg_xrc_rcv_qp registers a user process with the XRC receive QP whose number is
xrc_qp_num associated with the XRC domain xrc_domain.
This function may fail if the number xrc_qp_num is not the number of a valid XRC receive QP (for
example if the QP is not allocated or it is the number of a non-XRC QP), or the XRC receive QP was
created with an XRC domain other than xrc_domain.
4.4.14 ibv_unreg_xrc_rcv_qp
Template: int ibv_unreg_xrc_rcv_qp(struct ibv_xrc_domain *xrc_domain, uint32_t xrc_qp_num)
Input Parameters:
xrc_domain The XRC domain associated with the XRC receive QP from which the user
wishes to unregister
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_unreg_xrc_rcv_qp unregisters a user process from the XRC receive QP number
xrc_qp_- num which is associated with the XRC domain xrc_domain. When the number of user
processes registered with this XRC receive QP drops to zero, the QP is destroyed.
4.4.15 ibv_create_ah
Template: struct ibv_ah {}ibv_create_ah{*}(struct ibv_pd *pd, struct ibv_ah_attr *attr)
Input Parameters:
Description: ibv_create_ah creates an AH. An AH contains all of the necessary data to reach a
remote destination. In connected transport modes (RC, UC) the AH is associated with a queue pair
(QP). In the datagram transport modes (UD), the AH is associated with a work request (WR).
struct ibv_ah_attr
{
struct ibv_global_route grh;
uint16_t dlid;
uint8_t sl;
uint8_t src_path_bits;
uint8_t static_rate;
uint8_t is_global;
uint8_t port_num;
};
43
dlid destination lid
sl service level
src_path_bits source path bits
static_rate static rate
is_global this is a global address, use grh.
port_num physical port number to use to reach this destination
struct ibv_global_route is defined as follows:
struct ibv_global_route
{
union ibv_gid dgid;
uint32_t flow_label;
uint8_t sgid_index;
uint8_t hop_limit;
uint8_t traffic_class;
};
4.4.16 ibv_destroy_ah
Template: int ibv_destroy_ah(struct ibv_ah *ah)
Input Parameters:
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_destroy_ah frees an address handle (AH). Once an AH is destroyed, it can't be used
any- more in UD QPs
QP States:
4.5.1 ibv_modify_qp
Template: int ibv_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, enum ibv_qp_attr_mask
attr_mask)
Input Parameters:
attr QP attributes
attr_mask bit mask that defines which attributes within attr have been set for this
call
44
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_modify_qp this verb changes QP attributes and one of those attributes may be the
QP state. Its name is a bit of a misnomer, since you cannot use this command to modify qp
attributes at will. There is a very strict set of attributes that may be modified during each
transition, and transitions must occur in the proper order. The following subsections describe each
transition in more detail.
struct ibv_qp_attr
{
enum ibv_qp_state qp_state;
enum ibv_qp_state cur_qp_state;
enum ibv_mtu path_mtu;
enum ibv_mig_state path_mig_state;
uint32_t qkey;
uint32_t rq_psn;
uint32_t sq_psn;
uint32_t dest_qp_num;
int qp_access_flags;
struct ibv_qp_cap cap;
struct ibv_ah_attr ah_attr;
struct ibv_ah_attr alt_ah_attr;
uint16_t pkey_index;
uint16_t alt_pkey_index;
uint8_t en_sqd_async_notify;
uint8_t sq_draining;
uint8_t max_rd_atomic;
uint8_t max_dest_rd_atomic;
uint8_t min_rnr_timer;
uint8_t port_num;
uint8_t timeout;
uint8_t retry_cnt;
uint8_t rnr_retry;
uint8_t alt_port_num;
uint8_t alt_timeout;
};
The following values select one of the above attributes and should be OR’d into the attr_mask field:
IBV_QP_STATE
IBV_QP_CUR_STATE
IBV_QP_EN_SQD_ASYNC_NOTIFY
IBV_QP_ACCESS_FLAGS
IBV_QP_PKEY_INDEX
IBV_QP_PORT
IBV_QP_QKEY
IBV_QP_AV
IBV_QP_PATH_MTU
IBV_QP_TIMEOUT
IBV_QP_RETRY_CNT
IBV_QP_RNR_RETRY
IBV_QP_RQ_PSN
IBV_QP_MAX_QP_RD_ATOMIC
IBV_QP_ALT_PATH
IBV_QP_MIN_RNR_TIMER
IBV_QP_SQ_PSN
IBV_QP_MAX_DEST_RD_ATOMIC
IBV_QP_PATH_MIG_STATE
IBV_QP_CAP
IBV_QP_DEST_QPN
Required Attributes:
45
*** All QPs ***
qp_state / IBV_QP_STATE IBV_QPS_INIT
pkey_index / IBV_QP_PKEY_INDEX pkey index, normally 0
port_num / IBV_QP_PORT physical port number (1...n)
qp_access_flags /
IBV_QP_ACCESS_FLAGS access flags (see ibv_reg_mr)
Effect of transition: Once the QP is transitioned into the INIT state, the user may begin to post
receive buffers to the receive queue via the ibv_post_recv command. At least one receive buffer
should be posted before the QP can be transitioned to the RTR state.
Required Attributes:
Optional Attributes:
Effect of transition: Once the QP is transitioned into the RTR state, the QP begins receive
processing.
Required Attributes:
46
sq_psn / IBV_SQ_PSN send queue starting packet sequence number (should match remote QP’s rq_psn)
max_rd_atomic
/ IBV_QP_MAX_QP_RD_ATOMIC number of outstanding RDMA reads and atomic operations allowed.
Optional Attributes:
Effect of transition: Once the QP is transitioned into the RTS state, the QP begins send processing
and is fully operational. The user may now post send requests with the ibv_post_send command.
4.6.1 ibv_query_qp
Template: int ibv_query_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr, enum ibv_qp_attr_mask
attr_mask, struct ibv_qp_init_attr *init_attr)
Input Parameters:
Output Parameters:
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_query_qp retrieves the various attributes of a queue pair (QP) as previously set
through ibv_create_qp and ibv_modify_qp. The user should allocate a struct ibv_qp_attr and a
struct ibv_qp_init_attr and pass them to the command. These structs will be filled in upon
successful return. The user is responsible to free these structs.
4.6.2 ibv_query_srq
Template: int ibv_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *srq_attr)
Input Parameters:
47
srq The SRQ to query
Output Parameters:
srq_attr The struct ibv_srq_attr is returned with the attributes of the specified
SRQ
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_query_srq returns the attributes list and current values of the specified SRQ. It
returns the attributes through the pointer srq_attr which is an ibv_srq_attr struct described above
under ibv_create_srq. If the value of srq_limit in srq_attr is 0, then the SRQ limit reached ('low
water- mark') event is not or is no longer armed. No asynchronous events will be generated until the
event is re-armed.
4.6.3 ibv_query_xrc_rcv_qp
Template: int ibv_query_xrc_rcv_qp(struct ibv_xrc_domain *xrc_domain, uint32_t xrc_qp_num,
struct ibv_qp_attr *attr, int attr_mask, struct ibv_qp_init_attr *init_attr)
Input Parameters:
xrc_domain The XRC domain associated with this QP xrc_qp_num The queue pair number
to identify this QP
attr The ibv_qp_attr struct in which to return the attributes attr_mask A mask
specifying the minimum list of attributes to retrieve init_attr The
ibv_qp_init_attr struct to return the initial attributes
Output Parameters:
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_query_xrc_rcv_qp retrieves the attributes specified in attr_mask for the XRC
receive QP with the number xrc_qp_num and domain xrc_domain. It returns them through the
pointers attr and init_attr.
The attr_mask specifies a minimal list to retrieve. Some RDMA devices may return extra attributes
not requested. Attributes are valid if they have been set using the ibv_modify_xrc_rcv_qp. The
exact list of valid attributes depends on the QP state. Multiple ibv_query_xrc_rcv_qp calls may yield
different returned values for these attributes: qp_state, path_mig_state, sq_draining, ah_attr (if
automatic path migration (APM) is enabled).
4.6.4 ibv_post_recv
Template: int ibv_post_recv(struct ibv_qp *qp, struct ibv_recv_wr *wr, struct ibv_recv_wr **bad_wr)
48
Input Parameters:
Output Parameters:
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_post_recv posts a linked list of WRs to a queue pair's (QP) receive queue. At least
one receive buffer should be posted to the receive queue to transition the QP to RTR. Receive
buffers are consumed as the remote peer executes Send, Send with Immediate and RDMA Write with
Immediate operations. Receive buffers are NOT used for other RDMA operations. Processing of the
WR list is stopped on the first error and a pointer to the offending WR is returned in bad_wr.
struct ibv_recv_wr
{
uint64_t wr_id;
struct ibv_recv_wr *next;
struct ibv_sge *sg_list;
int num_sge;
};
wr_id user assigned work request ID
next pointer to next WR, NULL if last one.
sg_list scatter array for this WR
num_sge number of entries in sg_list
struct ibv_sge is defined as follows:
struct ibv_sge
{
uint64_t addr;
uint32_t length;
uint32_t lkey;
};
4.6.5 ibv_post_send
Template: int ibv_post_send(struct ibv_qp *qp, struct ibv_send_wr *wr, struct ibv_send_wr
**bad_wr)
Input Parameters:
Output Parameters:
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_post_send posts a linked list of WRs to a queue pair's (QP) send queue. This
operation is used to initiate all communication, including RDMA operations. Processing of the WR list
is stopped on the first error and a pointer to the offending WR is returned in bad_wr.
The user should not alter or destroy AHs associated with WRs until the request has been fully
49
executed and a completion queue entry (CQE) has been retrieved from the corresponding
completion queue (CQ) to avoid unexpected behaviour.
The buffers used by a WR can only be safely reused after the WR has been fully executed and a WCE
has been retrieved from the corresponding CQ. However, if the IBV_SEND_INLINE flag was set, the
buffer can be reused immediately after the call returns.
struct ibv_send_wr
{
uint64_t wr_id;
struct ibv_send_wr *next;
struct ibv_sge *sg_list;
int num_sge;
enum ibv_wr_opcode opcode;
enum ibv_send_flags send_flags;
uint32_t imm_data;/* network byte order */
union
{
struct
{
uint64_t remote_addr;
uint32_t rkey;
} rdma;
struct
{
uint64_t remote_addr;
uint64_t compare_add;
uint64_t swap;
uint32_t rkey;
} atomic;
struct
{
struct ibv_ah *ah;
uint32_t remote_qpn;
uint32_t remote_qkey;
} ud;
} wr;
uint32_t xrc_remote_srq_num;
};
wr_id user assigned work request ID
next pointer to next WR, NULL if last one.
sg_list scatter/gather array for this WR
num_sge number of entries in sg_list
opcode IBV_WR_RDMA_WRITE
IBV_WR_RDMA_WRITE_WITH_IMM
IBV_WR_SEND
IBV_WR_SEND_WITH_IMM
IBV_WR_RDMA_READ
IBV_WR_ATOMIC_CMP_AND_SWP
IBV_WR_ATOMIC_FETCH_AND_ADD
send_flags (optional) - this is a bitwise OR of the flags. See the details below.
imm_data immediate data to send in network byte order
remote_addr remote virtual address for RDMA/atomic operations
rkey remote key (from ibv_reg_mr on remote) for RDMA/atomic operations
compare_add compare value for compare and swap operation
swap swap value
ah address handle (AH) for datagram operations
remote_qpn remote QP number for datagram operations
remote_qkey Qkey for datagram operations
xrc_remote_srq_num shared receive queue (SRQ) number for the destination extended reliable connection (XRC). Only
used for XRC operations.
send flags:
IBV_SEND_FENCE set fence indicator
IBV_SEND_SIGNALED send completion event for this WR. Only meaningful for QPs that had the sq_sig_all set to 0
IBV_SEND_SEND_SOLICITED
set solicited event indicator
IBV_SEND_INLINE send data in sge_list as inline data.
struct ibv_sge is defined in ibv_post_recv.
4.6.6 ibv_post_srq_recv
Template: int ibv_post_srq_recv(struct ibv_srq *srq, struct ibv_recv_wr *recv_wr, struct ibv_recv_wr
**bad_recv_wr)
Input Parameters:
Output Parameters:
50
bad_recv_wrpointer to first rejected WR
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_post_srq_recv posts a list of work requests to the specified SRQ. It stops processing
the WRs from this list at the first failure (which can be detected immediately while requests are
being posted), and returns this failing WR through the bad_recv_wr parameter.
The buffers used by a WR can only be safely reused after WR the request is fully executed and a
work completion has been retrieved from the corresponding completion queue (CQ).
If a WR is being posted to a UD QP, the Global Routing Header (GRH) of the incoming message will be
placed in the first 40 bytes of the buffer(s) in the scatter list. If no GRH is present in the incoming
message, then the first 40 bytes will be undefined. This means that in all cases for UD QPs, the
actual data of the incoming message will start at an offset of 40 bytes into the buffer(s) in the
scatter list.
4.6.7 ibv_req_notify_cq
Template: int ibv_req_notify_cq(struct ibv_cq *cq, int solicited_only)
Input Parameters:
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_req_notify_cq arms the notification mechanism for the indicated completion queue
(CQ). When a completion queue entry (CQE) is placed on the CQ, a completion event will be sent to
the completion channel (CC) associated with the CQ. If there is already a CQE in that CQ, an event
won't be generated for this event. If the solicited_only flag is set, then only CQEs for WRs that had
the solicited flag set will trigger the notification.
The user should use the ibv_get_cq_event operation to receive the notification.
The notification mechanism will only be armed for one notification. Once a notification is sent, the
mechanism must be re-armed with a new call to ibv_req_notify_cq.
4.6.8 ibv_get_cq_event
Template: int ibv_get_cq_event(struct ibv_comp_channel *channel, struct ibv_cq **cq, void
**cq_con- text)
Input Parameters:
Output Parameters:
cq pointer to completion queue (CQ) associated with event cq_context user supplied
context set in ibv_create_cq
51
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description:
ibv_get_cq_event waits for a notification to be sent on the indicated completion channel (CC). Note
that this is a blocking operation. The user should allocate pointers to a struct ibv_cq and a void to
be passed into the function. They will be filled in with the appropriate values upon return. It is the
user's responsibility to free these pointers.
Each notification sent MUST be acknowledged with the ibv_ack_cq_events operation. Since the
ibv_destroy_cq operation waits for all events to be acknowledged, it will hang if any events are not
properly acknowledged.
Once a notification for a completion queue (CQ) is sent on a CC, that CQ is now "disarmed" and will
not send any more notifications to the CC until it is rearmed again with a new call to the
ibv_req_notify_cq operation.
This operation only informs the user that a CQ has completion queue entries (CQE) to be processed,
it does not actually process the CQEs. The user should use the ibv_poll_cq operation to process the
CQEs.
4.6.9 ibv_ack_cq_events
Template: void ibv_ack_cq_events(struct ibv_cq *cq, unsigned int nevents)
Input Parameters:
4.6.10 ibv_poll_cq
Template: int ibv_poll_cq(struct ibv_cq *cq, int num_entries, struct ibv_wc *wc)
Input Parameters:
Output Parameters:
wc CQE array
52
array of struct ibv_wc and pass it to the call in wc. The number of entries available in wc should be
passed in num_entries. It is the user's responsibility to free this memory.
The number of CQEs actually retrieved is given as the return value. CQs must be polled regularly to
prevent an overrun. In the event of an overrun, the CQ will be shut down and an async event
IBV_EVENT_CQ_ERR will be sent.
struct ibv_wc
{
uint64_t wr_id;
enum ibv_wc_status status;
enum ibv_wc_opcode opcode;
uint32_t vendor_err;
uint32_t byte_len;
uint32_t imm_data;/* network byte order */
uint32_t qp_num;
uint32_t src_qp;
enum ibv_wc_flags wc_flags;
uint16_t pkey_index;
uint16_t slid;
uint8_t sl;
uint8_t dlid_path_bits;
};
wr_id user specified work request id as given in ibv_post_send or ibv_post_recv
status IBV_WC_SUCCESS
IBV_WC_LOC_LEN_ERR
IBV_WC_LOC_QP_OP_ERR
IBV_WC_LOC_EEC_OP_ERR
IBV_WC_LOC_PROT_ERR
IBV_WC_WR_FLUSH_ERR
IBV_WC_MW_BIND_ERR
IBV_WC_BAD_RESP_ERR
IBV_WC_LOC_ACCESS_ERR
IBV_WC_REM_INV_REQ_ERR
IBV_WC_REM_ACCESS_ERR
IBV_WC_REM_OP_ERR
IBV_WC_RETRY_EXC_ERR
IBV_WC_RNR_RETRY_EXC_ERR
IBV_WC_LOC_RDD_VIOL_ERR
IBV_WC_REM_INV_RD_REQ_ERR
IBV_WC_REM_ABORT_ERR
IBV_WC_INV_EECN_ERR
IBV_WC_INV_EEC_STATE_ERR
IBV_WC_FATAL_ERR
IBV_WC_RESP_TIMEOUT_ERR
IBV_WC_GENERAL_ERR
opcode IBV_WC_SEND,
IBV_WC_RDMA_WRITE,
IBV_WC_RDMA_READ,
IBV_WC_COMP_SWAP,
IBV_WC_FETCH_ADD,
IBV_WC_BIND_MW,
IBV_WC_RECV = 1 << 7,
IBV_WC_RECV_RDMA_WITH_IMM
vendor_err vendor specific error
byte_len number of bytes transferred
imm_data immediate data
qp_num local queue pair (QP) number
src_qp remote QP number
wc_flags see below
pkey_index index of pkey (valid only for GSI QPs)
slid source local identifier (LID)
sl service level (SL)
dlid_path_bits destination LID path bits
flags:
IBV_WC_GRH global route header (GRH) is present in UD packet
IBV_WC_WITH_IMM immediate data value is valid
4.6.11 ibv_init_ah_from_wc
Template: int ibv_init_ah_from_wc(struct ibv_context *context, uint8_t port_num, struct ibv_wc
*wc, struct ibv_grh *grh, struct ibv_ah_attr *ah_attr)
Input Parameters:
context struct ibv_context from ibv_open_device. This should be the device the
completion queue entry (CQE) was received on.
port_numphysical port number (1..n) that CQE was received on wc received CQE from
ibv_poll_cq
53
grh global route header (GRH) from packet (see description)
Output Parameters:
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
4.6.12 ibv_create_ah_from_wc
Template: struct ibv_ah {}ibv_create_ah_from_wc{*}(struct ibv_pd *pd, struct ibv_wc *wc, struct
ibv_grh *grh, uint8_t port_num)
Input Parameters:
4.6.13 ibv_attach_mcast
Template: int ibv_attach_mcast(struct ibv_qp *qp, const union ibv_gid *gid, uint16_t lid)
Input Parameters:
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
54
Description: ibv_attach_mcast attaches the specified QP, qp, to the multicast group whose multicast
group GID is gid, and multicast LID is lid.
Only QPs of Transport Service Type IBV_QPT_UD may be attached to multicast groups.
In order to receive multicast messages, a join request for the multicast group must be sent to the
subnet administrator (SA), so that the fabric's multicast routing is configured to deliver messages to
the local port.
If a QP is attached to the same multicast group multiple times, the QP will still receive a single copy
of a multicast message.
4.6.14 ibv_detach_mcast
Template: int ibv_detach_mcast(struct ibv_qp *qp, const union ibv_gid *gid, uint16_t lid)
Input Parameters:
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_detach_mcast detaches the specified QP, qp, from the multicast group whose
multicast group GID is gid, and multicast LID is lid.
Input Parameters:
Output Parameters:
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: ibv_get_async_event gets the next asynchronous event of the RDMA device context
'context' and returns it through the pointer 'event' which is an ibv_async_event struct. All async
events returned by ibv_get_async_event must eventually be acknowledged with ibv_ack_asyn-
c_event. ibv_get_async_event() is a blocking function. If multiple threads call this function
simultaneously, then when an async event occurs, only one thread will receive it, and it is not
possible to predict which thread will receive it.
55
struct ibv_async_event {
union {
struct ibv_cq *cq; The CQ that got the event
struct ibv_qp *qp; The QP that got the event
struct ibv_srq *srq; The SRQ that got the event
int port_num; The port number that got the event
} element;
enum ibv_event_type event_type; Type of event
};
One member of the element union will be valid, depending on the event_type member of the
structure. event_type will be one of the following events:
QP events:
IBV_EVENT_QP_FATAL Error occurred on a QP and it transitioned to error state
IBV_EVENT_QP_REQ_ERR Invalid Request Local Work Queue Error
IBV_EVENT_QP_ACCESS_ERR Local access violation error
IBV_EVENT_COMM_EST Communication was established on a QP
IBV_EVENT_SQ_DRAINED Send Queue was drained of outstanding messages in
progress
IBV_EVENT_PATH_MIG A connection has migrated to the alternate path
IBV_EVENT_PATH_MIG_ERR A connection failed to migrate to the alternate path
IBV_EVENT_QP_LAST_WQE_REACHED Last WQE Reached on a QP associated with an SRQ
CQ events:
IBV_EVENT_CQ_ERR CQ is in error (CQ overrun)
SRQ events:
IBV_EVENT_SRQ_ERR Error occurred on an SRQ
IBV_EVENT_SRQ_LIMIT_REACHED SRQ limit was reached
Port events:
IBV_EVENT_PORT_ACTIVE Link became active on a port
IBV_EVENT_PORT_ERR Link became unavailable on a port
IBV_EVENT_LID_CHANGE LID was changed on a port
IBV_EVENT_PKEY_CHANGE P_Key table was changed on a port
IBV_EVENT_SM_CHANGE SM was changed on a port
IBV_EVENT_CLIENT_REREGISTER SM sent a CLIENT_REREGISTER request to a port
IBV_EVENT_GID_CHANGE GID table was changed on a port
CA events:
IBV_EVENT_DEVICE_FATAL CA is in FATAL state
4.7.2 ib_ack_async_event
Template: void ibv_ack_async_event(struct ibv_async_event *event)
Input Parameters:
56
Output Parameters: None
Description: All async events that ibv_get_async_event() returns must be acknowledged using
ibv_ack_asyn- c_event(). To avoid races, destroying an object (CQ, SRQ or QP) will wait for all
affiliated events for the object to be acknowledged; this avoids an application retrieving an
affiliated event after the corresponding object has already been destroyed.
4.7.3 ibv_event_type_str
Template: const char {}ibv_event_type_str{*}(enum ibv_event_type event_type)
Input Parameters:
Return Value: A constant string which describes the enum value event_type
Description: ibv_event_type_str returns a string describing the event type enum value, event_type.
event_- type may be any one of the 19 different enum values describing different IB events.
ibv_event_type {
IBV_EVENT_CQ_ERR,
IBV_EVENT_QP_FATAL,
IBV_EVENT_QP_REQ_ERR,
IBV_EVENT_QP_ACCESS_ERR,
IBV_EVENT_COMM_EST,
IBV_EVENT_SQ_DRAINED,
IBV_EVENT_PATH_MIG,
IBV_EVENT_PATH_MIG_ERR,
IBV_EVENT_DEVICE_FATAL,
IBV_EVENT_PORT_ACTIVE,
IBV_EVENT_PORT_ERR,
IBV_EVENT_LID_CHANGE,
IBV_EVENT_PKEY_CHANGE,
IBV_EVENT_SM_CHANGE,
IBV_EVENT_SRQ_ERR,
IBV_EVENT_SRQ_LIMIT_REACHED,
IBV_EVENT_QP_LAST_WQE_REACHED,
IBV_EVENT_CLIENT_REREGISTER,
IBV_EVENT_GID_CHANGE,
};
Input Parameters:
context
Output Parameters:
attr
Return Value: returns 0 on success, or the value of errno on failure (which indicates the failure
reason).
Description: ibv_exp_query_device returns the attributes of the device with context context. The
argument attr is a pointer to an ibv_exp_device_attr struct, as defined in <infiniband/verbs_exp.h>.
57
struct ibv_exp_device_attr {
char fw_ver[64];
uint64_t node_guid;
uint64_t sys_image_guid;
uint64_t max_mr_size;
uint64_t page_size_cap;
uint32_t vendor_id;
uint32_t vendor_part_id;
uint32_t hw_ver;
int max_qp;
int max_qp_wr;
int reserved; /* place holder to align with ibv_device_attr */
int max_sge;
int max_sge_rd;
int max_cq;
int max_cqe;
int max_mr;
int max_pd;
int max_qp_rd_atom;
int max_ee_rd_atom;
int max_res_rd_atom;
int max_qp_init_rd_atom;
int max_ee_init_rd_atom;
enum ibv_exp_atomic_cap exp_atomic_cap;
int max_ee;
int max_rdd;
int max_mw;
int max_raw_ipv6_qp;
int max_raw_ethy_qp;
int max_mcast_grp;
int max_mcast_qp_attach;
int max_total_mcast_qp_attach;
int max_ah;
int max_fmr;
int max_map_per_fmr;
int max_srq;
int max_srq_wr;
int max_srq_sge;
uint16_t max_pkeys;
uint8_t local_ca_ack_delay;
uint8_t phys_port_cnt;
uint32_t comp_mask;
struct ibv_exp_device_calc_cap calc_cap;
uint64_t timestamp_mask;
uint64_t hca_core_clock;
uint64_t exp_device_cap_flags; /* use ibv_exp_device_cap_flags */
int max_dc_req_rd_atom;
int max_dc_res_rd_atom;
int inline_recv_sz;
uint32_t max_rss_tbl_sz;
struct ibv_exp_ext_atomics_params ext_atom;
uint32_t max_mkey_klm_list_size;
uint32_t max_send_wqe_inline_klms;
uint32_t max_umr_recursion_depth;
uint32_t max_umr_stride_dimension;
};
4.8.2 ibv_exp_create_qp
Template: ibv_exp_create_qp(struct ibv_context *context, struct ibv_exp_qp_init_attr *qp_init_attr)
Input Parameters:
Output Parameters:
Return Value: Returns a pointer to the created QP, or NULL if the request fails. Check the QP number
(qp_num) in the returned QP.
Description: ibv_exp_create_qp creates a queue pair (QP) associated with the protection domain
pd. The argument init_attr is an ibv_exp_qp_init_attr struct, as defined in <infiniband/
verbs_exp.h>.
struct ibv_exp_qp_init_attr {
void *qp_context;
struct ibv_cq *send_cq;
struct ibv_cq *recv_cq;
struct ibv_srq *srq;
struct ibv_qp_cap cap;
enum ibv_qp_type qp_type;
int sq_sig_all;
uint32_t max_inl_recv;
58
struct ibv_exp_qpg qpg;
uint32_t max_atomic_arg;
uint32_t max_inl_send_klms;
};
4.8.3 ibv_exp_post_send
Template: static inline int ibv_exp_post_send(struct ibv_qp *qp, struct ibv_exp_send_wr *wr, struct
ibv_- exp_send_wr **bad_wr)
Input Parameters:
Output Parameters:
Return Value: returns 0 on success, or the value of errno on failure (which indicates the failure
reason).
Description: ibv_exp_post_send posts the linked list of work requests (WRs) starting with wr to the
send queue of the queue pair qp. It stops processing WRs from this list at the first failure (that can
be detected immediately while requests are being posted), and returns this failing WR through
bad_wr.
struct ibv_exp_send_wr {
uint64_t wr_id;
struct ibv_exp_send_wr *next;
struct ibv_sge *sg_list;
int num_sge;
enum ibv_exp_wr_opcode exp_opcode; /* use ibv_exp_wr_opcode */
int reserved; /* place holder to align with ibv_send_wr */
union {
uint32_t imm_data; /* in network byte order */
uint32_t invalidate_rkey;
} ex;
union {
struct {
uint64_t remote_addr;
uint32_t rkey;
} rdma;
struct {
uint64_t remote_addr;
uint64_t compare_add;
uint64_t swap;
uint32_t rkey;
} atomic;
struct {
struct ibv_ah *ah;
uint32_t remote_qpn;
uint32_t remote_qkey;
} ud;
} wr;
union {
union {
struct {
uint32_t remote_srqn;
} xrc;
} qp_type;
uint32_t xrc_remote_srq_num;
};
union {
struct {
uint64_t remote_addr;
uint32_t rkey;
} rdma;
struct {
uint64_t remote_addr;
uint64_t compare_add;
uint64_t swap;
uint32_t rkey;
} atomic;
struct {
struct ibv_cq *cq;
int32_t cq_count;
} cqe_wait;
struct {
struct ibv_qp *qp;
int32_t wqe_count;
} wqe_enable;
} task;
union {
struct {
enum ibv_exp_calc_op calc_op;
enum ibv_exp_calc_data_type data_type;
enum ibv_exp_calc_data_size data_size;
} calc;
} op;
struct {
59
struct ibv_ah *ah;
uint64_t dct_access_key;
uint32_t dct_number;
} dc;
struct {
struct ibv_mw *mw;
uint32_t rkey;
struct ibv_exp_mw_bind_info bind_info;
} bind_mw;
uint64_t exp_send_flags; /* use ibv_exp_send_flags */
uint32_t comp_mask; /* reserved for future growth (must be 0) */
union {
struct {
struct {
enum mem_layout_type mkey_type;
union {
struct ibv_exp_mem_region *mem_reg_list; /* array, size corresponds to wr->num_sge */
struct {
struct ibv_exp_mem_repeat_block *mem_repeat_block_list; /* array, size corresponds to wr->num_sge */
size_t *repeat_count; /* array size corresponds to ndim */
uint32_t ndim;
} rb;
} mem_list;
struct non_inline_data *memory_objects; /* used when IBV_EXP_SEND_INLINE is not set */
int access;
struct ibv_mr *modified_mr;
void *region_base_addr;
} memory_key;
} umr;
struct {
uint32_t log_arg_sz;
uint64_t remote_addr;
uint32_t rkey;
union {
struct {
/* For the next four fields:
* If operand_size <= 8 then inline data is immediate
* from the corresponding field; for small opernands,
* ls bits are used.
* Else the fields are pointers in the process's address space
* where arguments are stored
*/
union {
struct ibv_exp_cmp_swap cmp_swap;
struct ibv_exp_fetch_add fetch_add;
} op;
} inline_data; /* IBV_EXP_SEND_EXT_ATOMIC_INLINE is set */
/* in the future add support for non-inline argument provisioning */
} wr_data;
} masked_atomics;
} ext_op;
};
For atomic operations, to support atomic responses in big-endian format (the only way to use
atomics on Connect-IB® on little-endian machines) is:
1. Use experimental verbs.
2. Check to see if the atomics capabilities flag IBV_EXP_ATOMIC_HCA_REPLY_BE in the
exp_atomic_cap field of the struct ibv_exp_device_attr returned by ibv_exp_query_device().
3. Set the flag IBV_EXP_QP_CREATE_ATOMIC_BE_REPLY when opening the QP. This is what
enables the use of atomic ops on Connect-IB.
4. Use the experimental post send verb.
60
5 RDMA_CM API
This chapter describes the details of the RDMA_CM verbs API.
Input Parameters:
void no arguments
Output Parameters:
none
Return Value:
A pointer to the created event channel, or NULL if the request fails. On failure, errno will be set to
indicate the failure reason.
Description:
Opens an event channel used to report communication events. Asynchronous events are reported to
users through event channels.
Event channels are used to direct all events on an rdma_cm_id. For many clients, a single
event channel may be sufficient, however, when managing a large number of connections or
cm_ids, users may find it useful to direct events for different cm_ids to different channels
for processing.
All created event channels must be destroyed by calling rdma_destroy_event_channel.
Users should call rdma_get_cm_event to retrieve events on an event channel.
Each event channel is mapped to a file descriptor. The associated file descriptor can be
used and manipulated like any other fd to change its behavior. Users may make the fd non-
blocking, poll or select the fd, etc.
5.1.2 rdma_destroy_event_channel
Template: void rdma_destroy_event_channel (struct rdma_event_channel *channel)
Input Parameters:
Output Parameters:
none
61
Return Value:
none
Description: Close an event communication channel. Release all resources associated with an event
channel and closes the associated file descriptor.
All rdma_cm_id's associated with the event channel must be destroyed, and all returned
events must be acked before calling this function.
Template:
int rdma_create_id(struct rdma_event_channel *channel, struct rdma_cm_id **id, void *con- text,
enum rdma_port_space ps)
Input Parameters:
channel The communication channel that events associated with the allocated rdma_cm_id will be
reported on.
idA reference where the allocated communication identifier will be returned.
context User specified context associated with the rdma_cm_id. psRDMA port space.
Output Parameters:
None
Return Value:
0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for the failure.
Description:
Creates an identifier that is used to track communication information.
Notes:
rdma_cm_ids are conceptually equivalent to a socket for RDMA communication. The difference is
that RDMA communication requires explicitly binding to a specified RDMA device before
communication can occur, and most operations are asynchronous in nature. Communication events
on an rdma_cm_id are reported through the associated event channel. Users must release the
rdma_cm_id by calling rdma_destroy_id.
PORT SPACES Details of the services provided by the different port spaces are outlined below.
RDMA_PS_TCP Provides reliable, connection-oriented QP communication. Unlike TCP, the RDMA port
space provides message, not stream, based communication.
RDMA_PS_UDPProvidesunreliable,connection lessQP communication. Supports both datagram and
multicast communication.
See Also:
rdma_cm, rdma_create_event_channel, rdma_destroy_id, rdma_get_devices, rdma_bind_addr,
rdma_resolve_addr, rdma_connect, rdma_listen, rdma_set_option
62
5.2.2 rdma_destroy_id
Template:
int rdma_destroy_id (struct rdma_cm_id *id)
Input Parameters:
idThe communication identifier to destroy.
Output Parameters:
None
Return Value:
0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for the failure.
Description:
Destroys the specified rdma_cm_id and cancels any outstanding asynchronous operation.
Notes:
Users must free any associated QP with the rdma_cm_id before calling this routine and ack an
related events.
See Also:
rdma_create_id, rdma_destroy_qp, rdma_ack_cm_event
5.2.3 rdma_migrate_id
Template:
int rdma_migrate_id(struct rdma_cm_id *id, struct rdma_event_channel *channel)
Input Parameters:
idAn existing RDMA communication identifier to migrate
channel The new event channel for rdma_cm_id events
Output Parameters:
None
Return Value:
0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for the failure.
Description:
rdma_migrate_id migrates a communication identifier to a different event channel and moves any
pending events associated with the rdma_cm_id to the new channel.
No polling for events on the rdma_cm_id's current channel nor running of any routines on the
rdma_cm_id should be done while migrating between channels. rdma_migrate_id will block while
there are any unacknowledged events on the current event channel.
If the channel parameter is NULL, then the specified rdma_cm_id will be placed into synchro- nous
operation mode. All calls on the id will block until the operation completes.
5.2.4 rdma_set_option
Template:
int rdma_set_option(struct rdma_cm_id *id, int level, int optname, void *optval, size_t optlen)
Input Parameters:
idRDMA communication identifier
levelProtocol level of the option to set
optname Name of the option to set
63
optvalReference to the option data
optlen The size of the option data (optval) buffer
Output Parameters:
None
Return Value:
0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for the failure.
Description:
rdma_set_option sets communication options for an rdma_cm_id. Option levels and details may be
found in the enums in the relevant header files.
5.2.5 rdma_create_ep
Template:
int rdma_create_ep(struct rdma_cm_id **id, struct rdma_addrinfo *res, struct ibv_pd *pd, struct
ibv_qp_init_attr *qp_init_attr)
Input Parameters:
idA reference where the allocated communication identifier will be returned
res Address information associated with the rdma_cm_id returned from rdma_getaddrinfo
pd OPtional protection domain if a QP is associated with the rdma_cm_id
qp_init_attrOptional initial QP attributes
Output Parameters:
idThe communication identifier is returned through this reference
Return Value:
0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for the failure
Description:
rdma_create_ep creates an identifier and optional QP used to track communication information.
If qp_init_attr is not NULL, then a QP will be allocated and associated with the rdma_cm_id, id. If a
protection domain (PD) is provided, then the QP will be created on that PD. Otherwise the QP will
be allocated on a default PD.
The rdma_cm_id will be set to use synchronous operations (connect, listen and get_request). To use
asynchronous operations, rdma_cm_id must be migrated to a user allocated event channel using
rdma_migrate_id.
rdm_cm_id must be released after use, using rdma_destroy_ep. struct rdma_addrinfo is defined as
follows:
struct rdma_addrinfo {
int ai_flags;
int ai_family;
int ai_qp_type;
int ai_port_space;
socklen_tai_src_len;
socklen_tai_dst_len;
struct sockaddr *ai_src_addr;
struct sockaddr *ai_dst_addr;
char *ai_src_canonname;
char *ai_dst_canonname;
size_t ai_route_len;
void *ai_route;
64
size_t ai_connect_len;
void *ai_connect;
struct rdma_addrinfo *ai_next;
};
ai_flags Hint flags which control the operation. Supported flags are:
RAI_PASSIVE, RAI_NUMERICHOST and RAI_NOROUTE
ai_familyAddressfamily forthe source and destinationaddress (AF_INET, AF_INET6, AF_IB)
ai_qp_type The type of RDMA QP used
ai_port_space RDMA port space used (RDMA_PS_UDP or RDMA_PS_TCP) ai_src_len Length of the
source address referenced by ai_src_addr ai_dst_lenLength of the destination address referenced by
ai_dst_addr
*ai_src_addrAddress of local RDMA device, if provided
*ai_dst_addrAddress of destination RDMA device, if provided
*ai_src_canonname The canonical for the source
*ai_dst_canonname The canonical for the destination
ai_route_len Sizeof the routinginformationbufferreferenced by ai_route.
*ai_routeRouting information for RDMA transports that require routing data as part of connection
establishment
ai_connect_len Size of connection information referenced by ai_connect
*ai_connect Data exchanged aspart of the connectionestablishment process
*ai_nextPointer to the next rdma_addrinfo structure in the list
5.2.6 rdma_destroy_ep
Template:
int rdma_destroy_ep (struct rdma_cm_id *id)
Input Parameters:
idThe communication identifier to destroy
Output Parameters:
None
Return Value:
0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for the failure
Description:
rdma_destroy_ep destroys the specified rdma_cm_id and all associated resources, including QPs
associated with the id.
5.2.7 rdma_resolve_addr
Template:
int rdma_resolve_addr (struct rdma_cm_id *id, struct sockaddr *src_addr, struct sockaddr
*dst_addr, int timeout_ms)
Input Parameters:
idRDMA identifier.
65
src_addrSource address information. This parameter may be NULL. dst_addrDestination address
information.
timeout_ms Time to wait for resolution to complete.
Output Parameters:
None
Return Value:
0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for the failure.
Description:
rdma_resolve_addr resolves destination and optional source addresses from IP addresses to an RDMA
address. If successful, the specified rdma_cm_id will be bound to a local device.
Notes:
This call is used to map a given destination IP address to a usable RDMA address. The IP to RDMA
address mapping is done using the local routing tables, or via ARP. If a source address is given, the
rdma_cm_id is bound to that address, the same as if rdma_bind_addr were called. If no source
address is given, and the rdma_cm_id has not yet been bound to a device, then the rdma_cm_id will
be bound to a source address based on the local routing tables. After this call, the rdma_cm_id will
be bound to an RDMA device. This call is typically made from the active side of a connection before
calling rdma_resolve_route and rdma_connect.
InfiniBand Specific
This call maps the destination and, if given, source IP addresses to GIDs. In order to perform the
mapping, IPoIB must be running on both the local and remote nodes.
See Also:
rdma_create_id, rdma_resolve_route, rdma_connect, rdma_create_qp, rdma_get_cm_event,
rdma_bind_addr, rdma_get_src_port, rdma_get_dst_port, rdma_get_local_addr,
rdma_get_peer_addr
5.2.8 rdma_bind_addr
Template:
int rdma_bind_addr (struct rdma_cm_id *id, struct sockaddr *addr)
Input Parameters:
idRDMA identifier.
addrLocal address information. Wildcard values are permitted.
Output Parameters:
None
Return Value:
0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for the failure.
Description:
rdma_bind_addr associates a source address with an rdma_cm_id. The address may be wild carded.
If binding to a specific local address, the rdma_cm_id will also be bound to a local RDMA device.
Notes:
Typically, this routine is called before calling rdma_listen to bind to a specific port number, but it
may also be called on the active side of a connection before calling rdma_resolve_addr to bind to a
specific address. If used to bind to port 0, the rdma_cm will select an available port, which can be
retrieved with rdma_get_src_port.
See Also:
rdma_create_id, rdma_listen, rdma_resolve_addr, rdma_create_qp, rdma_get_local_addr,
rdma_get_src_port
66
5.2.9 rdma_resolve_route
Template:
int rdma_resolve_route (struct rdma_cm_id *id, int timeout_ms)
Input Parameters:
idRDMA identifier.
addrLocal address information. Wildcard values are permitted.
Output Parameters:
None
Return Value:
0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for the failure.
Description:
rdma_resolve_route resolves an RDMA route to the destination address in order to establish a
connection. The destination must already have been resolved by calling rdma_resolve_addr.
Thus this function is called on the client side after rdma_resolve_addr but before calling rdma_-
connect. For InfiniBand connections, the call obtains a path record which is used by the connec-
tion.
5.2.10 rdma_listen
Template:
int rdma_listen(struct rdma_cm_id *id, int backlog)
Input Parameters:
id RDMA communication identifier
backlog The backlog of incoming connection requests
Output Parameters:
None
Return Value:
0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for the failure.
Description:
rdma_listen initiates a listen for incoming connection requests or datagram service lookup. The
listen is restricted to the locally bound source address.
Please note that the rdma_cm_id must already have been bound to a local address by calling
rdma_bind_addr before calling rdma_listen. If the rdma_cm_id is bound to a specific IP address, the
listen will be restricted to that address and the associated RDMA device. If the rdma_cm_id is bound
to an RDMA port number only, the listen will occur across all RDMA devices.
5.2.11 rdma_connect
Template:
int rdma_connect(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
Input Parameters:
id RDMA communication identifier conn_paramOptional connection parameters
Output Parameters:
none
Return Value:
67
0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for the failure.
Description:
rdma_connect initiates an active connection request. For a connected rdma_cm_id, id, the call
initiates a connection request to a remote destination. or an unconnected rdma_cm_id, it initiates a
lookup of the remote QP providing the datagram service. The user must already have resolved a
route to the destination address by having called rdma_resolve_route or rdma_create_ep before
calling this method.
For InfiniBand specific connections, the QPs are configured with minimum RNR NAK timer and local
ACK values. The minimum RNR NAK timer value is set to 0, for a delay of 655 ms. The local ACK
timeout is calculated based on the packet lifetime and local HCA ACK delay. The packet lifetime is
determined by the InfiniBand Subnet Administrator and is part of the resolved route (path record)
information. The HCA ACK delay is a property of the locally used HCA. Retry count and RNR retry
count values are 3-bit values.
Connections established over iWarp RDMA devices currently require that the active side of the
connection send the first message.
struct rdma_conn_param is defined as follows:
struct rdma_conn_param {
const void *private_data; uint8_t private_data_len; uint8_t responder_resources; uint8_t
initiator_depth; uint8_t flow_control;
uint8_t retry_count;ignored when accepting uint8_t rnr_retry_count;
uint8_t srq;ignored if QP created on the rdma_cm_id
uint32_t qp_num;ignored if QP created on the rdma_cm_id
};
Here is a more detailed description of the rdma_conn_param structure members:
private_data References a user-controlled data buffer. The contents of the buffer are copied and
transparently passed to the remote side as part of the communication request. May be NULL if
private_data is not required.
private_data_len Specifies the size of the user-controlled data buffer. Note that the actual amount
of data transferred to the remote side is transport dependent and may be larger than that
requested.
responder_resources The maximum number of outstanding RDMA read and atomic operations that
the local side will accept from the remote side. Applies only to RDMA_PS_TCP. This value must be
less than or equal to the local RDMA device attribute max_qp_rd_atom and remote RDMA device
attribute max_qp_init_rd_atom. The remote endpoint can adjust this value when accepting the
connection.
initiator_depth The maximum number of outstanding RDMA read and atomic operations that the
local side will have to the remote side. Applies only to RDMA_PS_TCP. This value must be less than
or equal to the local RDMA device attribute max_qp_init_rd_atom and remote RDMA device attribute
max_qp_rd_atom. The remote endpoint can adjust this value when accepting the connection.
flow_control Specifies if hardware flow control is available. This value is exchanged with the remote
peer and is not used to configure the QP. Applies only to RDMA_PS_TCP.
retry_count The maximum number of times that a data transfer operation should be retried on the
connection when an error occurs. This setting controls the number of times to retry send, RDMA,
and atomic operations when timeouts occur. Applies only to RDMA_PS_TCP.
rnr_retry_count The maximum number of times that a send operation from the remote peer should
be retried on a connection after receiving a receiver not ready (RNR) error. RNR errors are
generated when a send request arrives before a buffer has been posted to receive the incoming
data. Applies only to RDMA_PS_TCP.
srq Specifies if the QP associated with the connection is using a shared receive queue. This field is
ignored by the library if a QP has been created on the rdma_cm_id. Applies only to RDMA_PS_TCP.
qp_numSpecifies the QP number associated with the connection. This field is ignored by the library
68
if a QP has been created on the rdma_cm_id. Applies only to RDMA_PS_TCP.
5.2.12 rdma_get_request
Template:
int rdma_get_request (struct rdma_cm_id *listen, struct rdma_cm_id **id)
Input Parameters:
listenListening rdma_cm_id
id rdma_cm_id associated with the new connection
Output Parameters:
idA pointer to rdma_cm_id associated with the request
Return Value:
0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for the failure.
Description:
rdma_get_request retrieves the next pending connection request event. The call may only be used
on listening rdma_cm_ids operating synchronously. If the call is successful, a new rdma_c- m_id (id)
representing the connection request will be returned to the user. The new rdma_cm_id will
reference event information associated with the request until the user calls rdma_reject,
rdma_accept, or rdma_destroy_id on the newly created identifier. For a description of the event
data, see rdma_get_cm_event.
If QP attributes are associated with the listening endpoint, the returned rdma_cm_id will also ref-
erence an allocated QP.
5.2.13 rdma_accept
Template:
int rdma_accept(struct rdma_cm_id *id, struct rdma_conn_param *conn_param)
Input Parameters:
id
RDMA communication identifier
conn_param rdma_connect) Optionalconnectionparameters (described under
Output Parameters:
None
Return Value:
0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for the failure.
Description:
rdma_accept is called from the listening side to accept a connection or datagram service lookup
request.
Unlike the socket accept routine, rdma_accept is not called on a listening rdma_cm_id. Instead,
after calling rdma_listen, the user waits for an RDMA_CM_EVENT_CONNECT_REQUEST event to
occur. Connection request events give the user a newly created rdma_cm_id, similar to a new
socket, but the rdma_cm_id is bound to a specific RDMA device. rdma_accept is called on the new
rdma_cm_id.
69
5.2.14 rdma_reject
Template:
int rdma_reject(struct rdma_cm_id *id, const void *private_data, uint8_t private_data_len)
Input Parameters:
idRDMA communication identifier
private_dataOptional private data to send with the reject message private_data_len Size (in bytes)
of the private data being sent
Output Parameters:
None
Return Value:
0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for the failure.
Description:
rdma_reject is called from the listening side to reject a connection or datagram service lookup
request.
After receiving a connection request event, a user may call rdma_reject to reject the request. The
optional private data will be passed to the remote side if the underlying RDMA transport sup- ports
private data in the reject message.
5.2.15 rdma_notify
Template:
int rdma_notify(struct rdma_cm_id *id, enum ibv_event_type event)
Input Parameters:
idRDMA communication identifier
eventAsynchronous event
Output Parameters:
None
Return Value:
0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for the failure.
Description:
rdma_notify is used to notify the librdmacm of asynchronous events which have occurred on a QP
associated with the rdma_cm_id, id.
Asynchronous events that occur on a QP are reported through the user's device event handler. This
routine is used to notify the librdmacm of communication events. In most cases, use of this routine
is not necessary, however if connection establishment is done out of band (such as done through
InfiniBand), it is possible to receive data on a QP that is not yet considered connected. This routine
forces the connection into an established state in this case in order to handle the rare situation
where the connection never forms on its own. Calling this routine ensures the delivery of the
RDMA_CM_EVENT_ESTABLISHED event to the application. Events that should be reported to the CM
are: IB_EVENT_COMM_EST.
5.2.16 rdma_disconnect
Template:
int rdma_disconnect(struct rdma_cm_id *id)
70
Input Parameters:
idRDMA communication identifier
Output Parameters:
None
Return Value:
0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for the failure.
Description:
rdma_disconnect disconnects a connection and transitions any associated QP to the error state. This
action will result in any posted work requests being flushed to the completion queue.
rdma_disconnect may be called by both the client and server side of the connection. After suc-
cessfully disconnecting, an RDMA_CM_EVENT_DISCONNECTED event will be generated on both sides
of the connection.
5.2.17 rdma_get_src_port
Template:
uint16_t rdma_get_src_port(struct rdma_cm_id *id)
Input Parameters:
idRDMA communication identifier
Output Parameters:
None
Return Value:
Returns the 16-bit port number associated with the local endpoint of 0 if the rdma_cm_id, id, is not
bound to a port
Description:
rdma_get_src_port retrieves the local port number for an rdma_cm_id (id) which has been bound to
a local address. If the id is not bound to a port, the routine will return 0.
5.2.18 rdma_get_dst_port
Template:
uint16_t rdma_get_dst_port(struct rdma_cm_id *id)
Input Parameters:
idRDMA communication identifier
Output Parameters:
None
Return Value:
Returns the 16-bit port number associated with the peer endpoint of 0 if the rdma_cm_id, id, is not
connected
Description:
rdma_get_dst_port retrieves the port associated with the peer endpoint. If the rdma_cm_id, id, is
not connected, then the routine will return 0.
5.2.19 rdma_get_local_addr
Template:
71
struct sockaddr {}rdma_get_local_addr{*}(struct rdma_cm_id *id)
Input Parameters:
idRDMA communication identifier
Output Parameters:
None
Return Value:
Returns a pointer to the local sockaddr address of the rdma_cm_id, id. If the id is not bound to an
address, then the contents of the sockaddr structure will be set to all zeros
Description:
rdma_get_local_addr retrieves the local IP address for the rdma_cm_id which has been bound to a
local device.
5.2.20 rdma_get_peer_addr
Template:
struct sockaddr * rdma_get_peer_addr (struct rdma_cm_id *id)
Input Parameters:
idRDMA communication identifier
Output Parameters:
None
Return Value:
A pointer to the sockaddr address of the connected peer. If the rdma_cm_id is not connected, then
the contents of the sockaddr structure will be set to all zeros
Description:
rdma_get_peer_addr retrieves the remote IP address of a bound rdma_cm_id.
5.2.21 rdma_get_devices
Template:
struct ibv_context ** rdma_get_devices (int *num_devices)
Input Parameters:
num_devices If non-NULL, set to the number of devices returned
Output Parameters:
num_devices Number of RDMA devices currently available
Return Value:
Array of available RDMA devices on success or NULL if the request fails
Description:
rdma_get_devices retrieves an array of RDMA devices currently available. Devices remain opened
while librdmacm is loaded and the array must be released by calling rdma_free_devices.
5.2.22 rdma_free_devices
Template:
void rdma_free_devices (struct ibv_context **list)
Input Parameters:
list List of devices returned from rdma_get_devices
Output Parameters:
72
None
Return Value:
None
Description:
rdma_free_devices frees the device array returned by the rdma_get_devices routine.
5.2.23 rdma_getaddrinfo
Template:
int rdma_getaddrinfo(char *node, char *service, struct rdma_addrinfo *hints, struct rdma_ad- drinfo
**res)
Input Parameters:
nodeOptional: name, dotted-decimal IPv4 or IPv6 hex address to resolve
service The service name or port number of the address
hintsReference to an rmda_addrinfo structure containing hints about the type of service the caller
supports resA pointer to a linked list of rdma_addrinfo structures containing response information
Output Parameters:
resAn rdma_addrinfo structure which returns information needed to establish communication
Return Value:
0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for the failure.
Description:
rdma_getaddrinfo provides transport independent address translation. It resolves the destination
node and service address and returns information required to establish device communication. It is
the functional equivalent of getaddrinfo.
Please note that either node or service must be provided. If hints are provided, the operation will
be controlled by hints.ai_flags. If RAI_PASSIVE is specified, the call will resolve address infor- mation
for use on the passive side of a connection.
The rdma_addrinfo structure is described under the rdma_create_ep routine.
5.2.24 rdma_freeaddrinfo
Template:
void rdma_freeaddrinfo(struct rdma_addrinfo *res)
Input Parameters:
resThe rdma_addrinfo structure to free
Output Parameters:
None
Return Value:
None
Description:
rdma_freeaddrinfo releases the rdma_addrinfo (res) structure returned by the rdma_getaddrinfo
routine. Note that if ai_next is not NULL, rdma_freeaddrinfo will free the entire list of addrinfo
structures.
5.2.25 rdma_create_qp
Template:
73
int rdma_create_qp (struct rdma_cm_id *id, struct ibv_pd *pd, struct ibv_qp_init_attr
*qp_init_attr)
Input Parameters:
idRDMA identifier.
pd protection domain for the QP. qp_init_attrinitial QP attributes.
Output Parameters:
qp_init_attr The actual capabilities and properties of the created QP are returned through this
structure
Return Value:
0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for the failure.
Description:
rdma_create_qp allocates a QP associated with the specified rdma_cm_id and transitions it for
sending and receiving. The actual capabilities and properties of the created QP will be returned to
the user through the qp_init_attr parameter.
Notes:
The rdma_cm_id must be bound to a local RDMA device before calling this function, and the
protection domain must be for that same device. QPs allocated to an rdma_cm_id are automati-
cally transitioned by the librdmacm through their states. After being allocated, the QP will be ready
to handle posting of receives. If the QP is unconnected, it will be ready to post sends.
See Also:
rdma_bind_addr, rdma_resolve_addr, rdma_destroy_qp, ibv_create_qp, ibv_modify_qp
5.2.26 rdma_destroy_qp
Template:
void rdma_destroy_qp (struct rdma_cm_id *id)
Input Parameters:
idRDMA identifier.
Output Parameters:
none
Return Value:
none
Description:
rdma_destroy_qp destroys a QP allocated on the rdma_cm_id.
Notes:
Users must destroy any QP associated with an rdma_cm_id before destroying the ID.
See Also:
rdma_create_qp, rdma_destroy_id, ibv_destroy_qp
5.2.27 rdma_join_multicast
Template:
int rdma_join_multicast (struct rdma_cm_id *id, struct sockaddr *addr, void *context)
Input Parameters:
id Communication identifier associated with the request.
addrMulticast address identifying the group to join. contextUser-defined context associated with the
join request.
74
Output Parameters:
none
Return Value:
0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for the failure.
Description:
rdma_join_multicast joins a multicast group and attaches an associated QP to the group.
Notes:
Before joining a multicast group, the rdma_cm_id must be bound to an RDMA device by calling
rdma_bind_addr or rdma_resolve_addr. Use of rdma_resolve_addr requires the local routing tables
to resolve the multicast address to an RDMA device, unless a specific source address is provided. The
user must call rdma_leave_multicast to leave the multicast group and release any multicast
resources. After the join operation completes, any associated QP is automatically attached to the
multicast group, and the join context is returned to the user through the private_- data field in the
rdma_cm_event.
See Also:
rdma_leave_multicast, rdma_bind_addr, rdma_resolve_addr, rdma_create_qp, rdma_get_c-
m_event
5.2.28 rdma_leave_multicast
Template:
int rdma_leave_multicast (struct rdma_cm_id *id, struct sockaddr *addr)
Input Parameters:
id Communication identifier associated with the request.
addrMulticast address identifying the group to leave.
Output Parameters:
none
Return Value:
0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for the failure.
Description:
rdma_leave_multicast leaves a multicast group and detaches an associated QP from the group.
Notes:
Calling this function before a group has been fully joined results in canceling the join operation.
Users should be aware that messages received from the multicast group may stilled be queued for
completion processing immediately after leaving a multicast group. Destroying an rdma_cm_id will
automatically leave all multicast groups.
See Also:
rdma_join_multicast, rdma_destroy_qp
Input Parameters:
75
event Allocated information about the next communication event.
Output Parameters:
none
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: Retrieves a communication event. If no events are pending, by default, the call will
block until an event is received.
The default synchronous behavior of this routine can be changed by modifying the file
descriptor associated with the given channel. All events that are reported must be
acknowledged by calling rdma_ack_cm_event. Destruction of an rdma_cm_id will block
until related events have been acknowledged.
76
rdma_connect or rdma_accept. This field is NULL if the event does not include
private data. The buffer referenced by this pointer is deallocated when calling
rdma_ack_cm_event.
private_data_len The size of the private data buffer. Users should note that the
size of the private data buffer may be larger than the amount of private data sent
by the remote side. Any additional space in the buffer will be zeroed out.
private_data_len The size of the private data buffer. Users should note that the
size of the private data buffer may be larger than the amount of private data sent
by the remote side. Any additional space in the buffer will be zeroed out.
rnr_retry_count The number of times that the recipient should retry receiver
not ready (RNR) NACK errors.
77
5.3.1.4 Event Types
The following types of communication events may be reported.
RDMA_CM_EVENT_ADDR_RESOLVED
Address resolution (rdma_resolve_addr) completed successfully.
RDMA_CM_EVENT_ADDR_ERROR
Address resolution (rdma_resolve_addr) failed.
RDMA_CM_EVENT_ROUTE_RESOLVED
Route resolution (rdma_resolve_route) completed successfully.
RDMA_CM_EVENT_ROUTE_ERROR
Route resolution (rdma_resolve_route) failed.
RDMA_CM_EVENT_CONNECT_REQUEST
Generated on the passive side to notify the user of a new connection request.
RDMA_CM_EVENT_CONNECT_RESPONSE
Generated on the active side to notify the user of a successful response to a connection request. It
is only generated on rdma_cm_id's that do not have a QP associated with them.
RDMA_CM_EVENT_CONNECT_ERROR
Indicates that an error has occurred trying to establish or a connection. May be generated on the
active or passive side of a connection.
RDMA_CM_EVENT_UNREACHABLE
Generated on the active side to notify the user that the remote server is not reachable or unable to
respond to a connection request.
RDMA_CM_EVENT_REJECTED
Indicates that a connection request or response was rejected by the remote end point.
RDMA_CM_EVENT_ESTABLISHED
Indicates that a connection has been established with the remote end point.
RDMA_CM_EVENT_DISCONNECTED
The connection has been disconnected.
RDMA_CM_EVENT_DEVICE_REMOVAL
The local RDMA device associated with the rdma_cm_id has been removed. Upon receiving this
event, the user must destroy the related rdma_cm_id.
RDMA_CM_EVENT_MULTICAST_JOIN
The multicast join operation (rdma_join_multicast) completed successfully.
RDMA_CM_EVENT_MULTICAST_ERROR
An error either occurred joining a multicast group, or, if the group had already been joined, on an
existing group. The specified multicast group is no longer accessible and should be rejoined, if
desired.
RDMA_CM_EVENT_ADDR_CHANGE
The network device associated with this ID through address resolution changed its HW address, eg
following of bonding failover. This event can serve as a hint for applications who want the links used
for their RDMA sessions to align with the network stack.
78
RDMA_CM_EVENT_TIMEWAIT_EXIT
The QP associated with a connection has exited its timewait state and is now ready to be re-used.
After a QP has been disconnected, it is maintained in a timewait state to allow any in flight packets
to exit the network. After the timewait state has completed, the rdma_cm will report this event.
5.3.2 rdma_ack_cm_event
Template: int rdma_ack_cm_event (struct rdma_cm_event *event)
Input Parameters:
Output Parameters:
none
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: rdma_ack_cm_event frees a communication event. All events which are allocated by
rdma_get_cm_event must be released, there should be a one-to-one correspondence between suc-
cessful gets and acks. This call frees the event structure and any memory that it references.
5.3.3 rdma_event_str
Template: char {}rdma_event_str* (enum rdma_cm_event_type event)
Input Parameters:
Output Parameters:
none
79
6 RDMA Verbs API
This chapter describes the details of the RDMA verbs API.
Input Parameters:
id A reference to the communication identifier where the message buffer(s) will be
used
Output Parameters:
Return Value: A reference to the registered memory region on success or NULL on failure
Description: rdma_reg_msgs registers an array of memory buffers for sending or receiving messages
or for RDMA operations. The registered memory buffers may then be posted to an rdma_cm_id using
rdma_post_send or rdma_post_recv. They may also be specified as the target of an RDMA read
operation or the source of an RDMA write request.
The memory buffers are registered with the protection domain associated with the rdma_cm_id.
The start of the data buffer array is specified through the addr parameter and the total size of the
array is given by the length.
All data buffers must be registered before being posted as a work request. They must be
deregistered by calling rdma_dereg_mr.
6.1.2 rdma_reg_read
Template: struct ibv_mr * rdma_reg_read(struct rdma_cm_id *id, void *addr, size_t length)
Input Parameters:
Output Parameters:
Return Value: A reference to the registered memory region on success or NULL on failure. If an error
occurs, errno will be set to indicate the failure reason.
80
Description: rdma_reg_read Registers a memory buffer that will be accessed by a remote RDMA read
operation. Memory buffers registered using rdma_reg_read may be targeted in an RDMA read
request, allowing the buffer to be specified on the remote side of an RDMA connection as the
remote_addr of rdma_post_read, or similar call.
rdma_reg_read is used to register a data buffer that will be the target of an RDMA read operation on
a queue pair associated with an rdma_cm_id. The memory buffer is registered with the protection
domain associated with the identifier. The start of the data buffer is specified through the addr
parameter, and the total size of the buffer is given by length.
All data buffers must be registered before being posted as work requests. Users must deregister all
registered memory by calling the rdma_dereg_mr.
6.1.3 rdma_reg_write
Template: struct ibv_mr {}rdma_reg_write{*}(struct rdma_cm_id *id, void *addr, size_t length)
Input Parameters:
Output Parameters:
Return Value: A reference to the registered memory region on success or NULL on failure. If an error
occurs, errno will be set to indicate the failure reason.
Description: rdma_reg_write registers a memory buffer which will be accessed by a remote RDMA
write operation. Memory buffers registered using this routine may be targeted in an RDMA write
request, allowing the buffer to be specified on the remote side of an RDMA connection as the
remote_addr of an rdma_post_write or similar call.
The memory buffer is registered with the protection domain associated with the rdma_cm_id. The
start of the data buffer is specified through the addr parameter, and the total size of the buffer is
given by the length.
All data buffers must be registered before being posted as work requests. Users must deregister all
registered memory by calling the rdma_dereg_mr.
6.1.4 rdma_dereg_mr
Template: int rdma_dereg_mr(struct ibv_mr *mr)
Input Parameters:
81
Output Parameters:
None
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: rdma_dereg_mr deregisters a memory buffer which has been registered for RDMA or
message operations. This routine must be called for all registered memory associated with a given
rdma_cm_id before destroying the rdma_cm_id.
6.1.5 rdma_create_srq
Template: int rdma_create_srq(struct rdma_cm_id *id, struct ibv_pd *pd, struct ibv_srq_init_attr
*attr)
Input Parameters:
Output Parameters:
attr The actual capabilities and properties of the created SRQ are returned through
this structure
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: rdma_create_srq allocates a shared request queue associated with the rdma_cm_id, id.
The id must be bound to a local RMDA device before calling this routine. If the protection domain,
pd,
is provided, it must be for that same device. After being allocated, the SRQ will be ready to handle
posting of receives. If a pd is NULL, then the rdma_cm_id will be created using a default protection
domain. One default protection domain is allocated per RDMA device. The initial SRQ attributes are
specified by the attr parameter.
If a completion queue, CQ, is not specified for the XRC SRQ, then a CQ will be allocated by the
rdma_cm for the SRQ, along with corresponding completion channels. Completion channels and CQ
data created by the rdma_cm are exposed to the user through the rdma_cm_id structure. The
actual capabilities and properties of the created SRQ will be returned to the user through the attr
parameter.
An rdma_cm_id may only be associated with a single SRQ.
6.1.6 rdma_destroy_srq
Template: void rdma_destroy_srq(struct rdma_cm_id *id)
Input Parameters:
Output Parameters:
82
None
Description: rdma_destroy_srq destroys an SRQ allocated on the rdma_cm_id, id. Any SRQ
associated with an rdma_cm_id must be destroyed before destroying the rdma_cm_id, id.
Input Parameters:
Output Parameters:
None
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: rdma_post_recvv posts a single work request to the receive queue of the queue pair
associated with the rdma_cm_id, id. The posted buffers will be queued to receive an incoming
message sent by the remote peer.
Please note that this routine supports multiple scatter-gather entries. The user is responsible for
ensuring that the receive is posted, and the total buffer space is large enough to contain all sent
data before the peer posts the corresponding send message. The message buffers must have been
registered before being posted, and the buffers must remain registered until the receive completes.
Messages may be posted to an rdma_cm_id only after a queue pair has been associated with it. A
queue pair is bound to an rdma_cm_id after calling rdma_create_ep or rdma_create_qp, if the
rdma_cm_id is allocated using rdma_create_id.
The user-defined context associated with the receive request will be returned to the user through
the work completion work request identifier (wr_id) field.
6.2.2 rdma_post_sendv
Template: int rdma_post_sendv(struct rdma_cm_id *id, void *context, struct ibv_sge *sgl, int nsge,
int flags)
Input Parameters:
83
context A user-defined context associated with the request
Output Parameters:
None
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: rdma_post_sendv posts a work request to the send queue of the queue pair associated
with the rdma_cm_id, id. The contents of the posted buffers will be sent to the remote peer of the
connection.
The user is responsible for ensuring that the remote peer has queued a receive request before
issuing the send operations. Also, unless the send request is using inline data, the message buffers
must already have been registered before being posted. The buffers must remain registered until
the send completes.
This routine supports multiple scatter-gather entries.
Send operations may not be posted to an rdma_cm_id or the corresponding queue pair until a
connection has been established.
The user-defined context associated with the send request will be returned to the user through the
work completion work request identifier (wr_id) field.
6.2.3 rdma_post_readv
Template: int rdma_post_readv(struct rdma_cm_id *id, void *context, struct ibv_sge *sgl, int nsge,
int flags, uint64_t remote_addr, uint32_t rkey)
Input Parameters:
remote_addr The address of the remote registered memory to read from rkey The
registered memory key associated with the remote address
Output Parameters:
None
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: rdma_post_readv posts a work request to the send queue of the queue pair associated
with the rdma_cm_id, id. The contents of the remote memory region at remote_addr will be read
84
into the local data buffers given in the sgl array.
The user must ensure that both the remote and local data buffers have been registered before the
read is issued. The buffers must remain registered until the read completes.
Read operations may not be posted to an rdma_cm_id or the corresponding queue pair until a
connection has been established.
The user-defined context associated with the read request will be returned to the user through the
work completion work request identifier (wr_id) field.
6.2.4 rdma_post_writev
Template: int rdma_post_writev(struct rdma_cm_id *id, void *context, struct ibv_sge *sgl, int nsge,
int flags, uint64_t remote_addr, uint32_t rkey)
Input Parameters:
remote_addr The address of the remote registered memory to write into rkey The
registered memory key associated with the remote address
Output Parameters:
None
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: rdma_post_writev posts a work request to the send queue of the queue pair associated
with the rdma_cm_id, id. The contents of the local data buffers in the sgl array will be written to
the remote memory region at remote_addr.
Unless inline data is specified, the local data buffers must have been registered before the write is
issued, and the buffers must remain registered until the write completes. The remote buffers must
always be registered.
Write operations may not be posted to an rdma_cm_id or the corresponding queue pair until a
connection has been established.
The user-defined context associated with the write request will be returned to the user through the
work completion work request identifier (wr_id) field.
6.2.5 rdma_post_recv
Template: int rdma_post_recv(struct rdma_cm_id *id, void *context, void *addr, size_t length, struct
ibv_mr *mr)
Input Parameters:
85
id A reference to the communication identifier where the message buffer will be
posted
context A user-defined context associated with the request addr The address of the
memory buffer to post
Output Parameters:
None
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: rdma_post_recv posts a work request to the receive queue of the queue pair associated
with the rdma_cm_id, id. The posted buffer will be queued to receive an incoming message sent by
the remote peer.
The user is responsible for ensuring that receive buffer is posted and is large enough to contain all
sent data before the peer posts the corresponding send message. The buffer must have already been
registered before being posted, with the mr parameter referencing the registration. The buffer
must remain registered until the receive completes.
Messages may be posted to an rdma_cm_id only after a queue pair has been associated with it. A
queue pair is bound to an rdma_cm_id after calling rdma_create_ep or rdma_create_qp, if the
rdma_cm_id is allocated using rdma_create_id.
The user-defined context associated with the receive request will be returned to the user through
the work completion request identifier (wr_id) field.
Please note that this is a simple receive call. There are no scatter-gather lists involved here.
6.2.6 rdma_post_send
Template: int rdma_post_send(struct rdma_cm_id *id, void *context, void *addr, size_t length,
struct ibv_mr *mr, int flags)
Input Parameters:
context A user-defined context associated with the request addr The address of the
memory buffer to post
Output Parameters:
None
86
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: rdma_post_send posts a work request to the send queue of the queue pair associated
with the rdma_cm_id, id. The contents of the posted buffer will be sent to the remote peer of the
connection.
The user is responsible for ensuring that the remote peer has queued a receive request before
issuing the send operations. Also, unless the send request is using inline data, the message buffer
must already have been registered before being posted, with the mr parameter referencing the
registration. The buffer must remain registered until the send completes.
Send operations may not be posted to an rdma_cm_id or the corresponding queue pair until a
connection has been established.
The user-defined context associated with the send request will be returned to the user through the
work completion work request identifier (wr_id) field.
6.2.7 rdma_post_read
Template: int rdma_post_read(struct rdma_cm_id *id, void *context, void *addr, size_t length, struct
ibv_mr *mr, int flags, uint64_t remote_addr, uint32_t rkey)
Input Parameters:
remote_addr The address of the remote registered memory to read from rkey The
registered memory key associated with the remote address
Output Parameters:
None
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: rdma_post_read posts a work request to the send queue of the queue pair associated
with the rdma_cm_id. The contents of the remote memory region will be read into the local data
buffer.
For a list of supported flags, see ibv_post_send. The user must ensure that both the remote and
local data buffers must have been registered before the read is issued, and the buffers must remain
registered until the read completes.
Read operations may not be posted to an rdma_cm_id or the corresponding queue pair until it has
been connected.
The user-defined context associated with the read request will be returned to the user through the
work completion wr_id, work request identifier, field.
87
6.2.8 rdma_post_write
Template: int rdma_post_write(struct rdma_cm_id *id, void *context, void *addr, size_t length,
struct ibv_mr *mr, int flags, uint64_t remote_addr, uint32_t rkey)
Input Parameters:
context A user-defined context associated with the request addr The local address
of the source of the write request
remote_addr The address of the remote registered memory to write into rkey The
registered memory key associated with the remote address
Output Parameters:
None
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: rdma_post_write posts a work request to the send queue of the queue pair associated
with the rdma_cm_id, id. The contents of the local data buffer will be written into the remote
memory region.
Unless inline data is specified, the local data buffer must have been registered before the write is
issued, and the buffer must remain registered until the write completes. The remote buffer must
always be registered.
Write operations may not be posted to an rdma_cm_id or the corresponding queue pair until a
connection has been established.
The user-defined context associated with the write request will be returned to the user through the
work completion work request identifier (wr_id) field.
6.2.9 rdma_post_ud_send
Template: int rdma_post_ud_send(struct rdma_cm_id *id, void *context, void *addr, size_t length,
struct ibv_mr *mr, int flags, struct ibv_ah *ah, uint32_t remote_qpn)
Input Parameters:
context A user-defined context associated with the request addr The address of the
memory buffer to post
88
flags Optional flags used to control the send operation
Output Parameters:
None
Return Value: 0 on success, -1 on error. If the call fails, errno will be set to indicate the reason for
the failure.
Description: rdma_post_ud_send posts a work request to the send queue of the queue pair
associated with the rdma_cm_id, id. The contents of the posted buffer will be sent to the specified
destination queue pair, remote_qpn.
The user is responsible for ensuring that the destination queue pair has queued a receive request
before issuing the send operations. Unless the send request is using inline data, the message buffer
must have been registered before being posted, with the mr parameter referencing the registration.
The buffer must remain registered until the send completes.
The user-defined context associated with the send request will be returned to the user through the
work completion work request identifier (wr_id) field.
6.2.10 rdma_get_send_comp
Template: int rdma_get_send_comp(struct rdma_cm_id *id, struct ibv_wc *wc)
Input Parameters:
Output Parameters:
Return Value: A non-negative value (0 or 1) equal to the number of completions found on success, or
-1 on failure. If the call fails, errno will be set to indicate the reason for the failure.
Description: rdma_get_send_comp retrieves a completed work request for a send, RDMA read or
RDMA write operation. Information about the completed request is returned through the ibv_wc, wc
parameter, with the wr_id set to the context of the request. Please see ibv_poll_cq for details on
the work completion structure, ibv_wc.
Please note that this call polls the send completion queue associated with the rdma_cm_id, id. If a
completion is not found, the call blocks until a request completes. This means, therefore, that the
call should only be used on rdma_cm_ids which do not share CQs with other rdma_cm_ids, and
maintain separate CQs for sends and receive completions.
6.2.11 rdma_get_recv_comp
Template: int rdma_get_recv_comp(struct rdma_cm_id *id, struct ibv_wc *wc)
Input Parameters:
89
id A reference to the communication identifier to check for completions
Output Parameters:
Return Value: A non-negative value equal to the number of completions found on success, or errno
on failure
90
7 Events
This chapter describes the details of the events that occur when using the VPI API
7.1.2 IBV_EVENT_QP_FATAL
This event is generated when an error occurs on a Queue Pair (QP) which prevents the generation of
completions while accessing or processing the Work Queue on either the Send or Receive Queues.
The user must modify the QP state to Reset for recovery. It is the responsibility of the software to
ensure that all error processing is completed prior to calling the modify QP verb to change the QP
state to Reset.
If the problem that caused this event is in the CQ of that Work Queue, the appropriate CQ will also
receive the IBV_EVENT_CQ_ERR event. In the event of a CQ error, it is best to destroy and recreate
the resources
7.1.3 IBV_EVENT_QP_REQ_ERR
This event is generated when the transport layer of the RDMA device detects a transport error
violation on the responder side. The error may be caused by the use of an unsupported or reserved
opcode, or the use of an out of sequence opcode.
These errors are rare but may occur when there are problems in the subnet or when an RDMA device
sends illegal packets.
When this happens, the QP is automatically transitioned to the IBV_QPS_ERR state by the RDMA
device. The user must modify the states of any such QPs from the error state to the Reset state for
recovery
This event applies only to RC QPs.
7.1.4 IBV_EVENT_QP_ACCESS_ERR
This event is generated when the transport layer of the RDMA device detects a request error
violation on the responder side. The error may be caused by
• Misaligned atomic request
• Too many RDMA Read or Atomic requests R_Key violation
• Length errors without immediate data
91
device.The user must modify the QP state to Reset for recovery.
This event is relevant only to RC QPs.
7.1.5 IBV_EVENT_COMM_EST
This event is generated when communication is established on a given QP. This event implies that a
QP whose state is IBV_QPS_RTR has received the first packet in its Receive Queue and the packet
was processed without error.
This event is relevant only to connection oriented QPs (RC and UC QPs). It may be generated for UD
QPs as well but that is driver implementation specific.
7.1.6 IBV_EVENT_SQ_DRAINED
This event is generated when all outstanding messages have been drained from the Send Queue (SQ)
of a QP whose state has now changed from IBV_QPS_RTS to IBV_QPS_SQD. For RC QPs, this means
that all the messages received acknowledgements as appropriate.
Generally, this event will be generated when the internal QP state changes from SQD.draining to
SQD.drained. The event may also be generated if the transition to the state IBV_QPS_SQD is aborted
because of a transition, either by the RDMA device or by the user, into the IBV_QPS_SQE,
IBV_QPS_ERR or IBV_QPS_RESET QP states.
After this event, and after ensuring that the QP is in the IBV_QPS_SQD state, it is safe for the user
to start modifying the Send Queue attributes since there aren't are no longer any send mes- sages in
progress. Thus it is now safe to modify the operational characteristics of the QP and transition it
back to the fully operational RTS state.
7.1.7 IBV_EVENT_PATH_MIG
This event is generated when a connection successfully migrates to an alternate path. The event is
relevant only for connection oriented QPs, that is, it is relevant only for RC and UC QPs.
When this event is generated, it means that the alternate path attributes are now in use as the
primary path attributes. If it is necessary to load attributes for another alternate path, the user may
do that after this event is generated.
7.1.8 IBV_EVENT_PATH_MIG_ERR
This event is generated when an error occurs which prevents a QP which has alternate path
attributes loaded from performing a path migration change. The attempt to effect the path
migration may have been attempted automatically by the RDMA device or explicitly by the user.
This error usually occurs if the alternate path attributes are not consistent on the two ends of the
connection. It could be, for example, that the DLID is not set correctly or if the source port is
invalid.CQ The event may also occur if a cable to the alternate port is unplugged.
7.1.9 IBV_EVENT_DEVICE_FATAL
This event is generated when a catastrophic error is encountered on the channel adapter. The port
and possibly the channel adapter becomes unusable.
When this event occurs, the behavior of the RDMA device is undetermined and it is highly
recommended to close the process immediately. Trying to destroy the RDMA resources may fail and
thus the device may be left in an unstable state.
92
7.1.10 IBV_EVENT_PORT_ACTIVE
This event is generated when the link on a given port transitions to the active state. The link is now
available for send/receive packets.
This event means that the port_attr.state has moved from one of the following states
• IBV_PORT_DOWN
• IBV_PORT_INIT
• IBV_PORT_ARMED
to either
• IBV_PORT_ACTIVE
• IBV_PORT_ACTIVE_DEFER
This might happen for example when the SM configures the port.
The event is generated by the device only if the IBV_DEVICE_PORT_ACTIVE_EVENT attribute is set in
the dev_cap.device_cap_flags.
7.1.11 IBV_EVENT_PORT_ERR
This event is generated when the link on a given port becomes inactive and is thus unavailable to
send/receive packets.
The port_attr.state must have been in either in either IBV_PORT_ACTIVE or IBV_PORT_AC-
TIVE_DEFER state and transitions to one of the following states:
• IBV_PORT_DOWN
• IBV_PORT_INIT
• IBV_PORT_ARMED
This can happen when there are connectivity problems within the IB fabric, for example when a
cable is accidentally pulled.
This will not affect the QPs associated with this port, although if this is a reliable connection, the
retry count may be exceeded if the link takes a long time to come back up.
7.1.12 IBV_EVENT_LID_CHANGE
The event is generated when the LID on a given port changes. This is done by the SM. If this is not
the first time that the SM configures the port LID, it may indicate that there is a new SM on the
subnet or that the SM has reconfigured the subnet. If the user cached the structure returned from
ibv_query_port(), then these values must be flushed when this event occurs.
7.1.13 IBV_EVENT_PKEY_CHANGE
This event is generated when the P_Key table changes on a given port. The PKEY table is configured
by the SM and this also means that the SM can change it. When that happens, an
IBV_EVENT_PKEY_CHANGE event is generated.
Since QPs use GID table indexes rather than absolute values (as the source GID), it is suggested for
clients to check that the GID indexes used by the client's QPs are not changed as a result of this
event.
93
If a user caches the values of the P_Key table, then these must be flushed when the
IBV_EVENT_PKEY_CHANGE event is received.
7.1.14 IBV_EVENT_SM_CHANGE
This event is generated when the SM being used at a given port changes. The user application must
re-register with the new SM. This means that all subscriptions previously registered from the given
port, such as one to join a multicast group, must be re-registered.
7.1.15 IBV_EVENT_SRQ_ERR
This event is generated when an error occurs on a Shared Receive Queue (SRQ) which prevents the
RDMA device from dequeuing WRs from the SRQ and reporting of receive completions.
When an SRQ experiences this error, all the QPs associated with this SRQ will be transitioned to the
IBV_QPS_ERR state and the IBV_EVENT_QP_FATAL asynchronous event will be generated for them.
Any QPs which have transitioned to the error state must have their state modified to Reset for
recovery.
7.1.16 IBV_EVENT_SRQ_LIMIT_REACHED
This event is generated when the limit for the SRQ resources is reached. This means that the
number of SRQ Work Requests (WRs) is less than the SRQ limit. This event may be used by the user
as an indicator that more WRs need to be posted to the SRQ and rearm it.
7.1.17 IBV_EVENT_QP_LAST_WQE_REACHED
This event is generated when a QP which is associated with an SRQ is transitioned into the
IBV_QPS_ERR state either automatically by the RDMA device or explicitly by the user. This may have
happened either because a completion with error was generated for the last WQE, or the QP
transitioned into the IBV_QPS_ERR state and there are no more WQEs on the Receive Queue of the
QP.
This event actually means that no more WQEs will be consumed from the SRQ by this QP.
If an error occurs to a QP and this event is not generated, the user must destroy all of the QPs
associated with this SRQ as well as the SRQ itself in order to reclaim all of the WQEs associated with
the offending QP. At the minimum, the QP which is in the error state must have its state changed to
Reset for recovery.
7.1.18 IBV_EVENT_CLIENT_REREGISTER
This event is generated when the SM sends a request to a given port for client re-registration for all
subscriptions previously requested for the port. This could happen if the SM suffers a failure and as
a result, loses its own records of the subscriptions. It may also happen if a new SM becomes
operational on the subnet.
The event will be generated by the device only if the bit that indicates a client reregister is
supported is set in port_attr.port_cap_flags.
94
7.1.19 IBV_EVENT_GID_CHANGE
This event is generated when a GID changes on a given port. The GID table is configured by the SM
and this also means that the SM can change it. When that happens, an IBV_EVENT_GID_CHANGE
event is generated. If a user caches the values of the GID table, then these must be flushed when
the IBV_EVENT_GID_CHANGE event is received.
7.2.2 IBV_WC_LOC_LEN_ERR
This event is generated when the receive buffer is smaller than the incoming send. It is generated
on the receiver side of the connection.
7.2.3 IBV_WC_LOC_QP_OP_ERR
This event is generated when a QP error occurs. For example, it may be generated if a user neglects
to specify responder_resources and initiator_depth values in struct rdma_conn_param before calling
rdma_connect() on the client side and rdma_accept() on the server side.
7.2.4 IBV_WC_LOC_EEC_OP_ERR
This event is generated when there is an error related to the local EEC's receive logic while
executing the request packet. The responder is unable to complete the request. This error is not
caused by the sender.
7.2.5 IBV_WC_LOC_PROT_ERR
This event is generated when a user attempts to access an address outside of the registered memory
region. For example, this may happen if the Lkey does not match the address in the WR.
7.2.6 IBV_WC_WR_FLUSH_ERR
This event is generated when an invalid remote error is thrown when the responder detects an
invalid request. It may be that the operation is not supported by the request queue or there is
insufficient buffer space to receive the request.
7.2.7 IBV_WC_MW_BIND_ERR
This event is generated when a memory management operation error occurs. The error may be due
to the fact that the memory window and the QP belong to different protection domains. It may also
95
be that the memory window is not allowed to be bound to the specified MR or the access
permissions may be wrong.
7.2.8 IBV_WC_BAD_RESP_ERR
This event is generated when an unexpected transport layer opcode is returned by the responder.
7.2.9 IBV_WC_LOC_ACCESS_ERR
This event is generated when a local protection error occurs on a local data buffer during the
process of an RDMA Write with Immediate Data operation sent from the remote node.
7.2.10 IBV_WC_REM_INV_REQ_ERR
This event is generated when the receive buffer is smaller than the incoming send. It is generated
on the sender side of the connection. It may also be generated if the QP attributes are not set cor-
rectly, particularly those governing MR access.
7.2.11 IBV_WC_REM_ACCESS_ERR
This event is generated when a protection error occurs on a remote data buffer to be read by an
RDMA Read, written by an RDMA Write or accessed by an atomic operation. The error is reported
only on RDMA operations or atomic operations.
7.2.12 IBV_WC_REM_OP_ERR
This event is generated when an operation cannot be completed successfully by the responder. The
failure to complete the operation may be due to QP related errors which prevent the responder
from completing the request or a malformed WQE on the Receive Queue.
7.2.13 IBV_WC_RETRY_EXC_ERR
This event is generated when a sender is unable to receive feedback from the receiver. This means
that either the receiver just never ACKs sender messages in a specified time period, or it has been
disconnected or it is in a bad state which prevents it from responding.
7.2.14 IBV_WC_RNR_RETRY_EXC_ERR
This event is generated when the RNR NAK retry count is exceeded. This may be caused by lack of
receive buffers on the responder side.
7.2.15 IBV_WC_LOC_RDD_VIOL_ERR
This event is generated when the RDD associated with the QP does not match the RDD associated
with the EEC.
96
7.2.16 IBV_WC_REM_INV_RD_REQ_ERR
This event is generated when the responder detects an invalid incoming RD message. The message
may be invalid because it has in invalid Q_Key or there may be a Reliable Datagram Domain (RDD)
violation.
7.2.17 IBV_WC_REM_ABORT_ERR
This event is generated when an error occurs on the responder side which causes it to abort the
operation.
7.2.18 IBV_WC_INV_EECN_ERR
This event is generated when an invalid End to End Context Number (EECN) is detected.
7.2.19 IBV_WC_INV_EEC_STATE_ERR
This event is generated when an illegal operation is detected in a request for the specified EEC
state.
7.2.20 IBV_WC_FATAL_ERR
This event is generated when a fatal transport error occurs. The user may have to restart the RDMA
device driver or reboot the server to recover from the error.
7.2.21 IBV_WC_RESP_TIMEOUT_ERR
This event is generated when the responder is unable to respond to a request within the timeout
period. It generally indicates that the receiver is not ready to process requests.
7.2.22 IBV_WC_GENERAL_ERR
This event is generated when there is a transport error which cannot be described by the other
specific events discussed here.
97
7.3.2 RDMA_CM_EVENT_ADDR_ERROR
This event is generated on the client (active) side. It is generated in response to rdma_re-
solve_addr() in the case where an error occurs. This may happen, for example, if the device can-
not be found such as when a user supplies an incorrect device. Specifically, if the remote device has
both ethernet and IB interfaces, and the client side supplies the ethernet device name instead of
the IB device name of the server side, an RDMA_CM_EVENT_ADDR_ERROR will be generated.
7.3.3 RDMA_CM_EVENT_ROUTE_RESOLVED
This event is generated on the client (active) side in response to rdma_resolve_route(). It is
generated when the system is able to resolve the server address supplied by the client.
7.3.4 RDMA_CM_EVENT_ROUTE_ERROR
This event is generated when rdma_resolve_route() fails.
7.3.5 RDMA_CM_EVENT_CONNECT_REQUEST
This is generated on the passive side of the connection to notify the user of a new connection
request. It indicates that a connection request has been received.
7.3.6 RDMA_CM_EVENT_CONNECT_RESPONSE
This event may be generated on the active side of the connection to notify the user that the con-
nection request has been successful. The event is only generated on rdma_cm_ids which do not have
a QP associated with them.
7.3.7 RDMA_CM_EVENT_CONNECT_ERROR
This event may be generated on the active or passive side of the connection. It is generated when
an error occurs while attempting to establish a connection.
7.3.8 RDMA_CM_EVENT_UNREACHABLE
This event is generated on the active side of a connection. It indicates that the (remote) server is
unreachable or unable to respond to a connection request.
7.3.9 RDMA_CM_EVENT_REJECTED
This event may be generated on the client (active) side and indicates that a connection request or
response has been rejected by the remote device. This may happen for example if an attempt is
made to connect with the remote end point on the wrong port.
98
7.3.10 RDMA_CM_EVENT_ESTABLISHED
This event is generated on both sides of a connection. It indicates that a connection has been
established with the remote end point.
7.3.11 RDMA_CM_EVENT_DISCONNECTED
This event is generated on both sides of the connection in response to rdma_disconnect(). The event
will be generated to indicate that the connection between the local and remote devices has been
disconnected. Any associated QP will transition to the error state. All posted work requests are
flushed. The user must change any such QP's state to Reset for recovery.
7.3.12 RDMA_CM_EVENT_DEVICE_REMOVAL
This event is generated when the RDMA CM indicates that the device associated with the
rdma_cm_id has been removed. Upon receipt of this event, the user must destroy the related
rdma_cm_id.
7.3.13 RDMA_CM_EVENT_MULTICAST_JOIN
This event is generated in response to rdma_join_multicast(). It indicates that the multicast join
operation has completed successfully.
7.3.14 RDMA_CM_EVENT_MULTICAST_ERROR
This event is generated when an error occurs while attempting to join a multicast group or on an
existing multicast group if the group had already been joined. When this happens, the multicast
group will no longer be accessible and must be rejoined if necessary.
7.3.15 RDMA_CM_EVENT_ADDR_CHANGE
This event is generated when the network device associated with this ID through address resolution
changes its hardware address. For example, this may happen following bonding fail over. This event
may serve to aid applications which want the links used for their RDMA sessions to align with the
network stack.
7.3.16 RDMA_CM_EVENT_TIMEWAIT_EXIT
This event is generated when the QP associated with the connection has exited its timewait state
and is now ready to be re-used. After a QP has been disconnected, it is maintained in a timewait
state to allow any in flight packets to exit the network. After the timewait state has completed, the
rdma_cm will report this event.
99
8 Programming Examples Using IBV Verbs
This chapter provides code examples using the IBV Verbs
8.1.1 Main
Parse command line. The user may set the TCP port, device name, and device port for the test. If
set, these values will override default values in config. The last parameter is the server name. If the
server name is set, this designates a server to connect to and therefore puts the program into client
mode. Otherwise the program is in server mode.
Call print_config.
Call resources_init.
Call resources_create.
Call connect_qp.
Call poll_completion. Note that the server side expects a completion from the SEND request and the
client side expects a RECEIVE completion.
If in client mode, show the message we received via the RECEIVE operation, otherwise, if we are in
server mode, load the buffer with a new message.
Sync client<->server.
At this point the server goes directly to the next sync. All RDMA operations are done strictly by the
client.
Call poll_completion.
Call poll_completion.
Sync client<->server.
Call resources_destroy.
100
Free device name string.
Done.
8.1.2 print_config
Print out configuration information.
8.1.3 resources_init
Clears resources struct.
8.1.4 resources_create
Call sock_connect to connect a TCP socket to the peer.
Get the list of devices, locate the one we want, and open it.
Create a PD.
Create a CQ.
Create a QP.
8.1.5 sock_connect
If client, resolve DNS address of server and initiate a connection to it.
8.1.6 connect_qp
Call modify_qp_to_init.
Call post_receive.
Call modify_qp_to_rtr.
Call modify_qp_to_rts.
8.1.7 modify_qp_to_init
Transition QP to INIT state.
101
8.1.8 post_receive
Prepare a scatter/gather entry for the receive buffer.
Prepare an RR.
8.1.9 sock_sync_data
Using the TCP socket created with sock_connect, synchronize the given set of data between client
and the server. Since this function is blocking, it is also called with dummy data to synchronize the
timing of the client and server.
8.1.10 modify_qp_to_rtr
Transition QP to RTR state.
8.1.11 modify_qp_to_rts
Transition QP to RTS state.
8.1.12 post_send
Prepare a scatter/gather entry for data to be sent (or received in RDMA read case).
8.1.13 poll_completion
Poll CQ until an entry is found or MAX_POLL_CQ_TIMEOUT milliseconds are reached.
8.1.14 resources_destroy
Release/free/deallocate all items in resource struct.
*/
/******************************************************************************
102
*
* This code demonstrates how to perform the following operations using the * VPI Verbs API:
*
* Send
* Receive
* RDMA Read
* RDMA Write
*
*****************************************************************************/
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <stdint.h>
#include <inttypes.h>
#include <endian.h>
#include <byteswap.h>
#include <getopt.h>
#include <sys/time.h>
#include <arpa/inet.h>
#include <infiniband/verbs.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <netdb.h>
#else
struct config_t
103
char *server_name; /* server host name */
u_int32_t tcp_port; /* server TCP port */
int ib_port; /* local IB port to work with */
int gid_idx; /* gid index to use */
};
struct cm_con_data_t
} __attribute__ ((packed));
struct resources
};
104
struct config_t config =
NULL, /* dev_name */
NULL, /* server_name */
19875, /* tcp_port */
1, /* ib_port */
-1 /* gid_idx */
};
/******************************************************************************
Socket operations
For simplicity, the example program uses TCP sockets to exchange control
(CM) may be used to pass this information. Use of CM is beyond the scope of
this example
******************************************************************************/
/******************************************************************************
* Function: sock_connect
*
* Input
* Output
* none
*
* Returns
*
* Description
******************************************************************************/
105
struct addrinfo *resolved_addr = NULL;
struct addrinfo *iterator;
char service[6];
int sockfd = -1;
int listenfd = 0;
int tmp;
.ai_flags = AI_PASSIVE,
.ai_family = AF_INET,
.ai_socktype = SOCK_STREAM
};
if (sockfd < 0)
{
if (sockfd >= 0)
if (servername)
/* Client mode. Initiate connection to remote */
{
fprintf(stdout, "failed connect \n");
close(sockfd);
sockfd = -1;
else
{
listenfd = sockfd;
sockfd = -1;
goto sock_connect_exit;
listen(listenfd, 1);
106
sockfd = accept(listenfd, NULL, 0);
sock_connect_exit:
if(listenfd)
close(listenfd);
if(resolved_addr)
freeaddrinfo(resolved_addr);
if (sockfd < 0)
if(servername)
else
{
perror("server accept");
}
}
return sockfd;
/******************************************************************************
* Function: sock_sync_data
* Input
* Output
* Returns
* Description
* Sync data across a socket. The indicated local data will be sent to the
* remote. It will then wait for the remote to send its data back. It is
* assumed that the two sides are in sync and call this function in the proper
107
* order. Chaos will ensue if they are not. :)
* Also note this is a blocking function and will wait for the full data to be
* received from the remote.
******************************************************************************/
int rc;
int read_bytes = 0;
int total_read_bytes = 0;
else
rc = 0;
total_read_bytes += read_bytes;
else
rc = read_bytes;
return rc;
/******************************************************************************
******************************************************************************/
/* poll_completion */
/******************************************************************************
* Function: poll_completion
*
* Input
* Output
* none
* Returns
* 0 on success, 1 on failure
*
* Description
* Poll the completion queue for a single event. This function will continue to
******************************************************************************/
108
static int poll_completion(struct resources *res)
gettimeofday(&cur_time, NULL);
do
{
poll_result = ibv_poll_cq(res->cq, 1, &wc);
gettimeofday(&cur_time, NULL);
if(poll_result < 0)
/* poll CQ failed */
fprintf(stderr, "poll CQ failed\n");
rc = 1;
}
else if (poll_result == 0)
/* the CQ is empty */
fprintf(stderr, "completion wasn't found in the CQ after timeout\n");
rc = 1;
else
{
/* CQE found */
/* check the completion status (here we don't care about the completion opcode */
if (wc.status != IBV_WC_SUCCESS)
fprintf(stderr, "got bad completion with status: 0x%x, vendor syndrome: 0x%x\n", wc.status, wc.vendor_err);
rc = 1;
}
109
return rc;
/******************************************************************************
* Function: post_send
*
* Input
*
* Output
* none
* Returns
* Description
******************************************************************************/
memset(&sge, 0, sizeof(sge));
sge.addr = (uintptr_t)res->buf;
sge.length = MSG_SIZE;
sge.lkey = res->mr->lkey;
memset(&sr, 0, sizeof(sr));
sr.next = NULL;
sr.wr_id = 0;
sr.sg_list = &sge;
sr.num_sge = 1;
sr.opcode = opcode;
sr.send_flags = IBV_SEND_SIGNALED;
if(opcode != IBV_WR_SEND)
110
sr.wr.rdma.remote_addr = res->remote_props.addr;
sr.wr.rdma.rkey = res->remote_props.rkey;
/* there is a Receive Request in the responder side, so we won't get any into RNR flow */
if (rc)
switch(opcode)
case IBV_WR_SEND:
fprintf(stdout, "Send Request was posted\n");
break;
case IBV_WR_RDMA_READ:
case IBV_WR_RDMA_WRITE:
default:
break;
}
return rc;
/******************************************************************************
* Function: post_receive
*
* Input
*
* Output
* none
*
* Returns
* Description
*
******************************************************************************/
111
struct ibv_recv_wr rr;
struct ibv_sge sge;
struct ibv_recv_wr *bad_wr;
int rc;
memset(&sge, 0, sizeof(sge));
sge.addr = (uintptr_t)res->buf;
sge.length = MSG_SIZE;
sge.lkey = res->mr->lkey;
memset(&rr, 0, sizeof(rr));
rr.next = NULL;
rr.wr_id = 0;
rr.sg_list = &sge;
rr.num_sge = 1;
if (rc)
return rc;
/******************************************************************************
* Function: resources_init
* Input
* Output
* res is initialized
*
* Returns
* none
* Description
* res is initialized to default values
******************************************************************************/
112
memset(res, 0, sizeof *res);
res->sock = -1;
/******************************************************************************
* Function: resources_create
* Input
*
* Output
* Returns
* 0 on success, 1 on failure
*
* Description
* This function creates and allocates all necessary system resources. These
*****************************************************************************/
{
struct ibv_device **dev_list = NULL;
size_t size;
int i;
int mr_flags = 0;
int cq_size = 0;
int num_devices;
int rc = 0;
/* if client side */
if (config.server_name)
config.server_name, config.tcp_port);
rc = -1;
goto resources_create_exit;
113
}
else
if (res->sock < 0)
rc = -1;
goto resources_create_exit;
if (!dev_list)
{
fprintf(stderr, "failed to get IB devices list\n");
rc = 1;
goto resources_create_exit;
if (!num_devices)
{
rc = 1;
goto resources_create_exit;
if(!config.dev_name)
{
config.dev_name = strdup(ibv_get_device_name(dev_list[i]));
fprintf(stdout, "device not specified, using first one found: %s\n", config.dev_name);
if (!strcmp(ibv_get_device_name(dev_list[i]), config.dev_name))
{
ib_dev = dev_list[i];
break;
114
/* if the device wasn't found in host */
if (!ib_dev)
goto resources_create_exit;
if (!res->ib_ctx)
rc = 1;
goto resources_create_exit;
ibv_free_device_list(dev_list);
dev_list = NULL;
ib_dev = NULL;
goto resources_create_exit;
}
if (!res->pd)
{
goto resources_create_exit;
/* each side will send only one WR, so Completion Queue with 1 entry is enough */
cq_size = 1;
if (!res->cq)
goto resources_create_exit;
115
size = MSG_SIZE;
if (!res->buf )
{
rc = 1;
goto resources_create_exit;
memset(res->buf, 0 , size);
/* only in the server side put the message in the memory buffer */
if (!config.server_name)
{
strcpy(res->buf, MSG);
else
memset(res->buf, 0, size);
goto resources_create_exit;
}
qp_init_attr.qp_type = IBV_QPT_RC;
qp_init_attr.sq_sig_all = 1;
qp_init_attr.send_cq = res->cq;
qp_init_attr.recv_cq = res->cq;
qp_init_attr.cap.max_send_wr = 1;
qp_init_attr.cap.max_recv_wr = 1;
qp_init_attr.cap.max_send_sge = 1;
qp_init_attr.cap.max_recv_sge = 1;
if (!res->qp)
116
rc = 1;
goto resources_create_exit;
resources_create_exit:
if(rc)
if(res->qp)
ibv_destroy_qp(res->qp);
res->qp = NULL;
}
if(res->mr)
ibv_dereg_mr(res->mr);
res->mr = NULL;
}
if(res->buf)
{
free(res->buf);
res->buf = NULL;
if(res->cq)
{
ibv_destroy_cq(res->cq);
res->cq = NULL;
if(res->pd)
{
ibv_dealloc_pd(res->pd);
res->pd = NULL;
}
if(res->ib_ctx)
ibv_close_device(res->ib_ctx);
res->ib_ctx = NULL;
if(dev_list)
ibv_free_device_list(dev_list);
dev_list = NULL;
}
if (res->sock >= 0)
117
{
if (close(res->sock))
return rc;
}
/******************************************************************************
* Function: modify_qp_to_init
*
* Input
* qp QP to transition
* Output
* none
*
* Returns
*
* Description
memset(&attr, 0, sizeof(attr));
attr.qp_state = IBV_QPS_INIT;
attr.port_num = config.ib_port;
attr.pkey_index = 0;
if (rc)
118
return rc;
/******************************************************************************
* Function: modify_qp_to_rtr
* Input
* qp QP to transition
* remote_qpn remote QP number
* dlid destination LID
* dgid destination GID (mandatory for RoCEE)
* Output
* none
*
* Returns
* Description
* Transition a QP from the INIT to RTR state, using the specified QP number
******************************************************************************/
static int modify_qp_to_rtr(struct ibv_qp *qp, uint32_t remote_qpn, uint16_t dlid, uint8_t *dgid)
memset(&attr, 0, sizeof(attr));
attr.qp_state = IBV_QPS_RTR;
attr.path_mtu = IBV_MTU_256;
attr.dest_qp_num = remote_qpn;
attr.rq_psn = 0;
attr.max_dest_rd_atomic = 1;
attr.min_rnr_timer = 0x12;
attr.ah_attr.is_global = 0;
attr.ah_attr.dlid = dlid;
119
attr.ah_attr.sl = 0;
attr.ah_attr.src_path_bits = 0;
attr.ah_attr.port_num = config.ib_port;
if (config.gid_idx >= 0)
attr.ah_attr.is_global = 1;
attr.ah_attr.port_num = 1;
attr.ah_attr.grh.hop_limit = 1;
attr.ah_attr.grh.sgid_index = config.gid_idx;
attr.ah_attr.grh.traffic_class = 0;
return rc;
}
/******************************************************************************
* Function: modify_qp_to_rts
*
* Input
* qp QP to transition
*
* Output
* none
*
* Returns
* 0 on success, ibv_modify_qp failure code on failure
*
* Description
******************************************************************************/
120
memset(&attr, 0, sizeof(attr));
attr.qp_state = IBV_QPS_RTS;
attr.timeout = 0x12;
attr.retry_cnt = 6;
attr.rnr_retry = 0;
attr.sq_psn = 0;
attr.max_rd_atomic = 1;
return rc;
/******************************************************************************
* Function: connect_qp
*
* Input
* Output
* none
*
* Returns
* 0 on success, error code on failure
* Description
* Connect the QP. Transition the server side to RTR, sender side to RTS
******************************************************************************/
int rc = 0;
char temp_char;
if (config.gid_idx >= 0)
{
if (rc)
121
fprintf(stderr, "could not get gid for port %d, index %d\n", config.ib_port, config.gid_idx);
return rc;
}
} else
local_con_data.rkey = htonl(res->mr->rkey);
local_con_data.qp_num = htonl(res->qp->qp_num);
local_con_data.lid = htons(res->port_attr.lid);
{
fprintf(stderr, "failed to exchange connection data between sides\n");
rc = 1;
goto connect_qp_exit;
remote_con_data.addr = ntohll(tmp_con_data.addr);
remote_con_data.rkey = ntohl(tmp_con_data.rkey);
remote_con_data.qp_num = ntohl(tmp_con_data.qp_num);
remote_con_data.lid = ntohs(tmp_con_data.lid);
memcpy(remote_con_data.gid, tmp_con_data.gid, 16);
/* save the remote side attributes, we will need it for the post SR */
res->remote_props = remote_con_data;
{
uint8_t *p = remote_con_data.gid;
rc = modify_qp_to_init(res->qp);
if (rc)
{
goto connect_qp_exit;
if (config.server_name)
122
{
rc = post_receive(res);
if (rc)
{
goto connect_qp_exit;
if (rc)
{
goto connect_qp_exit;
rc = modify_qp_to_rts(res->qp);
if (rc)
/* sync to make sure that both sides are in states that they can connect to prevent packet loose */
if (sock_sync_data(res->sock, 1, "Q", &temp_char)) /* just send a dummy char back and forth */
connect_qp_exit:
return rc;
/******************************************************************************
* Function: resources_destroy
* Input
*
* Output
* none
* Returns
* 0 on success, 1 on failure
*
* Description
123
* Cleanup and deallocate all resources used
******************************************************************************/
int rc = 0;
if (res->qp)
if (ibv_destroy_qp(res->qp))
rc = 1;
if (res->mr)
if (ibv_dereg_mr(res->mr))
if (res->buf)
free(res->buf);
if (res->cq)
if (ibv_destroy_cq(res->cq))
{
fprintf(stderr, "failed to destroy CQ\n");
rc = 1;
}
if (res->pd)
if (ibv_dealloc_pd(res->pd))
{
fprintf(stderr, "failed to deallocate PD\n");
rc = 1;
}
if (res->ib_ctx)
if (ibv_close_device(res->ib_ctx))
{
fprintf(stderr, "failed to close device context\n");
rc = 1;
if (res->sock >= 0)
if (close(res->sock))
rc = 1;
return rc;
124
}
/******************************************************************************
* Function: print_config
* Input
* none
*
* Output
* none
* Returns
* none
*
* Description
******************************************************************************/
if (config.server_name)
if (config.gid_idx >= 0)
/******************************************************************************
* Function: usage
* Input
125
*
* Output
* none
*
* Returns
* none
* Description
* print a description of command line syntax
******************************************************************************/
{
fprintf(stdout, "Usage:\n");
fprintf(stdout, "\n");
fprintf(stdout, "Options:\n");
fprintf(stdout, " -p, --port <port> listen on/connect to port <port> (default 18515)\n");
fprintf(stdout, " -d, --ib-dev <dev> use IB device <dev> (default first device found)\n");
fprintf(stdout, " -i, --ib-port <port> use port <port> of IB device (default 1)\n");
fprintf(stdout, " -g, --gid_idx <git index> gid index to be used in GRH (default not used)\n");
}
/******************************************************************************
* Function: main
*
* Input
* Output
* none
*
* Returns
* 0 on success, 1 on failure
*
* Description
126
char temp_char;
while (1)
int c;
};
if (c == -1)
break;
switch (c)
case 'p':
config.tcp_port = strtoul(optarg, NULL, 0);
break;
case 'd':
config.dev_name = strdup(optarg);
break;
case 'i':
usage(argv[0]);
return 1;
}
break;
case 'g':
if (config.gid_idx < 0)
127
{
usage(argv[0]);
return 1;
break;
default:
usage(argv[0]);
return 1;
config.server_name = argv[optind];
usage(argv[0]);
return 1;
resources_init(&res);
if (resources_create(&res))
{
goto main_exit;
}
if (connect_qp(&res))
{
goto main_exit;
}
if (!config.server_name)
if (post_send(&res, IBV_WR_SEND))
if (poll_completion(&res))
{
128
goto main_exit;
/* after polling the completion we have the message in the client buffer too */
if (config.server_name)
else
{
/* setup server buffer with read message */
strcpy(res.buf, RDMAMSGR);
/* Sync so we are sure server side has data ready before client tries to read it */
if (sock_sync_data(res.sock, 1, "R", &temp_char)) /* just send a dummy char back and forth */
rc = 1;
goto main_exit;
}
/* Now the client performs an RDMA read and then write on server.
Note that the server has no idea these events have occured */
if (config.server_name)
if (post_send(&res, IBV_WR_RDMA_READ))
{
fprintf(stderr, "failed to post SR 2\n");
rc = 1;
goto main_exit;
if (poll_completion(&res))
{
fprintf(stderr, "poll completion failed 2\n");
rc = 1;
goto main_exit;
strcpy(res.buf, RDMAMSGW);
if (post_send(&res, IBV_WR_RDMA_WRITE))
goto main_exit;
129
}
if (poll_completion(&res))
{
rc = 1;
goto main_exit;
}
}
/* Sync so server will know that client is done mucking with its memory */
if (sock_sync_data(res.sock, 1, "W", &temp_char)) /* just send a dummy char back and forth */
rc = 1;
goto main_exit;
if(!config.server_name)
rc = 0;
main_exit:
if (resources_destroy(&res))
{
fprintf(stderr, "failed to destroy resources\n");
rc = 1;
if(config.dev_name)
free((char *) config.dev_name);
return rc;
}
Notes:
130
1. In order to run the multicast example on either IB or LLE, no change is needed to the test's
code. However if RDMA_CM is used, it is required that the network interface will be
configured and up (whether it is used over RoCE or over IB).
2. For the IB case, a join operation is involved, yet it is performed by the rdma_cm kernel code.
3. For the LLE case, no join is required. All MGIDs are resolved into MACs at the host.
4. To inform the multicast example which port to use, you need to specify "-b <IP address>” to
bind to the desired device port.
8.3.1 Main
1. Get command line parameters.
m - MC address, destination port
M - unmapped MC address, requires also bind address (parameter “b”)
s - sender flag.
b - bind address.
c - connections amount.
C - message count.
S - message size.
p - port space (UDP default; IPoIB)
2. Create event channel to receive asynchronous events.
3. Allocate Node and creates an identifier that is used to track communication information
4. Start the “run” main function.
5. On ending - release and free resources.
8.3.2 Run
1. Get source (if provided for binding) and destination addresses - convert the input addresses
to socket presentation.
2. Joining:
a. For all connections:
if source address is specifically provided, then bind the rdma_cm object to the
corresponding network interface. (Associates a source address with an rdma_cm
identifier).
if unmapped MC address with bind address provided, check the remote address and
then bind.
b. Poll on all the connection events and wait that all rdma_cm objects joined the MC
group.
3. Send & receive:
a. If sender: send the messages to all connection nodes (function “post_sends”).
b. If receiver: poll the completion queue (function “poll_cqs”) till messages arrival.
On ending - release network resources (per all connections: leaves the multicast group and detaches
its associated QP from the group)
131
8.4 Code for Multicast Using RDMA_CM and IBV Verbs
Multicast Code Example
/*
* BUILD COMMAND:
* gcc -g -Wall -D_GNU_SOURCE -g -O2 -o examples/mckey examples/mckey.c -libverbs -lrdmacm
*
* $Id$
*/
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>
#include <sys/types.h>
#include <netinet/in.h>
#include <arpa/inet.h>
#include <sys/socket.h>
#include <netdb.h>
#include <byteswap.h>
#include <unistd.h>
#include <getopt.h>
#include <rdma/rdma_cma.h>
struct cmatest_node
{
int id;
struct rdma_cm_id *cma_id;
int connected;
struct ibv_pd *pd;
struct ibv_cq *cq;
struct ibv_mr *mr;
struct ibv_ah *ah;
uint32_t remote_qpn;
uint32_t remote_qkey;
void *mem;
};
struct cmatest
{
struct rdma_event_channel *channel;
struct cmatest_node *nodes;
int conn_index;
int connects_left;
132
if (message_count && message_size > (1 << (port_attr.active_mtu + 7)))
{
printf("mckey: message_size %d is larger than active mtu %d\n", message_size, 1 << (port_attr.active_mtu + 7))
;
return -EINVAL;
}
return 0;
}
static int init_node(struct cmatest_node *node)
{
struct ibv_qp_init_attr init_qp_attr;
int cqe, ret;
node->pd = ibv_alloc_pd(node->cma_id->verbs);
if (!node->pd)
{
ret = -ENOMEM;
printf("mckey: unable to allocate PD\n");
goto out;
}
ret = create_message(node);
if (ret)
{
printf("mckey: failed to create messages: %d\n", ret);
goto out;
}
out:
return ret;
}
if (!message_count)
return 0;
recv_wr.next = NULL;
recv_wr.sg_list = &sge;
recv_wr.num_sge = 1;
recv_wr.wr_id = (uintptr_t) node;
if (!node->connected || !message_count)
return 0;
send_wr.next = NULL;
send_wr.sg_list = &sge;
send_wr.num_sge = 1;
send_wr.opcode = IBV_WR_SEND_WITH_IMM;
send_wr.send_flags = signal_flag;
133
send_wr.wr_id = (unsigned long)node;
send_wr.imm_data = htonl(node->cma_id->qp->qp_num);
send_wr.wr.ud.ah = node->ah;
send_wr.wr.ud.remote_qpn = node->remote_qpn;
send_wr.wr.ud.remote_qkey = node->remote_qkey;
sge.length = message_size;
sge.lkey = node->mr->lkey;
sge.addr = (uintptr_t) node->mem;
for (i = 0; i < message_count && !ret; i++)
{
ret = ibv_post_send(node->cma_id->qp, &send_wr, &bad_send_wr);
if (ret)
printf("failed to post sends: %d\n", ret);
}
return ret;
}
ret = verify_test_params(node);
if (ret)
goto err;
ret = init_node(node);
if (ret)
goto err;
if (!is_sender)
{
ret = post_recvs(node);
if (ret)
goto err;
}
ret = rdma_join_multicast(node->cma_id, test.dst_addr, node);
if (ret)
{
printf("mckey: failure joining: %d\n", ret);
goto err;
}
return 0;
err:
connect_error();
return ret;
}
switch (event->event)
{
case RDMA_CM_EVENT_ADDR_RESOLVED:
ret = addr_handler(cma_id->context);
break;
case RDMA_CM_EVENT_MULTICAST_JOIN:
ret = join_handler(cma_id->context, &event->param.ud);
break;
case RDMA_CM_EVENT_ADDR_ERROR:
case RDMA_CM_EVENT_ROUTE_ERROR:
case RDMA_CM_EVENT_MULTICAST_ERROR:
printf("mckey: event: %s, error: %d\n", rdma_event_str(event->event), event->status);
connect_error();
ret = event->status;
break;
case RDMA_CM_EVENT_DEVICE_REMOVAL:
/* Cleanup will occur after test completes. */
break;
default:
break;
134
}
return ret;
}
if (node->ah)
ibv_destroy_ah(node->ah);
if (node->cma_id->qp)
rdma_destroy_qp(node->cma_id);
if (node->cq)
ibv_destroy_cq(node->cq);
if (node->mem)
{
ibv_dereg_mr(node->mr);
free(node->mem);
}
if (node->pd)
ibv_dealloc_pd(node->pd);
135
return ret;
}
if (unmapped_addr)
ret = addr_handler(&test.nodes[i]);
else
ret = rdma_resolve_addr(test.nodes[i].cma_id, test.src_addr, test.dst_addr,
2000);
if (ret)
{
printf("mckey: resolve addr failure: %d\n", ret);
connect_error();
return ret;
}
}
ret = connect_events();
if (ret)
goto out;
/*
* Pause to give SM chance to configure switches. We don't want to
* handle reliability issue in this simple test program.
*/
sleep(3);
if (message_count)
{
if (is_sender)
{
printf("initiating data transfers\n");
for (i = 0; i < connections; i++)
{
ret = post_sends(&test.nodes[i], 0);
if (ret)
goto out;
}
}
else
{
printf("receiving data transfers\n");
ret = poll_cqs();
if (ret)
goto out;
}
printf("data transfers complete\n");
}
out:
for (i = 0; i < connections; i++)
{
ret = rdma_leave_multicast(test.nodes[i].cma_id, test.dst_addr);
if (ret)
printf("mckey: failure leaving: %d\n", ret);
}
return ret;
}
136
int main(int argc, char **argv)
{
int op, ret;
if (alloc_nodes())
exit(1);
ret = run();
printf("test complete\n");
destroy_nodes();
rdma_destroy_event_channel(test.channel);
printf("return status %d\n", ret);
return ret;
}
137
* There are two ways to cause the path to be migrated.
* 1. Use the ibv_modify_qp verb to set path_mig_state = IBV_MIG_MIGRATED
* 2. Assuming there are two ports on at least one side of the connection, and
* each port has a path to the other host, pull out the cable of the original
* port and watch it migrate to the other port.
*
* Running the Example:
* This example requires a specific IB network configuration to properly
* demonstrate APM. Two hosts are required, one for the client and one for the
* server. At least one of these two hosts must have a IB card with two ports.
* Both of these ports should be connected to the same subnet and each have a
* route to the other host through an IB switch.
* The executable can operate as either the client or server application. Start
* the server side first on one host then start the client on the other host. With default parameters, the
* client and server will exchange 100 sends over 100 seconds. During that time,
* manually unplug the cable connected to the original port of the two port
* host, and watch the path get migrated to the other port. It may take up to
* a minute for the path to migrated. To see the path get migrated by software,
* use the -m option on the client side.
*
* Server:
* ./apm -s
*
* Client (-a is IP of remote interface):
* ./apm -a 192.168.1.12
*
*/
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <getopt.h>
#include <rdma/rdma_verbs.h>
while (1) {
ret = ibv_get_async_event(ctx->id->verbs, &event);
if (ret) {
VERB_ERR("ibv_get_async_event", ret);
break;
}
switch (event.event_type) {
case IBV_EVENT_PATH_MIG:
printf("QP path migrated\n");
break;
138
case IBV_EVENT_PATH_MIG_ERR:
printf("QP path migration error\n");
break;
default:
printf("Async Event %d\n", event.event_type);
break;
}
ibv_ack_async_event(&event);
}
return NULL;
}
/*
* Function: get_alt_dlid_from_private_data
*
* Input:
* event The RDMA event containing private data
*
* Output:
* dlid The DLID that was sent in the private data
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
* Takes the private data sent from the remote side and returns the
* destination LID that was contained in the private data
*/
int get_alt_dlid_from_private_data(struct rdma_cm_event *event, uint16_t *dlid)
{
if (event->param.conn.private_data_len < 4) {
printf("unexpected private data len: %d",
event->param.conn.private_data_len);
return -1;
}
/*
* Function: get_alt_port_details
*
* Input:
* ctx The context object
*
* Output:
* none
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
* First, query the device to determine if path migration is supported.
* Next, queries all the ports on the device to determine if there is
* different port than the current one to use as an alternate port. If so,
* copy the port number and dlid to the context so they can be used when
* the alternate path is loaded.
*
* Note:
* This function assumes that if another port is found in the active state,
* that the port is connected to the same subnet as the initial port and
* that there is a route to the other hosts alternate port.
*/
int get_alt_port_details(struct context *ctx)
{
int ret, i;
struct ibv_qp_attr qp_attr;
struct ibv_qp_init_attr qp_init_attr;
struct ibv_device_attr dev_attr;
/* This example assumes the alternate port we want to use is on the same
* HCA. Ports from other HCAs can be used as alternate paths as well. Get
* a list of devices using ibv_get_device_list or rdma_get_devices.*/
ret = ibv_query_device(ctx->id->verbs, &dev_attr);
if (ret) {
VERB_ERR("ibv_query_device", ret);
return ret;
}
/* Verify the APM is supported by the HCA */
if (!(dev_attr.device_cap_flags | IBV_DEVICE_AUTO_PATH_MIG)) {
printf("device does not support auto path migration!\n");
return -1;
}
139
}
if (port_attr.state == IBV_PORT_ACTIVE) {
ctx->my_alt_dlid = port_attr.lid;
ctx->alt_srcport = i;
if (qp_attr.port_num != i)
break;
}
}
return 0;
}
/*
* Function: load_alt_path
*
* Input:
* ctx The context object
*
* Output:
* none
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
* Uses ibv_modify_qp to load the alternate path information and set the
* path migration state to rearm.
*/
int load_alt_path(struct context *ctx)
{
int ret;
struct ibv_qp_attr qp_attr;
struct ibv_qp_init_attr qp_init_attr;
/* if an alternate path was supplied, set the source port and the dlid */
if (ctx->alt_srcport)
qp_attr.alt_port_num = ctx->alt_srcport;
else
qp_attr.alt_port_num = qp_attr.port_num;
if (ctx->alt_dlid)
qp_attr.alt_ah_attr.dlid = ctx->alt_dlid;
/*
* Function: reg_mem
*
* Input:
* ctx The context object
*
* Output:
* none
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
* Registers memory regions to use for our data transfer
*/
int reg_mem(struct context *ctx)
{
ctx->send_buf = (char *) malloc(ctx->msg_length);
memset(ctx->send_buf, 0x12, ctx->msg_length);
140
}
return 0;
}
/*
* Function: getaddrinfo_and_create_ep
*
* Input:
* ctx The context object
*
* Output:
* none
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
* Gets the address information and creates our endpoint
*/
int getaddrinfo_and_create_ep(struct context *ctx)
{
int ret;
struct rdma_addrinfo *rai, hints;
struct ibv_qp_init_attr qp_init_attr;
memset(&hints, 0, sizeof (hints));
hints.ai_port_space = RDMA_PS_TCP;
if (ctx->server == 1)
hints.ai_flags = RAI_PASSIVE; /* this makes it a server */
printf("rdma_getaddrinfo\n");
ret = rdma_getaddrinfo(ctx->server_name, ctx->server_port, &hints, &rai);
if (ret) {
VERB_ERR("rdma_getaddrinfo", ret);
return ret;
}
qp_init_attr.cap.max_send_wr = 1;
qp_init_attr.cap.max_recv_wr = 1;
qp_init_attr.cap.max_send_sge = 1;
qp_init_attr.cap.max_recv_sge = 1;
printf("rdma_create_ep\n");
ret = rdma_create_ep(&ctx->id, rai, NULL, &qp_init_attr);
if (ret) {
VERB_ERR("rdma_create_ep", ret);
return ret;
}
rdma_freeaddrinfo(rai);
return 0;
}
/*
* Function: get_connect_request
*
* Input:
* ctx The context object
*
* Output:
* none
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
* Wait for a connect request from the client
*/
int get_connect_request(struct context *ctx)
{
int ret;
printf("rdma_listen\n");
ret = rdma_listen(ctx->id, 4);
if (ret) {
VERB_ERR("rdma_listen", ret);
return ret;
}
ctx->listen_id = ctx->id;
printf("rdma_get_request\n");
ret = rdma_get_request(ctx->listen_id, &ctx->id);
if (ret) {
VERB_ERR("rdma_get_request", ret);
return ret;
}
if (ctx->id->event->event != RDMA_CM_EVENT_CONNECT_REQUEST) {
printf("unexpected event: %s",
rdma_event_str(ctx->id->event->event));
return ret;
}
/* If the alternate path info was not set on the command line, get
* it from the private data */
if (ctx->alt_dlid == 0 && ctx->alt_srcport == 0) {
ret = get_alt_dlid_from_private_data(ctx->id->event, &ctx->alt_dlid);
if (ret) {
return ret;
141
}
}
return 0;
}
/*
* Function: establish_connection
*
* Input:
* ctx The context object
*
* Output:
* none
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
* Create the connection. For the client, call rdma_connect. For the
* server, the connect request was already received, so just do
* rdma_accept to complete the connection.
*/
int establish_connection(struct context *ctx)
{
int ret;
uint16_t private_data;
struct rdma_conn_param conn_param;
/* send the dlid for the alternate port in the private data */
private_data = htons(ctx->my_alt_dlid);
/* If the alternate path info was not set on the command line, get
* it from the private data */
if (ctx->alt_dlid == 0 && ctx->alt_srcport == 0) {
ret = get_alt_dlid_from_private_data(ctx->id->event,
&ctx->alt_dlid);
if (ret)
return ret;
}
}
return 0;
}
/*
* Function: send_msg
*
* Input:
* ctx The context object
*
* Output:
* none
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
* Performs an Send and gets the completion
*
*/
int send_msg(struct context *ctx)
{
int ret;
struct ibv_wc wc;
142
ret = rdma_post_send(ctx->id, NULL, ctx->send_buf, ctx->msg_length,
ctx->send_mr, IBV_SEND_SIGNALED);
if (ret) {
VERB_ERR("rdma_send_recv", ret);
return ret;
}
return 0;
}
/*
* Function: recv_msg
*
* Input:
* ctx The context object
*
* Output:
* none
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
* Waits for a receive completion and posts a new receive buffer
*/
int recv_msg(struct context *ctx)
{
int ret;
struct ibv_wc wc;
ret = rdma_get_recv_comp(ctx->id, &wc);
if (ret < 0) {
VERB_ERR("rdma_get_recv_comp", ret);
return ret;
}
/*
* Function: main
*
* Input:
* ctx The context object
*
* Output:
* none
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
*
*/
int main(int argc, char** argv)
{
int ret, op, i, send_cnt, recv_cnt;
struct context ctx;
struct ibv_qp_attr qp_attr;
memset(&ctx, 0, sizeof (ctx));
memset(&qp_attr, 0, sizeof (qp_attr));
ctx.server = 0;
ctx.server_port = DEFAULT_PORT;
ctx.msg_count = DEFAULT_MSG_COUNT;
ctx.msg_length = DEFAULT_MSG_LENGTH;
ctx.msec_delay = DEFAULT_MSEC_DELAY;
ctx.alt_dlid = 0;
ctx.alt_srcport = 0;
ctx.migrate_after = -1;
143
break;
case 'r':
ctx.alt_srcport = atoi(optarg);
break;
case 'm':
ctx.migrate_after = atoi(optarg);
break;
case 'w':
ctx.msec_delay = atoi(optarg);
break;
default:
printf("usage: %s [-s or -a required]\n", argv[0]);
printf("\t[-s[erver mode]\n");
printf("\t[-a ip_address]\n");
printf("\t[-p port_number]\n");
printf("\t[-c msg_count]\n");
printf("\t[-l msg_length]\n");
printf("\t[-d alt_dlid] (requires -r)\n");
printf("\t[-r alt_srcport] (requires -d)\n");
printf("\t[-m num_iterations_then_migrate] (client only)\n");
printf("\t[-w msec_wait_between_sends]\n");
exit(1);
}
}
if (ctx.server) {
ret = get_connect_request(&ctx);
if (ret)
goto out;
}
/* only query for alternate port if information was not specified on the
* command line */
if (ctx.alt_dlid == 0 && ctx.alt_srcport == 0) {
ret = get_alt_port_details(&ctx);
if (ret)
goto out;
}
ret = reg_mem(&ctx);
if (ret)
goto out;
ret = establish_connection(&ctx);
/* load the alternate path after the connection was created. This can be
* done at connection time, but the connection must be created and
* established using all ib verbs */
ret = load_alt_path(&ctx);
if (ret)
goto out;
send_cnt = recv_cnt = 0;
if (send_msg(&ctx))
break;
144
if (!ctx.server) {
if (recv_msg(&ctx))
break;
rdma_disconnect(ctx.id);
out:
if (ctx.send_mr)
rdma_dereg_mr(ctx.send_mr);
if (ctx.recv_mr)
rdma_dereg_mr(ctx.recv_mr);
if (ctx.id)
rdma_destroy_ep(ctx.id);
if (ctx.listen_id)
rdma_destroy_ep(ctx.listen_id);
if (ctx.send_buf)
free(ctx.send_buf);
if (ctx.recv_buf)
free(ctx.recv_buf);
return ret;
}
145
char *server_port;
int msg_count;
int msg_length;
/* Resources */
struct sockaddr mcast_sockaddr;
struct rdma_cm_id *id;
struct rdma_event_channel *channel;
struct ibv_pd *pd;
struct ibv_cq *cq;
struct ibv_mr *mr;
char *buf;
struct ibv_ah *ah;
uint32_t remote_qpn;
uint32_t remote_qkey;
pthread_t cm_thread;
};
/*
* Function: cm_thread
*
* Input:
* arg The context object
*
* Output:
* none
*
* Returns:
* NULL
*
* Description:
* Reads any CM events that occur during the sending of data
* and prints out the details of the event
*/
static void *cm_thread(void *arg)
{
struct rdma_cm_event *event;
int ret;
while (1) {
ret = rdma_get_cm_event(ctx->channel, &event);
if (ret) {
VERB_ERR("rdma_get_cm_event", ret);
break;
}
printf("event %s, status %d\n",
rdma_event_str(event->event), event->status);
rdma_ack_cm_event(event);
}
return NULL;
}
/*
* Function: get_cm_event
*
* Input:
* channel The event channel
* type The event type that is expected
*
* Output:
* out_ev The event will be passed back to the caller, if desired
* Set this to NULL and the event will be acked automatically
* Otherwise the caller must ack the event using rdma_ack_cm_event
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
* Waits for the next CM event and check that is matches the expected
* type.
*/
int get_cm_event(struct rdma_event_channel *channel,
enum rdma_cm_event_type type,
struct rdma_cm_event **out_ev)
{
int ret = 0;
struct rdma_cm_event *event = NULL;
146
/*
* Function: resolve_addr
*
* Input:
* ctx The context structure
*
* Output:
* none
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
* Resolves the multicast address and also binds to the source address
* if one was provided in the context
*/
int resolve_addr(struct context *ctx)
{
int ret;
struct rdma_addrinfo *bind_rai = NULL;
struct rdma_addrinfo *mcast_rai = NULL;
struct rdma_addrinfo hints;
if (ctx->bind_addr) {
hints.ai_flags = RAI_PASSIVE;
hints.ai_flags = 0;
if (ctx->bind_addr) {
/* bind to a specific adapter if requested to do so */
ret = rdma_bind_addr(ctx->id, bind_rai->ai_src_addr);
if (ret) {
VERB_ERR("rdma_bind_addr", ret);
return ret;
}
/* A PD is created when we bind. Copy it to the context so it can
* be used later on */
ctx->pd = ctx->id->pd;
}
ret = rdma_resolve_addr(ctx->id, (bind_rai) ? bind_rai->ai_src_addr : NULL,
mcast_rai->ai_dst_addr, 2000);
if (ret) {
VERB_ERR("rdma_resolve_addr", ret);
return ret;
}
memcpy(&ctx->mcast_sockaddr,
mcast_rai->ai_dst_addr,
sizeof (struct sockaddr));
return 0;
}
/*
* Function: create_resources
*
* Input:
* ctx The context structure
*
* Output:
* none
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
* Creates the PD, CQ, QP and MR
*/
int create_resources(struct context *ctx)
{
int ret, buf_size;
struct ibv_qp_init_attr attr;
memset(&attr, 0, sizeof (attr));
147
VERB_ERR("ibv_alloc_pd", -1);
return ret;
}
}
/* The receiver must allow enough space in the receive buffer for
* the GRH */
buf_size = ctx->msg_length + (ctx->sender ? 0 : sizeof (struct ibv_grh));
return 0;
}
/*
* Function: destroy_resources
*
* Input:
* ctx The context structure
*
* Output:
* none
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
* Destroys AH, QP, CQ, MR, PD and ID
*/
void destroy_resources(struct context *ctx)
{
if (ctx->ah)
ibv_destroy_ah(ctx->ah);
if (ctx->id->qp)
rdma_destroy_qp(ctx->id);
if (ctx->cq)
ibv_destroy_cq(ctx->cq);
if (ctx->mr)
rdma_dereg_mr(ctx->mr);
if (ctx->buf)
free(ctx->buf);
/*
* Function: post_send
*
* Input:
* ctx The context structure
*
* Output:
* none
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
* Posts a UD send to the multicast address
*/
int post_send(struct context *ctx)
{
int ret;
struct ibv_send_wr wr, *bad_wr;
struct ibv_sge sge;
sge.length = ctx->msg_length;
148
sge.lkey = ctx->mr->lkey;
sge.addr = (uint64_t) ctx->buf;
return 0;
}
/*
* Function: get_completion
*
* Input:
* ctx The context structure
*
* Output:
* none
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
* Waits for a completion and verifies that the operation was successful
*/
int get_completion(struct context *ctx)
{
int ret;
struct ibv_wc wc;
do {
ret = ibv_poll_cq(ctx->cq, 1, &wc);
if (ret < 0) {
VERB_ERR("ibv_poll_cq", ret);
return -1;
}
}
while (ret == 0);
if (wc.status != IBV_WC_SUCCESS) {
printf("work completion status %s\n",
ibv_wc_status_str(wc.status));
return -1;
}
return 0;
}
/*
* Function: main
*
* Input:
* argc The number of arguments
* argv Command line arguments
*
* Output:
* none
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
* Main program to demonstrate multicast functionality.
* Both the sender and receiver create a UD Queue Pair and join the
* specified multicast group (ctx.mcast_addr). If the join is successful,
* the sender must create an Address Handle (ctx.ah). The sender then posts
* the specified number of sends (ctx.msg_count) to the multicast group.
* The receiver waits to receive each one of the sends and then both sides
* leave the multicast group and cleanup resources.
*/
int main(int argc, char** argv)
{
int ret, op, i;
struct context ctx;
struct ibv_port_attr port_attr;
struct rdma_cm_event *event;
char buf[40];
ctx.sender = 0;
ctx.msg_count = DEFAULT_MSG_COUNT;
ctx.msg_length = DEFAULT_MSG_LENGTH;
ctx.server_port = DEFAULT_PORT;
149
ctx.sender = 1;
break;
case 'b':
ctx.bind_addr = optarg;
break;
case 'm':
ctx.mcast_addr = optarg;
break;
case 'p':
ctx.server_port = optarg;
break;
case 'c':
ctx.msg_count = atoi(optarg);
break;
case 'l':
ctx.msg_length = atoi(optarg);
break;
default:
printf("usage: %s -m mc_address\n", argv[0]);
printf("\t[-s[ender mode]\n");
printf("\t[-b bind_address]\n");
printf("\t[-p port_number]\n");
printf("\t[-c msg_count]\n");
printf("\t[-l msg_length]\n");
exit(1);
}
}
if(ctx.mcast_addr == NULL) {
printf("multicast address must be specified with -m\n");
exit(1);
}
ctx.channel = rdma_create_event_channel();
if (!ctx.channel) {
VERB_ERR("rdma_create_event_channel", -1);
exit(1);
}
ret = resolve_addr(&ctx);
if (ret)
goto out;
/* Verify that the buffer length is not larger than the MTU */
ret = ibv_query_port(ctx.id->verbs, ctx.id->port_num, &port_attr);
if (ret) {
VERB_ERR("ibv_query_port", ret);
goto out;
}
ret = create_resources(&ctx);
if (ret)
goto out;
if (!ctx.sender) {
for (i = 0; i < ctx.msg_count; i++) {
ret = rdma_post_recv(ctx.id, NULL, ctx.buf,
ctx.msg_length + sizeof (struct ibv_grh),
ctx.mr);
if (ret) {
VERB_ERR("rdma_post_recv", ret);
goto out;
}
}
}
/* Join the multicast group */
ret = rdma_join_multicast(ctx.id, &ctx.mcast_sockaddr, NULL);
if (ret) {
VERB_ERR("rdma_join_multicast", ret);
goto out;
}
ctx.remote_qpn = event->param.ud.qp_num;
ctx.remote_qkey = event->param.ud.qkey;
if (ctx.sender) {
/* Create an address handle for the sender */
ctx.ah = ibv_create_ah(ctx.pd, &event->param.ud.ah_attr);
if (!ctx.ah) {
VERB_ERR("ibv_create_ah", -1);
goto out;
}
150
}
rdma_ack_cm_event(event);
if (!ctx.sender)
printf("waiting for messages...\n");
ret = get_completion(&ctx);
if (ret)
goto out;
if (ctx.sender)
printf("sent message %d\n", i + 1);
else
printf("received message %d\n", i + 1);
}
out:
ret = rdma_leave_multicast(ctx.id, &ctx.mcast_sockaddr);
if (ret)
VERB_ERR("rdma_leave_multicast", ret);
destroy_resources(&ctx);
return ret;
}
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include <errno.h>
#include <getopt.h>
#include <rdma/rdma_verbs.h>
151
int max_wr;
/* Resources */
struct rdma_cm_id *srq_id;
struct rdma_cm_id *listen_id;
struct rdma_cm_id **conn_id;
struct ibv_mr *send_mr;
struct ibv_mr *recv_mr;
struct ibv_srq *srq;
struct ibv_cq *srq_cq;
struct ibv_comp_channel *srq_cq_channel;
char *send_buf;
char *recv_buf;
};
/*
* Function: init_resources
*
* Input:
* ctx The context object
* rai The RDMA address info for the connection
*
* Output:
* none
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
* This function initializes resources that are common to both the client
* and server functionality.
* It creates our SRQ, registers memory regions, posts receive buffers
* and creates a single completion queue that will be used for the receive
* queue on each queue pair.
*/
int init_resources(struct context *ctx, struct rdma_addrinfo *rai)
{
int ret, i;
struct rdma_cm_id *id;
/* Save the SRQ in our context so we can assign it to other QPs later */
ctx->srq = ctx->srq_id->srq;
/* Post our receive buffers on the SRQ */
for (i = 0; i < ctx->max_wr; i++) {
ret = rdma_post_recv(ctx->srq_id, NULL, ctx->recv_buf, ctx->msg_length,
ctx->recv_mr);
if (ret) {
VERB_ERR("rdma_post_recv", ret);
return ret;
}
}
/* Create a completion channel to use with the SRQ CQ */
ctx->srq_cq_channel = ibv_create_comp_channel(ctx->srq_id->verbs);
152
if (!ctx->srq_cq_channel) {
VERB_ERR("ibv_create_comp_channel", -1);
return -1;
}
/* Create a CQ to use for all connections (QPs) that use the SRQ */
ctx->srq_cq = ibv_create_cq(ctx->srq_id->verbs, ctx->max_wr, NULL,
ctx->srq_cq_channel, 0);
if (!ctx->srq_cq) {
VERB_ERR("ibv_create_cq", -1);
return -1;
}
return 0;
}
/*
* Function: destroy_resources
*
* Input:
* ctx The context object
*
* Output:
* none
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
* This function cleans up resources used by the application
*/
void destroy_resources(struct context *ctx)
{
int i;
if (ctx->conn_id) {
for (i = 0; i < ctx->qp_count; i++) {
if (ctx->conn_id[i]) {
if (ctx->conn_id[i]->qp &&
ctx->conn_id[i]->qp->state == IBV_QPS_RTS) {
rdma_disconnect(ctx->conn_id[i]);
}
rdma_destroy_qp(ctx->conn_id[i]);
rdma_destroy_id(ctx->conn_id[i]);
}
}
free(ctx->conn_id);
}
if (ctx->recv_mr)
rdma_dereg_mr(ctx->recv_mr);
if (ctx->send_mr)
rdma_dereg_mr(ctx->send_mr);
if (ctx->recv_buf)
free(ctx->recv_buf);
if (ctx->send_buf)
free(ctx->send_buf);
if (ctx->srq_cq)
ibv_destroy_cq(ctx->srq_cq);
if (ctx->srq_cq_channel)
ibv_destroy_comp_channel(ctx->srq_cq_channel);
if (ctx->srq_id) {
rdma_destroy_srq(ctx->srq_id);
rdma_destroy_id(ctx->srq_id);
}
}
/*
* Function: await_completion
*
* Input:
* ctx The context object
*
* Output:
* none
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
* Waits for a completion on the SRQ CQ
*
*/
int await_completion(struct context *ctx)
{
int ret;
struct ibv_cq *ev_cq;
void *ev_ctx;
153
ret = ibv_get_cq_event(ctx->srq_cq_channel, &ev_cq, &ev_ctx);
if (ret) {
VERB_ERR("ibv_get_cq_event", ret);
return ret;
}
ibv_ack_cq_events(ev_cq, 1);
return 0;
}
/*
* Function: run_server
*
* Input:
* ctx The context object
* rai The RDMA address info for the connection
*
* Output:
* none
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
* Executes the server side of the example
*/
int run_server(struct context *ctx, struct rdma_addrinfo *rai)
{
int ret, i;
uint64_t send_count = 0;
uint64_t recv_count = 0;
struct ibv_wc wc;
struct ibv_qp_init_attr qp_attr;
ret = await_completion(ctx);
if (ret) {
printf("await_completion %d\n", ret);
return ret;
154
}
do {
ne = ibv_poll_cq(ctx->srq_cq, 1, &wc);
if (ne < 0) {
VERB_ERR("ibv_poll_cq", ne);
return ne;
}
else if (ne == 0)
break;
if (wc.status != IBV_WC_SUCCESS) {
printf("work completion status %s\n",
ibv_wc_status_str(wc.status));
return -1;
}
recv_count++;
printf("recv count: %d, qp_num: %d\n", recv_count, wc.qp_num);
i++;
}
while (ne);
}
send_count++;
printf("send count: %d\n", send_count);
}
return 0;
}
/*
* Function: run_client
*
* Input:
* ctx The context object
* rai The RDMA address info for the connection
*
* Output:
* none
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
* Executes the client side of the example
*/
int run_client(struct context *ctx, struct rdma_addrinfo *rai)
{
int ret, i, ne;
uint64_t send_count = 0;
uint64_t recv_count = 0;
struct ibv_wc wc;
struct ibv_qp_init_attr attr;
ret = init_resources(ctx, rai);
if (ret) {
printf("init_resources returned %d\n", ret);
return ret;
}
attr.qp_context = ctx;
attr.cap.max_send_wr = ctx->max_wr;
attr.cap.max_recv_wr = ctx->max_wr;
attr.cap.max_send_sge = 1;
attr.cap.max_recv_sge = 1;
attr.cap.max_inline_data = 0;
attr.recv_cq = ctx->srq_cq;
attr.srq = ctx->srq;
attr.sq_sig_all = 0;
155
VERB_ERR("rdma_connect", ret);
return ret;
}
}
send_count++;
printf("send count: %d, qp_num: %d\n", send_count, wc.qp_num);
}
/* wait for a recv indicating that all buffers were processed */
ret = await_completion(ctx);
if (ret) {
VERB_ERR("await_completion", ret);
return ret;
}
do {
ne = ibv_poll_cq(ctx->srq_cq, 1, &wc);
if (ne < 0) {
VERB_ERR("ibv_poll_cq", ne);
return ne;
}
else if (ne == 0)
break;
if (wc.status != IBV_WC_SUCCESS) {
printf("work completion status %s\n",
ibv_wc_status_str(wc.status));
return -1;
}
recv_count++;
printf("recv count: %d\n", recv_count);
return ret;
}
/*
* Function: main
*
* Input:
* argc The number of arguments
* argv Command line arguments
*
* Output:
* none
*
* Returns:
* 0 on success, non-zero on failure
*
* Description:
* Main program to demonstrate SRQ functionality.
* Both the client and server use an SRQ. ctx.qp_count number of QPs are
* created and each one of them uses the SRQ. After the connection, the
* client starts blasting sends to the server upto ctx.max_wr. When the
* server has received all the sends, it performs a send to the client to
* tell it that it can continue. Process repeats until ctx.msg_count
* sends have been performed.
*/
int main(int argc, char** argv)
{
int ret, op;
struct context ctx;
struct rdma_addrinfo *rai, hints;
memset(&ctx, 0, sizeof (ctx));
memset(&hints, 0, sizeof (hints));
ctx.server = 0;
ctx.server_port = DEFAULT_PORT;
ctx.msg_count = DEFAULT_MSG_COUNT;
ctx.msg_length = DEFAULT_MSG_LENGTH;
ctx.qp_count = DEFAULT_QP_COUNT;
ctx.max_wr = DEFAULT_MAX_WR;
/* Read options from command line */
while ((op = getopt(argc, argv, "sa:p:c:l:q:w:")) != -1) {
156
switch (op) {
case 's':
ctx.server = 1;
break;
case 'a':
ctx.server_name = optarg;
break;
case 'p':
ctx.server_port = optarg;
break;
case 'c':
ctx.msg_count = atoi(optarg);
break;
case 'l':
ctx.msg_length = atoi(optarg);
break;
case 'q':
ctx.qp_count = atoi(optarg);
break;
case 'w':
ctx.max_wr = atoi(optarg);
break;
default:
printf("usage: %s -a server_address\n", argv[0]);
printf("\t[-s server mode]\n");
printf("\t[-p port_number]\n");
printf("\t[-c msg_count]\n");
printf("\t[-l msg_length]\n");
printf("\t[-q qp_count]\n");
printf("\t[-w max_wr]\n");
exit(1);
}
}
if (ctx.server_name == NULL) {
printf("server address required (use -a)!\n");
exit(1);
}
hints.ai_port_space = RDMA_PS_TCP;
if (ctx.server == 1)
hints.ai_flags = RAI_PASSIVE; /* this makes it a server */
ret = rdma_getaddrinfo(ctx.server_name, ctx.server_port, &hints, &rai);
if (ret) {
VERB_ERR("rdma_getaddrinfo", ret);
exit(1);
}
/* allocate memory for our QPs and send/recv buffers */
ctx.conn_id = (struct rdma_cm_id **) calloc(ctx.qp_count,
sizeof (struct rdma_cm_id *));
memset(ctx.conn_id, 0, sizeof (ctx.conn_id));
ctx.send_buf = (char *) malloc(ctx.msg_length);
memset(ctx.send_buf, 0, ctx.msg_length);
ctx.recv_buf = (char *) malloc(ctx.msg_length);
memset(ctx.recv_buf, 0, ctx.msg_length);
if (ctx.server)
ret = run_server(&ctx, rai);
else
ret = run_client(&ctx, rai);
destroy_resources(&ctx);
free(rai);
return ret;
}
157
• User creates DCI's. The number of DCI's depends on the user's strategy for handling concurrent
data transmissions.
• User defines a DC Access Key, and initializes a DCT using this access key
• User can query the DCI with the routine ibv_exp_query_qp(), and can query the DCT with the
ibv_exp_query_dct() routine.
• User can arm the DCT, so that an event is generated when a DC Access Key violation occurs.
• Send work requests are posted to the DCI's. Data can be sent to a different DCT only after all
previous sends complete, so send CQE's can be used to detect such completions.
• The CQ associated with the DCT is used to detect data arrival.
• Destroy resources when done
is used to query for device capabilities. The flag IBV_EXP_DEVICE_DC_TRANSPORT in the field
exp_atomic_cap of the struct ibv_exp_device_attr defines if the DC transport is supported.
The fields,
int max_dc_req_rd_atom;
int max_dc_res_rd_atom;
158
• tclass- Used by packets sent by the DCT in case GRH is used
• flow_label - Used by packets sent by the DCT in case GRH is used
• mtu - MTU
• pkey_index - pkey index used by the DC target
• gid_index - Gid (e.g., all caps) index associated with the DCT. Used to verify incoming packets
if GRH is used. This field in mandatory
• hop_limit - Used by packets sent by the DCT in case GRH is used
• Create flags
Destroy a DC target. This call may take some time till all DCRs are disconnected.
159
struct ibv_exp_arm_attr {
uint32_t comp_mask;
};
The following are the valid state transitions for DCI with required and optional params
160
In the figure above, the total operand size is N bits, with the length of each data field being four
bits. The 1's in the mask indicate the termination of a data field. With ConnectX® family of HCA's
and Connect-IB®, there is always an implicit 1 in the mask.
+ IBV_EXP_DEVICE_EXT_ATOMICS ,
and the device attribute comp mask enumeration ibv_exp_device_attr_comp_mask is updated with:
+ IBV_EXP_DEVICE_ATTR_EXT_ATOMIC_ARGS,
struct ibv_exp_ext_atomics_params {
161
uint32_t max_fa_bit_boundary;
uint32_t log_max_atomic_inline;
};
Atomic fetch&add operations on subsections of the operands are also supported, with
max_fa_bit_boundary being the log-base-2 of the largest such subfield, in bytes.
Log_max_atomic_inline is the log of the largest amount of atomic data, in bytes, that can be put in
the work request and includes the space for all required fields. -For ConnectX and Connect-IB the
largest subsection supported is eight bytes.
The returned data is formatted in units that correspond to the host's natural word size. For
example, if extended atomics are used for a 16 byte field, and returned in big-endian format, each
eight byte portion is arranged in big-endian format, regardless of the size the fields used in an
association in a multi-field fetch-and-add operation.
8.7.2.3 QP Initialization
QP initialization needs additional information with respect to the sizes of atomic operations that
will be supported inline. This is needed to ensure the QP is provisioned with sufficient send
resources to support the number of support WQE's.
+ IBV_EXP_QP_INIT_ATTR_ATOMICS_ARG ,
162
} ext_op;
Three examples of non-contiguous regions of memory that are used to form new contiguous regions
of memory are described below. Figure 2 shows an example where portions of three separate
contiguous regions of memory are combined to create a single logically contiguous region of
memory. The base address of the new memory region is defined by the user when the new memory
key is defined.
163
The figure below shows a non-contiguous memory region with regular. This region is defined by a
base address, stride between adjacent elements, the extent of each element, and a repeat count.
The figure below shows an example where two non-contiguous memory regions are interleaved,
using the repeat structure UMR.
164
8.8.1 Interfaces
The usage model for the UMR includes:
• Ability to with ibv_exp_query_device if UMR is supported.
• If UMR is supported, checking struct ibv_exp_device_attr for it's characteristics
• Using ibv_exp_create_mr() to create an uninitialized memory key for future UMR use
• Using ibv_exp_post_send() to define the new memory key. This can be posted to the same
send queue that will use the memory key in future operations.
• Using the UMR defined as one would use any other memory keys
• Using ibv_exp_post_send() to invalidate the UMR memory key
• Releasing the memory key with the ibv_dereg_mr()
struct ibv_exp_umr_caps umr_caps field describes the UMR capabilities. This structure is defined as:
struct ibv_exp_umr_caps {
uint32_t max_klm_list_size;
uint32_t max_send_wqe_inline_klms;
uint32_t max_umr_recursion_depth;
uint32_t max_umr_stride_dimension;
};
The fields added to the struct struct ibv_exp_device_attr to support UMR include:
• exp_device_cap_flags - UMR support available if the flag IBV_EXP_DEVICE_ATTR_UMR is set.
• max_mkey_klm_list_size - maximum number of memory keys that may be input to UMR
• max_send_wqe_inline_klms - the largest number of KLM's that can be provided inline in the
work request. When the list is larger than this, a buffer allocated via the struct ibv_mr
165
*ibv_exp_reg_mr(struct ibv_exp_reg_mr_in *in) function, and provided to the driver as part of
the memory key creation
• max_umr_recursion_depth - memory keys created by UMR operations may be input to UMR
memory key creation. This specifies the limit on how deep this recursion can be.
• max_umr_stride_dimension - The maximum number of independent dimensions that may be
used with the regular structure UMR operations. The current limit is one.
8.8.1.2 QP Creation
To configure QP UMR support the routine
To query the resources associated with the memory key, the routine
166
uint32_t comp_mask; /*use ibv_exp_alloc_mkey_list_comp_mask */
};
This memory is freed with
int ibv_exp_dealloc_mkey_list_memory(struct ibv_exp_mkey_list_container *mem)
where
struct ibv_exp_mkey_list_container {
uint32_t max_klm_list_size;
struct ibv_context *context;
}; (NOTE - Need to check with Eli Cohen here - just reading the code).
The struct ibv_exp_send_wr contains the following fields to support the UMR capabilities:
union {
struct {
uint32_t umr_type; /* use ibv_exp_umr_wr_type */
struct ibv_exp_mkey_list_container *memory_objects; /* used when IBV_EXP_SEND_INLINE is not set */
uint64_t exp_access; /* use ibv_exp_access_flags */
struct ibv_mr *modified_mr;
uint64_t base_addr;
uint32_t num_mrs; /* array size of mem_repeat_block_list or mem_reg_list */
union {
struct ibv_exp_mem_region *mem_reg_list; /* array, size corresponds to num_mrs */
struct {
struct ibv_exp_mem_repeat_block *mem_repeat_block_list; /* array, size corresponds to num_mr */
size_t *repeat_count; /* array size corresponds to stride_dim */
uint32_t stride_dim;
} rb;
} mem_list;
} umr;
where
enum ibv_exp_umr_wr_type {
IBV_EXP_UMR_MR_LIST,
IBV_EXP_UMR_REPEAT
};
and
struct ibv_exp_mkey_list_container {
uint32_t max_klm_list_size;
struct ibv_context *context;
};
struct ibv_exp_mem_region {
uint64_t base_addr;
struct ibv_mr *mr;
size_t length;
};
and
struct ibv_exp_mem_repeat_block {
uint64_t base_addr; /* array, size corresponds to ndim */
struct ibv_mr *mr;
size_t *byte_count; /* array, size corresponds to ndim */
size_t *stride; /* array, size corresponds to ndim */
};
Terms relevant to the Cross-Channel Synchronization are defined in the following table:
Term Description
Cross Channel supported QP QP that allows send_enable, recv_enable, wait, and reduction
tasks.
167
Term Description
Managed send QP Work requests in the corresponding send queues must be
explicitly enabled before they can be executed.
Managed receive QP Work requests in the corresponding receive queues must be
explicitly enabled before they can be executed.
Master Queue Queue that uses send_enable and/or recv_enable work requests
to enable tasks in managed QP. A QP can be both master and
managed QP.
Wait task (n) Task the completes when n completion tasks appear in the
specified completion queue
Send Enable task (n) Enables the next n send tasks in the specified send queue to be
executable.
Receive Enable task Enables the next n send tasks in the specified receive queue to
be executable.
Reduction operation Data reduction operation to be executed by the HCA on
specified data.
In addition, the struct calc_cap is used to define what reduction capabilities are supported
struct ibv_exp_device_attr {
…
struct ibv_exp_device_calc_cap calc_cap;
…
168
};
where,
struct ibv_exp_device_calc_cap {
uint64_t data_types;
uint64_t data_sizes;
uint64_t int_ops;
uint64_t uint_ops;
uint64_t fp_ops;
};
Where the operation types are given by:
IBV_EXP_CALC_OP_ADD , /* addition */
IBV_EXP_CALC_OP_BAND, /* bit-wise and */
IBV_EXP_CALC_OP_BXOR, /*bit wise xor */
IBV_EXP_CALC_OP_BOR, /* bit-wise or */
and data types supported are described by
IBV_EXP_CALC_DATA_SIZE_64_BIT
8.9.2.3 QP Creation
To configure the QP for Cross-Channel use following function is used
struct ibv_exp_qp_init_attr {
void *qp_context;
struct ibv_cq *send_cq;
struct ibv_cq *recv_cq;
struct ibv_srq *srq;
struct ibv_qp_cap cap;
enum ibv_qp_type qp_type;
int sq_sig_all;
uint32_t max_inl_recv;
struct ibv_exp_qpg qpg;
uint32_t max_atomic_arg;
uint32_t max_inl_send_klms;
};
169
• IBV_EXP_QP_CREATE_CROSS_CHANNEL - This must be set for any QP to which cross-channel-
synchronization work requests will be posted.
• IBV_EXP_QP_CREATE_MANAGED_SEND - This is set for a managed send QP, e.g. one for which
send-enable operations are used to activate the posted send requests.
• IBV_EXP_QP_CREATE_MANAGED_RECV - This is set for a managed receive QP, e.g. one for
which send-enable operations are used to activate the posted receive requests.
The routine
int ibv_exp_post_task(struct ibv_context *context, struct ibv_exp_task *task, struct ibv_exp_task **bad_task)
is used to post the list of work requests, with
struct ibv_exp_task {
enum ibv_exp_task_type task_type;
struct {
struct ibv_qp *qp;
union {
struct ibv_exp_send_wr *send_wr;
struct ibv_recv_wr *recv_wr;
};
} item;
struct ibv_exp_task *next;
uint32_t comp_mask; /* reserved for future growth (must be 0) */
};
IBV_EXP_TASK_SEND
IBV_EXP_TASK_RECV
To support the new work requests, the struct ibv_exp_send_wr is expanded with
union {
struct {
uint64_t remote_addr;
uint32_t rkey;
} rdma;
struct {
170
uint64_t remote_addr;
uint64_t compare_add;
uint64_t swap;
uint32_t rkey;
} atomic;
struct {
struct ibv_cq *cq;
int32_t cq_count;
} cqe_wait;
struct {
struct ibv_qp *qp;
int32_t wqe_count;
} wqe_enable;
} task;
union {
struct {
enum ibv_exp_calc_op calc_op;
enum ibv_exp_calc_data_type data_type;
enum ibv_exp_calc_data_size data_size;
} calc;
} op;
are supported.
New send opcodes are defined for the new work requests. These include:
• IBV_EXP_WR_SEND_ENABLE
• IBV_EXP_WR_RECV_ENABLE
• IBV_EXP_WR_CQE_WAIT
171
9 Document Revision History
Rev. Date Changes
Rev 1.7 August 2020 Converted to HTML
May 2015 Added "Verbs API for Extended Atomics
Support"
Added "User-Mode Memory Registration
(UMR)"
Added "Cross-Channel Communications
Support"
172
Notice
This document is provided for information purposes only and shall not be regarded as a warranty of a certain
functionality, condition, or quality of a product. Neither NVIDIA Corporation nor any of its direct or indirect subsidiaries
and affiliates (collectively: “NVIDIA”) make any representations or warranties, expressed or implied, as to the accuracy
or completeness of the information contained in this document and assumes no responsibility for any errors contained
herein. NVIDIA shall have no liability for the consequences or use of such information or for any infringement of patents
or other rights of third parties that may result from its use. This document is not a commitment to develop, release, or
deliver any Material (defined below), code, or functionality.
NVIDIA reserves the right to make corrections, modifications, enhancements, improvements, and any other changes to
this document, at any time without notice. Customer should obtain the latest relevant information before placing orders
and should verify that such information is current and complete.
NVIDIA products are sold subject to the NVIDIA standard terms and conditions of sale supplied at the time of order
acknowledgement, unless otherwise agreed in an individual sales agreement signed by authorized representatives of
NVIDIA and customer (“Terms of Sale”). NVIDIA hereby expressly objects to applying any customer general terms and
conditions with regards to the purchase of the NVIDIA product referenced in this document. No contractual obligations
are formed either directly or indirectly by this document.
NVIDIA products are not designed, authorized, or warranted to be suitable for use in medical, military, aircraft, space, or
life support equipment, nor in applications where failure or malfunction of the NVIDIA product can reasonably be
expected to result in personal injury, death, or property or environmental damage. NVIDIA accepts no liability for
inclusion and/or use of NVIDIA products in such equipment or applications and therefore such inclusion and/or use is at
customer’s own risk.
NVIDIA makes no representation or warranty that products based on this document will be suitable for any specified use.
Testing of all parameters of each product is not necessarily performed by NVIDIA. It is customer’s sole responsibility to
evaluate and determine the applicability of any information contained in this document, ensure the product is suitable
and fit for the application planned by customer, and perform the necessary testing for the application in order to avoid a
default of the application or the product. Weaknesses in customer’s product designs may affect the quality and reliability
of the NVIDIA product and may result in additional or different conditions and/or requirements beyond those contained in
this document. NVIDIA accepts no liability related to any default, damage, costs, or problem which may be based on or
attributable to: (i) the use of the NVIDIA product in any manner that is contrary to this document or (ii) customer product
designs.
No license, either expressed or implied, is granted under any NVIDIA patent right, copyright, or other NVIDIA intellectual
property right under this document. Information published by NVIDIA regarding third-party products or services does not
constitute a license from NVIDIA to use such products or services or a warranty or endorsement thereof. Use of such
information may require a license from a third party under the patents or other intellectual property rights of the third
party, or a license from NVIDIA under the patents or other intellectual property rights of NVIDIA.
Reproduction of information in this document is permissible only if approved in advance by NVIDIA in writing, reproduced
without alteration and in full compliance with all applicable export laws and regulations, and accompanied by all
associated conditions, limitations, and notices.
THIS DOCUMENT AND ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS, DIAGNOSTICS, LISTS,
AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY, “MATERIALS”) ARE BEING PROVIDED “AS IS.” NVIDIA MAKES NO
WARRANTIES, EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE MATERIALS, AND EXPRESSLY
DISCLAIMS ALL IMPLIED WARRANTIES OF NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR PURPOSE.
TO THE EXTENT NOT PROHIBITED BY LAW, IN NO EVENT WILL NVIDIA BE LIABLE FOR ANY DAMAGES, INCLUDING WITHOUT
LIMITATION ANY DIRECT, INDIRECT, SPECIAL, INCIDENTAL, PUNITIVE, OR CONSEQUENTIAL DAMAGES, HOWEVER CAUSED AND
REGARDLESS OF THE THEORY OF LIABILITY, ARISING OUT OF ANY USE OF THIS DOCUMENT, EVEN IF NVIDIA HAS BEEN
ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. Notwithstanding any damages that customer might incur for any reason
whatsoever, NVIDIA’s aggregate and cumulative liability towards customer for the products described herein shall be
limited in accordance with the Terms of Sale for the product.
Trademarks
NVIDIA, the NVIDIA logo, and Mellanox are trademarks and/or registered trademarks of NVIDIA Corporation and/
or Mellanox Technologies Ltd. in the U.S. and in other countries. Other company and product names may be trademarks
of the respective companies with which they are associated.
Copyright
© 2024 NVIDIA Corporation & affiliates. All Rights Reserved.