Verbs Programming Tutorial-Final
Verbs Programming Tutorial-Final
Dotan Barak
OpenSHMEM, 2014
About me
Features:
• Remote Direct Memory Access (RDMA) – zero copy
• Kernel bypass
• Highly scalable (10K’s of nodes)
• Message based transactions
• Credit based flow control
Performance
• High BW (56 Gb/sec)
• Low latency (700 nsec)
• High message rate (137 Mpps)
• The lower 4 OSI layers implemented in HW
Other
• Industry Standard
• Developed as Open Source (in *NIX)
• Inbox in several OS’s
xCAs are located in hosts/systems and they have a DMA engine that allows to initiate
local and remote DMA operations.
Routers allow routing of packets between different subnets according to the packet’s
destination Global address
Repeaters are transparent physical entities that extend the range of a link
Switch
Switch Switch
Switch
Every node in InfiniBand has a Global Unique Identifier (GUID) – Node GUID
• Persistent
• World-wide unique 64 bits value (node GUID)
Every port in a node has a port GUID
• Ports can be identified using the port GUID
• Every port can be configured with multiple additional GID (Global IDentifier) addresses in the
port’s GID table
A system, which contains several nodes, may have a System GUID configured in each
node
Application Application
Network Network
Layer (GRH) Packet relay Packet relay Layer
Link
Link
Link
Link
Link Link
Packet relay
Layer (LRH) Layer
PHY
PHY
PHY
PHY
PHY
Points :
• Network software stack involved
- Usually part of the OS’s kernel
- Consumes CPU
- If needed, provides reliability
- Caches the data
• Order of operations does matter
- Receive buffer(s) should be available before data is received
Send
• Just like the classic model
• Data is read in local side
- Can be gathered from multiple buffers
• Sent over the wire as a message
• Remote side specify where the message will be stored
- Can be scattered to multiple buffers
RDMA
• Local side can write data directly to remote side memory
- Can be gathered locally from multiple buffers
• Local side can read data directly from remote side memory
- Can be scattered locally to multiple buffers
• Remote side isn’t aware to any activity
- No CPU involvement at remote side
Requester
• The active side
- It initiates the data transfer
• Post SR(s)
Responder
• The passive side
- It sends or receives data – depend on the used opcode
• May Post RR(s)
• Send ACK/NACK for reliable transport types
Requester Responde
r
sync Post RR
Post SR data
ACK Poll CQ
Poll CQ
Requester Responde
r
ACK
Poll CQ
Requester Responde
r
data
Poll CQ
Requester Responde
r
data
Poll CQ
Software
• Linux distribution
• RDMA stack installed
- MLNX-OFED
- Community's OFED
- Native RDMA stack within the Linux distribution
- Download manually the needed libraries and compile the Linux kernel
Hardware
• RDMA device
- InfiniBand/RoCE is preferred
- Connected in loopback or back-to-back/switch
Verbs is an abstract description of the functionality that is provided for applications for
using RDMA.
• Verbs is not an API
• There are several implementations for it
Verbs can be divided into two major groups
• Control path – manage the resources and usually requires context switch
- Create
- Destroy
- Modify
- Query
- Work with events
• Data path – Use the resources to send/receive data and doesn’t require context switch
- Post Send
- Post Receive
- Poll CQ
- Request for completion event
libibverbs, developed and maintained by Roland Dreier since 2006, are de-facto the
verbs API standard in *nix
• Developed as an Open source
• The kernel part of the verbs is integrated in the Linux kernel since 2005 – Kernel 2.6.11
• Inbox in several *nix distributions
• There are level low-level libraries from several HW vendors
Same API for all RDMA-enabled transport protocols
• InfiniBand – Networking architecture which supports RDMA
- requires both NICs and switches that supports it.
• RDMA Over Converged Ethernet (RoCE) – encapsulation of RDMA packets over Ethernet/IP
frames
- requires NICs which supports it and standard Ethernet switches
• Internet Wide Area RDMA Protocol (iWARP) – provides RDMA over Stream Control
Transmission Protocol (SCTP) and Transmission Control Protocol (TCP)
- requires NICs which supports it and standard Ethernet switches
Warning: libibverbs will not prevent you from getting into troubles
• Destroying a resource in one thread and using it in another thread will end up with segmentation
fault
- This happens in non-thread code as well
Warning: Not following those rules may lead to data corruption or segmentation fault!
Mandatory
struct
ibv_device
Optional
struct
context
struct ibv_pd
struct
ibv_comp_chann
el
struct ibv_qp
device_list = ibv_get_device_list(&num_devices);
if (!device_list) {
fprintf(stderr, "Error, ibv_get_device_list() failed\n");
exit(1);
}
ibv_free_device_list(device_list);
Write a program that go over all the RDMA devices and print for every device its node
type.
Tip: use ibv_node_type_str() to get a string from a node type enumerated value.
Write a program that go over all the RDMA devices and print for every port in each device
the GID in entry 0 (i.e. Port GUID)
Protection Domain is a mechanism for associating Queue Pairs with other RDMA
resources
• Such as Memory Regions and Address Handles
Not all resources have a PD
• For example: Completion Queues
Protection Domain as its name state is a mean of protection
• Mixing resources that were associated with different PDs will result a Work Completion with error
• This verb should be called after destroying all the resources that are associated with it
pd = ibv_alloc_pd(context);
if (!pd) {
fprintf(stderr, "Error, ibv_alloc_pd() failed\n");
return -1;
}
if (ibv_dealloc_pd(pd)) {
fprintf(stderr, "Error, ibv_dealloc_pd() failed\n");
return -1;
}
Memory Region is a virtually contiguous memory block that was registered, i.e.
prepared for work with RDMA.
• Any memory buffer in the process’ virtual space can be registered
• Available permissions. One or more of the following permissions (Or’ed):
- Local operations (Local Read is always supported)
IBV_ACCESS_LOCAL_WRITE
IBV_ACCESS_MW_BIND
- Remote operations
IBV_ACCESS_REMOTE_WRITE
IBV_ACCESS_REMOTE_READ
IBV_ACCESS_REMOTE_ATOMIC
• If Remote Write or Remote Atomic is enabled, local Write should be enabled too
• The same memory buffer can be registered multiple times
- even with different permissions
• After a successful memory registration, two keys are being generated:
- Local Key (lkey)
- Remote Key (rkey)
Those keys are used when referring to this MR in a Work Request
• This verb should be called if there is no outstanding Send Request or Receive Request that
points to it
if (ibv_dereg_mr(mr)) {
fprintf(stderr, "Error, ibv_dereg_mr() failed\n");
return -1;
}
Completion Event channel is a mechanism for delivering notification about the creation
of Work Completions in CQs that is attached to it.
• Useful to reduce the CPU consumption
This object will be used when creating new CQs
One Completion Event channel can be used with multiple CQs
• This verb should be called after destroying all the CQs that are associated with it
channel = ibv_create_comp_channel(context);
if (!channel) {
fprintf(stderr, "Error, ibv_create_comp_channel() failed\n");
return -1;
}
if (ibv_destroy_comp_channel(channel)) {
fprintf(stderr, "Error, ibv_destroy_comp_channel() failed\n");
return -1;
}
Completion Queue is a Queue that holds information about completed Work Requests
• Every Work Completion contains information about the corresponding completed Work Request
A Completion Queue size is limited
• If more Work Completions than its size are added, the CQ is overruled and all associated Work
Queues are moved to the Error state
- It is up to the user to make sure that the CQ size is enough
- It is up to the user to empty the CQ in order to prevent CQ overrun
One CQ can be shared with multiple queues
• Several Queue Paris
• Only Send Queues
• Only Receive Queues
• Mix of the above
• This verb should be called after destroying all the QPs that are associated with it
int ibv_resize_cq(struct ibv_cq *cq, int cqe);
• Resize an existing Completion Queue
• The new size should be able to contain the Work Completions that currently populate the CQ
if (ibv_destroy_cq(cq)) {
fprintf(stderr, "Error, ibv_destroy_cq() failed\n");
return -1;
}
Write a program that open a device, create a Completion Event channel and create 2
CQs:
1) One without any associated Completion Event channel
2) One with associated Completion Event channel
Metric UD UC RC
Reliability ☺
Send (with immediate) ☺ ☺ ☺
RDMA Write (with immediate) ☺ ☺
RDMA Read ☺
Atomic operations ☺
Multicast ☺
Max message size MTU 2GB 2GB
CRC ☺ ☺ ☺
Solutions:
1. Exchange information Out Of Band
• For example: over sockets
2. Use Communication Manager (CM) this is the right way to connect QPs
In each QP state transition, the relevant attributes to enable the state functionality
needs to be configured
• There are different attributes for every transport type
- For RC QPs: retransmission count and timers
- For RC/UC QPs: Primary path and alternate path (optional)
• This verb should be called after detach is from all multicast groups
int ibv_modify_qp(struct ibv_qp *qp, struct ibv_qp_attr *attr,
enum ibv_qp_attr_mask attr_mask);
• Modify the QP attributes
struct ibv_qp_cap {
uint32_t max_send_wr; - The number of Send Requests that can be outstanding in the QP
uint32_t max_recv_wr; - The number of Receive Requests that can be outstanding in the QP
uint32_t max_send_sge; - The number of S/G entries that each Send Request may hold
uint32_t max_recv_sge; - The number of S/G entries that each Receive Request may hold
uint32_t max_inline_data; - The requested inline data (in bytes) to be sent
};
struct ibv_qp_init_attr {
void *qp_context; - A private context that the QP will be associated with
struct ibv_cq *send_cq; - The CQ to be associated with the QP’s Send Queue
struct ibv_cq *recv_cq; - The CQ to be associated with the QP’s Receive Queue
struct ibv_srq *srq; - Optional: if not NULL, the SRQ to be associated with
struct ibv_qp_cap cap; - The QP attributes to be created
enum ibv_qp_type qp_type; - The QP transport type
int sq_sig_all; - Indication if every completed Send Request will generate a Work
Completion
};
qp = ibv_create_qp(pd, &init_attr);
if (!qp) {
fprintf(stderr, "Error, ibv_create_qp() failed\n");
return -1;
}
if (ibv_destroy_qp(qp)) {
fprintf(stderr, "Error, ibv_destroy_qp() failed\n");
return -1;
}
struct ibv_sge {
uint64_t addr; - Start address of the memory buffer (usually registered memory)
uint32_t length; - Size (in bytes) of the memory buffer
uint32_t lkey; - lkey of Memory Region that is associated with this
memory buffer
};
Warning: The value zero in ‘length’ is special – it means 2 GB
• Warning: bad_wr is mandatory; It will be assigned with the address of the Send Request that its posting
failed
struct ibv_send_wr {
uint64_t wr_id; - Private context that will be available in the corresponding Work
Completion
struct ibv_send_wr *next; - Address of the next Send Request. Should be NULL in the last Send
Request
struct ibv_sge *sg_list; - Array of scatter/gather elements
int num_sge; - Number of elements in sg_list
enum ibv_wr_opcode opcode; - The opcode to be used
int send_flags; - Send flags. Or of the following flags:
IBV_SEND_FENCE – Prevent process this Send Request until the processing of
previous RDMA
Read and Atomic operations were completed.
IBV_SEND_SIGNALED – Generate a Work Completion after processing of this
Send Request ends
IBV_SEND_SOLICITED – Generate Solicited event for this message in remote
side
IBV_SEND_INLINE - allow the low-level driver to read the gather buffers
uint32_t imm_data; - Send message with immediate data (for supported opcodes); extra 32
bits, in network
order, that will be available in remote’s Work Completion
© 2014 Mellanox Technologies 72
Post Send Request: API (cont.)
union {
struct { - Attributes for RDMA Read and write opcodes
uint64_t remote_addr; - Remote start address (the message size is
according to the S/G entries)
uint32_t rkey; - rkey of Memory Region that is associated with
remote memory buffer
} rdma;
struct { - Attributes for Atomic opcodes
uint64_t remote_addr; - Remote start address (the message size is
according to the S/G entries)
uint64_t compare_add; - Value to compare/add (depends on opcode)
uint64_t swap; - Value to swap if the comparison passed
uint32_t rkey; - rkey of Memory Region that is associated with
remote memory buffer
} atomic;
struct { - Attributes for UD QP
struct ibv_ah *ah; - Address Handle to get to remote side
uint32_t remote_qpn; - Remote QP number (of 0xffffff for multicast
message)
uint32_t remote_qkey; - Remote Q_Key value
} ud;
} wr;
};
© 2014 Mellanox Technologies 73
Post Send Request: example (for RC/UC QPs)
• Warning: bad_wr is mandatory; It will be assigned with the address of the Receive Request that
its posting failed
struct ibv_recv_wr {
uint64_t wr_id; - Private context that will be available in the corresponding Work
Completion
struct ibv_recv_wr *next; - Address of the next Receive Request. Should be NULL in the last
Receive Request
struct ibv_sge *sg_list; - Array of scatter elements
int num_sge; - Number of elements in sg_list
};
Polling for Work Completion checks if the processing of a Work Request has ended
A Work Completion holds information about a completed Work Request
• Every Work Completion contains information about the corresponding completed Work Request
Every Work Completion contain several attributes
• The following fields are always valid (even if the Work Completion was ended with error)
- wr_id
- status
- qp_num
- vendor_err
• The rest of the fields depend on the QP’s transport type, opcode and status
Work Completion of Send Requests:
• Mark that a Send Request was performed and its memory buffers can be (re)used
- For reliable transport QP: this means that the message was written in the buffers (if status is successful)
- For unreliable transport QP: this means that the message was sent from the local port
Work Completion of Receive Requests:
• Mark that an incoming message was completed and its memory buffers can be (re)used
• Contains some attributes about the incoming message, such as size, origin, etc.
• If the return value is non-negative – this is the number of polled Work Completions
• If the return value is negative – error occurred
struct ibv_wc {
uint64_t wr_id; - Private context that was posted in the corresponding Work Request
enum ibv_wc_status status; - The status of the Work Completion
enum ibv_wc_opcode opcode; - The opcode of the Work Completion
uint32_t vendor_err; - Vendor specific error syndrome
uint32_t byte_len; - Number of bytes that were received
uint32_t imm_data; - Immediate data, in network order, if the flags indicate that such exists
uint32_t qp_num; - The local QP number that this Work Completion ended in
uint32_t src_qp; - The remote QP number
int wc_flags; - Work Completion flags. Or of the following flags:
IBV_WC_GRH – Indicator that the first 40 bytes of the receive buffer(s) contain a valid
GRH
IBV_WC_WITH_IMM – Indicator that the received message contains immediate data
uint16_t pkey_index;
uint16_t slid; - For UD QP: the source LID
uint8_t sl; - For UD QP: the source Service Level
uint8_t dlid_path_bits; - For UD QP: the destination LID path bits
};
do {
num_comp = ibv_poll_cq(cq, 1, &wc);
} while (num_comp == 0);
if (num_comp < 0) {
fprintf(stderr, "ibv_poll_cq() failed\n");
return -1;
}
if (wc.status != IBV_WC_SUCCESS) {
fprintf(stderr, "Failed status %s (%d) for wr_id %d\n", ibv_wc_status_str(wc.status),
wc.status, (int)wc.wr_id);
return -1;
}
Requester
Responder
Open device
Allocate PD Open device
Register MR Allocate PD
Create CQ Register MR
Create QP Create CQ
Connect QPs Create QP
CM Connect QPs + post RR
Post SR
Poll CQ Post Send Request
Check completion status Poll CQ
Check completion status
Write a program that will open all the needed resources and transfer data for RC QP for
every Send opcode.
Optional:
1. Extend it to support RDMA Write operation.
2. Change the QP transport type to UC.
• Every asynchronous event must be acknowledged. Not doing this may cause destruction of RDMA
resources to be blocked forever.
struct ibv_async_event {
union { - The element that got the asynchronous event;
depends on the event type
struct ibv_cq *cq;
struct ibv_qp *qp;
struct ibv_srq *srq;
int port_num;
} element;
enum ibv_event_type event_type; - The asynchronous event type
};
CQ events
• IBV_EVENT_CQ_ERR – Error occurred to the CQ
QP events
• IBV_EVENT_COMM_EST – incoming message received while the QP in RTR state
• IBV_EVENT_SQ_DRAINED – The processing of all Send Requests was ended
• IBV_EVENT_PATH_MIG – The alternative path of the QP was loaded (for connected QPs)
• IBV_EVENT_QP_LAST_WQE_REACHED – Receive Request won’t be read from SRQ anymore
• IBV_EVENT_QP_FATAL - Error occurred to the CQ
• IBV_EVENT_QP_REQ_ERR – Transport errors detected in responder side
• IBV_EVENT_QP_ACCESS_ERR – Violation detected in responder side
• IBV_EVENT_PATH_MIG_ERR – Failed to load the alternative path of the QP (for connected
QPs)
SRQ events
• IBV_EVENT_SRQ_LIMIT_REACHED – SRQ limit was reached
• IBV_EVENT_SRQ_ERR – Error occurred to the SRQ
Port events
• IBV_EVENT_PORT_ACTIVE – Port’s logical state become active
• IBV_EVENT_LID_CHANGE – Port’s LID changed
• IBV_EVENT_PKEY_CHANGE – Port’s P_Key table was changed
• IBV_EVENT_GID_CHANGE - Port’s GID table was changed
• IBV_EVENT_SM_CHANGE – New SM started to manage the subnet
• IBV_EVENT_CLIENT_REREGISTER - New SM started to manage the subnet
• IBV_EVENT_PORT_ERR – Port’s logical state is not active anymore
Device events
• IBV_EVENT_DEVICE_FATAL – Something really bad happened to the device
if (ibv_get_async_event(context, &event)) {
fprintf(stderr, "Error, ibv_get_async_event() failed\n");
return -1;
}
ibv_ack_async_event(&event);
Write a program that open a device, and listen for asynchronous events in a loop and
print them.
The following pseudo-code example demonstrates one possible way to work with
completion events. It performs the following steps:
Stage I: Preparation
1. Creates a CQ
2. Request for notification upon a new (first) completion event
Stage II: Completion Handling Routine
3. Wait for the completion event and ack it
4. Request for notification upon the next completion event
5. Empty the CQ
Note that an extra event may be triggered without having a corresponding completion
entry in the CQ. This occurs if a completion entry is added to the CQ between Step 4
and Step 5, and the CQ is then emptied (polled) in Step 5.
• Request for any Work Completion or only for Work Completion of completed Receive Requests
that their requester send them with the solicited event indicator on
if (ibv_req_notify_cq(cq, 0)) {
fprintf(stderr, "Error, ibv_req_notify_cq() failed\n");
return -1;
}
...
if (ibv_get_cq_event(channel, &ev_cq, &ev_ctx)) {
fprintf(stderr, "Error, ibv_get_cq_event() failed\n");
return -1;
}
ibv_ack_cq_events(ev_cq, 1);
if (ibv_req_notify_cq(ev_cq, 0)) {
fprintf(stderr, "Error, ibv_req_notify_cq() failed\n");
return -1;
}
TODO - Need to empty the CQ here …
• This verb should be called after destroying all the QPs that are associated with it
int ibv_modify_srq(struct ibv_srq *srq,
struct ibv_srq_attr *srq_attr,
enum ibv_srq_attr_mask srq_attr_mask);
• Resize or modify the Shared Receive Queue attributes
int ibv_query_srq(struct ibv_srq *srq, struct ibv_srq_attr *srq_attr);
• Query the attributes of a Shared Receive Queue
struct ibv_srq_attr {
uint32_t max_wr; - The number of Receive Requests that can be outstanding in the SRQ
uint32_t max_sge; - The number of scatter entries that each Receive Request may hold
uint32_t srq_limit; - The SRQ watermark value (only relevant in modify_srq)
};
struct ibv_srq_init_attr {
void *srq_context; - A private context that the SRQ will be associated with
struct ibv_srq_attr attr; - The SRQ attributes to be created
};
if (ibv_destroy_srq(srq)) {
fprintf(stderr, "Error, ibv_destroy_srq() failed\n");
return -1;
}
memset(&srq_attr, 0, sizeof(srq_attr));
srq_attr.srq_limit = 10;
Tips:
Add the SRQ handle when creating the QP
Post the RR to the SRQ and not to the QP
• This verb should be called if there isn’t any outstanding Send Request that points to it
struct ibv_global_route {
union ibv_gid dgid; - Destination GID address
uint32_t flow_label; - Flow label which is a hint for switches and routers which path to
take
uint8_t sgid_index; - The index in the port’s GID table of the source GID
uint8_t hop_limit; - The number of hops to take before dropping the message
uint8_t traffic_class; - Traffic class of the message (priority)
};
struct ibv_ah_attr {
struct ibv_global_route grh; - Description of the Global Routing Header
uint16_t dlid; - The Destination LID (can be unicast or multicast)
uint8_t sl; - The Service Level value of the message
uint8_t src_path_bits; - The source path bits used when the port has a range of LIDs
uint8_t static_rate; - The static rate between local and remote port speeds
uint8_t is_global; - Indication that the message will be sent with GRH
uint8_t port_num; - The local port number to send the message from
};
ah = ibv_create_ah(pd, &ah_attr);
if (!ah) {
fprintf(stderr, "Error, ibv_create_ah() failed\n");
return -1;
}
…
if (ibv_destroy_ah(ah)) {
fprintf(stderr, "Error, ibv_destroy_ah() failed\n");
return -1;
}
Tips:
AH should be used when posting in the SR
Remote side attributes should be added to the SR as well
The data in the receive buffer starts at address 40
General tips
• Avoid using control operations in data path
- They will perform context switch
- They may allocate/free dynamic resources
• Set affinity for process/task
• Work with local NUMA node
• Use MTU which provide best performance
• Register physical contiguous memory
• UD is more scalable than RC
Posting
• Post multiple Work Request in one call
• Avoid using many scatter/gather elements
• Atomic operations are performance killers
• Work with big messages
• Use selective signaling to reduce number of Work Completions
• Inline data will provide better latency
Polling
• Read multiple Work Completion in one call
• Use polling to get low latency and Completion events to get lower CPU usage
• When working with events: acknowledge multiple events at once
/sys/class/infiniband/<device name>/diag_counters
InfiniBand specifications
RDMAmojo (my blog)
The document “RDMA Aware Networks Programming User Manual”
The man pages
Code samples that comes with libibverbs and librdmacm