0% found this document useful (0 votes)
42 views20 pages

Empowering Azure Storage With RDMA

The paper discusses the deployment of Remote Direct Memory Access (RDMA) in Azure to enhance storage performance by enabling low-latency, high-throughput communication between compute and storage clusters. It highlights the challenges faced due to the heterogeneous infrastructure and the need for RDMA support at a regional scale, resulting in approximately 70% of Azure's traffic utilizing RDMA. The authors detail the improvements in disk I/O performance and CPU core savings achieved through this deployment, which is critical for meeting the demands of cloud storage workloads.

Uploaded by

jachalfonsini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
42 views20 pages

Empowering Azure Storage With RDMA

The paper discusses the deployment of Remote Direct Memory Access (RDMA) in Azure to enhance storage performance by enabling low-latency, high-throughput communication between compute and storage clusters. It highlights the challenges faced due to the heterogeneous infrastructure and the need for RDMA support at a regional scale, resulting in approximately 70% of Azure's traffic utilizing RDMA. The authors detail the improvements in disk I/O performance and CPU core savings achieved through this deployment, which is critical for meeting the demands of cloud storage workloads.

Uploaded by

jachalfonsini
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

Empowering Azure Storage with RDMA

Wei Bai, Shanim Sainul Abdeen, Ankit Agrawal, Krishan Kumar Attre, Paramvir Bahl,
Ameya Bhagat, Gowri Bhaskara, Tanya Brokhman, Lei Cao, Ahmad Cheema, Rebecca Chow,
Jeff Cohen, Mahmoud Elhaddad, Vivek Ette, Igal Figlin, Daniel Firestone, Mathew George,
Ilya German, Lakhmeet Ghai, Eric Green, Albert Greenberg, Manish Gupta, Randy Haagens,
Matthew Hendel, Ridwan Howlader, Neetha John, Julia Johnstone, Tom Jolly, Greg Kramer,
David Kruse, Ankit Kumar, Erica Lan, Ivan Lee, Avi Levy, Marina Lipshteyn, Xin Liu, Chen Liu,
Guohan Lu, Yuemin Lu, Xiakun Lu, Vadim Makhervaks, Ulad Malashanka, David A. Maltz,
Ilias Marinos, Rohan Mehta, Sharda Murthi, Anup Namdhari, Aaron Ogus, Jitendra Padhye,
Madhav Pandya, Douglas Phillips, Adrian Power, Suraj Puri, Shachar Raindel, Jordan Rhee,
Anthony Russo, Maneesh Sah, Ali Sheriff, Chris Sparacino, Ashutosh Srivastava, Weixiang Sun,
Nick Swanson, Fuhou Tian, Lukasz Tomczyk, Vamsi Vadlamuri, Alec Wolman, Ying Xie,
Joyce Yom, Lihua Yuan, Yanzhao Zhang, and Brian Zill, Microsoft
https://www.usenix.org/conference/nsdi23/presentation/bai

This paper is included in the


Proceedings of the 20th USENIX Symposium on
Networked Systems Design and Implementation.
April 17–19, 2023 • Boston, MA, USA
978-1-939133-33-5

Open access to the Proceedings of the


20th USENIX Symposium on Networked
Systems Design and Implementation
is sponsored by
Empowering Azure Storage with RDMA
Wei Bai, Shanim Sainul Abdeen, Ankit Agrawal, Krishan Kumar Attre, Paramvir Bahl, Ameya Bhagat, Gowri Bhaskara,
Tanya Brokhman, Lei Cao, Ahmad Cheema, Rebecca Chow, Jeff Cohen, Mahmoud Elhaddad, Vivek Ette, Igal Figlin,
Daniel Firestone, Mathew George, Ilya German, Lakhmeet Ghai, Eric Green, Albert Greenberg∗ , Manish Gupta,
Randy Haagens, Matthew Hendel, Ridwan Howlader, Neetha John, Julia Johnstone, Tom Jolly, Greg Kramer, David Kruse,
Ankit Kumar, Erica Lan, Ivan Lee, Avi Levy, Marina Lipshteyn, Xin Liu, Chen Liu∗ , Guohan Lu, Yuemin Lu, Xiakun Lu,
Vadim Makhervaks, Ulad Malashanka, David A. Maltz, Ilias Marinos, Rohan Mehta, Sharda Murthi, Anup Namdhari,
Aaron Ogus, Jitendra Padhye, Madhav Pandya, Douglas Phillips, Adrian Power, Suraj Puri, Shachar Raindel∗ , Jordan Rhee∗ ,
Anthony Russo, Maneesh Sah, Ali Sheriff, Chris Sparacino, Ashutosh Srivastava, Weixiang Sun∗ , Nick Swanson, Fuhou Tian,
Lukasz Tomczyk, Vamsi Vadlamuri, Alec Wolman, Ying Xie, Joyce Yom, Lihua Yuan, Yanzhao Zhang, Brian Zill
Microsoft
Abstract
Given the wide adoption of disaggregated storage in public
clouds, networking is the key to enabling high performance
and high reliability in a cloud storage service. In Azure, we
choose Remote Direct Memory Access (RDMA) as our trans-
port and aim to enable it for both storage frontend traffic
(between compute virtual machines and storage clusters) and
backend traffic (within a storage cluster) to fully realize its
benefits. As compute and storage clusters may be located
in different datacenters within an Azure region, we need to
support RDMA at regional scale.
This work presents our experience in deploying intra-region Figure 1: Traffic statistics of all Azure public regions between
RDMA to support storage workloads in Azure. The high com- January 18 and February 16, 2023. Traffic was measured by
plexity and heterogeneity of our infrastructure bring a series collecting switch counters of server-facing ports on all Top of
of new challenges, such as the problem of interoperability Rack (ToR) switches. Around 70% of traffic was RDMA.
between different types of RDMA network interface cards.
We have made several changes to our network infrastructure
to address these challenges. Today, around 70% of traffic in low single-core throughput, and high CPU consumption, thus
Azure is RDMA and intra-region RDMA is supported in all making it ill-suited for this scenario.
Azure public regions. RDMA helps us achieve significant Given these limitations, Remote Direct Memory Access
disk I/O performance improvements and CPU core savings. (RDMA) offers a promising solution. By offloading the
network stack to the network interface card (NIC) hard-
ware, RDMA achieves ultra-low processing latency and high
1 Introduction throughput with near zero CPU overhead. In addition to per-
formance improvements, RDMA also reduces the number of
High performance and highly reliable storage is one of the CPU cores reserved on each server for network stack process-
most fundamental services in public clouds. In recent years, ing. These saved CPU cores can then be sold as customer
we have witnessed significant improvements in storage media virtual machines (VMs) or used for application processing.
and technologies [73] and customers also desire similar perfor- To fully utilize the benefits of RDMA, we aim to enable
mance in the cloud. Given the wide adoption of disaggregated it for both storage frontend traffic (between compute VMs
storage in the cloud [35, 46], the network interconnecting and storage clusters) and backend traffic (within a storage
compute and storage clusters becomes a key performance cluster). This is different from previous work [46] that targets
bottleneck for cloud storage. Despite the sufficient bandwidth RDMA only for the storage backend. In Azure, due to capacity
capacity provided by Clos-based network fabrics [25, 48], issues, corresponding compute and storage clusters may be
the legacy TCP/IP stack suffers from high processing delay, located in different datacenters within a region. This imposes
∗ Albert Greenberg is now with Uber. Chen Liu is now with Meta.
a requirement that our storage workloads rely on support for
RDMA at regional scale.
Shachar Raindel and Jordan Rhee are now with Google. Weixiang Sun is
now with a stealth startup. This work was performed when they were with In this paper, we summarize our experience in deploy-
Microsoft. ing intra-region RDMA to support Azure storage workloads.

USENIX Association 20th USENIX Symposium on Networked Systems Design and Implementation 49
Figure 2: The network architecture of an Azure region. Figure 3: High-level architecture of Azure storage.

Compared to previous RDMA deployments [46, 50], intra- vation for and challenges to enabling intra-region RDMA.
region RDMA deployment introduces many new challenges
due to high complexity and heterogeneity within Azure re- 2.1 Network Architecture of an Azure Region
gions. As Azure infrastructure keeps evolving incrementally,
In cloud computing, a region [2, 5, 8] is a group of datacenters
different clusters may be deployed with different RDMA
deployed within a latency-defined perimeter. Figure 2 shows
NICs. While all the NICs support DCQCN [112], their imple-
the simplified topology of an Azure region. The servers within
mentations are very different. This results in many undesir-
a region are connected through an Ethernet-based Clos net-
able behaviors when different NICs communicate with each
work with four tiers of switches1 : tier 0 (T0), tier 1 (T1), tier 2
other. Similarly, heterogeneous switch software and hardware
(T2) and regional hub (RH). We use external BGP (eBGP) for
from multiple vendors significantly increase our operational
routing and equal-cost multi-path (ECMP) for load balancing.
effort. In addition, long-haul cables interconnecting datacen-
We deploy the following four types of units.
ters cause large propagation delays and large round-trip time
(RTT) variations within a region. This brings new challenges • Rack: a T0 switch and the servers connected to it.
to congestion control. • Cluster: a set of racks connected to the same set of T1
We have made several changes to our network infrastruc- switches.
ture, from application layer protocols to link layer flow con- • Datacenter: a set of clusters connected to the same set of
trol, to safely enable intra-region RDMA for Azure storage T2 switches.
traffic. We developed new RDMA-based storage protocols • Region: datacenters connected to the same set of RH
with many optimizations and failover support, and seamlessly switches. In contrast with short links (several to hundreds
integrated them into the legacy storage stack (§4). We built of meters) in datacenters [50], T2 and RH switches are
RDMA Estats to monitor the status of the host network stack connected by long-haul links whose lengths can be as long
(§5). We leveraged SONiC to enforce a unified software stack as tens of kilometers.
across different switch platforms (§6). We updated firmware
There are two thing to notice about this architecture. First,
of NICs to unify their DCQCN behaviors and used the com-
due to long-haul links between T2 and RH, the base round-
bination of Priority-based Flow Control (PFC) and DCQCN
trip time (RTT) varies from a few microseconds within a
to achieve high throughput, low latency and near zero packet
datacenter to as large as 2 milliseconds within a region. Sec-
losses (§7).
ond, we use two types of switches: pizza box switches for
In 2018, we started to enable RDMA for storage backend
T0 and T1, and chassis switches for T2 and RH. The pizza
traffic. In 2019, we started to enable RDMA to serve customer
box switch, which has been widely studied in the research
frontend traffic. Figure 1 gives traffic statistics of all Azure
community, typically has a single switch ASIC with shallow
public regions between January 18 and February 16, 2023.
packet buffers [31]. In contrast, chassis switches are built
As of February 2023, around 70% of traffic in Azure was
using multiple switch ASICs with deep packet buffers based
RDMA and intra-region RDMA was supported in all Azure
on the Virtual Output Queue (VoQ) architecture [3, 6].
public regions. RDMA helps us achieve significant disk I/O
performance improvements and CPU core savings.
2.2 High Level Architecture of Azure Storage
2 Background In Azure, we disaggregate compute and storage resources for
cost savings and auto-scaling. There are two main types of
In this section, we first present background on Azure’s net- 1 Inthis paper, we use switch to denote the layer 3 switch which can
work and storage architecture. Then, we introduce the moti- perform IP routing. We use the terms switch and router interchangeably.

50 20th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
clusters in Azure: compute and storage. VMs are created in High performance cloud storage solutions [1, 4] impose strin-
compute clusters but the actual storage of Virtual Hard Disks gent performance requirements to the underlying network
(VHDs) resides in storage clusters. due to the disaggregated and distributed storage architecture
Figure 3 shows the high-level architecture of Azure stor- (§2.2). While datacenter networks generally provide sufficient
age [35]. Azure storage has three layers: the frontend layer, bandwidth capacity, the legacy TCP/IP stack in the OS kernel
the partition layer, and the stream layer. The stream layer is becomes a performance bottleneck due to its high processing
an append-only distributed file system. It stores bits on the latency and low single-core throughput. What is worse, the
disk and replicates them for durability, but it does not un- performance of the legacy TCP/IP stack also depends on OS
derstand higher level storage abstractions, e.g., Blobs, Tables scheduling. To provide predictable storage performance, we
and VHDs. The partition layer understands different storage must reserve enough CPU cores on both compute and storage
abstractions, manages partitions of all the data objects in a nodes for the TCP/IP stack to process peak storage workloads.
storage cluster, and stores object data on top of the stream Burning CPU cores takes away the processing power that
layer. The daemon processes of the partition layer and the could otherwise be sold as customer VMs, thus increasing the
stream layer are called the Partition Server (PS) and the Ex- overall cost of providing cloud services.
tent Node (EN), respectively. PS and EN are co-located on Given these limitations, RDMA offers a promising solu-
each storage server. The frontend (FE) layer consists of a set tion. By offloading the network stack to the NIC hardware,
of servers that authenticate and forward incoming requests RDMA achieves predictable low processing latency (a few
to corresponding PSs. In some cases, FE servers can also microseconds) and high throughput (line rate for a single flow)
directly access the stream layer for efficiency. with near zero CPU overhead. In addition to its performance
When a VM wants to write to its disks, the disk driver benefits, RDMA also reduces the number of CPU cores re-
running in the host domain of the compute server issues I/O served on each server for network stack processing. These
requests to the corresponding storage cluster. The FE or PS saved CPU cores can then be sold as customer VMs or used
parses and validates the request, and generates requests to for storage request processing.
corresponding ENs in the stream layer to write the data. At the To fully achieve the benefits of RDMA, we must enable
stream layer, a file is essentially an ordered list of large storage RDMA for both storage frontend traffic and backend traffic.
chunks called "extents". To write a file, data is appended to the Enabling RDMA for backend traffic is relatively easy because
end of an active extent, which is replicated three times in the almost all the backend traffic stays within a storage cluster.
storage cluster for durability. Only after receiving successful In contrast, frontend traffic crosses different clusters within
responses from all the ENs, the FE or PS sends the final a region. Even though we try to co-locate corresponding
response back to the disk driver. In contrast, disk reads are compute and storage clusters to minimize latency, sometimes
different. The FE or PS reads data from any EN replica and they may still end up located in different datacenters within a
sends the response back to the disk driver. region due to capacity issues. This imposes the requirement
In addition to user-facing workloads, there are also many that our storage workloads rely on support for RDMA at
background workloads in the storage clusters, e.g., garbage regional scale.
collection and erasure coding [57]. We classify our storage
traffic into two categories: frontend (between compute and
storage servers, e.g., VHD write and read requests) and back-
2.4 Challenges
end (between storage servers, e.g., replication and disk recon- We faced many challenges when enabling intra-region RDMA
struction). Our storage traffic has incast-like characteristics. because our design was limited by many practical constraints.
The most typical example is data reconstruction, which is im- Practical considerations: We aimed to enable intra-region
plemented in the stream layer [57]. The stream layer erasure RDMA over the legacy infrastructure. While we had some
codes a sealed extent to several fragments, and then sends flexibility to reconfigure and upgrade software stacks, e.g.,
encoded fragments to different servers to store. When the user the NIC driver, the switch OS, and the storage stack, it was op-
wants to read a fragment which is unavailable due to a failure, erationally infeasible to replace the underlying hardware, e.g.,
the stream layer will read the other fragments from multiple the NICs and switches. Hence, we adopted RDMA over com-
storage servers to reconstruct the target fragment. modity Ethernet v2 (RoCEv2) [29] to keep compatibility with
our IP-routed networks (§2.1). Before starting this project,
we had deployed a significant number of our first generation
2.3 Motivation for Intra-Region RDMA RDMA NICs, which implement go-back-N retransmission in
Storage technology has improved significantly in recent years. the NIC firmware with limited processing capacity. Our mea-
For example, Non-Volatile Memory Express (NVMe) Solid- surements showed that it took hundreds of microseconds to
State Drives (SSDs) can provide tens of Gbps of throughput recover a lost packet, which was even worse than the TCP/IP
with request latencies in the hundreds of microseconds [105]. software stack. Given such a large performance degradation,
Many customers demand similar performance in the cloud. we made the decision to adopt Priority-based Flow Control

USENIX Association 20th USENIX Symposium on Networked Systems Design and Implementation 51
(PFC) [60] to eliminate packet losses due to congestion. stack by providing an accurate breakdown of cost for each
Challenges: Before this project, we had deployed RDMA in RDMA operation.
some clusters to support Bing services [50], and we learnt In the network, we use the combination of PFC and DC-
several lessons from this deployment. Compared to intra- QCN [112] to achieve high throughput, low latency, and near
cluster RDMA deployments [46, 50], intra-region RDMA zero losses due to congestion. DCQCN and PFC were the
deployments introduce many new challenges due to the high state-of-the-art commercial solutions when we started the
complexity and heterogeneity of the infrastructure. project. To optimize the customer experience, we use two pri-
• Heterogeneous NICs: Cloud infrastructure keeps evolving orities to isolate storage frontend traffic and backend traffic.
incrementally, often one cluster or one rack at a time with To mitigate the switch heterogeneity problem, we developed
the latest generation of server hardware [91]. Different and deployed SONiC [15] to provide a unified software stack
clusters within a region may have different NICs. We have across different switch platforms (§6). To mitigate the in-
deployed three generations of commodity RDMA NICs teroperability problem of heterogeneous NICs, we updated
from a popular NIC vendor: Gen1, Gen2 and Gen3. Each the firmware of NICs to unify their DCQCN behaviors (§7).
NIC generation has a different implementation of DCQCN. We carefully tuned DCQCN and switch buffer parameters to
This results in many undesired interactions when different optimize performance across different scenarios.
NIC generations communicate with each other.
• Heterogeneous switches: Similar to server infrastructure, 3.1 PFC Storm Mitigation Using Watchdogs
we keep deploying new switches to reduce costs and in- We use PFC to prevent congestion packet losses. However,
crease the bandwidth capacity. We have deployed many malfunctioning NICs and switches can continually send PFC
switch ASICs and multiple switch OSes from different ven- pause frames in the absence of congestion [50], thus com-
dors. However, this has increased our operational effort pletely blocking the peer device for a long time. Moreover,
significantly because many aspects are vendor specific, for these endless PFC pause frames can eventually propagate
example, buffer architectures, sizes, allocation mechanisms, into the whole network, thus causing collateral damage to
monitoring and configuration, etc. innocent devices. Such endless PFC pause frames are called
• Heterogeneous latency: As shown in §2.1, there are large a PFC storm. In contrast, normal congestion-triggered PFC
RTT variations from several microseconds to 2 millisec- pause frames only slow down the data transmission of the
onds within a region, due to long-haul links between T2 peer device through intermittent pauses and resumes.
and RH. Hence, RTT fairness re-emerges as a key chal- To detect and mitigate PFC storms, we designed and de-
lenge. In addition, the large propagation delay of long-haul ployed a PFC watchdog [11, 50] on every switch and bump-
links also imposes large pressure on PFC headroom [12]. in-the-wire FPGA card [42] between T0 switches and servers.
Like other services in public clouds, availability, diagno- When the PFC watchdog detects that a queue has been in the
sis, and serviceability are key aspects for our RDMA storage paused state for an abnormally long duration, e.g., hundreds
system. To achieve high availability, we always prepare for of milliseconds, it disables PFC and drops all the packets on
unexpected zero-day problems despite large investments in this queue, thereby preventing PFC storms from propagating
testing. Our system must detect performance anomalies and into the whole network.
perform automatic failover if necessary. To understand and
debug faults, we must build fine-grained telemetry systems 3.2 Security
to deliver crystal clear visibility into every component in the We use RDMA to empower first-party storage traffic in a
end-to-end path. Our system also must be serviceable: stor- trusted environment, including storage servers, the host do-
age workloads should survive NIC driver updates and switch main of compute servers, switches and links. Therefore we
software updates. are secure against issues described in [69, 94, 104, 109].

3 Overview 4 Storage Protocols over RDMA


We have made several changes to our network infrastructure, In this section, we introduce two storage protocols built on
from application layer protocols to link layer flow control, top of RDMA Reliable Connections (RC): sU-RDMA and sK-
to safely empower Azure storage with RDMA. We devel- RDMA. Both protocols aim to optimize performance while
oped two RDMA-based protocols: sU-RDMA (§4.1) and keeping good compatibility with legacy software stacks.
sK-RDMA (§4.2), which we have seamlessly integrated into
our legacy storage stack to support backend communication
and frontend communication, respectively. Between the stor- 4.1 sU-RDMA
age protocols and the NIC, we deployed a monitoring system sU-RDMA [87] is used for storage backend (storage to stor-
RDMA Estats (§5), giving us visibility into the host network age) communication. Figure 4 shows the architecture of our

52 20th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
Figure 4: Azure storage backend network stack. Figure 5: sK-RDMA’s data flow. We use blue arrows and
red arrows to represent control messages and data massages,
respectively. Arrow width represents data size.
storage backend network stack with the sU-RDMA modules
highlighted. The Azure Storage Network Protocol is an RPC
protocol directly used by applications to send request and Receive request, regardless of its actual message size. In
response objects. It leverages socket APIs to implement con- contrast, a small S causes large data fragmentation overhead.
nection management, sending and receiving messages. Hence, sU-RDMALib uses three transfer modes based on the
To simplify RDMA integration with storage stack, we built message size [87].
sU-RDMALib, a user space library that exposes socket-like • Small messages: Data is transferred using RDMA Send
byte-stream APIs to upper layers. To map socket-like APIs and Receive.
to RDMA operations, sU-RDMALib needs to handle the fol- • Medium messages: The sender posts a RDMA Write re-
lowing challenges: quest to transfer data, and a Send request with "Write
• When the RDMA application cannot directly write into Done" to notify the receiver.
an existing memory regions (MR), it must either register • Large messages: The sender first posts a RDMA Send
the application buffer as a new MR or copy its data into request carrying the description of the local data buffer to
an existing MR. Both options can introduce large latency the receiver. Then the receiver posts a Read request to pull
penalties and we should minimize these overhead. the data. Finally, the receiver posts a Send request with
• If we use RDMA Send and Receive, the receiver must "Read Done" to notify the sender.
pre-post enough Receive requests. On top of sU-RDMALib, we built modules to enable dy-
• The RDMA sender and receiver must be in agreement on namic transitions between TCP and RDMA, which is critical
the size of data being transferred. for failover and recovery. The transition process is gradual.
To reduce memory registrations, which are especially ex- We periodically close a small portion of all connections and
pensive for small messages [44], sU-RDMALib maintains a establish new connections using the desired transport.
common buffer pool of pre-registered memory shared across Unlike TCP, RDMA uses rate based congestion con-
multiple connections. sU-RDMALib also provides APIs to trol [112] without tracking the number of in-flight packets
allow applications to request and release registered buffers. To (the window size). Hence, RDMA tends to inject excessive
avoid Memory Translation Table (MTT) cache misses on the in-flight packets, thus triggering PFC. To mitigate this, we im-
NIC [50], sU-RDMALib allocates large memory slabs from plemented a static flow control mechanism in the Azure Stor-
the kernel and registers memory over these slabs. This buffer age Network Protocol by dividing a message into fixed-sized
pool can also autoscale based on runtime usage. To avoid over- chunks and only allowing a single in-flight chunk for each
whelming the receiver, sU-RDMALib implements a receiver- connection. Chunking can significantly improve performance
driven credit-based flow control where credits represent the re- under high-degree incast with negligible CPU overhead.
sources (e.g., available buffers and posted Receive requests)
allocated by the receiver. The receiver sends credit update mes-
sages back to the sender regularly. When we started design- 4.2 sK-RDMA
ing sU-RDMALib, we did consider using RDMA Send and sK-RDMA is used for storage frontend (compute to stor-
Receive with a fixed buffer size S for each Send/Receive re- age) communication. In contrast with sU-RDMA which runs
quest to transfer data. However, this design causes a dilemma. RDMA in user space, sK-RDMA runs RDMA in kernel space.
If we use a large S, we may waste much memory space be- This enables the disk driver, which runs in kernel space in the
cause a Send request fully uses the receive buffer of the host domain of compute servers, to directly use sK-RDMA to

USENIX Association 20th USENIX Symposium on Networked Systems Design and Implementation 53
issue network I/O requests. sK-RDMA leverages and extends
Server Message Block (SMB) Direct [14] which provides
socket-like kernel-mode RDMA interfaces. Similar to sU-
RDMA, sK-RDMA also provides credit-based flow control
and dynamic transition between RDMA and TCP.
Figure 5 shows sK-RDMA’s data flow for reading and
writing disks. The compute server first posts a Fast Memory
Registration (FMR) request to register data buffers. Then it
posts an RDMA Send request to transfer a request message
to the storage server. The request carries a disk I/O com-
mand, and a description of FMR registered buffers available
for RDMA access. According to the InfiniBand (IB) speci-
fication, the NIC should wait for the completion of the FMR
request before processing any subsequently posted requests. Figure 6: RDMA Estats measurement points. There are four
Hence, the request message is actually pushed onto the wire NIC timestamps and two host timestamps. We use blue arrows
after the memory registration. The data transfer is initiated and red arrows to represent PCIe transactions and network
by the storage server using RDMA Read or Write. After the transfers, respectively. Arrow width represents data size.
data transfer, the storage server sends a response message to
the compute server using RDMA Send With Invalidate.
To detect data corruptions, which can happen silently due T1 : WQE posting: Host processor timestamp when the WQE
to various software and hardware bugs along the path, both is posted to the submission queue.
sK-RDMA and sU-RDMA implement a Cyclical Redundancy
Check (CRC) on all application data. In sK-RDMA, the com- T5 : CQE generation: NIC timestamp when the completion
pute server calculates the CRC of the data for disk writes. queue element (CQE) is generated in the NIC.
These calculated CRCs are included in the request messages,
and used by the storage server to validate the data. For disk T6 : CQE polling: Host timestamp when the CQE is polled
reads, the storage server performs the CRC calculations and by software.
includes them in the response messages, and the compute
server uses them to validate the data. In Azure, the NIC driver reports various latencies derived
from the above timestamps. For example, T6 − T1 is the oper-
ation latency seen by the RDMA consumer, while T5 − T1 is
5 RDMA Estats the latency seen by the NIC. A user-mode agent groups the
latency samples by connection, operation type, and (success/-
To understand and debug faults, we need fine-grained teleme- failure) status to create latency histograms for each group. By
try tools to capture behaviors of every component in the end- default, a histogram covers a one-minute interval. Each his-
to-end path. Despite many existing tools [51, 97, 114] to diag- togram’s quantiles and summary statistics are fed into Azure’s
nose switch and link faults, none of these tools gives us good telemetry pipeline. As our diagnostics evolved, we added to
visibility into the RDMA network stack at end hosts. our user-mode agent the ability to collect and upload NIC
Inspired by diagnostic tools for TCP [79], we developed and QP state dumps during high latency events. Finally, we
RDMA Extended Statistics (Estats) to diagnose performance extended the scope of event-triggered data collection by the
problems in both the network and the host. If an RDMA user-mode agent to include NIC statistics and state dumps in
application is performing poorly, RDMA Estats enables us case of events not specific to RDMA (e.g., servicing opera-
to tell if the bottleneck is in the sender, the receiver, or the tions that impact connectivity).
network. The collection of latency samples adds overhead to the
To this end, RDMA Estats provides a fine-grained break- WQE posting and completion processing code paths. This
down of latency for each RDMA operation, in addition to overhead is dominated by keeping the NIC and host time
collecting regular counters such as bytes sent/received and stamps synchronized. To reduce the overhead, we developed
number of NACKs. The requester NIC records timestamps at a clock synchronization procedure that attempts to minimize
one or more measurement points as the work queue element the frequency of reading the NIC clock registers, while main-
(WQE) traverses the transmission pipeline. When a response taining low deviations.
(ACK or read response) is received, the NIC records addi- RDMA Estats can significantly reduce the time to debug
tional timestamps at measurement points along the receive and mitigate storage performance incidents by quickly ruling
pipeline (Figure 6). The following measurement points are out (or in) network latency. In §8.3, we share our experience
required in any RDMA Estats implementation in Azure in diagnosing the FMR hidden fence bug using RDMA Estats.

54 20th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
6 Switch Management ment by the size of the admitted packet, and decrement by
the size of the departing packet. We use both dynamic thresh-
6.1 Overcoming Heterogeneity with SONiC olds [40] and static thresholds to limit the queue lengths.
Our RDMA deployment heavily relies on the support of We apply ingress admission control only to lossless traffic,
switches. However, heterogeneous switch ASICs and OSes and we apply egress admission control only to lossy traffic.
from multiple vendors have brought significant challenges If the switch buffer size is B, then the ingress_pool size
to network management. For example, commercial switch must be smaller than B, reserving enough space for PFC head-
OSes are designed to satisfy diverse requirements of all the room buffer (§7.1). When an ingress lossless queue hits the
customers, thus leading to complex software stacks and slow dynamic threshold, the queue enters the “paused” state, and
feature evolution [39]. In addition, different switch ASICs the switch sends PFC pause frames to the upstream device.
provide different buffer architectures and mechanisms, thus in- Future arriving packets on this ingress lossless queue use the
creasing the effort to qualify and test them for Azure’s RDMA PFC headroom buffer rather than ingress_pool. In contrast,
deployment. for ingress lossy queues we configure a static threshold which
Our solutions to the above challenges were two-fold. On equals to the switch buffer size B. Since ingress lossy queue
one hand, we worked closely with our vendors to define con- lengths cannot hit the switch buffer size, lossy packets can
crete feature requirements and test plans, and to understand bypass ingress admission control.
their low-level implementation details. On the other hand, At egress, lossy and lossless packets are mapped to the
in collaboration with many partners, we developed and de- egress_lossy_pool and egress_lossless_pool,
ployed an in-house cross-platform switch OS called Software respectively. We configure both the size of the
for Open Networking in the Cloud (SONiC) [15]. Based on egress_lossless_pool and the static thresholds for
a Switch Abstraction Interface (SAI) [20], SONiC manages egress lossless queues to B so that lossless packets bypass
heterogeneous switches from multiple vendors with a sim- egress admission control. In contrast, the size of the
plified and unified software stack. It breaks apart monolithic egress_lossy_pool must be no larger than the size of the
switch software into multiple containerized components. Con- ingress_pool because lossy packets should not use any of
tainerization provides clean isolation, improves development the PFC headroom buffer at ingress. Egress lossy queues are
agility, and enables choices on a per-component basis. Net- configured to use dynamic thresholds [40] to drop packets.
work operators can customize SONiC with only the features
they require, thereby creating a "lean stack". 6.3 Testing RDMA Features with SONiC
We use nightly tests to track the quality of SONiC switches.
6.2 Buffer Model and Configuration Practices In this section, we briefly introduce our methods for testing
of SONiC on Pizza Box Switches RDMA features with SONiC switches.
SONiC provides all the features required by RDMA deploy- Software-based Tests: We leveraged the Packet Testing
ments, such as ECN marking, PFC, a PFC watchdog (§3.1) Framework (PTF) [10] to develop test cases for SONiC in
and a shared buffer model. In the interest of space, we briefly general. PTF is mostly used for testing packet forwarding be-
introduce the buffer model and configuration practices of haviors, with which testing RDMA features require additional
SONiC on pizza box switches, which are used at T0 and T1 effort.
(§2.1). We provide a buffer configuration example in §A. Our testing approach is inspired by breakpoints in software
We typically allocate three buffer pools on a pizza debugging. To set a “breakpoint” for the switch, we first block
box switch: (1) the ingress_pool for ingress admission the transmission of a switch port using SAI APIs. We then
control of all packets, (2) the egress_lossy_pool for generate a series of packets destined for the blocked port and
egress admission control of lossy packets, and (3) the capture one or several snapshots of the switch states (e.g.,
egress_lossless_pool for egress admission control of buffer watermark), analogous to dumping the values of vari-
lossless packets. Note that these buffer pools and queues are ables in software debugging. Next, we release the port and
not backed by separate dedicated buffers, but instead are essen- dump the received packets. We determine if the test passes by
tially counters applied to a single physical shared buffer and analyzing both the captured switch snapshots and the received
used for admission control purposes. Each counter is updated packets. We use this approach to test buffer management
only by the packets mapped to it, and the same packet can be mechanisms, buffer related counters, and packet schedulers.
mapped to multiple queues and pools simultaneously. For ex- Hardware-based Tests: While the above approach gives us
ample, a lossless (lossy) packet of priority p from source port good visibility into switch states and packet micro-behaviors,
s to destination port d updates ingress queue (s, p), egress it cannot meet the stringent performance requirements of
queue (d, p), ingress_pool and egress_lossless_pool some tests. For example, to test PFC watchdog [50], we need
(egress_lossy_pool). A packet is accepted only if it passes to generate continuous PFC pause frames at high speed and ac-
both ingress and egress admission controls. Counters incre- curately control their intervals due to the small pause duration

USENIX Association 20th USENIX Symposium on Networked Systems Design and Implementation 55
enforced by each PFC frame. thus leaving more shared buffer space to absorb bursts. Our
To conduct such performance-sensitive tests, we need to production experience shows that the oversubscribed PFC
control traffic generation at µs or even ns timescales and headroom pool can effectively eliminate congestion losses
have high-resolution measurement of data plane behaviors. and improve burst tolerance.
This motivated us to build a hardware-based test system by
leveraging hardware programmable traffic generators [9]. Our
hardware-based system focuses on testing features like PFC,
7.2 DCQCN Interoperability Challenges
PFC watchdog, RED/ECN marking. We use DCQCN [112] to control the sending rate of each
As of February 2023, we built 32 software test cases and 50 queue pair (QP). DCQCN consists of three entities: the sender
hardware test cases for RDMA features. The documentation or reaction point (RP), the switch or congestion point (CP),
and implementation of our test cases are available at [18]. and the receiver or notification point (NP). The CP performs
ECN marking at the egress queue based on the RED algo-
rithm [43]. The NP sends Congestion Notification Packets
7 Congestion Control (CNPs) when it receives ECN-marked packets. The RP re-
duces its sending rate when it receives CNPs. Otherwise, it
We use the combination of PFC and DCQCN to mitigate leverages a byte counter and a timer to increase the rate.
congestion. In this section, we discuss how we scale both We deployed three generations of commodity NICs from
techniques at regional scale. a popular NIC vendor: Gen1, Gen2 and Gen3, for different
types of clusters. While all of them support DCQCN, their
7.1 Scaling PFC over Long Links implementation details differ significantly. This causes an
interoperability problem when different generations of NICs
Once an ingress queue pauses the upstream device, it requires
communicate with each other.
a dedicated headroom buffer to absorb in-flight packets be-
fore the PFC pause frame takes effect on the upstream de- DCQCN implementation differences: On Gen1, most of the
vice [50, 112]. The ideal PFC headroom value depends on DCQCN functionality, such as the NP and RP state machines,
many factors, e.g., link capacity and propagation delay [12]. is implemented in firmware. Given the limited processing
The total demand on the headroom buffer for a switch is also capacity of the firmware, Gen1 minimizes CNP generation
in proportion to the number of lossless priorities2 . through coalescing at the NP side. As described in [112], the
To extend RDMA from cluster scale [46, 50] to regional NP generates at most one CNP in a time window for a flow,
scale, we must deal with long links between T2 and RH (tens if any arriving packets within this window are ECN marked.
of kilometers), and between T1 and T2 (hundreds of meters), Correspondingly, the RP reduces the sending rate upon re-
which demand much larger PFC headroom than that of intra- ceiving a CNP. In addition, Gen1 also has limited cache re-
cluster links. At first glance, it may seem that a T1 switch in sources. Cache misses can significantly impact RDMA’s per-
our production environment can reserve half of the total buffer formance [50, 63]. To mitigate cache misses, we increase the
for PFC headroom and other usages. At T2 and RH, given granularity of rate limiting on Gen1 from a single packet to a
the high port density (100s) of chassis switches and long-haul burst of packets. Burst transmissions can effectively reduce
links, we need to reserve several GB of PFC headroom buffer. the number of active QPs in a fixed interval, thus lowering
pressure on the very limited cache resources of Gen1 NICs.
To scale PFC over long links, we leverage the fact that
pathological cases, e.g., all the ports are congested simulta- In contrast, Gen2 and Gen3 have hardware-based DCQCN
neously, and ingress lossless queues of a port pause peers implementations and adopt a RP-based CNP coalescing mech-
sequentially, are likely to be rare. Our solution is two-fold. anism, which is the exact opposite of the NP-based CNP co-
First, on chassis switches at T2 and RH, we use deep packet alescing used by Gen1. In Gen2 and Gen3, the NP sends a
buffers of off-chip DRAM3 to store RDMA packets. Our CNP for every arriving ECN-marked packet. However, the
analysis shows that our chassis switches in production can RP only cuts the sending rate for a flow at most once in a
provide abundant DRAM buffers for PFC headroom. Second, time window if it receives any CNPs within that window. It
instead of reserving PFC headroom per queue, we allocate a is worthwhile to note that RP-based and NP-based CNP coa-
PFC headroom pool shared by all the ingress lossless queues lescing mechanisms essentially provide the same congestion
on the switch. Each ingress lossless queue has a static thresh- notification granularity. The rate limiting is on a per-packet
old to limit its maximum usage in the headroom pool. We granularity on Gen2 and Gen3.
oversubscribe the headroom pool size with a reasonable ratio, Interoperability challenges: Storage frontend traffic, which
crosses different clusters, may lead to communication be-
2 For an ingress port, the worst case is that its lossless queues sequentially
tween different generations of NICs. In this scenario, the DC-
pause the peer queues, and none of its packets can be drained from the buffer.
3 Unlike on-chip SRAM, the bandwidth of off-chip DRAM is slightly QCN implementation differences cause undesirable behaviors.
smaller than the forwarding capacity of the switch ASIC. When all the ports First, when a Gen2/Gen3 node sends traffic to a Gen1 node,
send and receive traffic at line rate, DRAM will suffer from packet drops. its per-packet rate limiting tends to trigger many cache misses

56 20th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
on the Gen1 node, thus slowing down the receiver pipeline. PFC may be triggered before ECN marking.
Second, when a Gen1 node sends traffic to a Gen2/Gen3 node
through a congested path, the Gen2/Gen3 NP tends to send
excessive CNPs to the Gen1 RP, thus causing excessive rate 8 Experience
reductions and throughput losses.
In 2018, we started to enable RDMA to serve customer back-
Our solution: Given the limited processing capacity and
end traffic. In 2019, we started to enable RDMA to serve
resources of Gen1, we cannot make it behave like Gen2 and
customer frontend traffic, with storage and compute clusters
Gen3. Instead, we try to make Gen2 and Gen3 behave like
co-located in the same datacenter. In 2020, we enabled intra-
Gen1 as much as possible. Our solution is two-fold. First, we
region RDMA in the first Azure region. As of February 2023,
move the CNP coalescing on Gen2 and Gen3 from the RP
around 70% of traffic in Azure public regions was RDMA
side to the NP side. On the Gen2/Gen3 NP side, we add a
(Figure 1) and intra-region RDMA was supported in all Azure
per-QP CNP rate limiter and set the minimal interval between
public regions.
two consecutive CNPs to the value of CNP coalescing timer
of the Gen1 NP. On the Gen2/Gen3 RP side, we minimize the
time window for rate reduction so that the RP almost always 8.1 Deployment and Servicing
reduces the rate upon receiving a CNP. Second, we enable
We took a three-step approach to gradually enable RDMA in
per-burst rate limiting on Gen2 and Gen3.
production environments. First, we leveraged the lab testbed
to develop and test each individual component. Second, we
7.3 Tuning DCQCN conducted end-to-end stress tests in test clusters with the same
software and hardware setups as those of production coun-
There were certain practical limitations when we tuned DC- terparts. In addition to normal workloads, we also injected
QCN in Azure. First, our NICs only support global DCQCN common errors, e.g., random packet drops, to evaluate the
parameter settings. Second, to optimize customer experience, robustness of the system. Third, we cautiously increased the
we classify RDMA flows into two switch queues based on deployment scale of RDMA in production environments to
their application semantics, rather than RTTs. Hence, instead carry more customer traffic. During our deployment, NIC
of using different DCQCN parameters for inter-datacenter driver/firmware and switch OS updates were common. Thus
and intra-datacenter traffic, we use global DCQCN parameter it was crucial to minimize the impact of such updates to cus-
settings (on the NICs and switches) that work well given the tomer traffic.
large RTT variations within a region.
Servicing switches: Compared to switches in T1 or tiers
We took a three-step approach to tune DCQCN parameters. above, T0 switches, especially in compute clusters, were more
First, we leveraged the fluid model [113] to understand theo- challenging to service as they could be a single point of failure
retical proprieties of DCQCN. Second, we ran experiments (SPOF) for customer VMs. In this scenario, we leveraged fast
with synthetic traffic in our lab testbed to evaluate solutions to reboot [17] and warm reboot [19] to reduce the data plane
the interoperability problem and deliver reasonable parameter disruption time from a few minutes to less than a second.
settings. Third, we finalized the parameter settings in test clus-
ters, which use the same setup as production clusters carrying Servicing NICs: In some cases, servicing the NIC driver
customer traffic. We ran stress tests with real storage applica- or firmware required unloading the NIC driver. The driver
tions and tuned DCQCN parameters based on the application could safely unload only after all the NIC resources had been
performance. released. To this end, we needed to signal consumers, e.g., disk
driver, to close RDMA connections and shift traffic to TCP.
To illustrate our findings, we use Kmin , Kmax , and Pmax to
Once RDMA and other NIC features with similar concerns
denote the minimum threshold, the maximum threshold, and
had been disabled, we could reload the driver.
the maximum marking probability of RED/ECN [43], respec-
tively. We make the following three key observations (more
experiment results appear in §B): 8.2 Performance
• DCQCN does not suffer from RTT unfairness as it is a
Storage backend: Currently almost all the storage backend
rate-based protocol and its rate adjustment is independent
traffic in Azure is RDMA. It is no longer feasible to run large-
of RTT. scale A/B tests with customer traffic because the CPU cores
• To provide high throughput for DCQCN flows with large saved by RDMA have been used for other purposes, not to
RTTs, we use sparse ECN marking with large Kmax − Kmin mention customer experience degradation. Hence we demon-
and small Pmax . strate results of an A/B test conducted in a test cluster in 2018.
• DCQCN and switch buffers should be jointly tuned [112]. In this test, we ran storage workloads with high transactions
For example, before increasing Kmin , we ensure that ingress per second (TPS) and switched transport between RDMA and
thresholds for lossless traffic are large enough. Otherwise, TCP. Figure 7 plots normalized CPU utilization of storage

USENIX Association 20th USENIX Symposium on Networked Systems Design and Implementation 57
Figure 7: Average CPU usage of storage servers of a storage Figure 9: Average CPU usage of the host domain. We normal-
tenant. We normalize results to the maximum CPU usage. We ize results to the maximum value.
switched traffic between RDMA and TCP twice.

Figure 10: Average access latencies of a type of SSDs across


Figure 8: Message completion times of storage backend traf- all Azure public regions between February 22, 2022, and
fic measured in a test cluster. We normalize results to the February 22, 2023. We normalize RDMA results to corre-
maximum message completion time. sponding TCP results.

servers during two transport switches. It is worthwhile to note covers different I/O sizes, types of disks, and transports for
that CPU utilization here includes all the types of processing storage frontend traffic.
overhead, e.g., storage application, Azure Storage Network Figure 10 shows the overall average access latencies of a
Protocol, and TCP/IP stack. Figure 8 gives message comple- type of SSDs across all Azure public regions collected by the
tion times measured in Azure Storage Network Protocol layer monitoring service for a year. Note that the RDMA and TCP
(Figure 4), which excludes the overhead of application pro- in this figure only refer to the transport of frontend traffic
cessing. Compared to TCP, RDMA achieved obvious CPU generated by test VMs. We normalize RDMA results to cor-
saving and significantly accelerated network data transfer. responding TCP results. Compared to TCP, RDMA yielded
Storage frontend: Since we cannot perform large-scale better access latencies with every I/O size. In particular, 1
A/B tests with customer traffic, we present results of an A/B MB I/O requests benefited the most from RDMA with 23.8%
test conducted in a test cluster in 2018. In this test, we used and 15.6% latency reductions for read and write, respectively.
DiskSpd to generate read and write workloads at A IOPS This is due to the fact that large I/O requests are more sen-
and B IOPS (A < B). The I/O size was 8 KB. Figure 9 gives sitive to throughput than smaller I/O requests, and RDMA
average CPU utilization of the host domain during the test improves throughput drastically since it can run at line rate
period. Compared to TCP, RDMA could reduce the CPU using a single connection without slow starts.
utilization by up to 34.5%. Congestion control: We ran stress tests in a test cluster to
To understand the performance improvement introduced by drive the DCQCN parameter setting that could achieve rea-
RDMA, we leverage an always-on storage monitoring service. sonable performance even under peak workloads. Figure 11
This service allocates some VMs in each region, uses them to gives results of the 99th percentile message completion time,
periodically generate disk read and write workloads, and col- the key metric we used to guide our tuning. At the beginning,
lects end-to-end performance results. The monitoring service we disabled DCQCN and only tuned switch buffer parame-

58 20th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
network pipe between datacenters. We have worked with the
NIC vendor to fix this problem in the new NIC driver.
PFC and MACsec: After we enabled PFC on long-haul
links between T2 and RH, many long-haul links reported
high packet corruption rates, thus triggering alerts. It turned
out that the MACSec standard [21] did not specify whether
PFC frames should be encrypted. As a result, different ven-
dors had no agreement on whether PFC frames sent should be
encrypted and what to do with arriving encrypted PFC frames.
For example, switch A may send unencrypted PFC frames to
switch B, wile switch B was expecting encrypted PFC frames.
As a result, switch B would treat those PFC frames as cor-
Figure 11: The 99th percentile message completion times of rupted packets and report errors. We have worked with switch
different schemes measured in a test cluster. vendors to standardize how MACsec enabled switch ports
treat PFC frames.
ters, e.g., the dynamic threshold of ingress lossless queues, Congestion leaking: The problem was found in the testbed.
to explore the best performance achieved by PFC only. After When we enabled interoperability features (§7.2) on Gen2
reaching the best performance of PFC only, we enabled DC- NICs, we found that their throughput would be degraded. To
QCN using the default parameter setting, which was derived dig into this problem, we used the water filling algorithm
on the lab testbed using synthetic traffic. While DCQCN re- to calculate theoretical per-QP throughput results and com-
duced the number of PFC pause frames, it degraded the tail pared them with actual throughput results measured from the
message completion time as the default setting reduced the testbed. We had two interesting observations when comparing
sending rate too aggressively. Given this, we adjusted ECN the results. First, flows sent by a Gen2 NIC always had near
marking parameters to improve DCQCN’s throughput. With identical sending rates regardless of their congestion degrees.
optimized setting, DCQCN performs better than using PFC Second, actual sending rates were very close to the theoret-
alone. Our key takeaway from this tuning experience was that ical sending rate of the slowest flow sent from the NIC. It
DCQCN and switch buffer should be jointly tuned to optimize seemed that all the flows from a Gen2 NIC were throttled by
the application performance, rather than PFC pause duration. the slowest flow. We reported these observations to the NIC
vendor, and they identified a head-of-line blocking in the NIC
firmware. We have fixed this problem on all the NICs with
8.3 Problems Discovered and Fixed interoperability features.
During tests and deployments, we discovered and fixed a Slow receiver due to loopback RDMA: This problem was
series of problems in NICs, switches and our RDMA applica- found in a test cluster. During stress tests, we found that a
tions. large number of servers sent PFC pause frames to T0 switches.
FMR hidden fence: In sK-RDMA (§4.2), every I/O request However, unlike slow receivers found before, PFC watchdog
from compute servers requires a FMR request followed by a was not triggered on any T0 switches. It seemed that those
Send request to the storage server, which contains the de- servers only gracefully slowed down the traffic coming from
scription of FMR registered memory and storage commands. T0 switches, rather than completely blocking T0 switches for
Therefore, the send queue consists of many FMR/Send pairs. a long duration. In addition, where slow receivers were com-
When we deployed sK-RDMA in compute and storage clus- mon at Azure’s scale, it was very unlikely that a significant
ters located in different datacenters, we found that the frontend portion of servers in a cluster became “mad” simultaneously.
traffic showed extremely low throughput, even though we kept Based on the above observations, we suspected that these
many outstanding FMR/Send pairs in the send queue. To debug slow receivers were caused by our applications. We found that
this problem, we used RDMA Estats to collect T5 − T1 latency each server actually ran multiple RDMA application instances.
for every Send request (§5). We found a strong correlation All the inter-instance traffic ran on RDMA, regardless of their
between T5 − T1 and inter-datacenter RTT, and noticed that locations. Therefore, loopback traffic and external traffic co-
there was only a single outstanding Send request per RTT. existed on every NIC, thus creating a 2:1 congestion on PCIe
After we shared these findings with the NIC vendor, they iden- lanes of the NIC. Since the NIC could not mark ECN, it could
tified the root cause: to simplify the implementation, NICs only throttle loopback traffic and external traffic through PCIe
processed the FMR request only after the completions of previ- back pressure and PFC pause frames. To validate the above
ously posted requests. In sK-RDMA, the FMR request created analysis, we disabled RDMA for loopback traffic on some
a hidden fence between two Send requests, thus only allowing servers, then these servers stopped sending PFC frames. We
a single Send request in the air, which could not fill the large notice that recent work [61, 70] also found this problem.

USENIX Association 20th USENIX Symposium on Networked Systems Design and Implementation 59
9 Lessons and Open Problems touching DCQCN. This is why we always tune switch buffers
before touching DCQCN (§8.2). The importance of switch
In this section, we summarize the lessons learned from our buffer lies in the prevalence of bursty traffic and short-lived
experience and discuss open problems for future exploration. congestion events in datacenters [108]. Conventional conges-
tion control solutions are ill-suited for such scenarios given
Failovers are very expensive for RDMA. While we have
their reactive nature. Instead, switch buffer plays as the first
implemented failover solutions in both sU-RDMA and sK-
resort to absorb bursts and provide fast responses.
RDMA as the last resort, we find that failovers are particularly
With the increase in datacenter link speed, we believe that
expensive for RDMA, and should be avoided as much as pos-
switch buffer is increasingly important, thus deserving more
sible. Cloud providers adopt RDMA to save CPU cores and
efforts and innovations. First, the buffer size per port per Gbps
then use freed CPU cores for other purposes. To move traffic
on pizza box switches keeps decreasing in recent years [31].
away from RDMA, we need to allocate extra CPU cores to
Some switch ASICs even split the packet memory into multi-
carry these traffic. This increases CPU utilization, and even
ple partitions, thus reducing effective buffer resource. We en-
runs out of CPU cores at high loads. Hence, it is risky to per-
courage more efforts to put into the development ASICs with
form large-scale RDMA failovers, which we treat as serious
deeper packet buffers and more unified architectures. Second,
incidents in Azure. Given the risk, only after all the tests have
today’s commodity switch ASICs only provide buffer manage-
passed, we gradually increase the RDMA deployment scale.
ment mechanisms [40] designed decades ago, thus limiting
During the rollout, we continuously monitor network perfor-
the scope of solutions to handle congestion. Following the
mance and immediately stop the rolltout once anomalies are
trend of programmable data plane [32], we envision that future
detected. After unavoidable failovers, we should aggressively
switch ASICs would provide more programmability on buffer
switch back to RDMA when possible.
models and interfaces, thus enabling the implementation of
Host network and physical network should be converged. more effective buffer management solutions [22].
In 8.3, we present a new type of slow receivers, which is es-
Cloud needs unified behavior models and interfaces for
sentially due to congestion inside the host. Recent work [24]
network devices. The diversity in software and hardware
also presents evidence and characterization of host conges-
brings significant challenges to network operation at cloud
tion in production clusters. We believe this problem is just
scale. Different NICs from the same vendor can even have
a tip of the iceberg, while many problematic behaviors be-
different behaviors that cause interoperability problems, not
tween host network and physical network remain unexposed.
to mention devices from different vendors. In spite of all the
In conventional wisdom, host network and physical network
efforts we put into the unified switch software (§6) and NIC
are separated entities and NIC is their border. If we look into
congestion control (§7.2), we still experienced problems due
the host, it is essentially a network connecting heterogeneous
to diversity, e.g., unexpected interactions between PFC and
nodes (e.g., CPU, GPU, DPU) with proprietary high speed
MACsec (§8.3). We envision that more unified models and
links (e.g., PCIe link and NVLink) and switches (e.g., PCIe
interfaces will emerge to simplify operations and accelerate
switch and NVSwitch). Inter-host traffic can be treated as
innovations in the cloud. Some key areas include chassis
north-south traffic for the host. With the increase of the data-
switches, smart network appliances, and RDMA NICs. We
center link capacity and wide adoptions of hardware offload-
notice that there have been some efforts on standardizing
ing and device direct access technologies (e.g., GPUDirect
congestion control for different data paths [85] and APIs for
RDMA), inter-host traffic tends to consume larger and more
heterogeneous smart appliances [16].
various resources inside the host, thus resulting in more com-
plex interactions with intra-host traffic. Testing new network devices is crucial and challenging.
From the day one of this project, we have been making large
We believe that host network and physical network should
investments in building various testing tools and running rig-
be converged in the future. And we envision this converged
orous tests in both testbeds and test clusters. Despite the
network will be an important step towards the dis-aggregated
significant number of problems discovered during tests, we
cloud. We look forward to operating this converged network
still found some problems during deployments (§8.3), mostly
in similar ways as we manage physical network today.
due to micro-behaviors and corner cases that were overlooked.
Switch buffer is increasingly important and needs more Some burning questions are given as follows:
innovations. The conventional wisdom [26] suggests that low
• How to precisely capture micro-behaviors of RDMA NIC
latency datacenter congestion control [26, 71, 82, 112] can
implementations in various scenarios?
alleviate the need of large switch buffers as they can preserve
short queues. However, we find a strong correlation between • Despite many endeavors to measure switches’ micro-
switch buffers and RDMA performance problems in produc- behaviors (§6.3), we still rely on domain knowledge to
tion. Clusters with smaller switch buffers tend to have more design test cases. How to systematically test the correct-
performance problems. And many performance problems can ness and performance of a switch?
be mitigated by just tuning switch buffer parameters without These questions motivate us to rethink challenges and re-

60 20th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
quirements of testing emerging network devices with more niques: Many proposals [41, 62–66, 74, 93, 106, 111] leverage
and more features. First, many features lack clear specifica- RDMA to accelerate storage systems or networked systems in
tions, which is a prerequisite for systematic testing. Many general. Similar to some solutions [13, 47, 74, 90], our RDMA
seemingly simple features are actually entangled with com- protocols (§4) provide socket-like interfaces to keep compati-
plex interactions between software and hardware. We believe bility with legacy storage stack. In addition to RDMA, some
that unified behavior models and interfaces discussed above recent proposals improve storage systems using new kernel
can help with this. Second, the test system should be able designs [58, 59, 73] and SmartNIC [68, 81].
to interact with network devices at high speed, and precisely
capture micro-behaviors. We believe programmable hardware
can help on this [33, 37]. We note that there have been some 11 Conclusions and Future Work
recent progresses on testing RDMA NICs [69, 70] and pro-
grammable switches [37, 110]. In this paper, we summarize our experience in deploying intra-
region RDMA to support storage workloads in Azure. The
high complexity and heterogeneity of our infrastructure brings
10 Related Work a series of new challenges. We have made several changes to
our network infrastructure to address these challenges. Today,
This paper focuses on RDMA for cloud storage. The literature around 70% of traffic in Azure is RDMA and intra-region
of RDMA and storage systems is vast. Here we only discuss RDMA is supported in all Azure public regions. RDMA helps
some closely related ideas. us achieve significant disk I/O performance improvements
Deployment experience of RDMA and storage networks: and CPU core savings.
Before this project, we had deployed RDMA to support some In the future, we plan to further improve our storage sys-
Bing workloads and encountered many problems, such as tems through innovations on system architecture, hardware
PFC storms, PFC deadlocks, and slow receivers [50]. We acceleration, and congestion control. We also plan to bring
learnt several lessons from this deployment. Gao et al. [46] RDMA to more scenarios.
summarized the experience of deploying intra-cluster RDMA
to support storage backend traffic in Alibaba. Miao et al. [80]
presented two generations of storage network stacks to carry Acknowledgements
Alibaba’s storage frontend traffic: LUNA and SOLAR. LUNA We thank our shepherd Marco Canini and the anonymous
is a high performance user-space TCP stack while SOLAR reviewers for their valuable feedback that significantly im-
is a storage-oriented UDP stack implemented in proprietary proved the final paper. Yuanwei Lu, Liang Yang and Danushka
DPU. Scalable Reliable Datagram (SRD) [96] is a cloud- Menikkumbura also provided important feedback. Yibo Zhu
optimized transport protocol implemented in AWS custom made contributions to DCQCN and PFC deadlock avoidance
Nitro networking card, and used by HPC, ML, and storage at the early stage of this project. Ranysha Ware contributed to
applications [7]. In contrast, we use commodity hardware to DCQCN tuning. Zhuolong Yu helped us measure RDMA’s
enable intra-region RDMA to support both storage frontend retransmission performance. This project represents the work
and backend traffic. of many engineers, product managers, researchers, data sci-
Congestion control in datacenters: There is a large body entists, and leaders across Microsoft over many years, more
of work on datacenter congestion control, including ECN- than we can list here. We thank them all. Finally, we thank our
based [26, 27, 99, 112], delay-based [71, 72, 76, 82], INT- partners: Arista Networks, Broadcom, Cisco, Dell, Keysight
based [23, 75, 101], credit-based [34, 38, 45, 52, 55, 84, 86, 88] and NVIDIA for their technical contributions and support.
and packet scheduling [28, 30, 36, 49, 54]. Our work focuses
on regional networks which have large RTT variations. We
notice that some efforts [95, 107] target at similar scenarios. References
Improve RDMA in datacenters: In addition to congestion
[1] Amazon ebs volume types. https://aws.amazon.c
control, there are many efforts to improve RDMA’s reliability,
om/ebs/volume-types/.
security and performance in datacenters, such as deadlock mit-
igation [56,92,103], support of multi-path [77], resilience over [2] Amazon web services region. https://aws.amazon
lossy networks [78,83,102], security mechanisms [94,98,104], .com/about-aws/global-infrastructure/reg
virtualization [53, 67, 89, 100], testing [69, 70], and perfor- ions_az/.
mance isolation in multi-tenant environments [109]. Our work
focuses on first party traffic in the trusted environment. Given [3] Arista 7500r switch architecture (‘a day in the life of a
the limited retransmission performance of our NICs, we en- packet’). https://www.arista.com/assets/data
able RDMA over lossless networks (§2.4). /pdf/Whitepapers/Arista7500RSwitchArchitec
Accelerate storage systems using RDMA and other tech- tureWP.pdf.

USENIX Association 20th USENIX Symposium on Networked Systems Design and Implementation 61
[4] Azure managed disk types. https://docs.microso [20] Switch abstraction interface (sai). https://github
ft.com/en-us/azure/virtual-machines/disks- .com/opencomputeproject/SAI.
types.
[21] Ieee standard for local and metropolitan area networks-
[5] Azure region. https://docs.microsoft.com/en- media access control (mac) security. IEEE Std
us/azure/availability-zones/az-overview. 802.1AE-2018 (Revision of IEEE Std 802.1AE-2006),
2018.
[6] Cisco silicon one product family. https://www.cisc
o.com/c/dam/en/us/solutions/collateral/sil [22] Vamsi Addanki, Maria Apostolaki, Manya Ghobadi,
icon-one/white-paper-sp-product-family.p Stefan Schmid, and Laurent Vanbever. Abm: active
df. buffer management in datacenters. In SIGCOMM
2022.
[7] A decade of ever-increasing provisioned iops for ama-
zon ebs. https://aws.amazon.com/blogs/aws/a [23] Vamsi Addanki, Oliver Michel, and Stefan Schmid.
-decade-of-ever-increasing-provisioned-i Powertcp: Pushing the performance limits of datacen-
ops-for-amazon-ebs/. ter networks. In NSDI 2022.
[8] Google cloud region. https://cloud.google.com [24] Saksham Agarwal, Rachit Agarwal, Behnam Montaz-
/compute/docs/regions-zones. eri, Masoud Moshref, Khaled Elmeleegy, Luigi Rizzo,
[9] Keysight network test solutions. https://www.keys Marc Asher de Kruijf, Gautam Kumar, Sylvia Rat-
ight.com/us/en/solutions/network-test.ht nasamy, David Culler, and Amin Vahdat. Understand-
ml. ing host interconnect congestion. In HotNets 2022.

[10] Packet testing framework (ptf). https://github.c [25] Mohammad Al-Fares, Alexander Loukissas, and Amin
om/p4lang/ptf. Vahdat. A scalable, commodity data center network
architecture. In SIGCOMM 2008.
[11] Pfc watchdog in sonic. https://github.com/son
ic-net/SONiC/wiki/PFC-Watchdog-Design. [26] Mohammad Alizadeh, Albert Greenberg, David A.
Maltz, Jitendra Padhye, Parveen Patel, Balaji Prab-
[12] Priority flow control: Build reliable layer 2 infrastruc- hakar, Sudipta Sengupta, and Murari Sridharan. Data
ture. https://e2e.ti.com/cfs-file/__key/com center tcp (dctcp). In SIGCOMM 2010.
munityserver-discussions-components-file
s/908/802.1q-Flow-Control-white_5F00_pap [27] Mohammad Alizadeh, Abdul Kabbani, Tom Edsall,
er_5F00_c11_2D00_542809.pdf. Balaji Prabhakar, Amin Vahdat, and Masato Yasuda.
Less is more: trading a little bandwidth for ultra-low
[13] rsocket(7) - linux man page. https://linux.die. latency in the data center. In NSDI 2012.
net/man/7/rsocket.
[28] Mohammad Alizadeh, Shuang Yang, Milad Sharif,
[14] Smb direct. https://learn.microsoft.com/en-u Sachin Katti, Nick McKeown, Balaji Prabhakar, and
s/windows-server/storage/file-server/smb Scott Shenker. pfabric: Minimal near-optimal datacen-
-direct. ter transport. In SIGCOMM 2013.
[15] Software for open networking in the cloud (sonic). [29] InfiniBand Trade Association. Supplement to infini-
https://sonic-net.github.io/SONiC/. band architecture specification volume 1 release 1.2. 1
[16] Sonic-dash - disaggregated api for sonic hosts. https: annex a17: Rocev2, 2014.
//github.com/sonic-net/DASH.
[30] Wei Bai, Li Chen, Kai Chen, Dongsu Han, Chen Tian,
[17] Sonic fast reboot. https://github.com/sonic-n and Hao Wang. Information-agnostic flow scheduling
et/SONiC/blob/master/doc/fast-reboot/fas for commodity data centers. In NSDI 2015.
treboot.pdf.
[31] Wei Bai, Shuihai Hu, Kai Chen, Kun Tan, and
[18] sonic-mgmt: Management and automation code used Yongqiang Xiong. One more config is enough: Sav-
for sonic testbed deployment, tests and reporting. ht ing (dc) tcp for high-speed extremely shallow-buffered
tps://github.com/sonic-net/sonic-mgmt. datacenters. In INFOCOM 2020.

[19] Sonic warm reboot. https://github.com/sonic [32] Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick
-net/SONiC/blob/master/doc/warm-reboot/SON McKeown, Jennifer Rexford, Cole Schlesinger, Dan
iC_Warmboot.md. Talayco, Amin Vahdat, George Varghese, and David

62 20th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
Walker. P4: Programming protocol-independent packet Popuri, Shachar Raindel, Tejas Sapre, Mark Shaw,
processors. ACM SIGCOMM Computer Communica- Gabriel Silva, Madhan Sivakumar, Nisheeth Srivas-
tion Review, 2014. tava, Anshuman Verma, Qasim Zuhair, Deepak Bansal,
Doug Burger, Kushagra Vaid, David A. Maltz, and Al-
[33] Pietro Bressana, Noa Zilberman, and Robert Soulé. bert Greenberg. Azure accelerated networking: Smart-
Finding hard-to-find data plane bugs with a pta. In NICs in the public cloud. In NSDI 2018.
CoNEXT 2020.
[43] Sally Floyd and Van Jacobson. Random early detec-
[34] Qizhe Cai, Mina Tahmasbi Arashloo, and Rachit Agar- tion gateways for congestion avoidance. IEEE/ACM
wal. dcpim: Near-optimal proactive datacenter trans- Transactions on Networking, 1993.
port. In SIGCOMM 2022.
[44] Philip Werner Frey and Gustavo Alonso. Minimizing
[35] Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakan- the hidden cost of rdma. In ICDCS 2009.
tan, Arild Skjolsvold, Sam McKelvie, Yikang Xu,
Shashwat Srivastav, Jiesheng Wu, Huseyin Simitci, [45] Peter X. Gao, Akshay Narayan, Gautam Kumar, Rachit
Jaidev Haridas, Chakravarthy Uddaraju, Hemal Kha- Agarwal, Sylvia Ratnasamy, and Scott Shenker. phost:
tri, Andrew Edwards, Vaman Bedekar, Shane Mainali, Distributed near-optimal datacenter transport over com-
Rafay Abbasi, Arpit Agarwal, Mian Fahim ul Haq, modity network fabric. In CoNEXT 2015.
Muhammad Ikram ul Haq, Deepali Bhardwaj, Sowmya
Dayanand, Anitha Adusumilli, Marvin McNett, Sriram [46] Yixiao Gao, Qiang Li, Lingbo Tang, Yongqing Xi,
Sankaran, Kavitha Manivannan, and Leonidas Rigas. Pengcheng Zhang, Wenwen Peng, Bo Li, Yaohui Wu,
Windows azure storage: A highly available cloud stor- Shaozong Liu, Lei Yan, Fei Feng, Yan Zhuang, Fan
age service with strong consistency. In SOSP 2011. Liu, Pan Liu, Xingkui Liu, Zhongjie Wu, Junping Wu,
Zheng Cao, Chen Tian, Jinbo Wu, Jiaji Zhu, Haiyong
[36] Li Chen, Kai Chen, Wei Bai, and Mohammad Alizadeh. Wang, Dennis Cai, and Jiesheng Wu. When cloud
Scheduling mix-flows in commodity datacenters with storage meets RDMA. In NSDI 2021.
karuna. In SIGCOMM 2016.
[47] Dror Goldenberg, Michael Kagan, Ran Ravid, and
[37] Yanqing Chen, Bingchuan Tian, Chen Tian, Li Dai, Michael S Tsirkin. Zero copy sockets direct proto-
Yu Zhou, Mengjing Ma, Ming Tang, Hao Zheng, col over infiniband-preliminary implementation and
Zhewen Yang, Guihai Chen, Dennis Cai, and Ennan performance analysis. In HOTI 2005.
Zhai. Norma: Towards practical network load testing.
In NSDI 2023. [48] Albert Greenberg, James R. Hamilton, Navendu Jain,
Srikanth Kandula, Changhoon Kim, Parantap Lahiri,
[38] Inho Cho, Keon Jang, and Dongsu Han. Credit- David A. Maltz, Parveen Patel, and Sudipta Sengupta.
scheduled delay-bounded congestion control for data- Vl2: a scalable and flexible data center network. In
centers. In SIGCOMM 2017. SIGCOMM 2009.
[39] Sean Choi, Boris Burkov, Alex Eckert, Tian Fang, [49] Matthew P Grosvenor, Malte Schwarzkopf, Ionel Gog,
Saman Kazemkhani, Rob Sherwood, Ying Zhang, and Robert NM Watson, Andrew W Moore, Steven Hand,
Hongyi Zeng. Fboss: building switch software at scale. and Jon Crowcroft. Queues don’t matter when you can
In SIGCOMM 2018. jump them! In NSDI 2015.
[40] Abhijit K. Choudhury and Ellen L. Hahne. Dynamic [50] Chuanxiong Guo, Haitao Wu, Zhong Deng, Gaurav
queue length thresholds for shared-memory packet Soni, Jianxi Ye, Jitu Padhye, and Marina Lipshteyn.
switches. IEEE/ACM Transactions on Networking, Rdma over commodity ethernet at scale. In SIGCOMM
1998. 2016.
[41] Aleksandar Dragojević, Dushyanth Narayanan, Miguel [51] Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong
Castro, and Orion Hodson. Farm: Fast remote memory. Dang, Ray Huang, Dave Maltz, Zhaoyi Liu, Vin Wang,
In NSDI 2014. Bin Pang, Hua Chen, Zhi-Wei Lin, and Varugis Kurien.
Pingmesh: A large-scale system for data center net-
[42] Daniel Firestone, Andrew Putnam, Sambhrama Mund- work latency measurement and analysis. In SIGCOMM
kur, Derek Chiou, Alireza Dabagh, Mike Andrewartha, 2015.
Hari Angepat, Vivek Bhanu, Adrian Caulfield, Eric
Chung, Harish Kumar Chandrappa, Somesh Chatur- [52] Mark Handley, Costin Raiciu, Alexandru Agache, An-
mohta, Matt Humphrey, Jack Lavier, Norman Lam, drei Voinescu, Andrew W Moore, Gianni Antichi, and
Fengfen Liu, Kalin Ovtcharov, Jitu Padhye, Gautham Marcin Wójcik. Re-architecting datacenter networks

USENIX Association 20th USENIX Symposium on Networked Systems Design and Implementation 63
and stacks for low latency and high performance. In and Srinivasan Seshan. Hyperloop: group-based nic-
SIGCOMM 2017. offloading to accelerate replicated transactions in multi-
tenant storage systems. In SIGCOMM 2018.
[53] Zhiqiang He, Dongyang Wang, Binzhang Fu, Kun Tan,
Bei Hua, Zhi-Li Zhang, and Kai Zheng. Masq: Rdma [67] Daehyeok Kim, Tianlong Yu, Hongqiang Harry Liu,
for virtual private cloud. In SIGCOMM 2020. Yibo Zhu, Jitu Padhye, Shachar Raindel, Chuanxiong
Guo, Vyas Sekar, and Srinivasan Seshan. Freeflow:
[54] Chi-Yao Hong, Matthew Caesar, and P Godfrey. Fin- Software-based virtual rdma networking for container-
ishing flows quickly with preemptive scheduling. In ized clouds. In NSDI 2019.
SIGCOMM 2012.
[68] Jongyul Kim, Insu Jang, Waleed Reda, Jaeseong Im,
[55] Shuihai Hu, Wei Bai, Gaoxiong Zeng, Zilong Wang, Marco Canini, Dejan Kostić, Youngjin Kwon, Simon
Baochen Qiao, Kai Chen, Kun Tan, and Yi Wang. Ae- Peter, and Emmett Witchel. Linefs: Efficient smart-
olus: A building block for proactive transport in data- nic offload of a distributed file system with pipeline
centers. In SIGCOMM 2020. parallelism. In SOSP 2021.
[56] Shuihai Hu, Yibo Zhu, Peng Cheng, Chuanxiong Guo, [69] Xinhao Kong, Jingrong Chen, Wei Bai, Yechen Xu,
Kun Tan, Jitendra Padhye, and Kai Chen. Tagger: Prac- Mahmoud Elhaddad, Shachar Raindel, Jitendra Pad-
tical pfc deadlock prevention in data center networks. hye, and Alvin R Lebeck Danyang Zhuo. Understand-
In CoNEXT 2017. ing rdma microarchitecture resources for performance
isolation. In NSDI 2023.
[57] Cheng Huang, Huseyin Simitci, Yikang Xu, Aaron
Ogus, Brad Calder, Parikshit Gopalan, Jin Li, and [70] Xinhao Kong, Yibo Zhu, Huaping Zhou, Zhuo Jiang,
Sergey Yekhanin. Erasure coding in windows azure Jianxi Ye, Chuanxiong Guo, and Danyang Zhuo. Collie:
storage. In ATC 2012. Finding performance anomalies in rdma subsystems.
In NSDI 2022.
[58] Jaehyun Hwang, Qizhe Cai, Ao Tang, and Rachit Agar-
wal. Tcp ≈ rdma: Cpu-efficient remote storage access [71] Gautam Kumar, Nandita Dukkipati, Keon Jang, Hassan
with i10. In NSDI 2020. M. G. Wassel, Xian Wu, Behnam Montazeri, Yaogong
Wang, Kevin Springborn, Christopher Alfeld, Michael
[59] Jaehyun Hwang, Midhul Vuppalapati, Simon Peter, and
Ryan, David Wetherall, and Amin Vahdat. Swift: Delay
Rachit Agarwal. Rearchitecting linux storage stack for
is simple and effective for congestion control in the
µs latency and high throughput. In OSDI 2021.
datacenter. In SIGCOMM 2020.
[60] IEEE. 802.11 qbb. priority based flow control. 2008.
[72] Changhyun Lee, Chunjong Park, Keon Jang, Sue
[61] Yimin Jiang, Yibo Zhu, Chang Lan, Bairen Yi, Yong Moon, and Dongsu Han. Accurate latency-based con-
Cui, and Chuanxiong Guo. A unified architecture for gestion feedback for datacenters. In ATC 2015.
accelerating distributed dnn training in heterogeneous
[73] Gyusun Lee, Seokha Shin, Wonsuk Song, Tae Jun Ham,
gpu/cpu clusters. In OSDI 2020.
Jae W. Lee, and Jinkyu Jeong. Asynchronous I/O stack:
[62] Anuj Kalia, Michael Kaminsky, and David Andersen. A low-latency kernel I/O stack for Ultra-Low latency
Datacenter rpcs can be general and fast. In NSDI 2019. SSDs. In ATC 2019.

[63] Anuj Kalia, Michael Kaminsky, and David G Andersen. [74] Bojie Li, Tianyi Cui, Zibo Wang, Wei Bai, and Lintao
Design guidelines for high performance rdma systems. Zhang. Socksdirect: Datacenter sockets can be fast and
In ATC 2016. compatible. In SIGCOMM 2019.

[64] Anuj Kalia, Michael Kaminsky, and David G Andersen. [75] Yuliang Li, Rui Miao, Hongqiang Harry Liu, Yan
Fasst: Fast, scalable and simple distributed transactions Zhuang, Fei Feng, Lingbo Tang, Zheng Cao, Ming
with two-sided (rdma) datagram rpcs. In OSDI 2016. Zhang, Frank Kelly, Mohammad Alizadeh, and Min-
lan Yu. Hpcc: High precision congestion control. In
[65] Anuj Kalia, Michael Kaminsky, and David G Ander- SIGCOMM 2019.
sen. Using rdma efficiently for key-value services. In
SIGCOMM 2014. [76] Shiyu Liu, Ahmad Ghalayini, Mohammad Alizadeh,
Balaji Prabhakar, Mendel Rosenblum, and Anirudh
[66] Daehyeok Kim, Amirsaman Memaripour, Anirudh Sivaraman. Breaking the transience-equilibrium nexus:
Badam, Yibo Zhu, Hongqiang Harry Liu, Jitu Pad- A new approach to datacenter packet transport. In
hye, Shachar Raindel, Steven Swanson, Vyas Sekar, NSDI 2021.

64 20th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
[77] Yuanwei Lu, Guo Chen, Bojie Li, Kun Tan, Yongqiang [87] Madhav Himanshubhai Pandya, Aaron William Ogus,
Xiong, Peng Cheng, Jiansong Zhang, Enhong Chen, Zhong Deng, and Weixiang Sun. Transport protocol
and Thomas Moscibroda. Multi-path transport for and interface for efficient data transfer over rdma fabric,
rdma in datacenters. In NSDI 2018. August 2 2022. US Patent 11,403,253.

[78] Yuanwei Lu, Guo Chen, Zhenyuan Ruan, Wencong [88] Jonathan Perry, Amy Ousterhout, Hari Balakrishnan,
Xiao, Bojie Li, Jiansong Zhang, Yongqiang Xiong, Deverat Shah, and Hans Fugal. Fastpass: A centralized
Peng Cheng, and Enhong Chen. Memory efficient loss "zero-queue" datacenter network. In SIGCOMM 2014.
recovery for hardware-based transport in datacenter. In
APNet 2017. [89] Jonas Pfefferle, Patrick Stuedi, Animesh Trivedi,
Bernard Metzler, Ionnis Koltsidas, and Thomas R
[79] Matt Mathis, John Heffner, and Rajiv Raghunarayan. Gross. A hybrid i/o virtualization framework for rdma-
Tcp extended statistics mib (rfc 4898). Technical re- capable network interfaces. ACM SIGPLAN Notices,
port, 2007. 2015.

[80] Rui Miao, Lingjun Zhu, Shu Ma, Kun Qian, Shu- [90] Jim Pinkerton. Sockets direct protocol v1. 0 rdma
jun Zhuang, Bo Li, Shuguang Cheng, Jiaqi Gao, consortium. 2003.
Yan Zhuang, Pengcheng Zhang, Rong Liu, Chao Shi,
[91] Leon Poutievski, Omid Mashayekhi, Joon Ong, Ar-
Binzhang Fu, Jiaji Zhu, Jiesheng Wu, Dennis Cai, and
jun Singh, Mukarram Tariq, Rui Wang, Jianan Zhang,
Hongqiang Harry Liu. From luna to solar: The evo-
Virginia Beauregard, Patrick Conner, Steve Gribble,
lutions of the compute-to-storage networks in alibaba
Rishi Kapoor, Stephen Kratzer, Nanfang Li, Hong Liu,
cloud. In SIGCOMM 2022.
Karthik Nagaraj, Jason Ornstein, Samir Sawhney, Ry-
[81] Jaehong Min, Ming Liu, Tapan Chugh, Chenxingyu ohei Urata, Lorenzo Vicisano, Kevin Yasumura, Shi-
Zhao, Andrew Wei, In Hwan Doh, and Arvind Krish- dong Zhang, Junlan Zhou, and Amin Vahdat. Jupiter
namurthy. Gimbal: enabling multi-tenant storage dis- evolving: Transforming google’s datacenter network
aggregation on smartnic jbofs. In SIGCOMM 2021. via optical circuit switches and software-defined net-
working. In SIGCOMM 2022.
[82] Radhika Mittal, Vinh The Lam, Nandita Dukkipati,
[92] Kun Qian, Wenxue Cheng, Tong Zhang, and Fengyuan
Emily Blem, Hassan Wassel, Monia Ghobadi, Amin
Ren. Gentle flow control: avoiding deadlock in lossless
Vahdat, Yaogong Wang, David Wetherall, and David
networks. In SIGCOMM 2019.
Zats. Timely: Rtt-based congestion control for the
datacenter. In SIGCOMM 2015. [93] Waleed Reda, Marco Canini, Dejan Kostic, and Simon
Peter. Rdma is turing complete, we just did not know
[83] Radhika Mittal, Alexander Shpiner, Aurojit Panda, Ei-
it yet! In NSDI 2022.
tan Zahavi, Arvind Krishnamurthy, Sylvia Ratnasamy,
and Scott Shenker. Revisiting network support for [94] Benjamin Rothenberger, Konstantin Taranov, Adrian
rdma. In SIGCOMM 2018. Perrig, and Torsten Hoefler. Redmark: Bypassing rdma
security mechanisms. In USENIX Security 2021.
[84] Behnam Montazeri, Yilong Li, Mohammad Alizadeh,
and John Ousterhout. Homa: A receiver-driven low- [95] Ahmed Saeed, Varun Gupta, Prateesh Goyal, Milad
latency transport protocol using network priorities. In Sharif, Rong Pan, Mostafa Ammar, Ellen Zegura, Keon
SIGCOMM 2018. Jang, Mohammad Alizadeh, Abdul Kabbani, and Amin
Vahdat. Annulus: A dual congestion control loop for
[85] Akshay Narayan, Frank Cangialosi, Deepti Raghavan, datacenter and wan traffic aggregates. In SIGCOMM
Prateesh Goyal, Srinivas Narayana, Radhika Mittal, 2020.
Mohammad Alizadeh, and Hari Balakrishnan. Restruc-
turing endpoint congestion control. In SIGCOMM [96] Leah Shalev, Hani Ayoub, Nafea Bshara, and Erez Sab-
2018. bag. A cloud-optimized transport protocol for elastic
and scalable hpc. IEEE Micro, 2020.
[86] Vladimir Olteanu, Haggai Eran, Dragos Dumitrescu,
Adrian Popa, Cristi Baciu, Mark Silberstein, Georgios [97] Cheng Tan, Ze Jin, Chuanxiong Guo, Tianrong Zhang,
Nikolaidis, Mark Handley, and Costin Raiciu. An edge- Haitao Wu, Karl Deng, Dongming Bi, and Dong Xiang.
queued datagram service for all datacenter traffic. In Netbouncer: Active device and link failure localization
NSDI 2022. in data center networks. In NSDI 2019.

USENIX Association 20th USENIX Symposium on Networked Systems Design and Implementation 65
[98] Konstantin Taranov, Benjamin Rothenberger, Adrian [110] Naiqian Zheng, Mengqi Liu, Ennan Zhai,
Perrig, and Torsten Hoefler. srdma: efficient nic-based Hongqiang Harry Liu, Yifan Li, Kaicheng Yang,
authentication and encryption for remote direct mem- Xuanzhe Liu, and Xin Jin. Meissa: scalable network
ory access. In ATC 2020. testing for programmable data planes. In SIGCOMM
2022.
[99] Balajee Vamanan, Jahangir Hasan, and TN Vijaykumar.
Deadline-aware datacenter tcp (d2tcp). In SIGCOMM [111] Bohong Zhu, Youmin Chen, Qing Wang, Youyou Lu,
2012. and Jiwu Shu. Octopus+: An rdma-enabled distributed
persistent memory file system. ACM Transactions on
[100] Dongyang Wang, Binzhang Fu, Gang Lu, Kun Tan, and Storage, 2021.
Bei Hua. vsocket: virtual socket interface for rdma in
public clouds. In VEE 2019. [112] Yibo Zhu, Haggai Eran, Daniel Firestone, Chuanxiong
Guo, Marina Lipshteyn, Yehonatan Liron, Jitendra Pad-
[101] Weitao Wang, Masoud Moshref, Yuliang Li, Gautam hye, Shachar Raindel, Mohamad Haj Yahia, and Ming
Kumar, TS Eugene Ng, Neal Cardwell, and Nandita Zhang. Congestion control for large-scale rdma de-
Dukkipati. Poseidon: Efficient, robust, and practical ployments. In SIGCOMM 2015.
datacenter cc via deployable int. In NSDI 2023.
[113] Yibo Zhu, Monia Ghobadi, Vishal Misra, and Jitendra
[102] Zilong Wang, Layong Luo, Qingsong Ning, Chao- Padhye. Ecn or delay: Lessons learnt from analysis of
liang Zeng, Wenxue Li, Xinchen Wan, Peng Xie, Tao dcqcn and timely. In CoNEXT 2016.
Feng, Ke Cheng, Xiongfei Geng, Tianhao Wang, We- [114] Yibo Zhu, Nanxi Kang, Jiaxin Cao, Albert Greenberg,
icheng Ling, Kejia Huo, Pingbo An, Kui Ji, Shideng Guohan Lu, Ratul Mahajan, Dave Maltz, Lihua Yuan,
Zhang, Bin Xu, Ruiqing Feng, Tao Ding, Kai Chen, Ming Zhang, Ben Y. Zhao, and Haitao Zheng. Packet-
and Chuanxiong Guo. Srnic: A scalable architecture level telemetry in large datacenter networks. In SIG-
for rdma nics. In NSDI 2023. COMM 2015.
[103] Xinyu Crystal Wu and TS Eugene Ng. Detecting and
resolving pfc deadlocks with itsy entirely in the data
plane. In INFOCOM 2022.

[104] Jiarong Xing, Kuo-Feng Hsu, Yiming Qiu, Ziyang


Yang, Hongyi Liu, and Ang Chen. Bedrock: Pro-
grammable network support for secure rdma systems.
In USENIX Security 2022.

[105] Qiumin Xu, Huzefa Siyamwala, Mrinmoy Ghosh,


Tameesh Suri, Manu Awasthi, Zvika Guz, Anahita
Shayesteh, and Vijay Balakrishnan. Performance anal-
ysis of nvme ssds and their implication on real world
databases. In SYSTOR 2015.

[106] Jian Yang, Joseph Izraelevitz, and Steven Swanson.


Orion: A distributed file system for non-volatile main
memory and rdma-capable networks. In FAST 2019.

[107] Gaoxiong Zeng, Wei Bai, Ge Chen, Kai Chen, Dongsu


Han, Yibo Zhu, and Lei Cui. Congestion control for
cross-datacenter networks. In ICNP 2019.

[108] Qiao Zhang, Vincent Liu, Hongyi Zeng, and Arvind


Krishnamurthy. High-resolution measurement of data
center microbursts. In IMC 2017.

[109] Yiwen Zhang, Yue Tan, Brent Stephens, and Mosharaf


Chowdhury. Justitia: Software multi-tenancy in hard-
ware kernel-bypass networks. In NSDI 2022.

66 20th USENIX Symposium on Networked Systems Design and Implementation USENIX Association
A SONiC buffer analysis
"BUFFER_POOL " : {
" ingress_pool ": {
" s i z e " : "18000000" ,
" type ": " ingress " ,
" mode " : " dynamic " ,
" x o f f " : "6000000"
},
" egress_lossy_pool ": {
" s i z e " : "14000000" ,
" type ": " egress " ,
" mode " : " dynamic "
},
" egress_lossless_pool ": {
" s i z e " : "24000000" ,
" type ": " egress " ,
" mode " : " s t a t i c "
}
}

" BUFFER_PROFILE " : {


" ingress_lossless_profile " : {
" p o o l " : " [ BUFFER_POOL | i n g r e s s _ p o o l ] " ,
Figure 12: Goodput of two flows with different RTTs.
" s i z e " : "1248" ,
" dynamic_th " : " −3" ,
" xoff " : "96928" , control for lossy traffic. To bypass ingress admission con-
" xon " "1248" ,
" xon_offset " "2496" trol for lossy traffic, we configure a sky-high static thresh-
},
" ingress_lossy_profile ": { old 24 MB (static_th of ingress_lossy_profile) for
" p o o l " : " [ BUFFER_POOL | i n g r e s s _ p o o l ] " ,
" size ": "0" , each ingress lossy queue. Since lossy traffic can only
" s t a t i c _ t h ":"24000000"
}, use 18 MB shared buffer space of ingress_pool, the
" egress_lossless_profile ": {
" p o o l " : " [ BUFFER_POOL | e g r e s s _ l o s s l e s s _ p o o l ] " , size of egress_lossy_pool should be no larger than 18
" size ": "0" ,
" s t a t i c _ t h " : "24000000" MB (size of ingress_pool). In Listing 1, the size of
},
" egress_lossy_profile ": { egress_lossy_pool is 14 MB. This guarantees that ingress
" p o o l " : " [ BUFFER_POOL | e g r e s s _ l o s s y _ p o o l ] " ,
" s i z e " : "1664" , lossless queues can exclusively use 4 MB shared buffer
" d y n a m i c _ t h " : " −1"
} (size of ingress_pool - size of egress_lossy_pool) in
}
ingress_pool before entering the paused state. We use DT
Listing 1: SONiC Buffer Configuration Example algorithm to manage the egress lossy queue length and set
α to 1/2 (2dynamic_th ). Once the egress lossy queue hits the
Listing 1 gives a buffer configuration example of a SONiC dynamic threshold, its arriving packets will be dropped.
pizza box switch with 24 MB packet buffer. ingress_pool
has 18 MB (size) shared buffer for all the ingress queues, and
6 MB (xoff) PFC headroom buffer exclusively for ingress B DCQCN experiment results
lossless queues in the paused state. egress_lossy_pool and
egress_lossless_pool have 14 MB and 24 MB shared We conduct an experiment in our lab testbed to demonstrate
buffer, respectively. It is worthwhile to notice that the sum of the RTT fairness of DCQCN. Our lab testbed uses a four-
pool sizes can be larger than the physical buffer limit, as they tier Clos topology like Figure 2. We use 80 km cables to
are only virtual counters for admission control purposes. interconnect T2 switches to a RH switch to emulate a region.
Lossless packets are mapped to both ingress lossless queues In this experiment, we use two hosts A and B as senders and
(ingress_lossless_profile) and egress lossless queues a host C as the receiver. Each host is equipped with a Gen1 40
(egress_lossless_profile). We use Dynamic Thresh- Gbps NIC. Host A and C are located within the same rack with
old (DT) algorithm [40] to manage the buffer occupancy ∼2 µs base RTT. In contrast, B is in another datacenter. The
of the ingress lossless queue in the 18 MB shared buffer base RTT across the RH switch is ∼1.77 ms. On each sender,
space of ingress_pool. DT algorithm is controlled by a we use ndperf to create a QP with the receiver and keep
parameter called α, which is 1/8 (2dynamic_th ) in Listing 1. posting 64 KB Write messages. Each QP can keep up to 160
Once the ingress lossless queue hits the dynamic threshold in-flight Write messages, resulting in around 10 MB in-flight
(α× remaining buffer), it will enter the paused state (send data, which is enough to saturate the large inter-datacenter
PFC pause frames) and start to use PFC headroom. All pipe (40 Gbps × 1.77 ms = 8.85 MB). We set RED/ECN
the ingress lossless queues in the paused state share a 6 marking parameters Kmin , Kmax and Pmax to 1 MB, 2 MB and
MB PFC headroom pool (xoff of ingress_pool). Each 5%, respectively.
ingress lossless queue can use up to 96928 bytes buffer As shown in Figure 12, two DCQCN flows achieve similar
(xoff of ingress_lossless_profile) in the PFC head- goodput regardless of their RTTs. A flow can achieve around
room pool. We bypass the egress admission control for loss- 17 Gbps goodput, which is close to half of the line rate. We
less traffic by setting the static threshold of the egress loss- also keep polling queue watermark counters at the congested
less queue (static_th of egress_lossless_profile) to switch and find queue watermarks oscillate around 1.36 MB,
24 MB, which equals to the switch buffer size. which is smaller than Kmax . This experiment demonstrates
In contrast, we only want to apply egress admission that DCQCN does not suffer from RTT unfairness.

USENIX Association 20th USENIX Symposium on Networked Systems Design and Implementation 67

You might also like