0% found this document useful (0 votes)
62 views7 pages

Stateful Switch Hotnets 2020

Uploaded by

Isaak Toviessi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
62 views7 pages

Stateful Switch Hotnets 2020

Uploaded by

Isaak Toviessi
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Challenging the Stateless Quo of Programmable Switches

Nadeen Gebara Alberto Lerner Mingran Yang


[email protected] [email protected] [email protected]
Imperial College London University of Fribourg – Switzerland MIT CSAIL

Minlan Yu Paolo Costa Manya Ghobadi


[email protected] [email protected] [email protected]
Harvard University Microsoft Research MIT CSAIL

ABSTRACT 1 INTRODUCTION
Programmable switches based on the Protocol Independent Switch The ability of programmable switches based on the Protocol Inde-
Architecture (PISA) have greatly enhanced the flexibility of today’s pendent Switch Architecture (PISA) [2, 7, 9, 35] to execute data-
networks by allowing new packet protocols to be deployed without plane programs at line rate has opened up new opportunities for
any hardware changes. They have also been instrumental in en- researchers and practitioners, spurring unprecedented innovation
abling a new computing paradigm in which parts of an application’s in network protocols and architectures, e.g., [4, 7, 13, 25, 35, 41, 43].
logic run within the network core (in-network computing). They have also paved the way for in-network computing [11, 38,
The characteristics and requirements of in-network applications, 48], a new class of applications, ranging from caching [22] and data-
however, are quite different from those of packet protocols for base query processing [26, 47] to machine learning (ML) [33, 40, 52]
which programmable switches were originally designed. Packet and consensus [16, 17, 27], that take advantage of the ability to ex-
protocols are typically stateless, while in-network applications re- ecute arbitrary code within the network core (as opposed to just
quire frequent operations on shared state maintained in the switch. at the edge), leveraging the switches’ unique vantage point. How-
This mismatch increases the developing complexity of in-network ever, while programmable switches have been crucial to enable this
computing and hampers widespread adoption. new paradigm, we argue that their current architecture is a poor
In this paper, we describe the key obstacles to developing in- fit for the emerging applications, introducing unnecessary devel-
network applications on PISA and propose rethinking the current opment complexity and impacting performance. This is limiting
switch architecture. Rather than changing the existing architecture, further growth and precluding widespread adoption of in-network
we propose augmenting it with a Stateful Data Plane (SDP). The computing, ultimately hurting innovation.
SDP supports the requirements of stateful applications, while the PISA and some of its recent extensions [13, 41] were primarily
conventional data plane (CDP) performs packet-protocol functions. devised to handle traditional packet-processing operations, e.g.,
address rewriting or forwarding based on header fields, which
CCS CONCEPTS typically require limited state. Their architectures, therefore, rely
• Networks → Intermediate nodes; Programmable networks; on a pipeline of independent match-action stages across which
In-network processing; • Hardware → Networking hardware; memory and compute resources are distributed (§2).
Hardware accelerators. In contrast, in-network computing applications tend to be stateful
and comprise a sequence of operations performed on complex data
structures, such as hash tables or caches. Such complex operations
KEYWORDS are hard to accommodate in current PISA-based switches because
In-network Computing, Stateful Applications, Programmable Swi- the application’s state is scattered across pipelines and match-action
tches, PISA pipeline stages. For instance, inserting a new item into a cache data
structure might require a packet to traverse all match-action stages
ACM Reference Format:
Nadeen Gebara, Alberto Lerner, Mingran Yang, Minlan Yu, Paolo Costa, in a pipeline to check whether that item is already present. If the
and Manya Ghobadi. 2020. Challenging the Stateless Quo of Programmable item is not found, it should be inserted in one of the previous stages
Switches. In Proceedings of the 19th ACM Workshop on Hot Topics in Networks but this is not possible because pipelines are feed-forward only.
(HotNets ’20), November 4–6, 2020, Virtual Event, USA. ACM, New York, NY, The common workaround is to reinsert (or recirculate) the packet
USA, 7 pages. https://doi.org/10.1145/3422604.3425928 into the pipeline, potentially at the cost of program correctness and
performance.
We believe the time has come to rethink the hardware platform
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed of programmable switches. We advocate the need for a new ar-
for profit or commercial advantage and that copies bear this notice and the full citation chitecture designed to support both in-network computing and
on the first page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
traditional packet protocols. This is analogous to a similar trend in
to post on servers or to redistribute to lists, requires prior specific permission and/or a ML hardware design where general-purpose GPUs paved the way
fee. Request permissions from [email protected]. for new architectures designed specifically with ML applications in
HotNets ’20, November 4–6, 2020, Virtual Event, USA
mind, e.g., Google TPUs [23] and NVIDIA Tensor Cores [34].
© 2020 Association for Computing Machinery.
ACM ISBN 978-1-4503-8145-1/20/11. . . $15.00
https://doi.org/10.1145/3422604.3425928
MAU Stage Processing Pipeline Multi-pipeline Switch
SRAM/TCAM Ingress Scheduler Egress
Pipeline of MAU Stages
PHV PHV PHV PHV Pipeline1 Pipeline1
S

Deparser
Parser
metadata E
L MAU Bus MAU Ingress Pipeline2 Pipeline2 Egress
Match Action Packet
headers
E Stage1 StageN Ports Ports
C Unit Unit Pipeline3 Pipeline3
T
O Payload Bus Pipeline4 Pipeline4
R

Recirculate

Figure 1: PISA-based switch, adapted from RMT architecture [7]. We show the switch in different levels of detail: Match-Action
Unit (MAU) stage within a single switch pipeline (left), switch pipeline of MAU stages (middle), and a switch with multiple
pipelines (right).

In this paper, we first provide a comprehensive review of the table to determine the actions to be executed. Match tables can be
limitations of existing programmable data planes in supporting stored in SRAM or TCAM, allowing more sophisticated matching
stateful applications (§3). Next, we sketch the design of a possi- conditions such as longest prefix matching. If a match is found,
ble architecture for extending the capabilities of PISA to support the action associated with the matching condition is provided to
stateful applications. We propose to complement the conventional a specific ALU in the form of a simple RISC-like instruction. The
data plane (CDP) in PISA with a new stateful data plane (SDP) (§4). action unit has multiple ALUs that operate on PHV fields (header or
Unlike the CDP, which requires all of its stages to be traversed metadata), and/or register data. The complexity of the operations
by packets, the SDP is designed to support stateful applications performed by ALUs varies across switch micro-architectures. For
without causing performance degradation. The key observation example, MAU stages in Domino [41] support more complex actions
driving our design is that today’s CDP pipelines have an end-to-end (atoms) than those originally proposed in RMT at the expense of
latency of several hundreds of nanoseconds, creating a slack time increased area. However, the complexity of the operations is limited
that can be leveraged to perform more complex operations in the by the timing constraint imposed within a pipeline stage.
SDP while the packet headers traverse the CDP. As long as the Registers. An action may need to change a value stored in
stateful operations are completed within the slack time, and no the MAU beyond the lifetime of a packet traversing the switch.
packets are blocked, no performance loss occurs. This could enable For this purpose, PISA switches provide registers in MAU stages,
supporting stateful operations without disrupting traditional traffic. usually stored in SRAM. Registers are the only state holders that
While our design is still preliminary and several questions re- can be updated through the data plane. However, such operations
main to be investigated, we consider this paper a first step towards usually have tight restrictions on the number of registers that can
enabling stateful in-network computing at line rate. We hope to be concurrently accessed within a stage [27].
trigger new research and innovation at the boundaries of computer
architecture, programming languages, and networking disciplines. A pipeline of MAU stages. Multiple MAU stages form a pipeline
between the parser and deparser. The MAU stages are independent
of each other and share no resources as shown in Fig. 1. This design
2 PROGRAMMABLE SWITCHES TODAY has one important implication: the information is propagated across
Although the micro-architectural details of today’s programmable the stages in a single direction, feed-forwarding data through the
switches vary, they usually comprise four main elements: a parser, connecting PHVs. If a computation requires reading/writing values
a deparser, a pipeline of Match-Action Unit (MAU) stages, and a more than once, the packet needs to traverse the pipeline several
bus connecting these elements. Fig. 1 shows the components of a times. This is referred to as recirculation.
generalized programmable switch architecture, commonly referred Multi-pipeline switches. To scale the number of ports, switch
to as the Protocol Independent Switch Architecture (PISA) [2]. manufacturers use multiple parallel pipelines with no state sharing
Packet parser, header vectors, and packet deparser. As as depicted in Fig. 1. Ingress (egress) pipelines are statically bound
shown in Fig. 1, the processing pipeline starts with a programmable to their respective input (output) ports. Packets can move from their
parser that extracts the packet header fields required for the pro- ingress pipeline to any egress pipeline via the switching elements,
gram execution [18]. These fields are transferred to a large register but moving a packet from one pipeline to another is only possible
file called the Packet Header Vector (PHV) [7]. Each MAU stage by recirculating the packet.
takes a PHV instance as its input, performs an operation using
data in the PHV, and outputs data to the following PHV instance 3 PISA AND THE BATTLE OF STATE
(input of next stage) through the bus inter-connecting the MAU In this section, we first discuss the shortcomings of PISA in the
stages. The final PHV instance is connected to a deparser which context of in-network compute (§3.1). We then describe the chal-
reassembles the packet and the newly obtained headers. lenges associated with developing a seemingly simple in-network
Match-Action Unit stages. The MAU stages are the processing application (top-𝑘 heavy-hitter detection) as a concrete example of
elements of a programmable switch, as illustrated in Fig. 1. They can the challenges faced by in-network compute programmers (§3.2).
“execute” a match-action table; i.e., they select specific fields from Finally, we conclude the section by summarizing the key data struc-
the input PHV and compare them to values stored in the match tures and building blocks used by in-network compute services;
we discuss how they are currently implemented on PISA-based binding of switch ports and input pipelines and the lack of inter-
switches and note the corresponding limitations (§3.3). pipeline channels, the only way a packet can access an application’s
state stored in another pipeline is by being recirculated to that
pipeline. From an application’s perspective, a switch is essentially
3.1 Limitations of PISA a multi-core processor, with the caveat that one processor can only
Link rates in data centers have increased by a factor of 100 over communicate with another through recirculation paths.
the last decade, jumping from 1 Gbps in 2009 to 100 Gbps today. A
hypothetical 64×100 Gbps-port would process an aggregate of 9.5 3.2 Example: Top-𝑘 Heavy-Hitter Detection
billion packets per second assuming 84-byte packets (including 20
bytes for preamble and inter-packet gap). If such a switch were to To illustrate the impact of the limitations described in the previous
retire a packet per clock, it would require an unfeasible 9.5 GHz section on a real application, we consider HashPipe [42], a recently
clock rate. This motivates the need for a pipelined architecture (§2) proposed P4 implementation for heavy-hitter detection. We chose
with a small cycle budget per stage to support high throughput this application because, despite the simplicity of the high-level pro-
even with today’s lower clock rates [54]. tocol, implementing its required data structure on a programmable
A pipelined architecture is well-suited for networking protocols, switch is particularly complex and requires a compromise between
which are typically characterized by relatively simple operations accuracy and performance. This is representative of the challenges
(e.g., rewrite the packet header or select an output port), but it faced by developers when they attempt to implement stateful ap-
is a poor fit for stateful in-network applications [38] because of plications on programmable switches.
the several implementation constraints that such an architecture Program overview. The goal of HashPipe is to identify the set
introduces as we detail next. of top-𝑘 flows that generate the largest number of packets. This is
L1 Limited support for complex operations. The first con-
achieved by using an adaptation of the space-saving algorithm [32].
sequence of the limited cycle budget is the difficulty of supporting The algorithm maintains a fixed-size table with 𝑘 entries, each
complex arithmetic operations. Such operations range from “sim- containing the flow ID (e.g., a hash of the TCP 5-tuple) and the
ple” multiplications to more complex ones involving floating-point corresponding packet counter. Every time a packet arrives, if its
arithmetic. This impacts such applications as gradient aggregation flow ID is already stored in the table, its counter is incremented.
in ML training [40] and complex queries in database systems [47]. Otherwise, the flow ID with the smallest counter is evicted and
replaced by the flow ID of the incoming packet, and the previous
L2 Shared-nothing stage architecture. Stages offer a limited
counter value is incremented by one.
set of operations but compensate by executing several of them in
parallel. However, they must be independent: an operation’s output Shared-nothing stage architecture. Determining the flow
cannot be used as the input for another operation in the same stage. with the smallest count value requires register accesses and se-
This effectively forces every sequential action to be unrolled across quential comparison operations that exceed the capacity of a single
several stages. Furthermore, a limited number of registers can be stage. To circumvent this limitation, the table’s registers must be
accessed at once within a stage. Stateful register operations might partitioned across pipeline stages. However, since such a solution
need to be partitioned across stages if this limit is reached [27]. would restrict the number of table registers (keys) to the number of
pipeline stages, HashPipe uses a probabilistic solution. Specifically,
L3 Feed-forward transfer across stages. When the state of
it applies 𝑑 hash functions (one per stage) to the flow ID to inspect
a program is partitioned across stages, if an operation requires 𝑑 randomly selected entries in different stages, where 𝑑 is the num-
access to the global state, the packet must travel across all the ber of stages. It then determines the minimum value across these
stages. If, after traversing the pipeline, the packet needs to update a selected entries. This approach, however, incurs an accuracy error
variable in one of the previous stages, the only option is to recirculate because the minimum value across just 𝑑 entries can be far from
the packet. However, this operation can affect the application’s the minimum value of the entire table.
correctness: by the time the packet is readmitted into the pipeline,
the state may have changed, thus possibly invalidating the result of Feed-forward transfer across stages. Partitioning the table
the previous read. Recirculation may also reduce the rate at which across 𝑑 stages presents another challenge. A naive way to find the
new packets access the pipeline, decreasing overall throughput [3]. minimum across the 𝑑 entries is the following. As the incoming
packet traverses all stages, the program checks whether its flow
L4 Fixed memory access pattern semantics. In addition to
ID (𝑓𝑛𝑒𝑤 ) is already stored in any of the stages, and it uses the
the challenges described earlier, when the state is partitioned across metadata field to keep track of the flow ID with the smallest packet
stages, compilers must impose a strict access pattern on the pro- counter encountered thus far (𝑓𝑚𝑖𝑛 ). If 𝑓𝑛𝑒𝑤 is found, its counter
gram’s variables. For example, if a programmer defines two vari- is simply incremented and no further action is required. However,
ables, 𝑎 and 𝑏, in two different stages, 𝑠 1 and 𝑠 2 , the program is if the packet reaches the last stage and 𝑓𝑛𝑒𝑤 has not been found,
restricted to always accessing them in this order. This forces de- the packet needs to be recirculated to replace the entry with 𝑓𝑚𝑖𝑛 .
velopers to think in terms of physical rather than logical memory In the worst-case scenario, the need to recirculate all packets can
layout, which reduces programming flexibility. result in halving the overall throughput [42].
L5 No state-sharing across pipelines. The above issues are To avoid recirculation and the corresponding performance hit,
exacerbated when the switch architecture hosts multiple parallel HashPipe modifies the space-saving algorithm by inserting 𝑓𝑛𝑒𝑤 in
pipelines—a common scenario for modern switches. Given the static the first stage. The flow evicted in the first stage is carried forward as
Class Data Structure Application(s) Limitation(s) Implementation trade-off
Distributed storage systems, network telemetry,
Bloom filter L2 Limited number of hash functions [22, 33, 41, 47, 53]
load balancing, databases [30]
Count-min Distributed storage systems, network
C1 L2 Limited number of hash functions [22, 41, 47]
sketch telemetry [53], databases
Distributed storage system, network telemetry, Limited collision list & external overflow [26]
Hash table databases, load balancing, distributed system L2 Lazy eviction threshold tuning required to
co-ordination [14, 19–22, 27–29, 33] evict stale hash table entries [12]
Limited vector size reduces application goodput
Vector ML training [5, 40, 48] L1 L2 Lack of floating point operations (fp) makes accuracy
C2 dependent on fp conversion efficiency [40]
Depth of queue bounded by pipeline stages requires
Priority queue Network telemetry, databases [47] L2 L3 L4
approximate solutions with reduced accuracy [3, 42]
Distributed storage systems, network telemetry, Cache policy managed through the control plane
Cache L2 L3 L4
databases & cache update operations violate line rate [22]
Distributed storage system, network telemetry,
C3 Cuckoo
databases,load balancing, distributed system L2 L3 L4 No known implementation is available
hash table
co-ordination [14, 19–22, 26–29, 33]
Matrix ML inference [43] L1 L2 No known implementation is available
Table 1: Mapping representative data structures to PISA-based switch architectures. (C1) implementation is practical; (C2)
implementation violates line rate or uses approximate solution; (C3) no complete data-plane implementation. Since L5 is not
application-dependent, and not a property of the data structure itself, it is not included in the table.

part of the packet metadata field. If a flow ID with a smaller counter Count-min sketch [15]. A count-min sketch with 𝐻 hash
is found in the next stage, the two are swapped and the latter is functions uses the minimum value accessed by a hash function
carried forward. Although this approach circumvents the need for as an estimator of a key. The estimator requires a read and com-
recirculation, it can result in duplicate entries if 𝑓𝑛𝑒𝑤 is already bine operation on 𝐻 registers; hence, it exhibits constraints similar
present in one of the later stages. These duplicate entries reduce to those of Bloom filters because of L2 . Universal sketches with
the usable table memory and can result in the accidental eviction more complex estimators are implemented by performing sketch
of heavy flows whose counts are spread across multiple registers. updates in the data plane and using the control plane for estimator
Such evictions cause HashPipe to have a higher false-positive rate computation [31].
than the space-saving algorithm when keys are over-reported [42]. Hash table. Many hash-table implementations maintain a se-
Multiple pipelines. Depending on the definition of flow, e.g., quential list to handle collisions. However, because of L2 , the list
the TCP 5-tuple vs. “all packets generated by a tenant,” the same needs to extend across multiple stages. This caps the maximum
flow ID might appear on different pipelines, further complicating length of the collision list at the number of stages left after the
a HashPipe-like implementation. One solution could be to force stage in which the hash computation is performed. A possible
packets to be steered through the same pipeline (ingress or egress) workaround is to rely on the control plane to handle overflows but,
via recirculation. Alternatively, independent tables could be main- again, this causes significantly higher latency [26]. An alternative
tained (one per ingress pipeline) but this would accentuate the risk approach is to rely on multiple hash tables with different hash func-
of duplicate keys and, hence, of lower accuracy. tions used in different stages [12]. Although this is a better solution,
it is less memory efficient than using a Cuckoo Hash table [36].
Multi-table hashes such as Cuckoo Hash may seem viable at first.
Yet, a pure data plane implementation requires recirculation when
3.3 Stateful data structures in PISA an element is evicted from the last table. While the recirculation
A top-k list is not the only data structure that is difficult to imple- is ongoing, new elements can be inserted and evicted, creating a
ment in PISA switches. We surveyed the in-network computing possible race condition. Today, implementing Cuckoo Hash tables
literature and identified the most common data structures used. We in PISA needs control-plane support.
classify them in Table 1 according to their implementation difficulty Vector and Matrix. Vector operations can be mapped onto a
and the impact of the limitations outlined in §3.1. PISA switch in a variety of ways. When the size of a vector exceeds
Bloom filter [6]. A membership test using a Bloom filter with the total number of stateful operations that can be performed on
𝐻 hash functions requires a read and combine operation on the packet headers ( L2 ), applications must partition the vector across
values held in 𝐻 registers. Such an operation cannot be completed multiple packets. For example, since ML tensors are large vectors,
within a single stage because of L2 . Therefore, a Bloom filter must gradient aggregation requires applications to send multiple packets
span 𝐻 + 1 stages to read the register values and combine their to complete a single vector aggregation operation. Further, since L1
outcome. This bounds the number of hash functions by the number does not allow floating-point operations, the provided gradient val-
of pipeline stages. ues must be converted to fixed-point numbers. Careful parameter
tuning is needed to avoid accuracy loss. Moreover, matrix oper- Conventional Data Plane (CDP)
PHV PHV

Deparser
ations required for in-network ML inference [43] are even more

Parser
Stage1 Stage2 Stage3 StageN
complex than vector ones. A typical case may comprise several
multiply-accumulate operations [52]. Here, the operation needs to
Stateful Data Plane (SDP)
be spread across multiple stages because of L1 and L2 . Stateful Stateful
Queue App DSIB1 Packet Queue
Cache. Implementing associative caches in the data plane presents Bus Reorder
challenges similar to the HashPipe example (§3.2). Cache lookups DSIBN
and cache updates of existing keys can be realized by distribut-
Top-k DSIB
ing the keys across multiple stages because of L2 . However, be- xi+1 xi xi-1 xi-2 ...
cause of L3 , replacing an existing key with a new one can only = 1 3
bin1 bin2 bin3 bin4 bin5 bin6 ...
be done through recirculation, which might lead to correctness 2 4
< swap
violations. An alternative approach explored by recent work [22] (Flow ID, Count)

is to manage cache insertions and evictions via the control plane. Figure 2: Our vision for the next generation of pro-
This shows higher memory efficiency than the data-plane imple- grammable switches is a Stateful Data Plane (SDP) that com-
mentation and can support variably-sized values but it incurs a putes the state while packets traverse the Conventional Data
performance penalty, as control-plane operations are up to two Plane (CDP). Individual stateful computations can be im-
orders of magnitude slower than data plane operations [49]. plemented through specialized blocks (DSIB). We show in
the detail how a DSIB calculates the top-𝑘 flows using a
pipelined hardware algorithm [45].
4 BRIDGING THE GAP
This section presents our vision for a stateful switch architecture.
We first provide an overview of current research efforts and then A series of units called Data Structure Instance Blocks (DSIB) are
describe our solution. We do not claim to have all the answers at the heart of the SDP. These blocks house structures that main-
yet. Rather, we hope to start a discussion on new ways to design tain the state of applications and are specialized for performing
programmable switches. the required operations. We expect DSIBs to leverage pipelining to
PISA variations. There have been attempts to extend PISA sustain high throughput and, as we show shortly, they can avoid the
switches by improving them along at least three dimensions: i) limitations faced by PISA (§3.1) because they do not need to adhere to
adding more sophisticated atoms (actions), e.g., in Domino [41]; ii) the PISA aspects that cause such limitations. Since some data struc-
allowing more flexible memory access patterns, e.g., in dRMT [13]; tures may be common, DSIBs may be shared across applications.
and iii) mixing MAU stages and “configurable logic-based” stages An Application Bus directs incoming stateful data to their respec-
within a pipeline, e.g., in Flowblaze and Taurus [37, 43]. All these tive associated DSIBs. The bus is equipped with sufficient buffering
proposals support more expressive programs than a “pure” PISA to match the rate at which inputs are provided by the parser unit,
switch because they relax some—but not all— the limitations dis- and ensures that no packets are blocked. The latency of DSIBs
cussed in §3.1. While in principle we could combine all these ap- might vary depending on the operations they support. Therefore,
proaches to address the shortcomings of PISA in supporting stateful the Packet Reorder unit after the DSIBs ensures that the order with
applications, such an approach would result in longer and more which fields leave both the CDP and SDP is consistent.
complex pipelines, and would therefore increase the cut-through The main constraint imposed on the SDP and its DSIBs is that
latency of packets that do not require any stateful operations, and all operations must be carried out within a slack time. The slack time
potentially result in resource wastage [54]. is the time budget during which a packet traverses the CDP, and is
Instead of extending the PISA pipeline, we explore an alternative usually determined at compile time [7]. By imposing this constraint,
approach that uses pipeline specialization and works alongside a we ensure that the latency of stateful computations matches that
PISA pipeline. In contrast to other designs that complement the of the original pipeline.
PISA pipeline with external memory resources on servers [24], Slack calculation. The key element of our solution is mapping
our design combines the PISA pipeline with an additional tightly a stateful computation onto the SDP and guaranteeing that its
integrated pipeline specialized for stateful computations. worst-case execution-time (WCET) is within the slack budget. There
Stateful Data Plane (SDP). We introduce a separate data plane are various methods to calculate the WCET given a program and
that operates in conjunction with the conventional data plane (CDP), a platform, including static program analysis [50]. We intend to
as depicted in Fig. 2. The goal of the SDP is to support a more program the SDP using feed-forward languages such as P4, and
powerful set of stateful operations as first-class citizens. extending them to support a richer set of data structures. We suspect
The two planes are synchronized at the shared parser and de- that P4 naturally produces temporally predictable code due to the
parser units. The packets are parsed normally, as they would be structure of the language [39]. Also, commercial P4 compilers are
in a PISA switch, but we propose extending the parser graph’s already capable of determining the slack for the PISA processing
semantics to allow the extraction of fields used by the SDP as well. pipeline at compile time [2]. We expect our compiler extension to
Fields requiring traditional header processing are sent to the CDP be able to do the same for the stateful computations. We note that
through the PHV, whereas fields requiring stateful computations the same extension could be applied to other languages that share
are streamed to the SDP through a separate Stateful Queue (SQ). these characteristics, such as NPL [10] or Xilinx PX [8].
Some hardware’s characteristics are also essential for timing that the FPGA used is two generations old (40 nm process node),
predictability [46]. Consider PISA, where the MAU stages have and newer FPGAs typically allow higher clocking rates because of
static instruction scheduling, lack memory hierarchy, and perform improved technology and advanced optimization heuristics, we an-
no speculative execution. These mechanisms foster predictability. ticipate that replicating the design on a state-of-the-art FPGA would
We have the same design goal for the SDP. result in higher throughput. Furthermore, the proposed design is
An FPGA-based realization. To explore the design space, we expected to be capable of maintaining k values in the hundreds
suggest the initial realization of the SDP path through FPGAs. They of items while meeting a slack value of at least 800 ns quoted for
support pipelined computation implementations and the tight hard- programmable switch pipelines [1].
ware control necessary for timing predictability. Recent FPGAs may The DSIB block’s latency is deterministic given a fixed clock rate
have dozens of 100-Gbps hard MAC units and as much as 0.5 GB and k value (latency = 𝑘 × clock_rate/2). We divided by 2 because
of on-chip RAM [51]. Also, new techniques are allowing FPGA de- we can implement two bin operations per clock cycle [45]. By
signs to be clocked from 400 to 850 MHz [44], only slightly slower further accounting for the time required to traverse the application
than today’s switching silicons at around 1 GHz. These factors may bus and re-order unit, we can determine whether the slack time
allow us to achieve an SDP’s traversal time in the hundreds of can be met for a specific k-value. Optimizing DSIBs for throughput
cycles per packet. Such latency is in line with current commercial and slack constraints is a fundamental part of our future research.
programmable switches’ CDPs, which take at least 800 ns [1]. Scaling to multi-pipeline switches. We have discussed how
The question arises as to whether general-purpose CPUs would to augment a single CDP with a stateful data plane. However, many
be a better option with their higher clock rates and easier pro- switches, in particular the high-port-count ones, deploy more than
grammability. The problem with CPUs in our scenario is that they one pipeline. We think that a single SDP path might be sufficient
use PCIe lanes to interact with network cards and do not have the for a 2-pipeline switch, a typical arrangement for today’s 32-port
bandwidth to support more than a few NICs. Moreover, caching switches. However, a 64-port, 4-pipeline switch, may need more
and speculative mechanisms in modern CPUs prevent writing ap- SDP paths, potentially one per pipeline. In such scenarios, a reduc-
plications with deterministic completion times. tion SDP path may be needed to connect all the SDPs and perform
The proposed architecture’s exact tradeoffs in terms of chip area, global computations.
power utilization, and additional costs are still open questions to be
addressed. We will research how to minimize the SDPs’ overhead
while matching the throughput and latency of the conventional 5 CONCLUSION
pipeline. In-network computing represents a new class of applications tak-
An FPGA-based top-k DSIB. Referring back to the top-𝑘 ex- ing advantage of programmable networks. The first generation
ample, the parser extracts the packet’s flow ID and sends it to the of such applications brought benefits to several domains beyond
top-k DSIB via the stateful queue and application bus as shown networking, but it also exposed severe limitations in the ability
in Fig. 2. This DSIB implements a variation of the space-saving of the current generation of programmable switches to perform
algorithm [32]. It maintains a series of k bins, each storing a flowID stateful computations. In this paper, we identify and characterize
(𝑥𝑖 ) and a counter, that keep track of the top-k flows [45]. the architectural features impacting in-network applications. We
Each bin in the pipeline performs the same operations on a curate a selection of data structures commonly used in stateful
different flow ID entry. We numbered these operations from (1) to applications and systematically expose the pain points involved in
(4) in Fig. 2 for convenience. A bin compares 𝑥𝑖 ’s value with the implementing them.
flow ID it holds (1) and updates its counter if they match. Next, the Armed with insights from this exercise, we suggest a way to
counter value is compared with the next bin’s and swapped if the extend programmable switches. The cornerstone of our proposal is
first counter is lower than the second (2). These two operations are the deployment of a second data plane in the switch, in addition
then repeated for this next bin (3 and 4), before sending the flow to the “conventional” one, thus making the switch capable of more
ID entry down the pipeline. Repeating the described sequence of complex stateful computations. We propose an initial design for
operations as the flow ID traverses the pipeline mimics a bubble our Stateful Data Plane but we hope other researchers join us in
sort and sorts the flow IDs in descending order. As a result, when this effort by proposing and evaluating alternative designs.
the flow ID gets to the last register without matching any bin value,
the last bin’s value is automatically updated with the new flow ID
and its counter is incremented. ACKNOWLEDGMENTS
It would not have been possible to access more than one bin We thank our anonymous HotNets reviewers for their feedback.
in the original CDP due to L1 and L2 . Furthermore, even with This work was partly supported by the Microsoft Research PhD
more powerful atoms that allow two register accesses and a swap Scholarship Program, by the United Kingdom EPSRC (grant num-
operation within a single pipeline stage, the approach is still not bers EP/L016796/1, EP/I012036/1, EP/L00058X/1, EP/N031768/1, and
possible because register swaps cannot be performed across pipeline EP/K034448/1), by the European Research Council (ERC) under the
stages due to L3 . European Union Horizon 2020 Research and Innovation Program
The original design implementation reports a throughput of 110 (grant agreements 683253/Graphint and 671653 ), by NSF grants
million items per second while clocked at a maximum rate of ≈ 115 CNS-2008624 and CNS-1834263, as well as by SystemsThatLearn@CSAIL
MHz when implemented on a Virtex-6 FPGA [45]. Considering Ignite Grant.
REFERENCES (OSDI ’16).
[1] Arista. 7170 Series Programmable Data Center Switches. https://www.arista. [29] X. Li, R. Sethi, M. Kaminsky, D. G. Andersen, and M. J. Freedman. 2016. Be Fast,
com/assets/data/pdf/Datasheets/7170-Datasheet.pdf. Cheap and in Control with SwitchKV. In 13th USENIX Symposium on Networked
Systems Design and Implementation (NSDI ’16).
[2] Barefoot. Tofino. https://www.barefootnetworks.com/products/brief-tofino-2/.
[30] M. Liu, L. Luo, J. Nelson, L. Ceze, A. Krishnamurthy, and K. Atreya. 2017. IncBricks:
[3] R. Ben Basat, X. Chen, G. Einziger, and O. Rottenstreich. 2018. Efficient Measure-
Toward In-Network Computation with an In-Network Cache. In Proceedings of the
ment on Programmable Switches Using Probabilistic Recirculation. In IEEE 26th
22nd International Conference on Architectural Support for Programming Languages
International Conference on Network Protocols (ICNP’18).
and Operating Systems (ASPLOS ’17).
[4] R. Ben Basat, S. Ramanathan, Y. Li, G. Antichi, M. Yu, and M. Mitzenmacher. 2020.
[31] Z. Liu, A. Manousis, G. Vorsanger, V. Sekar, and V. Braverman. 2016. One Sketch
PINT: Probabilistic In-band Network Telemetry. In Proceedings of the 2020 ACM
to Rule Them All: Rethinking Network Flow Monitoring with UnivMon. In
SIGCOMM Conference (SIGCOMM ’20).
Proceedings of the 2016 ACM SIGCOMM Conference (SIGCOMM ’16).
[5] G. Bloch. 2019. Accelerating Distributed Deep Learning with In-Network Comput-
[32] A. Metwally, D. Agrawal, and A. El Abbadi. 2005. Efficient Computation of
ing Technology. In APNET. https://conferences.sigcomm.org/events/apnet2019/
Frequent and Top-k Elements in Data Streams. In International Conference on
slides/Industrial_1_3.pdf
Database Theory (ICDT ’05).
[6] B. H. Bloom. 1970. Space/Time Trade-Offs in Hash Coding with Allowable Errors.
[33] R. Miao, H. Zeng, C. Kim, J. Lee, and M. Yu. 2017. SilkRoad: Making Stateful
Commun. ACM 13, 7 (July 1970), 422–426.
Layer-4 Load Balancing Fast and Cheap Using Switching ASICs. In Proceedings
[7] P. Bosshart et al. 2013. Forwarding Metamorphosis: Fast Programmable Match-
of the 2017 ACM SIGCOMM Conference (SIGCOMM ’17).
Action Processing in Hardware for SDN. SIGCOMM Comput. Commun. Rev. 43, 4
[34] NVIDIA. Tensor Cores. https://www.nvidia.com/en-us/data-center/tensor-
(Aug. 2013), 99–110.
cores/.
[8] G. Brebner and W. Jiang. 2014. High-Speed Packet Processing using Reconfig-
[35] P4.org Architecture Working Group. P416 Portable Switch Architecture (PSA).
urable Computing. MICRO 34, 1 (Jan 2014), 8–18.
https://p4.org/p4-spec/docs/PSA.html.
[9] Broadcom. BCM56870 Series. https://www.broadcom.com/products/ethernet-
[36] R. Pagh and F. F. Rodler. 2004. Cuckoo Hashing. Journal of Algorithms 51, 2
connectivity/switching/strataxgs/bcm56870-series.
(2004), 122–144.
[10] Broadcom. NPL - Network Programming Language. https://nplang.org/.
[37] S. Pontarelli et al. 2019. FlowBlaze: Stateful Packet Processing in Hardware. In
[11] A. Caulfield, P. Costa, and M. Ghobadi. 2018. Beyond SmartNICs: Towards a Fully
16th USENIX Symposium on Networked Systems Design and Implementation (NSDI
Programmable Cloud: Invited Paper. In 2018 IEEE 19th International Conference
’19).
on High Performance Switching and Routing (HPRS ’18).
[38] D. R. K. Ports and J. Nelson. 2019. When Should The Network Be The Computer?.
[12] X. Chen, H. Kim, J. M. Aman, W. Chang, M. Lee, and J. Rexford. 2020. Measuring
In Proceedings of the Workshop on Hot Topics in Operating Systems (HotOS ’19).
TCP Round-Trip Time in the Data Plane. In Proceedings of the Workshop on Secure
[39] P. Puschner and A. Burns. 2002. Writing Temporally Predictable Code. In Pro-
Programmable Network Infrastructure (SPIN ’20).
ceedings of the Seventh IEEE International Workshop on Object-Oriented Real-Time
[13] S. Chole et al. 2017. DRMT: Disaggregated Programmable Switching. In Proceed-
Dependable Systems (WORDS ’02).
ings of the 2017 ACM SIGCOMM Conference (SIGCOMM ’17).
[40] A. Sapio et al. 2019. Scaling Distributed Machine Learning with In-Network
[14] E. Cidon, S. Choi, S. Katti, and N. McKeown. 2017. AppSwitch: Application-Layer
Aggregation. CoRR abs/1903.06701 (2019). arXiv:1903.06701 http://arxiv.org/abs/
Load Balancing within a Software Switch. In Proceedings of the First Asia-Pacific
1903.06701
Workshop on Networking (APNet’17).
[41] A. Sivaraman et al. 2016. Packet Transactions: High-Level Programming for Line-
[15] G. Cormode and S. Muthukrishnan. 2005. An improved data stream summary:
Rate Switches. In Proceedings of the 2016 ACM SIGCOMM Conference (SIGCOMM
the count-min sketch and its applications. Journal of Algorithms 55, 1 (2005),
’16).
58–75.
[42] V. Sivaraman, S. Narayana, O. Rottenstreich, S. Muthukrishnan, and J. Rexford.
[16] H. T. Dang, P. Bressana, H. Wang, K. S. Lee, N. Zilberman, H. Weatherspoon, M.
2017. Heavy-Hitter Detection Entirely in the Data Plane. In Proceedings of the
Canini, F. Pedone, and R. Soule. 2020. P4xos: Consensus as a Network Service.
Symposium on SDN Research (SOSR ’17).
IEEE/ACM Transactions on Networking 28, 4 (2020).
[43] T. Swamy, A. Rucker, M. Shahbaz, and K. Olukotun. 2020. Taurus: An Intelligent
[17] H. T. Dang, M. Canini, F. Pedone, and R. Soulé. 2016. Paxos Made Switch-y.
Data Plane. CoRR abs/2002.08987 (2020). arXiv:2002.08987 https://arxiv.org/abs/
SIGCOMM Comput. Commun. Rev. 46, 2 (May 2016), 18–24.
2002.08987
[18] G. Gibb, G. Varghese, M. Horowitz, and N. McKeown. 2013. Design Principles
[44] T. Tan, E. Nurvitadhi, D. Shih, and D. Chiou. 2018. Evaluating The Highly-
for Packet Parsers. In Architectures for Networking and Communications Systems
Pipelined Intel Stratix 10 FPGA Architecture Using Open-Source Benchmarks. In
(ANCS ’13).
2018 International Conference on Field-Programmable Technology (FPT ’18).
[19] R. Harrison, S. L. Feibish, A. Gupta, R. Teixeira, S. Muthukrishnan, and J. Rexford.
[45] J. Teubner, R. Muller, and G. Alonso. 2011. Frequent Item Computation on a Chip.
2020. Carpe Elephants: Seize the Global Heavy Hitters. In Proceedings of the
IEEE Trans. on Knowl. and Data Eng. 23, 8 (Aug. 2011), 1169–1181.
Workshop on Secure Programmable Network Infrastructure (SPIN ’20).
[46] L. Thiele and R. Wilhelm. 2004. Design for Timing Predictability. Real-Time
[20] T. Jepsen, M. Moshref, A. Carzaniga, N. Foster, and R. Soulé. 2018. Life in the Fast
Systems 28, 2–3 (Nov. 2004), 157–177.
Lane: A Line-Rate Linear Road. In Proceedings of the Symposium on SDN Research
[47] M. Tirmazi, R. Ben Basat, J. Gao, and M. Yu. 2020. Cheetah: Accelerating Database
(SOSR ’18).
Queries with Switch Pruning. In Proceedings of the 2020 ACM SIGMOD Conference
[21] X. Jin, X. Li, H. Zhang, N. Foster, J. Lee, R. Soulé, C. Kim, and I. Stoica. 2018.
(SIGMOD ’20).
NetChain: Scale-Free Sub-RTT Coordination. In 15th USENIX Symposium on
[48] Y. Tokusashi, H. T. Dang, F. Pedone, R. Soulé, and N. Zilberman. 2019. The Case
Networked Systems Design and Implementation (NSDI ’18).
For In-Network Computing On Demand. In Proceedings of the Fourteenth EuroSys
[22] X. Jin, X. Li, H. Zhang, R. Soulé, J. Lee, N. Foster, C. Kim, and I. Stoica. 2017. Net-
Conference 2019 (EuroSys ’19).
Cache: Balancing Key-Value Stores with Fast In-Network Caching. In Proceedings
[49] S. Wang, C. Sun, Z. Meng, M. Wang, J. Cao, M. Xu, J. Bi, Q. Huang, M. Moshref,
of the 26th Symposium on Operating Systems Principles (SOSP ’17).
T. Yang, H. Hu, and G. Zhang. 2020. Martini: Bridging the Gap between Network
[23] N. Jouppi et al. 2017. In-Datacenter Performance Analysis of a Tensor Processing
Measurement and Control Using Switching ASICs. In IEEE 28th International
Unit. SIGARCH Comput. Archit. News 45, 2 (June 2017), 1–12.
Conference on Network Protocols (ICNP ’20).
[24] D. Kim, Z. Liu, Y. Zhu, C. Kim, J. Lee, V. Sekar, and S. Seshan. 2020. TEA: Enabling
[50] R. Wilhelm et al. 2008. The Worst-Case Execution-Time Problem—Overview of
State-Intensive Network Functions on Programmable Switches. In Proceedings of
Methods and Survey of Tools. ACM Trans. Embed. Comput. Syst. 7, 3 (May 2008),
the 2020 ACM SIGCOMM Conference (SIGCOMM ’20). New York, NY, USA.
36–53.
[25] D. Kim, Y. Zhu, C. Kim, J. Lee, and S. Seshan. 2018. Generic External Memory for
[51] Xilinx. UltraRAM: Breakthrough Embedded Memory Integration on UltraScale+
Switch Data Planes. In Proceedings of the 17th ACM Workshop on Hot Topics in
Devices. https://www.xilinx.com/support/documentation/white_papers/wp477-
Networks (HotNets ’18).
ultraram.pdf
[26] A. Lerner, R. Hussein, and P. Cudré-Mauroux. 2019. The Case for Network
[52] Z. Xiong and N. Zilberman. 2019. Do Switches Dream of Machine Learning?
Accelerated Query Processing. In Proceedings of the Innovative Data Systems
Toward In-Network Classification. In Proceedings of the 18th ACM Workshop on
Research Conference (CIDR ’19).
Hot Topics in Networks (HotNets’19).
[27] J. Li, E. Michael, and D. R. K. Ports. 2017. Eris: Coordination-Free Consistent
[53] M. Yu, L. Jose, and R. Miao. 2013. Software Defined Traffic Measurement with
Transactions Using In-Network Concurrency Control. In Proceedings of the 26th
OpenSketch. In 10th USENIX Symposium on Networked Systems Design and Im-
Symposium on Operating Systems Principles (SOSP ’17).
plementation (NSDI ’13).
[28] J. Li, E. Michael, N. K. Sharma, A. Szekeres, and D. R. K. Ports. 2016. Just Say No
[54] N. Zilberman, G. Bracha, and G. Schzukin. 2019. Stardust: Divide and Conquer
to Paxos Overhead: Replacing Consensus with Network Ordering. In Proceedings
in the Data Center Network. In 16th USENIX Symposium on Networked Systems
of the 12th USENIX Conference on Operating Systems Design and Implementation
Design and Implementation (NSDI ’19).

You might also like