An Exhaustive Survey On P4 Programmable Data Plane Switches: Taxonomy, Applications, Challenges, and Future Trends

Download as pdf or txt
Download as pdf or txt
You are on page 1of 51

1

An Exhaustive Survey on P4 Programmable Data


Plane Switches: Taxonomy, Applications,
Challenges, and Future Trends
Elie F. Kfoury∗ , Jorge Crichigno∗ , Elias Bou-Harb†
∗ Collegeof Engineering and Computing, University of South Carolina, Columbia, USA
† The Cyber Center For Security and Analytics, University of Texas at San Antonio, USA

Abstract—Traditionally, the data plane has been designed


with fixed functions to forward packets using a small set of
protocols. This closed-design paradigm has limited the capability
of the switches to proprietary implementations which are hard-
coded by vendors, inducing a lengthy, costly, and inflexible
process. Recently, data plane programmability has attracted
significant attention from both the research community and the
industry, permitting operators and programmers in general to
run customized packet processing functions. This open-design
paradigm is paving the way for an unprecedented wave of inno-
vation and experimentation by reducing the time of designing,
testing, and adopting new protocols; enabling a customized,
top-down approach to develop network applications; providing
granular visibility of packet events defined by the programmer;
reducing complexity and enhancing resource utilization of the
programmable switches; and drastically improving the perfor- Fig. 1. Cumulative number of RFCs.
mance of applications that are offloaded to the data plane.
Despite the impressive advantages of programmable data plane vendors. As an example, after being initially conceived by
switches and their importance in modern networks, the literature
Cisco and VMware [5], the Application Specific Integrated
has been missing a comprehensive survey. To this end, this
paper provides a background encompassing an overview of the Circuit (ASIC) implementation of the Virtual Extensible LAN
evolution of networks from legacy to programmable, describing (VXLAN) [6], a simple frame encapsulation protocol, took
the essentials of programmable switches, and summarizing their several years, a process that could have been reduced to weeks
advantages over Software-defined Networking (SDN) and legacy by software implementations1 .
devices. The paper then presents a unique, comprehensive tax-
Protocol ossification has been challenged first by Software-
onomy of applications developed with P4 language; surveying,
classifying, and analyzing more than 150 articles; discussing defined Networking (SDN) [7, 8] and then by the recent advent
challenges and considerations; and presenting future perspectives of programmable switches. SDN fostered major advances
and open research issues. by explicitly separating the control and data planes, and by
Index Terms—Programmable switches, P4 language, Software- implementing the control plane intelligence as a software
defined Networking, data plane, custom packet processing, tax- outside of the switches. While SDN reduced network com-
onomy. plexity and spurred control plane innovation at the speed of
software development, it did not wrest control of the actual
I. I NTRODUCTION packet processing functions away from network vendors.
Traditionally, the data plane has been designed with fixed
Since the emergence of the world wide web and the
functions to forward packets using a small set of protocols
explosive growth of the Internet in the 1990s, the network-
(e.g., IP, Ethernet). The design cycle of switch ASICs has been
ing industry has been dominated by closed and proprietary
characterized by a lengthy, closed, and proprietary process that
hardware and software. Consider the observations made by
usually takes years. Such process contrasts with the agility of
McKeown [1] and the illustration in Fig. 1, which shows the
the software industry.
cumulative number of Request For Comments (RFCs) [2].
The programmable forwarding can be viewed as a natural
While at first an increase in RFCs may appear encouraging, it
evolution of SDN, where the software that describes the
has actually represented an entry barrier to the network market.
behavior of how packets are processed can be conceived,
The progressive reduction in the flexibility of protocol design
tested, and deployed in a much shorter time span by operators,
caused by standardized requirements, which cannot be easily
engineers, researchers, and practitioners in general. The de-
removed to enable protocol changes, has perpetuated the status
quo. This protocol ossification [3, 4] has been characterized 1 The RFC and VXLAN observations are extracted from Dr. McKeown’s
by a slow innovation pace at the hand of few network presentation in [1].
2

An Exhaustive Survey on P4 Programmable Data Plane Switches:


Taxonomy, Applications, Challenges, and Future Trends

Section I: Section II: Section III: Section IV: Section V: Sections VI-XII: Section XIII:
Introduction Related Surveys Traditional Control Programmable Methodology and Surveyed Work Challenges and
Plane and SDN Switches Taxonomy Future Trends
• Protocol ossification • Comparison of • Comparison
C i b
between
t • PISA-based
PISA b d • SSurvey methodology
th d l • Background and • Generall challenges
h ll and
d
• Evolution of SDN aspects covered in traditional, SDN, and data plane • Proposed taxonomy literature review Future trends
• Rise of P4 and previous surveys programmable devices • Programmable • Year-based distribution • Intra-category • Memory availability
programmable data • Analysis and • Analogy with other switch features of the surveyed work comparison and • Arithmetic computations
planes limitations of domain specific • P4 language • Implementation discussions • Network-wide
• Paper contributions existing surveys processors platform distribution • Comparison with legacy cooperation, etc.

Fig. 2. Paper roadmap.

facto standard for defining the forwarding behavior is the B. Paper Organization
P4 language [9], which stands for Programming Protocol- The road-map of this survey is illustrated in Fig. 2. Section
independent Packet Processors. Essentially, P4 programmable II studies and compares existing surveys on various P4-
switches have removed the entry barrier to network design, related topics and demonstrates the added value of the offered
previously reserved to network vendors. work. Section III describes the traditional and SDN devices,
The momentum of programmable switches is reflected in and the evolution toward programmable data planes. Section
the global ecosystem around P4. Operators such as ATT [10], IV introduces programmable switches and their features and
Comcast [11], NTT [12], KPN [13], Turk Telekom [14], explains the Protocol Independent Switch Architecture (PISA),
Deutsche Telekom [15], and China Unicom [14], are now a pipeline forwarding model. Section V describes the survey
using P4-based platforms and applications to optimize their methodology and the proposed taxonomy. Subsequent sections
networks. Companies with large data centers such as Facebook (from Section VI to Section XII) explore the works pertaining
[16], Alibaba [17], and Google [18] operate on programmable to various categories proposed in the taxonomy, and compare
platforms running customized software, a contrast from the the P4 approaches in each category, as well as with the
fully proprietary implementations of just a few years ago legacy-enabled solutions. Section XIII outlines challenges and
[19]. Switch manufacturers such as Edgecore [20], Stordis considerations extracted and induced from the literature, and
[21], Cisco [22], Arista [23], Juniper [24], and Interface Mas- pinpoints directions that can be explored in the future to
ters [25] are now manufacturing P4 programmable switches ameliorate the state-of-the-art solutions. Finally, Section XIV
with multiple deployment models, from fully programmable concludes the survey. The abbreviations used in this article are
or white boxes to hybrid schemes. Chip manufactures such summarized in Table XIV, at the end of the article.
as Barefoot Networks (Intel) [26], Xilinx [27], Pensando
[28], Mellanox [29], and Innovium [30] have embraced pro- II. R ELATED S URVEYS
grammable data planes without compromising performance.
The availability of tools and the agility of software devel- The advantages of programmable switches attracted con-
opment have opened an unprecedented possibility of experi- siderable attention from the research community. They were
mentation and innovation by enabling network owners to build described in previous surveys.
custom protocols and process them using protocol-independent Stubbe et al. [35] discussed various P4 compilers and
primitives, reprogram the data plane in the field, and run interpreters in a short survey. This work provided a background
P4 codes on diverse platforms. Main agencies supporting on the P4 language and demonstrated the main building blocks
engineering research and education world-wide are investing that describe packet processing in a programmable switch.
in programmable networks as well [31–34]. It outlined reference hardware and software programmable
switch implementations. The survey lacks discussions on exist-
ing application schemes, challenges, and potential future work.
A. Contribution Dargahi et al. [36] focused on stateful data planes and
Despite the increasing interest on P4 switches, previous the security implications. There are two main objectives of
work has only partially covered this technology. As shown this survey. First, it introduces the reader to recent trends
in Table I, currently, there is no updated and comprehensive and technologies pertaining to stateful data planes. Second,
material. Thus, this paper addresses this gap by providing it discusses relevant security issues by analyzing selected
an overview of the evolution of networks from legacy to use cases. The scope of the survey is not limited to P4
programmable; describing the essentials of programmable for programming the data plane. Instead, it describes other
switches and P4; and summarizing the advantages of pro- schemes such as OpenState [44], Flow-level State Transitions
grammable switches over SDN and legacy devices. The paper (FAST) [45], etc. When reviewing the security properties of
continues by presenting a taxonomy of applications developed stateful data planes, the authors described a mapping between
with P4; surveying, classifying, and analyzing and comparing potential attacks and corresponding vulnerabilities.
more than 150 articles; discussing challenges and consid- Cordeiro et al. [37] discussed the evolution of SDN from
erations; and putting forward future perspectives and open OpenFlow to data plane programmability. The survey briefly
research issues. explained the layout of a P4 program and how it is mapped to
3

TABLE I
C OMPARISON WITH R ELATED S URVEYS

Programmable switches and P4 language Taxonomy Discussions


Paper
Intra-category Comparison Future
Evolution Description Features Background Literature Challenges
comparison with legacy directions
[35] qd qd qd d d d d d d
[36] t q
d q
d dq dq d d d q
d
[37] t q
d q
d dq t d d dq q
d
[38] q
d q
d q
d d d d d d d
[39] t d d dq dq d d dq q
d
[40] t d q
d d dq d d d d
[41] t d d t dq d d d d
[42] q
d q
d d dq dq dq d dq q
d
[43] t q
d q
d dq dq d d dq q
d
This t t t t t t t t t
paper
tCovered in this survey dNot covered in this survey dqPartially covered in this survey

the abstract forwarding model. It then listed various compil- evolution of programmable networks. This work described
ers, tools, simulators, and frameworks for P4 development. the pre-SDN model and the evolution to SDN and pro-
The authors categorized the literature into two categories: grammable data plane. The authors highlighted some features
1) programmable security and dependability management; 2) of programmable switches such as stateful processing, accurate
enhanced accounting and performance management. In the timing information, and flexible packet cloning and recircu-
first category, the authors listed works pertaining to policy lation. The survey categorized data plane applications into
modeling, analysis, and verification, as well as intrusion two categories, namely, network monitoring and in-network
detection and prevention, and network survivability. In the computing. While this survey listed a considerable number of
second category, the authors focused on network monitoring, papers belonging to these categories, it barely explained the
traffic engineering, and load balancing. The survey only lists operation and main ideas of each paper.
a limited set of papers without providing much details or how Tan et al. [42] presented a survey describing In-band Net-
papers differ from each. Moreover, the survey was published work Telemetry (INT). The survey explained the development
in 2017, and since then, a significant percentage of P4-related stages and classifications of network measurement (traditional,
works are missing. SDN-based, and P4-based). It also outlined some existing
Satapathy et al. [38] presented a short description about applications that leverage INT such as congestion control,
the pitfalls of traditional networks and the evolution of SDN. troubleshooting, etc. The survey concludes with discussions
The report briefly described elements of the P4 language. The and potential future work related to INT.
authors then discussed the control plane and P4Runtime [46], Zhang et al. [43] presented a survey that focuses on stateful
and enumerated three use cases of P4 applications. The report data plane. The survey starts with an overview of stateless
concludes with potential future. and stateful data planes, then overviews and compares some
The short survey presented by Bifulco et al. [39] reviews stateful platforms (e.g., OpenState, FAST, FlowBlaze, etc.).
the trends and issues of abstractions and architectures that The paper reviews a handful of stateful data plane applications
realize programmable networks. The authors discussed the and discusses challenges and future perspectives.
motivation of packet processing devices in the networking Table I summarizes the topics and the features described
field and described the anatomy of a programmable switch. in the related surveys. It also highlights how this paper
The proposed taxonomy categorizes the literature as state- differs from the existing surveys. All previous surveys lack
based, abstraction-based, implementation-based, and layer- a microscopic comparison between the intra-category works.
based. The layer-based consists of control/intent layer and data Also, none of them compare switch-based schemes against
plane layer; the implementation-based encompasses software legacy server-based schemes. To the best of the authors’
and hardware switches; the abstraction-based includes data knowledge, this work is the first to exhaustively explore the
flow graph and match-action pipelines; and the state-based whole programmable data plane ecosystem. Specifically, the
differentiates between stateful and stateless data planes. paper describes P4 switches and provides a detailed taxonomy
Kaljic et al. [40] presented a survey on data plane flex- of applications using P4 switches. It categorizes and compares
ibility and programmability in SDN networks. The authors the applications within each category as well as with legacy
evaluated data plane architectures through several definitions approaches, and provides challenges and future perspectives.
of flexibility and programmability. In general, flexibility in
SDN refers to the ability of the network to adapt its resources III. T RADITIONAL C ONTROL P LANE AND SDN
(e.g., changes in the topology or the network requirements).
Afterwards, the authors identified key factors that influence the A. Traditional and SDN Devices
deviation from the original data plane given with OpenFlow. With traditional devices, networks are connected using
The survey concludes with future research directions. protocols such as Open Shortest Path First (OSPF) and Border
Kannan et al. [41] presented a short survey related to the Gateway Protocol (BGP) [47]) running in the control plane
4

TABLE II
F EATURES , TRADITIONAL , SDN, AND P4 PROGRAMMABLE DEVICES

Feature Traditional SDN P4 programmable


Control - data plane separation No clear separation Well-defined separation Well-defined separation
Standardized (e.g., OpenFlow,
Standardized APIs (e.g.
Control and data plane interface Proprietary P4Runtine) and
OpenFlow)
program-dependent APIs
Control and data plane
NA/Proprietary NA/Proprietary Target independent
program-dependent APIs
Modular separation: (1) functions
Functionality separation at control No modular separation of to build topology view (state) and
Same as SDN networks
plane functions (2) algorithms to operate on
network state
Customization of control plane No Yes Yes
Visibility of events at data plane Low Low High
Flexibility to define and parse new
No flexible, fixed Subject to OpenFlow extensions Easy, programmable by user
fields and protocols
Customization of data plane No No Yes
ASIC packet processing Low, defined by user’s source
High, hard-coded High, hard-coded
complexity code
OpenFlow assumes in series
Data plane match-action stages Proprietary In series and/or in parallel
match-action stages
Data plane actions Protocol-dependent primitives Protocol-dependent primitives Protocol-independent primitives
Infield runtime reprogrammability No No Yes
Customer support High Medium Low
Technology maturity High Medium Low

at each device. Both control and data planes are under running on CPUs. The use of high-level languages accel-
full control of vendors. On the other hand, SDN delineates erated innovation by hiding the target hardware (e.g., x86).
a clear separation between the control plane and the data In signal processing, Digital Signal Processors (DSPs) were
plane, and consolidates the control plane so that a single developed in the late 1970s and early 1980s with instruction
centralized controller can control multiple remote data planes. sets optimized for digital signal processing. Matlab is used for
The controller is implemented in software, under the control developing DSP applications. In graphics, Graphics Processing
of the network owner. The controller computes the tables Units (GPUs) were developed in the late 1990s and early 2000s
used by each switch and distributes them via a well-defined with instruction sets for graphics. Open Computing Language
Application Programming Interface (API), such as Openflow (OpenCL) is one of the main languages for developing graphic
[48]. While SDN allows for the customization of the control applications. In machine learning, Tensor Processor Units
plane, it is limited to the OpenFlow specifications and the (TPUs) and TensorFlow were developed in mid 2010s with
fixed-function data plane. instruction sets optimized for machine learning.
The programmable forwarding is part of the larger informa-
tion technology evolution observed above. Specifically, over
B. Comparison of Traditional, SDN, and Programmable Data
the last few years, a group of researchers developed a ma-
Plane Devices
chine model for networking, namely the Protocol Independent
Table II contrasts the main characteristics of traditional, Switch Architecture (PISA) [49]. PISA was designed with
SDN, and P4 programmable devices. In the latter, the forward- instruction sets optimized for network operations. The high-
ing behavior is defined by the user’s code. Other advantages level language for programming PISA devices is P4.
include the program-dependent APIs, where the same P4
program running on different targets requires no modifications IV. P ROGRAMMABLE S WITCHES
in the runtime applications (i.e., the control plane and the
A. PISA Architecture
interface between control and data planes are target agnostic);
the protocol-independent primitives used to process packets; PISA is a packet processing model that includes the fol-
the more powerful computation model where the match-action lowing elements: programmable parser, programmable match-
stages can not only be in series but also in parallel; and the action pipeline, and programmable deparser, see Fig. 3.
infield reprogrammability at runtime. On the other hand, the The programmable parser permits the programmer to define
technology maturity and support for P4 devices can still be the headers (according to custom or standard protocols) and
considered low in contrast to traditional and SDN devices. to parse them. The parser can be represented as a state ma-
chine. The programmable match-action pipeline executes the
operations over the packet headers and intermediate results. A
C. Network Evolution and Analogy with other Domain Spe- single match-action stage has multiple memory blocks (tables,
cific Processors registers) and Arithmetic Logic Units (ALUs), which allow for
The introduction of the general-purpose computers in the simultaneous lookups and actions. Since some action results
early 1970s enabled programmers to develop applications may be needed for further processing (e.g., data dependencies),
5

App-1 App-2 … App-n TABLE III


C OMPARISON BETWEEN A P4 PROGRAMMABLE SWITCH AND A
P4 program
FIXED - FUNCTION SWITCH [51]
Software-based
Centralized Controller
Characteristic Programmable Fixed-function
C
Compiler Control Plane
Throughput 6.4Tb/s 6.4Tb/s
PD-API, P4Runtime Data Plane
Number of 100G ports 64 64
Program-defined local table Max forwarding rate 4.8B pps 4.2B pps
Key Action Action data Max 25G/10G ports 256/258 128/130
Header Forward() Dst IP=IP1, Dst port = p2 Programmable Yes (P4) No
fields, tuples, Mark() Dst IP=IP2, Dst port = p4 Power draw 4.2W per port 4.9W per port
etc. Drop() … Large scale NAT Yes (100k) No
Large scale stateful ACL Yes (100k) No
Large scale tunnels Yes (192k) No
State Memory ALU Packet buffers Unified Segmented
Full entropy, Hash seed,
LAG/ECMP
programmable reduced entropy
Packet Packet ECMP 256-way 128-way
… Line-rate per
Telemetry SFlow (sampled)
flow stats
Latency Under 400 ns Under 450ns

Programmable Stage 1 Stage n Programmable


parser deparser
Programmable match-action pipeline

Switch
• Differentiation: the customized protocol or feature imple-
ASIC mented by the programmer needs not to be shared with the
chip manufacturer.
Fig. 3. A PISA-based data plane and its interaction with the control plane. • Enhanced performance: programmable switches do not in-
troduce performance penalty. On the contrary, they may pro-
stages are arranged sequentially. The programmable deparser duce better performance than fixed-function switches. Table
assembles the packet headers back and serializes them for III shows a comparison between a programmable switch
transmission. A PISA device is protocol-independent. and a fixed-function switch, reproduced from [51]. Note
In Fig. 3, the P4 program defines the format of the keys the enhanced performance of the former (e.g., maximum
used for lookup operations. Keys can be formed using packet forwarding rate, latency, power draw).
header’s information. The control plane populates table entries
with keys and action data. Keys are used for matching packet C. P4 Language
information (e.g., destination IP address) and action data is P4 has a reduced instruction set and has the following goals:
used for operations (e.g., output port). • Reconfigurability: the parser and the processing logic can
be redefined in the field.
• Protocol independence: the switch is protocol-agnostic. The
B. Programmable Switch Features
programmer defines the protocols, the parser, and the oper-
The main features of programmable switches are [50]: ations to process the headers.
• Agility: the programmer can design, test, and adopt new • Target independence: the underlying ASIC is hidden from
protocols and features in significantly shorter times (i.e., the programmer. The compiler takes the switch’s capabilities
weeks or months rather than years). into account when turning a target-independent P4 program
• Top-down design: for decades the networking industry oper- into a target-dependent binary.
ated in a bottom-up approach. Fixed-function ASICs are at
SmartNICs
the bottom and enforce available protocols and features to 5%
the programmer at the top. With programmable switches, the
programmer describes protocols and features in the ASICs. 2020

Note that the physical layer and parts of the MAC layer may 2019
not be programmable. ASIC Software
Year

2018 48.5%
• Visibility: programmable switches provide greater visibility 38.6%

into the behavior of the network. INT is an example of a 2017


framework to collect and retrieve information from the data
2016
plane, without intervention of the control plane.
• Reduced complexity: fixed-function switches incorporate 0 10 20 30 40 50
NetFPGA
a large superset of protocols. These protocols consume Number of Papers 7.9%

resources and add complexity to the processing logic, which (a) (b)
is hard-coded in silicon. With programmable switches, the Fig. 4. (a) Distribution of surveyed data plane research works per year. (b)
programmer has the option to implement only those proto- Implementation platform distribution. The shares are calculated based on the
cols that are needed. studied papers in this survey.
6

The original specification of the P4 language was released are still not widely available for sale in the public. For
in 2014, and is referred to as P414 . In 2016, a new version of example, to acquire a switch equipped with Tofino chip (e.g.,
the language was drafted, which is referred to as P416 . P416 Edgecore Wedge100BF-32 [20]), and to get the development
is a more mature language which extended the P4 language to environment and the customer support, a Non-Disclosure
broader underlying targets: ASICs, Field-Programmable Gate Agreement (NDA) with Barefoot Networks should be signed.
Arrays (FPGAs), Network Interface Cards (NICs), etc. Additionally, the client should attend a training course (e.g.,
[206]) to understand the architecture and the specifics of the
V. M ETHODOLOGY AND TAXONOMY platform. This process is considered lengthy and costly, and
This section describes the systematic methodology that was not every institution is capable of affording it.
adopted to generate the proposed taxonomy. The results of The proposed taxonomy is demonstrated in Fig. 5. The tax-
this literature survey represent derived findings by thoroughly onomy was meticulously designed to cover the most significant
exploring more than 150 data plane-related research works works related to data plane programmability and P4. The aim
starting from 2016 up to late 2020. The distribution of which is to categorize the surveyed works based on various high-
is summarized in Fig. 4 (a). level disciplines. The taxonomy provides a clear separation of
Fig. 4 (b) depicts the share of each implementation plat- categories so that a reader interested in a specific discipline can
form used in the surveyed papers, grouped by software (e.g., only read the works pertaining to the said discipline. The cor-
BMv2, PISCES), ASIC (e.g., Tofino, Cavium), NetFPGA (e.g., rectness of the taxonomy was verified by carefully examining
NetFPGA SUME), and SmartNICs (e.g., Netronome NFP). the related work of each paper to correlate them into high-
The graph shows that the vast majority of the works were level categories. Each high-level category is further divided
implemented on software switches. Note that behavioral soft- into sub-categories. For instance, various measurements works
ware switches (e.g., BMv2 [203]) are not suitable indicators of belong to the sub-category “Measurements” under the high-
whether the program could run on a hardware target; they are level category “Network Performance”.
typically used for prototyping ideas and to foster innovation. Further, the survey compares the results and the features of-
On the other hand, non-behavioral software switches (e.g., fered by programmable data plane approaches (intra-category),
PICSES [204], derived from Open vSwitch (OVS) [205]) are as well as with those of the contemporary and legacy ones.
production-grade and can be deployed in data centers. This detailed comparison is elaborated upon for each sub-
Hardware targets constitute a smaller share of the platform category, giving the interested reader a comprehensive view of
distribution than software switches. A possible reasoning the state-of-the-art findings of that sub-category. Additionally,
behind this is that the technology is still recent and targets the survey presents various challenges and considerations, as

Programmable
Switches Literature

In-Band Network Network Middlebox Accelerated Internet of


Security Testing
Telemetry (INT) Performance Functions Computations Things (IoT)

Congestion Load
Variations Consensus Aggregation Heavy Hitter Troubleshoot
Control Balancing
[52–57] [129–136] [152–155] [158–164] [189–193]
[63–68] [103–109]

Collectors Machine Service


Measurements Caching Cryptography Verification
and Solutions Learning Automation
[69–90] [110–117] [165–168] [194–202]
[58–62] [137–142] [156, 157]

Telecom
AQM Miscellaneous Anonymity
Services
[91–95] [143–151] [169–172]
[118–124]

Access
QoS and TM Pub/Sub
Control
[96–99] [125–128]
[173–176]

Attacks and
Multicast
Defenses
[100–102]
[177–188]

Fig. 5. Taxonomy of programmable switches literature based upon relevant, explored research areas.
7

INT source INT transit hop INT sink Data Data


Packet headers Packet headers

INT source INT transit hop INT sink


Receiver
Add Add S1 S2 S3 S4
metadata metadata

...
Telemetry
Teleme
metry

...

...
...

IInstructio
ions
Instructions INT
Sender Collector
Telemetry Add Add Extract
instructions metadata metadata metadata Data
Data
Data [S4] [S4]
Data [S3]
INT [S1]
header [S2] [S3] [S3]
[S2]
INT {S1}
header [S1] [S2] [S2]
[S1]
INT header [S1] [S1]
Packet headers INT header
Packet headers INT header INT header
Packet headers
Original packet headers Telemetry instructions Switch metadata INT Collector Packet headers

Fig. 6. In-band Network Telemetry (INT). Fig. 7. Example of how INT can be used to provide the path traversed by
a packet in the network. The INT source inserts its label (S1) as well as the
well as some current and future trends that could be explored INT headers to instruct subsequent switches about the required operations
(i.e., push their labels). Finally, switch S4 strips the INT headers from the
as future work. packet and forwards them to a collector, while forwarding the original packet
to the receiver.
VI. I N - BAND N ETWORK T ELEMETRY (INT)
was the main concern to monitor, the programmer inserts
Conventional monitoring and collecting tools and protocols queue metadata and transit latency. An INT-enabled network
(e.g., ping, traceroute, Simple Network Management Protocol has the following entities: 1) INT source: a trusted entity
(SNMP), NetFlow, sFlow) are by no means sufficiently accu- that instruments with the initial instruction set what metadata
rate to troubleshoot the network, especially with the presence should be added into the packet by other INT-capable devices;
of congestion. These methods provide milliseconds accuracy 2) INT transit hop: a device adding its own metadata to an
at best and cannot capture events that happen on microseconds INT packet after examining the INT instructions inserted by
magnitude. Moreover, they cannot provide per-packet visibility the INT source; 3) INT sink: a trusted entity that extracts the
across the network. INT headers in order to keep the INT operation transparent
In-band Network Telemetry (INT) [207] is one of the for upper-layer applications; and 4) INT collector: a device
earliest key applications of programmable data plane switches. that receives and processes INT packets.
It enables querying the internal state of the switch and pro- The location of an INT header in the packet is intentionally
vides fine-grained and precise telemetry measurements (e.g., not enforced in the specifications document. For example, it
queue occupancy, link utilization, queuing latency, etc.). INT can be inserted as a payload on top of TCP, UDP, and NSH, as
handles events that occur on microseconds scale, also known a Geneve option on top of Geneve, and as a VXLAN payload
as microbursts. Collecting and reporting the network state is on top of VXLAN.
performed entirely by the data plane, without any intervention
from the control plane. Due to the increased visibility achieved
with INT, network operators are able to troubleshoot problems A. Postcard-based Telemetry (PBT)
more efficiently. Additionally, it is possible to perform instant INT provides the exact forwarding path, the timestamp and
processing in the data plane after measuring telemetry data latency at each network node, and other information. Such
(e.g., reroute flows when a link is congested), without having detailed information is derived by augmenting user packets
to interact with the control plane. Fig. 6 shows an INT-enabled with data collected by each switch. Postcard-based Telemetry
network. INT enables network administrators to determine the (PBT) is an alternative to INT which does not modify user
following: packets. Fig. 8 shows an example of PBT. As a user packet
• The path a packet took when traversing the network (see traverses the network, each switch generates a postcard and
Fig. 7). Such information is difficult to learn using existing sends it to the monitor. The event that triggers the generation
technologies when multi-path routing strategies (e.g., Equal- of the postcard is defined by the programmer, according to
cost Multi-Path Routing (ECMP) [208], flowlet switching the application’s need. Examples include start and/or end of a
[209]) are used.
Postcard-based Telemetry
• The matched rules that forwarded the packets (e.g., ACL
Flow watchlist Flow watchlist
entry, routing lookup). Event detection Event detection

• The time a packet spent in the queue of each switch.


• The flows that shared the queue with a certain packet.
Host 1 Host 2
The P4 Applications Working Group developed the INT Event Event
detected detected
telemetry specifications [210] with contributions from key
enablers of the P4 language such as Barefoot Networks,
VMware, Alibaba, and others.
INT allows instrumenting the metadata to be monitored Original Packet Original headers with switch telemetry info INT Collector
without modifying the application layer. The metadata to be
inserted depends on the use case; for example, if congestion Fig. 8. Postcard-based telemetry (PBT).
8

TABLE IV
INT VARIATIONS C OMPARISON

Variation Name Overhead reduction strategy Metadata collection Operator intervention Implementation
[52] NetVision On-demand probing Active (segment routing) High; telemetry through queries Mininet
Flow subset selection by Software (BMv2)
[53] N/A Passive Low; closed-loop network
the knowledge plane w/ ONOS controller
Monitoring ratio adjustment Low; telemetry based on network
[54] sINT Passive Software (BMv2)
based on network changes behavior
Telemetry orchestration High; telemetry specified by
[55] INTO Passive N/A
based on heuristics operators
Per-flow packet subset High; telemetry specified by ASIC (Tofino) and
[56] ML-INT Passive
selection through sampling operators SmartNIC (NFP-4000)
Telemetry encoding on
[57] PINT Passive High; telemetry through queries ASIC (Tofino)
multiple packets

flow, sampling (e.g., one report per second), packet dropped in realtime. The proposed system encodes INT headers in
by the switch, queue congestion, etc. a subset of packets pertaining to an IP flow. The encoded
headers contain metadata that describes statistics of electrical
and optical network elements on the flow’s routing path. Ben
B. INT Variations et al. [57] proposed Probabilistic INT (PINT), an approach that
B.1. Background probabilistically adds telemetry information into a collection
Despite the improvements that INT brings compared to of packets to minimize the per-packet overhead associated with
legacy monitoring schemes, it introduces bandwidth overhead regular INT.
when enabled unconditionally by network operators. In such B.3. INT Variations, Comparison, and Discussions
scenarios, INT headers are added to every packet traversing
the switch, increasing bandwidth overhead which decreases Table IV compares the aforementioned INT variations so-
the overall network throughput. To mitigate such limitation, lutions. The main motivation behind these solutions is that
conditional statements are included in the P4 program to the majority of applications that leverage INT (e.g., con-
send reports only when certain events occur (e.g., queue gestion control, fast reroute) only require approximations of
utilization exceeds a threshold). This solution requires network the telemetry data and therefore, do not need to gather per-
operators to adjust thresholds and parameters manually based packet per-hop INT information. NetVision uses probing to
on the usual network traffic patterns. Consequently, several reduce the overhead of INT. The main limitation of this
variations of INT have been developed, aiming at customizing approach is that probing might result in poor accuracy and
its functionalities and addressing its limitations. Mainly, recent timeliness as the probes might experience different network
works focus on minimizing the bandwidth overhead of INT conditions than actual packets. All other works collect INT
by adjusting thresholds and parameters automatically, based information passively. [53] and sINT select flows based on
on measured traffic patterns and the desired application type. current network conditions, ML-INT uses a fixed sampling
scheme to select a small portion of packets in a flow, and
B.2. Literature Review PINT uses a probabilistic approach to encode telemetry on
Liu et al. [52] proposed NetVision, a telemetry system that multiple packets. Sampling and anomaly-based monitoring
aims at minimizing the traffic overhead of INT by using prob- might lead to information loss since not all packets are
ing. NetVision actively sends the rightful amount and format being reported. Some solutions require manual intervention
of probe packets depending on the telemetry application (e.g., from the operators to configure the telemetry process. The
traffic engineering, network visualization). Hyun et al. [53] simplicity of the configuration interface is vital to make
proposed an architecture for self-driving networks that uses the solution attractive to network operators. Finally, some
INT to collect packet-level network telemetry, and Knowledge- solutions were implemented on software switches, while other
Defined Networking (KDN) to create intelligence to the net- were implemented on hardware. It is important to note that not
work management, considering the collected telemetry data. all software implementations can fit into the pipeline of the
KDN accepts the network information as input and generates hardware.
policies to improve the network performance. Kim et al. [54]
B.4. INT, PBT, and Traditional Telemetry Comparison
proposed selective INT (sINT), a scheme that dynamically
adjusts the insertion frequency of INT headers. A monitoring Table V compares INT, PBT, and traditional telemetry.
engine observes changes in consecutive INT metadata and INT has higher potential vulnerabilities than PBT, such as
applies a heuristic algorithm to compute the insertion ratio. eavesdropping and tampering. Adding extra protective mea-
Marques et al. [55] described the orchestration problem in sures (e.g., encryption) is difficult on the fast data path. On
INT, which is associated with the optimal use of network the other hand, PBT packets tolerate additional processing to
resources for collecting the state and behavior of forwarding enhance security. The flow tracking process is simpler with
devices through INT. Niu at al. [56] proposed multilayer INT INT than with PBT. The latter requires the server receiving
(ML-INT), a system that visualizes IP-over-optical networks INT reports (i.e., INT collector, explained in Section VI-C)
9

TABLE V
I N - BAND , P OSTCARD - BASED , AND T RADITIONAL N ETWORK T ELEMETRY

Feature INT PBT Traditional


User packet
Yes No No
modification
User packet overhead Yes No No
Potential
Higher Lower Lower
vulnerabilities
Flow tracking
Simpler More complex More complex
process
Delay in reporting,
Lowest Low High
tracking
Microbursts detection Yes Yes No
Accuracy Higher Higher Lower; especially with congested links
Polling (e.g., SNMP), initiated by the control plane;
Reporting type Push-based, initiated by the data plane Push-based
sampling (e.g., NetFlow), initiated by the data plane
Troubleshoot
Easier and cheaper Easier and cheaper Harder and more expensive
problems
Granularity Higher; microseconds scale Higher Lower; milliseconds scale at best
Event-based Customizable based on conditions and
Customizable Not possible
monitoring thresholds
Faster; reactive processing is executed Slower; reactive processing is executed in the
Reactive processing Faster
in the data plane control plane
High when all packets are reported,
Bandwidth overhead Higher than INT Lowest
low when reported based on events

to correlate multiple postcards of a single flow packet passing [59], which extracts information from every INT packet and
through the network, to form the packet history at the mon- pushes them to a gateway. A database server then periodically
itor. This process also adds delay in reporting and tracking. pulls information from the gateway. INTCollector [60] is a
Legacy schemes that rely on sampling and polling suffer from collector that extracts events, which are important network
accuracy issues, especially when links are congested. INT information, from INT raw data. It uses in-kernel processing
on the other hand is push-based, has better accuracy, and to further improve the performance. INTCollector has two
is more granular (microseconds scale). Reports sent by an processing flows; the fast path, which processes INT reports
INT-capable device contain rich information (e.g., the path and needs to execute quickly, and the normal path which
a packet took) that can aid in troubleshooting the network. processes events sent from the fast path, and stores information
Such visibility is minimal in legacy monitoring schemes. in the database. Deep Insight [61] is a proprietary solution
Programmable switches permit reporting telemetry after the provided by Barefoot Networks that leverages INT capabilities
occurrence of specific events (e.g., congestion). Moreover, they to provide services such as real-time anomaly detection, con-
provide flexibility in programming reactive logic that executes gestion analysis, packet-drop analysis, etc. Another proprietary
promptly in the data plane. One drawback of INT is that it solution is BroadView Analytics used on Broadcom Trident 3
imposes bandwidth overhead if configured to report for every devices by Broadcom [62].
packet; however, when event-based reports are considered, the
bandwidth overhead significantly decreases. C.3. INT Collectors Comparison, Discussions, and Limita-
tions
C. INT Collectors Fig. 9 and Table VI compare the aforementioned INT
C.1. Background collectors. IntMon and Prometheus INT exporter were among
An INT collector is a component in the network that the earliest collectors. Both have low processing rates since
processes telemetry reports produced by INT devices. It parses they are implemented without kernel nor hardware accelera-
and filters metrics from the collected reports, then optionally
stores the results persistently into a database. Since a large
number of reports is typically produced in INT, having a high-
performance collector is essential to avoid missing important
network events. To this end, a number of research works
focus on developing and enhancing the performance of INT
collectors running on commodity servers.
C.2. Literature Review
IntMon [58] is an ONOS-based collector application for
INT reports. It includes a web-based interface that allows
controlling which flows to monitor and the specific metadata to Fig. 9. CPU efficiency with the three INT collectors. Source: INTCollector
collect. Another INT collector is the Prometheus INT exporter paper [60].
10

TABLE VI
INT C OLLECTORS C OMPARISON

Event Processing Historical data Implementation


Collector Name Rate Open source Analytics
detection acceleration availability notes
ONOS-BMv2
[58] IntMon 0.1Kpps × ×  × Low
subsystem (ONOS 1.6)
Prometheus ONOS P4 Brigade
[59] 6.4Kpps × ×  × Low
INT exporter project
Yes; fast path C language, XDP for
[60] IntCollector 154.8Kpps    Medium
with XDP in-kernel processing
SPRINT data plane
[61] DeepInsight N/A  N/A ×  High
telemetry (INT.p4)

tion. Also, they are very limited with respect to the features embedded into packets and forwarded to a high-performance
they provide (e.g., lack of event detection, limited analytics, collector. The collector typically performs analysis and ap-
historical data unavailability, etc.). Prometheus INT exporter plies actions accordingly (e.g., informs the control plane to
also suffers from increased overhead of sending the data for update table entries). Current research efforts mainly focus
every INT packet to the gateway, and the potential loss of on developing variations of INT to decrease its telemetry
network events as the database only stores the latest data pulled traffic overhead, considering the overhead-accuracy trade-off.
from the gateway. INTCollector on the other hand has higher Other works aim at accelerating INT collectors to handle
rate and uses the eXpress Data Path (XDP) [211] to accelerate large volumes of traffic (in the scale of Kpps). Future work
the packet processing in the kernel space. It filters the data could possibly investigate further improvements for INT such
to be published based on significant changes in the network as compressing packets’ headers, broadening coverage and
through its event detection mechanism. DeepInsight Analytics visibility, enriching the telemetry information, and simplifying
has a modular architecture and runs on commodity servers. the deployment.
It executes the Barefoot SPRINT data plane telemetry which
consists of a P4 program (INT.p4) encompassing intelligent VII. N ETWORK P ERFORMANCE
triggers. It also provides open northbound RESTful APIs that
allow customers to integrate their third-party network man- Measuring and improving network performance is critical
agement solutions. DeepInsight Analytics is advanced with in nowadays’ infrastructures. Low latency and high bandwidth
respect to the features it provides (real-time anomaly detection, are key requirements to operate modern applications that con-
congestion analysis, packet-drop analysis, etc.). However, it tinuously generate enormous amounts of data [212]. Conges-
is a closed-source solution and lacks reports of performance tion control (CC), which aims at avoiding network overload, is
benchmarks. critical to meet these requirements. Another important concept
Fig. 9 demonstrates the CPU efficiency of three INT col- for expediting these applications is managing the queues
lectors (IntMon, Prometheus INT exporter, and INTCollector) that form in routers and switches through Active Queuing
[60]. IntMon has the lowest throughput, and is 57 times slower Management (AQM) algorithms. This section explores the
than Prometheus INT. INTCollector on the other hand has the literature related to measuring and improving the performance
highest throughput and is 27 times faster than Prometheus INT of programmable networks.
exporter.
C.4. Collectors in INT and Legacy Monitoring Schemes Com- A. Congestion Control (CC)
parison A.1. Background
Generally, collectors used with both INT and legacy moni- One of the most challenging tasks in the Internet today is
toring schemes run on general purpose CPUs, and hence, have congestion control and collapse avoidance [213]. The difficulty
comparable performance. INT produces excessive amounts in controlling the congestion is increasing due to factors
of reports when compared with legacy monitoring schemes such as high-speed links, traffic diversity and burstiness, and
(e.g., NetFlow), and therefore, requires having a collector with buffer sizes [63]. Today’s CC algorithms aim at shortening
high processing capability. INT-based collectors are typically delays, maximizing throughput, and improving the fairness and
accelerated with in-kernel fast packet processing technologies utilization of network resources.
(e.g., XDP) and hardware-based accelerators (e.g., Data Plane Tremendous amount of research work has been done on
Development Kit (DPDK)). congestion control, including end hosts algorithms such as
loss-based CC algorithms (e.g., CUBIC [214], Hamilton TCP
D. Summary and Lessons Learned (HTCP) [215], etc.), model-based algorithms (e.g., Bottleneck
Legacy telemetry tools and protocols are not capable of Bandwidth and Round-trip Time (BBR) [216]), congestion-
capturing microbursts nor providing fine-grained telemetry signalling mechanisms (e.g., Explicit Congestion Notification
measurements. INT was developed to address these challenges; (ECN) [217]), data-center specific schemes (e.g., TIMELY
it enables the data plane developer to query with high- [218], Data Center Quantized Congestion Notification (DC-
precision the internal state of switches. Telemetry data are then QCN) [219], Data Center TCP (DCTCP) [220], pFabric [221],
11

INT INT
Packet
end-hosts then use this information to adjust the sending rate
Adjust rate through their smart Network Interface Controllers (NICs).
per ACK
Kfoury et al. [67] proposed a P4-based method to automate
Sender ACK ACK ACK Receiver
end-hosts’ TCP pacing. It supplies the bottleneck bandwidths
and the number of elephants flows to senders so that they can
pace their rates to safe targets, avoiding filling routers’ buffers.
Fig. 10. HPCC: INT-based high precision congestion control.
Turkovic et al. [64] proposed a P4-based method that reroutes
Performance-oriented Congestion Control (PCC) [222], etc.), flows to backup paths during congestion. The system detects
and application-specific schemes (e.g., QUIC [223]). congestion by continuously monitoring the queueing delays
With the advent of programmable data plane switches, of latency-critical flows. The same authors [68] proposed a
researchers are investigating new methods to provide network- method that separates the senders based on their congestion
assisted congestion feedback for end-hosts. control algorithm. Each congestion control uses a separate
queue in order to enforce the fairness among its competing
A.2. Literature Review flows.
Handley et al. [63] proposed NDP, a novel protocol archi-
A.3. CC Schemes Comparison, Discussions, and Limitations
tecture for datacenters that aims at achieving low comple-
tion latency for short flows and high throughput for longer Table VII compares the aforementioned CC schemes. NDP
flows. NDP avoids core network congestion by applying per- and NCF are similar in the sense that both use NACKs as
packet multipath load balancing, which comes at the cost congestion feedback. NDP avoids congestion by applying per-
of reordering. It also trims the payloads of packets, similar packet multihop load balancing. This approach works ade-
to what is done in Cut Payload (CP) [224], whenever the quately with symmetric topologies, but fails when topologies
queues of the switches become saturated. Once the payload is are asymmetric (e.g., BCube, Jellyfish), especially during
trimmed, the headers are forwarded using high-priority queues. heavy network load. Another limitation of NDP is the ex-
Consequently, a Negative ACK (NACK) is generated and sent cessive retransmissions produced by the server. NCF adopted
through high-priority queues so that a retransmission is sent the idea of packet trimming from NDP, but generates NACKs
before draining the low priority queue. Similarly, Feldmann from the trimmed packet and sends it directly to the sender.
et al. [66] proposed a method that uses network-assisted Such approach removes the receiver from the feedback loop,
congestion feedback (NCF) in the form of NACKs generated improving the sender’s reaction time. One limitation of NCF
entirely in the data plane. NACKs are sent to throttle elephant- is that it requires operators to manually tune some of the
flow senders in case of congestion. The method maintains three predefined parameters (e.g., threshold, queue size, etc.). Addi-
separate queues for mice flows, elephant flows, and control tionally, NCF might disclose network congestion information,
packets to ensure fair sharing of resources. making it less attractive to operators. Finally, the authors of
Li et al. [65] proposed High Precision Congestion Control NCF claim that the approach works with both datacenters and
(HPCC), a new CC mechanism that leverages INT-based data Internet-wide scenarios. However, no implementation results
added by P4 switches to obtain precise link load information. were presented to evaluate the effectiveness of the solution.
HPCC computes accurate flow rate by using only one rate HPCC leverages INT data to control network congestion.
update, as opposed to legacy approaches that require a large It enhances the convergence time by using a Multiplicative-
number of iterations to determine the rate. HPCC provides Increase Multiplicative-Decrease (MIMD) scheme. Note
near-zero queueing, while being almost parameterless. Fig. 10 that previous TCP variants use the Additive-Increase
shows the mechanism of HPCC. The switches add INT headers Multiplicative-Decrease (AIMD), which is conservative when
to every packet, and then the INT information is piggybacked increasing the rate, and hence has a slow convergence time.
into the TCP/RDMA Acknowledgement (ACK) packet. The The reason AIMD schemes are slow is that they use a single-

TABLE VII
C ONGESTION C ONTROL S CHEMES C OMPARISON

Congestion Feedback Traffic End-device


Scheme Name Strategy Rerouting Implementation
feedback information separation modification
Trim packets to headers NetFPGA
[63] NDP  NACKs   
and priority forward SUME
Monitor queue latency to
[64] N/A × N/A  × × BMv2
reroute traffic on congestion
Use INT data to compute
[65] HPCC  INT × ×  Tofino
sending rate
Throttle elephant flows
[66] NCF  NACKs ×  × N/A
with NACKs
Pace TCP traffic of Flow count
[67] N/A  × ×  BMv2
elephant flows to safe targets and BW
Separate flows according to
[68] P4Air × N/A ×  × Tofino
their congestion control group
12

TABLE VIII
C ONGESTION C ONTROL S CHEMES . 1) P ROGRAMMABLE S WITCHES (HPCC); 2) E ND - HOSTS ; AND 3) L EGACY N ETWORK - ASSISTED (ECN)

Characteristic Programmable switch End-hosts Legacy network-assisted (ECN)


Higher, INT-based, microbursts are Low, packet loss (e.g., CUBIC); Medium, Lower with classic ECN; High
Accuracy
detected and reported estimated RTT and btlbw (e.g., BBR) with L4S
Minimal if ECN is used (most
None; distributed nature of AIMD does
Required modifications Switches, end-hosts equipment have classic ECN
not require storing state of flows
implemented); High if L4S is used
Adequate with ECN; Fast with
Convergence Faster (MIMD) Slower (AIMD)
L4S ECN
High; possibility of Bufferbloat (e.g.,
Queue utilization Near-zero Low
CUBIC)
Parameterization Few None Few (e.g., thresholds)
Several fields (e.g., queue occupancy,
Congestion information Packets drop 1-bit ECN mark
link utilization, flow share, etc.)

bit congestion information (packet loss, ECN). With HPCC, switches on the other hand require few parameters (e.g.,
end-hosts can perform aggressive increase as INT metadata en- marking threshold) to adapt to different network conditions.
compasses precise link utilization and timely queue statistics.
HPCC demonstrated promising results with respect to latency,
B. Measurements
bandwidth, and convergence time. The authors however did
not evaluate the performance of HPCC with conventional B.1. Background
congestion control algorithms in the Internet (e.g., CUBIC, Gaining an overall understanding of the network behavior
BBR). Note that achieving inter-protocol fairness is essential is an increasingly complex task, especially when the size
so that the solution is adopted by operators. of the network is large and the bandwidth is high. Legacy
The method in [67] uses TCP pacing. Pacing decreases measurements schemes have accuracy limitations since they
throughput variations and traffic burstiness, and hence, mini- rely on polling and sampling-based methods to gather traffic
mizes queuing delays. However, this method works well only statistics. Typically, sampling methods have high sampling
in networks where the number of large flows senders is small rates (e.g., one every 30,000 packets) and polling methods
(e.g., in science Demilitarized Zone (DMZ) [212]). have large polling intervals. The literature [225] has shown that
P4Air, which applies traffic separation, demonstrated sig- such methods are only suitable for coarse-grained visibility.
nificant improvements in fairness compared to contemporary The accuracy limitation of sampling and polling techniques
solutions. However, it requires allocating a queue for each hampers the development of measurement applications. For
congestion control algorithm group (e.g., loss-based (Cubic), instance, it is not possible to accurately measure frequently
delay-based (TCP Vegas), etc.). Note that the number of changing TCP-specific fields such as congestion window,
queues is limited in switches, and production networks often receive window, and sending rate.
reserve them for other applications’ QoS [65]. Data streaming or sketching algorithms [226–230] were
Note that some schemes require modifying the end-hosts proposed to answer the limitation of sampling and polling.
(e.g., HPCC) while others are fully in-network (e.g., P4Air). They address the following problem: an algorithm is allowed
to perform a constant number of passes over a data stream
A.4. End-hosts, Programmable Switches, and Legacy Devices’
(input sequence of items) while using sub-linear space com-
CC Schemes
pared to the dataset and the dictionary sizes; desired statistical
Table VIII compares the CC schemes assisted by pro- properties (e.g., median) on the data stream are then estimated
grammable switches (e.g., HPCC) with end-hosts CC al- by the algorithm. The main problem with such algorithms is
gorithms (e.g., CUBIC) and legacy congestion signalling that they are tightly coupled to the metrics of interest. This
schemes (e.g., ECN). End-hosts CC infer congestion through means that switch vendors should build specialized algorithms,
packet drops and estimations (e.g., btlbw and Round-trip Time data structures, and hardware for specific monitoring tasks.
(RTT) estimation with BBR), which is not always sufficient to With the constraints of CPU and memory in networking
infer the existence of congestion. Legacy devices use classic devices, it is challenging to support a wide spectrum of
ECN to signal congestion so that end-hosts slow down their monitoring tasks that satisfy all customers. Legacy devices also
transmission rates. Classic ECN is limited as it only marks lack the capability of customizing the processing behavior so
a single bit to signal congestion, and is not aggressive nor that switches co-operate in the measurement process.
immediate. Programmable switches on the other hand use With the emergence of programmable switches, it is now
fine-grained prompt measurements to signal congestion (e.g., possible to perform fine-grained measurements in the data
INT metadata), which results in higher detection accuracy, plane at line rate. Moreover, data structures such as sketches
near-zero queueing delays, and faster convergence time. The and bloom filters can be easily implemented and customized
distributed nature of end-hosts CC schemes allows them to op- for specific metrics of interest. Programmable switches pave
erate without modifying the network infrastructure and without the way for new areas of research in measurements since not
tweaking parameters. ECN-enabled devices and programmable only they provide flexibility in inspecting with high accuracy
13

the traffic statistics, but also allow programmers to express counts the attributes across related packets identified by keys,
reactive processing in real time (e.g., dropping a packet when and flags packets that surpass a defined threshold.
a threshold is bypassed as done in Random Early Detection Other approaches such as Elastic sketch [73] performs mea-
(RED) [231]). surement that are adaptive to changes in network conditions
(e.g., bandwidth, packet rate and flow size distribution). *Flow
B.2. Literature Review [77] supports concurrent measurements and dynamic queries.
INT provides path-level metrics, with data similar to that of Such approach aims at minimizing the concurrency problems
polling-based techniques. Note that the metrics themselves are and the network disruption resulting from compiling excessive
fixed; for instance, it is possible to determine the flow-level queries into the data plane. TurboFlow [78] aims at achieving
latency, but not the latency variation (jitter) [71]. The fixed high coverage without sacrificing information richness. Bai
metrics of INT also prevent performing network-wide mea- et al. [86] proposed FastFE, a system that performs traffic
surements; note that the INT standard specification document features extraction by leveraging programmable data planes.
does not mention methods to aggregate metadata and perform Features are then used by traffic analysis and behavior detector
complex analytics in the data plane. ML techniques.
This section focuses on techniques that provide measure- Performance Diagnosis Systems. Recent works are leverag-
ments that go beyond the fixed metrics extracted from the ing programmable data planes to diagnose network perfor-
internal state of the switch. mance. The main motivation here is that fine-grained infor-
Generic Query-based Monitoring. Operators constantly mation can be monitored at line rate, mitigating the slow
change their monitoring specifications. Adding new moni- reaction to “gray failures” experienced by diagnosing end-
toring requirements on the fixed-function switching ASIC is hosts in legacy approaches.
expensive. Recent work explored the idea of providing a Ghasemi et al. [72] proposed Dapper, an in-network TCP
query-driven interface that allows operators to express their performance diagnosis system. Dapper analyzes packets in real
monitoring requirements. The queries can then be converted time, and identifies and pinpoints the root cause of the bottle-
into switch programs (e.g., P4) to be deployed in the network. neck (sender, network, or receiver). Blink [82] also diagnoses
Alternatively, the queries can be executed on the control plane TCP-related issues. In particular, it detects failures in the data
considering the measured information extracted from the data plane based on retransmissions, and consequently, reroutes
plane. traffic. Other approaches attempt to diagnose performance
A simplistic attempt is FlowRadar [69], a system that degradation manifested by an increase of latency. Wang et al.
stores counters for all flows in the data plane with low [84] proposed SpiderMon, a system that performs network-
memory footprint, then exports periodically (every 10ms) to a wide performance degradation diagnosis. The key idea is to
remote collector. Liu et al. [70] proposed Universal Monitor- have every switch maintain fine-grained telemetry data for a
ing (UnivMon), an application-agnostic monitoring framework short period of time, and upon detecting performance degra-
that provides accuracy and generality across a wide range dation (e.g., increased delay), the information is offloaded
of monitoring tasks. UnivMon benefits from the granularity to a collector. Liu et al. [81] proposed a memory-efficient
of the data plane to improve accuracy and runs different approach for network performance monitoring. This solution
estimation algorithms on the control plane. Narayana et al. only monitors the top-k problematic flows.
[71] presented Marple, a query language based on common Queue and Other Metrics Measurement. Programmable
query constructs (i.e., map, filter, group by). Marple allows data planes allows querying the internal state of the queue with
performing advanced aggregation (e.g., moving average of fine-grained visibility. Recent works leveraged this feature to
latencies) at line rate in the data plane. Similarly, Sonata provide better queueing information which can be used by
[79] provides a unified query interface that uses common various applications (e.g., AQMs, congestion control, etc.).
dataflow operators, and partitions each query across the stream Chen et al. [80] proposed ConQuest, a P4-based queue mea-
processor and the data plane. PacketScope [85] also uses surement solution that determines the size of flows occupying
dataflow constructs but allows to query the internal switch the queue in real time, and identifies flows that are grabbing a
processing, both in the ingress and the egress pipelines. significant portion of the queue. Joshi et al. [75] proposed
Many of the previous works use the sketch data structure. BurstRadar, a system that uses programmable switches to
The work in [88] extended the sketching approach used in monitor microbursts in the data plane. Mircorbursts are events
previous works to support the notion of time. The motivation of sporadic congestion that last for tens or hundreds of
of this work is that recently captured traffic trends are the microseconds. Microbursts increase latency, jitter, and packet
most relevant in network monitoring. Huang et al. [89] pro- loss, especially when links’ speeds are high and switch buffers
posed OmniMon, an architectural design that coordinates flow- are small.
level network telemetry operations between programmable Other works enabled measuring further metric. For instance,
switches, end-hosts, and controllers. Such coordination aims at Ding et al. [83] proposed P4Entropy, an algorithm to estimate
achieving high accuracy while maintaining low resource over- network traffic entropy (Shannon entropy) in the data plane.
head. Chen et al. [90] proposed BeauCoup, a P4-based mea- Tracking entropy is useful for calculating traffic distribution
surement system that handles multiple heterogeneous queries in order to understand the network behavior. Another example
in the data plane. It offers a general query abstraction that is the system proposed by Chen et al. [87] which passively
14

TABLE IX
M EASUREMENTS S CHEMES C OMPARISON

External Data Network Platform


Ref Name Core idea Approx.
computation structure wide HW SW
Coordinates flow-level Slots (bloom
[89] OmniMon ×   
telemetry among devices filter)
Uses scalable stream
[79] Sonata   Sketch × 
processor
Generic query-based monitoring

Stores flow counters and


[69] FlowRadar   Bloom filter  
periodically exports results
Elastic Adapts to network changing
[73]   Sketch  
Sketch conditions
Aggregates based on “map,
[71] Marple ×  Key-value store  
filter, group by” constructs
Enables simultaneous multiple Coupon collect
[90] BeauCoup  × × 
distinct counting queries (bloom filter)
Provides application-agnostic Universal
[70] UnivMon    
monitoring sketches
Groups traffic in the switch and GPV (register
[77] *Flow ×  × 
computes statistics in servers arrays)
Produces fine-grained and
[78] TurboFlow ×  Hash table × 
unsampled flow records
Enables time-aware Time-aware
[88] N/A   × 
monitoring sketch
Monitors packets’ lifecycle Key-value store
[85] PacketScope   × 
inside the switch (hash table)
Extracts traffic features for ML
[86] FastFE ×  key-value store × 
models
Reactive Measured Network Platform
Performance diagnosis systems

Ref Name Core idea Scope


processing information wide HW SW
Flight size, MSS,
Diagnoses TCP performance Identifies TCP sender’s reaction
[72] Dapper N/A × 
issues in the data plane bottleneck time, loss, RTT,
CWND, RWND
Diagnoses latency with small Identifies flows
[84] SpiderMon Limits rate Queue latency  
memory footprint affecting latency
Detects failures based on the Identifies Reroutes RTO-induced
[82] Blink × 
predictable behavior of TCP retransmitters traffic retransmissions
Retransmissions,
Improves monitoring scalability Identifies top-k
[81] N/A N/A latency, packet × 
by measuring subset of flows influential flows
loss, out-of-order
Passive Measured Data Platform
Queue/other measurement

Ref Name Core idea Analysis


measurement information structure HW SW
Identifies flows contributing Count-min
[80] ConQuest  Data plane Queue occupancy 
heavily to the queue sketch
Measures the RTT of TCP RTT from an ISP
[87] N/A  Data plane Hash table 
traffic in ISP networks vantage point
Monitors microbursts and
[75] BurstRadar captures telemetry for the × Control plane Queue occupancy Ring buffer 
contributing packets
Estimates network traffic Count-min
[83] P4Entropy × Data plane Shannon entropy 
entropy sketch

measures the RTT of TCP traffic in ISP networks. RTT event matching techniques. Such techniques are primarily
measurement is important for detecting spoofing and routing used to achieve high resource efficiency (i.e., low memory
attacks, ensuring Service Level Agreements (SLAs) compli- footprint), but cannot achieve full accuracy. On the other hand,
ance, measuring the Quality of Experience (QoE), improving systems like OmniMon carefully coordinates the collaboration
congestion control, and many others. among different types of entities in the network. Such coor-
dination will result in efficient resource utilization and fully
B.3. Measurements Schemes Comparison, Discussions, and accuracy. OmniMon follows a split-merge strategy where the
Limitations split operation decomposes telemetry operations into partial
Table IX compares the aforementioned measurements operations and schedules them among the entities (switches,
schemes. end-hosts, and controller), and the merge operation coordinates
the collaboration among these entities. The idea is to leverage
Generic Query-based Monitoring. Some schemes (e.g.,
the strength of the data plane in the switches and end-hosts
Sonata, FlowRadar, UnivMon) performed approximations of
(i.e., per-flow measurements with high accuracy) and the con-
the metrics by using probabilistic data structures (e.g., sketch,
trol plane (i.e., network-wide collaboration). OmniFlow also
bloom filter, etc), sampling methods, and top-k counting. In
ensures consistency through a synchronization mechanism and
addition, some focused on a subset of traffic by leveraging
15

accountability through a system of linear equation considering Size (MSS), sender’s reaction time (time between received
packet loss and other data center characteristics. Results show ACK and new transmission), loss rate, latency, congestion
that OmniMon reduces the memory by 33%-96% and the window (CWND), receiver window (RWND), and delayed
number of actions by 66%-90% when compared to state-of- ACKs. Based on the inferred variables, Dapper can identify
the-art solutions. the root cause of the bottleneck. Similarly, the authors in
Another criterion that differentiates the measurements [81] monitored conditions such as retransmissions, packet
schemes is whether there are computations being performed loss, round-trip-time, out-of-order packets to identify the top-k
outside the data plane. Most of the systems use the control problematic flows. Furthermore, Blink detects failures based
plane or external servers to perform complex computations on the predictable behavior of TCP, which retransmits packets
since the data plane has limited support to complex arithmetic at epochs exponentially spaced in time, in the presence of
functions. While some systems (e.g., BeauCoup) do not re- failure. Other schemes (i.e., SpiderMon) identify failures based
quire an external computation device, they often support less on the increase of latency.
measurement operations. Some schemes use reactive processing to mitigate the net-
The selection of the data structure to be used in the data work performance issue. For instance, Blink promptly reroutes
plane strongly affects the measurements features supported traffic whenever failures signals are generated by the data
by a certain scheme. For instance, the goal of BeauCoup plane, while SpiderMon limits the sending rate of the root
is to enable simultaneous distinct counting queries; for such cause hosts.
task, the authors based their design on the coupon-collection Finally, it is worth mentioning that some systems (e.g.,
problem [232], which computes the number of random draws Blink, Dapper) considered traces from real-world captures
from n coupons such that all coupons are drawn at least such as the ones provided by CAIDA for evaluation. Using
once. For example, if the threshold of distinct destination IPs real-world traces gives more credibility to the proposed solu-
for detecting superspreaders is 130, instead of recording all tion.
distinct destination IPs, 32 coupons are defined. Consequently,
Queue and other Metrics Measurement. Understanding
the destination IPs of incoming packets are mapped to those
the occupancy of the queue is useful for use cases such
32 coupons. While this data structure uses less memory than
as mitigating congestion-based attacks, avoiding conflicting
the other state-of-the-art measurement sketches, it is limited
workloads, implementing new AQMs, optimizing switch con-
to specific objectives (distinct counting). Other works (e.g.,
figurations, debugging switch implementation, off-path mon-
UnivMon) focused on generalizing the measurement scenarios,
itoring of queues in legacy devices, etc. ConQuest performs
and hence, used universal sketches as data structures.
queue measurements and identifies flows depending on the
Qiu et al. [88] focused on capturing traffic trends that are the purpose (e.g., detecting bursty connections). It maintains
most relevant in network monitoring and attacks’ detection. compact snapshots of the queue, updated on each incoming
The notion of time is not supported by native streaming packet. The snapshots are then aggregated in a round-robin
algorithms. For instance, count-min sketch, which is a data fashion to approximate the queue occupancy. Afterwards, it
structure that uses constant memory amount to record data, cleans the previous snapshots to reuse it for further packets.
is oblivious to the passage of time. Existing solutions that Similarly, BurstRadar detects microbursts, which can increase
consider recency are easily implemented on software, but not latency, jitter, and packet loss, especially when links’ speeds
on programmable ASICs. For example, resetting a sketch after are high and switch buffers are small. It is almost impossible
a timer expires requires iterating over the elements in the to detect microbursts in legacy switches which use sampling
sketch, an operation that cannot be implemented in the data and polling-based techniques. BurstRadar detects microbursts,
plane due to the lack of loops. Likewise, creating multiple and captures a snapshot of the telemetry information of all
sketches require additional stages which is limited in the the involved packets. Afterwards, an analysis is conducted
hardware. Time-adaptive sketches utilize the idea of Dolby on the snapshot to identify the microburst-contributing flow
noise reduction [233, 234]; a pre-emphasis function inflates and the burst characteristics. Note that BurstRadar does not
the update when a new key is inserted and a de-emphasis support measuring the queues of legacy devices passively, but
function restores the original value. This mechanism ages the ConQuest does. In addition, BurstRadar performs the analysis
old events over time, and therefore, improves the accuracy on the control plane, while ConQuest uses the data plane for
of recent events. The authors implemented the pre-emphasis analysis.
function in the data plane using simple bit shifts, and the de-
emphasis function in the control plane. B.4. In-Network versus Legacy Measurements
Finally, some systems considered network-wide monitoring,
Fig. 11 compares the legacy measurements to those con-
while others only restricted their capabilities to local per-
ducted on programmable switches. There are two main
switch measurements. Network-wide measurement is essential
classes of legacy measurements techniques. First, there are
and can significantly improve the visibility of traffic, as
techniques that rely on polling and sampling (e.g., Net-
discussed in Section XIII-D.
Flow). The differences between in-network measurements and
Performance Diagnosis Systems. Some performance diag- polling/sampling-based schemes are closely related to the dif-
nosis schemes restricted their scope to troubleshooting TCP. ferences between legacy measurements and INT (see Table V).
For instance, Dapper infers sending rate, Maximum Segment For instance, the granularity of the measurements conducted in
16

App1 App2 ... AppN

Application-specific
computation
Configure
Control Plane Control Plane

Data Plane Sampling/Polling Report Data Plane

Data structures
Flow reports
(e.g., Sketch)

Traffic Traffic

(a) (b)

Fig. 11. (a) Traditional measurements with sampling/polling. The switch uses sampling and polling protocols (e.g., NetFlow, SNMP) to generate fixed network
flow records. Instead of collecting every packet, sampling collects only one every N number of packets. Records are then exported to an external server for
further analysis. (b) Measurements with programmable switches (e.g., UnivMon [70]). The switch runs a universal algorithm over a universal data structure
(e.g., universal sketch). The control plane then estimates a wide range of metrics for various applications. Note that this is not the only design possible for
measurement tasks with programmable switches. The programmer has the flexibility to use customized algorithms than run at line rate in the data plane. Such
algorithms can leverage various data structures in the P4 program (e.g., sketch, bloom filter) to store flow statistics. The switch then push statistics reports to
the control plane for further analysis and reactive processing.

the data plane is much higher than those collected in traditional too much data is known as "Bufferbloat". Bufferbloat not
measurements (e.g., NetFlow). Further, it is not possible to only increases the end-to-end delay, but also decreases the
conduct event-based monitoring in legacy approaches, whereas throughput and increases the jitter of a communication session.
with in-network measurements, the programmer has the flexi- Modern AQMs help in mitigating the bufferbloat problem
bility of customizing the monitoring based on conditions and [235–238]. Unfortunately, modern AQMs are typically not
thresholds. Second, there are techniques that rely on sketching available in state-of-the-art network equipment; for instance,
or streaming algorithms to estimate the metric of interest. Controlled Delay (CoDel) AQM, which was proposed in
Such methods are tightly coupled with the metric, which 2013, and was proven in the literature to be effective in
forces hardware vendors to invest time and effort in building mitigating Bufferbloat [239], is still not available in most
customized algorithms and data structures that might not be network equipment. With programmable switches, it is now
used by various customers. Moreover, with the constraints possible to implement AQMs as P4 programs, which not only
of routers and switches, it is not possible to implement a accelerates support for new AQMs, but also provides means
variety of monitoring tasks while still supporting the standard to customize its parameters programmatically in response to
routing/switching functionalities. Therefore, such approaches network traffic. Moreover, programmable switches thrives for
are not scalable for the long run. innovation on newer AQMs that can be easily implemented
With programmable switches, it is possible to customize and rapidly tested.
the monitoring tasks by implementing customized sketch-
C.2. Literature Review
ing/streaming algorithms as P4 programs. This advantage
improves scalability as the operator can always modify the Kundel et al. [91] implemented CoDel queueing discipline
algorithms whenever needed. on a programmable switch. CoDel eliminates Bufferbloat, even
in the presence of large buffers [240]. Sharma et al. [92]
C. Active Queue Management (AQM) proposed Approximate Fair Queueing (AFQ), a mechanism
built on top of programmable switches that approximates
C.1. Background fair queuing on line rate. Fair Queueing (FQ) aims at fairly
A fundamental component in network devices is the queue dividing the bandwidth allocation among active flows. Laki
which temporarily buffers packets. As data traffic is inherently et al. [93] described an AQM evaluation testbed with P4 in
bursty, routers have been provisioned with large queues to a demo paper. The authors tested the framework with two
absorb this burstiness and to maintain high link utilization. The AQMs: Proportional Integral Controller Enhanced (PIE) and
majority of delays encountered in a communication session is RED. Mushtaq et al. [241] approximated Shortest Remaining
a result of large backlogs formed in queues. Previous legacy Processing Time (SRPT). Papagianni et al. [94] implemented
devices are limited in the visibility of the queue as they provide Proportional Integral PI2 AQM on a programmable switch. PI2
little or no insight about which flows are occupying or sharing is an extension of PIE AQM to support coexistence between
the queue [80]. Consequently, researchers have been investi- classic and scalable congestion controls in the public Internet.
gating queue management algorithms to shorten the delay and Kumazoe et al. [95] implemented MTQ/QTL scheme on P4.
mitigate packet losses, while providing fairness among flows.
C.3. AQM Schemes Comparison, Discussions, and Limitations
AQM is a set of algorithms designed to shorten the queueing
delay by prohibiting buffers on devices from becoming full. Table X compares the aforementioned AQM schemes. Some
The undesirable latency that results from a device buffering schemes require tuning a number of parameters and thresholds
17

TABLE X
AQM S CHEMES C OMPARISON

Scheme Name Idea Params & thresholds Multiple queues Data structure Implementation
[91] P4-CoDel Implementation of CoDel on P4 2 × Registers BMv2
Approximate fair queueing in the Count-min Cavium
[92] AFQ 4 
switch sketch OCTEON
[93] N/A Evaluation testbed for PIE and RED Red 1, PIE 5 × Registers BMv2
[94] PI2 for P4 Implementation of PI2 on P4 3 × Registers BMv2
[95] MTQ/QTL Implementation of MTQ/QTL on P4 3 × Registers BMv2

so that they operate well in certain network conditions. It is ditions (e.g., short/long RTTs, lossy networks, WANs) is an
worth mentioning that a scheme becomes hard to manage active research area. Typically, new AQMs are implemented
and less autonomous when the number of parameters and and tested in software (e.g., as a Linux queueing discipline
thresholds is high. (qdisc) used with traffic control (tc)), which is limited when
Some schemes are simple to implement in the data plane. the objective is to deploy the AQMs on production networks.
CoDel’s algorithm can be easily expressed in the data plane With programmable switches, AQMs are implemented in P4
as it consists of comparisons, counting, basic arithmetic, and programs, which foster innovation and enhance testing with
dropping packets. Similarly, PI2 is simple to implement as it production networks. Additionally, operators can create their
is mostly based on basic bit manipulations. FQ algorithms on own customized AQMs that perform efficiently with their typ-
the other hand are difficult to implement on hardware as they ical network traffic. Historically, deploying AQMs on network
require complex flow classification, per-packet scheduling, devices is a lengthy and costly process; once an effective
and buffer allocation. Such requirements make FQ algorithms AQM is published and thoroughly tested, equipment vendors
expensive to be implemented on high-speed devices. AFQ start investigating whether it is feasible to implement it on
aims at approximating fair queueing by using programmable future devices. Such process might take years to finish, and
switches’ features such as mutating switch state, performing by then, new network conditions evolve, requiring new AQMs.
basic calculations, and selecting the egress queue of a packet. With programmable switches, this process is cost-efficient and
AFQ’s operations can be summarized as follows: 1) per-flow relatively fast (can be completed in weeks). Table XI compares
state, which includes the number and timing information of the the features of AQMs on programmable switches versus fixed-
previous packet pertaining to that flow, is approximated; 2) the function devices.
position of each packet in the output schedule is determined;
3) the egress queue to use is selected; and 4) the packet is
dequeued based on the approximate sorted order. Note that D. Quality of Service and Traffic Management
AFQ uses a probabilistic data structure (count-min sketch) D.1. Background
since it only approximates the states, and uses multiple queues
in its implementation. Meeting diverse Quality of Service (QoS) requirements is
a fundamental challenge in today’s networks. Traffic Man-
C.4. AQMs on Programmable Switches and Fixed-function agement (TM) provides access control that guarantees that
Devices the traffic admitted to the network conforms to the defined
Inventing novel AQMs that control queueing delay, mitigate QoS specifications. TM often regulates the rate of a flow by
bufferbloat, and achieve fairness with different network con- applying traffic policing. New generation of programmable
switches facilitate traffic policing and differentiation by al-
TABLE XI lowing network operators to express their logic in a pro-
AQM S ON P ROGRAMMABLE AND F IXED - FUNCTION S WITCHES gramming language (P4). This section explores the works on
programmable switches that involve QoS and TM.
Feature Programmable switches Fixed-function devices
Lower; only D.2. Literature Review
Higher; new AQMs are
Innovation developed by
expressed in P4 programs
equipment vendors Bhat et al. [96] described a system where programmable
Higher; operators can switches route traffic intelligently by inspecting application
implement their own Lower; most
Exclusivity custom AQMs without supported AQMs are headers (layer-5) to improve users’ QoE. Lee et al. [97]
disclosing technical standards implemented a traffic meter based on Multi-Color Markers
information (MCM) on programmable switches to support multi-tenancy
Faster (weeks to months);
once an AQM is environments. Tokmakov et al. [98] proposed RL-SP-DRR, a
Readiness Slower (years)
expressed in P4, it can be traffic management system that combines Rate-limited Strict
immediately available Priority (RL-SP) and Deficit round-robin (DRR) to achieve
Cost Lower Higher
Higher; even standard low latency and fair scheduling while improving link utilisa-
AQMs can be customized Lower; only through tion, prioritization and scalability. Chen et al. [99] proposed
Tweakable
and tweaked based on parameters a bandwidth manager for end-to-end QoS provisioning using
network traffic
programmable switches. The system classifies packets into
18

TABLE XII the ability to perform TM by inspecting custom headers


Q O S/TM S CHEMES C OMPARISON fields. Moreover, it is possible to extract with high-granularity
metadata pertaining to the state of the switch (e.g., queue
Multiple Platform
Ref Idea Input occupancy, packet sojourn time, etc.) at line rate. Such in-
queues HW SW
Application-layer Layer-5 formation can significantly help switches take better decisions
[96] × 
headers inspection headers while performing traffic management.
MCM-based Traffic
[97] × 
traffic meter rate, VN ID
Traffic mgmt. E. Multicast
[98] Traffic rate  
(RL-SP and DRR)
BW manager for Flow ID, E.1. Background
[99]  
e2e QoS min/maxRate
Multicast routing enables a source node to send a copy
different categories based on their QoS demands and usages, of a packet to a group of nodes. Multicast uses in-network
and uses two-level queue when prioritizing. traffic replication to ensure that at most a single copy of a
packet traverses each link of the multicast tree. Perhaps the
D.3. QoS/TM Schemes Comparison, Discussions, and Limita- most widely multicast routing protocol deployed in traditional
tions networks is the Protocol-Independent Multicast (PIM) protocol
Table XII compares the QoS/TM schemes. The main idea [243]. PIM and other multicast routing protocols require a
in [96] is to translate application-layer header information into signaling protocol such as the Internet Group Management
link-layer headers (Q-in-Q 802.1ad) for the core network in Protocol (IGMP) [244] to create, change, and tear-down the
order to perform QoS routing and provisioning. The authors multicast tree. Traditional multicast presents some challenges.
adopted the Adaptive Bit Rate (ABR) video streaming as a use For example, it is not suitable for environments where multi-
case to showcase the QoS improvements and the flexibility cast group members constantly move (e.g., virtual machine mi-
of traffic management. Such approach is interesting since gration and allocation). In such cases, the multicast tree must
switches are inspecting higher layers in the protocol stack. be updated dynamically, which may require substantial time
This capability is not available in non-programmable devices. and overhead. Also, some routers support a limited number
Note however that the solution was only implemented on a of group-table entries, which does not scale in environments
software switch (BMv2). When it comes to hardware switches, such as datacenters. Additionally, the signaling protocol and
the solution might face challenges to run at line rate when multicast algorithm are hard coded in the router, which reduces
processing L5 headers. Therefore, the authors left the hardware flexibility in building and managing the tree. Finally, it is not
implementation as a future work. possible to implement multicast based on non-standard header
The other approaches considered traffic rates as inputs rather fields.
than inspecting application-layer headers. [97] focused on
E.2. Literature Review
isolating virtual networks (VN). A VN has to have its own
dedicated bandwidth (i.e., other networks’ traffic should not Shahbaz et al. [100] presented ELMO, a multicast scheme
impact the bandwidth) and should be able to differentiate based on programmable P4 switches for datacenter applica-
priorities in order to provide QoS for its flows. While the tions. ELMO encodes the multicast tree in the packet header,
solution was not implemented on hardware (the authors left as opposed to maintaining group-table entries inside routers.
the hardware implementation as future work), it is worth Kadosh et al. [101] implemented ELMO using a hybrid dat-
noting that this system relies on metering primitives which are aplane with programmable and non-programmable elements.
available in today’s hardware targets (e.g., meters in Tofino). ELMO is intended for multi-tenant datacenter applications
Similarly, [98] was only implemented on a software switch requiring high scalability. Braun et al. [102] presented an
(BMv2) and was evaluated by comparison against standard implementation of the Bit Index Explicit Replication (BIER)
priority-based and best-effort scheduling. This system uses architecture [245] with extensions for traffic engineering.
multiple priority queues, a feature supported in hardware tar- Similar to ELMO, BIER removes the per-multicast group state
gets. Therefore, the system could be implemented on hardware information from switches by adding a BIER header, which
switches. The approach in [99] aims at limiting the maximum is used to forward packets. BIER does not require a signaling
allowed rate and at maximizing bandwidth utilization. This is protocol for building, managing, and tearing down trees.
the only work that was implemented on a hardware switch E.3. Multicast Schemes Comparison, Discussions, and Limi-
(Tofino), and its design was compared against approaches tations
based on OpenFlow.
Table XIII compares the aforementioned multicast schemes.
D.4. Comparison of QoS/TM between Legacy and Pro- Both ELMO and BIER are source-routed multicast schemes.
grammable Networks In BIER, group members are encoded as bit strings and are
The ability to perform QoS-based traffic management in then inspected by switches to identify the output port. Such
legacy networks is restricted to algorithms that consider stan- scheme requires heavy processing on the switch, hampering
dard header fields (e.g, differentiated services [242]). On the the execution at line rate. Consequently, the authors only
other hand, programmable switches can parse, modify, and implemented BIER on a software switch (BMv2). ELMO on
process customized protocols. Hence, operators now have the other hand has no restrictions on the group and network
19

TABLE XIII switches in a legacy network. Another line of research aim at


M ULTICAST S CHEMES C OMPARISON ( SOURCE : [100]) combating congestion and reducing packet losses by analyzing
measurements collected in the data plane and by applying
Group Network Heavy Platform
Scheme Name queue management policies. Congestion control is enhanced
size size processing HW SW
[100] ELMO None None ×  by adopting techniques such as throttling senders, cutting
[102] BIER 2.6K 2.6K   payloads, enforcing sending rates by leveraging telemetry
data, and separating traffic into different queues. Furthermore,
sizes, and was implemented on a hardware switch, running at a handful of works are investigating methods to improve
line rate. QoS by applying traffic policing and management. Techniques
E.4. Comparison P4-based and Traditional Multicast adopted include application-layer inspection, traffic metering,
traffic separation, and bandwidth management. Finally, the
Table XIV compares P4-based multicast and traditional scalability concerns of multicast in legacy networks are being
multicast. The main advantages of implementing multicast mitigated with programmable switches. Recent efforts pro-
routing with programmable P4 switches are: i) the group posed encoding multicast trees into the headers of packets,
membership is encoded in the packet itself, which permits the and using programmable switches to parse these headers and
creation of arbitrary multicast tree based on the application. to determine the multicast groups. Future endeavours should
For example, a multicast tree to update software devices may investigate incremental deployment (i.e., interworking with
prioritize bandwidth over latency, while one for media traffic legacy multicast schemes), and reliability enhancement (e.g.,
may prioritize latency; ii) switches do not need to store per- by adopting layering protocols such as Pragmatic General
group state information, although tables can be customized Multicast (PGM) and Scalable Reliable Multicast (SRM)).
and used in conjunction with the tree encoded in the packet
header; iii) groups can be reconfigured easily by changing the VIII. M IDDLEBOX F UNCTIONS
information in the header of the packet; and iv) the elimination
RFC 3234 [246] defines middlebox as a device that performs
of the signaling protocol to build, manage, and tear-down the
functions other than the standard functions of an IP router
tree results in consider simplification and flexibility for the
between a source and a destination host. In legacy devices,
operator.
middlebox functions are designed and implemented by man-
F. Summary and Lessons Learned ufacturers. Hence, they are limited in the functionalities they
provide, and typically include standard well-known functions
Performing network-wide monitoring and measurements
(e.g., NAT, protocol converters (6to4/4to6), etc.). To overcome
is of utmost importance for network operators to diagnose
this limitation, the trend moved towards implementing mid-
performance degradation. A wide range of research efforts
dleboxes in x86-based servers and in data centers as Network
harness streaming methods that utilize various data structures
Function Virtualization (NFVs). While this shift accelerated
(e.g., sketches, bloom filters, etc.) and approximation algo-
innovation and introduced a wide range of new applications,
rithms. Further, the majority of measurements work provide a
there was some performance implications resulting from op-
query-based language to specify the monitoring tasks. Future
erating systems’ scheduling delays, interrupt processing la-
measurement works should consider generalizing the monitor-
tency, pre-emptions, and other low-level OS functions. Since
ing jobs, reducing storage requirements, managing accuracy-
programmable switches offer the flexibility of inspecting and
memory trade-off, extending monitoring primitives, minimiz-
modifying packets’ headers based on custom logic, they are
ing controller intervention, and optimizing the placement of
excellent candidates for enabling middlebox functions, while
TABLE XIV operating at line rate without performance implications.
C OMPARISON BETWEEN P4- BASED AND T RADITIONAL M ULTICAST
A. Load Balancing
Feature P4-based multicast Traditional multicast
High; no state Low; state information A.1. Background
Scalability information required in required in switchers
switches per-group
A cloud data center, such as a Google or Facebook data
Flexible; custom center, provides many applications concurrently, such as email
Inflexible; signaling
Tree multicast algorithm and
protocol required and and video applications. To support requests from external
management features can be clients, each application is associated with a publicly visible
hard coded in the switch
implemented
Packet High; multicast tree is IP address to which clients send their requests and from which
No packet overhead
overhead encoded in packet header they receive responses. This IP address is referred to as Virtual
Complex; topology IP (VIP) address. The external requests are then directed to
Dynamic Easy; packet header challenges may trigger
tree updates carries update information time-consuming tree a software load balancer whose task is to distribute requests
changes to the servers, balancing the load across them. The load
Fixed; switch is balancer is also referred to as layer-4 load balancer because
Flexible; switch can hard-coded to only
IP address multicast packets multicast packets with it makes decisions based on the 5-tuple source IP address
constraint independently of the type destination IP address in and port, destination IP address and port, and transport-layer
of IP address the range 224.0.0.0 - protocol. This state information is stored in a connection table
239.255.255.255
containing the 5-tuple and the Direct IP (DIP) address of the
20

TABLE XV
L OAD BALANCING S CHEMES C OMPARISON

Platform
Scheme Name Stateful Centralized Active probing MP-TCP support Failure handling
Hardware Software
[103] HULA  ×  ×  
[104] SilkRoad  × × ×  
[105] MP-HULA  ×    
[106] Beamer ×  ×    
[108] Dash  ×   × 
[109] Contra  ×  ×  

Connection table Connection table mechanism.


Table
5-tuple DIP 5-tuple DIP
… … mgmt … … Beamer [106] takes a different approach to load balancing
Load Switch Load
Switch + load
by using a stateless approach. Instead of storing the state in the
balancer balancer
VIP VIP
balancer switch, Beamer leverages the connection state already stored
in backend servers to perform the forwarding.
A.3. Load Balancing Schemes Comparison, Discussions, and
Server Server
Limitations
DIP1 DIP2 DIP3 DIP1 DIP2 DIP3
Table XV compares the aforementioned load balancing
(a) (b) schemes. The key idea of switch-based load balancing is
to eliminate the need for a software-layer while mapping
Fig. 12. (a) Traditional software-based load balancing. (b) Load balancing
system implemented by a programmable switch. a connection to the same server, ensuring Per-Connection
Consistency (PCC) property. The majority of the proposed
server serving that connection. State information is needed approaches are stateful, meaning that the switches store in-
to avoid disruptions caused by changes in the DIP pool (e.g., formation locally to perform load balancing. The exception
server failures, addition of new servers). The load balancer also here is Beamer which relies on using the connection state
provides a translation functionality, translating the VIP to the already stored in backend servers to ensure that connections
internal DIP, and then translating back for packets traveling are never dropped under churn. Another significant shift from
in the reverse direction back to the clients. The traditional the previous solutions is the decentralization nature of Beamer.
software-based load balancer is illustrated in Fig. 12(a). Some approaches (e.g., HULA, MP-HULA, Contra, Dash)
use active probing to collect network performance metrics.
A.2. Literature Review
Such metrics are then analyzed by the switches to make load
Recent works presented schemes where load balancing balancing decisions.
functionality is implemented in programmable P4 switches. In the presence of multi-path transport protocols (e.g.,
The main idea consists of storing state information directly in MPTCP), systems such as HULA provide sub-optimal for-
the switch’s dataplane. The connection table is managed by warding decisions when several subflows pertaining to a single
the software load balancer, which can be implemented either connection are pinned on the same bottleneck link. As a result,
in the switch’s control plane or as an external device, as shown schemes such as MP-HULA, Contra, and Dash were proposed
in Fig. 12(b). The software load balancer adds new entries in to support multi-path transport protocols. For instance, MP-
the switch’s table as they arrive, or removes old entries as HULA is a transport layer multi-path aware load-balancing
flows end. scheme that uses the best-k paths to the destination through
Katta et al. [103] proposed HULA, a load balancer scheme the neighbor switches.
where switches store the best path to the destination via Finally, it is important for a load balancing scheme to handle
their neighboring switches. This strategy avoids storing the network failures. Most of the discussed systems considered
congestion status of all paths in leaf switches. Bennet et al. mitigating failures, with the exception of DASH.
[105] extended this approach to support multi-path transport
protocols (e.g., Multi-path TCP (MPTCP)). Another signifi- A.4. Comparison between Switch-based and Server-based
cant work is SilkRoad, [104], a load balancer that provides Load Balancer
a direct path between application traffic and servers. Other Table XVI shows a comparison between switch-based and
mechanisms such as DistCache [107] enables load balancing server-based load balancers. There is a significant improve-
for storage systems through a distributed caching method. ment in the throughput when load balancing is offloaded to
DASH [108] proposed a data structure that leverages multiple the switches; for instance, SilkRoad [104], which is a load
pipeline stages and per-stage SALUs to dynamically balance balancing scheme in the data plane, achieves 10 billion packets
data across multiple paths. The aforementioned approaches per second (pps) while operating at line rate. Software load
work under specific assumptions about the network topology, balancers on the other hand achieve a much lower throughput,
routing constraints, and performance. Contra [109] generalized nine million PPS on average. Software-based load balancers
load balancing to work with various topologies and under also incur additional latency overhead when processing new
multiple constraints by using a performance-aware routing requests. It is relatively easy to install additional software load
21

TABLE XVI the key. The switch is used as an “in-network cache”, where
S WITCH - BASED AND S ERVER - BASED L OAD BALANCERS the hottest items are stored. When a read request for a hot key
Feature Switch-based Server-based
is received, the switch consults its local table and returns the
Higher; (e.g., SilkRoad
value corresponding to that key. If the key is missed (i.e., the
Lower (e.g., 9Mpps per case for non-hot keys) then the switch forwards the request to
Throughput with 6.4Tbps ASIC can
core [247])
achieve about 10Gpps) the appropriate server. When a write request is received, the
Higher; additional latency
Latency
Lower; sub-microseconds
when processing new
switch checks its local table and evicts the entry if the key
from ingress to egress is stored there. It then forwards the request to the appropriate
requests∗
Scalability
Lower; connection is
Higher backend server. A controller periodically collects statistics to
stored in limited SRAM update the cache with the current hot items.
Limited; hash-based flow
Policy Flexible policies can be A noteworthy approach is NetCache [110], an in-network
assignments may lead to
flexibility written in software
imbalance architecture that uses programmable switches to store hot
More complex; it requires items and balance the load across storage nodes. Similarly,
Simpler; it requires a
System coordination with routers,
customized parser, Liu et al. [112] proposed IncBricks, a caching fabric for key-
complexity tunneling (e.g., GRE
match-action tables
encapsulation) value pairs with basic computing primitives in the data plane.
∗ After the first packet is processed, no additional latency is observed [247]. Cidon et al. [111] proposed AppSwitch, a packet switch
that performs load balancing for key-value storage systems.
balancers, which makes it more scalable than switch-based Signorello et al. [113] developed a preliminary implementation
load balancing schemes. Moreover, software load balancers of Named Data Networking (NDN) instance using P4. Grig-
are more flexible in assigning flow identification policies. oryan et al. [114] proposed a system that caches Forwarding
Finally, switch-based schemes are simpler as the whole logic Information Base (FIB) entries (the most popular entries) in
is expressed in a program (customized parser and match- fast memory in order to minimize the TCAM consumption
action tables), whereas server-based balancers might require and to avoid the TCAM overflow problem. Zhang et al. [115]
additional coordination with routers (e.g., tunneling). proposed B-Cache, a framework that bypasses the original
processing pipeline to improve the performance of caching.
B. Caching Vestin et al. [116] proposed FastReact, a system that enables
B.1. Background caching for industrial control networks. Finally, Woodruff et
Modern applications (e.g., online banking, social networks) al. [117] proposed P4DNS, an in-network cache for Domain
rely on key-value stores. For example, retrieving a single Name System (DNS) entries.
web page may require thousands of storage accesses. As the B.3. Caching Schemes Comparison, Discussions, and Limita-
number of users increases to millions or billions, the need for tions
higher throughput and lower latency is needed. A challenge of
key-value stores is the non-uniform access of items. Instead, Table XVII compares the aforementioned caching schemes.
popular items, referred to as “hot items”, receive more queries Schemes can be separated based on the type of data they
than others. Furthermore, popular items may change rapidly aim to cache. For instance, NetCache, AppSwitch, and In-
due to popular posts, limited-time offers, and trending events cBricks cache arbitrary key-value pairs, while NDN.p4 caches
[110]. Fig. 13(a) shows a typical skew key-value store system only NDN names. Further, some schemes (e.g., NetCache,
which presents load imbalance among servers storing key- P4DNS, etc.) automatically index entries to be cached based
value objects. The performance of such systems may present on their access frequencies, while others require the operators
reduced throughput and long latencies. For example, server 2 to manually specify the entries. Another important distinction
may add substantial latency as a result of storing a hot item is whether the scheme uses a custom protocol or not. For
and being over-utilized, while server 1 is under-utilized. instance, switches in NetCache parse a custom protocol that
carries key-value pairs, while switches in P4DNS parse stan-
B.2. Literature Review dard DNS headers.
Fig. 13(b) illustrates a system where a programmable switch The main motivation of switch-based caching schemes is
receives a query before forwarding them to the server storing to improve the performance issues of server-based schemes.
For instance, NetCache, which efficiently detects hot key-
Key-value table value items and serves them in the data plane, was capable of
Key Value
… … handling two billion queries per second for 64,000 items with
Switch
16-bytes keys and 128-bytes values. Compared to commodity
Switch + cache
servers, NetCache improves the throughput by 3-10 times and
reduces the latency of 40% of queries by 50%. In addition to
Load
the throughput, the latency of the queries is also a major metric
to improve. In IncBricks, the latency of requests is reduced by
Server1 Server2 Server3 Server1 Server2 Server3 over 30% compared to client-side caching systems.
(a) (b) Similarly, B-Cache aims at improving the performance by
caching into a single cache match-action table. The motivation
Fig. 13. (a) Traditional software-based caching. (b) Switch-based caching. behind B-Cache is that the performance of the data plane
22

TABLE XVII
C ACHING S CHEMES C OMPARISON

Network accelerator Automatic Platform


Scheme Name Cached data Custom protocol Multi-level cache
needed entry indexing HW SW
[110] NetCache Key-value ×   × 
[111] AppSwitch Key-value × ×  × 
[112] IncBricks Key-value  ×  × 
[113] NDN.p4 NDN names ×   × 
[114] PFCA Routes (FIB entries) ×  ×  
[115] B-Cache FIB entries ×  × × 
[116] FastReact Sensor readings × ×  × 
[117] P4DNS DNS entries ×  × × 

decreases significantly as the complexity of the P4 program other hand is more flexible regarding cache policies, as well
and the packet processing pipeline grows. When a match as keys, values, and tables’ sizes.
occurs, the packet bypasses the original pipeline, making the
performance of caching independent of the pipeline length.
Note however that this system was evaluated on a software C. Telecommunication Services
switch (BMv2), and it is not certain whether this design is
always feasible on hardware targets. C.1. Background
Other caching schemes are more targeted for specific appli- The evolution of the current mobile network to the emerging
cations. As examples, FastReact enables caching for industrial Fifth-Generation (5G) technology implies significant improve-
control networks, while P4DNS caches DNS entries. Note ments of the network infrastructure. Such improvements are
that some schemes require a custom protocol to operate (e.g., necessary in order to meet the Key Performance Indicators
NetCache), while others (e.g., P4DNS) work with standard (KPIs) and requirements of 5G [248]. 5G requires ultra-
protocols (e.g., DNS). Finally, some schemes offer multi-level reliable low latency and jitter (microseconds-scale). As pro-
caching (e.g., level-1 and level-2 caches). grammable switches fulfill these requirements, researchers are
investigating the idea of offloading telecom-oriented VNFs
B.4. Comparison between Switch-based and Server-based running on x86 servers to programmable hardware.
Caching
C.2. Literature Review
Table XVIII compares the switch-based versus server-based
caching schemes. The throughput when data is cached on Ricart-Sanchez et al. [118] proposed a system that uses
the switch is order of magnitude larger than that of general programmable data plane to enhance the performance of the
purpose servers. The latency is also reduced by 50%, and most data path from the edge to the core network, also known as
of it is induced by the client. The switched-based caching the backhaul, in a 5G multi-tenant network. The same authors
solves the load imbalance problem and is simpler as the whole [119] proposed a 5G firewall that detects, differentiates and
logic is expressed in a program. Server-based caching on the selectively blocks 5G network traffic in the backhaul network.
In parallel, attempts such as TurboEPC [120] proposed
TABLE XVIII offloading a subset of user state in mobile packet core to
S WITCH - BASED AND S ERVER - BASED C ACHING programmable switches in order to perform signaling in the
Feature Switch-based Server-based
data plane. Similarly, Singh et al. [121] designed a P4-based
Higher; (e.g., NetCache,
element of 5G Mobile Packet Core (MPC) that merges the
Throughput Lower; 0.2BQPS
2BQPS1 ) functions of both signaling gateway (SGW) and the Packet
Lower; (e.g., NetCache, Data Network Gateway (PGW). Additionally, Voros et al.
Latency 7 μs, mostly caused by Higher; 15 μs
the client)
[122] proposed a a hybrid next-generation NodeB (gNB) that
Not flexible (limited by combines the capabilities of P4 switches and the external
Key size Arbitrary
packet header length) services built on top of NIC accelerators (DPDK).
Not flexible (limited by
the amount of state
Another important function required in 5G is handover.
Value size Arbitrary Palagummi et al. [123] proposed SMARTHO, a system that
accessed when processing
a packet) uses programmable switches to perform handover efficiently
Load
imbalance
No Yes in a wireless network.
More complex; it requires Finally, Kfoury et al. [124] proposed a system for offloading
Simpler; it requires a
System coordination with routers, conversational media traffic (e.g., Voice over IP (VoIP), Voice
customized parser,
complexity tunneling (e.g., GRE
match-action tables
encapsulation)
over LTE (VoLTE), WebRTC, media conferencing, etc.) from
Table size Limited by RAM Arbitrary x86-based relay servers to programmable switches. While
Cache
Limited by table size Arbitrary this system is not tailored for 5G network specifically, it
policies provides significant performance improvements for Over-The-
1 BQPS: Billion Queries Per Second. Top (OTT) VoIP systems.
23

TABLE XIX
T ELECOM S CHEMES C OMPARISON

Reported Concurrent Implementation


Scheme Core idea Deployment 5G-centric
latency scale users evaluated HW SW
[118] Enhances the data path in 5G multi-tenants Backhaul  Microseconds N/A 
[119] Implements a 5G firewall in the switch Backhaul  Microseconds 1K 
Between
[123] Provides smart handover for mobile UE  N/A N/A 
CU and DU
[121] Offloads MPC user plane functions to switch Core network  Microseconds 65K-1M 
[124] Offloads media traffic relay to switch Edge × Nanoseconds 65K-1M 
[120] Performs signaling in the data plane Core  Milliseconds 65K 

TABLE XX
S WITCH - BASED AND S ERVER - BASED M EDIA R ELAYING

Metric Switch-based relay [124] Server-based relay


Higher; averages at
Relay server Lower; negligible with
50% for 900 active
CPU 900 active sessions
sessions
Lower; almost constant at Higher; from 0.2ms to
Latency
440ns with 900 sessions 17ms with 900 sessions
Lower; negligible with Higher; ranges from
Jitter
900 active sessions 100us to 3ms
High; increases as the
Fig. 14. CDF of delay and packet loss rate of 900 offloaded VoIP calls [124]. None contributed by the
Packet loss number of sessions
switch
increases
C.3. Telecom Schemes Comparison, Discussions, and Limita- Maximum Higher; more than one Lower; thousand
tions number of million with additional sessions per core before
sessions resources to spare QoS degrades
Table XIX compares the aforementioned telecom schemes Mean Higher; maximum MOS
Lower; for 1800
on P4. In general, all schemes aim at offloading various sessions, 50% of
opinion (4.4) with 1800
sessions have a MOS
functionalities originally executed on x86-based servers to the score (MOS) concurrent sessions
score below 3.7
data plane. Such strategy improves the network performance Table size Limited by SRAM Arbitrary
(e.g., latency, throughput) significantly and aim at achieving Additional Arbitrary; e.g., media
Limited to relaying
functions mix, lawful interception
the KPIs of 5G. For instance, the experiments conducted in
[118] show that the attained QoS metrics meet the latency
requirements of 5G. Similarly, the results reported in [119] relayed by a relay server versus when it is relayed by the
demonstrate that the system meets the reliability KPI of 5G, switch, based on [124]. The results show that the latency,
which states that the network should be secured with zero jitter and packet loss rates are significantly lower when media
downtime. Furthermore, the results reported in [123] show is being relayed by the switch. Not only the QoS metrics
that there are 18% and 25% reductions in handover time with are improved, but also the maximum number of concurrent
respect to legacy approaches, for two- and three-handover sessions. With Tofino 3.2Tbps, more than one million sessions
sequences, respectively. The system in [124] emulates the were accommodated in the switch’s SRAM, with additional
behavior of the relay server which is primarily used to solve resources to spare for other functionalities. On the other hand,
the NAT problem. Results show that ultra-low latency and jitter only one thousand sessions per CPU core were handled in
(nanoseconds-scale) are achieved with programmable switches the server-based relay, before QoS starts to degrade. The
as opposed to x86-based relay servers where the latency and drawback of offloading media traffic to the switch is that
the jitter are in the milliseconds-scale (see Fig. 14). The some functionalities are complex to be implemented in the
solution also improves the packet loss rate, CPU usage of the data plane (e.g., media mixing for conference calls).
server, Mean Opinion Score (MOS), and can scale to more
than one million concurrent sessions, with additional resources D. Publish/Subscribe
to spare in the switch.
D.1. Background
Other systems allow offloading the signaling part to the
data plane. For instance, TurboEPC offloads messages that Emerging network architectures (e.g., [249]) promote
constitute a significant portion of the total signaling traffic in content-centric networking, a model where the addressing
the packet core, aiming at improving throughput and latency scheme is based on named data rather than named hosts.
of the control plane’s processing. In other words, users specify the data they are interested in
instead of specifying where to get the data from. A branch of
C.4. Switch-based and Server-based Media Relay content-centric networking is the publish/subscribe (pub/sub)
Offloading media traffic from general purpose servers to model. The goal of the model is to provide a scalable and
programmable switches greatly improves the quality of ser- robust communication channel between producers and con-
vice. Table XX shows the metrics achieved when media is sumers of information. A large fraction of today’s Internet
24

applications follow the publish/subscribe paradigm. With the TABLE XXI


IoT, this paradigm proliferated as sensors/actuators are often P UBLISH /S UBSCRIBE S CHEMES C OMPARISON
deployed in dynamic environments. Other applications that
use pub/sub model include instant messaging, Really Simple Dedicated Config Encoding Platform
Scheme
language complexity structure
Syndication (RSS) feeds, presence servers, telemetry and HW SW
others. Current approaches to content-centric networking use Hierarchical
[125]  Medium 
(BDD)
software-based middleboxes, which limits the performance in [126] Distribution
terms of throughput and latency. Recent works are leveraging × High 
[127] tree
programmable switches to overcome the performance limita- [128] × High
Attribute-

tions of software-based pub/sub middleboxes. value pair

D.2. Literature Review routing. The advantage of storing the distribution tree in the
Jepsen et al. [125] presented “packet subscription”, a new packet header instead of storing it in the switch is that rules
abstraction that generalizes the forwarding rules by evalu- in the switches do not need to be updated when subscriptions
ating stateful predicates on input packets. Wernecke et al. change. Another distinction between the pub/sub systems is
[126, 127] presented distribution strategies for content-based whether they require a dedicated language to describe the
publish/subscribe systems using programmable switches. The subscriptions, and the configuration complexity.
authors described a system where the notification distribution
D.4. Comparison between Switch-based and Server-based
tree (i.e., the subscribers that should receive the notification)
Pub/Sub Systems
is encoded in the packet headers, similar to multicast source
routing. Similarly, Kundel et al. [128] implemented a pub- Fig. 15 illustrates the operations of traditional software-
lish/subscribe system on programmable switches. The system based pub/sub systems (a) and switch-based pub/sub systems
is flexible in encoding attributes/values in packet headers. (b). Latency and its variations are significantly reduced when
the switch acts as a pub/sub broker. However, the size of mem-
D.3. Publish/Subscribe Schemes Comparison, Discussions, ory in the switch limits the amount of data to be distributed.
and Limitations Moreover, implementing features provided by software-based
Table XXI compares the aforementioned pub/sub schemes. pub/sub systems such as QoS levels, session persistence,
In [125], the authors described a compiler that generates P4 message retaining, last will and testament (notify users after
tables from logical predicates. It utilizes a novel algorithm a device disconnects) in hardware is challenging.
based on Binary Decision Diagrams (BDD) to preserve switch
resources (TCAM and SRAM). This feature simplifies the con-
figuration as operators do not need to manually install tables E. Summary and Lessons Learned
entries switches, which is a cumbersome process when the Programmable switches offer the flexibility of customizing
topology is large. The prototype was evaluated on a hardware the data plane to enable middlebox functions. A middlebox can
switch (Tofino), and the authors considered the Nasdaq’s ITCH be defined as a device that performs functions that are beyond
protocol as the pub/sub use case. Results show that the system the standard capabilities of routers and switches. A number of
was able to process messages at line rate while using the works demonstrated the implementation of middlebox func-
full switch capacity (6.5 Tbps). The other systems considered tions such as caching, load balancing, offloading services,
different encoding strategies. For example, in [126, 127], the and others on programmable switches. The majority of load
authors described a system where the notification distribution balancing schemes took advantage of the stateful nature of the
tree (i.e., the subscribers that should receive the notification) data plane to store the load balancing connection table. Future
is encoded in the packet headers, similar to multicast source work should consider minimizing the storage requirement to

SDN Controller

Control plane Subscriptions


rules

Legacy switch Legacy switch P4 switch Legacy switch


Publisher1 Subscriber1 Publisher1 Subscriber1
...

...

...

...

Pub/Sub Pub/Sub
PublisherN info SubscriberN PublisherN info SubscriberN

Broker

(a) (b)

Fig. 15. (a) Traditional software-based pub/sub architecture. (b) Pub/sub implemented on a programmable switch.
25

improve the scalability, supporting flow priority, and develop- Consensus protocol (e.g., Paxos)
ing further variations for novel multipath transport protocols running the network

such as multipath QUIC.


The switch can also act as an “in-network cache” that serves
hot items at line rate. Some schemes indexes entries auto- Proposer Acceptor Learner
matically, while others require operator’s intervention. Future
endeavours could investigate items compression, communi-
cation minimization, priority-based caching, and aggregated Coordinator Acceptor
computations caching (e.g., cache the average of hot items).
An additional middlebox application is offloading telecom
functions. The switch is capable of relaying media traffic and Proposer Learner
Acceptor
user plane functions. Future work could investigate scalability
improvement (i.e., to accommodate more concurrent sessions),
offloading signalling traffic, and in-network media mixing. Fig. 16. Consensus protocol in the data plane model [130]. An application
sends a request to the proposer which resides on a commodity server. The
Finally, the switch can also act as a broker to distribute proposer then creates a Paxos message and sends it to the coordinator, running
packets in a publish/subscribe system. Future work could in- in the data plane. The role of the coordinator is be the broker of requests on
vestigate reliability insurance (e.g., packet deliver guarantee), behalf of proposers. Afterwards, the acceptor, which also runs on the data
plane, receives the messages from the coordinator, and ensures consistency
message retaining, and QoS differentiation (e.g., QoS features through the system by deciding whether to accept/reject proposals. Finally,
of MQTT). learners provide replication by learning the result of consensus.

A.2. Literature Review


IX. N ETWORK -ACCELERATED C OMPUTATIONS

Programmable switches offer the flexibility of offloading


some upper-layer logic to the ASIC, referred also as in- Li et al. [129] proposed Network-Ordered Paxos
network computation. Since switch ASICs are designed to (NOPaxos), a P4-based Paxos [253] system that applies
process packets at terabits per second rates, in-network compu- replication in the data center to reduce the latency imposed
tation can result in an order of magnitude or more of improve- from communication overhead. Similarly, Dang et al. [130]
ment in throughput when compared to applications imple- presented an implementation of Paxos using P4 on the
mented in software. The potential performance improvement data plane. Dang et al. [134] also proposed Partitioned
has motivated programmers to built in-network computation Paxos, a P4-based system that separates the two aspects of
for different purposes, including consensus, machine learning Paxos, namely, agreement and execution, and optimizes them
acceleration, stream processing, and others. separately. Furthermore, The same authors also proposed
The idea of delegating computations to networking devices P4xos [136], a P4-based solution that executes Paxos logic
was perceived with Active Networks [250], where packets are directly in switch ASICs, without strengthening assumptions
replaced with small programs (“capsules”) that are executed about the network (e.g., ordered delivery, packet loss, etc.).
in each traversed device along the path. However, traditional Jin et al. [133] proposed NetChain, a variant of the Paxos
network devices were not capable of performing computations. protocol that provides scale-free sub-RTT coordination in data
With the recent advancements in programmable switches, centers. It is strongly-consistent, fault-tolerant, and presents
performing computations is now a possibility. an in-network key-value store.

A. Consensus Another line of research focused on consensus algorithms


other than Paxos. Li et al. [131] proposed Eris, a P4-based
A.1. Background
solution that avoids replication and transaction coordination
Consensus algorithms are common in distributed systems overhead. It processes a large class of distributed transactions
where machines collectively achieve agreement on a single in a single round trip, without any additional coordination
data value, or on the current state of a distributed system. between shards and replicas. Sakic et al. [135] proposed P4
Reliability is achieved with consensus algorithms, even in the Byzantine Fault Tolerance (P4BFT), a system that is based on
presence of some malicious or faulty processes. Consensus BFT-enabled SDN, where controllers act as replicated state
algorithms are used in applications such as blockchain [251], machines. The system offloads the comparison of controllers’
load balancing, clock synchronization, and others [252]. outputs required for correct BFT operations to programmable
Latency has always been a bottleneck with consensus algo- switches. Finally, Han et al. [132] offloaded part of the Raft
rithms as protocols require expensive coordination on every consensus algorithm [254] to programmable switches in order
request. Lately, researchers have started investigating how to improve its performance. The authors selected Raft due
programmable switches can be leveraged to operate consensus to the fact that it has been formally proven to be more safe
protocols in order to increase throughput and decrease latency. than Paxos, and it has been implemented on popular SDN
Fig. 16 shows a consensus model in the data plane. controllers.
26

TABLE XXII A.4. Network-Assisted and Legacy Consensus Comparison


C ONSENSUS S CHEMES C OMPARISON
Consensus algorithms have been traditionally implemented
Weak Full Platform as application on general purpose CPUs. Such architecture
Scheme Name Algo.
assumpt. proto. HW SW
inherently induces latency overhead (e.g., Paxos coordinator
[129] NOPaxos Paxos × × 
[130] N/A Paxos  × 
has a minimum latency of 96us [255]). There are numer-
[131] Eris Novel    ous performance benefits gained when consensus algorithms
[132] N/A Raft  ×  are implemented in programmable devices. When consensus
[133] NetChain Novel ×   messages are processed on the wire, the latency significantly
Partitioned
[134] Paxos    decreases (Paxos coordinator had a minimum latency of
Paxos
[135] P4BFT BFT     340ns [255]). Moreover, when compared to legacy consensus
[136] P4xos Paxos    deployments, network-assisted consensus require fewer hops
traversal.
A.3. Consensus Schemes Comparison, Discussions, and Lim-
itations

Table XXII compares the aforementioned consensus B. Machine Learning


schemes. In general, consensus algorithms such as Paxos
are complex and cannot be easily implemented with the B.1. Background
constraints of the data plane. For instance, [130] only im- The remarkable success of Machine Learning (ML) today
plemented phase-2 logic of Paxos leaders and acceptors. has been enabled by a synergy between development in hard-
Similarly, NetChain uses a variant of the Paxos protocol that ware and advancements in machine learning techniques. In-
divides it into two parts: steady state and reconfiguration. This creasingly complex ML models are being developed to handle
variant is known as Vertical Paxos, and is relatively simple the large size of datasets and to accelerate the training process.
to implement in the network as the division’s parts can be Hardware accelerators (e.g., GPU, TPU) were introduced to
mapped to the control plane and the data plane. speedup the training. These accelerators are installed in large
Unordered and completely asynchronous networks require clusters and collaborate through distributed training to exploit
the full implementation and complexity of Paxos. NOPaxos parallelism. Nevertheless, training ML models is time con-
suggests that the communication layer should provide a new suming and can last for weeks depending on the complexity
Ordered Unreliable Multicast (OUM) primitive; that is, there is and the size of the datasets. Researchers have traditionally
a guarantee that receivers will process the multicast messages investigated methods to accelerate the computation process,
in the same order, though messages can be lost. NOPaxos but not the communication in distributed learning. With the
relies on the network to deliver ordered messages in order to advancements in programmable switches, it is now possible
avoid entirely the coordination. Dropped packets on the other to accelerate the ML training process through the network.
hand are handled through coordination with the application.
B.2. Literature Review
Other systems like Eris avoid replication and transaction co-
ordination overhead. The main contribution of Eris compared The literature can be divided into two main categories:
to NOPaxos is that it establishes a consistent ordering across accelerating training and accelerating inference. Sapio et al.
messages delivered to many destination shards. Eris also [137] proposed DAIET, a system that performs in-network
allows receivers to detect dropped messages. data aggregation to accelerate applications that follow a par-
tition/aggregate workload pattern. Similarly, Yang et al. [140]
Partitioned Paxos [134] improved the existing systems. The proposed SwitchAgg, a system that performs similar functions
motivation behind Partitioned Paxos is that existing network- as DAIET, but with a higher data reduction rate. Perhaps the
accelerated approaches do not address the problem of how most significant work in the training acceleration literature is
replicated application can cope with the high rate of consensus SwitchML [141], a system that performs in-network aggre-
messages; NOPaxos only processes 13,000 transactions per gation for ML model updates sent from workers on external
second since it presents a new bottleneck at the host side. Other servers.
systems (e.g. NetChain) are specialized replication services
On the other hand, proposed schemes have shown interest
and can not be used by any off-the-shelf application.
in speeding the inference process by leveraging programmable
Finally, P4xos improves both the latency and the tail- switches. Siracusano et al. [138] proposed N2Net, a system
latency. The throughput is also improved compared to hard- that runs simplified neural networks (NN) on programmable
ware servers which require additional memory management switches. Sanvito et al. [139] proposed BaNaNa Split, a solu-
and safety features (e.g., user and kernel separation). P4xos tion that evaluates the conditions under which programmable
was implemented on a hardware switch (Tofino), and results switches can act as CPUs’ co-processors for the processing
show that it reduces the latency by three times compared to of Neural Networks (e.g., CNN). Finally, Xiong et al. [142]
traditional approaches, and it can process over 2.5 billion proposed IIsy, a system that enables programmable switches
consensus messages per second (four orders of magnitude to perform in-network classification. The system maps trained
improvement). ML classification models to match-action pipelines.
27

TABLE XXIII
M ACHINE L EARNING S CHEMES C OMPARISON

Objective Evaluated Platform


Scheme Name Core idea Quantization
Inference Training model/algorithm HW SW
In-network computation for
[137] DAIET ×  SGD, Adam N/A 
partition/aggregate work pattern
In-network classification using
[138] N2Net  × Binary neural networks  × ×
BNN
NN processing division between
[139] BaNaNa Split  × Binary neural networks  ×
switches and CPUs
In-network aggregation without
[140] SwitchAgg ×  MapReduce-like system N/A ×
modifying the network
Accelerates distributed parallel
[141] SwitchML ×  Synchronous SGD  ×
training in ML
Maps trained ML classification Decision tree, SVM,
[142] IIsy  × × ×
models to match-action pipeline naïve bayes, k-means

B.3. ML Schemes Comparison, Discussions, and Limitations to a gradient vector; and 2) updating the model by computing
Table XXIII compares the aforementioned ML schemes. the mean of all gradient vectors. The main motivation of this
While the goal of DAIET is to discuss what computations the idea is that the aggregation is computationally cheap (takes
network can perform, the authors did not design a complete 100ms), but is communication-intensive (transfer hundreds of
system, nor did they address the major challenges of support- megabytes each iteration). SwitchML uses computation on
ing ML applications. Moreover, their proof-of-concept pre- the switch to aggregate model update in the network as the
sented a simple MapReduce application on a software switch, workers are sending them (see Fig. 17). An advantage is
and it is not certain whether the system can be implemented that there is minimal communication; each worker sends its
on a hardware switch. Compared to DAIET, SwitchAgg does update vector and receives back the aggregated updates. The
not require modifying the network architecture, and offers design challenges of this system include: 1) the limitation of
better processing abilities with a significant data reduction rate. storage available on the switch, addressed by using a streaming
Moreover, SwitchAgg was implemented on an FPGA, and the approach; 2) switches cannot perform much computations per
results show that the job completion time can be reduced as packet, addressed by partitioning the work between the switch
much as 50%. and the workers; 3) ML systems use floating point numbers,
addressed by quantization approaches; and 4) failure recovery
SwitchML extended the literature on accelerating ML mod-
is needed to ensure correctness. The system is implemented
els training by providing a complete implementation and
on a hardware switch (Tofino), and results show that the
evaluation on a hardware switch. A commonly used training
system speeds up training by up to 300% compared to existing
technique for deep neural networks is synchronous stochastic
distributed learning approaches.
gradient descent [257]. In this technique, each worker has a
copy of the model that is being trained. The training is an it- With respect to in-network inference, it is challenging
erative process where each iteration consists of: 1) reading the to implement full-fledged models as they require extensive
sample of the dataset and locally perform some computation- computations (e.g., multiplications and activation functions).
intensive learning using the worker’s accelerators. This yields Simple variation such as the Binary Neural Network (BNN)

Legacy switch Programmable switch


All-to-all communication
In-network aggregation
Fast GPUs -> bottleneck on the network

... ...

Worker sends update vector


Worker receives aggregated updates
Worker 1 Updates Worker 2 Updates Worker N Updates

(a) (b)

Fig. 17. (a) ML model updates in legacy networks. The aggregation process is communication-intensive and follows an all-to-all communication pattern.
This means that the workers should receive all the other workers’ updates. Since accelerators on end-hosts are becoming faster, the network should speed up
so that it does not become the bottleneck. Therefore, it is expensive to deploy additional accelerators since it requires re-architecting the network. The red
arrow in (a) shows that the bottleneck source is the network. (b) ML model updates accelerated by the network. Aggregation is performed in the network by
the programmable switches while the workers are sending them. The workers do not need to obtain the updates of all other workers, hence there is minimal
communication. They only obtain the aggregated model from the switch. The red arrow in (b) shows that the bottleneck source is the worker rather than the
network [141, 256]
28

TABLE XXIV
S WITCH - BASED AND S ERVER - BASED ML A PPROACHES

Inference Training
Feature
Switch-based Server-based Switch-based Server-based
Slower; aggregations on an x86
Speed Faster, inference at line rate Slower Faster, aggregations at line rate
server
Complex computations Lower, basic arithmetic and
Higher Lower Higher
support bitwise logic function
Lower, switch is the centralized Higher, updates are exchanged
Communication overhead Low Low
aggregator with a remote aggregator
Lower, update is not stored
Storage Lower Higher Higher
entirely at once
Encrypted traffic Difficult Easy Difficult Easy

only requires bitwise logic functions (e.g., XNOR, POPCNT, networks. Although switches only support basic and limited
SIGN). N2Net provides a compiler that translates a given operations, it was shown in the literature that the performance
BNN model to switching chip’s configuration (P4 program). of various tasks (e.g., consensus, training models in machine
The authors did not mention on which platform N2Net was learning), could significantly improve if computations are
evaluated; however, based on their evaluations, they concluded delegated to the network.
that a BNN can be implemented on most current switching The majority of the in-network consensus works aim at
chips, and with small additions to the chip design, more implementing common consensus protocols such as Paxos
complex models can be implemented. IIsy studied other ML and Raft in the data plane. Due to the hardware constraints,
models. The authors of IIsy acknowledged that the work is current schemes implement only simplified variations of the
limited in scope as it does not address popular ML algorithms protocols. Future work could investigate implementing novel
such as neural networks. Furthermore, it is bounded to the consensus algorithms that diverge from the existing complex
type of features it can extract (i.e., packet headers), and has ones. Further, such schemes should encompass failure recovery
accuracy limitations. IIsy tries to find a balance between the mechanisms.
limited resources on the switch and the classification accuracy. Another interesting in-network application is ML train-
Finally, BaNaNa Split took a different approach by partitioning ing/inference acceleration. The literature has shown that signif-
the processing of NN to offload a subset of layers from the icant performance improvements are attained when the switch
CPU to a different processor. Note that the solution is far aggregates model updates or classifies new samples. Future
from complete, and the authors evaluated a single binary fully systems could explore developing further ML models for
connected layer with 4096 neurons using a network processor- various tasks such as classification, regression, clustering, etc.
based SmartNIC. In addition to the aforementioned categories, data plane
programming is being used for stream processing [143, 144],
parallel processing [145], string searching [146], erasure cod-
C. Comparison between Switch-based and Server-based ML
ing [147], in-network lock managers [148], database queries
Table XXIV shows a comparison between switch-based and acceleration [149], in-network compression [150], and com-
server-based ML approaches. ML works that were extracted puter vision offloading [151].
from the literature can be divided into two main categories:
1) expedited inference in the data plane, and 2) accelerated
X. I NTERNET OF T HINGS (I OT)
training in the network. The main advantage of switch-based
over server-based inference is the ability to execute at line rate, The Internet of Things (IoT) is a novel paradigm in which
and hence provides faster results to the clients. Performing pervasive devices equipped with sensors and actuators collect
complex computations in the switch is achieved through physical environment information and control the outside
estimations, and hence is limited. Moreover, the SRAM ca- world. IoT applications include smart water utilities, smart
pacity of the switch is small, impeding the storage of large grid, smart manufacturing, smart gas, smart metering, and
models. Such limitations are not problematic with server-based many others. Typical IoT scenarios entail a large number
inference approaches. of devices periodically transmitting their sensors’ readings
Distributed training can be significantly faster when aggre- to remote servers. Data received on those collectors is then
gations are offloaded to a centralized switch. However, due to processed and analyzed to assist organizations in taking data-
the small capacity of the switch memory, it is not possible to driven intelligence decisions.
store the whole model update at once. Additionally, encrypted
traffic remains a challenge when inference or training is A. Aggregation
handled by the switch. A.1. Background
Since IoT devices are constrained in size and process-
D. Summary and Lessons Learned ing capabilities, they typically generate packets that carry
Accelerating computations by leveraging programmable small payloads (e.g., temperature sensor readings). While such
switches is becoming a trend in data centers and backbone packets are small in size, their headers occupy a significant
29

TABLE XXV
I OT AGGREGATION S CHEMES C OMPARISON

Evaluation Constraints Line rate Platform


Scheme
Same Payload Number
Theoretical Implementation Aggregation Disaggregation HW SW
payload size <= 16 bytes of packets
[152]    8  × 
[153]  × × Up to MTU   
[154]    8 × × 

portion of the total packet size. For instance, Sigfox Low- receiving a packet, the P4 switch parses its headers and
Power Wide Area Network (LPWAN) [258] can support a identifies whether the packet is an IoT packet. If the packet was
maximum of 12-bytes payload size per packet. The overhead identified as an IoT packet, the switch parses and extracts the
of headers is 42-bytes (Ethernet 14-bytes + IP 20-bytes + UDP payload. Afterwards, the payload is stored in switch registers
8-bytes), which represent approximately 78% of the packet along with some other metadata, and the packet is dropped.
total size. When numerous devices continuously transmit IoT Once packets are aggregated, the resulting packet is sent across
packets, a significant percentage of network bandwidth is the WAN to reach the remote server. Before the packet reaches
wasted on transmitting these headers. Packet aggregation is the server, it is disaggregated by another P4 switch situated
a mechanism in which the payloads of small packets are close to the server and several packets identical to the original
aggregated into a single larger packet in order to mitigate the ones are generated. An important observation is that the
bandwidth overhead caused by transmitting multiple headers. aggregation/disaggregation processes are transparent to both
Legacy packet aggregation mechanisms operate on the CPUs the IoT devices and the servers; hence, no modifications are
of servers or on the control plane of switches [259–264]. required on either end. The main advantages of [153] over
While legacy mechanisms reduce the overhead of packet [152] are: 1) packets can have different payload sizes; 2) the
headers, they unquestionably increase the end-to-end latency payload size is no longer limited to 16 bytes; 3) the number
and decrease the throughput. As a result, some studies have of packets is dynamic and only limited by the packet MTU;
suggested aggregating only packets that are not real-time. and 4) both the disaggregation and the aggregation run at line
rate.
A.2. Literature Review
A.4. Comparison between Server-based and Switch-based Ag-
Wang et al. [152] presented an approach where small IoT
gregation
packets are aggregated into a larger packet in the switch data
plane (see Fig. 18). The goal of performing this aggregation Table XXVI shows a comparison between switch-based
is to minimize the bandwidth overhead of packets’ headers. and server-based packet aggregation. When aggregation is
The same authors [153] extended this work to solve some performed on the switch (ASIC), the throughput is higher
constraints related to the payload size and the number of aggre- while the latency and jitter are lower than that of the server-
gated packets. Similarly, Madureira et al. [155] proposed IoTP, based approaches (e.g., switch CPU or x86-based server).
a layer-2 communication protocol that enables the aggregation On the other hand, the server-based aggregation has more
of IoT data in programmable switches. The solution gathers flexibility in defining the number of packets and the amount
network information that includes the Maximum Transmis- of data that can be aggregated.
sion Unit (MTU), link bandwidths, underlying protocol, and
delays. These properties are used to empower the aggregation
algorithm. B. Service Automation

A.3. Aggregation Schemes Comparison, Discussions, and B.1. Background


Limitations Low-power low-range IoT communication technologies
Table XXV compares the aforementioned IoT aggregations (e.g., Bluetooth Low Energy (BLE) [265], Zigbee [266], Z-
schemes. [152] and [153] operate in the same way. Upon wave [267]) typically follow a peer-to-peer model. IoT devices

Aggregation Disaggregation TABLE XXVI


S WITCH - BASED AND S ERVER - BASED PACKET AGGREGATION

Feature Switch-based (ASIC) Server-based (CPU)


WAN
Higher; (e.g., [152],
Lower; (e.g., [152],
...

Throughput 100Gbps, i.e., max


P4 switch P4 switch Server 2.58Gbps)
capacity)
IoT devices IoT packet Aggregated packet Latency and
Lower; Higher;
Jitter
Count of packets Not flexible (limited by
Fig. 18. IoT packets aggregation [152]. Frequent small IoT packets are Arbitrary
to be aggregated the switch SRAM)
aggregated by a P4 switch and encapsulated in a larger packet. Another switch
Not flexible (limited by
across the WAN disaggregates the large packet to restore the original IoT Amount of data
the switch SRAM, Arbitrary
packets. Such mechanism prevents the frequent transmissions of headers, and to be aggregated
parsing capacity)
thus, minimizes the bandwidth overhead.
30

in such technologies can be divided into two distinct types, pe- TABLE XXVII
ripheral and central. Peripheral devices, which consist of sen- S WITCH - BASED , P2P, AND C LOUD S ERVICE AUTOMATION
sors and actuators, receive commands and execute subsequent
Feature Switch-based Peer-to-peer Cloud-based
actions. Central devices on the other hand run applications
Latency Low Low High
that analyze information collected from peripheral devices and IoT energy Low Low High
subsequently issue commands. Scalability High Low High
The interconnection of devices and services can follow Reachability High Low High
a Peer-to-Peer (P2P) model or a cloud-centric approach. In
the P2P model, the automation service runs on the central Another difference from BLESS is that the implementation
device which processes and analyzes sensor data published of Muppet’s control plane leverages ONOS controller with
by peripheral devices in order to issue commands. The main Protocol Independent (PI) framework.
advantages of the P2P include the low end-to-end latency B.4. Comparison between Server-based and Switch-based
and the subtle power consumption as devices are physically Service Automation
close to each other. The drawbacks of the P2P model in-
clude poor scalability, short reachability, and inflexibility of Table XXVII shows a comparison between switch-based,
policy enforcement. The cloud-centric model addresses the P2P, and cloud-based service automation. Generally, the
limitations of the P2P model by adding a gateway node switch-based approach overcomes the limitations of both ap-
that connects peripheral devices to a middleware hosted on proaches. It achieves the low energy and latency characteristics
the cloud (Internet). While this approach solves the poor of P2P while increasing scalability and reachability.
scalability and the policy enforcement flexibility issues, it
incurs additional delays and jitters in collecting and reacting
C. Summary and Lessons Learned
to data. Moreover, the middleware represents a single point
of failure which can shutdown the whole service in the event In the context of IoT, there exist broadly two categories,
of an outage. With programmable switches, researchers are namely, packets aggregation and service automation. The goal
investigating in-network approaches to manage transactional of packet aggregation is to minimize the overhead of IoT
relationships between low-power, low-range IoT devices. packets’ headers. Typically, headers in IoT packets represent
a significant portion of the whole packet size. By aggregating
B.2. Literature Review several packets into a single packet, the bandwidth overhead
Uddin et al. [156] proposed Bluetooth Low Energy Service is reduced. Future work should study the performance side-
Switch (BLESS), a programmable switch that automates IoT effects (e.g., delay, jitter, loss rate, retransmission) that ag-
applications services by encoding their transactions in the data gregation causes to packets. Furthermore, timers should be
plane. It maintains link-layer connections to the devices to implemented to avoid excessive delays resulting from waiting
support P2P connectivity. The same authors proposed Muppet for enough packets to be aggregated.
[157], an extension to BLESS to support multiple non-IP With respect to service automation, the goal is to automate
protocols. IoT applications services by encoding their transactions in the
data plane while improving scalability, reachability, energy
B.3. Service Automation Comparison, Discussions, and Limi- consumption, and latency. Future work should design and de-
tations velop translators for non-IP IoT protocols so that applications
In BLESS, the data plane operations are performed at the on various devices that run different protocols can exchange
Attribute Protocol (ATT) service layer which consists of three data. Additionally, production-grade software switches should
operations: read attributes, write attributes, and attributes’ be leveraged to support non-Ethernet IoT protocols.
notification. BLESS parses ATT packets, then processes and Other works that involve IoT include flowlet-based stateful
forwards them to the devices. The control plane on the other multipath forwarding [268] and SDN/NFV-based architecture
hand is responsible for address assignment, device and service for IoT networks [269].
discovery, policy enforcement, and subscription management.
The switch was implemented on a software switch (PISCES),
XI. C YBERSECURITY
and the results show that BLESS combines the advantages of
P2P and the cloud-center approaches. Specifically, it achieves Extensive research efforts have been devoted on deploying
small communication latency, low device power consumption, programmable switches to perform various security-related
high scalability, and flexible policy enforcement. Muppet ex- functions in the data plane. Such functions include heavy hitter
tended this approach to support multiple IoT protocols. The detection, traffic engineering, DDoS attacks detection and
system studied two popular IoT protocols, namely BLE and mitigation, anonymity, and cryptography. Fig. 19 demonstrates
Zigbee. Being in the middle, Muppet switch is responsible for the difference between contemporary security appliances and
translating actions (e.g., on/off switch of a light bulb) between programmable switches with respect to layers inspection in the
Zigbee and BLE protocols, as well as logging important events OSI model. Although programmable switches are limited in
to a database which resides on the Internet via the Hypertext the computation power, they are capable of inspecting upper
Transfer Protocol (HTTP). Note that parsers and actions layers (e.g., application layer) at line rate. Such functionality
policies have to be implemented for each supported protocol. is not available in any of the existing solutions.
31

Software inspection Hardware inspection of programmable switches while achieving high accuracy. A
Application Application
subsequent work proposed by Harrison et al. [159] considers a
Next-generation network-wide distributed heavy-hitter detection. Furthermore,
firewall, IDS/IPS
Presentation Presentation Kučera et al. [160] proposed Elastic Trie, a solution that
Traditional firewall, detects hierarchical heavy hitters, in-network traffic changes,
Session flow-based IDS Session
Programmable and superspreaders in the data plane. Hierarchical heavy hitters
switch
Transport Transport include the total activity of all traffic matching relevant IP
ACL,
packet filter prefixes. Basat et al. [161] proposed PRECISION, a heavy
Network Network hitter detection algorithm that probabilistically recirculates
Data Link Data Link a fraction of packets for a second pipeline traversal. The
recirculation idea greatly simplifies the access pattern of
Physical Physical memory without significantly degrading throughput. Ding et
(a) (b) al. [162] proposed an approach for incrementally deploying
programmable switches in a network consisting of legacy
Fig. 19. Layers inspection in the OSI model. (a) Contemporary security devices with the goal of monitoring as many distinct network
appliances. (b) Programmable switch.
flows as possible. Tang et al. [163] proposed MV-Sketch, a
A. Heavy Hitter solution that exploits the idea of majority voting to track the
candidate heavy flows inside the sketch data structure. Finally,
A.1. Background
Silva et al. [164] proposed a solution that identifies elephant
Heavy hitters are a small number of flows that constitute flows in Internet eXchange Points (IXP) networks.
most of the network traffic over a certain amount of time.
A.3. Heavy Hitter Detection Comparison, Limitations, and
They are identified based on the port speed, network RTT,
Discussions
traffic distribution, application policy, and others. Heavy hitters
increase the flow completion time for delay-sensitive mice Table XXVIII compares the aforementioned heavy hitter
flows, and represent the major source of congestion. It is schemes. The main criteria that differentiates the solutions
important to promptly detect heavy hitters in order to react is the selection and the implementation of the data structure.
to them; for instance, redirect them to a low priority queue, Hash tables and sketches are frequently used to store counters
perform rate control and traffic engineering, block volumetric for heavy flows. Note that several variations of such data
DDoS attacks, and diagnose congestion. Traditionally, packet structures are being used in the literature, mainly to tackle the
sampling technique (e.g., NetFlow) was used to detect heavy memory-accuracy tradeoff; the choice of data structure reflects
hitters. The main problem with such technique is the limited on the accuracy of the performed measurements. For example,
accuracy due to the CPU and bandwidth overheads of process- with probabilistic data structures, only approximations are
ing samples in the software. Advancements in programmable performed.
switches paved the way to detect heavy hitters in the data In HashPipe, the programmable switch stores the flows
plane, which is not only orders of magnitude faster than identifiers and their byte counts in a pipeline of hash tables.
sampling, but also enables additional applications (e.g., flow- HashPipe adapts the space saving algorithm which is described
size aware routing). in [270]. The system was evaluated using an ISP trace provided
by CAIDA (400,000 flows), and the results show that HashPipe
A.2. Literature Review needed only 80KB of memory to identify the 300 heaviest
Sivaraman et al. [158] proposed HashPipe, a heavy hitter flows, with an accuracy of 95%. Another hashtable-based
detection algorithm that operates entirely in the data plane. solution is Elastic Trie, which consists of a prefix tree that
It detects the k-th heavy hitter flows within the constraints expands or collapses to focus only on the prefixes that grabs a

TABLE XXVIII
H EAVY H ITTER S CHEMES C OMPARISON

Data Adaptive Platform


Scheme Name Core idea Network-wide Approximations
structure thresholds HW SW
Maintains counts of heavy flows
[158] HashPipe Hash tables × × × 
in a pipeline of hash tables.
Switch store locally the counts a
[159] N/A Hash tables    
coordinator aggregates the results
Detects hierarchical heavy hitters
[160] Elastic Trie Prefix tree ×   
using hashtable prefix tree
Recirculates a small fraction of
[161] PRECISION Hash tables × ×  
packets to simplify memory access
Monitors distinct flows using
[162] N/A HyperLogLog    
HyperLogLog algorithm
Supports the queries of recovering Invertible
[163] MV-Sketch  ×  
all heavy flows in a sketch sketches
Identifies elephant flows using
[164] N/A Hash tables ×  × 
dynamic thresholds in IXPs
32

TABLE XXIX
C RYPTOGRAPHY S CHEMES C OMPARISON

Security goal Computations Platform


Scheme Name Core idea Algorithms
Conf. Integ. Auth. ASIC CPU HW SW
Implementations of SipHash-2-4, Poly1305-AES,
[165] N/A × ×   
cryptographic hash functions BLAKE2b, HMAC-SHA256-512
P4- Implementation of host-to-
[166]     AES-CTRHMAC-MD5 
IPsec site IPsec in P4 switches
P4- Implementation of MACsec
[167]   ×  AES-GCM 
MACsec on P4 switches
AES implementation using
[168] N/A  × ×  AES-128, AES-192, AES-256 
scrambled lookup table

large share of the network. The data plane informs the control B.2. Literature Review
plane about high-volume traffic clusters in an event-based push
The authors in [165] argue on the need to implement
approach only when some conditions are met. Other systems
cryptographic hash functions in the data plane to mitigate
explored different data structures for the task. For instance,
potential attacks targeting hash collisions. Consequently, they
in [162] the authors used the HyperLogLog algorithm [271]
presented prototype implementations of cryptographic hash
which approximates the number of distinct elements in a multi-
functions in three different P4 target platforms (CPU, Smart-
set. The solution is capable of detecting heavy hitters by only
NIC, NetFPGA SUME). Another work by Hauser et al. [166]
using partial input from the data plane.
attempted to implement host-to-site IPsec in P4 switches. For
Another important criteria is whether the scheme tracks simplification, only Encapsulating Security Payload (ESP) in
heavy hitters across the whole network. For example, un- tunnel mode with different cipher suites is implemented. The
like HashPipe which considers a single switch, [159] tracks same authors also proposed P4-MACsec, an implementation
network-wide heavy hitters. Tracking network-wide heavy of MACsec on P4 switches. MACsec is an IEEE standard for
hitter is important as some applications (e.g., port scanners, securing Layer 2 infrastructure by encrypting, decrypting, and
superspreaders, etc.) cannot go undetected within a single performing integrity checks on packets.
location. Moreover, aggregating the results of switches sep- The previous works delegated the complex computations to
arately for detecting heavy hitter is not sufficient; flows might the control plane. Chen et al. [168] implemented the Advanced
not exceed a threshold locally, but when the total volume is Encryption Standard (AES) protocol in the data plane using
considered, the threshold might be crossed. scrambled lookup tables. AES is one of the most widely
A.4. Comparison between P4-based and Traditional Heavy used symmetric cryptography algorithms that applies several
Hitter Detection encryption rounds on 128-bit input data blocks

The main advantage of heavy hitters detection schemes in B.3. Cryptography Schemes Comparison, Discussions and
the data plane over sampling-based approaches is the ability to Limitations
operate at line rate. This means that every packet is considered Table XXIX compares the aforementioned cryptography
in the detection algorithm, which improves accuracy and schemes. With respect to hashing, P4 currently implements
the speed of detection. Moreover, additional applications that hash functions that do not have the characteristics of cryp-
exploit reactive processing can be implemented. For instance, tographic hashing. For example, Cyclic Redundancy Check
switches can perform a flow-size aware routing method to (CRC), which is commonly used in P4 targets, is originally
redirect traffic upon detecting a heavy hitter. developed for error detection. CRC can be easily implemented
in embedded hardware, and is computationally much less
complex than cryptographic hash functions (e.g., Secure Hash
B. Cryptography Algorithm (SHA)-256); however, it is not secure and has a
high collision rate. Evaluation results in [165] show that 1)
B.1. Background
implementing cryptographic hash functions on CPU is easy,
Performing cryptographic functions in the data plane is but has high latency (several milliseconds); 2) SmartNICs has
useful for a variety of applications (e.g., protecting the layer- the highest throughput, but can only process packets up to
2 with cryptographic integrity checks and encryption, miti- 900 bytes; and 3) NetFPGA has the lowest latency, but cannot
gating hash collisions, etc.). Computations in cryptographic be integrated using native P4 features. The authors found
operations (e.g., hashing, encryption, decryption) are known to that the performance of hashing is highly dependent on the
be complex and resource-intensive. The supported operations application, the input type, and the hashing algorithm, and
in switch targets and in the P4 language are limited to ba- therefore there is no single solution that fits all requirements.
sic arithmetic (e.g., additions, subtractions, bit concatenation, However, P4 targets should benefit from the characteristics
etc.). Recently, a handful of works have started studying the of each solution (CPU, SmartNICs, FPGA, and ASICs) to
possibility of performing cryptographic functions in the data implement cryptographic hashing.
plane. As for more complex protocol suites (e.g., IPsec), Hauser
33

et al. [166] only implemented Encapsulating Security Payload TABLE XXX


(ESP) in tunnel mode for simplification. The Security Policy P RIVACY AND A NONYMITY S CHEMES C OMPARISON
Database (SPD) and the Security Association Database (SAD) Platform
are represented as match-action tables in the P4 switch. To Name/Scheme Goal Strategy
HW SW
avoid complex key exchange protocols such as the Internet Mitigate topology Topology
NetHide [169] × ×
Key Exchange (IKE), this work delegates runtime management attacks obfuscation
Protect Internet Source info
operations to the control plane. Moreover, since encryption and PANEL [170]
users’ identities rewriting

decryption are not supported by P4, the authors relied on user- Protect PII in Headers fields
ONTAS [171] 
defined P4 externs to perform complex computations. Note packet traces hashing
that implementing user-defined externs is not applicable for Protect Internet Header fields
SPINE [172] 
users’ identities concealing
ASIC (e.g., Tofino), and consequently, the main CPU module
of the switch is used for performing encryption/decryption
no performance guarantees; 2) deployability challenges; some
computations, at the cost of increased latency and degraded
solutions require modifying the whole Internet architecture,
throughput. Same ideas are applied to P4-MACsec by the same
which is highly unlikely; 3) no clear partial deployment
authors.
pathway; and 4) most solutions are software-based. Conse-
The system proposed by Chen et al. [168] has significant
quently, recent works started investigating methods that exploit
performance advantages as it is fully implemented in the data
programmable switches to develop partially-deployable, low-
plane. The idea of the proposed system is to apply permuted
latency, and light-weight anonymity systems.
lookup tables by using an encryption key. The authors found
With respect to anonymity and privacy in the network, new
that a single switch pipeline is capable of performing two AES
class of attacks which target the topology, requires the attacker
rounds. Consequently, the system leverages packet recircula-
to know the topology and understand it’s forwarding behavior.
tion technique which re-injects the packet into the pipeline.
Such attacks can be mitigated by obfuscating (hiding) the
By doing so, it is possible to complete the 10 rounds of
topology from external users. P4-based schemes are also being
encryption required by the AES-128 algorithm by using five
developed to achieve this goal.
pipeline passes. Note that recirculation uses loopback ports
and hence is limited by their bandwidth. The implementation C.2. Literature Review
on Tofino chip shows that ≈ 10Gbps throughput was attained. Meier et al. [169] proposed NetHide, a P4-based solu-
The authors argued that this throughput is sufficient to support tion that obfuscates network topologies to mitigate against
various in-network security applications. Nevertheless, it is topology-centric attacks such as Link-Flooding Attacks
possible to enhance the throughput by configuring additional (LFAs). On the other hand, Kim et al. [171] proposed Online
physical ports as loopback ports. Network Traffic Anonymization System (ONTAS), a system
B.4. Comparison between In-network and Contemporary that anonymizes traffic online using P4 switches.
Cryptography Another line of research focused on protecting the identity
of Internet users. Moghaddam et al. [170] proposed Practical
Cryptographic primitives often require performing complex Anonymity at the NEtwork Level (PANEL), a lightweight and
arithmetic operations on data. Implementing such compu- low overhead in-network solution that provides anonymity into
tations on general purpose servers is simple; memory and the Internet forwarding infrastructure. Likewise, Datta et al.
processing units are not constrained. The literature has shown [172] proposed Surveillance Protection in the Network Ele-
that there is a need to implement cryptographic functions in the ments (SPINE), a system that anonymizes traffic by concealing
data plane. For instance, cryptographic hash functions can sig- IP addresses and relevant TCP fields (e.g., sequence number)
nificantly improve existing data plane applications with respect from adversarial Autonomous Systems (ASes) on the data
to collisions; encryption can protect confidential information plane.
from being exposed to the public. However, switches have
limitations when it comes to computing. Supported hash func- C.3. Privacy and Anonymity Schemes Discussions
tions in P4 are non-cryptographic (e.g., CRC), and therefore, Table XXX compares the privacy and anonymity schemes.
produce collisions when the table is not large. Consequently, NetHide aims at mitigating the attacks targeting the network
researchers are continuously investigating techniques to per- topology. The solution formulates network obfuscation as a
form such operations in the data plane. multi-objective optimization problem, and uses accuracy (hard
constraints) and utility (soft constraints) as metrics. The system
C. Privacy and Anonymity then uses ILP solver and heuristics. The P4 switches in
this system capture and modify tracing traffic at line rate.
C.1. Background The specifics of the implementation were not disclosed, but
Packets in a network carry information that can poten- the authors claim that the system was evaluated on realistic
tially identify users and their online behavior. Therefore, user topologies (more than 150 nodes), and more than 90% of link
privacy and anonymity have been extensively studied in the failures were detected by operators, despite obfuscation.
past (e.g., ToR and onion routing [272]). However, existing ONTAS had a slightly different goal; it aims at protecting
solutions have several limitations: 1) poor performance since the personally identifiable information (PII) from online traces.
overlay proxy servers are maintained by volunteers and have The system overcomes the limitations of existing systems
34

{Keys, version number}


High-level policies

Trusted Untrusted Trusted C


Compiler
entity 1 entity entity 2
Dev. Config. P4 programs
Unmodified Unmodified
device device

Original Traffic SPINE Traffic SPINE Traffic Original Traffic

Fig. 20. SPINE architecture [172].


WAN

...
which either requires network operators to anonymize packet
Context
traces before sharing them with other researchers and analysts, packets
or anonymize traffic online but with significant overhead. End devices P4 switches
ONTAS provides a policy language used by operators for
expressing anonymization tasks, which makes the system Fig. 21. Overview of Poise [175]. A compiler translates high-level policies
into P4 programs and device configurations. Context packets are continuously
flexible and scalable. The system was implemented and tested sent from the clients to the network, where the switches enforce the policies.
on a hardware switch, and results show that ONTAS entails 0%
packet processing overhead and requires half storage compared volunteers and have no performance guarantees. Moreover,
to existing offline tools. A limitation of this system is that it they often require performing advanced encryption routines
does not anonymize TCP/UDP field values. Another limitation to obfuscate from where the packet is originated (e.g., onion
is that it does not support applying multiple privacy policies routing technique used by Tor involves encapsulating messages
concurrently. in several layers of encryption) . On the other hand, approaches
Other line of research (i.e., PANEL, SPINE) focused on that are based on programmable switches often rely on headers
protecting the identities of Internet user. PANEL overcomes modification and simplified encryption and hashing to conceal
the performance limitations of popular anonymity systems information (e.g., SPINE [172]).
(e.g., Tor), and does not require modifying entirely the Internet
routing and forwarding protocols as proposed in [273] and D. Access Control
[274]. Partial deployment is possible as PANEL can co- D.1. Background
exist with legacy devices. The solution involves: 1) source
address rewriting to hide the origin of the packet; 2) source The selective restriction to access digital resources is known
information normalization (IP identification and TCP sequence as access control in cybersecurity. Typically, access control
randomization) to mitigate against fingerprinting attacks; and begins with “authentication” in order to verify the identity of a
3) path information hiding (TTL randomization) to hide the party. Afterwards, “authorization” is enforced through policies
distance to the original sender at any given vantage point. to specify access rights to resources. To authenticate parties,
methods such as passwords, biometric analysis, cryptographic
As for SPINE, it does not require cooperation between
keys, and others are used. With respect to authorization,
switches and end-hosts, but assumes that at least two entities
methods such as ACL are used to describe what operations
(typically two ASes or two ISPs) are trusted. Fig. 20 shows
are allowed on given objects.
the SPINE architecture. The solution encrypts the IP addresses
With the advent of programmable switches, it is now
before the packets enter the intermediary ASes. Therefore,
possible to delegate authentication and authorization to the
adversarial devices only see the encrypted addresses in the
data plane. As a result, access can be promptly granted or
headers. It also encrypts the TCP sequence and ACK num-
denied at line rate, before reaching the target server. A clear
bers to mitigate against attributing packets to flows. SPINE
advantage of this approach is that servers are no longer busy
transforms IPv4 headers into IPv6 headers when packets
processing access verification routines, which increases their
leave the trusted entity and restore the IPv4 headers upon
service throughput.
entering the trusted entity. These operations enable routing to
be performed in intermediary networks. The encrypted IPv4 D.2. Literature Review
address is inserted in the last 32-bits of the IPv6 destination Datta et al. [173] presented P4Guard, a P4-based config-
address. The encryption works by XORing the IP address with urable firewall that acts based on predefined policies set by
the hash of a pre-shared key and a nonce. The system uses the controller. Kang et al. [175] presented a scheme that
SipHash since it is easily implemented in the data plane. implements context-aware security policies (see Fig. 21). The
policies are applicable to enterprise and campus networks with
C.4. Privacy and Anonymity in Switch-based and Legacy
diverse devices, i.e., Bring Your Own Device (BYOD) (e.g.,
Systems
laptops, mobile devices, tablets, etc.).
Contemporary approaches that provide privacy and Almain et al. [174] proposed delegating the authentication
anonymity in the Internet uses special routing overlay net- of end hosts to the data plane. The method is based on
works to hide the physical location of each node from other port knocking, in which hosts deliver a sequence of packets
participants (e.g., Tor). Such approaches have performance addressed to an ordered list of closed ports. If the ports match
limitations as proxy servers (overlays) are maintained by the ones configured by the network administrators, then end
35

TABLE XXXI
ACCESS C ONTROL S CHEMES C OMPARISON

Platform
Scheme Goal Strategy Scope Limitations
HW SW
Simple firewall-based Translates from high-level Header-based firewall
[173] Lacks NGFW capabilities 
access control security policies to table entries (layer-4)
User-authentication Uses port knocking technique Unencrypted sequence- Unencrypted sequence
[174] 
in the data plane for authentication based authentication vulnerable to packet sniffing
Context-aware policies Translates from high-level CAS dynamic policies External encryptions are slow;
[175] 
enforcement security policies to P4 programs based on runtime contexts lack of authentication
OS fingerprinting and Compares TCP/IP headers to a Uses p0f to filter Lack of advanced built-in
[176] 
policy enforcement fingerprint database file connections actions (e.g., rate-limiting)

host is authenticated, and subsequent packets are allowed. user devices and the switch, impersonation, and others.
Finally, Bai et al. [176] presented P40f, a tool that performs OS Finally, [176] proposes fingerprinting OS in the data plane.
fingerprinting on programmable switches, and consequently, The main motivation behind this work is that software-based
applies security policies (e.g., allow, drop, redirect) at line passive fingerprinting tools (e.g., p0f [275]) are not practical
rate. nor sufficient with large amounts of traffic on high-speed
links. Furthermore, out-of-band monitoring systems cannot
D.3. Access Control Comparison, Discussions, and Limita- promptly take actions (e.g., drop, forward, rate-limit) on traffic
tions at line rate. The main drawback of the solution is that it lacks
Table XXXI compares the aforementioned access control sophisticated policies that involve rate-limiting traffic.
schemes. P4Guard provides access control based on security
D.4. Comparison between Switch-based and Server-based Ac-
policies translated from high-level security policies to table
cess Control
entries. Note that P4Guard only operates up to the transport
layer (e.g., source/destination IP addresses, source/destination Controlling access to resources often starts with authenti-
ports, protocol, etc.), similar to a traditional firewall. While cation. While server-based approaches are more flexible in
programmable switches provide increased flexibility in the the methods of authentication they can provide, they typi-
parser (e.g., parse beyond the transport layer) and the packet cally require client connections to reach the server before
processing logic, P4Guard did not leverage such capabilities. the communication starts. In switch-based approaches, the
It would be interesting to investigate additional capabilities authentication can be done in-network at the edge, eliminating
such as those enabled by next-generation firewalls (NGFW). unnecessary latency incurred from traversing the network and
The solution in [174] controls access by performing authen- from software processing.
tication in the data plane. The solution has several limitations Access to resources can be controlled after fingerprinting
since it uses on port knocking, a technique that has several end-hosts OSs. Software-based passive fingerprinting tools
security implications. For instance, programmable switches do cannot keep up with the high load (gigabits/s links). The
not use cryptographic hashes, making the solution vulnerable literature has shown that tools lead to 38% degradation in
to IP address spoofing attacks. Additionally, unencrypted port throughput [276]. Additionally, such tools are out-of-band,
knocking is vulnerable to packet sniffing. Furthermore, port meaning that it is not possible to apply policies on traffic
knocking relies on security through obscurity. (e.g., after fingerprinting an OS). On the other hand, switch
In [175], the scheme dynamically enforces access control hardware is able to perform OS fingerprinting and apply
to users based on contexts (e.g., if the user’s device uses security policies at line rate.
Secure Shell (SSH) 2.0 or higher, then the switch forwards Context-aware policies applied on nodes (clients/servers)
the packets of this flow. Otherwise, the switch drops the pack- have local visibility. A newer approach is to use a centralized
ets). The scheme requires user devices to run an application SDN controller (e.g., [277]), but such scheme is vulnerable
which communicates with the switch using a custom protocol to control plane saturation attacks and is subject for delay
(context packets). The context packets are generated on a increases. Switch-based schemes on the other hand are able to
per-flow basis. The switch tracks flows using a match action provide access control at line rate.
table and registers at the data plane. Actions over a packet
are dropping, allowing, and forwarding to other appliances E. Defenses
for deep packet inspection. Data packets are not modified.
Evaluations show that the proposed approach can operate E.1. Background
(install new flows in the and update rules) with a minimum DDoS attacks remain among the top security concerns
latency, even under heavy DoS attacks. On the other hand, despite the continuous efforts towards the development of their
such attacks can decimate similar SDN-based systems. One detection and mitigation schemes. This concern is exacerbated
of the main drawbacks of the proposed system is the lack not only by the frequency of said attacks, but also by their high
of authentication, integrity, and confidentiality of the context volumes and rates. Recent attacks (e.g. [278, 279]) reached
packets. Thus, the system can be subject to attacks such the order of terabits per seconds, a rate that existing defense
as snooping (i.e., eavesdropping) on communication between mechanisms cannot keep with.
36

TABLE XXXII
D EFENSES S CHEMES C OMPARISON

Attack coverage External Network- Platform


Name & scheme Mitigated attacks Limitations
Specific Generic computations wide HW SW
Hop-counts incorrectness
NETHCF [177] IP-spoofing   × 
with the presence of NAT
Cross-domain federation
FastFlex [178] Availability attacks  ×  
complexity and security
Limited evaluation on
[179] Sensitivity attacks  × × 
complex data plane systems
No support for encrypted
[180] SIP DDoS   × 
packets (e.g., SIP/TLS)
Not adaptable to traffic
[181] DDoS anomalies  × × 
patterns (fix thresholds)
Depends heavily on external
ML-Pushback [182] DDoS anomalies   × × ×
computations
Lack of cryptographic
[183] SYN floods   × 
hash functions
Human intervention for
Poseidon [184] Volumetric DDoS   × 
writing the defense policies
Volumetric and stealthy Only synthetic evaluations;
[185]   × 
DDoS no extensive experimentation
Slowpath/fastpath
NetWarden [186] Network covert channels   × 
communication latency
Small subset of attack
[187] ECN protocol abuse  × × 
space
Lack of comparison with
Ripple [188] Link-flooding  ×  
other P4 approaches

There are two main concerns with existing defense methods traffic patterns that exploit the behavior of the P4 program.
handled by end-hosts or deployed as middlebox functions Lapolli et al. [181] implemented a mechanism to perform
on x86-based servers. First, they dramatically degrade the real-time DDoS attack detection based on entropy changes.
throughput and increase latency and jitter, impacting the Such changes will be used to compute anomaly detection
performance of the network. Second, they present severe thresholds. Mi et al. [182] proposed ML-Pushback, a P4-based
consequences on the network operation when they are installed implementation of the Pushback method [281].
at the last mile (i.e., far from the edge). Zhang et al. [184] proposed Poseidon, a system that miti-
The escalation of volumetric DDoS attacks and the lack gates against volumetric DDoS attacks through programmable
of robust and efficient defense mechanisms motivated the switches. It provides a language where operators can express
idea of architecting defenses into the network. Up until re- a range of security policies. Friday et al. [185] proposed a
cently, in-network security methods were restricted to simple unified in-network DDoS detection and mitigation strategy that
access control lists encoded into the switching and routing considers both volumetric and slow/stealthy DDoS attacks.
devices. The main reason is that the data plane was fixed in Xing et al. [186] proposed NetWarden, a broad-spectrum
function, impeding the capabilities of developing customized defense against network covert channels in a performance-
and dynamic algorithms that can assist in detecting attacks. preserving manner. The method in [187] models a stateful
With the advent of programmable data planes, it is possible security monitoring function as an Extended Finite State Ma-
to develop systems that detect and mitigate various types of chine (EFSM) and expresses the EFSM using P4 abstractions.
attacks without imposing significant overhead on the network. Finally, Ripple [188] provides decentralized link-flooding de-
fense against dynamic adversaries.
E.2. Literature Review
E.3. Defense Schemes Comparison, Discussions, and Limita-
Li et al. [177] presented NETHCF, a Hop-Count Filtering
tions
(HCF) defense mechanism that mitigates spoofed IP traffic.
HCF schemes filter spoofed traffic with an IP-to-hop-count Table XXXII compares the aforementioned defense
mapping table. Another attack-specific scheme proposed by schemes. Broadly, defense schemes can be grouped into two
Febro et al. [180] mitigates against distributed SIP DDoS in main categories: attack-specific and generic. Attack-specific
the data plane. Furthermore, Scholz et al. [183, 280] presented category consists of the work that address a specific attack
a scheme that defends against SYN flood attacks. (e.g., NETHCF for IP spoofing, [180] for SIP DDoS, etc.),
Alternatively, some schemes are generic and aim at ad- while the generic category aims at addressing various types of
dressing multiple attacks concurrently. For instance, Xing et attacks (e.g., FastFlex for various availability attacks, Ripple
al. [178] proposed FastFlex, an abstraction that architects for link flooding attacks, etc.).
defenses into the network paths based on changing attacks. The significant advantage of architecting defenses in the
Kang et al. [179] presented an automated approach for dis- data plane is the performance improvement of the applica-
covering sensitivity attacks targeting the data plane programs. tion. For instance, NETHCF is motivated by the fact that
Sensitivity attacks in this context are intelligently crafted traditional HCF-based schemes are implemented on end-hosts,
37

which delays the filtering of spoofed packets and increases execute cryptographic primitives in the data plane to enable
the bandwidth overhead. Moreover, since traditional schemes further applications; 3) protect the identity and the behavior
are implemented in server-based middleboxes, low latency of end-hosts, as well as obfuscate the network topology; 4)
and minimal jitter are hard to achieve. Similarly, FastFlex enforce access control policies in the network while consid-
advocates on the need to offload the defenses to the data ering network dynamics; and 5) architect defenses in the data
plane. Specifically, it tackles the following key challenges that plane to accelerate the detection and mitigation processes.
are faced when programming defenses in the data plane: 1) Identifying heavy hitters at line rate has several advan-
resource multiplexing; 2) optimal placement; 3) distributed tages. Recent works considered various data structures and
control; and 4) dynamic scaling. streaming algorithms to detect heavy hitters. Future systems
When deploying defenses in the data plane, operators must could explore more complex data structures that reduce the
be aware of the capabilities of the constrained targets. Many amount of state storage required on the switches. Furthermore,
operations that require extensive computations cannot be easily novel systems must minimize the false positives and the
implemented on the data plane. The existing work either false negatives compared to both P4-based and legacy heavy
approximate the computations in the data plane (considering hitter detection systems. Finally, new schemes should explore
the computation complexity and the measurements accuracy strategies for incremental deployment while maximizing flow
trade-off), or delegate the computations to external processors visibility across the network.
(e.g., CPU on the switch, external server, SDN controller, There is an absolute necessity to implement cryptographic
etc.). For instance, NETHCF decouples the HCF defense into functions (e.g., hash, encrypt, decrypt) in the data plane.
a cache running in the data plane and a mirror in the control Such functions can be used by various applications that
plane. The cache serves the legitimate packets at line rate, require low hashing collisions (e.g., load balancing) and strong
while the mirror processes the missed packets, maintains the data protection. Most existing efforts delegate the complex
IP-to-hop-count mapping table, and adjust the state of the computations to the control plane. However, recent systems
system based on network dynamics. In Poseidon, the defense have demonstrated that AES, a well-known symmetric key
primitives are partitioned to be executed on switches and on encryption algorithm, can be implemented in the data plane.
servers, based on their properties. On the other hand, in [181], Another interesting line of work provided privacy and
the authors estimated the entropies of source and destination anonymity to the network. Recent efforts obfuscated the net-
IP addresses of incoming packets for consecutive partitions work topology in order to mitigate topology-centric attacks
(observation windows) in the data plane, without consulting (e.g., LFA). Such systems must preserve the practicality of
external devices. path tracing tools, while being robust against obfuscation
Network-wide defenses are those that are not restricted to a inversion. Additionally, link failures in the physical topology
single switch, and require multiple switches to co-operate in should remain visible after obfuscation. Furthermore, when
the attacks detection and mitigation phases. Such co-operation randomizing identifiers to achieve session unlinkability, the
significantly improves the accuracy and the promptness of the identifiers must fit into the small fixed header space so
detection. More details on network-wide data plane systems that compatibility with legacy networks is preserved. Other
is explained in Section XIII-D. efforts considered rewriting source information and headers
Finally, table XXXII lists some limitations of the existing concealing to protect the identity of Internet users.
schemes, which can be explored in future work to advance the Finally, access control methods and in-network defenses
state-of-the-art. were proposed. Future access control schemes should explore
further in-network methods to authenticate the users. Addi-
E.4. Comparison between P4-based and Traditional Defense
tionally, since switches are capable of inspecting upper-layer
Schemes
headers, it is worth exploring offloading some next generation
Network attacks such as large-scale DDoS and link flooding firewall functionalities to the data plane. For instance, in
may have substantial impact on the network operation. For [146], the authors proposed a system that allows searching
such attacks, server-based defenses deployed at the last mile for keywords in the payload of the packet. Similar techniques
are problematic and inherently insufficient, especially when could be leveraged to achieve URL filtering at line rate.
attacks target the network core. Moreover, it is not feasible to Additionally, schemes should mitigate against stealthy DDoS
detect and mitigate large volume of attack traffic (e.g., SYN attacks.
flood) on end-hosts without impacting the throughput of the
network. When defenses are architected into the network (i.e.,
XII. N ETWORK T ESTING
detection and mitigation are programmed into the forwarding
devices), it is easy to detect, throttle, or drop suspicious traffic Although programmable switches provide flexibility in
at any vantage point, at line rate. defining the packet processing logic, they introduce potential
risks of having erroneous and buggy programs. Such bugs
may cause fatal damages, especially when they are unexpect-
F. Summary and Lessons Learned edly triggered in production networks. In such scenarios, the
In the context of cybersecurity, a wide range of works network starts experiencing a degradation in performance as
leveraged programmable switches to achieve the following well as disruption in its operation. Bugs can occur in various
goals: 1) detect heavy hitters and apply countermeasures; 2) phases in the P4 program development workflow (e.g., in
38

TABLE XXXIII
T ROUBLESHOOTING S CHEMES C OMPARISON

Fault detection Memory Platform


Name & scheme Core idea
Passive Proactive requirements HW SW
P4DB [189] On-the-fly runtime debugging using watch, break, and next primitives  High 
P4Tester [190] Probing-based troubleshooting using BDD  Low 
[191] Targets’ behavior examination when undesired actions are triggered N/A N/A  
[192] Execution paths profiling using Ball-Larus encoding  Low 
KeySight [193] Probing-based troubleshooting using PEC  Low 

the P4 program itself, in the controller updating data plane diagnoses faults by injecting probes (e.g., [190, 193]). The
table entries, in the target compiler, etc.). Bugs are usually main limitation of passive detection is that schemes can only
manifested after processing a sequence of packets with certain detect rule faults that have been triggered by existing packets,
combinations not envisioned by the designer of the code. and cannot check the correctness of all table rules. On the
This section gives an overview of the troubleshooting and other hand, probing-based schemes may incur large control
verification schemes for P4 programmable networks. and probes overheads.
Examples of probing-based schemes include P4Tester and
A. Troubleshooting KeySight. P4Tester generates intermediate representation of
P4 programs and table rules based on BDD data structure.
A.1. Background Afterwards, it performs an automated analysis to generate
Intensive research interests were drawn on troubleshooting probes. Probes are sent using source routing to achieve high
the network. Previous efforts are mainly based on passive rule coverage while maintaining low overheads. The system
packet behavior tracking through the usage of monitoring was prototyped on a hardware switch (Tofino), and results
technologies (e.g., NetSight [282], EverFlow [283]). Other show that it can check all rules efficiently and that the probes
techniques (e.g., Automatic test Packet Generation (ATPG) count is smaller than that of server-based probe injection
[284]) send probing packets to proactively detect network systems (i.e., ATPG and Pronto).
bugs. Such techniques have two main problems. First, the Other schemes that use passive fault detection (e.g., P4DB)
number of probe packets increases exponentially as the size assume that packets consistently trigger the runtime bugs.
of the network increases. Second, the coverage is limited by P4DB debugs P4 programs in three levels of visibility by
the number of probes-generating servers. Despite the flexibility provisioning operator-friendly primitives: watch, break, and
that programmable switches offer, writing data plane programs next. P4DB does not require modifying the implementation of
increases the chance of introducing bugs into the network. Pro- the data plane. It was implemented and evaluated on a software
grams are inevitably prone to faults which could significantly switch (BMv2), and the results show that it is capable of
compromise the performance of the network and incur high troubleshooting runtime bugs with a small throughput penalty
penalty costs. and little latency increase.
Another important criterion that differentiate the trou-
A.2. Literature Review
bleshooting schemes is the memory footprint they require.
Zhang et al. [189] proposed P4DB, an on-the-fly runtime Some schemes (e.g., P4DB) require more memory than others
debugging platform. The system debugs P4 programs in three (e.g., KeySight) which bound the memory usage.
levels of visibility by provisioning operator-friendly primi- Finally, the work in [191] is different than the others.
tives: watch, break, and next. Zhou et al. [190] proposed The authors examined how three different targets, BMv2,
P4Tester, a troubleshooting system for data plane runtime P4-NetFPGA, and Barefoot’s Tofino, behave when undesired
faults. It generates intermediate representation of P4 programs behaviours are triggered. The authors first developed buggy
and table rules based on BDD data structure. Dumitru et programs in order to observe the actual behavior of targets.
al. [191] examined how three different targets, BMv2, P4- Then, they examined the most complex P4 program publicly
NetFPGA, and Barefoot’s Tofino, behave when undesired be- available, switch.p4, and found that it can be exploited when
haviours are triggered. Kodeswaran et al. [192] proposed a data attackers know the specifics of the implementation. In sum-
plane primitive for detecting and localizing bugs as they occur mary, the paper suggests that BMv2 leaks information from
in real time. Finally, Zhou et al. [193] proposed KeySight, a previous packets. This behavior is not observed with the other
platform that troubleshoots programmable switches with high two targets. Furthermore, the authors were able to perform
scalability and high coverage. It uses Packet Equivalence Class privilege escalation on switch.p4 due to a header destined
(PEC) abstraction when generating probes. to ensure communication between the CPU and the P4 data
plane.
A.3. Troubleshooting Schemes Comparison, Discussions, and
Limitations A.4. Comparison Legacy vs. P4-based Debugging
Table XXXIII compares the aforementioned troubleshooting In legacy networks, network devices are equipped with
schemes. Essentially, the schemes either passively track how fixed-function services that operate on standard protocols.
packets are processed inside switches (e.g., [189, 192]) or Troubleshooting these networks often involve testing proto-
39

cols and typical data plane functions (e.g., layer-3 routing) paths. Similarly, Lukács et al. [199] described a framework
through rigid probing. On the other hand, with programmable for verifying functional and non-functional requirement of
networks, since operators have the flexibility of defining protocols in P4. The system translates a P4 program in a
custom data plane functions and protocols, testing is more versatile symbolic formula to analyze various performance
complex and is program-dependent. Probing-based approaches costs. The proposed approach estimates the performance cost
should craft patterns depending on the deployed P4 program. of a P4 program prior to its execution.
Other approaches proposed primitives that increase the levels Stoenescu et al. [200] proposed Vera, a symbolic execution-
of visibility when debugging P4 programs. Research work based verification tool for P4 programs. The authors argue
extracted from the literature show that it is essential to develop in this paper that a data plane program should be verified
flexible mechanisms that operate dynamically on diverse P4 before deployment to ensure safe operations. Vera accepts as
programs and targets. input a P4 program, and translates it to a network verification
language, SEFL. It then relies on SymNet [287], a network
B. Verification static analysis tool based on symbolic execution to analyze the
behavior of the resulting program. Essentially, Vera generates
B.1. Background all possible packets layouts after inspecting the program’s
Program verification consists of tools and methods that parser and assumes that the header fields can accept any value.
ensure correctness of programs with respect to specifications Afterwards, it tracks the paths when processing these packets
and properties. Verification of P4 programs is an active area in the program following all branches to completion. For
as bugs can cause faults that have drastic impacts on the scalability improvements, Vera utilizes a novel match-forest
performance and the security of networking systems. Static data structure to optimize updates and verification time. Pars-
P4 verification handles programs before deployment to the ing/deparsing errors, invalid memory accesses, loops, among
network, and hence, cannot detect faults that occur at runtime. others, can be detected by Vera.
On the other hand, runtime verification uses passive measure- A different approach uses reinforcement learning is P4RL
ments and proactive network testing. This section describes [201], a fuzzy testing system that automatically verifies P4
the major verification work pertaining to P4 programs. switches at runtime. The authors described a query language
p4q in which operators express their intended switch behavior.
B.2. Literature Review
A prototype that executes verification on layer-3 switch was
Lopes et al. [194] proposed P4NOD, a tool that compiles implemented, and results show that PR4L detects various bugs
P4 specifications to Datalog rules. The main motivation be- and outperforms the baseline approach.
hind this work is that existing static checking tools (e.g., Finally, Dumitrescu et al. [202] proposed bf4, an end-to-
Header Space Analysis (HSA) [285], VeriFlow [286]) are end P4 program verification tool. It aims at guarantying that
not capable of handling changes to forwarding behaviors deployed P4 programs are bug-free. First, bf4 finds potential
without reprogramming tool internals. The authors introduced bugs at compile-time. Second, it automatically generates pred-
the “well formedness” bugs, a class of bugs arising due to the icates that must be followed by the controller whenever a rule
capabilities of modifying and adding headers. is to be inserted. Third, it proposes code changes if additional
Another interesting work is ASSERT-P4 [195, 196], a bugs remain reachable. bf4 executes a monitor at runtime
network verification technique that checks at compile-time that inspects the rules inserted by the controller and raises an
the correctness and the security properties of P4 programs. exception whenever a predicate is not satisfied. The authors
ASSERT-P4 offers a language with which programmers ex- executed bf4 on various data plane programs and interesting
press their intended properties with assertions. After annotat- bugs that were not detected in state-of-the-art approaches were
ing the program, a symbolic execution takes place with all the discovered.
assertions being checked while the paths are tested.
Further, Liu et al. [197] proposed p4v, a practical veri- B.3. Verification Schemes Discussions
fication tool for P4. It allows the programmer to annotate Table XXXIV compares the aforementioned verification
the program with Hoare logic clauses in order to perform schemes. Essentially, some schemes translate P4 programs to
static verification. To improve scalability, the system suggests verification languages and engines. For instance, in [194], P4
adding assumptions about the control plane and domain-
specific optimizations. The control plane interface is manually TABLE XXXIV
written by the programmer and is not verified, which makes V ERIFICATION S CHEMES C OMPARISON
it error-prone and cumbersome. The authors evaluated p4v Engine, Evaluated Inconsistency
on both an open source and proprietary P4 programs (e.g., Scheme Name
language programs detection
switch.p4) that have different sizes and complexities. [194] P4NOD NOD 2 ×
Nötzli et al. [198] proposed p4pktgen, a tool that automat- [195] ASSERT-P4 KLEE 5 ×
[197] p4v Z3 23 ×
ically generates test cases for P4 programs using symbolic [198] p4pktgen SMT 4 ×
execution and concrete paths. The tool accepts as input a [199] N/A Pure 0 ×
JSON representation of the P4 program (output of the p4c [200] Vera SEFL 11 ×
compiler for BMv2), and generates test cases. These test [201] P4RL DDQN 1 
[202] bf4 Z3 21 ×
cases consist of packets, tables configurations, and expected
40

programs are translated to Datalog to verify the reachability


[295, 296] [297]
and well-formedness. Similarly, in [197], P4 programs are Trends

converted into Guarded Command Language (GCL) models,


and then a theorem prover Z3 is used to verify that sev- Challenges

eral safety, architectural and program-specific properties hold.


[293, 294] [83, 91]
Other schemes (e.g., p4pktgen, Vera) use symbolic execution
to generate test cases for P4 programs. Programming
Arithmetic
The verification schemes were evaluated on different P4 simplicity and
Data plane computations
modularity
programs from the literature. A program that was evaluated challenges
and trends
by most schemes is switch.p4 which implements various Interoperability
Network-wide
cooperation
networking features needed for typical cloud data centers,
including Layer 2/3 functionalities, ACL, QoS, etc. It is [162] [298–303]

recommended for future schemes to evaluate switch.p4 as well


as other programs from the literature. Finally, P4RL detects
path-related consistency between data-control planes.
[178, 179]
B.4. P4-based and Traditional Network Verification [293]

Traditional verification techniques that address the security Fig. 22. Challenges and future trends. The references represent examples of
existing works that tackle the corresponding future trends.
properties in computer networks are mainly related to host
reachability, isolation, blackholes, and loop-freedom. Tech- A. Memory Capacity (SRAM and TCAM)
niques that check for the aforementioned properties include
Anteater [288], which models the data plane as boolean Stateful processing is a key enabler for programmable
functions to be used in a Boolean Satisfiability Problem (SAT) data planes as it allows applications to store and retrieve
solver, NetPlumber [289] which uses header space algebra data across different packets. This advantage enabled a wide
[285], and others (e.g., VeriFlow [286], DeltaNet [290], Flover range of novel applications (e.g., in-network caching, fine
[291], and VMN [292]). grained measurements, stateful load balancing, etc.) that were
Since P4 programs incorporate customized protocols and not possible in non-programmable networks. The amount
processing logic to be used in the data plane, traditional tools of data stored in the switch is limited by the size of the
are not capable of handling changes to forwarding behaviors on-chip memory which ranges from tens to hundreds of
without reprogramming their internals. Therefore, verification megabytes at most. Consequently, the majority of stateful-
techniques in programmable networks rely on analyzing the based applications suffer have trade-offs between performance
P4 programs themselves since they define the behavior of the and memory usage. For instance, the efficiency of caching
data plane. which is determined by the hit rate is directly affected by the
memory size. Furthermore, the vast majority of measurement
applications require storing statistics in the data plane (e.g.,
C. Summary and Lessons Learned byte/packet counters). The number of flows to be measured
and the richness of measurement information is bound by the
Network testing can generally be divided into debug-
size of the memory in the switch.
ging/troubleshooting network problems and verifying the be-
havior of forwarding devices. While traditional tools and Current and future initiatives. A notable work by Kim et
techniques were adequate for non-programmable networks, al. [295, 296] suggests accessing remote Dynamic Random
they are insufficient for programmable ones due to their Access Memory (DRAM) installed on data center servers
inability to handle changes to forwarding behaviors without purely from data plane to expand the available memory on the
reprogramming and restructuring their internals. A variety of switch. The bandwidth of the chip is traded for the bandwidth
works were proposed to analyze and model P4 programs in needed to access the external DRAM. The approach is cheap
order to troubleshoot and verify the correctness of networks’ and flexible since it reuses existing resources in commodity
operations. hardware without adding additional infrastructure costs. The
system is realized by allowing the data plane to access remote
memory through an access channel (RDMA over Converged
XIII. C HALLENGES AND F UTURE T RENDS Ethernet (RoCE)) as shown in Fig. 23. The implementation
show that the proposal achieves throughput close to the line
In this section, a number of research and operational
rate, and only incur 1-2 extra microseconds latency (Fig.
challenges that correspond to the proposed taxonomy are
24). There are some limitations in this approach that can be
outlined. The challenges are extracted after comprehensively
explored in the future.
reviewing and diving into each work in the described literature.
Further, the section discusses and pinpoints several initiatives • The current implementation only supports address-based
for future work which could be worthy of being pursued in this memory access, and hence, complicated data layouts and
imperative field of programmable switches. The challenges ternary matching in remote memory should be explored.
and the future trends are illustrated in Fig. 22 • Frequent updates in the remote memory requires several
41

ASIC pipeline stage no longer has local memory. Additionally, this


work solves the sequential execution limitation by creating a
cluster of processors used to execute operations in any order.
Remote buffer servers The main limitation of this approach is the lack of adoption
Remote table servers by any hardware vendors. Most of the switch vendors (e.g.,
Remote state stores Cavium’s XPliant and Barefoot’s Tofino) do not implement the
RDMA
RoCE disaggregation model and follow the regular Reconfigurable
Commodity Match-action Tables (RMT) model. The implementation and
Servers analysis of the disaggregation model on hardware targets
should be explored in the future.

General-purpose DRAM pool C. Arithmetic Computations


There are several challenges that must be handled when
Fig. 23. Expanding switch memory by leveraging remote DRAM on com- dealing with arithmetic computations in the data plane. First,
modity servers [295]. programmable switches support a small set of simple arith-
metic computations that operate on non-floating point values.
packets for fetching and adding. This is common in mea-
Second, only few operations are supported per packet to
surement applications where counters are continuously in-
guarantee the execution at line rate. Typically, a packet should
cremented. A possible solution to the bandwidth overhead is
only spend tens of nanoseconds in the processing pipeline.
aggregating updates into single operation. This comes with
Third, computations in the data plane consume significant
the cost of having delays in the updates.
hardware resources, hampering the possibility of other pro-
• Packet loss between the switch and the remote memory
grams to execute concurrently. A wide range of applications
should be handled, otherwise, the performance of the ap-
suffer from the lack of complex computations in the data
plication and the freshness of the remote values might be
plane. For instance, some operations required by AQMs (e.g.,
affected.
square root function in the CoDel algorithm) are complex
• The interaction between general data plane applications and
to be implemented with P4. Additionally, the majority of
the remote memory is challenging. A potential improvement
machine learning frameworks and models operate on floating
is designing well-defined APIs to facilitate the interaction.
point values while the supported arithmetic operations on the
switch operate on integer values. In-network model updates
B. Resources Accessibility aggregation requires calculating the average over a set of
floating-point vectors.
Beside the size limitation of the on-chip memory, there are
other restrictions that data plane developers should take into Current and Future Initiatives. Existing methods to over-
account [297, 304]. First, since the table memory is local come the computation limitations include approximation and
to each stage in the pipeline, other stages cannot reclaim pre-computations. In the approximation method, the applica-
non-utilized memory in other stages. As a result, memory tion designer relies on the small set of supported operations
and match/action processing are fuzed, making the placement to approximate the desired value, at the cost of sacrificing
of tables challenging. Second, the sequential execution of precision. For example, approximating the square root function
operations in the pipeline lead to poor utilization of resources can be achieved by counting the number of leading zeros
especially when the matches and the actions are imbalanced through longest prefix match [91]. It would be beneficial
(i.e., the presence of default actions that do not need a match). for P4 developers to have access to a community-maintained
library which encompasses P4 codes that approximate various
Current and Future Initiatives. An interesting work by complex functions. In the pre-computations method, values are
Chole et at. [297] explored the idea of disaggregating the computed by the control plane (e.g., switch CPU) and stored
memory and compute resources of a programmable switch. in match-action tables or registers. Future work can explore
The main notion of this work is to centralize the memory methods that automatically identify the complex computations
as a pool that is accessed by a crossbar. By doing so, each that can be pre-evaluated in the control plane. After identifica-
tion, the data plane code and its corresponding control plane
APIs can be automatically generated.

D. Network-wide Cooperation
The SDN architecture suggests using a centralized controller
for network-wide switches management. Through centraliza-
tion, the state of each programmable switch can be shared with
other switches. Consequently, applications will have the ability
Fig. 24. Accessing remote DRAM latency overhead. Achieved throughput to make better decisions as network-wide data is available
close to the line rate (≈ 37.5 Gbps) [295]. locally on the switch. The problem with such architecture is
42

S1 S1
C1 < T ID Count C1 + C 2 > T ID Count CountTotal
IPA C1 IPA C1 C1 + C2

Internet Internet C1 , C2

DDoS initiator C2 < T ID Count DDoS initiator C1 + C 2 > T ID Count CountTotal


IPA C2 IPA C1 C1 + C2
(A) (A)
S2 S2

(a) (b)
Fig. 25. (a) Local detection of DDoS attacks. (b) network-wide detection of DDoS attack.

the requirement of having a continuous exchange of packets exchanged data. P4Sync addresses the limitations of existing
with a software-based system. As an alternative, switches can approaches. It guarantees the completeness of the migration,
exchange messages to synchronize their states in a decentral- ensuring that the snapshot transfer is completed. Moreover, it
ized manner. solves the overhead of the repeatedly retransmitted updates.
An interesting aspect of P4Sync is its ability to control the
Consider Fig. 25 which shows an in-network DDoS defense
migration traffic rate depending on the changing network
solution. Each switch maintains a list of senders and their
conditions. Zeno et al. [303] presented a design of SwiSh-
corresponding numbers of bytes. A switch compares the
mem, a management layer that facilitates the deployment of
number of bytes transmitted from a given flow to a threshold.
network functions (NFs) on multiple switches by managing
When the threshold is crossed, the flow is blocked and the
the distributed shared states.
device is identified as a malicious DDoS sender. Assume
that the network implements a load balancing mechanism that The future work in this area should consider handling
distributes traffic across the switches. In the scenario where frequent state migrations. Some systems require migration
switches do not consider the byte counts of other switches packets to be generated each RTT, causing increased traffic
(Fig. 25 (a)), the traffic of a DDoS device might remain under overhead and additional expensive authentication operations.
the threshold. On the other hand, when switches synchronize For instance, P4Sync uses public key cryptography in the
their states by sharing the byte counts (Fig. 25 (b)), the control plane to sign and verify the end of the migration
total number of bytes is compared against the threshold. sequence chain (2.15ms for signing and 0.07ms to verify using
Consequently, the total load of a DDoS device is considered. RSA-2048 signature). Frequent migrations would cause this
This example demonstrates an application that heavily depends signature to be involved repeatedly. Another major concern
on network-wide cooperation and hence motivates the need for that should be handled in future work is denial of service.
state synchronization. Even with migration updates authentication, changes in the
packets cause the receiver to reject updates, leading to state
Current and Future Initiatives. Arashloo et al. [298] pro- inconsistency among switches.
posed SNAP, a centralized stateful programming model that
aims at solving the synchronization problem. SNAP introduced
the idea of writing programs for “one big switch” instead of E. Control Plane Intervention
many. Essentially, developers write stateful applications with-
Delegating tasks to the control plane incurs latency and
out caring about the distribution, placement, and optimization
affects the application’s performance. For instance, in conges-
of access to resources. SNAP is limited to one replica of
tion control, rerouting-based schemes often use tables to store
each state in the network. Sviridov et al. [299, 300] proposed
alternative routes. Since the data plane cannot directly modify
LODGE and LOADER to extend SNAP and enable multiple
table entries, intervention from the control plane is required.
replicas. Luo et al. [301] proposed Swing State, a framework
The interaction with the control plane in this application
for runtime state migration and management. This approach
hampers the promptness of rerouting. Another example are
leverages existing traffic to piggyback state updates between
methods that use collisions-free hashing. For example, cuckoo
cooperating switches. Swing State overcomes the challenges
hash [305], which rearranges items to solve collisions, uses a
of the SDN-based architecture by synchronizing the states
complex search algorithm that cannot run on the switch ASIC,
entirely in the data plane, at line rate, and without intervention
and is often executed on the switch CPU. Ideally, the control
from the control plane. There are several limitations with this
plane intervention should be minimized when possible. For
approach. First, there are no message delivery guarantees (i.e.,
example, to synchronize the state among switches, in-network
packets dropped/reordered are not retransmitted), leading to
cooperation should be considered.
inconsistency in the states among the switches. Second, it does
not merge the states if two switches share common states. Current and Future Initiatives. The design of the interaction
Third, the overhead can significantly increase if a single state between the control plane and the data plane is fully decided
is mirrored several times. Finally, there is no authentication by the developer. Experienced developers might have enough
of data or senders. Xing et al. [302] proposed P4Sync, a background to immediately minimize such interaction. Future
system that migrates states between switches in the data plane work should devise algorithms and tools that automatically
while guaranteeing the authenticity of the senders and the determine the excessive interaction between the control/data
43

planes, and suggest alternative workflows (ideally, as generated deploy programmable switches in an incremental fashion. That
codes) to minimize such interaction. is, P4 switches will be added to the network alongside the
existing legacy devices. While this solution seems simplistic
F. Security at first, studies have showed that partial deployment leads
to reduced effectiveness [162]. For instance, the accuracy of
When designing a system for the data plane, the developer
heavy hitter detection schemes is strongly affected by the flow
must envision the kind of traffic a malicious user can initiate
visibility. The work in [162] devised a greedy algorithm that
to corrupt the operation of the system. This class of attacks is
attempts to strategically position P4 switches in the network,
referred to as sensitivity attacks as coined in [179]. Essentially,
with the goal of monitoring as many distinct network flows
an attacker can intelligently craft traffic patterns to trigger
as possible. The F1 score is used to quantify correctness of
unexpected behaviors of a system in the data plane. For
switches placement. Future work in this area should consider
instance, a load balancer that balances traffic through packet
generalizing and enhancing this approach to work with any P4
headers hashing without cyrptographic support (e.g., modulo
application, and not only heavy hitter detection. For instance,
operator on the number of available paths) can be tricked by an
a future work could suggest the positioning of P4 switches in
attacker that craft skewed traffic patterns. This results in traffic
applications such as in-network caching, accelerated consen-
being forwarded to a single path, leading to congestion, link
sus, and in-network defenses, while taking into account the
saturation, and denial of service. Another example is attacks
current topology consisting of legacy devices.
against in-network caching. Caching in data plane performs
well when requests are mostly reads rather than writes. If an
attacker continuously generates high-skewed write requests, H. Programming Simplicity and Modularity
the load on the storage servers would be imbalanced. If the Writing in-network applications using P4 language is not
system is designed to handle write queries on hot items in the an easy task. Recent studies have shown that many existing
switch, a random failure in the switch causes data to be lost. P4 programs have several bugs that might lead to network
Further, an attacker can also exploit the memory limitation disruption [191]. For several decades, the networking indus-
of switch and request diverse values, causing the pre-cached try operated in a bottom-up approach, where switches are
values to be evicted. equipped with fixed-function ASICs. Consequently, little to
Current and Future Initiatives. To mitigate against sensi- no programming skills were needed by network operators.
tivity attacks, a developer attempts to discover various un- With the advent of programmable switches, operators are now
predicted traffic patterns, and accordingly, develops defense expected to have experience in programming the ASIC2 .
strategies. Such solution is highly unreliable, time consuming, Current and Future Initiatives. Since programming the
and error-prone. Recent efforts [179] aimed at automatically ASIC is not a straightforward task, future research endeavours
discovering sensitivity attacks in the data plane. Essentially, should consider simplifying the programming workflow for
the proposed system aims at deriving traffic patterns that would the operators and generating code (e.g., [293]). For instance,
drive the program away from common case behavior as much graphical tools can be developed to translate workflows (e.g.,
as possible. Other efforts focused on architecting defenses in flowcharts) to P4 programs that can fit into the hardware.
the data plane that perform distributed mode changes upon Further, future work should develop tools that allow operators
attack discovery [178]. Future work in this direction should to enable features (i.e., program modules) that will translate to
consider achieving high assurance by formally verifying the P4 programs. As an analogy, consider the mobile application
codes. Additionally, the stability of the data plane should be stores (e.g., Play store, Apple store). The user simply down-
carefully handled with fast mode changes; future work could loads and installs application on the device, without having to
consider integrating self-stabilizing systems for such purpose. understand anything about programming. An interesting work
Finally, future work should provide security interfaces for could investigate the idea of creating a store for P4 applications
collaborating switches that belong to different domains. It is where operators select the “apps” they want to activate, and
also worth exposing sensitivity attack patterns for different the result is a generated P4 program optimized to fit in the
application types so that data plane developers can avoid the hardware, considering the different targets available in the
vulnerabilities that trigger those attacks in their codes. market today (e.g., Tofino). Recent efforts attempted to merge
and test modular programs in P4 [294].
G. Interoperability
Programmable switches pave the way for a wide range of XIV. C ONCLUSIONS
innovative in-network applications. The literature has shown This article presents an exhaustive survey on programmable
that significant performance improvements are brought when data planes. The survey describes the evolution of networking
applications offload their processing logic to the network. by discussing the traditional control plane and the transition to
Despite such facts, it is very unlikely that mobile operators
2 Note that most vendors (e.g., Barefoot Networks) provide a program
will replace their current infrastructure with programmable
(switch.p4) that expresses the forwarding plane of a switch, with the typical
switches in one shot. This unlikelihood comes from the fact features of an advanced layer-2 and layer-3 switch. If the goal is to simply
that major operational and budgeting costs will incur. deploy a switch with no in-network applications, then the operators are not
required to program the chip. They just need to learn the interaction between
Current and Future Initiatives. Network operators might the control plane and the data plane (e.g., to populate table entries).
44

Abbreviation Term
SDN. Afterwards, the survey motivates the need for program-
DRAM Dynamic Random Access Memory
ming the data plane and delves into the general architecture DSP Digital Signal Processors
of a programmable switch (PISA). A brief description of P4, ECMP Equal-Cost Multi-Path Routing
the de-facto language for programming the data plane was ECN Explicit Congestion Notification
ESP Encapsulating Security Payload
presented. Motivated by the increasing trend in programming FAST Flow-level State Transitions
the data plane, the survey provides a taxonomy that sheds the FCT Flow Completion Time
light on numerous significant works and compares schemes FIB Forwarding Information Base
FPGA Field-programmable Gate Array
within each category in the taxonomy and with those in legacy FQ Fair Queueing
approaches. The survey concludes by discussing challenges GPU Graphics Processing Unit
and considerations as well as various future trends and initia- GRE Generic Routing Encapsulation
HCF Hop-Count Filtering
tives. HSA Header Space Analysis
HTCP Hamilton Transmission Control Protocol
ACKNOWLEDGEMENT HTTP Hypertext Transfer Protocol
IDS Intrusion Detection System
This material is based upon work supported by the Na- IGMP Internet Group Management Protocol
tional Science Foundation under grant numbers 1925484 and IKE Internet Key Exchange
1829698, funded by the Office of Advanced Cyberinfrastruc- ILP Integer Linear Programming
INT In-band Network Telemetry
ture (OAC). IoT Internet of Things
IP Internet Protocol
R EFERENCES ISP Internet Service Provider
[1] N. McKeown, “How we might get humans out of the way.” Open Net- JSON JavaScript Object Notation
working Foundation (ONF) Connect 19, Sep. 2019. [Online]. Available: KDN Knowledge-defined Networking
https://tinyurl.com/y4dnxacz. KPI Key Performance Indicator
[2] RFC Editor, “Number of RFCs published per year.” [Online]. Avail- INT In-band Network Telemetry
able: https://www.rfc-editor.org/rfcs-per-year/. IoT Internet of Things
[3] B. Trammell and M. Kuehlewind, “Report from the IAB workshop on IP Internet Protocol
stack evolution in a middlebox Internet (SEMI),” RFC7663. [Online]. ISP Internet Service Provider
Available: https://tools.ietf.org/html/rfc7663. INT In-band Network Telemetry
IoT Internet of Things
INT In-band Network Telemetry
TABLE XXXV IoT Internet of Things
A BBREVIATIONS U SED IN T HIS A RTICLE IP Internet Protocol
ISP Internet Service Provider
Abbreviation Term JSON JavaScript Object Notation
ABR Adaptive Bit Rate KDN Knowledge-defined Networking
ACK Acknowledgement KPI Key Performance Indicator
ACL Access Control List LAN Local Area Network
AFQ Approximate Fair Queueing LFA Link Flooding Attack
AIMD Additive Increase Multiplicative Decrease LPM Longest Prefix Match
ALU Arithmetic Logical Unit LPWAN Low Power Wide Area Network
API Application Programming Interface LTE Long Term Evolution
AQM Active Queue Management MAC Medium Access Control
AS Autonomous System MAU Match-Action Unit
ASIC Application-specific Integrated Circuit MCM Multicolor Markers
ATPG Automatic Test Packet Generation MIMD Multiplicative Increase Multiplicative Decrease
ATT Attribute Protocol ML Machine Learning
BBR Bottleneck Bandwidth and Round-trip Time MOS Mean Opinion Score
BDD Binary Decision Diagram MPC Mobile Packet Core
BFT Byzantine Fault Tolerance MQTT Message Queueing Telemetry Transport
BGP Border Gateway Protocol MSS Maximum Segment Size
BIER Bit Index Explicit Replication MPTCP Multipath Transmission Control Protocol
BLE Bluetooth Low Energy MTU Maximum Transmission Unit
BLESS Bluetooth Low Energy Service Switch NACK Negative Acknowledgement
BMv2 Behavioral Model Version 2 NAT Network Address Translation
BNN Binary Neural Network NDA Non-disclosure Agreement
BQPS Billion Queries Per Second NDN Named Data Networking
BYOD Bring Your Own Device NFV Network Functions Virtualization
CAIDA Center of Applied Internet Data Analysis NIC Network Interface Controller
CC Congestion Control NN Neural Networks
CNN Convolutional Neural Network NSH Network Service Header
CoDel Controlled Delay ONOS Open Network Operating System
CPU Central Processing Unit OSPF Open Shortest Path First
CRC Cyclic Redundancy Check OUM Ordered Unreliable Multicast
CWND Congestion Window OVS Open Virtual Switch
DCQCN Data Center Quantized Congestion Notification P2P Peer-to-peer
DCTCP Data Center Transmission Control Protocol PBT Postcard-Based Telemetry
DDoS Distributed Denial-of-Service PCC Performance-oriented Congestion Control
DIP Direct Internet Protocol PCC Per-Connection Consistency
DMA Direct Memory Access PD Program Dependent
DMZ Demilitarized Zone PGW Packet Data Network Gateway
DNS Domain Name Server PI Protocol Independent
DPDK Data Plane Development Kit PIE Proportional Integral Controller Enhanced
45

Abbreviation Term
[15] P4.org Community, “P4 gains broad adoption, joins Open Networking
PISA Protocol Independent Switch Architecture Foundation (ONF) and Linux Foundation (LF) to accelerate next phase
QoE Quality of Experience of growth and innovation.” [Online]. Available: https://p4.org/p4/
QoS Quality of Service p4-joins-onf-and-lf.html.
RAM Random-Access Memory [16] Facebook engineering, “Disaggregate: networking recap.” [Online].
RDMA Remote Direct Memory Access Available: https://tinyurl.com/yxoaj7kw.
RED Random Early Detection [17] Open Compute Project (OCP), “Alibaba DC network evolution with
REST Representational State Transfer open SONiC and programmable HW.” [Online]. Available: https://
RFC Request for Comments www.opencompute.org/files/OCP2018.alibaba.pdf.
RMT Reconfigurable Match-action Tables [18] S. Heule, “Using P4 and P4Runtime for optimal L3 routing.” [Online].
RSA Rivest-Shamir-Adleman Available: https://tinyurl.com/y365gnqy.
RSS Really Simple Syndication [19] N. McKeown, “SDN phase 3: getting the humans out of the way. ONF
RTT Round-trip Time Connect 19.” [Online]. Available: https://tinyurl.com/tp9bxw4.
RWND Receiver Window [20] Edgecore, “Wedge 100BF-32X, 100GbE data center switch,” 2020.
SAD Security Association Database [Online]. Available: https://tinyurl.com/sy2jkqe.
SAT Boolean Satisfiability Problem [21] STORDIS, “The new advanced programmable switches are available.”
SDN Software Defined Networking [Online]. Available: https://www.stordis.com/products/.
SHA Secure Hash Algorithm [22] Cisco, “Cisco Nexus 34180YC and 3464C programmable switches data
SIP Session Initiation Protocol sheet.” [Online]. Available: https://tinyurl.com/y92cbdxe.
SLA Service Level Agreement [23] Arista, “Arista 7170 series.” [Online]. Available: https://www.arista.
SNMP Simple Network Management Protocol com/en/products/7170-series.
SPD Security Policy Database [24] Juniper Networks, “Juniper advancing disaggregation
SRAM Static Random-Access Memory through P4Runtime integration.” [Online]. Available:
SSH Secure Shell https://tinyurl.com/yygz547t.
TCAM Ternary Content-Addressable Memory [25] Interface Masters, “Tahoe 2624.” [Online]. Available: https://
TCP Transmission Control Protocol interfacemasters.com/products/switches/10g-40g/tahoe-2624/.
TM Traffic Management [26] Barefoot Networks, “Tofino ASIC.” [Online]. Available: https://www.
ToR The Onion Router barefootnetworks.com/products/brief-tofino/.
TPU Tensor Processing Unit [27] Xilinx, “Xilinx solutions.” [Online]. Available: https://www.xilinx.com/
TTL Time-to-Live products/silicon-devices.html.
UDP User Datagram Protocol [28] Pensando, “The Pensando distributed services platform.” [Online].
UE User Equipment Available: https://pensando.io/our-platform/.
VIP Virtual Internet Protocol [29] Mellanox, “Empowering the next generation of secure cloud Smart-
VMN Verifying Mutable Networks NICs.” [Online]. Available: https://www.mellanox.com/products/
VN Virtual Network smartnic.
VoLTE Voice over Long-term Evolution [30] Innovium, “Teralynx switch silicon.” [Online]. Available: https://www.
VXLAN Virtual eXtensible Local Area Network innovium.com/teralynx/.
WAN Wide Area Network [31] I. Baldin, J. Griffioen, K. Wang, I. Monga, A. Nikolich, “Mid-Scale
XDP eXpress Data Path RI-1 (M1:IP): FABRIC: adaptive programmable research infrastructure
for computer science and science applications.” [Online]. Available:
[4] G. Papastergiou, G. Fairhurst, D. Ros, A. Brunstrom, K.-J. Grinnemo, https://tinyurl.com/y463v9z9.
P. Hurtig, N. Khademi, M. Tüxen, M. Welzl, D. Damjanovic, and [32] FABRIC, “About FABRIC.” [Online]. Available: https://fabric-testbed.
S. Mangiante, “De-ossifying the Internet transport layer: a survey net/about/.
and future perspectives,” IEEE Communications Surveys & Tutorials, [33] J. Mambretti, J. Chen, F. Yeh, and S. Y. Yu, “International P4
vol. 19, no. 1, pp. 619–639, 2016. networking testbed,” in SC19 Network Research Exhibition, 2019.
[5] “VMware, Cisco stretch virtual LANs across the heavens.” in The Reg- [34] 2STiC, “A national programmable infrastructure to experiment with
ister, Aug. 2011. [Online]. Available: https://tinyurl.com/y6mxhqzn. next-generation networks.” [Online]. Available: https://www.2stic.nl/
[6] M. Mahalingam, D. Dutt, K. Duda, P. Agarwal, L. Kreeger, T. Sridhar, national-programmable-infrastructure.html.
M. Bursell, and C. Wright, “Virtual eXtensible Local Area Network [35] H. Stubbe, “P4 compiler & interpreter: a survey,” Future Internet
(VXLAN): a framework for overlaying virtualized layer 2 networks (FI) and Innovative Internet Technologies and Mobile Communication
over layer 3 networks,” RFC7348. [Online]. Available: http://www. (IITM), vol. 47, 2017.
rfc-editor.org/rfc/rfc7348.txt. [36] T. Dargahi, A. Caponi, M. Ambrosin, G. Bianchi, and M. Conti, “A
[7] M. Casado, M. J. Freedman, J. Pettit, J. Luo, N. McKeown, and survey on the security of stateful SDN data planes,” IEEE Communi-
S. Shenker, “Ethane: taking control of the enterprise,” ACM SIGCOMM cations Surveys & Tutorials, vol. 19, no. 3, pp. 1701–1725, 2017.
computer communication review, vol. 37, no. 4, pp. 1–12, 2007. [37] W. L. da Costa Cordeiro, J. A. Marques, and L. P. Gaspary, “Data plane
[8] D. Kreutz, F. M. Ramos, P. E. Verissimo, C. E. Rothenberg, S. Azodol- programmability beyond OpenFlow: opportunities and challenges for
molky, and S. Uhlig, “Software-defined networking: a comprehensive network and service operations and management,” Journal of Network
survey,” Proceedings of the IEEE, vol. 103, no. 1, pp. 14–76, 2014. and Systems Management, vol. 25, no. 4, pp. 784–818, 2017.
[9] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford, [38] A. Satapathy, “Comprehensive study of P4 programming language and
C. Schlesinger, D. Talayco, A. Vahdat, and G. Varghese, “P4: pro- software-defined networks,” 2018. [Online]. Available: https://tinyurl.
gramming protocol-independent packet processors,” ACM SIGCOMM com/y4d4zma9.
Computer Communication Review, vol. 44, no. 3, pp. 87–95, 2014. [39] R. Bifulco and G. Rétvári, “A survey on the programmable data plane:
[10] Barefoot Networks, “Use cases.” [Online]. Available: https://www. abstractions, architectures, and open problems,” in 2018 IEEE 19th
barefootnetworks.com/use-cases/. International Conference on High Performance Switching and Routing
[11] A. Weissberger, “Comcast: ONF Trellis software is in production (HPSR), pp. 1–7, IEEE, 2018.
together with L2/L3 white box switches.” [Online]. Available: https: [40] E. Kaljic, A. Maric, P. Njemcevic, and M. Hadzialic, “A survey on data
//tinyurl.com/y69jc7sv. plane flexibility and programmability in software-defined networking,”
[12] N. Akiyama, M. Nishiki, “P4 and Stratum use case for new edge cloud.” IEEE Access, vol. 7, pp. 47804–47840, 2019.
[Online]. Available: https://tinyurl.com/yxuoo9qv. [41] P. G. Kannan and M. C. Chan, “On programmable networking evolu-
[13] Stordis GmbH, “New STORDIS advanced programmable switches tion,” CSI Transactions on ICT, vol. 8, no. 1, pp. 69–76, 2020.
(APS) first to unlock the full potential of P4 and next generation [42] L. Tan, W. Su, W. Zhang, J. Lv, Z. Zhang, J. Miao, X. Liu, and N. Li,
software defined networking (NG-SDN).” [Online]. Available: https: “In-band network telemetry: A survey,” Computer Networks, p. 107763,
//tinyurl.com/y3kjnypl. 2020.
[14] Open Networking Foundation (ONF), “Stratum – ONF launches major [43] X. Zhang, L. Cui, K. Wei, F. P. Tso, Y. Ji, and W. Jia, “A survey on
new open source SDN switching platform with support from Google.” stateful data plane in software defined networks,” Computer Networks,
[Online]. Available: https://tinyurl.com/yy3ykw7g. p. 107597, 2020.
[44] G. Bianchi, M. Bonola, A. Capone, and C. Cascone, “OpenState:
46

programming platform-independent stateful OpenFlow applications in- 2019 42nd International Conference on Telecommunications and Signal
side the switch,” ACM SIGCOMM Computer Communication Review, Processing (TSP), pp. 273–277, IEEE, 2019.
vol. 44, no. 2, pp. 44–51, 2014. [68] B. Turkovic and F. Kuipers, “P4air: Increasing fairness among com-
[45] M. Moshref, A. Bhargava, A. Gupta, M. Yu, and R. Govindan, peting congestion control algorithms,” 2020.
“Flow-level state transition as a new switch primitive for SDN,” in [69] Y. Li, R. Miao, C. Kim, and M. Yu, “Flowradar: A better NetFlow for
Proceedings of the third workshop on Hot topics in software defined data centers,” in 13th {USENIX} Symposium on Networked Systems
networking, pp. 61–66, 2014. Design and Implementation (NSDI), pp. 311–324, 2016.
[46] P4 Language Consortium, “P4Runtime.” [Online]. Available: https: [70] Z. Liu, A. Manousis, G. Vorsanger, V. Sekar, and V. Braverman,
//github.com/p4lang/PI/. “One sketch to rule them all: rethinking network flow monitoring with
[47] Y. Rekhter, T. Li, and S. Hares, “A border gateway protocol 4 (bgp-4),” UnivMon,” in Proceedings of the 2016 ACM SIGCOMM Conference,
RFC4271. http://www.rfc-editor.org/rfc/rfc4271.txt. pp. 101–114, 2016.
[48] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, [71] S. Narayana, A. Sivaraman, V. Nathan, P. Goyal, V. Arun, M. Alizadeh,
J. Rexford, S. Shenker, and J. Turner, “Openflow: enabling innovation V. Jeyakumar, and C. Kim, “Language-directed hardware design for
in campus networks,” ACM SIGCOMM Computer Communication network performance monitoring,” in Proceedings of the Conference
Review, vol. 38, no. 2, pp. 69–74, 2008. of the ACM Special Interest Group on Data Communication, pp. 85–
[49] N. McKeown, “Why does the Internet need a programmable forwarding 98, 2017.
plane.” [Online]. Available: https://tinyurl.com/y6x7qqpm. [72] M. Ghasemi, T. Benson, and J. Rexford, “Dapper: data plane perfor-
[50] A. Shapiro, “P4-programming data plane use-cases.” in P4 Expert mance diagnosis of TCP,” in Proceedings of the Symposium on SDN
Roundtable Series, April 28-29, 2020. [Online]. Available: https:// Research, pp. 61–74, 2017.
tinyurl.com/y5n4k83h. [73] T. Yang, J. Jiang, P. Liu, Q. Huang, J. Gong, Y. Zhou, R. Miao,
[51] C. Kim, “Evolution of networking, Networking Field Day 21, 2:01,” X. Li, and S. Uhlig, “Elastic sketch: adaptive and fast network-wide
2019. [Online]. Available: https://tinyurl.com/y9fkj7qx. measurements,” in Proceedings of the 2018 Conference of the ACM
[52] Z. Liu, J. Bi, Y. Zhou, Y. Wang, and Y. Lin, “Netvision: towards Special Interest Group on Data Communication, pp. 561–575, 2018.
network telemetry as a service,” in 2018 IEEE 26th International [74] N. Yaseen, J. Sonchack, and V. Liu, “Synchronized network snapshots,”
Conference on Network Protocols (ICNP), pp. 247–248, IEEE, 2018. in Proceedings of the 2018 Conference of the ACM Special Interest
[53] J. Hyun, N. Van Tu, and J. W.-K. Hong, “Towards knowledge-defined Group on Data Communication, pp. 402–416, 2018.
networking using in-band network telemetry,” in NOMS 2018-2018 [75] R. Joshi, T. Qu, M. C. Chan, B. Leong, and B. T. Loo, “Burstradar:
IEEE/IFIP Network Operations and Management Symposium, pp. 1–7, practical real-time microburst monitoring for datacenter networks,” in
IEEE, 2018. Proceedings of the 9th Asia-Pacific Workshop on Systems, pp. 1–8,
[54] Y. Kim, D. Suh, and S. Pack, “Selective in-band network telemetry 2018.
for overhead reduction,” in 2018 IEEE 7th International Conference [76] M. Lee and J. Rexford, “Detecting violations of service-level agree-
on Cloud Networking (CloudNet), pp. 1–3, IEEE, 2018. ments in programmable switches,” 2018. [Online]. Available: https:
[55] J. A. Marques, M. C. Luizelli, R. I. T. da Costa Filho, and L. P. Gaspary, //p4campus.cs.princeton.edu/pubs/mackl_thesis_paper.pdf.
“An optimization-based approach for efficient network monitoring [77] J. Sonchack, O. Michel, A. J. Aviv, E. Keller, and J. M. Smith, “Scaling
using in-band network telemetry,” Journal of Internet Services and hardware accelerated network monitoring to concurrent and dynamic
Applications, vol. 10, no. 1, p. 12, 2019. queries with* flow,” in 2018 USENIX Annual Technical Conference
[56] B. Niu, J. Kong, S. Tang, Y. Li, and Z. Zhu, “Visualize your IP- (USENIX ATC 18), pp. 823–835, 2018.
over-optical network in realtime: a P4-based flexible multilayer in-band [78] J. Sonchack, A. J. Aviv, E. Keller, and J. M. Smith, “Turboflow:
network telemetry (ML-INT) system,” IEEE Access, vol. 7, pp. 82413– Information rich flow record generation on commodity switches,” in
82423, 2019. Proceedings of the Thirteenth EuroSys Conference, pp. 1–16, 2018.
[57] R. Ben Basat, S. Ramanathan, Y. Li, G. Antichi, M. Yu, and M. Mitzen- [79] A. Gupta, R. Harrison, M. Canini, N. Feamster, J. Rexford, and
macher, “PINT: probabilistic in-band network telemetry,” in Proceed- W. Willinger, “Sonata: query-driven streaming network telemetry,” in
ings of the Annual conference of the ACM Special Interest Group on Proceedings of the 2018 Conference of the ACM Special Interest Group
Data Communication on the applications, technologies, architectures, on Data Communication, pp. 357–371, 2018.
and protocols for computer communication, pp. 662–680, 2020. [80] X. Chen, S. L. Feibish, Y. Koral, J. Rexford, O. Rottenstreich, S. A.
[58] N. Van Tu, J. Hyun, and J. W.-K. Hong, “Towards ONOS-based SDN Monetti, and T.-Y. Wang, “Fine-grained queue measurement in the
monitoring using in-band network telemetry,” in 2017 19th Asia-Pacific data plane,” in Proceedings of the 15th International Conference on
Network Operations and Management Symposium (APNOMS), pp. 76– Emerging Networking Experiments And Technologies, pp. 15–29, 2019.
81, IEEE, 2017. [81] Z. Liu, S. Zhou, O. Rottenstreich, V. Braverman, and J. Rexford,
[59] Serkant, “Prometheus INT exporter.” [Online]. Available: https://github. “Memory-efficient performance monitoring on programmable switches
com/serkantul/prometheus_int_exporter/. with lean algorithms,” in Symposium on Algorithmic Principles of
[60] N. Van Tu, J. Hyun, G. Y. Kim, J.-H. Yoo, and J. W.-K. Hong, “IntCol- Computer Systems (APoCS), 2020.
lector: a high-performance collector for in-band network telemetry,” in [82] T. Holterbach, E. C. Molero, M. Apostolaki, A. Dainotti, S. Vissicchio,
2018 14th International Conference on Network and Service Manage- and L. Vanbever, “Blink: fast connectivity recovery entirely in the data
ment (CNSM), pp. 10–18, IEEE, 2018. plane,” in 16th {USENIX} Symposium on Networked Systems Design
[61] Barefoot Networks, “Barefoot Deep Insight - product brief.” [Online]. and Implementation ({NSDI} 19), pp. 161–176, 2019.
Available: https://tinyurl.com/u2ncvry. [83] D. Ding, M. Savi, and D. Siracusa, “Estimating logarithmic and expo-
[62] Broadcom, “BroadView Analytics, Trident 3 in-band telemetry.” [On- nential functions to track network traffic entropy in P4,” in IEEE/IFIP
line]. Available: https://tinyurl.com/yxr2qydb. Network Operations and Management Symposium (NOMS), 2019.
[63] M. Handley, C. Raiciu, A. Agache, A. Voinescu, A. W. Moore, [84] W. Wang, P. Tammana, A. Chen, and T. E. Ng, “Grasp the root causes
G. Antichi, and M. Wójcik, “Re-architecting datacenter networks and in the data plane: diagnosing latency problems with SpiderMon,” in
stacks for low latency and high performance,” in Proceedings of the Proceedings of the Symposium on SDN Research, pp. 55–61, 2020.
Conference of the ACM Special Interest Group on Data Communica- [85] R. Teixeira, R. Harrison, A. Gupta, and J. Rexford, “PacketScope:
tion, pp. 29–42, 2017. monitoring the packet lifecycle inside a switch,” in Proceedings of
[64] B. Turkovic, F. Kuipers, N. van Adrichem, and K. Langendoen, “Fast the Symposium on SDN Research, pp. 76–82, 2020.
network congestion detection and avoidance using P4,” in Proceedings [86] J. Bai, M. Zhang, G. Li, C. Liu, M. Xu, and H. Hu, “FastFE:
of the 2018 Workshop on Networking for Emerging Applications and accelerating ML-based traffic analysis with programmable switches,”
Technologies, pp. 45–51, 2018. in Proceedings of the Workshop on Secure Programmable Network In-
[65] Y. Li, R. Miao, H. H. Liu, Y. Zhuang, F. Feng, L. Tang, Z. Cao, frastructure, SPIN ’20, p. 1–7, Association for Computing Machinery,
M. Zhang, F. Kelly, and M. Y. Alizadeh, Mohammad, “HPCC: high 2020.
precision congestion control,” in Proceedings of the ACM Special [87] X. Chen, H. Kim, J. M. Aman, W. Chang, M. Lee, and J. Rexford,
Interest Group on Data Communication, pp. 44–58, 2019. “Measuring TCP round-trip time in the data plane,” in Proceedings of
[66] A. Feldmann, B. Chandrasekaran, S. Fathalli, and E. N. Weyulu, “P4- the Workshop on Secure Programmable Network Infrastructure, pp. 35–
enabled network-assisted congestion feedback: a case for NACKs,” 41, 2020.
2019. [88] Y. Qiu, K.-F. Hsu, J. Xing, and A. Chen, “A feasibility study on time-
[67] E. F. Kfoury, J. Crichigno, E. Bou-Harb, D. Khoury, and G. Srivastava, aware monitoring with commodity switches,” in Proceedings of the
“Enabling TCP pacing using programmable data plane switches,” in Workshop on Secure Programmable Network Infrastructure, pp. 22–
47

27, 2020. systems with distributed caching,” in 17th {USENIX} Conference on


[89] Q. Huang, H. Sun, P. P. Lee, W. Bai, F. Zhu, and Y. Bao, “OmniMon: File and Storage Technologies ({FAST} 19), pp. 143–157, 2019.
re-architecting Network telemetry with resource efficiency and full [108] K.-F. Hsu, P. Tammana, R. Beckett, A. Chen, J. Rexford, and D. Walker,
accuracy,” in Proceedings of the Annual conference of the ACM “Adaptive weighted traffic splitting in programmable data planes,” in
Special Interest Group on Data Communication on the applications, Proceedings of the Symposium on SDN Research, pp. 103–109, 2020.
technologies, architectures, and protocols for computer communication, [109] K.-F. Hsu, R. Beckett, A. Chen, J. Rexford, and D. Walker, “Contra:
pp. 404–421, 2020. A programmable system for performance-aware routing,” in 17th
[90] X. Chen, S. Landau-Feibish, M. Braverman, and J. Rexford, “Beau- {USENIX} Symposium on Networked Systems Design and Implemen-
Coup: answering many network traffic queries, one memory update tation ({NSDI} 20), pp. 701–721, 2020.
at a time,” in Proceedings of the Annual conference of the ACM [110] X. Jin, X. Li, H. Zhang, R. Soulé, J. Lee, N. Foster, C. Kim, and
Special Interest Group on Data Communication on the applications, I. Stoica, “Netcache: balancing key-value stores with fast in-network
technologies, architectures, and protocols for computer communication, caching,” in Proceedings of the 26th Symposium on Operating Systems
pp. 226–239, 2020. Principles, pp. 121–136, 2017.
[91] R. Kundel, J. Blendin, T. Viernickel, B. Koldehofe, and R. Steinmetz, [111] E. Cidon, S. Choi, S. Katti, and N. McKeown, “AppSwitch: application-
“P4-CoDel: active queue management in programmable data planes,” layer load balancing within a software switch,” in Proceedings of the
in 2018 IEEE Conference on Network Function Virtualization and First Asia-Pacific Workshop on Networking, pp. 64–70, 2017.
Software Defined Networks (NFV-SDN), pp. 1–4, IEEE, 2018. [112] M. Liu, L. Luo, J. Nelson, L. Ceze, A. Krishnamurthy, and K. Atreya,
[92] N. K. Sharma, M. Liu, K. Atreya, and A. Krishnamurthy, “Approxi- “Incbricks: toward in-network computation with an in-network cache,”
mating fair queueing on reconfigurable switches,” in 15th {USENIX} in Proceedings of the Twenty-Second International Conference on
Symposium on Networked Systems Design and Implementation (NSDI), Architectural Support for Programming Languages and Operating
pp. 1–16, 2018. Systems, pp. 795–809, 2017.
[93] S. Laki, P. Vörös, and F. Fejes, “Towards an AQM evaluation testbed [113] S. Signorello, R. State, J. François, and O. Festor, “NDN.p4: pro-
with P4 and DPDK,” in Proceedings of the ACM SIGCOMM 2019 gramming information-centric data-planes,” in 2016 IEEE NetSoft
Conference Posters and Demos, pp. 148–150, 2019. Conference and Workshops (NetSoft), pp. 384–389, IEEE, 2016.
[94] C. Papagianni and K. De Schepper, “PI2 for P4: an active queue man- [114] G. Grigoryan and Y. Liu, “PFCA: a programmable FIB caching
agement scheme for programmable data planes,” in Proceedings of the architecture,” in Proceedings of the 2018 Symposium on Architectures
15th International Conference on emerging Networking EXperiments for Networking and Communications Systems, pp. 97–103, 2018.
and Technologies, pp. 84–86, 2019. [115] C. Zhang, J. Bi, Y. Zhou, K. Zhang, and Z. Ma, “B-cache: a
[95] K. Kumazoe and M. Tsuru, “P4-based implementation and evaluation behavior-level caching framework for the programmable data plane,”
of adaptive early packet discarding scheme,” in International Confer- in 2018 IEEE Symposium on Computers and Communications (ISCC),
ence on Intelligent Networking and Collaborative Systems, pp. 460– pp. 00084–00090, IEEE, 2018.
469, Springer, 2020. [116] J. Vestin, A. Kassler, and J. Åkerberg, “FastReact: in-network control
[96] D. Bhat, J. Anderson, P. Ruth, M. Zink, and K. Keahey, “Application- and caching for industrial control networks using programmable data
based QoE support with P4 and OpenFlow,” in IEEE INFOCOM 2019- planes,” in 2018 IEEE 23rd International Conference on Emerging
IEEE Conference on Computer Communications Workshops (INFO- Technologies and Factory Automation (ETFA), vol. 1, pp. 219–226,
COM WKSHPS), pp. 817–823, IEEE, 2019. IEEE, 2018.
[97] S. S. Lee and K.-Y. Chan, “A traffic meter based on a multicolor marker [117] J. Woodruff, M. Ramanujam, and N. Zilberman, “P4DNS: in-network
for bandwidth guarantee and priority differentiation in sdn virtual DNS,” in 2019 ACM/IEEE Symposium on Architectures for Networking
networks,” IEEE Transactions on Network and Service Management, and Communications Systems (ANCS), pp. 1–6, IEEE, 2019.
vol. 16, no. 3, pp. 1046–1058, 2019. [118] R. Ricart-Sanchez, P. Malagon, P. Salva-Garcia, E. C. Perez, Q. Wang,
[98] K. Tokmakov, M. Sarker, J. Domaschka, and S. Wesner, “A case for and J. M. A. Calero, “Towards an FPGA-accelerated programmable
data centre traffic management on software programmable ethernet data path for edge-to-core communications in 5G networks,” Journal
switches,” in 2019 IEEE 8th International Conference on Cloud of Network and Computer Applications, vol. 124, pp. 80–93, 2018.
Networking (CloudNet), pp. 1–6, IEEE, 2019. [119] R. Ricart-Sanchez, P. Malagon, J. M. Alcaraz-Calero, and Q. Wang,
[99] Y.-W. Chen, L.-H. Yen, W.-C. Wang, C.-A. Chuang, Y.-S. Liu, and C.- “Hardware-accelerated firewall for 5G mobile networks,” in 2018 IEEE
C. Tseng, “P4-Enabled bandwidth management,” in 2019 20th Asia- 26th International Conference on Network Protocols (ICNP), pp. 446–
Pacific Network Operations and Management Symposium (APNOMS), 447, IEEE, 2018.
pp. 1–5, IEEE, 2019. [120] R. Shah, V. Kumar, M. Vutukuru, and P. Kulkarni, “TurboEPC:
[100] M. Shahbaz, L. Suresh, J. Rexford, N. Feamster, O. Rottenstreich, and leveraging dataplane programmability to acccelerate the mobile packet
M. Hira, “Elmo: Source routed multicast for public clouds,” in Pro- core,” in Proceedings of the Symposium on SDN Research, pp. 83–95,
ceedings of the ACM Special Interest Group on Data Communication, 2020.
pp. 458–471, 2019. [121] S. K. Singh, C. E. Rothenberg, G. Patra, and G. Pongracz, “Offloading
[101] M. Kadosh, Y. Piasetzky, B. Gafni, L. Suresh, M. Shahbaz, S. Banerjee, virtual evolved packet gateway user plane functions to a programmable
“Realizing source routed multicast using Mellanox’s programmable ASIC,” in Proceedings of the 1st ACM CoNEXT Workshop on Emerging
hardware switches, P4 Expert Roundtable Series, Apr. 2020.” [Online]. in-Network Computing Paradigms, pp. 9–14, 2019.
Available: https://tinyurl.com/y8dfcsum. [122] P. Vörös, G. Pongrácz, and S. Laki, “Towards a hybrid next generation
[102] W. Braun, J. Hartmann, and M. Menth, “Scalable and reliable software- nodeb,” in Proceedings of the 3rd P4 Workshop in Europe, pp. 56–58,
defined multicast with BIER and P4,” in 2017 IFIP/IEEE Symposium 2020.
on Integrated Network and Service Management (IM), pp. 905–906, [123] P. Palagummi and K. M. Sivalingam, “SMARTHO: a network initiated
IEEE, 2017. handover in NG-RAN using P4-based switches,” in 2018 14th Inter-
[103] N. Katta, M. Hira, C. Kim, A. Sivaraman, and J. Rexford, “Hula: scal- national Conference on Network and Service Management (CNSM),
able load balancing using programmable data planes,” in Proceedings pp. 338–342, IEEE, 2018.
of the Symposium on SDN Research, pp. 1–12, 2016. [124] E. Kfoury, J. Crichigno, and E. Bou-Harb, “Offloading media traffic to
[104] R. Miao, H. Zeng, C. Kim, J. Lee, and M. Yu, “SilkRoad: making programmable data plane switches,” in ICC 2020 IEEE International
stateful layer-4 load balancing fast and cheap using switching ASICs,” Conference on Communications (ICC), IEEE, 2020.
in Proceedings of the Conference of the ACM Special Interest Group [125] T. Jepsen, M. Moshref, A. Carzaniga, N. Foster, and R. Soulé, “Packet
on Data Communication, pp. 15–28, 2017. subscriptions for programmable ASICs,” in Proceedings of the 17th
[105] C. H. Benet, A. J. Kassler, T. Benson, and G. Pongracz, “MP-HULA: ACM Workshop on Hot Topics in Networks, pp. 176–183, 2018.
multipath transport aware load balancing using programmable data [126] C. Wernecke, H. Parzyjegla, G. Mühl, P. Danielis, and D. Timmermann,
planes,” in Proceedings of the 2018 Morning Workshop on In-Network “Realizing content-based publish/subscribe with P4,” in 2018 IEEE
Computing, pp. 7–13, 2018. Conference on Network Function Virtualization and Software Defined
[106] V. Olteanu, A. Agache, A. Voinescu, and C. Raiciu, “Stateless data- Networks (NFV-SDN), pp. 1–7, IEEE, 2018.
center load-balancing with beamer,” in 15th {USENIX} Symposium on [127] C. Wernecke, H. Parzyjegla, G. Mühl, E. Schweissguth, and D. Tim-
Networked Systems Design and Implementation (NSDI), pp. 125–139, mermann, “Flexible notification forwarding for content-based pub-
2018. lish/subscribe using P4,” in 2019 IEEE Conference on Network Func-
[107] Z. Liu, Z. Bai, Z. Liu, X. Li, C. Kim, V. Braverman, X. Jin, and tion Virtualization and Software Defined Networks (NFV-SDN), pp. 1–
I. Stoica, “Distcache: provable load balancing for large-scale storage 5, IEEE, 2019.
48

[128] R. Kundel, C. Gärtner, M. Luthra, S. Bhowmik, and B. Koldehofe, 16th International Conference on emerging Networking EXperiments
“Flexible content-based publish/subscribe over programmable data and Technologies, pp. 399–405, 2020.
planes,” in NOMS 2020-2020 IEEE/IFIP Network Operations and [151] R. Glebke, J. Krude, I. Kunze, J. Rüth, F. Senger, and K. Wehrle,
Management Symposium, pp. 1–5, IEEE, 2020. “Towards executing computer vision functionality on programmable
[129] J. Li, E. Michael, N. K. Sharma, A. Szekeres, and D. R. Ports, “Just say network devices,” in Proceedings of the 1st ACM CoNEXT Workshop
{NO} to paxos overhead: replacing consensus with network ordering,” on Emerging in-Network Computing Paradigms, pp. 15–20, 2019.
in 12th {USENIX} Symposium on Operating Systems Design and [152] S.-Y. Wang, C.-M. Wu, Y.-B. Lin, and C.-C. Huang, “High-speed data-
Implementation (OSDI), pp. 467–483, 2016. plane packet aggregation and disaggregation by P4 switches,” Journal
[130] H. T. Dang, M. Canini, F. Pedone, and R. Soulé, “Paxos made switch- of Network and Computer Applications, vol. 142, pp. 98–110, 2019.
y,” ACM SIGCOMM Computer Communication Review, vol. 46, no. 2, [153] S.-Y. Wang, J.-Y. Li, and Y.-B. Lin, “Aggregating and disaggregating
pp. 18–24, 2016. packets with various sizes of payload in P4 switches at 100 Gbps line
[131] J. Li, E. Michael, and D. R. Ports, “Eris: coordination-free consistent rate,” Journal of Network and Computer Applications, p. 102676, 2020.
transactions using in-network concurrency control,” in Proceedings of [154] Y.-B. Lin, S.-Y. Wang, C.-C. Huang, and C.-M. Wu, “The SDN
the 26th Symposium on Operating Systems Principles, pp. 104–120, approach for the aggregation/disaggregation of sensor data,” Sensors,
2017. vol. 18, no. 7, p. 2025, 2018.
[132] B. Han, V. Gopalakrishnan, M. Platania, Z.-L. Zhang, and Y. Zhang, [155] A. L. R. Madureira, F. R. C. Araújo, and L. N. Sampaio, “On
“Network-assisted raft consensus protocol,” Feb. 13 2020. US Patent supporting IoT data aggregation through programmable data planes,”
App. 16/101,751. Computer Networks, p. 107330, 2020.
[133] X. Jin, X. Li, H. Zhang, N. Foster, J. Lee, R. Soulé, C. Kim, [156] M. Uddin, S. Mukherjee, H. Chang, and T. Lakshman, “SDN-based
and I. Stoica, “Netchain: scale-free sub-rtt coordination,” in 15th service automation for IoT,” in 2017 IEEE 25th International Confer-
{USENIX} Symposium on Networked Systems Design and Implemen- ence on Network Protocols (ICNP), pp. 1–10, IEEE, 2017.
tation ({NSDI} 18), pp. 35–49, 2018. [157] M. Uddin, S. Mukherjee, H. Chang, and T. Lakshman, “SDN-based
[134] H. T. Dang, P. Bressana, H. Wang, K. S. Lee, N. Zilberman, H. Weath- multi-protocol edge switching for IoT service automation,” IEEE Jour-
erspoon, M. Canini, F. Pedone, and R. Soulé, “Partitioned Paxos via nal on Selected Areas in Communications, vol. 36, no. 12, pp. 2775–
the network data plane,” arXiv preprint arXiv:1901.08806, 2019. 2786, 2018.
[135] E. Sakic, N. Deric, E. Goshi, and W. Kellerer, “P4BFT: hardware- [158] V. Sivaraman, S. Narayana, O. Rottenstreich, S. Muthukrishnan, and
accelerated byzantine-resilient network control plane,” arXiv preprint J. Rexford, “Heavy-hitter detection entirely in the data plane,” in
arXiv:1905.04064, 2019. Proceedings of the Symposium on SDN Research, pp. 164–176, 2017.
[136] H. T. Dang, P. Bressana, H. Wang, K. S. Lee, N. Zilberman, H. Weath- [159] R. Harrison, Q. Cai, A. Gupta, and J. Rexford, “Network-wide heavy
erspoon, M. Canini, F. Pedone, and R. Soulé, “P4xos: Consensus as a hitter detection with commodity switches,” in Proceedings of the
network service,” IEEE/ACM Transactions on Networking, 2020. Symposium on SDN Research, pp. 1–7, 2018.
[137] A. Sapio, I. Abdelaziz, A. Aldilaijan, M. Canini, and P. Kalnis, [160] J. Kučera, D. A. Popescu, G. Antichi, J. Kořenek, and A. W. Moore,
“In-network computation is a dumb idea whose time has come,” in “Seek and push: detecting large traffic aggregates in the dataplane,”
Proceedings of the 16th ACM Workshop on Hot Topics in Networks, arXiv preprint arXiv:1805.05993, 2018.
pp. 150–156, 2017. [161] R. Ben-Basat, X. Chen, G. Einziger, and O. Rottenstreich, “Efficient
[138] G. Siracusano and R. Bifulco, “In-network neural networks,” arXiv measurement on programmable switches using probabilistic recircu-
preprint arXiv:1801.05731, 2018. lation,” in 2018 IEEE 26th International Conference on Network
[139] D. Sanvito, G. Siracusano, and R. Bifulco, “Can the network be the Protocols (ICNP), pp. 313–323, IEEE, 2018.
AI accelerator?,” in Proceedings of the 2018 Morning Workshop on [162] D. Ding, M. Savi, G. Antichi, and D. Siracusa, “An incrementally-
In-Network Computing, pp. 20–25, 2018. deployable P4-enabled architecture for network-wide heavy-hitter de-
[140] F. Yang, Z. Wang, X. Ma, G. Yuan, and X. An, “SwitchAgg: tection,” IEEE Transactions on Network and Service Management,
a further step towards in-network computation,” arXiv preprint vol. 17, no. 1, pp. 75–88, 2020.
arXiv:1904.04024, 2019. [163] L. Tang, Q. Huang, and P. P. Lee, “A fast and compact invertible sketch
[141] A. Sapio, M. Canini, C.-Y. Ho, J. Nelson, P. Kalnis, C. Kim, A. Kr- for network-wide heavy flow detection,” IEEE/ACM Transactions on
ishnamurthy, M. Moshref, D. R. Ports, and P. Richtárik, “Scaling dis- Networking, vol. 28, no. 5, pp. 2350–2363, 2020.
tributed machine learning with in-network aggregation,” arXiv preprint [164] M. V. B. da Silva, J. A. Marques, L. P. Gaspary, and L. Z. Granville,
arXiv:1903.06701, 2019. “Identifying elephant flows using dynamic thresholds in programmable
[142] Z. Xiong and N. Zilberman, “Do switches dream of machine learning? ixp networks,” Journal of Internet Services and Applications, vol. 11,
toward in-network classification,” in Proceedings of the 18th ACM no. 1, pp. 1–12, 2020.
Workshop on Hot Topics in Networks, pp. 25–33, 2019. [165] D. Scholz, A. Oeldemann, F. Geyer, S. Gallenmüller, H. Stubbe,
[143] T. Jepsen, M. Moshref, A. Carzaniga, N. Foster, and R. Soulé, “Life in T. Wild, A. Herkersdorf, and G. Carle, “Cryptographic hashing in
the fast lane: a line-rate linear road,” in Proceedings of the Symposium P4 data planes,” in 2019 ACM/IEEE Symposium on Architectures for
on SDN Research, pp. 1–7, 2018. Networking and Communications Systems (ANCS), pp. 1–6, IEEE,
[144] T. Kohler, R. Mayer, F. Dürr, M. Maaß, S. Bhowmik, and K. Rothermel, 2019.
“P4CEP: towards in-network complex event processing,” in Proceed- [166] F. Hauser, M. Häberle, M. Schmidt, and M. Menth, “P4-IPsec: imple-
ings of the 2018 Morning Workshop on In-Network Computing, pp. 33– mentation of IPsec gateways in P4 with SDN control for host-to-site
38, 2018. scenarios,” arXiv preprint arXiv:1907.03593, 2019.
[145] L. Chen, G. Chen, J. Lingys, and K. Chen, “Programmable switch as [167] F. Hauser, M. Schmidt, M. Häberle, and M. Menth, “P4-MACsec:
a parallel computing device,” arXiv preprint arXiv:1803.01491, 2018. dynamic topology monitoring and data layer protection with MACsec
[146] T. Jepsen, D. Alvarez, N. Foster, C. Kim, J. Lee, M. Moshref, and in P4-based SDN,” IEEE Access, 2020.
R. Soulé, “Fast string searching on PISA,” in Proceedings of the 2019 [168] X. Chen, “Implementing AES encryption on programmable switches
ACM Symposium on SDN Research, pp. 21–28, 2019. via scrambled lookup tables,” in Proceedings of the Workshop on
[147] Y. Qiao, X. Kong, M. Zhang, Y. Zhou, M. Xu, and J. Bi, “Towards Secure Programmable Network Infrastructure, SPIN ’20, p. 8–14,
in-network acceleration of erasure coding,” in Proceedings of the Association for Computing Machinery, 2020.
Symposium on SDN Research, pp. 41–47, 2020. [169] R. Meier, P. Tsankov, V. Lenders, L. Vanbever, and M. Vechev,
[148] Z. Yu, Y. Zhang, V. Braverman, M. Chowdhury, and X. Jin, “NetLock: “NetHide: secure and practical network topology obfuscation,” in 27th
fast, centralized lock management using programmable switches,” in {USENIX} Security Symposium ({USENIX} Security 18), pp. 693–709,
Proceedings of the Annual conference of the ACM Special Interest 2018.
Group on Data Communication on the applications, technologies, [170] H. M. Moghaddam and A. Mosenia, “Anonymizing masses: prac-
architectures, and protocols for computer communication, pp. 126– tical light-weight anonymity at the network level,” arXiv preprint
138, 2020. arXiv:1911.09642, 2019.
[149] M. Tirmazi, R. Ben Basat, J. Gao, and M. Yu, “Cheetah: Accelerating [171] H. Kim and A. Gupta, “ONTAS: flexible and scalable online network
database queries with switch pruning,” in Proceedings of the 2020 ACM traffic anonymization system,” in Proceedings of the 2019 Workshop
SIGMOD International Conference on Management of Data, pp. 2407– on Network Meets AI & ML, pp. 15–21, 2019.
2422, 2020. [172] T. Datta, N. Feamster, J. Rexford, and L. Wang, “{SPINE}: surveil-
[150] S. Vaucher, N. Yazdani, P. Felber, D. E. Lucani, and V. Schiavoni, lance protection in the network elements,” in 9th {USENIX} Workshop
“Zipline: in-network compression at line speed,” in Proceedings of the on Free and Open Communications on the Internet (FOCI), 2019.
49

[173] R. Datta, S. Choi, A. Chowdhary, and Y. Park, “P4Guard: designing M. Barcellos, “Uncovering bugs in P4 programs with assertion-based
P4 based firewall,” in MILCOM 2018-2018 IEEE Military Communi- verification,” in Proceedings of the Symposium on SDN Research,
cations Conference (MILCOM), pp. 1–6, IEEE, 2018. pp. 1–7, 2018.
[174] A. Almaini, A. Al-Dubai, I. Romdhani, and M. Schramm, “Delegation [196] M. Neves, L. Freire, A. Schaeffer-Filho, and M. Barcellos, “Verification
of authentication to the data plane in software-defined networks,” of P4 programs in feasible time using assertions,” in Proceedings of the
in 2019 IEEE International Conferences on Ubiquitous Computing 14th International Conference on emerging Networking EXperiments
& Communications (IUCC) and Data Science and Computational and Technologies, pp. 73–85, 2018.
Intelligence (DSCI) and Smart Computing, Networking and Services [197] J. Liu, W. Hallahan, C. Schlesinger, M. Sharif, J. Lee, R. Soulé,
(SmartCNS), pp. 58–65, IEEE, 2019. H. Wang, C. Caşcaval, N. McKeown, and N. Foster, “P4v: practical
[175] Q. Kang, L. Xue, A. Morrison, Y. Tang, A. Chen, and X. Luo, verification for programmable data planes,” in Proceedings of the 2018
“Programmable in-network security for context-aware BYOD policies,” Conference of the ACM Special Interest Group on Data Communica-
arXiv preprint arXiv:1908.01405, 2019. tion, pp. 490–503, 2018.
[176] S. Bai, H. Kim, and J. Rexford, “Passive OS fingerprinting on com- [198] A. Nötzli, J. Khan, A. Fingerhut, C. Barrett, and P. Athanas, “P4pktgen:
modity switches,” automated test case generation for P4 programs,” in Proceedings of the
[177] G. Li, M. Zhang, C. Liu, X. Kong, A. Chen, G. Gu, and H. Duan, Symposium on SDN Research, pp. 1–7, 2018.
“NetHCF: enabling line-rate and adaptive spoofed IP traffic filtering,” [199] D. Lukács, M. Tejfel, and G. Pongrácz, “Keeping P4 switches fast and
in 2019 IEEE 27th International Conference on Network Protocols fault-free through automatic verification,” Acta Cybernetica, vol. 24,
(ICNP), pp. 1–12, IEEE, 2019. no. 1, pp. 61–81, 2019.
[178] J. Xing, W. Wu, and A. Chen, “Architecting programmable data plane [200] R. Stoenescu, D. Dumitrescu, M. Popovici, L. Negreanu, and C. Raiciu,
defenses into the network with FastFlex,” in Proceedings of the 18th “Debugging P4 programs with Vera,” in Proceedings of the 2018 Con-
ACM Workshop on Hot Topics in Networks, pp. 161–169, 2019. ference of the ACM Special Interest Group on Data Communication,
[179] Q. Kang, J. Xing, and A. Chen, “Automated attack discovery in pp. 518–532, 2018.
data plane systems,” in 12th {USENIX} Workshop on Cyber Security [201] A. Shukla, K. N. Hudemann, A. Hecker, and S. Schmid, “Runtime ver-
Experimentation and Test (CSET), 2019. ification of P4 switches with reinforcement learning,” in Proceedings
[180] A. Febro, H. Xiao, and J. Spring, “Distributed SIP DDoS defense of the 2019 Workshop on Network Meets AI & ML, pp. 1–7, 2019.
with P4,” in 2019 IEEE Wireless Communications and Networking [202] D. Dumitrescu, R. Stoenescu, L. Negreanu, and C. Raiciu, “bf4: to-
Conference (WCNC), pp. 1–8, IEEE, 2019. wards bug-free P4 programs,” in Proceedings of the Annual conference
[181] Â. C. Lapolli, J. A. Marques, and L. P. Gaspary, “Offloading real- of the ACM Special Interest Group on Data Communication on the
time DDoS attack detection to programmable data planes,” in 2019 applications, technologies, architectures, and protocols for computer
IFIP/IEEE Symposium on Integrated Network and Service Management communication, pp. 571–585, 2020.
(IM), pp. 19–27, IEEE, 2019. [203] A. Bas and A. Fingerhut, “P4 tutorial, slide 22.” [Online]. Available:
[182] Y. Mi and A. Wang, “ML-pushback: machine learning based pushback https://tinyurl.com/tb4m749.
defense against DDoS,” in Proceedings of the 15th International [204] M. Shahbaz, S. Choi, B. Pfaff, C. Kim, N. Feamster, N. McKeown, and
Conference on emerging Networking EXperiments and Technologies, J. Rexford, “PISCES: A programmable, protocol-independent software
pp. 80–81, 2019. switch,” in Proceedings of the 2016 ACM SIGCOMM Conference,
[183] D. Scholz, S. Gallenmüller, H. Stubbe, B. Jaber, M. Rouhi, and pp. 525–538, 2016.
G. Carle, “Me love (SYN-) cookies: SYN flood mitigation in pro- [205] B. Pfaff, J. Pettit, T. Koponen, E. Jackson, A. Zhou, J. Rajahalme,
grammable data planes,” arXiv preprint arXiv:2003.03221, 2020. J. Gross, A. Wang, J. Stringer, P. Shelar, et al., “The design and
[184] M. Zhang, G. Li, S. Wang, C. Liu, A. Chen, H. Hu, G. Gu, Q. Li, implementation of open vswitch,” in 12th {USENIX} Symposium on
M. Xu, and J. Wu, “Poseidon: mitigating volumetric DDoS attacks Networked Systems Design and Implementation (NSDI), pp. 117–130,
with programmable switches,” in Proceedings of NDSS, 2020. 2015.
[185] K. Friday, E. Kfoury, E. Bou-Harb, and J. Crichigno, “Towards a [206] Barefoot Networks, “Barefoot Academy,” 2020. [Online]. Available:
unified in-network DDoS detection and mitigation strategy,” in 2020 https://www.barefootnetworks.com/barefoot-academy/.
6th IEEE Conference on Network Softwarization (NetSoft), pp. 218– [207] C. Kim, A. Sivaraman, N. Katta, A. Bas, A. Dixit, and L. J. Wobker,
226, 2020. “In-band network telemetry via programmable dataplanes,” in ACM
[186] J. Xing, Q. Kang, and A. Chen, “NetWarden: mitigating network covert SIGCOMM, 2015.
channels while preserving performance,” in 29th {USENIX} Security [208] C. Hopps et al., “Analysis of an equal-cost multi-path algorithm,” tech.
Symposium ({USENIX} Security 20), 2020. rep., RFC 2992, November, 2000.
[187] A. Laraba, J. François, I. Chrisment, S. R. Chowdhury, and R. Boutaba, [209] S. Sinha, S. Kandula, and D. Katabi, “Harnessing TCP’s burstiness
“Defeating protocol abuse with p4: Application to explicit conges- with flowlet switching,” in Proc. 3rd ACM Workshop on Hot Topics in
tion notification,” in 2020 IFIP Networking Conference (Networking), Networks (Hotnets-III), Citeseer, 2004.
pp. 431–439, IEEE, 2020. [210] C. Kim, P. Bhide, E. Doe, H. Holbrook, A. Ghanwani, D. Daly,
[188] “Ripple: A programmable, decentralized link-flooding defense against M. Hira, and B. Davie, “In-band network telemetry (INT),” technical
adaptive adversaries,” in 30th USENIX Security Symposium (USENIX specification, 2016.
Security 21), (Vancouver, B.C.), USENIX Association, 2021. [211] M. A. Vieira, M. S. Castanho, R. D. Pacífico, E. R. Santos, E. P. C.
[189] C. Zhang, J. Bi, Y. Zhou, J. Wu, B. Liu, Z. Li, A. B. Dogar, and Júnior, and L. F. Vieira, “Fast packet processing with eBPF and XDP:
Y. Wang, “P4DB: on-the-fly debugging of the programmable data concepts, code, challenges, and applications,” ACM Computing Surveys
plane,” in 2017 IEEE 25th International Conference on Network (CSUR), vol. 53, no. 1, pp. 1–36, 2020.
Protocols (ICNP), pp. 1–10, IEEE, 2017. [212] J. Crichigno, E. Bou-Harb, and N. Ghani, “A comprehensive tutorial
[190] Y. Zhou, J. Bi, Y. Lin, Y. Wang, D. Zhang, Z. Xi, J. Cao, and C. Sun, on science DMZ,” IEEE Communications Surveys & Tutorials, vol. 21,
“P4tester: efficient runtime rule fault detection for programmable data no. 2, pp. 2041–2078, 2018.
planes,” in Proceedings of the International Symposium on Quality of [213] J. F. Kurose and K. W. Ross, “Computer networking a top down
Service, pp. 1–10, 2019. approach featuring the intel,” 2016.
[191] M. V. Dumitru, D. Dumitrescu, and C. Raiciu, “Can we exploit buggy [214] S. Ha, I. Rhee, and L. Xu, “CUBIC: a new TCP-friendly high-speed
P4 programs?,” in Proceedings of the Symposium on SDN Research, TCP variant,” ACM SIGOPS operating systems review, vol. 42, no. 5,
pp. 62–68, 2020. pp. 64–74, 2008.
[192] S. Kodeswaran, M. T. Arashloo, P. Tammana, and J. Rexford, “Tracking [215] D. Leith and R. Shorten, “H-TCP: TCP congestion control for
P4 program execution in the data plane,” in Proceedings of the high bandwidth-delay product paths,” draft-leith-tcp-htcp-06 (work in
Symposium on SDN Research, pp. 117–122, 2020. progress), 2008.
[193] Y. Zhou, J. Bi, T. Yang, K. Gao, C. Zhang, J. Cao, and Y. Wang, [216] N. Cardwell, Y. Cheng, C. S. Gunn, S. H. Yeganeh, and V. Jacobson,
“Keysight: Troubleshooting programmable switches via scalable high- “BBR: congestion-based congestion control,” Communications of the
coverage behavior tracking,” in 2018 IEEE 26th International Confer- ACM, vol. 60, no. 2, pp. 58–66, 2017.
ence on Network Protocols (ICNP), pp. 291–301, IEEE, 2018. [217] S. Floyd, “TCP and explicit congestion notification,” ACM SIGCOMM
[194] N. Lopes, N. Bjørner, N. McKeown, A. Rybalchenko, D. Talayco, Computer Communication Review, vol. 24, no. 5, pp. 8–23, 1994.
and G. Varghese, “Automatically verifying reachability and well- [218] R. Mittal, V. T. Lam, N. Dukkipati, E. Blem, H. Wassel, M. Ghobadi,
formedness in P4 networks,” Technical Report, Tech. Rep, 2016. A. Vahdat, Y. Wang, D. Wetherall, and D. Zats, “TIMELY: RTT-based
[195] L. Freire, M. Neves, L. Leal, K. Levchenko, A. Schaeffer-Filho, and congestion control for the data center,” ACM SIGCOMM Computer
50

Communication Review, vol. 45, no. 4, pp. 537–550, 2015. protocol specification (revised).,” [Online]. Available: https://tools.ietf.
[219] Y. Zhu, H. Eran, D. Firestone, C. Guo, M. Lipshteyn, Y. Liron, org/html/rfc7761.
J. Padhye, S. Raindel, M. H. Yahia, and M. Zhang, “Congestion control [244] H. Holbrook, B. Cain, and B. Haberman, “Using Internet group man-
for large-scale RDMA deployments,” ACM SIGCOMM Computer agement protocol version 3 (IGMPv3) and multicast listener discovery
Communication Review, vol. 45, no. 4, pp. 523–536, 2015. protocol version 2 (MLDv2) for source-specific multicast,” RFC 4604
[220] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prab- (Proposed Standard), Internet Engineering Task Force, 2006.
hakar, S. Sengupta, and M. Sridharan, “Data Center TCP (DCTCP),” [245] I. Wijnands, E. C. Rosen, A. Dolganow, T. Przygienda, and S. Aldrin,
in Proceedings of the ACM SIGCOMM 2010 conference, pp. 63–74, “Multicast using bit index explicit replication (BIER),” in RFC Editor,
2010. 2017.
[221] M. Alizadeh, S. Yang, M. Sharif, S. Katti, N. McKeown, B. Prabhakar, [246] B. Carpenter and S. Brim, “Middleboxes: taxonomy and issues,” 2002.
and S. Shenker, “pFabric: minimal near-optimal datacenter transport,” [Online]. Available: https://tools.ietf.org/html/rfc3234.
ACM SIGCOMM Computer Communication Review, vol. 43, no. 4, [247] J. McCauley, A. Panda, A. Krishnamurthy, and S. Shenker, “Thoughts
pp. 435–446, 2013. on load distribution and the role of programmable switches,” ACM
[222] M. Dong, Q. Li, D. Zarchy, P. B. Godfrey, and M. Schapira, “{PCC}: SIGCOMM Computer Communication Review, vol. 49, no. 1, pp. 18–
Re-architecting congestion control for consistent high performance,” 23, 2019.
in 12th {USENIX} Symposium on Networked Systems Design and [248] T. Norp, “5G Requirements and key performance indicators,” Journal
Implementation (NSDI), pp. 395–408, 2015. of ICT Standardization, vol. 6, no. 1, pp. 15–30, 2018.
[223] A. Langley, A. Riddoch, A. Wilk, A. Vicente, C. Krasic, D. Zhang, [249] G. Xylomenos, C. N. Ververidis, V. A. Siris, N. Fotiou, C. Tsilopou-
F. Yang, F. Kouranov, I. Swett, J. Iyengar, et al., “The QUIC transport los, X. Vasilakos, K. V. Katsaros, and G. C. Polyzos, “A survey
protocol: design and Internet-scale deployment,” in Proceedings of the of information-centric networking research,” IEEE communications
Conference of the ACM Special Interest Group on Data Communica- surveys & tutorials, vol. 16, no. 2, pp. 1024–1049, 2013.
tion, pp. 183–196, 2017. [250] D. L. Tennenhouse and D. J. Wetherall, “Towards an active network
[224] P. Cheng, F. Ren, R. Shu, and C. Lin, “Catch the whole lot in an action: architecture,” in Proceedings DARPA Active Networks Conference and
rapid precise packet loss notification in data center,” in 11th {USENIX} Exposition, pp. 2–15, IEEE, 2002.
Symposium on Networked Systems Design and Implementation (NSDI), [251] E. F. Kfoury, J. Gomez, J. Crichigno, E. Bou-Harb, and D. Khoury,
pp. 17–28, 2014. “Decentralized distribution of PCP mappings over blockchain for
[225] A. Ramachandran, S. Seetharaman, N. Feamster, and V. Vazirani, “Fast end-to-end secure direct communications,” IEEE Access, vol. 7,
monitoring of traffic subpopulations,” in Proceedings of the 8th ACM pp. 110159–110173, 2019.
SIGCOMM conference on Internet measurement, pp. 257–270, 2008. [252] S. A. Weil, S. A. Brandt, E. L. Miller, D. D. Long, and C. Maltzahn,
[226] N. Alon, Y. Matias, and M. Szegedy, “The space complexity of “Ceph: A scalable, high-performance distributed file system,” in Pro-
approximating the frequency moments,” Journal of Computer and ceedings of the 7th symposium on Operating systems design and
system sciences, vol. 58, no. 1, pp. 137–147, 1999. implementation, pp. 307–320, 2006.
[227] V. Braverman and R. Ostrovsky, “Zero-one frequency laws,” in Pro- [253] L. Lamport et al., “Paxos made simple,” ACM Sigact News, vol. 32,
ceedings of the forty-second ACM symposium on Theory of computing, no. 4, pp. 18–25, 2001.
pp. 281–290, 2010. [254] D. Ongaro and J. Ousterhout, “In search of an understandable con-
[228] M. Charikar, K. Chen, and M. Farach-Colton, “Finding frequent items sensus algorithm,” in 2014 {USENIX} Annual Technical Conference
in data streams,” in International Colloquium on Automata, Languages, (USENIX ATC 14), pp. 305–319, 2014.
and Programming, pp. 693–703, Springer, 2002. [255] Huynh Tu Dang, “Consensus as a network service.” [Online]. Avail-
[229] G. Cormode and S. Muthukrishnan, “An improved data stream sum- able: https://tinyurl.com/y2t9plsu.
mary: the count-min sketch and its applications,” Journal of Algorithms, [256] J. Nelson, “SwitchML scaling distributed machine learning with in net-
vol. 55, no. 1, pp. 58–75, 2005. work aggregation.” [Online]. Available: https://tinyurl.com/y53upm7k.
[230] M. Datar, A. Gionis, P. Indyk, and R. Motwani, “Maintaining stream [257] D. Das, S. Avancha, D. Mudigere, K. Vaidynathan, S. Sridharan,
statistics over sliding windows,” SIAM journal on computing, vol. 31, D. Kalamkar, B. Kaul, and P. Dubey, “Distributed deep learn-
no. 6, pp. 1794–1813, 2002. ing using synchronous stochastic gradient descent,” arXiv preprint
[231] S. Floyd and V. Jacobson, “Random early detection gateways for arXiv:1602.06709, 2016.
congestion avoidance,” IEEE/ACM Transactions on networking, vol. 1, [258] S. Farrell, “Low-power wide area network (LPWAN) overview,”
no. 4, pp. 397–413, 1993. RFC8376. [Online]. Available: https://tools.ietf.org/html/rfc8376.
[232] P. Flajolet, D. Gardy, and L. Thimonier, “Birthday paradox, coupon [259] A. Koike, T. Ohba, and R. Ishibashi, “IoT network architecture using
collectors, caching algorithms and self-organizing search,” Discrete packet aggregation and disaggregation,” in 2016 5th IIAI International
Applied Mathematics, vol. 39, no. 3, pp. 207–229, 1992. Congress on Advanced Applied Informatics (IIAI-AAI), pp. 1140–1145,
[233] R. Dolby, “Noise reduction systems,” Nov. 5 1974. US Patent IEEE, 2016.
3,846,719. [260] J. Deng and M. Davis, “An adaptive packet aggregation algorithm
[234] S. V. Vaseghi, Advanced digital signal processing and noise reduction. for wireless networks,” in 2013 International Conference on Wireless
John Wiley & Sons, 2008. Communications and Signal Processing, pp. 1–6, IEEE, 2013.
[235] J. Gettys, “Bufferbloat: dark buffers in the Internet,” IEEE Internet [261] Y. Yasuda, R. Nakamura, and H. Ohsaki, “A probabilistic interest
Computing, no. 3, p. 96, 2011. packet aggregation for content-centric networking,” in 2018 IEEE 42nd
[236] M. Allman, “Comments on bufferbloat,” ACM SIGCOMM Computer Annual Computer Software and Applications Conference (COMPSAC),
Communication Review, vol. 43, no. 1, pp. 30–37, 2013. vol. 2, pp. 783–788, IEEE, 2018.
[237] Y. Gong, D. Rossi, C. Testa, S. Valenti, and M. D. Täht, “Fighting the [262] A. S. Akyurek and T. S. Rosing, “Optimal packet aggregation schedul-
bufferbloat: on the coexistence of AQM and low priority congestion ing in wireless networks,” IEEE Transactions on Mobile Computing,
control,” Computer Networks, vol. 65, pp. 255–267, 2014. vol. 17, no. 12, pp. 2835–2852, 2018.
[238] C. Staff, “Bufferbloat: what’s wrong with the Internet?,” Communica- [263] K. Zhou and N. Nikaein, “Packet aggregation for machine type commu-
tions of the ACM, vol. 55, no. 2, pp. 40–47, 2012. nications in LTE with random access channel,” in 2013 IEEE Wireless
[239] V. G. Cerf, “Bufferbloat and other internet challenges,” IEEE Internet Communications and Networking Conference (WCNC), pp. 262–267,
Computing, vol. 18, no. 5, pp. 80–80, 2014. IEEE, 2013.
[240] F. Schwarzkopf, S. Veith, and M. Menth, “Performance analysis of [264] A. Majeed and N. B. Abu-Ghazaleh, “Packet aggregation in multi-
CoDel and PIE for saturated TCP sources,” in 2016 28th International rate wireless LANs,” in 2012 9th Annual IEEE Communications
Teletraffic Congress (ITC 28), vol. 1, pp. 175–183, IEEE, 2016. Society Conference on Sensor, Mesh and Ad Hoc Communications and
[241] A. Mushtaq, R. Mittal, J. McCauley, M. Alizadeh, S. Ratnasamy, Networks (SECON), pp. 452–460, IEEE, 2012.
and S. Shenker, “Datacenter congestion control: identifying what is [265] D. SIG, “Bluetooth core specification version 4.2,” Specification of the
essential and making it practical,” ACM SIGCOMM Computer Com- Bluetooth System, 2014.
munication Review, vol. 49, no. 3, pp. 32–38, 2019. [266] S. Farahani, ZigBee wireless networks and transceivers. Newnes, 2011.
[242] K. Nichols, S. Blake, F. Baker, and D. Black, “Definition of the [267] O. Hersent, D. Boswarthick, and O. Elloumi, The Internet of things:
differentiated services field (DS field) in the IPv4 and IPv6 headers,” key applications and protocols. John Wiley & Sons, 2011.
RFC8376. [Online]. Available: https://tools.ietf.org/html/rfc8376. [268] J. Shi, W. Quan, D. Gao, M. Liu, G. Liu, C. Yu, and W. Su,
[243] B. Fenner, M. Handley, H. Holbrook, I. Kouvelas, R. Parekh, Z. Zhang, “Flowlet-based stateful multipath forwarding in heterogeneous Internet
and L. Zheng, “Protocol independent multicast-sparse mode (PIM-SM): of things,” IEEE Access, vol. 8, pp. 74875–74886, 2020.
51

[269] S. Do, L.-V. Le, B.-S. P. Lin, and L.-P. Tung, “SDN/NFV-based network checking invariant security properties in OpenFlow,” in 2013 IEEE
infrastructure for enhancing IoT gateways,” in 2019 International Con- international conference on communications (ICC), pp. 1974–1979,
ference on Internet of Things (iThings) and IEEE Green Computing and IEEE, 2013.
Communications (GreenCom) and IEEE Cyber, Physical and Social [292] A. Panda, O. Lahav, K. Argyraki, M. Sagiv, and S. Shenker, “Verifying
Computing (CPSCom) and IEEE Smart Data (SmartData), pp. 1135– reachability in networks with mutable datapaths,” in 14th {USENIX}
1142, IEEE, 2019. Symposium on Networked Systems Design and Implementation (NSDI),
[270] A. Metwally, D. Agrawal, and A. El Abbadi, “Efficient computation pp. 699–718, 2017.
of frequent and top-k elements in data streams,” in International [293] X. Gao, T. Kim, M. D. Wong, D. Raghunathan, A. K. Varma, P. G.
Conference on Database Theory, pp. 398–412, Springer, 2005. Kannan, A. Sivaraman, S. Narayana, and A. Gupta, “Switch code
[271] S. Heule, M. Nunkesser, and A. Hall, “HyperLogLog in practice: generation using program synthesis,” in Proceedings of the Annual
algorithmic engineering of a state of the art cardinality estimation conference of the ACM Special Interest Group on Data Communication
algorithm,” in Proceedings of the 16th International Conference on on the applications, technologies, architectures, and protocols for
Extending Database Technology, pp. 683–692, 2013. computer communication, pp. 44–61, 2020.
[272] M. G. Reed, P. F. Syverson, and D. M. Goldschlag, “Anonymous [294] P. Zheng, T. A. Benson, and C. Hu, “Building and testing modular
connections and onion routing,” IEEE Journal on Selected areas in programs for programmable data planes,” IEEE Journal on Selected
Communications, vol. 16, no. 4, pp. 482–494, 1998. Areas in Communications, vol. 38, no. 7, pp. 1432–1447, 2020.
[273] V. Liu, S. Han, A. Krishnamurthy, and T. Anderson, “Tor instead of IP,” [295] D. Kim, Y. Zhu, C. Kim, J. Lee, and S. Seshan, “Generic external
in Proceedings of the 10th ACM Workshop on Hot Topics in Networks, memory for switch data planes,” in Proceedings of the 17th ACM
pp. 1–6, 2011. Workshop on Hot Topics in Networks, pp. 1–7, 2018.
[274] C. Chen, D. E. Asoni, D. Barrera, G. Danezis, and A. Perrig, “HOR- [296] D. Kim, Z. Liu, Y. Zhu, C. Kim, J. Lee, V. Sekar, and S. Seshan, “TEA:
NET: high-speed onion routing at the network layer,” in Proceedings of enabling state-intensive network functions on programmable switches,”
the 22nd ACM SIGSAC Conference on Computer and Communications in Proceedings of the 2020 ACM SIGCOMM Conference, 2020.
Security, pp. 1441–1454, 2015. [297] S. Chole, A. Fingerhut, S. Ma, A. Sivaraman, S. Vargaftik, A. Berger,
[275] M. Zalewski and W. Stearns, “p0f,” see http://lcamtuf. coredump. G. Mendelson, M. Alizadeh, S.-T. Chuang, I. Keslassy, et al., “dRMT:
cx/p0f3, 2006. disaggregated programmable switching,” in Proceedings of the Con-
[276] J. Barnes and P. Crowley, “k-p0f: A high-throughput kernel passive OS ference of the ACM Special Interest Group on Data Communication,
fingerprinter,” in Architectures for Networking and Communications pp. 1–14, 2017.
Systems, pp. 113–114, IEEE, 2013. [298] M. T. Arashloo, Y. Koral, M. Greenberg, J. Rexford, and D. Walker,
[277] S. Hong, R. Baykov, L. Xu, S. Nadimpalli, and G. Gu, “Towards SDN- “SNAP: stateful network-wide abstractions for packet processing,” in
defined programmable BYOD (bring your own device) security,” in Proceedings of the 2016 ACM SIGCOMM Conference, pp. 29–43,
NDSS, 2016. 2016.
[278] S. Hilton, “Dyn analysis summary of Friday October 21 [299] G. Sviridov, M. Bonola, A. Tulumello, P. Giaccone, A. Bianco,
Attack, 2016..” [Online]. Available: https://dyn.com/blog/ and G. Bianchi, “LODGE: Local decisions on global states in pro-
dyn-analysis-summary-of-friday-october-21-attack/. grammable data planes,” in 2018 4th IEEE Conference on Network
[279] S. Kottler, “February 28th DDoS incident report, March, 2018.” [On- Softwarization and Workshops (NetSoft), pp. 257–261, IEEE, 2018.
line]. Available: https://githubengineering.com/ddos-incident-report/. [300] G. Sviridov, M. Bonola, A. Tulumello, P. Giaccone, A. Bianco,
[280] D. Scholz, S. Gallenmüller, H. Stubbe, and G. Carle, “Syn flood defense and G. Bianchi, “Local decisions on replicated states (LOADER) in
in programmable data planes,” in Proceedings of the 3rd P4 Workshop programmable data planes: programming abstraction and experimental
in Europe, pp. 13–20, 2020. evaluation,” arXiv preprint arXiv:2001.07670, 2020.
[281] J. Ioannidis and S. M. Bellovin, “Implementing pushback: router-based [301] S. Luo, H. Yu, and L. Vanbever, “Swing state: consistent updates
defense against DDoS attacks,” in NDSS, 2016. for stateful and programmable data planes,” in Proceedings of the
[282] N. Handigol, B. Heller, V. Jeyakumar, D. Mazières, and N. McKeown, Symposium on SDN Research, pp. 115–121, 2017.
“I know what your packet did last hop: using packet histories to [302] J. Xing, A. Chen, and T. E. Ng, “Secure state migration in the data
troubleshoot networks,” in 11th {USENIX} Symposium on Networked plane,” in Proceedings of the Workshop on Secure Programmable
Systems Design and Implementation ({NSDI} 14), pp. 71–85, 2014. Network Infrastructure, pp. 28–34, 2020.
[283] Y. Zhu, N. Kang, J. Cao, A. Greenberg, G. Lu, R. Mahajan, D. Maltz, [303] L. Zeno, D. R. Ports, J. Nelson, and M. Silberstein, “Swishmem:
L. Yuan, M. Zhang, B. Y. Zhao, and H. Zheng, “Packet-level telemetry Distributed shared state abstractions for programmable switches,” in
in large datacenter networks,” in Proceedings of the 2015 ACM Confer- Proceedings of the 19th ACM Workshop on Hot Topics in Networks,
ence on Special Interest Group on Data Communication, pp. 479–491, pp. 160–167, 2020.
2015. [304] P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Iz-
[284] H. Zeng, P. Kazemian, G. Varghese, and N. McKeown, “Automatic test zard, F. Mujica, and M. Horowitz, “Forwarding metamorphosis: fast
packet generation,” in Proceedings of the 8th international conference programmable match-action processing in hardware for SDN,” ACM
on Emerging networking experiments and technologies, pp. 241–252, SIGCOMM Computer Communication Review, vol. 43, no. 4, pp. 99–
2012. 110, 2013.
[285] P. Kazemian, G. Varghese, and N. McKeown, “Header space anal- [305] R. Pagh and F. F. Rodler, “Cuckoo hashing,” J. Algorithms, vol. 51,
ysis: static checking for networks,” in Presented as part of the 9th p. 122–144, May 2004.
{USENIX} Symposium on Networked Systems Design and Implemen-
tation ({NSDI} 12), pp. 113–126, 2012.
[286] A. Khurshid, X. Zou, W. Zhou, M. Caesar, and P. B. Godfrey,
“Veriflow: verifying network-wide invariants in real time,” in Presented
as part of the 10th {USENIX} Symposium on Networked Systems
Design and Implementation (NSDI), pp. 15–27, 2013.
[287] R. Stoenescu, M. Popovici, L. Negreanu, and C. Raiciu, “Symnet:
scalable symbolic execution for modern networks,” in Proceedings of
the 2016 ACM SIGCOMM Conference, pp. 314–327, 2016.
[288] H. Mai, A. Khurshid, R. Agarwal, M. Caesar, P. B. Godfrey, and S. T.
King, “Debugging the data plane with Anteater,” ACM SIGCOMM
Computer Communication Review, vol. 41, no. 4, pp. 290–301, 2011.
[289] P. Kazemian, M. Chang, H. Zeng, G. Varghese, N. McKeown, and
S. Whyte, “Real time network policy checking using header space
analysis,” in Presented as part of the 10th {USENIX} Symposium on
Networked Systems Design and Implementation (NSDI), pp. 99–111,
2013.
[290] A. Horn, A. Kheradmand, and M. Prasad, “Delta-net: real-time network
verification using atoms,” in 14th {USENIX} Symposium on Networked
Systems Design and Implementation (NSDI), pp. 735–749, 2017.
[291] S. Son, S. Shin, V. Yegneswaran, P. Porras, and G. Gu, “Model

You might also like