P4 Packet Routing
P4 Packet Routing
Abstract—The soaring complexity of networks has led to changes, as in short traffic bursts. Moreover, even when the
more complex methods to efficiently manage and orchestrate software control plane is local on the switches, the ability to
the multitude of network environments. Recent advances in select new routes is often limited and not fast enough [7].
machine learning (ML) have opened new opportunities for
network management automation, exploiting existing advances To overcome these limitations, recently programmable data
in software-defined infrastructures. Advanced routing strategies planes have gained popularity, and different works have de-
have been proposed to accommodate the traffic demand of veloped mechanisms that, operating entirely in the data plane,
interactive systems, where the common architecture is composed enable real-time adaptation [8], [9]. These solutions can deliver
of a data-driven network management schema collecting network considerable performance benefits over more static mecha-
data that feed a reinforcement learning (RL) algorithm. However,
the overhead introduced by the SDN controller and its operations nisms and centralized approaches using fine-grained perfor-
can be mitigated if the networking architecture is redesigned. mance information on hardware timescales. However, these
In this paper, we propose ROAR, a novel architectural solution techniques are limited to trivial performance-aware policies
that implements Deep Reinforcement Learning (DRL) inside that are unable to learn during the execution and, consequently,
P4 programmable switches to perform adaptive routing policies adapt to multiple scenarios. Moreover, as the complexity of
based on network conditions and traffic patterns. The network
devices act independently in a multi-agent reinforcement learning the network grows, determining the optimal routing policies
(MARL) framework but are able to learn cooperative behaviors to avoid congestion and improve performance has become
to reduce the queuing time of transmitting packets. Experimental increasingly essential but also challenging.
results show that for an increasing amount of traffic in the To automate routing decisions directly in the network de-
network, there is both a throughput and delay improvement in vice, we designed a Reinforcement learning for Autonomous
the transmission compared to traditional approaches.
Index Terms—deep reinforcement learning, P4, routing Routers (ROAR). In our solution, we uses network pro-
grammability in general, and P4 [10] programmable switches
I. I NTRODUCTION in particular, to perform distributed routing decisions via Deep
In recent years, there has been a rapid increase in the Reinforcement Learning (DRL). As such, every switch of
number of brand-new applications, which not only places the network is an agent of the DRL system, which uses the
more and more demands on communication technologies, e.g., algorithm to decide the forwarding port for the incoming
5G and 6G, but also poses significant difficulties for the packet according to two main factors: the next hop to the
Internet. Each application, in particular, has distinct yet strict destination, known for the topology of interest, and the port’s
requirements for latency, jitter, throughput, and packet loss outgoing queue, to be minimized. Since the switches con-
rate. We can observe that as networks continue to evolve stantly learn from the environment, the model is periodically
and become more complex, the need for efficient routing trained to consider the impact of different routing decisions
mechanisms becomes increasingly crucial. on network performance and select the best route based on
One dictating trend is applying Machine Learning (ML) and learned policies. Designing a DRL model with P4 is known
Deep Learning (DL) to routing with the aim of leveraging to be challenging since the architecture does not support
information about past traffic conditions to learn good routing loops, complex arithmetical operations, or if-else conditions
configurations for future conditions [1]. The flexibility pro- in action blocks, which are essential for the DRL algorithm.
vided by SDN enables a fast reactive and proactive network To overcome this limitation, we modify the P4-16 compiler
management, in which the SDN controller can easily observe in order to adapt to a C++ external module, which is the
the network and react to changes in traffic demand and fundamental intermediary between the P4 application and the
evolutions [2], [3]. One class of ML is Reinforcement Learning DRL algorithm. We evaluated our solution on an emulated
(RL), which well fits routing problems given its automatic network over Mininet, showing that when the network starts
learning improvements to achieve the optimal policy [4]. In being congested, the benefits of ROAR can be observed in the
recent years many research works integrated RL and Deep increment of throughput and delay reduction.
Reinforcement Learning (DRL) into SDN networks for routing The rest of the paper is structured as follows. Section II
optimizations [5], [6]. presents literature about RL-based routing. In Section III, we
However, these approaches based on centralized controllers describe ROAR’s components. Section IV presents the exper-
are inherently too slow to respond to fine-grained traffic imental results, and finally, Section V concludes the paper.
64
Authorized licensed use limited to: Universiteit van Amsterdam. Downloaded on September 27,2024 at 11:17:12 UTC from IEEE Xplore. Restrictions apply.
2024 IEEE International Conference on Machine Learning for Communication and Networking (ICMLCN)
SDN Controller
the expected cumulative reward (Q-value) for that state-action
P4
Metrics Application couple. Thus, it takes into account two main factors: (i)
DRL
queuing time, which is the time every packet has spent in
IPC Module the output queue, and (ii) the distance of the chosen next hop
from the final destination. While we want to minimize the time
spent by the packets in the outgoing queues, the distance of the
Data Traffic next hop from the final destination is the only information the
agent knows about the global topology. The reward function
Ri also considers two indicators: delivered, σ1 , set to 1 if
correctly routed, 0 otherwise, and dropped, σ2 , set to 1 if the
packet has been dropped, 0 otherwise. Their value is always
Fig. 2: Overview of ROAR and its differences compared to a set to 0 in the case of spine switches, as they are not directly
centralized solution. The solution is based on the control of connected to any destination host.
packet forwarding planes directly in networking devices, also Summarizing, the reward function for each agent i is:
showing the blocks composing every network switch.
Ri = λ1 ∗ σ1 − λ2 ∗ q − λ3 ∗ σ2 − λ4 ∗ σ3 ∗ d (1)
Our time-varying MARL process is defined as a tuple where: (i) the λ values are the model’s hyper-parameters set
⟨{S i }i∈M , {Ai }i∈M , P, {Ri }i∈M , {Mt }t≥0 ⟩, where S i de- during the training to check the performance of the algorithm,
notes the local state space of agent i in Mt , and AQi is the (ii) q is the time that the packet has spent in the queue before
M
action set that agent i can execute. Besides, A = i=1 Ai being sent, (iii) σ3 is a parameter indicating whether the
is the joint action space of all agents, also referred to as the switch is a spine one and is multiplied by d, i.e., the distance
global action profile. We then proceed by defining the local to the destination switch.
reward function of agent i, denoted as Ri : S × A → R, and Our Neural Network. In ROAR, we determined the number
the state transition probability function P : S ×A×S → [0, 1]. of input features for our NN, i.e., the state space, by computing
In this setup, we assume that the states and actions have their mean reward and selecting the list length for which
a global impact but are locally observable, as well as the they achieve the highest value. After different settings and
rewards, which are only observed locally. At each time step evaluations, we set as input the last two taken decisions, i.e.,
t, given the state st ∈ S and the joint actions of the agents the length for the “future destination” and “action history”
t ) ∈ A, each agent receives an individual
at = (a1t , . . . , aM state. Being categorical features, we need to convert them
reward rt+1
i
. This reward is given by an equation that captures into numerical ones using encoding techniques. For this work,
the incentive that the learning model wants to model and is we used the One-Hot-Encoding method, which uses dummy
determined by R(s i
t ,at )
. Additionally, the system transitions to variables to perform categorical encoding and performs better
a new state st+1 ∈ S with a probability of P (st+1 |st , at ). than other techniques according to the precision-recall curve
Our model is considered fully decentralized and individual (PR-AUC) metric [21].
as each agent receives rewards locally and performs actions These encoding features are now ready to be the inputs
independently. Opposed to an SDN centralized scenario, where of our NN model. We evaluated different NN structures
the controller has a global view of the network, we design a on leaf and backbone switches, finding that an architecture
fully distributed solution. The leading idea is to train agents in- composed of 3 hidden layers of 128, 64, and 32 neurons
dependently of each other to simplify the coordination process can achieve the best performance. It is important to notice
among routers and reduce the overhead caused by continuous that the DRL algorithm is not applied every time a switch
updates. On the contrary, routers only exchange information receives a packet. The overhead would be too high and it
about the known topology to consolidate in a virtual global would take tens of milliseconds to make a routing decision
network view. In this scenario, agents are independent of which would be intolerable at line rate for modern switches.
each other, considering that they do not share network state The resulting trade-off is to adopt a static routing guided by
information or model parameters. the DRL algorithm by means of periodical updates. It has
In each ROAR’s agent i, the action Ai is a discrete number been tested that updating the single route towards a specific
ranging from 1 to N , where N is the number of ports the destination every 10, 000 packets maximize the performance
switch uses. The state S i , instead, is composed of three of our network.
elements: (i) current destination, which is the current packet’s
destination IP address, (ii) future destinations, which is a list B. P4-based Actions
of L next packet’s destination IP addresses that follow the P4 is a networking programming language that allows
same route as the current one, (iii) action history, which is defining the data plane processing of a switch in a high-level
a list of the last k actions adopted for the current packet’s structure and generates efficient code that can run on different
destination. Every time a given action for a certain state has hardware targets, including ASICs, FPGAs, and CPUs. It
been performed, e.g., a packet has been forwarded towards provides a way to define how packets should be processed
a specific port, the reward function is evaluated to update through a network device, including how they are parsed,
65
Authorized licensed use limited to: Universiteit van Amsterdam. Downloaded on September 27,2024 at 11:17:12 UTC from IEEE Xplore. Restrictions apply.
2024 IEEE International Conference on Machine Learning for Communication and Networking (ICMLCN)
matched against rules, and modified. To do so, P4 is composed comparison between ROAR, a traditional routing protocol and
of three main blocks: a parser, a match-action pipeline, and a centralized SDN solution.
a deparser. The parser is designed as a finite-state machine
that analyzes and extracts headers. For example, a packet may A. Evaluation settings
begin with an Ethernet header, followed by an IPv4 or IPv6 To validate ROAR’s benefits, we used Mininet a network
header; the parser extracts all of these headers and passes emulator that allows reproducing virtual networks and use
them to the next step, the match-action pipeline. This stage them as a testbed for simulation purposes. Being specifically
is composed of checksum verification algorithms, ingress, designed for software-defined networking (SDN), it supports
and egress pipelines, which are composed of structures (e.g., P4-compatible switches via the Behavioral Model Version 2
tables, registers, counters) that P4 uses to customize switch (BMV2), which allows compiling a P4 program into packet-
behavior and implement routing strategies based on policy. processing actions of C++11 software switches. The topology
The deparser is the final stage, defining how outgoing packets used for testing is composed of 10 servers connected to
are constructed from a set of header fields. In the following, we their switches which are consequently connected to other 4
mostly focus on the match-action pipeline’s ingress and egress switches, in a leaf-spine fashion, with all the links of the
while considering that the parser recognizes and extracts the topology with 100 Mbps bandwidth. In all the performed
Ethernet, IP, UDP, TCP, and ICMP headers. tests, we are using iperf3, a tool to perform measurements on
When a ROAR’s router receives a packet, it performs two the network according to different bandwidths, protocols, and
main operations: (i) inserts the current packet’s IP destination buffer setups. Each server of the network sends packets to any
address inside the “future destinations” data structure, accessed other server, varying the number of receiving servers (from 1
by the IPC module, (ii) chooses the output port given by the to 9) and replicating the workload as described in [23]. For
DRL module. The egress performs two other main operations: each network load, we then computed the average of the ob-
(i) forwards the packet according to a FIFO criteria and re- tained results (i.e., RTT, FPS, throughput, and packet loss) and
moves its destination IP address from the “future destinations” drew the two-tailed confidence interval at 95%. For the context
data structure, (ii) interacts with the IPC module to determi- of this work, we compare our results against two alternatives:
nate the reward for that specific forwarding action according to a traditional routing protocol implementation as OSPF, which
the time spent by the packet in the outgoing queue. Due to P4 uses the longest prefix match tables to perform routing across
limitations that prevent implementing any machine-learning the network and does not depend on the current network load;
method inside the default P4 compiler, we had to modify and a centralized SDN solution, QR-SDN [11], that routes
it and enable “extern” instances, which allow implementing packets according to the output of a tabular RL algorithm.
external methods outside the P4 program [22]. The P4 program
can reference the extern object and pass inputs to it, but the B. Random Traffic Generation
implementation details of how the object works are hidden To study the performance of our solution, we started by
from the P4 program. This makes the network’s control and generating traffic using the iperf3 tool that helped us control
data plane separation easier, allowing the P4 program to focus the level of congestion of our network and compare the results
on packet processing logic while leaving the low-level details with QR-SDN and the traditional routing implementation, as
of interacting with hardware to the extern objects. OSPF (Fig. 3). We can see from Fig. 3a that when the load
is low (10% to 20%), the network is not congested, and QR-
C. IPC module SDN shows a lower Round Trip Time (RTT), while ROAR
In ROAR, we use an IPC module deployed inside each and OSPF perform similarly. This might be caused to the fact
switch to act as an intermediary between our P4-based network that ROAR uses DRL to choose the optimal route and, for
application and the DRL algorithm. The IPC block serves every 10, 000 packet, the IPC module interacts with the DRL
to bridge the two modules by providing a means for the P4 one to retrieve information about the best port to forward the
application to communicate the packet counters to DRL, and packet. This interaction can indeed impact on the network
for the DRL to communicate the chosen actions, i.e., next performance, decreasing the overall throughput. However,
hop, to the P4 forwarding plane. As mentioned previously, when the network load starts to increase (20% to 40%), a
our IPC module includes functionality for preprocessing and significant difference is visible with OSPF, while QR-SDN
transforming data to make it suitable for the DRL algorithm still achieves a lower RTT than ROAR. When our network
and, in particular, for the NN method, as well as for monitoring is highly congested (from 50% to 90% of network load), we
and collecting feedback on the performance of the system for can clearly identify the benefits brought by ROAR, where the
the reward function. More in detail, our IPC module uses sock- RL method accurately chooses the best forwarding port to
ets abstraction written in C++ to establish a communication avoid congestion and decrease the RTT, while for QR-SDN
channel and exchange data with the DRL module. the interaction with an SDN controller impacts the overall
performance of the network. A similar behavior is visible
IV. E VALUATION when evaluating the throughput, as in Fig. 3b. The figure
This section describes our experimental settings and results shows that when the network is not congested and our load is
obtained over a virtual testbed like Mininet, focusing on a low (10% to 20%), ROAR performs as the traditional routing
66
Authorized licensed use limited to: Universiteit van Amsterdam. Downloaded on September 27,2024 at 11:17:12 UTC from IEEE Xplore. Restrictions apply.
2024 IEEE International Conference on Machine Learning for Communication and Networking (ICMLCN)
200 100
Throughput (Mbps)
ROAR ROAR
OSPF OSPF
100 5.0
60
ROAR
QR-SDN 2.5
40 OSPF
0 0.0
10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90
Load (%) Load (%) Load (%)
(a) (b) (c)
Fig. 3: Comparison of ROAR, OSPF and QR-SDN, measuring the (a) RTT evolution, (b) throughput, and (c) packet loss at
increasing network load.
100
Throughput (Mbps)
Throughput (Mbps)
ROAR ROAR 90.0
170 QR-SDN QR-SDN
90 87.5
165 ROAR
FPS
QR-SDN
160 80 85.0
155 82.5
70
10 20 30 40 50 60 70 80 90 10 20 30 40 50 60 70 80 90 0 50 100 150
Load (%) Load (%) Time (s)
(a) (b) (c)
Fig. 4: Comparison of ROAR and QR-SDN with realistic traffic workloads, measuring the (a) FPS evolution and (b) throughput
at increasing network load. (c) Throughput in 200 seconds when the network load is at 60%.
implementation, while QR-SDN achieves higher throughput. 8
10
However, a different behavior can be seen when our network
RAM (%)
CPU (%)
7
8
load increases. While QR-SDN achieves better throughput
6
than OSPF, ROAR allows our network to handle congestion 6
67
Authorized licensed use limited to: Universiteit van Amsterdam. Downloaded on September 27,2024 at 11:17:12 UTC from IEEE Xplore. Restrictions apply.
2024 IEEE International Conference on Machine Learning for Communication and Networking (ICMLCN)
fixed network load of 60% and reported the result in Fig. 4c. [5] T. Fu, C. Wang, and N. Cheng, “Deep-learning-based joint optimization
It is visible from the figure that for the entire evaluation of renewable energy storage and routing in vehicular energy network,”
IEEE Internet of Things Journal, vol. 7, no. 7, pp. 6229–6241, 2020.
period, ROAR achieves better throughput than QR-SDN, [6] C. Liu, M. Xu, Y. Yang, and N. Geng, “DRL-OR: Deep reinforcement
with results coherent with the ones reported before. These learning-based online routing for multi-type service requirements,” in
results are extremely important to assess the validity of local IEEE INFOCOM - IEEE Conference on Computer Communications.
IEEE, 2021, pp. 1–10.
performance-aware routing not only with synthetic traffic, but [7] R. Amin, E. Rojas, A. Aqdus, S. Ramzan, D. Casillas-Perez, and J. M.
also with more realistic traffic patterns. Arco, “A survey on machine learning techniques for routing optimization
in sdn,” IEEE Access, vol. 9, pp. 104 582–104 611, 2021.
D. Can ROAR run over real switches? [8] K.-F. Hsu, R. Beckett, A. Chen, J. Rexford, and D. Walker, “Contra: A
programmable system for performance-aware routing,” in 17th USENIX
Aware of the impact on resource consumption that a DRL Symposium on Networked Systems Design and Implementation (NSDI
model might cause when implemented on physical switches 20). USENIX Association, 2020, pp. 701–721.
[9] L. Yu, J. Sonchack, and V. Liu, “Mantis: Reactive programmable
(e.g., FPGA, Tofino), we computed the RAM and CPU usage switches,” in Proceedings of the Annual conference of the ACM Special
of ROAR at a varying network load. To do so, we took Interest Group on Data Communication (SIGCOMM ’20). ACM New
as reference the X308P-48Y-T programmable switch [25], York, NY, USA, 2020, pp. 296–309.
[10] P. Bosshart, D. Daly, G. Gibb, M. Izzard, N. McKeown, J. Rexford,
combined with its embedded Data Processing Unit (DPU), C. Schlesinger, D. Talayco, A. Vahdat, G. Varghese et al., “P4: Pro-
proportioning the results to its computing power. We reported gramming protocol-independent packet processors,” ACM SIGCOMM
the results in Fig. 5. In Fig. 5a, we can see how our solution Computer Communication Review, vol. 44, no. 3, pp. 87–95, 2014.
[11] J. Rischke, P. Sossalla, H. Salah, F. H. Fitzek, and M. Reisslein, “QR-
consumes a low amount of RAM when the network load SDN: Towards Reinforcement Learning States, Actions, and Rewards
is low and the network is congested. This amount increases for Direct Flow Routing in Software-Defined Networks,” IEEE Access,
only up to 2% at the highest load (70%-90%). The same vol. 8, pp. 174 773–174 791, 2020.
[12] D. M. Casas-Velasco, O. M. C. Rendon, and N. L. da Fonseca, “Intel-
behavior is visible in Fig. 5b, where the CPU consumption ligent Routing Based on Reinforcement Learning for Software-Defined
only increases at high network loads. These figures prove Networking,” IEEE Transactions on Network and Service Management,
that, despite the DRL module, the IPC module interaction, and vol. 18, no. 1, pp. 870–881, 2020.
[13] C. Yu, J. Lan, Z. Guo, and Y. Hu, “Drom: Optimizing the routing
the NN algorithm, ROAR requires small hardware resources, in software-defined networks with deep reinforcement learning,” IEEE
making it suitable to deploy on real programmable switches. Access, vol. 6, pp. 64 533–64 539, 2018.
[14] W. Li, H. Zhang, S. Gao, C. Xue, X. Wang, and S. Lu, “Smartcc: A
V. C ONCLUSION reinforcement learning approach for multipath tcp congestion control in
heterogeneous networks,” IEEE Journal on Selected Areas in Commu-
In this paper, we propose ROAR, a distributed ML-based nications, vol. 37, no. 11, pp. 2621–2633, 2019.
solution that uses the novel Deep Reinforcement Learning [15] S. S. Bhavanasi, L. Pappone, and F. Esposito, “Dealing with changes:
Resilient routing via graph neural networks and multi-agent deep
(DRL) mechanism to optimize the routing process of the reinforcement learning,” IEEE Transactions on Network and Service
network while doing all the computation directly inside the Management, vol. 20, no. 3, pp. 2283–2294, 2023.
P4 programmable switches. This approach allows to take into [16] L. Zhao, J. Wang, J. Liu, and N. Kato, “Routing for Crowd Management
in Smart Cities: A Deep Reinforcement Learning Perspective,” IEEE
consideration the current link load and re-route packets over Communications Magazine, vol. 57, no. 4, pp. 88–93, 2019.
less congested paths. In the experimental results, we compared [17] B. Dai, Y. Cao, Z. Wu, and Y. Xu, “IQoR-LSE: An Intelligent QoS On-
our solution to a traditional routing implementation and a Demand Routing Algorithm With Link State Estimation,” IEEE Systems
Journal, vol. 16, no. 4, pp. 5821–5830, 2022.
centralized SDN solution. It has been proven that the absence [18] C. Yu, W. Quan, D. Gao, Y. Zhang, K. Liu, W. Wu, H. Zhang, and
of an interaction with a centralized SDN controller reduces X. Shen, “Reliable cybertwin-driven concurrent multipath transfer with
RTT even when the network is congested, also achieving deep reinforcement learning,” IEEE Internet of Things Journal, vol. 8,
no. 22, pp. 16 207–16 218, 2021.
higher throughput, especially at higher network loads. [19] A. Sapio, M. Canini et al., “Scaling Distributed Machine Learning with
In-Network Aggregation,” in 18th USENIX Symposium on Networked
ACKNOWLEDGMENT Systems Design and Implementation (NSDI ’21), 2021, pp. 785–808.
[20] L. Buşoniu, R. Babuška, and B. De Schutter, “Multi-agent reinforce-
This work has been partially supported by NSF awards ment learning: An overview,” Innovations in multi-agent systems and
2133407 and 2201536. applications-1, pp. 183–221, 2010.
[21] C. Seger, “An investigation of categorical variable encoding techniques
R EFERENCES in machine learning: binary versus one-hot and feature hashing,” 2018.
[1] A. Sacco, F. Esposito, and G. Marchetto, “Resource Inference for Sus- [22] J. S. da Silva, F.-R. Boyer, L.-O. Chiquette, and J. P. Langlois, “Extern
tainable and Responsive Task Offloading in Challenged Edge Networks,” objects in p4: an rohc header compression scheme case study,” in
IEEE Transactions on Green Communications and Networking, vol. 5, 2018 4th IEEE Conference on Network Softwarization and Workshops
no. 3, pp. 1114–1127, 2021. (NetSoft). IEEE, 2018, pp. 517–522.
[2] Y.-J. Wu, P.-C. Hwang, W.-S. Hwang, and M.-H. Cheng, “Artificial [23] M. Alizadeh, A. Greenberg, D. A. Maltz, J. Padhye, P. Patel, B. Prab-
intelligence enabled routing in software defined networking,” Applied hakar, S. Sengupta, and M. Sridharan, “Data center tcp (dctcp),” in
Sciences, vol. 10, no. 18, p. 6564, 2020. Proceedings of the ACM SIGCOMM 2010 Conference, 2010, pp. 63–74.
[3] A. Sacco, F. Esposito, and G. Marchetto, “RoPE: An Architecture for [24] T. Benson, A. Akella, and D. A. Maltz, “Network traffic characteristics
Adaptive Data-Driven Routing Prediction at the Edge,” IEEE Transac- of data centers in the wild,” in Proceedings of the 10th ACM SIGCOMM
tions on Network and Service Management, vol. 17, no. 2, pp. 986–999, conference on Internet measurement, 2010, pp. 267–280.
2020. [25] “48x25gb+8x100gb, intel tofino p4 programmable bare
[4] N. C. Luong, D. T. Hoang, S. Gong, D. Niyato, P. Wang, Y.-C. metal switch: Asterfusion,” July 2022. [Online].
Liang, and D. I. Kim, “Applications of deep reinforcement learning Available: https://cloudswit.ch/product/48x25gb8x100gb-intel-tofino-
in communications and networking: A survey,” IEEE Communications p4-programmable-bare-metal-switch-asterfusion/
Surveys & Tutorials, vol. 21, no. 4, pp. 3133–3174, 2019.
68
Authorized licensed use limited to: Universiteit van Amsterdam. Downloaded on September 27,2024 at 11:17:12 UTC from IEEE Xplore. Restrictions apply.