DRL GNN
DRL GNN
Abstract—Deep Reinforcement Learning (DRL) has shown a (CP). However, real-world production networks have in the
dramatic improvement in decision-making and automated control order of hundreds of nodes and solvers based on ILP or
problems. Consequently, DRL represents a promising technique CP would take a large amount of time to solve network
to efficiently solve many relevant optimization problems (e.g.,
routing) in self-driving networks. However, existing DRL-based optimization problems [1], [2]. In addition, heuristic based
arXiv:1910.07421v3 [cs.NI] 7 Oct 2022
solutions applied to networking fail to generalize, which means solutions are far from being optimal.
that they are not able to operate properly when applied to Deep Reinforcement Learning (DRL) has shown significant
network topologies not observed during training. This lack of improvements in sequential decision-making and automated
generalization capability significantly hinders the deployment of
control problems [3], [4]. As a result, the network commu-
DRL technologies in production networks. This is because state-
of-the-art DRL-based networking solutions use standard neural nity is already investigating DRL as a key technology for
networks (e.g., fully connected, convolutional), which are not network optimization (e.g., routing) with the goal of enabling
suited to learn from information structured as graphs. self-driving networks [5]–[8]. However, existing DRL-based
In this paper, we integrate Graph Neural Networks (GNN) solutions still fail to generalize when applied to different
into DRL agents and we design a problem specific action network scenarios [9], [10]. In this context, generalization
space to enable generalization. GNNs are Deep Learning models
inherently designed to generalize over graphs of different sizes refers to the ability of the DRL agent to adapt to new network
and structures. This allows the proposed GNN-based DRL agent scenarios not seen during training (e.g., network topologies,
to learn and generalize over arbitrary network topologies. We configurations).
test our DRL+GNN agent in a routing optimization use case in We argue that generalization is an essential property for
optical networks and evaluate it on 180 and 232 unseen synthetic
the successful adoption of DRL technologies in production
and real-world network topologies respectively. The results show
that the DRL+GNN agent is able to outperform state-of-the-art networks. Without generalization, DRL solutions should be
solutions in topologies never seen during training. trained in the same network where they are deployed, which
is not possible or affordable in general. To train a DRL agent
Index Terms—Graph Neural Networks, Deep Reinforcement
Learning, Routing, Optimization is a very costly and lengthy process. It often requires sig-
nificant computing power and instrumentation of the network
I. I NTRODUCTION to observe its performance (e.g., delay, jitter). Additionally,
decisions made by a DRL agent during training can lead to
In the last years, industrial advances (e.g., Industry 4.0, IoT) degraded performance or even to service disruption. Thus,
and changes in social behavior created a proliferation of mod- training a DRL agent in the customer’s network may be
ern network applications (e.g., Vehicular Networks, AR/VR, unfeasible.
Real-Time Communications), imposing new requirements on With generalization, a DRL agent can be trained with
backbone networks (e.g., high throughput and low latency). multiple, representative network topologies and configurations.
Consequently, network operators need to efficiently manage Afterwards, it can be applied to other topologies and configu-
the network resources, ensuring the customer’s Quality of rations, as long as they share some common properties. Such
Service and fulfilling the Service Level Agreements. This is a “universal” model can be trained in a laboratory and later
typically done using expert knowledge or solvers leveraging on be incorporated in a product or a network device (e.g.,
Integer Linear Programming (ILP) or Constraint Programming router, load balancer). The resulting solution would be ready
to be deployed to a production network without requiring any
Paul Almasan, José Suárez-Varela, Pere Barlet-Ros and Albert Cabellos-
Aparicio are with the Barcelona Neural Networking Center. Universitat further training or instrumentation in the customer network1 .
Politècnica de Catalunya. Barcelona, Spain. Unfortunately, existing DRL proposals for networking were
E-mail: {felician.paul.almasan, jose.suarez-varela, pere.barlet, designed to operate in the same network topology seen during
alberto.cabellos}@upc.edu,
Krzysztof Rusek is with the Institute of Telecommunications, AGH training [9], [11], [12], thereby limiting their potential deploy-
University of Science and Technology, Krakow, Poland, and with the ment on production networks. The main reason behind this
Barcelona Neural Networking Center, Universitat Politècnica de Catalunya, strong limitation is that computer networks are fundamentally
Barcelona, Spain. E-mail: [email protected]
represented as graphs. For instance, the network topology and
NOTE: This work has been accepted for publication in the Computer routing policy are typically represented as such. However,
Communications journal. Please use the following reference to cite this state-of-the-art proposals [11], [13]–[15] use traditional neural
work: Paul Almasan, José Suárez-Varela, Krzysztof Rusek, Pere Barlet-Ros
and Albert Cabellos-Aparicio. ”Deep reinforcement learning meets graph
neural networks: Exploring a routing optimization use case” in Computer 1 Note that solutions based on transfer learning do not offer this property as
Communications, 2022, doi: https://doi.org/10.1016/j.comcom.2022.09.029. DRL agents need to be re-trained on the network where they finally operate.
2
network (NN) architectures (e.g., fully connected, convolu- A. Graph Neural Networks
tional) that are not well suited to model graph-structured Graph Neural Networks are a novel family of neural net-
information [16]. works designed to operate over graphs. They were introduced
In this paper, we integrate Graph Neural Networks in [17] and numerous variants have been developed since [20],
(GNN) [17] into DRL agents to solve network optimization [21]. In their basic form, they consist of associating some
problems. Particularly, our architecture is intended to solve initial states to the different elements of an input graph, and
routing optimization in optical networks and to generalize over combining them considering how these elements are connected
never-seen arbitrary topologies. The GNN integrated in our in the graph. An iterative algorithm updates the elements’
DRL agent is inspired by Message-passing Neural Networks state and uses the resulting states to produce an output. The
(MPNN), which were successfully applied to solve a relevant particularities of the problem to solve will determine which
chemistry-related problem [18]. In our case, the GNN was GNN variant is more suitable, depending on, for instance, the
specifically designed to capture meaningful information about nature of the graph elements (i.e., nodes and edges) involved.
the relations between the links and the traffic flowing through Message Passing Neural Networks (MPNN) [18] are a well-
the network topologies. known type of GNNs that apply an iterative message-passing
The evaluation results show that our agent achieves a algorithm to propagate information between the nodes of
strong generalization capability compared to state-of-the-art the graph. In a message-passing step, each node k receives
DRL (SoA DRL) algorithms [15]. Additionally, to further messages from all the nodes in its neighborhood, denoted by
test the generalization capability of the proposed DRL-based N(k). Messages are generated by applying a message function
architecture, we evaluated it in a set with 232 different real- m(·) to the hidden states of node pairs in the graph. Then,
world network topologies. The results show that the proposed they are combined by an aggregation function, for instance, a
DRL+GNN architecture is able to achieve outstanding perfor- sum (Equation 1). Finally, an update function u(·) is used to
mance over the networks never seen during training. Finally, compute a new hidden state for every node (Equation 2).
we explore the generalization limitations of our architecture
X
and discuss its scalability properties. Mkt+1 = m(htk , hti ) (1)
Overall, our DRL+GNN architecture for network optimiza- i∈N (k)
tion has the following features:
• Generality: It can work effectively in network topologies ht+1
k = u(htk , Mkt+1 ) (2)
and scenarios never seen during training. Where functions m(·) and u(·) can be learned by neural
• Deployability: It can be deployed to production networks
networks. After a certain number of iterations, the final node
without requiring training nor instrumentation in the states are used by a readout function r(·) to produce an output
customer network. for the given task. This function can also be implemented by
• Low overhead: Once trained, the DRL agent can make
a neural network and is typically tasked to predict properties
routing decisions in only one step (≈ ms), while its cost of individual nodes (e.g., the node’s class) or global properties
scales linearly with the network size. of the graph.
• Commercialization: Network vendors can easily embed it
GNNs have been able to achieve relevant performance re-
in network devices or products, and successfully operate sults in multiple domains where data is typically structured as
”arbitrary” networks. a graph [18], [22]. Since computer networks are fundamentally
We believe the combination of these features can enable represented as graphs, it is inherent in their design that GNNs
the development of a new generation of networking solu- offer unique advantages for network modeling compared to
tions based on DRL that are more cost-effective than current traditional neural network architectures (e.g., fully connected
approaches based on heuristics or linear optimization. All NN, Convolutional NN, etc.).
the topologies and scripts used in the experiments, as well
as the source code of our DRL+GNN agent are publicly B. Deep Reinforcement Learning
available [19].
DRL algorithms aim at learning a long-term strategy that
leads to maximize an objective function in an optimization
II. BACKGROUND problem. DRL agents start from a tabula rasa state and they
The solution proposed in this paper combines two machine learn the optimal strategy by an iterative process that explores
learning mechanisms. First, we use a GNN to model com- the state and action spaces. These are denoted by a set of states
puter network scenarios. GNNs are neural network architec- (S) and a set of actions (A). Given a state s ∈ S, the agent
tures specifically designed to generalize over graph-structured will perform an action a ∈ A that produces a transition to a
data [16]. In addition, they offer near real-time operation in new state s’ ∈ S, and will provide the agent with a reward
the scale of milliseconds (see Section VI-B). Second, we use r. Then, the objective is to find a strategy that maximizes the
Deep Reinforcement Learning to build an agent that learns cumulative reward by the end of an episode. The definition of
how to efficiently operate networks following a particular the end of an episode depends on the optimization problem to
optimization goal. DRL applies the knowledge obtained in past address.
optimizations to later decisions, without the necessity to run Q-learning [23] is a RL algorithm whose goal is to make
computationally intensive algorithms. an agent learn a policy π : S → A. The algorithm creates
3
a table (a.k.a., q-table) with all the possible combinations - Network state DRL Agent
of states and actions. At the beginning of the training, the - Traffic demand
Graph Neural Network
- Reward
table is initialized (e.g., with zeros or random values) and
during training, the agent updates these values according to Control plane
the rewards obtained after selecting an action. These values, Data plane
called q-values, represent the expected cumulative reward after
applying action a from state s, assuming that the agent follows
Lightpaths
the current policy π during the rest of the episode. During - ACTION:
Routing policy
training, q-values are updated using the Bellman equation (see Allocated for the current
Equation 3) where Q(st ,at ) is the q-value function at time-step demands
OTN state traffic demand
+
t, α is the learning rate, r(st ,at ) is the reward obtained from new traffic demand {src, dst, bandwidth}
selecting action at from state st and γ ∈ [0, 1] is the discount
Fig. 1: Schematic representation of the DRL agent in the OTN
factor.
routing scenario.
Q(st , at ) = Q(st , at ) + α r(st , at )+
(3) ITU-T Recommendation G.709 [26]. The ODUk signals are
0 0 then multiplexed into Optical Transport Units (OTUk), which
γ max
0
Q(s t , a ) − Q(s ,
t ta )
a
are data frames including Forward Error Correction. Eventu-
Deep Q-Network (DQN) [24] is a more advanced algorithm ally, OTUk frames are mapped to different optical channels
based on Q-learning that uses a Deep Neural Network (DNN) within the lightpaths of the topology.
to approximate the q-value function. As the q-table size In this scenario, the routing problem is defined as finding the
becomes larger, Q-learning faces difficulties to learn a policy optimal routing policy for each incoming source-destination
from high dimensional state and action spaces. To overcome traffic demand. The learning process is guided by an objective
this problem, they proposed to use a DNN as a q-value function that aims to maximize the traffic volume allocated in
function estimator, relying on the generalization capabilities the network in the long-term. We consider that a demand is
of DNNs to estimate the q-values of states and actions unseen properly allocated if there is enough available capacity in all
in advance. For this reason, a NN well suited to understand the lightpaths forming the end-to-end path selected. Note that
and generalize over the input data of the DRL agent is lightpaths are the edges in the logical topology where the agent
crucial for the agents to perform well when facing states (or operates. The demands do not expire, occupying the lightpaths
environments) never seen before. Additionally, DQN uses an until the end of a DRL episode. This implies a challenging task
experience replay buffer to store past sequential experiences for the agent, since it has not only to identify critical resources
(i.e., stores tuples of {s,a,r,s’} ). on networks (e.g., potential bottlenecks), but also to deal with
the uncertainty in the generation of future traffic demands. The
III. N ETWORK OPTIMIZATION SCENARIO following constraints summarize the traffic demand routing
In this paper, we explore the potential of a GNN-based problem in the OTN scenario:
DRL agent to address the routing problem in Optical Transport
• The agent must make sequential routing decisions for
Networks (OTN). Particularly, we consider a network scenario
every incoming traffic demand
based on Software-Defined Networking, where the DRL agent
• Traffic demands can not be split over multiple paths
(located in the control plane) has a global view of the current
• Previous traffic demands can not be rerouted and they
network state, and has to make routing decisions on every
occupy the links’ capacities until the end of the episode
traffic demand as it arrives. We consider a traffic demand as
the volume of traffic sent from a source to a destination node. The optimal solution to the OTN optimization problem can
This is a relevant optimization scenario that has been studied be found by solving its Markov Decision Process (MDP)
in the last decades in the optical networking community, where [27]. To do this, we can use techniques such as Dynamic
many solutions have been proposed [11], [15], [25]. Programming, which consist of an iterative process over all
In our OTN scenario, the DRL agent makes routing deci- MDP’s states until convergence. The MDP for the traffic
sions at the electrical domain, over a logical topology where demand allocation problem consists of all the possible network
nodes represent Reconfigurable Optical Add-Drop Multiplex- topology states and the transition probabilities between states.
ers (ROADM) and edges are predefined lightpaths connecting Notice that in our scenario we have uniform transition prob-
them (see Figure 1). The DRL agent receives traffic demands abilities from one state to the next. One limitation of solving
with different bandwidth requirements defined by the tuple MDPs optimally is that it becomes infeasible for large and
{src, dst, bandwidth}, and it has to select an end-to-end path complex optimization problems. As the problem size grows,
for every demand. Particularly, end-to-end paths are defined as so does the MDP’s state space, where the space complexity
sequences of lightpaths connecting the source and destination (in number of states) is S ≈ O(N E ), having N as the number
of a demand. Since the agent operates at the electrical domain, of different capacities a link can have and E as the number of
traffic demands are defined as requests of Optical Data Units links. Therefore, to solve the MDP the algorithm will spend
(ODUk), whose bandwidth requirements are defined in the more time on iterating over all MDP’s states.
4
Notation Description store information from links that are farther and farther apart.
x1 Link available capacity Therefore, the concept of time appears. RNNs are a NN ar-
x2 Link Betweenness chitecture that are tailored to capture sequential behavior (e.g.,
x3 Action vector (bandwidth allocated) text, video, time-series). In addition, some RNN architectures
x4 − xN Zero padding
(e.g., GRU) are designed to process large sequences (e.g., long
TABLE I: Input features of the link hidden states. N corre- text sentences in NLP). Specifically, they internally contain
sponds to the size of the hidden state vector. gates that are designed to mitigate the vanishing gradients,
a common problem with large sequences [28]. This makes
RNNs suitable to learn how the links’ state evolve during the
represent both the network state and the action, which is the message passing phase, even for large T.
input needed to model the q-value function Q(s, a).
The size of the hidden states is typically larger than the
D. DRL Agent Operation
number of features in the hidden states. This is to enable
each link to store information of himself (i.e., his own initial The DRL agent operates by interacting with the environ-
features) plus the aggregated information coming from all the ment. In Algorithm 2 we can observe a pseudocode describing
links’ neighbors (see Section IV-C). If the hidden state size the DRL agent operation. At the beginning, we initialize the
is equal to the number of link features, the links won’t have environment env by initializing all the link features. At the
space to store information about the neighboring links without same time, the environment generates a traffic demand to be
losing information. This results in a poor graph embedding allocated by the tuple {src, dst, bw} and an environment state
after the readout function. On the contrary, if the state size s. We also initialize the cumulative reward to zero, define
is very large, it can lead to a large GNN model, which can the action set size and create the experience replay buffer
overfit to the data. A common approach is to set the state size (agt.mem). Afterwards, we execute a while loop (lines 3-
larger than the number of features and to fill the vector with 16) that finishes when there is some demand that cannot be
zeros. allocated in the network topology. For each of the k=4 shortest
paths, we allocate the demand along all the links forming the
path and compute a q-value (lines 7-9). Once we have the
C. GNN Architecture q-value for each state-action pair, the next action a to apply
The GNN model is based on the Message Passing Neural is selected using an -greedy exploration strategy (line 10)
Network [18] model. In our case, we consider the link entity [24]. The action is then applied to the environment, leading
and perform the message passing process between all links. to a new state s’, a reward r and a flag Done indicating if
We choose link entities, instead of node entities, because the there is some link without enough capacity to support the
link features are what define the OTN routing optimization demand. Additionally, the environment returns a new traffic
problem. Node entities could be added when addressing an demand tuple {src0 , dst0 , bw0 }. The information about the
optimization problem that needs to incorporate node-level fea- state transition is stored in the experience replay buffer (line
tures (e.g., I/O buffer size, scheduling algorithm). Algorithm 13). This information will be used later on to train the GNN
1 shows a formal description of the message passing process in the agt.replay() call (line 15), which is executed every M
where the algorithm receives as input the links’ features (xl ) training iterations.
and outputs a q-value (q).
The algorithm performs T message passing steps. A graphi-
V. E XPERIMENTAL RESULTS
cal representation can be seen in Figure 3, where the algorithm
iterates over all links of the network topology. For each link, In this section we evaluate our GNN-based DRL agent
its features are combined with those of the neighboring links to optimize the routing configuration in the OTN scenario
using a fully-connected, corresponding to M in Figure 3. The described in Section III. Particularly, the experiments in this
outputs of these operations are called messages according to
the GNN notation. Then, the messages computed for each link
Execute T times
with their neighbors are aggregated using an element-wise sum Message passing Update
(line 5 in Algorithm 1). Afterwards, a Recurrent NN (RNN) For all neighbors
hL1 of link hL1
is used to update the link hidden states hLK with the new hL1
M + RNN
aggregated information (line 6 in Algorithm 1). At the end hL2
of the message passing phase, the resulting link states are . M . . Q-value
... +
aggregated using an element-wise sum (line 7 in Algorithm 1). . .
.
. . Readout
Algorithm 1 Message Passing destination node pair and a traffic demand type (ODUk). This
Input : xl makes the problem even more difficult for the DRL agent,
Output : hTl , q since the uniform traffic distribution hinders the exploitation
1: for each l ∈ L do of prediction systems to anticipate possible demands difficult
2: h0l ← [xl , 0 . . . , 0] to allocate. In other words, all traffic demands are equally
3: for t = 1 to T do probable to appear in the future, making it more difficult for
4: for each l ∈ P L do the DRL agent to estimate the expected future rewards.
5: Mlt+1 = i∈N (l) m (htl , hti ) Initial experiments were carried out to choose an appro-
priate gradient-based optimization algorithm and to find the
ht+1 = u htl , Mlt+1
6: l
P hyperparameter values for the DRL+GNN agent. For the GNN
7: rdt ← l∈L hl model, we defined the links’ hidden states hl as 27-element
8: q ← R(rdt) vectors (filled with the features described in Table I). Note
that the size of the hidden states is related to the amount
of information they may potentially encode. Larger network
section are focused on evaluating the performance and gener- topologies and complex network optimization scenarios might
alization capabilities of the proposed DRL+GNN agent. Af- need larger sizes for the hidden state vectors. In every forward
terwards, in Section VI, we analyze the scalability properties propagation of the GNN we execute T=7 message passing
of our solution and discuss other relevant aspects related to steps using batches of 32 samples. The optimizer used is
the deployment on production networks. the Stochastic Gradient Descent [31] with a learning rate
of 10−4 and a momentum of 0.9. We start the -greedy
A. Evaluation Setup exploration strategy with =1.0 and maintain this value during
We implemented the DRL+GNN solution described in 70 training iterations. Afterwards, decays exponentially every
Section IV with Tensorflow [29] and evaluated it on an episode. The experience buffer stores 4,000 samples and is
OTN network simulator implemented using the OpenAI Gym implemented as a FIFO queue (first in, first out). We applied
framework [30]. The source code, together with all the training l2 regularization and dropout to the readout function with a
and evaluation results are publicly available [19]. coefficient of 0.1 in both cases. The discount factor γ was set
In the OTN simulator, we consider three traffic demand to 0.95.
types (ODU2, ODU3, and ODU4), whose bandwidth require-
ments are expressed in terms of multiples of ODU0 signals B. Methodology
(i.e., 8, 32, and 64 ODU0 bandwidth units respectively) [26].
We divided the evaluation of our DRL+GNN agent in two
When the DRL agent correctly allocates a demand, it receives
sets of experiments. In the first set, we focused on reasoning
an immediate reward being the bandwidth of the current traffic
about the performance and generalization capabilities of our
demand if it was properly allocated, otherwise the reward is
solution. For illustration purposes, we chose two particular
0. We consider that a demand is successfully allocated if all
network scenarios and analyzed them extensively. As a base-
the links in the path selected by the DRL agent have enough
line, we implemented the DRL-based system proposed in [15],
available capacity to carry such demand. Likewise, episodes
a state-of-the-art solution for routing optimization in OTNs.
end when a traffic demand was not correctly allocated. Traf-
Later on, in Section VI, we evaluated our solution on real-
fic demands are generated by uniformly selecting a source-
world network topologies and analyzed its scalability in terms
of computation time and generalization capabilities.
Algorithm 2 DRL Agent operation To find the optimal MDP solution to the OTN optimization
1: s, src, dst, bw ← env.init env() problem is infeasible due to its complexity. Take as an example
2: reward ← 0, k ← 4, agt.mem ← { }, Done ← False a small network topology with 6 nodes and 8 edges, where
3: while not Done do the links have capacities of 3 ODU0 units, there is only
4: k q values ← { } one bandwidth type available (1 ODU0) and there are 4
5: k shortest paths ← compute k paths(k, src, dst) possible actions. The resulting number of states of the MDP
6: for i in 0, ..., k do is 58 *6*5*1 ≈ 1.17e7 . To find a solution to the MDP we can
7: p0 ← get path(i, k shortest paths) use Dynamic Programming algorithms such as value iteration.
8: s0 ← env.alloc demand(s, p’, src, dst, dem) However, this algorithm has a time complexity to solve the
9: k q values[i] ← compute q value(s’, p’) MDP of O(S 2 A), where S and A are the number of states
and actions respectively and S ≈ O(N E ), having N as the
10: q value ← epsilon greedy(k q values, )
number of different capacities a link can have and E as the
11: a ← get action(q value, k shortest paths, s)
number of links.
12: r, Done, s0 , src0 , dst0 , bw0 ← env.step(s, a)
As an alternative, we compare the DRL+GNN agent perfor-
13: agt.rmb(s, src, dst, bw, a, r, s0 , src0 , dst0 , bw0 )
mance with a theoretical fluid model (labeled as Theoretical
14: reward ← reward + r
Fluid). This model is a theoretical approach which considers
15: If training steps % M == 0: agt.replay()
that traffic demands may be split into the k=4 candidate paths
16: src ← src’; dst ← dst’; bw ← bw’, s ← s’
proportionally to the available capacity they have. This routing
7
1500
policy is aimed at avoiding congestion on links. For instance,
Probability Density
0.8 0.8
scenario on the 14-node Nsfnet topology [32], where we
considered that the links represent lightpaths with capacity 0.6 0.6
for 200 ODU0 signals. Note that the capacity is shared on 0.4 DRL+GNN 0.4 DRL+GNN
DRL trained in
both directions of the links and that the bandwidth of different 0.2
the same network DRL trained in
the same network
(NSFNet) 0.2
(Geant2)
traffic demands is expressed in multiples of ODU0 signals (i.e., LB LB
0.0 0.0
8, 32 or 64 ODU0 bandwidth units). We ran 1,000 training iter- −1.0 −0.5 0.0 0.5 1.0 1.5 −1.0 −0.5 0.0 0.5 1.0 1.5
Relative performance to Relative performance to
ations where the agent received traffic demands and allocated Theoretical Fluid Theoretical Fluid
them on one of the k=4 shortest paths available in the action (c) Evaluation on Nsfnet (d) Evaluation on Geant2
set. The model with highest performance was selected to be
benchmarked against traditional routing optimization strategies Fig. 4: Performance evaluation against state-of-the-art DRL.
and state-of-the-art DRL-based solutions. Notice that the vertical lines in 4c and 4d indicate the same
performance as the theoretical fluid model.
C. Performance evaluation against state-of-the-art DRL-
based solutions
In this evaluation experiment, we compare our DRL+GNN 1.0
agent against state-of-the-art DRL-based solutions. Particu-
Score (bw allocated)
Probability Density
1500 0.8
larly, we adapted the solution proposed in [15] to operate in 1250
0.6
scenarios where links share their capacity in both directions. 1000
We trained two different instances of the state-of-the-art DRL 750 0.4
agent in two network scenarios: the 14-node Nsfnet and the 24- 500 0.2
of-the-art DRL solution, were evaluated over the same list of (a) Bandwidth allocated (b) CDF
generated demands.
Fig. 5: Evaluation on Geant2 of DRL-based solutions trained
We run two experiments to compare the performance of
on Nsfnet.
our DRL+GNN with the results obtained by the state-of-the-
art DRL (SoA DRL). In the first experiment, we evaluated
the DRL+GNN agent against the SoA DRL agent trained
on Nsfnet, the LB routing policy, and the theoretical fluid
model. We evaluated the four routing strategies on the Nsfnet We run another experiment to compare the generalization
topology and compared their performance. In Figure 4a, we capabilities of our DRL+GNN agent. In this experiment, we
can observe a bloxplot with the evaluation results of 1,000 evaluated the DRL+GNN agent (trained on Nsfnet) against
evaluation experiments. The y-axis indicates the agent score, the SoA DRL agent trained on Nsfnet, and evaluated both
which corresponds to the bandwidth allocated by the agent. agents on the Geant2 topology. The resulting boxplot can be
Figure 4c shows the Cumulative Distribution Function (CDF) seen in Figure 5a and the corresponding CDF in Figure 5b.
of the relative score obtained with respect to the fluid model. The results indicate that in this scenario the DRL+GNN agent
In this experiment we could also observe that the proposed also outperforms the SoA DRL agent. In this case, in 80%
DRL+GNN agent slightly outperforms the SoA DRL-based by of the experiments our DRL+GNN agent achieved more than
allocating 6.6% more bandwidth. In the second experiment, we 45% of performance improvement with respect to the SoA
evaluated the same models (DRL+GNN, SoA DRL, LB, and DRL proposal. These results show that while the proposed
Theoretical Fluid) on the Geant2 topology, but in this case DRL+GNN agent is able to generalize and achieve outstanding
the SoA DRL agent was trained on Geant2. The resulting performance in the unseen Geant2 topology (Figure 5a and
boxplot can be seen in Figure 4b and the CDF of the Figure 5b), the SoA DRL agent performs poorly when applied
evaluation samples in Figure 4d. Similarly, in this case our to topologies not seen during training. This reveals the lack
agent performs slightly better than the SoA DRL approach of generalization capability of the latter DRL-based solution
(3% more bandwidth). compared to the agent proposed in this paper.
8
DRL+GNN trained
900 on NSFNet
800
Theoretical Fluid
2
we generated 20 topologies and we evaluated the agent on
1,000 episodes. To do this, we used the NetworkX python
700
1 library [36] to generate random network topologies between
600
20 and 100 nodes with similar average node degree to Nsfnet.
0
500 This allows us to analyze how the network size affects the
400 performance.
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Number of links removed Number of links removed Figure 8a shows how the performance scales inversely with
(a) (b) the topology size. For benchmark purposes, we computed the
relative score with respect to the theoretical fluid model. The
Fig. 6: DRL+GNN evaluation on a use case with link failures. agent shows a remarkable performance in unseen topologies.
9
8.5
As an example, the agent has a similar performance to the the-
8.0
seen during training differs from the evaluation samples (see Fig. 9: DRL+GNN average computation time (in milliseconds)
SectionVI-C). over different topology sizes.
2) Real-world network topologies: In this section we eval-
uate the generalization capabilities of our DRL+GNN agent,
trained in Nsfnet, on 232 real-world topologies obtained from B. Computation Time
the Topology Zoo [2] dataset. Specifically, we take all the In this section we analyze the computation time of an
topologies that have up to 100 nodes. In Table II we can already trained DRL+GNN agent when deployed in a realistic
see the features extracted from the resulting topologies. The scenario. For this purpose, we used the synthetic topologies
diameter feature corresponds to the maximum eccentricity generated before in Section VI-A1, and we executed 1,000
(i.e., maximum distance from one node to another node). episodes for each one and we measured the computation time.
The ranges of the different topology features indicate that our This is the time the agent takes to select the best path to
topology dataset contains different topology distributions. allocate all the incoming traffic requests. For this experiment
We executed 1,000 evaluation episodes and computed the we used off-the-shelf hardware without any specific hardware
average reward achieved by the DRL+GNN agents, the LB, accelerator (64-bit Ubuntu 16.04 LTS with processor Intel
and the theoretical fluid routing strategies for each topology. Core i5-8400 with 2.80GHz × 6 cores and 8GB of RAM
Then, we computed the relative performance (in %) of our memory). Results should be understood only as a reference
agent and the LB policy with respect to the theoretical fluid to analyze the scalability properties of our solution. Real im-
model. Figure 8b shows the results where, for readability, plementations in a network device would be highly optimized.
we sort the topologies according to the difference of score Figure 9 shows the computation time for all episodes. The
between the DRL+GNN agent and the LB policy. In the left dots correspond to the average agent operation time over all the
side of the figure we observe some topology samples where episodes and the confidence interval corresponds to the 5/95
the scores of all three routing strategies coincide. This kind percentiles. The execution time is in the order of few ms and
of behavior is normal in topologies where for each input grows linearly with the size of the topology. This is expected
traffic demand, there are not many paths to route the traffic due to the way the message-passing in the GNN has been
demand (e.g., in ring or star topologies). As the number of designed. The results indicate that, in terms of deployment,
paths increases, routing optimization becomes necessary to the proposed DRL+GNN agent has interesting features. It
maximize the number of traffic demands allocated. is capable of optimizing unseen networks achieving good
We also trained a DRL+GNN agent only in the Geant2 performance, as optimization algorithms, but in one single step
topology. The mean relative score (with respect to the theoret- and in tens of milliseconds, as heuristics.
ical fluid) of evaluating the model on all real-world topologies
was +4.78%. In the interest of space, we omit this figure. These C. Discussion
results indicate that our DRL+GNN architecture generalizes In this paper we propose a data-driven solution to solve
well to topologies never seen during training, independently a routing problem in OTN. This means that our DRL agent
of the topology used during training.
These experiments show the robustness of our architecture Mean Var. DRL+GNN
to operate in real-world topologies that largely differ from Topology Node Edge
Node Node Perf. w.r.t.
Size Betwee. Betwee.
the scenarios seen during training. Even when trained in a Degree Degree Fluid (%)
Nsfnet
single 14-node topology, the agent achieves good performance (training)
3 0.2857 0.0952 0.1020 -
in topologies of up to 100 nodes. 20 Nodes 2.90 0.1050 0.1036 0.0988 4.305
30 Nodes 2.93 0.0956 0.0844 0.0764 -0.649
40 Nodes 2.95 0.1025 0.0704 0.0623 -3.945
50 Nodes 2.96 0.1104 0.0620 0.0538 -6.422
Feature Minimum Maximum 60 Nodes 2.97 0.1056 0.0559 0.0476 -8.103
70 Nodes 2.97 0.0920 0.0522 0.0437 -10.064
Num. Nodes 6 92
80 Nodes 2.98 0.0956 0.0474 0.0395 -11.380
Num. Edges 5 101
Avg. node degree 1.667 8 90 Nodes 2.98 0.1062 0.0436 0.0361 -13.610
Var. node degree 0.001 41.415
Diameter 1 31
TABLE III: Features for the Synthetic network topologies. The
values correspond to the mean of all topologies from each
TABLE II: Real-world topology features (minimum and max- topology size. As a reference, the first row corresponds to the
imum values). Nsfnet topology used during training.
10
agents perform poorly when they are evaluated in different funds of the Faculty of Computer Science, Electronics and
topologies that were not seen during the training. Telecommunications of AGH University and by the PL-Grid
There have been several attempts to use GNN in the commu- Infrastructure.
nication networks field. In [46] they use GNN to learn shortest-
path routing and max-min routing in a supervised learning R EFERENCES
approach. In [47] they combine GNN with DRL to solve a
network planning problem. Another relevant work is the one [1] R. Hartert, S. Vissicchio, P. Schaus, O. Bonaventure, C. Filsfils,
T. Telkamp, and P. Francois, “A declarative and expressive approach to
from [48] where they use a distributed setup of DRL agents control forwarding paths in carrier-grade networks,” ACM SIGCOMM
to solve a Traffic Engineering problem in a decentralized way. computer communication review, vol. 45, no. 4, pp. 15–28, 2015.
The work from [10] proposes to use GNN to predict network [2] S. Knight, H. X. Nguyen, N. Falkner, R. Bowden, and M. Roughan,
“The internet topology zoo,” IEEE Journal on Selected Areas in Com-
metrics and a traditional optimizer to find the routing that munications, vol. 29, no. 9, pp. 1765–1775, 2011.
minimizes some network metrics (e.g., average delay). Finally, [3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
GNNs have been proposed to learn job scheduling policies in Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
et al., “Human-level control through deep reinforcement learning,”
a data-center scenario without human intervention [49]. Nature, vol. 518, p. 529–533, 2015.
[4] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez,
M. Lanctot, L. Sifre, D. Kumaran, T. Graepel et al., “Mastering chess
VIII. C ONCLUSION and shogi by self-play with a general reinforcement learning algorithm,”
In this paper, we presented a DRL architecture based on arXiv preprint arXiv:1712.01815, 2017.
[5] N. Feamster and J. Rexford, “Why (and how) networks should run
GNNs that is able to generalize to unseen network topologies. themselves,” arXiv preprint arXiv:1710.11583, 2017.
The use of GNNs to model the network environment allows the [6] M. Wang, Y. Cui, X. Wang, S. Xiao, and J. Jiang, “Machine learning
DRL agent to operate in different networks than those used for networking: Workflow, advances and opportunities,” IEEE Network,
vol. 32, no. 2, pp. 92–99, 2017.
for training. We believe that the lack of generalization was [7] A. Mestres, A. Rodriguez-Natal, J. Carner, P. Barlet-Ros, E. Alarcón,
the main obstacle preventing the use and deployment of DRL M. Solé, V. Muntés-Mulero, D. Meyer, S. Barkai, M. J. Hibbett
in production networks. The proposed architecture represents et al., “Knowledge-defined networking,” ACM SIGCOMM Computer
Communication Review, vol. 47, no. 3, pp. 2–10, 2017.
a first step towards the development of a new generation of [8] P. Kalmbach, J. Zerwas, P. Babarczi, A. Blenk, W. Kellerer, and
DRL-based products for networking. S. Schmid, “Empowering self-driving networks,” in Proceedings of the
In order to show the generalization capabilities of our Afternoon Workshop on Self-Driving Networks, 2018, pp. 8–14.
[9] A. Valadarsky, M. Schapira, D. Shahaf, and A. Tamar, “Learning to
DRL+GNN solution, we selected a classic problem in the field route,” in Proceedings of the ACM Workshop on Hot Topics in Networks
of optical networks. This served as a baseline benchmark to (HotNets), 2017, pp. 185–191.
validate the generalization performance of our architecture. [10] K. Rusek, J. Suárez-Varela, P. Almasan, P. Barlet-Ros, and A. Cabellos-
Aparicio, “Routenet: Leveraging graph neural networks for network
Our results show that the proposed DRL+GNN agent is able modeling and optimization in sdn,” IEEE Journal on Selected Areas
to effectively operate in networks never seen during training. in Communications, vol. 38, no. 10, pp. 2260–2270, 2020.
Previous DRL solutions based on traditional neural network [11] X. Chen, J. Guo, Z. Zhu, R. Proietti, A. Castro, and S. J. B. Yoo, “Deep-
rmsa: A deep-reinforcement-learning routing, modulation and spectrum
architectures were not able to generalize to other topologies. assignment agent for elastic optical networks,” in Proceedings of the
A fundamental challenge that remains to be addressed Optical Fiber Communications Conference (OFC), 2018.
towards the deployment of DRL techniques for self-driving [12] Z. Xu, J. Tang, J. Meng, W. Zhang, Y. Wang, C. H. Liu, and D. Yang,
“Experience-driven networking: A deep reinforcement learning based
networks is their black-box nature. DRL does not provide approach,” in IEEE Conference on Computer Communications (INFO-
guaranteed performance for all network scenarios and its COM), 2018, pp. 1871–1879.
operation cannot be understood easily by humans. As a result, [13] L. Chen, J. Lingys, K. Chen, and F. Liu, “Auto: Scaling deep rein-
forcement learning for datacenter-scale automatic traffic optimization,”
DRL-based solutions are inherently complex to troubleshoot in Proceedings of the 2018 conference of the ACM special interest group
and debug by network operators. In contrast, computer net- on data communication, 2018, pp. 191–205.
works have been built around well-understood analytical and [14] A. Mestres, E. Alarcón, Y. Ji, and A. Cabellos-Aparicio, “Understanding
the modeling of computer network delays using neural networks,” in
heuristic techniques, and such mechanisms are based on well- Proceedings of the ACM SIGCOMM Workshop on Big Data Analytics
known assumptions that perform reasonably well across dif- and Machine Learning for Data Communication Networks (Big-DAMA),
ferent scenarios. Such issues are not unique to self-driving 2018, pp. 46–52.
networks, but rather common to the application of machine [15] J. Suárez-Varela, A. Mestres, J. Yu, L. Kuang, H. Feng, A. Cabellos-
Aparicio, and P. Barlet-Ros, “Routing in optical transport networks with
learning to many critical use-cases, such as self-driving cars. deep reinforcement learning,” IEEE/OSA Journal of Optical Communi-
cations and Networking, vol. 11, no. 11, pp. 547–558, 2019.
[16] P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zam-
ACKNOWLEDGMENT baldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner
This publication is part of the Spanish I+D+i project et al., “Relational inductive biases, deep learning, and graph networks,”
arXiv preprint arXiv:1806.01261, 2018.
TRAINER-A (ref.PID2020-118011GB-C21), funded by [17] F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini,
MCIN/ AEI/10.13039/501100011033. This work is also “The graph neural network model,” IEEE Transactions on Neural
partially funded by the Catalan Institution for Research Networks, vol. 20, no. 1, pp. 61–80, 2008.
[18] J. Gilmer, S. S. Schoenholz, P. F. Riley, O. Vinyals, and G. E. Dahl,
and Advanced Studies (ICREA) and the Secretariat for “Neural message passing for quantum chemistry,” in Proceedings of the
Universities and Research of the Ministry of Business and International Conference on Machine Learning (ICML) - Volume 70,
Knowledge of the Government of Catalonia and the European 2017, pp. 1263–1272.
[19] https://github.com/knowledgedefinednetworking/DRL-GNN.
Social Fund. This work was also supported by the Polish [20] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph
Ministry of Science and Higher Education with the subvention sequence neural networks,” arXiv preprint arXiv:1511.05493, 2015.
12
[21] P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Lio, and Y. Ben- Selected Areas in Communications, vol. 38, no. 10, pp. 2249–2259,
gio, “Graph attention networks,” arXiv preprint arXiv:1710.10903, 2017. 2020.
[22] P. W. Battaglia, R. Pascanu, M. Lai, D. J. Rezende et al., “Interaction [45] S. Troia, F. Sapienza, L. Varé, and G. Maier, “On deep reinforcement
networks for learning about objects, relations and physics,” in Proceed- learning for traffic engineering in sd-wan,” IEEE Journal on Selected
ings of Advances in neural information processing systems (NIPS), 2016, Areas in Communications, vol. 39, no. 7, pp. 2198–2212, 2020.
pp. 4502–4510. [46] F. Geyer and G. Carle, “Learning and generating distributed routing
[23] C. J. C. H. Watkins and P. Dayan, “Q-learning,” Machine learning, protocols using graph-based deep learning,” in Proceedings of the ACM
vol. 8, no. 3-4, pp. 279–292, 1992. SIGCOMM Workshop on Big Data Analytics and Machine Learning for
[24] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier- Data Communication Networks (Big-DAMA), 2018, pp. 40–45.
stra, and M. Riedmiller, “Playing atari with deep reinforcement learn- [47] H. Zhu, V. Gupta, S. S. Ahuja, Y. Tian, Y. Zhang, and X. Jin, “Network
ing,” arXiv preprint arXiv:1312.5602, 2013. planning with deep reinforcement learning,” in Proceedings of the 2021
[25] J. Kuri, N. Puech, and M. Gagnaire, “Diverse routing of scheduled ACM SIGCOMM 2021 Conference, 2021, pp. 258–271.
lightpath demands in an optical transport network,” in Proceedings of [48] G. Bernárdez, J. Suárez-Varela, A. López, B. Wu, S. Xiao, X. Cheng,
the IEEE International Workshop on Design of Reliable Communication P. Barlet-Ros, and A. Cabellos-Aparicio, “Is machine learning ready
Networks (DRCN), 2003, pp. 69–76. for traffic engineering optimization?” in 2021 IEEE 29th International
[26] “Itu-t recommendation g.709/y.1331: Interface for the optical transport Conference on Network Protocols (ICNP). IEEE, 2021, pp. 1–11.
network,” 2019, https://www.itu.int/rec/T-REC-G.709/. [49] H. Mao, M. Schwarzkopf, S. B. Venkatakrishnan, Z. Meng, and M. Al-
[27] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. izadeh, “Learning scheduling algorithms for data processing clusters,”
MIT press, 2018. in Proceedings of ACM SIGCOMM, 2019, pp. 270–288.
[28] K. Cho, B. Van Merriënboer, D. Bahdanau, and Y. Bengio, “On the
properties of neural machine translation: Encoder-decoder approaches,”
arXiv preprint arXiv:1409.1259, 2014.
[29] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,
S. Ghemawat, G. Irving, M. Isard et al., “Tensorflow: A system for large-
scale machine learning,” in Proceedings of the 12th USENIX Symposium
on Operating Systems Design and Implementation (OSDI), 2016, pp.
265–283.
[30] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schul-
man, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint
arXiv:1606.01540, 2016.
[31] L. Bottou, “Large-scale machine learning with stochastic gradient de-
scent,” in Proceedings of the International Conference on Computational
Statistics (COMPSTAT), 2010, pp. 177–186.
[32] X. Hei, J. Zhang, B. Bensaou, and C.-C. Cheung, “Wavelength converter
placement in least-load-routing-based optical networks using genetic
algorithms,” Journal of Optical Networking, vol. 3, no. 5, pp. 363–378,
2004.
[33] F. Barreto, E. C. Wille, and L. Nacamura Jr, “Fast emergency paths
schema to overcome transient link failures in ospf routing,” arXiv
preprint arXiv:1204.2465, 2012.
[34] P. Francois, C. Filsfils, J. Evans, and O. Bonaventure, “Achieving
sub-second igp convergence in large ip networks,” ACM SIGCOMM
Computer Communication Review, vol. 35, no. 3, pp. 35–44, 2005.
[35] S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh,
S. Venkata, J. Wanderer, J. Zhou, M. Zhu et al., “B4: Experience with
a globally-deployed software defined wan,” ACM SIGCOMM Computer
Communication Review, vol. 43, no. 4, pp. 3–14, 2013.
[36] A. Hagberg, P. Swart, and D. S Chult, “Exploring network struc-
ture, dynamics, and function using networkx,” Los Alamos National
Lab.(LANL), Los Alamos, NM (United States), Tech. Rep., 2008.
[37] K. Xu, W. Hu, J. Leskovec, and S. Jegelka, “How powerful are graph
neural networks?” arXiv preprint arXiv:1810.00826, 2018.
[38] L. Gong, X. Zhou, X. Liu, W. Zhao, W. Lu, and Z. Zhu, “Efficient
resource allocation for all-optical multicasting over spectrum-sliced
elastic optical networks,” IEEE/OSA Journal of Optical Communications
and Networking, vol. 5, no. 8, pp. 836–847, 2013.
[39] L. Gong, X. Zhou, W. Lu, and Z. Zhu, “A two-population based
evolutionary approach for optimizing routing, modulation and spectrum
assignments (rmsa) in o-ofdm networks,” IEEE Communications letters,
vol. 16, no. 9, pp. 1520–1523, 2012.
[40] M. Klinkowski, M. Ruiz, L. Velasco, D. Careglio, V. Lopez, and
J. Comellas, “Elastic spectrum allocation for time-varying traffic in
flexgrid optical networks,” IEEE journal on selected areas in communi-
cations, vol. 31, no. 1, pp. 26–38, 2012.
[41] Y. Wang, X. Cao, and Y. Pan, “A study of the routing and spectrum
allocation in spectrum-sliced elastic optical path networks,” in IEEE
International Conference on Computer Communications (INFOCOM),
2011, pp. 1503–1511.
[42] K. Christodoulopoulos, I. Tomkos, and E. A. Varvarigos, “Elastic
bandwidth allocation in flexible ofdm-based optical networks,” Journal
of Lightwave Technology, vol. 29, no. 9, pp. 1354–1366, 2011.
[43] P. Sun, Z. Guo, J. Lan, J. Li, Y. Hu, and T. Baker, “Scaledrl: a scalable
deep reinforcement learning approach for traffic engineering in sdn with
pinning control,” Computer Networks, vol. 190, p. 107891, 2021.
[44] J. Zhang, M. Ye, Z. Guo, C.-Y. Yen, and H. J. Chao, “Cfr-rl: Traffic
engineering with reinforcement learning in sdn,” IEEE Journal on