(2020 - IEEE-Journal On SAIC) CFR-RL - Traffic Engineering With Reinforcement Learning in SDN
(2020 - IEEE-Journal On SAIC) CFR-RL - Traffic Engineering With Reinforcement Learning in SDN
Abstract— Traditional Traffic Engineering (TE) solutions can control plane of SDN. The goal of TE is to help Internet
achieve the optimal or near-optimal performance by rerouting Service Providers (ISPs) optimize network performance and
as many flows as possible. However, they do not usually consider resource utilization by configuring the routing across their
the negative impact, such as packet out of order, when frequently
rerouting flows in the network. To mitigate the impact of network backbone networks to control traffic distribution [5], [6]. Due
disturbance, one promising TE solution is forwarding the major- to dynamic load fluctuation among the nodes, traditional TE
ity of traffic flows using Equal-Cost Multi-Path (ECMP) and [7]–[12] reroutes many flows periodically to balance the load
selectively rerouting a few critical flows using Software-Defined on each link to minimize network congestion probability,
Networking (SDN) to balance link utilization of the network. where a flow is defined as a source-destination pair. One
However, critical flow rerouting is not trivial because the solution
space for critical flow selection is enormous. Moreover, it is usually formulates the flow routing problem with a particular
impossible to design a heuristic algorithm for this problem performance metric as a specific objective function for opti-
based on fixed and simple rules, since rule-based heuristics mization. For a given traffic matrix, one often wants to route
are unable to adapt to the changes of the traffic matrix and all the flows in such a way that the maximum link utilization
network dynamics. In this paper, we propose CFR-RL (Criti- in the network is minimized.
cal Flow Rerouting-Reinforcement Learning), a Reinforcement
Learning-based scheme that learns a policy to select critical Although traditional TE solutions can achieve the optimal
flows for each given traffic matrix automatically. CFR-RL then or near-optimal performance by rerouting as many flows as
reroutes these selected critical flows to balance link utilization possible, they do not consider the negative impact, such as
of the network by formulating and solving a simple Linear packet out of order, when rerouting the flows in the network.
Programming (LP) problem. Extensive evaluations show that To reach the optimal performance, TE solutions might reroute
CFR-RL achieves near-optimal performance by rerouting only
10%-21.3% of total traffic. many traffic flows to just slightly reduce the link utilization on
Index Terms— Reinforcement learning, software-defined net- the most congested link, leading to significant network distur-
working, traffic engineering, load balancing, network disturbance bance and service disruption. For example, a flow between
mitigation. two nodes in a backbone network is aggregated of many
micro-flows (e.g., five tuples-based TCP flows) of different
I. I NTRODUCTION
applications. Changing the path of a flow could temporarily
T HE emerging Software-Defined Networking (SDN) pro-
vides new opportunities to improve network performance
[1]. In SDN, the control plane can generate routing policies
affect many TCP flows’ normal operation. Packets loss or out-
of-order may cause duplicated ACK transmissions, triggering
the sender to react and reduce its congestion window size
based on its global view of the network and deploy these
and hence decrease its sending rate, eventually increasing the
policies in the network by installing and updating flow entries
flow’s completion time and degrading the flow’s Quality of
at the SDN switches.
Service (QoS). In addition, rerouting all flows in the network
Traffic Engineering (TE) is one of important network fea-
could incur a high burden on the SDN controller to calculate
tures for SDN [2]–[4], and is usually implemented in the
and deploy new flow paths [4]. Because rerouting flows
Manuscript received October 1, 2019; revised February 15, 2020; accepted to reduce congestion in backbone networks could adversely
March 31, 2020. Date of publication June 5, 2020; date of current version affect the quality of users’ experience, network operators have
September 16, 2020. The work of Zehua Guo was supported in part by
the National Key Research and Development Program of China under Grant no desire to deploy these traditional TE solutions in their
2018YFB1003700 and in part by the Beijing Institute of Technology Research networks unless reducing network disturbance is taken into
Fund Program for Young Scholars. (Corresponding author: Zehua Guo.) the consideration in designing the TE solutions.
Junjie Zhang is with Fortinet Inc., Sunnyvale, CA 94086 USA (e-mail:
[email protected]). To mitigate the impact of network disturbance, one promis-
Minghao Ye, Chen-Yu Yen, and H. Jonathan Chao are with the ing TE solution is forwarding majority of traffic flows using
Department of Electrical and Computer Engineering, New York Uni- Equal-Cost Multi-Path (ECMP) and selectively rerouting a few
versity, New York City, NY 11201 USA (e-mail: [email protected];
[email protected]; [email protected]). critical flows using SDN to balance link utilization of the
Zehua Guo is with the Beijing Institute of Technology, Beijing 100081, network, where a critical flow is defined as a flow with a
China (e-mail: [email protected]). dominant impact on network performance (e.g., a flow on the
Color versions of one or more of the figures in this article are available
online at http://ieeexplore.ieee.org. most congested link) [4], [13]. Existing works show that criti-
Digital Object Identifier 10.1109/JSAC.2020.3000371 cal flows exist in a given traffic matrix [4]. ECMP reduces the
0733-8716 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 10,2021 at 03:03:28 UTC from IEEE Xplore. Restrictions apply.
2250 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO. 10, OCTOBER 2020
congestion probability by equally splitting traffic on equal-cost The remainder of this paper is organized as follows.
paths while critical flow rerouting aims to achieve further Section II describes the related works. Section III presents the
performance improvement with low network disturbance. system design. Section IV discusses how to train the critical
The critical flow rerouting problem can be decoupled into flow selection policy using a RL-based approach. Section V
two sub-problems: (1) identifying critical flows and (2) rerout- describes how to reroute the critical flows. Section VI eval-
ing them to achieve good performance. Although sub-problem uates the effectiveness of our scheme. Section VII concludes
(2) is relatively easy to solve by formulating it as a Linear the paper and discusses future work.
Programming (LP) optimization problem, solving sub-problem
(1) is not trivial because the solution space is huge. For II. R ELATED W ORKS
example, if we want to find 10 critical flows among 100 flows,
10 A. Traditional TE Solutions
the solution space has C100 ≈ 17 trillion combinations.
Considering the fact that traffic matrix varies in the level of In Multiprotocol Label Switching (MPLS) networks, a rout-
minutes, an efficient solution should be able to quickly and ing problem has been formulated as an optimization problem
effectively identify the critical flows for each traffic matrix. where explicit routes are obtained for each source-destination
Unfortunately, it is impossible to design a heuristic algorithm pair to distribute traffic flows [7], [8]. Using Open Shortest
for the above algorithmically-hard problem based on fixed and Path First (OSPF) and ECMP protocols, [9]–[11] attempt to
simple rules. This is because rule-based heuristics are unable balance link utilization as even as possible by carefully tuning
to adapt to the changes of the traffic matrix and network the link costs to adjust path selection in ECMP. OSPF-OMP
dynamics and thus unable to guarantee their performance (OMP, Optimized Multipath) [14], a variation of OSPF,
when their design assumptions are violated, as later shown attempts to dynamically determine the optimal allocation of
in Section VI-B. traffic among multiple equal-cost paths based on the exchange
In this paper, we propose CFR-RL (Critical Flow Rerouting- of special traffic-load control messages. Weighted ECMP [12]
Reinforcement Learning), a Reinforcement Learning-based extends ECMP to allow weighted traffic splitting at each
scheme that performs critical flow selection followed by node and achieves significant performance improvement over
rerouting with linear programming. CFR-RL learns a policy ECMP. Two-phase routing optimizes routing performance by
to select critical flows purely through observations, without selecting a set of intermediate nodes and tuning the traffic split
any domain-specific rule-based heuristic. It starts from scratch ratios to the nodes [15], [16]. In the first phase, each source
without any prior knowledge, and gradually learns to make sends traffic to the intermediate nodes based on predetermined
better selections through reinforcement, in the form of reward split ratios, and in the second phase, the intermediate nodes
signals that reflects network performance for past selections. then deliver the traffic to the final destinations. This approach
By continuing to observe the actual performance of past requires IP tunnels, optical-layer circuits, or label switched
selections, CFR-RL would optimize its selection policy for paths in each phase.
various traffic matrices as time goes. Once training is done,
CFR-RL will efficiently and effectively select a small set of B. SDN-Based TE Solutions
critical flows for each given traffic matrix, and reroute them Thanks to the flexible routing policy from the emerging
to balance link utilization of the network by formulating and SDN, dynamic hybrid routing [4] achieves load balancing for
solving a simple linear programming optimization problem. a wide range of traffic scenarios by dynamically rebalancing
The main contributions of this paper are summarized as traffic to react to traffic fluctuations with a preconfigured rout-
follows: ing policy. Agarwal et al. [2] consider a network with partially
1) We consider the impact of flow rerouting on network deployed SDN switches. They improve network utilization
disturbance in our TE design and propose an effective and reduce packet loss by strategically placing the controller
scheme that not only minimizes the maximum link and SDN switches. Guo et al. [3] propose a novel algorithm
utilization but also reroutes only a small number of flows named SOTE to minimize the maximum link utilization in an
to reduce network disturbance. SDN/OSPF hybrid network.
2) We customize a RL approach to learn the critical flow
selection policy, and utilize LP as a reward function
to generate reward signals. This RL+LP combined C. Machine Learning-Based TE Solutions
approach turns out to be surprisingly powerful. Machine learning has been used to improve the performance
3) We evaluate and compare CFR-RL with other rule-based of backbone networks and data center networks. For backbone
heuristic schemes by conducting extensive experiments networks, Geyer and Carle [17] design an automatic network
on different topologies with both real and synthesized protocol using semi-supervised deep learning. Sun et al.
traffic. CFR-RL not only outperforms rule-based heuris- [18] selectively control a set of nodes and use a RL-based
tic schemes by up to 12.2%, but also reroutes 11.4%- policy to dynamically change the routing decision of flows
14.7% less traffic on average. Overall, CFR-RL is able traversing the selected nodes. To minimize signaling delay in
to achieve near-optimal performance by rerouting only large SDNs, Lin et al. [19] employ a distributed three-level
10%-21.3% of total traffic. In addition, the evaluation control plane architecture coupled with a RL-based solution
results show that CFR-RL is able to generalize to unseen named QoS-aware Adaptive Routing. Xu et al. [20] use RL
traffic matrices. to optimize the throughput and delay in TE. AuTO [21] is
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 10,2021 at 03:03:28 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: CFR-RL: TE WITH REINFORCEMENT LEARNING IN SDN 2251
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 10,2021 at 03:03:28 UTC from IEEE Xplore. Restrictions apply.
2252 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO. 10, OCTOBER 2020
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 10,2021 at 03:03:28 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: CFR-RL: TE WITH REINFORCEMENT LEARNING IN SDN 2253
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 10,2021 at 03:03:28 UTC from IEEE Xplore. Restrictions apply.
2254 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO. 10, OCTOBER 2020
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 10,2021 at 03:03:28 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: CFR-RL: TE WITH REINFORCEMENT LEARNING IN SDN 2255
TABLE II
C OMPARISON OF AVERAGE R EROUTING D ISTURBANCE
Fig. 5. Comparison of load balancing performance in the four networks on Fig. 7. Comparison of end-to-end delay performance in the four networks
each test traffic matrix. on each test traffic matrix.
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 10,2021 at 03:03:28 UTC from IEEE Xplore. Restrictions apply.
2256 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO. 10, OCTOBER 2020
As shown in Figs. 5(b)-5(d), Top-K critical performs well Fig. 11. Comparison of end-to-end delay performance ratio with the traffic
with the exponential traffic model. However, its performance is matrices from Tuesday, Wednesday, Friday and Saturday in week 2.
degraded with the uniform traffic model. One possible reason
for the performance degradation of Top-K critical is that all delay performance, and thus outperforms other schemes on
links in the network are relatively saturated under the uniform almost all traffic matrices. The load balancing performance
traffic model. Alternative underutilized paths are not available of CFR-RL degrades on several outlier traffic matrices in
for the critical flows selected by Top-K critical. In other words, day 2. There are two possible reasons for the degradation:
there is no much room for rerouting performance improvement (1) The traffic patterns of these traffic matrices are different
by only considering the elephant flows traversing the most con- from what CFR-RL learned from the previous week. (2)
gested links. Thus, fixed-rule heuristics are unable to guarantee Selecting K = 10% ∗ N ∗ (N − 1) is not enough for CFR-RL
their performance, showing that their design assumptions are to achieve near-optimal performance on these outlier traffic
invalid. In contrast, CFR-RL performs consistently well under matrices. However, CFR-RL still performs better than other
various traffic models. schemes. Overall, the results indicate that real traffic patterns
3) Generalization: In this series of experiments, we trained are relatively stable and CFR-RL generalizes well to unseen
CFR-RL on the traffic matrices from the first week (starting traffic matrices which are not explicitly trained.
from Mar. 1st 2004) and evaluate it for each day of the 4) Training and Inference Time: Training a policy for the
following week (starting from Mar. 8th 2004) for the Abi- Abilene network took approximately 10,000 iterations, and the
lene network. We only present the results for day 2, day time consumed for each iteration is approximately 1 second.
3, day 5 and day 6, since the results for other days are As a result, the total training time for Abilene network is
similar. Figures 8 and 9 show the full CDFs of two types approximately 3 hours. Since the EBONE network is relatively
of performance ratio for these 4 days. Figures 10 and 11 larger, it took approximately 60,000 iterations to train a
show the load balancing and end-to-end delay performance policy. Then, the total training time for EBONE network is
ratios on each traffic matrix of these 4 days, respectively. The approximately 16 hours. For larger networks like Sprintlink
results show that CFR-RL still achieves above 95% optimal and Tiscali, the solution space is even larger. Thus, more
load balancing performance and average 88.13% end-to-end iterations (e.g., approximately 90,000 and 100,000 iterations)
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 10,2021 at 03:03:28 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: CFR-RL: TE WITH REINFORCEMENT LEARNING IN SDN 2257
TABLE III that learns a critical flow selection policy automatically using
C OMPARISON OF AVERAGE L OAD BALANCING P ERFORMANCE R ATIO reinforcement learning, without any domain-specific rule-
W ITH D IFFERENT S ETS OF H YPERPARAMETERS
based heuristic. CFR-RL selects critical flows for each given
traffic matrix and reroutes them to balance link utilization
of the network by solving a simple rerouting optimization
problem. Extensive evaluations show that CFR-RL achieves
near-optimal performance by rerouting only a limited portion
of total traffic. In addition, CFR-RL generalizes well to traffic
matrices which are not explicitly trained.
Yet, there are several aspects that may help improve the
solution that we proposed in this contribution. Among them,
we determine how CFR-RL can be updated and improved.
A. Objectives
CFR-RL could be formulated to achieve other objec-
tives. For example, to minimize
overall end-to-end delay
l
should be taken to train a good policy, and each iteration takes in the network (i.e., Ω = ( ci,j i.j
−li,j )) described in
i,j∈E
approximately 2 seconds. Note that this cost is incurred offline
Section VI-A.4(2), we can define reward r as 1/Ω and refor-
and can be performed infrequently depending on environment
mulate the rerouting optimization problem (4a) to minimize Ω.
stability. The policy neural network as described in Section VI-
Table II shows an interesting finding. Although CFR-RL
A.1 is relatively small. Thus, the inference time for the Abilene
does not explicitly minimize rerouting traffic, it ends up
and EBONE networks are less than 1 second, and they are less
rerouting much less traffic (i.e., 10.0%-21.3%) and performs
than 2 seconds for the Sprintlink and Tiscali networks.
better than rule-based heuristic schemes by 1.3%-12.2%. This
5) Hyperparameters: Table III shows that how hyperpa- reveals that CFR-RL effectively searches the whole set of
rameters affect the load balancing performance of CFR-RL
candidate flows to find the best critical flows for various traffic
in the Abilene network. For each set of hyperparameters,
matrices, rather than simply considering the elephant flows on
we trained a policy for the Abilene network by 10,000 iter- the most congested links or in the whole network as rule-based
ations, and then evaluated the average load balancing perfor-
heuristic schemes do. We will consider minimizing rerouting
mance ratio over the whole test set. We only present the results
traffic as one of our objectives and investigate the trade-off
for the Abilene network, since the results for other network between maximizing performance and minimizing rerouting
topologies are similar. In Tab. III(a), the number of filters in
traffic.
the convolutional layer and neurons in the fully connected
layer is fixed to 128 and entropy factor β is fixed to 0.1.
We compare the performance with different learning rate α. B. Scalability
The results show that training might become unstable if the Scaling CFR-RL to larger networks is an important direc-
initial learning rate is too large (e.g., 0.01), and thus it cannot tion of our future work. CFR-RL relies on LP to produce
converge to a good policy. In contrast, training with a smaller reward signals r. The LP problem would become complex
learning rate is more stable but might require longer training as the number of critical flows K and the size of a network
time to further improve the performance. As a result, we chose increase. This would slow down the policy training for larger
α = 0.001 to encourage exploration in the early stage of networks (e.g., the Tiscali network in Section VI-B.4), since
training. We compared the performance with different sizes the time consumed for each iteration would increase. More-
of filters and neurons in Tab. III(b). The results show that over, the solution space would become enormous for larger
too few filters/neurons might restrict the representation that networks, and RL has to take more iterations to converge to
the neural network can learn and thus Cause under-fitting. a good policy. To further speed up training, we can either
Meanwhile, too many neurons might cause over-fitting, and spawn even more actor agents (e.g., 30) in parallel to allow
thus the corresponding policy cannot generalize well to the the system to consume more data at each time step and thus
test set. In addition, more training time is required for a improve exploration [28], or apply GA3C [36] to offload
larger neural network. In Tab. III(c), the results show that a the training to a GPU, which is an alternative architecture
larger entropy factor encourages exploration and leads to a of A3C and emphasizes on an efficient GPU utilization to
better performance. Overall, the set of hyperparameters we increase the number of training data generated and processed
have chosen is a good trade-off between performance and per second. Another possible design to mitigate the scalability
computational complexity of the model. issue is adopting SDN multi-controller architectures. Each
controller takes care of a subset of routers in a large network,
VII. C ONCLUSION AND F UTURE W ORK and one CFR-RL agent is running on each SDN controller.
The corresponding problem naturally falls into the realm of
With an objective of minimizing the maximum link utiliza- Multi-Agent Reinforcement Learning. We will evaluate if a
tion in a network and reducing disturbance to the network multi-SDN controller architecture can help provide additional
causing service disruption, we proposed CFR-RL, a scheme improvement in our approach.
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 10,2021 at 03:03:28 UTC from IEEE Xplore. Restrictions apply.
2258 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO. 10, OCTOBER 2020
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 10,2021 at 03:03:28 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: CFR-RL: TE WITH REINFORCEMENT LEARNING IN SDN 2259
Minghao Ye received the B.E. degree in micro- Chen-Yu Yen received the B.S. degree in elec-
electronic science and engineering from Sun Yat-sen trical engineering from National Taiwan Univer-
University, Guangzhou, China, and the second B.E. sity, Taipei, Taiwan, in 2014, and the M.S. degree
degree (Hons.) in electronic engineering from The in electrical engineering from Columbia University
Hong Kong Polytechnic University, Hong Kong, in 2018. He is currently pursuing the Ph.D. degree
in 2017, the M.S. degree in electrical engineering with the Department of Electrical and Computer
from New York University, New York City, NY, Engineering, New York University, New York City,
USA, in 2019, where he is currently pursuing the NY, USA. His research interests include reinforce-
Ph.D. degree with the Department of Electrical ment learning, congestion control, and practical
and Computer Engineering. His research interests machine learning for networking.
include traffic engineering, software-defined net-
works, mobile edge computing, and reinforcement learning.
Zehua Guo (Senior Member, IEEE) received the H. Jonathan Chao (Fellow, IEEE) received the
B.S. degree from Northwestern Polytechnical Uni- B.S. and M.S. degrees in electrical engineering
versity, the M.S. degree from Xidian University, and from National Chiao Tung University, Taiwan,
the Ph.D. degree from Northwestern Polytechnical in 1977 and 1980, respectively, and the Ph.D. degree
University. He was a Research Fellow with the in electrical engineering from The Ohio State Uni-
Department of Electrical and Computer Engineering, versity, Columbus, OH, USA, in 1985. He was
Tandon School of Engineering, New York Univer- the Head of the Electrical and Computer Engi-
sity, a Post-Doctoral Research Associate with the neering (ECE) Department, New York University
Department of Computer Science and Engineering, (NYU), New York City, NY, USA, from 2004 to
University of Minnesota Twin Cities, and a Visiting 2014. From 2000 to 2001, he was the Co-Founder
Associate Professor with the Singapore University and the CTO of Coree Networks, Tinton Falls, NJ,
of Technology and Design. He is currently an Associate Professor at the USA. From 1985 to 1992, he was a member of Technical Staff at Bellcore,
Beijing Institute of Technology. His research interests include software-defined Piscataway, NJ, USA, where he was involved in transport and switching
networking, network function virtualization, data center networks, cloud system architecture designs and application-specified integrated circuit imple-
computing, content delivery networks, network security, machine learning, mentations, such as the world’s first SONET-like framer chip, ATM layer chip,
and Internet exchange. He was the Session Chair of the IEEE International sequencer chip (the first chip handling packet scheduling), and ATM switch
Conference on Communications 2018 and the Technical Program Committee chip. He is currently a Professor of ECE at NYU, New York City. He is also
Member of Computer Communications (Elsevier). He is an Associate Editor the Director of the High-Speed Networking Laboratory. He has been doing
of the IEEE A CCESS and EURASIP Journal on Wireless Communications and research in the areas of software-defined networking, network function virtual-
Networking (Springer), and an Editor of KSII Transactions on Internet and ization, datacenter networks, high-speed packet processing/switching/routing,
Information Systems. network security, quality-of-service control, network on chip, and machine
learning for networking. He has coauthored three networking books, Broad-
band Packet Switching Technologies: A Practical Guide to ATM Switches and
IP Routers (Wiley, 2001), Quality of Service Control in High-Speed Networks
(Wiley, 2001), and High Performance Switches and Routers (Wiley, 2007).
He has published more than 260 journal and conference papers and holds
63 patents. He is a Fellow of the National Academy of Inventors. He was a
recipient of the Bellcore Excellence Award in 1987. He was a co-recipient
of the 2001 Best Paper Award from the IEEE T RANSACTIONS ON C IRCUITS
AND S YSTEMS FOR V IDEO T ECHNOLOGY .
Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 10,2021 at 03:03:28 UTC from IEEE Xplore. Restrictions apply.