0% found this document useful (0 votes)
66 views11 pages

(2020 - IEEE-Journal On SAIC) CFR-RL - Traffic Engineering With Reinforcement Learning in SDN

Uploaded by

Nam Quach
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views11 pages

(2020 - IEEE-Journal On SAIC) CFR-RL - Traffic Engineering With Reinforcement Learning in SDN

Uploaded by

Nam Quach
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO.

10, OCTOBER 2020 2249

CFR-RL: Traffic Engineering With


Reinforcement Learning in SDN
Junjie Zhang , Member, IEEE, Minghao Ye , Zehua Guo , Senior Member, IEEE,
Chen-Yu Yen, and H. Jonathan Chao , Fellow, IEEE

Abstract— Traditional Traffic Engineering (TE) solutions can control plane of SDN. The goal of TE is to help Internet
achieve the optimal or near-optimal performance by rerouting Service Providers (ISPs) optimize network performance and
as many flows as possible. However, they do not usually consider resource utilization by configuring the routing across their
the negative impact, such as packet out of order, when frequently
rerouting flows in the network. To mitigate the impact of network backbone networks to control traffic distribution [5], [6]. Due
disturbance, one promising TE solution is forwarding the major- to dynamic load fluctuation among the nodes, traditional TE
ity of traffic flows using Equal-Cost Multi-Path (ECMP) and [7]–[12] reroutes many flows periodically to balance the load
selectively rerouting a few critical flows using Software-Defined on each link to minimize network congestion probability,
Networking (SDN) to balance link utilization of the network. where a flow is defined as a source-destination pair. One
However, critical flow rerouting is not trivial because the solution
space for critical flow selection is enormous. Moreover, it is usually formulates the flow routing problem with a particular
impossible to design a heuristic algorithm for this problem performance metric as a specific objective function for opti-
based on fixed and simple rules, since rule-based heuristics mization. For a given traffic matrix, one often wants to route
are unable to adapt to the changes of the traffic matrix and all the flows in such a way that the maximum link utilization
network dynamics. In this paper, we propose CFR-RL (Criti- in the network is minimized.
cal Flow Rerouting-Reinforcement Learning), a Reinforcement
Learning-based scheme that learns a policy to select critical Although traditional TE solutions can achieve the optimal
flows for each given traffic matrix automatically. CFR-RL then or near-optimal performance by rerouting as many flows as
reroutes these selected critical flows to balance link utilization possible, they do not consider the negative impact, such as
of the network by formulating and solving a simple Linear packet out of order, when rerouting the flows in the network.
Programming (LP) problem. Extensive evaluations show that To reach the optimal performance, TE solutions might reroute
CFR-RL achieves near-optimal performance by rerouting only
10%-21.3% of total traffic. many traffic flows to just slightly reduce the link utilization on
Index Terms— Reinforcement learning, software-defined net- the most congested link, leading to significant network distur-
working, traffic engineering, load balancing, network disturbance bance and service disruption. For example, a flow between
mitigation. two nodes in a backbone network is aggregated of many
micro-flows (e.g., five tuples-based TCP flows) of different
I. I NTRODUCTION
applications. Changing the path of a flow could temporarily
T HE emerging Software-Defined Networking (SDN) pro-
vides new opportunities to improve network performance
[1]. In SDN, the control plane can generate routing policies
affect many TCP flows’ normal operation. Packets loss or out-
of-order may cause duplicated ACK transmissions, triggering
the sender to react and reduce its congestion window size
based on its global view of the network and deploy these
and hence decrease its sending rate, eventually increasing the
policies in the network by installing and updating flow entries
flow’s completion time and degrading the flow’s Quality of
at the SDN switches.
Service (QoS). In addition, rerouting all flows in the network
Traffic Engineering (TE) is one of important network fea-
could incur a high burden on the SDN controller to calculate
tures for SDN [2]–[4], and is usually implemented in the
and deploy new flow paths [4]. Because rerouting flows
Manuscript received October 1, 2019; revised February 15, 2020; accepted to reduce congestion in backbone networks could adversely
March 31, 2020. Date of publication June 5, 2020; date of current version affect the quality of users’ experience, network operators have
September 16, 2020. The work of Zehua Guo was supported in part by
the National Key Research and Development Program of China under Grant no desire to deploy these traditional TE solutions in their
2018YFB1003700 and in part by the Beijing Institute of Technology Research networks unless reducing network disturbance is taken into
Fund Program for Young Scholars. (Corresponding author: Zehua Guo.) the consideration in designing the TE solutions.
Junjie Zhang is with Fortinet Inc., Sunnyvale, CA 94086 USA (e-mail:
[email protected]). To mitigate the impact of network disturbance, one promis-
Minghao Ye, Chen-Yu Yen, and H. Jonathan Chao are with the ing TE solution is forwarding majority of traffic flows using
Department of Electrical and Computer Engineering, New York Uni- Equal-Cost Multi-Path (ECMP) and selectively rerouting a few
versity, New York City, NY 11201 USA (e-mail: [email protected];
[email protected]; [email protected]). critical flows using SDN to balance link utilization of the
Zehua Guo is with the Beijing Institute of Technology, Beijing 100081, network, where a critical flow is defined as a flow with a
China (e-mail: [email protected]). dominant impact on network performance (e.g., a flow on the
Color versions of one or more of the figures in this article are available
online at http://ieeexplore.ieee.org. most congested link) [4], [13]. Existing works show that criti-
Digital Object Identifier 10.1109/JSAC.2020.3000371 cal flows exist in a given traffic matrix [4]. ECMP reduces the
0733-8716 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 10,2021 at 03:03:28 UTC from IEEE Xplore. Restrictions apply.
2250 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO. 10, OCTOBER 2020

congestion probability by equally splitting traffic on equal-cost The remainder of this paper is organized as follows.
paths while critical flow rerouting aims to achieve further Section II describes the related works. Section III presents the
performance improvement with low network disturbance. system design. Section IV discusses how to train the critical
The critical flow rerouting problem can be decoupled into flow selection policy using a RL-based approach. Section V
two sub-problems: (1) identifying critical flows and (2) rerout- describes how to reroute the critical flows. Section VI eval-
ing them to achieve good performance. Although sub-problem uates the effectiveness of our scheme. Section VII concludes
(2) is relatively easy to solve by formulating it as a Linear the paper and discusses future work.
Programming (LP) optimization problem, solving sub-problem
(1) is not trivial because the solution space is huge. For II. R ELATED W ORKS
example, if we want to find 10 critical flows among 100 flows,
10 A. Traditional TE Solutions
the solution space has C100 ≈ 17 trillion combinations.
Considering the fact that traffic matrix varies in the level of In Multiprotocol Label Switching (MPLS) networks, a rout-
minutes, an efficient solution should be able to quickly and ing problem has been formulated as an optimization problem
effectively identify the critical flows for each traffic matrix. where explicit routes are obtained for each source-destination
Unfortunately, it is impossible to design a heuristic algorithm pair to distribute traffic flows [7], [8]. Using Open Shortest
for the above algorithmically-hard problem based on fixed and Path First (OSPF) and ECMP protocols, [9]–[11] attempt to
simple rules. This is because rule-based heuristics are unable balance link utilization as even as possible by carefully tuning
to adapt to the changes of the traffic matrix and network the link costs to adjust path selection in ECMP. OSPF-OMP
dynamics and thus unable to guarantee their performance (OMP, Optimized Multipath) [14], a variation of OSPF,
when their design assumptions are violated, as later shown attempts to dynamically determine the optimal allocation of
in Section VI-B. traffic among multiple equal-cost paths based on the exchange
In this paper, we propose CFR-RL (Critical Flow Rerouting- of special traffic-load control messages. Weighted ECMP [12]
Reinforcement Learning), a Reinforcement Learning-based extends ECMP to allow weighted traffic splitting at each
scheme that performs critical flow selection followed by node and achieves significant performance improvement over
rerouting with linear programming. CFR-RL learns a policy ECMP. Two-phase routing optimizes routing performance by
to select critical flows purely through observations, without selecting a set of intermediate nodes and tuning the traffic split
any domain-specific rule-based heuristic. It starts from scratch ratios to the nodes [15], [16]. In the first phase, each source
without any prior knowledge, and gradually learns to make sends traffic to the intermediate nodes based on predetermined
better selections through reinforcement, in the form of reward split ratios, and in the second phase, the intermediate nodes
signals that reflects network performance for past selections. then deliver the traffic to the final destinations. This approach
By continuing to observe the actual performance of past requires IP tunnels, optical-layer circuits, or label switched
selections, CFR-RL would optimize its selection policy for paths in each phase.
various traffic matrices as time goes. Once training is done,
CFR-RL will efficiently and effectively select a small set of B. SDN-Based TE Solutions
critical flows for each given traffic matrix, and reroute them Thanks to the flexible routing policy from the emerging
to balance link utilization of the network by formulating and SDN, dynamic hybrid routing [4] achieves load balancing for
solving a simple linear programming optimization problem. a wide range of traffic scenarios by dynamically rebalancing
The main contributions of this paper are summarized as traffic to react to traffic fluctuations with a preconfigured rout-
follows: ing policy. Agarwal et al. [2] consider a network with partially
1) We consider the impact of flow rerouting on network deployed SDN switches. They improve network utilization
disturbance in our TE design and propose an effective and reduce packet loss by strategically placing the controller
scheme that not only minimizes the maximum link and SDN switches. Guo et al. [3] propose a novel algorithm
utilization but also reroutes only a small number of flows named SOTE to minimize the maximum link utilization in an
to reduce network disturbance. SDN/OSPF hybrid network.
2) We customize a RL approach to learn the critical flow
selection policy, and utilize LP as a reward function
to generate reward signals. This RL+LP combined C. Machine Learning-Based TE Solutions
approach turns out to be surprisingly powerful. Machine learning has been used to improve the performance
3) We evaluate and compare CFR-RL with other rule-based of backbone networks and data center networks. For backbone
heuristic schemes by conducting extensive experiments networks, Geyer and Carle [17] design an automatic network
on different topologies with both real and synthesized protocol using semi-supervised deep learning. Sun et al.
traffic. CFR-RL not only outperforms rule-based heuris- [18] selectively control a set of nodes and use a RL-based
tic schemes by up to 12.2%, but also reroutes 11.4%- policy to dynamically change the routing decision of flows
14.7% less traffic on average. Overall, CFR-RL is able traversing the selected nodes. To minimize signaling delay in
to achieve near-optimal performance by rerouting only large SDNs, Lin et al. [19] employ a distributed three-level
10%-21.3% of total traffic. In addition, the evaluation control plane architecture coupled with a RL-based solution
results show that CFR-RL is able to generalize to unseen named QoS-aware Adaptive Routing. Xu et al. [20] use RL
traffic matrices. to optimize the throughput and delay in TE. AuTO [21] is

Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 10,2021 at 03:03:28 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: CFR-RL: TE WITH REINFORCEMENT LEARNING IN SDN 2251

routed by either default ECMP routing or newly installed flow


entries in the current period. Figure 1 shows an illustrative
example. CFR-RL reroutes the flow from S0 to S4 to balance
link load by installing forwarding entries at the corresponding
switches along the SDN path.
There are two reasons we do not want to adopt RL for
the flow rerouting problem. Firstly, since the set of critical
flows is small, LP is an efficient and optimal method to solve
the rerouting problem. Secondly, a routing solution consists
of a split ratio (i.e., traffic demand percentage) for each
flow on each link. Given a network with E links, there will
be total E ∗ K split ratios in the routing solution, where
K is the number of critical flows. Since split ratios are
continuous numbers, we have to adopt the RL methods for
Fig. 1. An illustrative example of CFR-RL rerouting procedure. Each link continuous action domain [24], [25]. However, due to the high-
capability equal to 1. Best viewed in color. dimensional, continuous action spaces, it has been shown that
this type of RL methods would lead to slow and ineffective
learning when the number of output parameters (i.e., E ∗ K)
developed to optimize routing traffic in data center networks is large [20], [26].
with a two-layer RL. One is called the Peripheral System
for deploying hosts and routing small flows, and the other
IV. L EARNING A C RITICAL F LOW S ELECTION P OLICY
one is called the Central System for collecting global traffic
information and routing large flows. In this section, we describe how to learn a critical flow
However, all of the above works do not consider mitigating selection policy using a customized RL approach.
the impact of network disturbance and service disruption
caused by rerouting. A. Reinforcement Learning Formulation
1) Input / State Space: An agent takes a state st = T Mt
III. S YSTEM D ESIGN
as an input, where T Mt is a traffic matrix at time step t that
In this section, we describe the design of CFR-RL, contains information of traffic demand of each flow. Typically,
a RL-based scheme that learns a critical flow selection policy the network topology remains unchanged. Thus, we do not
and reroutes the corresponding critical flows to balance link include the topology information as a part of the input. The
utilization of the network. results in Section VI-B show that CFR-RL is able to learn
We train CFR-RL to learn a selection policy over a rich vari- a good policy π without prior knowledge of the network.
ety of historical traffic matrices, where traffic matrices can be It is worth noting that including additional information like
measured by SDN switches and collected by an SDN central link states as a part of input might be beneficial for training
controller periodically [22]. CFR-RL represents the selection the critical flow selection policy. We will investigate it in our
policy as a neural network that maps a "raw" observation (e.g., future work.
a traffic matrix) to a combination of critical flows. The neural 2) Action Space: For each state st , CFR-RL would select
network provides a scalable and expressive way to incorporate K critical flows. Given that there are total N ∗ (N − 1)
various traffic matrices into the selection policy. CFR-RL flows in a network with N nodes, this RL problem would
trains this neural network based on REINFORCE algorithm require a large action space of size CN K
∗(N −1) . Inspired by
[23] with some customizations, as detailed in Section IV. [27], we define the action space as {0, 1, . . ., (N ∗(N −1))−1}
Once training is done, CFR-RL applies the critical flow and allow the agent to sample K different actions in each time
selection policy to each real time traffic matrix provided by step t (i.e., a1t , a2t , . . . , aK
t ).
the SDN controller periodically, where a small number of 3) Reward: After sampling K different critical flows (i.e.,
critical flows (e.g., K) are selected. The evaluation results in fK ) for a given state st , CFR-RL reroutes these critical
Section VI-B.1 show that selecting 10% of total flows as crit- flows and obtains the maximum link utilization U by solving
ical flows (roughly 11%-21% of total traffic) is sufficient for the rerouting optimization problem (4a) (described in the
CFR-RL to achieve near-optimal performance, while network following section). Reward r is defined as 1/U , which is set to
disturbance (i.e., the percentage of total rerouted traffic) is reflect the network performance after rerouting critical flows to
reduced by at least 78.7% compared to rerouting all flows by balance link utilization. The smaller U (i.e., the greater reward
traditional TE. Then the SDN controller reroutes the selected r), the better performance. In other words, CFR-RL adopts LP
critical flows by installing and updating corresponding flow as a reward function to produce reward signals r for RL.
entries at the switches using a flow rerouting optimization
method described in Section V. The remaining flows would
continue to be routed by the default ECMP routing. Note that B. Training Algorithm
the flow entries at the switches for the critical flows selected The critical flow selection policy is represented by a neural
in the previous period will time out, and the flows would be network. This policy network takes a state st = T Mt as

Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 10,2021 at 03:03:28 UTC from IEEE Xplore. Restrictions apply.
2252 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO. 10, OCTOBER 2020

Algorithm 1 Training Algorithm


Initialize θ, v = {} (keep track the sum of rewards for each
state), n = {} (keep track the visited count of each state)
for each iteration do
Δθ ← 0
{st } ← Sample a batch of states with size B
for t = 1, . . . , B do
Sample a solution atK according to policy π(atK |st )
Fig. 2. Policy network architecture.
Receive reward rt
if st ∈ v and st ∈ n then
v[st ]
an input as described above and outputs a probability dis- b(st ) = n[s t]
(average reward for state st )
tribution π(at |st ) over all available actions. Figure 2 shows else
the architecture of the policy network (details in Section VI- b(st ) = 0, v[st ] = 0, n[st ] = 0
A.1). Since K different actions are sampled for each state st end if
and their order does not matter, we define a solution atK = end for
(a1t , a2t , . . . , aK for t = 1, . . . , B do
t ) as a combination of K sampled actions. For
selecting a solution atK with a given state st , a stochastic Δθ ← Δθ + α(∇θ logπ(atK |st ; θ)(rt − b(st )) +
policy π(atK |st ) parameterized by θ can be approximated as β∇θ H(π(·|st ; θ)))
follows1 : v[st ] = v[st ] + rt
n[st ] = n[st ] + 1

K
end for
π(atK |st ; θ) ≈ π(ait |st ; θ). (1)
θ ← θ + Δθ
i=1
end for
The goal of training is to maximize the network performance
over various traffic matrices, i.e., maximize the expected
reward E[rt ]. Thus, we optimize E[rt ] by gradient ascend, where H is the entropy of the policy (the probability distribu-
using REINFORCE algorithm with a baseline b(st ). The tion over actions). The hyperparameter β controls the strength
policy parameter θ is updated according to the following of the entropy regularization term. Algorithm 1 shows the
equation: pseudo-code for the training algorithm.

θ ←θ+α ∇θ logπ(atK |st ; θ)(rt − b(st )), (2)
t V. R EROUTING C RITICAL F LOWS
where α is the learning rate for the policy network. A good In this section, we describe how to reroute the selected
baseline b(st ) reduces gradient variance and thus increases critical flows to balance link utilization of the network.
speed of learning. In this paper, we use an average reward
for each state st as the baseline. (rt − b(st )) indicates how A. Notations
much better a specific solution is compared to the "average G(V, E) network with nodes V and directed edges E
solution" for a given state st according to the policy. Intu- (|V | = N, |E| = M ).
itively, Eq.(2) can be explained as follows. If (rt − b(st )) ci,j the capacity of link i, j (i, j ∈ E).
is positive, π(atK |st ; θ) (i.e., the probability of the solution li,j the traffic load on link i, j (i, j ∈ E).
atK ) is increased by updating the policy parameters θ in the Ds,d the traffic demand from source s to destination
direction ∇θ logπ(atK |st ; θ) with a step size of α(rt − b(st )). d (s, d ∈ V , s = d).
Otherwise, the solution probability is decreased. The net effect s,d
σi,j the percentage of traffic demand from source s to
of Eq. (2) is to reinforce actions that empirically lead to better destination d routed on link i, j (s, d ∈ V, s =
rewards. d, i, j ∈ E, s, d ∈ fK ).
To ensure that the RL agent explores the action space ade- B. Explicit Routing for Critical Flows
quately during training to discover good policies, the entropy By default, traffic is distributed according to ECMP routing.
of the policy π is added to Eq. (2). This technique improves We reroute the small set of critical flows (i.e., fK ) by con-
the exploration by discouraging premature convergence to ducting explicit routing optimization for these critical flows
suboptimal deterministic policies [28]. Then, Eq(2) is modified s, d ∈ fK .
to the following equation: The critical flow rerouting problem can be described as

θ ←θ+α (∇θ logπ(atK |st ; θ)(rt − b(st )) the following. Given a network G(V, E) with the set of
t traffic demands Ds,d for the selected critical flows (∀s, d ∈
+ β∇θ H(π(·|st ; θ))), (3) fK ) and the background link load {l̄i,j } contributed by the
remaining flows using the default ECMP routing, our objective
1 To select K distinct actions, we do the action sampling without replace- s,d
is to obtain the optimal explicit routing ratios {σi,j } for
ment. The right side of Eq. (1) is the solution probability when sampling with
replacement, we use Eq. (1) to approximate the probability of the solution each critical flow, so that the maximum link utilization U is
atK given a state st for simplicity. minimized.

Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 10,2021 at 03:03:28 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: CFR-RL: TE WITH REINFORCEMENT LEARNING IN SDN 2253

To search all possible under-utilized paths for the selected TABLE I


critical flows, we formulate the rerouting problem as an ISP N ETWORKS U SED IN E VALUATION
optimization as follows.
  s,d
minimize U +  · σi,j (4a)
i,j∈E s,d∈fK
 s,d
subject to li,j = σi,j · Ds,d + l̄i,j i, j : i, j ∈ E
s,d∈fK
(4b) a good trade-off between performance and computational
li,j ≤ ci,j · U i, j : i, j ∈ E (4c) complexity of the model (details in Section VI-B.5). Thus,
⎧ we fixed them throughout our experiments. The results in the

⎨−1 if i = s
 s,d
 s,d
following experiments show CFR-RL works well on different
σk,i − σi,k = 1 if i = d network topologies with a single set of fixed hyperparameters.


k:k,i∈E k:i,k∈E 0 otherwise This architecture is implemented using TensorFlow [32].
i ∈ V, s, d : s, d ∈ fK (4d) 2) Dataset: In our evaluation, we use four real-world
s,d network topologies including Abilene network and 3 ISP
0 ≤ σi,j ≤ 1s, d : s, d ∈ fK , i, j : i, j ∈ E
networks collected by ROCKETFUEL [33]. The number of
(4e) nodes and directed links of the networks are listed in Table I.
  s,d For the Abilene network, the measured traffic matrices and net-
· σi,j in (4a) is needed because otherwise work topology information (such as link connectivity, weights,
i,j∈E s,d∈fK
the optimal solution may include unnecessarily long paths as and capacities) are available in [34]. Since Abilene traffic
long as they avoid the most congested link, where  ( > 0) matrices are measured every 5 minutes, there are a total
is a sufficiently small constant to ensure that the minimization of 288 traffic matrices each day. To evaluate the performance
of U takes higher priority [29]. (4b) indicates the traffic load of CFR-RL, we choose a total 2016 traffic matrices in the
on link i, j contributed by the traffic demands routed by the first week (starting from Mar. 1st 2004) as our dataset. For
explicit routing and the traffic demands routed by the default ROCKETFUEL topologies, the link costs are given while the
ECMP routing. (4c) is the link capacity utilization constraint. link capacities are not provided. Therefore, we infer the link
(4d) is the flow conservation constraint for the selected critical capacities as the inverse of link costs, which are based on
flows. the default link cost setting in Cisco routers. In other words,
By solving the above LP problem using LP solvers (such the link costs are inversely proportional to the link capacities.
as Gurobi [30]), we can obtain the optimal explicit routing This approach is commonly adopted in literature [4], [13],
s,d
solution for selected critical flows {σi,j } (∀s, d ∈ fK ). [15]. Besides, since traffic matrices are also unavailable for
Then, the SDN controller installs and updates flow entries at the ISP networks from ROCKETFUEL, we use a traffic matrix
the switches accordingly. generation tool [35] to generate 50 synthetic exponential traffic
matrices and 50 synthetic uniform traffic matrices for each
VI. E VALUATION network. Unless otherwise noted, we use a random sample
of 70% of our dataset as a training set for CFR-RL, and use
In this section, a series of simulation experiments are the remaining 30% as a test set for testing all schemes.
conducted using real-world network topologies to evaluate 3) Parallel Training: To speed up training, we spawn mul-
the performance of CFR-RL and show its effectiveness by tiple actor agents in parallel, as suggested by [28]. CFR-RL
comparing it with other rule-based heuristic schemes. uses 20 actor agents by default. Each actor agent is configured
to experience a different subset of the training set. Then, these
A. Evaluation Setup agents continually forward their (state, action, advantage (i.e,
1) Implementation: The policy neural network consists of rt − b(st ))) tuples to a central learner agent, which aggregates
three layers. The first layer is a convolutional layer with them to train the policy neural network. The central learner
128 filters. The corresponding kernel size is 3 × 3 and the agent performs a gradient update using Eq(3) according to
stride is set to 1. The second layer is a fully connected the received tuples, then sends back the updated parameters
layer with 128 neurons. The activation function used for the of the policy network to the actor agents. The whole process
first two layers is Leaky ReLU [31]. The final layer is a can happen asynchronously among all agents. We use 21 CPU
fully connected linear layer (without activation function) with cores to train CFR-RL (i.e., one core (2.6GHz) for each agent).
N ∗ (N − 1) neurons corresponding to all possible critical 4) Metrics:
flows. The softmax function is applied upon the output of final a) Load balancing performance ratio: To demonstrate
layer to generate the probabilities for all available actions. The the load balancing performance of the proposed CFR-RL
learning rate α is initially configured to 0.001 and decays every scheme, a load balancing performance ratio is applied and
500 iterations with a base of 0.96 until it reaches the minimum defined as follows:
value 0.0001. Additionally, the entropy factor β is configured U optimal
to 0.1. We found that the set of above hyperparameters is P RU = , (5)
U CFR-RL

Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 10,2021 at 03:03:28 UTC from IEEE Xplore. Restrictions apply.
2254 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO. 10, OCTOBER 2020

where U optimal is the maximum link utilization achieved by


an optimal explicit routing for all flows2 . P RU = 1 means
that the proposed CFR-RL achieves load balancing as good
as the optimal routing. A lower ratio indicates that the load
balancing performance of CFR-RL is farther away from that
of the optimal routing.
b) End-to-end delay performance ratio: To model and
measure end-to-end delay in the network, we  define the overall
l
end-to-end delay in the network as Ω = ( ci,j i.j
−li,j ) as
i,j∈E
described in [12]. Then, an end-to-end delay performance ratio
is defined as follows:
Ωoptimal
P RΩ = , (6)
ΩCFR-RL
where Ωoptimal is the minimum end-to-end delay achieved by an
optimal explicit routing for all flows with an objective3 to min-
Fig. 3. Average load balancing performance ratio of CFR-RL with increasing
imize the end-to-end delay Ω. Note that the rerouting solution number of critical flows K on the four networks.
for selected critical flows is still obtained by solving (4a). The
higher P RΩ , the better end-to-end delay performance achieved
by CFR-RL. P RΩ = 1 means that the proposed CFR-RL
achieves the minimum end-to-end delay as the optimal
routing.
c) Rerouting disturbance: To measure the disturbance
caused by rerouting, we define rerouting disturbance as the
percentage of total rerouted traffic4 for a given traffic matrix,
i.e.,
Fig. 4. Comparison of average load balancing performance ratio where error
 bars span ± one standard deviation from the average on the entire test set of
Ds,d the four networks.
s,d∈fK
RD =  , (7)
Ds,d
s,d∈V,s=d
B. Evaluation
 s,d
where Dis the total traffic of selected critical flows 1) Critical Flows Number: We conduct a series of experi-
s,d∈fK
 ments with different number of critical flows selected, and fix
that need to be rerouted and Ds,d is the total traffic
s,d∈V,s=d other parameters throughout the experiments.
of all flows. The smaller RD, the less disturbance caused by Figure 3 shows the average load balancing performance
rerouting. ratio achieved by CFR-RL with increasing number of critical
5) Rule-Based Heuristics: For comparison, we also evaluate flows K. The initial value with K = 0 represents the default
two rule-based heuristics as the following: ECMP routing. The results indicate that there is a considerable
1) Top-K: selects the K largest flows from a given traffic room for further improvement when flows are routed by
matrix in terms of demand volume. This approach is ECMP. The sharp increases in the average load balancing
based on the assumption that flows with larger traffic performance ratio for all four networks shown in Fig. 3
volumes would have a dominant impact on network indicate that CFR-RL is able to achieve near-optimal load
performance. balancing performance by rerouting only 10% flows. As a
2) Top-K Critical: similar to Top-K approach, but selects result, network disturbance would be much reduced compared
the K largest flows from the most congested links. This to rerouting all flows as traditional TE. For the subsequent
approach is based on the assumption that flows travers- experiments, we set K = 10%∗ N ∗ (N − 1) for each network.
ing the most congested links would have a dominant 2) Performance Comparison: For comparison, we also cal-
impact on network performance. culate the performance ratios and rerouting disturbances for
Top-K, Top-K critical, and ECMP according to Eqs. (5),
2 The corresponding LP formulation is similar to (4a), except that the
(6) and (7). Figure 4 shows the average load balancing
objective becomes obtaining the optimal explicit ratios {σi,j s,d
} for all performance ratio that each scheme achieves on the entire
flows. Note that the background link load {l̄i,j } would be 0 for this test set of the four networks. Figure 5 shows the load bal-
problem. ancing performance ratio on each individual traffic matrix
3 The objective of this LP problem is to obtain the optimal explicit routing
s,d
for the four networks. Note that the first 15 traffic matrices
ratios {σi,j } for all flows, such that Ω is minimized.
4 Although partial of traffic flows might still be routed along the original
in Figs. 5(b)-5(d) are generated by an exponential model and
ECMP paths, updating routing at the switches might cause packets drop or the remaining 15 traffic matrices are generated by an uniform
out-of-order. Thus, we still consider this amount of traffic as rerouting traffic. model. CFR-RL performs significantly well in all networks.

Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 10,2021 at 03:03:28 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: CFR-RL: TE WITH REINFORCEMENT LEARNING IN SDN 2255

TABLE II
C OMPARISON OF AVERAGE R EROUTING D ISTURBANCE

Fig. 5. Comparison of load balancing performance in the four networks on Fig. 7. Comparison of end-to-end delay performance in the four networks
each test traffic matrix. on each test traffic matrix.

By effectively selecting and rerouting critical flows to balance


link utilization of the network, CFR-RL outperforms heuristic
schemes and ECMP in terms of end-to-end delay in all
networks except the EBONE network. In the EBONE network,
heuristic schemes perform better with the exponential traffic
model. It is possible that rerouting the elephant flows selected
by heuristic schemes further balances load on non-congested
Fig. 6. Comparison of average end-to-end delay performance ratio where links and results in achieving smaller end-to-end delay. In
error bars span ± one standard deviation from the average on the entire test addition, Tab. II shows the average rerouting disturbance,
set of the four networks.
i.e., the average percentage of total traffic rerouted by each
For example, for the Abilene network, CFR-RL improves load scheme (except ECMP) for the four networks. CFR-RL greatly
balancing performance by about 32.8% compared to ECMP, reduces network disturbance by rerouting at most 21.3%,
and by roughly 7.4% compared to Top-K critical. For the 11.2%, 11.3%, and 11.2% of total traffic on average for the
EBONE network, CFR-RL outperforms Top-K critical with four networks, respectively. In contrast, Top-K critical reroutes
an average 12.2% load balancing performance improvement. 11.4% more traffic for the Abilene network and 14.7%, 12.3%,
For Sprintlink and Tiscali networks, CFR-RL performs slightly and 13.3% more traffic for the EBONE, Sprintlink, and Tiscali
better than Top-K critical by 1.3% and 3.5% on average, networks (for exponential traffic matrices). Top-K performs
respectively. Moreover, Figure 6 shows the average end-to- even worse by rerouting more than 42% of total traffic on aver-
end delay performance ratio that each scheme achieves on age for the Abilene network and 32%, 33%, and 32% of total
the entire test set of the four networks. Figure 7 shows the traffic on average for the other three networks (for exponential
end-to-end delay performance ratio on each test traffic matrix traffic matrices). It is worth noting that there are no elephant
for the four networks. It is worth noting that the rerouting flows in uniform traffic matrices shown in Fig. 5(b)-5(d).
solution for selected critical flows is still obtained by solving Thus, all three schemes reroute similar amount of traffic for
(4a) (i.e., minimize maximum link utilization), though the uniform traffic matrices. However, CFR-RL is still able to
end-to-end delay performance is evaluated5 for each scheme. perform slightly better than the two rule-based heuristics.
Overall, the above results indicate that CFR-RL is able to
5 For the Abilene network, the real traffic demands in the measured traffic achieve near-optimal load balancing performance and greatly
matrices collected in [34] are relatively small, and thus the corresponding reduce end-to-end delay and network disturbance by smartly
end-to-end delay would be very small. To effectively compare end-to-end
delay performance of each scheme, we multiply each demand D s,d in a real selecting a small number of critical flows for each given
traffic matrix T Mt by U0.9
ECMP , where Ut
ECMP is the maximum link utilization traffic matrix and effectively rerouting the corresponding small
t
achieved by ECMP routing on the traffic matrix T Mt . amount of traffic.

Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 10,2021 at 03:03:28 UTC from IEEE Xplore. Restrictions apply.
2256 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO. 10, OCTOBER 2020

Fig. 8. Comparison of load balancing performance ratio in CDF with the


traffic matrices from Tuesday, Wednesday, Friday and Saturday in week 2.
Fig. 10. Comparison of load balancing performance ratio with the traffic
matrices from Tuesday, Wednesday, Friday and Saturday in week 2.

Fig. 9. Comparison of end-to-end delay performance ratio in CDF with the


traffic matrices from Tuesday, Wednesday, Friday and Saturday in week 2.

As shown in Figs. 5(b)-5(d), Top-K critical performs well Fig. 11. Comparison of end-to-end delay performance ratio with the traffic
with the exponential traffic model. However, its performance is matrices from Tuesday, Wednesday, Friday and Saturday in week 2.
degraded with the uniform traffic model. One possible reason
for the performance degradation of Top-K critical is that all delay performance, and thus outperforms other schemes on
links in the network are relatively saturated under the uniform almost all traffic matrices. The load balancing performance
traffic model. Alternative underutilized paths are not available of CFR-RL degrades on several outlier traffic matrices in
for the critical flows selected by Top-K critical. In other words, day 2. There are two possible reasons for the degradation:
there is no much room for rerouting performance improvement (1) The traffic patterns of these traffic matrices are different
by only considering the elephant flows traversing the most con- from what CFR-RL learned from the previous week. (2)
gested links. Thus, fixed-rule heuristics are unable to guarantee Selecting K = 10% ∗ N ∗ (N − 1) is not enough for CFR-RL
their performance, showing that their design assumptions are to achieve near-optimal performance on these outlier traffic
invalid. In contrast, CFR-RL performs consistently well under matrices. However, CFR-RL still performs better than other
various traffic models. schemes. Overall, the results indicate that real traffic patterns
3) Generalization: In this series of experiments, we trained are relatively stable and CFR-RL generalizes well to unseen
CFR-RL on the traffic matrices from the first week (starting traffic matrices which are not explicitly trained.
from Mar. 1st 2004) and evaluate it for each day of the 4) Training and Inference Time: Training a policy for the
following week (starting from Mar. 8th 2004) for the Abi- Abilene network took approximately 10,000 iterations, and the
lene network. We only present the results for day 2, day time consumed for each iteration is approximately 1 second.
3, day 5 and day 6, since the results for other days are As a result, the total training time for Abilene network is
similar. Figures 8 and 9 show the full CDFs of two types approximately 3 hours. Since the EBONE network is relatively
of performance ratio for these 4 days. Figures 10 and 11 larger, it took approximately 60,000 iterations to train a
show the load balancing and end-to-end delay performance policy. Then, the total training time for EBONE network is
ratios on each traffic matrix of these 4 days, respectively. The approximately 16 hours. For larger networks like Sprintlink
results show that CFR-RL still achieves above 95% optimal and Tiscali, the solution space is even larger. Thus, more
load balancing performance and average 88.13% end-to-end iterations (e.g., approximately 90,000 and 100,000 iterations)

Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 10,2021 at 03:03:28 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: CFR-RL: TE WITH REINFORCEMENT LEARNING IN SDN 2257

TABLE III that learns a critical flow selection policy automatically using
C OMPARISON OF AVERAGE L OAD BALANCING P ERFORMANCE R ATIO reinforcement learning, without any domain-specific rule-
W ITH D IFFERENT S ETS OF H YPERPARAMETERS
based heuristic. CFR-RL selects critical flows for each given
traffic matrix and reroutes them to balance link utilization
of the network by solving a simple rerouting optimization
problem. Extensive evaluations show that CFR-RL achieves
near-optimal performance by rerouting only a limited portion
of total traffic. In addition, CFR-RL generalizes well to traffic
matrices which are not explicitly trained.
Yet, there are several aspects that may help improve the
solution that we proposed in this contribution. Among them,
we determine how CFR-RL can be updated and improved.

A. Objectives
CFR-RL could be formulated to achieve other objec-
tives. For example, to minimize
 overall end-to-end delay
l
should be taken to train a good policy, and each iteration takes in the network (i.e., Ω = ( ci,j i.j
−li,j )) described in
i,j∈E
approximately 2 seconds. Note that this cost is incurred offline
Section VI-A.4(2), we can define reward r as 1/Ω and refor-
and can be performed infrequently depending on environment
mulate the rerouting optimization problem (4a) to minimize Ω.
stability. The policy neural network as described in Section VI-
Table II shows an interesting finding. Although CFR-RL
A.1 is relatively small. Thus, the inference time for the Abilene
does not explicitly minimize rerouting traffic, it ends up
and EBONE networks are less than 1 second, and they are less
rerouting much less traffic (i.e., 10.0%-21.3%) and performs
than 2 seconds for the Sprintlink and Tiscali networks.
better than rule-based heuristic schemes by 1.3%-12.2%. This
5) Hyperparameters: Table III shows that how hyperpa- reveals that CFR-RL effectively searches the whole set of
rameters affect the load balancing performance of CFR-RL
candidate flows to find the best critical flows for various traffic
in the Abilene network. For each set of hyperparameters,
matrices, rather than simply considering the elephant flows on
we trained a policy for the Abilene network by 10,000 iter- the most congested links or in the whole network as rule-based
ations, and then evaluated the average load balancing perfor-
heuristic schemes do. We will consider minimizing rerouting
mance ratio over the whole test set. We only present the results
traffic as one of our objectives and investigate the trade-off
for the Abilene network, since the results for other network between maximizing performance and minimizing rerouting
topologies are similar. In Tab. III(a), the number of filters in
traffic.
the convolutional layer and neurons in the fully connected
layer is fixed to 128 and entropy factor β is fixed to 0.1.
We compare the performance with different learning rate α. B. Scalability
The results show that training might become unstable if the Scaling CFR-RL to larger networks is an important direc-
initial learning rate is too large (e.g., 0.01), and thus it cannot tion of our future work. CFR-RL relies on LP to produce
converge to a good policy. In contrast, training with a smaller reward signals r. The LP problem would become complex
learning rate is more stable but might require longer training as the number of critical flows K and the size of a network
time to further improve the performance. As a result, we chose increase. This would slow down the policy training for larger
α = 0.001 to encourage exploration in the early stage of networks (e.g., the Tiscali network in Section VI-B.4), since
training. We compared the performance with different sizes the time consumed for each iteration would increase. More-
of filters and neurons in Tab. III(b). The results show that over, the solution space would become enormous for larger
too few filters/neurons might restrict the representation that networks, and RL has to take more iterations to converge to
the neural network can learn and thus Cause under-fitting. a good policy. To further speed up training, we can either
Meanwhile, too many neurons might cause over-fitting, and spawn even more actor agents (e.g., 30) in parallel to allow
thus the corresponding policy cannot generalize well to the the system to consume more data at each time step and thus
test set. In addition, more training time is required for a improve exploration [28], or apply GA3C [36] to offload
larger neural network. In Tab. III(c), the results show that a the training to a GPU, which is an alternative architecture
larger entropy factor encourages exploration and leads to a of A3C and emphasizes on an efficient GPU utilization to
better performance. Overall, the set of hyperparameters we increase the number of training data generated and processed
have chosen is a good trade-off between performance and per second. Another possible design to mitigate the scalability
computational complexity of the model. issue is adopting SDN multi-controller architectures. Each
controller takes care of a subset of routers in a large network,
VII. C ONCLUSION AND F UTURE W ORK and one CFR-RL agent is running on each SDN controller.
The corresponding problem naturally falls into the realm of
With an objective of minimizing the maximum link utiliza- Multi-Agent Reinforcement Learning. We will evaluate if a
tion in a network and reducing disturbance to the network multi-SDN controller architecture can help provide additional
causing service disruption, we proposed CFR-RL, a scheme improvement in our approach.

Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 10,2021 at 03:03:28 UTC from IEEE Xplore. Restrictions apply.
2258 IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, VOL. 38, NO. 10, OCTOBER 2020

C. Retraining [15] M. Kodialam, T. V. Lakshman, J. B. Orlin, and S. Sengupta, “Oblivious


routing of highly variable traffic in service overlays and IP backbones,”
In this paper, we mainly described the RL-based critical flow IEEE/ACM Trans. Netw., vol. 17, no. 2, pp. 459–472, Apr. 2009.
selection policy training process as an offline task. In other [16] M. Antic, N. Maksic, P. Knezevic, and A. Smiljanic, “Two phase load
words, once training is done, CFR-RL remains unmodified balanced routing using OSPF,” IEEE J. Sel. Areas Commun., vol. 28,
no. 1, pp. 51–59, Jan. 2010.
after being deployed in the network. However, CFR-RL [17] F. Geyer and G. Carle, “Learning and generating distributed routing
can naturally accommodate future unseen traffic matrices by protocols using graph-based deep learning,” in Proc. Workshop Big Data
Anal. Mach. Learn. Data Commun. Netw. (Big-DAMA), 2018, pp. 40–45.
periodically updating the selection policy. This self-learning [18] P. Sun, J. Li, Z. Guo, Y. Xu, J. Lan, and Y. Hu, “SINET: Enabling
technique will enable CFR-RL to further adapt itself to the scalable network routing with deep reinforcement learning on partial
dynamic conditions in the network after being deployed in nodes,” in Proc. ACM SIGCOMM Conf. Posters Demos SIGCOMM
Posters Demos, 2019, pp. 88–89.
real networks. CFR-RL can be retrained by including new [19] S.-C. Lin, I. F. Akyildiz, P. Wang, and M. Luo, “QoS-aware adaptive
traffic matrices. For example, the outlier traffic matrices (e.g., routing in multi-layer hierarchical software defined networks: A rein-
the 235th-240th traffic matrices in Day 2) presented in Fig. 10 forcement learning approach,” in Proc. IEEE Int. Conf. Services Comput.
(SCC), Jun. 2016, pp. 25–33.
should be included for retraining, while the generalization [20] Z. Xu et al., “Experience-driven networking: A deep reinforcement
results shown in Section VI-B.3 suggest that retraining fre- learning based approach,” in Proc. IEEE INFOCOM Conf. Comput.
quently might not be necessary. Techniques to determine Commun., Apr. 2018, pp. 1871–1879.
[21] L. Chen, J. Lingys, K. Chen, and F. Liu, “AuTO: Scaling deep rein-
when to retrain and which new/old traffic matrix should forcement learning for datacenter-scale automatic traffic optimization,”
be included/excluded in/from the training dataset should be in Proc. ACM SIGCOMM, Aug. 2018, pp. 191–205.
further investigated. [22] H. Xu, Z. Yu, C. Qian, X.-Y. Li, and Z. Liu, “Minimizing flow statistics
collection cost of SDN using wildcard requests,” in Proc. IEEE Conf.
The above examples are some key issues that are left for Comput. Commun., May 2017, pp. 1–9.
future work. [23] R. J. Williams, “Simple statistical gradient-following algorithms for
connectionist reinforcement learning,” Mach. Learn., vol. 8, nos. 3–4,
pp. 229–256, May 1992.
ACKNOWLEDGMENT [24] T. P. Lillicrap et al. (2016). Continuous Control With Deep
Reinforcement Learning. [Online]. Available: http://dblp.uni-
The authors would like to thank the editors and reviewers trier.de/db/conf/iclr/iclr2016.htmlLillicrapHPHETS15
for providing many valuable comments and suggestions. [25] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trustre-
gion policy optimization,” vol. 37, pp. 1889–1897, Jul. 2015. [Online].
Available: http://proceedings.mlr.press/v37/schulman15.html
R EFERENCES [26] A. Valadarsky, M. Schapira, D. Shahaf, and A. Tamar, “Learning to
route,” in Proc. ACM Workshop Hot Topics Netw., 2017, pp. 185–191.
[1] N. McKeown et al., “OpenFlow: Enabling innovation in campus net- [27] H. Mao, M. Alizadeh, I. Menache, and S. Kandula, “Resource manage-
works,” ACM SIGCOMM Comput. Commun. Rev., vol. 38, no. 2, ment with deep reinforcement learning,” in Proc. ACM Workshop Hot
pp. 69–74, Apr. 2008. Topics Netw., 2016, pp. 50–56.
[2] S. Agarwal, M. Kodialam, and T. Lakshman, “Traffic engineering in [28] V. Mnih et al., “Asynchronous methods for deep reinforcement
software defined networks,” in Proc. IEEE Int. Conf. Comput. Commun., learning,” vol. 48, pp. 1928–1937, Jun. 2016. [Online]. Available:
Apr. 2013, pp. 2211–2219. http://proceedings.mlr.press/v48/mniha16.html
[3] Y. Guo, Z. Wang, X. Yin, X. Shi, and J. Wu, “Traffic engineering [29] Y. Wang, Z. Wang, and L. Zhang, “Internet traffic engineering without
in SDN/OSPF hybrid network,” in Proc. IEEE 22nd Int. Conf. Netw. full mesh overlaying,” in Proc. IEEE Int. Conf. Comput. Commun.,
Protocols, Oct. 2014, pp. 563–568. Apr. 2001, pp. 565–571.
[4] J. Zhang, K. Xi, M. Luo, and H. J. Chao, “Dynamic hybrid routing: [30] Gurobi Optimization. (2019). Gurobi Optimizer Reference Manual.
Achieve load balancing for changing traffic demands,” in Proc. IEEE [Online]. Available: http://www.gurobi.com
22nd Int. Symp. Qual. Service (IWQoS), May 2014, pp. 105–110. [31] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities
[5] J. Zhang, K. Xi, and H. J. Chao, “Load balancing in IP networks using improve neural network acoustic models,” in Proc. Workshop Deep
generalized destination-based multipath routing,” IEEE/ACM Trans. Learn. Audio, Speech Lang. Process. (ICML), 2013, pp. 1–6.
Netw., vol. 23, no. 6, pp. 1959–1969, Dec. 2015. [32] M. Abadi and et al., “TensorFlow: A system for large-scale machine
[6] Z. Guo, W. Chen, Y.-F. Liu, Y. Xu, and Z.-L. Zhang, “Joint switch learning,” in Proc. USENIX Conf. Oper. Syst. Design Implement., 2016,
upgrade and controller deployment in hybrid software-defined net- pp. 265–283.
[33] N. Spring, R. Mahajan, and D. Wetherall, “Measuring ISP topologies
works,” IEEE J. Sel. Areas Commun., vol. 37, no. 5, pp. 1012–1028,
with Rocketfuel,” ACM SIGCOMM Comput. Commun. Rev., vol. 32,
May 2019.
no. 4, pp. 133–145, 2002.
[7] Y. Wang and Z. Wang, “Explicit routing algorithms for Internet traf-
[34] Yin Zhang’s Abilene. Accessed: Apr. 22, 2019. [Online]. Available:
fic engineering,” in Proc. IEEE Int. Conf. Comput. Commun. Netw.,
http://www.cs.utexas.edu/yzhang/research/AbileneTM/
Oct. 1999, pp. 582–588. [35] TMgen: Traffic Matrix Generation Tool. [Online]. Available:
[8] E. D. Osborne and A. Simha, Traffic Engineering With MPLS. https://tmgen.readthedocs.io/en/latest/
Indianapolis, IN, USA: Cisco Press, 2002. [36] M. Babaeizadeh, I. Frosio, S. Tyree, J. Clemons, and J. Kautz,
[9] B. Fortz and M. Thorup, “Optimizing OSPF/IS-IS weights in a changing “GA3C: GPU-based A3C for deep reinforcement learning,”
world,” IEEE J. Sel. Areas Commun., vol. 20, no. 4, pp. 756–767, CoRR, vol. 1611.06256, 2016, [Online]. Available: http://arxiv.org/
May 2002. abs/1611.06256
[10] K. Holmberg and D. Yuan, “Optimization of Internet protocol network
design and routing,” Networks, vol. 43, no. 1, pp. 39–53, Jan. 2004. Junjie Zhang (Member, IEEE) received the B.S.
[11] J. Chu and C.-T. Lea, “Optimal link weights for IP-based networks degree in computer science from the Nanjing Uni-
supporting hose-model VPNs,” IEEE/ACM Trans. Netw., vol. 17, no. 3, versity of Posts and Telecommunications, China,
pp. 778–788, Jun. 2009. in 2006, and the M.S. degree in computer science
[12] J. Zhang, K. Xi, L. Zhang, and H. J. Chao, “Optimizing network and the Ph.D. degree in electrical engineering from
performance using weighted multipath routing,” in Proc. 21st Int. Conf. New York University, New York City, NY, USA,
Comput. Commun. Netw. (ICCCN), Jul. 2012, pp. 1–7. in 2010 and 2015, respectively.
[13] J. Zhang, K. Xi, M. Luo, and H. J. Chao, “Load balancing for multiple He has been with Fortinet Inc., Sunnyvale, CA,
traffic matrices using SDN hybrid routing,” in Proc. IEEE 15th Int. Conf. USA, since 2015. He holds two U.S. patents in the
High Perform. Switching Routing (HPSR), Jul. 2014, pp. 44–49. area of computer networking. His research interests
[14] C. Villamizar, OSPF Optimized Multipath (OSPF-OMP), draft-ietf-ospf- include network optimization, traffic engineering,
omp-02, IETF Internet-Draft, 1999. machine learning, and network security.

Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 10,2021 at 03:03:28 UTC from IEEE Xplore. Restrictions apply.
ZHANG et al.: CFR-RL: TE WITH REINFORCEMENT LEARNING IN SDN 2259

Minghao Ye received the B.E. degree in micro- Chen-Yu Yen received the B.S. degree in elec-
electronic science and engineering from Sun Yat-sen trical engineering from National Taiwan Univer-
University, Guangzhou, China, and the second B.E. sity, Taipei, Taiwan, in 2014, and the M.S. degree
degree (Hons.) in electronic engineering from The in electrical engineering from Columbia University
Hong Kong Polytechnic University, Hong Kong, in 2018. He is currently pursuing the Ph.D. degree
in 2017, the M.S. degree in electrical engineering with the Department of Electrical and Computer
from New York University, New York City, NY, Engineering, New York University, New York City,
USA, in 2019, where he is currently pursuing the NY, USA. His research interests include reinforce-
Ph.D. degree with the Department of Electrical ment learning, congestion control, and practical
and Computer Engineering. His research interests machine learning for networking.
include traffic engineering, software-defined net-
works, mobile edge computing, and reinforcement learning.

Zehua Guo (Senior Member, IEEE) received the H. Jonathan Chao (Fellow, IEEE) received the
B.S. degree from Northwestern Polytechnical Uni- B.S. and M.S. degrees in electrical engineering
versity, the M.S. degree from Xidian University, and from National Chiao Tung University, Taiwan,
the Ph.D. degree from Northwestern Polytechnical in 1977 and 1980, respectively, and the Ph.D. degree
University. He was a Research Fellow with the in electrical engineering from The Ohio State Uni-
Department of Electrical and Computer Engineering, versity, Columbus, OH, USA, in 1985. He was
Tandon School of Engineering, New York Univer- the Head of the Electrical and Computer Engi-
sity, a Post-Doctoral Research Associate with the neering (ECE) Department, New York University
Department of Computer Science and Engineering, (NYU), New York City, NY, USA, from 2004 to
University of Minnesota Twin Cities, and a Visiting 2014. From 2000 to 2001, he was the Co-Founder
Associate Professor with the Singapore University and the CTO of Coree Networks, Tinton Falls, NJ,
of Technology and Design. He is currently an Associate Professor at the USA. From 1985 to 1992, he was a member of Technical Staff at Bellcore,
Beijing Institute of Technology. His research interests include software-defined Piscataway, NJ, USA, where he was involved in transport and switching
networking, network function virtualization, data center networks, cloud system architecture designs and application-specified integrated circuit imple-
computing, content delivery networks, network security, machine learning, mentations, such as the world’s first SONET-like framer chip, ATM layer chip,
and Internet exchange. He was the Session Chair of the IEEE International sequencer chip (the first chip handling packet scheduling), and ATM switch
Conference on Communications 2018 and the Technical Program Committee chip. He is currently a Professor of ECE at NYU, New York City. He is also
Member of Computer Communications (Elsevier). He is an Associate Editor the Director of the High-Speed Networking Laboratory. He has been doing
of the IEEE A CCESS and EURASIP Journal on Wireless Communications and research in the areas of software-defined networking, network function virtual-
Networking (Springer), and an Editor of KSII Transactions on Internet and ization, datacenter networks, high-speed packet processing/switching/routing,
Information Systems. network security, quality-of-service control, network on chip, and machine
learning for networking. He has coauthored three networking books, Broad-
band Packet Switching Technologies: A Practical Guide to ATM Switches and
IP Routers (Wiley, 2001), Quality of Service Control in High-Speed Networks
(Wiley, 2001), and High Performance Switches and Routers (Wiley, 2007).
He has published more than 260 journal and conference papers and holds
63 patents. He is a Fellow of the National Academy of Inventors. He was a
recipient of the Bellcore Excellence Award in 1987. He was a co-recipient
of the 2001 Best Paper Award from the IEEE T RANSACTIONS ON C IRCUITS
AND S YSTEMS FOR V IDEO T ECHNOLOGY .

Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 10,2021 at 03:03:28 UTC from IEEE Xplore. Restrictions apply.

You might also like