(2018 - ICCC - IEEE) RNN Deep Reinforcement Learning For Routing Optimization

Uploaded by

Nam Quach

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views5 pages

(2018 - ICCC - IEEE) RNN Deep Reinforcement Learning For Routing Optimization

Uploaded by

Nam Quach

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

2018 IEEE 4th International Conference on Computer and Communications

RNN Deep Reinforcement Learning for Routing Optimization

Penghao Sun, Junfei Li, Julong Lan, Yuxiang Hu Xin Lu

China National Digital Switching System Engineering China Electronics Technology Group Corporation 32th
& Technological R&D Center (NDSC) Research Institute
Zhengzhou, China Shanghai, China
e-mail: [email protected] e-mail: [email protected]

Abstract—Routing optimization has been discussed in network of correct labeled data was hard. There are also other
design for a long time. In recent years, new methods of routing supervised learning methods that are based on the
strategy based on Reinforcement Learning are being characterization ability of deep neural network, such as [10].
considered. In this paper, we propose a reinforcement learning On the other hand, the advantage of reinforcement
based smart agent that can optimize routing strategy without learning (RL) is being considered. Reference [11] started the
human experience. Our proposed scheme is based on the research of RL in the context of routing optimization, and
collection of the traffic intensity in switches and the usage of a reference [12] applied RL to achieve QoS routing. However,
Recurrent Neural Network based deep reinforcement learning
a real network is a complicated continuous-time system with
model to train the agent. Simulation result shows that the
almost countless state number, while the researches
proposed scheme can adjust the routing strategy dynamically
according to the network condition and outperforms the
mentioned above all use a state-action table to find a certain
traditional shortest path routing after trained. routing strategy, which is hard to deal with too many states.
Reference [13] used deep reinforcement learning to deal with
Keywords-routing; neural network; reinforcement learning; such problem, but the proposed algorithm didnÿt perform
SDN well compared to traditional routing algorithms.
In this paper, we use Deep Deterministic Policy Gradient
I. INTRODUCTION (DDPG [14]) combined with Recurrent Neural Network
(RNN) in the automatic policy generation for traffic
In recent years, AI (Artificial Intelligence), especially engineering. Our main work includes the exploration of the
ML (Machine Learning) techniques are gaining momentum usage of RNN-based DDPG and applying such method to
in development. ML is thought to have a greater ability in the the generation of routing strategy, as well as the discussion
analysis and processing of a large amount of data than on how to better abstract network attributes for machine
traditional algorithms, thus having the potential to discover learning. The main architecture of our model is shown in
some new data patterns. Therefore, many researchers in the Fig.1. Our research proves that the proposed scheme can
field of network are now paying attention to the usage of ML reduce the average transmission delay of network traffic
in network design. In 2003, David D. Clark, etc. proposed compared with the shortest path routing.
Āknowledge planeā [1], which depicted a primitive view
about the usage of AI in network design. Reference [2-5]
also discussed about the network architecture with AI Environment
application. However, there are many problems that make
the employment of AI not easy in network design, as
state
action

discussed in [6]. In recent years, such problems have been

reward

eased to some extent. For example, the development of SDN

(Software-defined networking) has brought some agent
convenience in obtaining a global view of the network and
heterogeneous function deployment. Under this circumstance, RNN-DDPG
many traffic engineering and routing optimization
technologies have been proposed in recent years, of which Actor Critic
most are based on analytical optimization or heuristic Target network
methods, as presented in [7]. There are mainly two
categories of such researches: supervised learning and
reinforcement learning. Melinda Barabas, etc. [8] proposed a Figure 1. Basic model structure.
simple routing management scheme based on the QoS
information to improve the overall network traffic capacity,
which still needed a lot of further consideration in practical II. MODEL DESIGN
use. Wenhao Huang, etc. [9] employed a deep neural Typically, reinforcement model regards the interaction
network to characterize the node traffic, but the acquisition process between the environment and agent as a Markov

978-1-5386-8339-2/18/$31.00 ©2018 IEEE 285

Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 23,2021 at 07:47:29 UTC from IEEE Xplore. Restrictions apply.
Decision Process (MDP). The basic element tuple of such Initialize a random process N for action exploration
MDP is 0 6 $5 3 J , where S is the space of state Receive the initial state s1
for t =1:
s , A the space of action a , R the space of reward r ,
execute at P ( st | T P ) Nt , observe rt and st 1
P the transition probability function
store ( st , at , rt , st 1 ) in R
( p ( st 1 , r | st , at ) P ), and J [0,1] the discount factor
sample a minibatch of ( st , at , rt , st 1 ) from R
used in model. An agent chooses action a under state
s according to a policy, which is denoted as S (a | s ) in set yi ri JQ' ( si 1 , P ' ( si 1 | T P ' ) | T Q ' )
update T Q with
stochastic policies and a P (s ) in deterministic policies 1 Q 2
[15]. ¦i ( yi Q( si , ai | T ))
L
N
To evaluate whether a policy is good or not, value update T P with 1
T Q J | ¦[ aQ(si , ai | T Q )T P P (si | T P )]
functions are introduced. One popular value function used in N i
reinforcement learning is the Q value. The policy value Q update target network :
under state s when choosing action a is defined as function T Q' m WT Q (1 W )T Q'
(1): T P ' m WT P (1 W )T P '
f
end for
Q( st , at ) E[¦ J k R( st k , at k )] (1) end for
k 0
The bootstrap form above can properly represent the In our design, the state of network is represented by a
value, but the acquisition of the future reward is only suitable sequence of traffic data processed in a router / switch, which
for offline learning, which is improper in the network usage. t un
Such Q value can also be expressed in an iterative mode as is denoted by state matrix s , where t is the sampling times
function (2): and n is the number of switches in the network. The action
Q( st , at ) E[ Rt JQ( st 1 , at 1 )] (2) a1ul is the weight matrix that define the link weight for each
Function (2) is more convenient for us to introduce the duplex link between any two routers, where l is the number
policy gradient function as illustrated in the following. of links. Lakhina etc. [16] did a research on network traffic
For policy gradient in continuous environment, we need analysis and pointed out that Origin-Destination (OD) flows
to represent the Q value with a function approximator instead in network traffic had periodic flows as the main component,
of the simple cases where Q values are stored and searched for example, in data-set Sprint-1 (one kind of traffic flow
in a table. In this paper, we use a neural network as the dataset in [15]) more than 90% of its traffic content has a
periodic feature. Therefore, time-relevance is one of the most
function approximator. Thus, Q ( s , a ) is more accurately indispensable traffic features. In this paper, we assume that
specified as Q ( s, a | T ) ,where T is the set of parameters in the network traffic is composed of two components:
the neural network. For the evolution of the neural network, periodical flow PF and random flow RF. We use
we need a loss function to carry out the back propagation. As average( RF ) (6)
RP
defined in DDPG, the loss function is as function (3): average( PF )
/T (>4VW DW _ T \W @ (3) to denote the ratio of such two components. In our opinion,
the essence that DDPG can find an action for a current state
among which
yt
is defined as follows: is that the agent can predict the network traffic distribution in
the next time step based on its experience, thus making
yt R( st , at ) JQ( st 1 , P ( st 1 ) | T ) (4) proper action to optimize its reward. By selecting the traffic
Therefore, the main policy gradient of DDPG is as sequence instead of a snapshot of the traffic distribution in
function (5): the network, the agent can get a more accurate view of the
T Q J | E[T Q( s, P ( s | T P ) | T Q )] state that the network is currently in.
(5) Since the traffic shows an apparent periodic feature, the
E[ a Q( s, a | T Q )T P P ( s | T P )] time relevance in it should not be neglected. In this paper, we
use RNN as the input neural network to exploit such feature
Based on the definitions and functions above, the core in traffic. The basic structure of RNN is shown in Fig. 2.
idea of DDPG algorithm is clear. The detailed DDPG After each time step, the output data also serves as the one of
algorithm is as follows: the input in the next time step. With such neural network
structure, the function approximator can analyze the trend of
TABLE I. DDPG ALGORITHM
traffic in a network , and reach a more accurate conclusion
Randomly initialize T Q and T
P on the certain state a transmitting network is currently in (to
some degree, this process is similar to traffic prediction
Create target network Q ' and P ' with T Q ' m T Q and T P ' m T P which predict the traffic condition in the next time step with
Initialize replay buffer R the information of past traffic pattern). Therefore, RNN can
for episode = 1 to M : perform well in the processing of time sequence data.

286

Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 23,2021 at 07:47:29 UTC from IEEE Xplore. Restrictions apply.
Output POX
Controller
Openflow
W W

St-1 St St+1

Input

Figure 2. RNN network logic.

III. EXPERIMENT IMPLEMENTATION

Figure 3. Experiment topology.
Our scheme is simulated on the platform OMNet++,
which is run in win7 system on a PC. In our experiment, we In our experiment, each episode of RNN-DDPG training
implement the network with one POX controller, 14 routing contains 1000 steps, and the average transmission delay of
nodes and 21 full-duplex links among them, as shown in each step is added to the total delay of this episode to
Fig.3. Each router has a host connected to it, which is measure the performance. First, we set the PF intensity level
omitted in the figure. to 0.5 and test the performance of our scheme under different
The link capacities are all equally set with a bandwidth of RP. The simulation results are shown in Fig. 4. As shown in
1, and network traffic intensity is normalized according to the figure, we can find that in the first few dozens of
the link capacity. We create a traffic pattern for each traffic episodes, the proposed scheme performs worse than SP. As
originator, of which the flow is composed of two the training step grows, the total traffic delay of the proposed
components: periodical flow PF and random flow RF as scheme is on the trend of decline, as a normal RL agent
discussed above. In the experiment, the RP of each host (as should do.
defined in function 6) is regulated to generated different
traffic for the network. We use the shortest path (SP) routing
strategy (based on Dijkstra algorithm) for contrast, which is a
fundamental routing strategy in current networks.

(a) (b)

(c) (d)
Figure 4. Performance under different RP within 150 episodes of training. The RPs of a~d are 0, 0.08, 0.10, 0.12 respectively. The X-axis shows the
episode number of training and the Y-axis shows the average transmission delay within one episode.

287

Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 23,2021 at 07:47:29 UTC from IEEE Xplore. Restrictions apply.
Finally, after enough episodes, the proposed scheme other three diagrams, in Fig. 5a, RNN-DDPG show no
outperforms SP. On the other hand, when comparing the four obvious advantage over SP. This mainly results from the fact
diagrams in Fig. 4, we can find that the greater RP is, the that when the traffic intensity is low enough, the main
better RNN-DDPG performs compared to SP. This is transmission delay in the simulation platform comes not
because that the less variance the network traffic has, the from packet queueing in switches, so in this condition the
easier the agent can make precise decision based on its routing strategy cannot take significant effects
learned experience. .
Fig. 5 shows the performance comparison between the two
schemes under different traffic intensity. Compared with the

(a) (b)

(c) (d)
Figure 5. Performance under different traffic intensity within 150 episodes of training. The traffic intensity of a~d are 0.3, 0.4, 0.5, 0.6 respectively. The X-
axis shows the episode number of training and the Y-axis shows the average transmission delay within one episode.

First, as Fig.4 and Fig.5 show, the smart agent can not
IV. CONCLUSION reach a stable performance. Based on our current speculation,
In this paper, we demonstrate the advantage of this is partly due to the state noise in the network that
reinforcement learning in routing optimization. As a model- mislead the RNN network to output different decisions. For
free scheme, the proposed scheme can adapt easily to similar example, in a transmitting network, there are many factors
circumstances, and generate a near optimal dynamic routing that may cause some jitter in the traffic, which is not a stable
strategy for the network once trained. Compared to feature of network traffic under certain routing algorithm.
traditional routing algorithms, RL based method can update However, the RNN may count such input noise into the
routing rules closely following the change of traffic training process, which may lead to an over fitting problem
distribution in the network while cost only a small amount of or wrong feature extraction. Therefore, a fine-tuning set of
calculation and storage resources once deployed online. parameter for the RNN network should be researched to
Therefore, such methods have a great potential in replacing reach a more noise-resisting neural network.
traditional routing strategies. In the future, we plan to Second, the exploration speed in DDPG should be
optimize the configuration of our model and deploy it in a enhanced. As in the algorithm, in a continuous environment,
real life SDN network. an exploration with just one noise at each parameter
refreshing step may lead to a low efficieny, since just one
V. FUTURE WORK shot of noise value may hardly drop into a better condition
Though inspiring result has been reached in this paper compared to the large action space in the continuous
that proves the advantage of artificial intelligence in traffic environment. What’s more, such exploration method also
enginering over fixed traditional algorithm, there are still leads to a slimmer chance of jumping out of one sub-optimal
many improvements that need to be done. region into a better region. Therefore, we are targeting at the

288

Authorized licensed use limited to: CHONNAM NATIONAL UNIVERSITY. Downloaded on February 23,2021 at 07:47:29 UTC from IEEE Xplore. Restrictions apply.
research of a multi-step exploration scheme in our next stage [8] M. Barabas, G. Boanea, R. Andrei Bogdan, and V. Dobrota.
of work that help the training converge more quickly. ‘Congestion Control Based on Distributed Statistical QoS-Aware
Routing Management’. Przeglad Elektrotechniczny, 2013, 89(2b), pp.
Third, once the agent has been well trained, it learns the 251–256.
experience from certain network topology. Thus, it is [9] W. Huang, G. Song, H. Hong, and K. Xie. ‘Deep Architecture for
currently hard to transplant the agent to another network Traffic Flow Prediction: Deep Belief Networks with Multitask
topology, or when the current topology changes the agent Learning’. IEEE Transactions on Intelligent Transportation Systems,
can not work nearly well. In this case, the incremental 2014, 15(5), pp. 2191–2201.
deployment of a trained agent still need to be researched. [10] N. Kato, Z. M. Fadlullah, B. Mao, F. Tang, O. Akashi, T. Inoue, and
K. Mizutani. ‘The deep learning vision for heterogeneous network
REFERENCES traffic control: proposal, challenges, and future perspective’. IEEE
Wireless Communications, 2017, 24(3), pp. 146–153.
[1] Clark D D, Partridge C, Ramming J C, et al. ‘A knowledge plane for
[11] Justin A Boyan and Michael L Littman. ‘Packet routing in
the internet’. Proceedings of the SIGCOMM’03. Karlsruhe, Germany,
dynamically changing networks: A reinforcement learning approach’.
2003.J. Clerk Maxwell, A Treatise on Electricity and Magnetism, 3rd
In Advances in neural information processing systems, 1994, pp.
ed., vol. 2. Oxford: Clarendon, 1892, pp.68–73.
671–678.
[2] Thomas R W, Dasilva L A, Mackenzie A B. ‘Cognitive networks’.
[12] Shih-Chun Lin, Ian F Akyildiz, Pu Wang, and Min Luo. ‘QoS-Aware
First IEEE International Symposium on New Frontiers in Dynamic
Adaptive Routing in Multi-layer Hierarchical Software Defined
Spectrum Access Networks, Baltimore, MD, USA, 2005.
Networks: A Reinforcement Learning Approach’. In 2016 IEEE
[3] Derbel H, Agoulmine N, Salaün M. ‘ANEMA: Autonomic network International Conference on Services Computing (SCC) June 2016,
management architecture to support self-configuration and self- pp 25–33.
optimization in IP networks’. Computer Networks, 2009, 53(3), pp.
[13] Giorgio Stampa, Marta Arias, David Sánchez-Charles, Victor
418-430.
Muntés-Mulero, Albert Cabellos. ‘A Deep-Reinforcement Learning
[4] Zorzi M, Zanella A, Testolin A, et al. ‘COBANETS: A new paradigm Approach for Software-Defined Networking Routing Optimization’.
for cognitive communications systems’. International Conference on arXiv preprint arXiv: 2017, pp. 1709.07080.
Computing, NETWORKING and Communications. IEEE, 2016,
[14] Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control with deep
pp.1-7.
reinforcement learning[J]. Computer Science, 2015, 8(6), pp.A187.
[5] Mestres A, Rodrigueznatal A, Carner J, et al. ‘Knowledge-Defined
[15] Silver D, Lever G, Heess N, et al. ‘Deterministic policy gradient
Networking’. Acm Sigcomm Computer Communication Review,
algorithms’. International Conference on International Conference on
2016, 47(3), pp. 2-10.
Machine Learning. JMLR.org, 2014, pp. 387-395.
[6] Agoulmine N, Balasubramaniam S, Botvitch D, et al. ‘Challenges for
[16] Lakhina A, Papagiannaki K, Crovella M, et al. ‘Structural analysis of
Autonomic Network Management’. 2006.
network traffic flows’. Joint International Conference on
[7] Ning Wang, Kin Ho, George Pavlou, and Michael Howarth. ‘An Measurement and Modeling of Computer Systems. ACM, 2004, pp.
overview of routing optimization for internet traffc engineering’. 61-72.
IEEE Communications Surveys & Tutorials, 10(1), pp. 2008.