0% found this document useful (0 votes)
28 views6 pages

Research On Topology Planning For Wireless Mesh Networks Based On Deep Reinforcement Learning

Uploaded by

Mark Jennings
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views6 pages

Research On Topology Planning For Wireless Mesh Networks Based On Deep Reinforcement Learning

Uploaded by

Mark Jennings
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

2020 2nd International Conference on Computer Communication and the Internet

Research on Topology Planning for Wireless Mesh Networks Based on Deep


Reinforcement Learning

Changsheng Yin Ruopeng Yang


College of Information and Communication College of Information and Communication
National University of Defense Technology National University of Defense Technology
Wuhan, China Wuhan, China
e-mail: [email protected] e-mail: [email protected]

Xiaofei Zou Wei Zhu


College of Information and Communication College of Information and Communication
National University of Defense Technology National University of Defense Technology
Wuhan, China Wuhan, China
e-mail: [email protected] e-mail: [email protected]

Abstract—Focus on the access point deployment and topology channel configuration to optimize wireless network
control problem in Wireless Mesh Networks (WMNs), a performance and fairness. The author proposes a heuristic
topology planning method based on deep reinforcement algorithm to find the optimal combination in the candidate
learning was proposed. Developing a method of sample data AP position and different channel combinations, so that the
generation using monte-carlo tree search and self-game, then a objective function takes the maximum value. All in all,
policy and value network based on residual network was although there have been many studies on communication
established. A model based on Tensorflow was developed to network planning problems, most of them are based on
solve the training problem. Finally, simulation results show
traditional heuristic algorithms, which have certain
that the proposed method can provide efficient network
requirements for prior experience and historical data. At the
planning solution with high performance on timeliness and
validity.
same time, there are problems of low effectiveness and
partial optimality. However, based on methods such as
Keywords-wireless mesh networks; topology planning; quantitative analysis, the operability and versatility are not
reinforcement learning; intelligence strong, and it is difficult to implement effectively.
Aiming at the problems existing in the access point
I. INTRODUCTION deployment of WMNs, this paper comprehensively considers
As a typical wireless multi-hop network, wireless Mesh the two factors of network coverage and connectivity, and
network can effectively solve the "last mile" broadband introduces a deep reinforcement learning algorithm [4][5],
access problem. It combines the advantages of wireless LAN which focuses on solving the topology planning problem in
and Ad Hoc network, self-organizing, self-repairing, high WMNs. Firstly, according to the characteristics of the
bandwidth, etc., and can seamlessly integrate with wireless WMNs, the chess game theory is used to establish the
technologies such as WiMAX and WiFi, making full use of abstract model of the network planning, and the abstract
the advantages of their respective networks to provide better mapping of the WMNs/node to the chessboard/chess is
service for users [1]. In the network planning phase, the completed, and the migration of the network planning
effective deployment of access points plays an important role problem is completed. Then, based on the reinforcement
in controlling the cost of network construction. At the same learning idea, the use of Monte The Monte Carlo Tree
time, the quality of the wireless network topology plays a Search (MCTS) [6] combined with the self-game method to
decisive role in the throughput performance of the network. generate training sample data, and build a network planning
In recent years, many scholars have carried out a lot of strategy network and value evaluation network based on
research and experiments on the related technologies of Residual Network (Resual Network) [7]. Thus, a deep neural
wireless Mesh networks. Among them, access point network model for WMNs planning is established. On this
deployment and topology control in wireless mesh network basis, the evaluation function of reinforcement learning is
planning are also the focus of many scholars. The author in established according to the actual application requirements
[2] considers the impact of communication load balancing on of the WMNs, and the neural network is used to train the
wireless networks. The author translates the AP deployment model by Tensorflow. Finally, the model is based on training
problem into an integer linear programming problem. for real-time network planning. The results show that the
Guillaume [3] aims to deploy the least AP for service algorithm has strong feasibility and is more efficient.
coverage and propose a genetic algorithm to solve the
problem. Xiang Ling [4] combines AP deployment and

978-1-7281-5800-6/20/$31.00 ©2020 IEEE 6


Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 19:31:25 UTC from IEEE Xplore. Restrictions apply.
II. WMNS PLANNING PROBLEM b) User node access constraints: All user nodes are in
the access node within their communication distance for
A. Problem Description access to the network, as shown in Equation 2.
This work aims at satisfying network performance L M

requirements and reducing network cost, and mainly includes CU − A =  ( E (U i , Aj ))  0 (2)


i =1 j =1
three aspects, i.e., backbone nodes placement, access nodes
placement and the topology in the WMNs. That is, where E (U i , Aj ) indicates whether there is a link between
considering the location of the known user node, according the user node U i and the access node Aj , and if there is a
to the existing network node equipment, under the premise of
satisfying the reliability and service distribution constraints, link, it is 1, otherwise it is 0.
to find the location and topology relationship of the c) Access node access constraints: It means that all
economically optimal backbone nodes and access nodes. The access nodes have backbone nodes in their communication
distances for access to the backbone network, as shown in
WMNs is modeled by the structure of the graph. The vertices
Equation 3.
of the graph represent the user node, the access node and the M N
backbone node respectively, and the edge represents the C A− B =  ( E ( Ai , B j ))  0 (3)
transmission link, and the undirected graph G = V , E is i =1 j =1

used, and the vertex set V = B, A,U  includes three types where E ( Ai , B j ) indicates whether there is a link between the
access node Ai and the backbone node B j , and if there is a
of nodes, including N backbone nodes B = B1 , B2 , , BN  ,
link, it is 1, otherwise it is 0.
M access nodes A =  A1 , A2 , , AM  , L users Node d) Backbone network connectivity constraints: It
U = U1 ,U2 , ,U L  , as shown in Figure 1: means that all the backbone nodes form a connected network,
that is, there is a communication link between any backbone
U
U
nodes, or a path can be formed by relaying through other
U A backbone nodes. Then, the undirected graph GB = V , E
B backbone node
A U
composed of all the backbone nodes is a connected graph,
B A access node that is, for any two backbone nodes 𝐵𝑖 and 𝐵𝑗 , there are
B A U user node alternating sequences of vertices and edges
U  = ( Bi = v0 − e1 − v1 − e2 − − vk − ek +1 = B j ) .
U transmission link
U
B B
U
A A
U
III. NETWORK PLANNING METHOD BASED ON DEEP
U
U REINFORCEMENT LEARNING
Figure 1. WMNs topology diagram.
element
element abstraction
abstraction

B. Mathematical Model
a) Link establishment constraints: It means that if the Self-gaming
Self-gaming based
to
to generate
based on
generate training
on MCTS
MCTS
training samples
samples
distance between two nodes is within the communication
range, a communication link is established. The user node Model
Model training
training based
based on
on deep
deep Update
Update MCTS
MCTS tree
tree to
to
can only access the network through the access node, and the residual
residual neural
neural network
network optimize
optimize sample
sample
generation
generation
access node can establish a link with the user node and the
backbone node, and the backbone node can connect with the Continue
Continue toto Current
Current model
model
generate
generate sample
backbone node. In addition to establishing links, ingress and
and model
sample
model training
training
nodes can also establish links with each other and form a Is
Is itit better
better than
than the
the
transport backbone network. The constraint expressions are historical
historical optimal
optimal model?
model?
as follows.
Yes
1 dViV j  D
E (Vi ,V j ) =  (1) Update
Update historical
historical

0 dViV j  D
No optimal
optimal model
model

where D = DUA , DAB , DBB  respectively represent the Whether


Whether the
the training
training
batch
batch is
is reached
reached
maximum communication distance between the user node U
and the access node A, the maximum communication Yes
distance between the access node A and the backbone node B, User node Output
Output historical
historical Planning
optimal
optimal model
and the maximum communication distance between the location model scheme

backbone nodes B.
Figure 2. Schematic diagram of network planning method based on deep
reinforcement learning.

7
Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 19:31:25 UTC from IEEE Xplore. Restrictions apply.
In order to solve the problem of deep learning training b) Expand and evaluate: If the search tree is currently
samples, this paper adopts the network intelligent planning in a non-leaf node, the expansion continues, and if the leaf
method of WMNs based on deep reinforcement learning and node sl is reached, the evaluation is performed. According to
game theory. It mainly includes WMNs planning element the current neural network training based model fθ , the
abstraction, MCTS-based sample data generation, deep strategy output pl and the evaluation output 𝑣𝑙 are obtained,
learning-based model training, and model-based network and the quaternion set of the edge e(sl ,a) is initialized, that is,
planning. The specific steps are as shown in Figure 2: N(st ,a), W(st ,a) and Q(st ,a) is 0. The current chess data is
A. Elements Abstraction then entered into the neural network model to evaluate the
current panel.
Firstly, the abstraction of the WMNs planning is carried c) Backup: After the search tree completes the
out, including gridding the planning area into a similar expansion and evaluation, the search structure is started from
chessboard. The user node, the access node and the backbone the leaf node based on the connection side information of
node are regarded as chess pieces, wherein the user node each node in the search tree, and is transmitted back to the
position is the initial state of the chessboard, which is root node node by node, and the information of the
different from the general chess class. There is only one quaternion set information on each side is updated. The
player, and the player can play the game including the access
update algorithms N(st ,at ) , the action cumulative value
node and the backbone node. The winning and losing
condition is that after the agreed number of access nodes and W(st ,at ), and the action average Q(st ,at ) are as follows:
backbone nodes are completed, if the constraint in the 2.2 N ( st , at ) = N ( st , at ) + 1 (7)
mathematical model is satisfied at the same time. (2)-(4), it is W ( st , at ) = W ( st , at ) + vt (8)
determined that the player wins, otherwise the player loses.
W ( st , at )
B. Training Sample Generation Based on MCTS Q( st , at )= (9)
N ( st , at )
Intensive learning is used to continuously generate chess where vt is the estimated output of the neural network fθ (st ).
game data through self-game. First of all, in order to ensure
the diversity of the sample data, first, the initial chess surface As the number of simulations increases, the action value Q
of each game, that is, the layout of the user nodes are will gradually become stable, and it is not directly related to
randomly arranged, and the second is that the strategy model the strategic output pt of the neural network.
adopted by the game always adopts the latest model d) Play: According to the steps 3.2.1 to 3.2.3 above,
generated by the training process to avoid falling into the after 400 Monte Carlo tree searches, using the history
local space. hen use the model's strategy output and valuation information such as the number of accesses of each leaf node
output as the search direction, use the Monte Carlo tree stored in each side of the search tree, the simulated annealing
search algorithm to explore the possible action space, and algorithm can be used to obtain all the drop probability
continue to carry out the game according to the probability of distributions  ( a | s0 ) , ie:
returning, until the end of the game, if it wins, the game will 1

be played. Data is entered into the training sample set. N ( s0 , a)


 (a | s0 ) = (10)
Specific steps are as follows: 1

a) Select: Assume that the current state of the search  b


N ( s0 , b) 

tree is S, the selection action is a, the connected edge Among them, τ is the simulated annealing parameter, its
between nodes is e(s, a), and each edge stores a quad set, that function is to avoid the same game opening in each game,
is, the number of traversals N(s, a), the cumulative value of and effectively expand the diversity and effectiveness of the
the action W(s, a), action mean Q(s, a) and prior probability search sample.
P(s, a). It is assumed that in each step t from the root node 𝑠0 The above method is used to continuously perform the
to the leaf node s1, it is necessary to determine the action 𝑎𝑡 drop. After each step, the extended child nodes and subtree
according to the quaternion set stored in the current time information of the current search tree are retained until the
search number, that is, select the action with the largest game ends. Only when the final result of the game is that the
action value in the current state st, specifically The player wins, all the corresponding chess data and the
calculation method is as follows: probability distribution of the drop under the corresponding
at = arg max(Q( st , a) + U ( st , a )) (4) chess face are taken as the training sample set and the
evaluation set.
U ( st , a) = c puct P(st , a)
 b
N ( st , b)
(5)
e) Sample data expansion: Using the chess board to
1 + N ( st , a) have the equivalent properties of rotation and mirror flipping,
the expansion of the sample data is realized by the rotation
P ( st , a ) = (1 −  ) P( st , a) +  (6) and mirroring of the chess surface data. The original sample
data is rotated by using n×45o , where n={0,1,…,7}, and a
Among them, cpuct is the super parameter for balance total of 8 times of the original data is generated by this
exploration and utilization, ∑b N(st ,b) is all times of the state method, and these sample data are collectively used as a
st , p(st ,a) is the probability of action a in the model strategy neural network. Training input.
output, η To enhance the robust noise, ϵ is the inertia factor.

8
Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 19:31:25 UTC from IEEE Xplore. Restrictions apply.
C. Model Training Using Deep Residual Neural Network score of the current situation, and the training for the strategy
Using ResNet-based deep neural network, training value network is based on the sample data generated by the
samples are used to train the strategic value network of game. Therefore, the training goal is to make the probability
network planning. After the fixed batch, the strategy value of the action output by the strategy network approach the
network model is updated and compared with the historical probability of the MCTS output, and let the value score of
optimal model, while the optimal model and the latest model the output of the value network more accurately predict the
are saved. model. In order to ensure the diversity of the true final result, that is, the loss function is minimized by
exploration space, the latest policy value network of training training, and the loss is lost. The function is as follows:
2
and updating is applied to the self-game stage in the MCTS Loss = (C '− C )2 −   log P + g  (11)
to generate better self-game sample data, thereby realizing
the loop embedded of the self-game sample data generation D. Model-Based Network Topology Planning
and the neural network training process. Set, accelerate the Network topology planning is based on the neural
convergence of the training model. network model generated by training for offline planning.
a) Training data description method: The sample data The model output during the training process has saved all
is described by means of a binary feature plane. A total of the parameters of the neural network. In the planning stage, a
five Width×Height planes are included. The first three planes binary feature plane containing the initial situation
are used to represent the positions of the user nodes, the information such as the position of the user node is
access nodes and the backbone nodes. The position of the established, which is used as the input of the neural network,
node is 1, otherwise it is 0. The fourth one represents the last and the output of the neural network is Network planning
step of the player's position. Only one position of the plane is result.
1, that is, the position of the drop, and the rest of the position
is all zero. The fifth plane represents the last step type. If it is IV. SIMULATION
an access node, the entire plane is all 0, and if it is a The experimental platform computer CPU is Inter Core
backbone node, it is all 1. i5-8210, the memory is 8G, the hard disk is 1TB, the system
b) Neural network structure: The structure of the is Ubutun18, the specific code of the program is written by
neural network is as shown in Figure 3. First, a four-layer python language and open source package, and the neural
common convolutional network is used, and 32, 64, 128, and network is built by using Tensorflow library. Considering the
256 3×3 filters are constructed using the Relu function, computing power of the computer, this paper rasterizes the
respectively, and then divided into a planning strategy area with a scale of 0.5km per grid. With a flat terrain of
network (Policy). Network) and Value Network two 3km×3km as the background, the communication area grid is
branches, the strategy network branch uses four 1 × 1 6×6, and the coverage radius of the access node is set to 0.5.
dimensionality reduction filter, a fully connected layer, using Km, the communication distance between the backbone node
the softmax function to output the selection probability P of and the access node is 0.5km, the communication distance
each node in the planning space, value network branch Use between the backbone nodes is 2Km, and the number of user
two 1×1 dimensionality reduction filters, one fully connected nodes is 4, respectively for 6×6, 8×8 and 10×10 The network
layer, and use the tanh function to output a range of [0,1] to planning for the training. The termination condition of the
score C, i.e. fθ (s)=(P,C). training is 250 rounds of training. Through the application of
the model check, three models of the model can be
effectively used to plan the WMNs after completing 250
Policy
Policy Network
Network training sessions.
Conv 1*1, 4

FC*1
ReLu
Softmax
Conv 3*3,256
P
Conv 3*3,128
V
Conv 3*3, 64 Ouput
Tanh Ouput
Conv 3*3, 32
FC*2,64
Convolutional
Convolutional Network
Network Conv 1*1, 2
Input
Input W*H,5
W*H,5
Value
Value Network
Network

Figure 3. Neural network structure. Figure 4. Relationship between loss, entropy and training batch.

c) raining objectives: It can be seen from the above As shown in Figure 4, in the figure, with the increase of
that the input of the strategy network and the value network the number of training stations, the loss value and entropy
is a description of the current situation, the output is the value of the three scales can be continuously decreased, and
probability of each actionable in the current situation and the the initial loss value of the 6*6 scale due to the small search

9
Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 19:31:25 UTC from IEEE Xplore. Restrictions apply.
space. And the entropy value is the smallest, and the new model application verification shows that the strategy can
strategy is first generated. Although the entropy and loss meet the network planning requirements; In the case of
values decrease slowly, the strategy is the first to be stable, larger 8×8 and 10×10, the initial loss value and entropy value
that is, the convergence is faster; for 8×8 and 10×10 In larger will be larger, and as the training time increases, the loss
cases, the initial loss value and entropy value will be larger, value and entropy value decrease continuously, and a better
and as the number of training stations increases, the loss strategy is continuously generated, but even After the
value and entropy value decrease continuously, and a better training time reaches 1 hour, the loss value and entropy value
strategy is continuously generated, but even the loss value are still decreasing, and the resulting strategy model is not
and entropy value have been It has dropped below 2, ideal in practical applications, and its convergence time is
although its entropy and loss values have dropped rapidly, relatively longer.
but its strategy is still constantly updated. Even after training In order to ensure the flexibility and effectiveness of the
250 stations, it has not stabilized, and its convergence speed planning, in the planning implementation phase, multiple
is relatively slow. sets of feasible planning schemes are generated for the
commander to select, and the results are planned for the
10×10 scale. As shown in Figure 6, the generated planning
rules are in line with the planning requirements. Under the
different planning scales and the number of schemes, the
planning time is as shown in Table 1. Even if 3 feasible
schemes are planned under the scale of 10×10, the time is
within 1 minute, which can meet the requirements of real
WMNs planning.

TABLE I. PLANNING TIME FOR DIFFERENT SCALES

Planning time(s)
Network scales
Figure 5. Relationship between loss, entropy and training time. 1 set of plans 2 set of plans 3 set of plans
6×6 12.4 16.9 21.3
8×8 19 24.1 29.0

10 10 10×10 22.6 28.9 35.2


U U
8 A U 8 A U
G A A
6 6 G G V. CONCLUSION
G In order to carry out WMNs planning more efficiently
4 G A 4 G G A
U
and intelligently, this paper proposes an WMNs topology
U
2 A 2 A planning method based on deep reinforcement learning. The
U U mathematical model of WMNs planning is constructed. The
0 0 2 4 6 8 10
method of MCTS and deep reinforcement learning is used to
2 4 6 8 10
design the residual network module based on convolution.
10 10 The intelligent planning model is obtained by training the
U U A
U 8 A U planning data generated by self-game. The experimental
8 A
G A G G
results show that the proposed planning method can
6 6 effectively solve the communication network topology
G planning problem, and does not depend on historical
4 A 4 G planning data and human intervention, and has high
G U G A U autonomy and flexibility. The next step will be to study the
2 2 A efficiency of network intelligent planning algorithms in
U
U A larger planning spaces and higher precision, and further
0
0 2 4 6 8 10 2 4 6 8 10 improve the effectiveness and applicability of the algorithm.
Figure 6. Four planning options. REFERENCES
As shown in Figure 5, in the case that the number of [1] Vallone S . An energy efficient channel assignment and routing
algorithm for multi-radio wireless mesh networks[J] . Ad Hoc
training stations is 250, as the training time increases, the Networks. 2012,10(6):1043—1057.
loss value and entropy value of the three scales can be [2] Youngseok Lee, Kyoungae Kim, Choi Y. Optimization of AP
continuously decreased, and the 6×6 scale is due to the placement and channel assignment in wireless LANs [A]. Local
search space. Small, the training speed is the fastest, it takes Computer Networks, 2002 Proceedings LCN 2002 27th Annual IEEE
only 1 hour and 4 minutes to complete 250 training sessions, Conference on [C], 2002: 831-836
and its strategy is no update after 5 minutes, and the actual

10
Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 19:31:25 UTC from IEEE Xplore. Restrictions apply.
[3] Guillaume de la Roche, Raphael Rebeyrotte, Katia Jaffres-Runser, et [10] Mnih V, Kavukcuoglu K, Silver D, Graves A, Antonoglou I, Wierstra
al. A QoS-based FAP criterion for indoor 802.11 wireless LAN D, Riedmiller M. Playing Atari with deep reinforcement Learning[C].
optimization [A]. In: proceedings of the IEEE International NIPS2013.
Conference on Communications (ICC '06) [C],2006:5676-5681 [11] Sutton R S, Barto A G. Reinforcement Learning: An Introduction[M].
[4] Xiang Ling, Yeung K L. Joint access point placement and channel Cambridge, USA:MIT Press, 1998.
assignment for 802.11wireless LANs [J]. IEEE Transactions on [12] M. Hausknecht and P. Stone. Deep recurrent q-learning for patially
wireless communications, 2006,5(10):2705-2711 observable mdps[EB/OL]. arXiv: preprint arXiv, 2017-1-11[2017-11-
[5] Lecun Y, Bengio Y, Hinton G. Deep learning. Nature, 2015, 16], https://arxiv.org/abs/1507.06527.
521(7553):436-444. [13] Van Hasselt H, Guez A, Silver D. Deep reinforcement learning with
[6] David Silver, Thomas Hubert. A general reinforcement learning double Q 1earning[C]. Proceedings of the AAAI Conference on
algorithm that masters chess, shogi, and Go through self-play[J]. Artificial Intelligence. Phoenix, USA, 2016:2094-2100.
Science, 2018, 362:1140-1144. [14] Silver D, Schrittwieser J, Simonyan K, et al. Mastering the game of
[7] Max Jaderberg, Mojciech M.Czarnecki. Human-level performance in go without human knowledge[J]. Nature, 2017, 550(7676): 354-391.
3D multiplayer games with population-based reinforcement [15] Kun Shao, Yuanheng Zhu, Dongbin Zhao. Starcraft
learning[J]. Science, 2019, 364:859-865. Micromanagement With Reinforcement Learning and Curriculum
[8] Siqi Liu, Guy Lever, Josh Merel, Saran Tunyasuvunakool, et al. Transfer Learning[J]. IEEE Transactions on Emerging Topics in
Emergent coordination through completion[C]. ICLR2019, arXiv: Computational Intelligence, 2018(99): 1-12.
1902.07151v2[cs.AI], 2019-2-21. [16] David, Silver., Aja,Huang., et al. Mastering the game of Go with deep
[9] Bin Wu, Qiang Fu, Jing Liang, Peng Qu, et al. Hierarchical Macro neural networks and tree search[J]. Nature, 2016, 529:484-489.
Strategy Model for MOBA Game AI[J]. arXiv:
1812.07887v1[cs.MA], 2018-12-19.

11
Authorized licensed use limited to: University of the West Indies (UWI). Downloaded on July 03,2023 at 19:31:25 UTC from IEEE Xplore. Restrictions apply.

You might also like