0% found this document useful (0 votes)
12 views6 pages

IEEE Conf v2 0-3

1. The document proposes a deep reinforcement learning approach for traffic routing in mobile edge computing networks to minimize bandwidth utilization and average delay. 2. It trains a deep reinforcement learning agent using network topology data to learn the optimal routing policy that maps network states to actions. 3. Simulation results show the proposed approach reduces average total end-to-end delay and overall bandwidth utilization compared to traditional hop-count-based routing.

Uploaded by

Swar Yeole
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views6 pages

IEEE Conf v2 0-3

1. The document proposes a deep reinforcement learning approach for traffic routing in mobile edge computing networks to minimize bandwidth utilization and average delay. 2. It trains a deep reinforcement learning agent using network topology data to learn the optimal routing policy that maps network states to actions. 3. Simulation results show the proposed approach reduces average total end-to-end delay and overall bandwidth utilization compared to traditional hop-count-based routing.

Uploaded by

Swar Yeole
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

DEEP REINFORCEMENT LEARNING BASED TRAFFIC ROUTING ON MOBILE EDGE

NETWORKS

Bhavesh Ajwani1 ,Rajarshi Mahapatra2


1,2
Department of Data Science and Artificial Intelligence, IIIT-Naya Raipur, Chhattisgarh, India
3
Department of Electronics and Communication Engineering IIIT-Naya Raipur, Chhattisgarh, India

ABSTRACT formance deterioration throughout the learning process, especially


A potential paradigm that makes compute and storage resources in the early stages, which might result in an incorrect routing strat-
available at the network’s edge is Mobile Edge Computing egy that increases total delay and packet loss, putting the system’s
(MEC). In this paper, we propose a traffic routing algorithm dependability at risk. Moreover, when the topology of a network
for MEC networks leveraging Deep Reinforcement Learning changes, the agent must learn the routing optimization again, which
(DRL). Our approach leverages a dataset of MEC network can poorly affect the network performance in the long term, espe-
topologies and specifications to model the network and trains cially when the network traffic is complex. The adoption of DRL-
a DRL agent to select the optimal routing path on the basis of based route optimization that involves exploration in networks car-
the current state of the network. We adopt a Deep Q-Learning rying QoS-sensitive traffic might be devastating.
algorithm as the learning algorithm for the agent and design
the state spaces and action spaces accordingly. To evaluate the 1.1. Contribution
effectiveness of our approach, we perform simulations on var-
ious MEC network topologies and compare our results with The study made two main contributions. First, we proposed a novel
existing routing algorithms. Our results show that our approach approach for finding the optimal path in MEC networks to minimize
outperforms existing methods in terms of the end-to-end delay bandwidth utilization. The approach leverages DRL, which enables
and overall bandwidth utilization, and achieves a more bal- the DRL agent to learn that policy which maps the observed state
anced trade-off between these metrics. Overall, the effectiveness of the network to an action that optimizes the network performance.
and performance of MEC networks may be enhanced by our Second, we demonstrated that the proposed approach outperforms
suggested DRL-based traffic routing method. the traditional hop-count-based routing method in terms of mean to-
tal delay.
Index Terms— MEC, Traffic Routing DRL, Network Topolo- 1. Overall link bandwidth utilization: Developed an approach
gies, Deep Q-Learning Algorithm for finding optimal path in Mobile Edge Computing (MEC) net-
works to minimize bandwidth utilization, using deep reinforce-
1. INTRODUCTION ment learning (DRL) and the network topology data from a pub-
licly available dataset.
A technology called mobile edge computing (MEC) makes it possi- 2. Average delay: Demonstrate that the proposed approach re-
ble for wireless networks to provide high-bandwidth and low-latency duces the average total end-to-end delay compared to traditional
services. With the advent of MEC, mobile devices can now offload hop-count-based routing method.
compute-intensive tasks to edge servers that are close by, which low-
ers network latency and boosts service quality. In this regard, effi-
cient traffic routing is a key challenge that needs to be addressed in
MEC networks.
Traditional routing approaches based on static and dynamic
routing protocols are not capable of handling the dynamic traffic
patterns and complex network topologies in MEC networks. There-
fore, there is a need for developing a novel routing algorithm that
can efficiently manage the network traffic and adapt to the changing
network conditions.
In the past few years, deep reinforcement learning (DRL) has
emerged as a promising technique for developing traffic engineering
strategies by combining reinforcement learning with deep neural net-
works [1]. In a model-free and experience-driven way, DRL-based
routing schemes can learn from and adjust to complicated networks,
enhancing the performance of routing policies. Using the DRL ap-
proach in software-defined networks has been shown to yield sub-
stantial improvements in routing optimization performance in previ-
ous studies. [2]-[5].
However, due to the nature of reinforcement learning that in- Fig. 1. An inefficient routing choice brought about by failing to take
volves exploration to determine the best policy, there may be per- future flows’ requirements into account
Table 1. Literature Survey of Related Works
Author(s) Routing Scenario Optimization Technique Datasets Performance Metrics
Altaweel et al. Fog/Edge networks Resilient Socket (RSock) FEN dataset Packet delivery rate, packet
(2022) protocol delivery delay
Wang et al. (2019) Wireless networks Deep Reinforcement - Network throughput, inter-
Learning (DRL) ference
Yu et al. (2018) Software-defined DDPG, DRL-based net- - Network performance
networks work control
Li et al. (2022) Communication DRL-based routing ap- - End-to-end delay, packet
networks proach loss
Kang et al. (2021) Wireless networks DDPG-based congestion- NS3 simulator Packet loss, network
aware routing throughput, end-to-end
delay
Bhattacharyya et al. Edge computing Reinforcement Learning - End-to-end delay and en-
(2022) networks (RL)-based routing algo- ergy efficiency
rithm
Zhao et al. (2022) Wireless sensor Q-learning-based routing NS2 simulator Energy consumption, net-
networks protocol work lifetime

2. LITERATURE WORK policy in a data-driven manner. Their approach outperforms tradi-


tional SDN routing methods in terms of network performance.
In a recent study [4] by Altaweel et al. (2022), a routing proto- L. Li et al. (2021)[8] proposes a DRL-based routing approach
col called Resilient Socket (RSock) states that in disaster response for communication networks. The authors use a DRL algorithm to
settings, first responders’ on-body gadgets create fog/edge networks learn a routing policy that minimizes end-to-end average delay and
(FENs). RSock is intended to be simple to deploy and evolve, and packet losses. Their approach outperforms traditional routing meth-
it can adapt its routing and replication decisions to deal with any ods in terms of network performance. Recent research has demon-
dynamic FEN environment. It is essentially an identity-based rout- strated the potential of DRL for routing optimization in communi-
ing system that uses all wireless interfaces of FENs devices and cation networks. These studies show that DRL can learn effective
leverages EdgeKeeper, a novel coordination and naming service for routing policies in a data-driven manner, leading to better network
FENs. Real-world experiments demonstrate that RSock performs performance compared to traditional routing methods.
well in disaster-response scenarios. However, RSock has several
limitations such as its inability to provide seamless mobility commu-
nication and its reliance on the EdgeKeeper coordination and naming 3. NETWORK ROUTING SYSTEM MODEL USING MDP
service.
In particular, the paper [5] provides a comprehensive review of 3.1. Data Description
various Reinforcement learning-based optimal path routing proto- The dataset [9] used in this research project is a crucial component
cols proposed since mid-1990s. The paper proposes classification in evaluating the performance of routing algorithms and network
criteria to compare various existing Reinforcement learning-based protocols in MEC environments. It is the comprehensive collec-
routing protocols and highlights the suitability of RL for solving tion of 50 network topologies, each consisting of a different num-
optimization problems related to optimal path routing in networks. ber of nodes ranging from 10 to 60. These topologies were gen-
While the RL methods have been shown to have reasonable over- erated using a random graph model based on the Barabási–Albert
head compared to other optimization techniques, the paper acknowl- algorithm, which is well-known for creating networks with scale-
edges the challenge of balancing exploration and exploitation in the free properties, containing information on various network parame-
learning process to prevent network performance degradation. The ters such as link bandwidth, computation capacity, storage capacity,
review offered in the study could be a useful resource for researchers and traffic rate for the considered topologies. For each topology,
investigating the use of RL in network routing optimization. the dataset provides the adjacency matrix, node attributes, and edge
L. Wang et al. (2019) [6] published ”Deep Reinforcement attributes. The link bandwidth, computation capacity, storage capac-
Learning for Dynamic Multichannel Access in Wireless Networks.” ity, and traffic rate values were generated based on reference values
The authors suggest a DRL-based method for dynamic multichannel and ranges specified in the paper. It is a helpful resource for scholars
access in wireless networks in this research. They use a DRL al- and practitioners working in mobile edge computing field. It can be
gorithm to learn a channel selection policy that maximizes network used to assess the performance of optimal path routing algorithms
throughput while minimizing interference. and networking protocols in MEC contexts, as well as to compare
H. Yu et al. (2018) [7]. This paper presents a DRL-based net- the performance of other algorithms in similar network situations.
work control framework for software-defined networking (SDN). Moreover, this dataset can be used to generate new test scenarios for
The authors propose a DRL algorithm that learns an optimal routing MEC environments and to benchmark the performance of newly de-
veloped routing algorithms and network protocols against existing
ones.

Fig. 3. Overview of the Network routing algorithm

3.3. State and Action Space

Fig. 2. Network topologies from Milan city center dataset The current node and its nearby nodes in the network reflect the state
space of our problem. Formally, a state sc ∈ S is defined as a
list or a tuple (c, N (c)), where c ∈ V represents a current node
in the network graph, V signifies a set of all vertices or nodes in
3.2. Overall model Architecture graph network, and N (c) is the set of neighboring nodes directly
We present the general model architecture of our network optimiza- connected with node c through an edge. The state space size, |Sc |, is
tion efforts in this section, which aims to minimize delay and con- given by the count of all nodes in the network, |V |. To represent any
gestion in the network using a Deep Q-Network (DQN) algorithm. particular state as a feature vector, we can use an adjacency matrix
The main components of the system are the network environment, A ∈ R|V |×|V | , where Ai,j = 1 if nodes i and j are connected, and 0
the DQN agent, and the training and evaluation processes. otherwise. The state vector s ∈ R|V | for node c can then be obtained
Network environment is represented by a graph data structure, as the c-th row of the adjacency matrix A. This representation allows
where nodes correspond to network devices (e.g., routers, switches), the DQN agent to learn from the connectivity information of the
and edges represent the communication links between these devices. network and make routing decisions on the basis of current state.
The graph is initialized with the given network topology, link ca- The action space corresponds to the routing decisions made by
pacities, and traffic rates between node pairs. The environment is the DQN agent. In our problem, an action a ∈ A is the selection of
responsible for maintaining the network state, which consists of the a neighboring node to route the traffic flow from the current node.
current node and its neighboring nodes, as well as updating the link Formally, the action space A(c) for node c consists of all neighbor-
utilization based on the chosen actions. The environment also pro- ing nodes in N (c), i.e., A(c) = n|n ∈ N (c). The action space size,
vides the agent with reward signals, which are calculated based on |A(c)|, is determined by number of neighboring nodes connected to
the delay and congestion status of the network. node c. Because the agent has a finite number of surrounding nodes
The DQN agent is in charge of determining the best policy for to select from at each step, the action space is discrete.
routing traffic flows across the network. It comprises of the Q- We use a deep neural network to estimate the action-value func-
network, a deep neural network used to estimate the action-value tion Q(sc , ac ; θ), where θ are the network parameters, to implement
function (Q-function). The agent also keeps a target network, which the DQN algorithm. The state vector s is fed into the Q-network,
is a replica of the Q-network and is updated on a regular basis to and the output is a vector of Q-values for all conceivable actions,
increase training stability. The agent uses an epsilon-greedy strat- i.e., Q(sc , ac ; θ) in |A(c)|.The agent chooses an action ac to maxi-
egy for exploration and exploitation, where it selects actions either mize the Q-value, i.e., ac = arg maxa Q(sc , ac ; θ). By learning the
randomly or based on the current Q-function estimates. The agent optimal Q-function, the DQN agent is able to make routing decisions
learns from its experiences by updating the Q-network using a tech- that minimize delay and congestion in the network.
nique known as Q-learning, which minimizes the gap between ex-
pected and target Q-values. 3.4. Reward Function
During the training phase, the agent interacts with the environ-
ment and modifies its Q-network depending on the experiences it has We aim to minimize the overall bandwidth utilization and reduce the
gathered. In each episode, the agent starts at a randomly selected average delay in the network. To achieve this, we have designed a
initial node and proceeds to make routing decisions until the episode reward function that captures the essential aspects of our goal. The
terminates. The agent’s performance is evaluated periodically during reward function, R(sc , ac , s′ ), computes the entire reward for per-
training, by calculating the average reward obtained over a sliding forming action ac in state sc and transitioning to state s′c .
window of episodes. The training process continues until the agent’s Our reward function is a combination of three components: the
performance converges, or a predefined stopping criterion is met. change in link utilization, the change in the average delay, and a
To evaluate the performance of the DQN agent, we can plot var- penalty term for selecting infeasible actions. We will describe each
ious metrics such as the reward versus the number of episodes, the component and their mathematical representations below.
delay versus the number of flows, and the link utilization distribu- 1. Change in link utilization: The link utilization is a measure of
tion. These metrics help in understanding the effectiveness of the how much the available bandwidth is being used in the network.
DQN agent in optimizing the network performance and provide in- It is calculated as the ratio of the current traffic rate and the link
sights into the agent’s learning process. In addition, the trained agent capacity. Our objective is to minimize the overall link utilization
can be further tested in various network scenarios to assess its adapt- in the network, so we compute the change in link utilization be-
ability and robustness in handling different network conditions. fore and after taking action ac : ∆U (sc , ac , s′c ) = U (s′c , ac ) −
U (sc , ac ), where U (sc , ac ) denotes link utilization in state sc of past experiences to update the network parameters θ. This tech-
for action ac , and U (s′c , ac ) is the link utilization in the next nique reduces the correlation between consecutive updates, leading
state s′c for action ac . A negative value of ∆U (sc , ac , s′c ) indi- to a more stable learning process.
cates that the link utilization has decreased, which is favorable. In order to calculate the target Q-values for the updation step, we
2. Change in average end-to-end delay: Average delay in the net- additionally use a target network with parameters θ− . A soft update
work is affected by the routing decisions made by the DQN rule is used to regularly update the settings of the target network,
agent. To encourage the agent to minimize the average delay, which is a replica of the main network, θ− ← (1 − τ )θ− + τ θ,
we compute the change in the average delay before and after where τ is a small constant. This approach helps in stabilizing
taking action ac : ∆D(sc , ac , s′c ) = D(s′c , ac ) − D(sc , ac ), the learning process by providing a more consistent set of target
and D(sc , ac ) is the average delay in state sc for action ac , and values. A variation of the Q-learning method is used to train the
D(s′c , ac ) is the average delay in the next state s′c for action ac . DQNh agent, and it includes minimizing the loss function, L(θ) =
A negative value of ∆D(sc , ac , s′c ) indicates that the average 2 i
E Y Q − Q(s, a; θ) , where Y Q = r + γ maxa′ Q(s′ , a′ ; θ− )
delay has decreased, which is desirable.
is the target Q-value, and γ is the discount factor. Expectations are
3. Penalty term for infeasible actions: In some cases, the agent may
calculated over randomly sampled mini-batches from the experience
select an action that is infeasible, such as routing traffic to a node
replay buffer, introducing stochasticity for improved model robust-
that is not directly connected to the current node. To discour-
ness and generalization.
age the agent from selecting infeasible actions, we introduce a
penalty term in ( the reward function as Algorithm 1 Deep Q-Network Algorithm
−∞, if action ac is infeasible in state sc 0,
P (sc , ac ) = 1. Initialization/Input: Establish initial values for θ for the Deep
otherwise
Q-Network Q(sc , ac ; θ) and θ− = θ for the target network
Now, we can combine the three components to form the overall Q′ (sc , ac ; θ− ). Set the maximum capacity of the replay memory
reward function: B at initialization.
R(sc , ac , s′c ) = α∆U (sc , ac , s′c )+β∆D(sc , ac , s′c )+P (sc , ac ) 2. Output: Trained DQN Q(sc , ac ; θ) for optimal routing deci-
where α and β are scaling factors that allow us to control the rel- sions.
ative importance of the change in link utilization and the change 3. Steps:
in average delay. By adjusting the values of α and β, we can 4. For episode in range (1, N+1):
fine-tune the reward function to align with our specific goals. (a) Set the state sc initial.
Through this reward function, the DQN agent is encouraged to (b) For t in range (1, T+1):
make routing decisions that minimize both the overall bandwidth uti- i. Choose a random action ac with probability value as δ;
lization and the average delay in the network.The use of the penalty if not, choose ac = argmaxa′c Q(sc , a′c ; θ).
term assures that the agent does not choose infeasible actions, fur-
ther boosting the routing algorithm’s performance. By determining ii. Complete action ac , notice the following state s′c and the
the best action-value function Q(sc , ac ; θ) for maximizing cumula- reward r.
tive reward, the DQN agent can effectively solve the routing problem iii. Save (sc , ac , r, s′c ) as the transition in replay memory B.
in our network scenario. iv. Take a random minibatch sample from B of the transi-
tions (sci , aci , ri , s′c i ).
v. yi = ri + γ · maxa′c Q′ (s′c i , a′c ; θ− ) is the goal value.
3.5. Deep Q-Network Architecture as Q-value approximator Compute it.
The Deep Q-Network (DQN) architecture is a key component of vi. To minimize the loss, update P the network parameters 2
this study on traffic routing, where it serves as an approximator for θ using gradient descent: i (yi − Q(sci , aci ; θ)) is
action-value function, Q-function. In the context of our mobile edge equal to L(θ).
network topology, the DQN agent is responsible for learning an opti- vii. Update the target network parameters: θ− = τ θ + (1 −
mal routing policy that minimizes both the overall bandwidth utiliza- τ )θ− , where τ is a small positive constant.
tion and the average delay. The state-action pair (sc , ac ) is mapped viii. Set sc = s′c .
to a scalar value via the Q-function, Q(sc , ac ; θ), which represents 5. Prediction:
the expected cumulative reward of executing action ac in state sc (a) For a new state sn , feed it to the trained Deep Q-Network.
and then adhering to an optimum policy. Here, θ denotes the param- (b) Obtain the Q-values Q(sn , a; θ) for all possible actions a.
eters of the neural network that approximates the Q-function. The (c) Select the action a∗ = argmaxa Q(sn , a; θ) for optimal rout-
primary aim of the agent is to learn the optimal policy that can guide ing decision.
its routing decisions. In the DQN architecture, to approximate the
Q-function we employ a deep neural network. The network takes 3.6. DQN-based routing Algorithm for discrete action space
state vector s ∈ R|V | as input, where |V | is the number of nodes
in the network and output of the network is a vector of Q-values, This section introduces the DQN-based discrete action space rout-
Q(s, a; θ) ∈ R|A(c)| , for all possible actions a ∈ A(c), where ing algorithm, which learns the best routing policy for traffic routing
|A(c)| is the count of neighboring nodes connected to the current in mobile edge networks by utilizing the deep Q-network architec-
node c. The agent selects that action which has the highest Q-value, ture, the state and action spaces, and the created environment. The
a∗ = arg maxa Q(s, a; θ), to minimize delay and congestion in the key goal of our routing algorithm is to minimize overall bandwidth
network. To train the DQN, we utilize the concept of experience utilization and the average delay while considering the constraints
replay and a target network. Experience replay involves storing a imposed by virtue of network topologies and traffic demands.
buffer of past experiences, (s, a, r, s′ ), where s, a, r and s′ repre- The routing algorithm can be summarized as follows: Initialize
sents the current state, action taken, reward obtained, the next state the mobile edge network environment, which consists of the network
respectively. During the training, we sample random mini-batches topology, traffic rates, and other network parameters. The state space
is a representation of the current node and its neighboring nodes, representing the mobile edge network. The link capacities and the
while the action space consists of the available routing decisions for service rates for the topology were set to 104 packets and 3 × 103
each node. We initialize the DQN agent with a deep neural network packets/second, respectively. A Poisson process with an average λk ,
architecture to approximate the Q-function, Q(sc , ac ; θ), where θ where λk is selected from a uniform random distribution in the in-
are the network parameters. The agent selects an action ac that max- terval between 10 and 300 packets/second, controls the arrival rate
imizes the Q-value, i.e., a=
c arg maxa Q(sc , ac ; θ). By learning an of the k-th flow entering the network. We randomly chose source
optimum Q-function, the DQN agent is able to make routing deci- and destination node pairings and adjusted λk for each iteration to
sions that minimize delay and congestion in the network. The DQN- account for network flow unpredictability.
based routing algorithm iteratively updates the Q-function by min- The two completely linked hidden layers with 64 and 32 units,
imizing the mean squared error(MSE) loss function L(θ), between respectively, make up the architecture of the DQN agent. Because
the predicted Q-values and the target Q-values: of its efficiency and dependability, we employed the rectified linear
units (ReLU) activation function. The neural networks were trained
using the Adam optimizer. The mini-batch’s batch size was set to
L(θ) = E(sc , ac , r, s′c ) ∼ D[G] (1) 64 and the discount factor, γ, was set to 0.99. Before beginning the
′ ′ ′ − 2
G = (r + γ max ac Q(sc , ac ; θ ) − Q(sc , ac ; θ)) (2) learning process, one must first gather 100 transitional steps. The
learning rates for the Adam optimizer and the DQN agent were set
where D is the experience replay buffer, (sc , ac , r, s′c ) is a tran- to 10−4 . The agent used an epsilon-greedy exploration strategy to
sition sampled from the buffer, γ denotes discount factor, and θ− explore the environment thoroughly, with the epsilon value decaying
represents the parameters of the target Q-Network which are updated over time.
using gradient descent: We evaluated the effectiveness of our DQN-based routing
method in the simulation and compared it with traditional shortest-
path routing algorithms such as Dijkstra’s and Bellman-Ford al-
θ ← θ − α∇θ L(θ) (3)
gorithms. We also conducted simulations with varying network
L(θ), α is the learning rate. parameters to evaluate the adaptability and robustness of our pro-
As the agent learns, it explores the network environment by posed approach in different scenarios.
choosing actions based on a ϵ-greedy strategy. Initially, the agent
selects random actions with probability ϵ, which is then decayed 4.2. Simulation Results
over time to encourage exploitation of the learned Q-function. The
exploration-exploitation trade-off is controlled by the parameters We evaluated the performance of our proposed deep reinforcement
ϵstart , ϵend , and ϵdecay . Through interaction with the environment, ex- learning (DRL) based traffic routing method in mobile edge net-
perience gathering, and Q-function updating, the agent learns from works using a dataset of four network topologies from the paper ”A
its experiences. dataset for mobile edge computing network topologies”[9]. We com-
The DQN-based routing algorithm converges when the changes pared our approach with the naive approach, which uses the fewest
in the Q-function become negligible or when a predefined number possible hops to forward packets, and a DQN-based routing method
of episodes have been completed. After training, the agent can make that uses the state as the concatenation of node features, link band-
optimal routing decisions based on the learned Q-function, which width, link latency, and traffic rate vector. Plotting the agent’s re-
aims to minimize the overall bandwidth utilization and reduce the wards in relation to the number of iterations while utilizing our sug-
average delay in the network. gested strategy in each of the four network topologies allowed us
By integrating the developed environment, state and action to assess performance. We found that the incentives increased over
spaces, and the deep Q-network architecture, our DQN-based rout- the iterations in all topologies, suggesting that our DRL algorithm
ing algorithm is capable of learning effective routing policies for becomes more adept at selecting actions as it goes through more it-
mobile edge networks with discrete action spaces. The routing algo- erations. We also found that the convergence time increased with the
rithm’s mathematical foundation ensures that it is both theoretically network size and complexity of the topology.
sound and practically effective in addressing the complex challenges The primary objective of our simulations is to investigate the ef-
of traffic routing in mobile edge networks. fectiveness of our model in minimizing overall bandwidth utilization
and reducing average end-to-end delay. To achieve this objective, we
4. PERFORMANCE EVALUATION trained our model on a network topology of 25 nodes and 50 edges
for 1000 episodes, and evaluated its performance based on the re-
In this section, the experimental results are discussed based on dif- wards obtained at each episode.
ferent benchmark settings. The reward function used in our simulation considers both band-
width utilization and end-to-end delay, making it a comprehensive
4.1. Simulation Setup metric for evaluating the performance of our model. As shown
in Figure 6, the model converges at around 0.65 value of reward,
The simulation setup is designed to evaluate the performance of the demonstrating the effectiveness of the suggested DQN-based routing
proposed DQN-based routing algorithm in a mobile edge network method in lowering average end-to-end delay and total bandwidth
topology. The Python networkX module was used to generate the utilization. The graph clearly shows that the rewards obtained by the
network topology during the simulations and the DQN algorithm model gradually increase with each episode, eventually converging
was implemented in a custom-built framework. All the simulations at around 0.65 value of reward. This indicates that the proposed
were executed on the Kaggle platform, which provides a cloud-based model is effective in learning optimal routing decisions and making
environment with shared CPU and GPU resources. routing decisions that minimize overall bandwidth utilization and re-
We considered a network topology consisting of a varying num- duce average end-to-end delay. Our simulation results demonstrate
ber of nodes (up to 50 nodes) with connectivity described in [4], that the suggested DQN-based routing algorithm in this study is ef-
Fig. 4. Proposed DQN Algorithm trained on 50 episodes for 25N50E
Fig. 6. Proposed DQN Algorithm trained over 1000 episodes for
network topology
25N50E network topology

6. REFERENCES

[1] Goyal, Mukul, et al. ”Improving convergence speed and scala-


bility in OSPF: A survey.” IEEE Communications Surveys Tu-
torials 14.2 (2011): 443-463.
[2] Xiao, Yang, et al. ”Leveraging deep reinforcement learning for
traffic engineering: A survey.” IEEE Communications Surveys
Tutorials 23.4 (2021): 2064-2097.
[3] Sun, Penghao, et al. ”A scalable deep reinforcement learning
approach for traffic engineering based on link control.” IEEE
Communications Letters 25.1 (2020): 171-175.
[4] Liu, Wai-xi, et al. ”DRL-R: Deep reinforcement learning ap-
proach for intelligent routing in software-defined data-center
Fig. 5. Proposed DQN Algorithm trained over 1000 episodes for networks.” Journal of Network and Computer Applications 177
13N50E network topology (2021): 102865.
[5] Guo, Shaoyong, et al. ”Trusted cloud-edge network resource
management: DRL-driven service function chain orchestration
fective in achieving the objectives of minimizing overall bandwidth for IoT.” IEEE Internet of Things Journal 7.7 (2019): 6010-
utilization and reducing average end-to-end delay and the reward 6022.
function used in our simulation provides a comprehensive metric for [6] Altaweel, Ala, et al. ”Rsock: A resilient routing protocol for
evaluating the performance of our model and demonstrates the sig- mobile fog/edge networks.” Ad Hoc Networks 134 (2022):
nificance of considering both bandwidth utilization and end-to-end 102926.
delay in designing routing algorithms for mobile edge networks.
[7] Kim, Gyungmin, Yohan Kim, and Hyuk Lim. ”Deep reinforce-
ment learning-based routing on software-defined networks.”
5. CONCLUSION IEEE Access 10 (2022): 18121-18133.
[8] Han, S., Liu, M., Wu, Z., Wang, X. (2021). A Hybrid
In this paper, we suggested a traffic routing strategy for mobile edge Method for Network Traffic Prediction Based on Improved
networks based on deep reinforcement learning. Using a DQN algo- Singular Spectrum Analysis and Convolutional Neural Net-
rithm, our method learns the best policy for network traffic routing work. Wireless Personal Communications, 118, 1837-1854.
according to the state of the network at that moment. We evalu- https://doi.org/10.100
ated our approach on four different network topologies using the
proposed action, state space, and reward function and compared it [9] Xiang, Bin, et al. ”A dataset for mobile edge computing net-
with a baseline naive routing method. Results showed that our strat- work topologies.” Data in Brief 39 (2021): 107557.
egy performed better than the baseline method in terms of average [10] Chen, Bo, et al. ”An approach to combine the power of deep
packet losses, network throughput, and end-to-end delays, demon- reinforcement learning with a graph neural network for routing
strating the effectiveness of the proposed approach. Additionally, the optimization.” Electronics 11.3 (2022): 368.
suggested method functioned effectively even with greater network
[11] Liu, Yitong, Junjie Yan, and Xiaohui Zhao. ”Deep-
sizes and could adjust to changes in network conditions. Subsequent
Reinforcement-Learning-Based Optimal Transmission Poli-
research endeavors could concentrate on expanding this methodol-
cies for Opportunistic UAV-Aided Wireless Sensor Network.”
ogy to more intricate situations, like fluctuating traffic demands and
IEEE Internet of Things Journal 9.15 (2022): 13823-13836.
dynamic network topologies.

You might also like