IEEE Conf v2 0-3
IEEE Conf v2 0-3
NETWORKS
Fig. 2. Network topologies from Milan city center dataset The current node and its nearby nodes in the network reflect the state
space of our problem. Formally, a state sc ∈ S is defined as a
list or a tuple (c, N (c)), where c ∈ V represents a current node
in the network graph, V signifies a set of all vertices or nodes in
3.2. Overall model Architecture graph network, and N (c) is the set of neighboring nodes directly
We present the general model architecture of our network optimiza- connected with node c through an edge. The state space size, |Sc |, is
tion efforts in this section, which aims to minimize delay and con- given by the count of all nodes in the network, |V |. To represent any
gestion in the network using a Deep Q-Network (DQN) algorithm. particular state as a feature vector, we can use an adjacency matrix
The main components of the system are the network environment, A ∈ R|V |×|V | , where Ai,j = 1 if nodes i and j are connected, and 0
the DQN agent, and the training and evaluation processes. otherwise. The state vector s ∈ R|V | for node c can then be obtained
Network environment is represented by a graph data structure, as the c-th row of the adjacency matrix A. This representation allows
where nodes correspond to network devices (e.g., routers, switches), the DQN agent to learn from the connectivity information of the
and edges represent the communication links between these devices. network and make routing decisions on the basis of current state.
The graph is initialized with the given network topology, link ca- The action space corresponds to the routing decisions made by
pacities, and traffic rates between node pairs. The environment is the DQN agent. In our problem, an action a ∈ A is the selection of
responsible for maintaining the network state, which consists of the a neighboring node to route the traffic flow from the current node.
current node and its neighboring nodes, as well as updating the link Formally, the action space A(c) for node c consists of all neighbor-
utilization based on the chosen actions. The environment also pro- ing nodes in N (c), i.e., A(c) = n|n ∈ N (c). The action space size,
vides the agent with reward signals, which are calculated based on |A(c)|, is determined by number of neighboring nodes connected to
the delay and congestion status of the network. node c. Because the agent has a finite number of surrounding nodes
The DQN agent is in charge of determining the best policy for to select from at each step, the action space is discrete.
routing traffic flows across the network. It comprises of the Q- We use a deep neural network to estimate the action-value func-
network, a deep neural network used to estimate the action-value tion Q(sc , ac ; θ), where θ are the network parameters, to implement
function (Q-function). The agent also keeps a target network, which the DQN algorithm. The state vector s is fed into the Q-network,
is a replica of the Q-network and is updated on a regular basis to and the output is a vector of Q-values for all conceivable actions,
increase training stability. The agent uses an epsilon-greedy strat- i.e., Q(sc , ac ; θ) in |A(c)|.The agent chooses an action ac to maxi-
egy for exploration and exploitation, where it selects actions either mize the Q-value, i.e., ac = arg maxa Q(sc , ac ; θ). By learning the
randomly or based on the current Q-function estimates. The agent optimal Q-function, the DQN agent is able to make routing decisions
learns from its experiences by updating the Q-network using a tech- that minimize delay and congestion in the network.
nique known as Q-learning, which minimizes the gap between ex-
pected and target Q-values. 3.4. Reward Function
During the training phase, the agent interacts with the environ-
ment and modifies its Q-network depending on the experiences it has We aim to minimize the overall bandwidth utilization and reduce the
gathered. In each episode, the agent starts at a randomly selected average delay in the network. To achieve this, we have designed a
initial node and proceeds to make routing decisions until the episode reward function that captures the essential aspects of our goal. The
terminates. The agent’s performance is evaluated periodically during reward function, R(sc , ac , s′ ), computes the entire reward for per-
training, by calculating the average reward obtained over a sliding forming action ac in state sc and transitioning to state s′c .
window of episodes. The training process continues until the agent’s Our reward function is a combination of three components: the
performance converges, or a predefined stopping criterion is met. change in link utilization, the change in the average delay, and a
To evaluate the performance of the DQN agent, we can plot var- penalty term for selecting infeasible actions. We will describe each
ious metrics such as the reward versus the number of episodes, the component and their mathematical representations below.
delay versus the number of flows, and the link utilization distribu- 1. Change in link utilization: The link utilization is a measure of
tion. These metrics help in understanding the effectiveness of the how much the available bandwidth is being used in the network.
DQN agent in optimizing the network performance and provide in- It is calculated as the ratio of the current traffic rate and the link
sights into the agent’s learning process. In addition, the trained agent capacity. Our objective is to minimize the overall link utilization
can be further tested in various network scenarios to assess its adapt- in the network, so we compute the change in link utilization be-
ability and robustness in handling different network conditions. fore and after taking action ac : ∆U (sc , ac , s′c ) = U (s′c , ac ) −
U (sc , ac ), where U (sc , ac ) denotes link utilization in state sc of past experiences to update the network parameters θ. This tech-
for action ac , and U (s′c , ac ) is the link utilization in the next nique reduces the correlation between consecutive updates, leading
state s′c for action ac . A negative value of ∆U (sc , ac , s′c ) indi- to a more stable learning process.
cates that the link utilization has decreased, which is favorable. In order to calculate the target Q-values for the updation step, we
2. Change in average end-to-end delay: Average delay in the net- additionally use a target network with parameters θ− . A soft update
work is affected by the routing decisions made by the DQN rule is used to regularly update the settings of the target network,
agent. To encourage the agent to minimize the average delay, which is a replica of the main network, θ− ← (1 − τ )θ− + τ θ,
we compute the change in the average delay before and after where τ is a small constant. This approach helps in stabilizing
taking action ac : ∆D(sc , ac , s′c ) = D(s′c , ac ) − D(sc , ac ), the learning process by providing a more consistent set of target
and D(sc , ac ) is the average delay in state sc for action ac , and values. A variation of the Q-learning method is used to train the
D(s′c , ac ) is the average delay in the next state s′c for action ac . DQNh agent, and it includes minimizing the loss function, L(θ) =
A negative value of ∆D(sc , ac , s′c ) indicates that the average 2 i
E Y Q − Q(s, a; θ) , where Y Q = r + γ maxa′ Q(s′ , a′ ; θ− )
delay has decreased, which is desirable.
is the target Q-value, and γ is the discount factor. Expectations are
3. Penalty term for infeasible actions: In some cases, the agent may
calculated over randomly sampled mini-batches from the experience
select an action that is infeasible, such as routing traffic to a node
replay buffer, introducing stochasticity for improved model robust-
that is not directly connected to the current node. To discour-
ness and generalization.
age the agent from selecting infeasible actions, we introduce a
penalty term in ( the reward function as Algorithm 1 Deep Q-Network Algorithm
−∞, if action ac is infeasible in state sc 0,
P (sc , ac ) = 1. Initialization/Input: Establish initial values for θ for the Deep
otherwise
Q-Network Q(sc , ac ; θ) and θ− = θ for the target network
Now, we can combine the three components to form the overall Q′ (sc , ac ; θ− ). Set the maximum capacity of the replay memory
reward function: B at initialization.
R(sc , ac , s′c ) = α∆U (sc , ac , s′c )+β∆D(sc , ac , s′c )+P (sc , ac ) 2. Output: Trained DQN Q(sc , ac ; θ) for optimal routing deci-
where α and β are scaling factors that allow us to control the rel- sions.
ative importance of the change in link utilization and the change 3. Steps:
in average delay. By adjusting the values of α and β, we can 4. For episode in range (1, N+1):
fine-tune the reward function to align with our specific goals. (a) Set the state sc initial.
Through this reward function, the DQN agent is encouraged to (b) For t in range (1, T+1):
make routing decisions that minimize both the overall bandwidth uti- i. Choose a random action ac with probability value as δ;
lization and the average delay in the network.The use of the penalty if not, choose ac = argmaxa′c Q(sc , a′c ; θ).
term assures that the agent does not choose infeasible actions, fur-
ther boosting the routing algorithm’s performance. By determining ii. Complete action ac , notice the following state s′c and the
the best action-value function Q(sc , ac ; θ) for maximizing cumula- reward r.
tive reward, the DQN agent can effectively solve the routing problem iii. Save (sc , ac , r, s′c ) as the transition in replay memory B.
in our network scenario. iv. Take a random minibatch sample from B of the transi-
tions (sci , aci , ri , s′c i ).
v. yi = ri + γ · maxa′c Q′ (s′c i , a′c ; θ− ) is the goal value.
3.5. Deep Q-Network Architecture as Q-value approximator Compute it.
The Deep Q-Network (DQN) architecture is a key component of vi. To minimize the loss, update P the network parameters 2
this study on traffic routing, where it serves as an approximator for θ using gradient descent: i (yi − Q(sci , aci ; θ)) is
action-value function, Q-function. In the context of our mobile edge equal to L(θ).
network topology, the DQN agent is responsible for learning an opti- vii. Update the target network parameters: θ− = τ θ + (1 −
mal routing policy that minimizes both the overall bandwidth utiliza- τ )θ− , where τ is a small positive constant.
tion and the average delay. The state-action pair (sc , ac ) is mapped viii. Set sc = s′c .
to a scalar value via the Q-function, Q(sc , ac ; θ), which represents 5. Prediction:
the expected cumulative reward of executing action ac in state sc (a) For a new state sn , feed it to the trained Deep Q-Network.
and then adhering to an optimum policy. Here, θ denotes the param- (b) Obtain the Q-values Q(sn , a; θ) for all possible actions a.
eters of the neural network that approximates the Q-function. The (c) Select the action a∗ = argmaxa Q(sn , a; θ) for optimal rout-
primary aim of the agent is to learn the optimal policy that can guide ing decision.
its routing decisions. In the DQN architecture, to approximate the
Q-function we employ a deep neural network. The network takes 3.6. DQN-based routing Algorithm for discrete action space
state vector s ∈ R|V | as input, where |V | is the number of nodes
in the network and output of the network is a vector of Q-values, This section introduces the DQN-based discrete action space rout-
Q(s, a; θ) ∈ R|A(c)| , for all possible actions a ∈ A(c), where ing algorithm, which learns the best routing policy for traffic routing
|A(c)| is the count of neighboring nodes connected to the current in mobile edge networks by utilizing the deep Q-network architec-
node c. The agent selects that action which has the highest Q-value, ture, the state and action spaces, and the created environment. The
a∗ = arg maxa Q(s, a; θ), to minimize delay and congestion in the key goal of our routing algorithm is to minimize overall bandwidth
network. To train the DQN, we utilize the concept of experience utilization and the average delay while considering the constraints
replay and a target network. Experience replay involves storing a imposed by virtue of network topologies and traffic demands.
buffer of past experiences, (s, a, r, s′ ), where s, a, r and s′ repre- The routing algorithm can be summarized as follows: Initialize
sents the current state, action taken, reward obtained, the next state the mobile edge network environment, which consists of the network
respectively. During the training, we sample random mini-batches topology, traffic rates, and other network parameters. The state space
is a representation of the current node and its neighboring nodes, representing the mobile edge network. The link capacities and the
while the action space consists of the available routing decisions for service rates for the topology were set to 104 packets and 3 × 103
each node. We initialize the DQN agent with a deep neural network packets/second, respectively. A Poisson process with an average λk ,
architecture to approximate the Q-function, Q(sc , ac ; θ), where θ where λk is selected from a uniform random distribution in the in-
are the network parameters. The agent selects an action ac that max- terval between 10 and 300 packets/second, controls the arrival rate
imizes the Q-value, i.e., a=
c arg maxa Q(sc , ac ; θ). By learning an of the k-th flow entering the network. We randomly chose source
optimum Q-function, the DQN agent is able to make routing deci- and destination node pairings and adjusted λk for each iteration to
sions that minimize delay and congestion in the network. The DQN- account for network flow unpredictability.
based routing algorithm iteratively updates the Q-function by min- The two completely linked hidden layers with 64 and 32 units,
imizing the mean squared error(MSE) loss function L(θ), between respectively, make up the architecture of the DQN agent. Because
the predicted Q-values and the target Q-values: of its efficiency and dependability, we employed the rectified linear
units (ReLU) activation function. The neural networks were trained
using the Adam optimizer. The mini-batch’s batch size was set to
L(θ) = E(sc , ac , r, s′c ) ∼ D[G] (1) 64 and the discount factor, γ, was set to 0.99. Before beginning the
′ ′ ′ − 2
G = (r + γ max ac Q(sc , ac ; θ ) − Q(sc , ac ; θ)) (2) learning process, one must first gather 100 transitional steps. The
learning rates for the Adam optimizer and the DQN agent were set
where D is the experience replay buffer, (sc , ac , r, s′c ) is a tran- to 10−4 . The agent used an epsilon-greedy exploration strategy to
sition sampled from the buffer, γ denotes discount factor, and θ− explore the environment thoroughly, with the epsilon value decaying
represents the parameters of the target Q-Network which are updated over time.
using gradient descent: We evaluated the effectiveness of our DQN-based routing
method in the simulation and compared it with traditional shortest-
path routing algorithms such as Dijkstra’s and Bellman-Ford al-
θ ← θ − α∇θ L(θ) (3)
gorithms. We also conducted simulations with varying network
L(θ), α is the learning rate. parameters to evaluate the adaptability and robustness of our pro-
As the agent learns, it explores the network environment by posed approach in different scenarios.
choosing actions based on a ϵ-greedy strategy. Initially, the agent
selects random actions with probability ϵ, which is then decayed 4.2. Simulation Results
over time to encourage exploitation of the learned Q-function. The
exploration-exploitation trade-off is controlled by the parameters We evaluated the performance of our proposed deep reinforcement
ϵstart , ϵend , and ϵdecay . Through interaction with the environment, ex- learning (DRL) based traffic routing method in mobile edge net-
perience gathering, and Q-function updating, the agent learns from works using a dataset of four network topologies from the paper ”A
its experiences. dataset for mobile edge computing network topologies”[9]. We com-
The DQN-based routing algorithm converges when the changes pared our approach with the naive approach, which uses the fewest
in the Q-function become negligible or when a predefined number possible hops to forward packets, and a DQN-based routing method
of episodes have been completed. After training, the agent can make that uses the state as the concatenation of node features, link band-
optimal routing decisions based on the learned Q-function, which width, link latency, and traffic rate vector. Plotting the agent’s re-
aims to minimize the overall bandwidth utilization and reduce the wards in relation to the number of iterations while utilizing our sug-
average delay in the network. gested strategy in each of the four network topologies allowed us
By integrating the developed environment, state and action to assess performance. We found that the incentives increased over
spaces, and the deep Q-network architecture, our DQN-based rout- the iterations in all topologies, suggesting that our DRL algorithm
ing algorithm is capable of learning effective routing policies for becomes more adept at selecting actions as it goes through more it-
mobile edge networks with discrete action spaces. The routing algo- erations. We also found that the convergence time increased with the
rithm’s mathematical foundation ensures that it is both theoretically network size and complexity of the topology.
sound and practically effective in addressing the complex challenges The primary objective of our simulations is to investigate the ef-
of traffic routing in mobile edge networks. fectiveness of our model in minimizing overall bandwidth utilization
and reducing average end-to-end delay. To achieve this objective, we
4. PERFORMANCE EVALUATION trained our model on a network topology of 25 nodes and 50 edges
for 1000 episodes, and evaluated its performance based on the re-
In this section, the experimental results are discussed based on dif- wards obtained at each episode.
ferent benchmark settings. The reward function used in our simulation considers both band-
width utilization and end-to-end delay, making it a comprehensive
4.1. Simulation Setup metric for evaluating the performance of our model. As shown
in Figure 6, the model converges at around 0.65 value of reward,
The simulation setup is designed to evaluate the performance of the demonstrating the effectiveness of the suggested DQN-based routing
proposed DQN-based routing algorithm in a mobile edge network method in lowering average end-to-end delay and total bandwidth
topology. The Python networkX module was used to generate the utilization. The graph clearly shows that the rewards obtained by the
network topology during the simulations and the DQN algorithm model gradually increase with each episode, eventually converging
was implemented in a custom-built framework. All the simulations at around 0.65 value of reward. This indicates that the proposed
were executed on the Kaggle platform, which provides a cloud-based model is effective in learning optimal routing decisions and making
environment with shared CPU and GPU resources. routing decisions that minimize overall bandwidth utilization and re-
We considered a network topology consisting of a varying num- duce average end-to-end delay. Our simulation results demonstrate
ber of nodes (up to 50 nodes) with connectivity described in [4], that the suggested DQN-based routing algorithm in this study is ef-
Fig. 4. Proposed DQN Algorithm trained on 50 episodes for 25N50E
Fig. 6. Proposed DQN Algorithm trained over 1000 episodes for
network topology
25N50E network topology
6. REFERENCES