Distributed Channel Allocation For Mobile 6G Subnetworks Via Multi-Agent Deep Q-Learning
Distributed Channel Allocation For Mobile 6G Subnetworks Via Multi-Agent Deep Q-Learning
AI for Communication Research Group, Department of Electronic Systems, Aalborg University, Denmark
E-mail:{ra, gb}@es.aau.dk
Abstract—Sixth generation (6G) in-X subnetworks are recently practical constraints on resource availability) by adjusting the
proposed as short-range low-power radio cells for supporting utilization of the limited radio resources such as transmit
localized extreme wireless connectivity inside entities such as power, frequency channel, and time. Resource allocation typ-
industrial robots, vehicles, and the human body. The deployment
of in-X subnetworks in these entities may lead to fast changes in ically involves non-convex objective function and is known
the interference level and hence, varying risks of communication to be NP-hard with no universal optimal solution [5]. To
failure. In this paper, we investigate fully distributed resource overcome this limitation, algorithms for resource allocation
allocation for interference mitigation in dense deployments of have been traditionally based on hard-coded heuristics [6] or
6G in-X subnetworks. Resource allocation is cast as a multi- using optimization techniques such as game theory [7], genetic
agent reinforcement learning problem and agents are trained in a
simulated environment to perform channel selection with the goal algorithm [8] and geometric programming [9]. Over the last
of maximizing the per-subnetwork rate subject to a target rate few years, the focus appears to have shifted towards machine
constraint for each device. To overcome the slow convergence and learning-based algorithms [5] resulting in a large number
performance degradation issues associated with fully distributed of published works applying supervised [10], unsupervised
learning, we adopt a centralized training procedure involving [11] and reinforcement learning techniques [12] for resource
local training of a deep Q-network (DQN) at a central location
with measurements obtained at all subnetworks. The policy allocation in different types of wireless systems.
is implemented using Double Deep Q-Network (DDQN) due While several solutions have been proposed for resource
to its ability to enhance training stability and convergence. allocation in different wireless systems over the years, works
Performance evaluation results in an in-factory environment targeting the peculiar nature of short-range low-power 6G
indicated that the proposed method can achieve up to 19% rate in-X subnetworks are still rather limited. In our previous
increase relative to random allocation and is only marginally
worse than complex centralized benchmarks. works, we have proposed distributed rule-based heuristics [6],
Index Terms—Machine learning, reinforcement learning, inter- [13] and a supervised learning method [14] in which a deep
ference management, beyond 5G networks, resource allocation neural network (DNN) is trained with data generated using
centralized graph coloring for channel allocation in scenarios
I. I NTRODUCTION with dense deployment of 6G in-X subnetworks. In a recent
The proliferation of more demanding applications clearly in- work [15], a Q-learning method for joint power and channel
dicates that wireless networks beyond 5G must be designed to allocation using quantized state information is proposed. While
cope with more stringent performance requirements in denser the results in this paper highlight the potential of Q-learning
environments than current systems. Recent publications on for resource allocation, the method suffers from non-scalability
sixth generation (6G) [1]–[4] networks have identified short- to large problem dimensions as well as the effect of state
range wireless communication for replacing wired connectivity quantization on the performance of Q-learning algorithms. The
in applications such as industrial control at the sensor-actuator authors of [16] presented a complex architecture referred to
level, augmented- or virtual reality, and intra-vehicle control. as GA-Net which combines graph attention (GAT) networks,
Replacing wired connectivity with wireless offers the inherent graph neural networks (GNN), and multi-agent reinforcement
benefits of higher scalability, lower equipment weight, en- learning (MARL) for channel allocation in 6G subnetworks.
hanced flexibility, and lower maintenance cost among others. The introduction of multi-head attention for feature extraction
Clearly, some of these examples are life-critical use cases allows for only centralized training which requires the trans-
requiring performance guarantees at all times. Such use cases mission of sensing measurements from all subnetworks to a
can also lead to dense scenarios (e.g., in-body subnetworks central location translating to high communication overhead
in a crowded environment) leading to potentially high and and potential security threats. The lack of possibility for
dynamic interference footprint. In order to achieve the above distributed training limits the usability of GA-Net in practical
requirements, mechanisms for mitigating the adverse effects applications where connection to a central network may be
of interference are important. impossible. Moreover, relying solely on centralized training
Radio resource allocation has been an important component is not feasible for in-X subnetworks applications (such as
of wireless research for several years as a key framework in-vehicle or in-body) where privacy constraints may hinder
for interference mitigation. The goal of resource allocation the transmission of raw sensing data to a central server for
is to optimize specified performance metric(s) (subject to training. In such cases, methods that are amenable to both
cation with or without the exchange of measurements between where Sx is the value of a two-dimensional Gaussian random
subnetworks and is amenable to centralized, distributed, or fed- process with exponential covariance at the location of the
erated training. We perform extensive simulations to evaluate device or AP, and dc denotes the de-correlation distance.
the performance of the proposed using parameters defined for At slot, t, the signal-to-noise-plus-interference ratio (SINR)
the in-factory environment. The performance and complexity on the link between the AP in subnetwork n and its mth device
analysis results show that the MADDQN method can achieve can be expressed as
significant performance improvement relative to random al-
k
location and has low computation complexity. The proposed k
gn,n,m [t]
γnm [t] = P k
(4)
method is also scalable and generalizes well to scenarios with n0 ∈Inn0 gn,n0 ,m0 [t] + σ
2
parameters different from those used for training.
The remaining part of this paper is organized as follows. where Inn0 denotes the set of all other subnetworks that
The system model, the distributed channel allocation problem, are operating on the same channel as the nth subnetwork
and a short overview of DQN are presented in Section II. In and σ 2 = 10(−174+nf +10 log10 (BW))/10 is the noise power
III, we present the proposed method. Performance evaluation with nf and BW denoting the noise figure and channel
and complexity analysis results are presented in Section IV. bandwidth, respectively. Assuming single antenna at both the
Finally, we draw conclusions in Section V. APs and devices and considering the Shannon approximation,
the achieved rate at slot t can then be written as
II. P ROBLEM F ORMULATION
ζnm [t] ≈ log2 (1 + γnm [t]). (5)
System Model: We consider a network with N mobile
subnetworks each serving M devices. Each subnetwork has Distributed Resource Allocation Problem: We consider a
a single access point (AP) that coordinates transmission for resource allocation problem involving fully distributed selec-
its associated devices. We index the subnetworks (and hence, tion of frequency channels. We consider in-X subnetworks
APs) with n ∈ N = {1, 2, · · · , N } and the devices in each supporting applications that require high data rates with or
subnetwork with m ∈ M = {1, 2, · · · , M }. We assume that without minimum rate constraints. The resource optimization
a total bandwidth, B, which is partitioned into K equal-sized problem can then be defined as a constrained multi-objective
channels is available in the system and that each subnetwork task involving the maximization of N objective functions, one
operates on a single channel at each time slot. We index the for each subnetwork. To support the requirement, we take the
channels with k ∈ {1, 2, · · · , K}. Denoting the transmit power objective function as the per subnetwork sum-rate subject to
as ptx , the power received on the link between the nth AP from a minimum rate per device constraint. The problem can be
the mth device in the zth subnetwork is defined as: formally expressed as:
( M
)N
k
gn,z,m [t] = ptx |hkn,z,m [t]|2 Γkn,z,m ψn,z,m , (1) X
t
P : maxt
ζnm (c ) st: ζnm ≥ ζtarget ∀n, m (6)
{c }
m=1
where hkn,z,m [t], Γkn,z,m and ψn,z,m are the Rayleigh dis- n=1
t
tributed complex small scale gain, path-loss, and log-normal where c = [ct1 · · · ctN ]; ctn
∈ {1, 2, · · · , K} ∀n denotes the
shadowing, respectively. By considering Jakes model, the vector of indices of the channel selected by all subnetworks at
small scale gain, hkn,z,m [t], is defined as time, t and ζtarget is the target minimum rate which is assumed
p equal for all subnetworks. The problem in (6) involves joint
hkn,z,m [t] = ρhkn,z,m [t − 1] + 1 − ρ2 kn,z,m , (2) optimization of N conflicting non-convex objective functions
and is known to be difficult to solve. A multi-agent reinforce-
where kn,z,m is an iid complex Gaussian variable and ρ is ment learning method for solving the problem is proposed in
the lag-1 temporal autocorrelation coefficient. The temporal this paper.
autocorrelation coefficient is modeled as ρ = J0 (2πfd Ts ), Deep Q-Learning Fundamentals: In deep Q-learning, a
where J0 (·), fd and Ts are the zeroth order Bessel function deep neural network often called Deep Q-Network (DQN) is
of the first kind, the maximum Doppler frequency, and slot- used to approximate the Q-function. The DQN circumvents
duration, respectively. the limitations associated with its table-based counterpart and
Denoting the corresponding distance as dn,z,m , the has been shown to provide better performance. The DQN can
path-loss component, Γkn,z,m is expressed as Γkn,z,m = be expressed as
c2 d−β 2 2 8
n,z,m /16π fk , where c ≈ 3 × 10 ms
−1
is the speed of
light, fk and α are the center frequency of channel k and the Q̂(s, a) = f (s, a, θ), (7)
Authorized licensed use limited to: National Institute of Informatics (NII - Kokuritsu Johogaku Kenkyujo). Downloaded on February 11,2025 at 07:57:39 UTC from IEEE Xplore. Restrictions apply.
Fig. 1. Illustration of the MADDQN-based channel allocation.
where f is a function determined by the DQN architecture and independent learning case. With |Dn | < N − 1, the strongest
θ is a vector of the DQN parameters. The Q-value estimation interfering subnetworks are included |Dn |.
is now reduced to optimization of θ. This optimization is Action space: The action space is the set of all possible
typically performed using standard gradient descent algorithms actions that the agent can choose from at each time. While
with the Huber loss defined as [18] the method presented here can be applied to the selection of
( any wireless resource, we consider the allocation of frequency
(Γ(θ))2 if |Γ(θ)| ≤ δ
L(θ) = (8) channels. The action space for each subnetwork is therefore
δ|Γ(θ)| − 21 δ 2 otherwise the set of all available frequency channels defined as
0 0
where Γ = r(st , a) + γ maxa0 Q (st+1 , a ; θ) − Q(st , a; θ) is A = {c1 , c2 , · · · , cK }, (10)
the difference between expected and predicted Q-values and δ
is the discriminating parameter of the loss function. where ck denotes the k channel. At each time, the nth
subnetwork’s action is denoted atn ; atn ∈ A.
III. M ULTI - AGENT DDQN FOR C HANNEL A LLOCATION Reward signal: As stated in section II, the goal of each
We cast the resource selection described above in a MARL agent is to maximize the achieved rate while also ensuring that
framework in which each subnetwork has an agent at the a target rate, rtarget is achieved. To guide the agent towards
AP whose goal is to learn a policy for selecting a frequency achieving this goal, we define the reward function considering
channel such that its communication requirements are met via the optimization problem defined in (6). The reward for the
interaction with the wireless environment as shown in Fig. 1a. nth subnetwork at time, t is defined as
As with other RL techniques, MARL requires the definition (
of the environment, state (or feature) space, action space, and ζn if ζnm ≥ ζtarget , ∀n, m
rn = , (11)
reward signal as well appropriate model for the policy. As ζn − λ∆ζn otherwise
described in section II, a wireless environment with N mobile PM
subnetworks each serving M devices is considered. The other where ζn = m=1 ζnm is the P sum rate achieved by all
M
components are described below. devices in subnetwork n, ∆ζn = m=1 (ζtarget − ζnm ) and
State space: We consider two cases viz: fully independent λ is a control parameter which is set to ensure a balance
resource selection and resource selection with limited coop- between maximizing the achieved rate and guaranteeing that
eration. In the former, no communication is possible among the minimum rate is at least equal to ζtarget .
subnetworks. Each subnetwork, therefore, makes resource se- Policy Representation: Motivated by the work in [19]
lection decisions based solely on its local sensing information. where it was shown that a DQN-variant referred to as Double
The latter allows communication of only sensing measurement DQN (DDQN) offered up to 2-fold performance improvement
between a subnetwork and others in its neighbour set, denoted and better training stability than classic DQN, we adapt the
as Dn for the nth subnetwork. The feature set of subnetwork DDQN with experience replay [20] in a multi-agent version
n is represented as for channel selection. The considered DDQN architecture is
shown in Fig. 1. The DDQN comprises two networks viz:
Sn = {Iz,1 , Iz,2 , · · · , Iz,K } ∀z ∈ {n, Dn } (9)
• Main Network: The main network acts as the action-
where Iz,k is the measured aggregate interference power on value function approximator which maps the features to
channel k at the zth subnetwork. Note that the dimension of actions. This mapping for the nth subnetwork is denoted
the neighbour set, |Dn | can be varied between 0 and N − 1 to as Q(st , ak ; θt ) : st → {q(a|st , θt )|a ∈ A}, where
control the number of neighbours from which each subnetwork q(a|st , θt ) denotes the expected cumulative rewards for
receives state information. If |Dn | = 0, we have the fully taking action a at state, st .
Authorized licensed use limited to: National Institute of Informatics (NII - Kokuritsu Johogaku Kenkyujo). Downloaded on February 11,2025 at 07:57:39 UTC from IEEE Xplore. Restrictions apply.
Algorithm 1 Training of MADDQN-based channel allocation TABLE I
1: Input: Learning rate, α, discount factor, γ, number of D EFAULT SIMULATION PARAMETERS .
episodes, T , number of episode steps, Ne , batch size, Nb , Parameter Value Parameter Value
target network update interval, Tup , switching delay, τdelay Deployment area [m2 ]
subnetwork radius [m]
40 × 40
3.0
Number of subnetworks, N
Velocity, v [m/s]
25
2.0
2: Compute initial states, {s1n }Nn=1
Number of frequency channels, |A|
Shadowing standard deviation, σs [dB]
4
5
Pathloss exponent, γ
Carrier frequency [GHz]
2.7
6
3: Initialize replay memory, {Dn }N n=1 , main network param-
Transmit power [dBm]
Channel bandwidth [MHz]
0
10
Noise figure [dB]
Network structure
10
|S| − 24 − 24 − |A|
Authorized licensed use limited to: National Institute of Informatics (NII - Kokuritsu Johogaku Kenkyujo). Downloaded on February 11,2025 at 07:57:39 UTC from IEEE Xplore. Restrictions apply.
1 1
4.9 Surogate Optimizer
Surogate Optimizer
0.9 Centralized Coloring 0.9 Centralized Coloring
4.8 MADDQN: : jDj = 0
MADDQN: : jDj = 0
0.8 MADDQN: : jDj = 3 0.8 MADDQN: : jDj = 3
4.7 MADDQN: : jDj = 7 MADDQN: : jDj = 7
0.7 Random 0.7 Random
4.6
Reward [bps/Hz]
0.6 0.6
4.5
CDF
CDF
0.5 0.5
4.4
0.4 0.4
4.3 Reward: jDj = 0
Averaged Reward: jDj = 0 0.3 0.3
4.2 Reward: jDj = 3
Averaged Reward: jDj = 3 0.2 0.2
Reward: jDj = 7
4.1 0.1 0.1
Averaged Reward: jDj = 7
4 0 0
0 500 1000 1500 2000 0 2 4 6 8 10 2 2.5 3 3.5 4 4.5 5 5.5 6
Episode Per device rate [bps/Hz] Average rate [bps/Hz]
(a) Averaged reward versus episode with N = 25. (b) CDF of per device rate with N = 25. (c) CDF of average rate.
8
Random 0.04 80
MADDQN: jDj = 3
7
0.03
(d) Sensitivity to the number of subnetworks. (e) Sensitivity to shadowing standard deviation (f) Running time estimates
Fig. 2. Plots of the learning curves (a), performance (b-c), sensitivity evaluation (d-e) results, and running time estimates (f).
same channel. The delay is generated for all subnetworks at the surrogateopt function in MATLAB with default
the beginning of each snapshot as a random integer factor parameters except for the number of iterations which
of the transmission interval with a maximum value of 10. A is set to 400.
subnetwork is then allowed to perform channel switching at 3) Centralized coloring: Greedy graph coloring is applied
time instants determined by its assigned delay value. to the interference graph, G created from the matrix of
Simulation Results: Fig. 2a shows the averaged reward mutual interference power between subnetworks with a
over successive episodes with no target rate constraint, i.e., K − 1 strongest interfering neighbors edge constraint.
ζtarget = 0 bps/Hz and size of neighbor set for each subnet- To guarantee colorability G, the successive graph spar-
work, |D| = [0, 3, 7]. The averaging is performed over all steps sification involving removal of the weakest edges until
within each episode and all subnetworks. The figure shows no more than K colors are required [13] is used in the
that convergence is achieved at approximately 1000 episodes simulations.
with fully independent, i.e., |D| = 0 and 1600 episodes with Fig. 2b shows the empirical Cumulative Distribution Func-
|D| = 3 and |D| = 7. This indicates that an agent requires tion (CDF) of the per-device rate for the different methods.
longer training to learn the feature-to-action mapping function The proposed MADDQN scheme performs better than the
using sensing measurements from multiple subnetworks than random channel allocation, similar to centralized coloring,
using only local measurements. At convergence, averaged and only marginally worse compared to the iterative surrogate
reward of about 4.60 bps/Hz, 4.75 bps/Hz and 4.70 bps/Hz optimization technique. The averaged rate (or equivalently sum
is achieved with |D| = 0, |D| = 3 and |D| = 7, respectively, rate) performance of the different channel allocation methods
indicating marginal improvement of 3.3% with |D| = 3 and is shown in Fig. 2c where we plot the CDF of the rate averaged
2.2% with |D| = 7 compared to the fully independent case, over all subnetworks. Compared to random allocation, the
i.e. |D| = 0. proposed MADDQN method offers between ∼ 15% (with
The trained DDQN agents are deployed for distributed |D| = 0) and ∼ 19% (with |D| = 3) improvement at the
channel allocation and performance compared with three median of the average rate distribution and is only about ∼ 6%
benchmark algorithms viz: below the median average rate achieved by the centralized
1) Random: assign frequency channels randomly to all benchmark schemes, i.e., centralized coloring and surrogate
subnetworks at the start of a snapshot. optimizer. We remark here that the proposed method offers
2) Mixed Integer Surrogate Optimizer: the surrogate opti- the advantage of much lower signaling overhead since only a
mization method [21] is applied in a centralized version very limited exchange of information is required.
to the mixed integer problem involving maximization of Sensitivity evaluation: We study the robustness of the
the network sum rate. This method is implemented using proposed method to changes in the wireless environment than
Authorized licensed use limited to: National Institute of Informatics (NII - Kokuritsu Johogaku Kenkyujo). Downloaded on February 11,2025 at 07:57:39 UTC from IEEE Xplore. Restrictions apply.
those used during the training. Due to its high computation surrogate optimizer as well as the centralized graph coloring
complexity, the iterative surrogate optimizer is not included in with high signaling overhead. Our results further indicated that
the sensitivity evaluation. The MADDQN model trained with the proposed method is robust to changes in the deployment
N = 25 subnetworks and shadowing standard deviation of density as well as propagation parameters.
σs = 5 dB is evaluated with values of N between 5 and 45
R EFERENCES
in the same 40 m × 40 m and σs between 1 dB and 9 dB.
We plot the mean and standard deviation of the average rate [1] V. Ziegler, H. Viswanathan, H. Flinck, M. Hoffmann, V. Räisänen, and
K. Hätönen, “6G architecture to connect the worlds,” IEEE Access,
as a function of the number of subnetworks in Fig. 2d and vol. 8, pp. 173 508–173 520, 2020.
shadowing standard deviation in Fig. 2e. In both cases, the [2] H. Viswanathan and P. E. Mogensen, “Communications in the 6G Era,”
MADDQN method shows a similar trend as well as relative IEEE Access, vol. 8, pp. 57 063–57 074, 2020.
[3] G. Berardinelli, P. Baracca, R. Adeogun, S. Khosravirad, F. Schaich,
performance to the centralized coloring and random allocation K. Upadhya, D. Li, T. B. Tao, H. Viswanathan, and P. E. Mogensen,
benchmarks indicating that all schemes are equally affected “Extreme Communication in 6G: Vision and Challenges for ‘in-X’
by the changes in the number of subnetworks and shadowing Subnetworks,” IEEE OJCOM, 2021.
[4] G. Berardinelli, P. Mogensen, and R. O. Adeogun, “6G subnetworks for
standard deviation. It is therefore reasonable to conclude that Life-Critical Communication,” in 2nd 6G Wireless Summit, 2020.
the proposed scheme is robust to changes in the considered [5] F. Hussain, S. A. Hassan, R. Hussain, and E. Hossain, “Machine
wireless parameters. Learning for Resource Management in Cellular and IoT Networks:
Potentials, Current Solutions, and Open Challenges,” IEEE Commun.
Complexity Analysis: We compare the computational com- Surveys Tuts., vol. 22, no. 2, pp. 1251–1275, 2020.
plexity of the proposed MADDQN method with the bench- [6] R. Adeogun, G. Berardinelli, I. Rodriguez, and P. E. Mogensen, “Dis-
mark algorithms by estimating the total time required to per- tributed Dynamic Channel Allocation in 6G in-X Subnetworks for
Industrial Automation,” in IEEE Globecom Workshops, 2020.
form channel allocation for all subnetworks at each transmis- [7] R. O. Adeogun, “A novel game theoretic method for efficient downlink
sion instant. In Fig. 2f, we plot the average total running time resource allocation in dual band 5G heterogeneous network,” Wireless
per step as a function of the number of subnetworks. The figure Personal Communications, vol. 101, no. 1, pp. 119–141, Jul 2018.
[8] U. Mehboob, J. Qadir, S. Ali, and A. Vasilakos, “Genetic algorithms in
shows that the proposed MADDQN and our implementation wireless networking: techniques, applications, and issues,” Soft Comput-
of greedy coloring can provide up to a factor of 2000 reduction ing, vol. 20, no. 6, pp. 2467–2501, 2016.
in time complexity relative to the iterative surrogate optimizer. [9] K. T. Phan, T. Le-Ngoc, S. A. Vorobyov, and C. Tellambura, “Power
allocation in wireless relay networks: A geometric programming-based
While the running time for centralized coloring is marginally approach,” in IEEE GLOBECOM 2008 - 2008 IEEE Global Telecom-
lower than that of MADDQN for values of N between 5 and munications Conference, 2008, pp. 1–5.
35, the linear growth achieved by the latter makes it more [10] H. Sun, X. Chen, Q. Shi, M. Hong, X. Fu, and N. D. Sidiropoulos,
“Learning to Optimize: Training Deep Neural Networks for Interference
attractive for deployments with higher number of subnetworks, Management,” IEEE Transactions on Signal Processing, vol. 66, no. 20,
i.e., N ≥ 40. Note that the distributed MADDQN method pp. 5438–5453, 2018.
has minimal signaling overhead compared to the centralized [11] C. Sun and C. Yang, “Learning to Optimize with Unsupervised Learning:
Training Deep Neural Networks for URLLC,” in IEEE PIMRC, 2019,
benchmarks. Assuming a constant time cost for exchanging pp. 1–7.
sensing measurement between any pair of subnetworks or from [12] J. Burgueno, R. Adeogun, R. L. Bruun, C. S. M. Garcı́a, I. de-la Bandera,
a subnetwork to the central resource manager, the signaling and R. Barco, “Distributed Deep Reinforcement Learning Resource
Allocation Scheme For Industry 4.0 Device-To-Device Scenarios,” in
complexity for MADDQN and centralized benchmarks (i.e., IEEE VTC-Fall). IEEE, 2021, pp. 1–7.
centralized coloring and surrogate optimizer) is upper bounded [13] R. Adeogun, G. Berardinelli, and P. E. Mogensen, “Enhanced interfer-
by O(N |Dn |) and O(N 2 ), respectively. As observed from ence management for 6G in-X subnetworks,” IEEE Access, vol. 10, pp.
45 784–45 798, 2022.
the training curves in Fig. 2a and the mean rate performance [14] R. O. Adeogun, G. Berardinelli, and P. E. Mogensen, “Learning to Dy-
in Fig. 2c, no performance improvement is achieved with namically Allocate Radio Resources in Mobile 6G in-X Subnetworks,”
values of |Dn | > K − 1. In practical interference-limited in IEEE PIMRC, 2021.
[15] R. Adeogun and G. Berardinelli, “Multi-agent dynamic resource alloca-
scenarios, the number of available channels, K is much less tion in 6G in-X subnetworks with limited sensing information,” Sensors,
than the number of subnetworks. i.e., N << K and hence the vol. 22, no. 13, p. 5062, 2022.
signalling cost complexity for MADDQN reduces to O(N ). [16] X. Du, T. Wang, Q. Feng, C. Ye, T. Tao, L. Wang, Y. Shi, and
M. Chen, “Multi-agent reinforcement learning for dynamic resource
management in 6G in-X subnetworks,” IEEE Transactions on Wireless
V. C ONCLUSION Communications, pp. 1–1, 2022.
A simple multi-agent DDQN (MADDQN) approach is pro- [17] S. Lu, J. May, and R. J. Haines, “Effects of correlated shadowing
modeling on performance evaluation of wireless sensor networks,” in
posed for fully distributed dynamic channel allocation in dense IEEE Vehicular Technology Conference, 2015, pp. 1–5.
deployments of 6G in-X subnetworks. The access point in each [18] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction,
subnetwork acts as the DDQN agent which dynamically makes 2nd ed. The MIT Press, 2018.
[19] H. van Hasselt, A. Guez, and D. Silver, “Deep reinforcement learning
channel selection decisions based on aggregate interference with double q-learning,” CoRR, vol. abs/1509.06461, 2015. [Online].
power per channel measurements obtained via sensing. The Available: http://arxiv.org/abs/1509.06461
presented performance results indicated that DDQN agents for [20] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wier-
stra, and M. Riedmiller, “Playing Atari with deep reinforcement learn-
channel allocation can be trained with reasonably fast con- ing,” 2013.
vergence. The MADDQN approach yields a median average [21] H.-M. Gutmann, “A radial basis function method for global optimiza-
rate that is up to 19% higher than baseline random allocation tion,” Journal of global optimization, vol. 19, no. 3, pp. 201–227, 2001.
and only about 6% lower than the computational intensive
Authorized licensed use limited to: National Institute of Informatics (NII - Kokuritsu Johogaku Kenkyujo). Downloaded on February 11,2025 at 07:57:39 UTC from IEEE Xplore. Restrictions apply.