2 - Deep Reinforcement Learning Based Rate Adaptation For Wi-Fi Networks
2 - Deep Reinforcement Learning Based Rate Adaptation For Wi-Fi Networks
∗ School of Electronics and Communication Engineering, Sun Yat-sen University, Guangzhou, China
† Wireless Technology Lab, 2012 Laboratories, Huawei Technologies Co., Ltd, China
Abstract—The rate adaptation (RA) algorithm, which adap- The estimated SNR is shown to be an unreliable measure
tively selects the rate according to the quality of the wireless since it can be easily affected by the severe interference. As
environment, is one of the cornerstones of the wireless systems. a result, most mainstream Wi-Fi vendors employ sampling-
In Wi-Fi networks, dynamic wireless environments are mainly
due to fading channels and collisions caused by random access based RA algorithms such as Minstrel HT [2] which is used
protocols. However, existing RA solutions mainly focus on the in the ath9k Wi-Fi driver and Iwl-Mvm-Rs in the Intel IwlWifi
adaptive capability of fading channels, resulting in conservative linux driver [3]. Sampling-based RA schemes usually select
RA policies and poor overall performance in highly congested the MCS whose historical behavior performs best. The major
networks. To address this problem, we propose a model-free drawback of sampling-based RAs is that MCSs can only be
deep reinforcement learning (DRL) based RA algorithm, named
as drlRA, in this work, which incorporates the impact of collisions evaluated with enough samples, in other words, it requires
into the reward function design. Numerical results show that to probe each MCS in a bunch of times. Such mechanisms
the proposed algorithm improves the throughput by 16.5% and may not respond promptly to highly dynamic wireless envi-
39.5% while reducing the latency by 25% and 19.3% compared ronments.
to state-of-the-art baselines. To overcome the shortcomings of conventional RA algo-
Index Terms—Deep reinforcement learning, rate adaptation,
Wi-Fi, CSMA/CA, MCS rithms, recent studies have embraced artificial intelligence (AI)
by exploiting its capability of prediction, i.e., learning the
I. I NTRODUCTION intrinsic relationship between wireless environmental observa-
Wireless channel conditions are unstable due to the im- tions and MCS selection1 scheme. In [1], [4], supervised learn-
pact of path-loss, noise, shadowing, fading, interference and ing (SL) based RA algorithms were investigated, which have
radio-frequency chain impairments in wireless communication been shown to achieve significant potential gains. Nonetheless,
systems. To better utilize the wireless resources, rate adap- SL may be limited by generalization in different wireless
tation has become one of the mandatory functionalities in environments, due to the lack of online learning [5].
IEEE 802.11 wireless local area networks (WLANs), which With the rapid development of reinforcement learning (RL),
adaptively selects modulation and coding schemes (MCS) RA schemes based on RL were proposed. In [6]–[8], the RA
based on the quality of the wireless channels. Each MCS is problem was formulated as the multi-armed bandit (MAB)
associated with a coding rate and constellation size, which problem, where each candidate MCS is encoded as a discrete
has a given bit rate. To increase single-link throughput, more arm of a MAB. Thompson sampling was utilized with the
number of antennas, wider channel bandwidth and higher- advantage of faster convergence speed. However, the scenarios
order modulation are adopted in current IEEE 802.11 networks of interest in these works are cellular communication systems
(i.e., IEEE 802.11ax or Wi-Fi 6), and the number of available operating on licensed spectrum, which solely consider channel
MCSs has increased significantly. Therefore, the rate adapta- fading due to mobility and multi-path effects. In IEEE 802.11
tion algorithm is of great importance. networks operating on unlicensed spectrum, the dynamics of
Conventional rate adaptation (RA) schemes for IEEE 802.11 the radio environment is highly affected by interference from
networks are rule-based and can be roughly categorized as hidden nodes or other co-existence transmission technologies,
SNR-based and sampling-based [1]. As for SNR-based ap- and also collisions caused by random channel access mecha-
proach, the transmitter estimates the instantaneous signal-to- nisms in the Medium Access Control (MAC) layer, i.e., carrier
noise ratio (SNR) from physical layer and translates it to a sense multiple access with collision avoidance (CSMA/CA).
bit rate that can be supported by current channel conditions. In [5], a deep reinforcement learning (DRL) based RA
scheme was proposed, and its performance was verified on
The work of W. Lin was done during the internship in Wireless Tech- a commodity 802.11ac prototype. This work mainly focused
nology Lab, 2012 Laboratories, Huawei Technologies Co. Ltd. The work of on the throughput of nodes embedded with the DRL-based
X. Sun was supported in part by the National Key Research and Development RA. As shown in simulation results in SectionV-C1, it would
Program of China (2019YFE0114000), and in part by Guangdong Engineering
Technology Research Center for Integrated Space-Terrestrial Wireless Optical
Communication. 1 In this paper, MCS selection and rate adaptation are interchangeable.
Authorized licensed use limited to: Eindhoven University of Technology. Downloaded on July 17,2023 at 06:17:47 UTC from IEEE Xplore. Restrictions apply.
C. Reward
Reward function is the core design in a DRL algorithm. ܳሺݏ௧ ǡ ͳሻ
ݏ௧
There are only two results for each packet at MCS m, i.e., )&
)&
)&
)&
)&
)&
)&
…
success or failure. With respect to MCS m, denote the PER
ܳሺݏ௧ ǡ ܯሻ
as P ER[m], reward for success as rsucc [m], and reward for
failure as rf ail [m]. A direct design is to use throughput as the
reward function, like other existing work. However, this design Fig. 1: The architecture of the neural network
will cause overly conservative policies in a highly congested
wireless network, since nodes tend to choose a low-level MCS Algorithm 1 drlRA algorithm
in face of collisions.
To avoid this conservative policy, the agent should distin- Initialization: ε, γ, N , s0 , a0 , t = 0, cnt = 0, θ − = θ
guish between two types of packet errors, packet error caused for the transmission instance t = 1, 2, . . . do
by MCS selection and packet error caused by collisions. Compute st from st−1 , at−1 using (3) and (4)
This work uses P ER[m] and P ERm to distinguish two Store (st−1 , at−1 , rt−1 , st ) to experience memory (EM)
types of packet errors. P ER[m] contains the PER due to Input st to the NN in Fig.1 with θ and output Q
both wrong MCS policy (P ERm ) and collisions, and hence Generate action at from Q using ε-greedy policy
P ERm ≤ P ER[m]. A natural idea of designing reward Calculate the reward rt according to (5)
function is to set the expectation of the reward equivalent to the for each sample e = (s, a, r, s ) in EM do
objective function in (1). Hence, (1 − P ER[m]) × rsucc [m] + Compute y = r + γ maxa Q(s , a , s ; θ − )
P ER[m] × rf ail [m] = (1 − P ERm )Rm . Compute L(θ) = (y − Q(s, a; θ))2
By fixing2 rf ail [m] = 0, rsucc [m] is derived as Update θ by performing mini-batch gradient descent
end for
1 − P ERm
rsucc [m] = × Rm , ∀m ∈ M. (5) if (cnt mod N ) == 0 then
1 − P ER[m]
θ− ← θ
In a highly congested wireless network, collisions become the end if
main cause of the failure reception of packets. In this case, we cnt ← cnt + 1
have P ERm P ER[m]. As a result, a policy that chooses t←t+1
a high-level MCS will be encouraged according to (5). end for
In (5), Rm is estimated as
B
Rm = , (6)
Dm + DW A. Simulation Setup
where DW is calculated by the expectation of CW, i.e., DW = Simulation parameters are summarized in Table.I, where D
0.5∗Wi . P ERm can be estimated by means of a look-up table is the distance between transmitter and receiver. The rates of
that stores the relationship between SNR and P ERm [12]. The available MCSs ({Cm }) are listed in Table.II.
SNR can be derived as the ratio of the average RSS statistics We introduce the following algorithms as baselines, includ-
from ACKs to the energy level detected on the idle channel. ing the commonly used Minstrel HT [2], the experience driven
IV. A LGORITHM rate adaptation (EDRA) [5], and the newly proposed DRL-
With the definitions of the MDP tuples w.r.t the RA problem based RA algorithm mentioned above. The parameters for
at hand, we can use DQN to solve it. The pseudo-code on drlRA are shown in Table III. The parameters for the Minstrel
drlRA is summarized in Algorithm 1. are the exponentially weighted moving average (EWMA),
The neural network (NN) architecture is illustrated in Fig.1, sampling window and proportion of probing, which are set
which is a residual network containing seven fully-connected to be 0.75, 100ms and 10%, respectively. The EDRA contains
(FC) layers. The NN inputs the current state st and outputs two periods: probing period and transmission period. In our
Q [Q(st , 1), · · · , Q(st , M )]. Action at is selected using
ε-greedy policy where agent chooses actions greedily with
probability 1−ε and chooses actions randomly with probability 15m
V. P ERFORMANCE E VALUATION AP AP AP AP
Authorized licensed use limited to: Eindhoven University of Technology. Downloaded on July 17,2023 at 06:17:47 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Simulation Parameters TABLE III: Parameters of drlRA
MCS 1 2 3 4 5
Rate(M bps/s) 8.6 17.2 25.8 34.4 51.6
MCS 6 7 8 9 10
Rate(M bps/s) 68.8 77.4 86.0 103.2 114.7
Authorized licensed use limited to: Eindhoven University of Technology. Downloaded on July 17,2023 at 06:17:47 UTC from IEEE Xplore. Restrictions apply.
based algorithm–EDRA. Experimental results show that drlRA
achieves higher overall throughput and lower latency.
Moving forward, we are interested in designing reward
functions with negative values in the case of failure. This
would help agents learn from their failures. Another direction
to further improve the performance is to joint intelligent RA
algorithm and intelligent channel access such as [13].
R EFERENCES
[1] S. Khastoo, T. Brecht, and A. Abedi, “Neura: Using neural networks to
improve wifi rate adaptation,” in Proceedings of the 23rd International
(a) Throughput of Topo B ACM Conference on Modeling, Analysis and Simulation of Wireless and
Mobile Systems. New York, NY, USA: Association for Computing
Machinery, 2020, pp. 161–170.
[2] R. Albar, T. Y. Arif, and R. Munadi, “Modified rate control for collision-
aware in minstrel-ht rate adaptation algorithm,” in 2018 International
Conference on Electrical Engineering and Informatics (ICELTICs),
2018, pp. 7–12.
[3] R. Grünblatt, I. Guérin-Lassous, and O. Simonin, “Simulation and
performance evaluation of the intel rate adaptation algorithm,” in
Proceedings of the 22nd International ACM Conference on Modeling,
Analysis and Simulation of Wireless and Mobile Systems, ser. MSWIM
’19. New York, NY, USA: Association for Computing Machinery, 2019,
p. 27?34. [Online]. Available: https://doi.org/10.1145/3345768.3355921
[4] C.-Y. Li, S.-C. Chen, C.-T. Kuo, and C.-H. Chiu, “Practical machine
(b) Delay of Topo B learning-based rate adaptation solution for wi-fi nics: Ieee 802.11ac as a
case study,” IEEE Transactions on Vehicular Technology, vol. 69, no. 9,
Fig. 4: Performance comparison under Topo B. For Bx, x pp. 10 264–10 277, 2020.
[5] S.-C. Chen, C.-Y. Li, and C.-H. Chiu, “An experience driven design for
indicates the number of BSSs. ieee 802.11ac rate adaptation based on reinforcement learning,” in IEEE
INFOCOM 2021 - IEEE Conference on Computer Communications,
2021, pp. 1–10.
[6] V. Saxena, H. Tullberg, and J. Jalden, “Reinforcement learning for
efficient and tuning-free link adaptation,” IEEE Transactions on Wireless
Communications, vol. 21, no. 2, pp. 768–780, 2022.
[7] J. Park and S. Baek, “Two-stage thompson sampling for outer-loop link
adaptation,” IEEE Wireless Communications Letters, vol. 10, no. 9, pp.
2004–2008, 2021.
[8] V. Saxena, H. Tullberg, and J. Jalden, “Model-based adaptive modu-
lation and coding with latent thompson sampling,” in 2021 IEEE 32nd
Annual International Symposium on Personal, Indoor and Mobile Radio
Communications (PIMRC), 2021, pp. 610–616.
[9] G. F. Riley and T. R. Henderson, The ns-3 Network Simulator, 2010.
[10] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
MIT press, 2018.
[11] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Fig. 5: Convergence performance of the proposed drlRA Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
et al., “Human-level control through deep reinforcement learning,”
algorithm in A3 at Topo A nature, vol. 518, no. 7540, pp. 529–533, 2015.
[12] P. H. Tan, Y. Wu, and S. Sun, “Link adaptation based on adaptive mod-
ulation and coding for multiple-antenna ofdm system,” IEEE Journal
the computational complexity of inference is 47880 floating- on Selected Areas in Communications, vol. 26, no. 8, pp. 1599–1606,
2008.
point operations (FLOPs). The seven-layer NN may not be [13] Z. Guo, Z. Chen, P. Liu, J. Luo, X. Yang, and X. Sun, “Multi-
necessary, and there is still room to reduce FLOPs by opti- agent reinforcement learning-based distributed channel access for next
mizing the NN architecture. generation wireless networks,” IEEE Journal on Selected Areas in
Communications, vol. 40, no. 5, pp. 1587–1599, 2022.
Authorized licensed use limited to: Eindhoven University of Technology. Downloaded on July 17,2023 at 06:17:47 UTC from IEEE Xplore. Restrictions apply.