0% found this document useful (0 votes)
25 views5 pages

2 - Deep Reinforcement Learning Based Rate Adaptation For Wi-Fi Networks

Uploaded by

najme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views5 pages

2 - Deep Reinforcement Learning Based Rate Adaptation For Wi-Fi Networks

Uploaded by

najme
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Deep Reinforcement Learning based Rate

Adaptation for Wi-Fi Networks


Wenhai Lin∗ , Ziyang Guo† , Peng Liu† , Mingjun Du∗ , Xinghua Sun∗ and Xun Yang†
2022 IEEE 96th Vehicular Technology Conference (VTC2022-Fall) | 978-1-6654-5468-1/22/$31.00 ©2022 IEEE | DOI: 10.1109/VTC2022-Fall57202.2022.10012797

∗ School of Electronics and Communication Engineering, Sun Yat-sen University, Guangzhou, China
† Wireless Technology Lab, 2012 Laboratories, Huawei Technologies Co., Ltd, China

Email: [email protected], [email protected], [email protected]

Abstract—The rate adaptation (RA) algorithm, which adap- The estimated SNR is shown to be an unreliable measure
tively selects the rate according to the quality of the wireless since it can be easily affected by the severe interference. As
environment, is one of the cornerstones of the wireless systems. a result, most mainstream Wi-Fi vendors employ sampling-
In Wi-Fi networks, dynamic wireless environments are mainly
due to fading channels and collisions caused by random access based RA algorithms such as Minstrel HT [2] which is used
protocols. However, existing RA solutions mainly focus on the in the ath9k Wi-Fi driver and Iwl-Mvm-Rs in the Intel IwlWifi
adaptive capability of fading channels, resulting in conservative linux driver [3]. Sampling-based RA schemes usually select
RA policies and poor overall performance in highly congested the MCS whose historical behavior performs best. The major
networks. To address this problem, we propose a model-free drawback of sampling-based RAs is that MCSs can only be
deep reinforcement learning (DRL) based RA algorithm, named
as drlRA, in this work, which incorporates the impact of collisions evaluated with enough samples, in other words, it requires
into the reward function design. Numerical results show that to probe each MCS in a bunch of times. Such mechanisms
the proposed algorithm improves the throughput by 16.5% and may not respond promptly to highly dynamic wireless envi-
39.5% while reducing the latency by 25% and 19.3% compared ronments.
to state-of-the-art baselines. To overcome the shortcomings of conventional RA algo-
Index Terms—Deep reinforcement learning, rate adaptation,
Wi-Fi, CSMA/CA, MCS rithms, recent studies have embraced artificial intelligence (AI)
by exploiting its capability of prediction, i.e., learning the
I. I NTRODUCTION intrinsic relationship between wireless environmental observa-
Wireless channel conditions are unstable due to the im- tions and MCS selection1 scheme. In [1], [4], supervised learn-
pact of path-loss, noise, shadowing, fading, interference and ing (SL) based RA algorithms were investigated, which have
radio-frequency chain impairments in wireless communication been shown to achieve significant potential gains. Nonetheless,
systems. To better utilize the wireless resources, rate adap- SL may be limited by generalization in different wireless
tation has become one of the mandatory functionalities in environments, due to the lack of online learning [5].
IEEE 802.11 wireless local area networks (WLANs), which With the rapid development of reinforcement learning (RL),
adaptively selects modulation and coding schemes (MCS) RA schemes based on RL were proposed. In [6]–[8], the RA
based on the quality of the wireless channels. Each MCS is problem was formulated as the multi-armed bandit (MAB)
associated with a coding rate and constellation size, which problem, where each candidate MCS is encoded as a discrete
has a given bit rate. To increase single-link throughput, more arm of a MAB. Thompson sampling was utilized with the
number of antennas, wider channel bandwidth and higher- advantage of faster convergence speed. However, the scenarios
order modulation are adopted in current IEEE 802.11 networks of interest in these works are cellular communication systems
(i.e., IEEE 802.11ax or Wi-Fi 6), and the number of available operating on licensed spectrum, which solely consider channel
MCSs has increased significantly. Therefore, the rate adapta- fading due to mobility and multi-path effects. In IEEE 802.11
tion algorithm is of great importance. networks operating on unlicensed spectrum, the dynamics of
Conventional rate adaptation (RA) schemes for IEEE 802.11 the radio environment is highly affected by interference from
networks are rule-based and can be roughly categorized as hidden nodes or other co-existence transmission technologies,
SNR-based and sampling-based [1]. As for SNR-based ap- and also collisions caused by random channel access mecha-
proach, the transmitter estimates the instantaneous signal-to- nisms in the Medium Access Control (MAC) layer, i.e., carrier
noise ratio (SNR) from physical layer and translates it to a sense multiple access with collision avoidance (CSMA/CA).
bit rate that can be supported by current channel conditions. In [5], a deep reinforcement learning (DRL) based RA
scheme was proposed, and its performance was verified on
The work of W. Lin was done during the internship in Wireless Tech- a commodity 802.11ac prototype. This work mainly focused
nology Lab, 2012 Laboratories, Huawei Technologies Co. Ltd. The work of on the throughput of nodes embedded with the DRL-based
X. Sun was supported in part by the National Key Research and Development RA. As shown in simulation results in SectionV-C1, it would
Program of China (2019YFE0114000), and in part by Guangdong Engineering
Technology Research Center for Integrated Space-Terrestrial Wireless Optical
Communication. 1 In this paper, MCS selection and rate adaptation are interchangeable.

978-1-6654-5468-1/22/$31.00 ©2022 IEEE


Authorized licensed use limited to: Eindhoven University of Technology. Downloaded on July 17,2023 at 06:17:47 UTC from IEEE Xplore. Restrictions apply.
deteriorate the overall network throughput. This is because also by the waiting time DW due to backoff. The optimum
the policy of those nodes would become conservative as MCS can be determined by computing
the network becomes congested, so that most nodes tend to
m∗ = argmax (1 − P ERm )Rm . (2)
choose a low MCS, resulting in longer airtime of channel and m∈M
inefficient spectrum utilization.
If both the channel distribution and collision distribution
In this work, we formulate RA problem in CSMA/CA are known, then P ERm and DW can be explicitly expressed,
networks as a Markov Decision Process (MDP) and develop and then m∗ can be calculated. However, it is complicated
a DRL-based RA algorithm, named drlRA for conciseness. to characterize the distribution of DW , thereby solving the
To overcome overly conservative policies, we design a reward RA problem is challenging in CSMA/CA networks. In the
function that does not penalize DRL actions due to collisions following, we leverage the model-free DRL-based solution.
in order to avoid inefficient MCS use in congested wireless
environments, details are elaborated in Section III-C. Extensive III. MDP F ORMULATION
simulation results show that the drlRA achieves throughput A. MDP Basics
enhancement of up to 39.5% and 16.5%, and latency reduction MDP is a classical formalization of sequential decision
of up to 19.3% and 25.8% compared to Minstrel HT [2] and making [10]. In MDP, an agent in a state st ∈ S, choose
the state-of-the-art RA algorithm [5], respectively. an action at ∈ A. The environment responds a new state
The remainder of this paper is organized as follows. Sec- st+1 ∈ S, and feedbacks a reward rt ∈ R. The goal of
tion II introduces the system model and problem description. MDP agent is to find the optimal policy π ∗ (s)
∞ to maximize the
Section III formulates the considered RA problem using MDP expected cumulative discounted return Eπ [ t=0 γ t rt ], where
framework. The drlRA algorithm is elaborated in Section IV. γ is a discount factor.
Simulation results are presented in Section V, followed by the To find the optimal policy, the action-value function, also
conclusion in Section VI. known as Q-value, is defined as the expected cumulative
discounted return from undertaking action a at state s, i.e.,
II. S YSTEM M ODEL AND P ROBLEM D ESCRIPTION ∞

Q(s, a)  Eπ [ γ k rt+k |st = s, at = a].
We focus on downlink Wi-Fi networks where access points k=0
(APs) transmit packets (B bits of each) to their associated According to the Bellman Optimality Equation, Q∗ (s, a) 
stations (STAs). The AP contends for one shared channel maxπ Qπ (s, a),
through the CSMA/CA protocol. Before each packet trans-
mission, a backoff counter (BOF) is selected randomly from Q∗ (s, a) = E[rt + γ max

Q∗ (st+1 , a )|st = s, at = a].
a
{0, ..., Wi −1}, where Wi is the contention window (CW) and
i ∈ {0, ..., K}. The BOF is decreased by one when the channel The optimal policy π ∗ can be derived from the optimal
is sensed idle at each time slot, and AP will transmit its packet action-value function by taking a greedy action, π ∗ (s) =
until the BOF counts down to zero. RA algorithm is responsi- arg maxa Q∗ (s, a). The Q∗ (s, a) can be approximated by Q-
ble for selecting an MCS m ∈ M  {1, 2, · · · , M } to encode learning algorithm or Deep Q-Network (DQN) algorithm [11].
each packet. Correspondingly, the packet is transmitted at a The MDP model can be described as the 4-tuple <
rate of Cm , where C1 < ... < CM , and last for Dm = B/Cm S, A, r, γ >, whose definitions w.r.t the RA problem are
air time. If the packet is successfully decoded by the target elaborated hereafter.
STA, then AP receives an acknowledgement (ACK) and resets B. Action and States
CW to W0 ; otherwise, due to collisions or erroneous MCS The action of agent at the transmission instance t is defined
policy, CW will be doubled until it reaches WK . as at ∈ M.
The wireless signal experiences both large-scale and small- The observation of agent at t, ot ∈ O consists the informa-
scale fading. Log-distance path loss model and Nakagami- tion on last transmission, which can be denoted as
m fast fading model are assumed respectively, which are
consistent with the widely-used NS3 system-level simulator ot  [at−1 , ACKt−1 , RSSt−1 , dt−1 ] , (3)
[9]. where at−1 , ACKt−1 ∈ {0, 1} and RSSt−1 are the action,
We consider the goal of the RA problem is to maximize the the indicator of received ACK or not and received signal
following objective function, strength (RSS) measured from the ACK signal. dt−1 represents
the number of time slots between the (t − 1)-th and the t-th
maximize (1 − P ERm )Rm , (1) time instance. The RSS is used to reflect channel conditions.
m∈M
The state of agent st ∈ S at t is the observation history of
where P ERm denotes packet error rate (PER) caused by agent, which is given by
choosing MCS m. Rm represents the average rate instead of
st  [ot−T +1 , · · · , ot−2 , ot−1 ] , (4)
the rate Cm . Because in CSMA/CA networks, the throughput
is affected not only by the packet transmission time Dm , but where T is the length of observation history.

Authorized licensed use limited to: Eindhoven University of Technology. Downloaded on July 17,2023 at 06:17:47 UTC from IEEE Xplore. Restrictions apply.
C. Reward
Reward function is the core design in a DRL algorithm. ܳሺ‫ݏ‬௧ ǡ ͳሻ
‫ݏ‬௧
There are only two results for each packet at MCS m, i.e., )&

)&

)&

)&

)&

)&

)&


success or failure. With respect to MCS m, denote the PER
ܳሺ‫ݏ‬௧ ǡ ‫ܯ‬ሻ
as P ER[m], reward for success as rsucc [m], and reward for
failure as rf ail [m]. A direct design is to use throughput as the
reward function, like other existing work. However, this design Fig. 1: The architecture of the neural network
will cause overly conservative policies in a highly congested
wireless network, since nodes tend to choose a low-level MCS Algorithm 1 drlRA algorithm
in face of collisions.
To avoid this conservative policy, the agent should distin- Initialization: ε, γ, N , s0 , a0 , t = 0, cnt = 0, θ − = θ
guish between two types of packet errors, packet error caused for the transmission instance t = 1, 2, . . . do
by MCS selection and packet error caused by collisions. Compute st from st−1 , at−1 using (3) and (4)
This work uses P ER[m] and P ERm to distinguish two Store (st−1 , at−1 , rt−1 , st ) to experience memory (EM)
types of packet errors. P ER[m] contains the PER due to Input st to the NN in Fig.1 with θ and output Q
both wrong MCS policy (P ERm ) and collisions, and hence Generate action at from Q using ε-greedy policy
P ERm ≤ P ER[m]. A natural idea of designing reward Calculate the reward rt according to (5)
function is to set the expectation of the reward equivalent to the for each sample e = (s, a, r, s ) in EM do
objective function in (1). Hence, (1 − P ER[m]) × rsucc [m] + Compute y = r + γ maxa Q(s , a , s ; θ − )
P ER[m] × rf ail [m] = (1 − P ERm )Rm . Compute L(θ) = (y − Q(s, a; θ))2
By fixing2 rf ail [m] = 0, rsucc [m] is derived as Update θ by performing mini-batch gradient descent
end for
1 − P ERm
rsucc [m] = × Rm , ∀m ∈ M. (5) if (cnt mod N ) == 0 then
1 − P ER[m]
θ− ← θ
In a highly congested wireless network, collisions become the end if
main cause of the failure reception of packets. In this case, we cnt ← cnt + 1
have P ERm  P ER[m]. As a result, a policy that chooses t←t+1
a high-level MCS will be encouraged according to (5). end for
In (5), Rm is estimated as
B
Rm = , (6)
Dm + DW A. Simulation Setup
where DW is calculated by the expectation of CW, i.e., DW = Simulation parameters are summarized in Table.I, where D
0.5∗Wi . P ERm can be estimated by means of a look-up table is the distance between transmitter and receiver. The rates of
that stores the relationship between SNR and P ERm [12]. The available MCSs ({Cm }) are listed in Table.II.
SNR can be derived as the ratio of the average RSS statistics We introduce the following algorithms as baselines, includ-
from ACKs to the energy level detected on the idle channel. ing the commonly used Minstrel HT [2], the experience driven
IV. A LGORITHM rate adaptation (EDRA) [5], and the newly proposed DRL-
With the definitions of the MDP tuples w.r.t the RA problem based RA algorithm mentioned above. The parameters for
at hand, we can use DQN to solve it. The pseudo-code on drlRA are shown in Table III. The parameters for the Minstrel
drlRA is summarized in Algorithm 1. are the exponentially weighted moving average (EWMA),
The neural network (NN) architecture is illustrated in Fig.1, sampling window and proportion of probing, which are set
which is a residual network containing seven fully-connected to be 0.75, 100ms and 10%, respectively. The EDRA contains
(FC) layers. The NN inputs the current state st and outputs two periods: probing period and transmission period. In our
Q  [Q(st , 1), · · · , Q(st , M )]. Action at is selected using
ε-greedy policy where agent chooses actions greedily with
probability 1−ε and chooses actions randomly with probability 15m

ε to ensure convergence to an optimal policy. 6m


STA
15m
STA STA STA

V. P ERFORMANCE E VALUATION AP AP AP AP

In order to evaluate the effectiveness of the proposed drlRA


algorithm, simulation results under grid topology and random A1 A2 A3 A4
topology are presented in this section. The grid topology is Fig. 2: Grid Topology: four basic service sets (BSSs) are
pictorially illustrated in Fig.2. placed in a grid where the stars represent APs equipped with
2 If r
drlRA, the circles represent APs with fixed MCS m = 6, and
f ail [m] is set to be a negative value, then the agent can learn more
from failures. We leave this for future work. the squares represent STAs.

Authorized licensed use limited to: Eindhoven University of Technology. Downloaded on July 17,2023 at 06:17:47 UTC from IEEE Xplore. Restrictions apply.
TABLE I: Simulation Parameters TABLE III: Parameters of drlRA

Scenario Parameter Value Parameters Value


Time slot 9 μs Agent action-observation history length T 10
Size of each packet (B) 27000bit Experience memory size 500
Simulation time 10s Discount factor γ 0.9
CSMA/CA (W0 , WK ) (32, 1024) Batch size bs 32
Path loss model −46.67 − 30log10(D) Learning rate 5 × 10−4
Transmit power 10dBm Replace target iteration N 100
Traffic type Saturated Poisson Traffic Range of ε 0.5 to 0.01
Decay rate of ε 0.995
Sliding window size to calculate PER 100
TABLE II: Available MCSs and Rates

MCS 1 2 3 4 5
Rate(M bps/s) 8.6 17.2 25.8 34.4 51.6
MCS 6 7 8 9 10
Rate(M bps/s) 68.8 77.4 86.0 103.2 114.7

simulation, probing period and transmission period last for the


transmission of 5 and 18 packets, respectively.
B. Performance Metrics
The following metrics are used to evaluate the performance
(a) Throughput of Topo A
of the algorithms of interest.
• Total Throughput: Total throughput is defined as the ratio
of the total number of successfully transmitted bits in the
network to the simulation interval, which is set to be the
last five seconds.
• Mean Delay: Mean delay is defined as the average latency
of successfully transmitted packets, during the last five
seconds of simulation.
C. Simulation Results
1) Topo A: Topo A is a simple grid topology, where differ- (b) Delay of Topo A
ent settings (A1–A4) are deployed to examine the coexistence
Fig. 3: Performance comparison under Topo A. For Ax, x
performance with fixed MCS setups.
indicates the number of AP equipped with the RA algorithms
As shown in Fig. 3, the EDRA performs the worst in terms
of interest and the other APs all use fixed MCS.
of throughput. The reason is that when EDRA encounters
a high PER due to interference/collision, it will reduce the
rate to achieve a lower PER. However, such a conservative 2) Topo B: In this case, we evaluate the performance of
rate selection policy causes the channel to be occupied by the proposed drlRA algorithm under random topologies and
the low rate for a long time, which decreases the network more BSSs. APs and STAs are randomly dropped within a
throughput. This deterioration will become more pronounced 60m × 60m square. The coordinates of APs and STAs follow
as the number of EDRA nodes increases. a Poisson Point Distribution. Each topology with different
The proposed drlRA algorithm outperforms EDRA and number of APs is generated ten times in total. As shown in Fig.
Minstrel by up to 39.5% and 16.5% respectively in terms 4, both DRL-based schemes (EDRA and drlRA) outperform
of throughput, and reduces the delay by up to 19.2% and the conventional rule-based Minstrel algorithm. Although there
25.9%. This enhancement comes primarily from the reward is no significant performance gain compared to EDRA in terms
design in (5), which distinguishes the PER due to wrong MCS of average latency, the drlRA delivers significant throughput
policy from collisions. The gap between P ER[m] and P ERm gains of approximately 23.6% ∼ 47% in different settings.
indicates that the main cause of packet error is interference.
Therefore, the drlRA will maintain a high rate to avoid mutual D. Convergence and Complexity
interference. If the transmitters all choose a high rate with The convergence performance of the proposed drlRA can
short transmission airtime, then collisions will be alleviated. It be observed from Fig.5, where the convergence time is less
can be seen that the network maintains high throughput when than 1 second.
the number of drlRA nodes increases. As a result, drlRA can For the NN architecture adopted in our work, i.e. shown in
avoid overly conservative policies. Fig. 1, by counting all floating-point operations in inference,

Authorized licensed use limited to: Eindhoven University of Technology. Downloaded on July 17,2023 at 06:17:47 UTC from IEEE Xplore. Restrictions apply.
based algorithm–EDRA. Experimental results show that drlRA
achieves higher overall throughput and lower latency.
Moving forward, we are interested in designing reward
functions with negative values in the case of failure. This
would help agents learn from their failures. Another direction
to further improve the performance is to joint intelligent RA
algorithm and intelligent channel access such as [13].
R EFERENCES
[1] S. Khastoo, T. Brecht, and A. Abedi, “Neura: Using neural networks to
improve wifi rate adaptation,” in Proceedings of the 23rd International
(a) Throughput of Topo B ACM Conference on Modeling, Analysis and Simulation of Wireless and
Mobile Systems. New York, NY, USA: Association for Computing
Machinery, 2020, pp. 161–170.
[2] R. Albar, T. Y. Arif, and R. Munadi, “Modified rate control for collision-
aware in minstrel-ht rate adaptation algorithm,” in 2018 International
Conference on Electrical Engineering and Informatics (ICELTICs),
2018, pp. 7–12.
[3] R. Grünblatt, I. Guérin-Lassous, and O. Simonin, “Simulation and
performance evaluation of the intel rate adaptation algorithm,” in
Proceedings of the 22nd International ACM Conference on Modeling,
Analysis and Simulation of Wireless and Mobile Systems, ser. MSWIM
’19. New York, NY, USA: Association for Computing Machinery, 2019,
p. 27?34. [Online]. Available: https://doi.org/10.1145/3345768.3355921
[4] C.-Y. Li, S.-C. Chen, C.-T. Kuo, and C.-H. Chiu, “Practical machine
(b) Delay of Topo B learning-based rate adaptation solution for wi-fi nics: Ieee 802.11ac as a
case study,” IEEE Transactions on Vehicular Technology, vol. 69, no. 9,
Fig. 4: Performance comparison under Topo B. For Bx, x pp. 10 264–10 277, 2020.
[5] S.-C. Chen, C.-Y. Li, and C.-H. Chiu, “An experience driven design for
indicates the number of BSSs. ieee 802.11ac rate adaptation based on reinforcement learning,” in IEEE
INFOCOM 2021 - IEEE Conference on Computer Communications,
2021, pp. 1–10.
[6] V. Saxena, H. Tullberg, and J. Jalden, “Reinforcement learning for
efficient and tuning-free link adaptation,” IEEE Transactions on Wireless
Communications, vol. 21, no. 2, pp. 768–780, 2022.
[7] J. Park and S. Baek, “Two-stage thompson sampling for outer-loop link
adaptation,” IEEE Wireless Communications Letters, vol. 10, no. 9, pp.
2004–2008, 2021.
[8] V. Saxena, H. Tullberg, and J. Jalden, “Model-based adaptive modu-
lation and coding with latent thompson sampling,” in 2021 IEEE 32nd
Annual International Symposium on Personal, Indoor and Mobile Radio
Communications (PIMRC), 2021, pp. 610–616.
[9] G. F. Riley and T. R. Henderson, The ns-3 Network Simulator, 2010.
[10] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.
MIT press, 2018.
[11] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G.
Fig. 5: Convergence performance of the proposed drlRA Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski
et al., “Human-level control through deep reinforcement learning,”
algorithm in A3 at Topo A nature, vol. 518, no. 7540, pp. 529–533, 2015.
[12] P. H. Tan, Y. Wu, and S. Sun, “Link adaptation based on adaptive mod-
ulation and coding for multiple-antenna ofdm system,” IEEE Journal
the computational complexity of inference is 47880 floating- on Selected Areas in Communications, vol. 26, no. 8, pp. 1599–1606,
2008.
point operations (FLOPs). The seven-layer NN may not be [13] Z. Guo, Z. Chen, P. Liu, J. Luo, X. Yang, and X. Sun, “Multi-
necessary, and there is still room to reduce FLOPs by opti- agent reinforcement learning-based distributed channel access for next
mizing the NN architecture. generation wireless networks,” IEEE Journal on Selected Areas in
Communications, vol. 40, no. 5, pp. 1587–1599, 2022.

VI. C ONCLUSION AND F UTURE W ORK


In this work, we propose a new DRL-based RA algorithm,
named drlRA, for Wi-Fi networks. Unlike existing DRL-
based solutions, the drlRA takes consideration of the impact
of collisions caused by random channel access on the MCS
selection policy. Especially in the reward function, the drlRA
incorporates the waiting time in average throughput expression
and distinguishes the packet errors from collisions and from er-
roneous MCS decisions. We compare drlRA performance with
classical RA algorithm–Minstrel and state-of-the-art DRL-

Authorized licensed use limited to: Eindhoven University of Technology. Downloaded on July 17,2023 at 06:17:47 UTC from IEEE Xplore. Restrictions apply.

You might also like