RL IoT
RL IoT
2, 15 JANUARY 2023
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on December 20,2023 at 09:08:18 UTC from IEEE Xplore. Restrictions apply.
1838 IEEE INTERNET OF THINGS JOURNAL, VOL. 10, NO. 2, 15 JANUARY 2023
TABLE I
Route request (RREQ) and route reply (RREP) messages are C OMPARISON OF R ELATED W ORKS U SING L EARNING -BASED M ETHODS
used without any modifications. In the case of channel switch-
ing due to unexpected arrival of PUD, it needs to update the
routing table by the help of MAC layer sensing operation
which lemmatize the data rate as per application require-
ment. In another work, Wang et al. [20] used routing metrics,
such as channel switching delay and the probability of chan-
nel availability on the basis of exponential distribution of
ON/OFF duration to increase data rate and minimize end-to-
end delay (EED) by fixing the static PUD’s activities. Hence,
the probability of channel availability is calculated without
user interference which can increase packet loss due to the
unexpected arrival of PUD. Iqbal et al. suggested in [21] a
resource allocation mechanism for critical links in IoT com-
munication, however the proper issue of routing especially
for the users with different parameters are not considered.
Majeed et al. [22] presented an energy-aware deployment
for IoT-enabled cellular networks. Qureshi and Aldeen [23]
highlighted in their survey some new challenges for various
applications based on IoT communications for future genera-
tion networks. The authors identified the proper QoS solutions
for varying routing parameters especially for environments
evolving w.r.t. channel and network topology.
Learning-based approaches have drawn significant attention
by researchers to address a variety of issues in communication
networks [24], [25], [26], [27], [28]. Mao et al. [24] presented
a solution for routing in software-defined networks on the
basis of convolution neural networks (CNNs) for the period-
ical learning process of network experiences. The proposed
approach incures high costs in terms of computation and stor-
age and hence, not feasible for IoT devices. Another routing
solution employing multiagent deep reinforcement learning
(MADRL) and real-time Markov decision process (RTMDP)
is proposed in [29] to control the network congestion and
network resources for mobile networks. This proposed routing
protocol is designed for mobile networks in which the par-
ticipating nodes do not communicate with each other. It does
not apply to environments with frequently evolving topologies
such as CRN-based environment due to the retraining require-
ment of MADRL. In an earlier work, Mao et al. [30] discussed
the concerns in nonsupervised deep learning-based solution
for software-defined networks, such as network resource allo-
cation, centralized routing, and traffic control on different
layers. contention on its channels between SUDs in seeking to uti-
Yang et al. [31] proposed a routing protocol employing lize the available channel. This problem becomes worse when
global optimization to address QoS issues in CR-enabled one route between two SUDs can be affected by different pri-
advanced metering infrastructure networks. Specifically, mary users (PUDs). There are some solutions presented to
Yang et al. employed the ant colony optimization algorithm for manage the various resources of CR-IoT communication as
optimal routing. The proposed algorithm supports PUD protec- shown in Table I and still there are some issues in which we
tion while satisfying utility needs of the SUDs. Du et al. [25] can use the various machine learning techniques for further
addressed the EED and power efficiency problems jointly improvement especially in routing and security issues. Hence,
in CRNs by proposing a cross-layer routing protocol using the channel selection decisions need to be included in the rout-
quasi-cooperative multiagent learning. ing parameters to decide the best end-to-end route during the
In CR-IoT routing, the time-varying availability of the chan- whole transmission of IoT communication. These decisions
nel due to the unexpected arrival of PUD, decreases the are directly related to the QoS quantitative parameters such as
average data rate due to the increased packet collisions and packet collision, which must be minimized during the forma-
hence, resulted in a decrements in overall throughput during tion of the end-to-end routing for CRN-based IoT. Therefore,
the routing. The unexpected arrival of the PUD causes the in the CRN-based IoT the routing needs to search the available
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on December 20,2023 at 09:08:18 UTC from IEEE Xplore. Restrictions apply.
MALIK et al.: RL-IoT: RL-BASED ROUTING APPROACH FOR COGNITIVE RADIO-ENABLED IoT COMMUNICATIONS 1839
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on December 20,2023 at 09:08:18 UTC from IEEE Xplore. Restrictions apply.
1840 IEEE INTERNET OF THINGS JOURNAL, VOL. 10, NO. 2, 15 JANUARY 2023
Algorithm 1: Route and Channel Selection in RREQ Algorithm 2: Route and Channel Selection in RREP
1 Function Route_Channel_Selection_RREQ() 1 Function Rout_Channel_Selection_RREP()
2 Received RREQ by CRN-IoT node N from CRN-IoT 2 CRN-IoT node N receives RREP from node M
node M through Channel = [i] if Channel i is free through channel i
from PUD then 3 if Channel i is free from PU then
3 Channel selection using exploitation and 4 Add channel in available channel list in through
exploration learning exploitation and exploration learning
4 if first RREQ for node N then 5 if first RREP for N then
5 create a route using channel i and broadcast 6 create a route using channel i and send RREP
RREQ to SUDs for free channels from PUD to SUDs those exist in that route
6 else 7 else
7 if extra RREQ but on different channel then 8 if the additional RREP from M but on
8 create a route from that channel different channel then
9 else 9 create a route and send RREP from that
10 if new RREQ then channel
11 update a route through channel i 10 else
11 if it is the new RREP then
12 else 12 update route from channel i
13 #channel i is not free from PUD
14 Route selection through best available channel 13 else
using reinforcement learning (LAC) 14 #channel i is not free from PUD
15 if node N receives multiple routes from channel i 15 Multiple routes are selected from the list of
then available channels
16 if N == destination and first hop node of 16 if N receives RREP from multiple routes then
RREQ! = stored first hop node in routing 17 if N == source node and first hop node of
table and Y! = next hop node in routing RREP! = stored first hop node in routing
table and HOP_RREQmin HOP then table and M! = next hop node in routing
17 create a route from channel i table and HOP_RREPmin HOP then
18 else 18 create a route from channel i
19 discard the RREQ
19 N discards the RREP
20 if N has a route for destination then
21 send RREQ to M
22 else
23 discards the RREQ so that it can be saved in the learning block. The successful
strategy of any user is selected by searching the Nash equilib-
rium point (NEP) for its transmission. It happens only if the
following (1) is satisfied [33]:
A. Preliminaries and Mathematical Notation
Ui (S) ≥ Ui (sa , s−a ) ∀i ∈ T, sa ∈ Si (1)
This section presents the formal modeling to observe the
effectiveness of RL algorithms for channel selection during where sa is the strategy of user i for action a and s−a is the
routing. We present the mathematical model of a noncooper- strategy of user i according to the action of its opponent. Once
ative game which avoids the centralized channel management the strategy (S) of every user (SUD) is chosen based on (1),
was adopted [33]. The noncooperative game is defined as then no user can benefit by changing their strategies while
τ = {T, {Si }i∈T } where T is the set of SUDs, and Si = the other players keep theirs unchanged. As previously men-
{s1 , s2 , . . . , sC } is the set of strategies for user i ∈ T with C tioned, three RL algorithms are applied for selecting the best
vacant channels. The SUDs coexist with PUDs over the same available channel for transmission such that the packet colli-
network and can access only one single vacant channel from a sion can be minimized in case of unexpected arrival of PUDs.
PUD at a time. This work focuses on the channel transmission These RL algorithms include, No-External Regret learning,
rate (Rtr ) parameter, for the channel selection purpose in the Q-learning, and Learning Automata. The following sections
SUD. According to the noncooperative game rule, every user describe these learning algorithms with respect to the channel
i selects the strategy for transmission on the basis of a utility selection decision.
function Ui : Si → T against its opponents S−i . Every user
participating in this game can select and update its strategy
profile at any point of time as: S = [s1 , s2 , . . . , sT ] but must B. No-External Regret Learning-Based Channel Selection
follow the rule of the game. The strategy profile is created for No-External Regret Learning means when the channel selec-
every SUD in the selection of its channel for the transmission tion decision for routing is updated by the exploitation of
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on December 20,2023 at 09:08:18 UTC from IEEE Xplore. Restrictions apply.
MALIK et al.: RL-IoT: RL-BASED ROUTING APPROACH FOR COGNITIVE RADIO-ENABLED IoT COMMUNICATIONS 1841
previous channel selection decisions. In this learning tech- converge to the NEP. Secondary users (SUDs) maintain a Q-
nique, every SUD can reserve/save the channel availability table, and the values of the table are updated using (3) based
information for a specific time period and calculate future on the actions selected for rewards or punishments [35]. In the
selection of channel using (1) based on past channel utiliza- current study, the action for channel selection is based on the
tion [34]. Regrets are observed in this learning from past -greedy exploration. This mechanism selects a random action
channel utilization experiences of bad channel selection for with probability and the best action on the basis of the high-
routing. Therefore, this technique is used to minimize the est Q-value at the moment with probability 1 − . This can be
channel switching regrets and to reduce routing table size for seen as defining a probability vector over the action set of the
any channel selection during routing [34] agent for each state. For example, if x = {x1 , x2 , . . . , xj } is the
set of actions for one of these vectors, then the probability xi
(1 + α)Ui (sa )
t
of playing action i is given by [36]
pi t+1
(sa ) = (2)
s´a ∈Si (1 + α)Ui (s´a )
t
(1 − ) + No. of actions in Set , if Q of i is the highest
xi =
t j t No. of actions in set , otherwise.
where in (2), Uit (Sa ) = j=1 Ui (Sa ) and Ui (S a ) =
t j (4)
j=1 Ui (S a ) are used for the complete time span of t;
(t+1)
pi (sa )) is used for the probability allocated to strategy sa The Q-learning algorithm chooses a channel on the basis
at time period t + 1, while α > 0. Here, α indicates the learn- of -greedy exploration which has a maximum value of Q in
ing rate and determines to what extent the newly acquired the Q-table maintained through Q-learning. Users can start
information will override the old information. In practice, the exploration using very a low value of Q and updated
mostly a constant learning rate is used as α = 0.1. after each successful packet transmission using (4) [36]. The
In the case of the selected channel ID is greater than the Q-learning model for the implementation of the multiagent is
available vacant channels, the probability of selecting that shown below where two agents are composed with two actions
channel is calculated using (2) and updated the routing table. each and within a single state [36]
Therefore, it is categorized as exploitation learning through
No-External Regret Learning. In the case of unavailability of a a12 b11 b12
A = 11 B=
available channels from the existing channel list, the explo- a21 a22 b21 b22
ration learning through Q-learning is used which is discussed
in the next section. where A shows the rewards for the first agent and B shows
the rewards for the second agent. For this multiagent model,
the Q-learning update rule can be simplified as follows [36]:
C. Q-Learning-Based Channel Selection
Q-learning is a popular exploration learning algorithm Qai = Qai + α rai − Qai (5)
which is based on the value-iteration model-free technique
with a computational requirement to empower the SUDs to where Qai represents the Q-value of agent a for action i for
learn the mapping of environment states into actions for the the reward rai that agent a is receiving for executing action i
maximum numerical reward. It is mathematically assembled and α is the learning rate.
as (S, A, T, R) where S denotes a discrete set of environment The Q-value for the user si for an action a is initialized as
states; A denotes a set of actions; T denotes a state transition 0 so that the exploration of finding a channel with maximum
function of ON and OFF as S → [0, 1]; and R is a reward reward is searched. The average reward value is calculated for
function S → R. The user get a reward through the learning every channel and the channel reward of every user is com-
agent from the environment which indicates its state s, and pared against its opponents as sbj for an action b at time t
selects an action a, for channel selection in case of routing and t − 1. The channel reward is calculated through RL algo-
decisions. It changes the state of the environment and gener- rithms and compared against its opponents through -greedy
ating a reinforcement signal once the action is performed r. exploration so that the selected channel does not belong to a
The quality of the decision is dependent on this signal to main- same spectrum. The reward value for the action is assigned and
tain the corresponding Q(s, a) rewards. The Q-rewards/values updated through the Learning Automata which is discussed in
are updated as follows [35]: the next section.
Q(s, a) = Q(s, a) + α r(s, a) + γ maxQ ś, á − Q(s, a) (3)
D. Learning Automata-Based Channel Selection
where 0 < α < 1 is the learning rate and 0 < γ < 1 is the In this algorithm, users select a channel a ∈ C based on
discount rate. the reward value of an action saved in the learning block. The
The Q(s, a) values are estimations of the Q∗ (s, a) values, action probability table is updated as in [37]
which represent the sum of the immediate reward obtained by
taking action a at state s and the total expected future rewards qt (s, a) + αU(sa , sb ) 1 − qt (s, a) , where a = b
qt+1 (s, a) =
on the basis of its previous success or failure of channel selec- qt (s, a) − αU(sa , sb ) qt (s, a) , where a = b
tion. By updating Q(s, a) values, the agent eventually makes it (6)
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on December 20,2023 at 09:08:18 UTC from IEEE Xplore. Restrictions apply.
1842 IEEE INTERNET OF THINGS JOURNAL, VOL. 10, NO. 2, 15 JANUARY 2023
where qt+1 (s, a) and qt (s, a) represent action of the user for The integration of all the Q-values shows the differential
state s at time (t + 1) and at time t, respectively, for choosing equation at time t as [37]
an action a from the available state s using the normalized
utility function as follows [37]: Qt = Ke−at + E[U(sa , sb )] (15)
U(sa , sb ) where K is integration constant and Q-values at time t → ∞
U(sa , sb ) = (7)
maxa∈C E(U(sa , sb )) can be written as follows [37]:
where sa and sb represent the available states of action a and b lim Qt = E[U(sa , sb )]. (16)
for two users and maxa∈C E(U(sa , sb )) indicates the maximum x→∞
average reward of agent a depending on the probability of Equations (14)–(16) show that all three learning algorithms
agent b achieving the action using the learning mechanism. converge at time t for any user i ∈ T and can be written as
follows [37]:
E. Convergence Point C
In this section, the convergence point is proved to achieve Uit = Uit (sa , sb )pt (sa ) + Uit (sb , sa )pt (sb ). (17)
the balance between all of the learning techniques. The NEP b=a,b =a
can be defined as the function shown in the following [38]:
Hence, all SUDs can converge to a pure NEP after the con-
C θ{Ci }
vergence of the learning event. In spectrum mobility, multiple
P= Uj (θ {Ci }) (8) SUD pairs can make agreements simultaneously on different
j=1 j=1 channels.
where θ {Ci } denotes the cardinality of channel Ci which shows
the number of SUDs in a channel i ∈ C. The SUD can make IV. R ESULTS AND D ISCUSSION
a channel switch if (9) works as defined by [37]
⎧ The proposed routing is executed with the help of the CRCN
⎨ Ui [θ {Ci }] = Ui θ Cj+1 − U[θ {Ck }] simulator, an add-on of the network simulator (NS-2) [39].
P = if (9) We have compared the network performance achieved by our
⎩
Ui θ Cj+1 > U[θ {Ck }] proposed RL-based routing for CRN-based IoT communica-
θ {Cj+1 } θ {Ck+1 } tions with the recent AODV-IoT [9], ELD-CRN [10], routing,
and SpEED-IoT routing [11]. SpEED-IoT routing essentially
P = Ui θ Cj+1 + Ui θ {Ck+1 }
i=1 i=1
selects a route that ensures the connectivity and reachability
θ {Cj } θ{Ck }
of IoT devices with data rate optimization of the assigned
routes in a mesh network-based IoT network. This means that
− Ui θ Cj + Ui [θ {Ck }]
the route encounters the one type of users without any effect
i=j i=1
of PUDs unexpected arrival and, therefore, the data rate is
= Ui θ Cj+1 − U[θ {Ck }] optimized on the basis of multichannel routing for device-
= Ui [θ {Ci }]. (10) to-device communication in IoT mesh network. The results
are compared with the AODV-IoT and the ELD-CRN routing
Equation (10) shows that P is an exact potential function.
protocols which are designed for the CR-IoT environments.
Now, for No-External Regret learning, U = 0 happens at
ELD-CRN is a recent RL-based routing protocol which also
the NEP and can be written as follows [37]:
addresses energy efficiency during the routing process in CRN.
(t+1)
Uit (sa ) = Ui (sa ) Hence, the proposed routing protocol can also be compared
P=
(t+1) (11) for the latest energy constrains in future. Moreover, ELD-CRN
Ui s a = Ui
t sa .
routing conserves the limited battery resources of IoT devices
Equations (10) and (11) show that, once the network based on CRN and supports reliable packet delivery while
achieves the NEP, only then will users have link stability and incurring lower packet transfer latency and being energy effi-
further channel switching events will reduce, in the case of cient. This routing mechanism is limited for location-based
No-External Regret learning, as follows [38]: and operate over a single wireless channel using a channel
(t+1) (t+2) access mechanism that follows the IEEE 802.11 distributed
pi (sa ) = pi (sa ). (12)
coordination function.
Also, for Learning Automata, (6) and (11) indicate that after
the NEP is achieved, no further channel switching happens A. Average Data Rate Maximization
as [37]
The average data rate achieved from our RL-based routing
qt+1 (s, a) = qt (s, a). (13) approach for CRN-IoT communications (RL-IoT Routing) is
compared for the different network scenarios in Figs. 4–6. It is
Now, for Q-learning from (5), the differential equation of observed that when the numbers of SUDs are low, the average
Q-values from the Q-table is [37] data rate for all the routing protocols is also low while, for
(s,a)
dQ Qt+1 (s, a) − Qtt = α[E(U(sa , sb )) − Qt (s, a)] higher values, performance increases reaching almost 90% of
= (14)
dt = α[E(U(sa , sb )) − Qt (s, a)]. delivered packets for the increasing standard deviation (sd) of
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on December 20,2023 at 09:08:18 UTC from IEEE Xplore. Restrictions apply.
MALIK et al.: RL-IoT: RL-BASED ROUTING APPROACH FOR COGNITIVE RADIO-ENABLED IoT COMMUNICATIONS 1843
MAR in the RL-IoT routing. The MAR for the activity of the
PU indicates the availability of PU on the channel for channel
utilization, channel availability information, channel transmis-
sion rate, and channel transmission time. The real working of
PU on spectrum is not observed due to licensing restrictions.
Therefore, the PUs are distributed with the fixed allocation
on a spectrum in a stochastic environment of the CRN within
the mean arrival rate of [0, 1]. The Poisson process is a very
important model in queuing theory which can be used when
the packets originate from a large population of independent
users. It can be seen in Fig. 4 that initially the RL-IoT rout-
ing has a better average data rate compared to the other three
routing protocols which have almost the same performance at
the beginning for 0.0 sd of PUD’s MAR. This is due to the
lesser number of PUDs at the start of the simulations: as soon
Fig. 4. Average data rate of SUD at PUD’s sd of MAR = 0.0 (user/ms).
as the number of PUDs increases, the data rate changes for the
AODV-IoT, SpEED-IoT, and ELD-CRN routing protocols. At
the lowest MAR of PUDs, each SUD is affected by the activity
of the PUDs and, hence, the user is often isolated due to the
unavailability of free channels. Therefore, the packets deliv-
ered are mainly those sent when most PUDs are inactive and
those that are directed to destinations very close to the sources.
On the other hand, when the number of PUDs increases, the
routing choices also increase and, thus, the RL-IoT routing is
able to build routes unaffected by the activity of PUDs for
most of the flows as shown in Fig. 5. The RL-IoT and ELD-
CRN routing mechanisms have increased data rates as time
elapses. However, the RL-IoT routing protocol has a very fast
convergence time compared to the ELD-CRN routing protocol
due to exploitation learning combined with exploration learn-
ing. Hence, the average data rate is increasing for the 0.4 and Fig. 5. Average data rate of SUD at PUD’s sd of MAR = 0.4 (user/ms).
0.8 sd of PUD’s MAR, due to the more channel choices for
routing as shown in Figs. 5 and 6. On the contrary, the other
routing protocols are not capable of this technique and due
to the increment of PUD’s interferences, the average data rate
is declining. This enables it to improve the average data rate
by 69% in comparison with the AODV-IoT routing protocol
and nearly 39% and 43% in comparison with the ELD-CRN
and SpEED-IoT routing protocols, respectively. These results
are averaged for all the three scenarios of sd of MAR as
low, medium, and high and applied AWK script which can
be applied to process the trace file of NS-2 [40].
B. Packet Collision
Every channel c ∈ C is configured according to the Poisson
distribution for the mean arrival rate to monitor the activ-
ity of PUD. The sd value is calculated on the basis of Fig. 6. Average data rate of SUD at PUD’s sd of MAR = 0.8 (user/ms).
the Box–Muller transform from an interval of mean arrival
rate (MAR = [0,1]) for the stochastic environment of CR-
IoT network. The activity of PUD is predicted using the PUD activity can be observed for the MAR = 0.4 (user/ms)
sd distributed as low, medium, and high for the {0.0, 0.4, in which the user activity has a major role of the channel
0.8}, respectively. Initially, all the routing protocols achieve selection and there is a need to minimizes the packet collision
a similar probability of PUD–SUD packet collisions across between the different types of devices. It can be seen from
a CRN-based IoT communication (see Fig. 7). There is no the results presented in Fig. 8. which shows the reduction of
activity detected on the channel in case of MAR of PUD is packet collisions. This reduction in the number of the packet
0.0, hence, due to this unavailability of PUD’s activity, the collisions as compared to AODV-IoT routing is up to 30%. On
graph shows similar trends for all protocols. The effect of the other hand, 19% as compared to the SpEED-IoT routing
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on December 20,2023 at 09:08:18 UTC from IEEE Xplore. Restrictions apply.
1844 IEEE INTERNET OF THINGS JOURNAL, VOL. 10, NO. 2, 15 JANUARY 2023
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on December 20,2023 at 09:08:18 UTC from IEEE Xplore. Restrictions apply.
MALIK et al.: RL-IoT: RL-BASED ROUTING APPROACH FOR COGNITIVE RADIO-ENABLED IoT COMMUNICATIONS 1845
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on December 20,2023 at 09:08:18 UTC from IEEE Xplore. Restrictions apply.
1846 IEEE INTERNET OF THINGS JOURNAL, VOL. 10, NO. 2, 15 JANUARY 2023
[33] B. Pourpeighambar, M. Dehghan, and M. Sabaei, “Non-cooperative rein- Ayesha Afzal received the B.S. degree in com-
forcement learning based routing in cognitive radio networks,” Comput. puter science from Bahauddin Zakariya University,
Commun., vol. 106, pp. 11–23, Jul. 2017. Multan, Pakistan, in 2004, and the M.S. and Ph.D.
[34] W. Krichene, B. Drighes, and A. Bayen, “On the convergence of no- degrees in computer science from Lahore University
regret learning in selfish routing,” in Proc. Int. Conf. Mach. Learn., 2014, of Management Sciences, Lahore, Pakistan, in 2007
pp. 163–171. and 2018, respectively.
[35] K.-L. A. Yau, P. Komisarczuk, and P. D. Teal, “Reinforcement learning She is currently serving as an Assistant Professor
for context awareness and intelligence in wireless networks: Review, with Air University, Multan. Her research interests
new features and open issues,” J. Netw. Comput. Appl., vol. 35, no. 1, are in the areas of services computing, business
pp. 253–267, 2012. process management, distributed workflow manage-
[36] A. Popescu, “Cognitive radio networks: Elements and architectures,” ment, and cloud computing.
Ph.D. dissertation, Dept. Commun. Syst., Blekinge Inst. Technol., Dr. Afzal is on the editorial board of the International Journal on Digital
Karlskrona, Sweden, 2014. Libraries.
[37] Y. Xu, J. Wang, Q. Wu, A. Anpalagan, and Y.-D. Yao, “Opportunistic
spectrum access in cognitive radio networks: Global optimization using
local interaction games,” IEEE J. Sel. Topics Signal Process., vol. 6,
no. 2, pp. 180–194, Apr. 2012.
[38] M. Felegyhazi, J.-P. Hubaux, and L. Buttyan, “Nash equilibria of packet
forwarding strategies in wireless ad hoc networks,” IEEE Trans. Mobile
Comput., vol. 5, no. 5, pp. 463–476, May 2006.
[39] T. Issariyakul and E. Hossain, “Introduction to network sim-
ulator 2 (NS2),” in Introduction to Network Simulator NS2.
Boston, MA, USA: Springer, 2009, pp. 1–18. [Online]. Available:
https://doi.org/10.1007/978-0-387-71760-9_2
[40] L. Wood. “Awk script to get end-to-end delays from NS2 trace files.”
2011. Accessed: Oct. 6, 2022. [Online]. Available: http://personal.
ee.surrey.ac.uk/Personal/L.Wood/ns/packet-delay-script/lloyd-wood-ns-
packet-delay-awk-script.pdf
Muhammad Ibrar received the B.S. degree
in telecommunication and networking from
COMSATS University Islamabad (Abbottabad
Campus), Abbottabad, Pakistan, in 2010, the M.S.
degree in telecommunication and networking from
Bahria University, Islamabad, Pakistan, in 2014,
and the Ph.D. degree from the School of Software,
Dalian University of Technology, Dalian, China, in
Tauqeer Safdar Malik received the B.S. degree March 2021.
in computer science from Bahauddin Zakariya He is a Postdoctoral Researcher with the School
University, Multan, Pakistan, in 2004, and the of Software, Dalian University of Technology.
M.S. degree in computer science from COMSATS His research interests include software-defined networking, fog computing,
University Islamabad (Wah Campus), Rawalpindi, wireless ad hoc, and sensor networks.
Pakistan, in 2007, and the Ph.D. degree from
Universiti Teknologi PETRONAS, Seri Iskandar,
Malaysia, in 2018.
He is currently an Assistant Professor with
the Department of Computer Science, Air
University (Multan Campus), Islamabad, Pakistan.
His research interests include wireless networks, cognitive radio ad hoc
networks, Internet of Things, 5G and 6G, IPv6, artificial intelligence,
machine learning, routing, and security issues in wireless networks.
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on December 20,2023 at 09:08:18 UTC from IEEE Xplore. Restrictions apply.
MALIK et al.: RL-IoT: RL-BASED ROUTING APPROACH FOR COGNITIVE RADIO-ENABLED IoT COMMUNICATIONS 1847
Houbing Song (Senior Member, IEEE) received Nadir Shah received the B.Sc. and M.Sc. degrees in
the Ph.D. degree in electrical engineering from the computer science degrees from Peshawar University,
University of Virginia, Charlottesville, VA, USA, in Peshawar, Pakistan, in 2002 and 2005, respec-
August 2012. tively, the M.S. degree in computer science
He is currently a Tenured Associate Professor from International Islamic University, Islamabad,
of AI and the Director of the Security and Pakistan, in 2007, and the Ph.D. degree from Sino-
Optimization for Networked Globe Laboratory German Joint Software Institute, Beihang University,
(SONG Lab, www.SONGLab.us), University of Beijing, China, in 2011.
Maryland, Baltimore County, Baltimore, MD, USA. He is currently an Associate Professor with
He was a Tenured Associate Professor of Electrical COMSATS University Islamabad (Wah Campus),
Engineering and Computer Science with Embry– Rawalpindi, Pakistan. His current research interests
Riddle Aeronautical University, Daytona Beach, FL, USA. SONG Lab include computer networks, distributed systems, and network security.
graduates work in a variety of companies and universities. Those seek- Dr. Shah is serving in the editorial board of International Journal
ing academic positions have been hired as the Tenure-Track Assistant of Communication Systems (Wiley), IEEE S OFTWARIZATIONS, AHWSN,
Professors at U.S. universities, such as Auburn University, Auburn, AL, and Malaysian Journal of Computer Science. He has been serving as a
USA; Bowling Green State University, Bowling Green, OH, USA; and Reviewer for several journals/conferences, including the ICC, the INFOCOM,
the University of Tennessee, Knoxville, TN, USA. His research has been the WCNC, Computer Networks (Elsevier), the IEEE C OMMUNICATIONS
featured by popular news media outlets, including IEEE GlobalSpec’s L ETTERS, the IEEE Communications Magazine, the IEEE T RANSACTIONS
Engineering360, Association for Uncrewed Vehicle Systems International, ON I NDUSTRIAL I NFORMATICS , and The Computer Journal.
Security Magazine, CXOTech Magazine, Fox News, U.S. News & World
Report, The Washington Times, New Atlas, Battle Space, and Defense Daily.
His research has been sponsored by federal agencies (including National
Science Foundation, U.S. Department of Transportation, Federal Aviation
Administration, Air Force Office of Scientific Research, U.S. Department of
Defense, and Air Force Research Laboratory) and industry. He has edited
eight books, including Aviation Cybersecurity: Foundations, principles, and
applications (Scitech Publishing, 2022), Smart Transportation: AI Enabled
Mobility and Autonomous Driving (CRC Press, 2021), Big Data Analytics
for Cyber-Physical Systems: Machine Learning for the Internet of Things
(Elsevier, 2019), Smart Cities: Foundations, Principles, and Applications
(Hoboken, NJ, USA: Wiley, 2017), Security and Privacy in Cyber-Physical
Systems: Foundations, Principles, and Applications, (Chichester, U.K.: Wiley-
IEEE Press, 2017), Cyber-Physical Systems: Foundations, Principles and
Applications (Boston, MA: Academic Press, 2016), and Industrial Internet of
Things: Cybermanufacturing Systems (Cham, Switzerland: Springer, 2016).
He has authored more than 100 articles and the inventor of two patents
(U.S. and WO). His research interests include cyber–physical systems/Internet
of Things, cybersecurity and privacy, AI/machine learning/big data analytics,
edge computing, unmanned aircraft systems, connected vehicle, smart and
connected health, and wireless communications and networking.
Dr. Song was a recipient of the Best Paper Award from the 12th IEEE
International Conference on Cyber, Physical, and Social Computing in 2019,
the Best Paper Award from the 2nd IEEE International Conference on
Industrial Internet 2019, the Best Paper Award from the 19th Integrated
Communication, Navigation and Surveillance technologies Conference 2019,
the Best Paper Award from the 6th IEEE International Conference on
Cloud and Big Data Computing 2020, the Best Paper Award from
the 15th International Conference on Wireless Algorithms, Systems, and
Applications 2020, the Best Paper Award from the 40th Digital Avionics
Systems Conference 2021, the Best Paper Award from 2021 IEEE Global
Communications Conference, and the Best Paper Award from 2022 IEEE
International Conference on Computer Communications. He is a Highly
Cited Researcher identified by Clarivate in 2021 and a Top 1000 Computer
Scientist identified by Research.com. He has been serving as an Associate
Technical Editor for IEEE Communications Magazine since 2017, an
Associate Editor for IEEE I NTERNET OF T HINGS J OURNAL since 2020,
IEEE T RANSACTIONS ON I NTELLIGENT T RANSPORTATION S YSTEMS since
2021, and IEEE J OURNAL ON M INIATURIZATION FOR A IR AND S PACE
S YSTEMS since 2020, and a Guest Editor for IEEE J OURNAL ON S ELECTED
A REAS IN C OMMUNICATIONS, IEEE I NTERNET OF T HINGS J OURNAL,
IEEE N ETWORK, IEEE T RANSACTIONS ON I NDUSTRIAL I NFORMATICS,
IEEE S ENSORS J OURNAL, IEEE T RANSACTIONS ON I NTELLIGENT
T RANSPORTATION S YSTEMS, and IEEE J OURNAL OF B IOMEDICAL AND
H EALTH I NFORMATICS. He is a Senior Member of ACM and an ACM
Distinguished Speaker.
Authorized licensed use limited to: BIRLA INSTITUTE OF TECHNOLOGY AND SCIENCE. Downloaded on December 20,2023 at 09:08:18 UTC from IEEE Xplore. Restrictions apply.