Comprehensive Survey of Reinforcement Learning From Algorithms to Practical Challenges
Comprehensive Survey of Reinforcement Learning From Algorithms to Practical Challenges
Abstract—Reinforcement Learning (RL) has emerged as solving sequential decision-making problems in a variety
a powerful paradigm in Artificial Intelligence (AI), enabling of fields, such as game playing ([10], [11], [12], [13]),
agents to learn optimal behaviors through interactions with robotics ([14], [15], [16], [17]), and autonomous sys-
their environments. Drawing from the foundations of trial
and error, RL equips agents to make informed decisions tems, particularly in Intelligent Transportation Systems
through feedback in the form of rewards or penalties. This (ITS) ([18], [19], [20], [21], [22]).
paper presents a comprehensive survey of RL, meticulously This survey examines the practical application of dif-
analyzing a wide range of algorithms, from foundational ferent RL approaches through various domains including
tabular methods to advanced Deep Reinforcement Learning
but not limited to: robotics [14], [23], [24], optimization
(DRL) techniques. We categorize and evaluate these algo-
rithms based on key criteria such as scalability, sample [25], [26], energy efficiency and power management
efficiency, and suitability. We compare the methods in [27], [28], [29], networks [30], [31], [32], dynamic
the form of their strengths and weaknesses in diverse and partially observable environments [33], [34], [35],
settings. Additionally, we offer practical insights into the [36], video games [37], real-time systems and hardware
selection and implementation of RL algorithms, addressing
implementations [38], [39], financial portfolios [40], ITS
common challenges like convergence, stability, and the
exploration-exploitation dilemma. This paper serves as a [18], [41], [42], signal processing [43], benchmark tasks
comprehensive reference for researchers and practitioners [44], data management and processing [45], [46], multi-
aiming to harness the full potential of RL in solving agent and cloud-based systems [47], [48], [49], [50].
complex, real-world problems. Moreover, our survey gives detailed explanations of RL
Index Terms—Reinforcement Learning, Deep Reinforce- algorithms, ranging from traditional tabular methods to
ment Learning, Model-free, Model-based, Actor-Critic, Q- state-of-the-art methods.
learning, DQN, TD3, PPO, TRPO Related Surveys: Several notable survey papers have
examined different aspects of RL or attempted to teach
I. I NTRODUCTION various concepts to readers. Foundational work, such as
Reinforcement Learning (RL) is a subfield of Artificial [6], offers an in-depth computer science perspective on
Intelligence (AI) in which an agent learns to make RL, encompassing both its historical roots and modern
decisions by interacting with an environment, aiming developments. In the realm of DRL, [51], [52], [53],
to maximize cumulative reward over time [1]. RL has [54] offer comprehensive analyses of DRL’s integration
rapidly evolved since the 1950s, when Richard Bell- with DL, highlighting its applications and theoretical ad-
man’s work on Dynamic Programming (DP) established vancements. Model-based RL is identified as a promising
foundational concepts that underpin current approaches area in RL, with authors in [55], [56], [57] emphasizing
[2], [3]. The field gradually became more widespread by its potential to enhance sample efficiency by simulating
proposing more advanced approaches, such as Temporal- trial-and-error learning in predictive models, reducing
Difference (TD) Learning [4], [5] and suggesting solu- the need for costly real-world interactions.
tions to exploration-exploitation dilemma [6], [7]. RL In [58], authors provided a survey of Model-free RL
has further been evolving rapidly due to its integration specifically within the financial portfolio management
with Deep Learning (DL), giving rise to Deep Rein- domain, exploring its applications and effectiveness.
forcement Learning (DRL). This advancement enables Meanwhile, [59] analyzes the combination of Model-
researchers to tackle more sophisticated and complex free RL and Imitation Learning (IL), a hybrid approach
problems [8], [9]. It has proven to be highly effective in known as RL from Expert Demonstrations (RLED),
which leverages expert data to enhance RL performance.
†Corresponding author These survey papers help to understand the various
1 Department of Computer Science, Wilfrid Laurier University, Wa-
terloo, Canada. Emails: {mghasemi, debrahimi}@wlu.ca & aspects of RL and its integration with other areas.
[email protected] To our knowledge, no papers have analyzed the
2
strengths and weaknesses of algorithms used in papers also important to consider the degree to which direct
and provide a comprehensive analysis of an entire paper. implementation can be achieved and how easily it can
The first motivation of this survey is to address this gap. be debugged. In these respects, simpler algorithms may
tend to be more user-friendly but cannot be applied to
RL involves various challenges in choosing appropri- complex problems. Convergence and stability are impor-
ate algorithms due to diverse factors, such as problem tant considerations, as certain algorithms provide better
characteristics and environmental dynamics. In general, guarantees in specific circumstances. In conclusion, the
it depends on numerous factors based on the character- decision-making process is influenced by exploration
istics of the problem, including whether the state and style, domain-specific requirements, and past research
action spaces are large, whether their values are discrete results. To determine which algorithm to use based on
or continuous, or if the dynamics of the environment comparing the problem they are solving with similar
are stochastic or deterministic. Data availability and existing research, researchers should have a compre-
sample efficiency are other factors to consider. It is
3
hensive paper examining several papers in different a state (or state-action pair) and continues to follow its
domains thoroughly and accurately. Besides saving time current policy. As a final point, the model of the envi-
and resources, this will also prevent the excess cost ronment simulates the behavior of the environment or,
of going through a trial-and-error process to determine more generally, can be used to infer how the environment
which solution to choose, which is the second motivation will behave [1], [60], [61].
behind conducting this survey. There are two broad categories of RL methodology:
The rest of the paper is organized as follows: In II, Model-free and Model-based. Model-free methods do
we provide a general overview of RL before diving into not assume knowledge of the environment’s dynamics
algorithms. Consequently, we undertake an examination and learn directly from interactions with the environ-
of various algorithms within the domain of RL, inclusive ment. On the other hand, Model-based methods involve
of the associated papers, as well as an analysis of their building a model of the environment’s dynamics and
respective merits and drawbacks. It is imperative to using this model to develop and improve policies [6], [3].
acknowledge that these algorithms fall into three over- Each of the mentioned categories has its own advantages
arching categories: Value-based Methods, Policy-based and disadvantages, that will be discussed later in the
Methods, and Actor-Critic Methods. Section III initiates paper.
the discussion by focusing on Value-based Methods, Following this, we will briefly examine two of the
delineated by its four core components: Dynamic Pro- most crucial components of RL. First, we will explore
gramming, Tabular Model-free, Approximation Model- the Markov Decision Process (MDP), the foundational
free, and Tabular Model-based Methods. Section IV framework that structures the learning environment and
subsequently talks about the Policy-based Methods. Fur- guides the agent’s decision-making process. Then, we
thermore, section V offers detailed insights into Actor- will discuss the exploration-exploitation dilemma, one
Critic Methods. In section VI, we give a summary of of the most imperative characteristics of RL, which
the paper and discuss the scope of it. Finally, section VII balances the need to gather new information with the
provides a synthesis of the paper through a review of key goal of maximizing rewards.
points and an exposition of future research directions.
Central to solving an MDP are value functions, which and exploitation and can be applied with on-policy and
estimate the expected return. The state-value function off-policy RL algorithms. Using decoupling policies,
Vπ (s) under policy π is the expected return starting from DeRL improves robustness and sample efficiency in
state s and following policy π thereafter [3], [65], [62]: sparse reward environments. More advanced methodolo-
gies have been used in [76], where the study investigated
Vπ (s) = Eπ [Gt |st = s] (2) the trade-off between exploration and exploitation in
Similarly, the action-value function Qπ (s, a) is the ex- continuous-time RL using an entropy-regularized re-
pected return starting from state s, taking action a, and ward function. As a result, the optimal exploration-
thereafter following policy π: exploitation balance was achieved through a Gaussian
distribution for the control policy, where exploitation was
Qπ (s, a) = Eπ [Gt |st = s, at = a] (3) captured by the mean and exploration by the variance.
Equations 2 and 3 are referred to Bellman Equations. Moreover, various strategies such as ǫ-c, Upper Confi-
To find the optimal policy π ∗ , RL algorithms iteratively dence Bound (UCB), and Thompson Sampling are em-
update value functions based on experience [5], [66]. ployed to balance this trade-off in simpler environments,
In Q-learning (will be explained later in Alg. 10), the like Bandits [7], [1], [8].
update rule for the action-value function is: In the subsequent sections, we will undertake an
examination of various algorithms within the domain
Q(st , at ) ← Q(st , at ) + α R(st , at )+ of RL, inclusive of the associated papers, as well as
an analysis of their respective merits and drawbacks.
(4)
′ It is imperative to acknowledge that these algorithms
γ max Q(s t+1 , a ) − Q(s t , a t ) fall into three overarching categories: Value-based Meth-
a′
the environment’s dynamics. The model predicts how Algorithm 1 Policy Iteration
the environment will respond to an agent’s actions (state 1: Initialize policy π arbitrarily
transitions and rewards). On the other hand, Model-free 2: repeat
methods do not rely on a model of the environment. 3: Perform policy evaluation to update the value
Instead, they directly learn a policy or value function function Vπ
based on interactions with the environment. 4: Perform policy improvement to update the policy
We will begin by introducing the main part of Tabu- π
lar Model-based methods, Dynamic Programming (DP). 5: until policy π converges
Next, we will explore both Tabular and Approximate
Model-free algorithms. Finally, we will cover the ad-
vanced part of Tabular Model-based methods, Model- based methods are Policy Iteration and Value Iteration.
based Planning. We first start by examining Policy Iteration, then, we
discuss Value Iteration.
A. Tabular Model-based Algorithms a) Policy Iteration: Policy Iteration is a method
In this subsection, we examine the first part of the that iteratively improves the policy until it converges to
Tabular Model-based methods in RL, Dynamic Pro- the optimal policy. It consists of two main steps: policy
gramming methods. It must be noted that the advanced evaluation and policy improvement [1]. Policy Iteration
Tabular Model-based algorithms will be analyzed in consists of two different steps, Policy Evaluation and
section III-D along with various studies. Improvement. Policy Evaluation calculates the value
A Tabular Model-based algorithm is an example of function Vπ (s) for a given policy π. This involves
an RL technique used for solving problems with a finite solving the Bellman expectation equation for the current
and discrete state and action space. These algorithms policy:
are explicitly based on the maintenance and updating
of a table or matrix, which represents the dynamic X X
nature of the environment and its reward. ’Model-based’ Vπ (s) = π(a|s) P (s′ |s, a)[R(s, a, s′ )+γVπ (s′ )]
algorithms are characterized by the fact that they in- a∈A s′ ∈S
(5)
volve the construction of an environmental model for
use in decision-making. Key characteristics of Tabular This step iteratively updates the value of each state
Model-based techniques include Model Representation, under the current policy until the values converge [3].
Planning and Policy Evaluation, and Value Iteration [1]. Policy Improvement, on the other hand, improves the
In Model Representation, the model depicts the dy- policy by making it greedy with respect to the current
namics in a tabular form with transition probabilities value function:
represented as P (s′ |s, a) and the reward function as
R(s, a). These elements define the probability of tran- X
π ′ (s) = arg max P (s′ |s, a)[R(s, a, s′ ) + γVπ (s′ )]
sitioning to a new state s′ and the expected reward a
s′
when taking action a in state s. Planning and Policy (6)
Evaluation involves approximating the updating value This step updates the policy by selecting actions that
function V (s) iteratively using the Bellman equation maximize the expected value based on the current value
until convergence. To calculate the optimal policy, the function [55]. Alg. 1 gives an overview of how policy
algorithm examines all possible future states and their iteration can be implemented.
associated actions. Value Iteration also approximates Value Iteration is another approach in DP, which will
the updating value function V (s) iteratively using the be discussed in the next subsection.
Bellman equation until convergence. Similar to Planning
b) Value Iteration: Value Iteration is another DP
and Policy Evaluation, it analyzes all potential future
method that directly computes the optimal value func-
states and their linked actions to determine the optimal
tion by iteratively updating the value of each state. It
policy.
combines the steps of policy evaluation and policy im-
Tabular Model-based algorithms can be divided into
provement into a single step. Value Iteration consists of
two general categories: DP and Model-based Planning,
two different steps, Value Update and Policy Extraction.
where this variant will be discussed in section III-D1.
Value Update updates the value function for each state
1) Dynamic Programming (DP): DP methods are
based on the Bellman optimality equation:
fundamental techniques used to solve MDPs when a
complete model of the environment is known. These
methods are iterative and make use of the Bellman equa- X
V (s) = max P (s′ |s, a)[R(s, a, s′ ) + γV (s′ )] (7)
tions to compute the optimal policies. Two primary DP- a
s′
6
2) The return at the goal state is defined as the sum Algorithm 3 First-visit MC
of rewards from the starting state to the goal state. 1: Initialize:
3) Two approaches are considered for state-value 2: π ← policy to be evaluated
estimation: 3: V ← an arbitrary state-value function
• First-visit MC: Only the first visit to each 4: Returns(s) ← an empty list, for all s ∈ S
state in an episode is used for value estimation. 5: repeat
• Every-visit MC: All visits to each state in an 6: Generate an episode using π
episode are considered. 7: for each state s appearing in the episode do
Agent’s Behavior Across Episodes: 8: G ← return following the first occurrence of
Episode 1: s
9: Append G to Returns(s)
• Actions: Right, Right, Down, Down (Reaches G).
10: V (s) ← average(Returns(s))
• Returns: G is visited once with a return of 5.
11: end for
Episode 2: 12: until forever
• Actions: Up, Right, Right, Down, Down (Reaches
G).
• Returns: G is visited twice, with returns of 8 (first • Using Every-visit MC, the state value reflects the
visit) and 8 (second visit). average of all returns across all visits to that state.
Episode 3: In MC, both the First-visit MC and the Every-visit MC
• Actions: Right, Up, Right, Down, Down (Reaches methods converge toward vπ (s) as the number of visits
G). approaches infinity. Alg. 3 examines the First-visit MC
• Returns: G is visited once with a return of 6. method for estimating Vπ . The only difference between
State-Value Estimation for G: Every-visit and First-visit, as stated above, lies in line
8. For Every-visit MC, we should return following the
1) First-visit MC:
every occurrence of state s.
• Considers only the first visit to G in each
b) MC Estimation of Action Values (with Explo-
episode.
ration Starts): It becomes particularly advantageous to
• Returns for G:
estimate action values (values associated with state-
– Episode 1: 5 action pairs) rather than state values in the absence of an
– Episode 2: 8 (first visit only) environment model. State values alone are adequate for
– Episode 3: 6 determining a policy by examining one step ahead and
5+8+6
• Average Return for G: 3 = 6.33. selecting the action that leads to the optimal combination
2) Every-visit MC: of reward and next state [1]. It is, however, insufficient
• Considers all visits to G in each episode. to rely solely on state values without a model. An
• Returns for G: explicit estimate of each action’s value is essential in
order to provide meaningful guidance in the formulation
– Episode 1: 5
of policy. MC methods are intended to accomplish
– Episode 2: 8, 8
this objective. As a first step, we address the issue of
– Episode 3: 6
5+8+8+6
evaluating action values from a policy perspective in
• Average Return for G: 4 = 6.75. order to achieve this goal.
General Insights: During policy evaluation for action values, we esti-
• Using First-visit MC, each state value is updated mate qπ (s, a), which represents the anticipated return
using the average return from its first visits across when initiating in state s, taking action a, and following
episodes (e.g., G in Episode 2 considers only the the policy π [78]. In this task, MC methods are similar,
first return, 8). except that visits to state-action pairs are used instead
8
of states alone. When state s is visited and action a is In order to improve the policy, it is necessary to make
taken during an episode, then the state-action pair (s, a) the policy greedy regarding the current value function.
is considered visited. Based on the average of returns As a result, an action-value function is used to construct
following all visits to a state-action pair, the Every-visit the greedy policy without requiring a model. Whenever
MC method estimates its value. However, the First-visit an action-value function q is there, a greedy policy is
MC method averages the returns after each state visit and defined as one that selects the action with the maximal
action selection occurs. It is evident that both methods action-value for each state s ∈ S.
exhibit quadratic convergence as the number of visits to
each state-action pair approaches infinity [1].
π(s) ≡ arg max q(s, a) (10)
A deterministic policy π presents a significant chal- a
lenge in that numerous state-action pairs may never be
visited. When following π, only one action is observed Policy Improvement can be executed by formulating
for each state. As a consequence, MC estimates for each πk+1 as the greedy policy with respect to qπk . The
the remaining actions do not improve with experience, Policy Improvement theorem, as discussed earlier, is then
presenting a significant problem. As a result of learning applicable to πk and πk+1 because, for all s ∈ S, there
action values, one can choose among all available actions is:
in each state more easily. It is crucial to estimate the
value of all actions from each state and not only the qπk (s, πk+1 (s)) = qπk (s, arg max qπk (s, a)) =
one that is at present favored for the purpose of making a
(11)
informed comparisons. max qπk (s, a) ≥ qπk (s, πk (s)) = vπk (s)
a
It may be beneficial to explore starts in some sit-
uations, but they cannot be relied upon in all cases, Based on the General Policy Theorem [1], each πk+1
especially when learning directly from the environment. is uniformly superior to πk , or equally optimal, which
Therefore, the initial conditions are less likely to be makes both policies optimal. As a result, the overall
favorable in such situations. Alternatively, stochastic process converges to an optimal value function and
policies that select all actions in each state with a non- policy. Using MC methods, optimal policies can be
zero probability may be considered in order to ensure determined solely based on sample episodes, without the
that all state-action pairs are encountered. As a first step, need for additional information about the environment’s
we will continue to assume that starts will be explored dynamics.
and conclude with a comprehensive MC simulation Despite the convergence guarantee for MC simula-
approach. tions, two key assumptions must be addressed to create a
To begin, we consider an MC adaptation of classical practical algorithm: the presence of exploration starts in
Policy Iteration. We use this approach to iteratively episodes and the need for an infinite number of episodes
evaluate and improve policy starting with an arbitrary for policy evaluation. Our focus is on removing the
policy π0 and ending with an optimal policy and optimal assumption of an infinite number of episodes for pol-
action-value function: icy evaluation. This can be achieved by approximating
qπk during each evaluation, using measurements and
Evaluation Improvement Evaluation assumptions to minimize error. Although this method
π0 −−−−−→ qπ0 −−−−−−−→ π1 −−−−−→
could theoretically ensure correct convergence, it often
Improvement Evaluation Improvement
qπ1 −−−−−−−→ π2 −−−−−→ . . . −−−−−−−→ (9) requires an impractically large number of episodes, es-
Improvement pecially for complex problems. Alternatively, we can
π∗ −−−−−−−→ qπ∗
avoid relying on infinite episodes by not fully completing
As the approximate action-value function approaches the evaluation before improving the policy. Instead, the
the true function asymptotically, many episodes occur. value function is adjusted towards qπk incrementally
We will assume that there are an infinite number of across multiple steps, as seen in Generalized Policy
episodes we observe and that these episodes are gener- Iteration (GPI). In Value Iteration, this approach is evi-
ated with explorations starts for now. Exploration starts dent, where only one policy evaluation occurs between
referring to an assumption we make. We assume that policy improvements. MC policy iteration, by its nature,
all the states have a non-zero probability of starting. alternates between evaluation and improvement after
This approach encourages exploration as well though each episode. The returns observed in an episode are
not practical as this assumption does not hold true in used for evaluation, followed by policy improvement for
many real-world applications. For any arbitrary policy all visited states. Over the next subsection, we analyze
πk under these assumptions, MC methods will compute the integration of Importance Sampling [79], a well-
qπk accurately. known concept in Statistics, with MC methods.
9
c) Off-Policy Prediction via Importance Sampling: Algorithm 4 Off-Policy Prediction (via Importance
The use of RL as a control method is confronted with Sampling)
a fundamental dilemma, namely the need to learn the 1: Initialize, for all s ∈ S, a ∈ A(s):
value of actions based on the assumption of subsequent 2: Q(s, a) ∈ R (arbitrarily)
optimal behavior, contrasting with the need for non- 3: C(s, a) ← 0
optimal behavior to explore all possible actions in order 4: π(s) ← arg maxa Q(s, a) (ties broken randomly)
to find the optimal action. This conundrum leaves us 5: repeat
with the question of how we can learn about optimal 6: b ← any soft policy
policies while operating under exploratory policies. A 7: Generate an episode using b:
Policy-based approach serves as a compromise in this (S0 , A0 , R1 , . . . , ST −1 , AT −1 , RT )
situation. As part of this strategy, we seek to learn the 8: G←0
action values for a policy that, while it is not optimal, is 9: W ←1
close to it and incorporates mechanisms for exploration. 10: for each step of episode, t = T − 1, . . . , 0: do
Due to its exploratory nature, it does not directly address 11: G ← γG + Rt+1
the issue of learning the optimal policy action values. 12: C(St , At ) ← C(St , At ) + W
Off-policy learning is an effective method of address- 13: Q(St , At ) ← Q(St , At ) + C(SW t ,At )
ing this challenge by utilizing two distinct policies: a G − Q(St , At )
target policy (π) whose objective is to become the opti- 14: π(St ) ← arg maxa Q(St , a)
mal policy, and a behavior policy (b) whose purpose is (with ties broken consistently)
to generate behavior. The dual-policy framework allows 15: if At 6= π(St ) then
exploration to occur independently of learning about 16: break ⊲ Proceed to next episode
the optimal policy, with learning occurring from data 17: end if
generated outside the target policy by behavior policy. 18: W ← W · b(At1|St )
Off-policy methods are more versatile and powerful 19: end for
than on-policy methods. On-policy methods can also 20: until forever
be incorporated as a special case when both target and
behavior policies are the same.
On the other hand, off-policy methods introduce ad- punctuality. However, the method’s success depended
ditional complexity, which requires the use of more heavily on the availability and quality of historical data,
sophisticated concepts and notations. Off-policy learning and its scalability to larger, more complex networks
involves using data from a different policy, resulting remained a consideration.
in higher variance and slower convergence than on- In [80], a Renewal MC (RMC) algorithm was devel-
policy learning. On-policy methods provide simplicity oped to reduce variance, avoid delays in updating, and
and direct learning from the agent’s exploratory actions, achieve quicker convergence to locally optimal policies.
while off-policy methods provide a robust framework for It worked well with continuous state and action spaces
learning optimal policies indirectly through exploration and was applicable across various fields, such as robotics
guided by a separate behavior policy. Essentially, the and game theory. The method also introduced an ap-
dichotomy between on-policy and off-policy learning proximate version for faster convergence with bounded
represents the exploration-exploitation trade-off that un- errors. However, the performance of RMC was depen-
derlies RL, which enables agents to learn and adapt to dent on the chosen renewal set and its size.
complex environments in a variety of ways [3]. Alg. 4 In [81] authors introduced an MC off-policy strat-
provides a general overview of this algorithm. egy augmented by rough set theory, providing a novel
Throughout the next few paragraphs, we will ana- approach. This integration offered new insights and
lyze selected research studies that have employed MC methodologies in the field. However, the approach’s
methods and its mentioned variations, analyzing their complexity might challenge broader applicability, and
rationale and addressing their specific challenges. In an- the study’s focus on theoretical formulations necessitated
alyzing these papers in depth, we intend to demonstrate further empirical research for validation.
that MC methods are versatile and effective for solving a Authors in [23] introduced a Bayesian Model-free
wide range of problems while emphasizing the decision- Markov Chain Monte Carlo (MCMC) algorithm for
making processes leading to their selection. policy search, specifically applied to a 2-DoF robotic
Based on MC with historical data, [27] proposed manipulator. The algorithm demonstrated practicality
an intelligent train control approach enhancing energy and effectiveness in real implementations, adopting a
efficiency and punctuality. It offered a Model-free ap- gradient-free strategy that simplified the process and
proach, achieving 6.31% energy savings and improving excelled in mastering complex trajectory control tasks
10
within a limited number of iterations. However, its Algorithm 5 MC with Importance Sampling
applicability might be confined to specific scenarios, and 1: Initialize, for all s ∈ S, a ∈ A(s):
the high variance in the estimator presented challenges. 2: Q(s, a) ← arbitrary
The research on MC Bayesian RL (MCBRL) by 3: C(s, a) ← 0
[82] introduced an innovative method that streamlined 4: µ(a|s) ← an arbitrary soft behavior policy
Bayesian RL (BRL). By sampling a limited set of hy- 5: π(a|s) ← an arbitrary target policy
potheses, it constructed a discrete, Partially Observable 6: repeat ⊲ For ever
Markov Decision Process (POMDP), eliminating the 7: Generate an episode using µ:
need for conjugate distributions and facilitating the ap- (S0 , A0 , R1 , . . . , ST , AT , RT , ST )
plication of point-based approximation algorithms. This 8: G←0
method was adaptable to fully or partially observable 9: W ←1
environments, showing reliable performance across di- 10: for t = T − 1, T − 2, . . . , 0 down to 0 do
verse domains. However, the efficacy of the sampling 11: G ← γG + Rt+1
process was contingent on the choice of prior distribu- 12: C(St , At ) ← C(St , At ) + W
tion, and insufficient sample sizes might have affected 13: Q(St , At ) ← Q(St , At ) + C(SW t ,At )
performance. G − Q(St , At )
t |St )
In [83], authors introduced a Factored MC Bayesian 14: W ← W · π(A µ(At |St )
RL (FMCBRL) approach to solve the BRL problem 15: if W = 0 then
online. It leveraged factored representations to reduce 16: break ⊲ Exit inner loop
the size of learning parameters and applied partially ob- 17: end if
servable Monte-Carlo planning as an online solver. This 18: end for
approach managed the complexity of BRL, enhancing 19: until convergence or a stopping criterion is met
scalability and efficiency in large-scale domains.
Researchers in [84] discussed developing self-learning
agents for the Batak card game using MC methods for
state-value estimation and artificial neural networks for ducing computational complexity, and it achieved precise
function approximation. The approach handled the large parameter estimation even under challenging conditions.
state space effectively, enabling agents to improve game- Additionally, the method remained scalable for multiple
play over time. However, the study’s focus on a specific signal scenarios without significant increases in compu-
card game might have limited the direct applicability of tational load. However, performance might have been
its findings to other domains. affected below certain signal-to-noise ratios, and the
Next, we will discuss another variant of MC, MC with selection of MC samples and other parameters required
Importance Sampling, which is an off-policy algorithm. problem-specific tuning.
d) MC with Importance Sampling: MC Importance In [30], a method to estimate blocking probabilities
Sampling is a method used to improve the efficiency in multi-cast loss systems through simulation was in-
of MC simulations when estimating expected values. troduced, improving upon static MC methods with Im-
It samples from the proposal distribution rather than portance Sampling. The technique divided the complex
directly from the original distribution. As demonstrated problem into simpler sub-problems focused on blocking
in Alg. 5, re-weighting the samples is based on the ratio probability contributions from individual links, using a
of the original distribution to the proposal distribution. It distribution tailored to the blocking states of each link.
is through this re-weighting that the bias introduced by An inverse convolution method for sample generation,
sampling from the alternative distribution is corrected, coupled with a dynamic control algorithm, achieved
allowing for a more accurate and efficient estimation of significant variance reduction. Although this method
the data [85]. efficiently allocated samples and reduced variance, its
Let us analyze several papers regarding MC Impor- complexity and setup requirements might have made it
tance Sampling. In [43], authors presented a non-iterative less accessible compared to simpler approaches.
approach for estimating the parameters of superimposed Authors in [25] introduced a method for improving
chirp signals in noise using an MC Importance Sam- simulation efficiency in rare-event phenomena through
pling method. The primary method utilized maximum Importance Sampling. Initially developed for Markovian
likelihood to optimize the estimation process efficiently. random walks, this method was expanded to include non-
This approach, which differed from traditional grid- Markovian scenarios and molecular dynamics simula-
search methods, focused on estimating chirp rates and tions. It increased simulation efficiency by optimizing the
frequencies, providing a practical solution to multidi- sampling of successful transition paths and introduced a
mensional problems. The technique was non-iterative, re- method for identifying an optimal importance function to
11
The practical benefits of TD networks were evident in step Q(σ) algorithm and proposing the n-step CV Q(σ)
scenarios requiring specific sequences of actions. While and Q(σ, λ) algorithms. This unification clarified the
the paper focused on small-scale experiments, the results relationships between different multi-step TD algorithms
suggested that TD networks had significant potential and their variants. Additionally, the thesis extended
for broader applicability and effectiveness in various predictive knowledge representation into the frequency
domains. domain, allowing TD learning agents to detect periodic
In [33], an advanced multi-step TD learning technique structures in return, providing a more comprehensive
was introduced, emphasizing the use of per-decision representation of the environment.
control variates to reduce variance in updates. This ”Undelayed N-step TD prediction” (TD-P), developed
method significantly improved the performance of multi- in [36], integrated techniques like eligibility traces, value
step TD algorithms, particularly in off-policy learning function approximators, and environmental models. This
contexts, by enhancing stability and convergence speed. method employed Neural Networks to predict future
Empirical results from tasks like the 5x5 Grid World and steps in a learning episode, combining RL techniques to
Mountain Car showed that the n-step SARSA method enhance learning efficiency and performance. By using a
outperformed standard n-step Expected SARSA in re- forward-looking mechanism, the TD-P method sought to
ducing root-mean-square error and improving learning gather additional information that traditional backward-
efficiency. The approach was compatible with function looking eligibility traces might miss, leading to more
approximation, underscoring its practical applicability in accurate value function updates and better decision-
complex environments. making. The TD-P method was particularly designed
Researchers in [95] developed a unified framework to for partially observable environments, utilizing Neural
study finite-sample convergence guarantees of various Networks to handle complex and continuous state-action
Value-based asynchronous RL algorithms. By reformu- spaces effectively. To further solidify our knowledge of
lating these RL algorithms as Markovian Stochastic TD methods, we will analyze the off-policy version of
Approximation algorithms and employing a Lyapunov N-step Learning over the following paragraphs.
analysis, the authors derived mean-square error bounds e) N-step Off-policy Learning: N-step off-policy
on the convergence of these algorithms. This frame- learning is an advanced method that combines the con-
work provided a systematic approach to analyzing the cepts of N-step returns with off-policy updates. This
convergence properties of algorithms like Q-learning, n- approach leverages the benefits of multi-step returns
step TD, TD(λ), and off-policy TD algorithms such as to improve the stability and performance of learning
V-trace. The paper effectively addressed the challenges algorithms while allowing the use of data generated
of handling asynchronous updates and offered robust by a different policy (the behavior policy b) than the
convergence guarantees through detailed finite-sample one being improved (the target policy π) [1]. Multi-
mean-square convergence bounds. step returns differ from one-step methods by considering
In [96], the ”deadly triad” in RL—off-policy learn- cumulative rewards over multiple steps, providing a more
ing, bootstrapping, and function approximation—was ad- comprehensive view of future rewards and leading to
dressed through an in-depth theoretical analysis of multi- more accurate value estimates. In off-policy learning,
step TD learning. The paper provided a comprehensive the policy used to generate behavior (behavior policy)
theoretical foundation for understanding the behavior is different from the policy being optimized (target
of these algorithms in off-policy settings with linear policy), allowing for the reuse of past experiences and
function approximation. A notable contribution was the improving sample efficiency by enabling learning from
introduction of Model-based deterministic counterparts demonstrations or historical data. Importance sampling
to multi-step TD learning algorithms, enhancing the is used to correct the discrepancy between the behavior
robustness of the findings. The results demonstrated policy and the target policy in N-step off-policy learning,
that multi-step TD learning algorithms could converge where Importance Sampling ratios adjust the updates to
to meaningful solutions when the sampling horizon n account for differences in action probabilities under the
was sufficiently large, addressing a critical issue of (n)
two policies. The N-step off-policy return, Gt , can be
divergence in certain conditions. calculated in the following way:
In [97], researchers delved into various facets of TD
learning, particularly focusing on multi-step methods.
The thesis introduced the innovative use of control (n)
variates in multi-step TD learning to reduce variance Gt = Rt+1 + γRt+2 + · · · + γ n−1 Rt+n + γ n V (St+n )
in return estimates, thereby enhancing both learning (17)
speed and accuracy. The work also presented a unified To ensure the update is off-policy, Importance Sam-
framework for multi-step TD methods, extending the n- pling ratios are incorporated as follows:
16
Hadoop, improving its efficiency by iteratively adjust- enabled dynamic adjustments based on real-time con-
ing configurations based on feedback from performance ditions, potentially alleviating urban traffic congestion.
metrics introduced in [107]. This approach reduced the However, the model’s effectiveness heavily depended
manual effort and expertise required for parameter tuning on the accuracy and comprehensiveness of input data,
and demonstrated marked improvements in processing and the computational complexity might have posed
speeds. However, the model’s simplification of the pa- challenges for real-time deployment.
rameter space might have overlooked interactions that The study [112] addressed the optimization of net-
could achieve further optimization, and the scalability work routing within Optical Transport Networks us-
and adaptability in varying operational environments ing a Q-learning-based algorithm under the Software-
remained somewhat uncertain. Defined Networking framework. The approach improved
Researchers in [45] presented a novel approach to network capacity and efficiency by managing routes
enhancing the accuracy of nuclei segmentation in patho- based on learned network conditions. While the method
logical images using Q-learning and Deep Q-Network outperformed traditional routing strategies, its scalability
(DQN) (will be discussed in III-C1) algorithms. The and computational complexity in real-world scenarios
study reported improvements in Intersection over Union remained areas for further exploration.
for segmentation, critical in cancer diagnosis. While In [113], authors proposed a Q-learning algorithm
the approach adapted the segmentation threshold dy- motivated by internal stimuli, specifically visual nov-
namically, leading to better outcomes, the reliance on elty, to enhance autonomous learning in robots. The
RL introduced computational complexity and required method blended Q-learning with a cognitive develop-
significant training data. mental approach, making the robot’s learning process
The sampling efficiency of Q-learning was discussed more dynamic and responsive to new stimuli. While
in [108], exploring the feasibility of making Model-free the approach reduced computational costs and enhanced
algorithms as sample-efficient as Model-based counter- adaptability, its scalability in complex environments was
parts. The research introduced a variant of Q-learning not thoroughly explored.
that incorporated UCB exploration, achieving a regret The use of Q-learning and Double Q-learning (dis-
bound that approached the best possible by Model-based cussed in the next subsection) to manage the duty
methods. This paper marked a significant theoretical cycles of IoT devices in environmental monitoring was
advance by providing rigorous proof that Q-learning analyzed in [114]. The study demonstrated significant
could achieve sublinear regret in episodic MDPs without advancements in optimizing IoT device operations, en-
requiring a simulator. However, the analysis relied on hancing energy efficiency and operational effectiveness.
assumptions such as an episodic structure and precise However, the performance heavily depended on envi-
model knowledge, which might not have translated well ronmental parameters, and the increased memory re-
to more uncertain environments. quirements for Double Q-learning might have posed
Researchers in [109] examined the deployment of challenges for memory-constrained IoT devices.
Q-learning on Field Programmable Gate Arrays (FP- The study [115] presented a robust implementation
GAs) to enhance processing speed by leveraging parallel of RL to address resource allocation challenges in Fog
computing capabilities. The study demonstrated sub- Radio Access Networks (Fog RAN) for IoTs. The use
stantial acceleration of the Q-learning process, making of Q-learning effectively adapted to changes in network
it suitable for real-time scenarios. However, the focus conditions, improving network performance and reduc-
on a particular FPGA model might have restricted the ing latency. While the approach showed promise, the
broader applicability of the findings to different hardware reliance on Q-learning might not have fully captured
platforms. the complexities of larger-scale IoT networks, and the
The challenge of managing frequent handovers in exploration-exploitation trade-off could have led to sub-
high-speed railway systems using a Q-learning-based optimal performance in highly dynamic environments.
approach was addressed in [110]. The authors proposed The overview of the reviewed papers categorized by
a scheme that minimized unnecessary handovers and their respective domains is presented in Table V. Q-
enhanced network performance by dynamically adjust- learning has an overestimation bias primarily because of
ing to changes in the network environment. While the the max operator used in the Q-value update rule, which
approach showed promise, its complexity, computational tends to favor overestimated action values. To address
demands, and need for real-time processing presented this, researchers introduced a new architecture to tackle
challenges for practical implementation. this issue, which we will discuss in the next subsection.
In [111] authors explored optimizing taxi dispatching g) Double Q-learning: Double Q-learning was in-
and routing using Q-learning and MDPs to enhance taxi troduced in [118] as an enhancement to traditional Q-
service efficiency in urban environments. The method learning to reduce the overestimation of action values,
19
a detailed discussion of how the implementation could be plicability of the findings without further adaptations or
adapted to different RL scenarios or environments, which validations for different operational environments or IoT
may have limited its applicability without additional device configurations.
modifications or tuning. A Double Q-learning based routing protocol for opti-
The integration of A* pathfinding with Double Q- mizing maritime network routing, crucial for effective
learning to optimize route planning in autonomous driv- communication in maritime search and rescue opera-
ing systems was investigated in [120]. This innovative tions, was developed in [122]. The use of Double Q-
approach aimed to enhance route efficiency and safety learning in this context aimed to tackle the overestima-
by minimizing the common problem of action value tion problems inherent in Q-learning protocols, which
overestimations found in standard Q-learning. The paper was a significant improvement in maintaining stability
introduced a novel combination of A* pathfinding with in the model’s predictions and actions. This approach
RL, providing a dual strategy that leveraged the deter- not only enhanced the routing efficiency but also in-
ministic benefits of A* for initial path planning and the corporated a trust management system to ensure the
adaptive strengths of the proposed method for real-time reliability of data transfers and safeguard against packet-
adjustments to dynamic conditions, such as unexpected dropping attacks. The protocol demonstrated robust per-
obstacles. This synergy allowed for a balanced approach formance in various simulated attack scenarios with
to navigating real-world driving scenarios efficiently. By efficient energy consumption and minimal resource foot-
employing Double Q-learning, the system addressed and print, crucial for the resource-constrained environments
mitigated the issue of action value overestimation, which in which maritime operations occurred. While the pro-
was prevalent in RL. This enhancement was crucial for posed method showed promising results in simulations,
autonomous driving applications where decisions had to the complexity of real-world application scenarios could
be both accurate and dependable to ensure safety and have posed challenges. Maritime environments were
operational reliability. On the other hand, this combina- highly dynamic with numerous unpredictable elements,
tion, while robust, introduced a significant computational which might have affected the consistency of the perfor-
demand that might have impacted the system’s perfor- mance gains observed in controlled simulations. Addi-
mance in real-time scenarios. Quick decision-making tionally, the scalability of this approach when applied to
was essential in dynamic driving environments, and very large-scale networks or under extreme conditions
the increased computational load could have hindered typical of maritime emergencies could have required
the system’s ability to respond promptly. Additionally, further validation. The integration of such sophisticated
the promising method introduced lacked a detailed ex- systems also raised concerns about the computational
ploration of how this hybrid model performed across overhead and the practical deployment in existing mar-
different environmental conditions or traffic scenarios. itime communication infrastructures.
This limitation might have affected the model’s effec- Authors in [123] explored the application of Double
tiveness in diverse settings without further adaptation or Q-learning to manage adaptive wavelet compression in
refinement. environmental sensor networks. This method aimed to
Researchers in [121] delved into optimizing IoT de- optimize data transmission efficiency by dynamically
vices’ power management by employing a Double Q- adjusting compression levels based on real-time com-
learning based controller. This Double Data-Driven Self- munication bandwidth availability. A key advantage of
Learning (DDDSL) controller dynamically adjusted op- this approach was its high adaptability, which ensured
erational duty cycles, leveraging predictive data analytics efficient bandwidth utilization and minimized data loss
to enhance power efficiency significantly. A notable even under fluctuating network conditions, critical for
strength of the paper was the improved operational effi- remote environmental monitoring stations where con-
ciency introduced by the Double Q-learning, which ef- nectivity might have been inconsistent. Furthermore, this
fectively handled the overestimation issues found in stan- integration helped significantly reduce the risk of overes-
dard Q-learning within stochastic environments. This led timating action values, a common problem in Q-learning,
to more precise power management decisions, crucial for which could have led to suboptimal compression set-
prolonging battery life and minimizing energy usage in tings. However, the complexity of this implementation
IoT devices. Furthermore, the DDDSL controller showed posed a notable challenge, particularly in environments
a marked performance enhancement, outperforming tra- where computational resources were limited. The ne-
ditional fixed duty cycle controllers by 42–50%, and the cessity for managing two separate Q-values for each
previous Data-Driven Self-Learning (DDSL) model by action increased the computational demand, potentially
2–12%. However, while the performance improvements impacting the system’s real-time response capabilities.
were compelling, they were obtained under specific Additionally, the performance of the algorithm heavily
conditions, which might have limited the broader ap- depended on the accuracy of the network condition
21
movement and the complexity of the shepherding task mize peer-to-peer (P2P) electricity transactions among
led to a significant learning time, with notable success small-scale users in smart energy systems. This research
only after many episodes, indicating potential efficiency aimed to enhance economic efficiency and reliability in
issues in more demanding or time-sensitive applications. decentralized energy markets. A significant strength lies
Authors in [116] primarily focused on the imple- in its innovative application of SARSA to a complex, dy-
mentation of Q-learning and SARSA algorithms, incor- namic energy trading system. Researchers modeled P2P
porating eligibility traces to improve handling delayed electricity transactions as an MDP, allowing the system
rewards. The research demonstrated the effectiveness of to handle uncertainties in small-scale energy trading, like
RL algorithms in navigating dynamic tasks like micro- fluctuating prices and varying demands. This modeling
managing combat units in real-time strategy games. enabled the system to learn optimal transaction strategies
Adding eligibility traces significantly boosted the al- over time, potentially enhancing both efficiency and
gorithms’ learning from sequences of interdependent profitability in energy trading. However, the approach
actions, essential in fast-paced, chaotic environments. had limitations. SARSA, while effective in learning op-
Using a commercial game like StarCraft as a testbed timal policies through trial and error, required extensive
introduced real-world complexity often absent in sim- interaction with the environment to achieve satisfactory
ulated settings. This method confirmed the practicality performance. This data requirement could have been a
of RL algorithms in real scenarios and highlighted drawback in real-world applications where immediate
their potential to adapt to commercial applications. The decisions were necessary, and historical transaction data
algorithms showed promise in small-scale combat, but was limited. Moreover, implementing such a system in a
scaling to larger, more complex battles in StarCraft or live environment, where real-time decision-making was
similar games remained uncertain. Concerns arose about crucial, posed additional challenges, including the need
computational demands and the efficiency of learning for robust computational resources to handle continuous
optimal strategies without extensive prior training. While state and action space calculations.
effective within StarCraft, their broader applicability to In [47], they examined integrating SARSA(0) with
other real-time strategy games or applications remained Fully Homomorphic Encryption (FHE) for cloud-based
untested. The specialized design of state and action control systems to ensure data confidentiality while
spaces for StarCraft could have hindered transferring performing RL computations on encrypted data. A sig-
these methods to different domains without significant nificant strength was preserving privacy in cloud envi-
modifications. ronments where sensitive control data could have been
In [28], authors studied the enhancement of energy vulnerable to breaches. Using FHE allowed the RL algo-
efficiency in IoT networks through strategic power and rithm to execute without decrypting the data, providing a
channel resource allocation using a SARSA-based algo- robust method for maintaining confidentiality. The paper
rithm. This approach addressed the unpredictable nature successfully demonstrated this method on a classical
of energy from renewable sources and the variability pole-balancing problem, showing it was theoretically
of wireless channels in real-time settings. A major sound and practically feasible. However, implementing
contribution of the study lay in its innovative use of SARSA(0) over FHE introduced challenges related to
a Model-free, on-policy RL method to manage energy computational overhead and latency due to encryption
distribution across IoT nodes. This method efficiently operations. These factors could have impacted the ef-
handled the stochastic nature of energy harvesting and ficiency and scalability of the RL system, especially
channel conditions, optimizing network performance in in environments requiring real-time decision-making.
terms of energy efficiency and longevity. Integrating Additionally, encryption-induced delays and managing
SARSA with linear function approximation helped re- encrypted computations might have limited this method’s
fine solutions to continuous state and action spaces, application to scenarios where control tasks could toler-
enhancing the practicality in real-world IoT applica- ate such delays.
tions. However, relying on linear function approximation Authors in [117] presented a novel application of
introduced limitations in capturing the full complex- SARSA within a swarm RL framework to solve op-
ity of interactions in dynamic, multi-dimensional state timization problems more efficiently, particularly those
spaces typical of IoT environments. While the proposed involving large negative rewards. The authors incorpo-
SARSA algorithm demonstrated network efficiency im- rated individual learning and cooperative information
provements, implementing such an RL system in real IoT exchange among multiple agents, aiming to speed up
networks posed challenges. These included the need for learning and enhance decision-making efficiency. This
continual learning and adaptation to changing conditions, approach significantly leveraged swarm intelligence and
impacting the deployment’s feasibility and scalability. RL, particularly SARSA’s ability to handle tasks with
Authors in [125] explored applying SARSA to opti- substantial negative rewards. By enabling agents to learn
23
from individual experiences and shared insights, the Algorithm 13 Expected SARSA
system could converge to optimal policies more swiftly 1: Input: policy π, positive integer num episodes,
than traditional single-agent or non-cooperative multi- small positive fraction α, GLIE {ǫi }
agent systems. Implementing the shortest path problem 2: Output: value function Q (qπ if num episodes is
demonstrated the method’s practicality and effectiveness, large enough)
showcasing improved learning speeds and robustness 3: Initialize Q arbitrarily (e.g., Q(s, a) = 0 for all s ∈
against pitfalls with significant penalties. Nonetheless, S and a ∈ A(s), and Q(terminal-state, ·) = 0)
managing multiple agents and their interactions in- 4: for i ← 1 to num episodes do do
creased the computational overhead and complexity of 5: ǫ ← ǫi
the learning process. Moreover, the approach heavily 6: Observe S0
relied on designing the information-sharing protocol 7: t←0
among agents, which, if not optimized, could have led 8: repeat
to inefficiencies or suboptimal learning outcomes. The 9: Choose action At using policy derived from
generalized application of this method across different Q (e.g., ǫ-greedy)
environments or RL tasks remained to be thoroughly 10: Take action At and observe Rt+1 , St+1
tested, suggesting potential limitations in adaptability. 11: Q(St , At ) ← Q(St , At ) + α (Rt+1 + γ
P
a π(a|St+1 )Q(St+1 , a) − Q(St , At ))
Authors in [128] designed a comprehensive explo- policy RL algorithm, demonstrated how to manage and
ration of combining Model Predictive Control (MPC) stabilize frequency variations due to unpredictable re-
with the Expected SARSA algorithm to tune MPC mod- newable energy outputs and fluctuating demand adap-
els’ parameters. This integration aimed to enhance the tively. This was crucial for maintaining reliability and
robustness and efficiency of control systems, particularly efficiency in power grids increasingly incorporating vari-
in applications where the system model’s parameters able renewable energy sources. However, the approach’s
were not fully known or subject to change. A key downside was related to the RL algorithm’s computa-
advantage was the innovative approach to speeding up tional demands and the need for extensive simulation
learning by directly integrating RL with MPC, reducing and testing to fine-tune system parameters. Implementing
the episodes needed for effective training. This efficiency such a system in a real-world setting could be con-
was achieved using the Expected SARSA algorithm, strained by these factors, particularly in terms of real-
which offered smoother convergence and better perfor- time computation capabilities and the scalability of the
mance due to its average-based update rule compared solution to larger, more complex grid systems.
to the more common Q-learning method, which fo-
cused on maximum expected rewards and might lead
to higher variance in updates. However, the approach’s
Authors in [31] utilized the Expected SARSA algo-
complexity and the need for precise tuning of MPC
rithm to optimize Energy Storage Systems (ESS) opera-
model parameters represented significant challenges. The
tion in managing uncertainties in wind power generation
computational demands increased due to the dual needs
forecasts. The study modeled the problem as MDP where
of continuous adaptation by the RL algorithm and
the state and action spaces were defined by the ESS’s op-
the rigorous constraints enforced by MPC. While the
erational constraints. Expected SARSA’s primary result
framework showed potential in simulations, its real-
in this context was its superior performance compared to
world applicability, especially in highly dynamic and
conventional Q-learning-based methods. The algorithm
unpredictable environments, required further validation.
effectively handled the wide variance in wind power
Authors in [129] focused on improving Determinis- forecasts, crucial for optimizing ESS’s charging and
tic and Synchronous Multichannel Extension (DSME) discharging actions to reduce forecast errors. The strat-
networks’ resilience against WiFi interference using the egy’s effectiveness was underscored by its near-optimal
Expected SARSA algorithm. It evaluated channel adap- performance, closely approximating the optimal solution
tation and hopping strategies to mitigate interference with complete future information. Simulation results
in industrial environments, providing a detailed anal- demonstrated that the Expected SARSA-based strategy
ysis with Expected SARSA. The study stood out for could manage wind power forecast uncertainty more
its rigorous simulation-based evaluation of interference effectively by adapting to varying conditions. This adapt-
mitigation strategies in a controlled DSME network. By ability was enhanced by including frequency-domain
using Expected SARSA, the research effectively reduced data clustering, which refined the learning process and
uncertainty in channel quality assessment, leading to reduced input data variability, further improving the RL
more reliable network performance under interference model’s performance. The last variant of SARSA is N-
conditions, crucial for industrial applications where re- step SARSA. We will cover it in the next subsection
liable data transmission was vital for operational ef- before delving deeper into Approximation Model-free
ficiency and safety. However, the complexity of RL algorithms.
implementation and its dependency on accurate real-
time data posed challenges. The computational demands
of running Expected SARSA in real-time environments
could limit the practical deployment of this strategy in
resource-constrained settings. j) N-step SARSA: With foundational concepts of n-
In [130], researchers investigated using the Expected step TD and bootstrapping established, we can expand on
SARSA learning algorithm to manage Load Frequency these ideas and explore research papers utilizing them.
Control in multi-area power systems. The study focused Let’s begin with N-step SARSA, an enhancement of
on enhancing power systems’ stability integrated with the conventional one-step SARSA within the on-policy
Distributed Feed-in Generation based wind power, using learning framework. In N-step SARSA, as shown in
a Model-free RL approach to adjust to variable power Alg. 14, the update to the value (or action-value) is not
supply conditions without a predefined system model. limited to just the next state and action, as seen in one-
The primary strength lies in its innovative approach to step SARSA, but instead incorporates a sequence of n
addressing challenges in integrating renewable energy actions and rewards. The value function in the equation
sources into the power grid. Expected SARSA, an on- is replaced with Q(St+n , At+n ):
25
TABLE VII: SARSA Papers Review spaces. High scalability indicates that the algorithm
Application Domain Number of Papers remains efficient even in larger environments or state
Multi-agent Systems and [48] spaces. Moderate scalability means that the algorithm
Autonomous Behaviors can handle medium-sized problems effectively. Low
(Shepherding, Virtual Agents)
scalability suggests that the algorithm struggles with
Games and Simulations [116]
(Real-Time Strategy Games) large state spaces, often due to computational complexity
Energy and Power Management [28], [125] or memory requirements. Sample-efficiency reflects how
(IoT Networks, Smart Energy effectively the algorithm learns from the available data.
Systems) High sample-efficiency denotes that the algorithm can
Cloud-based Control and [47]
Encryption Systems
learn effectively from fewer samples. Moderate sample-
Swarm Intelligence and [117] efficiency indicates a requirement for a moderate number
Optimization Problems of samples. Low sample efficiency suggests that the
Transportation and Routing [126] algorithm needs a large number of samples to learn
Optimization (EVs) effectively. Bootstrapping involves using estimates to
MPC Tuning [128] update values, accelerating the learning process. Eligi-
Network Resilience and [129], [31]
Optimization
bility Traces track state-action pairs to improve learn-
Power Systems and Energy [130] ing efficiency by integrating information over multiple
Management steps. Experience Replay stores past experiences, which
Network Optimization (Optical [115] can be reused during learning to improve performance.
Transport Networks, Fog RAN) Exploration Starts ensure that all states are visited
Intelligent Traffic Signal Control [131] by starting from random states, promoting thorough
Hybrid RL Algorithms [132]
exploration of the environment.
In the next section, we start analyzing another
paradigm of Model-free algorithms, approximation-
By combining on-policy and off-policy updates, the based algorithms, before analyzing part II of Tabular
algorithm capitalized on the strengths of both SARSA Model-based ones.
and Q-learning, potentially decreasing the overestimation
bias seen in Q-learning while still targeting optimal
policies. The approach’s ability to adjust the N parameter C. Approximation Model-free Algorithms
based on learning phases or specific conditions suggested In this section, we analyze Approximation Model-free
customization for a broad array of applications, from algorithm variations and their applications of in various
simple gaming scenarios to intricate decision-making domains. It must be noted we assume that readers have
tasks in real-world contexts. However, the algorithm’s knowledge in DL before reading this section, and readers
requirement to finely adjust various hyperparameters, are referred to [133], [134], [135], [136] to understand
like maximum and minimum values of N and the and learn DL. Approximation Model-free algorithms in
breakpoint for decrementing N, added complexity to its RL comprise methods oriented to learning policies and
configuration and optimization. This could pose a chal- value functions solely from interaction with an environ-
lenge, particularly for users with limited RL experience. ment, though without an explicitly stated model of the
Additionally, maintaining and updating a combination of environment’s dynamics. These algorithms typically use
policies over multiple steps before reaching a decision function approximators, such as neural networks, and
resulted in higher computational demands, potentially generalize from observed state–action pairs to unknown
restricting the algorithm’s applicability in settings with ones. Consequently, they are quite effective at handling
limited resources or where rapid response times were large or continuous states and actions [59], [137], [138].
essential. The categorization of the examined papers by Key features of the Approximation Model-free algo-
their domain is given in Table VII. rithms are:
3) Summary of Tabular Model-free Algorithms: In • These algorithms learn the policy directly by op-
this section, we delved into Tabular Model-free algo- timizing expected reward without an explicit con-
rithms as one of the categories in RL. After providing a struction of the model of the environment.
brief overview of each algorithm and studies that used • In value-function estimation, the value function is
those methods, it is time to give a complete summary in estimated using function approximators predicting
Table VIII. expected rewards for states or state–action pairs.
It is necessary to explain what scalability and sample • The solutions are scalable, thus fitting for problems
efficiency mean. Scalability refers to the algorithm’s with large or continuous state and action spaces
ability to handle varying sizes of environments or state since they can generalize from limited data.
27
Gradient Descent Step models. The authors applied DQN to a reference office
h building, simulating its performance with an EnergyPlus
′ ′ −
∇θi Li (θi ) = E(s,a,r,s′ )∼D r + γ max Q(s , a ; θ ) model, aiming to minimize energy use while maintaining
a′
i indoor CO2 concentrations below 1,000 ppm. One of the
−Q(s, a; θi )) ∇θi Q(s, a; θi ) key strengths of this study was its demonstration of how
(27) DQN could improve HVAC control by learning from
Alg. 15 details the complete Deep Q-learning algo- previous actions, states, and rewards, thus offering a
rithm, regardless of using CNNs, as per the first paper. Model-free optimization approach. The paper reported a
This algorithm is a generalization to have a general significant reduction in total energy usage (15.7%) com-
overview of the algorithm. pared to baseline operations, highlighting the potential
28
of DQN to enhance energy efficiency in buildings. The Additionally, while the simulations showed promising
research also addressed the complexity and interconnec- results, practical deployment and testing in diverse en-
tivity of HVAC systems, providing a practical solution vironments would have been necessary to validate the
to a traditionally challenging control problem. However, generalizability of the approach.
the paper also had some limitations. The reliance on Researchers in [143] explored the application of DQN
simulation data for training and testing the DQN might optimized using Particle Swarm Optimization (PSO)
not have fully captured the intricacies and variations for resource allocation in fog computing environments,
of real-world scenarios. Additionally, the study focused specifically tailored for healthcare applications. A no-
on a specific type of building and HVAC setup, which table strength of this study was its innovative combi-
might have limited the generalizability of the results to nation of DQN and PSO, which effectively balanced
other building types or climates. Furthermore, the study resource allocation by reducing makespan and improving
did not delve deeply into the potential challenges of both average resource utilization and load balancing
implementing such a system in practice, such as the need levels. The methodology leveraged real-time data to
for extensive data collection and processing capabilities. dynamically adjust resources, showcasing a practical
An innovative application of DQN to localize brain application of DQN in a critical domain. However,
tumors in Magnetic Resonance Imaging (MRI) images the paper could have benefited from a more detailed
was introduced in [141]. The key strength of this study discussion on the scalability of the proposed system
lay in its ability to generalize tumor localization with and its performance under varying network conditions.
limited training data, addressing significant limitations Additionally, while the results were promising, they
of supervised DL which often required large annotated were primarily based on simulations, which might not
datasets and struggled with generalization. The authors have fully captured the complexities and unpredictability
demonstrated that the DQN approach achieved 70% of real-world fog environments. Further validation in
accuracy on a testing set with just 30 training images, practical deployments would have been necessary to
significantly outperforming the supervised DL method fully ascertain the efficacy of the proposed approach.
that showed only 11% accuracy due to over-fitting. An adaptive power management strategy for Parallel
This showcased the robustness of RL in handling small Plug-in Hybrid EVs (PHEVs) using DQN was inves-
datasets and its potential for broader application in tigated in [144]. The strength of this paper lies in
medical imaging. The use of a grid-world environment its practical application of DQN for real-time power
to define state-action spaces and a well-designed re- distribution in PHEVs. The approach considered con-
ward system further strengthened the methodology. A tinuous state variables such as battery State of Charge
notable weakness, however, was the limitation to two- (SOC), required power, vehicle speed, and remaining
dimensional image slices, which might not have fully distance ratio, which enhanced the model’s adaptabil-
captured the complexities of three-dimensional medical ity and precision in dynamic driving conditions. The
imaging. Future work should have addressed this by DQN model successfully minimized fuel consumption
extending the approach to 3D volumes and exploring while maintaining battery SOC within desired limits,
more sophisticated techniques to improve stability and showing a 6% increase in fuel consumption compared
accuracy. to DP which was globally optimal but computationally
Authors in [142] presented a routing algorithm for impractical for real-time applications. However, the re-
enhancing the sustainability of Rechargeable Wireless search was primarily based on simulations using the
Sensor Networks. The authors proposed an adaptive File Transfer Protocol (FTP)-72 driving cycle, which
dual-mode routing approach that integrated multi-hop might not have fully captured the variability of real-
routing with direct upload routing, optimized through world driving conditions. Additionally, the study focused
DQN. The strength of this paper lies in its innovative on a specific type of PHEV and driving scenario, which
use of RL to dynamically adjust the routing mode might have limited the generalizability of the results.
based on the life expectancy of nodes, significantly Further real-world testing and validation across different
improving the network’s energy efficiency and lifespan. vehicle models and driving conditions were necessary to
The simulation results demonstrated that the proposed establish the robustness and practical applicability of the
algorithm achieved a correct routing mode selection proposed strategy.
rate of 95% with limited network state information, Authors in [145] explored the application of DQN to
showcasing its practical applicability and robustness. create an AI for a visual fighting game. A significant
However, the paper did not fully address potential real- strength of this study was its innovative reduction of the
world challenges such as the computational overhead of action space from 41 to 11 actions, which simplified the
implementing DQN in resource-constrained sensor nodes training process and enhanced the model’s performance.
and the impact of network topology changes over time. The DQN architecture included convolutional and fully
29
connected layers optimized for handling sequential frame TABLE IX: DQN Papers Review
inputs, effectively learning to perform complex combina- Application Domain References
tions in a dynamic, competitive environment. However, General RL (Policy learning, raw [141]
a notable limitation was the reliance on a static opponent experience)
(None agent) during training, which might not have Network Optimization [142]
fully captured the complexities and adaptive behaviors Swarm Intelligence and [143],
Optimization Problems
of actual gameplay against diverse opponents. Addition- Network Optimization (Optical [144]
ally, the experiments were conducted in a controlled Transport Networks, Fog)
environment with specific hardware, potentially limiting Games and Simulations [145], [146]
the generalizability of the results to other setups or Security Games and Strategy [148]
real-world gaming scenarios. Further work should have Optimization
focused on testing against more dynamic and varied
opponents to evaluate the robustness and adaptability of
the AI. world implementations and considered the computational
Authors in [146] presented a novel path-planning constraints of practical deployment environments.
approach using an enhanced DQN combined with dense Study [148] presented a novel approach to countering
network structures. The strength of this work was its intelligent Unmanned Aerial Vehicle (UAV) jamming at-
innovative policy of leveraging both depth and breadth of tacks using a Stackelberg dynamic game framework. The
experience during different learning stages, which signif- UAV jammer, acting as the leader, used Deep Recurrent
icantly accelerated the learning process. The introduction Q-Networks (DRQN) to optimize its jamming trajectory,
of a value evaluation network helped the model quickly while ground users, as followers, employed DQN to
grasp environmental rules, while the parallel exploration find optimal communication trajectories to evade the
structure improved the accuracy by expanding the ex- jamming. The strength of this paper was its compre-
perience pool. The use of dense connections further hensive modeling of the UAV jamming problem using
enhanced feature propagation and reuse, contributing to DRQN and DQN, which effectively handled the dynamic
improved learning efficiency and path planning success. and partially observable nature of the environment. The
However, the primary limitation was that the experiments approach proved effective in simulations, showing that
were conducted in a controlled grid environment with both the UAV jammer and ground users could achieve
specific sizes (5x5 and 8x8), which might not have fully optimal trajectories that maximized their respective long-
represented the complexities of real-world scenarios. term cumulative rewards. The Stackelberg equilibrium
Additionally, the reliance on a fixed maximum number ensured that the proposed strategies were stable and
of steps could potentially have led to suboptimal policy effective in a competitive environment. However, the
evaluations in dynamic and larger environments. Future primary limitation was the complexity and computa-
work should have focused on validating this approach in tional demands of implementing DRQN and DQN in
more diverse and scalable settings to assess its general- real-time scenarios, which might have been challenging
izability and robustness. in practical deployments. Additionally, the simulations
The use of DQN for the real-time control of ESS co- were based on specific scenarios and parameters, which
located with renewable energy generators was discussed might not have fully captured the variability of real-
in [147]. A key strength of this work was its Model- world environments. Further validation through real-
free approach, which did not rely on distributional as- world experiments and a more extensive range of scenar-
sumptions for renewable energy generation or real-time ios would have been necessary to confirm the robustness
prices. This flexibility allowed the DQN to learn optimal and scalability of the proposed approach.
policies directly from interaction with the environment. Table IX provides an overview of the examined papers
The simulation results demonstrated that the DQN-based in DQN and their applications across different domains.
control policy achieved near-optimal performance, effec- Over the next paragraphs, we will cover the Double
tively balancing energy storage management tasks like Deep Q-Networks (DDQN), an extension of the DQN
charging and discharging without violating operational designed to address the overestimation bias observed in
constraints. However, the primary limitation was the re- Q-learning.
liance on simulated data for both training and evaluation, 2) Double Deep Q-Networks (DDQN): The DDQN
which might not have captured all the complexities of algorithm is an extension of the DQN designed to
real-world energy systems. Additionally, the approach address the overestimation bias observed in Q-learning.
assumed the availability of significant computational It achieves this by decoupling the selection of the action
resources, which might not have been feasible for all from the evaluation of the Q-value, thus providing a
consumers. Future work should have focused on real- more accurate estimation of action values [149], [150].
30
conditions and with different types of robotic vehicles to necessary to fully establish the robustness and practical
fully establish the robustness and adaptability of the pro- applicability of the proposed method.
posed method. Additionally, the computational demands A method for autonomous mobile robot navigation
of implementing DDQN in real-time applications might and collision avoidance using DDQN was developed in
have posed practical challenges that needed addressing. [155]. A significant strength of this work was its innova-
A novel method for improving tactical decision- tive use of DDQN to reduce reaction delay and improve
making in autonomous driving using DDQN enhanced training efficiency. The proposed method demonstrated
with spatial and channel attention mechanisms was in- superior performance in navigating robots to target posi-
troduced in [154]. The strength of this paper lies in tions without collisions, even in multi-obstacle scenarios.
its innovative use of a hierarchical control structure By employing a Kinect2 depth camera for obstacle detec-
that integrated spatial and channel attention modules tion and leveraging a well-designed reward function, the
to better encode the relative importance of different method achieved quick convergence and effective path
surrounding vehicles and their features. This approach planning, as evidenced by both simulation and real-world
allowed for more accurate and efficient decision-making, experiments on the Qbot2 robot platform. However, the
as evidenced by the significant improvement in safety reliance on a controlled laboratory environment and
rates (54%) and average exploration distance (30%) specific hardware (e.g., Optitrack system for positioning)
in simulated environments. The combination of the might have limited the generalizability of the results.
algorithm with double attention enhanced the agent’s Real-world applications could have introduced additional
ability to make intelligent and safe tactical decisions, complexities and variabilities that the study did not
outperforming baseline models in terms of both re- address. Further validation in diverse and uncontrolled
ward and capability metrics. However, the reliance on environments, along with a discussion on the compu-
simulation-based testing limited the assessment of the tational requirements and scalability of the approach,
model’s performance in real-world driving scenarios, would have been necessary to fully establish its practical
which might have presented additional complexities and applicability.
unpredictabilities. The computational overhead associ- Authors in [156] implemented an RL-based method
ated with the attention modules and the utilized algo- for autonomous vehicle decision-making in overtaking
rithm could also have posed challenges for real-time im- scenarios with oncoming traffic. The study leveraged
plementation in autonomous vehicles. Further validation DDQN to manage both longitudinal speed and lane-
in diverse and realistic environments would have been changing decisions. A notable strength of this research
essential to confirm the robustness and scalability of this was the use of DDQN with Prioritized Experience
approach. Replay, which accelerated policy convergence and en-
Study [26] explored a novel approach to solving the hanced decision-making precision. The simulation re-
distributed heterogeneous hybrid flow-shop scheduling sults in SUMO showed that the proposed method im-
problem with multiple priorities of jobs using a DDQN- proved average speed, reduced time spent in the opposite
based co-evolutionary algorithm. A key strength of this lane, and lowered the overall overtaking duration com-
work was its comprehensive framework that integrated pared to traditional methods, such as SUMO’s default
global and local searches to balance computational lane-change model. The RL-based approach demon-
resources effectively. The proposed DDQN-based co- strated a high collision-free rate (98.5%) and effectively
evolutionary algorithm showed significant improvements mimicked human decision-making behavior, showcasing
in minimizing total weighted tardiness and total en- significant improvements in safety and efficiency. How-
ergy consumption by efficiently selecting operators and ever, the reliance on simulated environments and specific
accelerating convergence. The numerical experiments scenarios might have limited the generalizability of the
and comparisons with state-of-the-art algorithms demon- results to real-world applications. The study assumed
strated the superior performance of the proposed method, complete observability of the state space, which might
particularly in handling real-world scenarios and large- not have been realistic in actual driving conditions where
scale instances. However, the study primarily relied on sensor imperfections and uncertainties were common.
simulations and controlled experiments, which might Future work should have focused on addressing these
not have fully captured the complexities and variabil- limitations by exploring POMDPs and validating the
ity of actual manufacturing environments. Additionally, approach in diverse, real-world scenarios.
the computational overhead associated with training the A DDQN-based control method for obstacle avoid-
DDQN model could have posed challenges for real- ance in agricultural robots was introduced in[24]. A
time applications. Further validation through real-world key strength of this work was its innovative use of
implementations and exploring dynamic events such as DDQN to handle the complexities of dynamic obstacle
new job inserts and due date changes would have been avoidance in a structured farmland environment. The
32
proposed method effectively integrated real-time data TABLE X: DDQN Papers Review
from sensors and used a neural network to decide the Application Domain References
optimal actions, leading to significant improvements in Energy and Power Management [151],[160]
space utilization and time efficiency compared to tra- (IoT Networks, Smart Energy
Systems)
ditional risk index-based methods. The study reported
Multi-agent Systems and [154], [156], [157]
high success rates in obstacle avoidance (98-99%) and Autonomous Behaviors
demonstrated the model’s robustness through extensive Optimization [26]
simulations and field experiments. However, the reliance Real-time Systems and Hardware [158]
on predefined paths and the assumption of only dynamic Implementations
obstacles appearing on these paths might have limited the Vehicle Speed Control System [159]
approach’s flexibility in more complex or unstructured Robotics [153], [24], [155]
environments. Additionally, the computational demands
of the DDQN framework might have posed challenges
for deployment in resource-constrained settings typical substantial improvements in throughput and reliability
of agricultural machinery. Further validation in diverse over conventional handover schemes like Conditional
real-world scenarios and exploration of more scalable Handover (CHO) and baseline Dual Active Protocol
solutions could have enhanced the practical applicability Stack (DAPS) HO. However, the study’s primary re-
of this method. liance on simulations and specific urban scenarios might
A DDQN-based algorithm for global path planning have limited the generalizability of the findings. Real-
of amphibious Unmanned Surface Vehicles (USVs) was world implementations might have faced additional chal-
studied in [157]. A major strength of this paper was lenges such as varying environmental conditions and
its innovative use of DDQN for handling the complex hardware limitations that were not fully addressed in
path-planning requirements of amphibious USVs, which the simulations. Further validation in diverse real-world
must navigate both water and air environments. The in- environments would have been necessary to confirm the
tegration of electronic nautical charts and elevation maps practical applicability and robustness of the proposed I-
to build a detailed 3D simulation environment enhanced DAPS HO scheme.
the realism and accuracy of the path planning. The pro- Authors in [159] implemented an advanced vehicle
posed method effectively balanced multiple objectives speed control system using DDQN. The approach inte-
such as minimizing travel time and energy consump- grated high-dimensional video data and low-dimensional
tion, making it suitable for diverse scenarios including sensor data to construct a comprehensive driving en-
emergency rescue and long-distance cruising. However, vironment, allowing the system to mimic human-like
the study primarily relied on simulated environments driving behaviors. A key strength of this work was
and predefined scenarios, which might not have fully its effective use of DDQN to address the instability
captured the complexities of real-world applications. The issues found in Q-learning, resulting in more accurate
computational demands of the DDQN framework could value estimates and higher policy quality. The use of
also have posed challenges for real-time implementa- naturalistic driving data from the Shanghai Naturalis-
tion, especially in dynamic environments where quick tic Driving Study enhanced the model’s realism and
decision-making was crucial. Further validation through applicability. The system demonstrated substantial im-
practical deployments and testing in various real-world provements in both value accuracy and policy perfor-
conditions would have been necessary to fully assess the mance, achieving a score that was 271.73% higher than
robustness and scalability of the proposed approach. that of DQN. However, the reliance on pre-recorded
Authors in [158] presented a Hand-over (HO) strat- driving data and controlled environments might have
egy for 5G networks. The proposed Intelligent Dual limited the generalizability of the results. Real-world
Active Protocol Stack (I-DAPS) HO used DDQN to driving conditions could have been significantly more
enhance the reliability and throughput of handovers variable, and the system’s performance in such dynamic
by predicting and avoiding radio link failures. A key environments needed further validation. Additionally, the
strength of this paper was its innovative approach to computational demands of DDQN might have posed
leveraging DDQN for dynamic and proactive handover challenges for real-time implementation in autonomous
decisions in highly variable mmWave environments. The vehicles, necessitating further optimization for practical
use of a learning-based framework allowed the system to deployment.
make informed decisions based on historical signal data, Authors in [160] addressed the dynamic multiple-
thereby significantly Reducing Hand-over Failure (HOF) channel access problem in IoT networks with energy
rates and achieving zero millisecond Mobility Interrup- harvesting devices using the DDQN approach. The key
tion Time (MIT). The simulation results demonstrated strength of this study lies in its innovative application of
33
DDQN to manage scheduling policies in POMDP. By Over the next few paragraphs, several research stud-
converting the partial observations of scheduled nodes ies are examined in detail to understand the Dueling
into belief states for all nodes, the proposed method DQN algorithm better. A DRL-based framework utiliz-
effectively reduced energy costs and extended the net- ing Dueling Double Deep Q-Networks with Prioritized
work lifetime. The simulation results demonstrated that Replay (DDDQNPR) to solve adaptive job shop schedul-
DDQN outperformed other RL algorithms, including Q- ing problems was implemented in [162]. The main
learning and DQN, in terms of average reward per time strength of this study was its innovative combination
slot. However, the paper’s reliance on simulations and of DDDQNPR with a disjunctive graph model, trans-
specific parameters might have limited its applicabil- forming scheduling into a sequential decision-making
ity to real-world IoT environments. The assumptions process. This approach allowed the model to adapt to
regarding the Poisson process for energy arrival and dynamic environments and achieve optimal scheduling
the fixed transmission energy threshold might not have policies through offline training. The proposed method
fully captured the complexities of practical deployments. outperformed traditional heuristic rules and genetic al-
Further validation through real-world experiments and gorithms in both static and dynamic scheduling scenar-
consideration of diverse environmental conditions would ios, showcasing a significant improvement in scheduling
have been necessary to fully establish the robustness and performance and generalization ability. However, the
scalability of the proposed scheduling approach. reliance on predefined benchmarks and controlled exper-
Table X offers a comprehensive summary of the iments might have limited the applicability of the results
papers reviewed in this section, categorized by their to real-world manufacturing environments. The com-
respective domains within the DDQN. In the next sub- plexity and computational demands of the DDDQNPR
section, another variant of DQN that uses the Dueling framework might have posed challenges for real-time
architecture is introduced. implementation in large-scale production settings. Future
3) Dueling Deep Q-Networks: The Dueling Deep Q- work should have focused on validating the approach
Networks (Dueling DQN) algorithm introduces a new in diverse real-world scenarios and exploring more effi-
neural network architecture that separates the represen- cient reward functions and neural network structures to
tation of state values and advantages to improve learning enhance the method’s scalability and robustness.
efficiency and performance. This architecture allows the Managing Device-to-Device (D2D) communication
network to estimate the value of each state more ro- using Dueling DQN architecture was proposed by au-
bustly, leading to better policy evaluation and improved thors in [163]. A key strength of this paper was its
stability. The dueling architecture splits the Q-network effective utilization of the Dueling DQN architecture to
into two streams, one for estimating the state value autonomously determine transmission decisions in D2D
function and the other for the advantage function, which networks, without relying on centralized infrastructure.
is then combined to produce the Q-values. By separately The approach leveraged easily obtainable Channel State
estimating the state value and advantage, the dueling Information (CSI), allowing each D2D transmitter to
network provides more informative gradients, leading to train its neural network independently. The proposed
more efficient learning. This architecture demonstrates method successfully mitigated co-channel interference,
improved performance in various RL tasks, particularly demonstrating near-optimal sum rates in low Signal-
in environments with many similar-valued actions [161]. to-Noise Ratio (SNR) environments and outperforming
The dueling network update rule is given as: traditional schemes like No Control, Opportunistic, and
Suboptimal in terms of efficiency and complexity re-
Q(s, a; θ, α, β) = V (s; θ, β)+
! duction. However, the study’s reliance on simulation
1 X data and specific channel models might have limited
A(s, a; θ, α) − A(s, a′ ; θ, α)
|A| ′ the generalizability of the results to real-world D2D
a
(29) networks. The approach assumed ideal conditions such
where θ are the parameters of the shared network, α are as perfect CSI and zero delay in TDD-based full duplex
the parameters of the advantage stream, and β are the communications, which might not have held in practical
parameters of the value stream. scenarios. Future work should have focused on vali-
By separately estimating the state value and advan- dating the method in diverse real-world environments
tage, the dueling network provides more robust estimates and addressing potential implementation challenges such
of state values. The decoupling mechanism enhances sta- as real-time computational requirements and varying
bility and overall performance in various environments, network conditions.
especially where actions have similar values. Alg. 17 Authors explored an advanced power allocation algo-
gives a comprehensive overview of the Deling DQN rithm in edge computing environments using Dueling
algorithm. DQN, in [29]. A significant strength of this study was
34
analytical methods. However, the reliance on simulations TABLE XI: Dueling DQN Papers Review
to validate the approach might have limited the general- Application Domain References
izability of the findings to real-world applications. The Energy and Power Management [29], [165]
assumptions made regarding the processing capabilities (IoT Networks, Smart Energy
Systems)
and network conditions might not have fully captured
Transportation and Routing [168]
the complexities of practical implementations. Further Optimization
research involving real-world testing and validation in Swarm Intelligence and [162]
diverse environments was necessary to fully establish the Optimization Problems
robustness and scalability of the proposed method. Network Optimization [163], [166]
The authors in [167] presented an enhanced Dueling Geo-statistics and Environmental [164]
Engineering
DQN method for improving the autonomous capabilities Autonomous UAVs [155], [167]
of UAVs in terms of obstacle avoidance and target Path planning approach for USVs [168]
tracking. One of the key strengths of this study was the
effective integration of improved Dueling DQN with a
target tracking mechanism, which allowed the UAV to to real-world scenarios where the USVs would have
make more precise and timely decisions. The improved encountered dynamic and unpredictable conditions. The
algorithm showed superior performance in simulation paper also did not address the computational complexity
tests, with significant improvements in obstacle avoid- introduced by the tree sampling mechanism, which could
ance accuracy and target tracking efficiency. The authors have posed challenges for real-time implementation on
provided a comprehensive analysis of the algorithm’s resource-constrained USVs. Future work should have
ability to maintain stability and adaptability in dynamic focused on validating the approach in dynamic real-
environments, showcasing its practical potential for real- world environments and exploring ways to optimize the
world UAV applications. However, the study’s depen- computational efficiency for practical applications. A
dence on simulation environments and controlled scenar- summary of the analyzed papers is given in Table XI. In
ios might not have fully captured the complexities and the next subsection, another part of Tabular Model-based
unpredictability of real-world applications. The effective- algorithms, Model-based Planning, is introduced.
ness of the improved Dueling DQN algorithm in diverse
and unstructured environments needed further validation
through real-world testing. Additionally, the computa- D. Advanced Tabular Model-based Methods
tional demands of the improved algorithm could have In this subsection, we dive into the second part of
posed challenges for real-time processing and decision- Tabular Model-based methods. Key characteristics of
making on resource-constrained UAV platforms. Future Model-based algorithms include Model Representation,
work should have addressed these aspects to establish which uses transition probabilities P (s′ |s, a) and reward
the practicality and scalability of the proposed approach functions R(s, a) to define state transitions and rewards;
fully. Planning and Policy Evaluation, which iteratively ap-
An advanced path planning approach for Unmanned proximates the value function V (s) with the Bellman
Surface Vessels (USVs) using Dueling DQN enhanced equation until convergence; and Value Iteration, which
with a tree sampling mechanism was implemented in also iteratively refines V (s) using the Bellman equation
[168]. A key strength of this work was the innovative to determine the optimal policy by evaluating future
combination of Dueling DQN with a tree-based prior- states and actions [1].
ity sampling mechanism, which significantly improved After analyzing the first part of Tabular Model-based
the convergence speed and efficiency of the learning algorithms, DP approaches, we now shed lights on the
process. The approach leveraged the decomposition of second part, Model-based Planning methods.
the value function into state-value and advantage func- 1) Model-based Planning: Model-based Planning in
tions, enhancing the policy evaluation and decision- RL refers to the approach where the agent builds and
making accuracy. The simulation results demonstrated utilizes a model of the environment to plan its actions
that the proposed method achieved faster convergence and make decisions. This model can be either learned
and more effective obstacle avoidance and path planning from interactions with the environment or predefined
compared to DQN algorithms. Specifically, the Dueling if the environment’s dynamics are known. Model-based
DQN algorithm showed improved performance in terms methods can be more sample-efficient as they leverage
of the number of steps and time required to reach learned models of the environment to simulate many
the target across various test environments. However, scenarios, allowing the agent to gain more insights
the study’s reliance on simulated static environments without needing to interact directly with the environment
might have limited the generalizability of the results each time. Model-based Planning enables the agent to
36
handle multi-dimensional rewards and discover several the evaluation of numerous nodes, ensuring more accu-
Pareto-optimal policies within a single tree. The MO- rate and efficient search processes. Experimental results
MCTS algorithm stood out for its ability to effectively demonstrated that the proposed method achieved higher
manage multi-objective optimization by integrating the performance and search efficiency compared to state-of-
hyper-volume indicator, which provided a comprehen- the-art NAS methods, particularly on the NAS-Bench-
sive measure of the solution set’s quality. This ap- Macro and ImageNet datasets. On the other hand, the
proach allowed the algorithm to balance exploration reliance on the accurate modeling of the search space
and exploitation in multi-dimensional spaces, making as a Monte-Carlo Tree introduced additional complexity,
it more efficient in discovering Pareto-optimal solutions particularly in terms of computational overhead and
compared to traditional scalarized RL methods. The use memory requirements. The method’s performance heav-
of the hyper-volume indicator as an action selection ily depended on the proper tuning of hyperparameters,
criterion ensured that the algorithm could capture a such as the temperature term and reduction ratio, which
diverse set of optimal policies, addressing the limita- could be challenging to optimize for different datasets
tions of linear-scalarization methods that failed in non- and tasks. While the experiments showed promising
convex regions of the Pareto front. The experimental results, further validation in more diverse and complex
validation on the Deep Sea Treasure (DST) problem real-world scenarios was necessary to fully assess the
and grid scheduling tasks demonstrated that MO-MCTS scalability and robustness of the approach. The addi-
achieved superior performance and scalability, matching tional computational cost associated with maintaining
or surpassing state-of-the-art non-RL-based methods de- and updating the Monte-Carlo Tree and performing hier-
spite higher computational costs. However, the reliance archical node selection could impact the method’s appli-
on accurate computation of the hyper-volume indicator cability in real-time applications or resource-constrained
introduced significant computational overhead, particu- environments.
larly in high-dimensional objective spaces. The com- Study [49] introduced Multi-agent MCTS (MAM-
plexity of maintaining and updating the Pareto archive CTS), a novel extension of MCTS tailored for coopera-
could also pose challenges, especially as the number tive multi-agent systems. The primary innovation was the
of objectives increased. While the algorithm showed integration of difference evaluations, which significantly
robust performance in deterministic settings, its scal- enhanced coordination strategies among agents. The per-
ability and efficiency in highly stochastic or dynamic formance of MAMCTS was demonstrated in a multi-
environments remained to be fully explored. The need agent path-planning domain called Multi-agent Grid-
for domain-specific knowledge to define the reference world (MAG), showcasing substantial improvements
point and other hyper-volume-related parameters could over traditional reward evaluation methods. The MAM-
also limit the algorithm’s generalizability. Additionally, CTS approach effectively leveraged difference evalua-
the method’s computational intensity might hinder its tions to prioritize actions that contributed positively to
real-time applicability, requiring further optimization or the overall system, leading to significant improvements
approximation techniques to reduce processing time. in learning efficiency and coordination among agents.
A novel approach to Neural Architecture Search By combining MCTS with different rewards, the al-
(NAS) using MCTS was introduced in [174]. The au- gorithm balanced exploration and exploitation, ensur-
thors proposed a method that captured the dependencies ing that agents could efficiently navigate the search
among layers by modeling the search space as a Monte- space to find optimal policies. The experimental results
Carlo Tree, enhancing the exploration-exploitation bal- in the 100x100 MAG environment demonstrated that
ance and efficiently storing intermediate results for better MAMCTS outperformed both local and global reward
future decisions. The method was validated through methods, achieving up to 31.4% and 88.9% better per-
experiments on the NAS-Bench-Macro benchmark and formance, respectively. This superior performance was
ImageNet dataset. The primary strength of this approach consistent across various agent and goal configurations,
lies in its ability to incorporate dependencies between highlighting the scalability and robustness of the ap-
different layers during architecture search, which many proach. The use of a structured search process and
existing NAS methods overlooked. By utilizing MCTS, prioritized updates ensured that the algorithm could
the proposed method effectively balanced exploration handle large-scale multi-agent environments effectively.
and exploitation, leading to more efficient architecture The reliance on accurate computation and maintenance
sampling. The use of a Monte-Carlo Tree allowed for the of difference evaluations introduced additional computa-
storage of intermediate results, significantly improving tional overhead, particularly in environments with a large
the search efficiency and reducing the need for redundant number of agents and goals. The method’s performance
computations. The introduction of node communication was sensitive to the accuracy of these evaluations, which
and hierarchical node selection techniques further refined might be challenging to maintain in highly dynamic
38
or unpredictable environments. While the simulations the strategic strengths of MCTS and the tactical pre-
provided strong evidence of the method’s efficacy, fur- cision of minimax. The authors proposed three hy-
ther validation in more diverse real-world scenarios was brid approaches: employing minimax during the selec-
necessary to fully assess its scalability and practical tion/expansion phase, the rollout phase, and the back-
utility. The complexity of managing multiple agents propagation phase of MCTS. These hybrids aimed to
and ensuring synchronized updates could also pose address the weaknesses of MCTS in tactical situa-
challenges, particularly in real-time applications where tions by incorporating shallow minimax searches within
computational resources are limited. Additionally, the the MCTS framework. The hybrid algorithms pre-
paper focused primarily on cooperative settings, and the sented in the paper offered a promising combination of
applicability of MAMCTS to competitive or adversarial MCTS’s ability to handle large search spaces with mini-
multi-agent environments remained an area for future max’s tactical accuracy. By integrating shallow mini-
exploration. max searches, the hybrids could better navigate shallow
A MO-MCTS algorithm tailored for real-time games traps that MCTS might overlook, leading to more robust
was developed in [175]. It focused on balancing multiple decision-making in games with high tactical demands.
objectives simultaneously, leveraging the hyper-volume The experimental results in games like Connect-4 and
indicator to replace the traditional UCB criterion. The Breakthrough demonstrated that these hybrid approaches
algorithm was tested against single-objective MCTS and could outperform standard MCTS, particularly in envi-
Non-dominated Sorting Genetic Algorithm (NSGA)-II, ronments where tactical precision was crucial. The use
showcasing superior performance in benchmarks like of mini-max in the selection/expansion phase and the
DST and the Multi-objective Physical Traveling Sales- backpropagation phase significantly improved the ability
man Problem (MO-PTSP). The MO-MCTS algorithm to avoid blunders and recognize winning strategies early,
excelled in handling multi-dimensional reward struc- enhancing the overall efficiency and effectiveness of
tures, efficiently balancing exploration and exploitation. the search process. However, the inclusion of mini-max
By incorporating the hyper-volume indicator, the algo- searches introduced additional computational overhead,
rithm could discover a diverse set of Pareto-optimal which could slow down the overall search process, es-
solutions, effectively addressing the limitations of linear- pecially for deeper mini-max searches. The performance
scalarization methods in non-convex regions of the improvements were heavily dependent on the correct
Pareto front. The empirical results on DST and MO- tuning of parameters, such as the depth of the mini-
PTSP benchmarks demonstrated the algorithm’s ability max search and the criteria for triggering these searches.
to converge to optimal solutions quickly, outperforming While the hybrids showed improved performance in the
both single-objective MCTS and NSGA-II in terms of tested games, their scalability and effectiveness in more
exploration efficiency and solution quality. The use of complex and dynamic real-world scenarios remained
a weighted sum and Euclidean distance mechanisms for to be fully validated. The reliance on game-specific
action selection further enhanced the adaptability of MO- characteristics, such as the presence of shallow traps,
MCTS to various game scenarios, providing a robust might limit the generalizability of the results. Further
framework for real-time decision-making. Nonetheless, exploration was needed to assess the impact of these hy-
the reliance on the hyper-volume indicator introduced brids in a broader range of domains and under different
significant computational overhead, which could limit conditions.
the algorithm’s scalability in high-dimensional objective Authors in [177] introduced the Option-MCTS (O-
spaces. The need for maintaining a Pareto archive and MCTS) algorithm, which extended the MCTS by incor-
computing the hyper-volume for action selection added porating high-level action sequences, or ”options,” aimed
to the complexity, potentially impacting real-time per- at achieving specific subgoals. The proposed algorithm
formance. While the algorithm showed strong results in aimed to enhance general video game playing by uti-
the tested benchmarks, its applicability to more complex lizing higher-level planning, enabling it to perform well
and dynamic real-time games required further validation. across a diverse set of games from the General Video
The computational intensity of the approach might hin- Game AI competition. Additionally, the paper introduced
der its practicality in resource-constrained environments, Option Learning MCTS (OL-MCTS), which applied a
necessitating the exploration of optimization techniques progressive widening technique to focus exploration on
to reduce processing time without compromising perfor- the most promising options. The integration of options
mance. Additionally, the paper primarily focused on de- into MCTS was a significant advancement, allowing the
terministic settings, leaving the performance in stochastic algorithm to plan more efficiently by considering higher-
or highly variable environments less explored. level strategies. This higher abstraction level helped
Researchers in [176] explored hybrid algorithms that the algorithm deal with complex games that required
integrated MCTS with mini-max search to leverage achieving multiple subgoals. The use of options reduced
39
the branching factor in the search tree, enabling deeper updating separate value estimates might impact the algo-
exploration within the same computational budget. The rithm’s efficiency, particularly in real-time applications.
empirical results demonstrated that O-MCTS outper- While the approach showed significant improvements
formed traditional MCTS in games requiring sequential in specific games, further validation was necessary to
subgoal achievements, such as collecting keys to open assess its scalability and robustness in more complex
doors, showcasing its strength in strategic planning. The and dynamic scenarios. The dependence on domain-
introduction of OL-MCTS further improved performance specific knowledge for heuristic evaluations might limit
by learning which options were most effective, thus the generalizability of the method to a wider range of
focusing the search on more promising parts of the applications. Additionally, the complexity of tuning the
game tree and improving efficiency. On the other hand, parameter that weighted the influence of heuristic evalu-
the reliance on predefined options and their proper ations could pose challenges in optimizing the algorithm
tuning could be a limitation, as the performance of for different environments.
O-MCTS heavily depended on the quality and rele- Authors in [179] explored the application of MCTS to
vance of these options to the specific games being the game of Lines of Action (LoA), a two-person zero-
played. The initial computational overhead associated sum game known for its tactical depth and moderate
with constructing and managing a large set of options branching factor. The authors proposed several enhance-
might impact the algorithm’s performance, particularly ments to standard MCTS to handle the tactical complex-
in games with numerous sprites and complex dynamics. ities of LoA, including game-theoretical value proving,
The progressive widening technique in OL-MCTS, while domain-specific simulation strategies, and effective use
beneficial for focusing exploration, introduced additional of progressive bias. The key strength of this paper
complexity and overhead, potentially reducing real-time lies in its innovative enhancements to MCTS, enabling
applicability. Further validation was needed to assess the it to handle the tactical and progression properties of
scalability and robustness of these algorithms in a wider LoA effectively. By incorporating game-theoretical value
range of real-world game scenarios, where the diversity proving, the algorithm could identify and propagate
and unpredictability of game mechanics might present winning and losing positions more efficiently, reducing
new challenges. the computational burden of extensive simulations. The
An extension of MCTS, incorporating heuristic eval- use of domain-specific simulation strategies significantly
uations through implicit mini-max backups, was in- improved the quality of the simulations, leading to more
vestigated in [178]. The approach aimed to combine accurate evaluations and better overall performance. The
the strengths of MCTS and mini-max search to im- empirical results demonstrated that the enhanced MCTS
prove decision-making in strategic games by main- variant outperformed the world’s best αβ-based LoA
taining separate estimations of win rates and heuris- program, marking a significant milestone for MCTS in
tic evaluations and using these to guide simulations. handling highly tactical games. The detailed analysis and
The integration of implicit minimax backups within systematic approach to integrating domain knowledge
MCTS significantly enhanced the quality of simulations into MCTS provided a robust framework for applying
by leveraging heuristic evaluations to inform decision- MCTS to other complex games. Despite its advantages,
making. This hybrid approach addressed the limitations the approach introduced additional computational over-
of pure MCTS in domains where tactical precision was head due to the need for maintaining and updating
crucial, effectively balancing strategic exploration with multiple enhancements, such as the progressive bias and
tactical accuracy. By maintaining separate values for game-theoretical value proving. The reliance on domain-
win rates and heuristic evaluations, the algorithm could specific knowledge for simulation strategies and the pro-
better navigate complex game states, leading to stronger gressive bias limited the generalizability of the method
play performance. The empirical results in games like to other games without similar properties. While the
Kalah, Breakthrough, and Lines of Action demonstrated algorithm performed well in the controlled environment
substantial improvements over standard MCTS, validat- of LoA, its scalability, and robustness in more dynamic
ing the effectiveness of implicit mini-max backups in and less structured environments remained to be fully
diverse strategic environments. The method also showed explored. The complexity of tuning various parameters,
robust performance across different parameter settings, such as the progressive bias coefficient and the simula-
highlighting its adaptability and potential for broader tion cutoff threshold, could also pose challenges, partic-
application in game-playing AI. However, the reliance ularly for practitioners without deep domain knowledge.
on accurate heuristic evaluations introduced complexity Further validation in diverse real-world scenarios was
in environments where such evaluations were not readily necessary to assess the practical applicability and long-
available or were difficult to compute. The additional term benefits of the proposed enhancements.
computational overhead associated with maintaining and Authors in [180] explored the application of MCTS to
40
the popular collectible card game ”Hearthstone: Heroes determinization-based methods in games with significant
of Warcraft.” Given the game’s complexity, uncertainty, hidden information and partially observable moves. For
and hidden information, the authors proposed enriching instance, in Lord of the Rings: The Confrontation and
MCTS with heuristic guidance and a database of decks the Phantom, ISMCTS achieved superior performance
to enhance performance. The approach was empirically by effectively leveraging the structure of the game and
validated against vanilla MCTS and state-of-the-art AI, reducing the impact of strategy fusion. However, the
showing significant performance gains. The integration ISMCTS approach introduced additional complexity in
of expert knowledge through heuristics and a deck maintaining and updating information sets, which could
database addressed two major challenges in Hearth- lead to increased computational overhead. The scalability
stone: hidden information and large search space. By of the method in environments with extensive hidden
incorporating a heuristic function into the selection and information and large state spaces remained a challenge,
simulation steps, the algorithm could more effectively as the branching factor in information set trees could
navigate the game’s complex state space, leading to become substantial. While ISMCTS showed promising
improved decision-making and performance. The use results in the tested domains, further validation in more
of a deck database allowed the MCTS algorithm to diverse and dynamic scenarios was necessary to fully
predict the opponent’s cards more accurately, enhancing assess its robustness and general applicability. The re-
the quality of simulations and overall strategy. The liance on accurate modeling of information sets and the
empirical results demonstrated that the enhanced MCTS necessity for domain-specific adaptations could limit the
approach significantly outperformed vanilla MCTS and ease of implementation and the algorithm’s flexibility
was competitive with existing AI players, showcasing its across different types of games.
potential for complex, strategic games like Hearthstone. Authors in [182] investigated the application of MCTS
However, the reliance on pre-constructed heuristics and to Kriegspiel, a variant of chess characterized by hid-
deck databases introduced additional complexity and den information and dynamic uncertainty. The authors
potential limitations. The effectiveness of the heuristic explored three MCTS-based methods, incrementally re-
function was highly dependent on its design and tun- fining the approach to handle the complexities of
ing, which might vary across different game scenarios Kriegspiel. They compared these methods to a strong
and decks. Similarly, the deck database’s accuracy and minimax-based Kriegspiel program, demonstrating the
comprehensiveness were crucial for predicting opponent effectiveness of MCTS in this challenging environment.
strategies, which might be challenging to maintain as The authors’ incremental refinement of MCTS methods
new cards and strategies emerged. The additional compu- for Kriegspiel effectively addressed the game’s dynamic
tational overhead associated with managing and updating uncertainty and large state space. By leveraging a proba-
these enhancements could impact real-time performance, bilistic model of the opponent’s pieces and incorporating
particularly in time-constrained gameplay environments. domain-specific heuristics, the refined MCTS algorithms
While the approach showed strong performance in con- significantly improved performance compared to the
trolled experiments, further validation in diverse and initial naive implementation. The experimental results
dynamic real-world scenarios was necessary to fully showed that the final MCTS approach outperformed
assess its robustness and adaptability. the minimax-based program, achieving better strategic
Information Set MCTS (ISMCTS), an extension of planning and decision-making. The innovative use of
MCTS designed for games with hidden information and a three-tiered game tree representation and opponent
uncertainty was introduced in [181]. ISMCTS algorithms modeling techniques demonstrated the adaptability and
searched trees of information sets rather than game robustness of MCTS in handling partial information
states, providing a more accurate representation of games games. This study provided valuable insights into the
with imperfect information. The approach was tested application of MCTS in environments with incomplete
across three domains with different characteristics. The information, highlighting its potential for broader ap-
ISMCTS algorithms excelled in efficiently managing plications in similar domains. Having said that, the
hidden information by constructing trees where nodes reliance on extensive probabilistic modeling and heuris-
represented information sets instead of individual states. tic adjustments introduced additional complexity, which
This method mitigated the strategy fusion problem seen could be computationally intensive and challenging to
in determinization approaches, where different states in maintain. The performance improvements were heavily
an information set were treated independently. By unify- dependent on the accuracy and relevance of the proba-
ing statistics about moves within a single tree, ISMCTS bilistic models, which might vary across different game
made better use of the computational budget, leading to scenarios and opponents. While the approach showed
more informed decision-making. The empirical results strong performance in the controlled environment of
demonstrated that ISMCTS outperformed traditional Kriegspiel, its scalability, and robustness in more diverse
41
Algorithm 19 Prioritized Sweeping higher total rewards compared to DDPG, validating its
1: Initialize Q(s, a), Model(s, a), for all s, a, and superiority in handling complex morphing scenarios.
P Queue to empty Despite its advantages, the reliance on accurate prioriti-
2: while true do ⊲ Do forever zation mechanisms introduces additional complexity in
3: S ← current (non-terminal) state maintaining and updating the priority queue, which could
4: A ← policy(S, Q) impact performance in highly dynamic environments
5: Execute action A; observe resultant reward, R, where state changes are rapid and unpredictable. The
and state, S ′ assumption that changes in sweep angle do not signifi-
6: Model(S, A) ← (R, S ′ ) cantly affect flight status within short time frames may
7: P ← |R + γ maxa Q(S ′ , a) − Q(S, A)| oversimplify real-world conditions, potentially limiting
8: if P > θ then the model’s accuracy. While the simulation results are
9: Insert (S, A) into P Queue with priority P promising, further validation in real-world applications
10: end if is necessary to fully assess the method’s robustness and
11: for i = 1 to n do scalability. The initial phase of training and data gener-
12: if P Queue is not empty then ation still requires significant computational resources,
13: S, A ← first(P Queue) which could be a bottleneck in practical deployments.
14: R, S ′ ← Model(S, A) Authors in [186] introduced modifications to Dyna-
15: Q(S, A) ← Q(S, A) learning and Prioritized Sweeping algorithms, incorpo-
+ α[R + γ maxa Q(S ′ , a) − Q(S, A)] rating an epoch-incremental approach using Breadth-
16: for each (S̃, Ã) predicted to lead to S do first Search (BFS). The combination of incremental
17: R̃ ← predicted reward for (S̃, Ã, S) and epoch-based updates improved learning efficiency,
18: P ← |R̃+γ maxa Q(S, a)−Q(S̃, Ã)| leading to faster convergence in dynamic environments
19: if P > θ then like grid worlds. The use of BFS after episodes provided
20: Insert (S̃, Ã) into P Queue with a more comprehensive understanding of the state space.
priority P However, managing dual modes of policy updates and
21: end if accurate BFS calculations introduced complexity, poten-
22: end for tially increasing computational overhead. The method
23: end if showed strong simulation results, but real-world vali-
24: end for dation was necessary to understand its scalability and
25: end while practical implications.
Authors in [187] introduced an innovative extension
to the Prioritized Sweeping algorithm by employing
plexity of setting up and tuning the priority mechanism small backups instead of full backups. The primary aim
posed challenges, particularly in dynamic environments is to enhance the computational efficiency of Model-
with unpredictable state transitions. based RL by reducing the complexity of value updates,
An enhancement to the Deep Deterministic Policy making it more suitable for environments with a large
Gradient (DDPG) (discussed in subsection V-B1) algo- number of successor states. The most notable advantage
rithm by incorporating Prioritized Sweeping for morph- of this approach is its ability to perform value updates
ing UAVs is introduced in [185]. The objective is to using only the current value of a single successor state,
improve the efficiency and effectiveness of morphing significantly reducing the computation time. By utilizing
strategy decisions by avoiding the random selection of small backups, the algorithm allows for finer control over
state-action pairs and instead focusing on those with a the planning process, leading to more efficient update
significant impact on learning outcomes. The integration strategies. The empirical results demonstrate that the
of Prioritized Sweeping with the DDPG framework no- small backup implementation of Prioritized Sweeping
tably enhances learning efficiency by prioritizing state- achieves substantial performance improvements over tra-
action pairs that are most influential in updating the ditional methods, particularly in environments where full
policy. This targeted approach accelerates convergence backups are computationally prohibitive. The theoretical
and improves decision-making accuracy, which is crucial foundation provided in the paper supports the robustness
for the dynamic and complex task of UAV morph- of the small backup approach, ensuring that it maintains
ing. The method effectively combines the strengths of the accuracy of value updates while enhancing com-
Value-based and Policy Gradient-based RL, leveraging putational efficiency. Additionally, the parameter-free
deep neural networks for state evaluation and policy nature of small backups eliminates the need for tuning
improvement. The simulation results demonstrate that step-size parameters, which is a common challenge in
the improved algorithm achieves faster learning and traditional sample-based methods. On the other hand,
43
the approach’s reliance on maintaining and updating a and computational overhead, potentially impacting the
priority queue introduces additional memory and com- algorithm’s performance in real-time applications. While
putational overhead, particularly in environments with the simulations provide strong evidence of the method’s
many state-action pairs. While the method improves efficacy, further validation in larger and more diverse net-
computational efficiency, it requires careful management work topologies is necessary to fully assess its scalability
of memory resources to store component values associ- and robustness. Managing priority queues and updating
ated with small backups. The simulations primarily focus Q-values in a distributed manner across multiple nodes
on deterministic environments, leaving the performance can also pose challenges in maintaining synchronization
of the small backup approach in highly stochastic or and consistency in real-world deployments.
dynamic settings less explored. Further validation in A structured version of Prioritized Sweeping to en-
real-world applications is necessary to fully assess the hance RL efficiency in large state spaces by leveraging
scalability and practicality of the method. Additionally, Dynamic Bayesian Networks (DBNs) was introduced
the initial phase of model construction and priority queue in [189]. The method accelerated learning by group-
management could introduce latency, impacting the algo- ing states with similar values, reducing updates needed
rithm’s real-time performance in complex environments. for convergence. DBNs provided a compact environ-
Authors in [188] introduced Cooperative Prioritized ment representation, further improving computational
Sweeping to efficiently handle Multi-Agent Markov De- efficiency. However, the reliance on predefined DBN
cision Processes (MMDPs) using factored Q-functions structures limited applicability in dynamic environments,
and Dynamic Decision Networks (DDNs). CPS managed and maintaining these structures added computational
large state-action spaces effectively, leading to faster overhead. While promising in simulations, the method
convergence in multi-agent environments. However, the required further validation in diverse real-world scenar-
reliance on accurately specified DDN structures and ios to fully assess its scalability and robustness.
the batch update mechanism introduced complexity, in- c) Dyna-Q: There are two types of Dyna-Q we are
creasing computational overhead in dynamic environ- twiddling with in RL; Dyna-Q when using with tabular
ments. While simulations showed Cooperative Priori- representation and when using function approximation.
tized Sweeping’s potential, further real-world validation Acknowledging this, we analyze both Dyna-Q’s varia-
was needed to assess its scalability and robustness. tions in this section.
A prioritized sweeping approach combined with The Dyna-Q algorithm, introduced in [73], combines
Confidence-based Dual RL (CB-DuRL) for routing in traditional RL with planning in a novel way to improve
Mobile Ad-Hoc Networks (MANETs) is investigated by the efficiency and adaptability of learning agents. This
researchers in [41]. The proposed method dynamically section provides an overview of the essential concepts
selects routes based on real-time traffic conditions, aim- and mechanisms underlying the Dyna-Q algorithm. This
ing to minimize delivery time and congestion. The key algorithm integrates learning from real experiences with
strength of this approach is its dynamic adaptability to planning using a learned model of the environment.
real-time traffic conditions, addressing the limitations of This integration allows the agent to improve its policy
traditional shortest path routing methods. By leveraging more rapidly and efficiently by leveraging both actual
prioritized sweeping, the algorithm prioritizes updates to and simulated experiences. Dyna-Q utilizes Q-learning
the most critical state-action pairs, enhancing learning [101]. This estimate combines the immediate reward
efficiency and ensuring optimal path selection under and the discounted value of the next state, guiding the
varying network loads. The inclusion of CB-DuRL re- agent towards actions that maximize long-term rewards.
fines routing decisions by considering the reliability of On top of that, its architecture includes a model of
Q-values, thus improving the robustness and reliability the environment that predicts the next state and reward
of the routing protocol. Empirical results from sim- given a current state and action. This model allows the
ulations on a 50-node MANET demonstrate that the agent to simulate hypothetical experiences, effectively
proposed method significantly outperforms traditional planning future actions without needing to interact with
routing protocols in terms of packet delivery ratio, the real environment continuously. Also, the agent uses
dropping ratio, and delay, showcasing its effectiveness the learned model to generate simulated experiences.
in handling high traffic conditions and reducing network These hypothetical experiences are treated similarly to
congestion. However, the approach’s dependence on ac- real experiences, updating the Q(s, a) values and refining
curate estimation of traffic conditions and Q-values may the agent’s policy. This process accelerates learning by
limit its adaptability in highly dynamic or unpredictable allowing the agent to practice and plan multiple scenarios
environments where traffic patterns change rapidly. The internally. Dyna-Q’s planning process is incremental,
initial phase of gathering sufficient data to populate the meaning it can be interrupted and resumed at any
Q-tables and confidence values can introduce latency time. This feature makes it highly adaptable to dynamic
44
algorithm, which combines Model-based and Model-free adaptability across different problem domains.
RL to enhance learning efficiency and control accuracy. A novel multi-path load-balancing routing algorithm
By leveraging past CGM and insulin data, the system for Wireless Sensor Networks (WSNs) leveraging the
could predict future glucose levels and optimize insulin Dyna-Q algorithm was proposed in [194]. The pro-
dosage in real time, resulting in improved Glycemic posed method termed ELMRRL (Energy-efficient Load-
control. The algorithm’s ability to operate effectively balancing Multi-path Routing with RL), aimed to mini-
without explicit carbohydrate information significantly mize energy consumption and enhance network lifetime
reduced the cognitive load on patients, aligning well by selecting optimal routing paths based on residual
with the goals of developing a fully automated artificial energy, hop count, and energy consumption of nodes.
pancreas. Furthermore, the incorporation of a precision The main advantage of the ELMRRL algorithm was its
medicine approach tailored the model to individual pa- effective integration of Dyna-Q RL, which combined
tients, enhancing the adaptability and accuracy of the real-time learning with planning to adaptively select
Glycemic predictions. On the other hand, the reliance on optimal routing paths. This dynamic adjustment ensured
precise CGM data and insulin records for model training that the algorithm could respond to changes in net-
might have posed challenges in real-world scenarios work conditions, thus extending the network lifetime.
where data quality could vary. While the algorithm The use of RL enabled each sensor node to act as
showed promise in simulations and preliminary tests an agent, making independent decisions based on local
with real patients, its generalizability across diverse information, which was crucial for distributed environ-
patient populations and long-term robustness remained ments like WSNs. Furthermore, the algorithm’s focus
areas of concern. The study’s limited real-world testing on both immediate and long-term cumulative rewards
meant that further extensive clinical trials were necessary led to more balanced energy consumption across the
to fully validate the system’s effectiveness and safety in network, preventing premature node failures and improv-
everyday use. ing overall network resilience. On the other hand, the
Dyna-T which integrated Dyna-Q with Upper Con- reliance on local information for decision-making might
fidence Bounds applied to Trees (UCT) for enhanced have limited the algorithm’s effectiveness in scenarios
planning efficiency, was developed in [193]. The method where global network knowledge could provide more
aimed to improve action selection by using UCT to optimal solutions. The initial learning phase, where
explore simulated experiences generated by the Dyna-Q nodes gathered and processed data to update their Q-
model, demonstrating its effectiveness in several OpenAI values, could have introduced latency and computational
Gym environments. The primary strength of Dyna-T was overhead, potentially affecting the algorithm’s perfor-
its ability to combine the Model-based learning of Dyna- mance in highly dynamic environments. Additionally, the
Q with the robust exploration capabilities of UCT. This approach’s dependence on the correct parameterization
combination allowed for a more directed exploration of the reward function and learning rates could have been
strategy, which enhanced the agent’s ability to find opti- challenging, as these parameters significantly impacted
mal actions in complex and stochastic environments. The the algorithm’s efficiency and convergence speed. The
algorithm’s use of UCT ensured that the most promising paper primarily validated the algorithm through simu-
action paths were prioritized, significantly improving lations, so its real-world applicability in diverse WSN
learning efficiency and convergence speed. The empirical deployments remained to be fully explored.
results indicated that Dyna-T outperformed traditional Researchers in [34] introduced a new motion control
RL methods, especially in environments with high vari- method for path planning in unfamiliar environments
ability and sparse rewards, showcasing its potential for using the Dyna-Q algorithm. The goal of the proposed
broader applications in RL tasks. Despite its advantages, method was to improve the efficiency of motion control
Dyna-T exhibited some limitations, particularly in deter- by combining direct RL with model learning, enabling
ministic environments where the added computational agents to effectively navigate both dynamic and static
complexity of UCT might not have yielded significant obstacles. The main benefit of this method was its use
benefits. The initial overhead of constructing and main- of the Dyna-Q algorithm, which merged Model-based
taining the UCT structure could be costly, especially in and Model-free RL techniques, facilitating concurrent
simpler tasks where traditional Dyna-Q or Q-learning learning and planning. This integration notably enhanced
might have sufficed. Moreover, while Dyna-T showed convergence speed and adaptability to dynamic envi-
promise in simulated environments, its performance in ronments, as demonstrated by faster path optimization
real-world scenarios with continuous state and action compared to traditional Q-learning. The algorithm’s ca-
spaces remained to be fully validated. The reliance on the pacity to simulate experiences allowed the agent to plan
exploration parameter ’c’ and its impact on performance more efficiently, resulting in improved decision-making
also warranted further investigation to ensure robust in complex situations. Furthermore, employing an ǫ-
46
greedy policy for action selection ensured a balanced algorithm was developed in [196]. The primary goal
exploration-exploitation trade-off, which was essential was to minimize long-term charging costs while con-
for finding optimal paths in unknown environments. sidering the stochastic nature of driving behavior, traffic
Nonetheless, the method’s dependence on accurate en- conditions, energy usage, and fluctuating energy prices.
vironment models for effective planning might have re- The method formulated the problem as a MDP and
stricted its performance in highly unpredictable settings employed DRL techniques for solution optimization.
where models might not accurately reflect real-world The key strength of this approach lies in its effective
dynamics. The computational burden of maintaining and combination of Model-based and Model-free RL, which
updating the Q-table and environment model could also enhances learning speed and efficiency. By continuously
have been significant, especially in large state-action updating the model with real experiences and generating
spaces. Additionally, while the method demonstrated synthetic experiences, the Dyna-Q algorithm ensured
promising results in simulation environments, its scal- rapid convergence to an optimal charging policy. This
ability and robustness in real-world applications with dual approach allowed for robust decision-making even
continuous state spaces and real-time constraints needed in the face of uncertain and dynamic real-world condi-
further validation. The reliance on predefined parame- tions, such as varying energy prices and unpredictable
ters, such as learning rates and discount factors, could driving patterns. Additionally, the integration of a deep-
also have impacted the algorithm’s efficiency and effec- Q network for value approximation facilitated handling
tiveness, requiring careful tuning for different scenarios. the vast state space inherent in PEV charging scenarios,
Researchers in [195] introduced an improved Dyna-Q ensuring the method’s scalability and applicability to
algorithm for mobile robot path planning in unknown, real-world settings. However, the reliance on accurate
dynamic environments. The proposed method integrated initial parameter values and predefined models for the
heuristic search strategies, Simulated Annealing (SA), user’s driving behavior might have limited the algo-
and a new action-selection strategy to enhance learning rithm’s adaptability during the initial learning phase. The
efficiency and path optimization. The enhanced Dyna- system’s performance heavily depended on the quality
Q algorithm effectively merged heuristic search and and accuracy of these models, which might have varied
SA, significantly improving the exploration-exploitation across different users and environments. Moreover, while
balance. By incorporating heuristic rewards and actions, the approach demonstrated significant improvements in
the algorithm ensured efficient navigation through com- simulations, its effectiveness in diverse and more com-
plex environments, avoiding local minima and achiev- plex real-world scenarios required further validation.
ing faster convergence. The novel SA-ǫ-greedy policy The potential computational overhead associated with
dynamically adjusted exploration rates, optimizing the maintaining and updating the deep-Q network and model
learning process. Empirical results from simulations could have posed challenges for real-time applications,
and practical experiments showed that the improved especially in large-scale deployments with numerous
algorithm outperformed Q-learning and Dyna-Q meth- PEVs.
ods, demonstrating superior global search capabilities, Authors in [197] presented an enhanced Dyna-Q algo-
enhanced learning efficiency, and robust convergence rithm tailored for Automated Guided Vehicles (AGVs)
properties in both static and dynamic obstacle scenarios. navigating complex, dynamic environments. The key im-
Despite the performance improvements from integrating provements included a global path guidance mechanism
heuristic search and SA, the approach’s reliance on pre- based on heuristic graphs and a dynamic reward function
defined heuristic functions might have limited its adapt- designed to address issues of sparse rewards and slow
ability to diverse environments. The initial training phase convergence in large state spaces. The improved Dyna-
required extensive exploration, potentially leading to Q algorithm stood out by integrating heuristic graphs
higher computational overhead and longer training times. that provided a global perspective on path planning,
Additionally, the dependence on grid-based environment significantly reducing the search space and enhanc-
representation might have restricted the algorithm’s scal- ing efficiency. This method enabled AGVs to quickly
ability to continuous state spaces. The focus on simulated orient towards their goals by leveraging precomputed
and controlled real-world environments raised concerns shortest path information, thus mitigating the problem
about the algorithm’s robustness and generalizability in of sparse rewards commonly encountered in extensive
more complex and unpredictable real-world applications. environments. Additionally, the dynamic reward function
Further studies were necessary to validate the method’s intensified feedback, guiding AGVs more effectively
effectiveness in larger, unstructured, and more dynamic through complex terrains and around obstacles. The
environments. experimental results in various scenarios with static and
An innovative approach for scheduling the charging dynamic obstacles demonstrated superior convergence
of Plug-in Electric Vehicles (PEVs) using the Dyna-Q speed and learning efficiency compared to traditional Q-
47
learning and standard Dyna-Q algorithms, highlighting for comprehensive network performance assessment.
its robustness and effectiveness in dynamic settings. Researchers in [199] introduced a Model-based RL
However, the dependency on heuristic graphs, which method using Dyna-Q tailored for multi-agent systems.
required prior computation, might have limited the algo- It emphasized efficient environmental modeling through
rithm’s adaptability in environments where real-time up- a tree structure and proposed methods for model sharing
dates were necessary or in scenarios with unpredictable among agents to enhance learning speed and reduce
changes. The initial setup phase, involving the creation computational costs. The approach leveraged the con-
of the heuristic graph, could have introduced overheads cept of knowledge sharing, where agents with more
that might not have been feasible for all applications. experience assisted others by disseminating valuable
Furthermore, while the dynamic reward function en- information. The integration of a tree-based model for
hanced learning efficiency, its design relied heavily on environmental learning within the Dyna-Q framework
accurate modeling of the environment, which could have significantly enhanced the efficiency of model construc-
been challenging in highly variable or noisy conditions. tion and memory usage. This method allowed agents
The paper’s focus on simulated environments left room to generate virtual experiences, thus accelerating the
for further validation in real-world applications, where learning process. The innovative model sharing tech-
additional factors such as sensor noise and real-time niques proposed in the paper, such as grafting partial
constraints could have impacted performance. branches and resampling, enabled agents to build more
A Dyna-Q based anti-jamming algorithm designed to accurate models collaboratively. By reducing redundant
enhance the efficiency of path selection in wireless com- data transfer and focusing on useful experiences, these
munication networks subject to malicious jamming was sharing methods improved sample efficiency and learn-
introduced in [198]. By leveraging both Model-based ing speed in complex environments. The simulation re-
and Model-free RL techniques, the algorithm aimed sults demonstrated the effectiveness of these techniques
to optimize multi-hop path selection, reducing packet in multi-agent cooperation scenarios, highlighting their
loss and improving transmission reliability in hostile potential to optimize learning in large, continuous state
environments. The application of the Dyna-Q algorithm spaces. Despite its advantages, the reliance on accurate
in this context was innovative, combining direct learning decision tree models for sharing experiences might have
with simulated experiences to accelerate the convergence limited the approach’s flexibility in highly dynamic or
of the Q-values. This dual approach allowed the system heterogeneous environments. The effectiveness of the
to adapt quickly to dynamic jamming conditions, en- sharing methods depended on the quality and rele-
suring more reliable path selection and communication vance of the shared models, which might have varied
efficiency. The inclusion of a reward function that con- across different agents and scenarios. Additionally, the
sidered various modulation modes based on the Signal- initial phase of building accurate tree models could
to-Jamming Noise Ratio (SJNR) enhanced the robust- have been computationally intensive, particularly in en-
ness of the algorithm. Simulation results demonstrated vironments with high variability. While the proposed
that the Dyna-Q algorithm significantly outperformed methods showed promising results in simulations, further
traditional Q-learning and multi-armed bandit models, validation in diverse real-world applications was needed
achieving faster convergence to optimal paths and better to fully assess their scalability and robustness. The paper
handling of interference, thus showcasing its potential also assumed a certain level of homogeneity among
for real-time applications in complex electromagnetic agents, which might not always have been applicable
environments. Nevertheless, the method’s reliance on in more varied multi-agent systems.
pre-established environmental models might have lim- The application of the Dyna-Q algorithm for path
ited its effectiveness in highly unpredictable or rapidly planning and obstacle avoidance in Unmanned Ground
changing conditions, where the initial models might not Vehicles (UGVs) and UAVs within complex urban en-
accurately capture real-time dynamics. The need for vironments was explored in [200]. The study focused
accurate initial state representations and model updates on utilizing a vector field-based approach for effective
introduced additional computational overhead, which navigation and air-ground collaboration tasks. The in-
could have impacted performance in larger or more tegration of the Dyna-Q algorithm with a vector field
complex networks. Furthermore, while the algorithm method significantly enhanced the efficiency and accu-
showed promise in simulated environments, its scala- racy of path planning in dynamic urban settings. The
bility and adaptability in real-world applications with approach leveraged both real and simulated experiences
varying node densities and jamming strategies required to adaptively update the agent’s policy, ensuring rapid
further validation. The focus on hop count minimization convergence to optimal paths. By simplifying the urban
might also have overlooked other critical factors such as environment into a grid world, the method allowed for
energy consumption and latency, which were essential precise waypoint calculation, facilitating smooth nav-
48
igation and effective obstacle avoidance. The use of TABLE XII: Model-based Planning Papers Review
PID controllers for UAV and UGV coordination further Application Domain References
improved the stability and responsiveness of the system, Theoretical Research [184], [187]
enabling robust air-ground collaboration. Simulation re- (Convergence, stability)
sults demonstrated that the proposed method effectively Multi-agent Systems and [49], [188] [199], [196],
Autonomous Behaviors [197], [200]
handled dynamic obstacles and complex path scenarios,
Games and Simulations [170], [173], [175], [177],
showcasing its potential for real-world applications in [179], [180], [181], [182],
urban environments. On the other hand, the reliance [37] [190], [195]
on grid-based environment representation might have Energy and Power Management [194], [192]
(IoT Networks, Smart Energy
limited the algorithm’s scalability and adaptability to Systems)
continuous state spaces found in more diverse and un- Transportation and Routing [41]
structured urban areas. The initial phase of creating the Optimization
vector field and the grid map could have introduced com- Network Resilience and [198], [194]
putational overheads, which might have impacted real- Optimization
time performance. While the paper focused on simulated Hybrid RL Algorithms [174], [176], [178], [185]
[193]
environments, further validation in real-world scenarios Dynamic Bayesian networks [189]
was necessary to assess the approach’s robustness and Dynamic Environments [191], [34], [195], [197]
effectiveness under varying conditions. Additionally, the Robotics [201]
method’s dependency on accurate dynamic models for
both UGV and UAV could have posed challenges, as
any discrepancies between the model and the real envi- prove coordination and reduce inter-agent interference.
ronment might have affected the overall performance. The primary advantage of Dyna-Q is its ability to
This paper [201] evaluated the performance of the accelerate learning by combining real and simulated
Dyna-Q algorithm in robot navigation within partially experiences. This dual approach reduces the dependency
known environments containing static obstacles. The on extensive real-world interactions, making it suitable
study extended the Dyna-Q algorithm to multi-robot for applications where such interactions are costly or
systems and conducted extensive simulations to assess limited. Moreover, Dyna-Q is inherently adaptable to
its efficiency and effectiveness. The primary strength of changing environments. The integration of continuous
this study lies in its thorough analysis of the Dyna- learning and planning allows the agent to update its
Q algorithm in both single and multi-agent contexts. policy dynamically in response to new information or
By integrating planning and Model-based learning, the changes in the environment [73], [1]. Table XII provides
Dyna-Q algorithm sped up the learning process, enabling a summary of the papers that utilized Model-based
robots to navigate efficiently even with limited prior Planning algorithms.
knowledge of the environment. The use of simulations Over the next section, a complete introduction to
with the Robot Motion Toolbox allowed for a compre- another paradigm of RL, Policy-based Methods, is given,
hensive evaluation of the algorithm’s performance across along with analyzing various algorithms that fall under
various parameters, providing valuable insights into the its umbrella.
optimal settings for different scenarios. Extending the
Dyna-Q algorithm to multi-robot systems showcased its IV. P OLICY- BASED M ETHODS
adaptability and potential for complex task coordination, Policy-based methods are another fundamental RL
where agents could share knowledge to enhance overall method that more strongly emphasizes direct policy
system performance. However, the paper’s focus on optimization in the process of choosing actions for an
static obstacles and deterministic environments might agent. In contrast to Value-based methods, which search
have limited the applicability of the findings to more for the value function implicit in the task, and then derive
dynamic and stochastic settings. The initial need for an optimal policy, Policy-based methods directly param-
environment discretization and model construction intro- eterize and optimize the policy. This approach offers
duced additional computational overhead, which could several advantages, particularly better dealing with very
have been a bottleneck in real-time applications. While challenging environments that have high-dimensional
the simulations offered a robust analysis, the lack of action spaces or where policies are inherently stochastic.
real-world validation left some uncertainty about the Perhaps at the core, Policy-based methods conduct their
algorithm’s practical effectiveness in unpredictable and operation based on the parameterization of policies,
continuously changing environments. Furthermore, the usually denoted as π(a|s; θ). Here, θ is used to denote
performance degradation observed in multi-agent scenar- the parameters of the policy, while s denotes the state and
ios indicated that further refinement was needed to im- a denotes the action. In other words, it finds the optimal
49
was sample efficient, with polynomial complexity, which in-loop experiments, validating the effectiveness of the
was crucial for practical applications due to the high proposed method in real-world scenarios. The results
cost of obtaining samples. However, the complexity of indicated that the fuzzy REINFORCE algorithm could
the analysis might have challenged practitioners less achieve near-optimal performance without requiring ac-
familiar with the mathematical concepts. The reliance curate system models or extensive prior knowledge,
on assumptions such as the log-barrier regularization making it a versatile and practical solution for EMS in
term and soft-max policy parametrization might have FCHEVs.
limited the generality of the results. The paper focused Authors in [205] presented a study protocol for a
on the stationary infinite-horizon discounted setting, trial aimed at improving medication adherence among
and a more detailed discussion on applying the results patients with type 2 diabetes using an RL-based text mes-
to other settings would have enhanced its relevance. saging program. One of the key strengths of this study
The absence of empirical validation of the proposed was its innovative use of RL to personalize text message
convergence bounds and sample efficiency was another interventions. By tailoring messages based on individual
limitation. Including experimental results would have responses to previous messages, the approach had the
provided additional evidence of the practical utility of potential to optimize engagement and improve adherence
the findings. more effectively than generic messaging strategies. This
A novel approach to Energy Management Strategies personalized communication could have led to more sig-
(EMS) in Fuel Cell Hybrid EVs (FCHEV) using the nificant behavior changes and better health outcomes for
fuzzy REINFORCE algorithm was introduced in [204]. patients with diabetes. The study’s design also enhanced
This method integrated a fuzzy inference system (FIS) its practical relevance. Conducted in a real-world setting
with Policy Gradient RL (PGRL) to optimize energy at Brigham and Women’s Hospital, it involved patients
management, achieve hydrogen savings, and maintain with suboptimal diabetes control, which reflected a com-
battery operation. One of the key strengths of the paper mon clinical scenario. The use of electronic pill bottles
was the innovative combination of fuzzy logic with the to monitor adherence provided accurate and objective
REINFORCE algorithm. By employing a fuzzy infer- data, supporting the reliability of the study outcomes.
ence system to approximate the policy function, the au- Additionally, the trial’s primary outcome of average
thors effectively leveraged the generalization capabilities medication adherence over six months was a meaningful
of fuzzy logic to handle the complexity and uncertainty measure that directly related to the study’s objective.
inherent in energy management tasks. This integration However, there were some weaknesses and challenges
helped to address the limitations of traditional EMS associated with the study. The requirement for patients
methods that relied heavily on expert knowledge and to use electronic pill bottles and smartphones with a
static rules, thus providing a more adaptive and robust data plan or WiFi might have limited the generalizability
solution. The use of a fuzzy baseline function to stabilize of the findings to populations without access to such
the training process and reduce the variance in policy technology. Furthermore, the study’s reliance on self-
gradient updates was another notable advantage. This reported adherence as a secondary outcome introduced
approach enhanced the convergence rate and stability of the potential for reporting bias. The study also faced
the learning process, which was particularly beneficial potential limitations related to the length of the follow-up
in real-time applications where computational efficiency period and the evaluation of the long-term sustainability
and robustness were critical. The paper’s demonstra- of the intervention. While a six-month follow-up period
tion of the algorithm’s adaptability to changing driv- was sufficient to assess initial adherence improvements,
ing conditions and system states further underscored longer-term studies would have been necessary to de-
its practical relevance and effectiveness. However, the termine whether the benefits of the intervention were
complexity of the proposed method might have posed sustained over time.
implementation challenges, particularly for practitioners In [206], the authors presented a novel method for
who were less familiar with fuzzy logic. The integration rate adaptation in 802.11 wireless networks leveraging
of FIS and PGRL required careful tuning of parameters the REINFORCE algorithm. The proposed approach,
and membership functions, which could have been time- named ReinRate, integrated a comprehensive set of ob-
consuming and computationally intensive. Additionally, servations, including received signal strength, contention
while the fuzzy REINFORCE algorithm showed promise window size, current modulation and coding scheme,
in reducing the computational burden and improving and throughput, to adapt dynamically to varying network
convergence, the reliance on fuzzy logic introduced an conditions and optimize network throughput. One of the
additional layer of complexity that might not have been key strengths of this paper was its innovative application
necessary for all applications. The paper also provided a of the REINFORCE algorithm to WiFi rate adaptation.
comprehensive analysis of the simulation and hardware- Traditional rate adaptation algorithms like Minstrel and
51
Ideal relied on limited observations such as packet loss TABLE XIII: REINFORCE Papers Review
rate or signal-to-noise ratio, which could have been in- Application Domain References
sufficient in dynamic wireless environments. In contrast, Energy and Power Management [204]
ReinRate’s broader set of observations allowed for a Theoretical Research [203]
more nuanced response to varying conditions, leading (Convergence, stability)
to significant improvements in network performance. Network Optimization [206]
Robotics [207]
The authors demonstrated that ReinRate outperformed
Minstrel and Ideal algorithms by up to 102.5% and
30.6% in network scenarios without interference, and by Husky robot further validated the practical applicability
up to 35.1% and 66.6% in scenarios with interference. of the proposed method. However, the complexity of the
Another strength was the comprehensive evaluation of proposed algorithm and the specific choice of heavy-
ReinRate using the ns-3 network simulator and ns3-ai tailed distributions might have posed challenges. The im-
OpenAI Gym. The authors conducted extensive simula- plementation of heavy-tailed policy gradients could have
tions under various network scenarios, both static and introduced instability, especially in the initial learning
dynamic, with and without interference. This thorough phases. While the authors mitigated this with adaptive
evaluation provided strong evidence of the algorithm’s moment estimation and gradient clipping, these tech-
effectiveness and adaptability in real-world conditions. niques required careful tuning and expertise, potentially
The results indicated that ReinRate consistently achieved limiting accessibility for practitioners.
higher throughput compared to traditional algorithms, Table XIII, gives an overview of the papers reviewed
showcasing its ability to handle the challenges of dy- in this section. The next Policy-based algorithm, which
namically changing wireless environments. However, the we need to cover is TRPO. Over the next subsection, we
complexity of the proposed method might have posed cover this algorithm.
challenges for practical implementation. The integration
of multiple observations and the application of the REIN- B. Trust Region Policy Optimization (TRPO)
FORCE algorithm required careful tuning of parameters
and computational resources. TRPO, introduced by [208], is an advancement in RL,
specifically within policy optimization methods. The pri-
A new approach to enhance DRL for outdoor robot mary objective of TRPO is to optimize control policies
navigation was investigated in [207]. The key innovation with guaranteed monotonic improvement, addressing the
was the use of a heavy-tailed policy parameterization, shortcomings of previous methods [209], [210], [211]
which induced exploration in sparse reward settings, that often resulted in unstable policy updates and poor
a common challenge in outdoor navigation tasks. A performance on complex tasks.
significant strength of the paper lies in addressing the TRPO is designed to handle large, nonlinear policies
sparse reward issue, which was prevalent in many real- such as those represented by neural networks. The al-
world navigation scenarios. Traditional DRL methods gorithm ensures that each policy update results in a
often relied on carefully designed dense reward func- performance improvement by maintaining the updated
tions, which could have been impractical to imple- policy within a ”trust region” around the current pol-
ment. The authors proposed HTRON, an algorithm that icy. This trust region is defined using a constraint on
leveraged heavy-tailed policy parameterizations, such as the KL divergence between the new and old policies,
the Cauchy distribution, to enhance exploration without effectively preventing large, destabilizing updates [212],
needing complex reward shaping. This approach allowed [208]. TRPO operates within the stochastic policy frame-
the algorithm to learn efficient behaviors even with work, where the policy πθ is parameterized by θ and
sparse rewards, making it more applicable to real-world defines a probability distribution over actions given the
scenarios. The paper’s thorough experimental evaluation states. The expected discounted reward for a policy π is
was another strong point. The authors tested HTRON given by:
against established algorithms like REINFORCE, Proxi- "∞ #
mal Policy Optimization (PPO), and Trust Region Policy X
t
Optimization (TRPO) (explained later in the upcoming J(π) = E γ r(st , at ) , (34)
subsections) across three different outdoor scenarios: t=0
goal-reaching, obstacle avoidance, and uneven terrain where γ is the discount factor, r(st , at ) is the reward at
navigation. HTRON outperformed these algorithms in time step t, and the expectation is taken over the state
terms of success rate, average time steps to reach the goal and action trajectories induced by the policy. To ensure
and elevation cost, demonstrating its effectiveness and that the policy update remains within a safe boundary,
efficiency. The use of a realistic unity-based simulator TRPO constrains the KL divergence between the new
and the deployment of the algorithm on a Clearpath policy πθ′ and the old policy πθ :
52
effectively encouraged exploration, which was crucial (IPPO), MAPPO, and MADDPG, in various tasks. This
to avoid premature convergence to suboptimal policies. performance improvement highlighted the practical ap-
This approach addressed a common limitation of TRPO, plicability of the proposed methods in complex, high-
which could sometimes restrict exploration due to its dimensional environments. The thorough experimental
strict KL divergence constraints between consecutive evaluation across multiple scenarios provided strong ev-
policies. The entropy regularization helped maintain a idence of the robustness and generalizability of HATRPO
balance between exploration and exploitation, leading and HAPPO. However, the complexity of implementing
to more robust learning outcomes. The empirical eval- HATRPO and HAPPO could have been a potential
uation provided in the paper was another significant limitation. The algorithms required the computation of
strength. The authors conducted thorough experiments multi-agent advantage functions and sequential updates,
using the Cart-Pole system, a well-known benchmark in which could have been computationally intensive and
the field. The results showed that EnTRPO converged challenging to implement efficiently. This complexity
faster and more reliably than TRPO, particularly when might have limited the accessibility of these methods
the discount factor was set to 0.85. This indicated that to practitioners who might not have had advanced
the proposed method not only improved exploration but computational resources or expertise in implementing
also enhanced the overall convergence speed and stabil- sophisticated algorithms.
ity of the learning process. The use of a well-defined Authors in [216] presented a method to actively rec-
experimental setup, including details on neural network ognize objects by choosing a sequence of actions for
architectures and hyperparameters, added credibility to an active camera. This method utilized TRPO combined
the findings. A potential limitation was the reliance on with Extreme Learning Machines (ELMs) to enhance
a single benchmark task for evaluation. While the Cart- the efficiency of the optimization algorithm. One of
Pole system was a standard benchmark, it was relatively the significant strengths of this paper was its innovative
simple compared to many real-world applications. The application of TRPO in conjunction with ELMs. ELMs
paper would have benefited from additional experiments provided a simple yet effective way to approximate poli-
on more complex tasks and environments to demonstrate cies, reducing the computational complexity compared
the generalizability and robustness of EnTRPO. This to traditional deep neural networks. This resulted in an
would have provided stronger evidence of the method’s efficient optimization process, crucial for real-time appli-
effectiveness across a wider range of scenarios. cations like active object recognition. The use of ELMs
The challenge of applying trust region methods to allowed for faster convergence and more straightforward
Multi-agent RL (MARL) was investigated in [215]. The implementation, making the proposed method accessible
authors introduced Heterogeneous-agent TRPO (HA- for practical applications. However, the complexity of in-
TRPO) and Heterogeneous-Agent Proximal Policy Op- tegrating TRPO with ELMs could have posed challenges
timization (HAPPO) algorithms. These methods were for some practitioners. Although ELMs simplified the
designed to guarantee monotonic policy improvement optimization process, they still required careful tuning of
without requiring agents to share parameters or relying parameters, such as the number of hidden nodes and the
on restrictive assumptions about the decomposability of distribution of random weights. This additional layer of
the joint value function. A strength of the paper was its complexity might have limited the method’s accessibility
theoretical foundation. The authors extended the theory for users without extensive experience in RL and neural
of trust region learning to cooperative MARL by de- networks.
veloping a multi-agent advantage decomposition lemma In [42], authors explored the application of the TRPO
and a sequential policy update scheme. This theoretical algorithm in MARL environments, specifically focusing
advancement allowed HATRPO and HAPPO to ensure on hide-and-seek games. The authors compared the
monotonic improvement in joint policy performance, performance of TRPO with the Vanilla Policy Gradi-
a key advantage over existing MARL algorithms that ent (VPG) algorithm to determine the most effective
did not guarantee such improvement. This theoretical method for this type of game. One of the primary
guarantee was essential for stable and reliable learning strengths of this paper was its focus on a well-defined,
in multi-agent settings, where individual policy updates complex multi-agent environment. Hide and seek games
could often lead to non-stationary environments and inherently involved dynamic interactions between agents,
suboptimal outcomes. The empirical validation of HA- making them an excellent testbed for evaluating algo-
TRPO and HAPPO on benchmarks such as Multi-Agent rithms. By using TRPO, which was designed to ensure
MuJoCo and StarCraft II demonstrated the effectiveness monotonic policy improvement, the authors addressed
of these algorithms. The results showed that HATRPO a significant challenge in MARL: maintaining stable
and HAPPO significantly outperformed strong baselines, and consistent learning despite the presence of multiple
including Independent Proximal Policy Optimization interacting agents. The empirical results presented in the
54
paper highlighted the strengths of TRPO, especially in morphologies, using TRPO. The study investigated the
scenarios where the testing environment differed from use of surrogate models, specifically Polynomial Chaos
the training environment. TRPO’s ability to adapt to Expansion (PCE) and model ensembles, to model the
new environments and maintain high performance was dynamics of the robots. One of the primary strengths of
a notable advantage over the VPG algorithm, which this thesis was its innovative approach to developing a
performed better in environments identical to the training universal policy. The use of TRPO ensured stability and
conditions but struggled when faced with variability. reliable policy updates even in complex environments.
This adaptability was crucial for practical applications The focus on creating a policy that could generalize
of MARL, where agents often encountered unpredictable across different robot configurations was particularly
changes in their environment. Another strength was noteworthy, as it addressed the challenge of designing
the comprehensive experimental setup, which included controllers that were not limited to a single robot mor-
various configurations and scenarios. The authors metic- phology. The integration of surrogate models, especially
ulously compared the performance of TRPO and VPG the PCE, was another strong point. PCE allowed for
across different numbers of agents and types of envi- efficient sampling and modeling of the stochastic envi-
ronments (quadrant and random walls scenarios). This ronment, potentially reducing the number of interactions
thorough approach provided robust evidence supporting required with the real environment. This was crucial
the efficacy of TRPO in MARL settings. However, for practical applications where real-world interactions
the paper also had some limitations. The complexity could have been costly or risky. The theoretical foun-
of implementing TRPO in a multi-agent context could dation laid for using PCE in this context was robust
have been a barrier for practitioners. TRPO required and showed promise for future research. However, the
careful tuning and substantial computational resources, thesis also highlighted several challenges and limitations.
which might not have been readily available in all The complexity of accurately modeling the dynamics
settings. Additionally, the reliance on simulation results with PCE was a significant hurdle. The results indicated
raised questions about the real-world applicability of the that while PCE showed potential, it currently could not
findings. While the hide-and-seek game was a useful model the dynamics accurately enough to be used in
simulation environment, real-world deployments could combination with TRPO effectively. The computational
have presented additional challenges not captured in the time required for PCE was also a practical concern,
simulations. limiting its immediate applicability. The model ensemble
The application of TRPO to improve Cross-Site surrogate showed some promise but ultimately failed to
Scripting (XSS) detection systems was analyzed by train a successful policy. This pointed to the difficulty
authors in [217]. The authors aimed to enhance the of creating surrogate models that could capture the
resilience of XSS filters against adversarial attacks by complexities of robot dynamics sufficiently. The thesis
using RL techniques to identify and counter malicious suggested that using the original environment from the
inputs. One of the main strengths of this paper was its RoboGrammar library yielded better results, emphasiz-
innovative approach to applying TRPO in Cybersecurity, ing the need for more advanced or alternative surrogate
specifically for XSS detection. Traditional XSS detection modeling techniques.
methods often relied on static rules and signatures, which A novel approach for optimizing Home Energy Man-
could be easily bypassed by sophisticated attackers. By agement Systems (HEMS) using Multi-agent TRPO
leveraging TRPO, the authors introduced a dynamic (MA-TRPO) was investigated in [219]. This approach
and adaptive mechanism that could learn to detect and aimed to improve energy efficiency, cost savings, and
counteract adversarial attempts to exploit XSS vulner- consumer satisfaction by leveraging TRPO techniques
abilities. This use of TRPO enhanced the robustness in a multi-agent setup. One of the primary strengths of
of the detection system, making it more resilient to this paper was its consumer-centric approach. Traditional
evolving threats. A limitation of this study was the re- HEMS solutions often prioritized energy efficiency and
liance on specific hyperparameters, such as the learning cost savings without adequately considering consumer
rate and discount factor, which could have significantly preferences and comfort. By incorporating a preference
impacted the model’s performance. The paper would factor for Interruptible-Deferrable Appliances (IDAs),
have benefited from a more detailed discussion on how the proposed MA-TRPO algorithm ensured that con-
these parameters were selected and their influence on sumer satisfaction was taken into account, leading to
the detection model. Providing guidelines or heuristics a more holistic and practical solution. This consumer-
for parameter tuning would have helped practitioners centric focus was crucial for the widespread adoption
replicate and extend the study’s findings. of HEMS in real-world settings. Another strength was
Authors in [218] aimed to create a universal policy the comprehensive use of real-world data for training
for a locomotion task that could adapt to various robot and validation. The authors utilized five-minute retail
55
electricity prices derived from wholesale market prices TABLE XIV: TRPO Papers Review
and real-world Photovoltaic (PV) generation profiles. Application Domain References
This approach enhanced the practical relevance and Object Recognition [216]
robustness of the proposed method, as it demonstrated Theoretical Research [213]
the algorithm’s effectiveness under realistic conditions. (Convergence, stability)
Additionally, the paper provided a detailed explanation Hybrid RL Algorithms [214]
Multi-agent Systems and [215], [42]
of the various components of the smart home environ- Autonomous Behaviors
ment, including the non-controllable base load, IDA, Cybersecurity [217]
Battery Energy Storage System (BESS), and PV system, Robotics [218]
which added clarity and depth to the study. The use of Energy and Power Management [219], [151] ,[160]
MA-TRPO in a multi-agent setup was also a signifi-
cant contribution. The proposed method modeled and
trained separate agents for different components of the CSI estimation and reporting to enhance the applicability
HEMS, such as the IDA and BESS, allowing for more of the proposed solution.
specialized and effective control strategies. This multi- Table XIV categorizes the papers reviewed in this sec-
agent approach addressed the complexities and inter- tion by their domain, offering a summary of the research
dependencies within the home energy environment, lead- landscape in TPRO. An advancement over TRPO was
ing to more efficient and coordinated energy manage- proposed as a new algorithm, PPO.
ment. The paper’s reliance on simulation results, while
comprehensive, still left questions about the real-world
C. Proximal Policy Optimization (PPO)
applicability of the proposed method. Although the use
of real-world data enhanced relevance, further validation PPO, proposed by [221], represents a significant ad-
in actual home environments would have strengthened vancement within policy gradient methods. PPO aims to
the case for practical deployment. Real-world testing was achieve reliable performance and sample efficiency, ad-
essential to ensure the robustness and effectiveness of dressing the limitations of previous policy optimization
the MA-TRPO algorithm in diverse and dynamic home algorithms such as VPG methods and TRPO.
energy scenarios. Additionally, the reliance on discrete Using policy gradient methods, the policy parameters
action spaces simplified the problem but might not have are optimized through stochastic gradient ascent by
fully captured the nuances of continuous control in real- estimating the gradient of the policy. One of the most
world applications. Future work could explore extending commonly used policy gradient estimators is:
the algorithm to handle continuous action spaces for h i
more precise control. ĝ = Êt ∇θ log πθ (at |st )Ât , (39)
Authors in [220] investigated the application of TRPO
where πθ represents the policy parameterized by θ,
to address the joint spectrum and power allocation prob-
and Ât is an estimator of the advantage function at time
lem in the Internet of Vehicles (IoV). The objective was
step t. This estimator helps construct an objective func-
to minimize AoI and power consumption, which were
tion whose gradient corresponds to the policy gradient
crucial for maintaining real-time communication and
estimator:
energy efficiency in vehicular networks. One of the key
strengths of this paper was its focus on AoI, a vital metric h i
for ensuring timely and accurate information exchange in LP G (θ) = Êt log πθ (at |st )Ât . (40)
vehicular communications. By incorporating AoI into the
PPO simplifies TRPO by using a surrogate objective
optimization framework, the authors addressed a signif-
with a clipped probability ratio, allowing for multiple
icant challenge in IoV networks, where the freshness of
epochs of mini-batch updates. In order to preserve learn-
information directly impacted road safety and traffic effi-
ing, large policy updates should be avoided. As a result,
ciency. The proposed TRPO-based approach effectively
the PPO objective is as follows:
balanced the trade-off between minimizing AoI and
reducing power consumption, showcasing its practical h
relevance. The paper’s reliance on certain assumptions, LCLIP (θ) = Êt min rt (θ)Ât ,
such as the availability of CSI and the periodic re- i
porting of CSI to the base station, might have limited clip(rt (θ), 1 − ǫ, 1 + ǫ)Ât
its generalizability. In real-world scenarios, obtaining (41)
accurate and timely CSI could have been challenging due where rt (θ) = ππθθ (a(at |s t)
t |st )
is the probability ratio, and
old
to various factors like signal interference and mobility. ǫ is a hyperparameter. This objective clips the proba-
Future work could explore more practical approaches to bility ratio to ensure it stays within a reasonable range,
56
and practicality of the proposed method in diverse and the complexity of implementing the proposed method
dynamic power grid environments. might have posed challenges. The integration of the ED-
A framework for optimizing metro service schedules LDF rule with PPO required careful tuning and a deep
and train compositions using PPO within a DRL frame- understanding of both RL and optimal scheduling princi-
work was proposed in [224]. This method was applied to ples. This complexity could have limited the accessibility
handle the dynamic and complex problem of metro train of the method for practitioners who might not have had
operations, focusing on minimizing operational costs advanced expertise in these areas. Additionally, the re-
and improving service regularity. A significant strength liance on heavy-traffic assumptions for the optimality of
of the paper was its innovative application of PPO to the ED-LDF rule might have limited the generalizability
metro service scheduling and train composition. PPO, of the approach to all scheduling environments. Real-
known for its stability and efficiency in handling large- world scenarios could have presented varying traffic
scale optimization problems, was effectively utilized to conditions that did not always align with these assump-
address the dynamic nature of metro operations. The tions. The paper’s focus on renewable energy as a factor
integration of PPO with Artificial Neural Networks in the scheduling decision was another strength, as it
(ANNs) for approximating value functions and policies aligned with the growing importance of sustainability in
demonstrated a robust approach to tackling the high- computing operations. However, the paper could have
dimensional decision space inherent in metro scheduling. benefited from further exploration of how variations in
This combination enhanced the framework’s ability to renewable energy availability impacted the scheduling
adapt to varying passenger demands and operational con- performance. This aspect was crucial for real-world
straints. The paper’s use of a real-world scenario, specif- applications where renewable energy sources could have
ically the Victoria Line of the London Underground, been highly variable.
for testing and validation was another strong point. The An innovative approach to enhancing image caption-
authors provided a comprehensive evaluation, comparing ing models using PPO was designed by authors in [46].
their PPO-based method with established meta-heuristic The authors aimed to improve the quality of generated
algorithms like Genetic Algorithm and Differential Evo- captions by incorporating PPO into the phase of training,
lution. The results indicated that the PPO-based approach specifically targeting the optimization of scores. The
outperformed these traditional methods in terms of both study explored various modifications to the PPO algo-
solution quality and computational efficiency. This prac- rithm to adapt it effectively for the image captioning task.
tical validation underscored the method’s applicability A significant strength of the paper was its integration of
and effectiveness in real-world settings. The reliance on PPO with image captioning models, which traditionally
a specific set of operational constraints, such as fixed relied on VPG methods. The authors argued that PPO
headways and trailer limits, might have also limited could provide better performance due to its ability to
the method’s generalizability. While these constraints enforce trust-region constraints, thereby improving sam-
were necessary for practical implementations, exploring ple complexity and ensuring stable policy updates. This
more flexible constraint formulations could have further was particularly important for image captioning, where
enhanced the method’s adaptability to different metro maintaining high-quality training trajectories was crucial.
systems and operational conditions. The authors’ experimentation with different regulariza-
Authors in [225] proposed an advanced scheduling tion techniques and baselines was another strong point.
algorithm for multitask environments using a combi- They found that combining PPO with dropout decreased
nation of PPO and an optimal policy characterization. performance, which they attributed to increased KL-
The main focus was to optimize the scheduling of mul- divergence of RL policies. This empirical observation
tiple tasks across multiple servers, taking into account was critical as it guided future implementations of PPO
random task arrivals and renewable energy generation. in similar contexts. Furthermore, the adoption of a word-
One of the primary strengths of this paper was its level baseline via MC estimation, as opposed to the
innovative combination of PPO with a priority rule, traditional sentence-level baseline, was a noteworthy
the Earlier Deadline and Less Demand First (ED-LDF). innovation. This approach was expected to reduce the
This rule prioritized tasks with earlier deadlines and variance of policy gradient estimators more effectively,
lower demands, which was shown to be optimal under contributing to improved model performance. While the
heavy traffic conditions. The integration of ED-LDF with results were promising, they were primarily validated
PPO effectively reduced the dimensionality of the action on the MSCOCO dataset. Further validation on other
space, making the algorithm scalable and efficient even datasets and in real-world applications would have been
in large-scale settings. This reduction in complexity was beneficial to assess the generalizability and robustness of
crucial for practical applications where the number of the approach. The paper could have also benefited from
tasks and servers could have been substantial. However, a more detailed discussion on the impact of different
58
hyperparameter settings and the specific configurations stable flight control without relying on a predefined
used in the experiments. Providing this information mathematical model of the quadrotor’s dynamics. One
would have enhanced the reproducibility of the study of the major strengths of this paper was its application
and allowed other researchers to build on the authors’ of PPO in the context of quadrotor control. By using
work more effectively. PPO, the authors ensured that the control policy updates
In [226], a centralized coordination scheme for CAVs remained stable and efficient, which was crucial for real-
at intersections without traffic signals was developed. time applications like quadrotor control. The choice of
The authors introduced the Model Accelerated PPO PPO over other RL algorithms was well justified due
(MA-PPO) algorithm, which incorporated a prior model to its robustness and low computational complexity. The
into the PPO algorithm to enhance sample efficiency authors’ approach to utilizing a stochastic policy gradient
and reduce computational overhead. One of the sig- method during training, which was then converted to
nificant strengths of this paper was its focus on im- a deterministic policy for control, was another notable
proving computational efficiency, a major challenge in strength. This strategy ensured efficient exploration dur-
centralized coordination methods. Traditional methods, ing training, allowing the quadrotor to learn a robust
such as MPC, were computationally demanding, making control policy. The use of a simple reward function that
them impractical for real-time applications with a large focused on minimizing the position error between the
number of vehicles. By using MA-PPO, the authors quadrotor and the target further added to the efficiency
significantly reduced the computation time required for of the training process. One limitation was the reliance
coordination, achieving an impressive reduction to 1/400 on specific initial conditions and a fixed simulation envi-
of the time needed by MPC. This efficiency gain was ronment. While the authors showed that the PPO-based
crucial for real-time deployment in busy intersections. controller could recover from harsh initial conditions,
A limitation was the focus on a specific intersection the generalizability of the method to different quadrotor
scenario. While the four-way single-lane intersection was models and varying environmental conditions remained
a common setup, real-world intersections could have var- to be explored. Future work could have benefited from
ied significantly in complexity and traffic patterns. Future testing the controller on a wider range of scenarios and
work could have explored the applicability of MA-PPO incorporating additional environmental factors to ensure
to more complex intersection scenarios and different broader applicability.
traffic conditions to ensure broader generalizability. Authors in [228] addressed the development of an
The application of PPO in developing a DRL con- intelligent lane change strategy for autonomous vehicles
troller for the nonlinear attitude control of fixed-wing using PPO. This approach PPO to manage lane change
UAVs was explored in [227]. The study presented a maneuvers in dynamic and complex traffic environments,
proof-of-concept controller capable of stabilizing a fixed- focusing on enhancing safety, efficiency, and comfort.
wing UAV from various initial conditions to desired roll, The authors’ design of a comprehensive reward func-
pitch, and airspeed values. One of the primary strengths tion that considered multiple aspects of lane change
of the paper was its innovative use of PPO for UAV maneuvers was one of the strengths. The reward function
attitude control. PPO was known for its stability and ef- incorporated components for safety (avoiding collisions
ficient policy updates, making it well-suited for complex and near-collisions), efficiency (minimizing travel time
control tasks like UAV attitude stabilization. The choice and aligning with desired speed and position), and
of PPO over other RL algorithms was justified by its comfort (reducing lateral and longitudinal jerks). This
robust performance and low computational complexity, multi-faceted approach ensured that the learned policy
which were crucial for real-time control applications. optimized for a holistic driving experience, balancing
The authors also highlighted the practical advantages of the often competing demands of these different aspects.
using PPO, such as its hyperparameter robustness across The inclusion of a safety intervention module to prevent
different tasks. One of the limitations was the reliance catastrophic actions was a particularly noteworthy fea-
on a single UAV model and specific aerodynamic co- ture. This module labeled actions as ”catastrophic” or
efficients for validation. While the Skywalker-X8 was ”safe” and could replace potentially dangerous actions
a popular fixed-wing UAV, the generalizability of the with safer alternatives, enhancing the robustness of the
proposed approach to other UAV models with different learning process. This safety-centric approach addressed
aerodynamic characteristics remained to be explored. a critical concern in applying DRL to real-world au-
Future work could have benefited from testing the PPO- tonomous driving tasks, where safety was paramount.
based controller on a wider range of UAV models to However, the complexity of implementing PPO for lane
ensure broader applicability. change maneuvers posed challenges. The need for con-
The use of PPO to control the position of a quadrotor tinuous training and fine-tuning of parameters could have
was investigated in [39]. The primary goal was to achieve been resource-intensive and might not have been feasible
59
TABLE XV: PPO Papers Review addressing some limitations of pure policy or Value-
Application Domain References based approaches [229], [52].
Distributed DRL Control for [222] In the next subsection, we first introduce two main ver-
Mixed-Autonomy Traffic sions of Actor-Critic methods, Asynchronous Advantage
Optimization
Actor-Critic (A3C) & Advantage Actor-Critic (A2C),
Power Systems and Energy [223]
Management and then, we will analyze various applications of each.
Transportation and Routing [224]
Optimization (EVs)
A. A3C & A2C
Real-time Systems and Hardware [225]
Image Captioning Models [46] The A2C algorithm is a synchronous variant of the
Hybrid RL Algorithms [46] A3C algorithm, which was introduced by [230]. A2C
Intelligent Traffic Signal Control [226] maintains the key principles of A3C but simplifies the
Real-time Systems and Hardware [227], [39] training process by synchronizing the updates of multiple
agents, thereby leveraging the strengths of both Actor-
TABLE XVI: Policy-based Papers Review Critic methods and advantage estimation. The Actor-
Application Domain References
Critic architecture combines two primary components,
Distributed DRL Control for [222] in both algorithms: the actor, which is responsible for
Mixed-Autonomy Traffic selecting actions, and the critic, which evaluates the
Optimization actions by estimating the value function. The actor
Power Systems and Energy [223] updates the policy parameters in a direction that is
Management
expected to increase the expected reward, while the
Transportation and Routing [224]
Optimization (EVs) critic provides feedback by computing the TD error.
Real-time Systems and Hardware [225], [227], [39] This integration allows for more stable and efficient
Image Captioning Models [46] learning compared to using Actor-only or critic-only
Hybrid RL Algorithms [46], [214] methods [231]. Advantage estimation is a technique used
Intelligent Traffic Signal Control [226] to reduce the variance of the policy gradient updates.
Energy and Power Management [204], [219], [151] ,[160] The advantage function A(s, a) represents the difference
Theoretical Research [203], [213] between the action-value function Q(s, a) and the value
(Convergence, stability)
function V (s):
Network Optimization [206]
Object Recognition [216]
Multi-agent Systems and [215], [42] A(s, a) = Q(s, a) − V (s). (42)
Autonomous Behaviors
By using the advantage function, A2C focuses on
Cybersecurity [217]
Robotics [218], [207]
actions that yield higher returns than the average, which
helps in making more informed updates to the policy [1].
Unlike A3C, where multiple agents update the global
for all developers or organizations, acknowledging that model asynchronously, A2C synchronizes these updates.
it is true for some other algorithms. Table XV shows a Multiple agents run in parallel environments, collecting
summary of analyzed papers. Moreover, Table XVI rep- experiences and calculating gradients, which are then
resents all papers reviewed in this section as a package. aggregated and used to update the global model syn-
We now shall analyze the last group of methods, chronously. This synchronization reduces the complexity
the Actor-Critic methods. We start by introducing the of implementation and avoids issues related to asyn-
general Actor-Critics, which combine Value-based and chronous updates, such as non-deterministic behavior
Policy-based approaches. and potential overwriting of gradients.
The A3C algorithm operates as follows:
Based on Alg. 24, first, the parameters of the policy
V. ACTOR -C RITIC M ETHODS network (actor) θ and the value network (critic) φ are ini-
Actor-critic methods combine Value-based and Policy- tialized (line 1). Then, multiple agents are run in parallel,
based approaches. Essentially, these methods consist of each interacting with its own copy of the environment
two components: the Actor, who selects actions based (lines 2-14). Each agent independently collects a batch
on a policy, and the Critic, who evaluates the actions of experiences (st , at , rt , st+1 ) (lines 5-9). For each
based on their value function. By providing feedback agent, the advantage is computed using the collected
on the quality of the actions taken, the critic guides experiences. Subsequently, the gradients of the policy
the actor in updating the policy directly. As a result of and value networks are calculated using the advantage
this synergy, learning can be more stable and efficient, estimates. Finally, each agent independently updates
60
lighting the effectiveness of the proposed strategy in addressed the gap in traditional AI chess opponents,
optimizing caching performance. Another limitation was which did not educate players on strategies and tactics.
the reliance on accurate and timely content popularity The cognitive agent relied on accurate feedback for
predictions. The effectiveness of the caching strategy consistency. Agent effectiveness was heavily dependent
depended heavily on the accuracy of these predictions. on its ability to provide relevant and timely suggestions.
In real-world applications, user preferences and content The agent needed to accurately interpret the player’s
popularity could change unpredictably, and any devia- intentions and provide appropriate feedback in real-
tions from the predicted values could negatively impact world applications. Users who had difficulty navigating
the performance of the caching algorithm. Additionally, digital interfaces might also have experienced acces-
the assumption that users did not request the same sibility issues due to the agent’s reliance on digital
content repeatedly might not have held true in all scenar- interfaces. Another concern was scalability. A controlled
ios, potentially affecting the reliability of the popularity environment showed promising results, but the system
prediction model. might not have been able to handle a broader range of
Authors in [233] presented a comprehensive approach cognitive impairments. Additional users and interactions
to optimizing resource allocation and pricing in Mobile could have introduced additional complexity, making it
Edge Computing (MEC)-enabled blockchain systems us- difficult to maintain performance.
ing the A3C algorithm. The study’s strengths lay in its in- Researchers in [235] introduced a novel approach to
novative integration of blockchain with MEC to enhance autonomous valet parking using a combination of PPO
resource management. The A3C algorithm’s capability and A3C. This method aimed to address the control
to handle both continuous and high-dimensional discrete errors due to the non-linear dynamics of vehicles to
action spaces made it well-suited for the dynamic nature optimize parking maneuvers. An important strength of
of MEC environments. The use of prospect theory to this study was the use of the A3C-PPO algorithm. Com-
balance risks and rewards based on miner preferences bining the advantages of both PPO and A3C, this hybrid
added a nuanced understanding of real-world scenarios. approach resulted in a more stable and efficient learning
The results demonstrated that the proposed A3C-based process. A3C’s asynchronous nature allowed parallel
algorithm outperformed baseline algorithms in terms of training, which sped convergence and improved state-
total reward and convergence speed, indicating its effec- action exploration. In addition, PPO prevented drastic
tiveness in optimizing long-term performance. However, changes from destabilizing the learning process, by limit-
the paper had several limitations. The reliance on a ing the magnitude of policy updates. Incorporating man-
specific MEC server configuration and a fixed number ual hyperparameter tuning further optimized the training
of mobile devices might have limited the generalizability process, resulting in better rewards. One of the limita-
of the findings to other settings with different configu- tions was the reliance on specific assumptions about the
rations. The assumption that all validators were honest environment and sensor accuracy. The effectiveness of
simplified the model but might not have reflected real- the proposed method depended on the accurate detection
world blockchain environments where malicious actors and interpretation of the surroundings by sensors such as
could have existed. The additional complexity introduced cameras and LiDAR. Any inaccuracies or deviations in
by the collaborative local MEC task processing mode sensor data could have impacted the performance and
might have increased the computational overhead, poten- robustness of the algorithm. In real-world applications,
tially affecting the scalability of the proposed solution. environmental conditions such as lighting, weather, and
Moreover, the paper did not address the potential impact obstacles could have varied significantly, and ensuring
of network latency and varying network conditions on reliable sensor performance under these conditions was
the performance of the A3C algorithm, which could have crucial. Additionally, the study did not address the poten-
been significant in practical deployments. tial impact of network latency and communication issues
In [234], authors explored the application of A3C to between the vehicle and the central control system,
create a cognitive agent designed to help Alzheimer’s which could have affected the real-time decision-making
patients play chess. The primary goal was to enhance process. Ensuring robust communication and minimizing
cognitive skills and boost brain activity through chess, latency was critical for the practical implementation of
a game known for its cognitive benefits. One of the autonomous valet parking systems.
key strengths of the paper was its innovative approach An advanced approach for optimizing content caching
to leveraging A3C to assist individuals with cogni- in 5G networks using A3C was proposed in [236].
tive disabilities in playing chess. The cognitive agent This method aimed to minimize the total transmission
provided real-time assistance by suggesting offensive cost by learning optimal caching and sharing policies
and defensive moves, thereby helping players improve among cooperative Base Stations (BSs) without prior
their strategies and cognitive abilities. This approach knowledge of content popularity distribution. One of the
62
TABLE XVII: A3C Papers Review showcased improved scalability, maintaining low task
Application Domain References latency even as the number of tasks increased. The use
Cloud-based Control and [50], [233] of the A2C algorithm, which combined policy gradient
Encryption Systems and value function estimates, enhanced the stability and
Games [234] effectiveness of the policy network, further contributing
Multi-agent Systems and [235] to the overall performance of the model. However,
Autonomous Behaviors
Network Optimization [236]
there were some limitations to the HA-A2C approach.
One notable weakness was the potential complexity of
implementing the hard attention mechanism in real-
main strengths of this paper was the use of the A3C world scenarios, where the dynamic and heterogeneous
algorithm, which leveraged the asynchronous nature of nature of edge environments might have posed additional
multiple agents to achieve faster convergence and reduce challenges. Furthermore, while the HA-A2C method
time correlation in learning samples. The algorithm’s outperformed other DRL methods in terms of latency,
ability to operate with multiple environment instances in it might have still faced difficulties in scenarios with ex-
parallel enhanced computational efficiency and signif- tremely high-dimensional states and action spaces, where
icantly improved the learning process. By considering further optimization might have been necessary. Another
cooperative BSs that could have fetched content from consideration was the reliance on accurate and timely
neighboring BSs or the backbone network, the proposed data for effective attention allocation, which might not
method effectively reduced data traffic and transmission have always been feasible in practical applications.
costs in 5G networks. The empirical results demonstrated A task scheduling mechanism in cloud-fog environ-
the superiority of the A3C-based algorithm over classical ments that leveraged the A2C algorithm was presented in
caching policies such as Least Recently Used, Least Fre- [238]. This approach aimed to optimize the scheduling
quently Used, and Adaptive Replacement Cache, show- process for scalability, reliability, trust, and makespan
casing lower transmission costs and faster convergence efficiency. One of the significant strengths of the paper
rates. One limitation of this study was the reliance on was the holistic approach it took toward task scheduling
accurate and timely updates of content popularity distri- in heterogeneous cloud-fog environments. The use of
butions. While the paper assumed that content popularity the A2C algorithm was particularly effective in handling
followed a Zipf distribution and varied over time, the the dynamic nature of task scheduling, as it allowed
accuracy of these assumptions could have significantly the scheduler to make real-time decisions based on the
impacted the performance of the caching algorithm. In current state of the system. By dynamically adjusting
real-world applications, user preferences and content the number of virtual machines according to workload
popularity could have changed unpredictably, and any demands, the proposed scheduler ensured efficient re-
deviations from the assumed distribution could have source utilization, which was crucial for maintaining
affected the effectiveness of the caching policy. Ensuring system performance under varying conditions. One lim-
robust performance under varying content popularity itation was the reliance on specific system parameters
distributions was crucial for the practical implementation and assumptions about the environment. The proposed
of the proposed method. method assumed accurate estimation of factors such as
Table XVII organizes the papers discussed in this task priorities and VM capacities, which might not have
section, offering a domain-specific breakdown of the always been feasible in practical applications. Deviations
research conducted in the A3C area. from these assumptions could have impacted the per-
b) Overview of A2C applications in the literature: formance and robustness of the scheduling algorithm.
In [237], authors introduced an innovative approach to Further research could have explored the robustness of
low-latency task scheduling in edge computing environ- the proposed approach under more relaxed assumptions
ments, addressing several significant challenges inherent and in diverse real-world scenarios.
in such settings. The primary focus of the paper was on Authors in [239] introduced an A2C-learning-based
integrating a hard attention mechanism with the A2C framework for optimizing beam selection and transmis-
algorithm to enhance task scheduling efficiency and sion power in mmWave networks. This approach aimed
reduce latency. The strengths of the HA-A2C method to improve energy efficiency while maintaining coverage
lay in its ability to significantly reduce task latency by in dynamic and complex network environments. A no-
approximately 40% compared to the DQN method. The table strength of the paper was its innovative application
hard attention mechanism employed by HA-A2C was of the A2C algorithm for joint optimization of beam
particularly effective in reducing computational com- selection and transmission power. This dual optimization
plexity and increasing efficiency, allowing the model to was particularly effective in addressing the significant
process tasks more quickly. Additionally, the method challenge of energy consumption in mmWave networks.
63
By leveraging A2C, the proposed method dynamically API, covering a substantial period (2006-2021), ensured
adjusted beam selection and power levels based on the that the models were tested against diverse market con-
current state of the network, which was represented by ditions, enhancing the robustness and reliability of the
the Signal-to-Interference-plus-Noise Ratio (SINR) val- results. The study’s focus on a single market (Indian
ues. The use of A2C ensured stable and efficient learning stock market) might have limited the generalizability
through policy gradients and value function approxi- of the findings. Future research could have explored
mations, making it suitable for real-time applications. the application of these algorithms in different financial
One of the limitations was the assumption of specific markets to ensure broader applicability.
system parameters and environmental conditions. The The authors presented an innovative framework for
method assumed accurate estimation of SINR values and optimizing task segmentation and parallel scheduling
predefined beam angles, which might not have always in edge computing networks using the A2C algorithm
been feasible in practical applications. Deviations from in [241]. The approach focused on minimizing total
these assumptions could have impacted the performance task execution delay by splitting multiple computing-
and robustness of the optimization algorithm. Future intensive tasks into sub-tasks and scheduling them ef-
research could have explored the robustness of the pro- ficiently across different edge servers. A key strength of
posed method under more relaxed assumptions and in this paper was its holistic approach to task segmentation
diverse real-world scenarios. and scheduling. By jointly optimizing both processes,
Authors explored the use of the A2C algorithm to the proposed method ensured that tasks were not only
estimate the power delay profile (PDP) in 5G New divided efficiently but also assigned to the most suitable
Radio environments in [32]. This approach aimed to edge servers for processing. This joint optimization
enhance channel estimation performance by leveraging was crucial in dynamic edge computing environments,
DRL techniques. A notable strength of this paper was where both computation capacity and task requirements
its innovative application of the A2C algorithm to the could have varied significantly over time. The use of
problem of PDP estimation. By framing the estimation A2C, allowed the system to adapt to these changes in
problem within an RL context, the proposed method real-time, enhancing overall system performance. The
directly targeted the minimization of Mean Square Error authors’ method of decoupling the complex mixed-
in channel estimation, rather than aiming to approximate integer non-convex problem into more manageable sub-
an ideal PDP. This pragmatic approach allowed the problems was another strength. By first addressing the
algorithm to adapt to the inherent approximations and task segmentation problem and then tackling the sub-
imperfections in practical channel estimation processes, tasks parallel scheduling, the paper presented a struc-
leading to improved performance. However, the com- tured and logical approach to solving the optimization
plexity of implementing the A2C algorithm in real- challenge. The introduction of the optimal task split
world scenarios posed challenges. The need for extensive ratio function and its integration into the A2C algorithm
training and parameter tuning required significant com- further enhanced the efficiency and effectiveness of the
putational resources and expertise in RL, which might proposed solution.
not have been readily available in all settings. Addi- Researchers in [242] presented an innovative approach
tionally, while the simulation results were promising, to enhancing multi-UAV obstacle avoidance using A2C
further validation in real-world deployments was neces- combined with an experience-sharing mechanism. This
sary to fully assess the robustness and practicality of the method aimed to optimize obstacle avoidance strategies
proposed approach. Real-world environments could have in complex, dynamic environments by sharing posi-
introduced additional challenges, such as varying traffic tive experiences among UAVs to expedite the training
patterns and hardware constraints, which were not fully process. One of the key strengths of this paper was
captured in simulations. the introduction of the experience-sharing mechanism
The application of various Actor-Critic algorithms to to the A2C algorithm. This mechanism significantly
develop a trading agent for the Indian stock market was enhanced the efficiency and robustness of the train-
investigated in [240]. The study evaluated the perfor- ing process by allowing UAVs to share positive expe-
mance of PPO, DDPG, A2C, and Twin Delayed DDPG riences. This collective learning approach accelerated
(TD3) algorithms in making trading decisions. One of the convergence of the algorithm, enabling UAVs to
the primary strengths of this paper was its comprehensive quickly learn effective obstacle avoidance strategies. The
approach to evaluating multiple Actor-Critic algorithms experience-sharing mechanism was particularly valuable
in a real-world financial trading context. By consider- in multi-agent systems, where individual agents could
ing different algorithms, the authors provided a broad have benefited from the knowledge gained by others,
perspective on the effectiveness of it in stock trading. leading to faster and more robust learning. However,
The use of historical stock data from the Yahoo Finance the experience-sharing mechanism relied on consistent
64
and reliable communication between UAVs. In practical TABLE XVIII: A2C Papers Review
applications, communication constraints and network Application Domain References
reliability issues could have significantly impacted the Edge computing environments [237], [241]
effectiveness of this mechanism. Inter-UAV communica- Network Optimization [238],[32]
tion latency and packet loss could have led to outdated or Cloud-based Control and [238]
incomplete information being shared, thereby reducing Encryption Systems
Energy and Power Management [239]
the overall efficiency of the learning process. Also, the (IoT Networks, Smart Energy
method assumed a certain level of accuracy in modeling Systems)
the environment and the dynamic obstacles within it. Financial Applications [240]
Any deviations from these assumptions, such as unex- Autonomous UAVs [242]
pected changes in obstacle behavior or environmental
conditions, could have affected the performance and Visual Navigation [243]
robustness of the algorithm. Real-world environments
were inherently unpredictable, and the algorithm must
have been tested extensively in diverse scenarios to Table XVIII summarizes the discussed papers in
ensure its reliability. this section. Over the next subsection, Deterministic
Policy Gradient (DPG) algorithm, which addresses the
A robust approach to visual navigation using an Actor- challenges associated with continuous action spaces is
Critic method enhanced with Generalized Advantage discussed.
Estimation (GAE) was developed by authors in [243].
This method demonstrated significant strengths in terms
of learning efficiency and stability, as well as effective B. Deterministic Policy Gradient (DPG)
navigation in complex environments like ViZDoom. One DPG addresses the challenges associated with con-
major strength of this approach was its ability to rapidly tinuous action spaces and offers significant improve-
converge and achieve high performance in both basic ments in sample efficiency over stochastic policy gra-
and complex visual navigation tasks. By employing the dient methods. Traditional policy gradient methods in
A2C method with GAE, the algorithm reduced variance RL use stochastic policies, where the policy πθ (a|s)
in policy gradient estimates, leading to more stable is a probability distribution over actions given a state,
learning. This was particularly evident in the ViZDoom parameterized by θ. These methods rely on sampling
health gathering scenarios, where the A2C with GAE actions from this distribution to compute the policy
agent achieved the highest scores with lower variance gradient, which can be computationally expensive and
compared to other methods. Additionally, the use of sample inefficient, especially in high-dimensional action
multiple processes in the A2C method significantly spaces [1], [244]. In contrast, the DPG algorithm uses
reduced training time, making it more efficient than a deterministic policy, denoted by µθ (s), which directly
traditional DQN approaches. However, the method also maps states to actions without involving any randomness.
had notable limitations. One significant drawback was The policy gradient theorem for deterministic policies
the high computational cost associated with using mul- shows that the gradient of the expected return with
tiple processes for training, which might not have been respect to the policy parameters can be computed as
feasible in resource-constrained environments. Further- [244]:
more, while the approach performed well in the tested
ViZDoom scenarios, its generalizability to other, more h i
diverse environments remained uncertain without further ∇θ J(µθ ) = Es∼ρµ ∇θ µθ (s)∇a Qµ (s, a) a=µθ (s)
validation. The reliance on visual inputs also presented (43)
challenges in environments with varying lighting con- where Qµ (s, a) is the action-value function under the
ditions or visual obstructions, which were not exten- deterministic policy µθ , and ρµ is the discounted state
sively tested in this study. Another limitation was the visitation distribution under µθ .
potential for over-fitting to specific task environments. By employing an off-policy learning approach, DPG
The training setup in controlled ViZDoom scenarios ensures adequate exploration while learning a determin-
might not have fully captured the complexities of real- istic target policy. To generate exploratory actions and
world navigation tasks, where environmental dynamics gather experiences, a behavior policy, often a stochastic
were less predictable. Thus, while the A2C with GAE policy, is used. Gradients derived from these experiences
approach showed promise, its applicability to a broader are then used to update the deterministic policy. As the
range of visual navigation tasks would have benefited same experiences can be reused to improve policy in
from additional research and testing in more varied and this non-policy setting, the data collected can be utilized
less controlled environments. more efficiently than in a policy setting [245].
65
The DPG algorithm is typically implemented within updated with a soft update mechanism. The target
an Actor-Critic framework, where the actor represents networks provide stable targets for the Q-learning
the deterministic policy µθ and the critic estimates the updates, reducing the likelihood of divergence:
action-value function Qµ (s, a). The critic is trained using ′ ′
TD learning to minimize the Bellman error: θQ ← τ θQ + (1 − τ )θQ (47)
′ ′
θµ ← τ θµ + (1 − τ )θµ (48)
δt = rt + γQµ (st+1 , µθ (st+1 )) − Qµ (st , at ) (44)
where δt is the TD error, rt is the reward, γ is the where τ < 1 is the target update rate.
discount factor, and st , at are the state and action at Exploration in continuous action spaces is crucial for
time step t. The actor updates the policy parameters in effective learning. DDPG employs an exploration policy
the direction suggested by the critic: by adding noise to the actor’s deterministic policy. An
Ornstein-Uhlenbeck process [247] is typically used to
generate temporally correlated noise, promoting explo-
θ ← θ + α∇θ µθ (st )∇a Qµ (st , at ) a=µθ (st )
(45) ration in environments with inertia.
where α is the learning rate for the actor. A variant of The DDPG algorithm operates as follows: As shown
DPG, which is designed to handle continuous action in Alg. 26, first, the parameters of the actor network θµ
spaces with the help of DL, is analyzed in the next and the critic network θQ are initialized [248] (line 1-2).
subsection in detail. Target networks for both the actor and critic are also ini-
1) Deep Deterministic Policy Gradient (DDPG): The tialized. Then, multiple agents interact with their respec-
DDPG algorithm is an extension of the DPG method, tive environments, collecting transitions (st , at , rt , st+1 )
designed to handle continuous action spaces effectively, which are stored in a replay buffer (lines 3-10). For
introduced by [246]. DDPG leverages the power of DL to each agent, the actor selects actions based on the current
address the challenges associated with high-dimensional policy with added exploration noise. The critic network
continuous control tasks [246]. The foundation of DDPG is updated using the Bellman equation to minimize the
lies in the DPG algorithm. This approach contrasts with TD error (lines 11-13). The actor network is updated
stochastic policy gradients, which sample actions from a using the policy gradient derived from the critic (line
probability distribution. The deterministic nature of DPG 14). Periodically, the target networks are updated to
reduces the variance of gradient estimates and improves slowly track the learned networks (line 15). Over the
sample efficiency, making it suitable for continuous ac- next paragraphs, we will analyze some of the papers in
tion spaces. DDPG employs an Actor-Critic architecture, the literature that used DDPG.
where the actor network represents the policy µ(s|θµ ) Authors in [249] presented an innovative method for
and the critic network estimates the action-value function developing a missile lateral acceleration control system
Q(s, a|θQ ). The actor network outputs a specific action using the DDPG algorithm. This study reframed the au-
for a given state, while the critic network evaluates the topilot control problem within the RL context, utilizing
action by estimating the expected return. The policy a 2-degrees-of-freedom nonlinear model of the missile’s
gradient is computed using the chain rule: longitudinal dynamics for training. One strength was
the incorporation of performance metrics such as set-
h i tling time, undershoot, and steady-state error into the
∇θµ J ≈ Es∼ρβ ∇a Q(s, a|θQ ) a=µ(s|θ µ )
∇θµ µ(s|θµ ) reward function. By integrating these key performance
(46) indicators, the authors ensured that the trained agent
where ρβ denotes the state distribution under a behavior not only learned to control the missile effectively but
policy β. To stabilize learning and address the challenges also adhered to desirable performance standards. This
of training with large, non-linear function approximators, approach enhanced the practical applicability of the
DDPG incorporates two key techniques from the DQN method, ensuring that the control system met operational
algorithm [8]: requirements. The method’s scalability to more complex
1) Replay Buffer: A replay buffer stores transitions scenarios and larger-scale implementations was another
(st , at , rt , st+1 ) observed during training. By sam- concern. The increased number of states and poten-
pling mini-batches of transitions from this buffer, tial interactions in a real-world missile control system
DDPG minimizes the correlations between con- could have introduced additional complexities, making it
secutive samples, which stabilizes training and challenging to maintain the same level of performance.
improves efficiency. Further research was needed to explore the scalability
2) Target Networks: DDPG uses target networks for of the approach and develop mechanisms to manage the
both the actor and critic, which are periodically increased computational load.
66
could effectively process and integrate dynamic environ- effectiveness of the proposed method depended heavily
mental information, enhancing the robot’s adaptability on the precision of the input data, such as nodal prices
to unpredictable scenarios. The reliance on accurate and load demands. In real-world applications, factors
environmental sensing and real-time data processing was such as data inaccuracies, communication delays, and
a potential limitation. The performance of the LSTM- varying environmental conditions could have impacted
based encoder and the overall DDPG framework heavily the reliability and robustness of the system. Ensuring
depended on the quality and accuracy of the sensor data. robust performance under diverse and unpredictable con-
In real-world applications, factors such as sensor noise, ditions remained a critical challenge that needed to be
varying environmental conditions, and communication addressed. Additionally, the study assumed a specific
delays could have affected the reliability and robustness structure for the neural networks used in the actor and
of the system. Ensuring robust performance under di- critic models. The performance of the algorithm could
verse and unpredictable conditions remained a critical have been sensitive to the choice of network architecture
challenge. and hyperparameters. A more systematic exploration of
DDPG combined with prioritized sampling to opti- different architectures and their impact on performance
mize power control in wireless communication systems, could have provided deeper insights into optimizing the
specifically targeting Multiple Sweep Interference (MSI) DDPG algorithm for electricity market modeling.
scenarios, was designed in [253]. Prioritized sampling An advanced approach for resource allocation in ve-
was another innovative aspect of the paper. By focus- hicular communications using a multi-agent DDPG algo-
ing on more valuable experiences during training, the rithm was studied in [255]. This method was designed to
algorithm accelerated the learning process and improved handle the dynamic and high-mobility nature of vehicu-
convergence speed. The empirical results showed that lar environments, specifically targeting the optimization
the DDPG scheme with prioritized sampling (DDPG-PS) of the sum rate of Vehicle-to-Infrastructure (V2I) com-
outperformed the traditional DDPG scheme with uniform munications while ensuring the latency and reliability of
sampling and DQN scheme. This was evident in various Vehicle-to-Vehicle (V2V) communications. One of the
MSI scenarios, where the DDPG-PS scheme achieved significant strengths was the formulation of the resource
better reward performance and stability. The scalability allocation problem as a decentralized Discrete-time and
of the proposed method to more complex scenarios and Finite-state MDP. This approach allowed each V2V
larger-scale implementations was another concern. While communication to act as an independent agent, making
the results were promising in simulated environments, decisions based on local observations without requiring
the ability to handle a broader range of interference pat- global network information. This decentralization was
terns and larger numbers of channels remained uncertain. crucial for scalability and real-time adaptability in high-
The increased number of states and potential interactions mobility vehicular environments. One potential limita-
could have introduced additional complexities, making it tion was the reliance on accurate and timely acquisi-
challenging to maintain the same level of performance. tion of CSI. In high-mobility vehicular environments,
Further research was needed to explore the scalability obtaining precise CSI could have been challenging due
of the approach and develop mechanisms to manage the to fast-varying channel conditions. Any inaccuracies in
increased computational load. CSI could have impacted the performance and robustness
Authors in [254] employed DDPG to model the of the proposed resource allocation scheme. Ensuring
bidding strategies of generation companies in electric- robust performance under diverse and unpredictable con-
ity markets. This approach was aimed at overcoming ditions remained a critical challenge. The last algorithm
the limitations of traditional game-theoretic methods in the category of Actor-Critic methods to analyze is
and conventional RL algorithms, particularly in envi- TD3, which is an enhancement of the DDPG algorithm,
ronments characterized by incomplete information and designed to address the issues of overestimation bias.
high-dimensional continuous state/action spaces. One This algorithm is analyzed in detail in the next subsec-
significant strength was the ability of the proposed tion.
method to converge to the Nash Equilibrium even in an Table XIX provides a summary of the analyzed papers
incomplete information environment. Traditional game- with respect to their domain.
theoretic methods often required complete information 2) Twin Delayed Deep Deterministic Policy Gradient
and were limited to static games. In contrast, the DDPG- (TD3): TD3 is an enhancement of the DDPG algorithm,
based approach could dynamically simulate repeated designed to address the issues of overestimation bias
games and achieve stable convergence, demonstrating in function approximation within Actor-Critic methods.
its robustness in modeling real-world market conditions. introduced by [256], TD3 incorporates several innovative
One limitation was the reliance on accurate modeling techniques to improve the stability and performance of
of market conditions and real-time data processing. The continuous control tasks in RL.
68
scenarios. However, there were notable limitations. The scalability of the proposed method to larger and more
implementation of TD3, while improving stability, in- complex satellite networks was one of the concerns.
troduced complexity in the training process, requiring While the results were promising in the tested scenarios,
careful tuning of hyperparameters to achieve optimal the ability to handle a broader range of interference
performance. The algorithm’s reliance on extensive com- patterns and a larger number of users remained uncertain.
putational resources for training might have limited its The increased number of states and potential interactions
practical applicability in scenarios with constrained re- could have introduced additional complexities, making it
sources. Additionally, the paper focused primarily on the challenging to maintain the same level of performance.
simulation results without providing sufficient real-world Further research was needed to explore the scalability
testing to validate the algorithm’s performance under of the approach and develop mechanisms to manage the
actual driving conditions. This gap raised questions about increased computational load.
the robustness of the proposed method when deployed A TD3-based method for optimizing Voltage and
in a real-world environment. Reactive (VAR) power in distribution networks with high
The application of the TD3 algorithm for the target penetration of Distributed Energy Resources (DERs)
tracking of UAVs was proposed in [259]. The authors such as battery energy storage and solar photovoltaic
integrated several enhancements into the TD3 frame- units was investigated in [261]. The authors’ approach
work to improve its performance in handling the high of coordinating the reactive power outputs of fast-
nonlinearity and dynamics of UAV control. A signif- responding smart inverters and the active power of
icant strength was the novel reward formulation that battery ESS enhanced the overall efficiency of the net-
incorporated exponential functions to limit the effects work. By carefully designing the reward function to
of velocity and acceleration on the policy function ap- ensure a proper voltage profile and effective schedul-
proximation. This approach prevented deformation in ing of reactive power outputs, the method optimized
the policy function, leading to more stable and robust both voltage regulation and power loss minimization.
learning outcomes. Additionally, the concept of multi- The results demonstrated that the TD3-based method
stage training, where the training process was divided outperformed traditional methods such as local droop
into stages focusing on position, velocity, and accelera- control and DDPG-based approaches, showing signifi-
tion sequentially, enhanced the learning efficiency and cant improvements in reducing voltage fluctuations and
performance of the UAV in tracking tasks. However, minimizing power loss in the IEEE 34- and 123-bus
the proposed method also had several limitations. The test systems. The scalability of the proposed method
integration of a PD controller and the novel reward to larger and more complex distribution networks was
formulation added to the complexity of the training another concern. While the results were promising in
process. The scalability of the proposed method to more the tested IEEE 34- and 123-bus systems, the ability
complex environments with a higher number of dynamic to handle a broader range of network configurations
obstacles or more sophisticated UAV maneuvers was and a larger number of DERs remained uncertain. The
another concern. While the results were promising in increased number of states and potential interactions
the tested scenarios, the ability to handle a broader range could have introduced additional complexities, making it
of operational conditions and larger numbers of UAVs challenging to maintain the same level of performance.
remained uncertain. The increased number of states and Further research was needed to explore the scalability
potential interactions could have introduced additional of the approach and develop mechanisms to manage the
complexities, making it challenging to maintain the same increased computational load.
level of performance. Authors in [262] presented an innovative approach to
A novel dynamic MsgA channel allocation strategy quadrotor control, leveraging the TD3 algorithm to ad-
using TD3 to mitigate the issue of MsgA channel colli- dress stabilization and position tracking tasks. This study
sions in Low Earth Orbit (LEO) satellite communication was notable for its application to handle the complex,
systems is investigated in [260]. The paper’s approach non-linear dynamics of quadrotor systems. The authors’
to dynamically pre-configuring the mapping relationship method of integrating target policy smoothing, twin
between PRACH occasions and PUSCH occasions based critic networks, and delayed updates of value networks
on historical access information was another strong enhanced the learning efficiency and reduced variance
point. This method allowed the system to adapt to in the policy updates. This comprehensive approach
changing access demands effectively, ensuring efficient ensured that the quadrotor could achieve precise control
use of available resources and reducing collision rates. in both stabilization and position tracking tasks. The
The empirical results were impressive, demonstrating a empirical results demonstrated the effectiveness of the
39.12% increase in access success probability, which TD3-based controllers, showcasing significant improve-
validated the effectiveness of the proposed strategy. The ments in achieving and maintaining target positions
70
under various initial conditions. The scalability of the TABLE XX: TD3 Papers Review
proposed method to more complex environments with Application Domain References
dynamic obstacles and more sophisticated maneuvers Energy and Power Management [258]
was another concern. While the results were promising Multi-agent Systems and [259], [263]
in the tested scenarios, the ability to handle a broader Autonomous UAVs
range of operational conditions and larger-scale imple- Network Resilience and [260]
Optimization
mentations remained uncertain. The increased number of Energy and Power Management [261]
states and potential interactions could have introduced Real-time Systems and Hardware [262], [264]
additional complexities, making it challenging to main- Implementations
tain the same level of performance. Further research was
needed to explore the scalability of the approach and
develop mechanisms to manage the increased computa- bustness and adaptability. The scalability of the proposed
tional load. method to more complex driving scenarios and larger-
A real-time charging navigation method for multi- scale implementations was another concern. While the
autonomous underwater vehicles (AUVs) systems using results were promising in the tested scenarios, the ability
the TD3 algorithm was designed in [263]. This method to handle a broader range of operational conditions
was designed to improve the efficiency of navigating and larger numbers of vehicles remained uncertain. The
AUVs to their respective charging stations by training increased number of states and potential interactions
a trajectory planning model in advance, eliminating could have introduced additional complexities, making it
the need for recalculating navigation paths for differ- challenging to maintain the same level of performance.
ent initial positions and avoiding dependence on sen- Further research was needed to explore the scalability
sor feedback or pre-arranged landmarks. The primary of the approach and develop mechanisms to manage
strength of this paper lies in its application of the the increased computational load. Additionally, the study
TD3 algorithm to the multi-AUV charging navigation assumed a specific structure for the neural networks
problem. By training the trajectory planning model in used in the actor and critic models. The performance
advance, the method significantly improved the real- of the algorithm could have been sensitive to the choice
time performance of multi-AUV navigation. However, of network architecture and hyperparameters. A more
the paper also had some limitations. One major limitation systematic exploration of different architectures and their
was the reliance on the accuracy of the AUV motion impact on performance could have provided deeper
model and the assumptions made during its formulation. insights into optimizing the TD3 algorithm for adaptive
For instance, the model assumed constant velocity and cruise control. Table XX gives a detailed summary of
neglected factors like water resistance and system delays, the discussed papers, and the domains of each paper.
which could have affected the real-world applicability of
the results. Moreover, the simulation environment used VI. D ISCUSSION
for training and testing might not have fully captured Throughout this survey, we examined various algo-
the complexities and variabilities of real underwater rithms in RL and their applications in a variety of
environments. Another potential limitation was the need domains, including but not limited to Robotics, ITS,
for extensive computational resources for training the Games, Wireless Networks, and many more. There is,
TD3 model, especially given the high number of training however, more to discover both in terms of the number of
rounds (up to 6000) and the large experience replay analyzed papers and in terms of the different algorithms.
buffer size. There are several algorithms and methods that were
Authors in [264] presented an advanced approach not analyzed in this survey for a variety of reasons.
for Adaptive Cruise Control (ACC) using the TD3 To begin with, the considered algorithms are those that
algorithm. This method addressed the complexities of have been applied to a variety of domains and are more
real-time decision-making and control in automotive widely used by researchers. In addition, time, resources,
applications. The authors carefully designed the reward and page limitations render it impossible to analyze all
function to consider the velocity error, control input, the algorithms and methods in one paper. Thirdly, un-
and additional terms to ensure stability and smooth derstanding these algorithms enables one to understand
driving behavior. This reward structure allowed the al- different variations being introduced by the community
gorithm to learn an optimal policy that maintained safe on a regular basis. The purpose of this survey is not to
distances between vehicles while adapting to changing identify which algorithm is better than the others, and as
traffic conditions. The empirical results demonstrated the we know, there is no one-fit-all solution to RL, so one
effectiveness of the TD3-based ACC system in both cannot state ”for problem X, algorithm Y performs better
normal and disturbance scenarios, highlighting its ro- than other algorithms” as it needs implementation of new
71
algorithms for the same problem, and reproducing of the wide-ranging impact of RL in numerous fields, we
the original work. Also, results achieved with RL and have categorized these domains into broader categories
specifically DRL may vary since different extrinsic and to provide a more organized and concise overview.
intrinsic factors change, as stated in [265], making it This categorization allows us to succinctly illustrate the
tough to compare and analyze. relevance and utilization of specific RL algorithms in
Lastly, we strongly recommend reading the chapters different research areas while effectively managing the
listed in this paper one by one, reading the introductions limited space available.
to the algorithms, and if necessary, consolidating the The Energy Efficiency and Power Management
knowledge of each algorithm by reviewing the reference category encompasses research areas such as train con-
papers. As a result, you will be able to read the analysis trol, IoTs, WBAN, PID Controllers, and Smart Energy
of the papers that have used that particular algorithm. Systems, all of which focus on optimizing energy us-
The provided tables are valuable to readers who are age and improving power management. Cloud-based
not interested in reading the entire article. By providing Systems includes works focused on cloud-based control
various tables at the end of the survey, we summarized and encryption systems, as well as edge computing
helpful information gathered throughout the survey. environments, reflecting the growing importance of RL
We tried our best to shed light on RL, in terms of in managing and optimizing cloud resources.
theory and applications, to give a thorough understanding The Optimization category captures studies that
of various broad categories of algorithms. This survey is leverage RL for solving complex optimization problems
a helpful resource to readers who would like to expand across various applications. Multi-agent Systems is a
their knowledge in RL (theory), as well as readers category that emphasizes RL’s role in enabling Au-
who desire to take a look at the applications of these tonomous behaviors, covering research involving Shep-
algorithms in the literature. herding, Virtual Agents, and other Multi-agent Systems.
In the final part of our survey, we present a com- Algorithmic RL covers advanced RL methodolo-
prehensive table that highlights the application of RL gies and hybrid approaches, including Renewal Theory,
algorithms across various domains in Table XXI. Given Rough Set Theory, Bayesian RL, and MPC tuning. The
72
General RL category encompasses broad RL applica- paper provided a comprehensive overview. Besides clas-
tions, including policy learning, cybersecurity, and learn- sifying RL algorithms according to Model-free/based
ing from raw experience. Robotics research, focusing approaches, scalability, and sample efficiency, it also pro-
on the application of RL in robotics, includes trajectory vided a practical guide for researchers and practitioners
control, learning, routing, and more. about the type(s) of algorithms used in various domains.
In the Financial Applications category, studies on Furthermore, this survey examines the practical imple-
portfolio re-balancing and other financial strategies using mentation and performance of RL algorithms across
RL are included. The Games category features research several fields, including Games, Robotics, Autonomous
on game strategies in Chess, StarCraft, Video Games, systems, and many more. It also provided a balanced
and Card Games, illustrating RL’s success in com- assessment of their usefulness.
plex strategic environments. Signal Processing research,
which uses RL for signal processing and parameter
R EFERENCES
estimation, is grouped under its own category.
The Networks category covers studies focused on [1] R. S. Sutton and A. G. Barto, Reinforcement learning: An
network reliability, blocking probabilities, Optical Trans- introduction. MIT press, 2018.
[2] R. Bellman, “The theory of dynamic programming,” Bulletin of
port Networks, Fog RAN, and Network Resilience and the American Mathematical Society, vol. 60, no. 6, pp. 503–515,
Optimization. ITS includes RL applications in Railway 1954.
Systems, EVs, Intelligent Traffic Signal Control, UAVs, [3] M. Ghasemi, A. H. Moosavi, I. Sorkhoh, A. Agrawal,
F. Alzhouri, and D. Ebrahimi, “An introduction to reinforcement
and other transportation-related technologies. learning: Fundamental concepts and practical applications,”
The Theoretical Research category includes studies arXiv preprint arXiv:2408.07712, 2024.
focused on the theoretical aspects of RL, such as conver- [4] R. S. Sutton, “Learning to predict by the methods of temporal
differences,” Machine learning, vol. 3, pp. 9–44, 1988.
gence and stability. Dynamic Environments research in-
[5] C. J. C. H. Watkins, “Learning from delayed rewards,” in PhD
volves RL in environments like mazes, the Mountain Car thesis, King’s College, Cambridge, 1989.
problem, and Atari games. Partially Observable Envi- [6] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Rein-
ronments includes studies on predictions, POMDPs, and forcement learning: A survey,” Journal of artificial intelligence
research, vol. 4, pp. 237–285, 1996.
Swarm Intelligence in optimization problems. [7] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis
Research on applying RL in FPGA, Real-time Sys- of the multiarmed bandit problem,” Machine learning, vol. 47,
tems, and other hardware implementations is grouped no. 2, pp. 235–256, 2002.
[8] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,
under Real-time Systems and Hardware Implementa- M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland,
tions. Studies using benchmark tasks like Mountain Car, G. Ostrovski et al., “Human-level control through deep rein-
and Acrobot to test RL algorithms are included in the forcement learning,” Nature, vol. 518, no. 7540, pp. 529–533,
2015.
Benchmark Tasks category. Data Management and [9] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van
Processing involves research applying RL in data man- Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershel-
agement and processing environments, such as Hadoop vam, M. Lanctot et al., “Mastering the game of go with deep
neural networks and tree search,” nature, vol. 529, no. 7587,
and Pathological Image Analysis. Finally, the Object pp. 484–489, 2016.
Recognition category encompasses studies focusing on [10] K. Souchleris, G. K. Sidiropoulos, and G. A. Papakostas,
using RL for object recognition tasks. “Reinforcement learning in game industry—review, prospects
and challenges,” Applied Sciences, vol. 13, no. 4, p. 2443, 2023.
Table XXI serves as a quick reference for researchers [11] S. Koyamada, S. Okano, S. Nishimori, Y. Murata, K. Habara,
to identify relevant work in their specific area of interest, H. Kita, and S. Ishii, “Pgx: Hardware-accelerated parallel game
showcasing the diversity and adaptability of RL algo- simulators for reinforcement learning,” Advances in Neural
Information Processing Systems, vol. 36, 2024.
rithms across various domains. The categorization helps [12] Z. Xu, C. Yu, F. Fang, Y. Wang, and Y. Wu, “Language agents
streamline the information, making it easier to navigate with reinforcement learning for strategic play in the werewolf
and understand the various applications of RL. game,” arXiv preprint arXiv:2310.18940, 2023.
[13] X. Qu, W. Gan, D. Song, and L. Zhou, “Pursuit-evasion game
strategy of usv based on deep reinforcement learning in complex
VII. C ONCLUSION multi-obstacle environment,” Ocean Engineering, vol. 273, p.
In this survey, we presented a comprehensive analysis 114016, 2023.
[14] K. Rana, M. Xu, B. Tidd, M. Milford, and N. Sünderhauf,
of Reinforcement Learning (RL) algorithms, categoriz- “Residual skill policies: Learning an adaptable skill-based ac-
ing them into Value-based, Policy-based, and Actor- tion space for reinforcement learning for robotics,” in Confer-
Critical Methods. By reviewing numerous research pa- ence on Robot Learning. PMLR, 2023, pp. 2095–2104.
[15] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning
pers, it highlights the strengths, weaknesses, and ap- in robotics: A survey,” The International Journal of Robotics
plications of each algorithm, offering valuable insights Research, vol. 32, no. 11, pp. 1238–1274, 2013.
into various domains. From classical approaches such [16] S. Balasubramanian, “Intrinsically motivated multi-goal rein-
forcement learning using robotics environment integrated with
as Q-learning to advanced Deep RL (DRL), along with openai gym,” Journal of Science & Technology, vol. 4, no. 5,
algorithmic variations tailored to specific domains, the pp. 46–60, 2023.
73
[17] S. W. Abeyruwan, L. Graesser, D. B. D’Ambrosio, A. Singh, Technology Conference:(VTC2022-Spring). IEEE, 2022, pp. 1–
A. Shankar, A. Bewley, D. Jain, K. M. Choromanski, and P. R. 6.
Sanketi, “i-sim2real: Reinforcement learning of robotic policies [33] K. De Asis and R. S. Sutton, “Per-decision multi-step tem-
in tight human-robot interaction loops,” in Conference on Robot poral difference learning with control variates,” arXiv preprint
Learning. PMLR, 2023, pp. 212–224. arXiv:1807.01830, 2018.
[18] R. Zhu, L. Li, S. Wu, P. Lv, Y. Li, and M. Xu, “Multi-agent [34] X. Li, C. Yang, J. Song, S. Feng, W. Li, and H. He, “A motion
broad reinforcement learning for intelligent traffic light control,” control method for agent based on dyna-q algorithm,” in 2023
Information Sciences, vol. 619, pp. 509–525, 2023. 4th International Conference on Computer Engineering and
[19] M. Yazdani, M. Sarvi, S. A. Bagloee, N. Nassir, J. Price, Application (ICCEA). IEEE, 2023, pp. 274–278.
and H. Parineh, “Intelligent vehicle pedestrian light (ivpl): A [35] R. S. Sutton and B. Tanner, “Temporal-difference networks,”
deep reinforcement learning approach for traffic signal control,” Advances in neural information processing systems, vol. 17,
Transportation research part C: emerging technologies, vol. 2004.
149, p. 103991, 2023. [36] J. Zuters, “Realizing undelayed n-step td prediction with neural
[20] Y. Liu, L. Huo, J. Wu, and A. K. Bashir, “Swarm learning- networks,” in Melecon 2010-2010 15th IEEE Mediterranean
based dynamic optimal management for traffic congestion in Electrotechnical Conference. IEEE, 2010, pp. 102–106.
6g-driven intelligent transportation system,” IEEE Transactions [37] T. M. Moerland, J. Broekens, A. Plaat, and C. M. Jonker,
on Intelligent Transportation Systems, vol. 24, no. 7, pp. 7831– “Monte carlo tree search for asymmetric trees,” arXiv preprint
7846, 2023. arXiv:1805.09218, 2018.
[21] D. Chen, M. R. Hajidavalloo, Z. Li, K. Chen, Y. Wang, L. Jiang, [38] M. A. Nasreen and S. Ravindran, “An overview of q-learning
and Y. Wang, “Deep multi-agent reinforcement learning for based energy efficient power allocation in wban (q-eepa),” in
highway on-ramp merging in mixed traffic,” IEEE Transactions 2022 2nd International Conference on Innovative Sustainable
on Intelligent Transportation Systems, vol. 24, no. 11, pp. Computational Technologies (CISCT). IEEE, 2022, pp. 1–5.
11 623–11 638, 2023. [39] G. C. Lopes, M. Ferreira, A. da Silva Simões, and E. L. Colom-
[22] P. Ghosh, T. E. A. de Oliveira, F. Alzhouri, and D. Ebrahimi, bini, “Intelligent control of a quadrotor with proximal policy
“Maximizing group-based vehicle communications and fairness: optimization reinforcement learning,” in 2018 Latin American
A reinforcement learning approach,” in 2024 IEEE Wireless Robotic Symposium, 2018 Brazilian Symposium on Robotics
Communications and Networking Conference (WCNC). IEEE, (SBR) and 2018 Workshop on Robotics in Education (WRE).
2024, pp. 1–7. IEEE, 2018, pp. 503–508.
[23] V. T. Aghaei, A. Ağababaoğlu, S. Yıldırım, and A. Onat, “A [40] N. Darapaneni, A. Basu, S. Savla, R. Gururajan, N. Saquib,
real-world application of markov chain monte carlo method S. Singhavi, A. Kale, P. Bid, and A. R. Paduri, “Automated port-
for bayesian trajectory control of a robotic manipulator,” ISA folio rebalancing using q-learning,” in 2020 11th IEEE Annual
transactions, vol. 125, pp. 580–590, 2022. Ubiquitous Computing, Electronics & Mobile Communication
[24] Y. Yu, Y. Liu, J. Wang, N. Noguchi, and Y. He, “Obstacle Conference (UEMCON). IEEE, 2020, pp. 0596–0602.
avoidance method based on double dqn for agricultural robots,” [41] R. M. Desai and B. Patil, “Prioritized sweeping reinforcement
Computers and Electronics in Agriculture, vol. 204, p. 107546, learning based routing for manets,” Indonesian Journal of
2023. Electrical Engineering and Computer Science, vol. 5, no. 2,
[25] M. de Koning, W. Cai, B. Sadigh, T. Oppelstrup, M. H. Kalos, pp. 383–390, 2017.
and V. V. Bulatov, “Adaptive importance sampling monte carlo [42] J. Santoso et al., “Multiagent simulation on hide and seek
simulation of rare transition events,” The Journal of chemical games using policy gradient trust region policy optimization,”
physics, vol. 122, no. 7, 2005. in 2020 7th International Conference on Advance Informatics:
[26] R. Li, W. Gong, L. Wang, C. Lu, Z. Pan, and X. Zhuang, “Dou- Concepts, Theory and Applications (ICAICTA). IEEE, 2020,
ble dqn-based coevolution for green distributed heterogeneous pp. 1–5.
hybrid flowshop scheduling with multiple priorities of jobs,” [43] S. Saha and S. M. Kay, “Maximum likelihood parameter esti-
IEEE Transactions on Automation Science and Engineering, mation of superimposed chirps using monte carlo importance
2023. sampling,” IEEE Transactions on Signal Processing, vol. 50,
[27] W. Liu, T. Tang, S. Su, Y. Cao, F. Bao, and J. Gao, “An intelli- no. 2, pp. 224–230, 2002.
gent train control approach based on the monte carlo reinforce- [44] L. Wu and K. Chen, “Bias resilient multi-step off-
ment learning algorithm,” in 2018 21st International Conference policy goal-conditioned reinforcement learning,” arXiv preprint
on Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. arXiv:2311.17565, 2023.
1944–1949. [45] M. Yao, X. Gao, J. Wang, and M. Wang, “Improving nuclei
[28] A. Jaiswal, S. Kumar, and U. Dohare, “Green computing in segmentation in pathological image via reinforcement learning,”
heterogeneous internet of things: Optimizing energy allocation in 2022 International Conference on Machine Learning, Cloud
using sarsa-based reinforcement learning,” in 2020 IEEE 17th Computing and Intelligent Mining (MLCCIM). IEEE, 2022,
India Council International Conference (INDICON). IEEE, pp. 290–295.
2020, pp. 1–6. [46] L. Zhang, Y. Zhang, X. Zhao, and Z. Zou, “Image captioning
[29] Z. Xuan, G. Wei, and Z. Ni, “Power allocation in multi-agent via proximal policy optimization,” Image and Vision Computing,
networks via dueling dqn approach,” in 2021 IEEE 6th Inter- vol. 108, p. 104126, 2021.
national Conference on Signal and Image Processing (ICSIP). [47] J. Suh and T. Tanaka, “Sarsa (0) reinforcement learning over
IEEE, 2021, pp. 959–963. fully homomorphic encryption,” in 2021 SICE International
[30] P. Lassila, J. Karvo, and J. Virtamo, “Efficient importance Symposium on Control Systems (SICE ISCS). IEEE, 2021,
sampling for monte carlo simulation of multicast networks,” pp. 1–7.
in Proceedings IEEE INFOCOM 2001. Conference on Com- [48] C. K. Go, B. Lao, J. Yoshimoto, and K. Ikeda, “A reinforcement
puter Communications. Twentieth Annual Joint Conference of learning approach to the shepherding task using sarsa,” in 2016
the IEEE Computer and Communications Society (Cat. No. International Joint Conference on Neural Networks (IJCNN).
01CH37213), vol. 1. IEEE, 2001, pp. 432–439. IEEE, 2016, pp. 3833–3836.
[31] E. Oh and H. Wang, “Reinforcement-learning-based energy [49] N. Zerbel and L. Yliniemi, “Multiagent monte carlo tree search,”
storage system operation strategies to manage wind power in Proceedings of the 18th International Conference on Au-
forecast uncertainty,” IEEE Access, vol. 8, pp. 20 965–20 976, tonomous Agents and MultiAgent Systems, 2019, pp. 2309–
2020. 2311.
[32] H. Kwon, “Learning-based power delay profile estimation for 5g [50] S. Mangalampalli, G. R. Karri, S. N. Mohanty, S. Ali, M. I.
nr via advantage actor-critic (a2c),” in 2022 IEEE 95th Vehicular Khan, S. Abdullaev, and S. A. AlQahtani, “Multi-objective pri-
74
oritized task scheduler using improved asynchronous advantage [73] R. S. Sutton, “Integrated architectures for learning, planning,
actor critic (a3c) algorithm in multi cloud environment,” IEEE and reacting based on approximating dynamic programming,”
Access, 2024. in Machine learning proceedings 1990. Elsevier, 1990, pp.
[51] Y. Li, “Deep reinforcement learning: An overview,” arXiv 216–224.
preprint arXiv:1701.07274, 2017. [74] S. Ishii, W. Yoshida, and J. Yoshimoto, “Control of exploitation–
[52] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. exploration meta-parameter in reinforcement learning,” Neural
Bharath, “Deep reinforcement learning: A brief survey,” IEEE networks, vol. 15, no. 4-6, pp. 665–687, 2002.
Signal Processing Magazine, vol. 34, no. 6, pp. 26–38, 2017. [75] L. Schäfer, F. Christianos, J. Hanna, and S. V. Albrecht, “Decou-
[53] X. Wang, S. Wang, X. Liang, D. Zhao, J. Huang, X. Xu, pling exploration and exploitation in reinforcement learning,” in
B. Dai, and Q. Miao, “Deep reinforcement learning: A survey,” ICML 2021 Workshop on Unsupervised Reinforcement Learn-
IEEE Transactions on Neural Networks and Learning Systems, ing, 2021.
vol. 35, no. 4, pp. 5064–5078, 2022. [76] H. Wang, T. Zariphopoulou, and X. Zhou, “Exploration versus
[54] H.-n. Wang, N. Liu, Y.-y. Zhang, D.-w. Feng, F. Huang, D.-s. exploitation in reinforcement learning: A stochastic control
Li, and Y.-m. Zhang, “Deep reinforcement learning: a survey,” approach,” arXiv preprint arXiv:1812.01552, 2018.
Frontiers of Information Technology & Electronic Engineering, [77] A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel, “Value
vol. 21, no. 12, pp. 1726–1744, 2020. iteration networks,” Advances in neural information processing
[55] T. M. Moerland, J. Broekens, A. Plaat, C. M. Jonker et al., systems, vol. 29, 2016.
“Model-based reinforcement learning: A survey,” Foundations [78] R. Fonteneau, S. Murphy, L. Wehenkel, and D. Ernst, “Model-
and Trends® in Machine Learning, vol. 16, no. 1, pp. 1–118, free monte carlo-like policy evaluation,” in Proceedings of the
2023. Thirteenth International Conference on Artificial Intelligence
[56] A. S. Polydoros and L. Nalpantidis, “Survey of model-based and Statistics. JMLR Workshop and Conference Proceedings,
reinforcement learning: Applications on robotics,” Journal of 2010, pp. 217–224.
Intelligent & Robotic Systems, vol. 86, no. 2, pp. 153–173, 2017. [79] S. T. Tokdar and R. E. Kass, “Importance sampling: a re-
[57] F.-M. Luo, T. Xu, H. Lai, X.-H. Chen, W. Zhang, and Y. Yu, “A view,” Wiley Interdisciplinary Reviews: Computational Statis-
survey on model-based reinforcement learning,” Science China tics, vol. 2, no. 1, pp. 54–60, 2010.
Information Sciences, vol. 67, no. 2, p. 121101, 2024. [80] J. Subramanian and A. Mahajan, “Renewal monte carlo: Re-
[58] Y. Sato, “Model-free reinforcement learning for financial port- newal theory-based reinforcement learning,” IEEE Transactions
folios: a brief survey,” arXiv preprint arXiv:1904.04973, 2019. on Automatic Control, vol. 65, no. 8, pp. 3663–3670, 2019.
[59] J. Ramı́rez, W. Yu, and A. Perrusquı́a, “Model-free reinforce- [81] J. F. Peters, D. Lockery, and S. Ramanna, “Monte carlo
ment learning from expert demonstrations: a survey,” Artificial off-policy reinforcement learning: A rough set approach,” in
Intelligence Review, vol. 55, no. 4, pp. 3213–3241, 2022. Fifth International Conference on Hybrid Intelligent Systems
[60] J. Eschmann, “Reward function design in reinforcement learn- (HIS’05). IEEE, 2005, pp. 6–pp.
ing,” Reinforcement Learning Algorithms: Analysis and Appli- [82] Y. Wang, K. S. Won, D. Hsu, and W. S. Lee, “Monte
cations, pp. 25–33, 2021. carlo bayesian reinforcement learning,” arXiv preprint
[61] R. S. Sutton, A. G. Barto et al., “Reinforcement learning,” arXiv:1206.6449, 2012.
Journal of Cognitive Neuroscience, vol. 11, no. 1, pp. 126–134, [83] B. Wu and Y. Feng, “Monte-carlo bayesian reinforcement
1999. learning using a compact factored representation,” in 2017 4th
[62] M. L. Puterman, “Chapter 8 markov decision International Conference on Information Science and Control
processes,” in Stochastic Models, ser. Handbooks Engineering (ICISCE). IEEE, 2017, pp. 466–469.
in Operations Research and Management Science. [84] O. Baykal and F. N. Alpaslan, “Reinforcement learning in card
Elsevier, 1990, vol. 2, pp. 331–434. [Online]. Available: game environments using monte carlo methods and artificial
https://www.sciencedirect.com/science/article/pii/S0927050705801720 neural networks,” in 2019 4th International Conference on
[63] E. Zanini, “Markov decision processes,” 2014. Computer Science and Engineering (UBMK). IEEE, 2019,
[64] Z. Wei, J. Xu, Y. Lan, J. Guo, and X. Cheng, “Reinforcement pp. 1–6.
learning to rank with markov decision process,” in Proceedings [85] D. Siegmund, “Importance sampling in the monte carlo study of
of the 40th international ACM SIGIR conference on research sequential tests,” The Annals of Statistics, pp. 673–684, 1976.
and development in information retrieval, 2017, pp. 945–948. [86] S. Bulteau and M. El Khadiri, “A new importance sampling
[65] D. J. Foster and A. Rakhlin, “Foundations of reinforce- monte carlo method for a flow network reliability problem,”
ment learning and interactive decision making,” arXiv preprint Naval Research Logistics (NRL), vol. 49, no. 2, pp. 204–228,
arXiv:2312.16730, 2023. 2002.
[66] G. A. Rummery and M. Niranjan, “On-line q-learning using [87] C. Wang, S. Yuan, K. Shao, and K. Ross, “On the convergence
connectionist systems,” University of Cambridge, Department of the monte carlo exploring starts algorithm for reinforcement
of Engineering Cambridge, Tech. Rep., 1994. learning,” arXiv preprint arXiv:2002.03585, 2020.
[67] M. A. Wiering and M. Van Otterlo, “Reinforcement learning,” [88] J. F. Peters and C. Henry, “Approximation spaces in off-policy
Adaptation, learning, and optimization, vol. 12, no. 3, p. 729, monte carlo learning,” Engineering applications of artificial
2012. intelligence, vol. 20, no. 5, pp. 667–675, 2007.
[68] D. Ernst and A. Louette, “Introduction to reinforcement learn- [89] A. Altahhan, “Td (0)-replay: An efficient model-free planning
ing,” 2024. with full replay,” in 2018 International Joint Conference on
[69] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, Neural Networks (IJCNN). IEEE, 2018, pp. 1–7.
R. Munos, C. Blundell, D. Kumaran, and M. Botvinick, “Learn- [90] J. Baxter, A. Tridgell, and L. Weaver, “Knightcap: a chess
ing to reinforcement learn,” arXiv preprint arXiv:1611.05763, program that learns by combining td (lambda) with game-tree
2016. search,” arXiv preprint cs/9901002, 1999.
[70] Z. Ding, Y. Huang, H. Yuan, and H. Dong, “Introduction to [91] P. Dayan, “The convergence of td (λ) for general λ,” Machine
reinforcement learning,” Deep reinforcement learning: funda- learning, vol. 8, pp. 341–362, 1992.
mentals, research and applications, pp. 47–123, 2020. [92] P. Dayan and T. J. Sejnowski, “Td (λ) converges with proba-
[71] D. A. White and D. A. Sofge, “The role of exploration in bility 1,” Machine Learning, vol. 14, pp. 295–301, 1994.
learning control,” Handbook of Intelligent Control: Neural, [93] M. A. Wiering and H. Van Hasselt, “Two novel on-policy
Fuzzy and Adaptive Approaches, pp. 1–27, 1992. reinforcement learning algorithms based on td (λ)-methods,” in
[72] M. Kearns and S. Singh, “Near-optimal performance for re- 2007 IEEE International Symposium on Approximate Dynamic
inforcement learning in polynomial time,” URL: http://www. Programming and Reinforcement Learning. IEEE, 2007, pp.
research. att. com/˜ mkearns, 1998. 280–287.
75
[94] K. De Asis, J. Hernandez-Garcia, G. Holland, and R. Sutton, [114] T. Paterova, M. Prauzek, and J. Konecny, “Robustness analysis
“Multi-step reinforcement learning: A unifying algorithm,” in of data-driven self-learning controllers for iot environmental
Proceedings of the AAAI conference on artificial intelligence, monitoring nodes based on q-learning approaches,” in 2022
vol. 32, no. 1, 2018. IEEE Symposium Series on Computational Intelligence (SSCI).
[95] Z. Chen, S. T. Maguluri, S. Shakkottai, and K. Shanmugam, IEEE, 2022, pp. 721–727.
“A lyapunov theory for finite-sample guarantees of asyn- [115] A. Nassar and Y. Yilmaz, “Reinforcement learning for adaptive
chronous q-learning and td-learning variants,” arXiv preprint resource allocation in fog ran for iot with heterogeneous latency
arXiv:2102.01567, 2021. requirements,” IEEE Access, vol. 7, pp. 128 014–128 025, 2019.
[96] D. Lee, “Analysis of off-policy multi-step td-learning with lin- [116] S. Wender and I. Watson, “Applying reinforcement learning to
ear function approximation,” arXiv preprint arXiv:2402.15781, small scale combat in the real-time strategy game starcraft:
2024. Broodwar,” in 2012 ieee conference on computational intelli-
[97] K. De Asis, “A unified view of multi-step temporal difference gence and games (cig). IEEE, 2012, pp. 402–408.
learning,” 2018. [117] H. Iima and Y. Kuroe, “Swarm reinforcement learning algo-
[98] C. Szepesvári, Algorithms for reinforcement learning. Springer rithms based on sarsa method,” in 2008 SICE Annual Confer-
nature, 2022. ence. IEEE, 2008, pp. 2045–2049.
[99] Y. Wang and X. Tan, “Greedy multi-step off-policy reinforce- [118] H. Hasselt, “Double q-learning,” Advances in neural informa-
ment learning,” 2020. tion processing systems, vol. 23, 2010.
[100] A. R. Mahmood, H. Yu, and R. S. Sutton, “Multi-step off-policy [119] M. Ben-Akka, C. Tanougast, C. Diou, and A. Chaddad, “An
learning without importance sampling ratios,” arXiv preprint efficient hardware implementation of the double q-learning
arXiv:1702.03006, 2017. algorithm,” in 2023 3rd International Conference on Electri-
[101] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, cal, Computer, Communications and Mechatronics Engineering
vol. 8, pp. 279–292, 1992. (ICECCME). IEEE, 2023, pp. 1–6.
[102] Y. Bi, A. Thomas-Mitchell, W. Zhai, and N. Khan, “A com- [120] F. Jamshidi, L. Zhang, and F. Nezhadalinaei, “Autonomous
parative study of deterministic and stochastic policies for q- driving systems: Developing an approach based on a* and
learning,” in 2023 4th International Conference on Artificial double q-learning,” in 2021 7th International Conference on
Intelligence, Robotics and Control (AIRC). IEEE, 2023, pp. Web Research (ICWR). IEEE, 2021, pp. 82–85.
1–5.
[121] T. Paterova, M. Prauzek, and J. Konecny, “Data-driven self-
[103] S. Spano, G. C. Cardarilli, L. Di Nunzio, R. Fazzolari, D. Gi- learning controller design approach for power-aware iot devices
ardino, M. Matta, A. Nannarelli, and M. Re, “An efficient based on double q-learning strategy,” in 2021 IEEE Symposium
hardware implementation of reinforcement learning: The q- Series on Computational Intelligence (SSCI). IEEE, 2021, pp.
learning algorithm,” Ieee Access, vol. 7, pp. 186 340–186 351, 01–07.
2019.
[122] J. Fan, T. Yang, J. Zhao, Z. Cui, J. Ning, and P. Wang, “Double
[104] D. Huang, H. Zhu, X. Lin, and L. Wang, “Application of mas-
q learning multi-agent routing method for maritime search
sive parallel computation based q-learning in system control,”
and rescue,” in 2023 International Conference on Ubiquitous
in 2022 5th International Conference on Pattern Recognition
Communication (Ucom). IEEE, 2023, pp. 367–372.
and Artificial Intelligence (PRAI). IEEE, 2022, pp. 1–5.
[123] J. Konecny, M. Prauzek, and T. Paterova, “Double q-learning
[105] M. Daswani, P. Sunehag, and M. Hutter, “Q-learning for history-
adaptive wavelet compression method for data transmission at
based reinforcement learning,” in Asian Conference on Machine
Learning. PMLR, 2013, pp. 213–228. environmental monitoring stations,” in 2022 IEEE Symposium
Series on Computational Intelligence (SSCI). IEEE, 2022, pp.
[106] B. Shou, H. Zhang, Z. Long, Y. Xie, K. Zhang, and Q. Gu,
567–572.
“Design and applications of q-learning adaptive pid algorithm
for maglev train levitation control system,” in 2023 35th Chinese [124] H. Huang, M. Lin, and Q. Zhang, “Double-q learning-based dvfs
Control and Decision Conference (CCDC). IEEE, 2023, pp. for multi-core real-time systems,” in 2017 IEEE International
1947–1953. Conference on Internet of Things (iThings) and IEEE Green
[107] G. Akshay, N. S. Naik, and J. Vardhan, “Enhancing hadoop Computing and Communications (GreenCom) and IEEE Cyber,
performance with q-learning for optimal parameter tuning,” in Physical and Social Computing (CPSCom) and IEEE Smart
TENCON 2023-2023 IEEE Region 10 Conference (TENCON). Data (SmartData). IEEE, 2017, pp. 522–529.
IEEE, 2023, pp. 617–622. [125] D. Wang, B. Liu, H. Jia, Z. Zhang, J. Chen, and D. Huang,
[108] C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan, “Is q-learning “Peer-to-peer electricity transaction decisions of the user-side
provably efficient?” Advances in neural information processing smart energy system based on the sarsa reinforcement learning,”
systems, vol. 31, 2018. CSEE Journal of Power and Energy Systems, vol. 8, no. 3, pp.
[109] L. M. Da Silva, M. F. Torquato, and M. A. Fernandes, “Parallel 826–837, 2020.
implementation of reinforcement learning q-learning technique [126] T. M. Aljohani and O. Mohammed, “A real-time energy con-
for fpga,” IEEE Access, vol. 7, pp. 2782–2798, 2018. sumption minimization framework for electric vehicles routing
[110] S. Wang and L. Zhang, “Q-learning based handover algorithm optimization based on sarsa reinforcement learning,” Vehicles,
for high-speed rail wireless communications,” in 2023 IEEE vol. 4, no. 4, pp. 1176–1194, 2022.
Wireless Communications and Networking Conference (WCNC). [127] H. Van Seijen, H. Van Hasselt, S. Whiteson, and M. Wiering,
IEEE, 2023, pp. 1–6. “A theoretical and empirical analysis of expected sarsa,” in
[111] S. Wang, “Taxi scheduling research based on q-learning,” in 2009 ieee symposium on adaptive dynamic programming and
2021 3rd International Conference on Machine Learning, Big reinforcement learning. IEEE, 2009, pp. 177–184.
Data and Business Intelligence (MLBDBI). IEEE, 2021, pp. [128] H. Moradimaryamnegari, M. Frego, and A. Peer, “Model predic-
700–703. tive control-based reinforcement learning using expected sarsa,”
[112] W. Xiao, J. Chen, X. Li, M. Wang, D. Huang, and D. Zhang, IEEE Access, vol. 10, pp. 81 177–81 191, 2022.
“Random walk routing algorithm based on q-learning in optical [129] I. A. M. Gonzalez and V. Turau, “Comparison of wifi in-
transport network,” in 2022 18th International Conference on terference mitigation strategies in dsme networks: Leveraging
Computational Intelligence and Security (CIS). IEEE, 2022, reinforcement learning with expected sarsa,” in 2023 IEEE
pp. 88–92. International Mediterranean Conference on Communications
[113] X. Qu and M. Yao, “Visual novelty based internally motivated and Networking (MeditCom). IEEE, 2023, pp. 270–275.
q-learning for mobile robot scene learning and recognition,” in [130] R. Muduli, D. Jena, and T. Moger, “Application of expected
2011 4th International Congress on Image and Signal Process- sarsa-learning for load frequency control of multi-area power
ing, vol. 3. IEEE, 2011, pp. 1461–1466. system,” in 2023 5th International Conference on Energy, Power
76
and Environment: Towards Flexible Green Energy Technologies [152] G. Zuo, T. Du, and J. Lu, “Double dqn method for object de-
(ICEPE). IEEE, 2023, pp. 1–6. tection,” in 2017 Chinese Automation Congress (CAC). IEEE,
[131] A. Kekuda, R. Anirudh, and M. Krishnan, “Reinforcement 2017, pp. 6727–6732.
learning based intelligent traffic signal control using n-step [153] W. Zhang, J. Gai, Z. Zhang, L. Tang, Q. Liao, and Y. Ding,
sarsa,” in 2021 International Conference on Artificial Intelli- “Double-dqn based path smoothing and tracking control method
gence and Smart Systems (ICAIS). IEEE, 2021, pp. 379–384. for robotic vehicle navigation,” Computers and Electronics in
[132] V. Kuchibhotla, P. Harshitha, and S. Goyal, “An n-step look Agriculture, vol. 166, p. 104985, 2019.
ahead algorithm using mixed (on and off) policy reinforcement [154] S. Zhang, Y. Wu, H. Ogai, H. Inujima, and S. Tateno, “Tactical
learning,” in 2020 3rd International Conference on Intelligent decision-making for autonomous driving using dueling double
Sustainable Systems (ICISS). IEEE, 2020, pp. 677–681. deep q network with double attention,” IEEE Access, vol. 9, pp.
[133] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, 151 983–151 992, 2021.
vol. 521, no. 7553, pp. 436–444, 2015. [155] X. Xue, Z. Li, D. Zhang, and Y. Yan, “A deep reinforcement
[134] I. Goodfellow, “Deep learning,” 2016. learning method for mobile robot collision avoidance based on
[135] J. D. Kelleher, Deep learning. MIT press, 2019. double dqn,” in 2019 IEEE 28th International Symposium on
[136] N. Rusk, “Deep learning,” Nature Methods, vol. 13, no. 1, pp. Industrial Electronics (ISIE). IEEE, 2019, pp. 2131–2136.
35–35, 2016. [156] S. Mo, X. Pei, and Z. Chen, “Decision-making for oncoming
[137] T. Degris, P. M. Pilarski, and R. S. Sutton, “Model-free rein- traffic overtaking scenario using double dqn,” in 2019 3rd
forcement learning with continuous action in practice,” in 2012 Conference on Vehicle Control and Intelligence (CVCI). IEEE,
2019, pp. 1–4.
American control conference (ACC). IEEE, 2012, pp. 2177–
2182. [157] Y. Xiaofei, S. Yilun, L. Wei, Y. Hui, Z. Weibo, and X. Zhen-
grong, “Global path planning algorithm based on double dqn
[138] Y. Liu, A. Halev, and X. Liu, “Policy learning with constraints
for multi-tasks amphibious unmanned surface vehicle,” Ocean
in model-free reinforcement learning: A survey,” in The 30th
Engineering, vol. 266, p. 112809, 2022.
international joint conference on artificial intelligence (ijcai),
2021. [158] C. Lee, J. Jung, and J.-M. Chung, “Intelligent dual active
protocol stack handover based on double dqn deep reinforce-
[139] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
ment learning for 5g mmwave networks,” IEEE Transactions
D. Wierstra, and M. Riedmiller, “Playing atari with deep rein-
on Vehicular Technology, vol. 71, no. 7, pp. 7572–7584, 2022.
forcement learning,” arXiv preprint arXiv:1312.5602, 2013.
[159] Y. Zhang, P. Sun, Y. Yin, L. Lin, and X. Wang, “Human-
[140] K. U. Ahn and C. S. Park, “Application of deep q-networks like autonomous vehicle speed control by deep reinforcement
for model-free optimal control balancing between different hvac learning with double q-learning,” in 2018 IEEE intelligent
systems,” Science and Technology for the Built Environment, vehicles symposium (IV). IEEE, 2018, pp. 1251–1256.
vol. 26, no. 1, pp. 61–74, 2020.
[160] D. Li, S. Xu, and J. Zhao, “Partially observable double dqn
[141] J. N. Stember and H. Shalu, “Reinforcement learning using deep based iot scheduling for energy harvesting,” in 2019 IEEE
q networks and q learning accurately localizes brain tumors international conference on communications workshops (ICC
on mri with very small training sets,” BMC Medical Imaging, Workshops). IEEE, 2019, pp. 1–6.
vol. 22, no. 1, p. 224, 2022. [161] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and
[142] H. Guo, R. Wu, B. Qi, and C. Xu, “Deep-q-networks-based N. Freitas, “Dueling network architectures for deep reinforce-
adaptive dual-mode energy-efficient routing in rechargeable ment learning,” in International conference on machine learn-
wireless sensor networks,” IEEE Sensors Journal, vol. 22, ing. PMLR, 2016, pp. 1995–2003.
no. 10, pp. 9956–9966, 2022. [162] B.-A. Han and J.-J. Yang, “Research on adaptive job shop
[143] F. M. Talaat, “Effective deep q-networks (edqn) strategy for scheduling problems based on dueling double dqn,” Ieee Access,
resource allocation based on optimized reinforcement learning vol. 8, pp. 186 474–186 495, 2020.
algorithm,” Multimedia Tools and Applications, vol. 81, no. 28, [163] T.-W. Ban, “An autonomous transmission scheme using dueling
pp. 39 945–39 961, 2022. dqn for d2d communication networks,” IEEE transactions on
[144] C. Song, H. Lee, K. Kim, and S. W. Cha, “A power management vehicular technology, vol. 69, no. 12, pp. 16 348–16 352, 2020.
strategy for parallel phev using deep q-networks,” in 2018 IEEE [164] Y. Liu and C. Zhang, “Application of dueling dqn and decga
Vehicle Power and Propulsion Conference (VPPC). IEEE, for parameter estimation in variogram models,” IEEE Access,
2018, pp. 1–5. vol. 8, pp. 38 112–38 122, 2020.
[145] S. Yoon and K.-J. Kim, “Deep q networks for visual fighting [165] W. Liu, P. Si, E. Sun, M. Li, C. Fang, and Y. Zhang, “Green
game ai,” in 2017 IEEE conference on computational intelli- mobility management in uav-assisted iot based on dueling
gence and games (CIG). IEEE, 2017, pp. 306–308. dqn,” in ICC 2019-2019 IEEE International Conference on
[146] L. Lv, S. Zhang, D. Ding, and Y. Wang, “Path planning via an Communications (ICC). IEEE, 2019, pp. 1–6.
improved dqn-based learning policy,” IEEE Access, vol. 7, pp. [166] S. B. Tadele, B. Kar, F. G. Wakgra, and A. U. Khan, “Optimiza-
67 319–67 330, 2019. tion of end-to-end aoi in edge-enabled vehicular fog systems: A
[147] A. S. Zamzam, B. Yang, and N. D. Sidiropoulos, “Energy dueling-dqn approach,” arXiv preprint arXiv:2407.02815, 2024.
storage management via deep q-networks,” in 2019 IEEE Power [167] W. Jiang, C. Bao, G. Xu, and Y. Wang, “Research on au-
& Energy Society General Meeting (PESGM). IEEE, 2019, pp. tonomous obstacle avoidance and target tracking of uav based
1–5. on improved dueling dqn algorithm,” in 2021 China Automation
[148] N. Gao, Z. Qin, X. Jing, Q. Ni, and S. Jin, “Anti-intelligent uav Congress (CAC). IEEE, 2021, pp. 5110–5115.
jamming strategy via deep q-networks,” IEEE Transactions on [168] Z. Huang, S. Liu, and G. Zhang, “The usv path planning of
Communications, vol. 68, no. 1, pp. 569–581, 2019. dueling dqn algorithm based on tree sampling mechanism,”
[149] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement in 2022 IEEE Asia-Pacific Conference on Image Processing,
learning with double q-learning,” in Proceedings of the AAAI Electronics and Computers (IPEC). IEEE, 2022, pp. 971–976.
conference on artificial intelligence, vol. 30, no. 1, 2016. [169] D. Silver and J. Veness, “Monte-carlo planning in large
[150] B. Peng, Q. Sun, S. E. Li, D. Kum, Y. Yin, J. Wei, and T. Gu, pomdps,” Advances in neural information processing systems,
“End-to-end autonomous driving through dueling double deep vol. 23, 2010.
q-network,” Automotive Innovation, vol. 4, pp. 328–337, 2021. [170] R. Coulom, “Efficient selectivity and backup operators in
[151] A. Iqbal, M.-L. Tham, and Y. C. Chang, “Double deep q- monte-carlo tree search,” in International conference on com-
network-based energy-efficient resource allocation in cloud ra- puters and games. Springer, 2006, pp. 72–83.
dio access network,” IEEE Access, vol. 9, pp. 20 440–20 449, [171] L. Rossi, M. H. Winands, and C. Butenweg, “Monte carlo
2021. tree search as an intelligent search tool in structural design
77
problems,” Engineering with Computers, vol. 38, no. 4, pp. [192] S. Del Giorno, F. D’Antoni, V. Piemonte, and M. Merone, “A
3219–3236, 2022. new glycemic closed-loop control based on dyna-q for type-1-
[172] M. C. Fu, “A tutorial introduction to monte carlo tree search,” diabetes,” Biomedical Signal Processing and Control, vol. 81,
in 2020 Winter Simulation Conference (WSC). IEEE, 2020, p. 104492, 2023.
pp. 1178–1193. [193] T. Faycal and C. Zito, “Dyna-t: Dyna-q and upper confidence
[173] W. Wang and M. Sebag, “Multi-objective monte-carlo tree bounds applied to trees,” arXiv preprint arXiv:2201.04502,
search,” in Asian conference on machine learning. PMLR, 2022.
2012, pp. 507–522. [194] Z. Xu, F. Zhu, Y. Fu, Q. Liu, and S. You, “A dyna-q based
[174] X. Su, T. Huang, Y. Li, S. You, F. Wang, C. Qian, C. Zhang, multi-path load-balancing routing algorithm in wireless sensor
and C. Xu, “Prioritized architecture sampling with monto-carlo networks,” in 2015 IEEE Trustcom/BigDataSE/ISPA, vol. 3.
tree search,” in Proceedings of the IEEE/CVF Conference on IEEE, 2015, pp. 1–4.
Computer Vision and Pattern Recognition, 2021, pp. 10 968– [195] M. Pei, H. An, B. Liu, and C. Wang, “An improved dyna-q
10 977. algorithm for mobile robot path planning in unknown dynamic
[175] D. Perez, S. Mostaghim, S. Samothrakis, and S. M. Lucas, environment,” IEEE Transactions on Systems, Man, and Cyber-
“Multiobjective monte carlo tree search for real-time games,” netics: Systems, vol. 52, no. 7, pp. 4415–4425, 2021.
IEEE Transactions on Computational Intelligence and AI in [196] F. Wang, J. Gao, M. Li, and L. Zhao, “Autonomous pev
Games, vol. 7, no. 4, pp. 347–360, 2014. charging scheduling using dyna-q reinforcement learning,” IEEE
[176] H. Baier and M. H. Winands, “Monte-carlo tree search and Transactions on Vehicular Technology, vol. 69, no. 11, pp.
minimax hybrids,” in 2013 IEEE Conference on Computational 12 609–12 620, 2020.
Inteligence in Games (CIG). IEEE, 2013, pp. 1–8. [197] Y. Liu, S. Yan, Y. Zhao, C. Song, and F. Li, “Improved dyna-
[177] M. De Waard, D. M. Roijers, and S. C. Bakkes, “Monte q: a reinforcement learning method focused via heuristic graph
carlo tree search with options for general video game playing,” for agv path planning in dynamic environments,” Drones, vol. 6,
in 2016 IEEE Conference on Computational Intelligence and no. 11, p. 365, 2022.
Games (CIG). IEEE, 2016, pp. 1–8. [198] G. Zhang, Y. Li, Y. Niu, and Q. Zhou, “Anti-jamming path
[178] M. Lanctot, M. H. Winands, T. Pepels, and N. R. Sturtevant, selection method in a wireless communication network based
“Monte carlo tree search with heuristic evaluations using im- on dyna-q,” Electronics, vol. 11, no. 15, p. 2397, 2022.
plicit minimax backups,” in 2014 IEEE Conference on Compu- [199] K.-S. Hwang, W.-C. Jiang, and Y.-J. Chen, “Model learning
tational Intelligence and Games. IEEE, 2014, pp. 1–8. and knowledge sharing for a multiagent system with dyna-q
[179] M. H. Winands, Y. Bjornsson, and J.-T. Saito, “Monte carlo tree learning,” IEEE transactions on cybernetics, vol. 45, no. 5, pp.
search in lines of action,” IEEE Transactions on Computational 978–990, 2014.
Intelligence and AI in Games, vol. 2, no. 4, pp. 239–250, 2010. [200] J. Huang, Q. Tan, J. Ma, and L. Han, “Path planning method
[180] A. Santos, P. A. Santos, and F. S. Melo, “Monte carlo tree using dyna-q algorithm under complex urban environment,” in
search experiments in hearthstone,” in 2017 IEEE conference 2022 China Automation Congress (CAC). IEEE, 2022, pp.
on computational intelligence and games (CIG). IEEE, 2017, 6776–6781.
pp. 272–279. [201] E. Vitolo, A. San Miguel, J. Civera, and C. Mahulea, “Perfor-
[181] P. I. Cowling, E. J. Powley, and D. Whitehouse, “Information set mance evaluation of the dyna-q algorithm for robot navigation,”
monte carlo tree search,” IEEE Transactions on Computational in 2018 IEEE 14th International Conference on Automation
Intelligence and AI in Games, vol. 4, no. 2, pp. 120–143, 2012. Science and Engineering (CASE). IEEE, 2018, pp. 322–327.
[182] P. Ciancarini and G. P. Favini, “Monte carlo tree search in [202] R. J. Williams, “Simple statistical gradient-following algorithms
kriegspiel,” Artificial Intelligence, vol. 174, no. 11, pp. 670– for connectionist reinforcement learning,” Machine learning,
684, 2010. vol. 8, pp. 229–256, 1992.
[183] A. W. Moore and C. G. Atkeson, “Prioritized sweeping: Re- [203] J. Zhang, J. Kim, B. O’Donoghue, and S. Boyd, “Sample
inforcement learning with less data and less time,” Machine efficient reinforcement learning with reinforce,” in Proceedings
learning, vol. 13, pp. 103–130, 1993. of the AAAI conference on artificial intelligence, vol. 35, no. 12,
[184] J. Peng and R. J. Williams, “Efficient learning and planning 2021, pp. 10 887–10 895.
within the dyna framework,” Adaptive behavior, vol. 1, no. 4, [204] L. Guo, Z. Li, R. Outbib, and F. Gao, “Function approximation
pp. 437–454, 1993. reinforcement learning of energy management with the fuzzy
[185] R. Li, Q. Wang, C. Dong et al., “Morphing strategy design reinforce for fuel cell hybrid electric vehicles,” Energy and AI,
for uav based on prioritized sweeping reinforcement learning,” vol. 13, p. 100246, 2023.
in IECON 2020 The 46th Annual Conference of the IEEE [205] J. C. Lauffenburger, E. Yom-Tov, P. A. Keller, M. E. McDonnell,
Industrial Electronics Society. IEEE, 2020, pp. 2786–2791. L. G. Bessette, C. P. Fontanet, E. S. Sears, E. Kim, K. Hanken,
[186] R. Zajdel, “Epoch-incremental dyna-learning and prioritized J. J. Buckley et al., “Reinforcement learning to improve non-
sweeping algorithms,” Neurocomputing, vol. 319, pp. 13–20, adherence for diabetes treatments by optimising response and
2018. customising engagement (reinforce): study protocol of a prag-
[187] H. Van Seijen and R. Sutton, “Planning by prioritized sweeping matic randomised trial,” BMJ open, vol. 11, no. 12, p. e052091,
with small backups,” in International Conference on Machine 2021.
Learning. PMLR, 2013, pp. 361–369. [206] Y. Tao and W. L. Tan, “A reinforcement learning approach to wi-
[188] E. Bargiacchi, T. Verstraeten, D. M. Roijers, and A. Nowé, fi rate adaptation using the reinforce algorithm,” in 2024 IEEE
“Model-based multi-agent reinforcement learning with cooper- Wireless Communications and Networking Conference (WCNC).
ative prioritized sweeping,” arXiv preprint arXiv:2001.07527, IEEE, 2024, pp. 1–6.
2020. [207] K. Weerakoon, S. Chakraborty, N. Karapetyan, A. J.
[189] R. Dearden, “Structured prioritized sweeping,” in ICML. Cite- Sathyamoorthy, A. S. Bedi, and D. Manocha, “Htron: Efficient
seer, 2001, pp. 82–89. outdoor navigation with sparse rewards via heavy tailed adaptive
[190] M. Santos, V. López, G. Botella et al., “Dyna-h: A heuris- reinforce algorithm,” arXiv preprint arXiv:2207.03694, 2022.
tic planning reinforcement learning algorithm applied to role- [208] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz,
playing game strategy decision systems,” Knowledge-Based “Trust region policy optimization,” in International conference
Systems, vol. 32, pp. 28–36, 2012. on machine learning. PMLR, 2015, pp. 1889–1897.
[191] Y. Chai and X.-J. Zeng, “A multi-objective dyna-q based routing [209] D. Bertsekas, Dynamic programming and optimal control: Vol-
in wireless mesh network,” Applied Soft Computing, vol. 108, ume I. Athena scientific, 2012, vol. 4.
p. 107486, 2021. [210] J. Peters and S. Schaal, “Reinforcement learning of motor skills
78
with policy gradients,” Neural networks, vol. 21, no. 4, pp. 682– deep reinforcement learning,” in 2020 IEEE Intelligent Vehicles
697, 2008. Symposium (IV). IEEE, 2020, pp. 1746–1752.
[211] I. Szita and A. Lörincz, “Learning tetris using the noisy cross- [229] I. Grondman, L. Busoniu, G. A. Lopes, and R. Babuska,
entropy method,” Neural computation, vol. 18, no. 12, pp. 2936– “A survey of actor-critic reinforcement learning: Standard and
2941, 2006. natural policy gradients,” IEEE Transactions on Systems, Man,
[212] B. Liu, Q. Cai, Z. Yang, and Z. Wang, “Neural trust re- and Cybernetics, part C (applications and reviews), vol. 42,
gion/proximal policy optimization attains globally optimal no. 6, pp. 1291–1307, 2012.
policy,” Advances in neural information processing systems, [230] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap,
vol. 32, 2019. T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous
[213] Q. Yuan and N. Xiao, “A monotonic policy optimization al- methods for deep reinforcement learning,” in International
gorithm for high-dimensional continuous control problem in conference on machine learning. PMLR, 2016, pp. 1928–1937.
3d mujoco,” Multimedia Tools and Applications, vol. 78, pp. [231] V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” Advances
28 665–28 680, 2019. in neural information processing systems, vol. 12, 1999.
[214] S. Roostaie and M. M. Ebadzadeh, “Entrpo: Trust region [232] F. Jiang, S. Han, and C. Sun, “Asynchronous advantage actor-
policy optimization method with entropy regularization,” arXiv critic algorithm based cooperative caching strategy for fog radio
preprint arXiv:2110.13373, 2021. access networks,” in 2023 IEEE Wireless Communications and
[215] J. G. Kuba, R. Chen, M. Wen, Y. Wen, F. Sun, J. Wang, and Networking Conference (WCNC). IEEE, 2023, pp. 1–6.
Y. Yang, “Trust region policy optimisation in multi-agent rein- [233] J. Du, W. Cheng, G. Lu, H. Cao, X. Chu, Z. Zhang, and J. Wang,
forcement learning,” arXiv preprint arXiv:2109.11251, 2021. “Resource pricing and allocation in mec enabled blockchain
[216] H. Liu, Y. Wu, and F. Sun, “Extreme trust region policy systems: An a3c deep reinforcement learning approach,” IEEE
optimization for active object recognition,” IEEE transactions Transactions on Network Science and Engineering, vol. 9, no. 1,
on neural networks and learning systems, vol. 29, no. 6, pp. pp. 33–44, 2021.
2253–2258, 2018. [234] M. Joypriyanka and R. Surendran, “Chess game to improve
[217] B. Mondal, A. Banerjee, and S. Gupta, “Xss filter detection the mental ability of alzheimer’s patients using a3c,” in 2023
using trust region policy optimization,” in 2023 1st International Fifth International Conference on Electrical, Computer and
Conference on Advanced Innovations in Smart Cities (ICAISC). Communication Technologies (ICECCT). IEEE, 2023, pp. 1–6.
IEEE, 2023, pp. 1–4. [235] T. Tiong, I. Saad, K. T. K. Teo, and H. B. Lago, “Autonomous
valet parking with asynchronous advantage actor-critic proximal
[218] J. Erens, “Universal robot policy: Using a surrogate model in
policy optimization,” in 2022 IEEE 12th Annual Computing and
combination with trpo,” Master’s thesis, University of Twente,
Communication Workshop and Conference (CCWC). IEEE,
2024.
2022, pp. 0334–0340.
[219] K. Thattai, J. Ravishankar, and C. Li, “Consumer-centric
[236] Z. Shi, L. Li, Y. Xu, X. Li, W. Chen, and Z. Han, “Content
home energy management system using trust region policy
caching policy for 5g network based on asynchronous advantage
optimization-based multi-agent deep reinforcement learning,” in
actor-critic method,” in 2019 IEEE Global Communications
2023 IEEE Belgrade PowerTech. IEEE, 2023, pp. 1–6.
Conference (GLOBECOM). IEEE, 2019, pp. 1–6.
[220] N. Peng, Y. Lin, Y. Zhang, and J. Li, “Aoi-aware joint spectrum [237] J. Yang, J. Lu, X. Zhou, S. Li, C. Xiong, and J. Hu, “Ha-a2c:
and power allocation for internet of vehicles: A trust region Hard attention and advantage actor-critic for addressing latency
policy optimization-based approach,” IEEE Internet of Things optimization in edge computing,” IEEE Transactions on Green
Journal, vol. 9, no. 20, pp. 19 916–19 927, 2022. Communications and Networking, 2024.
[221] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, [238] P. Choppara and S. Mangalampalli, “Reliability and trust aware
“Proximal policy optimization algorithms,” arXiv preprint task scheduler for cloud-fog computing using advantage actor
arXiv:1707.06347, 2017. critic (a2c) algorithm,” IEEE Access, 2024.
[222] H. Wei, X. Liu, L. Mashayekhy, and K. Decker, “Mixed- [239] Y. Dantas, P. E. Iturria-Rivera, H. Zhou, M. Bavand, M. El-
autonomy traffic control with proximal policy optimization,” in sayed, R. Gaigalas, and M. Erol-Kantarci, “Beam selection
2019 IEEE Vehicular Networking Conference (VNC). IEEE, for energy-efficient mmwave network using advantage actor
2019, pp. 1–8. critic learning,” in ICC 2023-IEEE International Conference on
[223] B. Zhang, X. Lu, R. Diao, H. Li, T. Lan, D. Shi, and Z. Wang, Communications. IEEE, 2023, pp. 5285–5290.
“Real-time autonomous line flow control using proximal policy [240] M. Vishal, Y. Satija, and B. S. Babu, “Trading agent for
optimization,” in 2020 IEEE Power & Energy Society General the indian stock market scenario using actor-critic based rein-
Meeting (PESGM). IEEE, 2020, pp. 1–5. forcement learning,” in 2021 IEEE international conference on
[224] C.-S. Ying, A. H. Chow, Y.-H. Wang, and K.-S. Chin, “Adaptive computation system and information technology for sustainable
metro service schedule and train composition with a proximal solutions (CSITSS). IEEE, 2021, pp. 1–5.
policy optimization approach based on deep reinforcement [241] Y. Sun and X. Zhang, “A2c learning for tasks segmentation
learning,” IEEE Transactions on Intelligent Transportation Sys- with cooperative computing in edge computing networks,” in
tems, vol. 23, no. 7, pp. 6895–6906, 2021. GLOBECOM 2022-2022 IEEE Global Communications Con-
[225] J. Jin and Y. Xu, “Optimal policy characterization enhanced ference. IEEE, 2022, pp. 2236–2241.
proximal policy optimization for multitask scheduling in cloud [242] X. Han, J. Wang, Q. Zhang, X. Qin, and M. Sun, “Multi-uav
computing,” IEEE Internet of Things Journal, vol. 9, no. 9, pp. automatic dynamic obstacle avoidance with experience-shared
6418–6433, 2021. a2c,” in 2019 International Conference on Wireless and Mobile
[226] Y. Guan, Y. Ren, S. E. Li, Q. Sun, L. Luo, and K. Li, Computing, Networking and Communications (WiMob). IEEE,
“Centralized cooperation for connected and automated vehicles 2019, pp. 330–335.
at intersections by proximal policy optimization,” IEEE Trans- [243] K. Shao, D. Zhao, Y. Zhu, and Q. Zhang, “Visual navigation
actions on Vehicular Technology, vol. 69, no. 11, pp. 12 597– with actor-critic deep reinforcement learning,” in 2018 Interna-
12 608, 2020. tional Joint Conference on Neural Networks (IJCNN). IEEE,
[227] E. Bøhn, E. M. Coates, S. Moe, and T. A. Johansen, “Deep 2018, pp. 1–6.
reinforcement learning attitude control of fixed-wing uavs using [244] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and
proximal policy optimization,” in 2019 international conference M. Riedmiller, “Deterministic policy gradient algorithms,” in
on unmanned aircraft systems (ICUAS). IEEE, 2019, pp. 523– International conference on machine learning. Pmlr, 2014,
533. pp. 387–395.
[228] F. Ye, X. Cheng, P. Wang, C.-Y. Chan, and J. Zhang, “Automated [245] T. Degris, M. White, and R. S. Sutton, “Off-policy actor-critic,”
lane change strategy using proximal policy optimization-based arXiv preprint arXiv:1205.4839, 2012.
79
[246] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, [264] H. K. Bishen, K. Shihabudheen, and P. M. Shanir, “Adaptive
Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with cruise control using twin delayed deep deterministic policy gra-
deep reinforcement learning,” arXiv preprint arXiv:1509.02971, dient,” in 2023 5th International Conference on Energy, Power
2015. and Environment: Towards Flexible Green Energy Technologies
[247] G. E. Uhlenbeck and L. S. Ornstein, “On the theory of the (ICEPE). IEEE, 2023, pp. 1–6.
brownian motion,” Physical review, vol. 36, no. 5, p. 823, 1930. [265] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup,
[248] Y. Kong, Y. Li, J. Wang, and S. Yin, “Edge computing task and D. Meger, “Deep reinforcement learning that matters,” in
unloading decision optimization algorithm based on deep rein- Proceedings of the AAAI conference on artificial intelligence,
forcement learning,” in China Conference on Wireless Sensor vol. 32, no. 1, 2018.
Networks. Springer, 2023, pp. 189–201.
[249] A. Candeli, G. De Tommasi, D. G. Lui, A. Mele, S. Santini, and
G. Tartaglione, “A deep deterministic policy gradient learning
approach to missile autopilot design,” IEEE Access, vol. 10, pp.
19 685–19 696, 2022.
[250] Z. Wei, Z. Quan, J. Wu, Y. Li, J. Pou, and H. Zhong, “Deep de-
terministic policy gradient-drl enabled multiphysics-constrained
fast charging of lithium-ion battery,” IEEE Transactions on
Industrial Electronics, vol. 69, no. 3, pp. 2588–2598, 2021.
[251] S. Wen, J. Chen, S. Wang, H. Zhang, and X. Hu, “Path
planning of humanoid arm based on deep deterministic policy
gradient,” in 2018 IEEE International Conference on Robotics
and Biomimetics (ROBIO). IEEE, 2018, pp. 1755–1760.
[252] X. Gao, L. Yan, Z. Li, G. Wang, and I.-M. Chen, “Improved
deep deterministic policy gradient for dynamic obstacle avoid-
ance of mobile robot,” IEEE Transactions on Systems, Man, and
Cybernetics: Systems, vol. 53, no. 6, pp. 3675–3682, 2023.
[253] S. Zhou, Y. Cheng, X. Lei, and H. Duan, “Deep deterministic
policy gradient with prioritized sampling for power control,”
IEEE Access, vol. 8, pp. 194 240–194 250, 2020.
[254] Y. Liang, C. Guo, Z. Ding, and H. Hua, “Agent-based modeling
in electricity market using deep deterministic policy gradient
algorithm,” IEEE transactions on power systems, vol. 35, no. 6,
pp. 4180–4192, 2020.
[255] Y.-H. Xu, C.-C. Yang, M. Hua, and W. Zhou, “Deep determin-
istic policy gradient (ddpg)-based resource allocation scheme
for noma vehicular communications,” IEEE Access, vol. 8, pp.
18 797–18 807, 2020.
[256] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function
approximation error in actor-critic methods,” in International
conference on machine learning. PMLR, 2018, pp. 1587–1596.
[257] S. B. Thrun, Efficient exploration in reinforcement learning.
Carnegie Mellon University, 1992.
[258] O. Yazar, S. Coskun, L. Li, F. Zhang, and C. Huang, “Actor-
critic td3-based deep reinforcement learning for energy man-
agement strategy of hev,” in 2023 5th International Congress
on Human-Computer Interaction, Optimization and Robotic
Applications (HORA). IEEE, 2023, pp. 1–6.
[259] N. A. Mosali, S. S. Shamsudin, O. Alfandi, R. Omar, and N. Al-
Fadhali, “Twin delayed deep deterministic policy gradient-based
target tracking for unmanned aerial vehicle with achievement
rewarding and multistage training,” IEEE Access, vol. 10, pp.
23 545–23 559, 2022.
[260] X. Han, Z. Li, and Z. Xie, “Two-step random access optimiza-
tion for 5g-and-beyond leo satellite communication system: a
td3-based msga channel allocation strategy,” IEEE Communi-
cations Letters, vol. 27, no. 6, pp. 1570–1574, 2023.
[261] R. Hossain, M. Gautam, M. M. Lakouraj, H. Livani, and
M. Benidris, “Volt-var optimization in distribution networks
using twin delayed deep reinforcement learning,” in 2022 IEEE
Power & Energy Society Innovative Smart Grid Technologies
Conference (ISGT). IEEE, 2022, pp. 1–5.
[262] M. Shehab, A. Zaghloul, and A. El-Badawy, “Low-level control
of a quadrotor using twin delayed deep deterministic policy gra-
dient (td3),” in 2021 18th International Conference on Electrical
Engineering, Computing Science and Automatic Control (CCE).
IEEE, 2021, pp. 1–6.
[263] J. Yu, H. Sun, and Q. Sun, “Multi-auv charging navigation
trajectory planning based on twin delayed deep deterministic
policy gradient,” in 2023 42nd Chinese Control Conference
(CCC). IEEE, 2023, pp. 8521–8526.