0% found this document useful (0 votes)
0 views

Comprehensive Survey of Reinforcement Learning From Algorithms to Practical Challenges

This document is a comprehensive survey of Reinforcement Learning (RL), detailing its algorithms, applications, and practical challenges across various domains such as robotics, optimization, and autonomous systems. It categorizes RL methods into Value-based, Policy-based, and Actor-Critic approaches, evaluating their strengths and weaknesses while providing insights on algorithm selection and implementation. The survey aims to serve as a reference for researchers and practitioners looking to leverage RL for solving complex real-world problems.

Uploaded by

zwz.chao
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Comprehensive Survey of Reinforcement Learning From Algorithms to Practical Challenges

This document is a comprehensive survey of Reinforcement Learning (RL), detailing its algorithms, applications, and practical challenges across various domains such as robotics, optimization, and autonomous systems. It categorizes RL methods into Value-based, Policy-based, and Actor-Critic approaches, evaluating their strengths and weaknesses while providing insights on algorithm selection and implementation. The survey aims to serve as a reference for researchers and practitioners looking to leverage RL for solving complex real-world problems.

Uploaded by

zwz.chao
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 79

1

Comprehensive Survey of Reinforcement


Learning: From Algorithms to Practical
Challenges
Majid Ghasemi†,1 , Amir Hossein Moosavi1 , and Dariush Ebrahimi1
arXiv:2411.18892v2 [cs.AI] 1 Feb 2025

Abstract—Reinforcement Learning (RL) has emerged as solving sequential decision-making problems in a variety
a powerful paradigm in Artificial Intelligence (AI), enabling of fields, such as game playing ([10], [11], [12], [13]),
agents to learn optimal behaviors through interactions with robotics ([14], [15], [16], [17]), and autonomous sys-
their environments. Drawing from the foundations of trial
and error, RL equips agents to make informed decisions tems, particularly in Intelligent Transportation Systems
through feedback in the form of rewards or penalties. This (ITS) ([18], [19], [20], [21], [22]).
paper presents a comprehensive survey of RL, meticulously This survey examines the practical application of dif-
analyzing a wide range of algorithms, from foundational ferent RL approaches through various domains including
tabular methods to advanced Deep Reinforcement Learning
but not limited to: robotics [14], [23], [24], optimization
(DRL) techniques. We categorize and evaluate these algo-
rithms based on key criteria such as scalability, sample [25], [26], energy efficiency and power management
efficiency, and suitability. We compare the methods in [27], [28], [29], networks [30], [31], [32], dynamic
the form of their strengths and weaknesses in diverse and partially observable environments [33], [34], [35],
settings. Additionally, we offer practical insights into the [36], video games [37], real-time systems and hardware
selection and implementation of RL algorithms, addressing
implementations [38], [39], financial portfolios [40], ITS
common challenges like convergence, stability, and the
exploration-exploitation dilemma. This paper serves as a [18], [41], [42], signal processing [43], benchmark tasks
comprehensive reference for researchers and practitioners [44], data management and processing [45], [46], multi-
aiming to harness the full potential of RL in solving agent and cloud-based systems [47], [48], [49], [50].
complex, real-world problems. Moreover, our survey gives detailed explanations of RL
Index Terms—Reinforcement Learning, Deep Reinforce- algorithms, ranging from traditional tabular methods to
ment Learning, Model-free, Model-based, Actor-Critic, Q- state-of-the-art methods.
learning, DQN, TD3, PPO, TRPO Related Surveys: Several notable survey papers have
examined different aspects of RL or attempted to teach
I. I NTRODUCTION various concepts to readers. Foundational work, such as
Reinforcement Learning (RL) is a subfield of Artificial [6], offers an in-depth computer science perspective on
Intelligence (AI) in which an agent learns to make RL, encompassing both its historical roots and modern
decisions by interacting with an environment, aiming developments. In the realm of DRL, [51], [52], [53],
to maximize cumulative reward over time [1]. RL has [54] offer comprehensive analyses of DRL’s integration
rapidly evolved since the 1950s, when Richard Bell- with DL, highlighting its applications and theoretical ad-
man’s work on Dynamic Programming (DP) established vancements. Model-based RL is identified as a promising
foundational concepts that underpin current approaches area in RL, with authors in [55], [56], [57] emphasizing
[2], [3]. The field gradually became more widespread by its potential to enhance sample efficiency by simulating
proposing more advanced approaches, such as Temporal- trial-and-error learning in predictive models, reducing
Difference (TD) Learning [4], [5] and suggesting solu- the need for costly real-world interactions.
tions to exploration-exploitation dilemma [6], [7]. RL In [58], authors provided a survey of Model-free RL
has further been evolving rapidly due to its integration specifically within the financial portfolio management
with Deep Learning (DL), giving rise to Deep Rein- domain, exploring its applications and effectiveness.
forcement Learning (DRL). This advancement enables Meanwhile, [59] analyzes the combination of Model-
researchers to tackle more sophisticated and complex free RL and Imitation Learning (IL), a hybrid approach
problems [8], [9]. It has proven to be highly effective in known as RL from Expert Demonstrations (RLED),
which leverages expert data to enhance RL performance.
†Corresponding author These survey papers help to understand the various
1 Department of Computer Science, Wilfrid Laurier University, Wa-
terloo, Canada. Emails: {mghasemi, debrahimi}@wlu.ca & aspects of RL and its integration with other areas.
[email protected] To our knowledge, no papers have analyzed the
2

TABLE I: A list of notations & symbols used in this survey


Symbols & Notations Definition
S A finite set of states
s State
s′ Next State
A A finite set of actions
a Action
P State transition probability function
P (s′ |s, a) Probability of transitioning to s′ by taking action a
R Reward function
R(s, a) The immediate reward received after taking action a in state s
γ Discount Factor
π (Target) Policy
β Behaviour policy
π∗ Optimal Policy
πθ (a|s) The probability of taking action a in the state s under policy π parameterized by θ
Gt Expected cumulative reward
Vπ (s) State-value function
Qπ (s, a) Action-value function
α Learning Rate
ǫ Exploration rate
δ TD error
e(s) Eligibility trace
q∗ Optimal action-value function
θ∗ Optimal parameters
θt Weight vector
θtT Transpose of the weight vector
φt Feature vector obtained through current state St
Qπθ (s, a) Action-value function under the current policy
J(π) The expected discounted reward for a policy π
DKL (πθ |πθ′ ) Kullback-Leibler (KL) divergence between the new policy πθ′ and the old policy πθ
L(θ) Surrogate objective function
Âθold (s, a) An estimate of the advantage function
A(s, a) Advantage function
µθ Deterministic policy
Qµ (s, a) Action-value function under µθ
ρµ Discounted state visitation distribution under µθ
τ <1 Target update rate
Qθ Critic network
Qθ ′ Target Critic network
i
µθ Actor network
µθ′ Target actor network

strengths and weaknesses of algorithms used in papers also important to consider the degree to which direct
and provide a comprehensive analysis of an entire paper. implementation can be achieved and how easily it can
The first motivation of this survey is to address this gap. be debugged. In these respects, simpler algorithms may
tend to be more user-friendly but cannot be applied to
RL involves various challenges in choosing appropri- complex problems. Convergence and stability are impor-
ate algorithms due to diverse factors, such as problem tant considerations, as certain algorithms provide better
characteristics and environmental dynamics. In general, guarantees in specific circumstances. In conclusion, the
it depends on numerous factors based on the character- decision-making process is influenced by exploration
istics of the problem, including whether the state and style, domain-specific requirements, and past research
action spaces are large, whether their values are discrete results. To determine which algorithm to use based on
or continuous, or if the dynamics of the environment comparing the problem they are solving with similar
are stochastic or deterministic. Data availability and existing research, researchers should have a compre-
sample efficiency are other factors to consider. It is
3

hensive paper examining several papers in different a state (or state-action pair) and continues to follow its
domains thoroughly and accurately. Besides saving time current policy. As a final point, the model of the envi-
and resources, this will also prevent the excess cost ronment simulates the behavior of the environment or,
of going through a trial-and-error process to determine more generally, can be used to infer how the environment
which solution to choose, which is the second motivation will behave [1], [60], [61].
behind conducting this survey. There are two broad categories of RL methodology:
The rest of the paper is organized as follows: In II, Model-free and Model-based. Model-free methods do
we provide a general overview of RL before diving into not assume knowledge of the environment’s dynamics
algorithms. Consequently, we undertake an examination and learn directly from interactions with the environ-
of various algorithms within the domain of RL, inclusive ment. On the other hand, Model-based methods involve
of the associated papers, as well as an analysis of their building a model of the environment’s dynamics and
respective merits and drawbacks. It is imperative to using this model to develop and improve policies [6], [3].
acknowledge that these algorithms fall into three over- Each of the mentioned categories has its own advantages
arching categories: Value-based Methods, Policy-based and disadvantages, that will be discussed later in the
Methods, and Actor-Critic Methods. Section III initiates paper.
the discussion by focusing on Value-based Methods, Following this, we will briefly examine two of the
delineated by its four core components: Dynamic Pro- most crucial components of RL. First, we will explore
gramming, Tabular Model-free, Approximation Model- the Markov Decision Process (MDP), the foundational
free, and Tabular Model-based Methods. Section IV framework that structures the learning environment and
subsequently talks about the Policy-based Methods. Fur- guides the agent’s decision-making process. Then, we
thermore, section V offers detailed insights into Actor- will discuss the exploration-exploitation dilemma, one
Critic Methods. In section VI, we give a summary of of the most imperative characteristics of RL, which
the paper and discuss the scope of it. Finally, section VII balances the need to gather new information with the
provides a synthesis of the paper through a review of key goal of maximizing rewards.
points and an exposition of future research directions.

II. G ENERAL OVERVIEW OF RL A. Markov Decision Process (MDP)


In this section, we give a general overview of RL, The MDP is a sequential decision-making process
assuming readers have a basic knowledge of RL. In in which the costs and transition functions are directly
the framework of RL, the learning process is defined related to the current state and actions of the system.
by several key components. The fundamental concept of MDPs aim to provide the decision-maker with an optimal
RL is to capture the crucial elements of a real problem policy π : S → A. The models have been applied to a
faced by an agent that interacts with its environment wide range of subjects, including queueing, inventory
to achieve a goal. It is evident that such an agent must control, and recommender systems [62], [63], [64].
be capable of sensing the state of the environment to MDP is defined by a tuple (S, A, P, R, γ), where:
some extent and must have the ability to take actions • S is a finite set of states,
that influence the state [1]. In general, an action refers • A is a finite set of actions,
to any decision an agent will have to make, while a state • P : S × A × S → [0, 1] is the state transition
refers to any factor that the agent will have to consider probability function, where P (s′ |s, a) denotes the
when making that decision. probability of transitioning to state s′ from state s
Beyond the agent and the environment, an RL system by taking action a,
has four main sub-elements: a policy, a reward signal, • R : S × A → R is the reward function, where
a value function, and, optionally, a model of the en- R(s, a) denotes the immediate reward received after
vironment. The reward signal determines the agent’s taking action a in state s,
behavior in the given environment. During each time • γ ∈ [0, 1] is the discount factor that determines the
step, the environment sends a single number, a reward, importance of future rewards.
to the agent. Ultimately, the agent’s sole objective is to The agent’s behavior is defined by a policy π : S →
maximize the total reward it receives. A policy defines A, which maps states to actions. The goal of the agent is
how a learning agent should behave at a particular to find an optimal policy π ∗ that maximizes the expected
point in time. In simple terms, the agent’s policy is cumulative reward, often termed the return (Gt ), which
the mapping from a possible state to a potential action. is defined as:
The value function corresponds to the agent’s current ∞
mapping from the set of possible states to its estimates X
Gt = γ k R(st+k , at+k ) (1)
of the net long-term reward it can expect once it visits
k=0
4

Central to solving an MDP are value functions, which and exploitation and can be applied with on-policy and
estimate the expected return. The state-value function off-policy RL algorithms. Using decoupling policies,
Vπ (s) under policy π is the expected return starting from DeRL improves robustness and sample efficiency in
state s and following policy π thereafter [3], [65], [62]: sparse reward environments. More advanced methodolo-
gies have been used in [76], where the study investigated
Vπ (s) = Eπ [Gt |st = s] (2) the trade-off between exploration and exploitation in
Similarly, the action-value function Qπ (s, a) is the ex- continuous-time RL using an entropy-regularized re-
pected return starting from state s, taking action a, and ward function. As a result, the optimal exploration-
thereafter following policy π: exploitation balance was achieved through a Gaussian
distribution for the control policy, where exploitation was
Qπ (s, a) = Eπ [Gt |st = s, at = a] (3) captured by the mean and exploration by the variance.
Equations 2 and 3 are referred to Bellman Equations. Moreover, various strategies such as ǫ-c, Upper Confi-
To find the optimal policy π ∗ , RL algorithms iteratively dence Bound (UCB), and Thompson Sampling are em-
update value functions based on experience [5], [66]. ployed to balance this trade-off in simpler environments,
In Q-learning (will be explained later in Alg. 10), the like Bandits [7], [1], [8].
update rule for the action-value function is: In the subsequent sections, we will undertake an
 examination of various algorithms within the domain
Q(st , at ) ← Q(st , at ) + α R(st , at )+ of RL, inclusive of the associated papers, as well as
an analysis of their respective merits and drawbacks.
 (4)
′ It is imperative to acknowledge that these algorithms
γ max Q(s t+1 , a ) − Q(s t , a t ) fall into three overarching categories: Value-based Meth-
a′

ods, Policy-based Methods, and Actor-Critic Methods.


where α is the learning rate. For detailed explanations
Section III will initiate the discussion by focusing on
of MDPs and RL in general, readers are referred to [1],
Value-based Methods, delineated by its three core com-
[6], [67], [68], [69], [70], [3]
ponents: Tabular Model-free, Tabular Model-based, and
Approximation Model-free Methods. Section IV will
B. Exploration vs Exploitation subsequently address the sole method within the Policy-
It is important to note that exploration and exploitation based Methods umbrella, namely Approximation Model-
represent a fundamental trade-off in RL. The objective free. Furthermore, section V will offer detailed insights
of exploration is to discover the effects of new actions, into Actor-Critic-based Methods. In section VI, we give
whereas the objective of exploitation is to select actions a detailed explanation of how this survey is meant to be
that have been shown to yield high rewards, to the best read to get the most out of it. Finally, section VII will
of the knowledge of the agent at that timestep. provide a synthesis of the paper through a review of key
Exploration techniques can be classified into two: points.
undirected and directed exploration methods. Undirected
exploration methods such as Semi-uniform (ǫ-greedy) III. VALUE - BASED M ETHODS
exploration and the Boltzmann exploration try to explore Value-based Methods in RL are techniques that focus
the whole state-action space by assigning positive prob- on estimating the value of states or state-action pairs
abilities to all possible actions [1], [71]. On the other to guide decision-making. An essential component of
hand, directed exploration methods like the E 3 algorithm the methodology is the learning of a value function,
and exploration bonus use the statistics obtained through which quantifies the expected long-term reward for a
past experiences to execute efficient exploration [72], given state under a given policy. Value-based Methods
[73]. are ones that iteratively update their value estimates
Balancing exploration and exploitation is a critical based on the observed rewards and transitions. Examples
aspect of RL, and various strategies have been developed of Value-based methods algorithms include Q-learning
to manage this trade-off effectively. For instance, in and State-Action-Reward-State-Action (SARSA) (will
[74], authors introduced a Model-based RL method that be discussed briefly later in this chapter). These methods
dynamically balances exploitation and exploration, par- aim to derive an optimal policy by maximizing the value
ticularly in changing environments. By using Bayesian function, enabling the agent to choose actions that lead
inference with a forgetting effect, it estimates state- to the highest cumulative rewards.
transition probabilities and adjusts the balance parameter Value-based Methods are divided into two broad cat-
based on action-outcome variations and environmental egories: Tabular Model-based and Tabular Model-free
changes. Furthermore, [75] proposed Decoupled RL methods. Model-based methods refer to a group of
(DeRL) which trains separate policies for exploration methods that rely on an explicit or learned model of
5

the environment’s dynamics. The model predicts how Algorithm 1 Policy Iteration
the environment will respond to an agent’s actions (state 1: Initialize policy π arbitrarily
transitions and rewards). On the other hand, Model-free 2: repeat
methods do not rely on a model of the environment. 3: Perform policy evaluation to update the value
Instead, they directly learn a policy or value function function Vπ
based on interactions with the environment. 4: Perform policy improvement to update the policy
We will begin by introducing the main part of Tabu- π
lar Model-based methods, Dynamic Programming (DP). 5: until policy π converges
Next, we will explore both Tabular and Approximate
Model-free algorithms. Finally, we will cover the ad-
vanced part of Tabular Model-based methods, Model- based methods are Policy Iteration and Value Iteration.
based Planning. We first start by examining Policy Iteration, then, we
discuss Value Iteration.
A. Tabular Model-based Algorithms a) Policy Iteration: Policy Iteration is a method
In this subsection, we examine the first part of the that iteratively improves the policy until it converges to
Tabular Model-based methods in RL, Dynamic Pro- the optimal policy. It consists of two main steps: policy
gramming methods. It must be noted that the advanced evaluation and policy improvement [1]. Policy Iteration
Tabular Model-based algorithms will be analyzed in consists of two different steps, Policy Evaluation and
section III-D along with various studies. Improvement. Policy Evaluation calculates the value
A Tabular Model-based algorithm is an example of function Vπ (s) for a given policy π. This involves
an RL technique used for solving problems with a finite solving the Bellman expectation equation for the current
and discrete state and action space. These algorithms policy:
are explicitly based on the maintenance and updating
of a table or matrix, which represents the dynamic X X
nature of the environment and its reward. ’Model-based’ Vπ (s) = π(a|s) P (s′ |s, a)[R(s, a, s′ )+γVπ (s′ )]
algorithms are characterized by the fact that they in- a∈A s′ ∈S
(5)
volve the construction of an environmental model for
use in decision-making. Key characteristics of Tabular This step iteratively updates the value of each state
Model-based techniques include Model Representation, under the current policy until the values converge [3].
Planning and Policy Evaluation, and Value Iteration [1]. Policy Improvement, on the other hand, improves the
In Model Representation, the model depicts the dy- policy by making it greedy with respect to the current
namics in a tabular form with transition probabilities value function:
represented as P (s′ |s, a) and the reward function as
R(s, a). These elements define the probability of tran- X
π ′ (s) = arg max P (s′ |s, a)[R(s, a, s′ ) + γVπ (s′ )]
sitioning to a new state s′ and the expected reward a
s′
when taking action a in state s. Planning and Policy (6)
Evaluation involves approximating the updating value This step updates the policy by selecting actions that
function V (s) iteratively using the Bellman equation maximize the expected value based on the current value
until convergence. To calculate the optimal policy, the function [55]. Alg. 1 gives an overview of how policy
algorithm examines all possible future states and their iteration can be implemented.
associated actions. Value Iteration also approximates Value Iteration is another approach in DP, which will
the updating value function V (s) iteratively using the be discussed in the next subsection.
Bellman equation until convergence. Similar to Planning
b) Value Iteration: Value Iteration is another DP
and Policy Evaluation, it analyzes all potential future
method that directly computes the optimal value func-
states and their linked actions to determine the optimal
tion by iteratively updating the value of each state. It
policy.
combines the steps of policy evaluation and policy im-
Tabular Model-based algorithms can be divided into
provement into a single step. Value Iteration consists of
two general categories: DP and Model-based Planning,
two different steps, Value Update and Policy Extraction.
where this variant will be discussed in section III-D1.
Value Update updates the value function for each state
1) Dynamic Programming (DP): DP methods are
based on the Bellman optimality equation:
fundamental techniques used to solve MDPs when a
complete model of the environment is known. These
methods are iterative and make use of the Bellman equa- X
V (s) = max P (s′ |s, a)[R(s, a, s′ ) + γV (s′ )] (7)
tions to compute the optimal policies. Two primary DP- a
s′
6

Algorithm 2 Value Iteration rely on experience, using sample sequences of states,


1: Initialize the value function V (s) arbitrarily actions, and rewards obtained through real or simulated
2: repeat interactions with an environment [1]. Learning from
3: for each state s do real experience is notable because MC does not require
4: Update V (s) using the Bellman optimality prior knowledge of the environment’s dynamics, yet it
equation can still achieve optimal behavior. Similarly, learning
5: end for from simulated experience is powerful. While a model is
6: until the value function V (s) converges necessary, it only needs to generate sample transitions,
7: Extract the optimal policy π ∗ from the converged not the complete probability distributions of all possible
value function transitions as required in DP. To ensure well-defined
returns, MC methods are used specifically for episodic
tasks, where experiences are divided into episodes, and
This step involves iterating through all states and all episodes eventually terminate regardless of the se-
updating their values based on the maximum expected lected actions. Changes to value estimates and policies
return of the possible actions. occur only at the end of an episode (sample returns).
Policy Extraction, happens once the value function As a result, MC methods show incremental behavior
has converged, the optimal policy can be extracted by on an episode-by-episode basis instead of a step-by-
selecting actions that maximize the expected value: step (online) fashion. While the term ”Monte Carlo”
is commonly used broadly for any estimation method
X involving a significant random component, here it specif-
π ∗ (s) = arg max P (s′ |s, a)[R(s, a, s′ ) + γV (s′ )] ically refers to methods based on averaging complete
a
s′
(8) returns [1].
Alg. 2 illustrates the implementation of Value Iteration. In the following, we will explain how MC methods are
Value Iteration is simpler to implement than Policy used to learn the state-value function for a given policy
Iteration since it combines evaluation and improvement π,where there are several ways of doing so.
into a single process. Additionally, it is often faster a) Monte Carlo Estimation of State Values: A
in practice for many problems, as it does not require straightforward method of estimating the value of a state
separate policy evaluation steps [77]. On the downside, based on experience is to average its observed returns
Value Iteration may require a large number of iterations after visits to the state. In terms of value, the state is
to converge, especially for problems with large state defined as the anticipated cumulative discounted reward
spaces. Furthermore, it can be less stable than Policy in the future. The average tends to converge to the
Iteration in some cases due to the combined update step expected value as more returns are observed (the law
[1]. A detailed comparison between Policy and Value of large numbers), which is the basis for MC.
Iterations is given in Table II. The value vπ (s) of a state s under policy π can be
Throughout the next subsection, we start exploring estimated by considering a set of episodes obtained by
Tabular Model-free algorithms by introducing Monte following π and passing through s. Each occurrence of
Carlo (MC) methods first, then, Temporal Difference state s in an episode is known as a visit. When s is visited
(TD) Learning methods later. multiple times within an episode, the first occurrence is
referred to as the first visit to s. According to the First-
B. Tabular Model-free Algorithms visit MC method, vπ (s) is calculated as the average of
first visits to the visited states, while under the Every-
Tabular Model-free algorithms are techniques suit- visit MC method, the average is calculated based on all
able for problems with discrete state and action spaces visits to s.
that are typically small enough to be tabulated. Unlike
Consider a simple 3x3 grid world where an agent
Model-based algorithms, which require a model of the
starts at the top-left corner (State S0 ) and aims to reach
environment’s dynamics, Model-free algorithms learn
the bottom-right corner (Goal State G). The agent can
directly from interactions with the environment without
take one of four actions in each state: up, down, left, or
understanding its underlying mechanics. In this context,
right. Each action transitions the agent to an adjacent
we will explore various methods and algorithms that are
state unless it moves outside the grid boundaries, in
categorized as Tabular Model-free algorithms, starting
which case the agent remains in its current state. The
with MC Methods.
episode ends once the agent reaches the goal state, G.
1) Monte Carlo Methods: In situations where com-
Key Rules:
plete knowledge of the environment is unavailable or
undesirable, MC methods can be employed. MC methods 1) States can be revisited during an episode.
7

TABLE II: Comparison of Policy Iteration and Value Iteration


Aspect Policy Iteration Value Iteration
Convergence Typically fewer iterations needed May require more iterations
Complexity per Iteration More complex (requires policy evaluation) Simpler (single update step)
Stability More stable due to separate steps Can be less stable
Ease of Implementation More complex Simpler
Computational Cost Higher per iteration Lower per iteration

2) The return at the goal state is defined as the sum Algorithm 3 First-visit MC
of rewards from the starting state to the goal state. 1: Initialize:
3) Two approaches are considered for state-value 2: π ← policy to be evaluated
estimation: 3: V ← an arbitrary state-value function
• First-visit MC: Only the first visit to each 4: Returns(s) ← an empty list, for all s ∈ S
state in an episode is used for value estimation. 5: repeat
• Every-visit MC: All visits to each state in an 6: Generate an episode using π
episode are considered. 7: for each state s appearing in the episode do
Agent’s Behavior Across Episodes: 8: G ← return following the first occurrence of
Episode 1: s
9: Append G to Returns(s)
• Actions: Right, Right, Down, Down (Reaches G).
10: V (s) ← average(Returns(s))
• Returns: G is visited once with a return of 5.
11: end for
Episode 2: 12: until forever
• Actions: Up, Right, Right, Down, Down (Reaches
G).
• Returns: G is visited twice, with returns of 8 (first • Using Every-visit MC, the state value reflects the
visit) and 8 (second visit). average of all returns across all visits to that state.
Episode 3: In MC, both the First-visit MC and the Every-visit MC
• Actions: Right, Up, Right, Down, Down (Reaches methods converge toward vπ (s) as the number of visits
G). approaches infinity. Alg. 3 examines the First-visit MC
• Returns: G is visited once with a return of 6. method for estimating Vπ . The only difference between
State-Value Estimation for G: Every-visit and First-visit, as stated above, lies in line
8. For Every-visit MC, we should return following the
1) First-visit MC:
every occurrence of state s.
• Considers only the first visit to G in each
b) MC Estimation of Action Values (with Explo-
episode.
ration Starts): It becomes particularly advantageous to
• Returns for G:
estimate action values (values associated with state-
– Episode 1: 5 action pairs) rather than state values in the absence of an
– Episode 2: 8 (first visit only) environment model. State values alone are adequate for
– Episode 3: 6 determining a policy by examining one step ahead and
5+8+6
• Average Return for G: 3 = 6.33. selecting the action that leads to the optimal combination
2) Every-visit MC: of reward and next state [1]. It is, however, insufficient
• Considers all visits to G in each episode. to rely solely on state values without a model. An
• Returns for G: explicit estimate of each action’s value is essential in
order to provide meaningful guidance in the formulation
– Episode 1: 5
of policy. MC methods are intended to accomplish
– Episode 2: 8, 8
this objective. As a first step, we address the issue of
– Episode 3: 6
5+8+8+6
evaluating action values from a policy perspective in
• Average Return for G: 4 = 6.75. order to achieve this goal.
General Insights: During policy evaluation for action values, we esti-
• Using First-visit MC, each state value is updated mate qπ (s, a), which represents the anticipated return
using the average return from its first visits across when initiating in state s, taking action a, and following
episodes (e.g., G in Episode 2 considers only the the policy π [78]. In this task, MC methods are similar,
first return, 8). except that visits to state-action pairs are used instead
8

of states alone. When state s is visited and action a is In order to improve the policy, it is necessary to make
taken during an episode, then the state-action pair (s, a) the policy greedy regarding the current value function.
is considered visited. Based on the average of returns As a result, an action-value function is used to construct
following all visits to a state-action pair, the Every-visit the greedy policy without requiring a model. Whenever
MC method estimates its value. However, the First-visit an action-value function q is there, a greedy policy is
MC method averages the returns after each state visit and defined as one that selects the action with the maximal
action selection occurs. It is evident that both methods action-value for each state s ∈ S.
exhibit quadratic convergence as the number of visits to
each state-action pair approaches infinity [1].
π(s) ≡ arg max q(s, a) (10)
A deterministic policy π presents a significant chal- a
lenge in that numerous state-action pairs may never be
visited. When following π, only one action is observed Policy Improvement can be executed by formulating
for each state. As a consequence, MC estimates for each πk+1 as the greedy policy with respect to qπk . The
the remaining actions do not improve with experience, Policy Improvement theorem, as discussed earlier, is then
presenting a significant problem. As a result of learning applicable to πk and πk+1 because, for all s ∈ S, there
action values, one can choose among all available actions is:
in each state more easily. It is crucial to estimate the
value of all actions from each state and not only the qπk (s, πk+1 (s)) = qπk (s, arg max qπk (s, a)) =
one that is at present favored for the purpose of making a
(11)
informed comparisons. max qπk (s, a) ≥ qπk (s, πk (s)) = vπk (s)
a
It may be beneficial to explore starts in some sit-
uations, but they cannot be relied upon in all cases, Based on the General Policy Theorem [1], each πk+1
especially when learning directly from the environment. is uniformly superior to πk , or equally optimal, which
Therefore, the initial conditions are less likely to be makes both policies optimal. As a result, the overall
favorable in such situations. Alternatively, stochastic process converges to an optimal value function and
policies that select all actions in each state with a non- policy. Using MC methods, optimal policies can be
zero probability may be considered in order to ensure determined solely based on sample episodes, without the
that all state-action pairs are encountered. As a first step, need for additional information about the environment’s
we will continue to assume that starts will be explored dynamics.
and conclude with a comprehensive MC simulation Despite the convergence guarantee for MC simula-
approach. tions, two key assumptions must be addressed to create a
To begin, we consider an MC adaptation of classical practical algorithm: the presence of exploration starts in
Policy Iteration. We use this approach to iteratively episodes and the need for an infinite number of episodes
evaluate and improve policy starting with an arbitrary for policy evaluation. Our focus is on removing the
policy π0 and ending with an optimal policy and optimal assumption of an infinite number of episodes for pol-
action-value function: icy evaluation. This can be achieved by approximating
qπk during each evaluation, using measurements and
Evaluation Improvement Evaluation assumptions to minimize error. Although this method
π0 −−−−−→ qπ0 −−−−−−−→ π1 −−−−−→
could theoretically ensure correct convergence, it often
Improvement Evaluation Improvement
qπ1 −−−−−−−→ π2 −−−−−→ . . . −−−−−−−→ (9) requires an impractically large number of episodes, es-
Improvement pecially for complex problems. Alternatively, we can
π∗ −−−−−−−→ qπ∗
avoid relying on infinite episodes by not fully completing
As the approximate action-value function approaches the evaluation before improving the policy. Instead, the
the true function asymptotically, many episodes occur. value function is adjusted towards qπk incrementally
We will assume that there are an infinite number of across multiple steps, as seen in Generalized Policy
episodes we observe and that these episodes are gener- Iteration (GPI). In Value Iteration, this approach is evi-
ated with explorations starts for now. Exploration starts dent, where only one policy evaluation occurs between
referring to an assumption we make. We assume that policy improvements. MC policy iteration, by its nature,
all the states have a non-zero probability of starting. alternates between evaluation and improvement after
This approach encourages exploration as well though each episode. The returns observed in an episode are
not practical as this assumption does not hold true in used for evaluation, followed by policy improvement for
many real-world applications. For any arbitrary policy all visited states. Over the next subsection, we analyze
πk under these assumptions, MC methods will compute the integration of Importance Sampling [79], a well-
qπk accurately. known concept in Statistics, with MC methods.
9

c) Off-Policy Prediction via Importance Sampling: Algorithm 4 Off-Policy Prediction (via Importance
The use of RL as a control method is confronted with Sampling)
a fundamental dilemma, namely the need to learn the 1: Initialize, for all s ∈ S, a ∈ A(s):
value of actions based on the assumption of subsequent 2: Q(s, a) ∈ R (arbitrarily)
optimal behavior, contrasting with the need for non- 3: C(s, a) ← 0
optimal behavior to explore all possible actions in order 4: π(s) ← arg maxa Q(s, a) (ties broken randomly)
to find the optimal action. This conundrum leaves us 5: repeat
with the question of how we can learn about optimal 6: b ← any soft policy
policies while operating under exploratory policies. A 7: Generate an episode using b:
Policy-based approach serves as a compromise in this (S0 , A0 , R1 , . . . , ST −1 , AT −1 , RT )
situation. As part of this strategy, we seek to learn the 8: G←0
action values for a policy that, while it is not optimal, is 9: W ←1
close to it and incorporates mechanisms for exploration. 10: for each step of episode, t = T − 1, . . . , 0: do
Due to its exploratory nature, it does not directly address 11: G ← γG + Rt+1
the issue of learning the optimal policy action values. 12: C(St , At ) ← C(St , At ) + W
Off-policy learning is an effective method of address- 13: Q(St , At ) ← Q(St , At ) + C(SW t ,At )
 
ing this challenge by utilizing two distinct policies: a G − Q(St , At )
target policy (π) whose objective is to become the opti- 14: π(St ) ← arg maxa Q(St , a)
mal policy, and a behavior policy (b) whose purpose is (with ties broken consistently)
to generate behavior. The dual-policy framework allows 15: if At 6= π(St ) then
exploration to occur independently of learning about 16: break ⊲ Proceed to next episode
the optimal policy, with learning occurring from data 17: end if
generated outside the target policy by behavior policy. 18: W ← W · b(At1|St )
Off-policy methods are more versatile and powerful 19: end for
than on-policy methods. On-policy methods can also 20: until forever
be incorporated as a special case when both target and
behavior policies are the same.
On the other hand, off-policy methods introduce ad- punctuality. However, the method’s success depended
ditional complexity, which requires the use of more heavily on the availability and quality of historical data,
sophisticated concepts and notations. Off-policy learning and its scalability to larger, more complex networks
involves using data from a different policy, resulting remained a consideration.
in higher variance and slower convergence than on- In [80], a Renewal MC (RMC) algorithm was devel-
policy learning. On-policy methods provide simplicity oped to reduce variance, avoid delays in updating, and
and direct learning from the agent’s exploratory actions, achieve quicker convergence to locally optimal policies.
while off-policy methods provide a robust framework for It worked well with continuous state and action spaces
learning optimal policies indirectly through exploration and was applicable across various fields, such as robotics
guided by a separate behavior policy. Essentially, the and game theory. The method also introduced an ap-
dichotomy between on-policy and off-policy learning proximate version for faster convergence with bounded
represents the exploration-exploitation trade-off that un- errors. However, the performance of RMC was depen-
derlies RL, which enables agents to learn and adapt to dent on the chosen renewal set and its size.
complex environments in a variety of ways [3]. Alg. 4 In [81] authors introduced an MC off-policy strat-
provides a general overview of this algorithm. egy augmented by rough set theory, providing a novel
Throughout the next few paragraphs, we will ana- approach. This integration offered new insights and
lyze selected research studies that have employed MC methodologies in the field. However, the approach’s
methods and its mentioned variations, analyzing their complexity might challenge broader applicability, and
rationale and addressing their specific challenges. In an- the study’s focus on theoretical formulations necessitated
alyzing these papers in depth, we intend to demonstrate further empirical research for validation.
that MC methods are versatile and effective for solving a Authors in [23] introduced a Bayesian Model-free
wide range of problems while emphasizing the decision- Markov Chain Monte Carlo (MCMC) algorithm for
making processes leading to their selection. policy search, specifically applied to a 2-DoF robotic
Based on MC with historical data, [27] proposed manipulator. The algorithm demonstrated practicality
an intelligent train control approach enhancing energy and effectiveness in real implementations, adopting a
efficiency and punctuality. It offered a Model-free ap- gradient-free strategy that simplified the process and
proach, achieving 6.31% energy savings and improving excelled in mastering complex trajectory control tasks
10

within a limited number of iterations. However, its Algorithm 5 MC with Importance Sampling
applicability might be confined to specific scenarios, and 1: Initialize, for all s ∈ S, a ∈ A(s):
the high variance in the estimator presented challenges. 2: Q(s, a) ← arbitrary
The research on MC Bayesian RL (MCBRL) by 3: C(s, a) ← 0
[82] introduced an innovative method that streamlined 4: µ(a|s) ← an arbitrary soft behavior policy
Bayesian RL (BRL). By sampling a limited set of hy- 5: π(a|s) ← an arbitrary target policy
potheses, it constructed a discrete, Partially Observable 6: repeat ⊲ For ever
Markov Decision Process (POMDP), eliminating the 7: Generate an episode using µ:
need for conjugate distributions and facilitating the ap- (S0 , A0 , R1 , . . . , ST , AT , RT , ST )
plication of point-based approximation algorithms. This 8: G←0
method was adaptable to fully or partially observable 9: W ←1
environments, showing reliable performance across di- 10: for t = T − 1, T − 2, . . . , 0 down to 0 do
verse domains. However, the efficacy of the sampling 11: G ← γG + Rt+1
process was contingent on the choice of prior distribu- 12: C(St , At ) ← C(St , At ) + W
tion, and insufficient sample sizes might have affected 13: Q(St , At ) ← Q(St , At ) + C(SW t ,At )
 
performance. G − Q(St , At )
t |St )
In [83], authors introduced a Factored MC Bayesian 14: W ← W · π(A µ(At |St )
RL (FMCBRL) approach to solve the BRL problem 15: if W = 0 then
online. It leveraged factored representations to reduce 16: break ⊲ Exit inner loop
the size of learning parameters and applied partially ob- 17: end if
servable Monte-Carlo planning as an online solver. This 18: end for
approach managed the complexity of BRL, enhancing 19: until convergence or a stopping criterion is met
scalability and efficiency in large-scale domains.
Researchers in [84] discussed developing self-learning
agents for the Batak card game using MC methods for
state-value estimation and artificial neural networks for ducing computational complexity, and it achieved precise
function approximation. The approach handled the large parameter estimation even under challenging conditions.
state space effectively, enabling agents to improve game- Additionally, the method remained scalable for multiple
play over time. However, the study’s focus on a specific signal scenarios without significant increases in compu-
card game might have limited the direct applicability of tational load. However, performance might have been
its findings to other domains. affected below certain signal-to-noise ratios, and the
Next, we will discuss another variant of MC, MC with selection of MC samples and other parameters required
Importance Sampling, which is an off-policy algorithm. problem-specific tuning.
d) MC with Importance Sampling: MC Importance In [30], a method to estimate blocking probabilities
Sampling is a method used to improve the efficiency in multi-cast loss systems through simulation was in-
of MC simulations when estimating expected values. troduced, improving upon static MC methods with Im-
It samples from the proposal distribution rather than portance Sampling. The technique divided the complex
directly from the original distribution. As demonstrated problem into simpler sub-problems focused on blocking
in Alg. 5, re-weighting the samples is based on the ratio probability contributions from individual links, using a
of the original distribution to the proposal distribution. It distribution tailored to the blocking states of each link.
is through this re-weighting that the bias introduced by An inverse convolution method for sample generation,
sampling from the alternative distribution is corrected, coupled with a dynamic control algorithm, achieved
allowing for a more accurate and efficient estimation of significant variance reduction. Although this method
the data [85]. efficiently allocated samples and reduced variance, its
Let us analyze several papers regarding MC Impor- complexity and setup requirements might have made it
tance Sampling. In [43], authors presented a non-iterative less accessible compared to simpler approaches.
approach for estimating the parameters of superimposed Authors in [25] introduced a method for improving
chirp signals in noise using an MC Importance Sam- simulation efficiency in rare-event phenomena through
pling method. The primary method utilized maximum Importance Sampling. Initially developed for Markovian
likelihood to optimize the estimation process efficiently. random walks, this method was expanded to include non-
This approach, which differed from traditional grid- Markovian scenarios and molecular dynamics simula-
search methods, focused on estimating chirp rates and tions. It increased simulation efficiency by optimizing the
frequencies, providing a practical solution to multidi- sampling of successful transition paths and introduced a
mensional problems. The technique was non-iterative, re- method for identifying an optimal importance function to
11

maximize computational efficiency. However, optimizing TABLE III: MC Papers Review


the importance function could be challenging, particu- Application Domain References
larly in systems with many states or those extending Train Control and Energy [27]
beyond Markovian frameworks. Efficiency
[86] presented an advanced MC Importance Sampling Algorithmic RL (Renewal Theory, [80], [81]
Rough Set Theory)
approach for assessing the reliability of stochastic flow
Robotics and Trajectory Control [23]
networks. This method leveraged a recursive state-space Bayesian RL [82], [83]
decomposition to efficiently target critical network seg- Game Strategy and Card Games [84]
ments for sampling, reducing computational demands Signal Processing and Parameter [43]
and enhancing the precision of reliability estimates. Estimation
However, its complexity and dependence on specific Network Reliability and Blocking [30]
Probabilities
network attributes might have limited its applicability in
Simulation Efficiency in [25]
very large or intricate networks. Lastly, to wrap up MC Rare-Event Phenomena
methods, we shall examine another variation, On-Policy Rough Set Theory and [86]
Monte Carlo (without Exploration Starts). Approximation Spaces
e) On-Policy Monte Carlo (MC without Explo-
ration Starts): MC with On-Policy Starts, also known
as MC without Exploration Starts, is a method in spaces to evaluate state values with MC without Ex-
which the agent learns based on episodes generated ploration Starts. Tested on simulated zebra danio fish
by following its current policy. In contrast to methods behavior, this approach integrated observed patterns for
that use exploratory starts, where the agent can begin a refined estimation of action values, showing promise
from any state-action pair, On-Policy MC uses only the for modeling complex systems. While the technique was
actual experiences generated by following the policy innovative, it also introduced computational complexity,
in the environment. As a result of this approach, the highlighting a balance between innovation and practical
agent estimates the value of the policy by averaging the application. The research underscored the potential of
cumulative rewards from all episodes beginning with the pattern-based evaluation, with considerations for scala-
start of the policy and ending with the termination of the bility and broader implementation.
policy. It is important to ensure that sufficient exploration Table III gives a summary of articles that utilized MC
is conducted in order to learn accurate estimates of methods Over the next subsection, we start analyzing
value since the agent’s experiences are limited to what another main category of Tabular Model-free methods,
is dictated by the current policy. The agent may explore TD Learning, as one of the most fundamental concepts
different state-action pairs over time by making policies within RL.
stochastic or incorporating soft exploration strategies. 2) Temporal Difference (TD) Learning: TD learning
The method is particularly useful in scenarios where is undoubtedly the most fundamental and innovative
exploration starts are not feasible or where policy must concept. A combination of MC methods and DP is used
be discovered through direct interaction with the envi- in this method. On one hand, similar to MC approaches,
ronment [1], [87]. TD learning can be used to acquire knowledge from
We would like to mention two studies in this section, unprocessed experience without the need for a model
acknowledging the fact that there are more papers in that describes the dynamics of the environment. On the
the literature. In [88], researchers introduced a novel other hand, TD algorithms are also similar to DP in that
off-policy MC learning method that integrated approx- they refine predictions using previously learned estimates
imation spaces and rough set theory. This approach instead of requiring a definitive outcome in order to
refined the RL process in dynamic environments by proceed (known as bootstrapping).
using weighted sampling based on observed behavior It is important to recognize that the expression within
patterns, enabling a nuanced estimation of action values brackets in the TD(0) update represents a type of er-
and improving the adaptability and efficiency of learning ror. This error measures the discrepancy between the
strategies. The method’s effectiveness relied on accu- estimated value of St and a more refined estimate,
rately identifying and applying behavior patterns, partic- Rt+1 + γV (St+1 ). This discrepancy is known as the
ularly in complex environments. Further exploration and TD error. TD(0) is an essential concept for understand-
comparison with other advanced RL techniques could ing other TD learning algorithms, such as Q-learning,
have provided a more comprehensive understanding of SARSA, and Double Q-learning, among others, which
its efficacy and scalability across various applications. we will explore in subsequent subsections. The following
In [81], researchers presented a method that advanced sections introduce and analyze various TD-based papers
RL by applying rough set theory and approximation in order to gain a general understanding of the mentioned
12

Algorithm 6 Tabular TD(0) Algorithm 7 TD(0)-Replay


1: Initialize the value function V (s) arbitrarily (e.g., 1: Input: α, γ, θinit
V (s) = 0 for all s ∈ S + ) 2: θ ← θinit
2: repeat ⊲ For each episode 3: loop ⊲ Over episodes
3: Initialize state S 4: Obtain initial state S, features φ
4: repeat ⊲ For each step of the episode 5: ê ← In×n , e ← 0n×1
5: A ← action given by policy π for S 6: while (terminal state has not been reached) do
6: Take action A; observe reward R and next 7: Act according to the policy
state S ′ 8: Observe next reward R = Rt+1 , next state
7: V (S) ← V (S) + α[R + γV (S ′ ) − V (S)] Ŝ = St+1 ,
8: S ← S′ and its features φ̂ = φt+1
9: until S is terminal 9: α ← ℓ(α)  
⊤
until convergence or for a specified number of

10:
10: e ← e + αφ γ φ̂ − φ e + R
episodes  ⊤ 
11: ê ← ê + αφ γ φ̂ − φ ê

algorithms before delving into their details. It will be 12: θ ← θ + ê + e


started with TD(0)-Replay and will be finished with N- 13: φ ← φ̂
step SARSA. 14: end while
Alg. 6 describes tabular TD(0) for estimating Vπ . 15: end loop
While MC methods require waiting until the conclusion
of an episode to calculate the update to V (St ) as Gt , TD
methods only need to wait until the subsequent timestep. historical data, ensuring effectiveness as environmental
At time t + 1, TD methods swiftly establish a target and conditions evolve. Its broad applicability across various
perform an effective update using the observed reward domains underscores its versatility, making it suitable for
Rt+1 and the estimated value V (St+1 ). The most basic both theoretical research and practical applications.
form of the TD method executes this update as follows δt refers to the the TD error, Rt+1 is the reward signal,
(line 7 in Alg. 6): γ is a discount factor, φTt is the transpose of feature
vector φt obtained through current state St , θtT is the
V (St ) ← V (St )+α [Rt+1 + γV (St+1 ) − V (St )] (12) transpose of the weight vector θt , and αt is a learning
step; all varies according to time step t. Over the next
Upon transitioning to St+1 and receiving Rt+1 , TD paragraphs, we analyze another variation of TD methods,
methods implement the update immediately. In contrast, TD(λ).
the target for a MC update is Gt , while the target for b) TD(λ): TD(λ) is a powerful and flexible algo-
the TD update is Rt+1 + γV (St+1 ). This TD method is rithm introduced in [4]. TD(λ) generalizes the simpler
called TD(0), or one-step TD. TD(0) is a bootstrapping TD(0) method by introducing a parameter λ (lambda),
method, like DP, as its update is in part on an existing which controls the weighting of n-step returns, blending
estimate [1]. one-step updates and MC methods. This allows the
a) TD(0)-Replay: TD(0)-Replay algorithm (Alg.7), algorithm to consider the entire trajectory of experiences
introduced in [89], enhances policy learning efficiency rather than just the immediate next step. It introduces
by leveraging full experience replay. Utilizing the agent’s the concept of eligibility traces, which keep a record of
entire history of interactions allows for comprehensive states and how eligible they are for learning updates. It
updates to the value function and policy at each step. can be seen as a credit assignment problem to answer
This approach accelerates convergence to optimal poli- the question of how many steps before or after what we
cies, particularly in complex environments where acquir- would like to update are contributing. Alg. 8 compre-
ing new experiences is costly or challenging. TD(0)- hensively represents the TD(λ) algorithm. An eligibility
Replay efficiently uses full replay of past experiences, trace (lines 9-11) is a temporary record of the occurrence
crucial in environments where re-experiencing events of an event, such as the visiting of a state, and these
is costly or impossible. It promotes rapid optimiza- traces decay over time, controlled by the parameter λ,
tion of the agent’s value function, leading to quicker allowing the algorithm to attribute credit for rewards to
learning and adaptation in dynamic environments, offer- prior states in a temporally distributed manner.
ing significant advantages over methods without replay The value update in TD(λ) is given by (line 13):
mechanisms. The algorithm’s adaptability is highlighted
through dynamic updates of weight parameters based on V (s) ← V (s) + αδe(s) (13)
13

Algorithm 8 TD(λ) further validation in varied game scenarios and against


1: Initialize V (s) arbitrarily (but set to 0 if s is termi- stronger AI opponents would provide a more compre-
nal) hensive assessment, the results were promising.
2: repeat(for each episode) A convergence theorem for TD learning by generaliz-
3: Initialize E(s) = 0, for all s ∈ S ing convergence theorem to handle arbitrary time steps,
4: Initialize S crucial for predicting future outcomes based on past
5: repeat ⊲ (for each step of episode) states was implemented in [91]. This work advanced the
6: A ← action given by π for S understanding of TD(λ), ensuring convergence even with
7: Take action A, observe reward R, and linearly dependent state representations, and bridged the
next state S ′ gap between TD learning and DP. The extension of
8: δ ← R + γV (S ′ ) − V (S) Q-learning convergence proof to TD(0) reinforced the
9: E(S) ← E(S) + 1 ⊲ (accumulating traces) strong convergence properties of TD methods, making
10: or E(S) ← (1 − α)E(S) + 1 them applicable in various RL scenarios. While the paper
11: ⊲ (Dutch traces) was primarily theoretical, the results provided strong
12: or E(S) ← 1 ⊲ (replacing traces) guarantees about the stability and reliability of TD learn-
13: for all s ∈ S do ing. Further empirical validation in diverse, dynamic
14: V (s) ← V (s) + αδE(s) environments would offer a complete assessment of the
15: E(s) ← γλE(s) proposed methods.
16: end for Authors in [92], strengthened the theoretical founda-
17: S ← S′ tions of TD learning by proving that TD(λ) algorithms
18: until S is terminal converge with probability 1 under certain conditions.
19: until convergence or a stopping criterion is met Building on earlier work by Sutton and Watkins, the au-
thors provided a rigorous proof of convergence for TD(λ)
in the general case, ensuring reliable and consistent
where α is the learning rate, δ is the TD error (δ = predictions. This theoretical advancement was crucial for
R + γV (s′ ) − V (s)), and e(s) is the eligibility trace the broader adoption of TD learning, providing strong
for state s. TD(λ) provides a continuum of methods assurances about the stability and reliability of the learn-
from TD(0) (pure TD) to MC (full return) methods, ing process. The detailed analysis and proof techniques
often learning more efficiently than either extreme by contributed to a deeper understanding of TD learning
balancing the bias-variance trade-off. It can quickly dynamics, informing future research and development in
propagate information about value estimates through the the field. Although the paper was theoretical, it provided
state space, improving learning efficiency [4], [1]. a significant contribution by proving the convergence of
Now, we need to cover various papers that employed TD(λ) with probability 1.
a form of TD(λ) to analyze the practicality of it. In In [93], researchers introduced two algorithms,
[90] authors presented KnightCap, a chess program QV(λ)-learning and the Actor-Critic Learning Automa-
that learned its evaluation function using TDLeaf(λ), a ton (ACLA), built upon TD(λ) methods. QV(λ)-learning
variation of the TD(λ) algorithm integrated with game- combined value function learning with a form of Q-
tree search. KnightCap improved its rating from 1650 to learning, while ACLA used a learning automaton-like
2150 in just 308 games over three days by playing on update rule for the actor component. These algorithms
the Free Internet Chess Server. The use of TDLeaf(λ) were tested across various environments, demonstrat-
in conjunction with game-tree search marked a signifi- ing their robustness and generalizability. QV(λ)-learning
cant advancement in integrating RL with traditional AI combined the strengths of traditional Q-learning and
methods in chess. The approach leveraged the strengths Actor-Critic methods, leading to more stable and effi-
of both deep evaluation through game-tree search and cient learning. ACLA introduced a flexible mechanism
the learning capabilities of TD methods. KnightCap’s for updating action preferences and adapting to various
rapid improvement demonstrated the effectiveness of environments. The empirical results showed that these
online learning against diverse opponents, emphasizing algorithms outperformed conventional methods such as
the importance of playing against varied strategies. The Q(λ)-learning, SARSA(λ), and Actor-Critic, particularly
adaptation of TD(λ) to TDLeaf(λ) for deep minimax in learning speed and final performance. While further
search allowed dynamic improvement of the evaluation validation in diverse and dynamic real-world scenarios
function based on real-game outcomes. The success of was needed, the results highlighted the potential for
the algorithm highlighted its potential in environments broader applications in complex and partially observable
that required learning from interactions, although starting domains. In the next subsection, we start analyzing
with intelligent initial parameters was beneficial. While another form of the TD method, N-step Bootstrapping.
14

c) N-step Bootstrapping: Having covered TD(0)- Algorithm 9 N-step TD for Estimating V ≈ vπ


Replay and TD(λ), we now have a solid base to explore 1: Input: a policy π
bootstrapping methods. N-step bootstrapping is a key 2: Algorithm parameters: step size α ∈ (0, 1], a positive
technique that extends the concept of updating estimated integer n
value functions across multiple steps rather than just 3: Initialize V (s) arbitrarily for all s ∈ S
based on immediate rewards or the value of the next 4: All store and access operations (for St and Rt ) use
state. This method provides a middle ground between index mod n + 1
MC methods, which delay updating value estimates until 5: repeat ⊲ Loop for each episode
the end of an episode, and one-step TD methods that 6: Initialize and store S0 such that S0 6= terminal
update values immediately based on the subsequent state. 7: T ←∞
In N-step bootstrapping, the update of the value function 8: for t = 0, 1, 2, . . . do
is based on n subsequent rewards and the estimated 9: if t < T then
value of the state that follows these n steps. The primary 10: Take an action according to π(·|St )
advantage of this approach is that it can lead to faster 11: Observe and store the next reward as
learning and reduced variance in the updates compared Rt+1 and the next state as St+1
to one-step methods [33], [1], [4]. The generic update 12: if St+1 is terminal then
rule for the N-step method can be expressed as: 13: T ←t+1
14: end if
end if
" n−1
X 15:
V (St ) ← V (St ) + α γ k Rt+k+1 16: τ ←t−n+1
k=0
# (14) ⊲ τ is the time whose state’s estimate
n
is being updated
+ γ V (St+n ) − V (St ) 17: if τ ≥ 0 then
Pmin(τ +n,T ) i−τ −1
18: G ← i=τ +1 γ Ri
where Rt+k+1 are the rewards received after taking 19: if τ + n < T then
action At from state St , γ is the discount factor, and 20: G ← G + γ n V (Sτ +n )
α is the learning rate. 21: end if
d) N-step TD Prediction: As previously explored, 22: V (Sτ ) ← V (Sτ ) + α [G − V (Sτ )]
N-step bootstrapping forms the basis for various adapta- 23: end if
tions like N-step TD, and n-step SARSA, among others. 24: end for
N-step methods (where n 6= 1) distinguish themselves 25: until τ = T − 1
by looking ahead multiple steps, which can be adjusted
according to specific requirements. This approach con- (n)
where α is the learning rate, and Gt is derived as
trasts with MC-based methods, which update the value of
follows (line 18):
each state based on the complete sequence of observed
(n)
rewards from that state until the episode concludes. It Gt = Rt+1 + γRt+2 + · · · + γ n−1 Rt+n + γ n V (St+n )
also differs from one-step TD methods, which focus (16)
solely on the next reward and use the value of the state Let us now examine research papers that have imple-
one step later as a surrogate for the subsequent rewards mented these methods and assess their advantages and
[94]. This gap can be bridged by performing updates disadvantages. Authors in [35] introduced an advanced
based on an intermediate number of rewards - more than generalization of TD learning by incorporating networks
one, but fewer than the total number following until the of interrelated predictions. This method expanded tra-
end of the episode. A two-step update, for example, ditional TD techniques by connecting multiple predic-
would utilize the first two rewards and the estimated tions over time, allowing for a more detailed approach
value of the state two steps ahead. Additionally, this to learning and representing predictions. TD networks
can be extended to three-step updates, four-step updates, facilitated the learning of interconnected predictions,
and beyond, each incorporating an increasing number of offering a broader range of representable and learnable
rewards and subsequent state values. predictions. This advancement was particularly useful
Alg. 9 gives a general overview of the N-step TD in solving problems that traditional TD methods could
Prediction algorithm. For a given state St at time t, not, as demonstrated by experiments on the random-walk
the update rule for the value function V in N-step TD problem and predictive state representations. Experimen-
prediction is (line 21): tal results showed that TD networks could learn complex
prediction tasks more efficiently than MC methods, par-
(n)
V (St ) ← V (St ) + α(Gt − V (St )) (15) ticularly in terms of data efficiency and learning speed.
15

The practical benefits of TD networks were evident in step Q(σ) algorithm and proposing the n-step CV Q(σ)
scenarios requiring specific sequences of actions. While and Q(σ, λ) algorithms. This unification clarified the
the paper focused on small-scale experiments, the results relationships between different multi-step TD algorithms
suggested that TD networks had significant potential and their variants. Additionally, the thesis extended
for broader applicability and effectiveness in various predictive knowledge representation into the frequency
domains. domain, allowing TD learning agents to detect periodic
In [33], an advanced multi-step TD learning technique structures in return, providing a more comprehensive
was introduced, emphasizing the use of per-decision representation of the environment.
control variates to reduce variance in updates. This ”Undelayed N-step TD prediction” (TD-P), developed
method significantly improved the performance of multi- in [36], integrated techniques like eligibility traces, value
step TD algorithms, particularly in off-policy learning function approximators, and environmental models. This
contexts, by enhancing stability and convergence speed. method employed Neural Networks to predict future
Empirical results from tasks like the 5x5 Grid World and steps in a learning episode, combining RL techniques to
Mountain Car showed that the n-step SARSA method enhance learning efficiency and performance. By using a
outperformed standard n-step Expected SARSA in re- forward-looking mechanism, the TD-P method sought to
ducing root-mean-square error and improving learning gather additional information that traditional backward-
efficiency. The approach was compatible with function looking eligibility traces might miss, leading to more
approximation, underscoring its practical applicability in accurate value function updates and better decision-
complex environments. making. The TD-P method was particularly designed
Researchers in [95] developed a unified framework to for partially observable environments, utilizing Neural
study finite-sample convergence guarantees of various Networks to handle complex and continuous state-action
Value-based asynchronous RL algorithms. By reformu- spaces effectively. To further solidify our knowledge of
lating these RL algorithms as Markovian Stochastic TD methods, we will analyze the off-policy version of
Approximation algorithms and employing a Lyapunov N-step Learning over the following paragraphs.
analysis, the authors derived mean-square error bounds e) N-step Off-policy Learning: N-step off-policy
on the convergence of these algorithms. This frame- learning is an advanced method that combines the con-
work provided a systematic approach to analyzing the cepts of N-step returns with off-policy updates. This
convergence properties of algorithms like Q-learning, n- approach leverages the benefits of multi-step returns
step TD, TD(λ), and off-policy TD algorithms such as to improve the stability and performance of learning
V-trace. The paper effectively addressed the challenges algorithms while allowing the use of data generated
of handling asynchronous updates and offered robust by a different policy (the behavior policy b) than the
convergence guarantees through detailed finite-sample one being improved (the target policy π) [1]. Multi-
mean-square convergence bounds. step returns differ from one-step methods by considering
In [96], the ”deadly triad” in RL—off-policy learn- cumulative rewards over multiple steps, providing a more
ing, bootstrapping, and function approximation—was ad- comprehensive view of future rewards and leading to
dressed through an in-depth theoretical analysis of multi- more accurate value estimates. In off-policy learning,
step TD learning. The paper provided a comprehensive the policy used to generate behavior (behavior policy)
theoretical foundation for understanding the behavior is different from the policy being optimized (target
of these algorithms in off-policy settings with linear policy), allowing for the reuse of past experiences and
function approximation. A notable contribution was the improving sample efficiency by enabling learning from
introduction of Model-based deterministic counterparts demonstrations or historical data. Importance sampling
to multi-step TD learning algorithms, enhancing the is used to correct the discrepancy between the behavior
robustness of the findings. The results demonstrated policy and the target policy in N-step off-policy learning,
that multi-step TD learning algorithms could converge where Importance Sampling ratios adjust the updates to
to meaningful solutions when the sampling horizon n account for differences in action probabilities under the
was sufficiently large, addressing a critical issue of (n)
two policies. The N-step off-policy return, Gt , can be
divergence in certain conditions. calculated in the following way:
In [97], researchers delved into various facets of TD
learning, particularly focusing on multi-step methods.
The thesis introduced the innovative use of control (n)
variates in multi-step TD learning to reduce variance Gt = Rt+1 + γRt+2 + · · · + γ n−1 Rt+n + γ n V (St+n )
in return estimates, thereby enhancing both learning (17)
speed and accuracy. The work also presented a unified To ensure the update is off-policy, Importance Sam-
framework for multi-step TD methods, extending the n- pling ratios are incorporated as follows:
16

TABLE IV: TD (and its variations) Papers Review


n−1 k−1
!
(n)
X Y Application Domain References
Gt = γk ρt+i Rt+k+1
General RL (Policy learning, raw [89], [4]
k=0 i=0 experience)
n−1
! (18)
Y Games (Chess) [90]
n
+ ρt+i γ V (St+n ) Theoretical Research [91], [92], [95], [96],
i=0 (Convergence, stability) [97], [99]
π(At+i |St+i ) Dynamic Environments (Mazes, [33]
where ρt+i = is the Importance Sampling
b(At+i |St+i ) Mountain Car, Atari)
ratio, with π as the target policy and b the behavior policy Partially Observable Environments [93],[35], [36]
[98]. (Predictions)
Benchmark Tasks (Mountain Car, [101], [44]
Now that we have acquired the fundamental knowl- Acrobot, GCRL)
edge, we can examine papers based on the N-step Off
Policy. [99] introduced the Greedy Multi-step Value
Iteration (GM-VI) algorithm, which approximated the in learning efficiency, with empirical results indicating
optimal value function using a novel multi-step boot- effectiveness in diverse scenarios. Table IV gives an
strapping technique. The method dynamically adjusted overview of the TD-based papers.
the step size along each trajectory based on a greedy Now, it is time to study and analyze one of the most
principle, effectively balancing information propagation widely used algorithms in RL, Q-learning.
and estimation accuracy. GM-VI’s adaptive step size f) Q-learning: Moving on from TD(0), a signifi-
mechanism is adjusted according to the quality of trajec- cant breakthrough was made by [5] with the introduc-
tory data, enhancing learning efficiency and robustness. tion of Q-learning, a Model-free algorithm considered
The method could safely learn from arbitrary behavior as off-policy TD control. Q-learning enables an agent
policies without needing off-policy corrections, simpli- to learn the value of an action in a particular state
fying the algorithm and reducing variance. Theoretical through experience, without requiring a model of the
analysis showed that GM-VI converged to the optimal environment. It operates on the principle of learning an
value function faster than traditional one-step Bellman action-value function that gives the expected utility of
optimality operators. Empirical results demonstrated taking a given action in each state and following a fixed
state-of-the-art performance on standard benchmarks, policy thereafter.
with GM-VI showing superior sample efficiency and A general overview of Q-learning is demonstrated in
reward performance compared to classical algorithms Alg. 10. The core of the Q-learning algorithm involves
like Mountain Car and Acrobot. updating the Q-values (action-value pairs), where the
The core strength of [100] lies in its innovative learned action-value function, denoted as Q, approxi-
approach to eliminating the use of Importance Sam- mates q∗ , the optimal action-value function, regardless
pling ratios in multi-step TD learning. By removing of the policy being followed. This significantly simplifies
these ratios, the method reduced estimation variance, the algorithm’s analysis and has facilitated early proofs
enhancing stability and efficiency in off-policy learn- of convergence. However, the policy still influences
ing. The introduction of action-dependent bootstrapping the process by determining which state-action pairs are
parameters allowed the algorithm to adapt flexibly to visited and subsequently updated (lines 4-9).
different state-action pairs, further reducing variance in
updates. Empirical validation on challenging off-policy 
tasks demonstrated the algorithm’s stability and superior Q(St , At ) ← Q(St , At ) + α Rt+1 + γ max Q(St+1 , a)
a

performance compared to state-of-the-art counterparts, − Q(St , At )
highlighting its practical applicability. (19)
[44] introduced innovative bias management tech- Now that we have established a foundational un-
niques in multi-step Goal-Conditioned RL (GCRL) by derstanding of Q-learning, it is appropriate to explore
categorizing and addressing shooting and shifting biases. the specifics of research papers that have utilized this
This approach allowed for larger step sizes, enhancing algorithm. Q-learning, being fundamental and relatively
learning efficiency and performance. The proposed meth- straightforward, has been extensively applied across nu-
ods were validated across various tasks, showing supe- merous studies. Here, we will briefly touch upon a
rior performance compared to baseline multi-step GCRL variety of notable studies.
benchmarks. The use of quantile regression to manage Researchers in [102] investigated the performance
biases effectively demonstrated the practical applicability differences between deterministic and stochastic poli-
of the approach. The introduction of resilient strate- cies within a grid-world problem using Q-learning. The
gies for bias management ensured robust improvement authors developed a flexible agent capable of operating
17

Algorithm 10 Q-learning ment, especially in healthcare settings. While it presented


1: Initialize Q(s, a), ∀s ∈ S, a ∈ A(s), arbitrarily, a robust approach to managing interference and power
and consumption, the paper did not discuss the computational
Q(terminal-state, ·) = 0 overhead introduced by the Q-learning algorithm and
2: repeat ⊲ (for each episode) game theory applications, which was crucial for feasi-
3: Initialize S bility in devices with limited computational capabilities.
4: repeat ⊲ (for each step of episode) In [104], authors explored improvements to the Q-
5: Choose A from S using policy derived from learning algorithm’s efficiency using massively paral-
Q (e.g., ǫ-greedy) lel computing techniques, such as multi-threading and
6: Take action A, observe R, S ′ Graphics Processing Unit (GPU) computing. The paper
7:  A) ← Q(S, A) +
Q(S, addressed the issue of slow convergence in traditional
α R + γ maxa Q(S ′ , a) − Q(S, A)

Q-learning, particularly in complex, real-world system
8: S ← S′ control problems that demanded quick adaptation to
9: until S is terminal dynamic environments. By leveraging GPUs and multi-
10: until convergence or a stopping criterion is met threading, the authors significantly decreased conver-
gence time, which was vital for real-time applications
like robotics and industrial control. However, the paper
under both policy types to determine which parameters did not thoroughly examine how parallel processing af-
maximized cumulative reward. Their results indicated fected the Q-learning algorithm’s integrity and reliability,
the superiority of deterministic policies in achieving particularly under different computational loads.
higher rewards in a structured task environment. The Researchers in [40] investigated the use of Q-learning
study’s strength lies in its clear methodological exe- to automate portfolio re-balancing. The research applied
cution, systematically exploring the impact of policy basic Q-learning agents to discern trading patterns across
variations on learning outcomes. However, the study was 15 Indian financial assets using technical indicators.
confined to a simulated grid world, which may not fully The paper innovatively harnessed Q-learning to improve
capture the complexities of real-world environments, portfolio re-balancing, a domain where traditional rule-
potentially reducing the generalizability of the findings. based systems might not perform optimally. However,
Additionally, the paper focused on policy optimization the simplicity of the Q-learning models used might
without significant consideration of the computational not have fully captured the intricate dependencies and
costs associated with each policy type. dynamics of highly fluctuating markets.
In [103], the authors explored the development of a Authors in [105] extended traditional Q-learning to
hardware architecture optimized for Q-learning, focus- handle POMDPs by incorporating L0 regularization to
ing on real-time applications. Key innovations included manage the complexity of the state representation de-
low power usage, high throughput, and minimal use rived from the agent’s history. This innovative approach
of hardware resources. The implementation, tested on transformed non-Markov and partially observable envi-
an Evaluation Kit, demonstrated improved performance ronments into a history-based RL framework, allowing
metrics such as speed, power consumption, and hardware Q-learning to be applicable in more complex scenarios.
resource use compared to existing Q-learning hardware While the method showed improved computational and
accelerators. While the study presented a comprehen- memory efficiency, the reliance on L0 regularization
sive approach to optimizing Q-learning for hardware could have made the learning process highly dependent
implementation, making it suitable for real-time and on the initial choice of features, potentially limiting
Internet of Things (IoT) applications, it was limited by adaptability in dynamic environments.
the specific hardware used for testing and the types An innovative approach to enhancing the performance
of environments evaluated. The generalizability of the of Proportional Integral Derivative (PID) controllers for
findings to other hardware or more complex real-world magnetic levitation train systems through the integration
applications was not fully explored. of Q-learning was introduced in [106]. This method
The application of Q-learning to improve power al- adapted PID parameters in real-time, maintaining op-
location in Wireless Body Area Networks (WBANs) timal levitation control despite non-linearities and un-
was explored in [38]. The focus was on enhancing certainties. While it showed improved performance over
energy efficiency while maintaining effective commu- traditional PID controllers, the requirement for contin-
nication within the network, particularly through con- uous learning and adjustment could be computationally
trolling transmission power, reducing interference, and intensive.
optimizing routing paths. The paper addressed critical A novel application of Deep Q-learning (discussed
aspects of WBANs that impact their practical deploy- in subsection III-C1) to optimize parameter settings in
18

Hadoop, improving its efficiency by iteratively adjust- enabled dynamic adjustments based on real-time con-
ing configurations based on feedback from performance ditions, potentially alleviating urban traffic congestion.
metrics introduced in [107]. This approach reduced the However, the model’s effectiveness heavily depended
manual effort and expertise required for parameter tuning on the accuracy and comprehensiveness of input data,
and demonstrated marked improvements in processing and the computational complexity might have posed
speeds. However, the model’s simplification of the pa- challenges for real-time deployment.
rameter space might have overlooked interactions that The study [112] addressed the optimization of net-
could achieve further optimization, and the scalability work routing within Optical Transport Networks us-
and adaptability in varying operational environments ing a Q-learning-based algorithm under the Software-
remained somewhat uncertain. Defined Networking framework. The approach improved
Researchers in [45] presented a novel approach to network capacity and efficiency by managing routes
enhancing the accuracy of nuclei segmentation in patho- based on learned network conditions. While the method
logical images using Q-learning and Deep Q-Network outperformed traditional routing strategies, its scalability
(DQN) (will be discussed in III-C1) algorithms. The and computational complexity in real-world scenarios
study reported improvements in Intersection over Union remained areas for further exploration.
for segmentation, critical in cancer diagnosis. While In [113], authors proposed a Q-learning algorithm
the approach adapted the segmentation threshold dy- motivated by internal stimuli, specifically visual nov-
namically, leading to better outcomes, the reliance on elty, to enhance autonomous learning in robots. The
RL introduced computational complexity and required method blended Q-learning with a cognitive develop-
significant training data. mental approach, making the robot’s learning process
The sampling efficiency of Q-learning was discussed more dynamic and responsive to new stimuli. While
in [108], exploring the feasibility of making Model-free the approach reduced computational costs and enhanced
algorithms as sample-efficient as Model-based counter- adaptability, its scalability in complex environments was
parts. The research introduced a variant of Q-learning not thoroughly explored.
that incorporated UCB exploration, achieving a regret The use of Q-learning and Double Q-learning (dis-
bound that approached the best possible by Model-based cussed in the next subsection) to manage the duty
methods. This paper marked a significant theoretical cycles of IoT devices in environmental monitoring was
advance by providing rigorous proof that Q-learning analyzed in [114]. The study demonstrated significant
could achieve sublinear regret in episodic MDPs without advancements in optimizing IoT device operations, en-
requiring a simulator. However, the analysis relied on hancing energy efficiency and operational effectiveness.
assumptions such as an episodic structure and precise However, the performance heavily depended on envi-
model knowledge, which might not have translated well ronmental parameters, and the increased memory re-
to more uncertain environments. quirements for Double Q-learning might have posed
Researchers in [109] examined the deployment of challenges for memory-constrained IoT devices.
Q-learning on Field Programmable Gate Arrays (FP- The study [115] presented a robust implementation
GAs) to enhance processing speed by leveraging parallel of RL to address resource allocation challenges in Fog
computing capabilities. The study demonstrated sub- Radio Access Networks (Fog RAN) for IoTs. The use
stantial acceleration of the Q-learning process, making of Q-learning effectively adapted to changes in network
it suitable for real-time scenarios. However, the focus conditions, improving network performance and reduc-
on a particular FPGA model might have restricted the ing latency. While the approach showed promise, the
broader applicability of the findings to different hardware reliance on Q-learning might not have fully captured
platforms. the complexities of larger-scale IoT networks, and the
The challenge of managing frequent handovers in exploration-exploitation trade-off could have led to sub-
high-speed railway systems using a Q-learning-based optimal performance in highly dynamic environments.
approach was addressed in [110]. The authors proposed The overview of the reviewed papers categorized by
a scheme that minimized unnecessary handovers and their respective domains is presented in Table V. Q-
enhanced network performance by dynamically adjust- learning has an overestimation bias primarily because of
ing to changes in the network environment. While the the max operator used in the Q-value update rule, which
approach showed promise, its complexity, computational tends to favor overestimated action values. To address
demands, and need for real-time processing presented this, researchers introduced a new architecture to tackle
challenges for practical implementation. this issue, which we will discuss in the next subsection.
In [111] authors explored optimizing taxi dispatching g) Double Q-learning: Double Q-learning was in-
and routing using Q-learning and MDPs to enhance taxi troduced in [118] as an enhancement to traditional Q-
service efficiency in urban environments. The method learning to reduce the overestimation of action values,
19

TABLE V: Q-learning Papers Review Algorithm 11 Double Q-learning


Application Domain References 1: Initialize QA , QB , s
General RL (Policy learning, raw [102], [108] 2: repeat
experience)
3: Choose a based on QA (s, ·) and QB (s, ·), ob-
Games and Simulations (Chess, [105], [116]
StarCraft) serve r, s′
Theoretical Research [91], [5]
4: Choose (e.g., random) either UPDATE(A) or
(Convergence, Stability) UPDATE(B)
Dynamic and Complex [47], [93] 5: if UPDATE(A) then
Environments (Mazes, Mountain 6: Define a∗ = arg maxa QA (s′ , a)
Car, Atari) 7: QA (s, a) ← QA (s, a)
Partially Observable Environments [117]
+ α(s, a) r + γQB (s′ , a∗ ) − QA (s, a)

(Predictions, POMDPs)
8: else if UPDATE(B) then
Real-time Systems and Hardware [103], [38], [109]
Implementations (FPGA, 9: Define b∗ = arg maxa QB (s′ , a)
Real-time applications) 10: QB (s, a) ← QB (s, a)
+ α(s, a) r + γQA (s′ , b∗ ) − QB (s, a)

Energy Efficiency and Power [114]
Management (IoT, WBAN, PID 11: end if
controllers)
12: s ← s′
Financial Applications (Portfolio [40]
Rebalancing) 13: until end
Data Management and Processing [107], [45]
(Hadoop, Pathological Images)
Network Optimization (Optical [112], [115]
Transport Networks, Fog RAN)
Transportation Systems (Railway, [106], [110]
QB (s, a) ← QB (s, a)
  
Taxi, Electric Vehicles)
+ α r + γQA s′ , arg max QB (s′ , a)
Autonomous Systems and [111], [113] a

Robotics (Learning, Routing)
− QB (s, a)
(21)
In essence, the approach alternates updates between
QA and QB using the maximum action value from the
which can be problematic in environments with stochas-
other table, which is believed to provide a more unbiased
tic rewards. Traditional Q-learning tends to overestimate
estimate of the underlying value function. This method
because it uses the maximum action value as an ap-
helps to mitigate the positive bias seen in traditional Q-
proximation for the maximum expected action value. To
learning by occasionally underestimating the maximum
address this, as shown in Alg. 11, Double Q-learning
expected values, aiming to strike a balance closer to
maintains two separate estimators (Q-tables), QA and
true expectations. Over the following paragraphs, we
QB (line 1). Each estimator is updated independently
will examine a handful of research papers that utilized
using the maximum value from the other estimator, re-
Double Q-learning.
ducing the overestimation bias typically seen in standard
The focus of [119] was on the FPGA-based implemen-
Q-learning [118] (lines 3-10).
tation of the Double Q-learning algorithm, emphasizing
The core update rule for Double Q-learning is as its efficiency in stochastic environments and its superior-
follows: Selecting an action a based on the average of ity over standard Q-learning due to reduced overestima-
QA and QB . Updating one of the Q-functions randomly tion biases. This implementation of FPGA allowed for
with a probability of 0.5, for example, QA , using (line the parallel processing of actions, significantly speeding
7): up the learning process. This was particularly benefi-
cial for applications requiring real-time decision-making.
Additionally, the use of asynchronous reading and syn-
chronous writing memory architectures optimized data
exchange and reduced the hardware footprint, which
QA (s, a) ← QA (s, a) was a critical advancement in hardware implementations
  
+ α r + γQB s′ , arg max QA (s′ , a) of RL algorithms. However, while the hardware imple-
a
 mentation was efficient for the stated range of states
− QA (s, a) (from 8 to 256 states), the paper did not extensively
(20) discuss scalability beyond this range, which might be
where s′ is the next state in the environment. Alterna- a limitation for environments with a much larger state
tively, we shall update QB similarly using (line 10): space. The paper focused on the hardware aspect without
20

a detailed discussion of how the implementation could be plicability of the findings without further adaptations or
adapted to different RL scenarios or environments, which validations for different operational environments or IoT
may have limited its applicability without additional device configurations.
modifications or tuning. A Double Q-learning based routing protocol for opti-
The integration of A* pathfinding with Double Q- mizing maritime network routing, crucial for effective
learning to optimize route planning in autonomous driv- communication in maritime search and rescue opera-
ing systems was investigated in [120]. This innovative tions, was developed in [122]. The use of Double Q-
approach aimed to enhance route efficiency and safety learning in this context aimed to tackle the overestima-
by minimizing the common problem of action value tion problems inherent in Q-learning protocols, which
overestimations found in standard Q-learning. The paper was a significant improvement in maintaining stability
introduced a novel combination of A* pathfinding with in the model’s predictions and actions. This approach
RL, providing a dual strategy that leveraged the deter- not only enhanced the routing efficiency but also in-
ministic benefits of A* for initial path planning and the corporated a trust management system to ensure the
adaptive strengths of the proposed method for real-time reliability of data transfers and safeguard against packet-
adjustments to dynamic conditions, such as unexpected dropping attacks. The protocol demonstrated robust per-
obstacles. This synergy allowed for a balanced approach formance in various simulated attack scenarios with
to navigating real-world driving scenarios efficiently. By efficient energy consumption and minimal resource foot-
employing Double Q-learning, the system addressed and print, crucial for the resource-constrained environments
mitigated the issue of action value overestimation, which in which maritime operations occurred. While the pro-
was prevalent in RL. This enhancement was crucial for posed method showed promising results in simulations,
autonomous driving applications where decisions had to the complexity of real-world application scenarios could
be both accurate and dependable to ensure safety and have posed challenges. Maritime environments were
operational reliability. On the other hand, this combina- highly dynamic with numerous unpredictable elements,
tion, while robust, introduced a significant computational which might have affected the consistency of the perfor-
demand that might have impacted the system’s perfor- mance gains observed in controlled simulations. Addi-
mance in real-time scenarios. Quick decision-making tionally, the scalability of this approach when applied to
was essential in dynamic driving environments, and very large-scale networks or under extreme conditions
the increased computational load could have hindered typical of maritime emergencies could have required
the system’s ability to respond promptly. Additionally, further validation. The integration of such sophisticated
the promising method introduced lacked a detailed ex- systems also raised concerns about the computational
ploration of how this hybrid model performed across overhead and the practical deployment in existing mar-
different environmental conditions or traffic scenarios. itime communication infrastructures.
This limitation might have affected the model’s effec- Authors in [123] explored the application of Double
tiveness in diverse settings without further adaptation or Q-learning to manage adaptive wavelet compression in
refinement. environmental sensor networks. This method aimed to
Researchers in [121] delved into optimizing IoT de- optimize data transmission efficiency by dynamically
vices’ power management by employing a Double Q- adjusting compression levels based on real-time com-
learning based controller. This Double Data-Driven Self- munication bandwidth availability. A key advantage of
Learning (DDDSL) controller dynamically adjusted op- this approach was its high adaptability, which ensured
erational duty cycles, leveraging predictive data analytics efficient bandwidth utilization and minimized data loss
to enhance power efficiency significantly. A notable even under fluctuating network conditions, critical for
strength of the paper was the improved operational effi- remote environmental monitoring stations where con-
ciency introduced by the Double Q-learning, which ef- nectivity might have been inconsistent. Furthermore, this
fectively handled the overestimation issues found in stan- integration helped significantly reduce the risk of overes-
dard Q-learning within stochastic environments. This led timating action values, a common problem in Q-learning,
to more precise power management decisions, crucial for which could have led to suboptimal compression set-
prolonging battery life and minimizing energy usage in tings. However, the complexity of this implementation
IoT devices. Furthermore, the DDDSL controller showed posed a notable challenge, particularly in environments
a marked performance enhancement, outperforming tra- where computational resources were limited. The ne-
ditional fixed duty cycle controllers by 42–50%, and the cessity for managing two separate Q-values for each
previous Data-Driven Self-Learning (DDSL) model by action increased the computational demand, potentially
2–12%. However, while the performance improvements impacting the system’s real-time response capabilities.
were compelling, they were obtained under specific Additionally, the performance of the algorithm heavily
conditions, which might have limited the broader ap- depended on the accuracy of the network condition
21

TABLE VI: Double Q-learning Papers Review Algorithm 12 SARSA


Application Domain References 1: Initialize Q(s, a), ∀s ∈ S, a ∈ A(s), arbitrarily,
Hardware Implementations [119], [124] and
(FPGA, Real-Time Systems) Q(terminal-state, ·) = 0
IoT and Power Management (IoT [121], [122] 2: repeat ⊲ (for each episode)
Devices, Maritime Network
Routing) 3: Initialize S
Data Transmission and [123] 4: Choose A from S using policy derived from Q
Compression (Environmental (e.g., ǫ-greedy)
Monitoring) 5: repeat ⊲ (for each step of the episode)
6: Take action A, observe R, S ′
7: Choose A′ from S ′ using policy derived
assessments, with inaccuracies potentially leading to from Q (e.g., ǫ-greedy)
inefficient compression and data transmission. 8: Q(S, A) ← Q(S, A)
In [124], authors explored the application of Double + α [R + γQ(S ′ , A′ ) − Q(S, A)]
Q-learning to enhance dynamic voltage and frequency 9: S ← S ′ ; A ← A′
scaling in multi-core real-time systems. Their approach 10: until S is terminal
addressed the inherent overestimation bias found in Q- 11: until convergence or a stopping criterion is met
learning, aiming to provide more accurate and reliable
power management. A major strength of their work lies
in the innovative use of Double Q-learning to mitigate The core of the SARSA algorithm lies in its method
the overestimation issue common in single-estimator Q- for updating the Q-values. The updates occur according
learning methods. By utilizing two estimators, the system to the following rule [1] (Alg. 12, line 8):
could potentially make more informed decisions that
better-balanced power consumption against system per-
formance needs. Additionally, their thorough simulation- Q(s, a) ← Q(s, a) + α[r + γQ(s′ , a′ ) − Q(s, a)] (22)
based evaluation indicated that this method could out- The The SARSA and Q-learning algorithms update their
perform traditional methods, offering substantial energy estimates of the action-value function based on the
savings across various system conditions. Nevertheless, information gleaned from actions taken. In Q-learning,
managing two Q-value estimators increased the compu- the update is based on the maximum estimated future
tational load, which could have challenged the limited reward, represented by the maximum Q-value of the next
resources available in real-time systems where rapid state, regardless of the action taken. In contrast, SARSA
processing was paramount. Furthermore, while the sim- updates its Q-values using the action actually taken in
ulation results were decent, translating these outcomes the next state, not necessarily the best possible action. In
to real-world scenarios required additional adjustments other words, SARSA does not wait till the agent takes the
to accommodate the diverse nature and unpredictability next action and calculates the maximum of it. It simply
of real system workloads. Table VI summarizes the takes the next available possible action. A fundamental
examined papers by their domain. algorithm of the SARSA is presented in Alg. 12. Several
Transitioning from one off-policy TD algorithm, Q- research studies that used SARSA are examined below.
learning, and its variation, Double Q-learning, which The application of the SARSA algorithm in a sim-
utilizes two Q functions, we now turn our attention ulated shepherding scenario where a dog herds sheep
to another TD-based algorithm, SARSA, in the next towards a target was investigated in [48]. This RL model
subsection. incorporated a discretized state and action space and
h) State-Action-Reward-State-Action (SARSA): designed a specific reward system to facilitate learning.
SARSA algorithm is an on-policy method that updates The application of SARSA to a complex, dynamic task
the action-value function (Q-function) incrementally and like shepherding demonstrated the algorithm’s versatility
in an online manner. Originating from the work in [66], and its potential in environments involving multiple
SARSA is distinguished by its approach of learning agents and stochastic elements. The model successfully
the Q-values directly from the policy being executed. taught a dog to herd sheep by learning to reach sub-goals,
SARSA is characterized by its on-policy learning ap- which simplified the learning process and improved the
proach, where the policy used to make decisions is the manageability of the task. However, the discretization of
same as the policy being evaluated and improved. This the state and action spaces might have limited the dog’s
contrasts with off-policy methods, such as Q-learning, movement and decision-making capabilities, potentially
which may learn about the optimal policy independently leading to less optimal performance in more realistic
from the agent’s actions. or varied environments. The stochastic nature of sheep
22

movement and the complexity of the shepherding task mize peer-to-peer (P2P) electricity transactions among
led to a significant learning time, with notable success small-scale users in smart energy systems. This research
only after many episodes, indicating potential efficiency aimed to enhance economic efficiency and reliability in
issues in more demanding or time-sensitive applications. decentralized energy markets. A significant strength lies
Authors in [116] primarily focused on the imple- in its innovative application of SARSA to a complex, dy-
mentation of Q-learning and SARSA algorithms, incor- namic energy trading system. Researchers modeled P2P
porating eligibility traces to improve handling delayed electricity transactions as an MDP, allowing the system
rewards. The research demonstrated the effectiveness of to handle uncertainties in small-scale energy trading, like
RL algorithms in navigating dynamic tasks like micro- fluctuating prices and varying demands. This modeling
managing combat units in real-time strategy games. enabled the system to learn optimal transaction strategies
Adding eligibility traces significantly boosted the al- over time, potentially enhancing both efficiency and
gorithms’ learning from sequences of interdependent profitability in energy trading. However, the approach
actions, essential in fast-paced, chaotic environments. had limitations. SARSA, while effective in learning op-
Using a commercial game like StarCraft as a testbed timal policies through trial and error, required extensive
introduced real-world complexity often absent in sim- interaction with the environment to achieve satisfactory
ulated settings. This method confirmed the practicality performance. This data requirement could have been a
of RL algorithms in real scenarios and highlighted drawback in real-world applications where immediate
their potential to adapt to commercial applications. The decisions were necessary, and historical transaction data
algorithms showed promise in small-scale combat, but was limited. Moreover, implementing such a system in a
scaling to larger, more complex battles in StarCraft or live environment, where real-time decision-making was
similar games remained uncertain. Concerns arose about crucial, posed additional challenges, including the need
computational demands and the efficiency of learning for robust computational resources to handle continuous
optimal strategies without extensive prior training. While state and action space calculations.
effective within StarCraft, their broader applicability to In [47], they examined integrating SARSA(0) with
other real-time strategy games or applications remained Fully Homomorphic Encryption (FHE) for cloud-based
untested. The specialized design of state and action control systems to ensure data confidentiality while
spaces for StarCraft could have hindered transferring performing RL computations on encrypted data. A sig-
these methods to different domains without significant nificant strength was preserving privacy in cloud envi-
modifications. ronments where sensitive control data could have been
In [28], authors studied the enhancement of energy vulnerable to breaches. Using FHE allowed the RL algo-
efficiency in IoT networks through strategic power and rithm to execute without decrypting the data, providing a
channel resource allocation using a SARSA-based algo- robust method for maintaining confidentiality. The paper
rithm. This approach addressed the unpredictable nature successfully demonstrated this method on a classical
of energy from renewable sources and the variability pole-balancing problem, showing it was theoretically
of wireless channels in real-time settings. A major sound and practically feasible. However, implementing
contribution of the study lay in its innovative use of SARSA(0) over FHE introduced challenges related to
a Model-free, on-policy RL method to manage energy computational overhead and latency due to encryption
distribution across IoT nodes. This method efficiently operations. These factors could have impacted the ef-
handled the stochastic nature of energy harvesting and ficiency and scalability of the RL system, especially
channel conditions, optimizing network performance in in environments requiring real-time decision-making.
terms of energy efficiency and longevity. Integrating Additionally, encryption-induced delays and managing
SARSA with linear function approximation helped re- encrypted computations might have limited this method’s
fine solutions to continuous state and action spaces, application to scenarios where control tasks could toler-
enhancing the practicality in real-world IoT applica- ate such delays.
tions. However, relying on linear function approximation Authors in [117] presented a novel application of
introduced limitations in capturing the full complex- SARSA within a swarm RL framework to solve op-
ity of interactions in dynamic, multi-dimensional state timization problems more efficiently, particularly those
spaces typical of IoT environments. While the proposed involving large negative rewards. The authors incorpo-
SARSA algorithm demonstrated network efficiency im- rated individual learning and cooperative information
provements, implementing such an RL system in real IoT exchange among multiple agents, aiming to speed up
networks posed challenges. These included the need for learning and enhance decision-making efficiency. This
continual learning and adaptation to changing conditions, approach significantly leveraged swarm intelligence and
impacting the deployment’s feasibility and scalability. RL, particularly SARSA’s ability to handle tasks with
Authors in [125] explored applying SARSA to opti- substantial negative rewards. By enabling agents to learn
23

from individual experiences and shared insights, the Algorithm 13 Expected SARSA
system could converge to optimal policies more swiftly 1: Input: policy π, positive integer num episodes,
than traditional single-agent or non-cooperative multi- small positive fraction α, GLIE {ǫi }
agent systems. Implementing the shortest path problem 2: Output: value function Q (qπ if num episodes is
demonstrated the method’s practicality and effectiveness, large enough)
showcasing improved learning speeds and robustness 3: Initialize Q arbitrarily (e.g., Q(s, a) = 0 for all s ∈
against pitfalls with significant penalties. Nonetheless, S and a ∈ A(s), and Q(terminal-state, ·) = 0)
managing multiple agents and their interactions in- 4: for i ← 1 to num episodes do do
creased the computational overhead and complexity of 5: ǫ ← ǫi
the learning process. Moreover, the approach heavily 6: Observe S0
relied on designing the information-sharing protocol 7: t←0
among agents, which, if not optimized, could have led 8: repeat
to inefficiencies or suboptimal learning outcomes. The 9: Choose action At using policy derived from
generalized application of this method across different Q (e.g., ǫ-greedy)
environments or RL tasks remained to be thoroughly 10: Take action At and observe Rt+1 , St+1
tested, suggesting potential limitations in adaptability. 11: Q(St , At ) ← Q(St , At ) + α (Rt+1 + γ
P
a π(a|St+1 )Q(St+1 , a) − Q(St , At ))

In [126], authors explored the application of SARSA 12: t←t+1


to optimize routing for Electric Vehicles (EVs) to mini- 13: until St is terminal
mize energy consumption. This approach adapted to real- 14: end for
time driving conditions to reduce on-road energy needs 15: return Q
by selecting routes with lower energy requirements. The
research’s strength lies in its real-time application and
use of SARSA to learn and predict the most energy- guaranteed under conditions similar to those required
efficient routes under various conditions, like traffic by SARSA, such as all state-action pairs being visited
and road type. By utilizing Markov chain models to infinitely often. [127] provided proof that Expected
estimate energy requirements based on actual driving SARSA converged to the optimal action-value function
data, the framework aimed to extend the driving range under typical RL assumptions, like finite state and action
of EVs, crucial for reducing range anxiety. However, the spaces, and a policy that became greedy in the limit
reliance on real-time data and the inherent variability of with infinite exploration. Empirically, Expected SARSA
driving conditions presented challenges. The accuracy outperformed both SARSA and Q-learning in various
of the SARSA model’s predictions heavily depended domains, especially in tasks where certain actions could
on the data quality and immediacy, which could have lead to significant negative consequences. By incorpo-
been compromised by limitations in data transmission rating the expectation over all possible next actions,
or processing delays. Implementing this system in a the algorithm effectively smoothed out learning updates,
real-world environment could have posed scalability and which helped in environments where certain decisions
adaptability challenges, particularly concerning integra- or state transitions led to high variability in rewards
tion with existing vehicle navigation systems and the [127], [1]. In summary, Expected SARSA provides a
computational demands on the vehicle’s hardware. robust approach to learning in stochastic environments
A variant of SARSA which incorporates expectations by reducing the variance inherent in updates of state-
over all possible next actions instead of relying solely action values. This leads to more reliable learning per-
on the sampled next action is introduced in the next formance, particularly in complex environments where
subsection. action outcomes are highly uncertain. Expected SARSA
i) Expected SARSA: The Expected SARSA algo- update rule is:
rithm, proposed in [127], extended the classic SARSA
algorithm. Expected SARSA differed from standard "
SARSA by incorporating expectations over all possible
next actions instead of relying solely on the sampled next Q(st , at ) ← Q(st , at ) + α rt+1
action. This modification reduced the update rule’s vari- # (23)
ance, potentially allowing for faster and more stable con-
X
+γ π(a|st+1 )Q(st+1 , a) − Q(st , at )
vergence in learning tasks. The algorithm operated under a
the premise that by averaging all possible actions from
the next state (weighted by their probability under the Before discussing the selected studies that have uti-
current policy), it could achieve a more stable estimate of lized Expected SARSA, a general overview of the algo-
state-action values. Expected SARSA’s convergence was rithm can be found in Alg. 13.
24

Authors in [128] designed a comprehensive explo- policy RL algorithm, demonstrated how to manage and
ration of combining Model Predictive Control (MPC) stabilize frequency variations due to unpredictable re-
with the Expected SARSA algorithm to tune MPC mod- newable energy outputs and fluctuating demand adap-
els’ parameters. This integration aimed to enhance the tively. This was crucial for maintaining reliability and
robustness and efficiency of control systems, particularly efficiency in power grids increasingly incorporating vari-
in applications where the system model’s parameters able renewable energy sources. However, the approach’s
were not fully known or subject to change. A key downside was related to the RL algorithm’s computa-
advantage was the innovative approach to speeding up tional demands and the need for extensive simulation
learning by directly integrating RL with MPC, reducing and testing to fine-tune system parameters. Implementing
the episodes needed for effective training. This efficiency such a system in a real-world setting could be con-
was achieved using the Expected SARSA algorithm, strained by these factors, particularly in terms of real-
which offered smoother convergence and better perfor- time computation capabilities and the scalability of the
mance due to its average-based update rule compared solution to larger, more complex grid systems.
to the more common Q-learning method, which fo-
cused on maximum expected rewards and might lead
to higher variance in updates. However, the approach’s
Authors in [31] utilized the Expected SARSA algo-
complexity and the need for precise tuning of MPC
rithm to optimize Energy Storage Systems (ESS) opera-
model parameters represented significant challenges. The
tion in managing uncertainties in wind power generation
computational demands increased due to the dual needs
forecasts. The study modeled the problem as MDP where
of continuous adaptation by the RL algorithm and
the state and action spaces were defined by the ESS’s op-
the rigorous constraints enforced by MPC. While the
erational constraints. Expected SARSA’s primary result
framework showed potential in simulations, its real-
in this context was its superior performance compared to
world applicability, especially in highly dynamic and
conventional Q-learning-based methods. The algorithm
unpredictable environments, required further validation.
effectively handled the wide variance in wind power
Authors in [129] focused on improving Determinis- forecasts, crucial for optimizing ESS’s charging and
tic and Synchronous Multichannel Extension (DSME) discharging actions to reduce forecast errors. The strat-
networks’ resilience against WiFi interference using the egy’s effectiveness was underscored by its near-optimal
Expected SARSA algorithm. It evaluated channel adap- performance, closely approximating the optimal solution
tation and hopping strategies to mitigate interference with complete future information. Simulation results
in industrial environments, providing a detailed anal- demonstrated that the Expected SARSA-based strategy
ysis with Expected SARSA. The study stood out for could manage wind power forecast uncertainty more
its rigorous simulation-based evaluation of interference effectively by adapting to varying conditions. This adapt-
mitigation strategies in a controlled DSME network. By ability was enhanced by including frequency-domain
using Expected SARSA, the research effectively reduced data clustering, which refined the learning process and
uncertainty in channel quality assessment, leading to reduced input data variability, further improving the RL
more reliable network performance under interference model’s performance. The last variant of SARSA is N-
conditions, crucial for industrial applications where re- step SARSA. We will cover it in the next subsection
liable data transmission was vital for operational ef- before delving deeper into Approximation Model-free
ficiency and safety. However, the complexity of RL algorithms.
implementation and its dependency on accurate real-
time data posed challenges. The computational demands
of running Expected SARSA in real-time environments
could limit the practical deployment of this strategy in
resource-constrained settings. j) N-step SARSA: With foundational concepts of n-
In [130], researchers investigated using the Expected step TD and bootstrapping established, we can expand on
SARSA learning algorithm to manage Load Frequency these ideas and explore research papers utilizing them.
Control in multi-area power systems. The study focused Let’s begin with N-step SARSA, an enhancement of
on enhancing power systems’ stability integrated with the conventional one-step SARSA within the on-policy
Distributed Feed-in Generation based wind power, using learning framework. In N-step SARSA, as shown in
a Model-free RL approach to adjust to variable power Alg. 14, the update to the value (or action-value) is not
supply conditions without a predefined system model. limited to just the next state and action, as seen in one-
The primary strength lies in its innovative approach to step SARSA, but instead incorporates a sequence of n
addressing challenges in integrating renewable energy actions and rewards. The value function in the equation
sources into the power grid. Expected SARSA, an on- is replaced with Q(St+n , At+n ):
25

" n−1 Algorithm 14 N-step SARSA


X
Q(St , At ) ← Q(St , At )α γ k Rt+k+1 1: Initialize Q(s, a) arbitrarily, for all s ∈ S, a ∈ A
k=0
# 2: Initialize π to be ǫ-greedy with respect to Q, or to
a fixed given policy
n
+ γ Q(St+n , At+n ) − Q(St , At ) 3: Algorithm parameters: step size α ∈ (0, 1], small
(24) ǫ > 0, a positive integer n
4: All store and access operations (for St , At , Rt ) can
N-step SARSA maintained the on-policy characteristic
take their index mod n + 1
of SARSA, meaning the policy generating the behavior
5: repeat ⊲ (for each episode)
was the same as the one being evaluated and improved.
6: Initialize and store S0 6= terminal
This allowed it to effectively integrate the benefits of
7: Select and store an action A0 ∼ π(·|S0 )
TD learning and the broader horizon considered in
8: T ←∞
MC methods, balancing the bias-variance trade-off by
9: for t = 0, 1, 2, . . . do
adjusting the number of steps n looked ahead. N-step
10: if t < T : then
methods, including N-step SARSA, enhance learning
11: Take action At
by providing more robust estimates that incorporate
12: Observe and store the next reward as
multiple future outcomes rather than relying solely on
Rt+1 and the next state as St+1
the immediate next state and action. This leads to faster 13: if St+1 is terminal then
learning and improves policy performance, especially in 14: T ←t+1
complex environments where future rewards are signifi- 15: else
cantly affected by actions taken over multiple steps [1], 16: Select and store an action At+1
[94]. ∼ π(·|St+1 )
Starting papers analysis, [131] examined the appli- 17: end if
cation of the n-step SARSA RL algorithm to optimize 18: end if
traffic signal control. This method dynamically adjusted 19: τ ← t − n + 1 (τ is the time whose
traffic signals based on real-time conditions, aiming estimate is being updated)
to reduce congestion more effectively than traditional 20: if τ ≥ 0: P then
min(τ +n,T ) i−τ −1
methods like Static Signaling (SS) and Longest Queue 21: G ← i=τ +1 γ Ri
First. The paper innovatively applied the n-step SARSA, 22: if τ + n < T : then
incorporating multiple future steps into decision-making. 23: G ← G + γ n Q(Sτ +n , Aτ +n )
This enabled more strategic planning and could lead to (Gτ :τ +n )
better handling of complex traffic situations. The use 24: end if
of the Simulation of Urban MObility (SUMO) traffic 25: Q(Sτ , Aτ ) ← Q(Sτ , Aτ )
simulator to model real-world traffic in Texas provided + α [G − Q(Sτ , Aτ )]
a solid base for testing and validating the algorithm. 26: if π is being learned then
A detailed comparative analysis with existing meth- 27: Ensure that π(·|Sτ ) is ǫ-greedy
ods showed potential improvements in managing traffic with respect to Q
flow and reducing congestion. The research also tackled 28: end if
scalability by employing a centralized control agent, 29: end if
mitigating rapid growth in state-action space seen in 30: end for
decentralized systems, and enhancing feasibility for large 31: until τ = T − 1
urban areas. Despite benefits, real-world implementation
could present challenges, including high computational
demands and the need for real-time data processing
and communication infrastructure. While simulations In [132], authors introduced a hybrid method that
were crucial for preliminary tests, there was a risk melded on-policy traits of SARSA with off-policy fea-
that the model might be over-fitted to these conditions, tures of Q-learning, integrating an N-step look-ahead
potentially impairing the algorithm’s effectiveness in real capability to enhance foresight in decision-making. The
settings. Additionally, the study focused on a particular algorithm’s flexibility to adjust the number of look-ahead
urban environment and did not extensively investigate steps (N) dynamically allowed it to suit different learning
the algorithm’s performance across diverse traffic pat- stages, providing an effective balance between explo-
terns, different urban layouts, or during unusual events ration and exploitation. This adaptability could facilitate
like accidents or road closures. more efficient learning than traditional RL methods.
26

TABLE VII: SARSA Papers Review spaces. High scalability indicates that the algorithm
Application Domain Number of Papers remains efficient even in larger environments or state
Multi-agent Systems and [48] spaces. Moderate scalability means that the algorithm
Autonomous Behaviors can handle medium-sized problems effectively. Low
(Shepherding, Virtual Agents)
scalability suggests that the algorithm struggles with
Games and Simulations [116]
(Real-Time Strategy Games) large state spaces, often due to computational complexity
Energy and Power Management [28], [125] or memory requirements. Sample-efficiency reflects how
(IoT Networks, Smart Energy effectively the algorithm learns from the available data.
Systems) High sample-efficiency denotes that the algorithm can
Cloud-based Control and [47]
Encryption Systems
learn effectively from fewer samples. Moderate sample-
Swarm Intelligence and [117] efficiency indicates a requirement for a moderate number
Optimization Problems of samples. Low sample efficiency suggests that the
Transportation and Routing [126] algorithm needs a large number of samples to learn
Optimization (EVs) effectively. Bootstrapping involves using estimates to
MPC Tuning [128] update values, accelerating the learning process. Eligi-
Network Resilience and [129], [31]
Optimization
bility Traces track state-action pairs to improve learn-
Power Systems and Energy [130] ing efficiency by integrating information over multiple
Management steps. Experience Replay stores past experiences, which
Network Optimization (Optical [115] can be reused during learning to improve performance.
Transport Networks, Fog RAN) Exploration Starts ensure that all states are visited
Intelligent Traffic Signal Control [131] by starting from random states, promoting thorough
Hybrid RL Algorithms [132]
exploration of the environment.
In the next section, we start analyzing another
paradigm of Model-free algorithms, approximation-
By combining on-policy and off-policy updates, the based algorithms, before analyzing part II of Tabular
algorithm capitalized on the strengths of both SARSA Model-based ones.
and Q-learning, potentially decreasing the overestimation
bias seen in Q-learning while still targeting optimal
policies. The approach’s ability to adjust the N parameter C. Approximation Model-free Algorithms
based on learning phases or specific conditions suggested In this section, we analyze Approximation Model-free
customization for a broad array of applications, from algorithm variations and their applications of in various
simple gaming scenarios to intricate decision-making domains. It must be noted we assume that readers have
tasks in real-world contexts. However, the algorithm’s knowledge in DL before reading this section, and readers
requirement to finely adjust various hyperparameters, are referred to [133], [134], [135], [136] to understand
like maximum and minimum values of N and the and learn DL. Approximation Model-free algorithms in
breakpoint for decrementing N, added complexity to its RL comprise methods oriented to learning policies and
configuration and optimization. This could pose a chal- value functions solely from interaction with an environ-
lenge, particularly for users with limited RL experience. ment, though without an explicitly stated model of the
Additionally, maintaining and updating a combination of environment’s dynamics. These algorithms typically use
policies over multiple steps before reaching a decision function approximators, such as neural networks, and
resulted in higher computational demands, potentially generalize from observed state–action pairs to unknown
restricting the algorithm’s applicability in settings with ones. Consequently, they are quite effective at handling
limited resources or where rapid response times were large or continuous states and actions [59], [137], [138].
essential. The categorization of the examined papers by Key features of the Approximation Model-free algo-
their domain is given in Table VII. rithms are:
3) Summary of Tabular Model-free Algorithms: In • These algorithms learn the policy directly by op-
this section, we delved into Tabular Model-free algo- timizing expected reward without an explicit con-
rithms as one of the categories in RL. After providing a struction of the model of the environment.
brief overview of each algorithm and studies that used • In value-function estimation, the value function is
those methods, it is time to give a complete summary in estimated using function approximators predicting
Table VIII. expected rewards for states or state–action pairs.
It is necessary to explain what scalability and sample • The solutions are scalable, thus fitting for problems
efficiency mean. Scalability refers to the algorithm’s with large or continuous state and action spaces
ability to handle varying sizes of environments or state since they can generalize from limited data.
27

TABLE VIII: Comparison of Tabular Model-free Algorithms


Algorithm On-Policy/ Off-Policy Scalability Sample-Efficiency Additional Information
TD Learning On-Policy High Moderate Online learning, bootstrapping
TD(0)-Replay Algorithm Off-Policy High Moderate Uses experience replay
TD(λ) On-Policy High Moderate Trace decay parameter λ
N-step Bootstrapping On-Policy High Moderate Generalization of TD
N-step TD Prediction On-Policy High Moderate Predictive, bootstrapping
N-step Off-Policy Learning Off-Policy High Moderate Uses off-policy returns
Q-learning Off-Policy High Moderate Finds optimal policy, bootstrapping
Double Q-learning Off-Policy High Moderate Reduces overestimation bias
SARSA On-Policy High Moderate Learns the action-value function
Expected SARSA On-Policy High Moderate Uses expected value of next state
N-step SARSA On-Policy High Moderate Generalization of SARSA
MC Methods On-Policy Moderate Low Does not bootstrap, needs full episodes
On-Policy MC On-Policy Moderate Low Requires exploration starts
MC Importance Sampling Off-Policy Moderate Low Corrects for different policies

Popular examples include Q-learning algorithms with Algorithm 15 Deep Q-learning


function approximation, called DQN. Approximation 1: Initialize replay memory D to capacity N and
Model-free algorithms are thus essential enablers of action-value function Q with random weights
practical applications in RL where an explicit environ- 2: for episode = 1 to M do
ment model is infeasible or too computationally expen- 3: Initialize sequence s1 = {x1 } and preprocessed
sive to be created. sequence φ1 = φ(s1 )
As the first Approximation Model-free algorithm, over 4: for t = 1 to T do
the next subsection, we will analyze DQN, one of the 5: With probability ǫ, select a random action
most widely used algorithms in RL. at
1) Deep Q-Networks (DQN): DQN algorithm merges 6: otherwise select at = maxa Q∗ (φ(st ), a; θ)
Q-learning [101] with Neural Networks to learn con- 7: Execute action at in emulator and observe
trol policies directly from raw pixel inputs. It uses reward rt and image xt+1
Convolutional Neural Networks (CNN) to process these 8: Set st+1 = st , at , xt+1 and preprocess
inputs and an experience replay mechanism to stabilize φt+1 = φ(st+1 )
learning by breaking correlations between consecutive 9: Store transition (φt , at , rt , φt+1 ) in D
experiences. The target network, updated less frequently, 10: Sample random minibatch of
aids in stabilizing training. DQN achieved state-of-the- transitions
(
(φj , aj , rj , φj+1 ) from D
art performance on various Atari 2600 games, surpassing rj for terminal φj+1
11: yj =
rj + γ maxa′ Q(φj+1 , a′ ; θ) for non-terminal φj+1
previous methods and, in some cases, human experts,
12: Perform a gradient descent step on
using a consistent network architecture and hyperparam-
(yj − Q(φj , aj ; θ))2
eters across different games [139]. DQN combines the
13: end for
introduced Bellman Equation with DL approaches like
14: end for
Loss Function and Gradient Descent to find the optimal
policy as below:
Loss Function
h i After building a solid foundation of the algorithm, it
2
Li (θi ) = E(s,a,r,s′ )∼D (yi − Q(s, a; θi )) (25) is time to analyze the papers in the literature. In [140],
authors explored the use of DQN for controlling Heating,
where
′ ′ − Ventilation, and Air Conditioning (HVAC) systems in an
yi = r + γ max Q(s , a ; θ ) (26) energy-efficient manner without relying on simulation
a′

Gradient Descent Step models. The authors applied DQN to a reference office
h building, simulating its performance with an EnergyPlus
′ ′ −
∇θi Li (θi ) = E(s,a,r,s′ )∼D r + γ max Q(s , a ; θ ) model, aiming to minimize energy use while maintaining
a′
i indoor CO2 concentrations below 1,000 ppm. One of the
−Q(s, a; θi )) ∇θi Q(s, a; θi ) key strengths of this study was its demonstration of how
(27) DQN could improve HVAC control by learning from
Alg. 15 details the complete Deep Q-learning algo- previous actions, states, and rewards, thus offering a
rithm, regardless of using CNNs, as per the first paper. Model-free optimization approach. The paper reported a
This algorithm is a generalization to have a general significant reduction in total energy usage (15.7%) com-
overview of the algorithm. pared to baseline operations, highlighting the potential
28

of DQN to enhance energy efficiency in buildings. The Additionally, while the simulations showed promising
research also addressed the complexity and interconnec- results, practical deployment and testing in diverse en-
tivity of HVAC systems, providing a practical solution vironments would have been necessary to validate the
to a traditionally challenging control problem. However, generalizability of the approach.
the paper also had some limitations. The reliance on Researchers in [143] explored the application of DQN
simulation data for training and testing the DQN might optimized using Particle Swarm Optimization (PSO)
not have fully captured the intricacies and variations for resource allocation in fog computing environments,
of real-world scenarios. Additionally, the study focused specifically tailored for healthcare applications. A no-
on a specific type of building and HVAC setup, which table strength of this study was its innovative combi-
might have limited the generalizability of the results to nation of DQN and PSO, which effectively balanced
other building types or climates. Furthermore, the study resource allocation by reducing makespan and improving
did not delve deeply into the potential challenges of both average resource utilization and load balancing
implementing such a system in practice, such as the need levels. The methodology leveraged real-time data to
for extensive data collection and processing capabilities. dynamically adjust resources, showcasing a practical
An innovative application of DQN to localize brain application of DQN in a critical domain. However,
tumors in Magnetic Resonance Imaging (MRI) images the paper could have benefited from a more detailed
was introduced in [141]. The key strength of this study discussion on the scalability of the proposed system
lay in its ability to generalize tumor localization with and its performance under varying network conditions.
limited training data, addressing significant limitations Additionally, while the results were promising, they
of supervised DL which often required large annotated were primarily based on simulations, which might not
datasets and struggled with generalization. The authors have fully captured the complexities and unpredictability
demonstrated that the DQN approach achieved 70% of real-world fog environments. Further validation in
accuracy on a testing set with just 30 training images, practical deployments would have been necessary to
significantly outperforming the supervised DL method fully ascertain the efficacy of the proposed approach.
that showed only 11% accuracy due to over-fitting. An adaptive power management strategy for Parallel
This showcased the robustness of RL in handling small Plug-in Hybrid EVs (PHEVs) using DQN was inves-
datasets and its potential for broader application in tigated in [144]. The strength of this paper lies in
medical imaging. The use of a grid-world environment its practical application of DQN for real-time power
to define state-action spaces and a well-designed re- distribution in PHEVs. The approach considered con-
ward system further strengthened the methodology. A tinuous state variables such as battery State of Charge
notable weakness, however, was the limitation to two- (SOC), required power, vehicle speed, and remaining
dimensional image slices, which might not have fully distance ratio, which enhanced the model’s adaptabil-
captured the complexities of three-dimensional medical ity and precision in dynamic driving conditions. The
imaging. Future work should have addressed this by DQN model successfully minimized fuel consumption
extending the approach to 3D volumes and exploring while maintaining battery SOC within desired limits,
more sophisticated techniques to improve stability and showing a 6% increase in fuel consumption compared
accuracy. to DP which was globally optimal but computationally
Authors in [142] presented a routing algorithm for impractical for real-time applications. However, the re-
enhancing the sustainability of Rechargeable Wireless search was primarily based on simulations using the
Sensor Networks. The authors proposed an adaptive File Transfer Protocol (FTP)-72 driving cycle, which
dual-mode routing approach that integrated multi-hop might not have fully captured the variability of real-
routing with direct upload routing, optimized through world driving conditions. Additionally, the study focused
DQN. The strength of this paper lies in its innovative on a specific type of PHEV and driving scenario, which
use of RL to dynamically adjust the routing mode might have limited the generalizability of the results.
based on the life expectancy of nodes, significantly Further real-world testing and validation across different
improving the network’s energy efficiency and lifespan. vehicle models and driving conditions were necessary to
The simulation results demonstrated that the proposed establish the robustness and practical applicability of the
algorithm achieved a correct routing mode selection proposed strategy.
rate of 95% with limited network state information, Authors in [145] explored the application of DQN to
showcasing its practical applicability and robustness. create an AI for a visual fighting game. A significant
However, the paper did not fully address potential real- strength of this study was its innovative reduction of the
world challenges such as the computational overhead of action space from 41 to 11 actions, which simplified the
implementing DQN in resource-constrained sensor nodes training process and enhanced the model’s performance.
and the impact of network topology changes over time. The DQN architecture included convolutional and fully
29

connected layers optimized for handling sequential frame TABLE IX: DQN Papers Review
inputs, effectively learning to perform complex combina- Application Domain References
tions in a dynamic, competitive environment. However, General RL (Policy learning, raw [141]
a notable limitation was the reliance on a static opponent experience)
(None agent) during training, which might not have Network Optimization [142]
fully captured the complexities and adaptive behaviors Swarm Intelligence and [143],
Optimization Problems
of actual gameplay against diverse opponents. Addition- Network Optimization (Optical [144]
ally, the experiments were conducted in a controlled Transport Networks, Fog)
environment with specific hardware, potentially limiting Games and Simulations [145], [146]
the generalizability of the results to other setups or Security Games and Strategy [148]
real-world gaming scenarios. Further work should have Optimization
focused on testing against more dynamic and varied
opponents to evaluate the robustness and adaptability of
the AI. world implementations and considered the computational
Authors in [146] presented a novel path-planning constraints of practical deployment environments.
approach using an enhanced DQN combined with dense Study [148] presented a novel approach to countering
network structures. The strength of this work was its intelligent Unmanned Aerial Vehicle (UAV) jamming at-
innovative policy of leveraging both depth and breadth of tacks using a Stackelberg dynamic game framework. The
experience during different learning stages, which signif- UAV jammer, acting as the leader, used Deep Recurrent
icantly accelerated the learning process. The introduction Q-Networks (DRQN) to optimize its jamming trajectory,
of a value evaluation network helped the model quickly while ground users, as followers, employed DQN to
grasp environmental rules, while the parallel exploration find optimal communication trajectories to evade the
structure improved the accuracy by expanding the ex- jamming. The strength of this paper was its compre-
perience pool. The use of dense connections further hensive modeling of the UAV jamming problem using
enhanced feature propagation and reuse, contributing to DRQN and DQN, which effectively handled the dynamic
improved learning efficiency and path planning success. and partially observable nature of the environment. The
However, the primary limitation was that the experiments approach proved effective in simulations, showing that
were conducted in a controlled grid environment with both the UAV jammer and ground users could achieve
specific sizes (5x5 and 8x8), which might not have fully optimal trajectories that maximized their respective long-
represented the complexities of real-world scenarios. term cumulative rewards. The Stackelberg equilibrium
Additionally, the reliance on a fixed maximum number ensured that the proposed strategies were stable and
of steps could potentially have led to suboptimal policy effective in a competitive environment. However, the
evaluations in dynamic and larger environments. Future primary limitation was the complexity and computa-
work should have focused on validating this approach in tional demands of implementing DRQN and DQN in
more diverse and scalable settings to assess its general- real-time scenarios, which might have been challenging
izability and robustness. in practical deployments. Additionally, the simulations
The use of DQN for the real-time control of ESS co- were based on specific scenarios and parameters, which
located with renewable energy generators was discussed might not have fully captured the variability of real-
in [147]. A key strength of this work was its Model- world environments. Further validation through real-
free approach, which did not rely on distributional as- world experiments and a more extensive range of scenar-
sumptions for renewable energy generation or real-time ios would have been necessary to confirm the robustness
prices. This flexibility allowed the DQN to learn optimal and scalability of the proposed approach.
policies directly from interaction with the environment. Table IX provides an overview of the examined papers
The simulation results demonstrated that the DQN-based in DQN and their applications across different domains.
control policy achieved near-optimal performance, effec- Over the next paragraphs, we will cover the Double
tively balancing energy storage management tasks like Deep Q-Networks (DDQN), an extension of the DQN
charging and discharging without violating operational designed to address the overestimation bias observed in
constraints. However, the primary limitation was the re- Q-learning.
liance on simulated data for both training and evaluation, 2) Double Deep Q-Networks (DDQN): The DDQN
which might not have captured all the complexities of algorithm is an extension of the DQN designed to
real-world energy systems. Additionally, the approach address the overestimation bias observed in Q-learning.
assumed the availability of significant computational It achieves this by decoupling the selection of the action
resources, which might not have been feasible for all from the evaluation of the Q-value, thus providing a
consumers. Future work should have focused on real- more accurate estimation of action values [149], [150].
30

Algorithm 16 DDQN the Q-value overestimation issue inherent in traditional


1: Initialize: Online network parameters θ, Target net- DQN methods. By separating the action selection and
work parameters θ− , Replay buffer D, Exploration evaluation processes, the DDQN framework achieved
rate ǫ, and Discount factor γ more accurate and stable policy learning. The simula-
2: for each episode: tion results showed that DDQN significantly enhanced
3: while (not done) do energy efficiency compared to baseline solutions and
4: Observe state s standard DQN approaches, demonstrating up to 22%
5: Select action a based on ǫ-greedy policy: power savings and a 20% improvement in EE. How-
6: ever, the study’s reliance on simulated environments and
( predefined parameters might not have fully reflected the
random action probability ǫ
a← complexities of real-world CRAN scenarios. The scala-
arg maxa Q(s, a; θ) probability 1 − ǫ bility and adaptability of the proposed method to diverse
network conditions and dynamic user demands needed
7: Execute action a, observe reward r and new state further validation through practical implementations. Ad-
s′ ditionally, while the study provided a robust theoretical
8: Store transition (s, a, r, s′ ) in replay buffer D framework, the practical deployment challenges, and
9: Sample a mini-batch of transitions (s, a, r, s′ ) computational overheads associated with DDQN in live
from D networks required more in-depth exploration.
10: Compute target: The application of DDQN to object detection tasks
was investigated in [152]. One of the key strengths of
y = r + γQ(s′ , arg max Q(s′ , a′ ; θ); θ− )
′ a this work was its innovative use of DDQN to enhance
11: Update the online network by minimizing the accuracy and efficiency of object detection. By
the loss: decoupling action selection and evaluation using two
h i separate Q-networks, the DDQN approach addressed the
2
L(θ) = E(s,a,r,s′ )∼D (y − Q(s, a; θ)) overestimation problem associated with traditional DQN
methods. This led to higher precision and recall rates, as
demonstrated in their experiments. The method proved
12: Every C steps, update the target network:
efficient, requiring fewer steps to detect objects and
θ− ← θ showing strong adaptability to different environments,
including person detection scenarios. However, the study
13: end while relied heavily on specific datasets and controlled environ-
ments, which might have limited the generalizability of
the results. Real-world applications could have presented
Key contributions of DDQN can be summarized as
more complex scenarios that might have challenged the
follows:
robustness of the proposed method. Further validation
• Overestimation Reduction: By using two separate
in diverse and uncontrolled settings would have helped
networks to select and evaluate actions, DDQN ascertain the practical applicability and scalability of the
mitigates the overestimation bias that is prevalent DDQN-based object detection framework.
in standard DQN.
Authors in [153] presented an enhanced path-tracking
• Improved Stability and Performance: The decou-
approach for robotic vehicles using DDQN. A key
pling mechanism improves the stability and per-
strength of this work was its innovative application of
formance of the learning process, particularly in
DDQN for both path smoothing and tracking, which sig-
complex environments.
nificantly reduced overshoot and settling time compared
The Double Q-learning update rule is: to traditional methods like Pure Pursuit Control. The
y = r + γQ(s′ , arg max Q(s′ , a′ ; θ); θ− ) (28) proposed method demonstrated superior performance in
′a navigating paths with sharp turns, making it particu-
where θ are the parameters of the online network and larly suitable for agricultural applications where precise
θ− are the parameters of the target network. Alg. 16, path following was critical. The use of a simulation
provides a general overview of the DDQN algorithm. environment for training and subsequent testing on a
Let us analyze the papers that used DDQN in the real rover enhanced the robustness of the developed
literature. Authors in [151] explored the application algorithm. However, the reliance on a specific rover
of DDQN for optimizing energy efficiency in Cloud model and controlled environments might have limited
Radio Access Networks (CRAN). A key strength of the generalizability of the results. The study could have
this work was its innovative use of DDQN to address benefited from further validation in diverse real-world
31

conditions and with different types of robotic vehicles to necessary to fully establish the robustness and practical
fully establish the robustness and adaptability of the pro- applicability of the proposed method.
posed method. Additionally, the computational demands A method for autonomous mobile robot navigation
of implementing DDQN in real-time applications might and collision avoidance using DDQN was developed in
have posed practical challenges that needed addressing. [155]. A significant strength of this work was its innova-
A novel method for improving tactical decision- tive use of DDQN to reduce reaction delay and improve
making in autonomous driving using DDQN enhanced training efficiency. The proposed method demonstrated
with spatial and channel attention mechanisms was in- superior performance in navigating robots to target posi-
troduced in [154]. The strength of this paper lies in tions without collisions, even in multi-obstacle scenarios.
its innovative use of a hierarchical control structure By employing a Kinect2 depth camera for obstacle detec-
that integrated spatial and channel attention modules tion and leveraging a well-designed reward function, the
to better encode the relative importance of different method achieved quick convergence and effective path
surrounding vehicles and their features. This approach planning, as evidenced by both simulation and real-world
allowed for more accurate and efficient decision-making, experiments on the Qbot2 robot platform. However, the
as evidenced by the significant improvement in safety reliance on a controlled laboratory environment and
rates (54%) and average exploration distance (30%) specific hardware (e.g., Optitrack system for positioning)
in simulated environments. The combination of the might have limited the generalizability of the results.
algorithm with double attention enhanced the agent’s Real-world applications could have introduced additional
ability to make intelligent and safe tactical decisions, complexities and variabilities that the study did not
outperforming baseline models in terms of both re- address. Further validation in diverse and uncontrolled
ward and capability metrics. However, the reliance on environments, along with a discussion on the compu-
simulation-based testing limited the assessment of the tational requirements and scalability of the approach,
model’s performance in real-world driving scenarios, would have been necessary to fully establish its practical
which might have presented additional complexities and applicability.
unpredictabilities. The computational overhead associ- Authors in [156] implemented an RL-based method
ated with the attention modules and the utilized algo- for autonomous vehicle decision-making in overtaking
rithm could also have posed challenges for real-time im- scenarios with oncoming traffic. The study leveraged
plementation in autonomous vehicles. Further validation DDQN to manage both longitudinal speed and lane-
in diverse and realistic environments would have been changing decisions. A notable strength of this research
essential to confirm the robustness and scalability of this was the use of DDQN with Prioritized Experience
approach. Replay, which accelerated policy convergence and en-
Study [26] explored a novel approach to solving the hanced decision-making precision. The simulation re-
distributed heterogeneous hybrid flow-shop scheduling sults in SUMO showed that the proposed method im-
problem with multiple priorities of jobs using a DDQN- proved average speed, reduced time spent in the opposite
based co-evolutionary algorithm. A key strength of this lane, and lowered the overall overtaking duration com-
work was its comprehensive framework that integrated pared to traditional methods, such as SUMO’s default
global and local searches to balance computational lane-change model. The RL-based approach demon-
resources effectively. The proposed DDQN-based co- strated a high collision-free rate (98.5%) and effectively
evolutionary algorithm showed significant improvements mimicked human decision-making behavior, showcasing
in minimizing total weighted tardiness and total en- significant improvements in safety and efficiency. How-
ergy consumption by efficiently selecting operators and ever, the reliance on simulated environments and specific
accelerating convergence. The numerical experiments scenarios might have limited the generalizability of the
and comparisons with state-of-the-art algorithms demon- results to real-world applications. The study assumed
strated the superior performance of the proposed method, complete observability of the state space, which might
particularly in handling real-world scenarios and large- not have been realistic in actual driving conditions where
scale instances. However, the study primarily relied on sensor imperfections and uncertainties were common.
simulations and controlled experiments, which might Future work should have focused on addressing these
not have fully captured the complexities and variabil- limitations by exploring POMDPs and validating the
ity of actual manufacturing environments. Additionally, approach in diverse, real-world scenarios.
the computational overhead associated with training the A DDQN-based control method for obstacle avoid-
DDQN model could have posed challenges for real- ance in agricultural robots was introduced in[24]. A
time applications. Further validation through real-world key strength of this work was its innovative use of
implementations and exploring dynamic events such as DDQN to handle the complexities of dynamic obstacle
new job inserts and due date changes would have been avoidance in a structured farmland environment. The
32

proposed method effectively integrated real-time data TABLE X: DDQN Papers Review
from sensors and used a neural network to decide the Application Domain References
optimal actions, leading to significant improvements in Energy and Power Management [151],[160]
space utilization and time efficiency compared to tra- (IoT Networks, Smart Energy
Systems)
ditional risk index-based methods. The study reported
Multi-agent Systems and [154], [156], [157]
high success rates in obstacle avoidance (98-99%) and Autonomous Behaviors
demonstrated the model’s robustness through extensive Optimization [26]
simulations and field experiments. However, the reliance Real-time Systems and Hardware [158]
on predefined paths and the assumption of only dynamic Implementations
obstacles appearing on these paths might have limited the Vehicle Speed Control System [159]
approach’s flexibility in more complex or unstructured Robotics [153], [24], [155]
environments. Additionally, the computational demands
of the DDQN framework might have posed challenges
for deployment in resource-constrained settings typical substantial improvements in throughput and reliability
of agricultural machinery. Further validation in diverse over conventional handover schemes like Conditional
real-world scenarios and exploration of more scalable Handover (CHO) and baseline Dual Active Protocol
solutions could have enhanced the practical applicability Stack (DAPS) HO. However, the study’s primary re-
of this method. liance on simulations and specific urban scenarios might
A DDQN-based algorithm for global path planning have limited the generalizability of the findings. Real-
of amphibious Unmanned Surface Vehicles (USVs) was world implementations might have faced additional chal-
studied in [157]. A major strength of this paper was lenges such as varying environmental conditions and
its innovative use of DDQN for handling the complex hardware limitations that were not fully addressed in
path-planning requirements of amphibious USVs, which the simulations. Further validation in diverse real-world
must navigate both water and air environments. The in- environments would have been necessary to confirm the
tegration of electronic nautical charts and elevation maps practical applicability and robustness of the proposed I-
to build a detailed 3D simulation environment enhanced DAPS HO scheme.
the realism and accuracy of the path planning. The pro- Authors in [159] implemented an advanced vehicle
posed method effectively balanced multiple objectives speed control system using DDQN. The approach inte-
such as minimizing travel time and energy consump- grated high-dimensional video data and low-dimensional
tion, making it suitable for diverse scenarios including sensor data to construct a comprehensive driving en-
emergency rescue and long-distance cruising. However, vironment, allowing the system to mimic human-like
the study primarily relied on simulated environments driving behaviors. A key strength of this work was
and predefined scenarios, which might not have fully its effective use of DDQN to address the instability
captured the complexities of real-world applications. The issues found in Q-learning, resulting in more accurate
computational demands of the DDQN framework could value estimates and higher policy quality. The use of
also have posed challenges for real-time implementa- naturalistic driving data from the Shanghai Naturalis-
tion, especially in dynamic environments where quick tic Driving Study enhanced the model’s realism and
decision-making was crucial. Further validation through applicability. The system demonstrated substantial im-
practical deployments and testing in various real-world provements in both value accuracy and policy perfor-
conditions would have been necessary to fully assess the mance, achieving a score that was 271.73% higher than
robustness and scalability of the proposed approach. that of DQN. However, the reliance on pre-recorded
Authors in [158] presented a Hand-over (HO) strat- driving data and controlled environments might have
egy for 5G networks. The proposed Intelligent Dual limited the generalizability of the results. Real-world
Active Protocol Stack (I-DAPS) HO used DDQN to driving conditions could have been significantly more
enhance the reliability and throughput of handovers variable, and the system’s performance in such dynamic
by predicting and avoiding radio link failures. A key environments needed further validation. Additionally, the
strength of this paper was its innovative approach to computational demands of DDQN might have posed
leveraging DDQN for dynamic and proactive handover challenges for real-time implementation in autonomous
decisions in highly variable mmWave environments. The vehicles, necessitating further optimization for practical
use of a learning-based framework allowed the system to deployment.
make informed decisions based on historical signal data, Authors in [160] addressed the dynamic multiple-
thereby significantly Reducing Hand-over Failure (HOF) channel access problem in IoT networks with energy
rates and achieving zero millisecond Mobility Interrup- harvesting devices using the DDQN approach. The key
tion Time (MIT). The simulation results demonstrated strength of this study lies in its innovative application of
33

DDQN to manage scheduling policies in POMDP. By Over the next few paragraphs, several research stud-
converting the partial observations of scheduled nodes ies are examined in detail to understand the Dueling
into belief states for all nodes, the proposed method DQN algorithm better. A DRL-based framework utiliz-
effectively reduced energy costs and extended the net- ing Dueling Double Deep Q-Networks with Prioritized
work lifetime. The simulation results demonstrated that Replay (DDDQNPR) to solve adaptive job shop schedul-
DDQN outperformed other RL algorithms, including Q- ing problems was implemented in [162]. The main
learning and DQN, in terms of average reward per time strength of this study was its innovative combination
slot. However, the paper’s reliance on simulations and of DDDQNPR with a disjunctive graph model, trans-
specific parameters might have limited its applicabil- forming scheduling into a sequential decision-making
ity to real-world IoT environments. The assumptions process. This approach allowed the model to adapt to
regarding the Poisson process for energy arrival and dynamic environments and achieve optimal scheduling
the fixed transmission energy threshold might not have policies through offline training. The proposed method
fully captured the complexities of practical deployments. outperformed traditional heuristic rules and genetic al-
Further validation through real-world experiments and gorithms in both static and dynamic scheduling scenar-
consideration of diverse environmental conditions would ios, showcasing a significant improvement in scheduling
have been necessary to fully establish the robustness and performance and generalization ability. However, the
scalability of the proposed scheduling approach. reliance on predefined benchmarks and controlled exper-
Table X offers a comprehensive summary of the iments might have limited the applicability of the results
papers reviewed in this section, categorized by their to real-world manufacturing environments. The com-
respective domains within the DDQN. In the next sub- plexity and computational demands of the DDDQNPR
section, another variant of DQN that uses the Dueling framework might have posed challenges for real-time
architecture is introduced. implementation in large-scale production settings. Future
3) Dueling Deep Q-Networks: The Dueling Deep Q- work should have focused on validating the approach
Networks (Dueling DQN) algorithm introduces a new in diverse real-world scenarios and exploring more effi-
neural network architecture that separates the represen- cient reward functions and neural network structures to
tation of state values and advantages to improve learning enhance the method’s scalability and robustness.
efficiency and performance. This architecture allows the Managing Device-to-Device (D2D) communication
network to estimate the value of each state more ro- using Dueling DQN architecture was proposed by au-
bustly, leading to better policy evaluation and improved thors in [163]. A key strength of this paper was its
stability. The dueling architecture splits the Q-network effective utilization of the Dueling DQN architecture to
into two streams, one for estimating the state value autonomously determine transmission decisions in D2D
function and the other for the advantage function, which networks, without relying on centralized infrastructure.
is then combined to produce the Q-values. By separately The approach leveraged easily obtainable Channel State
estimating the state value and advantage, the dueling Information (CSI), allowing each D2D transmitter to
network provides more informative gradients, leading to train its neural network independently. The proposed
more efficient learning. This architecture demonstrates method successfully mitigated co-channel interference,
improved performance in various RL tasks, particularly demonstrating near-optimal sum rates in low Signal-
in environments with many similar-valued actions [161]. to-Noise Ratio (SNR) environments and outperforming
The dueling network update rule is given as: traditional schemes like No Control, Opportunistic, and
Suboptimal in terms of efficiency and complexity re-
Q(s, a; θ, α, β) = V (s; θ, β)+
! duction. However, the study’s reliance on simulation
1 X data and specific channel models might have limited
A(s, a; θ, α) − A(s, a′ ; θ, α)
|A| ′ the generalizability of the results to real-world D2D
a
(29) networks. The approach assumed ideal conditions such
where θ are the parameters of the shared network, α are as perfect CSI and zero delay in TDD-based full duplex
the parameters of the advantage stream, and β are the communications, which might not have held in practical
parameters of the value stream. scenarios. Future work should have focused on vali-
By separately estimating the state value and advan- dating the method in diverse real-world environments
tage, the dueling network provides more robust estimates and addressing potential implementation challenges such
of state values. The decoupling mechanism enhances sta- as real-time computational requirements and varying
bility and overall performance in various environments, network conditions.
especially where actions have similar values. Alg. 17 Authors explored an advanced power allocation algo-
gives a comprehensive overview of the Deling DQN rithm in edge computing environments using Dueling
algorithm. DQN, in [29]. A significant strength of this study was
34

its innovative use of Dueling DQN, which separated the


value and advantage functions within the Q-network. Algorithm 17 Dueling DQN
This architecture enhanced the accuracy and efficiency
1: Input: Learning rate lr, batch size batch, discount
of power allocation by focusing on state-value estimation
factor γ, exploration rate ǫ, decay rate decay, target
and advantage learning. The proposed algorithm outper-
network update frequency f req, episodes
formed several baseline methods, including traditional
2: Output: Trained Dueling Q-Network
DQN, Fractional Programming (FP), and Weighted Min-
3: Initialize online network weights θ
imum Mean Square Error, in both simulated scenarios
4: Initialize target network θ− = θ
and different user densities. The results demonstrated
5: Initialize replay memory D
substantial improvements in average downlink rates and
6: for episode e = 1 to episodes do
reduced computational overhead, highlighting the algo-
7: Initialize state s
rithm’s robustness and efficiency in dynamic network
8: while state s is not terminal do
conditions. However, the paper relied on simulations
9: Select action at using ǫ-greedy policy
with specific parameters and controlled environments,
10: Execute at , observe rt , st+1
which might have limited its applicability to real-world
11: Store (st , at , rt , st+1 ) in D
scenarios. The assumptions regarding perfect synchro-
12: Sample minibatch from D
nization and the static nature of edge servers might
13: Compute targets yj
not have fully captured the complexities of practical
14: Update θ by minimizing loss L(θ)
deployments. Further real-world testing and validation
15: if update step mod f req = 0 then
were needed to confirm the scalability and practical
16: Update target network θ− ← θ
implementation of the proposed Dueling DQN approach
17: end if
in diverse and dynamic edge computing environments.
18: s ← st+1
A hybrid approach combining Dueling DQN and Dou-
19: Decay ǫ
ble Elite Co-Evolutionary Genetic Algorithm (DECGA)
20: end while
for parameter estimation in variogram models used in
21: end for
geo-statistics and environmental engineering was im-
plemented in [164]. A notable strength of this study
was its innovative integration of RL with evolutionary performance. The approach utilized a grid-based method
algorithms. The Dueling DQN enhanced the parameter to model network states, reducing computational com-
setting of the genetic algorithm, improving the conver- plexity. Simulation results showed notable improvements
gence speed and avoiding premature convergence. The in power consumption and data transmission delays
DECGA was effective in finding global optimal solutions compared to existing methods. However, the reliance on
by maintaining a diverse population. This combined ap- simulated environments might have limited the applica-
proach significantly improved the accuracy of parameter bility of the results to real-world scenarios, where sensor
estimation in both single and nested variogram models, data and network conditions could have been more
as demonstrated by the experimental results on heavy variable. Further validation through practical implemen-
metal concentration datasets, showing lower Residual tations and testing in diverse environments would have
Sum of Squares values compared to traditional meth- been essential to confirm the robustness and scalability
ods. However, the reliance on controlled datasets and of the proposed method. Additionally, the computational
predefined parameters might have limited the method’s demands of Dueling DQN might have posed challenges
generalizability to more complex, real-world scenarios. for real-time applications in resource-constrained IoT
The computational complexity introduced by combining networks.
Dueling DQN with DECGA could also have posed chal- An innovative method for optimizing the Age of
lenges for practical implementation, especially in sce- Information (AoI) in vehicular fog systems using Du-
narios requiring real-time parameter estimation. Future eling DQN was presented in [166]. A major strength
work should have focused on validating the approach of this work was its holistic approach to AoI opti-
in diverse real-world environments and optimizing the mization, considering both the transmission time from
computational efficiency for broader applicability. the information source to the monitor and from the
Authors in [165] presented a method for optimizing monitor to the destination. The Dueling DQN algorithm
UAV flight paths in IoT sensor networks using Dueling effectively minimized the end-to-end AoI by leveraging
DQN. The goal was to balance energy consumption edge computing and vehicular fog nodes to handle
and data delay. A significant strength of the study was real-time data more efficiently. The proposed method
its effective use of Dueling DQN to manage UAV demonstrated significant improvements in system perfor-
paths, enhancing both energy efficiency and network mance and information freshness compared to DQN and
35

analytical methods. However, the reliance on simulations TABLE XI: Dueling DQN Papers Review
to validate the approach might have limited the general- Application Domain References
izability of the findings to real-world applications. The Energy and Power Management [29], [165]
assumptions made regarding the processing capabilities (IoT Networks, Smart Energy
Systems)
and network conditions might not have fully captured
Transportation and Routing [168]
the complexities of practical implementations. Further Optimization
research involving real-world testing and validation in Swarm Intelligence and [162]
diverse environments was necessary to fully establish the Optimization Problems
robustness and scalability of the proposed method. Network Optimization [163], [166]
The authors in [167] presented an enhanced Dueling Geo-statistics and Environmental [164]
Engineering
DQN method for improving the autonomous capabilities Autonomous UAVs [155], [167]
of UAVs in terms of obstacle avoidance and target Path planning approach for USVs [168]
tracking. One of the key strengths of this study was the
effective integration of improved Dueling DQN with a
target tracking mechanism, which allowed the UAV to to real-world scenarios where the USVs would have
make more precise and timely decisions. The improved encountered dynamic and unpredictable conditions. The
algorithm showed superior performance in simulation paper also did not address the computational complexity
tests, with significant improvements in obstacle avoid- introduced by the tree sampling mechanism, which could
ance accuracy and target tracking efficiency. The authors have posed challenges for real-time implementation on
provided a comprehensive analysis of the algorithm’s resource-constrained USVs. Future work should have
ability to maintain stability and adaptability in dynamic focused on validating the approach in dynamic real-
environments, showcasing its practical potential for real- world environments and exploring ways to optimize the
world UAV applications. However, the study’s depen- computational efficiency for practical applications. A
dence on simulation environments and controlled scenar- summary of the analyzed papers is given in Table XI. In
ios might not have fully captured the complexities and the next subsection, another part of Tabular Model-based
unpredictability of real-world applications. The effective- algorithms, Model-based Planning, is introduced.
ness of the improved Dueling DQN algorithm in diverse
and unstructured environments needed further validation
through real-world testing. Additionally, the computa- D. Advanced Tabular Model-based Methods
tional demands of the improved algorithm could have In this subsection, we dive into the second part of
posed challenges for real-time processing and decision- Tabular Model-based methods. Key characteristics of
making on resource-constrained UAV platforms. Future Model-based algorithms include Model Representation,
work should have addressed these aspects to establish which uses transition probabilities P (s′ |s, a) and reward
the practicality and scalability of the proposed approach functions R(s, a) to define state transitions and rewards;
fully. Planning and Policy Evaluation, which iteratively ap-
An advanced path planning approach for Unmanned proximates the value function V (s) with the Bellman
Surface Vessels (USVs) using Dueling DQN enhanced equation until convergence; and Value Iteration, which
with a tree sampling mechanism was implemented in also iteratively refines V (s) using the Bellman equation
[168]. A key strength of this work was the innovative to determine the optimal policy by evaluating future
combination of Dueling DQN with a tree-based prior- states and actions [1].
ity sampling mechanism, which significantly improved After analyzing the first part of Tabular Model-based
the convergence speed and efficiency of the learning algorithms, DP approaches, we now shed lights on the
process. The approach leveraged the decomposition of second part, Model-based Planning methods.
the value function into state-value and advantage func- 1) Model-based Planning: Model-based Planning in
tions, enhancing the policy evaluation and decision- RL refers to the approach where the agent builds and
making accuracy. The simulation results demonstrated utilizes a model of the environment to plan its actions
that the proposed method achieved faster convergence and make decisions. This model can be either learned
and more effective obstacle avoidance and path planning from interactions with the environment or predefined
compared to DQN algorithms. Specifically, the Dueling if the environment’s dynamics are known. Model-based
DQN algorithm showed improved performance in terms methods can be more sample-efficient as they leverage
of the number of steps and time required to reach learned models of the environment to simulate many
the target across various test environments. However, scenarios, allowing the agent to gain more insights
the study’s reliance on simulated static environments without needing to interact directly with the environment
might have limited the generalizability of the results each time. Model-based Planning enables the agent to
36

balance exploration—trying out new actions—and ex- Algorithm 18 MCTS


ploitation—using known good actions more effectively. 1: Input:
By simulating different actions and their outcomes, the 2: PD: problem’s dimension
agent can make more informed decisions, optimizing 3: FG: fitness goal
its behavior. The learned model enables the agent to 4: RL: roll-out length
adapt its behavior easily to environmental changes. By 5: Output:
updating the model and re-planning, the agent can re- 6: n*: the best solution found
spond dynamically to new situations, maintaining its 7: f: the maximum performance level reached
performance even in changing environments [1], [55], 8: t: the total number of roll-outs performed
[169]. 9: Create root node nr and add it to the tree
There are three main streams in Model-based Plan- 10: Make nr the actual node n
ning, Monte Carlo Tree Search (MCTS), Prioritized 11: n∗ ← null
Sweeping, and Dyna-Q. We start by analyzing MCTS 12: f ←0
over the next paragraphs. 13: t←0
a) Monte Carlo Tree Search: MCTS is a power- 14: while move available(n) do
ful algorithm used to solve sequential decision-making 15: while not is leaf(n) do
problems under uncertainty. Its application spans various 16: n ← best child(n)
domains, most notably in game AI, where it has achieved 17: end while
significant success. MCTS combines the precision of 18: n ← add random child node(n)
tree search with the generality of random sampling 19: RL ← rand(PD, n)
(MC simulation). This technique has been instrumen- 20: f, n′ ← roll out(n, RL)
tal in the development of AI systems like AlphaGo 21: t ←t+1
and AlphaZero, which have demonstrated superhuman 22: back propagate(n, f )
performance in games like Go, chess, and shogi. The 23: if f > FG then
term ”MCTS” was first coined by [170] when it was 24: n∗ ← n′
applied it to the Go-playing program, Crazy Stone. Prior 25: break
to this, game-playing algorithms primarily relied on 26: end if
deterministic search techniques, such as those used in 27: n ← nr
chess. The significant challenge in games like Go, due 28: end while
to their vast state spaces, necessitated a more intelligent 29: return n∗ , f , t
and efficient approach to search, leading to the adoption
of MCTS.
MCTS operates by building a search tree incremen- information.
tally and asymmetrically. Each iteration of MCTS in- These steps are repeated until a computational budget
volves four main steps [1]: (time or iterations) is exhausted. The move with the
1) Selection: Starting from the root node, the al- highest value at the root node is then chosen as the
gorithm recursively selects child nodes until it optimal move. Based on the overview presented in [171],
reaches a leaf node (lines 14-17). This selection Alg. 18 provides a comprehensive overview of MCTS.
process balances exploration and exploitation, typ- Key Advantages of MCTS are Domain Independence
ically using a UCB formula to choose moves with (MCTS does not require domain-specific knowledge,
high estimated value and uncertainty. only the rules of the game or problem.), Scalability (It
2) Expansion: If the leaf node is not a terminal node, can handle very large state spaces by focusing compu-
it is expanded by adding one or more child nodes tational effort on the most promising parts of the search
(line 18). This represents exploring new moves or tree.), and Flexibility (It can be adapted to various types
actions that have not yet been considered. of problems, including those with stochastic elements.).
3) Simulation: From the expanded node, a simulation [170], [1], [172]
(or playout) is run to the end of the game using MCTS is a fundamental and important algorithm in
a default policy, often random (19-21). This simu- the realm of RL. Thus, let us examine some studies
lation provides an outcome that approximates the that utilized this approach to solve different problems.
value of the moves from the expanded node. Study [173] presented the Multi-objective MCTS (MO-
4) Backpropagation: The result of the simulation is MCTS) algorithm, an extension of MCTS designed
propagated back up the tree, updating the values for multi-objective sequential decision-making. The key
of all nodes along the path (line 22). This helps to innovation lies in using the hyper-volume indicator to
refine the estimated value of moves based on new replace the UCB criterion, enabling the algorithm to
37

handle multi-dimensional rewards and discover several the evaluation of numerous nodes, ensuring more accu-
Pareto-optimal policies within a single tree. The MO- rate and efficient search processes. Experimental results
MCTS algorithm stood out for its ability to effectively demonstrated that the proposed method achieved higher
manage multi-objective optimization by integrating the performance and search efficiency compared to state-of-
hyper-volume indicator, which provided a comprehen- the-art NAS methods, particularly on the NAS-Bench-
sive measure of the solution set’s quality. This ap- Macro and ImageNet datasets. On the other hand, the
proach allowed the algorithm to balance exploration reliance on the accurate modeling of the search space
and exploitation in multi-dimensional spaces, making as a Monte-Carlo Tree introduced additional complexity,
it more efficient in discovering Pareto-optimal solutions particularly in terms of computational overhead and
compared to traditional scalarized RL methods. The use memory requirements. The method’s performance heav-
of the hyper-volume indicator as an action selection ily depended on the proper tuning of hyperparameters,
criterion ensured that the algorithm could capture a such as the temperature term and reduction ratio, which
diverse set of optimal policies, addressing the limita- could be challenging to optimize for different datasets
tions of linear-scalarization methods that failed in non- and tasks. While the experiments showed promising
convex regions of the Pareto front. The experimental results, further validation in more diverse and complex
validation on the Deep Sea Treasure (DST) problem real-world scenarios was necessary to fully assess the
and grid scheduling tasks demonstrated that MO-MCTS scalability and robustness of the approach. The addi-
achieved superior performance and scalability, matching tional computational cost associated with maintaining
or surpassing state-of-the-art non-RL-based methods de- and updating the Monte-Carlo Tree and performing hier-
spite higher computational costs. However, the reliance archical node selection could impact the method’s appli-
on accurate computation of the hyper-volume indicator cability in real-time applications or resource-constrained
introduced significant computational overhead, particu- environments.
larly in high-dimensional objective spaces. The com- Study [49] introduced Multi-agent MCTS (MAM-
plexity of maintaining and updating the Pareto archive CTS), a novel extension of MCTS tailored for coopera-
could also pose challenges, especially as the number tive multi-agent systems. The primary innovation was the
of objectives increased. While the algorithm showed integration of difference evaluations, which significantly
robust performance in deterministic settings, its scal- enhanced coordination strategies among agents. The per-
ability and efficiency in highly stochastic or dynamic formance of MAMCTS was demonstrated in a multi-
environments remained to be fully explored. The need agent path-planning domain called Multi-agent Grid-
for domain-specific knowledge to define the reference world (MAG), showcasing substantial improvements
point and other hyper-volume-related parameters could over traditional reward evaluation methods. The MAM-
also limit the algorithm’s generalizability. Additionally, CTS approach effectively leveraged difference evalua-
the method’s computational intensity might hinder its tions to prioritize actions that contributed positively to
real-time applicability, requiring further optimization or the overall system, leading to significant improvements
approximation techniques to reduce processing time. in learning efficiency and coordination among agents.
A novel approach to Neural Architecture Search By combining MCTS with different rewards, the al-
(NAS) using MCTS was introduced in [174]. The au- gorithm balanced exploration and exploitation, ensur-
thors proposed a method that captured the dependencies ing that agents could efficiently navigate the search
among layers by modeling the search space as a Monte- space to find optimal policies. The experimental results
Carlo Tree, enhancing the exploration-exploitation bal- in the 100x100 MAG environment demonstrated that
ance and efficiently storing intermediate results for better MAMCTS outperformed both local and global reward
future decisions. The method was validated through methods, achieving up to 31.4% and 88.9% better per-
experiments on the NAS-Bench-Macro benchmark and formance, respectively. This superior performance was
ImageNet dataset. The primary strength of this approach consistent across various agent and goal configurations,
lies in its ability to incorporate dependencies between highlighting the scalability and robustness of the ap-
different layers during architecture search, which many proach. The use of a structured search process and
existing NAS methods overlooked. By utilizing MCTS, prioritized updates ensured that the algorithm could
the proposed method effectively balanced exploration handle large-scale multi-agent environments effectively.
and exploitation, leading to more efficient architecture The reliance on accurate computation and maintenance
sampling. The use of a Monte-Carlo Tree allowed for the of difference evaluations introduced additional computa-
storage of intermediate results, significantly improving tional overhead, particularly in environments with a large
the search efficiency and reducing the need for redundant number of agents and goals. The method’s performance
computations. The introduction of node communication was sensitive to the accuracy of these evaluations, which
and hierarchical node selection techniques further refined might be challenging to maintain in highly dynamic
38

or unpredictable environments. While the simulations the strategic strengths of MCTS and the tactical pre-
provided strong evidence of the method’s efficacy, fur- cision of minimax. The authors proposed three hy-
ther validation in more diverse real-world scenarios was brid approaches: employing minimax during the selec-
necessary to fully assess its scalability and practical tion/expansion phase, the rollout phase, and the back-
utility. The complexity of managing multiple agents propagation phase of MCTS. These hybrids aimed to
and ensuring synchronized updates could also pose address the weaknesses of MCTS in tactical situa-
challenges, particularly in real-time applications where tions by incorporating shallow minimax searches within
computational resources are limited. Additionally, the the MCTS framework. The hybrid algorithms pre-
paper focused primarily on cooperative settings, and the sented in the paper offered a promising combination of
applicability of MAMCTS to competitive or adversarial MCTS’s ability to handle large search spaces with mini-
multi-agent environments remained an area for future max’s tactical accuracy. By integrating shallow mini-
exploration. max searches, the hybrids could better navigate shallow
A MO-MCTS algorithm tailored for real-time games traps that MCTS might overlook, leading to more robust
was developed in [175]. It focused on balancing multiple decision-making in games with high tactical demands.
objectives simultaneously, leveraging the hyper-volume The experimental results in games like Connect-4 and
indicator to replace the traditional UCB criterion. The Breakthrough demonstrated that these hybrid approaches
algorithm was tested against single-objective MCTS and could outperform standard MCTS, particularly in envi-
Non-dominated Sorting Genetic Algorithm (NSGA)-II, ronments where tactical precision was crucial. The use
showcasing superior performance in benchmarks like of mini-max in the selection/expansion phase and the
DST and the Multi-objective Physical Traveling Sales- backpropagation phase significantly improved the ability
man Problem (MO-PTSP). The MO-MCTS algorithm to avoid blunders and recognize winning strategies early,
excelled in handling multi-dimensional reward struc- enhancing the overall efficiency and effectiveness of
tures, efficiently balancing exploration and exploitation. the search process. However, the inclusion of mini-max
By incorporating the hyper-volume indicator, the algo- searches introduced additional computational overhead,
rithm could discover a diverse set of Pareto-optimal which could slow down the overall search process, es-
solutions, effectively addressing the limitations of linear- pecially for deeper mini-max searches. The performance
scalarization methods in non-convex regions of the improvements were heavily dependent on the correct
Pareto front. The empirical results on DST and MO- tuning of parameters, such as the depth of the mini-
PTSP benchmarks demonstrated the algorithm’s ability max search and the criteria for triggering these searches.
to converge to optimal solutions quickly, outperforming While the hybrids showed improved performance in the
both single-objective MCTS and NSGA-II in terms of tested games, their scalability and effectiveness in more
exploration efficiency and solution quality. The use of complex and dynamic real-world scenarios remained
a weighted sum and Euclidean distance mechanisms for to be fully validated. The reliance on game-specific
action selection further enhanced the adaptability of MO- characteristics, such as the presence of shallow traps,
MCTS to various game scenarios, providing a robust might limit the generalizability of the results. Further
framework for real-time decision-making. Nonetheless, exploration was needed to assess the impact of these hy-
the reliance on the hyper-volume indicator introduced brids in a broader range of domains and under different
significant computational overhead, which could limit conditions.
the algorithm’s scalability in high-dimensional objective Authors in [177] introduced the Option-MCTS (O-
spaces. The need for maintaining a Pareto archive and MCTS) algorithm, which extended the MCTS by incor-
computing the hyper-volume for action selection added porating high-level action sequences, or ”options,” aimed
to the complexity, potentially impacting real-time per- at achieving specific subgoals. The proposed algorithm
formance. While the algorithm showed strong results in aimed to enhance general video game playing by uti-
the tested benchmarks, its applicability to more complex lizing higher-level planning, enabling it to perform well
and dynamic real-time games required further validation. across a diverse set of games from the General Video
The computational intensity of the approach might hin- Game AI competition. Additionally, the paper introduced
der its practicality in resource-constrained environments, Option Learning MCTS (OL-MCTS), which applied a
necessitating the exploration of optimization techniques progressive widening technique to focus exploration on
to reduce processing time without compromising perfor- the most promising options. The integration of options
mance. Additionally, the paper primarily focused on de- into MCTS was a significant advancement, allowing the
terministic settings, leaving the performance in stochastic algorithm to plan more efficiently by considering higher-
or highly variable environments less explored. level strategies. This higher abstraction level helped
Researchers in [176] explored hybrid algorithms that the algorithm deal with complex games that required
integrated MCTS with mini-max search to leverage achieving multiple subgoals. The use of options reduced
39

the branching factor in the search tree, enabling deeper updating separate value estimates might impact the algo-
exploration within the same computational budget. The rithm’s efficiency, particularly in real-time applications.
empirical results demonstrated that O-MCTS outper- While the approach showed significant improvements
formed traditional MCTS in games requiring sequential in specific games, further validation was necessary to
subgoal achievements, such as collecting keys to open assess its scalability and robustness in more complex
doors, showcasing its strength in strategic planning. The and dynamic scenarios. The dependence on domain-
introduction of OL-MCTS further improved performance specific knowledge for heuristic evaluations might limit
by learning which options were most effective, thus the generalizability of the method to a wider range of
focusing the search on more promising parts of the applications. Additionally, the complexity of tuning the
game tree and improving efficiency. On the other hand, parameter that weighted the influence of heuristic evalu-
the reliance on predefined options and their proper ations could pose challenges in optimizing the algorithm
tuning could be a limitation, as the performance of for different environments.
O-MCTS heavily depended on the quality and rele- Authors in [179] explored the application of MCTS to
vance of these options to the specific games being the game of Lines of Action (LoA), a two-person zero-
played. The initial computational overhead associated sum game known for its tactical depth and moderate
with constructing and managing a large set of options branching factor. The authors proposed several enhance-
might impact the algorithm’s performance, particularly ments to standard MCTS to handle the tactical complex-
in games with numerous sprites and complex dynamics. ities of LoA, including game-theoretical value proving,
The progressive widening technique in OL-MCTS, while domain-specific simulation strategies, and effective use
beneficial for focusing exploration, introduced additional of progressive bias. The key strength of this paper
complexity and overhead, potentially reducing real-time lies in its innovative enhancements to MCTS, enabling
applicability. Further validation was needed to assess the it to handle the tactical and progression properties of
scalability and robustness of these algorithms in a wider LoA effectively. By incorporating game-theoretical value
range of real-world game scenarios, where the diversity proving, the algorithm could identify and propagate
and unpredictability of game mechanics might present winning and losing positions more efficiently, reducing
new challenges. the computational burden of extensive simulations. The
An extension of MCTS, incorporating heuristic eval- use of domain-specific simulation strategies significantly
uations through implicit mini-max backups, was in- improved the quality of the simulations, leading to more
vestigated in [178]. The approach aimed to combine accurate evaluations and better overall performance. The
the strengths of MCTS and mini-max search to im- empirical results demonstrated that the enhanced MCTS
prove decision-making in strategic games by main- variant outperformed the world’s best αβ-based LoA
taining separate estimations of win rates and heuris- program, marking a significant milestone for MCTS in
tic evaluations and using these to guide simulations. handling highly tactical games. The detailed analysis and
The integration of implicit minimax backups within systematic approach to integrating domain knowledge
MCTS significantly enhanced the quality of simulations into MCTS provided a robust framework for applying
by leveraging heuristic evaluations to inform decision- MCTS to other complex games. Despite its advantages,
making. This hybrid approach addressed the limitations the approach introduced additional computational over-
of pure MCTS in domains where tactical precision was head due to the need for maintaining and updating
crucial, effectively balancing strategic exploration with multiple enhancements, such as the progressive bias and
tactical accuracy. By maintaining separate values for game-theoretical value proving. The reliance on domain-
win rates and heuristic evaluations, the algorithm could specific knowledge for simulation strategies and the pro-
better navigate complex game states, leading to stronger gressive bias limited the generalizability of the method
play performance. The empirical results in games like to other games without similar properties. While the
Kalah, Breakthrough, and Lines of Action demonstrated algorithm performed well in the controlled environment
substantial improvements over standard MCTS, validat- of LoA, its scalability, and robustness in more dynamic
ing the effectiveness of implicit mini-max backups in and less structured environments remained to be fully
diverse strategic environments. The method also showed explored. The complexity of tuning various parameters,
robust performance across different parameter settings, such as the progressive bias coefficient and the simula-
highlighting its adaptability and potential for broader tion cutoff threshold, could also pose challenges, partic-
application in game-playing AI. However, the reliance ularly for practitioners without deep domain knowledge.
on accurate heuristic evaluations introduced complexity Further validation in diverse real-world scenarios was
in environments where such evaluations were not readily necessary to assess the practical applicability and long-
available or were difficult to compute. The additional term benefits of the proposed enhancements.
computational overhead associated with maintaining and Authors in [180] explored the application of MCTS to
40

the popular collectible card game ”Hearthstone: Heroes determinization-based methods in games with significant
of Warcraft.” Given the game’s complexity, uncertainty, hidden information and partially observable moves. For
and hidden information, the authors proposed enriching instance, in Lord of the Rings: The Confrontation and
MCTS with heuristic guidance and a database of decks the Phantom, ISMCTS achieved superior performance
to enhance performance. The approach was empirically by effectively leveraging the structure of the game and
validated against vanilla MCTS and state-of-the-art AI, reducing the impact of strategy fusion. However, the
showing significant performance gains. The integration ISMCTS approach introduced additional complexity in
of expert knowledge through heuristics and a deck maintaining and updating information sets, which could
database addressed two major challenges in Hearth- lead to increased computational overhead. The scalability
stone: hidden information and large search space. By of the method in environments with extensive hidden
incorporating a heuristic function into the selection and information and large state spaces remained a challenge,
simulation steps, the algorithm could more effectively as the branching factor in information set trees could
navigate the game’s complex state space, leading to become substantial. While ISMCTS showed promising
improved decision-making and performance. The use results in the tested domains, further validation in more
of a deck database allowed the MCTS algorithm to diverse and dynamic scenarios was necessary to fully
predict the opponent’s cards more accurately, enhancing assess its robustness and general applicability. The re-
the quality of simulations and overall strategy. The liance on accurate modeling of information sets and the
empirical results demonstrated that the enhanced MCTS necessity for domain-specific adaptations could limit the
approach significantly outperformed vanilla MCTS and ease of implementation and the algorithm’s flexibility
was competitive with existing AI players, showcasing its across different types of games.
potential for complex, strategic games like Hearthstone. Authors in [182] investigated the application of MCTS
However, the reliance on pre-constructed heuristics and to Kriegspiel, a variant of chess characterized by hid-
deck databases introduced additional complexity and den information and dynamic uncertainty. The authors
potential limitations. The effectiveness of the heuristic explored three MCTS-based methods, incrementally re-
function was highly dependent on its design and tun- fining the approach to handle the complexities of
ing, which might vary across different game scenarios Kriegspiel. They compared these methods to a strong
and decks. Similarly, the deck database’s accuracy and minimax-based Kriegspiel program, demonstrating the
comprehensiveness were crucial for predicting opponent effectiveness of MCTS in this challenging environment.
strategies, which might be challenging to maintain as The authors’ incremental refinement of MCTS methods
new cards and strategies emerged. The additional compu- for Kriegspiel effectively addressed the game’s dynamic
tational overhead associated with managing and updating uncertainty and large state space. By leveraging a proba-
these enhancements could impact real-time performance, bilistic model of the opponent’s pieces and incorporating
particularly in time-constrained gameplay environments. domain-specific heuristics, the refined MCTS algorithms
While the approach showed strong performance in con- significantly improved performance compared to the
trolled experiments, further validation in diverse and initial naive implementation. The experimental results
dynamic real-world scenarios was necessary to fully showed that the final MCTS approach outperformed
assess its robustness and adaptability. the minimax-based program, achieving better strategic
Information Set MCTS (ISMCTS), an extension of planning and decision-making. The innovative use of
MCTS designed for games with hidden information and a three-tiered game tree representation and opponent
uncertainty was introduced in [181]. ISMCTS algorithms modeling techniques demonstrated the adaptability and
searched trees of information sets rather than game robustness of MCTS in handling partial information
states, providing a more accurate representation of games games. This study provided valuable insights into the
with imperfect information. The approach was tested application of MCTS in environments with incomplete
across three domains with different characteristics. The information, highlighting its potential for broader ap-
ISMCTS algorithms excelled in efficiently managing plications in similar domains. Having said that, the
hidden information by constructing trees where nodes reliance on extensive probabilistic modeling and heuris-
represented information sets instead of individual states. tic adjustments introduced additional complexity, which
This method mitigated the strategy fusion problem seen could be computationally intensive and challenging to
in determinization approaches, where different states in maintain. The performance improvements were heavily
an information set were treated independently. By unify- dependent on the accuracy and relevance of the proba-
ing statistics about moves within a single tree, ISMCTS bilistic models, which might vary across different game
made better use of the computational budget, leading to scenarios and opponents. While the approach showed
more informed decision-making. The empirical results strong performance in the controlled environment of
demonstrated that ISMCTS outperformed traditional Kriegspiel, its scalability, and robustness in more diverse
41

and dynamic real-world scenarios remained to be fully Model-based Planning.


validated. The necessity for domain-specific knowledge b) Prioritized Sweeping: Prioritized Sweeping was
to construct effective heuristics and models limited the introduced by [183], which enhances Model-based RL
generalizability of the method to other games with by focusing updates on the most critical state-action
different characteristics. Further research was needed to pairs. This prioritization accelerates learning and con-
explore the method’s applicability and efficiency in a vergence. As shown in Alg. 19, Prioritized Sweeping’s
wider range of partial information games and real-world fundamental mechanisms include model learning, pri-
decision-making tasks. ority queue management, and backward sweeping. In
Two extensions of MCTS tailored for Asymmetric model learning, the algorithm constructs a model of
Trees (MCTS-T) and environments with loops were the environment, capturing state transition probabilities
introduced in [37]. The first algorithm, MCTS-T, ad- and rewards (lines 3-6). A priority queue is maintained,
dressed the challenges posed by asymmetric tree struc- where state-action pairs are prioritized based on the
tures by incorporating tree uncertainty into the UCB magnitude of their potential update (error) (lines 7-
formula, thus improving exploration. The second al- 8). Backward sweeping ensures that significant changes
gorithm, MCTS-T+, further extended this approach to are promptly addressed by propagating their impact
handle loops by detecting and appropriately managing backward to predecessors (9-17). The process begins
repeated states within a single trace. The effectiveness of with experience collection (lines 3-5), where the agent
these algorithms was demonstrated through benchmarks interacts with the environment to gather transitions (state,
on OpenAI Gym and Atari 2600 games. The key strength action, reward, next state). The model is then updated
of MCTS-T lies in its ability to efficiently manage with these new transitions (line 6). For each affected
asymmetric trees by backing up the uncertainty related state-action pair, the algorithm calculates its priority
to the tree structure. This approach ensured that the (lines 7-8), reflecting the expected magnitude of the
algorithm could prioritize exploration in less explored value update. Priorities are managed in a queue, and the
subtrees, significantly improving the learning efficiency algorithm performs value updates in order of priority
in environments with uneven tree depths. MCTS-T+ until the queue is exhausted or a convergence criterion
further enhanced this capability by addressing loops, pre- is met (lines 9-17) [183], [184], [1].
venting redundant expansions of the same state, and thus Prioritized Sweeping refines the Bellman Equation by
saving computational resources. The empirical results on updating states in a priority-driven manner as follows:
benchmark tasks demonstrated that both MCTS-T and  
′ ′
MCTS-T+ consistently outperformed standard MCTS, P (s, a) = R(s, a) + γ max Q(s , a ) − Q(s, a)
a′
particularly in environments with strong asymmetry and (30)
loops. These improvements highlighted the potential where P (s, a) is the priority of the state-action pair
of the proposed methods to extend the applicability (s, a). By analyzing several papers over the following
of MCTS to more complex domains, such as robotic paragraphs, the primary advantage of Prioritized Sweep-
control and single-player video games, where asymmetry ing can be seen in its efficiency. By focusing updates on
and looping were common. However, the reliance on high-priority areas, it reduces unnecessary computations
non-stochastic environments and full state observability and accelerates learning. This prioritization also leads
limited the generalizability of MCTS-T and MCTS- to faster convergence compared to traditional DP meth-
T+. These assumptions might not hold in many real- ods. However, the algorithm introduces complexity by
world scenarios, such as partially observable or highly maintaining a priority queue and performing backward
stochastic environments, reducing the algorithms’ ap- sweeps, adding computational overhead. Additionally,
plicability. The computational overhead associated with the efficiency gains depend on the accuracy of the model,
maintaining and updating the tree uncertainty and de- which might not be perfect in all scenarios [183], [1].
tecting loops could be substantial, particularly in large Prioritized Sweeping algorithm was introduced in
and dynamic state spaces. While the proposed methods [184] to improve learning efficiency in stochastic
showed promising results in controlled experiments, fur- Markov systems. The method combined the rapid perfor-
ther validation in diverse real-world applications was mance of incremental learning methods like TD and Q-
necessary to fully assess their scalability and robust- learning with the accuracy of DP by prioritizing updates.
ness. The complexity of tuning additional parameters, By maintaining a priority queue, the algorithm processed
such as the uncertainty thresholds and loop detection the most impactful updates first, leading to faster con-
mechanisms, posed additional challenges for practical vergence in large state-space environments. Although the
implementation. algorithm showed superior performance in simulations,
In the next subsection, we focus on Prioritized Sweep- it involved significant computational overhead due to the
ing, another algorithm that falls under the umbrella of priority queue maintenance and state tracking. The com-
42

Algorithm 19 Prioritized Sweeping higher total rewards compared to DDPG, validating its
1: Initialize Q(s, a), Model(s, a), for all s, a, and superiority in handling complex morphing scenarios.
P Queue to empty Despite its advantages, the reliance on accurate prioriti-
2: while true do ⊲ Do forever zation mechanisms introduces additional complexity in
3: S ← current (non-terminal) state maintaining and updating the priority queue, which could
4: A ← policy(S, Q) impact performance in highly dynamic environments
5: Execute action A; observe resultant reward, R, where state changes are rapid and unpredictable. The
and state, S ′ assumption that changes in sweep angle do not signifi-
6: Model(S, A) ← (R, S ′ ) cantly affect flight status within short time frames may
7: P ← |R + γ maxa Q(S ′ , a) − Q(S, A)| oversimplify real-world conditions, potentially limiting
8: if P > θ then the model’s accuracy. While the simulation results are
9: Insert (S, A) into P Queue with priority P promising, further validation in real-world applications
10: end if is necessary to fully assess the method’s robustness and
11: for i = 1 to n do scalability. The initial phase of training and data gener-
12: if P Queue is not empty then ation still requires significant computational resources,
13: S, A ← first(P Queue) which could be a bottleneck in practical deployments.
14: R, S ′ ← Model(S, A) Authors in [186] introduced modifications to Dyna-
15: Q(S, A) ← Q(S, A) learning and Prioritized Sweeping algorithms, incorpo-
+ α[R + γ maxa Q(S ′ , a) − Q(S, A)] rating an epoch-incremental approach using Breadth-
16: for each (S̃, Ã) predicted to lead to S do first Search (BFS). The combination of incremental
17: R̃ ← predicted reward for (S̃, Ã, S) and epoch-based updates improved learning efficiency,
18: P ← |R̃+γ maxa Q(S, a)−Q(S̃, Ã)| leading to faster convergence in dynamic environments
19: if P > θ then like grid worlds. The use of BFS after episodes provided
20: Insert (S̃, Ã) into P Queue with a more comprehensive understanding of the state space.
priority P However, managing dual modes of policy updates and
21: end if accurate BFS calculations introduced complexity, poten-
22: end for tially increasing computational overhead. The method
23: end if showed strong simulation results, but real-world vali-
24: end for dation was necessary to understand its scalability and
25: end while practical implications.
Authors in [187] introduced an innovative extension
to the Prioritized Sweeping algorithm by employing
plexity of setting up and tuning the priority mechanism small backups instead of full backups. The primary aim
posed challenges, particularly in dynamic environments is to enhance the computational efficiency of Model-
with unpredictable state transitions. based RL by reducing the complexity of value updates,
An enhancement to the Deep Deterministic Policy making it more suitable for environments with a large
Gradient (DDPG) (discussed in subsection V-B1) algo- number of successor states. The most notable advantage
rithm by incorporating Prioritized Sweeping for morph- of this approach is its ability to perform value updates
ing UAVs is introduced in [185]. The objective is to using only the current value of a single successor state,
improve the efficiency and effectiveness of morphing significantly reducing the computation time. By utilizing
strategy decisions by avoiding the random selection of small backups, the algorithm allows for finer control over
state-action pairs and instead focusing on those with a the planning process, leading to more efficient update
significant impact on learning outcomes. The integration strategies. The empirical results demonstrate that the
of Prioritized Sweeping with the DDPG framework no- small backup implementation of Prioritized Sweeping
tably enhances learning efficiency by prioritizing state- achieves substantial performance improvements over tra-
action pairs that are most influential in updating the ditional methods, particularly in environments where full
policy. This targeted approach accelerates convergence backups are computationally prohibitive. The theoretical
and improves decision-making accuracy, which is crucial foundation provided in the paper supports the robustness
for the dynamic and complex task of UAV morph- of the small backup approach, ensuring that it maintains
ing. The method effectively combines the strengths of the accuracy of value updates while enhancing com-
Value-based and Policy Gradient-based RL, leveraging putational efficiency. Additionally, the parameter-free
deep neural networks for state evaluation and policy nature of small backups eliminates the need for tuning
improvement. The simulation results demonstrate that step-size parameters, which is a common challenge in
the improved algorithm achieves faster learning and traditional sample-based methods. On the other hand,
43

the approach’s reliance on maintaining and updating a and computational overhead, potentially impacting the
priority queue introduces additional memory and com- algorithm’s performance in real-time applications. While
putational overhead, particularly in environments with the simulations provide strong evidence of the method’s
many state-action pairs. While the method improves efficacy, further validation in larger and more diverse net-
computational efficiency, it requires careful management work topologies is necessary to fully assess its scalability
of memory resources to store component values associ- and robustness. Managing priority queues and updating
ated with small backups. The simulations primarily focus Q-values in a distributed manner across multiple nodes
on deterministic environments, leaving the performance can also pose challenges in maintaining synchronization
of the small backup approach in highly stochastic or and consistency in real-world deployments.
dynamic settings less explored. Further validation in A structured version of Prioritized Sweeping to en-
real-world applications is necessary to fully assess the hance RL efficiency in large state spaces by leveraging
scalability and practicality of the method. Additionally, Dynamic Bayesian Networks (DBNs) was introduced
the initial phase of model construction and priority queue in [189]. The method accelerated learning by group-
management could introduce latency, impacting the algo- ing states with similar values, reducing updates needed
rithm’s real-time performance in complex environments. for convergence. DBNs provided a compact environ-
Authors in [188] introduced Cooperative Prioritized ment representation, further improving computational
Sweeping to efficiently handle Multi-Agent Markov De- efficiency. However, the reliance on predefined DBN
cision Processes (MMDPs) using factored Q-functions structures limited applicability in dynamic environments,
and Dynamic Decision Networks (DDNs). CPS managed and maintaining these structures added computational
large state-action spaces effectively, leading to faster overhead. While promising in simulations, the method
convergence in multi-agent environments. However, the required further validation in diverse real-world scenar-
reliance on accurately specified DDN structures and ios to fully assess its scalability and robustness.
the batch update mechanism introduced complexity, in- c) Dyna-Q: There are two types of Dyna-Q we are
creasing computational overhead in dynamic environ- twiddling with in RL; Dyna-Q when using with tabular
ments. While simulations showed Cooperative Priori- representation and when using function approximation.
tized Sweeping’s potential, further real-world validation Acknowledging this, we analyze both Dyna-Q’s varia-
was needed to assess its scalability and robustness. tions in this section.
A prioritized sweeping approach combined with The Dyna-Q algorithm, introduced in [73], combines
Confidence-based Dual RL (CB-DuRL) for routing in traditional RL with planning in a novel way to improve
Mobile Ad-Hoc Networks (MANETs) is investigated by the efficiency and adaptability of learning agents. This
researchers in [41]. The proposed method dynamically section provides an overview of the essential concepts
selects routes based on real-time traffic conditions, aim- and mechanisms underlying the Dyna-Q algorithm. This
ing to minimize delivery time and congestion. The key algorithm integrates learning from real experiences with
strength of this approach is its dynamic adaptability to planning using a learned model of the environment.
real-time traffic conditions, addressing the limitations of This integration allows the agent to improve its policy
traditional shortest path routing methods. By leveraging more rapidly and efficiently by leveraging both actual
prioritized sweeping, the algorithm prioritizes updates to and simulated experiences. Dyna-Q utilizes Q-learning
the most critical state-action pairs, enhancing learning [101]. This estimate combines the immediate reward
efficiency and ensuring optimal path selection under and the discounted value of the next state, guiding the
varying network loads. The inclusion of CB-DuRL re- agent towards actions that maximize long-term rewards.
fines routing decisions by considering the reliability of On top of that, its architecture includes a model of
Q-values, thus improving the robustness and reliability the environment that predicts the next state and reward
of the routing protocol. Empirical results from sim- given a current state and action. This model allows the
ulations on a 50-node MANET demonstrate that the agent to simulate hypothetical experiences, effectively
proposed method significantly outperforms traditional planning future actions without needing to interact with
routing protocols in terms of packet delivery ratio, the real environment continuously. Also, the agent uses
dropping ratio, and delay, showcasing its effectiveness the learned model to generate simulated experiences.
in handling high traffic conditions and reducing network These hypothetical experiences are treated similarly to
congestion. However, the approach’s dependence on ac- real experiences, updating the Q(s, a) values and refining
curate estimation of traffic conditions and Q-values may the agent’s policy. This process accelerates learning by
limit its adaptability in highly dynamic or unpredictable allowing the agent to practice and plan multiple scenarios
environments where traffic patterns change rapidly. The internally. Dyna-Q’s planning process is incremental,
initial phase of gathering sufficient data to populate the meaning it can be interrupted and resumed at any
Q-tables and confidence values can introduce latency time. This feature makes it highly adaptable to dynamic
44

Algorithm 20 Dyna-Q tic planning enabled the Dyna-H algorithm to perform


1: Initialize Q(s, a) and Model(s, a) for all s ∈ S and efficiently without requiring a complete model of the
a ∈ A(s) environment beforehand, which was particularly benefi-
2: repeat cial for real-time applications. However, one drawback
3: S ← current (non-terminal) state of the approach was the reliance on predefined heuristic
4: A ← ǫ-greedy(S, Q) functions, which might have limited its adaptability to
5: Execute action A; observe resultant reward, R, various game environments where such heuristics were
and state, S ′ not readily available or accurate. The paper’s evaluation,
6: Q(S, A) ← Q(S, A) + α[R + γ maxa Q(S ′ , a) − while thorough, focused primarily on deterministic grid-
Q(S, A)] like environments, potentially overlooking the complex-
7: Model(S, A) ← R, S ′ ities and stochastic nature of more dynamic and un-
(assuming deterministic environment) predictable game settings. This narrow evaluation scope
8: for i = 1 to n do might have raised questions about the generalizability of
9: S ← random previously observed state the algorithm’s performance in broader contexts.
10: A ← random action previously taken in S In [191], a Multi-objective Dyna-Q based Routing
11: R, S ′ ← Model(S, A) (MODQR) algorithm designed to optimize both delay
12: Q(S, A) ← Q(S, A) + and energy efficiency in wireless mesh networks was
α[R + γ maxa Q(S ′ , a) − Q(S, A)] introduced. By leveraging the Dyna-Q technique, the
13: end for algorithm aimed to address dynamic network conditions,
14: until convergence or a stopping criterion is met ensuring that selected paths exhibited minimal delay and
optimal energy usage. The key strength of the MODQR
approach was its innovative integration of Dyna-Q with
environments where the agent must continuously learn multi-objective optimization, effectively balancing delay
and update its policy based on new information. This and energy efficiency in real time. The use of Dyna-Q
algorithm operates through a series of steps illustrated in enhanced the convergence speed, making the algorithm
Alg. 20. First, the agent interacts with the environment, well-suited for dynamic and complex network environ-
observes the current state, selects an action, and receives ments where conditions frequently change. Addition-
a reward and the subsequent state (lines 3-5). Based on ally, the consideration of both forward and reverse link
this experience, the Q(s, a) value is updated (line 6). conditions in determining path quality provided a more
Next, the world model is revised using the observed tran- accurate assessment of route reliability. The inclusion
sition (s, a, r, s′ ) (line 7), which enhances its predictions of dynamic learning and exploration rates further re-
for future planning. The agent then utilizes the world fined the algorithm’s adaptability, ensuring continuous
model to generate hypothetical experiences by selecting improvement and optimization as the network evolved.
states and actions and predicting their outcomes (lines 8- However, the approach’s reliance on predefined model
14). These hypothetical experiences are used to further parameters and assumptions, such as the gray physi-
update the Q(s, a) values, enabling the agent to refine cal interference model, might have limited its flexibil-
its policy without additional real-world interactions. ity across different network topologies and conditions.
Over the next few paragraphs, several studies are While the Dyna-Q algorithm’s fast convergence was
analyzed regarding Dyna-Q algorithm. advantageous, its performance in highly volatile envi-
This paper [190] explored the development and appli- ronments with extreme fluctuations might not have been
cation of the Dyna-H algorithm, a strategy that incorpo- as robust. Moreover, the evaluation primarily focused on
rated heuristic planning for optimizing decision-making simulation results, which, while promising, might not
in Role-playing Games (RPGs). The Dyna-H algorithm have fully captured the complexities and unpredictable
aimed to enhance path-finding efficiency by blending behaviors encountered in real-world deployments. The
heuristic search with the Model-free characteristics of lack of comprehensive real-world testing left some un-
RL, specifically within the context of the Dyna archi- certainty regarding the algorithm’s practical effectiveness
tecture. The primary advantage of the Dyna-H algorithm and scalability.
lay in its innovative combination of heuristic search and, Authors in [192] presented a novel approach for
effectively addressing the challenges posed by large and managing Type-1 Diabetes Mellitus through a Glycemic
complex search spaces typical in RPG scenarios. This control system using Dyna-Q. The system aimed to
synergy allowed the algorithm to focus on promising automate insulin infusion based on Continuous Glucose
solution branches, significantly improving learning speed Monitoring (CGM) data, eliminating the need for manual
and policy quality compared to traditional methods like carbohydrate input from patients. The primary strength
Q-learning and Dyna-Q. Additionally, the use of heuris- of this approach lies in its innovative use of the Dyna-Q
45

algorithm, which combines Model-based and Model-free adaptability across different problem domains.
RL to enhance learning efficiency and control accuracy. A novel multi-path load-balancing routing algorithm
By leveraging past CGM and insulin data, the system for Wireless Sensor Networks (WSNs) leveraging the
could predict future glucose levels and optimize insulin Dyna-Q algorithm was proposed in [194]. The pro-
dosage in real time, resulting in improved Glycemic posed method termed ELMRRL (Energy-efficient Load-
control. The algorithm’s ability to operate effectively balancing Multi-path Routing with RL), aimed to mini-
without explicit carbohydrate information significantly mize energy consumption and enhance network lifetime
reduced the cognitive load on patients, aligning well by selecting optimal routing paths based on residual
with the goals of developing a fully automated artificial energy, hop count, and energy consumption of nodes.
pancreas. Furthermore, the incorporation of a precision The main advantage of the ELMRRL algorithm was its
medicine approach tailored the model to individual pa- effective integration of Dyna-Q RL, which combined
tients, enhancing the adaptability and accuracy of the real-time learning with planning to adaptively select
Glycemic predictions. On the other hand, the reliance on optimal routing paths. This dynamic adjustment ensured
precise CGM data and insulin records for model training that the algorithm could respond to changes in net-
might have posed challenges in real-world scenarios work conditions, thus extending the network lifetime.
where data quality could vary. While the algorithm The use of RL enabled each sensor node to act as
showed promise in simulations and preliminary tests an agent, making independent decisions based on local
with real patients, its generalizability across diverse information, which was crucial for distributed environ-
patient populations and long-term robustness remained ments like WSNs. Furthermore, the algorithm’s focus
areas of concern. The study’s limited real-world testing on both immediate and long-term cumulative rewards
meant that further extensive clinical trials were necessary led to more balanced energy consumption across the
to fully validate the system’s effectiveness and safety in network, preventing premature node failures and improv-
everyday use. ing overall network resilience. On the other hand, the
Dyna-T which integrated Dyna-Q with Upper Con- reliance on local information for decision-making might
fidence Bounds applied to Trees (UCT) for enhanced have limited the algorithm’s effectiveness in scenarios
planning efficiency, was developed in [193]. The method where global network knowledge could provide more
aimed to improve action selection by using UCT to optimal solutions. The initial learning phase, where
explore simulated experiences generated by the Dyna-Q nodes gathered and processed data to update their Q-
model, demonstrating its effectiveness in several OpenAI values, could have introduced latency and computational
Gym environments. The primary strength of Dyna-T was overhead, potentially affecting the algorithm’s perfor-
its ability to combine the Model-based learning of Dyna- mance in highly dynamic environments. Additionally, the
Q with the robust exploration capabilities of UCT. This approach’s dependence on the correct parameterization
combination allowed for a more directed exploration of the reward function and learning rates could have been
strategy, which enhanced the agent’s ability to find opti- challenging, as these parameters significantly impacted
mal actions in complex and stochastic environments. The the algorithm’s efficiency and convergence speed. The
algorithm’s use of UCT ensured that the most promising paper primarily validated the algorithm through simu-
action paths were prioritized, significantly improving lations, so its real-world applicability in diverse WSN
learning efficiency and convergence speed. The empirical deployments remained to be fully explored.
results indicated that Dyna-T outperformed traditional Researchers in [34] introduced a new motion control
RL methods, especially in environments with high vari- method for path planning in unfamiliar environments
ability and sparse rewards, showcasing its potential for using the Dyna-Q algorithm. The goal of the proposed
broader applications in RL tasks. Despite its advantages, method was to improve the efficiency of motion control
Dyna-T exhibited some limitations, particularly in deter- by combining direct RL with model learning, enabling
ministic environments where the added computational agents to effectively navigate both dynamic and static
complexity of UCT might not have yielded significant obstacles. The main benefit of this method was its use
benefits. The initial overhead of constructing and main- of the Dyna-Q algorithm, which merged Model-based
taining the UCT structure could be costly, especially in and Model-free RL techniques, facilitating concurrent
simpler tasks where traditional Dyna-Q or Q-learning learning and planning. This integration notably enhanced
might have sufficed. Moreover, while Dyna-T showed convergence speed and adaptability to dynamic envi-
promise in simulated environments, its performance in ronments, as demonstrated by faster path optimization
real-world scenarios with continuous state and action compared to traditional Q-learning. The algorithm’s ca-
spaces remained to be fully validated. The reliance on the pacity to simulate experiences allowed the agent to plan
exploration parameter ’c’ and its impact on performance more efficiently, resulting in improved decision-making
also warranted further investigation to ensure robust in complex situations. Furthermore, employing an ǫ-
46

greedy policy for action selection ensured a balanced algorithm was developed in [196]. The primary goal
exploration-exploitation trade-off, which was essential was to minimize long-term charging costs while con-
for finding optimal paths in unknown environments. sidering the stochastic nature of driving behavior, traffic
Nonetheless, the method’s dependence on accurate en- conditions, energy usage, and fluctuating energy prices.
vironment models for effective planning might have re- The method formulated the problem as a MDP and
stricted its performance in highly unpredictable settings employed DRL techniques for solution optimization.
where models might not accurately reflect real-world The key strength of this approach lies in its effective
dynamics. The computational burden of maintaining and combination of Model-based and Model-free RL, which
updating the Q-table and environment model could also enhances learning speed and efficiency. By continuously
have been significant, especially in large state-action updating the model with real experiences and generating
spaces. Additionally, while the method demonstrated synthetic experiences, the Dyna-Q algorithm ensured
promising results in simulation environments, its scal- rapid convergence to an optimal charging policy. This
ability and robustness in real-world applications with dual approach allowed for robust decision-making even
continuous state spaces and real-time constraints needed in the face of uncertain and dynamic real-world condi-
further validation. The reliance on predefined parame- tions, such as varying energy prices and unpredictable
ters, such as learning rates and discount factors, could driving patterns. Additionally, the integration of a deep-
also have impacted the algorithm’s efficiency and effec- Q network for value approximation facilitated handling
tiveness, requiring careful tuning for different scenarios. the vast state space inherent in PEV charging scenarios,
Researchers in [195] introduced an improved Dyna-Q ensuring the method’s scalability and applicability to
algorithm for mobile robot path planning in unknown, real-world settings. However, the reliance on accurate
dynamic environments. The proposed method integrated initial parameter values and predefined models for the
heuristic search strategies, Simulated Annealing (SA), user’s driving behavior might have limited the algo-
and a new action-selection strategy to enhance learning rithm’s adaptability during the initial learning phase. The
efficiency and path optimization. The enhanced Dyna- system’s performance heavily depended on the quality
Q algorithm effectively merged heuristic search and and accuracy of these models, which might have varied
SA, significantly improving the exploration-exploitation across different users and environments. Moreover, while
balance. By incorporating heuristic rewards and actions, the approach demonstrated significant improvements in
the algorithm ensured efficient navigation through com- simulations, its effectiveness in diverse and more com-
plex environments, avoiding local minima and achiev- plex real-world scenarios required further validation.
ing faster convergence. The novel SA-ǫ-greedy policy The potential computational overhead associated with
dynamically adjusted exploration rates, optimizing the maintaining and updating the deep-Q network and model
learning process. Empirical results from simulations could have posed challenges for real-time applications,
and practical experiments showed that the improved especially in large-scale deployments with numerous
algorithm outperformed Q-learning and Dyna-Q meth- PEVs.
ods, demonstrating superior global search capabilities, Authors in [197] presented an enhanced Dyna-Q algo-
enhanced learning efficiency, and robust convergence rithm tailored for Automated Guided Vehicles (AGVs)
properties in both static and dynamic obstacle scenarios. navigating complex, dynamic environments. The key im-
Despite the performance improvements from integrating provements included a global path guidance mechanism
heuristic search and SA, the approach’s reliance on pre- based on heuristic graphs and a dynamic reward function
defined heuristic functions might have limited its adapt- designed to address issues of sparse rewards and slow
ability to diverse environments. The initial training phase convergence in large state spaces. The improved Dyna-
required extensive exploration, potentially leading to Q algorithm stood out by integrating heuristic graphs
higher computational overhead and longer training times. that provided a global perspective on path planning,
Additionally, the dependence on grid-based environment significantly reducing the search space and enhanc-
representation might have restricted the algorithm’s scal- ing efficiency. This method enabled AGVs to quickly
ability to continuous state spaces. The focus on simulated orient towards their goals by leveraging precomputed
and controlled real-world environments raised concerns shortest path information, thus mitigating the problem
about the algorithm’s robustness and generalizability in of sparse rewards commonly encountered in extensive
more complex and unpredictable real-world applications. environments. Additionally, the dynamic reward function
Further studies were necessary to validate the method’s intensified feedback, guiding AGVs more effectively
effectiveness in larger, unstructured, and more dynamic through complex terrains and around obstacles. The
environments. experimental results in various scenarios with static and
An innovative approach for scheduling the charging dynamic obstacles demonstrated superior convergence
of Plug-in Electric Vehicles (PEVs) using the Dyna-Q speed and learning efficiency compared to traditional Q-
47

learning and standard Dyna-Q algorithms, highlighting for comprehensive network performance assessment.
its robustness and effectiveness in dynamic settings. Researchers in [199] introduced a Model-based RL
However, the dependency on heuristic graphs, which method using Dyna-Q tailored for multi-agent systems.
required prior computation, might have limited the algo- It emphasized efficient environmental modeling through
rithm’s adaptability in environments where real-time up- a tree structure and proposed methods for model sharing
dates were necessary or in scenarios with unpredictable among agents to enhance learning speed and reduce
changes. The initial setup phase, involving the creation computational costs. The approach leveraged the con-
of the heuristic graph, could have introduced overheads cept of knowledge sharing, where agents with more
that might not have been feasible for all applications. experience assisted others by disseminating valuable
Furthermore, while the dynamic reward function en- information. The integration of a tree-based model for
hanced learning efficiency, its design relied heavily on environmental learning within the Dyna-Q framework
accurate modeling of the environment, which could have significantly enhanced the efficiency of model construc-
been challenging in highly variable or noisy conditions. tion and memory usage. This method allowed agents
The paper’s focus on simulated environments left room to generate virtual experiences, thus accelerating the
for further validation in real-world applications, where learning process. The innovative model sharing tech-
additional factors such as sensor noise and real-time niques proposed in the paper, such as grafting partial
constraints could have impacted performance. branches and resampling, enabled agents to build more
A Dyna-Q based anti-jamming algorithm designed to accurate models collaboratively. By reducing redundant
enhance the efficiency of path selection in wireless com- data transfer and focusing on useful experiences, these
munication networks subject to malicious jamming was sharing methods improved sample efficiency and learn-
introduced in [198]. By leveraging both Model-based ing speed in complex environments. The simulation re-
and Model-free RL techniques, the algorithm aimed sults demonstrated the effectiveness of these techniques
to optimize multi-hop path selection, reducing packet in multi-agent cooperation scenarios, highlighting their
loss and improving transmission reliability in hostile potential to optimize learning in large, continuous state
environments. The application of the Dyna-Q algorithm spaces. Despite its advantages, the reliance on accurate
in this context was innovative, combining direct learning decision tree models for sharing experiences might have
with simulated experiences to accelerate the convergence limited the approach’s flexibility in highly dynamic or
of the Q-values. This dual approach allowed the system heterogeneous environments. The effectiveness of the
to adapt quickly to dynamic jamming conditions, en- sharing methods depended on the quality and rele-
suring more reliable path selection and communication vance of the shared models, which might have varied
efficiency. The inclusion of a reward function that con- across different agents and scenarios. Additionally, the
sidered various modulation modes based on the Signal- initial phase of building accurate tree models could
to-Jamming Noise Ratio (SJNR) enhanced the robust- have been computationally intensive, particularly in en-
ness of the algorithm. Simulation results demonstrated vironments with high variability. While the proposed
that the Dyna-Q algorithm significantly outperformed methods showed promising results in simulations, further
traditional Q-learning and multi-armed bandit models, validation in diverse real-world applications was needed
achieving faster convergence to optimal paths and better to fully assess their scalability and robustness. The paper
handling of interference, thus showcasing its potential also assumed a certain level of homogeneity among
for real-time applications in complex electromagnetic agents, which might not always have been applicable
environments. Nevertheless, the method’s reliance on in more varied multi-agent systems.
pre-established environmental models might have lim- The application of the Dyna-Q algorithm for path
ited its effectiveness in highly unpredictable or rapidly planning and obstacle avoidance in Unmanned Ground
changing conditions, where the initial models might not Vehicles (UGVs) and UAVs within complex urban en-
accurately capture real-time dynamics. The need for vironments was explored in [200]. The study focused
accurate initial state representations and model updates on utilizing a vector field-based approach for effective
introduced additional computational overhead, which navigation and air-ground collaboration tasks. The in-
could have impacted performance in larger or more tegration of the Dyna-Q algorithm with a vector field
complex networks. Furthermore, while the algorithm method significantly enhanced the efficiency and accu-
showed promise in simulated environments, its scala- racy of path planning in dynamic urban settings. The
bility and adaptability in real-world applications with approach leveraged both real and simulated experiences
varying node densities and jamming strategies required to adaptively update the agent’s policy, ensuring rapid
further validation. The focus on hop count minimization convergence to optimal paths. By simplifying the urban
might also have overlooked other critical factors such as environment into a grid world, the method allowed for
energy consumption and latency, which were essential precise waypoint calculation, facilitating smooth nav-
48

igation and effective obstacle avoidance. The use of TABLE XII: Model-based Planning Papers Review
PID controllers for UAV and UGV coordination further Application Domain References
improved the stability and responsiveness of the system, Theoretical Research [184], [187]
enabling robust air-ground collaboration. Simulation re- (Convergence, stability)
sults demonstrated that the proposed method effectively Multi-agent Systems and [49], [188] [199], [196],
Autonomous Behaviors [197], [200]
handled dynamic obstacles and complex path scenarios,
Games and Simulations [170], [173], [175], [177],
showcasing its potential for real-world applications in [179], [180], [181], [182],
urban environments. On the other hand, the reliance [37] [190], [195]
on grid-based environment representation might have Energy and Power Management [194], [192]
(IoT Networks, Smart Energy
limited the algorithm’s scalability and adaptability to Systems)
continuous state spaces found in more diverse and un- Transportation and Routing [41]
structured urban areas. The initial phase of creating the Optimization
vector field and the grid map could have introduced com- Network Resilience and [198], [194]
putational overheads, which might have impacted real- Optimization
time performance. While the paper focused on simulated Hybrid RL Algorithms [174], [176], [178], [185]
[193]
environments, further validation in real-world scenarios Dynamic Bayesian networks [189]
was necessary to assess the approach’s robustness and Dynamic Environments [191], [34], [195], [197]
effectiveness under varying conditions. Additionally, the Robotics [201]
method’s dependency on accurate dynamic models for
both UGV and UAV could have posed challenges, as
any discrepancies between the model and the real envi- prove coordination and reduce inter-agent interference.
ronment might have affected the overall performance. The primary advantage of Dyna-Q is its ability to
This paper [201] evaluated the performance of the accelerate learning by combining real and simulated
Dyna-Q algorithm in robot navigation within partially experiences. This dual approach reduces the dependency
known environments containing static obstacles. The on extensive real-world interactions, making it suitable
study extended the Dyna-Q algorithm to multi-robot for applications where such interactions are costly or
systems and conducted extensive simulations to assess limited. Moreover, Dyna-Q is inherently adaptable to
its efficiency and effectiveness. The primary strength of changing environments. The integration of continuous
this study lies in its thorough analysis of the Dyna- learning and planning allows the agent to update its
Q algorithm in both single and multi-agent contexts. policy dynamically in response to new information or
By integrating planning and Model-based learning, the changes in the environment [73], [1]. Table XII provides
Dyna-Q algorithm sped up the learning process, enabling a summary of the papers that utilized Model-based
robots to navigate efficiently even with limited prior Planning algorithms.
knowledge of the environment. The use of simulations Over the next section, a complete introduction to
with the Robot Motion Toolbox allowed for a compre- another paradigm of RL, Policy-based Methods, is given,
hensive evaluation of the algorithm’s performance across along with analyzing various algorithms that fall under
various parameters, providing valuable insights into the its umbrella.
optimal settings for different scenarios. Extending the
Dyna-Q algorithm to multi-robot systems showcased its IV. P OLICY- BASED M ETHODS
adaptability and potential for complex task coordination, Policy-based methods are another fundamental RL
where agents could share knowledge to enhance overall method that more strongly emphasizes direct policy
system performance. However, the paper’s focus on optimization in the process of choosing actions for an
static obstacles and deterministic environments might agent. In contrast to Value-based methods, which search
have limited the applicability of the findings to more for the value function implicit in the task, and then derive
dynamic and stochastic settings. The initial need for an optimal policy, Policy-based methods directly param-
environment discretization and model construction intro- eterize and optimize the policy. This approach offers
duced additional computational overhead, which could several advantages, particularly better dealing with very
have been a bottleneck in real-time applications. While challenging environments that have high-dimensional
the simulations offered a robust analysis, the lack of action spaces or where policies are inherently stochastic.
real-world validation left some uncertainty about the Perhaps at the core, Policy-based methods conduct their
algorithm’s practical effectiveness in unpredictable and operation based on the parameterization of policies,
continuously changing environments. Furthermore, the usually denoted as π(a|s; θ). Here, θ is used to denote
performance degradation observed in multi-agent scenar- the parameters of the policy, while s denotes the state and
ios indicated that further refinement was needed to im- a denotes the action. In other words, it finds the optimal
49

parameters θ∗ that maximize the expected cumulative Algorithm 21 REINFORCE


reward. Needless to say, this is generally done by gra- 1: Input: a differentiable policy parameterization
dient ascent techniques and more specifically by Policy π(a|s, θ)

Gradient methods that explicitly compute the gradient of 2: Initialize policy parameter θ ∈ Rd
expected reward with respect to the policy parameters, 3: repeat ⊲ forever:
modifying parameters in the direction of reward increase 4: Generate an episode
[1], [6], [15]. S0 , A0 , R1 , . . . , ST −1 , AT −1 , RT , following
The gradient of the policy with respect to Q values is π(·|·; θ)
estimated in the following way: 5: for each step of the episode t = 0, . . . , T − 1 do
6: G ← return from step t
7: θ ← θ + αγ t G∇θ ln π(At |St , θ)
∇θ J(θ) = Eπθ [∇θ log πθ (a|s)Qπθ (s, a)] (31) 8: end for
Here, Qπθ (s, a) represents the action-value function 9: until convergence or a stopping criterion is met
under the current policy. This gradient can be estimated
using samples from the environment, allowing the appli-
cation of gradient ascent to update the policy parameters parameters [202]. The gradient of the expected reward
iteratively. The advantages of Policy-based Methods can with respect to the policy parameters θ is given by:
be simplified to:
• Direct Optimization of the Policy: These methods
∇θ J(θ) = Eπ [∇θ log πθ (a|s)Gt ] , (32)
thus optimize the policy directly, hence, in con- where πθ (a|s) is the probability of taking action a in
trast to the Value-based methods, working more the state s under policy π parameterized by θ, and Gt
successfully with continuous and high-dimensional is the return (cumulative future reward) following time
action spaces. They usually work very well for step t. This gradient estimation forms the basis for the
problems within robotics and control where actions parameter update rule in REINFORCE (line 7):
are naturally continuous.
• Stochastic Policies: Policy-based methods natu- θ ← θ + α∇θ log πθ (a|s)Gt , (33)
rally accommodate stochastic policies, which may
be important in settings where exploration is nec- where α is the learning rate.
essary or the optimal policy is inherently stochas- One of the strengths of the REINFORCE algorithm
tic. Stochastic policies aid in shooting up the is its simplicity and generality, allowing it to be applied
exploration-exploitation trade-off more efficiently across a wide range of problems. However, it also faces
as well. challenges such as high variance in the gradient estimates
• Improved Convergence Properties: Sometimes, and slow convergence, particularly in environments with
Policy-based methods may have much smoother sparse or delayed rewards. By analyzing the applications
convergence than Value-based methods, especially of REINFORCE, we will have a deeper look at these
when the latter are prone to a number of instabilities advantages and disadvantages.
and divergence related to the Church condition for Authors in [203] investigated the global convergence
value function approximations. rates of the REINFORCE algorithm in episodic RL
Now, it is time to talk about the first Policy Gradient settings. The authors aimed to close the gap between
method, REINFORCE, in the next subsection. theoretical and practical implementations of policy gra-
dient methods by providing new convergence results for
the REINFORCE algorithm. The paper’s strengths lay in
A. REINFORCE its comprehensive theoretical analysis and practical rel-
The REINFORCE algorithm is a seminal contribution evance. The authors derived performance bounds for the
to RL, particularly within the context of policy gradient REINFORCE algorithm using a fixed mini-batch size,
methods. The algorithm is designed to optimize the aligning more closely with practical implementations.
expected cumulative reward by adjusting the policy pa- They provided the first set of global convergence results
rameters in the direction of the gradient of the expected for the REINFORCE algorithm, including sub-linear
reward. As demonstrated in Alg. 21, the REINFORCE high probability regret bounds and almost sure global
algorithm is rooted in the stochastic policy framework, convergence of average regret. The focus on the widely-
where the policy, parameterized by θ, defines a prob- used REINFORCE gradient estimation procedure rather
ability distribution over actions given the current state. than state-action visitation measure-based estimators ad-
The key insight of the REINFORCE algorithm is to use dressed a significant gap between theory and practice.
the log-likelihood gradient estimator to update the policy The authors established that the REINFORCE algorithm
50

was sample efficient, with polynomial complexity, which in-loop experiments, validating the effectiveness of the
was crucial for practical applications due to the high proposed method in real-world scenarios. The results
cost of obtaining samples. However, the complexity of indicated that the fuzzy REINFORCE algorithm could
the analysis might have challenged practitioners less achieve near-optimal performance without requiring ac-
familiar with the mathematical concepts. The reliance curate system models or extensive prior knowledge,
on assumptions such as the log-barrier regularization making it a versatile and practical solution for EMS in
term and soft-max policy parametrization might have FCHEVs.
limited the generality of the results. The paper focused Authors in [205] presented a study protocol for a
on the stationary infinite-horizon discounted setting, trial aimed at improving medication adherence among
and a more detailed discussion on applying the results patients with type 2 diabetes using an RL-based text mes-
to other settings would have enhanced its relevance. saging program. One of the key strengths of this study
The absence of empirical validation of the proposed was its innovative use of RL to personalize text message
convergence bounds and sample efficiency was another interventions. By tailoring messages based on individual
limitation. Including experimental results would have responses to previous messages, the approach had the
provided additional evidence of the practical utility of potential to optimize engagement and improve adherence
the findings. more effectively than generic messaging strategies. This
A novel approach to Energy Management Strategies personalized communication could have led to more sig-
(EMS) in Fuel Cell Hybrid EVs (FCHEV) using the nificant behavior changes and better health outcomes for
fuzzy REINFORCE algorithm was introduced in [204]. patients with diabetes. The study’s design also enhanced
This method integrated a fuzzy inference system (FIS) its practical relevance. Conducted in a real-world setting
with Policy Gradient RL (PGRL) to optimize energy at Brigham and Women’s Hospital, it involved patients
management, achieve hydrogen savings, and maintain with suboptimal diabetes control, which reflected a com-
battery operation. One of the key strengths of the paper mon clinical scenario. The use of electronic pill bottles
was the innovative combination of fuzzy logic with the to monitor adherence provided accurate and objective
REINFORCE algorithm. By employing a fuzzy infer- data, supporting the reliability of the study outcomes.
ence system to approximate the policy function, the au- Additionally, the trial’s primary outcome of average
thors effectively leveraged the generalization capabilities medication adherence over six months was a meaningful
of fuzzy logic to handle the complexity and uncertainty measure that directly related to the study’s objective.
inherent in energy management tasks. This integration However, there were some weaknesses and challenges
helped to address the limitations of traditional EMS associated with the study. The requirement for patients
methods that relied heavily on expert knowledge and to use electronic pill bottles and smartphones with a
static rules, thus providing a more adaptive and robust data plan or WiFi might have limited the generalizability
solution. The use of a fuzzy baseline function to stabilize of the findings to populations without access to such
the training process and reduce the variance in policy technology. Furthermore, the study’s reliance on self-
gradient updates was another notable advantage. This reported adherence as a secondary outcome introduced
approach enhanced the convergence rate and stability of the potential for reporting bias. The study also faced
the learning process, which was particularly beneficial potential limitations related to the length of the follow-up
in real-time applications where computational efficiency period and the evaluation of the long-term sustainability
and robustness were critical. The paper’s demonstra- of the intervention. While a six-month follow-up period
tion of the algorithm’s adaptability to changing driv- was sufficient to assess initial adherence improvements,
ing conditions and system states further underscored longer-term studies would have been necessary to de-
its practical relevance and effectiveness. However, the termine whether the benefits of the intervention were
complexity of the proposed method might have posed sustained over time.
implementation challenges, particularly for practitioners In [206], the authors presented a novel method for
who were less familiar with fuzzy logic. The integration rate adaptation in 802.11 wireless networks leveraging
of FIS and PGRL required careful tuning of parameters the REINFORCE algorithm. The proposed approach,
and membership functions, which could have been time- named ReinRate, integrated a comprehensive set of ob-
consuming and computationally intensive. Additionally, servations, including received signal strength, contention
while the fuzzy REINFORCE algorithm showed promise window size, current modulation and coding scheme,
in reducing the computational burden and improving and throughput, to adapt dynamically to varying network
convergence, the reliance on fuzzy logic introduced an conditions and optimize network throughput. One of the
additional layer of complexity that might not have been key strengths of this paper was its innovative application
necessary for all applications. The paper also provided a of the REINFORCE algorithm to WiFi rate adaptation.
comprehensive analysis of the simulation and hardware- Traditional rate adaptation algorithms like Minstrel and
51

Ideal relied on limited observations such as packet loss TABLE XIII: REINFORCE Papers Review
rate or signal-to-noise ratio, which could have been in- Application Domain References
sufficient in dynamic wireless environments. In contrast, Energy and Power Management [204]
ReinRate’s broader set of observations allowed for a Theoretical Research [203]
more nuanced response to varying conditions, leading (Convergence, stability)
to significant improvements in network performance. Network Optimization [206]
Robotics [207]
The authors demonstrated that ReinRate outperformed
Minstrel and Ideal algorithms by up to 102.5% and
30.6% in network scenarios without interference, and by Husky robot further validated the practical applicability
up to 35.1% and 66.6% in scenarios with interference. of the proposed method. However, the complexity of the
Another strength was the comprehensive evaluation of proposed algorithm and the specific choice of heavy-
ReinRate using the ns-3 network simulator and ns3-ai tailed distributions might have posed challenges. The im-
OpenAI Gym. The authors conducted extensive simula- plementation of heavy-tailed policy gradients could have
tions under various network scenarios, both static and introduced instability, especially in the initial learning
dynamic, with and without interference. This thorough phases. While the authors mitigated this with adaptive
evaluation provided strong evidence of the algorithm’s moment estimation and gradient clipping, these tech-
effectiveness and adaptability in real-world conditions. niques required careful tuning and expertise, potentially
The results indicated that ReinRate consistently achieved limiting accessibility for practitioners.
higher throughput compared to traditional algorithms, Table XIII, gives an overview of the papers reviewed
showcasing its ability to handle the challenges of dy- in this section. The next Policy-based algorithm, which
namically changing wireless environments. However, the we need to cover is TRPO. Over the next subsection, we
complexity of the proposed method might have posed cover this algorithm.
challenges for practical implementation. The integration
of multiple observations and the application of the REIN- B. Trust Region Policy Optimization (TRPO)
FORCE algorithm required careful tuning of parameters
and computational resources. TRPO, introduced by [208], is an advancement in RL,
specifically within policy optimization methods. The pri-
A new approach to enhance DRL for outdoor robot mary objective of TRPO is to optimize control policies
navigation was investigated in [207]. The key innovation with guaranteed monotonic improvement, addressing the
was the use of a heavy-tailed policy parameterization, shortcomings of previous methods [209], [210], [211]
which induced exploration in sparse reward settings, that often resulted in unstable policy updates and poor
a common challenge in outdoor navigation tasks. A performance on complex tasks.
significant strength of the paper lies in addressing the TRPO is designed to handle large, nonlinear policies
sparse reward issue, which was prevalent in many real- such as those represented by neural networks. The al-
world navigation scenarios. Traditional DRL methods gorithm ensures that each policy update results in a
often relied on carefully designed dense reward func- performance improvement by maintaining the updated
tions, which could have been impractical to imple- policy within a ”trust region” around the current pol-
ment. The authors proposed HTRON, an algorithm that icy. This trust region is defined using a constraint on
leveraged heavy-tailed policy parameterizations, such as the KL divergence between the new and old policies,
the Cauchy distribution, to enhance exploration without effectively preventing large, destabilizing updates [212],
needing complex reward shaping. This approach allowed [208]. TRPO operates within the stochastic policy frame-
the algorithm to learn efficient behaviors even with work, where the policy πθ is parameterized by θ and
sparse rewards, making it more applicable to real-world defines a probability distribution over actions given the
scenarios. The paper’s thorough experimental evaluation states. The expected discounted reward for a policy π is
was another strong point. The authors tested HTRON given by:
against established algorithms like REINFORCE, Proxi- "∞ #
mal Policy Optimization (PPO), and Trust Region Policy X
t
Optimization (TRPO) (explained later in the upcoming J(π) = E γ r(st , at ) , (34)
subsections) across three different outdoor scenarios: t=0

goal-reaching, obstacle avoidance, and uneven terrain where γ is the discount factor, r(st , at ) is the reward at
navigation. HTRON outperformed these algorithms in time step t, and the expectation is taken over the state
terms of success rate, average time steps to reach the goal and action trajectories induced by the policy. To ensure
and elevation cost, demonstrating its effectiveness and that the policy update remains within a safe boundary,
efficiency. The use of a realistic unity-based simulator TRPO constrains the KL divergence between the new
and the deployment of the algorithm on a Clearpath policy πθ′ and the old policy πθ :
52

Algorithm 22 TRPO Let us analyze a handful of research studies that used


1: Input: initial policy parameters θ0 TRPO to grasp a better understanding. A Monotonic
2: for k = 0, 1, 2, . . . do Policy Optimization (MPO) algorithm was designed to
3: Collect set of trajectories Dk on policy πk = address the challenges associated with high-dimensional
π(θk ) continuous control tasks in [213]. The primary focus was
4: Estimate advantages Âπt k using any advantage on ensuring monotonic improvement in policy perfor-
estimation algorithm mance, which was crucial for stability and efficiency in
5: Form sample estimates for: RL. One of the significant strengths of this paper was its
6: policy gradient ĝk (using advantage estimates) innovative approach to policy optimization. The authors
7: and KL-divergence Hessian-vector product func- derived a new lower bound on policy improvement that
tion f (v) = Ĥk v penalized average policy divergence on the state space,
8: Use CG with ncg iterations to obtain xk ≈ rather than the maximum divergence. This approach
Ĥk−1 ĝk q addressed a critical limitation of previous algorithms,
2δ such as TRPO, which could suffer from worst-case
9: Estimate proposed step ∆k ≈ xT Ĥ xk
k xk
10:
k
Perform backtracking line search with exponen- degradation in policy performance. By focusing on aver-
tial decay to obtain final update age divergence, the MPO algorithm ensured more con-
11: θk+1 = θk + αj ∆k sistent and reliable improvements, making it particularly
12: end for suitable for high-dimensional continuous control tasks.
The empirical evaluation of the MPO algorithm was
another strong point. The authors conducted extensive
simulations using the MuJoCo physics engine, testing the
DKL (πθ kπθ′ ) ≤ ζ, (35) algorithm on various challenging robot locomotion tasks,
including swimming, quadruped locomotion, bipedal lo-
where ζ is a small positive constant. This constraint en- comotion, and more. The results demonstrated that the
sures that the new policy does not deviate too much from MPO algorithm consistently outperformed state-of-the-
the old policy, thereby providing stability to the learning art methods like TRPO and Vanilla Policy Gradient in
process. TRPO optimizes a surrogate objective function terms of average discounted rewards. The algorithm’s
that approximates the true objective while respecting the performance in these high-dimensional tasks highlighted
trust region constraint. The surrogate objective L(θ) is its robustness and practical applicability. However, the
defined as: complexity of the proposed algorithm might have posed
  challenges for practical implementation. The need for
πθ (a|s)
L(θ) = Ê Âθ (s, a) , (36) computing the natural policy gradient and performing
πθold (a|s) old line searches to determine optimal step sizes required
where Âθold (s, a) is an estimate of the advantage func- significant computational resources. This complexity
tion, which measures the relative value of taking action could have limited the accessibility of the MPO algo-
a in state s under the old policy. As demonstrated in rithm for practitioners without advanced computational
Alg. 22, the practical implementation of TRPO involves capabilities. Additionally, while the algorithm guaran-
the following steps: teed monotonic improvement, this came at the cost of
slower training speed due to conservative step sizes.
1) Sample Trajectories: Collect a set of trajectories
The authors suggested combining MPO with other faster
using the current policy πθold (line 3).
methods during the initial training phase to mitigate this
2) Estimate Advantages: Compute the advantage
issue, but this added another layer of complexity to the
function Âθold (s, a) using the collected trajectories
implementation.
(line 4).
3) Optimize Surrogate Objective: Solve the con- [214] introduced an enhancement to the TRPO al-
strained optimization problem to find the new gorithm by incorporating entropy regularization. This
policy parameters θ′ (lines 5-7): modification, termed EnTRPO, aimed to improve explo-
  ration and generalization by encouraging more stochastic
πθ (a|s) policy choices. The paper demonstrated the effective-
θ′ = arg max Ê Âθold (s, a) , (37)
θ πθold (a|s) ness of EnTRPO through experiments on the Cart-
subject to Pole system, showcasing better performance compared
DKL (πθold kπθ ) ≤ ζ. (38) to the original TRPO. One of the main strengths of
the paper was its innovative use of entropy regular-
4) Update Policy: Update the policy parameters to ization to enhance the TRPO algorithm. By adding an
θ′ (lines 9-11). entropy term to the advantage function, the authors
53

effectively encouraged exploration, which was crucial (IPPO), MAPPO, and MADDPG, in various tasks. This
to avoid premature convergence to suboptimal policies. performance improvement highlighted the practical ap-
This approach addressed a common limitation of TRPO, plicability of the proposed methods in complex, high-
which could sometimes restrict exploration due to its dimensional environments. The thorough experimental
strict KL divergence constraints between consecutive evaluation across multiple scenarios provided strong ev-
policies. The entropy regularization helped maintain a idence of the robustness and generalizability of HATRPO
balance between exploration and exploitation, leading and HAPPO. However, the complexity of implementing
to more robust learning outcomes. The empirical eval- HATRPO and HAPPO could have been a potential
uation provided in the paper was another significant limitation. The algorithms required the computation of
strength. The authors conducted thorough experiments multi-agent advantage functions and sequential updates,
using the Cart-Pole system, a well-known benchmark in which could have been computationally intensive and
the field. The results showed that EnTRPO converged challenging to implement efficiently. This complexity
faster and more reliably than TRPO, particularly when might have limited the accessibility of these methods
the discount factor was set to 0.85. This indicated that to practitioners who might not have had advanced
the proposed method not only improved exploration but computational resources or expertise in implementing
also enhanced the overall convergence speed and stabil- sophisticated algorithms.
ity of the learning process. The use of a well-defined Authors in [216] presented a method to actively rec-
experimental setup, including details on neural network ognize objects by choosing a sequence of actions for
architectures and hyperparameters, added credibility to an active camera. This method utilized TRPO combined
the findings. A potential limitation was the reliance on with Extreme Learning Machines (ELMs) to enhance
a single benchmark task for evaluation. While the Cart- the efficiency of the optimization algorithm. One of
Pole system was a standard benchmark, it was relatively the significant strengths of this paper was its innovative
simple compared to many real-world applications. The application of TRPO in conjunction with ELMs. ELMs
paper would have benefited from additional experiments provided a simple yet effective way to approximate poli-
on more complex tasks and environments to demonstrate cies, reducing the computational complexity compared
the generalizability and robustness of EnTRPO. This to traditional deep neural networks. This resulted in an
would have provided stronger evidence of the method’s efficient optimization process, crucial for real-time appli-
effectiveness across a wider range of scenarios. cations like active object recognition. The use of ELMs
The challenge of applying trust region methods to allowed for faster convergence and more straightforward
Multi-agent RL (MARL) was investigated in [215]. The implementation, making the proposed method accessible
authors introduced Heterogeneous-agent TRPO (HA- for practical applications. However, the complexity of in-
TRPO) and Heterogeneous-Agent Proximal Policy Op- tegrating TRPO with ELMs could have posed challenges
timization (HAPPO) algorithms. These methods were for some practitioners. Although ELMs simplified the
designed to guarantee monotonic policy improvement optimization process, they still required careful tuning of
without requiring agents to share parameters or relying parameters, such as the number of hidden nodes and the
on restrictive assumptions about the decomposability of distribution of random weights. This additional layer of
the joint value function. A strength of the paper was its complexity might have limited the method’s accessibility
theoretical foundation. The authors extended the theory for users without extensive experience in RL and neural
of trust region learning to cooperative MARL by de- networks.
veloping a multi-agent advantage decomposition lemma In [42], authors explored the application of the TRPO
and a sequential policy update scheme. This theoretical algorithm in MARL environments, specifically focusing
advancement allowed HATRPO and HAPPO to ensure on hide-and-seek games. The authors compared the
monotonic improvement in joint policy performance, performance of TRPO with the Vanilla Policy Gradi-
a key advantage over existing MARL algorithms that ent (VPG) algorithm to determine the most effective
did not guarantee such improvement. This theoretical method for this type of game. One of the primary
guarantee was essential for stable and reliable learning strengths of this paper was its focus on a well-defined,
in multi-agent settings, where individual policy updates complex multi-agent environment. Hide and seek games
could often lead to non-stationary environments and inherently involved dynamic interactions between agents,
suboptimal outcomes. The empirical validation of HA- making them an excellent testbed for evaluating algo-
TRPO and HAPPO on benchmarks such as Multi-Agent rithms. By using TRPO, which was designed to ensure
MuJoCo and StarCraft II demonstrated the effectiveness monotonic policy improvement, the authors addressed
of these algorithms. The results showed that HATRPO a significant challenge in MARL: maintaining stable
and HAPPO significantly outperformed strong baselines, and consistent learning despite the presence of multiple
including Independent Proximal Policy Optimization interacting agents. The empirical results presented in the
54

paper highlighted the strengths of TRPO, especially in morphologies, using TRPO. The study investigated the
scenarios where the testing environment differed from use of surrogate models, specifically Polynomial Chaos
the training environment. TRPO’s ability to adapt to Expansion (PCE) and model ensembles, to model the
new environments and maintain high performance was dynamics of the robots. One of the primary strengths of
a notable advantage over the VPG algorithm, which this thesis was its innovative approach to developing a
performed better in environments identical to the training universal policy. The use of TRPO ensured stability and
conditions but struggled when faced with variability. reliable policy updates even in complex environments.
This adaptability was crucial for practical applications The focus on creating a policy that could generalize
of MARL, where agents often encountered unpredictable across different robot configurations was particularly
changes in their environment. Another strength was noteworthy, as it addressed the challenge of designing
the comprehensive experimental setup, which included controllers that were not limited to a single robot mor-
various configurations and scenarios. The authors metic- phology. The integration of surrogate models, especially
ulously compared the performance of TRPO and VPG the PCE, was another strong point. PCE allowed for
across different numbers of agents and types of envi- efficient sampling and modeling of the stochastic envi-
ronments (quadrant and random walls scenarios). This ronment, potentially reducing the number of interactions
thorough approach provided robust evidence supporting required with the real environment. This was crucial
the efficacy of TRPO in MARL settings. However, for practical applications where real-world interactions
the paper also had some limitations. The complexity could have been costly or risky. The theoretical foun-
of implementing TRPO in a multi-agent context could dation laid for using PCE in this context was robust
have been a barrier for practitioners. TRPO required and showed promise for future research. However, the
careful tuning and substantial computational resources, thesis also highlighted several challenges and limitations.
which might not have been readily available in all The complexity of accurately modeling the dynamics
settings. Additionally, the reliance on simulation results with PCE was a significant hurdle. The results indicated
raised questions about the real-world applicability of the that while PCE showed potential, it currently could not
findings. While the hide-and-seek game was a useful model the dynamics accurately enough to be used in
simulation environment, real-world deployments could combination with TRPO effectively. The computational
have presented additional challenges not captured in the time required for PCE was also a practical concern,
simulations. limiting its immediate applicability. The model ensemble
The application of TRPO to improve Cross-Site surrogate showed some promise but ultimately failed to
Scripting (XSS) detection systems was analyzed by train a successful policy. This pointed to the difficulty
authors in [217]. The authors aimed to enhance the of creating surrogate models that could capture the
resilience of XSS filters against adversarial attacks by complexities of robot dynamics sufficiently. The thesis
using RL techniques to identify and counter malicious suggested that using the original environment from the
inputs. One of the main strengths of this paper was its RoboGrammar library yielded better results, emphasiz-
innovative approach to applying TRPO in Cybersecurity, ing the need for more advanced or alternative surrogate
specifically for XSS detection. Traditional XSS detection modeling techniques.
methods often relied on static rules and signatures, which A novel approach for optimizing Home Energy Man-
could be easily bypassed by sophisticated attackers. By agement Systems (HEMS) using Multi-agent TRPO
leveraging TRPO, the authors introduced a dynamic (MA-TRPO) was investigated in [219]. This approach
and adaptive mechanism that could learn to detect and aimed to improve energy efficiency, cost savings, and
counteract adversarial attempts to exploit XSS vulner- consumer satisfaction by leveraging TRPO techniques
abilities. This use of TRPO enhanced the robustness in a multi-agent setup. One of the primary strengths of
of the detection system, making it more resilient to this paper was its consumer-centric approach. Traditional
evolving threats. A limitation of this study was the re- HEMS solutions often prioritized energy efficiency and
liance on specific hyperparameters, such as the learning cost savings without adequately considering consumer
rate and discount factor, which could have significantly preferences and comfort. By incorporating a preference
impacted the model’s performance. The paper would factor for Interruptible-Deferrable Appliances (IDAs),
have benefited from a more detailed discussion on how the proposed MA-TRPO algorithm ensured that con-
these parameters were selected and their influence on sumer satisfaction was taken into account, leading to
the detection model. Providing guidelines or heuristics a more holistic and practical solution. This consumer-
for parameter tuning would have helped practitioners centric focus was crucial for the widespread adoption
replicate and extend the study’s findings. of HEMS in real-world settings. Another strength was
Authors in [218] aimed to create a universal policy the comprehensive use of real-world data for training
for a locomotion task that could adapt to various robot and validation. The authors utilized five-minute retail
55

electricity prices derived from wholesale market prices TABLE XIV: TRPO Papers Review
and real-world Photovoltaic (PV) generation profiles. Application Domain References
This approach enhanced the practical relevance and Object Recognition [216]
robustness of the proposed method, as it demonstrated Theoretical Research [213]
the algorithm’s effectiveness under realistic conditions. (Convergence, stability)
Additionally, the paper provided a detailed explanation Hybrid RL Algorithms [214]
Multi-agent Systems and [215], [42]
of the various components of the smart home environ- Autonomous Behaviors
ment, including the non-controllable base load, IDA, Cybersecurity [217]
Battery Energy Storage System (BESS), and PV system, Robotics [218]
which added clarity and depth to the study. The use of Energy and Power Management [219], [151] ,[160]
MA-TRPO in a multi-agent setup was also a signifi-
cant contribution. The proposed method modeled and
trained separate agents for different components of the CSI estimation and reporting to enhance the applicability
HEMS, such as the IDA and BESS, allowing for more of the proposed solution.
specialized and effective control strategies. This multi- Table XIV categorizes the papers reviewed in this sec-
agent approach addressed the complexities and inter- tion by their domain, offering a summary of the research
dependencies within the home energy environment, lead- landscape in TPRO. An advancement over TRPO was
ing to more efficient and coordinated energy manage- proposed as a new algorithm, PPO.
ment. The paper’s reliance on simulation results, while
comprehensive, still left questions about the real-world
C. Proximal Policy Optimization (PPO)
applicability of the proposed method. Although the use
of real-world data enhanced relevance, further validation PPO, proposed by [221], represents a significant ad-
in actual home environments would have strengthened vancement within policy gradient methods. PPO aims to
the case for practical deployment. Real-world testing was achieve reliable performance and sample efficiency, ad-
essential to ensure the robustness and effectiveness of dressing the limitations of previous policy optimization
the MA-TRPO algorithm in diverse and dynamic home algorithms such as VPG methods and TRPO.
energy scenarios. Additionally, the reliance on discrete Using policy gradient methods, the policy parameters
action spaces simplified the problem but might not have are optimized through stochastic gradient ascent by
fully captured the nuances of continuous control in real- estimating the gradient of the policy. One of the most
world applications. Future work could explore extending commonly used policy gradient estimators is:
the algorithm to handle continuous action spaces for h i
more precise control. ĝ = Êt ∇θ log πθ (at |st )Ât , (39)
Authors in [220] investigated the application of TRPO
where πθ represents the policy parameterized by θ,
to address the joint spectrum and power allocation prob-
and Ât is an estimator of the advantage function at time
lem in the Internet of Vehicles (IoV). The objective was
step t. This estimator helps construct an objective func-
to minimize AoI and power consumption, which were
tion whose gradient corresponds to the policy gradient
crucial for maintaining real-time communication and
estimator:
energy efficiency in vehicular networks. One of the key
strengths of this paper was its focus on AoI, a vital metric h i
for ensuring timely and accurate information exchange in LP G (θ) = Êt log πθ (at |st )Ât . (40)
vehicular communications. By incorporating AoI into the
PPO simplifies TRPO by using a surrogate objective
optimization framework, the authors addressed a signif-
with a clipped probability ratio, allowing for multiple
icant challenge in IoV networks, where the freshness of
epochs of mini-batch updates. In order to preserve learn-
information directly impacted road safety and traffic effi-
ing, large policy updates should be avoided. As a result,
ciency. The proposed TRPO-based approach effectively
the PPO objective is as follows:
balanced the trade-off between minimizing AoI and
reducing power consumption, showcasing its practical h
relevance. The paper’s reliance on certain assumptions, LCLIP (θ) = Êt min rt (θ)Ât ,
such as the availability of CSI and the periodic re- i
porting of CSI to the base station, might have limited clip(rt (θ), 1 − ǫ, 1 + ǫ)Ât
its generalizability. In real-world scenarios, obtaining (41)
accurate and timely CSI could have been challenging due where rt (θ) = ππθθ (a(at |s t)
t |st )
is the probability ratio, and
old
to various factors like signal interference and mobility. ǫ is a hyperparameter. This objective clips the proba-
Future work could explore more practical approaches to bility ratio to ensure it stays within a reasonable range,
56

Algorithm 23 PPO mixed-autonomy traffic optimization at a network level.


1: Initialize policy parameters θ, and value function The authors hypothesized that controlling distributed
parameters φ CAVs at a network level could outperform individually
2: Initialize old policy πθold ← πθ controlled CAVs, and their experimental results sup-
3: for iteration i = 1 to M do ported this hypothesis. The network-level RL policies
4: Collect trajectories {(st , at , rt , st+1 )} by run- for controlling CAVs significantly improved the total re-
ning policy πθold wards and average velocity compared to individual-level
5: Compute advantage estimates Ât for each trajec- RL policies. This finding was crucial for the development
tory using the value function of ITS which aimed to optimize traffic flow and reduce
6: for epoch j = 1 to K do congestion. The use of the Flow framework and SUMO
πθ (at |st )
7: Compute probability ratio rt (θ) = πθ (at |st ) environment for experiments was another strength of
old
8: Define the surrogate loss: the paper. These tools provided a realistic and flexible
h  simulation environment for testing the proposed control
LCLIP (θ) = Êt min rt (θ)Ât , policies. The comprehensive evaluation, including differ-
 ient traffic settings and CAV penetration rates (10%, 20%,
clip(rt (θ), 1 − ǫ, 1 + ǫ)Ât and 30%), added robustness to the findings. The authors’
use of three different learning strategies—single-agent
9: Perform mini-batch gradient ascent on the asynchronous learning, joint global cooperative learning,
surrogate objective LCLIP (θ) and joint local cooperative learning—allowed for a thor-
10: Update the value function parameters φ by ough comparison of the effectiveness of network-level
minimizing: control policies. The paper also highlighted the potential
h i high communication overhead associated with the global
LV (φ) = Êt (Vφ (st ) − Rt )2 joint cooperative policy, especially as the penetration rate
of CAVs increased. This overhead could have limited the
scalability of the solution in real-world scenarios where
11: end for
communication resources were constrained. The authors
12: Update old policy: πθold ← πθ
suggested that the local joint cooperative policy, which
13: end for
required less communication, could have been a more
practical choice in such situations, though it might not
have performed as well as the global policy.
preventing excessively large updates. The practical im-
In [223], authors presented a novel method for man-
plementation of PPO involves the following steps:
aging power grid line flows in real-time using PPO.
1) Sample Trajectories: Collect trajectories using This approach addressed the challenges posed by the
the current policy πθold . increasing penetration of renewable resources and the
2) Estimate Advantages: Compute the advantage unpredictability of power system conditions, aiming to
function Ât using the collected trajectories. prevent line overloading and potential cascading outages.
3) Optimize Surrogate Objective: Perform several One of the key strengths of this paper was the application
epochs of optimization on the surrogate objective of PPO in a critical real-time grid operation context. PPO
LCLIP (θ) using minibatch stochastic gradient de- was well-suited for this task due to its ability to ensure
scent. stable updates within trust regions, thus maintaining the
4) Update Policy: Update the policy parameters to reliability of the control policies. This characteristic was
the new parameters θ. crucial in power systems, where stability and safety
Before delving into related papers, to illustrate the were paramount. The authors’ decision to use PPO over
algorithm’s functionality, PPO’s algorithm is provided in other algorithms like DDPG and Policy Gradient was
Alg. 23. A DRL approach for optimizing traffic flow in justified by PPO’s robustness in handling larger and more
mixed-autonomy scenarios, where both Connected Au- complex action spaces. A limitation was the reliance on
tonomous Vehicles (CAVs) and human-driven vehicles simulation results for validation. While the use of high-
coexisted, was introduced in [222]. The authors proposed fidelity power grid simulators was a strength, real-world
three distributed learning control policies for CAVs using deployments could present additional challenges not
PPO, a policy gradient DRL method, and conducted captured in simulations. Factors such as communication
experiments with varying traffic settings and CAV pen- delays, unexpected equipment failures, and human oper-
etration rates on the Flow framework, a new open- ator interventions could have impacted the effectiveness
source microscopic traffic simulator. One of the primary of the control strategies. Further real-world testing and
strengths of the paper was its innovative approach to validation were necessary to fully assess the robustness
57

and practicality of the proposed method in diverse and the complexity of implementing the proposed method
dynamic power grid environments. might have posed challenges. The integration of the ED-
A framework for optimizing metro service schedules LDF rule with PPO required careful tuning and a deep
and train compositions using PPO within a DRL frame- understanding of both RL and optimal scheduling princi-
work was proposed in [224]. This method was applied to ples. This complexity could have limited the accessibility
handle the dynamic and complex problem of metro train of the method for practitioners who might not have had
operations, focusing on minimizing operational costs advanced expertise in these areas. Additionally, the re-
and improving service regularity. A significant strength liance on heavy-traffic assumptions for the optimality of
of the paper was its innovative application of PPO to the ED-LDF rule might have limited the generalizability
metro service scheduling and train composition. PPO, of the approach to all scheduling environments. Real-
known for its stability and efficiency in handling large- world scenarios could have presented varying traffic
scale optimization problems, was effectively utilized to conditions that did not always align with these assump-
address the dynamic nature of metro operations. The tions. The paper’s focus on renewable energy as a factor
integration of PPO with Artificial Neural Networks in the scheduling decision was another strength, as it
(ANNs) for approximating value functions and policies aligned with the growing importance of sustainability in
demonstrated a robust approach to tackling the high- computing operations. However, the paper could have
dimensional decision space inherent in metro scheduling. benefited from further exploration of how variations in
This combination enhanced the framework’s ability to renewable energy availability impacted the scheduling
adapt to varying passenger demands and operational con- performance. This aspect was crucial for real-world
straints. The paper’s use of a real-world scenario, specif- applications where renewable energy sources could have
ically the Victoria Line of the London Underground, been highly variable.
for testing and validation was another strong point. The An innovative approach to enhancing image caption-
authors provided a comprehensive evaluation, comparing ing models using PPO was designed by authors in [46].
their PPO-based method with established meta-heuristic The authors aimed to improve the quality of generated
algorithms like Genetic Algorithm and Differential Evo- captions by incorporating PPO into the phase of training,
lution. The results indicated that the PPO-based approach specifically targeting the optimization of scores. The
outperformed these traditional methods in terms of both study explored various modifications to the PPO algo-
solution quality and computational efficiency. This prac- rithm to adapt it effectively for the image captioning task.
tical validation underscored the method’s applicability A significant strength of the paper was its integration of
and effectiveness in real-world settings. The reliance on PPO with image captioning models, which traditionally
a specific set of operational constraints, such as fixed relied on VPG methods. The authors argued that PPO
headways and trailer limits, might have also limited could provide better performance due to its ability to
the method’s generalizability. While these constraints enforce trust-region constraints, thereby improving sam-
were necessary for practical implementations, exploring ple complexity and ensuring stable policy updates. This
more flexible constraint formulations could have further was particularly important for image captioning, where
enhanced the method’s adaptability to different metro maintaining high-quality training trajectories was crucial.
systems and operational conditions. The authors’ experimentation with different regulariza-
Authors in [225] proposed an advanced scheduling tion techniques and baselines was another strong point.
algorithm for multitask environments using a combi- They found that combining PPO with dropout decreased
nation of PPO and an optimal policy characterization. performance, which they attributed to increased KL-
The main focus was to optimize the scheduling of mul- divergence of RL policies. This empirical observation
tiple tasks across multiple servers, taking into account was critical as it guided future implementations of PPO
random task arrivals and renewable energy generation. in similar contexts. Furthermore, the adoption of a word-
One of the primary strengths of this paper was its level baseline via MC estimation, as opposed to the
innovative combination of PPO with a priority rule, traditional sentence-level baseline, was a noteworthy
the Earlier Deadline and Less Demand First (ED-LDF). innovation. This approach was expected to reduce the
This rule prioritized tasks with earlier deadlines and variance of policy gradient estimators more effectively,
lower demands, which was shown to be optimal under contributing to improved model performance. While the
heavy traffic conditions. The integration of ED-LDF with results were promising, they were primarily validated
PPO effectively reduced the dimensionality of the action on the MSCOCO dataset. Further validation on other
space, making the algorithm scalable and efficient even datasets and in real-world applications would have been
in large-scale settings. This reduction in complexity was beneficial to assess the generalizability and robustness of
crucial for practical applications where the number of the approach. The paper could have also benefited from
tasks and servers could have been substantial. However, a more detailed discussion on the impact of different
58

hyperparameter settings and the specific configurations stable flight control without relying on a predefined
used in the experiments. Providing this information mathematical model of the quadrotor’s dynamics. One
would have enhanced the reproducibility of the study of the major strengths of this paper was its application
and allowed other researchers to build on the authors’ of PPO in the context of quadrotor control. By using
work more effectively. PPO, the authors ensured that the control policy updates
In [226], a centralized coordination scheme for CAVs remained stable and efficient, which was crucial for real-
at intersections without traffic signals was developed. time applications like quadrotor control. The choice of
The authors introduced the Model Accelerated PPO PPO over other RL algorithms was well justified due
(MA-PPO) algorithm, which incorporated a prior model to its robustness and low computational complexity. The
into the PPO algorithm to enhance sample efficiency authors’ approach to utilizing a stochastic policy gradient
and reduce computational overhead. One of the sig- method during training, which was then converted to
nificant strengths of this paper was its focus on im- a deterministic policy for control, was another notable
proving computational efficiency, a major challenge in strength. This strategy ensured efficient exploration dur-
centralized coordination methods. Traditional methods, ing training, allowing the quadrotor to learn a robust
such as MPC, were computationally demanding, making control policy. The use of a simple reward function that
them impractical for real-time applications with a large focused on minimizing the position error between the
number of vehicles. By using MA-PPO, the authors quadrotor and the target further added to the efficiency
significantly reduced the computation time required for of the training process. One limitation was the reliance
coordination, achieving an impressive reduction to 1/400 on specific initial conditions and a fixed simulation envi-
of the time needed by MPC. This efficiency gain was ronment. While the authors showed that the PPO-based
crucial for real-time deployment in busy intersections. controller could recover from harsh initial conditions,
A limitation was the focus on a specific intersection the generalizability of the method to different quadrotor
scenario. While the four-way single-lane intersection was models and varying environmental conditions remained
a common setup, real-world intersections could have var- to be explored. Future work could have benefited from
ied significantly in complexity and traffic patterns. Future testing the controller on a wider range of scenarios and
work could have explored the applicability of MA-PPO incorporating additional environmental factors to ensure
to more complex intersection scenarios and different broader applicability.
traffic conditions to ensure broader generalizability. Authors in [228] addressed the development of an
The application of PPO in developing a DRL con- intelligent lane change strategy for autonomous vehicles
troller for the nonlinear attitude control of fixed-wing using PPO. This approach PPO to manage lane change
UAVs was explored in [227]. The study presented a maneuvers in dynamic and complex traffic environments,
proof-of-concept controller capable of stabilizing a fixed- focusing on enhancing safety, efficiency, and comfort.
wing UAV from various initial conditions to desired roll, The authors’ design of a comprehensive reward func-
pitch, and airspeed values. One of the primary strengths tion that considered multiple aspects of lane change
of the paper was its innovative use of PPO for UAV maneuvers was one of the strengths. The reward function
attitude control. PPO was known for its stability and ef- incorporated components for safety (avoiding collisions
ficient policy updates, making it well-suited for complex and near-collisions), efficiency (minimizing travel time
control tasks like UAV attitude stabilization. The choice and aligning with desired speed and position), and
of PPO over other RL algorithms was justified by its comfort (reducing lateral and longitudinal jerks). This
robust performance and low computational complexity, multi-faceted approach ensured that the learned policy
which were crucial for real-time control applications. optimized for a holistic driving experience, balancing
The authors also highlighted the practical advantages of the often competing demands of these different aspects.
using PPO, such as its hyperparameter robustness across The inclusion of a safety intervention module to prevent
different tasks. One of the limitations was the reliance catastrophic actions was a particularly noteworthy fea-
on a single UAV model and specific aerodynamic co- ture. This module labeled actions as ”catastrophic” or
efficients for validation. While the Skywalker-X8 was ”safe” and could replace potentially dangerous actions
a popular fixed-wing UAV, the generalizability of the with safer alternatives, enhancing the robustness of the
proposed approach to other UAV models with different learning process. This safety-centric approach addressed
aerodynamic characteristics remained to be explored. a critical concern in applying DRL to real-world au-
Future work could have benefited from testing the PPO- tonomous driving tasks, where safety was paramount.
based controller on a wider range of UAV models to However, the complexity of implementing PPO for lane
ensure broader applicability. change maneuvers posed challenges. The need for con-
The use of PPO to control the position of a quadrotor tinuous training and fine-tuning of parameters could have
was investigated in [39]. The primary goal was to achieve been resource-intensive and might not have been feasible
59

TABLE XV: PPO Papers Review addressing some limitations of pure policy or Value-
Application Domain References based approaches [229], [52].
Distributed DRL Control for [222] In the next subsection, we first introduce two main ver-
Mixed-Autonomy Traffic sions of Actor-Critic methods, Asynchronous Advantage
Optimization
Actor-Critic (A3C) & Advantage Actor-Critic (A2C),
Power Systems and Energy [223]
Management and then, we will analyze various applications of each.
Transportation and Routing [224]
Optimization (EVs)
A. A3C & A2C
Real-time Systems and Hardware [225]
Image Captioning Models [46] The A2C algorithm is a synchronous variant of the
Hybrid RL Algorithms [46] A3C algorithm, which was introduced by [230]. A2C
Intelligent Traffic Signal Control [226] maintains the key principles of A3C but simplifies the
Real-time Systems and Hardware [227], [39] training process by synchronizing the updates of multiple
agents, thereby leveraging the strengths of both Actor-
TABLE XVI: Policy-based Papers Review Critic methods and advantage estimation. The Actor-
Application Domain References
Critic architecture combines two primary components,
Distributed DRL Control for [222] in both algorithms: the actor, which is responsible for
Mixed-Autonomy Traffic selecting actions, and the critic, which evaluates the
Optimization actions by estimating the value function. The actor
Power Systems and Energy [223] updates the policy parameters in a direction that is
Management
expected to increase the expected reward, while the
Transportation and Routing [224]
Optimization (EVs) critic provides feedback by computing the TD error.
Real-time Systems and Hardware [225], [227], [39] This integration allows for more stable and efficient
Image Captioning Models [46] learning compared to using Actor-only or critic-only
Hybrid RL Algorithms [46], [214] methods [231]. Advantage estimation is a technique used
Intelligent Traffic Signal Control [226] to reduce the variance of the policy gradient updates.
Energy and Power Management [204], [219], [151] ,[160] The advantage function A(s, a) represents the difference
Theoretical Research [203], [213] between the action-value function Q(s, a) and the value
(Convergence, stability)
function V (s):
Network Optimization [206]
Object Recognition [216]
Multi-agent Systems and [215], [42] A(s, a) = Q(s, a) − V (s). (42)
Autonomous Behaviors
By using the advantage function, A2C focuses on
Cybersecurity [217]
Robotics [218], [207]
actions that yield higher returns than the average, which
helps in making more informed updates to the policy [1].
Unlike A3C, where multiple agents update the global
for all developers or organizations, acknowledging that model asynchronously, A2C synchronizes these updates.
it is true for some other algorithms. Table XV shows a Multiple agents run in parallel environments, collecting
summary of analyzed papers. Moreover, Table XVI rep- experiences and calculating gradients, which are then
resents all papers reviewed in this section as a package. aggregated and used to update the global model syn-
We now shall analyze the last group of methods, chronously. This synchronization reduces the complexity
the Actor-Critic methods. We start by introducing the of implementation and avoids issues related to asyn-
general Actor-Critics, which combine Value-based and chronous updates, such as non-deterministic behavior
Policy-based approaches. and potential overwriting of gradients.
The A3C algorithm operates as follows:
Based on Alg. 24, first, the parameters of the policy
V. ACTOR -C RITIC M ETHODS network (actor) θ and the value network (critic) φ are ini-
Actor-critic methods combine Value-based and Policy- tialized (line 1). Then, multiple agents are run in parallel,
based approaches. Essentially, these methods consist of each interacting with its own copy of the environment
two components: the Actor, who selects actions based (lines 2-14). Each agent independently collects a batch
on a policy, and the Critic, who evaluates the actions of experiences (st , at , rt , st+1 ) (lines 5-9). For each
based on their value function. By providing feedback agent, the advantage is computed using the collected
on the quality of the actions taken, the critic guides experiences. Subsequently, the gradients of the policy
the actor in updating the policy directly. As a result of and value networks are calculated using the advantage
this synergy, learning can be more stable and efficient, estimates. Finally, each agent independently updates
60

Algorithm 24 A3C Algorithm 25 A2C


1: Initialize actor and critic networks with random 1: Initialize policy network (actor) parameters θ and
weights value network (critic) parameters φ
2: for each episode ∈ [1, n] do 2: Set number of parallel agents N
3: Download weights from the headquarters to each 3: repeat
AC 4: for each agent i ← 1 to N in parallel do
4: for each AC do 5: Get initial state si
5: Initialize the random state s0 6: Initialize local episode storage for agent i
6: for each t ∈ [1, k] do 7: for each step t do
7: Select action at from the actor network 8: Sample action ai from policy πθ (ai |si )
8: Execute action at and observe reward 9: Execute ai , observe reward ri and next
rt from the critic network and next state state s′i
st 10: Store (si , ai , ri , s′i ) in local storage for
9: Update the actor network parameters agent i
10: end for 11: si ← s′i
11: Update the critic network parameters 12: end for
12: end for 13: Compute advantage estimates Âi for agent i
13: Upload weights of each AC network to the 14: and policy gradient ∇θ LPG (θ) for agent i
headquarters 15: and value loss ∇φ LV (φ) for agent i
14: end for 16: end for
17: Aggregate gradients from all agents
18: Update global actor parameters θ and critic pa-
the global model parameters θ and φ asynchronously rameters φ using aggregated gradients
using the computed gradients (line 13). 19: until convergence or maximum steps reached
The A2C algorithm, on the other hand, operates as
follows: Based on Alg. 25, first, the parameters of the
policy network (actor) θ and the value network (critic) large-scale cloud environments. The scheduler’s ability
φ are initialized (line 1). Then, multiple agents are run to prioritize tasks and virtual machines based on multiple
in parallel, each interacting with its own copy of the factors led to more informed and effective scheduling
environment (line 4). Each agent collects a batch of ex- decisions. Cloud environments were inherently dynamic
periences (st , at , rt , st+1 ) (line 10). For each agent, the and could exhibit unpredictable changes in workload,
advantage is computed using the collected experiences. resource availability, and cost structures. The proposed
Subsequently, the gradients of the policy and value method might have needed further enhancements to
networks are calculated using the advantage estimates adapt to these real-time variations effectively and en-
(lines 13-15). Finally, the gradients from all agents sure robust performance under fluctuating conditions.
are aggregated and used to update the global model While the paper presented a significant advancement
parameters θ and φ (lines 17-18). in task scheduling for cloud computing, addressing the
a) Overview of A3C applications in the literature: computational complexity, scalability, and real-world im-
[50] introduced a scheduler named MOPTSA3C, which plementation challenges would have been crucial for
prioritized tasks and virtual machines based on various its practical adoption and effectiveness in diverse cloud
factors such as task length, runtime processing capaci- environments.
ties, and electricity unit costs. This approach aimed to A latency-oriented cooperative caching strategy us-
optimize makespan, resource utilization, and resource ing A3C was proposed in [232]. This approach aimed
cost using an enhanced A3C. One of the significant to minimize average download latency by predicting
strengths of this paper was its comprehensive multi- content popularity and optimizing caching decisions in
objective approach. By addressing multiple objectives si- real-time, thereby improving user experience in Fog
multaneously, including minimizing makespan, optimiz- Radio Access Networks (F-RANs). One strength was
ing resource utilization, and reducing resource costs, the the comprehensive evaluation of the proposed method
proposed scheduler ensured a balanced and efficient task through extensive simulations. The authors compared
scheduling process in multi-cloud environments. The use their A3C-based approach with several baseline algo-
of an improved A3C algorithm enhanced the robustness rithms, including greedy caching, random caching, and
and efficiency of the scheduling process. The asyn- a cooperative caching strategy that did not consider user
chronous nature of A3C allowed for parallel training and equipment caching capacity. The results demonstrated
faster convergence, which was crucial for dynamic and significant reductions in average download latency, high-
61

lighting the effectiveness of the proposed strategy in addressed the gap in traditional AI chess opponents,
optimizing caching performance. Another limitation was which did not educate players on strategies and tactics.
the reliance on accurate and timely content popularity The cognitive agent relied on accurate feedback for
predictions. The effectiveness of the caching strategy consistency. Agent effectiveness was heavily dependent
depended heavily on the accuracy of these predictions. on its ability to provide relevant and timely suggestions.
In real-world applications, user preferences and content The agent needed to accurately interpret the player’s
popularity could change unpredictably, and any devia- intentions and provide appropriate feedback in real-
tions from the predicted values could negatively impact world applications. Users who had difficulty navigating
the performance of the caching algorithm. Additionally, digital interfaces might also have experienced acces-
the assumption that users did not request the same sibility issues due to the agent’s reliance on digital
content repeatedly might not have held true in all scenar- interfaces. Another concern was scalability. A controlled
ios, potentially affecting the reliability of the popularity environment showed promising results, but the system
prediction model. might not have been able to handle a broader range of
Authors in [233] presented a comprehensive approach cognitive impairments. Additional users and interactions
to optimizing resource allocation and pricing in Mobile could have introduced additional complexity, making it
Edge Computing (MEC)-enabled blockchain systems us- difficult to maintain performance.
ing the A3C algorithm. The study’s strengths lay in its in- Researchers in [235] introduced a novel approach to
novative integration of blockchain with MEC to enhance autonomous valet parking using a combination of PPO
resource management. The A3C algorithm’s capability and A3C. This method aimed to address the control
to handle both continuous and high-dimensional discrete errors due to the non-linear dynamics of vehicles to
action spaces made it well-suited for the dynamic nature optimize parking maneuvers. An important strength of
of MEC environments. The use of prospect theory to this study was the use of the A3C-PPO algorithm. Com-
balance risks and rewards based on miner preferences bining the advantages of both PPO and A3C, this hybrid
added a nuanced understanding of real-world scenarios. approach resulted in a more stable and efficient learning
The results demonstrated that the proposed A3C-based process. A3C’s asynchronous nature allowed parallel
algorithm outperformed baseline algorithms in terms of training, which sped convergence and improved state-
total reward and convergence speed, indicating its effec- action exploration. In addition, PPO prevented drastic
tiveness in optimizing long-term performance. However, changes from destabilizing the learning process, by limit-
the paper had several limitations. The reliance on a ing the magnitude of policy updates. Incorporating man-
specific MEC server configuration and a fixed number ual hyperparameter tuning further optimized the training
of mobile devices might have limited the generalizability process, resulting in better rewards. One of the limita-
of the findings to other settings with different configu- tions was the reliance on specific assumptions about the
rations. The assumption that all validators were honest environment and sensor accuracy. The effectiveness of
simplified the model but might not have reflected real- the proposed method depended on the accurate detection
world blockchain environments where malicious actors and interpretation of the surroundings by sensors such as
could have existed. The additional complexity introduced cameras and LiDAR. Any inaccuracies or deviations in
by the collaborative local MEC task processing mode sensor data could have impacted the performance and
might have increased the computational overhead, poten- robustness of the algorithm. In real-world applications,
tially affecting the scalability of the proposed solution. environmental conditions such as lighting, weather, and
Moreover, the paper did not address the potential impact obstacles could have varied significantly, and ensuring
of network latency and varying network conditions on reliable sensor performance under these conditions was
the performance of the A3C algorithm, which could have crucial. Additionally, the study did not address the poten-
been significant in practical deployments. tial impact of network latency and communication issues
In [234], authors explored the application of A3C to between the vehicle and the central control system,
create a cognitive agent designed to help Alzheimer’s which could have affected the real-time decision-making
patients play chess. The primary goal was to enhance process. Ensuring robust communication and minimizing
cognitive skills and boost brain activity through chess, latency was critical for the practical implementation of
a game known for its cognitive benefits. One of the autonomous valet parking systems.
key strengths of the paper was its innovative approach An advanced approach for optimizing content caching
to leveraging A3C to assist individuals with cogni- in 5G networks using A3C was proposed in [236].
tive disabilities in playing chess. The cognitive agent This method aimed to minimize the total transmission
provided real-time assistance by suggesting offensive cost by learning optimal caching and sharing policies
and defensive moves, thereby helping players improve among cooperative Base Stations (BSs) without prior
their strategies and cognitive abilities. This approach knowledge of content popularity distribution. One of the
62

TABLE XVII: A3C Papers Review showcased improved scalability, maintaining low task
Application Domain References latency even as the number of tasks increased. The use
Cloud-based Control and [50], [233] of the A2C algorithm, which combined policy gradient
Encryption Systems and value function estimates, enhanced the stability and
Games [234] effectiveness of the policy network, further contributing
Multi-agent Systems and [235] to the overall performance of the model. However,
Autonomous Behaviors
Network Optimization [236]
there were some limitations to the HA-A2C approach.
One notable weakness was the potential complexity of
implementing the hard attention mechanism in real-
main strengths of this paper was the use of the A3C world scenarios, where the dynamic and heterogeneous
algorithm, which leveraged the asynchronous nature of nature of edge environments might have posed additional
multiple agents to achieve faster convergence and reduce challenges. Furthermore, while the HA-A2C method
time correlation in learning samples. The algorithm’s outperformed other DRL methods in terms of latency,
ability to operate with multiple environment instances in it might have still faced difficulties in scenarios with ex-
parallel enhanced computational efficiency and signif- tremely high-dimensional states and action spaces, where
icantly improved the learning process. By considering further optimization might have been necessary. Another
cooperative BSs that could have fetched content from consideration was the reliance on accurate and timely
neighboring BSs or the backbone network, the proposed data for effective attention allocation, which might not
method effectively reduced data traffic and transmission have always been feasible in practical applications.
costs in 5G networks. The empirical results demonstrated A task scheduling mechanism in cloud-fog environ-
the superiority of the A3C-based algorithm over classical ments that leveraged the A2C algorithm was presented in
caching policies such as Least Recently Used, Least Fre- [238]. This approach aimed to optimize the scheduling
quently Used, and Adaptive Replacement Cache, show- process for scalability, reliability, trust, and makespan
casing lower transmission costs and faster convergence efficiency. One of the significant strengths of the paper
rates. One limitation of this study was the reliance on was the holistic approach it took toward task scheduling
accurate and timely updates of content popularity distri- in heterogeneous cloud-fog environments. The use of
butions. While the paper assumed that content popularity the A2C algorithm was particularly effective in handling
followed a Zipf distribution and varied over time, the the dynamic nature of task scheduling, as it allowed
accuracy of these assumptions could have significantly the scheduler to make real-time decisions based on the
impacted the performance of the caching algorithm. In current state of the system. By dynamically adjusting
real-world applications, user preferences and content the number of virtual machines according to workload
popularity could have changed unpredictably, and any demands, the proposed scheduler ensured efficient re-
deviations from the assumed distribution could have source utilization, which was crucial for maintaining
affected the effectiveness of the caching policy. Ensuring system performance under varying conditions. One lim-
robust performance under varying content popularity itation was the reliance on specific system parameters
distributions was crucial for the practical implementation and assumptions about the environment. The proposed
of the proposed method. method assumed accurate estimation of factors such as
Table XVII organizes the papers discussed in this task priorities and VM capacities, which might not have
section, offering a domain-specific breakdown of the always been feasible in practical applications. Deviations
research conducted in the A3C area. from these assumptions could have impacted the per-
b) Overview of A2C applications in the literature: formance and robustness of the scheduling algorithm.
In [237], authors introduced an innovative approach to Further research could have explored the robustness of
low-latency task scheduling in edge computing environ- the proposed approach under more relaxed assumptions
ments, addressing several significant challenges inherent and in diverse real-world scenarios.
in such settings. The primary focus of the paper was on Authors in [239] introduced an A2C-learning-based
integrating a hard attention mechanism with the A2C framework for optimizing beam selection and transmis-
algorithm to enhance task scheduling efficiency and sion power in mmWave networks. This approach aimed
reduce latency. The strengths of the HA-A2C method to improve energy efficiency while maintaining coverage
lay in its ability to significantly reduce task latency by in dynamic and complex network environments. A no-
approximately 40% compared to the DQN method. The table strength of the paper was its innovative application
hard attention mechanism employed by HA-A2C was of the A2C algorithm for joint optimization of beam
particularly effective in reducing computational com- selection and transmission power. This dual optimization
plexity and increasing efficiency, allowing the model to was particularly effective in addressing the significant
process tasks more quickly. Additionally, the method challenge of energy consumption in mmWave networks.
63

By leveraging A2C, the proposed method dynamically API, covering a substantial period (2006-2021), ensured
adjusted beam selection and power levels based on the that the models were tested against diverse market con-
current state of the network, which was represented by ditions, enhancing the robustness and reliability of the
the Signal-to-Interference-plus-Noise Ratio (SINR) val- results. The study’s focus on a single market (Indian
ues. The use of A2C ensured stable and efficient learning stock market) might have limited the generalizability
through policy gradients and value function approxi- of the findings. Future research could have explored
mations, making it suitable for real-time applications. the application of these algorithms in different financial
One of the limitations was the assumption of specific markets to ensure broader applicability.
system parameters and environmental conditions. The The authors presented an innovative framework for
method assumed accurate estimation of SINR values and optimizing task segmentation and parallel scheduling
predefined beam angles, which might not have always in edge computing networks using the A2C algorithm
been feasible in practical applications. Deviations from in [241]. The approach focused on minimizing total
these assumptions could have impacted the performance task execution delay by splitting multiple computing-
and robustness of the optimization algorithm. Future intensive tasks into sub-tasks and scheduling them ef-
research could have explored the robustness of the pro- ficiently across different edge servers. A key strength of
posed method under more relaxed assumptions and in this paper was its holistic approach to task segmentation
diverse real-world scenarios. and scheduling. By jointly optimizing both processes,
Authors explored the use of the A2C algorithm to the proposed method ensured that tasks were not only
estimate the power delay profile (PDP) in 5G New divided efficiently but also assigned to the most suitable
Radio environments in [32]. This approach aimed to edge servers for processing. This joint optimization
enhance channel estimation performance by leveraging was crucial in dynamic edge computing environments,
DRL techniques. A notable strength of this paper was where both computation capacity and task requirements
its innovative application of the A2C algorithm to the could have varied significantly over time. The use of
problem of PDP estimation. By framing the estimation A2C, allowed the system to adapt to these changes in
problem within an RL context, the proposed method real-time, enhancing overall system performance. The
directly targeted the minimization of Mean Square Error authors’ method of decoupling the complex mixed-
in channel estimation, rather than aiming to approximate integer non-convex problem into more manageable sub-
an ideal PDP. This pragmatic approach allowed the problems was another strength. By first addressing the
algorithm to adapt to the inherent approximations and task segmentation problem and then tackling the sub-
imperfections in practical channel estimation processes, tasks parallel scheduling, the paper presented a struc-
leading to improved performance. However, the com- tured and logical approach to solving the optimization
plexity of implementing the A2C algorithm in real- challenge. The introduction of the optimal task split
world scenarios posed challenges. The need for extensive ratio function and its integration into the A2C algorithm
training and parameter tuning required significant com- further enhanced the efficiency and effectiveness of the
putational resources and expertise in RL, which might proposed solution.
not have been readily available in all settings. Addi- Researchers in [242] presented an innovative approach
tionally, while the simulation results were promising, to enhancing multi-UAV obstacle avoidance using A2C
further validation in real-world deployments was neces- combined with an experience-sharing mechanism. This
sary to fully assess the robustness and practicality of the method aimed to optimize obstacle avoidance strategies
proposed approach. Real-world environments could have in complex, dynamic environments by sharing posi-
introduced additional challenges, such as varying traffic tive experiences among UAVs to expedite the training
patterns and hardware constraints, which were not fully process. One of the key strengths of this paper was
captured in simulations. the introduction of the experience-sharing mechanism
The application of various Actor-Critic algorithms to to the A2C algorithm. This mechanism significantly
develop a trading agent for the Indian stock market was enhanced the efficiency and robustness of the train-
investigated in [240]. The study evaluated the perfor- ing process by allowing UAVs to share positive expe-
mance of PPO, DDPG, A2C, and Twin Delayed DDPG riences. This collective learning approach accelerated
(TD3) algorithms in making trading decisions. One of the convergence of the algorithm, enabling UAVs to
the primary strengths of this paper was its comprehensive quickly learn effective obstacle avoidance strategies. The
approach to evaluating multiple Actor-Critic algorithms experience-sharing mechanism was particularly valuable
in a real-world financial trading context. By consider- in multi-agent systems, where individual agents could
ing different algorithms, the authors provided a broad have benefited from the knowledge gained by others,
perspective on the effectiveness of it in stock trading. leading to faster and more robust learning. However,
The use of historical stock data from the Yahoo Finance the experience-sharing mechanism relied on consistent
64

and reliable communication between UAVs. In practical TABLE XVIII: A2C Papers Review
applications, communication constraints and network Application Domain References
reliability issues could have significantly impacted the Edge computing environments [237], [241]
effectiveness of this mechanism. Inter-UAV communica- Network Optimization [238],[32]
tion latency and packet loss could have led to outdated or Cloud-based Control and [238]
incomplete information being shared, thereby reducing Encryption Systems
Energy and Power Management [239]
the overall efficiency of the learning process. Also, the (IoT Networks, Smart Energy
method assumed a certain level of accuracy in modeling Systems)
the environment and the dynamic obstacles within it. Financial Applications [240]
Any deviations from these assumptions, such as unex- Autonomous UAVs [242]
pected changes in obstacle behavior or environmental
conditions, could have affected the performance and Visual Navigation [243]
robustness of the algorithm. Real-world environments
were inherently unpredictable, and the algorithm must
have been tested extensively in diverse scenarios to Table XVIII summarizes the discussed papers in
ensure its reliability. this section. Over the next subsection, Deterministic
Policy Gradient (DPG) algorithm, which addresses the
A robust approach to visual navigation using an Actor- challenges associated with continuous action spaces is
Critic method enhanced with Generalized Advantage discussed.
Estimation (GAE) was developed by authors in [243].
This method demonstrated significant strengths in terms
of learning efficiency and stability, as well as effective B. Deterministic Policy Gradient (DPG)
navigation in complex environments like ViZDoom. One DPG addresses the challenges associated with con-
major strength of this approach was its ability to rapidly tinuous action spaces and offers significant improve-
converge and achieve high performance in both basic ments in sample efficiency over stochastic policy gra-
and complex visual navigation tasks. By employing the dient methods. Traditional policy gradient methods in
A2C method with GAE, the algorithm reduced variance RL use stochastic policies, where the policy πθ (a|s)
in policy gradient estimates, leading to more stable is a probability distribution over actions given a state,
learning. This was particularly evident in the ViZDoom parameterized by θ. These methods rely on sampling
health gathering scenarios, where the A2C with GAE actions from this distribution to compute the policy
agent achieved the highest scores with lower variance gradient, which can be computationally expensive and
compared to other methods. Additionally, the use of sample inefficient, especially in high-dimensional action
multiple processes in the A2C method significantly spaces [1], [244]. In contrast, the DPG algorithm uses
reduced training time, making it more efficient than a deterministic policy, denoted by µθ (s), which directly
traditional DQN approaches. However, the method also maps states to actions without involving any randomness.
had notable limitations. One significant drawback was The policy gradient theorem for deterministic policies
the high computational cost associated with using mul- shows that the gradient of the expected return with
tiple processes for training, which might not have been respect to the policy parameters can be computed as
feasible in resource-constrained environments. Further- [244]:
more, while the approach performed well in the tested
ViZDoom scenarios, its generalizability to other, more h i
diverse environments remained uncertain without further ∇θ J(µθ ) = Es∼ρµ ∇θ µθ (s)∇a Qµ (s, a) a=µθ (s)
validation. The reliance on visual inputs also presented (43)
challenges in environments with varying lighting con- where Qµ (s, a) is the action-value function under the
ditions or visual obstructions, which were not exten- deterministic policy µθ , and ρµ is the discounted state
sively tested in this study. Another limitation was the visitation distribution under µθ .
potential for over-fitting to specific task environments. By employing an off-policy learning approach, DPG
The training setup in controlled ViZDoom scenarios ensures adequate exploration while learning a determin-
might not have fully captured the complexities of real- istic target policy. To generate exploratory actions and
world navigation tasks, where environmental dynamics gather experiences, a behavior policy, often a stochastic
were less predictable. Thus, while the A2C with GAE policy, is used. Gradients derived from these experiences
approach showed promise, its applicability to a broader are then used to update the deterministic policy. As the
range of visual navigation tasks would have benefited same experiences can be reused to improve policy in
from additional research and testing in more varied and this non-policy setting, the data collected can be utilized
less controlled environments. more efficiently than in a policy setting [245].
65

The DPG algorithm is typically implemented within updated with a soft update mechanism. The target
an Actor-Critic framework, where the actor represents networks provide stable targets for the Q-learning
the deterministic policy µθ and the critic estimates the updates, reducing the likelihood of divergence:
action-value function Qµ (s, a). The critic is trained using ′ ′
TD learning to minimize the Bellman error: θQ ← τ θQ + (1 − τ )θQ (47)
′ ′
θµ ← τ θµ + (1 − τ )θµ (48)
δt = rt + γQµ (st+1 , µθ (st+1 )) − Qµ (st , at ) (44)
where δt is the TD error, rt is the reward, γ is the where τ < 1 is the target update rate.
discount factor, and st , at are the state and action at Exploration in continuous action spaces is crucial for
time step t. The actor updates the policy parameters in effective learning. DDPG employs an exploration policy
the direction suggested by the critic: by adding noise to the actor’s deterministic policy. An
Ornstein-Uhlenbeck process [247] is typically used to
generate temporally correlated noise, promoting explo-
θ ← θ + α∇θ µθ (st )∇a Qµ (st , at ) a=µθ (st )
(45) ration in environments with inertia.
where α is the learning rate for the actor. A variant of The DDPG algorithm operates as follows: As shown
DPG, which is designed to handle continuous action in Alg. 26, first, the parameters of the actor network θµ
spaces with the help of DL, is analyzed in the next and the critic network θQ are initialized [248] (line 1-2).
subsection in detail. Target networks for both the actor and critic are also ini-
1) Deep Deterministic Policy Gradient (DDPG): The tialized. Then, multiple agents interact with their respec-
DDPG algorithm is an extension of the DPG method, tive environments, collecting transitions (st , at , rt , st+1 )
designed to handle continuous action spaces effectively, which are stored in a replay buffer (lines 3-10). For
introduced by [246]. DDPG leverages the power of DL to each agent, the actor selects actions based on the current
address the challenges associated with high-dimensional policy with added exploration noise. The critic network
continuous control tasks [246]. The foundation of DDPG is updated using the Bellman equation to minimize the
lies in the DPG algorithm. This approach contrasts with TD error (lines 11-13). The actor network is updated
stochastic policy gradients, which sample actions from a using the policy gradient derived from the critic (line
probability distribution. The deterministic nature of DPG 14). Periodically, the target networks are updated to
reduces the variance of gradient estimates and improves slowly track the learned networks (line 15). Over the
sample efficiency, making it suitable for continuous ac- next paragraphs, we will analyze some of the papers in
tion spaces. DDPG employs an Actor-Critic architecture, the literature that used DDPG.
where the actor network represents the policy µ(s|θµ ) Authors in [249] presented an innovative method for
and the critic network estimates the action-value function developing a missile lateral acceleration control system
Q(s, a|θQ ). The actor network outputs a specific action using the DDPG algorithm. This study reframed the au-
for a given state, while the critic network evaluates the topilot control problem within the RL context, utilizing
action by estimating the expected return. The policy a 2-degrees-of-freedom nonlinear model of the missile’s
gradient is computed using the chain rule: longitudinal dynamics for training. One strength was
the incorporation of performance metrics such as set-
h i tling time, undershoot, and steady-state error into the
∇θµ J ≈ Es∼ρβ ∇a Q(s, a|θQ ) a=µ(s|θ µ )
∇θµ µ(s|θµ ) reward function. By integrating these key performance
(46) indicators, the authors ensured that the trained agent
where ρβ denotes the state distribution under a behavior not only learned to control the missile effectively but
policy β. To stabilize learning and address the challenges also adhered to desirable performance standards. This
of training with large, non-linear function approximators, approach enhanced the practical applicability of the
DDPG incorporates two key techniques from the DQN method, ensuring that the control system met operational
algorithm [8]: requirements. The method’s scalability to more complex
1) Replay Buffer: A replay buffer stores transitions scenarios and larger-scale implementations was another
(st , at , rt , st+1 ) observed during training. By sam- concern. The increased number of states and poten-
pling mini-batches of transitions from this buffer, tial interactions in a real-world missile control system
DDPG minimizes the correlations between con- could have introduced additional complexities, making it
secutive samples, which stabilizes training and challenging to maintain the same level of performance.
improves efficiency. Further research was needed to explore the scalability
2) Target Networks: DDPG uses target networks for of the approach and develop mechanisms to manage the
both the actor and critic, which are periodically increased computational load.
66

Algorithm 26 DDPG different types of batteries and charging environments


1: Randomly initialize critic network Q(s, a|θQ ) and was a concern. While the results were promising for the
actor µ(s|θµ ) with weights θQ and θµ . specific battery model used in the study, the ability to
2: Initialize target network Q′ and µ′ with weights generalize the approach to other battery types and con-
′ ′
θQ ← θQ , θµ ← θµ . figurations remained uncertain. The increased number
3: Initialize replay buffer R. of variables and potential interactions in more complex
4: for episode = 1 to M do systems could have introduced additional complexities,
5: Initialize a random process N for action making it challenging to maintain the same level of
exploration. performance. Further research was needed to explore the
6: Receive initial observation state s1 . scalability of the approach and develop mechanisms to
7: for t = 1 to T do manage the increased computational load.
8: Select action at = µ(st |θµ ) + Nt according Authors explored the application of DDPG to the
to the current policy and exploration noise problem of obstacle avoidance in the trajectory plan-
9: Execute action at and observe reward rt ning of a robot arm in [251]. The authors proposed
and observe new state st+1 using DDPG to plan the trajectory of a robot arm,
10: Store transition (st , at , rt , st+1 ) in R ensuring smooth and continuous motion while avoiding
11: Sample a random minibatch of N obstacles. The rewards were specifically designed to
transitions (si , ai , ri , si+1 ) from R overcome the convergence difficulties posed by multiple
′ ′
12: Set yi = ri + γQ′ (si+1 , µ′ (si+1 |θµ )|θQ ). and potentially antagonistic rewards. One strength of this
13: Update P critic by minimizing the loss: paper was the careful design of the reward function.
L = N1 Q 2
i (yi − Q(si , ai |θ )) . By considering both the distance to the target and the
14: Update the actor proximity to obstacles, the authors ensured that the
policy using the sampled policy gradient: robot not only reached its goal efficiently but also
1 X avoided collisions. This multi-faceted reward structure
∇θµ J ≈ ∇a Q(s, a|θQ )|s=si ,a=µ(si )
N i addressed the convergence issues often encountered in
RL tasks with multiple objectives. The empirical val-
× ∇θµ µ(s|θµ )|si idation through simulations in the MuJoCo environ-
ment demonstrated the effectiveness of the proposed
15: Update the target networks: method, showing that the robot arm could successfully
′ ′ navigate to its target while avoiding obstacles. The
θQ ← τ θQ + (1 − τ )θQ
method’s scalability to more complex scenarios and
′ ′
θµ ← τ θµ + (1 − τ )θµ larger-scale implementations might have impeded the
achieved performance. The increased number of states
and potential interactions in a real-world robotic system
16: end for could have introduced additional complexities, making it
17: end for challenging to maintain the same level of performance.
Further research was needed to explore the scalability
of the approach and develop mechanisms to manage the
A novel approach to fast charging lithium-ion batteries increased computational load. Investigating the impact
using a combination of a Model-based state observer of various hyperparameters and network architectures
and a DRL optimizer, specifically the DDPG algorithm, on the performance of the DDPG algorithm could have
was proposed in [250]. This method aimed to bal- provided deeper insights into optimizing the method for
ance charging rapidity with the enforcement of thermal different robotic applications.
safety and degradation constraints. One of the notable An enhanced version of DDPG, integrating a Long
strengths of this paper was its innovative application of Short-Term Memory (LSTM) network-based encoder to
the DDPG algorithm to the complex problem of fast handle dynamic obstacle avoidance for mobile robots
charging lithium-ion batteries. By formulating a multi- in stochastic environments is introduced in [252]. One
objective optimization problem that included penalties of the notable strengths of this paper was the inno-
for over-temperature and degradation, the authors ef- vative combination of DDPG with LSTM. This hy-
fectively addressed the crucial aspects of battery safety brid approach allowed the robot to encode a variable
and longevity. This approach ensured that the charging number of obstacles into a fixed-length representation,
strategy not only accelerated the charging process but which addressed the limitation of traditional DDPG
also maintained the battery within safe operating lim- that required a fixed number of inputs. By utilizing
its, thus extending its life. The method’s scalability to the LSTM network-based encoder, the proposed method
67

could effectively process and integrate dynamic environ- effectiveness of the proposed method depended heavily
mental information, enhancing the robot’s adaptability on the precision of the input data, such as nodal prices
to unpredictable scenarios. The reliance on accurate and load demands. In real-world applications, factors
environmental sensing and real-time data processing was such as data inaccuracies, communication delays, and
a potential limitation. The performance of the LSTM- varying environmental conditions could have impacted
based encoder and the overall DDPG framework heavily the reliability and robustness of the system. Ensuring
depended on the quality and accuracy of the sensor data. robust performance under diverse and unpredictable con-
In real-world applications, factors such as sensor noise, ditions remained a critical challenge that needed to be
varying environmental conditions, and communication addressed. Additionally, the study assumed a specific
delays could have affected the reliability and robustness structure for the neural networks used in the actor and
of the system. Ensuring robust performance under di- critic models. The performance of the algorithm could
verse and unpredictable conditions remained a critical have been sensitive to the choice of network architecture
challenge. and hyperparameters. A more systematic exploration of
DDPG combined with prioritized sampling to opti- different architectures and their impact on performance
mize power control in wireless communication systems, could have provided deeper insights into optimizing the
specifically targeting Multiple Sweep Interference (MSI) DDPG algorithm for electricity market modeling.
scenarios, was designed in [253]. Prioritized sampling An advanced approach for resource allocation in ve-
was another innovative aspect of the paper. By focus- hicular communications using a multi-agent DDPG algo-
ing on more valuable experiences during training, the rithm was studied in [255]. This method was designed to
algorithm accelerated the learning process and improved handle the dynamic and high-mobility nature of vehicu-
convergence speed. The empirical results showed that lar environments, specifically targeting the optimization
the DDPG scheme with prioritized sampling (DDPG-PS) of the sum rate of Vehicle-to-Infrastructure (V2I) com-
outperformed the traditional DDPG scheme with uniform munications while ensuring the latency and reliability of
sampling and DQN scheme. This was evident in various Vehicle-to-Vehicle (V2V) communications. One of the
MSI scenarios, where the DDPG-PS scheme achieved significant strengths was the formulation of the resource
better reward performance and stability. The scalability allocation problem as a decentralized Discrete-time and
of the proposed method to more complex scenarios and Finite-state MDP. This approach allowed each V2V
larger-scale implementations was another concern. While communication to act as an independent agent, making
the results were promising in simulated environments, decisions based on local observations without requiring
the ability to handle a broader range of interference pat- global network information. This decentralization was
terns and larger numbers of channels remained uncertain. crucial for scalability and real-time adaptability in high-
The increased number of states and potential interactions mobility vehicular environments. One potential limita-
could have introduced additional complexities, making it tion was the reliance on accurate and timely acquisi-
challenging to maintain the same level of performance. tion of CSI. In high-mobility vehicular environments,
Further research was needed to explore the scalability obtaining precise CSI could have been challenging due
of the approach and develop mechanisms to manage the to fast-varying channel conditions. Any inaccuracies in
increased computational load. CSI could have impacted the performance and robustness
Authors in [254] employed DDPG to model the of the proposed resource allocation scheme. Ensuring
bidding strategies of generation companies in electric- robust performance under diverse and unpredictable con-
ity markets. This approach was aimed at overcoming ditions remained a critical challenge. The last algorithm
the limitations of traditional game-theoretic methods in the category of Actor-Critic methods to analyze is
and conventional RL algorithms, particularly in envi- TD3, which is an enhancement of the DDPG algorithm,
ronments characterized by incomplete information and designed to address the issues of overestimation bias.
high-dimensional continuous state/action spaces. One This algorithm is analyzed in detail in the next subsec-
significant strength was the ability of the proposed tion.
method to converge to the Nash Equilibrium even in an Table XIX provides a summary of the analyzed papers
incomplete information environment. Traditional game- with respect to their domain.
theoretic methods often required complete information 2) Twin Delayed Deep Deterministic Policy Gradient
and were limited to static games. In contrast, the DDPG- (TD3): TD3 is an enhancement of the DDPG algorithm,
based approach could dynamically simulate repeated designed to address the issues of overestimation bias
games and achieve stable convergence, demonstrating in function approximation within Actor-Critic methods.
its robustness in modeling real-world market conditions. introduced by [256], TD3 incorporates several innovative
One limitation was the reliance on accurate modeling techniques to improve the stability and performance of
of market conditions and real-time data processing. The continuous control tasks in RL.
68

TABLE XIX: DDPG Papers Review Algorithm 27 TD3


Application Domain References 1: Initialize critic networks Qθ1 , Qθ2 , and actor net-
Theoretical Research [246] work πφ with random parameters θ1 , θ2 , φ
(Convergence, stability) 2: Initialize target networks θ1′ ← θ1 , θ2′ ← θ2 , φ′ ← φ
Missile Control Systems [249] 3: Initialize replay buffer B
Battery Charging Optimization [250]
4: for t = 1 to T do
5: Select action with exploration noise a ∼ πφ (s)+
Robotics [251], [252] ǫ, ǫ ∼ N (0, σ) and observe reward r and new
state s′
Network Optimization [253] 6: Store transition tuple (s, a, r, s′ ) in B
Financial Applications [254] 7: Sample mini-batch of N transitions (s, a, r, s′ )
Multi-agent Systems and [255] from B
Autonomous Behaviors
8: ã ← πφ′ (s′ ) + ǫ, ǫ ∼ clip(N (0, σ), −c, c)
9: y ← r + γ mini=1,2 Qθi′ (s′ , ã)
Update critics Qθi ← arg minθi N −1 (y −
P
10:
Overestimation bias occurs when the value estimates
for certain actions are consistently higher than their Qθi (s, a))2
11: if t mod d = 0 then
true values due to function approximation errors. This
12: Update φ by deterministic policy gradient
issue is well-documented in Value-based Methods like
∇φ J(φ) = N −1
P
∇a Qθ1 (s, a)|a=πφ (s) ∇φ πφ (s)
Q-learning [257]. In Actor-Critic methods, this bias can 13: Update target networks:
lead to suboptimal policy updates and divergent behavior
[256]. θi′ ← τ θi + (1 − τ )θi′
TD3 builds on the Double Q-learning concept, which
φ′ ← τ φ + (1 − τ )φ′
mitigates overestimation bias by maintaining two sep-
arate value estimators and using the minimum of the 14: end if
two estimates for the target update [127]. This approach 15: end for
is adapted to the Actor-Critic setting by employing two
critic networks, Qθ1 and Qθ2 , which are independently
trained. The target value is computed as:
networks for both the actor and critics are also initialized
′ ′
y = r + γ min Qθi′ (s , µθ′ (s )) (49) (line 2). Multiple agents interact with their respective en-
i=1,2 vironments, collecting transitions (st , at , rt , st+1 ) which
where Qθi′ are the target critic networks and µθ′ is the are stored in a replay buffer (line 6). The actor selects
target actor network. actions with added exploration noise (line 8). The critic
To further reduce the error propagation from the critic networks are updated by minimizing the TD error using
to the actor, TD3 delays the policy updates relative to the clipped double Q-learning target (lines 9-10). The
the value updates. Specifically, the policy (actor) network actor network is updated using the deterministic policy
is updated less frequently than the critic networks. This gradient, but only every d iterations (line 12). Periodi-
strategy ensures that the value estimates used to update cally, the target networks are updated to slowly track the
the policy are more accurate and stable. The policy is learned networks (line 13).
updated every d iterations, where d is a hyperparameter Authors in [258], as the first analyzed paper in the
typically set to 2 or more [256]. To prevent over-fitting literature, showcased both strengths and limitations in
to narrow peaks in the value estimate, TD3 introduces applying the TD3 algorithm for energy management in
target policy smoothing. This technique adds noise to Hybrid EVs (HEVs). One of the primary strengths of
the target policy, encouraging smoother value estimates the paper was its innovative application of the TD3 al-
and more robust policy learning. The target value is gorithm, which enhanced training efficiency and stability
computed with added noise: over the previously used methods like Q-learning, DQN,
and DDPG. The TD3 algorithm’s use of two critic net-
y = r + γ min Qθi′ (s′ , µθ′ (s′ ) + ǫ) (50) works helped provide more stable training by mitigating
i=1,2
the overestimation bias. The paper also demonstrated
where ǫ is clipped noise sampled from a Gaussian dis- significant improvements in fuel economy and battery
tribution. Based on Alg. 27, the TD3 algorithm operates state-of-charge sustainability, which are crucial metrics
as follows: for HEVs. By comparing the performance under various
First, the parameters of the actor network θ and the driving cycles, the authors provided comprehensive evi-
critic networks θ1 and θ2 are initialized (line 1). Target dence of the TD3 algorithm’s effectiveness in real-world
69

scenarios. However, there were notable limitations. The scalability of the proposed method to larger and more
implementation of TD3, while improving stability, in- complex satellite networks was one of the concerns.
troduced complexity in the training process, requiring While the results were promising in the tested scenarios,
careful tuning of hyperparameters to achieve optimal the ability to handle a broader range of interference
performance. The algorithm’s reliance on extensive com- patterns and a larger number of users remained uncertain.
putational resources for training might have limited its The increased number of states and potential interactions
practical applicability in scenarios with constrained re- could have introduced additional complexities, making it
sources. Additionally, the paper focused primarily on the challenging to maintain the same level of performance.
simulation results without providing sufficient real-world Further research was needed to explore the scalability
testing to validate the algorithm’s performance under of the approach and develop mechanisms to manage the
actual driving conditions. This gap raised questions about increased computational load.
the robustness of the proposed method when deployed A TD3-based method for optimizing Voltage and
in a real-world environment. Reactive (VAR) power in distribution networks with high
The application of the TD3 algorithm for the target penetration of Distributed Energy Resources (DERs)
tracking of UAVs was proposed in [259]. The authors such as battery energy storage and solar photovoltaic
integrated several enhancements into the TD3 frame- units was investigated in [261]. The authors’ approach
work to improve its performance in handling the high of coordinating the reactive power outputs of fast-
nonlinearity and dynamics of UAV control. A signif- responding smart inverters and the active power of
icant strength was the novel reward formulation that battery ESS enhanced the overall efficiency of the net-
incorporated exponential functions to limit the effects work. By carefully designing the reward function to
of velocity and acceleration on the policy function ap- ensure a proper voltage profile and effective schedul-
proximation. This approach prevented deformation in ing of reactive power outputs, the method optimized
the policy function, leading to more stable and robust both voltage regulation and power loss minimization.
learning outcomes. Additionally, the concept of multi- The results demonstrated that the TD3-based method
stage training, where the training process was divided outperformed traditional methods such as local droop
into stages focusing on position, velocity, and accelera- control and DDPG-based approaches, showing signifi-
tion sequentially, enhanced the learning efficiency and cant improvements in reducing voltage fluctuations and
performance of the UAV in tracking tasks. However, minimizing power loss in the IEEE 34- and 123-bus
the proposed method also had several limitations. The test systems. The scalability of the proposed method
integration of a PD controller and the novel reward to larger and more complex distribution networks was
formulation added to the complexity of the training another concern. While the results were promising in
process. The scalability of the proposed method to more the tested IEEE 34- and 123-bus systems, the ability
complex environments with a higher number of dynamic to handle a broader range of network configurations
obstacles or more sophisticated UAV maneuvers was and a larger number of DERs remained uncertain. The
another concern. While the results were promising in increased number of states and potential interactions
the tested scenarios, the ability to handle a broader range could have introduced additional complexities, making it
of operational conditions and larger numbers of UAVs challenging to maintain the same level of performance.
remained uncertain. The increased number of states and Further research was needed to explore the scalability
potential interactions could have introduced additional of the approach and develop mechanisms to manage the
complexities, making it challenging to maintain the same increased computational load.
level of performance. Authors in [262] presented an innovative approach to
A novel dynamic MsgA channel allocation strategy quadrotor control, leveraging the TD3 algorithm to ad-
using TD3 to mitigate the issue of MsgA channel colli- dress stabilization and position tracking tasks. This study
sions in Low Earth Orbit (LEO) satellite communication was notable for its application to handle the complex,
systems is investigated in [260]. The paper’s approach non-linear dynamics of quadrotor systems. The authors’
to dynamically pre-configuring the mapping relationship method of integrating target policy smoothing, twin
between PRACH occasions and PUSCH occasions based critic networks, and delayed updates of value networks
on historical access information was another strong enhanced the learning efficiency and reduced variance
point. This method allowed the system to adapt to in the policy updates. This comprehensive approach
changing access demands effectively, ensuring efficient ensured that the quadrotor could achieve precise control
use of available resources and reducing collision rates. in both stabilization and position tracking tasks. The
The empirical results were impressive, demonstrating a empirical results demonstrated the effectiveness of the
39.12% increase in access success probability, which TD3-based controllers, showcasing significant improve-
validated the effectiveness of the proposed strategy. The ments in achieving and maintaining target positions
70

under various initial conditions. The scalability of the TABLE XX: TD3 Papers Review
proposed method to more complex environments with Application Domain References
dynamic obstacles and more sophisticated maneuvers Energy and Power Management [258]
was another concern. While the results were promising Multi-agent Systems and [259], [263]
in the tested scenarios, the ability to handle a broader Autonomous UAVs
range of operational conditions and larger-scale imple- Network Resilience and [260]
Optimization
mentations remained uncertain. The increased number of Energy and Power Management [261]
states and potential interactions could have introduced Real-time Systems and Hardware [262], [264]
additional complexities, making it challenging to main- Implementations
tain the same level of performance. Further research was
needed to explore the scalability of the approach and
develop mechanisms to manage the increased computa- bustness and adaptability. The scalability of the proposed
tional load. method to more complex driving scenarios and larger-
A real-time charging navigation method for multi- scale implementations was another concern. While the
autonomous underwater vehicles (AUVs) systems using results were promising in the tested scenarios, the ability
the TD3 algorithm was designed in [263]. This method to handle a broader range of operational conditions
was designed to improve the efficiency of navigating and larger numbers of vehicles remained uncertain. The
AUVs to their respective charging stations by training increased number of states and potential interactions
a trajectory planning model in advance, eliminating could have introduced additional complexities, making it
the need for recalculating navigation paths for differ- challenging to maintain the same level of performance.
ent initial positions and avoiding dependence on sen- Further research was needed to explore the scalability
sor feedback or pre-arranged landmarks. The primary of the approach and develop mechanisms to manage
strength of this paper lies in its application of the the increased computational load. Additionally, the study
TD3 algorithm to the multi-AUV charging navigation assumed a specific structure for the neural networks
problem. By training the trajectory planning model in used in the actor and critic models. The performance
advance, the method significantly improved the real- of the algorithm could have been sensitive to the choice
time performance of multi-AUV navigation. However, of network architecture and hyperparameters. A more
the paper also had some limitations. One major limitation systematic exploration of different architectures and their
was the reliance on the accuracy of the AUV motion impact on performance could have provided deeper
model and the assumptions made during its formulation. insights into optimizing the TD3 algorithm for adaptive
For instance, the model assumed constant velocity and cruise control. Table XX gives a detailed summary of
neglected factors like water resistance and system delays, the discussed papers, and the domains of each paper.
which could have affected the real-world applicability of
the results. Moreover, the simulation environment used VI. D ISCUSSION
for training and testing might not have fully captured Throughout this survey, we examined various algo-
the complexities and variabilities of real underwater rithms in RL and their applications in a variety of
environments. Another potential limitation was the need domains, including but not limited to Robotics, ITS,
for extensive computational resources for training the Games, Wireless Networks, and many more. There is,
TD3 model, especially given the high number of training however, more to discover both in terms of the number of
rounds (up to 6000) and the large experience replay analyzed papers and in terms of the different algorithms.
buffer size. There are several algorithms and methods that were
Authors in [264] presented an advanced approach not analyzed in this survey for a variety of reasons.
for Adaptive Cruise Control (ACC) using the TD3 To begin with, the considered algorithms are those that
algorithm. This method addressed the complexities of have been applied to a variety of domains and are more
real-time decision-making and control in automotive widely used by researchers. In addition, time, resources,
applications. The authors carefully designed the reward and page limitations render it impossible to analyze all
function to consider the velocity error, control input, the algorithms and methods in one paper. Thirdly, un-
and additional terms to ensure stability and smooth derstanding these algorithms enables one to understand
driving behavior. This reward structure allowed the al- different variations being introduced by the community
gorithm to learn an optimal policy that maintained safe on a regular basis. The purpose of this survey is not to
distances between vehicles while adapting to changing identify which algorithm is better than the others, and as
traffic conditions. The empirical results demonstrated the we know, there is no one-fit-all solution to RL, so one
effectiveness of the TD3-based ACC system in both cannot state ”for problem X, algorithm Y performs better
normal and disturbance scenarios, highlighting its ro- than other algorithms” as it needs implementation of new
71

TABLE XXI: Overview of Algorithms Across Different Application Domains


Application Domain Algorithm(s) References
Energy Efficiency and Power Man- MC, Q Learning, Double Q-learning, SARSA, Dyna-Q, AMF, DDQN, TRPO, [27], [114], [121], [122], [28], [125], [130],
agement Dueling DQN, PPO, A2C, TD3 [194], [192], [151], [160], [29], [165],
[204], [219], [223], [239], [258], [261]
Cloud-based Systems SARSA, A3C, A2C [47], [50], [233], [237], [241], [238]
Optimization TD-Learning DQN, DDQN, Dueling DQN, DDPG [25], [143], [26], [162], [250]
Multi-agent Systems SARSA, MCTS, Prioritized Sweeping, Dyna-Q, DDQN, A3C, TD3 [48], [49], [188], [199], [200], [157], [263]
Algorithmic RL TD-Learning, MC, SARSA, Prioritized Sweeping [80], [81], [82], [83], [86], [128], [132],
[174], [176], [178], [185]
General RL TD-Learning, Q-learning, DQN, Dueling DQN, TRPO, PPO [89], [4], [102], [108], [141], [164], [217]
Robotics MC, Q-learning, Dyna-Q, DDQN, Dueling DQN, REINFORCE, TRPO, DDPG [14], [15], [16], [17], [23], [111], [113],
[201], [153], [24], [155], [207], [218],
[251], [252]
Financial Applications Q-learning, A2C, DDPG [40], [240], [254]
Games TD-Learning, SARSA, MCTS, Dyna-Q, DDQN, DQN, A3C [10], [11], [12], [13], [84], [90], [105],
[116], [170], [173], [175], [177], [179],
[180], [181], [182], [37], [190], [195],
[145], [146], [148], [234]
Signal Processing TD-Learning [43]
Networks TD-Learning, Q-learning, SARSA, Dyna-Q, Prioritized Sweeping, DQN, Du- [30], [112], [115], [129], [31], [198], [189],
eling DQN, REINFORCE, A3C, A2C, DDPG, TD3 [142], [144], [163], [166], [206], [236],
[238], [32], [253], [260]
Intelligent Transportation Systems Q-learning, SARSA, Prioritized Sweeping, Dyna-Q, DDQN, Dueling DQN, [18], [19], [20], [21], [22], [106], [110],
(ITS) TRPO, PPO, A2C, DDPG [126], [131], [41], [159], [154], [156],
[168], [196], [197], [167], [215], [42],
[222], [224], [226], [235], [242], [243],
[249], [255], [259]
Theoretical Research Q-learning, TD-Learning, Prioritized Sweeping, REINFORCE, TRPO, DDPG [91], [92], [95], [96], [97], [99], [5], [184],
[187], [203], [213], [246]
Dynamic Environments TD-Learning, Q-learning, Dyna-Q [33], [93], [191], [34], [195]
Partially Observable Environments TD-Learning, SARSA [35], [36], [117]
Real-time Systems and Hardware Q-learning, DQN, DDQN, PPO, TD3 [103], [38], [109], [119], [124], [158],
Implementations [225], [227], [39], [262], [264]
Benchmark Tasks TD-Learning [101], [44]
Data Management and Processing Q-learning, DQN, PPO [107], [45], [123], [46]

algorithms for the same problem, and reproducing of the wide-ranging impact of RL in numerous fields, we
the original work. Also, results achieved with RL and have categorized these domains into broader categories
specifically DRL may vary since different extrinsic and to provide a more organized and concise overview.
intrinsic factors change, as stated in [265], making it This categorization allows us to succinctly illustrate the
tough to compare and analyze. relevance and utilization of specific RL algorithms in
Lastly, we strongly recommend reading the chapters different research areas while effectively managing the
listed in this paper one by one, reading the introductions limited space available.
to the algorithms, and if necessary, consolidating the The Energy Efficiency and Power Management
knowledge of each algorithm by reviewing the reference category encompasses research areas such as train con-
papers. As a result, you will be able to read the analysis trol, IoTs, WBAN, PID Controllers, and Smart Energy
of the papers that have used that particular algorithm. Systems, all of which focus on optimizing energy us-
The provided tables are valuable to readers who are age and improving power management. Cloud-based
not interested in reading the entire article. By providing Systems includes works focused on cloud-based control
various tables at the end of the survey, we summarized and encryption systems, as well as edge computing
helpful information gathered throughout the survey. environments, reflecting the growing importance of RL
We tried our best to shed light on RL, in terms of in managing and optimizing cloud resources.
theory and applications, to give a thorough understanding The Optimization category captures studies that
of various broad categories of algorithms. This survey is leverage RL for solving complex optimization problems
a helpful resource to readers who would like to expand across various applications. Multi-agent Systems is a
their knowledge in RL (theory), as well as readers category that emphasizes RL’s role in enabling Au-
who desire to take a look at the applications of these tonomous behaviors, covering research involving Shep-
algorithms in the literature. herding, Virtual Agents, and other Multi-agent Systems.
In the final part of our survey, we present a com- Algorithmic RL covers advanced RL methodolo-
prehensive table that highlights the application of RL gies and hybrid approaches, including Renewal Theory,
algorithms across various domains in Table XXI. Given Rough Set Theory, Bayesian RL, and MPC tuning. The
72

General RL category encompasses broad RL applica- paper provided a comprehensive overview. Besides clas-
tions, including policy learning, cybersecurity, and learn- sifying RL algorithms according to Model-free/based
ing from raw experience. Robotics research, focusing approaches, scalability, and sample efficiency, it also pro-
on the application of RL in robotics, includes trajectory vided a practical guide for researchers and practitioners
control, learning, routing, and more. about the type(s) of algorithms used in various domains.
In the Financial Applications category, studies on Furthermore, this survey examines the practical imple-
portfolio re-balancing and other financial strategies using mentation and performance of RL algorithms across
RL are included. The Games category features research several fields, including Games, Robotics, Autonomous
on game strategies in Chess, StarCraft, Video Games, systems, and many more. It also provided a balanced
and Card Games, illustrating RL’s success in com- assessment of their usefulness.
plex strategic environments. Signal Processing research,
which uses RL for signal processing and parameter
R EFERENCES
estimation, is grouped under its own category.
The Networks category covers studies focused on [1] R. S. Sutton and A. G. Barto, Reinforcement learning: An
network reliability, blocking probabilities, Optical Trans- introduction. MIT press, 2018.
[2] R. Bellman, “The theory of dynamic programming,” Bulletin of
port Networks, Fog RAN, and Network Resilience and the American Mathematical Society, vol. 60, no. 6, pp. 503–515,
Optimization. ITS includes RL applications in Railway 1954.
Systems, EVs, Intelligent Traffic Signal Control, UAVs, [3] M. Ghasemi, A. H. Moosavi, I. Sorkhoh, A. Agrawal,
F. Alzhouri, and D. Ebrahimi, “An introduction to reinforcement
and other transportation-related technologies. learning: Fundamental concepts and practical applications,”
The Theoretical Research category includes studies arXiv preprint arXiv:2408.07712, 2024.
focused on the theoretical aspects of RL, such as conver- [4] R. S. Sutton, “Learning to predict by the methods of temporal
differences,” Machine learning, vol. 3, pp. 9–44, 1988.
gence and stability. Dynamic Environments research in-
[5] C. J. C. H. Watkins, “Learning from delayed rewards,” in PhD
volves RL in environments like mazes, the Mountain Car thesis, King’s College, Cambridge, 1989.
problem, and Atari games. Partially Observable Envi- [6] L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Rein-
ronments includes studies on predictions, POMDPs, and forcement learning: A survey,” Journal of artificial intelligence
research, vol. 4, pp. 237–285, 1996.
Swarm Intelligence in optimization problems. [7] P. Auer, N. Cesa-Bianchi, and P. Fischer, “Finite-time analysis
Research on applying RL in FPGA, Real-time Sys- of the multiarmed bandit problem,” Machine learning, vol. 47,
tems, and other hardware implementations is grouped no. 2, pp. 235–256, 2002.
[8] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness,
under Real-time Systems and Hardware Implementa- M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland,
tions. Studies using benchmark tasks like Mountain Car, G. Ostrovski et al., “Human-level control through deep rein-
and Acrobot to test RL algorithms are included in the forcement learning,” Nature, vol. 518, no. 7540, pp. 529–533,
2015.
Benchmark Tasks category. Data Management and [9] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van
Processing involves research applying RL in data man- Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershel-
agement and processing environments, such as Hadoop vam, M. Lanctot et al., “Mastering the game of go with deep
neural networks and tree search,” nature, vol. 529, no. 7587,
and Pathological Image Analysis. Finally, the Object pp. 484–489, 2016.
Recognition category encompasses studies focusing on [10] K. Souchleris, G. K. Sidiropoulos, and G. A. Papakostas,
using RL for object recognition tasks. “Reinforcement learning in game industry—review, prospects
and challenges,” Applied Sciences, vol. 13, no. 4, p. 2443, 2023.
Table XXI serves as a quick reference for researchers [11] S. Koyamada, S. Okano, S. Nishimori, Y. Murata, K. Habara,
to identify relevant work in their specific area of interest, H. Kita, and S. Ishii, “Pgx: Hardware-accelerated parallel game
showcasing the diversity and adaptability of RL algo- simulators for reinforcement learning,” Advances in Neural
Information Processing Systems, vol. 36, 2024.
rithms across various domains. The categorization helps [12] Z. Xu, C. Yu, F. Fang, Y. Wang, and Y. Wu, “Language agents
streamline the information, making it easier to navigate with reinforcement learning for strategic play in the werewolf
and understand the various applications of RL. game,” arXiv preprint arXiv:2310.18940, 2023.
[13] X. Qu, W. Gan, D. Song, and L. Zhou, “Pursuit-evasion game
strategy of usv based on deep reinforcement learning in complex
VII. C ONCLUSION multi-obstacle environment,” Ocean Engineering, vol. 273, p.
In this survey, we presented a comprehensive analysis 114016, 2023.
[14] K. Rana, M. Xu, B. Tidd, M. Milford, and N. Sünderhauf,
of Reinforcement Learning (RL) algorithms, categoriz- “Residual skill policies: Learning an adaptable skill-based ac-
ing them into Value-based, Policy-based, and Actor- tion space for reinforcement learning for robotics,” in Confer-
Critical Methods. By reviewing numerous research pa- ence on Robot Learning. PMLR, 2023, pp. 2095–2104.
[15] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning
pers, it highlights the strengths, weaknesses, and ap- in robotics: A survey,” The International Journal of Robotics
plications of each algorithm, offering valuable insights Research, vol. 32, no. 11, pp. 1238–1274, 2013.
into various domains. From classical approaches such [16] S. Balasubramanian, “Intrinsically motivated multi-goal rein-
forcement learning using robotics environment integrated with
as Q-learning to advanced Deep RL (DRL), along with openai gym,” Journal of Science & Technology, vol. 4, no. 5,
algorithmic variations tailored to specific domains, the pp. 46–60, 2023.
73

[17] S. W. Abeyruwan, L. Graesser, D. B. D’Ambrosio, A. Singh, Technology Conference:(VTC2022-Spring). IEEE, 2022, pp. 1–
A. Shankar, A. Bewley, D. Jain, K. M. Choromanski, and P. R. 6.
Sanketi, “i-sim2real: Reinforcement learning of robotic policies [33] K. De Asis and R. S. Sutton, “Per-decision multi-step tem-
in tight human-robot interaction loops,” in Conference on Robot poral difference learning with control variates,” arXiv preprint
Learning. PMLR, 2023, pp. 212–224. arXiv:1807.01830, 2018.
[18] R. Zhu, L. Li, S. Wu, P. Lv, Y. Li, and M. Xu, “Multi-agent [34] X. Li, C. Yang, J. Song, S. Feng, W. Li, and H. He, “A motion
broad reinforcement learning for intelligent traffic light control,” control method for agent based on dyna-q algorithm,” in 2023
Information Sciences, vol. 619, pp. 509–525, 2023. 4th International Conference on Computer Engineering and
[19] M. Yazdani, M. Sarvi, S. A. Bagloee, N. Nassir, J. Price, Application (ICCEA). IEEE, 2023, pp. 274–278.
and H. Parineh, “Intelligent vehicle pedestrian light (ivpl): A [35] R. S. Sutton and B. Tanner, “Temporal-difference networks,”
deep reinforcement learning approach for traffic signal control,” Advances in neural information processing systems, vol. 17,
Transportation research part C: emerging technologies, vol. 2004.
149, p. 103991, 2023. [36] J. Zuters, “Realizing undelayed n-step td prediction with neural
[20] Y. Liu, L. Huo, J. Wu, and A. K. Bashir, “Swarm learning- networks,” in Melecon 2010-2010 15th IEEE Mediterranean
based dynamic optimal management for traffic congestion in Electrotechnical Conference. IEEE, 2010, pp. 102–106.
6g-driven intelligent transportation system,” IEEE Transactions [37] T. M. Moerland, J. Broekens, A. Plaat, and C. M. Jonker,
on Intelligent Transportation Systems, vol. 24, no. 7, pp. 7831– “Monte carlo tree search for asymmetric trees,” arXiv preprint
7846, 2023. arXiv:1805.09218, 2018.
[21] D. Chen, M. R. Hajidavalloo, Z. Li, K. Chen, Y. Wang, L. Jiang, [38] M. A. Nasreen and S. Ravindran, “An overview of q-learning
and Y. Wang, “Deep multi-agent reinforcement learning for based energy efficient power allocation in wban (q-eepa),” in
highway on-ramp merging in mixed traffic,” IEEE Transactions 2022 2nd International Conference on Innovative Sustainable
on Intelligent Transportation Systems, vol. 24, no. 11, pp. Computational Technologies (CISCT). IEEE, 2022, pp. 1–5.
11 623–11 638, 2023. [39] G. C. Lopes, M. Ferreira, A. da Silva Simões, and E. L. Colom-
[22] P. Ghosh, T. E. A. de Oliveira, F. Alzhouri, and D. Ebrahimi, bini, “Intelligent control of a quadrotor with proximal policy
“Maximizing group-based vehicle communications and fairness: optimization reinforcement learning,” in 2018 Latin American
A reinforcement learning approach,” in 2024 IEEE Wireless Robotic Symposium, 2018 Brazilian Symposium on Robotics
Communications and Networking Conference (WCNC). IEEE, (SBR) and 2018 Workshop on Robotics in Education (WRE).
2024, pp. 1–7. IEEE, 2018, pp. 503–508.
[23] V. T. Aghaei, A. Ağababaoğlu, S. Yıldırım, and A. Onat, “A [40] N. Darapaneni, A. Basu, S. Savla, R. Gururajan, N. Saquib,
real-world application of markov chain monte carlo method S. Singhavi, A. Kale, P. Bid, and A. R. Paduri, “Automated port-
for bayesian trajectory control of a robotic manipulator,” ISA folio rebalancing using q-learning,” in 2020 11th IEEE Annual
transactions, vol. 125, pp. 580–590, 2022. Ubiquitous Computing, Electronics & Mobile Communication
[24] Y. Yu, Y. Liu, J. Wang, N. Noguchi, and Y. He, “Obstacle Conference (UEMCON). IEEE, 2020, pp. 0596–0602.
avoidance method based on double dqn for agricultural robots,” [41] R. M. Desai and B. Patil, “Prioritized sweeping reinforcement
Computers and Electronics in Agriculture, vol. 204, p. 107546, learning based routing for manets,” Indonesian Journal of
2023. Electrical Engineering and Computer Science, vol. 5, no. 2,
[25] M. de Koning, W. Cai, B. Sadigh, T. Oppelstrup, M. H. Kalos, pp. 383–390, 2017.
and V. V. Bulatov, “Adaptive importance sampling monte carlo [42] J. Santoso et al., “Multiagent simulation on hide and seek
simulation of rare transition events,” The Journal of chemical games using policy gradient trust region policy optimization,”
physics, vol. 122, no. 7, 2005. in 2020 7th International Conference on Advance Informatics:
[26] R. Li, W. Gong, L. Wang, C. Lu, Z. Pan, and X. Zhuang, “Dou- Concepts, Theory and Applications (ICAICTA). IEEE, 2020,
ble dqn-based coevolution for green distributed heterogeneous pp. 1–5.
hybrid flowshop scheduling with multiple priorities of jobs,” [43] S. Saha and S. M. Kay, “Maximum likelihood parameter esti-
IEEE Transactions on Automation Science and Engineering, mation of superimposed chirps using monte carlo importance
2023. sampling,” IEEE Transactions on Signal Processing, vol. 50,
[27] W. Liu, T. Tang, S. Su, Y. Cao, F. Bao, and J. Gao, “An intelli- no. 2, pp. 224–230, 2002.
gent train control approach based on the monte carlo reinforce- [44] L. Wu and K. Chen, “Bias resilient multi-step off-
ment learning algorithm,” in 2018 21st International Conference policy goal-conditioned reinforcement learning,” arXiv preprint
on Intelligent Transportation Systems (ITSC). IEEE, 2018, pp. arXiv:2311.17565, 2023.
1944–1949. [45] M. Yao, X. Gao, J. Wang, and M. Wang, “Improving nuclei
[28] A. Jaiswal, S. Kumar, and U. Dohare, “Green computing in segmentation in pathological image via reinforcement learning,”
heterogeneous internet of things: Optimizing energy allocation in 2022 International Conference on Machine Learning, Cloud
using sarsa-based reinforcement learning,” in 2020 IEEE 17th Computing and Intelligent Mining (MLCCIM). IEEE, 2022,
India Council International Conference (INDICON). IEEE, pp. 290–295.
2020, pp. 1–6. [46] L. Zhang, Y. Zhang, X. Zhao, and Z. Zou, “Image captioning
[29] Z. Xuan, G. Wei, and Z. Ni, “Power allocation in multi-agent via proximal policy optimization,” Image and Vision Computing,
networks via dueling dqn approach,” in 2021 IEEE 6th Inter- vol. 108, p. 104126, 2021.
national Conference on Signal and Image Processing (ICSIP). [47] J. Suh and T. Tanaka, “Sarsa (0) reinforcement learning over
IEEE, 2021, pp. 959–963. fully homomorphic encryption,” in 2021 SICE International
[30] P. Lassila, J. Karvo, and J. Virtamo, “Efficient importance Symposium on Control Systems (SICE ISCS). IEEE, 2021,
sampling for monte carlo simulation of multicast networks,” pp. 1–7.
in Proceedings IEEE INFOCOM 2001. Conference on Com- [48] C. K. Go, B. Lao, J. Yoshimoto, and K. Ikeda, “A reinforcement
puter Communications. Twentieth Annual Joint Conference of learning approach to the shepherding task using sarsa,” in 2016
the IEEE Computer and Communications Society (Cat. No. International Joint Conference on Neural Networks (IJCNN).
01CH37213), vol. 1. IEEE, 2001, pp. 432–439. IEEE, 2016, pp. 3833–3836.
[31] E. Oh and H. Wang, “Reinforcement-learning-based energy [49] N. Zerbel and L. Yliniemi, “Multiagent monte carlo tree search,”
storage system operation strategies to manage wind power in Proceedings of the 18th International Conference on Au-
forecast uncertainty,” IEEE Access, vol. 8, pp. 20 965–20 976, tonomous Agents and MultiAgent Systems, 2019, pp. 2309–
2020. 2311.
[32] H. Kwon, “Learning-based power delay profile estimation for 5g [50] S. Mangalampalli, G. R. Karri, S. N. Mohanty, S. Ali, M. I.
nr via advantage actor-critic (a2c),” in 2022 IEEE 95th Vehicular Khan, S. Abdullaev, and S. A. AlQahtani, “Multi-objective pri-
74

oritized task scheduler using improved asynchronous advantage [73] R. S. Sutton, “Integrated architectures for learning, planning,
actor critic (a3c) algorithm in multi cloud environment,” IEEE and reacting based on approximating dynamic programming,”
Access, 2024. in Machine learning proceedings 1990. Elsevier, 1990, pp.
[51] Y. Li, “Deep reinforcement learning: An overview,” arXiv 216–224.
preprint arXiv:1701.07274, 2017. [74] S. Ishii, W. Yoshida, and J. Yoshimoto, “Control of exploitation–
[52] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and A. A. exploration meta-parameter in reinforcement learning,” Neural
Bharath, “Deep reinforcement learning: A brief survey,” IEEE networks, vol. 15, no. 4-6, pp. 665–687, 2002.
Signal Processing Magazine, vol. 34, no. 6, pp. 26–38, 2017. [75] L. Schäfer, F. Christianos, J. Hanna, and S. V. Albrecht, “Decou-
[53] X. Wang, S. Wang, X. Liang, D. Zhao, J. Huang, X. Xu, pling exploration and exploitation in reinforcement learning,” in
B. Dai, and Q. Miao, “Deep reinforcement learning: A survey,” ICML 2021 Workshop on Unsupervised Reinforcement Learn-
IEEE Transactions on Neural Networks and Learning Systems, ing, 2021.
vol. 35, no. 4, pp. 5064–5078, 2022. [76] H. Wang, T. Zariphopoulou, and X. Zhou, “Exploration versus
[54] H.-n. Wang, N. Liu, Y.-y. Zhang, D.-w. Feng, F. Huang, D.-s. exploitation in reinforcement learning: A stochastic control
Li, and Y.-m. Zhang, “Deep reinforcement learning: a survey,” approach,” arXiv preprint arXiv:1812.01552, 2018.
Frontiers of Information Technology & Electronic Engineering, [77] A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel, “Value
vol. 21, no. 12, pp. 1726–1744, 2020. iteration networks,” Advances in neural information processing
[55] T. M. Moerland, J. Broekens, A. Plaat, C. M. Jonker et al., systems, vol. 29, 2016.
“Model-based reinforcement learning: A survey,” Foundations [78] R. Fonteneau, S. Murphy, L. Wehenkel, and D. Ernst, “Model-
and Trends® in Machine Learning, vol. 16, no. 1, pp. 1–118, free monte carlo-like policy evaluation,” in Proceedings of the
2023. Thirteenth International Conference on Artificial Intelligence
[56] A. S. Polydoros and L. Nalpantidis, “Survey of model-based and Statistics. JMLR Workshop and Conference Proceedings,
reinforcement learning: Applications on robotics,” Journal of 2010, pp. 217–224.
Intelligent & Robotic Systems, vol. 86, no. 2, pp. 153–173, 2017. [79] S. T. Tokdar and R. E. Kass, “Importance sampling: a re-
[57] F.-M. Luo, T. Xu, H. Lai, X.-H. Chen, W. Zhang, and Y. Yu, “A view,” Wiley Interdisciplinary Reviews: Computational Statis-
survey on model-based reinforcement learning,” Science China tics, vol. 2, no. 1, pp. 54–60, 2010.
Information Sciences, vol. 67, no. 2, p. 121101, 2024. [80] J. Subramanian and A. Mahajan, “Renewal monte carlo: Re-
[58] Y. Sato, “Model-free reinforcement learning for financial port- newal theory-based reinforcement learning,” IEEE Transactions
folios: a brief survey,” arXiv preprint arXiv:1904.04973, 2019. on Automatic Control, vol. 65, no. 8, pp. 3663–3670, 2019.
[59] J. Ramı́rez, W. Yu, and A. Perrusquı́a, “Model-free reinforce- [81] J. F. Peters, D. Lockery, and S. Ramanna, “Monte carlo
ment learning from expert demonstrations: a survey,” Artificial off-policy reinforcement learning: A rough set approach,” in
Intelligence Review, vol. 55, no. 4, pp. 3213–3241, 2022. Fifth International Conference on Hybrid Intelligent Systems
[60] J. Eschmann, “Reward function design in reinforcement learn- (HIS’05). IEEE, 2005, pp. 6–pp.
ing,” Reinforcement Learning Algorithms: Analysis and Appli- [82] Y. Wang, K. S. Won, D. Hsu, and W. S. Lee, “Monte
cations, pp. 25–33, 2021. carlo bayesian reinforcement learning,” arXiv preprint
[61] R. S. Sutton, A. G. Barto et al., “Reinforcement learning,” arXiv:1206.6449, 2012.
Journal of Cognitive Neuroscience, vol. 11, no. 1, pp. 126–134, [83] B. Wu and Y. Feng, “Monte-carlo bayesian reinforcement
1999. learning using a compact factored representation,” in 2017 4th
[62] M. L. Puterman, “Chapter 8 markov decision International Conference on Information Science and Control
processes,” in Stochastic Models, ser. Handbooks Engineering (ICISCE). IEEE, 2017, pp. 466–469.
in Operations Research and Management Science. [84] O. Baykal and F. N. Alpaslan, “Reinforcement learning in card
Elsevier, 1990, vol. 2, pp. 331–434. [Online]. Available: game environments using monte carlo methods and artificial
https://www.sciencedirect.com/science/article/pii/S0927050705801720 neural networks,” in 2019 4th International Conference on
[63] E. Zanini, “Markov decision processes,” 2014. Computer Science and Engineering (UBMK). IEEE, 2019,
[64] Z. Wei, J. Xu, Y. Lan, J. Guo, and X. Cheng, “Reinforcement pp. 1–6.
learning to rank with markov decision process,” in Proceedings [85] D. Siegmund, “Importance sampling in the monte carlo study of
of the 40th international ACM SIGIR conference on research sequential tests,” The Annals of Statistics, pp. 673–684, 1976.
and development in information retrieval, 2017, pp. 945–948. [86] S. Bulteau and M. El Khadiri, “A new importance sampling
[65] D. J. Foster and A. Rakhlin, “Foundations of reinforce- monte carlo method for a flow network reliability problem,”
ment learning and interactive decision making,” arXiv preprint Naval Research Logistics (NRL), vol. 49, no. 2, pp. 204–228,
arXiv:2312.16730, 2023. 2002.
[66] G. A. Rummery and M. Niranjan, “On-line q-learning using [87] C. Wang, S. Yuan, K. Shao, and K. Ross, “On the convergence
connectionist systems,” University of Cambridge, Department of the monte carlo exploring starts algorithm for reinforcement
of Engineering Cambridge, Tech. Rep., 1994. learning,” arXiv preprint arXiv:2002.03585, 2020.
[67] M. A. Wiering and M. Van Otterlo, “Reinforcement learning,” [88] J. F. Peters and C. Henry, “Approximation spaces in off-policy
Adaptation, learning, and optimization, vol. 12, no. 3, p. 729, monte carlo learning,” Engineering applications of artificial
2012. intelligence, vol. 20, no. 5, pp. 667–675, 2007.
[68] D. Ernst and A. Louette, “Introduction to reinforcement learn- [89] A. Altahhan, “Td (0)-replay: An efficient model-free planning
ing,” 2024. with full replay,” in 2018 International Joint Conference on
[69] J. X. Wang, Z. Kurth-Nelson, D. Tirumala, H. Soyer, J. Z. Leibo, Neural Networks (IJCNN). IEEE, 2018, pp. 1–7.
R. Munos, C. Blundell, D. Kumaran, and M. Botvinick, “Learn- [90] J. Baxter, A. Tridgell, and L. Weaver, “Knightcap: a chess
ing to reinforcement learn,” arXiv preprint arXiv:1611.05763, program that learns by combining td (lambda) with game-tree
2016. search,” arXiv preprint cs/9901002, 1999.
[70] Z. Ding, Y. Huang, H. Yuan, and H. Dong, “Introduction to [91] P. Dayan, “The convergence of td (λ) for general λ,” Machine
reinforcement learning,” Deep reinforcement learning: funda- learning, vol. 8, pp. 341–362, 1992.
mentals, research and applications, pp. 47–123, 2020. [92] P. Dayan and T. J. Sejnowski, “Td (λ) converges with proba-
[71] D. A. White and D. A. Sofge, “The role of exploration in bility 1,” Machine Learning, vol. 14, pp. 295–301, 1994.
learning control,” Handbook of Intelligent Control: Neural, [93] M. A. Wiering and H. Van Hasselt, “Two novel on-policy
Fuzzy and Adaptive Approaches, pp. 1–27, 1992. reinforcement learning algorithms based on td (λ)-methods,” in
[72] M. Kearns and S. Singh, “Near-optimal performance for re- 2007 IEEE International Symposium on Approximate Dynamic
inforcement learning in polynomial time,” URL: http://www. Programming and Reinforcement Learning. IEEE, 2007, pp.
research. att. com/˜ mkearns, 1998. 280–287.
75

[94] K. De Asis, J. Hernandez-Garcia, G. Holland, and R. Sutton, [114] T. Paterova, M. Prauzek, and J. Konecny, “Robustness analysis
“Multi-step reinforcement learning: A unifying algorithm,” in of data-driven self-learning controllers for iot environmental
Proceedings of the AAAI conference on artificial intelligence, monitoring nodes based on q-learning approaches,” in 2022
vol. 32, no. 1, 2018. IEEE Symposium Series on Computational Intelligence (SSCI).
[95] Z. Chen, S. T. Maguluri, S. Shakkottai, and K. Shanmugam, IEEE, 2022, pp. 721–727.
“A lyapunov theory for finite-sample guarantees of asyn- [115] A. Nassar and Y. Yilmaz, “Reinforcement learning for adaptive
chronous q-learning and td-learning variants,” arXiv preprint resource allocation in fog ran for iot with heterogeneous latency
arXiv:2102.01567, 2021. requirements,” IEEE Access, vol. 7, pp. 128 014–128 025, 2019.
[96] D. Lee, “Analysis of off-policy multi-step td-learning with lin- [116] S. Wender and I. Watson, “Applying reinforcement learning to
ear function approximation,” arXiv preprint arXiv:2402.15781, small scale combat in the real-time strategy game starcraft:
2024. Broodwar,” in 2012 ieee conference on computational intelli-
[97] K. De Asis, “A unified view of multi-step temporal difference gence and games (cig). IEEE, 2012, pp. 402–408.
learning,” 2018. [117] H. Iima and Y. Kuroe, “Swarm reinforcement learning algo-
[98] C. Szepesvári, Algorithms for reinforcement learning. Springer rithms based on sarsa method,” in 2008 SICE Annual Confer-
nature, 2022. ence. IEEE, 2008, pp. 2045–2049.
[99] Y. Wang and X. Tan, “Greedy multi-step off-policy reinforce- [118] H. Hasselt, “Double q-learning,” Advances in neural informa-
ment learning,” 2020. tion processing systems, vol. 23, 2010.
[100] A. R. Mahmood, H. Yu, and R. S. Sutton, “Multi-step off-policy [119] M. Ben-Akka, C. Tanougast, C. Diou, and A. Chaddad, “An
learning without importance sampling ratios,” arXiv preprint efficient hardware implementation of the double q-learning
arXiv:1702.03006, 2017. algorithm,” in 2023 3rd International Conference on Electri-
[101] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning, cal, Computer, Communications and Mechatronics Engineering
vol. 8, pp. 279–292, 1992. (ICECCME). IEEE, 2023, pp. 1–6.
[102] Y. Bi, A. Thomas-Mitchell, W. Zhai, and N. Khan, “A com- [120] F. Jamshidi, L. Zhang, and F. Nezhadalinaei, “Autonomous
parative study of deterministic and stochastic policies for q- driving systems: Developing an approach based on a* and
learning,” in 2023 4th International Conference on Artificial double q-learning,” in 2021 7th International Conference on
Intelligence, Robotics and Control (AIRC). IEEE, 2023, pp. Web Research (ICWR). IEEE, 2021, pp. 82–85.
1–5.
[121] T. Paterova, M. Prauzek, and J. Konecny, “Data-driven self-
[103] S. Spano, G. C. Cardarilli, L. Di Nunzio, R. Fazzolari, D. Gi- learning controller design approach for power-aware iot devices
ardino, M. Matta, A. Nannarelli, and M. Re, “An efficient based on double q-learning strategy,” in 2021 IEEE Symposium
hardware implementation of reinforcement learning: The q- Series on Computational Intelligence (SSCI). IEEE, 2021, pp.
learning algorithm,” Ieee Access, vol. 7, pp. 186 340–186 351, 01–07.
2019.
[122] J. Fan, T. Yang, J. Zhao, Z. Cui, J. Ning, and P. Wang, “Double
[104] D. Huang, H. Zhu, X. Lin, and L. Wang, “Application of mas-
q learning multi-agent routing method for maritime search
sive parallel computation based q-learning in system control,”
and rescue,” in 2023 International Conference on Ubiquitous
in 2022 5th International Conference on Pattern Recognition
Communication (Ucom). IEEE, 2023, pp. 367–372.
and Artificial Intelligence (PRAI). IEEE, 2022, pp. 1–5.
[123] J. Konecny, M. Prauzek, and T. Paterova, “Double q-learning
[105] M. Daswani, P. Sunehag, and M. Hutter, “Q-learning for history-
adaptive wavelet compression method for data transmission at
based reinforcement learning,” in Asian Conference on Machine
Learning. PMLR, 2013, pp. 213–228. environmental monitoring stations,” in 2022 IEEE Symposium
Series on Computational Intelligence (SSCI). IEEE, 2022, pp.
[106] B. Shou, H. Zhang, Z. Long, Y. Xie, K. Zhang, and Q. Gu,
567–572.
“Design and applications of q-learning adaptive pid algorithm
for maglev train levitation control system,” in 2023 35th Chinese [124] H. Huang, M. Lin, and Q. Zhang, “Double-q learning-based dvfs
Control and Decision Conference (CCDC). IEEE, 2023, pp. for multi-core real-time systems,” in 2017 IEEE International
1947–1953. Conference on Internet of Things (iThings) and IEEE Green
[107] G. Akshay, N. S. Naik, and J. Vardhan, “Enhancing hadoop Computing and Communications (GreenCom) and IEEE Cyber,
performance with q-learning for optimal parameter tuning,” in Physical and Social Computing (CPSCom) and IEEE Smart
TENCON 2023-2023 IEEE Region 10 Conference (TENCON). Data (SmartData). IEEE, 2017, pp. 522–529.
IEEE, 2023, pp. 617–622. [125] D. Wang, B. Liu, H. Jia, Z. Zhang, J. Chen, and D. Huang,
[108] C. Jin, Z. Allen-Zhu, S. Bubeck, and M. I. Jordan, “Is q-learning “Peer-to-peer electricity transaction decisions of the user-side
provably efficient?” Advances in neural information processing smart energy system based on the sarsa reinforcement learning,”
systems, vol. 31, 2018. CSEE Journal of Power and Energy Systems, vol. 8, no. 3, pp.
[109] L. M. Da Silva, M. F. Torquato, and M. A. Fernandes, “Parallel 826–837, 2020.
implementation of reinforcement learning q-learning technique [126] T. M. Aljohani and O. Mohammed, “A real-time energy con-
for fpga,” IEEE Access, vol. 7, pp. 2782–2798, 2018. sumption minimization framework for electric vehicles routing
[110] S. Wang and L. Zhang, “Q-learning based handover algorithm optimization based on sarsa reinforcement learning,” Vehicles,
for high-speed rail wireless communications,” in 2023 IEEE vol. 4, no. 4, pp. 1176–1194, 2022.
Wireless Communications and Networking Conference (WCNC). [127] H. Van Seijen, H. Van Hasselt, S. Whiteson, and M. Wiering,
IEEE, 2023, pp. 1–6. “A theoretical and empirical analysis of expected sarsa,” in
[111] S. Wang, “Taxi scheduling research based on q-learning,” in 2009 ieee symposium on adaptive dynamic programming and
2021 3rd International Conference on Machine Learning, Big reinforcement learning. IEEE, 2009, pp. 177–184.
Data and Business Intelligence (MLBDBI). IEEE, 2021, pp. [128] H. Moradimaryamnegari, M. Frego, and A. Peer, “Model predic-
700–703. tive control-based reinforcement learning using expected sarsa,”
[112] W. Xiao, J. Chen, X. Li, M. Wang, D. Huang, and D. Zhang, IEEE Access, vol. 10, pp. 81 177–81 191, 2022.
“Random walk routing algorithm based on q-learning in optical [129] I. A. M. Gonzalez and V. Turau, “Comparison of wifi in-
transport network,” in 2022 18th International Conference on terference mitigation strategies in dsme networks: Leveraging
Computational Intelligence and Security (CIS). IEEE, 2022, reinforcement learning with expected sarsa,” in 2023 IEEE
pp. 88–92. International Mediterranean Conference on Communications
[113] X. Qu and M. Yao, “Visual novelty based internally motivated and Networking (MeditCom). IEEE, 2023, pp. 270–275.
q-learning for mobile robot scene learning and recognition,” in [130] R. Muduli, D. Jena, and T. Moger, “Application of expected
2011 4th International Congress on Image and Signal Process- sarsa-learning for load frequency control of multi-area power
ing, vol. 3. IEEE, 2011, pp. 1461–1466. system,” in 2023 5th International Conference on Energy, Power
76

and Environment: Towards Flexible Green Energy Technologies [152] G. Zuo, T. Du, and J. Lu, “Double dqn method for object de-
(ICEPE). IEEE, 2023, pp. 1–6. tection,” in 2017 Chinese Automation Congress (CAC). IEEE,
[131] A. Kekuda, R. Anirudh, and M. Krishnan, “Reinforcement 2017, pp. 6727–6732.
learning based intelligent traffic signal control using n-step [153] W. Zhang, J. Gai, Z. Zhang, L. Tang, Q. Liao, and Y. Ding,
sarsa,” in 2021 International Conference on Artificial Intelli- “Double-dqn based path smoothing and tracking control method
gence and Smart Systems (ICAIS). IEEE, 2021, pp. 379–384. for robotic vehicle navigation,” Computers and Electronics in
[132] V. Kuchibhotla, P. Harshitha, and S. Goyal, “An n-step look Agriculture, vol. 166, p. 104985, 2019.
ahead algorithm using mixed (on and off) policy reinforcement [154] S. Zhang, Y. Wu, H. Ogai, H. Inujima, and S. Tateno, “Tactical
learning,” in 2020 3rd International Conference on Intelligent decision-making for autonomous driving using dueling double
Sustainable Systems (ICISS). IEEE, 2020, pp. 677–681. deep q network with double attention,” IEEE Access, vol. 9, pp.
[133] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, 151 983–151 992, 2021.
vol. 521, no. 7553, pp. 436–444, 2015. [155] X. Xue, Z. Li, D. Zhang, and Y. Yan, “A deep reinforcement
[134] I. Goodfellow, “Deep learning,” 2016. learning method for mobile robot collision avoidance based on
[135] J. D. Kelleher, Deep learning. MIT press, 2019. double dqn,” in 2019 IEEE 28th International Symposium on
[136] N. Rusk, “Deep learning,” Nature Methods, vol. 13, no. 1, pp. Industrial Electronics (ISIE). IEEE, 2019, pp. 2131–2136.
35–35, 2016. [156] S. Mo, X. Pei, and Z. Chen, “Decision-making for oncoming
[137] T. Degris, P. M. Pilarski, and R. S. Sutton, “Model-free rein- traffic overtaking scenario using double dqn,” in 2019 3rd
forcement learning with continuous action in practice,” in 2012 Conference on Vehicle Control and Intelligence (CVCI). IEEE,
2019, pp. 1–4.
American control conference (ACC). IEEE, 2012, pp. 2177–
2182. [157] Y. Xiaofei, S. Yilun, L. Wei, Y. Hui, Z. Weibo, and X. Zhen-
grong, “Global path planning algorithm based on double dqn
[138] Y. Liu, A. Halev, and X. Liu, “Policy learning with constraints
for multi-tasks amphibious unmanned surface vehicle,” Ocean
in model-free reinforcement learning: A survey,” in The 30th
Engineering, vol. 266, p. 112809, 2022.
international joint conference on artificial intelligence (ijcai),
2021. [158] C. Lee, J. Jung, and J.-M. Chung, “Intelligent dual active
protocol stack handover based on double dqn deep reinforce-
[139] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou,
ment learning for 5g mmwave networks,” IEEE Transactions
D. Wierstra, and M. Riedmiller, “Playing atari with deep rein-
on Vehicular Technology, vol. 71, no. 7, pp. 7572–7584, 2022.
forcement learning,” arXiv preprint arXiv:1312.5602, 2013.
[159] Y. Zhang, P. Sun, Y. Yin, L. Lin, and X. Wang, “Human-
[140] K. U. Ahn and C. S. Park, “Application of deep q-networks like autonomous vehicle speed control by deep reinforcement
for model-free optimal control balancing between different hvac learning with double q-learning,” in 2018 IEEE intelligent
systems,” Science and Technology for the Built Environment, vehicles symposium (IV). IEEE, 2018, pp. 1251–1256.
vol. 26, no. 1, pp. 61–74, 2020.
[160] D. Li, S. Xu, and J. Zhao, “Partially observable double dqn
[141] J. N. Stember and H. Shalu, “Reinforcement learning using deep based iot scheduling for energy harvesting,” in 2019 IEEE
q networks and q learning accurately localizes brain tumors international conference on communications workshops (ICC
on mri with very small training sets,” BMC Medical Imaging, Workshops). IEEE, 2019, pp. 1–6.
vol. 22, no. 1, p. 224, 2022. [161] Z. Wang, T. Schaul, M. Hessel, H. Hasselt, M. Lanctot, and
[142] H. Guo, R. Wu, B. Qi, and C. Xu, “Deep-q-networks-based N. Freitas, “Dueling network architectures for deep reinforce-
adaptive dual-mode energy-efficient routing in rechargeable ment learning,” in International conference on machine learn-
wireless sensor networks,” IEEE Sensors Journal, vol. 22, ing. PMLR, 2016, pp. 1995–2003.
no. 10, pp. 9956–9966, 2022. [162] B.-A. Han and J.-J. Yang, “Research on adaptive job shop
[143] F. M. Talaat, “Effective deep q-networks (edqn) strategy for scheduling problems based on dueling double dqn,” Ieee Access,
resource allocation based on optimized reinforcement learning vol. 8, pp. 186 474–186 495, 2020.
algorithm,” Multimedia Tools and Applications, vol. 81, no. 28, [163] T.-W. Ban, “An autonomous transmission scheme using dueling
pp. 39 945–39 961, 2022. dqn for d2d communication networks,” IEEE transactions on
[144] C. Song, H. Lee, K. Kim, and S. W. Cha, “A power management vehicular technology, vol. 69, no. 12, pp. 16 348–16 352, 2020.
strategy for parallel phev using deep q-networks,” in 2018 IEEE [164] Y. Liu and C. Zhang, “Application of dueling dqn and decga
Vehicle Power and Propulsion Conference (VPPC). IEEE, for parameter estimation in variogram models,” IEEE Access,
2018, pp. 1–5. vol. 8, pp. 38 112–38 122, 2020.
[145] S. Yoon and K.-J. Kim, “Deep q networks for visual fighting [165] W. Liu, P. Si, E. Sun, M. Li, C. Fang, and Y. Zhang, “Green
game ai,” in 2017 IEEE conference on computational intelli- mobility management in uav-assisted iot based on dueling
gence and games (CIG). IEEE, 2017, pp. 306–308. dqn,” in ICC 2019-2019 IEEE International Conference on
[146] L. Lv, S. Zhang, D. Ding, and Y. Wang, “Path planning via an Communications (ICC). IEEE, 2019, pp. 1–6.
improved dqn-based learning policy,” IEEE Access, vol. 7, pp. [166] S. B. Tadele, B. Kar, F. G. Wakgra, and A. U. Khan, “Optimiza-
67 319–67 330, 2019. tion of end-to-end aoi in edge-enabled vehicular fog systems: A
[147] A. S. Zamzam, B. Yang, and N. D. Sidiropoulos, “Energy dueling-dqn approach,” arXiv preprint arXiv:2407.02815, 2024.
storage management via deep q-networks,” in 2019 IEEE Power [167] W. Jiang, C. Bao, G. Xu, and Y. Wang, “Research on au-
& Energy Society General Meeting (PESGM). IEEE, 2019, pp. tonomous obstacle avoidance and target tracking of uav based
1–5. on improved dueling dqn algorithm,” in 2021 China Automation
[148] N. Gao, Z. Qin, X. Jing, Q. Ni, and S. Jin, “Anti-intelligent uav Congress (CAC). IEEE, 2021, pp. 5110–5115.
jamming strategy via deep q-networks,” IEEE Transactions on [168] Z. Huang, S. Liu, and G. Zhang, “The usv path planning of
Communications, vol. 68, no. 1, pp. 569–581, 2019. dueling dqn algorithm based on tree sampling mechanism,”
[149] H. Van Hasselt, A. Guez, and D. Silver, “Deep reinforcement in 2022 IEEE Asia-Pacific Conference on Image Processing,
learning with double q-learning,” in Proceedings of the AAAI Electronics and Computers (IPEC). IEEE, 2022, pp. 971–976.
conference on artificial intelligence, vol. 30, no. 1, 2016. [169] D. Silver and J. Veness, “Monte-carlo planning in large
[150] B. Peng, Q. Sun, S. E. Li, D. Kum, Y. Yin, J. Wei, and T. Gu, pomdps,” Advances in neural information processing systems,
“End-to-end autonomous driving through dueling double deep vol. 23, 2010.
q-network,” Automotive Innovation, vol. 4, pp. 328–337, 2021. [170] R. Coulom, “Efficient selectivity and backup operators in
[151] A. Iqbal, M.-L. Tham, and Y. C. Chang, “Double deep q- monte-carlo tree search,” in International conference on com-
network-based energy-efficient resource allocation in cloud ra- puters and games. Springer, 2006, pp. 72–83.
dio access network,” IEEE Access, vol. 9, pp. 20 440–20 449, [171] L. Rossi, M. H. Winands, and C. Butenweg, “Monte carlo
2021. tree search as an intelligent search tool in structural design
77

problems,” Engineering with Computers, vol. 38, no. 4, pp. [192] S. Del Giorno, F. D’Antoni, V. Piemonte, and M. Merone, “A
3219–3236, 2022. new glycemic closed-loop control based on dyna-q for type-1-
[172] M. C. Fu, “A tutorial introduction to monte carlo tree search,” diabetes,” Biomedical Signal Processing and Control, vol. 81,
in 2020 Winter Simulation Conference (WSC). IEEE, 2020, p. 104492, 2023.
pp. 1178–1193. [193] T. Faycal and C. Zito, “Dyna-t: Dyna-q and upper confidence
[173] W. Wang and M. Sebag, “Multi-objective monte-carlo tree bounds applied to trees,” arXiv preprint arXiv:2201.04502,
search,” in Asian conference on machine learning. PMLR, 2022.
2012, pp. 507–522. [194] Z. Xu, F. Zhu, Y. Fu, Q. Liu, and S. You, “A dyna-q based
[174] X. Su, T. Huang, Y. Li, S. You, F. Wang, C. Qian, C. Zhang, multi-path load-balancing routing algorithm in wireless sensor
and C. Xu, “Prioritized architecture sampling with monto-carlo networks,” in 2015 IEEE Trustcom/BigDataSE/ISPA, vol. 3.
tree search,” in Proceedings of the IEEE/CVF Conference on IEEE, 2015, pp. 1–4.
Computer Vision and Pattern Recognition, 2021, pp. 10 968– [195] M. Pei, H. An, B. Liu, and C. Wang, “An improved dyna-q
10 977. algorithm for mobile robot path planning in unknown dynamic
[175] D. Perez, S. Mostaghim, S. Samothrakis, and S. M. Lucas, environment,” IEEE Transactions on Systems, Man, and Cyber-
“Multiobjective monte carlo tree search for real-time games,” netics: Systems, vol. 52, no. 7, pp. 4415–4425, 2021.
IEEE Transactions on Computational Intelligence and AI in [196] F. Wang, J. Gao, M. Li, and L. Zhao, “Autonomous pev
Games, vol. 7, no. 4, pp. 347–360, 2014. charging scheduling using dyna-q reinforcement learning,” IEEE
[176] H. Baier and M. H. Winands, “Monte-carlo tree search and Transactions on Vehicular Technology, vol. 69, no. 11, pp.
minimax hybrids,” in 2013 IEEE Conference on Computational 12 609–12 620, 2020.
Inteligence in Games (CIG). IEEE, 2013, pp. 1–8. [197] Y. Liu, S. Yan, Y. Zhao, C. Song, and F. Li, “Improved dyna-
[177] M. De Waard, D. M. Roijers, and S. C. Bakkes, “Monte q: a reinforcement learning method focused via heuristic graph
carlo tree search with options for general video game playing,” for agv path planning in dynamic environments,” Drones, vol. 6,
in 2016 IEEE Conference on Computational Intelligence and no. 11, p. 365, 2022.
Games (CIG). IEEE, 2016, pp. 1–8. [198] G. Zhang, Y. Li, Y. Niu, and Q. Zhou, “Anti-jamming path
[178] M. Lanctot, M. H. Winands, T. Pepels, and N. R. Sturtevant, selection method in a wireless communication network based
“Monte carlo tree search with heuristic evaluations using im- on dyna-q,” Electronics, vol. 11, no. 15, p. 2397, 2022.
plicit minimax backups,” in 2014 IEEE Conference on Compu- [199] K.-S. Hwang, W.-C. Jiang, and Y.-J. Chen, “Model learning
tational Intelligence and Games. IEEE, 2014, pp. 1–8. and knowledge sharing for a multiagent system with dyna-q
[179] M. H. Winands, Y. Bjornsson, and J.-T. Saito, “Monte carlo tree learning,” IEEE transactions on cybernetics, vol. 45, no. 5, pp.
search in lines of action,” IEEE Transactions on Computational 978–990, 2014.
Intelligence and AI in Games, vol. 2, no. 4, pp. 239–250, 2010. [200] J. Huang, Q. Tan, J. Ma, and L. Han, “Path planning method
[180] A. Santos, P. A. Santos, and F. S. Melo, “Monte carlo tree using dyna-q algorithm under complex urban environment,” in
search experiments in hearthstone,” in 2017 IEEE conference 2022 China Automation Congress (CAC). IEEE, 2022, pp.
on computational intelligence and games (CIG). IEEE, 2017, 6776–6781.
pp. 272–279. [201] E. Vitolo, A. San Miguel, J. Civera, and C. Mahulea, “Perfor-
[181] P. I. Cowling, E. J. Powley, and D. Whitehouse, “Information set mance evaluation of the dyna-q algorithm for robot navigation,”
monte carlo tree search,” IEEE Transactions on Computational in 2018 IEEE 14th International Conference on Automation
Intelligence and AI in Games, vol. 4, no. 2, pp. 120–143, 2012. Science and Engineering (CASE). IEEE, 2018, pp. 322–327.
[182] P. Ciancarini and G. P. Favini, “Monte carlo tree search in [202] R. J. Williams, “Simple statistical gradient-following algorithms
kriegspiel,” Artificial Intelligence, vol. 174, no. 11, pp. 670– for connectionist reinforcement learning,” Machine learning,
684, 2010. vol. 8, pp. 229–256, 1992.
[183] A. W. Moore and C. G. Atkeson, “Prioritized sweeping: Re- [203] J. Zhang, J. Kim, B. O’Donoghue, and S. Boyd, “Sample
inforcement learning with less data and less time,” Machine efficient reinforcement learning with reinforce,” in Proceedings
learning, vol. 13, pp. 103–130, 1993. of the AAAI conference on artificial intelligence, vol. 35, no. 12,
[184] J. Peng and R. J. Williams, “Efficient learning and planning 2021, pp. 10 887–10 895.
within the dyna framework,” Adaptive behavior, vol. 1, no. 4, [204] L. Guo, Z. Li, R. Outbib, and F. Gao, “Function approximation
pp. 437–454, 1993. reinforcement learning of energy management with the fuzzy
[185] R. Li, Q. Wang, C. Dong et al., “Morphing strategy design reinforce for fuel cell hybrid electric vehicles,” Energy and AI,
for uav based on prioritized sweeping reinforcement learning,” vol. 13, p. 100246, 2023.
in IECON 2020 The 46th Annual Conference of the IEEE [205] J. C. Lauffenburger, E. Yom-Tov, P. A. Keller, M. E. McDonnell,
Industrial Electronics Society. IEEE, 2020, pp. 2786–2791. L. G. Bessette, C. P. Fontanet, E. S. Sears, E. Kim, K. Hanken,
[186] R. Zajdel, “Epoch-incremental dyna-learning and prioritized J. J. Buckley et al., “Reinforcement learning to improve non-
sweeping algorithms,” Neurocomputing, vol. 319, pp. 13–20, adherence for diabetes treatments by optimising response and
2018. customising engagement (reinforce): study protocol of a prag-
[187] H. Van Seijen and R. Sutton, “Planning by prioritized sweeping matic randomised trial,” BMJ open, vol. 11, no. 12, p. e052091,
with small backups,” in International Conference on Machine 2021.
Learning. PMLR, 2013, pp. 361–369. [206] Y. Tao and W. L. Tan, “A reinforcement learning approach to wi-
[188] E. Bargiacchi, T. Verstraeten, D. M. Roijers, and A. Nowé, fi rate adaptation using the reinforce algorithm,” in 2024 IEEE
“Model-based multi-agent reinforcement learning with cooper- Wireless Communications and Networking Conference (WCNC).
ative prioritized sweeping,” arXiv preprint arXiv:2001.07527, IEEE, 2024, pp. 1–6.
2020. [207] K. Weerakoon, S. Chakraborty, N. Karapetyan, A. J.
[189] R. Dearden, “Structured prioritized sweeping,” in ICML. Cite- Sathyamoorthy, A. S. Bedi, and D. Manocha, “Htron: Efficient
seer, 2001, pp. 82–89. outdoor navigation with sparse rewards via heavy tailed adaptive
[190] M. Santos, V. López, G. Botella et al., “Dyna-h: A heuris- reinforce algorithm,” arXiv preprint arXiv:2207.03694, 2022.
tic planning reinforcement learning algorithm applied to role- [208] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz,
playing game strategy decision systems,” Knowledge-Based “Trust region policy optimization,” in International conference
Systems, vol. 32, pp. 28–36, 2012. on machine learning. PMLR, 2015, pp. 1889–1897.
[191] Y. Chai and X.-J. Zeng, “A multi-objective dyna-q based routing [209] D. Bertsekas, Dynamic programming and optimal control: Vol-
in wireless mesh network,” Applied Soft Computing, vol. 108, ume I. Athena scientific, 2012, vol. 4.
p. 107486, 2021. [210] J. Peters and S. Schaal, “Reinforcement learning of motor skills
78

with policy gradients,” Neural networks, vol. 21, no. 4, pp. 682– deep reinforcement learning,” in 2020 IEEE Intelligent Vehicles
697, 2008. Symposium (IV). IEEE, 2020, pp. 1746–1752.
[211] I. Szita and A. Lörincz, “Learning tetris using the noisy cross- [229] I. Grondman, L. Busoniu, G. A. Lopes, and R. Babuska,
entropy method,” Neural computation, vol. 18, no. 12, pp. 2936– “A survey of actor-critic reinforcement learning: Standard and
2941, 2006. natural policy gradients,” IEEE Transactions on Systems, Man,
[212] B. Liu, Q. Cai, Z. Yang, and Z. Wang, “Neural trust re- and Cybernetics, part C (applications and reviews), vol. 42,
gion/proximal policy optimization attains globally optimal no. 6, pp. 1291–1307, 2012.
policy,” Advances in neural information processing systems, [230] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap,
vol. 32, 2019. T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous
[213] Q. Yuan and N. Xiao, “A monotonic policy optimization al- methods for deep reinforcement learning,” in International
gorithm for high-dimensional continuous control problem in conference on machine learning. PMLR, 2016, pp. 1928–1937.
3d mujoco,” Multimedia Tools and Applications, vol. 78, pp. [231] V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” Advances
28 665–28 680, 2019. in neural information processing systems, vol. 12, 1999.
[214] S. Roostaie and M. M. Ebadzadeh, “Entrpo: Trust region [232] F. Jiang, S. Han, and C. Sun, “Asynchronous advantage actor-
policy optimization method with entropy regularization,” arXiv critic algorithm based cooperative caching strategy for fog radio
preprint arXiv:2110.13373, 2021. access networks,” in 2023 IEEE Wireless Communications and
[215] J. G. Kuba, R. Chen, M. Wen, Y. Wen, F. Sun, J. Wang, and Networking Conference (WCNC). IEEE, 2023, pp. 1–6.
Y. Yang, “Trust region policy optimisation in multi-agent rein- [233] J. Du, W. Cheng, G. Lu, H. Cao, X. Chu, Z. Zhang, and J. Wang,
forcement learning,” arXiv preprint arXiv:2109.11251, 2021. “Resource pricing and allocation in mec enabled blockchain
[216] H. Liu, Y. Wu, and F. Sun, “Extreme trust region policy systems: An a3c deep reinforcement learning approach,” IEEE
optimization for active object recognition,” IEEE transactions Transactions on Network Science and Engineering, vol. 9, no. 1,
on neural networks and learning systems, vol. 29, no. 6, pp. pp. 33–44, 2021.
2253–2258, 2018. [234] M. Joypriyanka and R. Surendran, “Chess game to improve
[217] B. Mondal, A. Banerjee, and S. Gupta, “Xss filter detection the mental ability of alzheimer’s patients using a3c,” in 2023
using trust region policy optimization,” in 2023 1st International Fifth International Conference on Electrical, Computer and
Conference on Advanced Innovations in Smart Cities (ICAISC). Communication Technologies (ICECCT). IEEE, 2023, pp. 1–6.
IEEE, 2023, pp. 1–4. [235] T. Tiong, I. Saad, K. T. K. Teo, and H. B. Lago, “Autonomous
valet parking with asynchronous advantage actor-critic proximal
[218] J. Erens, “Universal robot policy: Using a surrogate model in
policy optimization,” in 2022 IEEE 12th Annual Computing and
combination with trpo,” Master’s thesis, University of Twente,
Communication Workshop and Conference (CCWC). IEEE,
2024.
2022, pp. 0334–0340.
[219] K. Thattai, J. Ravishankar, and C. Li, “Consumer-centric
[236] Z. Shi, L. Li, Y. Xu, X. Li, W. Chen, and Z. Han, “Content
home energy management system using trust region policy
caching policy for 5g network based on asynchronous advantage
optimization-based multi-agent deep reinforcement learning,” in
actor-critic method,” in 2019 IEEE Global Communications
2023 IEEE Belgrade PowerTech. IEEE, 2023, pp. 1–6.
Conference (GLOBECOM). IEEE, 2019, pp. 1–6.
[220] N. Peng, Y. Lin, Y. Zhang, and J. Li, “Aoi-aware joint spectrum [237] J. Yang, J. Lu, X. Zhou, S. Li, C. Xiong, and J. Hu, “Ha-a2c:
and power allocation for internet of vehicles: A trust region Hard attention and advantage actor-critic for addressing latency
policy optimization-based approach,” IEEE Internet of Things optimization in edge computing,” IEEE Transactions on Green
Journal, vol. 9, no. 20, pp. 19 916–19 927, 2022. Communications and Networking, 2024.
[221] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, [238] P. Choppara and S. Mangalampalli, “Reliability and trust aware
“Proximal policy optimization algorithms,” arXiv preprint task scheduler for cloud-fog computing using advantage actor
arXiv:1707.06347, 2017. critic (a2c) algorithm,” IEEE Access, 2024.
[222] H. Wei, X. Liu, L. Mashayekhy, and K. Decker, “Mixed- [239] Y. Dantas, P. E. Iturria-Rivera, H. Zhou, M. Bavand, M. El-
autonomy traffic control with proximal policy optimization,” in sayed, R. Gaigalas, and M. Erol-Kantarci, “Beam selection
2019 IEEE Vehicular Networking Conference (VNC). IEEE, for energy-efficient mmwave network using advantage actor
2019, pp. 1–8. critic learning,” in ICC 2023-IEEE International Conference on
[223] B. Zhang, X. Lu, R. Diao, H. Li, T. Lan, D. Shi, and Z. Wang, Communications. IEEE, 2023, pp. 5285–5290.
“Real-time autonomous line flow control using proximal policy [240] M. Vishal, Y. Satija, and B. S. Babu, “Trading agent for
optimization,” in 2020 IEEE Power & Energy Society General the indian stock market scenario using actor-critic based rein-
Meeting (PESGM). IEEE, 2020, pp. 1–5. forcement learning,” in 2021 IEEE international conference on
[224] C.-S. Ying, A. H. Chow, Y.-H. Wang, and K.-S. Chin, “Adaptive computation system and information technology for sustainable
metro service schedule and train composition with a proximal solutions (CSITSS). IEEE, 2021, pp. 1–5.
policy optimization approach based on deep reinforcement [241] Y. Sun and X. Zhang, “A2c learning for tasks segmentation
learning,” IEEE Transactions on Intelligent Transportation Sys- with cooperative computing in edge computing networks,” in
tems, vol. 23, no. 7, pp. 6895–6906, 2021. GLOBECOM 2022-2022 IEEE Global Communications Con-
[225] J. Jin and Y. Xu, “Optimal policy characterization enhanced ference. IEEE, 2022, pp. 2236–2241.
proximal policy optimization for multitask scheduling in cloud [242] X. Han, J. Wang, Q. Zhang, X. Qin, and M. Sun, “Multi-uav
computing,” IEEE Internet of Things Journal, vol. 9, no. 9, pp. automatic dynamic obstacle avoidance with experience-shared
6418–6433, 2021. a2c,” in 2019 International Conference on Wireless and Mobile
[226] Y. Guan, Y. Ren, S. E. Li, Q. Sun, L. Luo, and K. Li, Computing, Networking and Communications (WiMob). IEEE,
“Centralized cooperation for connected and automated vehicles 2019, pp. 330–335.
at intersections by proximal policy optimization,” IEEE Trans- [243] K. Shao, D. Zhao, Y. Zhu, and Q. Zhang, “Visual navigation
actions on Vehicular Technology, vol. 69, no. 11, pp. 12 597– with actor-critic deep reinforcement learning,” in 2018 Interna-
12 608, 2020. tional Joint Conference on Neural Networks (IJCNN). IEEE,
[227] E. Bøhn, E. M. Coates, S. Moe, and T. A. Johansen, “Deep 2018, pp. 1–6.
reinforcement learning attitude control of fixed-wing uavs using [244] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, and
proximal policy optimization,” in 2019 international conference M. Riedmiller, “Deterministic policy gradient algorithms,” in
on unmanned aircraft systems (ICUAS). IEEE, 2019, pp. 523– International conference on machine learning. Pmlr, 2014,
533. pp. 387–395.
[228] F. Ye, X. Cheng, P. Wang, C.-Y. Chan, and J. Zhang, “Automated [245] T. Degris, M. White, and R. S. Sutton, “Off-policy actor-critic,”
lane change strategy using proximal policy optimization-based arXiv preprint arXiv:1205.4839, 2012.
79

[246] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, [264] H. K. Bishen, K. Shihabudheen, and P. M. Shanir, “Adaptive
Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with cruise control using twin delayed deep deterministic policy gra-
deep reinforcement learning,” arXiv preprint arXiv:1509.02971, dient,” in 2023 5th International Conference on Energy, Power
2015. and Environment: Towards Flexible Green Energy Technologies
[247] G. E. Uhlenbeck and L. S. Ornstein, “On the theory of the (ICEPE). IEEE, 2023, pp. 1–6.
brownian motion,” Physical review, vol. 36, no. 5, p. 823, 1930. [265] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup,
[248] Y. Kong, Y. Li, J. Wang, and S. Yin, “Edge computing task and D. Meger, “Deep reinforcement learning that matters,” in
unloading decision optimization algorithm based on deep rein- Proceedings of the AAAI conference on artificial intelligence,
forcement learning,” in China Conference on Wireless Sensor vol. 32, no. 1, 2018.
Networks. Springer, 2023, pp. 189–201.
[249] A. Candeli, G. De Tommasi, D. G. Lui, A. Mele, S. Santini, and
G. Tartaglione, “A deep deterministic policy gradient learning
approach to missile autopilot design,” IEEE Access, vol. 10, pp.
19 685–19 696, 2022.
[250] Z. Wei, Z. Quan, J. Wu, Y. Li, J. Pou, and H. Zhong, “Deep de-
terministic policy gradient-drl enabled multiphysics-constrained
fast charging of lithium-ion battery,” IEEE Transactions on
Industrial Electronics, vol. 69, no. 3, pp. 2588–2598, 2021.
[251] S. Wen, J. Chen, S. Wang, H. Zhang, and X. Hu, “Path
planning of humanoid arm based on deep deterministic policy
gradient,” in 2018 IEEE International Conference on Robotics
and Biomimetics (ROBIO). IEEE, 2018, pp. 1755–1760.
[252] X. Gao, L. Yan, Z. Li, G. Wang, and I.-M. Chen, “Improved
deep deterministic policy gradient for dynamic obstacle avoid-
ance of mobile robot,” IEEE Transactions on Systems, Man, and
Cybernetics: Systems, vol. 53, no. 6, pp. 3675–3682, 2023.
[253] S. Zhou, Y. Cheng, X. Lei, and H. Duan, “Deep deterministic
policy gradient with prioritized sampling for power control,”
IEEE Access, vol. 8, pp. 194 240–194 250, 2020.
[254] Y. Liang, C. Guo, Z. Ding, and H. Hua, “Agent-based modeling
in electricity market using deep deterministic policy gradient
algorithm,” IEEE transactions on power systems, vol. 35, no. 6,
pp. 4180–4192, 2020.
[255] Y.-H. Xu, C.-C. Yang, M. Hua, and W. Zhou, “Deep determin-
istic policy gradient (ddpg)-based resource allocation scheme
for noma vehicular communications,” IEEE Access, vol. 8, pp.
18 797–18 807, 2020.
[256] S. Fujimoto, H. Hoof, and D. Meger, “Addressing function
approximation error in actor-critic methods,” in International
conference on machine learning. PMLR, 2018, pp. 1587–1596.
[257] S. B. Thrun, Efficient exploration in reinforcement learning.
Carnegie Mellon University, 1992.
[258] O. Yazar, S. Coskun, L. Li, F. Zhang, and C. Huang, “Actor-
critic td3-based deep reinforcement learning for energy man-
agement strategy of hev,” in 2023 5th International Congress
on Human-Computer Interaction, Optimization and Robotic
Applications (HORA). IEEE, 2023, pp. 1–6.
[259] N. A. Mosali, S. S. Shamsudin, O. Alfandi, R. Omar, and N. Al-
Fadhali, “Twin delayed deep deterministic policy gradient-based
target tracking for unmanned aerial vehicle with achievement
rewarding and multistage training,” IEEE Access, vol. 10, pp.
23 545–23 559, 2022.
[260] X. Han, Z. Li, and Z. Xie, “Two-step random access optimiza-
tion for 5g-and-beyond leo satellite communication system: a
td3-based msga channel allocation strategy,” IEEE Communi-
cations Letters, vol. 27, no. 6, pp. 1570–1574, 2023.
[261] R. Hossain, M. Gautam, M. M. Lakouraj, H. Livani, and
M. Benidris, “Volt-var optimization in distribution networks
using twin delayed deep reinforcement learning,” in 2022 IEEE
Power & Energy Society Innovative Smart Grid Technologies
Conference (ISGT). IEEE, 2022, pp. 1–5.
[262] M. Shehab, A. Zaghloul, and A. El-Badawy, “Low-level control
of a quadrotor using twin delayed deep deterministic policy gra-
dient (td3),” in 2021 18th International Conference on Electrical
Engineering, Computing Science and Automatic Control (CCE).
IEEE, 2021, pp. 1–6.
[263] J. Yu, H. Sun, and Q. Sun, “Multi-auv charging navigation
trajectory planning based on twin delayed deep deterministic
policy gradient,” in 2023 42nd Chinese Control Conference
(CCC). IEEE, 2023, pp. 8521–8526.

You might also like