0% found this document useful (0 votes)
2 views

Understanding Reinforcement Learning Algorithms:Q Learning

This paper reviews reinforcement learning (RL) to provide beginners with a clear understanding of its key concepts, techniques, and algorithms, tracing the historical progress from Q-learning to modern algorithms like PPO. It emphasizes the unique challenges of RL, including its jargon and mathematical complexity, while presenting a concise overview of various RL algorithms and their motivations, workings, and limitations. The paper aims to serve as a foundational resource for those interested in learning about RL without the distractions of specific application contexts.

Uploaded by

hsu5223
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Understanding Reinforcement Learning Algorithms:Q Learning

This paper reviews reinforcement learning (RL) to provide beginners with a clear understanding of its key concepts, techniques, and algorithms, tracing the historical progress from Q-learning to modern algorithms like PPO. It emphasizes the unique challenges of RL, including its jargon and mathematical complexity, while presenting a concise overview of various RL algorithms and their motivations, workings, and limitations. The paper aims to serve as a foundational resource for those interested in learning about RL without the distractions of specific application contexts.

Uploaded by

hsu5223
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Understanding Reinforcement Learning Algorithms: The

Progress from Basic Q-learning to Proximal Policy Optimization


Mohamed-Amine Chadia*, Hajar Mousannifa
a
Laboratory of informatics systems engineering, department of computer science, faculty of sciences Semlalia, University of Cadi Ayyad,
Marrakech 40000, Morocco

* Correspondence: [email protected]

Abstract

This paper presents a review of the field of reinforcement learning (RL), with a focus on providing a comprehensive
overview of the key concepts, techniques, and algorithms for beginners. RL has a unique se tting, jargon, and mathematics
that can be intimidating for those new to the field or artificial intelligence more broadly. While many papers review RL in
the context of specific applications, such as games, healthcare, finance, or robotics, these papers can be difficult for
beginners to follow due to the inclusion of non-RL-related work and the use of algorithms customized to those specific
applications. To address these challenges, this paper provides a clear and concise overview of the fundamental principles of
RL and covers the different types of RL algorithms. For each algorithm/method, we outline the main motivation behind its
development, its inner workings, and its limitations. The presentation of the paper is aligned with the historical progress of
the field, from the early 1980s Q-learning algorithm to the current state-of-the-art algorithms such as TD3, PPO, and offline
RL. Overall, this paper aims to serve as a valuable resource for beginners looking to construct a solid understanding of the
fundamentals of RL and be aware of the historical progress of the field. It is intended to be a go -to reference for those
interested in learning about RL without being distracted by the details of specific applications.

keywords: Reinforcement learning, Deep reinforcement learning, Brief review.

1. Introduction
Reinforcement learning (RL) is a branch of machine learning that focuses on training agents to make
decisions in an environment to maximize a reward signal [1]. Despite its growing popularity and success in a
variety of applications, such as games [2–4], healthcare [5–7], finance [8], and robotics [9–11], RL can be a
challenging field for beginners to understand due to its unique setting, jargon, and seemingly more complex
mathematics compared to other areas of artificial intelligence. Moreover, many of the existing review papers on
RL are application-focused, like the ones presented previously, as well as others [12–14]. This can be distracting
for readers who are primarily interested in understanding the algorithms themselves. The only works we know
that had focused solely on the RL algorithms are [15–17], however, since they were published in 1996 and 2017
respectively, they did not include many of the classic algorithms that appeared afterward.
To address these challenges, this paper presents a review of the field of RL with a focus on providing a
comprehensive overview of the key concepts, techniques, and algorithms for beginners. Unlike other review
papers that are application-focused, this paper aims to provide a clear and concise overview of the fundamental
principles of RL, without the distractions of specific applications. It also discusses the motivation, inner
workings, and limitations of each RL algorithm. In addition, the paper traces the historical progress of the field,
from the early 1980s Q-learning algorithm [18] to the current state-of-the-art algorithms, namely, deep Q-
learning (DQN) [19], REINFORCE [20], deep deterministic policy gradient (DDPG) [21], twin delayed DDPG
(TD3) [22], and proximal policy optimization (PPO) [23].
Overall, this paper aims to serve as a valuable resource for beginners looking to construct a solid
understanding of the fundamentals of RL and be aware of the historical progress of the field. By providing a
clear and concise introduction to the key concepts and techniques of RL, as well as a survey of the main
algorithms that have been developed over the years, this paper aims to help readers overcome the intimidation
and difficulties often encountered when learning about RL.
The rest of this paper is organized as follows. In the next section, we provide a brief overview of the general
pipeline of an RL setting as well as the key concepts and vocabulary. We then cover the different types of RL
algorithms, including value-based, and policy-based methods. In the following section, we trace the historical
progress of the field, starting with the early Q-learning algorithm until current state-of-the-art algorithms such
as TD3 and PPO.
2. Common background
1. The general pipeline

Figure 1: The general framework in RL

In the RL pipeline, as illustrated in Figure 1, the agent receives observations from the environment and takes
actions based on these observations. The environment responds to the actions by providing a reward signal and
transitioning to a new state. The agent uses this feedback to update its policy, which is a mapping from
observations to actions. The goal of the agent is to learn a policy that maximizes the long-term return, which is
the sum of the rewards received over time.
To learn a good policy, the agent must explore the environment and try different actions to learn which
actions lead to higher rewards. At the same time, the agent must also exploit its current knowledge by taking
actions that are expected to lead to the highest reward based on its current policy. This balance between
exploration and exploitation is known as the exploration-exploitation trade-off and is a key challenge in RL.
Overall, the general pipeline of RL involves an agent interacting with an environment, receiving feedback in
the form of a reward signal, and updating its policy based on this feedback to maximize the long-term reward.
2. Technical terms
Although among the three types of learning (supervised, unsupervised, and reinforcement learning), it is the
most natural and closest to how humans learn, RL is particular in the sense that it has many unique jargons and
vocabulary. In this sub-section, we aim to elucidate the main components of the RL setting.
As illustrated in Figure 1, the RL setting is composed of an agent (a learning-based model) that is interacting
with an environment (e.g., simulation such as a game, systems such as an autonomous vehicle or drone,
processes such as scheduling program, or even life itself, etc.). The environment receives the agent’s action
and changes its state accordingly, and this state is generally associated with a specific reward. For example, in
an autonomous driving problem, when the car is in a safe and legal position, that is a state which the agent will
receive a positive reward for. If the car is in an illegal position or in an accident, on the other hand, the agent
will receive a negative reward, also called punishment. The sequence of actions executed by the agent given the
corresponding environment’s states is referred to as the policy.
All RL and deep RL algorithms must ensure an optimal balance between exploration and exploitation, which
is a concept that is particularly associated with RL. It states that agents should explore the environment more in
the first learning episodes, and progressively converge towards exploiting the learned knowledge that enables
making optimal decisions. This is important because, if the agent starts by exploiting/using a learned policy
from the beginning, it might miss other policies of higher optimality, thus, exploration is necessary.
Mathematically, these components are all gathered in one framework known as the Markov decision process
(MDP) [1], as {S, A, T, R, γ}. Where S is the state, A is the action, T is the transition function describing the
dynamics of the environment, R is the reward function, and γ is a discount factor. The discount factor determines
how much the RL agent cares about rewards in the immediate future compared to those in the distant future.
This idea is borrowed from economics, where a certain amount of money is generally worth more now than in
the distant future, also the discount factor helps in mathematical convergence. Solving this MDP refers to
finding an optimal policy (π*), that is the one yielding the maximum rewards over an entire learning episode.
Following the process depicted in Figure 1, first, the agent is presented with an initial state of the environment
𝑠𝑡 = 𝑠0 , the reward associated with the initial state is usually considered null, i.e., rt=r0=0. The agent generates
an action a t the state 𝑠𝑡 . This action changes the environment’s state to st+1, as well as the new associated reward
rt+1. The cumulated sum of rewards is known as the return Gt and can be calculated as described in equation 1:
𝐺𝑡 = 𝑅𝑡+1 + 𝑅𝑡+2 + 𝑅𝑡+3 + … we begin by Rt+1 since Rt is considered 0 (1)

Then, we introduce the discount factor gamma γ


𝐺𝑡 = γ(0)𝑅𝑡+1 + γ(1) 𝑅𝑡+2 + γ(2) 𝑅𝑡+3 + … since γ(0)= 1, the equation becomes:
𝐺𝑡 = 𝑅𝑡+1 + γ(1) 𝑅𝑡+2 + γ(2) 𝑅𝑡+3 + … = ∑𝑇𝑘=0 γ(𝑘) 𝑅𝑡+𝑘+1 (2)

Besides its role in modeling the importance of future rewards relative to immediate ones, the parameter γ
gamma also ensures the mathematical convergence of the RL training. For instance, if γ = 0.99 (which is
usually the value given in the literature 1<γ<0), then:
𝐺𝑡 = 𝑅𝑡+1 + 0.99 (1) 𝑅𝑡+2 + 0.99(2) 𝑅𝑡+3 + … = ∑𝑇𝑘=0 0.99 (𝑘) 𝑅𝑡+𝑘+1, while the powers of γ tend to 0 (i.e., γ (1)
=0.99, …, γ (3) =0.97, …, γ (10) =0.90, …, γ (T) ~0).
In addition to the discount factor added to the equation of the return Gt, to make this setting even more
suitable for real case scenario, we shall consider a stochastic version of equation 2, that is, the expected return
of Gt also called the value function Vt, described in equation 3:
𝑉𝑡 = 𝐸[𝐺𝑡 |𝑠𝑡 = 𝑠] = 𝐸[∑𝑇𝑘=0 γ(𝑘) 𝑅𝑡+𝑘+1 |𝑠𝑡 = 𝑠] (3)

Given the mathematical expectation in equation 4:


𝑛
𝐸[𝑋] = ∑𝑖=0 𝑥𝑖 𝑃(𝑥𝑖 ) (4)

where x is a random variable and P is the probability of x.


Solving the MDP now consists of maximizing the value function (equation 3). Nevertheless, this is still a
difficult problem, indeed, this is known as the infinite horizon problem since rewards are completely
untraceable at a certain level. For this, we introduce the infamous Bellman equation trick. In the 1950s,
Richard Bellman et al. [1] proposed a recursive approximation of the return G and the value V as follows:
We know that: 𝐺𝑡 = 𝑅𝑡+1 + γ(1)𝑅𝑡+2 + γ(2)𝑅𝑡+3 + …
This can be rewritten as the immediate reward 𝑅𝑡+1 plus the discounted future values: 𝐺𝑡 = 𝑅𝑡+1 + γ(1)𝐺𝑡+1
and so on for later returns.
Similarly, the value function Vt can also be arranged in this way:
𝑉𝑡 = 𝑅𝑡+1 + 𝑉𝑡+1 (5)
This recursive representation helps avoid the infinite horizon problem by dividing it into solvable sub -
problems. Later, this became known as the Bellman equation.
Based on the concepts and ideas elaborated above, many algorithms and techniques were proposed to solve
such reinforcement learning problems. In Table 1 we present a subset of the foundational RL and deep RL
algorithms with a taxonomy of characteristics.
Table 1: Taxonomy of the RL and DRL algorithms

Algorithm Value-based Model-based On-policy Temporal-diff. Tabular


Policy-based Model-free Off-policy Monte Carlo NN
Q-learning 0 1 1 0 0
DQN 0 1 1 0 1
REINFORCE 1 1 0 1 1
DDPG 0 and 1 1 0 and 1 0 1
TD3 0 and 1 1 0 and 1 0 1
PPO 0 and 1 1 0 and 1 0 1

The following are definitions of the terms used in the taxonomy presented in Table 1:
• Value-based: updates the model’s policy based on the value function V.
• Policy-based: updates the model’s policy directly without referring to V.
• Model-based: has access to the environment’s dynamics (e.g., rules of chess).
• Model-free: does not have access to the environment’s dynamics.
• On-policy: uses a model for exploration, and the same for exploitation.
• Off-policy: uses a model for exploration, and a separate one for ex ploitation.
• Temporal difference: updates the model each time step.
• Monte Carlo: updates the model using a statistical description (e.g., the mean value) of many time
steps (often, the entire episode)
• Tabular: uses a table to store the computed values
• Neural network (NN): uses a neural network as the function approximator

In the following section, we shall present in detail the algorithms listed in Table 1, each with the
motivation behind its development, its inner working, as well as some of its known limitations. These
algorithms are all model-free, leaving model-based algorithms for future work.

3. Algorithms: motivation, inner workings, limitation


1. Q-learning (value-based)
1. Motivation
Before the NN and DL revolution, tremendous research was conducted to propose efficient techniques for
solving the Bellman equation in the context of RL. For this purpose, there have been three principal classes of
methods: dynamic programming, Monte Carlo methods, and temporal-difference (TD) learning. Although
dynamic programming methods are well-developed mathematically, they must be provided a complete and
accurate model of the environment, which is often not available. Monte Carlo methods on the other hand do
not require a model and are conceptually simple but are not suited for step -by-step incremental computation
(i.e., only episodic training). Finally, TD methods require no model and are fully stepwise incremental and
episodic as well. Consequently, we will see that most algorithms are either TD-based (mostly) or Monte
Carlo-based (less often), but rarely based on dynamic programming. Thus, to illustrate the RL paradigm here,
we present the Q-learning algorithm, which is a TD-based model that belongs to the tabular RL model family.
Other RL algorithms (i.e., do not rely on NNs), such as the State-action-reward-state-action (SARSA) which
is an on-policy version of Q-learning, and the Monte Carlo Tree Search (MCTS) that relies on heuristic trees
instead of a finite size table, can be found in Sutton & Barto’s book [1].
2. Inner-workings
Q-learning (founded in 1989 by Christopher Watkins ) is one of the basic, yet classic algorithms. Instead of
the value function V presented previously, Q-learning considers a variant of this, called Q-function, where Q
designates quality. While V is only associated with states, Q is conditioned on states and actions as well.
𝑄𝑡 = 𝐸[𝑄𝑡 |𝑠𝑡 = 𝑠, 𝑎 𝑡 = 𝑎] = 𝐸[∑𝑇𝑘=0 γ(𝑘) 𝑅𝑡+𝑘+1 |𝑠𝑡 = 𝑠, 𝑎 𝑡 = 𝑎] (6)

This makes the evaluation of the agent’s policy easier and explicit, and since we can utilize the mathematical
expectation formula (equation 4), the relationship between V and Q is formulated as:
𝑉 𝜋 = ∑𝜋 (𝑎|𝑠)𝑄𝜋 (𝑠, 𝑎)
𝑎𝜖𝐴
As illustrated in Figure 2, Q-learning uses Q-table to store the Q-value of taking a certain action in a given

Figure 2: Q-learning

state. Then it updates the Q-value based on the observed reward and the estimated future value.

𝑄(𝑠𝑡, 𝑎𝑡) = 𝑄 (𝑠𝑡, 𝑎𝑡) + 𝛼 . [𝑟𝑡 + 𝛾 . max 𝑄(𝑠𝑡 + 1, 𝑎) − 𝑄 (𝑠𝑡, 𝑎𝑡)] (7)
𝑎

After running many iterations of the algorithm described in Algorithm 1 below, the model (i.e., the Q-table)
should converge, meaning that the highest Q-values will be assigned to the optimal state-action pairs. If
interested, see the proof of convergence here [24].
Note for algorithm 1:
• s’ and a’ means next state and next action, respectively.
• ε-greedy is a method to ensure the exploration-exploitation process. Here, ε is set to a small number
(<0.5), and a number x is randomly generated, if x< ε, the algorithm explores, that is, it selects
actions randomly. Whereas, if x> ε, the algorithm exploits, that is, it uses the Q-table to sample the
best action, i.e., the one with the highest Q-value.
Algorithm 1: Q-learning
Initialize Q(s, a) arbitrarily
Repeat for each episode:
Initialize s
Repeat each step of the episode:
Choose a given s using the policy derived from Q (e.g., ε-greedy)
Take action a, observe r, s’
Q(s, a) <= Q(s, a) + 𝜶 [𝒓 + 𝜸 𝒎𝒂𝒙 𝑸(𝒔′ ,𝒂′ ) − 𝑸 (𝒔,𝒂)]
s <= s’
Until s is terminal

3. Limitation
Although it was standard in the past years before the NNs revolution. Q-learning algorithm can only be
applied in limited applications. This is because of its main drawback, which is its inability to (directly) learn
in environments with continuous states and actions since it uses finite-size tables as brains.
2. Deep Q-learning (value-based)
1. Motivation
Given that in the previous sub-section, we laid out many concepts related to RL in general and the Q-
learning algorithm specifically. understanding the deep version of the latter, that is, Deep -Q-network (DQN)
should be relatively easy.
All concepts related to Q-learning (e.g., Bellman equation, ε-greedy exploration, etc.) still apply to DQN,
except for one fundamental difference, which is the use of function approximators, specifically, NNs instead
of tabular storage for the Q-values. In 2013, Mnih et al. from DeepMind introduced the DQN architecture.
The work was labeled revolutionary as they used DQN to play a famous Atari game at the super -human level.
Since then, DQN was used in many applications including medicine [25], industrial control and engineering
[26], traffic management [27], resource allocation [28], as well as games [3,4]. Moreover, DQN work has
opened the doors for many theoretical improvements and DRL algorithms developments, and to date, the
DQN paper has been cited more than 9500 times.
2. Inner-workings
DQN has recognized two successive architectures, where the second one imported significant improvement
to the first. The former DQN architecture involved a naïve straight-forward replacement of the Q-table in Q-
learning by a NN, enabling a gradient-based update of the TD loss instead of the traditional dynamic
programming approach. While this has worked for a few easier tasks, it suffered from two main problems,
namely:
• The experience correlation: when the agent learns from consecutive experiences as they occur, data
can be highly correlated. This is because a given state may provoke a certain action which in turn
will provoke a specific state and action and so on, given those environments are governed by well-
defined transition functions.
• The moving target problem: unlike supervised learning (SL), where data is fixed, in RL, the agent
generates its data, and the more the agent changes its policy, the more the data distributions change,
which makes calculating the Q-values very unstable. Therefore, the model is said to be chasing a
moving target. This is because each time the Q-network is updated, and so is the Q-value will be
changing constantly. Formally, if the update is computed for 𝑟𝑡 + max 𝑄(𝑠𝑡 + 1, 𝑎; 𝜃) −
𝑎
𝑄(𝑠𝑡, 𝑎𝑡; 𝜃), then the learning is prone to becoming unstable since the target,
𝑟𝑡 + max 𝑄(𝑠𝑡 + 1, 𝑎; 𝜃), and the prediction 𝑄(𝑠𝑡, 𝑎𝑡; 𝜃), are not independent, as they both rely
𝑎
on θ.

For this, the second architecture was proposed and demonstrated significant mitigation of the two problems
discussed. This was achieved by adding two main blocks to the original architecture:
• The replay memory: the replay memory, experience replay, or even the replay buffer. This helps
break the correlation between successive experiences and ensures independent, identical,
distributed (IID)-like data which is assumed in most supervised convergence proofs. It achieves
this by storing transitions of {state, action, reward, new state} until having batches of these. Then,
randomly sampling transitions for learning in a phase that is separate from the interaction with the
environment/experience generation phase.
• The target network: by setting a separate network for the calculation of the target Q-value, the
model becomes more robust and stable. It is because only after batches of learning episodes (as
opposed to each timestep in the former case) that the target can change due to the update. Thus,
the model can succeed in approximating the target progressively.

Illustrated in Figure 3 is the enhanced version of the DQN architecture. Along with algorithm 2, we shall
explain its inner working. After initializing the two networks’ parameters (Q-network with θ and target
network with θ’, where θ= θ’), the learning episodes begin. We first initialize the initial state and begin each
episode step by step following the RL general workflow. The agent gives an action based on the state
provided, and the environment receives the action and outputs a new state along with the corresponding
reward. This (so-called) transition {state, action, reward, new state} is stored in a memory (i.e., the replay
memory). After a couple of episodes, the stored transitions are sampled randomly from the replay memory (to
avoid correlation of experiences) to compute their corresponding Q-values. The Q-value given by the Q
network is the predicted one, whereas the Q-value given by the target network with the Bellman
approximation is considered the real one. Finally, the Q network is updated via the mean -square error (MSE)
to minimize the loss between the predicted Q and the real one. Whereas the target network’s parameters are
updated by copying the Q network’s parameters, which helps avoid the moving target problem.

Figure 3: Deep Q-Network (DQN)


Algorithm 2: DQN
Initialize replay memory D to capacity N
Initialize action-value function Q with random weights θ #NN
Initialize target action-value function 𝑸 ̂ with weights θ’ = θ #NN
For episode = 1, M do
Initialize sequence s 1
For t = 1, T do
With probability, ε select a random action at
Otherwise select 𝒂𝒕 = 𝒂𝒓𝒈𝒎𝒂𝒙 𝑸(𝒔𝒕 , 𝒂;𝜽)
Execute action 𝒂𝒕 and observe reward rt and next state st+1
Store transition 𝒔𝒕 , 𝒂𝒕 ,𝒓𝒕 ,𝒔𝒕 +𝟏 𝐢𝐧 𝑫
Sample a random minibatch of transitions from D
𝒓𝒋 , 𝒊𝒇 𝒆𝒑𝒊𝒔𝒐𝒅𝒆 𝒕𝒆𝒓𝒎𝒊𝒏𝒂𝒕𝒆𝒔 𝒂𝒕 𝒔𝒕𝒆𝒑 𝒋 + 𝟏
Set 𝒚𝒋 = { ̂ (𝒔𝒕 + 𝟏, 𝒂′ ;𝜽′ ),
𝒓𝒋 + 𝜸𝒎𝒂𝒙𝑸 𝒐𝒕𝒉𝒆𝒓𝒘𝒊𝒔𝒆
Gradient descent on 𝑴𝑺𝑬 (𝒚𝒋,𝑸(𝒔𝒕 ,𝒂;𝜽)), with respect to θ
̂=Q
For every C step, reset 𝑸
End for
End for

3. Limitation
As mentioned, DQN has been driving a tremendous number of applications in many domains. Nevertheless,
it is still limited, mainly, because of the categorical way by which actions are generated (𝑎 𝑡 =
𝑎𝑟𝑔𝑚𝑎𝑥 𝑄(𝑠𝑡 , 𝑎; 𝜃)). This makes it unable to be used in continuous action settings (e.g., controlling velocity).
On top of this, DQN has also shown significant training instability in high-dimensional environments where
function approximation of the Q-value can become prone to significant overestimation error [29].
3. REINFORCE (policy-gradient)
1. Motivation
Policy gradient (PG) algorithms are the most natural and direct way of doing RL with neural networks.
This family of RL and DRL techniques was popularized by Sutton et al. in [20]. It proposes a new set of
algorithms that can directly map states to actions (as opposed to Q-values, then actions) via a learnable
differentiable function approximator such as NNs. It should be differentiable since the update is based on the
gradient.
2. Inner-workings
Unlike value-based methods such as the previously presented Q-learning and DQN, PG algorithms have a
much more solid mathematical basis. Indeed, instead of the Bellman approximation of the Q-values via the
recursive decomposition. The gradient loss in PG algorithms is directly proportional to the policy in an
intuitive way. Below is the PG theorem proof:
We know that the mathematical expectation E is equal to:
𝑛
𝐸[𝑋] = ∑𝑖=0 𝑥𝑖 𝑃(𝑥𝑖 ) , where x is the random variable and P is the probability of x. therefore, the gradient
of the objective function: 𝛻𝜃 𝐽(𝜋𝜃 ) = 𝛻𝜃 𝜏~𝜋 𝐸[𝑅(𝜏)] (where τ is the trajectory) can be formulated as:
𝜃

= 𝛻𝜃 ∫𝜏 𝑃( 𝜏|𝜃)𝑅(𝜏), expanding the expectation formula


= ∫𝜏 𝛻𝜃 𝑃(𝜏|𝜃)𝑅(𝜏), multiply and divide by highlighted term

𝛻𝜃 𝑃(𝜏|𝜃 )𝑅(𝜏)𝑃(𝜏|𝜃 ) 𝑑𝑢
= ∫𝜏 , the highlighted part is = dlog(u)
𝑃(𝜏|𝜃 ) 𝑢

= ∫𝜏 𝑃( 𝜏|𝜃)𝛻𝜃 log (𝑃(𝜏|𝜃))𝑅(𝜏) , log derivative trick

= 𝐸𝜏~𝜋𝜃 [𝛻𝜃 log (𝑃(𝜏|𝜃)𝑅(𝜏)] , original expectation form

Among the terms in the last expression, only 𝑃(𝜏|𝜃) needs to be defined (reward is known). Since 𝑃(𝜏|𝜃)
the probability of following a trajectory τ is dictated by the policy function 𝜋𝜃, thus, the final expression
becomes:

𝑇
𝛻𝜃 𝐽(𝜋𝜃 ) = 𝐸𝜏~𝜋𝜃 [∑𝑡=0 𝛻𝜃 log (𝜋𝜃(𝑎 𝑡|𝑠𝑡 ))𝑅(𝜏)], (8)

By performing the gradient ascent on this by a Monte Carlo estimate of the expected value, we can find the
optimal policy. While many tricks were proposed to make the PG algorithm better, all improved PG
algorithms are still based on this basic theorem.
This final expression allows the easy and direct gradient-based update of the agent’s parameters solely
based on its policy (i.e., actions) instead of the indirect way, i.e., state -action value (Q-value)). Moreover, it
permits the training of RL agents in continuous setting environments naturally. This is because the action
selection is not based on a categorical sampling method such as the argmax in DQN. Most PG-based
algorithms can be used for discrete action environments, via any categorical sampling function, as well as the
continuous settings, where the agent tries to learn a distribution function such as the gaussian distribution
conditioned on its parameters, the mean (μ), and the variance σ.
The simplest PG algorithm is called REINFORCE. Here, the model follows a naïve strategy of using a NN
to map states to actions optimally (i.e., in a way that maximizes the objective function). In other words, The
REINFORCE algorithm presents a direct differentiation of the reinforcement learning objective, i.e., the
expected return described in equation 3: 𝐸[∑𝑇𝑘=0 γ(𝑘) 𝑅𝑡+𝑘+1 |𝑠𝑡 = 𝑠]. As illustrated in Figure 4 and presented
in algorithm 3 there is only one network, which is the policy network (π). The agent learns to choose optimal
actions given the environment’s states to increase the objective function (in equation 8) via gradient ascent.

Figure 4: REINFORCE
Algorithm 3: REINFORCE (policy function is a NN)

Initialize a differentiable policy function 𝛑(𝐚|𝐬, 𝛉) #NN


Repeat for each episode:
Record S0, A0, R1, …, ST-1, AT-1, RT, following π
Loop for each step of the episode t = 0, 1, …, T-1
𝑮𝒕 ← ∑𝑻𝒌=𝒕+𝟏 𝜸𝒌−𝒕−𝟏𝑹𝒌
𝜽 ← 𝜽 + 𝜶𝜸𝒕 𝑮𝒕 𝜵𝒍𝒏𝝅(𝑨𝒕 |𝑺𝒕 ,𝜽) #𝛑(𝐀𝐭|𝐒𝐭 ,𝛉) will return the probability of 𝐀𝐭

3. Limitation
With all the advantages provided by the PG theorem and its direct implementation. The REINFORCE
algorithm suffers from fundamental problems, namely, the noisy estimates of the gradient caused by the unclear
credit assignment. That is, it is not clear which action resulted in which reward. This is because, as described in
algorithm 3, the agent is only updated when a complete episode of data (i.e., all steps) or more is completed.
This is necessary since PG is based on a statistical Monte Carlo-like estimate of the cumulated reward samples
(return) generated by the policy.
For this, a critical improvement was proposed as a new family of DRL algorithms known as actor -critic
architectures [30]. To date, the actor-critic-based models are the SOTA in the DRL field, especially, in the
model-free DRL.
The key contribution of actor-critic models is the use of both the policy and the value for learning optimal
strategies. This is the reason why in Table 1, all models after REINFORCE were dubbed value-based as well
as policy-based. In the actor-critic architecture, the actor is a NN that is responsible for learning the policy by
trying to maximize an objective function similar to that of the REINFORCE algorithm. Whereas the critic (also
a NN) learns the value function V, the action value function Q, or even a relationship between the two, called
the advantage A, which is just the difference between the next estimated value (V or Q) and the previous one as
described in equation 17 below:
𝑟 + 𝛾𝑉(𝑠𝑡+1 )– 𝑉(𝑠𝑡 ), 𝑜𝑟:
𝐴𝑡 = { 𝑡 (9)
𝑟𝑡 + 𝛾𝑄(𝑠𝑡+1 ,𝑎 𝑡+1 ) – 𝑄(𝑠𝑡 , 𝑎 𝑡 )

In the following subsections, we will overview three SOTA models that were making headlines in the past
five years or so. These are all actor-critic-based models, so more details about this family of DRL algorithms
will be presented in each sub-section.

4. DDPG (actor-critic):
1. Motivation
Deep deterministic policy gradient, or in short DDPG, was developed in 2016 by Lillicrap et al. to have a
DRL model that can operate in a high dimensional continuous state and action space. DDPG, as the name
suggests, uses a deterministic policy, as opposed to a stochastic one, by mapping states to actions directly
instead of a probability distribution over actions via a deterministic policy function named 𝜇(𝑠|𝜃𝜇 ).
2. Inner-workings
As shown in Figure 5, DDPG is an actor-critic model, meaning that it contains at least two networks, one
called the actor responsible for learning the policy, and a second netwo rk named the critic for learning the
value function of each state (V) or state-action pair (Q). The critic learns by the mean of minimizing the MSE
loss between the predicted Q-value and the actual one akin to the DQN model discussed previously. As the
critic gets better at predicting the corresponding Q-values, the actor is presented with a convenient value that
it can use to maximize in a PG fashion. Further, DDPG makes use of other improvements made for the DQN
model, namely, the replay memory and the target network.
In algorithm 4 we quote the line “Store transition (st , at , rt ,st+1) in D” which refers to the same replay

Figure 5: Deep deterministic policy gradient (DDPG)

memory used in the DQN model. This is because it was shown that PG models, including those with actor -
critic architectures, are very sample inefficient since they only use a given transition once and then throw it
off. The replay memory allows training in mini-batches of data as well as using the same data multiple times
during training. Another change is the way the target is updated, while DQN updates the target network by the
mean of a hard copy of the parameters, DDPG uses a soft copy based on the Polyak averaging function as
follows 𝜃𝑄 ’ ← 𝜏𝜃𝑄 + (1 – 𝜏)𝜃𝑄 ’. As we can see, the new parameters of the target will equal an interpolation
factor 𝜏 which is usually set to a value around 0.99, plus a small number (~1 −0.99, i.e., 0.01) times the old
parameters of the target network. This constrains the target network to be changed slowly and was shown (in
the DDPG paper) to yield better performance than the hard copy of the DQN model.
Further, since the DDPG model is merely deterministic, it must ensure the exploration-exploitation
somehow. Indeed, the action selection is executed according to the equation mentioned in the algorithm,
specifically: 𝑎 𝑡 = 𝜇(𝑠𝑡 |𝜃𝜇 ) + 𝑁𝑡 , where 𝑁𝑡 is a noise signal added to the policy function to force the
stochastic behavior. In the paper, the researchers used the Ornstein Uhlenbeck noise [31], however, later
research has demonstrated that other simple noise signals such as the gaussian noise can be used and yield
competitive performance.
Algorithm 4: DDPG
Initialize the critic and the actor networks, 𝑸(𝒔, 𝒂|𝜽𝑸) and 𝝁(𝒔|𝜽𝝁)
Initialize target network Q’ and μ’, 𝜽𝑸’ ← 𝜽𝑸 ,𝜽𝝁’ ← 𝜽𝝁
Initialize replay memory R
For episode = 1, M do
Initialize a random process N for action exploration
Receive initial observation state s1
For t = 1, T do
Select action 𝒂𝒕 = 𝝁(𝒔𝒕 |𝜽𝝁) + 𝑵𝒕 #policy + noise
Execute at, observe reward rt and new state st+1
Store transition (𝐬𝐭 , 𝐚𝐭 ,𝐫𝐭 ,𝐬𝐭+𝟏 ) in D
Sample a random minibatch of K transitions from D
′ ′
Set 𝒚𝒊 = 𝒓𝒋 + 𝜸𝑸′(𝒔𝒊+𝟏,𝝁′(𝒔 𝒊+𝟏 |𝜽𝝁 ) | 𝜽𝑸 )
Update critic by minimizing the loss: 𝑴𝑺𝑬(𝒚𝒊 , 𝑸(𝒔 𝒊 ,𝒂𝒊 |𝜽𝑸))
Update the actor by maximizing the expected Q:
𝑲
𝟏
𝜵𝜽𝝁 𝑱 ≈ ∑ 𝜵𝒂 𝐐(𝐬, 𝐚|𝜽𝑸) |𝒔=𝒔𝒊,𝒂=𝝁(𝒔𝒊) 𝜵𝜽𝝁 𝝁(𝒔|𝜽𝝁)|𝒔 𝒊
𝑲 𝒊=𝟎
Update the target networks:
𝜽𝑸 ’ ← 𝝉𝜽𝑸 + (𝟏 – 𝝉)𝜽𝑸’ #Polyak averaging
𝜽𝑸 ’ ← 𝝉𝜽𝝁 + (𝟏 – 𝝉)𝜽𝝁’
End for
End for

3. Limitation
While it has empowered several great achievements in the AI field, DDPG is often criticized for being
unstable. This is manifested in the form of high sensitivity to hyperparameter tuning and propensity to converge
to very poor solutions or even diverge in tasks requiring relatively more exploration [32]. For this, potential
improvements were open to further the DRL decision-making optimality, especially for high-dimensional
spaces and noisy environments.
5. TD3 (actor-critic):
1. Motivation
The limited performance of the DDPG model was extensively explored. As a result, the cause was
identified to be the overestimation bias. This problem was already known as an issue regarding Q-learning. It
is manifested when the estimated values Q-values are often greater than true ones, thus, the agent is said to be
overestimating the expected future rewards, making it select actions of misleading high values. This
phenomenon accumulates after each iteration and propagates through the TD error.
The overestimation bias is generally related to the statistical-based approximation via NNs or another
approximative sample-based method such as dynamic programming decomposition in Q-learning. As a
natural successor to DDPG, and given its main limitations discussed. The twin-delayed deep deterministic
(TD3) PG model was developed in 2018.
2. Inner-workings
TD3 solves the overestimation bias in the DDPG model issue by maintaining a pair of critics Q1 and Q2 as
illustrated in Figure 5 (hence the name “twin”) along with a single actor, and each network with its
corresponding target.

Figure 6: Twin delayed deep deterministic (TD3) policy gradient

For each time step, TD3 uses the smaller of the two Q-values and updates the policy less frequently than
the critic networks. It is worth mentioning that the use of two networks for better approximation of the Q -
value was inspired by previous work concerning the overestimation bias in Q-learning and DQN, namely,
double Q-learning and double DQN [29], respectively. Besides these changes, the TD3 model is akin to the
DDPG model in both selecting actions and updating the networks as explained in algorithm 5.

Algorithm 5: TD3
Initialize the critic networks Qθ1, Qθ2, and the actor-network πΦ, randomly
Initialize the target networks θ1’← θ1, θ2’← θ2, Φ’← Φ
Initialize the replay memory B
For t = 1 to T do
Select action with exploration noise 𝒂 ~ 𝝅(𝒔) + 𝜺,
ε ~ N(0, σ) and observe reward r and new state 𝐬’
Store transition (s, a, r, s’) in B
Sample a mini-batch of N transitions from B
̃ ← 𝝅𝜱 ’(𝒔) + 𝜺, #policy + noise
𝒂
𝐲 ← 𝐫 + 𝛄 𝐦𝐢𝐧 𝑸𝜽 ’(𝒔’, 𝒂 ̃ ) #Clipped double Q-learning
𝐢=𝟏,𝟐
Update critics 𝜽𝒊 ← 𝐦𝐢𝐧 𝑴𝑺𝑬(𝒚,𝑸𝜽𝒊 (𝒔, 𝒂))
𝜽𝒊
If t mode d then #Delayed update of target and policy networks
Update Φ by the deterministic policy gradient:
𝑲
𝜵𝜱 𝑱(𝜱) = 𝑵−𝟏 ∑ 𝜵𝜶 𝐐𝜽𝟏 (𝒔,𝒂)|𝒂=𝝅𝜱(𝒔) 𝜵𝜱 𝝅𝜱(𝒔)
𝒊=𝟎
Update target networks:
𝜽′𝒊 ← 𝝉𝜽𝑸 + (𝟏 – 𝝉)𝜽′𝒊 #Polyak averaging
𝜱’ ← 𝝉𝜱 + (𝟏 – 𝝉)𝜱′
End if
End for
3. Limitation
The exploration is handled manually with a gaussian noise, and the use of six networks is computationally
heavy. Further, TD3 can only be used naturally (if no features engineering is considered) with continuous space
environments.
6. PPO (actor-critic):
1. Motivation
In contrast to DDPG and TD3, where the exploration is executed by manually introducing a noise
parameter to the policy. There exists another category of actor-critic PG algorithms that uses a stochastic
policy, as opposed to the deterministic one. This inherently makes the exploration phase handle automatically.
Among the main SOTA models that utilize this approach is the proximal policy optimization (PPO)
algorithm.
2. Inner-workings

Figure 7: Proximal policy optimization (PPO)

Similar to the REINFORCE model, PPO learns online, meaning that it collects mini batches of experience
while interacting with the environment and use them to update its policy. Whereas DDPG and TD3 are
considered off-policy (or off-line) models since the function used to collect experiences (argmax) is different
from the one updated for learning (π).
Once the PPO policy is updated, a newer batch is collected with the newly updated policy. The critical
issue introduced by PPO is the concept of trust region, which refers to the fact that a new update of the policy
should not change it too much from the previous policy. This resulted in significantly less variance and
smoother training and makes sure the agent does not go down an unrecoverable path. Indeed, we quote the
original paper: “These methods have the stability and reliability of trust region policy optimization (TRPO)
[33] methods but are much simpler to implement, requiring only a few lines of code change to a vanilla policy
gradient implementation”
The PPO model was developed for three main points:
• ease of implementation
• sample efficiency
• ease of tuning

As one can tell from Figure 5, although the TD3 model has shown impressive improvement compared to
previous models, it involves many complexities regarding the implementation and training, especially, its use
of six separate networks, as opposed to the architecture of PPO illustrated i n Figure 6.
Additionally, the main problem of the previously presented on-policy model (REINFORCE) suffered from
the sample inefficiency since it only uses the collected transition once. Here, PPO is an on -policy model, yet it
can also utilize the same transition multiple times by using a clipping function, known as the surrogate loss
(mentioned in algorithm 6), to force the ratio of the old policy and the new one to stay between a manageable
interval. This key contribution of the PPO model is also helpful for hyperparameters tuning, wh ich can be
very laborious for DDPG and TD3 given the complex architectures with several parts as well as the risk of
going outside the trust region, especially, if the learning rate is larger than it should be.
Formally, PPO has imported small, yet very effective changes to the simple PG form such as the
REINFORCE model. That is, by replacing the “log (𝜋𝜃(𝑎 𝑡 |𝑠𝑡 ))” in the PG objective function of equation 16,
recall:
𝑇

𝛻𝜃 𝐽(𝜋𝜃 ) = 𝐸𝜏~𝜋𝜃 [∑ 𝛻𝜃 log (𝜋𝜃(𝑎 𝑡 |𝑠𝑡 ))𝑅(𝜏)]


𝑡=0
𝝅𝜽𝒏𝒆𝒘
by the ratio 𝒓𝒕 (𝜽) = , clipping it between 1 − 𝜖 𝑎𝑛𝑑 1 + 𝜖, and maximizing it using gradient ascent
𝝅𝜽
𝒐𝒍𝒅
just like a normal PG algorithm.

Algorithm 6: PPO
Initialize policy parameters θ0, initialize value function parameters Φ0
For k= 0, 1, 2, …, do
Collect a set of trajectories 𝑫𝒌 = {𝝉𝒊 } by running policy 𝝅(𝜽𝒌)in the environment
Compute rewards-to-go 𝑹 ̂ t.
Compute the advantage estimates, 𝑨 ̂ 𝒕 = 𝜹𝒕 + (𝜸𝝀)𝜹𝒕 + 𝟏 + … + (𝜸𝝀)𝑻 − 𝒕 + 𝟏𝜹𝑻 + 𝟏,
where 𝜹𝒕 = 𝒓𝒕 + 𝜸𝑽𝜱(𝒔𝒕 + 𝟏) – 𝑽𝜱(𝒔𝒕 ) #like TD in the Bellman equation
Update the policy by maximizing the PPO-clip objective:
𝝅
#Below is the clipped surrogate loss, where 𝒓𝒕 (𝜽) = 𝜽𝒏𝒆𝒘
𝝅𝜽𝒐𝒍𝒅

𝑳𝜽 𝒌 (𝜽) = 𝑬[∑𝑻𝒕=𝟎[𝐦𝐢𝐧(𝒓𝒕(𝜽)𝑨 ̂ 𝝅𝒕 𝒌 , 𝒄𝒍𝒊𝒑(𝒓𝒕(𝜽),𝟏 − 𝝐, 𝟏 + 𝝐)𝑨


̂ 𝝅𝒕 𝒌 ]]
̂ 𝒕 )via MSE loss
Fit the value function: 𝑽𝜱 (𝒔𝒕 ,𝑹
End for

3. Limitation
To date, PPO is still considered a SOTA algorithm, and its limitations are yet to be explored. It was created
by a team at OpenAI, and it is still the main algorithm used in the company. Indeed, it was recently used to fine-
tune their ChatGPT model as well.
4. Summary:
In this review, we have provided a comprehensive overview of the field of reinforcement learning (RL),
covering the key concepts, techniques, and algorithms. We have traced the historical progress of the field,
specifically, the main model-free algorithms, starting with the early Q-learning algorithm and covering the
current state-of-the-art algorithms of deep RL such as DQN, TD3, and PPO (all discussed algorithms are
listed in Figure 7 below).
We have also discussed the challenges and limitations of RL, including sample efficien cy and the
exploration-exploitation trade-off, and how these challenges have been addressed in the development of
newer algorithms. One of the main advantages of RL algorithms is their ability to learn from experience and
adapt to changing environments. This allows RL agents to solve complex and dynamic problems that may be
difficult to model using other techniques. However, RL algorithms can also be computationally intensive and
may require a large amount of data and interactions with the environment to le arn effectively.
Figure 8: Main RL-Algorithms discussed in the paper

Given the diverse range of RL algorithms that have been developed, it can be challenging to choose the
best algorithm for a particular problem. In general, model-based algorithms are more sample efficient but may
be less robust than value-based or policy-based algorithms. Value-based algorithms are widely used and can
be effective in many situations, but they may suffer from instability or divergence in certain cases. Policy-
based algorithms are generally more stable and can handle high-dimensional action spaces, but they may be
slower to converge compared to value-based algorithms.
In conclusion, the choice of the RL algorithm will depend on the specific characteristics of the problem at
hand and the trade-offs between sample efficiency, stability, and convergence speed. This review paper aims
to provide a comprehensive and accessible overview of the key concepts, techniques, and algorithms of RL,
as well as the challenges and limitations of the field and its historical progress. It is intended to serve as a
valuable resource for researchers and practitioners interested in learning more about RL and its applications.

Acknowledgement:
None.

References:
[1] S. Richard S., B. Andrew G., Reinforcement Learning: An Introduction, 2017.
[2] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M.
Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. Van Den Driessche, T. Graepel, D.
Hassabis, Mastering the game of Go without human knowledge, Nature. 550 (2017) 354 –359.
https://doi.org/10.1038/nature24270.
[3] J. Crespo, A. Wichert, Reinforcement learning applied to games, SN Appl. Sci. 2 (2020) 1–16.
https://doi.org/10.1007/s42452-020-2560-3.
[4] K. Shao, Z. Tang, Y. Zhu, N. Li, D. Zhao, A Survey of Deep Reinforcement Learning in Video
Games, (2019) 1–13. http://arxiv.org/abs/1912.10944.
[5] C. Yu, J. Liu, S. Nemati, Reinforcement Learning in Healthcare: A Survey, (2019).
http://arxiv.org/abs/1908.08796.
[6] O. Gottesman, F. Johansson, M. Komorowski, A. Faisal, D. Sontag, F. Doshi-Velez, L.A. Celi,
Guidelines for reinforcement learning in healthcare, Nat. Med. 25 (2019) 16–18.
https://doi.org/10.1038/s41591-018-0310-5.
[7] A. Coronato, M. Naeem, G. De Pietro, G. Paragliola, Reinforcement learning for intelligent healthcare
applications: A survey, Artif. Intell. Med. 109 (2020) 101964.
https://doi.org/10.1016/j.artmed.2020.101964.
[8] B.M. Hambly, R. Xu, H. Yang, Recent Advances in Reinforcement Learning in Finance, SSRN
Electron. J. (2021) 1–64. https://doi.org/10.2139/ssrn.3971071.
[9] B. Singh, R. Kumar, V.P. Singh, Reinforcement learning in robotic ap plications: a comprehensive
survey, Springer Netherlands, 2022. https://doi.org/10.1007/s10462 -021-09997-9.
[10] J. Kober, J.A. Bagnell, J. Peters, Reinforcement learning in robotics: A survey, Int. J. Rob. Res. 32
(2013) 1238–1274. https://doi.org/10.1177/0278364913495721.
[11] H. Nguyen, H. La, Review of Deep Reinforcement Learning for Robot Manipulation, Proc. - 3rd
IEEE Int. Conf. Robot. Comput. IRC 2019. (2019) 590–595. https://doi.org/10.1109/IRC.2019.00120.
[12] J.R. Vázquez-Canteli, Z. Nagy, Reinforcement learning for demand response: A review of algorithms
and modeling techniques, Appl. Energy. 235 (2019) 1072 –1089.
https://doi.org/10.1016/j.apenergy.2018.11.002.
[13] B. Rolf, I. Jackson, M. Müller, S. Lang, T. Reggelin, D. Ivanov, A review on reinf orcement learning
algorithms and applications in supply chain management, Int. J. Prod. Res. (2022).
https://doi.org/10.1080/00207543.2022.2140221.
[14] P. Mannion, J. Duggan, E. Howley, An Experimental Review of Reinforcement Learning Algorithms
for Adaptive Traffic Signal Control, Auton. Road Transp. Support Syst. (2016) 47 –66.
https://doi.org/10.1007/978-3-319-25808-9_4.
[15] A. Mosavi, Y. Faghan, P. Ghamisi, P. Duan, S.F. Ardabili, E. Salwana, S.S. Band, Comprehensive
review of deep reinforcement learning methods and applications in economics, Mathematics. 8
(2020). https://doi.org/10.3390/MATH8101640.
[16] Y. Li, Deep Reinforcement Learning: An Overview, ArXiv. 16 (2018).
https://arxiv.org/abs/1810.06339.
[17] K. Arulkumaran, M.P. Deisenroth, M. Brundage, A.A. Bharath, Deep reinforcement learning: A brief
survey, IEEE Signal Process. Mag. 34 (2017) 26–38. https://doi.org/10.1109/MSP.2017.2743240.
[18] C.J.C.H. WATKINS, P. DAYAN, Q-Learning, Mach. Learn. (1992).
https://link.springer.com/article/10.1023/A:1022676722315.
[19] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, M. Riedmiller, Playing
Atari with Deep Reinforcement Learning, (2013) 1–9. http://arxiv.org/abs/1312.5602.
[20] R.S. Sutton, D. McAllester, S. Singh, Y. Mansour, Policy gradient methods for reinforcement learning
with function approximation, Adv. Neural Inf. Process. Syst. (2000) 1057 –1063.
[21] T.P. Lillicrap, J.J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, D. Wierstra, Continuous
control with deep reinforcement learning, 4th Int. Conf. Learn. Represent. ICLR 2016 - Conf. Track
Proc. (2016).
[22] S. Fujimoto, H. Van Hoof, D. Meger, Addressing Function Approximation Error in Actor -Critic
Methods, 35th Int. Conf. Mach. Learn. ICML 2018. 4 (2018) 2587–2601.
[23] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, O. Klimov, Proximal Policy Optimization
Algorithms, (2017) 1–12. http://arxiv.org/abs/1707.06347.
[24] R. Luu, Convergence of Q-learning : a simple proof Convergence of Q -learning : a simple proof,
(n.d.) 1–4.
[25] D.P. Tobón, M.S. Hossain, G. Muhammad, J. Bilbao, A. El Saddik, Deep learning in multimedia
healthcare applications: a review, Multimed. Syst. 28 (2022) 1465 –1479.
https://doi.org/10.1007/s00530-022-00948-0.
[26] S. Chen, C. Jiang, J. Li, J. Xiang, W. Xiao, Improved deep q‐network for user‐side battery energy
storage charging and discharging strategy in industrial parks, Entropy. 23 (2021) 1 –18.
https://doi.org/10.3390/e23101311.
[27] J.H. Park, K. Farkhodov, S.H. Lee, K.R. Kwon, Deep Reinforcement Learning-Based DQN Agent
Algorithm for Visual Object Tracking in a Virtual Environmental Simulation, Appl. Sci. 12 (2022).
https://doi.org/10.3390/app12073220.
[28] H. Li, H. Gao, T. Lv, Y. Lu, Deep Q-Learning based dynamic resource allocation for self-powered
ultra-dense networks, 2018 IEEE Int. Conf. Commun. Work. ICC Work. 2018 - Proc. (2018) 1–6.
https://doi.org/10.1109/ICCW.2018.8403505.
[29] H. Van Hasselt, A. Guez, D. Silver, Deep reinforcement learning with double Q-Learning, 30th AAAI
Conf. Artif. Intell. AAAI 2016. (2016) 2094–2100. https://doi.org/10.1609/aaai.v30i1.10295.
[30] V.R. Konda, J.N. Tsitsiklis, [NIPS 2000] Actor-critic algorithms, Adv. Neural Inf. Process. Syst.
(2000) 1008–1014.
[31] Wikipedia, 48_wikipedia, (2022). https://en.wikipedia.org/wiki/Ornstein–Uhlenbeck_process.
[32] G. Matheron, N. Perrin, O. Sigaud, The problem with DDPG: understanding failures in deterministic
environments with sparse rewards, (2019) 1–19. http://arxiv.org/abs/1911.11679.
[33] J. Schulman, S. Levine, P. Moritz, M. Jordan, P. Abbeel, Trust region policy optimization, 32nd Int.
Conf. Mach. Learn. ICML 2015. 3 (2015) 1889–1897.

You might also like