Modern_Deep_Reinforcement_Learning_Algorithms
Modern_Deep_Reinforcement_Learning_Algorithms
Written by:
Sergey Ivanov
[email protected]
Scientific advisor:
Alexander D’yakonov
[email protected]
Moscow, 2019
Contents
1 Introduction 4
3 Value-based algorithms 10
3.1 Temporal Difference learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Deep Q-learning (DQN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Double DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Dueling DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5 Noisy DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6 Prioritized experience replay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.7 Multi-step DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6 Experiments 41
6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2 Cartpole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.3 Pong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.4 Interaction-training trade-off in value-based algorithms . . . . . . . . . . . . . . . . . . 43
6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7 Discussion 47
A Implementation details 50
B Hyperparameters 51
2
Abstract
Recent advances in Reinforcement Learning, grounded on combining classical theoretical re-
sults with Deep Learning paradigm, led to breakthroughs in many artificial intelligence tasks and
gave birth to Deep Reinforcement Learning (DRL) as a field of research. In this work latest DRL algo-
rithms are reviewed with a focus on their theoretical justification, practical limitations and observed
empirical properties.
3
1. Introduction
During the last several years Deep Reinforcement Learning proved to be a fruitful approach to
many artificial intelligence tasks of diverse domains. Breakthrough achievements include reaching
human-level performance in such complex games as Go [20], multiplayer Dota [14] and real-time
strategy StarCraft II [24]. The generality of DRL framework allows its application in both discrete and
continuous domains to solve tasks in robotics and simulated environments [12].
Reinforcement Learning (RL) is usually viewed as general formalization of decision-making task
and is deeply connected to dynamic programming, optimal control and game theory. [21] Yet its
problem setting makes almost no assumptions about world model or its structure and usually sup-
poses that environment is given to agent in a form of black-box. This allows to apply RL practically
in all settings and forces designed algorithms to be adaptive to many kinds of challenges. Latest RL
algorithms are usually reported to be transferable from one task to another with no task-specific
changes and little to no hyperparameters tuning.
As an object of desire is a strategy, i. e. a function mapping agent’s observations to possible
actions, reinforcement learning is considered to be a subfiled of machine learning. But instead of
learning from data, as it is established in classical supervised and unsupervised learning problems,
the agent learns from experience of interacting with environment. Being more "natural" model of
learning, this setting causes new challenges, peculiar only to reinforcement learning, such as neces-
sity of exploration integration and the problem of delayed and sparse rewards. The full setup and
essential notation are introduced in section 2.
Classical Reinforcement Learning research in the last third of previous century developed an ex-
tensive theoretical core for modern algorithms to ground on. Several algorithms are known ever
since and are able to solve small-scale problems when either environment states can be enumer-
ated (and stored in the memory) or optimal policy can be searched in the space of linear or quadratic
functions of state representation features. Although these restrictions are extremely limiting, foun-
dations of classical RL theory underlie modern approaches. These theoretical fundamentals are
discussed in sections 3.1 and 5.1–5.2.
Combining this framework with Deep Learning [5] was popularized by Deep Q-Learning algo-
rithm, introduced in [13], which was able to play any of 57 Atari console games without tweaking net-
work architecture or algorithm hyperparameters. This novel approach was extensively researched
and significantly improved in the following years. The principles of value-based direction in deep
reinforcement learning are presented in section 3.
One of the key ideas in the recent value-based DRL research is distributional approach, proposed
in [1]. Further extending classical theoretical foundations and coming with practical DRL algorithms,
it gave birth to distributional reinforcement learning paradigm, which potential is now being actively
investigated. Its ideas are described in section 4.
Second main direction of DRL research is policy gradient methods, which attempt to directly op-
timize the objective function, explicitly present in the problem setup. Their application to neural
networks involve a series of particular obstacles, which requested specialized optimization tech-
niques. Today they represent a competitive and scalable approach in deep reinforcement learning
due to their enormous parallelization potential and continuous domain applicability. Policy gradient
methods are discussed in section 5.
Despite the wide range of successes, current state-of-art DRL methods still face a number of
significant drawbacks. As training of neural networks requires huge amounts of data, DRL demon-
strates unsatisfying results in settings where data generation is expensive. Even in cases where
interaction is nearly free (e. g. in simulated environments), DRL algorithms tend to require excessive
amounts of iterations, which raise their computational and wall-clock time cost. Furthermore, DRL
suffers from random initialization and hyperparameters sensitivity, and its optimization process is
known to be uncomfortably unstable [9]. Especially embarrassing consequence of these DRL fea-
tures turned out to be low reproducibility of empirical observations from different research groups
[6]. In section 6, we attempt to launch state-of-art DRL algorithms on several standard testbed envi-
ronments and discuss practical nuances of their application.
4
2. Reinforcement Learning problem setup
2.1. Assumptions of RL setting
Informally, the process of sequential decision-making proceeds as follows. The agent is pro-
vided with some initial observation of environment and is required to choose some action from the
given set of possibilities. The environment responds by transitioning to another state and generat-
ing a reward signal (scalar number), which is considered to be a ground-truth estimation of agent’s
performance. The process continues repeatedly with agent making choices of actions from observa-
tions and environment responding with next states and reward signals. The only goal of agent is to
maximize the cumulative reward.
This description of learning process model already introduces several key assumptions. Firstly,
the time space is considered to be discrete, as agent interacts with environment sequentially. Sec-
ondly, it is assumed that provided environment incorporates some reward function as supervised
indicator of success. This is an embodiment of the reward hypothesis, also referred to as Reinforce-
ment Learning hypothesis:
Exploitation of this hypothesis draws a line between reinforcement learning and classical ma-
chine learning settings, supervised and unsupervised learning. Unlike unsupervised learning, RL
assumes supervision, which, similar to labels in data for supervised learning, has a stochastic nature
and represents a key source of knowledge. At the same time, no data or «right answer» is provided
to training procedure, which distinguishes RL from standard supervised learning. Moreover, RL is the
only machine learning task providing explicit objective function (cumulative reward signal) to max-
imize, while in supervised and unsupervised setting optimized loss function is usually constructed
by engineer and is not «included» in data. The fact that reward signal is incorporated in the envi-
ronment is considered to be one of the weakest points of RL paradigm, as for many real-life human
goals introduction of this scalar reward signal is at the very least unobvious.
For practical applications it is also natural to assume that agent’s observations can be repre-
sented by some feature vectors, i. e. elements of Rd . The set of possible actions in most practical
applications is usually uncomplicated and is either discrete (number of possible actions is finite) or
can be represented as subset of Rm (almost always [−1, 1]m or can be reduced to this case)1 . RL
algorithms are usually restricted to these two cases, but the mix of two (agent is required to choose
both discrete and continuous quantities) can also be considered.
The final assumption of RL paradigm is a Markovian property:
Although this assumption may seem overly strong, it actually formalizes the fact that the world
modeled by considered environment obeys some general laws. Giving that the agent knows the
current state of the world and the laws, it is assumed that it is able to predict the consequences of
his actions up to the internal stochasticity of these laws. In practice, both laws and complete state
representation is unavailable to agent, which limits its forecasting capability.
In the sequel we will work within the setting with one more assumption of full observability. This
simplification supposes that agent can observe complete world state, while in many real-life tasks
only a part of observations is actually available. This restriction of RL theory can be removed by
considering Partially observable Markov Decision Processes (PoMDP), which basically forces learn-
ing algorithms to have some kind of memory mechanism to store previously received observations.
Further on we will stick to fully observable case.
1 this set is considered to be permanent for all states of environment without any loss of generality as if agent chooses
invalid action the world may remain in the same state with zero or negative reward signal or stochastically select some valid
action for him.
5
2.2. Environment model
Though the definition of Markov Decision Process (MDP) varies from source to source, its essen-
tial meaning remains the same. The definition below utilizes several simplifications without loss of
generality.2
• r : S → R — reward function.
• s0 ∈ S — starting state.
It is important to notice that in the most general case the only things available for RL algorithm
beforehand are d (dimension of state space) and action space A. The only possible way of collecting
more information for agent is to interact with provided environment and observe s0 . It is obvious
that the first choice of action a0 will be probably random. While the environment responds by
sampling s1 ∼ p(s1 | s0 , a0 ), this distribution, defined in T and considered to be a part of MDP,
may be unavailable to agent’s learning procedure. What agent does observe is s1 and reward signal
r1 := r(s1 ) and it is the key information gathered by agent from interaction experience.
Definition 2. The tuple (st , at , rt+1 , st+1 ) is called transition. Several sequential transitions
are usually referred to as roll-out. Full track of observed quantities
s0 , a0 , r1 , s1 , a1 , r2 , s2 , a2 , r3 , s3 , a3 . . .
is called a trajectory.
In general case, the trajectory is infinite which means that the interaction process is neverend-
ing. However, in most practical cases the episodic property holds, which basically means that the
interaction will eventually come to some sort of an end3 . Formally, it can be simulated by the envi-
ronment stucking in the last state with zero probability of transitioning to any other state and zero
reward signal. Then it is convenient to reset the environment back to s0 to initiate new interaction.
One such interaction cycle from s0 till reset, spawning one trajectory of some finite length T , is
called an episode. Without loss of generality, it can be considered that there exists a set of termi-
nal states S + , which mark the ends of interactions. By convention, transitions (st , at , rt+1 , st+1 )
are accompanied with binary flag donet+1 ∈ {0, 1}, whether st+1 belongs to S + . As timestep t
at which the transition was gathered is usually of no importance, transitions are often denoted as
(s, a, r 0 , s0 , done) with primes marking the «next timestep».
Note that the length of episode T may vary between different interactions, but the episodic
property holds if interaction is guaranteed to end after some finite time T max . If this is not the case,
the task is called continuing.
2.3. Objective
In reinforcement learning, the agent’s goal is to maximize a cumulative reward. In episodic case,
this reward can be expressed as a summation of all received reward signals during one episode and
2 the reward function is often introduced as stochastic and dependent on action a, i. e. R(r | s, a) : S × A → P(R),
while instead of fixed s0 a distribution over S is given. Both extensions can be taken into account in terms of presented
definition by extending the state space and incorporating all the uncertainty into transition probability T.
3 natural examples include the end of the game or agent’s failure/success in completing some task.
6
is called the return:
T
X
R := rt (1)
t=1
Note that this quantity is formally a random variable, which depends on agent’s choices and the
outcomes of environment transitions. As this stochasticity is an inevitable part of interaction process,
the underlying distribution from which rt is sampled must be properly introduced to set rigorously
the task of return maximization.
Definition 3. Agent’s algorithm for choosing a by given current state s, which in general can be
viewed as distribution π(a | s) on domain A, is called a policy (strategy).
s0 , a0 , s1 , a1 , s2 , a2 . . .
It is always substantial to keep track of what policy was used to collect certain transitions (roll-outs
and episodes) during the learning procedure, as they are essentially samples from corresponding
trajectory distribution. If the policy is modified in any way, the trajectory distribution changes either.
Now when a policy induces a trajectory distribution, it is possible to formulate a task of expected
reward maximization:
T
X
ETπ rt → max
π
t=1
To ensure the finiteness of this expectation and avoid the case when agent is allowed to gather
infinite reward, limit on absolute value of rt can be assumed:
|rt | ≤ Rmax
Together with the limit on episode length T max this restriction guarantees finiteness of optimal
(maximal) expected reward.
To extend this intuition to continuing tasks, the reward for each next interaction step is multiplied
on some discount coefficient γ ∈ [0, 1), which is often introduced as part of MDP. This corresponds
to the logic that with probability 1 − γ agent «dies» and does not gain any additional reward, which
models the paradigm «better now than later». In practice, this discount factor is set very close to 1.
Definition 5. For given MDP and policy π the discounted expected reward is defined as
X
J (π) := ETπ γ t rt+1
t=0
Reinforcement learning task is to find an optimal policy π ∗ , which maximizes the discounted
expected reward:
J (π) → max (2)
π
7
2.4. Value functions
Solving reinforcement learning task (2) usually leads to a policy, that maximizes the expected
reward not only for starting state s0 , but for any state s ∈ S . This follows from the Markov property:
the reward which is yet to be collected from some step t does not depend on previous history and
for agent staying at state s the task of behaving optimal is equivalent to maximization of expected
reward with current state s as a starting state. This is the particular reason why many reinforcement
learning algorithms do not seek only optimal policy, but additional information about usefulness of
each state.
Definition 6. For given MDP and policy π the value function under policy π is defined as
X
V π (s) := ETπ |s0 =s γ t rt+1
t=0
This value function estimates how good it is for agent utilizing strategy π to visit state s and
generalizes the notion of discounted expected reward J (π) that corresponds to V π (s0 ).
∗
As value function can be induced by any policy, value function V π (s) under optimal policy π ∗
can also be considered. By convention4 , it is denoted as V ∗ (s) and is called an optimal value func-
tion.
Obtaining optimal value function V ∗ (s) doesn’t provide enough information to reconstruct some
optimal policy π ∗ due to unknown world dynamics, i. e. transition probabilities. In other words, be-
ing blind to what state s may be the environment’s response on certain action in a given state makes
knowing optimal value function unhelpful. This intuition suggests to introduce a similar notion com-
prising more information:
Definition 7. For given MDP and policy π the quality function (Q-function) under policy π is
defined as X
Qπ (s, a) := ETπ |s0 =s,a0 =a γ t rt+1
t=0
It directly follows from the definitions that these two functions are deeply interconnected:
is an optimal policy.
This result implies that instead of searching for optimal policy π ∗ , an agent can search for optimal
Q-function and derive the policy from it.
Proposition 4. For any MDP existence of optimal policy leads to existence of deterministic optimal
policy.
4 though optimal policy may not be unique, the value functions under any optimal policy that behaves optimally from any
given state (not only s0 ) coincide. Yet, optimal policy may not know optimal behaviour for some states if it knows how to
avoid them with probability 1.
8
2.5. Classes of algorithms
Reinforcement learning algorithms are presented in a form of computational procedures specify-
ing a strategy of collecting interaction experience and obtaining a policy with as higher J (π) as pos-
sible. They rarely include a stopping criterion like in classic optimization methods as the stochasticity
of given setting prevents any reasonable verification of optimality; usually the number of iterations
to perform is determined by the amount of computational resources. All reinforcement learning
algorithms can be roughly divided into four5 classes:
• meta-heuristics: this class of algorithms treats the task as black-box optimization with zeroth-
order oracle. They usually generate a set of policies π1 . . . πP and launch several episodes
of interaction for each to determine best and worst policies according to average return. After
that they try to construct more optimal policies using evolutionary or advanced random search
techniques [15].
• policy gradient: these algorithms directly optimize (2), trying to obtain π ∗ and no additional
information about MDP, using approximate estimations of gradient with respect to policy pa-
rameters. They consider RL task as an optimization with stochastic first-order oracle and make
use of interaction structure to lower the variance of gradient estimations. They will be dis-
cussed in sec. 5.
5 in many sources evolutionary algorithms are bypassed in discussion as they do not utilize the structure of RL task in any
way.
9
3. Value-based algorithms
3.1. Temporal Difference learning
In this section we consider temporal difference learning algorithm [21, Chapter 6], which is a
classical Reinforcement Learning method in the base of modern value-based approach in DRL.
The first idea behind this algorithm is to search for optimal Q-function Q∗ (s, a) by solving a
system of recursive equations which can be derived by recalling interconnection between Q-function
and value function (3):
This equation, named Bellman equation, remains true for value functions under any policies
including optimal policy π ∗ :
The straightforward utilization of this result is as follows. Consider the tabular case, when both
state space S and action space A are finite (and small enough to be listed in computer memory).
Let us also assume for now that transition probabilities are available to training procedure. Then
Q∗ (s, a) : S × A → R can be represented as a finite table with |S||A| numbers. In this case (6)
just gives a set of |S||A| equations for this table to satisfy.
Addressing the values of the table as unknown variables, this system of equations can be solved
using basic point iteration method: let Q∗0 (s, a) be initial arbitrary values of table (with the only
exception that for terminal states s ∈ S + , if any, Q∗0 (s, a) = 0 for all actions a). On each iteration t
the table is updated by substituting current values of the table to the right side of equation until the
process converges:
h i
Q∗t+1 (s, a) = Es0 ∼p(s0 |s,a) r(s0 ) + γ max
0
Q∗t (s0 , a0 ) (7)
a
This straightforward approach of learning the optimal Q-function, named Q-learning, has been
extensively studied in classical Reinforcement Learning. One of the central results is presented in
the following convergence theorem:
Therefore, there is a unique fixed point of the system of equations (7) and the point iteration method
converges to it.
The contraction mapping property is actually of high importance. It demonstrates that the point
iteration algorithm converges with exponential speed and requires small amount of iterations. As
the true Q∗ is a fixed point of (6), the algorithm is guaranteed to yield a correct answer. The trick is
10
that each iteration demands full pass across all state-action pairs and exact computation of expec-
tations over transition probabilities.
In general case, these expectations can’t be explicitly computed. Instead, agent is restricted to
samples from transition probabilities gained during some interaction experience. Temporal Differ-
ence (TD)6 algorithm proposes to collect this data using πt = argmax Q∗t (s, a) ≈ π ∗ and after
a
each gathered transition (st , at , rt+1 , st+1 ) update only one cell of the table:
h i
(1 − α )Q∗ (s, a) + α r Q∗t (st+1 , a0 ) if s = st , a = at
∗ t t t t+1 + γ max0
Qt+1 (s, a) = a (8)
Q∗ (s, a) else
t
where αt ∈ (0, 1) plays the role of exponential smoothing parameter for estimating expectation
Es0 ∼p(s0 |st ,at ) (·) from samples.
Two key ideas are introduced in the update formula (8): exponential smoothing instead of exact
expectation computation and cell by cell updates instead of updating full table at once. Both are
required to settle Q-learning algorithm for online application.
As the set S + of terminal states in online setting is usually unknown beforehand, a slight modifi-
cation of update (8) is used. If observed next state s0 turns out to be terminal (recall the convention
to denote this by flag done), its value function is known to be equal to zero:
V ∗ (s0 ) = max
0
Q∗ (s0 , a0 ) = 0
a
h i
Q∗ (s, a) + α r + γ max Q ∗
(s , a0
) − Q∗
(s, a) if s = st , a = at
t t t+1 t t+1 t
Q∗t+1 (s, a) = a0 (9)
Q∗ (s, a) else
t
The expression in the brackets, referred to as temporal difference, represents a difference be-
tween Q-value Q∗t (s, a) and its one-step approximation rt+1 + γ max
0
Q∗t (st+1 , a0 ), which must be
a
zero in expectation for true optimal Q-function.
The idea of exponential smoothing allows us to formulate first practical algorithm which can work
in the tabular case with unknown world dynamics:
Hyperparameters: αt ∈ (0, 1)
3. update table:
h i
Q∗ (s, a) ← Q∗ (s, a) + αt r 0 + (1 − done)γ max
0
Q∗ (s0 , a0 ) − Q∗ (s, a)
a
It turns out that under several assumptions on state visitation during interaction process this
procedure holds similar properties in terms of convergence guarantees, which are stated by the
following theorem:
6 also known as TD(0) due to theoretical generalizations
11
Proposition 7. [26] Let’s define
(
αt (s, a) is updated on step t
et (s, a) =
0 otherwise
+∞
X +∞
X
et (s, a) = ∞ et (s, a)2 < ∞
t t
This theorem states that basic policy iteration method can be actually applied online in the way
proposed by TD algorithm, but demands «enough exploration» from the strategy of interacting with
MDP during training. Satisfying this demand remains a unique and common problem of reinforce-
ment learning.
The widespread kludge is ε-greedy strategy which basically suggests to choose random action
instead of a = argmax Q∗ (s, a) with probability εt . The probability εt is usually set close to 1
a
during first interaction iterations and scheduled to decrease to a constant close to 0. This heuristic
makes agent visit all states with non-zero probabilities independent of what current approximation
Q∗ (s, a) suggests.
The main practical issue with Temporal Difference algorithm is that it requires table Q∗ (s, a) to
be explicitly stored in memory, which is impossible for MDP with high state space complexity. This
limitation substantially restricted its applicability until its combination with deep neural network was
proposed.
where s0 is a sample from p(s0 | s, a) and s, a is input data. In this notation (9) is equivalent to:
where we multiplied scalar value αt [y(s, a) − Q∗ (s, a, θt )] on the following vector es,a
(
1 (i, j) = (s, a)
es,a
i,j :=
0 (i, j) 6= (s, a)
Indeed:
∂ Loss(y, Q∗ (s, a, θt ))
θt+1 = θt − αt (13)
∂θ
It is important that dependence of y from θ is ignored during gradient computation (otherwise
the chain rule application with y being dependent on θ is incorrect). On each step of temporal dif-
ference algorithm new target y is constructed using current Q-function approximation, and a new
regression task with this target is set. For this fixed target one MSE optimization step is done ac-
cording to (13), and on the next step a new regression task is defined. Though during each step the
target is considered to represent some ground truth like it is in supervised learning, here it provides
a direction of optimization and because of this reason is sometimes called a guess.
Notice that representation (13) is equivalent to standard TD update (9) with all theoretical results
remaining while the parametric family Q(s, a, θ) is a table functions family. At the same time, (13)
can be formally applied to any parametric function family including neural networks. It must be
taken into account that this transition is not rigorous and all theoretical guarantees provided by
theorem 7 are lost at this moment.
Further on we assume that optimal Q-function is approximated with neural network Q∗θ (s, a)
with parameters θ . Note that for discrete action space case this network may take only s as input
and output |A| numbers representing Q∗θ (s, a1 ) . . . Q∗θ (s, a|A| ), which allows to find an optimal
action in a given state s with a single forward pass through the net. Therefore target y for given
transition (s, a, r 0 , s0 , done) can be computed with one forward pass and optimization step can be
performed in one more forward7 and one backward pass.
Small issue with this straightforward approach is that, of course, it is impractical to train neural
networks with batches of size 1. In [13] it is proposed to use experience replay to store all collected
transitions (s, a, r 0 , s0 , done) as data samples and on each iteration sample a batch of standard for
neural networks training size. As usual, the loss function is assumed to be an average of losses for
each transition from the batch. This utilization of previously experienced transitions is legit because
TD algorithm is known to be an off-policy algorithm, which means it can work with arbitrary transi-
tions gathered by any agent’s interaction experience. One more important benefit from experience
replay is sample decorrelation as consecutive transitions from interaction are often similar to each
other since agent usually locates at the particular part of MDP.
Though empirical results of described algorithm turned out to be promising, the behaviour of
Q∗θ values indicated the instability of learning process. Reconstruction of target after each optimiza-
tion step led to so-called compound error when approximation error propagated from the close-
to-terminal states to the starting in avalanche manner and could lead to guess being 106 and more
times bigger than the true Q∗ value. To address this problem, [13] introduced a kludge known as tar-
get network, which basic idea is to solve fixed regression problem for K > 1 steps, i. .e. recompute
target every K -th step instead of each.
7 in implementations it is possible to combine s and s0 in one batch and perform these two forward passes «at once».
13
To avoid target recomputation for the whole experience replay, the copy of neural network Q∗θ
is stored, called the target network. Its architecture is the same while weights θ − are a copy of Q∗θ
from the moment of last target recomputation8 and its main purpose is to generate targets y for
given current batch.
Combining all things together and adding ε-greedy strategy to facilitate exploration, we obtain
classic DQN algorithm:
6. compute loss:
1 X 2
Loss = (Q∗ (s, a, θ) − y(T ))
B T
∂ Loss
7. make a step of gradient descent using ∂θ
8. if t mod K = 0: θ − ← θ
y = r(s0 ) + γ max
0
Q∗ (s0 , a0 , θ − )
a
During this estimation max shifts Q-value estimation towards either to those actions that led to high
reward due to luck or to the actions with overestimating approximation error.
The solution proposed in [23] is based on idea of separating action selection and action evalua-
tion to carry out each of these operations using its own approximation of Q∗ :
max
0
Q∗ (s0 , a0 , θ − ) = Q∗ (s0 , argmax Q∗ (s0 , a0 , θ − ), θ − ) ≈
a a0
The simplest, but expensive, implementation of this idea is to run two independent DQN («Twin
DQN») algorithms and use the twin network to evaluate actions:
tial smoothing
14
y2 = r(s0 ) + γQ∗2 (s0 , argmax Q∗1 (s0 , a0 , θ1− ), θ2− )
a0
Intuitively, each Q-function here may prefer lucky or overestimated actions, but the other Q-function
judges them according to its own luck and approximation error, which may be as underestimating
as overestimating. Ideally these two DQNs should not share interaction experience to achieve that,
which makes such algorithm twice as expensive both in terms of computational cost and sample
efficiency.
Double DQN [23] is more compromised option which suggests to use current weights of network
θ for action selection and target network weights θ − for action evaluation, assuming that when the
target network update frequency K is big enough these two networks are sufficiently different:
Definition 8. For given MDP and policy π the advantage function under policy π is defined as
Advantage function is evidently interconnected with Q-function and value function and actually
shows the relative advantage of selecting action a comparing to average performance of the policy.
If for some state Aπ (s, a) > 0, then modifying π to select a more often in this particular state will
lead to better policy as its average return will become bigger than initial V π (s). This follows from
the following property of arbitrary advantage function:
Straightforward utilization of this decomposition is following: after several feature extracting lay-
ers the network is joined with two heads, one outputting single scalar V ∗ (s) and one outputting
|A| numbers A∗ (s, a) like it was done in DQN for Q-function. After that this scalar value estimation
is added to all components of A∗ (s, a) in order to obtain Q∗ (s, a) according to (16). The problem
with this naive approach is that due to (15) advantage function can not be arbitrary and must hold
the property (15) for Q∗ (s, a) to be identifiable.
This restriction (15) on advantage function can be simplified for the case when optimal policy is
15
induced by optimal Q-function:
This condition can be easily satisfied in computational graph by subtracting max A∗ (s, a) from
a
advantage head. This will be equivalent to the following formula of dueling DQN:
The interesting nuance of this improvement is that after evaluation on Atari-57 authors discov-
ered that substituting max operation in (17) with averaging across actions led to better results (while
usage of unidentifiable formula (16) led to poor performance). Although gradients can be backprop-
agated through both operation and formula (17) seems theoretically justified, in practical implemen-
tations averaging instead of maximum is widespread.
y(x) = W x + b
W = (µW + σW εW )
b = (µb + σb εb )
y(x) = W x + b
images as input.
10 using standard reparametrization trick
16
As the output of q-network now becomes a random variable, loss value becomes a random vari-
able too. Like in similar models for supervised learning, on each step an expectation of loss function
over noise is minimized:
Eε Loss(θ, ε) → min
θ
It can be seen that amount of noise actually inflicting output of network may vary for different
inputs, i. e. for different states. There are no guarantees that this amount will reduce as the inter-
action proceeds; the behaviour of average magnitude of noise injected in the network with time is
reported to be extremely sensitive to initialization of σW , σb and vary from MDP to MDP.
One technical issue with noisy layers is that on each pass an excessive amount (by the number
of network parameters) of noise samples is required. This may substantially reduce computational
efficiency of forward pass through the network. For optimization purposes it is proposed to ob-
tain noise for weights matrices in the following way: sample just n + m noise samples ε1W ∼
N (0, Im×m ), ε2W ∼ N (0, In×n ) and acquire matrix noise in a factorized form:
εW = f (ε1W )f (ε2W )T
p
where f is a scaling function, e. g. f (x) = sign(x) |x|. The benefit of this procedure is that it
requires m + n samples instead of mn, but sacrifices the interlayer independence of noise.
Using these priorities as proxy of transition importances, sampling from experience replay proceeds
using following probabilities:
P(T ) ∝ ρ(T )α
where hyperparameter α ∈ R+ controls the degree to which the sampling weights are sparsified:
the case α = 0 corresponds to uniform sampling distribution while α = +∞ is equivalent to
greedy sampling of transitions with highest priority.
The problem with (18) claim is that each transition’s priority changes after each network update.
As it is impractical to recalculate loss for the whole data after each step, some simplifications must
be put up with. The straightforward option is to update priority only for sampled transitions in
the current batch. New transitions can be added to experience replay with highest priority, i. e.
max ρ(T )11 .
T
Second debatable issue of prioritized replay is that it actually substitutes loss function of DQN
updates, which assumed uniform sampling of visited states to ensure they come from state visitation
distribution:
ET ∼Uniform Loss(T ) → min
θ
11 which can be computed online with O(1) complexity
17
While it is not clear what distribution is better to sample from to ensure exploration restrictions of
theorem 7, prioritized experienced replay changes this distribution in uncontrollable way. Despite
its fruitfulness at the beginning and midway of training process, this distribution shift may destabi-
lize learning close to the end and make algorithm stuck with locally optimal policy. Since formally
this issue is about estimating an expectation over one probability with preference to sample from
another one, the standard technique called importance sampling can be used as countermeasure:
M
X 1
ET ∼Uniform Loss(T ) = Loss(Ti ) =
i=0
M
M
X 1
= P(Ti ) Loss(Ti ) =
i=0
M P(Ti )
1
= ET ∼P(T ) Loss(T )
M P(T )
where M is a number of transitions stored in experience replay memory. Importance sampling
implies that we can avoid distribution shift that introduces undesired bias by making smaller gradient
updates for significant transitions which now appear in the batches with higher frequency. The price
for bias elimination is that importance sampling weights lower prioritization effect by slowing down
learning of highlighted new information.
This duality resembles trade-off between bias and variance, but important moment here is that
distribution shift does not cause any seeming issues at the beginning of training when agent behaves
close to random and do not produce valid state visitation distribution anyway. The idea proposed
in [16] based on this intuition is to anneal the importance sampling weights so they correct bias
properly only towards the end of training procedure.
β(t)
prioritizedER 1
Loss = ET ∼P(T ) Loss(T )
BP(T )
where β(t) ∈ [0, 1] and approaches 112 as more interaction steps are executed. If β(t) is set to 0,
no bias correction is held, while β(t) = 1 corresponds to unbiased loss function, i. e. equivalent to
sampling from uniform distribution.
The most significant and obvious drawback of prioritized experience replay approach is that it
introduces additional hyperparameters. Although α represents one number, algorithm’s behaviour
may turn out to be sensitive to its choosing, and β(t) must be designed by engineer as some sched-
uled motion from something near 0 to 1, and its well-turned selection may require inaccessible
knowledge about how many steps it will take for algorithm to «warm up».
Indeed, definition of Q∗ (s, a) consists of average return and can be viewed as making T max
steps from state s0 after selecting action a0 , while vanilla Bellman optimality equation represents
Q∗ (s, a) as reward from one next step in the environment and estimation of the rest of trajectory
reward recursively. N -step Bellman equation (19) generalizes these two opposites.
All the same reasoning as for DQN can be applied to N -step Bellman equation to obtain N -step
DQN algorithm, which only modification appears in target computation:
N
X
y(s0 , a0 ) = γ t−1 r(st ) + γ N max Q∗ (sN , aN , θ) (20)
aN
t=1
12 often it is initialized by a constant close to 0 and is linearly increased until it reaches 1
18
To perform this computation, we are required to obtain for given state s and a not only one next
step, but N steps. To do so, instead of transitions N -step roll-outs are stored, which can be done by
precomputing following tuples:
N
!
X
n−1 (n) (N )
T = s, a, γ r ,s , done
n=1
where r (n) is the reward received in n steps after visitation of considered state s, s(N ) is state visited
in N steps, and done is a flag whether the episode ended during N -step roll-out13 . All other aspects
of algorithm remain the same in practical implementations, and the case N = 1 corresponds to
standard DQN.
The goal of using N > 1 is to accelerate propagation of reward from terminal states backwards
through visited states to s0 as less update steps will be required to take into account freshly ob-
served reward and optimize behaviour at the beginning of episodes. The price is that formula (20)
includes an important trick: to calculate such target, for second (and following) step action a0 must
be sampled from π ∗ for Bellman equation (19) to remain true. In other words, application of N -step
Q-learning is theoretically improper when behaviour policy differs from π ∗ . Note that we do not face
this problem in the case N = 1 in which we are required to sample only from transition probability
p(s0 | s, a) for given state-action pair s, a.
Even considering π ∗ ≈ argmax Q∗ (s, a, θ), where Q∗ is our current approximation of π ∗ ,
a
makes N -step DQN an on-policy algorithm when for every state-action pair s, a it is preferable to
sample target using the closest approximation of π ∗ available. This questions usage of experience
replay or at the very least encourages to limit its capacity to store only M max newest transitions
with M max being relatively not very big.
To see the negative effect of N -step DQN, consider the following toy example. Suppose agent
makes a mistake on the second step after s and ends episode with huge negative reward. Then
in the case N > 2 each time the roll-out starting with this s is sampled in the batch, the value of
Q∗ (s, a, θ) will be updated with this received negative reward even if Q∗ (s0 , ·, θ) already learned
not to repeat this mistake again.
Yet empirical results in many domains demonstrate that raising N from 1 to 2-3 may result in
substantial acceleration of training and positively affect the final performance. On the contrary, the
theoretical groundlessness of this approach explains its negative effects when N is set too big.
13 all N -step roll-outs must be considered including those terminated at k-th step for k < N .
19
4. Distributional approach for value-based methods
4.1. Theoretical foundations
The setting of RL task inherently carries internal stochasticity of which agent has no substantial
control. Sometimes intelligent behaviour implies taking risks with severe chance of low episode
return. All this information resides in the distribution of return R (1) as random variable.
While value-based methods aim at learning expectation of this random variable as it is the quan-
tity we actually care about, in distributional approach [1] it is proposed to learn the whole distri-
bution of returns. It further extends the information gathered by algorithm about MDP towards
model-based case in which the whole MDP is imitated by learning both reward function r(s) and
transitions T, but still restricts itself only to reward and doesn’t intend to learn world model.
In this section we discuss some theoretical extensions of temporal difference ideas in the case
when expectations on both sides of Bellman equation (5) and Bellman optimality equation (6) are
taken away.
The central object of study in Q-learning was Q-function, which for given state and action returns
the expectation of reward. To rewrite Bellman equation not in terms of expectations, but in terms of
the whole distributions, we require a corresponding notation.
Definition 9. For given MDP and policy π the value distribution of policy π is a random variable
defined as X
Z π (s, a) := γ t rt+1 s0 = s, a0 = a
t=0
Note that Z π just represents a random variable which is taken expectation of in definition of
Q-function:
Qπ (s, a) = ETπ Z π (s, a)
Using this definition of value distribution, Bellman equation can be rewritten to extend the recur-
sive connection between adjacent states from expectations of returns to the whole distributions of
returns:
c.d.f .
Z π (s, a) = r(s0 ) + γZ π (s0 , a0 ) s0 ∼ p(s0 | s, a), a0 ∼ π(a0 | s0 ) (21)
c.d.f .
Here we used some auxiliary notation: by = we mean that cumulative distribution functions of
two random variables to the right and left are equal almost everywhere. Such equations are called
recursive distributional equations and are well-known in theoretical probability theory14 . By using
| we describe a sampling procedure for the random variable to the right side of equation: for given
s, a next state s0 is sampled from transition probability, then a0 is sampled from given policy, then
random variable Z π (s0 , a0 ) is sampled to calculate a resulting sample r(s0 ) + γZ π (s0 , a0 ).
While the space of Q-functions Qπ (s, a) ∈ S × A → R is finite, the space of value distributions
is a space of mappings from state-action pair to continuous distributions:
Z π (s, a) ∈ S × A → P(R)
and it is important to notice that even in the table-case when state and action spaces are finite, the
space of value distributions is essentially infinite. Crucial moment for us will be that convergence
properties now depend on chosen metric15 .
The choice of metric in S × A → P(R) represents the same issue as in the space of continuous
random variables P(R): if we choose a metric in the latter, we can construct one in the former:
20
Proposition 10. If d(X, Y ) is a metric in the space P(R), then
The particularly interesting for us example of metric in P(R) will be Wasserstein metric, which
concerns only random variables with bounded moments, so we will additionally assume that for all
state-action pairs s, a
EZ π (s, a)p ≤ +∞
are finite for p ≥ 1.
Proposition 11. For 1 ≤ p ≤ +∞ for two random variables X, Y on continuous domain with p-
th bounded moments and cumulative distribution functions FX and FY correspondingly a Wasser-
stein distance 1
p
Z1 p
−1
Wp (X, Y ) := FX (ω) − FY−1 (ω) dω
0
−1
W∞ (X, Y ) := sup FX (ω) − FY−1 (ω)
ω∈[0,1]
Thus we can conclude from proposition 10 that maximal form of Wasserstein metric
W p (Z1 , Z2 ) = sup Wp (Z1 (s, a), Z2 (s, a)) (22)
s∈S,a∈A
One more curious theoretical result is that B is in general not a contraction mapping for such dis-
tances as Kullback-Leibler divergence, Total Variation distance and Kolmogorov distance17 . It shows
16 here we consider value distributions from theoretical point of view, assuming that we are able to explicitly store a table of
1
Z 2
2
l2 (X, Y ) = (FX (ω) − FY (ω)) dω
R
where FX , FY are c.d.f. of random variables X, Y correspondingly.
21
that metric selection indeed influences convergence rate.
Similar to traditional value functions, we can define optimal value distribution Z ∗ (s, a). Sub-
stituting18 π ∗ (s) = argmax ETπ∗ Z ∗ (s, a) into (21), we obtain distributional Bellman optimality
a
equation:
c.d.f .
Z ∗ (s, a) = r(s0 ) + γZ ∗ (s0 , argmax ETπ∗ Z ∗ (s0 , a0 )) s0 ∼ p(s0 | s, a) (24)
a0
Now we concern the same question whether the point iteration method of solving (24) leads to
solution Z ∗ and whether it is a contraction mapping for some metric. The answer turns out to be
negative.
Proposition 14. [1] Point iteration for solving (24) may diverge.
Level of impact of this result is not completely clear. Point iteration for (24) preserves means
of distributions, i. e. it will eventually converge to Q∗ (s, a) with all theoretical guarantees from
classical Q-learning. The reason behind divergence theorems hides in the rest of distributions like
other moments and situations when equivalent (in terms of average return) actions may lead to
different higher moments.
Note that expectation of Zθ∗ (s0 , a0 ) is computed explicitly using the form of chosen parametric family
of distributions and outputted parameters ζθ (s0 , a0 ), as is the distribution of random variable r 0 +
(1 − done)γZθ∗ (s0 , a0 ). In other words, in this setting guess y(T ) is also a continuous random
variable, distribution of which can be constructed only approximately. As both target and model
output are distributions, it is reasonable to design loss function in a form of some divergence D
between y(T ) and Zθ∗ (s, a):
∂ Loss(θt )
θt+1 = θt − α
∂θ
18 to perform this step validly, a clarification concerning argmax operator definition must be given. The choice of action a
returned by this operator in the cases when several actions lead to the same maximal average returns must not depend on
Z , as this choice affects higher moments of resulted distribution. To overcome this issue, for example, in the case of finite
action space all actions can be enumerated and the optimal action with the lowest index is returned by operator.
22
The particular choice of this divergence must be made with concern that y(T ) is a «sample» from
a full one-step approximation of Zθ∗ which includes transition probabilities:
c.d.f . X
y full (s, a) := p(s0 | s, a)y(s, a, r(s0 ), s0 , done(s0 )) (27)
s0 ∈S
This form is precisely the right side of distributional Bellman optimality equation as we just incor-
porated intermediate sampling of s0 into the value of random variable. In other words, if transition
probabilities T were known, the update could be made using distribution of y full as a target.
This motivates to choose KL(y(T ) k Zθ∗ (s, a)) (specifically with this order of arguments) as D
to exploit the following property (we denote by pX a p.d.f. pf random variable X ):
Z
∇θ ET KL(y full
(s, a) k Zθ∗ (s, a)) = ∇θ ET −pyfull (s,a) (ω) log pZθ∗ (s,a)) (ω)dω + const(θ) =
Z R
{using (27)} = ∇θ ET Es0 ∼p(s0 |s,a) − py(T ) (ω) log pZθ∗ (s,a)) (ω)dω =
R
Z
{taking expectation out} = ∇θ ET Es0 ∼p(s0 |s,a) −py(T ) (ω) log pZθ∗ (s,a)) (ω)dω =
R
∇θ ET Es0 ∼p(s0 |s,a) KL y(T ) k Zθ∗ (s, a)
=
This property basically states that gradient of loss function (26) with KL as D is an unbiased
(Monte-Carlo) estimation of gradient of KL-divergence for «full» distribution (27), which resembles
the employment of exponential smoothing in temporal difference learning. For many other diver-
gences, including Wasserstein metric, same statement is not true, so their utilization in described
online setting will lead to biased gradients and all theory-grounded intuition that algorithm moves
in the right direction becomes distinctively lost. Moreover, KL-divergence is known to be one of the
easiest divergences to work with due to its nice smoothness properties and wide prevalence in many
deep learning pipelines.
Described above motivation to choose KL-divergence as an actual objective for minimization is
contradictory. Theoretical analysis of distributional Q-learning, specifically theorem 12, though con-
cerning policy evaluation other than optimal Z ∗ search, explicitly hints that the process converges
exponentially fast for Wasserstein metric, while even for precisely made updates in terms of KL-
divergence we are not guaranteed to get any closer to true solution.
More «practical» defect of KL-divergence is that it demands two comparable distributions to
share the same domain. This means that by choosing KL-divergence we pledge to guarantee that
y(T ) and Zθ∗ (s, a) in (26) have coinciding support. This emerging restriction seems limiting even
beforehand as for episodic MDP value distribution in terminal states is obviously degenerated (their
support consists of one point r(s) which is given all probability mass) which means that our value
distribution approximation is basically ensured to never be precise.
In Categorical DQN, as follows from the name, the family of distributions is chosen to be cate-
gorical on the fixed support {z0 , z1 . . . zA−1 } where A is number of atoms. As no prior informa-
tion about MDP is given, the basic choice of this support is uniform grid from some Vmin ∈ R to
V max ∈ R:
i
zi = Vmin + (Vmax − Vmin ), i ∈ 0, 1, . . . A − 1
A−1
These bounds, though, must be chosen carefully as they implicitly assume
and if these inequalities are not tight, the approximation will obviously become poor.
Therefore the neural network outputs A numbers, summing into 1, to represent arbitrary distri-
bution on this support:
ζi (s, a, θ) := P(Zθ∗ (s, a) = zi )
Within this family of distributions, computation of expectation, greedy action selection and KL-
divergence is trivial. One problem hides in target formula (25): while we can compute distribution
y(T ), its support may in general differ from {z0 . . . zA−1 }. To avoid the issue of disjoint supports,
23
a projection step must be done to find the closest to target distribution within the chosen family19 .
Therefore the resulting target used in the loss function is
c.d.f .
0 ∗ 0 ∗ 0 0
y(T ) := ΠC r + (1 − done)γZθ s , argmax EZθ (s , a )
a0
zi ζi∗ (s, a, θ)
P
1. select a randomly with probability ε(t), else a = argmax i
a
7. compute loss:
1 X
Loss = KL(y(T ) k Z ∗ (s, a, θ))
B T
∂ Loss
8. make a step of gradient descent using ∂θ
9. if t mod K = 0: θ − ← θ
24
latter can be done in online setting using quantile regression technique. This led to alternative
distributional Q-learning algorithm named Quantile Regression DQN (QR-DQN).
The basic idea is to «swap» fixed support and learned probabilities of Categorical DQN. We will
now consider the family with fixed probabilities for A-atomed categorical distribution with arbitrary
support {ζ0∗ (s, a, θ), ζ1∗ (s, a, θ), . . . , ζA−1
∗
(s, a, θ)}. Again, we will assume all probabilities to be
equal given the absence of any prior knowledge; namely, our distribution family is now
Zθ∗ (s, a) ∼ Uniform ζ0∗ (s, a, θ), . . . , ζA−1
∗
(s, a, θ)
In this setting neural network outputs A arbitrary real numbers that represent the support of uni-
form categorical distribution20 , where A is the number of atoms and the only hyperparameter to
select.
For table-case setting, on each step of point iteration we desire to update the cell for given state-
action pair s, a with full distribution of random variable to the right side of (24). If we are limited
to store only A atoms of the support, the true distribution must be projected on the space of A-
atomed categorical distributions. Consider now this task of projecting some given random variable
with c.d.f. F (ω) in terms of Wasserstein distance. Specifically, we will be interested in minimizing
W1 -distance for p = 1 as the theorem 12 states the contraction property for all 1 ≤ p ≤ +∞ and
we are free to choose any:
Z 1
F −1 (ω) − Uz−1
0 ,z1 ...zA−1
(ω) dω → min (28)
0 z0 ,z1 ...zA−1
where Uz0 ,z1 ...zA−1 is c.d.f. for uniform categorical distribution on given support. Its inverse, also
known as quantile function, has a following simple form:
1
z0 0≤ω< A
1 2
z1 ≤ω< A
A
Uz−1
0 ,z1 ...zA−1
(ω) = .
..
A−1
zA−1 ≤ω<1
A
splits the optimization of Wasserstein into A independent tasks that can be solved separately:
Z i+1
A
F −1 (ω) − zi dω → min (29)
i zi
A
The result 15 states that we require only A specific quantiles of random variable to the right side
of Bellman equation21 . Hence the last thing to do to design a practical algorithm is to develop a pro-
cedure of unbiased estimation of quantiles for the random variable on the right side of distribution
Bellman optimality equation (24).
20 Note that target distribution is now guaranteed to remain within this distribution family as multiplying on γ just shrinks
the support and adding r 0 just shifts it. We assume that if some atoms of the support coincide, the distribution is
still A-atomed categorical; for example, for degenerated distribution (like in the case of terminal states) ζ0∗ (s, a, θ) =
ζ1∗ (s, a, θ) = · · · = ζA−1
∗ (s, a, θ). This shows that projection step heuristic is not needed for this particular choice of
distribution family.
21 It can be proved that for table-case policy evaluation algorithm which stores in each cell not expectations of reward (as
in Q-learning) but A quantiles updated according to distributional Bellman equation (21) using theorem 15 converges to
quantiles of Z ∗ (s, a) in Wasserstein metric for 1 ≤ p ≤ +∞ and its update operator is a contraction mapping in W∞ .
25
Quantile regression is the standard technique to estimate the quantiles of empirical distribution
(i. .e. distribution that is represented by finite amount of i. i. d. samples from it). Recall from machine
learning that the constant solution optimizing l1-loss is median, i. .e. 21 -th quantile. This fact can be
generalized to arbitrary quantiles:
As usual in the case of neural networks, it is impractical to optimize (30) until convergence on
each iteration for each of A desired quantiles τi . Instead just one step of gradient optimization
is made and the outputs of neural network ζi∗ (s, a, θ), which play the role of c in formula (30), are
moved towards the quantile estimation via backpropagation. In other words, (30) sets a loss function
for network outputs; the losses for different quantiles are summed up. The resulting loss is
A−1
X
QR
Es0 ∼p(s0 |s,a) Ey∼y(T ) τ − I[ζi∗ (s, a, θ) < y] ζi∗ (s, a, θ) − y
Loss (s, a, θ) = (31)
i=0
where I denotes an indicator function. The expectation over y ∼ y(T ) for given transition can be
computed in closed form: indeed, y(T ) is also an A-atomed categorical distribution with support
{r 0 + γζ0∗ (s0 , a0 ), . . . , r 0 + γζA−1
∗
(s0 , a0 )}, where
1 X
a0 = argmax EZ ∗ (s0 , a0 , θ) = argmax ζi∗ (s0 , a0 , θ)
a0 a0 A i
and expectation over transition probabilities, as always, is estimated using Monte-Carlo by sampling
transitions from experience replay.
5. for each transition T from the batch compute the support of target distribution:
!
0 1 X
y(T )j = r + γζj∗ 0
s , argmax ζi∗ (s0 , a0 , θ − ), θ −
a0 A i
26
6. compute loss:
1 XXX
τi − I[ζi∗ (s, a, θ) < y(T )j ] ζi∗ (s, a, θ) − y(T )j
Loss =
BA T i j
∂ Loss
7. make a step of gradient descent using ∂θ
8. if t mod K = 0: θ − ← θ
To combine noisy networks with double DQN heuristic, it is proposed to resample noise on each
forward pass through the network and through its copy for target computation. This decision implies
that action selection, action evaluation and network utilization are independent and stochastic (for
exploration cultivation) steps.
The one snagging combination here is categorical DQN and dueling DQN. To merge these ideas,
we need to model advantage A∗ (s, a, θ) in distributional setting. In Rainbow this is done straight-
forwardly: the network has two heads, value stream v(s, θ) outputting A real values and advantage
stream a(s, a, θ) outputting A × |A| real values. Then these streams are integrated using the same
formula (17) with the only exception being softmax applied across atoms dimension to guarantee
that output is categorical distribution:
!
1 X
ζi∗ (s, a, θ) ∝ exp v(s, θ)i + a(s, a, θ)i − a(s, a, θ)i (32)
|A| a
Combining lack of intuition behind this integration formula with usage of mean instead of theo-
retically justified max makes this element of Rainbow the most questionable. During the ablation
studies it was discovered that dueling architecture is the only component that can be removed with-
out noticeable loss of performance. All other ingredients are believed to be crucial for resulting
algorithm as they address different problems.
22 Quantile Regression can be considered instead
27
Algorithm 5: Rainbow DQN
4. sample batch of size B from experience replay using probabilities P(T ) ∝ ρ(T )α
5. compute weights for the batch (where M is the size of experience replay memory)
β(t)
1
w(T ) =
M P(T )
6. for each transition T = (s, a, r̄, s̄, done) from the batch compute target (detached
from computational graph to prevent backpropagation):
ε1 , ε2 ∼ N (0, I)
!
X
N
P(y(T ) = r̄ + γ zi ) = ζi∗ s̄, argmax zi ζi∗ (s̄, ā, θ, ε1 ), θ − , ε2
ā
i
9. compute loss:
1 X
Loss = w(T )ρ(T )
B T
∂ Loss
10. make a step of gradient descent using ∂θ
11. if t mod K = 0: θ − ← θ
28
5. Policy Gradient algorithms
5.1. Policy Gradient theorem
Alternative approach to solving RL task is direct optimization of objective
X
J (θ) = ET ∼πθ γ t−1 rt → max (33)
θ
t=1
as a function of θ . Policy gradient methods provide a framework how to construct an efficient opti-
mization procedure based on stochastic first-order optimization within RL setting.
We will assume that πθ (a | s) is a stochastic policy parameterized with θ ∈ Θ. It turns out,
that if π is differentiable by θ , then so is our goal (33). We now proceed to discuss the technique of
derivative calculation which is based on employment of log-derivative trick:
In most general form, this trick allows us to derive the gradient of expectation of an arbitrary
function f (a, θ) : A × Θ → R, differentiable by θ , with respect to some distribution πθ (a), also
parameterized by θ :
Z
∇θ Ea∼πθ (a) f (a, θ) = ∇θ πθ (a)f (a, θ)da =
A
Z
= ∇θ [πθ (a)f (a, θ)] da =
ZA
{product rule} = [∇θ πθ (a)f (a, θ) + πθ (a)∇θ f (a, θ)] da =
ZA
= ∇θ πθ (a)f (a, θ)da + Eπθ (a) ∇θ f (a, θ) =
A
Z
{log-derivative trick (34)} = πθ (a)∇θ log πθ (a)f (a, θ)da + Eπθ (a) ∇θ f (a, θ) =
A
= Eπθ (a) ∇θ log πθ (a)f (a, θ) + Eπθ (a) ∇θ f (a, θ)
This technique can be applied sequentially (to expectations over πθ (a0 | s0 ), πθ (a1 | s1 ) and
so on) to obtain the gradient ∇θ J (πθ ).
Proposition 18. (Policy Gradient Theorem) [22] For any MDP and differentiable policy πθ the
gradient of objective (33) is
X
∇θ J (θ) = ET ∼πθ ∇θ log πθ (at | st )Qπ (st , at ) (35)
t=0
For future references, we require another form of formula (35), which provides another point of
view. For this purpose, let us define a state visitation frequency:
Definition 10. For given MDP and given policy π its state visitation frequency is defined by
X
dπ (s) := P(st = s)
t=0
State visitation frequencies, if normalized, represent a marginalized probability for agent to land in
a given state s. It is rarely attempted to be learned, but it assists theoretical study as it allows us
to rewrite expectations over trajectories with separated intrinsic and extrinsic randomness of the
decision making process:
Second important thing worth mentioning is that Qπ (s, a) is essentially present in the gradient.
Remark that it is never available to the algorithm and must also be somehow estimated.
5.2. REINFORCE
REINFORCE [27] provides a straightforward approach to approximately calculate the gradient (35)
in episodic case using Monte-Carlo estimation: N games are played and Q-function under policy π
is approximated with corresponding return:
N
" !#
1 XX X
t0 −t
∇θ J (θ) ≈ ∇θ log πθ (at | st ) γ rt0 +1 (38)
N T t=0 t0 =t
Notice that the constant here must be independent of a, but may depend on s. Application of this
technique to our case provides the following result23 :
23 this result can be generalized by introducing different baselines for estimation of different components of ∇θ J(θ).
30
Proposition 19. For any arbitrary function b(s) : S → R, called baseline:
X
∇θ J (θ) = ET ∼πθ ∇θ log πθ (at | st ) (Qπ (st , at ) − b(st ))
t=0
Selection of the baseline is up to us as long as it does not depend on actions at . The intent is to
choose it in a way that minimizes the variance.
It is believed that high variance of (38) originates from multiplication of Qπ (s, a), which may have
arbitrary scale (e. .g. in a range [100, 200]) while ∇θ log πθ (at | st ) naturally has varying signs24 .
To reduce the variance, the baseline must be chosen so that absolute values of expression inside
the expectation are shifted towards zero. Wherein the optimal baseline is provided by the following
theorem:
is given by
Ea∼πθ (a|s) k∇θ log πθ (a | s)k22 Qπ (s, a)
b(s) = (39)
Ea∼πθ (a|s) k∇θ log πθ (a | s)k22
As can be seen, optimal baseline calculation involves expectations which again can only be com-
puted (in most cases) using Monte-Carlo (both for numerator and denominator). For that purpose,
for every visited state s estimations of Qπ (s, a) are needed for all (or some) actions a, as otherwise
estimation of baseline will coincide with estimation of Qπ (s, a) and collapse gradient to zero. Prac-
tical utilization of result (39) is to consider a constant baseline independent of s with similar optimal
form:
k∇θ log πθ (at | st )k22 Qπ (st , at )
P
ET ∼πθ t=0
b= P 2
ET ∼πθ t=0 k∇θ log πθ (at | st )k2
which can be profitably estimated via Monte-Carlo.
Utilization of some kind of baseline, not necessarily optimal, is known to significantly reduce the
variance of gradient estimation and is an essential part of any policy gradient method. The final step
to make this family of algorithms applicable when using deep neural networks is to reduce variance
of Qπ estimation by employing RL task structure like it was done in value-based methods.
Substituting this baseline into gradient formula (37) and recalling the definition of advantage
function (14), the gradient can now be rewritten as follows:
This representation of gradient is used as the basement for most policy gradient algorithms as it
offers lower variance while selecting the baseline expressed in terms of value functions which can be
efficiently learned similar to how it was done in value-based methods. Such algorithms are usually
named Actor-Critic as they consist of two neural networks: πθ (a | s), representing a policy, called an
actor, and Vφπ (s) with parameters φ, approximately estimating actor’s performance, called a critic.
Note that the choice of value function to learn can be arbitrary; it is possible to learn Qπ or Aπ
instead, as all of them are deeply interconnected. Value function V π is chosen as the simplest one
since it depends only on state and thus is hoped to be easier to learn.
24 this follows, for example, from baseline derivation
31
Having a critic Vφπ (s), Q-function can be approximated in a following way:
First approximation is done using Monte-Carlo, while second approximation inevitably introduces
bias. Important thing to notice is that at this moment our gradient estimation stops being unbiased
and all theoretical guarantees of converging are once again lost.
Advantage function therefore can be obtained according to the definition:
Note that biased estimation of baseline doesn’t make gradient estimation biased by itself, as baseline
can be an arbitrary function of state. All bias introduction happens inside the approximation of Qπ .
It is possible to use critic only for baseline, which allows complete avoidance of bias, but then the
only way to estimate Qπ is via playing several games and using corresponding returns, which suffers
from higher variance and low sample efficiency.
The logic behind training procedure for the critic is taken from value-based methods: for given
policy π its value function can be obtained using point iteration for solving
y = r 0 + γVφπ (s0 )
and then MSE is minimized to move values of Vφπ (s) towards the guess.
Notice that to compute the target for critic we require samples from the policy π which is being
evaluated. Although actor evolves throughout optimization process, we assume that one update of
policy π does not lead to significant change of true V π and thus our critic, which approximates value
function for older version of policy, is close enough to construct the target. But if samples from, for
example, old policy are used to compute the guess, the step of critic update will correspond to learn-
ing the value function for old policy other than current. Essentially, this means that both actor and
critic training procedures require samples from current policy π , making Actor-Critic algorithm on-
policy by design. Consequently, samples that were collected on previous update iterations become
useless and can be forgotten. This is the key reason why policy gradient algorithms are usually less
sample-efficient than value-based.
Now as we have an approximation of value function, advantage estimation can be done using
one-step transitions (41). As the procedure of training an actor, i. .e. gradient estimation (40), also
does not demand sampling the whole trajectory, each update now requires only a small roll-out to
be sampled. The amount of transitions in the roll-out corresponds to the size of mini-batch.
The problem with roll-outs is that the data is obviously not i. i. d., which is crucial for training
networks. In value-based methods, this problem was solved with experience replay, but in policy
gradient algorithms it is essential to collect samples from scratch after each update of the networks
parameters. The practical solution for simulated environments is to launch several instances of
environment (for example, on different cores of multiprocessor) in parallel threads and have several
parallel interactions. After several steps in each environment, the batch for update is collected by
uniting transitions from all instances and one synchronous25 update of networks parameters θ and
φ is performed.
One more optimization that can be done is to partially share weights of networks θ and φ. It is
justified as first layers of both networks correspond to basic features extraction and these features
are likely to be the same for optimal policy and value function. While it reduces the number of train-
ing parameters almost twice, it might destabilize learning process as the scales of gradient (40) and
gradient of critic’s MSE loss may be significantly different, so they should be balanced with additional
hyperparameter.
25 there is also an asynchronous modification of advantage actor critic algorithm (A3C) which accelerates the training process
by storing a copy of network for each thread and performing weights synchronization from time to time.
32
Algorithm 6: Advantage Actor-Critic (A2C)
Hyperparameters: B — batch size, Vφ∗ — critic neural network, πθ — actor neural network,
α — critic loss scaling, SGD optimizer.
1 X
∇actor = ∇θ log πθ (a | s)Aπ (T )
B T
For N = 1 this estimation corresponds to Actor-Critic one-step estimation with high bias and low
variance. For N = ∞ it yields the estimator with critic used only for baseline with no bias and
high variance. Intermediate values correspond to something in between. Note that to use N -step
advantage estimation we have to perform N steps of interaction after given state-action pair.
Usually finding a good value for N as hyperparameter is difficult as its «optimal» value may float
throughout the learning process. In Generalized Advantage Estimation (GAE) [18] it is proposed to
construct an ensemble out of different N -step advantage estimators using exponential smoothing
with some hyperparameter λ:
Aπ π π 2 π
GAE (s, a) := (1 − λ) A(1) (s, a) + λA(2) (s, a) + λ A(3) (s, a) + . . . (42)
33
Here the parameter λ ∈ [0, 1] allows smooth control over bias-variance trade-off: λ = 0 corre-
sponds to Actor-Critic with higher bias and lower variance while λ → 1 corresponds to REINFORCE
with no bias and high variance. But unlike N as hyperparameter, it uses mix of different estimators
in intermediate case.
GAE proved to be a convenient way how more information can be obtained from collected roll-
out in practice. Instead of waiting for episode termination to compute (42) we may use «truncated»
GAE which ensembles only those N -step advantage estimators that are available:
N −1 π
Aπ π 2 π
(1) (s, a) + λA(2) (s, a) + λ A(3) (s, a) + · · · + λ A(N ) (s, a)
Aπ
trunc.GAE (s, a) :=
1 + λ + λ2 + · · · + λN −1
Note that the amount N of available estimators may be different for different transitions from roll-
out: if we performed K steps of interaction in some instance of environment starting from some
state-action pair s, a, we can use N = K step estimators; for next state-action pair s0 , a0 we have
only N = K −1 transitions and so on, while the last state-action pair sN −1 , aN −1 can be estimated
only using Aπ (1) as only N = 1 following transition is available. Although different transitions are
estimated with different precision (leading to different bias and variance), this approach allows to use
all available information for each transition and utilize multi-step approximations without dropping
last transitions of roll-outs used only for target computation.
f (q) → min
q
Classic example of such problem is maximum likelihood task when we try to fit the parameters of
our model to some observed data. The problem is that when using standard gradient descent both
the convergence rate and overall performance of optimization method substantially depend on the
choice of parametrization qθ . The problem holds even if we fix specific distribution family as many
distribution families allow different parametrizations.
To see why gradient descent is parametrization-sensitive, consider the model which is used at
some current point θk to determine the direction of next optimization step:
(
f (qθk ) + h∇θ f (qθk ), δθi → min
δθ
kδθk22 < αk
where αk is learning rate at step k. Being first-order method, gradient descent constructs a «model»
which approximates F locally around θk using first-order Taylor expansion and employs standard
Euclidean metric to determine a region of trust for this model. Then this surrogate task is solved
analytically to obtain well-known update formula:
δθ ∝ −∇θ f (qθk )
The issue arises from reliance on Eucliden metric in the space of parameters. In most parametriza-
tions, small changes in parameters space do not guarantee small change in distribution space and
vice versa: some small changes in distribution may demand big steps in parameters space26 .
Natural gradient proposes to use another metric, which achieves invariance to parametrization
of distribution q using the properties of Fisher matrix:
26 classic example is that N (0, 100) is similar to N (1, 100) while N (0, 0.1) is completely different from N (1, 0.1),
although Euclidean distance in parameter space is the same for both pairs.
34
Definition 11. For distribution qθ Fisher matrix Fq (θ) is defined as
Note that Fisher matrix depends on parametrization. Yet for any parametrization it is guaranteed
to be positive semi-definite by definition. Moreover, it induces a so-called Riemannian metric27 in
the space of parameters:
This surrogate task can be solved analytically to obtain the following optimization direction:
The direction of gradient descent is corrected by Fisher matrix which concerns the scale across dif-
ferent axes. This direction, specified by Fq (θk )−1 ∇θ f (qθk ), is called natural gradient.
Let’s discuss why this new metric really provides us invariance to distribution parametrization.
We already obtained natural gradient for q being parameterized by θ (43). Assume that we have
another parametrization qν . These new parameters ν are somehow related to θ ; we suppose there
is some functional dependency θ(ν), which we assume to be differentiable with jacobian J . In this
notation:
∂θi
δθ = J δν , Jij := (44)
∂νj
The central property of Fisher matrix, which provides the desired invariance, is the following:
Proposition 21. If θ = θ(ν) with jacobian J , then reparametrization formula for Fisher matrix is
Now it can be derived that natural gradient for parametrization with ν is the same as for θ . If we
want to calculate natural gradient in terms of ν , then our step is, according to (44):
δθ = J δν =
{natural gradient in terms of ν} ∝ J Fq (νk )−1 ∇ν f (qνk ) =
−1
{Fisher matrix reparametrization (45)} = J J T Fq (θk )J ∇ν f (qνk )
−1
{chain rule} = J J T Fq (θk )J ∇ν θ(νk )T ∇θ f (qθk ) =
The metric induced by this scalar product is correspondingly d(x, y)2 := (y − x)T G(y − x). The difference in Riemannian
space is that G, called metric tensor, depends on x, so the relative distance may vary for different points. It is used to
describe the distances between points on manifolds and holds important properties which Fisher matrix inherits as metric
tensor for distribution space.
35
5.6. Trust-Region Policy Optimization (TRPO)
The main drawback of Actor-Critic algorithm is believed to be the abandonment of experience
that was used for previous updates. As the number of updates required is usually huge, this is
considered to be a substantial loss of information. Yet, it is not clear how this information can be
effectively used for newer updates.
Suppose we want to make an update of π(θ), but using samples collected by some π old . The
straightforward approach is importance sampling technique, which naive application to gradient
formula (40) yields the following result:
P(T | π(θ)) X
∇θ J (θ) = ET ∼πold ∇θ log πθ (at | st )Aπ (st , at )
P(T | π old ) t=0
The emerged importance sampling weight is actually computable as transition probabilities cross
out: Q
P(T | π(θ)) t=1 πθ (at | st )
= Q
P(T | π old ) t=1 π
old (a | s )
t t
The problem with this coefficient is that it tends either to be exponentially small or to explode. Even
with some heuristic normalization of coefficients the batch gradient would become dominated by
one or several transitions and destabilize the training procedure by introducing even more variance.
Notice that application of importance sampling to another representation of gradient (37) yields
seemingly different result:
dπ(θ) (s) πθ (a | s)
∇θ J (θ) = Eπold ∇θ log πθ (a | s)Aπ (s, a) (46)
dπold (s) π old (a | s)
Here we avoided common for the whole trajectories importance sampling weights by using the def-
inition of state visitation frequencies. But this result is even less practical as these frequencies are
unknown to us.
The first key idea behind the theory concerning this problem is that may be these importance
sampling coefficients behave more stable if the policies π old and π(θ) are in some terms «close».
dπ(θ) (s)
Intuitively, in this case dπold (s)
of formula (46) is close to 1 as state visitation frequencies are similar,
and the remained importance sampling coefficient becomes acceptable in practice. And if some two
policies are similar, their values of our objective (2) are probably close too.
For any two policies, π and π old :
X
J (π) − J (π old ) = ET ∼π γ t r(st ) − J (π old ) =
t=0
X old
= ET ∼π γ t r(st ) − V π (s0 ) =
t=0
" #
π old
X
t
= ET ∼π γ r(st ) − V (s0 ) =
t=0
" #
Xh i
P∞ π old π old
X
28 t t+1 t
{trick t=0 (at+1 − at ) = −a0 } = ET ∼π γ r(st ) + γ V (st+1 ) − γ V (st ) =
t=0 t=0
X old old
{regroup} = ET ∼π γ t r(st ) + γV π (st+1 ) − V π (st ) =
t=0
X old old
{by definition (3)} = ET ∼π γ t Qπ (st , at ) − V π (st )
t=0
X old
{by definition (14)} = ET ∼π γ t Aπ (st , at )
t=0
The result obtained above is often referred to as relative policy performance identity and is
actually very interesting: it states that we can substitute reward with advantage function of arbitrary
policy and that will shift the objective by the constant.
We will require this identity rewritten in terms of state visitation frequencies. To do so, it is
convenient to define discounted version of state visitations distribution:
old
28 and if MDP is episodic, for terminal states V π (sT ) = 0 by definition.
36
Definition 12. For given MDP and given policy π its discounted state visitation frequency is
defined by X
d(s | π) := (1 − γ) γ t P(st = s)
t=0
Using frequency as unnormalized state visitation distribution, relative policy performance iden-
tity can be rewritten as
1 old
J (π) − J (π old ) = Es∼d(s|π) Ea∼π(a|s) Aπ (s, a)
1−γ
Now assume we want to optimize parameters θ of policy π while using data collected by π old :
applying importance sampling in the same manner:
1 d(s | πθ ) πθ (a | s) old
J (πθ ) − J (π old ) = Es∼d(s|πold ) Ea∼πold (a|s) Aπ (s, a)
1−γ d(s | π old ) π old (a | s)
As we have in mind the idea of π old being close to πθ , the question is how well this identity can
be approximated if we assume d(s | πθ ) = d(s | π old ). Under this assumption:
1 πθ (a | s) old
J (πθ ) − J (π old ) ≈ Lπold (θ) := Es∼d(s|πold ) Ea∼πold (a|s) Aπ (s, a)
1−γ π old (a | s)
The point is that interaction using π old corresponds to sampling from the expectations presented
in Lπold (θ):
πθ (a | s) old
Lπold (θ) = Eπold Aπ (s, a)
π old (a | s)
The approximation quality of Lπold (θ) can be described by the following theorem:
where C is some constant and KL(π old k πθ )[s] is a shorten notation for KL(π old (a | s) k
πθ (a | s)).
which not only states that expression on the right side represents a lower bound, but also that the
optimization procedure
h i
θk+1 = argmax Lπθk (θ) − C max KL(πθk k πθ )[s] (47)
θ s
37
The second step of TRPO is to rewrite the task of unconstrained minimization (47) in equivalent
constrained («trust-region») form31 to incorporate the unknown constant C into learning rate:
(
Lπold (θ) → max
θ (48)
Es∼d(s|πold ) KL(π old k πθ )[s] < C
Note that this rewrites an update iteration in terms of optimization methods: while Lπold (θ) is an
approximation of true objective J (πθ ) − J (π old ), the constraint sets the region of trust to the
surrogate. Remark that constraint is actually a divergence in policy space, i. e. it is very similar to a
metric in the space of distributions while the surrogate is a function of the policy and depends on
parameters θ only through πθ .
To solve the constrained problem (48), the technique from convex optimization is used. Assume
that π old is a current policy and we want to update its parameters θk . Then the objective of (48)
is modeled using first-order Taylor expansion around θk while constraint is modeled using second-
order 32 Taylor approximation:
(
Lπold (θk + δθ) ≈ h∇θ Lπold (θ)|θk , δθi → max
δθ
1
Es∼d(s|πold ) KL(π old k πθk +δθ ) ≈ 2
Es∼d(s|π old ) δθ
T
∇2θ KL(π old k πθ ) θk
δθ < C
It turns out, that this model is equivalent to natural policy gradient, discussed in sec. 5.5:
Proposition 23.
∇2θ KL(πθ k π old )[s] θk
= Fπ(a|s) (θ)
so KL-divergence constraint can be approximated with metric induced by Fisher matrix. Moreover,
the gradient of surrogate function is
∇θ πθ (a | s)|θk old
∇θ Lπold (θ)|θk = Eπold Aπ (s, a) =
π old (a | s)
old
{π old = πθk } = Eπold ∇θ log πθk (a | s)Aπ (s, a)
which is exactly an Actor-Critic gradient. Therefore the formula of update step is given by
where ∇θ Lπold (θ) coincides with standard policy gradient, and Fπ (θ) is hessian of KL-divergence:
38
5.7. Proximal Policy Optimization (PPO)
Proximal Policy Optimization [19] proposes alternative heuristic way of performing lower bound
(47) optimization which demonstrated encouraging empirical results.
PPO still substitutes max KL on average, but leaves the surrogate in unconstrained form, sug-
s
gesting to treat unknown constant C as a hyperparameter:
πθ (a | s)
π old old
Eπold A (s, a) − C KL(π k πθ )[s] → max (49)
π old (a | s) θ
The naive idea would be to straightforwardly optimize (49) as it is equivalent to solving the con-
straint trust-region task (48). To avoid Hessian-involved computations, one possible option is just to
perform one step of first-order gradient optimization of (49). Such algorithm was empirically discov-
πθ (a|s)
ered to perform poorly as importance sampling coefficients πold (a|s)
tend to unbounded growth.
In PPO it is proposed to cope with this problem in a simple old-fashioned way: by clipping. Let’s
denote by
πθ (a | s)
r(θ) :=
π old (a | s)
an importance sampling weight and by
its clipped version where ∈ (0, 1) is a hyperparameter. Then the clipped version of lower bound
is:
h old old
i
Eπold min r(θ)Aπ (s, a), r clip (θ)Aπ (s, a) − C KL(π old k πθ )[s] → max (50)
θ
Here the minimum operation is introduced to guarantee that the surrogate objective remains a
lower bound. Thus the clipping at 1 + may occur only in the case if advantage is positive while
clipping at 1 − may occur if advantage is negative. In both cases, clipping represents a penalty for
importance sampling weight r(θ) being too far from 1.
The overall procedure suggested by PPO to optimize the «stabilized» version of lower bound (50)
is the following. A roll-out is collected using current policy π old with some parameters θ . Then the
batches of typical size (as for Actor-Critic methods) are sampled from collected roll-out and several
steps of SGD optimization of (50) proceed with respect to policy parameters θ . During this process
the policy π old is considered to be fixed and new interaction steps are not performed, while in im-
plementations there is no need to store old weights θk since everything required from π old is to
collect transitions and remember the probabilities π old (a | s). The idea is that during these several
steps we may use transitions from the collected roll-out several times. Similar alternative is to per-
form several epochs of training by passing through roll-out several times, as it is often done in deep
learning.
Interesting fact discovered by the authors of PPO during ablation studies is that removing KL-
penalty term doesn’t affect the overall empirical performance. That is why in many implementations
PPO does not include KL-term at all, making the final surrogate objective have a following form:
old old
Eπold min r(θ)Aπ (s, a), r clip (θ)Aπ (s, a) → max (51)
θ
Note that in this form the surrogate is not generally a lower bound and «improvement guarantees»
intuition is lost.
39
1. obtain a roll-out of size R using policy π(θ), storing action probabilities as π old (a | s).
2. for each transition T from the roll-out compute advantage estimation (detached from
computational graph to prevent backpropagation):
3. perform n_epochs passes through roll-out using batches of size B ; for each batch:
πθ (a | s)
rθ (T ) =
π old (a | s)
rθclip (T ) = clip(rθ (T ), 1 − , 1 + )
40
6. Experiments
6.1. Setup
We performed our experiments using custom implementation of discussed algorithms attempt-
ing to incorporate best features from different official and unofficial sources and unifying all algo-
rithms in a single library interface. The full code is available at our github.
While custom implementation might not be the most efficient, it hinted us several ambiguities in
algorithms which are resolved differently in different sources. We describe these nuances and the
choices made for our experiments in appendix A.
For each environment we launch several algorithms to train the network with the same architec-
ture with the only exception being the head which is specified by the algorithm (see table 1).
For noisy networks all fully-connected layers in the feature extractor and in the head are substi-
tuted with noisy layers, doubling the number of their trained parameters. Both usage of noisy layers
and the choice of the head influences the total number of parameters trained by the algorithm.
As practical tuning of hyperparameters is computationally consuming activity, we set all hyperpa-
rameters to their recommended values while trying to share the values of common hyperparameters
among algorithms without affecting overall performance.
We choose to give each algorithm same amount of interaction steps to provide the fair compari-
son of their sample efficiency. Thus the wall-clock time, number of episodes played and the number
of network parameters updates varies for different algorithms.
6.2. Cartpole
Cartpole from OpenAI Gym [2] is considered to be one of the simplest environments for DRL
algorithms testing. The state is described with 4 real numbers while action space is two-dimensional
discrete.
The environment rewards agent with +1 each tick until the episode ends. Poor action choices
lead to early termination. The game is considered solved if agent holds for 200 ticks, therefore 200
is maximum reward in this environment.
In our first experiment we launch algorithms for 10 000 interaction steps to train a neural network
on the Cartpole environment. The network consists of two fully-connected hidden layers with 128
neurons and an algorithm-specific head. We used ReLU for activations. The results of a single launch
are provided33 in table 2.
33 we didn’t tune hyperparameters for each of the algorithms, so the configurations used might not be optimal.
41
Reached 200 Average reward Average FPS
6.3. Pong
We used Atari Pong environment from OpenAI Gym [2] as our main testbed to study the be-
haviour of the following algorithms:
• DQN — Deep Q-learning (sec. 3.2)
• c51 — Categorical DQN (sec. 4.2)
• QR-DQN — Quantile Regression DQN (sec. 4.3)
• Rainbow (sec. 4.4)
• A2C — Advantage Actor Critic (sec. 5.3) extended with GAE (sec. 5.4)
• PPO — Proximal Policy Optimization (sec. 5.7) extended with GAE (sec. 5.4)
In Pong, each episode is split into rounds. Each round ends with player either winning or loosing.
The episode ends when the player wins or looses 21 rounds. The reward is given after each round
and is +1 for winning and -1 for loosing. Therefore the maximum total reward is 21 and the minimum
is -21. Note that the flag done indicating episode ending is not provided to the agent after each
round but only at the end of full game (consisting of 21-41 rounds).
The standard preprocessing for Atari games proposed in DQN [13] was applied to the environ-
ment (see table 3). Thus, state space is represented by (84, 84) grayscale pixels input (1 channel
with domain [0, 255]). Action space is discrete with |A| = 6 actions.
All algorithms were given 1 000 000 interaction steps to train the network with the same feature
extractor presented on fig. 1. The number of trained parameters is presented in table 4. All used
hyperparameters are listed in table 7 in appendix B.
42
NoopResetEnv Do nothing first 30 frames of games to imitate the
pause between game start and real player reaction.
MaxAndSkipEnv Each interaction steps takes 4 frames of the game
to allow less frequent switch of action. Max is taken
over 4 passed frames to obtain an observation.
FireResetEnv Presses «Fire» button at first frame to launch
the game, otherwise screen remains frozen.
WarpFrame Turns observation to grayscale image of size 84x84.
Table 3: Atari Pong preprocessing
43
Convolution 8x8
(1, 84, 84)
with stride = 4
Convolution 4x4
with stride = 2 (32, 20, 20)
Convolution 3x3
with stride = 1
(64, 9, 9)
Fullyconnected layer
Algorithmspecific head
Figure 1: Network used for Atari Pong. All activation functions are ReLU. For Rainbow the fully-connected layer
and all dense layers in the algorithm-specific head are substituted with noisy layers.
servations per training step» in four times. To compensate this change we raised batch size in four
times.
As expected, average speed of algorithms increases in approximately 3.5 times (see table 6). We
provide training curves with respect to 1M performed interaction steps on fig. 2 and with respect
to wall-clock time on fig. 3. The only vanilla algorithm that achieved better final score comparing to
its accelerated rival is QR-DQN, while other three algorithms demonstrated both acceleration and
performance improvement. The latter is probably caused by randomness as relaunch of algorithms
within the same setting and hyperparameters can be strongly influenced by random seed.
It can be assumed that fraction «observations per updates» is an important hyperparameter of
value-based algorithms which can control the trade-off between wall-clock time and sample effi-
ciency. From our results it follows that low fraction leads to excessive network updates and may
slow down learning in several times. Yet this hyperparameter can barely be tuned universally for
all kinds of tasks opposed to many other hyperparameters that usually have their recommended
default values.
We stick further to the accelerated version and use its results in final comparisons.
6.5. Results
We compare the results of launch of six algorithms on Pong from two perspectives: sample effi-
ciency (fig. 4) and wall-clock time (fig. 5). We do not compare final performance of these algorithms
as all six algorithms are capable to reach near-maximum final score on Pong given more iterations,
while results after 1M iterations on a single launch significantly depend on chance.
All algorithms start with a warm-up session during which they try to explore the environment and
44
Interactions per update Average transitions per second
Algorithm vanilla accelerated vanilla accelerated
DQN 1 4 55.74 168.43
c51 1 4 44.08 148.76
QR-DQN 1 4 47.46 155.97
Rainbow 1 4 19.30 70.22
A2C 40 656.25
PPO 10.33 327.13
Table 6: Computational efficiency of vanilla and accelerated versions.
c51 vanilla
10 QR-DQN accelerated
QR-DQN vanilla
5 Rainbow accelerated
Rainbow vanilla
0
5
10
15
20
0 200000 400000 600000 800000 1000000
interaction step
Figure 2: Training curves of vanilla and accelerated version of value-based algorithms on 1M steps of Pong.
Although accelerated versions perform network updates four times less frequent, the performance degradation
is not observed.
learn first dependencies how the result of random behaviour can be surpassed. Epsilon-greedy with
tuned parameters provides sufficient amount of exploration for DQN, c51 and QR-DQN whithout
slowing down further learning while hyperparameter-free noisy networks are the main reason why
Rainbow has substantially longer warm-up.
Policy gradient algorithms incorporate exploration strategy in stochasticity of learned policy but
underutilization of observed samples leads to almost 1M-frames warm-up for A2C. It can be ob-
served that PPO successfully mitigates this problem by reusing samples thrice. Nevertheless, both
PPO and A2C solve Pong relatively quickly after the warm-up stage is over.
Value-based algorithm proved to be more computationally costly. QR-DQN and categorical DQN
introduce more complicated loss computation, yet their slowdown compared to standard DQN is
moderate. On the contrary, Rainbow is substantially slower mainly because of noise generation
involvement. Furthermore, combination of noisy networks and prioritized replay results in even less
stable training process.
We provide loss curves for all six algorithms and statistics for noise magnitude and prioritized re-
play for Rainbow in appendix C; some additional visualizations of trained algorithms playing episodes
of Pong are presented in appendix D.
45
Acceleration effect on value-based algorithms
20 DQN accelerated
DQN vanilla
15 c51 accelerated
average score for the last 20 episodes
c51 vanilla
10 QR-DQN accelerated
QR-DQN vanilla
5 Rainbow accelerated
Rainbow vanilla
0
5
10
15
20
0 200 400 600 800
minutes
Figure 3: Training curves of vanilla and accelerated version of value-based algorithms on 1M steps of Pong from
wall-clock time.
15 c51
QR-DQN
10 Rainbow
5 A2C
PPO
0
5
10
15
20
0 200000 400000 600000 800000 1000000
interaction step
0
5 DQN
A2C c51
10 (0h 25m)
QR-DQN
15 Rainbow
A2C
20 PPO
0 50 100 150 200
minutes
Figure 5: Training curves of all algorithms on 1M steps of Pong from wall-clock time.
46
7. Discussion
We have concerned two main directions of universal model-free RL algorithm design and at-
tempted to recreate several state-of-art pipelines.
While the extensions of DQN are reasonable solutions of evident DQN problems, their effect
is not clearly seen on simple tasks like Pong34 . Current state-of-art in single-threaded value-based
approach, Rainbow DQN, is full of «glue and tape» decisions that might be not the most effective
way of training process stabilization.
Distributional value-based approach is one of the cheapest in terms of resources extensions of
vanilla DQN algorithm. Although it is reported to provide substantial performance improvement in
empirical experiments, the reason behind this result remains unclear as expectation of return is the
key quantity for agent’s decision making while the rest of learned distribution does not affect his
choices. One hypothesis to explain this phenomenon is that attempting to capture wider range of
dependencies inside given MDP may provide auxiliary helping tasks to the algorithm, leading to bet-
ter learning of expectation. Intuitively it seems that more reasonable switch of DQN to distributional
setting would be learning the Bayesian uncertainty of expectation of return given observed data, but
scalable practical algorithms within this orthogonal paradigm are yet to be created.
Policy gradient algorithms are aimed at direct optimization of objective and currently beat value-
based approach in terms of computational costs. They tend to have less hyperparameters but are
extremely sensitive to the choice of optimizer parameters and especially learning rate. We have
affirmed the effectiveness of state-of-art algorithm PPO, which succeeded to solve Pong within an
hour without hyperparameter tuning. Though on the one hand this algorithm was derived from
TRPO theory, it essentially deviates from it and substitutes trust region updates with heuristic clip-
ping.
It can be observed in our results that PPO provides better gradients to the same network than
DQN-based algorithms despite the absence of experience replay. While it is fair to assume that
forgetting experienced transitions leads to information loss, it is also true that most observations
stored in replay memory are already learned or contain no useful information. The latter makes
most transitions in the sampled mini-batches insignificant, and, while prioritized replay attacks this
issue, it might still be the case that current experience replay management techniques are imperfect.
There are still a lot of deviations of empirical results from theoretical perspectives. It is yet unclear
which techniques are of the highest potential and what explanation lies behind many heuristic ele-
ments composing current state-of-art results. Possibly essential elements of modeling human-like
reinforcement learning are yet to be unraveled as active research in this area promises substantial
acceleration, generalization and stabilization of DRL algorithms.
34 although it takes several hours to train, Pong is considered to be the easiest of 57 Atari games and one of the most basic
47
References
[1] M. G. Bellemare, W. Dabney, and R. Munos. A distributional perspective on reinforcement learn-
ing. In Proceedings of the 34th International Conference on Machine Learning-Volume 70,
pages 449–458. JMLR. org, 2017.
[2] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Ope-
nai gym. arXiv preprint arXiv:1606.01540, 2016.
[4] M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis,
O. Pietquin, et al. Noisy networks for exploration. arXiv preprint arXiv:1706.10295, 2017.
[5] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio. Deep learning, volume 1. MIT press Cam-
bridge, 2016.
[6] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforcement
learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[7] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot,
M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. In
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.
[8] D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. Van Hasselt, and D. Silver. Dis-
tributed prioritized experience replay. arXiv preprint arXiv:1803.00933, 2018.
[9] A. Irpan. Deep reinforcement learning doesn⥪t work yet. Online (Feb. 14): https://www.
alexirpan. com/2018/02/14/rl-hard. html, 2018.
[11] R. Koenker and G. Bassett Jr. Regression quantiles. Econometrica: journal of the Econometric
Society, pages 33–50, 1978.
[12] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous
control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
[13] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Play-
ing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
[15] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable alternative
to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
[16] T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. arXiv preprint
arXiv:1511.05952, 2015.
[17] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz. Trust region policy optimization. In
Icml, volume 37, pages 1889–1897, 2015.
[18] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control
using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.
[19] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization
algorithms. arXiv preprint arXiv:1707.06347, 2017.
[20] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Ku-
maran, T. Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement
learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
[21] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
48
[22] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for rein-
forcement learning with function approximation. In Advances in neural information processing
systems, pages 1057–1063, 2000.
[23] H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. In
Thirtieth AAAI Conference on Artificial Intelligence, 2016.
49
Appendix A. Implementation details
Here we describe several technical details of our implementation which may potentially influence
the obtained results.
In most papers on value-based algorithms hyperparameters recommended for Atari games as-
sume raw input in the range [0, 255], while in various implementations of policy gradient algorithms
normalized input in the range [0, 1] is considered. Stepping aside from these agreements may dam-
age the convergence speed both for value-based and policy gradient algorithms as the change of
input domain requires hyperparameters retuning.
We use MSE loss emerged in theoretical intuition for DQN while in many sources it is recom-
mended to use Huber loss35 instead to stabilize learning.
In all value-based algorithms except c51 we update target network each K -th frame instead of
exponential smoothing of its parameters as it is computationally cheaper. For c51 we remove target
network heuristic as apriori limited domain prevents unbounded growth of predictions.
We do not architecturally force quantiles outputted by the network in Quantile Regression DQN
to satisfy ζ0 ≤ ζ1 ≤ · · · ≤ ζA−1 . As in the original paper, we assume that all A outputs of network
are arbitrary real values and use a standard linear transformation as our last layer.
In dueling architectures we subtract mean of A(s, a) across actions instead of theoretically as-
sumed maximum as proposed by original paper authors.
We implement sampling from prioritized replay using SumTree data structure and in informal
experiments affirmed the acceleration it provides. The importance sampling weight annealing β(t)
is represented by initial value β(0) = β which is then linearly annealed to 1 during first Tβ frames;
both β and Tβ are hyperparameters.
We do not allow priorities P(T ) to be greater than 1 by clipping as suggested in the original
paper. This may mitigate the effect of prioritization replay but stabilizes the process.
1
As importance sampling weights w(T ) = BP(T )
are potentially very close to zero, in original
article it was proposed to normalize them on max w(T ). In some implementations the maximum is
taken over the whole experience replay while in others maximum is taken over current batch, which
is not theoretically justified but computationally much faster. We stick to the latter option.
For noisy layers we use factorized noise sampling: for layer with m inputs and n outputs we sam-
√
ple ε1 ∈ Rn , ε2 ∈ Rm from standard normal distributions and scale both using f (ε) = sign(ε) ε.
Thus we use f (ε1 )f (ε2 )T as our noise sample for weights matrix and f (ε2 ) as noise sample for
bias. All noise is shared across mini-batch. Noise is resampled on each forward pass through the
network and thus is independent between evaluation, selection and interaction. Despite all these
simplifications, we found noisy layers to be the most computationally expensive modification of
DQN leading to substantial degradation of wall-clock time.
For policy gradient algorithms we add additional policy entropy term to the loss to force ex-
ploration. We also define actor loss as a scalar function that yields the same gradients as in the
corresponding gradient estimation (40) for A2C to compute it using PyTorch mechanics. For PPO
objective (51) provides analogous «actor loss»; thus, in both policy gradient algorithms the full loss
is defined as summation of actor, critic and entropy losses, with the two latter being scaled using
scalar hyperparameters.
We use shared network architecture for policy gradient algorithms with one feature extractor and
two heads, one for policy and one for critic.
KL-penalty is not used in our PPO implementation. Also we do not normalize advantage esti-
mations across the roll-out to zero mean and unit standard deviation as additionally done in some
implementations.
We use PyTorch default initialization for linear and convolutional layers although orthogonal ini-
tialization of all layers is reported to be beneficial for policy gradient algorithms. Initial values of
sigmas for noisy layers is set to be constant and equal to σm init
where σinit is a hyperparameter and
m is the number of inputs in accordance with original paper.
We use Adam as our optimizer with default β1 = 0.9, β2 = 0.999, ε = 1e−8. No gradient
clipping is performed.
50
Appendix B. Hyperparameters
36 number of transitions to collect in replay memory before starting network optimization using mini-batch sampling.
51
Appendix C. Training statistics on Pong
0.6 0.06
loss
loss
0.4 0.04
0.2 0.02
0.0 0.00
0 50000 100000 150000 200000 250000 0 50000 100000 150000 200000 250000
network update step network update step
Figure 6: DQN loss behaviour during training on Pong.
2 4
12.5
3
1 10.0
2
7.5
0 1
0 50000 100000 150000 200000 0 50000 100000150000200000 0 50000 100000150000200000
network update step network update step network update step
Figure 7: Loss behaviours of c51, QR-DQN and Rainbow during training on Pong.
52
Importance sampling correction weights Average noise magnitude
1.000
0.995 0.01025
(smoothed with window=1000)
median weight in mini-batch
0.990 0.01020
0.985
0.01015
0.980
0.975 0.01010
0.970
0.01005
0.965
0.01000
0 50000 100000 150000 200000 250000 0 50000 100000 150000 200000 250000
network update step network update step
Figure 8: Rainbow statistics during training. Left: smoothed with window 1000 median of importance sampling
weights from sampled mini-batches. Right: average noise magnitude logged at each 20-th step of training.
0.0
loss
0.5
1.0
1.5
0.5
loss
1.0
1.5
2.0
0 20000 40000 60000 80000
network update step
Figure 10: PPO loss behaviour during training.
53
Appendix D. Playing Pong behaviour
1.0
0.5 Predicted V(s)
Reward-to-go
0.0 losses
wins
0.5
0 200 400 600 800 1000 1200 1400 1600
episode step
2.0
1.5
state value
1.0
Predicted V(s)
0.5 Reward-to-go
losses
wins
0.0
0 200 400 600 800 1000 1200 1400 1600
episode step
0.3
0.0
2.0 0.2
4.0
6.0 0.1
8.0
10.0 0.0
0 200 400 600 800 1000 1200 1400 1600
episode step
Figure 13: c51 value distribution prediction during one episode of Pong.
54
Quantile Regression DQN playing Pong
3
2
state value
0
Predicted V(s)
Reward-to-go
1 losses
wins
0 500 1000 1500 2000 2500
episode step
Quantile Regression DQN value distribution approximation during one played episode
4
2
state value
1
0 500 1000 1500 2000 2500
episode step
Figure 15: Quantile Regression DQN value distribution prediction during one episode of Pong.
1.5
1.0
state value
0.5
Figure 16: Rainbow playing one episode of Pong (exploration turned off, i.e. all noise samples are zero).
0.25
0.0
0.20
2.0
4.0 0.15
6.0 0.10
8.0 0.05
10.0 0.00
0 250 500 750 1000 1250 1500 1750 2000
episode step
Figure 17: Rainbow value distribution prediction during one episode of Pong (exploration turned off, i.e. all
noise samples are zero).
55
A2C playing Pong
0.5
0.0
state value
0.5
1.0
Predicted V(s)
1.5 Reward-to-go
losses
wins
2.0
0 250 500 750 1000 1250 1500 1750 2000
episode step
0.4
LEFT
0.3
RIGHTFIRE 0.2
LEFTFIRE 0.1
0.0
0.5 Predicted V(s)
Reward-to-go
1.0 losses
wins
1.5
0 250 500 750 1000 1250 1500 1750 2000
episode step
RIGHT 0.6
actions
LEFT 0.4
RIGHTFIRE
0.2
LEFTFIRE
0 250 500 750 1000 1250 1500 1750 2000
episode step
Figure 21: PPO policy distribution during one episode of Pong.
56