0% found this document useful (0 votes)
6 views

Modern_Deep_Reinforcement_Learning_Algorithms

The document discusses modern Deep Reinforcement Learning (DRL) algorithms, highlighting their theoretical foundations, practical limitations, and empirical properties. It covers various aspects of reinforcement learning, including value-based and policy gradient methods, and addresses challenges such as data inefficiency and instability in optimization. The work aims to provide a comprehensive overview of the latest advancements in DRL and their applications across different domains.

Uploaded by

nwaizu2019
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Modern_Deep_Reinforcement_Learning_Algorithms

The document discusses modern Deep Reinforcement Learning (DRL) algorithms, highlighting their theoretical foundations, practical limitations, and empirical properties. It covers various aspects of reinforcement learning, including value-based and policy gradient methods, and addresses challenges such as data inefficiency and instability in optimization. The work aims to provide a comprehensive overview of the latest advancements in DRL and their applications across different domains.

Uploaded by

nwaizu2019
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Moscow State University

Faculty of Computational Mathematics and Cybernetics


Department of Mathematical Methods of Forecasting
arXiv:1906.10025v1 [cs.LG] 24 Jun 2019

Modern Deep Reinforcement Learning Algorithms

Written by:
Sergey Ivanov
[email protected]

Scientific advisor:
Alexander D’yakonov
[email protected]

Moscow, 2019
Contents
1 Introduction 4

2 Reinforcement Learning problem setup 5


2.1 Assumptions of RL setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Environment model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Value functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 Classes of algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.6 Measurements of performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Value-based algorithms 10
3.1 Temporal Difference learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Deep Q-learning (DQN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Double DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.4 Dueling DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.5 Noisy DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6 Prioritized experience replay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.7 Multi-step DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

4 Distributional approach for value-based methods 20


4.1 Theoretical foundations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.2 Categorical DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.3 Quantile Regression DQN (QR-DQN) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.4 Rainbow DQN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

5 Policy Gradient algorithms 29


5.1 Policy Gradient theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.2 REINFORCE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.3 Advantage Actor-Critic (A2C) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5.4 Generalized Advantage Estimation (GAE) . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.5 Natural Policy Gradient (NPG) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.6 Trust-Region Policy Optimization (TRPO) . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.7 Proximal Policy Optimization (PPO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

6 Experiments 41
6.1 Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.2 Cartpole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
6.3 Pong . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
6.4 Interaction-training trade-off in value-based algorithms . . . . . . . . . . . . . . . . . . 43
6.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

7 Discussion 47

A Implementation details 50

B Hyperparameters 51

C Training statistics on Pong 52

D Playing Pong behaviour 54

2
Abstract
Recent advances in Reinforcement Learning, grounded on combining classical theoretical re-
sults with Deep Learning paradigm, led to breakthroughs in many artificial intelligence tasks and
gave birth to Deep Reinforcement Learning (DRL) as a field of research. In this work latest DRL algo-
rithms are reviewed with a focus on their theoretical justification, practical limitations and observed
empirical properties.

3
1. Introduction
During the last several years Deep Reinforcement Learning proved to be a fruitful approach to
many artificial intelligence tasks of diverse domains. Breakthrough achievements include reaching
human-level performance in such complex games as Go [20], multiplayer Dota [14] and real-time
strategy StarCraft II [24]. The generality of DRL framework allows its application in both discrete and
continuous domains to solve tasks in robotics and simulated environments [12].
Reinforcement Learning (RL) is usually viewed as general formalization of decision-making task
and is deeply connected to dynamic programming, optimal control and game theory. [21] Yet its
problem setting makes almost no assumptions about world model or its structure and usually sup-
poses that environment is given to agent in a form of black-box. This allows to apply RL practically
in all settings and forces designed algorithms to be adaptive to many kinds of challenges. Latest RL
algorithms are usually reported to be transferable from one task to another with no task-specific
changes and little to no hyperparameters tuning.
As an object of desire is a strategy, i. e. a function mapping agent’s observations to possible
actions, reinforcement learning is considered to be a subfiled of machine learning. But instead of
learning from data, as it is established in classical supervised and unsupervised learning problems,
the agent learns from experience of interacting with environment. Being more "natural" model of
learning, this setting causes new challenges, peculiar only to reinforcement learning, such as neces-
sity of exploration integration and the problem of delayed and sparse rewards. The full setup and
essential notation are introduced in section 2.
Classical Reinforcement Learning research in the last third of previous century developed an ex-
tensive theoretical core for modern algorithms to ground on. Several algorithms are known ever
since and are able to solve small-scale problems when either environment states can be enumer-
ated (and stored in the memory) or optimal policy can be searched in the space of linear or quadratic
functions of state representation features. Although these restrictions are extremely limiting, foun-
dations of classical RL theory underlie modern approaches. These theoretical fundamentals are
discussed in sections 3.1 and 5.1–5.2.
Combining this framework with Deep Learning [5] was popularized by Deep Q-Learning algo-
rithm, introduced in [13], which was able to play any of 57 Atari console games without tweaking net-
work architecture or algorithm hyperparameters. This novel approach was extensively researched
and significantly improved in the following years. The principles of value-based direction in deep
reinforcement learning are presented in section 3.
One of the key ideas in the recent value-based DRL research is distributional approach, proposed
in [1]. Further extending classical theoretical foundations and coming with practical DRL algorithms,
it gave birth to distributional reinforcement learning paradigm, which potential is now being actively
investigated. Its ideas are described in section 4.
Second main direction of DRL research is policy gradient methods, which attempt to directly op-
timize the objective function, explicitly present in the problem setup. Their application to neural
networks involve a series of particular obstacles, which requested specialized optimization tech-
niques. Today they represent a competitive and scalable approach in deep reinforcement learning
due to their enormous parallelization potential and continuous domain applicability. Policy gradient
methods are discussed in section 5.
Despite the wide range of successes, current state-of-art DRL methods still face a number of
significant drawbacks. As training of neural networks requires huge amounts of data, DRL demon-
strates unsatisfying results in settings where data generation is expensive. Even in cases where
interaction is nearly free (e. g. in simulated environments), DRL algorithms tend to require excessive
amounts of iterations, which raise their computational and wall-clock time cost. Furthermore, DRL
suffers from random initialization and hyperparameters sensitivity, and its optimization process is
known to be uncomfortably unstable [9]. Especially embarrassing consequence of these DRL fea-
tures turned out to be low reproducibility of empirical observations from different research groups
[6]. In section 6, we attempt to launch state-of-art DRL algorithms on several standard testbed envi-
ronments and discuss practical nuances of their application.

4
2. Reinforcement Learning problem setup
2.1. Assumptions of RL setting
Informally, the process of sequential decision-making proceeds as follows. The agent is pro-
vided with some initial observation of environment and is required to choose some action from the
given set of possibilities. The environment responds by transitioning to another state and generat-
ing a reward signal (scalar number), which is considered to be a ground-truth estimation of agent’s
performance. The process continues repeatedly with agent making choices of actions from observa-
tions and environment responding with next states and reward signals. The only goal of agent is to
maximize the cumulative reward.
This description of learning process model already introduces several key assumptions. Firstly,
the time space is considered to be discrete, as agent interacts with environment sequentially. Sec-
ondly, it is assumed that provided environment incorporates some reward function as supervised
indicator of success. This is an embodiment of the reward hypothesis, also referred to as Reinforce-
ment Learning hypothesis:

Proposition 1. (Reward Hypothesis) [21]


«All of what we mean by goals and purposes can be well thought of as maximization of the expected
value of the cumulative sum of a received scalar signal (reward).»

Exploitation of this hypothesis draws a line between reinforcement learning and classical ma-
chine learning settings, supervised and unsupervised learning. Unlike unsupervised learning, RL
assumes supervision, which, similar to labels in data for supervised learning, has a stochastic nature
and represents a key source of knowledge. At the same time, no data or «right answer» is provided
to training procedure, which distinguishes RL from standard supervised learning. Moreover, RL is the
only machine learning task providing explicit objective function (cumulative reward signal) to max-
imize, while in supervised and unsupervised setting optimized loss function is usually constructed
by engineer and is not «included» in data. The fact that reward signal is incorporated in the envi-
ronment is considered to be one of the weakest points of RL paradigm, as for many real-life human
goals introduction of this scalar reward signal is at the very least unobvious.
For practical applications it is also natural to assume that agent’s observations can be repre-
sented by some feature vectors, i. e. elements of Rd . The set of possible actions in most practical
applications is usually uncomplicated and is either discrete (number of possible actions is finite) or
can be represented as subset of Rm (almost always [−1, 1]m or can be reduced to this case)1 . RL
algorithms are usually restricted to these two cases, but the mix of two (agent is required to choose
both discrete and continuous quantities) can also be considered.
The final assumption of RL paradigm is a Markovian property:

Proposition 2. (Markovian property)


Transitions depend solely on previous state and the last chosen action and are independent of all
previous interaction history.

Although this assumption may seem overly strong, it actually formalizes the fact that the world
modeled by considered environment obeys some general laws. Giving that the agent knows the
current state of the world and the laws, it is assumed that it is able to predict the consequences of
his actions up to the internal stochasticity of these laws. In practice, both laws and complete state
representation is unavailable to agent, which limits its forecasting capability.
In the sequel we will work within the setting with one more assumption of full observability. This
simplification supposes that agent can observe complete world state, while in many real-life tasks
only a part of observations is actually available. This restriction of RL theory can be removed by
considering Partially observable Markov Decision Processes (PoMDP), which basically forces learn-
ing algorithms to have some kind of memory mechanism to store previously received observations.
Further on we will stick to fully observable case.
1 this set is considered to be permanent for all states of environment without any loss of generality as if agent chooses

invalid action the world may remain in the same state with zero or negative reward signal or stochastically select some valid
action for him.

5
2.2. Environment model
Though the definition of Markov Decision Process (MDP) varies from source to source, its essen-
tial meaning remains the same. The definition below utilizes several simplifications without loss of
generality.2

Definition 1. Markov Decision Process (MDP) is a tuple (S, A, T, r, s0 ), where:

• S ⊆ Rd — arbitrary set, called the state space.

• A — a set, called the action space, either


– discrete: |A| < +∞, or
– continuous domain: A = [−1, 1]m .

• T — transition probability p(s0 | s, a), where s, s0 ∈ S, a ∈ A.

• r : S → R — reward function.
• s0 ∈ S — starting state.

It is important to notice that in the most general case the only things available for RL algorithm
beforehand are d (dimension of state space) and action space A. The only possible way of collecting
more information for agent is to interact with provided environment and observe s0 . It is obvious
that the first choice of action a0 will be probably random. While the environment responds by
sampling s1 ∼ p(s1 | s0 , a0 ), this distribution, defined in T and considered to be a part of MDP,
may be unavailable to agent’s learning procedure. What agent does observe is s1 and reward signal
r1 := r(s1 ) and it is the key information gathered by agent from interaction experience.

Definition 2. The tuple (st , at , rt+1 , st+1 ) is called transition. Several sequential transitions
are usually referred to as roll-out. Full track of observed quantities

s0 , a0 , r1 , s1 , a1 , r2 , s2 , a2 , r3 , s3 , a3 . . .

is called a trajectory.

In general case, the trajectory is infinite which means that the interaction process is neverend-
ing. However, in most practical cases the episodic property holds, which basically means that the
interaction will eventually come to some sort of an end3 . Formally, it can be simulated by the envi-
ronment stucking in the last state with zero probability of transitioning to any other state and zero
reward signal. Then it is convenient to reset the environment back to s0 to initiate new interaction.
One such interaction cycle from s0 till reset, spawning one trajectory of some finite length T , is
called an episode. Without loss of generality, it can be considered that there exists a set of termi-
nal states S + , which mark the ends of interactions. By convention, transitions (st , at , rt+1 , st+1 )
are accompanied with binary flag donet+1 ∈ {0, 1}, whether st+1 belongs to S + . As timestep t
at which the transition was gathered is usually of no importance, transitions are often denoted as
(s, a, r 0 , s0 , done) with primes marking the «next timestep».
Note that the length of episode T may vary between different interactions, but the episodic
property holds if interaction is guaranteed to end after some finite time T max . If this is not the case,
the task is called continuing.

2.3. Objective
In reinforcement learning, the agent’s goal is to maximize a cumulative reward. In episodic case,
this reward can be expressed as a summation of all received reward signals during one episode and
2 the reward function is often introduced as stochastic and dependent on action a, i. e. R(r | s, a) : S × A → P(R),

while instead of fixed s0 a distribution over S is given. Both extensions can be taken into account in terms of presented
definition by extending the state space and incorporating all the uncertainty into transition probability T.
3 natural examples include the end of the game or agent’s failure/success in completing some task.

6
is called the return:
T
X
R := rt (1)
t=1

Note that this quantity is formally a random variable, which depends on agent’s choices and the
outcomes of environment transitions. As this stochasticity is an inevitable part of interaction process,
the underlying distribution from which rt is sampled must be properly introduced to set rigorously
the task of return maximization.

Definition 3. Agent’s algorithm for choosing a by given current state s, which in general can be
viewed as distribution π(a | s) on domain A, is called a policy (strategy).

Deterministic policy, when the policy is represented by deterministic function π : S → A, can


be viewed as a particular case of stochastic policy with degenerated policy π(a | s), when agent’s
output is still a distribution with zero probability to choose an action other than π(s). In both cases
it is considered that agent sends to environment a sample a ∼ π(a | s).
Note that given some policy π(a | s) and transition probabilities T, the complete interaction
process becomes defined from probabilistic point of view:

Definition 4. For given MDP and policy π , the probability of observing

s0 , a0 , s1 , a1 , s2 , a2 . . .

is called trajectory distribution and is denoted as Tπ :


Y
Tπ := p(st+1 | st , at )π(at | st )
t=0

It is always substantial to keep track of what policy was used to collect certain transitions (roll-outs
and episodes) during the learning procedure, as they are essentially samples from corresponding
trajectory distribution. If the policy is modified in any way, the trajectory distribution changes either.
Now when a policy induces a trajectory distribution, it is possible to formulate a task of expected
reward maximization:
T
X
ETπ rt → max
π
t=1

To ensure the finiteness of this expectation and avoid the case when agent is allowed to gather
infinite reward, limit on absolute value of rt can be assumed:

|rt | ≤ Rmax

Together with the limit on episode length T max this restriction guarantees finiteness of optimal
(maximal) expected reward.
To extend this intuition to continuing tasks, the reward for each next interaction step is multiplied
on some discount coefficient γ ∈ [0, 1), which is often introduced as part of MDP. This corresponds
to the logic that with probability 1 − γ agent «dies» and does not gain any additional reward, which
models the paradigm «better now than later». In practice, this discount factor is set very close to 1.

Definition 5. For given MDP and policy π the discounted expected reward is defined as
X
J (π) := ETπ γ t rt+1
t=0

Reinforcement learning task is to find an optimal policy π ∗ , which maximizes the discounted
expected reward:
J (π) → max (2)
π

7
2.4. Value functions
Solving reinforcement learning task (2) usually leads to a policy, that maximizes the expected
reward not only for starting state s0 , but for any state s ∈ S . This follows from the Markov property:
the reward which is yet to be collected from some step t does not depend on previous history and
for agent staying at state s the task of behaving optimal is equivalent to maximization of expected
reward with current state s as a starting state. This is the particular reason why many reinforcement
learning algorithms do not seek only optimal policy, but additional information about usefulness of
each state.

Definition 6. For given MDP and policy π the value function under policy π is defined as
X
V π (s) := ETπ |s0 =s γ t rt+1
t=0

This value function estimates how good it is for agent utilizing strategy π to visit state s and
generalizes the notion of discounted expected reward J (π) that corresponds to V π (s0 ).

As value function can be induced by any policy, value function V π (s) under optimal policy π ∗
can also be considered. By convention4 , it is denoted as V ∗ (s) and is called an optimal value func-
tion.
Obtaining optimal value function V ∗ (s) doesn’t provide enough information to reconstruct some
optimal policy π ∗ due to unknown world dynamics, i. e. transition probabilities. In other words, be-
ing blind to what state s may be the environment’s response on certain action in a given state makes
knowing optimal value function unhelpful. This intuition suggests to introduce a similar notion com-
prising more information:

Definition 7. For given MDP and policy π the quality function (Q-function) under policy π is
defined as X
Qπ (s, a) := ETπ |s0 =s,a0 =a γ t rt+1
t=0

It directly follows from the definitions that these two functions are deeply interconnected:

Qπ (s, a) = Es0 ∼p(s0 |s,a) [r(s0 ) + γV π (s0 )] (3)

V π (s) = Ea∼π(a|s) Qπ (s, a) (4)



The notion of optimal Q-function Q (s, a) can be introduced analogically. But, unlike value
function, obtaining Q∗ (s, a) actually means solving a reinforcement learning task: indeed,

Proposition 3. If Q∗ (s, a) is a quality function under some optimal policy, then

π ∗ (s) = argmax Q∗ (s, a)


a

is an optimal policy.

This result implies that instead of searching for optimal policy π ∗ , an agent can search for optimal
Q-function and derive the policy from it.

Proposition 4. For any MDP existence of optimal policy leads to existence of deterministic optimal
policy.
4 though optimal policy may not be unique, the value functions under any optimal policy that behaves optimally from any

given state (not only s0 ) coincide. Yet, optimal policy may not know optimal behaviour for some states if it knows how to
avoid them with probability 1.

8
2.5. Classes of algorithms
Reinforcement learning algorithms are presented in a form of computational procedures specify-
ing a strategy of collecting interaction experience and obtaining a policy with as higher J (π) as pos-
sible. They rarely include a stopping criterion like in classic optimization methods as the stochasticity
of given setting prevents any reasonable verification of optimality; usually the number of iterations
to perform is determined by the amount of computational resources. All reinforcement learning
algorithms can be roughly divided into four5 classes:

• meta-heuristics: this class of algorithms treats the task as black-box optimization with zeroth-
order oracle. They usually generate a set of policies π1 . . . πP and launch several episodes
of interaction for each to determine best and worst policies according to average return. After
that they try to construct more optimal policies using evolutionary or advanced random search
techniques [15].

• policy gradient: these algorithms directly optimize (2), trying to obtain π ∗ and no additional
information about MDP, using approximate estimations of gradient with respect to policy pa-
rameters. They consider RL task as an optimization with stochastic first-order oracle and make
use of interaction structure to lower the variance of gradient estimations. They will be dis-
cussed in sec. 5.

• value-based algorithms construct optimal policy implicitly by gaining an approximation of op-


timal Q-function Q∗ (s, a) using dynamic programming. In DRL, Q-function is represented with
neural network and an approximate dynamic programming is performed using reduction to
supervised learning. This framework will be discussed in sec. 3 and 4.

• model-based algorithms exploit learned or given world dynamics, i. e. distributions p(s0 |


s, a) from T. The class of algorithms to work with when the model is explicitly provided is
represented by such algorithms as Monte-Carlo Tree Search; if not, it is possible to imitate the
world dynamics by learning the outputs of black box from interaction experience [10].

2.6. Measurements of performance


Achieved performance (score) from the point of average cumulative reward is not the only one
measure of RL algorithm quality. When speaking of real-life robots, the required number of simu-
lated episodes is always the biggest concern. It is usually measured in terms of interaction steps
(where step is one transition performed by environment) and is referred to as sample efficiency.
When the simulation is more or less cheap, RL algorithms can be viewed as a special kind of
optimization procedures. In this case, the final performance of the found policy is opposed to re-
quired computational resources, measured by wall-clock time. In most cases RL algorithms can be
expected to find better policy after more iterations, but the amount of these iterations tend to be
unjustified.
The ratio between amount of interactions and required wall-clock time for one update of policy
varies significantly for different algorithms. It is well-known that model-based algorithms tend to
have the greatest sample-efficiency at the cost of expensive update iterations, while evolutionary
algorithms require excessive amounts of interactions while providing massive resources for paral-
lelization and reduction of wall-clock time. Value-based and policy gradient algorithms, which will be
the focus of our further discussion, are known to lie somewhere in between.

5 in many sources evolutionary algorithms are bypassed in discussion as they do not utilize the structure of RL task in any

way.

9
3. Value-based algorithms
3.1. Temporal Difference learning
In this section we consider temporal difference learning algorithm [21, Chapter 6], which is a
classical Reinforcement Learning method in the base of modern value-based approach in DRL.
The first idea behind this algorithm is to search for optimal Q-function Q∗ (s, a) by solving a
system of recursive equations which can be derived by recalling interconnection between Q-function
and value function (3):

Qπ (s, a) = Es0 ∼p(s0 |s,a) [r(s0 ) + γV π (s0 )] =


= {using (4)} = Es0 ∼p(s0 |s,a) r(s0 ) + γEa0 ∼π(a0 |s0 ) Qπ (s0 , a0 )
 

This equation, named Bellman equation, remains true for value functions under any policies
including optimal policy π ∗ :

Q∗ (s, a) = Es0 ∼p(s0 |s,a) r(s0 ) + γEa0 ∼π(a0 |s0 ) Q∗ (s0 , a0 )


 
(5)
Recalling proposition 3, optimal (deterministic) policy can be represented as π ∗ (s) = argmax
a
Q∗ (s, a). Substituting this for π ∗ (s) in (5), we obtain fundamental Bellman optimality equation:

Proposition 5. (Bellman optimality equation)


h i
Q∗ (s, a) = Es0 ∼p(s0 |s,a) r(s0 ) + γ max
0
Q ∗ 0
(s , a0
) (6)
a

The straightforward utilization of this result is as follows. Consider the tabular case, when both
state space S and action space A are finite (and small enough to be listed in computer memory).
Let us also assume for now that transition probabilities are available to training procedure. Then
Q∗ (s, a) : S × A → R can be represented as a finite table with |S||A| numbers. In this case (6)
just gives a set of |S||A| equations for this table to satisfy.
Addressing the values of the table as unknown variables, this system of equations can be solved
using basic point iteration method: let Q∗0 (s, a) be initial arbitrary values of table (with the only
exception that for terminal states s ∈ S + , if any, Q∗0 (s, a) = 0 for all actions a). On each iteration t
the table is updated by substituting current values of the table to the right side of equation until the
process converges:
h i
Q∗t+1 (s, a) = Es0 ∼p(s0 |s,a) r(s0 ) + γ max
0
Q∗t (s0 , a0 ) (7)
a

This straightforward approach of learning the optimal Q-function, named Q-learning, has been
extensively studied in classical Reinforcement Learning. One of the central results is presented in
the following convergence theorem:

Proposition 6. Let by B denote an operator (S × A → R) → (S × A → R), updating Q∗t as in


(7):
Q∗t+1 = BQ∗t
for all state-action pairs s, a.
Then B is a contraction mapping, i. .e. for any two tables Q1 , Q2 ∈ (S × A → R)

kBQ1 − BQ2 k∞ ≤ γkQ1 − Q2 k∞

Therefore, there is a unique fixed point of the system of equations (7) and the point iteration method
converges to it.

The contraction mapping property is actually of high importance. It demonstrates that the point
iteration algorithm converges with exponential speed and requires small amount of iterations. As
the true Q∗ is a fixed point of (6), the algorithm is guaranteed to yield a correct answer. The trick is

10
that each iteration demands full pass across all state-action pairs and exact computation of expec-
tations over transition probabilities.
In general case, these expectations can’t be explicitly computed. Instead, agent is restricted to
samples from transition probabilities gained during some interaction experience. Temporal Differ-
ence (TD)6 algorithm proposes to collect this data using πt = argmax Q∗t (s, a) ≈ π ∗ and after
a
each gathered transition (st , at , rt+1 , st+1 ) update only one cell of the table:
 h i
(1 − α )Q∗ (s, a) + α r Q∗t (st+1 , a0 ) if s = st , a = at
∗ t t t t+1 + γ max0
Qt+1 (s, a) = a (8)
Q∗ (s, a) else
t

where αt ∈ (0, 1) plays the role of exponential smoothing parameter for estimating expectation
Es0 ∼p(s0 |st ,at ) (·) from samples.
Two key ideas are introduced in the update formula (8): exponential smoothing instead of exact
expectation computation and cell by cell updates instead of updating full table at once. Both are
required to settle Q-learning algorithm for online application.
As the set S + of terminal states in online setting is usually unknown beforehand, a slight modifi-
cation of update (8) is used. If observed next state s0 turns out to be terminal (recall the convention
to denote this by flag done), its value function is known to be equal to zero:

V ∗ (s0 ) = max
0
Q∗ (s0 , a0 ) = 0
a

This knowledge is embedded in the update rule (8) by multiplying max


0
Q∗t (st+1 , a0 ) on (1 −
a
donet+1 ). For the sake of shortness, this factor is often omitted but should be always present in
implementations.
Second important note about formula (8) is that it can be rewritten in the following equivalent
way:

 h i
Q∗ (s, a) + α r + γ max Q ∗
(s , a0
) − Q∗
(s, a) if s = st , a = at
t t t+1 t t+1 t
Q∗t+1 (s, a) = a0 (9)
Q∗ (s, a) else
t

The expression in the brackets, referred to as temporal difference, represents a difference be-
tween Q-value Q∗t (s, a) and its one-step approximation rt+1 + γ max
0
Q∗t (st+1 , a0 ), which must be
a
zero in expectation for true optimal Q-function.
The idea of exponential smoothing allows us to formulate first practical algorithm which can work
in the tabular case with unknown world dynamics:

Algorithm 1: Temporal Difference algorithm

Hyperparameters: αt ∈ (0, 1)

Initialize Q∗ (s, a) arbitrary


On each interaction step:

1. select a = argmax Q∗ (s, a)


a

2. observe transition (s, a, r 0 , s0 , done)

3. update table:
h i
Q∗ (s, a) ← Q∗ (s, a) + αt r 0 + (1 − done)γ max
0
Q∗ (s0 , a0 ) − Q∗ (s, a)
a

It turns out that under several assumptions on state visitation during interaction process this
procedure holds similar properties in terms of convergence guarantees, which are stated by the
following theorem:
6 also known as TD(0) due to theoretical generalizations

11
Proposition 7. [26] Let’s define
(
αt (s, a) is updated on step t
et (s, a) =
0 otherwise

Then if for every state-action pair (s, a)

+∞
X +∞
X
et (s, a) = ∞ et (s, a)2 < ∞
t t

the algorithm 1 converges to optimal Q∗ with probability 1.

This theorem states that basic policy iteration method can be actually applied online in the way
proposed by TD algorithm, but demands «enough exploration» from the strategy of interacting with
MDP during training. Satisfying this demand remains a unique and common problem of reinforce-
ment learning.
The widespread kludge is ε-greedy strategy which basically suggests to choose random action
instead of a = argmax Q∗ (s, a) with probability εt . The probability εt is usually set close to 1
a
during first interaction iterations and scheduled to decrease to a constant close to 0. This heuristic
makes agent visit all states with non-zero probabilities independent of what current approximation
Q∗ (s, a) suggests.
The main practical issue with Temporal Difference algorithm is that it requires table Q∗ (s, a) to
be explicitly stored in memory, which is impossible for MDP with high state space complexity. This
limitation substantially restricted its applicability until its combination with deep neural network was
proposed.

3.2. Deep Q-learning (DQN)


Utilization of neural nets to model either a policy or a Q-function frees from constructing task-
specific features and opens possibilities of applying RL algorithms to complex tasks, e. g. tasks with
images as input. Video games are classical example of such tasks where raw pixels of screen are
provided as state representation and, correspondingly, as input to either policy or Q-function.
Main idea of Deep Q-learning [13] is to adapt Temporal Difference algorithm so that update for-
mula (9) would be equivalent to gradient descent step for training a neural network to solve a certain
regression task. Indeed, it can be noticed that the exponential smoothing parameter αt resembles
learning rate of first-order gradient optimization procedures, while the exploration conditions from
theorem 7 look identical to restrictions on learning rate of stochastic gradient descent.
The key hint is that (9) is actually a gradient descent step in the parameter space of the table
functions family:
Q∗ (s, a, θ) = θ s,a
where all θ s,a form a vector of parameters θ ∈ R|S||A| .
To unravel this fact, it is convenient to introduce some notation from regression tasks. First, let’s
denote by y the target of our regression task, i. e. the quantity that our model is trying to predict:

y(s, a) := r(s0 ) + γ max


0
Q∗ (s0 , a0 , θ) (10)
a

where s0 is a sample from p(s0 | s, a) and s, a is input data. In this notation (9) is equivalent to:

θt+1 = θt + αt [y(s, a) − Q∗ (s, a, θt )] es,a

where we multiplied scalar value αt [y(s, a) − Q∗ (s, a, θt )] on the following vector es,a
(
1 (i, j) = (s, a)
es,a
i,j :=
0 (i, j) 6= (s, a)

to formulate an update of only one component of θ in a vector form. By this we transitioned to


update in parameter space using Q∗ (s, a, θ) = θ s,a . Remark that for table functions family the
12
derivative of Q∗ (s, a, θ) by θ for given input s, a is its one-hot encoding, i. e. exactly es,a :
∂Q∗ (s, a, θ)
= es,a (11)
∂θ
The statement now is that this formula is a gradient descent update for regression with input
s, a, target y(s, a) and MSE loss function:
2
Loss(y(s, a), Q∗ (s, a, θt )) = (Q∗ (s, a, θt ) − y(s, a)) (12)

Indeed:

θt+1 = θt + αt [y(s, a) − Q∗ (s, a, θt )] es,a =


∂ Loss(y, Q∗ (s, a, θt ))
{(12)} = θt − αt es,a
∂Q∗
∂ Loss(y, Q∗ (s, a, θt )) ∂Q∗ (s, a, θt )
{(11)} = θt − αt =
∂Q∗ ∂θ
∂ Loss(y, Q∗ (s, a, θt ))
{chain rule} = θt − αt
∂θ
The obtained result is evidently a gradient descent step formula to minimize MSE loss function
with target (10):

∂ Loss(y, Q∗ (s, a, θt ))
θt+1 = θt − αt (13)
∂θ
It is important that dependence of y from θ is ignored during gradient computation (otherwise
the chain rule application with y being dependent on θ is incorrect). On each step of temporal dif-
ference algorithm new target y is constructed using current Q-function approximation, and a new
regression task with this target is set. For this fixed target one MSE optimization step is done ac-
cording to (13), and on the next step a new regression task is defined. Though during each step the
target is considered to represent some ground truth like it is in supervised learning, here it provides
a direction of optimization and because of this reason is sometimes called a guess.
Notice that representation (13) is equivalent to standard TD update (9) with all theoretical results
remaining while the parametric family Q(s, a, θ) is a table functions family. At the same time, (13)
can be formally applied to any parametric function family including neural networks. It must be
taken into account that this transition is not rigorous and all theoretical guarantees provided by
theorem 7 are lost at this moment.
Further on we assume that optimal Q-function is approximated with neural network Q∗θ (s, a)
with parameters θ . Note that for discrete action space case this network may take only s as input
and output |A| numbers representing Q∗θ (s, a1 ) . . . Q∗θ (s, a|A| ), which allows to find an optimal
action in a given state s with a single forward pass through the net. Therefore target y for given
transition (s, a, r 0 , s0 , done) can be computed with one forward pass and optimization step can be
performed in one more forward7 and one backward pass.
Small issue with this straightforward approach is that, of course, it is impractical to train neural
networks with batches of size 1. In [13] it is proposed to use experience replay to store all collected
transitions (s, a, r 0 , s0 , done) as data samples and on each iteration sample a batch of standard for
neural networks training size. As usual, the loss function is assumed to be an average of losses for
each transition from the batch. This utilization of previously experienced transitions is legit because
TD algorithm is known to be an off-policy algorithm, which means it can work with arbitrary transi-
tions gathered by any agent’s interaction experience. One more important benefit from experience
replay is sample decorrelation as consecutive transitions from interaction are often similar to each
other since agent usually locates at the particular part of MDP.
Though empirical results of described algorithm turned out to be promising, the behaviour of
Q∗θ values indicated the instability of learning process. Reconstruction of target after each optimiza-
tion step led to so-called compound error when approximation error propagated from the close-
to-terminal states to the starting in avalanche manner and could lead to guess being 106 and more
times bigger than the true Q∗ value. To address this problem, [13] introduced a kludge known as tar-
get network, which basic idea is to solve fixed regression problem for K > 1 steps, i. .e. recompute
target every K -th step instead of each.
7 in implementations it is possible to combine s and s0 in one batch and perform these two forward passes «at once».

13
To avoid target recomputation for the whole experience replay, the copy of neural network Q∗θ
is stored, called the target network. Its architecture is the same while weights θ − are a copy of Q∗θ
from the moment of last target recomputation8 and its main purpose is to generate targets y for
given current batch.
Combining all things together and adding ε-greedy strategy to facilitate exploration, we obtain
classic DQN algorithm:

Algorithm 2: Deep Q-learning (DQN)

Hyperparameters: B — batch size, K — target network update frequency, ε(t) ∈ (0, 1] —


greedy exploration parameter, Q∗θ — neural network, SGD optimizer.

Initialize weights of θ arbitrary


Initialize θ − ← θ
On each interaction step:

1. select a randomly with probability ε(t), else a = argmax Q∗θ (s, a)


a
0 0
2. observe transition (s, a, r , s , done)

3. add observed transition to experience replay

4. sample batch of size B from experience replay

5. for each transition T from the batch compute target:

y(T ) = r(s0 ) + γ max


0
Q∗ (s0 , a0 , θ − )
a

6. compute loss:
1 X 2
Loss = (Q∗ (s, a, θ) − y(T ))
B T

∂ Loss
7. make a step of gradient descent using ∂θ

8. if t mod K = 0: θ − ← θ

3.3. Double DQN


Although target network successfully prevented Q∗θ from unbounded growth and empirically sta-
bilized learning process, the values of Q∗θ on many domains were evident to tend to overestimation.
The problem is presumed to reside in max operation in target construction formula (10):

y = r(s0 ) + γ max
0
Q∗ (s0 , a0 , θ − )
a

During this estimation max shifts Q-value estimation towards either to those actions that led to high
reward due to luck or to the actions with overestimating approximation error.
The solution proposed in [23] is based on idea of separating action selection and action evalua-
tion to carry out each of these operations using its own approximation of Q∗ :

max
0
Q∗ (s0 , a0 , θ − ) = Q∗ (s0 , argmax Q∗ (s0 , a0 , θ − ), θ − ) ≈
a a0

≈ Q (s , argmax Q∗ (s0 , a0 , θ1− ), θ2− )


∗ 0
a0

The simplest, but expensive, implementation of this idea is to run two independent DQN («Twin
DQN») algorithms and use the twin network to evaluate actions:

y1 = r(s0 ) + γQ∗1 (s0 , argmax Q∗2 (s0 , a0 , θ2− ), θ1− )


a0
8 alternative, but more computationally expensive option, is to update target network weights on each step using exponen-

tial smoothing

14
y2 = r(s0 ) + γQ∗2 (s0 , argmax Q∗1 (s0 , a0 , θ1− ), θ2− )
a0

Intuitively, each Q-function here may prefer lucky or overestimated actions, but the other Q-function
judges them according to its own luck and approximation error, which may be as underestimating
as overestimating. Ideally these two DQNs should not share interaction experience to achieve that,
which makes such algorithm twice as expensive both in terms of computational cost and sample
efficiency.
Double DQN [23] is more compromised option which suggests to use current weights of network
θ for action selection and target network weights θ − for action evaluation, assuming that when the
target network update frequency K is big enough these two networks are sufficiently different:

y = r(s0 ) + γQ∗ (s0 , argmax Q∗ (s0 , a0 , θ), θ − )


a0

3.4. Dueling DQN


Another issue with DQN algorithm 2 emerges when a huge part of considered MDP consists of
states of low optimal value V ∗ (s), which is an often case. The problem is that when the agent visits
unpromising state instead of lowering its value V ∗ (s) it remembers only low pay-off for performing
some action a in it by updating Q∗ (s, a). This leads to regular returns to this state during future
interactions until all actions prove to be unpromising and all Q∗ (s, a) are updated. The problem
gets worse when the cardinality of action space is high or there are many similar actions in action
space.
One benefit of deep reinforcement learning is that we are able to facilitate generalization across
actions by specifying the architecture of neural network. To do so, we need to encourage the learn-
ing of V ∗ (s) from updates of Q∗ (s, a). The idea of dueling architecture [25] is to incorporate
approximation of V ∗ (s) explicitly in computational graph. For that purpose we need the definition
of advantage function:

Definition 8. For given MDP and policy π the advantage function under policy π is defined as

Aπ (s, a) := Qπ (s, a) − V π (s) (14)

Advantage function is evidently interconnected with Q-function and value function and actually
shows the relative advantage of selecting action a comparing to average performance of the policy.
If for some state Aπ (s, a) > 0, then modifying π to select a more often in this particular state will
lead to better policy as its average return will become bigger than initial V π (s). This follows from
the following property of arbitrary advantage function:

Ea∼π(a|s) Aπ (s, a) = Ea∼π(a|s) [Qπ (s, a) − V π (s)] =


= Ea∼π(a|s) Qπ (s, a) − V π (s) = (15)
π π
{using (4)} = V (s) − V (s) = 0

Definition of optimal advantage function A∗ (s, a) is analogous and allows us to reformulate


Q∗ (s, a) in terms of V ∗ (s) and A∗ (s, a):

Q∗ (s, a) = V ∗ (s) + A∗ (s, a) (16)

Straightforward utilization of this decomposition is following: after several feature extracting lay-
ers the network is joined with two heads, one outputting single scalar V ∗ (s) and one outputting
|A| numbers A∗ (s, a) like it was done in DQN for Q-function. After that this scalar value estimation
is added to all components of A∗ (s, a) in order to obtain Q∗ (s, a) according to (16). The problem
with this naive approach is that due to (15) advantage function can not be arbitrary and must hold
the property (15) for Q∗ (s, a) to be identifiable.
This restriction (15) on advantage function can be simplified for the case when optimal policy is

15
induced by optimal Q-function:

0 = Ea∼π∗ (a|s) Q∗ (s, a) − V ∗ (s) =


= Q∗ (s, argmax Q∗ (s, a)) − V ∗ (s) =
a
= max Q∗ (s, a) − V ∗ (s) =
a
= max [Q∗ (s, a) − V ∗ (s)] =
a
= max A∗ (s, a)
a

This condition can be easily satisfied in computational graph by subtracting max A∗ (s, a) from
a
advantage head. This will be equivalent to the following formula of dueling DQN:

Q∗ (s, a) = V ∗ (s) + A∗ (s, a) − max A∗ (s, a) (17)


a

The interesting nuance of this improvement is that after evaluation on Atari-57 authors discov-
ered that substituting max operation in (17) with averaging across actions led to better results (while
usage of unidentifiable formula (16) led to poor performance). Although gradients can be backprop-
agated through both operation and formula (17) seems theoretically justified, in practical implemen-
tations averaging instead of maximum is widespread.

3.5. Noisy DQN


By default, DQN algorithm does not concern the exploration problem and is always augmented
with ε-greedy strategy to force agent to discover new states. This baseline exploration strategy
suffers from being extremely hyperparameter-sensitive as early decrease of ε(t) to close to zero
values may lead to stucking in local optima, when agent is unable to explore new options due to
imperfect Q∗ , while high values of ε(t) force agent to behave randomly for excessive amount of
episodes, which slows down learning. In other words, ε-greedy strategy transfers responsibility to
solve exploration-exploitation trade-off on engineer.
The key reason why ε-greedy exploration strategy is relatively primitive is that exploration priority
does not depend on current state. Intuitively, the choice whether to exploit knowledge by selecting
approximately optimal action or to explore MDP by selecting some other depends on how explored
the current state s is. Discovering a new part of state space after any amount of interaction probably
indicates that random actions are good to try there, while close-to-initial states will probably be
sufficiently explored after several first episodes.
In ε-greedy strategy agent selects action using deterministic Q∗ (s, a, θ) and only afterwards in-
jects state-independent noise in a form of ε(t) probability of choosing random action. Noisy net-
works [4] were proposed as a simple extension of DQN to provide state-dependent and parameter-
free exploration by injecting noise of trainable volume to all (or most9 ) nodes in computational graph.
Let a linear layer with m inputs and n outputs in q-network perform the following computation:

y(x) = W x + b

where x ∈ Rm is input, W ∈ Rn×m — weights matrix, b ∈ Rm — bias. In noisy layers it


is proposed to substitute deterministic parameters with samples from N (µ, σ) where µ, σ are
trained with gradient descent10 . On the forward pass through the noisy layer we sample εW ∼
N (0, Inm×nm ), εb ∼ N (0, In×n ) and then compute

W = (µW + σW εW )
b = (µb + σb εb )
y(x) = W x + b

where denotes element-wise multiplication, µW , σW ∈ Rn×m , µb , σb ∈ Rn — trainable param-


eters of the layer. Note that the number of parameters for such layers is doubled comparing to
ordinary layers.
9 usually it is not injected in very first layers responsible for feature extraction like convolutional layers in networks for

images as input.
10 using standard reparametrization trick

16
As the output of q-network now becomes a random variable, loss value becomes a random vari-
able too. Like in similar models for supervised learning, on each step an expectation of loss function
over noise is minimized:
Eε Loss(θ, ε) → min
θ

The gradient in this setting can be estimated using Monte-Carlo:

∇θ Eε Loss(θ, ε) = Eε ∇θ Loss(θ, ε) ≈ ∇θ Loss(θ, ε) ε ∼ N (0, I)

It can be seen that amount of noise actually inflicting output of network may vary for different
inputs, i. e. for different states. There are no guarantees that this amount will reduce as the inter-
action proceeds; the behaviour of average magnitude of noise injected in the network with time is
reported to be extremely sensitive to initialization of σW , σb and vary from MDP to MDP.
One technical issue with noisy layers is that on each pass an excessive amount (by the number
of network parameters) of noise samples is required. This may substantially reduce computational
efficiency of forward pass through the network. For optimization purposes it is proposed to ob-
tain noise for weights matrices in the following way: sample just n + m noise samples ε1W ∼
N (0, Im×m ), ε2W ∼ N (0, In×n ) and acquire matrix noise in a factorized form:

εW = f (ε1W )f (ε2W )T
p
where f is a scaling function, e. g. f (x) = sign(x) |x|. The benefit of this procedure is that it
requires m + n samples instead of mn, but sacrifices the interlayer independence of noise.

3.6. Prioritized experience replay


In DQN each batch of transitions is sampled from experience replay using uniform distribution,
treating collected data as equally prioritized. In such scheme states for each update come from the
same distribution as they come from interaction experience (except that they become decorellated),
which agrees with TD algorithm as the basement of DQN.
Intuitively observed transitions vary in their importance. At the beginning of training most guesses
tend to be more or less random as they rely on arbitrarily initialized Q∗θ and the only source of
trusted information are transitions with non-zero received reward, especially near terminal states
where Vθ∗ (s0 ) is known to be equal to 0. In the midway of training, most of experience replay is filled
with the memory of interaction within well-learned part of MDP while the most crucial information is
contained in transitions where agent explored new promising areas and gained novel reward yet to
be propagated through Bellman equation. All these significant transitions are drowned in collected
data and rarely appear in sampled batches.
The central idea of prioritized experience replay [16] is that priority of some transition T =
(s, a, r 0 , s0 , done) is proportional to temporal difference:
p
ρ(T ) := y(T ) − Q∗ (s, a, θ) = Loss(y(T ), Q∗ (s, a, θ)) (18)

Using these priorities as proxy of transition importances, sampling from experience replay proceeds
using following probabilities:
P(T ) ∝ ρ(T )α
where hyperparameter α ∈ R+ controls the degree to which the sampling weights are sparsified:
the case α = 0 corresponds to uniform sampling distribution while α = +∞ is equivalent to
greedy sampling of transitions with highest priority.
The problem with (18) claim is that each transition’s priority changes after each network update.
As it is impractical to recalculate loss for the whole data after each step, some simplifications must
be put up with. The straightforward option is to update priority only for sampled transitions in
the current batch. New transitions can be added to experience replay with highest priority, i. e.
max ρ(T )11 .
T
Second debatable issue of prioritized replay is that it actually substitutes loss function of DQN
updates, which assumed uniform sampling of visited states to ensure they come from state visitation
distribution:
ET ∼Uniform Loss(T ) → min
θ
11 which can be computed online with O(1) complexity

17
While it is not clear what distribution is better to sample from to ensure exploration restrictions of
theorem 7, prioritized experienced replay changes this distribution in uncontrollable way. Despite
its fruitfulness at the beginning and midway of training process, this distribution shift may destabi-
lize learning close to the end and make algorithm stuck with locally optimal policy. Since formally
this issue is about estimating an expectation over one probability with preference to sample from
another one, the standard technique called importance sampling can be used as countermeasure:
M
X 1
ET ∼Uniform Loss(T ) = Loss(Ti ) =
i=0
M
M
X 1
= P(Ti ) Loss(Ti ) =
i=0
M P(Ti )
1
= ET ∼P(T ) Loss(T )
M P(T )
where M is a number of transitions stored in experience replay memory. Importance sampling
implies that we can avoid distribution shift that introduces undesired bias by making smaller gradient
updates for significant transitions which now appear in the batches with higher frequency. The price
for bias elimination is that importance sampling weights lower prioritization effect by slowing down
learning of highlighted new information.
This duality resembles trade-off between bias and variance, but important moment here is that
distribution shift does not cause any seeming issues at the beginning of training when agent behaves
close to random and do not produce valid state visitation distribution anyway. The idea proposed
in [16] based on this intuition is to anneal the importance sampling weights so they correct bias
properly only towards the end of training procedure.
 β(t)
prioritizedER 1
Loss = ET ∼P(T ) Loss(T )
BP(T )
where β(t) ∈ [0, 1] and approaches 112 as more interaction steps are executed. If β(t) is set to 0,
no bias correction is held, while β(t) = 1 corresponds to unbiased loss function, i. e. equivalent to
sampling from uniform distribution.
The most significant and obvious drawback of prioritized experience replay approach is that it
introduces additional hyperparameters. Although α represents one number, algorithm’s behaviour
may turn out to be sensitive to its choosing, and β(t) must be designed by engineer as some sched-
uled motion from something near 0 to 1, and its well-turned selection may require inaccessible
knowledge about how many steps it will take for algorithm to «warm up».

3.7. Multi-step DQN


One more widespread modification of Q-learning in RL community is substituting one-step ap-
proximation present in Bellman optimality equation (6) with N -step:

Proposition 8. (N -step Bellman optimality equation)


" N
#
X
Q∗ (s0 , a0 ) = ETπ∗ |s0 ,a0 γ t−1 r(st ) + γ N max Q∗ (sN , aN ) (19)
aN
t=1

Indeed, definition of Q∗ (s, a) consists of average return and can be viewed as making T max
steps from state s0 after selecting action a0 , while vanilla Bellman optimality equation represents
Q∗ (s, a) as reward from one next step in the environment and estimation of the rest of trajectory
reward recursively. N -step Bellman equation (19) generalizes these two opposites.
All the same reasoning as for DQN can be applied to N -step Bellman equation to obtain N -step
DQN algorithm, which only modification appears in target computation:
N
X
y(s0 , a0 ) = γ t−1 r(st ) + γ N max Q∗ (sN , aN , θ) (20)
aN
t=1
12 often it is initialized by a constant close to 0 and is linearly increased until it reaches 1

18
To perform this computation, we are required to obtain for given state s and a not only one next
step, but N steps. To do so, instead of transitions N -step roll-outs are stored, which can be done by
precomputing following tuples:

N
!
X
n−1 (n) (N )
T = s, a, γ r ,s , done
n=1

where r (n) is the reward received in n steps after visitation of considered state s, s(N ) is state visited
in N steps, and done is a flag whether the episode ended during N -step roll-out13 . All other aspects
of algorithm remain the same in practical implementations, and the case N = 1 corresponds to
standard DQN.
The goal of using N > 1 is to accelerate propagation of reward from terminal states backwards
through visited states to s0 as less update steps will be required to take into account freshly ob-
served reward and optimize behaviour at the beginning of episodes. The price is that formula (20)
includes an important trick: to calculate such target, for second (and following) step action a0 must
be sampled from π ∗ for Bellman equation (19) to remain true. In other words, application of N -step
Q-learning is theoretically improper when behaviour policy differs from π ∗ . Note that we do not face
this problem in the case N = 1 in which we are required to sample only from transition probability
p(s0 | s, a) for given state-action pair s, a.
Even considering π ∗ ≈ argmax Q∗ (s, a, θ), where Q∗ is our current approximation of π ∗ ,
a
makes N -step DQN an on-policy algorithm when for every state-action pair s, a it is preferable to
sample target using the closest approximation of π ∗ available. This questions usage of experience
replay or at the very least encourages to limit its capacity to store only M max newest transitions
with M max being relatively not very big.
To see the negative effect of N -step DQN, consider the following toy example. Suppose agent
makes a mistake on the second step after s and ends episode with huge negative reward. Then
in the case N > 2 each time the roll-out starting with this s is sampled in the batch, the value of
Q∗ (s, a, θ) will be updated with this received negative reward even if Q∗ (s0 , ·, θ) already learned
not to repeat this mistake again.
Yet empirical results in many domains demonstrate that raising N from 1 to 2-3 may result in
substantial acceleration of training and positively affect the final performance. On the contrary, the
theoretical groundlessness of this approach explains its negative effects when N is set too big.

13 all N -step roll-outs must be considered including those terminated at k-th step for k < N .

19
4. Distributional approach for value-based methods
4.1. Theoretical foundations
The setting of RL task inherently carries internal stochasticity of which agent has no substantial
control. Sometimes intelligent behaviour implies taking risks with severe chance of low episode
return. All this information resides in the distribution of return R (1) as random variable.
While value-based methods aim at learning expectation of this random variable as it is the quan-
tity we actually care about, in distributional approach [1] it is proposed to learn the whole distri-
bution of returns. It further extends the information gathered by algorithm about MDP towards
model-based case in which the whole MDP is imitated by learning both reward function r(s) and
transitions T, but still restricts itself only to reward and doesn’t intend to learn world model.
In this section we discuss some theoretical extensions of temporal difference ideas in the case
when expectations on both sides of Bellman equation (5) and Bellman optimality equation (6) are
taken away.
The central object of study in Q-learning was Q-function, which for given state and action returns
the expectation of reward. To rewrite Bellman equation not in terms of expectations, but in terms of
the whole distributions, we require a corresponding notation.

Definition 9. For given MDP and policy π the value distribution of policy π is a random variable
defined as X
Z π (s, a) := γ t rt+1 s0 = s, a0 = a
t=0

Note that Z π just represents a random variable which is taken expectation of in definition of
Q-function:
Qπ (s, a) = ETπ Z π (s, a)
Using this definition of value distribution, Bellman equation can be rewritten to extend the recur-
sive connection between adjacent states from expectations of returns to the whole distributions of
returns:

Proposition 9. (Distributional Bellman Equation) [1]

c.d.f .
Z π (s, a) = r(s0 ) + γZ π (s0 , a0 ) s0 ∼ p(s0 | s, a), a0 ∼ π(a0 | s0 ) (21)

c.d.f .
Here we used some auxiliary notation: by = we mean that cumulative distribution functions of
two random variables to the right and left are equal almost everywhere. Such equations are called
recursive distributional equations and are well-known in theoretical probability theory14 . By using
| we describe a sampling procedure for the random variable to the right side of equation: for given
s, a next state s0 is sampled from transition probability, then a0 is sampled from given policy, then
random variable Z π (s0 , a0 ) is sampled to calculate a resulting sample r(s0 ) + γZ π (s0 , a0 ).
While the space of Q-functions Qπ (s, a) ∈ S × A → R is finite, the space of value distributions
is a space of mappings from state-action pair to continuous distributions:

Z π (s, a) ∈ S × A → P(R)

and it is important to notice that even in the table-case when state and action spaces are finite, the
space of value distributions is essentially infinite. Crucial moment for us will be that convergence
properties now depend on chosen metric15 .
The choice of metric in S × A → P(R) represents the same issue as in the space of continuous
random variables P(R): if we choose a metric in the latter, we can construct one in the former:

14 to get familiar with this notion, consider this basic example:


c.d.f . X2 X3
X1 = √ + √
2 2
where X1 , X2 , X3 are random variables coming from N (0, σ 2 ).
15 in finite spaces it is true that convergence in one metric guarantees convergence to the same point for any other metric.

20
Proposition 10. If d(X, Y ) is a metric in the space P(R), then

d(Z1 , Z2 ) := sup d(Z1 (s, a), Z2 (s, a))


s∈S,a∈A

is a metric in the space S × A → P(R).

The particularly interesting for us example of metric in P(R) will be Wasserstein metric, which
concerns only random variables with bounded moments, so we will additionally assume that for all
state-action pairs s, a
EZ π (s, a)p ≤ +∞
are finite for p ≥ 1.

Proposition 11. For 1 ≤ p ≤ +∞ for two random variables X, Y on continuous domain with p-
th bounded moments and cumulative distribution functions FX and FY correspondingly a Wasser-
stein distance 1
 p
Z1 p
−1
Wp (X, Y ) :=  FX (ω) − FY−1 (ω) dω 
0

−1
W∞ (X, Y ) := sup FX (ω) − FY−1 (ω)
ω∈[0,1]

is a metric in the space of random variables with p-th bounded moments.

Thus we can conclude from proposition 10 that maximal form of Wasserstein metric
W p (Z1 , Z2 ) = sup Wp (Z1 (s, a), Z2 (s, a)) (22)
s∈S,a∈A

is a metric in the space of value distributions.


We now concern convergence properties of point iteration method to solve (21) in order to obtain
Z π for given policy π , i. e. solve the task of policy evaluation. For that purpose we initialize Z0π (s, a)
arbitrarily16 and perform the following updates for all state-action pairs s, a:
c.d.f .
π
Zt+1 (s, a) := r(s0 ) + γZtπ (s0 , a0 ) (23)
Here we assume that we are able to compute the distribution of random variable on the right side
knowing π , all transition probabilities T, the distribution of Ztπ and reward function. The question
whether the sequence {Ztπ } converges to Z π can be given a detailed answer:

Proposition 12. [1] Denote by B the following operator (S × A → P(R)) → (S × A → P(R)),


updating Ztπ as in (23):
π
Zt+1 = BZtπ
for all state-action pairs s, a.
Then B is a contraction mapping in W p (22) for 1 ≤ p ≤ +∞, i.e. for any two value distribu-
tions Z1 , Z2
W p (BZ1 , BZ2 ) ≤ γW p (Z1 , Z2 )
Hence there is a unique fixed point of system of equations (21) and the point iteration method con-
verges to it.

One more curious theoretical result is that B is in general not a contraction mapping for such dis-
tances as Kullback-Leibler divergence, Total Variation distance and Kolmogorov distance17 . It shows
16 here we consider value distributions from theoretical point of view, assuming that we are able to explicitly store a table of

|S||A| continuous distributions without any approximations.


17 one more metric for which the contraction property was shown is Cramer metric:

 1
Z 2
2
l2 (X, Y ) =  (FX (ω) − FY (ω)) dω 
R
where FX , FY are c.d.f. of random variables X, Y correspondingly.

21
that metric selection indeed influences convergence rate.
Similar to traditional value functions, we can define optimal value distribution Z ∗ (s, a). Sub-
stituting18 π ∗ (s) = argmax ETπ∗ Z ∗ (s, a) into (21), we obtain distributional Bellman optimality
a
equation:

Proposition 13. (Distributional Bellman optimality equation)

c.d.f .
Z ∗ (s, a) = r(s0 ) + γZ ∗ (s0 , argmax ETπ∗ Z ∗ (s0 , a0 )) s0 ∼ p(s0 | s, a) (24)
a0

Now we concern the same question whether the point iteration method of solving (24) leads to
solution Z ∗ and whether it is a contraction mapping for some metric. The answer turns out to be
negative.

Proposition 14. [1] Point iteration for solving (24) may diverge.

Level of impact of this result is not completely clear. Point iteration for (24) preserves means
of distributions, i. e. it will eventually converge to Q∗ (s, a) with all theoretical guarantees from
classical Q-learning. The reason behind divergence theorems hides in the rest of distributions like
other moments and situations when equivalent (in terms of average return) actions may lead to
different higher moments.

4.2. Categorical DQN


There are obvious obstacles for practical application of distributional Q-learning following from
complication of working with arbitrary continuous distributions. Usually we are restricted to approx-
imations inside some family of parametric distributions, so we have to perform a projection step on
each iteration.
Second matter in combining distributional Q-learning with deep neural networks is to take into
account that only samples from p(s0 | s, a) are available for each update. To provide a distributional
analog of temporal difference algorithm 9, some analog of exponential smoothing for distributional
setting must be proposed.
Categorical DQN [1] (also referred as c51) provides straightforward design of practical distribu-
tional algorithm. While DQN was a resemblance of temporal difference algorithm, Categorical DQN
attempts to follow the logic of DQN.
The concept is as following. The neural network with parameters θ in this setting takes as in-
put s ∈ S and for each action a outputs parameters ζθ (s, a) of distributions of random variable
Zθ∗ (s, a). As in DQN, experience replay can be used to collect observed transitions and sample a
batch for each update step. For each transition T = (s, a, r 0 , s0 , done) in the batch a guess is
computed:  
c.d.f .
y(T ) := r 0 + (1 − done)γZθ∗ s0 , argmax EZθ∗ (s0 , a0 ) (25)
a0

Note that expectation of Zθ∗ (s0 , a0 ) is computed explicitly using the form of chosen parametric family
of distributions and outputted parameters ζθ (s0 , a0 ), as is the distribution of random variable r 0 +
(1 − done)γZθ∗ (s0 , a0 ). In other words, in this setting guess y(T ) is also a continuous random
variable, distribution of which can be constructed only approximately. As both target and model
output are distributions, it is reasonable to design loss function in a form of some divergence D
between y(T ) and Zθ∗ (s, a):

Loss(θ) = ET D y(T ) k Zθ∗ (s, a)



(26)

∂ Loss(θt )
θt+1 = θt − α
∂θ
18 to perform this step validly, a clarification concerning argmax operator definition must be given. The choice of action a
returned by this operator in the cases when several actions lead to the same maximal average returns must not depend on
Z , as this choice affects higher moments of resulted distribution. To overcome this issue, for example, in the case of finite
action space all actions can be enumerated and the optimal action with the lowest index is returned by operator.

22
The particular choice of this divergence must be made with concern that y(T ) is a «sample» from
a full one-step approximation of Zθ∗ which includes transition probabilities:

c.d.f . X
y full (s, a) := p(s0 | s, a)y(s, a, r(s0 ), s0 , done(s0 )) (27)
s0 ∈S

This form is precisely the right side of distributional Bellman optimality equation as we just incor-
porated intermediate sampling of s0 into the value of random variable. In other words, if transition
probabilities T were known, the update could be made using distribution of y full as a target.

Lossfull (θ) = Es,a D(y full (s, a) k Zθ∗ (s, a))

This motivates to choose KL(y(T ) k Zθ∗ (s, a)) (specifically with this order of arguments) as D
to exploit the following property (we denote by pX a p.d.f. pf random variable X ):
 Z 
∇θ ET KL(y full
(s, a) k Zθ∗ (s, a)) = ∇θ ET −pyfull (s,a) (ω) log pZθ∗ (s,a)) (ω)dω + const(θ) =
Z R
{using (27)} = ∇θ ET Es0 ∼p(s0 |s,a) − py(T ) (ω) log pZθ∗ (s,a)) (ω)dω =
R
Z
{taking expectation out} = ∇θ ET Es0 ∼p(s0 |s,a) −py(T ) (ω) log pZθ∗ (s,a)) (ω)dω =
R
∇θ ET Es0 ∼p(s0 |s,a) KL y(T ) k Zθ∗ (s, a)

=

This property basically states that gradient of loss function (26) with KL as D is an unbiased
(Monte-Carlo) estimation of gradient of KL-divergence for «full» distribution (27), which resembles
the employment of exponential smoothing in temporal difference learning. For many other diver-
gences, including Wasserstein metric, same statement is not true, so their utilization in described
online setting will lead to biased gradients and all theory-grounded intuition that algorithm moves
in the right direction becomes distinctively lost. Moreover, KL-divergence is known to be one of the
easiest divergences to work with due to its nice smoothness properties and wide prevalence in many
deep learning pipelines.
Described above motivation to choose KL-divergence as an actual objective for minimization is
contradictory. Theoretical analysis of distributional Q-learning, specifically theorem 12, though con-
cerning policy evaluation other than optimal Z ∗ search, explicitly hints that the process converges
exponentially fast for Wasserstein metric, while even for precisely made updates in terms of KL-
divergence we are not guaranteed to get any closer to true solution.
More «practical» defect of KL-divergence is that it demands two comparable distributions to
share the same domain. This means that by choosing KL-divergence we pledge to guarantee that
y(T ) and Zθ∗ (s, a) in (26) have coinciding support. This emerging restriction seems limiting even
beforehand as for episodic MDP value distribution in terminal states is obviously degenerated (their
support consists of one point r(s) which is given all probability mass) which means that our value
distribution approximation is basically ensured to never be precise.
In Categorical DQN, as follows from the name, the family of distributions is chosen to be cate-
gorical on the fixed support {z0 , z1 . . . zA−1 } where A is number of atoms. As no prior informa-
tion about MDP is given, the basic choice of this support is uniform grid from some Vmin ∈ R to
V max ∈ R:
i
zi = Vmin + (Vmax − Vmin ), i ∈ 0, 1, . . . A − 1
A−1
These bounds, though, must be chosen carefully as they implicitly assume

Vmin ≤ Z ∗ (s, a) ≤ Vmax

and if these inequalities are not tight, the approximation will obviously become poor.
Therefore the neural network outputs A numbers, summing into 1, to represent arbitrary distri-
bution on this support:
ζi (s, a, θ) := P(Zθ∗ (s, a) = zi )
Within this family of distributions, computation of expectation, greedy action selection and KL-
divergence is trivial. One problem hides in target formula (25): while we can compute distribution
y(T ), its support may in general differ from {z0 . . . zA−1 }. To avoid the issue of disjoint supports,
23
a projection step must be done to find the closest to target distribution within the chosen family19 .
Therefore the resulting target used in the loss function is
  
c.d.f .
0 ∗ 0 ∗ 0 0
y(T ) := ΠC r + (1 − done)γZθ s , argmax EZθ (s , a )
a0

where ΠC is projection operator.


The resulting practical algorithm, named c51 after categorical distributions with A = 51 atoms,
inherits ideas of experience replay, ε-greedy exploration and target network from DQN. Empirically,
though, usage of target network remains an open question as the chosen family of distributions
restricts value approximation from unbounded growth by «clipping» predictions at zA−1 and z0 , yet
it is still considered slightly improving performance.

Algorithm 3: Categorical DQN (c51)

Hyperparameters: B — batch size, Vmax , Vmin , A — parameters of support, K — target


network update frequency, ε(t) ∈ (0, 1] — greedy exploration parameter, ζ ∗ — neural net-
work, SGD optimizer.

Initialize weights θ of neural net ζ ∗ arbitrary


Initialize θ − ← θ
i
Precompute support grid zi = Vmin + A−1 (Vmax − Vmin )
On each interaction step:

zi ζi∗ (s, a, θ)
P
1. select a randomly with probability ε(t), else a = argmax i
a

2. observe transition (s, a, r 0 , s0 , done)

3. add observed transition to experience replay

4. sample batch of size B from experience replay

5. for each transition T from the batch compute target:


!
X
0
P(y(T ) = r + γzi ) = ζi∗ 0
s , argmax zi ζi∗ (s0 , a0 , θ − ), θ −
a0
i

6. project y(T ) on support {z0 , z1 . . . zA−1 }

7. compute loss:
1 X
Loss = KL(y(T ) k Z ∗ (s, a, θ))
B T

∂ Loss
8. make a step of gradient descent using ∂θ

9. if t mod K = 0: θ − ← θ

4.3. Quantile Regression DQN (QR-DQN)


Categorical DQN discovered a gap between theory and practice as KL-divergence, used in prac-
tical algorithm, is theoretically unjustified. Theorem 12 hints that the true divergence we should care
about is actually Wasserstein metric, but it remained unclear how it could be optimized using only
samples from transition probabilities T.
In [3] it was discovered that selecting another family of distributions to approximate Zθ∗ (s, a)
will reduce Wasserstein minimization task to the search for quantiles of specific distributions. The
19 to project a categorical distribution with support {v , v . . . v
0 1 A−1 } on categorical distributions with support
{z0 , z1 . . . zA−1 } one can just find for each vi the closest two atoms zj ≤ vi ≤ zj+1 and split all probability mass
for vi between zj and zj+1 proportional to closeness. If vi < z0 , then all its probability mass is given to z0 , same with
upper bound.

24
latter can be done in online setting using quantile regression technique. This led to alternative
distributional Q-learning algorithm named Quantile Regression DQN (QR-DQN).
The basic idea is to «swap» fixed support and learned probabilities of Categorical DQN. We will
now consider the family with fixed probabilities for A-atomed categorical distribution with arbitrary
support {ζ0∗ (s, a, θ), ζ1∗ (s, a, θ), . . . , ζA−1

(s, a, θ)}. Again, we will assume all probabilities to be
equal given the absence of any prior knowledge; namely, our distribution family is now
 
Zθ∗ (s, a) ∼ Uniform ζ0∗ (s, a, θ), . . . , ζA−1

(s, a, θ)

In this setting neural network outputs A arbitrary real numbers that represent the support of uni-
form categorical distribution20 , where A is the number of atoms and the only hyperparameter to
select.
For table-case setting, on each step of point iteration we desire to update the cell for given state-
action pair s, a with full distribution of random variable to the right side of (24). If we are limited
to store only A atoms of the support, the true distribution must be projected on the space of A-
atomed categorical distributions. Consider now this task of projecting some given random variable
with c.d.f. F (ω) in terms of Wasserstein distance. Specifically, we will be interested in minimizing
W1 -distance for p = 1 as the theorem 12 states the contraction property for all 1 ≤ p ≤ +∞ and
we are free to choose any:
Z 1
F −1 (ω) − Uz−1
0 ,z1 ...zA−1
(ω) dω → min (28)
0 z0 ,z1 ...zA−1

where Uz0 ,z1 ...zA−1 is c.d.f. for uniform categorical distribution on given support. Its inverse, also
known as quantile function, has a following simple form:
 1

z0 0≤ω< A
1 2

z1 ≤ω< A

A
Uz−1
0 ,z1 ...zA−1
(ω) = .

..

 A−1
zA−1 ≤ω<1

A

Substituting this into (28)


A−1 i+1
XZ A
F −1 (ω) − zi dω → min
i z0 ,z1 ...zA−1
i=0 A

splits the optimization of Wasserstein into A independent tasks that can be solved separately:
Z i+1
A
F −1 (ω) − zi dω → min (29)
i zi
A

Proposition 15. [3] Let’s denote


i i+1
A
+ A
τi :=
2
Then every solution for (29) satisfies F (zi ) = τi , i. e. it is τi -th quantile of c. d. f. F .

The result 15 states that we require only A specific quantiles of random variable to the right side
of Bellman equation21 . Hence the last thing to do to design a practical algorithm is to develop a pro-
cedure of unbiased estimation of quantiles for the random variable on the right side of distribution
Bellman optimality equation (24).
20 Note that target distribution is now guaranteed to remain within this distribution family as multiplying on γ just shrinks

the support and adding r 0 just shifts it. We assume that if some atoms of the support coincide, the distribution is
still A-atomed categorical; for example, for degenerated distribution (like in the case of terminal states) ζ0∗ (s, a, θ) =
ζ1∗ (s, a, θ) = · · · = ζA−1
∗ (s, a, θ). This shows that projection step heuristic is not needed for this particular choice of
distribution family.
21 It can be proved that for table-case policy evaluation algorithm which stores in each cell not expectations of reward (as

in Q-learning) but A quantiles updated according to distributional Bellman equation (21) using theorem 15 converges to
quantiles of Z ∗ (s, a) in Wasserstein metric for 1 ≤ p ≤ +∞ and its update operator is a contraction mapping in W∞ .

25
Quantile regression is the standard technique to estimate the quantiles of empirical distribution
(i. .e. distribution that is represented by finite amount of i. i. d. samples from it). Recall from machine
learning that the constant solution optimizing l1-loss is median, i. .e. 21 -th quantile. This fact can be
generalized to arbitrary quantiles:

Proposition 16. (Quantile Regression) [11] Let’s define loss as


(
τ (c − X) c≥X
Loss(c, X) =
(1 − τ )(X − c) c < X

Then solution for


EX Loss(c, X) → min (30)
c∈R

is τ -th quantile of distribution of X .

As usual in the case of neural networks, it is impractical to optimize (30) until convergence on
each iteration for each of A desired quantiles τi . Instead just one step of gradient optimization
is made and the outputs of neural network ζi∗ (s, a, θ), which play the role of c in formula (30), are
moved towards the quantile estimation via backpropagation. In other words, (30) sets a loss function
for network outputs; the losses for different quantiles are summed up. The resulting loss is

A−1
X
QR
Es0 ∼p(s0 |s,a) Ey∼y(T ) τ − I[ζi∗ (s, a, θ) < y] ζi∗ (s, a, θ) − y
 
Loss (s, a, θ) = (31)
i=0

where I denotes an indicator function. The expectation over y ∼ y(T ) for given transition can be
computed in closed form: indeed, y(T ) is also an A-atomed categorical distribution with support
{r 0 + γζ0∗ (s0 , a0 ), . . . , r 0 + γζA−1

(s0 , a0 )}, where

1 X
a0 = argmax EZ ∗ (s0 , a0 , θ) = argmax ζi∗ (s0 , a0 , θ)
a0 a0 A i

and expectation over transition probabilities, as always, is estimated using Monte-Carlo by sampling
transitions from experience replay.

Algorithm 4: Quantile Regression DQN (QR-DQN)

Hyperparameters: B — batch size, A — number of atoms, K — target network update fre-


quency, ε(t) ∈ (0, 1] — greedy exploration parameter, ζ ∗ — neural network, SGD optimizer.

Initialize weights θ of neural net ζ ∗ arbitrary


Initialize θ − ← θ
i
+ i+1
Precompute mid-quantiles τi = A 2 A
On each interaction step:
1
ζi∗ (s, a, θ)
P
1. select a randomly with probability ε(t), else a = argmax A i
a

2. observe transition (s, a, r 0 , s0 , done)

3. add observed transition to experience replay

4. sample batch of size B from experience replay

5. for each transition T from the batch compute the support of target distribution:
!
0 1 X
y(T )j = r + γζj∗ 0
s , argmax ζi∗ (s0 , a0 , θ − ), θ −
a0 A i

26
6. compute loss:

1 XXX
τi − I[ζi∗ (s, a, θ) < y(T )j ] ζi∗ (s, a, θ) − y(T )j
 
Loss =
BA T i j

∂ Loss
7. make a step of gradient descent using ∂θ

8. if t mod K = 0: θ − ← θ

4.4. Rainbow DQN


Success of Deep Q-learning encouraged a full-scale research of value-based deep reinforcement
learning by studying various drawbacks of DQN and developing auxiliary extensions. In many arti-
cles some extensions from previous research were already considered and embedded in compared
algorithms during empirical studies.
In Rainbow DQN [7], seven Q-learning-based ideas are united in one procedure with ablation
studies held whether all these incorporated extensions are essentially necessary for resulted RL
algorithm:
• DQN (sec. 3.2)
• Double DQN (sec. 3.3)
• Dueling DQN (sec. 3.4)
• Noisy DQN (sec. 3.5)
• Prioritized Experience Replay (sec. 3.6)
• Multi-step DQN (sec. 3.7)
• Categorical22 DQN (sec. 4.2)
There is little ambiguity on how these ideas can be combined; we will discuss several non-
straightforward circumstances and provide the full algorithm description after.
To apply prioritized experience replay in distributional setting, the measure of transition impor-
tance must be provided. The main idea is inherited from ordinary DQN where priority is just loss for
this transition:

ρ(T ) := Loss(y(T ), Z ∗ (s, a, θ)) = KL(y(T ) k Z ∗ (s, a, θ))

To combine noisy networks with double DQN heuristic, it is proposed to resample noise on each
forward pass through the network and through its copy for target computation. This decision implies
that action selection, action evaluation and network utilization are independent and stochastic (for
exploration cultivation) steps.
The one snagging combination here is categorical DQN and dueling DQN. To merge these ideas,
we need to model advantage A∗ (s, a, θ) in distributional setting. In Rainbow this is done straight-
forwardly: the network has two heads, value stream v(s, θ) outputting A real values and advantage
stream a(s, a, θ) outputting A × |A| real values. Then these streams are integrated using the same
formula (17) with the only exception being softmax applied across atoms dimension to guarantee
that output is categorical distribution:
!
1 X
ζi∗ (s, a, θ) ∝ exp v(s, θ)i + a(s, a, θ)i − a(s, a, θ)i (32)
|A| a

Combining lack of intuition behind this integration formula with usage of mean instead of theo-
retically justified max makes this element of Rainbow the most questionable. During the ablation
studies it was discovered that dueling architecture is the only component that can be removed with-
out noticeable loss of performance. All other ingredients are believed to be crucial for resulting
algorithm as they address different problems.
22 Quantile Regression can be considered instead

27
Algorithm 5: Rainbow DQN

Hyperparameters: B — batch size, Vmax , Vmin , A — parameters of support, K — target


network update frequency, N — multi-step size, α — degree of prioritized experience replay,
β(t) — importance sampling bias correction for prioritized experience replay, ζ ∗ — neural
network, SGD optimizer.

Initialize weights θ of neural net ζ ∗ arbitrary


Initialize θ − ← θ
i
Precompute support grid zi = Vmin + A−1 (Vmax − Vmin )
On each interaction step:

zi ζi∗ (s, a, θ, ε), ε ∼ N (0, I)


P
1. select a = argmax i
a

2. observe transition (s, a, r 0 , s0 , done)


 PN 
3. construct N -step transition T = s, a, n=0 γ n r (n+1) , s(N ) , done and add it to
experience replay with priority maxT ρ(T )

4. sample batch of size B from experience replay using probabilities P(T ) ∝ ρ(T )α

5. compute weights for the batch (where M is the size of experience replay memory)
 β(t)
1
w(T ) =
M P(T )

6. for each transition T = (s, a, r̄, s̄, done) from the batch compute target (detached
from computational graph to prevent backpropagation):

ε1 , ε2 ∼ N (0, I)
!
X
N
P(y(T ) = r̄ + γ zi ) = ζi∗ s̄, argmax zi ζi∗ (s̄, ā, θ, ε1 ), θ − , ε2

i

7. project y(T ) on support {z0 , z1 . . . zA−1 }

8. update transition priorities

ρ(T ) ← KL(y(T ) k Z ∗ (s, a, θ, ε)), ε ∼ N (0, I)

9. compute loss:
1 X
Loss = w(T )ρ(T )
B T

∂ Loss
10. make a step of gradient descent using ∂θ

11. if t mod K = 0: θ − ← θ

28
5. Policy Gradient algorithms
5.1. Policy Gradient theorem
Alternative approach to solving RL task is direct optimization of objective
X
J (θ) = ET ∼πθ γ t−1 rt → max (33)
θ
t=1

as a function of θ . Policy gradient methods provide a framework how to construct an efficient opti-
mization procedure based on stochastic first-order optimization within RL setting.
We will assume that πθ (a | s) is a stochastic policy parameterized with θ ∈ Θ. It turns out,
that if π is differentiable by θ , then so is our goal (33). We now proceed to discuss the technique of
derivative calculation which is based on employment of log-derivative trick:

Proposition 17. For arbitrary distribution π(a) parameterized by θ :

∇θ π(a) = π(a)∇θ log π(a) (34)

In most general form, this trick allows us to derive the gradient of expectation of an arbitrary
function f (a, θ) : A × Θ → R, differentiable by θ , with respect to some distribution πθ (a), also
parameterized by θ :
Z
∇θ Ea∼πθ (a) f (a, θ) = ∇θ πθ (a)f (a, θ)da =
A
Z
= ∇θ [πθ (a)f (a, θ)] da =
ZA
{product rule} = [∇θ πθ (a)f (a, θ) + πθ (a)∇θ f (a, θ)] da =
ZA
= ∇θ πθ (a)f (a, θ)da + Eπθ (a) ∇θ f (a, θ) =
A
Z
{log-derivative trick (34)} = πθ (a)∇θ log πθ (a)f (a, θ)da + Eπθ (a) ∇θ f (a, θ) =
A
= Eπθ (a) ∇θ log πθ (a)f (a, θ) + Eπθ (a) ∇θ f (a, θ)
This technique can be applied sequentially (to expectations over πθ (a0 | s0 ), πθ (a1 | s1 ) and
so on) to obtain the gradient ∇θ J (πθ ).

Proposition 18. (Policy Gradient Theorem) [22] For any MDP and differentiable policy πθ the
gradient of objective (33) is
X
∇θ J (θ) = ET ∼πθ ∇θ log πθ (at | st )Qπ (st , at ) (35)
t=0

For future references, we require another form of formula (35), which provides another point of
view. For this purpose, let us define a state visitation frequency:

Definition 10. For given MDP and given policy π its state visitation frequency is defined by
X
dπ (s) := P(st = s)
t=0

where st are taken from trajectories T sampled using given policy π .

State visitation frequencies, if normalized, represent a marginalized probability for agent to land in
a given state s. It is rarely attempted to be learned, but it assists theoretical study as it allows us
to rewrite expectations over trajectories with separated intrinsic and extrinsic randomness of the
decision making process:

∇θ J (θ) = Es∼dπ (s) Ea∼π(a|s) ∇θ log πθ (a | s)Qπ (s, a) (36)


29
This form is equivalent to (35) as sampling a trajectory and going through all visited states induces
the same distribution as defined in dπ (s).
Now, although we acquired an explicit form of objective’s gradient, we are able to compute it only
approximately, using Monte-Carlo estimation for expectations via sampling one or several trajecto-
ries. Second form of gradient (36) reveals that it is possible to use roll-outs of trajectories without
waiting for episode ending, as the states for the roll-outs come from the same distribution as they
would for complete episode trajectories. The essential thing is that exactly the policy π(θ) must be
used for sampling to obtain unbiased Monte-Carlo estimation (otherwise state visitation frequency
dπ (s) is different). These features are commonly underlined by notation Eπ , which is a shorter form
of Es∼dπ (s) Ea∼π(a|s) . When convenient, we will use it to reduce the gradient to a shorter form:

∇θ J (θ) = Eπ(θ) ∇θ log πθ (a | s)Qπ (s, a) (37)

Second important thing worth mentioning is that Qπ (s, a) is essentially present in the gradient.
Remark that it is never available to the algorithm and must also be somehow estimated.

5.2. REINFORCE
REINFORCE [27] provides a straightforward approach to approximately calculate the gradient (35)
in episodic case using Monte-Carlo estimation: N games are played and Q-function under policy π
is approximated with corresponding return:

Qπ (s, a) = ET ∼πθ |s,a R(T ) ≈ R(T ), T ∼ πθ | s, a

The resulting formula is therefore the following:

N
" !#
1 XX X
t0 −t
∇θ J (θ) ≈ ∇θ log πθ (at | st ) γ rt0 +1 (38)
N T t=0 t0 =t

This estimation is unbiased as both approximation of Qπ and approximation of expectation over


trajectories are done using Monte-Carlo. Given that estimation of gradient is unbiased, stochastic
gradient ascent or more advanced stochastic optimization techniques are known to converge to local
optimum.
From theoretical point of view REINFORCE can be applied straightforwardly for any parametric
family πθ (a | s) including neural networks. Yet the enormous time required for convergence and
the problem of stucking in local optimums make this naive approach completely impractical.
The main source of problems is believed to be the high variance of gradient estimation (38), as
the convergence rate of stochastic gradient descent directly depends on the variance of gradient
estimation.
The standard technique of variance reduction is an introduction of baseline. The idea is to add
some term that will not affect the expectation, but may affect the variance.
R One such baseline can
be derived using following reasoning: for any distribution it is true that πθ (a | s)da = 1. Taking
A
the gradient ∇θ from both sides, we obtain:
Z
0= ∇θ πθ (a | s)da =
ZA
{log-derivative trick (34)} = πθ (a | s)∇θ log πθ (a | s)da =
A
= Eπθ (a|s) ∇θ log πθ (a | s)

Multiplying this expression on some constant, we can scale this baseline:

Eπθ (a|s) const(a)∇θ log πθ (a | s) = 0

Notice that the constant here must be independent of a, but may depend on s. Application of this
technique to our case provides the following result23 :

23 this result can be generalized by introducing different baselines for estimation of different components of ∇θ J(θ).

30
Proposition 19. For any arbitrary function b(s) : S → R, called baseline:
X
∇θ J (θ) = ET ∼πθ ∇θ log πθ (at | st ) (Qπ (st , at ) − b(st ))
t=0

Selection of the baseline is up to us as long as it does not depend on actions at . The intent is to
choose it in a way that minimizes the variance.
It is believed that high variance of (38) originates from multiplication of Qπ (s, a), which may have
arbitrary scale (e. .g. in a range [100, 200]) while ∇θ log πθ (at | st ) naturally has varying signs24 .
To reduce the variance, the baseline must be chosen so that absolute values of expression inside
the expectation are shifted towards zero. Wherein the optimal baseline is provided by the following
theorem:

Proposition 20. The solution for


X
VT ∼πθ ∇θ log πθ (at | st ) (Qπ (st , at ) − b(st )) → min
b(s)
t=0

is given by
Ea∼πθ (a|s) k∇θ log πθ (a | s)k22 Qπ (s, a)
b(s) = (39)
Ea∼πθ (a|s) k∇θ log πθ (a | s)k22

As can be seen, optimal baseline calculation involves expectations which again can only be com-
puted (in most cases) using Monte-Carlo (both for numerator and denominator). For that purpose,
for every visited state s estimations of Qπ (s, a) are needed for all (or some) actions a, as otherwise
estimation of baseline will coincide with estimation of Qπ (s, a) and collapse gradient to zero. Prac-
tical utilization of result (39) is to consider a constant baseline independent of s with similar optimal
form:
k∇θ log πθ (at | st )k22 Qπ (st , at )
P
ET ∼πθ t=0
b= P 2
ET ∼πθ t=0 k∇θ log πθ (at | st )k2
which can be profitably estimated via Monte-Carlo.
Utilization of some kind of baseline, not necessarily optimal, is known to significantly reduce the
variance of gradient estimation and is an essential part of any policy gradient method. The final step
to make this family of algorithms applicable when using deep neural networks is to reduce variance
of Qπ estimation by employing RL task structure like it was done in value-based methods.

5.3. Advantage Actor-Critic (A2C)


Suppose that in optimal baseline formula (39) it happens that k∇θ log πθ (a | s)k22 = const(a).
Though in reality this is actually not true, under this circumstance the optimal baseline formula
significantly reduces and unravels a close-to-optimal but simple form of baseline:

b(s) = Ea∼πθ (a|s) Qπ (s, a) = V π (s)

Substituting this baseline into gradient formula (37) and recalling the definition of advantage
function (14), the gradient can now be rewritten as follows:

∇θ J (θ) = Eπ(θ) ∇θ log πθ (a | s)Aπ (s, a) (40)

This representation of gradient is used as the basement for most policy gradient algorithms as it
offers lower variance while selecting the baseline expressed in terms of value functions which can be
efficiently learned similar to how it was done in value-based methods. Such algorithms are usually
named Actor-Critic as they consist of two neural networks: πθ (a | s), representing a policy, called an
actor, and Vφπ (s) with parameters φ, approximately estimating actor’s performance, called a critic.
Note that the choice of value function to learn can be arbitrary; it is possible to learn Qπ or Aπ
instead, as all of them are deeply interconnected. Value function V π is chosen as the simplest one
since it depends only on state and thus is hoped to be easier to learn.
24 this follows, for example, from baseline derivation

31
Having a critic Vφπ (s), Q-function can be approximated in a following way:

Qπ (s, a) ≈ r 0 + γV π (s0 ) ≈ r 0 + γVφπ (s0 )

First approximation is done using Monte-Carlo, while second approximation inevitably introduces
bias. Important thing to notice is that at this moment our gradient estimation stops being unbiased
and all theoretical guarantees of converging are once again lost.
Advantage function therefore can be obtained according to the definition:

Aπ (s, a) = Qπ (s, a) − V π (s) ≈ r 0 + γVφπ (s0 ) − Vφπ (s) (41)

Note that biased estimation of baseline doesn’t make gradient estimation biased by itself, as baseline
can be an arbitrary function of state. All bias introduction happens inside the approximation of Qπ .
It is possible to use critic only for baseline, which allows complete avoidance of bias, but then the
only way to estimate Qπ is via playing several games and using corresponding returns, which suffers
from higher variance and low sample efficiency.
The logic behind training procedure for the critic is taken from value-based methods: for given
policy π its value function can be obtained using point iteration for solving

V π (s) = Ea∼π(a|s) Es0 ∼p(s0 |s,a) [r 0 + γV π (s0 )]

Similar to DQN, on each update a target is computed using current approximation

y = r 0 + γVφπ (s0 )

and then MSE is minimized to move values of Vφπ (s) towards the guess.
Notice that to compute the target for critic we require samples from the policy π which is being
evaluated. Although actor evolves throughout optimization process, we assume that one update of
policy π does not lead to significant change of true V π and thus our critic, which approximates value
function for older version of policy, is close enough to construct the target. But if samples from, for
example, old policy are used to compute the guess, the step of critic update will correspond to learn-
ing the value function for old policy other than current. Essentially, this means that both actor and
critic training procedures require samples from current policy π , making Actor-Critic algorithm on-
policy by design. Consequently, samples that were collected on previous update iterations become
useless and can be forgotten. This is the key reason why policy gradient algorithms are usually less
sample-efficient than value-based.
Now as we have an approximation of value function, advantage estimation can be done using
one-step transitions (41). As the procedure of training an actor, i. .e. gradient estimation (40), also
does not demand sampling the whole trajectory, each update now requires only a small roll-out to
be sampled. The amount of transitions in the roll-out corresponds to the size of mini-batch.
The problem with roll-outs is that the data is obviously not i. i. d., which is crucial for training
networks. In value-based methods, this problem was solved with experience replay, but in policy
gradient algorithms it is essential to collect samples from scratch after each update of the networks
parameters. The practical solution for simulated environments is to launch several instances of
environment (for example, on different cores of multiprocessor) in parallel threads and have several
parallel interactions. After several steps in each environment, the batch for update is collected by
uniting transitions from all instances and one synchronous25 update of networks parameters θ and
φ is performed.
One more optimization that can be done is to partially share weights of networks θ and φ. It is
justified as first layers of both networks correspond to basic features extraction and these features
are likely to be the same for optimal policy and value function. While it reduces the number of train-
ing parameters almost twice, it might destabilize learning process as the scales of gradient (40) and
gradient of critic’s MSE loss may be significantly different, so they should be balanced with additional
hyperparameter.
25 there is also an asynchronous modification of advantage actor critic algorithm (A3C) which accelerates the training process

by storing a copy of network for each thread and performing weights synchronization from time to time.

32
Algorithm 6: Advantage Actor-Critic (A2C)

Hyperparameters: B — batch size, Vφ∗ — critic neural network, πθ — actor neural network,
α — critic loss scaling, SGD optimizer.

Initialize weights θ, φ arbitrary


On each step:

1. obtain a roll-out of size B using policy π(θ)

2. for each transition T from the roll-out compute advantage estimation:

Aπ (T ) = r 0 + γVφπ (s0 ) − Vφπ

3. compute target (detached from computational graph to prevent backpropagation):

y(T ) = r 0 + γVφπ (s0 )

4. compute critic loss:


1 X 2
Loss = y(T ) − Vφπ
B T

5. compute critic gradients:


∂ Loss
∇critic =
∂φ

6. compute actor gradient:

1 X
∇actor = ∇θ log πθ (a | s)Aπ (T )
B T

7. make a step of gradient descent using ∇actor + α∇critic

5.4. Generalized Advantage Estimation (GAE)


There is a design dilemma in Advantage Actor Critic algorithm concerning the choice whether to
use the critic to estimate Qπ (s, a) and introduce bias into gradient estimation or to restrict critic em-
ployment only for baseline and cause higher variance with necessity of playing the whole episodes
for each update step.
Actually, the range of possibilities is wider. Since Actor-Critic is an on-policy algorithm by design,
we are free to use N -step approximations instead of one-step: using
N
X −1  
Qπ (s, a) ≈ γ n r (n+1) + γ N V π s(N )
n=0

we can define N -step advantage estimator as


N
X −1  

(N ) (s, a) := γ n r (n+1) + γ N Vφπ s(N ) − Vφπ (s)
n=0

For N = 1 this estimation corresponds to Actor-Critic one-step estimation with high bias and low
variance. For N = ∞ it yields the estimator with critic used only for baseline with no bias and
high variance. Intermediate values correspond to something in between. Note that to use N -step
advantage estimation we have to perform N steps of interaction after given state-action pair.
Usually finding a good value for N as hyperparameter is difficult as its «optimal» value may float
throughout the learning process. In Generalized Advantage Estimation (GAE) [18] it is proposed to
construct an ensemble out of different N -step advantage estimators using exponential smoothing
with some hyperparameter λ:
 
Aπ π π 2 π
GAE (s, a) := (1 − λ) A(1) (s, a) + λA(2) (s, a) + λ A(3) (s, a) + . . . (42)

33
Here the parameter λ ∈ [0, 1] allows smooth control over bias-variance trade-off: λ = 0 corre-
sponds to Actor-Critic with higher bias and lower variance while λ → 1 corresponds to REINFORCE
with no bias and high variance. But unlike N as hyperparameter, it uses mix of different estimators
in intermediate case.
GAE proved to be a convenient way how more information can be obtained from collected roll-
out in practice. Instead of waiting for episode termination to compute (42) we may use «truncated»
GAE which ensembles only those N -step advantage estimators that are available:
N −1 π
Aπ π 2 π
(1) (s, a) + λA(2) (s, a) + λ A(3) (s, a) + · · · + λ A(N ) (s, a)

trunc.GAE (s, a) :=
1 + λ + λ2 + · · · + λN −1
Note that the amount N of available estimators may be different for different transitions from roll-
out: if we performed K steps of interaction in some instance of environment starting from some
state-action pair s, a, we can use N = K step estimators; for next state-action pair s0 , a0 we have
only N = K −1 transitions and so on, while the last state-action pair sN −1 , aN −1 can be estimated
only using Aπ (1) as only N = 1 following transition is available. Although different transitions are
estimated with different precision (leading to different bias and variance), this approach allows to use
all available information for each transition and utilize multi-step approximations without dropping
last transitions of roll-outs used only for target computation.

5.5. Natural Policy Gradient (NPG)


In this section we discuss the motivation and basic principles behind the idea of natural gradient
descent, which we will require for future references.
The standard gradient descent optimization method is known to be extremely sensitive to the
choice of parametrization. Suppose we attempt to solve the following optimization task:

f (q) → min
q

where q is a distribution and F is arbitrary differentiable function. We often restrict q to some


parametric family and optimize similar objective, but with respect to some vector of parameters θ
as unknown variable:
f (qθ ) → min
θ

Classic example of such problem is maximum likelihood task when we try to fit the parameters of
our model to some observed data. The problem is that when using standard gradient descent both
the convergence rate and overall performance of optimization method substantially depend on the
choice of parametrization qθ . The problem holds even if we fix specific distribution family as many
distribution families allow different parametrizations.
To see why gradient descent is parametrization-sensitive, consider the model which is used at
some current point θk to determine the direction of next optimization step:
(
f (qθk ) + h∇θ f (qθk ), δθi → min
δθ
kδθk22 < αk

where αk is learning rate at step k. Being first-order method, gradient descent constructs a «model»
which approximates F locally around θk using first-order Taylor expansion and employs standard
Euclidean metric to determine a region of trust for this model. Then this surrogate task is solved
analytically to obtain well-known update formula:

δθ ∝ −∇θ f (qθk )

The issue arises from reliance on Eucliden metric in the space of parameters. In most parametriza-
tions, small changes in parameters space do not guarantee small change in distribution space and
vice versa: some small changes in distribution may demand big steps in parameters space26 .
Natural gradient proposes to use another metric, which achieves invariance to parametrization
of distribution q using the properties of Fisher matrix:

26 classic example is that N (0, 100) is similar to N (1, 100) while N (0, 0.1) is completely different from N (1, 0.1),

although Euclidean distance in parameter space is the same for both pairs.

34
Definition 11. For distribution qθ Fisher matrix Fq (θ) is defined as

Fq (θ) := Ex∼q ∇θ log qθ (x)(∇θ log qθ (x))T

Note that Fisher matrix depends on parametrization. Yet for any parametrization it is guaranteed
to be positive semi-definite by definition. Moreover, it induces a so-called Riemannian metric27 in
the space of parameters:

d(θ1 , θ2 )2 := (θ2 − θ1 )T Fq (θ1 )(θ2 − θ1 )

In natural gradient descent it is proposed to use this metric instead of Euclidean:


(
f (qθk ) + h∇θ f (qθk ), δθi → min
δθ
δθ T Fq (θk )δθ < αk

This surrogate task can be solved analytically to obtain the following optimization direction:

δθ ∝ −Fq (θk )−1 ∇θ f (qθk ) (43)

The direction of gradient descent is corrected by Fisher matrix which concerns the scale across dif-
ferent axes. This direction, specified by Fq (θk )−1 ∇θ f (qθk ), is called natural gradient.
Let’s discuss why this new metric really provides us invariance to distribution parametrization.
We already obtained natural gradient for q being parameterized by θ (43). Assume that we have
another parametrization qν . These new parameters ν are somehow related to θ ; we suppose there
is some functional dependency θ(ν), which we assume to be differentiable with jacobian J . In this
notation:
∂θi
δθ = J δν , Jij := (44)
∂νj
The central property of Fisher matrix, which provides the desired invariance, is the following:

Proposition 21. If θ = θ(ν) with jacobian J , then reparametrization formula for Fisher matrix is

Fq (ν) = J T Fq (θ)J (45)

Now it can be derived that natural gradient for parametrization with ν is the same as for θ . If we
want to calculate natural gradient in terms of ν , then our step is, according to (44):

δθ = J δν =
{natural gradient in terms of ν} ∝ J Fq (νk )−1 ∇ν f (qνk ) =
−1
{Fisher matrix reparametrization (45)} = J J T Fq (θk )J ∇ν f (qνk )
−1
{chain rule} = J J T Fq (θk )J ∇ν θ(νk )T ∇θ f (qθk ) =


= J J −1 Fq (θk )−1 J −T J T ∇θ f (qθk ) =


= Fq (θk )−1 ∇θ f (qθk )

which can be seen to be the same as in (43).


Application of natural gradient descent in DRL setting is complicated in practice. Theoretically,
the only change that must be done is scaling of gradient using inverse Fisher matrix (43). Yet, Fisher
matrix requires n2 memory and O(n3 ) computational costs for inversion where n is the number of
parameters. For neural networks this causes the same complications as the application of second-
order optimization methods.
K-FAC optimization method provides a specific approximation form of Fisher matrix for neural
networks with linear layers which can be efficiently computed, stored and inverted. Usage of K-FAC
approximation allows to compute natural gradient directly using (43).
27 in Euclidean space the general form of scalar product is hx, yi := xT Gy , where G is fixed positive semi-definite matrix.

The metric induced by this scalar product is correspondingly d(x, y)2 := (y − x)T G(y − x). The difference in Riemannian
space is that G, called metric tensor, depends on x, so the relative distance may vary for different points. It is used to
describe the distances between points on manifolds and holds important properties which Fisher matrix inherits as metric
tensor for distribution space.

35
5.6. Trust-Region Policy Optimization (TRPO)
The main drawback of Actor-Critic algorithm is believed to be the abandonment of experience
that was used for previous updates. As the number of updates required is usually huge, this is
considered to be a substantial loss of information. Yet, it is not clear how this information can be
effectively used for newer updates.
Suppose we want to make an update of π(θ), but using samples collected by some π old . The
straightforward approach is importance sampling technique, which naive application to gradient
formula (40) yields the following result:
P(T | π(θ)) X
∇θ J (θ) = ET ∼πold ∇θ log πθ (at | st )Aπ (st , at )
P(T | π old ) t=0

The emerged importance sampling weight is actually computable as transition probabilities cross
out: Q
P(T | π(θ)) t=1 πθ (at | st )
= Q
P(T | π old ) t=1 π
old (a | s )
t t

The problem with this coefficient is that it tends either to be exponentially small or to explode. Even
with some heuristic normalization of coefficients the batch gradient would become dominated by
one or several transitions and destabilize the training procedure by introducing even more variance.
Notice that application of importance sampling to another representation of gradient (37) yields
seemingly different result:
dπ(θ) (s) πθ (a | s)
∇θ J (θ) = Eπold ∇θ log πθ (a | s)Aπ (s, a) (46)
dπold (s) π old (a | s)
Here we avoided common for the whole trajectories importance sampling weights by using the def-
inition of state visitation frequencies. But this result is even less practical as these frequencies are
unknown to us.
The first key idea behind the theory concerning this problem is that may be these importance
sampling coefficients behave more stable if the policies π old and π(θ) are in some terms «close».
dπ(θ) (s)
Intuitively, in this case dπold (s)
of formula (46) is close to 1 as state visitation frequencies are similar,
and the remained importance sampling coefficient becomes acceptable in practice. And if some two
policies are similar, their values of our objective (2) are probably close too.
For any two policies, π and π old :
X
J (π) − J (π old ) = ET ∼π γ t r(st ) − J (π old ) =
t=0
X old
= ET ∼π γ t r(st ) − V π (s0 ) =
t=0
" #
π old
X
t
= ET ∼π γ r(st ) − V (s0 ) =
t=0
" #
Xh i
P∞ π old π old
X
28 t t+1 t
{trick t=0 (at+1 − at ) = −a0 } = ET ∼π γ r(st ) + γ V (st+1 ) − γ V (st ) =
t=0 t=0
X  old old

{regroup} = ET ∼π γ t r(st ) + γV π (st+1 ) − V π (st ) =
t=0
X  old old

{by definition (3)} = ET ∼π γ t Qπ (st , at ) − V π (st )
t=0
X old
{by definition (14)} = ET ∼π γ t Aπ (st , at )
t=0

The result obtained above is often referred to as relative policy performance identity and is
actually very interesting: it states that we can substitute reward with advantage function of arbitrary
policy and that will shift the objective by the constant.
We will require this identity rewritten in terms of state visitation frequencies. To do so, it is
convenient to define discounted version of state visitations distribution:
old
28 and if MDP is episodic, for terminal states V π (sT ) = 0 by definition.

36
Definition 12. For given MDP and given policy π its discounted state visitation frequency is
defined by X
d(s | π) := (1 − γ) γ t P(st = s)
t=0

where st are taken from trajectories T sampled using given policy π .

Using frequency as unnormalized state visitation distribution, relative policy performance iden-
tity can be rewritten as
1 old
J (π) − J (π old ) = Es∼d(s|π) Ea∼π(a|s) Aπ (s, a)
1−γ
Now assume we want to optimize parameters θ of policy π while using data collected by π old :
applying importance sampling in the same manner:
1 d(s | πθ ) πθ (a | s) old
J (πθ ) − J (π old ) = Es∼d(s|πold ) Ea∼πold (a|s) Aπ (s, a)
1−γ d(s | π old ) π old (a | s)
As we have in mind the idea of π old being close to πθ , the question is how well this identity can
be approximated if we assume d(s | πθ ) = d(s | π old ). Under this assumption:
1 πθ (a | s) old
J (πθ ) − J (π old ) ≈ Lπold (θ) := Es∼d(s|πold ) Ea∼πold (a|s) Aπ (s, a)
1−γ π old (a | s)
The point is that interaction using π old corresponds to sampling from the expectations presented
in Lπold (θ):
πθ (a | s) old
Lπold (θ) = Eπold Aπ (s, a)
π old (a | s)
The approximation quality of Lπold (θ) can be described by the following theorem:

Proposition 22. [17]

J (πθ ) − J (π old ) − Lπold (θ) ≤ C max KL(π old k πθ )[s]


s

where C is some constant and KL(π old k πθ )[s] is a shorten notation for KL(π old (a | s) k
πθ (a | s)).

There is an important corollary of proposition 22:

J (πθ ) − J (π old ) ≥ Lπold (θ) − C max KL(π old k πθ )[s]


s

which not only states that expression on the right side represents a lower bound, but also that the
optimization procedure
h i
θk+1 = argmax Lπθk (θ) − C max KL(πθk k πθ )[s] (47)
θ s

will yield a policy with guaranteed monotonic improvement29 .


In practice there are several obstacles which preserve us from obtaining such procedure. First of
all, our advantage function estimation is never precise. Secondly, it is hard to estimate precise value
of constant C . One last obstacle is that it is not clear how to calculate KL-divergence in its maximal
form (with max taken across all states).
In Trust-Region policy optimization [17] the idea of practical algorithm, approximating procedure
(47), is analyzed. To address the last issue, the naive approximation is proposed to substitute max
with averaging across states30 :

max KL(π old k πθ )[s] ≈ Es∼d(s|πold ) KL(π old k πθ )[s]


s
29 themaximum of lower bound is non-negative as its value for θ = θk equals zero, which causes J(πk+1 ) − J(πk ) ≥ 0
30 thedistribution from which the states come is set to be d(s | π old ) for convenience as this is the distribution from which
they come in Lπold (θ).

37
The second step of TRPO is to rewrite the task of unconstrained minimization (47) in equivalent
constrained («trust-region») form31 to incorporate the unknown constant C into learning rate:
(
Lπold (θ) → max
θ (48)
Es∼d(s|πold ) KL(π old k πθ )[s] < C

Note that this rewrites an update iteration in terms of optimization methods: while Lπold (θ) is an
approximation of true objective J (πθ ) − J (π old ), the constraint sets the region of trust to the
surrogate. Remark that constraint is actually a divergence in policy space, i. e. it is very similar to a
metric in the space of distributions while the surrogate is a function of the policy and depends on
parameters θ only through πθ .
To solve the constrained problem (48), the technique from convex optimization is used. Assume
that π old is a current policy and we want to update its parameters θk . Then the objective of (48)
is modeled using first-order Taylor expansion around θk while constraint is modeled using second-
order 32 Taylor approximation:
(
Lπold (θk + δθ) ≈ h∇θ Lπold (θ)|θk , δθi → max
δθ
1
Es∼d(s|πold ) KL(π old k πθk +δθ ) ≈ 2
Es∼d(s|π old ) δθ
T
∇2θ KL(π old k πθ ) θk
δθ < C

It turns out, that this model is equivalent to natural policy gradient, discussed in sec. 5.5:

Proposition 23.
∇2θ KL(πθ k π old )[s] θk
= Fπ(a|s) (θ)

so KL-divergence constraint can be approximated with metric induced by Fisher matrix. Moreover,
the gradient of surrogate function is

∇θ πθ (a | s)|θk old
∇θ Lπold (θ)|θk = Eπold Aπ (s, a) =
π old (a | s)

old
{π old = πθk } = Eπold ∇θ log πθk (a | s)Aπ (s, a)

which is exactly an Actor-Critic gradient. Therefore the formula of update step is given by

δθ ∝ −Fπ (θ)−1 ∇θ Lπold (θ)

where ∇θ Lπold (θ) coincides with standard policy gradient, and Fπ (θ) is hessian of KL-divergence:

Fπ (θ) := Es∼d(s|πold ) ∇2θ KL(π old k πθ ) θk

In practical implementations KL-divergence can be Monte-Carlo estimated using collected roll-


out. The size of roll-out must be significantly bigger than in Actor-Critic to achieve sufficient precision
of hessian estimation. Then to obtain a direction of optimization step the following system of linear
equations
Fπ (θ)δθ = −∇θ Lπold (θ)
is solved using a conjugate gradients method which is able to work with Hessian-vector multiplica-
tion procedure instead of requiring to calculate Fπ (θ) explicitly.
TRPO also accompanies the update step with a line-search procedure which dynamically adjusts
step length using standard backtracking heuristic. As TRPO intuitively seeks for policy improvement
on each step, the idea is to check whether the lower bound (47) is positive after the biggest step
allowed according to KL-constraint and reduce the step size until it becomes positive.
Unlike Actor-Critic, TRPO performs extremely expensive complicated update steps but requires
relatively small number of iterations in return. Of course, due to many approximations done, the
overall procedure is only a resemblance of theoretically-justified iterations (47) providing improve-
ment guarantees.
31 the unconstrained objective is Lagrange function for constrained form
32 as first-order term is zero

38
5.7. Proximal Policy Optimization (PPO)
Proximal Policy Optimization [19] proposes alternative heuristic way of performing lower bound
(47) optimization which demonstrated encouraging empirical results.
PPO still substitutes max KL on average, but leaves the surrogate in unconstrained form, sug-
s
gesting to treat unknown constant C as a hyperparameter:

πθ (a | s)
 
π old old
Eπold A (s, a) − C KL(π k πθ )[s] → max (49)
π old (a | s) θ

The naive idea would be to straightforwardly optimize (49) as it is equivalent to solving the con-
straint trust-region task (48). To avoid Hessian-involved computations, one possible option is just to
perform one step of first-order gradient optimization of (49). Such algorithm was empirically discov-
πθ (a|s)
ered to perform poorly as importance sampling coefficients πold (a|s)
tend to unbounded growth.
In PPO it is proposed to cope with this problem in a simple old-fashioned way: by clipping. Let’s
denote by
πθ (a | s)
r(θ) :=
π old (a | s)
an importance sampling weight and by

r clip (θ) := clip(r(θ), 1 − , 1 + )

its clipped version where  ∈ (0, 1) is a hyperparameter. Then the clipped version of lower bound
is:
h  old old
 i
Eπold min r(θ)Aπ (s, a), r clip (θ)Aπ (s, a) − C KL(π old k πθ )[s] → max (50)
θ

Here the minimum operation is introduced to guarantee that the surrogate objective remains a
lower bound. Thus the clipping at 1 +  may occur only in the case if advantage is positive while
clipping at 1 −  may occur if advantage is negative. In both cases, clipping represents a penalty for
importance sampling weight r(θ) being too far from 1.
The overall procedure suggested by PPO to optimize the «stabilized» version of lower bound (50)
is the following. A roll-out is collected using current policy π old with some parameters θ . Then the
batches of typical size (as for Actor-Critic methods) are sampled from collected roll-out and several
steps of SGD optimization of (50) proceed with respect to policy parameters θ . During this process
the policy π old is considered to be fixed and new interaction steps are not performed, while in im-
plementations there is no need to store old weights θk since everything required from π old is to
collect transitions and remember the probabilities π old (a | s). The idea is that during these several
steps we may use transitions from the collected roll-out several times. Similar alternative is to per-
form several epochs of training by passing through roll-out several times, as it is often done in deep
learning.
Interesting fact discovered by the authors of PPO during ablation studies is that removing KL-
penalty term doesn’t affect the overall empirical performance. That is why in many implementations
PPO does not include KL-term at all, making the final surrogate objective have a following form:
 old old

Eπold min r(θ)Aπ (s, a), r clip (θ)Aπ (s, a) → max (51)
θ

Note that in this form the surrogate is not generally a lower bound and «improvement guarantees»
intuition is lost.

Algorithm 7: Proximal Policy Optimization (PPO)

Hyperparameters: B — batch size, R — rollout size, n_epochs — number of epochs, ε —


clipping parameter, Vφ∗ — critic neural network, πθ — actor neural network, α — critic loss
scaling, SGD optimizer.

Initialize weights θ, φ arbitrary


On each step:

39
1. obtain a roll-out of size R using policy π(θ), storing action probabilities as π old (a | s).

2. for each transition T from the roll-out compute advantage estimation (detached from
computational graph to prevent backpropagation):

Aπ (T ) = r 0 + γVφπ (s0 ) − Vφπ

3. perform n_epochs passes through roll-out using batches of size B ; for each batch:

• compute critic target (detached from computational graph to prevent backpropa-


gation):
y(T ) = r 0 + γVφπ (s0 )
• compute critic loss:
1 X 2
Loss = y(T ) − Vφπ
B T

• compute critic gradients:


∂ Loss
∇critic =
∂φ
• compute importance sampling weights:

πθ (a | s)
rθ (T ) =
π old (a | s)

• compute clipped importance sampling weights:

rθclip (T ) = clip(rθ (T ), 1 − , 1 + )

• compute actor gradient:


1 X  
∇actor = ∇θ min rθ (T )Aπ (T ), rθclip (T )Aπ (T )
B T

• make a step of gradient descent using ∇actor + α∇critic

40
6. Experiments
6.1. Setup
We performed our experiments using custom implementation of discussed algorithms attempt-
ing to incorporate best features from different official and unofficial sources and unifying all algo-
rithms in a single library interface. The full code is available at our github.
While custom implementation might not be the most efficient, it hinted us several ambiguities in
algorithms which are resolved differently in different sources. We describe these nuances and the
choices made for our experiments in appendix A.
For each environment we launch several algorithms to train the network with the same architec-
ture with the only exception being the head which is specified by the algorithm (see table 1).

DQN Linear transformation to |A| arbitrary real values


First head: linear transformation to |A| arbitrary real values
Dueling Second head: linear transformations to an arbitrary scalar
Aggregated using dueling architecture formula (17)
Categorical |A| linear transformations with softmax to A values
First head: linear transformation to |A| arbitrary real values
Dueling Categorical Second head: |A| linear transformations to A arbitrary real values
Aggregated using dueling architecture formula (32)
Quantile |A| linear transformations to A arbitrary real values
First head: linear transformation to |A| arbitrary real values
Dueling Quantile Second head: |A| linear transformations to A arbitrary real values
Aggregated using dueling architecture formula (32) without softmax
Actor head: linear transformation with softmax to |A| values
A2C / PPO
Critic head: linear transformation to scalar value
Table 1: Heads used for different algorithms. Here |A| is the number of actions and A is the chosen number of
atoms.

For noisy networks all fully-connected layers in the feature extractor and in the head are substi-
tuted with noisy layers, doubling the number of their trained parameters. Both usage of noisy layers
and the choice of the head influences the total number of parameters trained by the algorithm.
As practical tuning of hyperparameters is computationally consuming activity, we set all hyperpa-
rameters to their recommended values while trying to share the values of common hyperparameters
among algorithms without affecting overall performance.
We choose to give each algorithm same amount of interaction steps to provide the fair compari-
son of their sample efficiency. Thus the wall-clock time, number of episodes played and the number
of network parameters updates varies for different algorithms.

6.2. Cartpole
Cartpole from OpenAI Gym [2] is considered to be one of the simplest environments for DRL
algorithms testing. The state is described with 4 real numbers while action space is two-dimensional
discrete.
The environment rewards agent with +1 each tick until the episode ends. Poor action choices
lead to early termination. The game is considered solved if agent holds for 200 ticks, therefore 200
is maximum reward in this environment.
In our first experiment we launch algorithms for 10 000 interaction steps to train a neural network
on the Cartpole environment. The network consists of two fully-connected hidden layers with 128
neurons and an algorithm-specific head. We used ReLU for activations. The results of a single launch
are provided33 in table 2.
33 we didn’t tune hyperparameters for each of the algorithms, so the configurations used might not be optimal.

41
Reached 200 Average reward Average FPS

Double DQN 23.0 126.17 95.78


Dueling Double DQN 27.0 121.78 62.65
DQN 33.0 116.27 101.53
Categorical DQN 28.0 110.87 74.95
Prioritized Double DQN 37.0 110.52 85.58
Categorical Prioritized Double DQN 46.0 104.86 66.00
Quantile Prioritized Double DQN 42.0 100.76 68.62
Categorical DQN with target network 44.0 96.08 73.92
Quantile Double DQN 54.0 93.14 75.40
Quantile DQN 70.0 88.12 77.93
Categorical Double DQN 42.0 81.25 70.90
Noisy Quantile Prioritized Dueling DQN 86.0 74.13 21.41
Twin DQN 57.0 71.14 52.51
Noisy Double DQN 67.0 71.06 31.81
Noisy Prioritized Double DQN 94.0 67.34 30.72
Quantile Regression Rainbow 106.0 67.11 21.54
Rainbow 91.0 64.01 20.35
Noisy Quantile Prioritized Double DQN 127.0 63.01 28.27
Noisy Categorical Prioritized Double DQN 63.0 62.04 27.81
PPO with GAE 144.0 53.06 390.53
Noisy Prioritized Dueling Double DQN 180.0 47.52 22.56
PPO 184.0 45.19 412.88
Noisy Categorical Prioritized Dueling Double DQN 428.0 22.09 20.63
A2C - 12.30 1048.64
A2C with GAE - 11.50 978.00
Table 2: Results on Cartpole for different algorithms: number of episode when the highest score of 200 was
reached, average reward across all played episodes and average number of frames processed in a second (FPS).

6.3. Pong
We used Atari Pong environment from OpenAI Gym [2] as our main testbed to study the be-
haviour of the following algorithms:
• DQN — Deep Q-learning (sec. 3.2)
• c51 — Categorical DQN (sec. 4.2)
• QR-DQN — Quantile Regression DQN (sec. 4.3)
• Rainbow (sec. 4.4)
• A2C — Advantage Actor Critic (sec. 5.3) extended with GAE (sec. 5.4)
• PPO — Proximal Policy Optimization (sec. 5.7) extended with GAE (sec. 5.4)
In Pong, each episode is split into rounds. Each round ends with player either winning or loosing.
The episode ends when the player wins or looses 21 rounds. The reward is given after each round
and is +1 for winning and -1 for loosing. Therefore the maximum total reward is 21 and the minimum
is -21. Note that the flag done indicating episode ending is not provided to the agent after each
round but only at the end of full game (consisting of 21-41 rounds).
The standard preprocessing for Atari games proposed in DQN [13] was applied to the environ-
ment (see table 3). Thus, state space is represented by (84, 84) grayscale pixels input (1 channel
with domain [0, 255]). Action space is discrete with |A| = 6 actions.
All algorithms were given 1 000 000 interaction steps to train the network with the same feature
extractor presented on fig. 1. The number of trained parameters is presented in table 4. All used
hyperparameters are listed in table 7 in appendix B.
42
NoopResetEnv Do nothing first 30 frames of games to imitate the
pause between game start and real player reaction.
MaxAndSkipEnv Each interaction steps takes 4 frames of the game
to allow less frequent switch of action. Max is taken
over 4 passed frames to obtain an observation.
FireResetEnv Presses «Fire» button at first frame to launch
the game, otherwise screen remains frozen.
WarpFrame Turns observation to grayscale image of size 84x84.
Table 3: Atari Pong preprocessing

Algorithm Number of trained parameters


DQN 1 681 062
c51 1 834 962
QR-DQN 1 834 962
Rainbow 3 650 410
A2C 1 681 575
PPO 1 681 575
Table 4: Number of trained parameters in Pong experiment.

6.4. Interaction-training trade-off in value-based algorithms


There is a common belief that policy gradient algorithms are much faster in terms of computa-
tional costs while value-based algorithms are preferable when simulation is expensive because of
their sample efficiency. This follows from the nature of algorithms, as the fraction «observations per
network updates» is extremely different for these two families: indeed, in DQN it is often assumed
to perform one network update after each new transitions, while A2C collects about 32-40 observa-
tions for only one update. That makes the number of network updates performed during 1M steps
interaction process substantially different and is the main reason of policy gradients speed rate.
Also policy gradient algorithms use several threads for parallel simulations (8 in our experiments)
while value-based algorithms are formally single-threaded. Yet they can also enjoy multi-threaded
interaction, in the simplest form by playing 1 step in all instances of environment and then perform-
ing L steps of network optimization [8]. For consistency with single-threaded case it is reasonable to
set the value of L to be equal to the number of threads to maintain the same fraction «observations
per network updates».
However it has been reported that lowering value of L in two or four times can positively affect
wall-clock time with some loss of sample efficiency, while raising batch size may mitigate this down-
grade. The overall impact of such acceleration of value-based algorithms on performance properties
is not well studied and may alter their behaviour.
In our experiments on Pong it became evident that value-based algorithms perform extensive
amount of redundant network optimization steps, absorbing knowledge faster than novel informa-
tion from new transitions comes in. This reasoning in particular follows from the success of PPO on
Pong task which performs more than 10 times less network updates.

Vanilla algorithm Accelerated version


Threads 1 8
Batch size 32 128
L 1 2
Interactions per update 1 4
Table 5: Setup for value-based acceleration experiment

We compared two versions of value-based algorithms: vanilla version, which is single-threaded


with standard batch size (32) and L = 1 meaning that each observed transition is followed with
one network optimization step, and accelerated version, where 1 interaction step is performed in
8 parallel instances of environment and L is set to be 2 instead of 8 which raises the fraction «ob-

43
Convolution 8x8
(1, 84, 84)
with stride = 4

Convolution 4x4
with stride = 2 (32, 20, 20)

Convolution 3x3
with stride = 1

(64, 9, 9)

Fully­connected layer

512 FEATURES (64, 7, 7)

Algorithm­specific head

Figure 1: Network used for Atari Pong. All activation functions are ReLU. For Rainbow the fully-connected layer
and all dense layers in the algorithm-specific head are substituted with noisy layers.

servations per training step» in four times. To compensate this change we raised batch size in four
times.
As expected, average speed of algorithms increases in approximately 3.5 times (see table 6). We
provide training curves with respect to 1M performed interaction steps on fig. 2 and with respect
to wall-clock time on fig. 3. The only vanilla algorithm that achieved better final score comparing to
its accelerated rival is QR-DQN, while other three algorithms demonstrated both acceleration and
performance improvement. The latter is probably caused by randomness as relaunch of algorithms
within the same setting and hyperparameters can be strongly influenced by random seed.
It can be assumed that fraction «observations per updates» is an important hyperparameter of
value-based algorithms which can control the trade-off between wall-clock time and sample effi-
ciency. From our results it follows that low fraction leads to excessive network updates and may
slow down learning in several times. Yet this hyperparameter can barely be tuned universally for
all kinds of tasks opposed to many other hyperparameters that usually have their recommended
default values.
We stick further to the accelerated version and use its results in final comparisons.

6.5. Results
We compare the results of launch of six algorithms on Pong from two perspectives: sample effi-
ciency (fig. 4) and wall-clock time (fig. 5). We do not compare final performance of these algorithms
as all six algorithms are capable to reach near-maximum final score on Pong given more iterations,
while results after 1M iterations on a single launch significantly depend on chance.
All algorithms start with a warm-up session during which they try to explore the environment and

44
Interactions per update Average transitions per second
Algorithm vanilla accelerated vanilla accelerated
DQN 1 4 55.74 168.43
c51 1 4 44.08 148.76
QR-DQN 1 4 47.46 155.97
Rainbow 1 4 19.30 70.22
A2C 40 656.25
PPO 10.33 327.13
Table 6: Computational efficiency of vanilla and accelerated versions.

Acceleration's influence on sample efficiency


20 DQN accelerated
DQN vanilla
15 c51 accelerated
average score for the last 20 episodes

c51 vanilla
10 QR-DQN accelerated
QR-DQN vanilla
5 Rainbow accelerated
Rainbow vanilla
0
5
10
15
20
0 200000 400000 600000 800000 1000000
interaction step

Figure 2: Training curves of vanilla and accelerated version of value-based algorithms on 1M steps of Pong.
Although accelerated versions perform network updates four times less frequent, the performance degradation
is not observed.

learn first dependencies how the result of random behaviour can be surpassed. Epsilon-greedy with
tuned parameters provides sufficient amount of exploration for DQN, c51 and QR-DQN whithout
slowing down further learning while hyperparameter-free noisy networks are the main reason why
Rainbow has substantially longer warm-up.
Policy gradient algorithms incorporate exploration strategy in stochasticity of learned policy but
underutilization of observed samples leads to almost 1M-frames warm-up for A2C. It can be ob-
served that PPO successfully mitigates this problem by reusing samples thrice. Nevertheless, both
PPO and A2C solve Pong relatively quickly after the warm-up stage is over.
Value-based algorithm proved to be more computationally costly. QR-DQN and categorical DQN
introduce more complicated loss computation, yet their slowdown compared to standard DQN is
moderate. On the contrary, Rainbow is substantially slower mainly because of noise generation
involvement. Furthermore, combination of noisy networks and prioritized replay results in even less
stable training process.
We provide loss curves for all six algorithms and statistics for noise magnitude and prioritized re-
play for Rainbow in appendix C; some additional visualizations of trained algorithms playing episodes
of Pong are presented in appendix D.

45
Acceleration effect on value-based algorithms
20 DQN accelerated
DQN vanilla
15 c51 accelerated
average score for the last 20 episodes

c51 vanilla
10 QR-DQN accelerated
QR-DQN vanilla
5 Rainbow accelerated
Rainbow vanilla
0
5
10
15
20
0 200 400 600 800
minutes

Figure 3: Training curves of vanilla and accelerated version of value-based algorithms on 1M steps of Pong from
wall-clock time.

Comparing different algorithms on Pong


20
DQN
average score for the last 20 episodes

15 c51
QR-DQN
10 Rainbow
5 A2C
PPO
0
5
10
15
20
0 200000 400000 600000 800000 1000000
interaction step

Figure 4: Training curves of all algorithms on 1M steps of Pong.

Comparing wall-clock time of different algorithms on Pong


20
PPO c51
DQN (1h 52m)
average score for the last 20 episodes

15 (0h 50m) (1h 38m)


QR-DQN
10 (1h 46m) Rainbow
5 (3h 57m)

0
5 DQN
A2C c51
10 (0h 25m)
QR-DQN
15 Rainbow
A2C
20 PPO
0 50 100 150 200
minutes

Figure 5: Training curves of all algorithms on 1M steps of Pong from wall-clock time.

46
7. Discussion
We have concerned two main directions of universal model-free RL algorithm design and at-
tempted to recreate several state-of-art pipelines.
While the extensions of DQN are reasonable solutions of evident DQN problems, their effect
is not clearly seen on simple tasks like Pong34 . Current state-of-art in single-threaded value-based
approach, Rainbow DQN, is full of «glue and tape» decisions that might be not the most effective
way of training process stabilization.
Distributional value-based approach is one of the cheapest in terms of resources extensions of
vanilla DQN algorithm. Although it is reported to provide substantial performance improvement in
empirical experiments, the reason behind this result remains unclear as expectation of return is the
key quantity for agent’s decision making while the rest of learned distribution does not affect his
choices. One hypothesis to explain this phenomenon is that attempting to capture wider range of
dependencies inside given MDP may provide auxiliary helping tasks to the algorithm, leading to bet-
ter learning of expectation. Intuitively it seems that more reasonable switch of DQN to distributional
setting would be learning the Bayesian uncertainty of expectation of return given observed data, but
scalable practical algorithms within this orthogonal paradigm are yet to be created.
Policy gradient algorithms are aimed at direct optimization of objective and currently beat value-
based approach in terms of computational costs. They tend to have less hyperparameters but are
extremely sensitive to the choice of optimizer parameters and especially learning rate. We have
affirmed the effectiveness of state-of-art algorithm PPO, which succeeded to solve Pong within an
hour without hyperparameter tuning. Though on the one hand this algorithm was derived from
TRPO theory, it essentially deviates from it and substitutes trust region updates with heuristic clip-
ping.
It can be observed in our results that PPO provides better gradients to the same network than
DQN-based algorithms despite the absence of experience replay. While it is fair to assume that
forgetting experienced transitions leads to information loss, it is also true that most observations
stored in replay memory are already learned or contain no useful information. The latter makes
most transitions in the sampled mini-batches insignificant, and, while prioritized replay attacks this
issue, it might still be the case that current experience replay management techniques are imperfect.
There are still a lot of deviations of empirical results from theoretical perspectives. It is yet unclear
which techniques are of the highest potential and what explanation lies behind many heuristic ele-
ments composing current state-of-art results. Possibly essential elements of modeling human-like
reinforcement learning are yet to be unraveled as active research in this area promises substantial
acceleration, generalization and stabilization of DRL algorithms.

34 although it takes several hours to train, Pong is considered to be the easiest of 57 Atari games and one of the most basic

testbeds for RL algorithms.

47
References
[1] M. G. Bellemare, W. Dabney, and R. Munos. A distributional perspective on reinforcement learn-
ing. In Proceedings of the 34th International Conference on Machine Learning-Volume 70,
pages 449–458. JMLR. org, 2017.

[2] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba. Ope-
nai gym. arXiv preprint arXiv:1606.01540, 2016.

[3] W. Dabney, M. Rowland, M. G. Bellemare, and R. Munos. Distributional reinforcement learning


with quantile regression. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[4] M. Fortunato, M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis,
O. Pietquin, et al. Noisy networks for exploration. arXiv preprint arXiv:1706.10295, 2017.

[5] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio. Deep learning, volume 1. MIT press Cam-
bridge, 2016.

[6] P. Henderson, R. Islam, P. Bachman, J. Pineau, D. Precup, and D. Meger. Deep reinforcement
learning that matters. In Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[7] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot,
M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. In
Thirty-Second AAAI Conference on Artificial Intelligence, 2018.

[8] D. Horgan, J. Quan, D. Budden, G. Barth-Maron, M. Hessel, H. Van Hasselt, and D. Silver. Dis-
tributed prioritized experience replay. arXiv preprint arXiv:1803.00933, 2018.

[9] A. Irpan. Deep reinforcement learning doesn⥪t work yet. Online (Feb. 14): https://www.
alexirpan. com/2018/02/14/rl-hard. html, 2018.

[10] L. Kaiser, M. Babaeizadeh, P. Milos, B. Osinski, R. H. Campbell, K. Czechowski, D. Erhan, C. Finn,


P. Kozakowski, S. Levine, et al. Model-based reinforcement learning for atari. arXiv preprint
arXiv:1903.00374, 2019.

[11] R. Koenker and G. Bassett Jr. Regression quantiles. Econometrica: journal of the Econometric
Society, pages 33–50, 1978.

[12] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous
control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.

[13] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Play-
ing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.

[14] OpenAI. Openai five. https://blog.openai.com/openai-five/, 2018.

[15] T. Salimans, J. Ho, X. Chen, S. Sidor, and I. Sutskever. Evolution strategies as a scalable alternative
to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.

[16] T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. arXiv preprint
arXiv:1511.05952, 2015.

[17] J. Schulman, S. Levine, P. Abbeel, M. I. Jordan, and P. Moritz. Trust region policy optimization. In
Icml, volume 37, pages 1889–1897, 2015.

[18] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel. High-dimensional continuous control
using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015.

[19] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization
algorithms. arXiv preprint arXiv:1707.06347, 2017.

[20] D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Ku-
maran, T. Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement
learning algorithm. arXiv preprint arXiv:1712.01815, 2017.

[21] R. S. Sutton and A. G. Barto. Reinforcement learning: An introduction. MIT press, 2018.
48
[22] R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for rein-
forcement learning with function approximation. In Advances in neural information processing
systems, pages 1057–1063, 2000.

[23] H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double q-learning. In
Thirtieth AAAI Conference on Artificial Intelligence, 2016.

[24] O. Vinyals, I. Babuschkin, J. Chung, M. Mathieu, M. Jaderberg, W. M. Czarnecki, A. Dudzik,


A. Huang, P. Georgiev, R. Powell, T. Ewalds, D. Horgan, M. Kroiss, I. Danihelka, J. Agapiou,
J. Oh, V. Dalibard, D. Choi, L. Sifre, Y. Sulsky, S. Vezhnevets, J. Molloy, T. Cai, D. Budden,
T. Paine, C. Gulcehre, Z. Wang, T. Pfaff, T. Pohlen, Y. Wu, D. Yogatama, J. Cohen, K. McKinney,
O. Smith, T. Schaul, T. Lillicrap, C. Apps, K. Kavukcuoglu, D. Hassabis, and D. Silver. AlphaS-
tar: Mastering the Real-Time Strategy Game StarCraft II. https://deepmind.com/blog/
alphastar-mastering-real-time-strategy-game-starcraft-ii/, 2019.
[25] Z. Wang, T. Schaul, M. Hessel, H. Van Hasselt, M. Lanctot, and N. De Freitas. Dueling network
architectures for deep reinforcement learning. arXiv preprint arXiv:1511.06581, 2015.

[26] C. J. Watkins and P. Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.

[27] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement


learning. Machine learning, 8(3-4):229–256, 1992.

49
Appendix A. Implementation details
Here we describe several technical details of our implementation which may potentially influence
the obtained results.
In most papers on value-based algorithms hyperparameters recommended for Atari games as-
sume raw input in the range [0, 255], while in various implementations of policy gradient algorithms
normalized input in the range [0, 1] is considered. Stepping aside from these agreements may dam-
age the convergence speed both for value-based and policy gradient algorithms as the change of
input domain requires hyperparameters retuning.
We use MSE loss emerged in theoretical intuition for DQN while in many sources it is recom-
mended to use Huber loss35 instead to stabilize learning.
In all value-based algorithms except c51 we update target network each K -th frame instead of
exponential smoothing of its parameters as it is computationally cheaper. For c51 we remove target
network heuristic as apriori limited domain prevents unbounded growth of predictions.
We do not architecturally force quantiles outputted by the network in Quantile Regression DQN
to satisfy ζ0 ≤ ζ1 ≤ · · · ≤ ζA−1 . As in the original paper, we assume that all A outputs of network
are arbitrary real values and use a standard linear transformation as our last layer.
In dueling architectures we subtract mean of A(s, a) across actions instead of theoretically as-
sumed maximum as proposed by original paper authors.
We implement sampling from prioritized replay using SumTree data structure and in informal
experiments affirmed the acceleration it provides. The importance sampling weight annealing β(t)
is represented by initial value β(0) = β which is then linearly annealed to 1 during first Tβ frames;
both β and Tβ are hyperparameters.
We do not allow priorities P(T ) to be greater than 1 by clipping as suggested in the original
paper. This may mitigate the effect of prioritization replay but stabilizes the process.
1
As importance sampling weights w(T ) = BP(T )
are potentially very close to zero, in original
article it was proposed to normalize them on max w(T ). In some implementations the maximum is
taken over the whole experience replay while in others maximum is taken over current batch, which
is not theoretically justified but computationally much faster. We stick to the latter option.
For noisy layers we use factorized noise sampling: for layer with m inputs and n outputs we sam-

ple ε1 ∈ Rn , ε2 ∈ Rm from standard normal distributions and scale both using f (ε) = sign(ε) ε.
Thus we use f (ε1 )f (ε2 )T as our noise sample for weights matrix and f (ε2 ) as noise sample for
bias. All noise is shared across mini-batch. Noise is resampled on each forward pass through the
network and thus is independent between evaluation, selection and interaction. Despite all these
simplifications, we found noisy layers to be the most computationally expensive modification of
DQN leading to substantial degradation of wall-clock time.
For policy gradient algorithms we add additional policy entropy term to the loss to force ex-
ploration. We also define actor loss as a scalar function that yields the same gradients as in the
corresponding gradient estimation (40) for A2C to compute it using PyTorch mechanics. For PPO
objective (51) provides analogous «actor loss»; thus, in both policy gradient algorithms the full loss
is defined as summation of actor, critic and entropy losses, with the two latter being scaled using
scalar hyperparameters.
We use shared network architecture for policy gradient algorithms with one feature extractor and
two heads, one for policy and one for critic.
KL-penalty is not used in our PPO implementation. Also we do not normalize advantage esti-
mations across the roll-out to zero mean and unit standard deviation as additionally done in some
implementations.
We use PyTorch default initialization for linear and convolutional layers although orthogonal ini-
tialization of all layers is reported to be beneficial for policy gradient algorithms. Initial values of
sigmas for noisy layers is set to be constant and equal to σm init
where σinit is a hyperparameter and
m is the number of inputs in accordance with original paper.
We use Adam as our optimizer with default β1 = 0.9, β2 = 0.999, ε = 1e−8. No gradient
clipping is performed.

35 Huber loss is defined as (


(y − ŷ)2 if |y − ŷ| < 1
Loss(y, ŷ) =
|y − ŷ| else

50
Appendix B. Hyperparameters

DQN QR-DQN c51 Rainbow A2C PPO


Reward discount factor γ 0.99
t
ε(t)-greedy strategy 0.01 + 0.99e− 30 000 - -
Interactions per training step 4 -
Batch size B 128 - 32
Rollout capacity - 40 1024
PPO number of epochs - 3
Replay buffer initialization size36 10 000 transitions -
Replay buffer capacity M 1 000 000 transitions -
Target network updates K each 1000-th step -
Number of atoms A - 51 -
Vmin , Vmax - - [−10, 10] -
Noisy layers std initialization - - - 0.5 -
Multistep N - - - 3 -
Prioritization degree α - - - 0.5 -
Prioritization bias correction β - - - 0.4 -
Unbiased prioritization after - - - 100 000 steps -
GAE coeff. λ - 0.95
Critic loss weight - 0.5
Entropy loss weight - 0.01
PPO clip  - 0.1
Optimizer Adam
Learning rate 0.0001
Table 7: Selected hyperparameters for Atari Pong

36 number of transitions to collect in replay memory before starting network optimization using mini-batch sampling.

51
Appendix C. Training statistics on Pong

DQN loss behaviour DQN loss (averaged across 1000 steps)


1.0 0.10
average loss
std
0.8 0.08

0.6 0.06
loss

loss
0.4 0.04

0.2 0.02

0.0 0.00
0 50000 100000 150000 200000 250000 0 50000 100000 150000 200000 250000
network update step network update step
Figure 6: DQN loss behaviour during training on Pong.

c51 loss behaviour QR-DQN loss behaviour Rainbow loss behaviour


8
4 22.5
7
20.0
6
3 17.5
5
15.0
loss

2 4
12.5
3
1 10.0
2
7.5
0 1
0 50000 100000 150000 200000 0 50000 100000150000200000 0 50000 100000150000200000
network update step network update step network update step
Figure 7: Loss behaviours of c51, QR-DQN and Rainbow during training on Pong.

52
Importance sampling correction weights Average noise magnitude
1.000
0.995 0.01025
(smoothed with window=1000)
median weight in mini-batch

0.990 0.01020
0.985
0.01015
0.980
0.975 0.01010
0.970
0.01005
0.965
0.01000
0 50000 100000 150000 200000 250000 0 50000 100000 150000 200000 250000
network update step network update step
Figure 8: Rainbow statistics during training. Left: smoothed with window 1000 median of importance sampling
weights from sampled mini-batches. Right: average noise magnitude logged at each 20-th step of training.

Advantage Actor-Critic loss behaviour


1.0
Actor loss
Critic loss
0.5 Entropy loss

0.0
loss

0.5

1.0

1.5

0 5000 10000 15000 20000 25000


network update step
Figure 9: A2C loss behaviour during training.

Proximal Policy Optimization loss behaviour


1.0
Actor loss
Critic loss
0.5 Entropy loss
0.0

0.5
loss

1.0

1.5

2.0
0 20000 40000 60000 80000
network update step
Figure 10: PPO loss behaviour during training.

53
Appendix D. Playing Pong behaviour

DQN playing Pong


2.5
2.0
1.5
state value

1.0
0.5 Predicted V(s)
Reward-to-go
0.0 losses
wins
0.5
0 200 400 600 800 1000 1200 1400 1600
episode step

Figure 11: DQN playing one episode of Pong.

c51 playing Pong


2.5

2.0

1.5
state value

1.0
Predicted V(s)
0.5 Reward-to-go
losses
wins
0.0
0 200 400 600 800 1000 1200 1400 1600
episode step

Figure 12: c51 playing one episode of Pong.

c51 value distribution during one played episode


-10.0 0.5
-8.0
-6.0 0.4
-4.0
-2.0
state value

0.3
0.0
2.0 0.2
4.0
6.0 0.1
8.0
10.0 0.0
0 200 400 600 800 1000 1200 1400 1600
episode step
Figure 13: c51 value distribution prediction during one episode of Pong.

54
Quantile Regression DQN playing Pong
3

2
state value

0
Predicted V(s)
Reward-to-go
1 losses
wins
0 500 1000 1500 2000 2500
episode step

Figure 14: Quantile Regression DQN playing one episode of Pong.

Quantile Regression DQN value distribution approximation during one played episode
4

2
state value

1
0 500 1000 1500 2000 2500
episode step

Figure 15: Quantile Regression DQN value distribution prediction during one episode of Pong.

Rainbow playing Pong


2.0

1.5

1.0
state value

0.5

0.0 Predicted V(s)


Reward-to-go
losses
0.5 wins
0 250 500 750 1000 1250 1500 1750 2000
episode step

Figure 16: Rainbow playing one episode of Pong (exploration turned off, i.e. all noise samples are zero).

Rainbow value distribution during one played episode


-10.0
-8.0 0.40
-6.0 0.35
-4.0 0.30
-2.0
state value

0.25
0.0
0.20
2.0
4.0 0.15
6.0 0.10
8.0 0.05
10.0 0.00
0 250 500 750 1000 1250 1500 1750 2000
episode step
Figure 17: Rainbow value distribution prediction during one episode of Pong (exploration turned off, i.e. all
noise samples are zero).

55
A2C playing Pong
0.5

0.0
state value

0.5

1.0
Predicted V(s)
1.5 Reward-to-go
losses
wins
2.0
0 250 500 750 1000 1250 1500 1750 2000
episode step

Figure 18: A2C playing one episode of Pong.

A2C policy during one played episode


NOOP 0.8
0.7
FIRE
0.6
RIGHT 0.5
actions

0.4
LEFT
0.3
RIGHTFIRE 0.2
LEFTFIRE 0.1

0 250 500 750 1000 1250 1500 1750 2000


episode step
Figure 19: A2C policy distribution during one episode of Pong.

PPO playing Pong


1.5
1.0
0.5
state value

0.0
0.5 Predicted V(s)
Reward-to-go
1.0 losses
wins
1.5
0 250 500 750 1000 1250 1500 1750 2000
episode step

Figure 20: PPO playing one episode of Pong.

PPO policy during one played episode


NOOP
0.8
FIRE

RIGHT 0.6
actions

LEFT 0.4

RIGHTFIRE
0.2
LEFTFIRE
0 250 500 750 1000 1250 1500 1750 2000
episode step
Figure 21: PPO policy distribution during one episode of Pong.

56

You might also like