0% found this document useful (0 votes)

70 views37 pages

Reinforcement Learning A LiteratureReview v2

This document provides a literature review of reinforcement learning (RL) algorithms from 1989 to 2020. It summarizes the evolution of RL techniques from early algorithms like Q-learning to modern approaches like hierarchical RL, meta RL, and ensemble methods combining model-free and model-based solutions. The review chronologically outlines key insights from papers in the field and shows growing interest in RL from researchers and companies.

Uploaded by

tulibhavya586304

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

70 views37 pages

Reinforcement Learning A LiteratureReview v2

Uploaded by

tulibhavya586304

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/344930010

REINFORCEMENT LEARNING: A LITERATURE REVIEW (September 2020)

Preprint · October 2020

DOI: 10.13140/RG.2.2.30323.76327

CITATIONS READS

2 6,727

3 authors:

José Salvador João Oliveira

ISCTE-Instituto Universitário de Lisboa ISCTE-Instituto Universitário de Lisboa
3 PUBLICATIONS 4 CITATIONS 10 PUBLICATIONS 27 CITATIONS

SEE PROFILE SEE PROFILE

Mauricio Breternitz
Instituto Universitario de Lisboa - ISCTE-IUL
15 PUBLICATIONS 122 CITATIONS

SEE PROFILE

Some of the authors of this publication are also working on these related projects:

mHBP: my Human Brain Project View project

All content following this page was uploaded by José Salvador on 22 December 2020.

The user has requested enhancement of the downloaded file.

REINFORCEMENT LEARNING:
A LITERATURE REVIEW (September 2020)

José Salvador 1 João Oliveira 1 Maurício Breternitz 1

[email protected]

Abstract
This paper contains a literature review of Reinforcement Learning and its
evolution. Reinforcement Learning is a part of Machine Learning and comprises
algorithms and techniques to achieve optimal control of an Agent in an Environment
providing a type of Artificial Intelligence. This Agent can be a physical or virtual robot,
can be a controller simulating a player in a game, or a bot trading stocks, etc. The
study starts at Q-learning [56] published in 1989 and follows the thread of algorithms
and frameworks up until 2020, looking at main insights each paper brings to the field
regarding strategies to handle RL.

1. Introduction
Reinforcement Learning (RL) is a part of the Machine Learning (ML) field. As seen in Figure
1, RL is considered an active type of ML [78]. Intrinsic Motivation (IM) is also considered as an active
type of ML [78] but differs from RL because it has no feedback mechanism from a supervisor, which
RL has. The feedback mechanism is also what creates a difference between Supervised and
Unsupervised ML. While [78] considers IM as a separate field of ML, several works of RL use IM as a
strategy to cope with complex, sparse reward environments.

1
Iscte - Instituto Universitário de Lisboa, Lisbon, Portugal

1
Figure 1- A diagram of Reinforcement Learning context and techniques

RL is an important area of study because it may enable society to automate tasks that we, in
the past, never thought could be automated. Autonomous driving vehicles is one of the uses for RL.
Other uses may comprise having robots that can perform tasks like preparing ingredients and cooking
food, with the same robot preparing several different dishes without any human intervention or any
specific programming for the executed tasks. Stock market trading can also be a task executed by a
RL agent, that instead of being programmed with specific rules, can learn the rules of best trading by
itself.

In theory RL can achieve a state of Artificial General Intelligence (AGI) [21, p.16], or even
further, of Artificial Super Intelligence (ASI) [21, p.16], being the highest goal to have an agent perform
any task any human performs but faster and with more precision.

While surveys of RL algorithms are usually exclusive, either of Flat Reinforcement Learning
(FRL) [40, p.1], Hierarchical Reinforcement Learning (HRL) [36, p.1] or Meta Reinforcement Learning
(MRL) [71, p.1], our approach is to mix all types in a chronological order (by year of print publishing
date), since, despite having fundamentally different approaches, they still propose to solve the
problem of optimal control. There is no specific order of papers within each year.

Figure 2 shows, in a simplified way, a comparison between FRL, HRL and MRL methods.
While FRL contains a core we call the “planner” that processes the observations and rewards to
produce micro-actions (actions that are atomic and not divisible into sub-actions), HRL combines a
planner level with a “Skills” level that we can also call macro-actions or options. Skills are a
combination of micro-actions that can be used by the planner. In HRL the planner can also choose to
perform micro-actions combined with macro-actions or other sub-planners. The planning with
micro-actions may have a dimensionality higher weight, so it may also involve a higher computational
cost. MRL is an approach that focuses on creating agents that learn how to learn, and that can model
concepts. While we separate skills from concepts in the MRL sketch, a skill can be seen as a concept
of an action, if we look at the adaptation a skill may get for every similar type of action.

2
Figure 2- Flat Reinforcement Learning (FRL) vs Hierarchical Reinforcement Learning (HRL)
vs. Meta Reinforcement Learning (MRL)

One of the early papers that introduced Q-learning, a root for FRL approaches, also
discusses how HRL methods can be used to solve the learning problem [56], but somehow, overtime
some degree of separation of FRL from HRL became more noticeable, despite HRL algorithms
implementation of planners sometimes resorting to FRL methods. MRL algorithms have been creating
a thread separated from FRL and HRL, but, we must say that in some aspects they leave a feel of
being HRL.

FRL refers to normal RL algorithms as opposed to HRL or MRL algorithms, because FRL
methods see space as a big flat search space [40, p.228]. FRL methods have limitations because of
the basic actions orientation of models in this set of approaches.[46, p.1].

In HRL, by the use of hierarchies that represent a decomposition of problems into other
smaller problems, RL agents can learn faster and can be used in larger action and state spaces
problems, because of the inherent dimensionality reduction of the hierarchical approach [36 p.1].

MRL concentrates on learning an inductive bias that can accelerate learning a new task by
training in many previous tasks [31, p.3] and in this way adapt to unseen tasks the agent will need to
perform in the future [68, p.1].

This survey holds brief descriptions of the methods, algorithms, techniques and frameworks
proposed in the selected publications, with focus on the insights each one brings. It also shows a
trend in the interest of researchers, universities [27, 4, 18, 12, 20, 57, 30, 7, 13, 22, 32] and
companies like Google [8, 5, 2, 33, 19, 23, 9, 10, 11, 14, 1, 28], OpenAi [57], Amazon [31], Alibaba

3
[16] and others investing in the field of RL.

As stated in [15, p.151], almost all deep RL algorithms consist of simple recipes that combine
a dataset specification, a procedure to perform optimization, a cost function, etc, so by rearranging
these elements we can have a lot of different algorithms. This is one effect that can be seen
throughout the chronology of RL.

Figure 3 shows the number of papers we can find in each year, and assuming the
unrepresented papers are equally distributed throughout the years of the survey, we can state that
there is a rapid progress and increasing interest in the field of RL in recent years, one of the
motivations for this study.

Figure 3- Distribution of papers, in this survey, per year (2020 partial year)

This study starts at Q-learning [56] published in 1989 and follows the thread of algorithms and
frameworks up until 2020, looking at main insights each paper brings to the field with regards to
strategies to handle RL. The motivation for having a chronologically ordered study is to allow us to
understand which ideas have become baselines for subsequent algorithms, what problems
researchers have been trying to solve with each approach and which trends of thought researchers of
RL followed over time, and like connecting the dots in a drawing, this allows us to form a general
image of RL and of the direction it is taking.

We can see, following the chronology, what, after 2017, has been called ensemble
approaches in some papers [60, 61, 75] and others while not referring it this way, they still present the
same hybrid approach by joining the strengths of model-free and model-based solutions [11, 12, 13,
32, 60, 61, 66, 75].

As a note addressing possible future observations for using bibtex entries in this study
bibliography, we adopted that format to provide faster access to works herein and in this way provide
a more useful reference tool.

4
2. Acronyms
In the next table there are some acronyms broadly used in the RL study field, some of them
also used throughout this paper.

AGI Artificial General Intelligence [21, p.16]

AI Artificial Intelligence

ASI Artificial Super Intelligence [21, p.16]

DRL Deep Reinforcement Learning

FRL Flat Reinforcement Learning

GAN Generative Adversarial Network

HDSL High-Dimensional Similarity Learning [37, p.5]

HRL Hierarchical Reinforcement Learning

IRL Inverse Reinforcement Learning

LSTM Long Short Term Memory

MBRL Model Based Reinforcement Learning

MCTS Monte Carlo Tree Search

MDP Markov Decision Process

ML Machine Learning

MFRL Model Free Reinforcement Learning

MRL Meta Reinforcement Learning

NN Neural Networks

RL Reinforcement Learning

RNN Recurrent Neural Network

SARSA State Action Reward State Action

SMDP Semi-Markov Decision Process

Table 1- Acronyms

5
3. Preliminary Concepts

The RL problem

The RL problem can be described as a flow diagram where an agent learns from the
interaction with an environment, getting state descriptions and rewards, in order to create decisions
about actions to perform [34] in an optimal control of the agent.

Figure 4- interaction of the agent with the environment [34]

While the previous logic has been the foundation of many RL algorithms [34], concepts like IM
[59, 78] and proprioception (self state) [45] (the sense of body position or self-movement) have been
introduced in some works, so we can design an extended RL agent-environment interaction, like the
one in Figure 5.

Figure 5- An extension of the agent-environment interaction including Intrinsic Motivation /

Reward [59, 78] and Proprioception (self state) [45]

In Figure 5, the agent not only looks for inputs from the “external” environment but also from
the “internal environment” represented by the Intrinsic Motivations / Rewards and by the self state
readings.

6
States, Policies, Rewards and Time

Each RL problem can be described using a Markov Decision Processes (MDP) and also a
Semi-Markov Decision Process (SMDP).

Figure 6 - A transition graph of a Markov Decision Process for usage and recharging of a
robot’s battery [34, p.52].

In Figure 6 that shows an MDP, each circle represents a state and each arrow represents a
transition with a given probability. We can see that from each state there is more than one possible
transition. A policy represents the strategy the system will use at a point in time, to choose the next
transition and the possible transitions it expects to do in the future. Rewards represent the gain the
system will have by achieving a certain state. For instance, if the robot’s battery has a charge of 40%
and it decides to charge itself to 100%, the reward may be a representation of a 60% charge, and if
the charge is at 10% and it decides to go charge itself, then the reward may be a representation of a
90% increase in charge. This is a very simple example, but in a real application, reward calculation
would also need to take in account other rewards than the battery charge. If the robot has goals to
achieve in the environment, rewards for charging the battery will certainly have to compete with
rewards for achieving other purposeful tasks.

SMDP’s introduce the time variable. In MDP’s each state/transition is expected to last a single
and uniform unit of time, a discrete time step; and here is where SMDP’s come into play, since in
SMDP’s the transitions can have different durations and not stay limited to single unit time changes,
introducing in this way the concept of Time [39].

Finite versus Infinite horizon

In RL, finite and infinite horizon settings have very different complexities, because while in the
first case, the agent has an understanding of when to stop processing because a goal has been met,
in the last one, the agent is treated as a living entity with a large life span, with multiple and endless
goal throughout its existence [16, 13, 61, 32].

The horizon concept has also been used to create hybrid model-free and model-based
approaches, allowing to decide the size of rollouts taken out from batch data of model-based
processing in order to feed the model-free part of these mixed models [16, 61]

7
Extrinsic Motivation versus Intrinsic Motivation (IM)

Extrinsic motivation happens when a reward is obtained by an agent, from the environment,
after completion of a task, and is also present when the agent is trying to learn how to solve a problem
through a future rewards (from the environment) function [78].

IM appears as a solution to the challenge of abstracting actions or to explore an environment

[78]. The difficulty of exploring an environment can come from the rewards being sparse, which
means that not every decision an agent makes will lead to a reward immediately or in the short term.

In [78] as shown in table 2, Intrinsic Motivation is considered to be the basis for creating a
new field of ML. While on one hand IM can be viewed this way, on the other hand IM can also be
viewed as a RL strategy to accomplish better performance in a sparse reward environment [41, 43,
46, 54].

With feedback Without feedback

Active Reinforcement Intrinsic Motivation

Passive Supervised Unsupervised

Table 2- Types of Machine Learning [78]

RL Problem Formulation

Figures 7 and 8 contain a set of RL basic formulas/equations and the pseudocode of tabular
solution methods like Policy Iteration, Value Iteration, MC methods, Temporal Difference (TD) with
Q-learning; and also one example of an approximate solution method with Deep Q Learning.

𝜸 (gamma) is an important hyperparameter of RL, since it dictates the weight future rewards
have in the calculation of the total reward at any given time of the agent’s life. 𝜸 is part of a structural
equation of RL: the Bellman equation. This equation integrates current and future rewards in a
recurrent way, so that the agent can choose its actions without being shortsighted by only looking at
the immediate reward. This lookahead behaviour is an attempt to create an intelligent agent.

None of the presented equations have Intrinsic Motivation into consideration, but introducing
IM in the formulation is an easy step, since IM can be considered just another reward coming from the
environment.

The reward formula introduces H, the horizon of the system, which can be infinite or finite.
Naturally the bigger H is the more complex the RL problem becomes. H is also an important variable
to allow the combination of model-free and model-based algorithms as you will see in the Literature
Review part of this paper.

8
Figure 7 - Formulas/Equations and pseudocode [89]

Figure 8 - Formulas/Equations and pseudocode [89]

9
Benefits and challenges of RL

Automation has been, in the last decades, achieved by creating sequential and conditional
software instructions performed by agents (physical robots or software agents). This type of
automation is viable in very controlled conditions, where there is very low stochasticity and the
performed tasks are very specific and not extremely complex. RL aims to achieve more than the type
of optimal control obtained in the previously exposed setting.

Optimal control of an agent performing a number of very different tasks and in a stochastic
environment is a goal that will allow us to control systems that have the ability to perform more
complex tasks in more complex environments, with a real chance to act without human supervision.
This is the benefit that RL research is aiming to achieve.

While many algorithms have already been created, some with better performance than
others, the field of RL is still missing an algorithm that can optimally perform any given task in any
given environment with any given dynamics.

The main challenges of RL have to do with how an agent can understand, process and apply
in a rational manner the data it gets, the same way humans are able to do. At least three different top
level approaches have been studied in the field of RL: FRL, HRL and MRL.

Challenges like optimization of policies, environment dynamics modeling, the curse of

dimensionality, dealing with stochasticity and with continuous action and state spaces are some of the
challenges the researchers of RL face. One of the major challenges is dealing with high
dimensionality problems, which has been partially tackled by Neural Networks usage (Deep RL) and
HRL and MRL approaches. One major challenge, not mentioned in the previous list is the problem of
IM. Creating IM is a strategy that can be used to reduce dimensionality of problems but also can be
used to create agents with more resemblance to humans. IM creation is not simple, since it is part of
the mysteries of the human mind. The way humans set goals and intermediate goals until they reach
the main goal and reward is still a process that has not yet been sufficiently understood in RL [78,
p.30].

RL Application Domains

Any activity that has realtime decision making based on data, is a candidate for the
application of RL. If realtime operation is not needed, fields like Supervised ML or Unsupervised ML,
as shown in table 2, can be used to help with the decision making, but when realtime is needed, RL
(with or without IM) is a potential answer.

Here are some examples of potential applications of RL, amongst many others:

- Autonomous driving vehicles [81];

- Stock trading [82, 84];
- Ad serving and Marketing [34, p.451];
- Games AI [79, 80, 87];
- Traffic Light Control [85];
- Robotics and Industry automation [83];
- Websites Personalized Recommendations of products and content [34, p.451];
- Bidding [88];

10
- Healthcare [86].

4. Literature Search Methodology

This literature review was started by analyzing OpenAi available taxonomy [76]. This
taxonomy concentrates on FRL methods, but at some point we extended the search to HRL. The HRL
methods discussion at [77] provided a first list of HRL methods. The extension of the method list was
possible through the analysis of papers referred in [76] and [77] in a backward snowballing process.
Also references regarding MRL started to appear in some of the previously found papers, so a third
line of investigation was opened, and once more, backward snowballing was applied to MRL papers
to extend the list of algorithms in this literature review.

5. Literature Review

Year of 1989
FRL - Q-learning - Chris Watkins, does not coin the “Q-learning” term [56], but proves that
one-step Q-learning method can converge to optimal value function and policy [56, p.112]. Addresses
estimation of q* with action-value functions, that are now often called “Q-functions”. [34, p.71].
Watkins makes a parallel with animal conditioning / learning and learning algorithms [56, p.3].
Explores learning as a problem of obtaining delayed rewards [56, p.24], in which the agent does not
try to maximize immediate rewards [56, p.41]. Defines Q-functions as the expected return from
starting at a given state and following a sequence of policies [56, p.46] by searching in a look-ahead
tree, recognizing the problem of combinatorial explosion with bigger depths of the tree [56, pp.59-60].
Establishes the principles of a Q-table, by describing a way to describe a policy as being a single
action for each state, being stored as a function from states to actions [56, p.102]. This dependence of
a Q-table naturally induces a limitation to finite and discrete states. Q-learning stands for an off-policy
algorithm [33, p.1] [39, p.206] and uses the greedy policy [5, p.3]. A limitation of using one-step
methods consists of obtaining a reward that only directly influences the value of the state action pair
that gave place to the reward, while the other pairs are only indirectly influenced, which can make
learning slow, because several updates will be needed to propagate a reward to the relevant previous
actions and states [2, p.2].

Year of 1992
FRL- REINFORCE - describes a set of model-free algorithms with a gradient-based approach
compatible with backpropagation [58, pp.23-24]. These algorithms are described as belonging to an
associative RL class [58, p.1]. The agent is considered to be a feedforward network with learning
units, that can be identified as a Neural Network (NN) with the respective neurons [58, p.6].

Year of 1993
HRL - Feudal Reinforcement Learning - this method explores a way to speed up RL, by
creating multiple Q-learning layers at different resolutions, and in this way divide RL problems into
layers that know how to define tasks and layers that know how to solve them, in a lords / serfs type of
relationship [38, p.1]. Each command at an upper layer is associated with a reward that the lower
layer tries to maximize, and the authors claim that this method is more efficient than flat Q-learning
[38, p.1].

Year of 1994
FRL - Modified Connectionist Q-Learning (MCQ-L) - also known as SARSA, this algorithm

11
proposes a method to address continuous state spaces with high-dimensionality by using NN as
function approximators and backpropagation [67, p.1]. It is an on-policy algorithm that fits the
action-value function to the current policy, and then refines it in a greedy manner regarding those
action-values [44, p.1].

Year of 1997
HRL - Hierarchical Abstract Machines (HAM) - uses a library of plans with description of
the decomposition of higher level activities into lower level activities [55, p.1]. HAMs are finite state
machines/programs and work in a non-deterministic way [55, pp.1-2]. States that with HAMs,
knowledge can be reused across different problems and that it only implies a recombination of
component solutions to attack a larger more complex problem [55, p.1]. Machines are associated with
skills, like when finding a wall the current machine can call a backoff machine or a follow-wall machine
as a policy [55, p.3]. Authors of HAM also propose HAMQ, a crossing between HAM and Q-Learning.
[55, p.5].

Year of 1998
HRL - Options-Framework - Defines the term “options” that includes actions like picking an
object, going to have a meal, and traveling to some place, as well as micro / primitive actions such as
muscle contractions and joint movements, as a way to provide temporal abstraction [39, p.181]. All
types of options are used in planning a task, and can also be changed during the execution of a plan
that by getting changed at mid-execution can be corrected to perform better [39, p.181]. This is an
approach that is well suited to stochastic changing environments [39, p.207]. Options are evaluated
during their application to produce more learning [39, p.181]. Besides the options mechanism, this
paper also explores the concept of sub-goals [90.181]. Options are given, defined by the programmer
[47, p.2].

Year of 2005
FRL - Neural Fitted Q Iteration (NFQ) - it is an improvement on Fitted Q Iteration algorithms
by proposing the use of a Neural Network to train a Q-value function and in this way take advantage
of NN ability to approximate nonlinear functions [27, p.317]. In [27] learning failures and learning
speed problems of NN are discussed and propose a mechanism of storing previous experiences (a
sort of Experience Replay) to deal with the aforementioned problems. It uses off-line batch learning to
be able to use advanced supervised learning techniques that can converge faster than on-line
learning [27, p.319], and also avoid the destruction of learning the latter approach can produce [27,
p.327].

Year of 2008
HRL - MAXQ framework - uses an approach of divide and conquer by decomposing an MDP
into smaller MDPs [40, pp.227-228] and acting recursively [40, p.229]. It depends on the identification
of goals and subgoals by the programmer [40, p.227]. Focuses itself on both state abstraction [40,
p.227][40, p.260] and temporal abstraction [40, p.237]. It is an online, model-free algorithm [40,
p.227].

Year of 2011
FRL - Neural Fitted Q Iteration with Continuous Actions (NFQCA) - this work approach
consists of using batch learning of a Q-function based on experiences of state transitions, with a
learning process for learning Neural Network based controllers comprising continuous action values
[29, p.144]. Solves the limitation of Q-learning, that can only be used with discrete actions [29, p.147].
Uses a Critic implemented as a Neural Q-function, and an Actor represented as neural policy function
[29, p.147]. It is similar to the Deterministic Policy Gradient (DPG) algorithm [5, p.2].

12
FRL - Doubly Robust (DR) - proposes a method to perform optimal control in a partially
observed environment [73, p.1]. Focuses on evaluating policies taking context into account, as well as
previous actions and rewards [73, p.1]. Tackles bias and variance problems deriving from models of
rewards and models of past policies respectively [73, p.1]. Defines a Doubly robust estimator take
uses the estimate of expected reward and the estimate of action probabilities [73, p.3].

FRL - Probabilistic Inference for Learning Control (PILCO) - proposes a sample efficient
model-based RL algorithm for high dimensional problems, with a probabilistic dynamics model to
reduce model bias (by including uncertainty) and the use of approximate inference to perform policy
evaluation [74, p.1]. PILCO has problems in getting stopped at local optimas [74, p.7].

Year of 2013
FRL - Deep Q Network (DQN) - addresses the problem of processing high-dimensional
sensory input with RL [8, p.1]. Uses a convolutional neural network that estimates future rewards by
processing raw pixels [19. p.1]. It achieves high performance and a new level, by being able to
perform across a set of different problems, without needing to use problem-specific features, using the
same network architecture, raw input, and parameter values like the step size, rate of discount,
exploration parameters [65. p.437]. While it is able to process high-dimensional observation spaces,
when it comes to action spaces, it can only process discrete and low dimensional ones, so if it is to be
applied to continuous action-spaces, a discretization has to occur (but brings the problem of the curse
of dimensionality) [5, p.1]. Can be compared with NFQ, except that NFQ uses a batch update that has
a cost per iteration that is proportional to the data set size, whereas DQN uses stochastic gradient
updates that have a small cost per iteration and scales well with bigger data sets [8, p.4].

Year of 2014
FRL - Deterministic Policy Gradient (DPG) - proposes an algorithm, for continuous action
spaces, that is based on the expected gradient of the action-value function, which can be calculated
more easily than the stochastic policy gradient [26, p.1]. DPG uses an off-policy optimization with an
actor-critic architecture and by employing an exploratory behaviour policy learns deterministic target
policies [26, p.1]. Shows that deterministic policy gradient algorithms are better suited for
high-dimensional action spaces than stochastic alternatives [26, p.1].

Year of 2015
FRL - Generalized Advantage Estimation (GAE) - makes advancements in
high-dimensional continuous control with RL [17, p.2] by using state-value functions, which have a
smaller dimensional input than state-action value functions, and so it is easier to learn than the later
[17, p.12]
https://github.com/higgsfield/RL-Adventure-2/blob/master/2.gae.ipynb

FRL - Trust Region Policy Optimization (TRPO) - It provides a method to optimize large
nonlinear policies like NN [4, p.1]. This method is similar to policy gradient methods [4, p.1]. It has a
good performance in physical control tasks, like walking, swimming, etc [4, p.1][4, p.6]. TRPO does
not need to learn an action-value function [5, p.8]. Establishes boundaries that limit updates of the
policy parameters to avoid the new policy to diverge too much relatively to the existing policy [5, p.8],
making training easier, and avoiding inability of Gradient Descent to make progress in more complex
tasks [4, p.7].

FRL - Deep Deterministic Policy Gradient (DDPG) - extends DQN ideas, and can operate
in continuous action spaces which DQN can’t, being DDPG more suitable for physical control tasks [5,
p.1] and also able to perform better with fewer experience steps (but still a lot) than DQN [5, pp.8-9]. It
is an actor-critic algorithm and is able to generalize well and operate without having a model of the
environment [5, p.1] across a diversity of domains [5, p.8]. DDPG combines the actor-critic structure
with NN function approximation, by following strategies (i.e. off-policy training with a replay buffer, and

13
a separate target network) implemented in DQN to make those NN more stable; it also takes
advantage of the Batch Normalization technique [5, pp.2-3]. (see Glossary for Batch Normalization).
DDPG is able to learn good policies from pixels of a camera image or / and from joints angles or other
low dimensional inputs, and operate in a stochastic environment [5, p.2]. It can learn in big state and
action spaces, and uses mini-batches to be able to operate in a large network [5, p.3]. DDPG is not
sample efficient [16, p.1]. https://github.com/higgsfield/RL-Adventure-2/blob/master/5.ddpg.ipynb

FRL - Stochastic Value Gradient (SVG) - defines a framework for continuous control
policies that use backpropagation [65, p.1]. SVG proposes the use of a deterministic function with
external noise so that stochasticity can be integrated into Bellman equation [65, p.1]. With the SVG
framework several algorithms are presented and named SVG(∞), SVG(0) and SVG(1), and each one
has a different Bellman equation recursion depth [65, p.3].

Year of 2016
HRL - Option-Critic Architecture - proposes an option-critic architecture that can learn
internal policies and termination conditions of options [42, p.1726] from [90]. Presents an alternative
approach, that mixes the problem of discovering options with the problem of learning options, which
works with linear and nonlinear function approximators, and favours transfer learning [42, p.1726].
The Critic is implemented with a NN [42, p.1731]. Options (or Skills [45, p.9]) can be learnt without
any specification of subgoals or pseudo-rewards [42, p.1731].

HRL - Modulated Locomotor Controllers - explores proprioception, the sense of body

position or self-movement, and defines a cortical network and a spinal network, as a way to surpass
failures of monolithic architectures [45, p.1]. Addresses the question on how to combine coherent
exploration (with modules and hierarchies) with the flexibility of data-driven end-to-end FRL
approaches [45, p.1]. This work is inspired by the division of labour in motor control done by biological
brains [45, p.1]. It defines two levels of controllers, high and low, the first using a non-recurrent NN
and the later using a recurrent NN [45, pp.2-3]. Low-level controllers produce pre-trained locomotor
behaviour [45, p.3] (or skills [45, p.9]) and high-level controllers direct behavior to achieve the task
goal [45, p.6]. Uses A3C combined with the previously defined concepts [45, p.4]. This method of
finding options with sparse reward environments differs from the babbling method [45, p.9].

HRL - hierarchical-DQN (H-DQN) - addresses RL problems of very sparse and delayed

feedback, by combining top-level (meta-controller) learning of policies based on intrinsic goals , with
low-level (controller) learning of policies based on atomic actions to accomplish the given goals [46,
p.1]. This framework merges hierarchical action-value functions, considered at several temporal
scales, with motivated and goal-driven deep RL [46, p.1]. Sparse feedback is something that is not
handled well by non-linear approximators based RL (FRL), so H-DQN adopts a tactic of starting by
learning ways to accomplish intrinsically generated goals, and then learn an optimal policy to
sequence them [46, p.2]. Uses stochastic gradient descent at different time scales [46, p.2] to learn its
own hyperparameters [46, p.5]. Uses separate DQNs for the top-level and for the low-level controllers
[46, p.4].

HRL - STRategic Attentive Writer (STRAW) - This model can either be used in RL or in
Natural Language Processing (NLP) [49, p.1] because it has a general sequence prediction
architecture [49, p.6]. Proposes a deep RNN setup that can learn macro-actions, and output a several
step action plan, and once in a while updates the plan with new observations, using rewards as the
driver of the learning process [49, p.1]. This model may suffer from the inability to react to stochastic
environments while it is executing a plan. But this focus on execution allows to control computational
costs [49, p.2]. Uses differentiable attentive reading [49, p.3] and A3C [49, p.5].

HRL - Hierarchical Deep Reinforcement Learning Network (H-DRLN) - proposes an

approach to high-dimensional and lifelong learning type of problems [50, p.1553]. The architecture is
able to transfer and reuse knowledge, by learning skills (referred to as Deep Skill Networks) that are
integrated in a hierarchical DRL Network [50, p.1553]. Skills are selected and used or reused to solve
sub-problems of the main unsolvable problem [50, pp.1553-1554]. The system can decide to execute

14
a skill or a primitive action (in this case it will execute for only one timestep) [50, p.1556]. H-DRLN
proposes a mechanism called skill distillation (a variation of policy distillation) that allows to retain
knowledge efficiently and scale the system to lifelong scenarios [50, p.1553].

HRL - Iterative Hierarchical Optimization for Misspecified Problems (IHOMP) - proposes

an hierarchical approach, for continuous high-dimensional domains, that reduces the effects of
problem misspecification, by using arbitrary or automatically learned (with Regularized Option
Interruption (ROI)) partitions of the state-space that are the base for a inter-communicating mesh of
options [52, pp.1-2]. Each option (or skill or macro-action) specializes due to being focused on one
partition [52, p.2] and are updated throughout the life of the agent [52, p.5]. Misspecification can
occur, mainly, when “good” features to represent a state are not found [52, p.1].

FRL - IRL - Generative Adversarial Imitation Learning (GAIL) - proposes an imitation

learning algorithm to extract a policy directly from data, like one would be able to obtain using Inverse
Reinforcement Learning (RL), but in as faster way [18, p.1]. This algorithm is prepared to work in large
environments [18, p.6]. https://github.com/higgsfield/RL-Adventure-2/blob/master/8.gail.ipynb

FRL - Actor Critic with Experience Replay (ACER) - proposes an actor-critic architecture
using DRL and Experience Replay for continuous and discrete control problems [19, pp.1-2].
Introduces the following mechanisms: stochastic dueling networks architectures and truncated
importance sampling with bias correction [19, p.1]. Also proposes a new TRPO method [19, p.1].
ACER can be seen as an off-policy equivalent of A3C [19, p.2]. The new TRPO method is done by
maintaining a network of running average policies calculated from previous policies, that are used as
references to avoid having the updated policies to go too far away from the references [19, p.5].
Humanoid agents, having a higher dimensionality action space, significantly benefit from using
truncation and bias correction [19, p.10]. ACER uses Retrace algorithm [19, p.3].

FRL - Bootstrapped-DQN - proposes a method to learn faster while doing deep exploration
with deep NN in a complex environment [23, p.1], with a low computational cost [23, p.2], which
allows scalability [23, p.8]. Applies the statistical resampling principle of bootstrapping (see Glossary)
[23, p.2]. Implements a Network with several heads (Neural Networks) that have input from a shared
NN that performs the task of feature extraction from a frame [23, p.2].

FRL - Asynchronous Advantage Actor Critic (A3C) - proposes a method that implements
asynchronous gradient descent, in a parallel actor-learner setup [2, p.1]. This method performs well
in several types of optimal control problems, like motor control or navigating using visual input [2, p.1].
Presents an alternative to Experience Replay (because of its limitations with on-policy learning), by
using parallelism/multithreading, having multiple agents learning on several instances of the
environment [2, p.1] and with the possibility of using different exploration policies [2, p.3]. It can run
efficiently on extremely less powerful hardware, than other methods while performing at the same
level or better [2, p.1][2, p.7]. Multithreading also showed benefits in older methods like one-step
Q-learning, one-step Sarsa and n-Step Q-learning [2, p.6].
https://github.com/awjuliani/DeepRL-Agents/blob/master/A3C-Doom.ipynb

FRL - Advantage Actor Critic(A2C) - It is the synchronous version of A3C [2].

FRL - Policy Gradient and Q-Learning (PGQL) - proposes a model free algorithm that
combines off-policy Q-Learning on a reply buffer with on-policy Gradient optimization [44, p.1].

FRL - Actor-Mimic - proposes an algorithm focused on transfer learning by having an agent

to perform several tasks at the same time [63, p.1].

MRL - RL2 - proposes an approach that comprises an RL algorithm to create another RL

algorithm [72, p.9], represented by a RNN [71, p.1]. The weights in the RNN represent the learned
algorithm, and this RNN gets the same inputs as the “normal” RL algorithms: actions, observations,
rewards and stop conditions [71, p.1]. The RNN is able to maintain state across multiple episodes [72,

15
p.1]. DR2 can be used in high-dimensional state and action spaces [72, p.1].

Year of 2017
HRL - FeUdal Networks (FUNs) - proposes a method that works at two different levels, with
different time resolutions: the manager, that selects goals and is motivated by an extrinsic reward, and
the worker levels [41, p.1]. The worker creates primitive actions at each time step and is motivated by
an intrinsic reward [41, p.1]. FUNs allow an extremely long timescale credit assignment (which is good
for sparse reward problems) and memorisation [41, p.1]. This method uses a new type of RNN called
a dilated LSTM, that enables the usage of a backpropagation algorithm with hundreds of steps [41,
p.1]. The goals of the Manager are trained with an approximate transition policy gradient [41, p.2].
FUNs use A3C for optimisation [41, p.5]. This method is also good at transfer and multitask learning
[41, p.8].

HRL - Abstract Markov Decision Processes (AMDP) - proposes a model based [51, p.2]
method for fast planning in large state-action spaces where the number of objects induce
combinatorial growth [51, p.1] and the environment is stochastic [51, p.2]. Divides goals into subgoals
that recursively solves; and selects only the information of the state space that is needed to solve
each decision [51, p.1]. Uses graph nodes to represent primitive actions or sub-problems [51, p.2].

FRL - 51-Atom Agent (C51) - explores the analysis of value distribution in approximate
distributional RL, instead of using the expectation in the states values as Expectation RL does [9,p.1].
C51 algorithm provides a more meaningful state value analysis, because instead of calculating only
one value per state, it adheres to the notion that stochasticity can produce very disparate state values,
and therefore a distributional approach makes more sense [9].
https://github.com/flyyufelix/C51-DDQN-Keras

FRL - EXpert ITeration (EXIT) - this algorithm is motivated by human thought dual process
theory [30, p.9] and uses tree search to assist the NN training [30, p.1]. This method implements
apprentices (generalize policies with NN) and experts (explore with tree search), to iterate through
training in an Imitation Learning style and self-play [30, pp.2-3].

FRL - Alpha-Zero - proposes a general purpose algorithm that starts without any knowledge
of the environment, except for the available actions (rules), and moves forward with self-play [14,
pp.1-2]. It uses NN to calculate action values and MCTS to simulate self-play [14, pp.2-3].

FRL - Actor Critic using Kronecker-factored Trust Region (ACKTR) - As the name says
this algorithm uses an Actor-Critic architecture [20, p.1]. It can be used with continuous and discrete
control policies, and employs Deep NN, plus a variation of TRPO with a new technique called
Kronecker-Factored Approximate Curvature (K-FAC) [20, pp.1-2].
https://github.com/openai/baselines/tree/master/baselines/acktr

FRL - Quantile Regression Deep Q Learning (QR-DQN) - presents a technique for RL by

using the distribution of the value function (following C51 work), instead of using an averaged value,
to guide an agent in a stochastic environment [10, p.2892]. It focuses on minimizing the Wassertein
similarity metric to compare the value function distribution to a parametrized quantile distribution [10,
p.2895]. QR-DQN builds on top of DQN and, a set of distributional math operations, to achieve a
complete distributional RL algorithm [10, p.2896].
https://github.com/senya-ashukha/quantile-regression-dqn-pytorch

FRL - Imagination-Augmented Agents (I2A) - proposes a new architecture for RL that

consists of a model-free agent control method, plus a model-based simulation or imagination method
[11, p.1]. In this way, negative, irreversible consequences of trial-and-error actions in the real
environment can be avoided, and the better scaling of the model-based part can be loaded with more
work to do, getting better results without a worse computational cost of the model-free part of the
model [11, p.1]. This architecture tackles the inefficiency of model-based methods in complex
environments, by using the imagination part, and still getting the benefits of this type of methods in an

16
agent acting in a complex world (i.e. better generalization) [11, pp.1-2]. States that reward expectation
is helpful but it can be discarded, and still get good performance [11, p.6].
https://github.com/higgsfield/Imagination-Augmented-Agents

FRL - Model-Based and Model-Free (MBMF) - presents an algorithm that combines

model-based and model-free approaches [12, p.7579]. Model-free methods need a lot of samples to
get good results, so this algorithm uses a model-based approach to get a better sample efficiency,
and does it with a Deep NN, that allows extraction of dynamics rules [12, p.7579]. The model-based
part output stands as an expert from whom the model-free part will learn [12, p.7582]

FRL - Proximal Policy Optimization (PPO) - proposes an algorithm that is as efficient and
reliable as TRPO but, it is of simpler implementation [3, p.1] by making minor changes to a vanilla
policy gradient implementation [3, p.8]. PPO can be used on continuous high-dimensional control
problems [3, p.7]. PPO introduces a new family of policy gradient methods that alternate between
optimizing a “surrogate” objective function by the use of stochastic gradient ascent and sampling data
through interaction with the environment [3, p.1]. Normal policy gradient methods execute one
gradient update per data sample, and PPO uses an objective function that allows multiple epochs of
minibatch updates. [3, p.1]. https://github.com/higgsfield/RL-Adventure-2/blob/master/3.ppo.ipynb

FRL - Soft Q-Learning (SQL) - proposes an algorithm for continuous action and state
spaces, with improved compositionality and exploration, allowing transfer learning from task to task
[57, p.1]. It uses an energy based model represented with an energy function or a neural network as
an energy function approximator [57, p.3]. https://github.com/haarnoja/softqlearning

FRL - Value Prediction Network (VPN) - proposes a method that combines the model-free
and model-based approaches in the same neural network [66, p.1]. It explores conditional options
when facing future values of the reward function instead of using the future observations or sensorial
data [66, p.1]. VPN performs well in stochastic environments when compared to purely model-free or
model-based algorithms [66, p.1]. The new neural network architecture combines a dynamics model
of states abstraction (model-based part) with these abstractions being mapped to rewards (model-free
part) [66, p.1]. VPN can be combined with other RL algorithms [66, p.4].

MRL - Model Agnostic Meta Learning (MAML) - proposes a model-agnostic method which
means it can also be used in a variety of learning problems [68, p.1]. MAML looks for good
generalization using a small amount of training data in a new task, and also to produce a model
training that is easy to tune [68, p.1]. https://github.com/cbfinn/maml /
https://github.com/cbfinn/maml_rl

HRL - Fine Grained Action Repetition (FiGAR) - proposes a framework that allows an
agent to choose the timeframe and repetition of an action [64, p.1]. It can be applied in combination
with existing RL algorithms like DDPG,.TRPO or A3C and improve policy optimization by working with
macro-actions at different time frames [64, p.10]. FiGAR is not prepared to interrupt a macro-action,
which in a stochastic environment makes it less efficient [64, p.10].

Year of 2018
FRL - Model-based Value Expansion (MVE) - it is a technique that takes advantage of the
imagination mechanism like the I2A algorithm, and the creation of dynamics models [13, p.1], but puts
some restrictions, in the horizon length, that ultimately allows improvement of the learning sample
complexity [13, p.8]. MVE is a method that tries to combine the best of model-free and model based
approaches [13, p.1], like MBMF does. MVE calculates the short term horizon by creating a dynamics
model (model-based), and the long term horizon by using Q-learning (model-free) [13, p.1].

FRL - World Models - this method creates a virtual environment and dynamics model to
train the RL Agent [1, p.2]. It does that from high dimensional image data [1, p.7], and sometimes gets

17
details that are not important, while missing other details that are relevant [1, p.8]. The Agent also
needs to explore the real world so that the model can be perfected [1, p.8]. This approach differs from
many other models that do their training by having the agent interacting with the real environment [1,
p.1]. In order to avoid having the model exploiting the virtual models flaws (by generating an
adversarial policy [1, p.6]), the training occurs with a noise-added version of that same virtual
environment [1, p.2]. This training in the virtual environment may lead, in some cases, to inadequate
policies, that will fail in the real environment [1, p.6]. By having the virtual environment, the model is
able to plan ahead, since it has access to a probability distribution of future events [1, p.4].
https://learningtopredict.github.io/ https://worldmodels.github.io

FRL - Twin Delayed DDPG (TD3) - proposes an algorithm (that extends DDPG [6, p.6]) with
an actor-critic architecture [6, p.2] and uses two critic NN [6, p.6]. Explores a technique called target
networks to limit errors with function approximators and stochastic optimization [6, p.8]. By using two
critics and getting the minimum value between the two it can avoid an overestimation problem [6, p.1].
Aims at solving problems affecting continuous control actor-critic settings, like overestimation bias and
accumulation of error [6, p.1]. It is assumed that even an overestimated value may be used as an
upper boundary for the true value [6, p.1].
https://github.com/higgsfield/RL-Adventure-2/blob/master/6.td3.ipynb

FRL - Soft Actor-Critic (SAC) - proposes a sample efficient, stable, model-free off-policy
algorithm with an actor critic architecture and Deep NN [7, p.1]. Explores the concept of maximum
entropy to create robust policies in high-dimensional and continuous spaces [7, pp.1-2]. SAC
performs at the same level as DDPG, PPO and TD3 in simple tasks, but performs better in more
complex ones [7, p.7]. SAC creates stochastic policies while learning, that are converted in
deterministic policies in the end (for performance reasons) [7, p.7]. https://github.com/haarnoja/sac

HRL - HIerarchical Reinforcement learning with Off-policy correction (HIRO) - explores a

generally applicable method that is suitable for real world optimal control problems, by using an
off-policy approach [43, p.1]. Proposes two levels of controllers, where the top level (responsible for
planning) creates intrinsic rewards and goals, in the form of goal states, for lower level controllers
(movement, object interaction, discrete decision-making) to try to meet, and both levels are trained
[43, pp.1-2]. Having the two levels being trained creates a non-stationary problem for the policy of the
higher level, that HIRO corrects by using historical data labeled with the higher level chosen action
[43, p.2]. HIRO uses the raw form of state observations [43, p.2]. The intrinsic reward is viewed as the
distance between the goal observation and the current observation [43, p.5]. HIRO is related to the
FeUdal Networks algorithm [43, p.6].

HRL - Modulated Policy Hierarchies (MPH) - proposes a method to solve problems with
sparse rewards, applying intrinsic motivation and modulation signals by communicating with
bit-vectors [48, pp.1-2]. Instead of selecting among the available skills in an exclusive way, MPH, with
bit vectors, is able to combine the available skills (implemented as NN) [48, pp.1-2]. Each level of the
hierarchy is trained separately using PPO in order to avoid non-stationarity problems [48, p.1]. The
lower level (worker) combines the modulated signals with the environment state to produce a final
action [48, p.4]. MPH implements time scales (temporal abstraction) by activating higher level policies
less frequently [48, p.4]. As intrinsic motivation, this method uses a curiosity-guided exploration
bonus, that has the effect of accelerating learning in a sparse reward environment [48, p.4].

HRL - Metalearning Shared Hierarchies (MLSH) - proposes a method that learns

temporally extended primitives (options/skills) and policies from a multi-task setting [47, p.1]. From an
observation, a master policy (that is associated with a master action) prescribes the use of a
sub-policiy that will execute an action [47, p.3]. MLSH uses NN in an arbitrary RL algorithm like A3C,
PPO, TRPO or DQN to update the master policy [47, p.4]. In training the sub-policies, the master
policy decision is treated as a part of the environment [47, p.4].

FRL - Asymmetric Self-Play - proposes a model-free method of unsupervised training for an

agent to explore and learn without extrinsic rewards in a sparse reward environment [59, p.1]. An
agent is configured with two controllers, being one responsible for setting tasks and the other
responsible for completing those tasks, and once the later is becoming more experienced, the former

18
is less frequently used [59, p.1]. This method can be used in discrete and continuous settings [59,
p.5].

FRL - Probabilistic Ensembles with Trajectory Sampling (PETS) - this algorithm explores
uncertainty aware deep neural network dynamics models to tackle the problem of deficient asymptotic
performance in model based architectures [60, pp.1-2]. Other algorithms use Bayesian nonparametric
models instead [60, p.9]. Applies some other model-free techniques to a model-based setting (i.e.
ensembling, output Gaussian distribution parameters, MPC) [60, p.9]. PETS does not use policy
learning [60, p.9]. https://github.com/kchua/handful-of-trials

FRL - STochastic Ensemble Value Expansion (STEVE) - proposes an algorithm that mixes
model-based and model-free approaches, in order to mitigate bias errors of the model based
approach and to improve problems of sample inefficiency of model-free methods [61, p.1]. The
environment model is only used once in a while with rollouts of different horizon sizes to avoid
introducing much error inherent to model-based methods [61, p.1]. STEVE is an extension of MVE
[61, p.2]. This algorithm uses uncertainty-awareness mechanisms [61].

MRL - Proximal Meta-Policy (ProMP) - proposes a meta-RL algorithm with improved credit
assignment and improved meta-policy gradient estimation [71, p.1]. ProMP achieves an effective
identification of tasks to be learned by optimizing the pre-update sampling distribution [71, p.9].

HRL - Hierarchical Self-play (HSP) - this method tackles complex tasks with sparse rewards
[53, p.8]. It uses unsupervised asymmetric self-play (to explore decomposition of tasks) and a
continuous sub-goal vector [53, pp.1-2]. The agent sets its own goals forced by adversarial rewards
(accordingly to what the environment allows), and then tries to achieve these goals respecting a time
limit [53, p.1]. The low level policies have access to the current state and to the goal vector (an
encoded target state) [53, pp.1-2]. The higher level policies are trained in a sparse reward logic [53,
p.2]. HSP defines two levels of policies implemented as NN that, at first are prepared by exploring and
creating skills (by having 2 lower level actors playing against each other) and that, later are trained (at
the higher level) using external rewards [53, p.2]. HSP breaks episodes into smaller segments in
order to avoid problems coming from higher complexity [53, p.4]. It has limitations in the self-play task,
because some knowledge of the domain needs to be embedded, to be possible to recognize if this
task has been completed successfully [53, p.8].

FRL - Model Ensemble Trust-Region Policy Optimization (ME-TRPO) - proposes a

method to combine model-free and model based approaches, by using deep NN to both learn a
model and learn policies [75, p.1]. Uses likelihood ratio derivatives instead of backpropagation to seek
a more stable learning process [75, p.1]. https://github.com/thanard/me-trpo

Year of 2019
FRL - Stochastic Lower Bounds Optimization (SLBO) - proposes a model based algorithm
that can be used in continuous control settings [62, p.10]. SLBO learns the models by using multi-step
prediction loss [62, p.8]. Makes use of TRPO for policy optimization, introducing an entropy term to
the objective function [62, p.9]. SLBO is an instantiation of a framework aimed at giving theoretical
guarantees when designing and analyzing model based algorithms [62, p.1].

MRL- Probabilistic Embeddings for Actor-critic meta-RL - (PEARL) - proposes an

actor-critic Meta-RL algorithm that aims at allowing an agent to develop new skills with a small
amount of data [70, p.1]. PEARL is an off-policy method that disentangles task inference and control
[70, p.1]. It uses online probabilistic filtering on task variables that are latent in order to understand
how to solve a new task using a small sample [70, p.1]. Learning is not only made more efficient, but
also exploration is made more efficient with this approach [70, p.1]. PEARL trains by learning latent
representations of tasks and then adapts to test tasks from their inferred representations [68, p.1].
This algorithm focuses not on learning a policy for each task, but on to learn a policy for a group of
tasks that share the same structure [70, p.1]. PEARL builds on top of SAC [70, p.5].
https://github.com/katerakelly/oyster

19
FRL - Model-Based Policy Optimization (MBPO) - proposes a procedure of using small
model-generated rollouts coming from real data, as a better solution than having a big model of the
environment [32, p.1]. MBPO combines the generalization strength of model-free algorithms with the
speed in learning of model-based ones [32, p.2]. This approach allows it to be used in
high-dimensional problems [32, p.2]. MBPO also works well with long horizon tasks [32, p.9].

FRL - Bootstrap Dual Policy Iteration (BDPI) - this is a model free, very sample-efficient
algorithm for continuous state spaces and discrete action spaces [22, p.1]. BDPI uses a value based
approach with an actor-critic architecture having several critics (implemented with a flavour of DQN)
operating in off-policy [22, p.1]. The actor is trained using all the critics [22, p.2]. With BDPI,
hyperparameters do not need a lot of tuning, since the algorithm is very robust concerning
hyperparametrization [22, p.2]. BDPI uses an experience buffer [22, p.3].
https://github.com/vub-ai-lab/bdpi

HRL - Factorized Macro Action Reinforcement Learning (FaMARL) - proposes a method

that allows macro actions to be applied to general RL algorithms (to reduce dimensionality of the
action space), by using disentangled representation [24, p.1]. FaMARL focuses on the segmentation
of a macro action from expert demonstrations [24, p.1]. Instead of using micro / primitive actions,
macro actions latent space can be then used with any RL algorithm [24, p.3].

FRL - Deterministic Value Gradients (DVG) - proposes a method for infinite horizon
problems [16, p.2216]. DVG is a model based algorithm that uses an actor-critic architecture [16,
p.3320].

FRL - Deterministic Value-Policy Gradient (DVPG) - proposes a method that combines

model free and model based approaches for infinite horizon problems [16, pp.3316-3317]. Uses
Deterministic Value Gradients and Deterministic Policy Gradients [16,p.3317]. DVPG can either be
used with imagination rollout and value expansion model-based techniques [16, p.3317]. This is a
temporal difference method [16, p.3320]. DVPG uses an actor-critic architecture [16, p.3321].

FRL - Deep Soft Policy Gradient (DSPG) - proposes a maximum entropy RL model-free
algorithm with an actor-critic architecture, and off-policy optimization [25, p.3425]. DSPG combines
policy and value based methods [25, pp.3425-3426]. DSPG can be used in continuous control
problems [25, p.3430].

HRL - Hierarchical Actor-Critic (HAC) - proposes an HRL method (named Hindsight action
transitions) that solves the non-stationary instability problem when more than one level of policies are
learnt at the same time, and it does this by having each layer trained considering the lower layer is
stable and optimal with the use of a simulation transition function [36, pp.1-3]. HAC can be used in
continuous state and action spaces with sparse rewards, and has successfully implemented a >2
layer architecture learning different levels policies in parallel [36, pp.1-2]. HAC can also be used in
discrete problems [36, p.10]. Each layer has access to the external current state, but its goals are
provided by the upper layer as a sub-goal [36, p.2].

HRL - Model Free HRL framework - proposes an HRL model free algorithm that focuses on
unsupervised subgoal discovery, learning skills and the use of intrinsic motivation [54].

Year of 2020
FRL - Meta Q Learning (MQL) - proposes an off-policy algorithm taking context into account
and meta-training policies [31, p.1]. Context stands as a meta-training technique [31, p.9]. MQL
handles the hyper-parameter sensitivity problem of off-policy methods by adapting to the distribution
shifting [31, p.9]. MQL reuses data from the replay buffer [31, p.10].

FRL - Advantage-weighted Behaviour Model (ABM) - proposes a method that allows

off-policy RL in continuous control settings, without resorting to batch processing alone, which has

20
worse results than on-policy processing [28, p.1]. ABM explores policies in a batch to create a
weighted model, as a prior, that can be used to correct the policy being used, and in this way
combine off-policy and on-policy strengths [28, p.2].

MRL - Model-based Adversarial Meta-Reinforcement Learning (AdMRL) - focuses on

solving one of the problems of meta-RL that arises when the distribution of the tasks used to learn is
very different from the tasks being executed [68, p.1]. Proposes a minimax formulation to tackle the
gap between distributions, and optimize it by either learning the dynamics model on a fixed task or by
looking for the adversarial task for the current model, in an alternated process [68, p.1].

Flat Reinforcement Learning (FRL) 2019 - SLBO [62]

- MBPO [32]
- BDPI [22]
1989 - Q-learning [56]
- DVG / DVPG [16]
1992 - REINFORCE [58]
- DSPG [25]
1994 - MCQ-L [67]
2020 - MQL [31]
2005 - NFQ [27]
- ABM [28]
2011 - NFQCA [144]
- DR [73]
- PILCO [74] Hierarchical Reinforcement Learning (HRL)
2013 - DQN [8]
2014 - DPG [26]
2015 - GAE [17] 1993 - Feudal RL [38]
- TRPO [4] 1997 - HAM [55]
- DDPG [5] 1998 - Option Framework [39]
- SVG [65] 2008 - MAXQ [40]
2016 - GAIL [18] 2016 - Option-critic architecture [42]
- ACER [19] - Modulated Locomotor Controls [45]
- Bootstrapped-DQN [23] - H-DQN [46]
- A2C / A3C [2] - STRAW [49]
- PGQL [44] - H-DRLN [50]
- Actor-Mimic [63] - IHOMP [52]
2017 - C51 [9] 2017 - FUNs [41]
- EXIT [30] - AMDP [51]
- Alpha-Zero [14] - FiGAR [64]
- ACKTR [20] 2018 - HIRO [43]
- QR-DQN [10] - MPH [48]
- I2A [11] - MLSH [47]
- MBMF [12] - HSP [53]
- PPO [3] 2019 - FaMARL [24]
- SQL [57] - HAC [36]
- VPN [66] - Model Free HRL Framework [54]
2018 - MVE [13]
- World Models [1] Meta Reinforcement Learning (MRL)
- TD3 [6]
- SAC [7] 2016 - RL2 [72]
- Asymmetric Self-Play [59] 2017 - MAML [68]
- PETS [60] 2018 - ProMP [71]
- STEVE [61] 2019 - PEARL [70]
- ME-TRPO [75] 2020 - AdMRL [68]

Table 3 -
Flat Reinforcement Learning (FRL), Hierarchical Reinforcement Learning (HRL) and Meta
Reinforcement Learning (MRL) papers in this study

With table 3 we can see how much work has been done in every major branch of RL. FRL
stands out as the approach with more research, while MRL as the one with less research.This shows
us, being FRL math intensive, that RL research has been more foccused on the math driven approach
than on the logical driven approach. It doesn’t mean that FRL does not have a logical side to it, but
math may be over-valued, and researchers may need to get out of the FRL track and research more
on HRL, MRL or any other yet to come type of approach.

21
Figure 9- word cloud of the bibliography

Figure 9 illustrates as a word cloud the key concepts found in the examined bibliography.
“Learning”, “Policy”, “Value”, “State”, “Model”, “Action”, “Agent”, “Data”, “Gradient”, “Time”, “Reward”,
“Deep” and “Neural” are the most important words in this cloud and define the most important
concepts of RL. The concepts of “Gradient”, “Deep” and “Neural” while not being RL core, are, as
seen in the cloud, very important to RL research, because Gradients have been one of the most
important optimization methods in RL, and “Deep” / “Neural” show how important Neural Networks
have been to the advancement of RL.

6. Conclusion

Through the findings there can be seen several evidences worth of notice:

● Off-policy methods are more efficient at learning than On-policy ones. But off-policy methods
tend to be less effective than on-policy methods, because they do not integrate the
knowledge related to the policy in use at the moment [28, p.1]

● HRL allows us to address more complex tasks than FRL, with less learning episodes, but
HRL algorithms are more complex due to the work with multiple layers [36, p.1].

● Neural Networks are a key component of the better performing models;

● High-dimensional continuous state / action spaces are much more difficult to address than
their discrete counterparts;

● Model-free algorithms are able to generalize better than model-based ones, but are slower at
learning policies [32, p.1]. Model-based algorithms also have the problem of possible model
bias [16, p. 3317]. Model free algorithms are very sensitive to hyperparameters and are also
very sample inefficient [25, p.3425]. Generally model based models are less efficient in
asymptotic performance [60, p.9]. Model-based algorithms work well in environments where it
is easy to model dynamic features, but on complex and noisy environments the learnt
environment model will prove itself less performant [61, p.1];

● Deep NN have been successfully applied in model-free algorithms, but in model-based

algorithms it hasn’t been easy to apply them [75, p.3].

22
23
Bibliography
[1] @incollection{NIPS2018_7512,
title = {Recurrent World Models Facilitate Policy Evolution},
author = {Ha, David and Schmidhuber, J\"{u}rgen},
booktitle = {Advances in Neural Information Processing Systems 31},
editor = {S. Bengio and H. Wallach and H. Larochelle and K. Grauman and N. Cesa-Bianchi and R. Garnett},
pages = {2450--2462}, year = {2018}, publisher = {Curran Associates, Inc.},
url = {http://papers.nips.cc/paper/7512-recurrent-world-models-facilitate-policy-evolution.pdf}}

[2] @InProceedings{pmlr-v48-mniha16,
title = {Asynchronous Methods for Deep Reinforcement Learning},
author = {Volodymyr Mnih and Adria Puigdomenech Badia and Mehdi Mirza and Alex Graves and Timothy Lillicrap
and Tim Harley and David Silver and Koray Kavukcuoglu},
booktitle = {Proceedings of The 33rd International Conference on Machine Learning},
pages = {1928--1937}, year = {2016}, editor = {Maria Florina Balcan and Kilian Q. Weinberger},
volume = {48}, series = {Proceedings of Machine Learning Research}, address = {New York, New York, USA},
month = {20--22 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v48/mniha16.pdf},
url = {http://proceedings.mlr.press/v48/mniha16.html}, Eprint = {arXiv:1602.01783}}

[3] @misc{1707.06347,
Author = {John Schulman and Filip Wolski and Prafulla Dhariwal and Alec Radford and Oleg Klimov},
Title = {Proximal Policy Optimization Algorithms}, Year = {2017}, Eprint = {arXiv:1707.06347},}

[4] @InProceedings{pmlr-v37-schulman15,
title = {Trust Region Policy Optimization},
author = {John Schulman and Sergey Levine and Pieter Abbeel and Michael Jordan and Philipp Moritz},
booktitle = {Proceedings of the 32nd International Conference on Machine Learning},
pages = {1889--1897}, year = {2015}, editor = {Francis Bach and David Blei}, volume = {37},
series = {Proceedings of Machine Learning Research}, address = {Lille, France}, month = {07--09 Jul},
publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v37/schulman15.pdf},
url = {http://proceedings.mlr.press/v37/schulman15.html}, Eprint = {arXiv:1502.05477}}

[5] @article{Lillicrap2015ContinuousCW,
title={Continuous control with deep reinforcement learning},
author={Timothy P. Lillicrap and Jonathan J. Hunt and Alexander Pritzel and Nicolas Manfred Otto Heess and Tom
Erez and Yuval Tassa and David Silver and Daan Wierstra}, journal={CoRR}, year={2015},
volume={abs/1509.02971}, Eprint = {arXiv:1509.02971}}

[6] @InProceedings{pmlr-v80-fujimoto18a,
title = {Addressing Function Approximation Error in Actor-Critic Methods},
author = {Fujimoto, Scott and van Hoof, Herke and Meger, David},
booktitle = {Proceedings of the 35th International Conference on Machine Learning},
pages = {1587--1596}, year = {2018}, editor = {Dy, Jennifer and Krause, Andreas}, volume = {80},
series = {Proceedings of Machine Learning Research}, address = {Stockholmsmässan, Stockholm Sweden},
month = {10--15 Jul}, publisher = {PMLR},
pdf = {http://proceedings.mlr.press/v80/fujimoto18a/fujimoto18a.pdf},
url = {http://proceedings.mlr.press/v80/fujimoto18a.html}, Eprint = {arXiv:1802.09477}}

[7] @InProceedings{pmlr-v80-haarnoja18b,
title = {Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor},
author = {Haarnoja, Tuomas and Zhou, Aurick and Abbeel, Pieter and Levine, Sergey},
booktitle = {Proceedings of the 35th International Conference on Machine Learning},pages = {1861--1870},
year = {2018}, editor = {Dy, Jennifer and Krause, Andreas}, volume = {80},
series = {Proceedings of Machine Learning Research}, address = {Stockholmsmässan, Stockholm Sweden},
month = {10--15 Jul}, publisher = {PMLR},
pdf = {http://proceedings.mlr.press/v80/haarnoja18b/haarnoja18b.pdf},
url = {http://proceedings.mlr.press/v80/haarnoja18b.html}, Eprint = {arXiv:1801.01290}}

[8] @incollection{mnih-atari-2013,
title = {Playing Atari With Deep Reinforcement Learning},
author = {Volodymyr Mnih and Koray Kavukcuoglu and David Silver and Alex Graves and Ioannis Antonoglou and
Daan Wierstra and Martin Riedmiller},
booktitle = {NIPS Deep Learning Workshop},
year = {2013},
url = {https://arxiv.org/pdf/1312.5602.pdf}, }

[9] @InProceedings{pmlr-v70-bellemare17a,

24
title = {A Distributional Perspective on Reinforcement Learning},
author = {Marc G. Bellemare and Will Dabney and R{\'e}mi Munos},
booktitle = {Proceedings of the 34th International Conference on Machine Learning},
pages = {449--458}, year = {2017}, editor = {Doina Precup and Yee Whye Teh}, volume = {70},
series = {Proceedings of Machine Learning Research},
address = {International Convention Centre, Sydney, Australia},
month = {06--11 Aug}, publisher = {PMLR},
pdf = {http://proceedings.mlr.press/v70/bellemare17a/bellemare17a.pdf},
url = {http://proceedings.mlr.press/v70/bellemare17a.html}}

[10] @inproceedings{DBLP:conf/aaai/DabneyRBM18,
author = {Will Dabney and Mark Rowland and Marc G. Bellemare and R{\'{e}}mi Munos},
editor = {Sheila A. McIlraith and Kilian Q. Weinberger},
title = {Distributional Reinforcement Learning With Quantile Regression},
booktitle = {Proceedings of the Thirty-Second {AAAI} Conference on Artificial Intelligence,
(AAAI-18), the 30th innovative Applications of Artificial Intelligence
(IAAI-18), and the 8th {AAAI} Symposium on Educational Advances in
Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February
2-7, 2018},
pages = {2892--2901}, publisher = {{AAAI} Press}, year = {2018},
url = {https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17184},
timestamp = {Tue, 23 Oct 2018 06:42:15 +0200}, biburl = {https://dblp.org/rec/conf/aaai/DabneyRBM18.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[11] @incollection{NIPS2017_7152,
title = {Imagination-Augmented Agents for Deep Reinforcement Learning},
author = {Racani\`{e}re, S\'{e}bastien and Weber, Theophane and Reichert, David and Buesing, Lars and Guez,
Arthur and Jimenez Rezende, Danilo and Puigdom\`{e}nech Badia, Adri\`{a} and Vinyals, Oriol and Heess, Nicolas and Li, Yujia
and Pascanu, Razvan and Battaglia, Peter and Hassabis, Demis and Silver, David and Wierstra, Daan},
booktitle = {Advances in Neural Information Processing Systems 30},
editor = {I. Guyon and U. V. Luxburg and S. Bengio and H. Wallach and R. Fergus and S. Vishwanathan
and R. Garnett},
pages = {5690--5701}, year = {2017}, publisher = {Curran Associates, Inc.}, url = {http://papers.nips.cc/paper/
7152-imagination-augmented-agents-for-deep-reinforcement-learning.pdf}}

[12] @inproceedings{DBLP:conf/icra/NagabandiKFL18,
author = {Anusha Nagabandi and Gregory Kahn and Ronald S. Fearing and Sergey Levine},
title = {Neural Network Dynamics for Model-Based Deep Reinforcement Learning
with Model-Free Fine-Tuning},
booktitle = {2018 {IEEE} International Conference on Robotics and
Automation, {ICRA} 2018, Brisbane, Australia, May 21-25, 2018},
pages = {7559--7566}, publisher = {{IEEE}}, year = {2018},
url = {https://doi.org/10.1109/ICRA.2018.8463189}, doi = {10.1109/ICRA.2018.8463189},
timestamp = {Wed, 16 Oct 2019 14:14:51 +0200}, biburl = {https://dblp.org/rec/conf/icra/NagabandiKFL18.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}, Eprint = {arXiv:1708.02596}}

[13] @article{DBLP:journals/corr/abs-1803-00101,
author = {Vladimir Feinberg and Alvin Wan and Ion Stoica and Michael I. Jordan and Joseph E. Gonzalez
and Sergey Levine},
title = {Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning},
journal = {CoRR}, volume = {abs/1803.00101}, year = {2018}, url = {http://arxiv.org/abs/1803.00101},
archivePrefix = {arXiv}, eprint = {1803.00101}, timestamp = {Mon, 13 Aug 2018 16:47:50 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1803-00101.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[14] @article{DBLP:journals/corr/abs-1712-01815,
author = {David Silver and Thomas Hubert and Julian Schrittwieser and Ioannis Antonoglou and
Matthew Lai and Arthur Guez and Marc Lanctot and Laurent Sifre and Dharshan Kumaran and
Thore Graepel and Timothy P. Lillicrap and Karen Simonyan and Demis Hassabis},
title = {Mastering Chess and Shogi by Self-Play with a General Reinforcement
Learning Algorithm},
journal = {CoRR}, volume = {abs/1712.01815}, year = {2017}, url = {http://arxiv.org/abs/1712.01815},
archivePrefix = {arXiv}, eprint = {1712.01815}, timestamp = {Mon, 13 Aug 2018 16:46:01 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1712-01815.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}, Eprint = {arXiv:1712.01815}}

[15] @book{Goodfellow-et-al-2016,
title={Deep Learning}, author={Ian Goodfellow and Yoshua Bengio and Aaron Courville},
publisher={MIT Press}, note={\url{http://www.deeplearningbook.org}}, year={2016}}

[16] @inproceedings{DBLP:conf/aaai/CaiPT20,

25
author = {Qingpeng Cai and Ling Pan and Pingzhong Tang},
title = {Deterministic Value-Policy Gradients},
booktitle = {The Thirty-Fourth {AAAI} Conference on Artificial Intelligence, {AAAI}
2020, The Thirty-Second Innovative Applications of Artificial Intelligence
Conference, {IAAI} 2020, The Tenth {AAAI} Symposium on Educational
Advances in Artificial Intelligence, {EAAI} 2020, New York, NY, USA,
February 7-12, 2020},
pages = {3316--3323}, publisher = {{AAAI} Press}, year = {2020},
url = {https://aaai.org/ojs/index.php/AAAI/article/view/5732}, timestamp = {Thu, 04 Jun 2020 16:49:55 +0200},
Biburl = {https://dblp.org/rec/conf/aaai/CaiPT20.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}, Eprint = {arXiv:1909.03939}}

[17] @inproceedings{DBLP:journals/corr/SchulmanMLJA15,
author = {John Schulman and Philipp Moritz and Sergey Levine and Michael I. Jordan and Pieter Abbeel},
editor = {Yoshua Bengio and Yann LeCun},
title = {High-Dimensional Continuous Control Using Generalized Advantage Estimation},
booktitle = {4th International Conference on Learning Representations, {ICLR} 2016,
San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings},
year = {2016}, url = {http://arxiv.org/abs/1506.02438}, timestamp = {Thu, 25 Jul 2019 14:25:38 +0200},
biburl = {https://dblp.org/rec/journals/corr/SchulmanMLJA15.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[18] @inproceedings{DBLP:conf/nips/HoE16,
author = {Jonathan Ho and Stefano Ermon},
editor = {Daniel D. Lee and Masashi Sugiyama and Ulrike von Luxburg and Isabelle Guyon and Roman Garnett},
title = {Generative Adversarial Imitation Learning},
booktitle = {Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information
Processing Systems 2016, December 5-10, 2016,
Barcelona, Spain},
pages= {4565--4573}, year = {2016},
Url = {http://papers.nips.cc/paper/6391-generative-adversarial-imitation-learning}}

[19] @inproceedings{DBLP:conf/iclr/0001BHMMKF17,
author = {Ziyu Wang and Victor Bapst and Nicolas Heess and Volodymyr Mnih and R{\'{e}}mi Munos and
Koray Kavukcuoglu and Nando de Freitas},
title = {Sample Efficient Actor-Critic with Experience Replay},
booktitle = {5th International Conference on Learning Representations, {ICLR} 2017,
Toulon, France, April 24-26, 2017, Conference Track Proceedings},
publisher = {OpenReview.net}, year = {2017}, Eprint = {arXiv:1611.01224}}

[20] @incollection{NIPS2017_7112,
title = {Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation},
author = {Wu, Yuhuai and Mansimov, Elman and Grosse, Roger B and Liao, Shun and Ba, Jimmy},
booktitle = {Advances in Neural Information Processing Systems 30},
editor = {I. Guyon and U. V. Luxburg and S. Bengio and H. Wallach and R. Fergus and S. Vishwanathan
and R. Garnett},
pages = {5279--5288}, year = {2017}, publisher = {Curran Associates, Inc.}, url =
{http://papers.nips.cc/paper/7112-scalable-trust-region-method-for-deep-reinforcement-learning-using-kronecker-fact
ored-approximation.pdf}}

[21] @article{Kaplan2019,
doi = {10.1016/j.bushor.2018.08.004}, url = {https://doi.org/10.1016/j.bushor.2018.08.004}, year = {2019},
month = jan, publisher = {Elsevier {BV}}, volume = {62}, number = {1}, pages = {15--25},
author = {Andreas Kaplan and Michael Haenlein}, journal = {Business Horizons},
title = {Siri, Siri, in my hand: Who's the fairest in the land? On the interpretations, illustrations, and
implications of artificial intelligence}}

[22] @inproceedings{DBLP:conf/pkdd/SteckelmacherPR19,
author = {Denis Steckelmacher and H{\'{e}}l{\`{e}}ne Plisnier and Diederik M. Roijers and Ann Now{\'{e}}},
editor = {Ulf Brefeld and {\'{E}}lisa Fromont and Andreas Hotho and Arno J. Knobbe and
Marloes H. Maathuis and C{\'{e}}line Robardet},
title = {Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics},
booktitle = {Machine Learning and Knowledge Discovery in Databases - European Conference, {ECML}
{PKDD} 2019, W{\"{u}}rzburg, Germany, September 16-20, 2019, Proceedings, Part {III}},
series = {Lecture Notes in Computer Science}, volume = {11908}, pages= {19--34}, publisher = {Springer},
year = {2019}, url = {https://doi.org/10.1007/978-3-030-46133-1\_2}, doi = {10.1007/978-3-030-46133-1\_2},
timestamp = {Mon, 04 May 2020 14:19:13 +0200}, biburl = {https://dblp.org/rec/conf/pkdd/SteckelmacherPR19.bib},
bibsource = {dblp computer science bibliography, https://dblp.org},
Eprint = {arXiv:1903.04193}}

[23] @incollection{NIPS2016_6501,

26
title = {Deep Exploration via Bootstrapped DQN},
author = {Osband, Ian and Blundell, Charles and Pritzel, Alexander and Van Roy, Benjamin},
booktitle = {Advances in Neural Information Processing Systems 29},
editor = {D. D. Lee and M. Sugiyama and U. V. Luxburg and I. Guyon and R. Garnett},
pages = {4026--4034}, year = {2016}, publisher = {Curran Associates, Inc.},
url = {http://papers.nips.cc/paper/6501-deep-exploration-via-bootstrapped-dqn.pdf}}

[24] @misc{1903.09366,
Author = {Heecheol Kim and Masanori Yamada and Kosuke Miyoshi and Hiroshi Yamakawa},
Title = {Macro Action Reinforcement Learning with Sequence Disentanglement using Variational Autoencoder},
Year = {2019}, Eprint = {arXiv:1903.09366}}

[25] @inproceedings{ijcai2019-475,
title = {Soft Policy Gradient Method for Maximum Entropy Deep Reinforcement Learning},
author = {Shi, Wenjie and Song, Shiji and Wu, Cheng},
booktitle = {Proceedings of the Twenty-Eighth International Joint Conference on
Artificial Intelligence, {IJCAI-19}},
publisher = {International Joint Conferences on Artificial Intelligence Organization},
pages = {3425--3431}, year = {2019}, month = {7}, doi = {10.24963/ijcai.2019/475},
url = {https://doi.org/10.24963/ijcai.2019/475}}

[26] @inproceedings{10.5555/3044805.3044850,
author = {Silver, David and Lever, Guy and Heess, Nicolas and Degris, Thomas and Wierstra, Daan and
Riedmiller, Martin},
title = {Deterministic Policy Gradient Algorithms}, year = {2014}, publisher = {JMLR.org},
booktitle = {Proceedings of the 31st International Conference on International Conference on
Machine Learning - Volume 32},
pages = {I–387–I–395}, numpages = {9}, location = {Beijing, China}, series = {ICML’14}}

[27] @inproceedings{DBLP:conf/ecml/Riedmiller05,
author = {Martin A. Riedmiller},
editor = {Jo{\~{a}}o Gama and Rui Camacho and Pavel Brazdil and Al{\'{\i}}pio Jorge and
Lu{\'{\i}}s Torgo},
title = {Neural Fitted {Q} Iteration - First Experiences with a Data Efficient
Neural Reinforcement Learning Method},
booktitle = {Machine Learning: {ECML} 2005, 16th European Conference on Machine
Learning, Porto, Portugal, October 3-7, 2005, Proceedings},
series = {Lecture Notes in Computer Science},
volume = {3720}, pages = {317--328}, publisher = {Springer}, year = {2005},
url = {https://doi.org/10.1007/11564096\_32}, doi = {10.1007/11564096\_32},
timestamp = {Tue, 14 May 2019 10:00:54 +0200}, biburl = {https://dblp.org/rec/conf/ecml/Riedmiller05.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[28] @article{DBLP:journals/corr/abs-2002-08396,
author = {Noah Y. Siegel and Jost Tobias Springenberg and Felix Berkenkamp and
Abbas Abdolmaleki and Michael Neunert and Thomas Lampe and Roland Hafner and
Nicolas Heess and Martin A. Riedmiller},
title = {Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement
Learning},
journal = {CoRR}, volume = {abs/2002.08396}, year = {2020}, url = {https://arxiv.org/abs/2002.08396},
archivePrefix = {arXiv}, eprint = {2002.08396}, timestamp = {Mon, 02 Mar 2020 16:46:06 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2002-08396.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[29] @article{Hafner2011,
doi = {10.1007/s10994-011-5235-x}, url = {https://doi.org/10.1007/s10994-011-5235-x},
year = {2011}, month = feb, publisher = {Springer Science and Business Media {LLC}}, volume = {84},
number = {1-2}, pages = {137--169}, author = {Roland Hafner and Martin Riedmiller},
title = {Reinforcement learning in feedback control},
journal = {Machine Learning}}

[30] @inproceedings{DBLP:conf/nips/AnthonyTB17,
author = {Thomas Anthony and Zheng Tian and David Barber},
editor = {Isabelle Guyon and Ulrike von Luxburg and Samy Bengio and Hanna M. Wallach and
Rob Fergus and S. V. N. Vishwanathan and Roman Garnett},
title = {Thinking Fast and Slow with Deep Learning and Tree Search},
booktitle = {Advances in Neural Information Processing Systems 30: Annual Conference
on Neural Information Processing Systems 2017, 4-9 December 2017,
Long Beach, CA, {USA}},
pages = {5360--5370}, year = {2017},
url={http://papers.nips.cc/paper/7120-thinking-fast-and-slow-with-deep-learning-and-tree-search},

27
timestamp = {Fri, 06 Mar 2020 16:56:07 +0100}, biburl = {https://dblp.org/rec/conf/nips/AnthonyTB17.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[31] @inproceedings{Fakoor2020Meta-Q-Learning,
title={Meta-Q-Learning},
author={Rasool Fakoor and Pratik Chaudhari and Stefano Soatto and Alexander J. Smola},
booktitle={International Conference on Learning Representations}, year={2020},
url={https://openreview.net/forum?id=SJeD3CEFPH}}

[32] @incollection{NIPS2019_9416,
title = {When to Trust Your Model: Model-Based Policy Optimization},
author = {Janner, Michael and Fu, Justin and Zhang, Marvin and Levine, Sergey},
booktitle = {Advances in Neural Information Processing Systems 32},
editor = {H. Wallach and H. Larochelle and A. Beygelzimer and F. d\textquotesingle Alch\'{e}-Buc and E. Fox and R.
Garnett},
pages = {12519--12530}, year = {2019}, publisher = {Curran Associates, Inc.},
url = {http://papers.nips.cc/paper/9416-when-to-trust-your-model-model-based-policy-optimization.pdf}}

[33] @inproceedings{1611.01626,
title={Combining policy gradient and Q-learning},
author={Brendan O'Donoghue and Remi Munos and Koray Kavukcuoglu and Volodymyr Mnih},
booktitle={ICLR}, year={2016}, Eprint = {arXiv:1611.01626}}

[34] @book{Sutton_Barto_2018,
place={Cambridge, Massachusetts}, edition={Second edition},
series={Adaptive computation and machine learning series}, title={Reinforcement learning: an introduction},
ISBN={9780262039246}, publisher={The MIT Press}, author={Sutton, Richard S. and Barto, Andrew G.},
year={2018}, collection={Adaptive computation and machine learning series}}

[35]@article{Mnih_Kavukcuoglu_Silver_Rusu_Veness_Bellemare_Graves_Riedmiller_Fidjeland_Ostrovski_et al._2015,
title={Human-level control through deep reinforcement learning},
volume={518}, ISSN={0028-0836, 1476-4687}, DOI={10.1038/nature14236}, number={7540}, journal={Nature},
author={Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Rusu, Andrei A. and Veness, Joel
and Bellemare, Marc G. and Graves, Alex and Riedmiller, Martin and Fidjeland, Andreas K.
and Ostrovski, Georg and et al.},
year={2015}, month={Feb}, pages={529–533}}

[36] @inproceedings{DBLP:conf/iclr/LevyKPS19,
author = {Andrew Levy and George Dimitri Konidaris and Robert Platt Jr. and Kate Saenko},
title = {Learning Multi-Level Hierarchies with Hindsight},
booktitle = {7th International Conference on Learning Representations, {ICLR} 2019,
New Orleans, LA, USA, May 6-9, 2019},
publisher = {OpenReview.net}, year = {2019},
url = {https://openreview.net/forum?id=ryzECoAcY7},
timestamp = {Tue, 19 Nov 2019 08:34:00 +0100},
biburl = {https://dblp.org/rec/conf/iclr/LevyKPS19.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[37] @article{Liu_Bellet_2019,
title={Escaping the Curse of Dimensionality in Similarity Learning: Efficient Frank-Wolfe Algorithm
and Generalization Bounds},
volume={333}, ISSN={09252312}, DOI={10.1016/j.neucom.2018.12.060}, journal={Neurocomputing},
author={Liu, Kuan and Bellet, Aurélien}, year={2019}, month={Mar}, pages={185–199}}

[38] @inproceedings{DBLP:conf/nips/DayanH92,
author = {Peter Dayan and Geoffrey E. Hinton},
editor = {Stephen Jose Hanson and Jack D. Cowan and C. Lee Giles},
title = {Feudal Reinforcement Learning},
booktitle = {Advances in Neural Information Processing Systems 5, {[NIPS} Conference,
Denver, Colorado, USA, November 30 - December 3, 1992]},
pages = {271--278}, publisher = {Morgan Kaufmann}, year = {1992},
url = {http://papers.nips.cc/paper/714-feudal-reinforcement-learning},
timestamp = {Fri, 06 Mar 2020 16:57:04 +0100}, biburl = {https://dblp.org/rec/conf/nips/DayanH92.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[39] @article{DBLP:journals/ai/SuttonPS99,
author = {Richard S. Sutton and Doina Precup and Satinder P. Singh},
title = {Between MDPs and Semi-MDPs: {A} Framework for Temporal Abstraction
i n Reinforcement Learning},
journal = {Artif. Intell.}, volume = {112}, number = {1-2}, pages = {181--211}, year = {1999},
url = {https://doi.org/10.1016/S0004-3702(99)00052-1}, doi = {10.1016/S0004-3702(99)00052-1},

28
timestamp = {Sat, 27 May 2017 14:24:41 +0200}, biburl = {https://dblp.org/rec/journals/ai/SuttonPS99.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[40] @article{DBLP:journals/jair/Dietterich00,
author = {Thomas G. Dietterich},
title = {Hierarchical Reinforcement Learning with the {MAXQ} Value Function Decomposition},
journal = {J. Artif. Intell. Res.}, volume = {13}, pages = {227--303},year = {2000},
url = {https://doi.org/10.1613/jair.639}, doi = {10.1613/jair.639},
timestamp = {Mon, 21 Jan 2019 15:01:17 +0100}, biburl = {https://dblp.org/rec/journals/jair/Dietterich00.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[41] @inproceedings{DBLP:conf/icml/VezhnevetsOSHJS17,
author = {Alexander Sasha Vezhnevets and Simon Osindero and Tom Schaul and Nicolas Heess and
Max Jaderberg and David Silver and Koray Kavukcuoglu},
editor = {Doina Precup and Yee Whye Teh},
title = {FeUdal Networks for Hierarchical Reinforcement Learning},
booktitle = {Proceedings of the 34th International Conference on Machine Learning,
{ICML} 2017, Sydney, NSW, Australia, 6-11 August 2017},
series = {Proceedings of Machine Learning Research},
volume = {70}, pages = {3540--3549}, publisher = {{PMLR}}, year = {2017},
url = {http://proceedings.mlr.press/v70/vezhnevets17a.html},
timestamp = {Wed, 29 May 2019 08:41:45 +0200},
biburl = {https://dblp.org/rec/conf/icml/VezhnevetsOSHJS17.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[42] @inproceedings{DBLP:conf/aaai/BaconHP17,
author = {Pierre{-}Luc Bacon and Jean Harb and Doina Precup},
editor = {Satinder P. Singh and Shaul Markovitch},
title = {The Option-Critic Architecture},
booktitle = {Proceedings of the Thirty-First {AAAI} Conference on Artificial Intelligence,
February 4-9, 2017, San Francisco, California, {USA}},
pages = {1726--1734}, publisher = {{AAAI} Press}, year = {2017},
url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14858},
timestamp = {Mon, 06 Mar 2017 11:36:24 +0100},
biburl = {https://dblp.org/rec/conf/aaai/BaconHP17.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[43] @inproceedings{DBLP:conf/nips/NachumGLL18,
author = {Ofir Nachum and Shixiang Gu and Honglak Lee and Sergey Levine},
editor = {Samy Bengio and Hanna M. Wallach and Hugo Larochelle and Kristen Grauman and
Nicol{\`{o}} Cesa{-}Bianchi and Roman Garnett},
title = {Data-Efficient Hierarchical Reinforcement Learning},
booktitle = {Advances in Neural Information Processing Systems 31: Annual Conference
on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December
2018, Montr{\'{e}}al, Canada},
pages = {3307--3317},year = {2018},
url = {http://papers.nips.cc/paper/7591-data-efficient-hierarchical-reinforcement-learning},
timestamp = {Fri, 06 Mar 2020 17:00:31 +0100},
biburl = {https://dblp.org/rec/conf/nips/NachumGLL18.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[44] @inproceedings{DBLP:conf/iclr/ODonoghueMKM17,
author = {Brendan O'Donoghue and R{\'{e}}mi Munos and Koray Kavukcuoglu and
Volodymyr Mnih},
title = {Combining policy gradient and Q-learning},
booktitle = {5th International Conference on Learning Representations, {ICLR} 2017,
Toulon, France, April 24-26, 2017, Conference Track Proceedings},
publisher = {OpenReview.net}, year = {2017}, url = {https://openreview.net/forum?id=B1kJ6H9ex},
timestamp = {Thu, 25 Jul 2019 14:25:50 +0200}, biburl = {https://dblp.org/rec/conf/iclr/ODonoghueMKM17.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[45] @article{DBLP:journals/corr/HeessWTLRS16,
author = {Nicolas Heess and Gregory Wayne and Yuval Tassa and Timothy P. Lillicrap and
Martin A. Riedmiller and David Silver},
title = {Learning and Transfer of Modulated Locomotor Controllers},
journal = {CoRR}, volume = {abs/1610.05182}, year = {2016}, url = {http://arxiv.org/abs/1610.05182},
archivePrefix = {arXiv}, eprint = {1610.05182}, timestamp = {Mon, 13 Aug 2018 16:47:23 +0200},
biburl = {https://dblp.org/rec/journals/corr/HeessWTLRS16.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[46] @inproceedings{DBLP:conf/nips/KulkarniNST16,

29
author = {Tejas D. Kulkarni and Karthik Narasimhan and Ardavan Saeedi and Josh Tenenbaum},
editor = {Daniel D. Lee andMasashi Sugiyama and Ulrike von Luxburg and Isabelle Guyon and Roman Garnett},
title = {Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation},
booktitle = {Advances in Neural Information Processing Systems 29: Annual Conference
on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain},
pages = {3675--3683}, year = {2016}, url =
{http://papers.nips.cc/paper/6233-hierarchical-deep-reinforcement-learning-integrating-temporal-abstraction-and-intrinsic-motiva
tion},
timestamp = {Fri, 06 Mar 2020 17:00:15 +0100}, biburl = {https://dblp.org/rec/conf/nips/KulkarniNST16.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[47] @inproceedings{DBLP:conf/iclr/FransH0AS18,
author = {Kevin Frans and Jonathan Ho and Xi Chen and Pieter Abbeel and John Schulman},
title = {Meta Learning Shared Hierarchies},
booktitle = {6th International Conference on Learning Representations, {ICLR} 2018,
Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings},
publisher = {OpenReview.net}, year = {2018}, url = {https://openreview.net/forum?id=SyX0IeWAW},
timestamp = {Thu, 25 Jul 2019 14:26:00 +0200}, biburl = {https://dblp.org/rec/conf/iclr/FransH0AS18.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[48] @article{DBLP:journals/corr/abs-1812-00025,
author = {Alexander Pashevich and Danijar Hafner and James Davidson and Rahul Sukthankar and
Cordelia Schmid},
title = {Modulated Policy Hierarchies}, journal = {CoRR}, volume = {abs/1812.00025}, year = {2018},
url = {http://arxiv.org/abs/1812.00025}, archivePrefix = {arXiv}, eprint = {1812.00025},
timestamp = {Tue, 01 Jan 2019 15:01:25 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-1812-00025.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[49] @inproceedings{DBLP:conf/nips/VezhnevetsMOGVA16,
author = {Alexander Vezhnevets and Volodymyr Mnih and Simon Osindero and Alex Graves and
Oriol Vinyals and John Agapiou and Koray Kavukcuoglu},
editor = {Daniel D. Lee and Masashi Sugiyama and Ulrike von Luxburg and Isabelle Guyon and
Roman Garnett},
title = {Strategic Attentive Writer for Learning Macro-Actions},
booktitle = {Advances in Neural Information Processing Systems 29: Annual Conference
on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain},
pages = {3486--3494}, year = {2016},
url = {http://papers.nips.cc/paper/6414-strategic-attentive-writer-for-learning-macro-actions},
timestamp = {Fri, 06 Mar 2020 17:00:15 +0100}, biburl = {https://dblp.org/rec/conf/nips/VezhnevetsMOGVA16.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[50] @inproceedings{DBLP:conf/aaai/TesslerGZMM17,
author = {Chen Tessler and Shahar Givony and Tom Zahavy and Daniel J. Mankowitz and Shie Mannor},
editor = {Satinder P. Singh and Shaul Markovitch},
title = {A Deep Hierarchical Approach to Lifelong Learning in Minecraft},
booktitle = {Proceedings of the Thirty-First {AAAI} Conference on Artificial Intelligence,
February 4-9, 2017, San Francisco, California, {USA}},
pages = {1553--1561}, publisher = {{AAAI} Press}, year = {2017},
url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14630},
timestamp = {Mon, 06 Mar 2017 11:36:24 +0100}, biburl = {https://dblp.org/rec/conf/aaai/TesslerGZMM17.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[51] @paper{ICAPS1715759,
author = {Nakul Gopalan and Marie desJardins and Michael Littman
and James MacGlashan and Shawn Squire and Stefanie Tellex
and John Winder and Lawson Wong},
title = {Planning with Abstract Markov Decision Processes},
conference = {International Conference on Automated Planning and Scheduling},
year = {2017}, url = {https://aaai.org/ocs/index.php/ICAPS/ICAPS17/paper/view/15759}}

[52] @article{DBLP:journals/corr/MankowitzMM16,
author = {Daniel J. Mankowitz and Timothy A. Mann and Shie Mannor},
title = {Iterative Hierarchical Optimization for Misspecified Problems (IHOMP)},
journal = {CoRR}, volume = {abs/1602.03348}, year = {2016}, url = {http://arxiv.org/abs/1602.03348},
archivePrefix = {arXiv}, eprint = {1602.03348}, timestamp = {Wed, 17 Jul 2019 17:00:48 +0200},
biburl = {https://dblp.org/rec/journals/corr/MankowitzMM16.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[53] @article{DBLP:journals/corr/abs-1811-09083,
author = {Sainbayar Sukhbaatar and Emily Denton and Arthur Szlam and Rob Fergus},
title = {Learning Goal Embeddings via Self-Play for Hierarchical Reinforcement Learning},

30
journal = {CoRR}, volume = {abs/1811.09083}, year = {2018}, url = {http://arxiv.org/abs/1811.09083},
archivePrefix = {arXiv}, eprint = {1811.09083}, timestamp = {Fri, 30 Nov 2018 12:44:28 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-1811-09083.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[54] @inproceedings{DBLP:conf/aaai/RafatiN19a,
author = {Jacob Rafati and David C. Noelle},
title = {Learning Representations in Model-Free Hierarchical Reinforcement Learning},
booktitle = {The Thirty-Third {AAAI} Conference on Artificial Intelligence, {AAAI}
2019, The Thirty-First Innovative Applications of Artificial Intelligence
Conference, {IAAI} 2019, The Ninth {AAAI} Symposium on Educational
Advances in Artificial Intelligence, {EAAI} 2019, Honolulu, Hawaii,
USA, January 27 - February 1, 2019},
pages = {10009--10010}, publisher = {{AAAI} Press}, year = {2019},
url = {https://doi.org/10.1609/aaai.v33i01.330110009}, doi = {10.1609/aaai.v33i01.330110009},
timestamp = {Wed, 25 Sep 2019 11:05:09 +0200}, biburl = {https://dblp.org/rec/conf/aaai/RafatiN19a.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[55] @inproceedings{DBLP:conf/nips/ParrR97,
author = {Ronald Parr and Stuart J. Russell},
editor = {Michael I. Jordan and Michael J. Kearns and Sara A. Solla},
title = {Reinforcement Learning with Hierarchies of Machines},
booktitle = {Advances in Neural Information Processing Systems 10, {[NIPS} Conference,
Denver, Colorado, USA, 1997]},
pages = {1043--1049}, publisher = {The {MIT} Press}, year = {1997},
url = {http://papers.nips.cc/paper/1384-reinforcement-learning-with-hierarchies-of-machines},
timestamp = {Fri, 06 Mar 2020 17:00:38 +0100}, biburl = {https://dblp.org/rec/conf/nips/ParrR97.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[56] @phdthesis{phdthesis,
author = {Watkins, Christopher}, year = {1989}, month = {04}, title = {Learning From Delayed Rewards},
school = {King’s College}}

[57] @inproceedings{DBLP:conf/icml/HaarnojaTAL17,
author = {Tuomas Haarnoja and Haoran Tang and Pieter Abbeel and Sergey Levine},
editor = {Doina Precup and Yee Whye Teh},
title = {Reinforcement Learning with Deep Energy-Based Policies},
booktitle = {Proceedings of the 34th International Conference on Machine Learning,
{ICML} 2017, Sydney, NSW, Australia, 6-11 August 2017},
series = {Proceedings of Machine Learning Research},
volume = {70}, pages = {1352--1361}, publisher = {{PMLR}}, year = {2017},
url = {http://proceedings.mlr.press/v70/haarnoja17a.html}, timestamp = {Wed, 29 May 2019 08:41:45 +0200},
biburl = {https://dblp.org/rec/conf/icml/HaarnojaTAL17.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[58] @article{10.1007/BF00992696,
author = {Williams, Ronald J.},
title = {Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning},
year = {1992}, issue_date = {May 1992}, publisher = {Kluwer Academic Publishers}, address = {USA},
volume = {8}, number = {3–4}, issn = {0 885-6125}, url = {https://doi.org/10.1007/BF00992696},
doi = {10.1007/BF00992696}, journal = {Mach. Learn.}, month = may, pages = {229–256}, numpages = {28},
keywords = {connectionist networks, mathematical analysis, Reinforcement learning, gradient descent}}

[59] @inproceedings{DBLP:conf/iclr/SukhbaatarLKSSF18,
author = {Sainbayar Sukhbaatar and Zeming Lin and Ilya Kostrikov and Gabriel Synnaeve and Arthur Szlam and
Rob Fergus},
title = {Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play},
booktitle = {6th International Conference on Learning Representations, {ICLR} 2018,
Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings},
publisher = {OpenReview.net}, year = {2018}, url = {https://openreview.net/forum?id=SkT5Yg-RZ},
timestamp = {Thu, 25 Jul 2019 14:25:46 +0200}, biburl = {https://dblp.org/rec/conf/iclr/SukhbaatarLKSSF18.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[60] @incollection{NIPS2018_7725,
title = {Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models},
author = {Chua, Kurtland and Calandra, Roberto and McAllister, Rowan and Levine, Sergey},
booktitle = {Advances in Neural Information Processing Systems 31},
editor = {S. Bengio and H. Wallach and H. Larochelle and K. Grauman and N. Cesa-Bianchi and R. Garnett},
pages = {4754--4765}, year = {2018}, publisher = {Curran Associates, Inc.}, url =
{http://papers.nips.cc/paper/7725-deep-reinforcement-learning-in-a-handful-of-trials-using-probabilistic-dynamics-mod
els.pdf}}

31
[61] @inproceedings{DBLP:conf/nips/BuckmanHTBL18,
author = {Jacob Buckman and Danijar Hafner and George Tucker and Eugene Brevdo and Honglak Lee},
editor = {Samy Bengio and Hanna M. Wallach and Hugo Larochelle and Kristen Grauman and
Nicol{\`{o}} Cesa{-}Bianchi and Roman Garnett},
title = {Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion},
booktitle = {Advances in Neural Information Processing Systems 31: Annual Conference
on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montr{\'{e}}al, Canada},
pages = {8234--8244}, year = {2018}, url =
{http://papers.nips.cc/paper/8044-sample-efficient-reinforcement-learning-with-stochastic-ensemble-value-expansion}
,
timestamp = {Fri, 06 Mar 2020 17:00:31 +0100}, biburl = {https://dblp.org/rec/conf/nips/BuckmanHTBL18.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[62] @inproceedings{DBLP:conf/iclr/LuoXLTDM19,
author = {Yuping Luo and Huazhe Xu and Yuanzhi Li and Yuandong Tian and Trevor Darrell and
Tengyu Ma},
title = {Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees},
booktitle = {7th International Conference on Learning Representations, {ICLR} 2019,
New Orleans, LA, USA, May 6-9, 2019},
publisher = {OpenReview.net}, year = {2019}, url = {https://openreview.net/forum?id=BJe1E2R5KX},
timestamp = {Thu, 25 Jul 2019 14:26:05 +0200},
biburl = {https://dblp.org/rec/conf/iclr/LuoXLTDM19.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[63] @inproceedings{DBLP:journals/corr/ParisottoBS15,
author = {Emilio Parisotto and Lei Jimmy Ba and Ruslan Salakhutdinov},
editor = {Yoshua Bengio and Yann LeCun},
title = {Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning},
booktitle = {4th International Conference on Learning Representations, {ICLR} 2016,
San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings},
year = {2016}, url = {http://arxiv.org/abs/1511.06342},
timestamp = {Thu, 25 Jul 2019 14:25:38 +0200},
biburl = {https://dblp.org/rec/journals/corr/ParisottoBS15.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[64] @inproceedings{DBLP:conf/iclr/SharmaLR17,
author = {Sahil Sharma and Aravind S. Lakshminarayanan and Balaraman Ravindran},
title = {Learning to Repeat: Fine Grained Action Repetition for Deep Reinforcement Learning},
booktitle = {5th International Conference on Learning Representations, {ICLR} 2017,
Toulon, France, April 24-26, 2017, Conference Track Proceedings},
publisher = {OpenReview.net}, year = {2017},
url = {https://openreview.net/forum?id=B1GOWV5eg},
timestamp = {Thu, 25 Jul 2019 14:25:59 +0200},
biburl = {https://dblp.org/rec/conf/iclr/SharmaLR17.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[65] @incollection{NIPS2015_5796,
title = {Learning Continuous Control Policies by Stochastic Value Gradients},
author = {Heess, Nicolas and Wayne, Gregory and Silver, David and Lillicrap, Timothy and Erez, Tom and
Tassa, Yuval},
booktitle = {Advances in Neural Information Processing Systems 28},
editor = {C. Cortes and N. D. Lawrence and D. D. Lee and M. Sugiyama and R. Garnett},
pages = {2944--2952}, year = {2015}, publisher = {Curran Associates, Inc.},
url = {http://papers.nips.cc/paper/5796-learning-continuous-control-policies-by-stochastic-value-gradients.pdf}}

[66] @inproceedings{DBLP:conf/nips/OhSL17,
author = {Junhyuk Oh and Satinder Singh and Honglak Lee},
editor = {Isabelle Guyon and Ulrike von Luxburg and Samy Bengio and Hanna M. Wallach and
Rob Fergus and S. V. N. Vishwanathan and Roman Garnett},
title = {Value Prediction Network},
booktitle = {Advances in Neural Information Processing Systems 30: Annual Conference
on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, {USA}},
pages = {6118--6128}, year = {2017},
url = {http://papers.nips.cc/paper/7192-value-prediction-network},
timestamp = {Fri, 06 Mar 2020 16:56:22 +0100},
biburl = {https://dblp.org/rec/conf/nips/OhSL17.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[67] @article{article,
author = {Rummery, G. and Niranjan, Mahesan}, year = {1994}, month = {11}, pages = {},

32
title = {On-Line Q-Learning Using Connectionist Systems}, journal = {Technical Report CUED/F-INFENG/TR 166}}

[68] @article{DBLP:journals/corr/abs-2006-08875,
author = {Zichuan Lin and Garrett Thomas and Guangwen Yang and Tengyu Ma},
title = {Model-based Adversarial Meta-Reinforcement Learning},
journal = {CoRR}, volume = {abs/2006.08875}, year = {2020},
url = {https://arxiv.org/abs/2006.08875},
archivePrefix = {arXiv}, eprint = {2006.08875}, timestamp = {Wed, 17 Jun 2020 14:28:54 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2006-08875.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[69] @inproceedings{DBLP:conf/icml/FinnAL17,
author = {Chelsea Finn and Pieter Abbeel and Sergey Levine},
editor = {Doina Precup and Yee Whye Teh},
title = {Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks},
booktitle = {Proceedings of the 34th International Conference on Machine Learning,
{ICML} 2017, Sydney, NSW, Australia, 6-11 August 2017},
series = {Proceedings of Machine Learning Research},
volume = {70}, pages = {1126--1135}, publisher = {{PMLR}}, year = {2017},
url = {http://proceedings.mlr.press/v70/finn17a.html},
timestamp = {Wed, 29 May 2019 08:41:45 +0200},
biburl = {https://dblp.org/rec/conf/icml/FinnAL17.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[70] @inproceedings{DBLP:conf/icml/RakellyZFLQ19,
author = {Kate Rakelly and Aurick Zhou and Chelsea Finn and Sergey Levine and
Deirdre Quillen},
editor = {Kamalika Chaudhuri and Ruslan Salakhutdinov},
title = {Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables},
booktitle = {Proceedings of the 36th International Conference on Machine Learning,
{ICML} 2019, 9-15 June 2019, Long Beach, California, {USA}},
series = {Proceedings of Machine Learning Research},
volume = {97}, pages = {5331--5340}, publisher = {{PMLR}}, year = {2019},
url = {http://proceedings.mlr.press/v97/rakelly19a.html},
timestamp = {Tue, 11 Jun 2019 15:37:38 +0200},
biburl = {https://dblp.org/rec/conf/icml/RakellyZFLQ19.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[71] @inproceedings{DBLP:conf/iclr/RothfussLCAA19,
author = {Jonas Rothfuss and Dennis Lee and Ignasi Clavera and Tamim Asfour and Pieter Abbeel},
title = {ProMP: Proximal Meta-Policy Search},
booktitle = {7th International Conference on Learning Representations, {ICLR} 2019,
New Orleans, LA, USA, May 6-9, 2019},
publisher = {OpenReview.net}, year = {2019},
url = {https://openreview.net/forum?id=SkxXCi0qFX},
timestamp = {Thu, 25 Jul 2019 14:25:43 +0200},
biburl = {https://dblp.org/rec/conf/iclr/RothfussLCAA19.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[72] @article{DBLP:journals/corr/DuanSCBSA16,
author = {Yan Duan and John Schulman and Xi Chen and Peter L. Bartlett and Ilya Sutskever and Pieter Abbeel},
title = {RL{\textdollar}{\^{}}2{\textdollar}: Fast Reinforcement Learning via
Slow Reinforcement Learning},
journal = {CoRR}, volume = {abs/1611.02779}, year = {2016},
url = {http://arxiv.org/abs/1611.02779},
archivePrefix = {arXiv}, eprint = {1611.02779}, timestamp = {Mon, 03 Sep 2018 12:15:29 +0200},
biburl = {https://dblp.org/rec/journals/corr/DuanSCBSA16.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[73] @inproceedings{DBLP:conf/icml/DudikLL11,
author = {Miroslav Dud{\'{\i}}k and John Langford and Lihong Li},
editor = {Lise Getoor andTobias Scheffer},
title = {Doubly Robust Policy Evaluation and Learning},
booktitle = {Proceedings of the 28th International Conference on Machine Learning,
{ICML} 2011, Bellevue, Washington, USA, June 28 - July 2, 2011},
pages = {1097--1104}, publisher = {Omnipress}, year = {2011},
url = {https://icml.cc/2011/papers/554\_icmlpaper.pdf},
timestamp = {Wed, 03 Apr 2019 17:43:35 +0200},
biburl = {https://dblp.org/rec/conf/icml/DudikLL11.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

33
[74] @inproceedings{DBLP:conf/icml/DeisenrothR11,
author = {Marc Peter Deisenroth and Carl Edward Rasmussen},
editor = {Lise Getoor and Tobias Scheffer},
title = {{PILCO:} {A} Model-Based and Data-Efficient Approach to Policy Search},
booktitle = {Proceedings of the 28th International Conference on Machine Learning,
{ICML} 2011, Bellevue, Washington, USA, June 28 - July 2, 2011},
pages = {465--472}, publisher = {Omnipress}, year = {2011},
url = {https://icml.cc/2011/papers/323\_icmlpaper.pdf},
timestamp = {Wed, 03 Apr 2019 17:43:35 +0200},
biburl = {https://dblp.org/rec/conf/icml/DeisenrothR11.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[75] @inproceedings{DBLP:conf/iclr/KurutachCDTA18,
author = {Thanard Kurutach and Ignasi Clavera and Yan Duan and Aviv Tamar and Pieter Abbeel},
title = {Model-Ensemble Trust-Region Policy Optimization},
booktitle = {6th International Conference on Learning Representations, {ICLR} 2018,
Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings},
publisher = {OpenReview.net}, year = {2018},
url = {https://openreview.net/forum?id=SJJinbWRZ},
timestamp = {Thu, 25 Jul 2019 14:25:59 +0200},
biburl = {https://dblp.org/rec/conf/iclr/KurutachCDTA18.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}

[76] @misc{part 2: kinds of rl algorithms - spinning up documentation,

title={Part 2: Kinds of RL Algorithms¶},
url={https://web.archive.org/save/https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html},
}

[77] @misc{flet-berliac_2020,
title={The Promise of Hierarchical Reinforcement Learning},
url={https://web.archive.org/web/20200501215221/https://thegradient.pub/the-promise-of-hierarchical-reinforcement-l
earning/},
author={Flet-Berliac, Yannis},
year={2020},
month={Apr}
}

[78] @article{DBLP:journals/corr/abs-1908-06976,
author = {Arthur Aubret and La{\"{e}}titia Matignon and Salima Hassas},
title = {A survey on intrinsic motivation in reinforcement learning},
journal = {CoRR},
volume = {abs/1908.06976},
year = {2019},
url = {http://arxiv.org/abs/1908.06976},
archivePrefix = {arXiv},
eprint = {1908.06976},
timestamp = {Mon, 26 Aug 2019 13:20:40 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1908-06976.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

[79] @misc{alphago: the story so far,

url={https://web.archive.org/web/20201011134603/https://deepmind.com/research/case-studies/alphago-the-story-so-
far}
}

[80] @misc{alphago: the story so far,

url={https://web.archive.org/web/20201011134603/https://deepmind.com/research/case-studies/alphago-the-story-so-
far}
}

[81] @misc{kiran2020deep,
title={Deep Reinforcement Learning for Autonomous Driving: A Survey},
author={B Ravi Kiran and Ibrahim Sobh and Victor Talpaert and Patrick Mannion and Ahmad A. Al Sallab
and Senthil Yogamani and Patrick Pérez},
year={2020},
eprint={2002.00444},
archivePrefix={arXiv},
primaryClass={cs.LG}
}

34
[82] @inbook{inbook,
author = {Azhikodan, Akhil and Bhat, Anvitha and Jadhav, Mamatha},
year = {2019},
month = {05},
pages = {41-49},
title = {Stock Trading Bot Using Deep Reinforcement Learning},
isbn = {978-981-10-8200-9},
journal = {Lecture Notes in Networks and Systems},
doi = {10.1007/978-981-10-8201-6_5}
}

[83] @article{article,
author = {Kober, Jens and Bagnell, J. and Peters, Jan},
year = {2013},
month = {09},
pages = {1238-1274},
title = {Reinforcement Learning in Robotics: A Survey},
volume = {32},
journal = {The International Journal of Robotics Research},
doi = {10.1177/0278364913495721}
}

[84] @techreport{Fischer2018Reinforcement,
address = {N\"{u}rnberg},
author = {Thomas G. Fischer},
copyright = {http://www.econstor.eu/dspace/Nutzungsbedingungen},
keywords = {330; financial markets; reinforcement learning; survey; trading systems; machine learning},
language = {eng},
number = {12/2018},
publisher = {Friedrich-Alexander-Universit\"{a}t Erlangen-N\"{u}rnberg, Institute for Economics},
title = {Reinforcement learning in financial markets - a survey},
type = {FAU Discussion Papers in Economics},
url = {http://hdl.handle.net/10419/183139},
year = {2018}
}

[85] @article{Liang_2019,
title={A Deep Reinforcement Learning Network for Traffic Light Cycle Control},
volume={68},
ISSN={1939-9359},
url={http://dx.doi.org/10.1109/TVT.2018.2890726},
DOI={10.1109/tvt.2018.2890726},
number={2},
journal={IEEE Transactions on Vehicular Technology},
publisher={Institute of Electrical and Electronics Engineers (IEEE)},
author={Liang, Xiaoyuan and Du, Xunsheng and Wang, Guiling and Han, Zhu},
year={2019},
month={Feb},
pages={1243–1253}
}

[86] @misc{yu2020reinforcement,
title={Reinforcement Learning in Healthcare: A Survey},
author={Chao Yu and Jiming Liu and Shamim Nemati},
year={2020},
eprint={1908.08796},
archivePrefix={arXiv},
primaryClass={cs.LG}
}

[87] @misc{shao2019survey,
title={A Survey of Deep Reinforcement Learning in Video Games},
author={Kun Shao and Zhentao Tang and Yuanheng Zhu and Nannan Li and Dongbin Zhao},
year={2019},
eprint={1912.10944},
archivePrefix={arXiv},
primaryClass={cs.MA}
}

[88] @inproceedings{HanCai2017,

35
View publication stats

author = {Cai, Han and Ren, Kan and Zhag, W and Malialis, K and Wang, J and Yu, Y and Guo, D},
year = {2017},
month = {02},
pages = {},
title = {Real-Time Bidding by Reinforcement Learning in Display Advertising}
}

[89] @misc{alphago: the story so far,

url={https://web.archive.org/web/20201020122839/https://towardsdatascience.com/reinforcement-learning-cheat-she
et-2f9453df7651?gi=2fbf652fe8eb},
notes = [https://towardsdatascience.com/reinforcement-learning-cheat-sheet-2f9453df7651]
}

Applied Generative AI For Beginners Practical Knowledge 1703207445
93% (14)
Applied Generative AI For Beginners Practical Knowledge 1703207445
221 pages
Apress Understanding Large Language Models B0CJ2C8TXQ
100% (11)
Apress Understanding Large Language Models B0CJ2C8TXQ
166 pages
Generative Ai Fundamentals v1
100% (16)
Generative Ai Fundamentals v1
80 pages
(EARLY RELEASE) Quick Start Guide To Large Language Models Strategies and Best Practices For Using ChatGPT and Other LLMs (Sinan Ozdemir) (Z-Library)
100% (14)
(EARLY RELEASE) Quick Start Guide To Large Language Models Strategies and Best Practices For Using ChatGPT and Other LLMs (Sinan Ozdemir) (Z-Library)
132 pages
Understanding Machine Learning
100% (69)
Understanding Machine Learning
416 pages
9781838826321-Managing Data Science
100% (7)
9781838826321-Managing Data Science
276 pages
Top Agentic AI Architecture Design Patterns
100% (4)
Top Agentic AI Architecture Design Patterns
8 pages
Natural Language Processing With PyTorch - Build Intelligent Language Applications Using Deep Learning PDF
100% (14)
Natural Language Processing With PyTorch - Build Intelligent Language Applications Using Deep Learning PDF
210 pages
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
100% (18)
Learning The Pandas Library Python Tools For Data Munging Analysis and Visual PDF
208 pages
Python Machine Learning For Beginners Ebook Final
100% (11)
Python Machine Learning For Beginners Ebook Final
305 pages
Beyond AI
100% (7)
Beyond AI
532 pages
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
93% (15)
Artificial Intelligence With Python (Machine Learning Foundations, Methodologies, and Applications) (Teik Toe Teoh, Zheng Rong)
334 pages
Algorithms For Data Science 1st Brian Steele (WWW - Ebook DL - Com)
94% (16)
Algorithms For Data Science 1st Brian Steele (WWW - Ebook DL - Com)
438 pages
Top 100 Applications of Generative AI 1683282083
100% (15)
Top 100 Applications of Generative AI 1683282083
119 pages
Hackers Guide To Machine Learning With Python PDF
100% (15)
Hackers Guide To Machine Learning With Python PDF
272 pages
Full Course of Machine Learning
100% (16)
Full Course of Machine Learning
660 pages
Machine Learning Projects in Python
100% (16)
Machine Learning Projects in Python
135 pages
Machine Learning Masterclass
100% (11)
Machine Learning Masterclass
108 pages
Understanding Reinforcement Learning Algorithms Q Learning
No ratings yet
Understanding Reinforcement Learning Algorithms Q Learning
18 pages
Unit 5
No ratings yet
Unit 5
7 pages
Case
No ratings yet
Case
6 pages
Subgoal Discovery For Hierarchical Reinforcement Learning Using Learned Policies
No ratings yet
Subgoal Discovery For Hierarchical Reinforcement Learning Using Learned Policies
5 pages
Advancements in Reinforcement Learning Algorithms For Autonomous Systems
No ratings yet
Advancements in Reinforcement Learning Algorithms For Autonomous Systems
6 pages
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
Theory and Novel Applications of Machine Learning
No ratings yet
Theory and Novel Applications of Machine Learning
386 pages
A Primer Chapter On Reinforcement Learning-Final
No ratings yet
A Primer Chapter On Reinforcement Learning-Final
22 pages
tiếng anhi
No ratings yet
tiếng anhi
7 pages
A Brief Survey of Deep Reinforcement Learning PDF
No ratings yet
A Brief Survey of Deep Reinforcement Learning PDF
14 pages
Hierarchical Reinforcement Learning - A Comprehensive Survey
No ratings yet
Hierarchical Reinforcement Learning - A Comprehensive Survey
6 pages
Reinforcement Learning in Dynamic Environments Optimizing Real Time Decision Making For Complex Systems MAR 2025
No ratings yet
Reinforcement Learning in Dynamic Environments Optimizing Real Time Decision Making For Complex Systems MAR 2025
8 pages
Unit 5 Notes
No ratings yet
Unit 5 Notes
9 pages
Collins - 2024 - Reinforcement Learning
No ratings yet
Collins - 2024 - Reinforcement Learning
16 pages
Reinforcement Learning and Deep Learning Unit 1,2
No ratings yet
Reinforcement Learning and Deep Learning Unit 1,2
74 pages
AS M - R L: Urvey On Odel Based Einforcement Earning
No ratings yet
AS M - R L: Urvey On Odel Based Einforcement Earning
28 pages
Multi Agent Reinforcement Learning A Rev
No ratings yet
Multi Agent Reinforcement Learning A Rev
25 pages
RL
No ratings yet
RL
94 pages
Reinforcement Learning in A Nutshell: January 2007
No ratings yet
Reinforcement Learning in A Nutshell: January 2007
13 pages
A Review of Reinforcement Learning Based Intelligent Optimization For Manufacturing Scheduling
No ratings yet
A Review of Reinforcement Learning Based Intelligent Optimization For Manufacturing Scheduling
14 pages
Discovering Reinforcement Learning Algorithms: Preprint. Under Review
No ratings yet
Discovering Reinforcement Learning Algorithms: Preprint. Under Review
19 pages
Disertatie
No ratings yet
Disertatie
5 pages
Reinforcement Learning 2012
No ratings yet
Reinforcement Learning 2012
653 pages
03 04 Lessonarticle
No ratings yet
03 04 Lessonarticle
5 pages
Reinforcement Learning Advancements Limitations An
No ratings yet
Reinforcement Learning Advancements Limitations An
14 pages
Reinf Learning Res Paper 2
No ratings yet
Reinf Learning Res Paper 2
12 pages
A Review of Cooperative Multi-Agent Deep Reinforcement Learning
No ratings yet
A Review of Cooperative Multi-Agent Deep Reinforcement Learning
46 pages
Exploring Game Playing AI Using Reinforcement Learning Techniques
No ratings yet
Exploring Game Playing AI Using Reinforcement Learning Techniques
5 pages
2006 00979v1 PDF
No ratings yet
2006 00979v1 PDF
33 pages
Final
No ratings yet
Final
18 pages
Ai PPT New
No ratings yet
Ai PPT New
14 pages
A Tutorial On Meta-Reinforcement Learning: Foundations and Trends in Machine Learning
No ratings yet
A Tutorial On Meta-Reinforcement Learning: Foundations and Trends in Machine Learning
164 pages
Module 1
No ratings yet
Module 1
81 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
19 pages
Reinforcement Learning (RL) : Agent
No ratings yet
Reinforcement Learning (RL) : Agent
35 pages
UNIT V Reinforcement Learning
No ratings yet
UNIT V Reinforcement Learning
8 pages
10.3934 Geosci.2022027
No ratings yet
10.3934 Geosci.2022027
15 pages
Winter Semester 2023-24 - CSE4037 - ETH - AP2023246000594 - 2024-01-05 - Reference-Material-I
No ratings yet
Winter Semester 2023-24 - CSE4037 - ETH - AP2023246000594 - 2024-01-05 - Reference-Material-I
35 pages
Journal of Automation and Intelligence 2024 - A Survey On Multi-Agent Reinforcement Learning and Its Application
No ratings yet
Journal of Automation and Intelligence 2024 - A Survey On Multi-Agent Reinforcement Learning and Its Application
19 pages
A Survey of Reinforcement Learning Algorithms
No ratings yet
A Survey of Reinforcement Learning Algorithms
15 pages
ML Assign Shubham
No ratings yet
ML Assign Shubham
13 pages
3.RL Unit 3
No ratings yet
3.RL Unit 3
31 pages
Enhanced Reinforcement Learning For Recommender Systems: Maximizing Sample Efficiency and Minimizing Variance
No ratings yet
Enhanced Reinforcement Learning For Recommender Systems: Maximizing Sample Efficiency and Minimizing Variance
6 pages
824 Multi Source Transfer Learning
No ratings yet
824 Multi Source Transfer Learning
25 pages
Workshop Master Revealed
From Everand
Workshop Master Revealed
Anil Soni
No ratings yet
ML 5 Reinforcement
No ratings yet
ML 5 Reinforcement
23 pages
Enhancing Robotic Manipulation: Harnessing The Power of Multi-Task Reinforcement Learning and Single Life Reinforcement Learning in Meta-World
No ratings yet
Enhancing Robotic Manipulation: Harnessing The Power of Multi-Task Reinforcement Learning and Single Life Reinforcement Learning in Meta-World
13 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
Reinforcement Learning 1
No ratings yet
Reinforcement Learning 1
14 pages
RL Chap 5
No ratings yet
RL Chap 5
21 pages
ML Suitable Task 1
No ratings yet
ML Suitable Task 1
8 pages
A Concise Introduction To Reinforcement Learning: February 2018
No ratings yet
A Concise Introduction To Reinforcement Learning: February 2018
12 pages
Module 01
No ratings yet
Module 01
66 pages
Rewards in Reinforcement Learning
No ratings yet
Rewards in Reinforcement Learning
12 pages
Introduction To Reinforcement Learning
No ratings yet
Introduction To Reinforcement Learning
19 pages
Hindsight Experience Replay
No ratings yet
Hindsight Experience Replay
15 pages
Entropy 25 00327
No ratings yet
Entropy 25 00327
36 pages
Means Ends Analysis: Fundamentals and Applications
From Everand
Means Ends Analysis: Fundamentals and Applications
Fouad Sabry
No ratings yet
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition)
From Everand
Data Science Fundamentals and Practical Approaches: Understand Why Data Science Is the Next (English Edition)
Dr. Gypsy Nandi
No ratings yet
Managing Projects: A Very Brief Introduction
From Everand
Managing Projects: A Very Brief Introduction
Stefan Kühl
5/5 (1)
Machine Learning Paradigms
100% (10)
Machine Learning Paradigms
336 pages
Deep Learning For NLP and Speech Recogni
100% (7)
Deep Learning For NLP and Speech Recogni
640 pages
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
100% (10)
Deep Learning - Fundamentals, Theory and Applications 2019 PDF
168 pages
Create LLM Application Using Langchain With Ease
100% (5)
Create LLM Application Using Langchain With Ease
12 pages
LLM Application Through Production
100% (11)
LLM Application Through Production
254 pages
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
91% (11)
Hands On Machine Learning With Python Concepts and Applications For Beginners - John Anderson 2018
166 pages
Deep Learning in Natural Language Processing PDF
100% (9)
Deep Learning in Natural Language Processing PDF
338 pages
(Probability and Statistics For Programmers) Allen Downey - Think Stats. Probability and Statistics For programmers-O'Reilly Media (2012) PDF
100% (10)
(Probability and Statistics For Programmers) Allen Downey - Think Stats. Probability and Statistics For programmers-O'Reilly Media (2012) PDF
142 pages
GDL GIQAR GCP AuditComputerizedSystems Danilo Neri
No ratings yet
GDL GIQAR GCP AuditComputerizedSystems Danilo Neri
37 pages
Python Finance - Harnessing The - Bisette, Vincent
100% (1)
Python Finance - Harnessing The - Bisette, Vincent
498 pages
Machine Learning in Finance
100% (4)
Machine Learning in Finance
300 pages
Machine Learning Projects Python
94% (18)
Machine Learning Projects Python
134 pages
Nonlinear Predictive Control For The Four-Tanks Plant Flow Regulation
No ratings yet
Nonlinear Predictive Control For The Four-Tanks Plant Flow Regulation
6 pages
Design of Formation Controller Based On BP Neural Network PID Control
No ratings yet
Design of Formation Controller Based On BP Neural Network PID Control
9 pages
AWS Certification Paths
No ratings yet
AWS Certification Paths
1 page
Model Predictive Control in LabVIEW
100% (1)
Model Predictive Control in LabVIEW
22 pages
A Software Security Testing Model For Autonomous Systems
No ratings yet
A Software Security Testing Model For Autonomous Systems
5 pages
Written Work 1 - Attempt Review3
No ratings yet
Written Work 1 - Attempt Review3
6 pages
Be Winter 2022
No ratings yet
Be Winter 2022
2 pages
Experiment 1A BMAT101P Calculus
No ratings yet
Experiment 1A BMAT101P Calculus
4 pages
Fab Amt 4203 Prelim Module No.2
No ratings yet
Fab Amt 4203 Prelim Module No.2
22 pages
Report of Thermodynamic
No ratings yet
Report of Thermodynamic
9 pages
Applied Thermodynamics and Engineering Fifth Edition by T
No ratings yet
Applied Thermodynamics and Engineering Fifth Edition by T
593 pages
Siemens - Redundancy and Safety
No ratings yet
Siemens - Redundancy and Safety
40 pages
A Framework For Evaluating The Standards For The P
No ratings yet
A Framework For Evaluating The Standards For The P
21 pages
Chapter4 - Root Locus
No ratings yet
Chapter4 - Root Locus
31 pages
Guide To Reliability Maintainability and Availability - NSW
100% (1)
Guide To Reliability Maintainability and Availability - NSW
36 pages
NWZ 4551
No ratings yet
NWZ 4551
5 pages
QA Principles
No ratings yet
QA Principles
108 pages
Nyquist Stability Criterion
No ratings yet
Nyquist Stability Criterion
20 pages
Maxima & Minima-Jee
No ratings yet
Maxima & Minima-Jee
2 pages
Mobile Car Racing (3D Game)
No ratings yet
Mobile Car Racing (3D Game)
12 pages
User Story and FNF Req
No ratings yet
User Story and FNF Req
13 pages
Efficient Development of Airborne Software With SCADE Suite
100% (1)
Efficient Development of Airborne Software With SCADE Suite
49 pages
Swarm Intelligence
No ratings yet
Swarm Intelligence
17 pages
CLL121 Equations Minor Ch1 5
No ratings yet
CLL121 Equations Minor Ch1 5
2 pages
Chapter 3-Thermochemistry Part 1
No ratings yet
Chapter 3-Thermochemistry Part 1
38 pages
Lec - 5 - Time Planning Process
No ratings yet
Lec - 5 - Time Planning Process
27 pages
Chapter 4 Test Case Design V1
No ratings yet
Chapter 4 Test Case Design V1
47 pages
Chapter 8 - Tut-3
No ratings yet
Chapter 8 - Tut-3
15 pages
04 Transfer Function
No ratings yet
04 Transfer Function
4 pages
Matlab Iris RBF
No ratings yet
Matlab Iris RBF
21 pages

Reinforcement Learning A LiteratureReview v2

Uploaded by

Reinforcement Learning A LiteratureReview v2

Uploaded by

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

REINFORCEMENT LEARNING: A LITERATURE REVIEW (September 2020)

Preprint · October 2020

José Salvador João Oliveira

SEE PROFILE SEE PROFILE

mHBP: my Human Brain Project View project

The user has requested enhancement of the downloaded file.

José Salvador 1 João Oliveira 1​ Maurício Breternitz 1​

AGI Artificial General Intelligence [21, p.16]

ASI Artificial Super Intelligence [21, p.16]

DRL Deep Reinforcement Learning

FRL Flat Reinforcement Learning

GAN Generative Adversarial Network

HDSL High-Dimensional Similarity Learning [37, p.5]

HRL Hierarchical Reinforcement Learning

IRL Inverse Reinforcement Learning

LSTM Long Short Term Memory

MBRL Model Based Reinforcement Learning

MCTS Monte Carlo Tree Search

MDP Markov Decision Process

MFRL Model Free Reinforcement Learning

MRL Meta Reinforcement Learning

RNN Recurrent Neural Network

SARSA State Action Reward State Action

SMDP Semi-Markov Decision Process

Table 1​- Acronyms

Figure 4​- interaction of the agent with the environment [34]

Figure 5​- An extension of the agent-environment interaction including Intrinsic Motivation /

Finite versus Infinite horizon

IM appears as a solution to the challenge of abstracting actions or to explore an environment

With feedback Without feedback

Active Reinforcement Intrinsic Motivation

Passive Supervised Unsupervised

Table 2​- Types of Machine Learning [78]

Figure 8 - Formulas/Equations and pseudocode [89]

Challenges like optimization of policies, environment dynamics modeling, the curse of

- Autonomous driving vehicles [81];

4. Literature Search Methodology

HRL - Modulated Locomotor Controllers - explores proprioception, the sense of body

HRL - hierarchical-DQN (H-DQN) - ​addresses RL problems of very sparse and delayed

HRL - Hierarchical Deep Reinforcement Learning Network (H-DRLN) - proposes an

HRL - Iterative Hierarchical Optimization for Misspecified Problems (IHOMP) - proposes

FRL - IRL - Generative Adversarial Imitation Learning (GAIL) - proposes an imitation

FRL - Advantage Actor Critic​​(A2C) ​- It is the synchronous version of A3C [2].

FRL - Actor-Mimic ​- proposes an algorithm focused on transfer learning by having an agent

MRL - RL2 - proposes an approach that comprises an RL algorithm to create another RL

FRL - Quantile Regression Deep Q Learning (QR-DQN) - ​presents a technique for RL by

FRL - Imagination-Augmented Agents (I2A) - ​proposes a new architecture for RL that

FRL - Model-Based and Model-Free (MBMF) - ​presents an algorithm that combines

HRL - HIerarchical Reinforcement learning with Off-policy correction (HIRO) - explores a

HRL - Metalearning Shared Hierarchies (MLSH) - proposes a method that learns

FRL - Asymmetric Self-Play - proposes a model-free method of unsupervised training for an

FRL - Model Ensemble Trust-Region Policy Optimization (ME-TRPO) ​- proposes a

MRL- Probabilistic Embeddings for Actor-critic meta-RL - (PEARL) - ​proposes an

HRL - Factorized Macro Action Reinforcement Learning (FaMARL) - ​proposes a method

FRL - Deterministic Value-Policy Gradient (DVPG) - ​proposes a method that combines

FRL - Advantage-weighted Behaviour Model (ABM) - proposes a method that allows

MRL - Model-based Adversarial Meta-Reinforcement Learning (AdMRL) - focuses on

Flat Reinforcement Learning (FRL) 2019 - SLBO [62]

● Neural Networks are a key component of the better performing models;

● Deep NN have been successfully applied in model-free algorithms, but in model-based

[76] ​@misc{part 2: kinds of rl algorithms - spinning up documentation,

[79]​ @misc{alphago: the story so far,

[80]​ @misc{alphago: the story so far,

[89]​ @misc{alphago: the story so far,

You might also like

José Salvador 1 João Oliveira 1 Maurício Breternitz 1

Table 1- Acronyms

Figure 4- interaction of the agent with the environment [34]

Figure 5- An extension of the agent-environment interaction including Intrinsic Motivation /

Table 2- Types of Machine Learning [78]

HRL - hierarchical-DQN (H-DQN) - addresses RL problems of very sparse and delayed

FRL - Advantage Actor Critic(A2C) - It is the synchronous version of A3C [2].

FRL - Actor-Mimic - proposes an algorithm focused on transfer learning by having an agent

FRL - Quantile Regression Deep Q Learning (QR-DQN) - presents a technique for RL by

FRL - Imagination-Augmented Agents (I2A) - proposes a new architecture for RL that

FRL - Model-Based and Model-Free (MBMF) - presents an algorithm that combines

FRL - Model Ensemble Trust-Region Policy Optimization (ME-TRPO) - proposes a

MRL- Probabilistic Embeddings for Actor-critic meta-RL - (PEARL) - proposes an

HRL - Factorized Macro Action Reinforcement Learning (FaMARL) - proposes a method

FRL - Deterministic Value-Policy Gradient (DVPG) - proposes a method that combines

[76] @misc{part 2: kinds of rl algorithms - spinning up documentation,

[79] @misc{alphago: the story so far,

[80] @misc{alphago: the story so far,

[89] @misc{alphago: the story so far,