Reinforcement Learning A LiteratureReview v2
Reinforcement Learning A LiteratureReview v2
net/publication/344930010
CITATIONS READS
2 6,727
3 authors:
Mauricio Breternitz
Instituto Universitario de Lisboa - ISCTE-IUL
15 PUBLICATIONS 122 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by José Salvador on 22 December 2020.
Abstract
This paper contains a literature review of Reinforcement Learning and its
evolution. Reinforcement Learning is a part of Machine Learning and comprises
algorithms and techniques to achieve optimal control of an Agent in an Environment
providing a type of Artificial Intelligence. This Agent can be a physical or virtual robot,
can be a controller simulating a player in a game, or a bot trading stocks, etc. The
study starts at Q-learning [56] published in 1989 and follows the thread of algorithms
and frameworks up until 2020, looking at main insights each paper brings to the field
regarding strategies to handle RL.
1. Introduction
Reinforcement Learning (RL) is a part of the Machine Learning (ML) field. As seen in Figure
1, RL is considered an active type of ML [78]. Intrinsic Motivation (IM) is also considered as an active
type of ML [78] but differs from RL because it has no feedback mechanism from a supervisor, which
RL has. The feedback mechanism is also what creates a difference between Supervised and
Unsupervised ML. While [78] considers IM as a separate field of ML, several works of RL use IM as a
strategy to cope with complex, sparse reward environments.
1
Iscte - Instituto Universitário de Lisboa, Lisbon, Portugal
1
Figure 1- A diagram of Reinforcement Learning context and techniques
RL is an important area of study because it may enable society to automate tasks that we, in
the past, never thought could be automated. Autonomous driving vehicles is one of the uses for RL.
Other uses may comprise having robots that can perform tasks like preparing ingredients and cooking
food, with the same robot preparing several different dishes without any human intervention or any
specific programming for the executed tasks. Stock market trading can also be a task executed by a
RL agent, that instead of being programmed with specific rules, can learn the rules of best trading by
itself.
In theory RL can achieve a state of Artificial General Intelligence (AGI) [21, p.16], or even
further, of Artificial Super Intelligence (ASI) [21, p.16], being the highest goal to have an agent perform
any task any human performs but faster and with more precision.
While surveys of RL algorithms are usually exclusive, either of Flat Reinforcement Learning
(FRL) [40, p.1], Hierarchical Reinforcement Learning (HRL) [36, p.1] or Meta Reinforcement Learning
(MRL) [71, p.1], our approach is to mix all types in a chronological order (by year of print publishing
date), since, despite having fundamentally different approaches, they still propose to solve the
problem of optimal control. There is no specific order of papers within each year.
Figure 2 shows, in a simplified way, a comparison between FRL, HRL and MRL methods.
While FRL contains a core we call the “planner” that processes the observations and rewards to
produce micro-actions (actions that are atomic and not divisible into sub-actions), HRL combines a
planner level with a “Skills” level that we can also call macro-actions or options. Skills are a
combination of micro-actions that can be used by the planner. In HRL the planner can also choose to
perform micro-actions combined with macro-actions or other sub-planners. The planning with
micro-actions may have a dimensionality higher weight, so it may also involve a higher computational
cost. MRL is an approach that focuses on creating agents that learn how to learn, and that can model
concepts. While we separate skills from concepts in the MRL sketch, a skill can be seen as a concept
of an action, if we look at the adaptation a skill may get for every similar type of action.
2
Figure 2- Flat Reinforcement Learning (FRL) vs Hierarchical Reinforcement Learning (HRL)
vs. Meta Reinforcement Learning (MRL)
One of the early papers that introduced Q-learning, a root for FRL approaches, also
discusses how HRL methods can be used to solve the learning problem [56], but somehow, overtime
some degree of separation of FRL from HRL became more noticeable, despite HRL algorithms
implementation of planners sometimes resorting to FRL methods. MRL algorithms have been creating
a thread separated from FRL and HRL, but, we must say that in some aspects they leave a feel of
being HRL.
FRL refers to normal RL algorithms as opposed to HRL or MRL algorithms, because FRL
methods see space as a big flat search space [40, p.228]. FRL methods have limitations because of
the basic actions orientation of models in this set of approaches.[46, p.1].
In HRL, by the use of hierarchies that represent a decomposition of problems into other
smaller problems, RL agents can learn faster and can be used in larger action and state spaces
problems, because of the inherent dimensionality reduction of the hierarchical approach [36 p.1].
MRL concentrates on learning an inductive bias that can accelerate learning a new task by
training in many previous tasks [31, p.3] and in this way adapt to unseen tasks the agent will need to
perform in the future [68, p.1].
This survey holds brief descriptions of the methods, algorithms, techniques and frameworks
proposed in the selected publications, with focus on the insights each one brings. It also shows a
trend in the interest of researchers, universities [27, 4, 18, 12, 20, 57, 30, 7, 13, 22, 32] and
companies like Google [8, 5, 2, 33, 19, 23, 9, 10, 11, 14, 1, 28], OpenAi [57], Amazon [31], Alibaba
3
[16] and others investing in the field of RL.
As stated in [15, p.151], almost all deep RL algorithms consist of simple recipes that combine
a dataset specification, a procedure to perform optimization, a cost function, etc, so by rearranging
these elements we can have a lot of different algorithms. This is one effect that can be seen
throughout the chronology of RL.
Figure 3 shows the number of papers we can find in each year, and assuming the
unrepresented papers are equally distributed throughout the years of the survey, we can state that
there is a rapid progress and increasing interest in the field of RL in recent years, one of the
motivations for this study.
Figure 3- Distribution of papers, in this survey, per year (2020 partial year)
This study starts at Q-learning [56] published in 1989 and follows the thread of algorithms and
frameworks up until 2020, looking at main insights each paper brings to the field with regards to
strategies to handle RL. The motivation for having a chronologically ordered study is to allow us to
understand which ideas have become baselines for subsequent algorithms, what problems
researchers have been trying to solve with each approach and which trends of thought researchers of
RL followed over time, and like connecting the dots in a drawing, this allows us to form a general
image of RL and of the direction it is taking.
We can see, following the chronology, what, after 2017, has been called ensemble
approaches in some papers [60, 61, 75] and others while not referring it this way, they still present the
same hybrid approach by joining the strengths of model-free and model-based solutions [11, 12, 13,
32, 60, 61, 66, 75].
As a note addressing possible future observations for using bibtex entries in this study
bibliography, we adopted that format to provide faster access to works herein and in this way provide
a more useful reference tool.
4
2. Acronyms
In the next table there are some acronyms broadly used in the RL study field, some of them
also used throughout this paper.
AI Artificial Intelligence
ML Machine Learning
NN Neural Networks
RL Reinforcement Learning
5
3. Preliminary Concepts
The RL problem
The RL problem can be described as a flow diagram where an agent learns from the
interaction with an environment, getting state descriptions and rewards, in order to create decisions
about actions to perform [34] in an optimal control of the agent.
While the previous logic has been the foundation of many RL algorithms [34], concepts like IM
[59, 78] and proprioception (self state) [45] (the sense of body position or self-movement) have been
introduced in some works, so we can design an extended RL agent-environment interaction, like the
one in Figure 5.
In Figure 5, the agent not only looks for inputs from the “external” environment but also from
the “internal environment” represented by the Intrinsic Motivations / Rewards and by the self state
readings.
6
States, Policies, Rewards and Time
Each RL problem can be described using a Markov Decision Processes (MDP) and also a
Semi-Markov Decision Process (SMDP).
Figure 6 - A transition graph of a Markov Decision Process for usage and recharging of a
robot’s battery [34, p.52].
In Figure 6 that shows an MDP, each circle represents a state and each arrow represents a
transition with a given probability. We can see that from each state there is more than one possible
transition. A policy represents the strategy the system will use at a point in time, to choose the next
transition and the possible transitions it expects to do in the future. Rewards represent the gain the
system will have by achieving a certain state. For instance, if the robot’s battery has a charge of 40%
and it decides to charge itself to 100%, the reward may be a representation of a 60% charge, and if
the charge is at 10% and it decides to go charge itself, then the reward may be a representation of a
90% increase in charge. This is a very simple example, but in a real application, reward calculation
would also need to take in account other rewards than the battery charge. If the robot has goals to
achieve in the environment, rewards for charging the battery will certainly have to compete with
rewards for achieving other purposeful tasks.
SMDP’s introduce the time variable. In MDP’s each state/transition is expected to last a single
and uniform unit of time, a discrete time step; and here is where SMDP’s come into play, since in
SMDP’s the transitions can have different durations and not stay limited to single unit time changes,
introducing in this way the concept of Time [39].
In RL, finite and infinite horizon settings have very different complexities, because while in the
first case, the agent has an understanding of when to stop processing because a goal has been met,
in the last one, the agent is treated as a living entity with a large life span, with multiple and endless
goal throughout its existence [16, 13, 61, 32].
The horizon concept has also been used to create hybrid model-free and model-based
approaches, allowing to decide the size of rollouts taken out from batch data of model-based
processing in order to feed the model-free part of these mixed models [16, 61]
7
Extrinsic Motivation versus Intrinsic Motivation (IM)
Extrinsic motivation happens when a reward is obtained by an agent, from the environment,
after completion of a task, and is also present when the agent is trying to learn how to solve a problem
through a future rewards (from the environment) function [78].
In [78] as shown in table 2, Intrinsic Motivation is considered to be the basis for creating a
new field of ML. While on one hand IM can be viewed this way, on the other hand IM can also be
viewed as a RL strategy to accomplish better performance in a sparse reward environment [41, 43,
46, 54].
RL Problem Formulation
Figures 7 and 8 contain a set of RL basic formulas/equations and the pseudocode of tabular
solution methods like Policy Iteration, Value Iteration, MC methods, Temporal Difference (TD) with
Q-learning; and also one example of an approximate solution method with Deep Q Learning.
𝜸 (gamma) is an important hyperparameter of RL, since it dictates the weight future rewards
have in the calculation of the total reward at any given time of the agent’s life. 𝜸 is part of a structural
equation of RL: the Bellman equation. This equation integrates current and future rewards in a
recurrent way, so that the agent can choose its actions without being shortsighted by only looking at
the immediate reward. This lookahead behaviour is an attempt to create an intelligent agent.
None of the presented equations have Intrinsic Motivation into consideration, but introducing
IM in the formulation is an easy step, since IM can be considered just another reward coming from the
environment.
The reward formula introduces H, the horizon of the system, which can be infinite or finite.
Naturally the bigger H is the more complex the RL problem becomes. H is also an important variable
to allow the combination of model-free and model-based algorithms as you will see in the Literature
Review part of this paper.
8
Figure 7 - Formulas/Equations and pseudocode [89]
9
Benefits and challenges of RL
Automation has been, in the last decades, achieved by creating sequential and conditional
software instructions performed by agents (physical robots or software agents). This type of
automation is viable in very controlled conditions, where there is very low stochasticity and the
performed tasks are very specific and not extremely complex. RL aims to achieve more than the type
of optimal control obtained in the previously exposed setting.
Optimal control of an agent performing a number of very different tasks and in a stochastic
environment is a goal that will allow us to control systems that have the ability to perform more
complex tasks in more complex environments, with a real chance to act without human supervision.
This is the benefit that RL research is aiming to achieve.
While many algorithms have already been created, some with better performance than
others, the field of RL is still missing an algorithm that can optimally perform any given task in any
given environment with any given dynamics.
The main challenges of RL have to do with how an agent can understand, process and apply
in a rational manner the data it gets, the same way humans are able to do. At least three different top
level approaches have been studied in the field of RL: FRL, HRL and MRL.
RL Application Domains
Any activity that has realtime decision making based on data, is a candidate for the
application of RL. If realtime operation is not needed, fields like Supervised ML or Unsupervised ML,
as shown in table 2, can be used to help with the decision making, but when realtime is needed, RL
(with or without IM) is a potential answer.
Here are some examples of potential applications of RL, amongst many others:
10
- Healthcare [86].
5. Literature Review
Year of 1989
FRL - Q-learning - Chris Watkins, does not coin the “Q-learning” term [56], but proves that
one-step Q-learning method can converge to optimal value function and policy [56, p.112]. Addresses
estimation of q* with action-value functions, that are now often called “Q-functions”. [34, p.71].
Watkins makes a parallel with animal conditioning / learning and learning algorithms [56, p.3].
Explores learning as a problem of obtaining delayed rewards [56, p.24], in which the agent does not
try to maximize immediate rewards [56, p.41]. Defines Q-functions as the expected return from
starting at a given state and following a sequence of policies [56, p.46] by searching in a look-ahead
tree, recognizing the problem of combinatorial explosion with bigger depths of the tree [56, pp.59-60].
Establishes the principles of a Q-table, by describing a way to describe a policy as being a single
action for each state, being stored as a function from states to actions [56, p.102]. This dependence of
a Q-table naturally induces a limitation to finite and discrete states. Q-learning stands for an off-policy
algorithm [33, p.1] [39, p.206] and uses the greedy policy [5, p.3]. A limitation of using one-step
methods consists of obtaining a reward that only directly influences the value of the state action pair
that gave place to the reward, while the other pairs are only indirectly influenced, which can make
learning slow, because several updates will be needed to propagate a reward to the relevant previous
actions and states [2, p.2].
Year of 1992
FRL- REINFORCE - describes a set of model-free algorithms with a gradient-based approach
compatible with backpropagation [58, pp.23-24]. These algorithms are described as belonging to an
associative RL class [58, p.1]. The agent is considered to be a feedforward network with learning
units, that can be identified as a Neural Network (NN) with the respective neurons [58, p.6].
Year of 1993
HRL - Feudal Reinforcement Learning - this method explores a way to speed up RL, by
creating multiple Q-learning layers at different resolutions, and in this way divide RL problems into
layers that know how to define tasks and layers that know how to solve them, in a lords / serfs type of
relationship [38, p.1]. Each command at an upper layer is associated with a reward that the lower
layer tries to maximize, and the authors claim that this method is more efficient than flat Q-learning
[38, p.1].
Year of 1994
FRL - Modified Connectionist Q-Learning (MCQ-L) - also known as SARSA, this algorithm
11
proposes a method to address continuous state spaces with high-dimensionality by using NN as
function approximators and backpropagation [67, p.1]. It is an on-policy algorithm that fits the
action-value function to the current policy, and then refines it in a greedy manner regarding those
action-values [44, p.1].
Year of 1997
HRL - Hierarchical Abstract Machines (HAM) - uses a library of plans with description of
the decomposition of higher level activities into lower level activities [55, p.1]. HAMs are finite state
machines/programs and work in a non-deterministic way [55, pp.1-2]. States that with HAMs,
knowledge can be reused across different problems and that it only implies a recombination of
component solutions to attack a larger more complex problem [55, p.1]. Machines are associated with
skills, like when finding a wall the current machine can call a backoff machine or a follow-wall machine
as a policy [55, p.3]. Authors of HAM also propose HAMQ, a crossing between HAM and Q-Learning.
[55, p.5].
Year of 1998
HRL - Options-Framework - Defines the term “options” that includes actions like picking an
object, going to have a meal, and traveling to some place, as well as micro / primitive actions such as
muscle contractions and joint movements, as a way to provide temporal abstraction [39, p.181]. All
types of options are used in planning a task, and can also be changed during the execution of a plan
that by getting changed at mid-execution can be corrected to perform better [39, p.181]. This is an
approach that is well suited to stochastic changing environments [39, p.207]. Options are evaluated
during their application to produce more learning [39, p.181]. Besides the options mechanism, this
paper also explores the concept of sub-goals [90.181]. Options are given, defined by the programmer
[47, p.2].
Year of 2005
FRL - Neural Fitted Q Iteration (NFQ) - it is an improvement on Fitted Q Iteration algorithms
by proposing the use of a Neural Network to train a Q-value function and in this way take advantage
of NN ability to approximate nonlinear functions [27, p.317]. In [27] learning failures and learning
speed problems of NN are discussed and propose a mechanism of storing previous experiences (a
sort of Experience Replay) to deal with the aforementioned problems. It uses off-line batch learning to
be able to use advanced supervised learning techniques that can converge faster than on-line
learning [27, p.319], and also avoid the destruction of learning the latter approach can produce [27,
p.327].
Year of 2008
HRL - MAXQ framework - uses an approach of divide and conquer by decomposing an MDP
into smaller MDPs [40, pp.227-228] and acting recursively [40, p.229]. It depends on the identification
of goals and subgoals by the programmer [40, p.227]. Focuses itself on both state abstraction [40,
p.227][40, p.260] and temporal abstraction [40, p.237]. It is an online, model-free algorithm [40,
p.227].
Year of 2011
FRL - Neural Fitted Q Iteration with Continuous Actions (NFQCA) - this work approach
consists of using batch learning of a Q-function based on experiences of state transitions, with a
learning process for learning Neural Network based controllers comprising continuous action values
[29, p.144]. Solves the limitation of Q-learning, that can only be used with discrete actions [29, p.147].
Uses a Critic implemented as a Neural Q-function, and an Actor represented as neural policy function
[29, p.147]. It is similar to the Deterministic Policy Gradient (DPG) algorithm [5, p.2].
12
FRL - Doubly Robust (DR) - proposes a method to perform optimal control in a partially
observed environment [73, p.1]. Focuses on evaluating policies taking context into account, as well as
previous actions and rewards [73, p.1]. Tackles bias and variance problems deriving from models of
rewards and models of past policies respectively [73, p.1]. Defines a Doubly robust estimator take
uses the estimate of expected reward and the estimate of action probabilities [73, p.3].
FRL - Probabilistic Inference for Learning Control (PILCO) - proposes a sample efficient
model-based RL algorithm for high dimensional problems, with a probabilistic dynamics model to
reduce model bias (by including uncertainty) and the use of approximate inference to perform policy
evaluation [74, p.1]. PILCO has problems in getting stopped at local optimas [74, p.7].
Year of 2013
FRL - Deep Q Network (DQN) - addresses the problem of processing high-dimensional
sensory input with RL [8, p.1]. Uses a convolutional neural network that estimates future rewards by
processing raw pixels [19. p.1]. It achieves high performance and a new level, by being able to
perform across a set of different problems, without needing to use problem-specific features, using the
same network architecture, raw input, and parameter values like the step size, rate of discount,
exploration parameters [65. p.437]. While it is able to process high-dimensional observation spaces,
when it comes to action spaces, it can only process discrete and low dimensional ones, so if it is to be
applied to continuous action-spaces, a discretization has to occur (but brings the problem of the curse
of dimensionality) [5, p.1]. Can be compared with NFQ, except that NFQ uses a batch update that has
a cost per iteration that is proportional to the data set size, whereas DQN uses stochastic gradient
updates that have a small cost per iteration and scales well with bigger data sets [8, p.4].
Year of 2014
FRL - Deterministic Policy Gradient (DPG) - proposes an algorithm, for continuous action
spaces, that is based on the expected gradient of the action-value function, which can be calculated
more easily than the stochastic policy gradient [26, p.1]. DPG uses an off-policy optimization with an
actor-critic architecture and by employing an exploratory behaviour policy learns deterministic target
policies [26, p.1]. Shows that deterministic policy gradient algorithms are better suited for
high-dimensional action spaces than stochastic alternatives [26, p.1].
Year of 2015
FRL - Generalized Advantage Estimation (GAE) - makes advancements in
high-dimensional continuous control with RL [17, p.2] by using state-value functions, which have a
smaller dimensional input than state-action value functions, and so it is easier to learn than the later
[17, p.12]
https://github.com/higgsfield/RL-Adventure-2/blob/master/2.gae.ipynb
FRL - Trust Region Policy Optimization (TRPO) - It provides a method to optimize large
nonlinear policies like NN [4, p.1]. This method is similar to policy gradient methods [4, p.1]. It has a
good performance in physical control tasks, like walking, swimming, etc [4, p.1][4, p.6]. TRPO does
not need to learn an action-value function [5, p.8]. Establishes boundaries that limit updates of the
policy parameters to avoid the new policy to diverge too much relatively to the existing policy [5, p.8],
making training easier, and avoiding inability of Gradient Descent to make progress in more complex
tasks [4, p.7].
FRL - Deep Deterministic Policy Gradient (DDPG) - extends DQN ideas, and can operate
in continuous action spaces which DQN can’t, being DDPG more suitable for physical control tasks [5,
p.1] and also able to perform better with fewer experience steps (but still a lot) than DQN [5, pp.8-9]. It
is an actor-critic algorithm and is able to generalize well and operate without having a model of the
environment [5, p.1] across a diversity of domains [5, p.8]. DDPG combines the actor-critic structure
with NN function approximation, by following strategies (i.e. off-policy training with a replay buffer, and
13
a separate target network) implemented in DQN to make those NN more stable; it also takes
advantage of the Batch Normalization technique [5, pp.2-3]. (see Glossary for Batch Normalization).
DDPG is able to learn good policies from pixels of a camera image or / and from joints angles or other
low dimensional inputs, and operate in a stochastic environment [5, p.2]. It can learn in big state and
action spaces, and uses mini-batches to be able to operate in a large network [5, p.3]. DDPG is not
sample efficient [16, p.1]. https://github.com/higgsfield/RL-Adventure-2/blob/master/5.ddpg.ipynb
FRL - Stochastic Value Gradient (SVG) - defines a framework for continuous control
policies that use backpropagation [65, p.1]. SVG proposes the use of a deterministic function with
external noise so that stochasticity can be integrated into Bellman equation [65, p.1]. With the SVG
framework several algorithms are presented and named SVG(∞), SVG(0) and SVG(1), and each one
has a different Bellman equation recursion depth [65, p.3].
Year of 2016
HRL - Option-Critic Architecture - proposes an option-critic architecture that can learn
internal policies and termination conditions of options [42, p.1726] from [90]. Presents an alternative
approach, that mixes the problem of discovering options with the problem of learning options, which
works with linear and nonlinear function approximators, and favours transfer learning [42, p.1726].
The Critic is implemented with a NN [42, p.1731]. Options (or Skills [45, p.9]) can be learnt without
any specification of subgoals or pseudo-rewards [42, p.1731].
HRL - STRategic Attentive Writer (STRAW) - This model can either be used in RL or in
Natural Language Processing (NLP) [49, p.1] because it has a general sequence prediction
architecture [49, p.6]. Proposes a deep RNN setup that can learn macro-actions, and output a several
step action plan, and once in a while updates the plan with new observations, using rewards as the
driver of the learning process [49, p.1]. This model may suffer from the inability to react to stochastic
environments while it is executing a plan. But this focus on execution allows to control computational
costs [49, p.2]. Uses differentiable attentive reading [49, p.3] and A3C [49, p.5].
14
a skill or a primitive action (in this case it will execute for only one timestep) [50, p.1556]. H-DRLN
proposes a mechanism called skill distillation (a variation of policy distillation) that allows to retain
knowledge efficiently and scale the system to lifelong scenarios [50, p.1553].
FRL - Actor Critic with Experience Replay (ACER) - proposes an actor-critic architecture
using DRL and Experience Replay for continuous and discrete control problems [19, pp.1-2].
Introduces the following mechanisms: stochastic dueling networks architectures and truncated
importance sampling with bias correction [19, p.1]. Also proposes a new TRPO method [19, p.1].
ACER can be seen as an off-policy equivalent of A3C [19, p.2]. The new TRPO method is done by
maintaining a network of running average policies calculated from previous policies, that are used as
references to avoid having the updated policies to go too far away from the references [19, p.5].
Humanoid agents, having a higher dimensionality action space, significantly benefit from using
truncation and bias correction [19, p.10]. ACER uses Retrace algorithm [19, p.3].
FRL - Bootstrapped-DQN - proposes a method to learn faster while doing deep exploration
with deep NN in a complex environment [23, p.1], with a low computational cost [23, p.2], which
allows scalability [23, p.8]. Applies the statistical resampling principle of bootstrapping (see Glossary)
[23, p.2]. Implements a Network with several heads (Neural Networks) that have input from a shared
NN that performs the task of feature extraction from a frame [23, p.2].
FRL - Asynchronous Advantage Actor Critic (A3C) - proposes a method that implements
asynchronous gradient descent, in a parallel actor-learner setup [2, p.1]. This method performs well
in several types of optimal control problems, like motor control or navigating using visual input [2, p.1].
Presents an alternative to Experience Replay (because of its limitations with on-policy learning), by
using parallelism/multithreading, having multiple agents learning on several instances of the
environment [2, p.1] and with the possibility of using different exploration policies [2, p.3]. It can run
efficiently on extremely less powerful hardware, than other methods while performing at the same
level or better [2, p.1][2, p.7]. Multithreading also showed benefits in older methods like one-step
Q-learning, one-step Sarsa and n-Step Q-learning [2, p.6].
https://github.com/awjuliani/DeepRL-Agents/blob/master/A3C-Doom.ipynb
FRL - Policy Gradient and Q-Learning (PGQL) - proposes a model free algorithm that
combines off-policy Q-Learning on a reply buffer with on-policy Gradient optimization [44, p.1].
15
p.1]. DR2 can be used in high-dimensional state and action spaces [72, p.1].
Year of 2017
HRL - FeUdal Networks (FUNs) - proposes a method that works at two different levels, with
different time resolutions: the manager, that selects goals and is motivated by an extrinsic reward, and
the worker levels [41, p.1]. The worker creates primitive actions at each time step and is motivated by
an intrinsic reward [41, p.1]. FUNs allow an extremely long timescale credit assignment (which is good
for sparse reward problems) and memorisation [41, p.1]. This method uses a new type of RNN called
a dilated LSTM, that enables the usage of a backpropagation algorithm with hundreds of steps [41,
p.1]. The goals of the Manager are trained with an approximate transition policy gradient [41, p.2].
FUNs use A3C for optimisation [41, p.5]. This method is also good at transfer and multitask learning
[41, p.8].
HRL - Abstract Markov Decision Processes (AMDP) - proposes a model based [51, p.2]
method for fast planning in large state-action spaces where the number of objects induce
combinatorial growth [51, p.1] and the environment is stochastic [51, p.2]. Divides goals into subgoals
that recursively solves; and selects only the information of the state space that is needed to solve
each decision [51, p.1]. Uses graph nodes to represent primitive actions or sub-problems [51, p.2].
FRL - 51-Atom Agent (C51) - explores the analysis of value distribution in approximate
distributional RL, instead of using the expectation in the states values as Expectation RL does [9,p.1].
C51 algorithm provides a more meaningful state value analysis, because instead of calculating only
one value per state, it adheres to the notion that stochasticity can produce very disparate state values,
and therefore a distributional approach makes more sense [9].
https://github.com/flyyufelix/C51-DDQN-Keras
FRL - EXpert ITeration (EXIT) - this algorithm is motivated by human thought dual process
theory [30, p.9] and uses tree search to assist the NN training [30, p.1]. This method implements
apprentices (generalize policies with NN) and experts (explore with tree search), to iterate through
training in an Imitation Learning style and self-play [30, pp.2-3].
FRL - Alpha-Zero - proposes a general purpose algorithm that starts without any knowledge
of the environment, except for the available actions (rules), and moves forward with self-play [14,
pp.1-2]. It uses NN to calculate action values and MCTS to simulate self-play [14, pp.2-3].
FRL - Actor Critic using Kronecker-factored Trust Region (ACKTR) - As the name says
this algorithm uses an Actor-Critic architecture [20, p.1]. It can be used with continuous and discrete
control policies, and employs Deep NN, plus a variation of TRPO with a new technique called
Kronecker-Factored Approximate Curvature (K-FAC) [20, pp.1-2].
https://github.com/openai/baselines/tree/master/baselines/acktr
16
agent acting in a complex world (i.e. better generalization) [11, pp.1-2]. States that reward expectation
is helpful but it can be discarded, and still get good performance [11, p.6].
https://github.com/higgsfield/Imagination-Augmented-Agents
FRL - Proximal Policy Optimization (PPO) - proposes an algorithm that is as efficient and
reliable as TRPO but, it is of simpler implementation [3, p.1] by making minor changes to a vanilla
policy gradient implementation [3, p.8]. PPO can be used on continuous high-dimensional control
problems [3, p.7]. PPO introduces a new family of policy gradient methods that alternate between
optimizing a “surrogate” objective function by the use of stochastic gradient ascent and sampling data
through interaction with the environment [3, p.1]. Normal policy gradient methods execute one
gradient update per data sample, and PPO uses an objective function that allows multiple epochs of
minibatch updates. [3, p.1]. https://github.com/higgsfield/RL-Adventure-2/blob/master/3.ppo.ipynb
FRL - Soft Q-Learning (SQL) - proposes an algorithm for continuous action and state
spaces, with improved compositionality and exploration, allowing transfer learning from task to task
[57, p.1]. It uses an energy based model represented with an energy function or a neural network as
an energy function approximator [57, p.3]. https://github.com/haarnoja/softqlearning
FRL - Value Prediction Network (VPN) - proposes a method that combines the model-free
and model-based approaches in the same neural network [66, p.1]. It explores conditional options
when facing future values of the reward function instead of using the future observations or sensorial
data [66, p.1]. VPN performs well in stochastic environments when compared to purely model-free or
model-based algorithms [66, p.1]. The new neural network architecture combines a dynamics model
of states abstraction (model-based part) with these abstractions being mapped to rewards (model-free
part) [66, p.1]. VPN can be combined with other RL algorithms [66, p.4].
MRL - Model Agnostic Meta Learning (MAML) - proposes a model-agnostic method which
means it can also be used in a variety of learning problems [68, p.1]. MAML looks for good
generalization using a small amount of training data in a new task, and also to produce a model
training that is easy to tune [68, p.1]. https://github.com/cbfinn/maml /
https://github.com/cbfinn/maml_rl
HRL - Fine Grained Action Repetition (FiGAR) - proposes a framework that allows an
agent to choose the timeframe and repetition of an action [64, p.1]. It can be applied in combination
with existing RL algorithms like DDPG,.TRPO or A3C and improve policy optimization by working with
macro-actions at different time frames [64, p.10]. FiGAR is not prepared to interrupt a macro-action,
which in a stochastic environment makes it less efficient [64, p.10].
Year of 2018
FRL - Model-based Value Expansion (MVE) - it is a technique that takes advantage of the
imagination mechanism like the I2A algorithm, and the creation of dynamics models [13, p.1], but puts
some restrictions, in the horizon length, that ultimately allows improvement of the learning sample
complexity [13, p.8]. MVE is a method that tries to combine the best of model-free and model based
approaches [13, p.1], like MBMF does. MVE calculates the short term horizon by creating a dynamics
model (model-based), and the long term horizon by using Q-learning (model-free) [13, p.1].
FRL - World Models - this method creates a virtual environment and dynamics model to
train the RL Agent [1, p.2]. It does that from high dimensional image data [1, p.7], and sometimes gets
17
details that are not important, while missing other details that are relevant [1, p.8]. The Agent also
needs to explore the real world so that the model can be perfected [1, p.8]. This approach differs from
many other models that do their training by having the agent interacting with the real environment [1,
p.1]. In order to avoid having the model exploiting the virtual models flaws (by generating an
adversarial policy [1, p.6]), the training occurs with a noise-added version of that same virtual
environment [1, p.2]. This training in the virtual environment may lead, in some cases, to inadequate
policies, that will fail in the real environment [1, p.6]. By having the virtual environment, the model is
able to plan ahead, since it has access to a probability distribution of future events [1, p.4].
https://learningtopredict.github.io/ https://worldmodels.github.io
FRL - Twin Delayed DDPG (TD3) - proposes an algorithm (that extends DDPG [6, p.6]) with
an actor-critic architecture [6, p.2] and uses two critic NN [6, p.6]. Explores a technique called target
networks to limit errors with function approximators and stochastic optimization [6, p.8]. By using two
critics and getting the minimum value between the two it can avoid an overestimation problem [6, p.1].
Aims at solving problems affecting continuous control actor-critic settings, like overestimation bias and
accumulation of error [6, p.1]. It is assumed that even an overestimated value may be used as an
upper boundary for the true value [6, p.1].
https://github.com/higgsfield/RL-Adventure-2/blob/master/6.td3.ipynb
FRL - Soft Actor-Critic (SAC) - proposes a sample efficient, stable, model-free off-policy
algorithm with an actor critic architecture and Deep NN [7, p.1]. Explores the concept of maximum
entropy to create robust policies in high-dimensional and continuous spaces [7, pp.1-2]. SAC
performs at the same level as DDPG, PPO and TD3 in simple tasks, but performs better in more
complex ones [7, p.7]. SAC creates stochastic policies while learning, that are converted in
deterministic policies in the end (for performance reasons) [7, p.7]. https://github.com/haarnoja/sac
HRL - Modulated Policy Hierarchies (MPH) - proposes a method to solve problems with
sparse rewards, applying intrinsic motivation and modulation signals by communicating with
bit-vectors [48, pp.1-2]. Instead of selecting among the available skills in an exclusive way, MPH, with
bit vectors, is able to combine the available skills (implemented as NN) [48, pp.1-2]. Each level of the
hierarchy is trained separately using PPO in order to avoid non-stationarity problems [48, p.1]. The
lower level (worker) combines the modulated signals with the environment state to produce a final
action [48, p.4]. MPH implements time scales (temporal abstraction) by activating higher level policies
less frequently [48, p.4]. As intrinsic motivation, this method uses a curiosity-guided exploration
bonus, that has the effect of accelerating learning in a sparse reward environment [48, p.4].
18
is less frequently used [59, p.1]. This method can be used in discrete and continuous settings [59,
p.5].
FRL - Probabilistic Ensembles with Trajectory Sampling (PETS) - this algorithm explores
uncertainty aware deep neural network dynamics models to tackle the problem of deficient asymptotic
performance in model based architectures [60, pp.1-2]. Other algorithms use Bayesian nonparametric
models instead [60, p.9]. Applies some other model-free techniques to a model-based setting (i.e.
ensembling, output Gaussian distribution parameters, MPC) [60, p.9]. PETS does not use policy
learning [60, p.9]. https://github.com/kchua/handful-of-trials
FRL - STochastic Ensemble Value Expansion (STEVE) - proposes an algorithm that mixes
model-based and model-free approaches, in order to mitigate bias errors of the model based
approach and to improve problems of sample inefficiency of model-free methods [61, p.1]. The
environment model is only used once in a while with rollouts of different horizon sizes to avoid
introducing much error inherent to model-based methods [61, p.1]. STEVE is an extension of MVE
[61, p.2]. This algorithm uses uncertainty-awareness mechanisms [61].
MRL - Proximal Meta-Policy (ProMP) - proposes a meta-RL algorithm with improved credit
assignment and improved meta-policy gradient estimation [71, p.1]. ProMP achieves an effective
identification of tasks to be learned by optimizing the pre-update sampling distribution [71, p.9].
HRL - Hierarchical Self-play (HSP) - this method tackles complex tasks with sparse rewards
[53, p.8]. It uses unsupervised asymmetric self-play (to explore decomposition of tasks) and a
continuous sub-goal vector [53, pp.1-2]. The agent sets its own goals forced by adversarial rewards
(accordingly to what the environment allows), and then tries to achieve these goals respecting a time
limit [53, p.1]. The low level policies have access to the current state and to the goal vector (an
encoded target state) [53, pp.1-2]. The higher level policies are trained in a sparse reward logic [53,
p.2]. HSP defines two levels of policies implemented as NN that, at first are prepared by exploring and
creating skills (by having 2 lower level actors playing against each other) and that, later are trained (at
the higher level) using external rewards [53, p.2]. HSP breaks episodes into smaller segments in
order to avoid problems coming from higher complexity [53, p.4]. It has limitations in the self-play task,
because some knowledge of the domain needs to be embedded, to be possible to recognize if this
task has been completed successfully [53, p.8].
Year of 2019
FRL - Stochastic Lower Bounds Optimization (SLBO) - proposes a model based algorithm
that can be used in continuous control settings [62, p.10]. SLBO learns the models by using multi-step
prediction loss [62, p.8]. Makes use of TRPO for policy optimization, introducing an entropy term to
the objective function [62, p.9]. SLBO is an instantiation of a framework aimed at giving theoretical
guarantees when designing and analyzing model based algorithms [62, p.1].
19
FRL - Model-Based Policy Optimization (MBPO) - proposes a procedure of using small
model-generated rollouts coming from real data, as a better solution than having a big model of the
environment [32, p.1]. MBPO combines the generalization strength of model-free algorithms with the
speed in learning of model-based ones [32, p.2]. This approach allows it to be used in
high-dimensional problems [32, p.2]. MBPO also works well with long horizon tasks [32, p.9].
FRL - Bootstrap Dual Policy Iteration (BDPI) - this is a model free, very sample-efficient
algorithm for continuous state spaces and discrete action spaces [22, p.1]. BDPI uses a value based
approach with an actor-critic architecture having several critics (implemented with a flavour of DQN)
operating in off-policy [22, p.1]. The actor is trained using all the critics [22, p.2]. With BDPI,
hyperparameters do not need a lot of tuning, since the algorithm is very robust concerning
hyperparametrization [22, p.2]. BDPI uses an experience buffer [22, p.3].
https://github.com/vub-ai-lab/bdpi
FRL - Deterministic Value Gradients (DVG) - proposes a method for infinite horizon
problems [16, p.2216]. DVG is a model based algorithm that uses an actor-critic architecture [16,
p.3320].
FRL - Deep Soft Policy Gradient (DSPG) - proposes a maximum entropy RL model-free
algorithm with an actor-critic architecture, and off-policy optimization [25, p.3425]. DSPG combines
policy and value based methods [25, pp.3425-3426]. DSPG can be used in continuous control
problems [25, p.3430].
HRL - Hierarchical Actor-Critic (HAC) - proposes an HRL method (named Hindsight action
transitions) that solves the non-stationary instability problem when more than one level of policies are
learnt at the same time, and it does this by having each layer trained considering the lower layer is
stable and optimal with the use of a simulation transition function [36, pp.1-3]. HAC can be used in
continuous state and action spaces with sparse rewards, and has successfully implemented a >2
layer architecture learning different levels policies in parallel [36, pp.1-2]. HAC can also be used in
discrete problems [36, p.10]. Each layer has access to the external current state, but its goals are
provided by the upper layer as a sub-goal [36, p.2].
HRL - Model Free HRL framework - proposes an HRL model free algorithm that focuses on
unsupervised subgoal discovery, learning skills and the use of intrinsic motivation [54].
Year of 2020
FRL - Meta Q Learning (MQL) - proposes an off-policy algorithm taking context into account
and meta-training policies [31, p.1]. Context stands as a meta-training technique [31, p.9]. MQL
handles the hyper-parameter sensitivity problem of off-policy methods by adapting to the distribution
shifting [31, p.9]. MQL reuses data from the replay buffer [31, p.10].
20
worse results than on-policy processing [28, p.1]. ABM explores policies in a batch to create a
weighted model, as a prior, that can be used to correct the policy being used, and in this way
combine off-policy and on-policy strengths [28, p.2].
Table 3 -
Flat Reinforcement Learning (FRL), Hierarchical Reinforcement Learning (HRL) and Meta
Reinforcement Learning (MRL) papers in this study
With table 3 we can see how much work has been done in every major branch of RL. FRL
stands out as the approach with more research, while MRL as the one with less research.This shows
us, being FRL math intensive, that RL research has been more foccused on the math driven approach
than on the logical driven approach. It doesn’t mean that FRL does not have a logical side to it, but
math may be over-valued, and researchers may need to get out of the FRL track and research more
on HRL, MRL or any other yet to come type of approach.
21
Figure 9- word cloud of the bibliography
Figure 9 illustrates as a word cloud the key concepts found in the examined bibliography.
“Learning”, “Policy”, “Value”, “State”, “Model”, “Action”, “Agent”, “Data”, “Gradient”, “Time”, “Reward”,
“Deep” and “Neural” are the most important words in this cloud and define the most important
concepts of RL. The concepts of “Gradient”, “Deep” and “Neural” while not being RL core, are, as
seen in the cloud, very important to RL research, because Gradients have been one of the most
important optimization methods in RL, and “Deep” / “Neural” show how important Neural Networks
have been to the advancement of RL.
6. Conclusion
Through the findings there can be seen several evidences worth of notice:
● Off-policy methods are more efficient at learning than On-policy ones. But off-policy methods
tend to be less effective than on-policy methods, because they do not integrate the
knowledge related to the policy in use at the moment [28, p.1]
● HRL allows us to address more complex tasks than FRL, with less learning episodes, but
HRL algorithms are more complex due to the work with multiple layers [36, p.1].
● High-dimensional continuous state / action spaces are much more difficult to address than
their discrete counterparts;
● Model-free algorithms are able to generalize better than model-based ones, but are slower at
learning policies [32, p.1]. Model-based algorithms also have the problem of possible model
bias [16, p. 3317]. Model free algorithms are very sensitive to hyperparameters and are also
very sample inefficient [25, p.3425]. Generally model based models are less efficient in
asymptotic performance [60, p.9]. Model-based algorithms work well in environments where it
is easy to model dynamic features, but on complex and noisy environments the learnt
environment model will prove itself less performant [61, p.1];
22
23
Bibliography
[1] @incollection{NIPS2018_7512,
title = {Recurrent World Models Facilitate Policy Evolution},
author = {Ha, David and Schmidhuber, J\"{u}rgen},
booktitle = {Advances in Neural Information Processing Systems 31},
editor = {S. Bengio and H. Wallach and H. Larochelle and K. Grauman and N. Cesa-Bianchi and R. Garnett},
pages = {2450--2462}, year = {2018}, publisher = {Curran Associates, Inc.},
url = {http://papers.nips.cc/paper/7512-recurrent-world-models-facilitate-policy-evolution.pdf}}
[2] @InProceedings{pmlr-v48-mniha16,
title = {Asynchronous Methods for Deep Reinforcement Learning},
author = {Volodymyr Mnih and Adria Puigdomenech Badia and Mehdi Mirza and Alex Graves and Timothy Lillicrap
and Tim Harley and David Silver and Koray Kavukcuoglu},
booktitle = {Proceedings of The 33rd International Conference on Machine Learning},
pages = {1928--1937}, year = {2016}, editor = {Maria Florina Balcan and Kilian Q. Weinberger},
volume = {48}, series = {Proceedings of Machine Learning Research}, address = {New York, New York, USA},
month = {20--22 Jun}, publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v48/mniha16.pdf},
url = {http://proceedings.mlr.press/v48/mniha16.html}, Eprint = {arXiv:1602.01783}}
[3] @misc{1707.06347,
Author = {John Schulman and Filip Wolski and Prafulla Dhariwal and Alec Radford and Oleg Klimov},
Title = {Proximal Policy Optimization Algorithms}, Year = {2017}, Eprint = {arXiv:1707.06347},}
[4] @InProceedings{pmlr-v37-schulman15,
title = {Trust Region Policy Optimization},
author = {John Schulman and Sergey Levine and Pieter Abbeel and Michael Jordan and Philipp Moritz},
booktitle = {Proceedings of the 32nd International Conference on Machine Learning},
pages = {1889--1897}, year = {2015}, editor = {Francis Bach and David Blei}, volume = {37},
series = {Proceedings of Machine Learning Research}, address = {Lille, France}, month = {07--09 Jul},
publisher = {PMLR}, pdf = {http://proceedings.mlr.press/v37/schulman15.pdf},
url = {http://proceedings.mlr.press/v37/schulman15.html}, Eprint = {arXiv:1502.05477}}
[5] @article{Lillicrap2015ContinuousCW,
title={Continuous control with deep reinforcement learning},
author={Timothy P. Lillicrap and Jonathan J. Hunt and Alexander Pritzel and Nicolas Manfred Otto Heess and Tom
Erez and Yuval Tassa and David Silver and Daan Wierstra}, journal={CoRR}, year={2015},
volume={abs/1509.02971}, Eprint = {arXiv:1509.02971}}
[6] @InProceedings{pmlr-v80-fujimoto18a,
title = {Addressing Function Approximation Error in Actor-Critic Methods},
author = {Fujimoto, Scott and van Hoof, Herke and Meger, David},
booktitle = {Proceedings of the 35th International Conference on Machine Learning},
pages = {1587--1596}, year = {2018}, editor = {Dy, Jennifer and Krause, Andreas}, volume = {80},
series = {Proceedings of Machine Learning Research}, address = {Stockholmsmässan, Stockholm Sweden},
month = {10--15 Jul}, publisher = {PMLR},
pdf = {http://proceedings.mlr.press/v80/fujimoto18a/fujimoto18a.pdf},
url = {http://proceedings.mlr.press/v80/fujimoto18a.html}, Eprint = {arXiv:1802.09477}}
[7] @InProceedings{pmlr-v80-haarnoja18b,
title = {Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor},
author = {Haarnoja, Tuomas and Zhou, Aurick and Abbeel, Pieter and Levine, Sergey},
booktitle = {Proceedings of the 35th International Conference on Machine Learning},pages = {1861--1870},
year = {2018}, editor = {Dy, Jennifer and Krause, Andreas}, volume = {80},
series = {Proceedings of Machine Learning Research}, address = {Stockholmsmässan, Stockholm Sweden},
month = {10--15 Jul}, publisher = {PMLR},
pdf = {http://proceedings.mlr.press/v80/haarnoja18b/haarnoja18b.pdf},
url = {http://proceedings.mlr.press/v80/haarnoja18b.html}, Eprint = {arXiv:1801.01290}}
[8] @incollection{mnih-atari-2013,
title = {Playing Atari With Deep Reinforcement Learning},
author = {Volodymyr Mnih and Koray Kavukcuoglu and David Silver and Alex Graves and Ioannis Antonoglou and
Daan Wierstra and Martin Riedmiller},
booktitle = {NIPS Deep Learning Workshop},
year = {2013},
url = {https://arxiv.org/pdf/1312.5602.pdf}, }
[9] @InProceedings{pmlr-v70-bellemare17a,
24
title = {A Distributional Perspective on Reinforcement Learning},
author = {Marc G. Bellemare and Will Dabney and R{\'e}mi Munos},
booktitle = {Proceedings of the 34th International Conference on Machine Learning},
pages = {449--458}, year = {2017}, editor = {Doina Precup and Yee Whye Teh}, volume = {70},
series = {Proceedings of Machine Learning Research},
address = {International Convention Centre, Sydney, Australia},
month = {06--11 Aug}, publisher = {PMLR},
pdf = {http://proceedings.mlr.press/v70/bellemare17a/bellemare17a.pdf},
url = {http://proceedings.mlr.press/v70/bellemare17a.html}}
[10] @inproceedings{DBLP:conf/aaai/DabneyRBM18,
author = {Will Dabney and Mark Rowland and Marc G. Bellemare and R{\'{e}}mi Munos},
editor = {Sheila A. McIlraith and Kilian Q. Weinberger},
title = {Distributional Reinforcement Learning With Quantile Regression},
booktitle = {Proceedings of the Thirty-Second {AAAI} Conference on Artificial Intelligence,
(AAAI-18), the 30th innovative Applications of Artificial Intelligence
(IAAI-18), and the 8th {AAAI} Symposium on Educational Advances in
Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February
2-7, 2018},
pages = {2892--2901}, publisher = {{AAAI} Press}, year = {2018},
url = {https://www.aaai.org/ocs/index.php/AAAI/AAAI18/paper/view/17184},
timestamp = {Tue, 23 Oct 2018 06:42:15 +0200}, biburl = {https://dblp.org/rec/conf/aaai/DabneyRBM18.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[11] @incollection{NIPS2017_7152,
title = {Imagination-Augmented Agents for Deep Reinforcement Learning},
author = {Racani\`{e}re, S\'{e}bastien and Weber, Theophane and Reichert, David and Buesing, Lars and Guez,
Arthur and Jimenez Rezende, Danilo and Puigdom\`{e}nech Badia, Adri\`{a} and Vinyals, Oriol and Heess, Nicolas and Li, Yujia
and Pascanu, Razvan and Battaglia, Peter and Hassabis, Demis and Silver, David and Wierstra, Daan},
booktitle = {Advances in Neural Information Processing Systems 30},
editor = {I. Guyon and U. V. Luxburg and S. Bengio and H. Wallach and R. Fergus and S. Vishwanathan
and R. Garnett},
pages = {5690--5701}, year = {2017}, publisher = {Curran Associates, Inc.}, url = {http://papers.nips.cc/paper/
7152-imagination-augmented-agents-for-deep-reinforcement-learning.pdf}}
[12] @inproceedings{DBLP:conf/icra/NagabandiKFL18,
author = {Anusha Nagabandi and Gregory Kahn and Ronald S. Fearing and Sergey Levine},
title = {Neural Network Dynamics for Model-Based Deep Reinforcement Learning
with Model-Free Fine-Tuning},
booktitle = {2018 {IEEE} International Conference on Robotics and
Automation, {ICRA} 2018, Brisbane, Australia, May 21-25, 2018},
pages = {7559--7566}, publisher = {{IEEE}}, year = {2018},
url = {https://doi.org/10.1109/ICRA.2018.8463189}, doi = {10.1109/ICRA.2018.8463189},
timestamp = {Wed, 16 Oct 2019 14:14:51 +0200}, biburl = {https://dblp.org/rec/conf/icra/NagabandiKFL18.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}, Eprint = {arXiv:1708.02596}}
[13] @article{DBLP:journals/corr/abs-1803-00101,
author = {Vladimir Feinberg and Alvin Wan and Ion Stoica and Michael I. Jordan and Joseph E. Gonzalez
and Sergey Levine},
title = {Model-Based Value Estimation for Efficient Model-Free Reinforcement Learning},
journal = {CoRR}, volume = {abs/1803.00101}, year = {2018}, url = {http://arxiv.org/abs/1803.00101},
archivePrefix = {arXiv}, eprint = {1803.00101}, timestamp = {Mon, 13 Aug 2018 16:47:50 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1803-00101.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[14] @article{DBLP:journals/corr/abs-1712-01815,
author = {David Silver and Thomas Hubert and Julian Schrittwieser and Ioannis Antonoglou and
Matthew Lai and Arthur Guez and Marc Lanctot and Laurent Sifre and Dharshan Kumaran and
Thore Graepel and Timothy P. Lillicrap and Karen Simonyan and Demis Hassabis},
title = {Mastering Chess and Shogi by Self-Play with a General Reinforcement
Learning Algorithm},
journal = {CoRR}, volume = {abs/1712.01815}, year = {2017}, url = {http://arxiv.org/abs/1712.01815},
archivePrefix = {arXiv}, eprint = {1712.01815}, timestamp = {Mon, 13 Aug 2018 16:46:01 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1712-01815.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}, Eprint = {arXiv:1712.01815}}
[15] @book{Goodfellow-et-al-2016,
title={Deep Learning}, author={Ian Goodfellow and Yoshua Bengio and Aaron Courville},
publisher={MIT Press}, note={\url{http://www.deeplearningbook.org}}, year={2016}}
[16] @inproceedings{DBLP:conf/aaai/CaiPT20,
25
author = {Qingpeng Cai and Ling Pan and Pingzhong Tang},
title = {Deterministic Value-Policy Gradients},
booktitle = {The Thirty-Fourth {AAAI} Conference on Artificial Intelligence, {AAAI}
2020, The Thirty-Second Innovative Applications of Artificial Intelligence
Conference, {IAAI} 2020, The Tenth {AAAI} Symposium on Educational
Advances in Artificial Intelligence, {EAAI} 2020, New York, NY, USA,
February 7-12, 2020},
pages = {3316--3323}, publisher = {{AAAI} Press}, year = {2020},
url = {https://aaai.org/ojs/index.php/AAAI/article/view/5732}, timestamp = {Thu, 04 Jun 2020 16:49:55 +0200},
Biburl = {https://dblp.org/rec/conf/aaai/CaiPT20.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}, Eprint = {arXiv:1909.03939}}
[17] @inproceedings{DBLP:journals/corr/SchulmanMLJA15,
author = {John Schulman and Philipp Moritz and Sergey Levine and Michael I. Jordan and Pieter Abbeel},
editor = {Yoshua Bengio and Yann LeCun},
title = {High-Dimensional Continuous Control Using Generalized Advantage Estimation},
booktitle = {4th International Conference on Learning Representations, {ICLR} 2016,
San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings},
year = {2016}, url = {http://arxiv.org/abs/1506.02438}, timestamp = {Thu, 25 Jul 2019 14:25:38 +0200},
biburl = {https://dblp.org/rec/journals/corr/SchulmanMLJA15.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[18] @inproceedings{DBLP:conf/nips/HoE16,
author = {Jonathan Ho and Stefano Ermon},
editor = {Daniel D. Lee and Masashi Sugiyama and Ulrike von Luxburg and Isabelle Guyon and Roman Garnett},
title = {Generative Adversarial Imitation Learning},
booktitle = {Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information
Processing Systems 2016, December 5-10, 2016,
Barcelona, Spain},
pages= {4565--4573}, year = {2016},
Url = {http://papers.nips.cc/paper/6391-generative-adversarial-imitation-learning}}
[19] @inproceedings{DBLP:conf/iclr/0001BHMMKF17,
author = {Ziyu Wang and Victor Bapst and Nicolas Heess and Volodymyr Mnih and R{\'{e}}mi Munos and
Koray Kavukcuoglu and Nando de Freitas},
title = {Sample Efficient Actor-Critic with Experience Replay},
booktitle = {5th International Conference on Learning Representations, {ICLR} 2017,
Toulon, France, April 24-26, 2017, Conference Track Proceedings},
publisher = {OpenReview.net}, year = {2017}, Eprint = {arXiv:1611.01224}}
[20] @incollection{NIPS2017_7112,
title = {Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation},
author = {Wu, Yuhuai and Mansimov, Elman and Grosse, Roger B and Liao, Shun and Ba, Jimmy},
booktitle = {Advances in Neural Information Processing Systems 30},
editor = {I. Guyon and U. V. Luxburg and S. Bengio and H. Wallach and R. Fergus and S. Vishwanathan
and R. Garnett},
pages = {5279--5288}, year = {2017}, publisher = {Curran Associates, Inc.}, url =
{http://papers.nips.cc/paper/7112-scalable-trust-region-method-for-deep-reinforcement-learning-using-kronecker-fact
ored-approximation.pdf}}
[21] @article{Kaplan2019,
doi = {10.1016/j.bushor.2018.08.004}, url = {https://doi.org/10.1016/j.bushor.2018.08.004}, year = {2019},
month = jan, publisher = {Elsevier {BV}}, volume = {62}, number = {1}, pages = {15--25},
author = {Andreas Kaplan and Michael Haenlein}, journal = {Business Horizons},
title = {Siri, Siri, in my hand: Who's the fairest in the land? On the interpretations, illustrations, and
implications of artificial intelligence}}
[22] @inproceedings{DBLP:conf/pkdd/SteckelmacherPR19,
author = {Denis Steckelmacher and H{\'{e}}l{\`{e}}ne Plisnier and Diederik M. Roijers and Ann Now{\'{e}}},
editor = {Ulf Brefeld and {\'{E}}lisa Fromont and Andreas Hotho and Arno J. Knobbe and
Marloes H. Maathuis and C{\'{e}}line Robardet},
title = {Sample-Efficient Model-Free Reinforcement Learning with Off-Policy Critics},
booktitle = {Machine Learning and Knowledge Discovery in Databases - European Conference, {ECML}
{PKDD} 2019, W{\"{u}}rzburg, Germany, September 16-20, 2019, Proceedings, Part {III}},
series = {Lecture Notes in Computer Science}, volume = {11908}, pages= {19--34}, publisher = {Springer},
year = {2019}, url = {https://doi.org/10.1007/978-3-030-46133-1\_2}, doi = {10.1007/978-3-030-46133-1\_2},
timestamp = {Mon, 04 May 2020 14:19:13 +0200}, biburl = {https://dblp.org/rec/conf/pkdd/SteckelmacherPR19.bib},
bibsource = {dblp computer science bibliography, https://dblp.org},
Eprint = {arXiv:1903.04193}}
[23] @incollection{NIPS2016_6501,
26
title = {Deep Exploration via Bootstrapped DQN},
author = {Osband, Ian and Blundell, Charles and Pritzel, Alexander and Van Roy, Benjamin},
booktitle = {Advances in Neural Information Processing Systems 29},
editor = {D. D. Lee and M. Sugiyama and U. V. Luxburg and I. Guyon and R. Garnett},
pages = {4026--4034}, year = {2016}, publisher = {Curran Associates, Inc.},
url = {http://papers.nips.cc/paper/6501-deep-exploration-via-bootstrapped-dqn.pdf}}
[24] @misc{1903.09366,
Author = {Heecheol Kim and Masanori Yamada and Kosuke Miyoshi and Hiroshi Yamakawa},
Title = {Macro Action Reinforcement Learning with Sequence Disentanglement using Variational Autoencoder},
Year = {2019}, Eprint = {arXiv:1903.09366}}
[25] @inproceedings{ijcai2019-475,
title = {Soft Policy Gradient Method for Maximum Entropy Deep Reinforcement Learning},
author = {Shi, Wenjie and Song, Shiji and Wu, Cheng},
booktitle = {Proceedings of the Twenty-Eighth International Joint Conference on
Artificial Intelligence, {IJCAI-19}},
publisher = {International Joint Conferences on Artificial Intelligence Organization},
pages = {3425--3431}, year = {2019}, month = {7}, doi = {10.24963/ijcai.2019/475},
url = {https://doi.org/10.24963/ijcai.2019/475}}
[26] @inproceedings{10.5555/3044805.3044850,
author = {Silver, David and Lever, Guy and Heess, Nicolas and Degris, Thomas and Wierstra, Daan and
Riedmiller, Martin},
title = {Deterministic Policy Gradient Algorithms}, year = {2014}, publisher = {JMLR.org},
booktitle = {Proceedings of the 31st International Conference on International Conference on
Machine Learning - Volume 32},
pages = {I–387–I–395}, numpages = {9}, location = {Beijing, China}, series = {ICML’14}}
[27] @inproceedings{DBLP:conf/ecml/Riedmiller05,
author = {Martin A. Riedmiller},
editor = {Jo{\~{a}}o Gama and Rui Camacho and Pavel Brazdil and Al{\'{\i}}pio Jorge and
Lu{\'{\i}}s Torgo},
title = {Neural Fitted {Q} Iteration - First Experiences with a Data Efficient
Neural Reinforcement Learning Method},
booktitle = {Machine Learning: {ECML} 2005, 16th European Conference on Machine
Learning, Porto, Portugal, October 3-7, 2005, Proceedings},
series = {Lecture Notes in Computer Science},
volume = {3720}, pages = {317--328}, publisher = {Springer}, year = {2005},
url = {https://doi.org/10.1007/11564096\_32}, doi = {10.1007/11564096\_32},
timestamp = {Tue, 14 May 2019 10:00:54 +0200}, biburl = {https://dblp.org/rec/conf/ecml/Riedmiller05.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[28] @article{DBLP:journals/corr/abs-2002-08396,
author = {Noah Y. Siegel and Jost Tobias Springenberg and Felix Berkenkamp and
Abbas Abdolmaleki and Michael Neunert and Thomas Lampe and Roland Hafner and
Nicolas Heess and Martin A. Riedmiller},
title = {Keep Doing What Worked: Behavioral Modelling Priors for Offline Reinforcement
Learning},
journal = {CoRR}, volume = {abs/2002.08396}, year = {2020}, url = {https://arxiv.org/abs/2002.08396},
archivePrefix = {arXiv}, eprint = {2002.08396}, timestamp = {Mon, 02 Mar 2020 16:46:06 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-2002-08396.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[29] @article{Hafner2011,
doi = {10.1007/s10994-011-5235-x}, url = {https://doi.org/10.1007/s10994-011-5235-x},
year = {2011}, month = feb, publisher = {Springer Science and Business Media {LLC}}, volume = {84},
number = {1-2}, pages = {137--169}, author = {Roland Hafner and Martin Riedmiller},
title = {Reinforcement learning in feedback control},
journal = {Machine Learning}}
[30] @inproceedings{DBLP:conf/nips/AnthonyTB17,
author = {Thomas Anthony and Zheng Tian and David Barber},
editor = {Isabelle Guyon and Ulrike von Luxburg and Samy Bengio and Hanna M. Wallach and
Rob Fergus and S. V. N. Vishwanathan and Roman Garnett},
title = {Thinking Fast and Slow with Deep Learning and Tree Search},
booktitle = {Advances in Neural Information Processing Systems 30: Annual Conference
on Neural Information Processing Systems 2017, 4-9 December 2017,
Long Beach, CA, {USA}},
pages = {5360--5370}, year = {2017},
url={http://papers.nips.cc/paper/7120-thinking-fast-and-slow-with-deep-learning-and-tree-search},
27
timestamp = {Fri, 06 Mar 2020 16:56:07 +0100}, biburl = {https://dblp.org/rec/conf/nips/AnthonyTB17.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[31] @inproceedings{Fakoor2020Meta-Q-Learning,
title={Meta-Q-Learning},
author={Rasool Fakoor and Pratik Chaudhari and Stefano Soatto and Alexander J. Smola},
booktitle={International Conference on Learning Representations}, year={2020},
url={https://openreview.net/forum?id=SJeD3CEFPH}}
[32] @incollection{NIPS2019_9416,
title = {When to Trust Your Model: Model-Based Policy Optimization},
author = {Janner, Michael and Fu, Justin and Zhang, Marvin and Levine, Sergey},
booktitle = {Advances in Neural Information Processing Systems 32},
editor = {H. Wallach and H. Larochelle and A. Beygelzimer and F. d\textquotesingle Alch\'{e}-Buc and E. Fox and R.
Garnett},
pages = {12519--12530}, year = {2019}, publisher = {Curran Associates, Inc.},
url = {http://papers.nips.cc/paper/9416-when-to-trust-your-model-model-based-policy-optimization.pdf}}
[33] @inproceedings{1611.01626,
title={Combining policy gradient and Q-learning},
author={Brendan O'Donoghue and Remi Munos and Koray Kavukcuoglu and Volodymyr Mnih},
booktitle={ICLR}, year={2016}, Eprint = {arXiv:1611.01626}}
[34] @book{Sutton_Barto_2018,
place={Cambridge, Massachusetts}, edition={Second edition},
series={Adaptive computation and machine learning series}, title={Reinforcement learning: an introduction},
ISBN={9780262039246}, publisher={The MIT Press}, author={Sutton, Richard S. and Barto, Andrew G.},
year={2018}, collection={Adaptive computation and machine learning series}}
[35]@article{Mnih_Kavukcuoglu_Silver_Rusu_Veness_Bellemare_Graves_Riedmiller_Fidjeland_Ostrovski_et al._2015,
title={Human-level control through deep reinforcement learning},
volume={518}, ISSN={0028-0836, 1476-4687}, DOI={10.1038/nature14236}, number={7540}, journal={Nature},
author={Mnih, Volodymyr and Kavukcuoglu, Koray and Silver, David and Rusu, Andrei A. and Veness, Joel
and Bellemare, Marc G. and Graves, Alex and Riedmiller, Martin and Fidjeland, Andreas K.
and Ostrovski, Georg and et al.},
year={2015}, month={Feb}, pages={529–533}}
[36] @inproceedings{DBLP:conf/iclr/LevyKPS19,
author = {Andrew Levy and George Dimitri Konidaris and Robert Platt Jr. and Kate Saenko},
title = {Learning Multi-Level Hierarchies with Hindsight},
booktitle = {7th International Conference on Learning Representations, {ICLR} 2019,
New Orleans, LA, USA, May 6-9, 2019},
publisher = {OpenReview.net}, year = {2019},
url = {https://openreview.net/forum?id=ryzECoAcY7},
timestamp = {Tue, 19 Nov 2019 08:34:00 +0100},
biburl = {https://dblp.org/rec/conf/iclr/LevyKPS19.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[37] @article{Liu_Bellet_2019,
title={Escaping the Curse of Dimensionality in Similarity Learning: Efficient Frank-Wolfe Algorithm
and Generalization Bounds},
volume={333}, ISSN={09252312}, DOI={10.1016/j.neucom.2018.12.060}, journal={Neurocomputing},
author={Liu, Kuan and Bellet, Aurélien}, year={2019}, month={Mar}, pages={185–199}}
[38] @inproceedings{DBLP:conf/nips/DayanH92,
author = {Peter Dayan and Geoffrey E. Hinton},
editor = {Stephen Jose Hanson and Jack D. Cowan and C. Lee Giles},
title = {Feudal Reinforcement Learning},
booktitle = {Advances in Neural Information Processing Systems 5, {[NIPS} Conference,
Denver, Colorado, USA, November 30 - December 3, 1992]},
pages = {271--278}, publisher = {Morgan Kaufmann}, year = {1992},
url = {http://papers.nips.cc/paper/714-feudal-reinforcement-learning},
timestamp = {Fri, 06 Mar 2020 16:57:04 +0100}, biburl = {https://dblp.org/rec/conf/nips/DayanH92.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[39] @article{DBLP:journals/ai/SuttonPS99,
author = {Richard S. Sutton and Doina Precup and Satinder P. Singh},
title = {Between MDPs and Semi-MDPs: {A} Framework for Temporal Abstraction
i n Reinforcement Learning},
journal = {Artif. Intell.}, volume = {112}, number = {1-2}, pages = {181--211}, year = {1999},
url = {https://doi.org/10.1016/S0004-3702(99)00052-1}, doi = {10.1016/S0004-3702(99)00052-1},
28
timestamp = {Sat, 27 May 2017 14:24:41 +0200}, biburl = {https://dblp.org/rec/journals/ai/SuttonPS99.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[40] @article{DBLP:journals/jair/Dietterich00,
author = {Thomas G. Dietterich},
title = {Hierarchical Reinforcement Learning with the {MAXQ} Value Function Decomposition},
journal = {J. Artif. Intell. Res.}, volume = {13}, pages = {227--303},year = {2000},
url = {https://doi.org/10.1613/jair.639}, doi = {10.1613/jair.639},
timestamp = {Mon, 21 Jan 2019 15:01:17 +0100}, biburl = {https://dblp.org/rec/journals/jair/Dietterich00.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[41] @inproceedings{DBLP:conf/icml/VezhnevetsOSHJS17,
author = {Alexander Sasha Vezhnevets and Simon Osindero and Tom Schaul and Nicolas Heess and
Max Jaderberg and David Silver and Koray Kavukcuoglu},
editor = {Doina Precup and Yee Whye Teh},
title = {FeUdal Networks for Hierarchical Reinforcement Learning},
booktitle = {Proceedings of the 34th International Conference on Machine Learning,
{ICML} 2017, Sydney, NSW, Australia, 6-11 August 2017},
series = {Proceedings of Machine Learning Research},
volume = {70}, pages = {3540--3549}, publisher = {{PMLR}}, year = {2017},
url = {http://proceedings.mlr.press/v70/vezhnevets17a.html},
timestamp = {Wed, 29 May 2019 08:41:45 +0200},
biburl = {https://dblp.org/rec/conf/icml/VezhnevetsOSHJS17.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[42] @inproceedings{DBLP:conf/aaai/BaconHP17,
author = {Pierre{-}Luc Bacon and Jean Harb and Doina Precup},
editor = {Satinder P. Singh and Shaul Markovitch},
title = {The Option-Critic Architecture},
booktitle = {Proceedings of the Thirty-First {AAAI} Conference on Artificial Intelligence,
February 4-9, 2017, San Francisco, California, {USA}},
pages = {1726--1734}, publisher = {{AAAI} Press}, year = {2017},
url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14858},
timestamp = {Mon, 06 Mar 2017 11:36:24 +0100},
biburl = {https://dblp.org/rec/conf/aaai/BaconHP17.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[43] @inproceedings{DBLP:conf/nips/NachumGLL18,
author = {Ofir Nachum and Shixiang Gu and Honglak Lee and Sergey Levine},
editor = {Samy Bengio and Hanna M. Wallach and Hugo Larochelle and Kristen Grauman and
Nicol{\`{o}} Cesa{-}Bianchi and Roman Garnett},
title = {Data-Efficient Hierarchical Reinforcement Learning},
booktitle = {Advances in Neural Information Processing Systems 31: Annual Conference
on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December
2018, Montr{\'{e}}al, Canada},
pages = {3307--3317},year = {2018},
url = {http://papers.nips.cc/paper/7591-data-efficient-hierarchical-reinforcement-learning},
timestamp = {Fri, 06 Mar 2020 17:00:31 +0100},
biburl = {https://dblp.org/rec/conf/nips/NachumGLL18.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[44] @inproceedings{DBLP:conf/iclr/ODonoghueMKM17,
author = {Brendan O'Donoghue and R{\'{e}}mi Munos and Koray Kavukcuoglu and
Volodymyr Mnih},
title = {Combining policy gradient and Q-learning},
booktitle = {5th International Conference on Learning Representations, {ICLR} 2017,
Toulon, France, April 24-26, 2017, Conference Track Proceedings},
publisher = {OpenReview.net}, year = {2017}, url = {https://openreview.net/forum?id=B1kJ6H9ex},
timestamp = {Thu, 25 Jul 2019 14:25:50 +0200}, biburl = {https://dblp.org/rec/conf/iclr/ODonoghueMKM17.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[45] @article{DBLP:journals/corr/HeessWTLRS16,
author = {Nicolas Heess and Gregory Wayne and Yuval Tassa and Timothy P. Lillicrap and
Martin A. Riedmiller and David Silver},
title = {Learning and Transfer of Modulated Locomotor Controllers},
journal = {CoRR}, volume = {abs/1610.05182}, year = {2016}, url = {http://arxiv.org/abs/1610.05182},
archivePrefix = {arXiv}, eprint = {1610.05182}, timestamp = {Mon, 13 Aug 2018 16:47:23 +0200},
biburl = {https://dblp.org/rec/journals/corr/HeessWTLRS16.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[46] @inproceedings{DBLP:conf/nips/KulkarniNST16,
29
author = {Tejas D. Kulkarni and Karthik Narasimhan and Ardavan Saeedi and Josh Tenenbaum},
editor = {Daniel D. Lee andMasashi Sugiyama and Ulrike von Luxburg and Isabelle Guyon and Roman Garnett},
title = {Hierarchical Deep Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation},
booktitle = {Advances in Neural Information Processing Systems 29: Annual Conference
on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain},
pages = {3675--3683}, year = {2016}, url =
{http://papers.nips.cc/paper/6233-hierarchical-deep-reinforcement-learning-integrating-temporal-abstraction-and-intrinsic-motiva
tion},
timestamp = {Fri, 06 Mar 2020 17:00:15 +0100}, biburl = {https://dblp.org/rec/conf/nips/KulkarniNST16.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[47] @inproceedings{DBLP:conf/iclr/FransH0AS18,
author = {Kevin Frans and Jonathan Ho and Xi Chen and Pieter Abbeel and John Schulman},
title = {Meta Learning Shared Hierarchies},
booktitle = {6th International Conference on Learning Representations, {ICLR} 2018,
Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings},
publisher = {OpenReview.net}, year = {2018}, url = {https://openreview.net/forum?id=SyX0IeWAW},
timestamp = {Thu, 25 Jul 2019 14:26:00 +0200}, biburl = {https://dblp.org/rec/conf/iclr/FransH0AS18.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[48] @article{DBLP:journals/corr/abs-1812-00025,
author = {Alexander Pashevich and Danijar Hafner and James Davidson and Rahul Sukthankar and
Cordelia Schmid},
title = {Modulated Policy Hierarchies}, journal = {CoRR}, volume = {abs/1812.00025}, year = {2018},
url = {http://arxiv.org/abs/1812.00025}, archivePrefix = {arXiv}, eprint = {1812.00025},
timestamp = {Tue, 01 Jan 2019 15:01:25 +0100}, biburl = {https://dblp.org/rec/journals/corr/abs-1812-00025.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[49] @inproceedings{DBLP:conf/nips/VezhnevetsMOGVA16,
author = {Alexander Vezhnevets and Volodymyr Mnih and Simon Osindero and Alex Graves and
Oriol Vinyals and John Agapiou and Koray Kavukcuoglu},
editor = {Daniel D. Lee and Masashi Sugiyama and Ulrike von Luxburg and Isabelle Guyon and
Roman Garnett},
title = {Strategic Attentive Writer for Learning Macro-Actions},
booktitle = {Advances in Neural Information Processing Systems 29: Annual Conference
on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain},
pages = {3486--3494}, year = {2016},
url = {http://papers.nips.cc/paper/6414-strategic-attentive-writer-for-learning-macro-actions},
timestamp = {Fri, 06 Mar 2020 17:00:15 +0100}, biburl = {https://dblp.org/rec/conf/nips/VezhnevetsMOGVA16.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[50] @inproceedings{DBLP:conf/aaai/TesslerGZMM17,
author = {Chen Tessler and Shahar Givony and Tom Zahavy and Daniel J. Mankowitz and Shie Mannor},
editor = {Satinder P. Singh and Shaul Markovitch},
title = {A Deep Hierarchical Approach to Lifelong Learning in Minecraft},
booktitle = {Proceedings of the Thirty-First {AAAI} Conference on Artificial Intelligence,
February 4-9, 2017, San Francisco, California, {USA}},
pages = {1553--1561}, publisher = {{AAAI} Press}, year = {2017},
url = {http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14630},
timestamp = {Mon, 06 Mar 2017 11:36:24 +0100}, biburl = {https://dblp.org/rec/conf/aaai/TesslerGZMM17.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[51] @paper{ICAPS1715759,
author = {Nakul Gopalan and Marie desJardins and Michael Littman
and James MacGlashan and Shawn Squire and Stefanie Tellex
and John Winder and Lawson Wong},
title = {Planning with Abstract Markov Decision Processes},
conference = {International Conference on Automated Planning and Scheduling},
year = {2017}, url = {https://aaai.org/ocs/index.php/ICAPS/ICAPS17/paper/view/15759}}
[52] @article{DBLP:journals/corr/MankowitzMM16,
author = {Daniel J. Mankowitz and Timothy A. Mann and Shie Mannor},
title = {Iterative Hierarchical Optimization for Misspecified Problems (IHOMP)},
journal = {CoRR}, volume = {abs/1602.03348}, year = {2016}, url = {http://arxiv.org/abs/1602.03348},
archivePrefix = {arXiv}, eprint = {1602.03348}, timestamp = {Wed, 17 Jul 2019 17:00:48 +0200},
biburl = {https://dblp.org/rec/journals/corr/MankowitzMM16.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[53] @article{DBLP:journals/corr/abs-1811-09083,
author = {Sainbayar Sukhbaatar and Emily Denton and Arthur Szlam and Rob Fergus},
title = {Learning Goal Embeddings via Self-Play for Hierarchical Reinforcement Learning},
30
journal = {CoRR}, volume = {abs/1811.09083}, year = {2018}, url = {http://arxiv.org/abs/1811.09083},
archivePrefix = {arXiv}, eprint = {1811.09083}, timestamp = {Fri, 30 Nov 2018 12:44:28 +0100},
biburl = {https://dblp.org/rec/journals/corr/abs-1811-09083.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[54] @inproceedings{DBLP:conf/aaai/RafatiN19a,
author = {Jacob Rafati and David C. Noelle},
title = {Learning Representations in Model-Free Hierarchical Reinforcement Learning},
booktitle = {The Thirty-Third {AAAI} Conference on Artificial Intelligence, {AAAI}
2019, The Thirty-First Innovative Applications of Artificial Intelligence
Conference, {IAAI} 2019, The Ninth {AAAI} Symposium on Educational
Advances in Artificial Intelligence, {EAAI} 2019, Honolulu, Hawaii,
USA, January 27 - February 1, 2019},
pages = {10009--10010}, publisher = {{AAAI} Press}, year = {2019},
url = {https://doi.org/10.1609/aaai.v33i01.330110009}, doi = {10.1609/aaai.v33i01.330110009},
timestamp = {Wed, 25 Sep 2019 11:05:09 +0200}, biburl = {https://dblp.org/rec/conf/aaai/RafatiN19a.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[55] @inproceedings{DBLP:conf/nips/ParrR97,
author = {Ronald Parr and Stuart J. Russell},
editor = {Michael I. Jordan and Michael J. Kearns and Sara A. Solla},
title = {Reinforcement Learning with Hierarchies of Machines},
booktitle = {Advances in Neural Information Processing Systems 10, {[NIPS} Conference,
Denver, Colorado, USA, 1997]},
pages = {1043--1049}, publisher = {The {MIT} Press}, year = {1997},
url = {http://papers.nips.cc/paper/1384-reinforcement-learning-with-hierarchies-of-machines},
timestamp = {Fri, 06 Mar 2020 17:00:38 +0100}, biburl = {https://dblp.org/rec/conf/nips/ParrR97.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[56] @phdthesis{phdthesis,
author = {Watkins, Christopher}, year = {1989}, month = {04}, title = {Learning From Delayed Rewards},
school = {King’s College}}
[57] @inproceedings{DBLP:conf/icml/HaarnojaTAL17,
author = {Tuomas Haarnoja and Haoran Tang and Pieter Abbeel and Sergey Levine},
editor = {Doina Precup and Yee Whye Teh},
title = {Reinforcement Learning with Deep Energy-Based Policies},
booktitle = {Proceedings of the 34th International Conference on Machine Learning,
{ICML} 2017, Sydney, NSW, Australia, 6-11 August 2017},
series = {Proceedings of Machine Learning Research},
volume = {70}, pages = {1352--1361}, publisher = {{PMLR}}, year = {2017},
url = {http://proceedings.mlr.press/v70/haarnoja17a.html}, timestamp = {Wed, 29 May 2019 08:41:45 +0200},
biburl = {https://dblp.org/rec/conf/icml/HaarnojaTAL17.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[58] @article{10.1007/BF00992696,
author = {Williams, Ronald J.},
title = {Simple Statistical Gradient-Following Algorithms for Connectionist Reinforcement Learning},
year = {1992}, issue_date = {May 1992}, publisher = {Kluwer Academic Publishers}, address = {USA},
volume = {8}, number = {3–4}, issn = {0 885-6125}, url = {https://doi.org/10.1007/BF00992696},
doi = {10.1007/BF00992696}, journal = {Mach. Learn.}, month = may, pages = {229–256}, numpages = {28},
keywords = {connectionist networks, mathematical analysis, Reinforcement learning, gradient descent}}
[59] @inproceedings{DBLP:conf/iclr/SukhbaatarLKSSF18,
author = {Sainbayar Sukhbaatar and Zeming Lin and Ilya Kostrikov and Gabriel Synnaeve and Arthur Szlam and
Rob Fergus},
title = {Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play},
booktitle = {6th International Conference on Learning Representations, {ICLR} 2018,
Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings},
publisher = {OpenReview.net}, year = {2018}, url = {https://openreview.net/forum?id=SkT5Yg-RZ},
timestamp = {Thu, 25 Jul 2019 14:25:46 +0200}, biburl = {https://dblp.org/rec/conf/iclr/SukhbaatarLKSSF18.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[60] @incollection{NIPS2018_7725,
title = {Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models},
author = {Chua, Kurtland and Calandra, Roberto and McAllister, Rowan and Levine, Sergey},
booktitle = {Advances in Neural Information Processing Systems 31},
editor = {S. Bengio and H. Wallach and H. Larochelle and K. Grauman and N. Cesa-Bianchi and R. Garnett},
pages = {4754--4765}, year = {2018}, publisher = {Curran Associates, Inc.}, url =
{http://papers.nips.cc/paper/7725-deep-reinforcement-learning-in-a-handful-of-trials-using-probabilistic-dynamics-mod
els.pdf}}
31
[61] @inproceedings{DBLP:conf/nips/BuckmanHTBL18,
author = {Jacob Buckman and Danijar Hafner and George Tucker and Eugene Brevdo and Honglak Lee},
editor = {Samy Bengio and Hanna M. Wallach and Hugo Larochelle and Kristen Grauman and
Nicol{\`{o}} Cesa{-}Bianchi and Roman Garnett},
title = {Sample-Efficient Reinforcement Learning with Stochastic Ensemble Value Expansion},
booktitle = {Advances in Neural Information Processing Systems 31: Annual Conference
on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montr{\'{e}}al, Canada},
pages = {8234--8244}, year = {2018}, url =
{http://papers.nips.cc/paper/8044-sample-efficient-reinforcement-learning-with-stochastic-ensemble-value-expansion}
,
timestamp = {Fri, 06 Mar 2020 17:00:31 +0100}, biburl = {https://dblp.org/rec/conf/nips/BuckmanHTBL18.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[62] @inproceedings{DBLP:conf/iclr/LuoXLTDM19,
author = {Yuping Luo and Huazhe Xu and Yuanzhi Li and Yuandong Tian and Trevor Darrell and
Tengyu Ma},
title = {Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees},
booktitle = {7th International Conference on Learning Representations, {ICLR} 2019,
New Orleans, LA, USA, May 6-9, 2019},
publisher = {OpenReview.net}, year = {2019}, url = {https://openreview.net/forum?id=BJe1E2R5KX},
timestamp = {Thu, 25 Jul 2019 14:26:05 +0200},
biburl = {https://dblp.org/rec/conf/iclr/LuoXLTDM19.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[63] @inproceedings{DBLP:journals/corr/ParisottoBS15,
author = {Emilio Parisotto and Lei Jimmy Ba and Ruslan Salakhutdinov},
editor = {Yoshua Bengio and Yann LeCun},
title = {Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning},
booktitle = {4th International Conference on Learning Representations, {ICLR} 2016,
San Juan, Puerto Rico, May 2-4, 2016, Conference Track Proceedings},
year = {2016}, url = {http://arxiv.org/abs/1511.06342},
timestamp = {Thu, 25 Jul 2019 14:25:38 +0200},
biburl = {https://dblp.org/rec/journals/corr/ParisottoBS15.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[64] @inproceedings{DBLP:conf/iclr/SharmaLR17,
author = {Sahil Sharma and Aravind S. Lakshminarayanan and Balaraman Ravindran},
title = {Learning to Repeat: Fine Grained Action Repetition for Deep Reinforcement Learning},
booktitle = {5th International Conference on Learning Representations, {ICLR} 2017,
Toulon, France, April 24-26, 2017, Conference Track Proceedings},
publisher = {OpenReview.net}, year = {2017},
url = {https://openreview.net/forum?id=B1GOWV5eg},
timestamp = {Thu, 25 Jul 2019 14:25:59 +0200},
biburl = {https://dblp.org/rec/conf/iclr/SharmaLR17.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[65] @incollection{NIPS2015_5796,
title = {Learning Continuous Control Policies by Stochastic Value Gradients},
author = {Heess, Nicolas and Wayne, Gregory and Silver, David and Lillicrap, Timothy and Erez, Tom and
Tassa, Yuval},
booktitle = {Advances in Neural Information Processing Systems 28},
editor = {C. Cortes and N. D. Lawrence and D. D. Lee and M. Sugiyama and R. Garnett},
pages = {2944--2952}, year = {2015}, publisher = {Curran Associates, Inc.},
url = {http://papers.nips.cc/paper/5796-learning-continuous-control-policies-by-stochastic-value-gradients.pdf}}
[66] @inproceedings{DBLP:conf/nips/OhSL17,
author = {Junhyuk Oh and Satinder Singh and Honglak Lee},
editor = {Isabelle Guyon and Ulrike von Luxburg and Samy Bengio and Hanna M. Wallach and
Rob Fergus and S. V. N. Vishwanathan and Roman Garnett},
title = {Value Prediction Network},
booktitle = {Advances in Neural Information Processing Systems 30: Annual Conference
on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, {USA}},
pages = {6118--6128}, year = {2017},
url = {http://papers.nips.cc/paper/7192-value-prediction-network},
timestamp = {Fri, 06 Mar 2020 16:56:22 +0100},
biburl = {https://dblp.org/rec/conf/nips/OhSL17.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[67] @article{article,
author = {Rummery, G. and Niranjan, Mahesan}, year = {1994}, month = {11}, pages = {},
32
title = {On-Line Q-Learning Using Connectionist Systems}, journal = {Technical Report CUED/F-INFENG/TR 166}}
[68] @article{DBLP:journals/corr/abs-2006-08875,
author = {Zichuan Lin and Garrett Thomas and Guangwen Yang and Tengyu Ma},
title = {Model-based Adversarial Meta-Reinforcement Learning},
journal = {CoRR}, volume = {abs/2006.08875}, year = {2020},
url = {https://arxiv.org/abs/2006.08875},
archivePrefix = {arXiv}, eprint = {2006.08875}, timestamp = {Wed, 17 Jun 2020 14:28:54 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-2006-08875.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[69] @inproceedings{DBLP:conf/icml/FinnAL17,
author = {Chelsea Finn and Pieter Abbeel and Sergey Levine},
editor = {Doina Precup and Yee Whye Teh},
title = {Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks},
booktitle = {Proceedings of the 34th International Conference on Machine Learning,
{ICML} 2017, Sydney, NSW, Australia, 6-11 August 2017},
series = {Proceedings of Machine Learning Research},
volume = {70}, pages = {1126--1135}, publisher = {{PMLR}}, year = {2017},
url = {http://proceedings.mlr.press/v70/finn17a.html},
timestamp = {Wed, 29 May 2019 08:41:45 +0200},
biburl = {https://dblp.org/rec/conf/icml/FinnAL17.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[70] @inproceedings{DBLP:conf/icml/RakellyZFLQ19,
author = {Kate Rakelly and Aurick Zhou and Chelsea Finn and Sergey Levine and
Deirdre Quillen},
editor = {Kamalika Chaudhuri and Ruslan Salakhutdinov},
title = {Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables},
booktitle = {Proceedings of the 36th International Conference on Machine Learning,
{ICML} 2019, 9-15 June 2019, Long Beach, California, {USA}},
series = {Proceedings of Machine Learning Research},
volume = {97}, pages = {5331--5340}, publisher = {{PMLR}}, year = {2019},
url = {http://proceedings.mlr.press/v97/rakelly19a.html},
timestamp = {Tue, 11 Jun 2019 15:37:38 +0200},
biburl = {https://dblp.org/rec/conf/icml/RakellyZFLQ19.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[71] @inproceedings{DBLP:conf/iclr/RothfussLCAA19,
author = {Jonas Rothfuss and Dennis Lee and Ignasi Clavera and Tamim Asfour and Pieter Abbeel},
title = {ProMP: Proximal Meta-Policy Search},
booktitle = {7th International Conference on Learning Representations, {ICLR} 2019,
New Orleans, LA, USA, May 6-9, 2019},
publisher = {OpenReview.net}, year = {2019},
url = {https://openreview.net/forum?id=SkxXCi0qFX},
timestamp = {Thu, 25 Jul 2019 14:25:43 +0200},
biburl = {https://dblp.org/rec/conf/iclr/RothfussLCAA19.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[72] @article{DBLP:journals/corr/DuanSCBSA16,
author = {Yan Duan and John Schulman and Xi Chen and Peter L. Bartlett and Ilya Sutskever and Pieter Abbeel},
title = {RL{\textdollar}{\^{}}2{\textdollar}: Fast Reinforcement Learning via
Slow Reinforcement Learning},
journal = {CoRR}, volume = {abs/1611.02779}, year = {2016},
url = {http://arxiv.org/abs/1611.02779},
archivePrefix = {arXiv}, eprint = {1611.02779}, timestamp = {Mon, 03 Sep 2018 12:15:29 +0200},
biburl = {https://dblp.org/rec/journals/corr/DuanSCBSA16.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[73] @inproceedings{DBLP:conf/icml/DudikLL11,
author = {Miroslav Dud{\'{\i}}k and John Langford and Lihong Li},
editor = {Lise Getoor andTobias Scheffer},
title = {Doubly Robust Policy Evaluation and Learning},
booktitle = {Proceedings of the 28th International Conference on Machine Learning,
{ICML} 2011, Bellevue, Washington, USA, June 28 - July 2, 2011},
pages = {1097--1104}, publisher = {Omnipress}, year = {2011},
url = {https://icml.cc/2011/papers/554\_icmlpaper.pdf},
timestamp = {Wed, 03 Apr 2019 17:43:35 +0200},
biburl = {https://dblp.org/rec/conf/icml/DudikLL11.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
33
[74] @inproceedings{DBLP:conf/icml/DeisenrothR11,
author = {Marc Peter Deisenroth and Carl Edward Rasmussen},
editor = {Lise Getoor and Tobias Scheffer},
title = {{PILCO:} {A} Model-Based and Data-Efficient Approach to Policy Search},
booktitle = {Proceedings of the 28th International Conference on Machine Learning,
{ICML} 2011, Bellevue, Washington, USA, June 28 - July 2, 2011},
pages = {465--472}, publisher = {Omnipress}, year = {2011},
url = {https://icml.cc/2011/papers/323\_icmlpaper.pdf},
timestamp = {Wed, 03 Apr 2019 17:43:35 +0200},
biburl = {https://dblp.org/rec/conf/icml/DeisenrothR11.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[75] @inproceedings{DBLP:conf/iclr/KurutachCDTA18,
author = {Thanard Kurutach and Ignasi Clavera and Yan Duan and Aviv Tamar and Pieter Abbeel},
title = {Model-Ensemble Trust-Region Policy Optimization},
booktitle = {6th International Conference on Learning Representations, {ICLR} 2018,
Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings},
publisher = {OpenReview.net}, year = {2018},
url = {https://openreview.net/forum?id=SJJinbWRZ},
timestamp = {Thu, 25 Jul 2019 14:25:59 +0200},
biburl = {https://dblp.org/rec/conf/iclr/KurutachCDTA18.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}}
[77] @misc{flet-berliac_2020,
title={The Promise of Hierarchical Reinforcement Learning},
url={https://web.archive.org/web/20200501215221/https://thegradient.pub/the-promise-of-hierarchical-reinforcement-l
earning/},
author={Flet-Berliac, Yannis},
year={2020},
month={Apr}
}
[78] @article{DBLP:journals/corr/abs-1908-06976,
author = {Arthur Aubret and La{\"{e}}titia Matignon and Salima Hassas},
title = {A survey on intrinsic motivation in reinforcement learning},
journal = {CoRR},
volume = {abs/1908.06976},
year = {2019},
url = {http://arxiv.org/abs/1908.06976},
archivePrefix = {arXiv},
eprint = {1908.06976},
timestamp = {Mon, 26 Aug 2019 13:20:40 +0200},
biburl = {https://dblp.org/rec/journals/corr/abs-1908-06976.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
[81] @misc{kiran2020deep,
title={Deep Reinforcement Learning for Autonomous Driving: A Survey},
author={B Ravi Kiran and Ibrahim Sobh and Victor Talpaert and Patrick Mannion and Ahmad A. Al Sallab
and Senthil Yogamani and Patrick Pérez},
year={2020},
eprint={2002.00444},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
34
[82] @inbook{inbook,
author = {Azhikodan, Akhil and Bhat, Anvitha and Jadhav, Mamatha},
year = {2019},
month = {05},
pages = {41-49},
title = {Stock Trading Bot Using Deep Reinforcement Learning},
isbn = {978-981-10-8200-9},
journal = {Lecture Notes in Networks and Systems},
doi = {10.1007/978-981-10-8201-6_5}
}
[83] @article{article,
author = {Kober, Jens and Bagnell, J. and Peters, Jan},
year = {2013},
month = {09},
pages = {1238-1274},
title = {Reinforcement Learning in Robotics: A Survey},
volume = {32},
journal = {The International Journal of Robotics Research},
doi = {10.1177/0278364913495721}
}
[84] @techreport{Fischer2018Reinforcement,
address = {N\"{u}rnberg},
author = {Thomas G. Fischer},
copyright = {http://www.econstor.eu/dspace/Nutzungsbedingungen},
keywords = {330; financial markets; reinforcement learning; survey; trading systems; machine learning},
language = {eng},
number = {12/2018},
publisher = {Friedrich-Alexander-Universit\"{a}t Erlangen-N\"{u}rnberg, Institute for Economics},
title = {Reinforcement learning in financial markets - a survey},
type = {FAU Discussion Papers in Economics},
url = {http://hdl.handle.net/10419/183139},
year = {2018}
}
[85] @article{Liang_2019,
title={A Deep Reinforcement Learning Network for Traffic Light Cycle Control},
volume={68},
ISSN={1939-9359},
url={http://dx.doi.org/10.1109/TVT.2018.2890726},
DOI={10.1109/tvt.2018.2890726},
number={2},
journal={IEEE Transactions on Vehicular Technology},
publisher={Institute of Electrical and Electronics Engineers (IEEE)},
author={Liang, Xiaoyuan and Du, Xunsheng and Wang, Guiling and Han, Zhu},
year={2019},
month={Feb},
pages={1243–1253}
}
[86] @misc{yu2020reinforcement,
title={Reinforcement Learning in Healthcare: A Survey},
author={Chao Yu and Jiming Liu and Shamim Nemati},
year={2020},
eprint={1908.08796},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
[87] @misc{shao2019survey,
title={A Survey of Deep Reinforcement Learning in Video Games},
author={Kun Shao and Zhentao Tang and Yuanheng Zhu and Nannan Li and Dongbin Zhao},
year={2019},
eprint={1912.10944},
archivePrefix={arXiv},
primaryClass={cs.MA}
}
[88] @inproceedings{HanCai2017,
35
View publication stats
author = {Cai, Han and Ren, Kan and Zhag, W and Malialis, K and Wang, J and Yu, Y and Guo, D},
year = {2017},
month = {02},
pages = {},
title = {Real-Time Bidding by Reinforcement Learning in Display Advertising}
}
36