W Pg#s
W Pg#s
Abstract
Reinforcement Learning (RL), bolstered by the expressive capabilities of Deep Neural
Networks (DNNs) for function approximation, has demonstrated considerable success in
numerous applications. However, its practicality in addressing various real-world scenarios,
characterized by diverse and unpredictable dynamics, noisy signals, and large state and
action spaces, remains limited. This limitation stems from poor data efficiency, limited
generalization capabilities, a lack of safety guarantees, and the absence of interpretability,
among other factors. To overcome these challenges and improve performance across these
crucial metrics, one promising avenue is to incorporate additional structural information
about the problem into the RL learning process. Various sub-fields of RL have proposed
methods for incorporating such inductive biases. We amalgamate these diverse methodologies
under a unified framework, shedding light on the role of structure in the learning problem,
and classify these methods into distinct patterns of incorporating structure. By leveraging
this comprehensive framework, we provide valuable insights into the challenges of structured
RL and lay the groundwork for a design pattern perspective on RL research. This novel
perspective paves the way for future advancements and aids in developing more effective
and efficient RL algorithms that can potentially handle real-world scenarios better.
1. Introduction
Reinforcement Learning (RL) has contributed to a range of sequential decision-making and
control problems like games (Silver et al., 2016), robotic manipulation (Lee et al., 2020b),
and optimizing chemical reactions (Zhou et al., 2017). Most of the traditional research in
RL focuses on designing agents that learn to solve a sequential decision problem induced
by the inherent dynamics of a task, e.g., the differential equations governing the cart pole
task (Sutton & Barto, 2018) in the classic control suite of OpenAI Gym (Brockman et al.,
2016). However, their performance significantly degrades when even minor aspects of the
environment change (Meng & Khushi, 2019; Lu et al., 2020). Moreover, deploying RL agents
©2024 The Authors. Published by AI Access Foundation under Creative Commons Attribution License CC BY 4.0.
Mohan, Zhang, & Lindauer
1168
Structure in Deep Reinforcement Learning: A Survey and Open Problems
Sample
Latent
Efficiency
Factored Generalization
Patterns of Incorporating
Decomposability Usage
Structure
Relational Interpretability
Environment Explicitly
Warehouse
Generation Designed
Figure 1: Overview of our framework. Side information can be used to achieve im-
proved performance across metrics such as Sample Efficiency, Generalization,
Interpretability, and Safety. We discuss this process in Section 4. A particular
source of side information is decomposability in a learning problem, which can be
categorized into four archetypes along a spectrum - Latent, Factored, Relational,
and Modular - explained further in Section 5.1. Incorporating side information
about decomposability amounts to adding structure to a learning pipeline, and
this process can be categorized into seven different patterns - Abstraction, Aug-
mentation, Auxiliary Optimization, Auxiliary Model, Warehouse, Environment
Generation, and Explicitly Designed - discussed further in Section 6.
side information into the learning pipeline, such as using the LLM to generate an intrinsic
reward (Klissarov et al., 2024), can improve the speed of convergence of the RL agent, make
it robust to variations in the problem and potentially help with making it safer and more
interpretable.
Structure of the Paper. To better guide the reader, the paper is structured as follows:
(i) In Section 2 we discuss the related works. We cover previous surveys on different areas in
RL and previous works aimed at incorporating domain knowledge into RL methods. (ii) In
Section 3, we describe the background and notation needed to formalize the relevant aspects
of the RL problems. We additionally define the RL pipeline that we use in the later sections.
(iii) In Section 4, we introduce side information and define the additional metrics that can
be addressed by incorporating side information into an RL pipeline. (iv) In Section 5, we
formulate structure as side information about decomposability and categorize decompositions
in the literature into four archetypes on the spectrum of decomposability (Höfer, 2017).
Using these archetypes, we demonstrate how various problem formulations in RL fall into the
proposed framework. (v) In Section 6, we formulate seven patterns of incorporating structure
into the RL learning process and provide an overview of each pattern by connecting it to
the relevant surveyed literature. We represent each pattern graphically as a plug-and-play
modification to the RL pipeline introduced in Section 3. We additionally provide a literature
survey for each pattern as a table and show possible research areas as empty spaces. (vi) In
Section 7, we discuss how our framework opens new avenues for research while providing a
common reference point for understanding what kind of design decisions work under which
1169
Mohan, Zhang, & Lindauer
2. Related Work
Multiple surveys have previously covered different areas in RL. However, none have covered
the methods of explicitly and holistically incorporating structure in RL. In the following
sections, we divide our literature research into surveys that tackle different problem settings,
additional objectives, individual decompositions, and previous works incorporating domain
knowledge into RL pipelines.
Different RL settings. Kirk et al. (2023) survey the field of Zero-Shot generalization and
briefly discuss the need for more restrictive structured assumptions for their setting. While
their survey argues for the requirement of similar assumptions, our work specifically lays out
a framework that allows surveying approaches to utilize these assumptions. Additionally,
our work is not limited to the setting of zero-shot generalization but covers additional areas
of interpretability, safety, and sample efficiency in RL. Beck et al. (2023) cover the field of
Meta-RL and discuss the role of structure in Meta-Exploration, Transfer, and the POMDP
formulation of Meta-Learning. However, their focus is on surveying the Meta-Learning
setting and does not delve deeper into grounding what structure means, as is the case with
our work. This is also the case with the survey of exploration methods in RL (Amin et al.,
2021b), where they argue for the need to choose the policy space to reflect prior information
about the structure of the solution to ensure that the exploration behavior follows the same
structure. Our framework grounds this idea in decomposition and argues for incorporating
this information using one of many patterns.
Additional objectives. Individual surveys have additionally covered the multiple objec-
tives defined in Section 4. Garcia and Fernandez (2015) provides a comprehensive review of
the literature on safety in RL and divides the methods based on whether they modify the
optimization criterion or the exploration process. We use their categorization to examine
the correlation of patterns that use structural information for safety but cover additional
1170
Structure in Deep Reinforcement Learning: A Survey and Open Problems
objectives beyond safety. In a similar vein, Glanois et al. (2021) cover methods that add
interpretability to the RL pipeline, which judges interpretability along the same axis as our
work, namely, the definitions proposed by Lipton (2018).
Incorporating domain knowledge into RL. Certain surveys have also been conducted
on methods incorporating domain knowledge into RL. Eßer et al. (2023) survey methods
that incorporate additional knowledge to tackle real-world deployment in RL. To this end,
they categorize sources of knowledge into three types: (i) Scientific Knowledge, that covers
empirical knowledge about the problem; (ii) World Knowledge, that covers an intuitive
understanding of the problem that can be incorporated into the pipeline; and, (iii) Expert
knowledge, available to experienced professionals in the form of experience. They formalize an
RL pipeline and then look at methods incorporating this knowledge into different parts of the
pipeline, such as problem representation, learning strategy, task structuring, and Sim2Real
transfer. In addition to our scope focusing on domain knowledge about decomposability, our
approach is source-agnostic. We focus on the specific part of the MDP on which structural
assumptions are imposed and the nature of such assumptions. We categorize methods into
patterns of incorporating these problems’ assumptions. Additionally, our patterns framework
covers a broader range of methods that apply to more settings than Sim2Real.
The intersection of side information and patterns has previously been discussed by Jon-
schkowski et al. (2015) and inspires our categorization as well. However, they predominantly
discuss patterns for supervised and semi-supervised settings and mention trivial extensions
to state representations in RL. Our formulation of patterns covers the RL pipeline more
holistically by additionally looking at assumptions on components such as actions, transition
dynamics, learned models, and previously learned skills. Moreover, our formulation solely
focuses on different ways of biasing RL pipelines, holding little relevance for supervised and
semi-supervised learning communities.
3. Preliminaries
The following sections summarize the main background necessary for our approach to
studying structural decompositions and related patterns. In Section 3.1, we formalize the
sequential decision-making problem as an MDP. Section 3.2 then presents the RL framework
for solving MDPs and introduces the RL pipeline.
1171
Mohan, Zhang, & Lindauer
For the sum in Equation (1) to be tractable, we either assume the horizon of the problem to
be of a fixed length T (finite-horizon return), i.e., the trajectory to terminate after T -steps, or
we discount the future rewards by a discount factor γ (infinite horizon return). Discounting,
however, can also be applied to finite horizons. Solving an MDP amounts to determining
the policy π ∗ ∈ Π that maximizes the expectation over the returns of its trajectory. This
expectation can be captured by the (state-action) value function Q ∈ Q. Given a policy π,
the expectation can be written recursively:
T
X
Qπ (s, a) = Eπ rt | s0 = s, a0 = a = Eπ R(s, a) + γEa′ ∼π(·|s′ ) [Qπ (s′ , a′ )] .
(2)
t=0
Thus, the goal can now be formulated as the task of finding an optimal policy that can
maximize the Qπ (s, a):
We also consider Partially Observable MDPs (POMDPs), which model situations where the
state is not fully observable. A POMDP is defined as a 7-tuple M = ⟨S, A, O, R, P, ξ, ρ⟩,
where S, A, R, P, ρ remain the same as defined above. Instead of observing the state s ∈ S,
1172
Structure in Deep Reinforcement Learning: A Survey and Open Problems
the agent now has access to observation o ∈ O that is generated from the actual state
through an emission function ξ : S × A → ∆(O). Thus, the observation takes the state’s
role in the experience generation process. However, solving POMDPs requires maintaining
an additional belief since multiple (s, a) can lead to the same o.
Action
Policy
State
1173
Mohan, Zhang, & Lindauer
with the dynamics P and a reward function R. The pipeline might generate experiences by
directly interacting with E, i.e., learning from experiences or by simulating a learned model
M̂ of the environment. The optimization procedure encompasses the interplay between the
current policy π, its value function Q, the reward R, and the learning objective J.
β :Ω→Ω×Z
This implies that we now augment our tuple Ω with an additional function Z that
operates on other tuple elements X ∈ Ω. For example, incorporating side information could
be used to learn state abstractions by adding an encoder Z to map the state space S to a
latent representation κ that can be used for control. We discuss the general templates for Z
in Section 5 and classify different methods of biasing Ω with Z into patterns in Section 6.
The natural follow-up question, then, becomes the impact of incorporating side information
into the learning pipeline. In this work, we focus on four ways side information can be used
and formally define them in the following sections.
1174
Structure in Deep Reinforcement Learning: A Survey and Open Problems
4.3 Interpretability
Interpretability refers to a mechanistic understanding of a system to make it more trans-
parent. Lipton (2018) enumerate three fundamental properties of model interpretability:
(i) Simulatability refers to the ability of a human to simulate the inner workings of a
1175
Mohan, Zhang, & Lindauer
4.4 Safety
Safety refers to learning policies that maximize the expectation of the return in problems in
which it is important to ensure reasonable system performance and/or respect safety-related
constraints during the learning and/or deployment processes. For example, Model-based
RL methods usually learn a model of the environment and then use it to plan a sequence
of actions. However, such models are often learned from noisy data, and deploying them
in the real world might lead an agent to catastrophic states. Therefore, methods in the
Safe-RL literature focus on incorporating safety-related constraints into the training process
to mitigate such issues.
While Safety in RL is a vast field in and of itself (Garcia & Fernandez, 2015), we consider
two specific categories in this work: Safe Learning with constraints and Safe Exploration.
The former subjects the learning process to one or more constraints ci ∈ C (Altman, 1999).
Depending on the necessity of strictness, these can be incorporated in many different ways,
such as safety in expectation, safety in values, safe trajectories, and safe states and actions.
We can formulate this as
max Eπ (G) s.t. ci = {hi ≤ α}, (6)
π∈Π
where hi can be a function related to the returns, trajectories, values, states, and actions,
and α is a safety threshold. Consequently, side information can be used in the formulation
of such constraints.
On the other hand, Safe Exploration modifies the exploration process subject to external
knowledge, which in our case translates to incorporating side information into the exploration
process. While intuitively, this overlaps with the usage of side information for directed
exploration, a distinguishing feature of this work is the final goal of this directed exploration,
which is to be safe, which might come at the cost of sample efficiency and/or generalization.
1176
Structure in Deep Reinforcement Learning: A Survey and Open Problems
...
Monolithic Distributed
explaining how it decomposes complex systems and categorizes such decompositions into
four archetypes. In Section 5.2 - Section 5.5, we discuss these archetypes further to connect
them with existing literature.
1177
Mohan, Zhang, & Lindauer
Z : X → κ. (7)
Latent States and Actions. Latent representations of states are used for tackling
scenarios such as rich observation spaces (Du et al., 2019) and contextual settings (Hallak
et al., 2015). Latent actions have been similarly explored in settings with stochastic action
sets (Boutilier et al., 2018).
Latent Transition and Rewards. While latent states allow decomposing transition
matrices, another way to approach the problem directly is to decompose transition matrices
into low-rank approximations. Linear MDPs (Papini et al., 2021) and applications in Model-
based RL (van der Pol et al., 2020a) have studied this form of direct decomposition. A
similar decomposition can also be applied to rewards by assuming the reward signal to be
generated from a latent function that can be learned as an auxiliary learning objective (Wang
et al., 2020).
Factored States and Actions. Factored state and action spaces have been explored in
the Factored MDPs (Kearns & Koller, 1999; Boutilier et al., 2000; Guestrin et al., 2003b).
Methods in this setting traditionally capture subsequent state distribution using mechanisms
such as Dynamic Bayesian Networks (Mihajlovic & Petkovic, 2001).
Factored action representations have also been used for tackling high-dimensional ac-
tions (Mahajan et al., 2021). These methods either impose a factorized structure on subsets
of a high-dimensional action set (Kim & Dean, 2002) or impose this structure through the
Q-values that lead to the final action (Tang et al., 2022). Crucially, these methods can
potentially exploit some form of independence resulting from such factorization, either in
the state representations or transitions.
1178
Structure in Deep Reinforcement Learning: A Survey and Open Problems
1179
Mohan, Zhang, & Lindauer
and Graph Laplacians (Mahadevan & Maggioni, 2007). Recent approaches have started
looking into DNN representations (Zambaldi et al., 2019; Garg et al., 2020), with extensions
into modeling problem aspects such as morphology in Robotic tasks (Wang et al., 2018) in a
relational manner, or learning diffusion operators for constructing intrinsic rewards (Klissarov
& Machado, 2023).
Relational Tasks. A parallel line of work looks at capturing relations in a multi-task
setting, where task perturbations are either in the form of goals and corresponding rewards
(Sohn et al., 2018; Illanes et al., 2020; Kumar et al., 2022). Most work aims at integrating
these relationships into the optimization procedure and/or additionally capturing them as
models. We delve deeper into specifics in later sections.
Z : X → {X1 , . . . , XN }. (11)
Such modularity can exist along the following axes: (i) Spatial Modularity allows learning
quantities specific to parts of the state space, thus effectively reducing the dimensionality
of the states; (ii) Temporal Modularity allows breaking down tasks into sequences over
a learning horizon and, thus, learning modular quantities in a sequence; (iii) Functional
Modularity allows decomposing the policy architecture into functionally modular parts, even
if the problem is spatially and temporally monolithic.
A potential consequence of such breakdown is the emergence of a hierarchy, and when
learning problems exploit this hierarchical relationship, these problems come under the
purview of Hierarchical RL (HRL) (Pateria et al., 2022). The learned policies can also exhibit
a hierarchy, where each can choose the lower-level policies to execute the subtasks. Each
level can be treated as a planning problem (Yang et al., 2018) or a learning problem (Sohn
et al., 2018), thus allowing solutions to combine planning and learning through the hierarchy.
Hierarchy, however, is not a necessity for modularity.
Modularity in States and Goals Modular decomposition of state spaces is primarily
studied at high-level planning and state-abstractions for HRL methods (Kokel et al., 2021).
Approaches such as Q-decomposition (Russell & Zimdars, 2003; Bouton et al., 2019) have
explored agent design by communicating Q values learned by individual agents on parts
of the state-action space to an arbitrator that suggests the following action. Additionally,
the literature on skills has looked into the direction of training policies for individual parts
of the state-space (Goyal et al., 2020). Similarly, partial models only make predictions or
specific parts of the observation-action spaces in Model-Based settings (Talvitie & Singh,
2008; Khetarpal et al., 2021). Goals have been considered explicitly in methods that either
use goals as an interface between levels of hierarchy (Kulkarni et al., 2016; Nachum et al.,
2018; Gehring et al., 2021), or as outputs of task specification methods (Jiang et al., 2019;
Illanes et al., 2020).
1180
Structure in Deep Reinforcement Learning: A Survey and Open Problems
1181
Mohan, Zhang, & Lindauer
Agent
(a) Abstraction (b) Augmentation (c) Aux. Optimization (d) Aux. Model
Policy
0%
y
ty
nc
ilit
io
fe
at
ab
ie
Sa
iz
fic
et
al
Ef
r
er
rp
e
en
te
pl
In
G
m
Sa
Additional Objective
when a passenger is successfully dropped off at their destination and incurs a minor penalty
for each time step to encourage efficiency.
1182
Structure in Deep Reinforcement Learning: A Survey and Open Problems
For each of the following sections, we present a table of the surveyed methods that
categorizes the work in the following manner: (i) The structured space, information about
which is incorporated as side information; (ii) The type of decomposition exhibited for that
structured space. We specifically categorize works that use structured task distributions
through goals and/or rewards; (iii) The additional objectives for which the decomposition
is utilized. Our rationale behind the table format is to highlight the areas where further
research might be lucrative in addition to categorizing the existing literature. These are the
spots in the tables where we could not yet find literature, and we believe additional work
can be important; therefore, in addition to categorizing existing methods, empty areas in
the table highlight avenues for future research.
Finding appropriate abstractions can be a challenging task in itself. Too much abstraction
can lead to loss of critical information, while too little might not significantly reduce
complexity (Dockhorn & Kruse, 2023). Consequently, learning-based methods that jointly
learn abstractions factor this granularity into the learning process.
Abstractions have been thoroughly explored in the literature, with early work addressing
a formal theory on state abstractions (Li et al., 2006; Sutton & Barto, 2018). Recent works
have primarily used abstractions for tackling generalization. Thus, we see in Section 6
that generalization is the most explored use case for abstractions. However, the advantages
mentioned earlier of abstraction usually interleave these approaches with sample efficiency
gains and safety. Given the widespread use of abstractions in the literature, the following
paragraphs explore how different abstractions impact each use case.
1183
Mohan, Zhang, & Lindauer
1184
Structure in Deep Reinforcement Learning: A Survey and Open Problems
Steccanella
Furelos-Blanco et al. (2022), Furelos-Blanco
Modular
et al. (2021) Furelos-Blanco et al. (2021)
et al. (2021)
Zhang et al.
(2021a), Bar-
reto et al.
Latent (2017), Barreto
Rewards
et al. (2018),
Borsa et al.
(2016)
Perez et al.
(2020), Sodhani
Sodhani et al. Sodhani et al. Wang et al.
Factored et al. (2022a),
(2022a) (2021) (2020)
Sodhani et al.
(2021),
Zhang et al.
(2020), Borsa
Zhang et al. et al. (2019),
Latent
(2020) Perez et al.
Dynamics
(2020), Zhang
et al. (2021a)
Factored Fu et al. (2021) Fu et al. (2021)
Sun et al. Sun et al.
Modular
(2021) (2021)
1185
Mohan, Zhang, & Lindauer
et al., 2022), templates of dynamics across tasks (Sun et al., 2021), or even be combined
with options to preserve optimal values (Abel et al., 2020).
Sample Efficiency. Latent variable models improve sample efficiency across the RL
pipeline. Latent state abstractions help improve sample efficiency in Model-based RL (Gelada
et al., 2019) and also help improve the tractability of policy learning over options in
HRL (Steccanella et al., 2022). In model-free tasks, these are also learned as inverse models
for visual features (Allen et al., 2021) or control in a latent space (Lee et al., 2020a). Latent
transition models demonstrate efficiency gains by capturing task-relevant information in noisy
settings (Fu et al., 2021), by preserving bisimulation distances between original states (Zhang
et al., 2021), or by utilizing factorized abstractions (Perez et al., 2020). Learned latent
abstractions (Gallouedec & Dellandrea, 2023) also contribute to the exploration mechanism
in the Go-Explore regime (Ecoffet et al., 2021).
Latent action models expedite convergence of policy gradient methods such as REIN-
FORCE (Williams, 1992) by shortening the learning horizon in stochastic scenarios like
dialog generation (Zhao et al., 2019). Action embeddings, on the other hand, help reduce
the dimensionality of large action spaces (Chandak et al., 2019)
Safety and Interpretability. Relational abstractions are an excellent choice for inter-
pretability since they capture interactionally complex decompositions. The combination of
object-centric representations and learned abstractions adds transparency (Adjodah et al.,
2018) while symbolic interjections, such as tracking the relational distance between objects,
help improve performance (Garnelo et al., 2016).
State and reward abstractions help with safety. Latent states help to learn safe causal
inference models by embedding confounders (Yang et al., 2022). On the other hand, meshes
(Talele & Byl, 2019; Gillen & Byl, 2021) help benchmark metrics such as robustness in a
learned policy.
1186
Structure in Deep Reinforcement Learning: A Survey and Open Problems
Agent
1187
Mohan, Zhang, & Lindauer
Pathak et al.
Devin et al.
Modular (2019), Devin
(2019)
et al. (2019)
Huang et al. Huang et al.
Rewards Factored
(2020) (2020)
Sodhani et al.
(2022b), Guo
Wang and van
Latent et al. (2022),
Dynamics Hoof (2022)
Wang and van
Hoof (2022)
Goyal et al.
Factored
(2021)
Raza and
Lin (2019),
Haarnoja et al. Verma et al.
Policies Modular Haarnoja et al.
(2018a) (2018)
(2018a), Marzi
et al. (2023)
1188
Structure in Deep Reinforcement Learning: A Survey and Open Problems
Opt Opt'
i.e., some form of regularization that controls how the modification of the optimization
procedure respects the original objective needs to exist. For reward shaping techniques,
this amounts to the invariance of the optimal policy under the shaped reward (Ng et al.,
1999). For auxiliary objectives, this manifests in some form of entropy (Fox et al., 2016)
or divergence regularization (Eysenbach et al., 2019). Constraints ensure this through
recursion (Lee et al., 2022), while baselines control the variance of updates (Wu et al., 2018).
The most vigorous use of constraints is in the safety literature, where constraints either help
control the updates using some safety criterion or constrain the exploration. Consequently,
in Section 6, the auxiliary optimization pattern peaks in its proclivity towards addressing
safety. In the following paragraphs, we cover methods that optimize individual aspects of
the optimization procedure, namely, rewards, learning objectives, constraints, and parallel
optimization.
1189
Mohan, Zhang, & Lindauer
Tavakol and
Brefeld (2014),
Trimponias
and Dietterich
Factored Lee et al. (2022)
(2023), Ross
and Pineau
(2008), lyu et al.
(2023)
Relational Li et al. (2021)
Nachum
et al. (2018), Lyu et al.
Modular
Khetarpal et al. (2019)
(2020)
Ok et al.
Zhang et al.
(2018), Amin
(2019a), Zhang
et al. (2021a), Gupta et al. Zhang et al.
Latent et al. (2019b),
Yang et al. (2017) (2021)
Actions Zhang et al.
(2020b), lyu
(2021)
et al. (2023)
Balaji et al.
(2020), Wu et al.
(2018) Metz
et al. (2017),
Spooner et al.
Factored (2021), Tang
et al. (2022),
Khamassi et al.
(2017), Tavakol
and Brefeld
(2014)
Metz et al.
(2017), Klis-
Lyu et al. Jain et al.
Modular sarov and
(2019) (2021a)
Machado
(2023)
Belogolovsky
Trimponias et al. (2021),
and Dietterich Saxe et al.
Prakash et al.
(2023), Saxe (2017), Buch-
Rewards Factored (2020), Baheri
et al. (2017), holz and
(2020)
Huang et al. Scheftelowitsch
(2020) (2019), Huang
et al. (2020)
Mu et al.
Lee and Chung
Latent (2022a), Henaff
(2021)
Dynamics et al. (2022)
Belogolovsky
et al. (2021),
Liao et al.
Factored Buchholz and
(2021)
Scheftelowitsch
(2019)
Mu et al.
Relational (2022a), Illanes
et al. (2020)
Hausman et al.
Hausman et al.
Policies Latent (2018), Gupta
(2018)
et al. (2017)
1190
Structure in Deep Reinforcement Learning: A Survey and Open Problems
Auxiliary Learning Objectives. Skill-based methods transfer skills between agents with
different morphology by learning invariant subspaces and using those to create a transfer
auxiliary objective (through a reward signal) (Gupta et al., 2017), or an entropy-based term
for policy regularization (Hausman et al., 2018). Discovering appropriate sub-tasks (Solway
et al., 2014) in hierarchical settings is a highly sample-inefficient process. Li et al. (2021)
tackle this by composing values of the sub-trajectories under the current policy, which
they subsequently use for behavior cloning. Latent decompositions additionally help with
robustness and safety when used for some form of policy regularization (Zhang et al.,
2020). Auxiliary losses usually help with generalization and are an excellent entry point for
human-like inductive biases (Kumar et al., 2022). Metrics inspired by the geometry of latent
decompositions help learn optimal values in multi-task settings (Wang et al., 2023).
1191
Mohan, Zhang, & Lindauer
Action selection mechanisms can exploit domain knowledge for safety and interpretabil-
ity (Zhang et al., 2021) or for directed exploration to improve sample efficiency (Amin
et al., 2021a). Hierarchical settings benefit from latent state decompositions incorporated
via modification of the termination condition (Harutyunyan et al., 2019). Additionally,
state-action equivalences help scale Q-learning to large spaces through factorization (lyu
et al., 2023).
Model
Agent
Our taxi agent could learn a latent model of city traffic based on past experiences.
This model could be used to plan routes that avoid traffic and reach destinations faster.
Alternatively, the agent could learn an ensembling technique to combine multiple models,
each of which model-specific components of the traffic dynamics. With models, there is
usually a trade-off between model complexity and accuracy, and it is essential to manage this
carefully to avoid overfitting and maintain robustness. To this end, incorporating structure
helps make the model-learning phase more efficient while allowing reuse for generalization.
Hence, in Section 6, the auxiliary model pattern shows a solid propensity to utilize structure
1192
Structure in Deep Reinforcement Learning: A Survey and Open Problems
for sample efficiency. In the following paragraphs, we explicitly discuss models that utilize
decompositions and models used for creating decompositions.
1193
Mohan, Zhang, & Lindauer
Biza et al.
Biza et al.
Relational (2022b), Pitis
(2022b)
et al. (2020)
Furelos-Blanco
et al. (2021), Furelos-Blanco Furelos-Blanco
Modular
Yang et al. et al. (2021) et al. (2021)
(2018)
Zhang et al.
(2021a), van der
Pol et al.
(2020a), Lee
van der Pol et al.
Latent and Chung
Rewards (2020a)
(2021), Sohn
et al. (2018),
Sohn et al.
(2020)
Wang et al.
Sohn et al.
Factored (2020), Baheri
(2018)
(2020)
Zhang et al.
(2021a), Woo
Woo et al.
et al. (2022),
(2022), Fu
van der Pol
et al. (2021),
et al. (2020a),
Latent van der Pol
Dynamics Fu et al. (2021),
et al. (2020a),
Guo et al.
Wang and van
(2022), Wang
Hoof (2022)
and van Hoof
(2022)
Schiewer and
Fu et al. (2021), Goyal et al.
Wiskott (2021),
Factored Schiewer and (2021), Fu et al.
Kaiser et al.
Wiskott (2021) (2021)
(2019)
Buesing et al. van Rossum
Relational
(2019) et al. (2021)
Abdulhai et al.
(2022), Wu et al.
Modular Wu et al. (2019)
(2019), Wen
et al. (2020)
Models with structured representations. Young et al. (2023) utilize factored decompo-
sition for state space to demonstrate the benefits of model-based methods in combinatorially
complex environments. Similarly, the dreamer models (Hafner et al., 2020, 2023) utilize
latent representations of pixel-based environments.
Object-oriented representations for states help bypass the need to learn latent factors
using CNNs in MBRL (Biza et al., 2022a) or as random variables whose posterior can be
refined using NNs (Veerapaneni et al., 2020). Graph (Convolutional) Networks (Zhang
et al., 2019) capture rich higher-order interaction data, such as crowd navigation (Chen
et al., 2020), or invariances (Kipf et al., 2020). Action equivalences help learn latent models
(Abstract MDPs) (van der Pol et al., 2020a) for planning and value iteration.
1194
Structure in Deep Reinforcement Learning: A Survey and Open Problems
irrelevance, such as observational and interventional data in Causal RL (Gasse et al., 2021),
or task-relevant vs. irrelevant data (Fu et al., 2021) help with generalization and sample effi-
ciency gains. Latent representations help models capture control-relevant information (Wang
et al., 2022) or subtask dependencies (Sohn et al., 2018).
Models for safety usually incorporate some measure of cost to abstract safe states (Simao
et al., 2021), or unawareness to factor states and actions (Innes & Lascarides, 2020).
Alternatively, models can also directly guide exploration mechanisms through latent causal
decompositions (Seitzer et al., 2021) and state subspaces (Ghorbani et al., 2020) to gain
sample efficiency. Generative methods such as CycleGAN (Zhu et al., 2017) are also excellent
ways to use Latent models of different components of an MDP to generate counterfactual
trajectories (Woo et al., 2022).
The taxi from our running example could maintain a database of value functions or
policies for different parts of the city or at different times of the day. These could be reused
as the taxi navigates through the city, making learning more efficient. Portfolios generally
improve efficiency and generalization and have been traditionally implemented through the
skills and options frameworks. An essential consideration in this pattern is managing the
portfolio’s size and diversity to avoid biasing the learning process too much toward past
experiences.
So far, the portfolios have primarily been applied to sample efficiency and generalization.
However, they also overlap with interpretability since the stored data can be easily used to
analyze the agent’s behavior and understand the policy for novel scenarios. Consequently,
these objectives are equitably distributed in Section 6.
1195
Mohan, Zhang, & Lindauer
1196
Structure in Deep Reinforcement Learning: A Survey and Open Problems
Policy portfolios. Policy subspaces (Gaya et al., 2022b) utilize shared latent parameters
in policies to learn a subspace, the linear combinations of which help create new policies.
Extending these subspaces by warehousing additional policies naturally extends them to
continual settings (Gaya et al., 2022a).
Using goals and rewards, task factorization endows warehousing policies and Q values in
multi-task lifelong settings. Relationship graphs between existing tasks generated from a
latent space provide a way to model lifelong multi-task learning problems (Mendez et al.,
2022b). On the other hand, methods such as those presented by Devin et al. (2017) factor
MDPs into agent-specific and task-specific degrees of variation, for which individual modules
are trained. Disentanglement using variational encoder-decoder models (Hu & Montana,
2019) helps control morphologically different agents by factorizing dynamics into shared and
agent-specific factors. Additionally, methods similar to the work of Raza and Lin (2019)
partition the agent’s problem into interconnected sub-agents that learn local control policies.
Methods that utilize the skills framework effectively create a portfolio of learned primitives,
similar to options in HRL. These are subsequently used for maximizing mutual information
in lower layers (Florensa et al., 2017), sketching together a policy (Heess et al., 2016),
diversity-seeking priors in continual settings (Eysenbach et al., 2019), or for partitioned
states spaces (Mankowitz et al., 2015). Similarly, Gupta et al. (2017) apply the portfolio
pattern on a latent embedding space, learned using auxiliary optimization.
Decomposed Models. Decompositions that inherently exist in models lead to approaches
that often ensemble multiple models that individually reflect different aspects of the prob-
lem. Ensemble methods such as Recurrent Independent Mechanisms (Goyal et al., 2021)
capture the dynamics in individual modules that sparsely interact and use attention mech-
anisms (Vaswani et al., 2017). Ensembling dynamics also helps with few-shot adaptation
to unseen MDPs (Lee & Chung, 2021). Factored models combined with relational decom-
positions help bind actions to object-centric representations (Biza et al., 2022b). Latent
representations in hierarchical settings (Abdulhai et al., 2022) improve the sample inefficiency
of methods such as the Deep Option Critic (Bacon et al., 2017).
1197
Mohan, Zhang, & Lindauer
environments while incorporating methods that use auxiliary models to induce structure
in the environment generation process. The decomposition is reflected in the aspects of
the environment generation that are impacted by the generative process, such as dynamics,
reward structure, state space, etc. Given the online nature of environment generation,
methods in this pattern address curriculum learning in one way or another.
In the taxi example, a curriculum of tasks could be generated, starting with simple tasks
(like navigating an empty grid) and gradually introducing complexity (like adding traffic
and passengers with different destinations). Ensuring that the generated MDPs provide
good coverage of the problem space is crucial to avoid overfitting to a specific subset of
tasks. This necessitates additional diversity constraints that must be incorporated into the
environment generation process. Structure, crucially, provides additional interpretability
and controllability in the environment generation process, thus making benchmarking easier
than methods that use unsupervised techniques (Laskin et al., 2021).
1198
Structure in Deep Reinforcement Learning: A Survey and Open Problems
Rule-based grammars help model the compositional nature of learning problems. Kumar
et al. (2021) utilize this to impact the transition dynamics and generate environments. This
allows them to train agents with an implicit compositional curriculum. Kumar et al. (2022)
use these grammars in their auxiliary optimization procedure. Another way to capture task
dependencies is through latent graphical models, which generate the state-space, reward
functions, and transition dynamics (Wang et al., 2021; Bauer et al., 2023).
Latent dynamics models allow simulating task distributions, which help with generaliza-
tion (Lee & Chung, 2021). Clustering methods such as Exploratory Task Clustering (Chu
& Wang, 2023) explore task similarities by meta-learning a clustering method through an
exploration policy. In some sense, they recover a factored decomposition on the task space
where individual clusters can be further used for policy adaptation.
Policy
In the case of the taxi, a neural architecture could be designed to process the city grid
as an image and output a policy. Techniques like convolutional layers could be used to
capture the spatial structure of the city grid. Different network parts could be specialized
for different subtasks, like identifying passenger locations and planning routes. However,
this pattern involves a considerable amount of manual tuning and experimentation, and it’s
critical to ensure that these designs generalize well across different tasks. Designing specific
neural architectures can provide better interpretability, enabling the ability to decompose
different components and simulate them independently. Consequently, this pattern shows the
highest proclivity to interpretability, with Generalization being a close second in Section 6.
1199
Mohan, Zhang, & Lindauer
Splitting Functionality. One way to bias the architecture is to split its functionality
into different parts. Most of the works that achieve such disambiguation are either Factored
or Relational. Structured Control Nets (Srouji et al., 2018) model linear and non-linear
aspects of the dynamics individually and combine them additively to gain sample efficiency
and generalization. Alternatively, Bi-linear Value Networks (Hong et al., 2022) architec-
turally decompose dynamics into state and goal-conditioned components to produce a
goal-conditioned Q-function. Action Branching architectures (Tavakoli et al., 2018) used
a shared representation that is then factored into separate action branches for individual
functionality. This approach bears similarity to capturing multi-task representations using
bottlenecks (D’Eramo et al., 2020).
1200
Structure in Deep Reinforcement Learning: A Survey and Open Problems
Relational and Modular biases manifest in hierarchical architectures. This also allows
them to add more interpretability to the architecture. Two-step hybrid policies (Mu et al.,
2022b), for example, demonstrate an explicit method to make policies more interpretable
through splitting actions into pruners and selector modules. On the other hand, routing
hierarchies explicitly capture modularity using sub-modules that a separate policy can use
for routing them (Shu et al., 2018; Yang et al., 2020a).
Specialized Modules. These are a class of methods that combines the best of both
worlds by capturing invariance in additional specialized modules. Such modules capture
relational structure in semantic meaning (Lampinen et al., 2022), relational encoders for
auxiliary models (Guo et al., 2022), or specialized architectures for incorporating domain
knowledge (Payani & Fekri, 2020).
Scalability measures how methods scale with the increasing problem complexity in terms
of the size of the state and action spaces, complex dynamics, noisy reward signals, and
longer task horizons. On the one hand, methods might specifically require low dimensional
spaces and might not scale so well with increasing the size of these spaces, and on the other,
some methods might be overkill for simple problems but better suited for large spaces.
Robustness measures the response of methods to changes in the environment. While the
notion overlaps with generalization, robustness for our purposes more holistically looks at
central properties of data distribution, such as initial state distributions and multi-modal
evaluation returns. Under this notion, fundamentally different learning dynamics might be
robust to different kinds of changes in the environment.
1201
Mohan, Zhang, & Lindauer
Structure of the Section. In the following subsections, we cover sub-fields of RL that lie
at different areas of the Scalability and Robustness space. Each subsection covers an existing
sub-field and its challenges. We then present some examples in which our framework can
bolster further research and practice in these fields. Finally, we collate this discussion into
takeaways that can be combined into specific areas of further research. These are shown in
the blue boxes at the end of each section.
7.1 Offline RL
Offline Reinforcement Learning (also known as batch RL) (Prudencio et al., 2023) involves
learning from a fixed dataset without further interaction with the environment. This
approach can benefit when active exploration is costly, risky, or infeasible. Consequently,
such methods are highly data-dependent due to their reliance on the collected dataset,
and they do not generalize well due to the limitations of the pre-collected data. The
three dominant paradigms in Offline RL – Behavior Cloning (Bain & Sammut, 1995), Q-
Learning (Kumar et al., 2020), and Sequence Modelling (Chen et al., 2021) – uniformly
degrade in performance as the state-space increases (Bhargava et al., 2024). Offline RL has
challenges, including effectively overcoming distributional shifts and exploiting the available
dataset. Structural decomposition can be crucial in addressing these challenges in different
ways, as summarized in the following paragraphs.
Improved Exploitation of Dataset. Task decomposition allows learning individual
policies or value functions for different subtasks, which could potentially leverage the available
data more effectively. For example, modular decomposition utilizing portfolios of policies for
individual modules using the corresponding subset of the data might be more sample-efficient
than learning a single policy for the entire task. Task decompositions, thus, open up new
avenues for developing specialized algorithms that effectively learn from limited data about
each subtask while balancing the effects of learning different subtasks. Practitioners can
leverage such decompositions to maximize the utility of their available datasets by training
models that effectively handle specific subtasks, potentially improving the overall system’s
performance with the same dataset.
Mitigating Distributional Shift. The structural information could potentially help
mitigate the effect of distributional shifts. For instance, if some factors are less prone to
distributional shifts in a factored decomposition, we could focus more on those factors during
learning. This opens up venues for gaining theoretical insights into the complex interplay of
structural decompositions, task distributions, and policy performance. On the other hand,
practical methods for environments where distributional shifts are standard could leverage
structural decomposition to create more robust RL systems.
Auxiliary Tasks for Exploration. Structural decomposition can be used to define
auxiliary tasks that facilitate learning from the dataset. For instance, in a relational
decomposition, we could define auxiliary tasks that involve predicting the relationships
between different entities, which could help in learning a valuable representation of the data.
Using the proposed framework, researchers can explore how to define meaningful auxiliary
tasks that help the agent learn a better representation of the environment. This could lead
to new methods that efficiently exploit the available data by learning about these auxiliary
1202
Structure in Deep Reinforcement Learning: A Survey and Open Problems
tasks. Practitioners can design auxiliary tasks based on the specific decompositions of their
problem. For example, if the task has a clear relational structure, auxiliary tasks that predict
the relations between different entities can potentially improve the agent’s understanding of
the environment and its overall performance.
7.2 Unsupervised RL
Unsupervised RL (Laskin et al., 2021) refers to the sub-field of behavior learning in RL,
where an agent learns to interact with an environment without receiving explicit feedback
or guidance in the form of rewards. Methods in this area can be characterized based on the
nature of the metrics that are used to evaluate performance intrinsically (Srinivas & Abbeel,
2021). Knowledge-based methods define a self-supervised task by making predictions on
some aspect of the environment (Pathak et al., 2017; Chen et al., 2022, 2022), Data-based
methods maximize the state visitation entropy for exploring the environment (Hazan et al.,
2019; Zhang et al., 2021b; Guo et al., 2021; Mutti et al., 2021, 2022), and Competence-based
methods maximize the mutual information between the trajectories and space of learned
skills (Mohamed & Rezende, 2015; Gregor et al., 2016; Baumli et al., 2021; Jiang et al., 2022;
Zeng et al., 2022). The pre-training phase allows these methods to learn the underlying
structure of data. However, this phase also requires large amounts of data and, thus, impacts
the scalability of such methods for problems where the learned representations are not very
useful. Consequently, such methods currently handle medium complexity problems, with
the avenue of better scalability being a topic of further research.
Structural decompositions can help such methods by improving the pre-training phase’s
tractability and the fine-tuning phase’s generality. Latent decompositions could help exploit
structure in unlabeled data, while relational decompositions could add interpretability to
the learned representations. Through augmentation, conditioning policies on specific parts
of the state space can reduce the data needed for fine-tuning. Additionally, understanding
problem decomposition can simplify complex problems into more manageable sub-problems,
effectively reducing the perceived problem complexity while incorporating such decomposition
in external curricula for fine-tuning. Incorporating portfolios guided by decompositions for
competence-based methods can boost the fine-tuning process of the learned skills.
1203
Mohan, Zhang, & Lindauer
1204
Structure in Deep Reinforcement Learning: A Survey and Open Problems
1205
Mohan, Zhang, & Lindauer
Abstractions. Abstractions can play a crucial role in such situations, where structural
decompositions using abstraction patterns can make methods more sample-efficient. Often
used in options, temporal Abstractions allow the agent to make decisions over extended
periods, thereby encapsulating potential temporal dependencies within these extended
actions. This can effectively convert a non-Markovian problem into a Markovian one at the
level of options. State abstractions abstract away irrelevant aspects of the state and, thus,
can sometimes ignore specific temporal dependencies, rendering the Markovian process at the
level of the abstracted states. Thus, research into the role of decompositions in abstraction
opens up possibilities to understand the dependencies between non-Markovian models and
the abstractions they use to solve problems with incomplete information. Abstraction can
also simplify the observation space in POMDPs, reducing the complexity of the belief update
process. The abstraction might involve grouping similar observations, identifying higher-level
features, or other simplifications. Abstractions allow us to break partial observability into
different types instead of always assuming the worst-case scenario. Utilizing such restricted
assumptions on partial observability can help us build more specific algorithms and derive
convergence and optimality guarantees for such scenarios.
Big worlds. As we extend the information content of the environment to its extremity,
we delve into the realm of the big world hypothesis in RL (Javed, 2023), where the agent’s
environment is multiple orders of magnitude larger than the agent. The agent cannot
represent the optimal value function and policy even in the limit of infinite data. In such
scenarios, the agent must make decisions under significant uncertainty, which presents several
challenges, including exploration, generalization, and efficient learning. Even though the
hypothesis suggests that incorporating side information might not be beneficial in learning the
optimal policy and value in such scenarios, structural decomposition of large environments in
different ways can allow benchmarking methods along different axes while allowing a deeper
study into the performance of algorithms on parts of the environment that the agent has not
yet experienced. Modular decomposition can guide the agent’s exploration process by helping
the agent explore different parts of the environment independently. Incorporating modularity
opens a gateway to novel methods and theoretical insights about the relationships between
task decomposition, exploration, and learning efficiency in large environments. Relational
decompositions can help the agent learn relationships between different entities, bolstering
its ability to generalize to unseen parts of the environment. Finally, Structural information
can be used to facilitate more efficient learning. For instance, in an auxiliary optimization
pattern, the agent could learn faster by optimizing auxiliary tasks that are easier to learn or
provide helpful information about the environment’s structure.
1206
Structure in Deep Reinforcement Learning: A Survey and Open Problems
1207
Mohan, Zhang, & Lindauer
1208
Structure in Deep Reinforcement Learning: A Survey and Open Problems
task’s decomposition can guide adaptation strategies in Meta-RL. This could lead to novel
methods or theories on adapting to new tasks more effectively based on their structure.
1209
Mohan, Zhang, & Lindauer
Acknowledgments
We wish to thank Robert Kirk and Rohan Chitnis for their discussion and comments on
drafts of this work. We would also like to thank Vincent François-Lavet, Khimya Khetarpal,
and Rishabh Aggarwal for providing additional relevant references in the literature.
References
Abdulhai, M., Kim, D., Riemer, M., Liu, M., Tesauro, G., & How, J. (2022). Context-
specific Representation Abstraction for Deep Option Learning. In Proceedings of the
Thirty-Sixth Conference on Artificial Intelligence (AAAI’22).
Abel, D., Hershkowitz, D., Barth-Maron, G., Brawner, S., O’Farrell, K., MacGlashan, J., &
Tellex, S. (2015). Goal-Based Action Priors. In Proceedings of the 25th International
Conference on Automated Planning and Scheduling (ICAPS’15).
Abel, D., Umbanhowar, N., Khetarpal, K., Arumugam, D., Precup, D., & Littman, M. (2020).
Value Preserving State-action Abstractions. In Proceedings of the 23rd International
Conference on Artificial Intelligence and Statistics (AISTATS’20).
Adjodah, D., Klinger, T., & Joseph, J. (2018). Symbolic Relation Networks for Reinforcement
Learning. In Proceedings of the Workshop on Relational Representation Learning in
the 31st Conference on Neural Information Processing Systems (NeurIPS’18).
Adriaensen, S., Biedenkapp, A., Shala, G., Awad, N., Eimer, T., Lindauer, M., & Hutter, F.
(2022). Automated Dynamic Algorithm Configuration. Journal of Artificial Intelligence
Research (JAIR), 75, 1633–1699.
Agarwal, A., Kakade, S., Krishnamurthy, A., & Sun, W. (2020). FLAMBE: Structural
Complexity and Representation Learning of Low Rank MDPs. In Larochelle, H.,
Ranzato, M., Hadsell, R., Balcan, M.-F., & Lin, H. (Eds.), Proceedings of the 34th
International Conference on Advances in Neural Information Processing Systems
(NeurIPS’20). Curran Associates.
Agarwal, R., Machado, M., Castro, P., & Bellemare, M. (2021). Contrastive Behavioral
Similarity Embeddings for Generalization in Reinforcement Learning. In Proceedings
of the 9th International Conference on Learning Representations (ICLR’21).
Alabdulkarim, A., Singh, M., Mansi, G., Hall, K., & Riedl, M. (2022). Experiential Explana-
tions for Reinforcement Learning. arXiv preprint, arXiv:2210.04723.
Alet, F., Schneider, M., Lozano-Pérez, T., & Kaelbling, L. (2020). Meta-learning Cu-
riosity Algorithms. In Proceedings of the 8th International Conference on Learning
Representations (ICLR’20).
Allen, C., Parikh, N., Gottesman, O., & Konidaris, G. (2021). Learning Markov State
Abstractions for Deep Reinforcement Learning. In Ranzato, M., Beygelzimer, A.,
Nguyen, K., Liang, P., Vaughan, J., & Dauphin, Y. (Eds.), Proceedings of the 35th
International Conference on Advances in Neural Information Processing Systems
(NeurIPS’21). Curran Associates.
Altman, E. (1999). Constrained Markov decision processes. Routledge.
1210
Structure in Deep Reinforcement Learning: A Survey and Open Problems
Amin, S., Gomrokchi, M., Aboutalebi, H., Satija, H., & Precup, D. (2021a). Locally
Persistent Exploration in Continuous Control Tasks With Sparse Rewards. In Meila,
M., & Zhang, T. (Eds.), Proceedings of the 38th International Conference on Machine
Learning (ICML’21), Vol. 139 of Proceedings of Machine Learning Research. PMLR.
Amin, S., Gomrokchi, M., Satija, H., van Hoof, H., & Precup, D. (2021b). A Survey of
Exploration Methods in Reinforcement Learning. arXiv preprint, arXiv:2109.00157.
Andersen, G., & Konidaris, G. (2017). Active Exploration for Learning Symbolic Representa-
tions. In Guyon, I., von Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan,
S., & Garnett, R. (Eds.), Proceedings of the 31st International Conference on Advances
in Neural Information Processing Systems (NeurIPS’17). Curran Associates.
Andreas, J., Klein, D., & Levine, S. (2018). Learning With Latent Language. In Proceed-
ings of the 2018 Conference of the North American Chapter of the Association for
Computational Linguistics - Human Language Technologies (NAACL - HLT’18).
Azizzadenesheli, K., Lazaric, A., & Anandkumar, A. (2017). Reinforcement Learning in Rich-
observation MDPs Using Spectral Methods. In Proceedings of the 3rd Multidisciplinary
Conference on Reinforcement Learning and Decision Making (RLDM’17).
Bacon, P., Harb, J., & Precup, D. (2017). The Option-critic Architecture. In S.Singh,
& Markovitch, S. (Eds.), Proceedings of the Thirty-First Conference on Artificial
Intelligence (AAAI’17). AAAI Press.
Baheri, A. (2020). Safe Reinforcement Learning With Mixture Density Network: A Case
Study in Autonomous Highway Driving. arXiv preprint, arXiv:2007.01698.
Bain, M., & Sammut, C. (1995). A Framework for Behavioural Cloning. In Machine
Intelligence.
Balaji, B., Christodoulou, P., Jeon, B., & Bell-Masterson, J. (2020). FactoredRL: Leveraging
Factored Graphs for Deep Reinforcement Learning. In Deep Reinforcement Learning
Workshop in the 34th International Conference on Advances in Neural Information
Processing Systems (NeurIPS’20).
Bapst, V., Sanchez-Gonzalez, A., Doersch, C., Stachenfeld, K., Kohli, P., Battaglia, P., &
Hamrick, J. (2019). Structured Agents for Physical Construction. In Chaudhuri, K., &
Salakhutdinov, R. (Eds.), Proceedings of the 36th International Conference on Machine
Learning (ICML’19), Vol. 97. Proceedings of Machine Learning Research.
Barreto, A., Borsa, D., Hou, S., Comanici, G., Aygün, E., Hamel, P., Toyama, D., Hunt, J.,
Mourad, S., Silver, D., & Precup, D. (2019). The Option Keyboard: Combining Skills in
Reinforcement Learning. In Wallach, H., Larochelle, H., Beygelzimer, A., d’Alche Buc,
F., Fox, E., & Garnett, R. (Eds.), Proceedings of the 33rd International Conference on
Advances in Neural Information Processing Systems (NeurIPS’19). Curran Associates.
Barreto, A., Borsa, D., Quan, J., Schaul, T., Silver, D., Hessel, M., Mankowitz, D., Zidek, A.,
& Munos, R. (2018). Transfer in Deep Reinforcement Learning Using Successor Features
and Generalised Policy Improvement. In Dy, J., & Krause, A. (Eds.), Proceedings of the
35th International Conference on Machine Learning (ICML’18), Vol. 80. Proceedings
of Machine Learning Research.
1211
Mohan, Zhang, & Lindauer
Barreto, A., Dabney, W., Munos, R., Hunt, J., Schaul, T., van Hasselt, H., & Silver, D.
(2017). Successor Features for Transfer in Reinforcement Learning. In Guyon, I.,
von Luxburg, U., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., & Garnett,
R. (Eds.), Proceedings of the 31st International Conference on Advances in Neural
Information Processing Systems (NeurIPS’17). Curran Associates.
Bauer, J., Baumli, K., Baveja, S., Behbahani, F., Bhoopchand, A., Bradley-Schmieg, N.,
Chang, M., Clay, N., Collister, A., Dasagi, V., Gonzalez, L., Gregor, K., Hughes, E.,
Kashem, S., Loks-Thompson, M., Openshaw, H., Parker-Holder, J., Pathak, S., Nieves,
N., Rakicevic, N., Rocktäschel, T., Schroecker, Y., Sygnowski, J., Tuyls, K., York, S.,
Zacherl, A., & Zhang, L. (2023). Human-timescale Adaptation in an Open-ended Task
Space. arXiv preprint, arXiv:2301.07608.
Baumli, K., Warde-Farley, D., Hansen, S., & Mnih, V. (2021). Relative Variational Intrinsic
Control. In Yang, Q., Leyton-Brown, K., & Mausam (Eds.), Proceedings of the Thirty-
Fifth Conference on Artificial Intelligence (AAAI’21). Association for the Advancement
of Artificial Intelligence, AAAI Press.
Beck, J., Vuorio, R., Liu, E., Xiong, Z., Zintgraf, L., Finn, C., & Whiteson, S. (2023). A
Survey of Meta-reinforcement Learning. arXiv preprint, arXiv:2301.08028.
Bellman, R. (1954). Some Applications of the Theory of Dynamic Programming - A Review.
Operations Research, 2 (3), 275–288.
Belogolovsky, S., Korsunsky, P., Mannor, S., Tessler, C., & Zahavy, T. (2021). Inverse
Reinforcement Learning in Contextual MDPs. Machine Learning, 110 (9), 2295–2334.
Benjamins, C., Eimer, T., Schubert, F., Mohan, A., Döhler, S., Biedenkapp, A., Rosenhahn,
B., Hutter, F., & Lindauer, M. (2023). Contextualize Me - The Case for Context in
Reinforcement Learning. Transactions on Machine Learning Research, 2835-8856.
Bewley, T., & Lecune, F. (2022). Interpretable Preference-based Reinforcement Learning
With Tree-structured Reward Functions. In Proceedings of the 21st International
Conference on Autonomous Agents and Multiagent Systems (AAMAS’22).
Beyret, B., Shafti, A., & Faisal, A. (2019). Dot-to-dot: Explainable Hierarchical Reinforcement
Learning for Robotic Manipulation. In International Conference on Intelligent Robots
and Systems (IROS’19), pp. 5014–5019.
Bhargava, P., Chitnis, R., Geramifard, A., Sodhani, S., & Zhang, A. (2024). Decision
Transformer is a Robust Contender for Offline Reinforcement Learning. In Proceedings
of the 12th International Conference on Learning Representations (ICLR’24).
Bhatt, V., Tjanaka, B., Fontaine, M., & Nikolaidis, S. (2022). Deep Surrogate Assisted
Generation of Environments. In Proceedings of the 36th International Conference on
Advances in Neural Information Processing Systems (NeurIPS’22).
Biza, O., Kipf, T., Klee, D., Platt, R., van de Meent, J., & Wong, L. (2022a). Factored
World Models for Zero-shot Generalization in Robotic Manipulation. arXiv preprint,
arXiv:2202.05333.
Biza, O., Platt, R., van de Meent, J., Wong, L., & Kipf, T. (2022b). Binding Actions to Objects
in World Models. In Workshop on the Elements of Reasoning: Objects, Structure and
Causality in the 10th International Conference on Learning Representations (ICLR’22).
1212
Structure in Deep Reinforcement Learning: A Survey and Open Problems
Borsa, D., Barreto, A., Quan, J., Mankowitz, D., van Hasselt, H., Munos, R., Silver, D., &
Schaul, T. (2019). Universal Successor Features Approximators. In Proceedings of the
7th International Conference on Learning Representations (ICLR’19).
Borsa, D., Graepel, T., & Shawe-Taylor, J. (2016). Learning Shared Representations in
Multi-task Reinforcement Learning. arXiv preprint, arXiv:1603.02041.
Boutilier, C., Cohen, A., Hassidim, A., Mansour, Y., Meshi, O., Mladenov, M., & Schuur-
mans, D. (2018). Planning and Learning With Stochastic Action Sets. In Lang, J.
(Ed.), Proceedings of the 27th International Joint Conference on Artificial Intelligence
(IJCAI’18).
Boutilier, C., Dearden, R., & Goldszmidt, M. (1995). Exploiting Structure in Policy Con-
struction. In Mellish, C. (Ed.), Proceedings of the 14th International Joint Conference
on Artificial Intelligence (IJCAI’95). Morgan Kaufmann Publishers.
Boutilier, C., Dearden, R., & Goldszmidt, M. (2000). Stochastic Dynamic Programming
With Factored Representations. Artificial Intelligence, 121 (1-2), 49–107.
Bouton, M., Julian, K., Nakhaei, A., Fujimura, K., & Kochenderfer, M. (2019). Decompo-
sition methods with deep corrections for reinforcement learning. In Proceedings of
the 18th International Conference on Autonomous Agents and MultiAgent Systems
(AAMAS’19).
Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., & Zaremba,
W. (2016). OpenAI Gym. arxiv preprint, arXiv:1606.01540.
Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A.,
Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan,
T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler,
E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford,
A., Sutskever, I., & Amodei, D. (2020). Language Models are Few-Shot Learners. In
Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.-F., & Lin, H. (Eds.), Proceedings
of the 34th International Conference on Advances in Neural Information Processing
Systems (NeurIPS’20), pp. 1877–1901. Curran Associates.
Buchholz, P., & Scheftelowitsch, D. (2019). Computation of Weighted Sums of Rewards for
Concurrent MDPs. Mathematical Methods of Operations Research, 89 (1), 1–42.
Buesing, L., Weber, T., Zwols, Y., Heess, N., Racanière, S., Guez, A., & Lespiau, J. (2019).
Woulda, Coulda, Shoulda: Counterfactually-guided Policy Search. In Proceesings of
the 7th International Conference on Learning Representations (ICLR’19).
Burgess, C., Matthey, L., Watters, N., Kabra, R., Higgins, I., Botvinick, M., & Lerchner,
A. (2019). MONet: Unsupervised Scene Decomposition and Representation. arXiv
preprint, arXiv:1901.11390.
Castro, P., Kastner, T., Panangaden, P., & Rowland, M. (2021). MICo: Improved Represen-
tations via Sampling-based State Similarity for Markov Decision Processes. In Ranzato,
M., Beygelzimer, A., Nguyen, K., Liang, P., Vaughan, J., & Dauphin, Y. (Eds.),
Proceedings of the 35th International Conference on Advances in Neural Information
Processing Systems (NeurIPS’21). Curran Associates.
1213
Mohan, Zhang, & Lindauer
Castro, P., Kastner, T., Panangaden, P., & Rowland, M. (2023). A Kernel Perspective
on Behavioural Metrics for Markov Decision Processes. Transactions on Machine
Learning Research, 2835-8856.
Chandak, Y., Theocharous, G., Kostas, J., Jordan, S., & Thomas, P. (2019). Learning Action
Representations for Reinforcement Learning. In Chaudhuri, K., & Salakhutdinov,
R. (Eds.), Proceedings of the 36th International Conference on Machine Learning
(ICML’19), Vol. 97. Proceedings of Machine Learning Research.
Chen, C., Gao, Z., Xu, K., Yang, S., Li, Y., Ding, B., Feng, D., & Wang, H. (2022).
Nuclear Norm Maximization Based Curiosity-driven Learning. arXiv preprint,
arXiv:2205.10484.
Chen, C., Hu, S., Nikdel, P., Mori, G., & Savva, M. (2020). Relational Graph Learning for
Crowd Navigation. In Proceedings of the 2020 IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS’20).
Chen, C., Wan, T., Shi, P., Ding, B., Gao, Z., & Feng, D. (2022). Uncertainty Estimation
Based Intrinsic Reward For Efficient Reinforcement Learning. In Proceedings of the
2022 IEEE International Conference on Joint Cloud Computing (JCC’22), pp. 1–8.
Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., Abbeel, P., Srinivas, A.,
& Mordatch, I. (2021). Decision Transformer: Reinforcement Learning via Sequence
Modeling. In Ranzato, M., Beygelzimer, A., Nguyen, K., Liang, P., Vaughan, J., &
Dauphin, Y. (Eds.), Proceedings of the 35th International Conference on Advances in
Neural Information Processing Systems (NeurIPS’21). Curran Associates.
Cheung, W., Simchi-Levi, D., & Zhu, R. (2020). Reinforcement Learning for Non-stationary
Markov Decision Processes: The Blessing of (More) Optimism. In III, H. D., & Singh,
A. (Eds.), Proceedings of the 37th International Conference on Machine Learning
(ICML’20), Vol. 98. Proceedings of Machine Learning Research.
Christodoulou, P., Lange, R., Shafti, A., & Faisal, A. (2019). Reinforcement Learning
With Structured Hierarchical Grammar Representations of Actions. arXiv preprint,
arXiv:1910.02876.
Chu, Z., & Wang, H. (2023). Meta-reinforcement Learning via Exploratory Task Clustering.
arXiv preprint, arXiv:2302.07958.
Co-Reyes, J., Miao, Y., Peng, D., Real, E., Le, Q., Levine, S., Lee, H., & Faust, A. (2021).
Evolving Reinforcement Learning Algorithms. In Proceedings of the 9th International
Conference on Learning Representations (ICLR’20).
Dayan, P. (1993). Improving Generalization for Temporal Difference Learning: The Successor
Representation. Neural Computation, 5 (4), 613–624.
D’Eramo, C., Tateo, D., Bonarini, A., Restelli, M., & Peters, J. (2020). Sharing Knowledge
in Multi-task Deep Reinforcement Learning. In Proceedings of the 8th International
Conference on Learning Representations (ICLR’20).
Devin, C., Geng, D., Abbeel, P., Darrell, T., & Levine, S. (2019). Plan Arithmetic: Composi-
tional Plan Vectors for Multi-task Control. In Wallach, H., Larochelle, H., Beygelzimer,
1214
Structure in Deep Reinforcement Learning: A Survey and Open Problems
A., d’Alche Buc, F., Fox, E., & Garnett, R. (Eds.), Proceedings of the 33rd Interna-
tional Conference on Advances in Neural Information Processing Systems (NeurIPS’19).
Curran Associates.
Devin, C., Gupta, A., Darrell, T., Abbeel, P., & Levine, S. (2017). Learning Modular Neural
Network Policies for Multi-task and Multi-robot Transfer. In Proceddings of the 2017
IEEE International Conference on Robotics and Automation (ICRA’17).
Ding, W., Lin, H., Li, B., & Zhao, D. (2022). Generalizing Goal-conditioned Reinforcement
Learning With Variational Causal Reasoning. In Proceedings of the 36th International
Conference on Advances in Neural Information Processing Systems (NeurIPS’22).
Diuk, C., Cohen, A., & Littman, M. (2008). An Object-oriented Representation for Efficient
Reinforcement Learning. In Cohen, W., McCallum, A., & Roweis, S. (Eds.), Proceedings
of the 25th International Conference on Machine Learning (ICML’08). Omnipress.
Dockhorn, A., & Kruse, R. (2023). State and Action Abstraction for Search and Reinforcement
Learning Algorithms. In Artificial Intelligence in Control and Decision-making Systems:
Dedicated to Professor Janusz Kacprzyk, pp. 181–198. Springer.
Du, S., Krishnamurthy, A., Jiang, N., Agarwal, A., Dudı́k, M., & Langford, J. (2019). Provably
Efficient RL With Rich Observations via Latent State Decoding. In Chaudhuri, K., &
Salakhutdinov, R. (Eds.), Proceedings of the 36th International Conference on Machine
Learning (ICML’19), Vol. 97. Proceedings of Machine Learning Research.
Dunion, M., McInroe, T., Luck, K., Hanna, J., & Albrecht, S. (2023a). Conditional Mu-
tual Information for Disentangled Representations in Reinforcement Learning. In
Proceedings of the 37th International Conference on Advances in Neural Information
Processing Systems (NeurIPS’23).
Dunion, M., McInroe, T., Luck, K., Hanna, J., & Albrecht, S. (2023b). Temporal Disentan-
glement of Representations for Improved Generalisation in Reinforcement Learning.
In Proceedings of the 11th International Conference on Learning Representations
(ICLR’23).
Dzeroski, S., Raedt, L., & Driessens, K. (2001). Relational Reinforcement Learning. Machine
Learning, 43 (1/2), 7–52.
Ecoffet, A., Huizinga, J., Lehman, J., Stanley, K., & Clune, J. (2021). First Return, Then
Explore. Nature, 590 (7847), 580–586.
Eimer, T., Lindauer, M., & Raileanu, R. (2023). Hyperparameters in Reinforcement Learning
and How To Tune Them. In Proceedings of the International Conference on Machine
Learning (ICML’23).
Eßer, J., Bach, N., Jestel, C., Urbann, O., & Kerner, S. (2023). Guided Reinforcement
Learning: A Review and Evaluation for Efficient and Effective Real-World Robotics
[Survey]. IEEE Robotics & Automation Magazine, 30 (2), 67–85.
Eysenbach, B., Gupta, A., Ibarz, J., & Levine, S. (2019). Diversity is All You Need: Learning
Skills Without a Reward Function. In Proceedings of the 7th International Conference
on Learning Representations (ICLR’19).
1215
Mohan, Zhang, & Lindauer
Fern, A., Yoon, S., & Givan, R. (2006). Approximate Policy Iteration With a Policy Language
Bias: Solving Relational Markov Decision Processes. Journal of Artificial Intelligence
Research, 25, 75–118.
Fitch, R., Hengst, B., Suc, D., Calbert, G., & Scholz, J. (2005). Structural Abstraction
Experiments in Reinforcement Learning. In Zhang, S., & Jarvis, R. (Eds.), Proceedings
of the 18th Australian Joint Conference on Artificial Intelligence, Vol. 3809, pp. 164–
175.
Florensa, C., Duan, Y., & Abbeel, P. (2017). Stochastic Neural Networks for Hierarchical
Reinforcement Learning. In Proceedings of the 5th International Conference on Learning
Representations (ICLR’17).
Fox, R., Pakman, A., & Tishby, N. (2016). Taming the Noise in Reinforcement Learning via
Soft Updates. In Ihler, A., & Janzing, D. (Eds.), Proceedings of the 32nd conference
on Uncertainty in Artificial Intelligence (UAI’16). AUAI Press.
Fu, X., Yang, G., Agrawal, P., & Jaakkola, T. (2021). Learning Task Informed Abstractions.
In Meila, M., & Zhang, T. (Eds.), Proceedings of the 38th International Conference on
Machine Learning (ICML’21), Vol. 139 of Proceedings of Machine Learning Research.
PMLR.
Furelos-Blanco, D., Law, M., Jonsson, A., Broda, K., & Russo, A. (2021). Induction and
Exploitation of Subgoal Automata for Reinforcement Learning. Journal of Artificial
Intelligence Research, 70, 1031–1116.
Gallouedec, Q., & Dellandrea, E. (2023). Cell-free Latent Go-explore. In Proceedings of the
40th International Conference on Machine Learning (ICML’23).
Garcia, J., & Fernandez, F. (2015). A Comprehensive Survey on Safe Reinforcement Learning.
Journal of Machine Learning Research, 16, 1437–1480.
Garg, S., Bajpai, A., & Mausam (2020). Symbolic Network: Generalized Neural Policies
for Relational MDPs. In III, H. D., & Singh, A. (Eds.), Proceedings of the 37th
International Conference on Machine Learning (ICML’20), Vol. 98. Proceedings of
Machine Learning Research.
Garnelo, M., Arulkumaran, K., & Shanahan, M. (2016). Towards Deep Symbolic Reinforce-
ment Learning. arXiv preprint, arXiv:1609.05518.
Gasse, M., Grasset, D., Gaudron, G., & Oudeyer, P. (2021). Causal Reinforcement Learning
Using Observational and Interventional Data. arXiv preprint, arXiv:2106.14421.
Gaya, J., Doan, T., Caccia, L., Soulier, L., Denoyer, L., & Raileanu, R. (2022a). Building
a Subspace of Policies for Scalable Continual Learning. In Decision Awareness in
Reinforcement Learning Workshop at the 39th International Conference on Machine
Learning (ICML’22).
Gaya, J., Soulier, L., & Denoyer, L. (2022b). Learning a Subspace of Policies for Online
Adaptation in Reinforcement Learning. In Proceedings of the 10th International
Conference on Learning Representations (ICLR’22).
Gehring, J., Synnaeve, G., Krause, A., & Usunier, N. (2021). Hierarchical Skills for Efficient
Exploration. In Ranzato, M., Beygelzimer, A., Nguyen, K., Liang, P., Vaughan, J., &
1216
Structure in Deep Reinforcement Learning: A Survey and Open Problems
1217
Mohan, Zhang, & Lindauer
Gupta, A., Devin, C., Liu, Y., Abbeel, P., & Levine, S. (2017). Learning Invariant Feature
Spaces to Transfer Skills With Reinforcement Learning. In Proceedings of the 5th
International Conference on Learning Representations (ICLR’17).
Gupta, A., Mendonca, R., Liu, Y., Abbeel, P., & Levine, S. (2018). Meta-Reinforcement
Learning of Structured Exploration Strategies. In Bengio, S., Wallach, H., Larochelle,
H., Grauman, K., Cesa-Bianchi, N., & Garnett, R. (Eds.), Proceedings of the 31st
International Conference on Advances in Neural Information Processing Systems
(NeurIPS’18). Curran Associates.
Gur, I., Jaques, N., Miao, Y., Choi, J., Tiwari, M., Lee, H., & Faust, A. (2021). Environment
Generation for Zero-shot Compositional Reinforcement Learning. In Ranzato, M.,
Beygelzimer, A., Nguyen, K., Liang, P., Vaughan, J., & Dauphin, Y. (Eds.), Proceedings
of the 35th International Conference on Advances in Neural Information Processing
Systems (NeurIPS’21). Curran Associates.
Haarnoja, T., Hartikainen, K., Abbeel, P., & Levine, S. (2018a). Latent Space Policies for
Hierarchical Reinforcement Learning. In Dy, J., & Krause, A. (Eds.), Proceedings of the
35th International Conference on Machine Learning (ICML’18), Vol. 80. Proceedings
of Machine Learning Research.
Haarnoja, T., Pong, V., Zhou, A., Dalal, M., Abbeel, P., & Levine, S. (2018b). Composable
Deep Reinforcement Learning for Robotic Manipulation. In 2018 IEEE International
Conference on Robotics and Automation (ICRA’18).
Hafner, D., Lillicrap, T., Ba, J., & Norouzi, M. (2020). Dream to Control: Learning Behaviors
by Latent Imagination. In III, H. D., & Singh, A. (Eds.), Proceedings of the 37th
International Conference on Machine Learning (ICML’20), Vol. 98. Proceedings of
Machine Learning Research.
Hafner, D., Pasukonis, J., Ba, J., & Lillicrap, T. (2023). Mastering Diverse Domains Through
World Models. arXiv preprint, arXiv:2301.04104.
Hallak, A., Castro, D. D., & Mannor, S. (2015). Contextual Markov Decision Processes.
arXiv preprint, arXiv:1502.02259.
Hansen-Estruch, P., Zhang, A., Nair, A., Yin, P., & Levine, S. (2022). Bisimulation Makes
Analogies in Goal-conditioned Reinforcement Learning. In Chaudhuri, K., Jegelka,
S., Song, L., Szepesvári, C., Niu, G., & Sabato, S. (Eds.), Proceedings of the 39th
International Conference on Machine Learning (ICML’22), Vol. 162 of Proceedings of
Machine Learning Research. PMLR.
Harutyunyan, A., Dabney, W., Borsa, D., Heess, N., Munos, R., & Precup, D. (2019). The
Termination Critic. In Proceedings of the 22nd International Conference on Artificial
Intelligence and Statistics (AISTATS’19).
Hausman, K., Springenberg, J., Wang, Z., Heess, N., & Riedmiller, M. (2018). Learning an
Embedding Space for Transferable Robot Skills. In Proceedings of the 6th International
Conference on Learning Representations (ICLR’18).
Hazan, E., Kakade, S., Singh, K., & van Soest, A. (2019). Provably Efficient Maximum
Entropy Exploration. In Chaudhuri, K., & Salakhutdinov, R. (Eds.), Proceedings of the
1218
Structure in Deep Reinforcement Learning: A Survey and Open Problems
1219
Mohan, Zhang, & Lindauer
1220
Structure in Deep Reinforcement Learning: A Survey and Open Problems
1221
Mohan, Zhang, & Lindauer
Koller, D., & Parr, R. (1999). Computing Factored Value Functions for Policies in Structured
MDPs. In Dean, T. (Ed.), Proceedings of the 16th International Joint Conference on
Artificial Intelligence (IJCAI’99). Morgan Kaufmann Publishers.
Kooi, J., Hoogendoorn, M., & François-Lavet, V. (2022). Disentangled (Un)Controllable
Features. arXiv preprint, arXiv:2211.00086.
Kulkarni, T., Narasimhan, K., Saeedi, A., & Tenenbaum, J. (2016). Hierarchical Deep
Reinforcement Learning: Integrating Temporal Abstraction and Intrinsic Motivation.
In Lee, D., Sugiyama, M., von Luxburg, U., Guyon, I., & Garnett, R. (Eds.), Proceedings
of the 30th International Conference on Advances in Neural Information Processing
Systems (NeurIPS’16). Curran Associates.
Kumar, A., Zhou, A., Tucker, G., & Levine, S. (2020). Conservative Q-learning for Offline
Reinforcement Learning. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.-F.,
& Lin, H. (Eds.), Proceedings of the 34th International Conference on Advances in
Neural Information Processing Systems (NeurIPS’20). Curran Associates.
Kumar, S., Correa, C., Dasgupta, I., Marjieh, R., Hu, M., Hawkins, R., Daw, N., Cohen,
J., Narasimhan, K., & Griffiths, T. (2022). Using Natural Language and Program
Abstractions to Instill Human Inductive Biases in Machines. In Proceedings of the
36th International Conference on Advances in Neural Information Processing Systems
(NeurIPS’22).
Kumar, S., Dasgupta, I., Cohen, J., Daw, N., & Griffiths, T. (2021). Meta-learning of
Structured Task Distributions in Humans and Machines. In Proceedings of the 9th
International Conference on Learning Representations (ICLR’21).
Lampinen, A., Roy, N., Dasgupta, I., Chan, S., Tam, A., Mcclelland, J., Yan, C., Santoro,
A., Rabinowitz, N., Wang, J., & Hill, F. (2022). Tell me Why! Explanations Support
Learning Relational and Causal Structure. In Chaudhuri, K., Jegelka, S., Song, L.,
Szepesvári, C., Niu, G., & Sabato, S. (Eds.), Proceedings of the 39th International
Conference on Machine Learning (ICML’22), Vol. 162 of Proceedings of Machine
Learning Research. PMLR.
Lan, C., & Agarwal, R. (2023). Revisiting Bisimulation: A Sampling-based State Similarity
Pseudo-metric. In The First Tiny Papers Track at the 11th International Conference
on Learning Representations (ICLR’23).
Lan, C., Bellemare, M., & Castro, P. (2021). Metrics and Continuity in Reinforcement
Learning. In Yang, Q., Leyton-Brown, K., & Mausam (Eds.), Proceedings of the Thirty-
Fifth Conference on Artificial Intelligence (AAAI’21). Association for the Advancement
of Artificial Intelligence, AAAI Press.
Lan, Q., Mahmood, A., Yan, S., & Xu, Z. (2023). Learning to Optimize for Reinforcement
Learning. arXiv preprint, arXiv:2302.01470.
Laroche, R., & Feraud, R. (2022). Reinforcement Learning Algorithm Selection. In Proceed-
ings of the 6th International Conference on Learning Representations (ICLR’22).
Laskin, M., Yarats, D., Liu, H., Lee, K., Zhan, A., Lu, K., Cang, C., Pinto, L., & Abbeel, P.
(2021). URLB: Unsupervised Reinforcement Learning Benchmark. In Vanschoren, J.,
& Yeung, S. (Eds.), Proceedings of the Neural Information Processing Systems Track
1222
Structure in Deep Reinforcement Learning: A Survey and Open Problems
1223
Mohan, Zhang, & Lindauer
Lu, M., Shahn, Z., Sow, D., Doshi-Velez, F., & Lehman, L. (2020). Is Deep Reinforcement
Learning Ready for Practical Applications in Healthcare? A Sensitivity Analysis of
Duel-DDQN for Hemodynamic Management in Sepsis Patients. In Proceedings of the
American Medical Informatics Association Annual Symposium (AMIA’20).
Luis, J., Miao, Y., Co-Reyes, J., Parisi, A., Tan, J., Real, E., & Faust, A. (2022). Multi-
objective Evolution for Generalizable Policy Gradient Algorithms. In Workshop on
Generalizable Policy Learning in Physical World in the 10th International Conference
on Learning Representations (ICLR’22).
Lyu, D., Yang, F., Liu, B., & Gustafson, S. (2019). SDRL: Interpretable and Data-efficient
Deep Reinforcement Learning Leveraging Symbolic Planning. In Hentenryck, P. V., &
Zhou, Z. (Eds.), Proceedings of the Thirty-Third Conference on Artificial Intelligence
(AAAI’19). AAAI Press.
lyu, Y., Côme, A., Zhang, Y., & Talebi, M. (2023). Scaling Up Q-learning via Exploiting
State-action Equivalence. Entropy, 25 (4), 584.
Mahadevan, S., & Maggioni, M. (2007). Proto-value Functions: A Laplacian Framework
for Learning Representation and Control in Markov Decision Processes. Journal of
Machine Learning Research, 8, 2169–2231.
Mahajan, A., Samvelyan, M., Mao, L., Makoviychuk, V., Garg, A., Kossaifi, J., Whiteson, S.,
Zhu, Y., & Anandkumar, A. (2021). Reinforcement Learning in Factored Action Spaces
Using Tensor Decompositions. In Workshop on Relational Representation Learning in
the 34th Conference on Neural Information Processing Systems (NeurIPS’21).
Mahajan, A., & Tulabandhula, T. (2017). Symmetry Learning for Function Approximation
in Reinforcement Learning. arXiv preprint, arXiv:1706.02999.
Mambelli, D., Träuble, F., Bauer, S., Schölkopf, B., & Locatello, F. (2022). Compositional
Multi-object Reinforcement Learning With Linear Relation Networks. In Workshop on
the Elements of Reasoning: Objects, Structure and Causality at the 10th International
Conference on Learning Representations (ICLR’22).
Mankowitz, D., Mann, T., & Mannor, S. (2015). Bootstrapping Skills. arXiv preprint,
arXiv:1506.03624.
Mannor, S., & Tamar, A. (2023). Towards Deployable RL – what’s Broken With RL Research
and a Potential Fix. arXiv preprint, arXiv:2301.01320.
Martinez, D., Alenya, G., & Torras, C. (2017). Relational Reinforcement Learning With
Guided Demonstrations. Artificial Intelligence, 247, 295–312.
Marzi, T., Khehra, A., Cini, A., & Alippi, C. (2023). Feudal Graph Reinforcement Learning.
arXiv preprint, arXiv:2304.05099.
Mausam, & Weld, D. (2003). Solving Relational MDPs With First-order Machine Learning. In
Proceedings of the workshop on planning under uncertainty and incomplete information
at the 13th International Conference on Automated Planning & Scheduling (ICAPS’03).
Mendez, J., Hussing, M., Gummadi, M., & Eaton, E. (2022a). CompoSuite: A Compositional
Reinforcement Learning Benchmark. In Chandar, S., Pascanu, R., & Precup, D. (Eds.),
1224
Structure in Deep Reinforcement Learning: A Survey and Open Problems
1225
Mohan, Zhang, & Lindauer
Mutti, M., Mancassola, M., & Restelli, M. (2022). Unsupervised Reinforcement Learning in
Multiple Environments. In Sycara, K., Honavar, V., & Spaan, M. (Eds.), Proceedings
of the Thirty-Sixth Conference on Artificial Intelligence (AAAI’22). Association for
the Advancement of Artificial Intelligence, AAAI Press.
Mutti, M., Pratissoli, L., & Restelli, M. (2021). Task-agnostic Exploration via Policy
Gradient of a Non-parametric State Entropy Estimate. In Yang, Q., Leyton-Brown, K.,
& Mausam (Eds.), Proceedings of the Thirty-Fifth Conference on Artificial Intelligence
(AAAI’21). Association for the Advancement of Artificial Intelligence, AAAI Press.
Nachum, O., Gu, S., Lee, H., & Levine, S. (2018). Data-efficient Hierarchical Reinforcement
Learning. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N.,
& Garnett, R. (Eds.), Proceedings of the 31st International Conference on Advances
in Neural Information Processing Systems (NeurIPS’18). Curran Associates.
Nam, T., Sun, S., Pertsch, K., Hwang, S., & Lim, J. (2022). Skill-based Meta-reinforcement
Learning. In Proceedings of the 10th International Conference on Learning Represen-
tations (ICLR’22).
Narvekar, S., Sinapov, J., Leonetti, M., & Stone, P. (2016). Source Task Creation for
Curriculum Learning. In Jonker, C., Marsella, S., Thangarajah, J., & Tuyls, K. (Eds.),
Proceedings of the International Conference on Autonomous Agents & Multiagent
Systems (AAMAS’16), pp. 566–574.
Ng, A., Harada, D., & Russell, S. (1999). Policy Invariance Under Reward Transformations:
Theory and Application to Reward Shaping. In Bratko, I. (Ed.), Proceedings of
the Sixteenth International Conference on Machine Learning (ICML’99). Morgan
Kaufmann Publishers.
Oh, J., Hessel, M., Czarnecki, W., Xu, Z., van Hasselt, H., Singh, S., & Silver, D. (2020).
Discovering Reinforcement Learning Algorithms. In Larochelle, H., Ranzato, M.,
Hadsell, R., Balcan, M.-F., & Lin, H. (Eds.), Proceedings of the 34th International
Conference on Advances in Neural Information Processing Systems (NeurIPS’20).
Curran Associates.
Ok, J., Proutière, A., & Tranos, D. (2018). Exploration in Structured Reinforcement Learning.
In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., & Garnett,
R. (Eds.), Proceedings of the 31st International Conference on Advances in Neural
Information Processing Systems (NeurIPS’18). Curran Associates.
Oliva, M., Banik, S., Josifovski, J., & Knoll, A. (2022). Graph Neural Networks for Relational
Inductive Bias in Vision-based Deep Reinforcement Learning of Robot Control. In
International Joint Conference on Neural Networks, IJCNN 2022, Padua, Italy, July
18-23, 2022, pp. 1–9.
OpenAI (2023). GPT-4 Technical Report. arXiv preprint, arXiv:2303.08774.
Papini, M., Tirinzoni, A., Pacchiano, A., Restelli, M., Lazaric, A., & Pirotta, M. (2021). Re-
inforcement Learning in Linear MDPs: Constant Regret and Representation Selection.
In Ranzato, M., Beygelzimer, A., Nguyen, K., Liang, P., Vaughan, J., & Dauphin,
Y. (Eds.), Proceedings of the 35th International Conference on Advances in Neural
Information Processing Systems (NeurIPS’21). Curran Associates.
1226
Structure in Deep Reinforcement Learning: A Survey and Open Problems
Parker-Holder, J., Nguyen, V., & Roberts, S. J. (2020). Provably efficient online Hyperpa-
rameter Optimization with population-based bandits. In Larochelle, H., Ranzato, M.,
Hadsell, R., Balcan, M.-F., & Lin, H. (Eds.), Proceedings of the 34th International
Conference on Advances in Neural Information Processing Systems (NeurIPS’20).
Curran Associates.
Parker-Holder, J., Rajan, R., Song, X., Biedenkapp, A., Miao, Y., Eimer, T., Zhang, B.,
Nguyen, V., Calandra, R., Faust, A., Hutter, F., & Lindauer, M. (2022). Automated
Reinforcement Learning (AutoRL): A Survey and Open Problems. Journal of Artificial
Intelligence Research (JAIR), 74, 517–568.
Parr, R., & Russell, S. (1997). Reinforcement Learning With Hierarchies of Machines. In
Proceedings of the Tenth International Conference on Advances in Neural Information
Processing Systems (NeurIPS’97).
Pateria, S., Subagdja, B., Tan, A., & Quek, C. (2022). Hierarchical Reinforcement Learning:
A Comprehensive Survey. ACM Computing Surveys, 54 (5), 109:1–109:35.
Pathak, D., Agrawal, P., Efros, A., & Darrell, T. (2017). Curiosity-driven Exploration by
Self-supervised Prediction. In Precup, D., & Teh, Y. (Eds.), Proceedings of the 34th
International Conference on Machine Learning (ICML’17), Vol. 70. Proceedings of
Machine Learning Research.
Pathak, D., Lu, C., Darrell, T., Isola, P., & Efros, A. (2019). Learning to Control Self-
assembling Morphologies: a Study of Generalization via Modularity. In Wallach,
H., Larochelle, H., Beygelzimer, A., d’Alche Buc, F., Fox, E., & Garnett, R. (Eds.),
Proceedings of the 33rd International Conference on Advances in Neural Information
Processing Systems (NeurIPS’19). Curran Associates.
Payani, A., & Fekri, F. (2020). Incorporating Relational Background Knowledge Into Rein-
forcement Learning via Differentiable Inductive Logic Programming. arXiv preprint,
arXiv:2003.10386.
Peng, X., Chang, M., Zhang, G., Abbeel, P., & Levine, S. (2019). MCP: Learning Composable
Hierarchical Control With Multiplicative Compositional Policies. In Wallach, H.,
Larochelle, H., Beygelzimer, A., d’Alche Buc, F., Fox, E., & Garnett, R. (Eds.),
Proceedings of the 33rd International Conference on Advances in Neural Information
Processing Systems (NeurIPS’19). Curran Associates.
Perez, C., Such, F., & Karaletsos, T. (2020). Generalized Hidden Parameter MDPs Transfer-
able Model-based RL in a Handful of Trials. In Rossi, F., Conitzer, V., & Sha, F. (Eds.),
Proceedings of the Thirty-Fourth Conference on Artificial Intelligence (AAAI’20). As-
sociation for the Advancement of Artificial Intelligence, AAAI Press.
Peters, J., Buhlmann, P., & Meinshausen, N. (2016). Causal Inference by Using Invariant
Prediction: Identification and Confidence Intervals. Journal of the Royal Statistical
Society. Series B (Statistical Methodology), 78 (5), 947–1012.
Pitis, S., Creager, E., & Garg, A. (2020). Counterfactual Data Augmentation Using Locally
Factored Dynamics. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.-F., & Lin,
H. (Eds.), Proceedings of the 34th International Conference on Advances in Neural
Information Processing Systems (NeurIPS’20). Curran Associates.
1227
Mohan, Zhang, & Lindauer
Prakash, B., Waytowich, N., Ganesan, A., Oates, T., & Mohsenin, T. (2020). Guiding Safe
Reinforcement Learning Policies Using Structured Language Constraints. In Espinoza,
H., Hernández-Orallo, J., Chen, X. C., ÓhÉigeartaigh, S., Huang, X., Castillo-Effen,
M., Mallah, R., & McDermid, J. (Eds.), Proceedings of the Workshop on Artificial
Intelligence Safety (SafeAI), co-located with 34th Conference on Artificial Intelligence
(AAAI’20), Vol. 2560, pp. 153–161.
Prakash, B., Waytowich, N., Oates, T., & Mohsenin, T. (2022). Towards an Interpretable Hi-
erarchical Agent Framework Using Semantic Goals. arXiv preprint, arXiv:2210.08412.
Prudencio, R., Maximo, M., & Colombini, E. (2023). A Survey on Offline Reinforcement
Learning: Taxonomy, Review, and Open Problems. In IEEE Transactions on Neural
Networks and Learning Systems, pp. 1–0.
Puterman, M. (2014). Markov decision processes: discrete stochastic dynamic programming.
John Wiley & Sons.
Raza, S., & Lin, M. (2019). Policy Reuse in Reinforcement Learning for Modular Agents.
In IEEE 2nd International Conference on Information and Computer Technologies
(ICICT’19).
Ross, S., & Pineau, J. (2008). Model-based Bayesian Reinforcement Learning in Large
Structured Domains. In Proceedings of the 24th Conference in Uncertainty in Artificial
Intelligence (UAI’08).
Russell, S., & Zimdars, A. (2003). Q-Decomposition for Reinforcement Learning Agents. In
Fawcett, T., & Mishra, N. (Eds.), Proceedings of the 20th International Conference on
Machine Learning (ICML’03). Omnipress.
Rusu, A., Colmenarejo, S., Gülçehre, C., Desjardins, G., Kirkpatrick, J., Pascanu, R., Mnih,
V., Kavukcuoglu, K., & Hadsell, R. (2016). Policy Distillation. In Proceedings of 4th
International Conference on Learning Representations (ICLR’16).
Salimans, T., Ho, J., Chen, X., & Sutskever, I. (2017). Evolution Strategies as a Scalable
Alternative to Reinforcement Learning. arXiv preprint, arXiv:1703.03864.
Sanner, S., & Boutilier, C. (2005). Approximate Linear Programming for First-order MDPs.
In Proceedings of the 21st Conference in Uncertainty in Artificial Intelligence (UAI’05),
pp. 509–517.
Saxe, A., Earle, A., & Rosman, B. (2017). Hierarchy Through Composition With Multitask
LMDPs. In Precup, D., & Teh, Y. (Eds.), Proceedings of the 34th International
Conference on Machine Learning (ICML’17), Vol. 70. Proceedings of Machine Learning
Research.
Schaul, T., Horgan, D., Gregor, K., & Silver, D. (2015). Universal Value Function Ap-
proximators. In Bach, F., & Blei, D. (Eds.), Proceedings of the 32nd International
Conference on Machine Learning (ICML’15), Vol. 37. Omnipress.
Schiewer, R., & Wiskott, L. (2021). Modular Networks Prevent Catastrophic Interference in
Model-based Multi-task Reinforcement Learning. In Proceedings of the Seventh Inter-
national Conference on Machine Learning, Optimization, and Data Science (LOD’21),
Vol. 13164, pp. 299–313.
1228
Structure in Deep Reinforcement Learning: A Survey and Open Problems
Seitzer, M., Schölkopf, B., & Martius, G. (2021). Causal Influence Detection for Improving
Efficiency in Reinforcement Learning. In Ranzato, M., Beygelzimer, A., Nguyen, K.,
Liang, P., Vaughan, J., & Dauphin, Y. (Eds.), Proceedings of the 35th International
Conference on Advances in Neural Information Processing Systems (NeurIPS’21).
Curran Associates.
Shanahan, M., Nikiforou, K., Creswell, A., Kaplanis, C., Barrett, D., & Garnelo, M. (2020).
An Explicitly Relational Neural Network Architecture. In III, H. D., & Singh, A. (Eds.),
Proceedings of the 37th International Conference on Machine Learning (ICML’20),
Vol. 98. Proceedings of Machine Learning Research.
Sharma, A., Gu, S., Levine, S., Kumar, V., & Hausman, K. (2020). Dynamics-aware
Unsupervised Discovery of Skills. In Proceedings of the 8th International Conference
on Learning Representations (ICLR’20).
Sharma, V., Arora, D., Geisser, F., Mausam, & Singla, P. (2022). SymNet 2.0: Effectively
Handling Non-fluents and Actions in Generalized Neural Policies for RDDL Relational
MDPs. In de Campos, C., & Maathuis, M. (Eds.), Proceedings of The 38th Uncertainty
in Artificial Intelligence Conference (UAI’22). PMLR.
Shu, T., Xiong, C., & Socher, R. (2018). Hierarchical and Interpretable Skill Acquisition in
Multi-task Reinforcement Learning. In Proceedings of the 6th International Conference
on Learning Representations (ICLR’18).
Shyam, P., Jaskowski, W., & Gomez, F. (2019). Model-based Active Exploration. In
Chaudhuri, K., & Salakhutdinov, R. (Eds.), Proceedings of the 36th International
Conference on Machine Learning (ICML’19), Vol. 97. Proceedings of Machine Learning
Research.
Silver, D., Huang, A., Maddison, C., Guez, A., Sifre, L., Driessche, G., Schrittwieser, J.,
Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J.,
Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel,
T., & Hassabis, D. (2016). Mastering the game of Go with deep neural networks and
tree search. Nature, 529 (7587), 484–489.
Simao, T., Jansen, N., & Spaan, M. (2021). AlwaysSafe: Reinforcement Learning Without
Safety Constraint Violations During Training. In Proceedings of the 20th International
Conference on Autonomous Agents and MultiAgent Systems (AAMAS’21).
Singh, G., Peri, S., Kim, J., Kim, H., & Ahn, S. (2021). Structured World Belief for
Reinforcement Learning in POMDPs. In Meila, M., & Zhang, T. (Eds.), Proceedings
of the 38th International Conference on Machine Learning (ICML’21), Vol. 139 of
Proceedings of Machine Learning Research. PMLR.
Sodhani, S., Levine, S., & Zhang, A. (2022a). Improving Generalization With Approximate
Factored Value Functions. In Workshop on the Elements of Reasoning: Objects, Struc-
ture and Causality at the 10th International Conference on Learning Representations
(ICLR’22).
Sodhani, S., Meier, F., Pineau, J., & Zhang, A. (2022b). Block Contextual MDPs for
Continual Learning. In Learning for Dynamics and Control Conference.
1229
Mohan, Zhang, & Lindauer
Sodhani, S., Zhang, A., & Pineau, J. (2021). Multi-task Reinforcement Learning With
Context-based Representations. In Meila, M., & Zhang, T. (Eds.), Proceedings of the
38th International Conference on Machine Learning (ICML’21), Vol. 139 of Proceedings
of Machine Learning Research. PMLR.
Sohn, S., Oh, J., & Lee, H. (2018). Hierarchical Reinforcement Learning for Zero-shot
Generalization With Subtask Dependencies. In Bengio, S., Wallach, H., Larochelle,
H., Grauman, K., Cesa-Bianchi, N., & Garnett, R. (Eds.), Proceedings of the 31st
International Conference on Advances in Neural Information Processing Systems
(NeurIPS’18). Curran Associates.
Sohn, S., Woo, H., Choi, J., & Lee, H. (2020). Meta Reinforcement Learning With Au-
tonomous Inference of Subtask Dependencies. In Proceedings of the 8th International
Conference on Learning Representations (ICLR’20).
Solway, A., Diuk, C., Córdova, N., Yee, D., Barto, A., Niv, Y., & Botvinick, M. (2014).
Optimal Behavioral Hierarchy. PLoS Computational Biolgy, 10 (8).
Song, Y., Suganthan, P., Pedrycz, W., Ou, J., He, Y., & Chen, Y. (2023). Ensemble
Reinforcement Learning: A Survey. arXiv preprint, arXiv:2303.02618.
Spooner, T., Vadori, N., & Ganesh, S. (2021). Factored Policy Gradients: Leveraging
Structure for Efficient Learning in MOMDPs. In Ranzato, M., Beygelzimer, A.,
Nguyen, K., Liang, P., Vaughan, J., & Dauphin, Y. (Eds.), Proceedings of the 35th
International Conference on Advances in Neural Information Processing Systems
(NeurIPS’21). Curran Associates.
Srinivas, A., & Abbeel, P. (2021). Unsupervised Learning for Reinforcement Learning..
Tutorial in the 9th International Conference on Learning Representations (ICLR’21).
Srouji, M., Zhang, J., & Salakhutdinov, R. (2018). Structured Control Nets for Deep
Reinforcement Learning. In Dy, J., & Krause, A. (Eds.), Proceedings of the 35th
International Conference on Machine Learning (ICML’18), Vol. 80. Proceedings of
Machine Learning Research.
Steccanella, L., Totaro, S., & Jonsson, A. (2022). Hierarchical Representation Learning for
Markov Decision Processes. In Proceedings of the Thirty-Sixth Conference on Artificial
Intelligence (AAAI’22).
Strehl, A., Li, L., & Littman, M. (2009). Reinforcement Learning in Finite MDPs: PAC
Analysis. Journal of Machine Learning Research, 10.
Sun, W., Jiang, N., Krishnamurthy, A., Agarwal, A., & Langford, J. (2019). Model-based RL
in Contextual Decision Processes: PAC Bounds and Exponential Improvements Over
Model-free Approaches. In Proceedings of the 32nd Conference on Learning Theory
(COLT’19).
Sun, Y., Ma, S., Madaan, R., Bonatti, R., Huang, F., & Kapoor, A. (2023). SMART:
Self-supervised Multi-task pretrAining With contRol Transformers. In Proceedings of
the 11th International Conference on Learning Representations (ICLR’23).
Sun, Y., Yin, X., & Huang, F. (2021). TempLe: Learning Template of Transitions for Sample
Efficient Multi-task RL. In Yang, Q., Leyton-Brown, K., & Mausam (Eds.), Proceedings
1230
Structure in Deep Reinforcement Learning: A Survey and Open Problems
1231
Mohan, Zhang, & Lindauer
1232
Structure in Deep Reinforcement Learning: A Survey and Open Problems
S., Song, L., Szepesvári, C., Niu, G., & Sabato, S. (Eds.), Proceedings of the 39th
International Conference on Machine Learning (ICML’22), Vol. 162 of Proceedings of
Machine Learning Research. PMLR.
Wang, T., Du, S., Torralba, A., Isola, P., Zhang, A., & Tian, Y. (2022). Denoised MDPs:
Learning World Models Better Than the World Itself. In Chaudhuri, K., Jegelka,
S., Song, L., Szepesvári, C., Niu, G., & Sabato, S. (Eds.), Proceedings of the 39th
International Conference on Machine Learning (ICML’22), Vol. 162 of Proceedings of
Machine Learning Research. PMLR.
Wang, T., Liao, R., Ba, J., & Fidler, S. (2018). Nervenet: Learning Structured Policy
With Graph Neural Networks. In Proceedings of the 6th International Conference on
Learning Representations (ICLR’18).
Wang, T., Torralba, A., Isola, P., & Zhang, A. (2023). Optimal Goal-reaching Reinforcement
Learning via Quasimetric Learning. In Krause, A., Brunskill, E., Cho, K., Engelhardt,
B., Sabato, S., & Scarlett, J. (Eds.), International Conference on Machine Learning,
ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, Vol. 202, pp. 36411–36430.
Wen, Z., Precup, D., Ibrahimi, M., Barreto, A., Roy, B., & Singh, S. (2020). On Efficiency
in Hierarchical Reinforcement Learning. In Larochelle, H., Ranzato, M., Hadsell, R.,
Balcan, M.-F., & Lin, H. (Eds.), Proceedings of the 34th International Conference on
Advances in Neural Information Processing Systems (NeurIPS’20). Curran Associates.
Whitehead, S., & Lin, L. (1995). Reinforcement Learning of Non-markov Decision Processes.
Artificial Intelligence, 73 (1-2), 271–306.
Williams, R. (1992). Simple Statistical Gradient-following Algorithms for Connectionist
Reinforcement Learning. Machine Learning, 8, 229–256.
Wolf, L., & Musolesi, M. (2023). Augmented Modular Reinforcement Learning Based on
Heterogeneous Knowledge. arXiv preprint, arXiv:2306.01158.
Woo, H., Yoo, G., & Yoo, M. (2022). Structure Learning-based Task Decomposition for
Reinforcement Learning in Non-stationary Environments. In Proceedings of the Thirty-
Sixth Conference on Artificial Intelligence (AAAI’22).
Wu, B., Gupta, J., & Kochenderfer, M. (2019). Model Primitive Hierarchical Lifelong
Reinforcement Learning. In Elkind, E., Veloso, M., Agmon, N., & Taylor, M. (Eds.),
Proceedings of the Eighteenth International Conference on Autonomous Agents and
MultiAgent Systems (AAMAS’19), pp. 34–42.
Wu, C., Rajeswaran, A., Duan, Y., Kumar, V., Bayen, A., Kakade, S., Mordatch, I., &
Abbeel, P. (2018). Variance Reduction for Policy Gradient With Action-dependent
Factorized Baselines. In Proceedings of the 6th International Conference on Learning
Representations (ICLR’18).
Xu, D., & Fekri, F. (2021). Interpretable Model-based Hierarchical Reinforcement Learning
Using Inductive Logic Programming. arXiv preprint, arXiv:2106.11417.
Xu, K., Verma, S., Finn, C., & Levine, S. (2020). Continual Learning of Control Primitives:
Skill Discovery via Reset-games. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan,
M.-F., & Lin, H. (Eds.), Proceedings of the 34th International Conference on Advances
in Neural Information Processing Systems (NeurIPS’20). Curran Associates.
1233
Mohan, Zhang, & Lindauer
Yang, C., Hung, I., Ouyang, Y., & Chen, P. (2022). Training a Resilient Q-network Against
Observational Interference. In Proceedings of the Thirty-Sixth Conference on Artificial
Intelligence (AAAI’22).
Yang, F., Lyu, D., Liu, B., & Gustafson, S. (2018). PEORL: Integrating Symbolic Planning
and Hierarchical Reinforcement Learning for Robust Decision-making. In Lang, J.
(Ed.), Proceedings of the 27th International Joint Conference on Artificial Intelligence
(IJCAI’18).
Yang, R., Xu, H., Wu, Y., & Wang, X. (2020a). Multi-task Reinforcement Learning With
Soft Modularization. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.-F., & Lin,
H. (Eds.), Proceedings of the 34th International Conference on Advances in Neural
Information Processing Systems (NeurIPS’20). Curran Associates.
Yang, Y., Zhang, G., Xu, Z., & Katabi, D. (2020b). Harnessing Structures for Value-
based Planning and Reinforcement Learning. In Proceedings of the 8th International
Conference on Learning Representations (ICLR’20).
Yarats, D., Fergus, R., Lazaric, A., & Pinto, L. (2021). Reinforcement Learning With
Prototypical Representations. In Meila, M., & Zhang, T. (Eds.), Proceedings of the
38th International Conference on Machine Learning (ICML’21), Vol. 139 of Proceedings
of Machine Learning Research. PMLR.
Yin, D., Thiagarajan, S., Lazic, N., Rajaraman, N., Hao, B., & Szepesvári, C. (2023).
Sample Efficient Deep Reinforcement Learning via Local Planning. arXiv preprint,
arXiv:2301.12579.
Young, K., Ramesh, A., Kirsch, L., & Schmidhuber, J. (2023). The Benefits of Model-based
Generalization in Reinforcement Learning. In Proceedings of the 40th International
Conference on Machine Learning (ICML’23).
Yu, D., Ma, H., Li, S., & Chen, J. (2022). Reachability Constrained Reinforcement Learning.
In Chaudhuri, K., Jegelka, S., Song, L., Szepesvári, C., Niu, G., & Sabato, S. (Eds.),
Proceedings of the 39th International Conference on Machine Learning (ICML’22),
Vol. 162 of Proceedings of Machine Learning Research. PMLR.
Zambaldi, D., Raposo, D., Santoro, A., Bapst, V., Li, Y., Babuschkin, I., Tuyls, K., Reichert,
D., Lillicrap, T., Lockhart, E., Shanahan, M., Langston, V., Pascanu, R., Botvinick,
M., Vinyals, O., & Battaglia, P. (2019). Deep Reinforcement Learning With Relational
Inductive Biases. In Proceedings of the 7th International Conference on Learning
Representations, ICLR 2019.
Zeng, K., Zhang, Q., Chen, B., Liang, B., & Yang, J. (2022). APD: Learning Diverse
Behaviors for Reinforcement Learning Through Unsupervised Active Pre-training.
IEEE Robotics Automation Letters, 7 (4), 12251–12258.
Zhang, A., Lyle, C., Sodhani, S., Filos, A., Kwiatkowska, M., Pineau, J., Gal, Y., & Precup, D.
(2020). Invariant Causal Prediction for Block Mdps. In III, H. D., & Singh, A. (Eds.),
Proceedings of the 37th International Conference on Machine Learning (ICML’20),
Vol. 98. Proceedings of Machine Learning Research.
1234
Structure in Deep Reinforcement Learning: A Survey and Open Problems
Zhang, A., McAllister, R., Calandra, R., Gal, Y., & Levine, S. (2021). Learning Invariant
Representations for Reinforcement Learning Without Reconstruction. In Proceedings
of the 9th International Conference on Learning Representations (ICLR’21).
Zhang, A., Sodhani, S., Khetarpal, K., & Pineau, J. (2020). Multi-task reinforcement learning
as a hidden-parameter block MDP. arXiv preprint, arXiv:2007.07206.
Zhang, A., Sodhani, S., Khetarpal, K., & Pineau, J. (2021a). Learning Robust State
Abstractions for Hidden-parameter Block MDPs. In Proceedings of the 9th International
Conference on Learning Representations (ICLR’21).
Zhang, C., Cai, Y., Huang, L., & Li, J. (2021b). Exploration by Maximizing Renyi En-
tropy for Reward-free RL Framework. In Yang, Q., Leyton-Brown, K., & Mausam
(Eds.), Proceedings of the Thirty-Fifth Conference on Artificial Intelligence (AAAI’21).
Association for the Advancement of Artificial Intelligence, AAAI Press.
Zhang, D., Courville, A., Bengio, Y., Zheng, Q., Zhang, A., & Chen, R. (2023). Latent State
Marginalization as a Low-cost Approach for Improving Exploration. In Proceedings of
the 11th International Conference on Learning Representations (ICLR’23).
Zhang, H., Chen, H., Xiao, C., Li, B., Liu, M., Boning, D., & Hsieh, C. (2020). Robust Deep
Reinforcement Learning Against Adversarial Perturbations on State Observations. In
Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M.-F., & Lin, H. (Eds.), Proceedings
of the 34th International Conference on Advances in Neural Information Processing
Systems (NeurIPS’20). Curran Associates.
Zhang, H., Gao, Z., Zhou, Y., Zhang, H., Wu, K., & Lin, F. (2019a). Faster and Safer
Training by Embedding High-level Knowledge Into Deep Reinforcement Learning.
arXiv preprint, arXiv:1910.09986.
Zhang, H., Gao, Z., Zhou, Y., Zhang, H., Wu, K., & Lin, F. (2019b). Faster and Safer
Training by Embedding High-level Knowledge Into Deep Reinforcement Learning.
arXiv preprint, arXiv:1910.09986.
Zhang, S., & Sridharan, M. (2022). A Survey of Knowledge-based Sequential Decision-making
Under Uncertainty. AI Magazine, 43 (2), 249–266.
Zhang, S., Tong, H., Xu, J., & Maciejewski, R. (2019). Graph Convolutional Networks: a
Comprehensive Review. Computational Social Networks, 6 (1), 1–23.
Zhang, X., Zhang, S., & Yu, Y. (2021). Domain Knowledge Guided Offline Q Learning. In
Second Offline Reinforcement Learning Workshop at the 35th International Conference
on Advances in Neural Information Processing Systems (NeurIPS’21).
Zhao, T., Xie, K., & Eskénazi, M. (2019). Rethinking Action Spaces for Reinforcement
Learning in End-to-end Dialog Agents With Latent Variable Models. In Proceed-
ings of the 2019 Conference of the North American Chapter of the Association for
Computational Linguistics: Human Language Technologies.
Zhou, A., Kumar, V., Finn, C., & Rajeswaran, A. (2022). Policy Architectures for Composi-
tional Generalization in Control. arXiv preprint, arXiv:2203.05960.
Zhou, Z., Li, X., & Zare, R. (2017). Optimizing Chemical Reactions With Deep Reinforcement
Learning. ACS central science, 3 (12), 1337–1344.
1235
Mohan, Zhang, & Lindauer
Zhu, J., Park, T., Isola, P., & Efros, A. (2017). Unpaired Image-to-image Translation
Using Cycle-consistent Adversarial Networks. In Proceedings of the 20th International
Conference on Computer Vision (ICCV’17).
1236