Benchmarking Reinforcement Learning Techniques For Autonomous Navigation
Benchmarking Reinforcement Learning Techniques For Autonomous Navigation
Zifan Xu1 , Bo Liu1 , Xuesu Xiao2,3 , Anirudh Nair1 , and Peter Stone1,4
Abstract— Deep reinforcement learning (RL) has brought fronts, including safety [5]–[7], generalizability [8]–[11],
many successes for autonomous robot navigation. However, sample efficiency [12], [13], and addressing temporal data
there still exists important limitations that prevent real-world [14]–[16]. For the problem of navigation, learned naviga-
use of RL-based navigation systems. For example, most learning
approaches lack safety guarantees; and learned navigation tion systems from RL [17] have the potential to relieve
systems may not generalize well to unseen environments. roboticists from extensive engineering efforts [18]–[22] spent
Despite a variety of recent learning techniques to tackle these on developing and fine-tuning classical systems. Moreover,
challenges in general, a lack of an open-source benchmark a simple case study conducted in five randomly generated
arXiv:2210.04839v2 [cs.RO] 27 Jun 2023
and reproducible learning methods specifically for autonomous obstacle courses where classical navigation systems often
navigation makes it difficult for roboticists to choose what
learning methods to use for their mobile robots and for learning fail shows that a RL-based navigation has the potential to
researchers to identify current shortcomings of general learning achieve superior behaviors in terms of successful collision
methods for autonomous navigation. In this paper, we identify avoidance and goal reaching (Fig. 1 left).
four major desiderata of applying deep RL approaches for Despite such promising advantages, learning-based nav-
autonomous navigation: (D1) reasoning under uncertainty, (D2) igation systems are far from finding their way into real-
safety, (D3) learning from limited trial-and-error data, and (D4)
generalization to diverse and novel environments. Then, we world robotics use cases, which currently still rely heavily
explore four major classes of learning techniques with the on their classical counterparts. Such reluctance in adopting
purpose of achieving one or more of the four desiderata: learning-based systems in the real world stems from a
memory-based neural network architectures (D1), safe RL (D2), series of fundamental limitations of learning methods, e.g.,
model-based RL (D2, D3), and domain randomization (D4). By lack of safety, explainability, and generalizability. To make
deploying these learning techniques in a new open-source large-
scale navigation benchmark and real-world environments, we things even worse, a lack of well-established comparison
perform a comprehensive study aimed at establishing to what metrics and reproducible learning methods further obfuscates
extent can these techniques achieve these desiderata for RL- the effects of different learning approaches on navigation
based navigation systems. across both the robotics and learning communities, making
it difficult to assess the state of the art and therefore to adopt
I. I NTRODUCTION
learned navigation systems in the real world.
Autonomous robot navigation, i.e., moving a robot from To facilitate research in developing RL-based navigation
one point to another without colliding with any obstacle, systems with the goal of deploying them in real-world
has been studied by the robotics community for decades. scenarios, we introduce a new open-source large-scale nav-
Classical navigation systems [1], [2] can successfully solve igation benchmark with a variety of challenging, highly
such navigation problem in many real-world scenarios, e.g., constrained obstacle courses to evaluate different learning
handling noisy, partially observable sensory input but still approaches, along with the implementation of several state-
providing verifiable collision-free safety guarantees. How- of-the-art RL algorithms. The obstacle courses resemble
ever, these systems require extensive engineering effort, highly-constraint real-world navigation environments (Fig.
and can still be brittle in challenging scenarios, e.g., in 1 right), and present major challenges to existing classical
highly constrained environments. This is reflected by a recent navigation systems, while RL-based navigation systems have
competition (The BARN Challenge [3]) held in ICRA 2022, the potential to perform well in them (Fig. 1 left).
which suggests that even experienced roboticists tend to We identify four major desiderata that ought to be fulfilled
underestimate how difficult navigation scenarios are for real by any learning-based system that is to be deployed: (D1)
robots. Recently, data-driven approaches have also been used reasoning under uncertainty of partially observed sensory
to tackle the navigation problem [4] thanks to advances in the inputs, (D2) safety, (D3) learning from limited trial-and-
machine learning community. In particular, Reinforcement error data, and (D4) generalization to diverse and novel
Learning (RL), i.e., learning from self-supervised trial-and- environments. By deploying four major classes of learning
error data, has achieved tremendous progress on multiple techniques: memory-based neural network architectures, safe
1
RL, model-based RL, and domain randomization, we perform
Department of Computer Science, University of Texas at Austin 2 Department
of Computer Science, George Mason University 3 Everyday Robots 4 Sony AI. This
extensive experiments and empirically compare a large range
work has taken place in the Learning Agents Research Group (LARG) at UT Austin. of RL-based methods based on the degree to which they
LARG research is supported in part by NSF (CPS-1739964, IIS-1724157, NRI-
1925082), ONR (N00014-18-2243), FLI (RFP2-000), ARO (W911NF-19-2-0333),
achieve each of these desiderata. Moreover, by deploying six
DARPA, Lockheed Martin, GM, and Bosch. Peter Stone serves as the Executive selected navigation systems in three qualitatively different
Director of Sony AI America and receives financial compensation for this work. The
terms of this arrangement have been reviewed and approved by the University of Texas
real-world navigation environments, we investigate to what
at Austin in accordance with its policy on objectivity in research. degree the conclusions drawn from the benchmark can be
Fig. 1: Left: Success rate of two classical navigation systems, DWA [2] (red) and E-band [1] (blue), and vanilla end-to-end
RL-based (green) navigation systems (individually trained) in five randomly generated difficult obstacle courses. The insets
at the top show top-down views of the five obstacle courses. Right: Navigation environments in the real world (left) and the
proposed benchmark (right) are similar to the robot perception system (e.g., white/red laser scans and cyan/purple costmaps).
applied to the real world. Supplementary videos and material expensive compared to other RL domains, e.g., MuJuCo or
for this work are available on the project webpage.1 Atari games [23], [24], which presents a high requirement
for sample efficiency. Most prior works have used off-
II. D ESIDERATA FOR L EARNING - BASED NAVIGATION policy RL algorithms to improve sample efficiency with
In this section, we introduce four desiderata for learning- experience replay [25], [26]. In addition, model-based RL
based autonomous navigation systems and briefly discuss the methods can explicitly improve sample efficiency, and are
learning techniques as their corresponding solutions. widely used in robot control problems. In this study, we
(D1) reasoning under uncertainty of partially observed compare two common classes of model-based RL method
sensory inputs. Autonomous navigation without explicit [12], [13] combined with an off-policy RL algorithm, and
mapping and localization is usually formalized as a Partially empirically study to what extent model-based approaches
Observable Markov Decision Process (POMDP), where the improve sample efficiency when provided with different
agent produces the motion of the robot only based on limited amounts of data.
sensory inputs that are usually not sufficient to recover the (D4) generalization to diverse and novel environments.
full state of the navigation environment. Most RL approaches The ultimate goal of deep RL approaches for autonomous
solve POMDPs by maintaining a history of past observations navigation is to learn a generalizable policy for all kinds
and actions [14], [15]. Then, neural network architectures of navigation environments in the real world. A simple
like Recurrent Neural Networks (RNNs) that process sequen- strategy is to train the agent in as many diverse navigation
tial data are employed to encode history and address partial environments as possible or domain randomization, but it is
observability. In this study, we investigate various design unclear what is the necessary amount of training environ-
choices of history-dependent architectures. ments to efficiently achieve good generalization. Utilizing
(D2) safety. Even though in some cases deep RL methods the large-scale navigation benchmark proposed in this paper,
achieve comparable performance to classical navigation, they we empirically study the dependence of generalization on
still suffer from poor explainability and do not guarantee the number of training environments.
collision-free navigation. The lack of safety guarantee is a
major challenge preventing RL-based navigation from being III. NAVIGATION B ENCHMARK
used in the real world. Prior works have addressed this This section details the proposed navigation benchmark
challenge by formalizing the navigation as a multi-objective for RL-based navigation systems, which aims to provide a
problem that treats collision avoidance as a separate objective unified and comprehensive testbed for future autonomous
from reaching the goal and solving it with Lagrangian or navigation research. First, Sec. III-A discusses the difference
Lyapunov-based methods [5]. For simplicity, we only explore between the proposed benchmark and existing navigation
Lagrangian method and investigate whether explicitly treat benchmarks. In Sec. III-B and III-C, the navigation task is
safety as a separate objective leads to safer and smoother formally defined and formulated as a POMDP. More detailed
learned navigation behavior. background of MDP and POMDP can be found on the
(D3) learning from limited trial-and-error data. Al- project webpage. Finally, Sec. III-D introduces simulated and
though deep RL approaches can alleviate roboticists from real-world environments that benchmark different aspects of
extensive engineering effort, a large amount of data is navigation performance.
still required to train a typical deep RL agent. However,
A. Existing Navigation Benchmarks
autonomous navigation data is usually expensive to collect
in the real world. Therefore, data collection is usually con- Our proposed benchmark differs from existing benchmarks
ducted in simulation, e.g., in the Robot Operating System in three aspects: (1) high-fidelity physics: the navigation
(ROS) Gazebo simulator, which provides an easy interface tasks are simulated by Gazebo [27], which is based on realis-
with real-world robots. However, simulating a full navigation tic physical dynamics and therefore tests motion planners that
stack from perception to actuation is more computationally directly produce low-level motion commands, i.e., linear and
angular velocies, in contrast to high-level instructions such as
1 https://cs.gmu.edu/ xiao/Research/RLNavBenchmark/ turn left, turn right, move forward [28], [29]. In other words,
we focus on “how to navigate” (motion planning), instead of
“where to navigate” (path planning); (2) ROS integration:
our benchmark is based on ROS [30], which allows seamless
transfer of a navigation method developed and benchmarked
in simulation directly onto a physical robot with little (if
any) effort; and (3) collision-free navigation: the benchmark
includes both static and dynamic environments, and requires
collision-free navigation, whereas other benchmarks assume Fig. 2: Three types of navigation environments: static
that either collisions are possible [29] or collision-avoidance (left), dynamic box (middle), and dynamic-wall
will be addressed by other low-level controllers out of the (right). The red squares mark the obstacle fields, and
scope of the benchmark [28]. A special case is the photo- the yellow arrows mark the direction of navigation. In
realistic interactive Gibson benchmark by Xia et. al. [31], dynamic-wall, the green (blue) arrows indicate the case
which intentionally allows physical interaction with objects when the two walls are moving apart (together). In dynamic
(e.g., pushing) and therefore pose no challenges to the box, the red arrows indicate the velocities of obstacles.
collision-avoidance system. agent, which matches with the objective of the navigation
task in Definition 1. The second and third terms are auxiliary
B. Navigation Problem Definition
rewards that facilitate the training by encouraging local
Definition 1 (Robot Navigation Problem). Situated within progress and penalizing collisions.
a navigation environment e which includes information of We perform a grid search over different values of the
all the obstacle locations at any time t, a start location coefficients in this reward function, and the result shows
(xi , yi ), a start orientation θi , and a goal location (xg , yg ), that the auxiliary reward term (dt−1 − dt ) is necessary for
the navigation problem Te is to maximize the probability p successful training, and a much smaller coefficient bp relative
of a mobile robot reaching the goal location from the start to bf can lead to a better asymptotic performance. The agent
location and orientation under a constraint on the number can learn without the penalty reward for collision (bc = 0),
of collisions with any obstacle C < 1 and a time limit but a moderate value of bc can improve the asymptotic
t < Tmax . performance and speed up training. For all the experiments
A navigation problem can be formally defined as above. in this paper, we fix the coefficients as bf = 20, bp = 1 and
Given the current location (xt , yt ), the robot is considered to bc = 4.
have reached the goal location if and only if its distance to In our experiments, the RL algorithm solves a multi-task
the goal location is smaller than a threshold, dt < ds , where RL problem where the tasks are randomly sampled from
dt is the Euclidean distance between (xt , yt ) and (xg , yg ), a task distribution Te ∼ p(Te ). Here the task distribution
and ds is a constant threshold. p(Te ) := U ({ei }Ni=1 ) is a uniform distribution on a set of
N navigation environments {ei }N i=1 . The overall objective
C. POMDP Formulation of this multi-task RL problem P∞ tot find an optimal
is policy
π ∗ = maxπ ETe ∼p(Te ),τt ∼π
γ R e (st , at ) .
A navigation task Te can be formulated as a POMDP t=0
TABLE I: (D1) Success rate (%) (↑) of policies trained with different neural network architectures and history lengths. H
is the history length of the memory. Bold font indicates the best success rate for each type of environment.
Methods Baseline (model-free) Lagrangian method MPC (model-based) DWA TEB
Success rate (%) (↑) 65 ± 4 74 ± 2 70 ± 3 82 70
Survival time (s) (↑) 8.0 ± 1.5 16.2 ± 2.5 55.7 ± 4.9 62.7 26.9
Traversal time (s) (↓) 7.5 ± 0.3 8.6 ± 0.2 24.7 ± 2.0 35.6 26.9
TABLE II: (D2) Success rate (↑), survival time (↑), and traversal time (↓) of policies trained with Lagrangian method, MPC
with probabilistic transition model, and DWA. The bold font indicates the best number achieved for each type of metric.
navigation systems which are believed to have better safety, the model compared to the Dyna-style method, which leads
we also add evaluation metrics from a classical navigation to much worse asymptotic performance (about 20% success
stack with the Dynamic Window Approach (DWA) [2] local rate in the end).
planner. Model-based methods with probabilistic dynamic mod-
Lagrangian method reduces the gap between training els improve the asymptotic performance. In the last
and test environments. When deployed in the training column of Table IV, both Dyna-style and MPC with proba-
environments, both the baseline MLP and the safe RL bilistic dynamic models achieve slightly better success rates
method achieves about 80% success rate. However, in the test of 70% compared to 65% in the baseline MLP method
environments, the Lagrangian method has a better success when sufficient transition samples of 2000k are given to the
rate of 74% compare to 65% by the baseline MLP. We learning agent.
hypothesize that the safety constraint applied by the safe The MPC policy performs conservatively when de-
RL methods forms a way of regularization, and therefore, ployed in unseen test environments and shows a better
improves the generalization to unseen environments. safety performance. The safety performances of MPC poli-
Lagrangian method increases the average survival time cies with probabilistic dynamic models are also tested (see
in failed episodes. As expected, the Lagrangian method Table II). We observe that the agents with MPC policies
increases the average survival time by 8.2s compared to navigate very conservatively with an average traversal time of
the baseline MLP at a cost of 1.1s longer average traversal 24.7s, which is about two times more than the MLP baseline.
time. However, such improved safety are still worse than the In the meantime, MPC policies achieve improved safety with
classical navigation systems given the best survival time of the best survival time of 55.7s among the RL-based methods.
88.6s achieved by DWA.
D. Domain Randomization (D4)
C. Model-based RL (D2 and D3) To explore how model generalization depends on the
To explore how the model-based approaches help with degree of randomness in the training environments, baseline
the autonomous navigation tasks, we implement Dyna-style, MLP policies with one history length are trained in the
MPC, and MBPO, and evaluate the methods in static en- environment sets with 5, 10, 50, 100, and 250 training
vironments. The transition models are either represented by environments. The trained policies are tested in the same
a deterministic NN or a probabilistic NN that predicts the static-test. To investigate the performance gap be-
mean and variance of the next state. During the training tween training and test, the policies trained with 50, 100, and
in static-train-50, the policies are saved when 100k, 250 environments are also tested on static-train-50,
500k and 2000k transition samples are collected, then tested which is part of their training sets. Fig 4 shows the success
in static-test. The success rates of these policies are rate of policies trained with different number of training
reported in Table IV. environments.
Model-based methods do not improve sample effi- The generalization to unseen environments improves
ciency. As shown in the second and third columns in Table with increasing number of training environments. As
IV, better success rates of 13% and 58% are achieved by the shown in Fig. 4, the performances on the unseen test en-
baseline MLP method provided by limited 100k and 500k vironments monotonously increase from 43% to 74% with
transition samples respectively. In addition, Higher success the number of training environments increasing from 5 to
rates at 500k transition samples are observed in probabilistic 250. Moreover, the gaps between training and test environ-
models compared to their deterministic counterparts, which ments gradually shrink by adding more training environments
indicates a more efficient learning with probabilistic transi- provided by that the polices are robust enough to maintain
tion models. Notice that MBPO exploits more heavily on similar performances of about 80% on the training environ-
real-world-1 real-world-2 real-world-3
H # envs
traversal time (↓) (# successful trials (↑) / total # trials)
MLP 1 50 6.9 (1/3) 10.6 (1/3) N (0/3)
MLP 1 250 4.6 ± 0.8 (3/3) 6.6 ± 0.6 (3/3) 22.6 ± 0.5 (3/3)
Transformer 4 50 6.1 ± 0.4 (3/3) 6.1 ± 0.1 (2/3) 20.5 ± 2 (2/3)
Lagrangian 1 50 4.4 ± 0.6 (3/3) 7.1 ± 0.1 (2/3) 26.2 (1/3)
MPC 1 50 13.2 ± 0.7 (3/3) 24.8 ± 3.7 (3/3) N (0/3)
DWA - - 16.2 ± 0.7 (3/3) 35.2 ± 8.2 (2/3) 66.9 ± 0.6 (3/3)
TABLE III: Physical experiments. The table shows the traversal time (s) (↓) and the
Fig. 4: (D4) Success rate (%) of number of successful trials (↑) of 5 RL-based navigation systems and a classical
policies trained with different num- navigation system (DWA) evaluated in three real-world environments. The bold font
ber of training environments. indicates the best traversal time when all three trials are successful.
ments. by the desiderata as follows:
(D1) reasoning under uncertainty of partially observed
Transition samples 100k 500k 2000k sensory inputs does not obviously benefit from adding mem-
MLP 13 ± 7 58 ± 2 65 ± 4 ory in simulated static environments and very random dy-
Dyna-style deterministic 8±2 30 ± 10 66 ± 5 namic (dynamic-box) environments, but much more sig-
MPC deterministic 0±0 21 ± 10 62 ± 3
Dyna-style probabilistic 0±0 48 ± 4 70 ± 1
nificant improvements were observed in the real world and in
MPC probabilistic 0±0 45 ± 4 70 ± 3 more challenging dynamic environments (dynamic-wall).
MBPO 0±0 0±0 21.9 ± 3 (D2) safety is improved by both safe RL and model-
based MPC methods. However, classical navigation systems
TABLE IV: (D3) Success rate (%) (↑) of policies trained still achieve the best safety performance at a cost of very
with different model-based methods and different number of long traversal time. Whether RL-based navigation systems
transition samples. The bold font indicates the best success can achieve similar safety guarantees as classical navigation
rate for each number of transition samples. systems and whether safety can be improved without signif-
E. Physical experiments icantly sacrificing the traversal time are still open questions.
To study the consistency of the above observations in (D3) the ability to learn from limited trial-and-error
simulation and the real world, we deploy one baseline MLP data is not improved by the evaluated model-based methods.
policy, one best policy for each studied desideratum, and one Currently, we observe that model-based RL methods indeed
classical navigation system (DWA [2]) in the three real-world improve sample-efficiency, but only when the number of
environments introduced in Sec. III-D. Each deployment is imaginary rollouts from the learned model is large (e.g.
repeated three times, and the average traversal time and ≥ 2000k) and when they are sampled with randomness.
the number of successful trials are reported in Table. III. We therefore hypothesize that the improvement comes from
Even though the best memory-based policy, transformer the robustness brought by learning on more data sampled
architecture with 4 history length, was only marginally from the learned model. Hence, this result motivates not
better than the baseline MLP in simulation, in the real only more accurate model learning for reducing the number
world it can navigate very smoothly and fails only once of imaginary rollouts, but also theoretical understanding of
in real-world-2 and real-world-3, while baseline how the model helps improve the robustness or even safety
MLP fails most of the trials in all the environments including of navigation.
the benchmark-like environment. One possible reason for (D4) the generalization to diverse and novel envi-
this is that simulations are typically more predictable than ronments is improved by increasing the randomness of
the real world. Therefore, it is particularly important to use training environments. However, a noticeable gap of about
historical data in the real world to estimate the environment 5% between training and test environments is not eliminated
and current states of the robot. Similarly, MLP policy trained by further increasing the number of training environments to
with 250 environments can successfully navigate in all 250. This reflects the limitation of simple domain randomiza-
the environments without any failures, while baseline MLP tion to increase the generalization, which is, however, widely
trained with 50 environments fails most of the trials. Safe RL used by the community.
improves the chances of success in all the environments and In summary, although the proposed benchmark is not
can navigate more safely by performing backups and small intended to represent every real-world navigation scenario, it
adjustments of robots’ poses. Similar to the simulation, MPC serves as a simple yet comprehensive testbed for RL-based
navigates very conservatively and succeeds in all the trials in navigation methods. We observed that for every desideratum,
real-world-1 and real-world-2, but has much more no method can achieve 100% success rate on all training
difficulty generalizing to large-scale real-world-3. environments. Even though we ensured that we have made
sure that every environment is indeed individually solvable.
V. C ONCLUSION This alone indicates that there exists an optimization and
In this section, we discuss the conclusions we draw from generalization challenge when we have a large number of
these benchmark experiments. We organize these conclusions training environments as in our proposed benchmark.
R EFERENCES [22] X. Xiao, Z. Wang, Z. Xu, B. Liu, G. Warnell, G. Dhamankar, A. Nair,
and P. Stone, “Appl: Adaptive planner parameter learning,” Robotics
[1] S. Quinlan and O. Khatib, “Elastic bands: Connecting path planning and Autonomous Systems, vol. 154, p. 104132, 2022.
and control,” in [1993] Proceedings IEEE International Conference [23] E. Todorov, T. Erez, and Y. Tassa, “Mujoco: A physics engine for
on Robotics and Automation. IEEE, 1993, pp. 802–807. model-based control,” in 2012 IEEE/RSJ International Conference on
[2] D. Fox, W. Burgard, and S. Thrun, “The dynamic window approach to Intelligent Robots and Systems, 2012, pp. 5026–5033.
collision avoidance,” IEEE Robotics & Automation Magazine, vol. 4, [24] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling, “The arcade
no. 1, pp. 23–33, 1997. learning environment: An evaluation platform for general agents,”
Journal of Artificial Intelligence Research, vol. 47, pp. 253–279, jun
[3] X. Xiao, Z. Xu, Z. Wang, Y. Song, G. Warnell, P. Stone, T. Zhang,
2013.
S. Ravi, G. Wang, H. Karnan et al., “Autonomous ground navigation
[25] H.-T. L. Chiang, A. Faust, M. Fiser, and A. Francis, “Learning
in highly constrained spaces: Lessons learned from the barn challenge
navigation behaviors end-to-end with autorl,” IEEE Robotics and
at icra 2022,” arXiv preprint arXiv:2208.10473, 2022.
Automation Letters, vol. 4, pp. 2007–2014, 2019.
[4] X. Xiao, B. Liu, G. Warnell, and P. Stone, “Motion planning and
[26] A. Wahid, A. Toshev, M. Fiser, and T.-W. E. Lee, “Long range neural
control for mobile robot navigation using machine learning: a survey,”
navigation policies for the real world,” 2019 IEEE/RSJ International
Autonomous Robots, pp. 1–29, 2022.
Conference on Intelligent Robots and Systems (IROS), pp. 82–89,
[5] Y. Chow, O. Nachum, A. Faust, M. Ghavamzadeh, and E. A. Duéñez- 2019.
Guzmán, “Lyapunov-based safe policy optimization for continuous [27] N. Koenig and A. Howard, “Design and use paradigms for gazebo, an
control,” CoRR, vol. abs/1901.10031, 2019. [Online]. Available: open-source multi-robot simulator,” in 2004 IEEE/RSJ International
http://arxiv.org/abs/1901.10031 Conference on Intelligent Robots and Systems (IROS)(IEEE Cat. No.
[6] G. Thomas, Y. Luo, and T. Ma, “Safe reinforcement learning by 04CH37566), vol. 3. IEEE, 2004, pp. 2149–2154.
imagining the near future,” 2022. [28] Y. Zhu, R. Mottaghi, E. Kolve, J. J. Lim, A. Gupta, L. Fei-Fei, and
[7] E. Rodrı́guez-Seda, D. Stipanovic, and M. Spong, “Lyapunov-based A. Farhadi, “Target-driven visual navigation in indoor scenes using
cooperative avoidance control for multiple lagrangian systems with deep reinforcement learning,” in 2017 IEEE international conference
bounded sensing uncertainties,” in 2011 50th IEEE Conference on on robotics and automation (ICRA). IEEE, 2017, pp. 3357–3364.
Decision and Control and European Control Conference, CDC-ECC [29] L. Harries, S. Lee, J. Rzepecki, K. Hofmann, and S. Devlin, “Mazeex-
2011, ser. Proceedings of the IEEE Conference on Decision and plorer: A customisable 3d benchmark for assessing generalisation in
Control, Dec. 2011, pp. 4207–4213, 2011 50th IEEE Conference on reinforcement learning,” in 2019 IEEE Conference on Games (CoG).
Decision and Control and European Control Conference, CDC-ECC IEEE, 2019, pp. 1–4.
2011 ; Conference date: 12-12-2011 Through 15-12-2011. [30] Stanford Artificial Intelligence Laboratory et al., “Robotic operating
[8] K. Cobbe, O. Klimov, C. Hesse, T. Kim, and J. Schulman, “Quanti- system.” [Online]. Available: https://www.ros.org
fying generalization in reinforcement learning,” in ICML, 2019. [31] F. Xia, W. B. Shen, C. Li, P. Kasimbeg, M. E. Tchapmi, A. Toshev,
[9] K. Cobbe, C. Hesse, J. Hilton, and J. Schulman, “Leveraging proce- R. Martı́n-Martı́n, and S. Savarese, “Interactive gibson benchmark: A
dural generation to benchmark reinforcement learning,” arXiv preprint benchmark for interactive navigation in cluttered environments,” IEEE
arXiv:1912.01588, 2019. Robotics and Automation Letters, vol. 5, no. 2, pp. 713–720, 2020.
[10] N. Justesen, R. R. Torrado, P. Bontrager, A. Khalifa, J. Togelius, and [32] S. Fujimoto, H. van Hoof, and D. Meger, “Addressing function
S. Risi, “Illuminating generalization in deep reinforcement learning approximation error in actor-critic methods,” 2018.
through procedural level generation,” arXiv: Learning, 2018. [33] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-
[11] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, policy maximum entropy deep reinforcement learning with a stochastic
“Domain randomization for transferring deep neural networks from actor,” in International conference on machine learning. PMLR,
simulation to the real world,” in 2017 IEEE/RSJ international con- 2018, pp. 1861–1870.
ference on intelligent robots and systems (IROS). IEEE, 2017, pp. [34] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa,
23–30. D. Silver, and D. Wierstra, “Continuous control with deep reinforce-
[12] R. S. Sutton, “Dyna, an integrated architecture for learning, planning, ment learning,” arXiv preprint arXiv:1509.02971, 2015.
and reacting,” SIGART Bull., vol. 2, no. 4, p. 160–163, jul 1991.
[Online]. Available: https://doi.org/10.1145/122344.122377
[13] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, “Neural network
dynamics for model-based deep reinforcement learning with model-
free fine-tuning,” in 2018 IEEE International Conference on Robotics
and Automation (ICRA). IEEE, 2018, pp. 7559–7566.
[14] M. J. Hausknecht and P. Stone, “Deep recurrent q-learning for partially
observable mdps,” in AAAI Fall Symposia, 2015.
[15] D. Wierstra, A. Förster, J. Peters, and J. Schmidhuber, “Solving deep
memory pomdps with recurrent policy gradients,” in ICANN, 2007.
[16] K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep rein-
forcement learning in a handful of trials using probabilistic dynamics
models,” Advances in neural information processing systems, vol. 31,
2018.
[17] H.-T. L. Chiang, A. Faust, M. Fiser, and A. Francis, “Learning
navigation behaviors end-to-end with autorl,” IEEE Robotics and
Automation Letters, vol. 4, no. 2, pp. 2007–2014, 2019.
[18] X. Xiao, B. Liu, G. Warnell, J. Fink, and P. Stone, “Appld: Adaptive
planner parameter learning from demonstration,” IEEE Robotics and
Automation Letters, vol. 5, no. 3, pp. 4541–4547, 2020.
[19] Z. Wang, X. Xiao, B. Liu, G. Warnell, and P. Stone, “APPLI:
Adaptive planner parameter learning from interventions,” in 2021
IEEE International Conference on Robotics and Automation (ICRA).
IEEE, 2021.
[20] Z. Wang, X. Xiao, G. Warnell, and P. Stone, “Apple: Adaptive planner
parameter learning from evaluative feedback,” IEEE Robotics and
Automation Letters, vol. 6, no. 4, pp. 7744–7749, 2021.
[21] Z. Xu, G. Dhamankar, A. Nair, X. Xiao, G. Warnell, B. Liu, Z. Wang,
and P. Stone, “APPLR: Adaptive planner parameter learning from
reinforcement,” in 2021 IEEE International Conference on Robotics
and Automation (ICRA). IEEE, 2021.