1945 Learning To Explore Using Acti
1945 Learning To Explore Using Acti
A BSTRACT
This work presents a modular and hierarchical approach to learn policies for ex-
ploring 3D environments, called ‘Active Neural SLAM’. Our approach leverages
the strengths of both classical and learning-based methods, by using analytical
path planners with learned SLAM module, and global and local policies. The
use of learning provides flexibility with respect to input modalities (in the SLAM
module), leverages structural regularities of the world (in global policies), and
provides robustness to errors in state estimation (in local policies). Such use of
learning within each module retains its benefits, while at the same time, hierarchical
decomposition and modular training allow us to sidestep the high sample complex-
ities associated with training end-to-end policies. Our experiments in visually and
physically realistic simulated 3D environments demonstrate the effectiveness of
our approach over past learning and geometry-based approaches. The proposed
model can also be easily transferred to the PointGoal task and was the winning
entry of the CVPR 2019 Habitat PointGoal Navigation Challenge.
1 I NTRODUCTION
Navigation is a critical task in building intelligent agents. Navigation tasks can be expressed in
many forms, for example, point goal tasks involve navigating to specific coordinates and semantic
navigation involves finding the path to a specific scene or object. Irrespective of the task, a core
problem for navigation in unknown environments is exploration, i.e., how to efficiently visit as
much of the environment. This is useful for maximizing the coverage to give the best chance of
finding the target in unknown environments or for efficiently pre-mapping environments on a limited
time-budget.
Recent work from Chen et al. (2019) has used end-to-end learning to tackle this problem. Their
motivation is three-fold: a) learning provides flexibility to the choice of input modalities (classical
systems rely on observing geometry through the use of specialized sensors, while learning systems
can infer geometry directly from RGB images), b) use of learning can improve robustness to errors in
explicit state estimation, and c) learning can effectively leverage structural regularities of the real
world, leading to more efficient behavior in previously unseen environments. This lead to their design
of an end-to-end trained neural network-based policy that processed raw sensory observations to
directly output actions that the agent should execute.
While the use of learning for exploration is well-motivated, casting the exploration problem as an
end-to-end learning problem has its own drawbacks. Learning about mapping, state-estimation, and
path-planning purely from data in an end-to-end manner can be prohibitively expensive. Consequently,
past end-to-end learning work for exploration from Chen et al. (2019) relies on the use of imitation
learning and many millions of frames of experience, but still performs worse than classical methods
that don’t require any training at all.
This motivates our work. In this paper, we investigate alternate formulations of employing learning
for exploration that retains the advantages that learning has to offer, but doesn’t suffer from the
†
Correspondence: [email protected]
∗
Equal Contribution
1
Published as a conference paper at ICLR 2020
drawbacks of full-blown end-to-end learning. Our key conceptual insight is that use of learning for
leveraging structural regularities of indoor environments, robustness to state-estimation errors, and
flexibility with respect to input modalities, happens at different time scales and can thus be factored
out. This motivates the use of learning in a modular and hierarchical fashion inside of what one
may call a ‘classical navigation pipeline’. This results in navigation policies that can work with raw
sensory inputs such as RGB images, are robust to state estimation errors, and leverage the regularities
of real-world layouts. This results in extremely competitive performance over both geometry-based
methods and recent learning-based methods; at the same time requiring a fraction of the number of
samples.
More specifically, our proposed exploration architecture comprises of a learned Neural SLAM module,
a global policy, and a local policy, that are interfaced via the map and an analytical path planner. The
learned Neural SLAM module produces free space maps and estimates agent pose from input RGB
images and motion sensors. The global policy consumes this free-space map with the agent pose and
employs learning to exploit structural regularities in layouts of real-world environments to produce
long-term goals. These long-term goals are used to generate short-term goals for the local policy
(using a geometric path-planner). This local policy uses learning to directly map raw RGB images
to actions that the agent should execute. Use of learning in the SLAM module provides flexibility
with respect to input modality, learned global policy can exploit regularities in layouts of real-world
environments, while learned local policies can use visual feedback to exhibit more robust behavior.
At the same time, hierarchical and modular design and use of analytical planning, significantly cuts
down the search space during training, leading to better performance as well as sample efficiency.
We demonstrate our proposed approach in visually and physically realistic simulators for the task
of geometric exploration (visit as much area as possible). We work with the Habitat simulator from
Savva et al. (2019). While Habitat is already visually realistic (it uses real-world scans from Chang
et al. (2017) and Xia et al. (2018) as environments), we improve its physical realism by using actuation
and odometry sensor noise models, that we collected by conducting physical experiments on a real
mobile robot. Our experiments and ablations in this realistic simulation reveal the effectiveness of
our proposed approach for the task of exploration. A straightforward modification of our method also
tackles point-goal navigation tasks, and won the AI Habitat challenge at CVPR2019 across all tracks.
2 R ELATED W ORK
Navigation has been well studied in classical robotics. There has been a renewed interest in the use
of learning to arrive at navigation policies, for a variety of tasks. Our work builds upon concepts in
classical robotics and learning for navigation. We survey related works below.
Navigation Approaches. Classical approaches to navigation break the problem into two parts:
mapping and path planning. Mapping is done via simultaneous localization and mapping (Thrun
et al., 2005; Hartley and Zisserman, 2003; Fuentes-Pacheco et al., 2015), by fusing information from
multiple views of the environment. While sparse reconstruction can be done well with monocular
RGB images (Mur-Artal and Tardós, 2017), dense mapping is inefficient (Newcombe et al., 2011)
or requires specialized scanners such as Kinect (Izadi et al., 2011). Maps are used to compute
paths to goal locations via path planning (Kavraki et al., 1996; Lavalle and Kuffner Jr, 2000;
Canny, 1988). These classical methods have inspired recent learning-based techniques. Researchers
have designed neural network policies that reason via spatial representations (Gupta et al., 2017;
Parisotto and Salakhutdinov, 2018; Zhang et al., 2017; Henriques and Vedaldi, 2018; Gordon et al.,
2018), topological representations (Savinov et al., 2018; 2019), or use differentiable and trainable
planners (Tamar et al., 2016; Lee et al., 2018; Gupta et al., 2017; Khan et al., 2017). Our work
furthers this research, and we study a hierarchical and modular decomposition of the problem and
employ learning inside these components instead of end-to-end learning. Research also focuses on
incorporating semantics in SLAM (Pronobis and Jensfelt, 2012; Walter et al., 2013).
Exploration in Navigation. While a number of works focus on passive map-building, path planning
and goal-driven policy learning, a much smaller body of work tackles the the problem of active
SLAM, i.e., how to actively control the camera for map building. We point readers to Fuentes-
Pacheco et al. (2015) for a detailed survey, and summarize the major themes below. Most such
works frame this problem as a Partially Observable Markov Decision Process (POMDP) that are
approximately solved (Martinez-Cantin et al., 2009; Kollar and Roy, 2008), and or seek to find a
sequence of actions that minimizes uncertainty of maps (Stachniss et al., 2005; Carlone et al., 2014).
2
Published as a conference paper at ICLR 2020
Another line of work explores by picking vantage points (such as on the frontier between explored
and unexplored regions (Dornhege and Kleiner, 2013; Holz et al., 2010; Yamauchi, 1997; Xu et al.,
2017)). Recent works from Chen et al. (2019); Savinov et al. (2019); Fang et al. (2019) attack this
problem via learning. Our proposed modular policies unify the last two lines of research, and we
show improvements over representative methods from both these lines of work. Exploration has also
been studied more generally in RL in the context of exploration-exploitation trade-off (Sutton and
Barto, 2018; Kearns and Singh, 2002; Auer, 2002; Jaksch et al., 2010).
Hierarchical and Modular Policies. Hierarchical RL (Dayan and Hinton, 1993; Sutton et al.,
1999; Barto and Mahadevan, 2003) is an active area of research, aimed at automatically discovering
hierarchies to speed up learning. However, this has proven to be challenging, and thus most work has
resorted to using hand-defining hierarchies. For example in the context of navigation, Bansal et al.
(2019) and Kaufmann et al. (2019) design modular policies for navigation, that interface learned
policies with low-level feedback controllers. Hierarchical and modular policies have also been used
for Embodied Question Answering (Das et al., 2018a; Gordon et al., 2018; Das et al., 2018b).
3 TASK S ETUP
We follow the exploration task setup proposed by Chen et al. (2019) where the objective is to
maximize the coverage in a fixed time budget. The coverage is defined as the total area in the map
known to be traversable. Our objective is to train a policy which takes in an observation st at each
time step t and outputs a navigational action at to maximize the coverage.
We try to make our experimental setup in simulation as realistic as possible with the goal of trans-
ferring trained policies to the real world. We use the Habitat simulator (Savva et al., 2019) with the
Gibson (Xia et al., 2018) and Matterport (MP3D) (Chang et al., 2017) datasets for our experiments.
Both Gibson and Matterport datasets are based on real-world scene reconstructions are thus signif-
icantly more realistic than synthetic SUNCG dataset (Song et al., 2017) used for past research on
exploration (Chen et al., 2019; Fang et al., 2019).
In addition to synthetic scenes, prior works on learning-based navigation have also assumed simplistic
agent motion. Some works limit agent motion on a grid with 90 degree rotations (Zhu et al., 2017;
Gupta et al., 2017; Chaplot et al., 2018). Other works which implement fine-grained control, typically
assume unrealistic agent motion with no noise (Savva et al., 2019) or perfect knowledge of agent
pose (Chaplot et al., 2016). Since the motion is simplistic, it becomes trivial to estimate the agent
pose in most cases even if it is not assumed to be known. The reason behind these assumptions on
agent motion and pose is that motion and sensor noise models are not known. In order to relax both
these assumptions, we collect motion and sensor data in the real-world and implement more realistic
agent motion and sensor noise models in the simulator as described in the following subsection.
3
Published as a conference paper at ICLR 2020
Figure 1: Overview of our approach. The Neural SLAM module predicts a map and agent pose estimate from
incoming RGB observations and sensor readings. This map and pose are used by a Global policy to output a
long-term goal, which is converted to a short-term goal using an analytic path planner. A Local Policy is trained
to navigate to this short-term goal.
uF orward = (0.25, 0, 0), uRight : (0, 0, −10 ∗ π/180) and uLef t : (0, 0, 10 ∗ π/180). In practice, a
robot can also rotate slightly while moving forward and translate a bit while rotating on-the-spot,
creating rotational actuation noise in forward action and similarly, a translation actuation noise in
on-the-spot rotation actions.
We use a LoCoBot1 to collect data for building the actuation and sensor noise models. We use the
pyrobot API (Murali et al., 2019) along with ROS (Quigley et al., 2009) to implement the control
commands and get sensor readings. For each action a, we fit a separate Gaussian Mixture Model
for the actuation noise and sensor noise, making a total of 6 models. Each component in these
Gaussian mixture models is a multi-variate Gaussian in 3 variables, x, y and o. For each model, we
collect 600 datapoints. The number of components in each Gaussian mixture model is chosen using
cross-validation. We implement these actuation and sensor noise models in the Habitat simulator for
our experiments. We have released the noise models, along with their implementation in the Habitat
simulator in the open-source code.
4 M ETHODS
We propose a modular navigation model, ‘Active Neural SLAM’. It consists of three components: a
Neural SLAM module, a Global policy and a Local policy as shown in Figure 1. The Neural SLAM
module predicts the map of the environment and the agent pose based on the current observations
and previous predictions. The Global policy uses the predicted map and agent pose to produce a
long-term goal. The long-term goal is converted into a short-term goal using path planning. The
Local policy takes navigational actions based on the current observation to reach the short-term goal.
Map Representation. The Active Neural SLAM model internally maintains a spatial map, mt and
pose of the agent xt . The spatial map, mt , is a 2 × M × M matrix where M × M denotes the
map size and each element in this spatial map corresponds to a cell of size 25cm2 (5cm × 5cm)
in the physical world. Each element in the first channel denotes the probability of an obstacle at
the corresponding location and each element in the second channel denotes the probability of that
location being explored. A cell is considered to be explored when it is known to be free space or an
obstacle. The spatial map is initialized with all zeros at the beginning of an episode, m0 = [0]2×M ×M .
The pose xt ∈ R3 denotes the x and y coordinates of the agent and the orientation of the agent at
time t. The agent always starts at the center of the map facing east at the beginning of the episode,
x0 = (M/2, M/2, 0.0).
Neural SLAM Module. The Neural SLAM Module (fSLAM ) takes in the current RGB observation,
st , the current and last sensor reading of the agent pose x0t−1:t , last agent pose and map estimates,
x̂t−1 , mt−1 and outputs an updated map, mt , and the current agent pose estimate, x̂t , (see Figure
2): mt , x̂t = fSLAM (st , x0t−1:t , x̂t−1 , mt−1 |θS ), where θS denote the trainable parameters of the
Neural SLAM module. It consists of two learned components, a Mapper and a Pose Estimator. The
Mapper (fM ap ) outputs a egocentric top-down 2D spatial map, pego t ∈ [0, 1]2×V ×V (where V is
the vision range), predicting the obstacles and the explored area in the current observation. The
Pose Estimator (fP E ) predicts the agent pose (xˆt ) based on past pose estimate (x̂t−1 ) and last two
1
http://locobot.org
4
Published as a conference paper at ICLR 2020
Figure 2: Architecture of the Neural SLAM module: The Neural SLAM module (fM ap ) takes in the current
RGB observation, st , the current and last sensor reading of the agent pose x0t−1:t , last agent pose estimate, x̂t−1
and the map at the previous time step mt−1 and outputs an updated map, mt and the current agent pose estimate,
x̂t . ‘ST’ denotes spatial transformation.
We perform two transformations before passing ht to the Global Policy model. The first transforma-
tion subsamples a window of size 4 × G × G around the agent from ht . The second transformation
performs max pooling operations to get an output of size 4 × G × G from ht . Both the transformations
are stacked to form a tensor of size 8 × G × G and passed as input to the Global Policy model. The
Global Policy uses a convolutional neural network to predict a long-term goal, gtl in G × G space:
gtl = πG (ht |θG ), where θG are the parameters of the Global Policy.
Planner. The Planner takes the long-term goal (gtl ), the spatial obstacle map (mt ) and the agnet pose
estimate (x̂t ) as input and computes the short-term goal gts , i.e. gts = fP lan (gtl , mt , x̂t ). It computes
the shortest path from the current agent location to the long-term goal (gtl ) using the Fast Marching
Method (Sethian, 1996) based on the current spatial map mt . The unexplored area is considered as
free space for planning. We compute a short-term goal coordinate (farthest point within ds (= 0.25m)
from the agent) on the planned path.
Local Policy. The Local Policy takes as input the current RGB observation (st ) and the short-term
goal (gts ) and outputs a navigational action, at = πL (st , gts |θL ), where θL are the parameters of the
Local Policy. The short-term goal coordinate is transformed into relative distance and angle from
the agent’s location before being passed to the Local Policy. The Local Policy is a recurrent neural
network consisting of a pretrained ResNet18 (He et al., 2016) as the visual encoder.
5 E XPERIMENTAL SETUP
We use the Habitat simulator (Savva et al., 2019) with the Gibson (Xia et al., 2018) and Matterport
(MP3D) (Chang et al., 2017) datasets for our experiments. Both Gibson and MP3D consist of scenes
5
Published as a conference paper at ICLR 2020
6
Published as a conference paper at ICLR 2020
’
Table 1: Exploration performance of the proposed model, Active Neural SLAM (ANS) and baselines. The
baselines are adated from [1] Savva et al. (2019), [2] Mirowski et al. (2017) and [3] Chen et al. (2019).
Domain Generalization
Gibson Val MP3D Test
Method % Cov. Cov. (m2) % Cov. Cov. (m2)
RL + 3LConv [1] 0.737 22.838 0.332 47.758
RL + Res18 0.747 23.188 0.341 49.175
RL + Res18 + AuxDepth [2] 0.779 24.467 0.356 51.959
RL + Res18 + ProjDepth [3] 0.789 24.863 0.378 54.775
Active Neural SLAM (ANS) 0.948 32.701 0.521 73.281
1.0 Gibson Val - Large Gibson Val - Small Gibson Val - Overall
0.8
% Coverage
0.6
ANS
0.4 RL + 3LConv + GRU
RL + Res18 + GRU
0.2 RL + Res18 + GRU + AuxDepth
RL + Res18 + GRU + ProjDepth
0.0
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Episode length Episode length Episode length
Figure 3: Plot showing the % Coverage as the episode progresses for ANS and the baselines on the large and
small scenes in the Gibson Val set as well as the overall Gibson Val set.
6 R ESULTS
We train the proposed ANS model and all the baselines for the Exploration task with 10 million
frames on the Gibson training set. The results are shown in Table 1. The results on the Gibson Val set
are averaged over a total of 994 episodes in 14 different unseen scenes. The proposed model achieves
an average absolute and relative coverage of 32.701m2 /0.948 as compared to 24.863m2 /0.789 for
the best baseline. This indicates that the proposed model is more efficient and effective at exhaustive
exploration as compared to the baselines. This is because our hierarchical policy architecture
reduces the horizon of the long-term exploration problem as instead of taking tens of low-level
navigational actions, the Global policy only takes few long-term goal actions. We also report the
domain generalization performance on the Exploration task in Table 1 (see shaded region), where
all models trained on Gibson are evaluated on the Matterport domain. ANS leads to higher domain
generalization performance (73.281m2 /0.521 vs 54.775m2 /0.378). The absolute coverage is higher
and % Cov is lower for the Matterport domain as it consists of larger scenes on average. On a
set of small MP3D test scenes (comparable to Gibson scene sizes), ANS achieved a performance
of 31.407m2 /0.836 as compared to 23.091m2 /0.620 for the best baseline. Some visualizations of
policy execution are provided in Figure 42 .
In Fig. 3, we plot the relative coverage (% Cov) of all the models as the episode progresses on the
large and small scene sets, as well as the overall Gibson Val set. The plot on the small scene set
shows that ANS is able to almost completely explore the small scenes in around 500 steps, however,
the baselines are only able to explore 85-90% of the small scenes in 1000 steps (see Fig. 3 center).
This indicates that ANS explores more efficiently in small scenes. The plot on the large scenes set
shows that the performance gap between ANS and baselines widens as the episode progresses (see
Fig. 3 left). Looking at the behavior of the baselines, we saw that they often got stuck in local areas.
This behavior indicates that they are unable to remember explored areas over long-time horizons and
are ineffective at long-term planning. On the other hand, ANS uses a Global policy on the map which
allows it to have the memory of explored areas over long-time horizons, and plan effectively to reach
distant long-term goals by leveraging analytical planners. As a result, it is able to explore effectively
in large scenes with long episode lengths.
2
See https://devendrachaplot.github.io/projects/Neural-SLAM for visualization videos.
7
Published as a conference paper at ICLR 2020
Figure 4: Exploration visualization. Figure showing a sample trajectory of the Active Neural SLAM model
in the Exploration task. Top: RGB observations seen by the agent. Inset: Global ground truth map and pose
(not visible to the agent). Bottom: Local map and pose predictions. Long-term goals selected by the Global
policy are shown by blue circles. The ground-truth map and pose are under-laid in grey. Map prediction is
overlaid in green, with dark green denoting correct predictions and light green denoting false positives. Agent
pose predictions are shown in red. The light blue shaded region shows the explored area.
6.1 A BLATIONS
Local Policy. An alternative to learning a Local Policy is to have a deterministic policy which follows
the plan given by the Planner. As shown in Table 2, the ANS model performs slightly worse without
the Local Policy. The Local Policy is designed to adapt to small errors in Mapping. We observed
Local policy overcoming false positives encountered in mapping. For example, the Neural SLAM
module could sometime wrongly predict a carpet as an obstacle. In this case, the planner would plan
to go around the carpet. However, if the short-term goal is beyond the carpet, the Local policy can
understand that the carpet is not an obstacle based on the RGB observation and learn to walk over it.
Global Policy. An alternative to learning a Global Policy for sampling long-term goals is to use a
classical algorithm called Frontier-based exploration (FBE) (Yamauchi, 1997). A frontier is defined
as the boundary between the explored free space and the unexplored space. Frontier-based exploration
essentially sample points on this frontier as goals to explore the space. There are different variants
of Frontier-based exploration based on the sampling strategy. Holz et al. (2010) compare different
sampling strategies and find that sampling the point on the frontier closest to the agent gives the
best results empirically. We implement this variant and replace it with our learned Global Policy.
As shown in Table 2, the performance of the Frontier-based exploration policy is comparable on
small scenes, but around 10% lower on large scenes, relative to the Global policy. This indicates the
importance of learning as compared to classical exploration methods in larger scenes. Qualitatively,
we observed that Frontier-based exploration spent a lot of time exploring corners or small areas behind
furniture. In contrast, the trained Global policy ignored small spaces and chose distant long-term
goals which led to higher coverage.
Pose Estimation. A difference between ANS and the baselines is that ANS uses additional supervi-
sion to train the Pose Estimator. In order to understand whether the performance gain is coming from
this additional supervision, we remove the Pose Estimator from ANS and just use the input sensor
reading as our pose estimate. Results in Table 2 show that the ANS still outperforms the baselines
even without the Pose Estimator. We also observed that performance without the pose estimator drops
only about 1% on small scenes, but around 10% on large scenes. This is expected because larger
scenes take longer to explore, and pose errors accumulate over time to cause drift. Passing the ground
truth pose as input the baselines instead of the sensor reading did not improve their performance.
8
Published as a conference paper at ICLR 2020
Figure 5: Real-world Transfer. Left: Image showing the living area in an apartment used for the real-world
experiments. Right: Sample images seen by the robot and the predicted map. The long-term goal selected by
the Global Policy is shown by a blue circle on the map.
9
Published as a conference paper at ICLR 2020
7 C ONCLUSION
In this paper, we proposed a modular navigational model which leverages the strengths of classical
and learning-based navigational methods. We show that the proposed model outperforms prior
methods on both Exploration and PointGoal tasks and shows strong generalization across domains,
goals, and tasks. In the future, the proposed model can be extended to complex semantic tasks
such as Semantic Goal Navigation and Embodied Question Answering by using a semantic Neural
SLAM module which creates a multi-channel map capturing semantic properties of the objects in the
environment. The model can also be combined with prior work on Localization to relocalize in a
previously created map for efficient navigation in subsequent episodes.
ACKNOWLEDGEMENTS
This work was supported by IARPA DIVA D17PC00340, ONR Grant N000141812861, ONR MURI,
ONR Young Investigator, DARPA MCS, and Apple. We would also like to acknowledge NVIDIA’s
GPU support. We thank Guillaume Lample for discussions and coding during the initial stages of
this project.
Licenses for referenced datasets.
Gibson: http://svl.stanford.edu/gibson2/assets/GDS_agreement.pdf
Matterport3D: http://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf
R EFERENCES
Peter Anderson, Angel Chang, Devendra Singh Chaplot, Alexey Dosovitskiy, Saurabh Gupta, Vladlen
Koltun, Jana Kosecka, Jitendra Malik, Roozbeh Mottaghi, Manolis Savva, et al. On evaluation of
embodied navigation agents. arXiv preprint arXiv:1807.06757, 2018.
Peter Auer. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine
Learning Research, 3(Nov):397–422, 2002.
Somil Bansal, Varun Tolani, Saurabh Gupta, Jitendra Malik, and Claire Tomlin. Combining optimal
control and learning for visual navigation in novel environments. In Conference on Robot Learning
(CoRL), 2019.
Andrew G Barto and Sridhar Mahadevan. Recent advances in hierarchical reinforcement learning.
Discrete event dynamic systems, 13(1-2):41–77, 2003.
John Canny. The complexity of robot motion planning. MIT press, 1988.
Luca Carlone, Jingjing Du, Miguel Kaouk Ng, Basilio Bona, and Marina Indri. Active slam and
exploration with particle filters using kullback-leibler divergence. Journal of Intelligent & Robotic
Systems, 75(2):291–311, 2014.
Angel Chang, Angela Dai, Thomas Funkhouser, Maciej Halber, Matthias Niebner, Manolis Savva,
Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor
environments. In 2017 International Conference on 3D Vision (3DV), pages 667–676. IEEE, 2017.
Devendra Singh Chaplot and Guillaume Lample. Arnold: An autonomous agent to play fps games.
In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
Devendra Singh Chaplot, Guillaume Lample, Kanthashree Mysore Sathyendra, and Ruslan Salakhut-
dinov. Transfer deep reinforcement learning in 3d environments: An empirical study. In NIPS
Deep Reinforcemente Leaning Workshop, 2016.
Devendra Singh Chaplot, Emilio Parisotto, and Ruslan Salakhutdinov. Active neural localization.
ICLR, 2018.
Tao Chen, Saurabh Gupta, and Abhinav Gupta. Learning exploration policies for navigation. In
ICLR, 2019.
Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of
neural machine translation: Encoder-decoder approaches. Eighth Workshop on Syntax, Semantics
and Structure in Statistical Translation, 2014.
10
Published as a conference paper at ICLR 2020
Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied
question answering. In CVPR, 2018a.
Abhishek Das, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Neural modular control
for embodied question answering. In Conference on Robot Learning, pages 53–62, 2018b.
Peter Dayan and Geoffrey E Hinton. Feudal reinforcement learning. In Advances in neural information
processing systems, pages 271–278, 1993.
Christian Dornhege and Alexander Kleiner. A frontier-void-based approach for autonomous explo-
ration in 3d. Advanced Robotics, 27(6):459–468, 2013.
Kuan Fang, Alexander Toshev, Li Fei-Fei, and Silvio Savarese. Scene memory transformer for
embodied agents in long-horizon tasks. In CVPR, 2019.
Daniel Gordon, Aniruddha Kembhavi, Mohammad Rastegari, Joseph Redmon, Dieter Fox, and Ali
Farhadi. Iqa: Visual question answering in interactive environments. In Proceedings of the IEEE
Conference on Computer Vision and Pattern Recognition, pages 4089–4098, 2018.
Saurabh Gupta, James Davidson, Sergey Levine, Rahul Sukthankar, and Jitendra Malik. Cognitive
mapping and planning for visual navigation. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pages 2616–2625, 2017.
Richard Hartley and Andrew Zisserman. Multiple view geometry in computer vision. Cambridge
university press, 2003.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image
recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 770–778, 2016.
Joao F Henriques and Andrea Vedaldi. Mapnet: An allocentric spatial memory for mapping envi-
ronments. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
pages 8476–8484, 2018.
Dirk Holz, Nicola Basilico, Francesco Amigoni, and Sven Behnke. Evaluating the efficiency of
frontier-based exploration strategies. In ISR 2010 (41st International Symposium on Robotics) and
ROBOTIK 2010 (6th German Conference on Robotics), pages 1–8. VDE, 2010.
Shahram Izadi, David Kim, Otmar Hilliges, David Molyneaux, Richard Newcombe, Pushmeet
Kohli, Jamie Shotton, Steve Hodges, Dustin Freeman, Andrew Davison, and Andrew Fitzgibbon.
KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. UIST,
2011.
Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In
Advances in neural information processing systems, pages 2017–2025, 2015.
Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement
learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
Elia Kaufmann, Mathias Gehrig, Philipp Foehn, René Ranftl, Alexey Dosovitskiy, Vladlen Koltun,
and Davide Scaramuzza. Beauty and the beast: Optimal methods meet learning for drone racing.
In 2019 International Conference on Robotics and Automation (ICRA), pages 690–696. IEEE,
2019.
Lydia E Kavraki, Petr Svestka, J-C Latombe, and Mark H Overmars. Probabilistic roadmaps for path
planning in high-dimensional configuration spaces. RA, 1996.
Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time.
Machine learning, 49(2-3):209–232, 2002.
11
Published as a conference paper at ICLR 2020
Arbaaz Khan, Clark Zhang, Nikolay Atanasov, Konstantinos Karydis, Daniel D Lee, and Vijay
Kumar. End-to-end navigation in unknown environments using neural networks. arXiv preprint
arXiv:1707.07385, 2017.
S. Kohlbrecher, J. Meyer, O. von Stryk, and U. Klingauf. A flexible and scalable slam system with
full 3d motion estimation. In Proc. IEEE International Symposium on Safety, Security and Rescue
Robotics (SSRR). IEEE, November 2011.
Thomas Kollar and Nicholas Roy. Trajectory optimization using reinforcement learning for map
exploration. The International Journal of Robotics Research, 27(2):175–196, 2008.
Ilya Kostrikov. Pytorch implementations of reinforcement learning algorithms. https://github.
com/ikostrikov/pytorch-a2c-ppo-acktr-gail, 2018.
Guillaume Lample and Devendra Singh Chaplot. Playing FPS games with deep reinforcement
learning. In Thirty-First AAAI Conference on Artificial Intelligence, 2017.
Steven M Lavalle and James J Kuffner Jr. Rapidly-exploring random trees: Progress and prospects.
In Algorithmic and Computational Robotics: New Directions, 2000.
Lisa Lee, Emilio Parisotto, Devendra Singh Chaplot, Eric Xing, and Ruslan Salakhutdinov. Gated
path planning networks. In ICML, 2018.
Ruben Martinez-Cantin, Nando de Freitas, Eric Brochu, José Castellanos, and Arnaud Doucet. A
bayesian exploration-exploitation approach for optimal online sensing and planning with a visually
guided mobile robot. Autonomous Robots, 27(2):93–103, 2009.
Piotr Mirowski, Razvan Pascanu, Fabio Viola, Hubert Soyer, Andrew J Ballard, Andrea Banino,
Misha Denil, Ross Goroshin, Laurent Sifre, Koray Kavukcuoglu, et al. Learning to navigate in
complex environments. ICLR, 2017.
Raul Mur-Artal and Juan D Tardós. Orb-slam2: An open-source slam system for monocular, stereo,
and rgb-d cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017.
Adithyavairavan Murali, Tao Chen, Kalyan Vasudev Alwala, Dhiraj Gandhi, Lerrel Pinto, Saurabh
Gupta, and Abhinav Gupta. Pyrobot: An open-source robotics framework for research and
benchmarking. arXiv preprint arXiv:1906.08236, 2019.
Richard A Newcombe, Steven J Lovegrove, and Andrew J Davison. Dtam: Dense tracking and
mapping in real-time. In 2011 international conference on computer vision, pages 2320–2327.
IEEE, 2011.
Emilio Parisotto and Ruslan Salakhutdinov. Neural map: Structured memory for deep reinforcement
learning. ICLR, 2018.
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,
Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in
pytorch. NIPS 2017 Autodiff Workshop, 2017.
Andrzej Pronobis and Patric Jensfelt. Large-scale semantic mapping and reasoning with heteroge-
neous modalities. In 2012 IEEE International Conference on Robotics and Automation, pages
3515–3522. IEEE, 2012.
Morgan Quigley, Brian Gerkey, Ken Conley, Josh Faust, Tully Foote, Jeremy Leibs, Eric Berger, Rob
Wheeler, and Andrew Ng. Ros: an open-source robot operating system. In Proc. of the IEEE Intl.
Conf. on Robotics and Automation (ICRA) Workshop on Open Source Robotics, Kobe, Japan, May
2009.
Nikolay Savinov, Alexey Dosovitskiy, and Vladlen Koltun. Semi-parametric topological memory for
navigation. In International Conference on Learning Representations (ICLR), 2018.
Nikolay Savinov, Anton Raichuk, Raphaël Marinier, Damien Vincent, Marc Pollefeys, Timothy
Lillicrap, and Sylvain Gelly. Episodic curiosity through reachability. In ICLR, 2019.
12
Published as a conference paper at ICLR 2020
Manolis Savva, Abhishek Kadian, Oleksandr Maksymets, Yili Zhao, Erik Wijmans, Bhavana Jain,
Julian Straub, Jia Liu, Vladlen Koltun, Jitendra Malik, et al. Habitat: A platform for embodied
ai research. In Proceedings of the IEEE International Conference on Computer Vision, pages
9339–9347, 2019.
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy
optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
James A Sethian. A fast marching level set method for monotonically advancing fronts. Proceedings
of the National Academy of Sciences, 93(4):1591–1595, 1996.
Shuran Song, Fisher Yu, Andy Zeng, Angel X Chang, Manolis Savva, and Thomas Funkhouser.
Semantic scene completion from a single depth image. In CVPR, 2017.
Cyrill Stachniss, Giorgio Grisetti, and Wolfram Burgard. Information gain-based exploration using
rao-blackwellized particle filters. In Robotics: Science and Systems, volume 2, pages 65–72, 2005.
Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
Richard S Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework
for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2):181–211, 1999.
Aviv Tamar, Yi Wu, Garrett Thomas, Sergey Levine, and Pieter Abbeel. Value iteration networks. In
Advances in Neural Information Processing Systems, pages 2154–2162, 2016.
Sebastian Thrun, Wolfram Burgard, and Dieter Fox. Probabilistic robotics. MIT press, 2005.
Matthew R Walter, Sachithra Hemachandra, Bianca Homberg, Stefanie Tellex, and Seth Teller.
Learning semantic maps from natural language descriptions. In Robotics: Science and Systems,
2013.
Fei Xia, Amir R. Zamir, Zhi-Yang He, Alexander Sax, Jitendra Malik, and Silvio Savarese. Gibson
Env: real-world perception for embodied agents. In Computer Vision and Pattern Recognition
(CVPR), 2018 IEEE Conference on. IEEE, 2018.
Kai Xu, Lintao Zheng, Zihao Yan, Guohang Yan, Eugene Zhang, Matthias Niessner, Oliver Deussen,
Daniel Cohen-Or, and Hui Huang. Autonomous reconstruction of unknown indoor scenes guided
by time-varying tensor fields. ACM Transactions on Graphics (TOG), 36(6):202, 2017.
Brian Yamauchi. A frontier-based approach for autonomous exploration. In cira, volume 97, page
146, 1997.
Jingwei Zhang, Lei Tai, Joschka Boedecker, Wolfram Burgard, and Ming Liu. Neural slam: Learning
to explore with external memory. arXiv preprint arXiv:1706.09520, 2017.
Yuke Zhu, Roozbeh Mottaghi, Eric Kolve, Joseph J Lim, Abhinav Gupta, Li Fei-Fei, and Ali Farhadi.
Target-driven visual navigation in indoor scenes using deep reinforcement learning. In Robotics
and Automation (ICRA), 2017 IEEE International Conference on, pages 3357–3364. IEEE, 2017.
13
Published as a conference paper at ICLR 2020
Table 3: Performance of the proposed model, Active Neural SLAM (ANS) and all the baselines on the
Exploration task. ‘ANS - Task Transfer’ refers to the ANS model transferred to the PointGoal task after training
on the Exploration task.
Domain Goal
Generalization Generalization
Test Setting → Gibson Val MP3D Test Hard-GEDR Hard-Dist
Train Task Method Succ SPL Succ SPL Succ SPL Succ SPL
PointGoal Random 0.027 0.021 0.010 0.010 0.000 0.000 0.000 0.000
RL + Blind 0.625 0.421 0.136 0.087 0.052 0.020 0.008 0.006
RL + 3LConv + GRU 0.550 0.406 0.102 0.080 0.072 0.046 0.006 0.006
RL + Res18 + GRU 0.561 0.422 0.160 0.125 0.176 0.109 0.004 0.003
RL + Res18 + GRU + AuxDepth 0.640 0.461 0.189 0.143 0.277 0.197 0.013 0.011
RL + Res18 + GRU + ProjDepth 0.614 0.436 0.134 0.111 0.180 0.129 0.008 0.004
IL + Res18 + GRU 0.823 0.725 0.365 0.318 0.682 0.558 0.359 0.310
CMP 0.827 0.730 0.320 0.270 0.670 0.553 0.369 0.318
ANS 0.951 0.848 0.593 0.496 0.824 0.710 0.662 0.534
Exploration ANS - Task Transfer 0.950 0.846 0.588 0.490 0.821 0.703 0.665 0.532
A P OINTGOAL EXPERIMENTS
PointGoal has been the most studied task in recent literature on navigation where the objective is to
navigate to a goal location whose relative coordinates are given as input in a limited time budget. We
follow the PointGoal task setup from Savva et al. (2019), using train/val/test splits for both Gibson
and Matterport datasets. Note that the set of scenes used in each split is disjoint, which means the
agent is tested on new scenes never seen during training. Gibson test set is not public but rather held
out on an online evaluation server3 . We report the performance of our model on the Gibson test set
when submitted to the online server but also use the validation set as another test set for extensive
comparison and analysis. We do not use the validation set for hyper-parameter tuning.
Savva et al. (2019) identify two measures to quantify the difficulty of a PointGoal dataset. The first is
the average geodesic distance (distance along the shortest path) to the goal location from the starting
location of the agent, and the second is the average geodesic to Euclidean distance ratio (GED ratio).
The GED ratio is always greater than or equal to 1, with higher ratios resulting in harder episodes. The
train/val/test splits in the Gibson dataset come from the same distribution of having similar average
geodesic distance and GED ratio. In order to analyze the performance of the proposed model on
out-of-set goal distribution, we create two harder sets, Hard-Dist and Hard-GEDR. In the Hard-Dist
set, the geodesic distance to goal is always more than 10m and the average geodesic distance to the
goal is 13.48m as compared to 6.9/6.5/7.0m in train/val/test splits (Savva et al., 2019). Hard-GEDR
set consists of episodes with an average GED ratio of 2.52 and a minimum GED ratio of 2.0 as
compared to average GED ratio 1.37 in the Gibson val set.
We also follow the episode specification from Savva et al. (2019). Each episode ends when either the
agent takes the stop action or at a maximum of 500 timesteps. An episode is considered a success
when the final position of the agent is within 0.2m of the goal location. In addition to Success rate
(Succ), we also use Success weighted by (normalized inverse) Path Length or SPL as a metric for
evaluation for the PointGoal task as proposed by Anderson et al. (2018).
In Table 3, we show the performance of the proposed model transferred to the PointGoal task along
with the baselines trained on the PointGoal task with the same amount of data (10million frames).
The proposed model achieves a success rate/SPL of 0.950/0.846 as compared to 0.827/0.730 for the
best baseline model on Gibson val set. We also report the performance of the proposed model trained
from scratch on the PointGoal task for 10 million frames. The results indicate that the performance of
ANS transferred from Exploration is comparable to ANS trained on PointGoal. This highlights a key
advantage of our model. It allows us to transfer the knowledge of obstacle avoidance and control in
low-level navigation across tasks, as the Local Policy and Neural SLAM module are task-invariant.
Sample efficiency. RL models are typically trained for more than 10 million samples. In order to
compare the performance and sample-efficiency, we trained the best performing RL model (RL +
Res18 + GRU + ProjDepth) for 75 million frames and it achieved a Succ/SPL of 0.678/0.486. ANS
3
https://evalai.cloudcv.org/web/challenges/challenge-page/254
14
Published as a conference paper at ICLR 2020
Figure 7: Performance of the proposed ANS model along with CMP and IL + Res18 + GRU (GRU) baselines
with increase in geodesic distance to goal and increase in GED Ratio on the Gibson Val set.
Figure 8: Figure showing sample trajectories of the proposed model along with the predicted map in the
PointGoal task. The starting and goal locations are shown by black squares and blue circles, respectively. The
ground-truth map is under-laid in grey. Map prediction is overlaid in green, with dark green denoting correct
predictions and light green denoting false positives. The blue shaded region shows the explored area prediction.
On the left, we show some successful trajectories which indicate that the model is effective at long distance
goals with high GED ratio. On the right, we show a failure case due to mapping error.
reaches the performance of 0.789/0.703 SPL/Succ at only 1 million frames. These numbers indicate
that ANS achieves > 75× speedup as compared to the best RL baseline.
Domain and Goal Generalization: In Table 3 (see
shaded region), we evaluate all the baselines and
ANS trained on the PointGoal task in the Gibson do-
main on the test set in Matterport domain as well as
the harder goal sets in Gibson. We also transfer ANS
trained on Exploration in Gibson on all the 3 sets. The
results show that ANS outperforms all the baselines
at all generalization sets. Interestingly, RL based
methods almost fail completely on the Hard-Dist set.
…
…
…
…
…
…
We also analyze the performance of the proposed
model as compared to the two best baselines CMP
and IL + Res18 + GRU as a function of geodesic
distance to goal and GED ratio in Figure 7. The per- Figure 6: Screenshot of CVPR 2019 Habitat Chal-
formance of the baselines drops faster as compared lenge Results. The proposed model was submitted
to ANS, especially with the increase in goal distance. under code-name ‘Arnold’.
This indicates that end-to-end learning methods are
effective at short-term navigation but struggle when long-term planning is required to reach a distant
goal. In Figure 8, we show some example trajectories of the ANS model along with the predicted
map. The successful trajectories indicate that the model exhibits strong backtracking behavior which
makes it effective at distant goals requiring long-term planning. Figure 9 visualizes a trajectory in the
PointGoal task show first-person observation and corresponding map predictions. Please refer to the
project webpage for visualization videos.
Habitat Challenge Results. We submitted the ANS model to the CVPR 2019 Habitat Pointgoal
Navigation Challenge. The results are shown in Figure 6. ANS was submitted under code-name
‘Arnold’. ANS was the winning entry for both RGB and RGB-D tracks among over 150 submissions
from 16 teams, achieving an SPL of 0.805 (RGB) and 0.948 (RGB-D) on the Test Challenge set.
15
Published as a conference paper at ICLR 2020
Figure 9: Pointgoal visualization. Figure showing sample trajectories of the proposed model along with
predicted map in the Pointgoal task as the episode progresses. The starting and goal locations are shown by
black squares and blue circles, respectively. Ground truth map is under-laid in grey. Map prediction is overlaid
in green, with dark green denoting correct predictions and light green denoting false positives. Blue shaded
region shows the explored area prediction.
move forward by 25cm, Turn Right: on the spot rotation clockwise by 10 degrees, and Turn Left:
on the spot rotation counter-clockwise by 10 degrees. The control commands are implemented as
uF orward = (0.25, 0, 0), uRight : (0, 0, −10 ∗ π/180) and uLef t : (0, 0, 10 ∗ π/180). In practice, a
robot can also rotate slightly while moving forward and translate a bit while rotating on-the-spot,
creating rotational actuation noise in forward action and similarly, a translation actuation noise in
on-the-spot rotation actions.
We use a Locobot 4 to collect data for building the actuation and sensor noise models. We use the
pyrobot API (Murali et al., 2019) along with ROS (Quigley et al., 2009) to implement the control
commands and get sensor readings. In order to get an accurate agent pose, we use an Hokuyo
UST-10LX Scanning Laser Rangefinder (LiDAR) which is especially very precise in our scenario as
we take static readings in 2D (Kohlbrecher et al., 2011). We install the LiDAR on the Locobot by
replacing the arm with the LiDAR. We note that the Hokuyo UST-10LX Scanning Laser Rangefinder
is an expensive sensor. It costs $1600 as compared to the whole Locobot costing less than $2000
without the arm. Using expensive sensors can improve the performance of a model, however, for a
method to be scalable, it should ideally work with cheaper sensors too. In order to demonstrate the
scalability of our method, we use the LiDAR only to collect the data for building noise models and
not for training or deploying navigation policies in the real-world.
For the sensor estimate, we use the Kobuki base odometry available in Locobot. We approximate
the LiDAR pose estimate to be the true pose of the agent as it is orders of magnitude more accurate
than the base sensor. For each action, we collect 600 datapoints from both the base sensor and the
LiDAR, making a total of 3600 datapoints (600 ∗ 3 ∗ 2). We use 500 datapoints for each action
to fit the actuation and sensor noise models and use the remaining 100 datapoints for validation.
For each action a, the LiDAR pose estimates gives us samples of p1 and the base sensor readings
give us samples of p01 , i = 1, 2, . . . , 600. The difference between LiDAR estimates (pi1 ) and control
command (∆ua ) gives us samples for the actuation noise for the action a: iact,a = pi1 − ∆ua and
difference between base sensor readings and LiDAR estimates gives us the samples for the sensor
0
noise, isen,a = pi1 − pi1 .
For each action a, we fit a separate Gaussian Mixture Model for the actuation noise and sensor noise
using samples iact,a and isen,a respectively, making a total of 6 models. We fit Gaussian mixture
models with the number of components ranging from 1 to 20 for and pick the model with the highest
likelihood on the validation set. Each component in these Gaussian mixture models is a multi-variate
Gaussian in 3 variables, x, y and o. We implement these actuation and sensor noise models in the
Habitat simulator for our experiments.
4
http://locobot.org
16
Published as a conference paper at ICLR 2020
where θS denote the trainable parameters and bt−1 denotes internal representations of the Neural
SLAM module. The Neural SLAM module can be broken down into two parts, a Mapper (fM ap )
and a Pose Estimator Unit (fP E ,). The Mapper outputs a egocentric top-down 2D spatial map,
pego
t ∈ [0, 1]2×V ×V (where V is the vision range), predicting the obstacles and the explored area
in the current observation: pego
t = fM ap (st |θM ), where θM are the parameters of the Mapper.
It consists of Resnet18 convolutional layers to produce an embedding of the observation. This
embedding is passed through two fully-connected layers followed by 3 deconvolutional layers to get
the first-person top-down 2D spatial map prediction.
Now, we would like to add the egocentric map prediction (pego t ) to the geocentric map from the
previous time step (mt−1 ). In order to transform the egocentric map to the geocentric frame, we need
the pose of the agent in the geocentric frame. The sensor reading x0t is typically noisy. Thus, we have
a Pose Estimator to correct the sensor reading and give an estimate of the agent’s geocentric pose.
In order to estimate the pose of the agent, we first calculate the relative pose change (dx) from the
last time step using the sensor readings at the current and last time step (x0t−1 , x0t ). Then we use a
Spatial Transformation (Jaderberg et al., 2015) on the egocentric map prediction at the last frame
(pego 0 ego
t−1 ) based on the relative pose change (dx), pt−1 = fST (pt−1 |dx). Note that the parameters of
this Spatial Transformation are not learnt, but calculated using the pose change (dx). This transforms
the projection at the last step to the current egocentric frame of reference. If the sensor was accurate,
p0t−1 would highly overlap with pego 0
t . The Pose Estimator Unit takes in pt−1 and pt
ego
as input and
ego
ˆ t = fP E (p0 , p |θP ) The intuition is that by looking at
predicts the relative pose change: dx t−1 t
the egocentric predictions of the last two frames, the pose estimator can learn to predict the small
translation and/or rotation that would align them better. The predicted relative pose change is then
added to the last pose estimate to get the final pose estimate xˆt = x̂t−1 + dx ˆ t.
Finally, the egocentric spatial map prediction is transformed to the geocentric frame using the current
pose prediction of the agent (xˆt ) using another Spatial Transformation and aggregated with the
previous spatial map (mt−1 ) using Channel-wise Pooling operation: mt = mt−1 + fST (pego t |xˆt ).
Combing all the functions and transformations:
mt , x̂t = fSLAM (st , x0t−1:t , mt−1 |θS , bt−1 )
pego
t = fM ap (st |θM )
x̂t = x̂t−1 + fP E (fST (pego ego
t−1 |x̂t−1:t ), pt |θP )
mt = mt−1 + fST (pego
t |x̂t )
where θM , θP ∈ θS , and pego
t−1 , x̂t−1 ∈ bt−1
D A RCHITECTURE DETAILS
We use PyTorch (Paszke et al., 2017) for implementing and training our model. The Mapper in the
Neural SLAM module consists of ResNet18 convolutional layers followed by 2 fully-connected
layers trained with a dropout of 0.5, followed by 3 deconvolutional layers. The Pose Estimator
consists of 3 convolutional layers followed by 3 fully connected layers. The Global Policy is a 5
layer convolutional network followed by 3 fully connected layers. We also pass the agent orientation
as a separate input (not captured in the map tensor) to the Global Policy. It is processed by an
Embedding layer and added as an input to the fully-connected layers. The Local Policy consists
of a pretrained ResNet18 convolutional layers followed by fully connected layers and a recurrent
GRU layer. In addition to the RGB observation, the Local policy receives relative distance and
angle to the short-term goal as input. We bin the relative distance (bin size increasing with distance),
17
Published as a conference paper at ICLR 2020
relative angle (5 degree bins) and current timestep (30 time step bins) before passing them through
embedding layers. This kind of discretization is used previously for RL policies (Lample and Chaplot,
2017; Chaplot and Lample, 2017) and it improved the sample efficiency as compared to passing
the continuous values as input directly. For a fair comparison, we use the same discretization for
all the baselines as well. The short-term goal is processed using Embedding layers. For the exact
architectures of all the modules, please refer to the open-source code.
E H YPERPARAMETER DETAILS
We train all the components with 72 parallel threads, with each thread using one of the 72 scenes in
the Gibson training set. We maintain a FIFO memory of size 500000 for training the Neural SLAM
module. After one step in all the environments (i.e. every 72 steps) we perform 10 updates to the
Neural SLAM module with a batch size of 72. We use Adam optimizer with a learning rate of 0.0001.
We use binary cross-entropy loss for obstacle map and explored area prediction and MSE Loss for
pose prediction (in meters and radians). The obstacle map and explored area loss coefficients are 1
and the pose loss coefficient is 10000 (as MSE loss in meters and radians is much smaller).
The Global policy samples a new goal every 25 timesteps. We use Proximal Policy Optimization
(PPO) (Schulman et al., 2017) for training the Global policy. Our PPO implementation for the Global
Policy is based on Kostrikov (2018). The reward for the Global policy is the increase in coverage in
m2 scaled by 0.02. It is trained with 72 parallel threads and a horizon length of 40 steps (40 steps for
Global policy is equivalent to 1000 low-level timesteps as Global policy samples a new goal after
every 25 timesteps). We use 36 mini-batches and do 4 epochs in each PPO update. We use Adam
optimizer with a learning rate of 0.000025, a discount factor of γ = 0.99, an entropy coefficient of
0.001, value loss coefficient of 0.5 for training the Global Policy.
The Local Policy is trained using binary cross-entropy loss. We use Adam optimizer with a learning
rate of 0.0001 for training the Local Policy.
Input frame size is 128 × 128, the vision range for the SLAM module is V = 64, i.e. 3.2m (each
cell is 5cm in length). Since there are no parameters dependent on the map size, it can be adaptive.
We train with a map size of M = 480 (equivalent to 24m) for training and M = 960 (equivalent to
48m) for evaluation. A map of size 48m × 48m is large enough for all scenes in the Gibson val set.
The size of the Global Policy input is constant, G = 240, which means we downscale map by 2 times
during training and 4 times during evaluation. All hyperparameters are available in the code.
F A DDITIONAL R ESULTS
RL + Res18 + GRU
40 RL + Res18 + GRU + AuxDepth
RL + Res18 + GRU + ProjDepth
30
20
10
0
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
Episode length Episode length Episode length
Figure 10: Plot showing the absolute Coverage in m2 as the episode progresses for ANS and the baselines on
the large and small scenes in the Gibson Val set as well as the overall Gibson Val set.
18