Zero-Shot Terrain Generalization For Visual Locomotion Policies
Zero-Shot Terrain Generalization For Visual Locomotion Policies
Alejandro Escontrela1,2 , George Yu1 , Peng Xu1 , Atil Iscen1 , Jie Tan1
I. INTRODUCTION
(c) Office 2 (d) Hilly
The ability to traverse unstructured terrains make legged
robots an appealing solution to a wide variety of tasks,
including disaster relief, last-mile delivery, industrial inspec-
tion, and planetary exploration [1], [2]. To deploy robots
in these settings successfully, we must design controllers
that work well across many different terrains. Due to the
diversity of environments that a legged robot can operate
in, hand-engineering such a controller presents unique chal-
lenges. Deep Reinforcement Learning (DRL) has proven (e) Mountainous (f) Maze
itself capable of automatically acquiring control policies to Fig. 1. A Laikago robot navigating a variety of complex terrains not
accomplish a large variety of challenging locomotion tasks. encountered during training.
However, many of these approaches learn control policies
that succeed in a single type of terrain with limited variations.
This approach limits the robot’s ability to generalize to new prior works in the legged robot literature focused on blind
or unseen environments, which is a crucial feature of a useful walking, which does not involve exteroceptive sensors (e.g.,
locomotion controller. camera, LiDAR), we find that exteroceptive perception is
In this paper, we develop an end-to-end reinforcement essential for robots to navigate in diverse environments. Our
learning system that enables legged robots to traverse a end-to-end visual-locomotion policy takes both exterocep-
large variety of terrains. To facilitate learning generalizable tive (a LiDAR scan) and proprioceptive information of the
policies, we make two purposeful design decisions for our robot and outputs low-level motor commands. We embed
learning system. First, we formulate the problem as a Multi- the Policies Modulating Trajectory Generator (PMTG) [3]
Task Partially Observable Markov Decision Problem and framework into our policy architecture to generate cyclic and
show that the robot learns a robust policy that works well smooth actuation patterns, and to facilitate the learning of
across a wide variety of tasks (terrains). To this end, we robust locomotion policies.
develop a novel procedural terrain generation method, which We evaluate our learning system using a high-fidelity
can efficiently generate a large variety of terrains for training. physics simulator [4] and visually-realistic indoor scans [5]
Second, we design an end-to-end neural network architecture (Figure 1). We test the learned policy in thirteen different
that can handle both perception and locomotion. We call and realistic simulation environments (five training and eight
this parameterization a visual-locomotion policy. While many testing). Our system learns highly generalizable locomotion
policies, which demonstrate zero-shot generalization to un-
1 Google Brain Robotics, seen testing environments. We also show that our visual-
{georgeyu,pengxu,atil,jietan}@google.com
2 Georgia Institute of Technology, [email protected] locomotion policy’s parameterization is key to generaliza-
Work performed while Alejandro was an intern at Google Brain. tion and yields far better performance than commonly-used
reactive policies. This paper’s main contributions include an Ti : S × A × S → R+ is the transition probability function,
end-to-end visual-locomotion policy parameterization and a and R : S × A → R is the reward function. During training,
complete multi-task learning system, with which a quadruped the agent is presented with randomly sampled tasks Mi ∈ M
robot learns a single locomotion policy that can traverse a (Section III-B). The solution of the multi-task POMDP is
diverse set of terrains. a stochastic policy π : O × A → R+ that maximizes the
expected accumulated reward over the episode length T .
II. RELATED WORK " T #
A. Legged Locomotion
X
π ∗ = arg max E r(st , at )
π Mi ∈M
Locomotion controllers can be developed using trajectory t=0
optimization [6], whole-body control [7], model predictive Our problem is partially observable because of the limited
control [8], and state-machines [9]. While the controllers sensors onboard the robot1 . The robot is equipped with a
developed by these techniques can generalize to a certain LiDAR sensor to perceive the distances d to the surrounding
degree, expertise and manual tuning are often needed to adapt environment. Proprioceptive information comes from a simu-
them to different terrains. lated IMU sensor, which includes measurement of the roll φ,
In contrast, Deep Reinforcement Learning [10] can au- pitch θ, and the angular velocity of the torso β ω = (φ̇, θ̇, ψ̇),
tomatically learn agile and robust locomotion skills [11], and from motor encoders that measure the robot’s 12 joint
[12], [13], [14]. Prior work in RL has learned policies that angles q. The complete observation at timestep t is
are specific for a single environment [15], or generalize to
variations of a single type of terrain [16], [17], [18]. Recently, st = [aTt−1 , ot , sTTG , gd,t , gh,t ],
Lee et. al. [14] combined various techniques, such as Actu- where ot = [dTt ,β ωtT , q Tt , φt , θt , ] are the sensor obser-
atorNet [13], PMTG [3], curriculum learning and “learning vations, gd and gh are the distance and relative heading
by cheating” [19], which successfully performed zero-shot to the target, at−1 is the action at the last timestep, and
transfer from simulation to many challenging terrains in sTTG are the parameters of the trajectory generator (Section
the real world. While our paper’s high-level goal is similar III-C). Unlike some prior work in MTRL, where the task
to this prior work, our approach incorporates exteroceptive ID is part of the observation [22], [23], we purposefully
sensors that enable the robot to navigate in cluttered indoor choose not to leverage such information, because identifying
environments where blind walking may have difficulties. tasks automatically in the real world is challenging. Instead,
B. Multi-Task Reinforcement Learning we would like to train a policy that can rely on its own
perception input and demonstrates zero-shot generalization to
Multi-task reinforcement learning (MTRL) [20] is a new tasks, without knowing the task ID explicitly. In section
promising approach to train generalizable policies that can IV, we demonstrate our perception is crucial in learning
accomplish a wide variety of tasks. Hessel et. al. [21] learned policies which generalize well to new tasks. The output
a single policy that achieves state-of-the-art performance on action at of the policy specifies the desired joint angles,
57 Atari games. Yu et al. [22] evaluated the performance which are tracked by PD controllers by the simulated robot.
of various RL algorithms on a grasping and manipulation We employ a simple reward function, which encourages
benchmark and demonstrated that a single control policy is the agent to navigate to a target location g = (xg , yg , zg )
capable of completing a variety of complex robotic manip- (the red ball in Figure 1):
ulation tasks. In this paper, we apply MTRL to develop a
learning system for locomotion that enables legged robots to gd,t − gd,t−1
rt = ,
navigate in a large variety of environments. ∆t
where gd,t is the Euclidean distance from the robot to the
III. METHODS target location at timestep t, and ∆t is the timestep duration.
In this work, we frame legged locomotion as a multi- This reward can be interpreted as the speed that the robot is
task reinforcement learning problem (MTRL) and define moving towards the target location. Once the robot’s center
each task as a type of terrain that the legged robot (agent) of mass is within a threshold distance to the target location,
must traverse. To learn generalizable locomotion policies, our the task is complete.
learning system consists of a procedural terrain generator
B. Terrain Parameterization and Procedural Task Genera-
that can efficiently generate diverse training environments,
tion
and an end-to-end visual-locomotion policy architecture that
directly maps the robot’s exteroceptive and proprioceptive We develop a procedural terrain generator to generate
observations to motor commands. diverse and challenging terrains that provide the robot with
a large quantity of rich training data. The environment is
A. Multi-Task Reinforcement Learning Formulation composed of m×n pillars, each pillar having cross-sectional
Given a distribution of tasks M, each task Mi ∈ M is dimensions of l, w, and height h. We denote H = {hi,j } ∈
a Partially Observable Markov Decision Process (POMDP). 1 Although we use a simulated robot due to limited access to the physical
A POMDP is tuple, Mi = hS, O, A, Ti , Ri, where S is the robot during COVID-19, we strive to make the simulation, including the
state space, O is the observation space, A is the action space, sensor measurement, as faithful as possible to the real robot.
TABLE I Generators (PMTG) [3] as our locomotion component ar-
T ERRAIN PARAMETERIZATION AND GENERATION FOR SELECTED chitecture (Fig. 2b). PMTG encourages the policy to learn
EXAMPLES .
smooth and cyclic locomotion behaviors. PMTG outputs a
desired trajectory for the legs that is modulated by a learned
Terrain Parameterization policy πθ (·): The policy observes the state of the trajectory
Terrain
Parameters φ Terrain Generation
generator (TG), stg , and the robot’s observation st , then
Flat No parameters H=0
Min terrain height: hmin H ∼ Um,n (hmin , hmax ) outputs parameters of the TG, ptg , including gait frequency,
Rugged Max terrain height: hmax Apply Gaussian smoothing swing height, and stride length, and a residual action term
Gaussian kernel std: σ with σ on H µf b . The final output action of our visual-locomotion policy
H=0
Number of holes: n Sample n index pairs (i, j)
is the combination of the trajectory generator and the residual
Holes Hole depth: h
H(i, j) = h action: at = µtg + µf b . Please refer to the original paper [3]
H=0 for more details. As detailed in [16], our visual-locomotion
Number of obstacles: n Sample n index pairs (i, j)
Obstacles obstacle height: h policy architecture achieves a separation of concerns between
H(i, j) = h the basic locomotion skills and terrain perception, which
H(0, :) = 0
Stairs
Stair step height: h
Set column lengths to l
enables the robot to adapt its smooth locomotion behaviors
Stair step length: l according to its surrounding environments.
H(i + 1, :) = H(i, :) + h
IV. EXPERIMENTAL RESULTS
Rm×n as the height field for all the pillars. During training, We design experiments to validate the proposed system’s
we select a task Mi and adjust each pillar’s heights to ability to learn a visual locomotion policy that generalizes
reflect the chosen task. Each task is a set of randomly well to terrains not encountered during training. In particular,
generated terrains that belongs to the same type (e.g., flat, we would like to answer the following two questions:
stairs). Each type of terrain is described by a parameter • Can our system learn visual locomotion policies that
vector φi , which provides the lower and upper bounds demonstrate zero-shot generalization to new terrains?
for the random sampling. The terrain generator constructs • Can our policy architecture effectively use LiDAR input
the heightfield H from the given parameter vector φ. For and PMTG parameterization to improve the generaliza-
example, the parameter vector φ for the rugged terrain task tion performance over unseen terrains?
(Fig. 3b) includes the minimum and maximum values of the A. Experiment Details
heightfield; for the stairs task, the parameter vector defines
the height and length of each step. Table I summarizes the To answer the above questions, we evaluate our system
parameters and terrain generator for selected terrain types. using a simulated Unitree Laikago quadruped robot [26],
With this simple parameterization, we can generate over ten
different types of terrains that a robot may encounter in
the real world. Our procedural terrain generation algorithm
provides a rich set of training data essential for generalizable
policies to emerge.
C. Visual-Locomotion Policy Architecture
Exteroceptive perception plays a crucial role when legged
robots need to navigate different terrains and environments
with obstacles and humans [24], [25]. As such, we aim
to incorporate perception into our policy architecture such
that information from the robot’s surroundings can modulate
(a) Visual-locomotion policy architecture.
locomotion. Additionally, the policy’s low-level actuation
commands need to be smooth and realizable on the physical
robot. To this end, we seek to restrict the search space of
possible gaits to be cyclic and smooth while still expressive
enough so that the perception can modulate locomotion
sufficiently to work on different terrains.
In our visual-locomotion policy architecture (Fig. 2), we
use two separate neural network encoders to process the
proprioceptive and exteroceptive inputs. The upper branch
of Fig. 2a processes the LiDAR input, while the lower
branch takes care of proprioceptive information. The learned
lower-dimensional features are concatenated with the target (b) The locomotion component using PMTG [3] for smooth
information before being passed to the policy’s locomotion and cyclic actuation patterns.
component. We chose to use Policies Modulating Trajectory Fig. 2. Overview of the visual-locomotion policy architecture.
(a) Obstacles (b) Rugged (c) Stairs (d) Cliff
which weighs approximately 22kg and is actuated by 12 exteroceptive and proprioceptive input encoders are both
motors. We simulate the onboard Velodyne VLP-16 (Puck) (32, 16, 4), respectively. We use the ReLU activation function
LiDAR sensor, which provides the perception of the sur- for all layers in both networks [29]. The advantages are
rounding environment (See Figure 2b). The LiDAR measures estimated using Generalized Advantage Estimation [30].
the distance from the surrounding obstacles and terrain to the We then evaluate the trained policies on a suite of test-
robot. This sensor supports 16 channels, a 360◦ horizontal ing environments not encountered during training. Figure
field of view, and a 30◦ vertical field of view. We add 1 illustrates a subset of these testing environments. These
Gaussian noise to the ground-truth distance readings in high-fidelity simulated environments are created in PyBullet
simulation to mimic the real-world noise model. The 3D physics engine [4] with Gibson scenes [5]. A policy’s ability
LiDAR scan matrix D is normalized to range [0, 1] and to successfully navigate across a given terrain is measured
flattened to a vector d. using the task completion rate, tcr, which measures how
close the agent gets to the target relative to its starting
Our policy computes joint target positions (at ), which are
position:
converted to target joint torques by a PD controller running at gd,T
1kHz. Rigid body dynamics and contacts are also simulated tcr = 1 − ,
gd,0
at 1kHz. In other words, the position and velocity (provided
by PyBullet [4]) and the desired torque (provided by the PD where gd,T is the final Euclidean distance between the robot
controller) are sent to the actuator model every 1ms. The and the target when the robot falls or completes the task, and
actuator model then computes 10 internal 100µs steps and gd,0 is the distance at the beginning of the episode. A task
provides the effective output torque of the actuator, which completion rate of 1 indicates successful navigation to the
is then used by PyBullet to compute joint accelerations. The target, whereas tcr close to zero means that the robot cannot
simulation environment is configured to use an action repeat navigate across the terrain.
of 10 steps, which means that our policy computes a new B. The Impact of MTRL on Generalization
action at and receive a state st every 10ms (100Hz).
Table II shows the generalization performance of our
We train the visual-locomotion policy using the MTRL visual-locomotion policy trained on different types of ter-
formulation with simulated environments randomly gener- rains (rows) and tested in unseen environments (columns),
ated using our procedural task generation method (Section including a maze (Maze), a steep and rugged mountain
III-B). We choose a distributed version of the Proximal (Mountain), two indoor scenarios (Office 1 and Office 2), an
Policy Optimization (PPO) [27] in TF-Agents [28] for train- office space with moving humans (Dynamic Env), a forest
ing. We use a 2-layer fully-connected neural network of scene with rugged terrain and obstacles (Forest), a winding
dimensions (512, 256) to parameterize the value function and path with a cliff on both sides (Cliff), and a randomly-
another network of dimensions (256, 128) to parameterize generated continuous mesh (Continuous). Policies trained
the policy. The policy outputs the parameters of a multi- on a single type of terrain achieve a low task completion
variate Gaussian distribution, which we sample actions from rate in the testing environments due to a lack of diverse
during training. We use a greedy policy during evaluation by training data. In contrast, our approach achieves much higher
executing the mean of the multivariate Gaussian distribution generalization performance. For instance, our method on
provided by the policy network. The dimensions of the average achieves a task completion rate of 67% on the
TABLE II
G ENERALIZATION PERFORMANCE OF OUR VISUAL - LOCOMOTION POLICY.
TABLE III
C OMPARISON OF OUR PROPOSED METHOD TO OTHER POLICIES DEPLOYED IN A MTRL TRAINING REGIME . T HE PERFORMANCE DECREASES WHEN
THE POLICY DOES NOT USE A PMTG PARAMETERIZATION , WHEN THE POLICY IS NOT PROVIDED EXTEROCEPTIVE INPUTS FROM THE L I DAR, AND
WHEN MULTI - TASK TRAINING IS PERFORMED IN A SEQUENTIAL MANNER .
mountain task, while policies trained in a single type of learned with our system can be successfully deployed in new
terrain only achieve 28% at best (See Figure 4 for a snapshot unseen environments.
of our policy navigating up the rugged mountain trail). These
results indicate that our MTRL formulation using procedural C. Ablation Studies
task generation, and visual-locomotion policy architecture, We perform three ablation studies to understand the im-
results in superior generalization performance. The policy portance of each design decision in our system. Table III
summarizes their impacts on the resulting generalization
performance of the policy.
a) PMTG: We replace the locomotion component of
the visual-locomotion policy with a reactive policy that does
not have a trajectory generator. Our PMTG-parameterized
visual-locomotion policy performs 28%-218% better than a
pure reactive locomotion component. We find that PMTG
produces smoother actions and leads to improved zero-shot
generalization to new terrains.
b) Exteroceptive input: We remove the LiDAR input
from the visual-locomotion policy. Observing Table III, it is
clear that the exteroceptive information plays a critical role
in learning generalizable locomotion policies that can adapt
to a wide variety of terrains. This finding agrees with results
from the field of experimental psychology, which establish
Fig. 4. Snapshot of a laikago robot navigating through mountainous terrain
the importance of exteroceptive observations in guiding foot
not encountered during training. Please refer to the supplementary video for placement when navigating over complex terrains [25], [24].
more examples of the agent navigating challenging terrains. Figure 5 visualizes the trajectory produced by our visual
Fig. 5. Visualization of trajectory generated by our method in an Fig. 6. Visualization of trajectory generated by our method in a rugged
environment with many obstacles. Foot Z positions for the left hind, right terrain. Foot Z positions for the left hind, right hind, left forward, and right
hind, left forward, and right forward feet are shown. forward feet are shown. The rugged terrain requires that the robot carefully
place its feet to maintain balance.