0% found this document useful (0 votes)
16 views8 pages

Deep Gait

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 8

IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 5, NO.

2, APRIL 2020 3699

DeepGait: Planning and Control of Quadrupedal


Gaits Using Deep Reinforcement Learning
Vassilios Tsounis , Mitja Alge, Joonho Lee , Farbod Farshidian , and Marco Hutter

Abstract—This letter addresses the problem of legged locomotion


in non-flat terrain. As legged robots such as quadrupeds are to be
deployed in terrains with geometries which are difficult to model
and predict, the need arises to equip them with the capability to
generalize well to unforeseen situations. In this work, we propose a
novel technique for training neural-network policies for terrain-
aware locomotion, which combines state-of-the-art methods for
model-based motion planning and reinforcement learning. Our ap-
proach is centered on formulating Markov decision processes using
the evaluation of dynamic feasibility criteria in place of physical
simulation. We thus employ policy-gradient methods to indepen-
dently train policies which respectively plan and execute foothold Fig. 1. The suite of terrains: the baseline Flat-World scenario (left), the
and base motions in 3D environments using both proprioceptive Random-Stairs scenario (bottom center), and composite Temple-Ascent (right)
and exteroceptive measurements. We apply our method within scenario comprising a set of winding stairs and two derelict bridges containing
a challenging suite of simulated terrain scenarios which contain stepping-stones and gaps of varying size.
features such as narrow bridges, gaps and stepping-stones, and
train policies which succeed in locomoting effectively in all cases.
as those employing deterministic optimization techniques [2],
Index Terms—Legged robots, deep learning in robotics and [3], in conjunction with other heuristics [4], to plan motions for
automation, motion and path planning.
both the base and feet. Although some of the aforementioned
approaches are able to solve such problems comprising both
I. INTRODUCTION continuous and discrete variables [2], [3], these remain too
EGGED locomotion in non-flat terrain, both structured and computationally intensive to be executed online. Thus, only
L unstructured, poses a significant challenge in robotics. Op-
erating autonomously in such environments requires addressing
kinostatic approaches [4], [5] have managed to perform online
foothold planning. Also, most methods typically employ some
the problem of multi-contact motion planning and control. If a form of parameterization or qualification of the terrain [4], [6],
legged robot such as ANYmal [1] is to traverse complex envi- [7] in order to simplify the search for viable footholds.
ronments autonomously, it must possess the capability to select One of the primary challenges in multi-contact planning for
footholds appropriate for the terrain, while also retaining balance multi-legged systems lies in dealing with the combinatorial
at all times. This work deals specifically with the problem of problem due to the vast number of contact configurations admis-
planning and executing sequences of footholds for quadrupedal sible for the terrain. Typical solutions involve either assuming
locomotion in rigid non-flat terrain using proprioceptive and the gait pattern [4], [5] or employing sampling-based search
exteroceptive sensing. techniques [7], [8]. There also exist works that have combined
Dynamically walking on non-flat terrain necessitates the op- both optimization and sampling-based methods [5], [6], [9].
timization of continuous state-input trajectories such as the However, these typically resort to decoupling the selection of
motion of the base, as well as discrete decision variables such as footholds from the optimization of base motions and thus remain
which surface, and when, to make contact with. This has been kinostatic as they tend to neglect the dynamics of the system.
addressed predominantly using model-based approaches, such Some works have also incorporated machine-learning tech-
niques for facilitating terrain perception [6], [10], [11]. Others,
have employed Deep Reinforcement Learning (DRL) [12], [13]
Manuscript received September 10, 2019; accepted January 18, 2020. Date
of publication March 9, 2020; date of current version March 27, 2020. This
for realizing end-to-end terrain-aware locomotion. The use of the
letter was recommended for publication by Associate Editor Z. Li and Editor latter, however, still poses several challenges, namely: (1) how
N. Tsagarakis upon evaluation of the reviewers’ comments. This work was to eliminate undesirable yet retain beneficial emergent behavior,
supported in part by Intel Labs, in part by the Swiss National Science Foundation and (2) reduce overall sample complexity and train policies
(SNSF) through Projects 166232 and 188596, in part by the National Centre of efficiently.
Competence in Research Robotics (NCCR Robotics), in part by the European
Union’s Horizon 2020 research and innovation program under Grant Agreement We propose a new method that combines state-of-the-art
780883, and in part by ANYmal Research, a community to advance legged model-based and model-free methods to enable quadrupedal
robotics. (Vassilios Tsounis and Mitja Alge contributed equally to this work.) systems to traverse complex non-flat terrain. Our formula-
(Corresponding author: Vassilios Tsounis.) tion consists of: (1) a terrain-aware planner that generates se-
The authors are with the Robotic Systems Lab, ETH Zürich, 8092
Zürich, Switzerland (e-mail: [email protected]; [email protected]; [email protected];
quences of footholds and base motions that direct the robot
[email protected]; [email protected]). towards a target heading, and (2) a foothold and base motion
Digital Object Identifier 10.1109/LRA.2020.2979660 controller which executes the aforementioned sequence while

2377-3766 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at 14:45:41 UTC from IEEE Xplore. Restrictions apply.
3700 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 5, NO. 2, APRIL 2020

maintaining balance and dealing with disturbances. Both planner via the mapping MR : R2 × R → R32×32 with a resolution of
and controller are realized as stochastic policies parameterized 4 cm. In order to reason precisely about gaits and transitions
using Neural-Network (NN) function approximation, which are between contact supports, we define a parameterization thereof
optimized using state-of-the-art Deep Reinforcement Learning that encompasses all necessary information. We thus parameter-
(DRL) algorithms. ize a gait as a sequence of so-called support phases. Each phase
Our contributions with this work are: (1) A novel method for is defined by the tuple
training kinodynamic motion planners, which employs a Tra-
jectory Optimization (TO) technique for determining so-called Φ := RB , rB , vB , rF , cF , tE , tS  ∈ Φ (1)
transition feasibility between discrete support phases using a
where cF ∈ {0, 1}4 is a vector indicating for each of the feet
coarse model of the robot. This removes the need for a planner
a closed, 1, or open, 0, contact w.r.t the terrain, rF ∈ R3×4 are
to interact with both physics and a controller during training,
the stacked absolute positions of the feet, and tE , tS ∈ R are
allows the two policies to be trained independently, and leads to a
the phase timing variables. For every phase Φt , t − tE defines
significant reduction in overall sample complexity. (2) A simple
the time at which the switch to the current contact configuration
formulation for realizing dynamic walking controllers that use
occurred, while t + tS the switch to next. Fig. 2 (b) illustrates
target footholds as references and rely solely on proprioceptive
the aforementioned quantities.
sensing. This allows us to train controllers that can fully exploit
the kinematics and dynamics of the robot in order to track
arbitrary target footholds, irrespective of the planner used to III. METHODOLOGY
generate them. We propose a two-level hierarchy comprising a high-level
We evaluate the performance of our method across a set Gait Planner (GP) and a low-level Gait Controller (GC) operat-
of challenging locomotion scenarios using a physics simulator ing at different time-scales, inspired by [13]. The GP, evaluated
and present results thereof. Our experiments demonstrate that at roughly 2 Hz, serves as a local terrain-aware planner, and uses
the planner can generalize well across terrain types, and the both exteroceptive and proprioceptive measurements to generate
controller succeeds in tracking reference footholds while always a finite sequence of support phases, i.e. a phase plan. The GC,
balancing the robot. Moreover, we illustrate the advantages of evaluated at 100 Hz, serves as a hybrid motion planner and
our method by comparing it with a state-of-the-art model-based controller, and uses only proprioceptive sensing in combination
approach [4]. with the aforementioned phase plan to output joint position
references. Finally, a joint-space PD controller (with zero target
II. PRELIMINARIES joint velocity) uses these joint position references to compute
joint torques at 400 Hz. The highest-level command to the
A. Reinforcement Learning system is provided as the desired walking direction in the form
We consider the problem of sequential decision making in of the deviation of base attitude w.r.t the goal. Fig. 2(a) provides
which an agent interacts with an environment with the objective an overview of the system.
of maximizing cumulative reward. We model this problem as a
discrete-time infinite Markov Decision Processes (MDP) with a A. Gait Planning
discounted expected return objective. Such an MDP consists
of set of states S, a set of actions A, a transition dynamics The GP operates by sequentially querying the planning pol-
distribution, an initial state distribution, a scalar reward function icy πθP to generate the aforementioned phase plan. We thus
r(st , at , st+1 ), and a scalar discount factor γ ∈ [0, 1). The agent formulate an MDP in order to train πθP using DRL, and our
 π with
selects actions according to a policy the objective of objective is to ensure that the resulting policy learns to respect the
maximizing the expected return E[ ∞ k=t γ k
rt+k ], where rt is
kinodynamic properties and limits of the robot, as well as contact
the scalar reward resulting from the state transition at time-step constraints, when proposing phase transitions. Moreover, we
t. As we consider infinite MDPs in which S and A are infinite aim to avoid direct interaction with the physics of the system, and
sets, we use parameterized stochastic policies πθ (a|ot ), which instead craft the transition dynamics of the MDP by employing a
are distributions over actions a ∈ A conditioned on observations transition feasibility criterion realized as a Linear Program (LP)
ot ∈ O given parameter vectors θ ∈ Rn . using the framework defined in [14]. Lastly, we avoid explicitly
modeling or qualifying the terrain, as done in [4], [8], and
instead directly use measurements of local terrain elevation. The
B. Model of the System
resulting MDP, allows us to train πθP to infer a distribution over
The robot comprises an unactuated floating base and four phase transitions, which are not only feasible but also maximize
articulated legs with actuated rotational joints. The state of the locomotion performance (refer to the discussion in Section V).
robot is specified as: rB ∈ R3 the absolute1 position of the base, Support Phase Transition Feasibility: Transition feasibility
RB ∈ SO(3) is the rotation matrix representing the absolute amounts to evaluating if a feasible motion exists between a pair
attitude of the base, vB , ω B ∈ R3 are the absolute linear and of support phases Φt , Φ∗t+1 , where the former is assumed while
angular velocities of the base, qj , q̇j ∈ R12 are the angular the latter is a candidate successor. As previously mentioned,
positions and velocities of the joints. The robot is controlled we employ the general framework defined in [14] to design a
using joint torques τ j ∈ R12 . Moreover, we assume that we convex LP using the Convex Resolution Of Centroidal dynamics
can extract robocentric measurements of local terrain elevation trajectories (CROC) formulation. We use CROC to derive a set
of linear equality and inequality constraints, forming a convex
1 We define a global inertial frame W for the world, and local body-fixed polytope, using the following terms:
frames B for the base and F for the feet. Left sub-scripts denote the frame in 1) A Centroidal Dynamics model of the system.
which the vector is expressed, but omit it for absolute quantities. 2) The contact force unilateral and friction constraints.

Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at 14:45:41 UTC from IEEE Xplore. Restrictions apply.
TSOUNIS et al.: DeepGait: PLANNING AND CONTROL OF QUADRUPEDAL GAITS USING DEEP REINFORCEMENT LEARNING 3701

Fig. 2. (a) Overview of the proposed control structure used at deployment time. (b) Phases within a sequence are indexed using s, and every index corresponds
to a point in time centered around a window defined by the durations tE and tS . The center of the window is defined by the motion of the base as captured by the
phase Φs . tS defines the time-to-switch from the current contact support to the next, specified in Φs+1 , and tE defines the time elapsed since the switch from the
previous contact support, specified in Φs−1 , to the current.

3) Assume angular momentum rate of zero. TABLE I


4) Parameterization of CoM motion as a Bezier curve. TRANSFORMATIONS USED TO FORM oP FROM PHASES AND PHASE
TRANSITIONS FROM aP : THE MATRIX Rz (α) DEFINES A ROTATION ABOUT
5) Restrict motion of the feet w.r.t. the base. THE WORLD’S Z-AXIS BY ANGLE α, AND fdec : R3 4
clip → {0, 1} DECODES A
6) Restrict contact forces in magnitude and direction. 3-DIGIT BINARY ENCODING INTO A VECTOR OF CONTACT STATES
The specific variant of CROC we employ uses a trivial cost, a
time-discretization of the CoM trajectory, and incorporates the
parameterization of the contact forces into the decision variables
of the optimization. We defer the reader to [14] for further details
regarding CROC. The resulting formulation allows us to real-
ize the transition feasibility mapping Fcroc : Φ × Φ → {0, 1}.
Therefore, we evaluate the LP for pairs of phases Φt , Φ∗t+1 , to
determine if the corresponding phase transition is feasible, 1, or
not, 0.
MDP Specification: Modeling locomotion as discrete se-
quences of support phases Φ, as defined in Section II-B, al-
lows us to formulate phase transitions that exhibit the Markov
property [15], and thus lends itself to modeling the overall 3) Tf easibility : Evaluates Fcroc (Φt , Φ∗t+1 ).
problem of gait planning as an MDP. Specifically, we de- Thus, each step of this MDP proceeds as follows: (1) Given
fine the state of the MDP as the tuple sP := Φ, r G , where a state sP,t , the MDP computes the corresponding observation
r G is the current absolute position of the goal in the world, oP,t , which is constructed according to the transformations in
while observations and actions are defined as the tuples oP := Table I, and is passed to the agent to select an appropriate
oR , ov , oF , oc , oM  and aP := aR , aB , av , aF , ac , at , re- action according to πθP . (2) The selected action aP,t , is used
spectively. Observations, consist of terms pertaining to the cur- to compute the candidate phase Φ∗t+1 , again using the set of
rent state of the robot and the coincident terrain in the form of: transformations defined in Table I. (3) The aforementioned
the attitude w.r.t the goal oR ∈ R, the CoM velocity ov ∈ R2 , termination conditions are used to assert if the phase transition is
the feet positions oF ∈ R8 , the feet contact states oc ∈ R4 and feasible. This formulation therefore allows the agent to propose
the local height-map oM ∈ R32×32 . Conversely, actions, contain the phase transition directly, while the MDP only checks if it is
terms pertaining to changes to the current phase Φ, in the form feasible and otherwise terminate the episode.
of the CoM rotation aR ∈ Rclip , CoM translation aB ∈ R2clip , In addition, we outline certain considerations regarding the
computation of Φ∗t+1 . First, if a foot would be in contact for both
CoM velocity av ∈ R2clip , feet positions aF ∈ R8clip , feet con- phases Φt and Φ∗t+1 , the new foothold is reset to that of Φt since
tact states ac ∈ R3clip and the phase timings at ∈ R2clip . All ac- stance feet cannot change positions. Second, the z-coordinates of
tion terms are scaled, offset and clipped to lie in Rclip := [−1; 1]. the feet and CoM positions are adjusted according to the height
The exact definitions for actions and observations are provided of the terrain at their respective locations, and for the CoM in
in Table I. particular, the height is set to a constant hcom above the lowest
Transition Dynamics: We define state transition dynamics foothold. Lastly, to evaluate the kinematic constraints of CROC,
for this MDP employing a formalism defining so-called ter- we need to infer the intermediate orientations. To this end, we
mination condition functions T (sP,t , aP,t , sP,t+1 ), which de- use a first-order approximation of the angular velocity of the
termine if an episode terminates. By formulating an episode base by linearly interpolating between the starting and next base
termination as a transition into an absorbing terminal state, we attitudes of each transition. Doing so also renders zero angular
can say that, an episode under this MDP, terminates whenever momentum between consecutive support phases, which is inline
sP,s+t = sP,s , ∀t > 0. In this MDP, in particular, we employ with assumptions made by CROC. One can thus envision an
the following termination conditions: extension to CROC that includes angular momentum, which we
1) Tf ootholds : Checks for obstacles or gaps within the vicinity intend to explore in future work.
of each foothold using an fixed eight-point grid surround- Reward Function: We design a reward function which drives
ing each foot. the agent to learn behaviors for reaching the goal position, facing
2) Tbase : Checks for collisions between base and terrain. the goal as much as possible, minimizing kinematic effort during

Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at 14:45:41 UTC from IEEE Xplore. Restrictions apply.
3702 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 5, NO. 2, APRIL 2020

TABLE II
THE GP AND GC REWARD TERMS. THE SUBSCRIPT k INDEXES EACH FOOT, AND ncontact,k COUNTS THE CURRENT NUMBER OF CONSECUTIVE STATE UPDATES
FOR WHICH FOOT k HAS REMAINED GROUNDED. bk IS A BINARY VARIABLE SPECIFYING WHETHER A FOOT IS WITHIN A CYLINDER OF RADIUS d = 5 CM OF
TARGET FOOTHOLD. AN OVER-LINED LETTER DESCRIBES THE CONJUGATE OF THE BINARY VARIABLE, E.G. b̄k . THE WEIGHTING FACTORS ARE: wp = 25,
wk = 80, wc = 0.01, wtc = 0.1, wsw = 0.01, wt = 0.001, wv = 0.5, we = 2, wsl = 0.02, wa = 0.2

phase transitions and inhibiting long stance phases. The final


reward function is specified as the combination of multiplicative
and additive terms
rP (sP,t , sP,t , sP,t+1 ) := rp · rh2 · rk − rc (2)
where rp rewards the agent for bringing the average foothold
position closer towards the goal and penalizes moving it away,
rh penalizes the robot for not facing the goal, rk penalizes for
moving the feet away from the nominal footholds B r N F located
beneath the shoulders and rc penalizes for not lifting a foot over
multiple steps, therefore promoting exploration and prevents the
policy from getting stuck in the local optimum of remaining in Fig. 3. The neural-network models used for the latent parameters of policy
a constant stance. distributions of the (a) GP and (b) GC, respectively.
Furthermore, we would like to emphasize certain features of
the multiplicative term in the above reward function. Specifi-
Policy Optimization (PPO) using clipped loss and a Generalized
cally, this term results in a penalty that is small when rp is small,
Advantage Estimation (GAE) critic [16].
i.e., beginning of training, and large when rp is large, i.e., towards
the end of training, thus resulting in a form of automatic scaling
of the overall multiplicative term. We found that using these B. Gait Control
multiplicative rewards results in beneficial gradients throughout The GC is responsible for executing the support phase se-
all iterations of training, as their values are ensured never to quence provided by the GP while maintaining balance at all
be too large as to hinder exploration, and never too small as to times. It operates by tracking a series of footholds and base
have negligible effect. Furthermore, as rp is computed using the positions extracted from the support phase sequence generated
average position of footholds and not of the base, the agent is by the GP. In order to learn this behavior, we define an MDP
required to walk in order to maximize reward, as opposed to with transition dynamics which incorporates the physics of the
just merely leaning. The latter aspect is important, since leaning system and specify an appropriate parameterization for πθC .
forward also inhibits the motion of the front legs, therefore mak- Training a GC agent in such an MDP requires only that a target
ing it much harder to walk. Table II details the aforementioned phase sequence be provided, and does not assume that a GP
reward terms. is available a priori. In fact, the target phase sequence can be
Policy Definition: We parameterize the GP’s policy as provided arbitrarily as long as the target footholds are feasible.
a Gaussian distribution with a diagonal covariance matrix However, in this work, we elected to utilize a GP for this purpose
πθP (a|oP,t ) := N (a|μθP (oP,t ), σ θP ). The mean μθP (oP,t ) solely as a matter of convenience so to avoid the use of additional
is output by a NN which inputs both exteroceptive and pro- elements.
prioceptive measurements into a series of NN layers, similar Target Foothold Extraction: Assuming the GP is queried at
to those proposed in [13]. First, M R is input into three CNN some time t, we denote the resulting phase sequence as Φ∗0:N,t ,
layers, the output of which is subsequently input into one where Φ∗0,t is the initial phase as measured by the GP before
more fully-connected layer. The resulting latent output from generating the sequence of length N . This amounts to rolling-
the height-map is concatenated with the raw proprioceptive out2 the planning policy by recursively evaluating πθP using its
measurements, then fed into two more fully-connected layers own output.
with ReLU and tanh nonlinearities, and finally passed through MDP Definition: Given a phase sequence Φ∗1:N , the GC
a linear output layer. The standard-deviation parameters σ θP proceeds to extract the following target quantities: (a) target
are realized by an additional layer that is independent of the position for the base r∗B , (b) target feet contact states c∗F , and
observations and is used to drive exploration during training.
Fig. 3(a) provides a graphical depiction of the NN model. Due 2 The small range and dimensions selected for the elevation map, in conjunc-
to the inclusion of high-dimensional height-map data in the tion with the limitation on maximum step length assumed by the planner, allows
observations oP as well as the relatively large dimensionality us to extract multiple successive samples of MR from within the effective
of the actions aP , we trained πθP with a variant of Proximal field-of-view afforded by exteroceptive sensing.

Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at 14:45:41 UTC from IEEE Xplore. Restrictions apply.
TSOUNIS et al.: DeepGait: PLANNING AND CONTROL OF QUADRUPEDAL GAITS USING DEEP REINFORCEMENT LEARNING 3703

(c) target foothold positions r∗F for all legs. In the case of r∗F , establish a baseline for performance and behavior. Secondly, the
targets are set by looking ahead into the phase plan so to ensure Random-Stairs terrain presents a 20 × 20 m2 square area con-
that both swing and stance legs have valid references at all times. sisting of 1 × 1 m2 flat regions of randomly selected elevation.
Thus, the GC computes the foothold tracking errors B rF,err , at The elevation changes were generated in a way that results in an
100 Hz, while the target footholds are updated at approximately effective inclination diagonally across the map. The third terrain
2 Hz. We specify MDP states sC , observations oC and actions scenario is that which we call Temple-Ascent, and is a composite
aC defined as terrain consisting of gaps, stepping stones, stairs as well as flat
regions. We realized the MDP environment for the GP using
sC := RB , rB , vB , ω B , qj , q̇j , nF , cF  an own implementation of CROC in C++, while for the MDP
oC := B rF,err , c∗F , B ez ,
W
zBF , B vB ,
environment of the GC we used the RaiSim [18] multi-body
physics engine. All DRL algorithms were implemented using
B ωB , cF , qj , q̇j , q∗j , η, aC := q∗j  (3) the TensorFlow3 C/C++ API.4
where B eWz is the gravity-aligned z-axis of frame W expressed
in coordinates of frame B, zBF is the distance between the B. Gait Planner
lowest stance foot and the base along the z-axis of W , q∗j is the Training Setup: Training of GP policies in the terrain suite
vector of previous target joint positions, and η ∈ [0, 1] is a phase consists of a set of episodes where the robot’s objective is to reach
variable indicating the normalized time within a support phase. a a goal position from a sufficiently distant starting location. Both
The transition dynamics of this MDP includes the generation of starting and goal positions are selected randomly at the start of
the phase plan using the GP, the physics of the system and the each episode. However, this procedure differs depending on the
joint-space PD controller. As the PD controller is evaluated at features of the terrain, as we must avoid invalid starting positions
400 Hz and the GC at 100 Hz, we apply a zero-order hold of and unreachable goal positions, which, would negatively impact
the joint positions output by the policy when computing joint the resulting policies during training. Once valid starting and
torques commands. For this MDP, we also define the following goal positions have been sampled, the robot’s initial attitude and
termination conditions: footholds are also sampled uniformly from within respective

1) Tattitude : Angle between eB z and ez exceeds 60 .
W bounds.
2) Tcontact : Base collides with the terrain. We thus trained two separate GP policies for Random-Stairs
We have found that these two simple, yet effective, ter- and Temple-Ascent respectively, using PPO with only 14 parallel
mination conditions are those principally responsible for the workers running on the respective desktop computer over 200 k
balancing and recovery behaviors learned during training, and iterations, which amounts to a total of two billion samples per
thus their importance must not be understated. Moreover, we run. Hyper-parameter values are listed in Table IV. We did not
designed a reward function that emphasizes tracking of target need to train a separate GP in Flat-World, and instead used
foothold positions and contact states, but also contains terms that that trained in Temple-Ascent for the respective performance
inhibit extraneous and aggressive motions during locomotion. evaluations.
The resulting reward function is defined as Performance Metrics: In order to assess the performance of
GP policies, we define the Episodic Success Rate (ESR), which
rC (sC,t , aC,t ) := re + rtc + rsw + rsl + rt + rv + ra (4) measures the number of successfully reached goal positions over
where re and rtc are the task-specific rewards penalizing de- a finite number of episodes. Essentially, we execute a sufficiently
viations from the target foothold positions and contact states, large number of episodes where the robot tracks a reference goal
rsl penalizes foot-slip for grounded feet, rsw penalizes large position in the world and assert if the robot has reached within
velocities for swing legs, rt penalizes joint torques, rv penalizes a 0.5 m vicinity of the goal and within a maximum permissible
vertical linear and roll-pitch angular velocities of the base and episode duration.
ra penalizes large angles between the unit vectors eB z and ez
W Training Results: GP training required approximately
of the base and world frame respectively. 82 hours in each terrain scenario. Throughout our experiments
Policy Definition: The GC’s policy, like that of the GP, is we found that the randomization scheme mentioned above and
also parameterized as a Gaussian distribution with diagonal co- used for realizing the initial state distribution of the MDP was
variance matrix πθC (a|oC,t ) := N (a|μθC (oC,t ), σ θC ). While crucial for successfully learning to traverse all parts of the
the mean μθC (oC,t ) is output by a simple NN with two fully- terrains. This demonstrates that if the agent does not observe
connected layers using tanh non-linearities, shown in Fig. 3(b), all aspects of the terrain from the very beginning of training, it
the standard deviation coefficients σ θC are, just as in the case is often unable to generalize to unseen cases at test time.
of the GP, output by an additional layer of parameters which is Furthermore, we observed that, as the centroidal dynamics
independent of oC,t . Due to the relatively small dimensionality model employed by CROC is relatively conservative, it tends
of the MDP and πθC , we train the latter using Trust-Region to limit the set of transitions that the policy can generate. This
Policy Optimization (TRPO) also employing a GAE critic [17]. conservativeness is furthered by the fact that in this work, we
limit the possible contact states that the GP’s policy can output to
IV. RESULTS only those with three and four active contacts. Such a restriction
was helpful for reducing the complexity of the problem, and we
A. Experimental Setup
3 [Online]. Available: https://github.com/leggedrobotics/tensorflow-cpp
In order to evaluate our approach, we crafted a suit of terrain 4 For the GP, we used a PC with 2x Intel Xeon E5-2680v4 (@2.4 GHz) CPUs,
scenarios for training and testing the GC and GP agents, as 128 GB of RAM, and an Nvidia GTX Titan (Pascal), and for the GC a PC with
depicted in Fig. 1. The first and most basic scenario consists of a single Intel Core i7-8700 K (@3.7 GHz) CPU, 64 GB of RAM and an Nvidia
an infinite flat plane we refer to as Flat-World, which we use to GTX 2080 Ti GPU

Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at 14:45:41 UTC from IEEE Xplore. Restrictions apply.
3704 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 5, NO. 2, APRIL 2020

TABLE III
PERFORMANCE OF THE GC ON THE DIFFERENT TERRAIN SCENARIOS IN TEMPLE-ASCENT, AND UNDER DIFFERENT KINDS OF VARIATIONS TO THE SYSTEM. THE
NOMINAL SYSTEM IS THAT WITH WHICH THE GC WAS TRAINED, AND ALL VARIATIONS ARE PERFORMED ONLY AT TEST TIME. mB IS THE MASS OF THE BASE,
WHILE lshank IS THE LENGTH OF THE SHANK LINKS. ESR VALUES ARE LISTED AS PERCENTAGES, AND ALL RESULTS ARE PRESENTED AS EMPIRICAL MEANS
PLUS-MINUS THE CORRESPONDING STANDARD DEVIATIONS

intend to extend to the general case of two and single contact TABLE IV
configurations in future work. POLICY OPTIMIZATION ALGORITHM HYPER-PARAMETERS FOR THE GC USING
TRPO AND THE GP USING PPO (SEE [16], [19] FOR DETAILS)
We tested the GP policies in their respective terrain scenarios
and evaluated their performance using the ESR metric. In all
cases, we have observed that fully trained policies can generate
valid support phase sequences which lead the robot to the goal
with at an average ESR nearing 100.0 %. The performance of
GP policies trained and tested in the terrain suite are presented
in Table III, where they have been deployed together with
respective GC policies.
Another important observation regarding the output of the
GP has to do with the types of gaits it manifests. In the case of
Flat-World as well as in the flat regions of Temple-Ascent, we
observe that the GP tends to output mostly cyclic support phases.
This indicates that the agent learns to generate cyclic gaits even
though no aspect of the MDP ever directed it to do so. In certain
cases, however, such as the Stepping-Stones and Gaps bridges
as well as when performing sharp point-turns, the GP outputs
acyclic support phases.
Sample Complexity: One key contribution of this work is the generated by the GP are appropriately and sufficiently random-
significant reduction in sample complexity afforded by the use of ized. Initial states are generated by first uniformly sampling ini-
transition feasibility instead of physical simulation to formulate tial and goal positions of the base from within the bounds of the
the GP’s MDP. Using the transition feasibility check, we can world. We then orientate the base by sampling uniformly from
evaluate the MDP’s transition dynamics at several thousands attitudes centered on the current orientation facing the
of steps-per-second, where each step corresponds to poten-   goal, and
bounded by the vector of Euler angles 0.1, 0.1, π/4 . Moreover,
tially several seconds of simulation time. Conversely, using a we randomize the initial feet positions by uniformly sampling xy
physics simulator typically requires several hundred or even coordinates from a 0.1 × 0.1 m2 box defined in the base frame
thousands of steps to evaluate just one second of simulation B and centered around nominal values that would place the
time. Specifically, during training, we executed episodes with feet below the shoulders. Furthermore, to randomize the target
a maximum length of 50 steps with each corresponding to footholds, we perform a randomized fixed rollout of the GP up
an average duration of approximately 2.6 s, which amounts to to the maximum permissible length of an episode. Essentially,
130 s of simulation time. However, the physics simulator using we rollout the GP however many times necessary such that the
a time-step of 2.5 ms would require 24 k steps to simulate the resulting phase sequence meets or exceeds the duration time
duration above. As the throughput of the transition feasibility of an episode, and randomize the target footholds at each
LP and the physics simulator, for our formulation, is 1 k Hz   step
by adding a bias uniformly sampled from −0.1, 1.0 in the
and 60 k Hz respectively, we can estimate an 18-fold effective
xy plane while ensuring that the z coordinates are fixed to the
reduction in sample-complexity.
terrain.
With the aforementioned sampling scheme, we trained a GC
agent using TRPO in Flat-World using only 24 parallel workers
C. Gait Controller
for total of 20 k iterations. Moreover, as part of ongoing work
Training Setup: Training a GC agent involves collecting MDP to extend our method to full 3D foothold tracking, we present
transitions over a rich set of target footholds. In order to achieve preliminary results for GC policies for stair-climbing by first
such a distribution of training samples, we ensure that both the pre-training in Random-Stairs then also on the stairs section
initial state distribution of the MDP as well as the target footholds of Temple-Ascent. Table IV presents the hyper-parameters most

Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at 14:45:41 UTC from IEEE Xplore. Restrictions apply.
TSOUNIS et al.: DeepGait: PLANNING AND CONTROL OF QUADRUPEDAL GAITS USING DEEP REINFORCEMENT LEARNING 3705

pertinent to the training of GC, for all of the cases mentioned


above. We want to emphasize that in all cases, the same hyper-
parameters were used, as we only adapted the initial state distri-
bution accordingly for each terrain. TRPO was employed using
mostly the default hyper-parameters specified in [17], [19].
Performance Metrics: We define two metrics for quantifying
the performance of GC policies at test-time. First we define the
Foothold Tracking Error Rate (FTER)
4
1 
T
1
F T ER := 4 c∗F,k r ∗F,k − r F,k 2 (5)
T t=0 k=1 c∗F,k
k=1

that measures the mean foothold tracking error throughout an in-


dividual episode of length T , and is computed as a function of the
desired contact states c∗F,k , the desired foothold positions r ∗F,k
and the measured feet positions r F,k while in contact with the
terrain for each foot and at every time-step. Secondly, we define
the Foothold Tracking Score (FTS) as the ratio of successfully
tracked footholds over the total generated by the GP within an
episode. At the end of each support phase, we check if feet which Fig. 4. Plots of the GC policy overcoming the 40 cm gap: (b) planar pose of
the base, (b) desired vs measured positions of the feet, and (c) log-scale norm of
were previously in swing phase have contacted the ground within the foothold tracking errors in the X-Y plane. The vertical dotted lines denote
5 cm of the target foothold in the xy plane, and increment the FTS the times at which the foothold targets change according to the phase plan,
by one for each foot with a successful touchdown in the afore- gray regions denote the time spent crossing the gap, and the horizontal line in
mentioned region. These metrics are important with regard to the (c) denotes the 5 cm error tolerance defined by the foothold tracking cylinder.
combined use of the GP and GC as they quantify how reliably a
GC can execute the footholds generated by the GP. The planner
has been trained to select footholds within a minimum distance D. Comparison to Existing Approaches
of 5 cm from any changes in elevation exceeding 1 cm. As long as As mentioned in Section I, most methods addressing plan-
the controller can maintain foothold tracking within this region, ning and control of multi-contact motions employ optimization,
then the combined system is ensured to operate safely. sampling-based search, or a combination thereof. Although
Training Results: GC training endured for approximately those relying solely on optimization [2]–[4] can be kinodynamic,
58 hours for Flat-World and approximately 116 hours for they are not feasible online, yet those that employ sampling-
Random-Stairs and Temple-Ascent. The discrepancy in dura- based search [7], [8], or both [5], [6], [9] can be used online
tions is due to the increased computational cost incurred in the but remain kinostatic. The issue is that kinostatic methods are
physics engine when evaluating contacts between the terrain unable to fully exploit the dynamics of the system, and typically
mesh and the multi-body system. Training in Flat-World results decouple foothold selection from the optimization of the base
in a policy that succeeds in generalizing well to planar foothold and feet motions, limiting their possible set of solutions. In
tracking, while training in Random-Stairs and Temple-Ascent addition, nearly all of the aforementioned approaches use some
extends these capabilities to 3D. However, the stair-climbing form of modelling [2], [6] or qualification [4], [5], [7] of the
agent trained in the latter case exhibits worse MER and FTS terrain. On the contrary, our approach relies on a minimal geo-
than those trained in the former. This difference is due to the metric representation of the terrain in the form of a height-map,
difficulties in designing sampling schemes that always initial- performs gait-free kinodynamic planning by jointly optimizing
ize the robot in valid initial states, but also to the similarity base and swing feet motions together with the selection of
of the foothold targets generated by the GP as a result of the footholds and is executable online.
repetitive terrain features in the suite. Lastly, to illustrate how our method performs compared to oth-
We evaluated the performance of the GC policies within ers, we present an experiment in which the robot is to walk over
Temple-Ascent across five runs, each consisting of 100 episodes a set of large gaps with different sizes. Specifically, we compare
with a maximum length of 90 s. We also perturb the model of the with Free-Gait, a state-of-the-art model-based framework for
robot (i.e., with which the GC was trained) to assess the robust- perceptive locomotion [4] which jointly optimizes body poses
ness of the policies. Specifically, we increased the mass of the and footholds using an Sequential Quadratic Programming
base by 25 % and varied the lengths of the shank links by ±10 %. solver. The experiment consists of two trials, corresponding to
In each case, ESR, FTS, and FTER values were recorded in gaps of different sizes: 30, 40 cm. Fig. 5 provides snap-shots
order to compute empirical means and standard deviations. The of the respective trials. Only our policies were able to traverse
resulting performance measurements are presented in Table III5 the 40 cm gap. Free-Gait relies on a heuristic scoring of the
and Fig. 4 shows time-series plots of the policy overcoming a surrounding terrain and selects footholds using a deterministic
large 40 cm gap. The final policies deployed are demonstrated greedy search within the vicinity of a nominal foothold. As it
in the supplementary video.6 ensures that the CoM remains within some margin of the support
polygon, it does not sacrifice static stability, even momentarily,
5 Although mean performance for stairs is ≥90 % in the nominal case, the to extend the reach of the front legs. Conversely, our planner
variance is noticeably higher for the perturbed models, indicating that the policy can identify valid footholds across the gap in a single shot, and
is more sensitive w.r.t model variations than that for other terrains. our controller then orientates the base to extend leg reach while
6 [Online]. Available: https://youtu.be/s1rrM1oczI4 retaining balance at all times.

Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at 14:45:41 UTC from IEEE Xplore. Restrictions apply.
3706 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 5, NO. 2, APRIL 2020

REFERENCES
[1] M. Hutter et al., “ANYmal-a highly mobile and dynamic quadrupedal
robot,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2016, pp. 38–44.
[2] S. Kuindersma et al., “Optimization-based locomotion planning, estima-
tion, and control design for the atlas humanoid robot,” Auton. Robots,
vol. 40, pp. 429–455, 2016.
[3] A. W. Winkler, C. D. Bellicoso, M. Hutter, and J. Buchli, “Gait and tra-
jectory optimization for legged systems through phase-based end-effector
parameterization,” IEEE Robot. Autom. Lett., vol. 3, no. 3, pp. 1560–1567,
Jul. 2018.
[4] P. Fankhauser, M. Bjelonic, C. D. Bellicoso, T. Miki, and M. Hutter,
“Robust rough-terrain locomotion with a quadrupedal robot,” in Proc.
IEEE Int. Conf. Robot. Autom., 2018, pp. 1–8.
[5] D. Belter, P. Abcki, and P. Skrzypczyski, “Adaptive motion planning for
autonomous rough terrain traversal with a walking robot,” J. Field Robot.,
vol. 33, pp. 337–370, 2016.
Fig. 5. Snap-shots of the comparison between our policies (top row) and Free- [6] M. Kalakrishnan, J. Buchli, P. Pastor, M. Mistry, and S. Schaal, “Learning,
Gait (bottom row) overcoming the 40 cm gap. The figures are numbered left-to- planning, and control for quadruped locomotion over challenging terrain,”
right to indicate the order of the frames. Int. J. Robot. Res., vol. 30, pp. 236–258, 2011.
[7] S. Tonneau, N. Mansard, C. Park, D. Manocha, F. Multon, and J. Pettré, “A
reachability-based planner for sequences of acyclic contacts in cluttered
V. CONCLUSION & DISCUSSION environments,” in Robotics Research. Berlin, Germany: Springer, 2018,
pp. 287–303.
This work proposes a new approach for training a two-layer hi- [8] S. Tonneau, A. Del Prete, J. Pettré, C. Park, D. Manocha, and N. Mansard,
erarchy of NN policies which realizes terrain-aware locomotion “An efficient acyclic contact planner for multiped robots,” IEEE Trans.
for quadruped robots. We decompose locomotion into two parts Robot., vol. 34, no. 3, pp. 586–601, Jun. 2018.
trained independently using model-free DRL, and which inter- [9] R. J. Griffin, G. Wiedebach, S. McCrory, S. Bertrand, I. Lee, and J. Pratt,
“Footstep planning for autonomous walking over rough terrain,” 2019,
face via a carefully designed parameterization of quadrupedal arXiv:1907.08673.
gaits. As physical simulation incurs a high computational cost [10] O. A. V. Magaña et al., “Fast and continuous foothold adaptation for
for use with DRL, our method reduces sample-complexity for dynamic locomotion through CNNs,” IEEE Robot. Autom. Lett., vol. 4,
training the gait planner by using a transition feasibility criteria no. 2, pp. 2140–2147, Apr. 2019.
[11] T. Klamt and S. Behnke, “Towards learning abstract representations for
realized as convex LP to formulate the transition dynamics of locomotion planning in high-dimensional state spaces,” in Proc. Int. Conf.
an MDP. Within a certain context, our formulation can resemble Robot. Automat., 2019, pp. 922–928.
bilevel optimization schemes used in nonlinear programming. [12] N. Heess et al., “Emergence of locomotion behaviours in rich environ-
The upper-level optimization performed by model-free DRL ments,” 2017, arXiv:1707.02286.
optimizes the selection of footholds and instantaneous CoM [13] X. B. Peng, G. Berseth, K. Yin, and M. Van De Panne, “DeepLoco: Dy-
namic locomotion skills using hierarchical deep reinforcement learning,”
motion for the coincident terrain. The lower-level optimization ACM Trans. Graph., vol. 36, pp. 41:1–41:13, 2017.
performed by the model-based LP optimizes phase transitions, [14] P. Fernbach, S. Tonneau, O. Stasse, J. Carpentier, and M. Taïx, “C-CROC:
thus resulting in optimal CoM trajectories between the support Continuous and convex resolution of centroidal dynamic trajectories for
phases proposed by the higher-level. Although in this work we legged robots in multi-contact scenarios,” IEEE Trans. Robot., 2020.
[15] R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning, 2nd
only used a trivial cost for the LP, we could instead formulate ed. Cambridge, MA, USA: MIT Press, 2018.
a cost that penalizes contact forces and CoM accelerations, and [16] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal
include it as an additional reward term in the MDP. Such a policy optimization algorithms,” 2017, arXiv:1707.06347.
coupling of the LP and MDP optimization objectives would [17] J. Schulman, P. Moritz, S. Levine, M. Jordan, and P. Abbeel, “High-
allow us to reason about both feasibility and optimality of dimensional continuous control using generalized advantage estimation,”
2015, arXiv:1506.02438.
the resulting gait plan. Furthermore, as the planner learns to [18] J. Hwangbo, J. Lee, and M. Hutter, “Per-contact iteration method for
generate relatively conservative foothold sequences due to the solving contact dynamics,” IEEE Roboti. Autom. Lett., vol. 3, no. 2,
restrained centroidal dynamics model, the burden of realizing pp. 895–902, Apr. 2018.
fully dynamic and optimal motions for the base and swing-feet [19] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz, “Trust region
policy optimization,” in Proc. Int. Conf. Mach. Learn., 2015, pp. 1889–
is placed on the gait controller. Lastly, we have demonstrated 1897.
the efficacy of this approach by successfully training and testing
in a suite of challenging terrain scenarios. The resulting policies
not only prove to be effective in the suite of rigid non-flat
terrains but also manage to generalize well to previously unseen
cases.

Authorized licensed use limited to: University of Canberra. Downloaded on April 29,2020 at 14:45:41 UTC from IEEE Xplore. Restrictions apply.

You might also like