AMP: Adversarial Motion Priors For Stylized Physics-Based Character Control
AMP: Adversarial Motion Priors For Stylized Physics-Based Character Control
Control
XUE BIN PENG∗ , University of California, Berkeley, USA
ZE MA∗ , Shanghai Jiao Tong University, China
PIETER ABBEEL, University of California, Berkeley, USA
SERGEY LEVINE, University of California, Berkeley, USA
ANGJOO KANAZAWA, University of California, Berkeley, USA
Fig. 1. Our framework enables physically simulated character to solve challenging tasks while adopting stylistic behaviors specified by unstructured motion
data. Left: A character learns to traverse an obstacles course using a variety of locomotion skills. Right: A character learns to walk to and punch a target.
Synthesizing graceful and life-like behaviors for physically simulated charac- system produces high-quality motions that are comparable to those achieved
ters has been a fundamental challenge in computer animation. Data-driven by state-of-the-art tracking-based techniques, while also being able to easily
methods that leverage motion tracking are a prominent class of techniques accommodate large datasets of unstructured motion clips. Composition of
for producing high fidelity motions for a wide range of behaviors. However, disparate skills emerges automatically from the motion prior, without re-
the effectiveness of these tracking-based methods often hinges on carefully quiring a high-level motion planner or other task-specific annotations of
designed objective functions, and when applied to large and diverse motion the motion clips. We demonstrate the effectiveness of our framework on
datasets, these methods require significant additional machinery to select the a diverse cast of complex simulated characters and a challenging suite of
appropriate motion for the character to track in a given scenario. In this work, motor control tasks.
we propose to obviate the need to manually design imitation objectives and
CCS Concepts: • Computing methodologies → Procedural animation;
mechanisms for motion selection by utilizing a fully automated approach
Adversarial learning; Control methods.
based on adversarial imitation learning. High-level task objectives that the
character should perform can be specified by relatively simple reward func- Additional Key Words and Phrases: Wireless sensor networks, media access
tions, while the low-level style of the character’s behaviors can be specified control, multi-channel, radio interference, time synchronization
by a dataset of unstructured motion clips, without any explicit clip selection
ACM Reference Format:
or sequencing. For example, a character traversing an obstacle course might
Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa.
utilize a task-reward that only considers forward progress, while the dataset
2021. AMP: Adversarial Motion Priors for Stylized Physics-Based Character
contains clips of relevant behaviors such as running, jumping, and rolling.
Control. ACM Trans. Graph. 40, 4, Article 144 (August 2021), 20 pages. https:
These motion clips are used to train an adversarial motion prior, which spec-
//doi.org/10.1145/3450626.3459670
ifies style-rewards for training the character through reinforcement learning
(RL). The adversarial RL procedure automatically selects which motion to
perform, dynamically interpolating and generalizing from the dataset. Our 1 INTRODUCTION
∗ Joint first authors.
Synthesizing natural and life-like motions for virtual characters
is a crucial element for breathing life into immersive experiences,
Authors’ addresses: Xue Bin Peng, University of California, Berkeley, 2121 Berkeley such as films and games. The demand for realistic motions becomes
Way, Berkeley, CA, 94704, USA, [email protected]; Ze Ma, Shanghai Jiao Tong
University, 800 Dongchuan Rd, Shanghai, 200240, China, [email protected]; even more apparent for VR applications, where users are provided
Pieter Abbeel, University of California, Berkeley, 2121 Berkeley Way, Berkeley, CA, with rich modalities through which to interact with virtual agents.
94704, USA, [email protected]; Sergey Levine, University of California, Berkeley, Developing control strategies that are able to replicate the properties
2121 Berkeley Way, Berkeley, CA, 94704, USA, [email protected]; Angjoo
Kanazawa, University of California, Berkeley, 2121 Berkeley Way, Berkeley, CA, 94704, of naturalistic behaviors is also of interest for robotic systems, as
USA, [email protected]. natural motions implicitly encode important properties, such as
Permission to make digital or hard copies of all or part of this work for personal or
safety and energy efficiency, which are vital for effective operation
classroom use is granted without fee provided that copies are not made or distributed of robots in the real world. While examples of natural motions
for profit or commercial advantage and that copies bear this notice and the full citation are commonplace, identifying the underlying characteristics that
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or constitute these behaviors is nonetheless challenging, and more
republish, to post on servers or to redistribute to lists, requires prior specific permission difficult still to replicate in a controller.
and/or a fee. Request permissions from [email protected]. So what are the characteristics that constitute natural and life-
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
0730-0301/2021/8-ART144 $15.00 like behaviors? Devising quantitative metrics of the naturalness of
https://doi.org/10.1145/3450626.3459670 motions has been a fundamental challenge for optimization-based
ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
144:2 • Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa
character animation techniques [Al Borno et al. 2013; Wampler et al. reward function. We present one of the first adversarial learning
2014; Wang et al. 2009]. Heuristics such as symmetry, stability, and systems that is able to produce high-quality full-body motions for
effort minimization can improve the realism of motions produced by physically simulated characters. By combining the motion prior
physically simulated characters [Grochow et al. 2004; Mordatch et al. with additional task objectives, our system provides a convenient
2012, 2013; Yu et al. 2018]. But these strategies may not be broadly interface through which users can specify high-level directions for
applicable to all behaviors of interest. Effective applications of these controlling a character’s behaviors. These task objectives allow our
heuristics often require careful balancing of the various objectives, characters to acquire more complex skills than those demonstrated
a tuning process that may need to be repeated for each task. Data- in the original motion clips. While our system is built on well-known
driven methods are able to mitigate some of these challenges by adversarial imitation learning techniques, we propose a number of
leveraging motion clips recorded from real-life actors to guide the important design decisions that lead to substantially higher quality
behaviors of simulated characters [Da Silva et al. 2008; Liu et al. results than those achieved by prior work, enabling our characters
2010; Muico et al. 2009; Sok et al. 2007]. A common instantiation to learn highly dynamic and diverse motors skills from unstructured
of this approach is to utilize a tracking objective that encourages a motion data.
character to follow particular reference trajectories relevant for a
given task. These tracking-based methods can produce high-quality 2 RELATED WORK
motions for a large repertoire skills. But extending these techniques Developing systems that can synthesize natural motions for vir-
to effectively leverage large unstructured motion datasets remains tual characters is one of the fundamental challenges of computer
challenging, since a suitable motion clip needs to be selected for animation. These procedural animation techniques can be broadly
the character to track at each time step. This selection process is categorized as kinematic methods and physics-based methods. Kine-
typically performed by a motion planner, which generates reference matic methods generally do not explicitly utilize the equations of
trajectories for solving a particular task [Bergamin et al. 2019; Park motion for motion synthesis. Instead, these methods often lever-
et al. 2019; Peng et al. 2017]. However, constructing an effective age datasets of motion clips to generate motions for a character
motion planner can itself be a challenging endeavour, and entails [Lee et al. 2002, 2010b]. Given a motion dataset, controllers can be
significant overhead to annotate and organize the motion clips constructed to select an appropriate motion clip to play back for
for a desired task. For many applications, it is not imperative to a particular scenario [Agrawal and van de Panne 2016; Safonova
exactly track a particular reference motion. Since a dataset typically and Hodgins 2007; Treuille et al. 2007]. Data-driven methods using
provides only a limited collection of example motions, a character generative models, such as Gaussian processes [Levine et al. 2012;
will inevitably need to deviate from the reference motions in order Ye and Liu 2010] and neural networks [Holden et al. 2017; Ling et al.
to effectively perform a given task. Therefore, the intent is often not 2020; Zhang et al. 2018], have also been applied to synthesize mo-
for the character to closely track a particular motion, but to adopt tions online. When provided with sufficiently large and high-quality
general behavioral characteristics depicted in the dataset. We refer datasets, kinematic methods are able to produce realistic motions
to these behavioral characteristics as a style. for a large variety of sophisticated skills [Agrawal and van de Panne
In this work, we aim to develop a system where users can specify 2016; Lee et al. 2018, 2010b; Levine et al. 2011; Starke et al. 2019].
high-level task objectives for a character to perform, while the low- However, their ability to synthesize motions for novel situations
level style of a character’s movements can be controlled through can be limited by the availability of data. For complex tasks and
examples provided in the form of unstructured motion clips. To environments, it can be difficult to collect a sufficient amount of data
control the style of a character’s motions, we propose adversarial to cover all possible behaviors that a character may need to perform.
motion priors (AMP), a method for imitating behaviors from raw This is particularly challenging for nonhuman and fictional crea-
motion clips without requiring any task-specific annotations or tures, where motion data can be scarce. In this work, we combine
organization of the dataset. Given a set of reference motions that data-driven techniques with physics-based animation methods to
constitutes a desired motion style, the motion prior is modeled as develop characters that produce realistic and responsive behaviors
an adversarial discriminator, trained to differentiate between behav- to novel tasks and environments.
iors depicted in the dataset from those produced by the character.
The motion prior therefore acts as a general measure of similarity Physics-Based Methods: Physics-based methods address some of
between the motions produced by a character and the motions in the limitations of kinematic methods by synthesizing motions from
the dataset. By incorporating the motion prior in a goal-conditioned first principles. These methods typically leverage a physics sim-
reinforcement learning framework, our system is able to train physi- ulation, or more general knowledge of the equations of motion,
cally simulated characters to perform challenging tasks with natural to generate motions for a character [Raibert and Hodgins 1991;
and life-like behaviors. Composition of diverse behaviors emerges Wampler et al. 2014]. Optimization techniques, such as trajectory
automatically from the motion prior, without the need for a motion optimization and reinforcement learning, play a pivotal role in many
planner or other mechanism for selecting which clip to imitate. physics-based methods, where controllers that drive a character’s
The central contribution of this work is an adversarial learning motions are produced by optimizing an objective function [Mor-
approach for physics-based character animation that combines goal- datch et al. 2012; Tan et al. 2014; van de Panne et al. 1994]. While
conditioned reinforcement with an adversarial motion prior, which these methods are able to synthesize physically plausible motions
enables the style of a character’s movements to be controlled via for novel scenarios, even in the absence of motion data, designing ef-
example motion clips, while the task is specified through a simple fective objectives that lead to natural behaviors can be exceptionally
ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control • 144:3
difficult. Heuristics derived from prior knowledge of the character- function for training a control policy to imitate the demonstrations.
istics of natural motions are commonly included into the objective While these methods have shown promising results for motion imi-
function, such as symmetry, stability, effort minimization, and many tation tasks [Merel et al. 2017; Wang et al. 2017], adversarial learning
more [Mordatch et al. 2012; Wang et al. 2009; Yu et al. 2018]. Simu- algorithms can be notoriously unstable and the resulting motion
lating more biologically accurate actuators can also improve motion quality still falls well behind what has been achieved with state-of-
quality [Geijtenbeek et al. 2013; Jiang et al. 2019; Wang et al. 2012], the-art tracking-based techniques. Peng et al. [2019b] was able to
but may nonetheless yield unnatural behaviors. able to produce substantially more realistic motions by regularizing
the discriminator with an information bottleneck. However, their
Imitation Learning: The challenges of designing objective func- method still requires a phase variable to synchronize the policy and
tions that lead to natural motions have spurred the adoption of discriminator with the reference motion. Therefore, their results are
data-driven physics-based animation techniques [Da Silva et al. limited to imitating a single motion per policy, and thus not suitable
2008; Kwon and Hodgins 2017; Lee et al. 2010a; Sharon and van de for learning from large diverse motion datasets. In this work, we
Panne 2005; Zordan and Hodgins 2002], which utilizes reference propose an adversarial method for learning general motion priors
motion data to improve motion quality. Reference motions are typi- from large unstructured datasets that contain diverse motion clips.
cally incorporated through an imitation objective that encourages a Our approach does not necessitate any synchronization between
character to imitate motions in the dataset. The imitation objective the policy and reference motion. Furthermore, our approach does
is commonly implemented as a tracking objective, which attempts not require a motion planner, or any task-specific annotation and
to minimize the pose error between the simulated character and segmentation of the motion clips [Bergamin et al. 2019; Park et al.
target poses from a reference motion [Lee et al. 2010a; Liu et al. 2019; Peng et al. 2017]. Instead, composition of multiple motions
2016, 2010; Peng et al. 2018a; Sok et al. 2007]. Since the pose error is in furtherance of a task objective emerges automatically through
generally computed with respect to a single target pose at a time, the motion prior. We also present a number of design decisions for
some care is required to select an appropriate target pose from the stabilizing the adversarial training process, leading to consistent
dataset. A simple strategy is to synchronize the simulated character and high-quality results.
with a given reference motion using a phase variable [Lee et al.
2019; Peng et al. 2018a,b], which is provided as an additional input Latent Space Models: Latent space models can also act as a form
to the controller. The target pose at each time step can then be of motion prior that leads to more life-like behaviors. These mod-
conveniently determined by selecting the target pose according to els specify controls through a learned latent representation, which
the phase. This strategy has been effective for imitating individual is then mapped to controls for the underlying system [Burgard
motion clips, but it can be difficult to scale to datasets containing et al. 2008; Florensa et al. 2017; Hausman et al. 2018; Heess et al.
multiple disparate motions, as it may not be possible to synchronize 2016]. The latent representation is typically learned through a pre-
and align multiple reference motions according to a single-phase training phase using supervised learning or reinforcement learning
variable. Recent methods have extended these tracking-based tech- techniques to encode a diverse range of behaviors into a latent rep-
niques to larger motion datasets by explicitly providing target poses resentation. Once trained, this latent representation can be used to
from the reference motion that is being tracked as inputs to the con- build a control hierarchy, where the latent space model acts as a
troller [Bergamin et al. 2019; Chentanez et al. 2018; Park et al. 2019; low-level controller, and a separate high-level controller is trained to
Won et al. 2020]. This then allows a controller to imitate different specify controls via the latent space [Florensa et al. 2017; Haarnoja
motions depending on the input target poses. However, selecting the et al. 2018; Lynch et al. 2020]. For motion control of simulated char-
appropriate motion for a character to imitate in a given scenario can acters, the latent representation can be trained to encode behaviors
still entail significant algorithmic overhead. These methods often from reference motion clips, which then constrains the behavior
require a high-level motion planner that selects which motion clip of a character to be similar to those observed in the motion data,
the character should imitate for a given task [Bergamin et al. 2019; therefore leading to more natural behaviors for downstream tasks
Park et al. 2019; Peng et al. 2017]. The character’s performance on a [Merel et al. 2019; Peng et al. 2019a]. However, since the realism
particular task can therefore be constrained by the performance of of the character’s motions is enforced implicitly through the latent
the motion planner. representation, rather than explicitly through an objective function,
Another major limitation of tracking-based imitation techniques it is still possible for the high-level controller to specify latent encod-
is the need for a pose error metric when computing the tracking ings that produce unnatural behaviors [Merel et al. 2020; Peng et al.
objective [Liu et al. 2010; Peng et al. 2018a; Sharon and van de Panne 2019a]. Luo et al. [2020] proposed an adversarial domain confusion
2005]. These error metrics are often manually-designed, and it can be loss to prevent the high-level controller from specifying encodings
challenging to construct and tune a common metric that is effective that are different from those observed during pre-training. However,
across all skills that a character is to imitate. Adversarial imitation since this adversarial objective is applied in the latent space, rather
learning provides an appealing alternative [Abbeel and Ng 2004; than on the actual motions produced by the character, the model is
Ho and Ermon 2016; Ziebart et al. 2008], where instead of using a nonetheless prone to generating unnatural behaviors. Our proposed
manually-designed imitation objective, these algorithms train an ad- motion prior directly enforces similarity between the motions pro-
versarial discriminator to differentiate between behaviors generated duced by the character and those in the reference motion dataset,
by an agent from behaviors depicted in the demonstration data (e.g. which enables our method to produce higher fidelity motions than
reference motions). The discriminator then serves as the objective what has been demonstrated by latent space models. Our motion
ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
144:4 • Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa
prior also does not require a separate pre-training phase, and instead,
can be trained jointly with the policy.
3 OVERVIEW
Given a dataset of reference motions and a task objective defined by a
reward function, our system synthesizes a control policy that enables
a character to achieve the task objective in a physically simulated
environment, while utilizing behaviors that resemble the motions
in the dataset. Crucially, the character’s behaviors need not exactly
match any specific motion in the dataset, instead its movements
need only to adopt more general characteristics exhibited by the
corpus of reference motions. These reference motions collectively
provide an example-based definition of a behavioral style, and by
providing the system with different motion datasets, the character
can then be trained to perform a task in a variety of distinct styles. Fig. 2. Schematic overview of the system. Given a motion dataset defining a
Figure 2 provides a schematic overview of the system. The motion desired motion style for the character, the system trains a motion prior that
dataset M consists of a collection of reference motions, where each specifies style-rewards 𝑟𝑡𝑆 for the policy during training. These style-rewards
motion m𝑖 = {^ q𝑖𝑡 } is represented as a sequence of poses q ^𝑖𝑡 . The mo- are combined with task-rewards 𝑟𝑡𝐺 and used to train a policy that enables
tion clips may be collected from the mocap of real-life actors or from a simulated character to satisfy task-specific goals g, while also adopting
behaviors that resemble the reference motions in the dataset.
artist-authored keyframe animations. Unlike previous frameworks,
our system can be applied directly on raw motion data, without
requiring task-specific annotations or segmentation of a clip into a policy that maximizes its expected discounted return 𝐽 (𝜋),
individual skills. The motion of the simulated character is controlled "𝑇 −1 #
by a policy 𝜋 (a𝑡 |s𝑡 , g) that maps the state of the character s𝑡 and Õ
a given goal g to a distribution over actions a𝑡 . The actions from 𝐽 (𝜋) = E𝑝 ( g) E𝑝 (𝜏 |𝜋,g) 𝛾 𝑡 𝑟𝑡 , (1)
𝑡 =0
the policy specify target positions for proportional-derivative (PD)
controllers positioned at each of the character’s joints, which in where 𝑝 (𝜏 |𝜋, g) = 𝑝 (s0 ) 𝑇𝑡 =−01 𝑝 (s𝑡 +1 |s𝑡 , a𝑡 )𝜋 (a𝑡 |s𝑡 , g) represents
Î
turn produce control forces that drive the motion of the character. the likelihood of a trajectory 𝜏 = {(s𝑡 , a𝑡 , 𝑟𝑡 )𝑇𝑡 =−01, s𝑇 } under a policy
The goal g specifies a task reward function 𝑟𝑡𝐺 = 𝑟 𝐺 (s𝑡 , a𝑡 , s𝑡 +1, g), 𝜋 for a goal g. 𝑝 (s0 ) is the initial state distribution, and 𝑝 (s𝑡 +1 |s𝑡 , a𝑡 )
which defines high-level objectives for the character to satisfy (e.g. represents the dynamics of the environment. 𝑇 denotes the time
walking in a target direction or punching a target). The style objec- horizon of a trajectory, and 𝛾 ∈ [0, 1) is a discount factor.
tive 𝑟𝑡𝑆 = 𝑟 𝑆 (s𝑡 , s𝑡 +1 ) is specified by an adversarial discriminator,
trained to differentiate between motions depicted in the dataset 4.2 Generative Adversarial Imitation Learning
from motions produced by the character. The style objective there-
fore acts as a task-agnostic motion prior that provides an a-priori Generative adversarial imitation learning (GAIL) [Ho and Ermon
estimate of the naturalness or style of a given motion, independent 2016] adapts techniques developed for generative adversarial net-
of a specific task. The style objective then encourages the policy to works (GAN) [Goodfellow et al. 2014] to the domain of imitation
produce motions that resemble behaviors depicted in the dataset. learning. In the interest of brevity, we exclude the goal g from
the notation, but the following discussion readily generalizes to
goal-conditioned settings. Given a dataset of demonstrations M =
4 BACKGROUND
{(s𝑖 , a𝑖 )}, containing states s𝑖 and actions a𝑖 recorded from an un-
Our system combines techniques from goal-conditioned reinforce- known demonstration policy, the objective is to train a policy 𝜋 (a|s)
ment learning and generative adversarial imitation learning to train that imitates the behaviors of the demonstrator. Behavioral cloning
control policies that enable simulated characters to perform chal- can be used to directly fit a policy to map from states observed in M
lenging tasks in a desired behavioral style. In this section, we provide to their corresponding actions using supervised learning [Bojarski
a brief review of these techniques. et al. 2016; Pomerleau 1988]. However, if only a small amount of
demonstrations are available, then behavioral cloning techniques
4.1 Goal-Conditioned Reinforcement Learning are prone to drift [Ross et al. 2011]. Furthermore, behavioral cloning
Our characters are trained through a goal-conditioned reinforce- is not directly applicable in settings where the demonstration actions
ment learning framework, where an agent interacts with an envi- are not observable (e.g. reference motion data).
ronment according to a policy 𝜋 in order to fulfill a given goal g ∈ G GAIL addresses some of the limitations of behavioral cloning by
sampled according to a goal distribution g ∼ 𝑝 (g). At each time step learning an objective function that measures the similarity between
𝑡, the agent observes the state s𝑡 ∈ S of the system, then samples the policy and the demonstrations, and then updating 𝜋 via rein-
an action a𝑡 ∈ A from a policy a𝑡 ∼ 𝜋 (a𝑡 |s𝑡 , g). The agent then forcement learning to optimize the learned objective. The objective
applies that action, which results in a new state s𝑡 +1 , as well as a is modeled as a discriminator 𝐷 (s, a), trained to predict whether a
scalar reward 𝑟𝑡 = 𝑟 (s𝑡 , a𝑡 , s𝑡 +1, g). The agent’s objective is to learn given state s and action a is sampled from the demonstrations M
ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control • 144:5
ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
144:6 • Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa
extracts a set of features relevant for determining the characteristics ALGORITHM 1: Training with AMP
of a given motion. The resulting features are then used as inputs to 1: input M: dataset of reference motions
the discriminator 𝐷 (Φ(s), Φ(s ′ )). The set of features include: 2: 𝐷 ← initialize discriminator
• Linear velocity and angular velocity of the root, represented 3: 𝜋 ← initialize policy
4: 𝑉 ← initialize value function
in the character’s local coordinate frame.
5: B ← ∅ initialize reply buffer
• Local rotation of each joint.
• Local velocity of each joint. 6: while not done do
• 3D positions of the end-effectors (e.g. hands and feet), repre- 7: for trajectory 𝑖 = 1, ..., 𝑚 do
sented in the character’s local coordinate frame. 8: 𝜏 𝑖 ← { (s𝑡 , a𝑡 , 𝑟𝑡𝐺 )𝑡𝑇=−01 , s𝑇𝐺 , g} collect trajectory with 𝜋
The root is designated to be the character’s pelvis. The character’s 9: for time step 𝑡 = 0, ...,𝑇 − 1 do
10: 𝑑𝑡 ← 𝐷 (Φ(s𝑡 ), Φ(s𝑡 +1 ))
local coordinate frame is defined with the origin located at the
11: 𝑟𝑡𝑆 ← calculate style reward according to Equation 7 using 𝑑𝑡
root, the x-axis oriented along the root link’s facing direction, and
12: 𝑟𝑡 ← 𝑤𝐺 𝑟𝑡𝐺 + 𝑤𝑆 𝑟𝑡𝑆
the y-axis aligned with the global up vector. The 3D rotation of 13: record 𝑟𝑡 in 𝜏 𝑖
each spherical joint is encoded using two 3D vectors corresponding 14: end for
to the normal and tangent in the coordinate frame. This rotation 15: store 𝜏 𝑖 in B
encoding provides a smooth and unique representation of a given 16: end for
rotation. This set of observation features for the discriminator is
selected to provide a compact representation of the motion across 17: for update step = 1, ..., 𝑛 do
a single state transition. The observations also do not include any 18: 𝑏 M ← sample batch of 𝐾 transitions { (s 𝑗 , s′𝑗 ) }𝐾 𝑗 =1 from M
task-specific features, thus enabling the motion prior to be trained 19: 𝑏 𝜋 ← sample batch of 𝐾 transitions { (s 𝑗 , s′𝑗 ) }𝐾
𝑗 =1 from B
without requiring task-specific annotation of the reference motions, 20: update 𝐷 according to Equation 8 using 𝑏 M and 𝑏 𝜋
and allowing motion priors trained with the same dataset to be used 21: end for
for different tasks.
22: update 𝑉 and 𝜋 using data from trajectories {𝜏 𝑖 }𝑚
𝑖=1
23: end while
5.4 Gradient Penalty
The interplay between the discriminator and generator in a GAN
often results in unstable training dynamics. One source of instability
is due to function approximation errors in the discriminator, where depicted in the dataset. In this section, we detail the design of various
the discriminator may assign nonzero gradients on the manifold of components of the learning framework.
real data samples [Mescheder et al. 2018]. These nonzero gradients
can cause the generator to overshoot and move off the data manifold, 6.1 States and Actions
instead of converging to the manifold, leading to oscillations and The state s𝑡 consists of a set of features that describes the configura-
instability during training. To mitigate this phenomenon, a gradient tion of the character’s body. The features are similar to those used by
penalty can be applied to penalize nonzero gradients on samples Peng et al. [2018a], which include the relative positions of each link
from the dataset [Gulrajani et al. 2017; Kodali et al. 2017; Mescheder with respective to the root, the rotation of each link as represented
et al. 2018]. We incorporate this technique to improve training using the 6D normal-tangent encoding, along with the link’s linear
stability. The discriminator objective is then given by: and angular velocities. All features are recorded in the character’s lo-
h cal coordinate system. Unlike previous systems, which synchronize
2i
arg min E𝑑 M ( s,s′ ) 𝐷 (Φ(s), Φ(s ′ )) − 1 the policy with a particular reference motion by including additional
𝐷
h phase information in the state, such as scalar phase variables [Lee
2i
+ E𝑑 𝜋 ( s,s′ ) 𝐷 Φ(s), Φ(s ′ ) + 1 et al. 2019; Peng et al. 2018a] or target poses [Bergamin et al. 2019;
Chentanez et al. 2018; Won et al. 2020], our policies are not trained
𝑤 gp
2
to explicitly imitate any specific motion from the dataset. Therefore,
+ E𝑑 M ( s,s′ ) ∇𝜙 𝐷 (𝜙) 𝜙=(Φ( s),Φ( s′ )) , (8)
2 no such synchronization or phase information is necessary.
Each action a𝑡 specifies target positions for PD controllers posi-
where 𝑤 gp is a manually-specified coefficient. Note, the gradient
tioned at each of the character’s joints. For spherical joints, each
penalty is calculated with respect to the observation features 𝜙 =
target is specified in the form of a 3D exponential map q ∈ R3
(Φ(s), Φ(s ′ )), not the full set of state features (s, s ′ ). As we show in
[Grassia 1998], where the rotation axis v and rotation angle 𝜃 can
our experiments, the gradient penalty is crucial for stable training
be determined according to:
and effective performance.
q
v= , 𝜃 = ||q|| 2 . (9)
6 MODEL REPRESENTATION ||q|| 2
Given a high-level task objective and a dataset of reference motions, This representation provides a more compact parameterization than
the agent is responsible for learning a control policy that fulfills the the 4D axis-angle or quaternion representations used in prior sys-
task objectives, while utilizing behaviors that resemble the motions tems [Peng et al. 2018a; Won et al. 2020], while also avoiding gimbal
ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control • 144:7
lock from parameterizations such as euler angles. Target rotations direction in the character’s local coordinate frame. The target speed
for revolute joints are specified as 1D rotation angles 𝑞 = 𝜃 . is selected randomly between 𝑣 ∗ ∈ [1, 5]m/s. For slower moving
styles, such as Zombie and Stealthy, the target speed is fixed at 1m/s.
6.2 Network Architecture
Each policy 𝜋 is modeled by a neural network that maps a given state Target Location: In this task, the character’s objective is to move
s𝑡 and goal g to a Gaussian distribution over actions 𝜋 (a𝑡 |s𝑡 , g) = to a target location x∗ . The goal g𝑡 = x̃𝑡∗ records the target location
N (𝜇 (s𝑡 , g), Σ), with an input-dependent mean 𝜇 (s𝑡 , g) and a fixed in the character’s local coordinate frame.
diagonal covariance matrix Σ. The mean is specified by a fully-
connected network with two hidden layers, consisting of 1024 and Dribbling: To evaluate our system on more complex object ma-
512 ReLU [Nair and Hinton 2010], followed by a linear output nipulation tasks, we train policies for a dribbling task, where the
layer. The values of the covariance matrix Σ = diag(𝜎1, 𝜎2, ...) are character’s objective is to dribble a soccer ball to a target location.
manually-specified and kept fixed over the course of training. The The goal g𝑡 = x̃𝑡∗ records the relative position of the target loca-
value function 𝑉 (s𝑡 , g) and discriminator 𝐷 (s𝑡 , s𝑡 +1 ) are modeled tion with respect to the character. The state s𝑡 is augmented with
by separate networks with a similar architecture as the policy. additional features that describe the state of the ball, including the
position x̃𝑡ball , orientation q̃𝑡ball , linear velocity x¤̃ 𝑡ball , and angular
6.3 Training velocity q¤̃ 𝑡ball of the ball in the character’s local coordinate frame.
Our policies are trained using a combination of GAIL [Ho and Ermon
2016] and proximal-policy optimization (PPO) [Schulman et al. 2017]. Strike: To demonstrate AMP’s ability to compose diverse behav-
Algorithm 1 provides an overview of the training process. At each iors, we consider a task where the character’s objective is to strike
time step 𝑡, the agent receives a task-reward 𝑟𝑡𝐺 = 𝑟 𝐺 (s𝑡 , a𝑡 , s𝑡 +1, g) a target using a designated end-effector (e.g. hands). The target may
from the environment, it then queries the motion prior for a style- be located at various distances from the character. Therefore, the
reward 𝑟𝑡𝑆 = 𝑟 𝑆 (s𝑡 , s𝑡 +1 ), computed according to Equation 7. The character must first move close to the target before striking it. These
two rewards are combined according to Equation 4 to yield the distinct phases entail different optimal behaviors, and thus require
reward for the particular timstep. Following the approach proposed the policy to compose and transition between the appropriate skills.
by Peng et al. [2018a], we incorporate reference state initialization The goal g𝑡 = (x̃𝑡∗, ℎ𝑡 ) records the location of the target x̃𝑡∗ in the
and early termination. Reference state initialization is applied by character’s local coordinate frame, along with an indicator variable
initializing the character to states sampled randomly from all motion ℎ𝑡 that specifies if the target has already been hit.
clips in the dataset. Early termination is triggered on most tasks
when any part of the character’s body, with exception of the feet, Obstacles: Finally, we consider tasks that involve visual percep-
makes contact with the ground. This termination criteria is disabled tion and interaction with more complex environments, where the
for more contact-rich tasks, such as rolling or getting up after a fall. character’s objective is to traverse an obstacle-filled terrain, while
Once a batch of data has been collected with the policy, the maintaining a target speed. Policies are trained for two types of envi-
recorded trajectories are used to update the policy and value func- ronments: 1) An environment containing a combination of obstacles
tion. The value function is updated with target values computed include gaps, steps, and overhead obstructions that the character
using TD(𝜆) [Sutton and Barto 1998]. The policy is updated using must duck under. 2) An environment containing narrow stepping
advantages computed using GAE(𝜆) [Schulman et al. 2015]. Each stones that requires more precise contact planning. Examples of the
trajectory recorded from the policy is also stored in a replay buffer environments are available in Figure 1 and 3. In order for the policy
B, containing trajectories from past training iterations. The dis- to perceive the upcoming obstacles, the state is augmented with a
criminator is updated according to Equation 8 using minibatches 1D height-field of the upcoming terrain.
of transitions (s, s ′ ) sampled from the reference motion data set M
and transitions from the replay buffer B. The replay buffer helps to
8 RESULTS
stabilize training by preventing the discriminator from overfitting
to the most recent batch of trajectories from the policy. We evaluate our framework’s effectiveness on a suite of challenging
motion control tasks with complex simulated characters. First, we
7 TASKS demonstrate that our approach can readily scale to large unstruc-
tured datasets containing diverse motion clips, which then enables
To evaluate AMP’s effectiveness for controlling the style of a char-
our characters to perform challenging tasks in a natural and life-like
acter’s motions, we apply our framework to train complex 3D sim-
manner by imitating behaviors from the dataset. The characters
ulated characters to perform various motion control tasks using
automatically learn to compose and generalize different skills from
different motion styles. The characters include a 34 DoF humanoid, a
the motion data in order to fulfill high-level task objectives, without
59 DoF T-Rex, and a 64 DoF dog. A summary of each task is provided
requiring mechanisms for explicit motion selection. We then evalu-
below. Please refer to Appendix A for a more in-depth description
ate AMP on a single-clip imitation task, and show that our method
of each task and their respective reward functions.
is able to closely imitate a diverse corpus of dynamic and acrobatic
Target Heading: In this task, the character’s objective is to move skills, producing motions that are nearly indistinguishable from
along a target heading direction d∗ at a target speed 𝑣 ∗ . The goal reference motions recorded from human actors. Behaviors learned
for the policy is specified as g𝑡 = ( d̃𝑡∗, 𝑣 ∗ ), with d̃𝑡∗ being the target by the characters can be viewed in the supplementary video.
ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
144:8 • Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa
(a) Humanoid: Target Location (Locomotion) (b) Humanoid: Target Location (Zombie)
(g) Humanoid: Stepping Stones (Cartwheel) (h) Humanoid: Stepping Stones (Jump)
Fig. 3. The motion prior can be trained with large datasets of diverse motions, enabling simulated characters to perform complex tasks by composing a wider
range of skills. Each environment is denoted by "Character: Task (Dataset)".
ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control • 144:9
Table 1. Performance statistics of combining AMP with additional task Table 2. Summary statistics of the different datasets used to train the motion
objectives. Performance is recorded as the average normalized task return, priors. We record the total length of motion clips in each dataset, along
with 0 being the minimum possible return per episode and 1 being the with the number of clips, and the number of subjects (e.g. human actors)
maximum possible return. The return is averaged across 3 models initialized that the clips were recorded from.
with different random seeds, with 32 episodes recorded per model. The
motion prior can be trained with different datasets to produce policies that
Character Dataset Size (s) Clips Subjects
adopt distinct stylistic behaviors when performing a particular task. Humanoid Cartwheel 13.6 3 1
Jump 28.6 10 4
Character Task Dataset Task Return Locomotion 434.1 56 8
Humanoid Target Locomotion 0.90 ± 0.01 Run 204.4 47 3
Heading Walk 0.46 ± 0.01 Run + Leap + Roll 22.1 10 7
Run 0.63 ± 0.01 Stealthy 136.5 3 1
Stealthy 0.89 ± 0.02 Walk 229.6 9 5
Zombie 0.94 ± 0.00 Walk + Punch 247.8 15 9
Target Locomotion 0.63 ± 0.01 Zombie 18.3 1 1
Location Zombie 0.50 ± 0.00 T-Rex Locomotion 10.5 5 1
Obstacles Run + Leap + Roll 0.27 ± 0.10
Stepping Cartwheel 0.43 ± 0.03
Stones Jump 0.56 ± 0.12
Dribble Locomotion 0.78 ± 0.05
Zombie 0.60 ± 0.04
Strike Walk + Punch 0.73 ± 0.02
Target
T-Rex Location Locomotion 0.36 ± 0.03
ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
144:10 • Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa
ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control • 144:11
Fig. 7. AMP can be used to train complex non-humanoid characters, such as a 59 DoF T-Rex and a 64 DoF dog. By providing the motion prior with different
reference motion clips, the characters can be trained to perform various locomotion gaits, such as trotting and cantering.
downstream tasks more quickly. However, the pre-training stage pose errors even when the overall motions are similar. To better
used to construct the low-level controller can itself be sample inten- evaluate the similarity of the motions, we first apply dynamic time
sive. In our experiments, the low-level controllers are trained using warping (DTW) to align the reference motion with the motion of
300 million samples before being transferred to downstream tasks. the simulated character [Sakoe and Chiba 1978], before computing
With AMP, no such pre-training is necessary, and the motion prior the pose error between the two aligned motions. DTW is applied
can be trained jointly with the policy. using Equation 10 as the cost function.
AMP is able to closely imitate a large variety of highly dynamic
8.4 Single-Clip Imitation skills, while also avoiding many of the visual artifacts exhibited
Although our goal is to train characters with large motion datasets, by prior adversarial motion imitation systems [Merel et al. 2017;
to evaluate the effectiveness of our framework for imitating behav- Wang et al. 2017]. We compare the performance of our system to
iors from motion clips, we consider a single-clip imitation task. In results produced by the motion tracking approach from Peng et al.
this setting, the character’s objective is to imitate a single motion [2018a], which uses a manually designed reward function and re-
clip at a time, without additional task objectives. Therefore, the quires synchronization of the policy with a reference motion via
policy is trained solely to maximize the style-reward 𝑟𝑡𝑆 from the a phase variable. Figure 8 compares the learning curves of the dif-
motion prior. Unlike previous motion tracking methods, our ap- ferent methods. Since the tracking-based policies are synchronized
proach does not require a manually designed tracking objective or a with their respective reference motions, they are generally able to
phase-based synchronization of the reference motion and the policy learn faster and achieve lower errors than policies trained with AMP.
Peng et al. [2018a]. Table 3 summarizes the performance of policies Nonetheless, our method is able to produce results of comparable
trained using AMP to imitate a diverse corpus of motions. Figure 6 quality without the need to manually design or tune reward func-
and 7 illustrate examples of motions learned by the characters. Per- tions for different motions. However, for some motions, such as the
formance is evaluated using the average pose error, where the pose Front-Flip, AMP is prone to converging to locally optimal behaviors,
error 𝑒𝑡
pose
at each time step 𝑡 is computed between the pose of where instead of performing a flip, the character learns to simply
the simulated character and the reference motion using the relative shuffle forwards in order to avoid falling. Tracking-based methods
positions of each joint with respect to the root (in units of meters), can mitigate these local optima by terminating an episode early
1 Õ if the character’s pose deviates too far from the reference motion
pose
(x𝑡 − x𝑡root ) − (^ ^𝑡root ) . (10)
𝑗 𝑗
= joint x𝑡 − x [Peng et al. 2018a; Won et al. 2020]. However, this strategy is not
𝑒𝑡
𝑁 2
𝑗 ∈joints directly applicable to AMP, since the policy is not synchronized
x𝑡 and x
𝑗
^𝑡 denote the 3D Cartesian position of joint 𝑗 from the
𝑗 with the reference motion. But as shown in the previous sections,
this lack of synchronization is precisely what allows AMP to eas-
simulated character and the reference motion, and 𝑁 joint is the total
ily leverage large datasets of diverse motion clips to solve more
number of joints in the character’s body. This method of evaluating
complex tasks.
motion similarity has previously been reported to better conform
to human perception of motion similarity [Harada et al. 2004; Tang
et al. 2008]. Since AMP does not use a phase variable to synchronize
the policy with the reference motion, the motions may progress at
different rates, resulting in de-synchronization that can lead to large
ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
144:12 • Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa
ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control • 144:13
can be conveniently incorporated into downstream tasks, without W. Burgard, O. Brock, and C. Stachniss. 2008. Learning Omnidirectional Path Following
requiring retraining for each new task. While the motion prior does Using Dimensionality Reduction. 257–264.
Nuttapong Chentanez, Matthias Müller, Miles Macklin, Viktor Makoviychuk, and Stefan
not require direct access to task-specific information, the data used Jeschke. 2018. Physics-Based Motion Capture Imitation with Deep Reinforcement
to train the motion prior is generated by policies trained to perform Learning. In Proceedings of the 11th Annual International Conference on Motion,
Interaction, and Games (Limassol, Cyprus) (MIG ’18). Association for Computing
a particular task. This may introduce some task dependencies into Machinery, New York, NY, USA, Article 1, 10 pages. https://doi.org/10.1145/3274247.
the motion prior, which can hinder its ability to be transferred to 3274506
other tasks. Training motion priors using data generated from larger CMU. [n.d.]. CMU Graphics Lab Motion Capture Database. http://mocap.cs.cmu.edu/.
Erwin Coumans et al. 2013. Bullet physics library. Open source: bulletphysics. org 15, 49
and more diverse repertoires of tasks may help to facilitate trans- (2013), 5.
ferring the learned motion priors to new tasks. Our experiments M. Da Silva, Y. Abe, and J. Popovic. 2008. Simulation of Human Motion Data using
also focus primarily on tasks that involve temporal composition Short-Horizon Model-Predictive Control. Computer Graphics Forum (2008). https:
//doi.org/10.1111/j.1467-8659.2008.01134.x
of different skills, which require the character to perform different Carlos Florensa, Yan Duan, and Pieter Abbeel. 2017. Stochastic Neural Networks for
behaviors at different points in time. However, spatial composition Hierarchical Reinforcement Learning. In Proceedings of the International Conference
on Learning Representations (ICLR).
might also be vital for some tasks that require a character to perform Thomas Geijtenbeek, Michiel van de Panne, and A. Frank van der Stappen. 2013. Flexible
multiple skills simultaneously. Developing motion priors that are Muscle-Based Locomotion for Bipedal Creatures. ACM Transactions on Graphics 32,
more amenable to spatial composition of disparate skills may lead to 6 (2013).
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
more flexible and sophisticated behaviors. Despite these limitations, Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In
we hope this work provides a useful tool that enables physically Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling,
simulated characters to take advantage of the large motion datasets C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc.,
2672–2680. http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf
that have been so effective for kinematic animation techniques, F. Sebastin Grassia. 1998. Practical Parameterization of Rotations Using the Exponential
and open exciting directions for future exploration in data-driven Map. J. Graph. Tools 3, 3 (March 1998), 29–48. https://doi.org/10.1080/10867651.
1998.10487493
physics-based character animation. Keith Grochow, Steven L. Martin, Aaron Hertzmann, and Zoran Popović. 2004. Style-
Based Inverse Kinematics. ACM Trans. Graph. 23, 3 (Aug. 2004), 522–531. https:
//doi.org/10.1145/1015706.1015755
ACKNOWLEDGMENTS Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C
We thank Sony Interactive Entertainment for providing reference Courville. 2017. Improved Training of Wasserstein GANs. In Advances in Neural
Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
motion data for this project, Bonny Ho for narrating the video, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 5767–5777.
the anonymous reviewers for their helpful feedback, and AWS for http://papers.nips.cc/paper/7159-improved-training-of-wasserstein-gans.pdf
providing computational resources. This research was funded by Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. 2018. Latent
Space Policies for Hierarchical Reinforcement Learning (Proceedings of Machine
an NSERC Postgraduate Scholarship, and a Berkeley Fellowship for Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, Stock-
Graduate Study. holmsmässan, Stockholm Sweden, 1851–1860. http://proceedings.mlr.press/v80/
haarnoja18a.html
T. Harada, S. Taoka, T. Mori, and T. Sato. 2004. Quantitative evaluation method for pose
REFERENCES and motion similarity based on human perception. In 4th IEEE/RAS International
Conference on Humanoid Robots, 2004., Vol. 1. 494–512 Vol. 1. https://doi.org/10.
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, 1109/ICHR.2004.1442140
Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, and Martin
Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Riedmiller. 2018. Learning an Embedding Space for Transferable Robot Skills. In
Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat International Conference on Learning Representations. https://openreview.net/forum?
Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, id=rk07ZXZRb
Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Nicolas Heess, Gregory Wayne, Yuval Tassa, Timothy P. Lillicrap, Martin A. Riedmiller,
Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin and David Silver. 2016. Learning and Transfer of Modulated Locomotor Controllers.
Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine CoRR abs/1610.05182 (2016). arXiv:1610.05182 http://arxiv.org/abs/1610.05182
Learning on Heterogeneous Systems. http://tensorflow.org/ Software available Jonathan Ho and Stefano Ermon. 2016. Generative Adversarial Imitation Learning.
from tensorflow.org. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama,
Pieter Abbeel and Andrew Y. Ng. 2004. Apprenticeship Learning via Inverse Rein- U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 4565–4573.
forcement Learning. In Proceedings of the Twenty-First International Conference on http://papers.nips.cc/paper/6391-generative-adversarial-imitation-learning.pdf
Machine Learning (Banff, Alberta, Canada) (ICML ’04). Association for Computing Daniel Holden, Taku Komura, and Jun Saito. 2017. Phase-Functioned Neural Networks
Machinery, New York, NY, USA, 1. https://doi.org/10.1145/1015330.1015430 for Character Control. ACM Trans. Graph. 36, 4, Article 42 (July 2017), 13 pages.
Shailen Agrawal and Michiel van de Panne. 2016. Task-based Locomotion. ACM https://doi.org/10.1145/3072959.3073663
Transactions on Graphics (Proc. SIGGRAPH 2016) 35, 4 (2016). Yifeng Jiang, Tom Van Wouwe, Friedl De Groote, and C. Karen Liu. 2019. Synthesis
M. Al Borno, M. de Lasa, and A. Hertzmann. 2013. Trajectory Optimization for Full- of Biologically Realistic Human Motion Using Joint Torque Actuation. ACM Trans.
Body Movements with Complex Contacts. IEEE Transactions on Visualization and Graph. 38, 4, Article 72 (July 2019), 12 pages. https://doi.org/10.1145/3306346.
Computer Graphics 19, 8 (2013), 1405–1414. https://doi.org/10.1109/TVCG.2012.325 3322966
Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein Generative Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. 2018. End-to-
Adversarial Networks (Proceedings of Machine Learning Research, Vol. 70), Doina end Recovery of Human Shape and Pose. In Computer Vision and Pattern Regognition
Precup and Yee Whye Teh (Eds.). PMLR, International Convention Centre, Sydney, (CVPR).
Australia, 214–223. http://proceedings.mlr.press/v70/arjovsky17a.html Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive Growing
Kevin Bergamin, Simon Clavet, Daniel Holden, and James Richard Forbes. 2019. DReCon: of GANs for Improved Quality, Stability, and Variation. CoRR abs/1710.10196 (2017).
Data-Driven Responsive Control of Physics-Based Characters. ACM Trans. Graph. arXiv:1710.10196 http://arxiv.org/abs/1710.10196
38, 6, Article 206 (Nov. 2019), 11 pages. https://doi.org/10.1145/3355089.3356536 Liyiming Ke, Matt Barnes, Wen Sun, Gilwoo Lee, Sanjiban Choudhury, and Sid-
David Berthelot, Tom Schumm, and Luke Metz. 2017. BEGAN: Boundary Equilibrium dhartha S. Srinivasa. 2019. Imitation Learning as f-Divergence Minimization. CoRR
Generative Adversarial Networks. CoRR abs/1703.10717 (2017). arXiv:1703.10717 abs/1905.12888 (2019). arXiv:1905.12888 http://arxiv.org/abs/1905.12888
http://arxiv.org/abs/1703.10717 Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational
Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Bayes. In 2nd International Conference on Learning Representations, ICLR
Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings.
Zhang, Jake Zhao, and Karol Zieba. 2016. End to End Learning for Self-Driving Cars. arXiv:http://arxiv.org/abs/1312.6114v10 [stat.ML]
CoRR abs/1604.07316 (2016). arXiv:1604.07316 http://arxiv.org/abs/1604.07316
ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
144:14 • Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa
Naveen Kodali, Jacob D. Abernethy, James Hays, and Zsolt Kira. 2017. How to Train York, NY, USA, Article 81, 9 pages. https://doi.org/10.1145/1576246.1531387
Your DRAGAN. CoRR abs/1705.07215 (2017). arXiv:1705.07215 http://arxiv.org/abs/ Vinod Nair and Geoffrey E. Hinton. 2010. Rectified Linear Units Improve Restricted
1705.07215 Boltzmann Machines. In Proceedings of the 27th International Conference on Interna-
Taesoo Kwon and Jessica K. Hodgins. 2017. Momentum-Mapped Inverted Pendulum tional Conference on Machine Learning (Haifa, Israel) (ICML’10). Omnipress, Madison,
Models for Controlling Dynamic Human Motions. ACM Trans. Graph. 36, 4, Article WI, USA, 807–814.
145d (Jan. 2017), 14 pages. https://doi.org/10.1145/3072959.2983616 Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. 2016. f-GAN: Training Genera-
Jehee Lee, Jinxiang Chai, Paul S. A. Reitsma, Jessica K. Hodgins, and Nancy S. Pollard. tive Neural Samplers using Variational Divergence Minimization. In Advances in
2002. Interactive Control of Avatars Animated with Human Motion Data. ACM Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon,
Trans. Graph. 21, 3 (July 2002), 491–500. https://doi.org/10.1145/566654.566607 and R. Garnett (Eds.), Vol. 29. Curran Associates, Inc., 271–279. https://proceedings.
Kyungho Lee, Seyoung Lee, and Jehee Lee. 2018. Interactive Character Animation by neurips.cc/paper/2016/file/cedebb6e872f539bef8c3f919874e9d7-Paper.pdf
Learning Multi-Objective Control. ACM Trans. Graph. 37, 6, Article 180 (Dec. 2018), Soohwan Park, Hoseok Ryu, Seyoung Lee, Sunmin Lee, and Jehee Lee. 2019. Learning
10 pages. https://doi.org/10.1145/3272127.3275071 Predict-and-Simulate Policies from Unorganized Human Motion Data. ACM Trans.
Seunghwan Lee, Moonseok Park, Kyoungmin Lee, and Jehee Lee. 2019. Scalable Muscle- Graph. 38, 6, Article 205 (Nov. 2019), 11 pages. https://doi.org/10.1145/3355089.
Actuated Human Simulation and Control. ACM Trans. Graph. 38, 4, Article 73 (July 3356501
2019), 13 pages. https://doi.org/10.1145/3306346.3322972 Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. 2018a. Deep-
Yoonsang Lee, Sungeun Kim, and Jehee Lee. 2010a. Data-Driven Biped Control. ACM Mimic: Example-guided Deep Reinforcement Learning of Physics-based Charac-
Trans. Graph. 29, 4, Article 129 (July 2010), 8 pages. https://doi.org/10.1145/1778765. ter Skills. ACM Trans. Graph. 37, 4, Article 143 (July 2018), 14 pages. https:
1781155 //doi.org/10.1145/3197517.3201311
Yongjoon Lee, Kevin Wampler, Gilbert Bernstein, Jovan Popović, and Zoran Popović. Xue Bin Peng, Glen Berseth, and Michiel van de Panne. 2016. Terrain-adaptive Loco-
2010b. Motion Fields for Interactive Character Locomotion. ACM Trans. Graph. 29, motion Skills Using Deep Reinforcement Learning. ACM Trans. Graph. 35, 4, Article
6, Article 138 (Dec. 2010), 8 pages. https://doi.org/10.1145/1882261.1866160 81 (July 2016), 12 pages. https://doi.org/10.1145/2897824.2925881
Sergey Levine, Yongjoon Lee, Vladlen Koltun, and Zoran Popović. 2011. Space-Time Xue Bin Peng, Glen Berseth, Kangkang Yin, and Michiel Van De Panne. 2017. DeepLoco:
Planning with Parameterized Locomotion Controllers. ACM Trans. Graph. 30, 3, Dynamic Locomotion Skills Using Hierarchical Deep Reinforcement Learning. ACM
Article 23 (May 2011), 11 pages. https://doi.org/10.1145/1966394.1966402 Trans. Graph. 36, 4, Article 41 (July 2017), 13 pages. https://doi.org/10.1145/3072959.
Sergey Levine, Jack M. Wang, Alexis Haraux, Zoran Popović, and Vladlen Koltun. 3073602
2012. Continuous Character Control with Low-Dimensional Embeddings. ACM Xue Bin Peng, Michael Chang, Grace Zhang, Pieter Abbeel, and Sergey Levine. 2019a.
Transactions on Graphics 31, 4 (2012), 28. MCP: Learning Composable Hierarchical Control with Multiplicative Composi-
Hung Yu Ling, Fabio Zinno, George Cheng, and Michiel Van De Panne. 2020. Character tional Policies. In Advances in Neural Information Processing Systems 32, H. Wallach,
Controllers Using Motion VAEs. ACM Trans. Graph. 39, 4, Article 40 (July 2020), H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Cur-
12 pages. https://doi.org/10.1145/3386569.3392422 ran Associates, Inc., 3681–3692. http://papers.nips.cc/paper/8626-mcp-learning-
Libin Liu, Michiel van de Panne, and KangKang Yin. 2016. Guided Learning of Control composable-hierarchical-control-with-multiplicative-compositional-policies.pdf
Graphs for Physics-Based Characters. ACM Transactions on Graphics 35, 3 (2016). Xue Bin Peng, Angjoo Kanazawa, Jitendra Malik, Pieter Abbeel, and Sergey Levine.
Libin Liu, KangKang Yin, Michiel van de Panne, and Baining Guo. 2012. Terrain runner: 2018b. SFV: Reinforcement Learning of Physical Skills from Videos. ACM Trans.
control, parameterization, composition, and planning for highly dynamic motions. Graph. 37, 6, Article 178 (Nov. 2018), 14 pages.
ACM Transactions on Graphics (TOG) 31, 6 (2012), 154. Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, and Sergey Levine. 2019b.
Libin Liu, KangKang Yin, Michiel van de Panne, Tianjia Shao, and Weiwei Xu. 2010. Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and
Sampling-based contact-rich motion control. ACM Trans. Graph. 29, 4, Article 128 GANs by Constraining Information Flow. In International Conference on Learning
(July 2010), 10 pages. https://doi.org/10.1145/1778765.1778865 Representations. https://openreview.net/forum?id=HyxPx3R9tm
Ying-Sheng Luo, Jonathan Hans Soeseno, Trista Pei-Chun Chen, and Wei-Chao Chen. Dean A. Pomerleau. 1988. ALVINN: An Autonomous Land Vehicle in a Neural Network.
2020. CARL: Controllable Agent with Reinforcement Learning for Quadruped In Proceedings of the 1st International Conference on Neural Information Processing
Locomotion. ACM Trans. Graph. 39, 4, Article 38 (July 2020), 10 pages. https: Systems (NIPS’88). MIT Press, Cambridge, MA, USA, 305–313.
//doi.org/10.1145/3386569.3392433 Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised Representa-
Corey Lynch, Mohi Khansari, Ted Xiao, Vikash Kumar, Jonathan Tompson, Sergey tion Learning with Deep Convolutional Generative Adversarial Networks. CoRR
Levine, and Pierre Sermanet. 2020. Learning Latent Plans from Play. In Proceedings of abs/1511.06434 (2015). arXiv:1511.06434 http://arxiv.org/abs/1511.06434
the Conference on Robot Learning (Proceedings of Machine Learning Research, Vol. 100), Marc H. Raibert and Jessica K. Hodgins. 1991. Animation of Dynamic Legged Loco-
Leslie Pack Kaelbling, Danica Kragic, and Komei Sugiura (Eds.). PMLR, 1113–1132. motion. In Proceedings of the 18th Annual Conference on Computer Graphics and
http://proceedings.mlr.press/v100/lynch20a.html Interactive Techniques (SIGGRAPH ’91). Association for Computing Machinery, New
X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley. 2017. Least Squares York, NY, USA, 349–358. https://doi.org/10.1145/122718.122755
Generative Adversarial Networks. In 2017 IEEE International Conference on Computer Stephane Ross, Geoffrey Gordon, and Drew Bagnell. 2011. A Reduction of Imitation
Vision (ICCV). 2813–2821. https://doi.org/10.1109/ICCV.2017.304 Learning and Structured Prediction to No-Regret Online Learning (Proceedings of
Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Machine Learning Research, Vol. 15), Geoffrey Gordon, David Dunson, and Miroslav
Wayne, Yee Whye Teh, and Nicolas Heess. 2019. Neural Probabilistic Motor Primi- Dudík (Eds.). JMLR Workshop and Conference Proceedings, Fort Lauderdale, FL,
tives for Humanoid Control. In International Conference on Learning Representations. USA, 627–635. http://proceedings.mlr.press/v15/ross11a.html
https://openreview.net/forum?id=BJl6TjRcY7 Alla Safonova and Jessica K. Hodgins. 2007. Construction and Optimal Search of
Josh Merel, Yuval Tassa, Dhruva TB, Sriram Srinivasan, Jay Lemmon, Ziyu Wang, Interpolated Motion Graphs. ACM Trans. Graph. 26, 3 (July 2007), 106–es. https:
Greg Wayne, and Nicolas Heess. 2017. Learning human behaviors from motion //doi.org/10.1145/1276377.1276510
capture by adversarial imitation. CoRR abs/1707.02201 (2017). arXiv:1707.02201 H. Sakoe and S. Chiba. 1978. Dynamic programming algorithm optimization for spoken
http://arxiv.org/abs/1707.02201 word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing 26, 1
Josh Merel, Saran Tunyasuvunakool, Arun Ahuja, Yuval Tassa, Leonard Hasenclever, Vu (1978), 43–49. https://doi.org/10.1109/TASSP.1978.1163055
Pham, Tom Erez, Greg Wayne, and Nicolas Heess. 2020. Catch and Carry: Reusable Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and
Neural Controllers for Vision-Guided Whole-Body Tasks. ACM Trans. Graph. 39, 4, Xi Chen. 2016. Improved Techniques for Training GANs. CoRR abs/1606.03498
Article 39 (July 2020), 14 pages. https://doi.org/10.1145/3386569.3392474 (2016). arXiv:1606.03498 http://arxiv.org/abs/1606.03498
Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. 2018. Which Training Meth- John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. 2015.
ods for GANs do actually Converge?. In Proceedings of the 35th International Con- High-Dimensional Continuous Control Using Generalized Advantage Estimation.
ference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), CoRR abs/1506.02438 (2015). arXiv:1506.02438
Jennifer Dy and Andreas Krause (Eds.). PMLR, Stockholmsmässan, Stockholm Swe- John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.
den, 3481–3490. http://proceedings.mlr.press/v80/mescheder18a.html 2017. Proximal Policy Optimization Algorithms. CoRR abs/1707.06347 (2017).
Igor Mordatch, Emanuel Todorov, and Zoran Popović. 2012. Discovery of Complex arXiv:1707.06347 http://arxiv.org/abs/1707.06347
Behaviors through Contact-Invariant Optimization. ACM Trans. Graph. 31, 4, Article SFU. [n.d.]. SFU Motion Capture Database. http://mocap.cs.sfu.ca/.
43 (July 2012), 8 pages. https://doi.org/10.1145/2185520.2185539 Dana Sharon and Michiel van de Panne. 2005. Synthesis of Controllers for Stylized
Igor Mordatch, Jack M. Wang, Emanuel Todorov, and Vladlen Koltun. 2013. Animating Planar Bipedal Walking. In Proc. of IEEE International Conference on Robotics and
Human Lower Limbs Using Contact-Invariant Optimization. ACM Trans. Graph. 32, Animation.
6, Article 203 (Nov. 2013), 8 pages. https://doi.org/10.1145/2508363.2508365 Kwang Won Sok, Manmyung Kim, and Jehee Lee. 2007. Simulating Biped Behaviors
Uldarico Muico, Yongjoon Lee, Jovan Popović, and Zoran Popović. 2009. Contact-Aware from Human Motion Data. ACM Trans. Graph. 26, 3 (July 2007), 107–es. https:
Nonlinear Control of Dynamic Characters. In ACM SIGGRAPH 2009 Papers (New //doi.org/10.1145/1276377.1276511
Orleans, Louisiana) (SIGGRAPH ’09). Association for Computing Machinery, New
ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control • 144:15
Sebastian Starke, He Zhang, Taku Komura, and Jun Saito. 2019. Neural State Machine
for Character-Scene Interactions. ACM Trans. Graph. 38, 6, Article 209 (Nov. 2019),
14 pages. https://doi.org/10.1145/3355089.3356505
Richard S. Sutton and Andrew G. Barto. 1998. Introduction to Reinforcement Learning
(1st ed.). MIT Press, Cambridge, MA, USA.
Jie Tan, Yuting Gu, C. Karen Liu, and Greg Turk. 2014. Learning Bicycle Stunts. ACM
Trans. Graph. 33, 4, Article 50 (July 2014), 12 pages. https://doi.org/10.1145/2601097.
2601121
Jeff Tang, Howard Leung, Taku Komura, and Hubert Shum. 2008. Emulating human
perception of motion similarity. Computer Animation and Virtual Worlds 19 (08
2008), 211–221. https://doi.org/10.1002/cav.260
Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas,
David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy P. Lilli-
crap, and Martin A. Riedmiller. 2018. DeepMind Control Suite. CoRR abs/1801.00690
(2018). arXiv:1801.00690 http://arxiv.org/abs/1801.00690
Faraz Torabi, Garrett Warnell, and Peter Stone. 2018. Generative Adversarial Imitation
from Observation. CoRR abs/1807.06158 (2018). arXiv:1807.06158 http://arxiv.org/
abs/1807.06158
Adrien Treuille, Yongjoon Lee, and Zoran Popović. 2007. Near-Optimal Character
Animation with Continuous Control. In ACM SIGGRAPH 2007 Papers (San Diego,
California) (SIGGRAPH ’07). Association for Computing Machinery, New York, NY,
USA, 7–es. https://doi.org/10.1145/1275808.1276386
Michiel van de Panne, Ryan Kim, and Eugene Flume. 1994. Virtual Wind-up Toys for
Animation. In Proceedings of Graphics Interface ’94. 208–215.
Kevin Wampler, Zoran Popović, and Jovan Popović. 2014. Generalizing Locomotion
Style to New Animals with Inverse Optimal Regression. ACM Trans. Graph. 33, 4,
Article 49 (July 2014), 11 pages. https://doi.org/10.1145/2601097.2601192
Jack M. Wang, David J. Fleet, and Aaron Hertzmann. 2009. Optimizing Walking
Controllers. In ACM SIGGRAPH Asia 2009 Papers (Yokohama, Japan) (SIGGRAPH
Asia ’09). Association for Computing Machinery, New York, NY, USA, Article 168,
8 pages. https://doi.org/10.1145/1661412.1618514
Jack M. Wang, Samuel R. Hamner, Scott L. Delp, and Vladlen Koltun. 2012. Optimizing
Locomotion Controllers Using Biologically-Based Actuators and Objectives. ACM
Trans. Graph. 31, 4, Article 25 (July 2012), 11 pages. https://doi.org/10.1145/2185520.
2185521
Ziyu Wang, Josh S Merel, Scott E Reed, Nando de Freitas, Gregory Wayne, and
Nicolas Heess. 2017. Robust Imitation of Diverse Behaviors. In Advances
in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Cur-
ran Associates, Inc., 5320–5329. https://proceedings.neurips.cc/paper/2017/file/
044a23cadb567653eb51d4eb40acaa88-Paper.pdf
Jungdam Won, Deepak Gopinath, and Jessica Hodgins. 2020. A Scalable Approach to
Control Diverse Behaviors for Physically Simulated Characters. ACM Trans. Graph.
39, 4, Article 33 (July 2020), 12 pages. https://doi.org/10.1145/3386569.3392381
Yuting Ye and C. Karen Liu. 2010. Synthesis of Responsive Motion Using a Dynamic
Model. Computer Graphics Forum (2010). https://doi.org/10.1111/j.1467-8659.2009.
01625.x
Wenhao Yu, Greg Turk, and C. Karen Liu. 2018. Learning Symmetric and Low-Energy
Locomotion. ACM Trans. Graph. 37, 4, Article 144 (July 2018), 12 pages. https:
//doi.org/10.1145/3197517.3201397
He Zhang, Sebastian Starke, Taku Komura, and Jun Saito. 2018. Mode-Adaptive Neural
Networks for Quadruped Motion Control. ACM Trans. Graph. 37, 4, Article 145 (July
2018), 11 pages. https://doi.org/10.1145/3197517.3201366
Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey. 2008. Maximum
Entropy Inverse Reinforcement Learning. In Proceedings of the 23rd National Confer-
ence on Artificial Intelligence - Volume 3 (Chicago, Illinois) (AAAI’08). AAAI Press,
1433–1438.
Victor Brian Zordan and Jessica K. Hodgins. 2002. Motion Capture-Driven Simulations
That Hit and React. In Proceedings of the 2002 ACM SIGGRAPH/Eurographics Sympo-
sium on Computer Animation (San Antonio, Texas) (SCA ’02). Association for Comput-
ing Machinery, New York, NY, USA, 89–96. https://doi.org/10.1145/545261.545276
ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
144:16 • Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa
APPENDIX q̃𝑡ball , linear velocity x¤̃ 𝑡ball , and angular velocity q¤̃ 𝑡ball of the ball in
A TASKS the character’s local coordinate frame.
In this section, we provide a detailed description of each task, and Strike: Finally, to further demonstrate our approach’s ability to
the task reward functions used during training. compose diverse behaviors, we consider a task where the charac-
ter’s objective is to strike a target using a designated end-effector
Target Heading: In this task, the objective for the character is to (e.g. hands). The target may be located at various distances from
move along a target heading direction d∗ at a target speed 𝑣 ∗ . The the character. Therefore, the character must first move close to the
goal input for the policy is specified as g𝑡 = ( d̃𝑡∗, 𝑣 ∗ ), with d̃𝑡∗ being target before striking it. These distinct phases of the task entail
the target direction in the character’s local coordinate frame. The different optimal behaviors, and thus requires the policy to compose
task-reward is calculated according to: and transition between the appropriate skills. The goal g𝑡 = (x̃𝑡∗, ℎ𝑡 )
records the location of the target x̃𝑡∗ in the character’s local coor-
2
𝑟𝑡𝐺 = exp −0.25 𝑣 ∗ − d∗ · x¤ 𝑡com , (11)
dinate frame, along with an indicator variable ℎ𝑡 that specifies if
where 𝑥¤𝑡com is the center-of-mass velocity of the character at time the target has already been hit. The task-reward is partitioned into
step 𝑡, and the target speed is selected randomly between 𝑣 ∗ ∈ three phases:
[1, 5]m/s. For slower moving styles, such as Zombie and Stealthy, 1, target has been hit
the target speed is fixed at 1m/s.
(18)
𝑟𝑡𝐺 = 0.3 𝑟𝑡near + 0.3, ||x∗ − x𝑡root || < 1.375𝑚 .
Target Location: In this task, the character’s objective is to move
0.3 𝑟 far,
otherwise
to a target location x∗ . The goal g𝑡 = x̃𝑡∗ records the target location 𝑡
in the character’s local coordinate frame. The task-reward is given If the character is far from the target x∗ , 𝑟𝑡far encourages the char-
by: acter to move to the target using a similar reward function as the
Target Location task (Equation 12). Once the character is within a
𝑟𝑡𝐺 = 0.7 exp −0.5||x∗ − x𝑡root || 2 given distance of the target, 𝑟𝑡near encourages the character to strike
2 the target with a particular end-effector,
+ 0.3 exp − max 0, 𝑣 ∗ − d𝑡∗ · x¤ 𝑡com . (12)
2
Here, 𝑣 ∗ = 1𝑚/𝑠 specifies a minimum target speed at which the 𝑟𝑡near = 0.2 exp −2||x∗ − x𝑡eff || 2 + 0.8 clip d𝑡∗ · x¤ 𝑡eff , 0, 1 ,
3
character should move towards the target, and the character will
not be penalized for moving faster than this threshold. d𝑡∗ is a unit where x𝑡eff and x¤ 𝑡eff denote the position and velocity of the end-
vector on the horizontal plane that points from the character’s root effector, and d𝑡∗ is a unit vector pointing from the character’s root to
to the target. the target. After striking the target, the character receives a constant
reward of 1 for the remaining time steps.
Dribbling: To evaluate our system on more complex object ma-
nipulation tasks, we train policies for a dribbling task, where the Obstacles: Finally, we consider tasks that involve visual percep-
objective is for the character to dribble a soccer ball to a target tion and interaction with more complex environments, where the
location. The reward function is given by: character’s goal is to traverse an obstacle filled environment, while
maintaining a target speed. Policies are trained for two types of envi-
cp bp
𝑟𝑡𝐺 = 0.1𝑟𝑡cv + 0.1𝑟𝑡 + 0.3𝑟𝑡bv + 0.5𝑟𝑡 (13) ronments. 1) An environment containing a combination of obstacles
2 including gaps, steps, and overhead obstacles that the character
𝑟𝑡cv = exp −1.5 max 0, 𝑣 ∗ − d𝑡ball · x¤ 𝑡com (14) must duck under. 2) An environment containing stepping stones
that requires more precise contact planning. Examples of the envi-
cp
𝑟𝑡 = exp −0.5 ||x𝑡ball − x𝑡com || 2 (15) ronment are available in Figure 1 and 3. The task-reward is the same
2 as the one used for the Target Heading task (Equation 11), except
𝑟𝑡bv = exp −max 0, 𝑣 ∗ − d𝑡∗ · x¤ 𝑡ball (16) the target heading is fixed along the direction of forward progress.
In order for the policy to perceive the upcoming obstacles, the state
bp
𝑟𝑡 = exp −0.5 ||x𝑡∗ − x𝑡com || 2 . (17) is augmented with a 1D height-field of the upcoming terrain. The
cp
height-field records the height of the terrain at 100 sample locations,
𝑟𝑡cv and 𝑟𝑡 encourages the character to move towards and stay near uniformly spanning 10m ahead of the character.
the ball, where x𝑡ball and x¤ 𝑡ball represent the position and velocity
of the ball, d𝑡ball is a unit vector pointing from the character to the B AMP HYPERPARAMETERS
ball, and 𝑣 ∗ = 1m/s is the target velocity at which the character Hyperparameter settings used in the AMP experiments are available
bp
should move towards the ball. Similarly, 𝑟𝑡bv and 𝑟𝑡 encourages the in Table 4. For single-clip imitation tasks, we found that a smaller
character to move the ball to the target location, with d𝑡∗ denoting a discount factor 𝛾 = 0.95 allows the character to more closely imitate
unit vector pointing from the ball to the target. The goal g𝑡 = x̃𝑡∗ a given reference motion. A larger discount factor 𝛾 = 0.99 is used
records the relative position of the target location with respect to the for experiments that include additional task objective, since these
character. The state s𝑡 is augmented with additional features that tasks may require longer horizon planning, such as Dribble and
describe the state of the ball, including the position x̃𝑡ball , orientation Strike.
ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control • 144:17
Table 4. AMP hyperparameters. During pretrainig, the latent space model is trained using a motion
imitation, where the objective is for the character to imitate a corpus
Parameter Value of motion clips. A reference motion is selected randomly at the start
𝑤 𝐺 Task-Reward Weight 0.5 of each episode, and a new reference motion is selected every 5-10s.
𝑤 𝑆 Style-Reward Weight 0.5 The goal g𝑡 = (^𝑞𝑡 +1, 𝑞^𝑡 +2 ) specifies target poses from the reference
𝑤 gp Gradient Penalty 10 motion at the next two time steps.
Samples Per Update Iteration 4096 The networks used for 𝜋 and 𝑢 follow a similar architecture as
Batch Size 256 the networks used for the policies trained with AMP. The encoder
𝐾 Discriminator Batch Size 256 𝑞 is modeled by a network consisting of two hidden layers, with
𝜋 Policy Stepsize (Single-Clip Imitation) 2 × 10−6 512 and 256 hidden units, followed by a linear output layer for
𝜋 Policy Stepsize (Tasks) 4 × 10−6 𝜇𝑞 (g𝑡 ) and Σ𝑞 (g𝑡 ). The size of the latent encoding is set to 16D.
𝑉 Value Stepsize (Single-Clip Imitation) 10−4 Hyperparameter settings are available in Table 5.
𝑉 Value Stepsize (Tasks) 2 × 10−5
𝐷 Discriminator Stepsize 10−5 Table 5. Latent space model hyperparameters.
B Discriminator Replay Buffer Size 105
𝛾 Discount (Single-Clip Imitation) 0.95 Parameter Value
𝛾 Discount (Tasks) 0.99 Latent Encoding Dimension 16
SGD Momentum 0.9 𝜆 KL-Regularizer 10−4
GAE(𝜆) 0.95 Samples Per Update Iteration 4096
TD(𝜆) 0.95 Batch Size 256
PPO Clip Threshold 0.2 𝜋 Policy Stepsize (Pre-Training) 2.5 × 10−6
𝑢 Policy Stepsize (Downstream Task) 10−4
𝑉 Value Stepsize 10−3
C LATENT SPACE MODEL 𝛾 Discount (Pre-Training) 0.95
The latent space model follows a similar architecture as Peng et al. 𝛾 Discount (Downstream Task) 0.99
[2019a] and Merel et al. [2019]. During pretraining, an encoder SGD Momentum 0.9
𝑞(z𝑡 |g𝑡 ) maps a goal g𝑡 to a distribution over latent variables z𝑡 . GAE(𝜆) 0.95
A latent encoding z𝑡 ∼ 𝑞(z𝑡 |g𝑡 ) is then sampled from the encoder TD(𝜆) 0.95
distribution and passed to the policy as an input 𝜋 (a𝑡 |s𝑡 , z𝑡 ). The PPO Clip Threshold 0.2
latent distribution is modeled as a Gaussian distribution 𝑞(z𝑡 |g𝑡 ) =
N (𝜇𝑞 (g𝑡 ), Σ𝑞 (g𝑡 )), with mean 𝜇𝑞 (g𝑡 ) and diagonal covariance ma-
trix Σ𝑞 (g𝑡 ). The encoder is trained jointly with the policy using the
following objective: D SPATIAL COMPOSITION
Our experiments have so far focused primarily on temporal composi-
"𝑇 −1 #
Õ
tions of skills, where a character performs different skills at different
𝑡
arg max E𝑝 (𝜏 |𝜋,𝑞) 𝛾 𝑟𝑡 + 𝜆E𝑝 ( g𝑡 ) [DKL [𝑞(·|g𝑡 )||𝑝 0 ]] .
points in time in order to fulfill particular task objectives, such as
𝜋,𝑞 𝑡 =0
(19)
walking to a target and then punching it. In this section, we explore
𝜏 = {(s𝑡 , a𝑡 , g𝑡 , 𝑟𝑡 )𝑇𝑡 =−01, s𝑇 , g𝑇 } represents the goal-augmented tra- settings that require spatial composition of multiple skills, where the
jectory, where the goal g𝑡 may vary at each time step, and task requires a character to perform different skills simultaneously.
To evaluate AMP in this setting, we consider a compositional task
𝑇Ö−1
where a character needs to walk along a target heading direction
𝑝 (𝜏 |𝜋, 𝑞) =𝑝 (g0 )𝑝 (s𝑡 ) (g𝑡 +1 |g𝑡 )𝑝 (s𝑡 +1 |s𝑡 , a𝑡 ) (20)
while also waving its hand at a target height. The motion prior
𝑡 =0
∫ is trained using a dataset consisting of both walking motions and
𝜋 (a𝑡 |s𝑡 , z𝑡 )𝑞(z𝑡 |g𝑡 )𝑑z𝑡 (21) waving motions, but none of the motion clips show examples of
z𝑡 walking and waving at the same time. Therefore, the onus is on the
is the likelihood of a trajectory under a given policy 𝜋 and encoder policy to spatially compose these different classes of skills in order
𝑞. Similar to a VAE, we include a KL-regularizer with respect to a to fulfill the two disparate objectives simultaneously.
variational prior 𝑝 0 (z𝑡 ) = N (0, 𝐼 ) and coefficient 𝜆. The policy and In this task, the character has two objectives: 1) a target heading
encoder are trained end-to-end with PPO using the reparameteriza- objective for moving along a target direction d∗ at a target speed
tion trick [Kingma and Welling 2014]. Once trained, the latent space 𝑣 ∗ , 2) and a waving objective for raising its right hand to a target
model can be transferred to downstream tasks by using 𝜋 (a𝑡 |s𝑡 , z𝑡 ) height 𝑦 ∗ . The goal input for the policy is given by g𝑡 = ( d̃𝑡∗, 𝑣 ∗, 𝑦 ∗ ),
as a low-level controller, and then training a separate high-level with d̃𝑡∗ being the target direction in the character’s local coordinate
controller 𝑢 (z𝑡 |s𝑡 , g𝑡 ) that specifies latent encodings z𝑡 for the low- frame. The composite reward is calculated according to:
level controller. The parameters of 𝜋 are fixed, and a new high-level
heading waving
controller 𝑢 is trained for each downstream task. 𝑟𝑡𝐺 = 0.5𝑟𝑡 + 0.5𝑟𝑡 , (22)
ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
144:18 • Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa
heading
where 𝑟𝑡 the same as the reward used for the Target Heading gait in order to follow the target heading direction. These results
task equation 11, and 𝑟𝑡wave is specified according to: suggest that AMP does exhibit some capability for spatial composi-
2 tion different skills. However, the policies trained with both datasets
wave hand ∗
𝑟𝑡 = exp −16 𝑦𝑡 −𝑦 , (23) can still exhibit some unnatural behaviors, particularly when the
target height for the hand is high.
where 𝑦𝑡hand is the height of character’s right hand.
To evaluate AMP’s ability to compose disparate skills spatially, we Table 6. Performance of policies trained using different dataset on a spatial
compare policies trained using both walking and waving motions, compositional task that combines following a target heading and waving
with policies trained with only walking motions or only waving the character’s hand at a target height. The normalized task returns for each
motions. Table 6 compares the performance of the different policies objective is averaged across 100 episodes for each model. The model trained
with respect to the target heading and waving objectives. Although with both walking and waving motions achieves relatively high rewards on
both objectives, while the models trained with only one type of motions
the motion prior was not trained with any reference motions that
perform well only on one of the objectives.
show both walking and waving at the same time, the policy was able
to discover behaviors that combine these different skills, enabling
the character to walk along different directions while also waving Dataset (Size) Heading Return Waving Return
its hand at various heights. The policies trained with only walking Wave (51.7s) 0.683 ± 0.195 0.949 ± 0.144
motions tend to ignore the waving objective, and exhibit solely Walk (229.7s) 0.945 ± 0.192 0.306 ± 0.378
walking behaviors. Policies trained with only the waving motion Wave + Walk (281.4s) 0.885 ± 0.184 0.891 ± 0.202
are able to fulfill the waving objective, but learns a clumsy shuffling
ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control • 144:19
Fig. 9. Learning curves comparing AMP to the motion tracking approach proposed by Peng et al. [2018a] (Motion Tracking) on the single-clip imitation tasks.
3 policies initialized with different random seeds are trained for each method and motion. AMP produces results of comparable quality when compared to
prior tracking-based methods, without requiring a manually designed reward function or synchronization between the policy and reference motion.
ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
144:20 • Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa
Fig. 10. Learning curves of applying AMP to various tasks and datasets.
ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.