0% found this document useful (0 votes)

96 views20 pages

AMP: Adversarial Motion Priors For Stylized Physics-Based Character Control

Reinforcement Learning

Uploaded by

Ziming Meng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

96 views20 pages

AMP: Adversarial Motion Priors For Stylized Physics-Based Character Control

Reinforcement Learning

Uploaded by

Ziming Meng

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 20

AMP: Adversarial Motion Priors for Stylized Physics-Based Character

Control
XUE BIN PENG∗ , University of California, Berkeley, USA
ZE MA∗ , Shanghai Jiao Tong University, China
PIETER ABBEEL, University of California, Berkeley, USA
SERGEY LEVINE, University of California, Berkeley, USA
ANGJOO KANAZAWA, University of California, Berkeley, USA

Fig. 1. Our framework enables physically simulated character to solve challenging tasks while adopting stylistic behaviors specified by unstructured motion
data. Left: A character learns to traverse an obstacles course using a variety of locomotion skills. Right: A character learns to walk to and punch a target.

Synthesizing graceful and life-like behaviors for physically simulated charac- system produces high-quality motions that are comparable to those achieved
ters has been a fundamental challenge in computer animation. Data-driven by state-of-the-art tracking-based techniques, while also being able to easily
methods that leverage motion tracking are a prominent class of techniques accommodate large datasets of unstructured motion clips. Composition of
for producing high fidelity motions for a wide range of behaviors. However, disparate skills emerges automatically from the motion prior, without re-
the effectiveness of these tracking-based methods often hinges on carefully quiring a high-level motion planner or other task-specific annotations of
designed objective functions, and when applied to large and diverse motion the motion clips. We demonstrate the effectiveness of our framework on
datasets, these methods require significant additional machinery to select the a diverse cast of complex simulated characters and a challenging suite of
appropriate motion for the character to track in a given scenario. In this work, motor control tasks.
we propose to obviate the need to manually design imitation objectives and
CCS Concepts: • Computing methodologies → Procedural animation;
mechanisms for motion selection by utilizing a fully automated approach
Adversarial learning; Control methods.
based on adversarial imitation learning. High-level task objectives that the
character should perform can be specified by relatively simple reward func- Additional Key Words and Phrases: Wireless sensor networks, media access
tions, while the low-level style of the character’s behaviors can be specified control, multi-channel, radio interference, time synchronization
by a dataset of unstructured motion clips, without any explicit clip selection
ACM Reference Format:
or sequencing. For example, a character traversing an obstacle course might
Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa.
utilize a task-reward that only considers forward progress, while the dataset
2021. AMP: Adversarial Motion Priors for Stylized Physics-Based Character
contains clips of relevant behaviors such as running, jumping, and rolling.
Control. ACM Trans. Graph. 40, 4, Article 144 (August 2021), 20 pages. https:
These motion clips are used to train an adversarial motion prior, which spec-
//doi.org/10.1145/3450626.3459670
ifies style-rewards for training the character through reinforcement learning
(RL). The adversarial RL procedure automatically selects which motion to
perform, dynamically interpolating and generalizing from the dataset. Our 1 INTRODUCTION
∗ Joint first authors.
Synthesizing natural and life-like motions for virtual characters
is a crucial element for breathing life into immersive experiences,
Authors’ addresses: Xue Bin Peng, University of California, Berkeley, 2121 Berkeley such as films and games. The demand for realistic motions becomes
Way, Berkeley, CA, 94704, USA, [email protected]; Ze Ma, Shanghai Jiao Tong
University, 800 Dongchuan Rd, Shanghai, 200240, China, [email protected]; even more apparent for VR applications, where users are provided
Pieter Abbeel, University of California, Berkeley, 2121 Berkeley Way, Berkeley, CA, with rich modalities through which to interact with virtual agents.
94704, USA, [email protected]; Sergey Levine, University of California, Berkeley, Developing control strategies that are able to replicate the properties
2121 Berkeley Way, Berkeley, CA, 94704, USA, [email protected]; Angjoo
Kanazawa, University of California, Berkeley, 2121 Berkeley Way, Berkeley, CA, 94704, of naturalistic behaviors is also of interest for robotic systems, as
USA, [email protected]. natural motions implicitly encode important properties, such as
Permission to make digital or hard copies of all or part of this work for personal or
safety and energy efficiency, which are vital for effective operation
classroom use is granted without fee provided that copies are not made or distributed of robots in the real world. While examples of natural motions
for profit or commercial advantage and that copies bear this notice and the full citation are commonplace, identifying the underlying characteristics that
on the first page. Copyrights for components of this work owned by others than the
author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or constitute these behaviors is nonetheless challenging, and more
republish, to post on servers or to redistribute to lists, requires prior specific permission difficult still to replicate in a controller.
and/or a fee. Request permissions from [email protected]. So what are the characteristics that constitute natural and life-
© 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.
0730-0301/2021/8-ART144 $15.00 like behaviors? Devising quantitative metrics of the naturalness of
https://doi.org/10.1145/3450626.3459670 motions has been a fundamental challenge for optimization-based

ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
144:2 • Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa

character animation techniques [Al Borno et al. 2013; Wampler et al. reward function. We present one of the first adversarial learning
2014; Wang et al. 2009]. Heuristics such as symmetry, stability, and systems that is able to produce high-quality full-body motions for
effort minimization can improve the realism of motions produced by physically simulated characters. By combining the motion prior
physically simulated characters [Grochow et al. 2004; Mordatch et al. with additional task objectives, our system provides a convenient
2012, 2013; Yu et al. 2018]. But these strategies may not be broadly interface through which users can specify high-level directions for
applicable to all behaviors of interest. Effective applications of these controlling a character’s behaviors. These task objectives allow our
heuristics often require careful balancing of the various objectives, characters to acquire more complex skills than those demonstrated
a tuning process that may need to be repeated for each task. Data- in the original motion clips. While our system is built on well-known
driven methods are able to mitigate some of these challenges by adversarial imitation learning techniques, we propose a number of
leveraging motion clips recorded from real-life actors to guide the important design decisions that lead to substantially higher quality
behaviors of simulated characters [Da Silva et al. 2008; Liu et al. results than those achieved by prior work, enabling our characters
2010; Muico et al. 2009; Sok et al. 2007]. A common instantiation to learn highly dynamic and diverse motors skills from unstructured
of this approach is to utilize a tracking objective that encourages a motion data.
character to follow particular reference trajectories relevant for a
given task. These tracking-based methods can produce high-quality 2 RELATED WORK
motions for a large repertoire skills. But extending these techniques Developing systems that can synthesize natural motions for vir-
to effectively leverage large unstructured motion datasets remains tual characters is one of the fundamental challenges of computer
challenging, since a suitable motion clip needs to be selected for animation. These procedural animation techniques can be broadly
the character to track at each time step. This selection process is categorized as kinematic methods and physics-based methods. Kine-
typically performed by a motion planner, which generates reference matic methods generally do not explicitly utilize the equations of
trajectories for solving a particular task [Bergamin et al. 2019; Park motion for motion synthesis. Instead, these methods often lever-
et al. 2019; Peng et al. 2017]. However, constructing an effective age datasets of motion clips to generate motions for a character
motion planner can itself be a challenging endeavour, and entails [Lee et al. 2002, 2010b]. Given a motion dataset, controllers can be
significant overhead to annotate and organize the motion clips constructed to select an appropriate motion clip to play back for
for a desired task. For many applications, it is not imperative to a particular scenario [Agrawal and van de Panne 2016; Safonova
exactly track a particular reference motion. Since a dataset typically and Hodgins 2007; Treuille et al. 2007]. Data-driven methods using
provides only a limited collection of example motions, a character generative models, such as Gaussian processes [Levine et al. 2012;
will inevitably need to deviate from the reference motions in order Ye and Liu 2010] and neural networks [Holden et al. 2017; Ling et al.
to effectively perform a given task. Therefore, the intent is often not 2020; Zhang et al. 2018], have also been applied to synthesize mo-
for the character to closely track a particular motion, but to adopt tions online. When provided with sufficiently large and high-quality
general behavioral characteristics depicted in the dataset. We refer datasets, kinematic methods are able to produce realistic motions
to these behavioral characteristics as a style. for a large variety of sophisticated skills [Agrawal and van de Panne
In this work, we aim to develop a system where users can specify 2016; Lee et al. 2018, 2010b; Levine et al. 2011; Starke et al. 2019].
high-level task objectives for a character to perform, while the low- However, their ability to synthesize motions for novel situations
level style of a character’s movements can be controlled through can be limited by the availability of data. For complex tasks and
examples provided in the form of unstructured motion clips. To environments, it can be difficult to collect a sufficient amount of data
control the style of a character’s motions, we propose adversarial to cover all possible behaviors that a character may need to perform.
motion priors (AMP), a method for imitating behaviors from raw This is particularly challenging for nonhuman and fictional crea-
motion clips without requiring any task-specific annotations or tures, where motion data can be scarce. In this work, we combine
organization of the dataset. Given a set of reference motions that data-driven techniques with physics-based animation methods to
constitutes a desired motion style, the motion prior is modeled as develop characters that produce realistic and responsive behaviors
an adversarial discriminator, trained to differentiate between behav- to novel tasks and environments.
iors depicted in the dataset from those produced by the character.
The motion prior therefore acts as a general measure of similarity Physics-Based Methods: Physics-based methods address some of
between the motions produced by a character and the motions in the limitations of kinematic methods by synthesizing motions from
the dataset. By incorporating the motion prior in a goal-conditioned first principles. These methods typically leverage a physics sim-
reinforcement learning framework, our system is able to train physi- ulation, or more general knowledge of the equations of motion,
cally simulated characters to perform challenging tasks with natural to generate motions for a character [Raibert and Hodgins 1991;
and life-like behaviors. Composition of diverse behaviors emerges Wampler et al. 2014]. Optimization techniques, such as trajectory
automatically from the motion prior, without the need for a motion optimization and reinforcement learning, play a pivotal role in many
planner or other mechanism for selecting which clip to imitate. physics-based methods, where controllers that drive a character’s
The central contribution of this work is an adversarial learning motions are produced by optimizing an objective function [Mor-
approach for physics-based character animation that combines goal- datch et al. 2012; Tan et al. 2014; van de Panne et al. 1994]. While
conditioned reinforcement with an adversarial motion prior, which these methods are able to synthesize physically plausible motions
enables the style of a character’s movements to be controlled via for novel scenarios, even in the absence of motion data, designing ef-
example motion clips, while the task is specified through a simple fective objectives that lead to natural behaviors can be exceptionally

ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control • 144:3

difficult. Heuristics derived from prior knowledge of the character- function for training a control policy to imitate the demonstrations.
istics of natural motions are commonly included into the objective While these methods have shown promising results for motion imi-
function, such as symmetry, stability, effort minimization, and many tation tasks [Merel et al. 2017; Wang et al. 2017], adversarial learning
more [Mordatch et al. 2012; Wang et al. 2009; Yu et al. 2018]. Simu- algorithms can be notoriously unstable and the resulting motion
lating more biologically accurate actuators can also improve motion quality still falls well behind what has been achieved with state-of-
quality [Geijtenbeek et al. 2013; Jiang et al. 2019; Wang et al. 2012], the-art tracking-based techniques. Peng et al. [2019b] was able to
but may nonetheless yield unnatural behaviors. able to produce substantially more realistic motions by regularizing
the discriminator with an information bottleneck. However, their
Imitation Learning: The challenges of designing objective func- method still requires a phase variable to synchronize the policy and
tions that lead to natural motions have spurred the adoption of discriminator with the reference motion. Therefore, their results are
data-driven physics-based animation techniques [Da Silva et al. limited to imitating a single motion per policy, and thus not suitable
2008; Kwon and Hodgins 2017; Lee et al. 2010a; Sharon and van de for learning from large diverse motion datasets. In this work, we
Panne 2005; Zordan and Hodgins 2002], which utilizes reference propose an adversarial method for learning general motion priors
motion data to improve motion quality. Reference motions are typi- from large unstructured datasets that contain diverse motion clips.
cally incorporated through an imitation objective that encourages a Our approach does not necessitate any synchronization between
character to imitate motions in the dataset. The imitation objective the policy and reference motion. Furthermore, our approach does
is commonly implemented as a tracking objective, which attempts not require a motion planner, or any task-specific annotation and
to minimize the pose error between the simulated character and segmentation of the motion clips [Bergamin et al. 2019; Park et al.
target poses from a reference motion [Lee et al. 2010a; Liu et al. 2019; Peng et al. 2017]. Instead, composition of multiple motions
2016, 2010; Peng et al. 2018a; Sok et al. 2007]. Since the pose error is in furtherance of a task objective emerges automatically through
generally computed with respect to a single target pose at a time, the motion prior. We also present a number of design decisions for
some care is required to select an appropriate target pose from the stabilizing the adversarial training process, leading to consistent
dataset. A simple strategy is to synchronize the simulated character and high-quality results.
with a given reference motion using a phase variable [Lee et al.
2019; Peng et al. 2018a,b], which is provided as an additional input Latent Space Models: Latent space models can also act as a form
to the controller. The target pose at each time step can then be of motion prior that leads to more life-like behaviors. These mod-
conveniently determined by selecting the target pose according to els specify controls through a learned latent representation, which
the phase. This strategy has been effective for imitating individual is then mapped to controls for the underlying system [Burgard
motion clips, but it can be difficult to scale to datasets containing et al. 2008; Florensa et al. 2017; Hausman et al. 2018; Heess et al.
multiple disparate motions, as it may not be possible to synchronize 2016]. The latent representation is typically learned through a pre-
and align multiple reference motions according to a single-phase training phase using supervised learning or reinforcement learning
variable. Recent methods have extended these tracking-based tech- techniques to encode a diverse range of behaviors into a latent rep-
niques to larger motion datasets by explicitly providing target poses resentation. Once trained, this latent representation can be used to
from the reference motion that is being tracked as inputs to the con- build a control hierarchy, where the latent space model acts as a
troller [Bergamin et al. 2019; Chentanez et al. 2018; Park et al. 2019; low-level controller, and a separate high-level controller is trained to
Won et al. 2020]. This then allows a controller to imitate different specify controls via the latent space [Florensa et al. 2017; Haarnoja
motions depending on the input target poses. However, selecting the et al. 2018; Lynch et al. 2020]. For motion control of simulated char-
appropriate motion for a character to imitate in a given scenario can acters, the latent representation can be trained to encode behaviors
still entail significant algorithmic overhead. These methods often from reference motion clips, which then constrains the behavior
require a high-level motion planner that selects which motion clip of a character to be similar to those observed in the motion data,
the character should imitate for a given task [Bergamin et al. 2019; therefore leading to more natural behaviors for downstream tasks
Park et al. 2019; Peng et al. 2017]. The character’s performance on a [Merel et al. 2019; Peng et al. 2019a]. However, since the realism
particular task can therefore be constrained by the performance of of the character’s motions is enforced implicitly through the latent
the motion planner. representation, rather than explicitly through an objective function,
Another major limitation of tracking-based imitation techniques it is still possible for the high-level controller to specify latent encod-
is the need for a pose error metric when computing the tracking ings that produce unnatural behaviors [Merel et al. 2020; Peng et al.
objective [Liu et al. 2010; Peng et al. 2018a; Sharon and van de Panne 2019a]. Luo et al. [2020] proposed an adversarial domain confusion
2005]. These error metrics are often manually-designed, and it can be loss to prevent the high-level controller from specifying encodings
challenging to construct and tune a common metric that is effective that are different from those observed during pre-training. However,
across all skills that a character is to imitate. Adversarial imitation since this adversarial objective is applied in the latent space, rather
learning provides an appealing alternative [Abbeel and Ng 2004; than on the actual motions produced by the character, the model is
Ho and Ermon 2016; Ziebart et al. 2008], where instead of using a nonetheless prone to generating unnatural behaviors. Our proposed
manually-designed imitation objective, these algorithms train an ad- motion prior directly enforces similarity between the motions pro-
versarial discriminator to differentiate between behaviors generated duced by the character and those in the reference motion dataset,
by an agent from behaviors depicted in the demonstration data (e.g. which enables our method to produce higher fidelity motions than
reference motions). The discriminator then serves as the objective what has been demonstrated by latent space models. Our motion

ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
144:4 • Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa

prior also does not require a separate pre-training phase, and instead,
can be trained jointly with the policy.

3 OVERVIEW
Given a dataset of reference motions and a task objective defined by a
reward function, our system synthesizes a control policy that enables
a character to achieve the task objective in a physically simulated
environment, while utilizing behaviors that resemble the motions
in the dataset. Crucially, the character’s behaviors need not exactly
match any specific motion in the dataset, instead its movements
need only to adopt more general characteristics exhibited by the
corpus of reference motions. These reference motions collectively
provide an example-based definition of a behavioral style, and by
providing the system with different motion datasets, the character
can then be trained to perform a task in a variety of distinct styles. Fig. 2. Schematic overview of the system. Given a motion dataset defining a
Figure 2 provides a schematic overview of the system. The motion desired motion style for the character, the system trains a motion prior that
dataset M consists of a collection of reference motions, where each specifies style-rewards 𝑟𝑡𝑆 for the policy during training. These style-rewards
motion m𝑖 = {^ q𝑖𝑡 } is represented as a sequence of poses q ^𝑖𝑡 . The mo- are combined with task-rewards 𝑟𝑡𝐺 and used to train a policy that enables
tion clips may be collected from the mocap of real-life actors or from a simulated character to satisfy task-specific goals g, while also adopting
behaviors that resemble the reference motions in the dataset.
artist-authored keyframe animations. Unlike previous frameworks,
our system can be applied directly on raw motion data, without
requiring task-specific annotations or segmentation of a clip into a policy that maximizes its expected discounted return 𝐽 (𝜋),
individual skills. The motion of the simulated character is controlled "𝑇 −1 #
by a policy 𝜋 (a𝑡 |s𝑡 , g) that maps the state of the character s𝑡 and Õ
a given goal g to a distribution over actions a𝑡 . The actions from 𝐽 (𝜋) = E𝑝 ( g) E𝑝 (𝜏 |𝜋,g) 𝛾 𝑡 𝑟𝑡 , (1)
𝑡 =0
the policy specify target positions for proportional-derivative (PD)
controllers positioned at each of the character’s joints, which in where 𝑝 (𝜏 |𝜋, g) = 𝑝 (s0 ) 𝑇𝑡 =−01 𝑝 (s𝑡 +1 |s𝑡 , a𝑡 )𝜋 (a𝑡 |s𝑡 , g) represents
Î
turn produce control forces that drive the motion of the character. the likelihood of a trajectory 𝜏 = {(s𝑡 , a𝑡 , 𝑟𝑡 )𝑇𝑡 =−01, s𝑇 } under a policy
The goal g specifies a task reward function 𝑟𝑡𝐺 = 𝑟 𝐺 (s𝑡 , a𝑡 , s𝑡 +1, g), 𝜋 for a goal g. 𝑝 (s0 ) is the initial state distribution, and 𝑝 (s𝑡 +1 |s𝑡 , a𝑡 )
which defines high-level objectives for the character to satisfy (e.g. represents the dynamics of the environment. 𝑇 denotes the time
walking in a target direction or punching a target). The style objec- horizon of a trajectory, and 𝛾 ∈ [0, 1) is a discount factor.
tive 𝑟𝑡𝑆 = 𝑟 𝑆 (s𝑡 , s𝑡 +1 ) is specified by an adversarial discriminator,
trained to differentiate between motions depicted in the dataset 4.2 Generative Adversarial Imitation Learning
from motions produced by the character. The style objective there-
fore acts as a task-agnostic motion prior that provides an a-priori Generative adversarial imitation learning (GAIL) [Ho and Ermon
estimate of the naturalness or style of a given motion, independent 2016] adapts techniques developed for generative adversarial net-
of a specific task. The style objective then encourages the policy to works (GAN) [Goodfellow et al. 2014] to the domain of imitation
produce motions that resemble behaviors depicted in the dataset. learning. In the interest of brevity, we exclude the goal g from
the notation, but the following discussion readily generalizes to
goal-conditioned settings. Given a dataset of demonstrations M =
4 BACKGROUND
{(s𝑖 , a𝑖 )}, containing states s𝑖 and actions a𝑖 recorded from an un-
Our system combines techniques from goal-conditioned reinforce- known demonstration policy, the objective is to train a policy 𝜋 (a|s)
ment learning and generative adversarial imitation learning to train that imitates the behaviors of the demonstrator. Behavioral cloning
control policies that enable simulated characters to perform chal- can be used to directly fit a policy to map from states observed in M
lenging tasks in a desired behavioral style. In this section, we provide to their corresponding actions using supervised learning [Bojarski
a brief review of these techniques. et al. 2016; Pomerleau 1988]. However, if only a small amount of
demonstrations are available, then behavioral cloning techniques
4.1 Goal-Conditioned Reinforcement Learning are prone to drift [Ross et al. 2011]. Furthermore, behavioral cloning
Our characters are trained through a goal-conditioned reinforce- is not directly applicable in settings where the demonstration actions
ment learning framework, where an agent interacts with an envi- are not observable (e.g. reference motion data).
ronment according to a policy 𝜋 in order to fulfill a given goal g ∈ G GAIL addresses some of the limitations of behavioral cloning by
sampled according to a goal distribution g ∼ 𝑝 (g). At each time step learning an objective function that measures the similarity between
𝑡, the agent observes the state s𝑡 ∈ S of the system, then samples the policy and the demonstrations, and then updating 𝜋 via rein-
an action a𝑡 ∈ A from a policy a𝑡 ∼ 𝜋 (a𝑡 |s𝑡 , g). The agent then forcement learning to optimize the learned objective. The objective
applies that action, which results in a new state s𝑡 +1 , as well as a is modeled as a discriminator 𝐷 (s, a), trained to predict whether a
scalar reward 𝑟𝑡 = 𝑟 (s𝑡 , a𝑡 , s𝑡 +1, g). The agent’s objective is to learn given state s and action a is sampled from the demonstrations M

ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control • 144:5

or generated by running the policy 𝜋, 5.1 Imitation from Observations

arg min −E𝑑 M ( s,a) [log (𝐷 (s, a))] − E𝑑 𝜋 ( s,a) [log (1 − 𝐷 (s, a))] . The original formulation of GAIL requires access to the demonstra-
𝐷 tor’s actions [Ho and Ermon 2016]. However, when the demonstra-
(2) tions are provided in the form of motion clips, the actions taken by
the demonstrator are unknown, and only states are observed in the
𝑑 M (s, a) and 𝑑 𝜋 (s, a) denote the state-action distribution of the
data. To extend GAIL to settings with state-only demonstrations,
dataset and policy respectively. The policy is then trained using the
the discriminator can be trained on state transitions 𝐷 (s, s ′ ) instead
RL objective detailed in Equation 1, with rewards specified by,
of state-action pairs 𝐷 (s, a) [Torabi et al. 2018],
𝑟𝑡 = −log (1 − 𝐷 (s𝑡 , a𝑡 )) . (3)
arg min − E𝑑 M ( s,s′ ) log 𝐷 (s, s ′ ) − E𝑑 𝜋 ( s,s′ ) log 1 − 𝐷 (s, s ′ ) .

This adversarial training procedure can be interpreted as training a 𝐷
(5)
policy to produce states and actions that appear to the discrimina-
tor as being indistinguishable from the demonstrations. It can be 𝑑 M (s, s ′ ) and 𝑑 𝜋 (s, s ′ ) denote the likelihood of observing a state
shown that this objective minimizes the Jensen-Shannon divergence transition from s to s ′ in the dataset M and by following policy
between 𝑑 M (s, a) and 𝑑 𝜋 (s, a) [Ke et al. 2019; Nowozin et al. 2016]. 𝜋 respectively. Note that if the demonstrator is different from the
agent (e.g. a human actor), the observed state transitions may not be
5 ADVERSARIAL MOTION PRIOR physically consistent for the agent, and therefore impossible for the
In this work, we consider reward functions that consist of two agent to perfectly reproduce. Despite this discrepancy, we show that
components specifying: 1) what task a character should perform, the discriminator still provides an effective objective for imitating a
and 2) how the character should go about performing that task, wide range of behaviors.
𝑟 (s𝑡 , a𝑡 , s𝑡 +1, g) = 𝑤 𝐺 𝑟 𝐺 (s𝑡 , a𝑡 , s𝑡 , g) + 𝑤 𝑆 𝑟 𝑆 (s𝑡 , s𝑡 +1 ). (4) 5.2 Least-Squares Discriminator
The what is represented by a task-specific reward 𝑟 𝐺 (s𝑡 , a𝑡 , s𝑡 , g), The standard GAN objective detailed in Equation 5 typically uses
which defines high-level objectives that a character should satisfy a sigmoid cross-entropy loss function. However, this loss tends to
(e.g. moving to a target location). The how is represented through lead to optimization challenges due to vanishing gradients as the
a learned task-agnostic style-reward 𝑟 𝑆 (s𝑡 , s𝑡 +1 ), which specifies sigmoid function saturates, which can hamper training of the policy
low-level details of the behaviors that the character should adopt [Arjovsky et al. 2017]. A myriad of techniques have been proposed
when performing the task (e.g., walking vs. running to a target). The to address these optimization challenges [Arjovsky et al. 2017; Berth-
two reward terms are combined linearly with weights 𝑤 𝐺 and 𝑤 𝑆 . elot et al. 2017; Gulrajani et al. 2017; Karras et al. 2017; Kodali et al.
The task-reward 𝑟 𝐺 can be relatively intuitive and simple to design. 2017; Mescheder et al. 2018; Radford et al. 2015; Salimans et al. 2016].
However, it can be exceptionally difficult to design a style-reward 𝑟 𝑆 In this work, we adopt the loss function proposed for least-squares
that leads a character to learn naturalistic behaviors, or behaviors GAN (LSGAN) [Mao et al. 2017], which has demonstrated more
that conform to a particular style. Learning effective style objectives stable training and higher quality results for image synthesis tasks.
will therefore be the primary focus of this work. The following objective is used to train the discriminator,
We propose to model the style-reward with a learned discrimi- h 2i h 2i
nator, which we refer to as an adversarial motion prior (AMP), by arg min E𝑑 M ( s,s′ ) 𝐷 (s, s ′ ) − 1 + E𝑑 𝜋 ( s,s′ ) 𝐷 (s, s ′ ) + 1 .
analogy to the adversarial pose priors that were previously pro-
𝐷
(6)
posed for vision-based pose estimation tasks [Kanazawa et al. 2018].
Unlike standard tracking objectives, which measure pose similarity The discriminator is trained by solving a least-squares regression
with respect to a specific reference motion, the motion prior returns problem to predict a score of 1 for samples from the dataset and
a general score indicating the similarity of the character’s motion −1 for samples recorded from the policy. The reward function for
to the motions depicted in the dataset, without explicitly compar- training the policy is then given by
ing to a particular motion clip. Given a motion dataset, the motion 𝑟 (s𝑡 , s𝑡 +1 ) = max 0, 1 − 0.25(𝐷 (s𝑡 , s𝑡 +1 ) − 1) 2 .

(7)
prior is trained using the GAIL framework to predict whether a
state transition (s𝑡 , s𝑡 +1 ) is a real sample from the dataset or a fake The additional offset, scaling, and clipping are applied to bound the
sample produced by the character. The motion prior is independent reward between [0, 1], as is common practice in previous RL frame-
of the task-specific goal g, therefore a single motion prior can be works [Peng et al. 2018a, 2016; Tassa et al. 2018]. Mao et al. [2017]
applied to multiple tasks, and different motion priors can be applied showed that this least-squares objective minimizes the Pearson 𝜒 2
to train policies that perform the same task but in different styles. By divergence between 𝑑 M (s, s ′ ) and 𝑑 𝜋 (s, s ′ ).
combining GAIL with additional task objectives, our approach de-
couples task specification from style specification, thereby enabling 5.3 Discriminator Observations
our characters to perform tasks that may not be depicted in the Since the discriminator specifies rewards for training the policy,
original demonstrations. However, adversarial RL techniques are selecting an appropriate set of features for use by the discriminator
known to be highly unstable. In the following sections, we discuss when making its predictions is vital to provide the policy with effec-
a number of design decisions to stabilize the training process and tive feedback. As such, before a state transition is provided as input
produce higher fidelity results. to the discriminator, we first apply an observation map Φ(s𝑡 ) that

ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
144:6 • Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa

extracts a set of features relevant for determining the characteristics ALGORITHM 1: Training with AMP
of a given motion. The resulting features are then used as inputs to 1: input M: dataset of reference motions
the discriminator 𝐷 (Φ(s), Φ(s ′ )). The set of features include: 2: 𝐷 ← initialize discriminator
• Linear velocity and angular velocity of the root, represented 3: 𝜋 ← initialize policy
4: 𝑉 ← initialize value function
in the character’s local coordinate frame.
5: B ← ∅ initialize reply buffer
• Local rotation of each joint.
• Local velocity of each joint. 6: while not done do
• 3D positions of the end-effectors (e.g. hands and feet), repre- 7: for trajectory 𝑖 = 1, ..., 𝑚 do
sented in the character’s local coordinate frame. 8: 𝜏 𝑖 ← { (s𝑡 , a𝑡 , 𝑟𝑡𝐺 )𝑡𝑇=−01 , s𝑇𝐺 , g} collect trajectory with 𝜋
The root is designated to be the character’s pelvis. The character’s 9: for time step 𝑡 = 0, ...,𝑇 − 1 do
10: 𝑑𝑡 ← 𝐷 (Φ(s𝑡 ), Φ(s𝑡 +1 ))
local coordinate frame is defined with the origin located at the
11: 𝑟𝑡𝑆 ← calculate style reward according to Equation 7 using 𝑑𝑡
root, the x-axis oriented along the root link’s facing direction, and
12: 𝑟𝑡 ← 𝑤𝐺 𝑟𝑡𝐺 + 𝑤𝑆 𝑟𝑡𝑆
the y-axis aligned with the global up vector. The 3D rotation of 13: record 𝑟𝑡 in 𝜏 𝑖
each spherical joint is encoded using two 3D vectors corresponding 14: end for
to the normal and tangent in the coordinate frame. This rotation 15: store 𝜏 𝑖 in B
encoding provides a smooth and unique representation of a given 16: end for
rotation. This set of observation features for the discriminator is
selected to provide a compact representation of the motion across 17: for update step = 1, ..., 𝑛 do
a single state transition. The observations also do not include any 18: 𝑏 M ← sample batch of 𝐾 transitions { (s 𝑗 , s′𝑗 ) }𝐾 𝑗 =1 from M
task-specific features, thus enabling the motion prior to be trained 19: 𝑏 𝜋 ← sample batch of 𝐾 transitions { (s 𝑗 , s′𝑗 ) }𝐾
𝑗 =1 from B
without requiring task-specific annotation of the reference motions, 20: update 𝐷 according to Equation 8 using 𝑏 M and 𝑏 𝜋
and allowing motion priors trained with the same dataset to be used 21: end for
for different tasks.
22: update 𝑉 and 𝜋 using data from trajectories {𝜏 𝑖 }𝑚
𝑖=1
23: end while
5.4 Gradient Penalty
The interplay between the discriminator and generator in a GAN
often results in unstable training dynamics. One source of instability
is due to function approximation errors in the discriminator, where depicted in the dataset. In this section, we detail the design of various
the discriminator may assign nonzero gradients on the manifold of components of the learning framework.
real data samples [Mescheder et al. 2018]. These nonzero gradients
can cause the generator to overshoot and move off the data manifold, 6.1 States and Actions
instead of converging to the manifold, leading to oscillations and The state s𝑡 consists of a set of features that describes the configura-
instability during training. To mitigate this phenomenon, a gradient tion of the character’s body. The features are similar to those used by
penalty can be applied to penalize nonzero gradients on samples Peng et al. [2018a], which include the relative positions of each link
from the dataset [Gulrajani et al. 2017; Kodali et al. 2017; Mescheder with respective to the root, the rotation of each link as represented
et al. 2018]. We incorporate this technique to improve training using the 6D normal-tangent encoding, along with the link’s linear
stability. The discriminator objective is then given by: and angular velocities. All features are recorded in the character’s lo-
h cal coordinate system. Unlike previous systems, which synchronize
2i
arg min E𝑑 M ( s,s′ ) 𝐷 (Φ(s), Φ(s ′ )) − 1 the policy with a particular reference motion by including additional
𝐷
h phase information in the state, such as scalar phase variables [Lee
2i
+ E𝑑 𝜋 ( s,s′ ) 𝐷 Φ(s), Φ(s ′ ) + 1 et al. 2019; Peng et al. 2018a] or target poses [Bergamin et al. 2019;

Chentanez et al. 2018; Won et al. 2020], our policies are not trained
𝑤 gp
2
to explicitly imitate any specific motion from the dataset. Therefore,

+ E𝑑 M ( s,s′ ) ∇𝜙 𝐷 (𝜙) 𝜙=(Φ( s),Φ( s′ )) , (8)

2 no such synchronization or phase information is necessary.
Each action a𝑡 specifies target positions for PD controllers posi-
where 𝑤 gp is a manually-specified coefficient. Note, the gradient
tioned at each of the character’s joints. For spherical joints, each
penalty is calculated with respect to the observation features 𝜙 =
target is specified in the form of a 3D exponential map q ∈ R3
(Φ(s), Φ(s ′ )), not the full set of state features (s, s ′ ). As we show in
[Grassia 1998], where the rotation axis v and rotation angle 𝜃 can
our experiments, the gradient penalty is crucial for stable training
be determined according to:
and effective performance.
q
v= , 𝜃 = ||q|| 2 . (9)
6 MODEL REPRESENTATION ||q|| 2
Given a high-level task objective and a dataset of reference motions, This representation provides a more compact parameterization than
the agent is responsible for learning a control policy that fulfills the the 4D axis-angle or quaternion representations used in prior sys-
task objectives, while utilizing behaviors that resemble the motions tems [Peng et al. 2018a; Won et al. 2020], while also avoiding gimbal

ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control • 144:7

lock from parameterizations such as euler angles. Target rotations direction in the character’s local coordinate frame. The target speed
for revolute joints are specified as 1D rotation angles 𝑞 = 𝜃 . is selected randomly between 𝑣 ∗ ∈ [1, 5]m/s. For slower moving
styles, such as Zombie and Stealthy, the target speed is fixed at 1m/s.
6.2 Network Architecture
Each policy 𝜋 is modeled by a neural network that maps a given state Target Location: In this task, the character’s objective is to move
s𝑡 and goal g to a Gaussian distribution over actions 𝜋 (a𝑡 |s𝑡 , g) = to a target location x∗ . The goal g𝑡 = x̃𝑡∗ records the target location
N (𝜇 (s𝑡 , g), Σ), with an input-dependent mean 𝜇 (s𝑡 , g) and a fixed in the character’s local coordinate frame.
diagonal covariance matrix Σ. The mean is specified by a fully-
connected network with two hidden layers, consisting of 1024 and Dribbling: To evaluate our system on more complex object ma-
512 ReLU [Nair and Hinton 2010], followed by a linear output nipulation tasks, we train policies for a dribbling task, where the
layer. The values of the covariance matrix Σ = diag(𝜎1, 𝜎2, ...) are character’s objective is to dribble a soccer ball to a target location.
manually-specified and kept fixed over the course of training. The The goal g𝑡 = x̃𝑡∗ records the relative position of the target loca-
value function 𝑉 (s𝑡 , g) and discriminator 𝐷 (s𝑡 , s𝑡 +1 ) are modeled tion with respect to the character. The state s𝑡 is augmented with
by separate networks with a similar architecture as the policy. additional features that describe the state of the ball, including the
position x̃𝑡ball , orientation q̃𝑡ball , linear velocity x¤̃ 𝑡ball , and angular
6.3 Training velocity q¤̃ 𝑡ball of the ball in the character’s local coordinate frame.
Our policies are trained using a combination of GAIL [Ho and Ermon
2016] and proximal-policy optimization (PPO) [Schulman et al. 2017]. Strike: To demonstrate AMP’s ability to compose diverse behav-
Algorithm 1 provides an overview of the training process. At each iors, we consider a task where the character’s objective is to strike
time step 𝑡, the agent receives a task-reward 𝑟𝑡𝐺 = 𝑟 𝐺 (s𝑡 , a𝑡 , s𝑡 +1, g) a target using a designated end-effector (e.g. hands). The target may
from the environment, it then queries the motion prior for a style- be located at various distances from the character. Therefore, the
reward 𝑟𝑡𝑆 = 𝑟 𝑆 (s𝑡 , s𝑡 +1 ), computed according to Equation 7. The character must first move close to the target before striking it. These
two rewards are combined according to Equation 4 to yield the distinct phases entail different optimal behaviors, and thus require
reward for the particular timstep. Following the approach proposed the policy to compose and transition between the appropriate skills.
by Peng et al. [2018a], we incorporate reference state initialization The goal g𝑡 = (x̃𝑡∗, ℎ𝑡 ) records the location of the target x̃𝑡∗ in the
and early termination. Reference state initialization is applied by character’s local coordinate frame, along with an indicator variable
initializing the character to states sampled randomly from all motion ℎ𝑡 that specifies if the target has already been hit.
clips in the dataset. Early termination is triggered on most tasks
when any part of the character’s body, with exception of the feet, Obstacles: Finally, we consider tasks that involve visual percep-
makes contact with the ground. This termination criteria is disabled tion and interaction with more complex environments, where the
for more contact-rich tasks, such as rolling or getting up after a fall. character’s objective is to traverse an obstacle-filled terrain, while
Once a batch of data has been collected with the policy, the maintaining a target speed. Policies are trained for two types of envi-
recorded trajectories are used to update the policy and value func- ronments: 1) An environment containing a combination of obstacles
tion. The value function is updated with target values computed include gaps, steps, and overhead obstructions that the character
using TD(𝜆) [Sutton and Barto 1998]. The policy is updated using must duck under. 2) An environment containing narrow stepping
advantages computed using GAE(𝜆) [Schulman et al. 2015]. Each stones that requires more precise contact planning. Examples of the
trajectory recorded from the policy is also stored in a replay buffer environments are available in Figure 1 and 3. In order for the policy
B, containing trajectories from past training iterations. The dis- to perceive the upcoming obstacles, the state is augmented with a
criminator is updated according to Equation 8 using minibatches 1D height-field of the upcoming terrain.
of transitions (s, s ′ ) sampled from the reference motion data set M
and transitions from the replay buffer B. The replay buffer helps to
8 RESULTS
stabilize training by preventing the discriminator from overfitting
to the most recent batch of trajectories from the policy. We evaluate our framework’s effectiveness on a suite of challenging
motion control tasks with complex simulated characters. First, we
7 TASKS demonstrate that our approach can readily scale to large unstruc-
tured datasets containing diverse motion clips, which then enables
To evaluate AMP’s effectiveness for controlling the style of a char-
our characters to perform challenging tasks in a natural and life-like
acter’s motions, we apply our framework to train complex 3D sim-
manner by imitating behaviors from the dataset. The characters
ulated characters to perform various motion control tasks using
automatically learn to compose and generalize different skills from
different motion styles. The characters include a 34 DoF humanoid, a
the motion data in order to fulfill high-level task objectives, without
59 DoF T-Rex, and a 64 DoF dog. A summary of each task is provided
requiring mechanisms for explicit motion selection. We then evalu-
below. Please refer to Appendix A for a more in-depth description
ate AMP on a single-clip imitation task, and show that our method
of each task and their respective reward functions.
is able to closely imitate a diverse corpus of dynamic and acrobatic
Target Heading: In this task, the character’s objective is to move skills, producing motions that are nearly indistinguishable from
along a target heading direction d∗ at a target speed 𝑣 ∗ . The goal reference motions recorded from human actors. Behaviors learned
for the policy is specified as g𝑡 = ( d̃𝑡∗, 𝑣 ∗ ), with d̃𝑡∗ being the target by the characters can be viewed in the supplementary video.

ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
144:8 • Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa

(a) Humanoid: Target Location (Locomotion) (b) Humanoid: Target Location (Zombie)

(c) Humanoid: Target Heading (Locomotion + Getup)

(d) Humanoid: Dribble (Locomotion) (e) Humanoid: Strike (Walk + Punch)

(f) Humanoid: Obstacles (Run + Leap + Roll)

(g) Humanoid: Stepping Stones (Cartwheel) (h) Humanoid: Stepping Stones (Jump)
Fig. 3. The motion prior can be trained with large datasets of diverse motions, enabling simulated characters to perform complex tasks by composing a wider
range of skills. Each environment is denoted by "Character: Task (Dataset)".

8.1 Experimental Setup 8.2 Tasks

All environments are simulated using the Bullet physics engine [Coumans In this section, we demonstrate AMP’s effectiveness for controlling
et al. 2013], with a simulation frequency of 1.2kHz. The policy is the style of a character’s motions as it performs other high-level
queried at 30Hz, and each action specifies target positions for PD tasks. The weights for the task-reward and style-reward are set to
controllers positioned at the character’s joints. All neural networks 𝑤 𝐺 = 0.5 and 𝑤 𝑆 = 0.5 for all tasks. The character can be trained
are implemented using Tensorflow [Abadi et al. 2015]. The gradient to perform tasks in a variety of distinct styles by providing the
penalty coefficient is set to 𝑤 gp = 10. Detailed hyperparameter motion prior with different datasets. Figure 3 illustrates behaviors
settings are available in Appendix B. Reference motion clips are learned by the Humanoid on various tasks. Table 1 records the
collected from a combination of public mocap libraries, custom performance of the policies with respect to the normalized task
recorded mocap clips, and artist-authored keyframe animations return, and summary statistics of the different datasets used to train
[CMU [n.d.]; SFU [n.d.]; Zhang et al. 2018]. Depending on the task the motion priors are available in Table 2. AMP can accommodate
and character, each policy is trained with 100-300 million samples, large unstructured datasets, with the largest dataset containing 56
requiring approximately 30-140 hours on 16 CPU cores. Code for clips from 8 different human actors, for a total of 434s of motion data.
our system will be released upon publication of this paper. In the case of the Target Heading task, a motion prior trained using
a locomotion dataset, containing walking, running, and jogging
motions, leads to a policy that executes different locomotion gaits
depending on the target speed. Transitions between various gaits

ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control • 144:9

Table 1. Performance statistics of combining AMP with additional task Table 2. Summary statistics of the different datasets used to train the motion
objectives. Performance is recorded as the average normalized task return, priors. We record the total length of motion clips in each dataset, along
with 0 being the minimum possible return per episode and 1 being the with the number of clips, and the number of subjects (e.g. human actors)
maximum possible return. The return is averaged across 3 models initialized that the clips were recorded from.
with different random seeds, with 32 episodes recorded per model. The
motion prior can be trained with different datasets to produce policies that
Character Dataset Size (s) Clips Subjects
adopt distinct stylistic behaviors when performing a particular task. Humanoid Cartwheel 13.6 3 1
Jump 28.6 10 4
Character Task Dataset Task Return Locomotion 434.1 56 8
Humanoid Target Locomotion 0.90 ± 0.01 Run 204.4 47 3
Heading Walk 0.46 ± 0.01 Run + Leap + Roll 22.1 10 7
Run 0.63 ± 0.01 Stealthy 136.5 3 1
Stealthy 0.89 ± 0.02 Walk 229.6 9 5
Zombie 0.94 ± 0.00 Walk + Punch 247.8 15 9
Target Locomotion 0.63 ± 0.01 Zombie 18.3 1 1
Location Zombie 0.50 ± 0.00 T-Rex Locomotion 10.5 5 1
Obstacles Run + Leap + Roll 0.27 ± 0.10
Stepping Cartwheel 0.43 ± 0.03
Stones Jump 0.56 ± 0.12
Dribble Locomotion 0.78 ± 0.05
Zombie 0.60 ± 0.04
Strike Walk + Punch 0.73 ± 0.02
Target
T-Rex Location Locomotion 0.36 ± 0.03

emerge automatically through the motion prior, with the character

adopting walking gaits at slow speeds (∼ 1m/s), switching to jogging
gaits at faster speeds (∼ 2.5m/s), and breaking into a fast run as the Fig. 4. Performance of Target Heading policies trained with different
target speed approaches (∼ 4.5m/s). The motion prior also leads to datasets. Left: Learning curves comparing the normalized task returns
other human-like strategies, such as banking into turns, and slowing of policies trained with a large dataset of diverse locomotion clips to policies
down before large changes in direction. The policies develop similar trained with only walking or running reference motions. Three models are
trained using each dataset. Right: Comparison of the target speed with the
behaviors for the Target Location task. When the target is near the
average speed achieved by the different policies. Policies trained using the
character, the policy adopts slower walking gaits. But when the larger Locomotion dataset is able to more closely follow the various target
target is further away, the character automatically transitions into speeds by imitating different gaits.
a run. These intricate behaviors arise naturally from the motion
prior, without requiring a motion planner to explicitly select which
motion the character should execute in a given scenario, such as our policies can in large part be attributed to the motion prior, and
those used in prior systems [Bergamin et al. 2019; Luo et al. 2020; is not solely a result of the task objective.
Peng et al. 2017]. In addition to standard locomotion gaits, the To further illustrate AMP’s ability to compose disparate skills,
motion prior can also be trained for more stylistic behaviors, such we introduce additional reference motions into the dataset for get-
as walking like a shambling zombie or walking in a stealthy manner. ting up from the ground in various configurations. These additional
Our framework enables the character to acquire these distinct styles motion clips then enable our character to recover from a fall and
by simply providing the motion prior with different unstructured continue to perform a given task (Figure 3(c)). The policy also dis-
motion datasets. covers novel recovery behaviors that are not present in the dataset.
To determine whether the transitions between distinct gaits are a When the character falls forward, it tucks its body into a roll during
product of the motion prior or a result of the task objective, we train the fall in order to more quickly transition into a getup behavior.
policies to perform the Target Heading task using limited datasets While this particular behavior is not present in the motion clips,
containing only walking or running data. Figure 4 compares the the policy is able to generalize behaviors observed in the dataset to
performance of policies trained with these different datasets. Policies produce novel and naturalistic strategies for new scenarios.
trained with only walking motions learn to perform only walking For the Strike task (Figure 1), the motion prior is trained using a
gaits, and do not show any transitions to faster running gaits even at collection of walking motion clips and punching motion clips. The
faster target speeds. As a result, these policies are not able to achieve resulting policy learns to walk to the target when it is far away, and
the faster target speeds. Similarly, policies trained with only running then transition to a punching motion once it is within range to hit
motions are not able to match slower target speeds. Training the the target. Note that the motion clips in the dataset contain strictly
motion prior with a diverse dataset results in more flexible and walking-only motions or punching-only motion, and none of the
optimal policies that are able to achieve a wider range of target clips show an actor walking to and punching a target. Instead, the
speeds. This indicates that the diversity of behaviors exhibited by policy learns to temporally sequence these different behaviors in

ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
144:10 • Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa

Fig. 5. Learning curves comparing the task performance of AMP to latent

space models (Latent Space) and policies trained from scratch without
motion data (No Data). Our method achieves comparable performance
across the various tasks, while also producing higher fidelity motions.

order to fulfill the high-level task objectives. Again, this composition

of different skills emerges automatically from the motion prior,
without requiring a motion planner or other mechanisms for motion
selection.
Finally, our system can also train visuomotor policies for travers-
ing obstacle-filled environments. By providing the motion prior with Fig. 6. Snapshots of behaviors learned by the Humanoid on the single-clip
a collection of locomotion clips and rolling clips, the character learns imitation tasks. Top-to-bottom: back-flip, side-flip, cartwheel, spin, spin-
to utilize these diverse behaviors to traverse the different obstacles. kick, roll. AMP enables the character to closely imitate a diverse corpus of
The character learns to leap over obstacles such as gaps. But as it highly dynamic and acrobatic skills.
approaches the overhead obstructions, the character transitions into
a rolling behavior in order to pass underneath the obstacles. Previ- to train the motion prior. The learned low-level controller is then
ous systems that have demonstrated similar composition of diverse used to train separate high-level controllers for each downstream
maneuvers for clearing obstacle have typically required a separate task. Note that reference motions are used only during pre-training,
motion planner or manual annotations [Liu et al. 2012; Park et al. and the high-level controllers are trained to optimize only the task
2019]. Our approach provides a unified framework where the same objectives. A more in-depth description of the experimental setup
underlying algorithm is able to learn how to perform the various for the latent space model is available in Appendix C.
skills and which skill to execute in a given scenario. Furthermore, A qualitative comparison of the behaviors learned using AMP and
the character can also be trained to traverse obstacles in distinct the latent space model is available in the supplementary video. Fig-
styles by providing the motion prior with different motion clips, ure 5 compares the task performance of the different models, along
such as jumping or cartwheeling across stepping stones (Figure 3). with a baseline model trained from scratch for each task without
leveraging any motion data. Both AMP and the latent space models
8.3 Comparisons are able to produce substantially more life-like behaviors than the
An alternative approach for learning a motion prior from unstruc- baseline models. For the latent space models, since the low-level
tured motion data is to build a latent space model [Heess et al. 2016; and high-level controllers are trained separately, it is possible for
Lynch et al. 2020; Merel et al. 2020; Peng et al. 2019a]. Unlike AMP, the distribution of encodings specified by the high-level controller
which encourages a character to adopt a desired motion style directly to be different than the distribution of encodings observed by the
through an optimization objective, a latent space model enforces a low-level controller during pre-training [Luo et al. 2020]. This in
particular motion style indirectly, by using a latent representation turn can result in unnatural motions that deviate from the behav-
to constrain the policy’s actions to those that produce motions of iors depicted in the original dataset. AMP enforces a motion style
the desired style. To compare AMP to these latent space models, directly through the reward function, and is therefore able to better
we first pre-train a low-level controller using a motion tracking mitigate some of these artifacts. The more structured exploration
objective to imitate the same set of reference motions that are used behaviors from the latent space model enable the policies to solve

ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control • 144:11

(a) T-Rex (Walk)

(b) Dog (Trot)

(c) Dog (Canter)

Fig. 7. AMP can be used to train complex non-humanoid characters, such as a 59 DoF T-Rex and a 64 DoF dog. By providing the motion prior with different
reference motion clips, the characters can be trained to perform various locomotion gaits, such as trotting and cantering.

downstream tasks more quickly. However, the pre-training stage pose errors even when the overall motions are similar. To better
used to construct the low-level controller can itself be sample inten- evaluate the similarity of the motions, we first apply dynamic time
sive. In our experiments, the low-level controllers are trained using warping (DTW) to align the reference motion with the motion of
300 million samples before being transferred to downstream tasks. the simulated character [Sakoe and Chiba 1978], before computing
With AMP, no such pre-training is necessary, and the motion prior the pose error between the two aligned motions. DTW is applied
can be trained jointly with the policy. using Equation 10 as the cost function.
AMP is able to closely imitate a large variety of highly dynamic
8.4 Single-Clip Imitation skills, while also avoiding many of the visual artifacts exhibited
Although our goal is to train characters with large motion datasets, by prior adversarial motion imitation systems [Merel et al. 2017;
to evaluate the effectiveness of our framework for imitating behav- Wang et al. 2017]. We compare the performance of our system to
iors from motion clips, we consider a single-clip imitation task. In results produced by the motion tracking approach from Peng et al.
this setting, the character’s objective is to imitate a single motion [2018a], which uses a manually designed reward function and re-
clip at a time, without additional task objectives. Therefore, the quires synchronization of the policy with a reference motion via
policy is trained solely to maximize the style-reward 𝑟𝑡𝑆 from the a phase variable. Figure 8 compares the learning curves of the dif-
motion prior. Unlike previous motion tracking methods, our ap- ferent methods. Since the tracking-based policies are synchronized
proach does not require a manually designed tracking objective or a with their respective reference motions, they are generally able to
phase-based synchronization of the reference motion and the policy learn faster and achieve lower errors than policies trained with AMP.
Peng et al. [2018a]. Table 3 summarizes the performance of policies Nonetheless, our method is able to produce results of comparable
trained using AMP to imitate a diverse corpus of motions. Figure 6 quality without the need to manually design or tune reward func-
and 7 illustrate examples of motions learned by the characters. Per- tions for different motions. However, for some motions, such as the
formance is evaluated using the average pose error, where the pose Front-Flip, AMP is prone to converging to locally optimal behaviors,
error 𝑒𝑡
pose
at each time step 𝑡 is computed between the pose of where instead of performing a flip, the character learns to simply
the simulated character and the reference motion using the relative shuffle forwards in order to avoid falling. Tracking-based methods
positions of each joint with respect to the root (in units of meters), can mitigate these local optima by terminating an episode early
1 Õ if the character’s pose deviates too far from the reference motion
pose
(x𝑡 − x𝑡root ) − (^ ^𝑡root ) . (10)
𝑗 𝑗
= joint x𝑡 − x [Peng et al. 2018a; Won et al. 2020]. However, this strategy is not

𝑒𝑡
𝑁 2
𝑗 ∈joints directly applicable to AMP, since the policy is not synchronized
x𝑡 and x
𝑗
^𝑡 denote the 3D Cartesian position of joint 𝑗 from the
𝑗 with the reference motion. But as shown in the previous sections,
this lack of synchronization is precisely what allows AMP to eas-
simulated character and the reference motion, and 𝑁 joint is the total
ily leverage large datasets of diverse motion clips to solve more
number of joints in the character’s body. This method of evaluating
complex tasks.
motion similarity has previously been reported to better conform
to human perception of motion similarity [Harada et al. 2004; Tang
et al. 2008]. Since AMP does not use a phase variable to synchronize
the policy with the reference motion, the motions may progress at
different rates, resulting in de-synchronization that can lead to large

ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
144:12 • Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa

Table 3. Performance statistics of imitating individual motion clips without

task objectives. "Dataset Size" records the total length of motion data used
for each skill. Performance is recorded as the average pose error (in units of
meters) between the time-warped trajectories from the reference motion and
simulated character. The pose error is averaged across 3 models initialized
with different random seeds, with 32 episodes recorded per model. Each
episode has a maximum length of 20s. We compare our method (AMP) with
the motion tracking approach proposed by Peng et al. [2018a]. AMP is able
to closely imitate a diverse repertoire of complex motions, without manual
reward engineering.
Dataset Motion AMP
Character Motion Size Tracking (Ours)
Humanoid Back-Flip 1.75s 0.076 ± 0.021 0.150 ± 0.028
Cartwheel 2.72s 0.039 ± 0.011 0.067 ± 0.014
Crawl 2.93s 0.044 ± 0.001 0.049 ± 0.007
Dance 1.62s 0.038 ± 0.001 0.055 ± 0.015
Front-Flip 1.65s 0.278 ± 0.040 0.425 ± 0.010
Jog 0.83s 0.029 ± 0.001 0.056 ± 0.001
Jump 1.77s 0.033 ± 0.001 0.083 ± 0.022
Roll 2.02s 0.072 ± 0.018 0.088 ± 0.008
Run 0.80s 0.028 ± 0.002 0.075 ± 0.015
Spin 1.00s 0.063 ± 0.022 0.047 ± 0.002
Fig. 8. Learning curves of various methods on the single-clip imitation
Side-Flip 2.44s 0.191 ± 0.043 0.124 ± 0.012
tasks. We compare AMP to the motion tracking approach proposed by Peng
Spin-Kick 1.28s 0.042 ± 0.001 0.058 ± 0.012
et al. [2018a] (Motion Tracking), as well a version of AMP without velocity
Walk 1.30s 0.018 ± 0.005 0.030 ± 0.001
features for the discriminator (AMP - No Vel), and AMP without the gradient
Zombie 1.68s 0.049 ± 0.013 0.058 ± 0.014
penalty regularizer (AMP - No GP). A comprehensive collection of learning
T-Rex Turn 2.13s 0.098 ± 0.011 0.284 ± 0.023
curves for all skills are available in the Appendix. AMP produces results
Walk 2.00s 0.069 ± 0.005 0.096 ± 0.027 of comparable quality when compared to prior tracking-based methods,
Dog Canter 0.45s 0.026 ± 0.002 0.034 ± 0.002 without requiring a manually designed reward function or synchronization
Pace 0.63s 0.020 ± 0.001 0.024 ± 0.003 between the policy and reference motion. Velocity features and gradient
Spin 0.73s 0.026 ± 0.002 0.086 ± 0.008 penalty are vital for effective and consistent results on challenging skills.
Trot 0.52s 0.019 ± 0.001 0.026 ± 0.001
9 DISCUSSION AND LIMITATIONS
In this work, we presented an adversarial learning system for physics-
8.5 Ablations based character animation that enables characters to imitate diverse
Our system is able to produce substantially higher fidelity motions behaviors from large unstructured datasets, without the need for
than prior adversarial learning frameworks for physics-based char- motion planners or other mechanisms for clip selection. Our system
acter control [Merel et al. 2017; Wang et al. 2017]. In this section, we allows users to specify high-level task objectives for controlling a
identify critical design decisions that lead to more stable training character’s behaviors, while the more granular low-level style of a
and higher quality results. Figure 8 compares learning curves of character’s motions can be controlled using a learned motion prior.
policies trained on the single-clip imitation tasks with different com- Composition of disparate skills in furtherance of a task objective
ponents of the system disabled. Gradient penalty proves to be the emerges automatically from the motion prior. The motion prior also
most vital component. Models trained without this regularization enables our characters to closely imitate a rich repertoire of highly
tend to exhibit large performance fluctuations over the course of the dynamic skills, and produces results that are on par with tracking-
training, and lead to noticeable visual artifacts in the final policies, based techniques, without requiring manual reward engineering or
as shown in the supplementary video. The addition of the gradient synchronization between the controller and the reference motions.
penalty not only improves stability during training, but also leads Our system demonstrates that adversarial imitation learning tech-
to substantially faster learning across a large set of skills. The inclu- niques can indeed produce high fidelity motions for complex skills.
sion of velocity features in the discriminator’s observations is also However, like many other GAN-based techniques, AMP is suscepti-
an important component for imitating some motions. In principle, ble to mode collapse. When provided with a large dataset of diverse
including consecutive poses as input to the discriminator should motion clips, the policy is prone to imitating only a small subset
provide some information that can be used to infer the joint veloci- of the example behaviors, ignoring other behaviors that may ulti-
ties. But we found that this was insufficient for some motions, such mately be more optimal for a given task. The motion priors in our
as rolling. As shown in the supplementary video, in the absence of experiments are also trained from scratch for each policy. But since
velocity features, the character is prone to converging to a strat- the motion prior is largely task-agnostic, it should in principle be
egy of holding a fixed pose on the ground, instead of performing possible to transfer and reuse motion priors for different policies
a roll. The additional velocity features are able to mitigate these and tasks. Exploring techniques for developing general and trans-
undesirable behaviors. ferable motion priors may lead to modular objective functions that

ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control • 144:13

can be conveniently incorporated into downstream tasks, without W. Burgard, O. Brock, and C. Stachniss. 2008. Learning Omnidirectional Path Following
requiring retraining for each new task. While the motion prior does Using Dimensionality Reduction. 257–264.
Nuttapong Chentanez, Matthias Müller, Miles Macklin, Viktor Makoviychuk, and Stefan
not require direct access to task-specific information, the data used Jeschke. 2018. Physics-Based Motion Capture Imitation with Deep Reinforcement
to train the motion prior is generated by policies trained to perform Learning. In Proceedings of the 11th Annual International Conference on Motion,
Interaction, and Games (Limassol, Cyprus) (MIG ’18). Association for Computing
a particular task. This may introduce some task dependencies into Machinery, New York, NY, USA, Article 1, 10 pages. https://doi.org/10.1145/3274247.
the motion prior, which can hinder its ability to be transferred to 3274506
other tasks. Training motion priors using data generated from larger CMU. [n.d.]. CMU Graphics Lab Motion Capture Database. http://mocap.cs.cmu.edu/.
Erwin Coumans et al. 2013. Bullet physics library. Open source: bulletphysics. org 15, 49
and more diverse repertoires of tasks may help to facilitate trans- (2013), 5.
ferring the learned motion priors to new tasks. Our experiments M. Da Silva, Y. Abe, and J. Popovic. 2008. Simulation of Human Motion Data using
also focus primarily on tasks that involve temporal composition Short-Horizon Model-Predictive Control. Computer Graphics Forum (2008). https:
//doi.org/10.1111/j.1467-8659.2008.01134.x
of different skills, which require the character to perform different Carlos Florensa, Yan Duan, and Pieter Abbeel. 2017. Stochastic Neural Networks for
behaviors at different points in time. However, spatial composition Hierarchical Reinforcement Learning. In Proceedings of the International Conference
on Learning Representations (ICLR).
might also be vital for some tasks that require a character to perform Thomas Geijtenbeek, Michiel van de Panne, and A. Frank van der Stappen. 2013. Flexible
multiple skills simultaneously. Developing motion priors that are Muscle-Based Locomotion for Bipedal Creatures. ACM Transactions on Graphics 32,
more amenable to spatial composition of disparate skills may lead to 6 (2013).
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
more flexible and sophisticated behaviors. Despite these limitations, Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In
we hope this work provides a useful tool that enables physically Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling,
simulated characters to take advantage of the large motion datasets C. Cortes, N. D. Lawrence, and K. Q. Weinberger (Eds.). Curran Associates, Inc.,
2672–2680. http://papers.nips.cc/paper/5423-generative-adversarial-nets.pdf
that have been so effective for kinematic animation techniques, F. Sebastin Grassia. 1998. Practical Parameterization of Rotations Using the Exponential
and open exciting directions for future exploration in data-driven Map. J. Graph. Tools 3, 3 (March 1998), 29–48. https://doi.org/10.1080/10867651.
1998.10487493
physics-based character animation. Keith Grochow, Steven L. Martin, Aaron Hertzmann, and Zoran Popović. 2004. Style-
Based Inverse Kinematics. ACM Trans. Graph. 23, 3 (Aug. 2004), 522–531. https:
//doi.org/10.1145/1015706.1015755
ACKNOWLEDGMENTS Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C
We thank Sony Interactive Entertainment for providing reference Courville. 2017. Improved Training of Wasserstein GANs. In Advances in Neural
Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,
motion data for this project, Bonny Ho for narrating the video, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.). Curran Associates, Inc., 5767–5777.
the anonymous reviewers for their helpful feedback, and AWS for http://papers.nips.cc/paper/7159-improved-training-of-wasserstein-gans.pdf
providing computational resources. This research was funded by Tuomas Haarnoja, Kristian Hartikainen, Pieter Abbeel, and Sergey Levine. 2018. Latent
Space Policies for Hierarchical Reinforcement Learning (Proceedings of Machine
an NSERC Postgraduate Scholarship, and a Berkeley Fellowship for Learning Research, Vol. 80), Jennifer Dy and Andreas Krause (Eds.). PMLR, Stock-
Graduate Study. holmsmässan, Stockholm Sweden, 1851–1860. http://proceedings.mlr.press/v80/
haarnoja18a.html
T. Harada, S. Taoka, T. Mori, and T. Sato. 2004. Quantitative evaluation method for pose
REFERENCES and motion similarity based on human perception. In 4th IEEE/RAS International
Conference on Humanoid Robots, 2004., Vol. 1. 494–512 Vol. 1. https://doi.org/10.
Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, 1109/ICHR.2004.1442140
Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Karol Hausman, Jost Tobias Springenberg, Ziyu Wang, Nicolas Heess, and Martin
Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Riedmiller. 2018. Learning an Embedding Space for Transferable Robot Skills. In
Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dan Mané, Rajat International Conference on Learning Representations. https://openreview.net/forum?
Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, id=rk07ZXZRb
Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Nicolas Heess, Gregory Wayne, Yuval Tassa, Timothy P. Lillicrap, Martin A. Riedmiller,
Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin and David Silver. 2016. Learning and Transfer of Modulated Locomotor Controllers.
Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine CoRR abs/1610.05182 (2016). arXiv:1610.05182 http://arxiv.org/abs/1610.05182
Learning on Heterogeneous Systems. http://tensorflow.org/ Software available Jonathan Ho and Stefano Ermon. 2016. Generative Adversarial Imitation Learning.
from tensorflow.org. In Advances in Neural Information Processing Systems 29, D. D. Lee, M. Sugiyama,
Pieter Abbeel and Andrew Y. Ng. 2004. Apprenticeship Learning via Inverse Rein- U. V. Luxburg, I. Guyon, and R. Garnett (Eds.). Curran Associates, Inc., 4565–4573.
forcement Learning. In Proceedings of the Twenty-First International Conference on http://papers.nips.cc/paper/6391-generative-adversarial-imitation-learning.pdf
Machine Learning (Banff, Alberta, Canada) (ICML ’04). Association for Computing Daniel Holden, Taku Komura, and Jun Saito. 2017. Phase-Functioned Neural Networks
Machinery, New York, NY, USA, 1. https://doi.org/10.1145/1015330.1015430 for Character Control. ACM Trans. Graph. 36, 4, Article 42 (July 2017), 13 pages.
Shailen Agrawal and Michiel van de Panne. 2016. Task-based Locomotion. ACM https://doi.org/10.1145/3072959.3073663
Transactions on Graphics (Proc. SIGGRAPH 2016) 35, 4 (2016). Yifeng Jiang, Tom Van Wouwe, Friedl De Groote, and C. Karen Liu. 2019. Synthesis
M. Al Borno, M. de Lasa, and A. Hertzmann. 2013. Trajectory Optimization for Full- of Biologically Realistic Human Motion Using Joint Torque Actuation. ACM Trans.
Body Movements with Complex Contacts. IEEE Transactions on Visualization and Graph. 38, 4, Article 72 (July 2019), 12 pages. https://doi.org/10.1145/3306346.
Computer Graphics 19, 8 (2013), 1405–1414. https://doi.org/10.1109/TVCG.2012.325 3322966
Martin Arjovsky, Soumith Chintala, and Léon Bottou. 2017. Wasserstein Generative Angjoo Kanazawa, Michael J. Black, David W. Jacobs, and Jitendra Malik. 2018. End-to-
Adversarial Networks (Proceedings of Machine Learning Research, Vol. 70), Doina end Recovery of Human Shape and Pose. In Computer Vision and Pattern Regognition
Precup and Yee Whye Teh (Eds.). PMLR, International Convention Centre, Sydney, (CVPR).
Australia, 214–223. http://proceedings.mlr.press/v70/arjovsky17a.html Tero Karras, Timo Aila, Samuli Laine, and Jaakko Lehtinen. 2017. Progressive Growing
Kevin Bergamin, Simon Clavet, Daniel Holden, and James Richard Forbes. 2019. DReCon: of GANs for Improved Quality, Stability, and Variation. CoRR abs/1710.10196 (2017).
Data-Driven Responsive Control of Physics-Based Characters. ACM Trans. Graph. arXiv:1710.10196 http://arxiv.org/abs/1710.10196
38, 6, Article 206 (Nov. 2019), 11 pages. https://doi.org/10.1145/3355089.3356536 Liyiming Ke, Matt Barnes, Wen Sun, Gilwoo Lee, Sanjiban Choudhury, and Sid-
David Berthelot, Tom Schumm, and Luke Metz. 2017. BEGAN: Boundary Equilibrium dhartha S. Srinivasa. 2019. Imitation Learning as f-Divergence Minimization. CoRR
Generative Adversarial Networks. CoRR abs/1703.10717 (2017). arXiv:1703.10717 abs/1905.12888 (2019). arXiv:1905.12888 http://arxiv.org/abs/1905.12888
http://arxiv.org/abs/1703.10717 Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational
Mariusz Bojarski, Davide Del Testa, Daniel Dworakowski, Bernhard Firner, Beat Flepp, Bayes. In 2nd International Conference on Learning Representations, ICLR
Prasoon Goyal, Lawrence D. Jackel, Mathew Monfort, Urs Muller, Jiakai Zhang, Xin 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings.
Zhang, Jake Zhao, and Karol Zieba. 2016. End to End Learning for Self-Driving Cars. arXiv:http://arxiv.org/abs/1312.6114v10 [stat.ML]
CoRR abs/1604.07316 (2016). arXiv:1604.07316 http://arxiv.org/abs/1604.07316

ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
144:14 • Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa

Naveen Kodali, Jacob D. Abernethy, James Hays, and Zsolt Kira. 2017. How to Train York, NY, USA, Article 81, 9 pages. https://doi.org/10.1145/1576246.1531387
Your DRAGAN. CoRR abs/1705.07215 (2017). arXiv:1705.07215 http://arxiv.org/abs/ Vinod Nair and Geoffrey E. Hinton. 2010. Rectified Linear Units Improve Restricted
1705.07215 Boltzmann Machines. In Proceedings of the 27th International Conference on Interna-
Taesoo Kwon and Jessica K. Hodgins. 2017. Momentum-Mapped Inverted Pendulum tional Conference on Machine Learning (Haifa, Israel) (ICML’10). Omnipress, Madison,
Models for Controlling Dynamic Human Motions. ACM Trans. Graph. 36, 4, Article WI, USA, 807–814.
145d (Jan. 2017), 14 pages. https://doi.org/10.1145/3072959.2983616 Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. 2016. f-GAN: Training Genera-
Jehee Lee, Jinxiang Chai, Paul S. A. Reitsma, Jessica K. Hodgins, and Nancy S. Pollard. tive Neural Samplers using Variational Divergence Minimization. In Advances in
2002. Interactive Control of Avatars Animated with Human Motion Data. ACM Neural Information Processing Systems, D. Lee, M. Sugiyama, U. Luxburg, I. Guyon,
Trans. Graph. 21, 3 (July 2002), 491–500. https://doi.org/10.1145/566654.566607 and R. Garnett (Eds.), Vol. 29. Curran Associates, Inc., 271–279. https://proceedings.
Kyungho Lee, Seyoung Lee, and Jehee Lee. 2018. Interactive Character Animation by neurips.cc/paper/2016/file/cedebb6e872f539bef8c3f919874e9d7-Paper.pdf
Learning Multi-Objective Control. ACM Trans. Graph. 37, 6, Article 180 (Dec. 2018), Soohwan Park, Hoseok Ryu, Seyoung Lee, Sunmin Lee, and Jehee Lee. 2019. Learning
10 pages. https://doi.org/10.1145/3272127.3275071 Predict-and-Simulate Policies from Unorganized Human Motion Data. ACM Trans.
Seunghwan Lee, Moonseok Park, Kyoungmin Lee, and Jehee Lee. 2019. Scalable Muscle- Graph. 38, 6, Article 205 (Nov. 2019), 11 pages. https://doi.org/10.1145/3355089.
Actuated Human Simulation and Control. ACM Trans. Graph. 38, 4, Article 73 (July 3356501
2019), 13 pages. https://doi.org/10.1145/3306346.3322972 Xue Bin Peng, Pieter Abbeel, Sergey Levine, and Michiel van de Panne. 2018a. Deep-
Yoonsang Lee, Sungeun Kim, and Jehee Lee. 2010a. Data-Driven Biped Control. ACM Mimic: Example-guided Deep Reinforcement Learning of Physics-based Charac-
Trans. Graph. 29, 4, Article 129 (July 2010), 8 pages. https://doi.org/10.1145/1778765. ter Skills. ACM Trans. Graph. 37, 4, Article 143 (July 2018), 14 pages. https:
1781155 //doi.org/10.1145/3197517.3201311
Yongjoon Lee, Kevin Wampler, Gilbert Bernstein, Jovan Popović, and Zoran Popović. Xue Bin Peng, Glen Berseth, and Michiel van de Panne. 2016. Terrain-adaptive Loco-
2010b. Motion Fields for Interactive Character Locomotion. ACM Trans. Graph. 29, motion Skills Using Deep Reinforcement Learning. ACM Trans. Graph. 35, 4, Article
6, Article 138 (Dec. 2010), 8 pages. https://doi.org/10.1145/1882261.1866160 81 (July 2016), 12 pages. https://doi.org/10.1145/2897824.2925881
Sergey Levine, Yongjoon Lee, Vladlen Koltun, and Zoran Popović. 2011. Space-Time Xue Bin Peng, Glen Berseth, Kangkang Yin, and Michiel Van De Panne. 2017. DeepLoco:
Planning with Parameterized Locomotion Controllers. ACM Trans. Graph. 30, 3, Dynamic Locomotion Skills Using Hierarchical Deep Reinforcement Learning. ACM
Article 23 (May 2011), 11 pages. https://doi.org/10.1145/1966394.1966402 Trans. Graph. 36, 4, Article 41 (July 2017), 13 pages. https://doi.org/10.1145/3072959.
Sergey Levine, Jack M. Wang, Alexis Haraux, Zoran Popović, and Vladlen Koltun. 3073602
2012. Continuous Character Control with Low-Dimensional Embeddings. ACM Xue Bin Peng, Michael Chang, Grace Zhang, Pieter Abbeel, and Sergey Levine. 2019a.
Transactions on Graphics 31, 4 (2012), 28. MCP: Learning Composable Hierarchical Control with Multiplicative Composi-
Hung Yu Ling, Fabio Zinno, George Cheng, and Michiel Van De Panne. 2020. Character tional Policies. In Advances in Neural Information Processing Systems 32, H. Wallach,
Controllers Using Motion VAEs. ACM Trans. Graph. 39, 4, Article 40 (July 2020), H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Cur-
12 pages. https://doi.org/10.1145/3386569.3392422 ran Associates, Inc., 3681–3692. http://papers.nips.cc/paper/8626-mcp-learning-
Libin Liu, Michiel van de Panne, and KangKang Yin. 2016. Guided Learning of Control composable-hierarchical-control-with-multiplicative-compositional-policies.pdf
Graphs for Physics-Based Characters. ACM Transactions on Graphics 35, 3 (2016). Xue Bin Peng, Angjoo Kanazawa, Jitendra Malik, Pieter Abbeel, and Sergey Levine.
Libin Liu, KangKang Yin, Michiel van de Panne, and Baining Guo. 2012. Terrain runner: 2018b. SFV: Reinforcement Learning of Physical Skills from Videos. ACM Trans.
control, parameterization, composition, and planning for highly dynamic motions. Graph. 37, 6, Article 178 (Nov. 2018), 14 pages.
ACM Transactions on Graphics (TOG) 31, 6 (2012), 154. Xue Bin Peng, Angjoo Kanazawa, Sam Toyer, Pieter Abbeel, and Sergey Levine. 2019b.
Libin Liu, KangKang Yin, Michiel van de Panne, Tianjia Shao, and Weiwei Xu. 2010. Variational Discriminator Bottleneck: Improving Imitation Learning, Inverse RL, and
Sampling-based contact-rich motion control. ACM Trans. Graph. 29, 4, Article 128 GANs by Constraining Information Flow. In International Conference on Learning
(July 2010), 10 pages. https://doi.org/10.1145/1778765.1778865 Representations. https://openreview.net/forum?id=HyxPx3R9tm
Ying-Sheng Luo, Jonathan Hans Soeseno, Trista Pei-Chun Chen, and Wei-Chao Chen. Dean A. Pomerleau. 1988. ALVINN: An Autonomous Land Vehicle in a Neural Network.
2020. CARL: Controllable Agent with Reinforcement Learning for Quadruped In Proceedings of the 1st International Conference on Neural Information Processing
Locomotion. ACM Trans. Graph. 39, 4, Article 38 (July 2020), 10 pages. https: Systems (NIPS’88). MIT Press, Cambridge, MA, USA, 305–313.
//doi.org/10.1145/3386569.3392433 Alec Radford, Luke Metz, and Soumith Chintala. 2015. Unsupervised Representa-
Corey Lynch, Mohi Khansari, Ted Xiao, Vikash Kumar, Jonathan Tompson, Sergey tion Learning with Deep Convolutional Generative Adversarial Networks. CoRR
Levine, and Pierre Sermanet. 2020. Learning Latent Plans from Play. In Proceedings of abs/1511.06434 (2015). arXiv:1511.06434 http://arxiv.org/abs/1511.06434
the Conference on Robot Learning (Proceedings of Machine Learning Research, Vol. 100), Marc H. Raibert and Jessica K. Hodgins. 1991. Animation of Dynamic Legged Loco-
Leslie Pack Kaelbling, Danica Kragic, and Komei Sugiura (Eds.). PMLR, 1113–1132. motion. In Proceedings of the 18th Annual Conference on Computer Graphics and
http://proceedings.mlr.press/v100/lynch20a.html Interactive Techniques (SIGGRAPH ’91). Association for Computing Machinery, New
X. Mao, Q. Li, H. Xie, R. Y. K. Lau, Z. Wang, and S. P. Smolley. 2017. Least Squares York, NY, USA, 349–358. https://doi.org/10.1145/122718.122755
Generative Adversarial Networks. In 2017 IEEE International Conference on Computer Stephane Ross, Geoffrey Gordon, and Drew Bagnell. 2011. A Reduction of Imitation
Vision (ICCV). 2813–2821. https://doi.org/10.1109/ICCV.2017.304 Learning and Structured Prediction to No-Regret Online Learning (Proceedings of
Josh Merel, Leonard Hasenclever, Alexandre Galashov, Arun Ahuja, Vu Pham, Greg Machine Learning Research, Vol. 15), Geoffrey Gordon, David Dunson, and Miroslav
Wayne, Yee Whye Teh, and Nicolas Heess. 2019. Neural Probabilistic Motor Primi- Dudík (Eds.). JMLR Workshop and Conference Proceedings, Fort Lauderdale, FL,
tives for Humanoid Control. In International Conference on Learning Representations. USA, 627–635. http://proceedings.mlr.press/v15/ross11a.html
https://openreview.net/forum?id=BJl6TjRcY7 Alla Safonova and Jessica K. Hodgins. 2007. Construction and Optimal Search of
Josh Merel, Yuval Tassa, Dhruva TB, Sriram Srinivasan, Jay Lemmon, Ziyu Wang, Interpolated Motion Graphs. ACM Trans. Graph. 26, 3 (July 2007), 106–es. https:
Greg Wayne, and Nicolas Heess. 2017. Learning human behaviors from motion //doi.org/10.1145/1276377.1276510
capture by adversarial imitation. CoRR abs/1707.02201 (2017). arXiv:1707.02201 H. Sakoe and S. Chiba. 1978. Dynamic programming algorithm optimization for spoken
http://arxiv.org/abs/1707.02201 word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing 26, 1
Josh Merel, Saran Tunyasuvunakool, Arun Ahuja, Yuval Tassa, Leonard Hasenclever, Vu (1978), 43–49. https://doi.org/10.1109/TASSP.1978.1163055
Pham, Tom Erez, Greg Wayne, and Nicolas Heess. 2020. Catch and Carry: Reusable Tim Salimans, Ian J. Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and
Neural Controllers for Vision-Guided Whole-Body Tasks. ACM Trans. Graph. 39, 4, Xi Chen. 2016. Improved Techniques for Training GANs. CoRR abs/1606.03498
Article 39 (July 2020), 14 pages. https://doi.org/10.1145/3386569.3392474 (2016). arXiv:1606.03498 http://arxiv.org/abs/1606.03498
Lars Mescheder, Andreas Geiger, and Sebastian Nowozin. 2018. Which Training Meth- John Schulman, Philipp Moritz, Sergey Levine, Michael I. Jordan, and Pieter Abbeel. 2015.
ods for GANs do actually Converge?. In Proceedings of the 35th International Con- High-Dimensional Continuous Control Using Generalized Advantage Estimation.
ference on Machine Learning (Proceedings of Machine Learning Research, Vol. 80), CoRR abs/1506.02438 (2015). arXiv:1506.02438
Jennifer Dy and Andreas Krause (Eds.). PMLR, Stockholmsmässan, Stockholm Swe- John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov.
den, 3481–3490. http://proceedings.mlr.press/v80/mescheder18a.html 2017. Proximal Policy Optimization Algorithms. CoRR abs/1707.06347 (2017).
Igor Mordatch, Emanuel Todorov, and Zoran Popović. 2012. Discovery of Complex arXiv:1707.06347 http://arxiv.org/abs/1707.06347
Behaviors through Contact-Invariant Optimization. ACM Trans. Graph. 31, 4, Article SFU. [n.d.]. SFU Motion Capture Database. http://mocap.cs.sfu.ca/.
43 (July 2012), 8 pages. https://doi.org/10.1145/2185520.2185539 Dana Sharon and Michiel van de Panne. 2005. Synthesis of Controllers for Stylized
Igor Mordatch, Jack M. Wang, Emanuel Todorov, and Vladlen Koltun. 2013. Animating Planar Bipedal Walking. In Proc. of IEEE International Conference on Robotics and
Human Lower Limbs Using Contact-Invariant Optimization. ACM Trans. Graph. 32, Animation.
6, Article 203 (Nov. 2013), 8 pages. https://doi.org/10.1145/2508363.2508365 Kwang Won Sok, Manmyung Kim, and Jehee Lee. 2007. Simulating Biped Behaviors
Uldarico Muico, Yongjoon Lee, Jovan Popović, and Zoran Popović. 2009. Contact-Aware from Human Motion Data. ACM Trans. Graph. 26, 3 (July 2007), 107–es. https:
Nonlinear Control of Dynamic Characters. In ACM SIGGRAPH 2009 Papers (New //doi.org/10.1145/1276377.1276511
Orleans, Louisiana) (SIGGRAPH ’09). Association for Computing Machinery, New

ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control • 144:15

Sebastian Starke, He Zhang, Taku Komura, and Jun Saito. 2019. Neural State Machine
for Character-Scene Interactions. ACM Trans. Graph. 38, 6, Article 209 (Nov. 2019),
14 pages. https://doi.org/10.1145/3355089.3356505
Richard S. Sutton and Andrew G. Barto. 1998. Introduction to Reinforcement Learning
(1st ed.). MIT Press, Cambridge, MA, USA.
Jie Tan, Yuting Gu, C. Karen Liu, and Greg Turk. 2014. Learning Bicycle Stunts. ACM
Trans. Graph. 33, 4, Article 50 (July 2014), 12 pages. https://doi.org/10.1145/2601097.
2601121
Jeff Tang, Howard Leung, Taku Komura, and Hubert Shum. 2008. Emulating human
perception of motion similarity. Computer Animation and Virtual Worlds 19 (08
2008), 211–221. https://doi.org/10.1002/cav.260
Yuval Tassa, Yotam Doron, Alistair Muldal, Tom Erez, Yazhe Li, Diego de Las Casas,
David Budden, Abbas Abdolmaleki, Josh Merel, Andrew Lefrancq, Timothy P. Lilli-
crap, and Martin A. Riedmiller. 2018. DeepMind Control Suite. CoRR abs/1801.00690
(2018). arXiv:1801.00690 http://arxiv.org/abs/1801.00690
Faraz Torabi, Garrett Warnell, and Peter Stone. 2018. Generative Adversarial Imitation
from Observation. CoRR abs/1807.06158 (2018). arXiv:1807.06158 http://arxiv.org/
abs/1807.06158
Adrien Treuille, Yongjoon Lee, and Zoran Popović. 2007. Near-Optimal Character
Animation with Continuous Control. In ACM SIGGRAPH 2007 Papers (San Diego,
California) (SIGGRAPH ’07). Association for Computing Machinery, New York, NY,
USA, 7–es. https://doi.org/10.1145/1275808.1276386
Michiel van de Panne, Ryan Kim, and Eugene Flume. 1994. Virtual Wind-up Toys for
Animation. In Proceedings of Graphics Interface ’94. 208–215.
Kevin Wampler, Zoran Popović, and Jovan Popović. 2014. Generalizing Locomotion
Style to New Animals with Inverse Optimal Regression. ACM Trans. Graph. 33, 4,
Article 49 (July 2014), 11 pages. https://doi.org/10.1145/2601097.2601192
Jack M. Wang, David J. Fleet, and Aaron Hertzmann. 2009. Optimizing Walking
Controllers. In ACM SIGGRAPH Asia 2009 Papers (Yokohama, Japan) (SIGGRAPH
Asia ’09). Association for Computing Machinery, New York, NY, USA, Article 168,
8 pages. https://doi.org/10.1145/1661412.1618514
Jack M. Wang, Samuel R. Hamner, Scott L. Delp, and Vladlen Koltun. 2012. Optimizing
Locomotion Controllers Using Biologically-Based Actuators and Objectives. ACM
Trans. Graph. 31, 4, Article 25 (July 2012), 11 pages. https://doi.org/10.1145/2185520.
2185521
Ziyu Wang, Josh S Merel, Scott E Reed, Nando de Freitas, Gregory Wayne, and
Nicolas Heess. 2017. Robust Imitation of Diverse Behaviors. In Advances
in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio,
H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30. Cur-
ran Associates, Inc., 5320–5329. https://proceedings.neurips.cc/paper/2017/file/
044a23cadb567653eb51d4eb40acaa88-Paper.pdf
Jungdam Won, Deepak Gopinath, and Jessica Hodgins. 2020. A Scalable Approach to
Control Diverse Behaviors for Physically Simulated Characters. ACM Trans. Graph.
39, 4, Article 33 (July 2020), 12 pages. https://doi.org/10.1145/3386569.3392381
Yuting Ye and C. Karen Liu. 2010. Synthesis of Responsive Motion Using a Dynamic
Model. Computer Graphics Forum (2010). https://doi.org/10.1111/j.1467-8659.2009.
01625.x
Wenhao Yu, Greg Turk, and C. Karen Liu. 2018. Learning Symmetric and Low-Energy
Locomotion. ACM Trans. Graph. 37, 4, Article 144 (July 2018), 12 pages. https:
//doi.org/10.1145/3197517.3201397
He Zhang, Sebastian Starke, Taku Komura, and Jun Saito. 2018. Mode-Adaptive Neural
Networks for Quadruped Motion Control. ACM Trans. Graph. 37, 4, Article 145 (July
2018), 11 pages. https://doi.org/10.1145/3197517.3201366
Brian D. Ziebart, Andrew Maas, J. Andrew Bagnell, and Anind K. Dey. 2008. Maximum
Entropy Inverse Reinforcement Learning. In Proceedings of the 23rd National Confer-
ence on Artificial Intelligence - Volume 3 (Chicago, Illinois) (AAAI’08). AAAI Press,
1433–1438.
Victor Brian Zordan and Jessica K. Hodgins. 2002. Motion Capture-Driven Simulations
That Hit and React. In Proceedings of the 2002 ACM SIGGRAPH/Eurographics Sympo-
sium on Computer Animation (San Antonio, Texas) (SCA ’02). Association for Comput-
ing Machinery, New York, NY, USA, 89–96. https://doi.org/10.1145/545261.545276

ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
144:16 • Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa

APPENDIX q̃𝑡ball , linear velocity x¤̃ 𝑡ball , and angular velocity q¤̃ 𝑡ball of the ball in
A TASKS the character’s local coordinate frame.
In this section, we provide a detailed description of each task, and Strike: Finally, to further demonstrate our approach’s ability to
the task reward functions used during training. compose diverse behaviors, we consider a task where the charac-
ter’s objective is to strike a target using a designated end-effector
Target Heading: In this task, the objective for the character is to (e.g. hands). The target may be located at various distances from
move along a target heading direction d∗ at a target speed 𝑣 ∗ . The the character. Therefore, the character must first move close to the
goal input for the policy is specified as g𝑡 = ( d̃𝑡∗, 𝑣 ∗ ), with d̃𝑡∗ being target before striking it. These distinct phases of the task entail
the target direction in the character’s local coordinate frame. The different optimal behaviors, and thus requires the policy to compose
task-reward is calculated according to: and transition between the appropriate skills. The goal g𝑡 = (x̃𝑡∗, ℎ𝑡 )
records the location of the target x̃𝑡∗ in the character’s local coor-
2
𝑟𝑡𝐺 = exp −0.25 𝑣 ∗ − d∗ · x¤ 𝑡com , (11)
dinate frame, along with an indicator variable ℎ𝑡 that specifies if
where 𝑥¤𝑡com is the center-of-mass velocity of the character at time the target has already been hit. The task-reward is partitioned into
step 𝑡, and the target speed is selected randomly between 𝑣 ∗ ∈ three phases:
[1, 5]m/s. For slower moving styles, such as Zombie and Stealthy,  1, target has been hit
the target speed is fixed at 1m/s.


(18)


𝑟𝑡𝐺 = 0.3 𝑟𝑡near + 0.3, ||x∗ − x𝑡root || < 1.375𝑚 .
Target Location: In this task, the character’s objective is to move 
 0.3 𝑟 far,

otherwise
to a target location x∗ . The goal g𝑡 = x̃𝑡∗ records the target location  𝑡

in the character’s local coordinate frame. The task-reward is given If the character is far from the target x∗ , 𝑟𝑡far encourages the char-
by: acter to move to the target using a similar reward function as the
Target Location task (Equation 12). Once the character is within a
𝑟𝑡𝐺 = 0.7 exp −0.5||x∗ − x𝑡root || 2 given distance of the target, 𝑟𝑡near encourages the character to strike
2 the target with a particular end-effector,
+ 0.3 exp − max 0, 𝑣 ∗ − d𝑡∗ · x¤ 𝑡com . (12)
2
Here, 𝑣 ∗ = 1𝑚/𝑠 specifies a minimum target speed at which the 𝑟𝑡near = 0.2 exp −2||x∗ − x𝑡eff || 2 + 0.8 clip d𝑡∗ · x¤ 𝑡eff , 0, 1 ,
3
character should move towards the target, and the character will
not be penalized for moving faster than this threshold. d𝑡∗ is a unit where x𝑡eff and x¤ 𝑡eff denote the position and velocity of the end-
vector on the horizontal plane that points from the character’s root effector, and d𝑡∗ is a unit vector pointing from the character’s root to
to the target. the target. After striking the target, the character receives a constant
reward of 1 for the remaining time steps.
Dribbling: To evaluate our system on more complex object ma-
nipulation tasks, we train policies for a dribbling task, where the Obstacles: Finally, we consider tasks that involve visual percep-
objective is for the character to dribble a soccer ball to a target tion and interaction with more complex environments, where the
location. The reward function is given by: character’s goal is to traverse an obstacle filled environment, while
maintaining a target speed. Policies are trained for two types of envi-
cp bp
𝑟𝑡𝐺 = 0.1𝑟𝑡cv + 0.1𝑟𝑡 + 0.3𝑟𝑡bv + 0.5𝑟𝑡 (13) ronments. 1) An environment containing a combination of obstacles
2 including gaps, steps, and overhead obstacles that the character
𝑟𝑡cv = exp −1.5 max 0, 𝑣 ∗ − d𝑡ball · x¤ 𝑡com (14) must duck under. 2) An environment containing stepping stones
that requires more precise contact planning. Examples of the envi-
cp
𝑟𝑡 = exp −0.5 ||x𝑡ball − x𝑡com || 2 (15) ronment are available in Figure 1 and 3. The task-reward is the same
2 as the one used for the Target Heading task (Equation 11), except
𝑟𝑡bv = exp −max 0, 𝑣 ∗ − d𝑡∗ · x¤ 𝑡ball (16) the target heading is fixed along the direction of forward progress.
In order for the policy to perceive the upcoming obstacles, the state
bp
𝑟𝑡 = exp −0.5 ||x𝑡∗ − x𝑡com || 2 . (17) is augmented with a 1D height-field of the upcoming terrain. The
cp
height-field records the height of the terrain at 100 sample locations,
𝑟𝑡cv and 𝑟𝑡 encourages the character to move towards and stay near uniformly spanning 10m ahead of the character.
the ball, where x𝑡ball and x¤ 𝑡ball represent the position and velocity
of the ball, d𝑡ball is a unit vector pointing from the character to the B AMP HYPERPARAMETERS
ball, and 𝑣 ∗ = 1m/s is the target velocity at which the character Hyperparameter settings used in the AMP experiments are available
bp
should move towards the ball. Similarly, 𝑟𝑡bv and 𝑟𝑡 encourages the in Table 4. For single-clip imitation tasks, we found that a smaller
character to move the ball to the target location, with d𝑡∗ denoting a discount factor 𝛾 = 0.95 allows the character to more closely imitate
unit vector pointing from the ball to the target. The goal g𝑡 = x̃𝑡∗ a given reference motion. A larger discount factor 𝛾 = 0.99 is used
records the relative position of the target location with respect to the for experiments that include additional task objective, since these
character. The state s𝑡 is augmented with additional features that tasks may require longer horizon planning, such as Dribble and
describe the state of the ball, including the position x̃𝑡ball , orientation Strike.

ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control • 144:17

Table 4. AMP hyperparameters. During pretrainig, the latent space model is trained using a motion
imitation, where the objective is for the character to imitate a corpus
Parameter Value of motion clips. A reference motion is selected randomly at the start
𝑤 𝐺 Task-Reward Weight 0.5 of each episode, and a new reference motion is selected every 5-10s.
𝑤 𝑆 Style-Reward Weight 0.5 The goal g𝑡 = (^𝑞𝑡 +1, 𝑞^𝑡 +2 ) specifies target poses from the reference
𝑤 gp Gradient Penalty 10 motion at the next two time steps.
Samples Per Update Iteration 4096 The networks used for 𝜋 and 𝑢 follow a similar architecture as
Batch Size 256 the networks used for the policies trained with AMP. The encoder
𝐾 Discriminator Batch Size 256 𝑞 is modeled by a network consisting of two hidden layers, with
𝜋 Policy Stepsize (Single-Clip Imitation) 2 × 10−6 512 and 256 hidden units, followed by a linear output layer for
𝜋 Policy Stepsize (Tasks) 4 × 10−6 𝜇𝑞 (g𝑡 ) and Σ𝑞 (g𝑡 ). The size of the latent encoding is set to 16D.
𝑉 Value Stepsize (Single-Clip Imitation) 10−4 Hyperparameter settings are available in Table 5.
𝑉 Value Stepsize (Tasks) 2 × 10−5
𝐷 Discriminator Stepsize 10−5 Table 5. Latent space model hyperparameters.
B Discriminator Replay Buffer Size 105
𝛾 Discount (Single-Clip Imitation) 0.95 Parameter Value
𝛾 Discount (Tasks) 0.99 Latent Encoding Dimension 16
SGD Momentum 0.9 𝜆 KL-Regularizer 10−4
GAE(𝜆) 0.95 Samples Per Update Iteration 4096
TD(𝜆) 0.95 Batch Size 256
PPO Clip Threshold 0.2 𝜋 Policy Stepsize (Pre-Training) 2.5 × 10−6
𝑢 Policy Stepsize (Downstream Task) 10−4
𝑉 Value Stepsize 10−3
C LATENT SPACE MODEL 𝛾 Discount (Pre-Training) 0.95
The latent space model follows a similar architecture as Peng et al. 𝛾 Discount (Downstream Task) 0.99
[2019a] and Merel et al. [2019]. During pretraining, an encoder SGD Momentum 0.9
𝑞(z𝑡 |g𝑡 ) maps a goal g𝑡 to a distribution over latent variables z𝑡 . GAE(𝜆) 0.95
A latent encoding z𝑡 ∼ 𝑞(z𝑡 |g𝑡 ) is then sampled from the encoder TD(𝜆) 0.95
distribution and passed to the policy as an input 𝜋 (a𝑡 |s𝑡 , z𝑡 ). The PPO Clip Threshold 0.2
latent distribution is modeled as a Gaussian distribution 𝑞(z𝑡 |g𝑡 ) =
N (𝜇𝑞 (g𝑡 ), Σ𝑞 (g𝑡 )), with mean 𝜇𝑞 (g𝑡 ) and diagonal covariance ma-
trix Σ𝑞 (g𝑡 ). The encoder is trained jointly with the policy using the
following objective: D SPATIAL COMPOSITION
Our experiments have so far focused primarily on temporal composi-
"𝑇 −1 #
Õ
tions of skills, where a character performs different skills at different
𝑡
arg max E𝑝 (𝜏 |𝜋,𝑞) 𝛾 𝑟𝑡 + 𝜆E𝑝 ( g𝑡 ) [DKL [𝑞(·|g𝑡 )||𝑝 0 ]] .
points in time in order to fulfill particular task objectives, such as
𝜋,𝑞 𝑡 =0
(19)
walking to a target and then punching it. In this section, we explore
𝜏 = {(s𝑡 , a𝑡 , g𝑡 , 𝑟𝑡 )𝑇𝑡 =−01, s𝑇 , g𝑇 } represents the goal-augmented tra- settings that require spatial composition of multiple skills, where the
jectory, where the goal g𝑡 may vary at each time step, and task requires a character to perform different skills simultaneously.
To evaluate AMP in this setting, we consider a compositional task
𝑇Ö−1
where a character needs to walk along a target heading direction
𝑝 (𝜏 |𝜋, 𝑞) =𝑝 (g0 )𝑝 (s𝑡 ) (g𝑡 +1 |g𝑡 )𝑝 (s𝑡 +1 |s𝑡 , a𝑡 ) (20)
while also waving its hand at a target height. The motion prior
𝑡 =0
∫ is trained using a dataset consisting of both walking motions and
𝜋 (a𝑡 |s𝑡 , z𝑡 )𝑞(z𝑡 |g𝑡 )𝑑z𝑡 (21) waving motions, but none of the motion clips show examples of
z𝑡 walking and waving at the same time. Therefore, the onus is on the
is the likelihood of a trajectory under a given policy 𝜋 and encoder policy to spatially compose these different classes of skills in order
𝑞. Similar to a VAE, we include a KL-regularizer with respect to a to fulfill the two disparate objectives simultaneously.
variational prior 𝑝 0 (z𝑡 ) = N (0, 𝐼 ) and coefficient 𝜆. The policy and In this task, the character has two objectives: 1) a target heading
encoder are trained end-to-end with PPO using the reparameteriza- objective for moving along a target direction d∗ at a target speed
tion trick [Kingma and Welling 2014]. Once trained, the latent space 𝑣 ∗ , 2) and a waving objective for raising its right hand to a target
model can be transferred to downstream tasks by using 𝜋 (a𝑡 |s𝑡 , z𝑡 ) height 𝑦 ∗ . The goal input for the policy is given by g𝑡 = ( d̃𝑡∗, 𝑣 ∗, 𝑦 ∗ ),
as a low-level controller, and then training a separate high-level with d̃𝑡∗ being the target direction in the character’s local coordinate
controller 𝑢 (z𝑡 |s𝑡 , g𝑡 ) that specifies latent encodings z𝑡 for the low- frame. The composite reward is calculated according to:
level controller. The parameters of 𝜋 are fixed, and a new high-level
heading waving
controller 𝑢 is trained for each downstream task. 𝑟𝑡𝐺 = 0.5𝑟𝑡 + 0.5𝑟𝑡 , (22)

ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
144:18 • Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa

heading
where 𝑟𝑡 the same as the reward used for the Target Heading gait in order to follow the target heading direction. These results
task equation 11, and 𝑟𝑡wave is specified according to: suggest that AMP does exhibit some capability for spatial composi-
2 tion different skills. However, the policies trained with both datasets
wave hand ∗
𝑟𝑡 = exp −16 𝑦𝑡 −𝑦 , (23) can still exhibit some unnatural behaviors, particularly when the
target height for the hand is high.
where 𝑦𝑡hand is the height of character’s right hand.
To evaluate AMP’s ability to compose disparate skills spatially, we Table 6. Performance of policies trained using different dataset on a spatial
compare policies trained using both walking and waving motions, compositional task that combines following a target heading and waving
with policies trained with only walking motions or only waving the character’s hand at a target height. The normalized task returns for each
motions. Table 6 compares the performance of the different policies objective is averaged across 100 episodes for each model. The model trained
with respect to the target heading and waving objectives. Although with both walking and waving motions achieves relatively high rewards on
both objectives, while the models trained with only one type of motions
the motion prior was not trained with any reference motions that
perform well only on one of the objectives.
show both walking and waving at the same time, the policy was able
to discover behaviors that combine these different skills, enabling
the character to walk along different directions while also waving Dataset (Size) Heading Return Waving Return
its hand at various heights. The policies trained with only walking Wave (51.7s) 0.683 ± 0.195 0.949 ± 0.144
motions tend to ignore the waving objective, and exhibit solely Walk (229.7s) 0.945 ± 0.192 0.306 ± 0.378
walking behaviors. Policies trained with only the waving motion Wave + Walk (281.4s) 0.885 ± 0.184 0.891 ± 0.202
are able to fulfill the waving objective, but learns a clumsy shuffling

ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control • 144:19

Fig. 9. Learning curves comparing AMP to the motion tracking approach proposed by Peng et al. [2018a] (Motion Tracking) on the single-clip imitation tasks.
3 policies initialized with different random seeds are trained for each method and motion. AMP produces results of comparable quality when compared to
prior tracking-based methods, without requiring a manually designed reward function or synchronization between the policy and reference motion.

ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.
144:20 • Xue Bin Peng, Ze Ma, Pieter Abbeel, Sergey Levine, and Angjoo Kanazawa

Fig. 10. Learning curves of applying AMP to various tasks and datasets.

ACM Trans. Graph., Vol. 40, No. 4, Article 144. Publication date: August 2021.

Advanced Techniques in GSAP Animation: Definitive Reference for Developers and Engineers
From Everand
Advanced Techniques in GSAP Animation: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
From Everand
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
Fouad Sabry
No ratings yet
Geometric Feature Learning: Unlocking Visual Insights through Geometric Feature Learning
From Everand
Geometric Feature Learning: Unlocking Visual Insights through Geometric Feature Learning
Fouad Sabry
No ratings yet
3M Versaflo Respirator Systems Are Easy To Select: Modular Means Versatile
No ratings yet
3M Versaflo Respirator Systems Are Easy To Select: Modular Means Versatile
2 pages
R23 DBMS Syllabus
No ratings yet
R23 DBMS Syllabus
3 pages
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet
Axis Mobile Features
No ratings yet
Axis Mobile Features
34 pages
B - Com - II Money and Financial System Additional Sub Point
No ratings yet
B - Com - II Money and Financial System Additional Sub Point
32 pages
UNIT 2.2 Functional Modeling
No ratings yet
UNIT 2.2 Functional Modeling
23 pages
F - B - Gan-B L D: W W: Eature Ased Vs Ased Earning From Emonstrations Hen and HY
No ratings yet
F - B - Gan-B L D: W W: Eature Ased Vs Ased Earning From Emonstrations Hen and HY
14 pages
Yuan PhysDiff Physics-Guided Human Motion Diffusion Model ICCV 2023 Paper
No ratings yet
Yuan PhysDiff Physics-Guided Human Motion Diffusion Model ICCV 2023 Paper
12 pages
5S in Warehouse Operations
No ratings yet
5S in Warehouse Operations
11 pages
Hierarchical Diffusion Policy For Kinematics-Aware Multi-Task Robotic Manipulation
No ratings yet
Hierarchical Diffusion Policy For Kinematics-Aware Multi-Task Robotic Manipulation
12 pages
Final - Accops Certified Sales Professional - Fundamentals
No ratings yet
Final - Accops Certified Sales Professional - Fundamentals
14 pages
BD X Paper
No ratings yet
BD X Paper
14 pages
Tennis Skills Main
No ratings yet
Tennis Skills Main
14 pages
Mescab Catalogue - Compressed
No ratings yet
Mescab Catalogue - Compressed
7 pages
Perpetual Humanoid Control For Real-Time Simulated Avatars
No ratings yet
Perpetual Humanoid Control For Real-Time Simulated Avatars
14 pages
Time Sensitive Character Control
No ratings yet
Time Sensitive Character Control
11 pages
2410 05260v1-Source
No ratings yet
2410 05260v1-Source
22 pages
Data Structure Course
No ratings yet
Data Structure Course
48 pages
Section C - Digital MCQ2
No ratings yet
Section C - Digital MCQ2
6 pages
Building Support Structures, 2nd Ed., Analysis and Design with SAP2000 Software
From Everand
Building Support Structures, 2nd Ed., Analysis and Design with SAP2000 Software
Wolfgang Schueller
4.5/5 (15)
B11 - B12 - B13 - 0141 - MAT2002 - 100318 - Dr. Sheerin Kayenat - Fall 22-23 - TEE
No ratings yet
B11 - B12 - B13 - 0141 - MAT2002 - 100318 - Dr. Sheerin Kayenat - Fall 22-23 - TEE
2 pages
GREIL-Crowds: Crowd Simulation With Deep Reinforcement Learning and Examples
No ratings yet
GREIL-Crowds: Crowd Simulation With Deep Reinforcement Learning and Examples
15 pages
Zhang Generating Human Motion From Textual Descriptions With Discrete Representations CVPR 2023 Paper
No ratings yet
Zhang Generating Human Motion From Textual Descriptions With Discrete Representations CVPR 2023 Paper
11 pages
Amy Corns - Connecting Scatter Plots and Correlation Coefficients Activity
No ratings yet
Amy Corns - Connecting Scatter Plots and Correlation Coefficients Activity
23 pages
Sagemcom Dgci384
No ratings yet
Sagemcom Dgci384
6 pages
DRe Con
No ratings yet
DRe Con
11 pages
Super Track
No ratings yet
Super Track
13 pages
Large-Scale Reusable Adversarial Skill Embeddings For Physically
No ratings yet
Large-Scale Reusable Adversarial Skill Embeddings For Physically
18 pages
EQUIP9-Operations-Use Case Challenge
No ratings yet
EQUIP9-Operations-Use Case Challenge
6 pages
High School Students' Perceptions of Motivations For Cyberbullying An Exploratory Study
No ratings yet
High School Students' Perceptions of Motivations For Cyberbullying An Exploratory Study
6 pages
Chi 等 - 2024 - Diffusion Policy Visuomotor Policy Learning via Action Diffusion
No ratings yet
Chi 等 - 2024 - Diffusion Policy Visuomotor Policy Learning via Action Diffusion
19 pages
Dynamical Movement Primitives Learning Attractor Models For Motor Behaviors
No ratings yet
Dynamical Movement Primitives Learning Attractor Models For Motor Behaviors
46 pages
Behavior Planning For Character Animation
No ratings yet
Behavior Planning For Character Animation
11 pages
Motion Specification in Computer Animation
No ratings yet
Motion Specification in Computer Animation
2 pages
I H B D M: Mitating Uman Ehaviour With Iffusion Odels
No ratings yet
I H B D M: Mitating Uman Ehaviour With Iffusion Odels
24 pages
Story To Motion
No ratings yet
Story To Motion
8 pages
Diff Control
No ratings yet
Diff Control
8 pages
Information and Software Technology: Andrew Austin, Casper Holmgreen, Laurie Williams
No ratings yet
Information and Software Technology: Andrew Austin, Casper Holmgreen, Laurie Williams
10 pages
Operational Amplifier
No ratings yet
Operational Amplifier
18 pages
FULLTEXT01
No ratings yet
FULLTEXT01
64 pages
Aplikasi Ujian Online Masuk Universitas Merdeka Madiun Berbasis Android
No ratings yet
Aplikasi Ujian Online Masuk Universitas Merdeka Madiun Berbasis Android
12 pages
Food Irradiation: Communication Strategies To Bridge The Gap Between Scientists and The Public
No ratings yet
Food Irradiation: Communication Strategies To Bridge The Gap Between Scientists and The Public
10 pages
Nursery Pmamp Dadeldhura
No ratings yet
Nursery Pmamp Dadeldhura
16 pages
Bond Strength of Concrete Plugs Embedded in Tubula PDF
No ratings yet
Bond Strength of Concrete Plugs Embedded in Tubula PDF
16 pages
Ch4-Operationa Based Railway Planning
No ratings yet
Ch4-Operationa Based Railway Planning
40 pages
EEPROM Cross Reference (In Detail)
No ratings yet
EEPROM Cross Reference (In Detail)
11 pages
Character Animation Change Survey
No ratings yet
Character Animation Change Survey
13 pages
Let's Create Local and Global Ads: Activity 2: Home Business! Lead in
100% (1)
Let's Create Local and Global Ads: Activity 2: Home Business! Lead in
3 pages
ABIDES-Gym: Gym Environments For Multi-Agent Discrete Event Simulation and Application To Financial Markets
No ratings yet
ABIDES-Gym: Gym Environments For Multi-Agent Discrete Event Simulation and Application To Financial Markets
9 pages
Plamo: Plan and Move in Rich 3D Physical Environments
No ratings yet
Plamo: Plan and Move in Rich 3D Physical Environments
19 pages
WANDR Intention-Guided Human Motion Generation
No ratings yet
WANDR Intention-Guided Human Motion Generation
10 pages
Compliance Under Case-B'.: Notes
No ratings yet
Compliance Under Case-B'.: Notes
10 pages
Diffusion Policy Attacker: Crafting Adversarial Attacks For Diffusion-Based Policies
No ratings yet
Diffusion Policy Attacker: Crafting Adversarial Attacks For Diffusion-Based Policies
16 pages
Siggraph2023 Calm
No ratings yet
Siggraph2023 Calm
15 pages
T2M-GPT - Generating Human Motion From Textual Descriptions
No ratings yet
T2M-GPT - Generating Human Motion From Textual Descriptions
15 pages
GREER, David - DEng. - 2015
No ratings yet
GREER, David - DEng. - 2015
155 pages
Deepmind Control Suite
No ratings yet
Deepmind Control Suite
24 pages
A Virtual Rodent Predicts The Structure of Neural Activity Across Behaviors
No ratings yet
A Virtual Rodent Predicts The Structure of Neural Activity Across Behaviors
55 pages
Computer Animation
No ratings yet
Computer Animation
33 pages
Pathways to Machine Learning and Soft Computing: 邁向機器學習與軟計算之路（國際英文版）
From Everand
Pathways to Machine Learning and Soft Computing: 邁向機器學習與軟計算之路（國際英文版）
Jyh-Horng Jeng
No ratings yet
Sketchup 1pp PDF
No ratings yet
Sketchup 1pp PDF
115 pages
Kinematic Analysis of A Planer Robot Using Artificial Neural Network
No ratings yet
Kinematic Analysis of A Planer Robot Using Artificial Neural Network
7 pages
Disertatie AlexDinu
No ratings yet
Disertatie AlexDinu
15 pages
ML CharAnim SCSS2019 AlexDinu
No ratings yet
ML CharAnim SCSS2019 AlexDinu
6 pages
Consumer Durable Industry: Presented By-Kasturi Mandal A Vijay Kumar Sasi Kumar Umesh G S Arun Kumar Barun Bardhan
0% (1)
Consumer Durable Industry: Presented By-Kasturi Mandal A Vijay Kumar Sasi Kumar Umesh G S Arun Kumar Barun Bardhan
60 pages
Trends1 Aio Pretest
No ratings yet
Trends1 Aio Pretest
4 pages
Motion mm2020
No ratings yet
Motion mm2020
9 pages
5 Tracking Human Motion and Actions For Interactive Robots
No ratings yet
5 Tracking Human Motion and Actions For Interactive Robots
8 pages
Interactive Character Animation Using Simulated Physics - A State-Of-The-Art Review
No ratings yet
Interactive Character Animation Using Simulated Physics - A State-Of-The-Art Review
24 pages
Robot Learning System Based On Adaptive Neural Control and Dynamic Movement Primitives
No ratings yet
Robot Learning System Based On Adaptive Neural Control and Dynamic Movement Primitives
11 pages
Composite Dynamic Movement Primitives Based On Neural Networks For Human-Robot Skill Transfer
No ratings yet
Composite Dynamic Movement Primitives Based On Neural Networks For Human-Robot Skill Transfer
11 pages
IROS 2003 Schaal
No ratings yet
IROS 2003 Schaal
21 pages
22 Scheme Physics For Cse Module 5 Notes
100% (1)
22 Scheme Physics For Cse Module 5 Notes
32 pages
Del Mocap A Lo Real
No ratings yet
Del Mocap A Lo Real
10 pages
CEPT
No ratings yet
CEPT
19 pages
Chapter 19 - Continual Improvement Methods With Six Sigma and Lean
No ratings yet
Chapter 19 - Continual Improvement Methods With Six Sigma and Lean
8 pages
Control Strategies For Physically Simulated Characters Performing Two Player Competitive Sports
No ratings yet
Control Strategies For Physically Simulated Characters Performing Two Player Competitive Sports
11 pages
Simulating Biped Behaviors From Human Motion Data: Kwang Won Sok Manmyung Kim Jehee Lee Seoul National University
No ratings yet
Simulating Biped Behaviors From Human Motion Data: Kwang Won Sok Manmyung Kim Jehee Lee Seoul National University
10 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Development of A Bipedal Robot That Walks Like An Animation Character
No ratings yet
Development of A Bipedal Robot That Walks Like An Animation Character
7 pages
Deep Cloth Paper
No ratings yet
Deep Cloth Paper
12 pages
Xue Bin (Jason) Peng: Year 2, PHD in Computer Science
No ratings yet
Xue Bin (Jason) Peng: Year 2, PHD in Computer Science
3 pages
A Deep Learning Framework For Character Motion Synthesis and Editing
No ratings yet
A Deep Learning Framework For Character Motion Synthesis and Editing
11 pages
Adversarial Character Animation
No ratings yet
Adversarial Character Animation
13 pages
Learning To Fight: Keywords
No ratings yet
Learning To Fight: Keywords
8 pages
Pose Controlled Physically-Based Motion: School of Computer Science and Engineering The Hebrew University of Jerusalem
No ratings yet
Pose Controlled Physically-Based Motion: School of Computer Science and Engineering The Hebrew University of Jerusalem
11 pages
Optimization As Motion Selection Principle in Robot Action
No ratings yet
Optimization As Motion Selection Principle in Robot Action
11 pages

AMP: Adversarial Motion Priors For Stylized Physics-Based Character Control

Uploaded by

AMP: Adversarial Motion Priors For Stylized Physics-Based Character Control

Uploaded by

AMP: Adversarial Motion Priors for Stylized Physics-Based Character

or generated by running the policy 𝜋, 5.1 Imitation from Observations

(c) Humanoid: Target Heading (Locomotion + Getup)

(d) Humanoid: Dribble (Locomotion) (e) Humanoid: Strike (Walk + Punch)

(f) Humanoid: Obstacles (Run + Leap + Roll)

8.1 Experimental Setup 8.2 Tasks

emerge automatically through the motion prior, with the character

Fig. 5. Learning curves comparing the task performance of AMP to latent

order to fulfill the high-level task objectives. Again, this composition

(a) T-Rex (Walk)

(b) Dog (Trot)

(c) Dog (Canter)

Table 3. Performance statistics of imitating individual motion clips without

You might also like