Model-based Deep Reinforcement Learning for Robotic Systems
Model-based Deep Reinforcement Learning for Robotic Systems
Systems
Anusha Nagabandi
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission.
Model-based Deep Reinforcement Learning for Robotic Systems
by
Anusha Nagabandi
Doctor of Philosophy
in
in the
Graduate Division
of the
Committee in charge:
Summer 2020
1
Abstract
by
Anusha Nagabandi
Deep learning has shown promising results in robotics, but we are still far from having
intelligent systems that can operate in the unstructured settings of the real world, where
disturbances, variations, and unobserved factors lead to a dynamic environment. The
premise of the work in this thesis is that model-based deep RL provides an efficient and
effective framework for making sense of the world, thus allowing for reasoning and adaptation
capabilities that are necessary for successful operation in the dynamic settings of the world.
We first build up a model-based deep RL framework and demonstrate that it can indeed
allow for efficient skill acquisition, as well as the ability to repurpose models to solve a variety
of tasks. We then scale up these approaches to enable locomotion with a 6-DoF legged
robot on varying terrains in the real world, as well as dexterous manipulation with a 24-DoF
anthropomorphic hand in the real world. Next, we focus on the inevitable mismatch between
an agent’s training conditions and the test conditions in which it may actually be deployed,
thus illuminating the need for adaptive systems. Inspired by the ability of humans and animals
to adapt quickly in the face of unexpected changes, we present a meta-learning algorithm
within this model-based RL framework to enable online adaptation of large, high-capacity
models using only small amounts of data from the new task. We demonstrate these fast
adaptation capabilities in both simulation and the real-world, with experiments such as a
6-legged robot adapting online to an unexpected payload or suddenly losing a leg. We then
further extend the capabilities of our robotic systems by enabling the agents to reason directly
from raw image observations. Bridging the benefits of representation learning techniques
with the adaptation capabilities of meta-RL, we present a unified framework for effective
meta-RL from images. With robotic arms in the real world that learn peg insertion and
ethernet cable insertion to varying targets, we show the fast acquisition of new skills, directly
from raw image observations in the real world. Finally, we conclude by discussing the key
limitations of our existing approaches and present promising directions for future work in the
area of model-based deep RL for robotic systems.
2
Acknowledgments
I would like to thank my advisors, Ronald Fearing and Sergey Levine, for their invaluable
mentorship these past five years. I would like to thank Claire Tomlin for being a part of my
qualification exam committee, and Hannah Stuart for being a part of my qualification exam
committee and dissertation committee. Additionally, I would like to thank my wonderful
collaborators, without whom the work in this thesis would not have been possible: Greg Kahn,
Ignasi Clavera, Kate Rakelly, Guangzhao (Philip) Yang, Thomas Asmar, Simin Liu, Zihao
(Tony) Zhao, Vikash Kumar, Chelsea Finn, and Pieter Abbeel. I would also like to thank
everyone in my labs (RAIL and BML) and BAIR for always being willing to help each other,
provide insightful feedback and discussions about research, create a collaborative atmosphere,
talk about life, or even just eat meals together and share memes. In particular, I’d like
to thank Abhishek Gupta, Alex Lee, Andrea Bajcsy, Ashvin Nair, Austin Buchan, Carlos
Casarez, Carolyn Chen, Coline Devin, Conrad Holda, Duncan Haldane, Eric Mazumdar,
Esther Rolf, Ethan Schaler, Greg Kahn, Jessica Lee, Justin Yim, Kate Rakelly, Liyu Wang,
Marvin Zhang, Michael Janner, Nick Rhinehart, Sareum Kim, Somil Bansal, Vitchyr Pong,
and many others! You have all taught me so much, and you have inspired me daily with your
kindness, hard work, dedication, enthusiasm, and thoughtfulness. I truly feel so privileged to
have gotten to know you all.
I would like to thank Marcel Bergerman and Maria Yamanaka for sparking my interest
in robotics in high school, Aaron Steinfeld for supporting me during my first ever research
experience after my freshman year of college, and all of my wonderful colleagues and mentors
throughout internships at Carnegie Robotics, Texas Instruments, MIT Lincoln Lab, and
Google Brain.
I would also like to thank all of the wonderful friends that I’ve made in Berkeley, without
whom these past 5 years seem unimaginable. Whether we met as roommates, lab mates,
ultimate frisbee friends, dance team friends, or just friends, you guys have very quickly
become like family. In addition to those mentioned above, I’d like to give a special thank you
to Keertana Settaluri, Matt McPhail, and Spencer Kent for making Berkeley feel like home.
I would also like to thank my high school friends and my college friends for continuing to
be a big part of my life. I am so thankful for all of the wonderful experiences that we have
shared together, and there are not enough words that I can use to tell you all how much
you mean to me. I would like to give a special thanks to Akshay Ravi for always supporting
me, for delicious baked goods, for unlimited laughs and good times, and for making me a
better person every day. Finally, a huge thank you to my family, especially to my parents
and my sister, without whom I would not be the person that I am today. Thank you for your
never-ending love and support.
i
Contents
Contents i
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Online Reasoning and Decision Making for Generalizable Systems . . . . . . 2
1.3 Overview, Organization, and Contributions . . . . . . . . . . . . . . . . . . . 4
8 Conclusion 118
Bibliography 121
1
Chapter 1
Introduction
From carefully planned-out acrobatics and dynamic walking maneuvers (Kuindersma et al.
2016; Tajima, Honda, and Suga 2009), to precise perception-based manipulation for folding
towels (Miller et al. 2012) and reorienting cubes (Akkaya et al. 2019), robots these days
can demonstrate very impressive capabilities. Although both traditional control approaches
as well as deep learning approaches have demonstrated promising robotics results, we are
unfortunately still far from having intelligent robotic systems operating in the real world.
1.1 Motivation
To understand why we don’t yet have robotic systems deployed in real-world settings such
as homes, we must come to terms with the unstructured nature of the real world, where
disturbances, variations, and unobserved factors lead to an extremely dynamic environment.
Moving away from controlled and instrumented settings such as labs, what we really want
is a robot that can rely on its own on-board sensors (such as cameras) to perceive an
unstructured world without the strong assumption of perfect state estimation, and that can
very quickly adapt to a variety of new tasks as the situation demands, as opposed to being
overly specialized for a single task. For example, a robot that is only capable of moving a
certain object from one fixed position to another may be useful in a fixed factory setting
where no amount of variation ever occurs (Fig. 1.1), but it will rarely be useful anywhere else.
Figure 1.1: Robots in a Hyundai car factory (Narasimhan 2018), increasing factory efficiency through repeated
execution of pre-programmed commands.
CHAPTER 1. INTRODUCTION 2
What we want instead is a robot that can be trained to sort boxes in a warehouse, for
instance, but can then adapt its behavior to heavy or slippery packages, to the presence of
an unexpected human arm during its operation, to the fact that the current stack of boxes
has fallen, or to the fact that the delivery truck has parked farther away than usual.
Centuries of research efforts have brought us to the point of being able to design complex
systems in controlled settings by reasoning through the specific mechanisms at hand; and
while we are surprisingly effective at enabling super-human types of capabilities such as
beating human Chess champions (Campbell, Hoane Jr, and F.-h. Hsu 2002) and winning
Jeopardy (Ferrucci 2012), we unfortunately cannot yet enable the robust and generalizable
capabilities of even a young child around the house. Although efforts such as RoboCup (Kitano
et al. 1997) and the Darpa Robotics Challenge (Krotkov et al. 2017) have taken steps toward
encouraging robots to operate in more realistic scenarios, there is still much progress to be
made. The real world is full of challenges for robots, including irregular terrains, the lack of
perfect state information for all objects in the world, the resulting difficultly of recognizing
or correcting for errors, and even small 1cm errors being enough to spill a coffee mug or
drop a bowl. Humans, on the other hand, can feel around their pocket to find a key and
then proceed to unlock a door entirely in the dark, without any sense of exactly where the
keyhole is. We can pour coffee into a mug regardless of its precise location, using our visual
perception. We can continue to move by hobbling on one foot after rolling an ankle. We
can walk on stairs or slopes or ice. The development of online reasoning is critical for these
types of capabilities, and will directly result in more generalizable and adaptable robots that
can be deployed into the world. The premise of the work in this thesis is that model-based
deep RL can provide an efficient and effective framework for making sense of the world; and
in turn, this ability to make sense of the world can allow for these types of reasoning and
adaptation capabilities that are necessary for successful operation in the dynamic settings of
the real world.
look like, we still need to address the next problem of actually working with it. Consider
the task of putting away a chair. Where do I hold onto it? How do I pull it? The answer is
different for every specific instance. For instance, do I hold it at the top and pull on it? Do I
hold it in the middle and push on it? How will this type of chair slide on this type of floor?
Do I need to understand physics here? Will the chair scratch the floor? Should I try to lift
the chair up before moving it? What if the chair has things on top? Do I need to put away
all the clothes first? How do I pick up and fold flexible materials like a shirt? As you can see,
the difficulty of this problem escalated very quickly, and yet we have still only considered
one small task involved with cleaning a room, which is moving a chair. Now, consider a
robot that learns how to move a certain type of chair from exactly 1 fixed position to some
other goal position, in a lab. Unfortunately, that skill is entirely useless in the real world if it
cannot generalize to moving chairs from arbitrary positions, or moving a slightly different
type of chair, or moving the chair even when the floor is slightly different. In order words,
being a specialist is not nearly as important as being a generalist when we want something
that works in the real world. Furthermore, it has become clear over the years that it is not
sufficient to set rules or specific control laws to dictate what each actuator of a robot should
do; rather, this ability to generalize must arise from enabling robots to reason on their own
and act accordingly.
Much work has tried to enable generalization to new scenarios through approaches such
as training on large amounts of data, domain randomization (Tobin et al. 2017), and domain
adaptation (Bousmalis et al. 2018). Other work has instead emphasized the idea of learning
predictive models (Pathak 2019) as being the key mechanism for enabling adaptation to
scenarios past those seen during training. For example, general concepts such as object
permanence (Baillargeon, Spelke, and Wasserman 1985) and approximate physics (Hespos
and VanMarle 2012) are learned by kids at a very young age, and they display explicit shock
or confusion when faced with seemingly impossible situations that violate their internal
predictive models. Prior work in robotics has extended this idea, that “play” teaches infants
to use experimentation to learn a model of how things work (Agrawal 2018), which can then
be used to perform new tasks. This concept of creating generalist (as opposed to specialist)
agents by taking the time to develop good predictive models is further supported by the
link between long childhoods and the resulting intelligence of a species (Piantadosi and Kidd
2016). The work in this thesis is inspired by this idea of predictive models as a mechanism for
learning to generalize – that perhaps the most general level at which to learn is the predictive
level, and that these predictive models can be re-purposed downstream in the process of
learning other skills or re-planning in the face of uncertainty.
This observation that the format of predictive models allows us to re-use them for future
decision making is powerful, and further supported by evidence in both humans and animals.
Humans often perform “what if” types of reasoning by querying their internal predictive
models before making decisions; when choosing a future job, for example, we often imagine
the consequences and possible futures that may result from each choice, before assessing those
possible futures and making our decision. Relatedly, the firing patterns of place cells have
been shown (Johnson and Redish 2007) in rats to demonstrate “what if” types of reasoning
CHAPTER 1. INTRODUCTION 4
when the rat comes to critical decision points in a maze. When examined at fine time scales,
these place cells, which normally encode concepts such as physical location, were shown to
demonstrate a looking-ahead type of firing pattern which scanned both of the possible turn
directions before making a decision at the T-junction of a maze. Having explicit predictive
mechanisms, then, seems to help both animals and humans to make sense of the world and
allow for explicit querying and reasoning in unfamiliar situations.
“Generalization is the concept that humans and animals use past learning in
present situations of learning if the conditions in the situations are regarded
as similar.” (Gluck, Mercado, and Myers 2007)
“The most difficult problems for computers are informally known as AI-complete.
solving them is equivalent to the general aptitude of human intelligence. . . be-
yond the capabilities of a purpose-specific algorithm. AI-complete problems are
hypothesized to include. . . dealing with unexpected circumstances while
solving any real world problem” (Shapiro 1992)
With these in mind, we highlight three main concepts that will reappear in the contents of
this thesis: abstracting common properties, using past learning in present situations, and
going beyond purpose-specific algorithms to deal with unexpected circumstances. The first
concept of abstracting common properties is addressed in this thesis with the building of
predictive models; we learn models that can explain what they observe. The second concept is
addressed with meta-learning algorithms, which learn to use prior knowledge to quickly solve
new tasks. And the third concept of going “beyond purpose-specific” is addressed with work
in online adaptation, continual learning, and online inference from raw sensory observations
in an effort to succeed even in the face of unexpected circumstances.
CHAPTER 1. INTRODUCTION 5
Figure 1.2: Overview figure to visualize the flow and general contributions of each chapter of this thesis,
for the overall goal of enabling the deployment of robotic systems into the real world. The difficulty axis
here encompasses difficulty of the tasks (e.g., contacts), difficulty from state dimensionality (e.g., working
with images), difficulty from action dimensionality (e.g., working with high-DoF robots), as well as difficulty
induced from operating in the real world. The diversity axis, on the other hand, focuses on the generalization
aspect of getting robots into the real world. This axis encompasses the re-usability of learned knowledge, the
variety of achievable tasks, the capability of adapting online to address unexpected perturbations at run time,
and the ability to adjust behaviors as necessary to succeed in new and unseen situations.
7 presents a unified framework for effective meta-RL from images. Robotic arms in the
real world that learn peg insertion and ethernet cable insertion to varying targets show
the fast acquisition of new skills, directly from raw image observations in the real world.
Finally, chapter 8 concludes by discussing the key limitations of our existing approaches
and presenting promising directions for future work in this area of model-based deep RL for
robotic systems.
7
Chapter 2
2.1 Introduction
Model-free deep reinforcement learning algorithms have been shown to be capable of learning a
wide range of tasks, ranging from playing video games from images (V. Mnih, K. Kavukcuoglu,
et al. 2013; Oh et al. 2016) to learning complex locomotion skills (J. Schulman et al. 2015).
However, such methods suffer from very high sample complexity, often requiring millions of
samples to achieve good performance (J. Schulman et al. 2015). Model-based reinforcement
learning algorithms are generally regarded as being more efficient (M. P. Deisenroth, Neumann,
Peters, et al. 2013). However, to achieve good sample efficiency, these model-based algorithms
1
https://sites.google.com/view/mbmf
CHAPTER 2. MBRL: LEARNING NEURAL NETWORK MODELS FOR CONTROL 8
have conventionally used either simple function approximators (Lioutikov et al. 2014) or
Bayesian models that resist overfitting (M. Deisenroth and C. Rasmussen 2011) in order
to effectively learn the dynamics using few samples. This makes them difficult to apply
to a wide range of complex, high-dimensional tasks. Although a number of prior works
have attempted to mitigate these shortcomings by using large, expressive neural networks
to model the complex dynamical systems typically used in deep reinforcement learning
benchmarks (Brockman et al. 2016; Emanuel Todorov, Erez, and Tassa 2012a), such models
often do not perform well (S. Gu et al. 2016) and have been limited to relatively simple,
low-dimensional tasks (Mishra, Abbeel, and Mordatch 2017).
In this work, we demonstrate that multi-layer neural net-
work models can in fact achieve excellent sample complexity
in a model-based reinforcement learning algorithm. The
resulting models can then be used for model-based control,
which we perform using model predictive control (MPC)
with a simple random-sampling shooting method (Richards
2004). We demonstrate that this method can acquire effec-
tive locomotion gaits for a variety of MuJoCo benchmark
systems (Emanuel Todorov, Erez, and Tassa 2012a), includ- Figure 2.2: Our method learns a dy-
ing the swimmer, half-cheetah, hopper, and ant. Fig. 7.3 namics model that enables a simulated
shows that these models can be used at run-time to execute quadrupedal robot to autonomously
a variety of locomotion tasks such as trajectory following, follow user-defined waypoints. Train-
5
where the agent executes a path through a given set of ing for this task uses 7×10 time steps
(collected without knowledge of test-
sparse waypoints that represent desired center-of-mass posi- time navigation tasks), and the learned
tions. Additionally, each systems uses less than four hours model can be reused at test time to fol-
worth of data, indicating that the sample complexity of low arbitrary desired trajectories, such
our model-based approach is low enough to be applied in as the U-turn shown above.
the real world, and is dramatically lower than pure model-
free learners. In particular, when comparing this model-based approach’s ability to follow
arbitrary desired trajectories with a model-free approach’s ability to learn just a competent
moving forward gait, our results show that the model-based method uses only 3%, 10%, and
14% of the data that is used by a model-free approach (for half-cheetah, swimmer, and ant,
respectively). Relatedly, our model-based method can achieve qualitatively good forward
gaits for the swimmer, cheetah, hopper, and ant using 20 − 80× fewer data points than is
required by a model-free approach.
Although such model-based methods are drastically more sample efficient and more flexible
than task-specific policies learned with model-free reinforcement learning, their asymptotic
performance is usually worse than model-free learners due to model bias. Model-free algo-
rithms are not limited by the accuracy of the model, and therefore can achieve better final
performance, though at the expense of much higher sample complexity (M. P. Deisenroth,
Neumann, Peters, et al. 2013; Kober, J. A. Bagnell, and Peters 2013). To address this issue,
we use our model-based algorithm, which can quickly achieve moderately proficient behavior,
to initialize a model-free learner, which can slowly achieve near-optimal behavior. The learned
CHAPTER 2. MBRL: LEARNING NEURAL NETWORK MODELS FOR CONTROL 9
demonstrated with PILCO that we could find has 11 dimensions (Meger et al. 2015), while
the most complex task in our work has 49 dimensions and features challenging properties
such as frictional contacts.
Although neural networks have been widely used in earlier work to model plant dynamics
(K. J. Hunt et al. 1992; Bekey and Goldberg 1992), more recent model-based algorithms have
achieved only limited success in applying such models to the more complex benchmark tasks
that are commonly used in deep reinforcement learning. Several works have proposed to use
deep neural network models for building predictive models of images (Watter et al. 2015), but
these methods have either required extremely large datasets for training (Watter et al. 2015)
or were applied to short-horizon control tasks (Wahlström, Schön, and M. P. Deisenroth 2015).
In contrast, we consider long-horizon simulated locomotion tasks, where the high-dimensional
systems and contact-rich environment dynamics provide a considerable modeling challenge.
(Mishra, Abbeel, and Mordatch 2017) proposed a relatively complex time-convolutional model
for dynamics prediction, but only demonstrated results on low-dimensional (2D) manipulation
tasks. (Gal, McAllister, and Carl Edward Rasmussen 2016) extended PILCO (M. Deisenroth
and C. Rasmussen 2011) using Bayesian neural networks, but only presented results on a
low-dimensional cart-pole swingup task, which does not include contacts.
Aside from training neural network dynamics models for model-based reinforcement
learning, we also explore how such models can be used to accelerate a model-free learner.
Prior work on model-based acceleration has explored a variety of avenues. The classic Dyna (R.
Sutton 1991) algorithm proposed to use a model to generate simulated experience that could
be included in a model-free algorithm. This method was extended (Silver, R. S. Sutton, and
Müller 2008; Asadi 2015) to work with deep neural network policies, but performed best
with models that were not neural networks (S. Gu et al. 2016). Model learning has also been
used to accelerate model-free Bellman backups (Heess, Wayne, et al. 2015), but the gains
in performance from including the model were relatively modest. Prior work has also used
model-based learners to guide policy optimization through supervised learning (Levine, Finn,
et al. 2017), but the models that were used were typically local linear models. In a similar
way, we also use supervised learning to initialize the policy, but we then fine-tune this policy
with model-free learning to achieve the highest returns. Our model-based method is more
expressive and flexible than local linear models, and it does not require multiple samples
from the same initial state for local linearization.
parameterization for fθ (st , at ) would take as input the current state st and action at , and
output the predicted next state ŝt+1 . However, this function can be difficult to learn when
the states st and st+1 are too similar and the action has seemingly little effect on the output;
this difficulty becomes more pronounced as the time between states ∆t becomes smaller.
We overcome this issue by instead learning a dynamics function that predicts the change
in state st over the time step duration of ∆t. Thus, the predicted next state is as follows:
ŝt+1 = st + fθ (st , at ). Note that increasing this ∆t increases the information available from
each data point, and can help with not only dynamics learning but also with planning
using the learned dynamics model (Sec. 2.4). However, increasing ∆t also increases the
discretization and variation of the underlying continuous-time dynamics, which can make the
learning process more difficult.
using stochastic gradient descent. While training on the training dataset D, we also calculate
the mean squared error in Eqn. 2.1 on a validation set Dval , composed of trajectories not
stored in the training dataset.
Although this error provides an estimate of how well the learned dynamics function
predicts next state, we would in fact like to know how well the model can predict further into
the future because we will ultimately use this model for longer-horizon control (Sec. 2.4). We
CHAPTER 2. MBRL: LEARNING NEURAL NETWORK MODELS FOR CONTROL12
therefore calculate H-step validation errors by propagating the learned dynamics function
forward H times to make multi-step open-loop predictions. For each given sequence of
true actions (at , . . . at+H−1 ) from Dval , we compare the corresponding ground-truth states
(st+1 , . . . st+H ) to the dynamics model’s multi-step state predictions (ŝt+1 , . . . ŝt+H ), calculated
as
H
(H) 1 X 1 X1
Eval = kst+h − ŝt+h k2 :
Dval D H h=1 2
val
(
st h=0
ŝt+h = (2.2)
ŝt+h−1 + fθ (ŝt+h−1 , at+h−1 ) h>0
In this work, we do not use this H-step validation during training; however, this is a meaningful
metric that we keep track of, since the controller (introduced in the next section) uses such
H-step predictions when performing action selection.
Calculating the exact optimum of Eqn. 2.3 is difficult due to the dynamics and reward
functions being nonlinear, but many techniques exist for obtaining approximate solutions to
finite-horizon control problems that are sufficient for succeeding at the desired task. In this
work, we use a simple random-sampling shooting method (Rao 2009) in which K candidate
action sequences are randomly generated, the corresponding state sequences are predicted
using the learned dynamics model, the rewards for all sequences are calculated, and the
candidate action sequence with the highest expected cumulative reward is chosen. Rather
than have the policy execute this action sequence in open-loop, we use model predictive
control (MPC): the policy executes only the first action at , receives updated state information
st+1 , and recalculates the optimal action sequence at the next time step. Note that for
higher-dimensional action spaces and longer horizons, random sampling with MPC may be
insufficient, and investigating other methods (W. Li and E. Todorov 2004) in future work
could improve performance.
CHAPTER 2. MBRL: LEARNING NEURAL NETWORK MODELS FOR CONTROL13
Note that this combination of predictive dynamics model plus controller is beneficial in
that the model is trained only once, but by simply changing the reward function, we can
accomplish a variety of goals at run-time, without a need for live task-specific retraining.
To improve the performance of our model-based learning algorithm, we gather additional
on-policy data by alternating between gathering data with the current model and retraining
the model using the aggregated data. This on-policy data aggregation (i.e., reinforcement
learning) improves performance by mitigating the mismatch between the data’s state-action
distribution and the model-based controller’s distribution (Ross, G. J. Gordon, and D. Bagnell
2011). Alg. 1 and Fig. 2.3 provide an overview of our model-based reinforcement learning
algorithm.
First, random trajectories are col-
lected and added to dataset Drand ,
which is used to train fθ by perform-
ing gradient descent on the objective in
Eqn. 2.1. Then, the model-based MPC
controller (Sec. 2.4) gathers T new on-
policy datapoints and adds these data-
points to a separate dataset Drl . The
dynamics function fθ is then retrained
using data from both Drand and Drl .
Note that during retraining, the neu-
ral network dynamics function’s weights Figure 2.3: Illustration of Algorithm 1. On the first iteration,
are warm-started with the weights from random actions are performed and used to initialize Drand .
On all following iterations, this iterative procedure is used
the previous iteration. The algorithm to train the dynamics model using the data in the datasets
continues alternating between training Drand and Drl , run the MPC controller using the learned
the model and gathering additional data model for action selection, aggregate collected data into Drl ,
until a predefined maximum number of and retrain the model on updated data.
CHAPTER 2. MBRL: LEARNING NEURAL NETWORK MODELS FOR CONTROL14
2
https://sites.google.com/view/mbmf
CHAPTER 2. MBRL: LEARNING NEURAL NETWORK MODELS FOR CONTROL15
Prior to training, both the inputs and outputs in the dataset were pre-processed to have
mean 0 and standard deviation 1. Furthermore, we normalize all environments such that all
actions fall in the range [−1, 1]. Shown in tables 2.5 and 2.6 below are additional parameters
CHAPTER 2. MBRL: LEARNING NEURAL NETWORK MODELS FOR CONTROL16
for these tasks and domains, such as the duration of each environment timestep, the planning
horizon for the MPC controller, and the number of action sequences sampled by the controller
for each action selection step.
and half-cheetah agents; the other agents also exhibited similar trends. After each design
decision was evaluated, we used the best outcome of that evaluation for the remainder of the
evaluations.
(A) Training steps. Fig. 4.5a shows varying numbers of gradient descent steps taken
during each iteration of model learning. As expected, training for too few epochs negatively
affects learning performance, with 20 epochs causing swimmer to reach only half of the other
experiments’ performance.
(B) Dataset aggregation. Fig. 4.5b shows varying amounts of (initial) random data
versus (aggregated) on-policy data used within each mini-batch of stochastic gradient descent
when training the learned dynamics function. We see that training with at least some
aggregated on-policy rollouts significantly improves performance, revealing the benefits of
improving learned models with reinforcement learning. However, our method still works well
with even just 30% of each mini-batch coming from on-policy rollouts, showing the advantage
of model-based reinforcement learning being off-policy.
(C) Controller. Fig. 4.5c shows the effect of varying the horizon H and the number of
random samples K used at each time step by the model-based controller. We see that too
short of a horizon is harmful for performance, perhaps due to greedy behavior and entry into
unrecoverable states. Additionally, the model-based controller for half-cheetah shows worse
performance for longer horizons. This is further revealed below in Fig. 2.6, which illustrates a
single 100-step validation rollout (as explained in Eqn. 2.2). We see here that the open-loop
predictions for certain state elements, such as the center of mass x position, diverge from
ground truth. Thus, a large H leads to the use of an inaccurate model for making predictions,
which is detrimental to task performance. Finally, with regards to the number of randomly
sampled trajectories evaluated, we expect this value needing to be higher for systems with
higher-dimensional action spaces.
(D) Number of initial random trajectories. Fig. 4.5d shows varying numbers of
random trajectories used to initialize our model-based approach. We see that although a
CHAPTER 2. MBRL: LEARNING NEURAL NETWORK MODELS FOR CONTROL18
Figure 2.5: Analysis of design decisions for our model-based reinforcement learning approach. (a) Training
steps, (b) dataset training split, (c) horizon and number of actions sampled, (d) initial random trajectories.
Training for more epochs, leveraging on-policy data, planning with medium-length horizons and many action
samples were the best design choices, while data aggregation caused the number of initial trajectories that
have little effect.
Figure 2.6: We compare a true rollout (solid line) to its corresponding multi-step prediction from the learned
model (dotted line) on the half-cheetah. Although we learn to predict certain elements of the state space
well, note the eventual divergence of the open-loop predictions over time. Even so, the MPC controller can
successfully use the model to control an agent by performing short-horizon planning.
CHAPTER 2. MBRL: LEARNING NEURAL NETWORK MODELS FOR CONTROL19
higher amount of initial training data leads to higher initial performance, data aggregation
allows low-data initialization runs to reach a high final performance level, highlighting how
on-policy data from reinforcement learning improves sample efficiency.
(a) Swimmer, left (b) Swimmer, right (c) Ant, left (d) Ant, right (e) Ant, u-turn
Figure 2.7: Trajectory following results for the swimmer and ant, with blue dots representing the center-of-
mass positions that were specified as the desired trajectory to follow. For each of these agents, we train the
dynamics model only once on random trajectories, but can then use it at run-time to execute various desired
trajectories.
model-free methods by training a policy to mimic the learned model-based controller, and
then using the resulting imitation policy as the initialization for a model-free reinforcement
learning algorithm.
which we optimize using stochastic gradient descent. To achieve desired performance and
address the data distribution problem, we applied DAGGER (Ross, G. J. Gordon, and
D. Bagnell 2011): This consisted of iterations of training the policy, performing on-policy
rollouts, querying the “expert” MPC controller for “true” action labels for those visited states,
and then retraining the policy.
2.6.2 Model-free RL
After initialization, we can use the policy πφ , which was trained on data generated by the
learned model-based controller, as an initial policy for a model-free reinforcement learning
algorithm. Specifically, we use trust region policy optimization (TRPO) (J. Schulman et al.
2015); such policy gradient algorithms are a good choice for model-free fine-tuning since they
do not require any critic or value function for initialization (Grondman et al. 2012), though
our method could also be combined with other model-free RL algorithms.
TRPO is also a common choice for the benchmark tasks we consider, and it provides
us with a natural way to compare purely model-free learning with our model-based pre-
initialization approach. Initializing TRPO with the learned expert policy πφ is as simple as
using πφ as the initial policy for TRPO, instead of a standard randomly initialized policy.
Although this approach of combining model-based and model-free methods is extremely
simple, we demonstrate the efficacy of this approach in our experiments.
ant) to learn the fastest forward-moving gait possible. The model-free approach we compare
with is the rllab3 implementation of trust region policy optimization (TRPO) (J. Schulman
et al. 2015), which is know to be successful on these tasks. The code and videos of these
experiments are available online4 .
deviation (std) is another parameter of importance. Optimizing this std parameter according
to the imitation learning loss function results in worse TRPO performance than arbitrarily
using a larger std, perhaps because a higher std on the initial policy leads to more exploration
and thus is more beneficial to TRPO. Therefore, we train our policy’s mean network using
the standard imitation learning loss function, but we manually select the std to be 1.0.
As mentioned above, we use rllab’s (Duan, X. Chen, et al. 2016) implementation of
the TRPO algorithm for the model-free RL algorithm. We run TRPO with the following
parameters for all agents: batch size 50000, base epsilon 10−5 , discount factor 0.995, and step
size 0.5.
Reward r
at 2
Swimmer sxvel
t+1 − 0.5k 50 k2
at 2
Half-Cheetah sxvel
t+1 − 0.05k 1 k2
at 2
Hopper sxvel
t+1 + 1 − 0.005k 200 k2
at 2
Ant sxvel
t+1 + 0.5 − 0.005k 150 k2
Figure 2.8: Plots show the mean and standard deviation over multiple runs and compare our model-based
approach, a model-free approach (TRPO, (J. Schulman et al. 2015)), and our hybrid model-based plus
model-free approach. Our combined approach shows a 3 − 5× improvement in sample efficiency for all shown
agents. Note that the x-axis uses a logarithmic scale.
2.8 Discussion
In this chapter, we presented a model-based RL algorithm that is able to learn neural network
dynamics functions for simulated locomotion tasks using a small number of samples. We
described a number of important design decisions for effectively and efficiently training
neural network dynamics models, and we presented experiments that evaluated these design
parameters. Our method quickly discovered a dynamics model that led to an effective gait;
that model could be applied to different trajectory following tasks at run-time, or the initial
CHAPTER 2. MBRL: LEARNING NEURAL NETWORK MODELS FOR CONTROL24
gait could then be fine-tuned with model-free learning to achieve high task rewards on
benchmark Mujoco agents.
In addition to looking at the difference in sample complexity between our hybrid Mb-Mf
approach and a pure model-free approach, there are also takeaways from the model-based
approach alone. Our model-based algorithm cannot always reach extremely high rewards
on its own, but it offers practical use by allowing quick and successful discovery of complex
and realistic gaits. In general, our model-based approach can very quickly become competent
at a task, whereas model-free approaches can very slowly become experts. For example,
when we have a small legged robot with unknown dynamics and we want it to accomplish
tasks in the real-world (such as exploration, construction, search and rescue, etc.), achieving
reliable walking gaits that can follow any desired trajectory is a superior skill to that of just
running straight forward as fast as possible. Additionally, consider the ant: A model-free
approach requires 5 × 106 points to achieve a steady walking forward gait, but using just
14% of those data points, our model-based approach can allow for travel in any direction and
along arbitrary desired trajectories. Training such a dynamics model only once and applying
it to various tasks is compelling; especially when looking toward application to real robots,
this sample efficiency can bring these methods out of the simulation world and into the realm
of feasibility.
25
Chapter 3
3.1 Introduction
Legged millirobots are an effective platform for applications, such as exploration, mapping,
and search and rescue, because their small size and mobility allows them to navigate through
complex, confined, and hard-to-reach environments that are often inaccessible to aerial vehicles
1
https://sites.google.com/view/imageconddyn
CHAPTER 3. SCALING UP MBRL FOR LOCOMOTION 26
and untraversable by wheeled robots. Millirobots also provide additional benefits in the form
of low power consumption and low manufacturing costs, which enables scaling them to large
teams that can accomplish more complex tasks. This superior mobility, accessibility, and
scalability makes legged millirobots some of the most mission-capable small robots available.
However, the same properties that enable these systems to traverse complex environments
are precisely what make them difficult to control.
Modeling the hybrid dynamics of underactuated legged millirobots from first principles
is exceedingly difficult due to complicated ground contact physics that arise while moving
dynamically on complex terrains. Furthermore, cheap and rapid manufacturing techniques
cause each of these robots to exhibit varying dynamics. Due to these modeling challenges,
many locomotion strategies for such systems are hand-engineered and heuristic. These
manually designed controllers impose simplifying assumptions, which not only constrain
the overall capabilities of these platforms, but also impose a heavy burden on the engineer.
Additionally, and perhaps most importantly, they preclude the opportunity for adapting and
improving over time.
In this paper, we explore how learning can be
used to automatically acquire locomotion strate-
gies in diverse environments for small, low-cost,
and highly dynamic legged robots. Choosing an
appropriate learning algorithm requires consider-
ation of a number of factors. First, the learned
model needs to be expressive enough to cope with
the highly dynamic and nonlinear nature of legged
millirobots, as well as with high-dimensional sen-
sory observations such as images. Second, the
algorithm must allow the robot to learn quickly
from modest amounts of data, so as to make it Figure 3.2: VelociRoACH: our small, mobile,
a practical algorithm for real-world application. highly dynamic, and bio-inspired hexapedal mil-
Third, the learned general-purpose models must lirobot, shown with a camera mounted for terrain
imaging.
be able to be deployed on a wide range of naviga-
tional tasks in a diverse set of environments, with
minimal human supervision.
The primary contribution of the work work in this chapter is an approach for controlling
dynamic legged millirobots that learns an expressive and high-dimensional image-conditioned
neural network dynamics model, which is then combined with a model predictive controller
(MPC) to follow specified paths. Our sample efficient learning-based approach uses less than
17 minutes of real-world data to learn to follow desired paths in a desired environment, and
we empirically show that it outperforms a conventional differential drive control strategy
for highly dynamic maneuvers. Our method also enables adaptation to diverse terrains by
conditioning its dynamics predictions on its own observed images, allowing it to predict how
terrain features such as gravel or turf will alter the system’s response. The work in this
chapter leverages and builds upon recent advances in learning to achieve a high-performing
CHAPTER 3. SCALING UP MBRL FOR LOCOMOTION 27
and sample efficient approach for controlling dynamic legged millirobots on various terrains
in the real world.
and can be applied to real systems, they have not yet been shown to work for high dimensional
systems or more complex systems, such as fast robots operating in highly dynamic regimes
on irregular surfaces with challenging contact dynamics.
Model-free Policy Learning: Rather than optimizing gaits, prior work in model-
free reinforcement learning algorithms has demonstrated the ability to instead learn these
behaviors from scratch. Work in this area, including Q-learning (Volodymyr Mnih, Koray
Kavukcuoglu, Silver, Rusu, et al. 2015; Oh et al. 2016), actor-critic methods (T. Lillicrap
et al. 2016; V. Mnih, Badia, et al. 2016), and policy gradients (J. Schulman et al. 2015), has
learned complex skills in high-dimensional state spaces, including skills for simulated robotic
locomotion tasks. However, the high sample complexity of such purely model-free algorithms
makes them difficult to use for learning in the real world, where sample collection is limited
by time and other physical constraints. Unlike these approaches, our model-based learning
method uses only minutes of experience to achieve generalizable real-world locomotion skills
that were not explicitly seen during training, and it further exemplifies the benefits in sample
complexity that arise from incorporating models with learning-based approaches.
Model Learning: Although the sample efficiency of model-based learning is appealing,
and although data-driven approaches can eliminate the need to impose restrictive assumptions
or approximations, the challenge lies in the difficulty of learning a good model. Relatively
simple function approximators such as time-varying linear models have been used to model
dynamics of systems (Lioutikov et al. 2014; Yip and Camarillo 2014), including our Ve-
lociRoACH (Buchan, Haldane, and Ronald S Fearing 2013) platform. However, these models
have not yet been shown to posses enough representational power (i.e., accuracy) to generalize
to complex locomotion tasks. Prior work has also investigated learning probabilistic dynamics
models (M. Deisenroth and C. Rasmussen 2011; Ko and Fox 2008), including Gaussian
process models for simulated legged robots (M. P. Deisenroth, Calandra, et al. 2012). While
these approaches can be sample efficient, it is intractable to scale them to higher dimensions,
as needed especially when incorporating rich sensory inputs such as image observations.
In contrast, our method employs expressive neural network dynamics models, which eas-
ily scale to high dimensional inputs. Other modeling approaches have leveraged smaller
neural networks for dynamics modeling, but they impose strict and potentially restrictive
structure to their formulation, such as designing separate modules to represent the various
segments of a stride (Crusea et al. 1998), approximating actuators as muscles and tuning
these parameters (Xiong, Worgotter, and Manoonpong 2014), or calculating equations of
motion and learning error terms on top of these specific models (Grandia, Pardo, and Buchli
2018). Instead, we demonstrate a sample efficient, expressive, and high-dimensional neural
network dynamics model that is free to learn without the imposition of an approximated
hand-specified structure.
Environment Adaptation: The dynamics of a robot depend not only on its own
configuration, but also on its environment. Prior methods generally categorize the problem of
adapting to diverse terrains into two stages: first, the terrain is recognized by a classifier trained
with human-specified labels (or, less often, using unsupervised learning methods (Leffler
2009)), and second, the gait is adapted to the terrain. This general approach has been
CHAPTER 3. SCALING UP MBRL FOR LOCOMOTION 29
used for autonomous vehicles (Thrun, Montemerlo, et al. 2006; Leffler 2009), larger legged
robots (Kolter, Abbeel, and Ng 2008; Kalakrishnan et al. 2010; Zucker et al. 2011; Xiong,
Worgotter, and Manoonpong 2014; Hoepflinger et al. 2010), and for legged millirobots (Wu
et al. 2016; Bermudez et al. 2012). In contrast, our method does not require any human labels
at run time, and it adapts to terrains based entirely on autonomous exploration: the dynamics
model is simply conditioned on image observations of the terrain, and it automatically learns
to recognize the visual cues of terrain features that affect the robot’s dynamics.
3.3.1 Overview
We provide an overview of our approach in
Fig. 3.3. Note that this model-based RL
framework is also the one shown in Fig. 2.3
of the previous chapter, and we will now
develop the dynamics model component.
Since we require a parameterization of
the dynamics model that can cope with high-
dimensional state and action spaces and the
complex dynamics of legged millirobots, we
represent the dynamics function fθ (st , at ) as
a multilayer neural network, parameterized
by θ. As before, this function outputs the
Figure 3.3: Image-conditioned model-based learning
predicted change in state that occurs as a for locomotion control: A closed-loop MPC controller
result of executing action at from state st , performs action selection by using predictions from the
over the time step duration of ∆t. Thus, learned dynamics model, which is conditioned on the
the predicted next state is given by: ŝt+1 = current state st , action at , and image It .
st + fθ (st , at ). While choosing too small of a
∆t leads to too small of a state difference to allow meaningful learning, increasing the ∆t too
much can also make the learning process more difficult because it increases the complexity
of the underlying continuous-time dynamics. As described in Algorithm 1 from Section 2.4,
initial training data is collected by placing the robot in arbitrary start states and executing
CHAPTER 3. SCALING UP MBRL FOR LOCOMOTION 30
c(st , at ) = fp ∗ p + fh ∗ h + ff ∗ f, (3.1)
where the parameter fp penalizes perpendicular distance p away from the desired path,
parameter ff encourages forward progress f along the path, and parameter fh maintains
the heading h of the system toward the desired direction. Rather than executing the entire
sequence of selected optimal actions, we use model predictive control (MPC) to execute only
the first action at , and we then replan at the next time step, given updated state information.
As currently described, our model-based RL approach can successfully follow arbitrary
paths when trained and tested on a single terrain. However, in order to traverse complex
and varied terrains, it is necessary to adjust the dynamics to the terrain conditions; we will
develop such a model in the following section.
Figure 3.4: Our image-conditioned neural network dynamics model. The model takes as input the current
state st , action at , and image It . The image is passed through the convolutional layers of AlexNet (Krizhevsky,
Sutskever, and Hinton 2012) pre-trained on ImageNet (Deng et al. 2009), which is then flattened and projected
into a lower dimension through multiplication with a random fixed matrix to obtain et . The image and
concatenated state-action vectors are passed through their own fully connected layers, fused via an outer
product, flattened, and passed through more fully connected layers to obtain a predicted state difference ∆ŝt .
input not only the current robot state st and action at , but also the current image observation
It . The model (Fig. 3.4) passes image It through the first eight layers of AlexNet (Krizhevsky,
Sutskever, and Hinton 2012). The resulting activations are flattened into a vector, and this
vector is then multiplied by a fixed random matrix in order to produce a lower dimensional
feature vector et . The concatenated state-action vector [st ; at ] is passed through a hidden
layer and combined with et through an outer product. As opposed to a straightforward
concatenation of [st ; at ; et ], this outer product allows for higher-order integration of terrain
information terms with the state and action information terms. This combined layer is
then passed through another hidden layer and output layer to produce a prediction of state
difference ∆ŝt .
Training the entire image-conditioned neural network dynamics model with only minutes
of data—corresponding to tens of thousands of datapoints—and in only a few environments
would result in catastrophic overfitting. Thus, to perform feature extraction on the images,
we use the AlexNet (Krizhevsky, Sutskever, and Hinton 2012) layer weights optimized from
training on the task of image classification on the ImageNet (Deng et al. 2009) dataset,
which contains 15 million diverse images. Although gathering and labelling this large image
dataset was a significant effort, we note that such image datasets are ubiquitous and their
learned features have been shown to transfer well to other tasks (Razavian et al. 2014). By
using these pre-trained and transferable features, the image-conditioned dynamics model
is sample-efficient and can automatically adapt to different terrains without any manual
labelling of terrain information.
We show in our experiments that this image-conditioned dynamics model outperforms
a naïvely trained dynamics model that is trained simply on an aggregation of all the data.
Furthermore, the performance of the image-conditioned dynamics model is comparable, on
each terrain, to individual dynamics models that are specifically trained (and tested) on that
terrain.
CHAPTER 3. SCALING UP MBRL FOR LOCOMOTION 32
Figure 3.5: Over 15 teleoperated trials performed on rough terrain, a legged robot succeeded in navigating
through the terrain 90% of the time, whereas a wheeled robot of comparable size succeeded only 30% of the
time.
Figure 3.6: (a-d) Conceptual saggital plane drawing of robot leg positions as a function of crank angle, and (e)
isometric solid model view of the leg transmission model. Both of these images are borrowed with permission
from (Casarez 2018).
each transmission side is driven by a 3.6 ohm DC motor with a 21.3:1 gear reduction. As
shown in Fig. 3.6, the fore and aft legs are constrained to be 180◦ out of phase from the
middle leg. Similar to the design of the X2-VelociRoACH (Haldane and Ronald S Fearing
2015), this VelociRoACH uses two connected output cranks per side to transmit forces at
each leg contact.
The VelociRoACH carries an ImageProc embedded circuit board3 , which includes a 40
MHz Microchip dsPIC33F microprocessor, a six axis inertial measurement unit (IMU), an
802.15.4 wireless radio (XBee), and motor control circuitry. We added a 14-bit magnetic rotary
encoders to the motors on each side of the robot to monitor absolute position. Additional
sensory information includes battery voltage and back-EMF signals from the motors.
The onboard microcontroller runs a low-level 1 kHz control loop and processes communi-
cation signals from the XBee. Due to computational limits of the microprocessor, we stream
data from the robot to a laptop for calculating controller commands, and then stream these
commands back to the microprocessor for execution. To bypass the problem of using only
on-board sensors for state estimation, we also use an OptiTrack motion capture system to
stream robot pose information during experiments. The motion capture system does not
provide any information about the environment terrain, so we also mounted a 3.4 gram
monocular color camera onto the VelociRoACH, which communicates directly with the laptop
via a radio frequency USB receiver.
3
https://github.com/biomimetics/imageproc_pcb
CHAPTER 3. SCALING UP MBRL FOR LOCOMOTION 34
epochs, using the Adam optimizer (Kingma and J. Ba 2014) with learning rate 0.001 and
batchsize 1000.
The process of using the neural network dynamics model and the cost function to select
the best candidate action sequence at each time step is done in real-time. Relevant parameters
for our model-based controller are the number of candidate action sequences sampled at each
time step N = 500, the horizon H = 4, and parameters fp = 50, ff = 10, and fh = 5 for the
perpendicular, forward, and heading components of the trajectory following cost function
from Eqn. 3.1.
Note that the training data is gathered entirely using random trajectories, and therefore,
the paths executed by the controller at run-time differ substantially from the training data.
This illustrates the use of off-policy training data, and that the model exhibits considerable
generalization. Furthermore, although the model is trained only once, we use it to accomplish
a variety of tasks at run-time by simply changing the desired path in the cost function. This
decoupling of the task from the dynamics eliminates the need for task-specific training, which
further improves overall sample efficiency.
We define the state st of the VelociRoACH to be [x, y, z, vx , vy , vz , cos(φr ), sin(φr ),
cos(φp ), sin(φp ), cos(φy ), sin(φy ), ωx , ωy , ωz , cos(aL ), sin(aL ), cos(aR ), sin(aR ), vaL ,
vaR , bemf L , bemf R , Vbat ]T . The center of mass positions (x, y, z) and the Euler angles to
describe the center of mass pose (φr , φp , φy ) come from the OptiTrack motion capture system.
The angular velocities (ωx , ωy , ωz ) come from the gyroscope onboard the IMU, and the motor
crank positions (aL , aR ) come from the magnetic rotary encoders, which give a notion of
leg position. We include (bemf L , bemf R ) because back-EMF provides a notion of motor
torque/velocity, and (Vbat ) because the voltage of the battery affects the VelociRoACH’s
performance. Note that the state includes sin and cos of angular values, which is common
practice and allows the neural network to avoid wrapping issues.
Finally, we define the action space at of the VelociRoACH to represent the desired velocity
setpoints for the rotation of the legs, and we achieve these setpoints using a lower-level PID
controller onboard the system. We discuss two of the possible action abstraction choices
below in Section 3.4.3.
Figure 3.7: Trajectories executed by the model-based controller when the control outputs are (Top:) direct
motor PWM values and (Bottom:) leg velocity setpoints, which a lower-level controller is tasked with
achieving. Note that for each of these options, the corresponding dynamics model is trained using data where
the at represents the indicated choice of action abstraction.
the battery level, as well as due to the leg kinematics leading to different forces at different
stages of the leg rotation.
At the same time, outputting desired velocities and then designing a lower-level PID
controller to achieve those velocities involves an additional stage of parameter tuning, and one
concern includes unpredictable behavior caused by not achieving the desired velocity within
the time ∆t before the next setpoint is received. Each of these action abstraction options
has pros and cons that manifest themselves differently on different systems. Thus, it is an
enticing feature to have an algorithm easily adapt to the user’s choice of action abstraction,
since the best choice may change based on the available system’s details.
Table 3.1: Trajectory following cost incurred vs. amount of training data
In comparing our method to the differential drive controller, all cost numbers reported in
the tables below are calculated on the same cost function (Eqn. 3.1) that indicates how well
the executed path aligns with the desired path. Each reported number represents an average
over 10 runs. Fig. 3.8 illustrates that the model-based learning method and the differential
drive control strategy are comparable at low speeds, across different trajectories on carpet.
However, the learned model-based approach outperforms the differential drive strategy at
CHAPTER 3. SCALING UP MBRL FOR LOCOMOTION 37
Figure 3.9: VelociRoACH telemetry data from the learned model performing a left turn. The
top plots show 1kHz data from on-board the robot, with zoomed-in plots on the right. The bottom plots
show 10Hz data from the motion capture system, during the same run.
CHAPTER 3. SCALING UP MBRL FOR LOCOMOTION 39
Figure 3.10: Distribution of commanded actions for the right vs. left side of the VelociRoACH, where the
commanded actions are velocity setpoints for the legs, in units of leg revolutions per second. The top plots
show values from multiple “straight” runs, and the bottom plots show values from multiple “left” runs.
Figure 3.11: Four “left” runs (top) from the learned method, and four “left” runs (bottom) from the differential
drive method on the VelociRoACH. The blue dots correspond to the right motor, and the black dots correspond
to the left motor. The differential drive runs show a clear correlation between heading error and resulting
control commands (as prescribed), whereas the learned model does something different.
CHAPTER 3. SCALING UP MBRL FOR LOCOMOTION 40
Figure 3.13: Top: Execution of our model-based learning method, using an image-conditioned dynamics
model, on various desired paths on four terrains (styrofoam, gravel, carpet, and turf). Note that the path
boundaries are outlined for visualization purposes only, and were not present during the experiments. Bottom:
Example images from the onboard camera during the runs (shaky images due to body motion).
some knowledge about the surface. Also, performance diminishes when the model is trained
on data from both terrains, which indicates that naïvely combining data in order to learn a
joint dynamics model is insufficient.
Table 3.2: Costs incurred by the VelociRoACH while executing a straight line path. The model-based
controller has the best performance when executed on the surface that it was trained on, indicating that the
model incorporates knowledge about the environment. Performance deteriorates when the model is trained on
one terrain but tested on another, as well as when the model is jointly trained on all data from all surfaces.
Carpet Styrofoam
Differential Drive 13.85 15.45
Model trained on carpet 5.69 18.62
Model trained on styrofoam 22.25 8.15
Model trained on both 7.52 15.76
We have shown so far that when trained on data gathered from a single terrain, our
model-based approach is superior to the common differential drive approach, and that our
approach improves with more data. Furthermore, the experiment above demonstrated that
the robot’s dynamics depend on the environment. Thus, we would like our approach to be
able to control the VelociRoACH on a variety of terrains.
A standard approach for this would be to train a dynamics model using data from all
terrains. However, as shown above in Table 3.2 as well as below in Fig. 3.14, a model that is
naïvely trained on all data from multiple terrains and then tested on one of those terrains is
significantly worse than a model that is trained solely on that particular terrain. The main
reason that this naïve approach does not work well is that the dynamics themselves differ
greatly with terrain, and a dynamics model that takes only the robot’s current state and
action as inputs receives a weak and indirect signal about the robot’s environment.
To have a direct signal about the environment, our image-conditioned model takes an
additional input: an image taken from an onboard camera, as shown in Fig. 3.13. In
Figure 3.14, we compare the performance of this image-conditioned approach to that of
alternative approaches on the task of path following for four different paths (straight, left,
right, zigzag) on four different surfaces (styrofoam, carpet, gravel, turf). We compare the
image-conditioned learned dynamics approach to various alternate approaches:
1. Training a separate dynamics model on each terrain, and testing on that same terrain.
2. Naïvely training one joint dynamics model on all training data, with no images or labels
of which terrain the data came from.
3. Training one joint dynamics model using data with explicit terrain labels in the form of
a one-hot vector (where the activation of a single vector element directly corresponds
to a terrain).
The naïve approach of training one joint dynamics model using an aggregation of all data
performs worse than the other learning-based methods. The method of having a separate
CHAPTER 3. SCALING UP MBRL FOR LOCOMOTION 42
Figure 3.14: Comparison of our image-conditioned model-based approach vs. alternate methods, evaluated on
four different terrains, each with four different paths (straight, left, right, and zigzag) on the VelociRoACH.
The methods that we compare to include: a hand-engineered differential drive controller, a joint dynamics
model that is naïvely trained on all data from all terrains, an “oracle” approach that uses a separate dynamics
model on each terrain, and another “oracle” approach where the joint dynamics model is trained using all
data with extra one-hot vector labels of the terrain. Our method outperforms the differential drive method
and the naïve model-based controller, while performing similarly to the oracle baselines without needing any
explicit labels.
dynamics model for each terrain, as well as the method of training one joint dynamics
model using one-hot vectors as terrain labels, both perform well on all terrains. However,
both of these methods require human supervision to label the training data and to specify
which terrain the robot is on at test time. In contrast, our image-conditioned approach
performs just as well as the separate and one-hot models, but does not require any additional
supervision beyond an onboard monocular camera. Finally, our image-conditioned approach
also substantially outperforms the differential drive baseline on all terrains.
3.4.9 Discussion
In this chapter, we presented a sample-efficient model-based learning algorithm using image-
conditioned neural network dynamics models that enabled accurate locomotion of a low-cost,
CHAPTER 3. SCALING UP MBRL FOR LOCOMOTION 43
Chapter 4
Figure 4.2: PDDM can efficiently and effectively learn complex dexterous manipulation skills in both
simulation and the real world. Here, less than 4 hours of experience is needed for the Shadow Hand to learn
to rotate two free-floating Baoding balls in the palm, without any prior knowledge of system dynamics.
CHAPTER 4. SCALING UP MBRL FOR DEXTEROUS MANIPULATION 45
to scale up to more complex and realistic tasks requiring fine motor skills. In this chapter,
we demonstrate that our method of online planning with deep dynamics models (PDDM)
addresses both of these limitations; we show that improvements in learned dynamics models,
together with improvements in online model-predictive control, can indeed enable efficient and
effective learning of flexible contact-rich dexterous manipulation skills – and that too, on a
24-DoF anthropomorphic hand in the real world, using just 4 hours of purely real-world data
to learn to simultaneously coordinate multiple free-floating objects. Videos of the experiments
as well as the code are available online1 .
4.1 Introduction
Dexterous manipulation with multi-fingered hands represents a grand challenge in robotics:
the versatility of the human hand is as yet unrivaled by the capabilities of robotic systems,
and bridging this gap will enable more general and capable robots. Although some real-world
tasks can be accomplished with simple parallel jaw grippers, there are countless tasks in
which dexterity in the form of redundant degrees of freedom is critical. In fact, dexterous
manipulation is defined (Okamura, Smaby, and Cutkosky 2000) as being object-centric,
with the goal of controlling object movement through precise control of forces and motions –
something that is not possible without the ability to simultaneously impact the object from
multiple directions. Through added controllability and stability, multi-fingered hands enable
useful fine motor skills that are necessary for deliberate interaction with objects. For example,
using only two fingers to attempt common tasks such as opening the lid of a jar, hitting a
nail with a hammer, or writing on paper with a pencil would quickly encounter the challenges
of slippage, complex contact forces, and underactuation. Success in such settings requires a
sufficiently dexterous hand, as well as an intelligent policy that can endow such a hand with
the appropriate control strategy.
The principle challenges in dexterous manipulation stem from the need to coordinate
numerous joints and impart complex forces onto the object of interest. The need to repeatedly
establish and break contacts presents an especially difficult problem for analytic approaches,
which require accurate models of the physics of the system. Learning offers a promising
data-driven alternative. Model-free reinforcement learning (RL) methods can learn policies
that achieve good performance on complex tasks (Van Hoof et al. 2015; Levine, Finn, et al.
2016; Rajeswaran et al. 2017); however, we will show that these state-of-the-art algorithms
struggle when a high degree of flexibility is required, such as moving a pencil to follow
arbitrary user-specified strokes. Here, complex contact dynamics and high chances of task
failure make the overall skill much more difficult. Model-free methods also require large
amounts of data, making them difficult to use in the real world. Model-based RL methods, on
the other hand, can be much more efficient, but have not yet been scaled up to such complex
tasks. In this work, we aim to push the boundary on this task complexity; consider, for
instance, the task of rotating two Baoding balls around the palm of your hand (Figure 4.2).
1
https://sites.google.com/view/pddm
CHAPTER 4. SCALING UP MBRL FOR DEXTEROUS MANIPULATION 46
We will discuss how model-based RL methods can solve such tasks, both in simulation and
on a real-world robot.
Algorithmically, we present a technique that com-
bines elements of recently-developed uncertainty-
aware neural network models with state-of-the-art
gradient-free trajectory optimization. While the in-
dividual components of our method are based heavily
on prior work, we show that their combination is
both novel and critical. Our approach, based on
deep model-based RL, challenges the general ma-
chine learning community’s notion that models are
difficult to learn and do not yet deliver control results
that are as impressive as model-free methods. In
this chapter, we push forward the empirical results
of model-based RL, in both simulation and the real
world, on a suite of dexterous manipulation tasks
starting with a 9-DoF three-fingered hand (Zhu et
al. 2019) rotating a valve, and scaling up to a 24-
DoF anthropomorphic hand executing handwriting
and manipulating free-floating objects (Figure 4.3).
These realistic tasks require not only learning about
interactions between the robot and objects in the
world, but also effective planning to find precise and
coordinated maneuvers while avoiding task failure
(e.g., dropping objects). The work in this chapter Figure 4.3: Task suite of simulated and real-
demonstrates for the first time that deep neural net- world dexterous manipulation: valve rotation,
work models can indeed enable sample-efficient and in-hand reorientation, handwriting, and ma-
nipulating Baoding balls.
autonomous discovery of fine motor skills with high-
dimensional manipulators, including a real-world
dexterous hand trained entirely using just 4 hours of real-world data.
stable grasp state, and (Bai and Liu 2014)’s controller used the conservation of mechanical
energy to control the tilt of a palm and roll objects to desired positions on the hand. These
types of manipulation techniques have thus far struggled to scale to more complex tasks
or sophisticated manipulators in simulation as well as the real world, perhaps due to their
need for precise characterization (Okada 1982) of the system and its environment. Reasoning
through contact models and motions cones (Chavan-Dafle, Holladay, and Rodriguez 2018;
Kolbert, Chavan-Dafle, and Rodriguez 2016), for example, requires computation time that
scales exponentially with the number of contacts, and has thus been limited to simpler
manipulators and more controlled tasks. With this work, we aim to significantly scale up the
complexity of feasible tasks, while also minimizing such task-specific formulations.
More recent work in deep RL has studied this question through the use of data-driven
learning to make sense of observed phenomenon (Andrychowicz, B. Baker, et al. 2018; Van
Hoof et al. 2015). These methods, while powerful, require large amounts of system interaction
to learn successful control policies, making them difficult to apply in the real world. Some
work (Rajeswaran et al. 2017; Zhu et al. 2019) has used expert demonstrations to improve
this sample efficiency. In contrast, our method is sample efficient without requiring any
expert demonstrations, and is still able to leverage data-driven learning techniques to acquire
challenging dexterous manipulation skills.
Model-based RL has the potential to provide both efficient and flexible learning. In fact,
methods that assume perfect knowledge of system dynamics can achieve very impressive
manipulation behaviors (Mordatch, Popović, and Emanuel Todorov 2012; Lowrey et al.
2018) using generally applicable learning and control techniques. Other work has focused
on learning these models using high-capacity function approximators (M. P. Deisenroth,
Neumann, Peters, et al. 2013; Lenz, Knepper, and Saxena 2015; Levine, Finn, et al. 2016;
Nagabandi, Yang, et al. 2017; Nagabandi, Kahn, Ronald S. Fearing, et al. 2018; Williams,
Wagener, et al. 2017; Chua et al. 2018a) and probabilistic dynamics models (M. Deisenroth
and C. Rasmussen 2011; Ko and Fox 2008; M. P. Deisenroth, Calandra, et al. 2012; Doerr et al.
2017). Our method combines components from multiple prior works, including uncertainty
estimation (M. Deisenroth and C. Rasmussen 2011; Chua et al. 2018a; Kurutach et al.
2018a), deep models and model-predictive control (MPC) (Nagabandi, Yang, et al. 2017),
and stochastic optimization for planning (Williams, Aldrich, and Theodorou 2015). Model-
based RL methods, including recent work in uncertainty estimation (A. Malik et al. 2019)
and combining policy networks with online planning (T. Wang and Jimmy Ba 2019), have
unfortunately mostly been studied and shown on lower-dimensional (and often, simulated)
benchmark tasks, and scaling these methods to higher dimensional tasks such as dexterous
manipulation has proven to be a challenge. As illustrated in our evaluation, the particular
synthesis of different ideas in this work allows model-based RL to push forward the task
complexity of achievable dexterous manipulation skills, and to extend this progress to even a
real-world robotic hand.
CHAPTER 4. SCALING UP MBRL FOR DEXTEROUS MANIPULATION 48
Random Shooting: The simplest gradient-free optimizer is the one used in the previ-
ous chapters of this work. It simply generates K independent random action sequences
{A0 . . . AK }, where each sequence Ak = {ak0 . . . akH−1 } is of length H action. Given a reward
function r(st , at ) that defines the task, and given future state predictions ŝt+1 = fθ (ŝt , at ) + ŝt
from the learned dynamics model fθ , the optimal action sequence Ak∗ is selected to be
the one corresponding
Pt+H−1 to the sequence with highest predicted reward: k ∗ = arg maxk Rk =
arg maxk t0 =t r(ŝt0 , akt0 ). We showed this approach to achieve success on continuous control
tasks with learned models in the previous chapters, but it does have numerous drawbacks: it
scales poorly with the dimension of both the planning horizon and the action space, and it
often is insufficient for achieving high task performance since a sequence of actions sampled
at random often does not directly lead to meaningful behavior.
After M iterations, the optimal actions are selected to be the resulting mean of the action
distribution.
considers covariances between time steps and uses a softer update rule that more effectively
integrates a larger number of samples into the distribution update. As derived by recent
model-predictive path integral work (Williams, Aldrich, and Theodorou 2015; Lowrey et al.
2018), this general update rule takes the following form for time step t, reward-weighting
factor γ, and reward Rk from each of the K predicted trajectories:
PN γ·Rk
k=0 (e )(akt )
µt = P N
∀t ∈ {0 . . . H − 1}. (4.2)
eγ·(Rj )
j=0
Rather than sampling the action samples from a random policy or from iteratively refined
Gaussians, we instead apply a filtering technique to explicitly produce smoother candidate
action sequences. Given the iteratively updating mean distribution µt from above, we generate
K action sequences akt = nkt + µt , where each noise sample nkt is generated using filtering
coefficient β as follows:
By coupling time steps to each other, this filtering also reduces the effective degrees of freedom
or dimensionality of the search space, thus allowing for better scaling with dimensionality.
4.3.3 Overview
After using the models and predicted rewards to perform action selection, we take one step
at of the selected action plan, receive updated state information st+1 , and then replan at
the following time step. This closed-loop method of replanning using updated information
at every time step helps to mitigate some model inaccuraries by preventing accumulating
model error. Note that this control procedure also allows us to easily swap out new reward
functions or goals at run-time, independent of the trained model. Overall, the full procedure
of PDDM involves iteratively performing actions in the real world (through online planning
with the use of the learned model) and then using those observations to update that learned
model, as stated in Algorithm 1 from Section 2.4. Further implementation details of this
procedure are provided in the sections below.
3. How does the performance as well as sample efficiency of PDDM compare to that of
other state-of-the-art algorithms?
5. Can we apply these lessons learned from simulation to enable a 24-DoF humanoid hand
to manipulate free-floating objects in the real world?
All experiment videos as well as the released code can be found online on the project website2 .
2
https://sites.google.com/view/pddm
CHAPTER 4. SCALING UP MBRL FOR DEXTEROUS MANIPULATION 52
Baoding Balls: While manipulation of one free object is already a challenge, sharing the
compact workspace with other objects exacerbates the challenge and truly tests the dexterity
of the manipulator. We examine this challenge with Baoding balls, where the goal is to rotate
two balls around the palm without dropping them. The objects influence the dynamics of
not only the hand, but also each other; inconsistent movement of either one knocks the other
out of the hand, leading to failure.
R T H K γ β M E
Handwriting −20||tipz ||
−101(forwardtipping > 0)
experiments, though we observed similar trends on other tasks. In the first plot, we see
that a sufficiently large architecture is crucial, indicating that the model must have enough
capacity to represent the complex dynamical system. In the second plot, we see that the use
of ensembles is helpful, especially earlier in training when non-ensembled models can overfit
badly and thus exhibit overconfident and harmful behavior. This suggests that ensembles
are an enabling factor in using sufficiently high-capacity models. In the third plot, we see
that there is not much difference between resetting model weights randomly at each training
iteration versus warmstarting them from their previous values.
In the fourth plot, we see that using a planning horizon that is either too long or too
short can be detrimental: Short horizons lead to greedy planning, while long horizons
suffer from compounding errors in the predictions. In the fifth plot, we study the type of
planning algorithm and see that PDDM, with action smoothing and soft updates, greatly
CHAPTER 4. SCALING UP MBRL FOR DEXTEROUS MANIPULATION 54
2x64 1
2x250 3 False
2x500 5 True
0.0 0.0 0.0
0 100000 200000 0 100000 200000 0 100000 200000
Number of datapoints Number of datapoints Number of datapoints
Figure 4.5: Baoding task performance (in simulation) for various design decisions: (Top) model architecture,
ensemble size, warmstarting model weights, and (Bottom) planning horizon, controller type, and reward-
weighting γ.
outperforms the others. In the final plot, we study the effect of the γ reward-weighting
variable, showing that medium values provide the best balance of dimensionality reduction
and smooth integration of action samples versus loss of control authority. Here, too soft of
a weighting leads to minimal movement of the hand, and too hard of a weighting leads to
aggressive behaviors that frequently drop the objects.
4.4.4 Comparisons
In this section, we compare our method to the following state-of-the-art model-based and
model-free RL algorithms: Nagabandi et. al (Nagabandi, Kahn, Ronald S. Fearing, et al.
2018) is the method introduced in the previous two chapters, which learns a deterministic
neural network model combined with a random shooting MPC controller; PETS (Kurutach
et al. 2018a) combines uncertainty-aware deep network dynamics models with sampling-
based uncertainty propagation; NPG (Kakade 2002) is a model-free natural policy gradient
method, and has been used in prior work on learning manipulation skills (Rajeswaran et al.
2017); SAC (Haarnoja, Zhou, Abbeel, et al. 2018) is an off-policy model-free RL algorithm;
MBPO (Janner et al. 2019) is a recent hybrid approach that uses data from its model to
accelerate policy learning. On our suite of dexterous manipulation tasks, PDDM consistently
outperforms prior methods both in terms of learning speed and final performance, even
solving tasks that prior methods cannot.
CHAPTER 4. SCALING UP MBRL FOR DEXTEROUS MANIPULATION 55
Valve Turning
Valve turning: We first experiment with a three- 500
MBPO
fingered hand rotating a valve, with starting and goal 250
PETS
Nagabandi et. al
positions chosen randomly from the range [−π, π]. On 0 SAC
NPG
this simpler task, we confirm that most of the prior
Task Reward
−250 PDDM (Ours)
methods do in fact succeed. We also see that even −500
on this simpler task, policy gradient approaches such
−750
as NPG require prohibitively large amounts of data
−1000
(note the log scale of Figure 4.6).
−1250
−1.0 −1.0
MBPO
PETS
Nagabandi et. al
−1.5 −1.5
SAC
NPG
PDDM (Ours)
−2.0 −2.0
0 100000 200000 300000 0 200000 400000
Number of datapoints Number of datapoints
Figure 4.7: In-hand reorientation of a cube. Our method achieves the best results for both (top) 2 and
(bottom) 8 goal angles, and model-free algorithms and methods that directly learn a policy, such as MBPO,
struggle with 8 goal angles.
CHAPTER 4. SCALING UP MBRL FOR DEXTEROUS MANIPULATION 56
Baoding Balls
Handwriting: Fixed Trajectory Handwriting: Arbitrary Trajectories
−5
−7.5 0.0
−10
−10.0
Mean Tracking Error
−15 −12.5
of Pencil Tip (mm)
PETS
Task Reward
−15.0
−20 Nagabandi et. al
SAC −17.5
NPG −0.1 MBPO
−25
PDDM (Ours) −20.0 PETS
−30 −22.5 Nagabandi et. al
SAC SAC
−35 −25.0 NPG NPG
PDDM (Ours) PDDM (Ours)
−27.5
−40 −0.2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.1 0.2 0.3 0.4 0 50000 100000 150000
Number of datapoints (M) Number of datapoints (M) Number of datapoints
Figure 4.8: Left: Manipulating a pencil to make its tip follow a fixed
desired path, where PDDM learns substantially faster than SAC and Figure 4.9: PDDM outperforms
NPG. Right: Manipulating a pencil to make its tip follow arbitrary prior model-based and model-free
paths, where only PDDM succeeds. Note that due to the decoupling of methods on the simulated Baoding
dynamics from task, PDDM requires a similar amount of training data balls task.
in both of these scenarios.
Baoding Balls: This task is particularly challenging due to the inter-object interactions,
which can lead to drastically discontinuous dynamics and frequent failures from dropping
the objects. We were unable to get the other model-based or model-free methods to succeed
at this task (Figure 4.9), but PDDM solves it using just 100,000 data points, or 2.7 hours
worth of data. Additionally, we can employ the model that was trained on 100-step rollouts
to then run for much longer (1000 steps) at test time. The model learned for this task can
also be repurposed, without additional training, to perform a variety of related tasks (see
video3 ): moving a single ball to a goal location in the hand, posing the hand, and performing
clockwise rotations instead of the learned counter-clockwise ones.
3
https://sites.google.com/view/pddm
CHAPTER 4. SCALING UP MBRL FOR DEXTEROUS MANIPULATION 57
Figure 4.10: Real-world Baoding balls hardware setup with the ShadowHand (left), Franka-Emika arm used
for the automated reset mechanism (middle), and the resulting success rates for 90◦ and 180◦ turns (right).
Hardware setup: In order to run this experiment in the real world (Figure 4.10), we use
a camera tracker to produce 3D position estimates for the Baoding balls. The camera tracker
serves the purpose of providing low latency, robust, and accurate 3D position estimates of
the Baoding balls. To enable this tracking, we employ a dilated CNN modeled after the one
in KeypointNet (Suwajanakorn et al. 2018). The input to the system is a 280x180 RGB
stereo pair (no explicit depth) from a calibrated 12 cm baseline camera rig. The output is a
spatial softmax for the 2D location and depth of the center of each sphere in camera frame.
Standard pinhole camera equations convert 2D and depth into 3D points in the camera
frame, and an additional calibration finally converts it into the ShadowHand’s coordinate
system. Training of the model is done in sim, with fine-tuning on real-world data. Our
semi-automated process of composing static scenes with the spheres, moving the stereo rig,
and using VSLAM algorithms to label the images using relative poses of the camera views
substantially decreased the amount of hand-labelling that was requiring. We hand-labeled
only about 100 images in 25 videos, generating over 10,000 training images. We observe
average tracking errors of 5 mm and latency of 20 ms, split evenly between image capture
and model inference.
As shown in the supplementary video, we also implement an automated reset mechanism,
which consists of a ramp that funnels the dropped Baoding balls to a specific position and
then triggers a pre-preprogrammed 7-DoF Franka-Emika arm to use its parallel jaw gripper
to pick them up and return them to the Shadow Hand’s palm. The planner commands the
hand at 10Hz, which is communicated via a 0-order hold to the low-level position controller
that operates at 1kHz. The episode terminates if the specific task horizon of 10 seconds
has elapsed or if the hand drops either ball, at which point a reset request is issued again.
Numerous sources of delays in real robotic systems, in addition to the underactuated nature
of the real Shadow Hand, make the task quite challenging in the real world.
CHAPTER 4. SCALING UP MBRL FOR DEXTEROUS MANIPULATION 58
Results: After less than 2 hours of real-world training, PDDM is able to learn 90◦ rotations
without dropping the two Baoding balls with a success rate of about 100%, and can achieve a
success rate of about 54% on the challenging 180◦ rotation task, as shown in Figure 4.10. An
example trajectory of the robot rotating the Baoding balls using PDDM is shown in Figure 4.2,
and videos on the project website 4 illustrate task progress through various stages of training.
Qualitatively, we note that performance improves fastest during the first 1.5 hours of training;
after this, the system must learn the more complex transition of transferring the control of
a Baoding ball from the pinky to the thumb (with a period of time in between, where the
hand has only indirect control of the ball through wrist movement). These results illustrate
that, although the real-world version of this task is substantially more challenging than its
simulated counterpart, our method can learn to perform it with considerable proficiency
using a modest amount of real-world training data.
4.5 Discussion
In this chapter, we presented a method for using deep model-based RL to learn dexterous
manipulation skills with multi-fingered hands. We demonstrated results on challenging non-
prehensile manipulation tasks, including controlling free-floating objects, agile finger gaits for
repositioning objects in the hand, and precise control of a pencil to write user-specified strokes.
As we showed in our experiments, our method achieves substantially better results than prior
deep model-based RL methods, and also demonstrates advantages over model-free RL: it
requires substantially less training data and results in a model that can be flexibly reused
to perform a wide variety of user-specified tasks. In addition to analyzing the approach on
our simulated suite of tasks using 1-2 hours worth of training data, we demonstrated PDDM
on a real-world 24 DoF anthropomorphic hand, showing successful in-hand manipulation of
objects using just 4 hours worth of entirely real-world interactions.
4
https://sites.google.com/view/pddm/
59
Chapter 5
Figure 5.2: We implement our sample-efficient meta-reinforcement learning algorithm on a real legged
millirobot, enabling online adaptation to new tasks and unexpected occurrences such as losing a leg (shown
here), novel terrains and slopes, errors in pose estimation, and pulling payloads.
CHAPTER 5. ONLINE MODEL ADAPTATION VIA META-LEARNING 60
Our approach uses meta-learning to train a dynamics model prior such that, when
combined with a small amount of recent data, this meta-learned prior can be rapidly adapted
to the local context. Our experiments demonstrate online adaptation for continuous control
tasks on both simulated and real-world agents. We first show simulated agents adapting their
behavior online to novel terrains, crippled body parts, and highly-dynamic environments. We
also illustrate the importance of incorporating online adaptation into autonomous agents
that operate in the real world by applying our method to a real dynamic legged millirobot.
We demonstrate the agent’s learned ability to quickly adapt online to a missing leg, adjust
to novel terrains and slopes, account for miscalibration or errors in pose estimation, and
compensate for pulling payloads. Videos of the experiments as well as the code are available
online1 .
5.1 Introduction
Both model-based and model-free reinforcement learning (RL) methods generally operate in
one of two regimes: all training is performed in advance, producing a model or policy that can
be used at test-time to make decisions in settings that approximately match those seen during
training; or, training is performed online (e.g., as in the case of online temporal-difference
learning), in which case the agent can slowly modify its behavior as it interacts with the
environment. However, in both of these cases, dynamic changes such as failure of a robot’s
components, encountering a new terrain, environmental factors such as lighting and wind, or
other unexpected perturbations, can cause the agent to fail. In contrast, humans can rapidly
adapt their behavior to unseen physical perturbations and changes in their dynamics (Braun
et al. 2009): adults can learn to walk on crutches in just a few seconds, people can adapt
almost instantaneously to picking up an object that is unexpectedly heavy, and children that
can walk on carpet and grass can quickly figure out how to walk on ice without having to
relearn how to walk. How is this possible? If an agent has encountered a large number of
perturbations in the past, it can in principle use that experience to learn how to adapt. In
this work, we propose a meta-learning approach for learning online adaptation.
Motivated by the ability to tackle real-world applications, we specifically develop a model-
based meta-RL algorithm. In this model-based setting, data for updating the model is readily
available at every timestep in the form of recent experiences. Crucially, the meta-training
process for training such an adaptive model can be much more sample efficient than model-
free meta-RL approaches (Duan, John Schulman, X. Chen, Peter L. Bartlett, et al. 2016a;
J. X. Wang et al. 2016a; Finn, Abbeel, and Levine 2017c). Further, our approach foregoes
the episodic framework on which model-free meta-RL approaches rely on, where tasks are
pre-defined to be different rewards or environments, and tasks exist at the trajectory level
only. Instead, our method considers each timestep to potentially be a new “task”, where
any detail or setting could have changed at any timestep. This view induces a more general
meta-RL problem setting by allowing the notion of a task to represent anything from existing
1
https://sites.google.com/berkeley.edu/metaadaptivecontrol
CHAPTER 5. ONLINE MODEL ADAPTATION VIA META-LEARNING 61
5.3 Preliminaries
In this section, we present model-based RL, introduce the meta-learning formulation, and
describe two main meta-learning approaches.
5.3.2 Meta-Learning
Meta-learning is concerned with automatically learning learning algorithms that are more
efficient and effective than learning each task from scratch. These algorithms leverage data
from previous tasks to acquire a learning procedure that can quickly adapt to new tasks.
These methods operate under the assumption that the previous meta-training tasks and the
new meta-test tasks are drawn from the same task distribution ρ(T ) and share a common
structure that can be exploited for fast learning. In the supervised learning setting, we aim
to learn a function fθ with parameters θ that minimizes a supervised learning loss L. Then,
the goal of meta-learning is to find a learning procedure, denoted as θ0 = uψ (DTtr , θ), that
can use just small amounts of data DTtr from each task T to generate updated or adapated
parameters θ0 such that the objective L is well optimized on other data DTtest from that same
task.
We can formalize this meta-learning problem setting as optimizing for the parameters of
the learning procedure θ, ψ as follows:
where DTtr , DTtest are sampled without replacement from the meta-training dataset DT .
Once meta-training optimizes for the parameters θ∗ , ψ∗ , the learning procedure uψ (·, θ)
can then be used to learn new held-out tasks from small amounts of data. We will also refer
to this learned learning procedure u as the update function.
the new tasks. Using the notation from before, MAML prescribed the learning algorithm
above to be gradient descent:
uψ (DTtr , θ) = θ − α∇θ L(DTtr , θ) (5.2)
The learning rate α may be a learnable parameter (in which case ψ = α) or fixed as a
hyperparameter, leading to ψ = ∅. Despite the update rule being fixed, a meta-learned
initialization θ of an overparameterized deep network followed by gradient descent is as
expressive as update rules represented by deep recurrent networks (Finn and Levine 2017b).
assumes that the environment is locally consistent, in that every segment of length j − i is
from the same environment. Even though this assumption is not always correct, it allows us
to learn to adapt from data without knowing when the environment has changed. Due to the
fast nature of our adaptation (less than a second), this assumption is seldom violated.
We pose the meta-RL problem in this setting as an optimization over (θ, ψ) with respect
to a maximum likelihood meta-objective. The meta-objective is the likelihood of the data
under a predictive model p̂θ0 (s0 |s, a) with parameters θ0 , where θ0 = uψ (τE (t − M, t − 1), θ)
corresponds to model parameters that were updated using the past M data points. Concretely,
this corresponds to the following optimization:
min EτE (t−M,t+K)∼D L(τE (t, t + K), θE0 ) s.t.: θE0 = uψ (τE (t − M, t − 1), θ),
(5.3)
θ,ψ
In the meta-objective in Equation 5.3, note that the past M points are used to adapt θ into θ0 ,
and the loss of this θ0 is evaluated on the future K points. Thus, at every time step t, we use
the past M timesteps to provide insight into how to adapt our model such that it performs
well for timesteps in the near future. As outlined in Algorithm 4 below, the update rule uψ
for the inner update and a gradient step on θ for the outer update allow us to optimize this
meta-objective of adaptation. By training with this objective during meta-train time, we
learn the ability to fine-tune the model using just M data points and thus achieve fast online
adaptation. Note that, in our experiments, we compare this approach to using the same M
data points to adapt a model that was not meta-learned, and we see that this meta-training
objective is indeed necessary to enable fast adaptation at test time. We instantiate two
versions of our algorithm below, using a recurrence-based meta-learner and a gradient-based
meta-learner.
2. Does our approach enable fast adaptation to varying dynamics, tasks, and environments,
both inside and outside of the training distribution?
2
https://sites.google.com/berkeley.edu/metaadaptivecontrol
CHAPTER 5. ONLINE MODEL ADAPTATION VIA META-LEARNING 69
Half-cheetah (HC): disabled joint. For each rollout during meta-training, we randomly
sample a joint to be disabled (i.e., the agent cannot apply torques to that joint). At test
time, we evaluate performance in two different situations: (a) disabling a joint unseen during
training, to test generalization in the face of a new environment (referred to as “HC Dis.
Gen”), and (b) switching between disabled joints during a rollout, to test fast adaptation to
changing dynamics (referred to as “HC Dis. F.A.”).
HC: sloped terrain. For each rollout during meta-training, we randomly select an upward
or downward slope of low steepness. At test time, we evaluate performance on an unseen
setting of a steep hill that first goes up and then down (referred to as “HC Hill”).
HC: pier. In this experiment, the cheetah runs over a series of blocks that are floating on
water. Each block moves up and down when stepped on, and the changes in the dynamics
are frequent due to each block having different damping and friction properties. The HC is
meta-trained by varying these block properties, and tested on a specific (randomly-selected)
configuration of properties.
Ant: crippled leg. For each meta-training rollout, we randomly sample a leg to cripple
on this quadrupedal robot. This causes unexpected and drastic changes to the underlying
dynamics. We evaluate this agent at test time in two different situations: (a) crippling a leg
unseen during training, to test generalization in the face of a new environment (referred to
as “Ant Crip. Gen.”), and (b) switching between normal operation and having a crippled
leg within a rollout, to test fast adaptation to changing dynamics (referred to “Ant Crip. F.A.”).
In the following sections, we evaluate our model-based meta-RL methods (GrBAL and
ReBAL) in comparison to the prior methods described below.
evaluation (Krause, Kahembwe, et al. 2017). This final comparison evaluates the benefit of
explicitly training for adaptability (i.e., meta-training).
All model-based approaches (MB, MB+DE, GrBAL, and ReBAL) use the same neural
network architecture, and use the same planner within experiments: MPPI (Williams, Aldrich,
and Theodorou 2015) for the simulated experiments and random shooting (RS) (Nagabandi,
Kahn, Ronald S Fearing, et al. 2017) for the real-world experiments.
model-based methods (including our approach) using the equivalent of 1.5-3 hours of real-
world experience. Our methods result in superior or equivalent performance to the model-free
agent that is trained with 1000× more data. Our methods also surpass the performance of
the non-meta-learned model-based approaches. Finally, our performance closely matches the
high asymptotic performance of the model-free meta-RL method for half-cheetah disabled,
and achieves a suboptimal performance for ant crippled but, again, it does so with the
equivalent of 1000× less data. Note that this suboptimality in asymptotic performance
is a known issue with model-based methods, and thus an interesting direction for future
efforts. The improvement in sample efficiency from using model-based methods matches prior
findings (Marc Deisenroth and Carl E Rasmussen 2011; Nagabandi, Kahn, Ronald S Fearing,
et al. 2017; Kurutach et al. 2018b); the most important evaluation, which we discuss in more
detail next, is the ability for our method to adapt online to drastic dynamics changes in only
a handful of timesteps.
Figure 5.6: Compared to model-free RL, model-free meta-RL, and model-based RL methods, our
model-based meta-RL methods achieve good performance with 1000× less data. Dotted lines indicate
performance at convergence. For MB+DE+MPPI, we perform dynamic evaluation at test time on
the final MB+MPPI model.
2. A model that saw no perturbation forces (e.g., trained on a single task) during training
did the worst at test time.
3. The middle 3 training ranges show comparable performance in the “constant force =
4” case, which is an out-of-distribution task for those models. Thus, there is not actually a
strong restriction on what needs to be seen during training in order for adaptation to occur
at train time (though there is a general trend that more is better).
Figure 5.9: Learning curves, for different values of K = M , of GrBAL in the half-cheetah disabled
and sloped terrain environments. curves suggest that GrBAL performance is fairly robust to the
values of these hyperparameters.
Reward function
xt+1 −xt
Half-cheetah 0.01
− 0.05kat k22
xt+1 −xt
Ant 0.02
− 0.005kat k22 + 0.05
5.6.7 Hyperparameters
Below, we list the hyperparameters of our experiments. In all experiments we used a single
gradient step for the update rule of GrBAL. The learning rate (LR) of TRPO in the table
below corresponds to the Kullback–Leibler divergence constraint. # Task/itr corresponds to
the number of tasks sampled while collecting new data, whereas # TS/itr is the total number
of times steps collected (across all tasks). T refers to the number of steps in each rollout, nA
is the number of actions sampled for each controller step of performing action selection, H is
the planning horizon for the MPC planner, and K = M is the number of past steps used for
CHAPTER 5. ONLINE MODEL ADAPTATION VIA META-LEARNING 75
adapting the model as well as the number of future steps on which to validate the adapted
model’s prediction loss.
LR Inner LR Epochs K M Batch Size # Tasks/itr # TS/itr T nA Train H Train nA Test H Test
LR Inner LR Epochs K M Batch Size # Tasks/itr # TS/itr T nA Train H Train nA Test H Test
LR Inner LR Epochs K M Batch Size # Tasks/itr # TS/itr T na Train H Train na Test H Test
Figure 5.10: GrBAL significantly outperforms both MB and MB+DE, when tested on environments
that require online adaptation and/or were never seen during training.
Table 5.5: Trajectory following costs for real-world GrBAL and MB results when tested on three
terrains that were seen during training. Tested here for left turn (Left), straight line (Str), zig-zag
(Z-z), and figure-8 shapes (F-8). The methods perform comparably, indicating that online adaptation
is not needed in settings that are indeed seen during training, but including it is not detrimental.
Next, we test the performance of our method on what it is intended for: fast online
adaptation of the learned model to enable successful execution in new and changing environ-
ments at test time. Similar to the comparisons above, we compare GrBAL to a model-based
method (MB) that involves neither meta-training nor online adaptation, as well as a dynamic
evaluation method that involves online adaptation of that MB model (MB+DE). Our results
in Fig. 5.10 demonstrate that GrBAL substantially outperforms MB and MB+DE. Unlike
MB and MB+DE, GrBAL can quickly (1) adapt online to a missing leg, (2) adjust to novel
terrains and slopes, (3) account for miscalibration or errors in pose estimation, and (4)
compensate for pulling unepxected payloads.
None of these environments were seen during training time, but the agent’s ability to learn
how to learn enables it to quickly leverage its prior knowledge and fine-tune to adapt to these
CHAPTER 5. ONLINE MODEL ADAPTATION VIA META-LEARNING 78
Figure 5.11: The dotted black line indicates the desired trajectory in the xy plane. GrBAL adapts
online to prevent drift from a missing leg, prevents sliding sideways down a slope, accounts for
pose miscalibration errors, and adjusts to pulling payloads (left to right). Note that none of
these tasks/environments were seen during training time, and they require fast and effective online
adaptation for success.
new environments online (i.e., in less than a second). Furthermore, the poor performance of
the MB and MB+DE baselines demonstrate not only the need for adaptation, but also the
importance of meta-learning to give us good initial parameters to adapt from. The qualitative
results of these experiments, shown in Fig. 5.11, show that GrBAL allows the legged robot
to adapt online and effectively follow the target trajectories, even in the presence of new
environments and unexpected perturbations at test time.
5.8 Discussion
In this chapter, we presented an approach for model-based meta-RL that enabled fast, online
adaptation of large and expressive models in dynamic environments. We showed that meta-
learning a model for online adaptation resulted in a method that is able to adapt to unseen
situations or sudden and drastic changes in the environment, and is also sample efficient to
train. We provided two instantiations of our approach (ReBAL and GrBAL), and we provided
a comparison with other prior methods on a range of continuous control tasks. Finally, we
showed that (compared to model-free meta-RL approaches), our approach is practical for
real-world applications, and that this capability to adapt quickly is particularly important
under complex real-world dynamics.
79
Chapter 6
6.1 Introduction
Human and animal learning is characterized not just by a capacity to acquire complex skills,
but also the ability to adapt rapidly when those skills must be carried out under new or
changing conditions. For example, animals can quickly adapt to walking and running on
different surfaces (Herman 2017) and humans can easily modulate force during reaching move-
ments in the presence of unexpected perturbations (Flanagan and Wing 1993). Furthermore,
these experiences are remembered, and can be recalled to adapt more quickly when similar
disturbances occur in the future (Doyon and Benali 2005). Since learning entirely new models
on such short time-scales is impractical, we can devise algorithms that explicitly train models
to adapt quickly from small amounts of data. Such online adaptation is crucial for intelligent
systems operating in the real world, where changing factors and unexpected perturbations are
the norm. In this chapter, we propose an algorithm for fast and continuous online learning
that utilizes deep neural network models to build and maintain a task distribution, allowing
for the natural development of both generalization as well as task specialization.
Our working example is continuous adaptation in the model-based reinforcement learning
(RL) setting, though our approach generally addresses any online learning scenario with
streaming data. We assume that each “trial” consists of multiple tasks, and that the
delineation between the tasks is not provided explicitly to the learner – instead, the method
must adaptively decide what “tasks” even represent, when to instantiate new tasks, and when
to continue updating old ones. For example, a robot running over changing terrain might
need to handle uphill and downhill slopes, and might choose to maintain separate models
that become specialized to each slope, adapting to each one in turn based on the currently
inferred surface.
We perform adaptation simply by using online stochastic gradient descent (SGD) on the
model parameters, while maintaining a mixture model over model parameters for different
tasks. The mixture is updated via the Chinese restaurant process (Stimberg, Ruttor, and
Opper 2012), which enables new tasks to be instantiated as needed over the course of a trial.
Although online learning is perhaps one of the oldest applications of SGD (Bottou 1998),
modern parametric models such as deep neural networks are exceedingly difficult to train
1
https://sites.google.com/berkeley.edu/onlineviameta
CHAPTER 6. CONTINUAL MODEL ADAPTATION VIA EXPECTATION
MAXIMIZATION 81
online with this method. They typically require medium-sized minibatches and multiple
epochs to arrive at sensible solutions, which is not suitable when receiving data in an online
streaming setting. One of our key observations is that meta-learning can be used to learn
a prior initialization for the parameters that makes such direct online adaptation feasible,
with only a handful of gradient steps. The meta-training procedure we use is based on
model-agnostic meta-learning (MAML) (Finn, Abbeel, and Levine 2017a), where a prior
weight initialization is learned for a model so as to optimize improvement on any task from a
meta-training task distribution after a small number of gradient steps.
Meta-learning with MAML has previously been extended to model-based RL (Nagabandi,
Clavera, et al. 2018), but only for the k-shot adaptation setting: The meta-learned prior
model is adapted to the k most recent time steps, but the adaptation is not carried forward in
time (i.e., adaptation is always performed from the prior itself). Note that this is the method
presented in the previous chapter. This rigid batch-mode setting is restrictive in an online
learning setup and is insufficient for tasks that are further outside of the training distribution.
A more natural formulation is one where the model receives a continuous stream of data and
must adapt online to a potentially non-stationary task distribution. This requires both fast
adaptation and the ability to recall prior tasks, as well as an effective adaptation strategy to
interpolate as needed between the two.
The primary contribution of this work is a meta-learning for online learning (MOLe)
algorithm that uses expectation maximization, in conjunction with a Chinese restaurant
process prior on the task distribution, to learn mixtures of neural network models that are
each updated with online SGD. In contrast to prior multi-task and meta-learning methods,
our method’s online assignment of soft task probabilities allows for task specialization to
emerge naturally, without requiring task delineations to be specified in advance. We evaluate
MOLe in the context of model-based RL on a suite of challenging simulated robotic tasks
including disturbances, environmental changes, and simulated motor failures. The simulated
experiments show a half-cheetah agent and a hexapedal crawler robot performing continuous
model adaptation in an online setting. Our results show online instantiation of new tasks,
the ability to adapt to out-of-distribution tasks, and the ability to recognize and revert back
to prior tasks. Additionally, we demonstrate that MOLe outperforms the state-of-the-art
prior method that was introduced in the previous chapter (which does k-shot model-based
meta-RL), as well as natural baselines such as continuous gradient updates for adaptation
and online learning without meta-training.
6.3.1.1 Overview
Let pθ(Tt ) (yt |xt ) represent the predictive distribution of the model on input xt , for an unknown
task Tt at time step t. In our mixture model, each option Ti for the task Tt corresponds to its
own set of model parameters θ(Ti ). Our goal is to estimate model parameters θt (Ti ) for each
task Ti in the non-stationary task distribution: This requires inferring the distribution over
tasks at each step P (Tt = Ti |xt , yt ) ∀Ti given some data observations, using that inferred
task distribution to make predictions ŷt = pθ(Tt ) (yt |xt ), and also using it to update model
parameters from θt (Ti ) to θt+1 (Ti ) ∀Ti . In practice, the parameters θ(Ti ) of each model will
correspond to the weights of a neural network fθ(Ti ) .
Each model begins with some prior parameter vector θ∗ , which we will discuss in more
detail in Section 6.3.2. Since the number of tasks is also unknown, we begin with one task at
time step 0, where |T | = 1 and thus θ0 (T ) = {θ0 (T0 )} = {θ∗ }. From here, we continuously
update all parameters in θt (T ) at each time step (as explained below) and add new tasks
as needed, in the attempt to model the true underlying process P (Yt |Xt , Tt ), which we only
CHAPTER 6. CONTINUAL MODEL ADAPTATION VIA EXPECTATION
MAXIMIZATION 84
observe in the form of incoming xt and yt observations. Since underlying task identities Tt
are unknown, we must also estimate this P (Tt ) at each time step.
Thus, the online learning problem consists of iteratively inferring task probabilities
P (Tt = Ti |xt , yt ), and then using that inferred task distribution to adapt θt (Ti ) ∀Ti at each
time step t. The process of inferring the task probabilities is described in further details
below, but the model update step under the current inferred task distribution is done by
optimizing the expected log-likelihood of the data, given by
L = −ETt ∼P (Tt |xt ,yt ) [log pθt (Tt ) (yt |xt )], . (6.1)
Intuitively, this objective for updating the model seeks model parameters that best explain the
observed data, under the current task distribution. Overall, the algorithm iterates between
(a) using observed data to infer posterior task probabilities and then (b) using the inferred
task probabilities to perform a soft update of model parameters for all tasks. This iterative
procedure is detailed below, along with a mechanism for automatically instantiating new
tasks as needed.
Here, the likelihood of the data p(yt |xt , Tt = Ti ) is directly given by the model as
pθt (Ti ) (yt |xt ), and the task prior can be chosen as desired. In this work, we choose to
formulate the task prior P (Tt = Ti ) using a Chinese restaurant process (CRP). The CRP is
an instantiation of a Dirichlet process. In the CRP, at time t, the probability of each task Ti
should be given by
nTi
P (Tt = Ti ) = (6.3)
t−1+α
where nTi is the expected number of datapoints in task Ti for all steps 1, . . . , t − 1, and α is
a hyperparameter that controls the instantiation of new tasks. The prior therefore becomes
Pt−1
0 P (Tt0 = Ti |xt0 , yt0 ) α
P (Tt = Ti ) = t =1 and P (Tt = Tnew ) = . (6.4)
t−1+α t−1+α
Intuitively, this prior induces a bias that says that tasks seen more often are more likely, and
α controls the possibility of a new task. Combining this choice of prior with the likelihood
CHAPTER 6. CONTINUAL MODEL ADAPTATION VIA EXPECTATION
MAXIMIZATION 85
given by the predictive model, we derive the following posterior task probability distribution:
" t−1 #
P (Tt0 = Ti |xt0 , yt0 ) + 1(Tt0 = Tnew )α .
X
P (Tt = Ti |xt , yt ) ∝ pθt (Ti ) (yt |xt ) (6.5)
t0 =1
Having estimated the latent task probabilities, we next perform the M step, which improves
the expected log-likelihood in Equation 6.1 based on the inferred task distribution. Since
each task starts at t = 0 from the prior θ∗ , the values of all parameters after t + 1 update
steps becomes θt+1 (T ) as follows:
t
X
∗
θt+1 (Ti ) = θ − β Pt (Tt0 = Ti |xt0 , yt0 )∇θt0 (Ti ) log pθt0 (Ti ) (yt0 |xt0 ) ∀ Ti . (6.6)
t0 =0
If we assume that all parameters of θt (T ) have already been updated for the previous time
steps 0, . . . , t, we can approximate the last iteration of this update by simply updating all
parameters from θt (T ) to θt+1 (T ) on the newest data:
θt+1 (Ti ) = θt (Ti ) − βPt (Tt = Ti |xt , yt )∇θt (Ti ) log pθt (Ti ) (yt |xt ) ∀ Ti . (6.7)
Figure 6.2: Overview of our algorithm for online learning with mixture of networks. The algorithm decides
“task” delineations on its own, starting with only the meta-learned prior θ∗ and adding new tasks as it deems
necessary for the overall objective of log-likelihood of the observed data. Each instantiated task has to its
own set of model parameters, and the algorithm alternates between an expectation (E) step of estimating the
posterior task probabilities, and a maximization (M) step of optimizing the log likelihood of data under that
inferred task distribution with respect to the model parameters.
CHAPTER 6. CONTINUAL MODEL ADAPTATION VIA EXPECTATION
MAXIMIZATION 87
6.3.2 Meta-Learning the Prior
We formulated an algorithm above for performing online adaptation using continually incoming
data. For this method, we choose to meta-train the prior using the model-agnostic meta-
learning (MAML) algorithm. This meta-training algorithm is an appropriate choice, because
it results in a prior that is specifically intended for gradient-based fine-tuning. Before we
further discuss our choice in meta-training procedure, we first give an overview of MAML
and meta-learning in general.
Recall from the previous chapter that, given a distribution of tasks, a meta-learning
algorithm produces a learning procedure which can quickly adapt to a new task. MAML
optimizes for an initialization of a deep network θ∗ that achieves good k-shot task generaliza-
tion when fine-tuned using just a few (k) datapoints from that task. At train time, MAML
sees small amounts of data from a distribution tasks, where data DT from each task T can
be split into training and validation subsets (DTtr and DTval ). We will define DTtr as having
k datapoints. MAML optimizes for model parameters θ such that one or more gradients
steps on DTtr results in a minimal loss L on DTval . In our case, we will set DTtrt = (xt , yt ) and
DTval
t
= (xt+1 , yt+1 ), and the loss L will correspond to the negative log likelihood objective
introduced in the previous section.
The MAML meta-RL objective is defined as follows:
X X
min L(θ − η∇θ L(θ, DTtr ), DTval ) = min L(φT , DTval ), (6.8)
θ θ
T T
where a good θ is one that uses a small amount of data DTtr to perform an inner update
φT = θ − η∇θ L(θ, DTtr ) with learning rate η, such that this updated information φT is then
able to optimize the objective well on the unseen data DTval from that same task. After meta-
training with this objective, the resulting θ∗ acts as a prior from which effective fine-tuning
can occur on a new task Ttest at test-time. Here, only a small amount of recent experience
from DTtrtest is needed in order to update the meta-learned prior θ∗ into a φTtest that is more
representative of the current task at hand:
Although MAML (Finn, Abbeel, and Levine 2017a) demonstrated this fast adaptation of
deep neural networks and the work in the previous chapter (Nagabandi, Clavera, et al. 2018)
extended this framework to model-based meta RL, these methods address adaptation in the
k-shot setting, always adapting directly from the meta-learned prior and not allowing further
adaptation or specialization. In this work, we have extended these capabilities by enabling
more evolution of knowledge through a temporally-extended online adaptation procedure,
which was presented in the previous section. Note that our procedure for continual online
learning is initialized with this prior that was meta-trained for k-shot adaptation, with the
intuitive rational that MAML trains this model to be able to change significantly using only
a small number of datapoints and gradient steps – which is not true in general for a deep
CHAPTER 6. CONTINUAL MODEL ADAPTATION VIA EXPECTATION
MAXIMIZATION 88
neural network. We show in Sec. 6.5 that our method for continual online learning with
a mixture of models (initialized from this meta-learned prior) outperforms both standard
k-shot adaptation with the same prior (where model parameters are updated at each step
from the prior itself), and also outperforms naively taking many adaptation steps away from
that prior (where model parameters are updated at each step directly from the parameters
values of the previous time step).
We note that it is quite possible to modify the MAML meta-training algorithm to optimize
the model directly with respect to the weighted updates discussed in Section 6.3.1.2. This
simply requires computing the task weights (the E step) on each batch during meta-training,
and then constructing a computation graph where all gradient updates are multiplied by
their respective weights. Standard automatic differentiation software can then compute the
corresponding meta-gradient. For short trial lengths, this is not substantially more complex
than standard MAML; for longer trial lengths, truncated backpropagation is an option.
Although such a meta-training procedure better matches the way that the model is used
during online adaptation, we found that it did not substantially improve our results. While
it’s possible that the difference might be more significant if meta-training for longer-term
adaptation, this observation does suggest that simply meta-training with MAML is sufficient
for enabling effective continuous online adaptation in non-stationary multi-task settings.
1. Can MOLe autonomously discover some task structure amid a stream of non-stationary
data?
2. Can MOLe adapt to tasks that are further outside of the task distribution than can be
handled by a k-shot learning approach?
(a) k-shot adaptation with meta-learning: Always adapt from the meta-trained prior
∗
θ , as typically done with meta-learning methods (Nagabandi, Clavera, et al. 2018), including
the one introduced in the previous chapted. This method is often insufficient for adapting
to tasks that are further outside of the training distribution, and the adaptation is also not
carried forward in time for future use.
(b) continued adaptation with meta-learning: Always take gradient steps from the
previous time step’s parameters (without revert back to the meta-learning prior). This
method oftens overfits to recently observed tasks, so it should indicate the importance of our
method effectively identifying task structure to avoid overfitting and enable recall.
(c) model-based RL: Train a model on the same data as the methods above, using
standard supervised learning, and keep this model fixed throughout the trials (i.e., no meta-
learning and no adaptation). For context, this is the model-based RL framework that was
introduced in the first few chapters, containing no explicit meta-learning or adaptation
mechanisms.
(d) model-based RL with online gradient updates: Use the same model from
model-based RL (i.e., no meta-learning), but adapt it online at test using gradient-descent.
This is representative of commonly used dynamic evaluation methods (Rei 2015; Krause,
Kahembwe, et al. 2017; Krause, Lu, et al. 2016; Fortunato, Blundell, and Vinyals 2017).
2
https://sites.google.com/berkeley.edu/onlineviameta
CHAPTER 6. CONTINUAL MODEL ADAPTATION VIA EXPECTATION
MAXIMIZATION 91
Table 6.1: Hyperparameters for train-time
6.5.1 Hyperparameters
In all experiments, we use a dynamics model consisting of three hidden layers, each of
dimension 500, with ReLU nonlinearities. The control method that we use is random-
shooting model predictive control (MPC) where 1000 candidate action sequences each of
horizon length H=10 are sampled at each time step, fed through the predictive model, and
ranked by their expected reward. The first action step from the highest-scoring candidate
action sequence is then executed before the entire planning process repeats again at the next
time step.
In Tables 6.1 and 6.2, we list relevant training and testing parameters for the various
methods used in our experiments. # Task/itr corresponds to the number of tasks sampled
during each iteration of collecting data to train the model, and # TS/itr is the total number
of times steps collected during that iteration (sum over all tasks). The inner and outer
learning rates control the sizes of the gradient steps during meta-training, where each iteration
out of the “Iters” iterations consists of “Epochs” epochs of model training, with each epoch
consisting of a full pass through the dataset.
CHAPTER 6. CONTINUAL MODEL ADAPTATION VIA EXPECTATION
MAXIMIZATION 92
Figure 6.3: Half-cheetah robot, shown traversing a landscape with ‘basins’ that was not encountered during
training.
Figure 6.4: Results on half-cheetah terrain traversal. The poorly performing model-based RL shows that a
single model is not sufficient, and the poorly performing model-based RL with online gradient updates shows
that a meta-learned initialization is critical. The three meta-learning approaches perform similarly on these
tasks of different slopes. Note, however, that the performance of k-shot adaptation does deteriorate when the
tasks are further away from the training task distribution, such as the last column above where the test tasks
introduce crippling of joints. Since this unexpected perturbation is far from what was seen during training, it
calls for taking multiple gradient steps away from the prior in order to actually succeed. We see that MOLe
succeeds in all of these task settings.
Figure 6.5: Latent task distribution over time for two half-cheetah landscape traversal tasks, where encountered
terrain slopes vary within each run. Interestingly, we find that MOLe chooses to only use a single latent task
variable to describe varying terrain.
Figure 6.6: Results on the motor malfunction trials, where different trials are shown task distributions that
modulate at different frequencies (or stay constant, in the first column). Here, online learning is critical for
good performance, k-shot adaptation is insufficient for these tasks that are very different from the tasks seen
during training, and continued gradient steps leads to overfitting to recently seen data. MOLe, however,
demonstrates high performance in all of these types of task distributions.
Figure 6.7: Latent task variable distribution over the course of an online learning trial where the underlying
motor malfunction changes every 500 timesteps. We find that MOLe is able to successfully recover the
underlying task structure by recognizing when the underlying task has changed, and even recalling previously
seen tasks. As such, MOLe allows for both specialization as well as generalization.
CHAPTER 6. CONTINUAL MODEL ADAPTATION VIA EXPECTATION
MAXIMIZATION 95
6.5.4 Crippling of End Effectors on Six-Legged Crawler
To further examine the effects of our continual online adaptation al-
gorithm, we study another, more complex agent: a 6-legged crawler
(Fig. 6.8). In these experiments, models are trained on random joints
being crippled (i.e., unable to apply actuator commands). In Fig. 6.9, we
present two illustrative test tasks: (1) the agent sees a set configuration
of crippling for the duration of its test-time experience, and (2) the agent
receives alternating periods of experience, between regions of normal
operation and regions of having crippled legs. Figure 6.8: Six-legged
The first setting is similar to data seen during training, and thus, crawler robot, shown
we see that even the model-based RL and model based-RL with online with crippled legs at
gradient updates baselines do not fail. The methods that include both run time.
meta-learning and adaptation, however, do have higher performance.
Furthermore, we see again that continued gradient steps in this case of a single-task setting
is not detrimental.
The second setting’s non-stationary task distribution (when the leg crippling is dynamic)
illustrates the need for online adaptation (model-based RL fails), the need for a good meta-
learning prior to adapt from (failure of model-based RL with online gradient updates), the
harm of overfitting to recent experience and thus forgetting older skills (low performance of
continued gradient steps), and the need for further adaptation away from the prior (limited
performance of k-shot adaptation).
With MOLe, this agent is able to build its own representation of “task” switches, and
we see that this switch does indeed correspond to recognizing regions of leg crippling (left
of Fig. 6.10). The plot of the cumulative sum of rewards (right of Fig. 6.10) of each of the
three meta-learning plus adaptation methods includes this same task switch pattern every
500 steps: Here, we can clearly see that steps 500-1000 and 1500-2000 were the crippled
regions. Continued gradient steps actually performs worse on the second and third times
it sees normal operation, whereas MOLe is noticeably better as it sees the task more often.
Note this improvement of both skills with MOLe, where development of one skill actually
does not hinder the other.
Finally, we examine experiments where the crawler experiences (during each trial) walking
straight, making turns, and sometimes having a crippled leg. The performance during the
first 500 time steps of “walking forward in a normal configuration” for continued gradient
steps was comparable to MOLe (+/-10% difference), but its performance during the last 500
time steps of “walking forward in a normal configuration” was 200% lower (for continued
gradient steps). Note this detrimental effect of performing continual updates at each step
without allowing for separate task specialization/adaptation.
CHAPTER 6. CONTINUAL MODEL ADAPTATION VIA EXPECTATION
MAXIMIZATION 96
Figure 6.9: Quantitative results on crawler. For a fixed task (left column), adaptation is not necessary and
all methods perform well. In contrast, when tasks change dynamically within the trial, only MOLe effectively
learns online.
Figure 6.10: Results on crawler experiments. Left: Online recognition of latent task probabilities for
alternating periods of normal/crippled experience. Right: MOLe improves from seeing the same tasks
multiple times; MOLe allows for improvement of both skills, without letting the development of one skill
hinder the other.
6.6 Discussion
In this chapter, we presented an online learning method for neural network models that
can handle non-stationary, multi-task settings within each trial. Our method adapts the
model directly with SGD, where an EM algorithm uses a Chinese restaurant process prior
to maintain a distribution over tasks and handle non-stationarity. Although SGD generally
makes for a poor online learning algorithm in the streaming setting for large parametric
models such as deep neural networks, we observed that, by (1) meta-training the model for
fast adaptation with MAML and (2) employing our online algorithm for probabilistic updates
at test time, we can enable effective online learning with neural networks.
CHAPTER 6. CONTINUAL MODEL ADAPTATION VIA EXPECTATION
MAXIMIZATION 97
In our experiments, we applied this approach to model-based RL, and we demonstrated
that it could be used to adapt the behavior of simulated robots faced with various new and
unexpected tasks. The results showed that our method can develop its own notion of task,
continuously adapt away from the meta-learned prior as necessary (to succeed at tasks further
outside of the training distribution, which require more adaptation), and recall tasks it has
seen before. This ability to effectively adapt online and develop specializations as well as
maintain generalization is a critical step toward enabling the deployment of robots into the
real world, where the uncontrolled settings are never exactly what was seen during train time.
98
Chapter 7
Figure 7.2: Our approach (MELD) enables meta-RL directly from raw image observations in the real
world, as shown here by the learned tasks of peg insertion and ethernet cable insertion. MELD allows for
sample-efficiency during meta-training time (enabling us to perform this training procedure in the real world),
as well as the learned ability to perform fast adaptation to new tasks at test time. This ability to perform
broad ranges of skills as necessitated by the changing conditions of test time settings is critical towards the
goal of deploying robots into the real world.
CHAPTER 7. ADAPTIVE ONLINE REASONING FROM IMAGES VIA LATENT
DYNAMICS MODELS 99
address the difficulty of working from raw sensory observations such as a robot’s onboard
cameras. Unlike the previous chapters, whose methods operated directly from information
such as joint positions/velocities and object positions/velocities, we now scale up our learning
and adaptation capabilities to the more general setting of working directly from raw image
observations.
Operating from raw image observations introduces challenges in the form of high-
dimensional and partially observable inputs. Making sense of these pixel inputs adds a
representation learning burden on top of the already challenging task learning problem. Prior
work has demonstrated success in single-task RL from these types of inputs by explicitly
addressing the representation learning problem via the learning of latent dynamics models
to make sense of the inputs. Unlike the work in previous chapters which learned models
and then used the model predictions for action selection, the work in this chapter instead
uses models to address the representation learning and meta-learning aspects of the problem,
while allowing for model-free RL techniques to address the task learning problem within this
learned representation space.
We refer to the problem of making sense of incoming observations in order to understand
the underlying state of the system as performing “latent state estimation.” By learning a
latent dynamics model to address this representation learning problem, we still make use
of all of the benefits of models that we’ve seen thus far – their efficient training using even
off-policy data, their adaptability, and their generalization capabilities. Inspired by the use of
these latent state dynamics models to improve single-task RL from high dimensional inputs,
this work aims to extend these capabilities into the meta-RL regime of being able to perform
a broad range of skills. In this work, we posit that the task inference problem in meta-RL can
actually be cast into this same framework of latent state estimation, with the unknown task
variable viewed as part of the hidden variables that must be inferred. Leveraging this idea
of fusing both task and state inference into a unified framework, we present our algorithm
MELD, Meta-RL with Latent Dynamics: a practical algorithm for meta-RL from image
observations that quickly acquires new skills at test time via posterior inference in a learned
latent state model over joint state and task variables. We show that MELD outperforms prior
meta-RL methods on a range of simulated robotic locomotion and manipulation problems
including peg insertion and object placing. Furthermore, we demonstrate MELD on two real
robots, learning to perform peg insertion into varying target boxes with a Sawyer robot, and
learning to insert an ethernet cable into new locations after only 4 hours of meta-training on
a WidowX robot. Videos of the experiments are available online1 .
7.1 Introduction
Robots that operate in unstructured environments, such as homes and offices, must be able
to perform a broad range of skills using only their on-board sensors. The problem of learning
from raw sensory observations is often tackled in the context of state estimation: building
1
https://sites.google.com/view/meld-lsm/home
CHAPTER 7. ADAPTIVE ONLINE REASONING FROM IMAGES VIA LATENT
DYNAMICS MODELS 100
... ...
Reward: -0.1 Reward: 0.5 Reward: 1.0
Figure 7.3: Real-world cable insertion: At test time our algorithm (MELD) enables a 7-DoF WidowX
robot to insert the ethernet cable into novel locations and orientations within a single trial, operating from
image observations. MELD achieves this result by meta-training a latent dynamics model to capture task
and state information in the hidden variables, as well a policy that conditions on these variables, across 20
meta-training tasks with varying locations and orientations.
models that infer the unknown state variables from sensory readings (Thrun, Burgard, Fox,
et al. 2005; Barfoot 2017; Tremblay et al. 2018). In principle, deep reinforcement learning (RL)
algorithms can automate this process by directly learning to map sensory inputs to actions,
obviating the need for explicit state estimation. This automation comes at a steep cost in
sample efficiency since the agent must learn to interpret observations from reward supervision
alone. Fortunately, unsupervised learning in the form of general-purpose latent state (or
r=0.1 dynamics) models can serve as an additional training signal to help solve the representation
learning problem (Finn, X. Y. Tan, et al. 2016; Ghadirzadeh et al. 2017; A. X. Lee et al. 2019;
M. Zhang et al. 2019), substantially improving the performance of end-to-end RL. However,
even the best-performing algorithms in this class require hours of training to learn a single
task (Hafner et al. 2018; M. Zhang et al. 2019; A. X. Lee et al. 2019) and lack a mechanism
to transfer knowledge to subsequent tasks.
General purpose autonomous robots must be able to perform a wide variety of tasks and
quickly acquire new skills. For example, consider a robot tasked with assembling electronics
in a data center. This robot must be able to insert cables of varying shapes, sizes, colors,
and weights into the correct ports with the appropriate amounts of force. While standard
RL algorithms require hundreds or thousands of trials to learn a policy for each new setting,
meta-RL methods hold the promise of drastically reducing the number of trials required.
Given a task distribution, such as the variety of ways to insert electronic cables described
above, meta-RL algorithms leverage a set of training tasks to meta-learn a mechanism that
can quickly learn unseen tasks from the same distribution. Despite impressive results in
simulation demonstrating that agents can learn new tasks in a handful of trials (J. X. Wang
et al. 2016a; Duan, John Schulman, X. Chen, Peter L Bartlett, et al. 2016b; Finn, Abbeel,
and Levine 2017b; Rakelly et al. 2019), these model-free meta-RL algorithms remain largely
unproven on real-world robotic systems, largely due to the inefficiency of the meta-training
CHAPTER 7. ADAPTIVE ONLINE REASONING FROM IMAGES VIA LATENT
DYNAMICS MODELS 101
process. In this paper, we show that the same latent dynamics models that greatly improve
efficiency in end-to-end single-task RL can also, with minimal modification, be used for
meta-RL by treating the unknown task information as a hidden variable to be estimated
from experience. We formalize the connection between latent state inference and meta-RL,
and leverage this insight in our proposed algorithm MELD, Meta-RL with Latent Dynamics.
To derive our algorithm, we cast meta-RL and latent state inference into a single partially
observed Markov decision process (POMDP) in which task and state variables are aspects
of a more general per-time step hidden variable. Concretely, we represent the agent’s belief
over the hidden variable as the variational posterior in a sequential VAE latent state model
that takes observations and rewards as input, and we condition the agent’s policy on this
belief. During meta-training, the latent state model and policy are trained across a fixed set
of training tasks sampled from the task distribution. The trained system can then quickly
learn to succeed on a new unseen task by using a small amount of data from the new task
to infer the posterior belief over the hidden variable, and then executing the meta-learned
policy conditioned on this belief.
We show in simulation that MELD substantially outperforms prior work on several
challenging locomotion and manipulation problems such as running at varying velocities,
inserting a peg into varying targets, and putting away mugs of unknown weights to varying
locations on a shelf. Next, we run MELD on a real Sawyer robot to solve the problem of peg
insertion into varying targets, verifying that MELD can reason jointly about state and task
information. Finally, we evaluate our method with a real WidowX robotic arm, where we
show that MELD is able to successfully perform ethernet cable insertion into ports at novel
locations and orientations after four hours of meta-training.
7.3 Preliminaries
In this work, we leverage tools from latent state modeling to design an efficient meta-RL
method that can operate in the real world from image observations. In this section, we review
latent state models and meta-RL, developing a formalism that will allow us to derive our
algorithm in Section 7.4.
To put this formulation in context with the meta-learning formulations introduced in the
previous two chapters, there are two main aspects to look at. First, the objective here is the
direct optimization of rewards, unlike the previous chapters where the objective was model
prediction error. As discussed at the beginning of this chapter, this change corresponds
with the decision to allow a model-free RL algorithm to learn a policy, rather than taking a
model-based planning approach to action selection. Second, the adaptation mechanism (e.g.,
update rule) fφ was prescribed in the previous chapters to be gradient descent, whereas it is
now left to be a more general learned procedure.
In general, meta-RL methods may differ in how the adaptation procedure fφ is represented
(e.g., as probabilistic inference (Rakelly et al. 2019; Zintgraf et al. 2019), as a recurrent
CHAPTER 7. ADAPTIVE ONLINE REASONING FROM IMAGES VIA LATENT
DYNAMICS MODELS 105
(a) Modeling latent dynamics in single-task RL (c) MELDing (a) and (b) into unified latent state model
(b) Task controlling latent dynamics and rewards in meta-RL
Figure 7.4: (a): When only partial observations of the underlying state are available, latent dynamics models
can glean state information zt from a history of observations. (b): Meta-RL considers a task distribution
where the current task T is an unobserved variable that controls dynamics and rewards. (c): We interpret T
as part of zt , allowing us to leverage latent dynamics models for efficient image-based meta-RL.
update (Duan, John Schulman, X. Chen, Peter L Bartlett, et al. 2016b; J. X. Wang et al.
2016a), as a gradient step (Finn, Abbeel, and Levine 2017b)), how often the adaptation
procedure occurs (e.g., at every timestep (Duan, John Schulman, X. Chen, Peter L Bartlett,
et al. 2016b; Zintgraf et al. 2019) or once per episode (Rakelly et al. 2019; Humplik et al.
2019)), and also in how the optimization is performed (e.g., on-policy (Duan, John Schulman,
X. Chen, Peter L Bartlett, et al. 2016b), off-policy (Rakelly et al. 2019)). Differences aside,
these methods all typically optimize this objective end-to-end, creating a representation
learning bottleneck when learning from image inputs that are ubiquitous in real-world robotics.
(a)In latent
thespace
following section,
model for single-task RL we show how the latent state models discussed in Section
(b) meta-RL 7.3.1
(c\) MELD
can be re-purposed for joint representation and task learning, and how this insight leads to a
practical algorithm for image-based meta-RL.
Note that the only change from the latent state model from Section 7.3.1 is the inclusion of
rewards as part of the observed evidence, and while this change appears simple, it enables
meta-learning by allowing the hidden state to capture task information.
Posterior inference in this model then gives the agent’s belief bt = p(zt |x1:t , r1:t , a1:t−1 )
over latent state and task variables zt . Conditioned on this belief, the policy πθ (at |bt ) can
learn to modulate its actions and adapt its behavior to the task. Prescribing the adaptation
procedure fφ from Equation 7.1 to be posterior inference in our latent state model, the
meta-training objective in MELD is:
" T #
X
max E E γ t rt where bt = p(zt |x1:t , r1:t , a1:t−1 ). (7.3)
θ T ∼p(T ) xt ∼pT (·|st )
t=1
at ∼πθ (·|bt )
st+1 ∼pT (·|st ,at )
rt ∼rT (·|st ,at )
By melding state and task inference into a unified framework of latent state estimation,
MELD inherits the same representation learning mechanism as latent state models discussed
in Section 7.3.1 to enable efficient meta-RL with images.
T
X
Lmodel (x1:T , r1:T , a1:T −1 ) = E log pφ (xt |zt ) + log pφ (rt |zt )
z1:T ∼qφ
t=1
XT
− DKL (qφ (z1 |x1 , r1 )kp(z1 )) − DKL (qφ (zt |xt , rt , zt−1 , at−1 )kpφ (zt |zt−1 , at−1 )). (7.4)
t=2
The first two terms encourage a rich latent representation zt by requiring it to reconstruct
observations and rewards, while the last term keeps the inference network consistent with
latent dynamics. The first timestep posterior qφ (z1 |x1 , r1 ) is modeled separately from the
remaining steps, and p(z1 ) is chosen to be a fixed unit Gaussian N (0, I). The learned
inference networks qφ (z1 |x1 , r1 ) and qφ (zt |xt , rt , zt−1 , at−1 ), decoder networks pφ (xt |zt ) and
pφ (rt |zt ), and dynamics pφ (zt |zt−1 , at−1 ) are all fully connected networks that predict the
output parameters of Gaussian distributions. We follow the architecture of the latent variable
model from SLAC (A. X. Lee et al. 2019), which models two layers of latent variables. Since
our observations consist of RGB camera images, we use convolutional layers in the observation
encoder and decoder. Both of these networks include the same convolutional architecture
(the decoder simply the transpose of the encoder) that consists of five convolutional layers.
The layers have 32, 64, 128, 256, and 256 filters and the corresponding filter sizes are 5, 3, 3,
3, 4. For environments in which the robot observes a two images (such as a fixed scene image
as well as a first-person image from a wrist camera), we concatenate the images together and
apply rectangular filters. All other model networks are fully connected and consist of 2 hidden
replay
buffer
Figure 7.5: MELD meta-training alternates between collecting data with πθ , training the latent state model
using the collected data from the replay buffer, and training the actor πθ and critic Qζ conditioned on the
belief from the current model.
CHAPTER 7. ADAPTIVE ONLINE REASONING FROM IMAGES VIA LATENT
DYNAMICS MODELS 108
layers of 32 units each. We use ReLU activations after each layer. We train all networks with
the Adam optimizer with learning rate 0.001. We use the soft actor-critic (SAC) (Haarnoja,
Zhou, Abbeel, et al. 2018) RL algorithm in this work, due to its high sample efficiency and
performance. The SAC algorithm maximizes discounted returns as well as policy entropy
via policy iteration. The actor πθ (at |bt ) and the critic Qψ (bt , at ) are conditioned on the
posterior belief bt , modeled as fully connected neural networks, and trained as prescribed by
the SAC algorithm. The critic is trained to minimize the soft Bellman error, which takes the
entropy of the policy into account in the backup. We instantiate the actor and critic as fully
connected networks with 2 hidden layers of 256 units each. We follow the implementation of
SAC, including the use of 2 Q-networks and the tanh actor output activation.
During meta-training, MELD alternates between collecting data with the current policy,
training the model by optimizing Lmodel , and training the policy with the current model.
Relevant hyper-parameters for meta-training can be found in Table 7.1. Meta-training and
meta-testing are described in Algorithm 7 and Algorithm 8 respectively.
Parameter Value
Reward
50
100
150
200 Episode 1
250 Episode 2
0.0 0.5 1.0 1.5
Environment Steps 1e6
Figure 7.6: Image-based 2D navigation: trajectory traces (left) and rewards (right) of meta-learned exploration
to find goal in Episode-1 (red) and persisting information across episodes going directly there in Episode-2
(blue).
1. Can MELD capture and propagate state and task information over time to explore
effectively in a new task?
2. How does MELD compare to prior meta-RL methods in enabling fast acquisition of
new skills at test time in challenging simulated control problems?
3. Can MELD enable real robots to quickly acquire skills via meta-RL from images?
Figure 7.7: Simulated locomotion and manipulation meta-RL environments in the MuJoCo simulator (Emanuel
Todorov, Erez, and Tassa 2012a): running at different velocities, reaching to varying goals, inserting a peg
into varying boxes in varying locations, and putting away mugs of varying weights to varying locations.
The goal for each task is available to the robot only via per-timestep rewards, but is illustrated here for
visualization purposes.
The episode length is 50 time steps, and the observation consists of a single 64x64 pixel image
from a tracking camera (as shown in Fig. 7.8a), which sees a view of the full cheetah.
In all three Sawyer environments, we control the robot by commanding joint delta-positions
for all 7 joints. The reward function indicates the difference between the current end-effector
pose xee and a goal pose xgoal , as follows:
This reward function encourages precision near the goal, which is particularly important for
the peg insertion task. We impose a maximum episode length of 40 time steps for these
environments. The observations for all three of these environments consist of two images
concatenated to form a 64x128 image: one from a fixed scene camera, and one from a
wrist-mounted first-person view camera. These image observations for each environment
are shown in Fig. 7.8b-d. The simulation time step and control frequency for each of these
simulated environments is listed in Table 7.2.
Table 7.2: Simulation Environments
For all environments, we train with 30 meta-training tasks and evaluate on 10 meta-test
tasks from the same distribution that are not seen during training.
CHAPTER 7. ADAPTIVE ONLINE REASONING FROM IMAGES VIA LATENT
DYNAMICS MODELS 113
Figure 7.8: 64x64 and 64x128 image observations, seen as input by MELD for (a) Cheetah-vel, (b) Reacher,
(c) Peg-insertion, and (d) Shelf-placing.
7.5.2.2 Comparisons
We compare MELD to two representative state-of-the-art meta-RL algorithms, PEARL (Rakelly
et al. 2019) and RL2 (Duan, John Schulman, X. Chen, Peter L Bartlett, et al. 2016b). PEARL
models a belief over a probabilistic latent task variable as a function of un-ordered batches of
transitions, and conditions the policy on both the current observation and this inferred task
belief. Unlike MELD, this algorithm assumes an exploration phase of several trajectories in
the new task to gather information before it adapting, so to get its best performance, we
evaluate only after this exploration phase. RL2 models the policy as a recurrent network
that directly maps observations, actions, and rewards to actions. To apply PEARL and RL2
CHAPTER 7. ADAPTIVE ONLINE REASONING FROM IMAGES VIA LATENT
DYNAMICS MODELS 114
10
Cheetah-vel Reacher 0.0
Peg-insertion Shelf-placing
0.1 0
20
0.1
30 0.2
50
40 0.3 0.2
50 100
0.4 0.3
60
0.5 0.4 150
70
80 0.6 0.5 200
0.00 0.15 0.30 0.45 0.60 0.75 0.00 0.05 0.10 0.15 0.20 0.00 0.08 0.16 0.24 0.32 0.00 0.04 0.08 0.12 0.16
Env Steps (Millions) Env Steps (Millions) Env Steps (Millions) Env Steps (Millions)
Success MELD (ours) PEARL RL2 SLAC SAC with [img, rew]
Figure 7.9: Rewards on test tasks versus meta-training environment steps, comparing MELD to prior methods.
See text for definitions of these success metrics, and discussion on the scale of these metrics.
to environments with image observations, we augment them with the same convolutional
encoder architecture used by MELD. We also compare MELD to the end-to-end RL algorithm
SAC (Haarnoja, Zhou, Abbeel, et al. 2018), with a sequence of observations and rewards
as input. This input provides SAC with enough information to solve these dense reward
tasks, so this comparison reveals the importance of the explicit representation learning and
meta-learning mechanisms in MELD. Finally, to verify the need for meta-learning on these
held-out tasks, we compare to SLAC (A. X. Lee et al. 2019), which infers a latent state from
observations but does not perform meta-learning.
Figure 7.11: (Left) The peg and ethernet cable insertion problems. (Right) 64x128 image observations seen
by the Sawyer for peg insertion and the WidowX for ethernet cable insertion.
switch(A) is mounted to a 3D printed housing(B) with gear attached. We control the rotation
of the housing through motor 1. This setup is then mounted on top of a linear rail(C) and
motor 2 controls its translational displacement through a timing pulley. In our experiments,
the training task distribution consisted of 20 different tasks, where each task was randomly
assigned from a rotational range of 16 degrees and a translational range of 2cm.
Figure 7.12: Automatic task reset mechanism for ethernet cable insertion: The network switch is rotated
and translated by a series of motors in order to generate different tasks for meta-learning. This allows our
meta-learning process to be entirely automated, without needing human intervention to reset either the robot
or the task at the beginning of each rollout.
After training across these 20 meta-training tasks using a total of 4 hours (50, 000 samples
CHAPTER 7. ADAPTIVE ONLINE REASONING FROM IMAGES VIA LATENT
DYNAMICS MODELS 117
1.0
200 50 1.0
150
0.8 25
Train Rewards
0
0.6
50 0.6
25
0
0.4 50 0.4
50
75
100 0.2
0.2
150
Peg (Train=Eval) 100 Ethernet (Train)
Consistent Task Success Ethernet (Eval) 0.0
200
0.0 125
0.0
0 20000 0.2 40000 60000 0.4 0 0.6 10000 20000 0.8
30000 40000 50000
1.0
Environment Steps
Figure 7.13: In red, rewards on train tasks during meta-training for Sawyer peg insertion (left) and WidowX
ethernet cable insertion (right). In blue, success rate of ethernet cable insertion on unseen eval tasks.
at 3.3Hz) worth of data, MELD achieves a success rate of 96% over three rounds of evaluation
in each of the 10 randomly sampled evaluation tasks that were not seen during training.
Videos and more experiments can be found online3 .
7.6 Discussion
In this chapter, we drew upon the insight that meta-RL can be cast into the framework of
latent state inference. This allowed us to combine the adaptation and fast skill-acquisition
capabilities of meta-learning with the efficiency of unsupervised latent state models when
learning from raw image observations. Based on this principle, we designed MELD, a practical
algorithm for meta-RL with image observations. We showed that MELD outperforms prior
methods on simulated locomotion and manipulation tasks with image observations, and is
efficient enough to perform meta-RL directly from images in the real world. By operating
directly from raw image observations rather than assuming access to the underlying state
information, as well as by explicitly enabling the fast acquisition of new skills, the work in
this chapter takes a critical step toward the deployment of intelligent and adaptive robots
into the real world.
3
https://sites.google.com/view/meld-lsm/home
118
Chapter 8
Conclusion
In this thesis, we presented the idea that model-based deep RL provides an efficient and
effective framework for making sense of the world; and in turn, this ability to make sense
of the world allows for reasoning and adaptation capabilities that give rise to the general-
ization that is necessary for successful operation in the real world. We started in chapter
2 by building up a model-based deep RL framework and demonstrating that it allows for
efficient and effective autonomous skill acquisition for various agents in simulation. We also
demonstrated the ability to repurpose the learned models to execute a variety of paths at
test time. Next, in chapter 3, we extended this framework to enable locomotion with a 6-DoF
legged robot in the real world by learning image-conditioned models that are able to address
various types of terrains. Then, in chapter 4, we scaled up both modeling and control aspects
of our model-based deep RL framework and demonstrated efficient and effective learning
of challenging dexterous manipulation skills with a 24-DoF anthropomorphic hand, both
in simulation and in the real world. We then switched focus to the inevitable mismatch
between an agent’s training conditions and the test conditions in which it may actually be
deployed. In chapter 5, we presented a meta-learning algorithm within our model-based
deep RL framework to enable online adaptation of large, high-capacity models using only
small amounts of data from new tasks. We demonstrated these fast adaptation capabilities
in both simulation and the real-world, with experiments such as a 6-legged robot adapting
online to an unexpected payload or suddenly losing a leg. Next, in chapter 6, we further
improved these adaptation capabilities by formulating an online learning approach as an
expectation maximization problem. By maintaining and building a task distribution with a
mixture of models, this algorithm demonstrated both generalization as well as specialization.
Finally, in chapter 7, we further extended the capabilities of our robotic systems by enabling
the agents to reason directly from raw image observations. By combining efficiency benefits
from representation learning with the adaptation capabilities of meta-RL, we presented a
unified framework for effective meta-RL from images. This approach demonstrated not only
efficient meta-learning due to the learned latent dynamics model, but it also demonstrated
fast acquisition of new skills at test time due to the meta-learned procedure for performing
online inference. We demonstrated these results on robotic arms in the real world performing
CHAPTER 8. CONCLUSION 119
peg insertion and ethernet cable insertion to varying targets, thus showing the fast acquisition
of new skills directly from raw image observations in the real world.
Looking ahead, we see a few promising directions for future work in this area of model-based
deep RL for robotic systems. We discuss some open questions and future directions below.
Incorporating other sensing modalities: Thanks to the large amounts of prior work in
computer vision research, great strides have been made in learning directly from image inputs.
While there is still work to be done in that direction, there has been a relatively much smaller
effort in incorporating other sources of sensing modalities such as touch and sound. These
additional senses can be critical for many tasks in the real world, such as buttoning a shirt or
braiding our own hair, where most of the work happens without vision to guide our actions.
Even tasks that do incorporate vision can benefit from additional sensing modalities, includ-
ing operating in environments with occlusions, or reasoning further about details that are
not obvious from vision alone, such as the weight, composition, or interactions between objects.
What to model and predict: Throughout this thesis, we talked about the importance and
effectiveness of learning predictive models. However, in the real world, it’s often not obvious
what to predict. In most of the chapters in this thesis, we had some “state” representation and
thus formulated the problem as predicting the next state. However, knowing what to include
in this state ahead of time is a challenge in the real world; it is impossible to include the
position of every possible item that may be relevant, and then predict the future value of that
entry. Instead, the last chapter presented an algorithm that operated directly from image
inputs. As is common for representation learning approaches, however, we used pixel-based
reconstruction loss as part of the training objective. These types of pixel-based predictions
are unfortunately challenging as well as severely limiting. Consider, for example, a robot
arm that might push a bowl of cereal off of the edge of the table. In this case, what we
care to predict is that the bowl will break and the milk will splatter. Predicting each pixel
value of the resulting image is not at all appropriate in these types of situations, because
trying to figure out where every drop of milk or every piece of shattered glass might end up
is not only unimportant, but also impossible. Recent work has shown promising alternatives
for pixel-based reconstruction losses, such as contrastive learning (Srinivas, Laskin, and
Abbeel 2020; T. Chen et al. 2020) and other similarity-based (R. Zhang et al. 2018) learning
approaches. Additionally, there may also be effective ways to combine autonomous image
segmentation processes with predictive processes on those segments. The development and
integration of such approaches – to produce predictive models that don’t rely on fixed state
representations or on pixel-based predictions – are a promising direction for future work.
Planning over long horizons: RL can be described as decision making to control a system
in order to accomplish some given goal. This problem statement encompasses many types of
challenges, from flying a quadcopter to unloading a dishwasher. While we can reason over
short time horizons for tasks such as maneuvering a quadcopter to follow a given trajectory
CHAPTER 8. CONCLUSION 120
or figuring out in-hand manipulation of an object, we must reason over a much longer time
horizon to do multi-stage reasoning or tasks with sparse rewards. Consider setting a dinner
table or cleaning an entire room, where “success” cannot be achieved without reasoning over a
long time horizon and doing many intermediate things before actually achieving that success.
Although model-based deep RL algorithms can now solve tasks that are challenging in many
ways, from locomotion (M. P. Deisenroth, Calandra, et al. 2012; Hester, Quinlan, and Stone
2010; Morimoto and Atkeson 2003) to manipulation (M. P. Deisenroth, Englert, et al. 2014;
Depraetere et al. 2014; M. P. Deisenroth, Carl Edward Rasmussen, and Fox 2011; Atkeson
1998), these approaches still struggle with compounding model errors when trying to scale
to multi-step reasoning or long-horizon planning tasks. Work in the areas of gradient-based
optimization techniques (Mordatch, Emanuel Todorov, and Popović 2012; Ratliff et al. 2009),
generating sub-goals (McGovern and Barto 2001), learning options (Stolle and Precup 2002),
generating successor representations (Kulkarni et al. 2016), and hierarchical reinforcement
learning (Peng et al. 2017; Vezhnevets et al. 2017; Stulp and Schaal 2011; Kaelbling, Littman,
and A. W. Moore 1996) have shown promise in this area of addressing long-horizon tasks.
Integrating, developing, and applying these ideas to enable reasoning over varying time
horizons and autonomous behaviors in the dynamic conditions of real-world environments is
a critical direction for future work.
121
Bibliography
[47] Xingye Da, Ross Hartley, and Jessy Grizzle. “Supervised Learning for Stabilizing
Underactuated Bipedal Robot Locomotion, with Outdoor Experiments on the Wave
Field”. In: ICRA. 2017.
[48] M. Deisenroth and C. Rasmussen. “A model-based and data-efficient approach to
policy search”. In: ICML. 2011.
[49] Marc Peter Deisenroth, Roberto Calandra, et al. “Toward fast policy search for learning
legged locomotion”. In: IROS. 2012.
[50] Marc Peter Deisenroth, Peter Englert, et al. “Multi-task policy search for robotics”.
In: 2014 IEEE International Conference on Robotics and Automation (ICRA). IEEE.
2014, pp. 3876–3881.
[51] Marc Peter Deisenroth, Gerhard Neumann, Jan Peters, et al. “A survey on policy
search for robotics”. In: Foundations and Trends in Robotics. 2013.
[52] Marc Peter Deisenroth and Jan Peters. “Solving nonlinear continuous state-action-
observation POMDPs for mechanical systems with Gaussian noise”. In: ().
[53] Marc Peter Deisenroth, Carl Edward Rasmussen, and Dieter Fox. “Learning to control
a low-cost manipulator using data-efficient reinforcement learning”. In: (2011).
[54] Marc Deisenroth and Carl E Rasmussen. “PILCO: A model-based and data-efficient
approach to policy search”. In: International Conference on machine learning (ICML).
2011, pp. 465–472.
[55] Jia Deng et al. “Imagenet: A large-scale hierarchical image database”. In: CVPR. 2009.
[56] B Depraetere et al. “Comparison of model-free and model-based methods for time
optimal hit control of a badminton robot”. In: Mechatronics 24.8 (2014), pp. 1021–1030.
[57] Andreas Doerr et al. “Model-Based Policy Search for Automatic Tuning of Multivariate
PID Controllers”. In: CoRR abs/1703.02899 (2017). arXiv: 1703.02899. url: http:
//arxiv.org/abs/1703.02899.
[58] Mehmet R Dogar and Siddhartha S Srinivasa. “Push-grasping with dexterous hands:
Mechanics and a method”. In: Intelligent Robots and Systems (IROS), 2010 IEEE/RSJ
International Conference on. IEEE. 2010, pp. 2123–2130.
[59] Julien Doyon and Habib Benali. “Reorganization and plasticity in the adult brain
during learning of motor skills”. In: Current opinion in neurobiology 15.2 (2005),
pp. 161–167.
[60] Yan Duan, Xi Chen, et al. “Benchmarking deep reinforcement learning for continuous
control”. In: ICML. 2016.
[61] Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, et al. “RLΘ2: Fast Reinforcement
Learning via Slow Reinforcement Learning”. In: CoRR abs/1611.02779 (2016). arXiv:
1611.02779.
BIBLIOGRAPHY 125
[62] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, et al. “RLΘ2: Fast Reinforcement
Learning via Slow Reinforcement Learning”. In: arXiv preprint arXiv:1611.02779
(2016).
[63] Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, et al. “RL2: Fast Reinforcement
Learning via Slow Reinforcement Learning”. In: arXiv:1611.02779 (2016).
[64] John Duchi, Elad Hazan, and Yoram Singer. “Adaptive subgradient methods for online
learning and stochastic optimization”. In: Journal of Machine Learning Research
(2011).
[65] David A Ferrucci. “Introduction to “this is watson””. In: IBM Journal of Research and
Development 56.3.4 (2012), pp. 1–1.
[66] Chelsea Finn, Pieter Abbeel, and Sergey Levine. “Model-Agnostic Meta-Learning for
Fast Adaptation of Deep Networks”. In: International Conference on Machine Learning
(ICML) (2017).
[67] Chelsea Finn, Pieter Abbeel, and Sergey Levine. “Model-Agnostic Meta-Learning for
Fast Adaptation of Deep Networks”. In: arXiv preprint arXiv:1703.03400 (2017).
[68] Chelsea Finn, Pieter Abbeel, and Sergey Levine. “Model-Agnostic Meta-Learning
for Fast Adaptation of Deep Networks”. In: CoRR abs/1703.03400 (2017). arXiv:
1703.03400.
[69] Chelsea Finn and Sergey Levine. “Deep visual foresight for planning robot motion”.
In: ICRA. 2017.
[70] Chelsea Finn and Sergey Levine. “Meta-Learning and Universality: Deep Represen-
tations and Gradient Descent can Approximate any Learning Algorithm”. In: CoRR
abs/1710.11622 (2017). arXiv: 1710.11622.
[71] Chelsea Finn, Xin Yu Tan, et al. “Deep spatial autoencoders for visuomotor learning”.
In: ICRA. IEEE. 2016, pp. 512–519.
[72] Chelsea Finn, Tianhe Yu, et al. “One-Shot Visual Imitation Learning via Meta-
Learning”. In: CoRL. 2017, pp. 357–368.
[73] J Randall Flanagan and Alan M Wing. “Modulation of grip force with load force
during point-to-point arm movements”. In: Experimental Brain Research 95.1 (1993),
pp. 131–143.
[74] Peter Florence, Lucas Manuelli, and Russ Tedrake. “Self-supervised correspondence in
visuomotor policy learning”. In: IEEE Robotics and Automation Letters 5.2 (2019),
pp. 492–499.
[75] Meire Fortunato, Charles Blundell, and Oriol Vinyals. “Bayesian recurrent neural
networks”. In: arXiv preprint arXiv:1704.02798 (2017).
[76] Justin Fu, Sergey Levine, and Pieter Abbeel. “One-Shot Learning of Manipulation
Skills with Online Dynamics Adaptation and Neural Network Priors”. In: CoRR
abs/1509.06841 (2015). arXiv: 1509.06841.
BIBLIOGRAPHY 126
[77] Yarin Gal, Rowan Thomas McAllister, and Carl Edward Rasmussen. “Improving
PILCO with bayesian neural network dynamics models”. In: Data-Efficient Machine
Learning workshop. 2016.
[78] Sebastien Gay, Jose Santos-Victor, and Auke Ijspeert. “Learning Robot Gait Stability
using Neural Networks as Sensory Feedback Function for Central Pattern Generators”.
In: 2013.
[79] Carles Gelada et al. “DeepMDP: Learning Continuous Latent Space Models for
Representation Learning”. In: ICML. 2019, pp. 2170–2179.
[80] Ali Ghadirzadeh et al. “Deep predictive policy training using reinforcement learning”.
In: IROS. IEEE. 2017, pp. 2351–2358.
[81] Mark A Gluck, Eduardo Mercado, and Catherine E Myers. Learning and memory:
From brain to behavior. Macmillan Higher Education, 2007.
[82] Ruben Grandia, Diego Pardo, and Jonas Buchli. “Contact Invariant Model Learning
for Legged Robot Locomotion”. In: RAL. 2018.
[83] Karol Gregor and Frederic Besse. “Temporal difference variational auto-encoder”. In:
arXiv preprint arXiv:1806.03107 (2018).
[84] Ivo Grondman et al. “A survey of actor-critic reinforcement learning: standard and
natural policy gradients”. In: IEEE Transactions on Systems, Man, and Cybernetics.
2012.
[85] S. Gu et al. “Continuous deep Q-learning with model-based acceleration”. In: ICML.
2016.
[86] Shixiang Gu, Ethan Holly, et al. “Deep reinforcement learning for robotic manipulation
with asynchronous off-policy updates”. In: ICRA. IEEE. 2017, pp. 3389–3396.
[87] Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, et al. “Q-Prop: sample-efficient
policy gradient with an off-policy critic”. In: ICLR. 2017.
[88] Shixiang Gu, Timothy Lillicrap, Ilya Sutskever, et al. “Continuous Deep Q-Learning
with Model-based Acceleration”. In: International Conference on Machine Learning.
2016, pp. 2829–2838.
[89] Vijaykumar Gullapalli, Judy A Franklin, and Hamid Benbrahim. “Acquiring robot
skills via reinforcement learning”. In: IEEE CSM 14.1 (1994), pp. 13–24.
[90] Abhishek Gupta et al. “Meta-reinforcement learning of structured exploration strate-
gies”. In: NeurIPS. 2018, pp. 5302–5311.
[91] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, et al. “Soft actor-critic: Off-policy
maximum entropy deep reinforcement learning with a stochastic actor”. In: arXiv
preprint arXiv:1801.01290 (2018).
[92] Tuomas Haarnoja, Aurick Zhou, Kristian Hartikainen, et al. “Soft Actor-Critic Algo-
rithms and Applications”. In: arXiv preprint arXiv:1812.05905 (2018).
BIBLIOGRAPHY 127
[93] Danijar Hafner et al. “Learning latent dynamics for planning from pixels”. In: arXiv
preprint arXiv:1811.04551 (2018).
[94] Danijar Hafner et al. “Learning latent dynamics for planning from pixels”. In: ICML.
2019, pp. 2555–2565.
[95] Duncan W Haldane and Ronald S Fearing. “Roll oscillation modulated turning in
dynamic millirobots”. In: IEEE Int. Conf. on Robotics and Automation. 2014, pp. 4569–
4575.
[96] Duncan W Haldane and Ronald S Fearing. “Running beyond the bio-inspired regime”.
In: IEEE Int. Conf. on Robotics and Automation. 2015, pp. 4539–4546.
[97] Duncan W Haldane, Kevin C Peterson, et al. “Animal-inspired design and aerody-
namic stabilization of a hexapedal millirobot”. In: IEEE Int. Conf. on Robotics and
Automation. 2013, pp. 3279–3286.
[98] Matthew Hausknecht and Peter Stone. “Deep Recurrent Q-Learning for Partially
Observable MDPs”. In: 2015 AAAI Fall Symposium Series. 2015.
[99] Nicolas Heess, Jonathan J Hunt, et al. “Memory-based control with recurrent neural
networks”. In: arXiv preprint arXiv:1512.04455 (2015).
[100] Nicolas Heess, Gregory Wayne, et al. “Learning continuous control policies by stochastic
value gradients”. In: NIPS. 2015.
[101] Dominik Henrich and Heinz Wörn. Robot manipulation of deformable objects. Springer
Science & Business Media, 2012.
[102] Robert Herman. Neural control of locomotion. Vol. 18. Springer, 2017.
[103] Susan J Hespos and Kristy VanMarle. “Physics for infants: Characterizing the origins
of knowledge about objects, substances, and number”. In: Wiley Interdisciplinary
Reviews: Cognitive Science 3.1 (2012), pp. 19–27.
[104] Todd Hester, Michael Quinlan, and Peter Stone. “Generalized model learning for
reinforcement learning on a humanoid robot”. In: 2010 IEEE International Conference
on Robotics and Automation. IEEE. 2010, pp. 2369–2374.
[105] Mark A Hoepflinger et al. “Haptic terrain classification for legged robots”. In: ICRA.
2010.
[106] Matthew Hoffman, Francis R Bach, and David M Blei. “Online learning for latent
dirichlet allocation”. In: advances in neural information processing systems. 2010,
pp. 856–864.
[107] Aaron M Hoover, Samuel Burden, et al. “Bio-inspired design and dynamic maneuver-
ability of a minimally actuated six-legged robot”. In: IEEE RAS and EMBS Int. Conf.
on Biomedical Robotics and Biomechatronics. 2010, pp. 869–876.
[108] Aaron M Hoover and Ronald S Fearing. “Fast scale prototyping for folded millirobots”.
In: IEEE Int. Conf. on Robotics and Automation. 2008, pp. 886–892.
BIBLIOGRAPHY 128
[109] Rein Houthooft et al. “Evolved policy gradients”. In: NeurIPS. 2018.
[110] Jan Humplik et al. “Meta reinforcement learning as task inference”. In: arXiv preprint
arXiv:1905.06424 (2019).
[111] K Jetal Hunt et al. “Neural networks for control systems—a survey”. In: Automatica.
1992.
[112] Marco Hutter et al. “Anymal-a highly mobile and dynamic quadrupedal robot”. In:
IEEE/RSJ Int. Conf. on Intell. Robots and Systems. 2016, pp. 38–44.
[113] Maximilian Igl et al. “Deep Variational Reinforcement Learning for POMDPs”. In:
ICML. 2018, pp. 2122–2131.
[114] Amir Jafari et al. “On no-regret learning, fictitious play, and nash equilibrium”. In:
ICML. Vol. 1. 2001, pp. 226–233.
[115] Stephen James, Michael Bloesch, and Andrew J Davison. “Task-Embedded Control
Networks for Few-Shot Imitation Learning”. In: CoRL. 2018, pp. 783–795.
[116] Michael Janner et al. “When to Trust Your Model: Model-Based Policy Optimization”.
In: arXiv preprint arXiv:1906.08253 (2019).
[117] Ghassen Jerfel et al. “Online gradient-based mixtures for transfer modulation in
meta-learning”. In: arXiv preprint arXiv:1812.06080 (2018).
[118] Adam Johnson and A David Redish. “Neural ensembles in CA3 transiently encode
paths forward of the animal at a decision point”. In: Journal of Neuroscience 27.45
(2007), pp. 12176–12189.
[119] Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. “Planning and
acting in partially observable stochastic domains”. In: AI 101.1-2 (1998), pp. 99–134.
[120] Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. “Reinforcement
learning: A survey”. In: Journal of artificial intelligence research 4 (1996), pp. 237–285.
[121] Sham M Kakade. “A natural policy gradient”. In: Advances in neural information
processing systems. 2002, pp. 1531–1538.
[122] Mrinal Kalakrishnan et al. “Fast, robust quadruped locomotion over challenging
terrain”. In: IEEE Int. Conf. on Robotics and Automation. 2010, pp. 2665–2670.
[123] Dmitry Kalashnikov et al. “Scalable Deep Reinforcement Learning for Vision-Based
Robotic Manipulation”. In: CoRL. 2018, pp. 651–673.
[124] Peter Karkus, David Hsu, and Wee Sun Lee. “Qmdp-net: Deep learning for planning
under partial observability”. In: NeurIPS. 2017, pp. 4694–4704.
[125] Maximilian Karl et al. “Deep Variational Bayes Filters: Unsupervised Learning of
State Space Models from Raw Data”. In: ICLR. 2016.
[126] Sousso Kelouwani et al. “Online system identification and adaptive control for PEM
fuel cell maximum efficiency tracking”. In: IEEE Transactions on Energy Conversion
27.3 (2012), pp. 580–592.
BIBLIOGRAPHY 129
[127] S Mohammad Khansari-Zadeh and Aude Billard. “Learning stable nonlinear dynamical
systems with gaussian mixture models”. In: IEEE Transactions on Robotics. 2011.
[128] Sangbae Kim, Jonathan E Clark, and Mark R Cutkosky. “iSprawl: Design and tuning
for high-speed autonomous open-loop running”. In: The International Journal of
Robotics Research 25.9 (2006), pp. 903–912.
[129] D. Kingma and J. Ba. “Adam: A method for stochastic optimization”. In: ICLR. 2014.
[130] James Kirkpatrick et al. “Overcoming catastrophic forgetting in neural networks”. In:
Proceedings of the National Academy of Sciences (2017).
[131] Hiroaki Kitano et al. “RoboCup: A challenge problem for AI”. In: AI magazine 18.1
(1997), pp. 73–73.
[132] Jonathan Ko and Dieter Fox. “GP-BayesFilters: Bayesian filtering using Gaussian
process prediction and observation models”. In: Autonomous Robots 27.1 (2009),
pp. 75–90.
[133] Jonathan Ko and Dieter Fox. “GP-BayesFilters: Bayesian filtering using gaussian
process prediction and observation models”. In: IROS. 2008.
[134] Jens Kober, J Andrew Bagnell, and Jan Peters. “Reinforcement learning in robotics:
A survey”. In: IJRR (2013).
[135] Roman Kolbert, Nikhil Chavan-Dafle, and Alberto Rodriguez. “Experimental validation
of contact dynamics for in-hand manipulation”. In: International Symposium on
Experimental Robotics. Springer. 2016, pp. 633–645.
[136] J Zico Kolter, Pieter Abbeel, and Andrew Y Ng. “Hierarchical apprenticeship learning
with application to quadruped locomotion”. In: Advances in Neural Information
Processing Systems. 2008, pp. 769–776.
[137] Haldun Komsuoglu, Anirudha Majumdar, et al. “Characterization of dynamic behaviors
in a hexapod robot”. In: Experimental Robotics. Springer. 2014, pp. 667–684.
[138] Haldun Komsuoglu, Kiwon Sohn, et al. “A physical model for dynamical arthropod
running on level ground”. In: Departmental Papers (ESE) (2008), p. 466.
[139] Ben Krause, Emmanuel Kahembwe, et al. “Dynamic Evaluation of Neural Sequence
Models”. In: CoRR abs/1709.07432 (2017). arXiv: 1709.07432.
[140] Ben Krause, Liang Lu, et al. “Multiplicative LSTM for sequence modelling”. In: arXiv
preprint arXiv:1609.07959 (2016).
[141] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with
deep convolutional neural networks”. In: NIPS. 2012.
[142] Klas Kronander, Etienne Burdet, and Aude Billard. “Task transfer via collaborative
manipulation for insertion assembly”. In: WHRI. Citeseer. 2014.
[143] Eric Krotkov et al. “The darpa robotics challenge finals: Results and perspectives”. In:
Journal of Field Robotics 34.2 (2017), pp. 229–240.
BIBLIOGRAPHY 130
[161] T. Lillicrap et al. “Continuous control with deep reinforcement learning”. In: ICRL.
2016.
[162] Timothy Lillicrap et al. “Continuous control with deep reinforcement learning”. In:
CoRR abs/1509.02971 (2015). arXiv: 1509.02971.
[163] Rudolf Lioutikov et al. “Sample-based information-theoretic stochastic optimal control”.
In: ICRA. 2014.
[164] Daniel J Lizotte et al. “Automatic Gait Optimization with Gaussian Process Regres-
sion.” In: IJCAI. Vol. 7. 2007, pp. 944–949.
[165] David Lopez-Paz et al. “Gradient Episodic Memory for Continual Learning”. In:
Advances in Neural Information Processing Systems. 2017.
[166] Kendall Lowrey et al. “Plan Online, Learn Offline: Efficient Learning and Exploration
via Model-Based Control”. In: arXiv preprint arXiv:1811.01848 (2018).
[167] Sridhar Mahadevan and Jonathan Connell. “Automatic programming of behavior-based
robots using reinforcement learning”. In: AI 55.2-3 (1992), pp. 311–365.
[168] Ali Malik et al. “Calibrated Model-Based Deep Reinforcement Learning”. In: arXiv
preprint arXiv:1906.08312 (2019).
[169] Patrizio Manganiello et al. “Optimization of Perturbative PV MPPT Methods Through
Online System Identification.” In: IEEE Trans. Industrial Electronics 61.12 (2014),
pp. 6812–6821.
[170] Arthur Joseph McClung III. Techniques for dynamic maneuvering of hexapedal legged
robots. Vol. 67. 11. 2006.
[171] Amy McGovern and Andrew G Barto. “Automatic discovery of subgoals in reinforce-
ment learning using diverse density”. In: (2001).
[172] David Meger et al. “Learning legged swimming gaits from experience”. In: ICRA. 2015.
[173] Franziska Meier, Daniel Kappler, et al. “Towards Robust Online Inverse Dynamics
Learning”. In: Proceedings of the IEEE/RSJ Conference on Intelligent Robots and
Systems. IEEE, 2016.
[174] Franziska Meier and Stefan Schaal. “Drifting Gaussian Processes with Varying Neigh-
borhood Sizes for Online Model Learning”. In: Proceedings of the IEEE International
Conference on Robotics and Automation (ICRA) 2016. IEEE, May 2016.
[175] Russell Mendonca et al. “Guided Meta-Policy Search”. In: arXiv preprint arXiv:1904.00956
(2019).
[176] Stephen Miller et al. “A geometric approach to robotic laundry folding”. In: The
International Journal of Robotics Research 31.2 (2012), pp. 249–267.
[177] Nikhil Mishra, Pieter Abbeel, and Igor Mordatch. “Prediction and control with temporal
segment models”. In: ICML. 2017.
BIBLIOGRAPHY 132
[178] Nikhil Mishra, Mostafa Rohaninejad, et al. “A simple neural attentive meta-learner”.
In: NIPS 2017 Workshop on Meta-Learning. 2017.
[179] Nikhil Mishra, Mostafa Rohaninejad, et al. “Meta-learning with temporal convolutions”.
In: arXiv preprint arXiv:1707.03141 (2017).
[180] V. Mnih, A. P. Badia, et al. “Asynchronous methods for deep reinforcement learning”.
In: ICML. 2016.
[181] V. Mnih, K. Kavukcuoglu, et al. “Playing Atari with deep reinforcement learning”. In:
Workshop on Deep Learning, NIPS. 2013.
[182] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, et al. “Playing atari
with deep reinforcement learning”. In: arXiv preprint arXiv:1312.5602 (2013).
[183] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, et al. “Human-
level control through deep reinforcement learning”. In: Nature. 2015.
[184] Igor Mordatch, Zoran Popović, and Emanuel Todorov. “Contact-invariant optimiza-
tion for hand manipulation”. In: Proceedings of the ACM SIGGRAPH/Eurographics
symposium on computer animation. Eurographics Association. 2012, pp. 137–144.
[185] Igor Mordatch, Emanuel Todorov, and Zoran Popović. “Discovery of complex behaviors
through contact-invariant optimization”. In: ACM Transactions on Graphics (TOG)
31.4 (2012), pp. 1–8.
[186] Jun Morimoto and Christopher G Atkeson. “Minimax differential dynamic program-
ming: An application to robust biped walking”. In: NIPS. 2003.
[187] Tsendsuren Munkhdalai and Hong Yu. “Meta Networks”. In: International Conference
on Machine Learning (ICML) (2017).
[188] Tsendsuren Munkhdalai and Hong Yu. “Meta networks”. In: arXiv preprint arXiv:1703.00837
(2017).
[189] Tsendsuren Munkhdalai, Xingdi Yuan, et al. “Learning Rapid-Temporal Adaptations”.
In: arXiv preprint arXiv:1712.09926 (2017).
[190] K. Murphy. “Dynamic Bayesian Networks: Representation, Inference and Learning”.
PhD thesis. Dept. Computer Science, UC Berkeley, 2002. url: https://www.cs.ubc.
ca/~murphyk/Thesis/thesis.html.
[191] Anusha Nagabandi, Ignasi Clavera, et al. “Learning to Adapt: Meta-Learning for
Model-Based Control”. In: arXiv preprint arXiv:1803.11347 (2018).
[192] Anusha Nagabandi, Gregory Kahn, Ronald S. Fearing, et al. “Neural Network Dynamics
for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning”. In: 2018
IEEE International Conference on Robotics and Automation, ICRA 2018, Brisbane,
Australia, May 21-25, 2018. 2018, pp. 7559–7566.
[193] Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, et al. “Neural Network Dynamics
for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning”. In:
arXiv preprint arXiv:1708.02596 (2017).
BIBLIOGRAPHY 133
[209] Steven T Piantadosi and Celeste Kidd. “Extraordinary intelligence and the care of
infants”. In: Proceedings of the National Academy of Sciences 113.25 (2016), pp. 6874–
6879.
[210] Joelle Pineau, Geoff Gordon, Sebastian Thrun, et al. “Point-based value iteration: An
anytime algorithm for POMDPs”. In: IJCAI. Vol. 3. 2003, pp. 1025–1032.
[211] Lerrel Pinto and Abhinav Gupta. “Supersizing self-supervision: Learning to grasp
from 50k tries and 700 robot hours”. In: International Conference on Robotics and
Automation (ICRA). IEEE. 2016.
[212] Akshara Rai et al. “Learning Feedback Terms for Reactive Planning and Control”. In:
Proceedings 2017 IEEE International Conference on Robotics and Automation (ICRA).
Piscataway, NJ, USA: IEEE, May 2017.
[213] Marc Raibert et al. “Bigdog, the rough-terrain quadruped robot”. In: IFAC Proceedings
Volumes 41.2 (2008), pp. 10822–10825.
[214] Aravind Rajeswaran et al. “Learning complex dexterous manipulation with deep
reinforcement learning and demonstrations”. In: arXiv preprint arXiv:1709.10087
(2017).
[215] Kate Rakelly et al. “Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic
Context Variables”. In: ICML. 2019.
[216] A. Rao. “A survey of numerical methods for optimal control”. In: Advances in the
Astronautical Sciences. 2009.
[217] Nathan Ratliff et al. “CHOMP: Gradient optimization techniques for efficient motion
planning”. In: 2009 IEEE International Conference on Robotics and Automation. IEEE.
2009, pp. 489–494.
[218] Sachin Ravi and Hugo Larochelle. “Optimization as a Model for Few-Shot Learning”.
In: International Conference on Learning Representations (ICLR). 2017.
[219] Sachin Ravi and Hugo Larochelle. “Optimization as a model for few-shot learning”. In:
International Conference on Learning Representations (ICLR) (2018).
[220] Ali Sharif Razavian et al. “CNN features off-the-shelf: an astounding baseline for
recognition”. In: CVPR Workshops (CVPRW). 2014.
[221] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, and Christoph H Lampert. “iCaRL:
Incremental classifier and representation learning”. In: Proc. CVPR. 2017.
[222] Marek Rei. “Online Representation Learning in Recurrent Neural Language Models”.
In: CoRR abs/1508.03854 (2015). arXiv: 1508.03854.
[223] A. Richards. “Robust constrained model predictive control”. PhD thesis. MIT, 2004.
[224] Samuel Ritter et al. “Been There, Done That: Meta-Learning with Episodic Recall”.
In: arXiv preprint arXiv:1805.09692 (2018).
BIBLIOGRAPHY 135
[225] Stéphane Ross, Geoffrey J Gordon, and Drew Bagnell. “A reduction of imitation
learning and structured prediction to no-regret online learning”. In: AISTATS. 2011.
[226] Stéphane Ross, Joelle Pineau, et al. “A Bayesian approach for learning and planning in
partially observable Markov decision processes”. In: JMLR 12.May (2011), pp. 1729–
1770.
[227] Jonas Rothfuss et al. “ProMP: Proximal Meta-Policy Search”. In: arXiv preprint
arXiv:1810.06784 (2018).
[228] Andrei A Rusu et al. “Progressive neural networks”. In: arXiv:1606.04671 (2016).
[229] Steindór Sæmundsson, Katja Hofmann, and Marc Peter Deisenroth. “Meta Rein-
forcement Learning with Latent Variable Gaussian Processes”. In: arXiv preprint
arXiv:1803.07551 (2018).
[230] Doyen Sahoo et al. “Online deep learning: Learning deep neural networks on the fly”.
In: arXiv preprint arXiv:1711.03705 (2017).
[231] Yoshiaki Sakagami et al. “The intelligent ASIMO: System overview and integration”.
In: Intelligent Robots and Systems, 2002. IEEE/RSJ International Conference on.
Vol. 3. IEEE. 2002, pp. 2478–2483.
[232] Adam Santoro et al. “Meta-learning with memory-augmented neural networks”. In:
International Conference on Machine Learning (ICML). 2016.
[233] Adam Santoro et al. “One-shot learning with memory-augmented neural networks”. In:
arXiv preprint arXiv:1605.06065 (2016).
[234] Sosale Shankara Sastry and Alberto Isidori. “Adaptive control of linearizable systems”.
In: IEEE Transactions on Automatic Control (1989).
[235] Alexander Sax et al. “Mid-Level Visual Representations Improve Generalization and
Sample Efficiency for Learning Visuomotor Policies”. In: CoRL. 2019.
[236] Juergen Schmidhuber and Rudolf Huber. “Learning to generate artificial fovea trajec-
tories for target detection”. In: International Journal of Neural Systems (1991).
[237] Jurgen Schmidhuber. “Evolutionary principles in self-referential learning”. In: Diploma
thesis, Institut f. Informatik, Tech. Univ. Munich (1987).
[238] Jürgen Schmidhuber. “Learning to control fast-weight memories: An alternative to
dynamic recurrent networks”. In: Neural Computation (1992).
[239] Tanner Schmidt, Richard Newcombe, and Dieter Fox. “Self-supervised visual descriptor
learning for dense correspondence”. In: IEEE Robotics and Automation Letters 2.2
(2016), pp. 420–427.
[240] Gerrit Schoettler et al. “Deep Reinforcement Learning for Industrial Insertion Tasks
with Visual Inputs and Natural Reward Signals”. In: ICLR. 2019.
[241] J. Schulman et al. “Trust region policy optimization”. In: ICML. 2015.
BIBLIOGRAPHY 136
[242] John Schulman, Sergey Levine, et al. “Trust region policy optimization”. In: CoRR,
abs/1502.05477 (2015).
[243] John Schulman, Philipp Moritz, et al. “High-dimensional continuous control using
generalized advantage estimation”. In: ICLR. 2016.
[244] Stuart C Shapiro. Encyclopedia of artificial intelligence second edition. John, 1992.
[245] Maruan Al-Shedivat et al. “Continuous adaptation via meta-learning in nonstationary
and competitive environments”. In: arXiv preprint arXiv:1710.03641 (2017).
[246] David Silver, Julian Schrittwieser, et al. “Mastering the game of Go without human
knowledge”. In: Nature (2017).
[247] David Silver, Richard S Sutton, and Martin Müller. “Sample-based learning and search
with permanent and transient memories”. In: ICML. 2008.
[248] Avi Singh et al. “End-to-End Robotic Reinforcement Learning without Reward Engi-
neering”. In: environment (eg, by placing additional sensors) 34 (), p. 44.
[249] Xingyou Song et al. “Rapidly Adaptable Legged Robots via Evolutionary Meta-
Learning”. In: arXiv preprint arXiv:2003.01239 (2020).
[250] Aravind Srinivas, Michael Laskin, and Pieter Abbeel. “Curl: Contrastive unsupervised
representations for reinforcement learning”. In: arXiv preprint arXiv:2004.04136 (2020).
[251] Florian Stimberg, Andreas Ruttor, and Manfred Opper. “Bayesian inference for change
points in dynamical systems with reusable states-a chinese restaurant process approach”.
In: Artificial Intelligence and Statistics. 2012, pp. 1117–1124.
[252] Martin Stolle and Doina Precup. “Learning options in reinforcement learning”. In:
International Symposium on abstraction, reformulation, and approximation. Springer.
2002, pp. 212–223.
[253] Freek Stulp and Stefan Schaal. “Hierarchical reinforcement learning with movement
primitives”. In: 2011 11th IEEE-RAS International Conference on Humanoid Robots.
IEEE. 2011, pp. 231–238.
[254] Balakumar Sundaralingam and Tucker Hermans. “Geometric in-hand regrasp planning:
Alternating optimization of finger gaits and in-grasp manipulation”. In: 2018 IEEE
International Conference on Robotics and Automation (ICRA). IEEE. 2018, pp. 231–
238.
[255] Flood Sung et al. “Learning to learn: Meta-critic networks for sample efficient learning”.
In: arXiv preprint arXiv:1706.09529 (2017).
[256] R. Sutton. “Dyna, an integrated architecture for learning, planning, and reacting”. In:
AAAI. 1991.
[257] Supasorn Suwajanakorn et al. “Discovery of Latent 3D Keypoints via End-to-end
Geometric Reasoning”. In: Neural Information Processing Systems (NIPS ’18. 2018.
BIBLIOGRAPHY 137
[258] Ryosuke Tajima, Daisaku Honda, and Keisuke Suga. “Fast running experiments
involving a humanoid robot”. In: 2009 IEEE International Conference on Robotics
and Automation. IEEE. 2009, pp. 1571–1576.
[259] Marko Tanaskovic et al. “Adaptive model predictive control for constrained linear
systems”. In: Control Conference (ECC), 2013 European. IEEE. 2013.
[260] Russ Tedrake, Teresa Weirui Zhang, and H Sebastian Seung. “Learning to walk in 20
minutes”. In: Proceedings of the Fourteenth Yale Workshop on Adaptive and Learning
Systems. Vol. 95585. Yale University New Haven (CT). 2005, pp. 1939–1412.
[261] Matthew Tesch, Jeff Schneider, and Howie Choset. “Using response surfaces and
expected improvement to optimize snake robot gait parameters”. In: IEEE/RSJ Int.
Conf. on Intell. Robots and Systems. 2011, pp. 1069–1074.
[262] Sebastian Thrun. “Lifelong learning algorithms”. In: Learning to learn. Springer, 1998.
[263] Sebastian Thrun, Wolfram Burgard, Dieter Fox, et al. Probabilistic robotics, vol. 1.
2005.
[264] Sebastian Thrun, Mike Montemerlo, et al. “Stanley: The robot that won the DARPA
Grand Challenge”. In: Journal of field Robotics (2006).
[265] Sebastian Thrun and Lorien Pratt. Learning to learn. Springer Science & Business
Media, 1998.
[266] Sebastian Thrun and Lorien Pratt. “Learning to learn: Introduction and overview”. In:
Learning to learn. Springer, 1998.
[267] Josh Tobin et al. “Domain randomization for transferring deep neural networks from
simulation to the real world”. In: 2017 IEEE/RSJ International Conference on Intelli-
gent Robots and Systems (IROS). IEEE. 2017, pp. 23–30.
[268] Emanuel Todorov, Tom Erez, and Yuval Tassa. “Mujoco: A physics engine for model-
based control”. In: IROS. 2012.
[269] Emanuel Todorov, Tom Erez, and Yuval Tassa. “Mujoco: A physics engine for model-
based control”. In: International Conference on Intelligent Robots and Systems (IROS).
2012.
[270] Jonathan Tremblay et al. “Deep Object Pose Estimation for Semantic Robotic Grasping
of Household Objects”. In: CoRL. 2018, pp. 306–316.
[271] Samuel J Underwood and Iqbal Husain. “Online parameter estimation and adap-
tive control of permanent-magnet synchronous machines”. In: IEEE Transactions on
Industrial Electronics 57.7 (2010), pp. 2435–2443.
[272] Herke Van Hoof et al. “Learning robot in-hand manipulation with tactile features”. In:
Humanoid Robots (Humanoids), 2015 IEEE-RAS 15th International Conference on.
IEEE. 2015, pp. 121–127.
BIBLIOGRAPHY 138
[273] Alexander Sasha Vezhnevets et al. “Feudal networks for hierarchical reinforcement
learning”. In: arXiv preprint arXiv:1703.01161 (2017).
[274] Niklas Wahlström, Thomas B Schön, and Marc Peter Deisenroth. “From pixels to
torques: policy learning with deep dynamical models”. In: Deep learning workshop at
ICML (2015).
[275] Martin J Wainwright and Michael Irwin Jordan. Graphical models, exponential families,
and variational inference. Now Publishers Inc, 2008.
[276] Jane X Wang et al. “Learning to reinforcement learn”. In: arXiv preprint arXiv:1611.05763
(2016).
[277] Jane X Wang et al. “Learning to reinforcement learn”. In: arXiv:1611.05763 (2016).
[278] Tingwu Wang and Jimmy Ba. “Exploring Model-based Planning with Policy Networks”.
In: arXiv preprint arXiv:1906.08649 (2019).
[279] Yu-Xiong Wang, Deva Ramanan, and Martial Hebert. “Growing a brain: Fine-tuning
by increasing model capacity”. In: IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR). 2017.
[280] Manuel Watter et al. “Embed to control: a locally linear latent dynamics model for
control from raw images”. In: NIPS. 2015.
[281] Ari Weinstein and Matthew Botvinick. “Structure Learning in Motor Control: A Deep
Reinforcement Learning Model”. In: CoRR abs/1706.06827 (2017). arXiv: 1706.06827.
[282] Grady Williams, Andrew Aldrich, and Evangelos Theodorou. “Model Predictive
Path Integral Control using Covariance Variable Importance Sampling”. In: CoRR
abs/1509.01149 (2015). arXiv: 1509.01149.
[283] Grady Williams, Nolan Wagener, et al. “Information theoretic mpc for model-based
reinforcement learning”. In: International Conference on Robotics and Automation
(ICRA). 2017.
[284] X Alice Wu et al. “Integrated ground reaction force sensing and terrain classification
for small legged robots”. In: RAL (2016).
[285] Annie Xie et al. “Few-Shot Goal Inference for Visuomotor Learning and Planning”. In:
CoRL. 2018, pp. 40–52.
[286] Xiaofeng Xiong, Florentin Worgotter, and Poramate Manoonpong. “Neuromechanical
Control for Hexapedal Robot Walking on Challenging Surfaces and Surface Classifica-
tion”. In: RAS. 2014.
[287] Zhe Xu and Emanuel Todorov. “Design of a highly biomimetic anthropomorphic robotic
hand towards artificial limb regeneration”. In: 2016 IEEE International Conference on
Robotics and Automation (ICRA). IEEE. 2016, pp. 3485–3492.
[288] Narri Yadaiah and G Sowmya. “Neural network based state estimation of dynamical
systems”. In: IJCNN. IEEE. 2006, pp. 1042–1049.
BIBLIOGRAPHY 139
[289] Denis Yarats et al. “Improving Sample Efficiency in Model-Free Reinforcement Learning
from Images”. In: arXiv preprint arXiv:1910.01741 (2019).
[290] Michael C Yip and David B Camarillo. “Model-less feedback control of continuum
manipulators in constrained environments”. In: IEEE Transactions on Robotics. 2014.
[291] A Steven Younger, Sepp Hochreiter, and Peter R Conwell. “Meta-learning with back-
propagation”. In: International Joint Conference on Neural Networks. IEEE. 2001.
[292] Tianhe Yu et al. “One-Shot Imitation from Observing Humans via Domain-Adaptive
Meta-Learning”. In: ICLR. 2018.
[293] David Zarrouk, Duncan W Haldane, and Ronald S Fearing. “Dynamic legged locomotion
for palm-size robots”. In: SPIE Defense+ Security. International Society for Optics
and Photonics. 2015, 94671S–94671S.
[294] Andy Zeng, Shuran Song, Johnny Lee, et al. “Tossingbot: Learning to throw arbitrary
objects with residual physics”. In: arXiv preprint arXiv:1903.11239 (2019).
[295] Andy Zeng, Shuran Song, Stefan Welker, et al. “Learning Synergies between Pushing
and Grasping with Self-supervised Deep RL”. In: arXiv preprint arXiv:1803.09956
(2018).
[296] Friedemann Zenke, Ben Poole, and Surya Ganguli. “Continual learning through synaptic
intelligence”. In: International Conference on Machine Learning. 2017.
[297] Marvin Zhang et al. “SOLAR: Deep Structured Representations for Model-Based
Reinforcement Learning”. In: ICML. 2019, pp. 7444–7453.
[298] Richard Zhang et al. “The unreasonable effectiveness of deep features as a perceptual
metric”. In: Proceedings of the IEEE conference on computer vision and pattern
recognition. 2018, pp. 586–595.
[299] Henry Zhu et al. “Dexterous manipulation with deep reinforcement learning: Efficient,
general, and low-cost”. In: 2019 International Conference on Robotics and Automation
(ICRA). IEEE. 2019, pp. 3651–3657.
[300] Luisa Zintgraf et al. “VariBAD: A Very Good Method for Bayes-Adaptive Deep RL
via Meta-Learning”. In: arXiv preprint arXiv:1910.08348 (2019).
[301] Matt Zucker et al. “Optimization and learning for rough terrain legged locomotion”.
In: The International Journal of Robotics Research 30.2 (2011), pp. 175–191.