Learning For Control From Multiple Demonstrations
Learning For Control From Multiple Demonstrations
t
u
t
_
, for t = 0..T 1.
We use the following notation: y = {y
k
j
| j = 0..N
k
1, k = 0..M1}, z = {z
t
| t = 0..T 1}, and similarly
for other indexed variables.
The generative model for the ideal trajectory is given
by an initial state distribution z
0
N(
0
,
0
) and an
approximate model of the dynamics
z
t+1
= f(z
t
) +
(z)
t
,
(z)
t
N(0,
(z)
). (1)
The dynamics model does not need to be particularly
accuratein our experiments, we use a single generic
model learned from a large corpus of data that is not
specic to the trajectory we want to perform. In our
experiments (Section 5) we provide some concrete ex-
amples showing how accurately the generic model cap-
tures the true dynamics for our helicopter.
1
Our generative model represents each demonstration
as a set of independent observations of the hidden,
ideal trajectory z. Specically, our model assumes
y
k
j
= z
k
j
+
(y)
j
,
(y)
j
N(0,
(y)
). (2)
Here
k
j
is the time index in the hidden trajectory to
which the observation y
k
j
is mapped. The noise term in
the observation equation captures both inaccuracy in
estimating the observed trajectories from sensor data,
as well as errors in the maneuver that are the result of
the human pilots imperfect demonstration.
2
The time indices
k
j
are unobserved, and our model
assumes the following distribution with parameters d
k
i
:
P(
k
j+1
|
k
j
) =
_
_
d
k
1
if
k
j+1
k
j
= 1
d
k
2
if
k
j+1
k
j
= 2
d
k
3
if
k
j+1
k
j
= 3
0 otherwise
(3)
k
0
0. (4)
To accommodate small, gradual shifts in time between
the hidden and observed trajectories, our model as-
sumes the observed trajectories are subsampled ver-
sions of the hidden trajectory. We found that hav-
ing a hidden trajectory length equal to twice the
average length of the demonstrations, i.e., T =
2(
1
M
M
k=1
N
k
), gives sucient resolution.
Figure 1 depicts the graphical model corresponding to
our basic generative model. Note that each observa-
tion y
k
j
depends on the hidden trajectorys state at
time
k
j
, which means that for
k
j
unobserved, y
k
j
de-
pends on all states in the hidden trajectory that it
could be associated with.
2.2. Extensions to the Generative Model
Thus far we have assumed that the expert demon-
strations are misaligned copies of the ideal trajectory
1
The state transition model also predicts the controls
as a function of the previous state and controls. In our
experiments we predict u
t+1
as u
t
plus Gaussian noise.
2
Even though our observations, y, are correlated over
time with each other due to the dynamics governing the ob-
served trajectory, our model assumes that the observations
y
k
j
are independent for all j = 0..N
k
1 and k = 0..M1.
Learning for Control from Multiple Demonstrations
Figure 1. Graphical model representing our trajectory as-
sumptions. (Shaded nodes are observed.)
merely corrupted by Gaussian noise. Listgarten et
al. have used this same basic generative model (for
the case where f() is the identity function) to align
speech signals and biological data (Listgarten, 2006;
Listgarten et al., 2005). We now augment the basic
model to account for other sources of error which are
important for modeling and control.
2.2.1. Learning Local Model Parameters
For many systems, we can substantially improve our
modeling accuracy by using a time-varying model f
t
()
that is specic to the vicinity of the intended trajectory
at each time t. We express f
t
as our crude model,
f, augmented with a bias term
3
,
t
:
z
t+1
= f
t
(z
t
) +
(z)
t
f(z
t
) +
t
+
(z)
t
.
To regularize our model, we assume that
t
changes
only slowly over time. We have
t+1
N(
t
,
()
).
We incorporate the bias into our observation model
by computing the observed bias
k
j
= y
k
j
f(y
k
j1
)
for each of the observed state transitions, and mod-
eling this as a direct observation of the true model
bias corrupted by Gaussian noise. The result of this
modication is that the ideal trajectory must not only
look similar to the demonstration trajectories, but it
must also obey a dynamics model which includes those
errors consistently observed in the demonstrations.
2.2.2. Factoring out Demonstration Drift
It is often dicult, even for an expert pilot, during
aerobatic maneuvers to keep the helicopter centered
around a xed position. The recorded position tra-
jectory will often drift around unintentionally. Since
these position errors are highly correlated, they are
not explained well by the Gaussian noise term in our
observation model.
To capture such slow drift in the demonstrated trajec-
3
Our generative model can incorporate richer local
models. We discuss our choice of merely using biases in our
generative trajectory model in more detail in Section 4.
tories, we augment the latent trajectorys state with a
drift vector
k
t
for each time t and each demonstrated
trajectory k. We model the drift as a zero-mean ran-
dom walk with (relatively) small variance. The state
observations are now noisy measurements of z
t
+
k
t
rather than merely z
t
.
2.2.3. Incorporating Prior Knowledge
Even though it might be hard to specify the complete
ideal trajectory in state space, we might still have prior
knowledge about the trajectory. Hence, we introduce
additional observations
t
= (z
t
) corresponding to
our prior knowledge about the ideal trajectory at time
t. The function (z
t
) computes some features of the
hidden state z
t
and our expert supplies the value
t
that this feature should take. For example, for the
case of a helicopter performing an in-place ip, we use
an observation that corresponds to our expert pilots
knowledge that the helicopter should stay at a xed
position while it is ipping. We assume that these ob-
servations may be corrupted by Gaussian noise, where
the variance of the noise expresses our condence in
the accuracy of the experts advice. In the case of the
ip, the variance expresses our knowledge that it is,
in fact, impossible to ip perfectly in-place and that
the actual position of the helicopter may vary slightly
from the position given by the expert.
Incorporating prior knowledge of this kind can greatly
enhance the learned ideal trajectory. We give more
detailed examples in Section 5.
2.2.4. Model Summary
In summary, we have the following generative model:
z
t+1
= f(z
t
) +
t
+
(z)
t
, (5)
t+1
=
t
+
()
t
, (6)
k
t+1
=
k
t
+
()
t
, (7)
t
= (z
t
) +
()
t
, (8)
y
k
j
= z
k
j
+
k
j
+
(y)
j
, (9)
k
j
P(
k
j+1
|
k
j
) (10)
Here
(z)
t
,
()
t
,
()
t
,
()
t
,
(y)
j
are zero mean Gaussian
random variables with respective covariance matrices
(z)
,
()
,
()
,
()
,
(y)
. The transition probabili-
ties for
k
j
are dened by Eqs. (3, 4) with parameters
d
k
1
, d
k
2
, d
k
3
(collectively denoted d).
3. Trajectory Learning Algorithm
Our learning algorithm automatically nds the time-
alignment indexes , the time-index transition prob-
abilities d, and the covariance matrices
()
by (ap-
proximately) maximizing the joint likelihood of the
observed trajectories y and the observed prior knowl-
Learning for Control from Multiple Demonstrations
edge about the ideal trajectory , while marginalizing
out over the unobserved, intended trajectory z. Con-
cretely, our algorithm (approximately) solves
max
,
()
,d
log P(y, , ;
()
, d). (11)
Then, once our algorithm has found , d,
()
, it nds
the most likely hidden trajectory, namely the trajec-
tory z that maximizes the joint likelihood of the ob-
served trajectories y and the observed prior knowledge
about the ideal trajectory for the learned parameters
, d,
()
.
4
The joint optimization in Eq. (11) is dicult because
(as can be seen in Figure 1) the lack of knowledge of
the time-alignment index variables introduces a very
large set of dependencies between all the variables.
However, when is known, the optimization problem
in Eq. (11) greatly simplies thanks to context spe-
cic independencies (Boutilier et al., 1996). When
is xed, we obtain a model such as the one shown in
Figure 2. In this model we can directly estimate the
multinomial parameters d in closed form; and we have
a standard HMM parameter learning problem for the
covariances
()
, which can be solved using the EM al-
gorithm (Dempster et al., 1977)often referred to as
Baum-Welch in the context of HMMs. Concretely, for
our setting, the EM algorithms E-step computes the
pairwise marginals over sequential hidden state vari-
ables by running a (extended) Kalman smoother; the
M-step then uses these marginals to update the covari-
ances
()
.
Figure 2. Example of graphical model when is known.
(Shaded nodes are observed.)
To also optimize over the time-indexing variables ,
we propose an alternating optimization procedure. For
4
Note maximizing over the hidden trajectory and the
covariance parameters simultaneously introduces undesir-
able local maxima: the likelihood score would be highest
(namely innity) for a hidden trajectory with a sequence
of states exactly corresponding to the (crude) dynamics
model f() and state-transition covariance matrices equal
to all-zeros as long as the observation covariances are non-
zero. Hence we marginalize out the hidden trajectory to
nd , d,
()
.
xed
()
and d, and for xed z, we can nd the opti-
mal time-indexing variables using dynamic program-
ming over the time-index assignments for each demon-
stration independently. The dynamic programming al-
gorithm to nd is known in the speech recognition
literature as dynamic time warping (Sakoe & Chiba,
1978) and in the biological sequence alignment litera-
ture as the Needleman-Wunsch algorithm (Needleman
& Wunsch, 1970). The xed z we use, is the one that
maximizes the likelihood of the observations for the
current setting of parameters , d,
()
.
5
In practice, rather than alternating between complete
optimizations over
()
, d and , we only partially op-
timize over
()
, running only one iteration of the EM
algorithm.
We provide the complete details of our algorithm in
the full paper (Coates et al., 2008).
4. Local Model Learning
For complex dynamical systems, the state z
t
used
in the dynamics model often does not correspond to
the complete state of the system, since the latter
could involve large numbers of previous states or unob-
served variables that make modeling dicult.
6
How-
ever, when we only seek to model the system dynamics
along a specic trajectory, knowledge of both z
t
and
how far we are along that trajectory is often sucient
to accurately predict the next state z
t+1
.
Once the alignments between the demonstrations are
computed by our trajectory learning algorithm, we can
use the time aligned demonstration data to learn a se-
quence of trajectory-specic models. The time indices
of the aligned demonstrations now accurately associate
the demonstration data points with locations along the
learned trajectory, allowing us to build models for the
state at time t using the appropriate corresponding
data from the demonstration trajectories.
7
5
Fixing z means the dynamic time warping step only
approximately optimizes the original objective. Unfortu-
nately, without xing z, the independencies required to
obtain an ecient dynamic programming algorithm do not
hold. In practice we nd our approximation works very
well.
6
This is particularly true for helicopters. Whereas the
state of the helicopter is very crudely captured by the 12D
rigid-body state representation we use for our controllers,
the true physical state of the system includes, among
others, the airow around the helicopter, the rotor head
speed, and the actuator dynamics.
7
We could learn the richer local model within the tra-
jectory alignment algorithm, updating the dynamics model
during the M-step. We chose not to do so since these
models are more computationally expensive to estimate.
The richer models have minimal inuence on the alignment
because the biases capture the average model errorthe
richer models capture the derivatives around it. Given the
limited inuence on the alignment, we chose to save com-
putational time and only estimate the richer models after
Learning for Control from Multiple Demonstrations
Figure 3. Our XCell Tempest autonomous helicopter.
To construct an accurate nonlinear model to predict
z
t+1
from z
t
, using the aligned data, one could use lo-
cally weighted linear regression (Atkeson et al., 1997),
where a linear model is learned based on a weighted
dataset. Data points from our aligned demonstrations
that are nearer to the current time index along the
trajectory, t, and nearer the current state, z
t
, would
be weighted more highly than data far away. While
this allows us to build a more accurate model from
our time-aligned data, the weighted regression must
be done online, since the weights depend on the cur-
rent state, z
t
. For performance reasons
8
this may often
be impractical. Thus, we weight data only based on
the time index, and learn a parametric model in the re-
maining variables (which, in our experiments, has the
same form as the global crude model, f()). Con-
cretely, when estimating the model for the dynamics
at time t, we weight a data point at time t
by:
9
W(t
) = exp
_
(t t
)
2
2
_
,
where is a bandwidth parameter. Typical values for
are between one and two seconds in our experiments.
Since the weights for the data points now only depend
on the time index, we can precompute all models f
t
()
along the entire trajectory. The ability to precompute
the models is a feature crucial to our control algorithm,
which relies heavily on fast simulation.
5. Experimental Results
5.1. Experimental Setup
To test our algorithm, we had our expert helicopter
pilot y our XCell Tempest helicopter (Figure 3),
alignment.
8
During real-time control execution, our model is
queried roughly 52000 times per second. Even with KD-
tree or cover-tree data structures a full locally weighted
model would be much too slow.
9
In practice, the data points along a short segment of
the trajectory lie in a low-dimensional subspace of the state
space. This sometimes leads to an ill-conditioned param-
eter estimation problem. To mitigate this problem, we
regularize our models toward the crude model f().
which can perform professional, competition-level ma-
neuvers.
10
We collected multiple demonstrations from our expert
for a variety of aerobatic trajectories: continuous in-
place ips and rolls, a continuous tail-down tic toc,
and an airshow, which consists of the following maneu-
vers in rapid sequence: split-S, snap roll, stall-turn,
loop, loop with pirouette, stall-turn with pirouette,
hurricane (fast backward funnel), knife-edge, ips
and rolls, tic-toc and inverted hover.
The (crude) helicopter dynamics f() is constructed
using the method of Abbeel et al. (2006a).
11
The
helicopter dynamics model predicts linear and angular
accelerations as a function of current state and inputs.
The next state is then obtained by integrating forward
in time using the standard rigid-body equations.
In the trajectory learning algorithm, we have bias
terms
t
for each of the predicted accelerations. We
use the state-drift variables,
k
t
, for position only.
For the ips, rolls, and tic-tocs we incorporated our
prior knowledge that the helicopter should stay in
place. We added a measurement of the form:
0 = p(z
t
) +
(0)
,
(0)
N(0,
(0)
)
where p() is a function that returns the position co-
ordinates of z
t
, and
(0)
is a diagonal covariance ma-
trix. This measurementwhich is a direct observation
of the pilots intended trajectoryis similar to advice
given to a novice human pilot to describe the desired
maneuver: A good ip, roll, or tic-toc trajectory stays
close to the same position.
We also used additional advice in the airshow to in-
dicate that the vertical loops, stall-turns and split-S
should all lie in a single vertical plane; that the hurri-
canes should lie in a horizontal plane and that a good
knife-edge stays in a vertical plane. These measure-
ments take the form:
c = N
p(z
t
) +
(1)
,
(1)
N(0,
(1)
)
where, again, p(z
t
) returns the position coordinates of
z
t
. N is a vector normal to the plane of the maneu-
ver, c is a constant, and
(1)
is a diagonal covariance
matrix.
10
We instrumented the helicopter with a Microstrain
3DM-GX1 orientation sensor. A ground-based camera sys-
tem measures the helicopters position. A Kalman lter
uses these measurements to track the helicopters position,
velocity, orientation and angular rate.
11
The model of Abbeel et al. (2006a) naturally general-
izes to any orientation of the helicopter regardless of the
ight regime from which data is collected. Hence, even
without collecting data from aerobatic ight, we can rea-
sonably attempt to use such a model for aerobatic ying,
though we expect it to be relatively inaccurate.
Learning for Control from Multiple Demonstrations
5 0 5 10 15
10
20
30
40
East (m)
N
o
r
t
h
(
m
)
(a) (b) (c) (d)
Figure 4. Colored lines: demonstrations. Black dotted line: trajectory inferred by our algorithm. (See text for details.)
5.2. Trajectory Learning Results
Figure 4(a) shows the horizontal and vertical position
of the helicopter during the two loops own during
the airshow. The colored lines show the expert pi-
lots demonstrations. The black dotted line shows the
inferred ideal path produced by our algorithm. The
loops are more rounded and more consistent in the in-
ferred ideal path. We did not incorporate any prior
knowledge to this extent. Figure 4(b) shows a top-
down view of the same demonstrations and inferred
trajectory. The prior successfully encouraged the in-
ferred trajectory to lie in a vertical plane, while obey-
ing the system dynamics.
Figure 4(c) shows one of the bias terms, namely the
model prediction errors for the Z-axis acceleration of
the helicopter computed from the demonstrations, be-
fore time-alignment. Figure 4(d) shows the result after
alignment (in color) as well as the inferred acceleration
error (black dotted). We see that the unaligned bias
measurements allude to errors approximately in the -
1G to -2G range for the rst 40 seconds of the airshow
(a period that involves high-G maneuvering that is not
predicted accurately by the crude model). However,
only the aligned biases precisely show the magnitudes
and locations of these errors along the trajectory. The
alignment allows us to build our ideal trajectory based
upon a much more accurate model that is tailored to
match the dynamics observed in the demonstrations.
Results for other maneuvers and state variables are
similar. At the URL provided in the introduction we
posted movies which simultaneously replay the dier-
ent demonstrations, before alignment and after align-
ment. The movies visualize the alignment results in
many state dimensions simultaneously.
5.3. Flight Results
After constructing the idealized trajectory and models
using our algorithm, we attempted to y the trajectory
on the actual helicopter.
Our helicopter uses a receding-horizon dierential dy-
namic programming (DDP) controller (Jacobson &
Mayne, 1970). DDP approximately solves general con-
tinuous state-space optimal control problems by taking
advantage of the fact that optimal control problems
with linear dynamics and a quadratic reward function
(known as linear quadratic regulator (LQR) problems)
can be solved eciently. It is well-known that the so-
lution to the (time-varying, nite horizon) LQR prob-
lem is a sequence of linear feedback controllers. In
short, DDP iteratively approximates the general con-
trol problem with LQR problems until convergence, re-
sulting in a sequence of linear feedback controllers that
are approximately optimal. In the receding-horizon al-
gorithm, we not only run DDP initially to design the
sequence of controllers, but also re-run DDP during
control execution at every time step and recompute
the optimal controller over a xed-length time interval
(the horizon), assuming the precomputed controller
and cost-to-go are correct after this horizon.
As described in Section 4, our algorithm outputs a
sequence of learned local parametric models, each of
the form described by Abbeel et al. (2006a). Our
implementation linearizes these models on the y with
a 2 second horizon (at 20Hz). Our reward function
penalizes error from the target trajectory, s
t
, as well
as deviation from the desired controls, u
t
, and the
desired control velocities, u
t+1
u
t
.
First we compare our results with the previous state-
of-the-art in aerobatic helicopter ight, namely the in-
place rolls and ips of Abbeel et al. (2007). That
work used hand-specied target trajectories and a sin-
gle nonlinear model for the entire trajectory.
Figure 5(a) shows the Y-Z position
12
and the collec-
tive (thrust) control inputs for the in-place rolls for
both their controller and ours. Our controller achieves
(i) better position performance (standard deviation of
approximately 2.3 meters in the Y-Z plane, compared
to about 4.6 meters and (ii) lower overall collective
control values (which roughly represents the amount
of energy being used to y the maneuver).
Similarly, Figure 5(b) shows the X-Z position and the
collective control inputs for the in-place ips for both
controllers. Like for the rolls, we see that our con-
troller signicantly outperforms that of Abbeel et al.
(2007), both in position accuracy and in control energy
expended.
12
These are the position coordinates projected into a
plane orthogonal to the axis of rotation.
Learning for Control from Multiple Demonstrations
15 10 5 0 5 10 15 20
0
5
10
A
lt
it
u
d
e
(
m
)
North Position (m)
0 5 10 15 20 25 30 35
1
0.5
0
0.5
1
C
o
lle
c
t
iv
e
I
n
p
u
t
Time (s)
20 15 10 5 0 5 10 15
0
5
10
A
lt
it
u
d
e
(
m
)
East Position (m)
0 5 10 15 20 25 30 35
1
0.5
0
0.5
1
C
o
lle
c
t
iv
e
I
n
p
u
t
Time (s)
10 5 0 5
6
4
2
0
2
4
6
8
10
12
14
North Position (m)
A
l
t
i
t
u
d
e
(
m
)
(a) (b) (c)
Figure 5. Flight results. (a),(b) Solid black: our results. Dashed red: Abbeel et al. (2007). (c) Dotted black: autonomous
tic-toc. Solid colored: expert demonstrations. (See text for details.)
Besides ips and rolls, we also performed autonomous
tic tocswidely considered to be an even more chal-
lenging aerobatic maneuver. During the (tail-down)
tic-toc maneuver the helicopter pitches quickly back-
ward and forward in-place with the tail pointed toward
the ground (resembling an inverted clock pendulum).
The complex relationship between pitch angle, hori-
zontal motion, vertical motion, and thrust makes it ex-
tremely dicult to create a feasible tic-toc trajectory
by hand. Our attempts to use such a hand-coded tra-
jectory with the DDP algorithm from (Abbeel et al.,
2007) failed repeatedly. By contrast, our algorithm
readily yields an excellent feasible trajectory that was
successfully own on the rst attempt. Figure 5(c)
shows the expert trajectories (in color), and the au-
tonomously own tic-toc (black dotted). Our con-
troller signicantly outperforms the experts demon-
strations.
We also applied our algorithm to successfully y a
complete aerobatic airshow, which consists of the fol-
lowing maneuvers in rapid sequence: split-S, snap roll,
stall-turn, loop, loop with pirouette, stall-turn with
pirouette, hurricane (fast backward funnel), knife-
edge, ips and rolls, tic-toc and inverted hover.
The trajectory-specic local model learning typically
captures the dynamics well enough to y all the afore-
mentioned maneuvers reliably. Since our computer
controller ies the trajectory very consistently, how-
ever, this allows us to repeatedly acquire data from
the same vicinity of the target trajectory on the real
helicopter. Similar to Abbeel et al. (2007), we incorpo-
rate this ight data into our model learning, allowing
us to improve ight accuracy even further. For exam-
ple, during the rst autonomous airshow our controller
achieves an RMS position error of 3.29 meters, and this
procedure improved performance to 1.75 meters RMS
position error.
Videos of all our ights are available at:
http://heli.stanford.edu
6. Related Work
Although no prior works span our entire setting of
learning for control from multiple demonstrations,
there are separate pieces of work that relate to var-
ious components of our approach.
Atkeson and Schaal (1997) use multiple demonstra-
tions to learn a model for a robot arm, and then nd an
optimal controller in their simulator, initializing their
optimal control algorithm with one of the demonstra-
tions.
The work of Calinon et al. (2007) considered learning
trajectories and constraints from demonstrations for
robotic tasks. There, they do not consider the systems
dynamics or provide a clear mechanism for the inclu-
sion of prior knowledge. Our formulation presents a
principled, joint optimization which takes into account
the multiple demonstrations, as well as the (complex)
system dynamics and prior knowledge. While Calinon
et al. (2007) also use some form of dynamic time warp-
ing, they do not try to optimize a joint objective cap-
turing both the system dynamics and time-warping.
Among others, An et al. (1988) and, more recently,
Abbeel et al. (2006b) have exploited the idea of
trajectory-indexed model learning for control. How-
ever, contrary to our setting, their algorithms do not
time align nor coherently integrate data from multiple
trajectories.
While the work by Listgarten et al. (Listgarten, 2006;
Listgarten et al., 2005) does not consider robotic con-
trol and model learning, they also consider the prob-
lem of multiple continuous time series alignment with
a hidden time series.
Our work also has strong similarities with recent work
on inverse reinforcement learning, which extracts a re-
ward function (rather than a trajectory) from the ex-
pert demonstrations. See, e.g., Ng and Russell (2000);
Abbeel and Ng (2004); Ratli et al. (2006); Neu and
Szepesvari (2007); Ramachandran and Amir (2007);
Syed and Schapire (2008).
Learning for Control from Multiple Demonstrations
Most prior work on autonomous helicopter ight only
considers the ight-regime close to hover. There
are three notable exceptions. The aerobatic work
of Gavrilets et al. (2002) comprises three maneuvers:
split-S, snap-roll, and stall-turn, which we also include
during the rst 10 seconds of our airshow for com-
parison. They record pilot demonstrations, and then
hand-engineer a sequence of desired angular rates and
velocities, as well as transition points. Ng et al. (2004)
have their autonomous helicopter perform sustained
inverted hover. We compared the performance of our
system with the work of Abbeel et al. (2007), by far
the most advanced autonomous aerobatics results to
date, in Section 5.
7. Conclusion
We presented an algorithm that takes advantage of
multiple suboptimal trajectory demonstrations to (i)
extract (an estimate of) the ideal demonstration, (ii)
learn a local model along this trajectory. Our algo-
rithm is generally applicable for learning trajectories
and dynamics models along trajectories from multi-
ple demonstrations. We showed the eectiveness of
our algorithm for control by applying it to the chal-
lenging problem of autonomous helicopter aerobatics.
The ideal target trajectory and the local models out-
put by our trajectory learning algorithm enable our
controllers to signicantly outperform the prior state
of the art.
Acknowledgments
We thank Garett Oku for piloting and building our
helicopter. Adam Coates is supported by a Stanford
Graduate Fellowship. This work was also supported
in part by the DARPA Learning Locomotion program
under contract number FA8650-05-C-7261.
References
Abbeel, P., Coates, A., Quigley, M., & Ng, A. Y. (2007).
An application of reinforcement learning to aerobatic he-
licopter ight. NIPS 19.
Abbeel, P., Ganapathi, V., & Ng, A. Y. (2006a). Learning
vehicular dynamics with application to modeling heli-
copters. NIPS 18.
Abbeel, P., & Ng, A. Y. (2004). Apprenticeship learning
via inverse reinforcement learning. Proc. ICML.
Abbeel, P., Quigley, M., & Ng, A. Y. (2006b). Using inac-
curate models in reinforcement learning. Proc. ICML.
An, C. H., Atkeson, C. G., & Hollerbach, J. M. (1988).
Model-based control of a robot manipulator. MIT Press.
Atkeson, C., & Schaal, S. (1997). Robot learning from
demonstration. Proc. ICML.
Atkeson, C. G., Moore, A. W., & Schaal, S. (1997). Lo-
cally weighted learning for control. Articial Intelligence
Review, 11.
Boutilier, C., Friedman, N., Goldszmidt, M., & Koller, D.
(1996). Context-specic independence in Bayesian net-
works. Proc. UAI.
Calinon, S., Guenter, F., & Billard, A. (2007). On learn-
ing, representing and generalizing a task in a humanoid
robot. IEEE Trans. on Systems, Man and Cybernetics,
Part B.
Coates, A., Abbeel, P., & Ng, A. Y. (2008). Learning
for control from multiple demonstrations (Full version).
http://heli.stanford.edu/icml2008.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977).
Maximum likelihood from incomplete data via the EM
algorithm. J. of the Royal Statistical Society.
Gavrilets, V., Martinos, I., Mettler, B., & Feron, E. (2002).
Control logic for automated aerobatic ight of minia-
ture helicopter. AIAA Guidance, Navigation and Con-
trol Conference.
Jacobson, D. H., & Mayne, D. Q. (1970). Dierential dy-
namic programming. Elsevier.
Listgarten, J. (2006). Analysis of sibling time series data:
alignment and dierence detection. Doctoral disserta-
tion, University of Toronto.
Listgarten, J., Neal, R. M., Roweis, S. T., & Emili, A.
(2005). Multiple alignment of continuous time series.
NIPS 17.
Needleman, S., & Wunsch, C. (1970). A general method
applicable to the search for similarities in the amino acid
sequence of two proteins. J. Mol. Biol.
Neu, G., & Szepesvari, C. (2007). Apprenticeship learning
using inverse reinforcement learning and gradient meth-
ods. Proc. UAI.
Ng, A. Y., Coates, A., Diel, M., Ganapathi, V., Schulte, J.,
Tse, B., Berger, E., & Liang, E. (2004). Autonomous in-
verted helicopter ight via reinforcement learning. ISER.
Ng, A. Y., & Russell, S. (2000). Algorithms for inverse
reinforcement learning. Proc. ICML.
Ramachandran, D., & Amir, E. (2007). Bayesian inverse
reinforcement learning. Proc. IJCAI.
Ratli, N., Bagnell, J., & Zinkevich, M. (2006). Maximum
margin planning. Proc. ICML.
Sakoe, H., & Chiba, S. (1978). Dynamic programming al-
gorithm optimization for spoken word recognition. IEEE
Transactions on Acoustics, Speech, and Signal Process-
ing.
Syed, U., & Schapire, R. E. (2008). A game-theoretic ap-
proach to apprenticeship learning. NIPS 20.
Learning for Control from Multiple Demonstrations
A. Trajectory Learning Algorithm
As described in Section 3, our algorithm (approxi-
mately) solves
max
,
()
,d
log P(y, , ;
()
, d). (12)
Then, once our algorithm has found , d,
()
, it nds
the most likely hidden trajectory, namely the trajec-
tory z that maximizes the joint likelihood of the ob-
served trajectories y and the observed prior knowledge
about the ideal trajectory for the learned parameters
, d,
()
.
To optimize Eq. (12), we alternatingly optimize over
()
, d and . Section 3 provides the high-level de-
scription, below we provide the detailed description of
our algorithm.
1. Initialize the parameters to hand-chosen defaults.
A typical choice:
()
= I, d
k
i
=
1
3
,
k
j
= j
T1
N
k
1
.
2. E-step for latent trajectory: For the current set-
ting of ,
()
run a (extended) Kalman smoother
to nd the distributions for the latent states,
N(
t|T1
,
t|T1
).
3. M-step for latent trajectory: Update the covari-
ances
()
using the standard EM update.
4. E-step for the time indexing (using hard assign-
ments): run dynamic time warping to nd
that maximizes the joint probability P(z, y, , ),
where z is xed to
t|T1
, namely the mode of the
distribution obtained from the Kalman smoother.
5. M-step for the time indexing: estimate d from .
6. Repeat steps 2-5 until convergence.
A.1. Steps 2 and 3 detailsEM for non-linear
dynamical systems
Steps 2 and 3 in our algorithm correspond to the stan-
dard E and M steps of the EM algorithm applied to a
non-linear dynamical system with Gaussian noise. For
completeness we provide the details below.
In particular, we have:
z
t+1
= f(z
t
) +
t
,
t
N(0, Q),
y
t+1
= h(z
t
) +
t
,
t
N(0, R).
In the E-step, for t = 0..T 1, the Kalman smoother
computes the parameters
t|t
and
t|t
for the distribu-
tion N(
t|t
,
t|t
), which is the distribution of z
t
condi-
tioned on all observations up to and including time t.
Along the way, the smoother also computes
t+1|t
and
t+1|t
. These are the parameters for the distribution
of z
t+1
given only the measurements up to time t. Fi-
nally, during the backward pass, the parameters
t|T1
and
t|T1
are computed, which give the distribution
for z
t
given all measurements.
After running the Kalman smoother (for the E-step),
we can use the computed quantities to update Q and
R in the M-step. In particular, we can compute
13
:
t
=
t+1|T1
f(
t|T1
),
A
t
= Df(
t|T1
),
L
t
=
t|t
A
t
1
t+1|t
,
P
t
=
t+1|T1
t+1|T1
L
t
A
t
A
t
L
t
t+1|T1
,
Q =
1
T
T1
t=0
t
+ A
t
t|T1
A
t
+ P
t
,
y
t
= y
t
h(
t|T1
),
C
t
= Dh(
t|T1
),
R =
1
T
T1
t=0
y
t
y
t
+ C
t
t|T1
C
t
.
A.2. Steps 4 and 5 detailsDynamic time
warping
In Step 4 our goal is to compute as:
= arg max
log P(z, y, , ;
()
, d)
= arg max
M1
k=0
N
k
1
j=0
_
(y
k
j
| z
k
j
,
k
j
) + (
k
j
|
k
j1
)
_
(14)
Note that the inner summations in the above expres-
sion are independentthe likelihoods for each of the
M observation sequences can be maximized separately.
Hence, in the following, we will omit the k superscript,
as the algorithm can be applied separately for each se-
quence of observations and indices.
At this point, we can solve the maximization over us-
ing a dynamic programming algorithm known in the
speech recognition literature as dynamic time warp-
ing (Sakoe & Chiba, 1978) and in the biological se-
quence alignment literature as the Needleman-Wunsch
13
The notation Df(z) is the Jacobian of f evaluated at z.
Learning for Control from Multiple Demonstrations
algorithm (Needleman & Wunsch, 1970). For com-
pleteness, we provide the details for our setting below.
We dene the quantity Q(s, t) to be the maximum
obtainable value of the rst s + 1 terms of the inner
summation if we choose
s
= t.
For s = 0 we have:
Q(0, t) = (y
0
| z
0
,
0
= t) + (
0
= t) (15)
And for s > 0:
Q(s, t) = (y
s
| z
s
,
s
= t)
+ max
1,...,s1
[(
s
= t|
s1
)
+
s1
j=0
_
(y
j
| z
j
,
j
) + (
j
|
j1
)
[(
s
= t|
s1
= t
) +Q(s 1, t
)]
(16)
The equations (15) and (16) can be used to compute
max
t
Q(N
k
1, t) for each observation sequence (and
the maximizing solution, ), which is exactly the max-
imizing value of the inner summation in Eq. (14). The
maximization in Eq. (16) can be restricted to the rel-
evant values of t