0% found this document useful (0 votes)

59 views10 pages

Learning For Control From Multiple Demonstrations

This document proposes an algorithm to learn trajectories and dynamics models from multiple suboptimal demonstrations provided by an expert. The algorithm first extracts an ideal target trajectory from the demonstrations using an EM approach. It models the demonstrations as noisy observations of the ideal trajectory, with possible time warping. The algorithm infers the ideal trajectory and time alignment of the demonstrations. The time-aligned data is then used to learn local dynamics models around the extracted trajectory, allowing the robot to perform the task as well or better than the expert. The algorithm is applied to autonomous aerobatic flights of a helicopter, where it outperforms the human pilot demonstrations.

Uploaded by

neha_automatic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views10 pages

Learning For Control From Multiple Demonstrations

Uploaded by

neha_automatic

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

Learning for Control from Multiple Demonstrations

Adam Coates [email protected]

Pieter Abbeel [email protected]
Andrew Y. Ng [email protected]
Stanford University CS Department, 353 Serra Mall, Stanford, CA 94305 USA
Abstract
We consider the problem of learning to follow
a desired trajectory when given a small num-
ber of demonstrations from a sub-optimal ex-
pert. We present an algorithm that (i) ex-
tracts theinitially unknowndesired tra-
jectory from the sub-optimal experts demon-
strations and (ii) learns a local model suit-
able for control along the learned trajectory.
We apply our algorithm to the problem of
autonomous helicopter ight. In all cases,
the autonomous helicopters performance ex-
ceeds that of our expert helicopter pilots
demonstrations. Even stronger, our results
signicantly extend the state-of-the-art in au-
tonomous helicopter aerobatics. In particu-
lar, our results include the rst autonomous
tic-tocs, loops and hurricane, vastly superior
performance on previously performed aero-
batic maneuvers (such as in-place ips and
rolls), and a complete airshow, which requires
autonomous transitions between these and
various other maneuvers.
1. Introduction
Many tasks in robotics can be described as a trajectory
that the robot should follow. Unfortunately, specify-
ing the desired trajectory and building an appropriate
model for the robot dynamics along that trajectory are
often non-trivial tasks. For example, when asked to
describe the trajectory that a helicopter should follow
to perform an aerobatic ip, one would have to spec-
ify a trajectory that (i) corresponds to the aerobatic
ip task, and (ii) is consistent with the helicopters dy-
namics. The latter requires (iii) an accurate helicopter
dynamics model for all of the ight regimes encoun-
tered in the vicinity of the trajectory. These coupled
tasks are non-trivial for systems with complex dynam-
ics, such as helicopters. Failing to adequately address
these points leads to a signicantly more dicult con-
Appearing in Proceedings of the 25
th
International Confer-
ence on Machine Learning, Helsinki, Finland, 2008. Copy-
right 2008 by the author(s)/owner(s).
trol problem.
In the apprenticeship learning setting, where an ex-
pert is available, rather than relying on a hand-
engineered target trajectory, one can instead have the
expert demonstrate the desired trajectory. The expert
demonstration yields both a desired trajectory for the
robot to follow, as well as data to build a dynamics
model in the vicinity of this trajectory. Unfortunately,
perfect demonstrations can be hard (if not impossible)
to obtain. However, repeated expert demonstrations
are often suboptimal in dierent ways, suggesting that
a large number of suboptimal expert demonstrations
could implicitly encode the ideal trajectory the subop-
timal expert is trying to demonstrate.
In this paper we propose an algorithm that ap-
proximately extracts this implicitly encoded opti-
mal demonstration from multiple suboptimal expert
demonstrations, and then builds a model of the dy-
namics in the vicinity of this trajectory suitable for
high-performance control. In doing so, the algorithm
learns a target trajectory and a model that allows the
robot to not only mimic the behavior of the expert but
even perform signicantly better.
Properly extracting the underlying ideal trajectory
from a set of suboptimal trajectories requires a signi-
cantly more sophisticated approach than merely aver-
aging the states observed at each time-step. A simple
arithmetic average of the states would result in a tra-
jectory that does not even obey the constraints of the
dynamics model. Also, in practice, each of the demon-
strations will occur at dierent rates so that attempt-
ing to combine states from the same time-step in each
trajectory will not work properly.
We propose a generative model that describes the ex-
pert demonstrations as noisy observations of the unob-
served, intended target trajectory, where each demon-
stration is possibly warped along the time axis. We
present an EM algorithmwhich uses a (extended)
Kalman smoother and an ecient dynamic program-
ming algorithm to perform the E-stepto both infer
the unobserved, intended target trajectory and a time-
alignment of all the demonstrations. The time-aligned
demonstrations provide the appropriate data to learn
Learning for Control from Multiple Demonstrations
good local models in the vicinity of the trajectory
such trajectory-specic local models tend to greatly
improve control performance.
Our algorithm allows one to easily incorporate prior
knowledge to further improve the quality of the learned
trajectory. For example, for a helicopter performing
in-place ips, it is known that the helicopter can be
roughly centered around the same position over the
entire sequence of ips. Our algorithm incorporates
this prior knowledge, and successfully factors out the
position drift in the expert demonstrations.
We apply our algorithm to learn trajectories and dy-
namics models for aerobatic ight with a remote con-
trolled helicopter. Our experimental results show that
(i) our algorithm successfully extracts a good trajec-
tory from the multiple sub-optimal demonstrations,
and (ii) the resulting ight performance signicantly
extends the state of the art in aerobatic helicopter
ight (Abbeel et al., 2007; Gavrilets et al., 2002). Most
importantly, our resulting controllers are the rst to
perform as well, and often even better, than our ex-
pert pilot.
We posted movies of our autonomous helicopter ights
at:
http://heli.stanford.edu
The remainder of this paper is organized as follows:
Section 2 presents our generative model for (multi-
ple) suboptimal demonstrations; Section 3 describes
our trajectory learning algorithm in detail; Section 4
describes our local model learning algorithm; Section 5
describes our helicopter platform and experimental re-
sults; Section 6 discusses related work.
2. Generative Model
2.1. Basic Generative Model
We are given M demonstration trajectories of length
N
k
, for k = 0..M 1. Each trajectory is a sequence
of states, s
k
j
, and control inputs, u
k
j
, composed into a
single state vector:
y
k
j
=
_
s
k
j
u
k
j
_
, for j = 0..N
k
1, k = 0..M 1.
Our goal is to estimate a hidden target trajectory of
length T, denoted similarly:
z
t
=
_
s

t
u

t
_
, for t = 0..T 1.
We use the following notation: y = {y
k
j
| j = 0..N
k

1, k = 0..M1}, z = {z
t
| t = 0..T 1}, and similarly
for other indexed variables.
The generative model for the ideal trajectory is given
by an initial state distribution z
0
N(
0
,
0
) and an
approximate model of the dynamics
z
t+1
= f(z
t
) +
(z)
t
,
(z)
t
N(0,
(z)
). (1)
The dynamics model does not need to be particularly
accuratein our experiments, we use a single generic
model learned from a large corpus of data that is not
specic to the trajectory we want to perform. In our
experiments (Section 5) we provide some concrete ex-
amples showing how accurately the generic model cap-
tures the true dynamics for our helicopter.
1
Our generative model represents each demonstration
as a set of independent observations of the hidden,
ideal trajectory z. Specically, our model assumes
y
k
j
= z

k
j
+
(y)
j
,
(y)
j
N(0,
(y)
). (2)
Here
k
j
is the time index in the hidden trajectory to
which the observation y
k
j
is mapped. The noise term in
the observation equation captures both inaccuracy in
estimating the observed trajectories from sensor data,
as well as errors in the maneuver that are the result of
the human pilots imperfect demonstration.
2
The time indices
k
j
are unobserved, and our model
assumes the following distribution with parameters d
k
i
:
P(
k
j+1
|
k
j
) =
_

_
d
k
1
if
k
j+1

k
j
= 1
d
k
2
if
k
j+1

k
j
= 2
d
k
3
if
k
j+1

k
j
= 3
0 otherwise
(3)

k
0
0. (4)
To accommodate small, gradual shifts in time between
the hidden and observed trajectories, our model as-
sumes the observed trajectories are subsampled ver-
sions of the hidden trajectory. We found that hav-
ing a hidden trajectory length equal to twice the
average length of the demonstrations, i.e., T =
2(
1
M

M
k=1
N
k
), gives sucient resolution.
Figure 1 depicts the graphical model corresponding to
our basic generative model. Note that each observa-
tion y
k
j
depends on the hidden trajectorys state at
time
k
j
, which means that for
k
j
unobserved, y
k
j
de-
pends on all states in the hidden trajectory that it
could be associated with.
2.2. Extensions to the Generative Model
Thus far we have assumed that the expert demon-
strations are misaligned copies of the ideal trajectory
1
The state transition model also predicts the controls
as a function of the previous state and controls. In our
experiments we predict u

t+1
as u

t
plus Gaussian noise.
2
Even though our observations, y, are correlated over
time with each other due to the dynamics governing the ob-
served trajectory, our model assumes that the observations
y
k
j
are independent for all j = 0..N
k
1 and k = 0..M1.
Learning for Control from Multiple Demonstrations
Figure 1. Graphical model representing our trajectory as-
sumptions. (Shaded nodes are observed.)
merely corrupted by Gaussian noise. Listgarten et
al. have used this same basic generative model (for
the case where f() is the identity function) to align
speech signals and biological data (Listgarten, 2006;
Listgarten et al., 2005). We now augment the basic
model to account for other sources of error which are
important for modeling and control.
2.2.1. Learning Local Model Parameters
For many systems, we can substantially improve our
modeling accuracy by using a time-varying model f
t
()
that is specic to the vicinity of the intended trajectory
at each time t. We express f
t
as our crude model,
f, augmented with a bias term
3
,

t
:
z
t+1
= f
t
(z
t
) +
(z)
t
f(z
t
) +

t
+
(z)
t
.
To regularize our model, we assume that

t
changes
only slowly over time. We have

t+1
N(

t
,
()
).
We incorporate the bias into our observation model
by computing the observed bias
k
j
= y
k
j
f(y
k
j1
)
for each of the observed state transitions, and mod-
eling this as a direct observation of the true model
bias corrupted by Gaussian noise. The result of this
modication is that the ideal trajectory must not only
look similar to the demonstration trajectories, but it
must also obey a dynamics model which includes those
errors consistently observed in the demonstrations.
2.2.2. Factoring out Demonstration Drift
It is often dicult, even for an expert pilot, during
aerobatic maneuvers to keep the helicopter centered
around a xed position. The recorded position tra-
jectory will often drift around unintentionally. Since
these position errors are highly correlated, they are
not explained well by the Gaussian noise term in our
observation model.
To capture such slow drift in the demonstrated trajec-
3
Our generative model can incorporate richer local
models. We discuss our choice of merely using biases in our
generative trajectory model in more detail in Section 4.
tories, we augment the latent trajectorys state with a
drift vector
k
t
for each time t and each demonstrated
trajectory k. We model the drift as a zero-mean ran-
dom walk with (relatively) small variance. The state
observations are now noisy measurements of z
t
+
k
t
rather than merely z
t
.
2.2.3. Incorporating Prior Knowledge
Even though it might be hard to specify the complete
ideal trajectory in state space, we might still have prior
knowledge about the trajectory. Hence, we introduce
additional observations
t
= (z
t
) corresponding to
our prior knowledge about the ideal trajectory at time
t. The function (z
t
) computes some features of the
hidden state z
t
and our expert supplies the value
t
that this feature should take. For example, for the
case of a helicopter performing an in-place ip, we use
an observation that corresponds to our expert pilots
knowledge that the helicopter should stay at a xed
position while it is ipping. We assume that these ob-
servations may be corrupted by Gaussian noise, where
the variance of the noise expresses our condence in
the accuracy of the experts advice. In the case of the
ip, the variance expresses our knowledge that it is,
in fact, impossible to ip perfectly in-place and that
the actual position of the helicopter may vary slightly
from the position given by the expert.
Incorporating prior knowledge of this kind can greatly
enhance the learned ideal trajectory. We give more
detailed examples in Section 5.
2.2.4. Model Summary
In summary, we have the following generative model:
z
t+1
= f(z
t
) +

t
+
(z)
t
, (5)

t+1
=

t
+
()
t
, (6)

k
t+1
=
k
t
+
()
t
, (7)

t
= (z
t
) +
()
t
, (8)
y
k
j
= z

k
j
+
k
j
+
(y)
j
, (9)

k
j
P(
k
j+1
|
k
j
) (10)
Here
(z)
t
,
()
t
,
()
t
,
()
t
,
(y)
j
are zero mean Gaussian
random variables with respective covariance matrices

(z)
,
()
,
()
,
()
,
(y)
. The transition probabili-
ties for
k
j
are dened by Eqs. (3, 4) with parameters
d
k
1
, d
k
2
, d
k
3
(collectively denoted d).
3. Trajectory Learning Algorithm
Our learning algorithm automatically nds the time-
alignment indexes , the time-index transition prob-
abilities d, and the covariance matrices
()
by (ap-
proximately) maximizing the joint likelihood of the
observed trajectories y and the observed prior knowl-
Learning for Control from Multiple Demonstrations
edge about the ideal trajectory , while marginalizing
out over the unobserved, intended trajectory z. Con-
cretely, our algorithm (approximately) solves
max
,
()
,d
log P(y, , ;
()
, d). (11)
Then, once our algorithm has found , d,
()
, it nds
the most likely hidden trajectory, namely the trajec-
tory z that maximizes the joint likelihood of the ob-
served trajectories y and the observed prior knowledge
about the ideal trajectory for the learned parameters
, d,
()
.
4
The joint optimization in Eq. (11) is dicult because
(as can be seen in Figure 1) the lack of knowledge of
the time-alignment index variables introduces a very
large set of dependencies between all the variables.
However, when is known, the optimization problem
in Eq. (11) greatly simplies thanks to context spe-
cic independencies (Boutilier et al., 1996). When
is xed, we obtain a model such as the one shown in
Figure 2. In this model we can directly estimate the
multinomial parameters d in closed form; and we have
a standard HMM parameter learning problem for the
covariances
()
, which can be solved using the EM al-
gorithm (Dempster et al., 1977)often referred to as
Baum-Welch in the context of HMMs. Concretely, for
our setting, the EM algorithms E-step computes the
pairwise marginals over sequential hidden state vari-
ables by running a (extended) Kalman smoother; the
M-step then uses these marginals to update the covari-
ances
()
.
Figure 2. Example of graphical model when is known.
(Shaded nodes are observed.)
To also optimize over the time-indexing variables ,
we propose an alternating optimization procedure. For
4
Note maximizing over the hidden trajectory and the
covariance parameters simultaneously introduces undesir-
able local maxima: the likelihood score would be highest
(namely innity) for a hidden trajectory with a sequence
of states exactly corresponding to the (crude) dynamics
model f() and state-transition covariance matrices equal
to all-zeros as long as the observation covariances are non-
zero. Hence we marginalize out the hidden trajectory to
nd , d,
()
.
xed
()
and d, and for xed z, we can nd the opti-
mal time-indexing variables using dynamic program-
ming over the time-index assignments for each demon-
stration independently. The dynamic programming al-
gorithm to nd is known in the speech recognition
literature as dynamic time warping (Sakoe & Chiba,
1978) and in the biological sequence alignment litera-
ture as the Needleman-Wunsch algorithm (Needleman
& Wunsch, 1970). The xed z we use, is the one that
maximizes the likelihood of the observations for the
current setting of parameters , d,
()
.
5
In practice, rather than alternating between complete
optimizations over
()
, d and , we only partially op-
timize over
()
, running only one iteration of the EM
algorithm.
We provide the complete details of our algorithm in
the full paper (Coates et al., 2008).
4. Local Model Learning
For complex dynamical systems, the state z
t
used
in the dynamics model often does not correspond to
the complete state of the system, since the latter
could involve large numbers of previous states or unob-
served variables that make modeling dicult.
6
How-
ever, when we only seek to model the system dynamics
along a specic trajectory, knowledge of both z
t
and
how far we are along that trajectory is often sucient
to accurately predict the next state z
t+1
.
Once the alignments between the demonstrations are
computed by our trajectory learning algorithm, we can
use the time aligned demonstration data to learn a se-
quence of trajectory-specic models. The time indices
of the aligned demonstrations now accurately associate
the demonstration data points with locations along the
learned trajectory, allowing us to build models for the
state at time t using the appropriate corresponding
data from the demonstration trajectories.
7
5
Fixing z means the dynamic time warping step only
approximately optimizes the original objective. Unfortu-
nately, without xing z, the independencies required to
obtain an ecient dynamic programming algorithm do not
hold. In practice we nd our approximation works very
well.
6
This is particularly true for helicopters. Whereas the
state of the helicopter is very crudely captured by the 12D
rigid-body state representation we use for our controllers,
the true physical state of the system includes, among
others, the airow around the helicopter, the rotor head
speed, and the actuator dynamics.
7
We could learn the richer local model within the tra-
jectory alignment algorithm, updating the dynamics model
during the M-step. We chose not to do so since these
models are more computationally expensive to estimate.
The richer models have minimal inuence on the alignment
because the biases capture the average model errorthe
richer models capture the derivatives around it. Given the
limited inuence on the alignment, we chose to save com-
putational time and only estimate the richer models after
Learning for Control from Multiple Demonstrations
Figure 3. Our XCell Tempest autonomous helicopter.
To construct an accurate nonlinear model to predict
z
t+1
from z
t
, using the aligned data, one could use lo-
cally weighted linear regression (Atkeson et al., 1997),
where a linear model is learned based on a weighted
dataset. Data points from our aligned demonstrations
that are nearer to the current time index along the
trajectory, t, and nearer the current state, z
t
, would
be weighted more highly than data far away. While
this allows us to build a more accurate model from
our time-aligned data, the weighted regression must
be done online, since the weights depend on the cur-
rent state, z
t
. For performance reasons
8
this may often
be impractical. Thus, we weight data only based on
the time index, and learn a parametric model in the re-
maining variables (which, in our experiments, has the
same form as the global crude model, f()). Con-
cretely, when estimating the model for the dynamics
at time t, we weight a data point at time t

by:
9
W(t

) = exp
_

(t t

)
2

2
_
,
where is a bandwidth parameter. Typical values for
are between one and two seconds in our experiments.
Since the weights for the data points now only depend
on the time index, we can precompute all models f
t
()
along the entire trajectory. The ability to precompute
the models is a feature crucial to our control algorithm,
which relies heavily on fast simulation.
5. Experimental Results
5.1. Experimental Setup
To test our algorithm, we had our expert helicopter
pilot y our XCell Tempest helicopter (Figure 3),
alignment.
8
During real-time control execution, our model is
queried roughly 52000 times per second. Even with KD-
tree or cover-tree data structures a full locally weighted
model would be much too slow.
9
In practice, the data points along a short segment of
the trajectory lie in a low-dimensional subspace of the state
space. This sometimes leads to an ill-conditioned param-
eter estimation problem. To mitigate this problem, we
regularize our models toward the crude model f().
which can perform professional, competition-level ma-
neuvers.
10
We collected multiple demonstrations from our expert
for a variety of aerobatic trajectories: continuous in-
place ips and rolls, a continuous tail-down tic toc,
and an airshow, which consists of the following maneu-
vers in rapid sequence: split-S, snap roll, stall-turn,
loop, loop with pirouette, stall-turn with pirouette,
hurricane (fast backward funnel), knife-edge, ips
and rolls, tic-toc and inverted hover.
The (crude) helicopter dynamics f() is constructed
using the method of Abbeel et al. (2006a).
11
The
helicopter dynamics model predicts linear and angular
accelerations as a function of current state and inputs.
The next state is then obtained by integrating forward
in time using the standard rigid-body equations.
In the trajectory learning algorithm, we have bias
terms

t
for each of the predicted accelerations. We
use the state-drift variables,
k
t
, for position only.
For the ips, rolls, and tic-tocs we incorporated our
prior knowledge that the helicopter should stay in
place. We added a measurement of the form:
0 = p(z
t
) +
(0)
,
(0)
N(0,
(0)
)
where p() is a function that returns the position co-
ordinates of z
t
, and
(0)
is a diagonal covariance ma-
trix. This measurementwhich is a direct observation
of the pilots intended trajectoryis similar to advice
given to a novice human pilot to describe the desired
maneuver: A good ip, roll, or tic-toc trajectory stays
close to the same position.
We also used additional advice in the airshow to in-
dicate that the vertical loops, stall-turns and split-S
should all lie in a single vertical plane; that the hurri-
canes should lie in a horizontal plane and that a good
knife-edge stays in a vertical plane. These measure-
ments take the form:
c = N

p(z
t
) +
(1)
,
(1)
N(0,
(1)
)
where, again, p(z
t
) returns the position coordinates of
z
t
. N is a vector normal to the plane of the maneu-
ver, c is a constant, and
(1)
is a diagonal covariance
matrix.
10
We instrumented the helicopter with a Microstrain
3DM-GX1 orientation sensor. A ground-based camera sys-
tem measures the helicopters position. A Kalman lter
uses these measurements to track the helicopters position,
velocity, orientation and angular rate.
11
The model of Abbeel et al. (2006a) naturally general-
izes to any orientation of the helicopter regardless of the
ight regime from which data is collected. Hence, even
without collecting data from aerobatic ight, we can rea-
sonably attempt to use such a model for aerobatic ying,
though we expect it to be relatively inaccurate.
Learning for Control from Multiple Demonstrations
5 0 5 10 15
10
20
30
40
East (m)
N
o
r
t
h

(
m
)
(a) (b) (c) (d)
Figure 4. Colored lines: demonstrations. Black dotted line: trajectory inferred by our algorithm. (See text for details.)
5.2. Trajectory Learning Results
Figure 4(a) shows the horizontal and vertical position
of the helicopter during the two loops own during
the airshow. The colored lines show the expert pi-
lots demonstrations. The black dotted line shows the
inferred ideal path produced by our algorithm. The
loops are more rounded and more consistent in the in-
ferred ideal path. We did not incorporate any prior
knowledge to this extent. Figure 4(b) shows a top-
down view of the same demonstrations and inferred
trajectory. The prior successfully encouraged the in-
ferred trajectory to lie in a vertical plane, while obey-
ing the system dynamics.
Figure 4(c) shows one of the bias terms, namely the
model prediction errors for the Z-axis acceleration of
the helicopter computed from the demonstrations, be-
fore time-alignment. Figure 4(d) shows the result after
alignment (in color) as well as the inferred acceleration
error (black dotted). We see that the unaligned bias
measurements allude to errors approximately in the -
1G to -2G range for the rst 40 seconds of the airshow
(a period that involves high-G maneuvering that is not
predicted accurately by the crude model). However,
only the aligned biases precisely show the magnitudes
and locations of these errors along the trajectory. The
alignment allows us to build our ideal trajectory based
upon a much more accurate model that is tailored to
match the dynamics observed in the demonstrations.
Results for other maneuvers and state variables are
similar. At the URL provided in the introduction we
posted movies which simultaneously replay the dier-
ent demonstrations, before alignment and after align-
ment. The movies visualize the alignment results in
many state dimensions simultaneously.
5.3. Flight Results
After constructing the idealized trajectory and models
using our algorithm, we attempted to y the trajectory
on the actual helicopter.
Our helicopter uses a receding-horizon dierential dy-
namic programming (DDP) controller (Jacobson &
Mayne, 1970). DDP approximately solves general con-
tinuous state-space optimal control problems by taking
advantage of the fact that optimal control problems
with linear dynamics and a quadratic reward function
(known as linear quadratic regulator (LQR) problems)
can be solved eciently. It is well-known that the so-
lution to the (time-varying, nite horizon) LQR prob-
lem is a sequence of linear feedback controllers. In
short, DDP iteratively approximates the general con-
trol problem with LQR problems until convergence, re-
sulting in a sequence of linear feedback controllers that
are approximately optimal. In the receding-horizon al-
gorithm, we not only run DDP initially to design the
sequence of controllers, but also re-run DDP during
control execution at every time step and recompute
the optimal controller over a xed-length time interval
(the horizon), assuming the precomputed controller
and cost-to-go are correct after this horizon.
As described in Section 4, our algorithm outputs a
sequence of learned local parametric models, each of
the form described by Abbeel et al. (2006a). Our
implementation linearizes these models on the y with
a 2 second horizon (at 20Hz). Our reward function
penalizes error from the target trajectory, s

t
, as well
as deviation from the desired controls, u

t
, and the
desired control velocities, u

t+1
u

t
.
First we compare our results with the previous state-
of-the-art in aerobatic helicopter ight, namely the in-
place rolls and ips of Abbeel et al. (2007). That
work used hand-specied target trajectories and a sin-
gle nonlinear model for the entire trajectory.
Figure 5(a) shows the Y-Z position
12
and the collec-
tive (thrust) control inputs for the in-place rolls for
both their controller and ours. Our controller achieves
(i) better position performance (standard deviation of
approximately 2.3 meters in the Y-Z plane, compared
to about 4.6 meters and (ii) lower overall collective
control values (which roughly represents the amount
of energy being used to y the maneuver).
Similarly, Figure 5(b) shows the X-Z position and the
collective control inputs for the in-place ips for both
controllers. Like for the rolls, we see that our con-
troller signicantly outperforms that of Abbeel et al.
(2007), both in position accuracy and in control energy
expended.
12
These are the position coordinates projected into a
plane orthogonal to the axis of rotation.
Learning for Control from Multiple Demonstrations
15 10 5 0 5 10 15 20
0
5
10
A
lt
it
u
d
e

(
m
)
North Position (m)
0 5 10 15 20 25 30 35
1
0.5
0
0.5
1
C
o
lle
c
t
iv
e

I
n
p
u
t
Time (s)
20 15 10 5 0 5 10 15
0
5
10
A
lt
it
u
d
e

(
m
)
East Position (m)
0 5 10 15 20 25 30 35
1
0.5
0
0.5
1
C
o
lle
c
t
iv
e

I
n
p
u
t
Time (s)
10 5 0 5
6
4
2
0
2
4
6
8
10
12
14
North Position (m)
A
l
t
i
t
u
d
e

(
m
)
(a) (b) (c)
Figure 5. Flight results. (a),(b) Solid black: our results. Dashed red: Abbeel et al. (2007). (c) Dotted black: autonomous
tic-toc. Solid colored: expert demonstrations. (See text for details.)
Besides ips and rolls, we also performed autonomous
tic tocswidely considered to be an even more chal-
lenging aerobatic maneuver. During the (tail-down)
tic-toc maneuver the helicopter pitches quickly back-
ward and forward in-place with the tail pointed toward
the ground (resembling an inverted clock pendulum).
The complex relationship between pitch angle, hori-
zontal motion, vertical motion, and thrust makes it ex-
tremely dicult to create a feasible tic-toc trajectory
by hand. Our attempts to use such a hand-coded tra-
jectory with the DDP algorithm from (Abbeel et al.,
2007) failed repeatedly. By contrast, our algorithm
readily yields an excellent feasible trajectory that was
successfully own on the rst attempt. Figure 5(c)
shows the expert trajectories (in color), and the au-
tonomously own tic-toc (black dotted). Our con-
troller signicantly outperforms the experts demon-
strations.
We also applied our algorithm to successfully y a
complete aerobatic airshow, which consists of the fol-
lowing maneuvers in rapid sequence: split-S, snap roll,
stall-turn, loop, loop with pirouette, stall-turn with
pirouette, hurricane (fast backward funnel), knife-
edge, ips and rolls, tic-toc and inverted hover.
The trajectory-specic local model learning typically
captures the dynamics well enough to y all the afore-
mentioned maneuvers reliably. Since our computer
controller ies the trajectory very consistently, how-
ever, this allows us to repeatedly acquire data from
the same vicinity of the target trajectory on the real
helicopter. Similar to Abbeel et al. (2007), we incorpo-
rate this ight data into our model learning, allowing
us to improve ight accuracy even further. For exam-
ple, during the rst autonomous airshow our controller
achieves an RMS position error of 3.29 meters, and this
procedure improved performance to 1.75 meters RMS
position error.
Videos of all our ights are available at:
http://heli.stanford.edu
6. Related Work
Although no prior works span our entire setting of
learning for control from multiple demonstrations,
there are separate pieces of work that relate to var-
ious components of our approach.
Atkeson and Schaal (1997) use multiple demonstra-
tions to learn a model for a robot arm, and then nd an
optimal controller in their simulator, initializing their
optimal control algorithm with one of the demonstra-
tions.
The work of Calinon et al. (2007) considered learning
trajectories and constraints from demonstrations for
robotic tasks. There, they do not consider the systems
dynamics or provide a clear mechanism for the inclu-
sion of prior knowledge. Our formulation presents a
principled, joint optimization which takes into account
the multiple demonstrations, as well as the (complex)
system dynamics and prior knowledge. While Calinon
et al. (2007) also use some form of dynamic time warp-
ing, they do not try to optimize a joint objective cap-
turing both the system dynamics and time-warping.
Among others, An et al. (1988) and, more recently,
Abbeel et al. (2006b) have exploited the idea of
trajectory-indexed model learning for control. How-
ever, contrary to our setting, their algorithms do not
time align nor coherently integrate data from multiple
trajectories.
While the work by Listgarten et al. (Listgarten, 2006;
Listgarten et al., 2005) does not consider robotic con-
trol and model learning, they also consider the prob-
lem of multiple continuous time series alignment with
a hidden time series.
Our work also has strong similarities with recent work
on inverse reinforcement learning, which extracts a re-
ward function (rather than a trajectory) from the ex-
pert demonstrations. See, e.g., Ng and Russell (2000);
Abbeel and Ng (2004); Ratli et al. (2006); Neu and
Szepesvari (2007); Ramachandran and Amir (2007);
Syed and Schapire (2008).
Learning for Control from Multiple Demonstrations
Most prior work on autonomous helicopter ight only
considers the ight-regime close to hover. There
are three notable exceptions. The aerobatic work
of Gavrilets et al. (2002) comprises three maneuvers:
split-S, snap-roll, and stall-turn, which we also include
during the rst 10 seconds of our airshow for com-
parison. They record pilot demonstrations, and then
hand-engineer a sequence of desired angular rates and
velocities, as well as transition points. Ng et al. (2004)
have their autonomous helicopter perform sustained
inverted hover. We compared the performance of our
system with the work of Abbeel et al. (2007), by far
the most advanced autonomous aerobatics results to
date, in Section 5.
7. Conclusion
We presented an algorithm that takes advantage of
multiple suboptimal trajectory demonstrations to (i)
extract (an estimate of) the ideal demonstration, (ii)
learn a local model along this trajectory. Our algo-
rithm is generally applicable for learning trajectories
and dynamics models along trajectories from multi-
ple demonstrations. We showed the eectiveness of
our algorithm for control by applying it to the chal-
lenging problem of autonomous helicopter aerobatics.
The ideal target trajectory and the local models out-
put by our trajectory learning algorithm enable our
controllers to signicantly outperform the prior state
of the art.
Acknowledgments
We thank Garett Oku for piloting and building our
helicopter. Adam Coates is supported by a Stanford
Graduate Fellowship. This work was also supported
in part by the DARPA Learning Locomotion program
under contract number FA8650-05-C-7261.
References
Abbeel, P., Coates, A., Quigley, M., & Ng, A. Y. (2007).
An application of reinforcement learning to aerobatic he-
licopter ight. NIPS 19.
Abbeel, P., Ganapathi, V., & Ng, A. Y. (2006a). Learning
vehicular dynamics with application to modeling heli-
copters. NIPS 18.
Abbeel, P., & Ng, A. Y. (2004). Apprenticeship learning
via inverse reinforcement learning. Proc. ICML.
Abbeel, P., Quigley, M., & Ng, A. Y. (2006b). Using inac-
curate models in reinforcement learning. Proc. ICML.
An, C. H., Atkeson, C. G., & Hollerbach, J. M. (1988).
Model-based control of a robot manipulator. MIT Press.
Atkeson, C., & Schaal, S. (1997). Robot learning from
demonstration. Proc. ICML.
Atkeson, C. G., Moore, A. W., & Schaal, S. (1997). Lo-
cally weighted learning for control. Articial Intelligence
Review, 11.
Boutilier, C., Friedman, N., Goldszmidt, M., & Koller, D.
(1996). Context-specic independence in Bayesian net-
works. Proc. UAI.
Calinon, S., Guenter, F., & Billard, A. (2007). On learn-
ing, representing and generalizing a task in a humanoid
robot. IEEE Trans. on Systems, Man and Cybernetics,
Part B.
Coates, A., Abbeel, P., & Ng, A. Y. (2008). Learning
for control from multiple demonstrations (Full version).
http://heli.stanford.edu/icml2008.
Dempster, A. P., Laird, N. M., & Rubin, D. B. (1977).
Maximum likelihood from incomplete data via the EM
algorithm. J. of the Royal Statistical Society.
Gavrilets, V., Martinos, I., Mettler, B., & Feron, E. (2002).
Control logic for automated aerobatic ight of minia-
ture helicopter. AIAA Guidance, Navigation and Con-
trol Conference.
Jacobson, D. H., & Mayne, D. Q. (1970). Dierential dy-
namic programming. Elsevier.
Listgarten, J. (2006). Analysis of sibling time series data:
alignment and dierence detection. Doctoral disserta-
tion, University of Toronto.
Listgarten, J., Neal, R. M., Roweis, S. T., & Emili, A.
(2005). Multiple alignment of continuous time series.
NIPS 17.
Needleman, S., & Wunsch, C. (1970). A general method
applicable to the search for similarities in the amino acid
sequence of two proteins. J. Mol. Biol.
Neu, G., & Szepesvari, C. (2007). Apprenticeship learning
using inverse reinforcement learning and gradient meth-
ods. Proc. UAI.
Ng, A. Y., Coates, A., Diel, M., Ganapathi, V., Schulte, J.,
Tse, B., Berger, E., & Liang, E. (2004). Autonomous in-
verted helicopter ight via reinforcement learning. ISER.
Ng, A. Y., & Russell, S. (2000). Algorithms for inverse
reinforcement learning. Proc. ICML.
Ramachandran, D., & Amir, E. (2007). Bayesian inverse
reinforcement learning. Proc. IJCAI.
Ratli, N., Bagnell, J., & Zinkevich, M. (2006). Maximum
margin planning. Proc. ICML.
Sakoe, H., & Chiba, S. (1978). Dynamic programming al-
gorithm optimization for spoken word recognition. IEEE
Transactions on Acoustics, Speech, and Signal Process-
ing.
Syed, U., & Schapire, R. E. (2008). A game-theoretic ap-
proach to apprenticeship learning. NIPS 20.
Learning for Control from Multiple Demonstrations
A. Trajectory Learning Algorithm
As described in Section 3, our algorithm (approxi-
mately) solves
max
,
()
,d
log P(y, , ;
()
, d). (12)
Then, once our algorithm has found , d,
()
, it nds
the most likely hidden trajectory, namely the trajec-
tory z that maximizes the joint likelihood of the ob-
served trajectories y and the observed prior knowledge
about the ideal trajectory for the learned parameters
, d,
()
.
To optimize Eq. (12), we alternatingly optimize over

()
, d and . Section 3 provides the high-level de-
scription, below we provide the detailed description of
our algorithm.
1. Initialize the parameters to hand-chosen defaults.
A typical choice:
()
= I, d
k
i
=
1
3
,
k
j
= j
T1
N
k
1
.
2. E-step for latent trajectory: For the current set-
ting of ,
()
run a (extended) Kalman smoother
to nd the distributions for the latent states,
N(
t|T1
,
t|T1
).
3. M-step for latent trajectory: Update the covari-
ances
()
using the standard EM update.
4. E-step for the time indexing (using hard assign-
ments): run dynamic time warping to nd
that maximizes the joint probability P(z, y, , ),
where z is xed to
t|T1
, namely the mode of the
distribution obtained from the Kalman smoother.
5. M-step for the time indexing: estimate d from .
6. Repeat steps 2-5 until convergence.
A.1. Steps 2 and 3 detailsEM for non-linear
dynamical systems
Steps 2 and 3 in our algorithm correspond to the stan-
dard E and M steps of the EM algorithm applied to a
non-linear dynamical system with Gaussian noise. For
completeness we provide the details below.
In particular, we have:
z
t+1
= f(z
t
) +
t
,
t
N(0, Q),
y
t+1
= h(z
t
) +
t
,
t
N(0, R).
In the E-step, for t = 0..T 1, the Kalman smoother
computes the parameters
t|t
and
t|t
for the distribu-
tion N(
t|t
,
t|t
), which is the distribution of z
t
condi-
tioned on all observations up to and including time t.
Along the way, the smoother also computes
t+1|t
and

t+1|t
. These are the parameters for the distribution
of z
t+1
given only the measurements up to time t. Fi-
nally, during the backward pass, the parameters
t|T1
and
t|T1
are computed, which give the distribution
for z
t
given all measurements.
After running the Kalman smoother (for the E-step),
we can use the computed quantities to update Q and
R in the M-step. In particular, we can compute
13
:

t
=
t+1|T1
f(
t|T1
),
A
t
= Df(
t|T1
),
L
t
=
t|t
A

t

1
t+1|t
,
P
t
=
t+1|T1

t+1|T1
L

t
A

t
A
t
L
t

t+1|T1
,
Q =
1
T
T1

t=0

t
+ A
t

t|T1
A

t
+ P
t
,
y
t
= y
t
h(
t|T1
),
C
t
= Dh(
t|T1
),
R =
1
T
T1

t=0
y
t
y

t
+ C
t

t|T1
C

t
.
A.2. Steps 4 and 5 detailsDynamic time
warping
In Step 4 our goal is to compute as:
= arg max

log P(z, y, , ;
()
, d)
= arg max

log P(y|z, )P(|z)P(z)P() (13)

= arg max

log P(y|z, )P()

where z is the mode of the distribution computed
by the Kalman smoother (namely, z
t
=
t|T1
) and
Eq. (13) made use of independence assumptions im-
plied by our model (see Figure 2). Again, using inde-
pendence properties, the log likelihood above can be
expanded to:
=
arg max

k=0
N
k
1

j=0
_
(y
k
j
| z

k
j
,
k
j
) + (
k
j
|
k
j1
)
_
(14)
Note that the inner summations in the above expres-
sion are independentthe likelihoods for each of the
M observation sequences can be maximized separately.
Hence, in the following, we will omit the k superscript,
as the algorithm can be applied separately for each se-
quence of observations and indices.
At this point, we can solve the maximization over us-
ing a dynamic programming algorithm known in the
speech recognition literature as dynamic time warp-
ing (Sakoe & Chiba, 1978) and in the biological se-
quence alignment literature as the Needleman-Wunsch
13
The notation Df(z) is the Jacobian of f evaluated at z.
Learning for Control from Multiple Demonstrations
algorithm (Needleman & Wunsch, 1970). For com-
pleteness, we provide the details for our setting below.
We dene the quantity Q(s, t) to be the maximum
obtainable value of the rst s + 1 terms of the inner
summation if we choose
s
= t.
For s = 0 we have:
Q(0, t) = (y
0
| z
0
,
0
= t) + (
0
= t) (15)
And for s > 0:
Q(s, t) = (y
s
| z
s
,
s
= t)
+ max
1,...,s1
[(
s
= t|
s1
)
+
s1

j=0
_
(y
j
| z
j
,
j
) + (
j
|
j1
)

The latter equation can be written recursively as:

Q(s, t) = (y
s
| z
s
,
s
= t)+
max
t

[(
s
= t|
s1
= t

) +Q(s 1, t

)]
(16)
The equations (15) and (16) can be used to compute
max
t
Q(N
k
1, t) for each observation sequence (and
the maximizing solution, ), which is exactly the max-
imizing value of the inner summation in Eq. (14). The
maximization in Eq. (16) can be restricted to the rel-
evant values of t

. In our application, we only allow

{t 3, t 2, t 1}. As is common practice,

we typically restrict the time-index assignments to a
xed-width band around the default, equally-spaced
alignment. In our case, we only compute Q(s, t) if
2s C t 2s + C, for xed C.
In Step 5 we compute the parameters d using standard
maximum likelihood estimates for multinomial distri-
butions.

Chem Is Try Lab 2019
No ratings yet
Chem Is Try Lab 2019
8 pages
A80 Amplifier P80 Power Amplifier: Service Manual
100% (1)
A80 Amplifier P80 Power Amplifier: Service Manual
41 pages
Quadcopter Drone: Adaptive Control Laws: Alfredo M. Gar o M. Tianyang Cao Al Chandeck
No ratings yet
Quadcopter Drone: Adaptive Control Laws: Alfredo M. Gar o M. Tianyang Cao Al Chandeck
10 pages
GEMS UnfoldUnwrinkle PDF
No ratings yet
GEMS UnfoldUnwrinkle PDF
10 pages
Impacts of A Handset Leasing Model On Mobile Telcos
No ratings yet
Impacts of A Handset Leasing Model On Mobile Telcos
5 pages
LeeShen SystemIdentificationOfCessna182ModelUAV
No ratings yet
LeeShen SystemIdentificationOfCessna182ModelUAV
5 pages
An Application of Reinforcement Learning To Aerobatic Helicopter Flight
No ratings yet
An Application of Reinforcement Learning To Aerobatic Helicopter Flight
8 pages
Learning Vehicular Dynamics, With Application To Modeling Helicopters
No ratings yet
Learning Vehicular Dynamics, With Application To Modeling Helicopters
8 pages
Movement Skill Acquisition Using Imitati
No ratings yet
Movement Skill Acquisition Using Imitati
64 pages
A Proposal On Machine Learning Via Dynamical Systems
No ratings yet
A Proposal On Machine Learning Via Dynamical Systems
11 pages
Estimation-Aware Trajectory Optimization With Set-Valued Measurement Uncertainties
No ratings yet
Estimation-Aware Trajectory Optimization With Set-Valued Measurement Uncertainties
24 pages
Lec 3 Edited
No ratings yet
Lec 3 Edited
6 pages
Performing Aggressive Maneuvers Using Iterative Learning Control
No ratings yet
Performing Aggressive Maneuvers Using Iterative Learning Control
6 pages
Final MSC Report Divyam Rastogi
No ratings yet
Final MSC Report Divyam Rastogi
78 pages
Aser - Hidden Markov Models and Dynamical Systems
No ratings yet
Aser - Hidden Markov Models and Dynamical Systems
145 pages
Deep Learning Control
No ratings yet
Deep Learning Control
8 pages
Yue 2021
No ratings yet
Yue 2021
23 pages
Final Report
No ratings yet
Final Report
6 pages
He 等 - 2021 - Explainable Deep Reinforcement Learning for UAV Autonomous Navigation
No ratings yet
He 等 - 2021 - Explainable Deep Reinforcement Learning for UAV Autonomous Navigation
12 pages
Alemi Etal 2017
No ratings yet
Alemi Etal 2017
8 pages
Imitation Learning
No ratings yet
Imitation Learning
188 pages
Helicopter Lab (Optimization and Control)
100% (2)
Helicopter Lab (Optimization and Control)
41 pages
136 hw11
No ratings yet
136 hw11
89 pages
Aerial Robotics Week3
No ratings yet
Aerial Robotics Week3
119 pages
Bootstrapping Reinforcement Learning With Imitation For Vision-Based Agile Flight
No ratings yet
Bootstrapping Reinforcement Learning With Imitation For Vision-Based Agile Flight
15 pages
CS60010 Fitting-1
No ratings yet
CS60010 Fitting-1
39 pages
State Estimation For Robotics
No ratings yet
State Estimation For Robotics
386 pages
Deep Learning Book Part2
No ratings yet
Deep Learning Book Part2
101 pages
Invariant Extended Kalman Filter For Measurements On Lie Groups
No ratings yet
Invariant Extended Kalman Filter For Measurements On Lie Groups
72 pages
Neural Network For Modeling Nonlinear Ti
No ratings yet
Neural Network For Modeling Nonlinear Ti
10 pages
Simon Chapter 3
No ratings yet
Simon Chapter 3
12 pages
6 CoverNet
No ratings yet
6 CoverNet
10 pages
Estimation For Quadrotors
No ratings yet
Estimation For Quadrotors
12 pages
Basics of Sensor Fusion 2020
No ratings yet
Basics of Sensor Fusion 2020
102 pages
Jurnal
No ratings yet
Jurnal
18 pages
Nonlinear Model Predictive Control Applied To Quadrotor UAV
No ratings yet
Nonlinear Model Predictive Control Applied To Quadrotor UAV
21 pages
GTD2 TDC Suttonetal2009
No ratings yet
GTD2 TDC Suttonetal2009
8 pages
Competitive Mixtures of Simple Neurons: Karthik Sridharan Matthew J. Beal Venu Govindaraju
No ratings yet
Competitive Mixtures of Simple Neurons: Karthik Sridharan Matthew J. Beal Venu Govindaraju
4 pages
DLbook
No ratings yet
DLbook
165 pages
Feed Forward Neural Network Assignment PDF
No ratings yet
Feed Forward Neural Network Assignment PDF
11 pages
Aerial Robotics Lecture 3A - 1 2-D Quadrotor Control
No ratings yet
Aerial Robotics Lecture 3A - 1 2-D Quadrotor Control
5 pages
Trajectory Tracking and Formation Controls For A VA V Model That Incorporates - IEEE
No ratings yet
Trajectory Tracking and Formation Controls For A VA V Model That Incorporates - IEEE
6 pages
Minsky y Papert
No ratings yet
Minsky y Papert
77 pages
Cessna 2
No ratings yet
Cessna 2
20 pages
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
No ratings yet
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
223 pages
Physics-Based Deep Learning: N. Thuerey, B. Holzschuh, P. Holl, G. Kohl, M. Lino, Q. Liu, P. Schnell, F. Trost
No ratings yet
Physics-Based Deep Learning: N. Thuerey, B. Holzschuh, P. Holl, G. Kohl, M. Lino, Q. Liu, P. Schnell, F. Trost
461 pages
Learning-Based Control of Continuous-Time Systems Using Output Feedback
No ratings yet
Learning-Based Control of Continuous-Time Systems Using Output Feedback
8 pages
CS229 Andrew NG Lecture Notes
No ratings yet
CS229 Andrew NG Lecture Notes
216 pages
Ucav Mission Execution Reinforcement Learning Paper
No ratings yet
Ucav Mission Execution Reinforcement Learning Paper
9 pages
Report
No ratings yet
Report
7 pages
Final Project
No ratings yet
Final Project
4 pages
Planning-Based Prediction For Pedestrians
No ratings yet
Planning-Based Prediction For Pedestrians
6 pages
CS229
No ratings yet
CS229
216 pages
Modelling Pedestrian Trajectory Patterns With Gaussian Processes
No ratings yet
Modelling Pedestrian Trajectory Patterns With Gaussian Processes
6 pages
Patel Uchicago 0330D 14442
No ratings yet
Patel Uchicago 0330D 14442
239 pages
Fun Least Squares
No ratings yet
Fun Least Squares
3 pages
Incremental Skill Learning of Stable Dynamical Systems
No ratings yet
Incremental Skill Learning of Stable Dynamical Systems
8 pages
Autonomous Drone Racing With Deep Reinforcement Learning
No ratings yet
Autonomous Drone Racing With Deep Reinforcement Learning
9 pages
Thesis 2019 NN and Da Qt42k273gk - Nosplash
No ratings yet
Thesis 2019 NN and Da Qt42k273gk - Nosplash
111 pages
A Reinforcement Learning Approach To Spacecraft Trajectory Optimi
No ratings yet
A Reinforcement Learning Approach To Spacecraft Trajectory Optimi
81 pages
Learning More Accurate Metrics For Self-Organizing Maps
No ratings yet
Learning More Accurate Metrics For Self-Organizing Maps
6 pages
Trajectory Tracking Error Using PID Control Law For A 2 DOF Helicopter Model Via Adaptive Time-Delay Neural Networks
No ratings yet
Trajectory Tracking Error Using PID Control Law For A 2 DOF Helicopter Model Via Adaptive Time-Delay Neural Networks
4 pages
Flatness Based Trajectory Generation For A Helicopter UAV: S. Taamallah
No ratings yet
Flatness Based Trajectory Generation For A Helicopter UAV: S. Taamallah
25 pages
Cam Dynamics
No ratings yet
Cam Dynamics
8 pages
2.water Hardness - Ion Exchange Method
No ratings yet
2.water Hardness - Ion Exchange Method
5 pages
Paper 5
No ratings yet
Paper 5
9 pages
(Code: 4340501) : Process Heat Transfer Course Code: 4340501
100% (1)
(Code: 4340501) : Process Heat Transfer Course Code: 4340501
9 pages
Lesco Web Bill 2
No ratings yet
Lesco Web Bill 2
1 page
India Coastal Features
No ratings yet
India Coastal Features
28 pages
Major Component Component Section: Hierarchy Component & Symptom of Problem
No ratings yet
Major Component Component Section: Hierarchy Component & Symptom of Problem
34 pages
SIGA AB4G Audible Detector Base Installation Sheet
No ratings yet
SIGA AB4G Audible Detector Base Installation Sheet
6 pages
1833 Designing High Rise Housing The Singapore Experience
No ratings yet
1833 Designing High Rise Housing The Singapore Experience
7 pages
Design and Manufacturing of Digital MOSFET based-AVR For Synchronous Generator
No ratings yet
Design and Manufacturing of Digital MOSFET based-AVR For Synchronous Generator
7 pages
Le Mock 35 Ques @legaledgemock
No ratings yet
Le Mock 35 Ques @legaledgemock
40 pages
LAB0016 Covid-19 Molecular Diagnostic Lab Belbas, Butwal
No ratings yet
LAB0016 Covid-19 Molecular Diagnostic Lab Belbas, Butwal
1 page
Advanced Car Hire Price List
No ratings yet
Advanced Car Hire Price List
5 pages
0818 Stran DWG001 MRDV Man 15M
No ratings yet
0818 Stran DWG001 MRDV Man 15M
80 pages
Manual
No ratings yet
Manual
8 pages
FDA Certificate
No ratings yet
FDA Certificate
3 pages
Company Profile
No ratings yet
Company Profile
17 pages
1 Robert Grosseteste Compotus Correctorius Trans Philipp Nothaft
No ratings yet
1 Robert Grosseteste Compotus Correctorius Trans Philipp Nothaft
80 pages
Pipes. Wall Thickness Calculation According ASME B31.3
No ratings yet
Pipes. Wall Thickness Calculation According ASME B31.3
108 pages
Tracer Experiment in Plug Flow Reactor
No ratings yet
Tracer Experiment in Plug Flow Reactor
6 pages
No-Bake Cheesecake No-Bake Nuttela Cheesecakes: Ingredients Ingredients
No ratings yet
No-Bake Cheesecake No-Bake Nuttela Cheesecakes: Ingredients Ingredients
6 pages
Cumene Prices
No ratings yet
Cumene Prices
3 pages
Datasheet Norsat LNA Ku Band 4000 Series
No ratings yet
Datasheet Norsat LNA Ku Band 4000 Series
1 page
Astm A609
No ratings yet
Astm A609
9 pages
IRWM (Incl. Upto ACS 10) PDF
No ratings yet
IRWM (Incl. Upto ACS 10) PDF
262 pages
Emsb 1F A2
No ratings yet
Emsb 1F A2
1 page

Learning For Control From Multiple Demonstrations

Uploaded by

Learning For Control From Multiple Demonstrations

Uploaded by

Learning for Control from Multiple Demonstrations

Adam Coates [email protected]

log P(y|z, )P(|z)P(z)P() (13)

log P(y|z, )P()

The latter equation can be written recursively as:

. In our application, we only allow

{t 3, t 2, t 1}. As is common practice,

You might also like