0% found this document useful (0 votes)

47 views

Active Policy Learning For Robot Planning and Exploration Under Uncertainty

This paper proposes an active policy learning algorithm for planning robot paths under uncertainty. The algorithm uses Gaussian process regression to approximate an expensive-to-evaluate expected cost function related to localization uncertainty. It balances exploration of uncertain regions with exploitation of low-cost policies to iteratively select policy parameters for new simulations. The goal is to find policy waypoints that minimize expected localization error. The method is tested on a robot navigation problem where the robot must go from a start to end point while improving its pose and map estimates under sensor limitations.

Uploaded by

eocv20

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views

Active Policy Learning For Robot Planning and Exploration Under Uncertainty

Uploaded by

eocv20

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Active Policy Learning for Robot Planning and

Exploration under Uncertainty

Ruben Martinez-Cantin , Nando de Freitas , Arnaud Doucet , Jose A. Castellanos

Department of Computer Science and Systems Engineering, University of Zaragoza

Email: {rmcantin,jacaste}@unizar.es
Department of Computer Science, University of British Columbia
Email: {nando,arnaud}@cs.ubc.ca

Abstract This paper proposes a simulation-based active policy learning algorithm for finite-horizon, partially-observed sequential decision processes. The algorithm is tested in the domain
of robot navigation and exploration under uncertainty. In such a
setting, the expected cost, that must be minimized, is a function of
the belief state (filtering distribution). This filtering distribution
is in turn nonlinear and subject to discontinuities, which arise
because constraints in the robot motion and control models. As a
result, the expected cost is non-differentiable and very expensive
to simulate. The new algorithm overcomes the first difficulty and
reduces the number of required simulations as follows. First, it
assumes that we have carried out previous simulations which
returned values of the expected cost for different corresponding
policy parameters. Second, it fits a Gaussian process (GP)
regression model to these values, so as to approximate the
expected cost as a function of the policy parameters. Third, it
uses the GP predicted mean and variance to construct a statistical
measure that determines which policy parameters should be used
in the next simulation. The process is then repeated using the new
parameters and the newly gathered expected cost observation.
Since the objective is to find the policy parameters that
minimize the expected cost, this iterative active learning approach
effectively trades-off between exploration (in regions where the
GP variance is large) and exploitation (where the GP mean is
low). In our experiments, a robot uses the proposed algorithm
to plan an optimal path for accomplishing a series of tasks,
while maximizing the information about its pose and map
estimates. These estimates are obtained with a standard filter
for simultaneous localization and mapping. Upon gathering new
observations, the robot updates the state estimates and is able to
replan a new path in the spirit of open-loop feedback control.

I. I NTRODUCTION
The direct policy search method for reinforcement learning
has led to significant achievements in control and robotics [1,
2, 3, 4]. The success of the method does often, however, hinge
on our ability to formulate expressions for the gradient of
the expected cost [5, 4, 6]. In some important applications in
robotics, such as exploration, constraints in the robot motion
and control models make it hard, and often impossible, to
compute derivatives of the cost function with respect to the
robot actions. In this paper, we present a direct policy search
method for continuous policy spaces that relies on active
learning to side-step the need for gradients.
The proposed active policy learning approach also seems
to be more appropriate in situations where the cost function
has many local minima that cause the gradient methods to
get stuck. Moreover, in situations where the cost function is
very expensive to evaluate by simulation, an active learning approach that is designed to minimize the number of evaluations
might be more suitable than gradient methods, which often

Parameters

5
6

3
2

The robot plans a path that allows it to accomplish the task

of going from Begin to End while simultaneously reducing the
uncertainty in the map and pose (robot location and heading) estimates. The robot has a prior over its pose and the landmark locations,
but it can only see the landmarks within its field of view (left). In
the planning stage, the robot must compute the best six-dimensional
policy vector consisting of the three way-points describing the path.
After running the stochastic planning algorithm proposed in this
paper, the robot has reached the End while following a planned
trajectory that minimizes the posterior uncertainty about its pose and
map (right). The marginal landmark uncertainty ellipses are scaled
for clarification.
Fig. 1.

require small step sizes for stable convergence (and hence

many cost evaluations).
We demonstrate the new approach on a hard robotics
problem: planning and exploration under uncertainty. This
problem plays a key role in simultaneous localization and
mapping (SLAM), see for example [7, 8]. Mobile robots must
maximize the size of the explored terrain, but, at the same
time, they must ensure that localization errors are minimized.
While exploration is needed to find new features, the robot
must return to places were known landmarks are visible
to maintain reasonable map and pose (robot location and
heading) estimates.
In our setting, the robot is assumed to have a rough a priori
estimate of the map features and its own pose. The robot must
accomplish a series of tasks while simultaneously maximizing
its information about the map and pose. This is illustrated in
Figure 1, where a robot has to move from Begin to End by
planning a path that satisfies logistic and physical constraints.
The planned path must also result in improved map and pose
estimates. As soon as the robot accomplishes a task, it has a
new a posteriori map that enables it to carry out future tasks in
the same environment more efficiently. This sequential decision
making problem is exceptionally difficult because the actions

and states are continuous and high-dimensional. Moreover,

the cost function is not differentiable and depends on the
posterior belief (filtering distribution). Even a toy problem
requires enormous computational effort. As a result, it is not
surprising that most existing approaches relax the constrains.
For instance, full observability is assumed in [9, 7], known
robot location is assumed in [10], myopic planning is adopted
in [8], and discretization of the state and/or actions spaces
appears in [11, 12, 7]. The method proposed in this paper
does not rely on any of these assumptions.
Our direct policy solution uses an any-time probabilistic
active learning algorithm to predict what policies are likely
to result in higher expected returns. The method effectively
balances the goals of exploration and exploitation in policy
search. It is motivated by work on experimental design [13,
14, 15]. Simpler variations of our ideas appeared early in
the reinforcement literature. In [16], the problem is treated
in the framework of exploration/exploitation with bandits. An
extension to continuous spaces (infinite number of bandits)
using locally weighted regression was proposed in [17]. Our
paper presents richer criteria for active learning as well suitable
optimization objectives.
This paper also presents posterior Cramer-Rao bounds to
approximate the cost function in robot exploration. The appeal
of these bounds is that they are much cheaper to simulate than
the actual cost function.
Although the discussion is focused on robot exploration and
planning, our policy search framework extends naturally to
other domains. Related problems appear the fields of terrainaided navigation [18, 9] and dynamic sensor nets [19, 6].
II. A PPLICATION TO ROBOT E XPLORATION AND
P LANNING
Although the algorithm proposed in this paper applies to
many sequential decision making settings, we will restrict
attention to the robot exploration and planning domain. In
this domain, the robot has to plan a path that will improve its
knowledge of its pose (location and heading) and the location
of navigation landmarks. In doing so, the robot might be
subject to other constraints such as low energy consumption,
limited time, safety measures and obstacle avoidance. However, for the time being, let us first focus on the problem of
minimizing posterior errors in localization and mapping as this
problem already captures a high degree of complexity.
There are many variations of this problem, but let us
consider the one of Figure 1 for illustration purposes. Here, the
robot has to navigate from Begin to End while improving
its estimates of the map and pose. For the time being, let us
assume that the robot has no problem in reaching the target.
Instead, let us focus on how the robot should plan its path so
as to improve its map and pose posterior estimates. Initially, as
illustrated by the ellipses on the left plot, the robot has vague
priors about its pose and the location of landmarks. We want
the robot to plan a path (parameterized policy ()) so that
by the time it reaches the target, it has learned the most about
its pose and the map. This way, if the robot has to repeat the
task, it will have a better estimate of the map and hence it
will be able to accomplish the task more efficiently.
In this paper, the policy is simply a path parameterized as a
set of ordered way-points i , although different representations

can be used depending on the robot capabilities. A trajectory

with 3 way-points, whose location was obtained using our
algorithm, is shown on the right plot of Figure 1. We use
a standard proportional-integral-derivative (PID) controller to
generate the motion commands a = {a1:T } to follow the
path for T steps. The controller moves the robot toward each
way-point in turn while taking into account the kinematic and
dynamic constrains of the problem.
It should be noticed that the robot has a limited field of
view. It can only see the landmarks that appear within an
observation gate.
Having restricted the problem to one of improving posterior
pose and map estimates, a natural cost function is the average
mean square error (AMSE) of the state:
" T
#
X

CAM
T t (b
xt xt )(b
xt xt )0 ,
SE = Ep(x0:T ,y1:T |)
t=1

bt = Ep(xt |y1:t ,) [xt ]. The expectation is with respect

where x
QT
to p(x0:T , y1:T |) = p(x0 ) t=1 p(xt |at , xt1 )p(yt |xt , at ),
[0, 1] is a discount factor, () denotes the policy
parameterized by the way-points i Rn , xt Rnx is
the hidden state (robot pose and location of map features)
at time t, y1:T = {y1 , y2 , . . . , yT } Rny T is the history
of observations along the planned trajectory for T steps,
a1:T Rna T is the history of actions determined by the policy
bt is the posterior estimate of the state at time t.
() and x
In our application to robotics, we focus on the uncertainty
of the posterior estimates at the end of the planning horizon.
That is, we set so that the cost function reduces to:

xT xT )(b
xT xT )0 ] ,
CAM
SE = Ep(xT ,y1:T |) [(b

(1)

Note that the true state xT and observations are unknown in

advance and so one has to marginalize over them.
The cost function hides an enormous degree of complexity.
It is a matrix function of an intractable filtering distribution
p(xT |y1:T , ) (also known as the belief or information state).
This belief can be described in terms of the observation and
odometry (robot dynamics) models using marginalization and
Bayes rule. The computation of this belief is known as the
simultaneous localization and mapping problem (SLAM) and
it is known to be notoriously hard because of nonlinearity and
non-Gaussianity. Moreover, in our domain, the robot only sees
the landmarks within and observation gate.
Since the models are not linear-Gaussian, one cannot use
standard linear-quadratic-Gaussian (LQG) controllers [20] to
solve our problem. Moreover, since the action and state spaces
are large-dimensional and continuous, one cannot discretize
the problem and use closed-loop control as suggested in [21].
That is, the discretized partially observed Markov decision
process is too large for stochastic dynamic programming [22].
As a result of these considerations, we adopt the direct
policy search method [23, 24]. In particular, the initial policy
is set either randomly or using prior knowledge. Given this
policy, we conduct simulations to estimate the AMSE. These
simulations involve sampling states and observations using
the prior, dynamic and observation models. They also involve
estimating the posterior mean of the state with suboptimal
filtering. After evaluating the AMSE using the simulated

1) Choose an initial policy 0 .

2) For j = 1 : M axN umberOf P olicySearchIterations:
a) For i = 1 : N :
(i)

i) Sample the prior states x0 p(x0 ).

ii) For t = 1 : T :
A) Use a PID controller regulated about the path j
(i)
to determine the current action at .
(i)
(i)
(i)
B) Sample the state xt p(xt |at , xt1 ).
(i)

(i)

Fig. 2. The overall solution approach in the open-loop control (OLC) setting.
Here, N denotes the number of Monte Carlo samples and T is the planning
horizon. In replanning with open-loop feedback control (OLFC), one simply
uses the present position and the estimated posterior distribution (instead of the
prior) as the starting point for the simulations. One can apply this strategy with
either approaching or receding control horizons. It is implicit in the pseudocode that we freeze the random seed generator so as to reduce variance.

trajectories, we update the policy parameters and iterate with

the goal of minimizing the AMSE. Note that in order to
reduce Monte Carlo variance, the random seed should be
frozen as described in [24]. The pseudo-code for this openloop simulation-based controller (OLC) is shown in Figure 2.
Note that as the robot moves along the planned path, it is
possible to use the newly gathered observations to update the
posterior distribution of the state. This distribution can then be
used as the prior for subsequent simulations. This process of
replanning is known as open-loop feedback control (OLFC)
[20]. We can also allow for the planning horizon to recede.
That is, as the robot moves, it keeps planning T steps ahead
of its current position. This control framework is also known
as receding-horizon model-predictive control [25].
In the following two subsections, we will describe a way of
conducting the simulations to estimate the AMSE. The active
policy update algorithm will be described in Section III.
A. Simulation of the cost function
We can approximate the AMSE cost by simulating N state
(i)
(i)
and observation trajectories {x1:T , y1:T }N
i=1 and adopting the
Monte Carlo estimator:
N
1 X (i)
(i)
(i)
(i)
(b
x xT )(b
xT xT )0 .
N i=1 T

Estimated landmark location

Simulated landmark location
True landmark location

(i)

C) Generate observations yt p(yt |at , xt ) as

described in Section II-A. There can be missing
observations.
(i)
(i)
D) Compute the filtering distribution p(xt |y1:t , a1:t )
using a SLAM filter.
b) Evaluate the approximate AMSE cost function of equation (2) using the simulated trajectories.
c) Use the active learning algorithm with Gaussian processes,
described in Section III, to generate the new policy j+1 .
The choice of the new policy is governed by our desire
to exploit and our need to explore the space of policies
(navigation paths). In particular, we give preference to
policies for which we expect the cost to be minimized and
to policies where we have high uncertainty about what the
cost might be.

CAM
SE

each landmark, one draws a sample from its posterior. If the

sample falls within the observation gate, it is treated as an
observation. As in most realistic settings, most landmarks will
remain unobserved.

(2)

Assuming that is given (we discuss the active learning

algorithm to learn in Section III), one uses a PID controller
to obtain the next action at . The new state xt is easily simulated using the odometry model. The process of generating
observations is more involved. As shown in Figure 3, for

Simulated robot location

Simulated field of view

Estimated robot location

True robot location

Fig. 3. An observation is generated using the current map and robot

pose estimates. Gating information is used to validate the observation.

In this picture, the simulation validates the observation despite the fact
that the true robot and feature locations are too distant for the given
field of view. New information is essential to reduce the uncertainty
and improve the simulations.
(i)

(i)

After the trajectories {x1:T , y1:T }N

i=1 are obtained, one uses
a SLAM filter (EKF, UKF or particle filter) to compute the
(i)
b1:T . (In this paper, we adopt the EKFposterior mean state x
SLAM algorithm to estimate the mean and covariance of this
distribution. We refer the reader to [26] for implementation
details.) The evaluation of the cost function is therefore extremely expensive. Moreover, since the model is nonlinear, it is
hard to quantify the uncertainty introduced by the suboptimal
filter. Later, in Section IV, we will discuss an alternative cost
function, which consists of a lower bound on the AMSE.
Yet, in both cases, it is imperative to minimize the number
of evaluations of the cost functions. This calls for an active
learning approach.
III. ACTIVE P OLICY L EARNING
This section presents an active learning algorithm to update
the policy parameters after each simulation. In particular, we
adopt the expected cost simulation strategy presented in [24].
In this approach, a scenario consists of an initial choice of
the state and a sequence of random numbers. Given a policy
parameter vector and a set of fixed scenarios, the simulation is
deterministic and yields an empirical estimate of the expected
cost [24].
The simulations are typically very expensive and consequently cannot be undertaken for many values of the policy
parameters. Discretization of the potentially high-dimensional
and continuous policy space is out of the question. The
standard solution to this problem is to optimize the policy
using gradients. However, the local nature of gradient-based
optimization often leads to the common criticism that direct
policy search methods get stuck in local minima. Even more
pertinent to our setting, is the fact that the cost function
is discontinuous and hence policy gradient methods do not
apply. We present an alternative approach to gradient-based
optimization for continuous policy spaces. This approach,
which we refer to as active policy learning, is based on
experimental design ideas [27, 13, 28, 29]. Active policy
learning is an any-time, black-box statistical optimization

3.5
GP mean cost

GP variance

3
Data point
2.5

2
True cost
1.5

Infill

0.5

0
1.5

0.5

0
Policy parameter

0.5

1.5

0.5

0
Policy parameter

0.5

1.5

3.5

2.5

1.5

0.5

0
1.5

Fig. 4. An example of active policy learning with a univariate policy

using data generated by our simulator. The figure on top shows a

GP approximation of the cost function using 8 simulated values. The
variance of the GP is low, but not zero, at these simulated values.
In reality, the true expected cost function is unknown. The figure
also shows the expected improvement (infill) of each potential next
sampling location in the lower shaded plot. The infill is high where
the GP predicts a low expected cost (exploitation) and where the
prediction uncertainty is high (exploration). Selecting and labelling
the point suggested by the highest infill in the top plot produces
the GP fit in the plot shown below. The new infill function, in the
plot below, suggests that we should query a point where the cost is
expected to be low (exploitation).

approach. Figure 4 illustrates it for a simple one-dimensional

example. The approach is iterative and involves three steps.
In the first step, a Bayesian regression model is learned
to map the policy parameters to the estimates of the expected
cost function obtained from previous simulations. In this work,
the regression function is obtained using Gaussian processes
(GPs). Though in Figure 4 the GPs provide a good approximation to the expected cost, it should be emphasized that the
objective is not to predict the value of the regression surface
over the entire feasible domain, but rather to predict it well
near the minima. The details of the GP fit are presented in
Section III-A.
The second step involves active learning. Because the simulations are expensive, we must ensure that the selected samples
(policy parameter candidates) will generate the maximum
possible improvement. Roughly speaking, it is reasonable to
sample where the GP predicts a low expected cost (exploitation) or where the GP variance is large (exploration). These

intuitions can be incorporated in the design of a statistical

measure indicating where to sample. This measure is known
as the infill function, borrowing the term from the geostatistics
literature. Figure 4 depicts a simple infill function that captures
our intuitions. More details on how to choose the infill are
presented in Section III-B.
Having defined an infill function, still leaves us with the
problem of optimizing it. This is the third and final step in the
approach. Our thesis is that the infill optimization problem is
more amenable than the original problem because in this case
the cost function is known and easy to evaluate. Furthermore,
for the purposes of our application, it is not necessary to
guarantee that we find the global minimum, merely that we can
quickly locate a point that is likely to be as good as possible.
To deal with this nonlinear constrained optimization problem, we adopted the DIvided RECTangles (DIRECT) algorithm [30, 31]. DIRECT is a deterministic, derivative-free
sampling algorithm. It uses the existing samples of the objective function to decide how to proceed to divide the feasible
space into finer rectangles. For low-dimensional parameter
spaces, say up to 10D, DIRECT provides a better solution
than gradient approaches because the infill function tends to
have many local optima. Another motivating factor is that
DIRECTs implementation is easily available [32]. However,
we conjecture that for large dimensional spaces, sequential
quadratic programming or concave-convex programming [33]
might be better algorithm choices for infill optimization.
A. Gaussian processes
A Gaussian process, z() GP (m(), K(, )), is an infinite
random process indexed by the vector , such that any
realization z() is Gaussian [34]. We can parameterize the
GP hierarchically
C () = 1 + z()
z() GP (0, 2 K(, ))
and subsequently estimate the posterior distributions of the
mean and scale 2 using standard Bayesian conjugate
analysis, see for example [14]. The symbol 1 denotes a
column vector of ones. Assuming that n simulations have been

conducted, the simulated costs {C1:n

} and the predicted cost

Cn+1 for a new test point n+1 are jointly Gaussian:

Cn+1
1
kT
2 k

N
,

C1:n
1
k K
where kT
=
[k( n+1 , 1 ) k( n+1 , n )], k
=
k( n+1 , n+1 ) and K is the training data kernel matrix
with entries k( i , j ) for i = 1, . . . , n and j = 1, . . . , n.
Since we are interested in regression, the Matern kernel is a
suitable choice for k(|) [14].
We assign a normal-inverse-Gamma conjugate prior to the
parameters: N (0, 2 2 ) and 2 IG(a/2, b/2). The
priors play an essential role at the beginning of the design
process, when there are only a few data. Classical Bayesian
analysis allow us to obtain analytical expressions for the
posterior modes of these quantities:

b = (1T K1 1 + 2 )1 1T K1 C
b + C T K1 C (1T K1 1 + 2 )b
2

b2 =
n+a+2

The posterior Cramer-Rao bound (PCRB) for nonlinear

systems leads to an alternative objective function that is
cheaper to evaluate and does not require that we run a
SLAM filter. That is, the criterion presented next does not
require the adoption of an EKF, UKF, particle filter or any
other suboptimal filter in order to evaluate it. The PCRB is a
measure of the maximum information that can be extracted
from the dynamic system when both the measurements and
states are assumed random. It is defined as the inverse of the
Fisher information matrix J and provides the following lower
bound on the AMSE:

2.5

^
C

1.5

0.5

0
25

0
1
1
0
0
1
0
1
0
1
0
1
0
1
0
1
0Low infill
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
10

1
CAM
SE CP CRB = J
0

Policy parameter []

Fig. 5.

C min

111
000
000
111
000
111
000
111
000
111
000
111
000
111
000
111
High infill
000
111
000
111
000
111
000
111
000
111
000
111
000
111

Tichavsky [35], derived the following Riccati-like recursion to

compute the PCRB for any unbiased estimator:
Jt+1 = Dt C0t (Jt + Bt )1 Ct + At+1 ,

Infill function.

(3)

where
Using the previous estimates, the GP predictive mean and
variance are given by
b () =
C

b + kT K1 (C1:n
1b
)

(1
1T K1 k)2
.
sb2 () =
b2 k kT K1 k + T 1
(1 K 1 + 2 )

Since the number of query points is small, the GP predictions

are very easy to compute.
B. Infill Function

denote the current lowest (best) estimate of the

Let Cmin
cost function. As shown in Figure 5, we can define the
probability of improvement at a point to be
!

b ()

C
C

min
p(C () Cmin
)=
,
sb()
b (), sb()2 ) and denotes CDF of the
where C () N (C
standard Normal distribution. This measure was proposed several decades ago by [27], who used univariate Wiener process.

However, as argued by [13], it is sensitive to the value of Cmin

.
To overcome this problem, Jones defined the improvement

over the current best point as I() = max{0, Cmin

C ()}.
This resulted in the following expected improvement (infill
function):

b ())(d) + sb()(d) if sb > 0

(Cmin
C
EI() =
0
if sb = 0
where is the PDF of the standard Normal distribution and
d=

b ()
Cmin
C
.
sb()

IV. A C HEAPER C OST: T HE P OSTERIOR C RAM E R -R AO

B OUND
As mentioned in Section II-A, it is not possible to compute
the AMSE cost function exactly. In that section, we proposed a
simulation approach that required that we run an SLAM filter
for each simulated scenario. This approximate filtering step is
not only expensive, but also a possible source of errors when
approximating the AMSE with Monte Carlo simulations.

At+1
Bt
Ct
Dt

=
=
=
=

E[xt+1 ,xt+1 log p(yt+1 |xt+1 )]

E[xt ,xt log p(xt+1 |xt , at )]
E[xt ,xt+1 log p(xt+1 |xt , at )]
E[xt+1 ,xt+1 log p(xt+1 |xt , at )],

where the expectations are with respect to the simulated trajectories and denotes the Laplacian operator. By simulating
(sampling) trajectories, using our observation and transition
models, one can easily approximate these expectations with
Monte Carlo averages. These averages can be computed offline and hence the expensive recursion of equation (3) only
needs to be done once for all scenarios.
The PCRB approximation method of [35] applies to nonlinear (NL) models with additive noise only. This is not the
case in our setting and hence a potential source of error. An
alternative PCRB approximation method that overcomes this
shortcoming, in the context of jump Markov linear (JML)
models, was proposed by [36]. We try both approximations
in our experiments and refer to them as NL-PCRB and JMLPCRB respectively.
The AMSE simulation approach of Section II-A using the
EKF requires that we perform an expensive Ricatti update
(EKF covariance update) for each simulated trajectory. In contrast, the simulation approach using the PCRB only requires
one Ricatti update (equation (3)). Thus, the latter approach is
considerably cheaper. Yet, the PCRB is only a lower bound
and hence it is not guaranteed to be necessarily tight. In
the following section, we will provide empirical comparisons
between these simulation approaches.
V. E XPERIMENTS
We present two sets of experiments. The first experiment is
very simple as it is aimed at illustrating the approach. It involves a fixed-horizon stochastic planning domain. The second
set of experiments is concerned with exploration with receding
horizon policies in more realistic settings. In all cases, the aim
is to find the optimal path in terms of posterior information
about the map and robot pose. For clarification, other terms
contributing to the cost, such as time and obstacles are not
considered, but the implementation should be straightforward.

12
OLC1
OLC3
OLFC3

Resulting covariance trace

0
0

steps

Evolution of the trace of the state covariance matrix for 15

runs using OLC1, OLC3 and OLFC3, with 99% confidence intervals,
for the map in Figure 7.
Fig. 8.

Empirical AMSE cost as a function of policy improvement

iterations. For this 6D parameter space, the solution converges to the
minimum in few iterations, while allowing for several exploration
steps in the first 40 iterations. The figure also shows the actual
computed trajectories at three different iteration steps.
Fig. 6.

A. Fixed-horizon planning
The first experiment is the one described in Figure 1.
Here, the start and end positions of the path are fixed. The
robot has to compute the coordinates of three intermediate
way-points and, hence, the policy has six parameters. For
illustration purposes we chose a simple environment consisting
of 5 landmarks (with vague priors). We placed an informative
prior on the initial robot pose. Figure 6 shows three different
robot trajectories computed during policy optimization. The
trajectories are also indicated in the Monte Carlo AMSE cost
evolution plot. The 6D optimization requires less than 50
iterations. We found that the optimal trajectory allowed the
robot to observe the maximum number of features. However,
since the prior on the robots initial pose is informative
(narrow Gaussian), feature A is originally detected with very
low uncertainty. Consequently, the robot tries to maintain
that feature in the field of view to improve the localization.
A greedy strategy would have focused only on feature A,
improving the estimation of that feature and the robot, but
dropping the global posterior estimate.
B. Receding-horizon planning
In this experiment, the a priori map has high uncertainty
(1 meter standard deviation see Figure 7). The robot is
a differential drive vehicle equipped with odometers and a
stereo camera that provides the location of features. The field
of view is limited to 7 meters and 90o , which are typical values
for reliable stereo matching. We assume that the camera and
a detection system that provides a set of observations every
0.5 seconds. The sensor noise is Gaussian for both range and
bearing, with standard deviations range = 0.2 range and
bearing = 0.5o . The policy is given by a set of ordered
way-points. Each way-point is defined in terms of heading
and distance with respect to the robot pose at the preceding
way-point. The distance between way-points is limited to 10
meters and the heading should be in the interval [3/4, 3/4]

to avoid backwards trajectories. The motion commands are

computed by a PID controller to guarantee that the goal is
reached in 10 seconds.
First, we compare the behavior of the robot using different
planning and acting horizons. The three methods that we
implemented are:
OLC1 : This is an open loop greedy algorithm that plans
with only 1 way-point ahead. Once the way-point is
reached, the robot replans a path. Strictly speaking
this is a hybrid of OLC and OLFC as replanning
using actual observations takes place. For simplicity,
however, we refer to it as OLC.
OLC3 : This is an open loop algorithm that plans with
3 way-points ahead. Once the third way-point is
reached, the robot replans using actual observations.
OLFC3: This is an open loop feedback controller with a
receding horizon. The planning horizon is 3 waypoints, but the execution horizon is only 1 step.
Thus, the last 2 way-points plus a new way-point
are recomputed after a way-point is reached.
It is obvious that the OLC algorithms have a lower computational cost. Using the AMSE cost and the map of Figure 7,
the times for OLC1, OLC3 and OLFC3 are approximately 6,
30 and 75 minutes (using an un-optimized Matlab implementation). On the other hand, the OLC methods can get trapped
in local minima, as shown in Figure 7. Due to the limited
planning horizon of OLC1, it barely explores new areas. OLC3
tends to overshoot as it only replans at the third way-point.
OLFC3, on the other hand, replans at each step and as a result
is able to steer to the unexplored part of the map. Figure 8
plots the evolution of the uncertainty in this trap situation for
15 experimental runs. The controller with feedback is clearly
the winner because it avoids the trap. This behavior is stable
across different runs.
We repeated this experiment with different initial random
maps (5 landmarks). Figure 9 shows the methods perform
similarly as worst-case situations are rare.
The next set of simulations is used to experimentally

[m]

OLC1
15

Landmark location
Map Estimate
5
5

10
[m]

5
10

[m]

OLC3

OLFC3

12
10

8
6

5
[m]

[m]

4
2
0
0
2

4
10
10

[m]

6
10

5
[m]

Trajectories generated using OLC1, OLC3 and OLFC3. The blue and red ellipses represent the landmark and robot location 95%
confidence intervals. The robot field of view is shown in green. OLC3 is more exploratory than OLC1, which gets stuck repeatedly updating
the first landmark it encounters. Yet, only OLFC3, because of being able to replan at each step, is able to fully explore the map and reduce
the uncertainty.
Fig. 7.

10
OLC1
OLC3
OLFC3

Resulting covariance trace

8
7
6
5
4
3

VI. D ISCUSSION AND F UTURE W ORK

2
1
0

40
steps

Evolution of the trace of the state covariance matrix for 15

random maps using OLC1, OLC3 and OLFC3, with 99% confidence
intervals.
Fig. 9.

validate the PCRB approximation of the AMSE cost. We

increase the size and the complexity of the environment to
30 landmarks in a 25 by 25 meters squared area. Figure 10
shows the trace of the covariance matrix of the map and robot
location, estimated using OLFC3, for the three approximations
discussed in this paper. The JML-PCRB remains close to the
simulated AMSE. This indicates that this bound is tight and a
good choice in this case. On the other hand, NL-PCRB seems
to be too loose. In this larger map, the computational times for
the approximate AMSE and JML-PCRB were approximately
180 and 150 minutes respectively.

We have presented an approach for stochastic exploration

and planning rooted in strong statistical and decision-theoretic
foundations. The most important next step is to test the
proposed simulator on a real robotic domain. One needed
step in this implementation is enabling the robot to add new
landmarks to the existing map, while still within our decisiontheoretic framework. We also note that our method is directly
applicable to the problem of planning the architecture of a
dynamic sensor network. In terms of modelling, we need to

60
NLPCRB
JMLPCRB
AMSE

Resulting covariance trace

0
0

steps

Evolution of the trace of the state covariance matrix for

10 random maps with the AMSE, NL-PCRB and JML-PCRB cost
functions, while using OLFC3.
Fig. 10.

introduce richer cost functions and constraints. In terms of

algorithm improvement, we must design infill optimization
strategies for high-dimensional policies. Whenever gradients
are available, the approach presented here could be improved
by ensuring that the regression function matches the gradients
at the query points. Finally, on the theoretical front, we plan
to build upon early work on correlated bandits to obtain theoretical performance bounds and, hence, confidence intervals
that could potentially do better than the current infill criterion.
ACKNOWLEDGMENTS
We would like to thank Eric Brochu, Nick Roy, Hendrik
Kueck and Andrew Ng for valuable discussions. This project
was supported by NSERC, MITACS and the Direccion General de Investigacion of Spain, project DPI2006-13578.
R EFERENCES
[1] N. Kohl and P. Stone, Policy gradient reinforcement learning for fast
quadrupedal locomotion, in IEEE International Conference on Robotics
and Automation, 2004.
[2] G. Lawrence, N. Cowan, and S. Russell, Efficient gradient estimation
for motor control learning, in Uncertainty in Artificial Intelligence. San
Francisco, CA: Morgan Kaufmann, 2003, pp. 35436.
[3] A. Ng, A. Coates, M. Diel, V. Ganapathi, J. Schulte, B. Tse, E. Berger,
and E. Liang, Inverted autonomous helicopter flight via reinforcement
learning, in International Symposium on Experimental Robotics, 2004.
[4] J. Peters and S. Schaal, Policy gradient methods for robotics, in IEEE
International Conference on Intelligent Robotics Systems, 2006.
[5] J. Baxter and P. L. Bartlett, Infinite-horizon policy-gradient estimation,
Journal of Artificial Intelligence Research, vol. 15, pp. 319350, 2001.
[6] S. Singh, N. Kantas, A. Doucet, B. N. Vo, and R. J. Evans, Simulationbased optimal sensor scheduling with application to observer trajectory
planning, in IEEE Conference on Decision and Control and European
Control Conference, 2005, pp. 7296 7301.
[7] R. Sim and N. Roy, Global A-optimal robot exploration in SLAM,
in Proceedings of the IEEE International Conference on Robotics and
Automation (ICRA), Barcelona, Spain, 2005.
[8] C. Stachniss, G. Grisetti, and W. Burgard, Information gain-based
exploration using Rao-Blackwellized particle filters, in Proceedings of
Robotics: Science and Systems, Cambridge, USA, June 2005.
[9] S. Paris and J. P. Le Cadre, Planification for terrain-aided navigation,
in Fusion 2002, Annapolis, Maryland, July 2002, pp. 10071014.
[10] C. Leung, S. Huang, G. Dissanayake, and T. Forukawa, Trajectory
planning for multiple robots in bearing-only target localisation, in IEEE
International Conference on Intelligent Robotics Systems, 2005.

[11] M. L. Hernandez, Optimal sensor trajectories in bearings-only tracking, in Proceedings of the Seventh International Conference on Information Fusion, P. Svensson and J. Schubert, Eds., vol. II. Mountain
View, CA: International Society of Information Fusion, Jun 2004, pp.
893900.
[12] T. Kollar and N. Roy, Using reinforcement learning to improve exploration trajectories for error minimization, in Proceedings of the IEEE
International Conference on Robotics and Automation (ICRA), 2006.
[13] D. R. Jones, M. Schonlau, and W. J. Welch, Efficient global optimization of expensive black-box functions, Journal of Global Optimization,
vol. 13, no. 4, pp. 455492, 1998.
[14] T. J. Santner, B. Williams, and W. Notz, The Design and Analysis of
Computer Experiments. Springer-Verlag, 2003.
[15] E. S. Siah, M. Sasena, and J. L. Volakis, Fast parameter optimization of
large-scale electromagnetic objects using DIRECT with Kriging metamodeling, IEEE Transactions on Microwave Theory and Techniques,
vol. 52, no. 1, pp. 276285, January 2004.
[16] L. P. Kaelbling, Learning in embedded systems, Ph.D. dissertation,
Stanford University, 1990.
[17] A. W. Moore and J. Schneider, Memory-based stochastic optimization,
in Advances in Neural Information Processing Systems, D. S. Touretzky,
M. C. Mozer, and M. E. Hasselmo, Eds., vol. 8. The MIT Press, 1996,
pp. 10661072.
[18] N. Bergman and P. Tichavsky, Two Cramer-Rao bounds for terrainaided navigation, IEEE Transactions on Aerospace and Electronic
Systems, 1999, in review.
[19] M. L. Hernandez, T. Kirubarajan, and Y. Bar-Shalom, Multisensor
resource deployment using posterior Cram`er-Rao bounds, IEEE Transactions on Aerospace and Electronic Systems Aerospace, vol. 40, no. 2,
pp. 399 416, April 2004.
[20] D. P. Bertsekas, Dynamic Programming and Optimal Control. Athena
Scientific, 1995.
[21] O. Tremois and J. P. LeCadre, Optimal observer trajectory in bearingsonly tracking for manoeuvring sources, IEE Proceedings on Radar,
Sonar and Navigation, vol. 146, no. 1, pp. 3139, 1999.
[22] R. D. Smallwood and E. J. Sondik, The optimal control of partially observable Markov processes over a finite horizon, Operations Research,
vol. 21, pp. 10711088, 1973.
[23] R. J. Williams, Simple statistical gradient-following algorithms for
connectionist reinforcement learning, Machine Learning, vol. 8, no. 3,
pp. 229256, 1992.
[24] A. Y. Ng and M. I. Jordan, Pegasus: A policy search method for
large MDPs and POMDPs. in Uncertainty in Artificial Intelligence
(UAI2000), 2000.
[25] J. M. Maciejowski, Predictive control: with constraints. Prentice-Hall,
2002.
[26] J. A. Castellanos and J. D. Tardos, Mobile Robot Localization and
Map Building. A Multisensor Fusion Approach. Kluwer Academic
Publishers, 1999.
[27] H. J. Kushner, A new method of locating the maximum of an arbitrary
multipeak curve in the presence of noise, Journal of Basic Engineering,
vol. 86, pp. 97106, 1964.
[28] D. R. Jones, A taxonomy of global optimization methods based on
response surfaces, Journal of Global Optimization, vol. 21, pp. 345
383, 2001.
[29] M. J. Sasena, Flexibility and efficiency enhancement for constrained
global design optimization with Kriging approximations, Ph.D. dissertation, University of Michigan, 2002.
[30] D. R. Jones, C. D. Perttunen, and B. E. Stuckman, Lipschitzian
optimization without the Lipschitz constant, Journal of Optimization
Theory and Applications, vol. 79, no. 1, pp. 157181, October 1993.
[31] J. M. Gablonsky, Modification of the DIRECT algorithm, Ph.D. dissertation, Department of Mathematics, North Carolina State University,
Raleigh, North Carolina, 2001.
[32] D. E. Finkel, DIRECT Optimization Algorithm User Guide. Center for
Research in Scientific Computation, North Carolina State University,
2003.
[33] A. J. Smola, S. V. N. Vishwanathan, and T. Hofmann, Kernel methods
for missing variables, in AI Stats, 2005, pp. 325332.
[34] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine
Learning. Cambridge, Massachusetts: MIT Press, 2006.
[35] P. Tichavsky, C. Muravchik, and A. Nehorai, Posterior Cramer-Rao
bounds for discrete-time nonlinear filtering, IEEE Transactions on
Signal Processing, vol. 46, no. 5, pp. 13861396, 1998.
[36] N. Bergman, A. Doucet, and N. J. Gordon, Optimal estimation and
Cramer-Rao bounds for partial non-Gaussian state space models, Annals of the Institute of Mathematical Statistics, vol. 52, no. 1, pp. 117,
2001.

APRILIA Caponord 1200 Manual
50% (2)
APRILIA Caponord 1200 Manual
168 pages
Nghi Dinh 91 2020 ND CP Chong Tin Nhan Rac Thu Dien Tu Rac Cuoc Goi Rac
No ratings yet
Nghi Dinh 91 2020 ND CP Chong Tin Nhan Rac Thu Dien Tu Rac Cuoc Goi Rac
21 pages
Setting Calculation
100% (4)
Setting Calculation
33 pages
CENTER 337 Light Meter Instruction Manual
No ratings yet
CENTER 337 Light Meter Instruction Manual
2 pages
Comparing Task Simplifications To Learn Closed-Loop Object Picking Using Deep Reinforcement Learning
No ratings yet
Comparing Task Simplifications To Learn Closed-Loop Object Picking Using Deep Reinforcement Learning
8 pages
10 1108 - Ir 12 2022 0299
No ratings yet
10 1108 - Ir 12 2022 0299
11 pages
0192 LEAF Latent Exploration Along The Frontier
No ratings yet
0192 LEAF Latent Exploration Along The Frontier
8 pages
An Ergodic Measure For Active Learning From Equilibrium
No ratings yet
An Ergodic Measure For Active Learning From Equilibrium
15 pages
Reinforcement Learning in Motion Planning
No ratings yet
Reinforcement Learning in Motion Planning
8 pages
A Bayesian Framework For Optimal Motion
No ratings yet
A Bayesian Framework For Optimal Motion
8 pages
Reinforcement Learning With Deep Energy-Based Policies
No ratings yet
Reinforcement Learning With Deep Energy-Based Policies
16 pages
Abdolmaleki et al. - 2018 - Maximum a Posteriori Policy Optimisation
No ratings yet
Abdolmaleki et al. - 2018 - Maximum a Posteriori Policy Optimisation
23 pages
Cutler16_ICRA_final_submission
No ratings yet
Cutler16_ICRA_final_submission
7 pages
Gaussian Processes For Data-Efficient Learning in Robotics and Control
No ratings yet
Gaussian Processes For Data-Efficient Learning in Robotics and Control
20 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Path Planning With Local Motion Estimations
No ratings yet
Path Planning With Local Motion Estimations
8 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
Compact and Efficient Encodings For Planning in Factored Sta - 2020 - Artificial
No ratings yet
Compact and Efficient Encodings For Planning in Factored Sta - 2020 - Artificial
21 pages
fnbot-16-821991
No ratings yet
fnbot-16-821991
12 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
0、Bayesian reinforcement learning in continuous POMDPs with application to robot navigation.（2008）
No ratings yet
0、Bayesian reinforcement learning in continuous POMDPs with application to robot navigation.（2008）
7 pages
1 s2.0 S0004370222001527 Main
No ratings yet
1 s2.0 S0004370222001527 Main
20 pages
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
Sensors 23 02036
No ratings yet
Sensors 23 02036
24 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
1945 Learning To Explore Using Acti
No ratings yet
1945 Learning To Explore Using Acti
18 pages
2004.05155v1
No ratings yet
2004.05155v1
18 pages
Motion Planning in Non-Gaussian Belief Spaces
No ratings yet
Motion Planning in Non-Gaussian Belief Spaces
9 pages
Mobile Robot Path Planning in Dynamic Environments Through Globally Guided Reinforcement Learning
No ratings yet
Mobile Robot Path Planning in Dynamic Environments Through Globally Guided Reinforcement Learning
8 pages
unit 3 ai
No ratings yet
unit 3 ai
5 pages
Reinforcement Learning Based Approach For Mobile Robot Navigation
No ratings yet
Reinforcement Learning Based Approach For Mobile Robot Navigation
4 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
Bagaria 2021 DSG
No ratings yet
Bagaria 2021 DSG
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
232 pages
Improved Fast Replanning For Robot Navigation in Unknown Terrain
No ratings yet
Improved Fast Replanning For Robot Navigation in Unknown Terrain
8 pages
Drug Papper
No ratings yet
Drug Papper
43 pages
Chewu 2018 IOP Conf. Ser.: Mater. Sci. Eng. 402 012022
No ratings yet
Chewu 2018 IOP Conf. Ser.: Mater. Sci. Eng. 402 012022
10 pages
FP 3
No ratings yet
FP 3
7 pages
Bug Algorithms Family
No ratings yet
Bug Algorithms Family
38 pages
Robot Motion Planning For Map Building: Benjam In Tovar, Rafael Murrieta-Cid and Claudia Esteves
No ratings yet
Robot Motion Planning For Map Building: Benjam In Tovar, Rafael Murrieta-Cid and Claudia Esteves
8 pages
Probabilistic Robotics
No ratings yet
Probabilistic Robotics
20 pages
Thrun 2000 AI Mag
No ratings yet
Thrun 2000 AI Mag
20 pages
PU学习法：Off-policy evaluation via off-policy classification
No ratings yet
PU学习法：Off-policy evaluation via off-policy classification
12 pages
Extended - A Simple NBV Selection Method For Smoother Exploration of Unknown Environment
No ratings yet
Extended - A Simple NBV Selection Method For Smoother Exploration of Unknown Environment
4 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
Azar 17 A
No ratings yet
Azar 17 A
10 pages
2504.02057v1
No ratings yet
2504.02057v1
6 pages
NIPS 2016 Generative Adversarial Imitation Learning Paper
No ratings yet
NIPS 2016 Generative Adversarial Imitation Learning Paper
9 pages
Paper_103-Navigation_of_Autonomous_Vehicles_using_Reinforcement_Learning
No ratings yet
Paper_103-Navigation_of_Autonomous_Vehicles_using_Reinforcement_Learning
6 pages
Computational Geometry: Exploring Geometric Insights for Computer Vision
From Everand
Computational Geometry: Exploring Geometric Insights for Computer Vision
Fouad Sabry
No ratings yet
AIML
No ratings yet
AIML
4 pages
Sampling Yue Fuselage
No ratings yet
Sampling Yue Fuselage
11 pages
Autonomous Robot Motion Path Planning Using Shortest Path Planning Algorithms
No ratings yet
Autonomous Robot Motion Path Planning Using Shortest Path Planning Algorithms
5 pages
ML Unit 5 (ChatGPT)
No ratings yet
ML Unit 5 (ChatGPT)
17 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
JumpStartRL
No ratings yet
JumpStartRL
28 pages
37 RL
No ratings yet
37 RL
18 pages
A comprehensive survey on safe reinforcement learning
No ratings yet
A comprehensive survey on safe reinforcement learning
44 pages
Reinforcement Learning: EEE 485/585 Statistical Learning and Data Analytics
No ratings yet
Reinforcement Learning: EEE 485/585 Statistical Learning and Data Analytics
15 pages
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
From Everand
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
Fouad Sabry
No ratings yet
BAYES For OBSTACLE AVOIDANCE-Obstacle-Avoidance-exercise and Problem
No ratings yet
BAYES For OBSTACLE AVOIDANCE-Obstacle-Avoidance-exercise and Problem
31 pages
Zhu Et Al 2014 the Path Planning of Auv Based on d s Information Fusion Map Building and Bio Inspired Neural Network In
No ratings yet
Zhu Et Al 2014 the Path Planning of Auv Based on d s Information Fusion Map Building and Bio Inspired Neural Network In
14 pages
Nonlinear H, Control of Robotic Manipulator: Jongguk Yim and Jong Hyeon Park
No ratings yet
Nonlinear H, Control of Robotic Manipulator: Jongguk Yim and Jong Hyeon Park
6 pages
A Multiple Mobile Robots Path Planning Algorithm Based On A-Star and Dijkstra Algorithm
No ratings yet
A Multiple Mobile Robots Path Planning Algorithm Based On A-Star and Dijkstra Algorithm
12 pages
Translating Structured English To Robot Controllers: Full Paper
No ratings yet
Translating Structured English To Robot Controllers: Full Paper
17 pages
Motion Robots
No ratings yet
Motion Robots
16 pages
Motion Planning For Multiple Mobile Robot Systems Using Dynamic Networks
No ratings yet
Motion Planning For Multiple Mobile Robot Systems Using Dynamic Networks
6 pages
Paper de Motion Planning
No ratings yet
Paper de Motion Planning
16 pages
Formentera, Alondra L. Defining Approaches LP
No ratings yet
Formentera, Alondra L. Defining Approaches LP
7 pages
OPIF Reference Guide
No ratings yet
OPIF Reference Guide
106 pages
Basic Baluns
100% (2)
Basic Baluns
13 pages
Manufacturing Performance and Evolution of TPM
No ratings yet
Manufacturing Performance and Evolution of TPM
14 pages
Course Descriptons
No ratings yet
Course Descriptons
10 pages
Physics 72.1
No ratings yet
Physics 72.1
10 pages
Lexus 5 Speed A650E Gearbox Drawing Diodes 4
No ratings yet
Lexus 5 Speed A650E Gearbox Drawing Diodes 4
1 page
Book (Arranged by Series) Category Audience: The 39 Clues
No ratings yet
Book (Arranged by Series) Category Audience: The 39 Clues
4 pages
DR: A.P.J Abdul Kalam'S Biography: By, Githu Raju
No ratings yet
DR: A.P.J Abdul Kalam'S Biography: By, Githu Raju
11 pages
Cylinder Head 26916
No ratings yet
Cylinder Head 26916
2 pages
Depth of Undercut
No ratings yet
Depth of Undercut
7 pages
Water Treatment Plant
No ratings yet
Water Treatment Plant
12 pages
THE VALUE OF MONEY
No ratings yet
THE VALUE OF MONEY
10 pages
Dlp-Math-Nov. 21-25
No ratings yet
Dlp-Math-Nov. 21-25
12 pages
1st List of Not Eligible Students Ehsaas Scholarship Phase II For Website Iub
No ratings yet
1st List of Not Eligible Students Ehsaas Scholarship Phase II For Website Iub
44 pages
CNF ST Q4
No ratings yet
CNF ST Q4
2 pages
Pe9 Q2 Module1 V3
No ratings yet
Pe9 Q2 Module1 V3
15 pages
Module 1 - History: Definition, Nature, Methodology and Importance
No ratings yet
Module 1 - History: Definition, Nature, Methodology and Importance
69 pages
INTRODUCTION To Understanding Media
No ratings yet
INTRODUCTION To Understanding Media
3 pages
Understanding Exposure
No ratings yet
Understanding Exposure
26 pages
Application of SAP-FICO Functionalities To Health Care Organizations
No ratings yet
Application of SAP-FICO Functionalities To Health Care Organizations
7 pages
Finerenone in Heart Failure With Mildly Reduced or Preserved Ejection Fraction
No ratings yet
Finerenone in Heart Failure With Mildly Reduced or Preserved Ejection Fraction
11 pages
Updated Information Brochure For MBBS 2023 Admission in AFMC Pune
No ratings yet
Updated Information Brochure For MBBS 2023 Admission in AFMC Pune
29 pages
Quiz Paper
No ratings yet
Quiz Paper
30 pages
Greenheck Presentation
No ratings yet
Greenheck Presentation
69 pages
COT_2ndqtr_DLP_ ENGLISH 2
No ratings yet
COT_2ndqtr_DLP_ ENGLISH 2
2 pages

Active Policy Learning For Robot Planning and Exploration Under Uncertainty

Uploaded by

Active Policy Learning For Robot Planning and Exploration Under Uncertainty

Uploaded by

Active Policy Learning for Robot Planning and

Exploration under Uncertainty

Department of Computer Science and Systems Engineering, University of Zaragoza

The robot plans a path that allows it to accomplish the task

require small step sizes for stable convergence (and hence

and states are continuous and high-dimensional. Moreover,

can be used depending on the robot capabilities. A trajectory

bt = Ep(xt |y1:t ,) [xt ]. The expectation is with respect

Note that the true state xT and observations are unknown in

1) Choose an initial policy 0 .

i) Sample the prior states x0 p(x0 ).

trajectories, we update the policy parameters and iterate with

Estimated landmark location

C) Generate observations yt p(yt |at , xt ) as

each landmark, one draws a sample from its posterior. If the

Assuming that is given (we discuss the active learning

Simulated robot location

Simulated field of view

Estimated robot location

Fig. 3. An observation is generated using the current map and robot

pose estimates. Gating information is used to validate the observation.

After the trajectories {x1:T , y1:T }N

Fig. 4. An example of active policy learning with a univariate policy

using data generated by our simulator. The figure on top shows a

approach. Figure 4 illustrates it for a simple one-dimensional

intuitions can be incorporated in the design of a statistical

conducted, the simulated costs {C1:n

Cn+1 for a new test point n+1 are jointly Gaussian:

The posterior Cramer-Rao bound (PCRB) for nonlinear

Tichavsky [35], derived the following Riccati-like recursion to

Since the number of query points is small, the GP predictions

denote the current lowest (best) estimate of the

However, as argued by [13], it is sensitive to the value of Cmin

over the current best point as I() = max{0, Cmin

b ())(d) + sb()(d) if sb > 0

IV. A C HEAPER C OST: T HE P OSTERIOR C RAM E R -R AO

E[xt+1 ,xt+1 log p(yt+1 |xt+1 )]

Resulting covariance trace

Evolution of the trace of the state covariance matrix for 15

Empirical AMSE cost as a function of policy improvement

to avoid backwards trajectories. The motion commands are

Resulting covariance trace

VI. D ISCUSSION AND F UTURE W ORK

Evolution of the trace of the state covariance matrix for 15

validate the PCRB approximation of the AMSE cost. We

We have presented an approach for stochastic exploration

Resulting covariance trace

Evolution of the trace of the state covariance matrix for

introduce richer cost functions and constraints. In terms of

You might also like