Active Policy Learning For Robot Planning and Exploration Under Uncertainty
Active Policy Learning For Robot Planning and Exploration Under Uncertainty
Abstract This paper proposes a simulation-based active policy learning algorithm for finite-horizon, partially-observed sequential decision processes. The algorithm is tested in the domain
of robot navigation and exploration under uncertainty. In such a
setting, the expected cost, that must be minimized, is a function of
the belief state (filtering distribution). This filtering distribution
is in turn nonlinear and subject to discontinuities, which arise
because constraints in the robot motion and control models. As a
result, the expected cost is non-differentiable and very expensive
to simulate. The new algorithm overcomes the first difficulty and
reduces the number of required simulations as follows. First, it
assumes that we have carried out previous simulations which
returned values of the expected cost for different corresponding
policy parameters. Second, it fits a Gaussian process (GP)
regression model to these values, so as to approximate the
expected cost as a function of the policy parameters. Third, it
uses the GP predicted mean and variance to construct a statistical
measure that determines which policy parameters should be used
in the next simulation. The process is then repeated using the new
parameters and the newly gathered expected cost observation.
Since the objective is to find the policy parameters that
minimize the expected cost, this iterative active learning approach
effectively trades-off between exploration (in regions where the
GP variance is large) and exploitation (where the GP mean is
low). In our experiments, a robot uses the proposed algorithm
to plan an optimal path for accomplishing a series of tasks,
while maximizing the information about its pose and map
estimates. These estimates are obtained with a standard filter
for simultaneous localization and mapping. Upon gathering new
observations, the robot updates the state estimates and is able to
replan a new path in the spirit of open-loop feedback control.
I. I NTRODUCTION
The direct policy search method for reinforcement learning
has led to significant achievements in control and robotics [1,
2, 3, 4]. The success of the method does often, however, hinge
on our ability to formulate expressions for the gradient of
the expected cost [5, 4, 6]. In some important applications in
robotics, such as exploration, constraints in the robot motion
and control models make it hard, and often impossible, to
compute derivatives of the cost function with respect to the
robot actions. In this paper, we present a direct policy search
method for continuous policy spaces that relies on active
learning to side-step the need for gradients.
The proposed active policy learning approach also seems
to be more appropriate in situations where the cost function
has many local minima that cause the gradient methods to
get stuck. Moreover, in situations where the cost function is
very expensive to evaluate by simulation, an active learning approach that is designed to minimize the number of evaluations
might be more suitable than gradient methods, which often
Parameters
5
6
3
2
CAM
T t (b
xt xt )(b
xt xt )0 ,
SE = Ep(x0:T ,y1:T |)
t=1
xT xT )(b
xT xT )0 ] ,
CAM
SE = Ep(xT ,y1:T |) [(b
(1)
(i)
Fig. 2. The overall solution approach in the open-loop control (OLC) setting.
Here, N denotes the number of Monte Carlo samples and T is the planning
horizon. In replanning with open-loop feedback control (OLFC), one simply
uses the present position and the estimated posterior distribution (instead of the
prior) as the starting point for the simulations. One can apply this strategy with
either approaching or receding control horizons. It is implicit in the pseudocode that we freeze the random seed generator so as to reduce variance.
(i)
CAM
SE
(2)
(i)
3.5
GP mean cost
GP variance
3
Data point
2.5
2
True cost
1.5
Infill
0.5
0
1.5
0.5
0
Policy parameter
0.5
1.5
0.5
0
Policy parameter
0.5
1.5
3.5
2.5
1.5
0.5
0
1.5
Cn+1
1
kT
2 k
N
,
C1:n
1
k K
where kT
=
[k( n+1 , 1 ) k( n+1 , n )], k
=
k( n+1 , n+1 ) and K is the training data kernel matrix
with entries k( i , j ) for i = 1, . . . , n and j = 1, . . . , n.
Since we are interested in regression, the Matern kernel is a
suitable choice for k(|) [14].
We assign a normal-inverse-Gamma conjugate prior to the
parameters: N (0, 2 2 ) and 2 IG(a/2, b/2). The
priors play an essential role at the beginning of the design
process, when there are only a few data. Classical Bayesian
analysis allow us to obtain analytical expressions for the
posterior modes of these quantities:
b = (1T K1 1 + 2 )1 1T K1 C
b + C T K1 C (1T K1 1 + 2 )b
2
b2 =
n+a+2
^s
2.5
^
C
1.5
0.5
0
25
20
15
0
1
1
0
0
1
0
1
0
1
0
1
0
1
0
1
0Low infill
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
10
1
CAM
SE CP CRB = J
0
Policy parameter []
Fig. 5.
C min
111
000
000
111
000
111
000
111
000
111
000
111
000
111
000
111
High infill
000
111
000
111
000
111
000
111
000
111
000
111
000
111
10
15
20
25
Infill function.
(3)
where
Using the previous estimates, the GP predictive mean and
variance are given by
b () =
C
b + kT K1 (C1:n
1b
)
(1
1T K1 k)2
.
sb2 () =
b2 k kT K1 k + T 1
(1 K 1 + 2 )
b ()
C
C
min
p(C () Cmin
)=
,
sb()
b (), sb()2 ) and denotes CDF of the
where C () N (C
standard Normal distribution. This measure was proposed several decades ago by [27], who used univariate Wiener process.
b ()
Cmin
C
.
sb()
At+1
Bt
Ct
Dt
=
=
=
=
where the expectations are with respect to the simulated trajectories and denotes the Laplacian operator. By simulating
(sampling) trajectories, using our observation and transition
models, one can easily approximate these expectations with
Monte Carlo averages. These averages can be computed offline and hence the expensive recursion of equation (3) only
needs to be done once for all scenarios.
The PCRB approximation method of [35] applies to nonlinear (NL) models with additive noise only. This is not the
case in our setting and hence a potential source of error. An
alternative PCRB approximation method that overcomes this
shortcoming, in the context of jump Markov linear (JML)
models, was proposed by [36]. We try both approximations
in our experiments and refer to them as NL-PCRB and JMLPCRB respectively.
The AMSE simulation approach of Section II-A using the
EKF requires that we perform an expensive Ricatti update
(EKF covariance update) for each simulated trajectory. In contrast, the simulation approach using the PCRB only requires
one Ricatti update (equation (3)). Thus, the latter approach is
considerably cheaper. Yet, the PCRB is only a lower bound
and hence it is not guaranteed to be necessarily tight. In
the following section, we will provide empirical comparisons
between these simulation approaches.
V. E XPERIMENTS
We present two sets of experiments. The first experiment is
very simple as it is aimed at illustrating the approach. It involves a fixed-horizon stochastic planning domain. The second
set of experiments is concerned with exploration with receding
horizon policies in more realistic settings. In all cases, the aim
is to find the optimal path in terms of posterior information
about the map and robot pose. For clarification, other terms
contributing to the cost, such as time and obstacles are not
considered, but the implementation should be straightforward.
12
OLC1
OLC3
OLFC3
10
0
0
10
15
20
25
30
35
40
45
steps
A. Fixed-horizon planning
The first experiment is the one described in Figure 1.
Here, the start and end positions of the path are fixed. The
robot has to compute the coordinates of three intermediate
way-points and, hence, the policy has six parameters. For
illustration purposes we chose a simple environment consisting
of 5 landmarks (with vague priors). We placed an informative
prior on the initial robot pose. Figure 6 shows three different
robot trajectories computed during policy optimization. The
trajectories are also indicated in the Monte Carlo AMSE cost
evolution plot. The 6D optimization requires less than 50
iterations. We found that the optimal trajectory allowed the
robot to observe the maximum number of features. However,
since the prior on the robots initial pose is informative
(narrow Gaussian), feature A is originally detected with very
low uncertainty. Consequently, the robot tries to maintain
that feature in the field of view to improve the localization.
A greedy strategy would have focused only on feature A,
improving the estimation of that feature and the robot, but
dropping the global posterior estimate.
B. Receding-horizon planning
In this experiment, the a priori map has high uncertainty
(1 meter standard deviation see Figure 7). The robot is
a differential drive vehicle equipped with odometers and a
stereo camera that provides the location of features. The field
of view is limited to 7 meters and 90o , which are typical values
for reliable stereo matching. We assume that the camera and
a detection system that provides a set of observations every
0.5 seconds. The sensor noise is Gaussian for both range and
bearing, with standard deviations range = 0.2 range and
bearing = 0.5o . The policy is given by a set of ordered
way-points. Each way-point is defined in terms of heading
and distance with respect to the robot pose at the preceding
way-point. The distance between way-points is limited to 10
meters and the heading should be in the interval [3/4, 3/4]
15
10
10
[m]
[m]
OLC1
15
Landmark location
Map Estimate
5
5
10
[m]
15
20
5
10
25
10
15
20
25
10
15
20
25
[m]
OLC3
OLFC3
15
12
10
10
8
6
5
[m]
[m]
4
2
0
0
2
4
10
10
10
15
20
25
[m]
6
10
5
[m]
Trajectories generated using OLC1, OLC3 and OLFC3. The blue and red ellipses represent the landmark and robot location 95%
confidence intervals. The robot field of view is shown in green. OLC3 is more exploratory than OLC1, which gets stuck repeatedly updating
the first landmark it encounters. Yet, only OLFC3, because of being able to replan at each step, is able to fully explore the map and reduce
the uncertainty.
Fig. 7.
10
OLC1
OLC3
OLFC3
8
7
6
5
4
3
2
1
0
10
20
30
40
steps
50
60
70
60
NLPCRB
JMLPCRB
AMSE
50
40
30
20
10
0
0
10
20
30
40
50
60
70
80
90
steps
[11] M. L. Hernandez, Optimal sensor trajectories in bearings-only tracking, in Proceedings of the Seventh International Conference on Information Fusion, P. Svensson and J. Schubert, Eds., vol. II. Mountain
View, CA: International Society of Information Fusion, Jun 2004, pp.
893900.
[12] T. Kollar and N. Roy, Using reinforcement learning to improve exploration trajectories for error minimization, in Proceedings of the IEEE
International Conference on Robotics and Automation (ICRA), 2006.
[13] D. R. Jones, M. Schonlau, and W. J. Welch, Efficient global optimization of expensive black-box functions, Journal of Global Optimization,
vol. 13, no. 4, pp. 455492, 1998.
[14] T. J. Santner, B. Williams, and W. Notz, The Design and Analysis of
Computer Experiments. Springer-Verlag, 2003.
[15] E. S. Siah, M. Sasena, and J. L. Volakis, Fast parameter optimization of
large-scale electromagnetic objects using DIRECT with Kriging metamodeling, IEEE Transactions on Microwave Theory and Techniques,
vol. 52, no. 1, pp. 276285, January 2004.
[16] L. P. Kaelbling, Learning in embedded systems, Ph.D. dissertation,
Stanford University, 1990.
[17] A. W. Moore and J. Schneider, Memory-based stochastic optimization,
in Advances in Neural Information Processing Systems, D. S. Touretzky,
M. C. Mozer, and M. E. Hasselmo, Eds., vol. 8. The MIT Press, 1996,
pp. 10661072.
[18] N. Bergman and P. Tichavsky, Two Cramer-Rao bounds for terrainaided navigation, IEEE Transactions on Aerospace and Electronic
Systems, 1999, in review.
[19] M. L. Hernandez, T. Kirubarajan, and Y. Bar-Shalom, Multisensor
resource deployment using posterior Cram`er-Rao bounds, IEEE Transactions on Aerospace and Electronic Systems Aerospace, vol. 40, no. 2,
pp. 399 416, April 2004.
[20] D. P. Bertsekas, Dynamic Programming and Optimal Control. Athena
Scientific, 1995.
[21] O. Tremois and J. P. LeCadre, Optimal observer trajectory in bearingsonly tracking for manoeuvring sources, IEE Proceedings on Radar,
Sonar and Navigation, vol. 146, no. 1, pp. 3139, 1999.
[22] R. D. Smallwood and E. J. Sondik, The optimal control of partially observable Markov processes over a finite horizon, Operations Research,
vol. 21, pp. 10711088, 1973.
[23] R. J. Williams, Simple statistical gradient-following algorithms for
connectionist reinforcement learning, Machine Learning, vol. 8, no. 3,
pp. 229256, 1992.
[24] A. Y. Ng and M. I. Jordan, Pegasus: A policy search method for
large MDPs and POMDPs. in Uncertainty in Artificial Intelligence
(UAI2000), 2000.
[25] J. M. Maciejowski, Predictive control: with constraints. Prentice-Hall,
2002.
[26] J. A. Castellanos and J. D. Tardos, Mobile Robot Localization and
Map Building. A Multisensor Fusion Approach. Kluwer Academic
Publishers, 1999.
[27] H. J. Kushner, A new method of locating the maximum of an arbitrary
multipeak curve in the presence of noise, Journal of Basic Engineering,
vol. 86, pp. 97106, 1964.
[28] D. R. Jones, A taxonomy of global optimization methods based on
response surfaces, Journal of Global Optimization, vol. 21, pp. 345
383, 2001.
[29] M. J. Sasena, Flexibility and efficiency enhancement for constrained
global design optimization with Kriging approximations, Ph.D. dissertation, University of Michigan, 2002.
[30] D. R. Jones, C. D. Perttunen, and B. E. Stuckman, Lipschitzian
optimization without the Lipschitz constant, Journal of Optimization
Theory and Applications, vol. 79, no. 1, pp. 157181, October 1993.
[31] J. M. Gablonsky, Modification of the DIRECT algorithm, Ph.D. dissertation, Department of Mathematics, North Carolina State University,
Raleigh, North Carolina, 2001.
[32] D. E. Finkel, DIRECT Optimization Algorithm User Guide. Center for
Research in Scientific Computation, North Carolina State University,
2003.
[33] A. J. Smola, S. V. N. Vishwanathan, and T. Hofmann, Kernel methods
for missing variables, in AI Stats, 2005, pp. 325332.
[34] C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine
Learning. Cambridge, Massachusetts: MIT Press, 2006.
[35] P. Tichavsky, C. Muravchik, and A. Nehorai, Posterior Cramer-Rao
bounds for discrete-time nonlinear filtering, IEEE Transactions on
Signal Processing, vol. 46, no. 5, pp. 13861396, 1998.
[36] N. Bergman, A. Doucet, and N. J. Gordon, Optimal estimation and
Cramer-Rao bounds for partial non-Gaussian state space models, Annals of the Institute of Mathematical Statistics, vol. 52, no. 1, pp. 117,
2001.