A Algorithm
A Algorithm
Emmanuel Benazera
Ronen Brafman , Nicolas Meuleau
NASA Ames Research Center
Mail Stop 269-3
Moffet Field, CA 94035-1000
{ebenazer, brafman, nmeuleau}
@email.arc.nasa.gov
Mausam
Eric A. Hansen
Abstract
We consider the problem of optimal planning in stochastic
domains with resource constraints, where resources are continuous and the choice of action at each step may depend on
the current resource level. Our principal contribution is the
HAO* algorithm, a generalization of the AO* algorithm that
performs search in a hybrid state space that is modeled using
both discrete and continuous state variables. The search algorithm leverages knowledge of the starting state to focus computational effort on the relevant parts of the state space. We
claim that this approach is especially effective when resource
limitations contribute to reachability constraints. Experimental results show its effectiveness in the domain that motivates
our research automated planning for planetary exploration
rovers.
Introduction
Control of planetary exploration rovers presents several important challenges for research in automated planning. Because of difficulties inherent in communicating with devices
on other planets, remote rovers must operate autonomously
over substantial periods of time (Bresina et al. 2002). The
planetary surfaces on which they operate are very uncertain
environments: there is a great deal of uncertainty about the
duration, energy consumption, and outcome of a rovers actions. Currently, instructions sent to planetary rovers are in
the form of a simple plan for attaining a single goal (e.g.,
photographing some interesting rock). The rover attempts
to carry this out, and, when done, remains idle. If it fails
early on, it makes no attempt to recover and possibly achieve
an alternative goal. This may have a serious impact on missions. For example, it has been estimated that the 1997 Mars
Pathfinder rover spent between 40% and 75% of its time doing nothing because plans did not execute as expected. The
current MER rovers (aka Spirit and Opportunity) require an
average of 3 days to visit a single rock, but in future missions, multiple rock visits in a single communication cycle
will be possible (Pedersen et al. 2005). As a result, it is
expected that space scientists will request a large number of
form:
Vn0 (x) = 0 ,
Vnt+1 (x)
Z
x0
= max
aAn (x)
"
Pr(n0 |, n, x, a)
n0 N
(1)
Solution Approach
Feng et al. describe a dynamic programming (DP) algorithm
that solves this Bellman optimality equation. In particular,
they show that the continuous integral over x0 can be computed exactly, as long as the transition function satisfies certain conditions. This algorithm is rather involved, so we will
treat it as a black-box in our algorithm. In fact, it can be
replaced by any other method for carrying out this computation. This also simplifies the description of our algorithm
in the next section and allows us to focus on our contribution. We do explain the ideas and the assumptions behind
the algorithm of Feng et al. in Section 3.
The difficulty we address in this paper is the potentially huge size of the state space, which makes DP infeasible. One reason for this size is the existence of continuous variables. But even if we only consider the discrete component of the state space, the size of the state
space is exponential in the number of propositional variables comprising the discrete component. To address this
issue, we use forward heuristic search in the form of a
novel variant of the AO* algorithm. Recall that AO* is
an algorithm for searching AND/OR graphs (Pearl 1984;
Hansen and Zilberstein 2001). Such graphs arise in problems where there are choices (the OR components), and each
choice can have multiple consequences (the AND component), as is the case in planning under uncertainty. AO* can
be very effective in solving such planning problems when
there is a large state space. One reason for this is that AO*
only considers states that are reachable from an initial state.
Another reason is that given an informative heuristic function, AO* focuses on states that are reachable in the course
of executing a good plan. As a result, AO* often finds an
optimal plan by exploring a small fraction of the entire state
space.
The challenge we face in applying AO* to this problem is
the challenge of performing state-space search in a continuous state space. Our solution is to search in an aggregate
state space that is represented by a search graph in which
there is a node for each distinct value of the discrete component of the state. In other words, each node of our search
graph represents a region of the continuous state space in
which the discrete value is the same. In this approach, different actions may be optimal for different Markov states in
the aggregate state associated with a search node, especially
since the best action is likely to depend on how much energy or time is remaining. To address this problem and still
find an optimal solution, we associate a value estimate with
each of the Markov states in an aggregate. That is, we attach to each search node a value function (function of the
continuous variables) instead of the simple scalar value used
by standard AO*. Following the approach of (Feng et al.
2004), this value function can be represented and computed
efficiently due to the continuous nature of these states and
the simplifying assumptions made about the transition functions. Using these value estimates, we can associate different actions with different Markov states within the aggregate
state corresponding to a search node.
In order to select which node on the fringe of the search
graph to expand, we also need to associate a scalar value
with each search node. Thus, we maintain for a search node
both a heuristic estimate of the value function (which is used
to make action selections), and a heuristic estimate of the
priority which is used to decide which search node to expand
next. Details are given in the following section.
We note that LAO*, a generalization of AO*, allows for
policies that contain loops in order to specify behavior
over an infinite horizon (Hansen and Zilberstein 2001). We
could use similar ideas to extend LAO* to our setting. However, we need not consider loops for two reasons: (1) our
problems have a bounded horizon; (2) an optimal policy
will not contain any intentional loop because returning to
the same discrete state with fewer resources cannot buy us
anything. Our current implementation assumes any loop is
intentional and discards actions that create such a loop.
Hybrid AO*
A simple way of understanding HAO* is as an AO* variant
where states with identical discrete component are expanded
in unison. HAO* works with two graphs:
The explicit graph describes all the states that have been
generated so far and the AND/OR edges that connect
them. The nodes of the explicit graph are stored in two
lists: OPEN and CLOSED.
The greedy policy (or partial solution) graph, denoted
GREEDY in the algorithms, is a sub-graph of the explicit
graph describing the current optimal policy.
In standard AO*, a single action will be associated with each
node in the greedy graph. However, as described before,
multiple actions can be associated with each node, because
different actions may be optimal for different Markov states
represented by an aggregate state.
Data Structures
The main data structure represents a search node n. It contains:
The value of the discrete state. In our application these
are the discrete state variables and set of goals achieved.
Pointers to its parents and children in the explicit and
greedy policy graphs.
Pn () a probability distribution on the continuous variables in node n. For each x X, Pn (x) is an estimate of
the probability density of passing through state (n, x) under the current greedy policy. It is obtained by progressing the initial state forward through the optimal actions of
the greedy policy. With each Pn , we maintain the probability of passing through n under the greedy policy:
Z
M (Pn ) =
Pn (x)dx .
xX
state.
Pn0 = initial distribution on resources.
Vn0 = 0 everywhere in X.
gn0 = 0.
OPEN = GREEDY = {n0 }.
CLOSED = .
while OPEN GREEDY 6= do
n = arg maxn0 OPENGREEDY (gn0 ).
Move n from OPEN to CLOSED.
for all (a, n0 ) A N not expanded yet in n and
reachable under Pn do
if n0
/ OPEN CLOSED then
Create the data structure to represent n0 and add
the transition (n, a, n0 ) to the explicit graph.
Get Hn0 .
Vn0 = Hn0 everywhere in X.
if n0 is terminal: then
Add n0 to CLOSED.
else
Add n0 to OPEN.
else if n0 is not an ancestor of n in the explicit
graph then
Add the transition (n, a, n0 ) to the explicit
graph.
if some pair (a, n0 ) was expanded at previous step
(10) then
Update Vn for the expanded node n and some of its
ancestors in the explicit graph, with Algorithm 2.
Update Pn0 and gn0 using Algorithm 3 for the nodes
n0 that are children of the expanded node or of a node
where the optimal decision changed at the previous
step (22). Move every node n0 CLOSED where P
changed back into OPEN.
Algorithm 1: Hybrid AO*
Pn
3
Pn3
n1
n0
a1
n3
a0
n2
n1
x
a2
n4
n0
a0
n4
a4
Vn3
n6
x
(a) Initial GREEDY graph. Actions have multiple possible
discrete effects (e.g., a0 has two possible effects in n0 ).
The curves represent the current probability distribution
P and value function V over x values for n3 . n2 is a
fringe node.
n5
a2
n3
a3
n2
Vn3
a1
X0
X
n
such that the state (n, x) has already been expanded before
(Xold
n = if n has never been expanded). The techniques
used to represent the continuous probability distributions
Pn and compute the continuous integrals are discussed
9:
10:
xS(Pn )Xold
n
necessary.
Update gn following Eqn. 3.
Algorithm 3: Updating the state distributions Pn .
from it. The state-space partition is kept as coarse as possible, so that only the relevant distinctions between (continuous) states are taken into account. Given the above conditions, it can be shown (see (Feng et al. 2004)) that for any
finite horizon, for any discrete state, there exists a partition
of the continuous space into hyper-rectangles over which the
optimal value function is piece-wise constant or linear. The
implementation represents the value functions as kd-trees,
using a fast algorithm to intersect kd-trees (Friedman et al.
1977), and merging adjacent pieces of the value function
based on their value. We augmented this approach by representing the continuous state distributions Pn as piecewise
constant functions of the continuous variables. Under the
set of hypotheses above, if the initial probability distribution
on the continuous variables is piecewise constant, then the
probability distribution after any finite number of actions is
too, and Eqn. 2 may always be computed in finite time.4
Properties
Handling Continuous Variables
Computationally, the most challenging aspect of HAO* is
the handling of continuous state variables, and particularly
the computation of the continuous integral in Bellman backups and Eqns. 2 and 3. We approach this problem using the
ideas developed in (Feng et al. 2004) for the same application domain. However, we note that HAO* could also be
used with other models of uncertainty and continuous variables, as long as the value functions can be computed exactly
in finite time. The approach of (Feng et al. 2004) exploits
the structure in the continuous value functions of the type of
problems we are addressing. These value functions typically
appear as collections of humps and plateaus, each of which
corresponds to a region in the state space where similar goals
are pursued by the optimal policy (see Fig. 3). The sharpness of the hump or the edge of a plateau reflects uncertainty
of achieving these goals. Constraints imposing minimal resource levels before attempting risky actions introduce sharp
cuts in the regions. Such structure is exploited by grouping
states that belong to the same plateau, while reserving a fine
discretization for the regions of the state space where it is
the most useful (such as the edges of plateaus).
To adapt the approach of (Feng et al. 2004), we make
some assumptions that imply that our value functions can
be represented as piece-wise constant or linear. Specifically,
we assume that the continuous state space induced by every
discrete state can be divided into hyper-rectangles in each
of which the following holds: (i) The same actions are applicable. (ii) The reward function is piece-wise constant or
linear. (iii) The distribution of discrete effects of each action
are identical. (iv) The set of arrival values or value variations for the continuous variables is discrete and constant.
Assumptions (i-iii) follow from the hypotheses made in our
domain models. Assumption (iv) comes down to discretizing the actions resource consumptions, which is an approximation. It contrasts with the naive approach that consists of
discretizing the state space regardless of the relevance of the
partition introduced. Instead, we discretize the action outcomes first, and then deduce a partition of the state space
Heuristic Functions
The heuristic function Hn helps focus the search on truly
useful reachable states. It is essential for tackling real-size
problems. Our heuristic function is obtained by solving a
relaxed problem. The relaxation is very simple: we assume
deterministic transitions for the continuous variables, i.e.,
P r(x0 |n, x, a, n0 ) {0, 1}. If we assume the action consumes the minimum amount of each resource, we obtain an
admissible heuristic function. A non-admissible, but probably more informative heuristic function is obtained by using
the mean resource consumption.
The central idea is to use the same algorithm to solve both
the relaxed and the original problem. Unlike classical approaches where a relaxed plan is generated for every search
state, we generate a relaxed search-graph using our HAO*
algorithm once with a deterministic-consumption model and
a trivial heuristic. The value function Vn of a node in the
4
A deterministic starting state x0 is represented by a uniform
distribution with very small rectangular support centered in x0 .
relaxed graph represents the heuristic function Hn of the associated node in the original problem graph. Solving the
relaxed problem with HAO* is considerably easier, because
the structure and the updates of the value functions Vn and
of the probabilities Pn are much simpler than in the original
domain. However, we run into the following problem: deterministic consumption implies that the number of reachable
states for any given initial state is very small (because only
one continuous assignment is possible). This means that in a
single expansion, we obtain information about a small number of states. To address this problem, instead of starting
with the initial resource values, we assume a uniform distribution over the possible range of resource values. Because
it is relatively easy to work with a uniform distribution, the
computation is simple relative to the real problem, but we
obtain an estimate for many more states. It is still likely that
we reach states for which no heuristic estimate was obtained
using these initial values. In that case, we simply recompute
starting with this initial state.
T2(10)
L2
[20,30]
T1(5)
T5(15)
Lose T4
L1
[20,30]
[15,20] Lose T2, T5
L4
T4 (15)
Reacquire T4
Lose T1
[15,18]
L3
T3 (10)
Experimental Evaluation
We tested our algorithm on a slightly simplified variant of
the rover model used for NASA Ames October 2004 Intelligent Systems demo (Pedersen et al. 2005). In this domain,
a planetary rover moves in a planar graph made of locations
and paths, sets up instruments at different rocks, and performs experiments on the rocks. Actions may fail, and their
energy and time consumption are uncertain. Resource consumptions are drawn from two type of distributions: uniform
and normal, and then discretized. The problem instance used
in our preliminary experiments is illustrated in figure 2. It
contains 5 target rocks (T1 to T5) to be tested. To take a
picture of a target rock, this target must be tracked. To track
a target, we must register it before doing the first move.5
Later, different targets can be lost and re-acquired when navigating along different paths. These changes are modeled as
action effects in the discrete state. Overall, the problem contains 43 propositional state variables and 37 actions. Therefore, there are 248 different discrete states, which is far beyond the reach of a flat DP algorithm.
The results presented here were obtained using a preliminary implementation of the piecewise constant DP approximations described in (Feng et al. 2004) based on a flat
representation of state partitions instead of kd-trees. This
is considerably slower than an optimal implementation. To
compensate, our domain features a single abstract continuous resource, while the original domain contains two resources (time and energy). Another difference in our implementation is in the number of nodes expanded at each iteration. We adapt the findings of (Hansen and Zilberstein 2001)
that overall convergence speeds up if all the nodes in OPEN
are expanded at once, instead of prioritizing them based on
gn values and changing the value functions after each ex-
6
In this implementation, we do not have to maintain exact probability distributions Pn . We just need to keep track of the supports
of these distributions, which can be approximated by lower and
upper bounds on each continuous variable.
A
30
40
50
60
70
80
90
100
110
120
130
140
150
B
0.1
0.4
1.8
7.6
13.4
32.4
87.3
119.4
151.0
213.3
423.2
843.1
1318.9
C
39
176
475
930
1548
2293
3127
4673
6594
12564
19470
28828
36504
D
39
163
456
909
1399
2148
3020
4139
5983
11284
17684
27946
36001
E
38
159
442
860
1263
2004
2840
3737
5446
9237
14341
24227
32997
F
9
9
12
32
22
33
32
17
69
39
41
22
22
G
1
1
1
2
2
2
2
2
3
3
3
3
3
H
239
1378
4855
12888
25205
42853
65252
102689
155733
268962
445107
17113
1055056
Expected utility
25
20
15
10
5
0
0
20
40
60
80
100
Initial resource
120
140
Initial
resource
130
130
130
130
130
130
130
130
130
130
130
130
130
0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
5.50
6.00
Execution
time
426.8
371.9
331.9
328.4
330.0
320.0
322.1
318.3
319.3
319.3
318.5
320.4
315.5
# nodes
created by AO*
17684
17570
17486
17462
17462
17417
17417
17404
17404
17404
17404
17404
17356
# nodes
expanded by AO*
14341
14018
13786
13740
13740
13684
13684
13668
13668
13668
13668
13668
13628
Conclusions
We presented a variant of the AO* algorithm that, to the best
of our knowledge, is the first algorithm to deal with: limited continuous resources, uncertainty, and oversubscription
planning. We developed a sophisticated reachability analysis involving continuous variables that could be useful for
heuristic search algorithms at large. Our preliminary implementation of this algorithm shows very promising results on
a domain of practical importance. We are able to handle
problems with 248 discrete states, as well as a continuous
component.
In the near future, we hope to report on a more mature version of the algorithm, which we are currently implementing.
It includes: (1) a full implementation of the techniques described in (Feng et al. 2004); (2) a rover model with two
continuous variables; (3) a more informed heuristic function, as discussed in Section 3.
Acknowledgements
This work was funded by the NASA Intelligent Systems
program. Eric Hansen was supported in part by NSF grant
IIS-9984952, NASA grant NAG-2-1463 and a NASA Summer Faculty Fellowship. Any opinions, findings, and conclusions or recommendations expressed in this material are
those of the authors and do not reflect the views of the NSF
or NASA. This work was performed during Mausam and
Eric Hansens visit at NASA Ames Research Center.
References
E. Altman. Constrained Markov Decision Processes.
Chapman and HALL/CRC, 1999.
J. Bresina, R. Dearden, N. Meuleau, S. Ramakrishnan,
D. Smith, and R. Washington. Planning under continuous
time and resource uncertainty: A challenge for AI. In Proceedings of the Eighteenth Conference on Uncertainty in
Artificial Intelligence, pages 7784, 2002.
Z. Feng, R. Dearden, N. Meuleau, and R. Washington. Dynamic programming for structured continuous Markov decision problems. In Proceedings of the Twentieth Confer-