0% found this document useful (0 votes)
35 views

A Algorithm

for otimisatio

Uploaded by

Pramod Chougule
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views

A Algorithm

for otimisatio

Uploaded by

Pramod Chougule
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

An AO* Algorithm for Planning with Continuous Resources

Emmanuel Benazera
Ronen Brafman , Nicolas Meuleau
NASA Ames Research Center
Mail Stop 269-3
Moffet Field, CA 94035-1000
{ebenazer, brafman, nmeuleau}
@email.arc.nasa.gov

Mausam

Eric A. Hansen

Dept. of Computer Science


and Engineering
University of Washington
Seattle, WA 981952350

Dept. of Computer Science


and Engineering
Mississippi State University
Mississippi State, MS 39762

[email protected]

[email protected]

Abstract
We consider the problem of optimal planning in stochastic
domains with resource constraints, where resources are continuous and the choice of action at each step may depend on
the current resource level. Our principal contribution is the
HAO* algorithm, a generalization of the AO* algorithm that
performs search in a hybrid state space that is modeled using
both discrete and continuous state variables. The search algorithm leverages knowledge of the starting state to focus computational effort on the relevant parts of the state space. We
claim that this approach is especially effective when resource
limitations contribute to reachability constraints. Experimental results show its effectiveness in the domain that motivates
our research automated planning for planetary exploration
rovers.

Introduction
Control of planetary exploration rovers presents several important challenges for research in automated planning. Because of difficulties inherent in communicating with devices
on other planets, remote rovers must operate autonomously
over substantial periods of time (Bresina et al. 2002). The
planetary surfaces on which they operate are very uncertain
environments: there is a great deal of uncertainty about the
duration, energy consumption, and outcome of a rovers actions. Currently, instructions sent to planetary rovers are in
the form of a simple plan for attaining a single goal (e.g.,
photographing some interesting rock). The rover attempts
to carry this out, and, when done, remains idle. If it fails
early on, it makes no attempt to recover and possibly achieve
an alternative goal. This may have a serious impact on missions. For example, it has been estimated that the 1997 Mars
Pathfinder rover spent between 40% and 75% of its time doing nothing because plans did not execute as expected. The
current MER rovers (aka Spirit and Opportunity) require an
average of 3 days to visit a single rock, but in future missions, multiple rock visits in a single communication cycle
will be possible (Pedersen et al. 2005). As a result, it is
expected that space scientists will request a large number of

Research Institute for Advanced Computer Science.


QSS Group Inc.
c 2005, American Association for Artificial IntelliCopyright
gence (www.aaai.org). All rights reserved.

potential tasks for future rovers to perform, more than may


be feasible, presenting an oversubscribed planning problem.
Working in this application domain, our goal is to provide
a planning algorithm that can generate reliable contingent
plans that respond to different events and action outcomes.
Such plans must optimize the expected value of the experiments conducted by the rover, while being aware of its time,
energy, and memory constraints. In particular, we must pay
attention to the fact that given any initial state, there are multiple locations the rover could reach, and many experiments
the rover could conduct, most combinations of which are infeasible due to resource constraints. To address this problem
we need a faithful model of the rovers domain, and an algorithm that can generate optimal or near-optimal plans for
such domains. General features of our problem include: (1)
a concrete starting state; (2) continuous resources (including
time) with stochastic consumption; (3) uncertain action effects; (4) several possible one-time-rewards, only a subset of
which are achievable in a single run. This type of problem is
of general interest, and includes a large class of (stochastic)
logistics problems, among others.
Past work has dealt with some features of this problem.
Related work on MDPs with resource constraints includes
the model of constrained MDPs developed in the OR community (Altman 1999). A constrained MDP is solved by
a linear program that includes constraints on resource consumption, and finds the best feasible policy, given an initial
state and resource allocation. A drawback of the constrained
MDP model is that it does not include resources in the state
space, and thus, a policy cannot be conditioned on resource
availability. Moreover, it does not model stochastic resource
consumption. In the area of decision-theoretic planning,
several techniques have been proposed to handle uncertain
continuous variables (e.g. (Feng et al. 2004; Younes and
Simmons 2004; Guestrin et al. 2004)). Smith 2004 and
van den Briel et al. 2004 consider the problem of oversubscription planning, i.e., planning with a large set of goals
which is not entirely achievable. They provide techniques
for selecting a subset of goals for which to plan, but they
deal only with deterministic domains. Finally, Meuleau et
al. 2004 present preliminary experiments towards scaling up
decision-theoretic approaches to planetary rover problems.
Our contribution in this paper is an implemented algorithm, Hybrid AO* (HAO*), that handles all of these prob-

lems together: oversubscription planning, uncertainty, and


limited continuous resources. Of these, the most essential
features of our algorithm are its ability to handle hybrid
state-spaces and to utilize the fact that many states are unreachable due to resource constraints.
In our approach, resources are included in the state description. This allows decisions to be made based on resource availability, and it allows a stochastic resource consumption model (as opposed to constrained MDPs). Although this increases the size of the state space, we assume
that the value functions may be represented compactly. We
use the work of Feng et al. (2004) on piecewise constant and
linear approximations of dynamic programming (DP) in our
implementation. However, standard DP does not exploit the
fact that the reachable state space is much smaller than the
complete state space, especially in the presence of resource
constraints. Our contribution is to show how to use the
forward heuristic search algorithm called AO* (Pearl 1984;
Hansen and Zilberstein 2001) to solve MDPs with resource
constraints and continuous resource variables. Unlike DP,
forward search keeps track of the trajectory from the start
state to each reachable state, and thus it can check whether
the trajectory is feasible or violates a resource constraint.
This allows heuristic search to prune infeasible trajectories
and can dramatically reduce the number of states that must
be considered to find an optimal policy. This is particularly
important in our domain where the discrete state space is
huge (exponential in the number of goals), yet the portion
reachable from any initial state is relatively small because
of the resource constraints. It is well-known that heuristic
search can be more efficient than DP because it leverages a
search heuristic and reachability constraints to focus computation on the relevant parts of the state space. We show
that for problems with resource constraints, this advantage
can be even greater than usual because resource constraints
further limit reachability.
The paper is structured as follows: In Section 2 we describe the basic action and goal model. In Section 3 we
explain our planning algorithm, HAO*. Initial experimental results are described in Section 4, and we conclude in
Section 5.

Problem Definition and Solution Approach


Problem Formulation
We consider a Markov decision process (MDP) with both
continuous and discrete state variables (also called a hybrid MDP (Guestrin et al. 2004) or Generalized State
MDP (Younes and Simmons 2004)). Each state corresponds
to an assignment to a set of state variables. These variables
may be discrete or continuous. Continuous variables typically represent resources, where one possible type of resource is time. Discrete variables model other aspects of the
state, including (in our application) the set of goals achieved
so far by the rover. (Keeping track of already-achieved
goals ensures a Markovian reward structure, since we reward achievement of a goal only if it was not achieved in
the past.) Although our models typically contain multiple
discrete variables, this plays no role in the description of our

algorithm, and so, for notational convenience, we model the


discrete component as a single variable n.
A Markov state s S is a pair (n, x) where n N is
the discrete variable, and x = (xi ) is a vector of continuous
variables. The domain
of each xi is an interval Xi of the real
N
line, and X = i Xi is the hypercube over which the continuous variables are defined. We assume an explicit initial
state, denoted (n0 , x0 ), and one or more absorbing terminal states. One terminal state corresponds to the situation in
which all goals have been achieved. Others model situations
in which resources have been exhausted or an action has resulted in some error condition that requires executing a safe
sequence by the rover and terminating plan execution.
Actions can have executability constraints. For example,
an action cannot be executed in a state that does not have its
minimum resource requirements. An (x) denotes the set of
actions executable in state (n, x).
State transition probabilities are given by the function
Pr(s0 | s, a), where s = (n, x) denotes the state before
action a and s0 = (n0 , x0 ) denotes the state after action a,
also called the arrival state. Following (Feng et al. 2004),
the probabilities are decomposed into:
the
Pr(n0 |n, x, a). For all (n, x, a),
P discrete marginals
0
n0 N Pr(n |n, x, a) = 1;
the continuous
Pr(x0 |n, x, a, n0 ). For all
R conditionals
0
0
(n, x, a, n ), x0 X Pr(x |n, x, a, n0 )dx0 = 1.
Any transition that results in negative value for some continuous variable is viewed as a transition into a terminal state.
The reward of a transition is a function of the arrival
state only. More complex dependencies are possible, but
this is sufficient for our goal-based domain models. We let
Rn (x) 0 denote the reward associated with a transition to
state (n, x).
In our application domain, continuous variables model
non-replenishable resources. This translates into the general
assumption that the value of the continuous variables is nonincreasing. Moreover, we assume that each action has some
minimum positive consumption of at least one resource. We
do not utilize this assumption directly. However, it has two
implications upon which the correctness of our approach depends: (1) the values of the continuous variables are a-priori
bounded, and (2) the number of possible steps in any execution of a plan is bounded, which we refer to by saying the
problem has a bounded horizon. Note that the actual number of steps until termination can vary depending on actual
resource consumption.
Given an initial state (n0 , x0 ), the objective is to find a
policy that maximizes expected cumulative reward.1 In our
application, this is equal to the sum of the rewards for the
goals achieved before running out of a resource. Note that
there is no direct incentive to save resources: an optimal
solution would save resources only if this allows achieving more goals. Therefore, we stay in a standard decisiontheoretic framework. This problem is solved by solving
Bellmans optimality equation, which takes the following
1
Our algorithm can easily be extended to deal with an uncertain
starting state, as long as its probability distribution is known.

form:
Vn0 (x) = 0 ,
Vnt+1 (x)
Z
x0

= max

aAn (x)

"

Pr(n0 |, n, x, a)

n0 N

Pr(x0 | n, x, a, n0 ) Rn0 (x0 ) + Vnt0 (x0 ) dx0

(1)

Note that the index t represents the iteration or time-step of


DP, and does not necessarily correspond to time in the planning problem. The duration of actions is one of the biggest
sources of uncertainty in our rover problems, and we typically model time as one of the continuous resources xi .

Solution Approach
Feng et al. describe a dynamic programming (DP) algorithm
that solves this Bellman optimality equation. In particular,
they show that the continuous integral over x0 can be computed exactly, as long as the transition function satisfies certain conditions. This algorithm is rather involved, so we will
treat it as a black-box in our algorithm. In fact, it can be
replaced by any other method for carrying out this computation. This also simplifies the description of our algorithm
in the next section and allows us to focus on our contribution. We do explain the ideas and the assumptions behind
the algorithm of Feng et al. in Section 3.
The difficulty we address in this paper is the potentially huge size of the state space, which makes DP infeasible. One reason for this size is the existence of continuous variables. But even if we only consider the discrete component of the state space, the size of the state
space is exponential in the number of propositional variables comprising the discrete component. To address this
issue, we use forward heuristic search in the form of a
novel variant of the AO* algorithm. Recall that AO* is
an algorithm for searching AND/OR graphs (Pearl 1984;
Hansen and Zilberstein 2001). Such graphs arise in problems where there are choices (the OR components), and each
choice can have multiple consequences (the AND component), as is the case in planning under uncertainty. AO* can
be very effective in solving such planning problems when
there is a large state space. One reason for this is that AO*
only considers states that are reachable from an initial state.
Another reason is that given an informative heuristic function, AO* focuses on states that are reachable in the course
of executing a good plan. As a result, AO* often finds an
optimal plan by exploring a small fraction of the entire state
space.
The challenge we face in applying AO* to this problem is
the challenge of performing state-space search in a continuous state space. Our solution is to search in an aggregate
state space that is represented by a search graph in which
there is a node for each distinct value of the discrete component of the state. In other words, each node of our search
graph represents a region of the continuous state space in
which the discrete value is the same. In this approach, different actions may be optimal for different Markov states in
the aggregate state associated with a search node, especially

since the best action is likely to depend on how much energy or time is remaining. To address this problem and still
find an optimal solution, we associate a value estimate with
each of the Markov states in an aggregate. That is, we attach to each search node a value function (function of the
continuous variables) instead of the simple scalar value used
by standard AO*. Following the approach of (Feng et al.
2004), this value function can be represented and computed
efficiently due to the continuous nature of these states and
the simplifying assumptions made about the transition functions. Using these value estimates, we can associate different actions with different Markov states within the aggregate
state corresponding to a search node.
In order to select which node on the fringe of the search
graph to expand, we also need to associate a scalar value
with each search node. Thus, we maintain for a search node
both a heuristic estimate of the value function (which is used
to make action selections), and a heuristic estimate of the
priority which is used to decide which search node to expand
next. Details are given in the following section.
We note that LAO*, a generalization of AO*, allows for
policies that contain loops in order to specify behavior
over an infinite horizon (Hansen and Zilberstein 2001). We
could use similar ideas to extend LAO* to our setting. However, we need not consider loops for two reasons: (1) our
problems have a bounded horizon; (2) an optimal policy
will not contain any intentional loop because returning to
the same discrete state with fewer resources cannot buy us
anything. Our current implementation assumes any loop is
intentional and discards actions that create such a loop.

Hybrid AO*
A simple way of understanding HAO* is as an AO* variant
where states with identical discrete component are expanded
in unison. HAO* works with two graphs:
The explicit graph describes all the states that have been
generated so far and the AND/OR edges that connect
them. The nodes of the explicit graph are stored in two
lists: OPEN and CLOSED.
The greedy policy (or partial solution) graph, denoted
GREEDY in the algorithms, is a sub-graph of the explicit
graph describing the current optimal policy.
In standard AO*, a single action will be associated with each
node in the greedy graph. However, as described before,
multiple actions can be associated with each node, because
different actions may be optimal for different Markov states
represented by an aggregate state.

Data Structures
The main data structure represents a search node n. It contains:
The value of the discrete state. In our application these
are the discrete state variables and set of goals achieved.
Pointers to its parents and children in the explicit and
greedy policy graphs.

Pn () a probability distribution on the continuous variables in node n. For each x X, Pn (x) is an estimate of
the probability density of passing through state (n, x) under the current greedy policy. It is obtained by progressing the initial state forward through the optimal actions of
the greedy policy. With each Pn , we maintain the probability of passing through n under the greedy policy:
Z
M (Pn ) =
Pn (x)dx .
xX

Hn () the heuristic function. For each x X, Hn (x) is


a heuristic estimate of the optimal expected reward from
state (n, x).
Vn () the value function. At the leaf nodes of the explicit graph, Vn = Hn . At the non-leaf nodes of the explicit graph, Vn is obtained by backing up the H functions
from the descendant leaves. If the heuristic function Hn0
is admissible in all leaf nodes n0 , then Vn (x) is an upper
bound on the optimal reward to come from (n, x) for all
x reachable under the greedy policy.
gn a heuristic estimate of the increase in value of the
greedy policy that we would get by expanding node n.
If Hn is admissible then gn represents an upper bound
on the gain in expected reward. The gain gn is used to
determine the priority of nodes in the OPEN list (gn = 0
if n is in CLOSED), and to bound the error of the greedy
solution at each iteration of the algorithm.
Note that some of this information is redundant. Nevertheless, it is convenient to maintain all of it so that the
algorithm can easily access it. HAO* uses the customary
OPEN and CLOSED lists maintained by AO*. They encode
the explicit graph and the current greedy policy. CLOSED
contains expanded nodes, and OPEN contains unexpanded
nodes and nodes that need to be re-expanded.

1: Create the root node n0 which represents the initial


2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:

state.
Pn0 = initial distribution on resources.
Vn0 = 0 everywhere in X.
gn0 = 0.
OPEN = GREEDY = {n0 }.
CLOSED = .
while OPEN GREEDY 6= do
n = arg maxn0 OPENGREEDY (gn0 ).
Move n from OPEN to CLOSED.
for all (a, n0 ) A N not expanded yet in n and
reachable under Pn do
if n0
/ OPEN CLOSED then
Create the data structure to represent n0 and add
the transition (n, a, n0 ) to the explicit graph.
Get Hn0 .
Vn0 = Hn0 everywhere in X.
if n0 is terminal: then
Add n0 to CLOSED.
else
Add n0 to OPEN.
else if n0 is not an ancestor of n in the explicit
graph then
Add the transition (n, a, n0 ) to the explicit
graph.
if some pair (a, n0 ) was expanded at previous step
(10) then
Update Vn for the expanded node n and some of its
ancestors in the explicit graph, with Algorithm 2.
Update Pn0 and gn0 using Algorithm 3 for the nodes
n0 that are children of the expanded node or of a node
where the optimal decision changed at the previous
step (22). Move every node n0 CLOSED where P
changed back into OPEN.
Algorithm 1: Hybrid AO*

The HAO* Algorithm


Algorithm 1 presents the main procedure. The crucial steps
are described in detail below.
Expanding a node (lines 10 to 20): At each iteration,
HAO* expands the open node n with the highest priority gn
in the greedy graph. An important distinction between AO*
and HAO* is that in the latter, nodes are often only partially
expanded (i.e., not all Markov states associated with a discrete node are considered). Thus, nodes in the CLOSED
list are sometimes put back in OPEN (line 23). The reason
for this is that a Markov state associated with this node, that
was previously considered unreachable, may now be reachable. Technically, what happens is that as a result of finding a new path to a node, the probability distribution over
it is updated (line 23), possibly increasing the probability of
some Markov state from 0 to some positive value. This process is illustrated in Figure 1. Thus, while standard AO* expands only tip nodes, HAO* sometimes expands nodes that
were moved from CLOSED to OPEN and are in the middle
of the greedy policy subgraph.
Next, HAO* considers all possible successors (a, n0 ) of
n given the state distribution Pn . Typically, when n is expanded for the first time, we enumerate all actions a possible

in (n, x) (a An (x) ) for some reachable x (Pn (x) > 0),


and all arrival states n0 that can result from such a transition (Pr(n0 | n, x, a) > 0).2 If n was previously expanded
(i.e. it has been put back in OPEN), only actions and arrival
nodes not yet expanded are considered. In line 11, we check
whether a node has already been generated. This is not necessary if the graph is a tree (i.e., there is only one way to get
to each discrete state).3 In line 15, a node n0 is terminal if no
action is executable in it (because of lack of resources). In
our application domain each goal pays only once, thus the
nodes in which all goals of the problem have been achieved
are also terminal. Finally, the test in line 19 prevents loops
in the explicit graph. As discussed earlier, such loops are
always suboptimal.
2
We assume that performing an action in a state where it is not
allowed is an error that ends execution with zero or constant reward.
3
Sometimes it is beneficial to use the tree implementation of
AO* when the problem graph is almost a tree, by duplicating nodes
that represents the same (discrete) state reached through different
paths.

Pn
3

Pn3
n1
n0

a1
n3

a0
n2

n1

x
a2

n4

n0

a0

n4

a4
Vn3

n6

x
(a) Initial GREEDY graph. Actions have multiple possible
discrete effects (e.g., a0 has two possible effects in n0 ).
The curves represent the current probability distribution
P and value function V over x values for n3 . n2 is a
fringe node.

n5
a2

n3

a3
n2

Vn3

a1

(b) GREEDY graph with n2 expanded. Since the path


(n0 , n2 , n3 ) is optimal for some resource levels in n0 ,
Pn3 has changed. As a consequence, n3 has been reexpanded , showing that node n5 is now reachable from
n3 under a2 , and action a4 has become do-able in n3 .

Figure 1: Node re-expansion.


Updating the value functions (lines 22 to 23): As in standard AO*, the value of a newly expanded node must be updated. This consists of recomputing its value function with
Bellmans equations (Eqn. 1), based on the value functions
of all children of n in the explicit graph. Note that these
backups involve all continuous states x X for each node,
not just the reachable values of x. However, they consider
only actions and arrival nodes that are reachable according to
Pn . Once the value of a state is updated, its new value must
be propagated backward in the explicit graph. The backward propagation stops at nodes where the value function is
not modified, and/or at the root node. The whole process is
performed by applying Algorithm 2 to the newly expanded
node.
1: Z = {n} // n is the newly expanded node.
2: while Z 6= do
3:
Choose a node n0 Z that has no descendant in Z.
4:
Remove n0 from Z.
5:
Update Vn0 following Eqn. 1.
6:
if Vn0 was modified at the previous step then
7:
Add all parents of n0 in the explicit graph to Z.
8:
if optimal decision changes for some (n0 , x),

n, we consider all its parents n0 in the greedy policy graph,


and all the actions a that can lead from one of the parents to
n. The probability of getting to n with a continuous component x is the sum over all (n0 , a) and all possible values of
x0 of the continuous component over the the probability of
arriving from n0 and x0 under a. This can be expressed as:
X Z
Pn0 (x0 ) Pr(n | n0 , x0 , a)
Pn (x) =
(n0 ,a)n

X0

Pr(x | n0 , x0 , a, n)dx0 . (2)


Here, X0 is the domain of possible values for x0 , and n is
the set of pairs (n0 , a) where a is the greedy action in n0 for
some reachable resource level:
n = {(n0 , a) N A : x X,
Pn0 (x) > 0, n0 (x) = a, Pr(n | n0 , x, a) > 0} ,

Algorithm 2: Updating the value functions Vn .

where n (x) A is the greedy action in (n, x). Clearly, we


can restrict our attention to state-action pairs in n , only.
Note that this operation
P may induce a loss of total probability mass (Pn <
n0 Pn0 ) because we can run out of a
resource during the transition and end up in a sink state.
When the distribution Pn of a node n in the OPEN list
is updated, its priority gn is recomputed using the following
equation (the priority of nodes in CLOSED is maintained as
0):
Z
gn =
Pn (x)Hn (x)dx ;
(3)

Updating the state distributions (line 23): Pn s represent


the state distribution under the greedy policy, and they need
to be updated after recomputing the greedy policy. More
precisely, P needs to be updated in each descendant of a
node where the optimal decision changed. To update a node

where S(P ) is the support of P :


S(P )
=
{x X : P (x) > 0}, and Xold
contains
all
x

X
n
such that the state (n, x) has already been expanded before
(Xold
n = if n has never been expanded). The techniques
used to represent the continuous probability distributions
Pn and compute the continuous integrals are discussed

9:
10:

Pn0 (x) > 0 then


Update the greedy subgraph (GREEDY) at n0 if
necessary.
Mark n0 for use at line 23 of Algorithm1.

xS(Pn )Xold
n

in the next sub-section. Algorithm 3 presents the state


distribution updates. It applies to the set of nodes where the
greedy decision changed during value updates (including
the newly expanded node, i.e. n in HAO* Algorithm 1).
1: Z = children of nodes where the optimal decision

changed when updating value functions in Algorithm 1.


2: while Z 6= do
3:
Choose a node n Z that has no ancestor in Z.
4:
Remove n from Z.
5:
Update Pn following Eqn. 2.
6:
if Pn was modified at step 5 then
7:
Move n from CLOSED to OPEN.
8:
Update the greedy subgraph (GREEDY) at n if
9:

necessary.
Update gn following Eqn. 3.
Algorithm 3: Updating the state distributions Pn .

from it. The state-space partition is kept as coarse as possible, so that only the relevant distinctions between (continuous) states are taken into account. Given the above conditions, it can be shown (see (Feng et al. 2004)) that for any
finite horizon, for any discrete state, there exists a partition
of the continuous space into hyper-rectangles over which the
optimal value function is piece-wise constant or linear. The
implementation represents the value functions as kd-trees,
using a fast algorithm to intersect kd-trees (Friedman et al.
1977), and merging adjacent pieces of the value function
based on their value. We augmented this approach by representing the continuous state distributions Pn as piecewise
constant functions of the continuous variables. Under the
set of hypotheses above, if the initial probability distribution
on the continuous variables is piecewise constant, then the
probability distribution after any finite number of actions is
too, and Eqn. 2 may always be computed in finite time.4

Properties
Handling Continuous Variables
Computationally, the most challenging aspect of HAO* is
the handling of continuous state variables, and particularly
the computation of the continuous integral in Bellman backups and Eqns. 2 and 3. We approach this problem using the
ideas developed in (Feng et al. 2004) for the same application domain. However, we note that HAO* could also be
used with other models of uncertainty and continuous variables, as long as the value functions can be computed exactly
in finite time. The approach of (Feng et al. 2004) exploits
the structure in the continuous value functions of the type of
problems we are addressing. These value functions typically
appear as collections of humps and plateaus, each of which
corresponds to a region in the state space where similar goals
are pursued by the optimal policy (see Fig. 3). The sharpness of the hump or the edge of a plateau reflects uncertainty
of achieving these goals. Constraints imposing minimal resource levels before attempting risky actions introduce sharp
cuts in the regions. Such structure is exploited by grouping
states that belong to the same plateau, while reserving a fine
discretization for the regions of the state space where it is
the most useful (such as the edges of plateaus).
To adapt the approach of (Feng et al. 2004), we make
some assumptions that imply that our value functions can
be represented as piece-wise constant or linear. Specifically,
we assume that the continuous state space induced by every
discrete state can be divided into hyper-rectangles in each
of which the following holds: (i) The same actions are applicable. (ii) The reward function is piece-wise constant or
linear. (iii) The distribution of discrete effects of each action
are identical. (iv) The set of arrival values or value variations for the continuous variables is discrete and constant.
Assumptions (i-iii) follow from the hypotheses made in our
domain models. Assumption (iv) comes down to discretizing the actions resource consumptions, which is an approximation. It contrasts with the naive approach that consists of
discretizing the state space regardless of the relevance of the
partition introduced. Instead, we discretize the action outcomes first, and then deduce a partition of the state space

As for standard AO*, it can be shown that if the heuristic


functions Hn are admissible (optimistic), the actions have
positive resource consumptions, and the continuous backups
are computed exactly, then: (i) at each step of HAO*, Vn (x)
is an upper-bound on the optimal expected return in (n, x),
for all (n, x) expanded by HAO*; (ii) HAO* terminates after
a finite number of iterations; (iii) after termination, Vn (x) is
equal to the optimal expected return in (n, x), for all (n, x)
reachable under the greedy policy (Pn (x) > 0). Moreover,
if we assume that, in each state, there is a done action that
terminates execution with zero reward (in a rover problem,
we would then start a safe sequence), then we can evaluate
the greedy policy at each step of the algorithm by assuming that execution ends each time we reach a leaf of the
greedy subgraph. Under the same hypotheses, the error of
the
P greedy policy at each step of the algorithm is bounded by
nGREEDYOPEN gn . This property allows trading computation time for accuracy by stopping the algorithm early.

Heuristic Functions
The heuristic function Hn helps focus the search on truly
useful reachable states. It is essential for tackling real-size
problems. Our heuristic function is obtained by solving a
relaxed problem. The relaxation is very simple: we assume
deterministic transitions for the continuous variables, i.e.,
P r(x0 |n, x, a, n0 ) {0, 1}. If we assume the action consumes the minimum amount of each resource, we obtain an
admissible heuristic function. A non-admissible, but probably more informative heuristic function is obtained by using
the mean resource consumption.
The central idea is to use the same algorithm to solve both
the relaxed and the original problem. Unlike classical approaches where a relaxed plan is generated for every search
state, we generate a relaxed search-graph using our HAO*
algorithm once with a deterministic-consumption model and
a trivial heuristic. The value function Vn of a node in the
4
A deterministic starting state x0 is represented by a uniform
distribution with very small rectangular support centered in x0 .

relaxed graph represents the heuristic function Hn of the associated node in the original problem graph. Solving the
relaxed problem with HAO* is considerably easier, because
the structure and the updates of the value functions Vn and
of the probabilities Pn are much simpler than in the original
domain. However, we run into the following problem: deterministic consumption implies that the number of reachable
states for any given initial state is very small (because only
one continuous assignment is possible). This means that in a
single expansion, we obtain information about a small number of states. To address this problem, instead of starting
with the initial resource values, we assume a uniform distribution over the possible range of resource values. Because
it is relatively easy to work with a uniform distribution, the
computation is simple relative to the real problem, but we
obtain an estimate for many more states. It is still likely that
we reach states for which no heuristic estimate was obtained
using these initial values. In that case, we simply recompute
starting with this initial state.

T2(10)
L2
[20,30]

T1(5)

T5(15)

Lose T4

L1

[20,30]
[15,20] Lose T2, T5

L4
T4 (15)

Reacquire T4
Lose T1
[15,18]

L3
T3 (10)

Figure 2: Case study: the rover navigates around five target


rocks (T1 to T5). The number with each rock is the reward
received on testing that rock.

Experimental Evaluation
We tested our algorithm on a slightly simplified variant of
the rover model used for NASA Ames October 2004 Intelligent Systems demo (Pedersen et al. 2005). In this domain,
a planetary rover moves in a planar graph made of locations
and paths, sets up instruments at different rocks, and performs experiments on the rocks. Actions may fail, and their
energy and time consumption are uncertain. Resource consumptions are drawn from two type of distributions: uniform
and normal, and then discretized. The problem instance used
in our preliminary experiments is illustrated in figure 2. It
contains 5 target rocks (T1 to T5) to be tested. To take a
picture of a target rock, this target must be tracked. To track
a target, we must register it before doing the first move.5
Later, different targets can be lost and re-acquired when navigating along different paths. These changes are modeled as
action effects in the discrete state. Overall, the problem contains 43 propositional state variables and 37 actions. Therefore, there are 248 different discrete states, which is far beyond the reach of a flat DP algorithm.
The results presented here were obtained using a preliminary implementation of the piecewise constant DP approximations described in (Feng et al. 2004) based on a flat
representation of state partitions instead of kd-trees. This
is considerably slower than an optimal implementation. To
compensate, our domain features a single abstract continuous resource, while the original domain contains two resources (time and energy). Another difference in our implementation is in the number of nodes expanded at each iteration. We adapt the findings of (Hansen and Zilberstein 2001)
that overall convergence speeds up if all the nodes in OPEN
are expanded at once, instead of prioritizing them based on
gn values and changing the value functions after each ex-

pansion.6 Finally, these preliminary experiments do not use


the sophisticated heuristics presented earlier, but the following simple admissible heuristic: Hn is the constant function
equal to the sum of the utilities of all the goals not achieved
in n.
We varied the initial amount of resource available to the
rover. As available resource increases, more nodes are
reachable and more reward can be gained. The performance
of the algorithm is presented in Table 1. We see that the
number of reachable discrete states is much smaller than the
total number of states (248 ) and the number of nodes in an
optimal policy is surprisingly small. This indicates that AO*
is particularly well suited to our rover problems. However,
the number of nodes expanded is quite close to the number
of reachable discrete states. Thus, our current simple heuristic is only slightly effective in reducing the search space,
and reachability makes the largest difference. This suggests
that much progress can be obtained by using better heuristics. The last column measures the total number of reachable
Markov states, after discretizing the action consumptions as
in (Feng et al. 2004). This is the space that a forward search
algorithm manipulating Markov states, instead of discrete
states, would have to tackle. In most cases, it would be impossible to explore such space with poor quality heuristics
such as ours. This indicates that our algorithm is quite effective in scaling up to very large problems by exploiting
the structure presented by continuous resources.
Figure 3 shows the converged value function of the initial state of the problem. The value function is comprised of
several plateaus, where different sets of goals are achieved.
For example, the first plateau (until resource level 23) corresponds to the case where the resource level is insufficient for

Therefore, starting to track some targets is a typical example of


set-up actions, that is, actions that are not necessary in the nominal
plan but that we must have performed before if we want to deviate from this plan, for instance, by changing goals if the current
resource levels are below the expectations.

6
In this implementation, we do not have to maintain exact probability distributions Pn . We just need to keep track of the supports
of these distributions, which can be approximated by lower and
upper bounds on each continuous variable.

A
30
40
50
60
70
80
90
100
110
120
130
140
150

B
0.1
0.4
1.8
7.6
13.4
32.4
87.3
119.4
151.0
213.3
423.2
843.1
1318.9

C
39
176
475
930
1548
2293
3127
4673
6594
12564
19470
28828
36504

D
39
163
456
909
1399
2148
3020
4139
5983
11284
17684
27946
36001

E
38
159
442
860
1263
2004
2840
3737
5446
9237
14341
24227
32997

F
9
9
12
32
22
33
32
17
69
39
41
22
22

G
1
1
1
2
2
2
2
2
3
3
3
3
3

H
239
1378
4855
12888
25205
42853
65252
102689
155733
268962
445107
17113
1055056

Table 1: Performance of the algorithm for different initial


resource levels. A: initial resource (abstract unit). B: execution time (s). C: # reachable discrete states. D: # nodes
created by AO*. E: # nodes expanded by AO*. F: # nodes in
the optimal policy graph. G: # goals achieved in the longest
branch of the optimal solution. H: # reachable Markov
states.
35
30

Expected utility

25
20
15
10
5
0
0

20

40

60
80
100
Initial resource

120

140

Figure 3: Value function of the initial state.

any goal to be achieved. The next plateau (until 44) depicts


the region in which the target T1 is tested. The remaining
resources are still not enough to move to a new location and
generate additional rewards. In the region between 44 and
61 the rover decides to move to L4 and test T4. Note that
the location L2 is farther from L4 and so the rover does not
attempt to move to L2, yet. The next plateau corresponds to
the region in which the optimal strategy is to move to L2 and
test both T2 and T5, as enough resources for that are now
available. The last region (beyond 101) is in which three
goals T1, T2 and T5 are tested and reward of 30 is obtained.
When Hn is admissible, we can bound the error of the
current greedy graph by summing gn over fringe nodes. In
Table 2 we describe the time/value tradeoff we found for this
domain. On the one hand, we see that even a large compromise in quality leads to no more than 25% reduction in time.
On the other hand, we see that much of this reduction is obtained with a very small price ( = 0.5). Additional experiments are required to learn if this is a general phenomenon.

Initial
resource
130
130
130
130
130
130
130
130
130
130
130
130
130

0.00
0.50
1.00
1.50
2.00
2.50
3.00
3.50
4.00
4.50
5.00
5.50
6.00

Execution
time
426.8
371.9
331.9
328.4
330.0
320.0
322.1
318.3
319.3
319.3
318.5
320.4
315.5

# nodes
created by AO*
17684
17570
17486
17462
17462
17417
17417
17404
17404
17404
17404
17404
17356

# nodes
expanded by AO*
14341
14018
13786
13740
13740
13684
13684
13668
13668
13668
13668
13668
13628

Table 2: Complexity of computing an -optimal policy. The


optimal return for an initial resource of 130 is 30.

Conclusions
We presented a variant of the AO* algorithm that, to the best
of our knowledge, is the first algorithm to deal with: limited continuous resources, uncertainty, and oversubscription
planning. We developed a sophisticated reachability analysis involving continuous variables that could be useful for
heuristic search algorithms at large. Our preliminary implementation of this algorithm shows very promising results on
a domain of practical importance. We are able to handle
problems with 248 discrete states, as well as a continuous
component.
In the near future, we hope to report on a more mature version of the algorithm, which we are currently implementing.
It includes: (1) a full implementation of the techniques described in (Feng et al. 2004); (2) a rover model with two
continuous variables; (3) a more informed heuristic function, as discussed in Section 3.

Acknowledgements
This work was funded by the NASA Intelligent Systems
program. Eric Hansen was supported in part by NSF grant
IIS-9984952, NASA grant NAG-2-1463 and a NASA Summer Faculty Fellowship. Any opinions, findings, and conclusions or recommendations expressed in this material are
those of the authors and do not reflect the views of the NSF
or NASA. This work was performed during Mausam and
Eric Hansens visit at NASA Ames Research Center.

References
E. Altman. Constrained Markov Decision Processes.
Chapman and HALL/CRC, 1999.
J. Bresina, R. Dearden, N. Meuleau, S. Ramakrishnan,
D. Smith, and R. Washington. Planning under continuous
time and resource uncertainty: A challenge for AI. In Proceedings of the Eighteenth Conference on Uncertainty in
Artificial Intelligence, pages 7784, 2002.
Z. Feng, R. Dearden, N. Meuleau, and R. Washington. Dynamic programming for structured continuous Markov decision problems. In Proceedings of the Twentieth Confer-

ence on Uncertainty in Artificial Intelligence, pages 154


161, 2004.
J.H. Friedman, J.L. Bentley, and R.A. Finkel. An algorithm for finding best matches in logarithmic expected
time. ACM Trans. Mathematical Software, 3(3):209226,
1977.
C. Guestrin, M. Hauskrecht, and B. Kveton. Solving factored MDPs with continuous and discrete variables. In Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence, pages 235242, 2004.
E. Hansen and S. Zilberstein. LAO*: A heuristic search
algorithm that finds solutions with loops. Artificial Intelligence, 129:3562, 2001.
N. Meuleau, R. Dearden, and R. Washington. Scaling up
decision theoretic planning to planetary rover problems. In
AAAI-04: Proceedings of the Workshop on Learning and
Planning in Markov Processes Advances and Challenges,
pages 6671, Technical Report WS-04-08, AAAI Press,
Menlo Park, CA, 2004.
J. Pearl. Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-Wesley, 1984.
L. Pedersen, D. Smith, M. Deans, R. Sargent, C. Kunz,
D. Lees, and S.Rajagopalan. Mission planning and target
tracking for autonomous instrument placement. In Submitted to 2005 IEEE Aerospace Conference, 2005.
D. Smith. Choosing objectives in over-subscription planning. In Proceedings of the Fourteenth International Conference on Automated Planning and Scheduling, pages
393401, 2004.
M. van den Briel, M.B. Do R. Sanchez and, and S. Kambhampati. Effective approaches for partial satisfation (oversubscription) planning. In Proceedings of the Nineteenth
National Conference on Artificial Intelligence, pages 562
569, 2004.
H.L.S. Younes and R.G. Simmons. Solving generalized
semi-Markov decision processes using continuous phasetype distributions. In Proceedings of the Nineteenth National Conference on Artificial Intelligence, pages 742
747, 2004.

You might also like