0% found this document useful (0 votes)
40 views9 pages

Discretized Approximations For POMDP With Average Cost: UAI 2004 Yu & Bertsekas 619

This document proposes a new lower approximation scheme for partially observable Markov decision processes (POMDPs) with average cost criteria. The scheme discretizes the belief space into a finite number of points and approximates the optimal cost function by its values at these belief points. This allows the problem to be solved as a finite-state Markov decision process (MDP) using existing algorithms. The scheme is shown to provide a lower bound on the optimal liminf average cost and can also bound the optimal limsup average cost and cost of executing the associated stationary policy. Asymptotic convergence of the cost approximation is proven under certain conditions.

Uploaded by

PratikChaudhari
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views9 pages

Discretized Approximations For POMDP With Average Cost: UAI 2004 Yu & Bertsekas 619

This document proposes a new lower approximation scheme for partially observable Markov decision processes (POMDPs) with average cost criteria. The scheme discretizes the belief space into a finite number of points and approximates the optimal cost function by its values at these belief points. This allows the problem to be solved as a finite-state Markov decision process (MDP) using existing algorithms. The scheme is shown to provide a lower bound on the optimal liminf average cost and can also bound the optimal limsup average cost and cost of executing the associated stationary policy. Asymptotic convergence of the cost approximation is proven under certain conditions.

Uploaded by

PratikChaudhari
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

Discretized Approximations for POMDP with Average Cost

Huizhen Yu
Lab for Information and Decisions
EECS Dept., MIT
Cambridge, MA 02139
Dimitri P. Bertsekas
Lab for Information and Decisions
EECS Dept., MIT
Cambridge, MA 02139
Abstract
In this paper, we propose a new lower approx-
imation scheme for POMDP with discounted
and average cost criterion. The approximat-
ing functions are determined by their values
at a nite number of belief points, and can
be computed eciently using value iteration
algorithms for nite-state MDP. While for
discounted problems several lower approxi-
mation schemes have been proposed earlier,
ours seems the rst of its kind for average
cost problems. We focus primarily on the
average cost case, and we show that the
corresponding approximation can be com-
puted eciently using multi-chain algorithms
for nite-state MDP. We give a preliminary
analysis showing that regardless of the exis-
tence of the optimal average cost J

in the
POMDP, the approximation obtained is a
lower bound of the liminf optimal average
cost function, and can also be used to cal-
culate an upper bound on the limsup opti-
mal average cost function, as well as bounds
on the cost of executing the stationary pol-
icy associated with the approximation. We
show the convergence of the cost approxima-
tion, when the optimal average cost is con-
stant and the optimal dierential cost is con-
tinuous.
1 INTRODUCTION
We consider discrete-time innite horizon partially ob-
servable Markov Decision Processes (POMDP) with
the state space S, the observation space Z and the
control space U all being nite. Let X be the set of
probability distributions on S, called belief space, and
g
u
(s) be the per stage cost function. With the aver-
age cost criterion, we minimize over the policies the
average expected cost
1
N
E

N1
t=0
g
ut
(s
t
)|s
0
x}, as
N goes to innity, when the initial state s
0
follows
the distribution x. POMDPs with average cost cri-
terion are substantially more dicult to analyze than
with discounted cost. Although there are optimality
equations whose solution provides the optimal average
cost function and a stationary optimal policy, in gen-
eral there is no guarantee that a solution exists, and
there are no nite computation algorithms to obtain
it. Therefore, discretized approximations are compu-
tationally appealing as approximate solutions for av-
erage cost POMDP, since the problem of nite-state
MDPs with average cost is well understood and can
be solved with several commonly used algorithms.
We note that a discretization scheme for discounted
POMDP that gives a lower approximation was rst
proposed by (Lovejoy, 1991). It was later improved
by (Zhou and Hansen, 2001). There have been no
proposals of discretization schemes for average cost
POMDP, to our knowledge. A conceptually dier-
ent alternative to solve approximately average cost
POMDP is the nite memory approach (Aberdeen and
Baxter, 2002). In this approach, one seeks a policy
that is average cost optimal within a class of nite
state controllers. The advantage of the nite memory
approach is that a suboptimal policy can be learned in
a model-free fashion, i.e., with a simulator rather than
an explicit transition probability model of the system.
By contrast the discretization approaches of Lovejoy,
and Zhou and Hansen, as well as ours, require an ex-
act mechanism for generating beliefs/conditional state
distributions, as the system is operating.
We have recently become aware of the related work
by (Ormoneit and Glynn, 2002) on MDP with con-
tinuous state space and average cost. Our POMDP
scheme can be viewed as a special case of their gen-
eral approximation scheme. However, the lower ap-
proximation property is special to POMDP, and the
corresponding asymptotic convergence results are also
dierent in the two works.
UAI 2004 YU & BERTSEKAS 619
The starting point for our discretization methodology
is the discounted problem, for which we introduce a
new lower approximation scheme, based on a ctitious
optimistic controller that receives extra information
about the hidden states. The cost of this controller, a
lower bound to the optimal cost, can be calculated us-
ing nite-state MDP methods, and can be used as an
approximate cost-to-go function for a one-step looka-
head scheme. We extend our approach to the average
cost criterion, where the discretized problem can be
solved by multi-chain algorithms for nite-state MDP.
We show that the corresponding approximate cost is
a lower bound to the optimal liminf average cost func-
tion, and can be used to obtain an upper bound to
the optimal limsup average cost function, as well as
bounds on the cost of the stationary policy associated
with the approximation. We show asymptotic conver-
gence of the cost approximation of the discretization
scheme, assuming that the optimal average cost is con-
stant and the optimal dierential cost is continuous.
The paper is organized as follows. In Section 2, we
consider discretized approximations in the discounted
case, and introduce a new approximation scheme. We
prove asymptotic convergence for two main discretiza-
tion schemes. In Section 3, we extend discretized ap-
proximations to the average cost case, and give an
analysis of error bounds and asymptotic convergence.
Finally in Section 4, we present experimental results.
Due to space limitations, some of the proofs have been
omitted. They can be found in an expanded version
of this paper (Yu and Bertsekas, 2004), which also
addresses some additional topics, including a general
framework for deriving upper and lower approximation
schemes for POMDP.
2 DISCOUNTED CASE
We introduce a new approximation scheme and
summarize known discretized lower approximation
schemes for the discounted case. The belief MDPs
associated with them will be the basis for the lower
approximation schemes in the average cost case. The
results obtained here will also be useful there.
In the discounted case, we minimize the discounted
cost
E

t
g
ut
(s
t
)|s
0
x}
for a xed [0, 1). The optimal cost function J

(x)
satises the Bellman equation
J

(x) = (TJ

)(x),
where
(TJ)(x) = min
uU
[x

g
u
+E
z
_
J
_

u
(x, z)
__
],
and denotes transpose, g
u
denotes the per stage cost
vector, and
u
(x, z) denotes the conditional distribu-
tion of s
1
after applying control u and observing z.
A few notations for expectations will be used through-
out the text. At places where emphasis of the distribu-
tion is necessary, we use the symbol E
z|x,u
{. . .}, which
should be read as

z
p(z|x, u) . . ., and is equivalent to
the conditional expectation E
z
{. . . |x, u}.
2.1 A NEW INEQUALITY
The optimal cost J

() is concave, i.e., for any con-


vex combination x =

i
( x)x
i
, where
i
( x) 0 and

i
( x) = 1, J

( x)

i
( x)J

(x
i
). Using this
property with x =
u
(x, z) in the Bellman equation,
we have the following inequality that was proposed
by (Zhou and Hansen, 2001) for a discretized cost ap-
proximation:
J

(x) min
u
_
x

g
u
+E
z|x,u
_

i
_

u
(x, z)
_
J

(x
i
)
_
.
(1)
We introduce a new inequality, which follows from con-
cavity of E
z|x,u
{J

(
u
(x, z))} in x.
Proposition 1 For all x X, x
i
X and
i
(x) 0
such that x =

i
(x)x
i
,

i
(x) = 1, the optimal
cost J

(x) satises
J

(x) min
u
_
x

g
u
+

i
(x)E
z|xi,u
_
J

u
(x
i
, z)
__
.
(2)
We present here, however, an alternative proof, that
uses the interpretation of a modied process in which
there is additional information about the randomness
of the initial distribution. This argument has the same
spirit as region-observable-POMDP (Zhang and Liu,
1997), and can be generalized (Yu and Bertsekas,
2004). Since Prop. 1 implies concavity of J

(),
1
which
is not used in the proof, it can also be used to establish
concavity of J

() without an induction argument.


Proof: Consider a new process P, otherwise iden-
tical to the original POMDP, except that the initial
distribution of s
0
is generated by a mixture of m dis-
tributions x
i
marginally identical to x. By this we
mean that there is a random variable q taking values
from 1 to m with
p(q = k|x) =
k
(x), p(s
0
|q = k) = x
k
(s
0
).
Assume q is not accessible to the controller. The op-
timal cost for this new process equals J

(x), and is
1
This is so because x

gu =
P
i
i(x)x

i
gu and min
P

P
min.
620 YU & BERTSEKAS UAI 2004
achieved by the policy

that is optimal in the origi-


nal POMDP. Denote its action at x by a. We have
J

(x) =x

g
a
+ E
_
E

t=1

t
g(s
t
, u
t
)|x, a, z, q}|x, a
_
.
Let

a
(x, (z, q)) be the distribution p(s
1
|x, a, (z, q)).
As q and hence

a
(x, (z, q)) are inaccessible to

, by
the optimality of J

(), we have that in the last equa-


tion
E

t=1

t
g(s
t
, u
t
)|x, a, z, q} J

a
(x, (z, q))
_
.
Since

a
(x, (z, q)) =
a
(x
i
, z) given q = i, it follows
that
J

(x) x

g
a
+E
(z,q)|x,a
_
J

a
(x, (z, q))
__
= x

g
a
+

i
(x)E
z|xi,a
_
J

a
(x
i
, z)
__
min
u
_
x

g
u
+

i
(x)E
z|xi,u
_
J

u
(x
i
, z)
__
.

2.2 DISCRETIZED APPROXIMATIONS


We rst summarize known lower approximation
schemes, and then prove asymptotic convergence for
two main schemes corresponding to the inequalities (1)
and (2).
2.2.1 Approximation Schemes
Let G = {x
i
} be a nite set of beliefs such that their
convex hull is X. A simple choice is to discretize X
into a regular grid, so we refer to x
i
as grid points. By
choosing dierent x
i
and
i
() in the inequalities (1)
and (2), we obtain lower cost approximations that are
functionally determined by their values at a nite num-
ber of beliefs.
Denition 1 (-Discretization Scheme) Call
(G, ) an -discretization scheme where G = {x
i
} is
a set of n beliefs, = (
1
(), . . . ,
n
()) is a convex
representation scheme such that x =

i
(x)x
i
for
all x X, and is a scalar characterizing the neness
of the discretization, and dened by
= max
xX
max
xiG
i(x)>0
x x
i
.
Given (G, ), let

T
Di
, i = 1, 2, be the associated map-
pings corresponding to the right-hand sides of inequal-
ities (1) and (2), respectively:
(

T
D1
J)(x) = min
u
_
x

g
u
+

i
E
z|x,u
_

i
_

u
(x, z)
__
J(x
i
)

,
(3)
(

T
D2
J)(x) = min
u
_
x

g
u
+

i
(x)E
z|xi,u
_
J
_

u
(x
i
, z)
__
.
(4)
Associated with these mappings are their unique belief
MDPs on the continuous belief space X, which we will
refer as the modied belief MDPs. The optimal cost
functions

J
i
in these modied belief MDPs satisfy, re-
spectively,
(

T
Di

J
i
)(x) =

J
i
(x) J

(x), x X, i = 1, 2.
Both

J
i
are functionally determined by their values at
a nite number of beliefs, which will be called sup-
porting points, and whose set is denoted by C. In par-
ticular, the function

J
1
can be computed by solving
a corresponding nite-state MDP on C = G = {x
i
},
and the function

J
2
can be computed by solving a cor-
responding nite-state MDP on C = {
u
(x
i
, z)|x
i

G, u U, z Z}.
2
The computation can thus be done
eciently by variants of value iteration methods, or
linear programming.
Usually X is partitioned into convex regions and be-
liefs in a region are represented as the convex combina-
tions of its vertices. The function

J
1
is then piecewise
linear on each region, and the function

J
2
is piecewise
linear and concave on each region. To see the lat-
ter, let q(x
i
, u) = E
z|xi,u
{

J
2
_

u
(x
i
, z)
_
}, and we have

J
2
(x) = min
u
[x

g
u
+

i
(x)q(x
i
, u)].
The simplest case for both mappings is when G con-
sists of vertices of the belief simplex, i.e. G = {e
s
|s
S}, where e
s
(s) = 1 and e
s
(s

) = 0, s

= s, s, s

S.
Denote the corresponding mappings by

T
D
0
i
, i = 1, 2,
respectively, i.e.,
(

T
D
0
1
J)(x) = min
u
_
x

g
u
+

s
p(s|x, u)J(e
s
)

, (5)
(

T
D
0
2
J)(x) = min
u
_
x

g
u
+

s
x(s)E
z|s,u
{J
_

u
(e
s
, z)
_
}

.
(6)
The mapping

T
D
0
1
is the QMDP approximation, sug-
gested by (Littman, Cassandra, and Kaelbling, 1995),
who have shown good results for certain applications.
In the belief MDP associated with

T
D
0
1
, the states will
be observable after the initial step. In the belief MDP
associated with

T
D
0
2
, the previous state will be revealed
at each stage. One can show that

T
D
0
2
gives a better
2
More precisely, C = {u(xi, z)|xi G, u U, z
Z, such that p(z|xi, u) > 0}.
UAI 2004 YU & BERTSEKAS 621
approximation than

T
D
0
1
in both discounted and av-
erage cost cases. For the comparison of

T
Di
in gen-
eral, by concavity of J

, one can relax the inequality


J



T
D2
J

to obtain an inequality of the same form


as the inequality J



T
D1
J

. See (Yu and Bertsekas,


2004) for these details.
By concatenating mappings we obtain other dis-
cretized lower approximations. For example,
T

T
Di
, i = 1, 2;

T
I


T
D2
, (7)
where

T
I
denotes a region-observable-POMDP type of
mapping (Zhang and Liu, 1997). In the concatenated
mapping (

T
I


T
D2
) we only need grid points to be on
lower dimensional spaces.
Let

T be any of the above mappings. Its associated
modied belief MDP is not necessarily a POMDP
model. It is straightforward to show the following,
3
by comparing the N-stage optimal cost of the modi-
ed MDP to that of the original POMDP. This result
also holds for = 1.
Proposition 2 Let J
0
be a concave function on X.
For any [0, 1],
(

T
N
J
0
)(x) (T
N
J
0
)(x), x X, N.
2.2.2 Asymptotic Convergence
We will now provide a limiting theorem for

T
D1
and

T
D2
using the uniform continuity property of J

().
We rst give some conventional notations related to
policies, to be used throughout the paper. Let be
a stationary policy, and J

be its cost. We dene the


mapping T

by
(T

J)(x) = x

g
(x)
+E
z|x,(x)
{J
_

(x)
(x, z)
_
},
and similarly for any control u, we dene T
u
to be the
mapping that has the same single control u in place of
(x) in T

. Let

T be either

T
D1
or

T
D2
, and similarly
let

T

and

T
u
correspond to a policy and control u,
respectively.
The function J

(x) is continuous on X. For any con-


tinuous function v(), E
z|x,u
{v
_

u
(x, z)
_
} is also con-
tinuous on X. As X is compact, by the uniform conti-
nuity of corresponding functions, we have the following
lemma.
Lemma 1 Let v() be a continuous function on X.
For any > 0, there exists > 0 such that for any
-discretization scheme (G, ) with ,
|(T
u
v)(x) (

T
u
v)(x)| , x X, u U,
3
Use induction and concavity, or alternatively an argu-
ment similar to the proof of Prop. 1.
where

T is either

T
D1
or

T
D2
associated with (G, ).
By Lemma 1, and the standard error bounds

J
J


T

J

J
1
and J


T

J

J
1
(see
e.g., (Bertsekas, 2001)), we have the following limiting
theorem, which states that the lower approximation
and the cost of its look-ahead policy, as well as the cost
of the optimal policy with respect to the modied
belief MDP, all converge to the optimal cost of the
original POMDP.
Theorem 1 Let (G
k
,
k
) be a sequence of
k
-
discretization schemes with
k
0 as k . Let

J
k
,
k
and
k
be such that

J
k
=

T
k

J
k
=

T
k,
k

J
k
, T

J
k
= T

J
k
,
where

T
k
is either

T
D1
or

T
D2
associated with (G
k
,
k
).
Then for any xed [0, 1),

J
k
J

, J

k
J

, J

k
J

, as k .
3 DISCRETIZED
APPROXIMATIONS FOR
AVERAGE COST CRITERION
In average cost POMDP, the objective is to minimize
the average cost
1
N
E

N1
t=0
g(s
t
, u
t
)|s
0
x
0
}, as N
goes to innity. For POMDP with average cost, in
order that a stationary optimal policy exists, it is suf-
cient that the following functional equations, in the
belief MDP notation,
J(x) = min
u
E
x|x,u
{J( x)}, (8)
J(x) +h(x) = min
uU(x)
[x

g
u
+ E
x|x,u
{h( x)}],
where U(x) = arg min
u
E
x|x,u
{J( x)},
admit a bounded solution (J

(), h

()). The sta-


tionary policy that obtains the minimum is then op-
timal with its average cost being J

(x). However,
there are no nite computation algorithms to obtain
it. (For a general analysis of POMDP with average
cost, see (Fernandez-Gaucherand, Arapostathis, and
Marcus, 1991) or the survey by (Arapostathis et al.,
1993).)
We now extend the application of the discretized ap-
proximations to the average cost case. First, note that
solving the corresponding average cost problem in the
discretized approach is much easier. Let

T be any of
the mappings from Eq. (3)-(7) in Section 2.2.1. For its
associated modied belief MDP, writing g
u
(x) for cost
per-stage, we have the following average cost optimal-
622 YU & BERTSEKAS UAI 2004
ity equations:
J(x) = min
u

E
x|x,u
{J( x)}, (9)
J(x) +h(x) = min
uU(x)
[ g
u
(x) +

E
x|x,u
{h( x)}],
where U(x) = arg min
u

E
x|x,u
{J( x)},
and we use

E to indicate that the expectation is taken
with respect to the distributions p( x|x, u) of the mod-
ied MDP, which satisfy
p( x|x, u) = 0, (x, u), x C,
with C being the nite set of supporting beliefs. There
are bounded solutions (

J(),

h()) to the optimality


equations (9) for the following reason: Every nite-
state MDP problem admits a solution to its average
cost optimality equations. Furthermore if x C, x is
transient and unreachable from C, and the next belief
x belongs to C under any control u in the modied
MDP. It follows that the optimality equations (9) re-
stricted on {x} C are the optimality equations for
the nite-state MDP with |C| + 1 states, so the solu-
tion (

J( x),

h( x)) exists for x x C with their values


on C independent of x. This is essentially the algo-
rithm to solve

J() and

h() in two stages, and obtain
an optimal stationary policy for the modied MDP.
Concerns arise, however, about using any optimal pol-
icy for the modied MDP as suboptimal control in
the original POMDP. Although all average cost opti-
mal policies behave equally optimally in the asymp-
totic sense, they do so in the modied MDP, in which
all the states x C are transient. As an illustra-
tion, suppose for the completely observable MDP, the
optimal average cost is constant over all states, then
at any belief x C any control will have the same
asymptotic average cost in the modied MDP corre-
sponding to the QMDP approximation scheme. The
situation worsens, if even the completely observable
MDP itself has a large number of states that are tran-
sient under its optimal policies. We therefore speculate
that for the modied MDP, we should aim to compute
policies with additional optimality guarantees, relating
to their nite-stage behaviors. Fortunately for nite-
state MDPs, there are ecient algorithms for comput-
ing such policies.
In the following we present the algorithm, after a brief
review of the related results for nite-state MDP, and
give preliminary analysis of error bounds and asymp-
totic convergence. We show sucient conditions for
the convergence of cost approximation, assuming that
the optimal average cost of the POMDP is constant.
3.1 ALGORITHM
We rst briey review related results for nite-state
MDPs. Since average cost measures the asymptotic
behavior of a policy, given two policies having the same
average cost, one can incur signicantly larger cost in
nite steps than the other. The concept of n-discount
optimality is useful for dierentiating between such
policies. It is also closely related to Blackwell opti-
mality. A policy

is n-discount optimal if its cost in


the discounted cases satisfy
limsup
1
(1 )
n
(J

(s) J

(s)) 0, s, .
By denition an (n + 1)-discount policy is also k-
discount optimal for k = 1, 0, . . . , n. A policy is
called Blackwell optimal, if it is optimal for all the
discounted problems with discount factor [ , 1)
for some < 1. For nite-state MDPs, a policy is
Blackwell optimal if and only if it is -discount opti-
mal. By contrast, any (1)-discount optimal policy is
average cost optimal.
For any nite-state MDP, there exist stationary aver-
age cost optimal policies and furthermore, stationary
n-discount optimal and Blackwell optimal policies. In
particular, there exist functions J(), h() and w
k
(),
k = 0, . . . , n + 1, with w
0
= h such that they satisfy
the following nested equations:
J(s) = min
uU(s)
E
s|s,u
{J( s)}, (10)
J(s) +h(s) = min
uU1(s)
[ g
u
(s) + E
s|s,u
{h( s)}],
w
k1
(s) +w
k
(s) = min
uU
k1
(s)
E
s|s,u
{w
k
( s)},
where
U
1
(s) = arg min
uU(s)
E
s|s,u
{J( s)},
U
0
(s) = arg min
uU1(s)
[ g
u
(s) + E
s|s,u
{h( s)}],
U
k
(s) = arg min
uU
k1
(s)
E
s|s,u
{w
k
( s)}.
Any stationary policy that attains the minimum in
the right-hand sides of the equations in (10) is an n-
discount optimal policy.
For nite-state MDPs, a stationary n-discount opti-
mal policy not only exists, but can also be eciently
computed by multi-chain algorithms. Furthermore,
in order to obtain a Blackwell optimal policy, which
is -discount optimal, it is sucient to compute a
(N 2)-discount optimal policy, where N is the num-
ber of states in the nite-state MDP. We refer readers
to (Puterman, 1994) Chapter 10, especially Section
10.3 for details of the algorithm as well as theoretical
analysis.
UAI 2004 YU & BERTSEKAS 623
This leads to the following algorithm for computing
an n-discount optimal policy for the modied MDP
dened on the continuous belief space. We rst solve
the average cost problem on C, then determine optimal
controls on transient states x C. Note there are no
conditions (such as unichain) at all on this modied
belief MDP.
The algorithm solving the modied MDP
1. Compute an n-discount optimal solution for the
nite-state MDP problem associated with C. Let

(x
i
),

h(x
i
), and w
k
(x
i
), k = 1, . . . , n + 1, with
x
i
C, be the corresponding functions obtained
that satisfy Eq. (10) on C.
2. For any belief x, let the control set U
n+1
be com-
puted at the last step of the sequence of optimiza-
tions:
U
1
= arg min
u

E
xi|x,u
{

(x
i
)},
U
0
= arg min
uU1
[ g
u
(x) +

E
xi|x,u
{

h(x
i
)}],
U
k
= arg min
uU
k1

E
xi|x,u
{ w
k
(x
i
)}, 1 k n + 1.
Let u be any control in U
n+1
, and let

(x) = u.
Also if x C, dene

(x) =

E
xi|x,u
{

(x
i
)},

h(x) = g
u
(x) +

E
xi|x,u
{

h(x
i
)}

J

(x).
With the above algorithm we obtain an (n 1)-
discount optimal policy for the modied MDP. When
n = |C| 1, we obtain an -discount optimal policy
for the modied MDP,
4
since the algorithm essentially
computes a Blackwell optimal policy for every nite-
state MDP restricted on {x} C, for all x. Thus, for
the modied MDP, for any other policy , and any
x X,
limsup
1
(1 )
n
(

J

(x)

J

(x)) 0, n 1.
It is also straightforward to see that

(x) = lim
1
(1 )

J

(x), x X, (11)
where

J

(x) are the optimal discounted costs for the


modied MDP, and the convergence is uniform over
X, since

J

(x) and

J

(x) are piecewise linear interpo-


lations of the function values on a nite set of beliefs.
4
Note that -discount optimality and Blackwell opti-
mality are equivalent for nite-state MDPs, however, they
are not equivalent in the case of a continuous state space.
In the modied MDP, although for each x there exists an
(x) (0, 1) such that

(x) is optimal for all -discounted


problems with (x) < 1, we may have sup
x
(x) = 1
due to the continuity of the belief space.
3.2 ANALYSIS OF ERROR BOUNDS
We now show how to bound the optimal average cost
of the original POMDP, and how to bound the cost of
executing the suboptimal policy, that is optimal to the
modied MDP, in the original POMDP.
Let V

N
(x) = E

N1
t=0
g
ut
(x
t
)|x
0
= x} be the N-
stage cost of a non-randomized policy , which can be
non-stationary, in the original POMDP. Let
J

(x)=inf

liminf
N
1
N
V

N
(x), J

+
(x)=inf

limsup
N
1
N
V

N
(x).
It is straightforward to show
5
that

J

(x) J

+
(x),
x X. We now show that

J

(x) J

(x), x X.
Proposition 3 The optimal average cost function

(x) of the modied MDP satises

(x) J

(x), x X.
Proof: Let V

N
(x) and

V

N
(x) be the optimal
N-stage cost function of the original POMDP, and
of the modied belief MDP, respectively. By Prop. 2
in Section 2.2.1, we have

V

N
(x) V

N
(x), N. Thus

(x) = liminf
N
1
N

N
(x) liminf
N
1
N
V

N
(x) J

(x).
Next we give a simple upper bound on J

+
().
Theorem 2 The optimal liminf and limsup average
cost functions satisfy

(x) J

(x) J

+
(x) max
xC

( x) +,
where = max
xX
_
(T

h)(x)

J

(x)

h(x)

,
and

J

(x),

h(x) and C are dened as in the modied
MDP.
This statement is a consequence of the following
lemma, whose proof, omitted here, follows by bound-
ing the expected cost per stage in the summation of
the N-stage cost.
Lemma 2 Let J(x) and h(x) be any bounded func-
tions on X, and be any stationary policy. Dene
5
Since in the discounted case the corresponding lower
approximation satises

J

(x) J

(x), by Eq. (11) and a


Tauberian theorem, we have for the approximate average
cost

(x) = lim
1
(1 )

J

(x) liminf
1
(1 )J

(x)
inf

limsup
N
1
N
V

N
(x) = J

+
(x).
624 YU & BERTSEKAS UAI 2004
constants
+
and

by

+
=max
xX
_
g
(x)
(x) + E
x|x,(x)
{h( x)} J(x) h(x)

=min
xX
_
g
(x)
(x) + E
x|x,(x)
{h( x)} J(x) h(x)

.
Then V

N
(x), the N-stage cost of executing policy ,
satises

(x) +

liminf
N
1
N
V

N
(x)
limsup
N
1
N
V

N
(x)
+
(x) +
+
, x X,
where
+
(x),

(x) are dened by

+
(x) = max
xD

x
J( x),

(x) = min
xD

x
J(x),
and D

x
denotes the set of beliefs reachable under policy
from x.
Let

be the stationary policy that is optimal for


the modied MDP. We can use Lemma 2 to bound
the liminf and limsup average cost of

in the orig-
inal POMDP. For example, if the optimal average
cost J

MDP
of the completely observable MDP prob-
lem equals the constant

over all states, then we also


have

J

(x) =

, x X, for this modied MDP. The


cost of executing the policy

in the original POMDP


can therefore be bounded by

liminf
N
1
N
V

N
(x)
limsup
N
1
N
V

N
(x)

+
+
.
The quantities
+
and

can be hard to calcu-


late exactly in general, since

J

() and

h() obtained
from the modied MDP are piecewise linear functions.
The bounds may also be loose. On the other hand,
these functions may indicate the structure of the orig-
inal problem, and help us to rene the discretization
scheme in the approximation.
3.3 ANALYSIS OF ASYMPTOTIC
CONVERGENCE
Let (G, ) be an -discretization scheme, and

J

and

J
,
be the optimal average cost and discounted cost,
respectively, in the modied MDP associated with
(G, ) and either

T
D1
or

T
D2
. Recall that in the dis-
counted case (Theorem 1) for a xed discount factor
, we have asymptotic convergence to optimality:
lim
0

J
,
(x) = J

(x).
We now address the question whether

J

(x), as
0, when J

(x) = J

(x) = J

+
(x) exists.
This question of asymptotic convergence under the av-
erage cost criterion is hard to tackle for a couple of
reasons. First of all, it is not clear when J

(x) ex-
ists. (Fernandez-Gaucherand, Arapostathis, and Mar-
cus, 1991) have shown that under certain conditions,
(such as the condition that |J

(x) J

( x)| is bounded
for all [0, 1), and its relaxed variants,) the optimal
average cost J

(x) exists and equals a constant

over
X, and furthermore

= lim
1
(1 )J

(x), x X. (12)
However, even when Eq. (12) holds, in general we have
lim
0

(x) = lim
0
lim
1
(1 )

J
,
(x)
= lim
1
lim
0
(1 )

J
,
(x) =

.
To ensure that

J

, we therefore need stronger


conditions than those that guarantee the existence of

. We now show that a sucient condition is the


continuity of the optimal dierential cost h

().
Theorem 3 Suppose the average cost optimality
equations (8) admit a bounded solution (J

(x), h

(x))
with J

(x) equal to a constant

. Then, if the dier-


ential cost h

(x) is continuous on X, we have


lim
0

(x) =

, x X,
and the convergence is uniform, where

J

is the opti-
mal average cost function for the modied MDP cor-
responding to either

T
D1
or

T
D2
with an associated
-discretization scheme (G, ).
Proof: Let

be the optimal policy for the modied


MDP associated with an -discretization scheme. Let

T be the mapping corresponding to the modied MDP,


dened by (

Tv)(x) = min
u
[ g
u
(x) +

E
x|x,u
{v( x)}].
Since h

(x) is continuous on X, by Lemma 1 in Sec-


tion 2.2.2, we have that for any > 0, there exists
> 0 such that for all -discretization schemes with
< ,
|(T

)(x) (

)(x)| . (13)
We now apply the result of Lemma 2 in the modied
MDP with J =

, h = h

, and =

. That is, by
the same argument as in Lemma 2, we have

(x) = liminf
N
1
N

N
(x)

+, x X,
where = min
xX
_
(

)(x)

(x)

. Since

+h

(x) = (Th

)(x) (T

)(x),
and |(T

)(x)(

T

)(x)| by Eq. (13), we have


(

)(x)

(x) .
UAI 2004 YU & BERTSEKAS 625
Hence , and

J

(x)

for all , and


x X, which proves the uniform convergence of

J

to

.
Note that the inequality

J

is crucial in the pre-


ceding proof. Note also that the proof does not gener-
alize to the case when J

(x) is not constant. A fairly


strong sucient condition that guarantees the exis-
tence of a constant J

and a continuous h

is that
J

(x) is equicontinuous on X for all [0, 1). (For


a proof see (Ross, 1968) or Theorem 6.3 (iv) in (Ara-
postathis et al., 1993)).
4 PRELIMINARY EXPERIMENTS
We demonstrate our approach on a set of toy prob-
lems: Paint, Bridge-repair, and Shuttle. The
sizes of the problems are summarized in Table 1.
Their descriptions and parameters are as speci-
ed in A. Cassandras POMDP File Repository
(http://cs.brown.edu/research/ai/pomdp/examples),
and we dene costs to be negative rewards when a
problem has a reward model.
Table 1: Sizes of Problems |S||U||Z|
Paint Bridge Shuttle
442 5125 835
We used some simple grid patterns. One pattern, re-
ferred to as k-E, consists of k grid points on each edge,
in addition to the vertices of the belief simplex. An-
other pattern, referred to as n-R, consists of n ran-
domly chosen grid points, in addition to the vertices
of the simplex. The combined pattern is referred to
as k-E+n-R. Thus the grid pattern for QMDP ap-
proximation is 0-E, for instance, and 2-E+10-R is a
combined pattern. The grid pattern then induces a
partition of the belief space and a convex representa-
tion (interpolation) scheme, which we kept implicitly
and computed by linear programming on-line.
The algorithm for solving the modied nite-state
MDP was implemented by solving a system of linear
equations for each policy iteration. This may not be
the most ecient way. No higher than 5-discount opti-
mal policies were computed, when the number of sup-
porting points became large.
Figure 1 shows the average cost approximation of

T
D1
and

T
D2
with a few grid patterns for the problem
Paint. In all cases we obtained a constant average
cost for the modied MDP. The horizontal axis is la-
beled by the grid pattern, and the vertical axis is the
approximate cost. The red curve is obtained by

T
D1
,
0E 1E 2E+10R 3E 3E+10R 3E+100R 4E
0.65
0.6
0.55
0.5
0.45
0.4
0.35
0.3
0.25
0.2
0.15
Figure 1: Average Cost Approximation for Problem
Paint Using Various Grid Patterns. Blue:

T
D2
, Red:

T
D1
.
and the blue curve

T
D2
. As will be shown below, the
approximation obtained by

T
D2
with 3-E is already
near optimal. The policies generated by

T
D2
are not
always better, however. We also notice, as indicated
by the drop in the curves when using grid pattern 4-E,
that the improvement of cost approximation does not
solely depend on the number of grid points, but also
on where they are positioned.
In Table 2 we summarize the cost approximations ob-
tained (column LB) and the simulated cost of the
policies (column S. Policy) for the three problems.
The approximation schemes obtaining LB values in
Table 2, as well as the policies simulated, are listed
in Table 3. The column N. UB shows the numer-
ically computed upper bound of the optimal we
calculate in Theorem 2 by sampling the values of
(T

h)(x)

h(x)

J(x) at hundreds of beliefs generated
randomly and taking the maximum over them. Thus
the N. UB values are under-estimates of the exact
upper bound. For both Paint and Shuttle the num-
ber of trajectories simulated is 160, and for Bridge
1000. Each trajectory has 500 steps starting from the
same belief. The rst number in S. Policy in Ta-
ble 2 is the mean over the average cost of simulated
trajectories, and the standard error listed as the sec-
ond number is estimated from bootstrap samples we
created 100 pseudo-random samples by sampling from
the empirical distribution of the original sample and
computed the standard deviation of the mean estima-
tor over these 100 pseudo-random samples.
As shown in Table 2, we nd that some policy from the
discretized approximation with very coarse grids can
already be comparable to the optimal. This is veried
by simulating the policy (S. Policy) and comparing
its average cost against the lower bound of the optimal
626 YU & BERTSEKAS UAI 2004
Table 2: Average Cost Approximations and Simulated
Average Cost of Policies
Problem LB N. UB S. Policy
Paint 0.170 -0.052 0.172
0.002
Bridge 241.798 241.880 241.700
1.258
Shuttle 1.842 1.220 1.835
0.007
Table 3: Approximation Schemes in LB and Simulated
Policies in Table 2
Problem LB S. Policy
Paint

T
D2
w/ 3-E

T
D1
w/ 1-E
Bridge

T
D2
w/ 0-E

T
D2
w/ 0-E
Shuttle

T
D1,2
w/ 2-E

T
D1
w/ 2-E
(LB), which in turn shows that the lower approxima-
tion is near optimal.
We nd that in some cases the upper bounds may
be too loose to be informative. For example, in the
problem Paint we know that there is a simple policy
achieving zero average cost, therefore a near-zero up-
per bound does not tell much about the optimal. In
the experiments we also observe that an approxima-
tion scheme with more grid points does not necessarily
provide a better upper bound of the optimal.
5 CONCLUSION
In this paper we have proposed a discretized lower
approximation approach for POMDP with average
cost. We have shown that the approximations can be
computed eciently using multi-chain algorithms for
nite-state MDP, and they can be used for bounding
the optimal liminf and limsup average cost functions,
as well as generating suboptimal policies. Thus, like
the nite state controller approach, our approach also
bypasses the dicult analytic questions such as the
existence of bounded solutions to the average cost op-
timality equations. We have also introduced a new
lower approximation scheme for both discounted and
average cost cases, and shown asymptotic convergence
of two main approximation schemes in the average cost
case under certain conditions.
Acknowledgements
This work is supported by NSF Grant ECS-0218328.
We thank Leslie Kaelbling for helpful discussions.
References
Aberdeen, D. and J. Baxter (2002). Internal-
state policy-gradient algorithms for innite-horizon
POMDPs. Technical report, RSISE, Australian Na-
tional University.
Arapostathis, A., V. S. Borkar, E. Fernandez-
Gaucherand, M. K. Ghosh, and S. I. Marcus (1993).
Discrete-time controlled Markov processes with aver-
age cost criterion: a survey. SIAM J. Control and
Optimization 31(2): 282344.
Bertsekas, D. P. (2001). Dynamic Programming and
Optimal Control, Vols. I, II. Athena Scientic, second
edition.
Fernandez-Gaucherand, E., A. Arapostathis, and S. I.
Marcus (1991). On the average cost optimality equa-
tion and the structure of optimal policies for partially
observable Markov decision processes. Ann. Opera-
tions Research 29: 439470.
Littman, M. L., A. R. Cassandra, and L. P. Kaelbling
(1995). Learning policies for partially observable envi-
ronments: Scaling up. In Int. Conf. Machine Learning.
Lovejoy, W. S. (1991). Computationally feasible
bounds for partially observed Markov decision pro-
cesses. Operations Research 39(1): 162175.
Ormoneit, D. and P. Glynn (2002). Kernel-based re-
inforcement learning in average-cost problems. IEEE
Trans. Automatic Control 47(10): 16241636.
Puterman, M. L. (1994). Markov decision processes:
discrete stochastic dynamic programming. John Wiley
& Sons, Inc.
Ross, S. M. (1968). Arbitrary state Marko-
vian decision processes. Ann. Mathematical Statis-
tics 39(6): 21182122.
Yu, H. and D. P. Bertsekas (2004). Inequalities
and their applications in value approximation for dis-
counted and average cost POMDP. LIDS tech. report,
MIT. to appear.
Zhang, N. L. and W. Liu (1997). A model ap-
proximation scheme for planning in partially observ-
able stochastic domains. J. Articial Intelligence Re-
search 7: 199230.
Zhou, R. and E. A. Hansen (2001). An improved grid-
based approximation algorithm for POMDPs. In Int.
J. Conf. Articial Intelligence.
UAI 2004 YU & BERTSEKAS 627

You might also like