0% found this document useful (0 votes)
11 views

Robust_Dynamic_Programming

Uploaded by

manohar.kaul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Robust_Dynamic_Programming

Uploaded by

manohar.kaul
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/220442530

Robust Dynamic Programming

Article in Mathematics of Operations Research · May 2005


DOI: 10.1287/moor.1040.0129 · Source: DBLP

CITATIONS READS
509 2,250

1 author:

Garud Iyengar
Columbia University
145 PUBLICATIONS 3,819 CITATIONS

SEE PROFILE

All content following this page was uploaded by Garud Iyengar on 30 July 2015.

The user has requested enhancement of the downloaded file.


CORC Tech Report TR-2002-07
Robust dynamic programming∗

G. Iyengar

Submitted Dec. 3rd, 2002. Revised May 4, 2004.

Abstract
In this paper we propose a robust formulation for discrete time dynamic programming (DP). The
objective of the robust formulation is to systematically mitigate the sensitivity of the DP optimal policy
to ambiguity in the underlying transition probabilities. The ambiguity is modeled by associating a
set of conditional measures with each state-action pair. Consequently, in the robust formulation each
policy has a set of measures associated with it. We prove that when this set of measures has a certain
“Rectangularity” property all the main results for finite and infinite horizon DP extend to natural robust
counterparts. We identify families of sets of conditional measures for which the computational complexity
of solving the robust DP is only modestly larger than solving the DP, typically logarithmic in the size
of the state space. These families of sets are constructed from the confidence regions associated with
density estimation, and therefore, can be chosen to guarantee any desired level of confidence in the robust
optimal policy. Moreover, the sets can be easily parameterized from historical data. We contrast the
performance of robust and non-robust DP on small numerical examples.

1 Introduction
This paper is concerned with sequential decision making in uncertain environments. Decisions are made
in stages and each decision, in addition to providing an immediate reward, changes the context of future
decisions; thereby affecting the future rewards. Due to the uncertain nature of the environment, there is
limited information about both the immediate reward from each decision and the resulting future state. In
order to achieve a good performance over all the stages the decision maker has to trade-off the immediate
payoff with future payoffs. Dynamic programming (DP) is the mathematical framework that allows the
decision maker to efficiently compute a good overall strategy by succinctly encoding the evolving information
state. In the DP formalism the uncertainty in the environment is modeled by a Markov process whose
transition probability depends both on the information state and the action taken by the decision maker. It
is assumed that the transition probability corresponding to each state-action pair is known to the decision
maker, and the goal is to choose a policy, i.e. a rule that maps states to actions, that maximizes some
performance measure. Puterman (1994) provides a excellent introduction to the DP formalism and its
various applications. In this paper, we assume that the reader has some prior knowledge of DP.
∗ Submitted
to Math. Oper. Res. . Do not distribute
† IEOR
Department, Columbia University, Email: [email protected]. Research partially supported by NSF grants
CCR-00-09972 and DMS-01-04282.

1
The DP formalism encodes information in the form of a “reward-to-go” function (see Puterman, 1994, for
details) and chooses an action that maximizes the sum of the immediate reward and the expected “reward-
to-go”. Thus, to compute the optimal action in any given state the “reward-to-go” function for all the future
states must be known. In many applications of DP, the number of states and actions available in each state
are large; consequently, the computational effort required to compute the optimal policy for a DP can be
overwhelming – Bellman’s “curse of dimensionality”. For this reason, considerable recent research effort has
focused on developing algorithms that compute an approximately optimal policy efficiently (Bertsekas and
Tsitsiklis, 1996; de Farias and Van Roy, 2002).
Fortunately, for many applications the DP optimal policy can be computed with a modest computational
effort. In this paper we restrict attention to this class of DPs. Typically, the transition probability of the
underlying Markov process is estimated from historical data and is, therefore, subject to statistical errors. In
current practice, these errors are ignored and the optimal policy is computed assuming that the estimate is,
indeed, the true transition probability. The DP optimal policy is quite sensitive to perturbations in the tran-
sition probability and ignoring the estimation errors can lead to serious degradation in performance (Nilim
and El Ghaoui, 2002; Tsitsiklis et al., 2002). Degradation in performance due to estimation errors in param-
eters has also been observed in other contexts (Ben-Tal and Nemirovski, 1997; Goldfarb and Iyengar, 2003).
Therefore, there is a need to develop DP models that explicitly account for the effect of errors.
In order to mitigate the effect of estimation errors we assume that the transition probability corresponding
to a state-action pair is not exactly known. The ambiguity in the transition probability is modeled by
associating a set P(s, a) of conditional measures with each state-action pair (s, a). (We adopt the convention
of the decision analysis literature wherein uncertainty refers to random quantities with known probability
measures and ambiguity refers to unknown probability measures (see, e.g. Epstein and Schneider, 2001)).
Consequently, in our formulation each policy has a set of measures associated with it. The value of a
policy is the minimum expected reward over the set of associated measures, and the goal of the decision
maker is to choose a policy with maximum value, i.e. we adopt a maximin approach. We will refer to this
formulation as robust DP. We prove that, when the set of measures associated with a policy satisfy a certain
“Rectangularity” property (Epstein and Schneider, 2001), the following results extend to natural robust
counterparts: the Bellman recursion, the optimality of deterministic policies, the contraction property of
the value iteration operator, and the policy iteration algorithm. “Rectangularity” is a sort of independence
assumption and is a minimal requirement for these results to hold. However, this assumption is not always
appropriate, and is particularly troublesome in the infinite horizon setting (see Appendix A for details).
We show that if the decision maker is restricted to stationary policies the effects of the “Rectangularity”
assumption are not serious.
There is some previous work on modeling ambiguity in the transition probability and mitigating its effect
on the optimal policy. Satia and Lave (1973); White and Eldieb (1994); Bagnell et al. (2001) investigate
ambiguity in the context of infinite horizon DP with finite state and action spaces. They model ambiguity
by constraining the transition probability matrix to lie in a pre-specified polytope. They do not discuss how
one constructs this polytope. Moreover, the complexity of the resulting robust DP is at least an order of
magnitude higher than DP. Shapiro and Kleywegt (2002) investigate ambiguity in the context of stochastic
programming and propose a sampling based method for solving the maximin problem. However, they do not
discuss how to choose and calibrate the set of ambiguous priors. None of this work discusses the dynamic
structure of the ambiguity; in particular, there is no discussion of the central role of “Rectangularity”. Our
theoretical contributions are based on recent work on uncertain priors in the economics literature (Gilboa
and Schmeidler, 1989; Epstein and Schneider, 2001, 2002; Hansen and Sargent, 2001). The focus of this

2
body of work is on the axiomatic justification for uncertain priors in the context of multi-period utility
maximization. It does not provide any means of selecting the set of uncertain priors nor does it focus on
efficiently solving the resulting robust DP.
In this paper we identify families of sets of conditional measures that have the following desirable proper-
ties. These families of sets provide a means for setting any desired level of confidence in the robust optimal
policy. For a given confidence level, the corresponding set from each family is easily parameterizable from
data. The complexity of solving the robust DP corresponding to these families of sets is only modestly
larger that the non-robust counterpart. These families of sets are constructed from the confidence regions
associated with density estimation.
While this paper was being prepared for publication we became aware of a technical report by Nilim
and El Ghaoui (2002) where they formulate finite horizon robust DP in the context of an aircraft routing
problem. A “robust counterpart” for the Bellman equation appears in their paper but they do not justify
that this “robust counterpart”, indeed, characterizes the robust value function. Like all the previous work
on robust DP, Nilim and El Ghaoui also do not recognize the importance of Rectangularity. However, they
do introduce sets based on confidence regions and show that the finite horizon robust DP corresponding to
these sets can be solved efficiently.
The paper has two distinct and fairly independent parts. The first part comprising of Section 2 and
Section 3 presents the robust DP theory. In Section 2 we formulate finite horizon robust DP and the
“Rectangularity” property that leads to the robust counterpart of the Bellman recursion; and Section 3
formulates the robust extension of discounted infinite horizon DP. The focus of the second part comprising
of Section 4 and Section 5 is on computation. In Section 4 we describe three families of sets of conditional
measures that are based on the confidence regions, and show that the computational effort required to solve
the robust DP corresponding to these sets is only modestly higher than that required to solve the non-
robust counterpart. The results in this section, although independently obtained, are not new and were first
obtained by Nilim and El Ghaoui (2002). In Section 5 we provide basic examples and computational results.
Section 6 includes some concluding remarks.

2 Finite horizon robust dynamic programming


Decisions are made at discrete points in time t ∈ T = {0, 1, . . .} referred to as decision epochs. In this
section we assume that T finite, i.e. T = {0, . . . , N − 1} for some N ≥ 1. At each epoch t ∈ T the system
occupies a state s ∈ St , where St is assumed to be discrete (finite or countably infinite). In a state s ∈ St the
decision maker is allowed to choose an action a ∈ At (s), where At (s) is assumed to be discrete. Although
many results in this paper extend to non-discrete state and action sets, we avoid this generality because the
associated measurability issues would detract from the ideas that we want to present in this work.
For any discrete set B, we will denote the set of probability measures on B by M(B). Decision makers
can choose actions either randomly or deterministically. A random action is a state s ∈ S t corresponds
to an element qs ∈ M(A(s)) with the interpretation that an action a ∈ A(s) is selected with probability
qs (a). Degenerate probability measures that assign all the probability mass to a single action correspond to
deterministic actions.
Associated with each epoch t ∈ T and state-action pair (s, a), a ∈ A(s), s ∈ S t , is a set of conditional
measures Pt (s, a) ⊆ M(St+1 ) with the interpretation that if at epoch t, action a is chosen in state s, the
state st+1 at the next epoch t + 1 is determined by some conditional measure psa ∈ Pt (s, a). Thus, the state
transition is ambiguous. (We adopt the convention of the decision analysis literature wherein uncertainty

3
refers to random quantities with known probability measures and ambiguity refers to unknown probability
measures (see, e.g. Epstein and Schneider, 2001)).
The decision maker receives a reward rt (st , at , st+1 ) when the action at ∈ A(st ) is chosen in state st ∈ S
at the decision epoch t, and the state at the next epoch is st+1 ∈ S. Since st+1 is ambiguous, we allow the
reward at time t to depend on st+1 as well. Note that one can assume, without loss of generality, that the
reward rt (·, ·, ·) is certain. The reward rN (s) at the epoch N is a only a function of the state s ∈ SN .
© ª
We will refer to the collection of objects T, {St , At , Pt , rt (·, ·, ·) : t ∈ T } as a finite horizon ambiguous
Markov decision process (AMDP). The notation above is a modification of that in Puterman (1994) and the
structure of ambiguity is motivated by Epstein and Schneider (2001).
A decision rule dt is a procedure for selecting actions in each state at a specified decision epoch t ∈ T . We
will call a decision rule history dependent if it depends on the entire past history of the system as represented
by the sequence of past states and actions, i.e. dt is a function of the history ht = (s0 , a0 , . . . , st−1 , at−1 , st ).
Let Ht denote the set of all histories ht . Then a randomized decision rule dt is a map dt : Ht 7→ M(A(st )).
A decision rule dt is called deterministic if it puts all the probability mass on a single action a ∈ A(s t ), and
Markovian if it is a function of the current state st alone.
The set of all conditional measures consistent with a deterministic Markov decision rule d t is given by
n o
T dt = p : St 7→ M(St+1 ) : ∀s ∈ St , ps ∈ Pt (s, dt (s)) , (1)

i.e. for every state s ∈ S, the next state can be determined by any p ∈ Pt (s, dt (s)). The set of all conditional
measures consistent with a history dependent decision rule dt is given by
( )
∀h ∈ H t , p h (a, s) = q d (h) (a)p s a (s),
T dt = p : Ht 7→ M(A(st ) × St+1 ) : t t
(2)
pst a ∈ P(st , a), a ∈ A(st ), s ∈ St+1

A policy prescribes the decision rule to be used at all decision epochs. Thus, a policy π is a sequence of
decision rules, i.e. π = (dt : t ∈ T ). Given the ambiguity in the conditional measures, a policy π induces a
collection of measure on the history space HN . We assume that the set T π of measures consistent with a
policy π has the following structure.

Assumption 1 (Rectangularity) The set T π of measures consistent with a policy π is given by


½ Y ¾
π dt
T = P : ∀hN ∈ HN , P(hN ) = pht (at , st+1 ), pht ∈ T , t ∈ T ,
t∈T
d0 d1 dN −1
= T ×T × ··· × T , (3)

where the notation in (3) simply denotes that each p ∈ T π is a product of pt ∈ T dt , and vice versa.

The Rectangularity assumption is motivated by the structure of the recursive multiple priors in Epstein and
Schneider (2001). We will defer discussing the implications of the this assumption until after we define the
objective of the decision maker.
The reward V0π (s) generated by a policy π starting from the initial state s0 = s is defined as follows.
·X ¸
π P
V0 (s) = inf π E rt (st , dt (ht ), st+1 ) + rN (sN ) , (4)
P∈T
t∈T

where EP denotes the expectation with respect to the fixed measure P ∈ T π . Equation (4) defines the
reward of a policy π to be the minimum expected reward over all measures consistent with the policy π.
Thus, we take a worst-case approach in defining the reward. In the optimization literature this approach is

4
known as the robust approach (Ben-Tal and Nemirovski, 1998). Let Π denote the set of all history dependent
policies. Then the goal of robust DP is to characterize the robust value function
( ·X ¸)
n o
V0∗ (s) = sup V0π (s) = sup inf π EP rt (st , dt (ht ), st+1 ) + rN (sN ) , (5)
π∈Π π∈Π P∈T
t∈T

and an optimal policy π if the supremum is achieved.
In order to appreciate the implications of the Rectangularity assumption the objective (5) has to in-
terpreted in an adversarial setting: the decision maker chooses π; an adversary observes π, and chooses a
measure P ∈ T π that minimizes the reward. In this context, Rectangularity is a form of an independence
assumption: the choice of particular distribution p̄ ∈ P(st , at ) in a state-action pair (st , at ) at time t does
not limit the choices of the adversary in the future. This, in turn, leads to a separability property that is
crucial for establishing the robust counterpart of the Bellman recursion (see Theorem 1). Such a model for
an adversary is not always appropriate. See Appendix A for an example of such a situation. We will return
to this issue in the context of infinite horizon models in Section 3.
The optimistic value V̄0π (s0 ) of a policy π starting from the initial state s0 = s is defined as
·X ¸
π P
V̄0 (s) = sup E rt (st , dt (ht ), st+1 ) + rN (sN ) . (6)
P∈T π
t∈T

Let V0π (s0 ; P) denote the non-robust value of a policy π corresponding to a particular choice P ∈ T π . Then
V̄0π (s0 ) ≥ V0π (s0 ; P) ≥ V0π (s0 ). Analogous to the robust value function V0∗ (s), the optimistic value function
V̄0∗ (s) is defined as
( ·X ¸)
n o
V̄0∗ (s) = sup V̄0π (s) = sup sup EP rt (st , dt (ht ), st+1 ) + rN (sN ) . (7)
π∈Π π∈Π P∈T π
t∈T

Remark 1 Since our interest is in computing the robust optimal policy π ∗ , we will restrict attention to the
robust value function V0∗ . However, all the results in this paper imply a corresponding result for the optimistic
value function V̄0∗ with the inf P∈T π (·) replaced by supP∈T π (·).

Let Vnπ (hn ) denote the reward obtained by using policy π over epochs n, n + 1, . . . , N − 1, starting from
the history hn , i.e.
·NX
−1 ¸
Vnπ (hn ) = inf π EP rt (st , dt (ht ), st+1 ) + rN (sN ) , (8)
P∈Tn
t=n

where Rectangularity implies that the set of conditional measures Tnπ consistent with the policy π and the
history hn is given by
½ N
Y −1 QN −1 ¾
π ∀hn ∈ Hn , Phn (an , sn+1 , . . . , aN −1 , sN ) = t=n pht (at , st+1 ),
Tn = Pn : Hn 7→ (At × St+1 ) : ,
t=n
pht ∈ T dt , t = n, . . . , N − 1
= T dn × T d1 × · · · × T dN −1 ,
= π
T dn × Tn+1 . (9)

Let Vn∗ (hn ) denote the optimal reward starting from the history hn at the epoch n, i.e.
n o ½ ·NX
−1 ¸¾
Vn∗ (hn ) = sup Vnπ (hn ) = sup inf E P
rt (st , dt (ht ), st+1 ) + rN (sN ) , (10)
π∈Πn π∈Πn P∈Tnπ
t=n

where Πn is the set of all history dependent randomized policies for epochs t ≥ n.

5
Theorem 1 (Bellman equation) The set of functions {Vn∗ : n = 0, 1, . . . , N } satisfies the following robust
Bellman equation:

VN∗ (hN ) = rN (sN ),


½ h i¾
Vn∗ (hn ) = sup inf p
E rn (sn , a, s) + ∗
Vn+1 (hn , a, s) , n = 0, . . . , N − 1. (11)
a∈A(sn ) p∈P(sn ,a)

Proof: From (9) it follows that


½ ·NX
−1 ¸¾
Vn∗ (hn ) = sup inf EP rt (st , dt (ht ), st+1 ) + rN (sN ) .
π
P=(p,P̄)∈T dn ×Tn+1
π∈Π t=n

Since the conditional measures P̄ do not affect the first term rn (sn , dn (hn ), sn+1 ), we have:
½ · h NX
−1 i¸¾
Vn∗ (hn ) = sup inf Ep rn (sn , dn (hn ), sn+1 ) + EP̄ rt (st , dt (ht ), st+1 ) + rN (sN ) ,
π
(p,P̄)∈T dn ×Tn+1
π∈Πn t=n+1
½ · h NX
−1 i¸¾
= sup inf Ep rn (sn , dn (hn ), sn+1 ) + inf EP̄ rt (st , dt (ht ), st+1 ) + rN (sN ) ,
p∈T dn π
P̄∈Tn+1
π∈Πn t=n+1
½ h i¾
p π
= sup inf E rn (sn , dn (hn ), sn+1 ) + Vn+1 (hn , dn (hn ), sn+1 ) , (12)
π∈Πn p∈T dn

π
where the last equality follows from the definition of Vn+1 (hn+1 ) in (8).
Let (dn (hn )(ω), sn+1 (ω)) denote any realization of the random action-state pair corresponding the (ran-
π ∗
domized) decision rule dn . Then Vn+1 (hn , dn (hn )(ω), sn+1 (ω)) ≤ Vn+1 (hn , dn (hn )(ω), sn+1 (ω)). Therefore,
(12) implies that
½ h i
∗ ∗
Vn (hn ) ≤ sup inf Ep rn (sn , dn (hn ), sn+1 ) + Vn+1 (hn , dn (hn ), sn+1 ) ,
π∈Πn p∈T dn
½ h i¾

= sup inf Ep rn (sn , dn (hn ), sn+1 ) + Vn+1 (hn , dn (hn ), sn+1 ) , (13)
dn ∈Dn p∈T dn

where Dn is the set of all history dependent decision rules at time n, and (13) follows from the fact that the
term within the expectation only depends
n on dno∈ Dn .
∗ π ²
Since Vn+1 (hn+1 ) = supπ∈Πn+1 Vn+1 (hn+1 ) , it follows that for all ² > 0 there exists a policy πn+1 ∈
n+1 π² ∗ ²
Πn+1 such that Vn+1 (hn+1 ) ≥ Vn+1 (hn+1 ) − ², for all hn+1 ∈ Hn+1 . For all dn ∈ Dn , (dn , πn+1 ) ∈ Πn .
Therefore,
½ h i
∗ π
Vn (hn ) = sup inf Ep rn (sn , dn (hn ), sn+1 ) + Vn+1 (hn , dn (hn ), sn+1 ) ,
π∈Πn p∈T dn
½ h ² i¾
p πn+1
≥ sup inf E rn (sn , dn (hn ), sn+1 ) + Vn+1 (hn , dn (hn ), sn+1 ) ,
dn ∈Dn p∈T dn
½ h i¾

≥ sup inf Ep rn (sn , dn (hn ), sn+1 ) + Vn+1 (hn , dn (hn ), sn+1 ) − ². (14)
dn ∈Dn p∈T dn

Since ² > 0 is arbitrary, (13) and (14) imply that


½ h i¾
Vn∗ (hn ) = sup ∗
inf Ep rn (sn , dn (hn ), sn+1 ) + Vn+1 (hn , dn (hn ), sn+1 ) .
dn ∈Dn p∈T dn

6
The definition of T dn in (2) implies that Vn∗ (hn ) can be rewritten as follows.
n X hX £ ¤io
Vn∗ (hn ) = sup inf q(a) ∗
psn a (s) rn (sn , a, s) + Vn+1 (hn , a, s) ,
q∈M(A(sn )) psn a ∈Pn (sn ,a) s∈S
a∈A(sn )
n X hX £ ¤io

= sup q(a) inf psn a (s) rn (sn , a, s) + Vn+1 (hn , a, s) ,
q∈M(A(sn )) psn a ∈Pn (sn ,a)
a∈A(sn ) s∈S
n hX £ ¤io

= sup inf p(s) rn (sn , a, s) + Vn+1 (hn , a, s) , (15)
a∈A(sn )) p∈Pn (sn ,a)
s∈S

where (15) follows from the fact that


X
sup w(u) ≥ q(u)w(u),
u∈W
u∈W

for all discrete sets W , functions w : W 7→ R, and probability measures q on W .


While this paper was being prepared for publication we became aware of a technical report by Nilim and El
Ghaoui (2002) where they formulate robust solutions to finite-horizon AMDPs with finite state and action
spaces. A “robust counterpart” of the Bellman equation appears in their paper. This “robust counterpart”
reduces to the robust Bellman equation (11) provided one assumes that the set of measures P(s, a) is convex.
The convexity assumption is very restrictive, e.g. a discrete set of measures P(s, a) = {q 1 , . . . , qm } is not
convex. Moreover, they do not prove that the solution Vt (s) of the “robust counterpart” is the robust value
function, i.e. there exists a policy that achieves Vt (s). Their paper does not discuss the dynamic structure
of the ambiguity; in particular, there is no discussion of the structure of the set T π of measures consistent
with a policy. The robust Bellman equation characterizes the robust value function if and only if T π satisfies
Rectangularity, it would be impossible to claim that the solution of a recursion is the robust value function
without invoking Rectangularity is some form. In summary, while the robust solutions to AMDPs were
addressed in Nilim and El Ghaoui (2002), we provide the necessary theoretical justification for the robust
Bellman recursion and generalize the result to countably infinite state and action sets.
The following corollary establishes that one can restrict the decision maker to deterministic policies
without affecting the achievable robust reward.

Corollary 1 Let ΠD be the set of all history dependent deterministic policies. Then ΠD is adequate for
characterizing the value function Vn in the sense that for all n = 0, . . . , N − 1,
n o
Vn∗ (hn ) = sup Vnπ (hn ) .
π∈ΠD

Proof: This result follows from (11). The details are left to the reader.
Next, we show that it suffices to restrict oneself to deterministic Markov policies, i.e. policies where the
deterministic decision rule dt at any epoch t is a function of only the current state st .

Theorem 2 (Markov optimality) For all n = 0, . . . , N , the robust value function V n∗ (hn ) is a function
of the current state sn alone, and Vn∗ (sn ) = supπ∈ΠM D {Vnπ (sn )}, n ∈ T , where ΠM D is the set of all
deterministic Markov policies. Therefore, the robust Bellman equation (11) reduces to
½ h i¾
Vn∗ (sn ) = sup inf Ep rn (sn , a, s) + Vn+1

(s) , n ∈ T. (16)
a∈A(sn ) p∈Pn (sn ,a)

Proof: The result is established by induction on the epoch t. For t = N , the value function V N∗ (hN ) =
rN (sN ) and is, therefore, a function of only the current state.

7
Next, suppose the result holds for all t > n. From the Bellman equation (11) we have
½ h i¾
∗ p ∗
Vn (hn ) = sup inf E rn (sn , a, s) + Vn+1 (hn , a, s) ,
a∈A(sn ) p∈Pn (sn ,a)
½ h i¾
= sup inf Ep rn (sn , a, s) + Vn+1

(s) , (17)
a∈A(sn ) p∈Pn (sn ,a)

where (17) follows from the induction hypothesis. Since the right hand side of (17) depends on h n only via
sn , the result follows.

The recursion relation (16) forms the basis for robust DP. This relation establishes that, provided V n+1 (s0 ) is
known for all s0 ∈ S, computing Vn∗ (s) reduces to a collection of optimization problems. Suppose the action
set A(s) is finite. Then the optimal decision rule d∗n at epoch n is given by
½ h i¾
d∗n (s) = argmax inf Ep rn (s, a, s0 ) + Vn+1 (s0 ) .
a∈A(s) p∈Pn (s,a)

Hence, in order to compute the value function Vn∗ efficiently one must be able to efficiently solve the opti-
mization problem inf p∈P(s,a) Ep [v] for a specified s ∈ S, a ∈ A(s) and v ∈ R|S| . In Section 4 we describe
three families of sets P(s, a) of conditional measures for which inf p∈P(s,a) Ep [v] can be solved efficiently.
As noted in Remark 1, Theorem 2 implies the following result for the optimistic value function V̄n∗ .

Theorem 3 For n = 0, . . . , N , the optimistic value function V̄n∗ (hn ) is a function of the current state sn
alone, and n o
V̄n∗ (sn ) = sup V̄nπ (sn ) , n ∈ T,
π∈ΠM D

where ΠM D is the set of all deterministic Markov policies. Therefore,


½ h i¾
V̄n∗ (sn ) = sup sup ∗
Ep rn (sn , a, s) + V̄n+1 (s) , n ∈ T. (18)
a∈A(sn ) p∈Pn (sn ,a)

3 Infinite horizon robust dynamic programming


In this section we formulate robust infinite horizon robust DP with a discounted reward criterion and
describe methods for solving this problem. Robust infinite horizon DP with finite state and action spaces
was addressed in Satia (1968); Satia and Lave (1973). A special case of the robust DP where the decision
maker is restricted to stationary policies appears in Bagnell et al. (2001). We will contrast our contributions
with the previous work as we establish the main results of this section.
The setup is similar to the one introduced in Section 2. As before, we assume that the decisions epochs
are discrete, however now the set T = {0, 1, 2, . . . , } = Z+ . The system state s ∈ S, where S is assumed
to be discrete, and in state s ∈ S the decision maker is allowed to take a randomized action chosen from a
discrete set A(s). As the notation suggests, in this section we assume that the state space is not a function
of the decision epoch t ∈ T .
Unlike in the finite horizon setting, we assume that the set of conditional measures P(s, a) ⊆ M(S) is not
a function of the decision epoch t ∈ T . We continue to assume that the set T π of measures consistent with a
Q
policy π satisfies Rectangularity, i.e. T π = t∈T T dt . Note that Rectangularity implies that the adversary is
allowed to choose a possibly different conditional measure p ∈ P(s, a) every time the state-action pair (s, a)
is encountered. Hence we will refer to this adversary model as the dynamic model. In many applications
of robust DP the transition probability is, in fact, fixed but the decision maker is only able to estimate to

8
within a set. In such situations the dynamic model is not appropriate (see Appendix A for a discussion).
Instead, one would prefer a static model where the adversary is restricted to choose the same, but unknown,
psa ∈ P(s, a) every time the state-action pair (s, a) is encountered. We contrast the implications of the two
models in Lemma 3. Bagnell et al. (2001) also has some discussion on this issue.
As before, the reward r(st , at , st+1 ) is a function of the current state st , the action at ∈ A(st ), and the
future state st+1 ; however, it is not a function of the decision epoch t. We will also assume that the reward
is bounded, i.e. sups,s0 ∈S,a∈A(s) {r(s, a, s0 )} = R < ∞. The reward V π (s) received by employing a policy π
when the initial state s0 = s is given by
hX∞ i
π
Vλ (s) = inf π E P
λt r(st , dt (ht ), st+1 ) , (19)
P∈T
t=0
R
where λ ∈ (0, 1) is the discount factor. It is clear that for all policies π, sups∈S {Vλπ (s)} ≤ 1−λ . The optimal
reward in state s is given by
n o ½ hX
∞ i¾
∗ π P t
Vλ (s) = sup V (s) = sup inf π E λ r(st , dt (ht ), st+1 ) , (20)
π∈Π π∈Π P∈T
t=0

where Π is the set of all history dependent randomized policies. The optimistic value function V̄λ∗ can be
defined as follows. ½ hX
∞ i¾
∗ P t
V̄λ (s) = sup sup E λ r(st , dt (ht ), st+1 ) . (21)
π∈Π P∈T π t=0
As noted in Remark 1, all the results in this section imply a corresponding result for the optimistic value
function V̄λ∗ with the inf P∈T π (·) replaced by supP∈T π (·).
The following result is the infinite horizon counterpart of Theorem 2.
Theorem 4 (Markov optimality) The decision maker can be restricted to deterministic Markov policies
without any loss in performance, i.e. Vλ∗ (s) = supπ∈ΠM D {Vλπ (s)}, where ΠM D is the set of all deterministic
Markov policies.
Proof: Since P(s, a) only depends on the current state-action pair, this result follows from robust extensions
of Theorem 5.5.1, Theorem 5.5.3 and Proposition 6.2.1 in Puterman (1994).
Let V denote the set of all bounded real valued functions on the discrete set S. Let kV k denote the L ∞
norm on V, i.e.
kV k = max |V (s)| .
s∈S
Then (V, k·k) is a Banach space. Let D be any subset of all deterministic Markov decision rules. Define the
robust Bellman operator LD on V as follows: For all V ∈ V,
n £ ¤o
LD V (s) = sup inf Ep r(s, d(s), s0 ) + λV (s0 ) , s ∈ S. (22)
d∈D p∈P(s,d(s))

Theorem 5 (Bellman equation) The operator LD satisfies the following properties:


(a) The operator LD is contraction mapping on V; in particular, for all U, V ∈ V,

kLU − LV k ≤ λkU − V k. (23)

(b) The operator equation LD V = V has a unique solution. Moreover,


hX
∞ i
V (s) = sup inf π EP λt r(st , dt (ht ), st+1 ) ,
{π:dπ P∈T
t ∈D} t=0
π
where T is defined in (3).

9
Proof: Let U, V ∈ V. Fix s ∈ S, and assume that LU (s) ≥ LV (s). Fix ² > 0 and choose d ∈ D such that
for all s ∈ S,
£ ¤
inf Ep r(s, d(s), s0 ) + λU (s0 ) ≥ LD U (s) − ².
p∈P(s,d(s))

Choose a conditional probability measure ps ∈ P(s, d(s)), s ∈ S, such that


£ ¤ £ ¤
Eps r(s, d(s), s0 ) + λV (s0 ) ≤ inf Ep r(s, d(s), s0 ) + λV (s0 ) + ².
p∈P(s,d(s))

Then
³ £ ¤ ´ ³ £ ¤´
0 ≤ LU (s) − LV (s) ≤ inf Ep r(s, d(s), s0 ) + λU (s0 ) + ² − inf Ep r(s, d(s), s0 ) + λV (s0 ) ,
p∈P(s,d(s)) p∈P(s,d(s))
³ £ ¤ ´ ³ £ ¤ ´
≤ ps 0 0 ps
E r(s, d(s), s ) + λU (s ) + ² − E r(s, d(s), s0 ) + λV (s0 ) − ² ,
= λEps [U − V ] + 2²,
≤ λEps |U − V | + 2²,
≤ λkU − V k + 2².

Repeating the argument for the case LU (s) ≤ LV (s) implies that

|LU (s) − LV (s)| ≤ λkU − V k + 2², ∀s ∈ S,

i.e. kLU − LV k ≤ λkU − V k + 2². Since ² was arbitrary, this establishes part (a) of the Theorem.
Since LD is a contraction operator on a Banach space, the Banach fixed point theorem implies that the
operator equation LD V = V has a unique solution V ∈ V.
Fix π such that dπt ∈ D, for all t ≥ 0. Then

V (s) = LD V (s),
£ ¤
≥ inf π Ep0 r(s, dπ0 (s), s1 ) + λV (s1 ) , (24)
p0 ∈P(s,d0 (s))
h £ ¤i
≥ inf π Ep0 r(s, dπ0 (s), s1 ) + λ inf π Ep1 r(s1 , dπ1 (s1 ), s2 ) + λV (s2 ) , (25)
p0 ∈P(s,d0 (s)) p1 ∈P(s1 ,d1 (s1 ))

hX
1 i
= inf π E P
r(st , dπt (st ), st+1 ) + λ2 V (st+1 ) , (26)
P∈T
t=0

where (24) follows from the fact that choosing a particular action dπ0 (s) can only lower the value of the right
hand side, (25) follows by iterating the same argument once more, and (26) follows from the Rectangularity
assumption. Thus, for all n ≥ 0,
hX
n i
V (s) ≥ inf π EP r(st , dπt (st ), st+1 ) + λn+1 V (st+1 ) ,
P∈T
t=0
hX
∞ ∞
X i
= inf π EP r(st , dπt (st ), st+1 ) + λn+1 V (st+1 ) − λt r(st , dπt (st ), st+1 ) ,
P∈T
t=0 t=n+1
λn+1 R
≥ V π (s) − λn+1 kV k − ,
1−λ
where R = sups,s0 ∈S,a∈A(s) {r(s, a, s0 )} < ∞. Since n is arbitrary, it follows that
n o
V (s) ≥ sup V π (s) . (27)
{π:dπ
t ∈D,∀t}

10
The Robust Value Iteration Algorithm:
Input: V ∈ V, ² > 0
e
Output: Ṽ such that kṼ − V ∗ k ≤ 2
n £ ¤o
For each s ∈ S, set Ṽ (s) = supa∈A(s) inf p∈P(s,a) Ep r(r, a, s0 ) + λV (s0 ) .
³ ´
(1−λ)
while kṼ − V k ≥ 4λ · ² do

V = Ṽ n £ ¤o
∀s ∈ S, set Ṽ (s) = supa∈A(s) inf p∈P(s,a) Ep r(r, a, s0 ) + λV (s0 ) .

end while

return Ṽ

Figure 1: Robust value iteration algorithm

Fix ² > 0 and choose a deterministic decision rule d ∈ D such that for all s ∈ S
h i
V (s) = LD V (s) ≤ inf Ep r(s, d(s), s0 ) + λV (s0 ) + ².
p∈P(s,d(s))

Consider the policy π = (d, d, . . .). An argument similar to the one above establishes that for all n ≥ 0
²
V (s) ≤ V π (s) + λn kV k + . (28)
1−λ
© ª
Since ² and n are arbitrary, it follows from (27) and (28) that V (s) = sup{π:dπt ∈D,∀t} V π (s) .

Corollary 2 The properties of the operator LD imply the following:


(a) Let d be any deterministic decision rule. Then the value Vλπ of the stationary policy π = (d, d, . . .) is
the unique solution of the operator equation
£ ¤
V (s) = inf Ep r(s, d(s), s0 ) + λV (s0 ) , s ∈ S. (29)
p∈P(s,d(s))

(b) The value function Vλ∗ is the unique solution of the operator equation
£ ¤
V (s) = sup inf Ep r(s, a, s0 ) + λV (s0 ) , s ∈ S. (30)
a∈A(s) p∈P(s,a)

Moreover, for all ² > 0, there exists an ²-optimal stationary policy, i.e. there exists π ² = (d² , d² , . . .)
²
such that Vλπ ≥ Vλ∗ − ².
Q
Proof: The results follow by setting D = {d} and D = s∈S A(s) respectively.
Theorem 4 and part (b) of Corollary 2 for the special case of finite state and action spaces appears in Satia
(1968) with an additional assumption that the set of conditional measures P(s, a) is convex. (Their proof,
in fact, extends to non-convex P(s, a).) Also, they do not explicitly prove that the solution of (30) is indeed
the robust value function. Theorem 5 for general D, and in particular for D = {d}, is new. The special case
D = {d} is crucial for establishing the policy improvement algorithm.
From Theorem 5, Corollary 2 and convergence results for contraction operators on Banach spaces, it
follows that the robust value iteration algorithm displayed in Figure 1 computes an ²-optimal policy. This
algorithm is the robust analog of the value iteration algorithm for non-robust DPs (see Section 6.3.2 in
Puterman, 1994, for details). The following Lemma establishes this approximation result for the robust
value iteration algorithm.

11
Lemma 1 Let Ṽ be the output of the robust value iteration algorithm shown in Figure 1. Then
²
kṼ − Vλ∗ k ≤ ,
4
where Vλ∗ is the optimal value defined in (20). Let d be the decision rule
£ ¤ n £ ¤o ²
inf Ep r(s, d(s), s0 ) + λṼ (s0 ) ≥ sup inf Ep r(s, a, s0 ) + λṼ (s0 ) − .
p∈P(s,d(s)) a∈A(s) p∈P(s,a) 2

Then, the policy π = (d, d, . . .) is ²-optimal.

Proof: Since Lemma 5 establishes that LD is a contraction operator, this result is a simple extension of
Theorem 6.3.1 in Puterman (1994). The details are left to the reader.
Suppose the action set A(s) is finite. Then robust value iteration reduces to
½ h i¾
Ṽ (s) = max inf Ep r(s, a, s0 ) + Vn+1 (s0 ) .
a∈A(s) p∈Pn (s,a)

For this iteration to be efficient one must be able to efficiently solve the optimization problem inf p∈P(s,a) Ep [v]
for a specified s ∈ S, a ∈ A(s) and v ∈ R|S| . These optimization problems are identical to those solved
in finite state problems. In Section 4 we show that for suitable choices for the set P(s, a) of conditional
measures the complexity of solving such problems is only modestly larger than evaluating E p [v] for a fixed p.
We next present a policy iteration approach for computing Vλ∗ . As a first step, Lemma 2 below establishes
that policy evaluation is a robust optimization problem.

Lemma 2 (Policy evaluation) Let d be a deterministic decision rule and π = (d, d, . . .) be the correspond-
ing stationary policy. Then V π is the optimal solution of the robust optimization problem
P
maximize s∈S α(s)V (s),
(31)
subject to V (s) ≤ Ep [rs + λV ], ∀p ∈ P(s, d(s)), s ∈ S,

where α(s) > 0, s ∈ S, and rs ∈ R|S| with rs (s0 ) = r(s, d(s), s0 ), s0 ∈ S.

Proof: The constraint in (31) can be restated as V ≤ Ld V , where Ld = LD with D = {d}. Corollary 2
implies that V π = Ld V π , i.e. V π is feasible for (31). Therefore, the optimal value of (31) is at least
P π
s∈S α(s)V (s).
For every s ∈ S, choose ps ∈ P(s, d(s)) such that

V π (s) = Ld V π (s) ≥ Eps [r(s, d(s), s0 ) + λV π (s0 )] − ².

Then for any V feasible for (31)


³ ´
V (s) − V π (s) ≤ Eps [r(s, d(s), s0 ) + λV (s0 )] − Eps [r(s, d(s), s0 ) + λV π (s0 )] − ² ,
£ ¤
= λEps V (s0 ) − V π (s0 ) + ².

Iterating this argument for n time steps, we get the bound


²
V (s) − V π (s) ≤ λn kV − V π k + .
1−λ
Since n and ² are arbitrary, all V feasible for (31) satisfy V ≤ V π . Since α(s) > 0, s ∈ S, it follows that the
P
value of (31) is at most s∈S α(s)V π (s). This establishes the result.

12
The Robust Policy Iteration Algorithm:
Input: decision rule d0 , ² > 0
Output: ²-optimal decision rule d∗
Q
Set n = 0 and πn = (dn , dn , . . .). Solve (31) to compute V πn . Set Ṽ ← LD V πn , D = s∈S A(s)

For each s ∈ S, choose


n £ ¤ o
dn+1 (s) ∈ a ∈ A(s) : inf Ep r(s, a, s0 ) + λV (s0 ) ≥ Ṽ (s) − ² ;
p∈P(s,a)

setting dn+1 (s) = dn (s) if possible.


¡ ¢
while dn+1 6= dn do
Q
n = n + 1; Solve (31) to computer V πn . Set Ṽ ← LD V πn , D = s∈S A(s)
For each s ∈ S, choose
n £ ¤ o
dn+1 (s) ∈ a ∈ A(s) : inf Ep r(s, a, s0 ) + λV (s0 ) ≥ Ṽ (s) − ² ;
p∈P(s,a)

setting dn+1 (s) = dn (s) if possible.

end while

return dn+1

Figure 2: Robust policy iteration algorithm

Since Ep [rs +λV ] is a linear function of p, (31) is a convex optimization problem. Typically, (31) can be solved
efficiently only if S is finite and the robust constraint can be reformulated as a small collection of deterministic
constraints. In Section 4 we introduce some natural candidates for the set P(s, a) of conditional measures.
Dualizing the constraints in (31) leads to a compact representation for some of these sets. However, for most
practical applications, the policy evaluation step is computationally expensive and is usually replaced by a
m-step look-ahead value iteration (Puterman, 1994).
Lemma 1 leads to the robust policy iteration algorithm displayed in Figure 2. Suppose (31) is efficiently
solvable; then finite convergence of this algorithm for the special case of finite state and action spaces follows
from Theorem 6.4.2 in Puterman (1994). A rudimentary version of robust policy iteration algorithm for this
special case appears in Satia and Lave (1973) (see also Satia, 1968). They compute the value of a policy
π = (d, d, . . .), i.e. solve the robust optimization problem (31), via the following iterative procedure:

(a) For every s ∈ S, fix ps ∈ P(s, d(s). Solve the set of equations

V (s) = Eps [r(s, d(s), s0 ) + λV (s0 )], s ∈ S.

Since λ < 1, these set of equations has a unique solution (see Theorem 6.1.1 in Puterman, 1994).

(b) Fix V , and solve n o


p̃(s) ← argmin Ep [r(s, d(s), s0 ) + λV (s0 )] , s ∈ S.
p∈P(s,d(s))

If V (s) = Ep̃s [r(s, d(s), s0 ) + λV (s0 )], for all s ∈ S, stop; otherwise, p(s) ← p̃(s), s ∈ S, return to (a).

However, it is not clear, and Satia and Lave (1973) do not show, that this iterative procedure converges.

13
Given the relative ease with which value iteration and policy iteration translate to the robust setting, one
might attempt to solve the robust DP by the following natural analog of the linear programming method
for DP (Puterman, 1994):
P
maximize s∈S α(s)V (s),
(32)
subject to V (s) ≥ inf p∈P(s,a) Ep [r(s, a, s0 ) + λV (s0 )], a ∈ A(s), s ∈ S.

Unfortunately, (32) is not a convex optimization problem. Hence, the LP method does not appear to have
a tractable analog in the robust setting.
Recall that in the beginning of this section we had proposed two models for the adversary. The first was
a dynamic model where the measures T π consistent with a policy π satisfies Rectangularity. So far we have
assumed that this model prevails. In the second, static model, the adversary was restricted to employing
a fixed psa ∈ P(s, a) whenever the state-action pair (s, a) is encountered. The last result in this section
establishes that if the decision maker is restricted to stationary policies the implications of the static and
dynamic models are, in fact, identical.

Lemma 3 (Dynamic vs Static adversary) Let d be any decision rule and let π = (d, d, . . .) be the corre-
sponding stationary policy. Let Vλπ and Vbλπ be the value of the π in the dynamic and static model respectively.
Then Vbλπ = Vλπ .

Proof: We prove the result for deterministic decision rules. The same technique extends to randomized
policies but the notation becomes complicated.
Clearly Vbλπ ≥ Vλπ . Thus, we only need to establish that Vbλπ ≤ Vλπ . Fix ² > 0 and choose p̄ : S 7→ M(S)
such that p̄s ∈ P(s, d(s)), for all s ∈ S, and Vλπ (s) ≥ Ep̄s [r(s, d(s), s0 ) + λVλπ (s0 )] − ². Let Vλp̄ π
denote the
non-robust value of the policy π corresponding to the fixed conditional measure p̄. Clearly Vλp̄ ≥ Vbλπ . Thus,
π

π
the result will follow if we show that Vλp̄ ≤ Vλπ .
π
From results for non-robust DP we have that Vλp̄ = Ep̄s [r(s, d(s), s0 ) + λVλp̄
π
(s0 )]. Therefore,
³ ´ ³ ´
π
Vλp̄ − Vλπ (s) ≤ Eps [r(s, a, s0 ) + λVλp̄
π
(s0 )] − Eps [r(s, d(s), s0 ) + λVλπ (s0 )] − ² ,
£
= λEps Vbλπ (s0 ) − Vλπ (s0 )] + ².

Iterating this bound for n time steps, we get


²
π
Vλp̄ (s) − Vλπ (s) ≤ λn kVbλp̄
π
− Vλπ k + .
1−λ
π
Since n and ² are arbitrary, it follows that Vλp̄ ≤ Vλπ .
In the proof of the result we have implicitly established that the “best-response” of dynamic adversary
when the decision maker employs a stationary policy is, in fact, static, i.e. the adversary chooses the same
psa ∈ P(s, a) every time the pair (s, a) is encountered. Consequently, the optimal stationary policy in a
static model can be computed by solving (30). Bagnell et al. (2001) establish that when the set P(s, a) of
conditional measures is convex and the decision maker is restricted to stationary policies the optimal policies
for the decision maker and the adversary is the same in both the static and dynamic models. We extend this
result to non-convex sets. In addition we show that the value of any stationary policy, optimal or otherwise,
is the same in both models. While solving (30) is, in general, NP-complete (Littman, 1994), the problem is
tractable provided the sets are P(s, a) are “nice” convex sets. In particular, the problem is tractable for the
families of sets discussed in Section 4.
Lemma 3 highlights an interesting asymmetry between the decision maker and the adversary that is a
consequence of the fact that the adversary plays second. While it is optimal for a dynamic adversary to play

14
static (stationary) policies when the decision maker is restricted to stationary policies, it is not optimal for
the decision maker to play stationary policies against a static adversary. The optimal policy for the decision
maker in the static model are the so-called universal policy (Cover, 1991).

4 Tractable sets of conditional measures


Section 2 and Section 3 were devoted to extending results from non-robust DP theory. In this and the next
section we focus on computational issues. Since computations are only possible when state and action spaces
are finite (or are suitably truncated versions of infinite sets), we restrict ourselves to this special case. The
results in this section are not new and are included for completeness. They were first obtained by El Ghaoui
and Nilim Nilim and El Ghaoui (2002).
In the absence of any ambiguity, the value of an action a ∈ A(s) in state s ∈ S is given by E p [v] = pT v,
where p is the conditional measure and v is a random variable that takes value v(s 0 ) = r(s, a, s0 ) + V (s0 )
in state s0 ∈ S. Thus, the complexity of evaluating the value of a state-action pair is O(|S|). When the
conditional measure is ambiguous, the value of the state-action pair is (s, a) is given by inf p∈P(s,a) Ep [v].
In this section, we introduce three families of sets of conditional measures P(s, a) which only result in a
modest increase in complexity, typically logarithmic in |S|. These families of sets are constructed from
approximations of the confidence regions associated with density estimation. Two of these families are also
discussed in Nilim and El Ghaoui (2002). We distinguish our contribution in the relevant sections.
Note that since supp∈P(s,a) Ep [v] = − inf p∈P(s,a) Ep [−v], it follows that the recursion (18) for the opti-
mistic value function can also be computed efficiently for these families of sets.

4.1 Sets based on relative entropy


As mentioned in the introduction, the motivation for the robust methodology was to systematically correct
for the statistical errors associated with estimating the transition probabilities using historical data. Thus, a
natural choice for the sets P(s, a) of conditional measures are the confidence regions associated with density
estimation. In this section, we show how to construct such sets for any desired confidence level ω ∈ (0, 1).
We also show that the optimization problem inf p∈P(s,a) Ep [v] can be efficiently solved for this class of sets.
Suppose the underlying controlled Markov chain is stationary. Suppose also that we have historical data
© ª
consisting of triples (sj , aj , s0j ) : j ≥ 1 , with the interpretation that state s0j was observed in period t + 1
when the action aj was employed in state sj in period t. Then the maximum likelihood estimate pbsa of the
conditional measure corresponding to the state-action pair (s, a) is given by
nX o
pbsa = argmax n(s0 |s, a) log(p(s0 )) , (33)
p∈M(S) s0 ∈S

where X ³ ´
n(s0 |a, s) = 1 (s, a, s0 ) = (sj , aj , s0j ) ,
j

is the number of samples of the triple (s, a, s0 ). Let q ∈ M(S) be defined as

n(s0 |s, a)
q(s0 ) = P , s0 ∈ S.
u∈S n(u|s, a)

Then, (33) is equivalent to


pbsa = argmin D(qkp), (34)
p∈M(S)

15
where D(p1 kp2 ) is the Kullback-Leibler or the relative entropy distance (see Chapter 2 in Cover and Thomas,
1991) between two measures p1 , p2 ∈ M(S) and is defined as follows:
X ³ p (s) ´
1
D(p1 kp2 ) = p1 (s) log . (35)
p2 (s)
s∈S

The function D(p1 kp2 ) ≥ 0 with equality if and only if p1 = p2 (however, D(p1 kp2 ) 6= D(p2 kp1 )). Thus, we
have that the maximum likelihood estimate of the conditional measure is given by

n(s0 |s, a)
pbsa (s0 ) = q(s0 ) = P , s0 , s ∈ S, a ∈ A(s). (36)
u∈S n(u|s, a)

More generally, let g j : S 7→ R, j = 1, . . . , k be k functions defined on the state space S (typically, g j (s) = sj ,
i.e. j-th moment) and
j 1 X
ḡsa = n(s0 |s, a)g j (s0 ), j = 1, . . . , k,
nsa
s∈S

be the sample averages of the moments corresponding to the state-action pair (s, a). Let p 0sa ∈ M(S) be the
prior distribution on S conditioned on the state-action pair (s, a). Then the maximum likelihood solution
pbsa is given by
pbsa = argmin D(pkp0 ) (37)
j
{p∈M(S):Ep [g j ]=ḡsa ,j=1,...,k}
© ª
provided the set p ∈ M(S) : Ep [g j ] = ḡsa
j
, j = 1, . . . , k 6= ∅.
Let psa , a ∈ A(s), s ∈ S denote the unknown true state transition of the stationary Markov chain. Then
a standard result in statistical information theory (see Cover and Thomas, 1991, for details) implies the
following convergence in probability:
1 2
nsa D(psa kb
psa ) =⇒ χ , (38)
2 |S|−1
P
where nsa = s0 ∈S n(s0 |s, a) is the number of samples of the state-action pair (s, a) and χ2|S|−1 denotes a χ2
random variable with |S| − 1 degrees of freedom (note that the maximum likelihood estimate pbsa is, itself, a
function of the sample size nsa ). Therefore,
n o n o
P p : D(pkbpsa ) ≤ t ≈ P χ2|S|−1 ≤ 2nsa t ,
= F|S|−1 (2nsa t).

−1
Let ω ∈ (0, 1) and tω = F|S|−1 (ω)/(2nsa ). Then
½ ¾
P = p ∈ M(S) : D(pkb
psa ) ≤ tω , (39)

is the ω-confidence set for the true transition probability psa . Since D(pkq) is a convex function of the pair
(p, q) (Cover and Thomas, 1991), P is convex for all t ≥ 0.
The following results establish that an ²-approximate solution for the robust problem corresponding to
the set P in (39) can be computed efficiently.

Lemma 4 The value of optimization problem :

minimize Ep [v] n o (40)


subject to p ∈ P = p ∈ M(S) : D(pkq) ≤ t, q ∈ M(S) ,

16
where t > 0, is equal to n ³ £ ¡ v ¢¤´o
− min tγ + γ log Eq exp − . (41)
γ≥0 γ
³ l m´
∆v max{t,|t+log(qmin )|}
The complexity of computing an ²-optimal solution for (41) is O |S| log2 2²t , where
∆v = maxs∈S {v(s)} − mins∈S {v(s)} and qmin = P(v(s) = min{v}).

Proof: The Lagrangian L for the optimization problem (40) is given by


X ³ X ³ p(s) ´´ ³X ´
L= p(s)v(s) − γ t − p(s) log −µ p(s) − 1 , (42)
q(s)
s∈S s∈S s∈S

where γ ≥ 0 and µ ∈ R. Taking the derivative of L with respect to p(s) and setting it to zero, we get
³ ³ p(s) ´ ´
v(s) + γ log + 1 − µ = 0, s ∈ S,
q(s)
i.e. ³ p(s) ´
γ log + v(s) = µ − γ. (43)
q(s)
From (43) it follows that
³ µ − v(s) ´
p(s) = q(s)exp − 1 + , s ∈ S. (44)
γ
Thus, the non-negative constraints p(s) ≥ 0 are never active. Since p is constrained to be a probability, i.e.
P
s∈S p(s) = 1, (43) implies that the Lagrangian

L = µ − γ − γt.

Also, (44) together with the fact that p ∈ M(S) implies that
³X v(s)
´
µ − γ = −γ log q(s)e γ
s∈S
³ £ ¡ v ¢¤´
q
Thus, the Lagrangian L(γ) = −tγ − γ log E exp γ . For t > 0 the set P in (40) has a strictly feasible
point; therefore, the value of (40) is equal to maxγ≥0 L(γ).
Suppose v(s) = v for all s ∈ S. Then, the value of (40) is trivially v. Next assume that there exist
s, s ∈ S such that v(s) 6= v(s0 ). In this case, by suitably shifting and scaling the vector v, one can assume
0

that v(s) ∈ [0, 1], mins∈S {v(s)} = 0 and maxs∈S {v(s)} = 1. Note that this shifting and scaling is an O(|S|)
operation.
¡P v(s) ¢
− γ
Let f (γ) = γt + γ log s∈S q(s)e denote the objective function of (41). Then f (γ) is convex and

³X ´ P v(s)

0 −
v(s) 1 s∈S q(s)v(s)e− γ
f (γ) = t + log q(s)e γ + P v(s)
.
γ q(s)e− γ
s∈S s∈S

Since f (γ) is convex, it follows that f 0 (γ) is non-decreasing. It is easy to verify that f 0 (0) = limγ→0 f 0 (γ) =
© ª
t + log(qmin ), where qmin = Prob(v = 0), f 0 ( 1t ) > 0 and |f 0 (γ)| ≤ max t, |t + log(qmin )| .
Clearly, γ = 0 is optimum if f 0 (0) ≥ 0. Otherwise, the optimum value lies in the interval [0, 1t ], and after
N iterations of a bisection algorithm the optimum value γ ∗ is guaranteed to lie in an interval [γ1 , γ2 ] with
γ2 − γ1 ≤ 1t 2−N . Let γ̄ = 12 (γ1 + γ2 ). Then
²
f (γ ∗ ) − f (γ̄) ≥ − |f 0 (γ̄)| ,
2
² © ª
≥ − max t, |t + log(qmin )| . (45)
2

17
Thus, it follows that an ²-optimal solution of minγ f (γ) can be computed in dlog2 max{t,|t+log(q
2²t
min )|}
e bisec-
0
tions. The result follows by recognizing that each evaluation of f (γ) is an O(|S|) operation.
As mentioned above, relative entropy-based sets of conditional measures of the form (39) were first introduced
in Nilim and El Ghaoui (2002). Our analysis is different but essentially equivalent to their approach.

4.2 Sets based on L2 approximations for the relative entropy


In this section we consider conservative approximations for the relative entropy sets. For this family of sets
the optimization inf p∈P Ep [v] can solved to optimality in O(|S| log(|S|)) time.
Since log(1 + x) ≤ x for all x ∈ R, it follows that
X ³ p(s) ´ X³ p(s) − q(s) ´ X (p(s) − q(s))2
D(pkq) = p(s) log ≤ p(s) · = .
q(s) q(s) q(s)
s∈S s∈cS s∈S

Thus, a conservative approximation for the uncertainty set defined in (39) is given by
n X (p(s) − q(s))2 o
P = p ∈ M(S) : ≤t . (46)
q(s)
s∈S

Lemma 5 The value of optimization problem :

minimize Ep [v] n o (47)


P (p(s)−q(s))2
subject to p ∈ P = p ∈ M(S) : s∈S q(s) ≤t ,

is equal to n o
p
max Eq [v − µ] − tVarq [v − µ] , (48)
µ≥0

and the complexity of (48) is O(|S| log(|S|)).


P 2 (s) P
Proof: Let y = p − q. Then p ∈ P if and only if s yq(s) ≤ t, s y(s) = 0 and y ≥ −q. Thus, the value
of (47) is equal to
P
Eq [v] + minimize y(s)v(s), (49)
Ps y2 (s)
subject to ≤ t,
Ps q(s)
s y(s) = 0,
y ≥ −q.
Lagrangian duality implies that the value of (49) is equal to
n X X ¡ ¢o
Eq [v] + max © P min ª − µ(s)q(s) + y(s) v(s) − γ − µ(s)
µ≥0,γ≥0 y 2 (s)
y: s q(s)
≤t s∈S s∈S
n s X o
= max Eq [v − µ] − t q(s)(v(s) − µ(s) − γ)2 , (50)
µ≥0,γ≥0
s∈S
n p o
= max Eq [v − µ] − tVarq [v − µ] . (51)
µ≥0

This establishes that the value of (47) is equal to that of (48).


The optimum value of the inner minimization in (50) is attained at
p
∗ tq(s)z(s)
y (s) = − , s ∈ S,
kzk

18
p ¡ ¢ © ª
where z(s) = q(s) v(s) − µ(s) − Eq [v − µ] , s ∈ S. Let B = s ∈ S : µ(s) > 0 . Then complementary
slackness conditions imply that y ∗ (s) = −q(s), for all s ∈ B, or equivalently,
kzk
v(s) − µ(s) = √ + Eq [v − µ] = α, ∀s ∈ B, (52)
t
i.e. v(s) − µ(s) is a constant for all s ∈ B. Since the optimal value of (47) is at least v min = mins∈S {v(s)},
it follows that α ≥ vmin .
Suppose α is known. Then the optimal µ∗ is given by
(
∗ v(s) − α, v(s) ≥ α,
µ (s) = (53)
0, otherwise.

Thus, dual optimization problem (48) reduces to solving for the optimal α. To this end, let {b v (k) : 1 ≤
k ≤ |S|} denote the values {v(s) : s ∈ S} arranged in increasing order – an O(|S| log(|S|)) operation. Let qb
denote the sorted values of the measure q.
Suppose α ∈ [bvn , vbn+1 ). Then

Eq [v − µ] = an + bn α, Varq [v − µ] = cn + bn α2 + (an + bn α)2 ,


P P P
where an = k≤n qb(k)b v 2 (k). Note that, once the sorting is done,
v (k), bn = k>n qb(k) and cn = k≤n qb(k)b
computing {(an , bn , cn ) : 1 ≤ n ≤ |S|} is O(|S|).
The dual objective f (α) as a function of α is
p
f (α) = Eq [v − µ] − tVarq [v − µ],
p
= an + bn α − t(cn + bn α2 − (an + bn α)2 ).

If α is optimal, it must be that f 0 (α) = 0, i.e. α is root of the quadratic equation


¡ ¢ ¡ ¢2
b2n cn + bn α2 − (an + bn α)2 = t bn (1 − bn )α − an (54)

Thus, the optimal α can be computed by sequentially checking whether a root of (54) lies in [v n , vn+1 ),
n = 1, . . . , |S|. Since this is an O(|S|) operation, we have that the overall complexity of computing a solution
of (48) is O(|S| log(|S|)).
Sets of conditional measures of the form (46) have also been investigated in Nilim and El Ghaoui (2002).
However, they do not identify these sets as inner, i.e. conservative, approximations of relative entropy sets.
Moreover, they are not able to solve the problem (47) – their algorithm is only able to solve (47) when the
P 2
set (46) is expanded to {p : 1T p = 1, s∈S (p(s)−q(s))
q(s) ≤ t}, i.e when the constraint p ≥ 0 is dropped.

4.3 Sets based on L1 approximation to relative entropy


From Lemma 12.6.1 in Cover and Thomas (1991), we have that
1 2
D(pkq) ≥ kp − qk1 ,
2 ln(2)
where kp − qk1 is the L1 -distance between the measures p, q ∈ M(S). Thus, the set
n p o
P = p : kp − qk1 ≤ 2 ln(2)t , (55)

is an outer approximation, i.e. relaxation, of the relative entropy uncertainty set (39). To the best of our
knowledge, this family of sets has not been previously analyzed.

19
Lemma 6 The value of optimization problem :

minimize Ep [v] n o (56)


p
subject to p ∈ P = p : kp − qk1 ≤ 2 ln(2)t ,

is equal to
1 ³p ´ n¡ ¢o
Eq [v] − 2 ln(2)t min max{v(s) − µ(s)} − min{v(s) − µ(s)} , (57)
2 µ≥0 s s

and the complexity of (57) is O(|S| log(|S|)).


p P
Proof: Let y(s) = (p(s) − q(s)), s ∈ S. Then p ∈ P if and only if kyk1 ≤ 2 ln(2)t, s∈S y(s) = 0 and
y ≥ −q. Therefore, the value of (56) is equal to
P
Eq [v] + minimize y(s)v(s)
s∈Sp
subject to kyk1 ≤ 2 ln(2)t,
P
s∈S y(s) = 0,
y ≥ −q.

From Lagrangian duality we have that the value of this optimization problem is equal to
n X X o
Eq [v] + max min
√ − µ(s)q(s) + y(s)(v(s) − µ(s) − γ)
µ≥0,γ∈R y:kyk1 ≤ 2 ln(2)t s∈S s∈S
n X p o
q
= E [v] + max − µ(s)q(s) − 2 ln(2)tkv − µ − γ1k∞ ,
µ≥0,γ∈R
s∈S
n 1p ¡ ¢o
= max Eq [v − µ] − 2 ln(2)t max{v(s) − µ(s)} − min{v(s) − µ(s)} .
µ≥0 2 s s

Let µ∗ be the optimal dual solution and let α = maxs∈S {v(s) − µ∗ (s)}. It is easy to see that
(
∗ v(s) − α, v(s) > α,
µ (s) =
0, otherwise.

Thus, dual optimization problem (48) reduces to solving for the optimal α. To this end, let {b v (k) : 1 ≤
k ≤ |S|} denote the values {v(s) : s ∈ S} arranged in increasing order – an O(|S| log(|S|)) operation. Let qb
denote the sorted values of the measure q.
Suppose α ∈ [cvn , vbn+1 ). Then, the dual function f (α) is given by
1p ¡ ¢
f (α) = Eq [v − µ] − 2 ln(2)t max{v(s) − µ(s)} − min{v(s) − µ(s)} ,
2 s s
X 1p ³X p ´
= qb(k)b
v (k) + 2t ln(2)b
v1 + qb(k) − 2 ln(2)t α.
2
k≤n k>n

Since f (α) is linear, the optimal is always obtained at the end points. Thus, the optimal value of α is given
by n X o
p
α = min vb(n) : qb(k) < 2 ln(2)t ,
k>n

i.e. α can be computed in O(|S|) time.


See Nilim and El Ghaoui (2002) for other families of sets, in particular sets based on L ∞ and L2 norms.
Although these families are popular in modeling, they do not have any basis in statistical theory. Conse-
quently, parameterizing these sets are nearly impossible. We, therefore, do not recommend using these sets
to model ambiguity in the transition probability.

20
Relative nominal performance
1

0.95

M (ω)
0.9

0.85

0.8
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ω
Relative worst case performance
3

2.5
PSfrag replacements

R(ω)
2

1.5

1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ω

Figure 3: Robust stopping problem (|S| = 100, N = 10, m = 40)

5 Computational results
5.1 Robust finite horizon optimal stopping problems
Suppose the state transition matrix A of a Markov chain where known. Clearly, the non-robust policy
designed for A will then be superior to a robust policy designed for any set P containing A. The rationale
behind the robust formulation was that if there was an error in estimating A then the performance of the
policy designed for A will be significantly worse than a robust policy. In this section we investigate this
claim in the context of finite horizon optimal stopping problems.
In an optimal stopping problem the state evolves according to uncontrollable stationary Markov chain
M on a finite state space S. At each decision epoch t and each state s ∈ S, the decision maker has two
actions available: stop or continue. If the decision maker stops in state s at time t, the reward received is
gt (s), and if the action is to continue, the cost incurred is ft (s); and the state st+1 evolves according to M.
The problem has a finite time horizon N and, if the decision maker does not stop before N , the reward at
time N is h(s). Once stopped, the state remains in the stopped state, yielding no reward thereafter. The
objective is to choose a policy to maximize the total expected reward.
The experimental setup was as follows. Once the time horizon N and the size of the state space S was
selected, a transition matrix A was randomly generated. In order to keep the problem tractable we assumed
a bound on the number m of 1-step neighbors and ensured that A induced an irreducible Markov chain.
The rewards gt (s), the cost ft (s) and the terminal reward h(s) were all randomly generated.
2
A single sample path of length 100 |S| was generated according to the above (randomly selected) Markov
chain. This sample path was used to compute the maximum likelihood estimate A ml of the transition matrix.
We will call Aml the nominal Markov chain. The non-robust DP assumed that the underlying Markov chain
is governed by Aml . Let V0nr denote the non-robust value function. The ambiguity in the transition matrix
was modeled by the relative entropy sets defined in (39). The ambiguity structure was applied independently
to each row of the transition matrix. For each ω ∈ {0.05, 0.01, . . . , 0.95} we computed the robust stopping
policy using the robust Bellman recursion (11). Let V0r,ω denote the robust value function corresponding to
the confidence level ω.
The first performance measure we consider is the loss associated with employing robust policies on the
nominal Markov chain. Clearly the non-robust value function V0nr is optimal value of the optimal stopping
r,ω
problem defined on the nominal chain. Let V0,ml denote the reward generated by robust policy corresponding

21
Relative nominal performance
1

0.95

M (ω)
0.9

0.85

0.8
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ω
Relative worst case performance
3

2.5
PSfrag replacements

R(ω)
2

1.5

1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ω

Figure 4: Robust stopping problem (|S| = 100, N = 10, m = 80)

Relative nominal performance


1

0.95
M (ω)

0.9

0.85

0.8
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ω
Relative worst case performance
4

3.5

PSfrag replacements 3
R(ω)

2.5

1.5

1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
ω

Figure 5: Robust stopping problem (|S| = 200, N = 20, m = 80)

r,ω
to the confidence level ω in the nominal chain. Clearly V0nr ≥ V0,ml . We will call the ratio
P r,ω
s∈S V0,ml (s)
M (ω) = P ,
s∈S V0nr (s)

the relative nominal performance of the robust policy. The ratio M (ω) measures the loss associated with
using a robust policy designed for a confidence level ω. Clearly M (ω) ≤ 1 and we expect the ratio to decrease
as ω increases.
The second performance measure we consider is the worst case performance of the non-robust policy.
nr
Define VN,w = h. For t = 0, . . . , N − 1, define
n o
nr
Vt,w (s) = max gt (s), ft (s) + inf Ep [Vt+1,w
nr
] ,
p∈P(s,ω)

where P(s, ω) is the set of conditional measures for state s ∈ S corresponding to a confidence level ω. Thus,
nr
V0,w denotes the worst case value of the non-robust policy. We will call the ratio
P
V r,ω (s)
R(ω) = Ps∈S 0nr ,
s∈S V0,w (s)

the relative worst-case performance of the robust policy. Since the robust policy optimizes the worst case

22
value, R(ω) ≥ 1 and we expect the ratio to increase as ω increases. The ratio R(ω) measures the relative
gain associated with using the robust policy when the transition probabilities are ambiguous.
Figures 3-5 plot the relative nominal and worst case performance for three different simulation runs.
The plots show that the relative loss in nominal performance of the robust policy even at ω = 0.95 is
approximately 15%. On the other hand the worst case performance improves with ω and is greater than
250% at ω = 0.95. This behavior may be explained by the fact the robust and non-robust optimal policies
differ only on a few states in the entire trellis. Thus, the robust policy appears to be able to track the mean
behavior while at the same time improve the worst case behavior by altering the action in a small number of
critical states. The relative nominal performance appears to be fairly stable as a function of the time horizon
N , the size of the state space |S| and the number of 1-step neighbors m. These numerical experiments are
clearly quite preliminary. We are currently conducting experiments to further understand the relative merits
of the robust approach.

5.2 Robust infinite horizon dynamic programs


In this section we contrast the computational effort required to solve discounted infinite horizon robust and
non-robust DPs. This comparison is done by averaging the CPU time and the number of iterations required
to solve randomly generated problems. The details of the experiments are given below. All computations
were done in the MATLAB6.1 R12 computing environment and, therefore, only the relative values of the
run times are significant.
The first set of experiments compared the required computational effort as a function of the uncertainty
level. For this set of experiments, the size |S| of the state space was set to |S| = 500, the number of
actions |A(s)| was |A(s)| = 10, and the discount rate λ = 0.9. The rewards r(s, a, s 0 ) were assumed to
be independent of the future state s0 and distributed uniformly over [0, 10]. The state transition was also
randomly generated. The ambiguity in the transition structure was assumed to be given by L 2 approximation
to the relative entropy sets (see (46)).
For each value of ω = (0.05, 0.1, . . . , 0.95) the results were averages over N = 10 random instances. The
random instances were solved using both value iteration and policy iteration. The robust value iteration
followed the algorithm described in Figure 1 and was terminated once the difference between successive
iterates was less than τ = 10−6 . However, the robust policy iteration did not entirely follow the algorithm
in Figure 2 – instead of solving (31), the value function of the policy πn was computed by iteratively solving
the operator equation (29).
The results for this set of experiments are shown in Table 1. The columns labeled iter display the number
of iterations and the columns marked time display the run time in seconds. From these results it is clear
that both the run times and the number of iterations is insensitive to ω. Both the non-robust and robust
DP require approximately the same number of iterations to solve the problem. However, the run time per
robust value iteration is close to twice that of the run time per non-robust value iteration.
The second set of experiments compared the run time as a function of size |S| of the state space. In
this set of experiments, the uncertainty level ² = 0.95 and the discount rate λ = 0.9. For each value of
|S| = 200, 400, . . . , 1000, the results were averaged over N = 10 random instances. Each random instance
was generated just as in the first set of experiments. As before, each instance was solved using both value
iteration and policy iteration.
The results for this set of experiments in shown in Table 2. From the results in the table, it appears that
the number of iterations is insensitive to the size |S|. Moreover, the robust and non-robust DP algorithms
require approximately the same number of iterations. Therefore, based on Lemma 5, we expect that the

23
Value iteration Policy iteration
Non-robust Robust Non-robust Robust
² iter time (sec) iter time (sec) iter time (sec) iter time (sec)
0.05 153.7 4.73 152.9 8.23 3.9 4.29 3.0 6.67
0.10 153.6 4.66 152.7 8.48 3.9 4.31 3.0 6.87
0.15 153.9 4.69 152.6 8.60 3.6 4.36 2.9 6.79
0.20 153.8 4.61 152.2 8.64 3.5 4.21 2.8 6.60
0.25 154.0 4.80 152.1 8.50 3.9 4.32 2.8 6.54
0.30 154.2 4.72 151.9 8.73 3.8 4.21 2.8 6.73
0.35 153.7 4.68 151.8 8.71 3.6 4.28 2.9 6.88
0.40 154.1 4.70 151.8 8.79 3.8 4.26 2.8 6.72
0.45 153.4 4.67 151.8 8.84 3.7 4.37 2.8 6.76
0.50 154.2 4.69 151.8 8.69 3.3 4.20 2.8 6.70
0.55 153.5 4.65 151.6 8.84 3.3 4.35 2.8 6.74
0.60 153.5 4.72 151.5 8.84 3.3 4.39 2.8 6.76
0.65 154.1 4.75 151.5 8.88 3.5 4.40 2.9 6.96
0.70 153.9 4.71 151.5 8.83 3.7 4.36 2.9 6.90
0.75 153.3 4.73 151.4 8.80 3.3 4.29 2.9 6.90
0.80 153.2 4.64 151.1 8.86 3.8 4.30 3.0 7.14
0.85 154.1 4.68 151.0 8.83 3.6 4.24 3.0 7.10
0.90 153.4 4.76 150.8 8.79 3.4 4.33 3.0 7.10
0.95 153.5 4.74 150.7 8.84 3.7 4.26 3.0 7.12
¡ ¢
Table 1: Robust DP vs ² |S| = 500, |A(s)| = 10, λ = 0.9

run time of robust DP is at most a logarithmic factor higher than the run time of the non-robust version.
Regressing the run time tv of the non-robust value iteration on the sample size |S|, we get

log(tv ) ≈ 2.1832 log(|S|) − 11.6401

Regressing the run time tp of the non-robust policy iteration on the sample size |S|, we get that

log(tp ) ≈ 2.0626 log(|S|) − 11.7612

The regression results are plotted in Figure 6. The upper plot corresponds to value iteration and the bottom
plot corresponds to policy iteration. The dotted line in both plots is the best fit line obtained by regression
and the solid line is the observed run times. Clearly the regression approximation fits the observed run times
very well.
In the upper plot of Figure 7, the solid line corresponds to log(trv ), where trv is the observed run time
of robust value iteration, and the dotted line corresponds to the upper bound log( t̄rv ) expected on the basis
of Lemma 5, i.e.

log(t̄rv ) = log log(|S|) + log(tv ),


= log log(|S|) + 2.1832 log(|S|) − 11.6401

Clearly, the expected upper bound dominates over the observed run times, i.e t rv << tv log(|S|). In the
second plot of Figure 7 we plot log(trp ), where trp is the run time of robust policy iteration, and the
corresponding expected upper bound log(t̄rp ). Once again, the upper bound clearly dominates.

24
Value iteration Policy iteration
Non-robust Robust Non-robust Robust
² iter time (sec) iter time (sec) iter time (sec) iter time (sec)
200 154.2 1.000 153.6 0.710 3.4 0.480 3.5 0.329
400 154.3 3.840 153.7 5.350 3.3 1.720 3.4 2.421
600 153.6 9.420 153.4 17.654 3.1 3.960 3.4 7.939
800 154.8 19.180 153.6 42.129 3.2 7.320 3.6 19.939
1000 154.9 33.320 153.5 83.986 3.2 11.460 3.2 36.368
1200 153.8 49.480 153.2 143.217 3.5 17.860 3.7 68.953
1400 155.2 66.300 153.2 223.425 3.1 22.020 3.5 104.234
1600 154.2 87.040 153.6 337.874 3.6 33.680 3.5 157.546
1800 153.8 112.160 153.4 483.529 3.2 40.600 3.3 214.227
2000 154.2 136.360 153.7 649.510 3.5 55.340 3.8 322.526
¡ ¢
Table 2: Robust DP vs |S| |A(s)| = 10, ² = 0.95, λ = 0.9

These computational results are still preliminary and there are many unresolved issues. For example,
although the bounds dominate the run times of robust DP, the two lines appear to converge leading one to
believe that the bound may not hold for larger state spaces. However, recall that the bounds are constructed
using linear regression and, therefore, there is the possibility that the bound will shift upward when larger
state spaces are considered. The codes for both non-robust and robust DP needs to be optimized before one
can completely trust the run times and iterations.

6 Conclusion
In this paper we propose a robust formulation for the discrete time DP. This formulation attempts to mitigate
the impact of errors in estimating the transition probabilities by choosing a maximin optimal policy, where
the minimization is over a set of transition probabilities. This set summarizes the limited knowledge that
the decision maker has around the transition probabilities of the underlying Markov chain. A natural
family of sets describing the knowledge of the decision maker are the confidence regions about the maximum
likelihood estimates of the transition probability. This family of sets was first introduced in Nilim and El
Ghaoui (2002). Since these confidence regions are described in terms of the relative entropy or the Kullback-
Liebler distance, we are led to the sets described in Section 4.1. The family of relative entropy based sets
can be easily parameterized by setting the desired confidence level. We also introduce two other families of
sets that are approximations of the relative entropy based sets.
Since the transition probabilities are ambiguous, every policy now has a set of measures associated with it.
We prove that when this set of measures satisfies a certain Rectangularity property most of important results
in DP theory, such as the Bellman recursion, the optimality of deterministic Markov policies, the contraction
property of the value iteration operator, etc., extend to natural robust counterparts. On the computational
front, we show that the the computational effort required to solve the robust DP corresponding to sets of
conditional measures based on confidence regions is only modestly higher than that required to solve the
non-robust DP. Our preliminary computational results appear to confirm this experimentally. While parts
of the theory presented in this paper have been addressed by other authors, we provide a unifying framework
for the theory of robust DP.

25
Run times for value iteration
5

log(tv )
2

−1
5 5.5 6 6.5 7 7.5 8
log(|S|)

Run times for policy Iteration


5

3
log(tp )

PSfrag replacements 2

−1
5 5.5 6 6.5 7 7.5 8
log(|S|)

Figure 6: Regression results for non-robust DP run times

The robust value function V ∗ provides a lower bound on the achievable performance; one can also define
an optimistic value function V̄ ∗ that provides an upper bound on the achievable performance. All the results
in this paper imply corresponding results for the optimistic value function, i.e. in particular there is value
iteration and a policy iteration algorithm that efficiently characterizes the optimistic value function.
Some unresolved issues that remain are as follows. The computational results presented in this paper are
very preliminary. While the initial results are promising, more experiments need to be performed in order
to better understand the performance of robust DP on practical examples. As indicated in the introduction,
we restricted our attention to problems where the non-robust DP is tractable. In most of the interesting
applications of DP, this is not the case and one has to resort to approximate DP. One would, therefore,
be interested in developing the robust counterpart of approximate DP. Such an approach might be able to
prevent instabilities observed in approximate DP (Bertsekas and Tsitsiklis, 1996).

Acknowledgments
The authors would like to thank the anonymous referees for insightful comments that have substantially
improved the presentation and the conceptual content of the paper.

A Consequences of Rectangularity
We will begin with an example that illustrates the inappropriateness of the Rectangularity in a finite horizon
setting. This example is a dynamic version of the Ellsberg Urn problem (Ellsberg, 1961) discussed in Epstein
and Schneider (2001).
Suppose an urn contains 30 red balls and 60 balls that are either blue or green. At time 0 a ball is drawn
from the urn and the the color of the ball is revealed at time t = 2. At the intermediate time t = 1 the
decision maker is told whether the drawn ball is green. Thus, the state transition structure is as shown in
Figure 8 where pb = P{ball is blue}.
Suppose pb ∈ [p, p] ⊆ [0, 2/3] is ambiguous. Consider the robust optimal stopping problem where the

26
Run times for value iteration
8

log(tv )
2

−2
5 5.5 6 6.5 7 7.5 8
log(|S|)

Run times for policy iteration


6

4
log(tp )

PSfrag replacements 2

−2
5 5.5 6 6.5 7 7.5 8
log(|S|)

Figure 7: Run times of robust DP and corresponding bounds


s21 = {r}

PSfrag replacements
1
3

b
p
+
1
3

s11 = {r, b}
pb pb
1 +
3
pb
+
1
3

s01 s22 = {b}


2
3

pb

s12 = {g} s23 = {g}


1

Figure 8: Dynamic Ellsberg experiment

state transition is given by Figure 8. In each state s ∈ St at time t = 0, 1 there are are two actions {s, c}
available, where c denotes continue and s denotes stop. Let π̄ = (d¯0 , d¯1 ) denote the policy that chooses the
deterministic action c in every state s ∈ St , t = 0.1. Then the state-transition structure in Figure 8 implies
that the conditional measures consistent with the decision rules d¯i , i = 0, 1 are given by
¯ ©¡ ¢ ª
T d0 = p(s11 | s01 ), p(s12 | s01 ) = (1/3 + α, 2/3 − α) : α ∈ [p, p] ,
½ µ ¶ ¾
¯ ¡ 2 1 ¢ 1/3 α
T d1 = 2 1
p(s1 | s1 ), p(s2 | s1 ) = , 2 1
, p(s2 | s2 ) = 1 : α ∈ [p, p] .
1/3 + α 1/3 + α

Thus,
( ¡ ¢ )
d¯0 d¯1
p(s11 | s01 ), p(s12 | s01 ) = (1/3
³ + α, 2/3 − ´α) , 0
T ×T = ¡ 2 1 ¢ 1/3 α0 : α, α ∈ [p, p] ,
p(s1 | s1 ), p(s22 | s11 ) = 1/3+α 0 , 1/3+α0 , p(s22 | s12 ) = 1

27
where α and α0 need not be equal. However, the set of measures T π̄ consistent with the policy π̄ satisfies
( ¡ 1 0 ¢ )
p(s1 | s1 ), p(s12 | s01 ) = (1/3 + α, 2/3 − α) ,
T π̄
= ¡ 2 1 ¢ ³ 1/3 α
´ : α ∈ [p, p] ,
p(s1 | s1 ), p(s22 | s11 ) = 1/3+α , 1/3+α , p(s22 | s12 ) = 1
¯ ¯
6= T d0 × T d1 .

The problem arises because the information structure in Figure 8 assumes that there is a single urn that
decides that conditional measures at both epochs t = 0, 1; whereas, Rectangularity demands that the con-
ditional measures at epochs t = 0, 1 be independent, i.e. in this case, they should be determined by an
independent copy of the urn used at t = 0.
Assuming that Rectangularity holds in this setting is equivalent to assuming that apriori distribution on
the composition of the urn is given by
½ µ ¶ µ ¶ ¾
1 1/3 + α 0 1/3 + α 2
(pr , pb , pg ) ∈ P = , α , − α .
3 1/3 + α0 1/3 + α0 3
A very counterintuitive prior indeed ! This example clearly shows that Rectangularity may not always be an
appropriate property to impose on an AMDP. Inspite of the counterexample above, Rectangularity is often
appropriate for finite horizon AMDPs because the sources of the ambiguity in different periods are typically
independent of each other.
Rectangularity implies that the adversary is able to choose a different conditional measure every time a
state-action pair (s, a) is encountered. This adversary model should not raise an alarm in a finite horizon
setting where a state-action pair is never revisited. However, the situation is very different in a infinite horizon
setting where a state-action can be revisited. In this setting the Rectangularity may not be appropriate
situations where there is ambiguity but the transition probabilities are not dynamically changing. Deciding
whether Rectangularity is appropriate can often be a function of the time scale of events. Suppose one is
interested in a robust analysis of network routing algorithms where the action in each node is the choice
of the outgoing edge and the ambiguity is with respect to the delay on the network edges. For a traffic
network the Rectangularity assumption might be appropriate because the time elapsed in returning to a
node is sufficiently long so that the parameters could have shifted. On the other hand, for data networks
that operate at much higher speeds the ambiguity might be evolve on a slower time scale, and therefore,
Rectangularity might not be appropriate. On a positive note, Lemma 3 shows that the problems with
Rectangularity disappear if one restricts the decision maker to stationary policies.

References
Bagnell, J., Ng, A., and Schneider, J. (2001). Solving uncertain Markov decision problems. Technical report,
Robotics Inst., CMU.

Ben-Tal, A. and Nemirovski, A. (1997). Robust truss topology design via semidefinite programming. SIAM
J. Optim., 7(4):991–1016.

Ben-Tal, A. and Nemirovski, A. (1998). Robust convex optimization. Math. Oper. Res., 23(4):769–805.

Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scientific.

Cover, T. M. (1991). Universal portfolios. Mathematical Finance, 1(1):1–29.

Cover, T. M. and Thomas, J. A. (1991). Elements of Information Theory. John Wiley & Sons, New York.

28
de Farias, D. and Van Roy, B. (2002). The linear programming approach to approximate dynamic program-
ming. Submitted to Oper. Res.

Ellsberg, D. (1961). Risk, ambiguity and the Savage axioms. Quart. J. Econ., 25(643-669).

Epstein, L. G. and Schneider, M. (2001). Recursive multiple priors. Technical Report 485, Rochester Center
for Economic Research. Available at http://rcer.econ.rochester.edu. To appear in J. Econ. Theory.

Epstein, L. G. and Schneider, M. (2002). Learning under Ambiguity. Technical Report 497, Rochester Center
for Economic Research. Available at http://rcer.econ.rochester.edu.

Gilboa, I. and Schmeidler, D. (1989). Maxmin expected utility with non-unique priors. J. Math. Econ.,
18:141–153.

Goldfarb, D. and Iyengar, G. (2003). Robust portfolio selection problems. Math. Oper. Res., 28(1):1–38.

Hansen, L. P. and Sargent, T. J. (2001). Robust control and model uncertainty. American Economic Review,
91:60–66.

Littman, M. (1994). memoryless policies: Theoretical limitations and practical results. In Cliff, D., Husbands,
P., and Wilson, S. W., editors, From Animals to Animats: SAB ’94, pages 238–245. MIT Press.

Nilim, A. and El Ghaoui, L. (2002). Robust Solutions to Markov Decision Problems with Uncertain Transi-
tion Matrices. Submitted to Operations Research. UC Berkeley Tech Report UCB-ERL-M02/31.

Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley
series in probability and mathematical statistics. John Wiley & Sons.

Satia, J. K. (1968). Markovian Decision Process with Uncertain Transition Matrices or/and Probabilistic
Observation of States. PhD thesis, Stanford University.

Satia, J. K. and Lave, R. L. (1973). Markov Decision Processes with Uncertain Transition Probabilities.
Oper. Res., 21(3):728–740.

Shapiro, A. and Kleywegt, A. J. (2002). Minimax analysis of stochastic problems. To appear in Optimization
Methods and Software.

Tsitsiklis, J., Simester, D., and Sun, P. (2002). Dynamic optimization for direct marketing problem. Pre-
sented at INFORMS 2002.

White, C. C. and Eldieb, H. K. (1994). Markov decision processes with imprecise transition probabilities.
Oper. Res., 43:739–749.

29

View publication stats

You might also like