0% found this document useful (0 votes)
41 views14 pages

European Journal of Operational Research: Stefan Woerner, Marco Laumanns, Rico Zenklusen, Apostolos Fertis

This document summarizes a research paper about using approximate dynamic programming techniques to solve stochastic linear control problems with compact state and action spaces. The paper presents an approximate relative value iteration algorithm that generates piecewise-linear convex approximations of the relative value functions. The algorithm also provides lower bounds on the optimal average cost. The algorithm is applied to inventory management problems and shown to find policies that are as good or better than existing heuristics, with optimality gaps never exceeding 5%.

Uploaded by

julio perez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views14 pages

European Journal of Operational Research: Stefan Woerner, Marco Laumanns, Rico Zenklusen, Apostolos Fertis

This document summarizes a research paper about using approximate dynamic programming techniques to solve stochastic linear control problems with compact state and action spaces. The paper presents an approximate relative value iteration algorithm that generates piecewise-linear convex approximations of the relative value functions. The algorithm also provides lower bounds on the optimal average cost. The algorithm is applied to inventory management problems and shown to find policies that are as good or better than existing heuristics, with optimality gaps never exceeding 5%.

Uploaded by

julio perez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

European Journal of Operational Research 241 (2015) 85–98

Contents lists available at ScienceDirect

European Journal of Operational Research


journal homepage: www.elsevier.com/locate/ejor

Decision Support

Approximate dynamic programming for stochastic linear control


problems on compact state spaces
Stefan Woerner a,⇑, Marco Laumanns a, Rico Zenklusen b, Apostolos Fertis c
a
IBM Research, Saeumerstrasse 4, 8803 Rueschlikon, Switzerland
b
ETH Zurich, Raemistrasse 101, 8092 Zurich, Switzerland
c
SMA und Partner AG, Gubelstrasse 28, 8050 Zurich, Switzerland

a r t i c l e i n f o a b s t r a c t

Article history: This paper addresses Markov Decision Processes over compact state and action spaces. We investigate the
Received 28 November 2013 special case of linear dynamics and piecewise-linear and convex immediate costs for the average cost
Accepted 2 August 2014 criterion. This model is very general and covers many interesting examples, for instance in inventory
Available online 10 August 2014
management. Due to the curse of dimensionality, the problem is intractable and optimal policies usually
cannot be computed, not even for instances of moderate size.
Keywords: We show the existence of optimal policies and of convex and bounded relative value functions that
Dynamic programming
solve the average cost optimality equation under reasonable and easy-to-check assumptions. Based on
Markov processes
Inventory
these insights, we propose an approximate relative value iteration algorithm based on piecewise-linear
Dual sourcing convex relative value function approximations. Besides computing good policies, the algorithm also
Multiple sourcing provides lower bounds to the optimal average cost, which allow us to bound the optimality gap of any
given policy for a given instance.
The algorithm is applied to the well-studied Multiple Sourcing Problem as known from inventory
management. Multiple sourcing is known to be a hard problem and usually tackled by parametric
heuristics. We analyze several MSP instances with two and more suppliers and compare our results to
state-of-the-art heuristics. For the considered scenarios, our policies are always at least as good as the
best known heuristic, and strictly better in most cases. Moreover, by using the computed lower bounds
we show for all instances that the optimality gap has never exceeded 5%, and that it has been much
smaller for most of them.
Ó 2014 Elsevier B.V. All rights reserved.

1. Introduction value function approximations, together with a non-decreasing


sequence of lower bounds to the optimal average cost. The relative
In this paper we address Markov Decision Processes (MDPs) value functions approximation can then be used to derive a good
with compact state and action space, linear dynamics as well as policies. We show the effectiveness of this algorithm on the Multi-
piecewise-linear and convex immediate costs. We denote this class ple Sourcing Problem (MSP) known in inventory management and
of MDPs as Stochastic Linear Control Problems (SLCPs). SLCPs are a compare the results to the current best-known heuristics.
broad class of problems that represent many practically relevant The paper is organized as follows. Section 2 surveys the related
problems in various areas, for instance in inventory management. literature for average-cost MDPs, ADP, and the MSP. In Section 3,
We study existence and characterization policies that are we formally introduce SLCPs. Section 4 discusses the related con-
optimal for the average cost criterion. In particular, we develop cepts and results of MDP theory. In Section 5, we review and
conditions that imply the existence of a convex solution to the extend results about the existence of optimal policies and relative
Average Cost Optimality Equation (ACOE). Such solutions can be value functions for general state space MDPs. In particular, we
used to characterize optimal policies. In addition, we develop an prove the existence of a convex solution to the ACOE under rather
Approximate Dynamic Programming (ADP) algorithm for SLCPs. mild assumptions that are typically fulfilled in practice. This moti-
The algorithm generates piecewise-linear and convex relative vates our approximation algorithm, which is given in Section 6. In
Section 7, we apply our algorithms to various MSP instances, com-
pare the results with state-of-the-art heuristics, and discuss the
⇑ Corresponding author. differences.
E-mail address: [email protected] (S. Woerner).

http://dx.doi.org/10.1016/j.ejor.2014.08.003
0377-2217/Ó 2014 Elsevier B.V. All rights reserved.
86 S. Woerner et al. / European Journal of Operational Research 241 (2015) 85–98

2. Related work and our contributions Piecewise-linear convex value function approximations were
also considered by Lincoln and Rantzer (2006) and Shapiro
In this section we review the relevant literature about MDPs, (2011). Lincoln and Rantzer (2006) also focus on SLCPs, but instead
ADP and MSP, and discuss our contributions in the context of exist- of the average cost criterion, they consider discounted cost, for
ing results and open questions. which the existence of optimal policies and relative value functions
is well established. Lincoln and Rantzer introduce a ‘‘Relaxed Value
Iteration’’ and exploit the fact that the Bellman operator can be
2.1. Average cost Markov decision processes expressed as a Multi-Parametric LP (MPLP) and thus preserves
piecewise-linearity and convexity. In contrast to our approach,
While average cost MDP with finite state and action spaces are Lincoln and Rantzer (2006) have to solve an MPLP in every iteration
well understood (Bertsekas, 2007), much less in known for MDPs to obtain the hyperplane representation of the next relative value
with continuous state and action spaces. From the point of view function, which is a very complex operation (exponential in the
of the present paper, the existing work on continuous state space worst case and practically solvable only in low dimensions). In
MDPs can be classified into two groups, distinguished by continu- order to reduce the complexity of the relative value function, some
ity assumptions on the transition law. of the hyperplanes are dropped as long as given deviation bounds
Most authors assume the transition law to be strongly continu- are respected. Satisfying the bounds implies performance guaran-
ous (Hernández-Lerma & Lasserre, 1990; Montes-de Oca & tees for the resulting policies and allows to control the trade-off
Hernández-Lerma, 1996; Kurano, Nakagami, & Huang, 2000). This between computational effort and performance of the policy. How-
simplifies the analysis of the existence of optimal policies and ever, to get meaningful performance guarantees, the bounds have to
relative value functions. However, in the more general model con- be chosen quite tight, which implies that most hyperplanes have to
sidered here, we only assume a weakly continuous transition law, be kept. Unfortunately, solving an MPLP with many hyperplanes is
so these results cannot be applied. impractical even in smaller dimensions. This drawback might be the
There are only very few papers that consider weakly continuous reason why this approach has never been applied to problems of
transition laws, most importantly Schäl (1993) as well as realistic size. For more details on MPLPs in control, including finite
Hernández-Lerma and Lasserre (2002, chap. 12). Schäl (1993) horizon approximation techniques, we refer to Jones, Baric, and
follows the limiting discount factor approach to establish the exis- Morari (2007) as well as Jones and Morari (2009).
tence of optimal policies. We apply these ideas to SLCPs and extend In this paper, we develop a different approach to cope with the
the results to the existence of a convex bounded solution to the inherent complexity of the exact value iteration step. Instead of
ACOE. solving an MPLP to find all hyperplanes in every step and then
Hernández-Lerma and Lasserre (2002, chap. 12) follow the infi- dropping some of them, we compute only a subset in the first
nite dimensional Linear Programming (LP) approach. They provide place. We choose the subsets in such a way that the associated
conditions that imply existence of optimal policies and relative lower bounds are non-decreasing.
value functions. However, their results only hold almost every- Piecewise-linear convex value functions have also been used by
where with respect to an invariant measure. The conditions we Shapiro (2011) to compute a lower approximation of the optimal
propose in Section 5 imply that all their results are applicable as value function in the Stochastic Dual Dynamic Programming
well. (SDDP) method, which originated in Pereira and Pinto (1991).
Arapostathis, Borkar, Fernández-Gaucherand, Ghosh, and SDDP is typically applied to the Sample Average Approximation
Marcus (1993) provide a general survey about MDPs with the aver- problem. It is designed for finite horizon problems, whereas the
age cost criterion ranging from finite to Borel state and action technique proposed here targets infinite horizon problems.
spaces.
Finally, we also note that convexity of value functions of MDPs 2.3. Multiple sourcing
has been an important object of study for a long time. However,
most of these structural results have been established for finite Inventory management with a single supplier is a well studied
horizon or discounted cost problems only (Dynkin, 1972; problem. Scarf (1960) and Iglehart (1963) proved that the optimal
Hinderer, 1984; Hernández-Lerma & Runggaldier, 1994; policy of an infinite horizon discounted cost problem with deter-
Hernández-Lerma, Piovesan, & Runggaldier, 1995). ministic lead times and stationary stochastic demand is an ðs; SÞ
policy. Veinott and Wagner (1965) concluded that this holds also
for infinite horizon average cost per stage problems.
2.2. Approximate dynamic programming While single sourcing is well understood, there exist only few
results about the structure of optimal policies if multiple suppliers
ADP is an active research area with a variety of literature. Stan- are available. Fukuda (1964) proved that for two suppliers, a dual
dard textbooks include Bertsekas and Tsitsiklis (1996) as well as base stock policy is optimal when no fixed order costs are
Powell (2007). accounted, lead times are deterministic, and the difference of the
The typical approach in ADP is to approximate an optimal rela- lead times of the two suppliers is exactly equal to one. However,
tive value function with a combination of a fixed set of basis an optimal policy for a general MSP is highly state-dependent
functions. Such approaches were shown to work well for some (Whittemore & Saunders, 1977). Therefore, parametric heuristics
applications (Schweitzer & Seidmann, 1985; De Farias & Van Roy, and the computation of their optimal parameters is an active field
2003, 2004; Farias & Van Roy, 2006, chap. 6). However, determin- of research.
ing a good set of basis functions is often very difficult and problem- Minner (2003) provided a detailed review of MSPs and dis-
specific. cussed different heuristics and different setups. Scheller-Wolf,
As we will show in Section 5, there exists a convex solution to Veeraraghavan, and Van Houtum (2006) showed how the optimal
the ACOE for SLCPs, therefore we propose to use a piecewise-linear parameters of the Single Index Policy (SIP) can be computed for a
convex approximation of the relative value function instead. Thus, general dual sourcing problem with deterministic lead times. In
instead of fixing a set of basis functions, we take the maximum of a Veeraraghavan and Scheller-Wolf (2008), the more complex Dual
set of hyperplanes, which we generate in a particular way during Index Policy (DIP) was analyzed and a simulation-based approach
the run of the algorithm. to determine the optimal parameters was presented, while Arts,
S. Woerner et al. / European Journal of Operational Research 241 (2015) 85–98 87

Van Vuuren, and Kiesmuller (2009) provide a different way of W :¼ fð~ uÞ 2 Rnþm jQ~
x; ~ x 6 b; R~
u þ V~
x 6 dg;
approximating the optimal DIP parameters. Finally, Klosterhalfen,
which implies that W is also a polytope.
Kiesmuller, and Minner (2011) analyzed the Constant Order Policy
For any ð~
x; ~ x0 is given by
uÞ 2 W, the random successor state ~
(COP) and gave an overview of the three aforementioned heuristics
and their performance based on numerical studies. The COP was !
x0 :¼ A~
~ x þ B~
uþ C ð3Þ
also studied before, e.g. by Janssen and de Kok (1999) who pro- !
vided an algorithm to compute the optimal parameters under ser- where ðA; B; C Þ is a multi-dimensional random variable that takes
vice level constraints. Since the MSP is known to be a very difficult values in Rnn  Rnm  Rn . The support and distribution of
!
problem and since there are well-studied heuristics for it, it serves ðA; B; C Þ are given as
n !o n ! o
as a good test case for the new ADP approach proposed in this
supp ðA; B; C Þ :¼ ðAj ; Bj ; C j Þj j ¼ 1; . . . ; J ;
paper. h ! ! i
qj :¼ P ðA; B; C Þ ¼ ðAj ; Bj ; C j Þ ; j ¼ 1; . . . ; J;
2.4. Open questions and our contributions
where J 2 N.
In this paper, we provide results to the following open ques- In general, (3) does not guarantee that the resulting problem is
tions regarding SLCPs: consistent, because the successor state ~ x0 does not have to be an
element of S. In order to ensure feasibility, we make the following
 existence of optimal policies, assumption.
 existence and structure of the solution to the ACOE,
 suitable approximation of optimal relative value functions, and Assumption 1 (Feasibility). For every ð~
x; ~
uÞ 2 W we assume that
 performance evaluation of approximated policies and !
heuristics. Aj~
x þ Bj~
u þ C j 2 S; 8j ¼ 1; . . . ; J:
Given the previous definitions, it is possible to state the transi-
We prove the existence of optimal policies as well as convex x0 2 S for a given state-action
tion probabilities to a particular state ~
solutions to the ACOE under very mild assumptions. This signifi- pair ð~
x; ~
uÞ 2 W as
cantly extends known results and constitutes the main theoretical
contribution of the paper, stated in Theorem 6. These insights from X
J

the theoretical analysis motivate our new approximate Relative x0 j~


P½~ x; ~
u ¼ qj  In ! o ð~
x0 Þ; ð4Þ
j¼1 Aj~ uþ C
xþBj~ j
Value Iteration (RVI) algorithm based on piecewise-linear convex
functions. Besides good control policies, the algorithm also pro- where Ifg ðÞ denotes the indicator function of a given set. Note that
vides a non-decreasing sequence of lower bounds to the optimal (4) defines an atomic transition law. In Section 5, we show that it is
average costs. The lower bounds allow us to bound the optimality weakly continuous, but not strongly continuous. This is the reason
gap of any given policy. why most existing results about continuous state space MDPs are
We apply this algorithm to MSP instances and compare the per- not applicable, since they assume strongly continuous transition
formance with state-of-the-art heuristics. MSP is a typical example laws, as discussed in Section 2.1.
of a seemingly simple stochastic control problem with simple The immediate costs c : W ! RP0 are assumed to be piecewise-
dynamics and cost structure. However, nothing is known about linear and convex, non-negative, and defined as the maximum over
the structure of optimal policies, and the only known way to find a set of hyperplanes as
an optimal policy is via exact dynamic programming, which is usu- n o
ally intractable. Various heuristic policies have been proposed and c : ð~ uÞ # max ~
x; ~ fTw~ gTw~
x þ~ u þ hw ð5Þ
w¼1;...;W
shown to perform well on small instances by comparing them to
the exact solution, but the lack of lower bounds has so far pre- where ~ fw 2 Rn ; ~
gw 2 Rm ; hw 2 R, for all w ¼ 1; . . . ; W.
vented to judge their performance on instances that are too large In this paper we focus on the infinite horizon average cost
to be solved to optimality. objective to evaluate the performance of a given policy. We con-
sider only deterministic policies, thus a policy is a map from the
state space into the action space. We call a policy p : S ! U feasible
3. Model
if pð~xÞ 2 Uð~xÞ for all ~ x 2 S, and we denote the set of all feasible
policies by P.
We now formally introduce SLCPs, as for instance studied by
Suppose a stationary policy p 2 P, or a sequence of policies
Lincoln and Rantzer (2006).
Let the state space S be defined as
ps 2 P; s 2 N is given. We denote the random state that is reached
n o after t periods by ~ xt when applying the policy p, respectively the
S :¼ ~ x 6~
x 2 Rn jQ~ b ; ð1Þ policies ps ; s ¼ 0; . . . ; t  1, for a given initial state ~
x0 2 S.
The infinite horizon average cost per stage for a given initial
where Q 2 Rsn and ~ b 2 Rs define a polytope and the set of feasible state ~x 2 S and a policy p 2 P is defined as
actions UðxÞ for a given state ~
~ x 2 S be defined as "  #
1 X N1 
n o 
J p ð~
xÞ :¼ lim sup E xt ; pð~
cð~ xt ÞÞ ~x0 ¼ ~
x ;
Uð~
xÞ :¼ ~u 2 Rm jR~ x 6~
u þ V~ d ; ð2Þ N!1 N t¼0


where R 2 Rrm ; V 2 Rrn and ~


d 2 Rr define a polytope for each state and the optimal average cost is
~
x 2 S. We assume that Uð~xÞ – ; for all ~
x 2 S. The set of all actions U q :¼ inf inf J p ð~xÞ:
~
x2S p2P
is defined as
[ Later we will show that the optimal average cost does not depend
U :¼ Uð~
xÞ:
on the initial state ~
x 2 S.
~
x2S
A common approach to study average cost behavior of MDPs is
Using (1) and (2), the space of feasible state-action pairs W can be to analyze the corresponding discounted cost problems as the dis-
defined as count factor approaches one (Schäl, 1993). We also follow this
88 S. Woerner et al. / European Journal of Operational Research 241 (2015) 85–98

approach here and define the discounted cost objective for a policy Lemma 1. Given a convex function h : S ! R; T a h : S ! R is again a
p 2 P, discount factor a 2 ð0; 1Þ and initial state ~x 2 S as convex function for all a 2 ½0; 1.
"  # As noted for instance by Lincoln and Rantzer (2006) the Bellman
XN1 
V a;p ð~
xÞ :¼ lim E at cð~xt ; pð~xt ÞÞ ~x0 ¼ ~x : operator of a piecewise-linear convex function can be written as an
N!1
t¼0
 MPLP. This immediately implies that it also preserves piecewise-
linearity in addition to convexity. We derive the corresponding
MPLP since we need the notation later in this section in order to
4. The Bellman operator and its properties prove Lemma 2.
Suppose h : S ! R is piecewise-linear and convex and given as
A basic tool from MDP theory is the Bellman operator, which we n o
will need throughout the remainder of this paper. In the following, xÞ ¼ max ~
hð~ /Tk~
x þ wk :
k¼1;...;K
we provide a formal definition of the Bellman operator and the
ACOE for SLCP and discuss some of their properties. For a more Using the provided definitions, T a h can be written as
( )
detailed discussion of its properties we refer to the standard liter- X
J
!
T a hð~
xÞ ¼ min cð~ uÞ þ a qj  hðAj~
x; ~ x þ Bj~
u þ C jÞ
ature for discrete state space problems (Puterman, 1994; ~
u2Uð~

j¼1
Bertsekas, 2007), since a lot of results hold in the continuous state ( )
XJ  ! 
space setup as well.
¼ min qj  cð~ uÞ þ a  hðAj~
x; ~ x þ Bj~
u þ C jÞ
~
u2Uð~

j¼1
Definition 1 (Bellman Operator). Given a function h : S ! R and a (
XJ n 
discount factor a 2 ð0; 1, the corresponding Q-function ¼ min qj  max ~ fTw~ gTw~
x þ~ u þ hw
~
u2Uð~ k¼1;...;K
Q a : W ! R and Bellman operator T a are given by xÞ
j¼1 w¼1;...;W
)
X
J
!  ! o
Q a ð~
x; ~
ujhÞ :¼ cð~ uÞ þ a
x; ~ qj hðAj~
x þ Bj~
u þ C j Þ; ð~
x; ~
uÞ 2 W; þa u ~ Tk ðAj~
xþ Bj~
u þ C j Þ þ wk
j¼1
(
T a hð~
xÞ :¼ inf fQ a ð~
x; ~
ujhÞg; ~
x 2 S: XJ n 
~
u2Uð~
xÞ ¼ min qj  max ~ fTw þ au
~ Tk Aj ~
x
~
u2Uð~
xÞ k¼1;...;K
j¼1 w¼1;...;W
Choosing a ¼ 1 corresponds to the average cost per stage objective, )
 T   ! o
in which case we usually drop the index and write just Q and T, ~ Tk Bj ~ ~ Tk C j þ awk
gw þ au
þ ~ u þ hw þ au :
respectively. We define the Bellman operator by using the infimum.
However, whenever possible, we replace it by the minimum. For readability, we merge the indices k ¼ 1; . . . ; K and w ¼ 1; . . . ; W
For every h : S ! R we define a corresponding policy ph 2 P for into i ¼ 1; . . . ; I :¼ K  W and introduce the following notation:
~
x 2 S by
!  T
ph ð~xÞ :¼ argmin fQ a ð~x; ~
ujhÞg; ð6Þ F i;j :¼ ~fTw þ au
~ Tk Aj ;
~
u2Uð~

!  T 
G i;j :¼ ~gw þ au~ Tk Bj T ;
assuming that the minimum is achieved for all ~
x 2 S.  
!
Hi;j :¼ hw þ au ~ Tk C j þ awk :
Definition 2 (Average Cost Optimality Equation/Inequality). A scalar
q 2 R and a relative value function h : S ! R are said to solve the Using this notation, we can simplify the expression of T a h to
( )
Average Cost Optimality Equation (ACOE) if XJ n! !T o
T
T a hð~
xÞ ¼ min qj  max F i;j~
x þ G i;j~
u þ Hi;j
q þ hð~xÞ ¼ Thð~xÞ 8~x 2 S: ~
u2Uð~

j¼1
i¼1;...;I
8 9
A relative value function h : S ! R satisfies the Average Cost Opti- >
> >
>
>
> >
>
mality Inequality (ACOI) if <XJ
 =
¼ min qj  min fj
q þ hð~xÞ P Thð~xÞ 8~x 2 S: ~
u2Uð~xÞ>
> f j 2R >
>
>
> j¼1 ! ! >
>
Studying ACOE and ACOI is of interest because of the following : fjP F ~
T xþ G T ~ uþHi;j
;
i;j i;j
two theorems. ( )
Def: of Uð~xÞ X J
¼ min m
qj  f j : ð7Þ
~
u2R
j¼1
Theorem 1 (Hernández-Lerma and Lasserre (2002, chap. 12)). Sup- ~
f 2RJ
pose q 2 R and h : S ! R satisfy the ACOE, and assume the policy ph u6~
R~ dV~
x
!T !
is well-defined, then G i;j~uf j 6 F Ti;j~xHi;j

xÞ ¼ q ¼ q
J ph ð~ 8~
x 2 S: For a fixed state ~x 2 S this is an LP and we refer to (7) as the primal
LP formulation of the Bellman operator. This implies that T a hð~ xÞ is
an MPLP with parameter ~ x 2 S. Therefore, the Bellman operator pre-
Theorem 2 (Schäl (1993), Proposition 1.3). Suppose h satisfies the
serves piecewise-linearity and convexity (Lincoln & Rantzer, 2006;
ACOI, and assume the policy ph is well-defined, then
Jones et al., 2007). Moreover, evaluating the Bellman operator for
a given state can also return a supporting hyperplane as described
xÞ ¼ q
J ph ð~ 8~
x 2 S:
in the following.
In other words, ACOE and ACOI imply that ph is an optimal pol-
icy, and the optimal average cost is independent of the initial state.
Lemma 2. Evaluating T a hð~ xÞ for a fixed ~
x 2 S by solving the dual LP
From standard convex optimization theory, it follows that the corresponding to (7) also results in a supporting hyperplane of T a h at
Bellman operator as defined here preserves convexity (Boyd & ~
x, denoted by
Vandenberghe, 1999, (2.11)). We state this in the following lemma T
for further reference.
u
~ ½~
x; h ~x þ w½~
x; h:
S. Woerner et al. / European Journal of Operational Research 241 (2015) 85–98 89

Proof of Lemma 2. In order to prove the lemma, we analyze the convex relative value function that satisfies the ACOE. If there
dual LP formulation of (7), but first we introduce the following exists a policy that achieves the minimum in the Bellman operator
notation to simplify matters: of such a relative value function, the ACOE implies that this policy
is average cost optimal and achieves the optimal cost q for every
0! 1 0! 1
F T1;1 GT initial state ~
x 2 S.
B! C B !1;1 C There are different approaches to study average cost MDPs, pri-
B T C B T C
B F 1;2 C BG C marily the limiting discount factor approach (Schäl, 1993) and the
F :¼ B C 2 RðIJÞn ; G :¼ B 1;2 C 2 RðIJÞm ;
B . C B . C infinite dimensional LP approach (Hernández-Lerma & Lasserre,
B . C B . C
@ . A @ . A 2002, chap. 12). The drawback of the infinite-dimensional LP
!T !T
F I;J G I;J approach is that its results might only hold for a subset of the state
0 1 0 1 space. This subset will be an ergodic class under a resulting optimal
H1;1 IJ
B H1;2 C B IJ C policy, but the policy is only defined for this subset, so this does not
! B C B C
H :¼ B C 2 RðIJÞ ; J :¼ B C 2 RðIJÞJ ; yield any information about the remaining state space. In this
B . C. B .. C
@ . A @.A paper we mainly follow the first approach.
HI;J IJ The proofs of theorems, lemmas, and corollaries are given in
Appendix A. We skip them here, because most of them are very
where IJ 2 RJJ denotes the identity matrix of dimension J. This technical and do not provide many further insights.
allows us to write (7) as To begin with, we introduce the concept of Weak Accessibility
8 !
< ~ T ~


!) (WA). This concept is known in discrete state space MDPs
0 u  R O ~
u ~
d  V~
x (Bertsekas, 2007), but the atomic transition law (4) allows us to
T a hð~
xÞ ¼ min ~  ~ 6 ! ;
~
u2R m: ~
q f  G J f ðF~
x þ HÞ extend it to SLCPs over compact state and action spaces.
~f 2 RJ
Definition 3 (Weak Accessibility). An SLCP satisfies WA if there
where ~
0 and O denote the appropriate zero vector, respectively zero exist a T 2 N and a s > 0 such that for every ~ y; ~
z 2 S there is a
matrix. The dual LP can now be stated as sequence of policies pt;~y;~z 2 P; t 2 N, such that

!T ! xs ¼ ~
P½9s 6 T : ~ zj~ y  P s:
x0 ¼ ~ ð9Þ
xÞ ¼ max u
T a hð~ ~ð K Þ ~
x þ wð K Þ ; ð8Þ
!
K 2L Next, Lemma 3 shows that the n-step cost differences between
where any two states remain bounded under WA and continuous termi-
( 
T !) nal cost.
 ~
! rþIJ  R O ! 0
L :¼ K 2 R60  K¼ ;
 G J ~
q Lemma 3. If an SLCP satisfies WA and h : S ! R is continuous then

! there exists M P 0 such that for all a 2 ½0; 1; n 2 N, and ~
y; ~
z2S
! !T V ! !T ~
d
u
~ ð K Þ :¼ K ; wð K Þ :¼ K ! :  n 
T hð~ n
~
F H a yÞ  T a hðzÞ 6 M: ð10Þ

We refer to (8) as the dual LP formulation of the Bellman operator. Next, we summarize results from Schäl (1993) on the existence
! of optimal policies for general state space MDPs.
Thus, each vector of dual variables K 2 L leads to a hyperplane
! T !
u
~ð K Þ ~x þ wð K Þ for each ~
x 2 S. Since L is a linearly constrained set Definition 4 (Hernández-Lerma and Lasserre (2002, chap. 12)). A
! !
independent of x 2 S, and u
~ ~ ð K Þ as well as wð K Þ are linear in transition law P : W ! MðSÞ is weakly continuous if for every
! continuous function / : S ! R
K 2 L, it follows that for every ~x 2 S, the optimal solution of (8) will
be an extreme point of L. Denoting the finite set of extreme points of
Z
L by extðLÞ allows us to write ð~
x; ~
uÞ # /ð~
yÞP½d~
yj~
x; ~
u
S
!T !
T a hð~
xÞ ¼ max u
~ð K Þ ~
x þ wð K Þ : is a continuous function on W, where MðSÞ is the space of probabil-
K2extðLÞ
ity measures on S.
Therefore, T a h is the maximum over a finite set of hyperplanes. Denoting
! !
u
~ ð K Þ by u
~ ½~
x; h and wð K Þ by w½~
x; h completes the proof. h Definition 5 (Michael (1956)). Let F : XY be a multi-valued
mapping between topological spaces X and Y. F is said to be u.s.c.
From standard MDP theory it is known that the Bellman opera- if for every open B  Y, the set
tor can be used to compute lower bounds to the optimal average
F 1 ðBÞ :¼ fx 2 XjFðxÞ  Bg
cost (Bertsekas, 2007). This also holds for the compact case, and
leads to the following theorem. is open as well.
Schäl (1993) introduces a slightly weaker form of the following
Theorem 3. Assume h : S ! R is a bounded function. A non- assumptions. We adapt them according to the assumptions already
decreasing sequence of lower bounds to the optimal average cost q made throughout this paper.
is given by
n o Assumption 2 (Schäl (1993), Condition (W)).
qðT n hÞ :¼ inf T nþ1 hð~xÞ  T n hð~xÞ ; n 2 N:
~
x2S
1. S is compact;
5. Existence of optimal policies and relative value functions 2. Uð~xÞ is compact for all ~
x 2 S and ~
x 2 S # Uð~
xÞ is an upper semi-
continuous (u.s.c.) multi-valued mapping;
In this section we discuss the existence of stationary determin- 3. P : W ! MðSÞ is weakly continuous;
istic average cost optimal policies. We show that there exists a 4. c : W ! R is continuous.
90 S. Woerner et al. / European Journal of Operational Research 241 (2015) 85–98

Lemma 4. Assumption 2 holds for SLCPs. Since the Bellman operator preserves convexity, as stated in
Lemma 1, we can define the following sequence of convex
functions:
Theorem 4 (Schäl (1993), Proposition 2.1). Assume Assumption 2
holds and let be a 2 ð0; 1Þ. Thus, there exists a discounted cost-optimal hn ð~ ~
xÞ :¼ T n hð xÞ  nq :
stationary policy pa and a value function V a : S ! R such that
  By Lemma 7, the sequence is uniformly bounded. In addition, The-
V a ð~
xÞ ¼ T a V a ð~
xÞ ¼ Q a ~x; pa ð~
xÞ 8~
x 2 S: ð11Þ orem 3 and Lemma 6 imply that
inf fThn ð~ xÞg ¼ q
xÞ  hn ð~ 8n 2 N;
~
x
Corollary 1. Under the assumptions of Theorem 4, V a : S ! R is con-
vex and continuous for SLCPs. In particular, for every bounded measur- since otherwise either the lower bound property stated in Theo-
able function V 0 : S ! R it holds that rem 3 or inequality (13) would be violated. In particular, this
implies that for all ~
x2S
xÞ ¼ lim T na V 0 ð~
V a ð~ xÞ 8~
x 2 S;
n!1 q þ hn ð~xÞ 6 Thn ð~xÞ;
where the convergence is uniform. hn ð~
xÞ 6 hnþ1 ð~
xÞ:
Moreover, Schäl (1993) introduces Thus, hn defines a non-decreasing sequence of uniformly bounded
ma :¼ minV a ð~
xÞ; convex functions and hence converges point-wise. We denote the
~
x2S
limit as
xÞ :¼ V a ð~
wa ð~ xÞ  ma : ð12Þ 
h ð~
xÞ :¼ lim hn ð~
xÞ:
n!1
For every a 2 ð0; 1Þ; wa is a non-negative convex function. In addi-

tion, Schäl (1993) makes the following assumption. It follows that h is also convex and bounded and satisfies
q þ h ð~xÞ 6 Th ð~xÞ 8~x 2 S: ð15Þ
Assumption 3 (Schäl (1993), Condition (B)).
By definition hn satisfies
xÞ < 1 8~
sup wa ð~ x 2 S:
a2ð0;1Þ q þ hnþ1 ð~xÞ ¼ Thn ð~xÞ:
In the following, we show that a slightly stronger condition
holds. Since the Bellman operator is continuous we get
q þ h ð~xÞ ¼ Th ð~xÞ: ð16Þ
Lemma 5. If WA holds, Assumption 3 is satisfied and there exists This allows us to state our main theorem.
M < 1 such that
sup wa ð~
xÞ 6 M 8~
x 2 S: Theorem 6 (Existence of Convex Solution to ACOE). It exists a

a2ð0;1Þ bounded convex function h : S ! R and a scalar q that satisfy the
Lemmas 4 and 5 imply that the main result of Schäl (1993) also ACOE
holds for SLCPs.
q þ h ð~xÞ ¼ Th ð~xÞ 8~x 2 S:

Theorem 5 (Schäl (1993), Theorem 3.8). Given that Assumptions 2 In particular, h is continuous on the interior of S and u.s.c. on S. If

and 3 hold, there exists a stationary deterministic policy p 2 P that is there exists a policy that achieves the minimum in Th for every
average cost optimal with optimal average cost q independent of the ~
x 2 S, this policy is optimal.

initial state. In addition, for any sequence ak – arrow1; ðk ! 1Þ, Even if a policy corresponding to h does not exist, the ACOE can
be used to generate -optimal policies for any  > 0.
q ¼ lim ð1  ak Þmak : In the remainder of this section we show that our results imme-
k!1
We now use these results to prove the existence of a convex diately imply the existence of a continuous convex and bounded
solution to the ACOE. optimal relative value function in the simpler special case of a
strongly continuous transition law. This also illustrates the diffi-
culty of weakly continuous transition laws.
Lemma 6. Suppose an arbitrary sequence ak % 1; k ! 1, and define
the two functions Definition 6 (Gordienko and Hernández-Lerma (1995), Assump-
~
hð xÞ :¼ lim supwak ð~
xÞ; tion 2.2(c)). A transition law P : W ! MðSÞ is strongly continuous,
k!1 if for each measurable and bounded function / : S ! R the map
hð~
xÞ :¼ lim inf wak ð~
xÞ: Z
k!1
ð~
x; ~
uÞ # /ð~
yÞP½d~
yj~
x; ~
u
For all ~
x 2 S, it holds that hð~ ~
xÞ 6 hð xÞ, and further more we have S

~
q þ hð ~
xÞ 6 T hð xÞ; ð13Þ is a continuous function on W.

q þ hð~xÞ P Thð~xÞ: ð14Þ

 and h are bounded, and h
 is convex. Theorem 7. If h and q satisfy the ACOE as given in Theorem 6 and
In addition, h 
the transition law is strongly continuous, then h is a continuous, con-
Theorem 2 implies that any well-defined policy corresponding vex and bounded function and ph is a well-defined optimal policy.
to h would be optimal, since (14) is equal to the ACOI. However,
the limit inferior does not preserve convexity. Therefore, we will
use h to show the existence of a convex function that satisfies 6. Approximation algorithm
the ACOE.
We know from the discussion in Section 4 that the Bellman
~
Lemma 7. jT n hðxÞ  nq j is bounded uniformly for all n 2 N; ~
x 2 S. operator preserves piecewise-linear convex functions and that
S. Woerner et al. / European Journal of Operational Research 241 (2015) 85–98 91

we can get a supporting hyperplane by evaluating the dual LP (8). rent relative value function at all points that influence the value of
In addition, Theorem 6 implies the existence of a convex solution the corresponding lower bound by computing the supporting
to the ACOE. These results enable us to propose an approximate hyperplanes. Using the introduced notation, we state MARVI in
version of RVI that uses piecewise-linear convex functions. Our Algorithm 1. The algorithm always terminates as there is only a
algorithm produces a non-decreasing sequence of lower bounds finite number of hyperplanes to be added in a particular step.
to the optimal average cost, we therefore call it Monotone Approx-
imate RVI (MARVI). We use the resulting relative value function to Algorithm 1. MARVI (N = number of steps)
determine a corresponding policy as defined in (6). Thus, we can
simulate the system under this policy to estimate the resulting set h0 ð~
xÞ ¼ 0 for all ~x2S
average cost. The lower bounds allow us to bound the optimality compute qðh0 Þ
gap. for n ¼ 1; . . . ; N do
set hn ð~
xÞ ¼ hn1 ð~ xÞ þ qðhn1 Þ
6.1. Lower bound computation repeat
compute qðhn Þ; ~ xn ; ~
un
A lower bound for the optimal average cost per stage can be
compute ðu ~ ½~ xn ; hn1 ; w½~ xn ; hn1 Þ
computed from every given relative value function as shown in
add hyperplane to ht if not redundant
Theorem 3. If the relative value function is piecewise-linear and
for j ¼ 1; . . . ; J do
convex, the lower bound can be computed efficiently as discussed !
in the following. xj ¼ Aj~
set ~ xn þ Bj~ un þ C j
Suppose a piecewise-linear convex relative value function is compute ðu ~ ½~xj ; hn1 ; w½~
xj ; hn1 Þ
given as add hyperplane to hn if not redundant
 T  end for
xÞ ¼ max u
hð~ ~ k~
x þ wk :
k¼1;...;K until no hyperplanes were added to hn
shift hn by qðhn Þ
In (7) it is shown that Thð~ xÞ can be evaluated by solving an LP for a
given ~x 2 S. It is easy to see that if hn ¼ hn1 then
break for-loop
minfThð~
xÞg ð17Þ end if
~
x2S
end for
can be evaluated as an LP as well, since there is only a linear depen- return differential cost function hN , lower bound qðhN Þ
dence on ~
x in the right-hand side of the LP in (7), and S is linearly
constrained.
In order to compute the lower bound, we consider the state As mentioned before, the sequence of lower bounds produced
space decomposition by MARVI is non-decreasing, as we prove in the following
  theorem.
S l :¼ ~ ~ Tl~
x 2 Sju ~ Tk~
x þ wl P u x þ wk ; 8k ¼ 1; . . . ; K ; l ¼ 1; . . . ; K;
ð18Þ
Theorem 8 (Monotonicity of MARVI Lower Bounds). The sequence of
which decomposes the state space into polyhedral regions, each lower bounds generated by MARVI is non-decreasing.
defined by one active hyperplane among those used in the descrip-
tion of h. The lower bound corresponding to h, denoted by qðhÞ, can
then be written as Proof of Theorem 8. Suppose a piecewise-linear convex relative
value function hn1 generated by MARVI. The corresponding lower
qðhÞ ¼ minfThð~xÞ  hð~xÞg ¼ min minfThð~xÞ  hð~xÞg bound and minimizer are qðhn1 Þ and ð~ xn1 ; ~
un1 Þ. Note that
~
x2S l¼1;...;K ~
x2S l
  T  Thn1 ð~
xÞ P hn1 ð~
xÞ, for all ~
x 2 S (Bertsekas, 2007). Therefore, we
xÞ  u
¼ min min Thð~ ~l ~
x þ wl : initialize hn with hn1 þ qn1 , since it is the smallest function that
l¼1;...;K ~
x2S l
satisfies this property. Suppose no further hyperplanes were added
For every l ¼ 1; . . . ; K the corresponding inner minimization prob- within an iteration step. This implies that
lem is an LP, since it can be constructed from (17) by adding a linear
term to the objective and adding the required linear constraints to Qð~
xn ; ~
un jhn Þ ¼ Q ð~
xn ; ~
un jThn1 Þ and hn ð~
xn Þ ¼ Thn1 ð~
xn Þ:
guarantee ~ x 2 S l . Thus we can compute the lower bound by solving
K LPs. If such a sub-problem is infeasible, it means that the corre- This leads to
sponding hyperplane is never active and can be deleted. Note that
the sub-problems are independent and can be solved in parallel. qðhn Þ ¼ minfThn ð~xÞ  hn ð~xÞg ¼ Q ð~xn ; ~
un jhn Þ  hn ð~
xn Þ
The minimizer of the lower bound computation is a state-action ~
x2S

pair ð~x; ~
uÞ 2 W, where the minimization over the action is hidden ¼ Qð~
xn ; ~ xn Þ P T 2 hn1 ð~
un jThn1 Þ  Thn1 ð~ xn Þ  Thn1 ð~
xn Þ
within the Bellman operator. We denote the minimizing state- n o
2
action pair for a particular relative value function h by P min T hn1 ð~ xÞ  Thn1 ð~xÞ ¼ qðThn1 Þ P qðhn1 Þ
~
x2S
ð~
xðhÞ; ~
uðhÞÞ 2 W. If the relative value function has an index, hn ,
we write ð~ xn ; ~
un Þ instead. where the last inequality holds because of the monotonicity of
lower bounds. h
6.2. Monotone Approximate Relative Value Iteration (MARVI)
MARVI and Theorem 8 can also be applied to unbounded state
As stated in Lemma 2, evaluating Thð~xÞ via the dual LP returns a spaces. However, the presented existence theory does not hold in
supporting hyperplane of Th at ~ x, which we denote by its current form. Whether it can be extended to unbounded state
ðu
~ ½~
x; h; w½~
x; hÞ. spaces is subject of further research.
In every iteration step, MARVI computes the current lower In Sections 6.2.1 and 6.2.2, we provide two ways of speeding
bound, checks the corresponding minimizer, and updates the cur- up the basic algorithm. First, we explain how the lower bound
92 S. Woerner et al. / European Journal of Operational Research 241 (2015) 85–98

computation during the iteration steps can be sped up signifi- 7. Application example: multiple sourcing
cantly. Second, we introduce a way to reduce the number of
hyperplanes after every iteration step without decreasing the We focus on the MSP to apply our results and as a test case for
associated lower bound. MARVI. First, we formally introduce MSPs and show that they
satisfy the assumptions made within this paper. Second, we inves-
6.2.1. Improving the lower bound computation tigate a set of dual sourcing instances where we increase the lead
The lower bound computation within MARVI can be sped up time of the slow supplier. We provide the results obtained with
significantly. Within every step, MARVI adds some hyperplanes MARVI, compare them with the DIP, and discuss the differences.
to the current relative value function approximation and then There are other parametric heuristics studied in the literature
recomputes the lower bound. The lower bound computation besides the DIP, such as SIP and COP (Scheller-Wolf et al., 2006;
results in an LP for every hyperplane. However, for earlier added Veeraraghavan & Scheller-Wolf, 2008; Klosterhalfen et al., 2011).
hyperplanes the optimal solution is already known if the new In our studies, DIP has always been superior to SIP and COP.
hyperplanes are ignored. For the LP corresponding to an old hyper- Therefore, we only consider DIP in the forthcoming analysis. Since
plane, adding hyperplanes translates to adding new constraints. MARVI provides lower bounds to the optimal average cost, we can
Instead of solving the updated LP again, we check whether the bound the optimality gap for different policies although the
old optimal solution stays feasible when the additional constraints problems cannot be solved exactly.
are added. If this is the case, the old solution stays optimal and the At last, we discuss an MSP instance with six different suppliers
LP does not have to be resolved. and a larger state and action space. We can solve this instance to
optimality and show that the complexity does not necessarily
correlate with the dimension of the state space, but with other
6.2.2. Reducing the number of hyperplanes problem parameters.
Suppose a piecewise-linear convex relative value function
h : S ! R is given as 7.1. Problem definition
 
xÞ :¼ max u
hð~ ~ Tk~
xþ wk :
k¼1;...;K We address the MSP as defined in Veeraraghavan and Scheller-
Wolf (2008). Thus, we consider a periodic review problem with
We show how to construct a function h ~ : S ! R, defined by a subset N 2 N suppliers, where unmet demand is backlogged and satisfied
of the hyperplanes of h, which corresponds to a lower bound greater as soon as possible. A positive inventory level represents stock on
than or equal to the lower bound corresponding to h. hand, and a negative one represents backlogged demand. The
Consider the sub-problem of the lower bound computation cor- notation is provided in Table 1.
responding to hyperplane k 2 f1; . . . ; K g. Every hyperplane l – k of With the notation in Table 1 we can state the MSP dynamics as
h translates into constraints in the sub-problem. There are J con- X
N X
N
straints for every hyperplane to cover the piecewise-linear convex ytþ1 ¼ yt þ yi;t;0 þ dðLi ¼ 0Þui;t  dt ;
objective and one constraint to restrict to the feasible set S k  S. i¼1 i¼1

This follows from (7) and (18). zi;tþ1;j1 ¼ zi;t;j ; 8i 2 f1; . . . ; Ng; j 2 f1; . . . ; Li g;
Removing a hyperplane from h affects all lower bound sub- zi;tþ1;Li ¼ ui;t ; 8i 2 f1; . . . ; NjLi > 0g;
problems, where the corresponding constraints are active for the
where dðÞ is 1 if the given condition is satisfied and 0 otherwise.
optimal solution, and the sub-problem corresponding to the hyper- 
For readability we denote states as ~
plane itself. We use this relation to construct a directed graph G on  xt :¼ yt ; zi;t;j j j ¼ 1; . . . ;
Li ; i ¼ 1; . . . ; NÞ and actions as ~
ut :¼ ui;t j i ¼ 1; . . . ; N , respec-
the set of hyperplanes, respectively sub-problems.
tively. Therefore, state and action space are defined as
Consider two nodes k – l 2 f1; . . . ; K g. We add a directed edge
N
ðl; kÞ if the sub-problem corresponding to k has a constraint corre-  i Li ;
  ½0; u
S :¼ ½y; y
sponding to l that is active for the optimal solution. If no such con- i¼1
N
straint exists, hyperplane l does not affect the optimal solution of  i :
U :¼ ½0; u
sub-problem k. i¼1
P
Next, we compute the strongly connected components of G, i.e., with corresponding dimensions dimS ¼ 1 þ Ni¼1 Li and dimU ¼ N.
the maximal strongly connected sub-graphs. We construct a sec- State and action space are illustrated in Fig. 1.
ond graph G0 with the strongly connected components of G as The immediate costs are defined as
nodes. There is an edge between two nodes of G0 if there is at least X
N
one edge between the corresponding strongly connected compo- xt ; ~
cð~ ut Þ ¼ hðyt Þþ þ bðyt Þ þ ci ui;t ;
nents of G. G0 is called the condensation of G. i¼1

It is known that for every directed graph, its condensation is a


Table 1
Directed Acyclic Graph (DAG). Every DAG has at least one root
MSP notation.
node. We pick a root node of G0 that corresponds to a strongly con-
nected component with minimal number of elements. We keep all Parameter Description

hyperplanes that are elements of the chosen root node and drop h; b P 0 Unit holding/backlogging costs
the rest. ci P 0 Unit order cost of supplier i 2 f1; . . . ; N g
Li 2 N Lead time of supplier i 2 f1; . . . ; N g
By construction, the sub-problems of the kept hyperplanes are
yt 2 R Inventory level at period t 2 N
not affected by the removed hyperplanes, i.e., their optimal solu- 
y; y Inventory level bounds
tions stay the same. Since we reduce the number of sub-problems ui;t P 0 Order quantity at supplier i 2 f1; . . . ; N g at period t 2 N
but do not change the remaining ones, the resulting lower bound is i
u Maximal order quantity at supplier i 2 f1; . . . ; N g
greater than or equal to the original lower bound. zi;t;j Open order at period t 2 N at supplier i 2 f1; . . . ; N g,
We apply this hyperplane reduction at the end of every MARVI arriving in j 2 f1; . . . ; Li g periods
dt Demand at period t 2 N (i.i.d.)
iteration step. Since the lower bounds are never decreased by this
Dmax P 0 Maximal demand
method, Theorem 8 still holds.
S. Woerner et al. / European Journal of Operational Research 241 (2015) 85–98 93

where ðÞþ and ðÞ denote the positive and negative part of the the maximal inventory level and no open orders. This state can
given number. Since ðÞþ and ðÞ are piecewise-linear convex func- always be reached by placing enough orders and assuming zero
tions, it follows that c : W ! R is piecewise-linear and convex as demand, which has positive probability by Assumption 4. Starting
well. It can be easily seen that MSP is an SLCP. from this state, one can reach any inventory level with constant
Without loss of generality we make the following assumption. probability by assuming constant demand and adapting the orders
accordingly. Similarly, the open orders of the target state can be
Assumption 4. Zero demand has positive probability. reached by assuming zero demand again and placing the orders
Otherwise, we could constantly place orders of the size of the needed. Since the state space is compact, the needed number of
minimal demand at the cheapest supplier. This will shift the periods can be bounded uniformly. h
demand distribution accordingly.
Corollary 2. For every MSP instance as defined in this section, there

7.1.1. Feasibility exists a convex and bounded relative value function h : S ! R and a
In order to obtain a compact state space, we impose bounds on scalar q that satisfy the corresponding ACOE.
inventory level and order quantities. However, these bounds can
interfere with Assumption 1. We have to make sure that the
7.2. Dual index policy
system remains feasible. Therefore, additional state-dependent
constraints on the action space have to be added.
In its standard form DIP is a heuristic for MSPs with two
First, we have to make sure that the inventory level upper
suppliers. DIP determines the order quantities as a function of
bound is never violated. We achieve this by bounding the inven-
the Inventory Position of the system P t as well as the so-called
tory position (the sum of inventory level and all open orders) by
Expedited Inventory Position P et . The Inventory Position P t is the
the upper bound for the inventory level:
sum of stock on hand and all open orders:
X
N X Li
N X
ui;t þ yt þ :
zi;t;j 6 y X Li
N X
i¼1 i¼1 j¼1 Pt :¼ yt þ yi;t;j :
i¼1 j¼1
Second, we have to enforce a lower bound for the inventory level. We
add a constraint that guarantees that the minimal inventory level at The Expedited Inventory Position Pet is the sum of stock on hand and
the next time step is greater than or equal to the lower bound: all open orders that will arrive within the lead time of the fast
X
N X
N supplier, i.e.,
yt þ zi;t;1 þ ui;t dðLi ¼ 0Þ P y þ Dmax : X
N X
L2
i¼1 i¼1 Pet :¼ yt þ yi;t;j ;
i¼1 j¼1
However, this constraint might be infeasible if there is no supplier
with lead time equal to zero. In this case, we add an artificial where L2 denotes the lead time of the fast supplier. DIP depends on
supplier with lead time zero that allows orders of the size of the two parameters. No exact way to compute the optimal parameters
maximal demand. This supplier is assumed to be very expensive is known, but efficient simulation-based approaches have been pro-
and should never be used in practice. The artificial supplier can be posed (Veeraraghavan & Scheller-Wolf, 2008; Arts et al., 2009). DIP
interpreted as lost sales if the backlog is getting too big. If the unit is defined as follows.
order cost of the artificial supplier and inventory level lower bound
are chosen properly, this constraint will never be binding except for Definition 7 (Dual Index Policy). DIP uses two parameters
transient states. zr ; ze 2 Z and is defined as
These additional constraints make sure that the system is feasi-  
ble, i.e. Assumption 1 holds. u2;t ¼ max 0; zr  Pet ;
 
u1;t ¼ max 0; ze  ðPt þ u2;t Þ :
7.1.2. Weak accessibility
WA allows us to apply the existence results in Section 5 and we DIP can be extended to multiple suppliers by introducing one
show that it holds for MSPs. index per supplier and the corresponding order-up-to levels.

Theorem 9 (Weak Accessibility). WA holds for MSPs. 7.3. Results: dual sourcing
Proof of Theorem 9. In order to prove the theorem, we have to
show that there exist T 2 N and s > 0, such that for any two states We consider a set of MSP instances with two suppliers and
~ z 2 S, there are policies pt;~y; ~z 2 P; t ¼ 0; . . . ; ðT  1Þ, such that
y; ~ increasing complexity. We vary the lead time of the slow supplier
P½9t 6 T : ~
xt ¼ ~
yj~ x P s;
x0 ¼ ~ ð19Þ
so ~
z can be reached from ~
y with probability greater than or equal to
s in at most T periods. It can be assumed that the system starts with Table 2
Dual sourcing instances.

Parameter Value

z1t1 z1t2 u1t N 2


ðh; bÞ ð1; 100Þ

½y; y ½100; 100
dt yt z2t1 u2t L1 1; . . . ; 10
L2 0
ðc1 ; c2 Þ ð0; 20Þ
u3t i
u 20; i ¼ 1; 2
dt Uðf0; 1; 2; 4; 8; 16gÞ
dimS ðL1 þ 1Þ
Fig. 1. Illustration of MSP state and action space: 3 suppliers with lead times 2, 1, dimU 2
and 0, respectively.
94 S. Woerner et al. / European Journal of Operational Research 241 (2015) 85–98

Table 3
Simulation results: lead time of slow supplier, MARVI computation time, lower bound, number of hyperplanes, estimated costs of MARVI policy, estimated costs of DIP, optimal
DIP parameters, max. optimality gap for MARVI policy, max. optimality gap for DIP.

L1 T (seconds) Lower bound #Planes MARVI DIP zopt


r zopt
e
MARVI/LB (%) DIP/LB (%)

1 <1 21.67 4 21.69 21.69 32 0 0.10 0.10


2 <1 28.24 16 28.08 28.08 40 0 0.00 0.00
3 6 31.37 73 31.35 31.74 48 8 0.00 1.18
4 59 32.69 352 32.98 33.46 52 16 0.90 2.36
5 134 33.47 629 34.25 34.93 57 16 2.34 4.36
6 383 33.96 1246 35.41 36.04 62 16 4.27 6.12

37 29
MARVI MARVI
35 28
DIP DIP
33

Holding Costs
Lower Bound 27
Total Costs

31 26
29 25
27 24
25 23
23 22
21 21
1 2 3 4 5 6 1 2 3 4 5 6
L1 L1
9 4
MARVI MARVI
8
Backlogging Costs

DIP 3.5 DIP


7
Ordering Costs

3
6
2.5
5
2
4
1.5
3
1
2
1 0.5
0 0
1 2 3 4 5 6 1 2 3 4 5 6
L1 L1

Fig. 2. Simulation results: contribution of different cost factors to total costs of MARVI policy and DIP w.r.t. lead time of slower supplier.

Table 4
without changing the other parameters. The problem parameters
MSP instance and MARVI results.
are given in Table 2.
We run MARVI with 15 steps to approximate the optimal Parameter Value
policy and to obtain a lower bound. The resulting policy and N 6
DIP with optimal parameters are evaluated by simulation of ðh; bÞ ð1; 100Þ

½y; y ½100; 100
100,000 periods. All computations were carried out on 4 hexa-
i
u 20; i ¼ 1; . . . ; 6
core Xeon X5660 CPUs (2.67 gigahertz). Maximal memory usage
dt Uðf0; 1; 2; 4; 8; 16gÞ
was less than 1 gigabytes. The detailed results are given in dimS 16
Table 3. dimU 6
Table 3 shows that the difference between the average costs of Comp. time 12 seconds
the MARVI policy and the DIP increases with the lead time Lower bound 26.625
difference of the two suppliers. In the remainder of this section, MARVI costs 26.581
we analyze how the different cost factors contribute to the total MARVI/ LB 0.00%
# Planes 76
costs and what differences there are between the policies. To this
end, we show the average holding, backlogging, ordering and total
costs for the different instances and policies in Fig. 2. Table 5
Table 3 shows that zopt
e is equal to the maximal demand for the Simulation results: supplier lead times, unit order costs, average and relative order
scenarios with L1 greater than or equal to 4. Since the fast supplier quantities.
can deliver immediately in the next period this implies that there Supplier-ID Lead time Unit cost Avg. order Rel. order (%)
is no backlogging at all under the DIP, as can be seen in Fig. 2.
1 5 0.000 0.834 16.2
If the lead time difference is getting too big, the DIP reacts by 2 4 0.015 1.972 38.2
preventing backlogging completely, which leads to high holding 3 3 0.165 0.445 8.6
costs. On the other hand, the MARVI policy allows backlogging 4 2 0.765 1.090 21.1
and usually uses the fast supplier more often to reduce the holding 5 1 2.265 0.384 7.4
6 0 9.015 0.431 8.4
costs significantly compared to DIP.
S. Woerner et al. / European Journal of Operational Research 241 (2015) 85–98 95

37 1,000 100
All 6 Suppliers 900 Comp. Time (sec.) 90 Supplier 1

Optimal Average Costs

Comp. Time/#Planes
35 Suppliers 1 & x #Planes Supplier x

Supplier Usage (%)


800 80
33 700 70
600 60
31 500 50
400 40
29 300 30
27 200 20
100 10
25 0 0
2 3 4 5 6 2 3 4 5 6 2 3 4 5 6
Second Supplier Second Supplier Second Supplier

Fig. 3. Simulation results: the optimal costs for all six suppliers compared to the corresponding dual sourcing instances (left). The computation time and number of planes for
the dual sourcing instances (middle). The relative usage of the two suppliers (right).

7.4. Results: multiple sourcing We first tested MARVI on different MSP instances with two sup-
pliers and compared it to the DIP. The policy computed with MAR-
We now go beyond dual sourcing and consider an MSP instance VI consistently outperformed the DIP. In the considered dual
with six suppliers with different lead times and unit order costs. sourcing scenarios, we found that a bigger lead time difference
We choose the unit order costs in such a way that all suppliers leads to a worse DIP performance.
are used. The complete specification of the instance is given in We also investigated a higher dimensional MSP instance with
Tables 4 and 5. six suppliers. Even though the problem has larger state and action
We approximate the optimal policy for all six suppliers with 15 spaces than all the other dual sourcing instances, it was solved fas-
MARVI steps. Furthermore, we analyze how much is ordered on ter than most of them, in just a few seconds. This shows that the
average at the different suppliers in a simulation run over complexity does not only depend on state/action space dimension
100,000 periods. The results of this experiment are given in Tables but also on other problem characteristics. Determining what drives
4 and 5. the complexity is an open task for further research. Once this is
Next, we compare dual sourcing instances where we keep the understood, it might help designing new, more efficient policy
cheapest supplier and add one of the remaining suppliers. The structures, value function approximations, and solution
results of this experiment are shown in Fig. 3. Note that the algorithms.
gap between simulated costs and the MARVI lower bounds has Another open point is related to the convergence of MARVI
always been less than 1%, usually close to zero. Therefore we itself. We know that the sequence of lower bounds converges as
consider the resulting policies optimal and only report the opti- it is monotonous and bounded, but it is not clear whether it
mal costs. converges to the optimal average cost. Nevertheless, all empirical
The results of this experiment allow several conclusions. First, a evidence points towards this direction.
strong portfolio effect can be observed. Using all six suppliers can There are several interesting extensions to the presented the-
decrease the costs by more than 4% compared to the best combina- ory, algorithm, and application. First, the presented theory does
tion of the considered instances (the other possible combinations not provide conditions under which the solution of the ACOE has
achieve about the same performance as the reported ones, the cor- to be continuous to the boundary of the state space as well. This
responding results are shown in Appendix B). In particular, the would immediately imply the existence of a corresponding policy.
combination of supplier 3 and 6 achieve the best result (optimal Moreover, the restriction to compact state spaces is sometimes a
average costs of 27.454) even though they are hardly used in the bit artificial. Therefore it is an open task to extend the presented
instance with all six suppliers. theory to unbounded state spaces. This could be one way to solve
Second, the computation times vary quite a lot and are not at all the problem of discontinuity of the relative value function on the
monotone in the lead times and state space dimension, respec- boundary of the state space.
tively. The dual sourcing instances where the average order quan- It is straightforward to apply MARVI to discounted cost prob-
tity of the two suppliers is about the same are the easiest to solve, lems, which are much simpler than the average cost problem con-
whereas the instances with a strong focus on one supplier are the sidered here. The resulting value function approximation would
hardest. In particular, the instance with all 6 suppliers was solved provide a lower bound to the optimal expected discounted cost
the fastest with the smallest number of hyperplanes. for any initial state. Another extension could be to replace the
expectation in the Bellman operator by the worst case. This is easy
8. Conclusion and outlook to implement, but it is not clear how to interpret this setup within
an infinite horizon setting.
We introduced simple and realistic conditions that imply the Lastly, the MSP as introduced here can easily be extended to
existence of an optimal policy and a convex solution to the ACOE stochastic lead times as described, e.g., in Kaplan (1970) and
for SLCPs, a broad class of MDPs, which is particularly useful in Ehrhardt (1984). This will again yield an SLCP, and the theory
inventory management. and algorithm presented in this paper are applicable.
We exploited the insights to construct a new approximation To conclude, a promising new approach to approximate opti-
algorithm, MARVI, to compute good policies for this broad class mal policies for SLCPs was presented in this paper and success-
of problems. MARVI provides lower bounds of the optimal average fully tested. There is still potential to further develop the
costs, which can be used to bound the optimality gap of a particu- corresponding theory, as well as to extend the algorithm to dif-
lar policy. ferent settings.
96 S. Woerner et al. / European Journal of Operational Research 241 (2015) 85–98

h i
Acknowledgment M þ E aCð~zÞ T na hð~
zÞ:

Rico Zenklusen has been partially supported by NSF grant CCF- Since a 2 ½0; 1 we can just drop E aCð~zÞ and still get an upper
1115849. Furthermore, the authors would like to thank the three bound.
referees for their constructive comments, which helped to improve By construction, this is an upper bound for T na hð~
yÞ, and leads to
the paper.
T na hð~
yÞ  T na hð~
zÞ 6 M þ T na hð~
zÞ  T na hð~
zÞ ¼ M: ðA:2Þ

Appendix A. Technical proofs y and ~


By switching ~ z and applying the same argument again we can
complete the proof. h
Proof of Lemma 3. T na hð~
xÞ denotes the optimal expected n-period
discounted costs with terminal costs h for a process started in Proof of Lemma 4. Assumptions 2.1 and 2.4 are satisfied by the
~
x 2 S. The corresponding optimal policies are given by definition of SLCPs. Furthermore, Uð~ xÞ is compact by definition,
n o and the u.s.c. of ~x 2 S # Uð~
xÞ follows directly from Aubin and
ujT ðntÞ
pt ð~xÞ :¼ argmin Q ð~x; ~ a h ; t ¼ 0 . . . ; n; ~
x 2 S: Frankowska (1999), Proposition 1.4.8, which implies Assumption
~
u2Uð~

2.2. It remains to be shown that the transition law defined in (4)
is weakly continuous. Suppose / : S ! R is continuous. From (4)
With this notation it follows that for all ~
x2S
" # it follows that
X
n Z X
J
n
T a hð~
xÞ ¼ E a cð~xt ; pt ð~xt ÞÞ þ an hð~xn Þj~x0 ¼ ~x :
t !
/ð~
yÞP½d~
yj~
x; ~
u ¼ qj /ðAj~
x þ Bj~
u þ C j Þ:
t¼0 S j¼1

Since h is continuous and S compact there exists M h > 0 such However, since / is continuous,
that jhð~xÞj 6 M h , for all ~
x 2 S. In order to compute bounds, we X
J
!
can drop an hð~ xn Þ and add
M h accordingly. Thus, w.l.o.g. we can ð~
x; ~
uÞ # qj /ðAj~
x þ Bj~
u þ C jÞ
assume h ¼ 0 in the remainder of this proof. j¼1
The immediate costs c are non-negative and continuous, hence is continuous as well. h
bounded on the compactum W. Therefore, there exists Mc > 0 such
that 0 6 cð~ x; ~
uÞ 6 M c , for all ð~ x; ~
uÞ 2 W. Therefore, for all
~
x 2 S; n 2 N it holds: Proof of Corollary 1. It is known that for a 2 ð0; 1Þ the Bellman
operator T a is a contraction mapping for bounded measurable
X
n X
1
Mc
jT na hð~
xÞj 6 at M c 6 at Mc ¼ : ðA:1Þ functions w.r.t. the supremum norm (Bertsekas, 2007). Hence, for
t¼0 t¼0
1a bounded and measurable f ; g : S ! R it holds that

WA implies that there exist T 2 N; s > 0 and for every ~ y; ~


z2S kT a f  T a g k1 6 akf  g k1 ;
there are policies pt;~y;~z 2 P such that (9) holds. We use this to con- where kk1 denotes the supremum norm. By the Banach Fixed-Point
struct new policies, denoted by p ~ t;~y;~z 2 P, as follows. For
Theorem this implies that T na V 0 converges uniformly to a unique
t ¼ 0; . . . ; T  1 we follow the policies pt;~y;~z 2 P. After T periods fixed-point V a independent of V 0 as n approaches infinity. In partic-
we are in a new state w ~ 2 S. For t ¼ T; . . . ; 2T  1 we apply the pol-
ular, if we choose V 0 to be continuous, the uniform convergence
icies pt;w~ ;~z 2 P, and so on. implies that the unique fixed-point is also continuous.
The resulting expected discounted cost to go from initial state ~ y Furthermore, if we assume that V 0 is convex, it follows from
and target state ~ z when following p ~ t;~y;~z can be defined as:
Lemma 1 that T na V 0 is also convex for n 2 N. Since the limit of a
" Cð~zÞ #
X  point-wise convergent sequence of convex functions is again
t
y; ~
Cð~ zÞ :¼ E a c ~xt ; p~ t;~y;~z ð~xt Þj~x0 ¼ ~
y ; convex, this also holds for the unique fixed-point. h
t¼0

where Cð~zÞ denotes the first hitting time of ~z. Proof of Lemma 5. From Corollary 1 we know for every a 2 ð0; 1Þ
Exploiting WA we can bound Cð~ y; ~
zÞ as follows. We start the and an arbitrary continuous V 0 : S ! R, we have
y and run it for T periods. WA implies that we reach ~
process in ~ z  
with probability greater than or equal to s. If we do not reach ~
z we xÞ ¼ lim T na V 0 ð~
wa ð~ xÞ  T na V 0 ð~
xa Þ 8~
x 2 S;
n!1
repeat this experiment with the new initial state and run the
where
process for another T periods. This is like tossing a coin with
success probability greater than or equal to s and cost TMc . xa :¼ argmin V a ð~
~ xÞ:
Therefore, we can bound the expected discounted costs by: ~
x2S

X
1
TMc However, if WA holds, we know from Lemma 3 that there exists a
Cð~ z; pt;~y;~z Þ 6
y; ~ TMc ði þ 1Þð1  sÞi s ¼ ¼: M; constant M < 1 such that for every n 2 N and a 2 ð0; 1Þ
i¼0
s  n 
T V 0 ð~ xa Þ 6 M:
xÞ  T na V 0 ð~
a
and M is a uniform bound for Cð~ y; ~
zÞ.
We combine the previous results and construct a new process Hence, this also holds for the limit n ! 1. h
to get an upper bound for T na hð~yÞ in terms of T na hð~
zÞ. Therefore, we
~ z by following p
start the process in y, run it until we reach ~ ~ t;~y;~z , and
Proof of Lemma 6. By definition it immediately follows that
afterwards follow the optimal policies for n more periods. The cost ~
hð~
xÞ 6 hð xÞ, and that both functions are bounded. Since the limit
of this process is given by  is a convex
 superior preserves convexity, it also follows that h
2 3
ð~
CXzÞ1 nþ
XCð~
zÞ  function.

E4 t
a cð~xt ; p~ t;~y;~z ð~xt ÞÞ þ t
a cð~xt ; p  ~ ~ y5:
zÞÞ ðxt ÞÞ
ðtCð~  x0 ¼ ~ By using (12) in (11) we can derive that for all k 2 N
t¼0 t¼Cð~
zÞ 
ð1  ak Þmak þ wak ð~
xÞ ¼ T ak wak ð~
xÞ:
With the previous results we can bound it by
S. Woerner et al. / European Journal of Operational Research 241 (2015) 85–98 97

Theorem 5 implies that taking the limit superior of the left- Table B.6
hand side leads to Additional simulation results: optimal average costs, MARVI computation time and
number of planes for the different dual sourcing supplier combinations.
lim supð1  ak Þmak þ wak ð~ ~
xÞ ¼ q þ hð xÞ: S2 =S1 1 2 3 4 5
k!1
Costs
The limit superior of the right-hand side can be bounded as follows: 2 36.429
( ) 3 33.586 33.537
  X
J  ! 4 31.262 31.275 31.492
limsup T ak wak ð~
xÞ ¼ lim sup inf cð~ uÞ þ a k
x;~ qj wak Aj~
x þ Bj~
uþ C j ðA:3Þ
k!1 K!1 kPK ~
u2U ð~
xÞ 5 29.581 29.563 29.813 31.119
j¼1
( ) 6 27.768 27.454 27.429 28.843 33.369
X
J  !
6 lim inf sup cð~ uÞ þ ak qj wak Aj~
x;~ x þ Bj~
uþ C j ðA:4Þ T
K!1~
u2U ð~
xÞ kPK
j¼1 2 451
( ) 3 114 54
X
J n  ! o
6 lim inf cð~
x;~
uÞ þ qj sup ak wak Aj~
x þ Bj~
uþ C j ðA:5Þ 4 23 10 2
K!1~
u2U ð~
xÞ kPK
j¼1 5 31 15 2 0
! 6 151 74 7 2 1
 
¼ lim T sup a k wa k ð~
xÞ ðA:6Þ #Planes
K!1 kPK
! 2 936
  3 382 187
¼ T lim sup a k wa k ð~
xÞ ðA:7Þ
K!1 kPK 4 155 68 26

5 237 136 30 8
 
¼ T limsup ak wak ð~
xÞ ðA:8Þ 6 727 323 103 35 5
k!1
 ð~
¼ Th xÞ: ðA:9Þ

In (A.3) we apply the definitions of limit superior and the Bellman v 2 Uð~xÞ and applying (14) leads to
Taking the infimum over ~
operator. Next, we interchange supremum and infimum in (A.4), ( )
X
J
!
which can at most increase the value of the expression ~
T hðxÞ 6 inf cð~ vÞ þ
x; ~ qj hðAj~ v þ C jÞ
x þ Bj~ þ M ¼ Thð~
xÞ þ M
v 2Uð~xÞ
~
(Rockafellar, 1970, Lemma 36.1). In (A.5) we take the supremum j¼1
and the discount factor and put it into the sum, which again can 6 q þ hð~
xÞ þ M:
at most increase the value. Once more, we apply the definition of
the (undiscounted) Bellman operator in (A.6), and exploit its conti- Inequality (14), the monotonicity of the Bellman operator, and
nuity (w.r.t. the supremum norm) in (A.7). Using the definitions of applying it ðn  1Þ times lead to
limit superior and h in (A.8) and (A.9), respectively, completes the
derivation and, together with the previous discussion, proves (13). ~
T n hð xÞ 6 nq þ hð~
xÞ þ M:
Inequality (14) can be shown along the same lines. h
Since h is bounded this completes the proof. h
Proof of Lemma 7. The proof follows the idea of a similar result
for finite state/action space MDPs stated in Bertsekas (2007), Prop- Proof of Theorem 6. Equation (16) implies the existence of a
osition 4.3.1. From Lemma 6 we know that for all ~ x2S 
bounded convex h that together with q satisfies the ACOE. There-
~
q þ hð ~
xÞ 6 T hð xÞ: fore, Theorem 1 implies that any well-defined policy correspond-

ing to h is optimal. The continuity on the interior of S and u.s.c.
We can repeat this argument n times to get on S follows from convexity and boundedness of h on S


~
nq þ hð ~
xÞ 6 T n hðxÞ; (Rockafellar, 1970). h

which can be reformulated as


Proof of Theorem 7. If the transition law is strongly continuous, it
~
hð ~
xÞ 6 T n hð xÞ  nq : follows immediately that Q ð~
x; ~

ujh Þ is a continuous function. There-
 
 is bounded, this proves that T n hð
~ fore, Th is also continuous, which implies that h is continuous
Since h xÞ  nq is bounded from
since it satisfies the ACOE. The minimum of a continuous convex
below.
function over a compact set is well-defined, and therefore ph is
An upper bound can be proven as follows. For any ~ x 2 S and
well-defined. h
v 2 Uð~xÞ it holds that
arbitrary ~
( )
X
J
!
~
T hð xÞ ¼ inf cð~
x; ~
uÞ þ  j~
qj hðA x þ Bj~
u þ C jÞ 6 cð~ vÞ
x; ~ Appendix B. Additional numerical experiments
~
u2Uð~

j¼1

X
J
! XJ
! Table B.6 shows the results of all possible combinations of two
 ~ v þ C j Þ ¼ cð~x; ~
v Þ þ qj hðAj~x þ Bj~
v þ C jÞ
þ qj hðA j x þ Bj ~ suppliers in Section 7.4.
j¼1 j¼1

X
J  ! !  References
 ~ v þ C j Þ  hðAj~x þ Bj~
v þ C j Þ 6 cð~x; ~

þ qj hðA j x þ Bj ~
j¼1 Arapostathis, A., Borkar, V. S., Fernández-Gaucherand, E., Ghosh, M. K., & Marcus, S.
X
J I. (1993). Discrete-time controlled Markov processes with average cost
!  
þ qj hðAj~ v þ C j Þ þ sup hð
x þ Bj~ ~
xÞ  hð~
xÞ : criterion: A survey. SIAM Journal on Control and Optimization, 31(2), 282–344.
~
x2S Arts, J., Van Vuuren, M., & Kiesmuller, G. (2009). Efficient optimization of the dual-
j¼1
index policy using Markov chains. Tech. rep., Technische Universiteit Eindhoven.
Aubin, J.-P., & Frankowska, H. (1999). Set-valued analysis. Birkhäuser.
 are both bounded, there exists M P 0 such that
Since h and h Bertsekas, D. P. (2007). Dynamic programming and optimal control (3rd ed., Vol. II).
  Athena Scientific.
sup hð~
xÞ  hð~
xÞ 6 M: Bertsekas, D. P., & Tsitsiklis, J. N. (1996). Neuro-dynamic programming. Athena
~
x2S Scientific.
98 S. Woerner et al. / European Journal of Operational Research 241 (2015) 85–98

Boyd, S., & Vandenberghe, L. (1999). Convex optimization. Cambridge University Kaplan, R. S. (1970). A dynamic inventory model with stochastic lead times.
Press. Management Science, 16(7), 491–507.
De Farias, D. P., & Van Roy, B. (2003). The linear programming approach to Klosterhalfen, S., Kiesmuller, G., & Minner, S. (2011). A comparison of the constant-
approximate dynamic programming. Operations Research, 51(6), 850–865. order and dual-index policy for dual sourcing. International Journal of Production
De Farias, D. P., & Van Roy, B. (2004). On constraint sampling in the linear Economics, 133(1), 302–311.
programming approach to approximate dynamic programming. Mathematics Of Kurano, M., Nakagami, J.-I., & Huang, Y. (2000). Constrained Markov decision
Operations Research, 29(3), 462–478. processes with compact state and action spaces: The average case. Optimization,
Dynkin, E. (1972). Stochastic concave dynamic programming. Mathematics of the 48(2), 255–269.
USSR Sbornik, 16, 501–515. Lincoln, B., & Rantzer, A. (2006). Relaxing dynamic programming. IEEE Transactions
Ehrhardt, R. (1984). (s, S) policies for a dynamic inventory model with stochastic on Automatic Control, 51(8), 1249–1260.
lead times. Operations Research, 32(1), 121–132. Michael, E. (1956). Continuous selections I. Annals of Mathematics, 63(2), 361–382.
Farias, V. F., & Van Roy, B. (2006). Tetris: A study of randomized constraint Minner, S. (2003). Multiple-supplier inventory models in supply chain
sampling. In G. Calafiore & F. Dabbenne (Eds.), Probabilistic and randomized management: A review. International Journal of Production Economics, 265–279.
methods for design under uncertainty (pp. 189–202). Springer. Montes-de Oca, R., & Hernández-Lerma, O. (1996). Value iteration in average cost
Fukuda, Y. (1964). Optimal policies for the inventory problem with negotiable Markov control processes on Borel spaces. Acta Applicandae Mathematicae, 42,
leadtime. Management Science, 10(4), 690–708. 203–222.
Gordienko, E., & Hernández-Lerma, O. (1995). Average cost Markov control Pereira, M., & Pinto, L. (1991). Multi-stage stochastic optimization applied to energy
processes with weighted norms: Existence of canonical policies. Applied planning. Mathematical Programming, 52(1–3), 359–375.
Mathematics (Warsaw), 23(2), 199–218. Powell, W. B. (2007). Approximate dynamic programming: Solving the curses of
Hernández-Lerma, O., & Lasserre, J. (1990). Average cost optimal policies for Markov dimensionality (1st ed.). Wiley-Interscience.
control processes with Borel state space and unbounded costs. Systems & Puterman, M. L. (1994). Markov decision processes – Discrete stochastic dynamic
Control Letters, 15(4), 349–356. programming. John Wiley & Sons.
Hernández-Lerma, O., & Lasserre, J. (2002). The linear programming approach. In E. Rockafellar, R. T. (1970). Convex analysis. Princeton University Press.
Feinberg & A. Shwartz (Eds.), Handbook of Markov decision processes: Methods Scarf, H. (1960). The optimality of (S, s) policies in the dynamic inventory problem.
and applications. Handbooks in operations research and management science Mathematical Methods in Social Sciences, 196–202.
(pp. 377–407). Springer Science+Business Media. Schäl, M. (1993). Average optimality in dynamic programming with general state
Hernández-Lerma, O., Piovesan, C., & Runggaldier, W. (1995). Numerical aspects of space. Mathematics of Operations Research, 18(1), 163–172.
monotone approximations in convex stochastic control problems. Annals of Scheller-Wolf, A., Veeraraghavan, S. K., & Van Houtum, G. -J. (2006). Effective dual
Operations Research, 56(2), 135–156. sourcing with a single index policy. Working paper, Tepper School of Business,
Hernández-Lerma, O., & Runggaldier, W. (1994). Monotone approximations for Carnegie Mellon University.
convex stochastic control problems. Journal Mathematical Systems, Estimation, Schweitzer, P. J., & Seidmann, A. (1985). Generalized polynomial approximations in
Control, 4, 99–140. Markovian decisions processes. Journal of Mathematical Analysis and
Hinderer, K. (1984). On the structure of solutions of stochastic dynamic programs. Applications, 110(2), 568–582.
In Proc. 7th conf. on probability theory (pp. 173–182). Shapiro, A. (2011). Analysis of stochastic dual dynamic programming method.
Iglehart, D. L. (1963). Optimality of (s, S) policies in the infinite horizon dynamic European Journal of Operational Research, 209(1), 63–72.
inventory problem. Management Science, 9(2), 259–267. Veeraraghavan, S. K., & Scheller-Wolf, A. (2008). Now or later: A simple policy for
Janssen, F., & de Kok, T. (1999). A two-supplier inventory model. International effective dual sourcing in capacitated systems. Operations Research, 56(4),
Journal of Production Economics, 59(1–3), 395–403. 850–864.
Jones, C.N., & Morari, M. (2009). Approximate explicit MPC using bilevel Veinott, A. F., & Wagner, H. M. (1965). Computing optimal (s, S) inventory policies.
optimization. In Proceedings of the European control conference. Budapest, Management Science, 11(5), 525–552.
Hungary. Whittemore, A. S., & Saunders, S. C. (1977). Optimal inventory under stochastic
Jones, C. N., Baric, M., & Morari, M. (2007). Multiparametric linear programming demand with two supply options. SIAM Journal on Applied Mathematics, 32(2),
with applications to control. European Journal of Control, 13(2–3), 152–170. 293–305.

You might also like