Infinite-Horizon Dynamic Programming: Tianxiao Zheng Saif
Infinite-Horizon Dynamic Programming: Tianxiao Zheng Saif
Tianxiao Zheng
SAIF
1. Introduction
Unlike the finite-horizon case, the infinite-horizon model has a stationarity structure in that
both the one-period rewards and the stochastic kernels for the state process are time homogeneous.
Intuitively, we may view the infinite-horizon model as the limit of the finite-horizon model as the
time horizon goes to infinity. The difficulty of the infinite-horizon case is that there is no general
theory to guarantee the existence of a solution to the Bellman equation. For bounded rewards, we
can use the powerful Contraction Mapping Theorem to deal with this issue.
2. Principle of Optimality
The principle of optimality in the infinite horizon states that
It can be shown that the value function (independent of time) satisfies the Bellman equa-
tion (a functional equation)
where s0 is the state variable in the next period, and Γ(s) is the set of feasible action a.
2. Verification theorem
Given any s0 , for any feasible policy πt , we can use the Bellman equation to derive
where V ∗ is the solution to the Bellman equation. Multiplying by β t and rearranging yield
1
Taking expectation conditional on time 0 and summing over t = 0, 1, ..., n − 1, we obtain
n−1
X
V ∗ (s0 ) − β n E0 V ∗ (sn ) ≥ E0 β t u(st , πt )
t=0
If the transversality condition (recall in the finite horizon, VT∗ (sT ) = uT (sT ))
lim E0 β n V ∗ (sn ) = 0
n→∞
The equality holds if we replace π by π ∗ (optimal policy generated from the Bellman equation).
In this case, the right-hand-side of the above inequality becomes V (s0 ), giving V ∗ (s0 ) = V (s0 ).
The result demonstrates that under the transversality condition the solution to the Bellman
equation gives the value function for the Markov decision problem. In addition, any plan
generated by the optimal policy correspondence from the dynamic programming problem is
optimal for the Markov decision problem.
We should emphasize that the transversality condition is a sufficient condition, but not a
necessary one. It is quite strong because it requires the limit to converge to zero for any
feasible policy. This condition is often violated in many applications with unbounded rewards.
However, if any feasible plan that violates the transversality condition is dominated by some
feasible plan that satisfies this condition, the solution to the Bellman equation is the value
function and that the associated policy function generates an optimal plan. (See Stokey,
Lucas, and Prescott, 1989).
3. Any optimal policy obtained from solving the Markov decision problem can be generated by
solving the Bellman equation.
2
maker is risk neutral so that he maximizes his expected return. The discount factor is equal to the
inverse of the gross interest rate, β = 1/R.
• Find the Bellman equation of and solve the dynamic programming problem. For simplicity,
we assume that zt is i.i.d. The cumulative distribution function is F (z), z ∈ [0, B], B > I.
• Find the mean waiting period until the option is exercised.
Note that βEV (z 0 ) is a constant under iid z t. Therefore, if z − I > βEV (z 0 ), the decision maker
chooses to exercise the option and to wait otherwise. As a result,
(
z − I, if z > z ∗ ,
V (z) =
const., if z < z ∗ .
This gives Z B
β
z∗ − I = (z − z ∗ )dF (z)
1−β z∗
From the equation, we see that z ∗ ∈ [I, B]. The decision maker will not exercise the option for
zt ∈ (I, z ∗ ) because there is option value of waiting.
The probability of not exercising the option at each period is λ = F (z ∗ ). Consequently, the
probability of exercising the option at time period t is λj (1 − λ). The mean waiting period is then
∞ ∞
X d X j λ
jλj (1 − λ) = (1 − λ)λ λ =
dλ 1−λ
j=0 j=0
where f is a continuous function on S. Then the solution to the Bellman equation is a fixed point
of T̂ in that T̂ V = V .
3
The set of bounded and continuous function on the state space S endowed with the sup norm
is a Banach space (C(S)). The operator T is a contraction if (1) u is bounded and continuous; (2)
Γ is nonempty, compact, and continuous; (3) The stochastic kernel P (s, a; s0 ) satisfies the property
that f (s0 )P (s, a; s0 )ds0 is continuous in (s, a) for any bounded and continuous function f ; (4)
R
β ∈ (0, 1).
The contraction property of the Bellman operator T̂ gives the existence and uniqueness of the
solution to the Bellman equation. It justifies the guess-and-verify method for finding the value
function. (As long as we find a solution, it is the solution.) Below is an simple example.
∞
X
max
∞
E0 β t log(ct )
{cj }j=0
t=0
subject to ct + Kt+1 = zt Ktα , where zt follows a Markov process with transition equation log(zt+1 ) =
ρ log(zt ) + σεt+1 . Here, ρ ∈ (0, 1) and εt is normal distribution with mean 0 and variance 1.
We write the Bellman equation as
Given the log utility, we could guess the value function takes the functional form V (K, z) =
d0 + d1 log z + d2 log K. The maximization problem
max log c + βEV (zK α − c, z 0 ) = max log c + βd0 + βd2 log(zK α − c) + βd1 E log z 0
c c
zK α zK α βd2
d0 + d1 log z + d2 log K = log + βd2 log + βd1 ρ log z + βd0
1 + βd2 1 + βd2
d2 = α + αβd2 ,
d1 = 1 + βd2 + βρd1 , (2)
βd2
d0 = − log(1 + βd2 ) + βd2 log + βd0 .
1 + βd2
4
The decision rule can also be derived: Kt+1 = αβzt Ktα .
lim T̂ N v0 = V.
N →∞
This property gives rise to a numerical algorithm known as value function iteration for finding V .
We start with an arbitrary guess V0 (s) and iterate the Bellman operator
until Vn is convergent. The contraction mapping theorem guarantees the convergence of this algo-
rithm. In particular, the contraction property implies that kVn (s) − V (s)k converges to zero at a
geometric rate.
Note that in the case where we have set v0 (s) = 0, the value function iteration algorithm is
equivalent to solving a finite horizon problem by backward induction. Suppose we stop the iteration
at N because convergence is attained, e.g. kVN (s)−VN −1 (s)k ≈ 10−15 . The equivalent finite horizon
problem is then define with ut (st , at ) = u(st , at ) , for t = 0, 1, ..., N − 1 and uN (sN , aN ) = 0.
1. Choose a arbitrary policy g0 , and compute the value function associated implied by g0 .
On discretized grids of the state space, this is usually done by solving a linear system. There
also exists a fast method to compute V0 (s) by defining an operator B̂,
and finding the fix point V0 = B̂V0 . Iterate on B for a small number of times to obtain an
approximation of V0 .
2. Generate a improved policy g1 (s) that solves the two-period problem
5
3. Given g1 , one continues the cycle of value function evaluation step and the policy improvement
step until the first iteration n such that kgn − gn−1 k → 0 (or alternatively kVn − Vn−1 k → 0).
Since such a gn satisfies the Bellman equation, it is optimal.
• under the condition (1) u(., z, a) is continuous and bounded for each z, a, (2) u(., z, a) is
strictly increasing, (3) for each z, Γ(., z) is increasing (x < x0 implies Γ(x, z) ⊂ Γ(y, z)), (4)
φ(., a, z, z 0 ) is increasing for each a, z, z 0 , then V (., z) is strictly increasing for each z.
• under the condition (1), (2), (3), (4), (5) at each z, for all x, a, x0 , a0 and θ ∈ (0, 1), u(θx +
(1 − θ)x0 , z, θa + (1 − θ)a0 ) ≥ θu(x, z, a) + (1 − θ)u(x0 , z, a0 ), (6) φ(., ., z, z 0 ) is concave for each
z, z 0 , then V (., z) is strictly concave for each z; G is a single-valued continuous function.
• under (5), (7) for each z, u(., z, .) is continuously differentiable on the interior of X × A, (8)
for each z, z 0 , φ(., ., z, z 0 ) is differentiable on the interior of X × A, (9) at each z, for all x, x0 ,
a ∈ Γ(x, z) and a0 ∈ Γ(x0 , z) imply that θa + (1 − θ)a0 ∈ Γ(θx + (1 − θ)x0 , z), then V (., z)$ is
continuously differentiable.
6
Analyzing the existence and properties of the value function is nontrivial for unbounded reward
functions. By contrast, unbounded reward functions do not pose any difficulty for the Maximum
Principle to work. To present the infinite horizon maximum principle, we write the Lagrangian
form for the optimal control problem
∞
" #
X
t t+1
L=E β u(xt , zt , at ) − β µt+1 (xt+1 − φ(xt , zt , at , zt+1 ))
t=0
F.O.C
Setting µt = Vx (xt , zt ), we can see that the two conditions above are equivalent to the first order
condition and envelope condition of Bellman equation. The Lagrange multiplier µt is interpreted
as the shadow value of the value function.
∂uT
Recall that in the finite horizon case, we have a terminal condition µT = ∂xT to solve the
problem by backward induction. There is no well-defined terminal condition in the infinite horizon
case. Here, a sufficient boundary condition is in the form of transversality condition
lim E[β T µT xT ] = 0
T →∞
For a special class of control problems -- the Euler class, we could prove the transversality condition
is also necessary. (cf. Ekeland and Scheinkman, 1986 and Kamihigashi, 2000)
In practice, it may be possible to use simple tricks to transform the general optimal control
problem to a special class of control problems -- the Euler class. Suppose it is possible to perform
a change of variables such that the state transition equation becomes
xt+1 = at .
This could simplify the solution by Bellman equation and maximum principle.
1. Bellman equation
The envelope condition becomes very simple
where x0 = a = g(x, z) is the optimal policy. Substituting the envelope condition into the
7
first order condition yields the Euler equation
Z
0 = ua (x, z, a) + β ux (x0 , z 0 , a0 )Q(z, z 0 )dz 0
Z
= ua (x, z, g(x, z)) + β ux g(x, z), z 0 , g g(x, z) Q(z, z 0 )dz 0
This is a functional equation for the optimal policy g. Instead of solving the original Bellman
equation, for the Euler class, we could solve the Euler equation.
2. Maximum principle
The first order conditions become
Substituting the second equation into the first one, we get the sequential form of Euler
equation
0 = ua (xt , zt , xt+1 ) + βEt [ux (xt+1 , zt+1 , xt+2 )].
To get some economic sense of the transversality condition, we consider a simple example.
8
By defining Kt+1 ≡ at , the problem can be rewritten as
T
" #
X
t
max E β u(zt F (Kt ) − at ) , β ∈ (0, 1)
{ct }T
t=0 t=0
KT +1 should be non-negative.
• If KT +1 = 0, the following condition should be satisfied E[β T u0 (cT )] > 0;
• If KT +1 > 0, the following condition should be satisfied E[β T u0 (cT )] = 0
We can combine the conditions as E[β T u0 (cT )KT +1 ] = 0. This is the transversality condition
in the finite horizon case. The economic meaning is that the expected discounted shadow value
of the terminal state (e.g., capital or wealth) must be zero. In the infinite horizon case, we
take the limit of the condition.
subject to xt+1 = Rt+1 (xt − ct ), xt+1 > 0, x0 > 0 given, where Rt+1 > 0 is i.i.d. drawn from a
distribution. By defining yt+1 = xt+1 /Rt+1 , yt+1 = at = xt − ct = yt Rt − ct , ct = yt Rt − at .
F.O.C
An obvious guess of the consumption policy is that ct = Cxt (0 < C < 1). Plugging the conjecture
into the Euler equation yields
(Cxt )−γ = βEt [Rt+1 (Cxt+1 )−γ ] = βEt [Rt+1 C −γ (Rt+1 xt − Rt+1 Cxt )−γ ]
9
1−γ 1/γ
The above equation gives us C = 1 − (βEt [Rt+1 ]) . Consider the Bellman equation
An obvious guess of the value function is V (x) = Bx1−γ /(1 − γ). Plugging the conjecture into the
Euler equation yields h i−γ
1−γ 1/γ
B = 1 − (βEt [Rt+1 ])
lim E0 β t V (xt ) = 0
t→∞
βtB βtB
E0 β t V (xt ) = E0 [x1−γ
t ] = E0 [Rt1−γ (xt−1 − ct−1 )1−γ ]
1−γ 1−γ
βtB
= (1 − C)1−γ E0 [Rt1−γ xt−1
1−γ
] (9)
1−γ
t
βtB
(1 − C)t(1−γ) x01−γ E0 [ Rj1−γ ]
Y
=
1−γ
j=1
E0 [β T c−γ T 1−γ
T xT ] = β BE0 [xT ]
References
Alvarez, F., Stokey, N. L., 1998. Dynamic programming with homogeneous functions. Journal of
Economic Theory 82, 167 – 189.
Durán, J., 2000. On dynamic programming with unbounded returns. Economic Theory 15, 339–352.
Ekeland, I., Scheinkman, J. A., 1986. Transversality conditions for some infinite horizon discrete
time optimization problems. Mathematics of Operations Research 11, 216–229.
Kamihigashi, T., 2000. A simple proof of ekeland and scheinkman’s result on the necessity of a
transversality condition. Economic Theory 15, 463–468.
Stokey, N. L., Lucas, R. E., Prescott, E. C., 1989. Recursive Methods in Economic Dynamics.
Harvard University Press.
10