0% found this document useful (0 votes)
37 views5 pages

Homework - 06 - 223 - Spring 2024

Uploaded by

Yasin sonmez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
37 views5 pages

Homework - 06 - 223 - Spring 2024

Uploaded by

Yasin sonmez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

EE 223 Spring 2024

Homework 6
Due by 11 p.m. on Monday, 18 March 2024.
The homework should be submitted as a scanned pdf file to ananth at berkeley
dot edu
Please retain a copy of your submitted solution for self-grading.

1. This was the last problem on Homework 5, postponed to this homework set.

Let X ∶= {1, . . . , d} be a finite state space, U a finite set of control actions,


and ([pij (u)] , u ∈ U) a family of controlled transition probability matrices
on X .
Let 0 < β < 1 be a given discount factor. We consider the discounted dy-
namic programming problem of minimizing the overall expected β-discounted
cost starting from each initial state, where the one-step cost incurred when
being in state i and using control action u is c(i, u), for some given collec-
tion of real numbers (c(i, u), i ∈ X , u ∈ U).
Let V ∗ ∶ X → R be the corresponding optimal β-discounted costs, which
we can think of as a vector of length d, indexed by the initial state. Thus
V ∗ is the fixed point of the Bellman equation associated to the β-discounted
control problem.
Consider the following variant of the value iteration algorithm. Given a
function V ∶ X → R, define the function S(V ) ∶ X → R via

c(i, u) + β ∑j≠i pij (u)V (j)


S(V )(i) ∶= min , for all i ∈ X .
u 1 − βpii (u)

We then consider the sequence of iterates (S k (V ), k ≥ 0), where S 0 (V ) ∶=


V and S k (V ) ∶= S(S k−1 (V )) for k ≥ 1.

(a) Show that


lim ∥S k (V ) − V ∗ ∥∞ = 0,
k→∞

for all V ∶ X → R. Here ∥ ⋅ ∥∞ denotes the L∞ norm on real-valued


functions on X .

1
(b) Find the best ρ > 0 that you can such that we have

∥S k (V ) − V ∗ ∥∞ ≤ ρk ∥V − V ∗ ∥∞ for all V ∶ X → R and all k ≥ 0.

Here ρ will depend on β.

2. Let 0 < β < 1 be a discount factor.


Let X = {0, 1, 2}, U = {a, b} and Y = {y0 , y1 } be the state space, the action
space, and the space of observations respectively.
Consider the controlled transition probability matrices
⎡ 1 1
0 ⎤⎥

⎢ 2 2
1 ⎥
P (a) = [Pij (a)] = ⎢ 0 1
2 ⎥ ,
⎢ 2
1 ⎥
⎢ 1
0 ⎥
⎣ 2 2 ⎦
⎡ 0 ⎤⎥
⎢ 0 1
⎢ ⎥
P (b) = [Pij (b)] = ⎢ 0 0 1 ⎥,
⎢ ⎥
⎢ 1 0 0 ⎥⎦

where the rows and columns of each matrix are enumerated by states in the
order 0, 1, 2.
Let c(i, u), for i ∈ X and u ∈ U denote the cost of taking action u when in
state i.
The observation at each time is a noisy function of the current state, given
by
1 1 1 3
p(y0 ∣0) = 1, p(y1 ∣0) = 0, p(y0 ∣1) = , p(y1 ∣1) = , p(y0 ∣2) = , p(y1 ∣2) = .
2 2 4 4
To be completely precise, the framework is that of our usual control prob-
lem. Namely, (Xk , k ≥ 0) denotes the state process, (Uk , k ≥ 0) the control
process, and (Yk , k ≥ 0) the observation process, with the evolution given
by
P (Xk+1 = j∣Xk = i, Uk = u, X0k−1 , U0k−1 , Y0k ) = Pij (u).
and the observation given by

P (Yk = y∣Xk = i, X0k−1 , U0k−1 , Y0k−1 ) = p(y∣i).

Also, we have written X0k−1 for the sequence (X0 , . . . , Xk−1 ), interpreted to
be the empty sequence when k = 0, and similarly for U0k−1 and Y0k−1 etc.

2
The problem we wish to solve is the partially observed discounted control
problem

Minimizeg E g [ ∑ β k c(Xk , Uk )],
k=0

where the minimization is over all strategies g = (gk , k ≥ 0), where Uk =


gk (Y0 , . . . , Yk ), and, as usual, E g denotes that we are taking expectations
when the strategy g is in effect.
Write down the Bellman equation that allows you to solve this problem.
Your answer should be sufficiently explicit to make it clear how one could
incorporate the specific form of the controlled transition probability matri-
ces and the probability of the observation at each time given the current
state. The discount factor β and the per-step costs c(i, u) can be treated as
unspecified variables.
Your answer should also explain briefly the intuitive reasoning behind the
Bellman equation you wrote down. There is no need to give a formal proof.

3. Consider the controlled Markov chain model with state space X ∶= {1, 2, 3},
action space U ∶= {a, b}, and transition probability matrices
⎡ 0 1 1 ⎤
⎢ 2 ⎥
⎢ 2

Pij (a) = ⎢ 1 0 0 ⎥ ,
⎢ 1 1 ⎥
⎢ ⎥
⎣ 2 2 0 ⎦
⎡ 0 1 1 ⎤
⎢ 2 ⎥
⎢ 2

Pij (b) = ⎢ 0 0 1 ⎥ .
⎢ 1 1 ⎥
⎢ ⎥
⎣ 2 2 0 ⎦
Note that the transition probabilities from states 1 and 3 do not depend on
the control choice. Suppose the one-step costs are given by:

c(1, a) = c(1, b) = 10 , (1)


c(2, a) = c(2, b) = 0 ,
c(3, a) = c(3, b) = 10 .

Let 0 < β < 1 be the discount factor. In the value-iteration algorithm for
finding the optimal control strategy, we start with an initial function V (0)

3
on the state space and form the sequence of iterates (V (n) , n ≥ 0) by letting
V (n+1) = T V (n) , where

T V (i) ∶= min{c(i, u) + β ∑ Pij (u)V (j)} .


u
j

Let µ(n) denote a minimizer at the n-th step of value iteration, i.e. for each
i ∈ X , µ(n) (i) satisfies

V (n+1) (i) = c(i, µ(n) (i)) + β ∑ Pij (µ(n) (i))V (n) (j) .
j

We proved in class that during value iteration from an arbitrary initial func-
tion for a finite state space finite action space discounted cost optimal con-
trol problem there is a finite N such that for all n ≥ N we have that µ(n) is
an optimal control strategy for the problem. In this example, show that if
V (0) is such that V (0) (1) ≠ V (0) (3) then the sequence (µ(n) , n ≥ 0) will not
converge. Thus, even though value iteration eliminates all non-optimal sta-
tionary Markov strategies in finitely many steps the sequence of stationary
Markov control strategies it proposes need not converge in general.

4. Let µ1 and µ2 define stationary Markov policies in a finite state finite control
space discounted dynamic programming problem with one step costs c(i, u)
and state transition probabilities pij (u). Thus µ1 and µ2 are functions from
the state space X to the set of controls U. We denote the discount factor by
0 < β < 1.

(a) Let µ3 denote a stationary Markov policy that, when in state i, chooses
the action u to minimize

c(i, u) + β ∑ pij (u) min{W∞


µ1
(j), W∞
µ2
(j)} ,
j

µ
where W∞ denotes the optimal overall discounted cost when the sta-
tionary Markov control strategy µ is in effect. Show that
µ3
W∞ ≤ min{W∞
µ1 µ2
, W∞ },

(the inequality is meant to hold coordinatewise, i.e. state by state, as


usual).

4
(b) Let µ4 be defined by

µ1 (i) if W∞µ1
(i) ≤ W∞µ2
(i)
µ4 (i) = { .
µ2 (i) if W∞µ2
(i) < W∞µ1
(i) ,

Show that
µ4
W∞ ≤ min{W∞
µ1 µ2
, W∞ }.

5. Consider a controlled Markov chain with state space the set of nonnegative
integers X = {0, 1, 2, . . .} and action space U = {0, 1}. When action u = 1 is
taken the state moves from the current state i to i + 1, for all i ≥ 0, and the
cost incurred is 1. When action u = 0 is taken the state stays at the current
state i, for all i ≥ 0, and the cost 1+i
1
is incurred.

(a) Consider the problem of choosing a control strategy to minimize the


long term average cost. Show that the optimal long term average cost
is 0.
(b) Show that for every discount factor 0 < β < 1, if i is large enough then
the optimal action to take in state i for the purpose of minimizing the
overall β-discounted cost is the action u = 0.
(c) Show that for every state i, if the discount factor 0 < β < 1 is suffi-
ciently close to 1, then the optimal action to take in state i for the pur-
pose of minimizing the overall β-discounted cost is the action u = 1.
(d) Conclude that there is no Blackwell optimal strategy in this problem.

You might also like