Homework - 07 - 223 - Spring 2024
Homework - 07 - 223 - Spring 2024
Homework 7
Due by 11 p.m. on Thursday, 11 April 2024.
The homework should be submitted as a scanned pdf file to ananth at berkeley
dot edu
Please retain a copy of your submitted solution for self-grading.
1. This was the last problem on Homework 6, postponed to this homework set.
Consider a controlled Markov chain with state space the set of nonnegative
integers X = {0, 1, 2, . . .} and action space U = {0, 1}. When action u = 1 is
taken the state moves from the current state i to i + 1, for all i ≥ 0, and the
cost incurred is 1. When action u = 0 is taken the state stays at the current
1
state i, for all i ≥ 0, and the cost 1+i is incurred.
1
If it does not rain when she moves and the umbrella is at her initial location,
she has the option of taking it with her, which incurs a cost of V , because
of the inconvenience of carrying an umbrella with her even though it is not
raining.
Find the optimal strategy for whether she should take the umbrella when it
does not rain (if it happens to be at her initial location) so as to minimize
her long term average cost.
Hint: First argue that the problem can be modeled by a 4-state controlled
Markov chain with state space X = {(1, r), (1, n), (0, r), (0, n)} and action
space U = {0, 1}. Here the state (1, r) means that the umbrella is at her
current location and it is raining; (1, n) means that the umbrella is at her
current location and it is not raining; (0, r) means that the umbrella is not at
her current location and it is raining; and (0, n) means that the umbrella is
not at her current location and it is not raining. The action u = 1 corresponds
to the decision to take the umbrella if the umbrella is at the current location,
and u = 0 corresponds to the decision to not take the umbrella even though
the umbrella is at the current location. Then observe that the control action
that is taken in states (0, r) and (0, n) is irrelevant. Also observe that even
though it looks as if, for any stationary Markov policy, one has to compute
a stationary distribution on a set of size 4, this stationary distribution can
be determined in terms of one parameter in [0, 1] (and the probability of
raining, i.e. p).
3. Consider the average cost optimal control problem for the following con-
trolled Markov chain model with countable state space and finite action
space. The state space is the set of positive integers, {1, 2, . . .}. There are
two possible actions u1 and u2 . The transition probabilities under action u1
are given by
Pi,i+1 (u1 ) = 1 , i ≥ 1 ,
and under action u2 are given by
Pi,i (u2 ) = 1 , i ≥ 1 .
c(i, u1 ) = 1 , i ≥ 1 ,
2
and under action u2 they are given by
1
c(i, u2 ) = , i≥1.
i
(a) Argue that, starting from any initial probability distribution, the opti-
mal long term average cost is 0.
(b) Consider the Bellman equation for the average cost control problem,
which in general reads
Write down the Bellman equation for this problem. Can you find a
solution (J ∗ , (h(i), i ≥ 1)) for the Bellman equation?
(c) Can you find a solution (J ∗ , (h(i), i ≥ 1)) for the Bellman equation
for which h is a bounded function (i.e. there is finite constant K < ∞
such that ∣h(i)∣ ≤ K for all i ≥ 1)?
(d) Show that, starting from any initial distribution, the following nonsta-
tionary control strategy is optimal for the long term average cost:
If we are in state i for the first time, we use the control action u2 for i
successive times, and then use the control action u1 once.
(This will move us to state i+1 for the first time, after which we repeat
the above prescription, and so on.)
(e) Show that there is no stationary optimal Markovian control strategy
for this control problem.
Note: Our terminology in this course is that a Markovian strategy is
given by a deterministic function from the state space to the space of
actions. Further, the strategy is said to be optimal if it is optimal from
every initial distribution.
(f) Find a stationary randomized Markovian control strategy for this con-
trol problem which is optimal for the long terma average cost, i.e.
achieves the long term average optimal cost 0.
Note: A randomized Markovian control strategy is a function from the
state space to probability distributions on the set of actions.
3
We wish to control a finite state Markov chain to minimize the long run
average cost. We can observe the state of the Markov chain and base our
control action at each time k ≥ 0 on the entire state sequence up to and
including the state at time k. However, we are not sure what the transition
probability matrix is.
More precisely, let X ∶= {1, 2} denote the state space and let U ∶= {a, b} de-
note the set of control actions. Let Θ ∶= {θ1 , θ2 }. The underlying controlled
transition probability matrices are modeled as being P (u, θ) ∶= [pij (u, θ)],
where i, j ∈ X , u ∈ U, and θ is either θ1 or θ2 , but we are not sure which.
We adopt a Bayesian viewpoint with our prior probability being that the two
possibilites for θ are equiprobable.
Assume that the cost we incur when in state i and taking action u does not
depend on the underlying θ and, for concreteness, is given by
c(1, a) = 1, c(2, a) = 5, c(1, b) = 0, c(2, b) = 6.
Also, for concreteness, assume that
0.5 0.5 0.9 0.1
P (a, θ1 ) = [ ], P (a, θ2 ) = [ ],
1 0 1 0
0.8 0.2 0.2 0.8
P (b, θ1 ) = [ ], P (b, θ2 ) = [ ].
1 0 1 0
4
We are also given a cost function c ∶ X × U → R, where c(1, u) = 1 and
c(2, u) = 0 for all u ∈ U.
and
and
5
The observation at time 0 is given by Y0 = X0 , and the observation at time
1 is given by Y1 = ∗.
There is a cost L > 0 for using action b at time 0, zero cost for using the
control action a at time 0, and zero cost, whatever the control action, at time
1.
The terminal cost of ending up with X2 = 1 is K > 0, and the terminal cost
of ending up with X2 = 2 is 0.
(a) Find the optimal policy to minimize the overall expected cost (from the
given initial condition) over all policies of the type U0 = g0 (Y0 ), U1 =
g1 (Y0 , Y1 ).
(b) Find the optimal policy to minimize the overall expected cost (from the
given initial condition) over all policies of the type U0 = g0 (Y0 ), U1 =
g1 (Y1 ). This can be considered a distributed control problem, since
the controller at time 1 does not have access to the observation at time
0.
(c) Is there a signalling aspect to the optimal control in the second case
(i.e. the case of distributed control). If so, explain what it is, in your
own words.