Assignment Two
Assignment Two
Assignment Two
Due Date: October 28, 2024
1 Theoretical Problems
1. Suppose that two assets have jointly normally distributed returns R1
and R2 . R1 has mean µ1 and standard deviation σ1 and R2 has mean
µ2 and standard deviation σ2 , and the correlation between R1 and R2 is
ρ. Assume µ1 > µ2 . Investors observe N i.i.d. observations of (R1 , R2 ),
and estimate µ̂1 and µ̂2 by the sample means, and declare that asset one
has a higher expected return if µ̂1 > µ̂2 , and that µ2 > µ1 otherwise.
(a) Determine the probability that investors will be correct.
(b) For a prescribed confidence level α, determine the number of ob-
servations required to make the probability that investors are cor-
rect equal to α.
(c) Write a computer program to implement the formulas you derived
in the previous two parts (you may want to experiment with this
program to determine how long an observation period is required
for what you consider to be “reasonable” values of the parame-
ters).
2. Consider the modified search for a stochastic policy in reinforcement
learning, with an entropy regularization:
K K
X 1X πk
max πk Q(st , Ak ) − πk log (1)
π
k=1
β k=1 ωk
K
X
0 ≤ πk ≤ 1, πk = 1. (2)
k=1
1
3. Consider the Boltzmann weighted average of a function h(i) defined on
a binary set I = {1, 2}:
X exp(βh(i))
Boltzβ h = h(i) P . (3)
i∈I j∈I exp(βh(j))
(a) Verify that this functional smoothly interpolates between the max
and the mean of h(i) which are obtained in the limits β → ∞ and
β → 0.
(b) An functional B mapping real-valued functions on I to R is called
a non-expansion if mini h(i) ≤ Bh ≤ maxi h(i) and |Bh − Bh′ | ≤
maxi |h(i) − h′ (i)|. By taking β = 1, h(1) = 100, h(2) = 1,
h′ (1) = 1, h′ (2) = 0, show that Boltzβ is not a non-expansion.
4. Consider the infinite horizon finite state Markov decision process prob-
lem presented in class, and suppose that π, π ′ : S → A are two deter-
ministic policies such that:
Qπ (s, π ′ (s)) ≥ V π (s), ∀s ∈ S.
′
Show that V π (s) ≥ V π (s).
5. Consider the infinite horizon finite state Markov decision process pre-
sented in class, with discount factor γ ∈ (0, 1). Suppose that R(s, a, s′ ) =
R(s, a) and let π : S → A be a deterministic policy. Let M be the
dynamic programming max operator:
!
X
MV (s) = max R(s, a) + γ p(s′ |s, a)V (s′ ) . (4)
a
s′
Let:
c̄ = max(MV π (s) − V π (s)), (5)
s
and define a policy π ′ using:
!
X
π ′ (s) = arg max R(s, a) + γ p(s′ |s, a)V π (s′ ) = arg max Qπ (s, a).
a a
s′
(6)
′
Let V π be the value of this policy. Show that
′ c̄
V π (s) ≤ V π (s) + (7)
1−γ
for all s ∈ S.
2
2 Computational Problems
1. Using operators that are not non-expansions can lead to a loss of a
solution in a generalized Bellman equation. To illustrate such a phe-
nomenon, consider the MDP problem on the set I = {1, 2} with two
actions a and b and the following specification:
where a, a′ are drawn from the Boltzmann policy with β = 16.55 and
α = 0.1, leads to oscillating solutions for Q̂1 (s, a) and Q̂1 (s, a) that do
not achieve stable states with an increased number of iterations.