RL Paper Deepsk
RL Paper Deepsk
Marks)
Time: 2 Hours
(a) Compute vπ (s1 ) for policy π that always chooses a1 (γ = 1). (10 Marks)
(b) Find the optimal value function v∗ (s1 ). (10 Marks)
A B (Terminal) +10
C D (Terminal) -5
Compute V (A) using Bellman equation (γ = 0.8, non-terminal rewards = -1, random
policy). (10 Marks)
1
Section C: Policy Iteration and Value Iteration (25
Marks)
5. Perform one iteration of policy evaluation for:
(10 Marks)
5. Apply value iteration for 2 steps on the MDP in Question 5. (15 Marks)
7. Explain -greedy strategy (ϵ = 0.1) and compute exploration probability for 3 arms. (5
Marks)
2
Complete Solutions
Section A
1. (a) The Bellman equation for vπ (s) is:
Section B
3. (a) For policy π always choosing a1 :
Terminal states: V (B) = 10, V (D) = −5. Non-terminal V (C) = 0 (initial assump-
tion):
1
V (A) = −1 + 0.8 × (10) = −1 + 2 = 1
4
Section C
5. First iteration of policy evaluation:
3
Section D
7. UCB scores (ntotal = 10 + 5 + 20 = 35):
r
2 ln 35
UCB1 = 3 + ≈ 3 + 0.97 = 3.97
r 10
2 ln 35
UCB2 = 4 + ≈ 4 + 1.12 = 5.12
r 5
2 ln 35
UCB3 = 2 + ≈ 2 + 0.65 = 2.65
20
Choose Arm 2 (highest UCB).
7. -greedy strategy:
Section E
9. (a) Model-based: Uses environment model for predictions.
Model-free: Learns directly from experience without model.
(b) Expected reward with 70% accurate model: