0% found this document useful (0 votes)
7 views4 pages

RL Paper Deepsk

The document outlines a mathematics exam on reinforcement learning, consisting of multiple sections covering basic concepts, value functions, policy iteration, exploration vs exploitation, and model-based vs model-free approaches. Each section includes specific questions and marks allocation, with calculations and definitions required for various concepts such as the Bellman equation and expected returns. The exam assesses understanding of reinforcement learning principles through problem-solving and theoretical explanations.

Uploaded by

alijaskani35
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views4 pages

RL Paper Deepsk

The document outlines a mathematics exam on reinforcement learning, consisting of multiple sections covering basic concepts, value functions, policy iteration, exploration vs exploitation, and model-based vs model-free approaches. Each section includes specific questions and marks allocation, with calculations and definitions required for various concepts such as the Bellman equation and expected returns. The exam assesses understanding of reinforcement learning principles through problem-solving and theoretical explanations.

Uploaded by

alijaskani35
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Reinforcement Learning Mathematics Exam (100

Marks)
Time: 2 Hours

Section A: Basic Concepts (20 Marks)


1. (a) Define the Bellman equation for the state-value function vπ (s). (5 Marks)
(b) Calculate the expected return Gt for rewards Rt+1 = 2, Rt+2 = −1, Rt+3 = 3 with
γ = 0.9. (5 Marks)

2. (a) Differentiate between deterministic and stochastic policies. (4 Marks)


(b) Given policy: (
0.7 if a = 1
π(a|s) =
0.3 if a = 2
Compute π(a = 1|s). (6 Marks)

Section B: Value Functions and Bellman Equations


(30 Marks)
3. Consider an MDP with:

• States S = {s1 , s2 }, Actions A = {a1 , a2 }


• Transitions:
– In s1 : a1 → s2 (+5), a2 → s1 (-1)
– In s2 : All actions terminate (+10)

(a) Compute vπ (s1 ) for policy π that always chooses a1 (γ = 1). (10 Marks)
(b) Find the optimal value function v∗ (s1 ). (10 Marks)

3. For the gridworld:

A B (Terminal) +10
C D (Terminal) -5

Compute V (A) using Bellman equation (γ = 0.8, non-terminal rewards = -1, random
policy). (10 Marks)

1
Section C: Policy Iteration and Value Iteration (25
Marks)
5. Perform one iteration of policy evaluation for:

• States S = {s1 , s2 }, terminal s2


• Transition: s1 → s2 (+5)
• γ = 0.9, initial V (s1 ) = V (s2 ) = 0

(10 Marks)

5. Apply value iteration for 2 steps on the MDP in Question 5. (15 Marks)

Section D: Exploration vs Exploitation (15 Marks)


7. Given bandit arms:

• Arm 1: 10 pulls, avg=3


• Arm 2: 5 pulls, avg=4
• Arm 3: 20 pulls, avg=2

Compute UCB scores (c = 2) and choose next arm. (10 Marks)

7. Explain -greedy strategy (ϵ = 0.1) and compute exploration probability for 3 arms. (5
Marks)

Section E: Model-Based vs Model-Free (10 Marks)


9. (a) Contrast model-based and model-free RL. (4 Marks)
(b) Compute expected reward for model with 70% accuracy (R1 = 3, R2 = 5). (6
Marks)

2
Complete Solutions
Section A
1. (a) The Bellman equation for vπ (s) is:

vπ (s) = Eπ [Rt+1 + γvπ (St+1 ) | St = s]

(b) Expected return calculation:

Gt = 2 + 0.9(−1) + 0.92 (3) = 2 − 0.9 + 2.43 = 3.53

2. (a) Deterministic policy: Maps states to specific actions (a = π(s)).


Stochastic policy: Specifies probability distribution over actions (π(a|s) =
P[At = a|St = s]).
(b) Direct from definition:
π(a = 1|s) = 0.7

Section B
3. (a) For policy π always choosing a1 :

vπ (s1 ) = 5 + γvπ (s2 ) = 5 + 1 × 10 = 15

(b) Optimal value function:


 
5 + 10 = 15 (a1 )
v∗ (s1 ) = max = 15
a −1 + 15 = 14 (a2 loops)

3. Bellman equation for state A:


 
1 1 1 1
V (A) = −1 + 0.8 V (B) + V (C) + V (A) + V (A)
4 4 4 4

Terminal states: V (B) = 10, V (D) = −5. Non-terminal V (C) = 0 (initial assump-
tion):
1
V (A) = −1 + 0.8 × (10) = −1 + 2 = 1
4

Section C
5. First iteration of policy evaluation:

V (s1 ) = 5 + γV (s2 ) = 5 + 0.9 × 0 = 5

V (s2 ) remains 0 (terminal).

5. Value iteration steps:

Step 1: V1 (s1 ) = 5 + 0.9 × 0 = 5


Step 2: V2 (s1 ) = 5 + 0.9 × 0 = 5

3
Section D
7. UCB scores (ntotal = 10 + 5 + 20 = 35):
r
2 ln 35
UCB1 = 3 + ≈ 3 + 0.97 = 3.97
r 10
2 ln 35
UCB2 = 4 + ≈ 4 + 1.12 = 5.12
r 5
2 ln 35
UCB3 = 2 + ≈ 2 + 0.65 = 2.65
20
Choose Arm 2 (highest UCB).

7. -greedy strategy:

• Exploit best arm with probability 1 − ϵ = 0.9


• Explore randomly with probability ϵ = 0.1
• Exploration probability per arm: 0.1
3
≈ 0.033

Section E
9. (a) Model-based: Uses environment model for predictions.
Model-free: Learns directly from experience without model.
(b) Expected reward with 70% accurate model:

0.7 × (3 + 5) = 0.7 × 8 = 5.6

You might also like