Exam RL Questions
Exam RL Questions
Part
PartA:
A:Kernel
KernelMethods
Methods
Part
PartB:
B:Reinforcement
ReinforcementLearning
Learning
1.1.Consider
Considerthe
thefollowing
followingMarkov
MarkovDecision
Decision Process
Process (MDP)
(MDP) with discount factor γγ =
discount factor 0.5.
= 0.5.
Upper
Uppercase
caseletters
lettersA,
A,B,B,CCrepresent
represent states;
states; arcs
arcs represent
represent state transitions;
transitions; lower
lower case
case
lettersab,
letters ab,ba,
ba,bc,
bc,ca,
ca,cbcbrepresent
representactions;
actions;signed
signed integers
integers represent rewards;
rewards; and
and fractions
fractions
representtransition
represent transitionprobabilities.
probabilities.
+2
ba
-8
A ab B
bc
1/4 -2
+8
3/4
+4
ca
cb
C
Definethe
• •Define thestate-value functionVVππ(s)
state-valuefunction for aa discounted
(s)for discounted MDP
[1
[1 marks]
marks]
Writedown
• •Write downthe
theBellman
Bellmanexpectation
expectationequation
equation for
for state-value
state-value functions
functions
[2
[2 marks]
marks]
GI13 1 TURN OVER
GI13 1 TURN OVER
• Consider the uniform random policy π1 (s,a) that takes all actions from state s with
equal probability. Starting with an initial value function of V1 (A) = V1 (B) = V1 (C) =
2, apply one synchronous iteration of iterative policy evaluation (i.e. one backup for
each state) to compute a new value function V2 (s)
[3 marks]
• Consider a deterministic policy π(s). Prove that if a new policy π0 is greedy with
respect to V π then it must be better than or equal to π, i.e. V π (s) ≥ V π (s) for all s;
0
and that if V π (s) = V π (s) for all s then π0 must be an optimal policy.
0
[5 marks]
• Starting with an initial value function of V1 (A) = V1 (B) = V1 (C) = 2, apply one
synchronous iteration of value iteration (i.e. one backup for each state) to compute
a new value function V2 (s).
[3 marks]
[Total 20 marks]
GI13 2 CONTINUED
2. Consider an undiscounted Markov Reward Process with two states A and B. The transition
matrix and reward function are unknown, but you have observed two sample episodes:
A + 3 → A + 2 → B − 4 → A + 4 → B − 3 → terminate
B − 2 → A + 3 → B − 3 → terminate
In the above episodes, sample state transitions and sample rewards are shown at each step,
e.g. A + 3 → A indicates a transition from state A to state A, with a reward of +3.
• Using first-visit Monte-Carlo evaluation, estimate the state-value function V (A),V (B)
[2 marks]
• Using every-visit Monte-Carlo evaluation, estimate the state-value function V (A),V (B)
[2 marks]
• Draw a diagram of the Markov Reward Process that best explains these two episodes
(i.e. the model that maximises the likelihood of the data - although it is not necessary
to prove this fact). Show rewards and transition probabilities on your diagram.
[4 marks]
• Solve the Bellman equation to give the true state-value function V (A),V (B). Hint:
solve the Bellman equations directly, rather than iteratively.
• What value function would batch TD(0) find, i.e. if TD(0) was applied repeatedly
to these two episodes?
[2 marks]
• What value function would batch TD(1) find, using accumulating eligibility traces?
[2 marks]
[Total 20 marks]
• Represent the rat’s state s by a vector of two binary features, bell(s) ∈ {0, 1} and
light(s) ∈ {0, 1}. Write down the sequence of feature vectors corresponding to this
episode.
[3 marks]
• Write down the sequence of λ-returns vtλ corresponding to this episode, for λ = 0.5
and b = 2, l = −2
[3 marks]
• Using the forward-view TD(λ) algorithm and your linear function approximator,
what are the sequence of updates to weight b? What is the total update to weight b?
Use λ = 0.5, γ = 1, α = 0.5 and start with b = 2, l = −2
[3 marks]
• Define the TD(λ) accumulating eligibility trace et when using linear value function
approximation
[1 marks]
• Write down the sequence of eligibility traces et corresponding to the bell, using
λ = 0.5, γ = 1
[3 marks]
• Using the backward-view TD(λ) algorithm and your linear function approximator,
what are the sequence of updates to weight b? (Use offline updates, i.e. do not ac-
tually change your weights, just accumulate your updates). What is the total update
to weight b? Use λ = 0.5, γ = 1, α = 0.5 and start with b = 2, l = −2
GI13 4 CONTINUED
[3 marks]
END OF PAPER
GI13 5