0% found this document useful (0 votes)
310 views5 pages

Exam RL Questions

This document contains exam questions for the course "Advanced Topics in Machine Learning". Students must answer 3 of the questions, with each question worth 20 marks. Part A focuses on kernel methods and reinforcement learning, while Part B focuses solely on reinforcement learning. The questions assess understanding of topics like Markov decision processes, policy evaluation, policy improvement, state-value functions, and Monte Carlo methods.

Uploaded by

elz0rr0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
310 views5 pages

Exam RL Questions

This document contains exam questions for the course "Advanced Topics in Machine Learning". Students must answer 3 of the questions, with each question worth 20 marks. Part A focuses on kernel methods and reinforcement learning, while Part B focuses solely on reinforcement learning. The questions assess understanding of topics like Markov decision processes, policy evaluation, policy improvement, state-value functions, and Monte Carlo methods.

Uploaded by

elz0rr0
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Advanced Topics in Machine Learning, GI13, 2010/11

Advanced Topics in Machine Learning, GI13, 2010/11


Answer any THREE questions. Each question is worth 20 marks. Use separate answer books
Answer any THREE questions. Each question is worth 20 marks. Use separate answer books
for PART A and PART B. Gatsby PhD students only: answer either TWO questions from
for PART A and PART B. Gatsby PhD students only: answer either TWO questions from
PART A and ONE question from PART B; or ONE question from PART A and TWO questions
PART A and ONE question from PART B; or ONE question from PART A and TWO questions
from PART B.
from PART B.
Marks
Marksforforeach
eachpart
partofofeach
eachquestion
questionare
areindicated
indicated in
in square
square brackets
brackets
Calculators
CalculatorsareareNOT
NOTpermitted
permitted

Part
PartA:
A:Kernel
KernelMethods
Methods

Part
PartB:
B:Reinforcement
ReinforcementLearning
Learning
1.1.Consider
Considerthe
thefollowing
followingMarkov
MarkovDecision
Decision Process
Process (MDP)
(MDP) with discount factor γγ =
discount factor 0.5.
= 0.5.
Upper
Uppercase
caseletters
lettersA,
A,B,B,CCrepresent
represent states;
states; arcs
arcs represent
represent state transitions;
transitions; lower
lower case
case
lettersab,
letters ab,ba,
ba,bc,
bc,ca,
ca,cbcbrepresent
representactions;
actions;signed
signed integers
integers represent rewards;
rewards; and
and fractions
fractions
representtransition
represent transitionprobabilities.
probabilities.
+2

ba
-8
A ab B
bc

1/4 -2
+8
3/4
+4
ca
cb
C

Definethe
• •Define thestate-value functionVVππ(s)
state-valuefunction for aa discounted
(s)for discounted MDP
[1
[1 marks]
marks]

Writedown
• •Write downthe
theBellman
Bellmanexpectation
expectationequation
equation for
for state-value
state-value functions
functions
[2
[2 marks]
marks]
GI13 1 TURN OVER
GI13 1 TURN OVER
• Consider the uniform random policy π1 (s,a) that takes all actions from state s with
equal probability. Starting with an initial value function of V1 (A) = V1 (B) = V1 (C) =
2, apply one synchronous iteration of iterative policy evaluation (i.e. one backup for
each state) to compute a new value function V2 (s)
[3 marks]

• Apply one iteration of greedy policy improvement to compute a new, deterministic


policy π2 (s)
[2 marks]

• Consider a deterministic policy π(s). Prove that if a new policy π0 is greedy with
respect to V π then it must be better than or equal to π, i.e. V π (s) ≥ V π (s) for all s;
0

and that if V π (s) = V π (s) for all s then π0 must be an optimal policy.
0

[5 marks]

• Define the optimal state-value function V ∗ (s) for an MDP


[1 marks]

• Write down the Bellman optimality equation for state-value functions


[2 marks]

• Starting with an initial value function of V1 (A) = V1 (B) = V1 (C) = 2, apply one
synchronous iteration of value iteration (i.e. one backup for each state) to compute
a new value function V2 (s).
[3 marks]

• Is your new value function V2 (s) optimal? Justify your answer.


[1 marks]

[Total 20 marks]

GI13 2 CONTINUED
2. Consider an undiscounted Markov Reward Process with two states A and B. The transition
matrix and reward function are unknown, but you have observed two sample episodes:

A + 3 → A + 2 → B − 4 → A + 4 → B − 3 → terminate

B − 2 → A + 3 → B − 3 → terminate

In the above episodes, sample state transitions and sample rewards are shown at each step,
e.g. A + 3 → A indicates a transition from state A to state A, with a reward of +3.

• Using first-visit Monte-Carlo evaluation, estimate the state-value function V (A),V (B)

[2 marks]

• Using every-visit Monte-Carlo evaluation, estimate the state-value function V (A),V (B)

[2 marks]

• Draw a diagram of the Markov Reward Process that best explains these two episodes
(i.e. the model that maximises the likelihood of the data - although it is not necessary
to prove this fact). Show rewards and transition probabilities on your diagram.
[4 marks]

• Define the Bellman equation for a Markov reward process


[2 marks]

• Solve the Bellman equation to give the true state-value function V (A),V (B). Hint:
solve the Bellman equations directly, rather than iteratively.

• What value function would batch TD(0) find, i.e. if TD(0) was applied repeatedly
to these two episodes?
[2 marks]

• What value function would batch TD(1) find, using accumulating eligibility traces?

[2 marks]

• What value function would LSTD(0) find?


[2 marks]

[Total 20 marks]

GI13 3 TURN OVER


3. A rat is involved in an experiment. It experiences one episode. At the first step it hears
a bell. At the second step it sees a light. At the third step it both hears a bell and sees
a light. It then receives some food, worth +1 reward, and the episode terminates on the
fourth step. All other rewards were zero. The experiment is undiscounted.

• Represent the rat’s state s by a vector of two binary features, bell(s) ∈ {0, 1} and
light(s) ∈ {0, 1}. Write down the sequence of feature vectors corresponding to this
episode.
[3 marks]

• Approximate the state-value function by a linear combination of these features with


two parameters: b · bell(s) + l · light(s). If b = 2 and l = −2 then write down the
sequence of approximate values corresponding to this episode.
[3 marks]

• Define the λ-return vtλ


[1 marks]

• Write down the sequence of λ-returns vtλ corresponding to this episode, for λ = 0.5
and b = 2, l = −2
[3 marks]

• Using the forward-view TD(λ) algorithm and your linear function approximator,
what are the sequence of updates to weight b? What is the total update to weight b?
Use λ = 0.5, γ = 1, α = 0.5 and start with b = 2, l = −2
[3 marks]

• Define the TD(λ) accumulating eligibility trace et when using linear value function
approximation
[1 marks]

• Write down the sequence of eligibility traces et corresponding to the bell, using
λ = 0.5, γ = 1
[3 marks]

• Using the backward-view TD(λ) algorithm and your linear function approximator,
what are the sequence of updates to weight b? (Use offline updates, i.e. do not ac-
tually change your weights, just accumulate your updates). What is the total update
to weight b? Use λ = 0.5, γ = 1, α = 0.5 and start with b = 2, l = −2
GI13 4 CONTINUED
[3 marks]

∆b1 = αδ1 e1 = 0.5(0 + −2 − 2)1 = −2

∆b2 = αδ2 e2 = 0.5(0 + 0 − −2)1/2 = 1/2

∆b3 = αδ3 e3 = 0.5(1 + 0 − 0)5/4 = 5/8

∑ ∆b = (−2 + 1/2 + 5/8) = −7/8


[Total 20 marks]

END OF PAPER

GI13 5

You might also like