0% found this document useful (1 vote)
455 views4 pages

AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions

This document is the instructions for an exam on reinforcement learning. It consists of 5 problems worth 70 points total. The problems cover topics like Markov reward processes, Bellman equations, Monte Carlo methods, problem formulation, and miscellaneous questions. Students are instructed to show their work and justify their answers. They must submit their response as a private post by the due date and time.

Uploaded by

Abdul Taufeeq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (1 vote)
455 views4 pages

AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions

This document is the instructions for an exam on reinforcement learning. It consists of 5 problems worth 70 points total. The problems cover topics like Markov reward processes, Bellman equations, Monte Carlo methods, problem formulation, and miscellaneous questions. Students are instructed to show their work and justify their answers. They must submit their response as a private post by the due date and time.

Uploaded by

Abdul Taufeeq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

AI 3000 / CS5500 : Reinforcement Learning

Exam № 1
Due Date : 23/10/2021, 3.00 PM

Easwar Subramanian, IIT Hyderabad 23/10/2021

Instructions

Read all the instructions below carefully before you start answering the questions

• Please include your institute roll number in the first page of the answer sheet

• This is an open-book exam.

• Seeking help from other individuals (including classmates) is not allowed.

• Plagiarism in answers will be dealt with strictly.

• The exam has 5 problems for a total of 70 points.

• All answers should include suitable justification lest no marks will be awarded.

• The estimated amount of work for this exam is about three hours.

• Please submit the answer sheets by 3:00pm IST, Saturday October 23, 2021.

• Submit your answer-sheet as a private post to me and the Instructors on Piazza under
the midterm-exam tab.

Exam № 1 Page 1
Problem 1 : Markov Reward Process

Consider the following snake and ladders game as depicted in the figure below.

• Initial state is S and a fair four sided die is used to decide the next state at each time

• Player must land exactly on state W to win

• Die throws that take you further than state W leave the state unchanged

(a) Identify the states, transition matrix of this Markov process (1 points)

(b) Construct a suitable reward function, discount factor and use the Bellman equation for the
Markov reward process to compute how long does it take "on average" (the expected number
of die throws) to reach the state W from any other state (7 points)

Problem 2 : Bellman Equations and Dynamic Programming

(a) Consider an MDP M =< S, A, P, R, γ > where the reward function has the structure

R(s, a) = R1 (s, a) + R2 (s, a).

Suppose we are given action value functions Qπ1 and Qπ2 , for a given policy π, corresponding
to reward fuctions R1 and R2 , respectively. Explain whether it is possible to combine these
action value functions policies in a smple manner to compute the action value function Qπ
corresponding to the composite reward function R. (4 Points)

(b) Let M =< S, A, P, R, γ > be an MDP with fintite state and action space. We further assume
that the reward function R to be a deterministic function of current state s ∈ S and action
a ∈ A. Let f and g be two arbitraty action value functions mapping a state-action pair of the
MDP to a real number, i.e. f, g : S × A → R. Let L denote the Bellman optimality operator
(for the action value function) given by,

(Lf )(s, a) = R(s, a) + γP (s, a), Vf (s)

where Vf (s) = maxa f (·, , a). Prove that,

kLf − Lgk∞ ≤ γkf − gk∞

(6 Points)
[ Note : The Bellman optimality operator defined above is for action value functions and is
different from the one that was defined in the lectures which is for value functions.Think of
Vf (s) as a transformation operator that turns a vector f ∈ R|S||A| into a vector of length |S|.
The max norm of an action value function f is defined as kf k∞ = maxs maxa |f (s, a)|]

Exam № 1 Page 2
Problem 3 : Monte Carlo Methods

Consider a Markov process with 2 states S = {S, A} with transition probablities as shown in
the table below, where p ∈ (0, 1) is a non-zero probablity. To generate a MRP from this Markov
chain, assume that the rewards for being in states S and A are 1 and 0, respectively. In addition,
let the discount factor of the MRP be γ = 1.

S A
S 1−p p
A 0 1

(a) Provide a generic form for a typical trajectory starting at state S. (1 Point)

(b) Estimate V (S) using first visit MC. (2 Points)

(c) Estimate V (S) using every visit MC. (2 Points)

(d) What is true value of V (S) ? (3 Points)

(e) Explain if the every vist MC estimate is biased. (2 Points)

(f) In general for a MRP, comment on the convergence properties of the first visit MC and every
visit MC algorithms (2 Points)

Problem 4 : Problem Forumlation

Consider a SUBWAY outlet in your locality. Customers arrive to the store at times governed
by an unknown probablity distribution. The outlet sells sandwiches with a certain type of bread
(choice of 4 types) and filling (choice of 5 types). If a customer cannot get the desired sandwich,
he/she is not going to visit the store again. Ingredients need to be discarded every 3 days after
purchase. The store owner wants to figure out a policy for buying ingredients in such a way
to maximize his long-time profit using reiforcement learning. To this end, we will formulate
the problem as a MDP. You are free to make other assumptions regarding the probem setting.
Please enumerate your assumptions while answering the questions below.

(a) Suggest a suitable state and action space for the MDP. (5 Points)

(b) Devise an appropriate reward function for the MDP (3 Points)

(c) Would you use discounted or undiscounted setting in your MDP formulation ? Justify your
answer. (3 Points)

(d) Would you use dynamic progamming or reinforcement learning to solve the the problem ?
Explain with reasons. (3 Point)

(e) Between MC and TD methods, which would you use for learning ? Why ? (3 Point)

(f) Is function approximation required to solve this problem ? Why or why not ? (3 Point)

Exam № 1 Page 3
Problem 5 : Miscellaneous Questions

(a) What is the algorithm that results, if In the TD(λ) algorithm, we set λ = 1 ? (1 Point)

(b) What are the possible reasons to study TD(λ) over TD(0) method ? (2 Points)

(c) Given a MDP, does scaling of rewards using a positive scale factor, change the optimal
policy ? (3 Points)

(d) In off policy evaluation, would it be beneficial to have the behaviour policy be deterministic
and the target policy be stochastic ? (2 Points)

(e) Under what conditions, does temporal methods for policy evaluation converge to true value
of the policy π ? Explain intitutively, the reasoning behind those conditions. (3 Points)

(f) Why does MC methods for policy evaluation yields an unbiased estimate of the true value of
the policy ? (2 Points)

(g) Let M =< S, A, P, R, γ > be an MDP with fintite state and action space. We further assume
that the reward function R is non-negative for all state-action pairs. In addition, suppose that
for every state s ∈ S, there is some action as such that P (s0 |s, as ) ≥ p for some p ∈ [0, 1].
We intend to find optimal value function V ∗ using value iteration. Initialize Vs (0) = 0 for all
states of the MDP and let Vt (s) denote the value of state s after t iterations. Prove that for all
states s and t ≥ 0, Vt+1 (s) ≥ pγVt (s). (4 Points)

(h) Explain if it is possible to parallelize the value iteration algorithm (3 Points)

ALL THE BEST

Exam № 1 Page 4

You might also like