AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions
AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions
Exam № 1
Due Date : 23/10/2021, 3.00 PM
Instructions
Read all the instructions below carefully before you start answering the questions
• Please include your institute roll number in the first page of the answer sheet
• All answers should include suitable justification lest no marks will be awarded.
• The estimated amount of work for this exam is about three hours.
• Please submit the answer sheets by 3:00pm IST, Saturday October 23, 2021.
• Submit your answer-sheet as a private post to me and the Instructors on Piazza under
the midterm-exam tab.
Exam № 1 Page 1
Problem 1 : Markov Reward Process
Consider the following snake and ladders game as depicted in the figure below.
• Initial state is S and a fair four sided die is used to decide the next state at each time
• Die throws that take you further than state W leave the state unchanged
(a) Identify the states, transition matrix of this Markov process (1 points)
(b) Construct a suitable reward function, discount factor and use the Bellman equation for the
Markov reward process to compute how long does it take "on average" (the expected number
of die throws) to reach the state W from any other state (7 points)
(a) Consider an MDP M =< S, A, P, R, γ > where the reward function has the structure
Suppose we are given action value functions Qπ1 and Qπ2 , for a given policy π, corresponding
to reward fuctions R1 and R2 , respectively. Explain whether it is possible to combine these
action value functions policies in a smple manner to compute the action value function Qπ
corresponding to the composite reward function R. (4 Points)
(b) Let M =< S, A, P, R, γ > be an MDP with fintite state and action space. We further assume
that the reward function R to be a deterministic function of current state s ∈ S and action
a ∈ A. Let f and g be two arbitraty action value functions mapping a state-action pair of the
MDP to a real number, i.e. f, g : S × A → R. Let L denote the Bellman optimality operator
(for the action value function) given by,
(6 Points)
[ Note : The Bellman optimality operator defined above is for action value functions and is
different from the one that was defined in the lectures which is for value functions.Think of
Vf (s) as a transformation operator that turns a vector f ∈ R|S||A| into a vector of length |S|.
The max norm of an action value function f is defined as kf k∞ = maxs maxa |f (s, a)|]
Exam № 1 Page 2
Problem 3 : Monte Carlo Methods
Consider a Markov process with 2 states S = {S, A} with transition probablities as shown in
the table below, where p ∈ (0, 1) is a non-zero probablity. To generate a MRP from this Markov
chain, assume that the rewards for being in states S and A are 1 and 0, respectively. In addition,
let the discount factor of the MRP be γ = 1.
S A
S 1−p p
A 0 1
(a) Provide a generic form for a typical trajectory starting at state S. (1 Point)
(f) In general for a MRP, comment on the convergence properties of the first visit MC and every
visit MC algorithms (2 Points)
Consider a SUBWAY outlet in your locality. Customers arrive to the store at times governed
by an unknown probablity distribution. The outlet sells sandwiches with a certain type of bread
(choice of 4 types) and filling (choice of 5 types). If a customer cannot get the desired sandwich,
he/she is not going to visit the store again. Ingredients need to be discarded every 3 days after
purchase. The store owner wants to figure out a policy for buying ingredients in such a way
to maximize his long-time profit using reiforcement learning. To this end, we will formulate
the problem as a MDP. You are free to make other assumptions regarding the probem setting.
Please enumerate your assumptions while answering the questions below.
(a) Suggest a suitable state and action space for the MDP. (5 Points)
(c) Would you use discounted or undiscounted setting in your MDP formulation ? Justify your
answer. (3 Points)
(d) Would you use dynamic progamming or reinforcement learning to solve the the problem ?
Explain with reasons. (3 Point)
(e) Between MC and TD methods, which would you use for learning ? Why ? (3 Point)
(f) Is function approximation required to solve this problem ? Why or why not ? (3 Point)
Exam № 1 Page 3
Problem 5 : Miscellaneous Questions
(a) What is the algorithm that results, if In the TD(λ) algorithm, we set λ = 1 ? (1 Point)
(b) What are the possible reasons to study TD(λ) over TD(0) method ? (2 Points)
(c) Given a MDP, does scaling of rewards using a positive scale factor, change the optimal
policy ? (3 Points)
(d) In off policy evaluation, would it be beneficial to have the behaviour policy be deterministic
and the target policy be stochastic ? (2 Points)
(e) Under what conditions, does temporal methods for policy evaluation converge to true value
of the policy π ? Explain intitutively, the reasoning behind those conditions. (3 Points)
(f) Why does MC methods for policy evaluation yields an unbiased estimate of the true value of
the policy ? (2 Points)
(g) Let M =< S, A, P, R, γ > be an MDP with fintite state and action space. We further assume
that the reward function R is non-negative for all state-action pairs. In addition, suppose that
for every state s ∈ S, there is some action as such that P (s0 |s, as ) ≥ p for some p ∈ [0, 1].
We intend to find optimal value function V ∗ using value iteration. Initialize Vs (0) = 0 for all
states of the MDP and let Vt (s) denote the value of state s after t iterations. Prove that for all
states s and t ≥ 0, Vt+1 (s) ≥ pγVt (s). (4 Points)
Exam № 1 Page 4