0% found this document useful (0 votes)

251 views

AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process

This document contains instructions for Assignment 1 of the course AI 3000/CS 5500: Reinforcement Learning. It includes 6 problems related to modeling decision making problems as Markov reward processes, Markov decision processes, and finite horizon MDPs. It asks students to formulate the problems mathematically, derive equations, and analyze concepts such as value iteration, policy evaluation, and contractions.

Uploaded by

SHUBHAM PANCHAL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

251 views

AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process

Uploaded by

SHUBHAM PANCHAL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

AI 3000 / CS 5500 : Reinforcement Learning

Assignment № 1
Due Date : 27/09/2021
Teaching Assistants : Shantam Gulati and Megha Gupta

Easwar Subramanian, IIT Hyderabad 15/09/2021

Problem 1 : Markov Reward Process

Consider a fair four sided dice with faces marked as {0 10 ,0 20 ,0 30 ,0 40 }. The dice is tossed repeat-
edly and independently. By formulating a suitable Markov reward process (MRP) and using
Bellman equation for MRP, find the expected number of tosses required for the pattern 0 12340 to
appear. Specifically, answer the following questions.

(a) Identify the states, transition probablities and terminal states (if any) of the MRP (3 Points)

(b) Construct a suitable reward function, discount factor and use the Bellman equation for MRP
to find the ’average’ number of tosses required for the pattern 0 12340 to appear. (7 Points)

[Explanation : For the target pattern to occur, four consequective tosses of the dice
should result in different faces of the dice being on the top, in the specific order 0 1, 0 20 , 0 30
and 0 40 ]

Problem 2 : Finite Horizon MDP

Consider a dice game in which a player is eligible for a reward that is equal to 3x2 + 5 where
x is the value of the face of the dice that comes on top. A player is allowed to roll the dice at
most N times. At every time step, after having observed the outcome of the dice roll, the player
can pick the eligible reward and quit the game or roll the dice one more time with no immediate
reward. If not having stopped before, then, at terminal time N , the game ends and the player
gets the reward corresponding to the outcome of dice roll at time N .
The goal of this problem is to model the game as an MDP and formulate a policy that helps
the player decide, at any time step n < N , whether to continue or quit the game. As a specific
case, let’s consider a fair four sided dice for this game. It then follows that one can model the
game as a finite horizon MDP (with horizon N ) consisting of four states S = {1, 2, 3, 4} and
two actions A = {Continue, Quit}. One can assume that the discount factor (γ) is 1. For any
n ≤ N , denote V n (s) and Qn (s, a) as the state and action functions for state s and action a at
time step n.
[Hint : A finite horizon MDP is solved backwards in time. One first computes the value of a
state at terminal time and then use it to compute the value of a state at intermediate times. Note

Assignment № 1 Page 1
that the value of a state at any intermediate time is equal to the best action value possible for
that state at that time. The best action value for a state, at any time, is evaluated by considering
all possible actions from that state at that time.

(a) Evaluate the value function V N (s) for each state s of the MDP. (1 Point)

(b) Compute QN −1 (s, a) for each state-action pair of the MDP. (2 Points)

(d) For any time 2 < n ≤ N , express V n−1 (s) recursively in terms of V n (s). (2 Points)

(e) For any time 2 < n ≤ N , express Qn−1 (s, ”Continue”) in terms of Qn (s, ”Continue”). (2
Points)

(f) What is the optimal policy at any time n that lets a player decide whether to continue or quit
based on current state s ? (2 Points)

(g) Is the optimal policy stationary or non-stationary ? Explain. (2 Points)

Problem 3 : Value Iteration

Let M be a MDP given by < S, A, P, R, γ > with |S| < ∞ and |A| < ∞ and γ ∈ [0, 1). Let
M̂ =< S, A, P, R̂, γ > be another MDP with a modified reward function R̂ such that

R(s, a, s0 ) − R̂(s, a, s0 ) = ε.

Given a policy π, let V π and V̂ π be value functions under policy π for MDPs M and M̂ respec-
tively.

(a) Derive an expression that relates V π (s) to V̂ π (s) for any state s ∈ S of the MDP. (5 Points)

(b) Derive an expression that relates the optimal value functions V∗ and V̂∗ . (3 Points)

Problem 4 : Effect of Noise and Discounting

Consider the grid world problem shown in Figure 1. The grid has two terminal states with positive
payoff (+1 and +10). The bottom row is a cliff where each state is a terminal state with negative
payoff (-10). The greyed squares in the grid are walls. The agent starts from the yellow state
S. As usual, the agent has four actions A = (Left, Right, Up, Down) to choose from any non-
terminal state and the actions that take the agent off the grid leaves the state unchanged. Notice
that, if agent follows the dashed path, it needs to be careful not to step into any terminal state at
the bottom row that has negative payoff. There are four possible (optimal) paths that an agent
can take.

Assignment № 1 Page 2
Figure 1: Modified Grid World

• Prefer the close exit (state with reward +1) but risk the cliff (dashed path to +1)

• Prefer the distant exit (state with reward +10) but risk the cliff (dashed path to +10)

• Prefer the close exit (state with reward +1) by avoiding the cliff (solid path to +1)

• Prefer the distant exit (state with reward +10) by avoiding the cliff (solid path to +10)

There are two free parameters to this problem. One is the discount factor γ and the other is the
noise factor (η) in the environment. Noise makes the environment stochastic. For example, a
noise of 0.2 would mean the action of the agent is successful only 80 % of the times. The rest
20 % of the time, the agent may end up in an unintended state after having chosen an action.

(a) Identify what values of γ and η lead to each of the optimal paths listed above with reasoning.
If necessary, you could implement the value iteration algorithm on this environment and
observe the optimal paths for various choices of γ and η. (10 Points)

[Hint : For the discount factor, try high and low γ values like 0.9 and 0.1 respectively. For
noise, consider deterministic and stochastic environment with noise level η being 0 or 0.5
respectively]

Problem 5 : On Value Iteration Algorithm

Let M be an MDP given by < S, A, P, R, γ > with |S| < ∞ and |A| < ∞ and γ ∈ [0, 1).
We are given a policy π and the task is to evalute V π (s) for every state s ∈ S of the MDP.

Assignment № 1 Page 3
To this end, we use the iterative policy evalution algorithm. It is the analog of the algorithm
described in slide 9 of Lecture 6 for the policy evaluation case. We start the iterative policy
evaluation algorithm with an initial guess V1 and let Vk+1 be the k + 1-th iterate of the value
function corresponding to policy π. Our constraint on compute infrastructure does not allow us
to wait for the successive iterates of the value function to converge to the true value function V π
given by V π = (I − γP )−1 R. Instead, we let the algorithm terminate at time step k + 1 when the
distance between the sucessive iterates given by kVk+1 − Vk k∞ ≤ ε for a given ε > 0.

(a) Prove that the error estimate between the obtained value function estimate Vk+1 and true
value function V π is given by
εγ
kVk+1 − V π k∞ ≤
1−γ
(5 Points)

(b) Prove that the iterative policy evaluation algorithm converges geometrically, i.e.

kVk+1 − V π k∞ ≤ γ k kV1 − V∞
π
k

(2 Points)

L(v) = max [Ra + γP a v] .

a∈A

Prove that the Bellman optimality operator (L) satisfies the monotoniciity property. That is,
for any two value functions u and v such that u ≤ v (this means, u(s) ≤ v(s) for all s ∈ S), we
have L(u) ≤ L(v) (3 Points)

Problem 6 : On Contractions

(a) Let P and Q be two contractions defined on a normed vector space < V,k·k >. Prove
that the compositions P ◦ Q and Q ◦ P are contractions on the same normed vector space.
(5 Points)

(b) What can be suitable contraction (or LIpschitz) coeffecients for the contractions P ◦ Q and
Q◦P ? (1 point)

(c) Define operator B as F ◦L where L is the Bellman optimality operator and F is any other suit-
able operator. For example, F could play the role of a function approximator to the Bellman
backup L. Under what conditions would the value iteration algorithm converge to a unique
solution if operator B is used in place of L (in the value iteration algorithm) ? Explain your
answer. (2 Points)

Assignment № 1 Page 4
Problem 7 : Programming Value and Policy Iteration

Implement value and policy iteration algorithm and test it on ’Frozen Lake’ environment in ope-
nAI gym. ’Frozen Lake’ is a grid-world like environment available in gym. The purpose of this
exercise is to to help you get hands on with using gym and to understand the implementation
details of value and policy iteration algorithm(s)

This question will not be graded but will still come in handy for future assignments.

ALL THE BEST

Assignment № 1 Page 5

The Case Against the Sexual Revolution 1st Edition Louise Perry pdf download
100% (1)
The Case Against the Sexual Revolution 1st Edition Louise Perry pdf download
43 pages
Cersai: Central Registry of Securitisation Asset Reconstruction and Security Interest of India
100% (1)
Cersai: Central Registry of Securitisation Asset Reconstruction and Security Interest of India
3 pages
Understanding Machine Learning Solution Manual: 2 Gentle Start
No ratings yet
Understanding Machine Learning Solution Manual: 2 Gentle Start
67 pages
CS 229, Summer 2019 Problem Set #3 Solutions
No ratings yet
CS 229, Summer 2019 Problem Set #3 Solutions
19 pages
SP18 CS182 Midterm Solutions - Edited
No ratings yet
SP18 CS182 Midterm Solutions - Edited
14 pages
Exercises 695 Clas
No ratings yet
Exercises 695 Clas
3 pages
CS 3600 Project 4b Analysis
No ratings yet
CS 3600 Project 4b Analysis
3 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
A8
No ratings yet
A8
4 pages
Solutions To Reinforcement Learning by Sutton Chapter 5 r3
No ratings yet
Solutions To Reinforcement Learning by Sutton Chapter 5 r3
9 pages
Assignment 4 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
No ratings yet
Assignment 4 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
4 pages
Assignment 4: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Assignment 4: Reinforcement Learning Prof. B. Ravindran
4 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
2022 ML Assignments
No ratings yet
2022 ML Assignments
45 pages
Deep Learning - IIT Ropar - Unit 4 - Week 1
No ratings yet
Deep Learning - IIT Ropar - Unit 4 - Week 1
5 pages
Assignment 11: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Assignment 11: Reinforcement Learning Prof. B. Ravindran
4 pages
Solution 4 Ann Weka 2012
No ratings yet
Solution 4 Ann Weka 2012
8 pages
Assignment 3: Introduction To Machine Learning Prof. B. Ravindran
No ratings yet
Assignment 3: Introduction To Machine Learning Prof. B. Ravindran
4 pages
Backpropagation Examples PDF
No ratings yet
Backpropagation Examples PDF
9 pages
PA12
100% (2)
PA12
3 pages
2023 ML Assignment
No ratings yet
2023 ML Assignment
57 pages
ML Assignment 3 Nptel 2019
No ratings yet
ML Assignment 3 Nptel 2019
26 pages
Assignment - Week 6 (Neural Networks) Type of Question: MCQ/MSQ
No ratings yet
Assignment - Week 6 (Neural Networks) Type of Question: MCQ/MSQ
4 pages
Deep Learning - IIT Ropar - Unit 4 - Week 1
No ratings yet
Deep Learning - IIT Ropar - Unit 4 - Week 1
8 pages
Reinforcement Learning - Unit 6 - Week 4
No ratings yet
Reinforcement Learning - Unit 6 - Week 4
3 pages
NPTEL Online Certification Courses Indian Institute of Technology Kharagpur
100% (1)
NPTEL Online Certification Courses Indian Institute of Technology Kharagpur
4 pages
CS230 Midterm Solutions Fall 2022
No ratings yet
CS230 Midterm Solutions Fall 2022
20 pages
ISyE 6669 Homework 15 PDF
No ratings yet
ISyE 6669 Homework 15 PDF
3 pages
CS230 Midterm Fall 2022
No ratings yet
CS230 Midterm Fall 2022
14 pages
Cs230exam Spr18 Soln PDF
100% (1)
Cs230exam Spr18 Soln PDF
45 pages
Advanced Machine Learning: Module-1
No ratings yet
Advanced Machine Learning: Module-1
164 pages
CS302 Mid-Sem Online Examination - Attempt Review-2
No ratings yet
CS302 Mid-Sem Online Examination - Attempt Review-2
10 pages
Total Marks (15 Qns 1 Mark 15 Marks) : Business Intelligence and Analytics Assignment Week 1
No ratings yet
Total Marks (15 Qns 1 Mark 15 Marks) : Business Intelligence and Analytics Assignment Week 1
29 pages
Must Know Questions Deep Learning
No ratings yet
Must Know Questions Deep Learning
22 pages
Deep Reinforcement Learning Handout v2.0.docx (1)
0% (1)
Deep Reinforcement Learning Handout v2.0.docx (1)
6 pages
SN Week7 PDF
No ratings yet
SN Week7 PDF
5 pages
Cs230exam Win21
No ratings yet
Cs230exam Win21
21 pages
hw3 Solutions PDF
No ratings yet
hw3 Solutions PDF
11 pages
ST Microelectronics Interview Questions
No ratings yet
ST Microelectronics Interview Questions
4 pages
Introduction To Machine Learning - IITKGP - Unit 4 - Week 2
No ratings yet
Introduction To Machine Learning - IITKGP - Unit 4 - Week 2
5 pages
4th Unit DL Final Class Notes (1)
No ratings yet
4th Unit DL Final Class Notes (1)
68 pages
Deep Learning - IIT Ropar - Unit 10 - Week 7
100% (1)
Deep Learning - IIT Ropar - Unit 10 - Week 7
4 pages
Practice Final sp22
No ratings yet
Practice Final sp22
10 pages
Hopfield Neural Network
100% (1)
Hopfield Neural Network
6 pages
Assignment 4: Introduction To Machine Learning Prof. B. Ravindran
No ratings yet
Assignment 4: Introduction To Machine Learning Prof. B. Ravindran
2 pages
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
24 pages
Assignment 1
No ratings yet
Assignment 1
4 pages
Assignment 8 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
No ratings yet
Assignment 8 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
3 pages
Assignment 5
No ratings yet
Assignment 5
3 pages
Lesson 4 Gradient Descent
No ratings yet
Lesson 4 Gradient Descent
13 pages
Assignment 5 (Sol.) : Reinforcement Learning
100% (1)
Assignment 5 (Sol.) : Reinforcement Learning
4 pages
9.deep Feedforward Networks
100% (1)
9.deep Feedforward Networks
13 pages
Plotting Decision Regions - Mlxtend
No ratings yet
Plotting Decision Regions - Mlxtend
5 pages
Deep Learning Assignment3 Solution
No ratings yet
Deep Learning Assignment3 Solution
9 pages
Final Exam AIML2023
No ratings yet
Final Exam AIML2023
3 pages
MP Neuron
No ratings yet
MP Neuron
35 pages
Cs7015 (Deep Learning) : Lecture 11: Convolutional Neural Networks, Lenet, Alexnet, Zf-Net, Vggnet, Googlenet and Resnet
No ratings yet
Cs7015 (Deep Learning) : Lecture 11: Convolutional Neural Networks, Lenet, Alexnet, Zf-Net, Vggnet, Googlenet and Resnet
477 pages
ISYE 6669 Homework 15 Fall 2021 PDF
No ratings yet
ISYE 6669 Homework 15 Fall 2021 PDF
3 pages
Perceptons Neural Networks
No ratings yet
Perceptons Neural Networks
33 pages
IIT Madras Notes Machine Learning
No ratings yet
IIT Madras Notes Machine Learning
13 pages
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
No ratings yet
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
4 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
Second Grade Curriculum
No ratings yet
Second Grade Curriculum
3 pages
Bill-Book Printing Services in Agra
No ratings yet
Bill-Book Printing Services in Agra
4 pages
7331.04.ru
No ratings yet
7331.04.ru
1 page
EIA Report-Thetmal Powerpoint
No ratings yet
EIA Report-Thetmal Powerpoint
8 pages
Building Planning and Drawing
No ratings yet
Building Planning and Drawing
16 pages
Universal Consultants: Rewrite Your Business
No ratings yet
Universal Consultants: Rewrite Your Business
1 page
History Intro - 1
No ratings yet
History Intro - 1
35 pages
7th Guide
No ratings yet
7th Guide
4 pages
New z400 Datasheet
No ratings yet
New z400 Datasheet
3 pages
Starbucks
No ratings yet
Starbucks
8 pages
(Ebook) Forensic Polymer Engineering. Why Polymer Products Fail in Service by Peter Rhys Lewis ISBN 9780081007280, 9780081010556, 0081007280, 0081010559 download pdf
100% (17)
(Ebook) Forensic Polymer Engineering. Why Polymer Products Fail in Service by Peter Rhys Lewis ISBN 9780081007280, 9780081010556, 0081007280, 0081010559 download pdf
55 pages
06 Export From AC 16 To Artlantis Render 4.1 and Artlantis Studio 4.1
No ratings yet
06 Export From AC 16 To Artlantis Render 4.1 and Artlantis Studio 4.1
8 pages
Polaris Indiaxculture
No ratings yet
Polaris Indiaxculture
21 pages
Fred Mednick: Vita
No ratings yet
Fred Mednick: Vita
4 pages
CS158 1 Reviewer
No ratings yet
CS158 1 Reviewer
8 pages
SanShing Machine[Ver.6.2]
No ratings yet
SanShing Machine[Ver.6.2]
9 pages
English Grammar - Wikipedia
No ratings yet
English Grammar - Wikipedia
33 pages
Isomerism Jeemain - Guru
0% (1)
Isomerism Jeemain - Guru
30 pages
7887_1992_Reff2022_AMD2
No ratings yet
7887_1992_Reff2022_AMD2
9 pages
Basics of GIT Histology
No ratings yet
Basics of GIT Histology
7 pages
Harga Batubara Acuan (Hba) & Harga Patokan Batubara (HPB) BULAN Mei 2017
No ratings yet
Harga Batubara Acuan (Hba) & Harga Patokan Batubara (HPB) BULAN Mei 2017
4 pages
Montano - 2020 MODIFIED ARCHITECTURAL THESIS PROPOSAL FORMAT
No ratings yet
Montano - 2020 MODIFIED ARCHITECTURAL THESIS PROPOSAL FORMAT
4 pages
Tutorials
No ratings yet
Tutorials
1 page
Properties of Saturated Steam - SI Units
No ratings yet
Properties of Saturated Steam - SI Units
10 pages
The North Face Brand-Guideline
No ratings yet
The North Face Brand-Guideline
42 pages
GEA HRT
No ratings yet
GEA HRT
25 pages
Unit 8: T 8.1 Inventions Jeans
No ratings yet
Unit 8: T 8.1 Inventions Jeans
4 pages
Verification and Validation CMMI
No ratings yet
Verification and Validation CMMI
23 pages

AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process

Uploaded by

AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process

Uploaded by

AI 3000 / CS 5500 : Reinforcement Learning

Easwar Subramanian, IIT Hyderabad 15/09/2021

Problem 1 : Markov Reward Process

Problem 2 : Finite Horizon MDP

(g) Is the optimal policy stationary or non-stationary ? Explain. (2 Points)

Problem 3 : Value Iteration

Problem 4 : Effect of Noise and Discounting

Problem 5 : On Value Iteration Algorithm

L(v) = max [Ra + γP a v] .

ALL THE BEST

You might also like