0% found this document useful (0 votes)

189 views6 pages

Assignment 4

This document contains an assignment on reinforcement learning concepts including: 1) Model-free prediction and control on a multi-state MDP using Monte Carlo methods and maximum likelihood estimation. 2) Analyzing the convergence of different learning rates in temporal difference learning algorithms. 3) Analyzing bias in value estimation on a single state MDP using sample-based Q-learning. 4) Importance sampling techniques for estimating values of a target policy using data collected from a behavior policy. 5) Questions cover concepts like first visit vs every visit Monte Carlo, policy evaluation, Q-learning updates, analysis of learning rates, importance sampling, and bias in off-policy evaluation.

Uploaded by

SHUBHAM PANCHAL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

189 views6 pages

Assignment 4

Uploaded by

SHUBHAM PANCHAL

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

AI 3000 / CS 5500 : Reinforcement Learning

Assignment № 2
Due Date : 19/10/2021
Teaching Assistants : Chaitanya Devaguptapu and Deepayan Das

Easwar Subramanian, IIT Hyderabad 06/10/2021

Problem 1 : Model Free Prediction and Control

Consider the MDP shown below with states {A, B, C, D, E, F, G}. Normally, an agent can either
move left or right in each state. However, in state C, the agent has the choice to either move
left or jump forward as the state D of the MDP has an hurdle. There is no right action from state
C. The jump action from state C will place the agent either in square D or in square E with
probability 0.5 each. The rewards for each action at each state s is depicted in the figure below
alongside the arrow. The terminal state is G and has a reward of zero. Assume a discount factor
of γ = 1.

Consider the following samples of Markov chain trajectories with rewards to answer the
questions below

+1 +1 −2 +1 +1 +1 +1 +10
• A −−→ B −−→ C −−→ B −−→ C −−→ D −−→ E −−→ F −−→ G
+1 +1 +1 +1 +1 +10
• A −−→ B −−→ C −−→ D −−→ E −−→ F −−→ G
+1 +1 +4 +1 +10
• A −−→ B −−→ C −−→ E −−→ F −−→ G
+1 +1 +4 −2 +1 +1 +10
• A −−→ B −−→ C −−→ E −−→ D −−→ E −−→ F −−→ G
+1 +1 +4 −2 +1 +1 −2 +1 +10
• A −−→ B −−→ C −−→ E −−→ D −−→ E −−→ F −−→ E −−→ F −−→ G

(a) Evaluate V (s) using first visit Monte-Carlo method for all states s of the MDP. (2 Points)

(b) Which states are likely to have different value estimates if evaluated using every visit MC as
compared to first visit MC ? Why ? (1 Point)

Assignment № 2 Page 1
(c) Now consider a policy πf that always move forward (using actions right or jump). Compute
true values of V πf (s) for all states of the MDP. (2 Points)

(d) Consider trajectories 2, 3 and 4 from the above list of rollouts. Compute V πf (s) for all states
of the MDP using maximum likelihood estimation (2 Points)
[Hint : A MLE (or certainity equivalence) estimate is based value estimation computed from
sample trajectories. For example, to compute V (B) we need to compute V (C) and one
need to calculate state transition probabilities to go from state C to D and E respectively
using samples. Use the transition probabilities obtained to compute V (C). ]

(e) Suppose, using policy πf , we collect infinitely many trajectories of the above MDP. If we
compute the value function V πf using Monte Carlo and TD(0) evaluations, would the two
methods converge to the same value function ? Justify your answer. (2 Points)

(f) Fill in the blank cells of the table below with the Q-values that result from applying the Q-
learning update for the 4 transitions specified by the episode below. You may leave Q-values
that are unaffected by the current update blank. Use learning rate α = 0.5. Assume all Q-
values are initialized to 0. (2 Points)

s a r s a r s a r s a r s
C jump 4 E right 1 F left -2 E right +1 F

Q(C, left) Q(C, jump) Q(E, left) Q(E, right) Q(F, left) Q(F, right)
Initial 0 0 0 0 0 0
Transition 1
Transition 2
Transition 3
Transition 4

(g) After running the Q-learning algorithm using the four transitions given above, construct a
greedy policy using the current values of the Q-table in states C, E and F . (1 Point)

Problem 2 : On Learning Rates

In any TD based algorithm, the update rule is of the following form

V (s) ← V (s) + αt [r + γV (s0 ) − V (s)]

where αt is the learning rate at the t-th time step. In here, the time step t refers to the t-th time
we are updating the value of the state s. Among other conditions, the learning rate αt has to

Assignment № 2 Page 2
obey the Robbins-Monroe condition given by,
∞
X
αt = ∞
t=0
∞
X
αt2 < ∞
t=0

for convergence to true V (s). Other conditions being same, reason out if the following values
for αt would result in convergence. (5 Points)

1
(1) αt = t

1
(2) αt = t2

1
(3) αt = 2
t3
1
(4) αt = 1
t2
1
Generalize the above result for αt = tp for any positive real number p (i.e. p ∈ R+ )

Problem 3 : Q-Learning

Consider a single state state MDP with two actions. That is, S = {s} and A = {a1 , a2 }. Assume
the discount factor of the MDP γ and the horizon length to be 1. Both actions yield random
rewards, with expected reward for each action being a constant c ≥ 0. That is,

E(r|a1 ) = c and E(r|a2 ) = c

where r ∼ Rai , i ∈ {1, 2}.

(a) What are the true values of Q(s, a1 ), Q(s, a2 ) and V ∗ (s) ? (1 Point)

(b) Consider a collection of n prior samples of reward r obtained by choosing action a1 or a2 from
state s. Denote Q̂(s, a1 ) and Q̂(s, a2 ) to be the sample estimates of action value functions
Q(s, a1 ) and Q(s, a2 ), repectively. Let π̂ be a greedy policy obtained with respect to the
estimated Q̂(s, ai ), i ∈ {1, 2}. That is,

π̂(s) = arg max Q̂(s, a)

Prove that the estimated value of the policy π̂, denoted by V̂ π̂ , is a biased estimate of the
optimal value function V ∗ (s). (4 Points)
[Note : Assume that actions a1 and a2 have been choosen equal number of times.]

(c) Let us now consider that the first action a1 always gives a constant reward of c whereas
the second action a2 gives a reward c + N (−0.2, 1) (normal distribution with mean -0.2 and
unit variance). Which is the better action to take in expectation ? Would the TD control
algorithms like Q-learning or SARSA control, trained using finite samples, always favor the
action that is best in expectation ? Explain. (3 Points)

Assignment № 2 Page 3
Problem 4 : Importance Sampling

Consider a single state MDP with finite action space, such that |A| = K. Assume the discount
factor of the MDP γ and the horizon length to be 1. For taking an action a ∈ A, let Ra (r) denote
the unknown distribution of reward r, bounded in the range [0, 1]. Suppose we have collected a
dataset consisting of action-reward pairs {(a, r)} by sampling a ∼ πb , where πb is a stochastic
behaviour policy and r ∼ Ra . Using this dateset, we now wish to estimate V π = Eπ [r|a ∼ π] for
some target policy π. We assume that π is fully supported on πb .

(a) Suppose the dataset consists of a single sample (a, r). Estimate V π using importance
sampling (IS). Is the obtained IS estimate of V π is unbiased ? Explain. (2 Points)

(b) Compute
π(a|·)
Eπb
πb (a|·)
(1 Point)

(c) For the case that πb is a uniformly random policy (all K actions are equiprobable) and π a
deterministic policy, provide an expression for importance sampling ratio. (1 Point)

(d) For this sub-question, consider the special case when the reward r for choosing any action
is identical, given by a deterministic constant r [i.e., r = R(a), ∀a ∈ A]. For a uniform
behaviour policy πb and a deterministic target policy π, calculate the variance of V π estimated
using importance sampling (IS) method. (5 Points)
[Note : Variance needs to be estimated under measure πb ]

(e) Derive an upper bound for the variance of the IS estimate of V π for the general case when
the reward distribution is bounded in the range [0, 1]. (3 Points)

(f) We now consider the case of multi-state (i.e |S| > 1), multi-step MDP. We futher assume that
P (s0 ) to be the initial start state distribution (i.e. s0 ∼ P (s0 )) where s0 is the start state of the
MDP. Let τ denote a trajectory (state-action sequence) given by, (s0 , a0 , s1 , a1 , · · · , st , at , · · · )
with actions a0:∞ ∼ πb . Let P and Q be joint distributions, over the entire trajectory τ induced
by the behaviour policy πb and a target policy π, respectively. Provide a compact expression
P (τ )
for the importance sampling weight Q(τ ) . (3 Points)

[ Note : A probablity distribution P is fully supported on another probablity distributions Q, if Q

does not assign non-zero probablity to any outcome that is assigned non-zero probablity by P ].

Problem 5 : Game of Tic-Tac-Toe

Consider a 3 × 3 Tic-Tac-Toe game. The aim of this problem is to implement a Tic-Tac-Toe agent
using Q-learning. This is a two player game in which the opponent is part of the environment.

(a) Develop a Tic-Tac-Toe environment with the following methods. (5 Points)

Assignment № 2 Page 4
(1) An init method that starts with an empty board position, assigns both player symbols
(’X’ or ’O’) and determines who starts the game. For simplicity, you may assume that
the agent always plays ’X’ and the opponent plays ’O’.
(2) An act method that takes as input a move suggested by the agent. This method should
check if the move is valid and place the ’X’ in the appropriate board position.
(3) A print method that prints the current board position
(4) You are free add other methods inside the environment as you deem fit.

(b) Develop two opponents for the Q-learning agent to train against, namely, a random agent
and safe agent (5 Points)

(1) A random agent picks a square among all available empty squares in a (uniform) ran-
dom fashion
(2) A safe agent uses the following heuristic to choose a square. If there is a winning move
for the safe agent, then the corresponding square is picked. Else, if there is a blocking
move, the corresponding square is chosen. A blocking move obstructs an opponent
from winning in his very next chance. If there are no winning or blocking moves, the
safe agent behaves like the random agent.

(c) The Q-learning agent now has the task to learn to play Tic-Tac-Toe by playing several games
against safe and random opponents. The training will be done using tabular Q-learning by
playing 10,000 games. In each of these 10,000 games, a fair coin toss determines who
makes the first move. After every 200 games, assess the efficacy of the learning by playing
100 games with the opponent using the full greedy policy with respect to the current Q-table.
Record the number of wins in those 100 games. This way, one can study the progress of
the training as a function of training epochs. Plot the training progress graph as suggested.
In addition, after the training is completed (that is after 10,000 games of training is done),
the trained agent’s performance is ascertained by playing 1000 games with opponents and
recording the total number of wins, draws and losses in those 1000 games. The training
and testing process is described below. (10 Points)

(1) Training is done only against the random player. But the learnt Q-table is tested against
both random and safe player.
(2) Training is done only against the safe player. But the learnt Q-table is tested against
both random and safe player.
(3) In every game of training, we randomly select our opponent. The learnt Q-table is tested
against both random and safe player.
(4) Among the three agents developed, which agent is best ? Why ?
(5) Is the Q-learning agent developed unbeatable against any possible opponent ? If not,
suggest ways to improve the training process.

[Note : A useful diagnostic could be to keep count of how many times each state-action pair is
visited and the latest Q value for each state-action pair. The idea is that, if a state-action pair

Assignment № 2 Page 5
is visited more number of times, Q value for that state-action pair gets updated frequently and
consequently it may be more close to the ’optimal’ value. Although, it is not necessary to use the
concept of afterstate discussed in the class, it may be useful to accelerate the training process]

ALL THE BEST

Assignment № 2 Page 6

AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions
0% (1)
AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions
4 pages
Problem 1: Markov Reward Process
No ratings yet
Problem 1: Markov Reward Process
3 pages
A12 Spring2024
No ratings yet
A12 Spring2024
5 pages
RL Solution3
No ratings yet
RL Solution3
4 pages
RL - Exam2023 Solved
No ratings yet
RL - Exam2023 Solved
6 pages
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
No ratings yet
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
5 pages
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
No ratings yet
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
8 pages
q2B Review
No ratings yet
q2B Review
9 pages
RL Exam Tutti
No ratings yet
RL Exam Tutti
47 pages
q2B Review Sol
No ratings yet
q2B Review Sol
14 pages
RL-solution 4
No ratings yet
RL-solution 4
4 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
169 pages
RL 2021 22 Exam I
No ratings yet
RL 2021 22 Exam I
4 pages
DRL Homework 1
No ratings yet
DRL Homework 1
4 pages
Reinforcement Learning Assignment
No ratings yet
Reinforcement Learning Assignment
4 pages
DL Unit 6 QP Solution
No ratings yet
DL Unit 6 QP Solution
15 pages
RL Practice Midterm
No ratings yet
RL Practice Midterm
4 pages
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 6: Reinforcement Learning Prof. B. Ravindran
24 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Reinforcement Learning Exam
No ratings yet
Reinforcement Learning Exam
6 pages
RL Paper Deepsk
No ratings yet
RL Paper Deepsk
4 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
30 pages
HGTFHGFHTF
No ratings yet
HGTFHGFHTF
5 pages
Exam Prep Exercises034534123124
No ratings yet
Exam Prep Exercises034534123124
20 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Solution 3
No ratings yet
Solution 3
4 pages
Assignment 10: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Assignment 10: Reinforcement Learning Prof. B. Ravindran
4 pages
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
No ratings yet
Tutorial Questions (Annexure I) Que S-Tion No Questions Co BTL
6 pages
Intro RL Paper GPT
No ratings yet
Intro RL Paper GPT
5 pages
RL Cheatsheet for Researchers
No ratings yet
RL Cheatsheet for Researchers
16 pages
Reinforcement Learning - Unit 6 - Week 4
0% (1)
Reinforcement Learning - Unit 6 - Week 4
3 pages
CS6700 RL 2024 Wa1
No ratings yet
CS6700 RL 2024 Wa1
7 pages
RL Problem Sheet: E0 270: Machine Learning (Spring 2025)
No ratings yet
RL Problem Sheet: E0 270: Machine Learning (Spring 2025)
10 pages
RL Assignment
No ratings yet
RL Assignment
2 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
16 RL
No ratings yet
16 RL
51 pages
Practice Assignment 4: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 4: Reinforcement Learning Prof. B. Ravindran
2 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Assignment 4: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Assignment 4: Reinforcement Learning Prof. B. Ravindran
4 pages
Assignment 5
100% (1)
Assignment 5
2 pages
Reinforcement Learning Concepts and Algorithms
No ratings yet
Reinforcement Learning Concepts and Algorithms
10 pages
Assignment
No ratings yet
Assignment
2 pages
HW 2
No ratings yet
HW 2
2 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
No ratings yet
RL-UNIT2 - RL Unit 2 RL-UNIT2 - RL Unit 2
23 pages
Cs229-Notes12 Reinforcement in Control
No ratings yet
Cs229-Notes12 Reinforcement in Control
17 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
Cs748 s2021 Quizzes Till q4
No ratings yet
Cs748 s2021 Quizzes Till q4
4 pages
AI Decision Making & RL Guide
No ratings yet
AI Decision Making & RL Guide
18 pages
Quiz2 Sol
No ratings yet
Quiz2 Sol
4 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
Reinforcement Learning Exam
No ratings yet
Reinforcement Learning Exam
2 pages
Lecture 4 - ModelFreePrediction
No ratings yet
Lecture 4 - ModelFreePrediction
48 pages
Query 4
No ratings yet
Query 4
1 page
Assignment 4
No ratings yet
Assignment 4
6 pages
CS5500 Reinforcement Learning Assignment 3
No ratings yet
CS5500 Reinforcement Learning Assignment 3
7 pages
CS5500 Reinforcement Learning Assignment 3
No ratings yet
CS5500 Reinforcement Learning Assignment 3
7 pages
Trans Guanella
100% (1)
Trans Guanella
12 pages
Test. ... - 7Itle:..Noen: of Semester
No ratings yet
Test. ... - 7Itle:..Noen: of Semester
16 pages
Title Page
No ratings yet
Title Page
8 pages
Hagen Kleinert: Theoretical Physicist
No ratings yet
Hagen Kleinert: Theoretical Physicist
6 pages
Bhavik Bansal - Y1
No ratings yet
Bhavik Bansal - Y1
1 page
Alexander Dictionary of English Idioms
100% (1)
Alexander Dictionary of English Idioms
2 pages
CSC 325 AI Lecture08 Supervised Learning
No ratings yet
CSC 325 AI Lecture08 Supervised Learning
32 pages
Most Essential Learning Competencies in English 7-10
100% (8)
Most Essential Learning Competencies in English 7-10
7 pages
Computer Science Universities List
No ratings yet
Computer Science Universities List
14 pages
Issues in The Anthropology of Islam Cont PDF
No ratings yet
Issues in The Anthropology of Islam Cont PDF
42 pages
Naac 222
No ratings yet
Naac 222
36 pages
Opening and Closing Rank CE GEN
No ratings yet
Opening and Closing Rank CE GEN
1 page
Acctg 111B-Partnership and Corporation Accounting
No ratings yet
Acctg 111B-Partnership and Corporation Accounting
7 pages
Future Careers
No ratings yet
Future Careers
3 pages
NJ Local Health Departments Guide
No ratings yet
NJ Local Health Departments Guide
28 pages
Bushra CV
No ratings yet
Bushra CV
1 page
Sarvodaya Bal Vidyalaya
No ratings yet
Sarvodaya Bal Vidyalaya
6 pages
SAP ABAP 7.4 Exam Prep Guide
No ratings yet
SAP ABAP 7.4 Exam Prep Guide
5 pages
TKT Yl Part 2 Lesson Plans
100% (1)
TKT Yl Part 2 Lesson Plans
11 pages
Academic Motivation Among Senior High School Student Athlete
No ratings yet
Academic Motivation Among Senior High School Student Athlete
3 pages
NATE Module 3 - Week10 PDF
No ratings yet
NATE Module 3 - Week10 PDF
22 pages
Conditional Admission Offer: MSc Program
No ratings yet
Conditional Admission Offer: MSc Program
4 pages
ENGL 1010 Course Reflection Insights
No ratings yet
ENGL 1010 Course Reflection Insights
2 pages
EEE Evening Package Spring 2025 1
No ratings yet
EEE Evening Package Spring 2025 1
3 pages
Civil Eng. Internal Actions Guide
No ratings yet
Civil Eng. Internal Actions Guide
3 pages
Handbook Tanuvas
No ratings yet
Handbook Tanuvas
25 pages
Creative Nonfiction Lesson Plan for Grade 12
40% (5)
Creative Nonfiction Lesson Plan for Grade 12
2 pages
2022 Midterm
No ratings yet
2022 Midterm
14 pages
MPS 004
No ratings yet
MPS 004
6 pages
Term Paper in Psychology Sample
100% (1)
Term Paper in Psychology Sample
5 pages

Assignment 4

Uploaded by

Assignment 4

Uploaded by

AI 3000 / CS 5500 : Reinforcement Learning

Easwar Subramanian, IIT Hyderabad 06/10/2021

Problem 1 : Model Free Prediction and Control

Problem 2 : On Learning Rates

In any TD based algorithm, the update rule is of the following form

V (s) ← V (s) + αt [r + γV (s0 ) − V (s)]

E(r|a1 ) = c and E(r|a2 ) = c

where r ∼ Rai , i ∈ {1, 2}.

π̂(s) = arg max Q̂(s, a)

[ Note : A probablity distribution P is fully supported on another probablity distributions Q, if Q

Problem 5 : Game of Tic-Tac-Toe

(a) Develop a Tic-Tac-Toe environment with the following methods. (5 Points)

ALL THE BEST

You might also like