0% found this document useful (0 votes)

35 views4 pages

DRL_Homework_1

This document outlines an assignment for a Deep Reinforcement Learning course, focusing on Markov Decision Processes (MDPs) and value iteration. It includes problems related to value iteration, theorems, and simulations, requiring proofs and coding implementations for the Frozen Lake environment. The assignment emphasizes understanding the Bellman backup operator, optimal policies, and the effects of stochasticity in MDPs.

Uploaded by

planck7737

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views4 pages

DRL_Homework_1

Uploaded by

planck7737

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Deep Reinforcement Learning (Spring 2022)

Assignment 1

March 16, 2022

1 MDPs [15 pts]

A boy is being chased around the school yard by bullies and must choose whether to Fight or
Run.

• There are three states:

– Ok (O), where he is fine for the moment.

– Danger (D), where the bullies are right on his heels.
– Caught (C), where the bullies catch up with him and administer noogies.

• He begins in state O 75% of the time.

• He begins in state D 25% of the time.

The graph of the MDP is given here:

1
1. Fill out the table with the results of value iteration with a discount factor γ = 0.9 [9 pts]:

k V k (O) V k (D) V k (C)

1 2 -1 -5
2
3

2. At k = 2 with γ = 0.9 what policy would you select? Is it necessarily true that this is
the optimal policy? At k = 3 what policy would you select? Is it necessarily true that
this is the optimal policy? [6 pts]

2 Value Iteration Theorem [35 pts]

In this problem, we will deal with contractions and fixed points and prove an important result
from the value iteration theorem. From lecture, we know that the Bellman backup operator
B given below is a contraction with the fixed point as V ∗ , the optimal value function of the
MDP. The symbols have their usual meanings. γ is the discount factor and 0 ≤ γ < 1. In all
parts, ||v|| is the infinity norm of the vector.
X
(BV )(s) = max[R(s, a) + γ p(s0 |s, a)V (s0 )] (1)
a
s0 ∈S

We also saw the contraction operator Bπ which is the Bellman backup operator for a
particular policy given below:
X
(Bπ V )(s) = Ea∼π [R(s, a) + γ p(s0 |s, a)V (s0 )] (2)
s0 ∈S

(a) Recall that ||BV − BV 0 || ≤ γ||V − V 0 || for two random value functions V and V 0 . Prove
that Bπ is also a contraction mapping: ||Bπ V − Bπ V 0 || ≤ γ||V − V 0 ||. [5 pts]

(b) Prove that the fixed point for Bπ is unique. What is the fixed point of Bπ ? [5 pts]

In value iteration, we repeatedly apply the Bellman backup operator B to improve our value
function. At the end of value iteration, we can recover a greedy policy π from the value function
using the equation below:
X
π(s) = arg max[r(s, a) + γ p(s0 |s, a)V (s0 )] (3)
a
s0 ∈S

Suppose we run value iteration for a finite number of steps to obtain a value function V (V
has not necessarily converged to V ∗ ). Say now that we evaluate our policy π obtained using
the formula above to get V π . Note that here and for the rest of Q2, π refers to the
greedy policy.

(c) Is V π always the same as V ? Justify your answer. [5 pts]

2
In lecture, we learned that running value iteration until a certain tolerance can bring us close
to recovering the optimal value function. Let Vn and Vn+1 be the outputs of value iteration at
the nth and n + 1th iterations respectively. Let > 0 and consider the point in value iteration
such that ||Vn+1 − Vn || < (1−γ)
2γ
. Let π be the greedy policy given the value function Vn+1 .
You will now prove that this policy π is -optimal. This result justifies why halting value
iteration when the difference between success iterations is sufficiently small, ensures the decision
policy obtained by being greedy with respect to the value function, is near-optimal.
Precisely if
(1 − γ)
||Vn+1 − Vn || < (4)
2γ
then,
||V π − V ∗ || ≤ (5)
(d) When π is the greedy policy, what is the relationship between B and Bπ ? [2 pts]

(e) Prove that ||V π − Vn+1 || ≤ /2.

Hint: Introduce an in-between term and leverage the triangle inequality. [6 pts]

(f) Prove ||B k V − B k V 0 || ≤ γ k ||V − V 0 || [3 pts]

(g) Prove that ||V ∗ − Vn+1 || ≤ /2. [7pts]

Hints: Note that ||V ∗ − Vn+1 || = ||V ∗ + Vn+2 − Vn+2 − Vn+1 || and you can repeatedly
apply this trick. It may also be useful to leverage part (f) and recall that V ∗ is the fixed
point of the contraction B.

(h) Use the results from parts (e) and (g), to show that ||V π − V ∗ || ≤ [2 pts]

3 Simulation Lemma and Model-based Learning [25 pts]

Consider two finite horizon MDPs M := {S, A, H, rh , Ph , s0 } and M̃ := {S, A, H, rh , P̃h , s0 }.
Consider an arbitrary stochastic stationary policy π : S → ∆(A). In finite horizon MDPs,
state value is also a function of steps the agent have taken so far, and we denote Vhπ (s) as the
expected value following π starting from s at step h. We define V π = V0π (s0 ) as the expected
total reward of π under M, and Ṽ π = Ṽ0π (s0 ) as the expected total reward of π under M̃ . We
denote Pπh as the state-action distribution of π under M at step h.
(a) Prove the following equality:
H−1
X h i
π π π 0 π 0
V − Ṽ = Es,a∼Pπ
h
Es0 ∼Ph (·|s,a) Ṽh+1 (s ) − Es0 ∼P̃h (·|s,a) Ṽh+1 (s )
h=0

π
Here Ṽh+1 (s0 ) is the expected total reward of π under M̃ from time h + 1.[10 Points]

(b) Imagine the following situation: we have M as the true real MDP, and we have M̃ as
some learned model that supposes to approximate the real MDP M. Given M̃, the
natural thing to do is to compute the optimal policy under M̃, i.e.,

π̃ ? = arg max Ṽ π ,
π∈Π

3
where Π ⊂ {π : S → ∆(A)} is a pre-defined policy class. Let us also denote the true
optimal policy π ? under the real model as:

π ? = arg max V π .
π∈Π

A natural question is that what is the performance of π̃ ? under the real model M,
compared to π ? under M? To answer this question, let’s assume r(s, a) ∈ [0, 1] for all
s, a and prove the following inequality:
H−1
Xh i
π? π̃ ?
V −V ≤H E
s,a∼Pπ
h
? kP̃h (·|s, a) − Ph (·|s, a)k1 + E
s,a∼Pπ̃
h
? kP̃h (·|s, a) − Ph (·|s, a)k1
h=0
R
(Hint: |f (x)g(x)|dx ≤ kf (·)k1 kg(·)k∞ ) [15 Points]

4 Frozen Lake MDP [25 Points]

Now you will implement value iteration and policy iteration for the Frozen Lake environment
from OpenAI Gym. We have provided custom versions of this environment in the starter code.

(a) (coding) Read through vi_and_pi.py and implement policy_evaluation,

policy_improvement and policy_iteration. The stopping tolerance (defined as
maxs |Vold (s) − Vnew (s)|) is tol = 10−3 . Use γ = 0.9. Return the optimal value function
and the optimal policy. [10 pts]

(b) (coding) Implement value_iteration in vi_and_pi.py. The stopping tolerance is

tol = 10−3 . Use γ = 0.9. Return the optimal value function and the optimal policy. [10
pts]

(c) (written) Run both methods on the Deterministic-4x4-FrozenLake-v0 and

Stochastic-4x4-FrozenLake-v0 environments. In the second environment, the dynam-
ics of the world are stochastic. How does stochasticity affect the number of iterations
required, and the resulting policy? [5 pts]

Littomore
No ratings yet
Littomore
169 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
No ratings yet
Homework 1: ELEN E6885: Introduction To Reinforcement Learning September 21, 2021
8 pages
Rl Exam Tutti
No ratings yet
Rl Exam Tutti
47 pages
2-dynamic
No ratings yet
2-dynamic
50 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
cs747 A2020 Quizzes PDF
No ratings yet
cs747 A2020 Quizzes PDF
5 pages
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
No ratings yet
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
43 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
2025_MDPs 2
No ratings yet
2025_MDPs 2
42 pages
lec12
No ratings yet
lec12
60 pages
EE675 Lecture 10
No ratings yet
EE675 Lecture 10
4 pages
RL_2021_22_Exam_I_63163060c243ad69c552d008b899be82
No ratings yet
RL_2021_22_Exam_I_63163060c243ad69c552d008b899be82
4 pages
Homework - 06 - 223 - Spring 2024
No ratings yet
Homework - 06 - 223 - Spring 2024
5 pages
Solution3
No ratings yet
Solution3
4 pages
Reinforcement learning lec12
No ratings yet
Reinforcement learning lec12
60 pages
Investigation Into The Potential Shortfall of Power Technicians Across The Generation Industry 3058322 1
No ratings yet
Investigation Into The Potential Shortfall of Power Technicians Across The Generation Industry 3058322 1
81 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
mdp-cheatsheet
No ratings yet
mdp-cheatsheet
3 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
Assignment 3- solution
No ratings yet
Assignment 3- solution
4 pages
HW 2
No ratings yet
HW 2
2 pages
Assignment (2)
No ratings yet
Assignment (2)
2 pages
Assignment 5
No ratings yet
Assignment 5
2 pages
Installing RBS
No ratings yet
Installing RBS
305 pages
Assignment 4 (Sol.) : Reinforcement Learning
No ratings yet
Assignment 4 (Sol.) : Reinforcement Learning
6 pages
RL UNIT-4
No ratings yet
RL UNIT-4
18 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
Cyber Journalism 1
No ratings yet
Cyber Journalism 1
60 pages
E0_270_RL
No ratings yet
E0_270_RL
10 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Quiz2_sol
No ratings yet
Quiz2_sol
4 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Codan 9323-9360 Reference Manual
No ratings yet
Codan 9323-9360 Reference Manual
228 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
A117 Ca910 - en P
No ratings yet
A117 Ca910 - en P
218 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Core Mathematics C12: Pearson Edexcel
No ratings yet
Core Mathematics C12: Pearson Edexcel
56 pages
Assignment 4
No ratings yet
Assignment 4
6 pages
Practice Assignment 5: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 5: Reinforcement Learning Prof. B. Ravindran
2 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
RL_Paper_Deepsk
No ratings yet
RL_Paper_Deepsk
4 pages
CS229
No ratings yet
CS229
17 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Financial Time Series Analisys For Raizen Company
No ratings yet
Financial Time Series Analisys For Raizen Company
19 pages
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
No ratings yet
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
5 pages
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
No ratings yet
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
4 pages
5. Scrap Disposal schedule (1)
No ratings yet
5. Scrap Disposal schedule (1)
29 pages
Advanced Computing and Communication Technologies Jyotsna Kumar Mandal pdf download
100% (1)
Advanced Computing and Communication Technologies Jyotsna Kumar Mandal pdf download
37 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Veeam Backup 12 0 Whats New
No ratings yet
Veeam Backup 12 0 Whats New
32 pages
COA - Unit-1 (Amiraj) - VisionPapers - in
No ratings yet
COA - Unit-1 (Amiraj) - VisionPapers - in
25 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Assignment 4: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Assignment 4: Reinforcement Learning Prof. B. Ravindran
4 pages
Aeroflex Main New Catalog PDF
No ratings yet
Aeroflex Main New Catalog PDF
32 pages
Candidate's CV Tracker - Assessment (QA - QC-Inspector)
No ratings yet
Candidate's CV Tracker - Assessment (QA - QC-Inspector)
8 pages
Gps Plotter GD3300 Brochure PDF
100% (1)
Gps Plotter GD3300 Brochure PDF
4 pages
Teste de Pressão Do Freio
No ratings yet
Teste de Pressão Do Freio
3 pages
Mil Lesson 8
No ratings yet
Mil Lesson 8
38 pages
Accomplishment Report Dictrict Preventive Maintenance
No ratings yet
Accomplishment Report Dictrict Preventive Maintenance
2 pages
Bat 5
No ratings yet
Bat 5
15 pages
Problem 1: Markov Reward Process
No ratings yet
Problem 1: Markov Reward Process
3 pages
Assignment 5 (Sol.) : Reinforcement Learning
100% (1)
Assignment 5 (Sol.) : Reinforcement Learning
4 pages
Next Level ABAP Development: Creating Efficient Code: Sandor Van Der Neut
No ratings yet
Next Level ABAP Development: Creating Efficient Code: Sandor Van Der Neut
11 pages
Cs748 s2021 Quizzes Till q4
No ratings yet
Cs748 s2021 Quizzes Till q4
4 pages
Ne Report
No ratings yet
Ne Report
6 pages
Boiler Drum Level Control: Application Note
No ratings yet
Boiler Drum Level Control: Application Note
2 pages
Omta Merit Certificates: Guidelines For Ordering (See Page 2 For Fillable Order Form)
No ratings yet
Omta Merit Certificates: Guidelines For Ordering (See Page 2 For Fillable Order Form)
2 pages
Micro Living: Small in Space But Big On Ideas Flexible Architecture
No ratings yet
Micro Living: Small in Space But Big On Ideas Flexible Architecture
3 pages
PCK 136 Building and Enhancing New Literacies Across The Curriculum
100% (1)
PCK 136 Building and Enhancing New Literacies Across The Curriculum
9 pages
Fraudulent Credit Card Activity Detection Using Adaptive Boosting and Aggregate Voting
No ratings yet
Fraudulent Credit Card Activity Detection Using Adaptive Boosting and Aggregate Voting
8 pages
MC 10163050 9999 PDF
No ratings yet
MC 10163050 9999 PDF
2 pages
AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions
0% (1)
AI 3000 / CS5500: Reinforcement Learning Exam 1: Instructions
4 pages
BDE Unit 3 Numericals From GTU EXAM
No ratings yet
BDE Unit 3 Numericals From GTU EXAM
3 pages
Formatting Text With Javascript
No ratings yet
Formatting Text With Javascript
1 page
CS 748 (Spring 2021) : Weekly Quizzes: Week 2
No ratings yet
CS 748 (Spring 2021) : Weekly Quizzes: Week 2
2 pages
VacuumPump Catalog
No ratings yet
VacuumPump Catalog
3 pages
Fig.1205A CASTER (-352AB )
No ratings yet
Fig.1205A CASTER (-352AB )
2 pages
Practice Assignment 4: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 4: Reinforcement Learning Prof. B. Ravindran
2 pages
Reinforcement Learning - Unit 6 - Week 4
No ratings yet
Reinforcement Learning - Unit 6 - Week 4
3 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)

DRL_Homework_1

Uploaded by

DRL_Homework_1

Uploaded by

Deep Reinforcement Learning (Spring 2022)

March 16, 2022

1 MDPs [15 pts]

• There are three states:

– Ok (O), where he is fine for the moment.

• He begins in state O 75% of the time.

• He begins in state D 25% of the time.

The graph of the MDP is given here:

k V k (O) V k (D) V k (C)

2 Value Iteration Theorem [35 pts]

(c) Is V π always the same as V ? Justify your answer. [5 pts]

(e) Prove that ||V π − Vn+1 || ≤ /2.

(f) Prove ||B k V − B k V 0 || ≤ γ k ||V − V 0 || [3 pts]

(g) Prove that ||V ∗ − Vn+1 || ≤ /2. [7pts]

3 Simulation Lemma and Model-based Learning [25 pts]

4 Frozen Lake MDP [25 Points]

(a) (coding) Read through vi_and_pi.py and implement policy_evaluation,

(b) (coding) Implement value_iteration in vi_and_pi.py. The stopping tolerance is

(c) (written) Run both methods on the Deterministic-4x4-FrozenLake-v0 and

You might also like

(e) Prove that ||V π − Vn+1 || ≤ /2.

(g) Prove that ||V ∗ − Vn+1 || ≤ /2. [7pts]