0% found this document useful (0 votes)

12 views13 pages

Assignment 2

This document discusses policy evaluation and improvement for an optimal parking pricing policy using dynamic programming. It provides code to evaluate an initial pricing policy that charges a constant price of one unit using value iteration. The value function computed is then checked against a test answer to ensure it is evaluating the policy correctly.

Uploaded by

Vasudha Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

12 views13 pages

Assignment 2

Uploaded by

Vasudha Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Assignment2

February 13, 2024

1 Assignment 2: Optimal Policies with Dynamic Programming

Welcome to Assignment 2. This notebook will help you understand: - Policy Evaluation and Policy
Improvement. - Value and Policy Iteration. - Bellman Equations.

1.1 Gridworld City

Gridworld City, a thriving metropolis with a booming technology industry, has recently experienced
an influx of grid-loving software engineers. Unfortunately, the city’s street parking system, which
charges a fixed rate, is struggling to keep up with the increased demand. To address this, the city
council has decided to modify the pricing scheme to better promote social welfare. In general, the
city considers social welfare higher when more parking is being used, the exception being that the
city prefers that at least one spot is left unoccupied (so that it is available in case someone really
needs it). The city council has created a Markov decision process (MDP) to model the demand
for parking with a reward function that reflects its preferences. Now the city has hired you — an
expert in dynamic programming — to help determine an optimal policy.

1.2 Preliminaries

You’ll need two imports to complete this assigment: - numpy: The fundamental package for scien-
tific computing with Python. - tools: A module containing an environment and a plotting function.
There are also some other lines in the cell below that are used for grading and plotting — you
needn’t worry about them.
In this notebook, all cells are locked except those that you are explicitly asked to modify. It is up
to you to decide how to implement your solution in these cells, but please do not import other
libraries — doing so will break the autograder.
[1]: %matplotlib inline
import numpy as np
import tools
import grader

<Figure size 432x288 with 0 Axes>

1
In the city council’s parking MDP, states are nonnegative integers indicating how many parking
spaces are occupied, actions are nonnegative integers designating the price of street parking, the
reward is a real value describing the city’s preference for the situation, and time is discretized by
hour. As might be expected, charging a high price is likely to decrease occupancy over the hour,
while charging a low price is likely to increase it.
For now, let’s consider an environment with three parking spaces and three price points. Note that
an environment with three parking spaces actually has four states — zero, one, two, or three spaces
could be occupied.
[2]: # ---------------
# Discussion Cell
# ---------------
num_spaces = 3
num_prices = 3
env = tools.ParkingWorld(num_spaces, num_prices)
V = np.zeros(num_spaces + 1)
pi = np.ones((num_spaces + 1, num_prices)) / num_prices

The value function is a one-dimensional array where the i-th entry gives the value of i spaces being
occupied.
[3]: V

[3]: array([0., 0., 0., 0.])

We can represent the policy as a two-dimensional array where the (i, j)-th entry gives the probability
of taking action j in state i.
[4]: pi

[4]: array([[0.33333333, 0.33333333, 0.33333333],

[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333],
[0.33333333, 0.33333333, 0.33333333]])

[5]: pi[0] = [0.75, 0.11, 0.14]

for s, pi_s in enumerate(pi):

for a, p in enumerate(pi_s):
print(f'pi(A={a}|S={s}) = {p.round(2)} ', end='')
print()

pi(A=0|S=0) = 0.75 pi(A=1|S=0) = 0.11 pi(A=2|S=0) = 0.14

2
[6]: V[0] = 1

tools.plot(V, pi)

We can visualize a value function and policy with the plot function in the tools module. On
the left, the value function is displayed as a barplot. State zero has an expected return of ten,
while the other states have an expected return of zero. On the right, the policy is displayed on a
two-dimensional grid. Each vertical strip gives the policy at the labeled state. In state zero, action
zero is the darkest because the agent’s policy makes this choice with the highest probability. In the
other states the agent has the equiprobable policy, so the vertical strips are colored uniformly.
You can access the state space and the action set as attributes of the environment.
[7]: env.S

[7]: [0, 1, 2, 3]

[8]: env.A

[8]: [0, 1, 2]

You will need to use the environment’s transitions method to complete this assignment. The
method takes a state and an action and returns a 2-dimensional array, where the entry at (i, 0) is
the reward for transitioning to state i from the current state and the entry at (i, 1) is the conditional
probability of transitioning to state i given the current state and action.
[9]: state = 3
action = 1
transitions = env.transitions(state, action)
transitions

[9]: array([[1. , 0.12390437],

[2. , 0.15133714],
[3. , 0.1848436 ],

3
[2. , 0.53991488]])

[10]: for sp, (r, p) in enumerate(transitions):

print(f'p(S\'={sp}, R={r} | S={state}, A={action}) = {p.round(2)}')

p(S'=0, R=1.0 | S=3, A=1) = 0.12

p(S'=1, R=2.0 | S=3, A=1) = 0.15
p(S'=2, R=3.0 | S=3, A=1) = 0.18
p(S'=3, R=2.0 | S=3, A=1) = 0.54

1.3 Section 1: Policy Evaluation

You’re now ready to begin the assignment! First, the city council would like you to evaluate the
quality of the existing pricing scheme. Policy evaluation works by iteratively applying the Bellman
equation for vπ to a working value function, as an update rule, as shown below.

∑ ∑
v(s) ← π(a|s) p(s′ , r|s, a)[r + γv(s′ )]
a s′ ,r

This update can either occur “in-place” (i.e. the update rule is sequentially applied to each state)
or with “two-arrays” (i.e. the update rule is simultaneously applied to each state). Both versions
converge to vπ but the in-place version usually converges faster. In this assignment, we will
be implementing all update rules in-place, as is done in the pseudocode of chapter 4 of the
textbook.
We have written an outline of the policy evaluation algorithm described in chapter 4.1 of the
textbook. It is left to you to fill in the bellman_update function to complete the algorithm.
[11]: # lock
def evaluate_policy(env, V, pi, gamma, theta):
delta = float('inf')
while delta > theta:
delta = 0
for s in env.S:
v = V[s]
bellman_update(env, V, pi, s, gamma)
delta = max(delta, abs(v - V[s]))

return V

[12]: # -----------
# Graded Cell
# -----------
def bellman_update(env, V, pi, s, gamma):
"""Mutate ``V`` according to the Bellman update equation."""
# YOUR CODE HERE
v = 0

4
for action in env.A:
action_prob = pi[s][action]
transitions = env.transitions(s, action)
for next_state in env.S:
reward = transitions[next_state, 0]
prob = transitions[next_state, 1]
v = v + action_prob * prob * (reward + (gamma * V[next_state]))
V[s] = v

The cell below uses the policy evaluation algorithm to evaluate the city’s policy, which charges a
constant price of one.
[13]: # --------------
# Debugging Cell
# --------------
# Feel free to make any changes to this cell to debug your code

# set up test environment

num_spaces = 10
num_prices = 4
env = tools.ParkingWorld(num_spaces, num_prices)

# build test policy

city_policy = np.zeros((num_spaces + 1, num_prices))
city_policy[:, 1] = 1

gamma = 0.9
theta = 0.1

V = np.zeros(num_spaces + 1)
V = evaluate_policy(env, V, city_policy, gamma, theta)

print(V)

[80.04173399 81.65532303 83.37394007 85.12975566 86.87174913 88.55589131

90.14020422 91.58180605 92.81929841 93.78915889 87.77792991]

[14]: # -----------
# Tested Cell
# -----------
# The contents of the cell will be tested by the autograder.
# If they do not pass here, they will not pass there.

# set up test environment

num_spaces = 10
num_prices = 4
env = tools.ParkingWorld(num_spaces, num_prices)

5
# build test policy
city_policy = np.zeros((num_spaces + 1, num_prices))
city_policy[:, 1] = 1

gamma = 0.9
theta = 0.1

V = np.zeros(num_spaces + 1)
V = evaluate_policy(env, V, city_policy, gamma, theta)

# test the value function

answer = [80.04, 81.65, 83.37, 85.12, 86.87, 88.55, 90.14, 91.58, 92.81, 93.78,␣
,→87.77]

# make sure the value function is within 2 decimal places of the correct answer
assert grader.near(V, answer, 1e-2)

You can use the plot function to visualize the final value function and policy.
[15]: # lock
tools.plot(V, city_policy)

Observe that the value function qualitatively resembles the city council’s preferences — it mono-
tonically increases as more parking is used, until there is no parking left, in which case the value
is lower. Because of the relatively simple reward function (more reward is accrued when many
but not all parking spots are taken and less reward is accrued when few or all parking spots are
taken) and the highly stochastic dynamics function (each state has positive probability of being
reached each time step) the value functions of most policies will qualitatively resemble this graph.
However, depending on the intelligence of the policy, the scale of the graph will differ. In other
words, better policies will increase the expected return at every state rather than changing the
relative desirability of the states. Intuitively, the value of a less desirable state can be increased
by making it less likely to remain in a less desirable state. Similarly, the value of a more desirable

6
state can be increased by making it more likely to remain in a more desirable state. That is to
say, good policies are policies that spend more time in desirable states and less time in undesirable
states. As we will see in this assignment, such a steady state distribution is achieved by setting the
price to be low in low occupancy states (so that the occupancy will increase) and setting the price
high when occupancy is high (so that full occupancy will be avoided).

1.4 Section 2: Policy Iteration

Now the city council would like you to compute a more efficient policy using policy iteration. Policy
iteration works by alternating between evaluating the existing policy and making the policy greedy
with respect to the existing value function. We have written an outline of the policy iteration
algorithm described in chapter 4.3 of the textbook. We will make use of the policy evaluation
algorithm you completed in section 1. It is left to you to fill in the q_greedify_policy function,
such that it modifies the policy at s to be greedy with respect to the q-values at s, to complete the
policy improvement algorithm.
[16]: def improve_policy(env, V, pi, gamma):
policy_stable = True
for s in env.S:
old = pi[s].copy()
q_greedify_policy(env, V, pi, s, gamma)

if not np.array_equal(pi[s], old):

policy_stable = False

return pi, policy_stable

def policy_iteration(env, gamma, theta):

V = np.zeros(len(env.S))
pi = np.ones((len(env.S), len(env.A))) / len(env.A)
policy_stable = False

while not policy_stable:

V = evaluate_policy(env, V, pi, gamma, theta)
pi, policy_stable = improve_policy(env, V, pi, gamma)

return V, pi

[17]: # -----------
# Graded Cell
# -----------
def q_greedify_policy(env, V, pi, s, gamma):
"""Mutate ``pi`` to be greedy with respect to the q-values induced by ``V``.
,→"""

# YOUR CODE HERE

A = np.zeros(len(env.A))
for action in env.A:

7
transitions = env.transitions(s, action)
for next_state in env.S:
reward = transitions[next_state, 0]
prob = transitions[next_state, 1]
A[action] += prob * (reward + (gamma * V[next_state]))
best_action = np.argmax(A)
pi[s] = np.eye(len(env.A))[best_action]

[18]: # --------------
# Debugging Cell
# --------------
# Feel free to make any changes to this cell to debug your code

gamma = 0.9
theta = 0.1
env = tools.ParkingWorld(num_spaces=6, num_prices=4)

V = np.array([7, 6, 5, 4, 3, 2, 1])
pi = np.ones((7, 4)) / 4

new_pi, stable = improve_policy(env, V, pi, gamma)

# expect first call to greedify policy

expected_pi = np.array([
[0, 0, 0, 1],
[0, 0, 0, 1],
[0, 0, 0, 1],
[0, 0, 0, 1],
[0, 0, 0, 1],
[0, 0, 0, 1],
[0, 0, 0, 1],
])
assert np.all(new_pi == expected_pi)
assert stable == False

# the value function has not changed, so the greedy policy should not change
new_pi, stable = improve_policy(env, V, new_pi, gamma)

assert np.all(new_pi == expected_pi)

assert stable == True

[19]: # -----------
# Tested Cell
# -----------
# The contents of the cell will be tested by the autograder.
# If they do not pass here, they will not pass there.
gamma = 0.9

8
theta = 0.1
env = tools.ParkingWorld(num_spaces=10, num_prices=4)

V, pi = policy_iteration(env, gamma, theta)

V_answer = [81.60, 83.28, 85.03, 86.79, 88.51, 90.16, 91.70, 93.08, 94.25, 95.
,→25, 89.45]

pi_answer = [
[1, 0, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 0, 0, 1],
[0, 0, 0, 1],
]

# make sure value function is within 2 decimal places of answer

assert grader.near(V, V_answer, 1e-2)
# make sure policy is exactly correct
assert np.all(pi == pi_answer)

When you are ready to test the policy iteration algorithm, run the cell below.
[20]: env = tools.ParkingWorld(num_spaces=10, num_prices=4)
gamma = 0.9
theta = 0.1
V, pi = policy_iteration(env, gamma, theta)

You can use the plot function to visualize the final value function and policy.
[21]: tools.plot(V, pi)

9
You can check the value function (rounded to one decimal place) and policy against the answer
below: State Value Action 0 81.6 01 83.3 02 85.0
03 86.8 04 88.5 05 90.2 06 91.7 07
93.1 08 94.3 09 95.3 3 10 89.5 3

1.5 Section 3: Value Iteration

The city has also heard about value iteration and would like you to implement it. Value iteration
works by iteratively applying the Bellman optimality equation for v∗ to a working value function,
as an update rule, as shown below.

∑
v(s) ← max p(s′ , r|s, a)[r + γv(s′ )]
a
s′ ,r

We have written an outline of the value iteration algorithm described in chapter 4.4 of the textbook.
It is left to you to fill in the bellman_optimality_update function to complete the value iteration
algorithm.
[22]: def value_iteration(env, gamma, theta):
V = np.zeros(len(env.S))
while True:
delta = 0
for s in env.S:
v = V[s]
bellman_optimality_update(env, V, s, gamma)
delta = max(delta, abs(v - V[s]))
if delta < theta:
break
pi = np.ones((len(env.S), len(env.A))) / len(env.A)
for s in env.S:
q_greedify_policy(env, V, pi, s, gamma)
return V, pi

10
[23]: # -----------
# Graded Cell
# -----------
def bellman_optimality_update(env, V, s, gamma):
"""Mutate ``V`` according to the Bellman optimality update equation."""
# YOUR CODE HERE
v = np.zeros(len(env.A))
for action in env.A:
transitions = env.transitions(s, action)
for next_state in env.S:
reward = transitions[next_state, 0]
prob = transitions[next_state, 1]
v[action] += prob * (reward + (gamma * V[next_state]))
V[s] = np.max(v)

[24]: # --------------
# Debugging Cell
# --------------
# Feel free to make any changes to this cell to debug your code

gamma = 0.9
env = tools.ParkingWorld(num_spaces=6, num_prices=4)

V = np.array([7, 6, 5, 4, 3, 2, 1])

# only state 0 updated

bellman_optimality_update(env, V, 0, gamma)
assert list(V) == [5, 6, 5, 4, 3, 2, 1]

# only state 2 updated

bellman_optimality_update(env, V, 2, gamma)
assert list(V) == [5, 6, 7, 4, 3, 2, 1]

[25]: # -----------
# Tested Cell
# -----------
# The contents of the cell will be tested by the autograder.
# If they do not pass here, they will not pass there.
gamma = 0.9
env = tools.ParkingWorld(num_spaces=10, num_prices=4)

V = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])

for _ in range(10):
for s in env.S:
bellman_optimality_update(env, V, s, gamma)

11
# make sure value function is exactly correct
answer = [61, 63, 65, 67, 69, 71, 72, 74, 75, 76, 71]
assert np.all(V == answer)

When you are ready to test the value iteration algorithm, run the cell below.
[26]: env = tools.ParkingWorld(num_spaces=10, num_prices=4)
gamma = 0.9
theta = 0.1
V, pi = value_iteration(env, gamma, theta)

You can use the plot function to visualize the final value function and policy.
[27]: tools.plot(V, pi)

You can check your value function (rounded to one decimal place) and policy against the answer
below: State Value Action 0 81.6 01 83.3 02 85.0
03 86.8 04 88.5 05 90.2 06 91.7 07
93.1 08 94.3 09 95.3 3 10 89.5 3
In the value iteration algorithm above, a policy is not explicitly maintained until the value func-
tion has converged. Below, we have written an identically behaving value iteration algorithm that
maintains an updated policy. Writing value iteration in this form makes its relationship to policy
iteration more evident. Policy iteration alternates between doing complete greedifications and com-
plete evaluations. On the other hand, value iteration alternates between doing local greedifications
and local evaluations.
[28]: def value_iteration2(env, gamma, theta):
V = np.zeros(len(env.S))
pi = np.ones((len(env.S), len(env.A))) / len(env.A)
while True:
delta = 0
for s in env.S:
v = V[s]

12
q_greedify_policy(env, V, pi, s, gamma)
bellman_update(env, V, pi, s, gamma)
delta = max(delta, abs(v - V[s]))
if delta < theta:
break
return V, pi

You can try the second value iteration algorithm by running the cell below.
[29]: env = tools.ParkingWorld(num_spaces=10, num_prices=4)
gamma = 0.9
theta = 0.1
V, pi = value_iteration2(env, gamma, theta)
tools.plot(V, pi)

1.6 Wrapping Up

Congratulations, you’ve completed assignment 2! In this assignment, we investigated policy evalua-

tion and policy improvement, policy iteration and value iteration, and Bellman updates. Gridworld
City thanks you for your service!

Choices-Stories-You-Play-Fandom-Blades of Light and Shadow Book 1 Choices
100% (1)
Choices-Stories-You-Play-Fandom-Blades of Light and Shadow Book 1 Choices
66 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
MIST MSc-CSE Syllabus
100% (1)
MIST MSc-CSE Syllabus
22 pages
BHN Procedure
100% (2)
BHN Procedure
13 pages
01 Module 1 Early Reinforcement Learning
No ratings yet
01 Module 1 Early Reinforcement Learning
134 pages
2025_MDPs 2
No ratings yet
2025_MDPs 2
42 pages
Markov Decision Process II
No ratings yet
Markov Decision Process II
88 pages
Lec 4
No ratings yet
Lec 4
16 pages
Experiment 3
No ratings yet
Experiment 3
6 pages
2025_MDPs_Part 2 (1)
No ratings yet
2025_MDPs_Part 2 (1)
41 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
M 2
No ratings yet
M 2
12 pages
Rl Lecture4
No ratings yet
Rl Lecture4
16 pages
Lec 09
No ratings yet
Lec 09
51 pages
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
No ratings yet
AI 3000 / CS 5500: Reinforcement Learning Assignment 1: Problem 1: Markov Reward Process
5 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Experiment 4
No ratings yet
Experiment 4
7 pages
Computing The Cake Eating Problem
No ratings yet
Computing The Cake Eating Problem
13 pages
F20-AI-L9
No ratings yet
F20-AI-L9
44 pages
Value Functions & Bellman Equations: UNIT-3
No ratings yet
Value Functions & Bellman Equations: UNIT-3
11 pages
Module 04
No ratings yet
Module 04
63 pages
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
No ratings yet
RL 10 QUESTIONS FOR MID II Scheme of Evaluvation
15 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
policy (RL IITH)
No ratings yet
policy (RL IITH)
46 pages
Shiyu Zhao - Mathematical Foundation of Reinforcement Learning (2024, Tsinghua University Press, Springer) - libgen.li
No ratings yet
Shiyu Zhao - Mathematical Foundation of Reinforcement Learning (2024, Tsinghua University Press, Springer) - libgen.li
283 pages
Module 3.0
No ratings yet
Module 3.0
17 pages
Lecture#3_Bellmann_Equation_and_Dynamic_programming_DP_2024_Part
No ratings yet
Lecture#3_Bellmann_Equation_and_Dynamic_programming_DP_2024_Part
33 pages
Lec 3
No ratings yet
Lec 3
15 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
No ratings yet
Exam Prep 4 Solutions: Q1. MDPS: Dice Bonanza
4 pages
Symbolic AI MDP DP
No ratings yet
Symbolic AI MDP DP
6 pages
Reinforcement learning lec12
No ratings yet
Reinforcement learning lec12
60 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
2023_week5_policy
No ratings yet
2023_week5_policy
62 pages
3 - Chapter 4 Value Iteration and Policy Iteration
No ratings yet
3 - Chapter 4 Value Iteration and Policy Iteration
20 pages
3 - Chapter 4 Value Iteration and Policy Iteration
No ratings yet
3 - Chapter 4 Value Iteration and Policy Iteration
20 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
RL Module 4
No ratings yet
RL Module 4
50 pages
2 - Overview of this book
No ratings yet
2 - Overview of this book
4 pages
subtitle (15)
No ratings yet
subtitle (15)
1 page
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Book All in One
No ratings yet
Book All in One
288 pages
1 - Table of contents
No ratings yet
1 - Table of contents
6 pages
1.10 Policy Evaluation (Prediction)
No ratings yet
1.10 Policy Evaluation (Prediction)
30 pages
Reinforcement Learning - Project 3
No ratings yet
Reinforcement Learning - Project 3
9 pages
HW 2
No ratings yet
HW 2
2 pages
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
No ratings yet
Approximate Dynamic Programming - II: Algorithms: Warren B. Powell
22 pages
MarkovDecisionProcesses Analysis
No ratings yet
MarkovDecisionProcesses Analysis
10 pages
Py Code Example 11 0 Baird Semi Gradient DP Like
No ratings yet
Py Code Example 11 0 Baird Semi Gradient DP Like
3 pages
RL_20241103355_report
No ratings yet
RL_20241103355_report
4 pages
Reinforcement Learning 3 Recap
No ratings yet
Reinforcement Learning 3 Recap
3 pages
18 - Dynamic Programming for Markov Decision Processes.pptx
No ratings yet
18 - Dynamic Programming for Markov Decision Processes.pptx
50 pages
mdp-cheatsheet
No ratings yet
mdp-cheatsheet
3 pages
doubly robust off-policy RL
No ratings yet
doubly robust off-policy RL
14 pages
Book All-In-One 2
No ratings yet
Book All-In-One 2
281 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
CS229
No ratings yet
CS229
17 pages
Profound Python Data Science
From Everand
Profound Python Data Science
Onder Teker
No ratings yet
2022-23 Sem II AIMLCZC416
No ratings yet
2022-23 Sem II AIMLCZC416
149 pages
List of Head of Departments for All BITS Campuses
No ratings yet
List of Head of Departments for All BITS Campuses
5 pages
Rubrics to Assess Dissertation Quality_Final
No ratings yet
Rubrics to Assess Dissertation Quality_Final
7 pages
Assignment 1
No ratings yet
Assignment 1
24 pages
Intro To Healthcare Study Guide - M1
No ratings yet
Intro To Healthcare Study Guide - M1
10 pages
Length Criterion: Beamer-Tu-Logo
No ratings yet
Length Criterion: Beamer-Tu-Logo
16 pages
Euclid CHMM 1428682292 PDF
No ratings yet
Euclid CHMM 1428682292 PDF
27 pages
Bachelor'S Degree Programme T-I Term-End Examination June, 2010 Elective Course: Mathematics Mte-12: Linear Programming
No ratings yet
Bachelor'S Degree Programme T-I Term-End Examination June, 2010 Elective Course: Mathematics Mte-12: Linear Programming
14 pages
Silo - Tips - Math 1011 Homework Set 2
No ratings yet
Silo - Tips - Math 1011 Homework Set 2
6 pages
Choices-Stories-You-Play Fandom Com Wiki Baby Bump, Book 1 Choices PDF
No ratings yet
Choices-Stories-You-Play Fandom Com Wiki Baby Bump, Book 1 Choices PDF
55 pages
Bachelor'S Degree Programme Term-End Examination June, 2010 Elective Course: Mathematics MTE 8: Differential Equations
No ratings yet
Bachelor'S Degree Programme Term-End Examination June, 2010 Elective Course: Mathematics MTE 8: Differential Equations
10 pages
Term-End Examination June, 2010 Mathematics Mte-4: Elementary Algebra
No ratings yet
Term-End Examination June, 2010 Mathematics Mte-4: Elementary Algebra
7 pages
Bachelor'S Degree Programme Term-End Examination June, 2010 Mathematics Mte-5: Analytical Geometry
No ratings yet
Bachelor'S Degree Programme Term-End Examination June, 2010 Mathematics Mte-5: Analytical Geometry
6 pages
Choices-Stories-You-Play Fandom Com Wiki The Royal Romance, Book 2 Choices
No ratings yet
Choices-Stories-You-Play Fandom Com Wiki The Royal Romance, Book 2 Choices
98 pages
Answer On Question #40386 - Math - Differential Calculus: Solution
No ratings yet
Answer On Question #40386 - Math - Differential Calculus: Solution
2 pages
Answer On Question: #76037 - Math - Differential Equations
No ratings yet
Answer On Question: #76037 - Math - Differential Equations
1 page
Answer On Question #70334 - Math - Differential Equations
No ratings yet
Answer On Question #70334 - Math - Differential Equations
2 pages
The Royal Holiday Choices: Choices: Stories You Play Wiki
No ratings yet
The Royal Holiday Choices: Choices: Stories You Play Wiki
30 pages
Unit Numerical Integration: Structure
No ratings yet
Unit Numerical Integration: Structure
18 pages
Unit 12 Numerical Differentiation: Coefficients
No ratings yet
Unit 12 Numerical Differentiation: Coefficients
22 pages
Answer On Question #40029, Math, Differential Calculus
No ratings yet
Answer On Question #40029, Math, Differential Calculus
2 pages
Answer On Question #74057 - Math - Differential Equations
No ratings yet
Answer On Question #74057 - Math - Differential Equations
1 page
4 Study Centres List
No ratings yet
4 Study Centres List
8 pages
I Fegzi: BDP /Bca/Bts Term-End Examination December, 2OO8 FEG-2: Foundatton Course TN English.2
No ratings yet
I Fegzi: BDP /Bca/Bts Term-End Examination December, 2OO8 FEG-2: Foundatton Course TN English.2
4 pages
Unit 14 Numerical Solution of Ordinary Differential Equations
No ratings yet
Unit 14 Numerical Solution of Ordinary Differential Equations
20 pages
Developing athlete monitoring systems in team-sports- data analysis and visualization
No ratings yet
Developing athlete monitoring systems in team-sports- data analysis and visualization
26 pages
Signals and Systems
No ratings yet
Signals and Systems
17 pages
Journal of Biomechanics: Sarah Jane Hobbs, Jim Richards, Hilary M. Clayton
No ratings yet
Journal of Biomechanics: Sarah Jane Hobbs, Jim Richards, Hilary M. Clayton
9 pages
Opportunities and Limitations of AHP in Multio - 1988 - Mathematical and Compute
No ratings yet
Opportunities and Limitations of AHP in Multio - 1988 - Mathematical and Compute
4 pages
H18 Assign3RecPS PDF
No ratings yet
H18 Assign3RecPS PDF
8 pages
Tamaya NC 77
No ratings yet
Tamaya NC 77
20 pages
Stats 210 Course Book
No ratings yet
Stats 210 Course Book
200 pages
LECTURE 1. Control Systems Engineering - MEB 4101-1
No ratings yet
LECTURE 1. Control Systems Engineering - MEB 4101-1
44 pages
Assignment: Daffodil International University
No ratings yet
Assignment: Daffodil International University
5 pages
General Physics 1: Melc 9
No ratings yet
General Physics 1: Melc 9
9 pages
12th Maths One Mark Question Bank With Answer Key EM
No ratings yet
12th Maths One Mark Question Bank With Answer Key EM
32 pages
General Physics 1: Grade 11 - Quarter 1 - Week 3 Kinematics: Motion in 1-Dimension and 2 Dimensions
No ratings yet
General Physics 1: Grade 11 - Quarter 1 - Week 3 Kinematics: Motion in 1-Dimension and 2 Dimensions
12 pages
Test Item Analysis
No ratings yet
Test Item Analysis
2 pages
Modified Genetic Algorithm For High School Time-Table Scheduling With Fuzzy Time Window
No ratings yet
Modified Genetic Algorithm For High School Time-Table Scheduling With Fuzzy Time Window
5 pages
Material Selection Process
No ratings yet
Material Selection Process
10 pages
122 PGTRB Maths Unit 6 Study Material PDF
No ratings yet
122 PGTRB Maths Unit 6 Study Material PDF
8 pages
Lecture 17
No ratings yet
Lecture 17
53 pages
Design Rules For Vacuum Chambers
No ratings yet
Design Rules For Vacuum Chambers
12 pages
Operations in Python
No ratings yet
Operations in Python
6 pages
CBSE Class 8 Maths Chapter 4 Data Handling Notes Free PDF
No ratings yet
CBSE Class 8 Maths Chapter 4 Data Handling Notes Free PDF
3 pages
Eschers Art, Smith Chart, and Hyperbolic Geometry
No ratings yet
Eschers Art, Smith Chart, and Hyperbolic Geometry
11 pages
Exercise 6 Chaks Pure Mathematics
No ratings yet
Exercise 6 Chaks Pure Mathematics
2 pages
AIATS Planner For SS - 2122
No ratings yet
AIATS Planner For SS - 2122
3 pages
Creep Damage Evaluation in High-Pressure Rotor Based On Hardness Measurement
No ratings yet
Creep Damage Evaluation in High-Pressure Rotor Based On Hardness Measurement
9 pages
Beam Section Temperature PDF
No ratings yet
Beam Section Temperature PDF
7 pages
Resume El Marjou Youssef 1727816159
No ratings yet
Resume El Marjou Youssef 1727816159
1 page
MG University 6th Ece Full Syllabus
No ratings yet
MG University 6th Ece Full Syllabus
9 pages

Assignment 2

Uploaded by

Assignment 2

Uploaded by

Assignment2

February 13, 2024

1 Assignment 2: Optimal Policies with Dynamic Programming

1.1 Gridworld City

<Figure size 432x288 with 0 Axes>

[3]: array([0., 0., 0., 0.])

[4]: array([[0.33333333, 0.33333333, 0.33333333],

[5]: pi[0] = [0.75, 0.11, 0.14]

for s, pi_s in enumerate(pi):

pi(A=0|S=0) = 0.75 pi(A=1|S=0) = 0.11 pi(A=2|S=0) = 0.14

[9]: array([[1. , 0.12390437],

[10]: for sp, (r, p) in enumerate(transitions):

p(S'=0, R=1.0 | S=3, A=1) = 0.12

1.3 Section 1: Policy Evaluation

# set up test environment

# build test policy

[80.04173399 81.65532303 83.37394007 85.12975566 86.87174913 88.55589131

# set up test environment

# test the value function

1.4 Section 2: Policy Iteration

if not np.array_equal(pi[s], old):

return pi, policy_stable

def policy_iteration(env, gamma, theta):

while not policy_stable:

# YOUR CODE HERE

new_pi, stable = improve_policy(env, V, pi, gamma)

# expect first call to greedify policy

assert np.all(new_pi == expected_pi)

V, pi = policy_iteration(env, gamma, theta)

# make sure value function is within 2 decimal places of answer

1.5 Section 3: Value Iteration

# only state 0 updated

# only state 2 updated

V = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11])

Congratulations, you’ve completed assignment 2! In this assignment, we investigated policy evalua-

You might also like