Assignment 2
Assignment 2
Welcome to Assignment 2. This notebook will help you understand: - Policy Evaluation and Policy
Improvement. - Value and Policy Iteration. - Bellman Equations.
Gridworld City, a thriving metropolis with a booming technology industry, has recently experienced
an influx of grid-loving software engineers. Unfortunately, the city’s street parking system, which
charges a fixed rate, is struggling to keep up with the increased demand. To address this, the city
council has decided to modify the pricing scheme to better promote social welfare. In general, the
city considers social welfare higher when more parking is being used, the exception being that the
city prefers that at least one spot is left unoccupied (so that it is available in case someone really
needs it). The city council has created a Markov decision process (MDP) to model the demand
for parking with a reward function that reflects its preferences. Now the city has hired you — an
expert in dynamic programming — to help determine an optimal policy.
1.2 Preliminaries
You’ll need two imports to complete this assigment: - numpy: The fundamental package for scien-
tific computing with Python. - tools: A module containing an environment and a plotting function.
There are also some other lines in the cell below that are used for grading and plotting — you
needn’t worry about them.
In this notebook, all cells are locked except those that you are explicitly asked to modify. It is up
to you to decide how to implement your solution in these cells, but please do not import other
libraries — doing so will break the autograder.
[1]: %matplotlib inline
import numpy as np
import tools
import grader
1
In the city council’s parking MDP, states are nonnegative integers indicating how many parking
spaces are occupied, actions are nonnegative integers designating the price of street parking, the
reward is a real value describing the city’s preference for the situation, and time is discretized by
hour. As might be expected, charging a high price is likely to decrease occupancy over the hour,
while charging a low price is likely to increase it.
For now, let’s consider an environment with three parking spaces and three price points. Note that
an environment with three parking spaces actually has four states — zero, one, two, or three spaces
could be occupied.
[2]: # ---------------
# Discussion Cell
# ---------------
num_spaces = 3
num_prices = 3
env = tools.ParkingWorld(num_spaces, num_prices)
V = np.zeros(num_spaces + 1)
pi = np.ones((num_spaces + 1, num_prices)) / num_prices
The value function is a one-dimensional array where the i-th entry gives the value of i spaces being
occupied.
[3]: V
We can represent the policy as a two-dimensional array where the (i, j)-th entry gives the probability
of taking action j in state i.
[4]: pi
2
[6]: V[0] = 1
tools.plot(V, pi)
We can visualize a value function and policy with the plot function in the tools module. On
the left, the value function is displayed as a barplot. State zero has an expected return of ten,
while the other states have an expected return of zero. On the right, the policy is displayed on a
two-dimensional grid. Each vertical strip gives the policy at the labeled state. In state zero, action
zero is the darkest because the agent’s policy makes this choice with the highest probability. In the
other states the agent has the equiprobable policy, so the vertical strips are colored uniformly.
You can access the state space and the action set as attributes of the environment.
[7]: env.S
[7]: [0, 1, 2, 3]
[8]: env.A
[8]: [0, 1, 2]
You will need to use the environment’s transitions method to complete this assignment. The
method takes a state and an action and returns a 2-dimensional array, where the entry at (i, 0) is
the reward for transitioning to state i from the current state and the entry at (i, 1) is the conditional
probability of transitioning to state i given the current state and action.
[9]: state = 3
action = 1
transitions = env.transitions(state, action)
transitions
3
[2. , 0.53991488]])
You’re now ready to begin the assignment! First, the city council would like you to evaluate the
quality of the existing pricing scheme. Policy evaluation works by iteratively applying the Bellman
equation for vπ to a working value function, as an update rule, as shown below.
∑ ∑
v(s) ← π(a|s) p(s′ , r|s, a)[r + γv(s′ )]
a s′ ,r
This update can either occur “in-place” (i.e. the update rule is sequentially applied to each state)
or with “two-arrays” (i.e. the update rule is simultaneously applied to each state). Both versions
converge to vπ but the in-place version usually converges faster. In this assignment, we will
be implementing all update rules in-place, as is done in the pseudocode of chapter 4 of the
textbook.
We have written an outline of the policy evaluation algorithm described in chapter 4.1 of the
textbook. It is left to you to fill in the bellman_update function to complete the algorithm.
[11]: # lock
def evaluate_policy(env, V, pi, gamma, theta):
delta = float('inf')
while delta > theta:
delta = 0
for s in env.S:
v = V[s]
bellman_update(env, V, pi, s, gamma)
delta = max(delta, abs(v - V[s]))
return V
[12]: # -----------
# Graded Cell
# -----------
def bellman_update(env, V, pi, s, gamma):
"""Mutate ``V`` according to the Bellman update equation."""
# YOUR CODE HERE
v = 0
4
for action in env.A:
action_prob = pi[s][action]
transitions = env.transitions(s, action)
for next_state in env.S:
reward = transitions[next_state, 0]
prob = transitions[next_state, 1]
v = v + action_prob * prob * (reward + (gamma * V[next_state]))
V[s] = v
The cell below uses the policy evaluation algorithm to evaluate the city’s policy, which charges a
constant price of one.
[13]: # --------------
# Debugging Cell
# --------------
# Feel free to make any changes to this cell to debug your code
gamma = 0.9
theta = 0.1
V = np.zeros(num_spaces + 1)
V = evaluate_policy(env, V, city_policy, gamma, theta)
print(V)
[14]: # -----------
# Tested Cell
# -----------
# The contents of the cell will be tested by the autograder.
# If they do not pass here, they will not pass there.
5
# build test policy
city_policy = np.zeros((num_spaces + 1, num_prices))
city_policy[:, 1] = 1
gamma = 0.9
theta = 0.1
V = np.zeros(num_spaces + 1)
V = evaluate_policy(env, V, city_policy, gamma, theta)
# make sure the value function is within 2 decimal places of the correct answer
assert grader.near(V, answer, 1e-2)
You can use the plot function to visualize the final value function and policy.
[15]: # lock
tools.plot(V, city_policy)
Observe that the value function qualitatively resembles the city council’s preferences — it mono-
tonically increases as more parking is used, until there is no parking left, in which case the value
is lower. Because of the relatively simple reward function (more reward is accrued when many
but not all parking spots are taken and less reward is accrued when few or all parking spots are
taken) and the highly stochastic dynamics function (each state has positive probability of being
reached each time step) the value functions of most policies will qualitatively resemble this graph.
However, depending on the intelligence of the policy, the scale of the graph will differ. In other
words, better policies will increase the expected return at every state rather than changing the
relative desirability of the states. Intuitively, the value of a less desirable state can be increased
by making it less likely to remain in a less desirable state. Similarly, the value of a more desirable
6
state can be increased by making it more likely to remain in a more desirable state. That is to
say, good policies are policies that spend more time in desirable states and less time in undesirable
states. As we will see in this assignment, such a steady state distribution is achieved by setting the
price to be low in low occupancy states (so that the occupancy will increase) and setting the price
high when occupancy is high (so that full occupancy will be avoided).
Now the city council would like you to compute a more efficient policy using policy iteration. Policy
iteration works by alternating between evaluating the existing policy and making the policy greedy
with respect to the existing value function. We have written an outline of the policy iteration
algorithm described in chapter 4.3 of the textbook. We will make use of the policy evaluation
algorithm you completed in section 1. It is left to you to fill in the q_greedify_policy function,
such that it modifies the policy at s to be greedy with respect to the q-values at s, to complete the
policy improvement algorithm.
[16]: def improve_policy(env, V, pi, gamma):
policy_stable = True
for s in env.S:
old = pi[s].copy()
q_greedify_policy(env, V, pi, s, gamma)
return V, pi
[17]: # -----------
# Graded Cell
# -----------
def q_greedify_policy(env, V, pi, s, gamma):
"""Mutate ``pi`` to be greedy with respect to the q-values induced by ``V``.
,→"""
7
transitions = env.transitions(s, action)
for next_state in env.S:
reward = transitions[next_state, 0]
prob = transitions[next_state, 1]
A[action] += prob * (reward + (gamma * V[next_state]))
best_action = np.argmax(A)
pi[s] = np.eye(len(env.A))[best_action]
[18]: # --------------
# Debugging Cell
# --------------
# Feel free to make any changes to this cell to debug your code
gamma = 0.9
theta = 0.1
env = tools.ParkingWorld(num_spaces=6, num_prices=4)
V = np.array([7, 6, 5, 4, 3, 2, 1])
pi = np.ones((7, 4)) / 4
# the value function has not changed, so the greedy policy should not change
new_pi, stable = improve_policy(env, V, new_pi, gamma)
[19]: # -----------
# Tested Cell
# -----------
# The contents of the cell will be tested by the autograder.
# If they do not pass here, they will not pass there.
gamma = 0.9
8
theta = 0.1
env = tools.ParkingWorld(num_spaces=10, num_prices=4)
V_answer = [81.60, 83.28, 85.03, 86.79, 88.51, 90.16, 91.70, 93.08, 94.25, 95.
,→25, 89.45]
pi_answer = [
[1, 0, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[1, 0, 0, 0],
[0, 0, 0, 1],
[0, 0, 0, 1],
]
When you are ready to test the policy iteration algorithm, run the cell below.
[20]: env = tools.ParkingWorld(num_spaces=10, num_prices=4)
gamma = 0.9
theta = 0.1
V, pi = policy_iteration(env, gamma, theta)
You can use the plot function to visualize the final value function and policy.
[21]: tools.plot(V, pi)
9
You can check the value function (rounded to one decimal place) and policy against the answer
below: State Value Action 0 81.6 01 83.3 02 85.0
03 86.8 04 88.5 05 90.2 06 91.7 07
93.1 08 94.3 09 95.3 3 10 89.5 3
The city has also heard about value iteration and would like you to implement it. Value iteration
works by iteratively applying the Bellman optimality equation for v∗ to a working value function,
as an update rule, as shown below.
∑
v(s) ← max p(s′ , r|s, a)[r + γv(s′ )]
a
s′ ,r
We have written an outline of the value iteration algorithm described in chapter 4.4 of the textbook.
It is left to you to fill in the bellman_optimality_update function to complete the value iteration
algorithm.
[22]: def value_iteration(env, gamma, theta):
V = np.zeros(len(env.S))
while True:
delta = 0
for s in env.S:
v = V[s]
bellman_optimality_update(env, V, s, gamma)
delta = max(delta, abs(v - V[s]))
if delta < theta:
break
pi = np.ones((len(env.S), len(env.A))) / len(env.A)
for s in env.S:
q_greedify_policy(env, V, pi, s, gamma)
return V, pi
10
[23]: # -----------
# Graded Cell
# -----------
def bellman_optimality_update(env, V, s, gamma):
"""Mutate ``V`` according to the Bellman optimality update equation."""
# YOUR CODE HERE
v = np.zeros(len(env.A))
for action in env.A:
transitions = env.transitions(s, action)
for next_state in env.S:
reward = transitions[next_state, 0]
prob = transitions[next_state, 1]
v[action] += prob * (reward + (gamma * V[next_state]))
V[s] = np.max(v)
[24]: # --------------
# Debugging Cell
# --------------
# Feel free to make any changes to this cell to debug your code
gamma = 0.9
env = tools.ParkingWorld(num_spaces=6, num_prices=4)
V = np.array([7, 6, 5, 4, 3, 2, 1])
[25]: # -----------
# Tested Cell
# -----------
# The contents of the cell will be tested by the autograder.
# If they do not pass here, they will not pass there.
gamma = 0.9
env = tools.ParkingWorld(num_spaces=10, num_prices=4)
for _ in range(10):
for s in env.S:
bellman_optimality_update(env, V, s, gamma)
11
# make sure value function is exactly correct
answer = [61, 63, 65, 67, 69, 71, 72, 74, 75, 76, 71]
assert np.all(V == answer)
When you are ready to test the value iteration algorithm, run the cell below.
[26]: env = tools.ParkingWorld(num_spaces=10, num_prices=4)
gamma = 0.9
theta = 0.1
V, pi = value_iteration(env, gamma, theta)
You can use the plot function to visualize the final value function and policy.
[27]: tools.plot(V, pi)
You can check your value function (rounded to one decimal place) and policy against the answer
below: State Value Action 0 81.6 01 83.3 02 85.0
03 86.8 04 88.5 05 90.2 06 91.7 07
93.1 08 94.3 09 95.3 3 10 89.5 3
In the value iteration algorithm above, a policy is not explicitly maintained until the value func-
tion has converged. Below, we have written an identically behaving value iteration algorithm that
maintains an updated policy. Writing value iteration in this form makes its relationship to policy
iteration more evident. Policy iteration alternates between doing complete greedifications and com-
plete evaluations. On the other hand, value iteration alternates between doing local greedifications
and local evaluations.
[28]: def value_iteration2(env, gamma, theta):
V = np.zeros(len(env.S))
pi = np.ones((len(env.S), len(env.A))) / len(env.A)
while True:
delta = 0
for s in env.S:
v = V[s]
12
q_greedify_policy(env, V, pi, s, gamma)
bellman_update(env, V, pi, s, gamma)
delta = max(delta, abs(v - V[s]))
if delta < theta:
break
return V, pi
You can try the second value iteration algorithm by running the cell below.
[29]: env = tools.ParkingWorld(num_spaces=10, num_prices=4)
gamma = 0.9
theta = 0.1
V, pi = value_iteration2(env, gamma, theta)
tools.plot(V, pi)
1.6 Wrapping Up
13