0% found this document useful (0 votes)

20 views77 pages

2023 Week2 Lecture Before

Uploaded by

luciferboboqwq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views77 pages

2023 Week2 Lecture Before

Uploaded by

luciferboboqwq

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 77

Week 2: Markov Decision Processes

Bolei Zhou

UCLA

October 3, 2023

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 1 / 77

Announcement

1 You should have been informed if you receive a PTE, email TA if not
2 Piazza Discussion Forum:
https://piazza.com/ucla/fall2023/cs260r
1 All course-related questions should go there
2 Expected response time from TA: <36 hours
3 Assignment1 will be out at
https://github.com/ucla-rlcourse/assignment-2023fall
1 Due in two weeks
4 Friday’s TA discussion session: programming and PyTorch basics
1 TAs will come up with a weekly schedule soon: Assignments,
clarification on technical detail, some more recent papers, etc
2 Discussion is optional, you can go to either session

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 2 / 77

Plan

1 Last Week
1 Course overview
2 Basic components of RL: reward, policy, value function, action-value
function, model
3 A simplified RL task: k-armed Bandit Problem
4 Markov Decision Process (MDP)
1 Markov Chain→ Markov Reward Process (MRP)→ Markov Decision
Processes (MDP)
2 Policy evaluation in MDP
3 Control in MDP: policy iteration and value iteration
4 Improving dynamic programming

5 Textbook of Sutton and Barto: Chapter 3 and Chapter 4

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 3 / 77

Reinforcement Learning: sequential decision making

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 4 / 77

Reward

1 A reward is a scalar feedback signal

2 Reward indicates how well agent is doing at step t
3 The objective of RL is to maximize the cumulative reward
P∞
1 Cumulative reward: Gt = k Rt+k+1
2 Expected cumulative reward: Eπ [Gt ]
4 Example: roll a dice five times and receive the face number as the
reward, what is the expected cumulative reward at the beginning? or
before the third time?

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 5 / 77

Sequential Decision Making and Delayed Reward

1 Objective of the agent: select a series of actions to maximize the

cumulative reward
2 Actions may have long-term consequences
1 Reward may be delayed
2 Trade-off between immediate reward and long-term reward

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 6 / 77

Major components of an RL agent

1 An RL agent may include one or more of these components:

1 Policy π(s): agent’s behavior function
2 Value function V(s): how good is each state or action
3 Action-value function Q(s,a): how good is an action in a certain state
4 Model: agent’s state representation of the environment

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 7 / 77

Policy

1 A policy is the agent’s behavior model

2 It is a map function from state/observation to action
3 Stochastic policy: action probability π(a|s) = P[At = a|St = s]
4 Deterministic policy: action at∗ = arg maxa Q(st , a)

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 8 / 77

Value function

1 Value function: expected discounted sum of future rewards under a

particular policy π
1 It is used to quantify the goodness of states if following a particular
policy π
X
vπ (s) = Eπ [Gt |St = s] = Eπ [ γ k Rt+k+1 |St = s] (1)
k

2 Action-value function, or Q function

X
Q(s, a) = Eπ [Gt |St = s, At = a] = Eπ [ γ k Rt+k+1 |St = s, At = a]
k
(2)

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 9 / 77

Model

1 A model predicts what the environment will do next

1 Predict the next state: P[St+1 = s 0 |St = s, At = a]
2 Predict the next reward: E [Rt+1 |St = s, At = a]
2 Sometimes the world model is given, like Newtonian force F = ma

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 10 / 77

k-Armed Bandit Problem as a simplified RL example

1 A multi-armed bandit is a tuple < A, R >

2 k actions to take at each step t
3 Ra (r ) = P(r |a) is unknown probability distribution over rewards
4 At each step t the agent selects an action at ∈ A, then the
environment generates a reward rt ∼ Rat
The goal of agent is to maximize cumulative reward T
P
5 τ =1 rτ
Bolei Zhou CS260R Reinforcement Learning October 3, 2023 11 / 77
class BernoulliArm():
def __init__(self, p):
self.p = p
def draw(self):
if random.random() > self.p:
return 0.0
else:
return 1.0

class NormalArm():
def __init__(self, mu, sigma):
self.mu = mu
self.sigma = sigma
def draw(self):
return random.gauss(self.mu, self.sigma)

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 12 / 77

Definition of Value Function and Action-Value Function

1 The action-value is the mean reward for action a

Q(a) = E(r |a) (3)

2 The optimal value

V ∗ = Q(a∗ ) = max Q(a) (4)

a∈A

3 To estimate Q(a), we can compute Qt (a) at step t as

Pt−1
sum of rewards when a taken prior to t i=1 ri · 1Ai =a
Qt (a) = = t−1
number of times a taken prior to t
P
i=1 1Ai =a
(5)

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 13 / 77

Greedy Action and -Greedy Action to Take

1 The estimation of Q(a) at step t

Pt−1
sum of rewards when a taken prior to t i=1 ri · 1Ai =a
Qt (a) = = t−1
number of times a taken prior to t
P
i=1 1Ai =a
(6)

2 Greedy action selection algorithm: At = arg maxa Qt (a)

3 Problem with the greedy algorithm?
4 -Greedy: greedy most of the time, but with small probability select
random actions ( is usually as 0.1)
1 probability 1 − : At = arg maxa Qt (a)
2 probability : At = uniform(A)

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 14 / 77

-Greedy Algorithm

Algorithm 1 A simple epsilon-Greedy bandit algorithm

1: for a = 1 to k do
2: Q(a) = 0, N(a) = 0
3: end for
4: loop (
arg maxa Q(a) with probability 1 −
5: A=
uniform(A) with probability
6: r = bandit(A)
7: N(A) = N(A) + 1
1
8: Q(A) = Q(A) + N(A) [r − Q(A)]
9: end loop

Deriving: NewEstimate = OldEstimate + StepSize[Target - OldEstimate]

1
Qt (at ) = Qt−1 + (rt − Qt−1 (at )) (7)
Nt (at )
Bolei Zhou CS260R Reinforcement Learning October 3, 2023 15 / 77
Why it is a simplified RL task

1 no delayed reward: instance feedback

2 no change of state: only one state
3 independent of consecutive behaviors

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 16 / 77

Markov Decision Process (MDP)

1 Markov Decision Process can model a lot of real-world problems. It

formally describes the framework of reinforcement learning
2 Under MDP, the environment is fully observable.
1 Optimal control primarily deals with continuous MDPs
2 Partially observable problems can be converted into MDPs

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 17 / 77

Defining Three Markov Models

• Markov Processes
• Markov Reward Processes (MRPs)
• Markov Decision Processes (MDPs)

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 18 / 77

Markov Property

1 The history of states: ht = {s1 , s2 , s3 , ..., st }

2 State st is Markovian if and only if:

p(st+1 |st ) =p(st+1 |ht ) (8)

p(st+1 |st , at ) =p(st+1 |ht , at ) (9)

3 “The future is independent of the past given the present”

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 19 / 77

Markov Process/Markov Chain

1 State transition matrix P specifies p(st+1 = s 0 |st = s)

 
P(s1 |s1 ) P(s2 |s1 ) . . . P(sN |s1 )
 P(s1 |s2 ) P(s2 |s2 ) . . . P(sN |s2 ) 
P=
 
.. .. . . .. 
 . . . . 
P(s1 |sN ) P(s2 |sN ) . . . P(sN |sN )

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 20 / 77

Example of MP

1 Sample episodes starting from s3

1 s3 , s4 , s5 , s6 , s6
2 s3 , s2 , s3 , s2 , s1
3 s3 , s4 , s4 , s5 , s5

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 21 / 77

Markov Reward Process (MRP)

1 Markov Reward Process is a Markov Chain + reward

2 Definition of Markov Reward Process (MRP)
1 S is a (finite) set of states (s ∈ S)
2 P is dynamics/transition model that specifies P(St+1 = s 0 |st = s)
3 R is a reward function R(st = s) = E[rt |st = s]
4 Discount factor γ ∈ [0, 1]
3 If finite number of states, R can be a vector
1 We focus on tabular representation first
2 What happen if there are infinite number of states?

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 22 / 77

Example of MRP

Reward: +5 in s1 , +10 in s7 , 0 in all other states. So that we can

represent R = [5, 0, 0, 0, 0, 0, 10]
Bolei Zhou CS260R Reinforcement Learning October 3, 2023 23 / 77
Return and Value Function

1 Definition of Horizon
1 Number of maximum time steps in each episode/trajectory
2 Can be infinite, otherwise called finite Markov (reward) Process
3 Per game: 100 moves for Go, 80 moves for chess
2 Definition of Return
1 Discounted sum of rewards from time step t to horizon

Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + γ 3 Rt+4 + ... + γ T −t−1 RT

3 Definition of state value function Vt (s) for a MRP

1 Expected return from t in state s

Vt (s) =E[Gt |st = s]

=E[Rt+1 + γRt+2 + γ 2 Rt+3 + ... + γ T −t−1 RT |st = s]

2 Present value of accumulated future rewards

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 24 / 77
Why Discount Factor γ

1 Avoid infinite returns in cyclic Markov processes

2 Uncertainty about the future
1 If the reward is financial, immediate rewards may earn more interest
than delayed rewards
3 Animal/human behavior shows a preference for immediate reward
4 It is sometimes possible to use undiscounted Markov reward processes
(i.e. γ = 1), e.g. if all sequences terminate.
1 γ = 0: Only care about the immediate reward
2 γ = 1: Future reward is equal to the immediate reward.

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 25 / 77

Example of MRP

1 Reward: +5 in s1 , +10 in s7 , 0 in all other states. So that we can

represent R = [5, 0, 0, 0, 0, 0, 10]
2 Sample returns G for a 3-step episodes with γ = 1/2
1 1
1 return for s4 , s5 , s6 , s7 : 0 + 2 ×0+ 4 × 10 = 2.5
1 1
2 return for s4 , s3 , s2 , s1 : 0 + 2 ×0+ 4 × 5 = 1.25
3 return for s4 , s5 , s6 , s6 : 0
3 How to compute the value function? For example, the value of state
s4 as V (s4 ) = E[Gt |st = s4 ]

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 26 / 77

Computing the Value of a Markov Reward Process

1 Value function: expected return from starting in state s

V (s) = E[Gt |st = s] = E[Rt+1 + γRt+2 + γ 2 Rt+3 + ...|st = s]

2 MRP value function satisfies the following Bellman equation:

X
V (s) = R(s) + γ P(s 0 |s)V (s 0 )
|{z} 0
Immediate reward s ∈S
| {z }
Discounted sum of future reward

3 Practice: To derive the Bellman equation from the definition of V(s)

1 Hint: V (s) = E[Rt+1 + γE[Rt+2 + γRt+3 + γ 2 Rt+4 + ...]|st = s]

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 27 / 77

Understanding Bellman Equation

1 Bellman equation describes the iterative relations of states

X
V (s) = R(s) + γ P(s 0 |s)V (s 0 )
s 0 ∈S

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 28 / 77

Matrix Form of Bellman Equation for MRP

Therefore, we can express V(s) using the matrix form:

V = R + γPV
1 Analytic solution for value of MRP: V = (I − γP)−1 R
1 Matrix inverse takes the complexity O(N 3 ) for N states
2 Only possible for small MRPs
2 Other methods to solve this?

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 29 / 77

Iterative Algorithm for Computing Value of a MRP

1 Dynamic Programming
2 Monte-Carlo evaluation
3 Temporal-Difference learning

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 30 / 77

Monte Carlo Algorithm for Computing Value of a MRP

Algorithm 2 Monte Carlo simulation to calculate MRP value function

1: i ← 0, Gt ← 0
2: while i 6= N do
3: generate an episode, starting from state s and timePt
H−1 i−t
4: Using the generated episode, calculate return g = i=t γ ri
5: Gt ← Gt + g , i ← i + 1
6: end while
7: Vt (s) ← Gt /N

1 For example: to calculate V (s4 ) we can generate a lot of trajectories

then take the average of the returns:
1 1
1 return for s4 , s5 , s6 , s7 : 0 + 2 ×0+ 4 × 10 = 2.5
1 1
2 return for s4 , s3 , s2 , s1 : 0 + 2 ×0+ 4 × 5 = 1.25
3 return for s4 , s5 , s6 , s6 : 0
4 more trajectories

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 31 / 77

Iterative Algorithm for Computing Value of a MRP

Algorithm 3 Iterative algorithm to calculate MRP value function

1: for all states s ∈ S, V 0 (s) ← 0, V (s) ← ∞
2: while ||V − V 0 || > do
3: V ← V0
For all states s ∈ S, V 0 (s) = R(s) + γ s 0 ∈S P(s 0 |s)V (s 0 )
P
4:
5: end while
6: return V 0 (s) for all s ∈ S

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 32 / 77

Markov Decision Process (MDP)

1 Markov Decision Process is Markov Reward Process with decisions.

2 Definition of MDP
1 S is a finite set of states
2 A is a finite set of actions
3 P is dynamics/transition model for each action a under certain state s
as P(st+1 = s 0 |st = s, at = a)
4 R is a reward function R(st = s, at = a) = E[rt |st = s, at = a]
5 Discount factor γ ∈ [0, 1]

3 MDP is a tuple: (S, A, P, R, γ)

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 33 / 77

Policy in MDP

1 Policy specifies what action to take in each state

2 Given a state, specify a distribution over actions
3 Policy: π(a|s) = P(at = a|st = s)
4 Policies are stationary (time-independent), At ∼ π(a|s) for any t > 0

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 34 / 77

Policy in MDP

1 Given a MDP (S, A, P, R, γ) and a policy π

2 The state and reward sequence S1 , R2 , S2 , R2 , ... is a Markov reward
process (S, P π , R π , γ) where,

X
P π (s 0 |s) = π(a|s)P(s 0 |s, a)
a∈A
X
R π (s) = π(a|s)R(s, a)
a∈A

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 35 / 77

Comparison of MP/MRP and MDP

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 36 / 77

Value function for MDP

1 The state-value function v π (s) of an MDP is the expected return

starting from state s, and following policy π

v π (s) = Eπ [Gt |st = s] (10)

2 The action-value function q π (s, a) is the expected return starting

from state s, taking action a, and then following policy π

q π (s, a) = Eπ [Gt |st = s, At = a] (11)

3 We have the relation between v π (s) and q π (s, a)

X
v π (s) = π(a|s)q π (s, a) (12)
a∈A

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 37 / 77

Bellman Expectation Equation

1 The state-value function can be decomposed into immediate reward

plus discounted value of the successor state,

v π (s) = Eπ [Rt+1 + γv π (st+1 )|st = s] (13)

2 The action-value function can similarly be decomposed

q π (s, a) = Eπ [Rt+1 + γq π (st+1 , At+1 )|st = s, At = a] (14)

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 38 / 77

Bellman Expectation Equation for V π and Q π

X
v π (s) = π(a|s)q π (s, a) (15)
a∈A
X
q π (s, a) =Rsa + γ P(s 0 |s, a)v π (s 0 ) (16)
s 0 ∈S

Thus
X X
v π (s) = π(a|s)(R(s, a) + γ P(s 0 |s, a)v π (s 0 )) (17)
a∈A s 0 ∈S
X X
π
q (s, a) =R(s, a) + γ P(s 0 |s, a) π(a0 |s 0 )q π (s 0 , a0 ) (18)
s 0 ∈S a0 ∈A

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 39 / 77

Backup Diagram for V π

X X
v π (s) = π(a|s)(R(s, a) + γ P(s 0 |s, a)v π (s 0 )) (19)
a∈A s 0 ∈S

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 40 / 77

Backup Diagram for Q π

X X
q π (s, a) =R(s, a) + γ P(s 0 |s, a) π(a0 |s 0 )q π (s 0 , a0 ) (20)
s 0 ∈S a0 ∈A

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 41 / 77

Policy Evaluation

1 Evaluate the value of state given a policy π: compute v π (s)

2 Also called as (value) prediction

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 42 / 77

Example: Navigate the boat

Figure: Markov Chain/MRP: Go with river stream

Figure: MDP: Navigate the boat

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 43 / 77

Example: Policy Evaluation

1 Two actions: Left or Right

2 For all actions, reward: +5 in s1 , +10 in s7 , 0 in all other states. So
that we can represent R = [5, 0, 0, 0, 0, 0, 10]
3 Let’s have a deterministic policy π(s) = Left and γ = 0 for any state
s, then what is the value of the policy?
1 V π = [5, 0, 0, 0, 0, 0, 10] since γ = 0

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 44 / 77

Example: Policy Evaluation

1 R = [5, 0, 0, 0, 0, 0, 10]
2 Practice 1: Deterministic policy π(s) = Left with γ = 0.5 for any
state s, then what are the state values under the policy?
3 Practice 2: Stochastic policy P(π(s) = Left) = 0.5 and
P(π(s) = Right) = 0.5 and γ = 0.5 for any state s, then what are the
state values under the policy?
4 Iteration P
t:
vtπ (s) = a P(π(s) = a)(r (s, a) + γ s 0 ∈S P(s 0 |s, a)vt−1
π (s 0 ))
P

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 45 / 77

Decision Making in Markov Decision Process (MDP)

1 Prediction (evaluate a given policy):

1 Input: MDP < S, A, P, R, γ > and policy π or MRP < S, P π , Rπ , γ >
2 Output: value function v π
2 Control (search the optimal policy):
1 Input: MDP < S, A, P, R, γ >
2 Output: optimal value function v ∗ and optimal policy π ∗
3 Prediction and control in MDP can be solved by dynamic
programming.

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 46 / 77

Dynamic Programming

Dynamic Programming is a very general solution method for problems

which have two properties:
1 Optimal substructure
1 Principle of optimality applies
2 Optimal solution can be decomposed into subproblems
2 Overlapping subproblems
1 Subproblems recur many times
2 Solutions can be cached and reused
Markov decision processes satisfy both properties
1 Bellman equation gives recursive decomposition
2 Value function stores and reuses solutions

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 47 / 77

Prediction: Policy evaluation on MDP

1 Objective: Evaluate a given policy π for a MDP

2 Output: the value function under policy v π
3 Solution: iteration on Bellman expectation backup
4 Algorithm: Synchronous backup
1 At each iteration t+1
update vt+1 (s) from vt (s 0 ) for all states s ∈ S where s 0 is a successor
state of s
X X
vt+1 (s) = π(a|s)(R(s, a) + γ P(s 0 |s, a)vt (s 0 )) (21)
a∈A s 0 ∈S

5 Convergence: v1 → v2 → ... → v π

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 48 / 77

Policy evaluation: Iteration on Bellman expectation backup

Bellman expectation backup for a particular policy

X X
vt+1 (s) = π(a|s)(R(s, a) + γ P(s 0 |s, a)vt (s 0 )) (22)
a∈A s 0 ∈S

Or if in the form of MRP < S, P π , R, γ >

X
vt+1 (s) = R π (s) + γ P π (s 0 |s)vt (s 0 ) (23)
s 0 ∈S

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 49 / 77

Evaluating a Random Policy in the Small Gridworld
Example 4.1 in the Sutton RL textbook.

1 Undiscounted episodic MDP (γ = 1)

2 Nonterminal states 1, ..., 14
3 Two terminal states (two shaded squares)
4 Action leading out of grid leaves state unchanged, P(7|7, right) = 1
5 Reward is −1 until the terminal state is reach
6 Transition is deterministic given the action, e.g., P(6|5, right) = 1
7 Uniform random policy π(l|.) = π(r |.) = π(u|.) = π(d|.) = 0.25

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 50 / 77

Evaluating a Random Policy in the Small Gridworld

1 Iteratively evaluate the random policy

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 51 / 77

A live demo on policy evaluation

X X
v π (s) = π(a|s)(R(s, a) + γ P(s 0 |s, a)v π (s 0 )) (24)
a∈A s 0 ∈S

1 https://cs.stanford.edu/people/karpathy/reinforcejs/
gridworld_dp.html

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 52 / 77

Practice: Gridworld

Textbook Example 3.5:GridWorld

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 53 / 77

MDP Control

1 Compute the optimal policy

π ∗ (s) = arg max v π (s) (25)

2 Optimal policy for a MDP in an infinite horizon problem (agent acts

forever) is
1 Deterministic
2 Stationary (does not depend on time step)
3 Unique? Not necessarily, may have state-actions with identical optimal
values

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 54 / 77

Optimal Value Function

1 The optimal state-value function v ∗ (s) is the maximum value

function over all policies

v ∗ (s) = max v π (s)

2 The optimal policy

π ∗ (s) = arg max v π (s)

3 An MDP is “solved” when we know the optimal value

4 There exists a unique optimal value function, but could be multiple
optimal policies (two actions that have the same optimal value
function)

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 55 / 77

Finding Optimal Policy

1 An optimal policy can be found by maximizing over q ∗ (s, a),

(
1, if a = arg maxa∈A q ∗ (s, a)
π ∗ (a|s) =
0, otherwise

2 There is always a deterministic optimal policy for any MDP

3 If we know q ∗ (s, a), we immediately have the optimal policy

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 56 / 77

Policy Search

1 One option is to enumerate search the best policy

2 Number of deterministic policies is |A||S|
3 Other approaches such as policy iteration and value iteration are more
efficient
1 Policy iteration
2 Value iteration

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 57 / 77

Improving a Policy through Policy Iteration

1 Iterate through the two steps:

1 Evaluate the policy π (computing v given current π)
2 Improve the policy by acting greedily with respect to v π

π 0 = greedy(v π ) (26)

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 58 / 77

Policy Improvement
1 Compute the state-action value of a policy π:
X
q πi (s, a) = R(s, a) + γ P(s 0 |s, a)v πi (s 0 ) (27)
s 0 ∈S

2 Compute new policy πi+1 for all s ∈ S following

πi+1 (s) = arg max q πi (s, a) (28)

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 59 / 77

Monotonic Improvement in Policy
1 Consider a determinisitc policy a = π(s)
2 We improve the policy through
π 0 (s) = arg max q π (s, a)
a

3 This improves the value from any state s over one step,
q π (s, π 0 (s)) = max q π (s, a) ≥ q π (s, π(s)) = v π (s)
a∈A

0
4 It therefore improves the value function, v π (s) ≥ v π (s)
v π (s) ≤q π (s, π 0 (s)) = Eπ0 [Rt+1 + γv π (St+1 |St = s)]
≤Eπ0 [Rt+1 + γq π (St+1 , π 0 (St+1 ))|St = s]
≤Eπ0 [Rt+1 + γRt+2 + γ 2 q π (St+2 , π 0 (St+2 ))|St = s]
0
≤Eπ0 [Rt+1 + γRt+2 + ...|St = s] = v π (s)
Bolei Zhou CS260R Reinforcement Learning October 3, 2023 60 / 77
Monotonic Improvement in Policy

1 If improvements stop,

q π (s, π 0 (s)) = max q π (s, a) = q π (s, π(s)) = v π (s)

a∈A

2 Thus the Bellman optimality equation has been satisfied

v π (s) = max q π (s, a)

a∈A

3 Therefore v π (s) = v ∗ (s) for all s ∈ S, so π is an optimal policy

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 61 / 77

Bellman Optimality Equation

1 The optimal value functions are reached by the Bellman optimality

equations:

v ∗ (s) = max q ∗ (s, a)

a
X
∗
q (s, a) =R(s, a) + γ P(s 0 |s, a)v ∗ (s 0 )
s 0 ∈S

thus
X
v ∗ (s) = max R(s, a) + γ P(s 0 |s, a)v ∗ (s 0 )
a
s 0 ∈S
X
∗
q (s, a) =R(s, a) + γ P(s 0 |s, a) max
0
q ∗ (s 0 , a0 )
a
s 0 ∈S

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 62 / 77

Value Iteration by turning the Bellman Optimality
Equation as update rule

1 If we know the solution to subproblem v ∗ (s 0 ), which is optimal.

2 Then the solution for the optimal v ∗ (s) can be found by iteration
over the following Bellman Optimality backup rule,
X
v (s) ← max R(s, a) + γ P(s 0 |s, a)v (s 0 )
a∈A
s 0 ∈S

3 The idea of value iteration is to apply these updates iteratively

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 63 / 77

Algorithm of Value Iteration

1 Objective: find the optimal policy π

2 Solution: iteration on the Bellman optimality backup
3 Value Iteration algorithm:
1 initialize k = 1 and v0 (s) = 0 for all states s
2 For k = 1 : H
1 for each state s
X
qk+1 (s, a) =R(s, a) + γ P(s 0 |s, a)vk (s 0 ) (29)
s 0 ∈S

vk+1 (s) = max qk+1 (s, a) (30)

2 k ←k +1
3 To retrieve the optimal policy after the value iteration:
X
π(s) = arg max R(s, a) + γ P(s 0 |s, a)vk+1 (s 0 ) (31)
a
s 0 ∈S

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 64 / 77

Example: Shortest Path

After the optimal values are reached, we run policy extraction to retrieve
the optimal policy.

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 65 / 77

Difference between Policy Iteration and Value Iteration

1 Policy iteration includes: policy evaluation + policy improvement,

and the two are repeated iteratively until policy converges.
2 Value iteration includes: finding optimal value function + one
policy extraction. There is no repeat of the two because once the
value function is optimal, then the policy out of it should also be
optimal (i.e. converged).
3 Finding optimal value function can also be seen as a combination of
policy improvement (due to max) and truncated policy evaluation
(the reassignment of v(s) after just one sweep of all states regardless
of convergence).

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 66 / 77

Summary for Prediction and Control in MDP

Table: Dynamic Programming Algorithms

Problem Bellman Equation Algorithm
Prediction Bellman Expectation Equation Iterative Policy Evaluation
Control Bellman Expectation Equation Policy Iteration
Control Bellman Optimality Equation Value Iteration

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 67 / 77

Demo of policy iteration and value iteration

1 Policy iteration: Iteration of policy evaluation and policy

improvement(update)
2 Value iteration
3 https://cs.stanford.edu/people/karpathy/reinforcejs/
gridworld_dp.html
Bolei Zhou CS260R Reinforcement Learning October 3, 2023 68 / 77
Policy iteration and value iteration on FrozenLake

1 https:
//github.com/ucla-rlcourse/RLexample/tree/master/MDP

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 69 / 77

Improving Dynamic Programming

1 A major drawback to the DP methods is that they involve operations

over the entire state set of the MDP, that is, they require sweeps of
the state set.
2 If the state set is very large, for example, the game of backgammon
has over 1020 states. Thousands of years to be taken to finish one
sweep.
3 Asychronous DP algorithms are in-place iterative DP that are not
organized in terms of systematic sweeps of the state set
4 The values of some states may be updated several times before the
values of others are updated once.

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 70 / 77

Improving Dynamic Programming

Synchronoous dynamic programming is usually slow. Three simple ideas to

extend DP for asynchronous dynamic programming:
1 In-place dynamic programming
2 Prioritized sweeping
3 Real-time dynamic programming

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 71 / 77

In-Places Dynamic Programming

1 Synchronous value iteration stores two copies of value function:

for all s in S
vnew (s) ← maxa∈A R(s, a) + γ s 0 ∈S P(s 0 |s, a)vold (s 0 )
P
vold ← vnew
2 In-place value iteration only stores one copy of value function:
for all s in S
v (s) ← maxa∈A R(s, a) + γ s 0 ∈S P(s 0 |s, a)v (s 0 )
P

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 72 / 77

Prioritized Sweeping

1 Use magnitude of Bellman error to guide state selection, e.g.

X
max R(s, a) + γ P(s 0 |s, a)v (s 0 ) − v (s)
a∈A
s 0 ∈S

2 Backup the state with the largest remaining Bellman error

3 Update Bellman error of affected states after each backup
4 Can be implemented efficiently by maintaining a priority queue

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 73 / 77

Real-Time Dynamic Programming

1 To solve a given MDP, we can run an iterative DP algorithm at the

same time that an agent is actually experiencing the MDP
2 The agent’s experience can be used to determine the states to which
the DP algorithm applies its updates
3 We can apply updates to states as the agent visits them. So focus on
the parts of the state set that are most relevant to the agent
4 After each time-step St , At , backup the state St ,
X
v (St ) ← max R(St , a) + γ P(s 0 |St , a)v (s 0 )
a∈A
s 0 ∈S

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 74 / 77

Sample Backups

1 The key design for RL algorithms such as Q-learning and SARSA in

next lectures
2 Using sample rewards and sample transition pairs < S, A, R, S 0 >,
rather than the reward function R and transition dynamics P
3 Benefits:
1 Model-free: no advance knowledge of MDP required
2 Break the curse of dimensionality through sampling
3 Cost of backup is constant, independent of n = |S|

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 75 / 77

Approximate Dynamic Programming

1 Using a function approximator v̂ (s, w)

2 Fitted value iteration repeats at each iteration k,
1 Sample state s from the state cache S̃
X
ṽk (s) = max R(s, a) + γ P(s 0 |s, a)v̂ (s 0 , wk )
a∈A
s 0 ∈S

2 Train next value function v̂ (s 0 , wk+1 ) using targets < s, ṽk (s) >.
3 Key idea behind the Deep Q-Learning

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 76 / 77

End

1 Summary: MDP, policy evaluation, policy iteration, and value

iteration,
2 Next Week: Model-free methods
3 Reading: Textbook Chapter 5 and Chapter 6

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 77 / 77

Unit 5 Reinforcement Learning Notes
No ratings yet
Unit 5 Reinforcement Learning Notes
20 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
31 pages
ECE CAD Introduction To AutoCAD
No ratings yet
ECE CAD Introduction To AutoCAD
5 pages
Yang - Good Brief of RL
No ratings yet
Yang - Good Brief of RL
87 pages
Neural Networks: 1 October, 2016
No ratings yet
Neural Networks: 1 October, 2016
51 pages
RL Frra
No ratings yet
RL Frra
9 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
AS02
No ratings yet
AS02
16 pages
Lect28 4up
No ratings yet
Lect28 4up
11 pages
Lec 02
No ratings yet
Lec 02
89 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Lecture 2 Post
No ratings yet
Lecture 2 Post
65 pages
RL Frra
No ratings yet
RL Frra
10 pages
2023 Week3 Modelfree
No ratings yet
2023 Week3 Modelfree
63 pages
RL Unit - Ii
No ratings yet
RL Unit - Ii
20 pages
Service Manual, PM7100, English PT00112534 Rev A Release 8-2020
No ratings yet
Service Manual, PM7100, English PT00112534 Rev A Release 8-2020
64 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
2023 Week7 modelbasedRL Updated
No ratings yet
2023 Week7 modelbasedRL Updated
56 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
20ai903 - RL - Unit 2
No ratings yet
20ai903 - RL - Unit 2
27 pages
Orientering
No ratings yet
Orientering
15 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
DLMAIRIL01 Q4-2024 Session2
No ratings yet
DLMAIRIL01 Q4-2024 Session2
68 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
Mod2 Slides
No ratings yet
Mod2 Slides
161 pages
Reinforcement Learning in A Nutshell
No ratings yet
Reinforcement Learning in A Nutshell
12 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Lecture 2 Pre
No ratings yet
Lecture 2 Pre
58 pages
RL Basics 1737166593
No ratings yet
RL Basics 1737166593
30 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
The Relationship of Endodontic-Periodontic Lesions
No ratings yet
The Relationship of Endodontic-Periodontic Lesions
7 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
MLT Unit-5 Notes
No ratings yet
MLT Unit-5 Notes
17 pages
Lecture 2: Markov Decision Processes: David Silver
No ratings yet
Lecture 2: Markov Decision Processes: David Silver
57 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Markov Decision Process
No ratings yet
Markov Decision Process
11 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Engr213 Chapter 4 Homework Solutions
No ratings yet
Engr213 Chapter 4 Homework Solutions
18 pages
Unit-5 Ai
No ratings yet
Unit-5 Ai
19 pages
Monsoon Theories
100% (1)
Monsoon Theories
14 pages
Mid Term Report SoS
No ratings yet
Mid Term Report SoS
18 pages
Story of Creation
No ratings yet
Story of Creation
3 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
A (Long) Peek Into Reinforcement Learning - Lil'Log
No ratings yet
A (Long) Peek Into Reinforcement Learning - Lil'Log
23 pages
The Art of Troubleshooting - Ebook - V2
No ratings yet
The Art of Troubleshooting - Ebook - V2
356 pages
People v. Pagal
No ratings yet
People v. Pagal
3 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Bachelor Thesis
No ratings yet
Bachelor Thesis
88 pages
Acceleration-Deceleration Behaviour of Various Vehicle Types PDF
No ratings yet
Acceleration-Deceleration Behaviour of Various Vehicle Types PDF
29 pages
Sweet Potatao As Superfood
No ratings yet
Sweet Potatao As Superfood
6 pages
Datasheet ERA SOLAR ERA-72HC - (525-550) M
No ratings yet
Datasheet ERA SOLAR ERA-72HC - (525-550) M
1 page
Autopage C3-RS665 PDF
No ratings yet
Autopage C3-RS665 PDF
34 pages
3.1 Tuple Relational Calculus
No ratings yet
3.1 Tuple Relational Calculus
11 pages
Eugen Fink Oasis of Happiness
No ratings yet
Eugen Fink Oasis of Happiness
29 pages
Strat Sim
No ratings yet
Strat Sim
289 pages
Tests For Two Correlations
No ratings yet
Tests For Two Correlations
10 pages
Injection Engine Control System. VAZ 21213, 21214 (Niva)
No ratings yet
Injection Engine Control System. VAZ 21213, 21214 (Niva)
3 pages
Hyaluronic Acid
No ratings yet
Hyaluronic Acid
7 pages
PHP Yii JSP Servlet - 2 - Md. Shibly Forkani
No ratings yet
PHP Yii JSP Servlet - 2 - Md. Shibly Forkani
4 pages
Motion in 1dimension DPP 05 of Lec 06 Prayas JEE 2.0 2025666a9a5e40a9650018c46d03
No ratings yet
Motion in 1dimension DPP 05 of Lec 06 Prayas JEE 2.0 2025666a9a5e40a9650018c46d03
5 pages
My Classroom
No ratings yet
My Classroom
1 page
K00200 - 20211027174133 - Rubrics Individual Assignment Paf3113 Sem A202
No ratings yet
K00200 - 20211027174133 - Rubrics Individual Assignment Paf3113 Sem A202
7 pages
Mining Terms 1
No ratings yet
Mining Terms 1
23 pages
BCOC Outstanding 24 Oktober 2023
No ratings yet
BCOC Outstanding 24 Oktober 2023
12 pages
Dr. Data New Fomat (June, 2015) BILAL
No ratings yet
Dr. Data New Fomat (June, 2015) BILAL
13 pages
Syllabus
No ratings yet
Syllabus
7 pages
Upgrading Cimplicity 6.1 To 8.1 License Issue
No ratings yet
Upgrading Cimplicity 6.1 To 8.1 License Issue
2 pages
Task3.Ipynb - Colaboratory Dip
No ratings yet
Task3.Ipynb - Colaboratory Dip
3 pages
Shortcuts to College Calculus Refreshment Kit
From Everand
Shortcuts to College Calculus Refreshment Kit
Juan Acevedo
No ratings yet
Analytic Geometry: Graphic Solutions Using Matlab Language
From Everand
Analytic Geometry: Graphic Solutions Using Matlab Language
Ing. Mario Castillo
No ratings yet
Trigonometric Ratios to Transformations (Trigonometry) Mathematics E-Book For Public Exams
From Everand
Trigonometric Ratios to Transformations (Trigonometry) Mathematics E-Book For Public Exams
Mohmmad Khaja Shareef
5/5 (1)
De Moiver's Theorem (Trigonometry) Mathematics Question Bank
From Everand
De Moiver's Theorem (Trigonometry) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet
Inverse Trigonometric Functions (Trigonometry) Mathematics Question Bank
From Everand
Inverse Trigonometric Functions (Trigonometry) Mathematics Question Bank
Mohmmad Khaja Shareef
No ratings yet