0% found this document useful (0 votes)
20 views77 pages

2023 Week2 Lecture Before

Uploaded by

luciferboboqwq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views77 pages

2023 Week2 Lecture Before

Uploaded by

luciferboboqwq
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 77

Week 2: Markov Decision Processes

Bolei Zhou

UCLA

October 3, 2023

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 1 / 77


Announcement

1 You should have been informed if you receive a PTE, email TA if not
2 Piazza Discussion Forum:
https://piazza.com/ucla/fall2023/cs260r
1 All course-related questions should go there
2 Expected response time from TA: <36 hours
3 Assignment1 will be out at
https://github.com/ucla-rlcourse/assignment-2023fall
1 Due in two weeks
4 Friday’s TA discussion session: programming and PyTorch basics
1 TAs will come up with a weekly schedule soon: Assignments,
clarification on technical detail, some more recent papers, etc
2 Discussion is optional, you can go to either session

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 2 / 77


Plan

1 Last Week
1 Course overview
2 Basic components of RL: reward, policy, value function, action-value
function, model
3 A simplified RL task: k-armed Bandit Problem
4 Markov Decision Process (MDP)
1 Markov Chain→ Markov Reward Process (MRP)→ Markov Decision
Processes (MDP)
2 Policy evaluation in MDP
3 Control in MDP: policy iteration and value iteration
4 Improving dynamic programming

5 Textbook of Sutton and Barto: Chapter 3 and Chapter 4

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 3 / 77


Reinforcement Learning: sequential decision making

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 4 / 77


Reward

1 A reward is a scalar feedback signal


2 Reward indicates how well agent is doing at step t
3 The objective of RL is to maximize the cumulative reward
P∞
1 Cumulative reward: Gt = k Rt+k+1
2 Expected cumulative reward: Eπ [Gt ]
4 Example: roll a dice five times and receive the face number as the
reward, what is the expected cumulative reward at the beginning? or
before the third time?

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 5 / 77


Sequential Decision Making and Delayed Reward

1 Objective of the agent: select a series of actions to maximize the


cumulative reward
2 Actions may have long-term consequences
1 Reward may be delayed
2 Trade-off between immediate reward and long-term reward

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 6 / 77


Major components of an RL agent

1 An RL agent may include one or more of these components:


1 Policy π(s): agent’s behavior function
2 Value function V(s): how good is each state or action
3 Action-value function Q(s,a): how good is an action in a certain state
4 Model: agent’s state representation of the environment

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 7 / 77


Policy

1 A policy is the agent’s behavior model


2 It is a map function from state/observation to action
3 Stochastic policy: action probability π(a|s) = P[At = a|St = s]
4 Deterministic policy: action at∗ = arg maxa Q(st , a)

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 8 / 77


Value function

1 Value function: expected discounted sum of future rewards under a


particular policy π
1 It is used to quantify the goodness of states if following a particular
policy π
X
vπ (s) = Eπ [Gt |St = s] = Eπ [ γ k Rt+k+1 |St = s] (1)
k

2 Action-value function, or Q function


X
Q(s, a) = Eπ [Gt |St = s, At = a] = Eπ [ γ k Rt+k+1 |St = s, At = a]
k
(2)

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 9 / 77


Model

1 A model predicts what the environment will do next


1 Predict the next state: P[St+1 = s 0 |St = s, At = a]
2 Predict the next reward: E [Rt+1 |St = s, At = a]
2 Sometimes the world model is given, like Newtonian force F = ma

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 10 / 77


k-Armed Bandit Problem as a simplified RL example

1 A multi-armed bandit is a tuple < A, R >


2 k actions to take at each step t
3 Ra (r ) = P(r |a) is unknown probability distribution over rewards
4 At each step t the agent selects an action at ∈ A, then the
environment generates a reward rt ∼ Rat
The goal of agent is to maximize cumulative reward T
P
5 τ =1 rτ
Bolei Zhou CS260R Reinforcement Learning October 3, 2023 11 / 77
class BernoulliArm():
def __init__(self, p):
self.p = p
def draw(self):
if random.random() > self.p:
return 0.0
else:
return 1.0

class NormalArm():
def __init__(self, mu, sigma):
self.mu = mu
self.sigma = sigma
def draw(self):
return random.gauss(self.mu, self.sigma)

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 12 / 77


Definition of Value Function and Action-Value Function

1 The action-value is the mean reward for action a

Q(a) = E(r |a) (3)

2 The optimal value

V ∗ = Q(a∗ ) = max Q(a) (4)


a∈A

3 To estimate Q(a), we can compute Qt (a) at step t as


Pt−1
sum of rewards when a taken prior to t i=1 ri · 1Ai =a
Qt (a) = = t−1
number of times a taken prior to t
P
i=1 1Ai =a
(5)

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 13 / 77


Greedy Action and -Greedy Action to Take

1 The estimation of Q(a) at step t


Pt−1
sum of rewards when a taken prior to t i=1 ri · 1Ai =a
Qt (a) = = t−1
number of times a taken prior to t
P
i=1 1Ai =a
(6)

2 Greedy action selection algorithm: At = arg maxa Qt (a)


3 Problem with the greedy algorithm?
4 -Greedy: greedy most of the time, but with small probability  select
random actions ( is usually as 0.1)
1 probability 1 −  : At = arg maxa Qt (a)
2 probability  : At = uniform(A)

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 14 / 77


-Greedy Algorithm

Algorithm 1 A simple epsilon-Greedy bandit algorithm


1: for a = 1 to k do
2: Q(a) = 0, N(a) = 0
3: end for
4: loop (
arg maxa Q(a) with probability 1 − 
5: A=
uniform(A) with probability 
6: r = bandit(A)
7: N(A) = N(A) + 1
1
8: Q(A) = Q(A) + N(A) [r − Q(A)]
9: end loop

Deriving: NewEstimate = OldEstimate + StepSize[Target - OldEstimate]


1
Qt (at ) = Qt−1 + (rt − Qt−1 (at )) (7)
Nt (at )
Bolei Zhou CS260R Reinforcement Learning October 3, 2023 15 / 77
Why it is a simplified RL task

1 no delayed reward: instance feedback


2 no change of state: only one state
3 independent of consecutive behaviors

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 16 / 77


Markov Decision Process (MDP)

1 Markov Decision Process can model a lot of real-world problems. It


formally describes the framework of reinforcement learning
2 Under MDP, the environment is fully observable.
1 Optimal control primarily deals with continuous MDPs
2 Partially observable problems can be converted into MDPs

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 17 / 77


Defining Three Markov Models

• Markov Processes
• Markov Reward Processes (MRPs)
• Markov Decision Processes (MDPs)

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 18 / 77


Markov Property

1 The history of states: ht = {s1 , s2 , s3 , ..., st }


2 State st is Markovian if and only if:

p(st+1 |st ) =p(st+1 |ht ) (8)


p(st+1 |st , at ) =p(st+1 |ht , at ) (9)

3 “The future is independent of the past given the present”

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 19 / 77


Markov Process/Markov Chain

1 State transition matrix P specifies p(st+1 = s 0 |st = s)


 
P(s1 |s1 ) P(s2 |s1 ) . . . P(sN |s1 )
 P(s1 |s2 ) P(s2 |s2 ) . . . P(sN |s2 ) 
P=
 
.. .. . . .. 
 . . . . 
P(s1 |sN ) P(s2 |sN ) . . . P(sN |sN )

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 20 / 77


Example of MP

1 Sample episodes starting from s3


1 s3 , s4 , s5 , s6 , s6
2 s3 , s2 , s3 , s2 , s1
3 s3 , s4 , s4 , s5 , s5

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 21 / 77


Markov Reward Process (MRP)

1 Markov Reward Process is a Markov Chain + reward


2 Definition of Markov Reward Process (MRP)
1 S is a (finite) set of states (s ∈ S)
2 P is dynamics/transition model that specifies P(St+1 = s 0 |st = s)
3 R is a reward function R(st = s) = E[rt |st = s]
4 Discount factor γ ∈ [0, 1]
3 If finite number of states, R can be a vector
1 We focus on tabular representation first
2 What happen if there are infinite number of states?

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 22 / 77


Example of MRP

Reward: +5 in s1 , +10 in s7 , 0 in all other states. So that we can


represent R = [5, 0, 0, 0, 0, 0, 10]
Bolei Zhou CS260R Reinforcement Learning October 3, 2023 23 / 77
Return and Value Function

1 Definition of Horizon
1 Number of maximum time steps in each episode/trajectory
2 Can be infinite, otherwise called finite Markov (reward) Process
3 Per game: 100 moves for Go, 80 moves for chess
2 Definition of Return
1 Discounted sum of rewards from time step t to horizon

Gt = Rt+1 + γRt+2 + γ 2 Rt+3 + γ 3 Rt+4 + ... + γ T −t−1 RT

3 Definition of state value function Vt (s) for a MRP


1 Expected return from t in state s

Vt (s) =E[Gt |st = s]


=E[Rt+1 + γRt+2 + γ 2 Rt+3 + ... + γ T −t−1 RT |st = s]

2 Present value of accumulated future rewards


Bolei Zhou CS260R Reinforcement Learning October 3, 2023 24 / 77
Why Discount Factor γ

1 Avoid infinite returns in cyclic Markov processes


2 Uncertainty about the future
1 If the reward is financial, immediate rewards may earn more interest
than delayed rewards
3 Animal/human behavior shows a preference for immediate reward
4 It is sometimes possible to use undiscounted Markov reward processes
(i.e. γ = 1), e.g. if all sequences terminate.
1 γ = 0: Only care about the immediate reward
2 γ = 1: Future reward is equal to the immediate reward.

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 25 / 77


Example of MRP

1 Reward: +5 in s1 , +10 in s7 , 0 in all other states. So that we can


represent R = [5, 0, 0, 0, 0, 0, 10]
2 Sample returns G for a 3-step episodes with γ = 1/2
1 1
1 return for s4 , s5 , s6 , s7 : 0 + 2 ×0+ 4 × 10 = 2.5
1 1
2 return for s4 , s3 , s2 , s1 : 0 + 2 ×0+ 4 × 5 = 1.25
3 return for s4 , s5 , s6 , s6 : 0
3 How to compute the value function? For example, the value of state
s4 as V (s4 ) = E[Gt |st = s4 ]

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 26 / 77


Computing the Value of a Markov Reward Process

1 Value function: expected return from starting in state s

V (s) = E[Gt |st = s] = E[Rt+1 + γRt+2 + γ 2 Rt+3 + ...|st = s]

2 MRP value function satisfies the following Bellman equation:


X
V (s) = R(s) + γ P(s 0 |s)V (s 0 )
|{z} 0
Immediate reward s ∈S
| {z }
Discounted sum of future reward

3 Practice: To derive the Bellman equation from the definition of V(s)


1 Hint: V (s) = E[Rt+1 + γE[Rt+2 + γRt+3 + γ 2 Rt+4 + ...]|st = s]

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 27 / 77


Understanding Bellman Equation

1 Bellman equation describes the iterative relations of states


X
V (s) = R(s) + γ P(s 0 |s)V (s 0 )
s 0 ∈S

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 28 / 77


Matrix Form of Bellman Equation for MRP

Therefore, we can express V(s) using the matrix form:


      
V (s1 ) R(s1 ) P(s1 |s1 ) P(s2 |s1 ) ... P(sN |s1 ) V (s1 )
 V (s2 )   R(s2 )   P(s1 |s2 ) P(s2 |s2 ) ... P(sN |s2 ) 
  V (s2 ) 
 
 ..  =  .. +γ 
    
.. .. .. ..   .. 
 .   .   . . . .  . 
V (sN ) R(sN ) P(s1 |sN ) P(s2 |sN ) ... P(sN |sN ) V (sN )

V = R + γPV
1 Analytic solution for value of MRP: V = (I − γP)−1 R
1 Matrix inverse takes the complexity O(N 3 ) for N states
2 Only possible for small MRPs
2 Other methods to solve this?

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 29 / 77


Iterative Algorithm for Computing Value of a MRP

1 Dynamic Programming
2 Monte-Carlo evaluation
3 Temporal-Difference learning

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 30 / 77


Monte Carlo Algorithm for Computing Value of a MRP

Algorithm 2 Monte Carlo simulation to calculate MRP value function


1: i ← 0, Gt ← 0
2: while i 6= N do
3: generate an episode, starting from state s and timePt
H−1 i−t
4: Using the generated episode, calculate return g = i=t γ ri
5: Gt ← Gt + g , i ← i + 1
6: end while
7: Vt (s) ← Gt /N

1 For example: to calculate V (s4 ) we can generate a lot of trajectories


then take the average of the returns:
1 1
1 return for s4 , s5 , s6 , s7 : 0 + 2 ×0+ 4 × 10 = 2.5
1 1
2 return for s4 , s3 , s2 , s1 : 0 + 2 ×0+ 4 × 5 = 1.25
3 return for s4 , s5 , s6 , s6 : 0
4 more trajectories

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 31 / 77


Iterative Algorithm for Computing Value of a MRP

Algorithm 3 Iterative algorithm to calculate MRP value function


1: for all states s ∈ S, V 0 (s) ← 0, V (s) ← ∞
2: while ||V − V 0 || >  do
3: V ← V0
For all states s ∈ S, V 0 (s) = R(s) + γ s 0 ∈S P(s 0 |s)V (s 0 )
P
4:
5: end while
6: return V 0 (s) for all s ∈ S

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 32 / 77


Markov Decision Process (MDP)

1 Markov Decision Process is Markov Reward Process with decisions.


2 Definition of MDP
1 S is a finite set of states
2 A is a finite set of actions
3 P is dynamics/transition model for each action a under certain state s
as P(st+1 = s 0 |st = s, at = a)
4 R is a reward function R(st = s, at = a) = E[rt |st = s, at = a]
5 Discount factor γ ∈ [0, 1]

3 MDP is a tuple: (S, A, P, R, γ)

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 33 / 77


Policy in MDP

1 Policy specifies what action to take in each state


2 Given a state, specify a distribution over actions
3 Policy: π(a|s) = P(at = a|st = s)
4 Policies are stationary (time-independent), At ∼ π(a|s) for any t > 0

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 34 / 77


Policy in MDP

1 Given a MDP (S, A, P, R, γ) and a policy π


2 The state and reward sequence S1 , R2 , S2 , R2 , ... is a Markov reward
process (S, P π , R π , γ) where,

X
P π (s 0 |s) = π(a|s)P(s 0 |s, a)
a∈A
X
R π (s) = π(a|s)R(s, a)
a∈A

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 35 / 77


Comparison of MP/MRP and MDP

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 36 / 77


Value function for MDP

1 The state-value function v π (s) of an MDP is the expected return


starting from state s, and following policy π

v π (s) = Eπ [Gt |st = s] (10)

2 The action-value function q π (s, a) is the expected return starting


from state s, taking action a, and then following policy π

q π (s, a) = Eπ [Gt |st = s, At = a] (11)

3 We have the relation between v π (s) and q π (s, a)


X
v π (s) = π(a|s)q π (s, a) (12)
a∈A

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 37 / 77


Bellman Expectation Equation

1 The state-value function can be decomposed into immediate reward


plus discounted value of the successor state,

v π (s) = Eπ [Rt+1 + γv π (st+1 )|st = s] (13)

2 The action-value function can similarly be decomposed

q π (s, a) = Eπ [Rt+1 + γq π (st+1 , At+1 )|st = s, At = a] (14)

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 38 / 77


Bellman Expectation Equation for V π and Q π

X
v π (s) = π(a|s)q π (s, a) (15)
a∈A
X
q π (s, a) =Rsa + γ P(s 0 |s, a)v π (s 0 ) (16)
s 0 ∈S

Thus
X X
v π (s) = π(a|s)(R(s, a) + γ P(s 0 |s, a)v π (s 0 )) (17)
a∈A s 0 ∈S
X X
π
q (s, a) =R(s, a) + γ P(s 0 |s, a) π(a0 |s 0 )q π (s 0 , a0 ) (18)
s 0 ∈S a0 ∈A

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 39 / 77


Backup Diagram for V π

X X
v π (s) = π(a|s)(R(s, a) + γ P(s 0 |s, a)v π (s 0 )) (19)
a∈A s 0 ∈S

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 40 / 77


Backup Diagram for Q π

X X
q π (s, a) =R(s, a) + γ P(s 0 |s, a) π(a0 |s 0 )q π (s 0 , a0 ) (20)
s 0 ∈S a0 ∈A

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 41 / 77


Policy Evaluation

1 Evaluate the value of state given a policy π: compute v π (s)


2 Also called as (value) prediction

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 42 / 77


Example: Navigate the boat

Figure: Markov Chain/MRP: Go with river stream

Figure: MDP: Navigate the boat

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 43 / 77


Example: Policy Evaluation

1 Two actions: Left or Right


2 For all actions, reward: +5 in s1 , +10 in s7 , 0 in all other states. So
that we can represent R = [5, 0, 0, 0, 0, 0, 10]
3 Let’s have a deterministic policy π(s) = Left and γ = 0 for any state
s, then what is the value of the policy?
1 V π = [5, 0, 0, 0, 0, 0, 10] since γ = 0

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 44 / 77


Example: Policy Evaluation

1 R = [5, 0, 0, 0, 0, 0, 10]
2 Practice 1: Deterministic policy π(s) = Left with γ = 0.5 for any
state s, then what are the state values under the policy?
3 Practice 2: Stochastic policy P(π(s) = Left) = 0.5 and
P(π(s) = Right) = 0.5 and γ = 0.5 for any state s, then what are the
state values under the policy?
4 Iteration P
t:
vtπ (s) = a P(π(s) = a)(r (s, a) + γ s 0 ∈S P(s 0 |s, a)vt−1
π (s 0 ))
P

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 45 / 77


Decision Making in Markov Decision Process (MDP)

1 Prediction (evaluate a given policy):


1 Input: MDP < S, A, P, R, γ > and policy π or MRP < S, P π , Rπ , γ >
2 Output: value function v π
2 Control (search the optimal policy):
1 Input: MDP < S, A, P, R, γ >
2 Output: optimal value function v ∗ and optimal policy π ∗
3 Prediction and control in MDP can be solved by dynamic
programming.

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 46 / 77


Dynamic Programming

Dynamic Programming is a very general solution method for problems


which have two properties:
1 Optimal substructure
1 Principle of optimality applies
2 Optimal solution can be decomposed into subproblems
2 Overlapping subproblems
1 Subproblems recur many times
2 Solutions can be cached and reused
Markov decision processes satisfy both properties
1 Bellman equation gives recursive decomposition
2 Value function stores and reuses solutions

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 47 / 77


Prediction: Policy evaluation on MDP

1 Objective: Evaluate a given policy π for a MDP


2 Output: the value function under policy v π
3 Solution: iteration on Bellman expectation backup
4 Algorithm: Synchronous backup
1 At each iteration t+1
update vt+1 (s) from vt (s 0 ) for all states s ∈ S where s 0 is a successor
state of s
X X
vt+1 (s) = π(a|s)(R(s, a) + γ P(s 0 |s, a)vt (s 0 )) (21)
a∈A s 0 ∈S

5 Convergence: v1 → v2 → ... → v π

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 48 / 77


Policy evaluation: Iteration on Bellman expectation backup

Bellman expectation backup for a particular policy


X X
vt+1 (s) = π(a|s)(R(s, a) + γ P(s 0 |s, a)vt (s 0 )) (22)
a∈A s 0 ∈S

Or if in the form of MRP < S, P π , R, γ >


X
vt+1 (s) = R π (s) + γ P π (s 0 |s)vt (s 0 ) (23)
s 0 ∈S

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 49 / 77


Evaluating a Random Policy in the Small Gridworld
Example 4.1 in the Sutton RL textbook.

1 Undiscounted episodic MDP (γ = 1)


2 Nonterminal states 1, ..., 14
3 Two terminal states (two shaded squares)
4 Action leading out of grid leaves state unchanged, P(7|7, right) = 1
5 Reward is −1 until the terminal state is reach
6 Transition is deterministic given the action, e.g., P(6|5, right) = 1
7 Uniform random policy π(l|.) = π(r |.) = π(u|.) = π(d|.) = 0.25

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 50 / 77


Evaluating a Random Policy in the Small Gridworld

1 Iteratively evaluate the random policy

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 51 / 77


A live demo on policy evaluation

X X
v π (s) = π(a|s)(R(s, a) + γ P(s 0 |s, a)v π (s 0 )) (24)
a∈A s 0 ∈S

1 https://cs.stanford.edu/people/karpathy/reinforcejs/
gridworld_dp.html

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 52 / 77


Practice: Gridworld

Textbook Example 3.5:GridWorld

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 53 / 77


MDP Control

1 Compute the optimal policy

π ∗ (s) = arg max v π (s) (25)


π

2 Optimal policy for a MDP in an infinite horizon problem (agent acts


forever) is
1 Deterministic
2 Stationary (does not depend on time step)
3 Unique? Not necessarily, may have state-actions with identical optimal
values

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 54 / 77


Optimal Value Function

1 The optimal state-value function v ∗ (s) is the maximum value


function over all policies

v ∗ (s) = max v π (s)


π

2 The optimal policy

π ∗ (s) = arg max v π (s)


π

3 An MDP is “solved” when we know the optimal value


4 There exists a unique optimal value function, but could be multiple
optimal policies (two actions that have the same optimal value
function)

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 55 / 77


Finding Optimal Policy

1 An optimal policy can be found by maximizing over q ∗ (s, a),


(
1, if a = arg maxa∈A q ∗ (s, a)
π ∗ (a|s) =
0, otherwise

2 There is always a deterministic optimal policy for any MDP


3 If we know q ∗ (s, a), we immediately have the optimal policy

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 56 / 77


Policy Search

1 One option is to enumerate search the best policy


2 Number of deterministic policies is |A||S|
3 Other approaches such as policy iteration and value iteration are more
efficient
1 Policy iteration
2 Value iteration

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 57 / 77


Improving a Policy through Policy Iteration

1 Iterate through the two steps:


1 Evaluate the policy π (computing v given current π)
2 Improve the policy by acting greedily with respect to v π

π 0 = greedy(v π ) (26)

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 58 / 77


Policy Improvement
1 Compute the state-action value of a policy π:
X
q πi (s, a) = R(s, a) + γ P(s 0 |s, a)v πi (s 0 ) (27)
s 0 ∈S

2 Compute new policy πi+1 for all s ∈ S following

πi+1 (s) = arg max q πi (s, a) (28)


a

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 59 / 77


Monotonic Improvement in Policy
1 Consider a determinisitc policy a = π(s)
2 We improve the policy through
π 0 (s) = arg max q π (s, a)
a

3 This improves the value from any state s over one step,
q π (s, π 0 (s)) = max q π (s, a) ≥ q π (s, π(s)) = v π (s)
a∈A

0
4 It therefore improves the value function, v π (s) ≥ v π (s)
v π (s) ≤q π (s, π 0 (s)) = Eπ0 [Rt+1 + γv π (St+1 |St = s)]
≤Eπ0 [Rt+1 + γq π (St+1 , π 0 (St+1 ))|St = s]
≤Eπ0 [Rt+1 + γRt+2 + γ 2 q π (St+2 , π 0 (St+2 ))|St = s]
0
≤Eπ0 [Rt+1 + γRt+2 + ...|St = s] = v π (s)
Bolei Zhou CS260R Reinforcement Learning October 3, 2023 60 / 77
Monotonic Improvement in Policy

1 If improvements stop,

q π (s, π 0 (s)) = max q π (s, a) = q π (s, π(s)) = v π (s)


a∈A

2 Thus the Bellman optimality equation has been satisfied

v π (s) = max q π (s, a)


a∈A

3 Therefore v π (s) = v ∗ (s) for all s ∈ S, so π is an optimal policy

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 61 / 77


Bellman Optimality Equation

1 The optimal value functions are reached by the Bellman optimality


equations:

v ∗ (s) = max q ∗ (s, a)


a
X

q (s, a) =R(s, a) + γ P(s 0 |s, a)v ∗ (s 0 )
s 0 ∈S

thus
X
v ∗ (s) = max R(s, a) + γ P(s 0 |s, a)v ∗ (s 0 )
a
s 0 ∈S
X

q (s, a) =R(s, a) + γ P(s 0 |s, a) max
0
q ∗ (s 0 , a0 )
a
s 0 ∈S

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 62 / 77


Value Iteration by turning the Bellman Optimality
Equation as update rule

1 If we know the solution to subproblem v ∗ (s 0 ), which is optimal.


2 Then the solution for the optimal v ∗ (s) can be found by iteration
over the following Bellman Optimality backup rule,
 X 
v (s) ← max R(s, a) + γ P(s 0 |s, a)v (s 0 )
a∈A
s 0 ∈S

3 The idea of value iteration is to apply these updates iteratively

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 63 / 77


Algorithm of Value Iteration

1 Objective: find the optimal policy π


2 Solution: iteration on the Bellman optimality backup
3 Value Iteration algorithm:
1 initialize k = 1 and v0 (s) = 0 for all states s
2 For k = 1 : H
1 for each state s
X
qk+1 (s, a) =R(s, a) + γ P(s 0 |s, a)vk (s 0 ) (29)
s 0 ∈S

vk+1 (s) = max qk+1 (s, a) (30)


a

2 k ←k +1
3 To retrieve the optimal policy after the value iteration:
X
π(s) = arg max R(s, a) + γ P(s 0 |s, a)vk+1 (s 0 ) (31)
a
s 0 ∈S

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 64 / 77


Example: Shortest Path

After the optimal values are reached, we run policy extraction to retrieve
the optimal policy.

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 65 / 77


Difference between Policy Iteration and Value Iteration

1 Policy iteration includes: policy evaluation + policy improvement,


and the two are repeated iteratively until policy converges.
2 Value iteration includes: finding optimal value function + one
policy extraction. There is no repeat of the two because once the
value function is optimal, then the policy out of it should also be
optimal (i.e. converged).
3 Finding optimal value function can also be seen as a combination of
policy improvement (due to max) and truncated policy evaluation
(the reassignment of v(s) after just one sweep of all states regardless
of convergence).

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 66 / 77


Summary for Prediction and Control in MDP

Table: Dynamic Programming Algorithms


Problem Bellman Equation Algorithm
Prediction Bellman Expectation Equation Iterative Policy Evaluation
Control Bellman Expectation Equation Policy Iteration
Control Bellman Optimality Equation Value Iteration

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 67 / 77


Demo of policy iteration and value iteration

1 Policy iteration: Iteration of policy evaluation and policy


improvement(update)
2 Value iteration
3 https://cs.stanford.edu/people/karpathy/reinforcejs/
gridworld_dp.html
Bolei Zhou CS260R Reinforcement Learning October 3, 2023 68 / 77
Policy iteration and value iteration on FrozenLake

1 https:
//github.com/ucla-rlcourse/RLexample/tree/master/MDP

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 69 / 77


Improving Dynamic Programming

1 A major drawback to the DP methods is that they involve operations


over the entire state set of the MDP, that is, they require sweeps of
the state set.
2 If the state set is very large, for example, the game of backgammon
has over 1020 states. Thousands of years to be taken to finish one
sweep.
3 Asychronous DP algorithms are in-place iterative DP that are not
organized in terms of systematic sweeps of the state set
4 The values of some states may be updated several times before the
values of others are updated once.

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 70 / 77


Improving Dynamic Programming

Synchronoous dynamic programming is usually slow. Three simple ideas to


extend DP for asynchronous dynamic programming:
1 In-place dynamic programming
2 Prioritized sweeping
3 Real-time dynamic programming

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 71 / 77


In-Places Dynamic Programming

1 Synchronous value iteration stores two copies of value function:


for all s in S  
vnew (s) ← maxa∈A R(s, a) + γ s 0 ∈S P(s 0 |s, a)vold (s 0 )
P
vold ← vnew
2 In-place value iteration only stores one copy of value function:
for all s in S  
v (s) ← maxa∈A R(s, a) + γ s 0 ∈S P(s 0 |s, a)v (s 0 )
P

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 72 / 77


Prioritized Sweeping

1 Use magnitude of Bellman error to guide state selection, e.g.


 X 
max R(s, a) + γ P(s 0 |s, a)v (s 0 ) − v (s)
a∈A
s 0 ∈S

2 Backup the state with the largest remaining Bellman error


3 Update Bellman error of affected states after each backup
4 Can be implemented efficiently by maintaining a priority queue

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 73 / 77


Real-Time Dynamic Programming

1 To solve a given MDP, we can run an iterative DP algorithm at the


same time that an agent is actually experiencing the MDP
2 The agent’s experience can be used to determine the states to which
the DP algorithm applies its updates
3 We can apply updates to states as the agent visits them. So focus on
the parts of the state set that are most relevant to the agent
4 After each time-step St , At , backup the state St ,
 X 
v (St ) ← max R(St , a) + γ P(s 0 |St , a)v (s 0 )
a∈A
s 0 ∈S

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 74 / 77


Sample Backups

1 The key design for RL algorithms such as Q-learning and SARSA in


next lectures
2 Using sample rewards and sample transition pairs < S, A, R, S 0 >,
rather than the reward function R and transition dynamics P
3 Benefits:
1 Model-free: no advance knowledge of MDP required
2 Break the curse of dimensionality through sampling
3 Cost of backup is constant, independent of n = |S|

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 75 / 77


Approximate Dynamic Programming

1 Using a function approximator v̂ (s, w)


2 Fitted value iteration repeats at each iteration k,
1 Sample state s from the state cache S̃
 X 
ṽk (s) = max R(s, a) + γ P(s 0 |s, a)v̂ (s 0 , wk )
a∈A
s 0 ∈S

2 Train next value function v̂ (s 0 , wk+1 ) using targets < s, ṽk (s) >.
3 Key idea behind the Deep Q-Learning

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 76 / 77


End

1 Summary: MDP, policy evaluation, policy iteration, and value


iteration,
2 Next Week: Model-free methods
3 Reading: Textbook Chapter 5 and Chapter 6

Bolei Zhou CS260R Reinforcement Learning October 3, 2023 77 / 77

You might also like