2023 Week3 Modelfree
2023 Week3 Modelfree
Bolei Zhou
UCLA
1 Last week
1 MDP, policy evaluation, policy iteration and value iteration for solving
a known MDP
2 This week
1 Model-free prediction: Estimate value function of an unknown MDP
2 Model-free control: Optimize value function of an unknown MDP
1 Both of the policy iteration and value iteration assume the direct
access to the dynamics and rewards of the environment
N(St ) ←N(St ) + 1
1
v (St ) ←v (St ) + (Gt − v (St ))
N(St )
3 n-step TD: v (St ) ← v (St ) + α Gtn − v (St )
1 Model-free prediction
1 Evaluate the state value by only interacting with the environment
2 Many algorithms can do it: Temporal Difference Learning and
Monte-Carlo method
1 Model-free control:
1 Optimize the value function of an unknown MDP
2 Generate a optimal control policy
2 Generalized Policy Iteration (GPI) with MC or TD in the loop
π 0 = greedy(vπ ) (2)
Algorithm 2
1: Initialize Q(S, A) = 0, N(S, A) = 0, = 1, k = 1
2: πk = -greedy(Q)
3: loop
4: Sample k-th episode (S1 , A1 , R2 , ..., ST ) ∼ πk
5: for each state St and action At in the episode do
6: N(St , At ) ← N(St , At ) + 1
7: Q(St , At ) ← Q(St , At ) + N(S1t ,At ) (Gt − Q(St , At ))
8: end for
9: k ← k + 1, ← 1/k
10: πk = -greedy(Q)
11: end loop
2 -greedy policy for one step, then bootstrap the action value function:
h i
Q(St , At ) ← Q(St , At ) + α Rt+1 + γQ(St+1 , At+1 ) − Q(St , At )
S1 , A1 , R2 , ..., ST ∼ µ
Update π using S1 , A1 , R2 , ..., ST
2 It leads to many benefits:
1 Learn about optimal policy while following exploratory policy
2 Learn from observing humans or other agents
3 Re-use experience generated from old policies π1 , π2 , ..., πt−1
Bolei Zhou CS260R Reinforcement Learning October 10, 2023 45 / 63
Off-Policy Control with Q Learning
https:
//github.com/ucla-rlcourse/RLexample/tree/master/modelfree
S1 , A1 , R2 , ..., ST ∼ µ