2023 Week2 Lecture Before
2023 Week2 Lecture Before
Bolei Zhou
UCLA
October 3, 2023
1 You should have been informed if you receive a PTE, email TA if not
2 Piazza Discussion Forum:
https://piazza.com/ucla/fall2023/cs260r
1 All course-related questions should go there
2 Expected response time from TA: <36 hours
3 Assignment1 will be out at
https://github.com/ucla-rlcourse/assignment-2023fall
1 Due in two weeks
4 Friday’s TA discussion session: programming and PyTorch basics
1 TAs will come up with a weekly schedule soon: Assignments,
clarification on technical detail, some more recent papers, etc
2 Discussion is optional, you can go to either session
1 Last Week
1 Course overview
2 Basic components of RL: reward, policy, value function, action-value
function, model
3 A simplified RL task: k-armed Bandit Problem
4 Markov Decision Process (MDP)
1 Markov Chain→ Markov Reward Process (MRP)→ Markov Decision
Processes (MDP)
2 Policy evaluation in MDP
3 Control in MDP: policy iteration and value iteration
4 Improving dynamic programming
class NormalArm():
def __init__(self, mu, sigma):
self.mu = mu
self.sigma = sigma
def draw(self):
return random.gauss(self.mu, self.sigma)
• Markov Processes
• Markov Reward Processes (MRPs)
• Markov Decision Processes (MDPs)
1 Definition of Horizon
1 Number of maximum time steps in each episode/trajectory
2 Can be infinite, otherwise called finite Markov (reward) Process
3 Per game: 100 moves for Go, 80 moves for chess
2 Definition of Return
1 Discounted sum of rewards from time step t to horizon
V = R + γPV
1 Analytic solution for value of MRP: V = (I − γP)−1 R
1 Matrix inverse takes the complexity O(N 3 ) for N states
2 Only possible for small MRPs
2 Other methods to solve this?
1 Dynamic Programming
2 Monte-Carlo evaluation
3 Temporal-Difference learning
X
P π (s 0 |s) = π(a|s)P(s 0 |s, a)
a∈A
X
R π (s) = π(a|s)R(s, a)
a∈A
X
v π (s) = π(a|s)q π (s, a) (15)
a∈A
X
q π (s, a) =Rsa + γ P(s 0 |s, a)v π (s 0 ) (16)
s 0 ∈S
Thus
X X
v π (s) = π(a|s)(R(s, a) + γ P(s 0 |s, a)v π (s 0 )) (17)
a∈A s 0 ∈S
X X
π
q (s, a) =R(s, a) + γ P(s 0 |s, a) π(a0 |s 0 )q π (s 0 , a0 ) (18)
s 0 ∈S a0 ∈A
X X
v π (s) = π(a|s)(R(s, a) + γ P(s 0 |s, a)v π (s 0 )) (19)
a∈A s 0 ∈S
X X
q π (s, a) =R(s, a) + γ P(s 0 |s, a) π(a0 |s 0 )q π (s 0 , a0 ) (20)
s 0 ∈S a0 ∈A
1 R = [5, 0, 0, 0, 0, 0, 10]
2 Practice 1: Deterministic policy π(s) = Left with γ = 0.5 for any
state s, then what are the state values under the policy?
3 Practice 2: Stochastic policy P(π(s) = Left) = 0.5 and
P(π(s) = Right) = 0.5 and γ = 0.5 for any state s, then what are the
state values under the policy?
4 Iteration P
t:
vtπ (s) = a P(π(s) = a)(r (s, a) + γ s 0 ∈S P(s 0 |s, a)vt−1
π (s 0 ))
P
5 Convergence: v1 → v2 → ... → v π
X X
v π (s) = π(a|s)(R(s, a) + γ P(s 0 |s, a)v π (s 0 )) (24)
a∈A s 0 ∈S
1 https://cs.stanford.edu/people/karpathy/reinforcejs/
gridworld_dp.html
π 0 = greedy(v π ) (26)
3 This improves the value from any state s over one step,
q π (s, π 0 (s)) = max q π (s, a) ≥ q π (s, π(s)) = v π (s)
a∈A
0
4 It therefore improves the value function, v π (s) ≥ v π (s)
v π (s) ≤q π (s, π 0 (s)) = Eπ0 [Rt+1 + γv π (St+1 |St = s)]
≤Eπ0 [Rt+1 + γq π (St+1 , π 0 (St+1 ))|St = s]
≤Eπ0 [Rt+1 + γRt+2 + γ 2 q π (St+2 , π 0 (St+2 ))|St = s]
0
≤Eπ0 [Rt+1 + γRt+2 + ...|St = s] = v π (s)
Bolei Zhou CS260R Reinforcement Learning October 3, 2023 60 / 77
Monotonic Improvement in Policy
1 If improvements stop,
thus
X
v ∗ (s) = max R(s, a) + γ P(s 0 |s, a)v ∗ (s 0 )
a
s 0 ∈S
X
∗
q (s, a) =R(s, a) + γ P(s 0 |s, a) max
0
q ∗ (s 0 , a0 )
a
s 0 ∈S
2 k ←k +1
3 To retrieve the optimal policy after the value iteration:
X
π(s) = arg max R(s, a) + γ P(s 0 |s, a)vk+1 (s 0 ) (31)
a
s 0 ∈S
After the optimal values are reached, we run policy extraction to retrieve
the optimal policy.
1 https:
//github.com/ucla-rlcourse/RLexample/tree/master/MDP
2 Train next value function v̂ (s 0 , wk+1 ) using targets < s, ṽk (s) >.
3 Key idea behind the Deep Q-Learning