0% found this document useful (0 votes)
19 views35 pages

Lecture 10 - Overview of RL With A VIP Perspective

Uploaded by

James Tan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views35 pages

Lecture 10 - Overview of RL With A VIP Perspective

Uploaded by

James Tan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Lecture 10:

Reinforcement
Learning and Links
with Games
Reminders
● Feedback on your project.

2
References for this lecture:

1. Lowe, Ryan, et al. "Multi-agent actor-critic for mixed cooperative-competitive


environments." Advances in neural information processing systems 30 (2017).
2. Slides on RL: https://dpmd.ai/DeepMindxUCL21
3. Notes on the repeated prisoner’s Dilemma: https://sites.math.northwestern.edu/~clark/364/
handouts/repeated.pdf

3
Today
1. RL setting

2. The policy gradient method:


1. Policy gradient 101
2. A game perspective on Actor-critic!

3. Q-learning
1. Tabular Q-learning 101
2. A variational inequality perspective on Q learning
Disclaimer
- Not a full course on RL!!!!
- More like a crash course.
- Check the course linked in these slides.
The RL setting
Agent and Environment
At each step t the agent:
• Receives observation ot and reward rt
• Executes action at

The environment:
• Receives action at
• Emits observation ot+1 and reward rt+1

Usually: observation is the state st of the environment.


In the following: ot = st (we move from state to state)
Source: https://deepmind.com/learning-resources/reinforcement-learning-series-2021
The Goal in RL
A reward rt is a scalar feedback signal
- Indicates how well agent is doing at step t
- The agent’s job is to maximize its future cumulative reward

Rt := rt + γrt+1 + γ 2rt+2 + . . .

Rt is called the return and γ is called the discount factor.


Reinforcement learning is based on the reward hypothesis:

Our goal can be formalized as the outcome of maximizing a cumulative reward


Source: https://deepmind.com/learning-resources/reinforcement-learning-series-2021
The Discounted reward
Why do we consider infinite horizon with a discount
factor:

Rt := rt+1 + γrt+2 + γ 2rt+3 + . . . = γ krk+t

k=0
Why not a finite horizon T ?
We will show that if T is a random variable, then the finite formulation actually recovers the formulation with an infinite horizon

T−1


R := r1 + . . . + RT = rt
k=0
The Discounted reward
Why not a finite horizon T ?
T


R := r1 + . . . + RT = rt+1
k=0
Answer: A fixed finite horizon is not realistic (you do
not know when you will die).
If you do your behaviour will drastically change (e.g.,
people who get diagnosed cancer) .
Solution: A random finite horizon
ℙ(T = t) = γ t(1 − γ)
The Discounted reward
Solution: A random finite horizon
ℙ(T = t) = γ t(1 − γ)
Why geometric:
1.Simple
2.Memory less ℙ(T = t | T ≥ t0) = ℙ(T = t − t0)
• Conditioned on the fact that you are still “alive’’, the
end of the episode is not more likely as time passes.
3.It recovers the infinite time horizon:
∞ T not infinity ∞ ∞
γ trt+1
T[R] = T[ ∑ ∑ ∑
rt+1] = rt+1ℙ(T ≥ t) =
t=0 t=0 t=0
proportional instead of equal here
𝔼
𝔼
Connection with games
• Similar Issues in Game theory:
• Not the same equilibria if the horizon is known or
not. (See next week)
• When the horizon is known:
• you will defect at the last timestep (best strategy)
• Hence defect at the previous time step (and so on)
• When the horizon is not known:
• No backward induction!
• New perfect sublime equilibria (cf next lecture)
• You can cooperate!
Conclusion on Finite Vs. Infinite Horizon
- Finite known horizon can lead to pathological behaviour.

- Because they are rational both in RL and Games Agent can act
differently close to the end of the episode (if they know the horizon).

- These undesired behaviour (in the real life we usually do not know the
exact horizon) are avoided with infinite horizon and discounted
reward.

- Such infinite horizon and discounted reward is equivalent to the


cumulative reward with a random end of the episode.

Rational agent: agent that aims at maximizing its reward.


Back to RL: Maximizing value by taking actions

Goal: select actions to maximise cumulative reward.


• Actions may have long term consequences.
• Reward may be delayed.
• Better to sacrifice immediate reward to gain more long-term
reward.
• Examples:
• Refuelling a helicopter (might prevent a crash in several hours)
• Defensive moves in a game (may help chances of winning later)
• Learning a new skill (can be costly & time-consuming at first)

A mapping from states to actions is called a policy


Source: https://deepmind.com/learning-resources/reinforcement-learning-series-2021
Agent, Environment, and Policy
At each step t the agent:
• Receives state st and reward rt
• Executes action at ∼ πθ(a | st )

The environment:
• Receives action at
• Move to state st+1 and reward rt+1

Goal: learn a policy πθ that maximizes the cumulative reward.

Source: https://deepmind.com/learning-resources/reinforcement-learning-series-2021
Policy Gradient
Policy Objective Functions
Goal: given a policy πθ(a | s), find the best parameters θ.


s0∼d0,at ∼πθ(a|st )[ ∑ γ rt+1]
t
θ ∈ arg min
argmax
t=0
Reward obtained for picking at in st.
d0 : distribution over initial state
st+1 sampled by the env., as a function of at ∼ πθ(a | st )
Usually one note


J(θ) := s0∼d0,at ∼πθ(a|st )[ ∑ γ trt+1]
t=0
𝔼
𝔼
Policy Objective Functions
Goal: given a policy πθ(a | s), find the best parameters θ.


s0∼d0,at ∼πθ(a|st )[ ∑ γ rt+1]
Policy based reinforcement learning is an toptimization problem:
θ ∈ arg min
Find θ that maximisest=0 J(θ)
Reward obtained for picking at in st.
d0 : distribution
Focus on over initial state
stochastic gradient ascent:
- Ef cient. st+1 sampled by the env., as a function of at ∼ πθ(a | st )
- Easy to implement with deep nets.
Usually one note


J(θ) := s0∼d0,at ∼πθ(a|st )[ ∑ γ trt+1]
t=0
𝔼
𝔼
fi
Gradients on parameterized policies
Goal: Compute ∇θ J(θ) where

J(θ) := πθ[ ∑ γ trt+1]

𝔼
t=0
Problem: trajectories depend on θ! Thus, cannot switch expectation and differentiation

∫a
∇ rt+1∼πθ(a|st )[rt+1] = ∇θ rt+1(a, st)πθ(a | st)da

∫a
= rt+1(a, st) ∇θ πθ(a | st)da
The trick:
∇θ πθ(a | st)
∫a
= rt+1(a, st) π (a | st)da
πθ(a | st) θ

∫a
= rt+1(a, st) ∇θ log πθ(a | st) ⋅ πθ(a | st)da = rt+1∼πθ(a|st )[rt+1 ∇log πθ]
𝔼
𝔼
Policy Gradient Theorem
Theorem: under certain regularity assumptions on πθ (for instance has
a density for continuous action spaces), we have


πθ πθ
∇J(θ) = πθ[ ∑ Q (st, at) ∇log πθ(at | st)] = s ∼ p πθ[Q (s, a) ∇log πθ(a | s)]
t=1 a ∼ πθ


Where Q π(st, at ) := γ trt+1 | S0 = st, A0 = at] is called the Q-value.
π[ ∑
t=0
Note that: Q π(st, at ) := π[rt+1 + γQ π(st+1, at+1) | S0 = st, A0 = at]
Bellman’s fixed point equation

We start the trajectories at st by picking the action at.


𝔼
𝔼
𝔼
𝔼
Policy Gradient Theorem
Corollary: under certain regularity assumptions on πθ (for instance
has a density for continuous action spaces), we have


πθ
∇J(θ) = πθ[ ∑ (Q (st, at)−B(st)) ∇log πθ(at | st)]
t=1


Where Q π(st, at ) := t
π[ ∑ γ rt+1 | S0 = st, A0 = at] is called the Q-value.
t=0
And B(st ) is any function independent of the action at. This function is
called the baseline.
𝔼
𝔼
Actor Critic
Given a policy πθ

Problem: Not easy to estimate its Q-value Q πθ

Idea: Use another network (the critic) to do it!

Q πθ(st, at) ≈ rt + γVw(st+1)


V is the critic (parametrised by W)

rt : reward for picking at ∼ πθ(a | st ) Vw(st+1): value of the next state estimated by the critic.

Last idea: also use Vw as a baseline.


Actor-Critic Meta Algorithm
Two “players”:
• Policy πθ learnt by policy gradient on J(θ).
• Q-value of πθ estimated with the critic Vw.

Learning steps:
1.Take an action according to the policy at ∼ πθ(a | st).
2.Update the critic using rt+1 caused by at.
3.Estimate Q πθ(st, at) using the critic Vw and update πθ.

Interactions between the critic and the policy! GAME!


Connecting Generative Adversarial Networks and Actor-
Critic Methods
Paper by David Pfau, Oriol Vinyals (2017)

Connection between GANs ans Actor-critic:

Policy <-> Generator & Critic <-> Discriminator


Standard Actor-Critic
Critic: Update parameters w of Vw by TD
Actor: Update θ by policy gradient.
Standard Actor Critic:
• Initialise s0, θ, w
• for t = 0,1,2,…do:
• Sample at ∼ πθ(a | st )
• Sample rt+1and st+1
• δt+1 := rt+1 + Vw(st+1) − Vw(st ) [one-step TD-error, or advantage]
should be gamma * V_w(s_t+1)

• w ← w + ηδt ∇w Vw(st ) [TD(0) critic update]


• θ ← θ + αδt ∇log πθ(at | st ) [Policy gradient step]
Source: https://deepmind.com/learning-resources/reinforcement-learning-series-2021
Another Actor Critic
Another idea to approximate Q πθ(st, at).

Directly use a neural network: Q πθ(st, at) = Qw(st, at)

Learn it with DQN:


2
ℒ(w) = s,a,r∼πθ[(Qw (s, a) − y(s′, r)) ]

where y(s′, r) = r − γ max Qw̄(s′, a′)


a′
And w̄ is a copy of w (no differentiation through it!!!)
Intuition: learn the target y


𝔼



Connecting Generative Adversarial Networks and Actor-
Critic Methods
Paper by David Pfau, Oriol Vinyals (2017)

Connection between GANs ans Actor-critic:

Policy <-> Generator & Critic <-> Discriminator


Q-learning
Q-Learning: Tabular Setting
Goal: Learn the Q values of each state-action pairs.

Given a state s, and a policy π, we want to estimate the


cumulative reward for each actions

Q π(s, a) = [rt+1 + γrt+2 + … | st = s, at = a, π]

Fixed point equation:

Q π(s, a) = [rt+1 + γQ π(st+1, at+1) | st = s, at = a, at+1 ∼ π( ⋅ , st+1)]


𝔼
𝔼
Q-Learning: Tabular Setting
Thm: The best Q-value corresponds to the optimal
policy: π*(s) = arg max Q*(s, a) where
a

Q*(s, a) = [rt+1 + γ max Q*(st+1, a) | st = s, at = a]


a
𝔼
Q-Learning: Tabular Setting
Goal: Learn the optimal Q value of each state-action
pairs.
Given a state s, we want to estimate the cumulative
reward for each actions (assuming subsequent
optimal play)
Q*(s, a) = (s′,r)∼P(⋅|a,s)[r + γ max Q*(s′, a′)]
a′
Fixed point equation F(Q*) = 0:
Non-linear operation!

F(Q)(s, a) := Q(s, a) − (s′,r)∼P(⋅|a,s)[r + γ max Q(s′, a′)]


a′


Indexes of Q (tabular case)


Could be rewritten with a transition matrix
𝔼
𝔼




Standard Q-Learning algorithm
• Initialize Qo(s, a) , ∀a, s
• Initialize s0
• For t = 0,…, T :
• Sample at ∼ π(st ) (exploration + exploitation)
• Get rt+1, st+1 from the environment
• Compute δt := rt+1 + γ max Qt(st+1, a) − Qt(st, at )
a
• Update Q: Qt+1(st, at ) = Qt(st, at ) + αδt

Usually π(st ) = arg max Qt(st, a) for exploitation and


a
π(st) = U({a}) for exploration
Q-learning as a Variational inequality

Standard Q-learning:

Qt+1(st, at) = Qt(st, at) − α[Qt(st, at) − (rt + γ max Qt(st+1, a))]
a

Q-learning as a VIP:
Qt+1 = Qt − αF̃(Qt)

With a stochastic estimate of F(Q) .

{0
F(Q)(st, at) if s = st and a = at
F̃(Q)(s, a) =
otherwise
Questions
What do to with that? What is challenging in this VIP?
Qt+1 = Qt − αF̃(Qt)

- Analysis of Q-learning using the tools from the last 4


classes.
However:
- Stochastic VIP.
- Stochasticity depends on the policy π (that may
depends on Q)
- Is this VIP monotone?????
Conclusion
1. Discounted reward can be seen as a random finite
horizon.

2. Actor critic looks like a GAN for the RL framework


1. Actor <—> Generator
2. Critic <—> Discriminator

3. Tabular Q-learning can be seen as a stochastic


variational inequality!

4.We are not even multi-agent yet!!!!

You might also like