0% found this document useful (0 votes)

19 views35 pages

Lecture 10 - Overview of RL With A VIP Perspective

Uploaded by

James Tan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

19 views35 pages

Lecture 10 - Overview of RL With A VIP Perspective

Uploaded by

James Tan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Lecture 10:

Reinforcement
Learning and Links
with Games
Reminders
● Feedback on your project.

2
References for this lecture:

1. Lowe, Ryan, et al. "Multi-agent actor-critic for mixed cooperative-competitive

environments." Advances in neural information processing systems 30 (2017).
2. Slides on RL: https://dpmd.ai/DeepMindxUCL21
3. Notes on the repeated prisoner’s Dilemma: https://sites.math.northwestern.edu/~clark/364/
handouts/repeated.pdf

3
Today
1. RL setting

2. The policy gradient method:

1. Policy gradient 101
2. A game perspective on Actor-critic!

3. Q-learning
1. Tabular Q-learning 101
2. A variational inequality perspective on Q learning
Disclaimer
- Not a full course on RL!!!!
- More like a crash course.
- Check the course linked in these slides.
The RL setting
Agent and Environment
At each step t the agent:
• Receives observation ot and reward rt
• Executes action at

The environment:
• Receives action at
• Emits observation ot+1 and reward rt+1

Usually: observation is the state st of the environment.

In the following: ot = st (we move from state to state)
Source: https://deepmind.com/learning-resources/reinforcement-learning-series-2021
The Goal in RL
A reward rt is a scalar feedback signal
- Indicates how well agent is doing at step t
- The agent’s job is to maximize its future cumulative reward

Rt := rt + γrt+1 + γ 2rt+2 + . . .

Rt is called the return and γ is called the discount factor.

Reinforcement learning is based on the reward hypothesis:

Our goal can be formalized as the outcome of maximizing a cumulative reward

Source: https://deepmind.com/learning-resources/reinforcement-learning-series-2021
The Discounted reward
Why do we consider infinite horizon with a discount
factor:
∞
Rt := rt+1 + γrt+2 + γ 2rt+3 + . . . = γ krk+t
∑
k=0
Why not a finite horizon T ?
We will show that if T is a random variable, then the finite formulation actually recovers the formulation with an infinite horizon

T−1

∑
R := r1 + . . . + RT = rt
k=0
The Discounted reward
Why not a finite horizon T ?
T

∑
R := r1 + . . . + RT = rt+1
k=0
Answer: A fixed finite horizon is not realistic (you do
not know when you will die).
If you do your behaviour will drastically change (e.g.,
people who get diagnosed cancer) .
Solution: A random finite horizon
ℙ(T = t) = γ t(1 − γ)
The Discounted reward
Solution: A random finite horizon
ℙ(T = t) = γ t(1 − γ)
Why geometric:
1.Simple
2.Memory less ℙ(T = t | T ≥ t0) = ℙ(T = t − t0)
• Conditioned on the fact that you are still “alive’’, the
end of the episode is not more likely as time passes.
3.It recovers the infinite time horizon:
∞ T not infinity ∞ ∞
γ trt+1
T[R] = T[ ∑ ∑ ∑
rt+1] = rt+1ℙ(T ≥ t) =
t=0 t=0 t=0
proportional instead of equal here
𝔼
𝔼
Connection with games
• Similar Issues in Game theory:
• Not the same equilibria if the horizon is known or
not. (See next week)
• When the horizon is known:
• you will defect at the last timestep (best strategy)
• Hence defect at the previous time step (and so on)
• When the horizon is not known:
• No backward induction!
• New perfect sublime equilibria (cf next lecture)
• You can cooperate!
Conclusion on Finite Vs. Infinite Horizon
- Finite known horizon can lead to pathological behaviour.

- Because they are rational both in RL and Games Agent can act
differently close to the end of the episode (if they know the horizon).

- These undesired behaviour (in the real life we usually do not know the
exact horizon) are avoided with infinite horizon and discounted
reward.

- Such infinite horizon and discounted reward is equivalent to the

cumulative reward with a random end of the episode.

Rational agent: agent that aims at maximizing its reward.

Back to RL: Maximizing value by taking actions

Goal: select actions to maximise cumulative reward.

• Actions may have long term consequences.
• Reward may be delayed.
• Better to sacrifice immediate reward to gain more long-term
reward.
• Examples:
• Refuelling a helicopter (might prevent a crash in several hours)
• Defensive moves in a game (may help chances of winning later)
• Learning a new skill (can be costly & time-consuming at first)

A mapping from states to actions is called a policy

Source: https://deepmind.com/learning-resources/reinforcement-learning-series-2021
Agent, Environment, and Policy
At each step t the agent:
• Receives state st and reward rt
• Executes action at ∼ πθ(a | st )

The environment:
• Receives action at
• Move to state st+1 and reward rt+1

Goal: learn a policy πθ that maximizes the cumulative reward.

Source: https://deepmind.com/learning-resources/reinforcement-learning-series-2021
Policy Gradient
Policy Objective Functions
Goal: given a policy πθ(a | s), find the best parameters θ.

∞
s0∼d0,at ∼πθ(a|st )[ ∑ γ rt+1]
t
θ ∈ arg min
argmax
t=0
Reward obtained for picking at in st.
d0 : distribution over initial state
st+1 sampled by the env., as a function of at ∼ πθ(a | st )
Usually one note

∞
J(θ) := s0∼d0,at ∼πθ(a|st )[ ∑ γ trt+1]
t=0
𝔼
𝔼
Policy Objective Functions
Goal: given a policy πθ(a | s), find the best parameters θ.

∞
s0∼d0,at ∼πθ(a|st )[ ∑ γ rt+1]
Policy based reinforcement learning is an toptimization problem:
θ ∈ arg min
Find θ that maximisest=0 J(θ)
Reward obtained for picking at in st.
d0 : distribution
Focus on over initial state
stochastic gradient ascent:
- Ef cient. st+1 sampled by the env., as a function of at ∼ πθ(a | st )
- Easy to implement with deep nets.
Usually one note

∞
J(θ) := s0∼d0,at ∼πθ(a|st )[ ∑ γ trt+1]
t=0
𝔼
𝔼
fi
Gradients on parameterized policies
Goal: Compute ∇θ J(θ) where
∞
J(θ) := πθ[ ∑ γ trt+1]

𝔼
t=0
Problem: trajectories depend on θ! Thus, cannot switch expectation and differentiation

∫a
∇ rt+1∼πθ(a|st )[rt+1] = ∇θ rt+1(a, st)πθ(a | st)da

∫a
= rt+1(a, st) ∇θ πθ(a | st)da
The trick:
∇θ πθ(a | st)
∫a
= rt+1(a, st) π (a | st)da
πθ(a | st) θ

∫a
= rt+1(a, st) ∇θ log πθ(a | st) ⋅ πθ(a | st)da = rt+1∼πθ(a|st )[rt+1 ∇log πθ]
𝔼
𝔼
Policy Gradient Theorem
Theorem: under certain regularity assumptions on πθ (for instance has
a density for continuous action spaces), we have

∞
πθ πθ
∇J(θ) = πθ[ ∑ Q (st, at) ∇log πθ(at | st)] = s ∼ p πθ[Q (s, a) ∇log πθ(a | s)]
t=1 a ∼ πθ

∞
Where Q π(st, at ) := γ trt+1 | S0 = st, A0 = at] is called the Q-value.
π[ ∑
t=0
Note that: Q π(st, at ) := π[rt+1 + γQ π(st+1, at+1) | S0 = st, A0 = at]
Bellman’s fixed point equation

We start the trajectories at st by picking the action at.

𝔼
𝔼
𝔼
𝔼
Policy Gradient Theorem
Corollary: under certain regularity assumptions on πθ (for instance
has a density for continuous action spaces), we have

∞
πθ
∇J(θ) = πθ[ ∑ (Q (st, at)−B(st)) ∇log πθ(at | st)]
t=1

∞
Where Q π(st, at ) := t
π[ ∑ γ rt+1 | S0 = st, A0 = at] is called the Q-value.
t=0
And B(st ) is any function independent of the action at. This function is
called the baseline.
𝔼
𝔼
Actor Critic
Given a policy πθ

Problem: Not easy to estimate its Q-value Q πθ

Idea: Use another network (the critic) to do it!

Q πθ(st, at) ≈ rt + γVw(st+1)

V is the critic (parametrised by W)

rt : reward for picking at ∼ πθ(a | st ) Vw(st+1): value of the next state estimated by the critic.

Last idea: also use Vw as a baseline.

Actor-Critic Meta Algorithm
Two “players”:
• Policy πθ learnt by policy gradient on J(θ).
• Q-value of πθ estimated with the critic Vw.

Learning steps:
1.Take an action according to the policy at ∼ πθ(a | st).
2.Update the critic using rt+1 caused by at.
3.Estimate Q πθ(st, at) using the critic Vw and update πθ.

Interactions between the critic and the policy! GAME!

Connecting Generative Adversarial Networks and Actor-
Critic Methods
Paper by David Pfau, Oriol Vinyals (2017)

Connection between GANs ans Actor-critic:

Policy <-> Generator & Critic <-> Discriminator

Standard Actor-Critic
Critic: Update parameters w of Vw by TD
Actor: Update θ by policy gradient.
Standard Actor Critic:
• Initialise s0, θ, w
• for t = 0,1,2,…do:
• Sample at ∼ πθ(a | st )
• Sample rt+1and st+1
• δt+1 := rt+1 + Vw(st+1) − Vw(st ) [one-step TD-error, or advantage]
should be gamma * V_w(s_t+1)

• w ← w + ηδt ∇w Vw(st ) [TD(0) critic update]

• θ ← θ + αδt ∇log πθ(at | st ) [Policy gradient step]
Source: https://deepmind.com/learning-resources/reinforcement-learning-series-2021
Another Actor Critic
Another idea to approximate Q πθ(st, at).

Directly use a neural network: Q πθ(st, at) = Qw(st, at)

Learn it with DQN:

2
ℒ(w) = s,a,r∼πθ[(Qw (s, a) − y(s′, r)) ]

where y(s′, r) = r − γ max Qw̄(s′, a′)

a′
And w̄ is a copy of w (no differentiation through it!!!)
Intuition: learn the target y

𝔼

Connecting Generative Adversarial Networks and Actor-
Critic Methods
Paper by David Pfau, Oriol Vinyals (2017)

Connection between GANs ans Actor-critic:

Policy <-> Generator & Critic <-> Discriminator

Q-learning
Q-Learning: Tabular Setting
Goal: Learn the Q values of each state-action pairs.

Given a state s, and a policy π, we want to estimate the

cumulative reward for each actions

Q π(s, a) = [rt+1 + γrt+2 + … | st = s, at = a, π]

Fixed point equation:

Q π(s, a) = [rt+1 + γQ π(st+1, at+1) | st = s, at = a, at+1 ∼ π( ⋅ , st+1)]

𝔼
𝔼
Q-Learning: Tabular Setting
Thm: The best Q-value corresponds to the optimal
policy: π*(s) = arg max Q*(s, a) where
a

Q(s, a) = [rt+1 + γ max Q(st+1, a) | st = s, at = a]

a
𝔼
Q-Learning: Tabular Setting
Goal: Learn the optimal Q value of each state-action
pairs.
Given a state s, we want to estimate the cumulative
reward for each actions (assuming subsequent
optimal play)
Q*(s, a) = (s′,r)∼P(⋅|a,s)[r + γ max Q*(s′, a′)]
a′
Fixed point equation F(Q*) = 0:
Non-linear operation!

F(Q)(s, a) := Q(s, a) − (s′,r)∼P(⋅|a,s)[r + γ max Q(s′, a′)]

a′

Indexes of Q (tabular case)

Could be rewritten with a transition matrix
𝔼
𝔼

Standard Q-Learning algorithm
• Initialize Qo(s, a) , ∀a, s
• Initialize s0
• For t = 0,…, T :
• Sample at ∼ π(st ) (exploration + exploitation)
• Get rt+1, st+1 from the environment
• Compute δt := rt+1 + γ max Qt(st+1, a) − Qt(st, at )
a
• Update Q: Qt+1(st, at ) = Qt(st, at ) + αδt

Usually π(st ) = arg max Qt(st, a) for exploitation and

a
π(st) = U({a}) for exploration
Q-learning as a Variational inequality

Standard Q-learning:

Qt+1(st, at) = Qt(st, at) − α[Qt(st, at) − (rt + γ max Qt(st+1, a))]
a

Q-learning as a VIP:
Qt+1 = Qt − αF̃(Qt)

With a stochastic estimate of F(Q) .

{0
F(Q)(st, at) if s = st and a = at
F̃(Q)(s, a) =
otherwise
Questions
What do to with that? What is challenging in this VIP?
Qt+1 = Qt − αF̃(Qt)

- Analysis of Q-learning using the tools from the last 4

classes.
However:
- Stochastic VIP.
- Stochasticity depends on the policy π (that may
depends on Q)
- Is this VIP monotone?????
Conclusion
1. Discounted reward can be seen as a random finite
horizon.

2. Actor critic looks like a GAN for the RL framework

1. Actor <—> Generator
2. Critic <—> Discriminator

3. Tabular Q-learning can be seen as a stochastic

variational inequality!

4.We are not even multi-agent yet!!!!

CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
F90de-Introduction To Reinforcement Learning
No ratings yet
F90de-Introduction To Reinforcement Learning
67 pages
Maai 6
No ratings yet
Maai 6
143 pages
cs224r L04 Actor Critic
No ratings yet
cs224r L04 Actor Critic
89 pages
18 AI BasicRL
No ratings yet
18 AI BasicRL
96 pages
Introduction To Reinforcement Learning
No ratings yet
Introduction To Reinforcement Learning
62 pages
Lecture 11 - Multi Agent RL
No ratings yet
Lecture 11 - Multi Agent RL
36 pages
Sections
No ratings yet
Sections
76 pages
Lecture 11
No ratings yet
Lecture 11
51 pages
Kguh
No ratings yet
Kguh
38 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
RL Lecturer
No ratings yet
RL Lecturer
38 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
DRL v5
No ratings yet
DRL v5
64 pages
5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
Actor-Critic Policy Optimization in Partially Observable Multiagent Environments 1810.09026
No ratings yet
Actor-Critic Policy Optimization in Partially Observable Multiagent Environments 1810.09026
28 pages
RL Intro-2
No ratings yet
RL Intro-2
24 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
L07 Slides - rl1
No ratings yet
L07 Slides - rl1
20 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
59 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
A (Long) Peek Into Reinforcement Learning - Lil'Log
No ratings yet
A (Long) Peek Into Reinforcement Learning - Lil'Log
23 pages
Hindsight Experience Replay
No ratings yet
Hindsight Experience Replay
15 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
ML Unit 4
No ratings yet
ML Unit 4
17 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Nips00 Bs
No ratings yet
Nips00 Bs
7 pages
6S191 MIT DeepLearning L5
No ratings yet
6S191 MIT DeepLearning L5
62 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Reinforcement Learning Mastery Path
No ratings yet
Reinforcement Learning Mastery Path
18 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
No ratings yet
Evolutionary Game Theory and Multi-Agent Reinforcement Learning
26 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Deep Reinforcement Learning: 1 Notation
No ratings yet
Deep Reinforcement Learning: 1 Notation
9 pages
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
No ratings yet
A Beginner's Guide To Deep Reinforcement Learning: Skymind - Ai
23 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
17 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
Reinforcement Learning in The Era of LLMS: What Is Essential? What Is Needed? An RL Perspective On RLHF, Prompting, and Beyond
No ratings yet
Reinforcement Learning in The Era of LLMS: What Is Essential? What Is Needed? An RL Perspective On RLHF, Prompting, and Beyond
11 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
37 RL
No ratings yet
37 RL
18 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
Unit 3 Ai
No ratings yet
Unit 3 Ai
5 pages
2024 MTH058 Lecture05 ReinforcementLearning
No ratings yet
2024 MTH058 Lecture05 ReinforcementLearning
59 pages
Reinforcement LN-6
No ratings yet
Reinforcement LN-6
13 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
CH 4
100% (1)
CH 4
113 pages
Some Thoughts On Reinforcement Learning: 1 Motivation
No ratings yet
Some Thoughts On Reinforcement Learning: 1 Motivation
9 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Process Control Systems
100% (1)
Process Control Systems
3 pages
Zhu Et Al. - 2024 - Propagation Structure-Aware Graph Transformer For
No ratings yet
Zhu Et Al. - 2024 - Propagation Structure-Aware Graph Transformer For
12 pages
FactorisingExercises132 PDF
No ratings yet
FactorisingExercises132 PDF
4 pages
H2 Math Topical Worksheet (Discrete Random Variable)
No ratings yet
H2 Math Topical Worksheet (Discrete Random Variable)
3 pages
Ai 2
No ratings yet
Ai 2
10 pages
Machine Learning For Predictive Data Analytics PDF
No ratings yet
Machine Learning For Predictive Data Analytics PDF
45 pages
Incorporating Joint Confidence Regions Into Design Under Uncertainty
No ratings yet
Incorporating Joint Confidence Regions Into Design Under Uncertainty
13 pages
Thesis On Extreme Learning Machine
100% (2)
Thesis On Extreme Learning Machine
4 pages
R20 SS (Eee) Dec 2023
No ratings yet
R20 SS (Eee) Dec 2023
2 pages
LCS R19 - Unit-2
No ratings yet
LCS R19 - Unit-2
69 pages
II Semester Exam May 2025 (Regular)
No ratings yet
II Semester Exam May 2025 (Regular)
2 pages
Econ c142 Berkeley
No ratings yet
Econ c142 Berkeley
3 pages
Game Theory
No ratings yet
Game Theory
21 pages
Enr 533 Examples
No ratings yet
Enr 533 Examples
7 pages
Megatron LM
No ratings yet
Megatron LM
15 pages
2022 Midterm Solutions
No ratings yet
2022 Midterm Solutions
10 pages
TOPIC 2 - Memory and Correlation in Biomedical Signals
No ratings yet
TOPIC 2 - Memory and Correlation in Biomedical Signals
23 pages
Box-Behnken Design Optimization of Sand Casting PR
No ratings yet
Box-Behnken Design Optimization of Sand Casting PR
13 pages
A Comparative Study of ACO, GA and SA For TSP
No ratings yet
A Comparative Study of ACO, GA and SA For TSP
5 pages
Dynamic Programming
No ratings yet
Dynamic Programming
2 pages
Linear Systems TEST
No ratings yet
Linear Systems TEST
4 pages
DH GAN Mstar
No ratings yet
DH GAN Mstar
11 pages
BEE302A
No ratings yet
BEE302A
2 pages
Machine Learning For Trading
No ratings yet
Machine Learning For Trading
2 pages
Central Divided Difference: Topic: Differentiation
No ratings yet
Central Divided Difference: Topic: Differentiation
14 pages
Cambridge International General Certificate of Secondary Education
No ratings yet
Cambridge International General Certificate of Secondary Education
12 pages
Natural Language Processing Assignment
No ratings yet
Natural Language Processing Assignment
3 pages
K Means MLExpert
No ratings yet
K Means MLExpert
3 pages
GPG SetPref
No ratings yet
GPG SetPref
3 pages
Calculus: Maths of the Gods
From Everand
Calculus: Maths of the Gods
Bill Todorovich
No ratings yet

Lecture 10 - Overview of RL With A VIP Perspective

Uploaded by

Lecture 10 - Overview of RL With A VIP Perspective

Uploaded by

Lecture 10:

1. Lowe, Ryan, et al. "Multi-agent actor-critic for mixed cooperative-competitive

2. The policy gradient method:

Usually: observation is the state st of the environment.

Rt is called the return and γ is called the discount factor.

Our goal can be formalized as the outcome of maximizing a cumulative reward

- Such infinite horizon and discounted reward is equivalent to the

Rational agent: agent that aims at maximizing its reward.

Goal: select actions to maximise cumulative reward.

A mapping from states to actions is called a policy

Goal: learn a policy πθ that maximizes the cumulative reward.

We start the trajectories at st by picking the action at.

Problem: Not easy to estimate its Q-value Q πθ

Idea: Use another network (the critic) to do it!

Q πθ(st, at) ≈ rt + γVw(st+1)

Last idea: also use Vw as a baseline.

Interactions between the critic and the policy! GAME!

Connection between GANs ans Actor-critic:

Policy <-> Generator & Critic <-> Discriminator

• w ← w + ηδt ∇w Vw(st ) [TD(0) critic update]

Directly use a neural network: Q πθ(st, at) = Qw(st, at)

Learn it with DQN:

where y(s′, r) = r − γ max Qw̄(s′, a′)

Connection between GANs ans Actor-critic:

Policy <-> Generator & Critic <-> Discriminator

Given a state s, and a policy π, we want to estimate the

Q π(s, a) = [rt+1 + γrt+2 + … | st = s, at = a, π]

Fixed point equation:

Q π(s, a) = [rt+1 + γQ π(st+1, at+1) | st = s, at = a, at+1 ∼ π( ⋅ , st+1)]

Q*(s, a) = [rt+1 + γ max Q*(st+1, a) | st = s, at = a]

F(Q)(s, a) := Q(s, a) − (s′,r)∼P(⋅|a,s)[r + γ max Q(s′, a′)]

Usually π(st ) = arg max Qt(st, a) for exploitation and

With a stochastic estimate of F(Q) .

- Analysis of Q-learning using the tools from the last 4

2. Actor critic looks like a GAN for the RL framework

3. Tabular Q-learning can be seen as a stochastic

4.We are not even multi-agent yet!!!!

You might also like

Q(s, a) = [rt+1 + γ max Q(st+1, a) | st = s, at = a]