0% found this document useful (0 votes)

11 views102 pages

13-RL DRL

Uploaded by

kamble793supriya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

11 views102 pages

13-RL DRL

Uploaded by

kamble793supriya

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 102

Reinforcement Learning

Reinforcement Learning: An
Introduction
• Reinforcement Learning: An Introduction,
Richard S. Sutton and Andrew G. Barto, 2nd
Edition. This is available for free and
references will refer to the final pdf version
http://incompleteideas.net/book/the-book-
2nd.html
• http://incompleteideas.net/book/RLbook2020
.pdf
http://angelpawstherapy.org/positive-reinforcement-dog-training.html
Nature of Learning
• We learn from past experiences.
– When an infant plays, waves its arms, or looks about,
it has no explicit teacher
– But it does have direct interaction to its environment.
• Years of positive compliments as well as negative
criticism have all helped shape who we are today.
• Reinforcement learning: computational approach
to learning from interaction.
Richard Sutton and Andrew Barto, Reinforcement Learning: An
Introduction Nishant Shukla , Machine Learning with TensorFlow
Reinforcement Learning

https://www.cs.utexas.edu/~eladlieb/RLRG.html
Machine Learning, Tom Mitchell, 1997

http://www.cs.cmu.edu/~tom/mlbook.html
http://www.cs.cmu.edu/~tom/mlbo
ok.html
Frozen Lake

S F F F

F H F H

F F F H

H F F G
Frozen LakeWorld (OpenAI GYM)

S F F F
(1) Action (right, left, up, down)
F H F H

(2) state, reward F F F H

H F F G

Agent Environment
Frozen LakeWorld (OpenAI GYM)

S F F F
(1) action: RIGHT
F H F H

F F F H
(2) state: 1, reward: 0
H F F G

Agent Environment
https://gym.openai.com/
NEXT:Try Frozen Lake Real
Game?
S
Frozen Lake: Even if you know the way,ask.
Q-function (state-action value
function)

(1) state
(3) quality (reward)
(2) action

Q (state,action)
Policy using Q-function

Q (state,action)

Q (s1, LEFT):0
Q (s1, RIGHT):0.5
Q (s1, UP):0
Q (s1,DOWN): 0.3
Optimal Policy, and Max Q

Q (state,action)

Max Q =
Frozen Lake: optimal policy with Q

S
Finding, Learning Q

• Assume (believe) Q in s` exists!

• My condition
- I am in s
- when I do action a,I’ll go to s`
- when I do action a,I’ll get reward r
- Q in s`, Q(s`, a`) exist!
• How can we express Q(s,a) using Q(s`,a`)?
State, action,
reward
S F F F

F H F H

F F F H

H F F G
Future reward
Learning Q (s, a)?
Learning Q(s, a): 16x4Table
16 states and 4 actions (up, down, left, right)
Learning Q(s, a):Table
initial Q values are 0

0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0
0 0 0 0
0 0 0 0 0 0 0 0
0 0 0 0
Learning Q(s, a)Table (with many trials)
initial Q values are 0
Learning Q(s, a) Table: one success!
initial Q values are 0

1 1
1

1
1

1 1
Learning Q(s, a)Table:optimal policy

1 1
1

1
1

1 1
Dummy Q-learning algorithm
Exploit VSExploration

1 1
1

1
1

1 1

http://home.deib.polimi.it/restelli/MyWebSite/pdf/rl5.pdf
ExploitVS Exploration: E-greedy
Epsilon-Greedy

e = 0.1
if rand < e:
a = random
else:
a = argmax(Q(s, a))
ExploitVS Exploration: decaying E-greedy

for i in range (1000)

e = 0.1 / (i+1)
if random(1) < e:
a = random

else:
a = argmax(Q(s, a))
ExploitVS Exploration: add random noise

a = argmax(Q(s, a) + random_values)

a = argmax([0.5 0.6 0.3 0.2 0.5]+[0.1 0.2 0.7 0.3 0.1])

0.5 0.6 0.3 0.2 0.5

ExploitVS Exploration: add random noise

for i in range (1000)

a = argmax(Q(s, a) + random_values / (i+1))

0.5 0.6 0.3 0.2 0.5

Discounted future reward

1 1 1
1

1
1 1

1 1
Learning Q (s, a) with discounted reward
Discounted future reward
Learning Q (s, a) with discounted
reward
Q-learning algorithm
Convergence

• In deterministic worlds
• In finite states
Machine Learning, Tom Mitchell, 1997
Windy Frozen Lake
S
Deterministic VS Stochastic(nondeterministic)

• In deterministic models the output of the model is fully

determined by the parameter values and the initial conditions initial
conditions
• Stochastic models possess some inherent randomness.
- The same set of parameter values and initial conditions will lead to an
ensemble of different outputs.
Stochastic (non-deterministic)
Our previous Q-learning does not work

Score over time: 0.0165

Why does not work in stochastic (non-
deterministic) worlds?

a
s
Stochastic (non-deterministic) world

• Solution?
- Listen
to Q (s`) (just a little bit)
- Update Q(s) little bit (learning rate)

• Like our life mentors

- Don’tjust listen and follow one mentor
- Need to listen from many mentors
Learning incrementally
Q(s, a) r + max Q(s0, a0)
a0

• Learning rate,
- = 0.1

Q(s, a) Q(s,a) + [r + max Q(s0, a0)]

a0
Learning with learning rate

Q(s, a) r + max Q(s0, a0)

Q(s, a) (1 ↵)Q(s, a) + ↵[r + max Q(s0, a0)]

Q(s, a) Q(s, a) + ↵[r + max0 Q(s0, a0) Q(s, a)]

a
Q-learning algorithm

a0
Convergence

Qˆ(s, a) (1 ↵)Qˆ(s, a) + ↵[r + max Qˆ(s 0, a0)]

0 a

Machine Learning, Tom Mitchell, 1997

Q-Table (?)

100x100 maze 80x80 pixel + 2 color (black/white)

Q-function
Approximation
(3) quality (reward)
for the given action
(1) state, s (eg, LEFT: 0.5)

(2) action, a
Q-function
Approximation
(2) quality (reward)
for all actions
(eg, [0.5, 0.1, 0.0, 0.8]
LEFT: 0.5,
RIGHT 0.1
(1) state, s UP: 0.0,
DOWN: 0.8)
Q-function
Approximation
Q-Network training (linear regression)
m
1 X
H(x) = Wx cost(W ) =
m
(Wx( i ) y( i ) ) 2
i=1

(1)s (2)Ws

Q(s)
Q-Network training (linear regression)

2
cost(W ) = (Ws y)
y = r + max Q(s0)
(1)s (2)Ws

Q(s)
Q-Network training (math
notations)
• Approximate Q*
function using ✓
(1)s (2)Ws

Q(s)
Q-Network training (math
notations)
• Approximate Q*
function using ✓
(1)s (2)Ws
Qˆ(s, a|✓)⇠ Q⇤(s,a)

• Choose ✓to minimize

XT
min ˆ t , at |✓)
[Q(s (r t + max Q(s
ˆ 0
t+1 , a |✓))]
2
✓ a0
t=0

Q(s)
http://introtodeeplearning.com/6.S091DeepReinforcementLearning.pdf
Algorithm
Convergence
Reinforcement + Neural Net
DQN: Deep, Replay, Separated
networks
Deep Reinforcement Learning
Reinforcement + Neural Net
DQN: Deep, Replay, Separated
networks
• Volodymyr Mnih Koray Kavukcuoglu David Silver
Alex Graves Ioannis Antonoglou Daan Wierstra
Martin Riedmiller
• DeepMind Technologies

https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf
Q-Nets are unstable
Two big issues

Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind

Correlations between samples
1. Correlations between samples

Playing Atari with Deep Reinforcement Learning - University of Toronto by V Mnih et al.
1. Correlations between samples
2. Non-stationary targets

T
X
min [Q̂(st , at |θ) − (r t + γ maxQ̂(st+1 , a0|θ))]2
θ a0
t=
0

Ŷ = Q̂(st , at |θ) Y = r t + γ max Qˆθ (st+1 , a0|θ)

a0
DQN’s solutions
1. Go deep
2. Capture and replay
Correlations between samples
3. Separate networks: create a target network
Non-stationary targets
1. Go Deep
Problem 2: correlations between
samples
Solution 2: experience replay

s1, a1, r2, s2 random sample

s2, a2, r3, s3 & Replay
Capture
s3, a3, r4, s4 min
XT
ˆ t, at|✓)(r
[Q(s t + max Q(s
ˆ t+1, a|✓))]
0 2
✓ a0
t=0
...
st ,at , rt+1, s t+1

ICML 2016 Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind
Solution 2: experience
replay

Playing Atari with Deep Reinforcement Learning - University of Toronto by V Mnih et al.
Problem 2: correlations between
samples

s1, a1, r2, s2

s2, a2, r3, s3
s3, a3, r4, s4
...
st ,at , rt+1, s t+1
Problem 3: Non-stationary targets

T
X
min [Q̂(st , at |θ) − (r t + γ maxQ̂(st+1 , a0|θ))]2
θ a0
t=
0

Ŷ = Q̂(st , at |θ) Y = r t + γ max Qˆθ (st+1 , a0|θ)

a0
Solution 3: separate target network
T
X
min [Q̂(st ,at |θ) − (r t + γ max Q̂(st+1 , a0|θ̄))]2
θ a0
t=0

(1)s (2)W s

(1)s (2) Y (target)

Solution 3: copy network
Understanding
Nature
Paper (2015)
DQN’s three
solutions

1. Go deep
2. Capture and replay
• Correlations between samples
3. Separate networks
• Non-stationary targets

Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind

2.Train from replay memory

https://github.com/awjuliani/DeepRL-Agents
DQN’s three
solutions

1. Go deep
2. Capture and replay
• Correlations between samples
3. Separate networks
• Non-stationary targets

Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind

Implementing
Nature Paper

Human-level control through deep reinforcement learning,

Nature
http://www.nature.com/nature/journal/v518/n7540/full/nature14236.html
Solution 3: separate target
X T
network
min ˆ t , at |✓) —(rt + max Q(s
[Q(s ˆ 0¯ 2
t+1 , a |✓))]
✓ a0
t=0

(1)s (2)W s

(1)s (2) Y (target)

DQN VS targetDQN
XT
min ˆ t, at|✓)(r
[Q(s t + max Q(s
ˆ 0¯ 2
t+1 , a|✓))]
✓ a0
t=0
Solution 3: copy network

(1)s (2)W s

(1)s (2) Y (target)

DQN works reasonably well
DQN
2013 VS
2015
• DQN implementations
- https://github.com/songrotek/DQN-Atari-Tensorflow
- https://github.com/dennybritz/reinforcement-learning/blob/master/DQN/
dqn.py
- https://github.com/devsisters/DQN-tensorflow
- http://www.ehrenbrav.com/2016/08/teaching-your-computer-to-play-
super- mario-bros-a-fork-of-the-google-deepmind-atari-machine-
learning-project
Deep Reinforcement Learing
https://deepmind.com/blog/article/Agent57-
Outperforming-the-human-Atari-benchmark
Never Give Up (NGU)
From R2D2 to Never Give up

LSTM is replaced with Transformer.

Agent57
• Improve on NGU agent
• Meta-Controller allows each actor of the agent to
choose a different trade-off between near vs. long term
performance

https://deepmind.com/blog/article/Agent57-
Outperforming-the-human-Atari-benchmark
Papers
• https://deepmind.com/research/dqn/
• https://storage.googleapis.com/deepmind-
media/dqn/DQNNaturePaper.pdf
• https://www.cs.toronto.edu/~vmnih/docs/dq
n.pdf
• http://nn.cs.utexas.edu/downloads/papers/st
anley.gecco02_1.pdf
References
• Simple Reinforcement Learning with TensorFlow,
https://medium.com/ emergent-future/
• http://kvfrans.com/simple-algoritms-for-solving-
cartpole/ (written by a high school student)
• Deep Reinforcement Learning: Pong from Pixels -
Andrej Karpathy blog
http://karpathy.github.io/2016/05/31/rl/
• Machine Learning, Tom Mitchell, 1997
• Deep RL Bootcamp
• https://sites.google.com/view/deep-rl-
bootcamp/lectures
• Reinforcement Learning: An Introduction, Sutton and Barto,
2nd Edition. This is available for free here and references
will refer to the final pdf version available here.
• Some other additional references that may be useful are
listed below:
• Reinforcement Learning: State-of-the-Art, Marco Wiering
and Martijn van Otterlo, Eds. [link]
• Artificial Intelligence: A Modern Approach, Stuart J. Russell
and Peter Norvig. [link]
• Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron
Courville. [link]
• David Silver's course on Reinforcement Learning. [link]

Industrial Grinders N V
100% (3)
Industrial Grinders N V
9 pages
Water Calculation
100% (2)
Water Calculation
38 pages
Vina Milk
No ratings yet
Vina Milk
5 pages
Transformers (Level-1 & 2)
No ratings yet
Transformers (Level-1 & 2)
45 pages
Britannia Industries Historical Closing Price Data-Final
No ratings yet
Britannia Industries Historical Closing Price Data-Final
48 pages
Session 7: Genetics, Experience and Financial Sophistication
100% (1)
Session 7: Genetics, Experience and Financial Sophistication
40 pages
Unit 2
No ratings yet
Unit 2
15 pages
Reinforcement Learning: 1 Updated Lecture Slides of Machine Learning Textbook, C Tom M. Mitchell, Mcgraw Hill, 1997
No ratings yet
Reinforcement Learning: 1 Updated Lecture Slides of Machine Learning Textbook, C Tom M. Mitchell, Mcgraw Hill, 1997
20 pages
Guide Specification - V1.9.1 - StruxureWare Building Operation
No ratings yet
Guide Specification - V1.9.1 - StruxureWare Building Operation
58 pages
Designation
No ratings yet
Designation
12 pages
6S191 MIT DeepLearning L5
No ratings yet
6S191 MIT DeepLearning L5
62 pages
Periodic Health Examination Form 2 2020
No ratings yet
Periodic Health Examination Form 2 2020
2 pages
Impro New 2.7 Preview
No ratings yet
Impro New 2.7 Preview
24 pages
Giancoli Chap 3 Vectors Kinematics in 2 Dimensions
No ratings yet
Giancoli Chap 3 Vectors Kinematics in 2 Dimensions
37 pages
Sbi General Set PPT 2012
No ratings yet
Sbi General Set PPT 2012
20 pages
Which Advertisement. Next To Each Statement Write A Letter (A-H) - Some Advertisements Correspond To More Than One Statement. One Example Is Given
No ratings yet
Which Advertisement. Next To Each Statement Write A Letter (A-H) - Some Advertisements Correspond To More Than One Statement. One Example Is Given
9 pages
Deep Reinforcement Learning
100% (4)
Deep Reinforcement Learning
48 pages
RRL
No ratings yet
RRL
2 pages
Q-Learning in RL With Openai Gym: Joo Soon Lee
No ratings yet
Q-Learning in RL With Openai Gym: Joo Soon Lee
34 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
2018-1 - Classifications in Brief Tonnis Classification of Hip Osteoarthritis
No ratings yet
2018-1 - Classifications in Brief Tonnis Classification of Hip Osteoarthritis
5 pages
FPGA TN 02136 1 8 LatticeECP3 SPI Slave Port
No ratings yet
FPGA TN 02136 1 8 LatticeECP3 SPI Slave Port
22 pages
Algorithms For Reinforced Learning
No ratings yet
Algorithms For Reinforced Learning
98 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
Annotating - 24 - Ria Silaen
No ratings yet
Annotating - 24 - Ria Silaen
3 pages
Comparative Analysis of Short Film
No ratings yet
Comparative Analysis of Short Film
4 pages
Report On Reinforcement Learning
No ratings yet
Report On Reinforcement Learning
26 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
RLAlgs in MDPs
No ratings yet
RLAlgs in MDPs
98 pages
Bayesian Deep Reinforcement Learning Via Deep Kernel Learning
No ratings yet
Bayesian Deep Reinforcement Learning Via Deep Kernel Learning
8 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
Playing Geometry Dash With Convolutional Neural Networks
No ratings yet
Playing Geometry Dash With Convolutional Neural Networks
7 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
Campus Map
No ratings yet
Campus Map
1 page
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Chemistry Homework 8-1
No ratings yet
Chemistry Homework 8-1
7 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
Lecture 29 RL
No ratings yet
Lecture 29 RL
38 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
So Harian N3 TGL 03 Juni 2024
No ratings yet
So Harian N3 TGL 03 Juni 2024
160 pages
3.5 Intro2DeepQLearning
No ratings yet
3.5 Intro2DeepQLearning
12 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Royal Park Property Development Limited
No ratings yet
Royal Park Property Development Limited
7 pages
Bacte - Medically Significant Fungi
No ratings yet
Bacte - Medically Significant Fungi
4 pages
Mated Ttbbi 3
No ratings yet
Mated Ttbbi 3
1 page
37 RL
No ratings yet
37 RL
18 pages
Reinforcement Learning With Python
No ratings yet
Reinforcement Learning With Python
24 pages
42-Deep Q Learning
No ratings yet
42-Deep Q Learning
8 pages
Algorithms For Reinforcement Learning - Szepesvari
No ratings yet
Algorithms For Reinforcement Learning - Szepesvari
98 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Towards Adapting Reinforcement Learning Agents To New Tasks: Insights From Q-Values
No ratings yet
Towards Adapting Reinforcement Learning Agents To New Tasks: Insights From Q-Values
10 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
DQN Atari
No ratings yet
DQN Atari
26 pages
Intro To Reinforcement Learning - DQ Q AC A3C
No ratings yet
Intro To Reinforcement Learning - DQ Q AC A3C
36 pages
9th Major-4 English NCERT Paper Zdyxcq
No ratings yet
9th Major-4 English NCERT Paper Zdyxcq
7 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Dr. Richard Felder and Dr. Rebecca Brent Part 3
No ratings yet
Dr. Richard Felder and Dr. Rebecca Brent Part 3
3 pages
Machine Learning Unit-1.2
No ratings yet
Machine Learning Unit-1.2
23 pages
Chapter 1
No ratings yet
Chapter 1
33 pages
ML Assignment 2
No ratings yet
ML Assignment 2
6 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
Unit 3
No ratings yet
Unit 3
13 pages
Lecture Notes On Reinforcement Learning Basics
No ratings yet
Lecture Notes On Reinforcement Learning Basics
6 pages
Learning Task
No ratings yet
Learning Task
14 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Hota ML ReinforcementLearning
No ratings yet
Hota ML ReinforcementLearning
12 pages
Research Chapter 1-5 REVISE G NA!
No ratings yet
Research Chapter 1-5 REVISE G NA!
47 pages
Rule-Based Reinforcement Learning Augmented by External Knowledge
No ratings yet
Rule-Based Reinforcement Learning Augmented by External Knowledge
7 pages
Unit - 5
No ratings yet
Unit - 5
43 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
15) EXPLAIN Fitted Q and Deep Q-Learning
No ratings yet
15) EXPLAIN Fitted Q and Deep Q-Learning
17 pages
Unit 5
No ratings yet
Unit 5
70 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
MAS Lab7 QFA
No ratings yet
MAS Lab7 QFA
10 pages
RL Concepts and Methods
No ratings yet
RL Concepts and Methods
8 pages
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
52 pages
Chapter 1 Introduction RL Report Kiran
No ratings yet
Chapter 1 Introduction RL Report Kiran
2 pages
CH5 - Function Approximation
No ratings yet
CH5 - Function Approximation
33 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
w7 - Reinforcement Learning
No ratings yet
w7 - Reinforcement Learning
5 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
19 pages
RL Unit - Iii
No ratings yet
RL Unit - Iii
20 pages
IJCRT2310639
No ratings yet
IJCRT2310639
9 pages

13-RL DRL

Uploaded by

13-RL DRL

Uploaded by

Reinforcement Learning

(2) state, reward F F F H

• Assume (believe) Q in s` exists!

for i in range (1000)

a = argmax([0.5 0.6 0.3 0.2 0.5]+[0.1 0.2 0.7 0.3 0.1])

0.5 0.6 0.3 0.2 0.5

for i in range (1000)

0.5 0.6 0.3 0.2 0.5

• In deterministic models the output of the model is fully

Score over time: 0.0165

• Like our life mentors

Q(s, a) Q(s,a) + [r + max Q(s0, a0)]

Q(s, a) r + max Q(s0, a0)

Q(s, a) (1 ↵)Q(s, a) + ↵[r + max Q(s0, a0)]

Q(s, a) Q(s, a) + ↵[r + max0 Q(s0, a0) Q(s, a)]

Qˆ(s, a) (1 ↵)Qˆ(s, a) + ↵[r + max Qˆ(s 0, a0)]

Machine Learning, Tom Mitchell, 1997

100x100 maze 80x80 pixel + 2 color (black/white)

• Choose ✓to minimize

Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind

Ŷ = Q̂(st , at |θ) Y = r t + γ max Qˆθ (st+1 , a0|θ)

s1, a1, r2, s2 random sample

s1, a1, r2, s2

Ŷ = Q̂(st , at |θ) Y = r t + γ max Qˆθ (st+1 , a0|θ)

(1)s (2) Y (target)

Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind

Tutorial: Deep Reinforcement Learning, David Silver, Google DeepMind

Human-level control through deep reinforcement learning,

(1)s (2) Y (target)

(1)s (2) Y (target)

LSTM is replaced with Transformer.

You might also like