0% found this document useful (0 votes)

32 views35 pages

L13 Reinforcement Learning

Uploaded by

Khang Trần Tuấn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views35 pages

L13 Reinforcement Learning

Uploaded by

Khang Trần Tuấn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 35

Machine Learning

(Học máy – IT3190E)

Khoat Than
School of Information and Communication Technology
Hanoi University of Science and Technology

2023
Contents 2

¡ Introduction
¡ Supervised learning
¡ Unsupervised learning
¡ Reinforcement learning
¡ Practical advice
Reinforcement Learning problem 3

¡ Goal: Learn to choose actions that maximize

𝑟0 + g𝑟1 + g2𝑟2 + ⋯ , 𝑤ℎ𝑒𝑟𝑒 0 ≤ g < 1

(g is the discount factor for future rewards)

(Mitchell, 1997)
Characteristics of Reinforcement learning 4

¡ What makes Reinforcement Learning (RL) different

from other machine learning paradigms?
v There is no explicit supervisor, only a reward signal
v Training examples are of form ((S, A), R)
v Feedback is often delayed
v Time really matters (sequential, not independent data)
v Agent's actions affect the subsequent data it receives
¡ Examples of RL
v Play games better than humans
v Manage an investment portfolio
v Make a humanoid robot walk
v …
Reward 5

¡ A reward Rt is a scalar feedback signal

¡ Indicates how well agent is doing at step t
¡ The agent's job is to maximize cumulative reward
¡ Reinforcement learning is based on the reward
hypothesis:
v All goals can be described by the maximization of expected
cumulative reward
Examples of reward 6

¡ Play games better than humans

v + reward for increasing score
v - reward for decreasing score

¡ Manage an investment portfolio

v + reward for each $ in bank

¡ Make a humanoid robot walk

v + reward for forward motion
v - reward for falling over
Sequential decision making 7

¡ Goal: Select actions to maximize total future reward

¡ Actions may have long term consequences
¡ Reward may be delayed
¡ It may be better to sacrifice an immediate reward to
gain more long-term reward
¡ Examples:
v A financial investment (may take months to mature)
v Blocking opponent moves (might help winning chances, after
many moves from now)
Agent and Environment (1) 8

n At each step t, the

Observation Action agent:
Ot
Agent At q Executes action At
q Receives observation Ot
Reward q Receives scalar reward Rt
Rt
Agent and Environment (2) 9

n At each step t, the

agent:
Observation Action
Ot
Agent At
q Executes action At
q Receives observation Ot
Reward q Receives scalar reward Rt
Rt
n At each step t, the
environment:
Environment q Receives action At
q Emits observation Ot+1
q Emits scalar reward Rt+1

n t increments at
environment step
History and State 10

¡ The history is the sequence of observations, actions,

rewards:
𝐻𝑡 = 𝑂1, 𝑅1, 𝐴1, … , 𝐴!"#, 𝑂𝑡, 𝑅!
v All observable variables up to time t
v The sensorimotor stream of the agent
¡ What happens next depends on the history:
v The agent selects actions
v The environment selects observations/rewards
¡ State is the information used to determine what
happens next
¡ Formally, state is a function of the history:
𝑆𝑡 = 𝑓(𝐻𝑡)
Environment state 11

Observation Action n The environment state

Ot
Agent At
𝑆#$ is the environment's
private representation
q The information the
Reward
environment uses to pick
Rt the next observation or
reward
n The environment state
Environment is not usually visible to
the agent
Environment state 𝑆!"
Agent state 12

Agent state Sat n The agent state 𝑆#& is

the agent's internal
Observation Action
Agent representation
Ot At
q The information the agent
uses to pick the next
Reward action
Rt q It is the information used
by reinforcement learning
algorithms
Environment n It can be a function of
history:
𝑆$% = 𝑓 𝐻$
Information state 13

¡ An information state (a.k.a. Markov state) contains all

useful information from the history
¡ A state St is Markov if and only if:
𝑃(𝑆$&' |𝑆$ ) = 𝑃(𝑆$&' | 𝑆1, … , 𝑆$ )
v The future is independent of the past given the present
𝐻':$ → 𝑆$ → 𝐻$&': )
v Once the state is known, the history may be thrown away
v The state is a sufficient statistic of the future
v The environment state 𝑆$* is Markov
v The history Ht is Markov
Fully observable environments 14

State Action n Full observability:

St
Agent At
Agent directly observes
environment state
Reward 𝑂' = 𝑆#& = 𝑆#$
Rt n Agent state =
Environment state =
Information state
Environment n Formally, this is a
Markov decision
process (MDP)
Partially observable environments 15

¡ Partial observability: Agent indirectly observes

environment:
v E.g., a robot with camera vision isn't told its absolute location
v E.g., a trading agent only observes current prices
v E.g., a poker playing agent only observes public cards

¡ Now, Agent state ≠ Environment state

¡ Formally this is a partially observable Markov decision
process (POMDP)
¡ Agent must construct its own state representation 𝑆!$ :
v E.g., by using complete history: 𝑆$% = 𝐻$
v E.g., by using a recurrent neural network: 𝑆$% = 𝜎(𝑆$+'
%
𝑊𝑠 + 𝑂𝑡𝑊𝑜 )
Major components of a RL agent 16

A RL agent may include one or more of these components:

¡ Policy: Agent's behavior function

¡ Value function: How good is each state and/or action

¡ Model: Agent's representation of the environment

Policy 17

¡ A policy is the agent's behavior

¡ It is a map from state to action
¡ Deterministic policy: 𝑎 = 𝜋(𝑠)
¡ Stochastic policy: p(𝑎|𝑠) = 𝑃(𝐴$ = 𝑎 |𝑆$ = 𝑠)
Value function 18

¡ Value function is a prediction of future reward

¡ Used to evaluate the goodness/badness of states
¡ And therefore, to select between actions
𝑣p(𝑠) = 𝔼p 𝑅!%# + g𝑅!%& + g2𝑅!%' + … 𝑆! = 𝑠)
where 𝑅$&' , 𝑅$&, , … are generated by following policy p starting at
state s
¡ For each policy p, we have a value 𝑣p(𝑠)
¡ We want to find the optimal policy p∗ such that
𝑣 ∗ 𝑠 = max 𝑣p(𝑠) , ∀𝑠
)
Model 19

¡ A model predicts what the environment will do next

¡ P predicts the next state

& ∗ 𝑆 = 𝑠, 𝐴 = 𝑎)
𝑃)) ∗ = 𝑃 𝑆#*+ = 𝑠 # #

¡ R predicts the next (immediate) reward

𝑅)& = 𝔼 𝑅#*+ 𝑆# = 𝑠; 𝐴# = 𝑎)
Maze example 20

¡ Rewards: -1 per
time-step
¡ Actions: N, E, S, W
¡ States: Agent's
location

(https://www.davidsilver.uk/wp-content/uploads/2020/03/intro_RL.pdf)
Maze example: Policy 21

¡ Arrows represent
policy p(𝑠) for each
state s

(https://www.davidsilver.uk/wp-content/uploads/2020/03/intro_RL.pdf)
Maze example: Value function 22

¡ Numbers represent
value 𝑣p(𝑠) of each
state s

(https://www.davidsilver.uk/wp-content/uploads/2020/03/intro_RL.pdf)
Maze example: Model 23

¡ Agent may have an internal

model of the environment
¡ Dynamics: How actions
change the state
¡ Rewards: How much reward
from each state
¡ Grid layout represents
%
transition model 𝑃--.
¡ Numbers represent
(https://www.davidsilver.uk/wp-
content/uploads/2020/03/intro_RL.pdf) immediate reward 𝑅-% from
each state s (same for all
actions a)
Categorizing RL agents (1) 24

¡ Value-based
v No policy
v Value function

¡ Policy-based
v Policy
v No value function

¡ Actor critic
v Policy
v Value function
Categorizing RL agents (2) 25

¡ Model-free
v Policy and/or Value function
v No model

¡ Model-based
v Policy and/or Value function
v Model
Exploration and Exploitation (1) 26

¡ Reinforcement learning is like trial-and-error learning

¡ The agent should discover a good policy
¡ from its experiences of the environment
¡ without losing too much reward along the way
Exploration and Exploitation (2) 27

¡ Exploration finds more information about the environment

¡ Exploitation exploits known information to maximize reward
¡ It is usually important to both explore and exploit
Exploration and Exploitation: Examples 28

¡ Restaurant selection
v Exploitation: Go to your favorite restaurant
v Exploration: Try a new restaurant

¡ Online banner advertisements

v Exploitation: Show the most successful advertisement
v Exploration: Show a different advertisement

¡ Game playing
v Exploitation: Play the move you believe is best
v Exploration: Play an experimental move
Q-Learning: What to learn 29

¡ We might try to have agent learn the value function 𝑣-

¡ It could then do a lookahead search to choose best action
from any state s because
𝜋 𝑠 = arg max 𝑟 𝑠, 𝑎 + 𝛾𝑣- 𝛿 𝑠, 𝑎
&
v 𝛿: 𝑆×𝐴 → 𝑆 will map a given action 𝑎 and state 𝑠 to the next state
v 𝑟: 𝑆×𝐴 → 𝑅 provides the reward of action 𝑎, from state 𝑠
¡ A problem:
v This works well if agent knows functions 𝛿 and 𝑟
v But when it doesn’t, it can’t choose actions by this way
Q-Function 30

¡ Define new function very similar to v:

𝑄 𝑠, 𝑎 = 𝑟 𝑠, 𝑎 + 𝛾𝑣- (𝛿(𝑠, 𝑎))
v 𝑄(𝑠, 𝑎) shows how good it is to perform action 𝑎 when in state 𝑠
v whereas 𝑣! (𝑠) shows how good it is for the agent to be in state 𝑠

¡ If agent learns Q, it can choose optimal action

𝜋 𝑠 = arg max 𝑟 𝑠, 𝑎 + 𝛾𝑣- 𝛿 𝑠, 𝑎 = arg max 𝑄(𝑠, 𝑎)
& &

¡ Q is the value function the agent will learn

Training rule to learn Q 31

¡ Note that Q and 𝑣. are closely related

𝑣. s = max 𝑄(𝑠, 𝑎′)
&/

¡ Which allows us to write Q recursively as

𝑄 𝑠#, 𝑎# = 𝑟 𝑠# , 𝑎# + 𝛾𝑣. (𝛿(𝑠# , 𝑎# ))
= 𝑟 𝑠# , 𝑎# + 𝛾𝑣. (𝑠#*+ )
= 𝑟 𝑠# , 𝑎# + 𝛾 max 𝑄(𝑠#*+ , 𝑎′)
&/

¡ Let Q* denote learner (agent)’s current approximation to

Q, consider the training rule
𝑄∗ 𝑠, 𝑎 ← 𝑟 𝑠, 𝑎 + 𝛾 max 𝑄∗ (𝑠 / , 𝑎′)
&/
v where s’ is the state resulting from applying action a in state s
Q-Learning for deterministic worlds 32

For each s, initialize table entry 𝑄∗ 𝑠, 𝑎 ← 0

Observe current state 𝑠 Note:
- Finite action
Do forever: space
- Finite state
v Select an action 𝑎 and execute it
space
v Receive immediate reward 𝑟
v Observe the new state 𝑠′
v Update the table entry for 𝑄∗ 𝑠, 𝑎 as follows:
𝑄∗ 𝑠, 𝑎 ← 𝑟 + 𝛾 max 𝑄∗(𝑠 . , 𝑎′)
%.

v 𝑠 ← 𝑠.
Updating Q* 33

¡ 𝑄∗ 𝑠+ , 𝑎1234# ← 𝑟 + 𝛾. max
&
𝑄 ∗
𝑠5 , 𝑎 /
&

← 0 + 0.9 . 𝑚𝑎𝑥 63, 81, 100

← 90
¡ Note that if rewards are non-negative, then
∗
∀ 𝑠, 𝑎, 𝑛: 𝑄6*+ 𝑠, 𝑎 ≥ 𝑄6∗ (𝑠, 𝑎)
∀ 𝑠, 𝑎, 𝑛: 0 ≤ 𝑄6∗ 𝑠, 𝑎 ≤ 𝑄(𝑠, 𝑎)
v Where 𝑄0∗ is the value at iteration 𝑛
(Mitchell, 1997)
Q-Learning for non-deterministic worlds 34

¡ What if reward and next state are non-deterministic?

¡ We redefine 𝑣. and Q by taking expected values
𝑣- 𝑠 = 𝔼[𝑟# + 𝛾𝑟#*+ + 𝛾 5 𝑟#*5 + ⋯ ]
𝑄 𝑠, 𝑎 = 𝔼 𝑟 𝑠, 𝑎 + 𝛾𝑣- 𝛿 𝑠, 𝑎

= S 𝑃 𝑠 / , 𝑟| 𝑠, 𝑎 𝑟 + 𝛾𝑣- (𝑠′)
)/,1

¡ Q-learning generalizes to non-deterministic worlds

v Alter the training rule at iteration 𝑛 to:
∗ ∗
𝑄0∗ 𝑠, 𝑎 ← 1 − 𝛼0 . 𝑄0+' 𝑠, 𝑎 + 𝛼0 𝑟 + max 𝑄0+' (𝑠 . , 𝑎′)
%.

v where 𝛼0 is sometimes known as learning rate

References 35

•D. Silver. Lecture 1: Introduction to Reinforcement

Learning (https://www.davidsilver.uk/wp-
content/uploads/2020/03/intro_RL.pdf).
•T. M. Mitchell. Machine Learning. McGraw-Hill, 1997.

Road Damage
No ratings yet
Road Damage
13 pages
CLASS 9 Unit - I CH - 1 WORKSHEET
No ratings yet
CLASS 9 Unit - I CH - 1 WORKSHEET
1 page
Introduction To Reinforcement Learning
100% (1)
Introduction To Reinforcement Learning
52 pages
Reinforcement Learning
100% (1)
Reinforcement Learning
64 pages
Chapter 3 Answer (Class 8) 2022
100% (5)
Chapter 3 Answer (Class 8) 2022
4 pages
4.1 Reinforcement Learning 2
No ratings yet
4.1 Reinforcement Learning 2
31 pages
Artificial Intelligence Seminar
100% (1)
Artificial Intelligence Seminar
11 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Cnns Convolution Neural Networks
No ratings yet
Cnns Convolution Neural Networks
50 pages
CMPE257 - W10C13 - Reinforcement Learning
No ratings yet
CMPE257 - W10C13 - Reinforcement Learning
161 pages
F90de-Introduction To Reinforcement Learning
No ratings yet
F90de-Introduction To Reinforcement Learning
67 pages
SIC - AI - Chapter 1. Introduction To Artificial Intelligence - Rev2.0
No ratings yet
SIC - AI - Chapter 1. Introduction To Artificial Intelligence - Rev2.0
121 pages
Liquid Neural Networks Presentation
No ratings yet
Liquid Neural Networks Presentation
19 pages
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
No ratings yet
Reinforcement Learning: By: Chandra Prakash IIITM Gwalior
64 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
2024 MTH058 Lecture05 ReinforcementLearning
No ratings yet
2024 MTH058 Lecture05 ReinforcementLearning
59 pages
AI, ML, DL Introduction
No ratings yet
AI, ML, DL Introduction
19 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Maai 6
No ratings yet
Maai 6
143 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Deep Learning Notes All Units
No ratings yet
Deep Learning Notes All Units
69 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
Lecture Week12
No ratings yet
Lecture Week12
37 pages
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
No ratings yet
Unit 1 - Reinforcement Learning, Overfitting, Training, Validation Sets, Metrics, Bias and Variance
16 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
A Crash Course On Reinforcement Learning - Felix Wagner
No ratings yet
A Crash Course On Reinforcement Learning - Felix Wagner
84 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Reinforcement Learning Notes ?
No ratings yet
Reinforcement Learning Notes ?
40 pages
Sections
No ratings yet
Sections
76 pages
37 RL
No ratings yet
37 RL
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
30 pages
Lecture 9 Reiforcement Learning
No ratings yet
Lecture 9 Reiforcement Learning
29 pages
Machine Learning For NLP
No ratings yet
Machine Learning For NLP
58 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Reinforcement Learning and Robotics
No ratings yet
Reinforcement Learning and Robotics
35 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
RL Ese Answers
No ratings yet
RL Ese Answers
16 pages
Introduction To Machine Learning - Prelim Exam
No ratings yet
Introduction To Machine Learning - Prelim Exam
30 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Unit-5 (AI)
No ratings yet
Unit-5 (AI)
21 pages
Lec 04 Reinforcement Learning
No ratings yet
Lec 04 Reinforcement Learning
57 pages
Natural Scene Recognition
No ratings yet
Natural Scene Recognition
26 pages
133 Final Research
No ratings yet
133 Final Research
51 pages
Unit Vi
No ratings yet
Unit Vi
17 pages
Federated Deep Learning For Monkeypox Disease Detection On GAN-Augmented Dataset
No ratings yet
Federated Deep Learning For Monkeypox Disease Detection On GAN-Augmented Dataset
11 pages
MLT Unit-5 Notes
No ratings yet
MLT Unit-5 Notes
17 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Lect1 introRL
No ratings yet
Lect1 introRL
52 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
L11 Reinforcement Learning 1
No ratings yet
L11 Reinforcement Learning 1
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Report Med Ghassen Dahmani
No ratings yet
Report Med Ghassen Dahmani
46 pages
Generative AI Masters Program Brochure - Edureka
No ratings yet
Generative AI Masters Program Brochure - Edureka
46 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
28 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
Unit-5 MLT
No ratings yet
Unit-5 MLT
13 pages
17 - PPT - NLP Project-2-24
No ratings yet
17 - PPT - NLP Project-2-24
23 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
GNN Foundations Frontiers and Applications Chapter3
No ratings yet
GNN Foundations Frontiers and Applications Chapter3
11 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
25 pages
Brain Tumor Detection and Classification
No ratings yet
Brain Tumor Detection and Classification
14 pages
CH3 Logistic Regression 2024
No ratings yet
CH3 Logistic Regression 2024
31 pages
U-Net Sabri 2022
No ratings yet
U-Net Sabri 2022
8 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
8 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
10 pages
Neuro Fuzzy - Session 3
No ratings yet
Neuro Fuzzy - Session 3
16 pages
Unit 4 NNDL-1
No ratings yet
Unit 4 NNDL-1
12 pages
Unit 5
No ratings yet
Unit 5
10 pages
A Review On Graph Neural Network Methods in Financial Applications
No ratings yet
A Review On Graph Neural Network Methods in Financial Applications
32 pages
ML Unit-4
No ratings yet
ML Unit-4
10 pages
A (Long) Peek Into Reinforcement Learning - Lil'Log
No ratings yet
A (Long) Peek Into Reinforcement Learning - Lil'Log
23 pages
Appendix
No ratings yet
Appendix
22 pages
01 RL Fundamentals - Complete Beginner's Guide
No ratings yet
01 RL Fundamentals - Complete Beginner's Guide
22 pages
Worksheet 5 - Machine Learning Part 2
No ratings yet
Worksheet 5 - Machine Learning Part 2
7 pages
ML 4
No ratings yet
ML 4
4 pages
Fully Connected (FC) Layer: Usually The Last Part (Or Layers) of Every CNN Architecture
No ratings yet
Fully Connected (FC) Layer: Usually The Last Part (Or Layers) of Every CNN Architecture
2 pages
Intro
No ratings yet
Intro
2 pages
Siddhant Bansal
No ratings yet
Siddhant Bansal
1 page
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
From Everand
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
Luka Nikolic
No ratings yet

L13 Reinforcement Learning

Uploaded by

L13 Reinforcement Learning

Uploaded by

Machine Learning

(Học máy – IT3190E)

¡ Goal: Learn to choose actions that maximize

(g is the discount factor for future rewards)

¡ What makes Reinforcement Learning (RL) different

¡ A reward Rt is a scalar feedback signal

¡ Play games better than humans

¡ Manage an investment portfolio

¡ Make a humanoid robot walk

¡ Goal: Select actions to maximize total future reward

n At each step t, the

n At each step t, the

¡ The history is the sequence of observations, actions,

Observation Action n The environment state

Agent state Sat n The agent state 𝑆#& is

¡ An information state (a.k.a. Markov state) contains all

State Action n Full observability:

¡ Partial observability: Agent indirectly observes

¡ Now, Agent state ≠ Environment state

A RL agent may include one or more of these components:

¡ Policy: Agent's behavior function

¡ Value function: How good is each state and/or action

¡ Model: Agent's representation of the environment

¡ A policy is the agent's behavior

¡ Value function is a prediction of future reward

¡ A model predicts what the environment will do next

¡ P predicts the next state

¡ R predicts the next (immediate) reward

¡ Agent may have an internal

¡ Reinforcement learning is like trial-and-error learning

¡ Exploration finds more information about the environment

¡ Online banner advertisements

¡ We might try to have agent learn the value function 𝑣-

¡ Define new function very similar to v:

¡ If agent learns Q, it can choose optimal action

¡ Q is the value function the agent will learn

¡ Note that Q and 𝑣. are closely related

¡ Which allows us to write Q recursively as

¡ Let Q* denote learner (agent)’s current approximation to

For each s, initialize table entry 𝑄∗ 𝑠, 𝑎 ← 0

← 0 + 0.9 . 𝑚𝑎𝑥 63, 81, 100

¡ What if reward and next state are non-deterministic?

¡ Q-learning generalizes to non-deterministic worlds

v where 𝛼0 is sometimes known as learning rate

•D. Silver. Lecture 1: Introduction to Reinforcement

You might also like