0% found this document useful (0 votes)
31 views35 pages

L13 Reinforcement Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views35 pages

L13 Reinforcement Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 35

Machine Learning

(Học máy – IT3190E)

Khoat Than
School of Information and Communication Technology
Hanoi University of Science and Technology

2023
Contents 2

¡ Introduction
¡ Supervised learning
¡ Unsupervised learning
¡ Reinforcement learning
¡ Practical advice
Reinforcement Learning problem 3

¡ Goal: Learn to choose actions that maximize


𝑟0 + g𝑟1 + g2𝑟2 + ⋯ , 𝑤ℎ𝑒𝑟𝑒 0 ≤ g < 1

(g is the discount factor for future rewards)


(Mitchell, 1997)
Characteristics of Reinforcement learning 4

¡ What makes Reinforcement Learning (RL) different


from other machine learning paradigms?
v There is no explicit supervisor, only a reward signal
v Training examples are of form ((S, A), R)
v Feedback is often delayed
v Time really matters (sequential, not independent data)
v Agent's actions affect the subsequent data it receives
¡ Examples of RL
v Play games better than humans
v Manage an investment portfolio
v Make a humanoid robot walk
v …
Reward 5

¡ A reward Rt is a scalar feedback signal


¡ Indicates how well agent is doing at step t
¡ The agent's job is to maximize cumulative reward
¡ Reinforcement learning is based on the reward
hypothesis:
v All goals can be described by the maximization of expected
cumulative reward
Examples of reward 6

¡ Play games better than humans


v + reward for increasing score
v - reward for decreasing score

¡ Manage an investment portfolio


v + reward for each $ in bank

¡ Make a humanoid robot walk


v + reward for forward motion
v - reward for falling over
Sequential decision making 7

¡ Goal: Select actions to maximize total future reward


¡ Actions may have long term consequences
¡ Reward may be delayed
¡ It may be better to sacrifice an immediate reward to
gain more long-term reward
¡ Examples:
v A financial investment (may take months to mature)
v Blocking opponent moves (might help winning chances, after
many moves from now)
Agent and Environment (1) 8

n At each step t, the


Observation Action agent:
Ot
Agent At q Executes action At
q Receives observation Ot
Reward q Receives scalar reward Rt
Rt
Agent and Environment (2) 9

n At each step t, the


agent:
Observation Action
Ot
Agent At
q Executes action At
q Receives observation Ot
Reward q Receives scalar reward Rt
Rt
n At each step t, the
environment:
Environment q Receives action At
q Emits observation Ot+1
q Emits scalar reward Rt+1

n t increments at
environment step
History and State 10

¡ The history is the sequence of observations, actions,


rewards:
𝐻𝑡 = 𝑂1, 𝑅1, 𝐴1, … , 𝐴!"#, 𝑂𝑡, 𝑅!
v All observable variables up to time t
v The sensorimotor stream of the agent
¡ What happens next depends on the history:
v The agent selects actions
v The environment selects observations/rewards
¡ State is the information used to determine what
happens next
¡ Formally, state is a function of the history:
𝑆𝑡 = 𝑓(𝐻𝑡)
Environment state 11

Observation Action n The environment state


Ot
Agent At
𝑆#$ is the environment's
private representation
q The information the
Reward
environment uses to pick
Rt the next observation or
reward
n The environment state
Environment is not usually visible to
the agent
Environment state 𝑆!"
Agent state 12

Agent state Sat n The agent state 𝑆#& is


the agent's internal
Observation Action
Agent representation
Ot At
q The information the agent
uses to pick the next
Reward action
Rt q It is the information used
by reinforcement learning
algorithms
Environment n It can be a function of
history:
𝑆$% = 𝑓 𝐻$
Information state 13

¡ An information state (a.k.a. Markov state) contains all


useful information from the history
¡ A state St is Markov if and only if:
𝑃(𝑆$&' |𝑆$ ) = 𝑃(𝑆$&' | 𝑆1, … , 𝑆$ )
v The future is independent of the past given the present
𝐻':$ → 𝑆$ → 𝐻$&': )
v Once the state is known, the history may be thrown away
v The state is a sufficient statistic of the future
v The environment state 𝑆$* is Markov
v The history Ht is Markov
Fully observable environments 14

State Action n Full observability:


St
Agent At
Agent directly observes
environment state
Reward 𝑂' = 𝑆#& = 𝑆#$
Rt n Agent state =
Environment state =
Information state
Environment n Formally, this is a
Markov decision
process (MDP)
Partially observable environments 15

¡ Partial observability: Agent indirectly observes


environment:
v E.g., a robot with camera vision isn't told its absolute location
v E.g., a trading agent only observes current prices
v E.g., a poker playing agent only observes public cards

¡ Now, Agent state ≠ Environment state


¡ Formally this is a partially observable Markov decision
process (POMDP)
¡ Agent must construct its own state representation 𝑆!$ :
v E.g., by using complete history: 𝑆$% = 𝐻$
v E.g., by using a recurrent neural network: 𝑆$% = 𝜎(𝑆$+'
%
𝑊𝑠 + 𝑂𝑡𝑊𝑜 )
Major components of a RL agent 16

A RL agent may include one or more of these components:

¡ Policy: Agent's behavior function

¡ Value function: How good is each state and/or action

¡ Model: Agent's representation of the environment


Policy 17

¡ A policy is the agent's behavior


¡ It is a map from state to action
¡ Deterministic policy: 𝑎 = 𝜋(𝑠)
¡ Stochastic policy: p(𝑎|𝑠) = 𝑃(𝐴$ = 𝑎 |𝑆$ = 𝑠)
Value function 18

¡ Value function is a prediction of future reward


¡ Used to evaluate the goodness/badness of states
¡ And therefore, to select between actions
𝑣p(𝑠) = 𝔼p 𝑅!%# + g𝑅!%& + g2𝑅!%' + … 𝑆! = 𝑠)
where 𝑅$&' , 𝑅$&, , … are generated by following policy p starting at
state s
¡ For each policy p, we have a value 𝑣p(𝑠)
¡ We want to find the optimal policy p∗ such that
𝑣 ∗ 𝑠 = max 𝑣p(𝑠) , ∀𝑠
)
Model 19

¡ A model predicts what the environment will do next

¡ P predicts the next state


& ∗ 𝑆 = 𝑠, 𝐴 = 𝑎)
𝑃)) ∗ = 𝑃 𝑆#*+ = 𝑠 # #

¡ R predicts the next (immediate) reward


𝑅)& = 𝔼 𝑅#*+ 𝑆# = 𝑠; 𝐴# = 𝑎)
Maze example 20

¡ Rewards: -1 per
time-step
¡ Actions: N, E, S, W
¡ States: Agent's
location

(https://www.davidsilver.uk/wp-content/uploads/2020/03/intro_RL.pdf)
Maze example: Policy 21

¡ Arrows represent
policy p(𝑠) for each
state s

(https://www.davidsilver.uk/wp-content/uploads/2020/03/intro_RL.pdf)
Maze example: Value function 22

¡ Numbers represent
value 𝑣p(𝑠) of each
state s

(https://www.davidsilver.uk/wp-content/uploads/2020/03/intro_RL.pdf)
Maze example: Model 23

¡ Agent may have an internal


model of the environment
¡ Dynamics: How actions
change the state
¡ Rewards: How much reward
from each state
¡ Grid layout represents
%
transition model 𝑃--.
¡ Numbers represent
(https://www.davidsilver.uk/wp-
content/uploads/2020/03/intro_RL.pdf) immediate reward 𝑅-% from
each state s (same for all
actions a)
Categorizing RL agents (1) 24

¡ Value-based
v No policy
v Value function

¡ Policy-based
v Policy
v No value function

¡ Actor critic
v Policy
v Value function
Categorizing RL agents (2) 25

¡ Model-free
v Policy and/or Value function
v No model

¡ Model-based
v Policy and/or Value function
v Model
Exploration and Exploitation (1) 26

¡ Reinforcement learning is like trial-and-error learning


¡ The agent should discover a good policy
¡ from its experiences of the environment
¡ without losing too much reward along the way
Exploration and Exploitation (2) 27

¡ Exploration finds more information about the environment


¡ Exploitation exploits known information to maximize reward
¡ It is usually important to both explore and exploit
Exploration and Exploitation: Examples 28

¡ Restaurant selection
v Exploitation: Go to your favorite restaurant
v Exploration: Try a new restaurant

¡ Online banner advertisements


v Exploitation: Show the most successful advertisement
v Exploration: Show a different advertisement

¡ Game playing
v Exploitation: Play the move you believe is best
v Exploration: Play an experimental move
Q-Learning: What to learn 29

¡ We might try to have agent learn the value function 𝑣-


¡ It could then do a lookahead search to choose best action
from any state s because
𝜋 𝑠 = arg max 𝑟 𝑠, 𝑎 + 𝛾𝑣- 𝛿 𝑠, 𝑎
&
v 𝛿: 𝑆×𝐴 → 𝑆 will map a given action 𝑎 and state 𝑠 to the next state
v 𝑟: 𝑆×𝐴 → 𝑅 provides the reward of action 𝑎, from state 𝑠
¡ A problem:
v This works well if agent knows functions 𝛿 and 𝑟
v But when it doesn’t, it can’t choose actions by this way
Q-Function 30

¡ Define new function very similar to v:


𝑄 𝑠, 𝑎 = 𝑟 𝑠, 𝑎 + 𝛾𝑣- (𝛿(𝑠, 𝑎))
v 𝑄(𝑠, 𝑎) shows how good it is to perform action 𝑎 when in state 𝑠
v whereas 𝑣! (𝑠) shows how good it is for the agent to be in state 𝑠

¡ If agent learns Q, it can choose optimal action


𝜋 𝑠 = arg max 𝑟 𝑠, 𝑎 + 𝛾𝑣- 𝛿 𝑠, 𝑎 = arg max 𝑄(𝑠, 𝑎)
& &

¡ Q is the value function the agent will learn


Training rule to learn Q 31

¡ Note that Q and 𝑣. are closely related


𝑣. s = max 𝑄(𝑠, 𝑎′)
&/

¡ Which allows us to write Q recursively as


𝑄 𝑠#, 𝑎# = 𝑟 𝑠# , 𝑎# + 𝛾𝑣. (𝛿(𝑠# , 𝑎# ))
= 𝑟 𝑠# , 𝑎# + 𝛾𝑣. (𝑠#*+ )
= 𝑟 𝑠# , 𝑎# + 𝛾 max 𝑄(𝑠#*+ , 𝑎′)
&/

¡ Let Q* denote learner (agent)’s current approximation to


Q, consider the training rule
𝑄∗ 𝑠, 𝑎 ← 𝑟 𝑠, 𝑎 + 𝛾 max 𝑄∗ (𝑠 / , 𝑎′)
&/
v where s’ is the state resulting from applying action a in state s
Q-Learning for deterministic worlds 32

For each s, initialize table entry 𝑄∗ 𝑠, 𝑎 ← 0


Observe current state 𝑠 Note:
- Finite action
Do forever: space
- Finite state
v Select an action 𝑎 and execute it
space
v Receive immediate reward 𝑟
v Observe the new state 𝑠′
v Update the table entry for 𝑄∗ 𝑠, 𝑎 as follows:
𝑄∗ 𝑠, 𝑎 ← 𝑟 + 𝛾 max 𝑄∗(𝑠 . , 𝑎′)
%.

v 𝑠 ← 𝑠.
Updating Q* 33

¡ 𝑄∗ 𝑠+ , 𝑎1234# ← 𝑟 + 𝛾. max
&
𝑄 ∗
𝑠5 , 𝑎 /
&

← 0 + 0.9 . 𝑚𝑎𝑥 63, 81, 100


← 90
¡ Note that if rewards are non-negative, then

∀ 𝑠, 𝑎, 𝑛: 𝑄6*+ 𝑠, 𝑎 ≥ 𝑄6∗ (𝑠, 𝑎)
∀ 𝑠, 𝑎, 𝑛: 0 ≤ 𝑄6∗ 𝑠, 𝑎 ≤ 𝑄(𝑠, 𝑎)
v Where 𝑄0∗ is the value at iteration 𝑛
(Mitchell, 1997)
Q-Learning for non-deterministic worlds 34

¡ What if reward and next state are non-deterministic?


¡ We redefine 𝑣. and Q by taking expected values
𝑣- 𝑠 = 𝔼[𝑟# + 𝛾𝑟#*+ + 𝛾 5 𝑟#*5 + ⋯ ]
𝑄 𝑠, 𝑎 = 𝔼 𝑟 𝑠, 𝑎 + 𝛾𝑣- 𝛿 𝑠, 𝑎

= S 𝑃 𝑠 / , 𝑟| 𝑠, 𝑎 𝑟 + 𝛾𝑣- (𝑠′)
)/,1

¡ Q-learning generalizes to non-deterministic worlds


v Alter the training rule at iteration 𝑛 to:
∗ ∗
𝑄0∗ 𝑠, 𝑎 ← 1 − 𝛼0 . 𝑄0+' 𝑠, 𝑎 + 𝛼0 𝑟 + max 𝑄0+' (𝑠 . , 𝑎′)
%.

v where 𝛼0 is sometimes known as learning rate


References 35

•D. Silver. Lecture 1: Introduction to Reinforcement


Learning (https://www.davidsilver.uk/wp-
content/uploads/2020/03/intro_RL.pdf).
•T. M. Mitchell. Machine Learning. McGraw-Hill, 1997.

You might also like