Q-Learning: Reinforcement Learning Basic Q-Learning Algorithm Common Modifications
Q-Learning: Reinforcement Learning Basic Q-Learning Algorithm Common Modifications
Reinforcement Learning
Delayed reward
Encourages exploration We don't necessarily know the precise results of our actions before we do them We don't necessarily know all about the current state Life-long learning
Our Problem
We don't immediately know how beneficial our last move was Rewards: 100 for win, -100 for loss We don't know what the new state will be from an action Current state is well defined Life-long learning?
Q-Learning Basics
At each step s, choose the action a which maximizes the function Q(s, a)
Q(s, a) = immediate reward for making an action + best utility (Q) for the resulting state Note: recursive definition More formally ...
Q is the estimated utility function it tells us how good an action is given a certain state
Formal Definition
Q s , a=r s , a max a ' Q s ' , a ' r s , a=Immediate reward =relative value of delayed vs. immediate rewards (0 to 1) s ' =the new state after action a a , a ' : actions in states s and s ' , respectively Selected action: s=argmax a Q s , a
Q Learning Algorithm
For each state-action pair (s, a), initialize the table entry Q s , a to zero Observe the current state s Do forever: ---Select an action a and execute it ---Receive immediate reward r ---Observe the new state s ' ---Update the table entry for Q s , a as follows: Q s , a=r max a ' Q s ' , a ' --- s=s '
Example Problem
s1
a12 a21
s2
a23 a32
s3 a36
a14 a41 s4
a45 a54
a25
a52
a56 End: s6
s5
Initial State
s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 s5, a56 0 0 0 0 0 0 0 0 0 0 0 0
s1
a14 a41
s2
a25
a23 a32
s3
a36
a52 a56
s4
s5
End: s6
The Algorithm
s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 0 0 0 0 0 0 0 0 0 0 0 Current Position: Red Available actions: a12, a14 Chose a12
s1
a14 a41
s2
a25
a23 a32
s3
a36
a52 a56
s4
s5
End: s6
s2
a25
a23 a32
s3
a36
a52 a56
s4
s5
End: s6
Next Move
s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 0 0 0 0 0 0 0 0 0 0 0 Current Position: Red Available actions: a21, a25, a23 Chose a23
s1
a14 a41
s2
a25
a23 a32
s3
a36
a52 a56
s4
s5
End: s6
s2
a25
a23 a32
s3
a36
a52 a56
s4
s5
End: s6
Next Move
s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 0 0 0 0 0 0 0 0 0 0 0 Current Position: Red Available actions: a32, a36 Chose a36
s1
a14 a41
s2
a25
a23 a32
s3
a36
a52 a56
s4
s5
End: s6
Update Q(s3,a36)
s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 0 0 0 0 0 0 100 0 0 0 0 Current Position: Red FINAL STATE! Update Q(s1, a12): Q(s2, a23) = r = 100
s1
a14 a41
s2
a25
a23 a32
s3
a36
a52 a56
s4
s5
End: s6
New Game
s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 0 0 0 0 0 0 100 0 0 0 0 Current Position: Red Available actions: a21, a25, a23 Chose a23
s1
a14 a41
s2
a25
a23 a32
s3
a36
a52 a56
s4
s5
End: s6
s2
a25
a23 a32
s3
a36
a52 a56
s4
s5
End: s6
s1
a14 a41
s2
a25
a23 a32
s3
a36
a52 a56
s4
s5
End: s6
Properties
Table size can be very large for complex environments like a game We do not estimate unseen values How to we fix these problems?
Inputs are the state and action Output is a number between 0 and 1 that represents the utility Helpful idea: multiple neural networks, one for each action
Enhancements
Exploration strategy Store past state-action transitions and retrain on them periodically
Exploration Strategy
Want to focus exploration on the good states Want to explore all states Solution: Randomly choose the next action
Give a higher probability to the actions that currently have better utility
P a is=
k
j
Q s , ai Q s , a i
Look farther into the future of a move Update the Q function after looking farther ahead Speeds up the learning process We will discuss this more when the time comes