0% found this document useful (0 votes)
36 views

Q-Learning: Reinforcement Learning Basic Q-Learning Algorithm Common Modifications

Q-learning is a reinforcement learning algorithm that learns the optimal action-selection policy for an agent interacting with its environment. The algorithm works by learning an action-value function (Q-function) that estimates the expected utility of taking a given action in a given state. It uses temporal difference learning to update the Q-values based on rewards and punishments received from the environment, without requiring a model of the environment. The algorithm can also be enhanced by using neural network function approximation for the Q-function, exploration strategies to balance exploitation and exploration, and temporal difference learning to speed up learning by bootstrapping estimates of future rewards.

Uploaded by

Shawn Taylor
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
36 views

Q-Learning: Reinforcement Learning Basic Q-Learning Algorithm Common Modifications

Q-learning is a reinforcement learning algorithm that learns the optimal action-selection policy for an agent interacting with its environment. The algorithm works by learning an action-value function (Q-function) that estimates the expected utility of taking a given action in a given state. It uses temporal difference learning to update the Q-values based on rewards and punishments received from the environment, without requiring a model of the environment. The algorithm can also be enhanced by using neural network function approximation for the Q-function, exploration strategies to balance exploitation and exploration, and temporal difference learning to speed up learning by bootstrapping estimates of future rewards.

Uploaded by

Shawn Taylor
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Q-Learning

Reinforcement Learning Basic Q-learning algorithm Common modifications

Reinforcement Learning

Delayed reward

We don't immediately know whether we did the correct thing

Encourages exploration We don't necessarily know the precise results of our actions before we do them We don't necessarily know all about the current state Life-long learning

Our Problem

We don't immediately know how beneficial our last move was Rewards: 100 for win, -100 for loss We don't know what the new state will be from an action Current state is well defined Life-long learning?

Q-Learning Basics

At each step s, choose the action a which maximizes the function Q(s, a)

Q(s, a) = immediate reward for making an action + best utility (Q) for the resulting state Note: recursive definition More formally ...

Q is the estimated utility function it tells us how good an action is given a certain state

Formal Definition
Q s , a=r s , a max a ' Q s ' , a ' r s , a=Immediate reward =relative value of delayed vs. immediate rewards (0 to 1) s ' =the new state after action a a , a ' : actions in states s and s ' , respectively Selected action: s=argmax a Q s , a

Q Learning Algorithm
For each state-action pair (s, a), initialize the table entry Q s , a to zero Observe the current state s Do forever: ---Select an action a and execute it ---Receive immediate reward r ---Observe the new state s ' ---Update the table entry for Q s , a as follows: Q s , a=r max a ' Q s ' , a ' --- s=s '

Example Problem

s1

a12 a21

s2

a23 a32

s3 a36

a14 a41 s4

a45 a54

a25

a52

a56 End: s6

s5

= .5, r = 100 if moving into state s6, 0 otherwise

Initial State
s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 s5, a56 0 0 0 0 0 0 0 0 0 0 0 0

s1
a14 a41

a12 a21 a45 a54

s2
a25

a23 a32

s3
a36

a52 a56

s4

s5

End: s6

The Algorithm
s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 0 0 0 0 0 0 0 0 0 0 0 Current Position: Red Available actions: a12, a14 Chose a12

s1
a14 a41

a12 a21 a45 a54

s2
a25

a23 a32

s3
a36

a52 a56

s4

s5

End: s6

Update Q(s1, a12)


s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 0 0 0 0 0 0 0 0 0 0 0 Current Position: Red Available actions: a21, a25, a23 Update Q(s1, a12): Q(s1, a12) = r + .5 * max(Q(s2,a21), Q(s2,a25), Q(s2,a23)) =0 s1
a14 a41 a12 a21 a45 a54

s2
a25

a23 a32

s3
a36

a52 a56

s4

s5

End: s6

Next Move
s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 0 0 0 0 0 0 0 0 0 0 0 Current Position: Red Available actions: a21, a25, a23 Chose a23

s1
a14 a41

a12 a21 a45 a54

s2
a25

a23 a32

s3
a36

a52 a56

s4

s5

End: s6

Update Q(s2, a23)


s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 0 0 0 0 0 0 0 0 0 0 0 Current Position: Red Available actions: a32, a36 Update Q(s1, a12): Q(s2, a23) = r + .5 * max(Q(s3,a32), Q(s3,a36)) =0 s1
a14 a41 a12 a21 a45 a54

s2
a25

a23 a32

s3
a36

a52 a56

s4

s5

End: s6

Next Move
s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 0 0 0 0 0 0 0 0 0 0 0 Current Position: Red Available actions: a32, a36 Chose a36

s1
a14 a41

a12 a21 a45 a54

s2
a25

a23 a32

s3
a36

a52 a56

s4

s5

End: s6

Update Q(s3,a36)
s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 0 0 0 0 0 0 100 0 0 0 0 Current Position: Red FINAL STATE! Update Q(s1, a12): Q(s2, a23) = r = 100

s1
a14 a41

a12 a21 a45 a54

s2
a25

a23 a32

s3
a36

a52 a56

s4

s5

End: s6

New Game
s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 0 0 0 0 0 0 100 0 0 0 0 Current Position: Red Available actions: a21, a25, a23 Chose a23

s1
a14 a41

a12 a21 a45 a54

s2
a25

a23 a32

s3
a36

a52 a56

s4

s5

End: s6

Update Q(s2, a23)


s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 0 0 0 50 0 0 100 0 0 0 0 Current Position: Red Available actions: a32, a36 Update Q(s1, a12): Q(s2, a23) = r + .5 * max(Q(s3,a32), Q(s3,a36)) = 0 + .5 * 100 = 50 s1
a14 a41 a12 a21 a45 a54

s2
a25

a23 a32

s3
a36

a52 a56

s4

s5

End: s6

Final State (after many iterations)


s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 s5, a56 25 25 12.5 50 25 25 100 12.5 50 25 25 100

s1
a14 a41

a12 a21 a45 a54

s2
a25

a23 a32

s3
a36

a52 a56

s4

s5

End: s6

Properties

Convergence: Our approximation will converge to the true Q function

Table size can be very large for complex environments like a game We do not estimate unseen values How to we fix these problems?

But we must visit every state-action pair infinitely many times!

Neural Network Approximation

Instead of the table, use a neural network


Encoding the states and actions *properly* will be challenging

Inputs are the state and action Output is a number between 0 and 1 that represents the utility Helpful idea: multiple neural networks, one for each action

Enhancements

Exploration strategy Store past state-action transitions and retrain on them periodically

Temporal Difference Learning

The values may change as time progresses

Exploration Strategy

Want to focus exploration on the good states Want to explore all states Solution: Randomly choose the next action

Give a higher probability to the actions that currently have better utility

P a is=

k
j

Q s , ai Q s , a i

Temporal Difference (TD) Learning


Look farther into the future of a move Update the Q function after looking farther ahead Speeds up the learning process We will discuss this more when the time comes

You might also like