0% found this document useful (0 votes)
29 views18 pages

5.4-Reinforcement learning-part3-Q-Learning

This document provides an overview of Q-learning, a reinforcement learning algorithm. Q-learning aims to learn an optimal policy that maximizes total reward by learning the quality (Q) of taking an action (a) in a given state (s). The algorithm initializes a Q(s,a) function and then iteratively updates it based on rewards and choices made in each state during episodes. After many episodes, the optimal policy emerges as the one that takes the action with the highest Q-value in each state. The document demonstrates the algorithm through an example gridworld environment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views18 pages

5.4-Reinforcement learning-part3-Q-Learning

This document provides an overview of Q-learning, a reinforcement learning algorithm. Q-learning aims to learn an optimal policy that maximizes total reward by learning the quality (Q) of taking an action (a) in a given state (s). The algorithm initializes a Q(s,a) function and then iteratively updates it based on rewards and choices made in each state during episodes. After many episodes, the optimal policy emerges as the one that takes the action with the highest Q-value in each state. The document demonstrates the algorithm through an example gridworld environment.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

NPTEL

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Week 5 Machine Learning enabled


by prior Theories

Video 5.4 Reinforcement Learning – Part 3 Q-learning


Q Learning

Q-learning is a model-free off-policy TD reinforcement learning algorithm.

The goal of Q-learning is to learn a policy, which tells an agent what action to take under what circumstances.

For any finite Markov decision process (FMDP), Q-learning finds a policy that is optimal in the sense that it
maximizes the expected value of the total reward over any and all successive steps, starting from the current
state.

Q-learning can identify an optimal action-selection policy for any given FMDP, given infinite exploration time
and a partly-random policy.

"Q" names the function Q(s,a) that can be said to stand for the "quality" of an action a taken in a given state s.

Suppose we have the optimal Q-function (s, a) then the optimal policy in state s is argmax a Q(s, a).
Q-learning Algorithm

Initialize Q(s, a) arbitrarily


Repeat (for each episode)
Initialize s
Repeat (for each step of the episode)
Take action a, observe r, s’
Q(s, a) 🡨 Q(s, a) + α[r + γ max Q(s’, a’) – Q(s, a)]
a’
s 🡨 s’

With α =1 or α =1 and γ = 1 the updating formula is simplified


Q(s, a) 🡨 r + γ max Q(s’, a’)
Q(s, a) 🡨 r + max Q(s’, a’)
Example

r=8

r=0
r=-8
States and Actions

States: s Actions: a
1 2 3 4 5
N

6 7 8 9 10
S

11 12 13 14 15
E

16 17 18 19 20
W

Assume that α=1 and γ = 0.5


Initializing the Q(s, a) function

States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

A N 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
c
t S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i
o W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
n
s E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
An Episode

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20
Calculating new Q(s, a) values

1st step:

2nd step:

3rd step:

4th step:
The Q(s, a) function after the first episode

States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

A N 0 0 0 0 0 0 -8 0 0 0 0 0 0 0 0 0 0 0 0 0
c
t S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i
o
W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
n
s
E 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
A second episode

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20
Calculating new Q(s, a) values

1st step:

2nd step:

3rd step:

4th step:
The Q(s, a) function after the second episode

States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

A N 0 0 0 0 0 0 -8 0 0 0 0 0 0 0 0 0 0 0 0 0
c
t S 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
i
o W 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
n
s E 0 0 0 0 0 0 0 0 8 0 0 0 0 0 0 0 0 0 0 0
The Q(s, a) function after a few episodes

States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

N 0 0 0 0 0 0 -8 -8 -8 0 0 1 2 4 0 0 0 0 0 0
A
c
t S 0 0 0 0 0 0 0.5 1 2 0 0 -8 -8 -8 0 0 0 0 0 0
i
o W 0 0 0 0 0 0 -8 1 2 0 0 -8 0.5 1 0 0 0 0 0 0
n
s E 0 0 0 0 0 0 2 4 8 0 0 1 2 -8 0 0 0 0 0 0
One of the optimal policies

States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

N 0 0 0 0 0 0 -8 -8 -8 0 0 1 2 4 0 0 0 0 0 0
A
c S 0 0 0 0 0 0 0.5 1 2 0 0 -8 -8 -8 0 0 0 0 0 0
t
i W 0 0 0 0 0 0 -8 1 2 0 0 -8 0.5 1 0 0 0 0 0 0
o
n
E 0 0 0 0 0 0 2 4 8 0 0 1 2 -8 0 0 0 0 0 0
s
An optimal policy graphically

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20
Another of the optimal policies

States
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

A N 0 0 0 0 0 0 -8 -8 -8 0 0 1 2 4 0 0 0 0 0 0
c
t S 0 0 0 0 0 0 0.5 1 2 0 0 -8 -8 -8 0 0 0 0 0 0
i
o W 0 0 0 0 0 0 -8 1 2 0 0 -8 0.5 1 0 0 0 0 0 0
n
s E 0 0 0 0 0 0 2 4 8 0 0 1 2 -8 0 0 0 0 0 0
Another optimal policy graphically

1 2 3 4 5

6 7 8 9 10

11 12 13 14 15

16 17 18 19 20
NPTEL

Video Course on Machine Learning

Professor Carl Gustaf Jansson, KTH

Thanks for your attention!

The next lecture 5.5 will be on the topic:

Case Based Reasoning

You might also like