0% found this document useful (0 votes)

43 views22 pages

Q-Learning: Reinforcement Learning Basic Q-Learning Algorithm Common Modifications

Q-learning is a reinforcement learning algorithm that learns the optimal action-selection policy for an agent interacting with its environment. The algorithm works by learning an action-value function (Q-function) that estimates the expected utility of taking a given action in a given state. It uses temporal difference learning to update the Q-values based on rewards and punishments received from the environment, without requiring a model of the environment. The algorithm can also be enhanced by using neural network function approximation for the Q-function, exploration strategies to balance exploitation and exploration, and temporal difference learning to speed up learning by bootstrapping estimates of future rewards.

Uploaded by

Shawn Taylor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

43 views22 pages

Q-Learning: Reinforcement Learning Basic Q-Learning Algorithm Common Modifications

Uploaded by

Shawn Taylor

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

Q-Learning

Reinforcement Learning Basic Q-learning algorithm Common modifications

Reinforcement Learning

Delayed reward

We don't immediately know whether we did the correct thing

Encourages exploration We don't necessarily know the precise results of our actions before we do them We don't necessarily know all about the current state Life-long learning

Our Problem

We don't immediately know how beneficial our last move was Rewards: 100 for win, -100 for loss We don't know what the new state will be from an action Current state is well defined Life-long learning?

Q-Learning Basics

At each step s, choose the action a which maximizes the function Q(s, a)

Q(s, a) = immediate reward for making an action + best utility (Q) for the resulting state Note: recursive definition More formally ...

Q is the estimated utility function it tells us how good an action is given a certain state

Formal Definition
Q s , a=r s , a max a ' Q s ' , a ' r s , a=Immediate reward =relative value of delayed vs. immediate rewards (0 to 1) s ' =the new state after action a a , a ' : actions in states s and s ' , respectively Selected action: s=argmax a Q s , a

Q Learning Algorithm
For each state-action pair (s, a), initialize the table entry Q s , a to zero Observe the current state s Do forever: ---Select an action a and execute it ---Receive immediate reward r ---Observe the new state s ' ---Update the table entry for Q s , a as follows: Q s , a=r max a ' Q s ' , a ' --- s=s '

Example Problem

a12 a21

a23 a32

s3 a36

a14 a41 s4

a45 a54

a25

a52

a56 End: s6

= .5, r = 100 if moving into state s6, 0 otherwise

Initial State
s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 s5, a56 0 0 0 0 0 0 0 0 0 0 0 0

s1
a14 a41

a12 a21 a45 a54

s2
a25

a23 a32

s3
a36

a52 a56

End: s6

The Algorithm
s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 0 0 0 0 0 0 0 0 0 0 0 Current Position: Red Available actions: a12, a14 Chose a12

s1
a14 a41

a12 a21 a45 a54

s2
a25

a23 a32

s3
a36

a52 a56

End: s6

Update Q(s1, a12)

s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 0 0 0 0 0 0 0 0 0 0 0 Current Position: Red Available actions: a21, a25, a23 Update Q(s1, a12): Q(s1, a12) = r + .5 * max(Q(s2,a21), Q(s2,a25), Q(s2,a23)) =0 s1
a14 a41 a12 a21 a45 a54

s2
a25

a23 a32

s3
a36

a52 a56

End: s6

Next Move
s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 0 0 0 0 0 0 0 0 0 0 0 Current Position: Red Available actions: a21, a25, a23 Chose a23

s1
a14 a41

a12 a21 a45 a54

s2
a25

a23 a32

s3
a36

a52 a56

End: s6

Update Q(s2, a23)

s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 0 0 0 0 0 0 0 0 0 0 0 Current Position: Red Available actions: a32, a36 Update Q(s1, a12): Q(s2, a23) = r + .5 * max(Q(s3,a32), Q(s3,a36)) =0 s1
a14 a41 a12 a21 a45 a54

s2
a25

a23 a32

s3
a36

a52 a56

End: s6

Next Move
s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 0 0 0 0 0 0 0 0 0 0 0 Current Position: Red Available actions: a32, a36 Chose a36

s1
a14 a41

a12 a21 a45 a54

s2
a25

a23 a32

s3
a36

a52 a56

End: s6

Update Q(s3,a36)
s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 0 0 0 0 0 0 100 0 0 0 0 Current Position: Red FINAL STATE! Update Q(s1, a12): Q(s2, a23) = r = 100

s1
a14 a41

a12 a21 a45 a54

s2
a25

a23 a32

s3
a36

a52 a56

End: s6

New Game
s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 0 0 0 0 0 0 100 0 0 0 0 Current Position: Red Available actions: a21, a25, a23 Chose a23

s1
a14 a41

a12 a21 a45 a54

s2
a25

a23 a32

s3
a36

a52 a56

End: s6

Update Q(s2, a23)

s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 0 0 0 50 0 0 100 0 0 0 0 Current Position: Red Available actions: a32, a36 Update Q(s1, a12): Q(s2, a23) = r + .5 * max(Q(s3,a32), Q(s3,a36)) = 0 + .5 * 100 = 50 s1
a14 a41 a12 a21 a45 a54

s2
a25

a23 a32

s3
a36

a52 a56

End: s6

Final State (after many iterations)

s1, a12 s1, a14 s2, a21 s2, a23 s2, a25 s3, a32 s3, a36 s4, a41 s4, a45 s5, a54 s5, a52 s5, a56 25 25 12.5 50 25 25 100 12.5 50 25 25 100

s1
a14 a41

a12 a21 a45 a54

s2
a25

a23 a32

s3
a36

a52 a56

End: s6

Properties

Convergence: Our approximation will converge to the true Q function

Table size can be very large for complex environments like a game We do not estimate unseen values How to we fix these problems?

But we must visit every state-action pair infinitely many times!

Neural Network Approximation

Instead of the table, use a neural network

Encoding the states and actions properly will be challenging

Inputs are the state and action Output is a number between 0 and 1 that represents the utility Helpful idea: multiple neural networks, one for each action

Enhancements

Exploration strategy Store past state-action transitions and retrain on them periodically

Temporal Difference Learning

The values may change as time progresses

Exploration Strategy

Want to focus exploration on the good states Want to explore all states Solution: Randomly choose the next action

Give a higher probability to the actions that currently have better utility

P a is=

k
j

Q s , ai Q s , a i

Temporal Difference (TD) Learning

Look farther into the future of a move Update the Q function after looking farther ahead Speeds up the learning process We will discuss this more when the time comes

RL Unit
No ratings yet
RL Unit
595 pages
Literary Terms
No ratings yet
Literary Terms
234 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Balance and Gravitational Curve
100% (1)
Balance and Gravitational Curve
9 pages
NP Iii
No ratings yet
NP Iii
10 pages
Kantor 1919 Human Personality and Its Pathology
No ratings yet
Kantor 1919 Human Personality and Its Pathology
10 pages
KET Speaking TIPS
No ratings yet
KET Speaking TIPS
3 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
Java Fundamentals PDF
No ratings yet
Java Fundamentals PDF
106 pages
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
No ratings yet
Unit-5 Part C 1) Explain The Q Function and Q Learning Algorithm Assuming Deterministic Rewards and Actions With Example. Ans)
11 pages
13-RL DRL
No ratings yet
13-RL DRL
102 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
Beeswax
100% (1)
Beeswax
4 pages
Unit 5
No ratings yet
Unit 5
65 pages
Human Behavior Insights (Ft. Kunal Shah) X
No ratings yet
Human Behavior Insights (Ft. Kunal Shah) X
64 pages
Unit 5
No ratings yet
Unit 5
70 pages
University of California, Los Angeles: UNDERGRADUATE Student Copy Transcript Report
No ratings yet
University of California, Los Angeles: UNDERGRADUATE Student Copy Transcript Report
4 pages
SAPinsider 2019 Compendium BI-Analytics-HANA
No ratings yet
SAPinsider 2019 Compendium BI-Analytics-HANA
68 pages
ML - Unit 3 - Part II
No ratings yet
ML - Unit 3 - Part II
51 pages
5SC28 L7 Machine Learning
No ratings yet
5SC28 L7 Machine Learning
61 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
52 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
New CZ3005 Module 5 - Reinforcement Learning
No ratings yet
New CZ3005 Module 5 - Reinforcement Learning
31 pages
Q Learning
No ratings yet
Q Learning
38 pages
Unit 5
No ratings yet
Unit 5
54 pages
Intelligent Optimization Algorithm For Master
No ratings yet
Intelligent Optimization Algorithm For Master
47 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
Installation Manual
No ratings yet
Installation Manual
41 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
Intro To Reinforcement Learning - DQ Q AC A3C
No ratings yet
Intro To Reinforcement Learning - DQ Q AC A3C
36 pages
Recitation 2 Philosophy
No ratings yet
Recitation 2 Philosophy
3 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
Q-Learning in RL With Openai Gym: Joo Soon Lee
No ratings yet
Q-Learning in RL With Openai Gym: Joo Soon Lee
34 pages
Ranger 7600 Installation Manual
No ratings yet
Ranger 7600 Installation Manual
31 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
RL MJJ
No ratings yet
RL MJJ
32 pages
Temporal Difference Learning
No ratings yet
Temporal Difference Learning
17 pages
Lec 09
No ratings yet
Lec 09
26 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
AI Seminar RL
No ratings yet
AI Seminar RL
27 pages
Deep Learning Binoy-19-3-RL Q Learning
No ratings yet
Deep Learning Binoy-19-3-RL Q Learning
26 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Algorithms To Solve An MDP
No ratings yet
Algorithms To Solve An MDP
24 pages
Q - Networks (1) 31 50
No ratings yet
Q - Networks (1) 31 50
20 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
What Is TD Learning
No ratings yet
What Is TD Learning
15 pages
Financial Statement Analysis
No ratings yet
Financial Statement Analysis
16 pages
Learning Task
No ratings yet
Learning Task
14 pages
Geopolymer
No ratings yet
Geopolymer
17 pages
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
No ratings yet
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
14 pages
Over Load Protection For Transformer
No ratings yet
Over Load Protection For Transformer
45 pages
12 ML Reinforcement Learning Value Based Control
No ratings yet
12 ML Reinforcement Learning Value Based Control
12 pages
39-Q Learning Numerical
No ratings yet
39-Q Learning Numerical
13 pages
DL Unit 6 QP Solution
No ratings yet
DL Unit 6 QP Solution
15 pages
Adobe Scan Nov 18, 2024
No ratings yet
Adobe Scan Nov 18, 2024
13 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
12 pages
Stamford - S0L2-K1 - Technical Data Sheet
No ratings yet
Stamford - S0L2-K1 - Technical Data Sheet
9 pages
Q-Learning Algorithm
No ratings yet
Q-Learning Algorithm
13 pages
Unit-5 MLT
No ratings yet
Unit-5 MLT
13 pages
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
No ratings yet
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
14 pages
' ''Shivanshu
No ratings yet
' ''Shivanshu
12 pages
Hota ML ReinforcementLearning
No ratings yet
Hota ML ReinforcementLearning
12 pages
MAS Lab7 QFA
No ratings yet
MAS Lab7 QFA
10 pages
Burdwan University Economics PH D List
No ratings yet
Burdwan University Economics PH D List
8 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
11 pages
Book Report
No ratings yet
Book Report
5 pages
Jasmine B Resume Revised
No ratings yet
Jasmine B Resume Revised
2 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Define The Problem
No ratings yet
Define The Problem
6 pages
2024 Exercise Allomorph Der Inf
No ratings yet
2024 Exercise Allomorph Der Inf
5 pages
RL Examples
No ratings yet
RL Examples
6 pages
Kiss That Frog Book Review
No ratings yet
Kiss That Frog Book Review
6 pages
Fees Structure 2015 - 2016: PGDCA Courses
No ratings yet
Fees Structure 2015 - 2016: PGDCA Courses
3 pages
RL DP and Value and Policy
No ratings yet
RL DP and Value and Policy
4 pages
A12 Spring2024
No ratings yet
A12 Spring2024
5 pages
Muhammad Muaaz Aamer BSCS 2021 FAST NU LHR - Take Home Quiz No 3
No ratings yet
Muhammad Muaaz Aamer BSCS 2021 FAST NU LHR - Take Home Quiz No 3
4 pages
FetalSim PS320
No ratings yet
FetalSim PS320
2 pages
SL-QMS-26 Outstation Audit Checklist
No ratings yet
SL-QMS-26 Outstation Audit Checklist
4 pages
Q Learning SARSA Deep Q Learning
No ratings yet
Q Learning SARSA Deep Q Learning
4 pages
Eee4008 High-Voltage-Engine 1.0 37 Eee4008
No ratings yet
Eee4008 High-Voltage-Engine 1.0 37 Eee4008
3 pages
Grade 3 Idioms 4
No ratings yet
Grade 3 Idioms 4
2 pages
cs188 sp23 Note14
No ratings yet
cs188 sp23 Note14
2 pages
Vacation Survey
No ratings yet
Vacation Survey
2 pages
Turbo Straight
No ratings yet
Turbo Straight
1 page
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
From Everand
10+2 Level Mathematics For All Exams GMAT, GRE, CAT, SAT, ACT, IIT JEE, WBJEE, ISI, CMI, RMO, INMO, KVPY Etc.
Shubhankar Paul
No ratings yet
Entrelac Hat in Adult Size Knitting Pattern
From Everand
Entrelac Hat in Adult Size Knitting Pattern
Agnese Iskrova
No ratings yet