0% found this document useful (0 votes)

6 views31 pages

CZ3005 Module 5 - Reinforcement Learning

The document outlines the fundamentals of Reinforcement Learning (RL), including various algorithms such as Monte Carlo, Q-learning, and Deep Q-Networks. It explains the concepts of value functions, policy evaluation, and control, as well as the importance of experience in learning optimal policies without transition functions. Additionally, it discusses the application of Q-learning and the use of neural networks in Deep Q-Networks to handle complex state and action spaces in real-world problems.

Uploaded by

springstienblaze.music

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views31 pages

CZ3005 Module 5 - Reinforcement Learning

Uploaded by

springstienblaze.music

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 31

CZ3005

Artificial Intelligence

Reinforcement Learning

Asst/P Hanwang Zhang

https://personal.ntu.edu.sg/hanwangzhang/
Email: [email protected]
Office: N4-02c-87
Lesson Outline

• Some RL algorithms:
– Monte-Carlo
– Temporal difference
– Q-learning
– Deep Q-Network
Reinforcement Learning

• Motivation
– In last lecture, we compute the value function and find the optimal policy
– But if without the transition function ？
– We can learn the value function and find the optimal policy without transition
• From experience

learning
Experience Policy/Value
RL algorithms
Model-free
• Types Learning
– Monte Carlo
– Q-Learning Sampling Bootstrapping
– DQN Monte Carlo Q-Learning
– …

DNN for Function Approx.

DQN
What is Monte Carlo

• Idea behind MC:

– Just use randomness to solve a problem
• Simple definition:
– Solve a problem by generating suitable random numbers and observing the
fraction of numbers obeying some properties
• An example for calculating (not policy in RL):

– putting dots on the square randomly for times

– is the number of dots in the circle
Monte Carlo in RL: Prediction

• Basic Idea: we run in the world randomly and gain experience to learn
• What experience? Many trajectories!

• What we learn? Value function!

– Recall that the return is the total discounted rewards:

– Recall that the value function is the expected return from

• How we learn?
– Use experience to learn an empirical state value function
An Example

• One-dimensional grid world

– A robot is in a 1x4 world
– State: current cell
– Action: left or right
– Reward:
• Move one step (-1)
• Reach the destination cell (+10) (ignoring the one-step reward)

𝑐𝑒𝑙𝑙1𝑐𝑒𝑙𝑙2𝑐𝑒𝑙𝑙3𝑐𝑒𝑙𝑙4
Start point Destination
One-dimensional Grid World

• Trajectory or episode:
𝑐𝑒𝑙𝑙1𝑐𝑒𝑙𝑙2𝑐𝑒𝑙𝑙3𝑐𝑒𝑙𝑙4
– The sequence of states from the staring state to the terminal state
– Robot starts in , ends in Start point Destination
• The representation of the three episodes
-1 10 -1 -1 -1 10
𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟑 𝒄𝒆𝒍𝒍 𝟒 𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 3 𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟑 𝒄𝒆𝒍𝒍 𝟒
1 2

-1 -1 -1 10
3
𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟏 𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟑 𝒄𝒆𝒍𝒍 𝟒
Compute Value Function
• Idea: Average return observed after visits to
• First-visit MC: average returns only for first time is visited in an
episode
• Return in one episode (trajectory):

• We calculate the return for of first episode with

-1 10
𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟑 𝒄𝒆𝒍𝒍 𝟒
Compute Value Function (cont’d)
• Similarly the return for of second episode with
-1 -1 -1 10
𝒄𝒆𝒍𝒍𝟐 𝒄𝒆𝒍𝒍3 𝒄𝒆𝒍𝒍𝟐 𝒄𝒆𝒍𝒍𝟑 𝒄𝒆𝒍𝒍𝟒

• Similarly the return for of third episode with

-1 -1 -1 10
𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟏 𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟑 𝒄𝒆𝒍𝒍 𝟒

• The empirical value function for is

Why First Visit?
Compute Value Function (cont’d)

• Given these three episodes, we compute the value function for

all non-terminal state

6.2 5.72 8.73

• We can get more accurate value function with more episodes

First Visit Monte Carlo Policy Evaluation

• Average returns only for the first time is visited in an episode

• Algorithm
– Initialize:
• policy to be evaluated
• an arbitrary state-value function
• an empty list, for all state
– Repeat many times:
• Generate an episode using
• For each state appearing in the episode:
– return following the first occurrence of
– Append to
Monte Carlo in RL: Control

• Now, we have the value function of all states given a policy

• We need to improve policy to be better
• Policy Iteration
– Policy evaluation
– Policy improvement
• However, we need to know how good an action is
Q-value

• Estimate how good an action is when staying in a state

• Defined as the expected return starting from , taking the action and
thereafter following policy

• Representation: A table
– Filled with the Q-vale given a state and an action

𝑐𝑒𝑙𝑙𝑐𝑒𝑙𝑙
1 𝑐𝑒𝑙𝑙
2 3
Computing Q-value

• MC for estimating Q:
– A slight difference from estimating the value function
– Average returns for state-action pair is visited in an episode
• We calculate the return for of first episode with
-1 10
𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟑 𝒄𝒆𝒍𝒍 𝟒
Compute Q-Value (cont’d)
• Similarly the return for of second episode with
-1 -1 -1 10
𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 3 𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟑 𝒄𝒆𝒍𝒍 𝟒

• Similarly the return for of third episode with

-1 -1 -1 10
𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟏 𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟑 𝒄𝒆𝒍𝒍 𝟒

• The empirical Q-value function for is

Q-Value for Control

• Filling the Q-table

– By going through all state-action pairs, we get a complete Q-table with all the
entries filled
– A possible Q-table example 𝑐𝑒𝑙𝑙 1 𝑐𝑒𝑙𝑙
𝑐𝑒𝑙𝑙 2 3

6.2 6.9 10

0 4.6 6.2

• Selecting action

At , we choose right
MC control algorithm

Policy evaluation

Policy improvement
Q-Learning is Bootstrapping
Q-Learning ： the blessings of Temporal Difference

• Previously, we need the whole trajectory

• In Q-Learning, we only need one-step trajectory:
• The difference is the Q-value computing
– Previously:

– Now, updating rule:

old estimation
new estimation learning rate new sample
Q-Learning
A Step-by-step Example
• 5-room environment as MDP
We'll number each room 0 through 4
The outside of the building can be thought of as one big room 5
–

End at room 5
–

Notice that doors at rooms 1 and 4 lead into the building from room 5
–

(outside)
–
A Step-by-step Example (cont’d)
• Goal
– Put an agent in any room, and from that room, go outside (or room 5)
• Reward
– The doors that lead immediately to the goal have an instant reward of
100
– Other doors not directly connected to the target room have zero
reward
0 1 2 3 4 5 action

[ ]
0 0 0 0 0 0 0
1 0 0 0 0 0 100
2 0 0 0 0 0 0
𝑅=
3 0 0 0 0 0 0
4 0 0 0 0 0 100
5 0 0 0 0 0 100
state
Q-Learning Step by Step

• Initialize matrix Q as a zero matrix

• Loop for each episode until converge

– Initial state: current we are in room 1 (1st outer loop)
– Loop for each step of episode (until reach room 5)
• … (Next slide)
Q-Learning Step by Step (cont’d)
• ... (last slide)
– Loop for each step of episode (until room 5)
• By random selection, we go to 5
• We get 100 reward
• Update Q:
– At room 5, we have 3 possible actions: go to 1, 4 or 5; We select the one with max
reward

1
Q-Learning Step by Step (cont’d)
• When we loop many episodes, we can get

• According to this Q-table, we can select actions

– E.g. We are at room 2
– Greedily select based on maximun of Q value
An Example of Iteration Process

• A complex grid world example

• https://cs.stanford.edu/people/karpathy/reinforcejs/gridw
orld_td.html
Deep Q-Network

• Previously, we represent the Q-value as a table

• However, tabular representation is insufficient
– Many real world problems have enormous state and/or action spaces
– Backgammon: 10^20 states
– Computer Go: 10^170 states
– Robots: continuous state space
• We use a neural network as a black box to replace the table
– Input a state and an action, output the Q-value

𝑠
𝑎 𝒘𝒒 𝑞^ (𝑠 , 𝑎, 𝒘 𝒒 )
DQN in Atari

• Output is 𝑞(𝑠,𝑎) for 18 button

• Input state s is stack of raw pixels from last 4 frames

• Reward is change in score for that step

DQN in Atari (cont’d)

• Pong’s video
• https://www.youtube.com/watch?v=PSQt5KGv7Vk
• Beat human on many games

New CZ3005 Module 5 - Reinforcement Learning
No ratings yet
New CZ3005 Module 5 - Reinforcement Learning
31 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Intro To Reinforcement Learning
No ratings yet
Intro To Reinforcement Learning
56 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Sdfesdf
No ratings yet
Sdfesdf
23 pages
21 - Reinforcement Learning
No ratings yet
21 - Reinforcement Learning
25 pages
Intro To Reinforcement Learning - DQ Q AC A3C
No ratings yet
Intro To Reinforcement Learning - DQ Q AC A3C
36 pages
7 - Reinforcement Learning
No ratings yet
7 - Reinforcement Learning
23 pages
Q-Learning and Deep Q Networks (DQN)
No ratings yet
Q-Learning and Deep Q Networks (DQN)
52 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
27 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Monte Carlo 1
No ratings yet
Monte Carlo 1
245 pages
Sections
No ratings yet
Sections
76 pages
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 11 - Reinforcement Learning II Dr. Shivanjali Khare
52 pages
AI T8 ReinfoLearning
No ratings yet
AI T8 ReinfoLearning
38 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
52 pages
Unit 5d - Deep Reinforcement Learning
No ratings yet
Unit 5d - Deep Reinforcement Learning
52 pages
Lec 22
No ratings yet
Lec 22
22 pages
Unit 5 Deep Learning
No ratings yet
Unit 5 Deep Learning
24 pages
10 Deep Reinforcement
No ratings yet
10 Deep Reinforcement
40 pages
Deep Reinforcement Learning: 1 Notation
No ratings yet
Deep Reinforcement Learning: 1 Notation
9 pages
Unit - 5
No ratings yet
Unit - 5
43 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
AI Seminar RL
No ratings yet
AI Seminar RL
27 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I PDF
38 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
Q Learning
No ratings yet
Q Learning
12 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I - Print
25 pages
Ai (It) Unit-5
No ratings yet
Ai (It) Unit-5
43 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
11 pages
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
No ratings yet
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
14 pages
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
No ratings yet
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
14 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Lec17 ReinforcementLearning
No ratings yet
Lec17 ReinforcementLearning
58 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
I2ml3e Chap18
No ratings yet
I2ml3e Chap18
27 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
Lec 10
No ratings yet
Lec 10
50 pages
Deep Learning Binoy-19-3-RL Q Learning
No ratings yet
Deep Learning Binoy-19-3-RL Q Learning
26 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
3.5 Intro2DeepQLearning
No ratings yet
3.5 Intro2DeepQLearning
12 pages
37 RL
No ratings yet
37 RL
18 pages
CO431 RL 2023 End Nov
No ratings yet
CO431 RL 2023 End Nov
3 pages
2.2+model Free+Control
No ratings yet
2.2+model Free+Control
92 pages
8200 Non Delusional Q Learning and Value Iteration
No ratings yet
8200 Non Delusional Q Learning and Value Iteration
11 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
42-Deep Q Learning
No ratings yet
42-Deep Q Learning
8 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
Exercises of Logarithms and Exponentials
From Everand
Exercises of Logarithms and Exponentials
Simone Malacrida
No ratings yet
Reading 2: Finding The Topic and Main Idea of The Passage
100% (1)
Reading 2: Finding The Topic and Main Idea of The Passage
14 pages
Licenciatura Biologia
No ratings yet
Licenciatura Biologia
26 pages
Bio Te CN Ika: Csir Net Unit 11 Syllabus Evolution and Behaviour
No ratings yet
Bio Te CN Ika: Csir Net Unit 11 Syllabus Evolution and Behaviour
3 pages
Ems - Danam Process
No ratings yet
Ems - Danam Process
2 pages
Normality, T-Test, ANOVA, Chi Square, Correlation
No ratings yet
Normality, T-Test, ANOVA, Chi Square, Correlation
31 pages
A-13 Agar Baird Parker LT 304212
No ratings yet
A-13 Agar Baird Parker LT 304212
2 pages
RISA-2D Educational Tutorial
No ratings yet
RISA-2D Educational Tutorial
18 pages
Assign3 AanchalDhar 4590
No ratings yet
Assign3 AanchalDhar 4590
12 pages
Business Environment Notes 2021
No ratings yet
Business Environment Notes 2021
10 pages
Teaching Practicum Syllabus
No ratings yet
Teaching Practicum Syllabus
9 pages
Mr. Dionisio Sold Domestic Stocks Directly To A Buy - ITProSpt
No ratings yet
Mr. Dionisio Sold Domestic Stocks Directly To A Buy - ITProSpt
6 pages
Clasar - Datasheet
No ratings yet
Clasar - Datasheet
8 pages
Ccaa Training Catalogue - March 2023 2
No ratings yet
Ccaa Training Catalogue - March 2023 2
27 pages
The One Where Everybody Learns English
No ratings yet
The One Where Everybody Learns English
14 pages
Modul 1 English For Engineering 1
100% (1)
Modul 1 English For Engineering 1
29 pages
A Ph.D. Research Proposal: YUSUF, Idowu Olusola
No ratings yet
A Ph.D. Research Proposal: YUSUF, Idowu Olusola
10 pages
Confusion Matrix
No ratings yet
Confusion Matrix
21 pages
MCNN-AAPT Accurate Classification and Functional Prediction of Amino Acid and Peptide Transporters in Secondary Active Transporters Using Protein Lan
No ratings yet
MCNN-AAPT Accurate Classification and Functional Prediction of Amino Acid and Peptide Transporters in Secondary Active Transporters Using Protein Lan
11 pages
Human Memory
No ratings yet
Human Memory
8 pages
Minkowski's Space-Time: From Visual Thinking To The Absolute World
No ratings yet
Minkowski's Space-Time: From Visual Thinking To The Absolute World
33 pages
NOTES Module 2 - ANOVA (Analysis of Variance)
No ratings yet
NOTES Module 2 - ANOVA (Analysis of Variance)
37 pages
016 Monkey
No ratings yet
016 Monkey
11 pages
Differentiate Between Instrument Check Standard and Equipment Quality Control Sample
No ratings yet
Differentiate Between Instrument Check Standard and Equipment Quality Control Sample
10 pages
Test Bank For Psychology in Your Life 3rd by Grisonpdf Download
100% (6)
Test Bank For Psychology in Your Life 3rd by Grisonpdf Download
41 pages
Requirements Before Issuance of Sanitary Permit: For Food Establishments
No ratings yet
Requirements Before Issuance of Sanitary Permit: For Food Establishments
4 pages
Conceptualizations of Intrinsic Motivation and Self-Determination
No ratings yet
Conceptualizations of Intrinsic Motivation and Self-Determination
2 pages
GEd 107 ETHICS - FINAL PROJECT
100% (1)
GEd 107 ETHICS - FINAL PROJECT
2 pages
Older Persons Programme Western Cape Government
No ratings yet
Older Persons Programme Western Cape Government
4 pages
The The Wacker Process
No ratings yet
The The Wacker Process
14 pages
DS - SG10KTL-MT Datasheet - V10 - EN PDF
No ratings yet
DS - SG10KTL-MT Datasheet - V10 - EN PDF
1 page

CZ3005 Module 5 - Reinforcement Learning

Uploaded by

CZ3005 Module 5 - Reinforcement Learning

Uploaded by

CZ3005

Asst/P Hanwang Zhang

DNN for Function Approx.

• Idea behind MC:

– putting dots on the square randomly for times

• What we learn? Value function!

– Recall that the value function is the expected return from

• One-dimensional grid world

• We calculate the return for of first episode with

• Similarly the return for of third episode with

• The empirical value function for is

• Given these three episodes, we compute the value function for

6.2 5.72 8.73

• We can get more accurate value function with more episodes

• Average returns only for the first time is visited in an episode

• Now, we have the value function of all states given a policy

• Estimate how good an action is when staying in a state

• Similarly the return for of third episode with

• The empirical Q-value function for is

• Filling the Q-table

• Previously, we need the whole trajectory

– Now, updating rule:

• Initialize matrix Q as a zero matrix

• Loop for each episode until converge

• According to this Q-table, we can select actions

• A complex grid world example

• Previously, we represent the Q-value as a table

• Output is 𝑞(𝑠,𝑎) for 18 button

• Reward is change in score for that step

You might also like