0% found this document useful (0 votes)
6 views31 pages

CZ3005 Module 5 - Reinforcement Learning

The document outlines the fundamentals of Reinforcement Learning (RL), including various algorithms such as Monte Carlo, Q-learning, and Deep Q-Networks. It explains the concepts of value functions, policy evaluation, and control, as well as the importance of experience in learning optimal policies without transition functions. Additionally, it discusses the application of Q-learning and the use of neural networks in Deep Q-Networks to handle complex state and action spaces in real-world problems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views31 pages

CZ3005 Module 5 - Reinforcement Learning

The document outlines the fundamentals of Reinforcement Learning (RL), including various algorithms such as Monte Carlo, Q-learning, and Deep Q-Networks. It explains the concepts of value functions, policy evaluation, and control, as well as the importance of experience in learning optimal policies without transition functions. Additionally, it discusses the application of Q-learning and the use of neural networks in Deep Q-Networks to handle complex state and action spaces in real-world problems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

CZ3005

Artificial Intelligence

Reinforcement Learning

Asst/P Hanwang Zhang

https://personal.ntu.edu.sg/hanwangzhang/
Email: [email protected]
Office: N4-02c-87
Lesson Outline

• Some RL algorithms:
– Monte-Carlo
– Temporal difference
– Q-learning
– Deep Q-Network
Reinforcement Learning

• Motivation
– In last lecture, we compute the value function and find the optimal policy
– But if without the transition function ?
– We can learn the value function and find the optimal policy without transition
• From experience

learning
Experience Policy/Value
RL algorithms
Model-free
• Types Learning
– Monte Carlo
– Q-Learning Sampling Bootstrapping
– DQN Monte Carlo Q-Learning
– …

DNN for Function Approx.


DQN
What is Monte Carlo

• Idea behind MC:


– Just use randomness to solve a problem
• Simple definition:
– Solve a problem by generating suitable random numbers and observing the
fraction of numbers obeying some properties
• An example for calculating (not policy in RL):

– putting dots on the square randomly for times


– is the number of dots in the circle
Monte Carlo in RL: Prediction

• Basic Idea: we run in the world randomly and gain experience to learn
• What experience? Many trajectories!

• What we learn? Value function!


– Recall that the return is the total discounted rewards:

– Recall that the value function is the expected return from

• How we learn?
– Use experience to learn an empirical state value function
An Example

• One-dimensional grid world


– A robot is in a 1x4 world
– State: current cell
– Action: left or right
– Reward:
• Move one step (-1)
• Reach the destination cell (+10) (ignoring the one-step reward)

𝑐𝑒𝑙𝑙1𝑐𝑒𝑙𝑙2𝑐𝑒𝑙𝑙3𝑐𝑒𝑙𝑙4
Start point Destination
One-dimensional Grid World

• Trajectory or episode:
𝑐𝑒𝑙𝑙1𝑐𝑒𝑙𝑙2𝑐𝑒𝑙𝑙3𝑐𝑒𝑙𝑙4
– The sequence of states from the staring state to the terminal state
– Robot starts in , ends in Start point Destination
• The representation of the three episodes
-1 10 -1 -1 -1 10
𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟑 𝒄𝒆𝒍𝒍 𝟒 𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 3 𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟑 𝒄𝒆𝒍𝒍 𝟒
1 2

-1 -1 -1 10
3
𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟏 𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟑 𝒄𝒆𝒍𝒍 𝟒
Compute Value Function
• Idea: Average return observed after visits to
• First-visit MC: average returns only for first time is visited in an
episode
• Return in one episode (trajectory):

• We calculate the return for of first episode with


-1 10
𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟑 𝒄𝒆𝒍𝒍 𝟒
Compute Value Function (cont’d)
• Similarly the return for of second episode with
-1 -1 -1 10
𝒄𝒆𝒍𝒍𝟐 𝒄𝒆𝒍𝒍3 𝒄𝒆𝒍𝒍𝟐 𝒄𝒆𝒍𝒍𝟑 𝒄𝒆𝒍𝒍𝟒

• Similarly the return for of third episode with


-1 -1 -1 10
𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟏 𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟑 𝒄𝒆𝒍𝒍 𝟒

• The empirical value function for is


Why First Visit?
Compute Value Function (cont’d)

• Given these three episodes, we compute the value function for


all non-terminal state

6.2 5.72 8.73

• We can get more accurate value function with more episodes


First Visit Monte Carlo Policy Evaluation

• Average returns only for the first time is visited in an episode


• Algorithm
– Initialize:
• policy to be evaluated
• an arbitrary state-value function
• an empty list, for all state
– Repeat many times:
• Generate an episode using
• For each state appearing in the episode:
– return following the first occurrence of
– Append to
Monte Carlo in RL: Control

• Now, we have the value function of all states given a policy


• We need to improve policy to be better
• Policy Iteration
– Policy evaluation
– Policy improvement
• However, we need to know how good an action is
Q-value

• Estimate how good an action is when staying in a state


• Defined as the expected return starting from , taking the action and
thereafter following policy

• Representation: A table
– Filled with the Q-vale given a state and an action

𝑐𝑒𝑙𝑙𝑐𝑒𝑙𝑙
1 𝑐𝑒𝑙𝑙
2 3
Computing Q-value

• MC for estimating Q:
– A slight difference from estimating the value function
– Average returns for state-action pair is visited in an episode
• We calculate the return for of first episode with
-1 10
𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟑 𝒄𝒆𝒍𝒍 𝟒
Compute Q-Value (cont’d)
• Similarly the return for of second episode with
-1 -1 -1 10
𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 3 𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟑 𝒄𝒆𝒍𝒍 𝟒

• Similarly the return for of third episode with


-1 -1 -1 10
𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟏 𝒄𝒆𝒍𝒍 𝟐 𝒄𝒆𝒍𝒍 𝟑 𝒄𝒆𝒍𝒍 𝟒

• The empirical Q-value function for is


Q-Value for Control

• Filling the Q-table


– By going through all state-action pairs, we get a complete Q-table with all the
entries filled
– A possible Q-table example 𝑐𝑒𝑙𝑙 1 𝑐𝑒𝑙𝑙
𝑐𝑒𝑙𝑙 2 3

6.2 6.9 10

0 4.6 6.2

• Selecting action

At , we choose right
MC control algorithm

Policy evaluation

Policy improvement
Q-Learning is Bootstrapping
Q-Learning : the blessings of Temporal Difference

• Previously, we need the whole trajectory


• In Q-Learning, we only need one-step trajectory:
• The difference is the Q-value computing
– Previously:

– Now, updating rule:

old estimation
new estimation learning rate new sample
Q-Learning
A Step-by-step Example
• 5-room environment as MDP
We'll number each room 0 through 4
The outside of the building can be thought of as one big room 5

End at room 5

Notice that doors at rooms 1 and 4 lead into the building from room 5

(outside)

A Step-by-step Example (cont’d)
• Goal
– Put an agent in any room, and from that room, go outside (or room 5)
• Reward
– The doors that lead immediately to the goal have an instant reward of
100
– Other doors not directly connected to the target room have zero
reward
0 1 2 3 4 5 action

[ ]
0 0 0 0 0 0 0
1 0 0 0 0 0 100
2 0 0 0 0 0 0
𝑅=
3 0 0 0 0 0 0
4 0 0 0 0 0 100
5 0 0 0 0 0 100
state
Q-Learning Step by Step

• Initialize matrix Q as a zero matrix

• Loop for each episode until converge


– Initial state: current we are in room 1 (1st outer loop)
– Loop for each step of episode (until reach room 5)
• … (Next slide)
Q-Learning Step by Step (cont’d)
• ... (last slide)
– Loop for each step of episode (until room 5)
• By random selection, we go to 5
• We get 100 reward
• Update Q:
– At room 5, we have 3 possible actions: go to 1, 4 or 5; We select the one with max
reward

1
Q-Learning Step by Step (cont’d)
• When we loop many episodes, we can get

• According to this Q-table, we can select actions


– E.g. We are at room 2
– Greedily select based on maximun of Q value
An Example of Iteration Process

• A complex grid world example


• https://cs.stanford.edu/people/karpathy/reinforcejs/gridw
orld_td.html
Deep Q-Network

• Previously, we represent the Q-value as a table


• However, tabular representation is insufficient
– Many real world problems have enormous state and/or action spaces
– Backgammon: 10^20 states
– Computer Go: 10^170 states
– Robots: continuous state space
• We use a neural network as a black box to replace the table
– Input a state and an action, output the Q-value

𝑠
𝑎 𝒘𝒒 𝑞^ (𝑠 , 𝑎, 𝒘 𝒒 )
DQN in Atari

• Output is 𝑞(𝑠,𝑎) for 18 button


• Input state s is stack of raw pixels from last 4 frames

• Reward is change in score for that step


DQN in Atari (cont’d)

• Pong’s video
• https://www.youtube.com/watch?v=PSQt5KGv7Vk
• Beat human on many games

You might also like