0% found this document useful (0 votes)

4 views55 pages

2.3+Value+Function+Approximation

The document discusses various concepts in reinforcement learning, focusing on value function approximation (VFA) and importance sampling techniques. It highlights the challenges of large state spaces and the need for function approximation methods, including linear and non-linear approaches, to efficiently learn policies and value functions. Additionally, it addresses convergence issues in reinforcement learning algorithms and introduces deep reinforcement learning as a solution to these challenges.

Uploaded by

shengaa1028

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

4 views55 pages

2.3+Value+Function+Approximation

Uploaded by

shengaa1028

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 55

Some slides are from: Katerina Fragkiadaki (CMU), Davild Silver

(DeepMind), Hado van Hasselt (DeepMind)

COMP 4901Z: Reinforcement Learning

2.3 Value Function Approximation

Long Chen (Dept. of CSE)

Two Types of Importance Sampling

• Ordinary Importance Sampling

∑!∈#(%) %!'(!)() &!

! " =
|((")|
• Weighted Importance Sampling

∑!∈#(%) %!'(!)() &!

! " =
∑!∈#(%) %!'(!)()

• Weighted IS is a biased estimation

• For first-visit method with single return, the expectation is !* " rather than !+ " .
• Ordinary IS is an unbiased estimation
• For first-visit method, its estimator is always !+ "
2
Two Types of Importance Sampling

• Ordinary Importance Sampling

∑!∈#(%) %!'(!)() &!

! " =
|((")|
• Weighted Importance Sampling

∑!∈#(%) %!'(!)() &!

! " =
∑!∈#(%) %!'(!)()

• The variance of the ordinary IS is in general unbounded, whereas in the weighted

estimator the largest weight on any single return is one.
• Suppose the ratio wen ten, the ordinary importance-sampling estimate would be
ten times the observed return.
3
SARSA Algorithm for On-Policy Control

4
Q-Learning Algorithm for Off-Policy Control

• Q-Learning: ! "! , $! ← ! "! , $! + ' (!"# + ) max ! "!"#, - − !("! , $! )

• SARSA: ! ", $ ← ! ", $ + ' ( + )! " % , $% − ! ", $

5
Double Tabular Q-Learning

6
2.3 Value Function
Approximation
Function Approximation and Deep RL

• The policy, value function, model, and agent state update are all functions
• We want to learn these from experience
• If there are too many states, we need to approximate
• This is often called deep reinforcement learning
• when using neural network to represent these functions

8
Large-Scale Reinforcement Learning

• In problems with large number of states, e.g.

• Backgammon: 10&' states
• Go: 10#(' states
• Helicopter: continuous state space
• Robots: real world
• Tabular methods that enumerate every single state do not work
• How can we scale up the model-free methods for prediction and control
from the last two lectures?

9
Value Function Approximation (VFA)

• So far we have represented value function by a lookup table

• Every state " has an entry ! " or
• Every state-action pair (", ,) has an entry - ", ,
• Problem with large MDPs:
• There are too many states and/or actions to store in memory
• It is too slow to learn the value of each state individually
• Solution for large MDPs
• Estimate value function with function approximation
! "; / ≈ !+ " or - ", ,; / ≈ -+ ", ,
• Generalize from seen states to unseen states
• Update parameters / using MC or TD learning
10
Agent State Update

• When the environment state is not fully observable ("!)*+ ≠ 5! )

• Use the agent state
"! = 7("!,#, $!,#, , 5! ; 9)
with parameters 9
• Henceforth, "! denotes the agent state
• Think of this as either a vector inside the agent,
or, in the simplest case, just the current observation: "! = 5!

11
Value Function Approximation (VFA)

• Value function approximation (VFA) replaces the table with a general

parameterized form:

• When we update the parameters 9, the values of many states change

simultaneously!
12
Policy Approximation

• Policy approximation replaces the table with a general parameterized form

13
Classes of Functions Approximation

• Tabular: a table with an entry for each MDP state

• Linear function approximation
• Consider fixed agent state update (e.g., "! = 5! )
• Fixed feature map: :: " → (*
• Values are linear function of features = >; 9 = 9/ :(>)
• Differentiable function approximation
• = >; 9 is a differentiable function of 9, could be non-linear
• E.g., a convolutional neural network that takes pixel as input
• Another interpretation: features are not fixed, but learnt

14
Which Function Approximation?

• There are many function approximators, e.g.

• Linear combinations of features
• Neural networks
• Decision tree
• Nearest neighbour
• Fourier/wavelet bases
•…

15
Classes of Function Approximation

• In principle, any function approximator can be used, but RL has specific

properties,
• Experience is not i.i.d – successive time steps are correlated
• Agent’s policy affects the data it receives
• Regression targets can be non-stationary
• … because of changing policies (which can change the target and the data!)
• … because of bootstrapping
• … because of non-stationary dynamics (e.g., other learning agents)
• … because the world is large (never quite in the same state)

16
Classes of Function Approximation

• Which function approximation should you choose?

• This depends on your goals:
• Tabular: good theory but does not scale/generalize
• Linear: reasonably good theory, but requires good features
• Non-linear: less well-understood, but scales well
• Flexible, and less reliant on picking good features first (e.g., by hand)
• (Deep) neural nets often perform quite well, and remain a popular
choice

17
Function Approximator Examples

• Image representation for classification

18
Function Approximator Examples

• Pixel space

19
Function Approximator Examples

• Convolutional neural network (CNN) architectures

20
Function Approximator Examples

• Recurrent neural network (RNN) architectures

21
Function Approximator Examples

• Recurrent neural network (RNN) architectures

22
Gradient-based Algorithms
Gradient Descent

• Let ! " be a differentiable function of parameter vector "

• Define the gradient of ! " to be:

%! "
%&"
∇!! " = ⋮
%! "
%&#
• To find a local minimum of ! " , adjust " in direction of the
negative gradient
1
∆" = − ,∇!! "
2
where , is a step-size parameter

24
Gradient Descent

• Let ! " be a differentiable function of parameter vector "

• Define the gradient of ! " to be:

%! "
%&"
∇!! " = ⋮
%! "
%&#
• Starting from a guess "$
• We consider the sequence "$ , "" , "% , …
"
• s.t. ∆"&'( = − ,∇!! "&
%

• We then have J "$ ≥ J "" ≥ ! "% ≥ ⋯

25
Value Function Approx. By Stochastic Gradient Descent

• Goal: find parameter vector " minimizing mean-squared error between the true value
function 0) 1 and its approximation 0(1; ")
! & = 5* 0) 1 − 0(1; ") %
Where 6 is a distribution over states (typically induced by the policy and dynamics)
• Gradient descent finds a local minimum
1
∆" = − ,∇!! " = ,5* 0) 1 − 0 1; " ∇!0 1; "
2
• Stochastic gradient descent (SGD), samples the gradient
∆" = , 7+ − 0 1+ ; " ∇!0 1+ ; "

• Note: Monte Carlo return 7+ is a sample for 0) 1+

• Expected update is equal to full gradient update

• We often write ∇0 1+ as short hand for ∇!0 1+ ; " |!,!!

26
Feature Vectors

• Represent state by a feature vector

?# "
: " = ⋮
?* "
• :: " → (* is a fixed mapping from state (e.g., observation) to features
• Short hand: : 0 = :("! )
• For example:
• Distance of robot from landmarks
• Trends in the stock market
• Piece and pawn configurations in chess
27
Linear Value Function Approximation

• Represent value function by a linear combination of features

! ", $ = & " ! $ = ' &" " $"

"#$

• Objective function is quadratic in parameters 9

A 9 = B1~3 =4 " − : " / 9 &

• Stochastic gradient descent converges on global optimum

• Update rule is particularly simple
∇& ! "' , $ = & "' = & '
∆$ = * !( "' − ! "' ; $ & '

• Update = step-size × prediction error × feature value

28
Incremental Prediction Algorithm

• Have assumed true value function !+ " given by supervisor

• But in RL there is no supervisor, only rewards
• In practice, we substitute a target for !+ "
• For MC, the target is the return &!
∆/ = 3 &! − ! 5! ; / ∇, ! 5! ; /
• For TD(0), the target is the TD target
∆/ = 3 8!-) + :! 5!-) ; / − ! 5! ; / ∇, ! 5! ; /
• For TD(;), the target is the ;-return &!.
∆/ = 3 &!. − ! 5! ; / ∇, ! 5! ; /

&!. = 8!-) + : .
1 − ; ! 5!-) + ;&!-)
29
Monte Carlo with Value Function Approximation

• The return D! is an unbiased, noisy sample of true value =4 "!

• Can therefore apply supervised learning to “training data”:
< "#, D# >, < "&, D& >, … , < "/ , D/ >
• For example, using linear Monte-Carlo policy evaluation
∆9 = ' D! − = "! , ; 9 ∇5= "! ; 9
= ' D! − = "! , ; 9 : !
• Linear Monte-Carlo evaluation converges to a local optimum
• Even when using non-linear value function approximation it converges
(but perhaps to a local optimum)

30
Monte Carlo with Value Function Approximation

31
TD Learning with Value Function Approximation

• The TD-target (!"# + )= "!"#; 9 is a biased sample of true value =4 ("! )

• Can still apply supervised learning to “training data”
< "#, (# + )=("#; 9) >, < "&, (& + )=("&; 9) >, … , < "/ , (/ + )=("/ ; 9) >
• For example, using linear TD(0)
∆9 = ' (!"# + )= "!"#; 9 − = "! , ; 9 ∇5= "! ; 9
= 'J! : !
where J! = (!"# + )= "!"#; 9 − = "! , ; 9 is “TD error”
This is akin to non-stationary regression problem
• But it’s a bit different: the target depends on our parameters!
We ignore the dependence of the target on 9! We call it semi-gradient method!
32
TD Learning with Value Function Approximation

33
Control with Value Function Approximation

• Policy evaluation: Approximate policy evaluation, !("! , $! ; 9) ≈ !4

• Policy improvement: L-greedy policy improvement

34
Action-Value Function Approximation

• Should we use action-in, or action-out?

• Action in: ! >, -; 9 = 9/ ?(>, -)
• Action out: ! >; 9 = M?(>) such that ! >, -; 9 = ! >; 9 [-]
• One reuses the same weights, the other the same features
• Unclear which is better in general
• If we want to use continuous actions, action-in is easier (later lecture)
• For (small) discrete action spaces, action-out is common (e.g., DQN)

35
Convergence and Divergence
Convergence Questions

• When do incremental prediction algorithms converge?

• When using bootstrapping (i.e., TD)?
• When using (e.g., linear) value function approximation?
• When using off-policy learning?
• Ideally, we would like algorithms that converge in all cases
• Alternatively, we want to understand when algorithms do, or do not,
converge

37
Example of Divergence

• What if we use TD only on this transition?

38
Example of Divergence

-')$ = -' + ' / + 0! 1 − ! 1 ∇! 1

= -' + *' / + 0! 1 * − ! 1 2 1
= -' + *' 0 + 02-' − -'
= -' + *' (20 − 1) -'
#
• Consider P! > 0. If ) > &, then P!"# > P! .

=> lim!→7 P! = ∞
39
Example of Divergence

• Algorithms that combine

• Bootstrapping
• Off-policy learning, and
• Function approximation
… may diverge
• This is sometimes called the deadly triad.
40
Deadly Triad

• Consider sampling on-policy, over an episode. Update:

∆P = ' 0 + 2)P − P + '(0 + )0 − 2P)
= ' 2) − 3 P
• This multiplier is negative, for all ) ∈ 0, 1
• => convergence (P goes to zero, which is optimal here)

41
Deadly Triad

• With tabular feature, this is just regression

• Answer may be sub-optimal, but no divergence occurs
• Specifically, if we only update = > (=left-most state):
• = > = P 0 will converge to )= > %
• = > % = P 1 will stay where it was initialized

42
Deadly Triad

• What if we use multiple-step returns?

• Still consider only updating the left-most state
∆9 = ; < + >?!" − " #

= % & + ( 1 − * " # # + *(& # + "(# ## ) − " # \ = \ 4 = = > 44 = 0

= % 2( 1 − * − 1 .

/
• The multiplier is negative when 2) 1 − W < 1 => W > 1−
01
2
• E.g. where ) = 0.9, then we need W > ≈ 0.45
3
43
Convergence of Prediction and Control Algorithms

• Tabular control learning algorithms (e.g., Q-learning) can be extended to FA

(e.g., Deep Q Network — DQN)
• The theory of control with function approximation is not fully developed
• Tracking is often preferred to convergence
(i.e., continually adapting the policy instead of converging to a fixed policy)

44
Deep Q Network (DQN)
Deep Reinforcement Learning

DL: Deep Learning; RL: Reinforcement Learning

• DL: It requires large amounts of hand-labelled training data.
• RL: It can learn from a scalar reward signal that is frequently sparse, noisy
and delayed.
• DL: It assumes the data samples to be independent.
• RL: It typically encounters sequences of highly correlated states.
• DL: It assumes a fixed underlying distribution.
• RL: The data distribution changes as the algorithm learns new behaviors.

Playing Atari with Deep Reinforcement Learning. In NIPS workshop, 2013.

46
DQN in Atari

• End-to-end learning of values ! >, - from pixels >

• Input state > is stack of raw pixels from last 4 frames
• Output is ! >, - for 18 joystick/button positions
• Reward is change in score for the step

Network architecture and hyperparameters fixed across all games 47

DQN

• Approximate the optimal action-value function !5 >, - by ! >, -; 9

48
DQN Results in Atari

49
Temporal Difference (TD) Learning

• Observe state >6 and perform action -6

• Environment provides new state >67/ and reward \6

• TD target: ]6 = (6 + ) ⋅ max !(>67/, -; 9)

• TD error: J6 = !6 − ]6 , where !6 = !(>6 , -6 ; 9)

• Goal: Make !6 close to ]6 , for all _. (Equivalently, make J60 small)

50
Temporal Difference (TD) Learning

• TD error: J6 = !6 − ]6 , where !6 = !(>6 , -6 ; P)

,
/ ;
• TD learning: Find 9 by minimizing ` 9 = ∑96:/ +
9 0

• Online gradient descent:

• Observe (>6 , -6 , \6 , >67/) and compute J6
<;+, /0 < @(A+ , 8+ ; >)
• Compute gradient b6 = <>
= J6 ⋅ <>
• Gradient descent: 9 ← 9 − ' ⋅ b6
• Discard (>6 , -6 , \6 , >67/) after using it

51
Shortcoming 1: Waste of Experience

• A transition: (>6 , -6 , \6 , >67/)

• Experience: all the transitions, for _ = 1, 2, …
• Previously, we discard (>6 , -6 , \6 , >67/) after using it
• It is a waste.

52
Shortcoming 2: Correlated Updates

• Previously, we use (>6 , -6 , \6 , >67/) sequentially, for _ = 1, 2, … , to update 9.

• Consecutive states, >6 and >67/, are strongly correlated (which is bad).
• It violates commonly held assumption for stochastic gradient (similar
issue as continual learning!)

53
Extra Reading Materials

• Playing Atari with Deep Reinforcement Learning. In NIPS workshop, 2013.

• Human-level Control through Deep Reinforcement Learning. Nature, 2015.

55
Thanks & QA?

The Practically Cheating Calculus Handbook
From Everand
The Practically Cheating Calculus Handbook
S. Deviant
3.5/5 (7)
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
No ratings yet
Introduction To Reinforcement Learning: Instructor: Sergey Levine UC Berkeley
46 pages
Lnotes 05
No ratings yet
Lnotes 05
5 pages
2023 Week4 Funcapproximate Update
No ratings yet
2023 Week4 Funcapproximate Update
69 pages
RL chap 4
No ratings yet
RL chap 4
7 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
Function Approximation
No ratings yet
Function Approximation
35 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Lecture 5: Value Function Approximation: Emma Brunskill
No ratings yet
Lecture 5: Value Function Approximation: Emma Brunskill
59 pages
Lecture 6 Value Function Approximation
No ratings yet
Lecture 6 Value Function Approximation
56 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Lecture 6: Value Function Approximation: David Silver
No ratings yet
Lecture 6: Value Function Approximation: David Silver
56 pages
Universal Value Function Approximators.
No ratings yet
Universal Value Function Approximators.
9 pages
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 3: RL Introduction
46 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
07 Deep Reinforcement Learning (John)
No ratings yet
07 Deep Reinforcement Learning (John)
52 pages
20AI903_RL_UNIT 4
No ratings yet
20AI903_RL_UNIT 4
49 pages
Deep Reinforcement Learning
No ratings yet
Deep Reinforcement Learning
93 pages
RL Unit 5
No ratings yet
RL Unit 5
30 pages
RL Theory Tutorial
No ratings yet
RL Theory Tutorial
80 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
CH5_Function Approximation (1)
No ratings yet
CH5_Function Approximation (1)
33 pages
RL With LCS
No ratings yet
RL With LCS
29 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Deep RL Tutorial Small
No ratings yet
Deep RL Tutorial Small
66 pages
4a - Approximate Reinforcement Learning
No ratings yet
4a - Approximate Reinforcement Learning
55 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Reinforcement Learning: Instructor: Max Welling
No ratings yet
Reinforcement Learning: Instructor: Max Welling
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
101 pages
07 FA Methods
No ratings yet
07 FA Methods
58 pages
S18 Reinforcement Learning 2
No ratings yet
S18 Reinforcement Learning 2
46 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Six Lectures On NN - Montanari
No ratings yet
Six Lectures On NN - Montanari
77 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
DLbook
No ratings yet
DLbook
165 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
No ratings yet
CS229 Lecture Notes: Andrew NG and Tengyu Ma April 25, 2023
223 pages
games2-6pp
No ratings yet
games2-6pp
15 pages
CS229 Andrew NG Lecture Notes
No ratings yet
CS229 Andrew NG Lecture Notes
216 pages
Lecture Notes v1.0 687 F22
No ratings yet
Lecture Notes v1.0 687 F22
115 pages
Notes Summary
No ratings yet
Notes Summary
65 pages
Andrew NG Main - Notes PDF
No ratings yet
Andrew NG Main - Notes PDF
226 pages
37 RL
No ratings yet
37 RL
18 pages
ML Main Printing Material
No ratings yet
ML Main Printing Material
241 pages
Cs229-Main Notes Andrew NG and Tengyu Ma
No ratings yet
Cs229-Main Notes Andrew NG and Tengyu Ma
227 pages
Main Notes
No ratings yet
Main Notes
227 pages
Reinforcement Learning and Dynamic Programming For Control
100% (1)
Reinforcement Learning and Dynamic Programming For Control
111 pages
Deep Learning 1
No ratings yet
Deep Learning 1
48 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
3 - Chapter 8 Value Function Approximation
No ratings yet
3 - Chapter 8 Value Function Approximation
39 pages
Main Notes
No ratings yet
Main Notes
227 pages
EE675A Lecture 16
No ratings yet
EE675A Lecture 16
6 pages
What is TD Learning
No ratings yet
What is TD Learning
15 pages
RL Class Notes (4)
No ratings yet
RL Class Notes (4)
68 pages
Unit 2 Introduction to Deep Learning
No ratings yet
Unit 2 Introduction to Deep Learning
79 pages
SSRN 4963741
No ratings yet
SSRN 4963741
26 pages
Deep Reinforcement Learning: Lecture Notes
No ratings yet
Deep Reinforcement Learning: Lecture Notes
60 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
Internal
No ratings yet
Internal
25 pages
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
Cara Flash Advan X7 Via Intel Flashtool - Tutorial Repair Software Atau Flashing Handphone Android
No ratings yet
Cara Flash Advan X7 Via Intel Flashtool - Tutorial Repair Software Atau Flashing Handphone Android
2 pages
Machine Learning Manual
No ratings yet
Machine Learning Manual
40 pages
LinuxONE 4 Level 1 Quiz Attempt Review2 PDF
No ratings yet
LinuxONE 4 Level 1 Quiz Attempt Review2 PDF
10 pages
Shubhaarambh Fest 2024-25 PDF
No ratings yet
Shubhaarambh Fest 2024-25 PDF
29 pages
MGMT Server Overview NEW2
No ratings yet
MGMT Server Overview NEW2
34 pages
Spi RTL
No ratings yet
Spi RTL
7 pages
Cluster-Guided Contrastive Graph Clustering Network
No ratings yet
Cluster-Guided Contrastive Graph Clustering Network
9 pages
Crowther 2002_Rowe and Kahn's Model of Successful Aging Revisited Positive Spirituality_The Forgotten Factor
No ratings yet
Crowther 2002_Rowe and Kahn's Model of Successful Aging Revisited Positive Spirituality_The Forgotten Factor
9 pages
Optima SG Series Sliding Gate Manual
No ratings yet
Optima SG Series Sliding Gate Manual
20 pages
Kaspersky Endpoint Security Cloud Datasheet 0222 EN
No ratings yet
Kaspersky Endpoint Security Cloud Datasheet 0222 EN
4 pages
Class 1 C
No ratings yet
Class 1 C
14 pages
Pydantic Pydantic download
100% (1)
Pydantic Pydantic download
49 pages
Linear Regression - Cheatsheet
No ratings yet
Linear Regression - Cheatsheet
8 pages
Implementing Cisco SD WAN Bootcamp Day 1
No ratings yet
Implementing Cisco SD WAN Bootcamp Day 1
67 pages
2015 Chapter 4 MMS ITencrypted
No ratings yet
2015 Chapter 4 MMS ITencrypted
29 pages
(Module 1.1) Everyday Things
No ratings yet
(Module 1.1) Everyday Things
33 pages
Field PG Power PG PDF
No ratings yet
Field PG Power PG PDF
6 pages
Sreehari Mangali Resume 2023
No ratings yet
Sreehari Mangali Resume 2023
2 pages
Network and Security Iti
No ratings yet
Network and Security Iti
64 pages
W3. Bayes Rule and Decision Tree PDF
No ratings yet
W3. Bayes Rule and Decision Tree PDF
25 pages
Heq Apr22 PGD We
No ratings yet
Heq Apr22 PGD We
6 pages
ACM60B-S1LE13x06: Absolute Encoders
No ratings yet
ACM60B-S1LE13x06: Absolute Encoders
7 pages
AccountStatement Report 6038802623 30012024 13 26
No ratings yet
AccountStatement Report 6038802623 30012024 13 26
4 pages
Remote Access To TMU Unix Systems
No ratings yet
Remote Access To TMU Unix Systems
5 pages
Chapter Seven: Networks: Mobile Business
No ratings yet
Chapter Seven: Networks: Mobile Business
47 pages
Word Manual
No ratings yet
Word Manual
44 pages
Literature Review On Garbage Collection
100% (2)
Literature Review On Garbage Collection
4 pages
Pinkalicious Dragon to the Rescue (I Can Read Level 1) Victoria Kann Victoria Kann 2024 scribd download
100% (4)
Pinkalicious Dragon to the Rescue (I Can Read Level 1) Victoria Kann Victoria Kann 2024 scribd download
16 pages
New GentleYAG SYSTEM OVERVIEW
No ratings yet
New GentleYAG SYSTEM OVERVIEW
41 pages
The Company The Skema Difference
No ratings yet
The Company The Skema Difference
2 pages

2.3+Value+Function+Approximation

Uploaded by

2.3+Value+Function+Approximation

Uploaded by

Some slides are from: Katerina Fragkiadaki (CMU), Davild Silver

(DeepMind), Hado van Hasselt (DeepMind)

COMP 4901Z: Reinforcement Learning

Long Chen (Dept. of CSE)

• Ordinary Importance Sampling

∑!∈#(%) %!'(!)() &!

∑!∈#(%) %!'(!)() &!

• Weighted IS is a biased estimation

• Ordinary Importance Sampling

∑!∈#(%) %!'(!)() &!

∑!∈#(%) %!'(!)() &!

• The variance of the ordinary IS is in general unbounded, whereas in the weighted

• Q-Learning: ! "! , $! ← ! "! , $! + ' (!"# + ) max ! "!"#, - − !("! , $! )

• SARSA: ! ", $ ← ! ", $ + ' ( + )! " % , $% − ! ", $

• In problems with large number of states, e.g.

• So far we have represented value function by a lookup table

• When the environment state is not fully observable ("!)*+ ≠ 5! )

• Value function approximation (VFA) replaces the table with a general

• When we update the parameters 9, the values of many states change

• Policy approximation replaces the table with a general parameterized form

• Tabular: a table with an entry for each MDP state

• There are many function approximators, e.g.

• In principle, any function approximator can be used, but RL has specific

• Which function approximation should you choose?

• Image representation for classification

• Convolutional neural network (CNN) architectures

• Recurrent neural network (RNN) architectures

• Recurrent neural network (RNN) architectures

• Let ! " be a differentiable function of parameter vector "

• Let ! " be a differentiable function of parameter vector "

• We then have J "$ ≥ J "" ≥ ! "% ≥ ⋯

• Note: Monte Carlo return 7+ is a sample for 0) 1+

• We often write ∇0 1+ as short hand for ∇!0 1+ ; " |!,!!

• Represent state by a feature vector

• Represent value function by a linear combination of features

! ", $ = & " ! $ = ' &" " $"

• Objective function is quadratic in parameters 9

• Stochastic gradient descent converges on global optimum

• Update = step-size × prediction error × feature value

• Have assumed true value function !+ " given by supervisor

• The return D! is an unbiased, noisy sample of true value =4 "!

• The TD-target (!"# + )= "!"#; 9 is a biased sample of true value =4 ("! )

• Policy evaluation: Approximate policy evaluation, !("! , $! ; 9) ≈ !4

• Should we use action-in, or action-out?

• When do incremental prediction algorithms converge?

• What if we use TD only on this transition?

-')$ = -' + *' / + 0! 1 * − ! 1 ∇! 1

• Algorithms that combine

• Consider sampling on-policy, over an episode. Update:

• With tabular feature, this is just regression

• What if we use multiple-step returns?

= % & + ( 1 − * " # # + *(& # + "(# ## ) − " # \ = \ 4 = = > 44 = 0

• Tabular control learning algorithms (e.g., Q-learning) can be extended to FA

DL: Deep Learning; RL: Reinforcement Learning

Playing Atari with Deep Reinforcement Learning. In NIPS workshop, 2013.

• End-to-end learning of values ! >, - from pixels >

Network architecture and hyperparameters fixed across all games 47

• Approximate the optimal action-value function !5 >, - by ! >, -; 9

• Observe state >6 and perform action -6

• TD target: ]6 = (6 + ) ⋅ max !(>67/, -; 9)

• TD error: J6 = !6 − ]6 , where !6 = !(>6 , -6 ; 9)

• TD error: J6 = !6 − ]6 , where !6 = !(>6 , -6 ; P)

• Online gradient descent:

• A transition: (>6 , -6 , \6 , >67/)

• Previously, we use (>6 , -6 , \6 , >67/) sequentially, for _ = 1, 2, … , to update 9.

• Playing Atari with Deep Reinforcement Learning. In NIPS workshop, 2013.

You might also like

-')$ = -' + ' / + 0! 1 − ! 1 ∇! 1