0% found this document useful (0 votes)
2 views

2.3+Value+Function+Approximation

The document discusses various concepts in reinforcement learning, focusing on value function approximation (VFA) and importance sampling techniques. It highlights the challenges of large state spaces and the need for function approximation methods, including linear and non-linear approaches, to efficiently learn policies and value functions. Additionally, it addresses convergence issues in reinforcement learning algorithms and introduces deep reinforcement learning as a solution to these challenges.

Uploaded by

shengaa1028
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

2.3+Value+Function+Approximation

The document discusses various concepts in reinforcement learning, focusing on value function approximation (VFA) and importance sampling techniques. It highlights the challenges of large state spaces and the need for function approximation methods, including linear and non-linear approaches, to efficiently learn policies and value functions. Additionally, it addresses convergence issues in reinforcement learning algorithms and introduces deep reinforcement learning as a solution to these challenges.

Uploaded by

shengaa1028
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 55

Some slides are from: Katerina Fragkiadaki (CMU), Davild Silver

(DeepMind), Hado van Hasselt (DeepMind)

COMP 4901Z: Reinforcement Learning


2.3 Value Function Approximation

Long Chen (Dept. of CSE)


Two Types of Importance Sampling

• Ordinary Importance Sampling

∑!∈#(%) %!'(!)() &!


! " =
|((")|
• Weighted Importance Sampling

∑!∈#(%) %!'(!)() &!


! " =
∑!∈#(%) %!'(!)()

• Weighted IS is a biased estimation


• For first-visit method with single return, the expectation is !* " rather than !+ " .
• Ordinary IS is an unbiased estimation
• For first-visit method, its estimator is always !+ "
2
Two Types of Importance Sampling

• Ordinary Importance Sampling

∑!∈#(%) %!'(!)() &!


! " =
|((")|
• Weighted Importance Sampling

∑!∈#(%) %!'(!)() &!


! " =
∑!∈#(%) %!'(!)()

• The variance of the ordinary IS is in general unbounded, whereas in the weighted


estimator the largest weight on any single return is one.
• Suppose the ratio wen ten, the ordinary importance-sampling estimate would be
ten times the observed return.
3
SARSA Algorithm for On-Policy Control

4
Q-Learning Algorithm for Off-Policy Control

• Q-Learning: ! "! , $! ← ! "! , $! + ' (!"# + ) max ! "!"#, - − !("! , $! )


$

• SARSA: ! ", $ ← ! ", $ + ' ( + )! " % , $% − ! ", $


5
Double Tabular Q-Learning

6
2.3 Value Function
Approximation
Function Approximation and Deep RL

• The policy, value function, model, and agent state update are all functions
• We want to learn these from experience
• If there are too many states, we need to approximate
• This is often called deep reinforcement learning
• when using neural network to represent these functions

8
Large-Scale Reinforcement Learning

• In problems with large number of states, e.g.


• Backgammon: 10&' states
• Go: 10#(' states
• Helicopter: continuous state space
• Robots: real world
• Tabular methods that enumerate every single state do not work
• How can we scale up the model-free methods for prediction and control
from the last two lectures?

9
Value Function Approximation (VFA)

• So far we have represented value function by a lookup table


• Every state " has an entry ! " or
• Every state-action pair (", ,) has an entry - ", ,
• Problem with large MDPs:
• There are too many states and/or actions to store in memory
• It is too slow to learn the value of each state individually
• Solution for large MDPs
• Estimate value function with function approximation
! "; / ≈ !+ " or - ", ,; / ≈ -+ ", ,
• Generalize from seen states to unseen states
• Update parameters / using MC or TD learning
10
Agent State Update

• When the environment state is not fully observable ("!)*+ ≠ 5! )


• Use the agent state
"! = 7("!,#, $!,#, , 5! ; 9)
with parameters 9
• Henceforth, "! denotes the agent state
• Think of this as either a vector inside the agent,
or, in the simplest case, just the current observation: "! = 5!

11
Value Function Approximation (VFA)

• Value function approximation (VFA) replaces the table with a general


parameterized form:

• When we update the parameters 9, the values of many states change


simultaneously!
12
Policy Approximation

• Policy approximation replaces the table with a general parameterized form

13
Classes of Functions Approximation

• Tabular: a table with an entry for each MDP state


• Linear function approximation
• Consider fixed agent state update (e.g., "! = 5! )
• Fixed feature map: :: " → (*
• Values are linear function of features = >; 9 = 9/ :(>)
• Differentiable function approximation
• = >; 9 is a differentiable function of 9, could be non-linear
• E.g., a convolutional neural network that takes pixel as input
• Another interpretation: features are not fixed, but learnt

14
Which Function Approximation?

• There are many function approximators, e.g.


• Linear combinations of features
• Neural networks
• Decision tree
• Nearest neighbour
• Fourier/wavelet bases
•…

15
Classes of Function Approximation

• In principle, any function approximator can be used, but RL has specific


properties,
• Experience is not i.i.d – successive time steps are correlated
• Agent’s policy affects the data it receives
• Regression targets can be non-stationary
• … because of changing policies (which can change the target and the data!)
• … because of bootstrapping
• … because of non-stationary dynamics (e.g., other learning agents)
• … because the world is large (never quite in the same state)

16
Classes of Function Approximation

• Which function approximation should you choose?


• This depends on your goals:
• Tabular: good theory but does not scale/generalize
• Linear: reasonably good theory, but requires good features
• Non-linear: less well-understood, but scales well
• Flexible, and less reliant on picking good features first (e.g., by hand)
• (Deep) neural nets often perform quite well, and remain a popular
choice

17
Function Approximator Examples

• Image representation for classification

18
Function Approximator Examples

• Pixel space

19
Function Approximator Examples

• Convolutional neural network (CNN) architectures

20
Function Approximator Examples

• Recurrent neural network (RNN) architectures

21
Function Approximator Examples

• Recurrent neural network (RNN) architectures

22
Gradient-based Algorithms
Gradient Descent

• Let ! " be a differentiable function of parameter vector "


• Define the gradient of ! " to be:

%! "
%&"
∇!! " = ⋮
%! "
%&#
• To find a local minimum of ! " , adjust " in direction of the
negative gradient
1
∆" = − ,∇!! "
2
where , is a step-size parameter

24
Gradient Descent

• Let ! " be a differentiable function of parameter vector "


• Define the gradient of ! " to be:

%! "
%&"
∇!! " = ⋮
%! "
%&#
• Starting from a guess "$
• We consider the sequence "$ , "" , "% , …
"
• s.t. ∆"&'( = − ,∇!! "&
%

• We then have J "$ ≥ J "" ≥ ! "% ≥ ⋯


25
Value Function Approx. By Stochastic Gradient Descent

• Goal: find parameter vector " minimizing mean-squared error between the true value
function 0) 1 and its approximation 0(1; ")
! & = 5* 0) 1 − 0(1; ") %
Where 6 is a distribution over states (typically induced by the policy and dynamics)
• Gradient descent finds a local minimum
1
∆" = − ,∇!! " = ,5* 0) 1 − 0 1; " ∇!0 1; "
2
• Stochastic gradient descent (SGD), samples the gradient
∆" = , 7+ − 0 1+ ; " ∇!0 1+ ; "

• Note: Monte Carlo return 7+ is a sample for 0) 1+


• Expected update is equal to full gradient update

• We often write ∇0 1+ as short hand for ∇!0 1+ ; " |!,!!


26
Feature Vectors

• Represent state by a feature vector


?# "
: " = ⋮
?* "
• :: " → (* is a fixed mapping from state (e.g., observation) to features
• Short hand: : 0 = :("! )
• For example:
• Distance of robot from landmarks
• Trends in the stock market
• Piece and pawn configurations in chess
27
Linear Value Function Approximation

• Represent value function by a linear combination of features


%

! ", $ = & " ! $ = ' &" " $"


"#$

• Objective function is quadratic in parameters 9


A 9 = B1~3 =4 " − : " / 9 &

• Stochastic gradient descent converges on global optimum


• Update rule is particularly simple
∇& ! "' , $ = & "' = & '
∆$ = * !( "' − ! "' ; $ & '

• Update = step-size × prediction error × feature value


28
Incremental Prediction Algorithm

• Have assumed true value function !+ " given by supervisor


• But in RL there is no supervisor, only rewards
• In practice, we substitute a target for !+ "
• For MC, the target is the return &!
∆/ = 3 &! − ! 5! ; / ∇, ! 5! ; /
• For TD(0), the target is the TD target
∆/ = 3 8!-) + :! 5!-) ; / − ! 5! ; / ∇, ! 5! ; /
• For TD(;), the target is the ;-return &!.
∆/ = 3 &!. − ! 5! ; / ∇, ! 5! ; /

&!. = 8!-) + : .
1 − ; ! 5!-) + ;&!-)
29
Monte Carlo with Value Function Approximation

• The return D! is an unbiased, noisy sample of true value =4 "!


• Can therefore apply supervised learning to “training data”:
< "#, D# >, < "&, D& >, … , < "/ , D/ >
• For example, using linear Monte-Carlo policy evaluation
∆9 = ' D! − = "! , ; 9 ∇5= "! ; 9
= ' D! − = "! , ; 9 : !
• Linear Monte-Carlo evaluation converges to a local optimum
• Even when using non-linear value function approximation it converges
(but perhaps to a local optimum)

30
Monte Carlo with Value Function Approximation

31
TD Learning with Value Function Approximation

• The TD-target (!"# + )= "!"#; 9 is a biased sample of true value =4 ("! )


• Can still apply supervised learning to “training data”
< "#, (# + )=("#; 9) >, < "&, (& + )=("&; 9) >, … , < "/ , (/ + )=("/ ; 9) >
• For example, using linear TD(0)
∆9 = ' (!"# + )= "!"#; 9 − = "! , ; 9 ∇5= "! ; 9
= 'J! : !
where J! = (!"# + )= "!"#; 9 − = "! , ; 9 is “TD error”
This is akin to non-stationary regression problem
• But it’s a bit different: the target depends on our parameters!
We ignore the dependence of the target on 9! We call it semi-gradient method!
32
TD Learning with Value Function Approximation

33
Control with Value Function Approximation

• Policy evaluation: Approximate policy evaluation, !("! , $! ; 9) ≈ !4


• Policy improvement: L-greedy policy improvement

34
Action-Value Function Approximation

• Should we use action-in, or action-out?


• Action in: ! >, -; 9 = 9/ ?(>, -)
• Action out: ! >; 9 = M?(>) such that ! >, -; 9 = ! >; 9 [-]
• One reuses the same weights, the other the same features
• Unclear which is better in general
• If we want to use continuous actions, action-in is easier (later lecture)
• For (small) discrete action spaces, action-out is common (e.g., DQN)

35
Convergence and Divergence
Convergence Questions

• When do incremental prediction algorithms converge?


• When using bootstrapping (i.e., TD)?
• When using (e.g., linear) value function approximation?
• When using off-policy learning?
• Ideally, we would like algorithms that converge in all cases
• Alternatively, we want to understand when algorithms do, or do not,
converge

37
Example of Divergence

• What if we use TD only on this transition?

38
Example of Divergence

-')$ = -' + *' / + 0! 1 * − ! 1 ∇! 1


= -' + *' / + 0! 1 * − ! 1 2 1
= -' + *' 0 + 02-' − -'
= -' + *' (20 − 1) -'
#
• Consider P! > 0. If ) > &, then P!"# > P! .

=> lim!→7 P! = ∞
39
Example of Divergence

• Algorithms that combine


• Bootstrapping
• Off-policy learning, and
• Function approximation
… may diverge
• This is sometimes called the deadly triad.
40
Deadly Triad

• Consider sampling on-policy, over an episode. Update:


∆P = ' 0 + 2)P − P + '(0 + )0 − 2P)
= ' 2) − 3 P
• This multiplier is negative, for all ) ∈ 0, 1
• => convergence (P goes to zero, which is optimal here)

41
Deadly Triad

• With tabular feature, this is just regression


• Answer may be sub-optimal, but no divergence occurs
• Specifically, if we only update = > (=left-most state):
• = > = P 0 will converge to )= > %
• = > % = P 1 will stay where it was initialized

42
Deadly Triad

• What if we use multiple-step returns?


• Still consider only updating the left-most state
∆9 = ; < + >?!" − " #

= % & + ( 1 − * " # # + *(& # + "(# ## ) − " # \ = \ 4 = = > 44 = 0


= % 2( 1 − * − 1 .

/
• The multiplier is negative when 2) 1 − W < 1 => W > 1−
01
2
• E.g. where ) = 0.9, then we need W > ≈ 0.45
3
43
Convergence of Prediction and Control Algorithms

• Tabular control learning algorithms (e.g., Q-learning) can be extended to FA


(e.g., Deep Q Network — DQN)
• The theory of control with function approximation is not fully developed
• Tracking is often preferred to convergence
(i.e., continually adapting the policy instead of converging to a fixed policy)

44
Deep Q Network (DQN)
Deep Reinforcement Learning

DL: Deep Learning; RL: Reinforcement Learning


• DL: It requires large amounts of hand-labelled training data.
• RL: It can learn from a scalar reward signal that is frequently sparse, noisy
and delayed.
• DL: It assumes the data samples to be independent.
• RL: It typically encounters sequences of highly correlated states.
• DL: It assumes a fixed underlying distribution.
• RL: The data distribution changes as the algorithm learns new behaviors.

Playing Atari with Deep Reinforcement Learning. In NIPS workshop, 2013.

46
DQN in Atari

• End-to-end learning of values ! >, - from pixels >


• Input state > is stack of raw pixels from last 4 frames
• Output is ! >, - for 18 joystick/button positions
• Reward is change in score for the step

Network architecture and hyperparameters fixed across all games 47


DQN

• Approximate the optimal action-value function !5 >, - by ! >, -; 9

48
DQN Results in Atari

49
Temporal Difference (TD) Learning

• Observe state >6 and perform action -6


• Environment provides new state >67/ and reward \6

• TD target: ]6 = (6 + ) ⋅ max !(>67/, -; 9)


8

• TD error: J6 = !6 − ]6 , where !6 = !(>6 , -6 ; 9)


• Goal: Make !6 close to ]6 , for all _. (Equivalently, make J60 small)

50
Temporal Difference (TD) Learning

• TD error: J6 = !6 − ]6 , where !6 = !(>6 , -6 ; P)


,
/ ;
• TD learning: Find 9 by minimizing ` 9 = ∑96:/ +
9 0

• Online gradient descent:


• Observe (>6 , -6 , \6 , >67/) and compute J6
<;+, /0 < @(A+ , 8+ ; >)
• Compute gradient b6 = <>
= J6 ⋅ <>
• Gradient descent: 9 ← 9 − ' ⋅ b6
• Discard (>6 , -6 , \6 , >67/) after using it

51
Shortcoming 1: Waste of Experience

• A transition: (>6 , -6 , \6 , >67/)


• Experience: all the transitions, for _ = 1, 2, …
• Previously, we discard (>6 , -6 , \6 , >67/) after using it
• It is a waste.

52
Shortcoming 2: Correlated Updates

• Previously, we use (>6 , -6 , \6 , >67/) sequentially, for _ = 1, 2, … , to update 9.


• Consecutive states, >6 and >67/, are strongly correlated (which is bad).
• It violates commonly held assumption for stochastic gradient (similar
issue as continual learning!)

53
Extra Reading Materials

• Playing Atari with Deep Reinforcement Learning. In NIPS workshop, 2013.


• Human-level Control through Deep Reinforcement Learning. Nature, 2015.

55
Thanks & QA?

You might also like