2.3+Value+Function+Approximation
2.3+Value+Function+Approximation
4
Q-Learning Algorithm for Off-Policy Control
6
2.3 Value Function
Approximation
Function Approximation and Deep RL
• The policy, value function, model, and agent state update are all functions
• We want to learn these from experience
• If there are too many states, we need to approximate
• This is often called deep reinforcement learning
• when using neural network to represent these functions
8
Large-Scale Reinforcement Learning
9
Value Function Approximation (VFA)
11
Value Function Approximation (VFA)
13
Classes of Functions Approximation
14
Which Function Approximation?
15
Classes of Function Approximation
16
Classes of Function Approximation
17
Function Approximator Examples
18
Function Approximator Examples
• Pixel space
19
Function Approximator Examples
20
Function Approximator Examples
21
Function Approximator Examples
22
Gradient-based Algorithms
Gradient Descent
%! "
%&"
∇!! " = ⋮
%! "
%&#
• To find a local minimum of ! " , adjust " in direction of the
negative gradient
1
∆" = − ,∇!! "
2
where , is a step-size parameter
24
Gradient Descent
%! "
%&"
∇!! " = ⋮
%! "
%&#
• Starting from a guess "$
• We consider the sequence "$ , "" , "% , …
"
• s.t. ∆"&'( = − ,∇!! "&
%
• Goal: find parameter vector " minimizing mean-squared error between the true value
function 0) 1 and its approximation 0(1; ")
! & = 5* 0) 1 − 0(1; ") %
Where 6 is a distribution over states (typically induced by the policy and dynamics)
• Gradient descent finds a local minimum
1
∆" = − ,∇!! " = ,5* 0) 1 − 0 1; " ∇!0 1; "
2
• Stochastic gradient descent (SGD), samples the gradient
∆" = , 7+ − 0 1+ ; " ∇!0 1+ ; "
&!. = 8!-) + : .
1 − ; ! 5!-) + ;&!-)
29
Monte Carlo with Value Function Approximation
30
Monte Carlo with Value Function Approximation
31
TD Learning with Value Function Approximation
33
Control with Value Function Approximation
34
Action-Value Function Approximation
35
Convergence and Divergence
Convergence Questions
37
Example of Divergence
38
Example of Divergence
=> lim!→7 P! = ∞
39
Example of Divergence
41
Deadly Triad
42
Deadly Triad
/
• The multiplier is negative when 2) 1 − W < 1 => W > 1−
01
2
• E.g. where ) = 0.9, then we need W > ≈ 0.45
3
43
Convergence of Prediction and Control Algorithms
44
Deep Q Network (DQN)
Deep Reinforcement Learning
46
DQN in Atari
48
DQN Results in Atari
49
Temporal Difference (TD) Learning
50
Temporal Difference (TD) Learning
51
Shortcoming 1: Waste of Experience
52
Shortcoming 2: Correlated Updates
53
Extra Reading Materials
55
Thanks & QA?