0% found this document useful (0 votes)

29 views24 pages

09 - Monte Carlo Learning

The document discusses computational control and Monte Carlo learning, focusing on policy iteration and value iteration methods that rely on model information. It explores the challenges of finding optimal policies without model information, including the use of Q functions and the Bellman equation. Additionally, it addresses the concepts of exploration vs. exploitation, scalability of Q functions, and the projected Bellman equation for model-free solutions.

Uploaded by

Ahmet Çelik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views24 pages

09 - Monte Carlo Learning

Uploaded by

Ahmet Çelik

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 24

Computational Control

Monte Carlo learning

Saverio Bolognani

Automatic Control Laboratory (IfA)

ETH Zurich
All the algorithmic steps that we saw relied on model information, in the form of
transition probabilities.

policy iteration
▶ policy evaluation
X
V ← (I − γP π )−1 Rπ π
Px,x ′ =
u
π(x, u)Px,x ′
u
▶ policy improvement
" #
X X
′
π(x, u) ← argmin ν(x, u) Rxu + γ u
Px,x ′ V (x )
ν u x′

value iteration
" #
X X ′
V (x) ← min π(x, u) Rxu +γ u
Px,x ′ V (x )
π
u x′

1 / 22
Can we find the optimal policy without model information?
Two different settings:

Based on collected data (typically in the form of repeated episodes)

(x0 , u0 , r0 ), (x1 , u1 , r1 ), . . . (xT , uT , rT )

▶ repetition of the control task in a numerical simulator

▶ repetition of the control task in a controlled environment

Online during the control task with no prior training "

→ adaptive control
- extremely challenging dual control task (learn AND control)
- few guarantees
- an open problem for decades!

2 / 22
Q function

A simple reformulation transforms the problem in a much more convenient form:

Q function
The Q function returns the expected future cost of
taking action u at state x
applying policy π at subsequent times
X ′
Qπ (x, u) = Rxu + γ u
Pxx π
′ V (x )

x′

V π is defined over N states

Qπ is defined over N states × M actions

The two are trivially related by

X
V π (x) = π(x, u)Qπ (x, u)
u

3 / 22
Bellman equation

In value function V
" #
X X ′
π
V (x) = π(x, u) Rxu +γ u
Pxx π
′ V (x )

u x′

In Q function
X X
Qπ (x, u) = Rxu + γ u
Pxx ′ π(x ′ , u′ )Qπ (x ′ , u′ )
x′ u′
| {z }
V π (x ′ )

Very similar model information needed

System of linear equations (N equations vs. NM equations)
“Consistency” relation that completely defines V / Q based on cost and
transition probabilities

4 / 22
Bellman optimality principle

In value function V
!
∗
X ∗ ′
V (x) = min Rxu +γ u
Pxx ′ V (x )
u
x′

In Q function
X
Q∗ (x, u) = Rxu + γ u
Pxx ′ min Q ∗
(x ′
, u ′
)
u′
x′ | {z }
π ∗ (x ′ ,u′ )Q∗ (x ′ ,u′ )
P
u′

that is X
Q∗ (x, u) = Rxu + γ u
Pxx ∗ ′
′ V (x )

x′

same model information

swapped min with expectation operator
value/Q function implicitly defines the optimal strategy – how?

5 / 22
Policy iteration

Remember:

1 Compute the value V associated to the policy π

V ← (I − γP π )−1 Rπ

2 Greedy update of the policy to update π

" #
X X ′
π(x, u) ← argmin ν(x, u) Rxu +γ u
Pxx ′ V (x )
ν
u x′

Model information both in the

1 policy evaluation step and in the
2 policy improvement step.

6 / 22
Policy iteration with Q function

Remember:

1 Compute the function Q(x, u) associated to the policy π by solving Bellman

equation X u X
Q(x, u) = Rxu + γ Pxx ′ π(x ′ , u′ )Q(x ′ , u′ )
x′ u′

2 Greedy update of the policy to update π

X
π(x, u) ← argmin ν(x, u)Q(x, u)
ν
u

1 The policy evaluation step is complex:

▶ it requires model information
▶ it is high dimensional
2 The policy improvement step is now trivial and require no model information

Why is this a good idea in a model-free setting?

7 / 22
Experimental policy evaluation
We can delegate the policy evaluation to experiment/simulations!

1 Let the system be controlled by a policy π and collect a (long) episode

(x0 , u0 , r0 ), (x1 , u1 , r1 ), . . . (xT , uT , rT )

2 Compute the empirical cost g0 , g1 , . . . , gT where

T
X
gk = γ i−k ri
i=k

3 Interpret gk as a realization of the return in the state-action pair xk , uk

Qπ (xk , uk ) ≈ gk

or, in case of multiple visits,

PT
t=1 1[xt = x, ut = u] gt
Qπ (x, u) = P T
t=1 1[xt = x, ut = u]

" Ergodicity assumption!

8 / 22
A few notes

The policy cannot be evaluated (Q cannot be computed) until the episode is

terminated.
Why?

The estimate of Q does not satisfy the Bellman equation (unless at the limit).
Why?

In other words, Markovianity is not exploited to make the computation of Q

more efficient.

9 / 22
Exploration vs exploitation
The greedy policy improvement step can get you stuck into a suboptimal
solution:
you haven’t learned enough, so you have a rough approximation of Qπ
your policy is optimized for this approximate Qπ and does not explore further

ϵ-greedy policy improvement

(
argmaxu Q(x, u) with probability 1 − ϵ
π(x, u) ←
Uniform(U) with probability ϵ
Large ϵ → more exploration

Boltzmann policy improvement

e−βQ(x,u)
π(x, u) ← P −βQ(x,u)
ue

Small β → more exploration

10 / 22
As the learning progresses from episode to episode, ϵ → 0 or β → ∞
explore less
exploit more

If done “properly”, then with probability 1

each state-action pair is visited infinitely often
the policy converges to a greedy policy

11 / 22
Scalability and parametrization of Q

In theory, we can memorize Q for all state-action pairs, like in a gigantic table.
However, it quickly becomes computationally infeasible when the state and action
space are large.
n-dimensional state → exponential size
multiple-input systems
continuous state and/or action space → quickly intractable

Parametrization of Q
Very often (almost always) the Q function is parametrized via a “low” dimensional
vector of parameters
Q(x, u) = Qθ (x, u), θ ∈ Rd

12 / 22
Advantages
Much smaller memory requirements (d instead of NM)
Problem size is (apparently) independent from N → continuous state space
A smart parametrization may allow to “guess” the Q function in state-action
pairs that have not been observed (interpolation/extrapolation)
Prior information on the problem may suggest a smart parametrization
(example?)

Disadvantages
Reduced degrees of freedom
The solution to the Bellman equation for a given policy π may not belong to

Q = {Qθ , θ ∈ Rd }
Possible consequences on
▶ convergence of policy iterations
▶ optimality of the limit
Computing θ that best approximates the data collected from an episode may
be computationally expensive

13 / 22
Example: linear parametrization

" Disclaimer
More advanced parametrizations are often used (e.g., neural networks).
Nevertheless, understanding the simple case of linear parametrization gives you a solid
understanding of the core idea: the same can be applied to other parametrizations, typically
by relying on customized algorithms instead of these simple steps.

d
X
Qθ (x, u) = ϕℓ (x, u)θl = ϕ⊤ (x, u)θ
ℓ=1

ϕℓ (x, u) are basis functions, such as

▶ polynomials x, u, x 2 , u2 , xu, . . .
▶ quadratic forms x ⊤ Aℓ x, . . .
▶ specific features (tetris: total height, and other 21)

14 / 22
Projected Bellman equation

We cannot aim at solving the Bellman equation

X u X
Q(x, u) = Rxu + γ Pxx ′ π(x ′ , u′ )Q(x ′ , u′ )
x′ u′
| {z }
B(Q)

because the solution is (in general) not in the form ϕ⊤ θ.

Define the following notion of “best approximant”

Π(Q) = argmin ∥Qθ − Q∥2ρ

Qθ

Projected Bellman equation

Determine Qθ that satisfies
Qθ = Π (B(Qθ ))

15 / 22
Matrix form

X X
Q(x, u) = Rxu + γ u
Pxx ′ π(x ′ , u′ )Q(x ′ , u′ )
x′ u′
| {z }
B(Q)

Let Q be a vector in RNM that encodes the Q function.

Let R be a vector in RNM with the expected immediate cost of each state-input
pair
Let T π be the transition probability between state-input pairs, determined by
both
▶ u
the transition probabilities Pxx ′
▶ the policy π(x, u)

Q-Bellman equation in matrix form

B(Q) = R + γT π Q
(NM-dimensional)

16 / 22
The solution to the projected Bellman equation Qθ = Π(B(Qθ )) is given by the
solution of
Φ⊤ ρΦθ = γΦ⊤ ρT π Φθ + Φ⊤ ρR
where Φ ∈ RNM×d is the matrix of basis functions.

Proof:
As Qθ = Φθ, we rewrite B(Qθ ) as R + γT π Φθ.
To compute the projection, we look for θ̂ that minimizes ∥Φθ̂ − R − γT π Φθ∥2ρ .
This can be computed in closed form by zeroing the gradient. Setting θ̂ = θ
(Bellman equation) returns directly

Φ⊤ ρΦθ = γΦ⊤ ρT π Φθ + Φ⊤ ρR

or, equivalently,
Φ⊤ ρΦ − γΦ⊤ ρT π Φ θ = Φ⊤ ρR

The Bellman equation in θ is a system of d linear equations (instead of NM)

17 / 22
Model free solution

Φ⊤ ρΦ − γΦ⊤ ρT π Φ θ = Φ⊤ ρR

How do we compute the matrices Φ⊤ ρΦ, Φ⊤ ρT π Φ, and Φ⊤ ρR?

For every sample xk , uk , rk of the episode and their successive sample xk+1 , uk+1 ,
we construct
ϕ(xk , uk )ϕ(xk , uk )⊤
ϕ(xk , uk )ϕ(xk+1 , uk+1 )⊤
ϕ(xk , uk )rk

ϕ1 (xk , uk )
 

Where ϕ(xk , uk ) is the vector 

 .. 
. 
ϕd (xk , uk )
The empirical average of these terms corresponds to the matrices above, where ρ
encodes the frequency of each state-input pair in the episode.

18 / 22
Key idea of the proof

Consider the first matrix, Φ⊤ ρΦ. Let’s prove that you can construct it as the
empirical average of the terms ϕ(xk , uk )ϕ(xk , uk )⊤ .
For simplicity, let’s enumerate the state-action pairs from s = 1 to S.
There is no need for them to be finite, it’s just for convenience.
Let’s then consider a single data point (state-action pair) s.
The corresponding term is
 
ϕ1 (s)ϕ1 (s) . . . ϕ1 (s)ϕd (s)
ϕ(s)ϕ(s)⊤ = 
 .. .. 
. . 
ϕd (s)ϕ1 (s) . . . ϕd (s)ϕd (s)

When many samples are considered in an empirical average, the generic element
in position i, j of this matrix will be
h i S
X
E ϕ(s)ϕ(s)⊤ = ρ(s)ϕi (s)ϕj (s)
ij
s=1

where 0 < ρ(s) < 1 is the frequency of the state-action pair s in the dataset.

19 / 22
Let instead consider the matrix
 ⊤  
ϕ1 ρ(1)
⊤  .  .. 
Φ ρΦ =  ..   .  ϕ1 ··· ϕd
ϕ⊤d ρ(S)

A generic element in position i, j is

 
ρ(1) S
h i
..
⊤ ⊤
X
Φ ρΦ = ϕi   ϕj = ρ(s)ϕi (s)ϕj (s)

ij
.
s=1
ρ(S)

which is identical to the expression that we found for ϕ(s)ϕ(s)⊤ ij .

A similar reasoning can be done for the other two matrices, Φ⊤ ρT π Φ and Φ⊤ ρR,

20 / 22
Key points

By using the Q-function representation, policy iteration can be split into

▶ trivial policy improvement
▶ a policy evaluation that can be delegated to an experiment/simulation

Alternating between policy improvement and episodic experiments allow to

compute the optimal strategy

Sufficient exploration needs to be guaranteed (not too greedy!)

Q-function approximations are needed to ensure scalability

Many variations exist!

21 / 22
The control engineer flowchart

22 / 22
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License

https://bsaver.io/COCO

5SC28 Machine Learning For Systems and Control
No ratings yet
5SC28 Machine Learning For Systems and Control
68 pages
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 9 - Markov Decision Processes II Dr. Shivanjali Khare
44 pages
18 AI BasicRL
No ratings yet
18 AI BasicRL
96 pages
Lecture 06
No ratings yet
Lecture 06
98 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
Markov Decision Process II
No ratings yet
Markov Decision Process II
88 pages
2 Dynamic
No ratings yet
2 Dynamic
50 pages
08 - Markov Decision Processes
No ratings yet
08 - Markov Decision Processes
31 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
2025 - MDPs 2
No ratings yet
2025 - MDPs 2
42 pages
Policy Gradient
No ratings yet
Policy Gradient
33 pages
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
No ratings yet
AI512/EE633: Reinforcement Learning: Lecture 3 - Dynamic Programming
43 pages
Ar514 MDP
No ratings yet
Ar514 MDP
27 pages
3 - Chapter 4 Value Iteration and Policy Iteration
No ratings yet
3 - Chapter 4 Value Iteration and Policy Iteration
20 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
2025 - MDPs - Part 2
No ratings yet
2025 - MDPs - Part 2
41 pages
3 - Chapter 4 Value Iteration and Policy Iteration
No ratings yet
3 - Chapter 4 Value Iteration and Policy Iteration
20 pages
Reinforcement Learning Lec12
No ratings yet
Reinforcement Learning Lec12
60 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
SRE Report Merged
No ratings yet
SRE Report Merged
16 pages
Fa19 Lecture 15 MDPs II
No ratings yet
Fa19 Lecture 15 MDPs II
76 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
02 Bellman Equations and Optimality - Complete Guide
No ratings yet
02 Bellman Equations and Optimality - Complete Guide
6 pages
Policy Gradient Methods
No ratings yet
Policy Gradient Methods
70 pages
Lec 09
No ratings yet
Lec 09
51 pages
Lec 12
No ratings yet
Lec 12
60 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
RL Lecture4
No ratings yet
RL Lecture4
7 pages
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
No ratings yet
An Introduction To Reinforcement Learning From Theory To Algorithms (December 19, 2024) - Joon Kwon
66 pages
THESIS-Neural Networks and Sliding Modes Control
No ratings yet
THESIS-Neural Networks and Sliding Modes Control
258 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
S A - C D A S: OFT Ctor Ritic For Iscrete Ction Ettings
No ratings yet
S A - C D A S: OFT Ctor Ritic For Iscrete Ction Ettings
7 pages
Nips00 Bs
No ratings yet
Nips00 Bs
7 pages
Advanced Functions 12
No ratings yet
Advanced Functions 12
97 pages
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
27 pages
CS229
No ratings yet
CS229
17 pages
12 ML Reinforcement Learning Value Based Control
No ratings yet
12 ML Reinforcement Learning Value Based Control
12 pages
Subtitle
No ratings yet
Subtitle
1 page
Makaleler
No ratings yet
Makaleler
108 pages
Lec 4
No ratings yet
Lec 4
16 pages
Reinforcement Learning Cheatsheet
No ratings yet
Reinforcement Learning Cheatsheet
16 pages
MDP Cheatsheet
No ratings yet
MDP Cheatsheet
3 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
Lec 3
No ratings yet
Lec 3
15 pages
3 CrankNicolson
No ratings yet
3 CrankNicolson
75 pages
RL DP and Value and Policy
No ratings yet
RL DP and Value and Policy
4 pages
Computational Economics: Session 16: Numerical Dynamic Programming
No ratings yet
Computational Economics: Session 16: Numerical Dynamic Programming
17 pages
Model Forunder in Mathematics: State
No ratings yet
Model Forunder in Mathematics: State
45 pages
MCS 208
No ratings yet
MCS 208
191 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
Solution of Algebraic & Transcendental Equation: After Reading This Chapter, You Should Be Able To
100% (1)
Solution of Algebraic & Transcendental Equation: After Reading This Chapter, You Should Be Able To
18 pages
POLYNOMIALS in Class IX, You Have Studied Polynomials in One Variable and
No ratings yet
POLYNOMIALS in Class IX, You Have Studied Polynomials in One Variable and
21 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Subtitle
No ratings yet
Subtitle
2 pages
5.02 Direct Method of Interpolation
100% (1)
5.02 Direct Method of Interpolation
11 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
cs229 Notes13
No ratings yet
cs229 Notes13
15 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
EE675A Lec12
No ratings yet
EE675A Lec12
5 pages
Reinforcement Learning 3 Recap
No ratings yet
Reinforcement Learning 3 Recap
3 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Mark Scheme: Q Scheme Marks Aos Pearson Progression Step and Progress Descriptor 1A M1
No ratings yet
Mark Scheme: Q Scheme Marks Aos Pearson Progression Step and Progress Descriptor 1A M1
7 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Deep Ukf
No ratings yet
Deep Ukf
13 pages
CHAPTER 2 - ALGEBRA (Latest)
No ratings yet
CHAPTER 2 - ALGEBRA (Latest)
41 pages
MCQs NS-PDE 1
No ratings yet
MCQs NS-PDE 1
19 pages
Receding Horizon HN Control of
No ratings yet
Receding Horizon HN Control of
10 pages
Blackmore GNC06
No ratings yet
Blackmore GNC06
15 pages
Simplex Method I
No ratings yet
Simplex Method I
26 pages
Applsci 13 08204
No ratings yet
Applsci 13 08204
14 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
05 - Robust MPC
No ratings yet
05 - Robust MPC
28 pages
NLP MultiVAr Constrained
No ratings yet
NLP MultiVAr Constrained
63 pages
Fin500J Topic05 NumericalOptimization 2010
No ratings yet
Fin500J Topic05 NumericalOptimization 2010
25 pages
Daa Course Handout
No ratings yet
Daa Course Handout
5 pages
Using HTK
No ratings yet
Using HTK
36 pages
Module - 7 Lecture Notes - 1 Integer Linear Programming
No ratings yet
Module - 7 Lecture Notes - 1 Integer Linear Programming
7 pages
Lec4 Orth
No ratings yet
Lec4 Orth
13 pages
C Prog
No ratings yet
C Prog
9 pages
Python File
No ratings yet
Python File
5 pages
Matlab Solution 1
No ratings yet
Matlab Solution 1
20 pages
Iteration 1: Decomposition (How Would You Break Down Your Problem Into Sub-Problems?
No ratings yet
Iteration 1: Decomposition (How Would You Break Down Your Problem Into Sub-Problems?
10 pages
Introduction To Clusterwise Regression
No ratings yet
Introduction To Clusterwise Regression
10 pages
K10 Ordinary Differetial Equations
No ratings yet
K10 Ordinary Differetial Equations
14 pages
2021 09 24 - Solvers
No ratings yet
2021 09 24 - Solvers
10 pages
It 503 A Theory of Computation Dec 2020
No ratings yet
It 503 A Theory of Computation Dec 2020
3 pages
Module 5 - Linear Algebra - QB
No ratings yet
Module 5 - Linear Algebra - QB
4 pages
Previous Questions1
No ratings yet
Previous Questions1
2 pages
Power System Planning and Operation Using Artificial Neural Networks
No ratings yet
Power System Planning and Operation Using Artificial Neural Networks
6 pages
Digital Signal Processing
No ratings yet
Digital Signal Processing
1 page
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Dr. R. Ponalagusamy, Professor (HAG), Department of Mathematics, NIT-Trichy Assignment - II Date of Submission: 07-04-2020
No ratings yet
Dr. R. Ponalagusamy, Professor (HAG), Department of Mathematics, NIT-Trichy Assignment - II Date of Submission: 07-04-2020
2 pages

09 - Monte Carlo Learning

Uploaded by

09 - Monte Carlo Learning

Uploaded by

Computational Control

Monte Carlo learning

Automatic Control Laboratory (IfA)

Based on collected data (typically in the form of repeated episodes)

(x0 , u0 , r0 ), (x1 , u1 , r1 ), . . . (xT , uT , rT )

▶ repetition of the control task in a numerical simulator

Online during the control task with no prior training "

A simple reformulation transforms the problem in a much more convenient form:

V π is defined over N states

The two are trivially related by

Very similar model information needed

same model information

1 Compute the value V associated to the policy π

2 Greedy update of the policy to update π

Model information both in the

1 Compute the function Q(x, u) associated to the policy π by solving Bellman

2 Greedy update of the policy to update π

1 The policy evaluation step is complex:

Why is this a good idea in a model-free setting?

1 Let the system be controlled by a policy π and collect a (long) episode

(x0 , u0 , r0 ), (x1 , u1 , r1 ), . . . (xT , uT , rT )

2 Compute the empirical cost g0 , g1 , . . . , gT where

3 Interpret gk as a realization of the return in the state-action pair xk , uk

or, in case of multiple visits,

" Ergodicity assumption!

The policy cannot be evaluated (Q cannot be computed) until the episode is

In other words, Markovianity is not exploited to make the computation of Q

ϵ-greedy policy improvement

Boltzmann policy improvement

Small β → more exploration

If done “properly”, then with probability 1

ϕℓ (x, u) are basis functions, such as

We cannot aim at solving the Bellman equation

because the solution is (in general) not in the form ϕ⊤ θ.

Π(Q) = argmin ∥Qθ − Q∥2ρ

Projected Bellman equation

Let Q be a vector in RNM that encodes the Q function.

Q-Bellman equation in matrix form

The Bellman equation in θ is a system of d linear equations (instead of NM)

How do we compute the matrices Φ⊤ ρΦ, Φ⊤ ρT π Φ, and Φ⊤ ρR?

Where ϕ(xk , uk ) is the vector 

A generic element in position i, j is

which is identical to the expression that we found for ϕ(s)ϕ(s)⊤ ij .

By using the Q-function representation, policy iteration can be split into

Alternating between policy improvement and episodic experiments allow to

Sufficient exploration needs to be guaranteed (not too greedy!)

Q-function approximations are needed to ensure scalability

Many variations exist!

You might also like