0% found this document useful (0 votes)
29 views24 pages

09 - Monte Carlo Learning

The document discusses computational control and Monte Carlo learning, focusing on policy iteration and value iteration methods that rely on model information. It explores the challenges of finding optimal policies without model information, including the use of Q functions and the Bellman equation. Additionally, it addresses the concepts of exploration vs. exploitation, scalability of Q functions, and the projected Bellman equation for model-free solutions.

Uploaded by

Ahmet Çelik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views24 pages

09 - Monte Carlo Learning

The document discusses computational control and Monte Carlo learning, focusing on policy iteration and value iteration methods that rely on model information. It explores the challenges of finding optimal policies without model information, including the use of Q functions and the Bellman equation. Additionally, it addresses the concepts of exploration vs. exploitation, scalability of Q functions, and the projected Bellman equation for model-free solutions.

Uploaded by

Ahmet Çelik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Computational Control

Monte Carlo learning

Saverio Bolognani

Automatic Control Laboratory (IfA)


ETH Zurich
All the algorithmic steps that we saw relied on model information, in the form of
transition probabilities.

policy iteration
▶ policy evaluation
X
V ← (I − γP π )−1 Rπ π
Px,x ′ =
u
π(x, u)Px,x ′
u
▶ policy improvement
" #
X X

π(x, u) ← argmin ν(x, u) Rxu + γ u
Px,x ′ V (x )
ν u x′

value iteration
" #
X X ′
V (x) ← min π(x, u) Rxu +γ u
Px,x ′ V (x )
π
u x′

1 / 22
Can we find the optimal policy without model information?
Two different settings:

Based on collected data (typically in the form of repeated episodes)

(x0 , u0 , r0 ), (x1 , u1 , r1 ), . . . (xT , uT , rT )

▶ repetition of the control task in a numerical simulator


▶ repetition of the control task in a controlled environment

Online during the control task with no prior training "


→ adaptive control
- extremely challenging dual control task (learn AND control)
- few guarantees
- an open problem for decades!

2 / 22
Q function

A simple reformulation transforms the problem in a much more convenient form:

Q function
The Q function returns the expected future cost of
taking action u at state x
applying policy π at subsequent times
X ′
Qπ (x, u) = Rxu + γ u
Pxx π
′ V (x )

x′

V π is defined over N states


Qπ is defined over N states × M actions

The two are trivially related by


X
V π (x) = π(x, u)Qπ (x, u)
u

3 / 22
Bellman equation

In value function V
" #
X X ′
π
V (x) = π(x, u) Rxu +γ u
Pxx π
′ V (x )

u x′

In Q function
X X
Qπ (x, u) = Rxu + γ u
Pxx ′ π(x ′ , u′ )Qπ (x ′ , u′ )
x′ u′
| {z }
V π (x ′ )

Very similar model information needed


System of linear equations (N equations vs. NM equations)
“Consistency” relation that completely defines V / Q based on cost and
transition probabilities

4 / 22
Bellman optimality principle

In value function V
!

X ∗ ′
V (x) = min Rxu +γ u
Pxx ′ V (x )
u
x′

In Q function  
X
Q∗ (x, u) = Rxu + γ u
Pxx ′ min Q ∗
(x ′
, u ′
)
u′
x′ | {z }
π ∗ (x ′ ,u′ )Q∗ (x ′ ,u′ )
P
u′

that is X
Q∗ (x, u) = Rxu + γ u
Pxx ∗ ′
′ V (x )

x′

same model information


swapped min with expectation operator
value/Q function implicitly defines the optimal strategy – how?

5 / 22
Policy iteration

Remember:

1 Compute the value V associated to the policy π

V ← (I − γP π )−1 Rπ

2 Greedy update of the policy to update π


" #
X X ′
π(x, u) ← argmin ν(x, u) Rxu +γ u
Pxx ′ V (x )
ν
u x′

Model information both in the


1 policy evaluation step and in the
2 policy improvement step.

6 / 22
Policy iteration with Q function

Remember:

1 Compute the function Q(x, u) associated to the policy π by solving Bellman


equation X u X
Q(x, u) = Rxu + γ Pxx ′ π(x ′ , u′ )Q(x ′ , u′ )
x′ u′

2 Greedy update of the policy to update π


X
π(x, u) ← argmin ν(x, u)Q(x, u)
ν
u

1 The policy evaluation step is complex:


▶ it requires model information
▶ it is high dimensional
2 The policy improvement step is now trivial and require no model information

Why is this a good idea in a model-free setting?

7 / 22
Experimental policy evaluation
We can delegate the policy evaluation to experiment/simulations!

1 Let the system be controlled by a policy π and collect a (long) episode

(x0 , u0 , r0 ), (x1 , u1 , r1 ), . . . (xT , uT , rT )

2 Compute the empirical cost g0 , g1 , . . . , gT where


T
X
gk = γ i−k ri
i=k

3 Interpret gk as a realization of the return in the state-action pair xk , uk

Qπ (xk , uk ) ≈ gk

or, in case of multiple visits,


PT
t=1 1[xt = x, ut = u] gt
Qπ (x, u) = P T
t=1 1[xt = x, ut = u]

" Ergodicity assumption!

8 / 22
A few notes

The policy cannot be evaluated (Q cannot be computed) until the episode is


terminated.
Why?

The estimate of Q does not satisfy the Bellman equation (unless at the limit).
Why?

In other words, Markovianity is not exploited to make the computation of Q


more efficient.

9 / 22
Exploration vs exploitation
The greedy policy improvement step can get you stuck into a suboptimal
solution:
you haven’t learned enough, so you have a rough approximation of Qπ
your policy is optimized for this approximate Qπ and does not explore further

ϵ-greedy policy improvement


(
argmaxu Q(x, u) with probability 1 − ϵ
π(x, u) ←
Uniform(U) with probability ϵ
Large ϵ → more exploration

Boltzmann policy improvement

e−βQ(x,u)
π(x, u) ← P −βQ(x,u)
ue

Small β → more exploration

10 / 22
As the learning progresses from episode to episode, ϵ → 0 or β → ∞
explore less
exploit more

If done “properly”, then with probability 1


each state-action pair is visited infinitely often
the policy converges to a greedy policy

11 / 22
Scalability and parametrization of Q

In theory, we can memorize Q for all state-action pairs, like in a gigantic table.
However, it quickly becomes computationally infeasible when the state and action
space are large.
n-dimensional state → exponential size
multiple-input systems
continuous state and/or action space → quickly intractable

Parametrization of Q
Very often (almost always) the Q function is parametrized via a “low” dimensional
vector of parameters
Q(x, u) = Qθ (x, u), θ ∈ Rd

12 / 22
Advantages
Much smaller memory requirements (d instead of NM)
Problem size is (apparently) independent from N → continuous state space
A smart parametrization may allow to “guess” the Q function in state-action
pairs that have not been observed (interpolation/extrapolation)
Prior information on the problem may suggest a smart parametrization
(example?)

Disadvantages
Reduced degrees of freedom
The solution to the Bellman equation for a given policy π may not belong to

Q = {Qθ , θ ∈ Rd }
Possible consequences on
▶ convergence of policy iterations
▶ optimality of the limit
Computing θ that best approximates the data collected from an episode may
be computationally expensive

13 / 22
Example: linear parametrization

" Disclaimer
More advanced parametrizations are often used (e.g., neural networks).
Nevertheless, understanding the simple case of linear parametrization gives you a solid
understanding of the core idea: the same can be applied to other parametrizations, typically
by relying on customized algorithms instead of these simple steps.

d
X
Qθ (x, u) = ϕℓ (x, u)θl = ϕ⊤ (x, u)θ
ℓ=1

ϕℓ (x, u) are basis functions, such as


▶ polynomials x, u, x 2 , u2 , xu, . . .
▶ quadratic forms x ⊤ Aℓ x, . . .
▶ specific features (tetris: total height, and other 21)

14 / 22
Projected Bellman equation

We cannot aim at solving the Bellman equation


X u X
Q(x, u) = Rxu + γ Pxx ′ π(x ′ , u′ )Q(x ′ , u′ )
x′ u′
| {z }
B(Q)

because the solution is (in general) not in the form ϕ⊤ θ.


Define the following notion of “best approximant”

Π(Q) = argmin ∥Qθ − Q∥2ρ


Projected Bellman equation


Determine Qθ that satisfies
Qθ = Π (B(Qθ ))

15 / 22
Matrix form

X X
Q(x, u) = Rxu + γ u
Pxx ′ π(x ′ , u′ )Q(x ′ , u′ )
x′ u′
| {z }
B(Q)

Let Q be a vector in RNM that encodes the Q function.


Let R be a vector in RNM with the expected immediate cost of each state-input
pair
Let T π be the transition probability between state-input pairs, determined by
both
▶ u
the transition probabilities Pxx ′
▶ the policy π(x, u)

Q-Bellman equation in matrix form

B(Q) = R + γT π Q
(NM-dimensional)

16 / 22
The solution to the projected Bellman equation Qθ = Π(B(Qθ )) is given by the
solution of
Φ⊤ ρΦθ = γΦ⊤ ρT π Φθ + Φ⊤ ρR
where Φ ∈ RNM×d is the matrix of basis functions.

Proof:
As Qθ = Φθ, we rewrite B(Qθ ) as R + γT π Φθ.
To compute the projection, we look for θ̂ that minimizes ∥Φθ̂ − R − γT π Φθ∥2ρ .
This can be computed in closed form by zeroing the gradient. Setting θ̂ = θ
(Bellman equation) returns directly

Φ⊤ ρΦθ = γΦ⊤ ρT π Φθ + Φ⊤ ρR

or, equivalently,  
Φ⊤ ρΦ − γΦ⊤ ρT π Φ θ = Φ⊤ ρR

The Bellman equation in θ is a system of d linear equations (instead of NM)

17 / 22
Model free solution

 
Φ⊤ ρΦ − γΦ⊤ ρT π Φ θ = Φ⊤ ρR

How do we compute the matrices Φ⊤ ρΦ, Φ⊤ ρT π Φ, and Φ⊤ ρR?


For every sample xk , uk , rk of the episode and their successive sample xk+1 , uk+1 ,
we construct
ϕ(xk , uk )ϕ(xk , uk )⊤
ϕ(xk , uk )ϕ(xk+1 , uk+1 )⊤
ϕ(xk , uk )rk

ϕ1 (xk , uk )
 

Where ϕ(xk , uk ) is the vector 


 .. 
. 
ϕd (xk , uk )
The empirical average of these terms corresponds to the matrices above, where ρ
encodes the frequency of each state-input pair in the episode.

18 / 22
Key idea of the proof

Consider the first matrix, Φ⊤ ρΦ. Let’s prove that you can construct it as the
empirical average of the terms ϕ(xk , uk )ϕ(xk , uk )⊤ .
For simplicity, let’s enumerate the state-action pairs from s = 1 to S.
There is no need for them to be finite, it’s just for convenience.
Let’s then consider a single data point (state-action pair) s.
The corresponding term is
 
ϕ1 (s)ϕ1 (s) . . . ϕ1 (s)ϕd (s)
ϕ(s)ϕ(s)⊤ = 
 .. .. 
. . 
ϕd (s)ϕ1 (s) . . . ϕd (s)ϕd (s)

When many samples are considered in an empirical average, the generic element
in position i, j of this matrix will be
h i S
X
E ϕ(s)ϕ(s)⊤ = ρ(s)ϕi (s)ϕj (s)
ij
s=1

where 0 < ρ(s) < 1 is the frequency of the state-action pair s in the dataset.

19 / 22
Let instead consider the matrix
 ⊤  
ϕ1 ρ(1)
⊤  .  ..  
Φ ρΦ =  ..   .  ϕ1 ··· ϕd
ϕ⊤d ρ(S)

A generic element in position i, j is


 
ρ(1) S
h i
..
⊤ ⊤
X
Φ ρΦ = ϕi   ϕj = ρ(s)ϕi (s)ϕj (s)

ij
.
s=1
ρ(S)

which is identical to the expression that we found for ϕ(s)ϕ(s)⊤ ij .


 

A similar reasoning can be done for the other two matrices, Φ⊤ ρT π Φ and Φ⊤ ρR,

20 / 22
Key points

By using the Q-function representation, policy iteration can be split into


▶ trivial policy improvement
▶ a policy evaluation that can be delegated to an experiment/simulation

Alternating between policy improvement and episodic experiments allow to


compute the optimal strategy

Sufficient exploration needs to be guaranteed (not too greedy!)

Q-function approximations are needed to ensure scalability

Many variations exist!

21 / 22
The control engineer flowchart

22 / 22
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License

https://bsaver.io/COCO

You might also like