09 - Monte Carlo Learning
09 - Monte Carlo Learning
Saverio Bolognani
policy iteration
▶ policy evaluation
X
V ← (I − γP π )−1 Rπ π
Px,x ′ =
u
π(x, u)Px,x ′
u
▶ policy improvement
" #
X X
′
π(x, u) ← argmin ν(x, u) Rxu + γ u
Px,x ′ V (x )
ν u x′
value iteration
" #
X X ′
V (x) ← min π(x, u) Rxu +γ u
Px,x ′ V (x )
π
u x′
1 / 22
Can we find the optimal policy without model information?
Two different settings:
2 / 22
Q function
Q function
The Q function returns the expected future cost of
taking action u at state x
applying policy π at subsequent times
X ′
Qπ (x, u) = Rxu + γ u
Pxx π
′ V (x )
x′
3 / 22
Bellman equation
In value function V
" #
X X ′
π
V (x) = π(x, u) Rxu +γ u
Pxx π
′ V (x )
u x′
In Q function
X X
Qπ (x, u) = Rxu + γ u
Pxx ′ π(x ′ , u′ )Qπ (x ′ , u′ )
x′ u′
| {z }
V π (x ′ )
4 / 22
Bellman optimality principle
In value function V
!
∗
X ∗ ′
V (x) = min Rxu +γ u
Pxx ′ V (x )
u
x′
In Q function
X
Q∗ (x, u) = Rxu + γ u
Pxx ′ min Q ∗
(x ′
, u ′
)
u′
x′ | {z }
π ∗ (x ′ ,u′ )Q∗ (x ′ ,u′ )
P
u′
that is X
Q∗ (x, u) = Rxu + γ u
Pxx ∗ ′
′ V (x )
x′
5 / 22
Policy iteration
Remember:
V ← (I − γP π )−1 Rπ
6 / 22
Policy iteration with Q function
Remember:
7 / 22
Experimental policy evaluation
We can delegate the policy evaluation to experiment/simulations!
Qπ (xk , uk ) ≈ gk
8 / 22
A few notes
The estimate of Q does not satisfy the Bellman equation (unless at the limit).
Why?
9 / 22
Exploration vs exploitation
The greedy policy improvement step can get you stuck into a suboptimal
solution:
you haven’t learned enough, so you have a rough approximation of Qπ
your policy is optimized for this approximate Qπ and does not explore further
e−βQ(x,u)
π(x, u) ← P −βQ(x,u)
ue
10 / 22
As the learning progresses from episode to episode, ϵ → 0 or β → ∞
explore less
exploit more
11 / 22
Scalability and parametrization of Q
In theory, we can memorize Q for all state-action pairs, like in a gigantic table.
However, it quickly becomes computationally infeasible when the state and action
space are large.
n-dimensional state → exponential size
multiple-input systems
continuous state and/or action space → quickly intractable
Parametrization of Q
Very often (almost always) the Q function is parametrized via a “low” dimensional
vector of parameters
Q(x, u) = Qθ (x, u), θ ∈ Rd
12 / 22
Advantages
Much smaller memory requirements (d instead of NM)
Problem size is (apparently) independent from N → continuous state space
A smart parametrization may allow to “guess” the Q function in state-action
pairs that have not been observed (interpolation/extrapolation)
Prior information on the problem may suggest a smart parametrization
(example?)
Disadvantages
Reduced degrees of freedom
The solution to the Bellman equation for a given policy π may not belong to
Q = {Qθ , θ ∈ Rd }
Possible consequences on
▶ convergence of policy iterations
▶ optimality of the limit
Computing θ that best approximates the data collected from an episode may
be computationally expensive
13 / 22
Example: linear parametrization
" Disclaimer
More advanced parametrizations are often used (e.g., neural networks).
Nevertheless, understanding the simple case of linear parametrization gives you a solid
understanding of the core idea: the same can be applied to other parametrizations, typically
by relying on customized algorithms instead of these simple steps.
d
X
Qθ (x, u) = ϕℓ (x, u)θl = ϕ⊤ (x, u)θ
ℓ=1
14 / 22
Projected Bellman equation
15 / 22
Matrix form
X X
Q(x, u) = Rxu + γ u
Pxx ′ π(x ′ , u′ )Q(x ′ , u′ )
x′ u′
| {z }
B(Q)
B(Q) = R + γT π Q
(NM-dimensional)
16 / 22
The solution to the projected Bellman equation Qθ = Π(B(Qθ )) is given by the
solution of
Φ⊤ ρΦθ = γΦ⊤ ρT π Φθ + Φ⊤ ρR
where Φ ∈ RNM×d is the matrix of basis functions.
Proof:
As Qθ = Φθ, we rewrite B(Qθ ) as R + γT π Φθ.
To compute the projection, we look for θ̂ that minimizes ∥Φθ̂ − R − γT π Φθ∥2ρ .
This can be computed in closed form by zeroing the gradient. Setting θ̂ = θ
(Bellman equation) returns directly
Φ⊤ ρΦθ = γΦ⊤ ρT π Φθ + Φ⊤ ρR
or, equivalently,
Φ⊤ ρΦ − γΦ⊤ ρT π Φ θ = Φ⊤ ρR
17 / 22
Model free solution
Φ⊤ ρΦ − γΦ⊤ ρT π Φ θ = Φ⊤ ρR
ϕ1 (xk , uk )
18 / 22
Key idea of the proof
Consider the first matrix, Φ⊤ ρΦ. Let’s prove that you can construct it as the
empirical average of the terms ϕ(xk , uk )ϕ(xk , uk )⊤ .
For simplicity, let’s enumerate the state-action pairs from s = 1 to S.
There is no need for them to be finite, it’s just for convenience.
Let’s then consider a single data point (state-action pair) s.
The corresponding term is
ϕ1 (s)ϕ1 (s) . . . ϕ1 (s)ϕd (s)
ϕ(s)ϕ(s)⊤ =
.. ..
. .
ϕd (s)ϕ1 (s) . . . ϕd (s)ϕd (s)
When many samples are considered in an empirical average, the generic element
in position i, j of this matrix will be
h i S
X
E ϕ(s)ϕ(s)⊤ = ρ(s)ϕi (s)ϕj (s)
ij
s=1
where 0 < ρ(s) < 1 is the frequency of the state-action pair s in the dataset.
19 / 22
Let instead consider the matrix
⊤
ϕ1 ρ(1)
⊤ . ..
Φ ρΦ = .. . ϕ1 ··· ϕd
ϕ⊤d ρ(S)
A similar reasoning can be done for the other two matrices, Φ⊤ ρT π Φ and Φ⊤ ρR,
20 / 22
Key points
21 / 22
The control engineer flowchart
22 / 22
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License
https://bsaver.io/COCO