MORL (Multiple Objective Reinforcement Learning)
MORL (Multiple Objective Reinforcement Learning)
Computational Intelligence
Vrije Universiteit Amsterdam
July 7, 2018
http://roijers.info/motutorial.html
Multi-Objective Motivation
Problem Taxonomy
Solution Concepts
village
mine
V :Π→R
V π = Eπ [ t rt ]
P
π ∗ = maxπ V π
V : Π → Rn
V : Π → Rn
V : Π → Rn
Vwπ = f (Vπ , w)
Linear case:
n
X
Vwπ = wi Viπ = w · Vπ
i=1
V : Π → Rn
Vwπ = f (Vπ , w)
Linear case:
n
X
Vwπ = wi Viπ = w · Vπ
i=1
weights
interaction with
the environment
rewards, states
and weights learning
algorithm
learning single
algorithm solution
Multi-Objective Motivation
Problem Taxonomy
Solution Concepts
π : S × A → [0, 1]
π:S →A
V π = E [R0 | π] (1)
V π (s) = E [Rt | π, st = s]
Theorem
For any additive infinite-horizon single-objective MDP, there exists a
deterministic stationary optimal policy [Howard 1960]
R : S × A × S → Rn
X∞
Vπ = E [ γ k rk+1 | π]
k=0
X∞
π
V (s) = E [ γ k rt+k+1 | π, st = s]
k=0
Multi-Objective Motivation
Problem Taxonomy
Solution Concepts
Utility-based approach:
I Execution phase: select one policy maximizing scalar utility Vwπ ,
where w may be hidden or implicit
Utility-based approach:
I Execution phase: select one policy maximizing scalar utility Vwπ ,
where w may be hidden or implicit
I Planning phase: find set of policies containing optimal solution
for each possible w; if w unknown, size of set generally > 1
Utility-based approach:
I Execution phase: select one policy maximizing scalar utility Vwπ ,
where w may be hidden or implicit
I Planning phase: find set of policies containing optimal solution
for each possible w; if w unknown, size of set generally > 1
I Deduce optimal solution set from three factors:
1 Multi-objective scenario
2 Properties of scalarization function
3 Allowable policies
1 Multi-objective scenario
I Known weights → single policy
I Unknown weights or decision support → multiple policies
3 Allowable policies
I Deterministic
I Stochastic
Multi-Objective Motivation
Problem Taxonomy
Solution Concepts
ppc ppb
utility = #cans × + #bottles ×
ppc + ppb ppc + ppb
X∞ ∞
X
Vwπ π
= w · V = w · E[ k
γ rt+k+1 ] = E [ γ k (w · rt+k+1 )]
k=0 k=0
Note: only cell in taxonomy that does not require multi-objective methods
Definition
The undominated set U(Π), is the subset of all possible policies Π for
which there exists a w for which the scalarized value is maximal,
0
U(Π) = {π : π ∈ Π ∧ ∃w∀(π 0 ∈ Π) Vwπ ≥ Vwπ }
Definition
The undominated set U(Π), is the subset of all possible policies Π for
which there exists a w for which the scalarized value is maximal,
0
U(Π) = {π : π ∈ Π ∧ ∃w∀(π 0 ∈ Π) Vwπ ≥ Vwπ }
Definition
A coverage set CS(Π) is a subset of U(Π) that, for every w, contains a
policy with maximal scalarized value, i.e.,
0
CS(Π) ⊆ U(Π) ∧ (∀w)(∃π) π ∈ CS(Π) ∧ ∀(π 0 ∈ Π) Vwπ ≥ Vwπ
Definition
The convex hull CH(Π) is the subset of Π for which there exists a w that
maximizes the linearly scalarized value:
0
CH(Π) = {π : π ∈ Π ∧ ∃w∀(π 0 ∈ Π) w · Vπ ≥ w · Vπ }
Definition
The convex hull CH(Π) is the subset of Π for which there exists a w that
maximizes the linearly scalarized value:
0
CH(Π) = {π : π ∈ Π ∧ ∃w∀(π 0 ∈ Π) w · Vπ ≥ w · Vπ }
Definition
The convex coverage set CCS(Π) is a subset of CH(Π) that, for every w,
contains a policy whose linearly scalarized value is maximal, i.e.,
0
CCS(Π) ⊆ CH(Π)∧(∀w)(∃π) π ∈ CCS(Π) ∧ ∀(π 0 ∈ Π) w · Vπ ≥ w · Vπ
4
3
3
Vw
V1
2
2 1
1
0
0
Vw = w0 V0 + w1 V1 , w0 = 1 − w1
Definition
A scalarization function is strictly monotonically increasing if changing a
policy such that its value increases in one or more objectives, without
decreasing in any other objectives, also increases the scalarized value:
0 0 0
(∀i Viπ ≥ Viπ ∧ ∃i Viπ > Viπ ) ⇒ (∀w Vwπ > Vwπ )
Definition
A scalarization function is strictly monotonically increasing if changing a
policy such that its value increases in one or more objectives, without
decreasing in any other objectives, also increases the scalarized value:
0 0 0
(∀i Viπ ≥ Viπ ∧ ∃i Viπ > Viπ ) ⇒ (∀w Vwπ > Vwπ )
Definition
A policy π Pareto-dominates another policy π 0 when its value is at least as
high in all objectives and strictly higher in at least one objective:
0 0 0
Vπ P Vπ ⇔ ∀i Viπ ≥ Viπ ∧ ∃i Viπ > Viπ
60
10
6
8
3
40
3 4
6
V2
V2
V2
V2
2
20
2
1
2
1
0
0
0 1 2 3 4 0 1 2 3 4 5 6 0 2 4 6 8 10 0 10 30 50
V1 V1 V1 V1
Definition
The Pareto front is the set of all policies that are not Pareto dominated:
0
PF (Π) = {π : π ∈ Π ∧ ¬∃(π 0 ∈ Π), Vπ P Vπ }
Definition
The Pareto front is the set of all policies that are not Pareto dominated:
0
PF (Π) = {π : π ∈ Π ∧ ¬∃(π 0 ∈ Π), Vπ P Vπ }
Definition
A Pareto coverage set is a subset of PF (Π) such that, for every π 0 ∈ Π, it
contains a policy that either dominates π 0 or has equal value to π 0 :
0 0
PCS(Π) ⊆ PF (Π) ∧ ∀(π 0 ∈ Π)(∃π) π ∈ PCS(Π) ∧ (Vπ P Vπ ∨ Vπ = Vπ )
4
3
3
Vw
V1
2
2 1
1
0
0
4
3
3
Vw
V1
2
2 1
1
0
0
4
3
V1
2 1
0
0 1 2 3 4
V0
Applications
For known w
X∞ ∞
X
Vwπ = w · E [ γ k rt+k+1 ] = E [ γ k (w · rt+k+1 )].
k=0 k=0
For known w
X∞ ∞
X
Vwπ = w · E [ γ k rt+k+1 ] = E [ γ k (w · rt+k+1 )].
k=0 k=0
Rw = w · R
For known w
X∞ ∞
X
Vwπ = w · E [ γ k rt+k+1 ] = E [ γ k (w · rt+k+1 )].
k=0 k=0
Rw = w · R
Apply standard VI
For known w
X∞ ∞
X
Vwπ = w · E [ γ k rt+k+1 ] = E [ γ k (w · rt+k+1 )].
k=0 k=0
Rw = w · R
Apply standard VI
Inner loop
I Adapting operators of single objective method (e.g., value iteration)
I Series of multi-objective operations (e.g. Bellman backups)
Inner loop
I Adapting operators of single objective method (e.g., value iteration)
I Series of multi-objective operations (e.g. Bellman backups)
Outer loop
I Single objective method as subroutine
I Series of single-objective problems
At backup:
I generate all value vectors for s, a-pair
I prune away those that are not optimal for any w
where u + V = {u + v : v ∈ V }, and
U ⊕ V = {u + v : u ∈ U ∧ v ∈ V }
where u + V = {u + v : v ∈ V }, and
U ⊕ V = {u + v : u ∈ U ∧ v ∈ V }
4
1 state: s;
3
2 actions: a1 and a2
Vw
Deterministic transitions
2
Deterministic rewards:
1
R(s, a1 , s) → (2, 0)
0
R(s, a2 , s) → (0, 2)
0.0 0.2 0.4 0.6 0.8 1.0
γ = 0.5 w1
Deterministic rewards:
R(s, a1 , s) → (2, 0)
R(s, a2 , s) → (0, 2)
γ = 0.5
Iteration 1:
V0 (s) = {(0, 0)}
Deterministic rewards:
R(s, a1 , s) → (2, 0)
R(s, a2 , s) → (0, 2)
γ = 0.5
Iteration 1:
V0 (s) = {(0, 0)}
Q1 (s, a1 ) = {(2, 0)}
Q1 (s, a2 ) = {(0, 2)}
Deterministic rewards:
4
R(s, a1 , s) → (2, 0)
3
R(s, a2 , s) → (0, 2)
Vw
γ = 0.5
2
Iteration 1:
1
V0 (s) = {(0, 0)}
0
Q1 (s, a1 ) = {(2, 0)}
Q1 (s, a2 ) = {(0, 2)} 0.0 0.2 0.4 0.6 0.8 1.0
w1
S
V1 (s) = CPrune( a Q1 (s, a)) =
{(2, 0), (0, 2)}
Deterministic rewards:
R(s, a1 , s) → (2, 0)
R(s, a2 , s) → (0, 2)
γ = 0.5
Iteration 2:
V1 (s) = {(2, 0), (0, 2)}
Deterministic rewards:
R(s, a1 , s) → (2, 0)
R(s, a2 , s) → (0, 2)
γ = 0.5
Iteration 2:
V1 (s) = {(2, 0), (0, 2)}
Q2 (s, a1 ) = {(3, 0), (2, 1)}
Q2 (s, a2 ) = {(1, 2), (0, 3)}
Deterministic rewards:
4
R(s, a1 , s) → (2, 0)
3
R(s, a2 , s) → (0, 2)
Vw
γ = 0.5
2
Iteration 2:
1
V1 (s) = {(2, 0), (0, 2)}
0
Q2 (s, a1 ) = {(3, 0), (2, 1)}
Q2 (s, a2 ) = {(1, 2), (0, 3)} 0.0 0.2 0.4 0.6 0.8 1.0
w1
V2 (s) =
CPrune({(3, 0), (2, 1), (1, 2), (0, 3)})
Deterministic rewards:
4
R(s, a1 , s) → (2, 0)
3
R(s, a2 , s) → (0, 2)
Vw
γ = 0.5
2
Iteration 2:
1
V1 (s) = {(2, 0), (0, 2)}
0
Q2 (s, a1 ) = {(3, 0), (2, 1)}
Q2 (s, a2 ) = {(1, 2), (0, 3)} 0.0 0.2 0.4 0.6 0.8 1.0
w1
V2 (s) =
{(3, 0), (0, 3)}
Deterministic rewards:
4
R(s, a1 , s) → (2, 0)
3
R(s, a2 , s) → (0, 2)
Vw
γ = 0.5
2
Iteration 3:
1
V2 (s) = {(3, 0), (0, 3)}
0
Q3 (s, a1 ) = {(3.5, 0), (2, 1.5)}
Q3 (s, a2 ) = {(1.5, 2), (0, 3.5)} 0.0 0.2 0.4 0.6 0.8 1.0
w1
V3 (s) =
CPrune({(3.5, 0), (2, 1.5), (1.5, 2), (0, 3.5)}) =
{(3.5, 0), (0, 3.5)}
6 8
Vw
4 2
0
6 8
Vw
4 2
0
6 8
Vw
4 2
0
6 8
Vw
4 2
0
8
(1,8) (1,8)
Δ (7,2)
6
6
(7,2)
(5,6)
uw
uw
4
4
2
2
wc
0
0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
w1 w1
ε-approximate solver
Produces an ε-CCS
Whiteson & Roijers Multi-Objective Planning July 7, 2018 83 / 112
Comparing Inner and Outer Loop
Similar to CHVI
where u + V = {u + v : v ∈ V },
U ⊕ V = {u + v : u ∈ U ∧ v ∈ V }
where u + V = {u + v : v ∈ V },
U ⊕ V = {u + v : u ∈ U ∧ v ∈ V }
4
Extremely simple MOMDP:
1 state: s;
3
2 actions: a1 and a2
V1
2
Deterministic rewards:
R(s, a1 , s) → (2, 0)
1
R(s, a2 , s) → (0, 2)
0
γ = 0.5
0 1 2 3 4
V0 (s) = {(0, 0)} V0
4
Deterministic rewards:
R(s, a1 , s) → (2, 0)
3
R(s, a2 , s) → (0, 2)
V1
2
γ = 0.5
Iteration 1:
1
V0 (s) = {(0, 0)}
0
Q1 (s, a1 ) = {(2, 0)} 0 1 2 3 4
Q1 (s, a2 ) = {(0, 2)} V0
S
V1 (s) = PPrune( a Q1 (s, a)) =
{(2, 0), (0, 2)}
4
Deterministic rewards:
R(s, a1 , s) → (2, 0)
3
R(s, a2 , s) → (0, 2)
V1
2
γ = 0.5
Iteration 2:
1
V1 (s) = {(2, 0), (0, 2)}
0
Q2 (s, a1 ) = {(3, 0), (2, 1)} 0 1 2 3 4
Q2 (s, a2 ) = {(1, 2), (0, 3)} V0
V2 (s) =
PPrune({(3, 0), (2, 1), (1, 2), (0, 3)})
4
R(s, a2 , s) → (0, 2)
3
γ = 0.5
V1
2
Iteration 2:
V2 (s) =
1
{(3, 0), (2, 1), (1, 2), (0, 3)}
0
Q3 (s, a1 ) =
{(3.5, 0), (3, 0.5), (2.5, 1), (2, 1.5)} 0 1 2 3 4
V0
Q3 (s, a2 ) =
{(1.5, 2), (1, 2.5), (0.5, 3), (0, 3.5)}
V3 (s) =
PPrune({(3.5, 0), (3, 0.5), (2.5, 1), (2, 1.5),
(1.5, 2), (1, 2.5), (0.5, 3), (0, 3.5)})
Whiteson & Roijers Multi-Objective Planning July 7, 2018 91 / 112
Inner Loop: Pareto-Q
No longer deterministic
No longer deterministic
Applications
learning single
algorithm solution
Definition
A multi-objective multi-armed bandit (MOMAB) (Drugan & Nowé, 2013)
is a tuple hA, Pi where
A is a finite set of actions or arms, and
P is a set of probability density functions, Pa (r) : Rd → [0, 1] over
vector-valued rewards r of length d, associated with each arm a ∈ A.
Definition
A multi-objective multi-armed bandit (MOMAB) (Drugan & Nowé, 2013)
is a tuple hA, Pi where
A is a finite set of actions or arms, and
P is a set of probability density functions, Pa (r) : Rd → [0, 1] over
vector-valued rewards r of length d, associated with each arm a ∈ A.
after 1 pull
after 3 pulls
after 10 pulls
0.4
p
0.2
0.0
-10 -5 0 5 10
mu(a)
Whiteson & Roijers Multi-Objective Planning July 7, 2018 98 / 112
Multi-Objective Challenges
No maximising action?
I Model the scalarization (/utility) function explicitly
I Learn about this function through user interaction
Action selection
I Sample vector-valued means from multi-variate posterior mean reward
distributions.
I Sample utility function from the posterior over utility functions
I Scalarize reward vectors with the sampled utility function
I Take maximizing action
When to query the user for preferences?
I Sample vector-valued means and utility function again
I Query when the maximising actions disagree with the first set of
samples
Applications
Finite-horizon MOMDPs
Deterministic policies
Partial observability
(MOPOMDP)
Finite-state controllers
Evolutionary method
Partial observability
(MOPOMDP)
Finite-state controllers
Evolutionary method
Promising applications
AAMAS
ICML
I Wenlong Lyu, Fan Yang, Changhao Yan, Dian Zhou, Xuan Zeng —
Batch Bayesian Optimization via Multi-objective Acquisition Ensemble
for Automated Analog Circuit Design
I Eugenio Bargiacchi, Timothy Verstraeten, Diederik M. Roijers, Ann
Now, Hado van Hasselt — Learning to Coordinate with Coordination
Graphs in Repeated Single-Stage Multi-Agent Decision Problems
ALA workshop