0% found this document useful (0 votes)
75 views

MORL (Multiple Objective Reinforcement Learning)

This document provides an overview of a tutorial on multi-objective planning and learning given by Shimon Whiteson and Diederik Roijers. The tutorial covers motivation and concepts in the morning sessions, including different scenarios that require multi-objective models like medical treatment problems with multiple health outcomes. It then covers methods for multi-objective decision making in the afternoon sessions. The slides and tutorial are based on previous work surveying multi-objective sequential decision making problems and approaches.

Uploaded by

Aamir Habib
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
75 views

MORL (Multiple Objective Reinforcement Learning)

This document provides an overview of a tutorial on multi-objective planning and learning given by Shimon Whiteson and Diederik Roijers. The tutorial covers motivation and concepts in the morning sessions, including different scenarios that require multi-objective models like medical treatment problems with multiple health outcomes. It then covers methods for multi-objective decision making in the afternoon sessions. The slides and tutorial are based on previous work surveying multi-objective sequential decision making problems and approaches.

Uploaded by

Aamir Habib
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 152

Multi-Objective Planning & Learning

Shimon Whiteson & Diederik M. Roijers

Department of Computer Science


University of Oxford

Computational Intelligence
Vrije Universiteit Amsterdam

July 7, 2018

Whiteson & Roijers Multi-Objective Planning July 7, 2018 1 / 112


Schedule

08:30-09:15: Motivation & Concepts (Shimon)

09:15-09:30: Short Break

09:30-10:15: Motivation & Concepts cont’d (Shimon)

10:15-10:45: Coffee Break

10:45-11:30: Methods (Diederik)

11:30-11:45: Short Break

11:45-12:30: Methods & Applications (Diederik)

Whiteson & Roijers Multi-Objective Planning July 7, 2018 2 / 112


Note

Get the latest version of the slides at:

http://roijers.info/motutorial.html

This tutorial is based on our survey article:


Diederik Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A
Survey of Multi-Objective Sequential Decision-Making. Journal of Artificial
Intelligence Research, 48:67—113, 2013.

and Diederik’s dissertation:

Diederik Roijers Multi-Objective Decision-Theoretic Planning, University of


Amsterdam, 2016. http://roijers.info/pub/thesis.pdf

Whiteson & Roijers Multi-Objective Planning July 7, 2018 3 / 112


Part 1: Motivation & Concepts

Multi-Objective Motivation

MDPs & MOMDPs

Problem Taxonomy

Solution Concepts

Whiteson & Roijers Multi-Objective Planning July 7, 2018 4 / 112


Medical Treatment

Chance of being cured, having side effects, or dying

Whiteson & Roijers Multi-Objective Planning July 7, 2018 5 / 112


Traffic Coordination

Latency, throughput, fairness, environmental impact, etc.

Whiteson & Roijers Multi-Objective Planning July 7, 2018 6 / 112


Mining Commodities

Gold collected, silver collected

village
mine

[Roijers et al. 2013, 2014]

Whiteson & Roijers Multi-Objective Planning July 7, 2018 7 / 112


Grid World
Getting the treasure, minimising fuel costs

Whiteson & Roijers Multi-Objective Planning July 7, 2018 8 / 112


Do We Need Multi-Objective Models?

Whiteson & Roijers Multi-Objective Planning July 7, 2018 9 / 112


Do We Need Multi-Objective Models?

Sutton’s Reward Hypothesis: “All of what we mean


by goals and purposes can be well thought of as maxi-
mization of the expected value of the cumulative sum
of a received scalar signal (reward).”
Source: http://rlai.cs.ualberta.ca/RLAI/rewardhypothesis.html

Whiteson & Roijers Multi-Objective Planning July 7, 2018 9 / 112


Do We Need Multi-Objective Models?

Sutton’s Reward Hypothesis: “All of what we mean


by goals and purposes can be well thought of as maxi-
mization of the expected value of the cumulative sum
of a received scalar signal (reward).”
Source: http://rlai.cs.ualberta.ca/RLAI/rewardhypothesis.html

V :Π→R

V π = Eπ [ t rt ]
P

π ∗ = maxπ V π

Whiteson & Roijers Multi-Objective Planning July 7, 2018 9 / 112


Why Multi-Objective Decision Making?
The weak argument: real-world problems are multi-objective!

V : Π → Rn

Whiteson & Roijers Multi-Objective Planning July 7, 2018 10 / 112


Why Multi-Objective Decision Making?
The weak argument: real-world problems are multi-objective!

V : Π → Rn

Objection: why not just scalarize?

Whiteson & Roijers Multi-Objective Planning July 7, 2018 10 / 112


Why Multi-Objective Decision Making?
The weak argument: real-world problems are multi-objective!

V : Π → Rn

Objection: why not just scalarize?

Scalarization function projects multi-objective value to a scalar:

Vwπ = f (Vπ , w)

Linear case:
n
X
Vwπ = wi Viπ = w · Vπ
i=1

A priori prioritization of the objectives

Whiteson & Roijers Multi-Objective Planning July 7, 2018 10 / 112


Why Multi-Objective Decision Making?
The weak argument: real-world problems are multi-objective!

V : Π → Rn

Objection: why not just scalarize?

Scalarization function projects multi-objective value to a scalar:

Vwπ = f (Vπ , w)

Linear case:
n
X
Vwπ = wi Viπ = w · Vπ
i=1

A priori prioritization of the objectives


The weak argument is necessary but not sufficient

Whiteson & Roijers Multi-Objective Planning July 7, 2018 10 / 112


Why Multi-Objective Decision Making?

The strong argument: a priori scalarization is sometimes impossible,


infeasible, or undesirable

Instead produce the coverage set of undominated solutions

Yields three scenarios for planning or off-line RL

Yields two scenarios for on-line RL

Whiteson & Roijers Multi-Objective Planning July 7, 2018 11 / 112


Unknown-Weights Planning Scenario

Weights known in execution phase but not in planning phase

Example: mining commodities [Roijers et al. 2013]

Whiteson & Roijers Multi-Objective Planning July 7, 2018 12 / 112


Decision-Support Planning Scenario

Quantifying priorities is infeasible

Choosing between options is easier

Example: medical treatment

Whiteson & Roijers Multi-Objective Planning July 7, 2018 13 / 112


Known Weights Planning Scenario

Scalarization yields intractable problem

Whiteson & Roijers Multi-Objective Planning July 7, 2018 14 / 112


Reinforcement Learning Scenarios

Same scenarios apply for offline RL

For example, unknown-weights scenario becomes:

learning phase selection phase execution phase

coverage set selected


Learning
dataset Scalarization policy
Algorithm

weights

For online RL there are two more scenarios

Whiteson & Roijers Multi-Objective Planning July 7, 2018 15 / 112


Dynamic-Weights Online RL Scenario

Scalarization changes, over time, e.g., market prices

Caching policies for different prices speeds adaptation


[Natarajan & Tadepalli, 2005]

interaction with
the environment
rewards, states
and weights learning
algorithm

learning and execution phase

Whiteson & Roijers Multi-Objective Planning July 7, 2018 16 / 112


Interactive Decision-Support Online RL Scenario

Scalarization initially unknown

Learned via user interaction

Learning from environment and user simultaneously

interaction with interaction


the environment with the user

learning single
algorithm solution

learning and execution phase execution only phase

Whiteson & Roijers Multi-Objective Planning July 7, 2018 17 / 112


Summary of Motivation

Multi-objective methods are useful because many


problems are naturally characterized by multiple ob-
jectives and cannot be easily scalarized a priori.

Whiteson & Roijers Multi-Objective Planning July 7, 2018 18 / 112


Summary of Motivation

Multi-objective methods are useful because many


problems are naturally characterized by multiple ob-
jectives and cannot be easily scalarized a priori.

The burden of proof rests with the a priori scalariza-


tion, not with the multi-objective modeling.

Whiteson & Roijers Multi-Objective Planning July 7, 2018 18 / 112


Part 1: Motivation & Concepts

Multi-Objective Motivation

MDPs & MOMDPs

Problem Taxonomy

Solution Concepts

Whiteson & Roijers Multi-Objective Planning July 7, 2018 19 / 112


Markov Decision Process (MDP)
A single-objective MDP is a tuple hS, A, T , R, µ, γi where:
I S is a finite set of states
I A is a finite set of actions
I T : S × A × S → [0, 1] is a transition function
I R : S × A × S → R is a reward function
I µ : S → [0, 1] is a probability distribution over initial states
I γ ∈ [0, 1) is a discount factor

(figure from Poole & Mackworth, Artificial Intelligence:


Foundations of Computational Agents, 2010)
Whiteson & Roijers Multi-Objective Planning July 7, 2018 20 / 112
Returns & Policies
Goal: maximize expected return, which is typically additive:

X
Rt = γ k rt+k+1
k=0

A stationary policy conditions only on the current state:

π : S × A → [0, 1]

A deterministic stationary policy maps states directly to actions:

π:S →A

Whiteson & Roijers Multi-Objective Planning July 7, 2018 21 / 112


Value Functions in MDPs

A state-independent value function V π specifies the expected return


when following π from the initial state:

V π = E [R0 | π] (1)

A state value function of a policy π:

V π (s) = E [Rt | π, st = s]

The Bellman equation restates this expectation recursively for


stationary policies:
X X
V π (s) = π(s, a) T (s, a, s 0 )[R(s, a, s 0 ) + γV π (s 0 )]
a s0

Whiteson & Roijers Multi-Objective Planning July 7, 2018 22 / 112


Optimality in MDPs

Theorem
For any additive infinite-horizon single-objective MDP, there exists a
deterministic stationary optimal policy [Howard 1960]

All optimal policies share the same optimal value function:

V ∗ (s) = max V π (s)


π
X

V (s) = max T (s, a, s 0 )[R(s, a, s 0 ) + γV ∗ (s 0 )]
a
s0

Extract the optimal policy using local action selection:


X
π ∗ (s) = arg max T (s, a, s 0 )[R(s, a, s 0 ) + γV ∗ (s 0 )]
a
s0

Whiteson & Roijers Multi-Objective Planning July 7, 2018 23 / 112


Multi-Objective MDP (MOMDP)
Vector-valued reward and value:

R : S × A × S → Rn

X∞
Vπ = E [ γ k rk+1 | π]
k=0
X∞
π
V (s) = E [ γ k rt+k+1 | π, st = s]
k=0

Vπ (s) imposes only a partial ordering, e.g.,


0 0
Viπ (s) > Viπ (s) but Vjπ (s) < Vjπ (s).

Definition of optimality no longer clear

Whiteson & Roijers Multi-Objective Planning July 7, 2018 24 / 112


Part 1: Motivation & Concepts

Multi-Objective Motivation

MDPs & MOMDPs

Problem Taxonomy

Solution Concepts

Whiteson & Roijers Multi-Objective Planning July 7, 2018 25 / 112


Axiomatic vs. Utility-Based Approach

Axiomatic approach: define optimal solution set to be Pareto front

Whiteson & Roijers Multi-Objective Planning July 7, 2018 26 / 112


Axiomatic vs. Utility-Based Approach

Axiomatic approach: define optimal solution set to be Pareto front

Utility-based approach:
I Execution phase: select one policy maximizing scalar utility Vwπ ,
where w may be hidden or implicit

Whiteson & Roijers Multi-Objective Planning July 7, 2018 26 / 112


Axiomatic vs. Utility-Based Approach

Axiomatic approach: define optimal solution set to be Pareto front

Utility-based approach:
I Execution phase: select one policy maximizing scalar utility Vwπ ,
where w may be hidden or implicit
I Planning phase: find set of policies containing optimal solution
for each possible w; if w unknown, size of set generally > 1

Whiteson & Roijers Multi-Objective Planning July 7, 2018 26 / 112


Axiomatic vs. Utility-Based Approach

Axiomatic approach: define optimal solution set to be Pareto front

Utility-based approach:
I Execution phase: select one policy maximizing scalar utility Vwπ ,
where w may be hidden or implicit
I Planning phase: find set of policies containing optimal solution
for each possible w; if w unknown, size of set generally > 1
I Deduce optimal solution set from three factors:
1 Multi-objective scenario
2 Properties of scalarization function
3 Allowable policies

Whiteson & Roijers Multi-Objective Planning July 7, 2018 26 / 112


Three Factors

1 Multi-objective scenario
I Known weights → single policy
I Unknown weights or decision support → multiple policies

2 Properties of scalarization function


I Linear
I Monotonically increasing

3 Allowable policies
I Deterministic
I Stochastic

Whiteson & Roijers Multi-Objective Planning July 7, 2018 27 / 112


Problem Taxonomy

single policy multiple policies (unknown


(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Whiteson & Roijers Multi-Objective Planning July 7, 2018 28 / 112


Part 1: Motivation & Concepts

Multi-Objective Motivation

MDPs & MOMDPs

Problem Taxonomy

Solution Concepts

Whiteson & Roijers Multi-Objective Planning July 7, 2018 29 / 112


Problem Taxonomy

single policy multiple policies (unknown


(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Whiteson & Roijers Multi-Objective Planning July 7, 2018 30 / 112


Linear Scalarization Functions
Computes inner product of w and Vπ :
n
X
Vwπ = wi Viπ = w · Vπ , w ∈ Rn
i=1

wi quantifies importance of i-th objective


Simple and intuitive, e.g., when utility translates to money:

revenue = #cans × ppc + #bottles × ppb

Whiteson & Roijers Multi-Objective Planning July 7, 2018 31 / 112


Linear Scalarization Functions
Computes inner product of w and Vπ :
n
X
Vwπ = wi Viπ = w · Vπ , w ∈ Rn
i=1

wi quantifies importance of i-th objective


Simple and intuitive, e.g., when utility translates to money:

revenue = #cans × ppc + #bottles × ppb

Vwπ typically constrained to be a convex combination:


X
∀i wi ≥ 0, wi = 1
i

ppc ppb
utility = #cans × + #bottles ×
ppc + ppb ppc + ppb

Whiteson & Roijers Multi-Objective Planning July 7, 2018 31 / 112


Linear Scalarization & Single Policy

No special methods required: just apply f to each reward vector

Inner product distributes over addition yielding a normal MDP:

X∞ ∞
X
Vwπ π
= w · V = w · E[ k
γ rt+k+1 ] = E [ γ k (w · rt+k+1 )]
k=0 k=0

Apply standard methods to an MDP with:

R(s, a, s 0 ) = w · R(s, a, s 0 ), (2)

yielding a single determinstic stationary policy

Whiteson & Roijers Multi-Objective Planning July 7, 2018 32 / 112


Problem Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Example: collecting bottles and cans

Whiteson & Roijers Multi-Objective Planning July 7, 2018 33 / 112


Problem Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Example: collecting bottles and cans

Note: only cell in taxonomy that does not require multi-objective methods

Whiteson & Roijers Multi-Objective Planning July 7, 2018 33 / 112


Problem Taxonomy

single policy multiple policies (unknown


(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Whiteson & Roijers Multi-Objective Planning July 7, 2018 34 / 112


Multiple Policies

Unknown weights or decision support → multiple policies

During planning w is unknown

Size of solution set is generally > 1

Set should not contain policies suboptimal for all w

Whiteson & Roijers Multi-Objective Planning July 7, 2018 35 / 112


Undominated & Coverage Sets

Definition
The undominated set U(Π), is the subset of all possible policies Π for
which there exists a w for which the scalarized value is maximal,
0
U(Π) = {π : π ∈ Π ∧ ∃w∀(π 0 ∈ Π) Vwπ ≥ Vwπ }

Whiteson & Roijers Multi-Objective Planning July 7, 2018 36 / 112


Undominated & Coverage Sets

Definition
The undominated set U(Π), is the subset of all possible policies Π for
which there exists a w for which the scalarized value is maximal,
0
U(Π) = {π : π ∈ Π ∧ ∃w∀(π 0 ∈ Π) Vwπ ≥ Vwπ }

Definition
A coverage set CS(Π) is a subset of U(Π) that, for every w, contains a
policy with maximal scalarized value, i.e.,
 0

CS(Π) ⊆ U(Π) ∧ (∀w)(∃π) π ∈ CS(Π) ∧ ∀(π 0 ∈ Π) Vwπ ≥ Vwπ

Whiteson & Roijers Multi-Objective Planning July 7, 2018 36 / 112


Example

Vwπ w = true w = false


π = π1 5 0
π = π2 0 5
π = π3 5 2
π = π4 2 2

One binary weight feature: only two possible weights

Weights are not objectives but two possible scalarizations

Whiteson & Roijers Multi-Objective Planning July 7, 2018 37 / 112


Example

Vwπ w = true w = false


π = π1 5 0
π = π2 0 5
π = π3 5 2
π = π4 2 2

One binary weight feature: only two possible weights

Weights are not objectives but two possible scalarizations

U(Π) = {π1 , π2 , π3 } but CS(Π) = {π1 , π2 } or {π2 , π3 }

Whiteson & Roijers Multi-Objective Planning July 7, 2018 37 / 112


Execution Phase

Single policy selected from CS(Π) and executed

Unknown weights: weights revealed and maximizing policy selected:

π ∗ = arg max Vwπ


π∈CS(Π)

Decision support: CS(Π) is manually inspected by the user

Whiteson & Roijers Multi-Objective Planning July 7, 2018 38 / 112


Linear Scalarization & Multiple Policies

Definition
The convex hull CH(Π) is the subset of Π for which there exists a w that
maximizes the linearly scalarized value:
0
CH(Π) = {π : π ∈ Π ∧ ∃w∀(π 0 ∈ Π) w · Vπ ≥ w · Vπ }

Whiteson & Roijers Multi-Objective Planning July 7, 2018 39 / 112


Linear Scalarization & Multiple Policies

Definition
The convex hull CH(Π) is the subset of Π for which there exists a w that
maximizes the linearly scalarized value:
0
CH(Π) = {π : π ∈ Π ∧ ∃w∀(π 0 ∈ Π) w · Vπ ≥ w · Vπ }

Definition
The convex coverage set CCS(Π) is a subset of CH(Π) that, for every w,
contains a policy whose linearly scalarized value is maximal, i.e.,
 0

CCS(Π) ⊆ CH(Π)∧(∀w)(∃π) π ∈ CCS(Π) ∧ ∀(π 0 ∈ Π) w · Vπ ≥ w · Vπ

Whiteson & Roijers Multi-Objective Planning July 7, 2018 39 / 112


Visualization
Objective Space Weight Space
4

4
3

3
Vw
V1
2

2 1
1

0
0

0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0


V0 w1

Vw = w0 V0 + w1 V1 , w0 = 1 − w1

Whiteson & Roijers Multi-Objective Planning July 7, 2018 40 / 112


Problem Taxonomy

single policy multiple policies (unknown


(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Example: mining gold and silver

Whiteson & Roijers Multi-Objective Planning July 7, 2018 41 / 112


Problem Taxonomy

single policy multiple policies (unknown


(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Whiteson & Roijers Multi-Objective Planning July 7, 2018 42 / 112


Monotonically Increasing Scalarization Functions
Mining example: Vπ1 = (3, 0), Vπ2 = (0, 3), Vπ3 = (1, 1)
Choosing Vπ3 implies nonlinear scalarization function

Whiteson & Roijers Multi-Objective Planning July 7, 2018 43 / 112


Monotonically Increasing Scalarization Functions

Definition
A scalarization function is strictly monotonically increasing if changing a
policy such that its value increases in one or more objectives, without
decreasing in any other objectives, also increases the scalarized value:
0 0 0
(∀i Viπ ≥ Viπ ∧ ∃i Viπ > Viπ ) ⇒ (∀w Vwπ > Vwπ )

Whiteson & Roijers Multi-Objective Planning July 7, 2018 44 / 112


Monotonically Increasing Scalarization Functions

Definition
A scalarization function is strictly monotonically increasing if changing a
policy such that its value increases in one or more objectives, without
decreasing in any other objectives, also increases the scalarized value:
0 0 0
(∀i Viπ ≥ Viπ ∧ ∃i Viπ > Viπ ) ⇒ (∀w Vwπ > Vwπ )

Definition
A policy π Pareto-dominates another policy π 0 when its value is at least as
high in all objectives and strictly higher in at least one objective:
0 0 0
Vπ P Vπ ⇔ ∀i Viπ ≥ Viπ ∧ ∃i Viπ > Viπ

A policy is Pareto optimal if no policy Pareto-dominates it.

Whiteson & Roijers Multi-Objective Planning July 7, 2018 44 / 112


Nonlinear Scalarization Can Destroy Additivity

Nonlinear scalarization and expectation do not commute:



X X∞
Vwπ = f (Vπ , w) = f (E [ γ k rt+k+1 ], w) 6= E [ γ k f (rt+k+1 , w)]
k=0 k=0

Bellman-based methods not applicable

Local action selection no longer yields an optimal policy:

π ∗ (s) 6= arg max V ∗ (s)

Whiteson & Roijers Multi-Objective Planning July 7, 2018 45 / 112


Deterministic vs. Stochastic Policies

Stochastic policies are fine in most settings

Sometimes inappropriate, e.g., medical treatment

In MDPs, requiring deterministic policies is not restrictive

Optimal value attainable with deterministic stationary policy:


X
π ∗ (s) = arg max T (s, a, s 0 )[R(s, a, s 0 ) + γV ∗ (s 0 )]
a
s0

Whiteson & Roijers Multi-Objective Planning July 7, 2018 46 / 112


Deterministic vs. Stochastic Policies

Stochastic policies are fine in most settings

Sometimes inappropriate, e.g., medical treatment

In MDPs, requiring deterministic policies is not restrictive

Optimal value attainable with deterministic stationary policy:


X
π ∗ (s) = arg max T (s, a, s 0 )[R(s, a, s 0 ) + γV ∗ (s 0 )]
a
s0

Similar for MOMDPs with linear scalarization


MOMDPs with nonlinear scalarization:
I Stochastic policies may be preferable if allowed
I Nonstationary policies may be preferable otherwise

Whiteson & Roijers Multi-Objective Planning July 7, 2018 46 / 112


White’s Example (1982)
3 actions: R(a1 ) = (3, 0), R(a2 ) = (0, 3), R(a3 ) = (1, 1)

Whiteson & Roijers Multi-Objective Planning July 7, 2018 47 / 112


White’s Example (1982)
3 actions: R(a1 ) = (3, 0), R(a2 ) = (0, 3), R(a3 ) = (1, 1)
3 deterministic stationary policies, all Pareto-optimal:
 3   3  π3  1 1 
V π1 = , 0 , Vπ2 = 0, ,V = ,
1−γ 1−γ 1−γ 1−γ

Whiteson & Roijers Multi-Objective Planning July 7, 2018 47 / 112


White’s Example (1982)
3 actions: R(a1 ) = (3, 0), R(a2 ) = (0, 3), R(a3 ) = (1, 1)
3 deterministic stationary policies, all Pareto-optimal:
 3   3  π3  1 1 
V π1 = , 0 , Vπ2 = 0, ,V = ,
1−γ 1−γ 1−γ 1−γ
πns alternates between a1 and a2 , starting with a1 :
 3 3γ 
Vπns = ,
1 − γ2 1 − γ2

Whiteson & Roijers Multi-Objective Planning July 7, 2018 47 / 112


White’s Example (1982)
3 actions: R(a1 ) = (3, 0), R(a2 ) = (0, 3), R(a3 ) = (1, 1)
3 deterministic stationary policies, all Pareto-optimal:
 3   3  π3  1 1 
V π1 = , 0 , Vπ2 = 0, ,V = ,
1−γ 1−γ 1−γ 1−γ
πns alternates between a1 and a2 , starting with a1 :
 3 3γ 
Vπns = ,
1 − γ2 1 − γ2
Thus πns P π3 when γ ≥ 0.5, e.g., γ = 0.5 and f (Vπ ) = V1π V2π :
V π1 = V π2 = 0, V π3 = 4, V πns = 8

60
10
6

gamma=0.3 gamma=0.5 gamma=0.7 gamma=0.95


4

8
3

40
3 4

6
V2

V2

V2

V2
2

20
2
1

2
1
0

0
0 1 2 3 4 0 1 2 3 4 5 6 0 2 4 6 8 10 0 10 30 50
V1 V1 V1 V1

Whiteson & Roijers Multi-Objective Planning July 7, 2018 47 / 112


Problem Taxonomy

single policy multiple policies (unknown


(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Example: radiation vs. chemotherapy

Whiteson & Roijers Multi-Objective Planning July 7, 2018 48 / 112


Problem Taxonomy

single policy multiple policies (unknown


(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Whiteson & Roijers Multi-Objective Planning July 7, 2018 49 / 112


Mixture Policies

A mixture policy πm selects i-th policy


P from set of N deterministic
policies with probability pi , where N
i=0 pi = 1

Values are convex combination of values of constituent policies

In White’s example, replace πns by πm :


 
3p1 3(1 − p1 )
Vπm = p1 Vπ1 + (1 − p1 )Vπ2 = ,
1−γ 1−γ

Whiteson & Roijers Multi-Objective Planning July 7, 2018 50 / 112


Problem Taxonomy

single policy multiple policies (unknown


(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Example: studying vs. networking

Whiteson & Roijers Multi-Objective Planning July 7, 2018 51 / 112


Problem Taxonomy

single policy multiple policies (unknown


(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Whiteson & Roijers Multi-Objective Planning July 7, 2018 52 / 112


Pareto Sets

Definition
The Pareto front is the set of all policies that are not Pareto dominated:
0
PF (Π) = {π : π ∈ Π ∧ ¬∃(π 0 ∈ Π), Vπ P Vπ }

Whiteson & Roijers Multi-Objective Planning July 7, 2018 53 / 112


Pareto Sets

Definition
The Pareto front is the set of all policies that are not Pareto dominated:
0
PF (Π) = {π : π ∈ Π ∧ ¬∃(π 0 ∈ Π), Vπ P Vπ }

Definition
A Pareto coverage set is a subset of PF (Π) such that, for every π 0 ∈ Π, it
contains a policy that either dominates π 0 or has equal value to π 0 :
 0 0

PCS(Π) ⊆ PF (Π) ∧ ∀(π 0 ∈ Π)(∃π) π ∈ PCS(Π) ∧ (Vπ P Vπ ∨ Vπ = Vπ )

Whiteson & Roijers Multi-Objective Planning July 7, 2018 53 / 112


Visualization

Objective Space Weight Space


4

4
3

3
Vw
V1
2

2 1
1

0
0

0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0


V0 w1

Whiteson & Roijers Multi-Objective Planning July 7, 2018 54 / 112


Visualization

Objective Space Weight Space


4

4
3

3
Vw
V1
2

2 1
1

0
0

0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0


V0 w1

Whiteson & Roijers Multi-Objective Planning July 7, 2018 55 / 112


Problem Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Example: radiation vs. chemotherapy (again)

Whiteson & Roijers Multi-Objective Planning July 7, 2018 56 / 112


Problem Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Example: radiation vs. chemotherapy (again)

Note: only setting that case requires a Pareto front!

Whiteson & Roijers Multi-Objective Planning July 7, 2018 56 / 112


Problem Taxonomy

single policy multiple policies (unknown


(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Whiteson & Roijers Multi-Objective Planning July 7, 2018 57 / 112


Mixture Policies
A CCS(ΠDS ) is also a CCS(Π) but not necessarily a PCS(Π)
But a PCS(Π) can be made by mixing policies in a CCS(ΠDS )

4
3
V1
2 1
0

0 1 2 3 4
V0

Whiteson & Roijers Multi-Objective Planning July 7, 2018 58 / 112


Problem Taxonomy

single policy multiple policies (unknown


(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Example: studying vs. networking (again)

Whiteson & Roijers Multi-Objective Planning July 7, 2018 59 / 112


Part 2: Methods and Applications

Convex Coverage Set Planning Methods


I Inner Loop: Convex Hull Value Iteration
I Outer Loop: Optimistic Linear Support

Pareto Coverage Set Planning Methods


I Inner loop (non-stationary): Pareto-Q
I Outer loop issues

MOPOMDP Convex Coverage Set Planning: OLSAR

Applications

Whiteson & Roijers Multi-Objective Planning July 7, 2018 60 / 112


Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Whiteson & Roijers Multi-Objective Planning July 7, 2018 61 / 112


Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Known transition and reward functions → planning

Whiteson & Roijers Multi-Objective Planning July 7, 2018 61 / 112


Taxonomy
single policy multiple policies (unknown
(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Known transition and reward functions → planning


Unknown transition and reward functions → learning

Whiteson & Roijers Multi-Objective Planning July 7, 2018 61 / 112


Background: Value Iteration

Initial estimate value estimate V0 (s)

Apply Bellman backups until convergence:


X h i
Vk+1 (s) ← max T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )
a
s0

Whiteson & Roijers Multi-Objective Planning July 7, 2018 62 / 112


Background: Value Iteration

Initial estimate value estimate V0 (s)

Apply Bellman backups until convergence:


X h i
Vk+1 (s) ← max T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )
a
s0

Can also be written:

Vk+1 (s) ← max Qk+1 (s, a),


a
X h i
Qk+1 (s, a) ← T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )
s0

Optimal policy is easy to retrieve from Q-table

Whiteson & Roijers Multi-Objective Planning July 7, 2018 62 / 112


Taxonomy

single policy multiple policies (unknown


(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Whiteson & Roijers Multi-Objective Planning July 7, 2018 63 / 112


Scalarize MOMDP + Value Iteration

For known w
X∞ ∞
X
Vwπ = w · E [ γ k rt+k+1 ] = E [ γ k (w · rt+k+1 )].
k=0 k=0

Whiteson & Roijers Multi-Objective Planning July 7, 2018 64 / 112


Scalarize MOMDP + Value Iteration

For known w
X∞ ∞
X
Vwπ = w · E [ γ k rt+k+1 ] = E [ γ k (w · rt+k+1 )].
k=0 k=0

Scalarize reward function of MOMDP

Rw = w · R

Whiteson & Roijers Multi-Objective Planning July 7, 2018 64 / 112


Scalarize MOMDP + Value Iteration

For known w
X∞ ∞
X
Vwπ = w · E [ γ k rt+k+1 ] = E [ γ k (w · rt+k+1 )].
k=0 k=0

Scalarize reward function of MOMDP

Rw = w · R

Apply standard VI

Whiteson & Roijers Multi-Objective Planning July 7, 2018 64 / 112


Scalarize MOMDP + Value Iteration

For known w
X∞ ∞
X
Vwπ = w · E [ γ k rt+k+1 ] = E [ γ k (w · rt+k+1 )].
k=0 k=0

Scalarize reward function of MOMDP

Rw = w · R

Apply standard VI

Does not return multi-objective value

Whiteson & Roijers Multi-Objective Planning July 7, 2018 64 / 112


Scalarized Value Iteration

Adapt Bellman backup:

w · Vk+1 (s) ← max w · Qk+1 (s, a),


a
X h i
Qk+1 (s, a) ← T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )
s0

Returns multi-objective value.

Whiteson & Roijers Multi-Objective Planning July 7, 2018 65 / 112


Taxonomy

single policy multiple policies (unknown


(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Whiteson & Roijers Multi-Objective Planning July 7, 2018 66 / 112


Inner versus Outer Loop

... ... ...


max ... max ... max ...
... ... ...
... ... ...
SO method MO inner loop MO outer loop

Inner loop
I Adapting operators of single objective method (e.g., value iteration)
I Series of multi-objective operations (e.g. Bellman backups)

Whiteson & Roijers Multi-Objective Planning July 7, 2018 67 / 112


Inner versus Outer Loop

... ... ...


max ... max ... max ...
... ... ...
... ... ...
SO method MO inner loop MO outer loop

Inner loop
I Adapting operators of single objective method (e.g., value iteration)
I Series of multi-objective operations (e.g. Bellman backups)

Outer loop
I Single objective method as subroutine
I Series of single-objective problems

Whiteson & Roijers Multi-Objective Planning July 7, 2018 67 / 112


Inner Loop: Convex Hull Value Iteration

Barrett & Narayanan (2008)

Idea: do the backup for all w in parallel

New backup operators must handle sets of values.

At backup:
I generate all value vectors for s, a-pair
I prune away those that are not optimal for any w

Only need deterministic stationary policies

Whiteson & Roijers Multi-Objective Planning July 7, 2018 68 / 112


Inner Loop: Convex Hull Value Iteration

Initial set of value vectors, e.g., V0 (s) = {(0, 0)}

All possible value vectors:


M
T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )
 
Qk+1 (s, a) ←
s0

where u + V = {u + v : v ∈ V }, and
U ⊕ V = {u + v : u ∈ U ∧ v ∈ V }

Whiteson & Roijers Multi-Objective Planning July 7, 2018 69 / 112


Inner Loop: Convex Hull Value Iteration

Initial set of value vectors, e.g., V0 (s) = {(0, 0)}

All possible value vectors:


M
T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )
 
Qk+1 (s, a) ←
s0

where u + V = {u + v : v ∈ V }, and
U ⊕ V = {u + v : u ∈ U ∧ v ∈ V }

Prune value vectors


!
[
Vk+1 (s) ← CPrune Qk+1 (s, a)
a

CPrune uses linear programs (e.g., Roijers et al. (2015))

Whiteson & Roijers Multi-Objective Planning July 7, 2018 69 / 112


CHVI Example

Extremely simple MOMDP:

4
1 state: s;

3
2 actions: a1 and a2

Vw
Deterministic transitions

2
Deterministic rewards:

1
R(s, a1 , s) → (2, 0)

0
R(s, a2 , s) → (0, 2)
0.0 0.2 0.4 0.6 0.8 1.0
γ = 0.5 w1

V0 (s) = {(0, 0)}

Whiteson & Roijers Multi-Objective Planning July 7, 2018 70 / 112


CHVI Example

Deterministic rewards:
R(s, a1 , s) → (2, 0)
R(s, a2 , s) → (0, 2)

γ = 0.5

Iteration 1:
V0 (s) = {(0, 0)}

Whiteson & Roijers Multi-Objective Planning July 7, 2018 71 / 112


CHVI Example

Deterministic rewards:
R(s, a1 , s) → (2, 0)
R(s, a2 , s) → (0, 2)

γ = 0.5

Iteration 1:
V0 (s) = {(0, 0)}
Q1 (s, a1 ) = {(2, 0)}
Q1 (s, a2 ) = {(0, 2)}

Whiteson & Roijers Multi-Objective Planning July 7, 2018 71 / 112


CHVI Example

Deterministic rewards:

4
R(s, a1 , s) → (2, 0)

3
R(s, a2 , s) → (0, 2)

Vw
γ = 0.5

2
Iteration 1:

1
V0 (s) = {(0, 0)}

0
Q1 (s, a1 ) = {(2, 0)}
Q1 (s, a2 ) = {(0, 2)} 0.0 0.2 0.4 0.6 0.8 1.0
w1
S
V1 (s) = CPrune( a Q1 (s, a)) =
{(2, 0), (0, 2)}

Whiteson & Roijers Multi-Objective Planning July 7, 2018 71 / 112


CHVI Example

Deterministic rewards:
R(s, a1 , s) → (2, 0)
R(s, a2 , s) → (0, 2)

γ = 0.5

Iteration 2:
V1 (s) = {(2, 0), (0, 2)}

Whiteson & Roijers Multi-Objective Planning July 7, 2018 72 / 112


CHVI Example

Deterministic rewards:
R(s, a1 , s) → (2, 0)
R(s, a2 , s) → (0, 2)

γ = 0.5

Iteration 2:
V1 (s) = {(2, 0), (0, 2)}
Q2 (s, a1 ) = {(3, 0), (2, 1)}
Q2 (s, a2 ) = {(1, 2), (0, 3)}

Whiteson & Roijers Multi-Objective Planning July 7, 2018 72 / 112


CHVI Example

Deterministic rewards:

4
R(s, a1 , s) → (2, 0)

3
R(s, a2 , s) → (0, 2)

Vw
γ = 0.5

2
Iteration 2:

1
V1 (s) = {(2, 0), (0, 2)}

0
Q2 (s, a1 ) = {(3, 0), (2, 1)}
Q2 (s, a2 ) = {(1, 2), (0, 3)} 0.0 0.2 0.4 0.6 0.8 1.0
w1
V2 (s) =
CPrune({(3, 0), (2, 1), (1, 2), (0, 3)})

Whiteson & Roijers Multi-Objective Planning July 7, 2018 72 / 112


CHVI Example

Deterministic rewards:

4
R(s, a1 , s) → (2, 0)

3
R(s, a2 , s) → (0, 2)

Vw
γ = 0.5

2
Iteration 2:

1
V1 (s) = {(2, 0), (0, 2)}

0
Q2 (s, a1 ) = {(3, 0), (2, 1)}
Q2 (s, a2 ) = {(1, 2), (0, 3)} 0.0 0.2 0.4 0.6 0.8 1.0
w1
V2 (s) =
{(3, 0), (0, 3)}

Whiteson & Roijers Multi-Objective Planning July 7, 2018 73 / 112


CHVI Example

Deterministic rewards:

4
R(s, a1 , s) → (2, 0)

3
R(s, a2 , s) → (0, 2)

Vw
γ = 0.5

2
Iteration 3:

1
V2 (s) = {(3, 0), (0, 3)}

0
Q3 (s, a1 ) = {(3.5, 0), (2, 1.5)}
Q3 (s, a2 ) = {(1.5, 2), (0, 3.5)} 0.0 0.2 0.4 0.6 0.8 1.0
w1
V3 (s) =
CPrune({(3.5, 0), (2, 1.5), (1.5, 2), (0, 3.5)}) =
{(3.5, 0), (0, 3.5)}

Whiteson & Roijers Multi-Objective Planning July 7, 2018 74 / 112


Convex Hull Value Iteration

CPrune retains at least one optimal vector for each w

Therefore, Vw that would have been computed by VI is kept

CHVI does not retain excess value vectors

Whiteson & Roijers Multi-Objective Planning July 7, 2018 75 / 112


Convex Hull Value Iteration

CPrune retains at least one optimal vector for each w

Therefore, Vw that would have been computed by VI is kept

CHVI does not retain excess value vectors

CHVI generates a lot of excess value vectors

Removal with linear programs (CPrune) is expensive

Whiteson & Roijers Multi-Objective Planning July 7, 2018 75 / 112


Outer Loop

... ... ...


max ... max ... max ...
... ... ...
... ... ...
SO method MO inner loop MO outer loop

Repeatly calls a single-objective solver


Generic multi-objective method
I multi-objective coordination graphs
I multi-objective (multi-agent) MDPs
I multi-objective partially observable MDPs

Whiteson & Roijers Multi-Objective Planning July 7, 2018 76 / 112


Outer Loop: Optimistic Linear Support

Optimistic linear support (OLS) adapts and improves linear support


for POMDPs (Cheng (1988))

Solves scalarized instances for specific w

Whiteson & Roijers Multi-Objective Planning July 7, 2018 77 / 112


Outer Loop: Optimistic Linear Support

Optimistic linear support (OLS) adapts and improves linear support


for POMDPs (Cheng (1988))

Solves scalarized instances for specific w

Terminates after checking only a finite number of weights

Returns exact CCS

Whiteson & Roijers Multi-Objective Planning July 7, 2018 77 / 112


Linear Support

6 8
Vw
4 2
0

0.0 0.2 0.4 0.6 0.8 1.0


w1
Whiteson & Roijers Multi-Objective Planning July 7, 2018 78 / 112
Linear Support

6 8
Vw
4 2
0

0.0 0.2 0.4 0.6 0.8 1.0


w1
Whiteson & Roijers Multi-Objective Planning July 7, 2018 79 / 112
Linear Support

6 8
Vw
4 2
0

0.0 0.2 0.4 0.6 0.8 1.0


w1
Whiteson & Roijers Multi-Objective Planning July 7, 2018 80 / 112
Linear Support

6 8
Vw
4 2
0

0.0 0.2 0.4 0.6 0.8 1.0


w1
Whiteson & Roijers Multi-Objective Planning July 7, 2018 81 / 112
Optimistic Linear Support
8

8
(1,8) (1,8)
Δ (7,2)
6

6
(7,2)
(5,6)
uw

uw
4

4
2

2
wc
0

0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
w1 w1

Priority queue, Q, for corner weights

Maximal possible improvement ∆ as priority

Stop when ∆ < ε

Whiteson & Roijers Multi-Objective Planning July 7, 2018 82 / 112


Optimistic Linear Support

Solving scalarized instance not always possible

ε-approximate solver

Produces an ε-CCS
Whiteson & Roijers Multi-Objective Planning July 7, 2018 83 / 112
Comparing Inner and Outer Loop

OLS (outer loop) advantages

I Any (cooperative) multi-objective decision problem

I Any single-objective / scalarized subroutine

I Inherits quality guarantees

I Faster for small and medium numbers of objectives

Inner loop faster for large numbers of objectives

Whiteson & Roijers Multi-Objective Planning July 7, 2018 84 / 112


Taxonomy

single policy multiple policies (unknown


(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Whiteson & Roijers Multi-Objective Planning July 7, 2018 85 / 112


Inner Loop: Pareto-Q

Similar to CHVI

Different pruning operator

Pairwise comparisons: V(s) P V0 (s)

Comparisons cheaper but much more vectors

Converges to correct Pareto coverage set (White (1982))

Executing a policy is no longer trivial (Van Moffaert & Nowé (2014))

Whiteson & Roijers Multi-Objective Planning July 7, 2018 86 / 112


Inner Loop: Pareto-Q

Compute all possible vectors


M
T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )
 
Qk+1 (s, a) ←
s0

where u + V = {u + v : v ∈ V },
U ⊕ V = {u + v : u ∈ U ∧ v ∈ V }

Whiteson & Roijers Multi-Objective Planning July 7, 2018 87 / 112


Inner Loop: Pareto-Q

Compute all possible vectors


M
T (s, a, s 0 ) R(s, a, s 0 ) + γVk (s 0 )
 
Qk+1 (s, a) ←
s0

where u + V = {u + v : v ∈ V },
U ⊕ V = {u + v : u ∈ U ∧ v ∈ V }

Take the union across a

Prune Pareto-dominated vectors


!
[
Vk+1 (s) ← PPrune Qk+1 (s, a)
a

Whiteson & Roijers Multi-Objective Planning July 7, 2018 87 / 112


Pareto-Q Example

4
Extremely simple MOMDP:
1 state: s;

3
2 actions: a1 and a2

V1
2
Deterministic rewards:
R(s, a1 , s) → (2, 0)

1
R(s, a2 , s) → (0, 2)

0
γ = 0.5
0 1 2 3 4
V0 (s) = {(0, 0)} V0

Whiteson & Roijers Multi-Objective Planning July 7, 2018 88 / 112


Pareto-Q Example

4
Deterministic rewards:
R(s, a1 , s) → (2, 0)

3
R(s, a2 , s) → (0, 2)

V1
2
γ = 0.5

Iteration 1:

1
V0 (s) = {(0, 0)}

0
Q1 (s, a1 ) = {(2, 0)} 0 1 2 3 4
Q1 (s, a2 ) = {(0, 2)} V0
S
V1 (s) = PPrune( a Q1 (s, a)) =
{(2, 0), (0, 2)}

Whiteson & Roijers Multi-Objective Planning July 7, 2018 89 / 112


Pareto-Q Example

4
Deterministic rewards:
R(s, a1 , s) → (2, 0)

3
R(s, a2 , s) → (0, 2)

V1
2
γ = 0.5

Iteration 2:

1
V1 (s) = {(2, 0), (0, 2)}

0
Q2 (s, a1 ) = {(3, 0), (2, 1)} 0 1 2 3 4
Q2 (s, a2 ) = {(1, 2), (0, 3)} V0

V2 (s) =
PPrune({(3, 0), (2, 1), (1, 2), (0, 3)})

Whiteson & Roijers Multi-Objective Planning July 7, 2018 90 / 112


Pareto-Q Example
Deterministic rewards:
R(s, a1 , s) → (2, 0)

4
R(s, a2 , s) → (0, 2)

3
γ = 0.5

V1
2
Iteration 2:
V2 (s) =

1
{(3, 0), (2, 1), (1, 2), (0, 3)}

0
Q3 (s, a1 ) =
{(3.5, 0), (3, 0.5), (2.5, 1), (2, 1.5)} 0 1 2 3 4
V0
Q3 (s, a2 ) =
{(1.5, 2), (1, 2.5), (0.5, 3), (0, 3.5)}
V3 (s) =
PPrune({(3.5, 0), (3, 0.5), (2.5, 1), (2, 1.5),
(1.5, 2), (1, 2.5), (0.5, 3), (0, 3.5)})
Whiteson & Roijers Multi-Objective Planning July 7, 2018 91 / 112
Inner Loop: Pareto-Q

PCS size can explode

No longer deterministic

Cannot read policy from Q-table

Except for first action

Whiteson & Roijers Multi-Objective Planning July 7, 2018 92 / 112


Inner Loop: Pareto-Q

PCS size can explode

No longer deterministic

Cannot read policy from Q-table

Except for first action


“Track” a policy during execution (Van Moffaert & Nowé (2014))
I For deterministic transitions: s, a → s 0
I From Qt=0 (s, a) substract R(s, a)
I Correct for discount factor → Vt=1 (s 0 )
I Find Vt=1 (s 0 ) in Q-tables for s 0

For stochastic transitions, see Kristoff van Moffaert’s PhD thesis

Whiteson & Roijers Multi-Objective Planning July 7, 2018 92 / 112


Outer Loop?

... ... ...


max ... max ... max ...
... ... ...
... ... ...
SO method MO inner loop MO outer loop

Outer loop very difficult:



X X∞
Vwπ = f (E [ k
γ rt+k+1 ], w) 6= E [ γ k f (rt+k+1 , w)]
k=0 k=0

Maximization does not do the trick!


Heuristic with non-linear f (Van Moffaert, Drugan, Nowé (2013))
Not guaranteed to find optimal policy, or converge
Whiteson & Roijers Multi-Objective Planning July 7, 2018 93 / 112
Taxonomy

single policy multiple policies (unknown


(known weights) weights or decision support)
deterministic stochastic deterministic stochastic
linear one deterministic convex coverage set of
scalarization stationary policy deterministic stationary
policies
monotonically one one mixture Pareto convex
increasing deterministic policy of two coverage set coverage set
scalarization non- or more of of
stationary deterministic deterministic deterministic
policy stationary non- stationary
policies stationary policies
policies

Whiteson & Roijers Multi-Objective Planning July 7, 2018 94 / 112


Part 2: Methods and Applications

Convex Coverage Set Planning Methods


I Inner Loop: Convex Hull Value Iteration
I Outer Loop: Optimistic Linear Support

Pareto Coverage Set Planning Methods


I Inner loop (non-stationary): Pareto-Q
I Outer loop issues

Interactive Online MORL: Interactive Thompson Sampling

Applications

Whiteson & Roijers Multi-Objective Planning July 7, 2018 95 / 112


Online Interactive Decision Support

interaction with interaction


the environment with the user

learning single
algorithm solution

learning and execution phase execution only phase

Simultaneous interaction with the environment and the decision maker

Whiteson & Roijers Multi-Objective Planning July 7, 2018 96 / 112


Multi-Objective Multi-Armed Bandits

Definition
A multi-objective multi-armed bandit (MOMAB) (Drugan & Nowé, 2013)
is a tuple hA, Pi where
A is a finite set of actions or arms, and
P is a set of probability density functions, Pa (r) : Rd → [0, 1] over
vector-valued rewards r of length d, associated with each arm a ∈ A.

Whiteson & Roijers Multi-Objective Planning July 7, 2018 97 / 112


Multi-Objective Multi-Armed Bandits

Definition
A multi-objective multi-armed bandit (MOMAB) (Drugan & Nowé, 2013)
is a tuple hA, Pi where
A is a finite set of actions or arms, and
P is a set of probability density functions, Pa (r) : Rd → [0, 1] over
vector-valued rewards r of length d, associated with each arm a ∈ A.

Can be seen as a single-state MOMDP

Whiteson & Roijers Multi-Objective Planning July 7, 2018 97 / 112


(Single-objective) Thompson Sampling
If we know the scalarization function: (SO) multi-armed bandit
Thompson sampling (Thompson, 1933) empirically best
I Basic idea: posterior distributions of mean reward for each arm: µai
I Pull sample mean for each ai
I Execute the action with the highest sample for its mean reward
0.6

after 1 pull
after 3 pulls
after 10 pulls
0.4
p

0.2
0.0

-10 -5 0 5 10

mu(a)
Whiteson & Roijers Multi-Objective Planning July 7, 2018 98 / 112
Multi-Objective Challenges

No maximising action?
I Model the scalarization (/utility) function explicitly
I Learn about this function through user interaction

Cannot access scalarization function directly


I Only pairwise preferences

Whiteson & Roijers Multi-Objective Planning July 7, 2018 99 / 112


Interactive Thompson Sampling

interaction with interaction


the MOMAB with the user
r(t) μx ≻ μy
D C
ITS
a1(t) μθ’,a (t) , μθ’,a (t)
1 1 2 2

(Roijers, Zintgraf, & Nowé, 2017)

Whiteson & Roijers Multi-Objective Planning July 7, 2018 100 / 112


Interactive Thompson Sampling

interaction with interaction


the MOMAB with the user
r(t) μx ≻ μy
D C
ITS
a1(t) μθ’,a (t) , μθ’,a (t)
1 1 2 2

(Roijers, Zintgraf, & Nowé, 2017)

Open question: when to ask the user for preferences

Whiteson & Roijers Multi-Objective Planning July 7, 2018 100 / 112


Interactive Thompson Sampling

Action selection
I Sample vector-valued means from multi-variate posterior mean reward
distributions.
I Sample utility function from the posterior over utility functions
I Scalarize reward vectors with the sampled utility function
I Take maximizing action
When to query the user for preferences?
I Sample vector-valued means and utility function again
I Query when the maximising actions disagree with the first set of
samples

Whiteson & Roijers Multi-Objective Planning July 7, 2018 101 / 112


Interactive Thompson Sampling: Some Results

Linear utility functions (Roijers, Zintgraf, & Nowé, 2017)

NB: cumulative regret (optimal scalarized reward minus actual scalarized


reward) rather than cumulative reward

Whiteson & Roijers Multi-Objective Planning July 7, 2018 102 / 112


Conclusions Interactive Thompson Sampling

It is possible to learn about the user and the environment


simultaneously

For linear scalarization function: hardly any additional regret

Non-linear scalarization function, Gaussian processes as model of


utility function [Talk @ ALA]

Whiteson & Roijers Multi-Objective Planning July 7, 2018 103 / 112


Part 2: Methods and Applications

Convex Coverage Set Planning Methods


I Inner Loop: Convex Hull Value Iteration
I Outer Loop: Optimistic Linear Support

Pareto Coverage Set Planning Methods


I Inner loop (non-stationary): Pareto-Q
I Outer loop issues

Interactive Online MORL: Interactive Thompson Sampling

Applications

Whiteson & Roijers Multi-Objective Planning July 7, 2018 104 / 112


Treatment planning

Lizotte (2010, 2012)


I Maximizing effectiveness of
the treatment
I Minimizing the severity of the
side-effects

Finite-horizon MOMDPs

Deterministic policies

Whiteson & Roijers Multi-Objective Planning July 7, 2018 105 / 112


Epidemic control

Anthrax response (Soh &


Demiris (2011))
I Minimizing loss of life
I Minimizing number of false
alarms
I Minimizing cost of
investigation

Partial observability
(MOPOMDP)

Finite-state controllers

Evolutionary method

Pareto coverage set

Whiteson & Roijers Multi-Objective Planning July 7, 2018 106 / 112


Semi-autonomous wheelchairs

Control system for wheelchairs


(Soh & Demiris (2011))
I Maximizing safety
I Maximizing speed
I Minimizing power
consumption.

Partial observability
(MOPOMDP)

Finite-state controllers

Evolutionary method

Pareto coverage set

Whiteson & Roijers Multi-Objective Planning July 7, 2018 107 / 112


Broader Application

“Probabilistic Planning is Multi-objective” — Bryce et al. (2007)


I The expected return is not enough
I Cost of a plan
I Probability of success of a plan
I Non-goal terminal states

Whiteson & Roijers Multi-Objective Planning July 7, 2018 108 / 112


Broader Application

“Human-aligned artificial intelligence is a multiobjective problem” –


Vamplew et al., 2018
I Philosophy journal (ethics)
I Decision problems have ethical implications
I Ethical decision-making always involves trade-offs
I To align this with people’s convictions and preferences is not trivial
I Multi-objective, whatever ethical framework you use

Whiteson & Roijers Multi-Objective Planning July 7, 2018 109 / 112


Broader Application

“Tim O’Reilly says the economy is running on the wrong algorithm” –


Wired
I Companies typically only try to optimise profit
I This is bad, as consumers experience negative effects of this
I Consumers are costumers
I Very bad as a long term strategy

Whiteson & Roijers Multi-Objective Planning July 7, 2018 110 / 112


Closing

Consider multiple objectives


I most problems have them
I a priori scalarization can be bad

Derive your solution set


I Pareto front often not necessary

Exciting growing field

Promising applications

Whiteson & Roijers Multi-Objective Planning July 7, 2018 111 / 112


At these conferences

AAMAS

I Luisa M. Zintgraf, Diederik M. Roijers, Sjoerd Linders, Catholijn M.


Jonker, Ann Nowe — Ordered Preference Elicitation Strategies for
Supporting Multi-Objective Decision Making

ICML

I Wenlong Lyu, Fan Yang, Changhao Yan, Dian Zhou, Xuan Zeng —
Batch Bayesian Optimization via Multi-objective Acquisition Ensemble
for Automated Analog Circuit Design
I Eugenio Bargiacchi, Timothy Verstraeten, Diederik M. Roijers, Ann
Now, Hado van Hasselt — Learning to Coordinate with Coordination
Graphs in Repeated Single-Stage Multi-Agent Decision Problems

Whiteson & Roijers Multi-Objective Planning July 7, 2018 112 / 112


At these conferences
IJCAI

I Chao Bian, Chao Qian, Ke Tang — A General Approach to Running


Time Analysis of Multi-objective Evolutionary Algorithms
I Miguel Terra-Neves, Ines Lynce, Vasco Manquinho — Stratification for
Constraint-Based Multi-Objective Combinatorial Optimization

ALA workshop

I Diederik M. Roijers, Denis Steckelmacher, Ann Nowé —


Multi-objective Reinforcement Learning for the Expected Utility of the
Return
I Diederik M. Roijers, Luisa M. Zintgraf, Pieter Libin, Ann Nowé —
Interactive Multi-Objective Reinforcement Learning in Multi-Armed
Bandits for Any Utility Function

Whiteson & Roijers Multi-Objective Planning July 7, 2018 113 / 112

You might also like