0% found this document useful (0 votes)
10 views30 pages

NeurIPS 2023 Eliciting User Preferences For Personalized Multi Objective Decision Making Through Comparative Feedback Paper Conference

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views30 pages

NeurIPS 2023 Eliciting User Preferences For Personalized Multi Objective Decision Making Through Comparative Feedback Paper Conference

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Eliciting User Preferences for Personalized

Multi-Objective Decision Making through


Comparative Feedback

Han Shao Lee Cohen Avrim Blum


TTIC TTIC TTIC
[email protected] [email protected] [email protected]

Yishay Mansour Aadirupa Saha Matthew R. Walter


Tel Aviv University and Google Research TTIC TTIC
[email protected] [email protected] [email protected]

Abstract
In this work, we propose a multi-objective decision making framework that accom-
modates different user preferences over objectives, where preferences are learned
via policy comparisons. Our model consists of a known Markov decision process
with a vector-valued reward function, with each user having an unknown prefer-
ence vector that expresses the relative importance of each objective. The goal is
to efficiently compute a near-optimal policy for a given user. We consider two
user feedback models. We first address the case where a user is provided with two
policies and returns their preferred policy as feedback. We then move to a different
user feedback model, where a user is instead provided with two small weighted
sets of representative trajectories and selects the preferred one. In both cases, we
suggest an algorithm that finds a nearly optimal policy for the user using a number
of comparison queries that scales quasilinearly in the number of objectives.

1 Introduction
Many real-world decision making problems involve optimizing over multiple objectives. For example,
when designing an investment portfolio, one’s investment strategy requires trading off maximizing
expected gain with minimizing risk. When using Google Maps for navigation, people are concerned
about various factors such as the worst and average estimated arrival time, traffic conditions, road sur-
face conditions (e.g., whether or not it is paved), and the scenery along the way. As the advancement
of technology gives rise to personalized machine learning (McAuley, 2022), in this paper, we design
efficient algorithms for personalized multi-objective decision making.
While prior works have concentrated on approximating the Pareto-optimal solution set1 (see Hayes
et al. (2022) and Roijers et al. (2013) for surveys), we aim to find the optimal personalized policy for a
user that reflects their unknown preferences over k objectives. Since the preferences are unknown, we
need to elicit users’ preferences by requesting feedback on selected policies. The problem of eliciting
preferences has been studied in Wang et al. (2022) using a strong query model that provides stochastic
feedback on the quality of a single policy. In contrast, our work focuses on a more natural and intuitive
query model, comparison queries, which query the user’s preference over two selected policies, e.g.,
‘do you prefer a policy which minimizes average estimated arrival time or policy which minimizes the
number of turns?’. Our goal is to find the optimal personalized policy using as few queries as possible.
1
The Pareto-optimal solution set contains a optimal personalized policy for every possible user preference.

37th Conference on Neural Information Processing Systems (NeurIPS 2023).


To the best of our knowledge, we are the first to provide algorithms with theoretical guarantees for
specific personalized multi-objective decision-making via policy comparisons.
Similar to prior works on multi-objective decision making, we model the problem using a finite-
horizon Markov decision process (MDP) with a k-dimensional reward vector, where each entry is a
non-negative scalar reward representative of one of the k objectives. To account for user preferences,
we assume that a user is characterized by a (hidden) k-dimensional preference vector with non-
negative entries, and that the personalized reward of the user for each state-action is the inner product
between this preference vector and the reward vector (for this state-action pair). We also distinguish
between the k-dimensional value of a policy, which is the expected cumulative reward when selecting
actions according to the policy, and the personalized value of a policy, which is the scalar expected
cumulative personalized reward when selecting actions according to this policy. The MDP is known
to the agent and the goal is to learn an optimal policy for the personalized reward function (henceforth,
the optimal personalized policy) of a user via policy comparative feedback.
Comparative feedback. If people could clearly define their preferences over objectives (e.g., “my
preference vector has 3 for the scenery objective, 2 for the traffic objective, and 1 for the road surface
objective”), the problem would be easy—one would simply use the personalized reward function
as a scalar reward function and solve for the corresponding policy. In particular, a similar problem
with noisy feedback regarding the value of a single multi-objective policy (as mapping from states
to actions) has been studied in Wang et al. (2022). As this type of fine-grained preference feedback
is difficult for users to define, especially in environments where sequential decisions are made, we
restrict the agent to rely solely on comparative feedback. Comparative feedback is widely used in
practice. For example, ChatGPT asks users to compare two responses to improve its performance.
This approach is more intuitive for users compared to asking for numerical scores of ChatGPT
responses.
Indeed, as knowledge of the user’s preference vector is sufficient to solve for their optimal person-
alized policy, the challenge is to learn a user’s preference vector using a minimal number of easily
interpretable queries. We therefore concentrate on comparison queries. The question is then what
exactly to compare? Comparing state-action pairs might not be a good option for the aforementioned
tasks—what is the meaning of two single steps in a route? Comparing single trajectories (e.g.,
routes in Google Maps) would not be ideal either. Consider for example two policies: one randomly
generates either (personalized) GOOD or (personalized) BAD trajectories while the other consistently
generates (personalized) MEDIOCRE trajectories. By solely comparing single trajectories without
considering sets of trajectories, we cannot discern the user’s preference regarding the two policies.
Interpretable policy representation. Since we are interested in learning preferences via policy
comparison queries, we also suggest an alternative, more interpretable representation of a policy.
Namely, we design an algorithm that given an explicit policy representation (as a mapping from states
to distributions over actions), returns a weighted set of trajectories of size at most k + 1, such that its
expected return is identical to the value of the policy.2 It immediately follows from our formalization
that for any user, the personalized return of the weighted trajectory set and the personalized value of
the policy are also identical.
In this work, we focus on answering two questions:
(1) How to find the optimal personalized policy by querying as few policy comparisons as possible?
(2) How can we find a more interpretable representation of policies efficiently?
Contributions. In Section 2, we formalize the problem of eliciting user preferences and finding
the optimal personalized policy via comparative feedback. As an alternative to an explicit policy
representation, we propose a weighted trajectory set as a more interpretable representation. In
Section 3, we provide an efficient algorithm for finding an approximate optimal personalized policy,
where the policies are given by their formal representations, thus answering (1). In Section 4,
we design two efficient algorithms that find the weighted trajectory set representation of a policy.
Combined with the algorithm in Section 3, we have an algorithm for finding an approximate optimal
personalized policy when policies are represented by weighted trajectory sets, thus answering (2).
Related Work. Multi-objective decision making has gained significant attention in recent years (see
Roijers et al. (2013); Hayes et al. (2022) for surveys). Prior research has explored various approaches,
2
A return of a trajectory is the cumulative reward obtained in the trajectory. The expectation in the expected
return of a weighted trajectory set is over the weights.

2
such as assuming linear preferences or Bayesian Settings, or finding an approximated Pareto frontier.
However, incorporating comparison feedback (as was done for Multi-arm bandits or active learning,
e.g., Bengs et al. (2021) and Kane et al. (2017)) allows us a more comprehensive and systematic
approach to handling different types of user preferences and provides (nearly) optimal personalized
decision-making outcomes. We refer the reader to Appendix A for additional related work.

2 Problem Setup

Sequential decision model. We consider a Markov decision process (MDP) known3 to the agent
represented by a tuple ⟨S, A, s0 , P, R, H⟩, with finite state and action sets, S and A, respectively,
an initial state s0 ∈ S, and finite horizon H ∈ N. For example, in the Google Maps example a
state is an intersection and actions are turns. The transition function P : S × A 7→ SimplexS
maps state-action pairs into a state probability distribution. To model multiple objectives, the reward
function R : S × A 7→ [0, 1]k maps every state-action pair to a k-dimensional reward vector, where
each component corresponds to one of the k objectives (e.g., road surface condition, worst and
average estimated arrival time). The return of a trajectory τ = (s0 , a0 , . . . , sH−1 , aH−1 , sH ) is given
PH−1
by Φ(τ ) = t=0 R(st , at ).
A policy π is a mapping from states to a distribution over actions. We denote the set of policies
by Π. The value of a policy π, denoted by V π , is the expected cumulative reward obtained by
executing the policy π starting from the iinitial state, s0 . Put differently, the value of π is V π =
hP
H−1
V π (s0 ) = ES0 =s0 k
t=0 R(St , π(St )) ∈ [0, H] , where St is the random state at time step t
when executing π, and the expectation is over the randomness of P and π. Note that V π = Eτ [Φ(τ )],
where τ = (s0 , π(s0 ), S1 , . . . , π(SH−1 ), SH ) is a random trajectory generated by executing π.
We assume the existence of a “do nothing” action a0 ∈ A, available only from the initial state s0 ,
that has zero reward for each objective R(s0 , a0 ) = 0 and keeps the system in the initial state, i.e.,
P (s0 | s0 , a0 ) = 1 (e.g., this action corresponds to not commuting or refusing to play in a chess
game.)4 . We also define the (deterministic) “do nothing” policy π0 that always selects action a0 and
has a value of V π0 = 0. From a mathematical perspective, the assumption of “do nothing” ensures
that 0 belongs to the value vector space {V π |π ∈ Π}, which is precisely what we need.
π

Since the rewards are bounded between √ [0, 1], we have that 1 ≤ ∥V ∥2 ≤ kH for every policy π.
For convenience, we denote CV = kH. We denote by d ≤ k the rank of the space spanned by all
the value vectors obtained by Π, i.e., d := rank(span({V π |π ∈ Π})).
Linear preferences. To incorporate personalized preferences over objectives, we assume each user
is characterized by an unknown k-dimensional preference vector w∗ ∈ Rk+ with a bounded norm
1 ≤ ∥w∗ ∥2 ≤ Cw for some (unknown) Cw ≥ 1. We avoid assuming that Cw = 1 to accommodate
for general linear rewards. Note that the magnitude of w∗ does not change the personalized optimal
policy but affects the “indistinguishability” in the feedback model. By not normalizing ∥w∗ ∥2 = 1,
we allow users to have varying levels of discernment. This preference vector encodes preferences
over the multiple objectives and as a result, determines the user preferences over policies.
Formally, for a user characterized by w∗ , the personalized value of policy π is ⟨w∗ , V π ⟩ ∈ R+ . We

denote by π ∗ := arg maxπ∈Π ⟨w∗ , V π ⟩ and v ∗ := w∗ , V π the optimal personalized policy and
its corresponding optimal personalized value for a user who is characterized by w∗ . We remark that
the “do nothing” policy π0 (that always selects action a0 ) has a value of V π0 = 0, which implies
a personalized value of ⟨w∗ , V π0 ⟩ = 0 for every w∗ . For any two policies π1 and π2 , the user
characterized by w∗ prefers π1 over π2 if ⟨w∗ , V π1 − V π2 ⟩ > 0. Our goal is to find the optimal
personalized policy for a given user using as few interactions with them as possible.
Comparative feedback. Given two policies π1 , π2 , the user returns π1 ≻ π2 when-
ever ⟨w∗ , V π1 − V π2 ⟩ > ϵ; otherwise, the user returns “indistinguishable” (i.e., whenever
|⟨w∗ , V π1 − V π2 ⟩| ≤ ϵ). Here ϵ > 0 measures the precision of the comparative feedback and
3
If the MDP is unknown, one can approximate the MDP and apply the results of this work in the approximated
MDP. More discussion is included in Appendix 5.
4
We remark that the “do nothing” action require domain-specific knowledge, for example in the Google maps
example the “do nothing” will be to stay put and in the ChatGPT example the “do nothing” is to answer nothing.

3
is small usually. The agent can query the user about their policy preferences using two different types
of policy representations:
1. Explicit policy representation of π: An explicit representation of policy as mapping, π : S →
SimplexA .
2. Weighted trajectory set representation of π: A κ-sized set of trajectory-weight pairs {(pi , τi )}κi=1
for some κ ≤ k + 1 such that (i) the weights p1 , . . . , pκ are non-negative and sum to 1; (ii)
every trajectory in the set is in the support5 of the policy π; and (iii) the expected
Pκ return of these
trajectories according to the weights is identical to the value of π, i.e., V π = i=1 pi Φ(τi ). Such
comparison could be practical for humans. E.g., in the context of Google Maps, when the goal is
to get from home to the airport, taking a specific route takes 40 minutes 90% of the time, but it
can take 3 hours in case of an accident (which happens w.p. 10%) vs. taking the subway which
always has a duration of 1 hour.
In both cases, the feedback is identical and depends on the hidden precision parameter ϵ. As a result,
the value of ϵ will affect the number of queries and how close the value of the resulting personalized
policy is to the optimal personalized value. Alternatively, we can let the agent decide in advance on a
maximal number of queries, which will affect the optimality of the returned policy.
Technical Challenges. In Section 3, we find an approximate optimal policy by O(log 1ϵ ) queries. To
achieve this, we approach the problem in two steps. Firstly, we identify a set of linearly independent
policy values, and then we estimate the preference vector w∗ using a linear program that incorporates
comparative feedback. The estimation error of w∗ usually depends on the condition number of the
linear program. Therefore, the main challenge we face is how to search for linear independent policy
values that lead to a small condition number and providing a guarantee for this estimate.
In Section 4, we focus on how to design efficient algorithms to find the weighted trajectory set
representation of a policy. Initially, we employ the well-known Carathéodory’s theorem, which yields
H
an inefficient algorithm with a potentially exponential running time of |S| . Our main challenge lies
in developing an efficient algorithm with a running time of O(poly(H |S| |A|)). The approach based
on Carathéodory’s theorem treats the return of trajectories as independent k-dimensional vectors,
neglecting the fact that they are all generated from the same MDP. To overcome this challenge, we
leverage the inherent structure of MDPs.

3 Learning from Explicit Policies


In this section, we consider the case where the interaction with a user is based on explicit policy
comparison queries. We design an algorithm that outputs a policy being nearly optimal for this user.
For multiple different users, we only need to run part of the algorithm again and again. For brevity,
we relegate all proofs to the appendix.
If the user’s preference vector w∗ (up to a positive scaling) is given, then one can compute the
optimal policy and its personalized value efficiently, e.g., using the Finite Horizon Value Iteration
algorithm. In our work, w∗ is unknown and we interact with the user to learn w∗ through comparative
feedback. Due to the structure model that is limited to meaningful feedback only when the compared
policy values differ at least ϵ, the exact value of w∗ cannot be recovered. We proceed by providing a
high-level description of our ideas of how to estimate w∗ .
(1) Basis policies: We find policies π1 , . . . , πd , and their respective values, V π1 , . . . , V πd ∈ [0, H]k ,
such that their values are linearly independent and that together they span the entire space of value
vectors.6 These policies will not necessarily be personalized optimal for the current user, and instead
serve only as building blocks to estimate the preference vector, w∗ . In Section 3.1 we describe an
algorithm that finds a set of basis policies for any given MDP.
(2) Basis ratios: For the basis policies, denote by αi > 0 the ratio between the personalized value of
a benchmark policy, π1 , to the personalized value of πi+1 , i.e.,
∀i ∈ [d − 1] : αi ⟨w∗ , V π1 ⟩ = ⟨w∗ , V πi+1 ⟩ . (1)
5
We require the trajectories in this set to be in the support of the policy, to avoid trajectories that do not make
sense, such as trajectories that “teleport” between different unconnected states (e.g., commuting at 3 mph in
Manhattan in one state and then at 40 mph in New Orleans for the subsequent state).
6
Recall that d ≤ k is the rank of the space spanned by all the value vectors obtained by all policies.

4
We will estimate α bi of αi for all i ∈ [d − 1] using comparison queries. A detailed algorithm for
estimating these ratios appears in Section 3.2. For intuition, if we obtain exact ratios α
bi = αi for

every i ∈ [d − 1], then we can compute the vector ∥ww∗ ∥ as follows. Consider the d − 1 equations
1
and d − 1 variables in Eq (1). Since d is the maximum number of value vectors that are linearly
independent, and V π1 , . . . V πd form a basis, adding the equation ∥w∥1 = 1 yields d independent
equations with d variables, which allows us to solve for w∗ . The details of computing an estimate of
w∗ are described in Section 3.3.

3.1 Finding a Basis of Policies

In order to efficiently find d policies with d linearly independent value vectors that span the space of
value vectors, one might think that selecting the k policies that each optimizes one of the k objectives
will suffice. However, this might fail—in Appendix N, we show an instance in which these k value
vectors are linearly dependent even though there exist k policies whose values span a space of rank k.
Moreover, our goal is to find not just any basis of policies, but a basis of policies such that (1) the
personalized value of the benchmark policy ⟨w∗ , V π1 ⟩ will be large (and hence the estimation error of
ratio αi , |b
αi − αi |, will be small), and (2) that the linear program generated by this basis of policies
and the basis ratios will produce a good estimate of w∗ .
Choice of π1 . Besides linear independence of values, another challenge is to find a basis of policies
to contain a benchmark policy, π1 (where the index 1 is wlog) with a relatively large personalized
value, ⟨w∗ , V π1 ⟩, so that α
bi ’s error is small (e.g., in the extreme case where ⟨w∗ , V π1 ⟩ = 0, we will
not be able to estimate αi ).
For any w ∈ Rk , we use π w denote a policy that maximizes the scalar reward ⟨w, R⟩, i.e.,
π w = arg max ⟨w, V π ⟩ , (2)
π∈Π
w
and by v w = w, V π to denote the corresponding personalized value. Let e1 , . . . , ek denote
the standard basis. To find π1 with large personalized value ⟨w∗ , V π1 ⟩, we find policies π ej that
maximize the j-th objective for every j = 1, . . . , k and then query the user to compare them until we
find a π ej with (approximately) a maximal personalized value among them. This policy will be our
benchmark policy, π1 . The details are described in lines 1–6 of Algorithm 1.
Choice of π2 , . . . , πd . After finding π1 , we next
Algorithm 1 Identification of Basis Policies search the remaining d − 1 polices π2 , . . . , πd

1: initialize π e ← π e1 sequentially (lines 8–13 of Algorithm 1). For
2: for j = 2, . . . ,∗k do i = 2, . . . , d, we find a direction ui such that (i)
3: compare π e and π ej the vector ui is orthogonal to the space of cur-
4:
∗ ∗
if π ej ≻ π e then π e ← π ej rent value vectors span(V π1 , . . . , V πi−1 ), and
5: end for (ii) there exists a policy πi such that V πi has

∗ πe a significant component in the direction of ui .
6: π1 ← π e and u1 ← Vπe∗
∥V ∥2 Condition (i) is used to guarantee that the policy
7: for i = 2, . . . , k do πi has a value vector linearly independent of
8: arbitrarily pick an orthonormal basis, span(V π1 , . . . , V πi−1 ). Condition (ii) is used
ρ1 , . . . , ρk+1−i , for to cope with the error caused by inaccurate ap-
span(V π1 , . . . , V πi−1 )⊥ . proximation of the ratios α bi . Intuitively, when
9: jmax ← ∥αi V π1 − V πi+1 ∥2 ≪ ϵ, the angle between
arg maxj∈[k+1−i] max(|v ρj | , |v −ρj |). bi V π1 − V πi+1 and αi V π1 − V πi+1 could be
α
10: if max(|v ρjmax | , |v −ρjmax |) > 0 then very large, which results in an inaccurate esti-

11: πi ← π ρjmax
if |v ρjmax
| > |v −ρjmax
|; mate of w in the direction of αi V π1 − V πi+1 .
π1
otherwise πi ← π −ρjmax
. ui ← ρjmax For example, if V = e1 and V πi = e1 +
1
12: else wi∗ ϵe i for i = 2, . . . , k, then π1 , πi are “indis-
13: return (π1 , π2 , . . .), (u1 , u2 , . . .) tinguishable” and the estimate ratio α bi−1 can
14: end if be 1. Then the estimate of w∗ by solving linear
15: end for equations in Eq (1) is (1, 0, . . . , 0), which could
be far from the true w∗ . Finding ui ’s in which
policy values have a large component can help with this problem.
Algorithm 1 provides a more detailed description of this procedure. Note that if there are n different
users, we will run Algorithm 1 at most k times instead of n times. The reason is that Algorithm

5
1 only utilizes preference comparisons while searching for π1 (lines 1-6), and not for π2 , . . . , πk
(which contributes to the k 2 factor in computational complexity). As there are at most k candidates
for π1 , namely π e1 , . . . , π ek , we execute lines 1-6 of Algorithm 1 for n rounds and lines 8-13 for
only k rounds.

3.2 Computation of Basis Ratios

As we have mentioned before, comparing basis policies alone does not allow for the exact computation
of the αi ratios as comparing π1 , πi can only reveal which is better but not how much. To this end,
we will use the “do nothing” policy to approximate every ratio αi up to some additive error |b αi − αi |
using binary search over the parameter α bi ∈ [0, Cα ] for some Cα ≥ 1 (to be determined later) and
comparison queries of policy πi+1 with policy α bi π1 + (1 − α bi ≤ 1 (or comparing π1 and
bi )π0 if α
1 1 7
αbi π i+1 + (1 − α
bi )π 0 instead if α
b i > 1). Notice that the personalized bi π1 + (1 − α
value of α bi )π0
is identical to the personalized value of π1 multiplied by α bi . We stop once α bi is such that the user
returns “indistinguishable”. Once we stop, the two policies have roughly the same personalized value,
1
αi ⟨w∗ , V π1 ⟩ − ⟨w∗ , V πi+1 ⟩| ≤ ϵ ; if α
bi ≤ 1, |b
if α bi > 1, ⟨w∗ , V π1 ⟩ − ⟨w∗ , V πi+1 ⟩ ≤ ϵ. (3)
α
bi
Eq (1) combined with the above inequality implies that |b αi − αi | ⟨w∗ , V π1 ⟩ ≤ Cα ϵ. Thus, the
αi − αi | ≤ ⟨w∗C,Vα ϵπ1 ⟩ . To make sure the procedure
approximation error of each ratio is bounded by |b
∗ ∗
will terminate, we need to set Cα ≥ ⟨w∗v,V π1 ⟩ since αi ’s must lie in the interval [0, ⟨w∗v,V π1 ⟩ ]. Upon
stopping binary search once Eq (3) holds, it takes at most O(d log(Cα ⟨w∗ , V π1 ⟩ /ϵ)) comparison
queries to estimate all the αi ’s.
v∗
Due to the carefully picked π1 in Algorithm 1, we can upper bound ⟨w∗ ,V π1 ⟩ by 2k and derive an
upper bound for |b
αi − αi | by selecting Cα = 2k.
v∗ v∗ 4k2 ϵ
Lemma 1. When ϵ ≤ 2k , we have ⟨w∗ ,V π1 ⟩ ≤ 2k, and |b
αi − αi | ≤ v∗ for every i ∈ [d].

In what follows we set Cα = 2k. The pseudo code of the above process of estimating αi ’s is deferred
to Algorithm 3 in Appendix C.

3.3 Preference Approximation and Personalized Policy

We move on to present an algorithm that estimates w∗ and calculates a nearly optimal personalized
policy. Given the πi ’s returned by Algorithm 1 and the αbi ’s returned by Algorithm 3, consider a matrix
Ab ∈ Rd×k with 1st row V π1 ⊤ and the ith row (b αi−1 V π1 − V πi )⊤ for every i = 2, . . . , d. Let wb be a

solution to Ax
b = e1 . We will show that w b is a good estimate of w′ := ⟨w∗w,V π1 ⟩ and that π wb is a nearly
1
optimal personalized policy. In particular, when ϵ is small, we have |⟨w, b V π ⟩ − ⟨w′ , V π ⟩| = O(ϵ 3 )
for every policy π. Putting this together, we derive the following theorem.
Theorem 1. Consider the algorithm of computing A b and any solution w
b to Ax
b = e1 and outputting
the policy π wb = arg maxπ∈Π ⟨w, b V π ⟩, which is the optimal personalizedpolicy for preference  vector
D w
E √ d+ 143 1
b Then the output policy π wb satisfying that v ∗ − w∗ , V π
w. ≤O k+1 ϵ 3 using
b

O(k log(k/ϵ)) comparison queries.

Computation Complexity We remark that Algorithm 1 solves Eq (2) for the optimal policy in scalar
reward MDP at most O(k 2 ) times. Using, e.g., Finite Horizon Value iteration to solve for the optimal
policy takes O(H|S|2 |A|) steps. However, while the time complexity it takes to return the optimal
policy for a single user is O(k 2 H|S|2 |A| + k log( kϵ )), considering n different users rather than one
results in overall time complexity of O((k 3 + n)H|S|2 |A| + nk log( kϵ )).
Proof Technique. The standard technique typically starts by deriving an upper bound on ∥wb − w∗ ∥2
π ∗ π ∗
and then uses this bound to upper bound supπ |⟨w,
b V ⟩ − ⟨w , V ⟩| as CV ∥w b − w ∥2 . However,
7
We write α bi π1 + (1 − α
bi )π0 to indicate that π1 is used with probability α
bi , and that π0 is used with
probability 1 − α
bi .

6
this method fails to achieve a non-vacuous bound in case there are two basis policies that are close to
each other. For instance, consider the returned basis policy values: V π1 = (1, 0, 0), V π2 = (1, 1, 0),
and V π3 = (1, 0, η) for some η > 0. When η is extremely small, the estimate w b becomes highly
b − w∗ ∥2 . Even in such cases, we can still
inaccurate in the direction of (0, 0, 1), leading to a large ∥w
obtain a non-vacuous guarantee since the selection of V π3 (line 11 of Algorithm 1) implies that no
policy in Π exhibits a larger component in the direction of (0, 0, 1) than π3 .
Proof Sketch. The analysis of Theorem 1 has two parts. First, as mentioned in Sec 3.1, when
∥αi V π1 − V πi+1 ∥2 ≪ ϵ, the error of α bi+1 can lead to inaccurate estimate of w∗ in direction
αi V π1 −V πi+1 . Thus, we consider another estimate of w∗ based only on some πi+1 ’s with a relatively
large ∥αi V π1 − V πi+1 ∥2 . In particular, for any δ > 0, let dδ := mini≥2:max(|vui |,|v−ui |)≤δ i − 1.
That is to say, for i = 2, . . . , dδ , the policy πi satisfies ⟨ui , V πi ⟩ > δ and for any policy π, we have
⟨udδ +1 , V π√⟩ ≤ δ. Then, for any policy π and any unit vector ξ ∈ span(V π1 , . . . , V πdδ )⊥ , we have
π
⟨ξ, V ⟩ ≤ kδ. This is because at round dδ + 1, we pick an orthonormal basis ρ1 , . . . , ρk−dδ of
span(V π1 , . . . , V πdδ )⊥ (line 8 in Algorithm 1) and pick udδ +1 to be the one in which there exists a
policy with the largest component as described in line 9. Hence, |⟨ρj , V π ⟩| ≤ δ for all j ∈ [k − dδ ].
Pk−dδ √
Then, we have ⟨ξ, V π ⟩ = π
j=1 ⟨ξ, ρj ⟩ ⟨ρj , V ⟩ ≤ kδ by Cauchy-Schwarz inequality. Let
(δ) dδ ×k
A ∈R
b be the sub-matrix comprised of the first dδ rows of A. b Then we consider an alternative
(δ) b(δ) x = e1 . We upper
estimate w b = arg min b(δ)x:A x=e1 ∥x∥ , the minimum norm solution of x to A
2
bound supπ w b(δ) , V π − ⟨w′ , V π ⟩ in Lemma 2 and supπ ⟨w,
b V π⟩ − w
b(δ) , V π in Lemma 3.
Then we are done with the proof of Theorem 1.
2 1 1
Lemma 2. If |b
αi − αi | ≤ ϵα and αi ≤ Cα for all i ∈ [d − 1], for every δ ≥ 4Cα3 CV d 3 ϵα3 , we have
3


4 2
Cα CV
2
dδ ∥w′ ∥ ϵα √ ∗
w (δ) π π
b , V − ⟨w , V ⟩ ≤ O( δ 2
2
+ kδ ∥w′ ∥2 ) for all π, where w′ = ⟨w∗w,V π1 ⟩ .

Since we only remove the rows in A b corresponding to ui ’s in the subspace where no policy’s value
(δ)
has a large component, w b are close in terms of supπ ⟨w,
b and w b V π⟩ − w b(δ) , V π .
Lemma 3. If |b αi − αi | ≤ ϵα and αi ≤ Cα for all i ∈ [d − 1], for every policy π and every
2 1 1 √
δ ≥ 4Cα3 CV d 3 ϵα3 , we have w b·Vπ −w b(δ) · V π ≤ O(( k + 1)d−dδ Cα ϵ(δ) ) , where ϵ(δ) =
3
Cα CV
2
dδ ∥w′ ∥ ϵα
4 2 √
δ 2
2
+ kδ ∥w′ ∥2 is the upper bound in Lemma 2.
d
Note that the result in Theorem 1 has a factor of k 2 , which is exponential in d. Usually, we consider
the case where k = O(1) is small and thus k d = O(1) is small. We get rid of the exponential
dependence on d by applying w b(δ) to estimate w∗ directly, which requires us to set the value of δ
beforehand. The following theorem follows directly by assigning the optimal value for δ in Lemma 2.
Theorem 2. Consider the algorithm of computing A b and any solution w b(δ) to A
b(δ) x = e1 for
5 1 (δ) (δ)
δ = k 3 ϵ 3 and outputting the policy π wb b(δ) , V π . Then the policy π wb
= arg maxπ∈Π w
D w (δ) E  13 1

satisfies that v ∗ − w∗ , V π ≤ O k 6 ϵ3 .
b

Notice that the algorithm in Theorem 2 needs to set the hyperparameter δ beforehand while we
don’t have to set any hyperparameter in Theorem 1. The improper value of δ could degrade the
performance of the algorithm. But we can approximately estimate ϵ by binary searching η ∈ [0, 1]
and comparing π1 against the scaled version of itself (1 − η)π1 until we find an η such that the
user cannot distinguish between π1 and (1 − η)π1 . Then we can obtain an estimate of ϵ and use the
estimate to set the hyperparameter.
We remark that though we think of k as a small number, it is unclear whether the dependency on ϵ
in Theorems 1 and 2 is optimal. The tight dependency on ϵ is left as an open problem. We briefly
discuss a potential direction to improve this bound in Appendix I.

4 Learning from Weighted Trajectory Set Representation


In the last section, we represented policies using their explicit form as state-action mappings. However,
such a representation could be challenging for users to interpret. For example, how safe is a car

7
described by a list of |S| states and actions such as “turning left”? In this section, we design algorithms
that return a more interpretable policy representation—a weighted trajectory set.
Recall the definition in Section 2, a weighted trajectory set is a small set of trajectories from the
support of the policy and corresponding weights, with the property that the expected return of the
trajectories in the set (according to the weights) is exactly the value of the policy (henceforth, the
exact value property).8 As these sets preserve all the information regarding the multi-objective values
of policies, they can be used as policy representations in policy comparison queries of Algorithm 3
without compromising on feedback quality. Thus, using these representations obtain the same
optimality guarantees regarding the returned policy in Section 3 (but would require extra computation
time to calculate the sets).
There are two key observations on which the algorithms in this section are based:
(1) Each policy π induces a distribution over trajectories. Let q π (τ ) denote the probability that a
trajectory τ is sampled when selecting actions according toPπ. The expected return of all trajectories
under q π is identical to the value of the policy, i.e., V π = τ q π (τ )Φ(τ ). In particular, the value of
a policy is a convex combination of the returns of the trajectories in its support. However, we avoid
using this convex combination to represent a policy since the number of trajectories in the support of
a policy could be exponential in the number of states and actions.
(2) The existence of a small weighted trajectory set is implied by Carathéodory’s theorem. Namely,
since the value of a policy is in particular a convex combination of the returns of the trajectories in
its support, Carathéodory’s theorem implies that there exist k + 1 trajectories in the support of the
policy and weights for them such that a convex combination of their returns is the value of the policy.
Such a (k + 1)-sized set will be the output of our algorithms.
We can apply the idea behind Carathéodory’s theorem proof to compress trajectories as follows.
For any (k + 2)-sized set of k-dimensional vectors {µ1 , . . . , µk+2 }, for any convex combination of
Pk+2
them µ = i=1 pi µi , we can always find a (k + 1)-sized subset such that µ can be represented as
the convex combination of the subset by solving a linear equation. Given an input of a probability
distribution p over a set of k-dimensional vectors, M , we pick k + 2 vectors from M , reduce at least
one of them through the above procedure. We repeat this step until we are left with at most k + 1
vectors. We refer to this algorithm as C4 (Compress Convex Combination using Carathéodory’s
theorem). The pseudocode is described in Algorithm 4, which is deferred to Appendix J due to space
limit. The construction of the algorithm implies the following lemma immediately.
Lemma 4. Given a set of k-dimensional vectors M ⊂ Rk and a distribution p over M , C4(M, p)

outputs M ′ ⊂ M with |M ′ | ≤ k + 1 and a distribution q ∈ SimplexM satisfying that Eµ∼q [µ] =
Eµ∼p [µ] in time O(|M | k 3 ).

So now we know how to compress a set of trajectories to the desired size. The main challenge is how
to do it efficiently (in time O(poly(H |S| |A|))). Namely, since the set of all trajectory returns from
H
the support of the policy could be of size Ω(|S| ), using it as input to C4 Algorithm is inefficient.
Instead, we will only use C4 as a subroutine when the number of trajectories is small.
We propose two efficient approaches for finding weighted trajectory representations. Both approaches
take advantage of the property that all trajectories are generated from the same policy on the same
MDP. First, we start with a small set of trajectories of length of 1, expand them, compress them, and
repeat until we get the set of trajectory lengths of H. The other is based on the construction of a layer
graph where a policy corresponds to a flow in this graph and we show that finding representative
trajectories is equivalent to flow decomposition.
In the next subsection, we will describe the expanding and compressing approach and defer the
flow decomposition based approach to Appendix K due to space considerations. We remark that
2 2
the flow decomposition approach has a running time of O(H 2 |S| + k 3 H |S| ) (see appendix for
8
Without asking for the exact value property, one could simply return a sample of O(log k/(ϵ′ )2 ) trajectories
from the policy and uniform weights. With high probability, the expected return of every objective is ϵ′ -far from
its expected value. The problem is that this set does not necessarily capture rare events. For example, if the
probability of a crash for any car is between (ϵ′ )4 and (ϵ′ )2 , depending on the policy, users that only care about
safety (i.e., no crashes) are unlikely to observe any “unsafe” trajectories at all, in which case we would miss
valuable feedback.

8
details), which underperforms the expanding and compressing approach (see Theorem 3) whenever
|S|H + |S|k 3 = ω(k 4 + k|S|). For presentation purposes, in the following, we only consider the
deterministic policies. Our techniques can be easily extended to random policies.9
Expanding and Compressing Approach.
The basic idea is to find k + 1 trajectories of length 1 to represent V π first and then increase the
π
length of i increasing the number of trajectories. For policy π, let V (s, h) =
hPthe trajectories without
h−1
ES0 =s t=0 R(St , π(St )) be the value of π with initial state S0 = s and time horizon h. Since
we study the representation for a fixed policy π in this section, we slightly abuse the notation and
represent a trajectory by τ π = (s, s1 , . . . , sH ). We denote the state of trajectory τ at time t as
sτt = st . For a trajectory prefix τ = (s, s1 , . . . , sh ) of π with initial state s and h ≤ H subsequent
Ph−1
states, the return of τ is Φ(τ ) = R(s, π(s)) + t=1 R(st , π(st )). Let J(τ ) be the expected return
of trajectories (of length H) with the prefix being τ , i.e.,

J(τ ) := Φ(τ ) + V (sτh , H − h) .

For any s ∈ S, let τ ◦ s denote the trajectory of appending s to τ . We can solve V π (s, h) for all
2
s ∈ S, h ∈ [H] by dynamic programming in time O(kH |S| ). Specifically, according to definition,
π
we have V (s, 1) = R(s, π(s)) and
X
V π (s, h + 1) = R(s, π(s)) + P (s′ |s, π(s))V π (s′ , h) . (4)
s′ ∈S

Thus, we can represent V π by


X X
V π =R(s0 , π(s0 )) + P (s|s0 , π(s0 ))V π (s, H − 1) = P (s|s0 , π(s0 ))J(s0 , s) .
s∈S s∈S

By applying C4, we can find a set of representative trajectories of length 1, F (1) ⊂ {(s0 , s)|s ∈ S},
(1)
with F (1) ≤ k + 1 and weights β (1) ∈ SimplexF such that
X
Vπ = β (1) (τ )J(τ ) . (5)
τ ∈F (1)

Supposing
P that we are given a set of trajectories F (t) of length t with weights β (t) such that
π (t)
V = τ ∈F (t) β (τ )J(τ ), we can first increase the length of trajectories by 1 through Eq (4) and
obtain a subset of {τ ◦ s|τ ∈ F (t) , s ∈ S}, in which the trajectories are of length t + 1. Specifically,
we have
X
Vπ = β (t) (τ )P (s|sτt , π(sτt ))J(τ ◦ s) . (6)
τ ∈F (t) ,s∈S

Then we would like to compress the above convex combination through C4 as we want to keep track
of at most k + 1 trajectories of length t + 1 due to the computing time. More formally, let JF (t) :=
(t)
{J(τ ◦ s)|τ ∈ F (t) , s ∈ S} be the set of expected returns and pF (t) ,β (t) ∈ SimplexF ×S with
pF (t) ,β (t) (τ ◦ s) = β (t) (τ )P (s|sτt , π(sτt )) be the weights appearing in Eq (6). Here pF (t) ,β (t) defines
a distribution over JF (t) with the probability of drawing J(τ ◦ s) being pF (t) ,β (t) (τ ◦ s). Then we can
apply C4 over (JF (t) , pF (t) ,β (t) ) and compress the representative trajectories {τ ◦ s|τ ∈ F (t) , s ∈ S}.
We start with trajectories of length 1 and repeat the process of expanding and compressing until we
get trajectories of length H. The details are described in Algorithm 2.

9
In Section 3, we consider a special type of random policy that is a mixed strategy of a deterministic policy
(the output from an algorithm that solves for the optimal policy for an MDP with scalar reward) with the “do
nothing” policy. For this specific random policy, we can find weighted trajectory representations for both policies
and then apply Algorithm 4 to compress the representation.

9
Algorithm 2 Expanding and compressing trajectories
1: compute V π (s, h) for all s ∈ S, h ∈ [H] by dynamic programming according to Eq (4)
2: F (0) = {(s0 )} and β (0) (s0 ) = 1
3: for t = 0, . . . , H − 1 do
4: JF (t) ← {J(τ ◦ s)|τ ∈ F (t) , s ∈ S} and pF (t) ,β (t) (τ ◦ s) ← β (t) (τ )P (s|sτt , π(sτt )) for
τ ∈ F (t) , s ∈ S // expanding step
5: (J (t+1) , β (t+1) ) ← C4(JF (t) , pF (t) ,β (t) ) and F (t+1) ← {τ |J(τ ) ∈ J (t+1) } // compressing
step
6: end for
7: output F (H) and β (H)

Theorem 3. Algorithm 2 outputs F (H) and β (H) satisfying that F (H) ≤ k + 1 and
(H) 2
(τ )Φ(τ ) = V π in time O(k 4 H |S| + kH |S| ).
P
τ ∈F (H) β

The proof of Theorem P 3 follows immediately from the construction of the algorithm. According to
Eq (5), we have V π = τ ∈F (1) β (1) (τ )J(τ ). Then we can show that the output of Algorithm 2 is a
valid weighted trajectory set by induction on the length of representative trajectories. C4 guarantees
that F (t) ≤ k + 1 for all t = 1, . . . , H, and thus, we only keep track of at most k + 1 trajectories
at each step and achieve the computation guarantee in the theorem. Combined with Theorem 1, we
derive the following Corollary.
Corollary 1. Running the algorithm in Theorem 1 with weighted trajectory set representation
returned by Algorithm 2 gives us the same guarantee as that of Theorem 1 in time O(k 2 H|S|2 |A| +
2
(k 5 H |S| + k 2 H |S| ) log( kϵ )).

5 Discussion
In this paper, we designed efficient algorithms for learning users’ preferences over multiple objectives
from comparative feedback. The efficiency is expressed in both the running time and number of
queries (both polynomial in H, |S| , |A| , k and logarithmic in 1/ϵ). The learned preferences of a user
can then be used to reduce the problem of finding a personalized optimal policy for this user to a
(finite horizon) single scalar reward MDP, a problem with a known efficient solution. As we have
focused on minimizing the policy comparison queries, our algorithms are based on polynomial time
pre-processing calculations that save valuable comparison time for users.
The results in Section 3 are of independent interest and can be applied to a more general learning
setting, where for some unknown linear parameter w∗ , given a set of points X and access to
comparison queries of any two points, the goal is to learn arg maxx∈X ⟨w∗ , x⟩. E.g., in personalized
recommendations for coffee beans in terms of the coffee profile described by the coffee suppliers
(body, aroma, crema, roast level,...), while users could fail to describe their optimal coffee beans
profile, adopting the methodology in Section 3 can retrieve the ideal coffee beans for a user using
comparisons (where the mixing with “do nothing” is done by diluting the coffee with water and the
optimal coffee for a given profile is the one closest to it).
When moving from the explicit representation of policies as mappings from states to actions to a more
natural policy representation as a weighted trajectory set, we then obtained the same optimality guar-
antees in terms of the number of queries. While there could be other forms of policy representations
(e.g., a small subset of common states), one advantage of our weighted trajectory set representation is
that it captures the essence of the policy multi-objective value in a clear manner via O(k) trajectories
and weights. The algorithms provided in Section 4 are standalone and could also be of independent
interest for explainable RL (Alharin et al., 2020). For example, to exemplify the multi-objective
performance of generic robotic vacuum cleaners (this is beneficial if we only have e.g., 3 of them—
we can apply the algorithms in Section 4 to generate weighted trajectory set representations and
compare them directly without going through the algorithm in Section 3.).
An interesting direction for future work is to relax the assumption that the MDP is known in advance.
One direct way is to first learn the model (in model-based RL), then apply our algorithms in the
learned MDP. The sub-optimality of the returned policy will then depend on both the estimation error
of the model and the error introduced by our algorithms (which depends on the parameters in the
learned model).

10
Acknowledgements
This work was supported in part by the National Science Foundation under grants CCF-2212968
and ECCS-2216899 and by the Defense Advanced Research Projects Agency under cooperative
agreement HR00112020003. The views expressed in this work do not necessarily reflect the position
or the policy of the Government and no official endorsement should be inferred. Approved for public
release; distribution is unlimited.
This project was supported in part by funding from the European Research Council (ERC) under the
European Union’s Horizon 2020 research and innovation program (grant agreement number 882396),
by the Israel Science Foundation (grant number 993/17), Tel Aviv University Center for AI and Data
Science (TAD), the Eric and Wendy Schmidt Fund, and the Yandex Initiative for Machine Learning
at Tel Aviv University.
We would like to thank all anonymous reviewers, especially Reviewer Gnoo, for their constructive
comments.

References
Ailon, N. (2012). An active learning algorithm for ranking from pairwise preferences with an almost
optimal query complexity. Journal of Machine Learning Research, 13(1):137–164.

Ailon, N., Karnin, Z. S., and Joachims, T. (2014). Reducing dueling bandits to cardinal bandits. In
ICML, volume 32, pages 856–864.

Akrour, R., Schoenauer, M., and Sebag, M. (2012). APRIL: Active preference learning-based
reinforcement learning. In Proceedings of the Joint European Conference on Machine Learning
and Knowledge Discovery in Databases, pages 116–131.

Alharin, A., Doan, T.-N., and Sartipi, M. (2020). Reinforcement learning interpretation methods: A
survey. IEEE Access, 8:171058–171077.

Balcan, M., Blum, A., and Vempala, S. S. (2015). Efficient representations for lifelong learning
and autoencoding. In Proceedings of the Annual Conference on Learning Theory (COLT), pages
191–210.

Balcan, M.-F., Vitercik, E., and White, C. (2016). Learning combinatorial functions from pairwise
comparisons. In Proceedings of the Annual Conference on Learning Theory (COLT).

Barrett, L. and Narayanan, S. (2008). Learning all optimal policies with multiple criteria. In
Proceedings of the International Conference on Machine Learning (ICML), pages 41–47.

Bengs, V., Busa-Fekete, R., El Mesaoudi-Paul, A., and Hüllermeier, E. (2021). Preference-based
online learning with dueling bandits: A survey. J. Mach. Learn. Res.

Bhatia, K., Pananjady, A., Bartlett, P., Dragan, A., and Wainwright, M. J. (2020). Preference
learning along multiple criteria: A game-theoretic perspective. In Advances in Neural Information
Processing Systems (NeurIPS), pages 7413–7424.

Biyik, E. and Sadigh, D. (2018). Batch active preference-based learning of reward functions. In
Proceedings of the Conference on Robot Learning (CoRL), pages 519–528.

Blum, A. and Shao, H. (2020). Online learning with primary and secondary losses. Advances in
Neural Information Processing Systems, 33:20427–20436.

Chatterjee, K. (2007). Markov decision processes with multiple long-run average objectives. In Pro-
ceedings of the International Conference on Foundations of Software Technology and Theoretical
Computer Science (FSTTCS), pages 473–484.

Chatterjee, K., Majumdar, R., and Henzinger, T. A. (2006). Markov decision processes with multiple
objectives. In Proceedings of the Annual Symposium on Theoretical Aspects of Computer Science
(STACS), pages 325–336.

11
Chen, X., Ghadirzadeh, A., Björkman, M., and Jensfelt, P. (2019). Meta-learning for multi-objective
reinforcement learning. In Proceedings of the IEEE/RSJ International Conference on Intelligent
Robots and Systems (IROS), pages 977–983.
Cheng, W., Fürnkranz, J., Hüllermeier, E., and Park, S.-H. (2011). Preference-based policy iteration:
Leveraging preference learning for reinforcement learning. In Proceedings of the Joint European
Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), pages
312–327.
Cheung, W. C. (2019). Regret minimization for reinforcement learning with vectorial feedback and
complex objectives. In Advances in Neural Information Processing Systems (NeurIPS).
Christiano, P. F., Leike, J., Brown, T., Martic, M., Legg, S., and Amodei, D. (2017). Deep reinforce-
ment learning from human preferences. In Advances in Neural Information Processing Systems
(NeurIPS).
Doumpos, M. and Zopounidis, C. (2007). Regularized estimation for preference disaggregation in
multiple criteria decision making. Computational Optimization and Applications, 38(1):61–80.
Fürnkranz, J., Hüllermeier, E., Cheng, W., and Park, S.-H. (2012). Preference-based reinforcement
learning: A formal framework and a policy iteration algorithm. Machine Learning, 89(1):123–156.
Hayes, C. F., Radulescu, R., Bargiacchi, E., Källström, J., Macfarlane, M., Reymond, M., Verstraeten,
T., Zintgraf, L. M., Dazeley, R., Heintz, F., Howley, E., Irissappane, A. A., Mannion, P., Nowé, A.,
de Oliveira Ramos, G., Restelli, M., Vamplew, P., and Roijers, D. M. (2022). A practical guide
to multi-objective reinforcement learning and planning. Autonomous Agents and Multi-Agent
Systems.
Ibarz, B., Leike, J., Pohlen, T., Irving, G., Legg, S., and Amodei, D. (2018). Reward learning from
human preferences and demonstrations in Atari. In Advances in Neural Information Processing
Systems (NeurIPS).
Jain, A., Sharma, S., Joachims, T., and Saxena, A. (2015). Learning preferences for manipulation tasks
from online coactive feedback. International Journal of Robotics Research, 34(10):1296–1313.
Jamieson, K. G. and Nowak, R. (2011). Active ranking using pairwise comparisons. Advances in
neural information processing systems, 24.
Kane, D. M., Lovett, S., Moran, S., and Zhang, J. (2017). Active classification with comparison
queries. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS),
pages 355–366. IEEE.
Knox, W. B., Hatgis-Kessell, S., Booth, S., Niekum, S., Stone, P., and Allievi, A. (2022). Models of
human preference for learning reward functions. arXiv preprint arXiv:2206.02231.
Lee, K., Smith, L., and Abbeel, P. (2021). PEBBLE: Feedback-efficient interactive reinforcement
learning via relabeling experience and unsupervised pre-training. arXiv preprint arXiv:2106.05091.
Mannor, S., Perchet, V., and Stoltz, G. (2014). Approachability in unknown games: Online learning
meets multi-objective optimization. In Conference on Learning Theory, pages 339–355. PMLR.
Mannor, S. and Shimkin, N. (2001). The steering approach for multi-criteria reinforcement learning.
In Advances in Neural Information Processing Systems (NeurIPS).
Mannor, S. and Shimkin, N. (2004). A geometric approach to multi-criterion reinforcement learning.
Journal of Machine Learning Research, 5:325–360.
Marinescu, R., Razak, A., and Wilson, N. (2017). Multi-objective influence diagrams with possibly
optimal policies. Proceedings of the AAAI Conference on Artificial Intelligence.
McAuley, J. (2022). Personalized Machine Learning. Cambridge University Press. in press.
Pacchiano, A., Saha, A., and Lee, J. (2022). Dueling RL: reinforcement learning with trajectory
preferences.

12
Ren, W., Liu, J., and Shroff, N. B. (2018). PAC ranking from pairwise and listwise queries: Lower
bounds and upper bounds. arXiv preprint arXiv:1806.02970.
Roijers, D. M., Vamplew, P., Whiteson, S., and Dazeley, R. (2013). A survey of multi-objective
sequential decision-making. Journal of Artificial Intelligence Research, 48:67–113.
Rothkopf, C. A. and Dimitrakakis, C. (2011). Preference elicitation and inverse reinforcement
learning. In Proceedings of the Joint European Conference on Machine Learning and Knowledge
Discovery in Databases, pages 34–48.
Sadigh, D., Dragan, A. D., Sastry, S., and Seshia, S. A. (2017). Active preference-based learning of
reward functions. In Proceedings of Robotics: Science and Systems (RSS).
Saha, A. and Gopalan, A. (2018). Battle of bandits. In Uncertainty in Artificial Intelligence.
Saha, A. and Gopalan, A. (2019). PAC Battling Bandits in the Plackett-Luce Model. In Algorithmic
Learning Theory, pages 700–737.
Saha, A., Koren, T., and Mansour, Y. (2021). Dueling convex optimization. In International
Conference on Machine Learning, pages 9245–9254. PMLR.
Sui, Y., Zhuang, V., Burdick, J., and Yue, Y. (2017). Multi-dueling bandits with dependent arms. In
Conference on Uncertainty in Artificial Intelligence, UAI’17.
Sui, Y., Zoghi, M., Hofmann, K., and Yue, Y. (2018). Advancements in dueling bandits. In IJCAI,
pages 5502–5510.
Wang, N., Wang, H., Karimzadehgan, M., Kveton, B., and Boutilier, C. (2022). Imoˆ 3: Interactive
multi-objective off-policy optimization. In Proceedings of the Thirty-First International Joint
Conference on Artificial Intelligence, IJCAI-22.
Wilson, A., Fern, A., and Tadepalli, P. (2012). A Bayesian approach for policy learning from
trajectory preference queries. In Advances in Neural Information Processing Systems (NeurIPS).
Wirth, C., Akrour, R., Neumann, G., and Fürnkranz, J. (2017a). A survey of preference-based
reinforcement learning methods. J. Mach. Learn. Res.
Wirth, C., Akrour, R., Neumann, G., and Fürnkranz, J. (2017b). A survey of preference-based
reinforcement learning methods. Journal of Machine Learning Research, 18(136):1–46.
Wirth, C., Fürnkranz, J., and Neumann, G. (2016). Model-free preference-based reinforcement
learning. In Proceedings of the National Conference on Artificial Intelligence (AAAI).
Yona, G., Moran, S., Elidan, G., and Globerson, A. (2022). Active learning with label comparisons.
In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI).
Zoghi, M., Whiteson, S., Munos, R., Rijke, M. d., et al. (2014). Relative upper confidence bound for
the k-armed dueling bandit problem. In JMLR Workshop and Conference Proceedings, number 32,
pages 10–18. JMLR.

13
A Related Work

Multi-objective sequential decision making There is a long history of work on multi-objective


sequential decision making (Roijers et al., 2013), with one key focus being the realization of efficient
algorithms for approximating the Pareto front (Chatterjee et al., 2006; Chatterjee, 2007; Marinescu
et al., 2017). Instead of finding a possibly optimal policy, we concentrate on specific user preferences
and find a policy that is optimal for that specific user. Just like the ideal car for one person could be a
Chevrolet Spark (small) and for another, it is a Ford Ranger (a truck).
In the context of multi-objective RL (Hayes et al., 2022), the goal can be formulated as one of learning
a policy for which the average return vector belongs to a target set (hence the term “multi-criteria”
RL), which existing work has treated as a stochastic game (Mannor and Shimkin, 2001, 2004). Other
works seek to maximize (in expectation) a scalar version of the reward that may correspond to a
weighted sum of the multiple objectives (Barrett and Narayanan, 2008; Chen et al., 2019) as we
consider here, or a nonlinear function of the objectives (Cheung, 2019). Multi-objective online
learning has also been studied, see (Mannor et al., 2014; Blum and Shao, 2020) for example.
The parameters that define this scalarization function (e.g., the relative objective weights) are often
unknown and vary with the task setting or user. In this case, preference learning (Wirth et al., 2017a)
is commonly used to elicit the value of these parameters. Doumpos and Zopounidis (2007) describe
an approach to eliciting a user’s relative weighting in the context of multi-objective decision-making.
Bhatia et al. (2020) learn preferences over multiple objectives from pairwise queries using a game-
theoretic approach to identify optimal randomized policies. In the context of RL involving both
scalar and vector-valued objectives, user preference queries provide an alternative to learning from
demonstrations, which may be difficult for people to provide (e.g., in the case of robots with high
degrees-of-freedom), or explicit reward specifications (Cheng et al., 2011; Rothkopf and Dimitrakakis,
2011; Akrour et al., 2012; Wilson et al., 2012; Fürnkranz et al., 2012; Jain et al., 2015; Wirth et al.,
2016; Christiano et al., 2017; Wirth et al., 2017b; Ibarz et al., 2018; Lee et al., 2021). These works
typically assume that noisy human preferences over a pair of trajectories are correlated with the
difference in their utilities (i.e., the reward acts as a latent term predictive of preference). Many
contemporary methods estimate the latent reward by minimizing the cross-entropy loss between the
reward-based predictions and the human-provided preferences (i.e., finding the reward that maximizes
the likelihood of the observed preferences) (Christiano et al., 2017; Ibarz et al., 2018; Lee et al., 2021;
Knox et al., 2022; Pacchiano et al., 2022).
Wilson et al. (2012) describe a Bayesian approach to policy learning whereby they query a user
for their preference between a pair of trajectories and use these preferences to maintain a posterior
distribution over the latent policy parameters. The task of choosing the most informative queries
is challenging due to the continuous space of trajectories, and is generally NP-hard (Ailon, 2012).
Instead, they assume access to a distribution over trajectories that accounts for their feasibility and
relevance to the target policy, and they describe two heuristic approaches to selecting trajectory
queries based on this distribution. Finally, Sadigh et al. (2017) describe an approach to active
preference-based learning in continuous state and action spaces. Integral to their work is the ability
to synthesize dynamically feasible trajectory queries. Biyik and Sadigh (2018) extend this approach
to the batch query setting.
Comparative feedback in other problems Comparative feedback has been studied in other problems
in learning theory, e.g., combinatorial functions (Balcan et al., 2016). One closely related problem is
active ranking/learning using pairwise comparisons (Jamieson and Nowak, 2011; Kane et al., 2017;
Saha et al., 2021; Yona et al., 2022). These works usually consider a given finite sample of points.
Kane et al. (2017) implies a lower bound of the number of comparisons is linear in the cardinality
of the set even if the points satisfy the linear structural constraint as we assume in this work. In
our work, the points are value vectors generated by running different policies under the same MDP
and thus have a specific structure. Besides, we allow comparison of policies not in the policy set.
Thus, we are able to obtain the query complexity sublinear in the number of policies. Another related
problem using comparative/preference feedback is dueling bandits that aim to learn through pairwise
feedback (Ailon et al., 2014; Zoghi et al., 2014) (see also Bengs et al. (2021); Sui et al. (2018) for
surveys), or more generally any subsetwise feedback (Sui et al., 2017; Saha and Gopalan, 2018, 2019;
Ren et al., 2018). However, unlike dueling bandits, we consider noiseless comparative feedback.

14
B Proof of Lemma 1

v∗ v∗ 4k2 ϵ
Lemma 1. When ϵ ≤ 2k , we have ⟨w∗ ,V π1 ⟩ ≤ 2k, and |b
αi − αi | ≤ v∗ for every i ∈ [d].

Proof. We first show that, according to our algorithm (lines 1-6 of Algorithm 1), the returned π1
satisfies that
D ei
E
⟨w∗ , V π1 ⟩ ≥ max w∗ , V π − ϵ.
i∈[k]

This can be proved by induction over k. In the base case of k = 2, it’s easy to see that the returned
π1 satisfies the above inequality. Suppose the above inequality holds for any k ≤ n − 1 and we
prove that it will also hold for k = n. After running the algorithm over j = 2, . . . , k − 1 (line 2), the
returned policy πe∗ satisfies that

e∗
D E D ei
E
w∗ , V π ≥ max w∗ , V π − ϵ.
i∈[k−1]

Then there are two cases,

e∗
D E e e
• If w∗ , V π < w∗ , V π k − ϵ, we will return π1 = π ek and also, w∗ , V π k ≥
π ei
w∗ , V − ϵ for all i ∈ [k].

e∗ e∗
D E e ∗
D E
• If w∗ , V π ≥ w∗ , V π k − ϵ, then we will return π e and it satisfies w∗ , V π ≥
ei
maxi∈[k] w∗ , V π − ϵ.

As π ei is the optimal personalized policy when the user’s preference vector is ei , we have that

k k
* k
+
D ∗
E X D ∗ E X D e E
π ei
X
v = w∗ , V π =

wi∗ V π , ei ≤ wi∗ V π , ei ≤ ∗
i
w , V ,
i=1 i=1 i=1

ei
where the last inequality holds because the entries of V π and w∗ are non-negative. Therefore, there
ei ∗
exists i ∈ [k] such that w∗ , V π ≥ k1 w∗ , V π = k1 v ∗ .
Then we have
D ei
E 1 1 ∗
⟨w∗ , V π1 ⟩ ≥ max w∗ , V π − ϵ ≥ v∗ − ϵ ≥ v ,
i∈[k] k 2k

v∗ v∗
when ϵ ≤ 2k . By rearranging terms, we have ⟨w∗ ,V π1 ⟩ ≤ 2k.
4k2 ϵ
αi − αi | ⟨w∗ , V π1 ⟩ ≤ Cα ϵ = 2kϵ and thus, |b
By setting Cα = 2k, we have |b αi − αi | ≤ v∗ .

C Pseudo Code of Computation of the Basis Ratios

The pseudo code of searching α


bi ’s is described in Algorithm 3.

15
Algorithm 3 Computation of Basis Ratios
1: input: (V π1 , . . . , V πd ) and Cα
2: for i = 1, . . . , d − 1 do
3: let l = 0, h = 2Cα and α b i = Cα
4: while True do
5: if α
bi > 1 then
6: compare π1 and αb1i πi+1 + (1 − αb1i )π0 ; if π1 ≻ αb1i πi+1 + (1 − αb1i )π0 then h ← α bi ,
l+h 1 1 l+h
bi ← 2 ; if π1 ≺ αbi πi+1 + (1 − αbi )π0 then l ← α
α bi ← 2
bi , α
7: else
8: compare πi+1 and α bi π1 + (1 − α bi π1 + (1 − α
bi )π0 ; if α bi )π0 ≻ πi+1 then h ← α
bi ,
bi ← l+h
α 2 ; if α
b π
i 1 + (1 − α
b )π
i 0 ≺ π i+1 then l ← α
b i , α
b i ← l+h
2
9: end if
10: if “indistinguishable” is returned then
11: break
12: end if
13: end while
14: end for
15: output: (b α1 , . . . , α
bd−1 )

D Proof of Theorem 1
Theorem 1. Consider the algorithm of computing A b and any solution w
b to Ax
b = e1 and outputting
the policy π wb = arg maxπ∈Π ⟨w,
b V π ⟩, which is the optimal personalizedpolicy for preference vector
D w
E √ d+ 14
3 1
b Then the output policy π wb satisfying that v ∗ − w∗ , V π
w. ≤O k+1 ϵ 3 using
b

O(k log(k/ϵ)) comparison queries.


2
Proof. Theorem 1 follows by setting ϵα = 4kv∗ ϵ and Cα = 2k as shown in Lemma 1 and combining
the results of Lemma 2 and 3 with the triangle inequality.
Specifically, for any policy π, we have
D E D E
b V π ⟩ − ⟨w′ , V π ⟩| ≤ ⟨w,
|⟨w, b V π⟩ − wb(δ) , V π + wb(δ) , V π − ⟨w′ , V π ⟩
3
√ 2
Cα CV4 dδ2 ∥w′ ∥2 ϵα √
≤O(( k + 1)d−dδ Cα ( + kδ ∥w′ ∥2 )) (7)
δ2
√ C 4 k 4 ∥w′ ∥ ϵ
≤O(( k + 1)d−dδ +3 ∥w′ ∥2 ( V ∗ 2 2 + δ)) .
v δ
∗ ∗
Since ∥w′ ∥ = ⟨w∥w ∥ ∗
∗ ,V π1 ⟩ and ⟨w , V
π1
⟩ ≥ v2k from Lemma 1, we derive
D w
E D ∗
E D w
E
v ∗ − w∗ , V π = ⟨w∗ , V π1 ⟩ w′ , V π − w′ , V π
b b

√ C 4 k 4 ∥w′ ∥ ϵ
D E D E 
∗ w
≤ ⟨w∗ , V π1 ⟩ b V π − w, b V π + O(( k + 1)d−dδ +3 ∥w′ ∥2 ( V ∗ 2 2 + δ))
b
w,
v δ
√ 4 4 ′
 
C k ∥w ∥ ϵ
≤O ⟨w∗ , V π1 ⟩ ( k + 1)d−dδ +3 ∥w′ ∥2 ( V ∗ 2 2 + δ)
v δ
√ 4 5 ∗
 
C k ∥w ∥ ϵ
=O ( k + 1)d−dδ +3 ∥w∗ ∥2 ( V ∗2 2 2 + δ)
v δ
!
2
CV2 ∥w∗ ∥2 2 √ d+ 16 1
=O ( ) 3 ( k + 1) 3 ϵ 3 .
v∗
The first inequality follows from
D ∗
E D w
E D ∗
E D w
E
w′ , V π − w′ , V π = w′ , V π − w′ , V π
b b

16
D ∗
E D ∗
E D w
E D w
E
b Vπ b Vπ
− w, b V π − w, b Vπ
b b
+ w, + w, ,

and applying (7) twice- once for π ∗ and once for π wb . The last inquality follows by setting δ =
 4 5 ∗  13
CV k ∥w ∥2 ϵ
v ∗2 .

E Proof of Lemma 2
2 1 1
Lemma 2. If |b
αi − αi | ≤ ϵα and αi ≤ Cα for all i ∈ [d − 1], for every δ ≥ 4Cα3 CV d 3 ϵα3 , we have
3


4 2
Cα CV
2
dδ ∥w′ ∥ ϵα √ ∗
w (δ) π π
b , V − ⟨w , V ⟩ ≤ O( δ 2
2
+ kδ ∥w′ ∥2 ) for all π, where w′ = ⟨w∗w,V π1 ⟩ .

To prove Lemma 2, we will first define a matrix, A(full) .


Given the output (V π1 , . . . , V πd ) of Algorithm 1, we have rank(span({V π1 , . . . , V πdδ })) = dδ .
Let b1 , . . . , bd−dδ be a set of orthonormal vectors that are orthogonal to span(V π1 , . . . , V πdδ ) and
together with V π1 , . . . , V πdδ form a basis for span({V π |π ∈ Π}).
We define A(full) ∈ Rd×k as the matrix of replacing the last d − dδ rows of A with b1 , . . . , bd−dδ , i.e.,

V π1 ⊤
 
 (α1 V π1 − V π2 )⊤ 
 
 .. 
 . 
π dδ ⊤ 
(full)
 π1

A = (αdδ −1 V − V
 ) .
 b⊤ 1

..
 
 
 . 
b⊤
d−dδ

Observation 1. We have that span(A(full) ) = span({V π |π ∈ Π}) and rank(A(full) ) = d.



Lemma 5. For all w ∈ Rk satisfying A(full) w = e1 , we have |w · V π − w′ · V π | ≤ kδ ∥w′ ∥2 for
all π.

We then show that there exists a w ∈ Rk satisfying A(full) w = e1 such that w


b(δ) · V π − w · V π is
small for all π ∈ Π.
2 1 1
Lemma 6. If |bαi − αi | ≤ ϵα and αi ≤ Cα for all i ∈ [d − 1], for every δ ≥ 4Cα3 CV d 3 ϵα3 there
3
2
k (full) (δ) π π Cα CV dδ ∥w′ ∥ ϵα
4 2
exists a w ∈ R satisfying A w = e1 s.t. wb · V − w · V ≤ O( δ 2
2
) for all π.

We now derive Lemma 2 using the above two lemmas.

Proof of Lemma 2. Let w be defined in Lemma 6. Then for any policy π, we have

b(δ) · V π − w′ · V π ≤ w
w b(δ) · V π − w · V π + |w · V π − w′ · V π |
3
2
Cα CV4 dδ2 ∥w′ ∥2 ϵα √
≤O( + kδ ∥w′ ∥2 ) ,
δ2
by applying Lemma 5 and 6.

F Proofs of Lemma 5 and Lemma 6



Lemma 5. For all w ∈ Rk satisfying A(full) w = e1 , we have |w · V π − w′ · V π | ≤ kδ ∥w′ ∥2 for
all π.

17
Proof of Lemma 5. Since span(A(full) ) = span({V π |π ∈ Π}), for every policy π, the value
vector can be represented as a linear combination of row vectors of A(full) , i.e., there exists
a = (a1 , . . . , ad ) ∈ Rd s.t.
d
X
Vπ = ai A(full)
i = A(full)⊤ a . (8)
i=1

Now, for any unit vector ξ ∈ span(b1 , . . . , bd−dδ ), we have ⟨V π , ξ⟩ ≤ kδ.
The reason is that at each round dδ + 1, we pick an orthonormal basis ρ1 , . . . , ρk−dδ of
span(V π1 , . . . , V πdδ )⊥ (line 8 in Algorithm 1) and pick udδ +1 to be the one in which there ex-
ists a policy with the largest component as described in line 9. Hence, |⟨ρj , V π ⟩| ≤ δ for all
j ∈ [k − dδ ].
Pk−d √
It follows from Cauchy-Schwarz inequality that ⟨ξ, V π ⟩ = j=1 δ ⟨ξ, ρj ⟩ ⟨ρj , V π ⟩ ≤ kδ.
Combining with the observation that b1 , . . . , bd−dδ are pairwise orthogonal and that each of them is
orthogonal to span(V π1 , . . . , V πdδ ) we have
* + v
X d Xd u d
u X √
2 π
ai = V , ai bi−dδ ≤t a2i kδ ,
i=dδ +1 i=dδ +1 i=dδ +1

which implies that v


u d
u X √
t a2i ≤ kδ . (9)
i=dδ +1

Since w′ satisfies Aw′ = e1 , we have


A(full) w′ = (1, 0, . . . , 0, ⟨b1 , w′ ⟩ , . . . , ⟨bd−dδ , w′ ⟩) .
Pd−d
For any w ∈ Rk satisfying A(full) w = e1 , consider w e = w + i=1 δ ⟨bi , w′ ⟩ bi . Then we have
e = A(full) w′ .
A(full) w
Thus, applying (8) twice, we get
e⊤ A(full)⊤ a = w′⊤ A(full)⊤ a = w′ · V π .
e·Vπ =w
w
Hence,
d
(a) X
|w · V π − w′ · V π | = |w · V π − w
e · V π| = e · A(full)
ai (w − w) i
i=1
v
d
X
u d
(b) u X (c) √

= ai ⟨bi−dδ , w ⟩ ≤ t a2i ∥w′ ∥2 ≤ kδ ∥w′ ∥2 ,
i=dδ +1 i=dδ +1

where Eq (a) follows from (8), inequality (b) from Cauchy-Schwarz, and inequality (c) from applying
(9).

2 1 1
Lemma 6. If |bαi − αi | ≤ ϵα and αi ≤ Cα for all i ∈ [d − 1], for every δ ≥ 4Cα3 CV d 3 ϵα3 there
3
2
k (full) (δ) π π Cα CV dδ ∥w′ ∥ ϵα
4 2
exists a w ∈ R satisfying A w = e1 s.t. wb · V − w · V ≤ O( δ 2
2
) for all π.

Before proving Lemma 6, we introduce some notations and a claim.

• For any x, y ∈ Rk , let θ(x, y) denotes the angle between x and y.


• For any subspace U ⊂ Rk , let θ(x, U ) := miny∈U θ(x, y).
• For any two subspaces U, U ′ ⊂ Rk , we define θ(U, U ′ ) as θ(U, U ′ ) =
maxx∈U miny∈U ′ θ(x, y).

18
• For any matrix M , let Mi denote the i-th row vector of M , Mi:j denote the submatrix of M
composed of rows i, i + 1, . . . , j, and Mi: denote the submatrix composed of all rows j ≥ i.
• Let span(M ) denote the span of the rows of M .

b ∈ Rd×k is defined as
Recall that A
V π1 ⊤
 
 α1 V − V π2 )⊤
(b π1 
A
b= ,

 ..
 .
αd−1 V π1 − V πd )⊤
(b

which is the approximation of matrix A ∈ Rd×k defined by true values of αi , i.e.,


V π1 ⊤
 
 (α1 V π1 − V π2 )⊤ 
A= .
 
..
 . 
(αd−1 V π1 − V πd )⊤

We denote by Ab(δ) = Ab1:d , A(δ) = A1:d ∈ Rdδ ×k the sub-matrices comprised of the first dδ rows
δ δ

of A and A respectively.
b
2 1 1
Claim 1. If |b
αi − αi | ≤ ϵα and αi ≤ Cα for all i ∈ [d − 1], for every δ ≥ 4Cα3 CV d 3 ϵα3 , we have
(δ)
b )) ≤ ηϵ ,δ , (δ)
θ(span(A2: ), span(A 2: α
(10)
and
(δ)
b ), span(A )) ≤ ηϵ ,δ , (δ)
θ(span(A 2: 2: α
(11)
2
4Cα CV dδ ϵα
where ηϵα ,δ = δ2 .

To prove the above claim, we use the following lemma by Balcan et al. (2015).
Lemma 7 (Lemma 3 of Balcan et al. (2015)). Let Ul = span(ξ1 , . . . , ξl ) and Ubl = span(ξb1 , . . . , ξbl ).
2
Let ϵacc , γnew ≥ 0 and ϵacc ≤ γnew /(10l), and assume that θ(ξbi , U
bi−1 ) ≥ γnew for i = 2, . . . , l, and
that θ(ξi , ξi ) ≤ ϵacc for i = 1, . . . , l.
b
Then,
bl ) ≤ 2l ϵacc
θ(Ul , U .
γnew

Proof of Claim 1. For all 2 ≤ i ≤ dδ , we have that


(δ) (δ) (δ) (δ) (δ) (δ)
θ(A 2:i−1 )) ≥θ(Ai , span(A1:i−1 )) ≥ sin(θ(Ai , span(A1:i−1 )))
b , span(A
b b b b b
i

(a)
b(δ) · ui
A (b)
i δ δ δ
≥ ≥ = ≥ ,
b(δ)
A
(δ)
Ai
b αi−1 V π1 − V πi ∥2
∥b (Cα + 1)CV
i
2 2

b(δ) ) according to line 8 of Algorithm 1


where Ineq (a) holds as ui is orthogonal to span(A 1:i−1
(δ) πi
and Ineq (b) holds due to Ai · ui = |V · ui | ≥ δ. The last inequality holds due to
b
αi−1 V π1 − V πi ∥2 ≤ α
∥b bi−1 ∥V π1 ∥2 + ∥Vi ∥2 ≤ (Cα + 1)CV .
Similarly, we also have
(δ) (δ) δ
θ(Ai , span(A2:i−1 )) ≥ .
(Cα + 1)CV

We continue by decomposing V πi in the direction of V π1 and the direction perpendicular to V π1 .


∥ π1 ∥ ∥ V π1 ∥
For convince, we denote vi := V πi · ∥VVπ1 ∥ , Vi := vi ∥V π1 ∥2 , Vi⊥ := V πi −Vi and vi⊥ := Vi⊥ 2
.
2

19
Then we have
(δ) b(δ) ∥ ∥
θ(Ai , A i ) = θ(αi−1 V
π1
bi−1 V π1 −V πi ) = θ(αi−1 V π1 −Vi −Vi⊥ , α
−V πi , α bi−1 V π1 −Vi −Vi⊥ ) .
∥ ∥ ∥ ∥
αi−1 V π1 − Vi ) · (αi−1 V π1 − Vi ) ≥ 0, i.e., α
If (b bi−1 V π1 − Vi and αi−1 V π1 − Vi are in the same
direction, then
∥ ∥
bi−1 V π1 − Vi
α αi−1 V π1 − Vi
(δ) b(δ) 2 2
θ(Ai , A i ) = arctan − arctan
vi⊥ vi⊥
∥ ∥
bi−1 V π1 − Vi
α αi−1 V π1 − Vi
2 2
≤ − (12)
vi⊥ vi⊥
αi−1 − αi−1 | ∥V π1 ∥2
|b
=
vi⊥
ϵα CV
≤ , (13)
δ
∂ arctan x
where Ineq (12) follows from the fact that the derivative of arctan is at most 1, i.e., ∂x =
lima→x arctan a−x
a−arctan x 1
= 1+x 2 ≤ 1.

Inequality (13) holds since vi⊥ ≥ |⟨V πi , ui ⟩| ≥ δ.


∥ ∥ ∥ ∥
αi−1 V π1 −Vi )·(αi−1 V π1 −Vi ) < 0, i.e., α
If (b bi−1 V π1 −Vi and αi−1 V π1 −Vi are in the opposite
∥ ∥
bi−1 V π1 − Vi
directions, then we have α + αi−1 V π1 − Vi αi−1 − αi−1 )V π1 ∥2 ≤
= ∥(b
2 2
ϵα ∥V π1 ∥2 .
Similarly, we have
∥ ∥
bi−1 V π1 − Vi
α αi−1 V π1 − Vi
b(δ) , A(δ) ) = arctan
θ(A 2
+ arctan 2
i i
vi⊥ vi⊥
∥ ∥
bi−1 V π1 − Vi
α αi−1 V π1 − Vi
2 2
≤ +
vi⊥ vi⊥
ϵα ∥V π1 ∥2

vi⊥
ϵα CV
≤ .
δ
ϵα CV δ
By applying Lemma 7 with ϵacc = δ , γnew = (Cα +1)CV , (ξi , ξbi ) = (Ai+1 , A
bi+1 ) (and (ξi , ξbi ) =
1 2 1 1
bi+1 , Ai+1 )), we have that when δ ≥ 10 (Cα + 1) CV d 3 ϵα3 ,
(A 3 3
δ

(δ) b )) ≤ (δ) 2dδ (Cα + 1)CV2 ϵα


θ(span(A2: ), span(A 2: = ηϵα ,δ ,
δ2
and
(δ)
b ), span(A )) ≤ ηϵ ,δ . (δ)
θ(span(A 2: 2: α

This completes the proof of Claim 1 since Cα ≥ 1.

b(δ) = arg minAb(δ) x=e1 ∥x∥2 is the minimum norm solution to


Proof of Lemma 6. Recall that w
(δ)
A
b x = e1 .
b(δ) , bi = 0 for all i ∈ [d − dδ ].
Thus, w
(δ)
Let λ1 , . . . , λdδ −1 be any orthonormal basis of span(A2: ).

20
(δ)
b(δ) ’s component in span(A2: ) and
We construct a vector w satisfying A(full) w = e1 by removing w
rescaling.
Formally,
Pdδ −1
b(δ) −
w i=1 b(δ) , λi λi
w
w := Pdδ −1 (δ) . (14)
1 − V π1 · ( i=1 w b , λi λi )

It is direct to verify that A1 · w = V π1 · w = 1 and Ai · w = 0 for i = 2, . . . , dδ . As a result,


A(δ) w = e1 .
b(δ) has zero component in bi for all i ∈ [d − dδ ], we have A(full) w = e1 .
Combining with the fact that w
According to Claim 1, we have
(δ) (δ)
b )) ≤ ηϵ ,δ .
θ(span(A2: ), span(A 2: α

Thus, there exist unit vectors λ


e1 , . . . , λ b(δ) ) such that θ(λi , λ
ed −1 ∈ span(A ei ) ≤ ηϵ ,δ .
δ 2: α

b(δ) w
Since A b(δ) = e1 , we have w
b(δ) · λ
ei = 0 for all i = 1, . . . , dδ − 1, and therefore,

b(δ) · λi = w
w b(δ) · (λi − λ b(δ)
ei ) ≤ w ηϵα ,δ .
2

This implies that for any policy π,


δ −1
dX p p
Vπ · b(δ) · λi )λi ≤ ∥V π ∥2
(w b(δ)
dδ w ηϵα ,δ ≤ CV b(δ)
dδ w ηϵα ,δ .
2 2
i=1
Pdδ −1 √
Denote by γ = V π1 · ( i=1 b(δ) , λi λi ), which is no greater than CV
w b(δ)
dδ w 2
ηϵα ,δ .
We have that
1 1
w b(δ) · V π −
b(δ) · V π − w · V π ≤ w b(δ) · V π +
w b(δ) · V π − w · V π
w
1−γ 1−γ
δ −1
dX
γ wb(δ) 2 CV 1
≤ + (wb(δ) · λi )λi · V π
1−γ 1 − γ i=1

γ wb(δ) 2 CV CV dδ w b(δ) 2 ηϵα ,δ
≤ +
1−γ 1−γ
p
(δ)
= 2(CV w b + 1)CV dδ w b(δ) ηϵα ,δ , (15)
2 2

where the last equality holds when CV b(δ)
dδ w 2
ηϵα ,δ ≤ 21 .
b(δ)
Now we show that w 2
≤ C ∥w′ ∥2 for some constant C.

Since wb(δ) is the minimum norm solution to Ab(δ) x = e1 , we will construct another solution to
(δ)
A x = e1 , denoted by w
b b0 in a simillar manner to the construction in Eq (14), and show that
b0 ∥2 ≤ C ∥w′ ∥.
∥w
b(δ) ).
Let ξ1 , . . . , ξdδ −1 be any orthonormal basis of span(A 2:

We construct a w b(δ) w
b0 s.t. A b(δ) ) and rescaling.
b0 = e1 by removing the component of w′ in span(A 2:
Specifically, let
Pdδ −1 ′
w′ − i=1 ⟨w , ξi ⟩ ξi
w
b0 = D E. (16)
d −1
1 − V π1 , ( i=1 ⟨w′ , ξi ⟩ ξi )
P δ

D E D E
Since A(δ) w′ = e1 , it directly follows that A b0 = ⟨V π1 , w
b1 , w b0 ⟩ = 1 and that Abi , w
b0 = 0 for
b(δ) w
i = 2, . . . , dδ , i.e., A b0 = e1 .

21
Since Claim 1 implies that θ(span(A b(δ) ), span(A(δ) )) ≤ ηϵ ,δ , there exist unit vectors
2: 2: α
(δ)
ξe1 , . . . , ξedδ −1 ∈ span(A2: ) such that θ(ξi , ξei ) ≤ ηϵα ,δ .
(δ) b(δ) ).
As w′ has zero component in span(A2: ), w′ should have a small component in span(A 2:
In particular,
D E
|⟨w′ , ξi ⟩| = w′ , ξi − ξei ≤ ∥w′ ∥2 ηϵα ,δ ,

which implies that


δ −1
dX p
⟨w′ , ξi ⟩ ξi ≤ dδ ∥w′ ∥2 ηϵα ,δ .
i=1 2

Hence
δ −1
dX
* +
p

V π1
,( ⟨w , ξi ⟩ ξi ) ≤ CV dδ ∥w′ ∥2 ηϵα ,δ .
i=1


As a result, ∥w
b0 ∥2 ≤ 3
2 ∥w ∥2 when CV dδ ∥w′ ∥2 ηϵα ,δ ≤ 1
3, which is true when ϵα is small
enough.
4k2 ϵ
According to Lemma 1, ϵα ≤ v ∗ , ϵα → 0 as ϵ → 0.

b(δ)
Thus, we have w 2
≤ ∥w
b0 ∥2 ≤ 3
2 ∥w′ ∥2 and CV dδ w
b(δ) 2
ηϵα ,δ ≤ 21 .
Combined with Eq (15), we get
 p 
b(δ) · V π − w · V π = O (CV ∥w′ ∥2 + 1)CV dδ ∥w′ ∥2 ηϵα ,δ .
w
2
4Cα CV dδ ϵα
Since CV ∥w′ ∥2 ≥ |⟨V π1 , w′ ⟩| = 1, by taking ηϵα ,δ = δ2 into the above equation, we have
3 !
2
Cα CV4 dδ2 ∥w′ ∥2 ϵα
b(δ) · V π − w · V π = O
w ,
δ2

which completes the proof.

G Proof of Lemma 3

Lemma 3. If |b αi − αi | ≤ ϵα and αi ≤ Cα for all i ∈ [d − 1], for every policy π and every
2 1 1 √
δ ≥ 4Cα3 CV d 3 ϵα3 , we have w b·Vπ −w b(δ) · V π ≤ O(( k + 1)d−dδ Cα ϵ(δ) ) , where ϵ(δ) =
3
2
dδ ∥w′ ∥ ϵα
4 2
Cα CV √
δ 2
2
+ kδ ∥w′ ∥2 is the upper bound in Lemma 2.

Proof. Given the output (V π1 , . . . , V πd ) of Algorithm 1, we have rank(span({V π1 , . . . , V πdδ })) =


dδ .
For i = dδ + 1, . . . , d, let ψi be the normalized vector of V πi ’s projection into
span(V π1 , . . . , V πi−1 )⊥ with ∥ψi ∥2 = 1.
Then we have that span(V π1 , . . . , V πi−1 , ψi ) = span(V π1 , . . . , V πi ) and that {ψi |i = dδ +
1, . . . , d} are orthonormal.
For every policy π, the value vector can be represented as a linear combination of
A
b ,...,A bd , ψd +1 , . . . , ψd , i.e., there exists a unique a = (a1 , . . . , ad ) ∈ Rd s.t. V π =
P1dδ δ
Pδd
i=1 ai Ai + i=dδ +1 ai ψi .
b

Since ψi is orthogonal to ψj for all j ̸= i and ψi is are orthogonal to span(A


b1 , . . . , A
bd ), we have
δ
π
ai = ⟨V , ψi ⟩ for i ≥ dδ + 1.

22
This implies that
D E dδ
X D E D E d
X
b V π⟩ − w
⟨w, b(δ) , V π ≤ ai w, bi − w(δ) , A
b A bi + ai ⟨w,
b ψi ⟩
i=1 i=dδ +1
| {z } | {z }
(a) (b)
d
X D E
+ ai wb(δ) , ψi .
i=dδ +1
| {z }
(c)

b(δ) w
Since A b(δ) w
b=A b(δ) = e1 , we have term (a) = 0.
We move on to bound term (c).
Note that the vectors {ψi |i = dδ +1, . . . , d} are orthogonal to span(V π1 , . . . , V πdδ ) and that together
with V π1 , . . . , V πdδ they form a basis for span({V π |π ∈ Π}).
Thus, we can let bi in the proof of Lemma 2 be ψi+dδ .
In addition, all the properties of {bi |i ∈ [d − dδ ]} also apply to {ψi |i = dδ + 1, . . . , d} as well.
Hence, similarly to Eq (9), v
u d
u X √
t a2i ≤ kδ .
i=dδ +1

Consequentially, we can bound term (c) is by


√ 3√
(c) ≤ kδ w b(δ) = kδ ∥w′ ∥2
2 2
(δ) 3 ′
√ ′ 1
since w
b 2
≤ 2 ∥w ∥2 when CV dδ ∥w ∥2 ηϵα ,δ ≤ 3 as discussed in the proof of Lemma 6.
Now all is left is to bound term (b).
We cannot bound term (b) in the same way as that of term (c) because ∥w∥
b 2 is not guaranteed to be
bounded by ∥w′ ∥2 .
For i = dδ + 1, . . . , d, we define D E
ϵi := ψi , A
bi .
D E
For any i, j = dδ + 1, . . . , d, ψi is perpendicular to V π1 , thus ψi , A
bj =
π1 πj πj
|⟨ψi , α
bj−1 V −V ⟩| = |⟨ψi , V
⟩|. Especially, we have
D E
ϵi = ψ i , Abi = |⟨ψi , V πi ⟩| .
D E
b∥ := A
Let A bi − Pd
j=dδ +1 Ai , ψj ψj denote Ai ’s projection into span(A1 , . . . , Adδ ).
b b b b
i
D E
Since A bi has zero component in direction ψj for j > i, we have A b∥ = A
bi − Pi A
bi , ψj ψj .
i j=dδ +1

Then, we have
D E i D E i
b∥ + w b∥ −
X X
0 = w,
b A b·A
bi = w
i b · Abi , ψj ψj = w
b · A i ⟨V πi , ψj ⟩ ⟨w,
b ψj ⟩ ,
j=dδ +1 j=dδ +1

where the first equation holds due to A


bwb = e1 .
By rearranging terms, we have
i−1

X
⟨V πi , ψi ⟩ ⟨w,
b ψi ⟩ = w
b·A
b −
i ⟨V πi , ψj ⟩ ⟨w,
b ψj ⟩ . (17)
j=dδ +1

23
Recall that at iteration j of Algorithm 1, in line 8 we pick an orthonormal basis ρ1 , . . . , ρk+1−j of
span(V π1 , . . . , V πj−1 )⊥ . Since ψj is in span(V π1 , . . . , V πj−1 )⊥ according to the definition of ψj ,
|⟨V πi , ψj ⟩| is no greater then the norm of V πi ’s projection into span(V π1 , . . . , V πj−1 )⊥ .
Therefore, we have
√ (d) √ D ρ E D −ρ E
|⟨V πi , ψj ⟩| ≤ |⟨V πi , ρl ⟩| ≤ max( V π , ρl , V π , −ρl )
l l
k max k max
l∈[k+1−j] l∈[k+1−j]

(e)√ (f ) √ √
= k |⟨V πj , uj ⟩| ≤ k |⟨V πj , ψj ⟩| = kϵj , (18)
ρl
where inequality (d) holds because π is the optimal personalized policy with respect to the preference
vector ρl , and Equation (e) holds due to the definition of uj (line 9 of Algorithm 1). Inequality (f)
holds since ⟨V πj , ψj ⟩ is the norm of V πj ’s projection in span(V π1 , . . . , V πj−1 )⊥ and uj belongs to
span(V π1 , . . . , V πj−1 )⊥ .
By taking absolute value on both sides of Eq (17), we have
i−1 √ i−1
∥ ∥
X X
ϵi |⟨w,
b ψi ⟩| = w
b·A
b −
i ⟨V πi , ψj ⟩ ⟨w,
b ψj ⟩ ≤ w
b·A
b +
i k ϵj |⟨w,
b ψj ⟩| . (19)
j=dδ +1 j=dδ +1

b∥ as follows.
b·A
We can now bound w i

b∥ = w
b·A
w b∥
b(δ) · A (20)
i i

d
X D E
b(δ) · (A
= w bi − A b(δ) · A
bi , ψj ψj ) = w bi (21)
j=dδ +1

b(δ) · Ai + w
≤ w b(δ) · (A
bi − Ai )

≤ |w′ · Ai | + (w
b(δ) − w′ ) · Ai + w
b(δ) · (A
bi − Ai )

b(δ) − w′ ) · V π + CV w
≤ 0 + (Cα + 1) sup (w b(δ) ϵα
π 2

≤ C ′ Cα ϵ(δ) ,
for some constant C ′ > 0.
Eq (20) holds because A b(δ) w
b = Ab(δ) w
b(δ) = e1 and Ab∥ belongs to span(A b(δ) ). Eq (21) holds
i
because w b(δ) is the minimum norm solution to Ab(δ) x = e1 , which implies that w
b(δ) · ψi = 0. The
last inequality follows by applying Lemma 2.
We will bound ϵi |⟨w,
b ψi ⟩| by induction on i = dδ + 1, . . . , d.
In the base case of i = dδ + 1,

ϵdδ +1 |⟨w,
b ψdδ +1 ⟩| ≤ w b∥
b·A ′ (δ)
dδ +1 ≤ C Cα ϵ .

Then, by induction through Eq (19), we have for i = dδ + 2, . . . , d,



b ψi ⟩| ≤ ( k + 1)i−dδ −1 C ′ Cα ϵ(δ) .
ϵi |⟨w,
Similar to the deviation of Eq (18), we pick an orthonormal basis ρ1 , . . . , ρk+1−i of
span(V π1 , . . . , V πi−1 )⊥ at line 8 of Algorithm 1, then we have that, for any policy π,
√ √ √ √
|⟨V π , ψi ⟩| ≤ k max |⟨V π , ρl ⟩| ≤ k |⟨V πi , ui ⟩| ≤ k |⟨V πi , ψi ⟩| = kϵi .
l∈[k+1−i]

Then we have that term (b) is bounded by


d
X d
X
(b) = ⟨V π , ψi ⟩ ⟨w,
b ψi ⟩ ≤ |⟨V π , ψi ⟩| · |⟨w,
b ψi ⟩|
i=dδ +1 i=dδ +1

24
√ Xd √
≤ k b ψi ⟩| ≤ ( k + 1)d−dδ C ′ Cα ϵ(δ) .
ϵi |⟨w,
i=dδ +1
Hence we have that for any policy π,
D E √ 3√
b V π⟩ − w
⟨w, b(δ) , V π ≤ ( k + 1)d−dδ C ′ Cα ϵ(δ) + kδ ∥w′ ∥2 .
2

H Proof of Theorem 2
Theorem 2. Consider the algorithm of computing A b(δ) to A
b and any solution w b(δ) x = e1 for
5 1 (δ) (δ)
w
δ = k 3 ϵ 3 and outputting the policy π b
= arg maxπ∈Π w b , V . Then the policy π wb
(δ) π
D w (δ) E  13 1

satisfies that v ∗ − w∗ , V π ≤ O k 6 ϵ3 .
b

4k2 ϵ
Proof of Theorem 2. As shown in Lemma 1, we set Cα = 2k and have ϵα = v∗ . We have
∗ ∗
∥w′ ∥ = ⟨w∥w ∥ ∗
∗ ,V π1 ⟩ and showed that ⟨w , V
π1
⟩ ≥ v2k in the proof of Lemma 1.
 13
k ∥w∗ ∥2 ϵ
4 5

CV
By applying Lemma 2 and setting δ = v ∗2 , we have

  D E  
b (δ)
w ∗ b (δ)
w
v ∗ − w∗ , V π = ⟨w∗ , V π1 ⟩ w′ , V π − w′ , V π
√ C 4 k 5 ∥w∗ ∥2 ϵ 1
D E   
∗ b (δ)
w
≤ ⟨w∗ , V π1 ⟩ wb(δ) , V π − w b(δ) , V π + O( k ∥w′ ∥2 ( V ) 3)
v ∗2
√ C 4 k 5 ∥w∗ ∥2 ϵ 1
=O( k ∥w∗ ∥2 ( V )3 ) .
v ∗2

I Dependency on ϵ
In this section, we would like to discuss a potential way of improving the dependency on ϵ in
Theorems 1 and 2.
Consider a toy example where the returned three basis policies are π1 with V π1 = (1, 0, 0), π2 with
V π2 = (1, 1, 1) and π3 with V π3 = (1, η, −η) for some η > 0 and w∗ = (1, w2 , w3 ) for some
w2 , w3 .
b1 lies in [1 + w2 + w3 − ϵ, 1 + w2 + w3 + ϵ], and α
The estimated ratio α b2 lies in [1 + ηw2 − ηw3 −
ϵ, 1 + ηw2 − ηw3 + ϵ]. Suppose that α b1 = 1 + w2 + w3 + ϵ and α b2 = 1 + ηw2 − ηw3 + ϵ.
By solving
1 0 0 1
! !
w2 + w3 + ϵ −1 −1 w b= 0
ηw2 − ηw3 + ϵ −η η 0
b2 = w2 + 2ϵ (1 + η1 ) and w
we can derive w b3 = w3 + 2ϵ (1 − η1 ).
The quantity measuring sub-optimality we care about is supπ |⟨w,b V π ⟩ − ⟨w∗ , V π ⟩|, which is upper

bounded by CV ∥w b − w ∥2 . But the ℓ2 distance between wb and w∗ depends on the condition number
of A,
b which is large when η is small. To obtain a non-vacuous upper bound in Section 3, we introduce
b(δ) based on the truncated version of A
another estimate w b and then upper bound w b(δ) − w∗ 2 and
π (δ) π
supπ ⟨w,
b V ⟩− w b ,V separately.
b V π ⟩ − ⟨w∗ , V π ⟩| depends on the condition number of A.
However, it is unclear if supπ |⟨w, b Due to
the construction of Algorithm 1, we can obtain some extra information about the set of all policy
values.

25
First, since we find π2 before π3 , η must be no greater than 1. According to the algorithm, V π2 is
the optimal policy when the preference vector is u2 (see line 11 of Algorithm 1 for the definition
of u2 ) and V π3 is the optimal policy when the preference vector is u3 = (0, 1, −1). Note that the
angle between u2 and V π2 is no greater than 45 degrees according to the definition of u2 . Then the
values of all policies can only lie in the small box B = {x ∈ R3 | u⊤ 2 x ≤ |⟨u , V π2 ⟩| , u⊤
3x ≤
π3 ∗
√2
|⟨u3 , V ⟩|}. It is direct to check that for any x ∈ B, |⟨w,
b x⟩ − ⟨w , x⟩| < (1 + 2)ϵ. This example
illustrates that even when the condition number of A b V π ⟩ − ⟨w∗ , V π ⟩| can be
b is large, supπ |⟨w,
small. It is unclear if this holds in general. Applying this additional information to upper bound
supπ |⟨w,b V π ⟩ − ⟨w∗ , V π ⟩| directly instead of through bounding CV ∥w b − w∗ ∥2 is a possible way
1
of improving the term ϵ 3 .

J Description of C4 and Proof of Lemma 4

Algorithm 4 C4: Compress Convex Combination using Carathéodory’s theorem


1: input a set of k-dimensional vectors M ⊂ Rk and a distribution p ∈ SimplexM
2: while |M | > k + 1 do
3: arbitrarily pick k + 2 vectors µ1 , . . . , µk+2 from M
Pk+2
4: solve for x ∈ Rk+2 s.t. i=1 xi (µi ◦ 1) = 0, where µ ◦ 1 denote the vector of appending 1
to µ
|xi |
5: i0 ← arg maxi∈[k+2] p(µ i)
6: if xi0 < 0 then x ← −x
p(µ )
7: γ ← xii0 and ∀i ∈ [k + 2], p(µi ) ← p(µi ) − γxi
0
8: remove µi with p(µi ) = 0 from M
9: end while
10: output M and p

Lemma 4. Given a set of k-dimensional vectors M ⊂ Rk and a distribution p over M , C4(M, p)



outputs M ′ ⊂ M with |M ′ | ≤ k + 1 and a distribution q ∈ SimplexM satisfying that Eµ∼q [µ] =
Eµ∼p [µ] in time O(|M | k 3 ).

Proof. The proof is similar to the proof of Carathéodory’s theorem. Given the vectors µ1 , . . . , µk+2
picked in line 3 of Algorithm 4 and their probability masses p(µi ), we solve x ∈ Rk+2 s.t.
Pk+2
i=1 xi (µi ◦ 1) = 0 in the algorithm.
Note that there exists a non-zero solution of x because {µi ◦ 1|i ∈ [k + 2]} are linearly dependent.
Pd+2
Besides, x satisfies i=1 xi = 0.
Therefore,
d+2
X d+2
X
(p(µi ) − γxi ) = p(µi ).
i=1 i=1

xi xi0 1
For all i, if xi < 0, p(µi ) − γxi ≥ 0 as γ > 0; if xi > 0, then p(µi ) ≤ p(µi0 ) = γ and thus
p(µi ) − γxi ≥ 0.
Hence, after one iteration, the updated p is still a probability over M (i.e., p(µ) ≥ 0 for all µ ∈ M
P Pd+2 Pd+2 Pd+2
and µ∈M p(M ) = 1). Besides, i=1 (p(µi ) − γxi )µi = i=1 p(µi )µi − γ i=1 xi µi =
Pd+2
i=1 p(µ )µ
i i .
Therefore, after one iteration, the expected value Eµ∼p [µ] is unchanged.
When we finally output (M ′ , q), we have that q is a distribution over M and that Eµ∼q [µ] = Eµ∼p [µ].
p(µi0 )
Due to line 6 of the algorithm, we know that xi0 > 0. Hence p(µi0 )−γxi0 = p(µi0 )− xi0 xi0 = 0.
We remove at least one vector µi0 from M and we will run for at most |M | iterations.

26
Finally, solving x takes O(k 3 ) time and thus, Algorithm 4 takes O(|M | k 3 ) time in total.

K Flow Decomposition Based Approach

We first introduce an algorithm based on the idea of flow decomposition.


For that, we construct a layer graph G = ((L(0) ∪ . . . ∪ L(H+1) ), E) with H + 2 pairwise disjoint
layers L(0) , . . . , L(H+1) , where every layer t ≤ H contains a set of vertices labeled by the (possibly
duplicated) states reachable at the corresponding time step t, i.e., {s ∈ S| Pr(St = s|S0 = s0 ) > 0}.
(t)
Let us denote by xs the vertex in L(t) labeled by state s.
(H+1) (H+1)
Layer L(H+1) = {x∗ } contains only an artificial vertex, x∗ , labeled by an artificial state ∗.
(t) (t+1) (t) (t+1)
For t = 0, . . . , H − 1, for every xs ∈ L(t) , xs′ ∈ L(t+1) , we connect xs and xs′ by an edge
′ ′ (H) (H+1)
labeled by (s, s ) if P (s |s, π(s)) > 0. Every vertex xs in layer H is connected to x∗ by one
edge, which is labeled by (s, ∗).
We denote by E (t) the edges between L(t) and L(t+1) . Note that every trajectory τ = (s0 , s1 , . . . , sH )
(0) (1) (H) (H+1) (0) (H+1)
corresponds to a single path (xs0 , xs1 , . . . , xsH , x∗ ) of length H + 2 from xs0 to x∗ .
This is a one-to-one mapping and in the following, we use path and trajectory interchangeably.
(0) (H+1)
The policy corresponds to a (xs0 , x∗ )-flow with flow value 1 in the graph G. In particular, the
flow is defined as follows.
(t)
When the layer t is clear from the context, we actually refer to vertex xs by saying vertex s.
For t = 0, . . . , H − 1, for any edge (s, s′ ) ∈ E (t) , let f : E → R+ be defined as
X
f (s, s′ ) = q π (τ ) , (22)
τ :(sτt ,sτt+1 )=(s,s′ )

where q π (τ ) is the probability of τ being sampled.


For any edge (s, ∗) ∈ E (H) , let f (s, ∗) = (s′ ,s)∈E (H−1) f (s′ , s). It is direct to check that the
P

function f is a well-defined flow. We can therefore compute f by dynamic programming.


For all (s0 , s) ∈ E (0) , we have f (s0 , s) = P (s|s0 , π(s0 )) and for (s, s′ ) ∈ E (t) ,
X
f (s, s′ ) = P (s′ |s, π(s)) f (s′′ , s) . (23)
s′′ :(s′′ ,s)∈E (t−1)

Now we are ready to present our algorithm by decomposing f in Algorithm 5.


Each iteration in Algorithm 5 will zero out at least one edge and thus, the algorithm will stop within
|E| rounds.

Algorithm 5 Flow decomposition based approach


1: initialize Q ← ∅.
2: calculate f (e) for all edge e ∈ E by dynamic programming according to Eq (23)
3: while ∃e ∈ E s.t. f (e) > 0 do
4: pick a path τ = (s0 , s1 , . . . , sH , ∗) ∈ L(0) × L(1) × . . . × L(H+1) s.t. f (si , si+1 ) > 0 ∀i ≥ 0

5: fτ ← mine in τ f (e)
6: Q ← Q ∪ {(τ, fτ )}, f (e) ← f (e) − fτ for e in τ
7: end while
8: output Q

2
fτ Φ(τ ) = V π in time O(H 2 |S| ).
P
Theorem 4. Algorithm 5 outputs Q satisfying that (τ,fτ )∈Q

27
The core idea of the proof is that for any edge (s, s′ ) ∈ E (t) , the flow on (s, s′ ) captures the
probability of St = s ∧ St+1 = s′ and thus, the value of the policy V π is linear in {f (e)|e ∈ E}.
The output Q has at most |E| number of weighted paths (trajectories). We can further compress the
representation through C4, which takes O(|Q| k 3 ) time.
Corollary 2. Executing Algorithm 5 with the output Q first and then running C4 over
{(Φ(τ ), fτ )|(τ, fτ ) ∈ Q} returns a (k + 1)-sized weighted trajectory representation in time
2 2
O(H 2 |S| + k 3 H |S| ).

We remark that the running time of this flow decomposition approach underperforms that of the
expanding and compressing approach (see Theorem 3) whenever |S|H + |S|k 3 = ω(k 4 + k|S|).

L Proof of Theorem 4
2
fτ Φ(τ ) = V π in time O(H 2 |S| ).
P
Theorem 4. Algorithm 5 outputs Q satisfying that (τ,fτ )∈Q

Proof. Correctness: The function f defined by Eq (22) is a well-defined flow since for all t =
1, . . . , H, for all s ∈ L(t) , we have that
X X X X
f (s, s′ ) = q π (τ ) = q π (τ )
s′ ∈L(t+1) :(s,s′ )∈E (t) s′ ∈L(t+1) :(s,s′ )∈E (t) τ :(sτt ,sτt+1 )=(s,s′ ) τ :sτt =s
X
= f (s′′ , s) .
s′′ ∈L(t−1) :(s′′ ,s)∈E (t−1)

In the following, we first show that Algorithm 5 will terminate with f (e) = 0 for all e ∈ E.
First, after each iteration, f is still a feasible (x(0) , x(H+1) )-flow feasible flow with the total flow
out-of x(0) reduced by fτ . Besides, for edge e with f (e) > 0 at the beginning, we have f (e) ≥ 0
throughout the algorithm because we never reduce f (e) by an amount greater than f (e).
Then, since f is a (x(0) , x(H+1) )-flow and f (e) ≥ 0 for all e ∈ E, we can always find a path τ in
line 4 of Algorithm 5.
Otherwise, the set of vertices reachable from x(0) through edges with positive flow does not contain
x(H+1) and the flow out of this set equals the total flow out-of x(0) . But since other vertices are not
reachable, there is no flow out of this set, which is a contradiction.
In line 6, there exists at least one edge e such that f (e) > 0 is reduced to 0. Hence, the algorithm
will run for at most |E| iterations and terminate with f (e) = 0 for all e ∈ E.
Thus we have that for any (s, s′ ) ∈ E (t) , f (s, s′ ) = (τ,fτ )∈Q:(sτ ,sτ )=(s,s′ ) fτ .
P
t t+1

Then we have
H−1
!
X X X
π π π
V = q (τ )Φ(τ ) = q (τ ) R(sτt , π(sτt ))
τ τ t=0
H−1
XX
= q π (τ )R(sτt , π(sτt ))
t=0 τ
H−1
X X X
= R(s, π(s)) q π (τ )
t=0 (s,s′ )∈E (t) τ :(sτt ,sτt+1 )=(s,s′ )
H−1
X X
= R(s, π(s))f (s, s′ )
t=0 (s,s′ )∈E (t)
H−1
X X X
= R(s, π(s)) fτ
t=0 (s,s′ )∈E (t) (τ,fτ )∈Q:(sτt ,sτt+1 )=(s,s′ )

28
X H−1
X
= fτ ( R(sτt , π(sτt )))
(τ,fτ )∈Q t=0
X
= fτ Φ(τ ) .
(τ,fτ )∈Q

Computational complexity: Solving f takes O(|E|) time. The algorithm will run for O(|E|)
2
iterations and each iteration takes O(H) time. Since |E| = O(|S| H), the total running time of
2
Algorithm 5 is O(|S| H 2 ). C4 will take O(k 3 |E|) time.

M Proof of Theorem 3
Theorem 3. Algorithm 2 outputs F (H) and β (H) satisfying that F (H) ≤ k + 1 and
(H) 2
(τ )Φ(τ ) = V π in time O(k 4 H |S| + kH |S| ).
P
τ ∈F (H) β

Proof. Correctness: C4 guarantees that F (H) ≤ k + 1.


(H)
We will prove τ ∈F (H) βτ Φ(τ ) = V π by induction on t = 1, . . . , H.
P

Recall that for any trajectory τ of length h, J(τ ) = Φ(τ ) + V (sτh , H − h) was defined as the expected
return of trajectories (of length H) with the prefix being τ .
In addition, recall that JF (t) = {J(τ ◦ s)|τ ∈ F (t) , s ∈ S} and pF (t) ,β (t) was defined by letting
pF (t) ,β (t) (τ ◦ s) = β (t) (τ )P (s|sτt , π(sτt )).
For the base case, we have that at t = 1
X X
V π = R(s0 , π(s0 )) + P (s|s0 , π(s0 ))V π (s, H − 1) = P (s|s0 , π(s0 ))J((s0 ) ◦ s)
s∈S s∈S
X X
(1)
= pF (0) ,β (0) ((s0 ) ◦ s)J((s0 ) ◦ s) = β (τ )J(τ ) .
s∈S τ ∈F (1)

Suppose that V π = τ ′ ∈F (t) β (t) (τ ′ )J(τ ′ ) holds at time t, then we prove the statement holds at
P
time t + 1.
X X ′
Vπ = β (t) (τ ′ )J(τ ′ ) = β (t) (τ ′ )(Φ(τ ′ ) + V π (sτt , H − t))
τ ′ ∈F (t) τ ′ ∈F (t)
!
X X ′ ′
 ′ ′

= β (t) (τ ′ ) Φ(τ ′ ) + P (s|sτt , π(sτt )) R(sτt , π(sτt )) + V π (s, H − t)
τ ′ ∈F (t) s∈S
X X ′ ′
 ′ ′


= β (t)
(τ )P (s|sτt , π(sτt )) Φ(τ ′ ) + R(sτt , π(sτt )) + V π (s, H − t)
τ ′ ∈F (t) s∈S
X X ′ ′
= β (t) (τ ′ )P (s|sτt , π(sτt )) (Φ(τ ′ ◦ s) + V π (s, H − t))
τ ′ ∈F (t) s∈S
X X
= pF (t) ,β (t) (τ ′ ◦ s)J(τ ′ ◦ s)
τ ′ ∈F (t) s∈S
X
= β (t+1) (τ )J(τ ) .
τ ∈F (t+1)

(H)
By induction, the statement holds at t = H by induction, i.e., V π =
P
τ ∈F (H) βτ J(τ ) =
P (H)
τ ∈F (H) βτ Φ(τ ).
2
Computational complexity: Solving V π (s, h) for all s ∈ S, h ∈ [H] takes time O(kH |S| ). In
each round, we need to call C4 for ≤ (k + 1) |S| vectors, which takes O(k 4 |S|) time. Thus, we need
2
O(k 4 H |S| + kH |S| ) time in total.

29
N Example of maximizing individual objective
Observation 2. Assume there exist k > 2 policies that together assemble k linear independent value
vectors. Consider the k different policies π1∗ , . . . , πk∗ that each πi∗ maximizes the objective i ∈ [k].
Then, their respective value vectors V1∗ , . . . , Vk∗ are not necessarily linearly independent. Moreover,
if V1∗ , . . . , Vk∗ are linearly depended it does not mean that k linearly independent value vectors do
not exists.

Proof. For simplicity, we show an example with a horizon of H = 1 but the results could be extended
to any H ≥ 1. We will show an example where there are 4 different value vectors, where 3 of them
are obtained by the k = 3 policies that maximize the 3 objectives and have linear dependence.
Consider an MDP with a single state (also known as Multi-arm Bandit) with 4 actions with de-
terministic reward vectors (which are also the expected values of the 4 possible policies in this
case):
8 1 85/12 7.083 1
! ! ! ! !
r(1) = 4 , r(2) = 2 , r(3) = 25/6 ≈ 4.167 , r(4) = 3 .
2 3 35/12 2.9167 2
Denote π a as the fixed policy that always selects action a. Clearly, policy π 1 maximizes the first
objective, policy π 2 the third, and policy π 3 the second (π 4 do not maximize any objective). However,

• r(3) linearly depends on r(1) and r(2) as


5 5
r(1) + r(2) = r(3).
6 12

• In addition, r(4) is linearly independent in r(1), r(2): Assume not. Then, there exists
β1 , β2 ∈ R s.t.:
8β1 + β2 1
! !
β1 · r(1) + β2 · r(2) = 4β1 + 2β2 = 3 = r(4).
2β1 + 3β2 2
1
Hence, the first equations imply β2 = 1 − 8β1 , and 4β1 + 2 − 16β1 = 3, hence β1 = − 12
5 1
and β2 = 3 . Assigning in the third equation yields − 6 + 5 = 2 which is a contradiction.

30

You might also like