0% found this document useful (0 votes)
10 views

Unsupervised Meta Learning

This paper proposes an unsupervised meta-reinforcement learning approach that does not require manually designed training tasks. It uses task proposals based on mutual information to train meta-learners. Experimental results show this unsupervised approach substantially outperforms learning from scratch and is competitive with supervised meta-RL methods on benchmark control tasks.

Uploaded by

aDreamerBoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Unsupervised Meta Learning

This paper proposes an unsupervised meta-reinforcement learning approach that does not require manually designed training tasks. It uses task proposals based on mutual information to train meta-learners. Experimental results show this unsupervised approach substantially outperforms learning from scratch and is competitive with supervised meta-RL methods on benchmark control tasks.

Uploaded by

aDreamerBoy
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Unsupervised Meta-Learning for Reinforcement Learning

Abhishek Gupta * 1 Benjamin Eysenbach * 2 Chelsea Finn 3 Sergey Levine 1

Abstract new tasks which are drawn from the same distribution as
Meta reinforcement learning (meta-RL) algo- the meta-training tasks (Finn & Levine, 2018). In effect,
rithms leverage experience from learning previ- meta-RL offloads the design burden from algorithm design
arXiv:1806.04640v3 [cs.LG] 30 Apr 2020

ous tasks to learn how to learn new tasks quickly. to task design. While meta-RL acquires representations for
However, this process requires a large number fast adaptation to the specified task distribution, specifying
of meta-training tasks to be provided for meta- this task distribution is often tedious and challenging. Can
learning. In effect, meta-RL shifts the human bur- we automate the process of task design, thereby doing away
den from algorithm to task design. In this work with human supervision entirely?
we automate the process of task design, devising In this paper, we take a step towards unsupervised meta-
a meta-learning algorithm that does not require RL: meta-learning from a task distribution that is acquired
manual design of meta-training tasks. We propose automatically, rather than requiring manual design of the
a family of unsupervised meta-RL algorithms meta-training tasks. While unsupervised meta-RL does not
based on the insight that task proposals based on make any assumptions about the reward functions on which
mutual information can be used to train optimal it will be evaluated at test time, it does assume that the
meta learners. Experimentally, our unsupervised environment dynamics remain the same. This allows an
meta-RL algorithm, which does not require man- unsupervised meta-RL agent to utilize environment interac-
ual task design, substantially improves on learn- tions to meta-train a model that is optimized to be effective
ing from scratch, and is competitive with super- for learning from previously unseen reward functions in
vised meta-RL approaches on benchmark tasks. that environment at meta-test time. Our method can also
1. Introduction be thought of as automatically acquiring an environment-
specific learning procedure for deep neural network policies,
Reusing past experience for faster learning of new tasks is a somewhat related to data-driven initialization procedures
key challenge for machine learning. Meta-learning methods explored in supervised learning (Krähenbühl et al., 2015;
achieve this by using past experience to explicitly optimize Hsu et al., 2018).
for rapid adaptation (Mishra et al., 2017; Snell et al., 2017;
Schmidhuber, 1987; Finn et al., 2017a; Gupta et al., 2018; The primary contribution of our work is a framework for un-
Wang et al., 2016; Al-Shedivat et al., 2017). In the context of supervised meta-RL. We describe a family of unsupervised
reinforcement learning (RL), meta-reinforcement learning meta-RL algorithms and provide analysis to show that unsu-
(meta-RL) algorithms can learn to solve new RL tasks more pervised meta-RL methods based on mutual information can
quickly through experience on past tasks (Duan et al., 2016b; be optimal, in a minimax sense. Our experiments shows that,
Gupta et al., 2018; Finn et al., 2017a). Typical meta-RL for a variety of robotic control tasks, unsupervised meta-RL
algorithms assume the ability to sample from a pre-specified can effectively acquire RL procedures. These procedures
task distribution, and these algorithms learn to solve new not only learn faster than standard RL approaches that learn
tasks drawn from this distribution very quickly. However, from scratch, but also outperform prior methods that do pure
specifying a task distribution is tedious and requires a sig- exploration and then fine-tuning at test time. Our results
nificant amount of supervision (Finn et al., 2017b; Duan even approach the performance of an oracle method that
et al., 2016b) that may be difficult to provide for large, real- relies on hand-designed task distributions.
world problem settings. The performance of meta-learning 2. Related Work
algorithms critically depends on the meta-training task dis-
tribution, and meta-learning algorithms generalize best to Our work lies at the intersection of meta-RL, goal gen-
eration, and unsupervised exploration. Meta-learning al-
*
Equal contribution 1 UC Berkeley 2 Carnegie Mellon Univer- gorithms use data from multiple tasks to learn how to
sity 3 Stanford University. Correspondence to: Abhishek Gupta learn, acquiring rapid adaptation procedures from experi-
<[email protected]>.
ence (Schmidhuber, 1987; Naik & Mammone, 1992; Thrun
& Pratt, 1998; Bengio et al., 1992; Hochreiter et al., 2001;
Unsupervised Meta-Learning for Reinforcement Learning

Fast
Unsupervised
Task Acquisition
Meta-RL Meta-learned Adaptation reward-maximizing on a hand-specified goal parameterization that doesn’t allow
environment environment-specific
policy
Unsupervised Meta-RL RL algorithm these algorithms to work with arbitrary reward functions.
reward
function
3. Unsupervised Meta-RL
Figure 1. Unsupervised meta-reinforcement learning: Given an We consider the problem of learning a reinforcement learn-
environment, unsupervised meta-RL produces an environment-
ing algorithm that can quickly solve new tasks in a given
specific learning algorithm that quickly acquire new policies that
maximize any task reward function.
environment. This meta-RL process could, for example,
tune the hyperparameters of another RL algorithm, or could
Santoro et al., 2016; Andrychowicz et al., 2016; Ravi & replace the RL update rule itself with a learned update rule.
Larochelle, 2017; Finn et al., 2017a; Munkhdalai & Yu, Unlike prior work, we aim to do so without depending on
2017). These approaches have been extended into the set- any human supervision or information about the tasks that
ting of RL (Duan et al., 2016b; Wang et al., 2016; Finn will be provided for meta-testing. A task reward is provided
et al., 2017a; Sung et al., 2017; Gupta et al., 2018; Men- at meta-test time, and the learned RL procedure should adapt
donca et al., 2019; Houthooft et al., 2018; Stadie et al., 2018; to this task reward as quickly as possible. We assume that
Rakelly et al., 2019; Nagabandi et al., 2018a). In practice, all test-time tasks have the same dynamics, and differ only
the performance of meta-learning algorithms depends on in their reward functions. Our algorithm will therefore need
the user-specified meta-training task distribution. We aim to to utilize unsupervised environment interaction to learn an
lift this limitation and provide a general recipe for avoiding RL algorithm. In effect, the dynamics themselves will be the
manual task engineering for meta-RL. A handful of prior supervision for our learning algorithm.
meta-learning methods have used self-proposed task distri- We formalize the meta-training setting as a controlled
butions for learning supervised learning procedures (Hsu Markov process (CMP) – a Markov decision process with-
et al., 2018; Antoniou & Storkey, 2019; Lin et al., 2019; Ji out a reward function, C = (S, A, P, γ, ρ), with state space
et al., 2019). In contrast, our work deals with the RL setting, S, action space A, transition dynamics P , discount factor
where the environment dynamics provides a rich inductive γ and initial state distribution ρ. The CMP, along with a
bias that our meta-learner can exploit. In the RL setting, reward function r, produces a Markov decision processes
task distributions can be obtained in a variety of ways, in- M = (S, A, P, γ, ρ, r). We define a learning algorithm
cluding adversarial goal generation (Sukhbaatar et al., 2017; f : D → π as a function that takes as input a dataset of
Held et al., 2017), information-theoretic methods (Gregor experience from the MDP, D = {(si , ai , ri , s0i )} ∼ M ,
et al., 2016; Eysenbach et al., 2018; Co-Reyes et al., 2018; and outputs a policy π(a | s). Evaluation of the learning
Achiam et al., 2018). The most similar work is Jabri et al. procedure f is carried out over a handful of episodes. In
(2019), which also considers the unsupervised application episode i, the learning procedure f observes all previous
of meta-learning to RL tasks. We build upon this work by data {τ1 , · · · , τi−1 } and outputs a policy to be used in itera-
proving that an optimal meta-learner can be acquired using tion i. We evaluate the learning procedure f by summing
mutual information-based task proposal. its cumulative reward across iterations:" #
X X
Exploration methods that seek out novel states are also R(f, rz ) = Eπ=f ({τ1 ,··· ,τi−1 }) rz (st , at )
closely related to goal generation methods (Pathak et al., τ ∼π
Our aim is to takei this CMP and producet an environment-
2017; Schmidhuber, 2009; Bellemare et al., 2016; Osband specific learning algorithm f that can quickly learn an op-
et al., 2016; Stadie et al., 2015), but do not by themselves timal policy πr∗ (a | s) for any reward function r. We refer
aim to generate new tasks or learn to adapt more quickly to to this problem as unsupervised meta-RL, and illustrate the
new tasks, only to achieve wide coverage of the state space. problem setting in Fig. 1.
Model-based RL methods (Deisenroth & Rasmussen, 2011;
Chua et al., 2018; Srinivas et al., 2018; Nagabandi et al., We now sketch a recipe for unsupervised meta-RL, analyze
2018b; Finn & Levine, 2017b; Atkeson & Santamaria, 1997) when this recipe is optimal, and then instantiate a practical
use unsupervised experience to learn a dynamics model but approximation to this theoretically-motivated approach by
do not learn how to efficiently use this model to explore to building upon known meta-learning algorithms and unsu-
solve new tasks. pervised exploration methods.
Goal-conditioned RL (Schaul et al., 2015; Andrychowicz
3.1. A General Recipe
et al., 2017; Pong et al., 2018) is also related to our work,
and our analysis will study this special case first before To construct an unsupervised meta-RL algorithm, we lever-
generalizing to the general case of arbitrary tasks. As we age the insight that, to acquire a fast learning algorithm
discuss in Section 3.4, goal-reaching itself is not enough, as without task supervision, we can simply leverage standard
goal-reaching agents are not optimized to efficiently explore meta-learning techniques, but with unsupervised task pro-
to determine which goal they should reach, relying instead posal mechanisms. Our unsupervised meta-RL framework
Unsupervised Meta-Learning for Reinforcement Learning

therefore consists of a task proposal mechanism and a meta- procedure as the difference in expected reward, compared
learning method. For reasons that will become more ap- with the optimal learning procedure:
parent later, we will define the task distribution as a map-
ping from a latent variable z ∼ p(z) to a reward function R EGRET(f, p(rz )) , Ep(rz ) [R(f ∗ , rz )]−Ep(rz ) [R(f, rz )] .
rz (s, a) : S × A → R1 . That is, for each value of the ran-
dom variable z, we have a different reward function rz (s, a). Minimizing this regret is equivalent to maximizing the
Under this formulation, learning a task distribution amounts expected reward objective used by most meta-RL meth-
to optimizing a parametric form for the reward function ods (Finn et al., 2017a; Duan et al., 2016b). Note that
rz (s, a) that maps each z ∼ p(z) to a different reward func- different task distributions p(rz ) will have different optimal
tion. The choice of this parametric form represents an im- learning procedures f ∗ . For example, the optimal behav-
portant design decision for an unsupervised meta-learning ior for manipulation tasks involves moving a robot’s arms,
method, and the resulting set of tasks is often referred to as while the optimal behavior for locomotion tasks involves
a task or goal proposal procedure. In the following section, moving a robot’s legs. Therefore, f ∗ depends on p(rz ). We
we will discuss a theoretical framework that allows us to next define the notion of an optimal unsupervised meta-
make this choice in the following section so as to minimize learner, which does not require prior knowledge of p(rz ).
worst case regret of the subsequently meta-learned learning In unsupervised meta-reinforcement learning, the reward
algorithm f . distribution p(rz ) is unknown. In this setting, we evaluate a
The second component is the meta-learning algorithm, learning procedure f based on its regret against the worst-
which takes the family of reward functions induced by p(z) case task distribution for CMP C:
and rz (s, a), along with the associated CMP, and meta-
R EGRET WC (f, C) = max R EGRET(f, p(rz )). (1)
learns an RL algorithm f that can quickly adapt to any task p(rz )
from the task distribution defined by p(z) and rz (s, a) in the
given CMP. The meta-learned algorithm f can then learn For a CMP C, we define the optimal unsupervised learning
new tasks quickly at meta-test time, when a user-specified procedure as follows:
reward function is actually provided. Fig. 1 summarizes this Definition 1. The optimal unsupervised learning procedure
generic design for an unsupervised meta-RL algorithm. fC∗ for a CMP C is defined as
The “no free lunch theorem” (Wolpert et al., 1995; Whitley fC∗ , arg min R EGRET WC (f, C).
f
& Watson, 2005) might lead us to expect that a truly generic
approach to proposing a task distribution would not yield Note the optimal unsupervised learning procedure may be
a learning procedure f that is effective on any real tasks. different for different CMPs. We can also define the opti-
However, the assumption that the dynamics remain the same mal unsupervised meta-learning algorithm F∗, which takes
across tasks affords us an inductive bias with which we pay as input a CMP C and returns the optimal unsupervised
for our lunch. In the following sections, we will discuss how learning procedure fC∗ for that CMP:
to formulate acquiring the optimal unsupervised learning Definition 2. The optimal unsupervised meta-learner
procedure, which minimizes regret on new meta-test tasks F ∗ (C) = fC∗ is a function that takes as input a CMP C
in the absence of any prior knowledge. Since our analysis and outputs the corresponding optimal unsupervised learn-
will focus on a restricted class of learning procedures, our ing procedure fC∗ :
results are lower bounds for the performance of general
learning procedures. We first define an optimal meta-learner F ∗ , arg min R EGRET WC (F(C), C)
F
and then show how we can train one without requiring task
distributions to be hand-specified. Note that the optimal unsupervised meta-learner F ∗ is uni-
versal – it does not depend on any particular task distribution,
3.2. Optimal Meta-Learners or any particular CMP. The next sections discuss how to
find the minimax learning procedure, which minimizes the
We begin our analysis by considering the optimal learning worst-case regret (Eq. 1).
procedure when the task distribution is known. For a task
distribution p(rz ), the optimal learning procedure f ∗ is 3.3. Special Case: Goal-Reaching Tasks
given by
f ∗ , arg max Ep(rz ) [R(f, rz )] . We start by deriving an optimal unsupervised meta-learner
f for the special case where all tasks are assumed to be goal
Other learning procedures f may achieve lower reward, and
state reaching tasks, and then generalize this approach to
we define the regret incurred by using a suboptimal learning
solve arbitrary tasks in Section 3.4. We restrict our anal-
1
In most cases p(z) is chosen to be a uniform categorical so it ysis to CMPs with deterministic dynamics, and consider
is not challenging to specify episodes with finite horizon T and a discount factor of
Unsupervised Meta-Learning for Reinforcement Learning

γ = 1. Each tasks corresponds to reaching a goal states sg procedure can be constructed from a policy with a uniform
at the last time step in the episode, so the reward function is marginal state distribution (proof in Appendix A):
rg (st ) , 1(t = T ) · 1(st = g). Lemma 1. Let π be a policy for which ρTπ (s) is uniform.
We first derive the optimal learning procedure for the case Then fπ is has lowest worst-case regret among learning
where p(sg ) is known, and then derive the optimal procedure procedures in Fπ .
for the case where p(sg ) is unknown.
One route for constructing this optimal unsupervised learn-
3.3.1. T HE O PTIMAL L EARNING P ROCEDURE FOR ing procedure is to first acquire a policy π for which ρTπ (s)
K NOWN p(sg ) is uniform and then return fπ . However, finding such a
policy π is challenging, especially in high-dimensional state
In the case of goal reaching tasks, the optimal fast learning
spaces and in the absense of resets. Instead, we will take
procedure f searches through potential goal states until it
an alternate route, acquiring fπ directly without every com-
finds the goal and then navigates to that goal state in all
puting π. In addition to sidestepping the requirement of
subsequent episodes. Define fπ as the learning procedure
computing π, this approach will also have the benefit of
that uses policy π to explore until the goal is found, and then
generalizing beyond goal-reaching tasks to arbitrary task
always returns to the goal state. We will restrict our attention
distributions.
to the set of learning procedures Fπ , {fπ } constructed in
this fashion, so our theoretical results will be lower bound Our approach for directly computing the optimal unsuper-
on the performance of arbitrary learning procedures. The vised learning procedure hinges on the observation that
learning procedure fπ incurs one unit of regret for each step the optimal unsupervised learning procedure is the optimal
before it has found the goal, and zero regret afterwards. The (supervised) learning procedure for goals proposed from
expected cumulative regret is therefore the expectation of a uniform distribution. Thus, the optimal unsupervised
the hitting time. To compute the expected hitting time, we learning procedure will come not as a result of a careful
define ρTπ (s) as the probability that policy π visits state s construction, but rather as the output of the an optimiza-
at time step t = T . If sg is the true goal, then the event tion procedure (i.e., meta-learning). Thus, we can obtain
that the policy π reaches sg at the final step of an episode the optimal unsupervised learning procedure by applying
is a Bernoulli random variable with parameter p = ρTπ (sg ). a meta-learning algorithm to a task distribution that sam-
Thus, the expected hitting time of this goal state is ples goals uniformly. To ensure that the resulting learning
1 procedure f lies within the set Fπ , we will only consider
H ITTING T IMEπ (sg ) = T .
ρπ (sg ) “memoryless” meta-learning algorithms that maintain no
The regret of the learning Zprocedure fπ is internal state before the true goal is found.2 While sampling
R EGRET(fπ , p(rg )) = H ITTING T IMEπ (sg )p(sg )dsg goals uniform is itself a challenging problem, we can use the
same trick as before: instead of constructing this uniform
goal distribution directly, we instead find an optimization
Z
p(sg )
= dsg . (2) problem for which the solution is to sample goals uniformly.
ρTπ (sg )
To now compute the optimal learning procedure fπ , we can The optimization problem that we use will involve two latent
minimize the regret in Equation 2 w.r.t. the marginal distri- variables, the final state sT and an auxiliary latent variable z
bution ρTπ . Using the calculus of variations (for more details sampled from a prior µ(z). The optimization problem will
refer to Appendix C in Lee et al. (2019)), the exploration be to find a conditional distribution µ(sT | z) such that the
policy for the optimal meta-learner, π ∗ , satisfies: mutual information between z and sT is optimized:
p
T p(sg ) max Iµ (sT ; z) (4)
ρπ∗ (sg ) = R q . (3) µ(sT |z)
p(s0g )ds0g
The conditional distribution µ(sT | z) that optimizes Equa-
Thus, when the goal sampling distribution p(sg ) is known, tion 4 is one with a uniform marginal distribution over ter-
the optimal learning procedure is obtained by finding π ∗ minal states (proof in Appendix A):
satisfying Eq. 3 and then using fπ∗ as the learning proce-
dure. The next section considers the case where p(sg ) is not Lemma 2. Assume there exists a conditional distribution
known. µ(sT | z) satisfying the following two properties:
1. The marginal distribution over terminal states is uni-
3.3.2. T HE O PTIMAL U NSUPERVISED L EARNING
R
form: µ(sT ) = µ(sT | z)µ(z)dz = U NIF(S); and
P ROCEDURE FOR G OAL R EACHING TASKS
2
MAML satisfies this requirement, as the internal parameters
In the case of goal-reaching tasks where the goal distribu- are updated by policy gradient, which is zero because the reward
tion p(sg ) is not known, the optimal unsupervised learning is zero before the true goal is found.
Unsupervised Meta-Learning for Reinforcement Learning

2. The conditional distribution µ(sT | z) is a Dirac: matching case is actually also a generalization of the typ-
∀z, sT ∃sz s.t. µ(sT | z) = 1(sT = sz ). ical reinforcement learning case with Markovian rewards,
because any such task can be represented by a trajectory
Then any solution µ(sT | z) to the mutual information reaching objective as well. Please refer to Section 3.4.3 for
objective (Eq. 4) satisfies the following: a more complete discussion of the same.
µ(sT ) = U NIF(S) and µ(sT | z) = 1(sT = sz ). As before, we will restrict our attention to CMPs with deter-
ministic dynamics. These non-Markovian tasks essentially
3.3.3. O PTIMIZING M UTUAL I NFORMATION amount to a problem where an RL algorithm must “guess”
the optimal policy, and only receives a reward if its behavior
To optimize the above mutual information objective, we is perfectly consistent with that optimal policy.
note that a conditional distribution µ(sT | z) can be defined
implicitly via a latent-conditioned policy µ(a | s, z). This We will show that optimizing the mutual information be-
policy is not a meta-learned model, but rather will become tween z and trajectories to obtain a task proposal distribu-
part of the task proposal mechanism. For a given prior tion, and subsequently optimizing a meta-learner for this dis-
µ(z) and latent-conditioned policy µ(a | s, z), the joint tribution will give us the optimal unsupervised meta-learner
likelihood is for this class of reward functions. We subsequently show
Y that unsupervised meta-learning for the trajectory-matching
µ(τ, z) = µ(z)p(s1 ) p(st+1 | st , at )µ(at | st , z), task is at least as hard as unsupervised meta-learning for
t
general tasks. As before, let us begin within an analysis of
and the marginal likelihood is simply given by optimal meta-learners in the case where the distribution over
trajectory matching tasks p(τ ∗ ) is known, and subsequently
Z
µ(sT , z) = µ(τ, z)ds1 a1 · · · aT −1 .
direct our attention to formulating an optimal unsupervised
The purpose of our repeated indirection now becomes clear: meta-learner.
prior work (Eysenbach et al., 2018; Achiam et al., 2018)
has proposed efficient algorithms for maximizing the mu- 3.4.1. O PTIMAL META - LEARNER FOR KNOWN p(τ ∗ )
tual information objective (Eq. 4) when the conditional Formally, we define a distribution of trajectory-matching
distribution µ(sT | z) is defined implicitly in terms of a tasks by a distribution over desired trajectories, p(τ ∗ ). For
latent-conditioned policy. At this point, we finally can sam- each goal trajectory τ ∗ , the corresponding trajectory-level
ple goals uniformly, by sampling z ∼ µ(z) followed by reward function is
sT ∼ µ(sT | z).
rτ∗ (τ ) , 1(τ = τ ∗ )
Recall that we wanted to obtain a uniform goal distribution
so that we could apply meta-learning to obtain the optimal Analysis from Section 3.3 can be repurposed here. As be-
learning procedure. However, the input to meta-learning fore, restrict our attention to learning procedures fπ ∈ Fπ .
procedures is not a distribution over goals but a distribution After running the exploration policy to discover trajectories
over reward functions. We then define our task proposal that obtain reward, the policy will deterministically keep
distribution p(rz ) by sampling z ∼ p(z) and using the executing the desired trajectory. We can define the hitting
corresponding reward function rz (sT , aT ) , log p(sT | z), time as the expected number of episodes to match the target
resulting in a uniform distribution as described in Lemma 2. trajectory:
1
3.4. General Case: Trajectory-Matching Tasks H ITTING T IMEπ (τ ∗ ) =
π(τ ∗ )
To extend the analysis in the previous section to the general We then define regret as the expected hitting time:
case, and thereby derive a framework for optimal unsuper-
vised meta-learning, we will consider “trajectory-matching” Z
tasks. These tasks are a trajectory-based generalization R EGRET(fπ , p(rτ )) = H ITTING T IMEπ (τ )p(τ )dτ )
of goal reaching: while goal reaching tasks only provide Z
a positive reward when the policy reaches the goal state, p(τ )
= dτ. (5)
trajectory-matching tasks only provide a positive reward π(τ )
when the policy executes the optimal trajectory. The trajec-
tory matching case is more general because, while trajectory This definition of regret allows us to optimize for an optimal
matching can represent different goal-reaching tasks, it can learning procedure, and we obtain an exploration policy for
also represent tasks that are not simply goal reaching, such the optimal learning procedure satisfying the requirement
as reaching a goal while avoiding a dangerous region or p
p(τ )

reaching a goal in a particular way. Moreover, the trajectory π (τ ) = R p .
p(τ 0 )dτ 0
Unsupervised Meta-Learning for Reinforcement Learning

3.4.2. O PTIMAL UNSUPERVISED LEARNING PROCEDURE LHS. In general, this bound is loose, because the set of all
FOR TRAJECTORY- MATCHING TASKS Markovian reward functions is smaller than the set of all
trajectory-level reward functions (i.e., trajectory-matching
As described in Section 3.2, obtaining such a policy requires
tasks). However, this bound becomes tight when consider-
knowing the trajectory distribution p(τ ), and we must resort
ing meta-learning on the set of all possible (non-Markovian)
to optimizing the worst-case regret. As argued in Lemma
reward functions.
1, the solution to this min-max optimization is a learning
procedure which has an exploration policy that is uniform In the discussion of meta-learning thus far, we have re-
distribution over trajectories. stricted our attention to tasks where the reward is provided
Lemma 3. Let π be a policy for which π(τ ) is uniform. at the last time step T of each episode and to the set of
Then fπ has lowest worst-case regret among learning pro- learning procedures Fπ that maintain no internal state be-
cedures in Fπ . fore the true goal or trajectory is found. In this restricted
setting case, the best that an optimal meta-learner can do
We can acquire an unsupervised meta-learner of this form is go directly to a goal or execute a particular trajectory at
by proposing and meta-learning on a task distribution that is every episode according to the optimal exploration policy
uniform over trajectories. How might we actually propose a as discussed previously, essentially performing a version of
task distribution that is uniform over trajectories? As argued posterior sampling. In the more general case with arbitrary
for the goal reaching case, we can do so by optimizing a reward functions and arbitrary learning procedures, interme-
trajectory-level mutual information objective: diate rewards along a trajectory may be informative, and the
I(τ ; z) = H[τ ] − H[τ | z] optimal exploration strategy may be different from posterior
The optimal policy for this objective has a uniform distri- sampling (Rothfuss et al., 2019; Duan et al., 2016b; Wang
bution over trajectories that, conditioned on a particular et al., 2016).
latent z, deterministically produces a single trajectory in Nonetheless, the analysis presented in this section provides
a deterministic CMP. The analysis for the case of stochas- us insight into the behavior of optimal meta-learning algo-
tic dynamics is more involved and is left to future work. rithms and allows us to understand the qualities desirable for
By optimizing a task proposal distribution that maximizes unsupervised task proposals. The general proposed scheme
trajectory-level mutual information, and subsequently per- for unsupervised meta-learning has a significant benefit over
forming meta-learning on the proposed tasks, we can ac- standard universal value function and goal reaching style
quire the optimal unsupervised meta-learner for trajectory algorithms: it can be applied to arbitrary reward functions
matching tasks, under the definition in Section 3.2. going beyond simple goal reaching, and doesn’t require the
goal to be known in a parametric form beforehand.
3.4.3. R ELATIONSHIP TO G ENERAL R EWARD
M AXIMIZING TASKS 3.5. Summary of Analysis

Now that we have derived the optimal meta-learner for Through our analysis, we introduced the notion of optimal
trajectory-matching tasks, we observe that trajectory- meta-learners and analyze their exploration behavior and
matching is a super-set of the problem of optimizing any regret on a class of goal reaching problems. We showed
possible Markovian reward function at test-time. For a given that on these problems, when the test-time task distribution
initial state distribution, each reward function is optimized is unknown, the optimal meta-training task distribution for
by a particular trajectory. However, trajectories produced by minimizing worst-case test-time regret is uniform over the
a non-Markovian policy (i.e., a policy with memory) are not space of goals. We also showed that this optimal task dis-
necessarily the unique optimum for any Markovian reward tribution can be acquired by a simple mutual information
function. Let Rτ denote the set of trajectory-level reward maximization scheme. We subsequently extend the analysis
functions, and Rs,a denote the set of all state-action level to the more general case of matching arbitrary trajectories,
reward functions. Bounding the worst-case regret on Rτ as a proxy for the more general class of arbitrary reward
minimizes an upper bound on the worst-case regret on Rs,a : functions. In the following section, we will discuss how
" # we can derive a practical algorithm for unsupervised meta-
X learning from this analysis.
min Eπ [rτ (τ )] ≤ min Eπ r(st , at ) ∀π.
rτ ∈Rτ r∈Rs,a
t
3.6. A Practical Algorithm
This inequality holds for all policies π, including the policy
Following the derivation in the previous section, we can
that maximizes the LHS. While we aim to maximize the
instantiate a practical unsupervised meta-RL algorithm by
RHS, we only know how to maximize the LHS, which gives
constructing a task proposal mechanism based on a mu-
us a lower bound on the RHS. This inequality holds for all
tual information objective. A variety of different mutual
policies π, so it also holds for the policy that maximizes the
Unsupervised Meta-Learning for Reinforcement Learning

Algorithm 1 Unsupervised Meta-RL Pseudocode learns the task distribution through unsupervised interaction
Input: M \ R, an MDP without a reward function with the environment. A fair baseline that likewise uses
Dφ ← DIAYN() or Dφ ← random requires no reward supervision at training time, and only
while not converged do uses rewards at test time, is learning via RL from scratch
Sample latent task variables z ∼ p(z) without any meta-learning. As an upper bound, we include
Define task reward rz (s) using Dφ (z|s) the unfair comparison to a standard meta-learning approach,
Update f using MAML with reward rz (s) where the meta-training distribution is manually designed.
end while This method has access to a hand-specified task distribution
Return: a learning algorithm f : Dφ → π that is not available to our method. We evaluate two vari-
ants of our approach: (a) task acquisition based on DIAYN
information objectives can be formulated, including mutual followed by meta-learning using MAML, and (b) task acqui-
information between single states and z (Eysenbach et al., sition using a randomly initialized discriminator followed
2018), pairs of start and end states and z (Gregor et al., by meta-learning using MAML.
2016), and entire trajectories and z (Achiam et al., 2018;
Sharma et al., 2019; Warde-Farley et al., 2018). We will 4.1. Tasks and Implementation Details
use DIAYN and leave a full examination of possible mutual
information objectives for future work. Our experiments study three simulated environments of
varying difficulty: 2D point navigation, 2D locomotion us-
DIAYN optimizes mutual information by training a discrim- ing the “HalfCheetah,” and 3D locomotion using the “Ant,”
inator network Dφ (z|·) that predicts which z was used to with the latter two environments are modifications of pop-
generate the states in a given rollout according to a latent- ular RL benchmarks (Duan et al., 2016a). While the 2D
conditioned policy π(a|s, z). Our task proposal distribu- navigation environment allows for direct control of posi-
tion is thus defined by rz (s, a) = log(Dφ (z|s)). The com- tion, HalfCheetah and Ant can only control their center of
plete unsupervised meta-learning algorithm is as follows: mass via feedback control with high dimensional actions
first, we acquire rz (s, a) by running DIAYN, which learns (6D for HalfCheetah, 8D for Ant) and observations (17D
Dφ (z|s) and a latent-conditioned policy π(a|s, z) (which for HalfCheetah, 111D for Ant).
is discarded). Then, we use z ∼ p(z) to propose tasks
rz (s, a) to a standard meta-RL algorithm. This meta-RL The evaluation tasks, shown in Figure 5, are similar to prior
algorithm uses the proposed tasks to learn how to learn, work (Finn et al., 2017a; Pong et al., 2018): 2D navigation
acquiring a fast learn algorithm f which can then learn new and ant require navigating to goal positions, while the half
tasks quickly. While, in principle, any meta-RL algorithm cheetah must run at different goal velocities. These tasks are
could be used, we use MAML (Finn et al., 2017a) as our not accessible to our algorithm during meta-training. Please
meta-learning algorithm. Note that the learning algorithm f refer to Appendix C for details about hyperparameters for
returned by MAML is defined simply as running gradient both MAML and DIAYN.
descent using the initial parameters found by MAML as
initialization, as discussed in prior work (Finn & Levine, 4.2. Fast Adaptation after Unsupervised Meta RL
2017a). The method is summarized in Algorithm 1.
The comparison between the two variants of unsupervised
In addition to mutual information maximizing task propos- meta-learning and learning from scratch is shown in Fig-
als, we will also consider random task proposals, where ure 2. We also add a comparison to VIME (Houthooft
we also use a discriminator as the reward, according to et al., 2016), a standard novelty-based exploration method,
r(s, z) = log Dφrand (z|s), but where the parameters φrand where we pretrain a policy with the VIME reward and then
are chosen randomly (i.e., a random weight initialization for finetune it on the meta-test tasks. In all cases, the UML-
a neural network). While such random reward functions are DIAYN variant of unsupervised meta-learning produces
not optimal, we find that they can surprisingly be used to ac- an RL procedure that outperforms RL from scratch and
quire useful task distributions for simple tasks, though they VIME-init, suggesting that unsupervised interaction with
are not as effective as the tasks become more complicated. the environment and meta-learning is effective in producing
environment-specific but task-agnostic priors that accelerate
4. Experimental Evaluation learning on new, previously unseen tasks. The comparison
with VIME shows that the speed of learning is not just about
In our experiments, we aim to understand whether unsuper- exploration but is indeed about fast adaptation. In our exper-
vised meta-learning as described in Section 3.1 can provide iments thus far, UML-DIAYN always performs better than
us with an accelerated RL procedure on new tasks. Whereas learning from scratch, although the benefit varies across
standard meta-learning requires a hand-specified task dis- tasks depending on the actual performance of DIAYN. We
tribution at meta-training time, unsupervised meta-learning also perform significantly better than a baseline of simply
Unsupervised Meta-Learning for Reinforcement Learning

2D navigation Half-Cheetah Ant

Figure 2. Unsupervised meta-learning accelerates learning: After unsupervised meta-learning, our approach (UML-DIAYN and UML-
RANDOM) quickly learns a new task significantly faster than learning from scratch, especially on complex tasks. Learning the task
distribution with DIAYN helps more for complex tasks. Results are averaged across 20 evaluation tasks, and 3 random seeds for testing.
UML-DIAYN and random also significantly outperform learning with DIAYN initialization or VIME.

2D Navigation Half-Cheetah Ant Navigation

Figure 3. Comparison with handcrafted tasks: Unsupervised meta-learning (UML-DIAYN) is competitive with meta-training on
handcrafted reward functions (i.e., an oracle). A misspecified, handcrafted meta-training task distribution often performs worse,
illustrating the benefits of learning the task distribution.

initializing from a DIAYN trained contextual policy, and the comparison provides a clue for identifying the source of
then finetuning the best skill with the actual task reward. the structure learned through unsupervised meta-learning:
though the particular task distribution has an effect on per-
Interestingly, in many cases (in Figure 3) the performance
formance, simply interacting with the environment (without
of unsupervised meta-learning with DIAYN matches that
structured objectives, using a random discriminator) already
of the hand-designed task distribution. We see that on the
allows meta-RL to learn effective adaptation strategies in a
2D navigation task, while handcrafted meta-learning is able
given environment.
to learn very quickly initially, it performs similarly after
100 steps. For the cheetah environment as well, handcrafted
meta-learning is able to learn very quickly to start off, but 5. Discussion and Future Work
is quickly matched by unsupervised meta-RL with DIAYN.
We presented an unsupervised approach to meta-RL, where
On the ant task, we see that hand-crafted meta-learning
meta-learning is used to acquire an efficient RL procedure
does do better than UML-DIAYN, likely because the task
without requiring hand-specified task distributions. This ap-
distribution is challenging, and a better unsupervised task
proach accelerates RL without relying on the manual super-
proposal algorithm would improve performance.
vision required for conventional meta-learning algorithms.
The comparison between the two unsupervised meta- We provide a theoretical derivation that argues that task
learning variants is also illuminating: while the DIAYN- proposals based on mutual information maximization can
based variant of our method generally achieves the best provide a minimum worst-case regret meta-learner, under
performance, even the random discriminator is often able to certain assumptions. Our experiments indicate unsupervised
provide a sufficient diversity of tasks to produce meaningful meta-RL can accelerate learning on a range of tasks.
acceleration over learning from scratch in the case of 2D
Our approach also opens a number of questions about un-
navigation and ant. This result has two interesting impli-
supervised meta-learning algorithms. One limitation of our
cations. First, it suggests that unsupervised meta-learning
analysis is that it only considers deterministic dynamics, and
is an effective tool for learning an environment prior. Al-
only considers task distributions where posterior sampling
though the performance of unsupervised meta-learning can
is optimal. Extending our analysis to stochastic dynamics
be improved with better coverage using DIAYN (as seen in
and more realistic task distributions may allow unsuper-
Figure 2), even the random discriminator version provides
vised meta-RL to acquire learning algorithms that can more
competitive advantages over learning from scratch. Second,
effectively solve real-world tasks.
Unsupervised Meta-Learning for Reinforcement Learning

References Chelsea Finn and Sergey Levine. Meta-learning and universality:


Deep representations and gradient descent can approximate
Joshua Achiam, Harrison Edwards, Dario Amodei, and Pieter any learning algorithm. CoRR, abs/1710.11622, 2017a. URL
Abbeel. Variational option discovery algorithms. arXiv preprint http://arxiv.org/abs/1710.11622.
arXiv:1807.10299, 2018.
Chelsea Finn and Sergey Levine. Deep visual foresight for plan-
Maruan Al-Shedivat, Trapit Bansal, Yuri Burda, Ilya Sutskever,
ning robot motion. In 2017 IEEE International Conference on
Igor Mordatch, and Pieter Abbeel. Continuous adaptation via
Robotics and Automation (ICRA), pp. 2786–2793. IEEE, 2017b.
meta-learning in nonstationary and competitive environments.
arXiv preprint arXiv:1710.03641, 2017. Chelsea Finn and Sergey Levine. Meta-learning and universality:
Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Deep representations and gradient descent can approximate
Hoffman, David Pfau, Tom Schaul, and Nando de Freitas. Learn- any learning algorithm. International Conference on Learning
ing to learn by gradient descent by gradient descent. In Neural Representations, 2018.
Information Processing Systems (NIPS), 2016.
Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic
Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schnei- meta-learning for fast adaptation of deep networks. arXiv
der, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, preprint arXiv:1703.03400, 2017a.
OpenAI Pieter Abbeel, and Wojciech Zaremba. Hindsight expe-
rience replay. In Advances in Neural Information Processing Chelsea Finn, Tianhe Yu, Tianhao Zhang, Pieter Abbeel, and
Systems, pp. 5048–5058, 2017. Sergey Levine. One-shot visual imitation learning via meta-
learning. CoRR, abs/1709.04905, 2017b. URL http://
Antreas Antoniou and Amos Storkey. Assume, augment and learn: arxiv.org/abs/1709.04905.
Unsupervised few-shot meta-learning via random labels and
data augmentation. arXiv preprint arXiv:1902.09884, 2019. Karol Gregor, Danilo Jimenez Rezende, and Daan Wierstra. Varia-
tional intrinsic control. arXiv preprint arXiv:1611.07507, 2016.
Christopher G Atkeson and Juan Carlos Santamaria. A comparison
of direct and model-based reinforcement learning. In Proceed- Abhishek Gupta, Russell Mendonca, YuXuan Liu, Pieter Abbeel,
ings of International Conference on Robotics and Automation, and Sergey Levine. Meta-reinforcement learning of structured
volume 4, pp. 3557–3564. IEEE, 1997. exploration strategies. arXiv preprint arXiv:1802.07245, 2018.
Marc G. Bellemare, Sriram Srinivasan, Georg Ostrovski, Tom David Held, Xinyang Geng, Carlos Florensa, and Pieter Abbeel.
Schaul, David Saxton, and Rémi Munos. Unifying count-based Automatic goal generation for reinforcement learning agents.
exploration and intrinsic motivation. CoRR, abs/1606.01868, arXiv preprint arXiv:1705.06366, 2017.
2016. URL http://arxiv.org/abs/1606.01868.
Sepp Hochreiter, A Steven Younger, and Peter R Conwell. Learn-
Samy Bengio, Yoshua Bengio, Jocelyn Cloutier, and Jan Gecsei. ing to learn using gradient descent. In International Conference
On the optimization of a synaptic learning rule. In Optimality on Artificial Neural Networks, 2001.
in Artificial and Biological Neural Networks, 1992.
Rein Houthooft, Xi Chen, Yan Duan, John Schulman, Filip De
Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Turck, and Pieter Abbeel. VIME: variational information max-
Levine. Deep reinforcement learning in a handful of trials imizing exploration. In Advances in Neural Information Pro-
using probabilistic dynamics models. In Advances in Neural cessing Systems, 2016.
Information Processing Systems, pp. 4754–4765, 2018.
John D Co-Reyes, YuXuan Liu, Abhishek Gupta, Benjamin Ey- Rein Houthooft, Richard Y Chen, Phillip Isola, Bradly C Stadie,
senbach, Pieter Abbeel, and Sergey Levine. Self-consistent Filip Wolski, Jonathan Ho, and Pieter Abbeel. Evolved policy
trajectory autoencoder: Hierarchical reinforcement learning gradients. arXiv preprint arXiv:1802.04821, 2018.
with trajectory embeddings. arXiv preprint arXiv:1806.02813, Kyle Hsu, Sergey Levine, and Chelsea Finn. Unsupervised learning
2018. via meta-learning. arXiv preprint arXiv:1810.02334, 2018.
Marc Deisenroth and Carl E Rasmussen. Pilco: A model-based and
data-efficient approach to policy search. In Proceedings of the Allan Jabri, Kyle Hsu, Abhishek Gupta, Ben Eysenbach, Sergey
28th International Conference on machine learning (ICML-11), Levine, and Chelsea Finn. Unsupervised curricula for visual
pp. 465–472, 2011. meta-reinforcement learning. In Advances in Neural Informa-
tion Processing Systems, pp. 10519–10530, 2019.
Yan Duan, Xi Chen, Rein Houthooft, John Schulman, and Pieter
Abbeel. Benchmarking deep reinforcement learning for continu- Zilong Ji, Xiaolong Zou, Tiejun Huang, and Si Wu. Unsupervised
ous control. In International Conference on Machine Learning, few-shot learning via self-supervised training. arXiv preprint
pp. 1329–1338, 2016a. arXiv:1912.12178, 2019.

Yan Duan, John Schulman, Xi Chen, Peter L Bartlett, Ilya Philipp Krähenbühl, Carl Doersch, Jeff Donahue, and Trevor Dar-
Sutskever, and Pieter Abbeel. Rl2 : Fast reinforcement rell. Data-dependent initializations of convolutional neural
learning via slow reinforcement learning. arXiv preprint networks. arXiv preprint arXiv:1511.06856, 2015.
arXiv:1611.02779, 2016b.
Lisa Lee, Benjamin Eysenbach, Emilio Parisotto, Eric P. Xing,
Benjamin Eysenbach, Abhishek Gupta, Julian Ibarz, and Sergey Sergey Levine, and Ruslan Salakhutdinov. Efficient exploration
Levine. Diversity is all you need: Learning skills without a via state marginal matching. CoRR, abs/1906.05274, 2019.
reward function. arXiv preprint arXiv:1802.06070, 2018. URL http://arxiv.org/abs/1906.05274.
Unsupervised Meta-Learning for Reinforcement Learning

Jianxin Lin, Yijun Wang, Yingce Xia, Tianyu He, and Zhibo Chen. Jürgen Schmidhuber. Evolutionary principles in self-referential
Learning to transfer: Unsupervised meta domain translation. learning, or on learning how to learn: the meta-meta-... hook.
arXiv preprint arXiv:1906.00181, 2019. PhD thesis, Technische Universität München, 1987.

Russell Mendonca, Abhishek Gupta, Rosen Kralev, Pieter Abbeel, Jürgen Schmidhuber. Driven by compression progress: A
Sergey Levine, and Chelsea Finn. Guided meta-policy search. simple principle explains essential aspects of subjective
CoRR, abs/1904.00956, 2019. beauty, novelty, surprise, interestingness, attention, curios-
ity, creativity, art, science, music, jokes. In Computa-
Nikhil Mishra, Mostafa Rohaninejad, Xi Chen, and Pieter Abbeel. tional Creativity: An Interdisciplinary Approach, 12.07. -
A simple neural attentive meta-learner. In NIPS 2017 Workshop 17.07.2009, 2009. URL http://drops.dagstuhl.de/
on Meta-Learning, 2017. opus/volltexte/2009/2197/.

Tsendsuren Munkhdalai and Hong Yu. Meta networks. Interna- Archit Sharma, Shixiang Gu, Sergey Levine, Vikash Kumar, and
tional Conference on Machine Learning (ICML), 2017. Karol Hausman. Dynamics-aware unsupervised discovery of
skills. arXiv preprint arXiv:1907.01657, 2019.
Anusha Nagabandi, Ignasi Clavera, Simin Liu, Ronald S Fear-
ing, Pieter Abbeel, Sergey Levine, and Chelsea Finn. Learn- Jake Snell, Kevin Swersky, and Richard Zemel. Prototypical net-
ing to adapt in dynamic, real-world environments through works for few-shot learning. In Advances in Neural Information
meta-reinforcement learning. arXiv preprint arXiv:1803.11347, Processing Systems, pp. 4080–4090, 2017.
2018a.
Aravind Srinivas, Allan Jabri, Pieter Abbeel, Sergey Levine, and
Chelsea Finn. Universal planning networks. arXiv preprint
Anusha Nagabandi, Gregory Kahn, Ronald S Fearing, and Sergey
arXiv:1804.00645, 2018.
Levine. Neural network dynamics for model-based deep rein-
forcement learning with model-free fine-tuning. In 2018 IEEE Bradly C Stadie, Sergey Levine, and Pieter Abbeel. Incentiviz-
International Conference on Robotics and Automation (ICRA), ing exploration in reinforcement learning with deep predictive
pp. 7559–7566. IEEE, 2018b. models. arXiv preprint arXiv:1507.00814, 2015.
Devang K Naik and RJ Mammone. Meta-neural networks that Bradly C. Stadie, Ge Yang, Rein Houthooft, Xi Chen, Yan Duan,
learn by learning. In International Joint Conference on Neural Yuhuai Wu, Pieter Abbeel, and Ilya Sutskever. Some consider-
Netowrks (IJCNN), 1992. ations on learning to explore via meta-reinforcement learning.
CoRR, abs/1803.01118, 2018. URL http://arxiv.org/
Ian Osband, Charles Blundell, Alexander Pritzel, and Ben- abs/1803.01118.
jamin Van Roy. Deep exploration via bootstrapped DQN. CoRR,
abs/1602.04621, 2016. URL http://arxiv.org/abs/ Sainbayar Sukhbaatar, Zeming Lin, Ilya Kostrikov, Gabriel Syn-
1602.04621. naeve, Arthur Szlam, and Rob Fergus. Intrinsic motivation and
automatic curricula via asymmetric self-play. arXiv preprint
Deepak Pathak, Pulkit Agrawal, Alexei A. Efros, and Trevor Dar- arXiv:1703.05407, 2017.
rell. Curiosity-driven exploration by self-supervised prediction.
In ICML, 2017. Flood Sung, Li Zhang, Tao Xiang, Timothy Hospedales, and
Yongxin Yang. Learning to learn: Meta-critic networks for
Vitchyr Pong, Shixiang Gu, Murtaza Dalal, and Sergey Levine. sample efficient learning. arXiv preprint arXiv:1706.09529,
Temporal difference models: Model-free deep rl for model- 2017.
based control. arXiv preprint arXiv:1802.09081, 2018.
Sebastian Thrun and Lorien Pratt. Learning to learn. Springer
Kate Rakelly, Aurick Zhou, Deirdre Quillen, Chelsea Finn, Science & Business Media, 1998.
and Sergey Levine. Efficient off-policy meta-reinforcement
learning via probabilistic context variables. arXiv preprint Jane X Wang, Zeb Kurth-Nelson, Dhruva Tirumala, Hubert Soyer,
arXiv:1903.08254, 2019. Joel Z Leibo, Remi Munos, Charles Blundell, Dharshan Ku-
maran, and Matt Botvinick. Learning to reinforcement learn.
Sachin Ravi and Hugo Larochelle. Optimization as a model for arXiv preprint arXiv:1611.05763, 2016.
few-shot learning. In International Conference on Learning
Representations (ICLR), 2017. David Warde-Farley, Tom Van de Wiele, Tejas Kulkarni, Catalin
Ionescu, Steven Hansen, and Volodymyr Mnih. Unsupervised
Jonas Rothfuss, Dennis Lee, Ignasi Clavera, Tamim Asfour, and control through non-parametric discriminative rewards. arXiv
Pieter Abbeel. Promp: Proximal meta-policy search. In In- preprint arXiv:1811.11359, 2018.
ternational Conference on Learning Representations, ICLR, Darrell Whitley and Jean Paul Watson. Complexity theory and the
2019. no free lunch theorem, 2005.
Adam Santoro, Sergey Bartunov, Matthew Botvinick, Daan Wier- David H Wolpert, William G Macready, et al. No free lunch
stra, and Timothy Lillicrap. Meta-learning with memory- theorems for search. Technical report, Technical Report SFI-
augmented neural networks. In International Conference on TR-95-02-010, Santa Fe Institute, 1995.
Machine Learning (ICML), 2016.

Tom Schaul, Daniel Horgan, Karol Gregor, and David Silver. Uni-
versal value function approximators. In International Confer-
ence on Machine Learning, pp. 1312–1320, 2015.
Unsupervised Meta-Learning for Reinforcement Learning

A. Proofs B. Ablations
Lemma 1 Let π be a policy for which ρTπ (s) is uniform.
Then π has lowest worst-case regret.

Proof of Lemma 1. To begin, we note that all goal distribu-


tions p(sg ) have equal regret for policies where ρTπ (s) =
1/|S| is uniform:
Z Z
p(sg ) p(sg )
R EGRETp (π) = T
dsg = dsg = |S|
ρπ (sg ) 1/|S|
Now, consider a policy π 0 for which ρTπ (s) is not uniform.
For simplicity, we will assume that the argmin is unique,
though the proof holds for non-unique argmins as well. The
worst-case goal distribution will choose the state s− where
that the policy is least likely to visit:
p− (sg ) , 1(sg = arg min ρTπ (s))
s
Figure 4. Analysis of effect of additional meta-training on meta-
Thus, the worst-case regret for policy π 0 is strictly greater test time learning of new tasks. For larger iterations of meta-trained
than the regret for a uniform π: policies, we have improved test time performance, showing that
additional meta-training is beneficial.
max R EGRETp (π) = R EGRETp− (π)
p

1(sg = arg mins ρTπ (s)) To understand the method performance more clearly, we
Z
= dsg also add an ablation study where we compare the meta-test
ρTπ (sg )
performance of policies at different iterations along meta-
1 training. This shows the effect that additional meta-training
= > |S| (6)
mins ρTπ0 (s) has on the fast learning performance for new tasks. This
Thus, a policy π 0 for which ρTπ is non-uniform cannot be comparison is shown in Figure 4. As can be seen here, at
minimax, so the optimal policy has a uniform marginal iteration 0 of meta-training the policy is not a very good ini-
ρTπ . tialization for learning new tasks. As we move further along
the meta-training process, we see that the meta-learned ini-
tialization becomes more and more effective at learning new
Lemma 2: Mutual information I(sT ; z) is maximized by a
tasks. This shows a clear correlation between additional
task distribution p(sg ) which is uniform over goal states.
meta-training and improved meta test-time performance.
Proof of Lemma 2. We define a latent variable model,
where we sample a latent variable z from a uniform prior B.1. Analysis of Learned Task Distributions
p(z) and sample goals from a conditional distribution We can analyze the tasks discovered through unsupervised
p(sT | z). To begin, note that the mutual information can exploration and compare them to tasks we evaluate on at
be written as a difference of entropies: meta-test time. Figure 5 illustrates these distributions using
scatter plots for 2D navigation and the Ant, and a histogram
Ip (sT ; z) = Hp [sT ] − Hp [sT | z]
for the HalfCheetah. Note that we visualize dimensions of
The conditional entropy Hp [sT | z] attains the smallest pos- the state that are relevant for the evaluation tasks – positions
sible value (zero) when each latent variable z corresponds to and velocities – but these dimensions are not specified in any
exactly one final state, sz . In contrast, the marginal entropy way during unsupervised task acquisition, which operates
Hp [sT ] attains the largest possible
R value (log |S|) when the on the entire state space. Although the tasks proposed via
marginal distribution p(sT ) = p(sT | z)p(z)dz is uni- unsupervised exploration provide fairly broad coverage, they
form. Thus, a task uniform distribution p(sg ) maximizes are clearly quite distinct from the meta-test tasks, suggesting
I(sT ; z). Note that for any non-uniform task distribution the approach can tolerate considerable distributional shift.
q(sT ), we have Hq [sT ] < Hp [sT ]. Since the conditional Qualitatively, many of the tasks proposed via unsupervised
entropy Hp [sT | z] is zero, no distribution can achieve a exploration such as jumping and falling that are not relevant
smaller conditional entropy. This, for all non-uniform task for the evaluation tasks. Our choice of the evaluation tasks
distributions q, we have Iq (sT ; z) < Ip (sT ; z). Thus, the was largely based on prior work, and therefore not tailored
optimal task distribution must be uniform. to this exploration procedure. The results for unsupervised
Unsupervised Meta-Learning for Reinforcement Learning

2D navigation Ant Half-Cheetah

Figure 5. Learned meta-training task distribution and evaluation tasks: We plot the center of mass for various skills discovered by
point mass and ant using DIAYN, and a blue histogram of goal velocities for cheetah. Evaluation tasks, which are not provided to the
algorithm during meta-training, are plotted as red ‘x’ for ant and pointmass, and as a green histogram for cheetah. While the meta-training
distribution is broad, it does not fully cover the evaluation tasks. Nonetheless, meta-learning on this learned task distribution enables
efficient learning on a test task distribution.

meta-RL therefore suggest quite strongly that unsupervised scratch via vanilla policy gradient, and found that using
task acquisition can provide an effective meta-training set, ADAM with adaptive step size is the most stable and quick
at least for MAML, even when evaluating on tasks that do at learning.
not closely match the discovered task distribution.

C. Hyperparameter Details

Half-Cheetah Ant

Figure 6. Environments: (Left) Half-Cheetah and (Right) Ant

For all our experiments, we used DIAYN to acquire the task


proposals using 20 skills for half-cheetah and for ant and 50
skills for the 2D navigation. We illustrate these half cheetah
and ant in Fig. 6. We ran the domains using the standard DI-
AYN hyperparameters described in https://github.
com/ben-eysenbach/sac to acquire task proposals.
These proposals were then fed into the MAML algorithm
https://github.com/cbfinn/maml_rl, with in-
ner learning rate 0.1, meta learning rate 0.01, inner batch
size 40, outer batch size, path length 100, using 2 layer
networks with 300 units each with ReLu nonlinearities.
We vary the meta-batch size according to the number of
skills: 50 for pointmass, 20 for cheetah, and 20 ant. The
test time learning is done with the same parameters for
the UMRL variants, and done using REINFORCE with
the Adam optimizer for the comparison with learning from
scratch. We swept over learning rates for learning from

You might also like