0% found this document useful (0 votes)
2 views

Meta-Learning & Transfer Learning

The document discusses the concepts of meta-learning and transfer learning in reinforcement learning (RL), emphasizing the importance of prior knowledge and task similarity for effective learning. It outlines various strategies for transfer learning, such as forward transfer, multi-task transfer, and meta-learning, while also addressing challenges like domain shift and finetuning issues. Additionally, it highlights the potential of meta-learning to accelerate learning by leveraging experience from multiple tasks and improving exploration strategies.

Uploaded by

vinodhrajapakse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Meta-Learning & Transfer Learning

The document discusses the concepts of meta-learning and transfer learning in reinforcement learning (RL), emphasizing the importance of prior knowledge and task similarity for effective learning. It outlines various strategies for transfer learning, such as forward transfer, multi-task transfer, and meta-learning, while also addressing challenges like domain shift and finetuning issues. Additionally, it highlights the potential of meta-learning to accelerate learning by leveraging experience from multiple tasks and improving exploration strategies.

Uploaded by

vinodhrajapakse
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Meta-Learning & Transfer Learning

CS 285
Instructor: Sergey Levine
UC Berkeley
What’s the problem?

this is easy (mostly) this is impossible

Why?
Montezuma’s revenge
• Getting key = reward
• Opening door = reward
• Getting killed by skull = bad
Montezuma’s revenge
• We know what to do because we understand what
these sprites mean!
• Key: we know it opens doors!
• Ladders: we know we can climb them!
• Skull: we don’t know what it does, but we know it
can’t be good!
• Prior understanding of problem structure can help
us solve complex tasks quickly!
Can RL use the same prior knowledge as us?

• If we’ve solved prior tasks, we might acquire useful knowledge for


solving a new task
• How is the knowledge stored?
• Q-function: tells us which actions or states are good
• Policy: tells us which actions are potentially useful
• some actions are never useful!
• Models: what are the laws of physics that govern the world?
• Features/hidden states: provide us with a good representation
• Don’t underestimate this!
Transfer learning terminology

transfer learning: using experience from one set of tasks for faster
learning and better performance on a new task

in RL, task = MDP! “shot”: number of attempts in the target


domain
source domain target domain 0-shot: just run a policy trained in the source
domain
1-shot: try the task once
few shot: try the task a few times
How can we frame transfer learning problems?

1. Forward transfer: learn policies that transfer effectively


a) Train on source task, then run on target task (or finetune)
b) Relies on the tasks being quite similar!
2. Multi-task transfer: train on many tasks, transfer to a new task
a) Sharing representations and layers across tasks in multi-task learning
b) New task needs to be similar to the distribution of training tasks
3. Meta-learning: learn to learn on many tasks
a) Accounts for the fact that we’ll be adapting to a new task during training!
Pretraining + Finetuning

The most popular transfer learning method in (supervised) deep learning!


What issues are we likely to face?
➢ Domain shift: representations learned in the source
domain might not work well in the target domain

➢ Difference in the MDP: some things that are possible


to do in the source domain are not possible to do in
the target domain

➢ Finetuning issues: if pretraining & finetuning, the


finetuning process may still need to explore, but
optimal policy during finetuning may be deterministic!
Domain adaptation in computer vision
train here

correct answer

reversed gradient

(same network) can we force this layer to be invariant to domain?


domain classifier:
guess domain from z
incorrect answer

do well here Is this true?


Invariance assumption: everything that is different between domains is irrelevant
Domain adaptation in RL for dynamics?
Why is invariance not enough when the dynamics don’t match?

When might this not work?

Eysenbach et al., “Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers”
What if we can also finetune?

1. RL tasks are generally much less diverse


• Features are less general
• Policies & value functions become overly specialized
2. Optimal policies in fully observed MDPs are
deterministic
• Loss of exploration at convergence
• Low-entropy policies adapt very slowly to new settings

See “exploration 2” lecture on unsupervised skill discovery and “control as inference” lecture on MaxEnt RL methods!
How to maximize forward transfer?
Basic intuition: the more varied the training domain is, the more likely
we are to generalize in zero shot to a slightly different domain.
“Randomization” (dynamics/appearance/etc.): widely used for
simulation to real world transfer (e.g., in robotics)
EPOpt: randomizing physical parameters
training on single torso mass training on model ensemble

train test
ensemble adaptation
unmodeled effects

adapt

Rajeswaran et al., “EPOpt: Learning robust neural network policies…”


More randomization!

Sadeghi et al., “CAD2RL: Real Single-Image Flight without a Single Real Image.” 2016

Xue Bin Peng et al., “Sim-to-Real Transfer of Robotic Control with


Dynamics Randomization.” 2018
Lee et al., “Learning Quadrupedal Locomotion over Challenging Terrain.” 2020
Some suggested readings
Domain adaptation:

Tzeng, Hoffman, Zhang, Saenko, Darrell. Deep Domain Confusion: Maximizing for Domain Invariance. 2014.

Ganin, Ustinova, Ajakan, Germain, Larochelle, Laviolette, Marchand, Lempitsky. Domain-Adversarial Training of Neural Networks. 2015.

Tzeng*, Devin*, et al., Adapting Visuomotor Representations with Weak Pairwise Constraints. 2016.

Eysenbach et al., Off-Dynamics Reinforcement Learning: Training for Transfer with Domain Classifiers. 2020.

Finetuning:

Finetuning via MaxEnt RL: Haarnoja*, Tang*, et al. (2017). Reinforcement Learning with Deep Energy-Based Policies.

Andreas et al. Modular multitask reinforcement learning with policy sketches. 2017.

Florensa et al. Stochastic neural networks for hierarchical reinforcement learning. 2017.

Kumar et al. One Solution is Not All You Need: Few-Shot Extrapolation via Structured MaxEnt RL. 2020

Simulation to real world transfer:

Rajeswaran, et al. (2017). EPOpt: Learning Robust Neural Network Policies Using Model Ensembles.

Yu et al. (2017). Preparing for the Unknown: Learning a Universal Policy with Online System Identification.

Sadeghi & Levine. (2017). CAD2RL: Real Single Image Flight without a Single Real Image.

Tobin et al. (2017). Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World.

Tan et al. (2018). Sim-to-Real: Learning Agile Locomotion For Quadruped Robots.

…and many many others!


How can we frame transfer learning problems?

1. Forward transfer: learn policies that transfer effectively


a) Train on source task, then run on target task (or finetune)
b) Relies on the tasks being quite similar!
2. Multi-task transfer: train on many tasks, transfer to a new task
a) Sharing representations and layers across tasks in multi-task learning
b) New task needs to be similar to the distribution of training tasks
3. Meta-learning: learn to learn on many tasks
a) Accounts for the fact that we’ll be adapting to a new task during training!
Can we learn faster by learning multiple tasks?

learn learn learn learn learn

Multi-task learning can:


learn - Accelerate learning of all tasks
that are learned together
- Provide better pre-training for
down-stream tasks
Can we solve multiple tasks at once?

Multi-task RL corresponds to single-task RL in a joint MDP

etc. MDP 0
pick MDP randomly
sample
in first state etc. MDP 1

etc. MDP 2
How does the model know what to do?

• What if the policy can do multiple things in the same environment?


Contextual policies

e.g., do dishes or laundry

images: Peng, van de Panne, Peters


Goal-conditioned policies
A few relevant papers:
• Kaelbling. Learning to achieve goals.
another state • Schaul et al. Universal value function
approximators.
• Andrychowicz et al. Hindsight experience
➢ Convenient: no need to manually define rewards for each task replay.
➢ Can transfer in zero shot to a new task if it’s another goal!
• Eysenbach et al. C-learning: Learning to
➢ Often hard to train in practice (see references) achieve goals via recursive classification.
➢ Not all tasks are goal reaching tasks!
Meta-Learning
What is meta-learning?

• If you’ve learned 100 tasks already, can you


figure out how to learn more efficiently?
• Now having multiple tasks is a huge advantage!
• Meta-learning = learning to learn
• In practice, very closely related to multi-task
learning
• Many formulations
• Learning an optimizer
• Learning an RNN that ingests experience
• Learning a representation

image credit: Ke Li
Why is meta-learning a good idea?

• Deep reinforcement learning, especially model-free, requires a


huge number of samples
• If we can meta-learn a faster reinforcement learner, we can learn
new tasks efficiently!
• What can a meta-learned learner do differently?
• Explore more intelligently
• Avoid trying actions that are know to be useless
• Acquire the right features more quickly
Meta-learning with supervised learning

image credit: Ravi & Larochelle ‘17


Meta-learning with supervised learning

input (e.g., image) output (e.g., label)

training set

test label • How to read in training set?


• Many options, RNNs can work
• More on this later

test input
(few shot) training set
What is being “learned”?
test label

test input
(few shot) training set
What is being “learned”?

meta-learned
RNN hidden
weights
state
Meta Reinforcement Learning
The meta reinforcement learning problem
The meta reinforcement learning problem

0.5 m/s 0.7 m/s -0.2 m/s -0.7 m/s


Contextual policies and meta-learning

“context”
Meta-RL with recurrent policies

meta-learned
RNN hidden
weights
state
Meta-RL with recurrent policies

crucially, RNN hidden state is not reset between episodes!

+0 +0 +1 +1
+0 +0
Why recurrent policies learn to explore

optimizing total reward over


the entire meta-episode with
RNN policy automatically
episode learns to explore!

meta-episode
Meta-RL with recurrent policies

Heess, Hunt, Lillicrap, Silver. Memory-based control with Wang, Kurth-Nelson, Tirumala, Soyer, Leibo, Munos, Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel. RL2:
recurrent neural networks. 2015. Blundell, Kumaran, Botvinick. Learning to Reinforcement Fast Reinforcement Learning via Slow Reinforcement
Learning. 2016. Learning. 2016.
Architectures for meta-RL
standard RNN (LSTM) architecture

Duan, Schulman, Chen, Bartlett, Sutskever, Abbeel. RL2:


Fast Reinforcement Learning via Slow Reinforcement
Learning. 2016.

attention + temporal convolution

Mishra, Rohaninejad, Chen, Abbeel. A Simple


Neural Attentive Meta-Learner.
parallel permutation-invariant context encoder

Rakelly*, Zhou*, Quillen, Finn, Levine. Efficient Off-Policy Meta-


Reinforcement learning via Probabilistic Context Variables.
Gradient-Based Meta-Learning
Back to representations…

is pretraining a type of meta-learning?


better features = faster learning of new task!
Meta-RL as an optimization problem

Finn, Abbeel, Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks.
MAML for RL in pictures
What did we just do??

Just another computation graph…


Can implement with any autodiff
package (e.g., TensorFlow)
But has favorable inductive bias…
MAML for RL in videos
after 1 gradient step after 1 gradient step
after MAML training (forward reward) (backward reward)
More on MAML/gradient-based meta-learning
for RL
MAML meta-policy gradient estimators:
• Finn, Abbeel, Levine. Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks.
• Foerster, Farquhar, Al-Shedivat, Rocktaschel, Xing, Whiteson. DiCE: The Infinitely
Differentiable Monte Carlo Estimator.
• Rothfuss, Lee, Clavera, Asfour, Abbeel. ProMP: Proximal Meta-Policy Search.

Improving exploration:
• Gupta, Mendonca, Liu, Abbeel, Levine. Meta-Reinforcement Learning of Structured
Exploration Strategies.
• Stadie*, Yang*, Houthooft, Chen, Duan, Wu, Abbeel, Sutskever. Some Considerations on
Learning to Explore via Meta-Reinforcement Learning.

Hybrid algorithms (not necessarily gradient-based):


• Houthooft, Chen, Isola, Stadie, Wolski, Ho, Abbeel. Evolved Policy Gradients.
• Fernando, Sygnowski, Osindero, Wang, Schaul, Teplyashin, Sprechmann, Pirtzel, Rusu. Meta-
Learning by the Baldwin Effect.
Meta-RL as a POMDP
Meta-RL as… partially observed RL?
Meta-RL as… partially observed RL?

encapsulates information policy


needs to solve current task
Meta-RL as… partially observed RL?

encapsulates information policy


needs to solve current task

this is not optimal! but it’s pretty good,


why? both in theory and in
some approximate posterior practice!
(e.g., variational)

act as though z was correct!

See, e.g. Russo, Roy. Learning to Optimize via Posterior Sampling.


Variational inference for meta-RL

maximize post-update reward stay close to prior


(same as standard meta-RL)

Rakelly*, Zhou*, Quillen, Finn, Levine. Efficient Off-Policy Meta-Reinforcement learning via
Probabilistic Context Variables. ICML 2019.
Specific instantiation: PEARL

perform maximization using soft actor-critic (SAC),


state-of-the-art off-policy RL algorithm

Rakelly*, Zhou*, Quillen, Finn, Levine. Efficient Off-Policy Meta-Reinforcement learning via
Probabilistic Context Variables. ICML 2019.
References on meta-RL, inference, and POMDPs

• Rakelly*, Zhou*, Quillen, Finn, Levine. Efficient Off-Policy


Meta-Reinforcement learning via Probabilistic Context
Variables. ICML 2019.

• Zintgraf, Igl, Shiarlis, Mahajan, Hofmann, Whiteson.


Variational Task Embeddings for Fast Adaptation in Deep
Reinforcement Learning.

• Humplik, Galashov, Hasenclever, Ortega, Teh, Heess. Meta


reinforcement learning as task inference.
The three perspectives on meta-RL

everything needed to solve task


The three perspectives on meta-RL
+ conceptually simple
+ relatively easy to apply
- vulnerable to meta-overfitting
- challenging to optimize in practice

+ good extrapolation (“consistent”)


+ conceptually elegant
- complex, requires many samples

+ simple, effective exploration via posterior sampling


+ elegant reduction to solving a special POMDP
- vulnerable to meta-overfitting
everything needed to solve task
- challenging to optimize in practice
But they’re not that different!

just perspective 1,
but with stochastic
hidden variables!
just a particular
architecture choice
for these

everything needed to solve task


Meta-RL and emergent phenomena
meta-RL gives rise to model-free meta-RL gives rise to meta-RL gives rise to
episodic learning model-based adaptation causal reasoning (!)

Ritter, Wang, Kurth-Nelson, Jayakumar, Blundell, Pascanu, Wang, Kurth-Nelson, Kumaran, Tirumala, Soyer, Leibo, Dasgupta, Wang, Chiappa, Mitrovic, Ortega, Raposo,
Botvinick. Been There, Done That: Meta-Learning with Hassabis, Botvinick. Prefrontal Cortex as a Meta- Hughes, Battaglia, Botvinick, Kurth-Nelson. Causal
Episodic Recall. Reinforcement Learning System. Reasoning from Meta-Reinforcement Learning.

Humans and animals seemingly learn behaviors in a variety of ways:


➢ Highly efficient but (apparently) model-free RL
➢ Episodic recall
➢ Model-based RL
➢ Causal inference
➢ etc.

Perhaps each of these is a separate “algorithm” in the brain


But maybe these are all emergent phenomena resulting from meta-RL?

You might also like