0% found this document useful (0 votes)
58 views20 pages

Transfer Learning in Deep Reinforcement

This document summarizes recent progress in transfer learning approaches for deep reinforcement learning. It categorizes state-of-the-art transfer learning methods for deep RL and analyzes their goals, methodologies, applications to RL backbones, and practical uses. The survey also explores connections between transfer learning and other RL topics, and potential challenges that await future research progress in using transfer learning to improve deep reinforcement learning.

Uploaded by

han zhou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views20 pages

Transfer Learning in Deep Reinforcement

This document summarizes recent progress in transfer learning approaches for deep reinforcement learning. It categorizes state-of-the-art transfer learning methods for deep RL and analyzes their goals, methodologies, applications to RL backbones, and practical uses. The survey also explores connections between transfer learning and other RL topics, and potential challenges that await future research progress in using transfer learning to improve deep reinforcement learning.

Uploaded by

han zhou
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

1

Transfer Learning in Deep Reinforcement


Learning: A Survey
Zhuangdi Zhu, Kaixiang Lin, Anil K. Jain, and Jiayu Zhou

Abstract—Reinforcement learning is a learning paradigm for solving sequential decision-making problems. Recent years have witnessed
remarkable progress in reinforcement learning upon the fast development of deep neural networks. Along with the promising prospects of
reinforcement learning in numerous domains such as robotics and game-playing, transfer learning has arisen to tackle various challenges
faced by reinforcement learning, by transferring knowledge from external expertise to facilitate the efficiency and effectiveness of the
learning process. In this survey, we systematically investigate the recent progress of transfer learning approaches in the context of deep
reinforcement learning. Specifically, we provide a framework for categorizing the state-of-the-art transfer learning approaches, under
which we analyze their goals, methodologies, compatible reinforcement learning backbones, and practical applications. We also draw
arXiv:2009.07888v7 [cs.LG] 4 Jul 2023

connections between transfer learning and other relevant topics from the reinforcement learning perspective and explore their potential
challenges that await future research progress.

Index Terms—Transfer Learning, Reinforcement Learning, Deep Learning, Survey.

1 I NTRODUCTION

R E inforcement Learning (RL) is an effective framework


to solve sequential decision-making tasks, where a
learning agent interacts with the environment to improve
have motivated various efforts to improve the current RL
procedure. As a result, transfer learning (TL), or equivalently
referred as knowledge transfer, which is a technique to utilize
its performance through trial and error [1]. Originated external expertise to benefit the learning process of the target
from cybernetics and thriving in computer science, RL domain, becomes a crucial topic in RL.
has been widely applied to tackle challenging tasks which While TL techniques have been extensively studied in
were previously intractable. Traditional RL algorithms were supervised learning [10], it is still an emerging topic for RL.
mostly designed for tabular cases, which provide principled Transfer learning can be more complicated for RL, in that
solutions to simple tasks but face difficulties when handling the knowledge needs to transfer in the context of a Markov
highly complex domains, e.g. tasks with 3D environments. Decision Process. Moreover, due to the delicate components
With the recent advances in deep learning research, the of the Markov decision process, expert knowledge may take
combination of RL and deep neural networks is developed to different forms that need to transfer in different ways.
address challenging tasks. The combination of deep learning Noticing that previous efforts on summarizing TL in the
with RL is hence referred to as Deep Reinforcement Learning RL domain did not cover research of the last decade [11, 12],
(DRL) [2], which learns powerful function approximators during which time considerate TL breakthroughs have been
using deep neural networks to address complicated domains. achieved empowered with deep learning techniques. Hence,
DRL has achieved notable success in applications such in this survey, we make a comprehensive investigation of the
as robotics control [3, 4] and game playing [5]. It also latest TL approaches in RL.
thrives in domains such as health informatics [6], electricity The contributions of our survey are multifold: 1) we inves-
networks [7], intelligent transportation systems[8, 9], to name tigated up-to-date research involving new DRL backbones
just a few. and TL algorithms over the recent decade. To the best of
Besides its remarkable advancement, RL still faces in- our knowledge, this survey is the first attempt to survey TL
triguing difficulties induced by the exploration-exploitation approaches in the context of deep reinforcement learning.
dilemma [1]. Specifically, for practical RL problems, the We reviewed TL methods that can tackle more evolved
environment dynamics are usually unknown, and the agent RL tasks, and also studied new TL schemes that are not
cannot exploit knowledge about the environment until deeply discussed by prior literatures, such as representation
enough interaction experiences are collected via exploration. disentanglement (Sec 5.5) and policy distillation (Sec 5.3). 2)
Due to the partial observability, sparse feedbacks, and We provided systematic categorizations that cover a broader
the high complexity of state and action spaces, acquiring and deeper view of TL developments in DRL. Our main
sufficient interaction samples can be prohibitive or even analysis is anchored on a fundamental question, i.e. what is
incur safety concerns for domains such as automatic-driving the transferred knowledge in RL, following which we conducted
and health informatics. The abovementioned challenges more refined analysis. Most TL strategies, including those
discussed in prior surveys are well suited in our categoriza-
• Zhuangdi Zhu, Anil K. Jain, and Jiayu Zhou are with the Department tion framework. 3) Reflecting on the developments of TL
of Computer Science and Engineering, Michigan State University, East methods in DRL, we brought new thoughts on its future
Lansing, MI, 48824. E-mail: {zhuzhuan, jain, jiayuz}@msu.edu directions, including how to do reasoning over miscellaneous
• Kaixiang Lin is with the Amazon Alexa AI. E-mail: [email protected]
knowledge forms and how to leverage knowledge in more
2

efficient and principled manner. We also pointed out the π . Similar to the value function, a policy also carries a Q-
prominent applications of TL for DRL and its opportunities function, which estimates the quality of taking action a from
to thrive in the future era of AGI. state s: QπM (s, a) = Es′ ∼T (·|s,a) [R(s, a, s′ ) + γVM
π
(s′ )] .
The rest of this survey is organized as follows: In Section
2 we introduce RL preliminaries, including the recent key Reinforcement Learning Goals: Standard RL aims to learn

development based on deep neural networks. Next, we an optimal policy πM with the optimal value and Q-

discuss the definition of TL in the context of RL and function, s.t. ∀s ∈ S, πM (s) = arg max Q∗M (s, a), where
a∈A
its relevant topics (Section 2.4). In Section 3, we provide Q∗M (s, a) = sup QπM (s, a). The learning objective can be
a framework to categorize TL approaches from multiple π
reduced as maximizing the expected return:
perspectives, analyze their fundamental differences, and X
summarize their evaluation metrics (Section 3.3). In Section J(π) := E(s,a)∼µπ (s,a) [ γ t rt ],
5, we elaborate on different TL approaches in the context of t
DRL, organized by the format of transferred knowledge, such π
where µ (s, a) is the stationary state-action distribution in-
as reward shaping (Section 5.1), learning from demonstrations
duced by π [14].
(Section 5.2), or learning from teacher policies (Section 5.3). We
Built upon recent progress of DRL, some literature has
also investigate TL approaches by the way that knowledge
extended the RL objective to achieving miscellaneous goals
transfer occurs, such as inter-task mapping (Section 5.4), or
under different conditions, referred to as Goal-Conditional
learning transferrable representations (Section 5.5), etc. We
RL (GCRL). In GCRL, the agent policy π(·|s, g) is dependent
discuss contemporary applications of TL in the context of
not only on state observations s but also the goal g being
DRL in Section 6 and provide some future perspectives and
optimized. Each individual goal g ∼ G can be differentiated
open questions in Section 7.
by its reward function r(st , at , g), hence the objective for
GCRL becomes maximizing the expected return P over the dis-
2 D EEP R EINFORCEMENT L EARNING AND T RANS - tribution of goals: J(π) := E(st ,at )∼µπ ,g∼G [ t γ t r(s, a, g)]
[15]. A prototype example of GCRL can be maze locomotion
FER L EARNING
tasks, where the learning goals are manifested as desired
2.1 Reinforcement Learning Basics locations in the maze [16].
Markov Decision Process: A typical RL problem can be Episodic vs. Non-episodic Reinforcement Learning: In
considered as training an agent to interact with an envi- episodic RL, the agent performs in finite episodes of length
ronment that follows a Markov Decision Process (MPD) [13]. H , and will be reset to an initial state ∈ µ0 upon the episode
The agent starts with an initial state and performs an action ends [1]. Whereas in non-episodic RL, the learning agent
accordingly, which yields a reward to guide the agent actions. continuously interacts with the MDP without any state
Once the action is taken, the MDP transits to the next state by reset [17]. To encompass the episodic concept in infinite
following the underlying transition dynamics of the MDP. The MDPs, episodic RL tasks usually assume the existence of
agent accumulates the time-discounted rewards along with its a set of absorbing states S0 , which indicates the termination
interactions. A subsequence of interactions is referred to as of episodic tasks [18, 19], and any action taken upon an
an episode. The above-mentioned components in an MDP can absorbing state will only transit to itself with zero rewards.
be represented using a tuple, i.e. M = (µ0 , S, A, T , γ, R), in
which:
• µ0 is the set of initial states. 2.2 Reinforcement Learning Algorithms
• S is the state space. There are two major methods to conduct RL: Model-Based
• A is the action space. and Model-Free. In model-based RL, a learned or provided
• T : S×A×S → R is the transition probability distribution, model of the MDP is used for policy learning. In model-free
where T (s′ |s, a) specifies the probability of the state RL, optimal policy is learned without modeling the transition
transitioning to s′ upon taking action a from state s. dynamics or reward functions. In this section, we start intro-
• R : S × A × S → R is the reward distribution, where ducing RL techniques from a model-free perspective, due to
R(s, a, s′ ) is the reward that an agent can get by taking its relatively simplicity, which also provides foundations for
action a from state s with the next state being s′ . many model-based methods.
• γ is a discounted factor, with γ ∈ (0, 1]. Prediction and Control: an RL problem can be disas-
A RL agent behaves in M by following its policy π , sembled into two subtasks: prediction and control [1]. In the
which is a mapping from states to actions: π : S → A . For a prediction phase, the quality of the current policy is being
stochastic policy π , π(a|s) denotes the probability of taking evaluated. In the control phase or the policy improvement phase,
action a from state s. Given an MDP M and a policy π , one the learning policy is adjusted based on evaluation results
π from the prediction step. Policies can be improved by iterating
can derive a value function VM (s), which is defined over
the state space: VM (s) = E r0 + γr1 + γ 2 r2 + . . . ; π, s ,
π

through these two steps, known as policy iteration.
where ri = R(si , ai , si+1 ) is the reward that an agent For model-free policy iterations, the target policy is opti-
receives by taking action ai in the i-th state si , and the mized without requiring knowledge of the MDP transition
next state transits to si+1 . The expectation E is taken over dynamics. Traditional model-free RL includes Monte-Carlo
s0 ∼ µ0 , ai ∼ π(·|si ), si+1 ∼ T (·|si , ai ). The value function methods, which estimates the value of each state using
estimates the quality of being in state s, by evaluating the ex- samples of episodes starting from that state. Monte-Carlo
pected rewards that an agent can get from s following policy methods can be on-policy if the samples are collected by
3

following the target policy, or off-policy if the episodic samples In the above definition, we use ϕ(I) to denote the
are collected by following a behavior policy that is different learned policy based on information I , which is usually
from the target policy. approximated with deep neural networks in DRL. For the
Temporal Difference (TD) Learning is an alternative to simplistic case, knowledge can transfer between two agents
Monte-Carlo for solving the prediction problem. The key idea within the same domain, resulting in |Ms | = 1, and
behind TD-learning is to learn the state quality function by Ms = Mt . One can consider regular RL without TL as
bootstrapping. It can also be extended to solve the control a special case of the above definition, by treating Is = ∅, so
problem so that both value function and policy can get that a policy π is learned purely on the feedback provided
improved simultaneously. Examples of on-policy TD-learning by the target domain, i.e. π = ϕ(It ).
algorithms include SARSA [20], Expected SARSA [21], Actor-
Critic [22], and its deep neural network extension called A3C
2.4 Related Topics
[23]. The off-policy TD-learning approaches include SAC [24]
for continuous state-action spaces, and Q-learning [25] for In addition to TL, other efforts have been made to benefit RL
discrete state-action spaces, along with its variants built on by leveraging different forms of supervision. In this section,
deep-neural networks, such as DQN [26], Double-DQN [26], we briefly discuss other techniques that are relevant to TL by
Rainbow [27], etc. TD-learning approaches focus more on analyzing the differences and connections between transfer
estimating the state-action value functions. learning and these relevant techniques, which we hope can
Policy Gradient, on the other hand, is a mechanism further clarify the scope of this survey.
that emphasizes on direct optimization of a parameteriz- Continual Learning is the ability of sequentially learning
able policy. Traditional policy-gradient approaches include multiple tasks that are temporally or spatially related, with-
REINFORCE [28]. Recent years have witnessed the joint out forgetting the previously acquired knowledge. Continual
presence of TD-learning and policy-gradient approaches. Learning is a specialized yet more challenging scenario of TL,
Representative algorithms along this line include Trust region in that the learned knowledge needs to be transferred along
policy optimization (TRPO) [29], Proximal Policy optimization a sequence of dynamically-changing tasks that cannot be
(PPO) [30], Deterministic policy gradient (DPG) [31] and its foreseen, rather than learning a fixed group of tasks. Hence,
extensions such as DDPG [32] and Twin Delayed DDPG [33]. different from most TL methods discussed in this survey,
Unlike model-free methods that learn purely from trial- the ability of automatic task detection and avoiding catastrophic
and-error, Model-Based RL (MBRL) explicitly learns the forgetting is usually indispensable in continual learning [43].
transition dynamics or cost functions of the environment. Hierarchical RL has been proposed to resolve complex
The dynamics model can sometimes be treated as a black-box real-world tasks. Different from traditional RL, for hierarchi-
for better sampling-based planning. Representative examples cal RL, the action space is grouped into different granularities
include the Monte-Carlo method dubbed random shooting [34] to form higher-level macro actions. Accordingly, the learning
and its cross-entropy method (CEM) variants [35, 36]. The task is also decomposed into hierarchically dependent sub-
modeled dynamics can also facilitate learning with data gen- goals. Well-known hierarchical RL frameworks include Feu-
eration [37] and value estimation [38]. For MBRL with white- dal learning [44], Options framework[45], Hierarchical Abstract
box modeling, the transition models become differentiable Machines [46], and MAXQ [47]. Given the higher-level
and can facilitate planning with direct gradient propogation. abstraction on tasks, actions, and state spaces, hierarchical
Methods along this line include differential planning for policy RL can facilitate knowledge transfer across similar domains.
gradient [39] and action sequences search [40], and value Multi-task RL learns an agent with generalized skills
gradient methods [41, 42]. One advantage of MBRL is its across various tasks, hence it can solve MDPs randomly
higher sample efficiency than model-free RL, although it sampled from a fixed yet unknown distribution [48]. A larger
can be challenging for complex domains, where it is usually concept of multi-task learning also incorporates multi-task
more difficult to learn the dynamics than learning a policy. supervised learning and unsupervised learning [49]. Multi-task
learning is naturally related to TL, in that the learned skills,
typically manifested as representations, need to be effectively
2.3 Transfer Learning in the Context of Reinforcement shared among domains. Many TL techniques later discussed
Learning in this survey can be readily applied to solve multi-task RL
scenarios, such as policy distillation [50], and representation
Remark 1. Without losing clarify, for the rest of this survey, we
sharing [51]. One notable challenges in multi-task learning
refer to MDPs, domains, and tasks equivalently.
is negative transfer, which is induced by the irrelevance or
Remark 2. [Transfer Learning in the Context of RL] Given conflicting property for learned tasks. Hence, some recent
a set of source domains Ms = {Ms |Ms ∈ Ms } and a target work in multi-task RL focused on a trade-off between sharing
domain Mt , Transfer Learning aims to learn an optimal policy and individualizing function modules [52–54].
π ∗ for the target domain, by leveraging exterior information Is Generalization in RL refers to the ability of learning
from Ms as well as interior information It from Mt : agents to adapt to unseen domains. Generalization is a crucial
property for RL to achieve, especially when classical RL
π ∗ = arg max Es∼µt0 ,a∼π [QπM (s, a)], assumes identical training and inference MDPs, whereas the
π
real world is constantly changing. Generalization in RL is
where π = ϕ(Is ∼ Ms , It ∼ Mt ) : S t → At is a policy considered more challenging than in supervised learning
learned for the target domain Mt based on information from both due to the non-stationarity of MDPs, where the latter has
It and Is . provided inspirations for the former [55]. Meta-learning is an
4

effective direction towards generalization, which also draws approaches into the following classes: (i) Zero-shot transfer,
close connections to TL. Some TL techniques discussed in this which learns an agent that is directly applicable to the
survey are actually designed for meta-RL. However, meta- target domain without requiring any training interactions;
learning is particularly focused on the learning methods (ii) Few-shot transfer, which only requires a few samples
that lead to fast adaptation to unseen domains, whereas TL (interactions) from the target domain; (iii) Sample-efficient
is a broader concept and covers scenarios where the target transfer, where an agent can benefit by TL to be more
environment can be (partially) observable. To tackle unseen sample efficient compared to normal RL.
tasks in RL, some meta-RL methods focused on training
MDPs generation [56] and variations estimation [57]. We
refer readers to [58] for a more focused survey on meta RL. 3.2 Case Analysis of Transfer Learning in the context of
Reinforcement Learning
3 A NALYZING T RANSFER L EARNING We now use HalfCheetah1 as a working example to illustrate
In this section, we discuss TL approaches in RL from different how TL can occur between the source and the target domain.
angles. We also use a prototype to illustrate the potential HalfCheetah is a standard DRL benchmark for solving physical
variants residing in knowledge transfer among domains, locomotion tasks, in which the objective is to train a two-leg
then summarize important metrics for TL evaluation. agent to run fast without losing control of itself.

3.1 Categorization of Transfer Learning Approaches 3.2.1 Potential Domain Differences:


TL approaches can be organized by answering the following During TL, the differences between the source and target
key questions: domain may reside in any component of an MDP:
1) What knowledge is transferred: Knowledge from the • S (State-space): domains can be made different by ex-
source domain can take different forms, such as expert tending or constraining the available positions for the
experiences [59], the action probability distribution of HalfCheetah agent to move.
an expert policy [60], or even a potential function that • A (Action-space) can be adjusted by changing the range of
estimates the quality of demonstrations in the target available torques for the thigh, shin, or foot of the agent.
MDP [61]. The divergence in representations and granu- • R (Reward function): a domain can be simplified by
larities of knowledge fundamentally influences how TL using only the distance moved forward as rewards or
is performed. The quality of the transferred knowledge, be perplexed by using the scale of accelerated velocity in
e.g. whether it comes from an oracle [62] or a suboptimal each direction as extra penalty costs.
teacher [63] also affects the way TL methods are designed. • T (Transition dynamics): two domains can differ by
2) What RL frameworks fit the TL approach: We can following different physical rules, leading to different
rephrase this question into other forms, e.g., is the TL transition probabilities given the same state-action pairs.
approach policy-agnostic, or only applicable to certain RL • µ0 (Initial states): the source and target domains may have
backbones, such as the Temporal Difference (TD) methods? different initial states, specifying where and with what
Answers to this question are closely related to the posture the agent can start moving.
representaion of knowledge. For example, transferring • τ (Trajectories): the source and target domains may allow
knowledge from expert demonstrations are usually policy- a different number of steps for the agent to move before a
agnostic (see Section 5.2), while policy distillation, to be task is done.
discussed in Section 5.3, may not be suitable for DQN
backbone which does not explicitly learn a policy function.
3.2.2 Transferrable Knowledge:
3) What is the difference between the source and the target
domain: Some TL approaches fit where the source domain Without losing generality, we list below some transferrable
Ms and the target domain Mt are equivalent, whereas knowledge assuming that the source and target domains are
others are designed to transfer knowledge between differ- variants of HalfCheetah:
ent domains. For example, in video gaming tasks where • Demonstrated trajectories: the target agent can learn from
observations are RGB pixels, Ms and Mt may share the the behavior of a pre-trained expert, e.g. a sequence of
same action space (A) but differs in their observation running demonstrations.
spaces (S ). For goal-conditioned RL [64], the two domains • Model dynamics: the RL agent may access a model of the
may differ only by the reward distribution: Rs ̸= Rt . physical dynamics for the source domain that is also partly
4) What information is available in the target do- applicable to the target domain. It can perform dynamic
main: While knowledge from source domains is usually programming based on the physical rules, running fast
accessible, it can be prohibitive to sample from the target without losing its control due to the accelerated velocity.
domain, or the reward signal can be sparse or delayed. • Teacher policies: an expert policy may be consulted by the
Examples include adapting an auto-driving agent pre- learning agent, which outputs the probability of taking
trained in simulated platforms to real environments [65], different actions upon a given state example.
The accessibility of information in the target domain can • Teacher value functions: besides teacher policy, the learning
affect the way that TL approaches are designed. agent may also refer to the value function derived by a
5) How sample-efficient the TL approach is: TL enables teacher policy, which implies the quality of state-actions
the RL with better initial performance, hence usually from the teacher’s point of view.
requires fewer interactions compared with learning from
scratch. Based on the sampling cost, we can categorize TL 1. https://gym.openai.com/envs/HalfCheetah-v2/
5

3.3 Evaluation metrics over the last decade. For instance, [11] emphasized on
In this section, we present some representative metrics for different task-mapping methods, which are more suitable for
evaluating TL approaches, which have also been partly domains with tabular or mild state-action space dimensions.
summarized in prior work [11, 66]: There are other surveys focused on specific subtopics that
• Jumpstart performance( jp): the initial performance (returns) interplay between RL and TL. For instance, [70] consolidated
of the agent. sim-to-real TL methods. They explored work that is more tai-
• Asymptotic performance (ap): the ultimate performance lored for robotics domains, including domain generalization
(returns) of the agent. and zero-shot transfer, which is a favored application field
• Accumulated rewards (ar): the area under the learning curve of DRL as we discussed in Sec 6. [71] conducted extensive
of the agent. database search and summarized benchmarks for evaluating
• Transfer ratio (tr): the ratio between asymptotic performance TL algorithms in RL. [72] surveyed recent progress in
of the agent with TL and asymptotic performance of the agent multi-task RL. They partially shared research focus with
without TL. us by studying certain TL oriented solutions towards multi-
• Time to threshold (tt): the learning time (iterations) needed task RL, such as learning shared representations, pathNets,
for the target agent to reach certain performance threshold. etc. We surveyed TL for RL with a broader spectrum in
• Performance with fixed training epochs (pe): the performance methodologies, applications, evaluations, which naturally
achieved by the target agent after a specific number of draws connections to the above literatures.
training iterations.
• Performance sensitivity (ps): the variance in returns using 5 T RANSFER L EARNING A PPROACHES D EEP D IVE
different hyper-parameter settings. In this section, we elaborate on various TL approaches and
The above criteria mainly focus on the learning process organize them into different sub-topics, mostly by answering
of the target agent. In addition, we introduce the following the question of “what knowledge is transferred”. For each type
metrics from the perspective of transferred knowledge, which, of TL approach, we analyze them by following the other
although commensurately important for evaluation, have not criteria mentioned in Section 3 and and summarize the key
been explicitly discussed by prior art: evaluation metrics that are applicable to the discussed work.
Figure 1 presents an overview of different TL approaches
• Necessary knowledge amount (nka): the necessary amount of
discussed in this survey.
the knowledge required for TL in order to achieve certain
performance thresholds. Examples along this line include
the number of designed source tasks [67], the number of 5.1 Reward Shaping
expert policies, or the number of demonstrated interactions We start by introducing the Reward Shaping approach, as
[68] required to enable knowledge transfer. it is applicable to most RL backbones and also largely
• Necessary knowledge quality (nkq): the guaranteed quality overlaps with the other TL approaches discussed later.
of the knowledge required to enable effective TL. This Reward Shaping (RS) is a technique that leverages the
metric helps in answering questions such as (i) Does the exterior knowledge to reconstruct the reward distribution of
TL approach rely on near-oracle knowledge, such as expert the target domain to guide the agent’s policy learning. More
demonstrations/policies [69], or (ii) is the TL technique specifically, in addition to the environment reward signals,
feasible even given suboptimal knowledge [63]? RS learns a reward-shaping function F : S × S × A → R
TL approaches differ in various perspectives, including the to render auxiliary rewards, provided that the additional
forms of transferred knowledge, the RL frameworks utilized rewards contain external knowledge to guide the agent
to enable such transfer, and the gaps between the source and for better action selections. Intuitively, an RS strategy will
the target domain. It maybe biased to evaluate TL from just assign higher rewards to more beneficial state-actions to
one viewpoint. We believe that explicating these TL related navigate the agent to desired trajectories. As a result, the
metrics helps in designing more generalizable and efficient agent will learn its policy using the newly shaped rewards
TL approaches. R′ : R′ = R + F , which means that RS has altered the target
In general, most of the abovementioned metrics can be domain with a different reward function:
considered as evaluating two abilities of a TL approach: the M = (S, A, T , γ, R)) → M′ = (S, A, T , γ, R′ ). (1)
mastery and generalization. Mastery refers to how well the
Along the line of RS, Potential based Reward Shaping (PBRS)
learned agent can ultimately perform in the target domain,
is one of the most classical approaches. [61] proposed PBRS
while generalization refers to the ability of the learning agent
to form a shaping function F as the difference between two
to quickly adapt to the target domain.
potential functions (Φ(·)):
F (s, a, s′ ) = γΦ(s′ ) − Φ(s), (2)
4 R ELATED W ORK
There are prior efforts in summarizing TL research in RL. One where the potential function Φ(·) comes from the knowledge
of the earliest literatures is [11] . Their main categorization of expertise and evaluates the quality of a given state.
is from the perspective of problem setting, in which the TL It has been proved that, without further restrictions on
scenarios may vary in the number of domains involved, and the underlying MDP or the shaping function F , PBRS
the difference of state-action space among domains. Similar is sufficient and necessary to preserve the policy invari-
categorization is adopted by [12], with more refined analysis ance. Moreover, the optimal Q-function in the original and
dimensions including the objective of TL. As pioneer surveys transformed MDP are related by the potential function:
for TL in RL, neither [11] nor [12] covered recent research Q∗M′ (s, a) = Q∗M (s, a) − Φ(s), which draws a connection
6

Fig. 1: An overview of different TL approaches, organized by the format of transferred knowledge.

between potential based reward-shaping and advantage- Ms to the target domain Mt . This approach assumed
based learning approaches [73]. the existence of two mapping functions MS and MA that
The idea of PBRS was extended to [74], which formulated can transform the state and action from the source to the
the potential as a function over both the state and the target domain. Another work used demonstrated state-
action spaces. This approach is called Potential Based state- action samples from an expert policy to shape rewards [78].
action Advice (PBA). The potential function Φ(s, a) therefore Learning the augmented reward involves learning a dis-
evaluates how beneficial an action a is to take from state s: criminator to distinguish samples generated by an expert
F (s, a, s′ , a′ ) = γΦ(s′ , a′ ) − Φ(s, a). (3) policy from samples generated by the target policy. The
loss of the discriminator is applied to shape rewards to
PBA requires on-policy learning and can be sample-costly,
incentivize the learning agent to mimic the expert behavior.
as in Equation (3), a′ is the action to take upon state s is
This work combines two TL approaches: RS and Learning
transitioning to s′ by following the learning policy.
from Demonstrations, the latter of which will be elaborated in
Traditional RS approaches assumed a static potential
Section 5.2.
function, until [75] proposed a Dynamic Potential Based (DPB)
approach which makes the potential a function of both states
and time: F (s, t, s′ , t′ ) = γΦ(s′ , t′ ) − Φ(s, t).They proved The above-mentioned RS approaches are summarized
that this dynamic approach can still maintain policy invari- in Table 1. They follow the potential based RS principle
ance: Q∗M′ (s, a) = Q∗M (s, a) − Φ(s, t),where t is the current that has been developed systematically: from the classical
tilmestep. [76] later introduced a way to incorporate any PBRS which is built on a static potential shaping function of
prior knowledge into a dynamic potential function structure, states, to PBA which generates the potential as a function of
which is called Dynamic Value Function Advice (DPBA). both states and actions, and DPB which learns a dynamic
The rationale behind DPBA is that, given any extra reward potential function of states and time, to the most recent
function R+ from prior knowledge, in order to add this extra DPBA, which involves a dynamic potential function of states
reward to the original reward function, the potential function and actions to be learned as an extra state-action Value
should satisfy: γΦ(s′ , a′ ) − Φ(s, a) = F (s, a) = R+ (s, a). function in parallel with the environment Value function.
If Φ is not static but learned as an extra state-action As an effective TL paradigm, RS has been widely applied to
Value function overtime, then the Bellman equation for Φ fields including robot training [79], spoken dialogue systems
is : Φπ (s, a) = rΦ (s, a) + γΦ(s′ , a′ ). The shaping rewards [80], and question answering [81]. It provides a feasible
F (s, a) is therefore the negation of rΦ (s, a) : framework for transferring knowledge as the augmented
reward and is generally applicable to various RL algorithms.
F (s, a) = γΦ(s′ , a′ ) − Φ(s, a) = −rΦ (s, a). (4)
RS has also been applied to multi-agent RL [82] and model-
This leads to the approach of using the negation of R+ as based RL [83]. Principled integration of RS with other TL
the immediate reward to train an extra state-action Value approaches, such as Learning from demonstrations (Section 5.2)
function Φ and the policy simultaneously. Accordingly, the and Policy Transfer (Section 5.3) will be an intriguing question
dynamic potential function F becomes: for ongoing research.

Ft (s, a) = γΦt+1 (s′ , a′ ) − Φt (s, a). (5)


Note that RS approaches discussed so far are built upon a
The advantage of DPBA is that it provides a framework to consensus that the source information for shaping the reward
allow arbitrary knowledge to be shaped as auxiliary rewards. comes externally, which coincides with the notion of knowledge
Research along this line mainly focus on designing transfer. Some RS work also tackles the scenario where the
different shaping functions F (s, a), while not much work shaped reward comes intrinsically. For instance, Belief Reward
has tackled the question of what knowledge can be used to Shaping was proposed by [84], which utilizes a Bayesian
derive this potential function. One work by [77] proposed to reward shaping framework to generate the potential value
use RS to transfer an expert policy from the source domain that decays with experience, where the potential value comes
from the critic itself.
7
Methods MDP difference Format of shaping reward Knowledge source Evaluation metrics
PBRS Ms = Mt F = γΦ(s′ ) − Φ(s) ✗ ap, ar
PBA Ms = Mt F = γΦ(s′ , a′ ) − Φ(s, a) ✗ ap, ar
DPB Ms = Mt F = γΦ(s′ , t′ ) − Φ(s, t) ✗ ap, ar
DPBA Ms = Mt Ft = γΦt+1 (s′ , a′ ) − Φt (s, a) , ✗ ap, ar
Φ learned as an extra Q function
[77] Ss ̸= St , As ̸= At Ft = γΦt+1 (s′ , a′ ) − Φt (s, a) πs ap, ar
[78] Ms = Mt Ft = γΦt+1 (s′ , a′ ) − Φt (s, a) DE ap, ar

TABLE 1: A comparison of reward shaping approaches. ✗ denotes that the information is not revealed in the paper.

5.2 Learning from Demonstrations with a different evaluation loss:


Learning from Demonstrations (LfD) is a technique to assist Lπ = E(s,a)∼Dπ ∥T ∗ Q(s, a) − Q(s, a)∥, (7)
RL by utilizing external demonstrations for more efficient
exploration. The demonstrations may come from different where T ∗ Q(s, a) = R(s, a) + γ Es′ ∼p(.|s,a) [max Q(s′ , a′ )].
sources with varying qualities. Research along this line a′
Their work theoretically converges to the optimal Q-function
usually address a scenario where the source and the target
compared with APID, as Lπ is minimizing the optimal
MDPs are the same: Ms = Mt , although there has been
Bellman residual instead of the empirical norm.
work that learns from demonstrations generated in a different
In addition to policy iteration, the following two ap-
domain [85, 86].
proaches integrate demonstration data into the TD-learning
Depending on when the demonstrations are used for
framework, such as Q-learning. Specifically, [92] proposed
knowledge transfer, approaches can be organized into offline
the DQfD algorithm, which maintains two separate replay
and online methods. For offline approaches, demonstrations
buffers to store demonstrated data and self-generated data,
are either used for pre-training RL components, or for offline
respectively, so that expert demonstrations can always be
RL [87, 88]. When leveraging demonstrations for pre-training,
sampled with a certain probability. Their method leverages
RL components such as the value function V (s) [89], the
the refined priority replay mechanism [99] where the prob-
policy π [90], or the model of transition dynamics [91], can
ability of sampling a transition i is based on its priority pi
be initialized by learning from demonstrations. For the online pα
with a temperature parameter α: P (i) = P ipα . Another
approach, demonstrations are directly used to guide agent k k

actions for efficient explorations [92]. Most work discussed in algorithm named LfDS was proposed by [96], which draws a
this section follows the online transfer paradigm or combines close connection to reward shaping (Section 5.1). LfDS builds
offline pre-training with online RL [93]. the potential value of a state-action pair as the highest simi-
Work along this line can also be categorized depending on larity between the given pair and the expert demonstrations.
what RL frameworks are compatible: some adopts the policy- This augmented reward assigns more credits to state-actions
iteration framework [59, 94, 95], some follow a Q-learning that are more similar to expert demonstrations, encouraging
framework [92, 96], while recent work usually follows the the agent for expert-like behavior.
Besides Q-learning, recent work has integrated LfD into
policy-gradient framework [63, 78, 93, 97]. Demonstrations policy gradient [63, 69, 78, 93, 97]. A representative work
have been leveraged in the policy iterations framework by [98]. along this line is Generative Adversarial Imitation Learning
Later, [94] introduced the Direct Policy Iteration with Demon- (GAIL) [69]. GAIL introduced the notion of occupancy measure
strations (DPID) algorithm. This approach samples complete dπ , which is the stationary state-action distributions derived
demonstrated rollouts DE from an expert policy πE , in from a policy π . Based on this notion, a new reward function
combination with the self-generated rollouts Dπ gathered is designed such that maximizing the accumulated new
rewards encourages minimizing the distribution divergence
from the learning agent. Dπ ∪ DE are used to learn a Monte-
between the occupancy measure of the current policy π and
Carlo estimation of the Q-value: Q̂, from which a learning the expert policy πE . Specifically, the new reward is learned
policy can be derived greedily: π(s) = arg maxQ̂(s, a). This by adversarial training [62]: a discriminator D is learned to
a∈A distinguish interactions sampled from the current policy π
policy π is further regularized by a loss function L(s, πE ) to and the expert policy πE :
minimize its discrepancy from the expert policy decision.
Another example is the Approximate Policy Iteration with JD = max Edπ log[1 − D(s, a)] + EdE log[D(s, a)] (8)
D:S×A→(0,1)
Demonstration (APID) algorithm, which was proposed by [59]
and extended by [95]. Different from DPID where both DE Since πE is unknown, its state-action distribution dE is
and Dπ are used for value estimation, the APID algorithm estimated based on the given expert demonstrations DE .
solely applies Dπ to approximate on the Q function. Expert The output of the discriminator is used as new rewards to
demonstrations DE are used to learn the value function, encourage distribution matching, with r′ (s, a) = − log(1 −
which, given any state si , renders expert actions πE (si ) with D(s, a)). The RL process is naturally altered to perform
higher Q-value margins compared with other actions that distribution matching by min-max optimization:
are not shown in DE :
max min J(π, D) : = Edπ log[1 − D(s, a)] + EdE log[D(s, a)].
π D
Q(si , πE (si )) − max Q(si , a) ≥ 1 − ξi . (6)
a∈A\πE (si )
The philosophy in GAIL of using expert demonstrations
The term ξi is used to account for the case of imperfect for distribution matching has inspired other LfD algorithms.
demonstrations. [95] further extended the work of APID For example, [97] extended GAIL with an algorithm called
8

Policy Optimization from Demonstrations (POfD), which com- We summarize the above-discussed approaches in Table 2.
bines the discriminator reward with the environment reward: In general, demonstration data can help in both offline pre-
training for better initialization and online RL for efficient
max = Edπ [r(s, a)] − λDJS [dπ ||dE ]. (9) exploration. During the RL phase, demonstration data can
θ
be used together with self-generated data to encourage
Both GAIL and POfD are under an on-policy RL frame- expert-like behaviors (DDPGfD, DQFD), to shape value
work. To further improve the sample efficiency of TL, functions (APID), or to guide the policy update in the form
some off-policy algorithms have been proposed, such as of an auxiliary objective function (PID,GAIL, POfD). To
DDPGfD [78] which is built upon the DDPG framework. validate the algorithm robustness given different knowledge
DDPGfD shares a similar idea as DQfD in that they both resources, most LfD methods are evaluated using metrics that
use a second replay buffer for storing demonstrated data, either indicate the performance under limited demonstrations
and each demonstrated sample holds a sampling priority (nka) or suboptimal demonstrations (nka). The integration of
pi . For a demonstrated sample, its priority pi is augmented LfD with off-policy RL backbone makes it natural to adopt pe
with a constant bias ϵD > 0 for encouraging more frequent metrics for evaluating how learning efficiency can be further
sampling of expert demonstrations: improved by knowledge transfer. Developing more general
pi = δi2 + λ∥∇a Q(si , ai |θQ )∥2 + ϵ + ϵD , LfD approaches that are agnostic to RL frameworks and can
learn from sub-optimal or limited demonstrations would be
where δi is the TD-residual for transition, ∥∇a Q(si , ai |θQ )∥2 the ongoing focus for this research domain.
is the loss applied to the actor, and ϵ is a small positive
constant to ensure all transitions are sampled with some prob- 5.3 Policy Transfer
ability. Another work also adopted the DDPG framework to Policy transfer is a TL approach where the external knowledge
learn from demonstrations [93]. Their approach differs from takes the form of pre-trained policies from one or multiple
DDPGfD in that its objective function is augmented with source domains. Work discussed in this section is built upon
a Behavior Cloning Loss to encourage imitating on provided a many-to-one problem setting, described as below:
P|D |
demonstrations: LBC = i=1E ||π(si |θπ ) − ai ||2 .
To further address the issue of suboptimal demonstrations, Policy Transfer. A set of teacher policies πE1 , πE2 , . . . , πEK
in [93] the form of Behavior Cloning Loss is altered based on are trained on a set of source domains M1 , M2 , . . . , MK ,
the critic output, so that only demonstration actions with respectively. A student policy π is learned for a target domain
higher Q values will lead to the loss penalty: by leveraging knowledge from {πEi }Ki=1 .
|DE | For the one-to-one scenario with only one teacher policy,
∥π(si |θπ ) − ai ∥2 1[Q(si , ai ) > Q(si , π(si ))]. (10)
X
LBC = one can consider it as a special case of the above with K = 1.
i=1 Next, we categorize recent work of policy transfer into two
There are several challenges faced by LfD, one of which techniques: policy distillation and policy reuse.
is the imperfect demonstrations. Previous approaches usu-
ally presume near-oracle demonstrations. Towards tackling 5.3.1 Transfer Learning via Policy Distillation
suboptimal demonstrations, [59] leveraged the hinge-loss The idea of knowledge distillation has been applied to the
function to allow occasional violations of the property that field of RL to enable policy distillation. Knowledge distillation
Q(si , πE (si )) − max Q(si , a) ≥ 1. Some other work was first proposed by [104] as an approach of knowledge
a∈A\πE (si )
uses regularized objective to alleviate overfitting on biased ensemble from multiple teacher models into a single stu-
data [92, 99]. A different strategy is to leverage those sub- dent model. Conventional policy distillation approaches
optimal demonstrations only to boost the initial learning transfer the teacher policy following a supervised learning
stage. For instance, [63] proposed Self-Adaptive Imitation paradigm [105, 106]. Specifically, a student policy is learned
Learning (SAIL), which learns from suboptimal demonstra- by minimizing the divergence of action distributions between
tions using generative adversarial training while gradually the teacher policy πE and student policy πθ , which is denoted
selecting self-generated trajectories with high qualities to as H× (πE (τt )|πθ (τt )):
replace less superior demonstrations. |τ |
Another challenge faced by LfD is covariate drift ([100]):
X
min Eτ ∼πE [ ∇θ H× (πE (τt )|πθ (τt ))]. (11)
demonstrations may be provided in limited numbers, which θ
t=1
results in the learning agent lacking guidance on states that
are unseen in the demonstration dataset. This challenge is The above expectation is taken over trajectories sampled from
aggravated in MDPs with sparse reward feedbacks, as the the teacher policy πE , hence this approach is called teacher
learning agent cannot obtain much supervision information distillation. One example along this line is [105], in which N
from the environment either. Current efforts to address this teacher policies are learned for N source tasks separately, and
challenge include encouraging explorations by using an each teacher yields a dataset DE = {si , qi }N i=0 consisting of
entropy-regularized objective [101], decaying the effects of observations s and vectors of the corresponding Q-values
demonstration guidance by softening its regularization on q , such that qi = [Q(si , a1 ), Q(si , a2 ), ...|aj ∈ A]. Teacher
policy learning over time [102], and introducing disagreement policies are further distilled to a single student πθ by min-
regularizations by training an ensemble of policies based on imizing the KL-Divergence between each teacher πEi (a|s)
E
the given demonstrations, where the variance among policies and the student πθ , approximated  using  the datasetE D  :
E E
|D | q softmax(q )
minθ DKL (π E |πθ ) ≈ i=1 softmax τi ln softmax(qiθ ) .
P
serves as a negative reward function [103].
i
9
Optimality
Methods Format of transferred demonstrations RL framework Evaluation metrics
guarantee
DPID ✓ Indicator binary-loss : L(si ) = 1{πE (si ) ̸= API ap, ar, nka
π(si )}  
APID ✗ Hinge loss on the marginal-loss: L(Q, π, πE ) + API ap, ar, nta, nkq
APID extend ✓ Marginal-loss: L(Q, π, πE ) API ap, ar, nta, nkq
[93] ✓ Increasing sampling priority and behavior cloning loss DDPG ap, ar, tr, pe, nkq
DQfD ✗ Cached transitions in the replay buffer DQN ap, ar, tr
LfDS ✗ Reward shaping function DQN ap, ar, tr
GAIL ✓ Reward shaping function: −λ log(1 − D(s, a)) TRPO ap, ar, tr, pe, nka
POfD ✓ Reward shaping function: TRPO,PPO ap, ar, tr, pe, nka
r(s, a) − λ log(1 − D(s, a))
DDPGfD (pe) ✓ Increasing sampling priority DDPG ap, ar, tr, pe
SAIL ✗ Reward shaping function: r(s, a) − λ log(1 − D(s, a)) DDPG ap, ar, tr, pe, nkq, nka
TABLE 2: A comparison of learning from demonstration approaches.

Another policy distillation approach is student distil- student policy λH(πE (at |st )||πθ (at |st )) to reshape rewards.
lation [51, 60], which is resemblant to teacher distilla- Moreover, they adopted a dynamically fading coefficient
tion except that during the optimization step, the ob- to alleviate the effect of the augmented reward so that the
jective expectation is taken over trajectories sampled student policy becomes independent of the teachers after
from the student hP policy instead of the teacher i policy, certain optimization iterations.
|τ | ×
i.e.: minθ Eτ ∼πθ t=1 ∇ θ H (π E (τt )|π θ (τ t )) . [60] summa-
rized related work on both kinds of distillation approaches. 5.3.2 Transfer Learning via Policy Reuse
Although it is feasible to combine both distillation ap- Policy reuse directly reuses policies from source tasks to build
proaches [100], we observe that more recent work focuses the target policy. The notion of policy reuse was proposed by
on student distillation, which empirically shows better [109], which directly learns the target policy as a weighted
exploration ability compared to teacher distillation, especially combination of different source-domain policies, and the
when the teacher policies are deterministic. probability for each source domain policy to be used is
Taking an alternative perspective, there are two ap- related to its expected performance gain in the target domain:
proaches of policy distillation: (1) minimizing the cross- P (πEi ) = PKexpexp(tWi )
(tWj )
, where t is a dynamic temperature
entropy between the teacher and student policy distributions j=0

over actions [51, 107]; and (2) maximizing the probability parameter that increases over time. Under a Q-learning
that the teacher policy will visit trajectories generated by framework, the Q-function of the target policy is learned
the student, i.e. maxθ P (τ ∼ πE |τ ∼ πθ ) [50, 108]. One in an iterative scheme: during every learning episode, Wi
example of approach (1) is the Actor-mimic algorithm [51]. is evaluated for each expert policy πEi , and W0 is obtained
This algorithm distills the knowledge of expert agents into for the learning policy, from which a reuse probability P
the student by minimizing the cross entropy between the is derived. Next, a behavior policy is sampled from this
P πθ and each teacher policy πEi over actions:
student policy
probability P . After each training episode, both Wi and
Li (θ) = a∈AEi πEi (a|s) logπθ (a|s), where each teacher the temperature t for calculating the reuse probability is
agent is learned using a DQN framework. The teacher policy
updated accordingly. One limitation of this approach is that
is therefore derived from the Boltzmann −1 distributions over
e
τ QE (s,a)
i the Wi , i.e. the expected return of each expert policy on the
the Q-function output: πEi (a|s) = P τ −1 QE (s,a′ )
. An
a′ ∈AE e i target task, needs to be evaluated frequently. This work was
i
instantiation of approach (2) is the Distral algorithm [50]. implemented in a tabular case, leaving the scalability issue
which learns a centroid policy πθ that is derived from K unresolved. More recent work by [110] extended the policy
teacher policies. The knowledge in each teacher πEi is dis- improvement theorem [111] from one to multiple policies,
tilled to the centroid and get transferred to the student, while which is named as Generalized Policy Improvement. We refer
both the transition dynamics Ti and reward distributions Ri its main theorem as follows:
for source domain Mi are heterogeneous. The student policy
is learned by maximizing a multi-task learning objective Theorem. [Generalized Policy Improvement (GPI)] Let
maxθ K
P {πi }ni=1 be n policies and let {Q̂πi }ni=1 be their approximated
i=1 J(πθ , πEi ), where
X hX action-value functions, s.t: Qπi (s, a) − Q̂πi (s, a) ≤ ϵ ∀s ∈
J(πθ , πEi ) = E(st ,at )∼πθ γ t (ri (at , st )+ S, a ∈ A, and i ∈ [n]. Define π(s) = arg max maxQ̂πi (s, a),
t t≥0 a i
2
α 1 i then: Qπ (s, a) ≥ maxQπi (s, a) − 1−γ ϵ, ∀ s ∈ S, a ∈ A.
log πθ (at |st ) − log(πEi (at |st ))) , i
β β
Based on this theorem, a policy improvement approach
in which both log πθ (at |st ) and πθ are used as augmented can be naturally derived by greedily choosing the action
rewards. Therefore, the above approach also draws a close which renders the highest Q-value among all policies for
connection to Reward Shaping (Section 5.1). In effect, the a given state. Another work along this line is [110], in
log πθ (at |st ) term guides the learning policy πθ to yield which an expert policy πEi is also trained on a differ-
actions that are more likely to be generated by the teacher ent source domain Mi with reward function Ri , so that
policy, whereas the entropy term − log(πEi (at |st ) encour- QπM0 (s, a) ̸= QπMi (s, a). To efficiently evaluate the Q-
ages exploration. A similar approach was proposed by [107] functions of different source policies in the target MDP,
which only uses the cross-entropy between teacher and a disentangled representation ψ(s, a) over the states and
10

actions is learned using neural networks and is generalized the mapping function is learned on the agent-specific sub
across multiple tasks. Next, a task (reward) mapper wi is state, and the mapped representation is applied to reshape
learned, based on which the Q-function can be derived: the immediate reward. For [113], the invariant feature space
Qπi (s, a) = ψ(s, a)T wi . [110] proved that the loss of GPI is mapped from sagent can be applied across agents who have
bounded by the difference between the source and the target distinct action space but share some morphological similarity.
tasks. In addition to policy-reuse, their approach involves Specifically, they assume that both agents have been trained
learning a shared representation ψ(s, a), which is also a on the same proxy task, based on which the mapping function
form of transferred knowledge and will be elaborated more is learned. The mapping function is learned using an encoder-
in Section 5.5.2. decoder structure [116] to largely reserve information about
We summarize the abovementioned policy transfer ap- the source domain. For transferring knowledge from the
proaches in Table 3. In general, policy transfer can be realized source agent to a new task, the environment reward is
by knowledge distillation, which can be either optimized augmented with a shaped reward term to encourage the
from the student’s perspecive (student distillation), or from target agent to imitate the source agent on an embedded
the teacher’s perspective (teacher distillation) Alternatively, feature space:
teacher policies can also be directly reused to update the target
policy. Regarding evaluation, most of the abovementioned r′ (s, ·) = α f (ssagent ; θf ) − g(stagent ; θg ) , (12)
work has investigated a multi-teacher transfer scenario, hence
the generalization ability or robustness is largely evaluated on where f (ssagent ) is the agent-specific state in the source
metrics such as performance sensitivity(ps) (e.g. performance domain, and g(stagent ) is for the target domain.
given different numbers of teacher policies or source tasks Another work is [115] which applied the Unsupervised
). Performance with fixed epochs (pe) is another commonly Manifold Alignment (UMA) method [117] to automatically
shared metric to evaluate how the learned policy can quickly learn the state mapping. Their approach requires collecting
adapt to the target domain. All approaches discussed so far trajectories from both the source and the target domain
presumed one or multiple expert policies, which are always to learn such a mapping. While applying policy gradient
at the disposal of the learning agent. Open questions along learning, trajectories from the target domain Mt are first
this line include How to leverage imperfect policies for knowledge mapped back to the source: τt → τs , then an expert policy
transfer, or How to refer to teacher policies within a budget. in the source domain is applied to each initial state of those

trajectories to generate near-optimal trajectories τs , which are
∼ ∼
5.4 Inter-Task Mapping further mapped to the target domain: τs → τt . The deviation

In this section, we review TL approaches that utilize mapping between τt and τt are used as a loss to be minimized in order
functions between the source and the target domains to assist to improve the target policy. Similar ideas of using UMA for
knowledge transfer. Research in this domain can be analyzed inter-task mapping can also be found in [118] and [119].
from two perspectives: (1) which domain does the mapping In addition to approaches that utilizes mapping over
function apply to, and (2) how is the mapped representation states or actions, [120] proposed to learn an inter-task
utilized. Most work discussed in this section shares a common mapping over the transition dynamics space: S × A × S .
assumption as below: Their work assumes that the source and target domains
are different in terms of the transition space dimensionality.
Assumption. One-to-one mappings exist between the source
Transitions from both the source domain ⟨ss , as , s′s ⟩ and
domain Ms and the target domain Mt .
the target domain ⟨st , at , s′t ⟩ are mapped to a latent space
Earlier work along this line requires a given mapping Z . Given the latent feature representations, a similarity
function [66, 112]. One examples is [66] which assumes that measure can be applied to find a correspondence between
each target state (action) has a unique correspondence in the source and target task triplets. Triplet pairs with the
the source domain, and two mapping functions XS , XA highest similarity in this feature space Z are used to learn a
are provided over the state space and the action space, mapping function X : ⟨st , at , s′t ⟩ = X (⟨ss , as , s′s ⟩). After the
respectively, so that XS (S t ) → S s , XA (At ) → As . Based transition mapping, states sampled from the expert policy in
on XS and XA , a mapping function over the Q-values the source domain can be leveraged to render beneficial states
M (Qs ) → Qt can be derived accordingly. Another work in the target domain, which assists the target agent learning
is done by [112] which transfers advice as the knowledge with a better initialization performance. A similar idea of
between two domains. In their settings, the advice comes mapping transition dynamics can be found in [121], which,
from a human expert who provides the mapping function however, requires a stronger assumption on the similarity
over the Q-values in the source domain and transfers it to the of the transition probability and the state representations
learning policy for the target domain. This advice encourages between the source and the target domains.
the learning agent to prefer certain good actions over others, As summarized in Table 4, for TL approaches that utilize
which equivalently provides a relative ranking of actions in an inter-task mapping, the mapped knowledge can be (a
the new task. subset of) the state space [113, 114], the Q-function [66], or
More later research tackles the inter-task mapping prob- (representations of) the state-action-sate transitions [120].
lem by learning a mapping function [113–115]. Most work In addition to being directly applicable in the target do-
learns a mapping function over the state space or a subset of main [120], the mapped representation can also be used as an
the state space. In their work, state representations are usu- augmented shaping reward [113, 114] or a loss objective [115]
ally divided into agent-specific and task-specific representations, in order to guide the agent learning in the target domain.
denoted as sagent and senv , respectively. In [113] and [114], Most inter-task mapping methods tackle domains with
11
Paper Transfer approach MDP difference RL framework Evaluation metrics
[105] Distillation S, A DQN ap, ar
[106] Distillation S, A DQN ap, ar, pe, ps
[51] Distillation S, A Soft Q-learning ap, ar, tr, pe, ps
[50] Distillation S, A A3C ap, ar, pe, tt
[109] Reuse R Tabular Q-learning ap, ar, ps, tr
[110] Reuse R DQN ap, ar, pe, ps

TABLE 3: A comparison of policy transfer approaches.

moderate state-action space dimensions, such as maze tasks function ϕ over states s, it can be decomposed into two
or tabular MDPs, where the goal can be reaching a target sub-modules gk and fr , i.e.:
state with a minimal number of transitions. Accordingly, tt
has been used to measure TL performance. For tasks with π(s) := ϕ(senv , sagent ) = fr (gk (senv ), sagent ),
limited and discrete state-action space, evaluation is also where fr is the agent-specific module and gk is the task-
conducted with different number of initial states collected in specific module. Their core idea is that the task-specific
the target domain (nka). module can be applied to different agents performing
5.5 Representation Transfer the same task, which serves as a transferred knowledge.
Accordingly, the agent-specific module can be applied to
This section review approaches that transfer knowledge in different tasks for the same agent.
the form of representations learned by deep neural networks. A model-based approach along this line is [125], which
They are built upon the following consensual assumption: learns a model to map the state observation s to a latent-
Assumption. [Existence of Task-Invariance Subspace] representation z . The transition probability is modeled on
The state space (S ), action space (A), or the reward space (R) the latent space instead of the original state space, i.e. ẑt+1 =
can be disentangled into orthogonal subspaces, which are task- fθ (zt , at ), where θ is the parameter of the transition model,
invariant such that knowledge can be transferred between domains zt is the latent-representation of the state observation, and at
on the universal subspace. is the action accompanying that state. Next, a reward module
learns the value function as well as the policy from the
We organize recent work along this line into two
latent space z using an actor-critic framework. One potential
subtopics: 1) approaches that directly reuse representations
benefit of this latent representation is that knowledge can be
from the source domain (Section 5.5.1), and 2) approaches
transferred across tasks that have different rewards but share
that learn to disentangle the source domain representations
the same transition dynamics.
into independent sub-feature representations, some of which
are on the universal feature space shared by both the source
5.5.2 Disentangling Representations
and the target domains (Section 5.5.2).
Methods discussed in this section mostly focus on learning a
5.5.1 Reusing Representations disentangled representation. Specifically, we elaborate on TL
A representative work of reusing representations is [122], approaches that are derived from two techniques: Successor
which proposed the progressive neural network structure to Representation (SR) and Universal Value Function Approximat-
enable knowledge transfer across multiple RL tasks in a ing (UVFA).
progressive way. A progressive network is composed of Successor Representations (SR) is an approach to de-
multiple columns, and each column is a policy network for couple the state features of a domain from its reward
one specific task. It starts with one single column for training distributions. It enables knowledge transfer across multiple
the first task, and then the number of columns increases domains: M = {M1 , M2 , . . . , MK }, so long as the only
with the number of new tasks. While training on a new task, difference among them is the reward distributions: Ri ̸= Rj .
neuron weights on the previous columns are frozen, and SR was originally derived from neuroscience, until [126]
representations from those frozen tasks are applied to the proposed to leverage it as a generalization mechanism for
new column via a collateral connection to assist in learning state representations in the RL domain.
the new task. Different from the v -value or Q-value that describes
Progressive network comes with a cost of large network states as dependent on the reward function, SR features
structures, as the network grows proportionally with the a state based on the occupancy measure of its successor
number of incoming tasks. A later framework called PathNet states. Specifically, SR decomposes the value function of
alleviates this issue by learning a network with a fixed any policyPinto two independent components, ψ and R:
size [123]. PathNet contains pathways, which are subsets of V π (s) = s′ ψ(s, s′ )w(s′ ), where w(s′ ) is a reward map-
neurons whose weights contain the knowledge of previous ping function that maps states to scalar rewards, and ψ
tasks and are frozen during training on new tasks. The pop- is the SR which describes any state s as the occupancy
ulation of pathway is evolved using a tournament selection measure of the future occurred states when following π ,
genetic algorithm [124]. with 1[S = s′ ] = 1 as an indicator function:
Another approach of reusing representations for TL is ∞
γ i−t 1[Si = s′ ]|St = s].
X
modular networks [52, 53, 125]. For example, [52] proposed ψ(s, s′ ) = Eπ [
to decompose the policy network into a task-specific module i=t
and agent-specific module. Specifically, let π be a policy The successor nature of SR makes it learnable using
performed by any agent (robot) r over the task Mk as a any TD-learning algorithms. Especially, [126] proved the
12
RL MDP Mapping Evaluation
Methods Usage of mapping
framework difference function metrics
[66] SARSA St ̸= St , As ̸= At M (Qs ) → Qt Q value reuse ap, ar, tt, tr
[112] Q-learning As ̸= At , Rs ̸= Rt M (Qs ) → advice Relative Q ranking ap, ar, tr
[113] − Ss ̸= St M (st ) → r′ Reward shaping ap, ar, pe, tr
[114] SARSA(λ) Ss ̸= St Rs ̸= Rt M (st ) → r′ Reward shaping ap, ar, pe, tt
[115] Fitted Value Iter- Ss ̸= St M (ss ) → st Penalty loss on state deviation ap, ar, pe, tr
ation from expert policy
[121] Fitted Q Itera- Ss × As ̸= St × At M (ss , as, s′s ) → Reduce random exploration ap, ar, pe, tr, nta
tion (st , at , s′t )
[120] − Ss × As ̸= St × At M (ss , as, s′s ) → Reduce random exploration ap, ar, pe, tr, nta
(st , at , s′t )

TABLE 4: A comparison of inter-task mapping approaches. “−” indicates no RL framework constraints.

feasibility of learning such representation in a tabular case, ri (s, a, s′ ). Based on the idea of basis-functions for a task’s
in which the state transitions can be described using a matrix. latent space, they proposed that ϕ(s, a, s′ ) can be approxi-
SR was later extended by [110] from three perspectives: mated as learning R(s, a, s′ ) directly, where R(s, a, s′ ) ∈ RD
(i) the feature domain of SR is extended from states to is a vector of reward functions for each seen task:
state-action pairs; (ii) deep neural networks are used as
R(s, a, s′ ) = r1 (s, a, s′ ); r2 (s, a, s′ ), . . . , rD (s, a, s′ ) .
 
function approximators to represent the SR ψ π (s, a) and
the reward mapper w; (iii) Generalized policy improvement Accordingly, learning ψ(s, a) for any policy πi in Mi
(GPI) algorithm is introduced to accelerate policy transfer for becomes equivalent to learning a collection of Q-functions:
multi-tasks (Section 5.3.2). These extensions, however, are
ψ πi (s, a) = Qπ1 i (s, a), Qπ2 i (s, a), . . . , QπDi (s, a) .
 
built upon a stronger assumption about the MDP:
Assumption. [Linearity of Reward Distributions] The reward A similar idea of using reward functions as features to
functions of all tasks can be computed as a linear combination represent unseen tasks is also proposed by [129], which
of a fixed set of features: r(s, a, s′ ) = ϕ(s, a, s′ )⊤ w, where assumes the ψ and w as observable quantities from the
ϕ(s, a, s′ ) ∈ Rd denotes the latent representation of the state environment.
transition, and w ∈ Rd is the task-specific reward mapper. Universal Function Approximation (UVFA) is an alter-
native approach of learning disentangled state representa-
Based on this assumption, SR can be decoupled from the tions [64]. Same as SR, UVFA allows TL for multiple tasks
rewards when evaluating the Q-function of any policy π in which differ only by their reward functions (goals). Different
a task. The advantage of SR is that, when the knowledge from SR which focuses on learning a reward-agnostic state
of ψ π (s, a) in the source domain Ms is observed, one can representation, UVFA aims to find a function approximator
quickly get the performance evaluation of the same policy that is generalized for both states and goals. The UVFA
in the target domain Mt by replacing ws with wt : QπMt = framework is built on a specific problem setting of goal
ψ π (s, a)wt . Similar ideas of learning SR as a TD-algorithm conditional RL: task goals are defined in terms of states, e.g. given
on a latent representation ϕ(s, a, s′ ) can also be found in the state space S and the goal space G , it satisfies that G ⊆ S .
[127, 128]. Specifically, the work of [127] was developed One instantiation of this problem setting can be an agent
based on a weaker assumption about the reward function: exploring different locations in a maze, where the goals are
Instead of requiring linearly-decoupled rewards, the latent described as certain locations inside the maze. Under this
space ϕ(s, a, s′ ) is learned in an encoder-decoder structure to problem setting, a UVFA module can be decoupled into
ensure that the information loss is minimized when mapping a state embedding ϕ(s) and a goal embedding ψ(g), by
states to the latent space. This structure, therefore, comes applying the technique of matrix factorization to a reward
with an extra cost of learning a decoder fd to reconstruct the matrix describing the goal-conditional task.
state: fd (ϕ(st )) ≈ st . One merit of UVFA resides in its transferrable embedding
An intriguing question faced by the SR approach is: Is ϕ(s) across tasks which only differ by goals. Another benefit
there a way that evades the linearity assumption about reward is its ability of continual learning when the set of goals keeps
functions and still enables learning the SR without extra modular expanding over time. On the other hand, a key challenge
cost? An extended work of SR [67] answered this question of UVFA is that applying the matrix factorization is time-
affirmatively, which proved that the reward functions does consuming, which makes it a practical concern for complex
not necessarily have to follow the linear structure, yet at the environments with large state space |S|. Even with the
cost of a looser performance lower-bound while applying the learned embedding networks, the third stage of fine-tuning
GPI approach for policy improvement. Especially, rather than these networks via end-to-end training is still necessary.
learning a reward-agnostic latent feature ϕ(s, a, s′ ) ∈ Rd for UVFA has been connected to SR by [67], in which a set
multiple tasks, [67] aims to learn a matrix ϕ(s, a, s′ ) ∈ RD×d of independent rewards (tasks) themselves can be used as
to interpret the basis functions of the latent space instead, features for state representations. Another extended work
where D is the number of seen tasks. Assuming k out of that combines UVFA with SR is called Universal Successor
D tasks are linearly independent, this matrix forms k basis Feature Approximator (USFA), which is proposed by [130].
functions for the latent space. Therefore, for any unseen task Following the same linearity assumption, USFA is proposed
Mi , its latent features can be built as a linear combination as a function over a triplet of the state, action, and a policy
of these basis functions, as well as its reward functions embedding z : ϕ(s, a, z) : S × A × Rk → Rd , where z is the
13

output of a policy-encoding mapping z = e(π) : S × A → Rk . representation space for states using multiple tasks with dif-
Based on USFA, the Q-function of any policy π for a task ferent dynamics for better generalization [132]. Alternatively,
specified by w can be formularized as the product of a TL mechanisms from the supervised learning domain, such
reward-agnostic Universal Successor Feature (USF) ψ and a as meta-learning, which enables the ability of fast adaptation
reward mapper w: Q(s, a, w, z) = ψ(s, a, z)⊤ w. Facilitated to new tasks [133], or importance sampling [134], which can
by the disentangled rewards and policy generalization, [130] compensate for the prior distribution changes [10], may also
further introduced a generalized TD-error as a function over shed light on this question.
tasks w and policy z , which allows them to approximate the 6 A PPLICATIONS
Q-function of any policy on any task using a TD-algorithm. In this section we summarize recent applications that are
closely related to using TL techniques for tackling RL
5.5.3 Summary and Discussion
domains.
We provide a summary of the discussed work in this
Robotics learning is a prominent application domain of
section in Table 5. Representation transfer can facilitate TL
RL. TL approaches in this field include robotics learning from
in multiple ways based on assumptions about certain task-
demonstrations, where expert demonstrations from humans
invariant property. Some assume that tasks are different only
or other robots are leveraged [135] Another is collaborative
in terms of their reward distributions. Other stronger as-
robotic training [136, 137], in which knowledge from different
sumptions include (i) decoupled dynamics, rewards [110], or
robots is transferred by sharing their policies and episodic
policies [130] from the Q-function representations, and (ii) the
demonstrations. Recent research focus is this domain is fast
feasibility of defining tasks in terms of states [130]. Based on
and robust adaptation to unseen tasks. One example towards
those assumptions, approaches such as TD-algorithms [67]
this goal is [138], in which robust robotics policies are trained
or matrix-factorization [64] become applicable to learn
using synthetic demonstrations to handle dynamic environ-
such disentangled representations. To further exploit the
ments. Another solution is to learn domain-invariant latent
effectiveness of disentangled structure, we consider that
representations. Examples include [139], which learns the
generalization approaches, which allow changing dynamics or
latent representation using 3D CAD models, and [140, 141]
state distributions, are important future work that is worth
which are derived based on the Generative-Adversarial
more attention in this domain.
Network. Another example is DARLA [142], which is a zero-
Most discussed work in this section tackles multi-task RL shot transfer approach to learn disentangled representations
or meta-RL scenarios, hence the agent’s generalization ability that are robust against domain shifts. We refer readers to
is extensively investigated. For instance, methods of modular [70, 143] for detailed surveys along this direction.
networks largely evaluated the zero-shot performance from Game Playing is a common test-bed for TL and RL
the meta-RL perspective [52, 130]. Given a fixed number of algorithms. It has evolved from classical benchmarks such
training epochs (pe), Transfer ratio (tr) is manifested differently as grid-world games to more complex settings such as
among these methods. It can be the relative performance of online-strategy games or video games with multimodal
a modular net architecture compared with a baseline, or inputs. One example is AlphaGo, which is an algorithm
the accumulated return in modified target domains, where for learning the online chessboard games using both TL
reward scores are negated for evaluating the dynamics and RL techniques [90]. AlphaGo is first pre-trained offline
transfer. Performance sensitivity (ps) is also broadly studied to using expert demonstrations and then learns to optimize its
estimate the robustness of TL. [110] analyzed the performance policy using Monte-Carlo Tree Search. Its successor, AlphaGo
sensitivity given varying source tasks, while [130] studied Master [144], even beat the world’s first ranked human
the performance on different unseen target domains. player. TL-DRL approaches are also thriving in video game
There are unresolved questions in this intriguing research playing. Especially, OpenAI has trained Dota2 agents that
topic. One is how to handle drastic changes of reward can surpass human experts [145]. State-of-the-art platforms
functions between domains. As discussed in [131], good include MineCraft, Atari, and Starcraft. [146] designed new RL
policies in one MDP may perform poorly in another due to benchmarks under the MineCraft platform. [147] provided
the fact that beneficial states or actions in Ms may become a comprehensive survey on DL applications in video game
detrimental in Mt with totally different reward functions. playing, which also covers TL and RL strategies from certain
Learning a set of basis functions [67] to represent unseen perspectives. A large portion of TL approaches reviewed in
tasks (reward functions), or decoupling policies from Q- this survey have been applied to the Atari platforms [148].
function representation [130] may serve as a good start Natural Language Processing (NLP) has evolved rapidly
to address this issue, as they propose a generalized latent along with the advancement of DL and RL. Applications
space, from which different tasks (reward functions) can be of RL to NLP range widely, from Question Answering
interpreted. However, the limitation of this work is that it is (QA) [149], Dialogue systems [150], Machine Translation [151],
not clear how many and what kind of sub-tasks need to be to an integration of NLP and Computer Vision tasks, such as
learned to make the latent space generalizable enough. Visual Question Answering (VQA) [152], Image Caption [153],
Another question is how to generalize the representation etc. Many NLP applications have implicitly applied TL
learning for TL across domains with different dynamics or approaches. Examples include learning from expert demon-
state-action spaces. A learned SR might not be transferrable strations for Spoken Dialogue Systems [154], VQA [152]; or
to an MDP with different transition dynamics, as the distri- reward shaping for Sequence Generation [155], Spoken Dialogue
bution of occupancy measure for SR may no longer hold. Systems [80],QA [81, 156], and Image Caption [153], or trans-
Potential solutions may include model-based approaches ferring policies for Structured Prediction [157] and VQA [158].
that approximate the dynamics directly or training a latent
14
Evaluation
Methods Representations format Assumptions MDP difference Learner
metrics
Progressive Net [122] Lateral connections to N/A S, A A3C ap, ar, pe, ps, tr
previously learned net-
work modules
PathNet [123] Selected neural paths N/A S, A A3C ap, ar, pe, tr
Modular Net [52] Task(agent)-specific net- Disentangled state rep- S, A Policy Gradient ap, ar, pe, tt
work module resentation
Modular Net [125] Dynamic transitions N/A S, A A3C ap, ar, pe, tr, ps
module learned on state
latent representations.
SR [110] SF Reward function can be R DQN ap, ar, nka, ps
linearly decoupled
SR [127] Encoder-decoder N/A R DQN ap, ar, pe, ps
learned SF
SR [67] Encoder-decoder Rewards can be repre- R Q(λ) ap, pe
learned SF sented by set of basis
functions
UVFA [64] Matrix-factorized UF Goal conditional RL R Tabular Q-learning ap, ar, pe, ps
UVFA with SR [130] Policy-encoded UF Reward function can be R ϵ-greedy Q-learning ap, ar, pe
linearly decoupled
TABLE 5: A comparison of TL approaches of representation transfer.

Large Model Training: RL from human and model- 7 F UTURE P ERSPECTIVES


assisted feedback becomes indispensable in training large In this section, we present some open challenges and future
models (LMM), such as GPT4 [159], Sparrow [160], directions in TL that are closely related to the DRL domain,
PaLM [161], LaMDA [162], which have shown tremendous based on both retrospectives of the methods discussed in this
breakthrough in dialogue applications, search engine answer survey and outlooks to the emerging trends of AI.
improvement, artwork generation, etc. The TL method at
Transfer Learning from Black-Box: Ranging from exterior
the core of them is using human preferences as a reward
teacher demonstrations to pre-trained function approxima-
signal for model fine-tuning, where the preference ranking
tors, black-box resource is more accessible and predominant
itself is considered as shaped rewards. We believe that TL
than well-articulated knowledge. Therefore, leveraging such
with carefully crafted human knowledge will help better
black-box resource is indispensable for practical TL in
align large models with human intent and hence achieve
DRL. One of its main challenges resides in estimating the
trustworthy and de-biased AI.
optimality of black-box resource, which can be potentially
Health Informatics: RL has been applied to various noisy or biased. We consider that efforts can be made from
healthcare tasks [163], including automatic medical diagno- the following perspectives:
sis [164, 165], health resource scheduling [166], and drug discov- 1) Inferring the reasoning mechanism inside the black-box.
ery and development, [167, 168], etc. Among these applications Resemblant ideas have been explored in inverse RL
we observe emerging trends of leveraging prior knowledge and model-based RL, where the goal is to approximate
to improve the RL procedure, especially given the difficulty the reward function or to learn the dynamics model
of accessing large amounts of clinical data. Specifically, under which the demonstrated knowledge becomes
[169] utilized Q-learning for drug delivery individualization. reasonable.
They integrated the prior knowledge of the dose-response 2) Designing effective feedback schemes, including lever-
characteristics into their Q-learning framework to avoid aging domain-provided rewards, intrinsic reward feed-
random exploration. [170] applied a DQN framework for back, or using human preference as feedback.
prescribing effective HIV treatments, in which they learned 3) Improving the interpretability of the transferred knowl-
a latent representation to estimate the uncertainty when edge [182, 183], which benefits in evaluating and explain-
transferring a pertained policy to the unseen domains. [171] ing the process of TL from black-box. It can also alleviate
studied applying human-involved interactive RL training for catastrophic decision-making for high-stake tasks such
health informatics. as auto-driving.
Others: RL has also been utilized in many other real-life Knowledge Disentanglement and Fusion are both towards
applications. Applications in the Transportation Systems better knowledge sharing across domains. Disentangling
have adopted RL to address traffic congestion issues with knowledge is usually a prerequisite for efficient knowledge
better traffic signal scheduling and transportation resource fusion, which may involve external knowledge from mul-
allocation [8, 9, 172, 173]. We refer readers to [174] for a review tiple source domains, with diverging qualities, presented in
along this line. Deep RL are also effective solutions to prob- different modalities, etc. Disentangling knowledge in RL can
lems in Finance, including portfolio management [175, 176], be interpreted from different perspectives: i) disentangling
asset allocation [177], and trading optimization [178]. Another the action, state, or reward representations, as discussed in
application is the Electricity Systems, especially the intelligent Sec 5.5; 2) decomposing complex tasks into multiple skill
electricity networks, which can benefit from RL techniques for snippets. The former is an effective direction in tackling meta-
improved power-delivery decisions [179, 180] and active RL and multi-task RL, although some solutions hinge on
resource management [181]. [7] provides a detailed survey strict assumptions of the problem setting, such as linear
of RL techniques for electric power system applications. dependence among domain dynamics or learning goals. The
15

latter is relevant to hierarchical RL and prototype learning from R EFERENCES


sequence data [184]. It is relatively less discussed besides few
pioneer research [132]. We believe that this direction is worth [1] R. S. Sutton and A. G. Barto, Reinforcement learning: An
more research efforts, which not only benefits interpretable introduction. MIT press, 2018.
knowledge learning, but also aligns with human perception. [2] K. Arulkumaran, M. P. Deisenroth, M. Brundage, and
Framework-Agnostic Knowledge Transfer: Most contempo- A. A. Bharath, “A brief survey of deep reinforcement
rary TL approaches are designed for certain RL frameworks. learning,” arXiv preprint arXiv:1708.05866, 2017.
Some are applicable to RL algorithms designed for the [3] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-
discrete-action space, while others may only be feasible end training of deep visuomotor policies,” The Journal
given a continuous action space. One fundamental reason of Machine Learning Research, 2016.
behind is the diversified development of RL algorithms. We [4] S. Levine, P. Pastor, A. Krizhevsky, J. Ibarz, and
expect that unified RL frameworks would contribute to the D. Quillen, “Learning hand-eye coordination for
standardization of TL approaches in this field. robotic grasping with deep learning and large-scale
data collection,” The International Journal of Robotics
Evaluation and Benchmarking: Variant evaluation metrics Research, 2018.
have been proposed to measure TL from different but [5] M. G. Bellemare, Y. Naddaf, J. Veness, and M. Bowling,
complementary perspectives, although no single metric “The arcade learning environment: An evaluation plat-
can summarize the efficacy of a TL approach. Designing form for general agents,” Journal of Artificial Intelligence
a set of generalized, novel metrics is beneficial for the Research, 2013.
development of TL in DRL. Moreover, with the effervescent [6] M. R. Kosorok and E. E. Moodie, Adaptive Treat-
development of large-scale models, it is crucial to standardize mentStrategies in Practice: Planning Trials and Analyzing
evaluation from the perspectives of ethics and groundedness. Data for Personalized Medicine. SIAM, 2015.
The appropriateness of the transferred knowledge, such as [7] M. Glavic, R. Fonteneau, and D. Ernst, “Reinforce-
potential stereotypes in human preference, and the bias in the ment learning for electric power system decision and
model itself should also be quantified as metrics. control: Past considerations and perspectives,” IFAC-
Knowledge Transfer to and from Pre-Trained Large Models: PapersOnLine, 2017.
By the time of this survey being finalized, unprecedented [8] S. El-Tantawy, B. Abdulhai, and H. Abdelgawad,
DL breakthroughs have been achieved in learning large- “Multiagent reinforcement learning for integrated net-
scale models built on massive computation resource and work of adaptive traffic signal controllers (marlin-atsc):
attributed data. One representative example is the Generative methodology and large-scale application on downtown
Pre-trained Transformer (GPT) [159]. Considering them as toronto,” IEEE Transactions on Intelligent Transportation
complete knowledge graphs whose training process maybe Systems, 2013.
inaccessible, there are more challenges in this direction [9] H. Wei, G. Zheng, H. Yao, and Z. Li, “Intellilight: A
besides learning from a black-box, which are faced by a larger reinforcement learning approach for intelligent traffic
AI community including the RL domain. We briefly point light control,” ACM SIGKDD International Conference
out two directions that are worth ongoing attention: on Knowledge Discovery and Data Mining, 2018.
1) Efficient model fine-tuning with knowledge distillation. One [10] S. J. Pan and Q. Yang, “A survey on transfer learning,”
important method for fine-tuning large models is RL IEEE Transactions on knowledge and data engineering,
with human feedback, in which the quantity and quality 2009.
of human ratings are critical for realizing a good reward [11] M. E. Taylor and P. Stone, “Transfer learning for
model. We anticipate other forms of TL methods in reinforcement learning domains: A survey,” Journal
RL to be explored to further improve the efficiency of of Machine Learning Research, 2009.
fine-tuning, such as imitation learning with adversarial [12] A. Lazaric, “Transfer in reinforcement learning: a
training to achieve human-level performance. framework and a survey.” Springer, 2012.
2) Principled prompt-engineering for knowledge extraction. [13] R. Bellman, “A markovian decision process,” Journal of
More often the large model itself cannot be accessed, mathematics and mechanics, 1957.
but only input and output of models are allowed. [14] M. G. Bellemare, W. Dabney, and R. Munos, “A
Such inference based knowledge extraction requires distributional perspective on reinforcement learning,”
delicate prompt designs. Some efficacious efforts include in International conference on machine learning. PMLR,
designing prompts with mini task examples as one- 2017, pp. 449–458.
shot learning, decomposing complex tasks into architec- [15] M. Liu, M. Zhu, and W. Zhang, “Goal-conditioned
tural, contextual prompts. Prompt engineering is being reinforcement learning: Problems and solutions,” arXiv
proved an important direction for effective knowledge preprint arXiv:2201.08299, 2022.
extraction, which with proper design, can largely benefit [16] C. Florensa, D. Held, X. Geng, and P. Abbeel, “Au-
downstream tasks that depend on large model resources. tomatic goal generation for reinforcement learning
agents,” in International conference on machine learning.
PMLR, 2018, pp. 1515–1528.
8 ACKNOWLEDGEMENTS [17] Z. Xu and A. Tewari, “Reinforcement learning in
This research was supported by the National Science Foun- factored mdps: Oracle-efficient algorithms and tighter
dation (IIS-2212174, IIS-1749940), National Institute of Aging regret bounds for the non-episodic setting,” NeurIPS,
(IRF1AG072449), and the Office of Naval Research (N00014- vol. 33, pp. 18 226–18 236, 2020.
20-1-2382).
16

[18] C. Yu, A. Velu, E. Vinitsky, J. Gao, Y. Wang, A. Bayen, [35] Z. I. Botev, D. P. Kroese, R. Y. Rubinstein, and
and Y. Wu, “The surprising effectiveness of ppo in P. L’Ecuyer, “The cross-entropy method for optimiza-
cooperative multi-agent games,” NeurIPS, vol. 35, pp. tion,” in Handbook of statistics. Elsevier, 2013, vol. 31,
24 611–24 624, 2022. pp. 35–59.
[19] I. Kostrikov, K. K. Agrawal, D. Dwibedi, S. Levine, and [36] K. Chua, R. Calandra, R. McAllister, and S. Levine,
J. Tompson, “Discriminator-actor-critic: Addressing “Deep reinforcement learning in a handful of trials
sample inefficiency and reward bias in adversarial using probabilistic dynamics models,” NeurIPS, vol. 31,
imitation learning,” arXiv preprint arXiv:1809.02925, 2018.
2018. [37] R. S. Sutton, “Integrated architectures for learning,
[20] G. A. Rummery and M. Niranjan, On-line Q-learning planning, and reacting based on approximating dy-
using connectionist systems. University of Cambridge, namic programming,” in Machine learning proceedings
Department of Engineering Cambridge, England, 1994. 1990. Elsevier, 1990, pp. 216–224.
[21] H. Van Seijen, H. Van Hasselt, S. Whiteson, and [38] V. Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gon-
M. Wiering, “A theoretical and empirical analysis of zalez, and S. Levine, “Model-based value estimation
expected sarsa,” IEEE Symposium on Adaptive Dynamic for efficient model-free reinforcement learning,” arXiv
Programming and Reinforcement Learning, 2009. preprint arXiv:1803.00101, 2018.
[22] V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” [39] S. Levine and V. Koltun, “Guided policy search,” in
NeurIPS, 2000. International conference on machine learning. PMLR,
[23] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, 2013, pp. 1–9.
T. Harley, D. Silver, and K. Kavukcuoglu, “Asyn- [40] H. Bharadhwaj, K. Xie, and F. Shkurti, “Model-
chronous methods for deep reinforcement learning,” predictive control via cross-entropy and gradient-based
ICML, 2016. optimization,” in Learning for Dynamics and Control.
[24] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft PMLR, 2020, pp. 277–286.
actor-critic: Off-policy maximum entropy deep rein- [41] M. Deisenroth and C. E. Rasmussen, “Pilco: A model-
forcement learning with a stochastic actor,” Interna- based and data-efficient approach to policy search,” in
tional Conference on Machine Learning, 2018. Proceedings of the 28th International Conference on machine
[25] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning (ICML-11), 2011, pp. 465–472.
learning, 1992. [42] Y. Gal, R. McAllister, and C. E. Rasmussen, “Improving
[26] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, pilco with bayesian neural network dynamics models,”
J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, in Data-efficient machine learning workshop, ICML, vol. 4,
A. K. Fidjeland, G. Ostrovski et al., “Human-level no. 34, 2016, p. 25.
control through deep reinforcement learning,” Nature, [43] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learn-
2015. ing to detect unseen object classes by between-class
[27] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, attribute transfer,” IEEE Conference on Computer Vision
G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and Pattern Recognition, 2009.
and D. Silver, “Rainbow: Combining improvements in [44] P. Dayan and G. E. Hinton, “Feudal reinforcement
deep reinforcement learning,” AAAI, 2018. learning,” NeurIPS, 1993.
[28] R. J. Williams, “Simple statistical gradient-following [45] R. S. Sutton, D. Precup, and S. Singh, “Between mdps
algorithms for connectionist reinforcement learning,” and semi-mdps: A framework for temporal abstraction
Machine learning, 1992. in reinforcement learning,” Artificial intelligence, 1999.
[29] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and [46] R. Parr and S. J. Russell, “Reinforcement learning with
P. Moritz, “Trust region policy optimization,” ICML, hierarchies of machines,” NeurIPS, 1998.
2015. [47] T. G. Dietterich, “Hierarchical reinforcement learning
[30] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and with the maxq value function decomposition,” Journal
O. Klimov, “Proximal policy optimization algorithms,” of artificial intelligence research, 2000.
arXiv preprint arXiv:1707.06347, 2017. [48] A. Lazaric and M. Ghavamzadeh, “Bayesian multi-
[31] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, task reinforcement learning,” in ICML-27th international
and M. Riedmiller, “Deterministic policy gradient conference on machine learning. Omnipress, 2010, pp.
algorithms,” 2014. 599–606.
[32] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, [49] Y. Zhang and Q. Yang, “A survey on multi-task
Y. Tassa, D. Silver, and D. Wierstra, “Continuous con- learning,” IEEE Transactions on Knowledge and Data
trol with deep reinforcement learning,” arXiv preprint Engineering, vol. 34, no. 12, pp. 5586–5609, 2021.
arXiv:1509.02971, 2015. [50] Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirk-
[33] S. Fujimoto, H. Van Hoof, and D. Meger, “Addressing patrick, R. Hadsell, N. Heess, and R. Pascanu, “Distral:
function approximation error in actor-critic methods,” Robust multitask reinforcement learning,” NeurIPS,
arXiv preprint arXiv:1802.09477, 2018. 2017.
[34] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, [51] E. Parisotto, J. L. Ba, and R. Salakhutdinov, “Actor-
“Neural network dynamics for model-based deep mimic: Deep multitask and transfer reinforcement
reinforcement learning with model-free fine-tuning,” learning,” ICLR, 2016.
in 2018 IEEE international conference on robotics and [52] C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine,
automation (ICRA). IEEE, 2018, pp. 7559–7566. “Learning modular neural network policies for multi-
17

task and multi-robot transfer,” 2017 IEEE International [71] M. Muller-Brockhausen, M. Preuss, and A. Plaat,
Conference on Robotics and Automation (ICRA), 2017. “Procedural content generation: Better benchmarks
[53] J. Andreas, D. Klein, and S. Levine, “Modular multitask for transfer reinforcement learning,” in 2021 IEEE
reinforcement learning with policy sketches,” ICML, Conference on games (CoG). IEEE, 2021, pp. 01–08.
2017. [72] N. Vithayathil Varghese and Q. H. Mahmoud, “A
[54] R. Yang, H. Xu, Y. Wu, and X. Wang, “Multi-task rein- survey of multi-task deep reinforcement learning,”
forcement learning with soft modularization,” NeurIPS, Electronics, vol. 9, no. 9, p. 1363, 2020.
vol. 33, pp. 4767–4777, 2020. [73] R. J. Williams and L. C. Baird, “Tight performance
[55] T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey, bounds on greedy policies based on imperfect value
“Meta-learning in neural networks: A survey,” IEEE functions,” Tech. Rep., 1993.
transactions on pattern analysis and machine intelligence, [74] E. Wiewiora, G. W. Cottrell, and C. Elkan, “Principled
vol. 44, no. 9, pp. 5149–5169, 2021. methods for advising reinforcement learning agents,”
[56] Z. Jia, X. Li, Z. Ling, S. Liu, Y. Wu, and H. Su, “Im- ICML, 2003.
proving policy optimization with generalist-specialist [75] S. M. Devlin and D. Kudenko, “Dynamic potential-
learning,” in International Conference on Machine Learn- based reward shaping,” ICAAMAS, 2012.
ing. PMLR, 2022, pp. 10 104–10 119. [76] A. Harutyunyan, S. Devlin, P. Vrancx, and A. Nowé,
[57] W. Ding, H. Lin, B. Li, and D. Zhao, “Generalizing goal- “Expressing arbitrary reward functions as potential-
conditioned reinforcement learning with variational based advice,” AAAI, 2015.
causal reasoning,” arXiv preprint arXiv:2207.09081, [77] T. Brys, A. Harutyunyan, M. E. Taylor, and A. Nowé,
2022. “Policy transfer using reward shaping,” ICAAMS, 2015.
[58] R. Kirk, A. Zhang, E. Grefenstette, and T. Rocktäschel, [78] M. Večerı́k, T. Hester, J. Scholz, F. Wang, O. Pietquin,
“A survey of zero-shot generalisation in deep reinforce- B. Piot, N. Heess, T. Rothörl, T. Lampe, and M. Ried-
ment learning,” Journal of Artificial Intelligence Research, miller, “Leveraging demonstrations for deep rein-
vol. 76, pp. 201–264, 2023. forcement learning on robotics problems with sparse
[59] B. Kim, A.-m. Farahmand, J. Pineau, and D. Pre- rewards,” arXiv preprint arXiv:1707.08817, 2017.
cup, “Learning from limited demonstrations,” NeurIPS, [79] A. C. Tenorio-Gonzalez, E. F. Morales, and L. Vil-
2013. laseñor-Pineda, “Dynamic reward shaping: Training
[60] W. Czarnecki, R. Pascanu, S. Osindero, S. Jayakumar, a robot by voice,” Advances in Artificial Intelligence –
G. Swirszcz, and M. Jaderberg, “Distilling policy dis- IBERAMIA, 2010.
tillation,” The 22nd International Conference on Artificial [80] P.-H. Su, D. Vandyke, M. Gasic, N. Mrksic, T.-H.
Intelligence and Statistics, 2019. Wen, and S. Young, “Reward shaping with recur-
[61] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance rent neural networks for speeding up on-line policy
under reward transformations: Theory and application learning in spoken dialogue systems,” arXiv preprint
to reward shaping,” ICML, 1999. arXiv:1508.03391, 2015.
[62] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, [81] X. V. Lin, R. Socher, and C. Xiong, “Multi-hop knowl-
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, edge graph reasoning with reward shaping,” arXiv
“Generative adversarial nets,” NeurIPS, pp. 2672–2680, preprint arXiv:1808.10568, 2018.
2014. [82] S. Devlin, L. Yliniemi, D. Kudenko, and K. Tumer,
[63] Z. Zhu, K. Lin, B. Dai, and J. Zhou, “Learning sparse “Potential-based difference rewards for multiagent
rewarded tasks from sub-optimal demonstrations,” reinforcement learning,” ICAAMS, 2014.
arXiv preprint arXiv:2004.00530, 2020. [83] M. Grzes and D. Kudenko, “Learning shaping rewards
[64] T. Schaul, D. Horgan, K. Gregor, and D. Silver, “Uni- in model-based reinforcement learning,” Proc. AAMAS
versal value function approximators,” ICML, 2015. Workshop on Adaptive Learning Agents, 2009.
[65] C. Finn and S. Levine, “Meta-learning: from few-shot [84] O. Marom and B. Rosman, “Belief reward shaping in
learning to rapid reinforcement learning,” ICML, 2019. reinforcement learning,” AAAI, 2018.
[66] M. E. Taylor, P. Stone, and Y. Liu, “Transfer learning via [85] F. Liu, Z. Ling, T. Mu, and H. Su, “State
inter-task mappings for temporal difference learning,” alignment-based imitation learning,” arXiv preprint
Journal of Machine Learning Research, 2007. arXiv:1911.10947, 2019.
[67] A. Barreto, D. Borsa, J. Quan, T. Schaul, D. Silver, [86] K. Kim, Y. Gu, J. Song, S. Zhao, and S. Ermon, “Domain
M. Hessel, D. Mankowitz, A. Žı́dek, and R. Munos, adaptive imitation learning,” ICML, 2020.
“Transfer in deep reinforcement learning using suc- [87] Y. Ma, Y.-X. Wang, and B. Narayanaswamy, “Imitation-
cessor features and generalised policy improvement,” regularized offline learning,” International Conference
ICML, 2018. on Artificial Intelligence and Statistics, 2019.
[68] Z. Zhu, K. Lin, B. Dai, and J. Zhou, “Off-policy [88] M. Yang and O. Nachum, “Representation matters:
imitation learning from observations,” NeurIPS, 2020. Offline pretraining for sequential decision making,”
[69] J. Ho and S. Ermon, “Generative adversarial imitation arXiv preprint arXiv:2102.05815, 2021.
learning,” NeurIPS, 2016. [89] X. Zhang and H. Ma, “Pretraining deep actor-critic
[70] W. Zhao, J. P. Queralta, and T. Westerlund, “Sim-to-real reinforcement learning algorithms with expert demon-
transfer in deep reinforcement learning for robotics: a strations,” arXiv preprint arXiv:1801.10459, 2018.
survey,” in 2020 IEEE symposium series on computational [90] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre,
intelligence (SSCI). IEEE, 2020, pp. 737–744. G. Van Den Driessche, J. Schrittwieser, I. Antonoglou,
18

V. Panneershelvam, M. Lanctot et al., “Mastering the [109] F. Fernández and M. Veloso, “Probabilistic policy reuse
game of go with deep neural networks and tree search,” in a reinforcement learning agent,” Proceedings of the
Nature, 2016. fifth international joint conference on Autonomous agents
[91] S. Schaal, “Learning from demonstration,” NeurIPS, and multiagent systems, 2006.
1997. [110] A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul,
[92] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, H. P. van Hasselt, and D. Silver, “Successor features for
B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband transfer in reinforcement learning,” NuerIPS, 2017.
et al., “Deep q-learning from demonstrations,” AAAI, [111] R. Bellman, “Dynamic programming,” Science, 1966.
2018. [112] L. Torrey, T. Walker, J. Shavlik, and R. Maclin, “Using
[93] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, advice to transfer knowledge acquired in one reinforce-
and P. Abbeel, “Overcoming exploration in reinforce- ment learning task to another,” European Conference on
ment learning with demonstrations,” IEEE International Machine Learning, 2005.
Conference on Robotics and Automation (ICRA), 2018. [113] A. Gupta, C. Devin, Y. Liu, P. Abbeel, and S. Levine,
[94] J. Chemali and A. Lazaric, “Direct policy iteration “Learning invariant feature spaces to transfer skills
with demonstrations,” International Joint Conference on with reinforcement learning,” ICLR, 2017.
Artificial Intelligence, 2015. [114] G. Konidaris and A. Barto, “Autonomous shaping:
[95] B. Piot, M. Geist, and O. Pietquin, “Boosted bellman Knowledge transfer in reinforcement learning,” ICML,
residual minimization handling expert demonstra- 2006.
tions,” Joint European Conference on Machine Learning [115] H. B. Ammar and M. E. Taylor, “Reinforcement learn-
and Knowledge Discovery in Databases, 2014. ing transfer via common subspaces,” Proceedings of the
[96] T. Brys, A. Harutyunyan, H. B. Suay, S. Chernova, M. E. 11th International Conference on Adaptive and Learning
Taylor, and A. Nowé, “Reinforcement learning from Agents, 2012.
demonstration through shaping,” International Joint [116] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet:
Conference on Artificial Intelligence, 2015. A deep convolutional encoder-decoder architecture
[97] B. Kang, Z. Jie, and J. Feng, “Policy optimization with for image segmentation,” IEEE transactions on pattern
demonstrations,” ICML, 2018. analysis and machine intelligence, 2017.
[98] D. P. Bertsekas, “Approximate policy iteration: A [117] C. Wang and S. Mahadevan, “Manifold alignment
survey and some new methods,” Journal of Control without correspondence,” International Joint Conference
Theory and Applications, 2011. on Artificial Intelligence, 2009.
[99] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, [118] B. Bocsi, L. Csató, and J. Peters, “Alignment-based
“Prioritized experience replay,” ICLR, 2016. transfer learning for robot models,” The 2013 Interna-
[100] S. Ross, G. Gordon, and D. Bagnell, “A reduction of tional Joint Conference on Neural Networks (IJCNN), 2013.
imitation learning and structured prediction to no- [119] H. B. Ammar, E. Eaton, P. Ruvolo, and M. E. Taylor,
regret online learning,” AISTATS, 2011. “Unsupervised cross-domain transfer in policy gradient
[101] Y. Gao, J. Lin, F. Yu, S. Levine, T. Darrell et al., “Re- reinforcement learning via manifold alignment,” AAAI,
inforcement learning from imperfect demonstrations,” 2015.
arXiv preprint arXiv:1802.05313, 2018. [120] H. B. Ammar, K. Tuyls, M. E. Taylor, K. Driessens, and
[102] M. Jing, X. Ma, W. Huang, F. Sun, C. Yang, B. Fang, G. Weiss, “Reinforcement learning transfer via sparse
and H. Liu, “Reinforcement learning from imperfect coding,” ICAAMS, 2012.
demonstrations under soft expert guidance.” AAAI, [121] A. Lazaric, M. Restelli, and A. Bonarini, “Transfer of
2020. samples in batch reinforcement learning,” ICML, 2008.
[103] K. Brantley, W. Sun, and M. Henaff, “Disagreement- [122] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer,
regularized imitation learning,” ICLR, 2019. J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and
[104] G. Hinton, O. Vinyals, and J. Dean, “Distilling the R. Hadsell, “Progressive neural networks,” arXiv
knowledge in a neural network,” Deep Learning and preprint arXiv:1606.04671, 2016.
Representation Learning Workshop, NeurIPS, 2014. [123] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha,
[105] A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, A. A. Rusu, A. Pritzel, and D. Wierstra, “Pathnet:
G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, Evolution channels gradient descent in super neural
K. Kavukcuoglu, and R. Hadsell, “Policy distillation,” networks,” arXiv preprint arXiv:1701.08734, 2017.
arXiv preprint arXiv:1511.06295, 2015. [124] I. Harvey, “The microbial genetic algorithm,” European
[106] H. Yin and S. J. Pan, “Knowledge transfer for deep Conference on Artificial Life, 2009.
reinforcement learning with hierarchical experience [125] A. Zhang, H. Satija, and J. Pineau, “Decoupling dy-
replay,” AAAI, 2017. namics and reward for transfer learning,” arXiv preprint
[107] S. Schmitt, J. J. Hudson, A. Zidek, S. Osindero, C. Do- arXiv:1804.10689, 2018.
ersch, W. M. Czarnecki, J. Z. Leibo, H. Kuttler, A. Zis- [126] P. Dayan, “Improving generalization for temporal dif-
serman, K. Simonyan et al., “Kickstarting deep rein- ference learning: The successor representation,” Neural
forcement learning,” arXiv preprint arXiv:1803.03835, Computation, 1993.
2018. [127] T. D. Kulkarni, A. Saeedi, S. Gautam, and S. J. Gersh-
[108] J. Schulman, X. Chen, and P. Abbeel, “Equivalence man, “Deep successor reinforcement learning,” arXiv
between policy gradients and soft q-learning,” arXiv preprint arXiv:1606.02396, 2016.
preprint arXiv:1704.06440, 2017. [128] J. Zhang, J. T. Springenberg, J. Boedecker, and W. Bur-
19

gard, “Deep reinforcement learning with successor human knowledge,” Nature, 2017.
features for navigation across similar environments,” [145] OpenAI. (2019) Dotal2 blog. [Online]. Available:
IEEE/RSJ International Conference on Intelligent Robots https://openai.com/blog/openai-five/
and Systems (IROS), 2017. [146] J. Oh, V. Chockalingam, S. Singh, and H. Lee, “Control
[129] N. Mehta, S. Natarajan, P. Tadepalli, and A. Fern, of memory, active perception, and action in minecraft,”
“Transfer in variable-reward hierarchical reinforcement arXiv preprint arXiv:1605.09128, 2016.
learning,” Machine Learning, 2008. [147] N. Justesen, P. Bontrager, J. Togelius, and S. Risi, “Deep
[130] D. Borsa, A. Barreto, J. Quan, D. Mankowitz, R. Munos, learning for video game playing,” IEEE Transactions on
H. van Hasselt, D. Silver, and T. Schaul, “Universal Games, 2019.
successor features approximators,” ICLR, 2019. [148] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves,
[131] L. Lehnert, S. Tellex, and M. L. Littman, “Advan- I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing
tages and limitations of using successor features for atari with deep reinforcement learning,” arXiv preprint
transfer in reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
arXiv:1708.00102, 2017. [149] H. Chen, X. Liu, D. Yin, and J. Tang, “A survey on
[132] J. C. Petangoda, S. Pascual-Diaz, V. Adam, P. Vrancx, dialogue systems: Recent advances and new frontiers,”
and J. Grau-Moya, “Disentangled skill embed- Acm Sigkdd Explorations Newsletter, 2017.
dings for reinforcement learning,” arXiv preprint [150] S. P. Singh, M. J. Kearns, D. J. Litman, and M. A. Walker,
arXiv:1906.09223, 2019. “Reinforcement learning for spoken dialogue systems,”
[133] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic NeurIPS, 2000.
meta-learning for fast adaptation of deep networks,” [151] B. Zoph and Q. V. Le, “Neural architecture
ICML, 2017. search with reinforcement learning,” arXiv preprint
[134] B. Zadrozny, “Learning and evaluating classifiers arXiv:1611.01578, 2016.
under sample selection bias,” ICML, 2004. [152] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and
[135] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, K. Saenko, “Learning to reason: End-to-end module
“A survey of robot learning from demonstration,” networks for visual question answering,” IEEE Interna-
Robotics and autonomous systems, 2009. tional Conference on Computer Vision, 2017.
[136] B. Kehoe, S. Patil, P. Abbeel, and K. Goldberg, “A [153] Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li, “Deep
survey of research on cloud robotics and automation,” reinforcement learning-based image captioning with
IEEE Transactions on automation science and engineering, embedding reward,” IEEE Conference on Computer
2015. Vision and Pattern Recognition, 2017.
[137] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep [154] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein,
reinforcement learning for robotic manipulation with “Learning to compose neural networks for question
asynchronous off-policy updates,” IEEE international answering,” arXiv preprint arXiv:1601.01705, 2016.
conference on robotics and automation (ICRA), 2017. [155] D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe,
[138] W. Yu, J. Tan, C. K. Liu, and G. Turk, “Preparing for J. Pineau, A. Courville, and Y. Bengio, “An actor-
the unknown: Learning a universal policy with online critic algorithm for sequence prediction,” arXiv preprint
system identification,” arXiv preprint arXiv:1702.02453, arXiv:1607.07086, 2016.
2017. [156] F. Godin, A. Kumar, and A. Mittal, “Learning when not
[139] F. Sadeghi and S. Levine, “Cad2rl: Real single-image to answer: a ternary reward structure for reinforcement
flight without a single real image,” arXiv preprint learning based question answering,” Proceedings of
arXiv:1611.04201, 2016. the 2019 Conference of the North American Chapter of
[140] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, the Association for Computational Linguistics: Human
M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Kono- Language Technologies, 2019.
lige et al., “Using simulation and domain adaptation to [157] K.-W. Chang, A. Krishnamurthy, A. Agarwal, J. Lang-
improve efficiency of deep robotic grasping,” IEEE ford, and H. Daumé III, “Learning to search better than
International Conference on Robotics and Automation your teacher,” 2015.
(ICRA), 2018. [158] J. Lu, A. Kannan, J. Yang, D. Parikh, and D. Batra,
[141] H. Bharadhwaj, Z. Wang, Y. Bengio, and L. Paull, “A “Best of both worlds: Transferring knowledge from
data-efficient framework for training and sim-to-real discriminative learning to a generative visual dialog
transfer of navigation policies,” International Conference model,” NeurIPS, 2017.
on Robotics and Automation (ICRA), 2019. [159] OpenAI, “Gpt-4 technical report,” arXiv, 2023.
[142] I. Higgins, A. Pal, A. Rusu, L. Matthey, C. Burgess, [160] A. Glaese, N. McAleese, M. Trebacz, J. Aslanides,
A. Pritzel, M. Botvinick, C. Blundell, and A. Lerchner, V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chad-
“Darla: Improving zero-shot transfer in reinforcement wick, P. Thacker et al., “Improving alignment of dia-
learning,” ICML, 2017. logue agents via targeted human judgements,” arXiv
[143] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement preprint arXiv:2209.14375, 2022.
learning in robotics: A survey,” The International Journal [161] A. Chowdhery, S. Narang, J. Devlin, M. Bosma,
of Robotics Research, 2013. G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sut-
[144] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, ton, S. Gehrmann et al., “Palm: Scaling language mod-
A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, eling with pathways,” arXiv preprint arXiv:2204.02311,
A. Bolton et al., “Mastering the game of go without 2022.
20

[162] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, [179] G. Dalal, E. Gilboa, and S. Mannor, “Hierarchical
A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, decision making in electricity grid management,” In-
Y. Du et al., “Lamda: Language models for dialog ternational Conference on Machine Learning, 2016.
applications,” arXiv preprint arXiv:2201.08239, 2022. [180] F. Ruelens, B. J. Claessens, S. Vandael, B. De Schutter,
[163] C. Yu, J. Liu, and S. Nemati, “Reinforcement learning in R. Babuška, and R. Belmans, “Residential demand
healthcare: A survey,” arXiv preprint arXiv:1908.08796, response of thermostatically controlled loads using
2019. batch reinforcement learning,” IEEE Transactions on
[164] A. Alansary, O. Oktay, Y. Li, L. Le Folgoc, B. Hou, Smart Grid, 2016.
G. Vaillant, K. Kamnitsas, A. Vlontzos, B. Glocker, [181] Z. Wen, D. O’Neill, and H. Maei, “Optimal demand
B. Kainz et al., “Evaluating reinforcement learning response using device-based reinforcement learning,”
agents for anatomical landmark detection,” 2019. IEEE Transactions on Smart Grid, 2015.
[165] K. Ma, J. Wang, V. Singh, B. Tamersoy, Y.-J. Chang, [182] Y. Li, J. Song, and S. Ermon, “Infogail: Interpretable im-
A. Wimmer, and T. Chen, “Multimodal image reg- itation learning from visual demonstrations,” NeurIPS,
istration with deep context reinforcement learning,” 2017.
International Conference on Medical Image Computing and [183] R. Ramakrishnan and J. Shah, “Towards interpretable
Computer-Assisted Intervention, 2017. explanations for transfer learning in sequential tasks,”
[166] T. S. M. T. Gomes, “Reinforcement learning for primary AAAI Spring Symposium Series, 2016.
care e appointment scheduling,” 2017. [184] E. Choi, M. T. Bahadori, J. Sun, J. Kulas, A. Schuetz, and
[167] A. Serrano, B. Imbernón, H. Pérez-Sánchez, J. M. Ce- W. Stewart, “Retain: An interpretable predictive model
cilia, A. Bueno-Crespo, and J. L. Abellán, “Accelerating for healthcare using reverse time attention mechanism,”
drugs discovery with deep reinforcement learning: NeurIPS, vol. 29, 2016.
An early approach,” International Conference on Parallel
Processing Companion, 2018. Zhuangdi Zhu is currently a senior data and
applied scientist with Microsoft. She obtained her
[168] M. Popova, O. Isayev, and A. Tropsha, “Deep rein- Ph.D. degree from the Computer Science depart-
forcement learning for de novo drug design,” Science ment of Michigan State University. Zhuangdi has
advances, 2018. regularly published on prestigious machine learn-
ing conferences including NeurIPs, ICML, KDD,
[169] A. E. Gaweda, M. K. Muezzinoglu, G. R. Aronoff, A. A. AAAI, etc. Her research interests reside in both
Jacobs, J. M. Zurada, and M. E. Brier, “Incorporating fundamental and applied machine learning. Her
prior knowledge into q-learning for drug delivery current research involves reinforcement learning
individualization,” Fourth International Conference on and distributed machine learning.
Machine Learning and Applications, 2005. Kaixiang Lin is an applied scientist at Amazon
[170] T. W. Killian, S. Daulton, G. Konidaris, and F. Doshi- web services. He obtained his Ph.D. from Michi-
Velez, “Robust and efficient transfer learning with hid- gan State University. He has broad research in-
terests across multiple fields, including reinforce-
den parameter markov decision processes,” NeurIPS, ment learning, human-robot interactions, and nat-
2017. ural language processing. His research has been
[171] A. Holzinger, “Interactive machine learning for health published on multiple top-tiered machine learning
informatics: when do we need the human-in-the-loop?” and data mining conferences such as ICLR, KDD,
NeurIPS, etc. He serves as a reviewer for top
Brain Informatics, 2016. machine learning conferences regularly.
[172] L. Li, Y. Lv, and F.-Y. Wang, “Traffic signal timing
via deep reinforcement learning,” IEEE/CAA Journal of Anil K. Jain is a University distinguished pro-
fessor in the Department of Computer Science
Automatica Sinica, 2016. and Engineering at Michigan State University. His
[173] K. Lin, R. Zhao, Z. Xu, and J. Zhou, “Efficient large- research interests include pattern recognition and
scale fleet management via multi-agent deep reinforce- biometric authentication. He served as the editor-
ment learning,” ACM SIGKDD International Conference in-chief of the IEEE Transactions on Pattern
Analysis and Machine Intelligence and was a
on Knowledge Discovery & Data Mining, 2018. member of the United States Defense Science
[174] K.-L. A. Yau, J. Qadir, H. L. Khoo, M. H. Ling, and Board. He has received Fulbright, Guggenheim,
P. Komisarczuk, “A survey on reinforcement learning Alexander von Humboldt, and IAPR King Sun Fu
awards. He is a member of the National Academy
models and algorithms for traffic signal control,” ACM of Engineering and a foreign fellow of the Indian National Academy of
Computing Surveys (CSUR), 2017. Engineering and the Chinese Academy of Sciences.
[175] J. Moody, L. Wu, Y. Liao, and M. Saffell, “Performance
Jiayu Zhou is an associate professor in the De-
functions and reinforcement learning for trading sys- partment of Computer Science and Engineering
tems and portfolios,” Journal of Forecasting, 1998. at Michigan State University. He received his
[176] Z. Jiang and J. Liang, “Cryptocurrency portfolio man- Ph.D. degree in computer science from Arizona
agement with deep reinforcement learning,” IEEE State University in 2014. He has broad research
interests in the fields of large-scale machine
Intelligent Systems Conference (IntelliSys), 2017. learning and data mining as well as biomedical in-
[177] R. Neuneier, “Enhancing q-learning for optimal asset formatics. He has served as a technical program
allocation,” NeurIPS, 1998. committee member for premier conferences such
as NIPS, ICML, and SIGKDD. His papers have
[178] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, “Deep received the Best Student Paper Award at the
direct reinforcement learning for financial signal rep- 2014 IEEE International Conference on Data Mining (ICDM), the Best
resentation and trading,” IEEE transactions on neural Student Paper Award at the 2016 International Symposium on Biomedical
networks and learning systems, 2016. Imaging (ISBI) and the Best Paper Award at IEEE Big Data 2016.

You might also like