Transfer Learning in Deep Reinforcement
Transfer Learning in Deep Reinforcement
Abstract—Reinforcement learning is a learning paradigm for solving sequential decision-making problems. Recent years have witnessed
remarkable progress in reinforcement learning upon the fast development of deep neural networks. Along with the promising prospects of
reinforcement learning in numerous domains such as robotics and game-playing, transfer learning has arisen to tackle various challenges
faced by reinforcement learning, by transferring knowledge from external expertise to facilitate the efficiency and effectiveness of the
learning process. In this survey, we systematically investigate the recent progress of transfer learning approaches in the context of deep
reinforcement learning. Specifically, we provide a framework for categorizing the state-of-the-art transfer learning approaches, under
which we analyze their goals, methodologies, compatible reinforcement learning backbones, and practical applications. We also draw
arXiv:2009.07888v7 [cs.LG] 4 Jul 2023
connections between transfer learning and other relevant topics from the reinforcement learning perspective and explore their potential
challenges that await future research progress.
1 I NTRODUCTION
efficient and principled manner. We also pointed out the π . Similar to the value function, a policy also carries a Q-
prominent applications of TL for DRL and its opportunities function, which estimates the quality of taking action a from
to thrive in the future era of AGI. state s: QπM (s, a) = Es′ ∼T (·|s,a) [R(s, a, s′ ) + γVM
π
(s′ )] .
The rest of this survey is organized as follows: In Section
2 we introduce RL preliminaries, including the recent key Reinforcement Learning Goals: Standard RL aims to learn
∗
development based on deep neural networks. Next, we an optimal policy πM with the optimal value and Q-
∗
discuss the definition of TL in the context of RL and function, s.t. ∀s ∈ S, πM (s) = arg max Q∗M (s, a), where
a∈A
its relevant topics (Section 2.4). In Section 3, we provide Q∗M (s, a) = sup QπM (s, a). The learning objective can be
a framework to categorize TL approaches from multiple π
reduced as maximizing the expected return:
perspectives, analyze their fundamental differences, and X
summarize their evaluation metrics (Section 3.3). In Section J(π) := E(s,a)∼µπ (s,a) [ γ t rt ],
5, we elaborate on different TL approaches in the context of t
DRL, organized by the format of transferred knowledge, such π
where µ (s, a) is the stationary state-action distribution in-
as reward shaping (Section 5.1), learning from demonstrations
duced by π [14].
(Section 5.2), or learning from teacher policies (Section 5.3). We
Built upon recent progress of DRL, some literature has
also investigate TL approaches by the way that knowledge
extended the RL objective to achieving miscellaneous goals
transfer occurs, such as inter-task mapping (Section 5.4), or
under different conditions, referred to as Goal-Conditional
learning transferrable representations (Section 5.5), etc. We
RL (GCRL). In GCRL, the agent policy π(·|s, g) is dependent
discuss contemporary applications of TL in the context of
not only on state observations s but also the goal g being
DRL in Section 6 and provide some future perspectives and
optimized. Each individual goal g ∼ G can be differentiated
open questions in Section 7.
by its reward function r(st , at , g), hence the objective for
GCRL becomes maximizing the expected return P over the dis-
2 D EEP R EINFORCEMENT L EARNING AND T RANS - tribution of goals: J(π) := E(st ,at )∼µπ ,g∼G [ t γ t r(s, a, g)]
[15]. A prototype example of GCRL can be maze locomotion
FER L EARNING
tasks, where the learning goals are manifested as desired
2.1 Reinforcement Learning Basics locations in the maze [16].
Markov Decision Process: A typical RL problem can be Episodic vs. Non-episodic Reinforcement Learning: In
considered as training an agent to interact with an envi- episodic RL, the agent performs in finite episodes of length
ronment that follows a Markov Decision Process (MPD) [13]. H , and will be reset to an initial state ∈ µ0 upon the episode
The agent starts with an initial state and performs an action ends [1]. Whereas in non-episodic RL, the learning agent
accordingly, which yields a reward to guide the agent actions. continuously interacts with the MDP without any state
Once the action is taken, the MDP transits to the next state by reset [17]. To encompass the episodic concept in infinite
following the underlying transition dynamics of the MDP. The MDPs, episodic RL tasks usually assume the existence of
agent accumulates the time-discounted rewards along with its a set of absorbing states S0 , which indicates the termination
interactions. A subsequence of interactions is referred to as of episodic tasks [18, 19], and any action taken upon an
an episode. The above-mentioned components in an MDP can absorbing state will only transit to itself with zero rewards.
be represented using a tuple, i.e. M = (µ0 , S, A, T , γ, R), in
which:
• µ0 is the set of initial states. 2.2 Reinforcement Learning Algorithms
• S is the state space. There are two major methods to conduct RL: Model-Based
• A is the action space. and Model-Free. In model-based RL, a learned or provided
• T : S×A×S → R is the transition probability distribution, model of the MDP is used for policy learning. In model-free
where T (s′ |s, a) specifies the probability of the state RL, optimal policy is learned without modeling the transition
transitioning to s′ upon taking action a from state s. dynamics or reward functions. In this section, we start intro-
• R : S × A × S → R is the reward distribution, where ducing RL techniques from a model-free perspective, due to
R(s, a, s′ ) is the reward that an agent can get by taking its relatively simplicity, which also provides foundations for
action a from state s with the next state being s′ . many model-based methods.
• γ is a discounted factor, with γ ∈ (0, 1]. Prediction and Control: an RL problem can be disas-
A RL agent behaves in M by following its policy π , sembled into two subtasks: prediction and control [1]. In the
which is a mapping from states to actions: π : S → A . For a prediction phase, the quality of the current policy is being
stochastic policy π , π(a|s) denotes the probability of taking evaluated. In the control phase or the policy improvement phase,
action a from state s. Given an MDP M and a policy π , one the learning policy is adjusted based on evaluation results
π from the prediction step. Policies can be improved by iterating
can derive a value function VM (s), which is defined over
the state space: VM (s) = E r0 + γr1 + γ 2 r2 + . . . ; π, s ,
π
through these two steps, known as policy iteration.
where ri = R(si , ai , si+1 ) is the reward that an agent For model-free policy iterations, the target policy is opti-
receives by taking action ai in the i-th state si , and the mized without requiring knowledge of the MDP transition
next state transits to si+1 . The expectation E is taken over dynamics. Traditional model-free RL includes Monte-Carlo
s0 ∼ µ0 , ai ∼ π(·|si ), si+1 ∼ T (·|si , ai ). The value function methods, which estimates the value of each state using
estimates the quality of being in state s, by evaluating the ex- samples of episodes starting from that state. Monte-Carlo
pected rewards that an agent can get from s following policy methods can be on-policy if the samples are collected by
3
following the target policy, or off-policy if the episodic samples In the above definition, we use ϕ(I) to denote the
are collected by following a behavior policy that is different learned policy based on information I , which is usually
from the target policy. approximated with deep neural networks in DRL. For the
Temporal Difference (TD) Learning is an alternative to simplistic case, knowledge can transfer between two agents
Monte-Carlo for solving the prediction problem. The key idea within the same domain, resulting in |Ms | = 1, and
behind TD-learning is to learn the state quality function by Ms = Mt . One can consider regular RL without TL as
bootstrapping. It can also be extended to solve the control a special case of the above definition, by treating Is = ∅, so
problem so that both value function and policy can get that a policy π is learned purely on the feedback provided
improved simultaneously. Examples of on-policy TD-learning by the target domain, i.e. π = ϕ(It ).
algorithms include SARSA [20], Expected SARSA [21], Actor-
Critic [22], and its deep neural network extension called A3C
2.4 Related Topics
[23]. The off-policy TD-learning approaches include SAC [24]
for continuous state-action spaces, and Q-learning [25] for In addition to TL, other efforts have been made to benefit RL
discrete state-action spaces, along with its variants built on by leveraging different forms of supervision. In this section,
deep-neural networks, such as DQN [26], Double-DQN [26], we briefly discuss other techniques that are relevant to TL by
Rainbow [27], etc. TD-learning approaches focus more on analyzing the differences and connections between transfer
estimating the state-action value functions. learning and these relevant techniques, which we hope can
Policy Gradient, on the other hand, is a mechanism further clarify the scope of this survey.
that emphasizes on direct optimization of a parameteriz- Continual Learning is the ability of sequentially learning
able policy. Traditional policy-gradient approaches include multiple tasks that are temporally or spatially related, with-
REINFORCE [28]. Recent years have witnessed the joint out forgetting the previously acquired knowledge. Continual
presence of TD-learning and policy-gradient approaches. Learning is a specialized yet more challenging scenario of TL,
Representative algorithms along this line include Trust region in that the learned knowledge needs to be transferred along
policy optimization (TRPO) [29], Proximal Policy optimization a sequence of dynamically-changing tasks that cannot be
(PPO) [30], Deterministic policy gradient (DPG) [31] and its foreseen, rather than learning a fixed group of tasks. Hence,
extensions such as DDPG [32] and Twin Delayed DDPG [33]. different from most TL methods discussed in this survey,
Unlike model-free methods that learn purely from trial- the ability of automatic task detection and avoiding catastrophic
and-error, Model-Based RL (MBRL) explicitly learns the forgetting is usually indispensable in continual learning [43].
transition dynamics or cost functions of the environment. Hierarchical RL has been proposed to resolve complex
The dynamics model can sometimes be treated as a black-box real-world tasks. Different from traditional RL, for hierarchi-
for better sampling-based planning. Representative examples cal RL, the action space is grouped into different granularities
include the Monte-Carlo method dubbed random shooting [34] to form higher-level macro actions. Accordingly, the learning
and its cross-entropy method (CEM) variants [35, 36]. The task is also decomposed into hierarchically dependent sub-
modeled dynamics can also facilitate learning with data gen- goals. Well-known hierarchical RL frameworks include Feu-
eration [37] and value estimation [38]. For MBRL with white- dal learning [44], Options framework[45], Hierarchical Abstract
box modeling, the transition models become differentiable Machines [46], and MAXQ [47]. Given the higher-level
and can facilitate planning with direct gradient propogation. abstraction on tasks, actions, and state spaces, hierarchical
Methods along this line include differential planning for policy RL can facilitate knowledge transfer across similar domains.
gradient [39] and action sequences search [40], and value Multi-task RL learns an agent with generalized skills
gradient methods [41, 42]. One advantage of MBRL is its across various tasks, hence it can solve MDPs randomly
higher sample efficiency than model-free RL, although it sampled from a fixed yet unknown distribution [48]. A larger
can be challenging for complex domains, where it is usually concept of multi-task learning also incorporates multi-task
more difficult to learn the dynamics than learning a policy. supervised learning and unsupervised learning [49]. Multi-task
learning is naturally related to TL, in that the learned skills,
typically manifested as representations, need to be effectively
2.3 Transfer Learning in the Context of Reinforcement shared among domains. Many TL techniques later discussed
Learning in this survey can be readily applied to solve multi-task RL
scenarios, such as policy distillation [50], and representation
Remark 1. Without losing clarify, for the rest of this survey, we
sharing [51]. One notable challenges in multi-task learning
refer to MDPs, domains, and tasks equivalently.
is negative transfer, which is induced by the irrelevance or
Remark 2. [Transfer Learning in the Context of RL] Given conflicting property for learned tasks. Hence, some recent
a set of source domains Ms = {Ms |Ms ∈ Ms } and a target work in multi-task RL focused on a trade-off between sharing
domain Mt , Transfer Learning aims to learn an optimal policy and individualizing function modules [52–54].
π ∗ for the target domain, by leveraging exterior information Is Generalization in RL refers to the ability of learning
from Ms as well as interior information It from Mt : agents to adapt to unseen domains. Generalization is a crucial
property for RL to achieve, especially when classical RL
π ∗ = arg max Es∼µt0 ,a∼π [QπM (s, a)], assumes identical training and inference MDPs, whereas the
π
real world is constantly changing. Generalization in RL is
where π = ϕ(Is ∼ Ms , It ∼ Mt ) : S t → At is a policy considered more challenging than in supervised learning
learned for the target domain Mt based on information from both due to the non-stationarity of MDPs, where the latter has
It and Is . provided inspirations for the former [55]. Meta-learning is an
4
effective direction towards generalization, which also draws approaches into the following classes: (i) Zero-shot transfer,
close connections to TL. Some TL techniques discussed in this which learns an agent that is directly applicable to the
survey are actually designed for meta-RL. However, meta- target domain without requiring any training interactions;
learning is particularly focused on the learning methods (ii) Few-shot transfer, which only requires a few samples
that lead to fast adaptation to unseen domains, whereas TL (interactions) from the target domain; (iii) Sample-efficient
is a broader concept and covers scenarios where the target transfer, where an agent can benefit by TL to be more
environment can be (partially) observable. To tackle unseen sample efficient compared to normal RL.
tasks in RL, some meta-RL methods focused on training
MDPs generation [56] and variations estimation [57]. We
refer readers to [58] for a more focused survey on meta RL. 3.2 Case Analysis of Transfer Learning in the context of
Reinforcement Learning
3 A NALYZING T RANSFER L EARNING We now use HalfCheetah1 as a working example to illustrate
In this section, we discuss TL approaches in RL from different how TL can occur between the source and the target domain.
angles. We also use a prototype to illustrate the potential HalfCheetah is a standard DRL benchmark for solving physical
variants residing in knowledge transfer among domains, locomotion tasks, in which the objective is to train a two-leg
then summarize important metrics for TL evaluation. agent to run fast without losing control of itself.
3.3 Evaluation metrics over the last decade. For instance, [11] emphasized on
In this section, we present some representative metrics for different task-mapping methods, which are more suitable for
evaluating TL approaches, which have also been partly domains with tabular or mild state-action space dimensions.
summarized in prior work [11, 66]: There are other surveys focused on specific subtopics that
• Jumpstart performance( jp): the initial performance (returns) interplay between RL and TL. For instance, [70] consolidated
of the agent. sim-to-real TL methods. They explored work that is more tai-
• Asymptotic performance (ap): the ultimate performance lored for robotics domains, including domain generalization
(returns) of the agent. and zero-shot transfer, which is a favored application field
• Accumulated rewards (ar): the area under the learning curve of DRL as we discussed in Sec 6. [71] conducted extensive
of the agent. database search and summarized benchmarks for evaluating
• Transfer ratio (tr): the ratio between asymptotic performance TL algorithms in RL. [72] surveyed recent progress in
of the agent with TL and asymptotic performance of the agent multi-task RL. They partially shared research focus with
without TL. us by studying certain TL oriented solutions towards multi-
• Time to threshold (tt): the learning time (iterations) needed task RL, such as learning shared representations, pathNets,
for the target agent to reach certain performance threshold. etc. We surveyed TL for RL with a broader spectrum in
• Performance with fixed training epochs (pe): the performance methodologies, applications, evaluations, which naturally
achieved by the target agent after a specific number of draws connections to the above literatures.
training iterations.
• Performance sensitivity (ps): the variance in returns using 5 T RANSFER L EARNING A PPROACHES D EEP D IVE
different hyper-parameter settings. In this section, we elaborate on various TL approaches and
The above criteria mainly focus on the learning process organize them into different sub-topics, mostly by answering
of the target agent. In addition, we introduce the following the question of “what knowledge is transferred”. For each type
metrics from the perspective of transferred knowledge, which, of TL approach, we analyze them by following the other
although commensurately important for evaluation, have not criteria mentioned in Section 3 and and summarize the key
been explicitly discussed by prior art: evaluation metrics that are applicable to the discussed work.
Figure 1 presents an overview of different TL approaches
• Necessary knowledge amount (nka): the necessary amount of
discussed in this survey.
the knowledge required for TL in order to achieve certain
performance thresholds. Examples along this line include
the number of designed source tasks [67], the number of 5.1 Reward Shaping
expert policies, or the number of demonstrated interactions We start by introducing the Reward Shaping approach, as
[68] required to enable knowledge transfer. it is applicable to most RL backbones and also largely
• Necessary knowledge quality (nkq): the guaranteed quality overlaps with the other TL approaches discussed later.
of the knowledge required to enable effective TL. This Reward Shaping (RS) is a technique that leverages the
metric helps in answering questions such as (i) Does the exterior knowledge to reconstruct the reward distribution of
TL approach rely on near-oracle knowledge, such as expert the target domain to guide the agent’s policy learning. More
demonstrations/policies [69], or (ii) is the TL technique specifically, in addition to the environment reward signals,
feasible even given suboptimal knowledge [63]? RS learns a reward-shaping function F : S × S × A → R
TL approaches differ in various perspectives, including the to render auxiliary rewards, provided that the additional
forms of transferred knowledge, the RL frameworks utilized rewards contain external knowledge to guide the agent
to enable such transfer, and the gaps between the source and for better action selections. Intuitively, an RS strategy will
the target domain. It maybe biased to evaluate TL from just assign higher rewards to more beneficial state-actions to
one viewpoint. We believe that explicating these TL related navigate the agent to desired trajectories. As a result, the
metrics helps in designing more generalizable and efficient agent will learn its policy using the newly shaped rewards
TL approaches. R′ : R′ = R + F , which means that RS has altered the target
In general, most of the abovementioned metrics can be domain with a different reward function:
considered as evaluating two abilities of a TL approach: the M = (S, A, T , γ, R)) → M′ = (S, A, T , γ, R′ ). (1)
mastery and generalization. Mastery refers to how well the
Along the line of RS, Potential based Reward Shaping (PBRS)
learned agent can ultimately perform in the target domain,
is one of the most classical approaches. [61] proposed PBRS
while generalization refers to the ability of the learning agent
to form a shaping function F as the difference between two
to quickly adapt to the target domain.
potential functions (Φ(·)):
F (s, a, s′ ) = γΦ(s′ ) − Φ(s), (2)
4 R ELATED W ORK
There are prior efforts in summarizing TL research in RL. One where the potential function Φ(·) comes from the knowledge
of the earliest literatures is [11] . Their main categorization of expertise and evaluates the quality of a given state.
is from the perspective of problem setting, in which the TL It has been proved that, without further restrictions on
scenarios may vary in the number of domains involved, and the underlying MDP or the shaping function F , PBRS
the difference of state-action space among domains. Similar is sufficient and necessary to preserve the policy invari-
categorization is adopted by [12], with more refined analysis ance. Moreover, the optimal Q-function in the original and
dimensions including the objective of TL. As pioneer surveys transformed MDP are related by the potential function:
for TL in RL, neither [11] nor [12] covered recent research Q∗M′ (s, a) = Q∗M (s, a) − Φ(s), which draws a connection
6
between potential based reward-shaping and advantage- Ms to the target domain Mt . This approach assumed
based learning approaches [73]. the existence of two mapping functions MS and MA that
The idea of PBRS was extended to [74], which formulated can transform the state and action from the source to the
the potential as a function over both the state and the target domain. Another work used demonstrated state-
action spaces. This approach is called Potential Based state- action samples from an expert policy to shape rewards [78].
action Advice (PBA). The potential function Φ(s, a) therefore Learning the augmented reward involves learning a dis-
evaluates how beneficial an action a is to take from state s: criminator to distinguish samples generated by an expert
F (s, a, s′ , a′ ) = γΦ(s′ , a′ ) − Φ(s, a). (3) policy from samples generated by the target policy. The
loss of the discriminator is applied to shape rewards to
PBA requires on-policy learning and can be sample-costly,
incentivize the learning agent to mimic the expert behavior.
as in Equation (3), a′ is the action to take upon state s is
This work combines two TL approaches: RS and Learning
transitioning to s′ by following the learning policy.
from Demonstrations, the latter of which will be elaborated in
Traditional RS approaches assumed a static potential
Section 5.2.
function, until [75] proposed a Dynamic Potential Based (DPB)
approach which makes the potential a function of both states
and time: F (s, t, s′ , t′ ) = γΦ(s′ , t′ ) − Φ(s, t).They proved The above-mentioned RS approaches are summarized
that this dynamic approach can still maintain policy invari- in Table 1. They follow the potential based RS principle
ance: Q∗M′ (s, a) = Q∗M (s, a) − Φ(s, t),where t is the current that has been developed systematically: from the classical
tilmestep. [76] later introduced a way to incorporate any PBRS which is built on a static potential shaping function of
prior knowledge into a dynamic potential function structure, states, to PBA which generates the potential as a function of
which is called Dynamic Value Function Advice (DPBA). both states and actions, and DPB which learns a dynamic
The rationale behind DPBA is that, given any extra reward potential function of states and time, to the most recent
function R+ from prior knowledge, in order to add this extra DPBA, which involves a dynamic potential function of states
reward to the original reward function, the potential function and actions to be learned as an extra state-action Value
should satisfy: γΦ(s′ , a′ ) − Φ(s, a) = F (s, a) = R+ (s, a). function in parallel with the environment Value function.
If Φ is not static but learned as an extra state-action As an effective TL paradigm, RS has been widely applied to
Value function overtime, then the Bellman equation for Φ fields including robot training [79], spoken dialogue systems
is : Φπ (s, a) = rΦ (s, a) + γΦ(s′ , a′ ). The shaping rewards [80], and question answering [81]. It provides a feasible
F (s, a) is therefore the negation of rΦ (s, a) : framework for transferring knowledge as the augmented
reward and is generally applicable to various RL algorithms.
F (s, a) = γΦ(s′ , a′ ) − Φ(s, a) = −rΦ (s, a). (4)
RS has also been applied to multi-agent RL [82] and model-
This leads to the approach of using the negation of R+ as based RL [83]. Principled integration of RS with other TL
the immediate reward to train an extra state-action Value approaches, such as Learning from demonstrations (Section 5.2)
function Φ and the policy simultaneously. Accordingly, the and Policy Transfer (Section 5.3) will be an intriguing question
dynamic potential function F becomes: for ongoing research.
TABLE 1: A comparison of reward shaping approaches. ✗ denotes that the information is not revealed in the paper.
actions for efficient explorations [92]. Most work discussed in algorithm named LfDS was proposed by [96], which draws a
this section follows the online transfer paradigm or combines close connection to reward shaping (Section 5.1). LfDS builds
offline pre-training with online RL [93]. the potential value of a state-action pair as the highest simi-
Work along this line can also be categorized depending on larity between the given pair and the expert demonstrations.
what RL frameworks are compatible: some adopts the policy- This augmented reward assigns more credits to state-actions
iteration framework [59, 94, 95], some follow a Q-learning that are more similar to expert demonstrations, encouraging
framework [92, 96], while recent work usually follows the the agent for expert-like behavior.
Besides Q-learning, recent work has integrated LfD into
policy-gradient framework [63, 78, 93, 97]. Demonstrations policy gradient [63, 69, 78, 93, 97]. A representative work
have been leveraged in the policy iterations framework by [98]. along this line is Generative Adversarial Imitation Learning
Later, [94] introduced the Direct Policy Iteration with Demon- (GAIL) [69]. GAIL introduced the notion of occupancy measure
strations (DPID) algorithm. This approach samples complete dπ , which is the stationary state-action distributions derived
demonstrated rollouts DE from an expert policy πE , in from a policy π . Based on this notion, a new reward function
combination with the self-generated rollouts Dπ gathered is designed such that maximizing the accumulated new
rewards encourages minimizing the distribution divergence
from the learning agent. Dπ ∪ DE are used to learn a Monte-
between the occupancy measure of the current policy π and
Carlo estimation of the Q-value: Q̂, from which a learning the expert policy πE . Specifically, the new reward is learned
policy can be derived greedily: π(s) = arg maxQ̂(s, a). This by adversarial training [62]: a discriminator D is learned to
a∈A distinguish interactions sampled from the current policy π
policy π is further regularized by a loss function L(s, πE ) to and the expert policy πE :
minimize its discrepancy from the expert policy decision.
Another example is the Approximate Policy Iteration with JD = max Edπ log[1 − D(s, a)] + EdE log[D(s, a)] (8)
D:S×A→(0,1)
Demonstration (APID) algorithm, which was proposed by [59]
and extended by [95]. Different from DPID where both DE Since πE is unknown, its state-action distribution dE is
and Dπ are used for value estimation, the APID algorithm estimated based on the given expert demonstrations DE .
solely applies Dπ to approximate on the Q function. Expert The output of the discriminator is used as new rewards to
demonstrations DE are used to learn the value function, encourage distribution matching, with r′ (s, a) = − log(1 −
which, given any state si , renders expert actions πE (si ) with D(s, a)). The RL process is naturally altered to perform
higher Q-value margins compared with other actions that distribution matching by min-max optimization:
are not shown in DE :
max min J(π, D) : = Edπ log[1 − D(s, a)] + EdE log[D(s, a)].
π D
Q(si , πE (si )) − max Q(si , a) ≥ 1 − ξi . (6)
a∈A\πE (si )
The philosophy in GAIL of using expert demonstrations
The term ξi is used to account for the case of imperfect for distribution matching has inspired other LfD algorithms.
demonstrations. [95] further extended the work of APID For example, [97] extended GAIL with an algorithm called
8
Policy Optimization from Demonstrations (POfD), which com- We summarize the above-discussed approaches in Table 2.
bines the discriminator reward with the environment reward: In general, demonstration data can help in both offline pre-
training for better initialization and online RL for efficient
max = Edπ [r(s, a)] − λDJS [dπ ||dE ]. (9) exploration. During the RL phase, demonstration data can
θ
be used together with self-generated data to encourage
Both GAIL and POfD are under an on-policy RL frame- expert-like behaviors (DDPGfD, DQFD), to shape value
work. To further improve the sample efficiency of TL, functions (APID), or to guide the policy update in the form
some off-policy algorithms have been proposed, such as of an auxiliary objective function (PID,GAIL, POfD). To
DDPGfD [78] which is built upon the DDPG framework. validate the algorithm robustness given different knowledge
DDPGfD shares a similar idea as DQfD in that they both resources, most LfD methods are evaluated using metrics that
use a second replay buffer for storing demonstrated data, either indicate the performance under limited demonstrations
and each demonstrated sample holds a sampling priority (nka) or suboptimal demonstrations (nka). The integration of
pi . For a demonstrated sample, its priority pi is augmented LfD with off-policy RL backbone makes it natural to adopt pe
with a constant bias ϵD > 0 for encouraging more frequent metrics for evaluating how learning efficiency can be further
sampling of expert demonstrations: improved by knowledge transfer. Developing more general
pi = δi2 + λ∥∇a Q(si , ai |θQ )∥2 + ϵ + ϵD , LfD approaches that are agnostic to RL frameworks and can
learn from sub-optimal or limited demonstrations would be
where δi is the TD-residual for transition, ∥∇a Q(si , ai |θQ )∥2 the ongoing focus for this research domain.
is the loss applied to the actor, and ϵ is a small positive
constant to ensure all transitions are sampled with some prob- 5.3 Policy Transfer
ability. Another work also adopted the DDPG framework to Policy transfer is a TL approach where the external knowledge
learn from demonstrations [93]. Their approach differs from takes the form of pre-trained policies from one or multiple
DDPGfD in that its objective function is augmented with source domains. Work discussed in this section is built upon
a Behavior Cloning Loss to encourage imitating on provided a many-to-one problem setting, described as below:
P|D |
demonstrations: LBC = i=1E ||π(si |θπ ) − ai ||2 .
To further address the issue of suboptimal demonstrations, Policy Transfer. A set of teacher policies πE1 , πE2 , . . . , πEK
in [93] the form of Behavior Cloning Loss is altered based on are trained on a set of source domains M1 , M2 , . . . , MK ,
the critic output, so that only demonstration actions with respectively. A student policy π is learned for a target domain
higher Q values will lead to the loss penalty: by leveraging knowledge from {πEi }Ki=1 .
|DE | For the one-to-one scenario with only one teacher policy,
∥π(si |θπ ) − ai ∥2 1[Q(si , ai ) > Q(si , π(si ))]. (10)
X
LBC = one can consider it as a special case of the above with K = 1.
i=1 Next, we categorize recent work of policy transfer into two
There are several challenges faced by LfD, one of which techniques: policy distillation and policy reuse.
is the imperfect demonstrations. Previous approaches usu-
ally presume near-oracle demonstrations. Towards tackling 5.3.1 Transfer Learning via Policy Distillation
suboptimal demonstrations, [59] leveraged the hinge-loss The idea of knowledge distillation has been applied to the
function to allow occasional violations of the property that field of RL to enable policy distillation. Knowledge distillation
Q(si , πE (si )) − max Q(si , a) ≥ 1. Some other work was first proposed by [104] as an approach of knowledge
a∈A\πE (si )
uses regularized objective to alleviate overfitting on biased ensemble from multiple teacher models into a single stu-
data [92, 99]. A different strategy is to leverage those sub- dent model. Conventional policy distillation approaches
optimal demonstrations only to boost the initial learning transfer the teacher policy following a supervised learning
stage. For instance, [63] proposed Self-Adaptive Imitation paradigm [105, 106]. Specifically, a student policy is learned
Learning (SAIL), which learns from suboptimal demonstra- by minimizing the divergence of action distributions between
tions using generative adversarial training while gradually the teacher policy πE and student policy πθ , which is denoted
selecting self-generated trajectories with high qualities to as H× (πE (τt )|πθ (τt )):
replace less superior demonstrations. |τ |
Another challenge faced by LfD is covariate drift ([100]):
X
min Eτ ∼πE [ ∇θ H× (πE (τt )|πθ (τt ))]. (11)
demonstrations may be provided in limited numbers, which θ
t=1
results in the learning agent lacking guidance on states that
are unseen in the demonstration dataset. This challenge is The above expectation is taken over trajectories sampled from
aggravated in MDPs with sparse reward feedbacks, as the the teacher policy πE , hence this approach is called teacher
learning agent cannot obtain much supervision information distillation. One example along this line is [105], in which N
from the environment either. Current efforts to address this teacher policies are learned for N source tasks separately, and
challenge include encouraging explorations by using an each teacher yields a dataset DE = {si , qi }N i=0 consisting of
entropy-regularized objective [101], decaying the effects of observations s and vectors of the corresponding Q-values
demonstration guidance by softening its regularization on q , such that qi = [Q(si , a1 ), Q(si , a2 ), ...|aj ∈ A]. Teacher
policy learning over time [102], and introducing disagreement policies are further distilled to a single student πθ by min-
regularizations by training an ensemble of policies based on imizing the KL-Divergence between each teacher πEi (a|s)
E
the given demonstrations, where the variance among policies and the student πθ , approximated using the datasetE D :
E E
|D | q softmax(q )
minθ DKL (π E |πθ ) ≈ i=1 softmax τi ln softmax(qiθ ) .
P
serves as a negative reward function [103].
i
9
Optimality
Methods Format of transferred demonstrations RL framework Evaluation metrics
guarantee
DPID ✓ Indicator binary-loss : L(si ) = 1{πE (si ) ̸= API ap, ar, nka
π(si )}
APID ✗ Hinge loss on the marginal-loss: L(Q, π, πE ) + API ap, ar, nta, nkq
APID extend ✓ Marginal-loss: L(Q, π, πE ) API ap, ar, nta, nkq
[93] ✓ Increasing sampling priority and behavior cloning loss DDPG ap, ar, tr, pe, nkq
DQfD ✗ Cached transitions in the replay buffer DQN ap, ar, tr
LfDS ✗ Reward shaping function DQN ap, ar, tr
GAIL ✓ Reward shaping function: −λ log(1 − D(s, a)) TRPO ap, ar, tr, pe, nka
POfD ✓ Reward shaping function: TRPO,PPO ap, ar, tr, pe, nka
r(s, a) − λ log(1 − D(s, a))
DDPGfD (pe) ✓ Increasing sampling priority DDPG ap, ar, tr, pe
SAIL ✗ Reward shaping function: r(s, a) − λ log(1 − D(s, a)) DDPG ap, ar, tr, pe, nkq, nka
TABLE 2: A comparison of learning from demonstration approaches.
Another policy distillation approach is student distil- student policy λH(πE (at |st )||πθ (at |st )) to reshape rewards.
lation [51, 60], which is resemblant to teacher distilla- Moreover, they adopted a dynamically fading coefficient
tion except that during the optimization step, the ob- to alleviate the effect of the augmented reward so that the
jective expectation is taken over trajectories sampled student policy becomes independent of the teachers after
from the student hP policy instead of the teacher i policy, certain optimization iterations.
|τ | ×
i.e.: minθ Eτ ∼πθ t=1 ∇ θ H (π E (τt )|π θ (τ t )) . [60] summa-
rized related work on both kinds of distillation approaches. 5.3.2 Transfer Learning via Policy Reuse
Although it is feasible to combine both distillation ap- Policy reuse directly reuses policies from source tasks to build
proaches [100], we observe that more recent work focuses the target policy. The notion of policy reuse was proposed by
on student distillation, which empirically shows better [109], which directly learns the target policy as a weighted
exploration ability compared to teacher distillation, especially combination of different source-domain policies, and the
when the teacher policies are deterministic. probability for each source domain policy to be used is
Taking an alternative perspective, there are two ap- related to its expected performance gain in the target domain:
proaches of policy distillation: (1) minimizing the cross- P (πEi ) = PKexpexp(tWi )
(tWj )
, where t is a dynamic temperature
entropy between the teacher and student policy distributions j=0
over actions [51, 107]; and (2) maximizing the probability parameter that increases over time. Under a Q-learning
that the teacher policy will visit trajectories generated by framework, the Q-function of the target policy is learned
the student, i.e. maxθ P (τ ∼ πE |τ ∼ πθ ) [50, 108]. One in an iterative scheme: during every learning episode, Wi
example of approach (1) is the Actor-mimic algorithm [51]. is evaluated for each expert policy πEi , and W0 is obtained
This algorithm distills the knowledge of expert agents into for the learning policy, from which a reuse probability P
the student by minimizing the cross entropy between the is derived. Next, a behavior policy is sampled from this
P πθ and each teacher policy πEi over actions:
student policy
probability P . After each training episode, both Wi and
Li (θ) = a∈AEi πEi (a|s) logπθ (a|s), where each teacher the temperature t for calculating the reuse probability is
agent is learned using a DQN framework. The teacher policy
updated accordingly. One limitation of this approach is that
is therefore derived from the Boltzmann −1 distributions over
e
τ QE (s,a)
i the Wi , i.e. the expected return of each expert policy on the
the Q-function output: πEi (a|s) = P τ −1 QE (s,a′ )
. An
a′ ∈AE e i target task, needs to be evaluated frequently. This work was
i
instantiation of approach (2) is the Distral algorithm [50]. implemented in a tabular case, leaving the scalability issue
which learns a centroid policy πθ that is derived from K unresolved. More recent work by [110] extended the policy
teacher policies. The knowledge in each teacher πEi is dis- improvement theorem [111] from one to multiple policies,
tilled to the centroid and get transferred to the student, while which is named as Generalized Policy Improvement. We refer
both the transition dynamics Ti and reward distributions Ri its main theorem as follows:
for source domain Mi are heterogeneous. The student policy
is learned by maximizing a multi-task learning objective Theorem. [Generalized Policy Improvement (GPI)] Let
maxθ K
P {πi }ni=1 be n policies and let {Q̂πi }ni=1 be their approximated
i=1 J(πθ , πEi ), where
X hX action-value functions, s.t: Qπi (s, a) − Q̂πi (s, a) ≤ ϵ ∀s ∈
J(πθ , πEi ) = E(st ,at )∼πθ γ t (ri (at , st )+ S, a ∈ A, and i ∈ [n]. Define π(s) = arg max maxQ̂πi (s, a),
t t≥0 a i
2
α 1 i then: Qπ (s, a) ≥ maxQπi (s, a) − 1−γ ϵ, ∀ s ∈ S, a ∈ A.
log πθ (at |st ) − log(πEi (at |st ))) , i
β β
Based on this theorem, a policy improvement approach
in which both log πθ (at |st ) and πθ are used as augmented can be naturally derived by greedily choosing the action
rewards. Therefore, the above approach also draws a close which renders the highest Q-value among all policies for
connection to Reward Shaping (Section 5.1). In effect, the a given state. Another work along this line is [110], in
log πθ (at |st ) term guides the learning policy πθ to yield which an expert policy πEi is also trained on a differ-
actions that are more likely to be generated by the teacher ent source domain Mi with reward function Ri , so that
policy, whereas the entropy term − log(πEi (at |st ) encour- QπM0 (s, a) ̸= QπMi (s, a). To efficiently evaluate the Q-
ages exploration. A similar approach was proposed by [107] functions of different source policies in the target MDP,
which only uses the cross-entropy between teacher and a disentangled representation ψ(s, a) over the states and
10
actions is learned using neural networks and is generalized the mapping function is learned on the agent-specific sub
across multiple tasks. Next, a task (reward) mapper wi is state, and the mapped representation is applied to reshape
learned, based on which the Q-function can be derived: the immediate reward. For [113], the invariant feature space
Qπi (s, a) = ψ(s, a)T wi . [110] proved that the loss of GPI is mapped from sagent can be applied across agents who have
bounded by the difference between the source and the target distinct action space but share some morphological similarity.
tasks. In addition to policy-reuse, their approach involves Specifically, they assume that both agents have been trained
learning a shared representation ψ(s, a), which is also a on the same proxy task, based on which the mapping function
form of transferred knowledge and will be elaborated more is learned. The mapping function is learned using an encoder-
in Section 5.5.2. decoder structure [116] to largely reserve information about
We summarize the abovementioned policy transfer ap- the source domain. For transferring knowledge from the
proaches in Table 3. In general, policy transfer can be realized source agent to a new task, the environment reward is
by knowledge distillation, which can be either optimized augmented with a shaped reward term to encourage the
from the student’s perspecive (student distillation), or from target agent to imitate the source agent on an embedded
the teacher’s perspective (teacher distillation) Alternatively, feature space:
teacher policies can also be directly reused to update the target
policy. Regarding evaluation, most of the abovementioned r′ (s, ·) = α f (ssagent ; θf ) − g(stagent ; θg ) , (12)
work has investigated a multi-teacher transfer scenario, hence
the generalization ability or robustness is largely evaluated on where f (ssagent ) is the agent-specific state in the source
metrics such as performance sensitivity(ps) (e.g. performance domain, and g(stagent ) is for the target domain.
given different numbers of teacher policies or source tasks Another work is [115] which applied the Unsupervised
). Performance with fixed epochs (pe) is another commonly Manifold Alignment (UMA) method [117] to automatically
shared metric to evaluate how the learned policy can quickly learn the state mapping. Their approach requires collecting
adapt to the target domain. All approaches discussed so far trajectories from both the source and the target domain
presumed one or multiple expert policies, which are always to learn such a mapping. While applying policy gradient
at the disposal of the learning agent. Open questions along learning, trajectories from the target domain Mt are first
this line include How to leverage imperfect policies for knowledge mapped back to the source: τt → τs , then an expert policy
transfer, or How to refer to teacher policies within a budget. in the source domain is applied to each initial state of those
∼
trajectories to generate near-optimal trajectories τs , which are
∼ ∼
5.4 Inter-Task Mapping further mapped to the target domain: τs → τt . The deviation
∼
In this section, we review TL approaches that utilize mapping between τt and τt are used as a loss to be minimized in order
functions between the source and the target domains to assist to improve the target policy. Similar ideas of using UMA for
knowledge transfer. Research in this domain can be analyzed inter-task mapping can also be found in [118] and [119].
from two perspectives: (1) which domain does the mapping In addition to approaches that utilizes mapping over
function apply to, and (2) how is the mapped representation states or actions, [120] proposed to learn an inter-task
utilized. Most work discussed in this section shares a common mapping over the transition dynamics space: S × A × S .
assumption as below: Their work assumes that the source and target domains
are different in terms of the transition space dimensionality.
Assumption. One-to-one mappings exist between the source
Transitions from both the source domain ⟨ss , as , s′s ⟩ and
domain Ms and the target domain Mt .
the target domain ⟨st , at , s′t ⟩ are mapped to a latent space
Earlier work along this line requires a given mapping Z . Given the latent feature representations, a similarity
function [66, 112]. One examples is [66] which assumes that measure can be applied to find a correspondence between
each target state (action) has a unique correspondence in the source and target task triplets. Triplet pairs with the
the source domain, and two mapping functions XS , XA highest similarity in this feature space Z are used to learn a
are provided over the state space and the action space, mapping function X : ⟨st , at , s′t ⟩ = X (⟨ss , as , s′s ⟩). After the
respectively, so that XS (S t ) → S s , XA (At ) → As . Based transition mapping, states sampled from the expert policy in
on XS and XA , a mapping function over the Q-values the source domain can be leveraged to render beneficial states
M (Qs ) → Qt can be derived accordingly. Another work in the target domain, which assists the target agent learning
is done by [112] which transfers advice as the knowledge with a better initialization performance. A similar idea of
between two domains. In their settings, the advice comes mapping transition dynamics can be found in [121], which,
from a human expert who provides the mapping function however, requires a stronger assumption on the similarity
over the Q-values in the source domain and transfers it to the of the transition probability and the state representations
learning policy for the target domain. This advice encourages between the source and the target domains.
the learning agent to prefer certain good actions over others, As summarized in Table 4, for TL approaches that utilize
which equivalently provides a relative ranking of actions in an inter-task mapping, the mapped knowledge can be (a
the new task. subset of) the state space [113, 114], the Q-function [66], or
More later research tackles the inter-task mapping prob- (representations of) the state-action-sate transitions [120].
lem by learning a mapping function [113–115]. Most work In addition to being directly applicable in the target do-
learns a mapping function over the state space or a subset of main [120], the mapped representation can also be used as an
the state space. In their work, state representations are usu- augmented shaping reward [113, 114] or a loss objective [115]
ally divided into agent-specific and task-specific representations, in order to guide the agent learning in the target domain.
denoted as sagent and senv , respectively. In [113] and [114], Most inter-task mapping methods tackle domains with
11
Paper Transfer approach MDP difference RL framework Evaluation metrics
[105] Distillation S, A DQN ap, ar
[106] Distillation S, A DQN ap, ar, pe, ps
[51] Distillation S, A Soft Q-learning ap, ar, tr, pe, ps
[50] Distillation S, A A3C ap, ar, pe, tt
[109] Reuse R Tabular Q-learning ap, ar, ps, tr
[110] Reuse R DQN ap, ar, pe, ps
moderate state-action space dimensions, such as maze tasks function ϕ over states s, it can be decomposed into two
or tabular MDPs, where the goal can be reaching a target sub-modules gk and fr , i.e.:
state with a minimal number of transitions. Accordingly, tt
has been used to measure TL performance. For tasks with π(s) := ϕ(senv , sagent ) = fr (gk (senv ), sagent ),
limited and discrete state-action space, evaluation is also where fr is the agent-specific module and gk is the task-
conducted with different number of initial states collected in specific module. Their core idea is that the task-specific
the target domain (nka). module can be applied to different agents performing
5.5 Representation Transfer the same task, which serves as a transferred knowledge.
Accordingly, the agent-specific module can be applied to
This section review approaches that transfer knowledge in different tasks for the same agent.
the form of representations learned by deep neural networks. A model-based approach along this line is [125], which
They are built upon the following consensual assumption: learns a model to map the state observation s to a latent-
Assumption. [Existence of Task-Invariance Subspace] representation z . The transition probability is modeled on
The state space (S ), action space (A), or the reward space (R) the latent space instead of the original state space, i.e. ẑt+1 =
can be disentangled into orthogonal subspaces, which are task- fθ (zt , at ), where θ is the parameter of the transition model,
invariant such that knowledge can be transferred between domains zt is the latent-representation of the state observation, and at
on the universal subspace. is the action accompanying that state. Next, a reward module
learns the value function as well as the policy from the
We organize recent work along this line into two
latent space z using an actor-critic framework. One potential
subtopics: 1) approaches that directly reuse representations
benefit of this latent representation is that knowledge can be
from the source domain (Section 5.5.1), and 2) approaches
transferred across tasks that have different rewards but share
that learn to disentangle the source domain representations
the same transition dynamics.
into independent sub-feature representations, some of which
are on the universal feature space shared by both the source
5.5.2 Disentangling Representations
and the target domains (Section 5.5.2).
Methods discussed in this section mostly focus on learning a
5.5.1 Reusing Representations disentangled representation. Specifically, we elaborate on TL
A representative work of reusing representations is [122], approaches that are derived from two techniques: Successor
which proposed the progressive neural network structure to Representation (SR) and Universal Value Function Approximat-
enable knowledge transfer across multiple RL tasks in a ing (UVFA).
progressive way. A progressive network is composed of Successor Representations (SR) is an approach to de-
multiple columns, and each column is a policy network for couple the state features of a domain from its reward
one specific task. It starts with one single column for training distributions. It enables knowledge transfer across multiple
the first task, and then the number of columns increases domains: M = {M1 , M2 , . . . , MK }, so long as the only
with the number of new tasks. While training on a new task, difference among them is the reward distributions: Ri ̸= Rj .
neuron weights on the previous columns are frozen, and SR was originally derived from neuroscience, until [126]
representations from those frozen tasks are applied to the proposed to leverage it as a generalization mechanism for
new column via a collateral connection to assist in learning state representations in the RL domain.
the new task. Different from the v -value or Q-value that describes
Progressive network comes with a cost of large network states as dependent on the reward function, SR features
structures, as the network grows proportionally with the a state based on the occupancy measure of its successor
number of incoming tasks. A later framework called PathNet states. Specifically, SR decomposes the value function of
alleviates this issue by learning a network with a fixed any policyPinto two independent components, ψ and R:
size [123]. PathNet contains pathways, which are subsets of V π (s) = s′ ψ(s, s′ )w(s′ ), where w(s′ ) is a reward map-
neurons whose weights contain the knowledge of previous ping function that maps states to scalar rewards, and ψ
tasks and are frozen during training on new tasks. The pop- is the SR which describes any state s as the occupancy
ulation of pathway is evolved using a tournament selection measure of the future occurred states when following π ,
genetic algorithm [124]. with 1[S = s′ ] = 1 as an indicator function:
Another approach of reusing representations for TL is ∞
γ i−t 1[Si = s′ ]|St = s].
X
modular networks [52, 53, 125]. For example, [52] proposed ψ(s, s′ ) = Eπ [
to decompose the policy network into a task-specific module i=t
and agent-specific module. Specifically, let π be a policy The successor nature of SR makes it learnable using
performed by any agent (robot) r over the task Mk as a any TD-learning algorithms. Especially, [126] proved the
12
RL MDP Mapping Evaluation
Methods Usage of mapping
framework difference function metrics
[66] SARSA St ̸= St , As ̸= At M (Qs ) → Qt Q value reuse ap, ar, tt, tr
[112] Q-learning As ̸= At , Rs ̸= Rt M (Qs ) → advice Relative Q ranking ap, ar, tr
[113] − Ss ̸= St M (st ) → r′ Reward shaping ap, ar, pe, tr
[114] SARSA(λ) Ss ̸= St Rs ̸= Rt M (st ) → r′ Reward shaping ap, ar, pe, tt
[115] Fitted Value Iter- Ss ̸= St M (ss ) → st Penalty loss on state deviation ap, ar, pe, tr
ation from expert policy
[121] Fitted Q Itera- Ss × As ̸= St × At M (ss , as, s′s ) → Reduce random exploration ap, ar, pe, tr, nta
tion (st , at , s′t )
[120] − Ss × As ̸= St × At M (ss , as, s′s ) → Reduce random exploration ap, ar, pe, tr, nta
(st , at , s′t )
feasibility of learning such representation in a tabular case, ri (s, a, s′ ). Based on the idea of basis-functions for a task’s
in which the state transitions can be described using a matrix. latent space, they proposed that ϕ(s, a, s′ ) can be approxi-
SR was later extended by [110] from three perspectives: mated as learning R(s, a, s′ ) directly, where R(s, a, s′ ) ∈ RD
(i) the feature domain of SR is extended from states to is a vector of reward functions for each seen task:
state-action pairs; (ii) deep neural networks are used as
R(s, a, s′ ) = r1 (s, a, s′ ); r2 (s, a, s′ ), . . . , rD (s, a, s′ ) .
function approximators to represent the SR ψ π (s, a) and
the reward mapper w; (iii) Generalized policy improvement Accordingly, learning ψ(s, a) for any policy πi in Mi
(GPI) algorithm is introduced to accelerate policy transfer for becomes equivalent to learning a collection of Q-functions:
multi-tasks (Section 5.3.2). These extensions, however, are
ψ πi (s, a) = Qπ1 i (s, a), Qπ2 i (s, a), . . . , QπDi (s, a) .
built upon a stronger assumption about the MDP:
Assumption. [Linearity of Reward Distributions] The reward A similar idea of using reward functions as features to
functions of all tasks can be computed as a linear combination represent unseen tasks is also proposed by [129], which
of a fixed set of features: r(s, a, s′ ) = ϕ(s, a, s′ )⊤ w, where assumes the ψ and w as observable quantities from the
ϕ(s, a, s′ ) ∈ Rd denotes the latent representation of the state environment.
transition, and w ∈ Rd is the task-specific reward mapper. Universal Function Approximation (UVFA) is an alter-
native approach of learning disentangled state representa-
Based on this assumption, SR can be decoupled from the tions [64]. Same as SR, UVFA allows TL for multiple tasks
rewards when evaluating the Q-function of any policy π in which differ only by their reward functions (goals). Different
a task. The advantage of SR is that, when the knowledge from SR which focuses on learning a reward-agnostic state
of ψ π (s, a) in the source domain Ms is observed, one can representation, UVFA aims to find a function approximator
quickly get the performance evaluation of the same policy that is generalized for both states and goals. The UVFA
in the target domain Mt by replacing ws with wt : QπMt = framework is built on a specific problem setting of goal
ψ π (s, a)wt . Similar ideas of learning SR as a TD-algorithm conditional RL: task goals are defined in terms of states, e.g. given
on a latent representation ϕ(s, a, s′ ) can also be found in the state space S and the goal space G , it satisfies that G ⊆ S .
[127, 128]. Specifically, the work of [127] was developed One instantiation of this problem setting can be an agent
based on a weaker assumption about the reward function: exploring different locations in a maze, where the goals are
Instead of requiring linearly-decoupled rewards, the latent described as certain locations inside the maze. Under this
space ϕ(s, a, s′ ) is learned in an encoder-decoder structure to problem setting, a UVFA module can be decoupled into
ensure that the information loss is minimized when mapping a state embedding ϕ(s) and a goal embedding ψ(g), by
states to the latent space. This structure, therefore, comes applying the technique of matrix factorization to a reward
with an extra cost of learning a decoder fd to reconstruct the matrix describing the goal-conditional task.
state: fd (ϕ(st )) ≈ st . One merit of UVFA resides in its transferrable embedding
An intriguing question faced by the SR approach is: Is ϕ(s) across tasks which only differ by goals. Another benefit
there a way that evades the linearity assumption about reward is its ability of continual learning when the set of goals keeps
functions and still enables learning the SR without extra modular expanding over time. On the other hand, a key challenge
cost? An extended work of SR [67] answered this question of UVFA is that applying the matrix factorization is time-
affirmatively, which proved that the reward functions does consuming, which makes it a practical concern for complex
not necessarily have to follow the linear structure, yet at the environments with large state space |S|. Even with the
cost of a looser performance lower-bound while applying the learned embedding networks, the third stage of fine-tuning
GPI approach for policy improvement. Especially, rather than these networks via end-to-end training is still necessary.
learning a reward-agnostic latent feature ϕ(s, a, s′ ) ∈ Rd for UVFA has been connected to SR by [67], in which a set
multiple tasks, [67] aims to learn a matrix ϕ(s, a, s′ ) ∈ RD×d of independent rewards (tasks) themselves can be used as
to interpret the basis functions of the latent space instead, features for state representations. Another extended work
where D is the number of seen tasks. Assuming k out of that combines UVFA with SR is called Universal Successor
D tasks are linearly independent, this matrix forms k basis Feature Approximator (USFA), which is proposed by [130].
functions for the latent space. Therefore, for any unseen task Following the same linearity assumption, USFA is proposed
Mi , its latent features can be built as a linear combination as a function over a triplet of the state, action, and a policy
of these basis functions, as well as its reward functions embedding z : ϕ(s, a, z) : S × A × Rk → Rd , where z is the
13
output of a policy-encoding mapping z = e(π) : S × A → Rk . representation space for states using multiple tasks with dif-
Based on USFA, the Q-function of any policy π for a task ferent dynamics for better generalization [132]. Alternatively,
specified by w can be formularized as the product of a TL mechanisms from the supervised learning domain, such
reward-agnostic Universal Successor Feature (USF) ψ and a as meta-learning, which enables the ability of fast adaptation
reward mapper w: Q(s, a, w, z) = ψ(s, a, z)⊤ w. Facilitated to new tasks [133], or importance sampling [134], which can
by the disentangled rewards and policy generalization, [130] compensate for the prior distribution changes [10], may also
further introduced a generalized TD-error as a function over shed light on this question.
tasks w and policy z , which allows them to approximate the 6 A PPLICATIONS
Q-function of any policy on any task using a TD-algorithm. In this section we summarize recent applications that are
closely related to using TL techniques for tackling RL
5.5.3 Summary and Discussion
domains.
We provide a summary of the discussed work in this
Robotics learning is a prominent application domain of
section in Table 5. Representation transfer can facilitate TL
RL. TL approaches in this field include robotics learning from
in multiple ways based on assumptions about certain task-
demonstrations, where expert demonstrations from humans
invariant property. Some assume that tasks are different only
or other robots are leveraged [135] Another is collaborative
in terms of their reward distributions. Other stronger as-
robotic training [136, 137], in which knowledge from different
sumptions include (i) decoupled dynamics, rewards [110], or
robots is transferred by sharing their policies and episodic
policies [130] from the Q-function representations, and (ii) the
demonstrations. Recent research focus is this domain is fast
feasibility of defining tasks in terms of states [130]. Based on
and robust adaptation to unseen tasks. One example towards
those assumptions, approaches such as TD-algorithms [67]
this goal is [138], in which robust robotics policies are trained
or matrix-factorization [64] become applicable to learn
using synthetic demonstrations to handle dynamic environ-
such disentangled representations. To further exploit the
ments. Another solution is to learn domain-invariant latent
effectiveness of disentangled structure, we consider that
representations. Examples include [139], which learns the
generalization approaches, which allow changing dynamics or
latent representation using 3D CAD models, and [140, 141]
state distributions, are important future work that is worth
which are derived based on the Generative-Adversarial
more attention in this domain.
Network. Another example is DARLA [142], which is a zero-
Most discussed work in this section tackles multi-task RL shot transfer approach to learn disentangled representations
or meta-RL scenarios, hence the agent’s generalization ability that are robust against domain shifts. We refer readers to
is extensively investigated. For instance, methods of modular [70, 143] for detailed surveys along this direction.
networks largely evaluated the zero-shot performance from Game Playing is a common test-bed for TL and RL
the meta-RL perspective [52, 130]. Given a fixed number of algorithms. It has evolved from classical benchmarks such
training epochs (pe), Transfer ratio (tr) is manifested differently as grid-world games to more complex settings such as
among these methods. It can be the relative performance of online-strategy games or video games with multimodal
a modular net architecture compared with a baseline, or inputs. One example is AlphaGo, which is an algorithm
the accumulated return in modified target domains, where for learning the online chessboard games using both TL
reward scores are negated for evaluating the dynamics and RL techniques [90]. AlphaGo is first pre-trained offline
transfer. Performance sensitivity (ps) is also broadly studied to using expert demonstrations and then learns to optimize its
estimate the robustness of TL. [110] analyzed the performance policy using Monte-Carlo Tree Search. Its successor, AlphaGo
sensitivity given varying source tasks, while [130] studied Master [144], even beat the world’s first ranked human
the performance on different unseen target domains. player. TL-DRL approaches are also thriving in video game
There are unresolved questions in this intriguing research playing. Especially, OpenAI has trained Dota2 agents that
topic. One is how to handle drastic changes of reward can surpass human experts [145]. State-of-the-art platforms
functions between domains. As discussed in [131], good include MineCraft, Atari, and Starcraft. [146] designed new RL
policies in one MDP may perform poorly in another due to benchmarks under the MineCraft platform. [147] provided
the fact that beneficial states or actions in Ms may become a comprehensive survey on DL applications in video game
detrimental in Mt with totally different reward functions. playing, which also covers TL and RL strategies from certain
Learning a set of basis functions [67] to represent unseen perspectives. A large portion of TL approaches reviewed in
tasks (reward functions), or decoupling policies from Q- this survey have been applied to the Atari platforms [148].
function representation [130] may serve as a good start Natural Language Processing (NLP) has evolved rapidly
to address this issue, as they propose a generalized latent along with the advancement of DL and RL. Applications
space, from which different tasks (reward functions) can be of RL to NLP range widely, from Question Answering
interpreted. However, the limitation of this work is that it is (QA) [149], Dialogue systems [150], Machine Translation [151],
not clear how many and what kind of sub-tasks need to be to an integration of NLP and Computer Vision tasks, such as
learned to make the latent space generalizable enough. Visual Question Answering (VQA) [152], Image Caption [153],
Another question is how to generalize the representation etc. Many NLP applications have implicitly applied TL
learning for TL across domains with different dynamics or approaches. Examples include learning from expert demon-
state-action spaces. A learned SR might not be transferrable strations for Spoken Dialogue Systems [154], VQA [152]; or
to an MDP with different transition dynamics, as the distri- reward shaping for Sequence Generation [155], Spoken Dialogue
bution of occupancy measure for SR may no longer hold. Systems [80],QA [81, 156], and Image Caption [153], or trans-
Potential solutions may include model-based approaches ferring policies for Structured Prediction [157] and VQA [158].
that approximate the dynamics directly or training a latent
14
Evaluation
Methods Representations format Assumptions MDP difference Learner
metrics
Progressive Net [122] Lateral connections to N/A S, A A3C ap, ar, pe, ps, tr
previously learned net-
work modules
PathNet [123] Selected neural paths N/A S, A A3C ap, ar, pe, tr
Modular Net [52] Task(agent)-specific net- Disentangled state rep- S, A Policy Gradient ap, ar, pe, tt
work module resentation
Modular Net [125] Dynamic transitions N/A S, A A3C ap, ar, pe, tr, ps
module learned on state
latent representations.
SR [110] SF Reward function can be R DQN ap, ar, nka, ps
linearly decoupled
SR [127] Encoder-decoder N/A R DQN ap, ar, pe, ps
learned SF
SR [67] Encoder-decoder Rewards can be repre- R Q(λ) ap, pe
learned SF sented by set of basis
functions
UVFA [64] Matrix-factorized UF Goal conditional RL R Tabular Q-learning ap, ar, pe, ps
UVFA with SR [130] Policy-encoded UF Reward function can be R ϵ-greedy Q-learning ap, ar, pe
linearly decoupled
TABLE 5: A comparison of TL approaches of representation transfer.
[18] C. Yu, A. Velu, E. Vinitsky, J. Gao, Y. Wang, A. Bayen, [35] Z. I. Botev, D. P. Kroese, R. Y. Rubinstein, and
and Y. Wu, “The surprising effectiveness of ppo in P. L’Ecuyer, “The cross-entropy method for optimiza-
cooperative multi-agent games,” NeurIPS, vol. 35, pp. tion,” in Handbook of statistics. Elsevier, 2013, vol. 31,
24 611–24 624, 2022. pp. 35–59.
[19] I. Kostrikov, K. K. Agrawal, D. Dwibedi, S. Levine, and [36] K. Chua, R. Calandra, R. McAllister, and S. Levine,
J. Tompson, “Discriminator-actor-critic: Addressing “Deep reinforcement learning in a handful of trials
sample inefficiency and reward bias in adversarial using probabilistic dynamics models,” NeurIPS, vol. 31,
imitation learning,” arXiv preprint arXiv:1809.02925, 2018.
2018. [37] R. S. Sutton, “Integrated architectures for learning,
[20] G. A. Rummery and M. Niranjan, On-line Q-learning planning, and reacting based on approximating dy-
using connectionist systems. University of Cambridge, namic programming,” in Machine learning proceedings
Department of Engineering Cambridge, England, 1994. 1990. Elsevier, 1990, pp. 216–224.
[21] H. Van Seijen, H. Van Hasselt, S. Whiteson, and [38] V. Feinberg, A. Wan, I. Stoica, M. I. Jordan, J. E. Gon-
M. Wiering, “A theoretical and empirical analysis of zalez, and S. Levine, “Model-based value estimation
expected sarsa,” IEEE Symposium on Adaptive Dynamic for efficient model-free reinforcement learning,” arXiv
Programming and Reinforcement Learning, 2009. preprint arXiv:1803.00101, 2018.
[22] V. Konda and J. Tsitsiklis, “Actor-critic algorithms,” [39] S. Levine and V. Koltun, “Guided policy search,” in
NeurIPS, 2000. International conference on machine learning. PMLR,
[23] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, 2013, pp. 1–9.
T. Harley, D. Silver, and K. Kavukcuoglu, “Asyn- [40] H. Bharadhwaj, K. Xie, and F. Shkurti, “Model-
chronous methods for deep reinforcement learning,” predictive control via cross-entropy and gradient-based
ICML, 2016. optimization,” in Learning for Dynamics and Control.
[24] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft PMLR, 2020, pp. 277–286.
actor-critic: Off-policy maximum entropy deep rein- [41] M. Deisenroth and C. E. Rasmussen, “Pilco: A model-
forcement learning with a stochastic actor,” Interna- based and data-efficient approach to policy search,” in
tional Conference on Machine Learning, 2018. Proceedings of the 28th International Conference on machine
[25] C. J. Watkins and P. Dayan, “Q-learning,” Machine learning (ICML-11), 2011, pp. 465–472.
learning, 1992. [42] Y. Gal, R. McAllister, and C. E. Rasmussen, “Improving
[26] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, pilco with bayesian neural network dynamics models,”
J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, in Data-efficient machine learning workshop, ICML, vol. 4,
A. K. Fidjeland, G. Ostrovski et al., “Human-level no. 34, 2016, p. 25.
control through deep reinforcement learning,” Nature, [43] C. H. Lampert, H. Nickisch, and S. Harmeling, “Learn-
2015. ing to detect unseen object classes by between-class
[27] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, attribute transfer,” IEEE Conference on Computer Vision
G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and Pattern Recognition, 2009.
and D. Silver, “Rainbow: Combining improvements in [44] P. Dayan and G. E. Hinton, “Feudal reinforcement
deep reinforcement learning,” AAAI, 2018. learning,” NeurIPS, 1993.
[28] R. J. Williams, “Simple statistical gradient-following [45] R. S. Sutton, D. Precup, and S. Singh, “Between mdps
algorithms for connectionist reinforcement learning,” and semi-mdps: A framework for temporal abstraction
Machine learning, 1992. in reinforcement learning,” Artificial intelligence, 1999.
[29] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and [46] R. Parr and S. J. Russell, “Reinforcement learning with
P. Moritz, “Trust region policy optimization,” ICML, hierarchies of machines,” NeurIPS, 1998.
2015. [47] T. G. Dietterich, “Hierarchical reinforcement learning
[30] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and with the maxq value function decomposition,” Journal
O. Klimov, “Proximal policy optimization algorithms,” of artificial intelligence research, 2000.
arXiv preprint arXiv:1707.06347, 2017. [48] A. Lazaric and M. Ghavamzadeh, “Bayesian multi-
[31] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, task reinforcement learning,” in ICML-27th international
and M. Riedmiller, “Deterministic policy gradient conference on machine learning. Omnipress, 2010, pp.
algorithms,” 2014. 599–606.
[32] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, [49] Y. Zhang and Q. Yang, “A survey on multi-task
Y. Tassa, D. Silver, and D. Wierstra, “Continuous con- learning,” IEEE Transactions on Knowledge and Data
trol with deep reinforcement learning,” arXiv preprint Engineering, vol. 34, no. 12, pp. 5586–5609, 2021.
arXiv:1509.02971, 2015. [50] Y. Teh, V. Bapst, W. M. Czarnecki, J. Quan, J. Kirk-
[33] S. Fujimoto, H. Van Hoof, and D. Meger, “Addressing patrick, R. Hadsell, N. Heess, and R. Pascanu, “Distral:
function approximation error in actor-critic methods,” Robust multitask reinforcement learning,” NeurIPS,
arXiv preprint arXiv:1802.09477, 2018. 2017.
[34] A. Nagabandi, G. Kahn, R. S. Fearing, and S. Levine, [51] E. Parisotto, J. L. Ba, and R. Salakhutdinov, “Actor-
“Neural network dynamics for model-based deep mimic: Deep multitask and transfer reinforcement
reinforcement learning with model-free fine-tuning,” learning,” ICLR, 2016.
in 2018 IEEE international conference on robotics and [52] C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine,
automation (ICRA). IEEE, 2018, pp. 7559–7566. “Learning modular neural network policies for multi-
17
task and multi-robot transfer,” 2017 IEEE International [71] M. Muller-Brockhausen, M. Preuss, and A. Plaat,
Conference on Robotics and Automation (ICRA), 2017. “Procedural content generation: Better benchmarks
[53] J. Andreas, D. Klein, and S. Levine, “Modular multitask for transfer reinforcement learning,” in 2021 IEEE
reinforcement learning with policy sketches,” ICML, Conference on games (CoG). IEEE, 2021, pp. 01–08.
2017. [72] N. Vithayathil Varghese and Q. H. Mahmoud, “A
[54] R. Yang, H. Xu, Y. Wu, and X. Wang, “Multi-task rein- survey of multi-task deep reinforcement learning,”
forcement learning with soft modularization,” NeurIPS, Electronics, vol. 9, no. 9, p. 1363, 2020.
vol. 33, pp. 4767–4777, 2020. [73] R. J. Williams and L. C. Baird, “Tight performance
[55] T. Hospedales, A. Antoniou, P. Micaelli, and A. Storkey, bounds on greedy policies based on imperfect value
“Meta-learning in neural networks: A survey,” IEEE functions,” Tech. Rep., 1993.
transactions on pattern analysis and machine intelligence, [74] E. Wiewiora, G. W. Cottrell, and C. Elkan, “Principled
vol. 44, no. 9, pp. 5149–5169, 2021. methods for advising reinforcement learning agents,”
[56] Z. Jia, X. Li, Z. Ling, S. Liu, Y. Wu, and H. Su, “Im- ICML, 2003.
proving policy optimization with generalist-specialist [75] S. M. Devlin and D. Kudenko, “Dynamic potential-
learning,” in International Conference on Machine Learn- based reward shaping,” ICAAMAS, 2012.
ing. PMLR, 2022, pp. 10 104–10 119. [76] A. Harutyunyan, S. Devlin, P. Vrancx, and A. Nowé,
[57] W. Ding, H. Lin, B. Li, and D. Zhao, “Generalizing goal- “Expressing arbitrary reward functions as potential-
conditioned reinforcement learning with variational based advice,” AAAI, 2015.
causal reasoning,” arXiv preprint arXiv:2207.09081, [77] T. Brys, A. Harutyunyan, M. E. Taylor, and A. Nowé,
2022. “Policy transfer using reward shaping,” ICAAMS, 2015.
[58] R. Kirk, A. Zhang, E. Grefenstette, and T. Rocktäschel, [78] M. Večerı́k, T. Hester, J. Scholz, F. Wang, O. Pietquin,
“A survey of zero-shot generalisation in deep reinforce- B. Piot, N. Heess, T. Rothörl, T. Lampe, and M. Ried-
ment learning,” Journal of Artificial Intelligence Research, miller, “Leveraging demonstrations for deep rein-
vol. 76, pp. 201–264, 2023. forcement learning on robotics problems with sparse
[59] B. Kim, A.-m. Farahmand, J. Pineau, and D. Pre- rewards,” arXiv preprint arXiv:1707.08817, 2017.
cup, “Learning from limited demonstrations,” NeurIPS, [79] A. C. Tenorio-Gonzalez, E. F. Morales, and L. Vil-
2013. laseñor-Pineda, “Dynamic reward shaping: Training
[60] W. Czarnecki, R. Pascanu, S. Osindero, S. Jayakumar, a robot by voice,” Advances in Artificial Intelligence –
G. Swirszcz, and M. Jaderberg, “Distilling policy dis- IBERAMIA, 2010.
tillation,” The 22nd International Conference on Artificial [80] P.-H. Su, D. Vandyke, M. Gasic, N. Mrksic, T.-H.
Intelligence and Statistics, 2019. Wen, and S. Young, “Reward shaping with recur-
[61] A. Y. Ng, D. Harada, and S. Russell, “Policy invariance rent neural networks for speeding up on-line policy
under reward transformations: Theory and application learning in spoken dialogue systems,” arXiv preprint
to reward shaping,” ICML, 1999. arXiv:1508.03391, 2015.
[62] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, [81] X. V. Lin, R. Socher, and C. Xiong, “Multi-hop knowl-
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, edge graph reasoning with reward shaping,” arXiv
“Generative adversarial nets,” NeurIPS, pp. 2672–2680, preprint arXiv:1808.10568, 2018.
2014. [82] S. Devlin, L. Yliniemi, D. Kudenko, and K. Tumer,
[63] Z. Zhu, K. Lin, B. Dai, and J. Zhou, “Learning sparse “Potential-based difference rewards for multiagent
rewarded tasks from sub-optimal demonstrations,” reinforcement learning,” ICAAMS, 2014.
arXiv preprint arXiv:2004.00530, 2020. [83] M. Grzes and D. Kudenko, “Learning shaping rewards
[64] T. Schaul, D. Horgan, K. Gregor, and D. Silver, “Uni- in model-based reinforcement learning,” Proc. AAMAS
versal value function approximators,” ICML, 2015. Workshop on Adaptive Learning Agents, 2009.
[65] C. Finn and S. Levine, “Meta-learning: from few-shot [84] O. Marom and B. Rosman, “Belief reward shaping in
learning to rapid reinforcement learning,” ICML, 2019. reinforcement learning,” AAAI, 2018.
[66] M. E. Taylor, P. Stone, and Y. Liu, “Transfer learning via [85] F. Liu, Z. Ling, T. Mu, and H. Su, “State
inter-task mappings for temporal difference learning,” alignment-based imitation learning,” arXiv preprint
Journal of Machine Learning Research, 2007. arXiv:1911.10947, 2019.
[67] A. Barreto, D. Borsa, J. Quan, T. Schaul, D. Silver, [86] K. Kim, Y. Gu, J. Song, S. Zhao, and S. Ermon, “Domain
M. Hessel, D. Mankowitz, A. Žı́dek, and R. Munos, adaptive imitation learning,” ICML, 2020.
“Transfer in deep reinforcement learning using suc- [87] Y. Ma, Y.-X. Wang, and B. Narayanaswamy, “Imitation-
cessor features and generalised policy improvement,” regularized offline learning,” International Conference
ICML, 2018. on Artificial Intelligence and Statistics, 2019.
[68] Z. Zhu, K. Lin, B. Dai, and J. Zhou, “Off-policy [88] M. Yang and O. Nachum, “Representation matters:
imitation learning from observations,” NeurIPS, 2020. Offline pretraining for sequential decision making,”
[69] J. Ho and S. Ermon, “Generative adversarial imitation arXiv preprint arXiv:2102.05815, 2021.
learning,” NeurIPS, 2016. [89] X. Zhang and H. Ma, “Pretraining deep actor-critic
[70] W. Zhao, J. P. Queralta, and T. Westerlund, “Sim-to-real reinforcement learning algorithms with expert demon-
transfer in deep reinforcement learning for robotics: a strations,” arXiv preprint arXiv:1801.10459, 2018.
survey,” in 2020 IEEE symposium series on computational [90] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre,
intelligence (SSCI). IEEE, 2020, pp. 737–744. G. Van Den Driessche, J. Schrittwieser, I. Antonoglou,
18
V. Panneershelvam, M. Lanctot et al., “Mastering the [109] F. Fernández and M. Veloso, “Probabilistic policy reuse
game of go with deep neural networks and tree search,” in a reinforcement learning agent,” Proceedings of the
Nature, 2016. fifth international joint conference on Autonomous agents
[91] S. Schaal, “Learning from demonstration,” NeurIPS, and multiagent systems, 2006.
1997. [110] A. Barreto, W. Dabney, R. Munos, J. J. Hunt, T. Schaul,
[92] T. Hester, M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, H. P. van Hasselt, and D. Silver, “Successor features for
B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband transfer in reinforcement learning,” NuerIPS, 2017.
et al., “Deep q-learning from demonstrations,” AAAI, [111] R. Bellman, “Dynamic programming,” Science, 1966.
2018. [112] L. Torrey, T. Walker, J. Shavlik, and R. Maclin, “Using
[93] A. Nair, B. McGrew, M. Andrychowicz, W. Zaremba, advice to transfer knowledge acquired in one reinforce-
and P. Abbeel, “Overcoming exploration in reinforce- ment learning task to another,” European Conference on
ment learning with demonstrations,” IEEE International Machine Learning, 2005.
Conference on Robotics and Automation (ICRA), 2018. [113] A. Gupta, C. Devin, Y. Liu, P. Abbeel, and S. Levine,
[94] J. Chemali and A. Lazaric, “Direct policy iteration “Learning invariant feature spaces to transfer skills
with demonstrations,” International Joint Conference on with reinforcement learning,” ICLR, 2017.
Artificial Intelligence, 2015. [114] G. Konidaris and A. Barto, “Autonomous shaping:
[95] B. Piot, M. Geist, and O. Pietquin, “Boosted bellman Knowledge transfer in reinforcement learning,” ICML,
residual minimization handling expert demonstra- 2006.
tions,” Joint European Conference on Machine Learning [115] H. B. Ammar and M. E. Taylor, “Reinforcement learn-
and Knowledge Discovery in Databases, 2014. ing transfer via common subspaces,” Proceedings of the
[96] T. Brys, A. Harutyunyan, H. B. Suay, S. Chernova, M. E. 11th International Conference on Adaptive and Learning
Taylor, and A. Nowé, “Reinforcement learning from Agents, 2012.
demonstration through shaping,” International Joint [116] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet:
Conference on Artificial Intelligence, 2015. A deep convolutional encoder-decoder architecture
[97] B. Kang, Z. Jie, and J. Feng, “Policy optimization with for image segmentation,” IEEE transactions on pattern
demonstrations,” ICML, 2018. analysis and machine intelligence, 2017.
[98] D. P. Bertsekas, “Approximate policy iteration: A [117] C. Wang and S. Mahadevan, “Manifold alignment
survey and some new methods,” Journal of Control without correspondence,” International Joint Conference
Theory and Applications, 2011. on Artificial Intelligence, 2009.
[99] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, [118] B. Bocsi, L. Csató, and J. Peters, “Alignment-based
“Prioritized experience replay,” ICLR, 2016. transfer learning for robot models,” The 2013 Interna-
[100] S. Ross, G. Gordon, and D. Bagnell, “A reduction of tional Joint Conference on Neural Networks (IJCNN), 2013.
imitation learning and structured prediction to no- [119] H. B. Ammar, E. Eaton, P. Ruvolo, and M. E. Taylor,
regret online learning,” AISTATS, 2011. “Unsupervised cross-domain transfer in policy gradient
[101] Y. Gao, J. Lin, F. Yu, S. Levine, T. Darrell et al., “Re- reinforcement learning via manifold alignment,” AAAI,
inforcement learning from imperfect demonstrations,” 2015.
arXiv preprint arXiv:1802.05313, 2018. [120] H. B. Ammar, K. Tuyls, M. E. Taylor, K. Driessens, and
[102] M. Jing, X. Ma, W. Huang, F. Sun, C. Yang, B. Fang, G. Weiss, “Reinforcement learning transfer via sparse
and H. Liu, “Reinforcement learning from imperfect coding,” ICAAMS, 2012.
demonstrations under soft expert guidance.” AAAI, [121] A. Lazaric, M. Restelli, and A. Bonarini, “Transfer of
2020. samples in batch reinforcement learning,” ICML, 2008.
[103] K. Brantley, W. Sun, and M. Henaff, “Disagreement- [122] A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer,
regularized imitation learning,” ICLR, 2019. J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and
[104] G. Hinton, O. Vinyals, and J. Dean, “Distilling the R. Hadsell, “Progressive neural networks,” arXiv
knowledge in a neural network,” Deep Learning and preprint arXiv:1606.04671, 2016.
Representation Learning Workshop, NeurIPS, 2014. [123] C. Fernando, D. Banarse, C. Blundell, Y. Zwols, D. Ha,
[105] A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, A. A. Rusu, A. Pritzel, and D. Wierstra, “Pathnet:
G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, Evolution channels gradient descent in super neural
K. Kavukcuoglu, and R. Hadsell, “Policy distillation,” networks,” arXiv preprint arXiv:1701.08734, 2017.
arXiv preprint arXiv:1511.06295, 2015. [124] I. Harvey, “The microbial genetic algorithm,” European
[106] H. Yin and S. J. Pan, “Knowledge transfer for deep Conference on Artificial Life, 2009.
reinforcement learning with hierarchical experience [125] A. Zhang, H. Satija, and J. Pineau, “Decoupling dy-
replay,” AAAI, 2017. namics and reward for transfer learning,” arXiv preprint
[107] S. Schmitt, J. J. Hudson, A. Zidek, S. Osindero, C. Do- arXiv:1804.10689, 2018.
ersch, W. M. Czarnecki, J. Z. Leibo, H. Kuttler, A. Zis- [126] P. Dayan, “Improving generalization for temporal dif-
serman, K. Simonyan et al., “Kickstarting deep rein- ference learning: The successor representation,” Neural
forcement learning,” arXiv preprint arXiv:1803.03835, Computation, 1993.
2018. [127] T. D. Kulkarni, A. Saeedi, S. Gautam, and S. J. Gersh-
[108] J. Schulman, X. Chen, and P. Abbeel, “Equivalence man, “Deep successor reinforcement learning,” arXiv
between policy gradients and soft q-learning,” arXiv preprint arXiv:1606.02396, 2016.
preprint arXiv:1704.06440, 2017. [128] J. Zhang, J. T. Springenberg, J. Boedecker, and W. Bur-
19
gard, “Deep reinforcement learning with successor human knowledge,” Nature, 2017.
features for navigation across similar environments,” [145] OpenAI. (2019) Dotal2 blog. [Online]. Available:
IEEE/RSJ International Conference on Intelligent Robots https://openai.com/blog/openai-five/
and Systems (IROS), 2017. [146] J. Oh, V. Chockalingam, S. Singh, and H. Lee, “Control
[129] N. Mehta, S. Natarajan, P. Tadepalli, and A. Fern, of memory, active perception, and action in minecraft,”
“Transfer in variable-reward hierarchical reinforcement arXiv preprint arXiv:1605.09128, 2016.
learning,” Machine Learning, 2008. [147] N. Justesen, P. Bontrager, J. Togelius, and S. Risi, “Deep
[130] D. Borsa, A. Barreto, J. Quan, D. Mankowitz, R. Munos, learning for video game playing,” IEEE Transactions on
H. van Hasselt, D. Silver, and T. Schaul, “Universal Games, 2019.
successor features approximators,” ICLR, 2019. [148] V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves,
[131] L. Lehnert, S. Tellex, and M. L. Littman, “Advan- I. Antonoglou, D. Wierstra, and M. Riedmiller, “Playing
tages and limitations of using successor features for atari with deep reinforcement learning,” arXiv preprint
transfer in reinforcement learning,” arXiv preprint arXiv:1312.5602, 2013.
arXiv:1708.00102, 2017. [149] H. Chen, X. Liu, D. Yin, and J. Tang, “A survey on
[132] J. C. Petangoda, S. Pascual-Diaz, V. Adam, P. Vrancx, dialogue systems: Recent advances and new frontiers,”
and J. Grau-Moya, “Disentangled skill embed- Acm Sigkdd Explorations Newsletter, 2017.
dings for reinforcement learning,” arXiv preprint [150] S. P. Singh, M. J. Kearns, D. J. Litman, and M. A. Walker,
arXiv:1906.09223, 2019. “Reinforcement learning for spoken dialogue systems,”
[133] C. Finn, P. Abbeel, and S. Levine, “Model-agnostic NeurIPS, 2000.
meta-learning for fast adaptation of deep networks,” [151] B. Zoph and Q. V. Le, “Neural architecture
ICML, 2017. search with reinforcement learning,” arXiv preprint
[134] B. Zadrozny, “Learning and evaluating classifiers arXiv:1611.01578, 2016.
under sample selection bias,” ICML, 2004. [152] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and
[135] B. D. Argall, S. Chernova, M. Veloso, and B. Browning, K. Saenko, “Learning to reason: End-to-end module
“A survey of robot learning from demonstration,” networks for visual question answering,” IEEE Interna-
Robotics and autonomous systems, 2009. tional Conference on Computer Vision, 2017.
[136] B. Kehoe, S. Patil, P. Abbeel, and K. Goldberg, “A [153] Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li, “Deep
survey of research on cloud robotics and automation,” reinforcement learning-based image captioning with
IEEE Transactions on automation science and engineering, embedding reward,” IEEE Conference on Computer
2015. Vision and Pattern Recognition, 2017.
[137] S. Gu, E. Holly, T. Lillicrap, and S. Levine, “Deep [154] J. Andreas, M. Rohrbach, T. Darrell, and D. Klein,
reinforcement learning for robotic manipulation with “Learning to compose neural networks for question
asynchronous off-policy updates,” IEEE international answering,” arXiv preprint arXiv:1601.01705, 2016.
conference on robotics and automation (ICRA), 2017. [155] D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe,
[138] W. Yu, J. Tan, C. K. Liu, and G. Turk, “Preparing for J. Pineau, A. Courville, and Y. Bengio, “An actor-
the unknown: Learning a universal policy with online critic algorithm for sequence prediction,” arXiv preprint
system identification,” arXiv preprint arXiv:1702.02453, arXiv:1607.07086, 2016.
2017. [156] F. Godin, A. Kumar, and A. Mittal, “Learning when not
[139] F. Sadeghi and S. Levine, “Cad2rl: Real single-image to answer: a ternary reward structure for reinforcement
flight without a single real image,” arXiv preprint learning based question answering,” Proceedings of
arXiv:1611.04201, 2016. the 2019 Conference of the North American Chapter of
[140] K. Bousmalis, A. Irpan, P. Wohlhart, Y. Bai, M. Kelcey, the Association for Computational Linguistics: Human
M. Kalakrishnan, L. Downs, J. Ibarz, P. Pastor, K. Kono- Language Technologies, 2019.
lige et al., “Using simulation and domain adaptation to [157] K.-W. Chang, A. Krishnamurthy, A. Agarwal, J. Lang-
improve efficiency of deep robotic grasping,” IEEE ford, and H. Daumé III, “Learning to search better than
International Conference on Robotics and Automation your teacher,” 2015.
(ICRA), 2018. [158] J. Lu, A. Kannan, J. Yang, D. Parikh, and D. Batra,
[141] H. Bharadhwaj, Z. Wang, Y. Bengio, and L. Paull, “A “Best of both worlds: Transferring knowledge from
data-efficient framework for training and sim-to-real discriminative learning to a generative visual dialog
transfer of navigation policies,” International Conference model,” NeurIPS, 2017.
on Robotics and Automation (ICRA), 2019. [159] OpenAI, “Gpt-4 technical report,” arXiv, 2023.
[142] I. Higgins, A. Pal, A. Rusu, L. Matthey, C. Burgess, [160] A. Glaese, N. McAleese, M. Trebacz, J. Aslanides,
A. Pritzel, M. Botvinick, C. Blundell, and A. Lerchner, V. Firoiu, T. Ewalds, M. Rauh, L. Weidinger, M. Chad-
“Darla: Improving zero-shot transfer in reinforcement wick, P. Thacker et al., “Improving alignment of dia-
learning,” ICML, 2017. logue agents via targeted human judgements,” arXiv
[143] J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement preprint arXiv:2209.14375, 2022.
learning in robotics: A survey,” The International Journal [161] A. Chowdhery, S. Narang, J. Devlin, M. Bosma,
of Robotics Research, 2013. G. Mishra, A. Roberts, P. Barham, H. W. Chung, C. Sut-
[144] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, ton, S. Gehrmann et al., “Palm: Scaling language mod-
A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, eling with pathways,” arXiv preprint arXiv:2204.02311,
A. Bolton et al., “Mastering the game of go without 2022.
20
[162] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, [179] G. Dalal, E. Gilboa, and S. Mannor, “Hierarchical
A. Kulshreshtha, H.-T. Cheng, A. Jin, T. Bos, L. Baker, decision making in electricity grid management,” In-
Y. Du et al., “Lamda: Language models for dialog ternational Conference on Machine Learning, 2016.
applications,” arXiv preprint arXiv:2201.08239, 2022. [180] F. Ruelens, B. J. Claessens, S. Vandael, B. De Schutter,
[163] C. Yu, J. Liu, and S. Nemati, “Reinforcement learning in R. Babuška, and R. Belmans, “Residential demand
healthcare: A survey,” arXiv preprint arXiv:1908.08796, response of thermostatically controlled loads using
2019. batch reinforcement learning,” IEEE Transactions on
[164] A. Alansary, O. Oktay, Y. Li, L. Le Folgoc, B. Hou, Smart Grid, 2016.
G. Vaillant, K. Kamnitsas, A. Vlontzos, B. Glocker, [181] Z. Wen, D. O’Neill, and H. Maei, “Optimal demand
B. Kainz et al., “Evaluating reinforcement learning response using device-based reinforcement learning,”
agents for anatomical landmark detection,” 2019. IEEE Transactions on Smart Grid, 2015.
[165] K. Ma, J. Wang, V. Singh, B. Tamersoy, Y.-J. Chang, [182] Y. Li, J. Song, and S. Ermon, “Infogail: Interpretable im-
A. Wimmer, and T. Chen, “Multimodal image reg- itation learning from visual demonstrations,” NeurIPS,
istration with deep context reinforcement learning,” 2017.
International Conference on Medical Image Computing and [183] R. Ramakrishnan and J. Shah, “Towards interpretable
Computer-Assisted Intervention, 2017. explanations for transfer learning in sequential tasks,”
[166] T. S. M. T. Gomes, “Reinforcement learning for primary AAAI Spring Symposium Series, 2016.
care e appointment scheduling,” 2017. [184] E. Choi, M. T. Bahadori, J. Sun, J. Kulas, A. Schuetz, and
[167] A. Serrano, B. Imbernón, H. Pérez-Sánchez, J. M. Ce- W. Stewart, “Retain: An interpretable predictive model
cilia, A. Bueno-Crespo, and J. L. Abellán, “Accelerating for healthcare using reverse time attention mechanism,”
drugs discovery with deep reinforcement learning: NeurIPS, vol. 29, 2016.
An early approach,” International Conference on Parallel
Processing Companion, 2018. Zhuangdi Zhu is currently a senior data and
applied scientist with Microsoft. She obtained her
[168] M. Popova, O. Isayev, and A. Tropsha, “Deep rein- Ph.D. degree from the Computer Science depart-
forcement learning for de novo drug design,” Science ment of Michigan State University. Zhuangdi has
advances, 2018. regularly published on prestigious machine learn-
ing conferences including NeurIPs, ICML, KDD,
[169] A. E. Gaweda, M. K. Muezzinoglu, G. R. Aronoff, A. A. AAAI, etc. Her research interests reside in both
Jacobs, J. M. Zurada, and M. E. Brier, “Incorporating fundamental and applied machine learning. Her
prior knowledge into q-learning for drug delivery current research involves reinforcement learning
individualization,” Fourth International Conference on and distributed machine learning.
Machine Learning and Applications, 2005. Kaixiang Lin is an applied scientist at Amazon
[170] T. W. Killian, S. Daulton, G. Konidaris, and F. Doshi- web services. He obtained his Ph.D. from Michi-
Velez, “Robust and efficient transfer learning with hid- gan State University. He has broad research in-
terests across multiple fields, including reinforce-
den parameter markov decision processes,” NeurIPS, ment learning, human-robot interactions, and nat-
2017. ural language processing. His research has been
[171] A. Holzinger, “Interactive machine learning for health published on multiple top-tiered machine learning
informatics: when do we need the human-in-the-loop?” and data mining conferences such as ICLR, KDD,
NeurIPS, etc. He serves as a reviewer for top
Brain Informatics, 2016. machine learning conferences regularly.
[172] L. Li, Y. Lv, and F.-Y. Wang, “Traffic signal timing
via deep reinforcement learning,” IEEE/CAA Journal of Anil K. Jain is a University distinguished pro-
fessor in the Department of Computer Science
Automatica Sinica, 2016. and Engineering at Michigan State University. His
[173] K. Lin, R. Zhao, Z. Xu, and J. Zhou, “Efficient large- research interests include pattern recognition and
scale fleet management via multi-agent deep reinforce- biometric authentication. He served as the editor-
ment learning,” ACM SIGKDD International Conference in-chief of the IEEE Transactions on Pattern
Analysis and Machine Intelligence and was a
on Knowledge Discovery & Data Mining, 2018. member of the United States Defense Science
[174] K.-L. A. Yau, J. Qadir, H. L. Khoo, M. H. Ling, and Board. He has received Fulbright, Guggenheim,
P. Komisarczuk, “A survey on reinforcement learning Alexander von Humboldt, and IAPR King Sun Fu
awards. He is a member of the National Academy
models and algorithms for traffic signal control,” ACM of Engineering and a foreign fellow of the Indian National Academy of
Computing Surveys (CSUR), 2017. Engineering and the Chinese Academy of Sciences.
[175] J. Moody, L. Wu, Y. Liao, and M. Saffell, “Performance
Jiayu Zhou is an associate professor in the De-
functions and reinforcement learning for trading sys- partment of Computer Science and Engineering
tems and portfolios,” Journal of Forecasting, 1998. at Michigan State University. He received his
[176] Z. Jiang and J. Liang, “Cryptocurrency portfolio man- Ph.D. degree in computer science from Arizona
agement with deep reinforcement learning,” IEEE State University in 2014. He has broad research
interests in the fields of large-scale machine
Intelligent Systems Conference (IntelliSys), 2017. learning and data mining as well as biomedical in-
[177] R. Neuneier, “Enhancing q-learning for optimal asset formatics. He has served as a technical program
allocation,” NeurIPS, 1998. committee member for premier conferences such
as NIPS, ICML, and SIGKDD. His papers have
[178] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai, “Deep received the Best Student Paper Award at the
direct reinforcement learning for financial signal rep- 2014 IEEE International Conference on Data Mining (ICDM), the Best
resentation and trading,” IEEE transactions on neural Student Paper Award at the 2016 International Symposium on Biomedical
networks and learning systems, 2016. Imaging (ISBI) and the Best Paper Award at IEEE Big Data 2016.