IBRL

Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Imitation Bootstrapped Reinforcement Learning

Hengyuan Hu Suvir Mirchandani Dorsa Sadigh


Stanford University Stanford Univeristy Stanford University

Abstract—Despite the considerable potential of reinforcement The most straightforward way to use demonstration data in
learning (RL), robotic control tasks predominantly rely on imita- RL is to initialize the RL replay buffer with demonstrations
tion learning (IL) due to its better sample efficiency. However, it is and oversample those demonstrations during training [30].
costly to collect comprehensive expert demonstrations that enable
IL to generalize to all possible scenarios, and any distribution This approach does not leverage the fact that IL policies
shift would require recollecting data for finetuning. Therefore, RL trained on the demonstrations can indeed provide more useful
arXiv:2311.02198v6 [cs.LG] 20 May 2024

is appealing if it can build upon IL as an efficient autonomous information – they can output actions that may not be good
self-improvement procedure. We propose imitation bootstrapped enough to solve unseen scenarios, but can still provide some
reinforcement learning (IBRL), a novel framework for sample- “lower bound” on the action quality when the initial RL
efficient RL with demonstrations that first trains an IL policy
on the provided demonstrations and then uses it to propose actions are highly suboptimal. Another common approach is
alternative actions for both online exploration and bootstrapping to pretrain the RL policy with human data and then fine-
target values. Compared to prior works that oversample the tune it with RL while applying additional regularization [12]
demonstrations or regularize RL with an additional imitation to ensure that the knowledge from demonstrations does not
loss, IBRL is able to utilize high quality actions from IL get washed out quickly by the randomly initialized critics.
policies since the beginning of training, which greatly accelerates
exploration and training efficiency. We evaluate IBRL on 6 This approach requires balancing the primary RL loss and
simulation and 3 real-world tasks spanning various difficulty the secondary IL regularization loss to achieve maximum
levels. IBRL significantly outperforms prior methods and the performance, which may require hyper-parameter tuning that
improvement is particularly more prominent in harder tasks. is infeasible in the real world. Additionally, this necessitates
using the same architecture to fit IL and RL data, which is
I. I NTRODUCTION
undesirable in complex tasks as RL and IL may require very
Despite achieving remarkable performance in many simula- different architectures.
tion domains [26, 31, 8], reinforcement learning (RL) has not We propose imitation bootstrapped reinforcement learning
been widely used in solving robotics and low level continuous (IBRL), a method to effectively combine IL and RL for
control problems, especially in the real world. The main sample-efficient reinforcement learning. IBRL first trains a
challenges of applying RL to continuous control problems are separate, standalone imitation policy on the provided demon-
exploration and sample efficiency. In these settings, reward strations with a powerful neural network that is much deeper
signals are often sparse by nature, and unlike learning in than the ones normally used in online RL. Then IBRL ex-
games where the sparse reward is often achievable within a plicitly uses this IL policy in two phases to accelerate RL
fixed horizon, a randomly initialized neural policy may never training. First, during the online interaction phase, both the
finish a task, resulting in no signals for learning. Besides the IL policy and RL policy propose an action and the agent
hard exploration problem, RL often needs a large number of executes the action that has a higher Q-value according to
samples to converge, which hinders its adoption in the real the Q-function being trained by the RL. Second, during the
world where massive parallel simulation is not available. training phase of RL, the target for updating the Q-values again
As a result, most learning-based robotics systems rely on bootstraps from the better action among the ones proposed
imitation learning (IL) [4] or offline RL [20] with strong by either the RL or the IL policies. Similar to prior work,
assumptions such as access to large specialized datasets. we also pre-fill the RL replay buffer with the demonstrations
However, those methods come with their own challenges. to provide learning signals before the policy collects its first
Expert demonstrations are often expensive to collect and online success. Fig. 1 illustrates the core idea of IBRL, and
require access to expert operators and domain knowledge [21]. how an IL policy is explicitly integrated in the interaction
In addition, policies learned from static datasets suffer from and training phase of RL. By keeping the IL policy separate,
distribution shifts when deployed in slightly different environ- IBRL does not need explicit regularization loss to prevent
ments. Given these challenges, online RL algorithms – when catastrophic forgetting and thus eliminate the need to search
carefully integrated with IL – can still play a valuable role in for proper hyperparameters to balance RL and IL. It also
efficiently learning robot policies. An ideal RL algorithm for allows the IL to utilize deeper, more powerful networks that
real world robotics applications should be able to benefit from may be hard to train in RL with sparse reward. By explicitly
human demonstrations and strong IL methods for sample- considering actions from the IL policy, IBRL improves the
efficient learning. Moreover, it should go far beyond these IL quality of exploration and value estimation when the RL
techniques via self-improvement to reach higher performance policy is inferior. It may also benefit from any potential
or to address distribution shift. generalizations of the IL policy in states beyond the limited
IL Policy Training Online Interaction RL Policy Training
(Actor Proposal) (Bootstrap Proposal)

s
Expert Replay Buffer
μψ πθ
Demos
aIL aRL
Train
Demos
Env QΦ
QΦ πθ
Train

μψ a* = argmaxa QΦ’ (s, a) QΦ(st, at) ← rt + 𝛾 maxa QΦ’ (st+1, a)


a ∈ {aIL, aRL} a ∈ {a IL
t+1 , a t+1 }
RL

Fig. 1: Imitation-Bootstrapped Reinforcement Learning (IBRL). IBRL first trains an imitation learning policy and then uses it to propose
additional actions for RL during both online interaction phase (actor proposal) and training phase (bootstrap proposal). We use the moving
average of the online Q-function, i.e. the target Q-function Qϕ′ , to decide which action to take.

demonstration data. some initial signals to learn from. The most straightforward
We evaluate IBRL on 6 simulation and 3 real-world robotics approach that leverages demonstrations in RL is to include
tasks spanning various difficulty levels. All tasks use sparse 0/1 the demonstrations in the replay buffer and oversample the
reward. IBRL matches or outperforms strong existing methods demonstrations during training with an off-policy RL algo-
on all tasks and the improvement is more significant in harder rithm [30]. Despite its simplicity, Ball et al. [2] recently have
tasks. In particular, IBRL nearly doubles the performance over shown that this approach – Reinforcement Learning from
the second best method in the hardest simulation task evaluated Prior Data (RLPD) – when combined with modern sample
in this paper. In a challenging real-world deformable cloth efficient RL techniques such as normalization, Q-ensembling,
hanging task, IBRL performs 2.4× better than the second best and image augmentation, outperforms many more complex
RL method. In fact, prior methods are unable to even surpass RL algorithms in continuous control domains that utilize prior
the BC baseline after 2 hours of real-world training on this data. Meanwhile, Song et al. [28] provide theoretical analysis
task. of a similar idea (Hybrid RL) and show that it is both effective
and sample efficient.
II. R ELATED W ORK
Another commonly used approach is to pretrain the RL
In this section, we review methods that address the sample policy with demonstration data and then fine-tune it with on-
efficiency of RL both with and without access to human line RL [14, 23, 22]. During RL fine-tuning, regularization is
demonstrations. We also cover a particularly relevant area of required to avoid catastrophic forgetting caused by undesirable
work that uses a reference policy in RL for various purposes. learning signals from randomly initialized critics. Approaches
Sample-Efficient RL. A number of recent works have greatly such as Regularized Optimal Transport (ROT) [12] extend this
improved sample efficiency of RL by applying various regu- idea to visual observations and integrate an optimal transport
larization techniques. For instance, RED-Q [5] and Dropout- reward as well as adaptive weighting over the regularization
Q [15] apply regularization to the Q-function (critics) via loss. This regularized fine-tuning approach achieves strong
ensembling or dropout so that they can be trained with higher results in simulation and real-world robotic tasks.
update-to-data (UTD) ratio (i.e., the number of updates for Apart from model-free RL, model-based RL is also well-
every transition collected), leading to faster convergence and positioned to use prior data. MoDem [13] is a model-based
thus higher sample efficiency. These approaches are com- planning/RL method that uses demonstrations to pretrain the
monly used in state-based RL, where it is computationally policy via behavioral cloning and then pretrains the world
feasible to have a large number of independent critics made model and critic using demonstrations as well as rollouts from
of shallow fully connected layers. For learning directly from the pretrained BC policy. It then uses TD-MPC, a model
pixel inputs, image augmentation such as random shifts [34] predictive control (MPC) style planning algorithm augmented
can instead boost performance and sample efficiency without by Q-functions, to generate action for online inference and
the need of increasing UTD ratio and thus maintains low update the Q-functions with temporal difference (TD) learning.
computational cost. We apply RED-Q and image augmentation MoDem compares favorably to a number of prior RL with
in our method, IBRL, for state- and pixel-based experiments demonstrations algorithms [23, 11, 25, 37].
respectively to build upon these strong foundations. Compared to the three families of methods listed above,
RL with Prior Demonstrations. In sparse reward settings, the uniqueness of our method, IBRL, stems from the use of a
sample-efficient RL algorithms alone are insufficient because powerful, standalone IL policy that provides alternative high
they are unlikely to collect any reward signal through ran- quality actions during both inference and training. In IBRL,
dom exploration. A common approach is to supply RL with the IL policy is directly integrated into the learning algorithm
successful prior data or human demonstrations so that it has so that we no longer need to arbitrarily oversample demon-
strations to overweight those learning signals. Additionally, {(s0 , a0 ), . . . , (sT , aT )}. The most common IL method is
because the IL policy is separate and will not be modified behavior cloning (BC) which trains a parameterized policy
by RL gradients, IBRL eliminates the need for a carefully µψ to minimize the negative log-likelihood of data, i.e.,
scheduled regularization loss that prevents the policy from L(ψ) = −E(s,a)∼D [log µψ (a|s)]. In this work, we assume
forgetting. This further allows for the RL and IL policies µψ follows an isotropic Gaussian as its action distribution
to use their own most suitable network architectures and for simplicity. We note that our framework can easily accom-
loss formulations. Lastly, compared to the model-based ap- modate more powerful IL methods such as BC-RNN with a
proaches, IBRL achieves strong performance while incurring Gaussian mixture model [21]. With the isotropic assumption,
significantly lower computational cost, which makes it more the BC training objective for the policy can be formulated as
2
suitable for high frequency control in the real world. As we the following squared loss: L(ψ) = E(s,a)∼D ∥µψ (s) − a∥2 .
show later in Section V, IBRL achieves superior performance
over these alternative techniques. IV. I MITATION B OOTSTRAPPED RL
Reference Policy in RL. Similar to IBRL, many prior works A. Core Algorithm
in RL and search have utilized a standalone policy (reference
policy) trained on human demonstrations that is separate The core idea of IBRL is to first train an IL policy µψ
from the policy being trained online for various purposes. In using expert demonstrations and then leverage this standalone
human-AI coordination, reference policies trained from human reference IL policy in two phases in RL: 1) to help exploration
data [1, 18] or induced from large language models [17] during the online interaction, and 2) to help with target value
are used to regularize RL policy updates to stay close to estimation in TD learning (as shown in Fig. 1). We refer to the
human-like equilibria. In robot learning, prior works have first phase as actor proposal and the second phase as bootstrap
used reference policies during online interaction to assist proposal.
exploration. EfficientImitate [35] uses a fixed BC policy to pro- We focus our discussion on off-policy RL methods since
pose action candidates for Monte Carlo Tree Search (MTCS) they often have higher sample efficiency by effectively reusing
alongside actions from the policy being trained during online past experiences as well as human demonstrations. Most popu-
exploration. PEX (Policy Expansion) [38] samples actions lar off-policy RL methods for continuous control, such as Soft
from a mixture of online RL policy and a reference offline Actor-Critic (SAC) [10] or Twin Delayed DDPG (TD3) [9]
RL policy during online exploration of RL. In comparison, involve training Q-networks to evaluate the action quality and
IBRL uses the IL reference policy in both exploration and training a separate policy network to generate actions with
training stages and we find it crucial to have both stages to high Q-values. In IBRL, actor proposal generates additional
achieve maximum sample efficiency and final performance. actions alongside the RL policy to assist with exploration
In addition, none of these prior works have been evaluated while bootstrap proposal accelerates Q-network training.
in real-world robot tasks, and PEX is only evaluated with low Online Interaction: Actor Proposal. In sparse reward robotics
dimensional state inputs. We evaluate IBRL in real world robot tasks, such as picking up a block and receiving reward only
tasks as well as simulations with both image and state inputs. when the block is picked up, randomly initialized Q-networks
and policy networks may hardly obtain any successes even
III. BACKGROUND
after a long period of interaction, resulting in no signal for
We consider a standard Markov decision process (MDP) learning. IBRL helps mitigate the exploration challenge by
consisting of state space s ∈ S, continuous action space A = using a standalone IL policy µψ trained on human demon-
[−1, 1]d , deterministic state transition function T : S×A → S, strations D. IBRL uses this reference IL policy to propose
sparse reward function R : S×A → {0, 1} that returns 1 when an alternative action aIL ∼ µψ (s) in addition to the action
the task is completed and 0 otherwise, and discount factor γ. aRL ∼ πθ (s) proposed by the RL policy at each online
Reinforcement Learning. IBRL builds on off-policy RL interaction step. Then, IBRL queries the target Q-network
methods as they can easily consume demonstration data gen- Qϕ′ and selects the action with higher Q-value between the
erated by humans. Deep RL methods for continuous action two candidates. That is, during online interaction, IBRL takes
spaces jointly learn a policy (actor) πθ and one or multiple an action that provides the higher Q-value between the one
value functions (critic) Qϕ parameterized by neural networks proposed by the imitation policy µψ and the one proposed by
θ and ϕ respectively. The value functions Qϕ are trained to the RL policy πθ that is being trained:
minimize TD-error L(ϕ) = [rt + γQϕ′ (st+1 , πθ′ (st+1 )) −
a∗ = argmax Qϕ′ (s, a). (1)
Qϕ (st , at )]2 while the policy is trained to output actions with a∈{aIL ,aRL }
high Q-values with L(θ) = −Qϕ (s, πθ (s)). πθ′ and Qϕ′
are target networks whose parameters θ′ , ϕ′ are exponential This is the actor proposal phase of IBRL (Fig. 1 middle).
moving averages of θ, ϕ respectively. RL Training: Bootstrap Proposal. Similarly, when computing
Imitation Learning. We assume access to a dataset D of the training targets for the Q-networks, instead of bootstrap-
demonstrations collected by expert human operators. Each ping from Qϕ′ (st+1 , πθ′ (st+1 )), we can bootstrap from the
trajectory ξ ∈ D consists of a sequence of transitions higher value between Qϕ′ (st+1 , aIL RL
t+1 ) and Qϕ′ (st+1 , at+1 )
where aIL RL
t+1 is sampled from the imitation policy while at+1 to its simplicity and the fact that it does not require additional
is sampled from the target actor πθ′ : hyperparameter tuning.
Qϕ (st , at ) ← rt + γ max Qϕ′ (st+1 , a′ ). (2) B. Benefits of IBRL
a′ ∈{aIL RL
t+1 ,at+1 }
When using RL with access to prior demonstrations, re-
This essentially assumes that the future rollout will be carried cent work has shown that straightforward approaches such
out by a policy that always picks the action between {aIL , aRL } as oversampling the demonstrations as in RLPD or Hybrid
with the higher Q-value for every time step, which is precisely RL [2, 28] and BC pretraining followed by RL with BC
the greedy version of the exploration policy in IBRL. We refer regularization on the policy in approaches such as ROT [12]
to this phase of IBRL as bootstrap proposal (Fig. 1 right). are powerful techniques that are commonly used in real world
In summary, IBRL replaces the policy πθ in vanilla RL robotics settings due to their simplicity, performance, and
algorithms with a hybrid policy argmaxa∈{aIL ,aRL } Qϕ′ (s, a) robustness. In this section, we will discuss how IBRL’s way of
in both inference and training. The idea of IBRL can be com- integrating IL with RL introduces additional important benefits
bined with any actor-critic style off-policy RL algorithm such in comparison to these methods.
as TD3 or SAC. In this paper, we use TD3 as our RL backbone
because it has demonstrated strong performance and high Automatic balancing between RL and IL policies. First,
sample efficiency in challenging RL from image settings [34]. IBRL does not require picking hyper-parameters nor annealing
Similar to prior works, we initialize the replay buffer with schedules for the BC regularization weight. Unlike prior
demonstrations but do not oversample those demonstrations. methods, IBRL does not need to worry about the IL policy
We provide detailed pseudocode of IBRL with TD3 backbone being washed out in the early stage of training nor does it
in Appendix. need to worry about the BC causing RL to be suboptimal in
the later stage of training. In IBRL, the balance between IL and
Soft IBRL Variant. The discussion so far focuses on a greedy RL changes automatically as the policy and critic improves.
instantiation of IBRL that always selects the action with the
higher Q-value. Although we find that this instantiation works Leveraging IL in both exploration and training. The
well in practice – especially in the realistic settings where the explicit consideration of IL actions during both exploration
model processes raw pixels with deep image encoders – it is and training through the argmax operation (or softmax in
worth noting that, in theory, this method may get stuck in a the soft variant) can lead to better exploration and training
local optimum. targets when the RL policy is underperforming. We show later
Consider a tabular setting where the update of one Q(s, a) that both actor proposal and bootstrap proposal are crucial for
does not lead to changes in other Q-values; then the Q-value maximum sample efficiency in ablations.
of the optimal action Q(s, a∗ ) will never be updated if its Modular and flexible architecture choices for IL and RL.
initial value is smaller than Q(s, aIL ), leading to a suboptimal The modular design of IBRL easily enables selecting the “best
solution. This problem, however, can be easily circumvented of both worlds” from an IL and RL perspective. For example,
by using a soft variant of IBRL that samples actions according we can use different network architectures that are most suited
to a Boltzmann distribution over Q-values instead of taking the for the RL and IL tasks respectively. In Section V-C, we show
argmax, i.e., changing Eq. (1) of actor proposal to that the widely used deep ResNet-18 encoder that achieves
strong performance in IL performs poorly as the visual back-
a∗ ∼ pQ (a) (3) bone for RL, while a shallow ViT encoder that performs worse
and changing Eq. (2) of bootstrap proposal to in IL works quite well in RL. IBRL’s modular integration of
RL and IL also allows different action representations for IL
Qϕ (st , at ) ← rt + γQϕ′ (st+1 , a′ ), a′ ∼ pQ (at+1 ), (4) and RL, such as unimodal Gaussian for RL but mixture of
where pQ (a) ∝ exp(βQ(s, a)) for a ∈ {aIL , aRL } with β ≥ 0 Gaussians for IL. This opens an avenue towards integrating
being the inverse of the temperature that controls the sharpness some more powerful IL methods [24, 6, 39] with RL, which
of the distribution. we leave for future research.
Essentially, soft IBRL replaces the argmax operation with
C. Architectural Improvements
a softmax to avoid the possibility of masking out optimal
actions. In practice, we find this soft version works better than Regularization with Actor Dropout. Many prior works
the normal IBRL in the state-based settings. However, this is have demonstrated the benefit of regularization in RL for
not essential in the more realistic pixel-based settings, possibly continuous control [9, 5, 33]. Additionally, as we discussed
because with deep image encoders, changing the Q-value for earlier, popular RL techniques that leverage prior data, such
certain observation-action pairs will likely cause changes to the as oversampling demonstrations in training or adding BC regu-
Q-values of many other correlated inputs, which brings suffi- larization loss to policy update, implicitly introduce additional
cient stochasticity to the learning process and thus mitigates regularization to RL that has shown to be useful. We observe
the masking effect. We demonstrate the effectiveness of soft that regularizing IBRL with dropout [29] in the policy network
IBRL in state-based experiments in Section V-C while using (actor) πθ , which we refer to as actor dropout, can further
the normal argmax version for all pixel-based experiments due improve its stability and sample efficiency, especially in more
V. E XPERIMENTS IN S IMULATION
We first conduct experiments in simulation environments to
comprehensively compare IBRL against state-of-the-art meth-
ods in terms of performance and sample efficiency. We also
perform ablations to understand the importance of different
design choices.
Fig. 2: ViT-based Q-network. First, ViT processes overlapping
image patches. Action and proprioception input are appended to each A. Experimental Setup
channel and an MLP is used to fuse this information. The projected Our evaluation suite consists of 4 tasks from Meta-
embeddings are reduced to a 1-D vector by multiplying with learned World [36] and 2 tasks from Robomimic [21]. All environ-
spatial embeddings and summing over the channel dimension. Finally,
ments use the sparse 0/1 task completion reward at the end
an MLP takes the embedding and outputs a scalar Q.
of each episode. The 4 Meta-World tasks are a subset of the
tasks evaluated in MoDem [13]. They span the medium, hard
challenging tasks where initial signals are noisy as successful and very hard tiers of this benchmark as categorized in [25].
episodes are less frequent. Although dropout has been previ- Since Meta-World does not come with human demonstrations,
ously applied in the critic to reduce overfitting on the value we use the scripted expert policies from [36] to generate 3
estimate [15], to the best of our knowledge, the application of demonstrations per task. Although we use harder-than-average
dropout in actor has not been well-studied before. We find that tasks from Meta-World, these tasks are often simple, and
adding actor dropout in IBRL significantly improves sample additionally, scripted demonstrations are inevitably different
efficiency, even when other regularization techniques such – much less noisy and cleaner – than human demonstrations,
as image augmentation (DrQ) [34] or Q-ensembling (RED- making these tasks too simple to distinguish between some of
Q) [5] are also present. Moreover, actor dropout accelerates the stronger methods. Robomimic is a well-established bench-
convergence without increasing the update-to-data (UTD) ratio mark with significantly more complex tasks and demonstra-
and requires negligible extra compute. tions collected by human teleoperators. We use two test sce-
narios: a medium-difficulty task PickPlaceCan (Can) with 10
Improved Vision Encoder and Critic Designs. Prior online demonstrations and a hard task NutAssemblySquare (Square)
RL in continuous control works have mostly inherited the ar- with 50 demonstrations. As documented in [21], the Square
chitecture from DrQ [33], which consists of shallow ConvNet task is particularly challenging for RL as RL methods with-
followed by linear layers. Despite its strong performance in out demonstrations have been unsuccessful even with hand-
many settings, we find this architecture to be a major bottle- engineered dense rewards and substantial tuning.
neck in more challenging tasks. Meanwhile, naı̈vely applying
common deep architectures without massive training data from B. Implementation of IBRL and Baselines
parallel simulators leads to poor performance. Therefore, we IBRL uses TD3 for RL and BC for IL. The BC policies in
introduce a new Q-network design with a shallow ViT [7] style all pixel-based experiments use a ResNet-18 vision encoder.
image encoder for learning from pixels, illustrated in Fig. 2. We integrate common best practices for RL such as random-
The general idea is to use Transformer layers so that relevant shift image augmentation in pixel-based RL and RED-Q in
information from different parts of the image can be exchanged state-based RL to ensure best performance. Unless specified
efficiently in a relatively shallow architecture that is expressive otherwise, IBRL always use actor dropout by default. Please
and yet easy to optimize. We first divide input images into see the Appendix for more implementation details and a
overlapping patches and apply two convolution layers to get complete list of hyper-parameters. We compare IBRL with
patch embeddings before feeding them into one Transformer three powerful baselines, RLPD, RFT and MoDem, that have
layer. Then, we flatten the post-ViT patch embeddings in each been shown to outperform various other methods.
channel and append the action and optionally proprioception RLPD loads the demonstrations in the replay buffer and
data to each flattened channel before feeding them through oversamples them during online RL such that 50% of the
an MLP to fuse this information. To reduce dimensionality transitions in each batch come from demonstrations.
of the features without using large linear layers, we multiply RFT (regularized fine-tuning) is a technique where the RL
the feature matrix with learned spatial embeddings [20] and policy π is first pre-trained with demonstrations and then
sum over the channel dimension to get a 1-D vector before fine-tuned with online RL. During RL, it adds a BC loss
feeding them into the final Q-MLP. As TD3 utilizes two Q- αλ(π)LBC where α is the weight of the BC loss and λ is an
heads for double Q-learning, we replicate the entire structure annealing schedule. We use the soft Q-filtering technique from
after the ViT for each Q-head. Similar to prior work, the Regularized Optimal Transport (ROT) [12] to dynamically
actor is a fully connected network that takes the output of the anneal λ. We use the best α = 0.1 found through hyper-
ViT encoder as input. We show that this architecture greatly parameter sweeping.
improves the performance of IBRL in complex manipulation RLPD and RFT share the same TD3 backbone as IBRL.
tasks in Section V-C and show that it also improves baselines In our experiments, unless otherwise specified, IBRL, RLPD,
in the Appendix. and RFT share the same non-algorithmic building blocks
Assembly Box close Coffee push Stick pull Aggregation
1.0 1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2 0.2

0.0 0.0 0.0 0.0 0.0


0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60
Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000)
BC IBRL (w/o Actor Dropout) RFT RLPD MODEM

Fig. 3: Performance on Meta-World. IBRL (without Actor Dropout) outperforms both MoDem and RLPD on all 4 tasks. RFT achieves
similar performance to IBRL. The dashed lines indicate the average success rate of the BC policies used in IBRL.

including network architecture, normalization, random-shift for MoDem vs. 1 hour for the three model-free methods),
image augmentation, RED-Q, etc. We make these implemen- we exclude MoDem in the more difficult and computationally
tation decisions to ensure strong baselines and controlled intensive Robomimic experiments.
comparisons against IBRL. IBRL significantly exceeds baselines in Robomimic. In
MoDem is a model-based approach that pre-trains a policy Robomimic, we run all methods with our new ViT-based ar-
with BC and uses it to generate rollouts which are then used chitecture as existing architectures become a major bottleneck
to pre-train a world model and critic. We use the original in Square, the most complicated task in our simulation exper-
open-source implementation of MoDem. For our Meta-World iments. We also run state-based experiments to demonstrate
experiments, we generate the prior demonstrations differently the effectiveness of IBRL in isolation from network designs.
from the original paper [13], but we have confirmed that our We run IBRL with actor dropout to highlight our empirical
rerun of MoDem with these demonstrations performs better improvement upon existing strong baselines. The ablations
on average than the results reported in the original paper. over different components of IBRL are in the next section.
Fig. 4 shows the performance of IBRL alongside the
C. Overall Results on Meta-World and Robomimic
two strong baselines, RLPD and RFT. IBRL outperforms the
IBRL matches or exceeds baselines in Meta-World. In baselines across all four settings. The performance of the BC
Meta-World, we focus on the core algorithmic contributions policy (gray dashed lines) illustrates the relative difficulty of
of IBRL. Therefore, we disable actor dropout for IBRL. We the tasks. For example, Square (pixel) is much harder than Can
also do not use our ViT-based architecture for IBRL, RFT, and (pixel); BC performs much worse in Square despite having 5×
RLDP but instead use the widely adopted ConvNet architec- as much demonstration data as Can. In the relatively simpler
ture from DrQ to ensure a fair comparison with MoDem as it Can (pixel) task, all three methods are able to eventually solve
is complicated to tune network architectures for MoDem. the task, but IBRL solves it with fewer interaction steps and
Fig. 3 shows the results of IBRL against three baselines more stable training. In the Square (pixel) task, IBRL is the
in each Meta-World task separately as well as in aggregation only method that is able to solve it within 0.5M samples,
(rightmost). IBRL and RFT universally outperform RLPD and while the baselines attain less than 60% success. In state-
MoDem across all tasks in terms of both sample efficiency based setting, the improvement is even more striking as the
and final performance, solving all tasks within 40K samples. existing methods fail to learn completely. Overfitting may be a
RFT has a small advantage over IBRL in the early stage of major issue that leads to the failures of baselines in state-based
training thanks to its pretrained encoder and policy network. experiments as we later see that their performance improves
However, IBRL catches up quickly and achieves high per- significantly after adding actor dropout, despite still being
formance within the same amount of samples, significantly worse than IBRL.
outperforming RLPD which is also randomly initialized. Be-
D. Ablations on Robomimic
cause the tasks are relatively simple, the IBRL’s advantage
of integrating a more powerful IL model is less beneficial We perform ablations on the more challenging Robomimic
here, which may partially explain the similar performance tasks to understand the contribution of each components of
between IBRL and RFT. However, it is worth noting that IBRL. We first show that adding actor dropout to the baselines
RFT requires additional tuning to find a proper range for is not sufficient to match IBRL’s performance. Then we ablate
the base regularization ratio α. In contrast, IBRL has no over the algorithmic components of IBRL and show that all of
additional hyper-parameters during the RL stage, making it them contribute to its success. Finally, we show that our ViT-
more desirable for real world applications where large scale based architecture significantly improves sample efficiency and
hyper-parameter search is infeasible. Lastly, the more complex final performance for all RL methods.
MoDem method performs much worse than IBRL and the two Actor dropout on baselines. To ensure that the advantage of
simpler baselines. Given that MoDem’s computational cost is IBRL over the baselines are not solely from actor dropout,
significantly higher than the other two baselines (10 hours we augment both RLPD and RFT with actor dropout and
Can (pixel) Square (pixel) Can (state) Square (state)
1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2

0.0 0.0 0.0 0.0


0 100 200 0 200 400 0 100 200 0 100 200 300
Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000)
IBRL RLPD RFT BC

Fig. 4: Performance on Robomimic. IBRL significantly outperforms RFT and RLPD on all 4 scenarios. The gap between IBRL and baselines
is especially large on the more difficult Square task. The horizontal dashed lines are the score of BC policies in IBRL.

Square (pixel) Square (state) Square (pixel) Square (state)


1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2

0.0 0.0 0.0 0.0


0 200 400 0 100 200 300 0 200 400 0 100 200 300
Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000)
RLPD RLPD w/ Actor Dropout IBRL IBRL IBRL w/o Actor Proposal
RFT RFT w/ Actor Dropout IBRL w/o Actor Dropout IBRL w/o Bootstrap Proposal

Fig. 5: IBRL vs. baselines and their variants with actor dropout. Fig. 6: Ablations on the algorithmic components of IBRL.
Actor dropout significantly improves RFT in pixel-based RL and
significantly improves both baselines in state-based RL.
enabled. IBRL w/o Bootstrap Proposal shares a similar high-
level structure to PEX [38], where a reference policy is
show their performance in Fig. 5. First of all, IBRL still used for proposing actions during exploration only. However,
outperforms the strongest variant among the four baselines, PEX trains the reference policy with offline RL and does not
“RFT with Actor Dropout”, showing that actor dropout is use actor dropout. IBRL is significantly less sample efficient
not the only reason behind IBRL’s new SoTA performance. without bootstrap proposal, indicating that using the IL policy
However, it is worth noting that actor dropout significantly in the target value computation leads to better training targets
improves RFT in both pixel- and state-based settings and and faster convergence. We also verify the importance of the
RLPD in state-based setting. In the state-based setting, actor actor proposal; IBRL’s performance decreases when removing
dropout essentially helps the two baselines solve the task, actor proposal because it becomes less efficient at finding good
although at a lower sample efficiency than IBRL. Adding actor actions in early stage of training. It is interesting to see that
dropout to RFT essentially leads to a new approach that greatly IBRL w/o Bootstrap Proposal performs worse than IBRL w/o
surpasses existing methods excluding IBRL. This suggests Actor Proposal, which further emphasizes the importance of
that this technique should be considered for other methods using the IL policy during training.
beyond IBRL, especially considering that it adds negligible Ablation of Network Architecture. We demonstrate the
extra computational cost. effectiveness of our ViT-based architecture in Fig. 7. In both
Algorithmic components of IBRL. To understand the im- tasks, our ViT architecture achieves better performance than
portance of key algorithmic components in IBRL, we per- the widely adopted DrQ network. The near zero performance
form ablations over actor proposal, bootstrap proposal, and of DrQ network in Square also reflects the difficulty of the
actor dropout in Fig. 6. Overall, all three components are task compared to the ones used in prior RL works. We also
crucial for IBRL’s strong performance. First, we can see that test the deep ResNet-18 encoder, the same one used in our BC
actor dropout is a powerful technique that improves sample policy, in RL. Note that this ResNet-18 replaces BatchNorm
efficiency and helps IBRL to escape sub-optimal solutions. with GroupNorm [32] as BatchNorm is known to cause RL to
Nonetheless, we emphasize that the core ideas of IBRL play diverge when used with moving average target networks [20].
a crucial role even when actor dropout is enabled: removing Compared with the deeper and more computationally expen-
either the bootstrap proposal or actor proposal causes signif- sive ResNet-18, our proposed ViT architecture achieves better
icant performance deterioration even when actor dropout is sample efficiency and final performance while also taking 50%
Can (pixel) Square (pixel) Lift Drawer Hang
1.0 1.0

0.8 0.8

0.6 0.6

0.4 0.4

0.2 0.2

0.0 0.0 1.0 1.0 1.0


0 50 100 150 200 0 200 400
0.8 0.8 0.8
Interaction steps (×1000) Interaction steps (×1000)
0.6 0.6 0.6
IBRL (our ViT) IBRL (ResNet18) IBRL (DrQ net)
BC (our ViT) BC (ResNet18) 0.4 0.4 0.4

0.2 0.2 0.2

Fig. 7: Comparison between our ViT-based Q-network design and the 0.0 0.0 0.0
2 4 6 8 5 10 15 10 20 30
DrQ net commonly used in prior RL work and the ResNet-18 that Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000)
achieves strong performance in imitation learning. All IBRL runs use
BC IBRL RLPD RFT
the same ResNet-18-based BC policy but different architectures for
the RL networks. Dashed horizontal lines show the performance in
Fig. 8: Top: Illustrations of each task and the variation in the
BC. Our ViT performs significantly better in RL while the deeper
initialization of each task. Bottom: Training curves for each task.
ResNet-18 performs better in BC. IBRL takes advantage of the best
y-axis is the percentage of successful episodes during each 1000-
architectures in both RL and IL.
step interval. IBRL consistently outperforms RLPD and RFT in all
3 tasks, with a larger gap on the more complex tasks that take more
less wall-clock time to train. Although the ViT performs better interaction steps to learn.
in online RL tasks, we also see from Fig. 7 (dashed lines) that
the higher capacity ResNet-18 still dominates in BC. Thus, Lift: The objective is to pick up a foam block. The initial
we empirically confirm that BC and RL may prefer different location of the block is randomized over roughly 22cm by
architectures, which is reasonable given that the training 22cm-28cm trapezoid, which covers the entire area visible
goals are different (fitting behaviors in the training data vs. from the wrist-camera when the robot is at the home position.
extrapolating to better behaviors while avoiding overfitting to We collect 10 demonstrations for this task due to its simplicity.
unsuccessful early exploration data). Prior works such as RFT It uses wrist-camera images as observations.
are forced to use the same architecture to fit both RL and Drawer: The objective is to open the top drawer in a set
demonstration data, which may limit their performance. In of plastic drawers in a fixed position. The initial pose of the
contrast, IBRL allows us to choose different architectures that robot is randomized by adding noise up to 10% of the joint
are most suitable for RL and IL respectively, which echoes limit to each joint. We collect 30 prior demonstrations and use
with one of the benefits of IBRL discussed in Section IV-B. wrist-camera images as observations.
Hang: The objective is to hang a deformable soft cloth on
VI. R EAL W ORLD E XPERIMENTS a metal hook. The initial location of the cloth is randomized
To fulfill IBRL’s promise of performing sample-efficient over a roughly 28cm by 30cm rectangular region, and the hook
policy improvement in real-world applications, we evaluate it is in a fixed position. The cloth is initialized so that its long
on three real-world manipulation tasks of increasing difficulty side is roughly perpendicular to the hook. We use 30 prior
and compare it against RFT and RLPD. demonstrations. This task uses third-person camera images as
observations because the wrist-camera loses sight of the hook
A. Experimental Setup after picking up the cloth.
As the primary goal of our real-world evaluations is to
We design three tasks named Lift, Drawer and Hang. The compare sample-efficiency and performance of various algo-
first two tasks use a Franka Emika Panda robot and the third rithms, we design rule-based success detectors and perform
task uses a Franka Research 3 robot. Both robots are equipped manual reset between episodes to ensure accurate reward and
with a Robotiq 2F-85 gripper. Actions are 7-dimensional initial conditions. The details of the success detection and reset
consisting of 6 dimensions for end-effector position and ori- mechanism are in the Appendix. Note that sparse 0/1 reward
entation deltas under a Cartesian impedance controller and from the success detector is the only source of reward.
1 dimension for absolute position of the gripper. Policies
run at 10 Hz. For each task, we collect a small number of B. Results
prior demonstrations via teleoperation with an Oculus VR Fig. 8 shows the training curves of IBRL and baselines
controller, and then run different RL methods for a fixed in the three tasks. Different tasks allow different interaction
number of interaction steps. All methods use the exact same budgets based on their difficulty. The training curve measures
hyper-parameters and network architectures as in Robomimic the success rate of episodes between each 1000-step interval
tasks. We illustrate the three tasks in Fig. 8 and briefly describe while the policy is being updated and exploration noise for
them here. action is enabled. Overall, we see that IBRL learns consistently
Lift Drawer horizontal motion to open the drawer. We provide 30 demon-
Lift Drawer Hang
(Hard Eval) (Early Stop)
strations and run each method for 16K interaction steps. IBRL
# Demos 10 10 30 30 30 achieves the strongest performance at 95% success. From the
BC 50% 0% 55% 55% 65%
learning curve in Fig. 8, we can clearly see that IBRL solves
# Env Steps 8K 8K 16K 10K 30K the task with far fewer samples. To verify this, we evaluate an
Time (mins) 32 32 64 48 120
RLPD 95% 80% 85% 0% 15% “early stop” checkpoint after 10K interaction steps and find
RFT 90% 75% 50% 15% 35% that IBRL already attains a perfect score while the baselines
IBRL 100% 95% 95% 100% 85% can only succeed in less than 15% of the time. In fact, RLPD
TABLE I: Evaluation performance of IBRL on the real-world tasks. and RFT still cannot fully solve this task even after 16K steps,
making IBRL at least 40% more sample efficient than the
BC IBRL
baselines in this task.
The Hang task is the hardest task as the robot must learn
to pick the cloth up from the center and release it above the
hook with enough precision so that the cloth rests on the hook
and does not fall. We provide 30 demonstrations and run each
method for 30K interaction steps. BC performs relatively well
on this task because the demonstrations from the human expert
19cm are clean and always grasp and drop at the optimal location,
✓ Succeeds ✓ Succeeds
after 82 Steps after 53 Steps which reduces the possible state space that the policy needs to
handle. However, the deformable nature of the cloth makes it
especially hard for RL as small differences in the grasp or drop
locations may lead to drastically different outcomes that are
hard to predict. Despite a significantly higher online interaction
budget of 30K steps, RFT and RLPD are not able to even reach
the performance level of BC. In contrast, IBRL exceeds the
success rate of BC by 20%. Fig. 9 illustrates rollouts of BC
16cm
X Fails to pick up ✓ Succeeds
and IBRL on two different initial conditions. In the top row,
within 150 steps after 53 Steps IBRL is able to solve the task with fewer steps than the BC
policy. The bottom row shows a different scenario where BC
Fig. 9: Illustration of rollouts by BC and IBRL on the Hang task from fails to pick up the towel within the episode limit of 150 steps
two different initial cloth locations. Note that IBRL can achieve task while IBRL can still solve the task.
success in fewer timesteps than the BC policy, and can solve the task
for certain initial states where the BC policy fails.
VII. D ISCUSSION

faster than RLPD and RFT across all three environments and Summary. We present IBRL, a novel way to use human
is the only method that is able to outperform BC in the most demonstrations for sample efficient RL by first training an
challenging Hang task under a 30K interaction budget. IL policy and using it in RL to propose actions to improve
We take the last checkpoints of each method and perform 20 online interaction and training time target Q-value estimation.
evaluations. All methods are evaluated using the same set of We show that IBRL outperforms prior SoTA methods across
initial conditions for fairness. Table I summarizes the results. 6 simulation tasks spanning wide range of difficulty levels
In Lift, we first evaluate all methods using a uniform and the improvement is particularly more significant in harder
distribution of initial positions of the block and then evaluate tasks. In real-world robotics tasks, IBRL also outperforms
them in a “Hard Eval” setting where the block is initialized prior methods by a large margin in terms of sample-efficiency
at the boundary such that only part of the block is visible and final performance, making it an ideal solution for rapid
from the wrist-camera at the beginning of each episode. real-world policy improvement to either improve upon an
IBRL achieves the highest score in both settings. In the hard existing IL policy or help address performance deterioration
setting, performance of all methods decreases, especially for caused by distribution shift. While we instantiated IBRL with
BC whose performance drops to 0 as it has not seen such cases specific choices of IL and RL algorithms, the framework is
in the demonstrations. However, IBRL still maintains a near general and can in principle accommodate any IL method and
perfect 95% success rate as it learns faster during RL and thus off-policy RL method.
has seen more different initial positions. This illustrates that Limitations and Future Work. In our real-world experi-
IBRL is highly suitable for real-world policy improvement to ments, we focus on evaluating the performance of IBRL so
combat potential distribution shifts or to tackle unseen cases we resort to manual reset to minimize noise from unsuccessful
during original data collection. resets. A large scale deployment of IBRL in the real world
The Drawer task is more challenging than Lift as it requires should ideally enable autonomous reset, which we leave for
grasping of the small drawer handle followed by a precise future work. Additionally, the modular design of IBRL opens
new avenues for integrating various IL methods with RL. An Transformers for image recognition at scale. In Interna-
exciting direction for future research is to extend IBRL to tional Conference on Learning Representations (ICLR),
take advantage of recent IL advancements such as diffusion 2020.
policies [24, 6] or learning with hybrid actions [3] for even [8] FAIR, Anton Bakhtin, Noam Brown, Emily Dinan,
better performance. Gabriele Farina, Colin Flaherty, Daniel Fried, Andrew
Goff, Jonathan Gray, Hengyuan Hu, Athul Paul Jacob,
VIII. ACKNOWLEDGMENTS Mojtaba Komeili, Karthik Konath, Minae Kwon, Adam
This project was sponsored by ONR grant N00014-21- Lerer, Mike Lewis, Alexander H. Miller, Sasha Mitts,
1-2298, NSF grants #2125511, #1941722, #2006388 and Adithya Renduchintala, Stephen Roller, Dirk Rowe,
DARPA grant W911NF2210214. We would like to thank Weiyan Shi, Joe Spisak, Alexander Wei, David Wu, Hugh
Yuchen Cui, Joey Hejna for their feedbacks and suggestions. Zhang, and Markus Zijlstra. Human-level play in the
game of “Diplomacy” by combining language models
R EFERENCES with strategic reasoning. Science, 378(6624):1067–1074,
[1] Anton Bakhtin, David J Wu, Adam Lerer, Jonathan Gray, 2022.
Athul Paul Jacob, Gabriele Farina, Alexander H Miller, [9] Scott Fujimoto, Herke van Hoof, and David Meger.
and Noam Brown. Mastering the game of no-press Addressing function approximation error in actor-critic
diplomacy via human-regularized reinforcement learning methods. In International Conference on Machine Learn-
and planning. In International Conference on Learning ing (ICML), 2018.
Representations (ICLR), 2023. [10] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and
[2] Philip J. Ball, Laura M. Smith, Ilya Kostrikov, and Sergey Sergey Levine. Soft actor-critic: Off-policy maximum
Levine. Efficient online reinforcement learning with entropy deep reinforcement learning with a stochastic
offline data. In International Conference on Machine actor. In International Conference on Machine Learning
Learning (ICML), 2023. (ICML), 2018.
[3] Suneel Belkhale, Yuchen Cui, and Dorsa Sadigh. Hydra: [11] Danijar Hafner, Timothy P. Lillicrap, Mohammad
Hybrid robot actions for imitation learning. In Confer- Norouzi, and Jimmy Ba. Mastering atari with discrete
ence on Robot Learning (CoRL), 2023. world models. In International Conference on Learning
[4] Anthony Brohan, Noah Brown, Justice Carbajal, Yev- Representations (ICLR), 2021.
gen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana [12] Siddhant Haldar, Vaibhav Mathur, Denis Yarats, and
Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Lerrel Pinto. Watch and match: Supercharging imitation
Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas with regularized optimal transport. In Conference on
Jackson, Sally Jesmonth, Nikhil J Joshi, Ryan Julian, Robot Learning (CoRL), 2022.
Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang- [13] Nicklas Hansen, Yixin Lin, Hao Su, Xiaolong Wang,
Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, Deek- Vikash Kumar, and Aravind Rajeswaran. MoDem: Ac-
sha Manjunath, Igor Mordatch, Ofir Nachum, Carolina celerating Visual Model-Based Reinforcement Learning
Parada, Jodilyn Peralta, Emily Perez, Karl Pertsch, Jor- with Demonstrations. In International Conference on
nell Quiambao, Kanishka Rao, Michael Ryoo, Grecia Learning Representations (ICLR), 2023.
Salazar, Pannag Sanketi, Kevin Sayed, Jaspiar Singh, [14] Todd Hester, Matej Vecerı́k, Olivier Pietquin, Marc Lanc-
Sumedh Sontakke, Austin Stone, Clayton Tan, Huong tot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan,
Tran, Vincent Vanhoucke, Steve Vega, Quan Vuong, Fei Andrew Sendonaris, Ian Osband, Gabriel Dulac-Arnold,
Xia, Ted Xiao, Peng Xu, Sichun Xu, Tianhe Yu, and John P. Agapiou, Joel Z. Leibo, and Audrunas Gruslys.
Brianna Zitkovich. RT-1: Robotics Transformer for Real- Deep Q-learning from demonstrations. In AAAI Confer-
World Control at Scale. In Proceedings of Robotics: ence on Artificial Intelligence, 2018.
Science and Systems (RSS), 2023. [15] Takuya Hiraoka, Takahisa Imagawa, Taisei Hashimoto,
[5] Xinyue Chen, Che Wang, Zijian Zhou, and Keith W. Takashi Onishi, and Yoshimasa Tsuruoka. Dropout Q-
Ross. Randomized ensembled double q-learning: Learn- functions for doubly efficient reinforcement learning. In
ing fast without a model. In International Conference on International Conference on Learning Representations
Learning Representations (ICLR), 2021. (ICLR), 2022.
[6] Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric [16] Kyle Hsu, Moo Jin Kim, Rafael Rafailov, Jiajun Wu,
Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- and Chelsea Finn. Vision-based manipulators need to
fusion Policy: Visuomotor Policy Learning via Action also see from their hands. In International Conference
Diffusion. In Proceedings of Robotics: Science and on Learning Representations (ICLR), 2022.
Systems (RSS), 2023. [17] Hengyuan Hu and Dorsa Sadigh. Language instructed
[7] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, reinforcement learning for human-ai coordination. In
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, International Conference on Machine Learning (ICML),
Mostafa Dehghani, Matthias Minderer, Georg Heigold, 2023.
Sylvain Gelly, et al. An image is worth 16x16 words: [18] Athul Paul Jacob, David J Wu, Gabriele Farina, Adam
Lerer, Hengyuan Hu, Anton Bakhtin, Jacob Andreas, The Journal of Machine Learning Research (JMLR), 15
and Noam Brown. Modeling strong and human-like (1):1929–1958, 2014.
gameplay with KL-regularized search. In International [30] Matej Vecerı́k, Todd Hester, Jonathan Scholz, Fumin
Conference on Machine Learning (ICML), 2022. Wang, Olivier Pietquin, Bilal Piot, Nicolas Heess,
[19] Diederik P. Kingma and Jimmy Ba. Adam: A method Thomas Rothörl, Thomas Lampe, and Martin A. Ried-
for stochastic optimization. In International Conference miller. Leveraging demonstrations for deep reinforce-
on Learning Representations (ICLR), 2015. ment learning on robotics problems with sparse rewards.
[20] Aviral Kumar, Anika Singh, Frederik Ebert, Mitsuhiko arXiv:1707.08817, 2017.
Nakamoto, Yanlai Yang, Chelsea Finn, and Sergey [31] Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki,
Levine. Pre-Training for Robots: Offline RL Enables Michaël Mathieu, Andrew Dudzik, Junyoung Chung,
Learning New Tasks in a Handful of Trials. Robotics: David H. Choi, Richard Powell, Timo Ewalds, Petko
Science and Systems (RSS), 2022. Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo
[21] Ajay Mandlekar, Danfei Xu, Josiah Wong, Soroush Danihelka, Aja Huang, L. Sifre, Trevor Cai, John P.
Nasiriany, Chen Wang, Rohun Kulkarni, Li Fei-Fei, Agapiou, Max Jaderberg, Alexander Sasha Vezhnevets,
Silvio Savarese, Yuke Zhu, and Roberto Martı́n-Martı́n. Rémi Leblond, Tobias Pohlen, Valentin Dalibard, David
What matters in learning from offline human demonstra- Budden, Yury Sulsky, James Molloy, Tom Le Paine,
tions for robot manipulation. In Conference on Robot Caglar Gulcehre, Ziyun Wang, Tobias Pfaff, Yuhuai Wu,
Learning (CoRL), 2021. Roman Ring, Dani Yogatama, Dario Wünsch, Katrina
[22] Ashvin Nair, Bob McGrew, Marcin Andrychowicz, Wo- McKinney, Oliver Smith, Tom Schaul, Timothy P. Lilli-
jciech Zaremba, and Pieter Abbeel. Overcoming explo- crap, Koray Kavukcuoglu, Demis Hassabis, Chris Apps,
ration in reinforcement learning with demonstrations. In and David Silver. Grandmaster level in starcraft II using
International Conference on Robotics and Automation multi-agent reinforcement learning. Nature, 575(7782):
(ICRA), 2018. 350–354, 2019.
[23] Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, [32] Yuxin Wu and Kaiming He. Group normalization. In
Giulia Vezzani, John Schulman, Emanuel Todorov, and Proceedings of the European Conference on Computer
Sergey Levine. Learning complex dexterous manipula- Vision (ECCV), September 2018.
tion with deep reinforcement learning and demonstra- [33] Denis Yarats, Ilya Kostrikov, and Rob Fergus. Im-
tions. In Proceedings of Robotics: Science and Systems age augmentation is all you need: Regularizing deep
(RSS), 2018. reinforcement learning from pixels. In International
[24] Moritz Reuss, Maximilian Li, Xiaogang Jia, and Rudolf Conference on Learning Representations (ICLR), 2021.
Lioutikov. Goal conditioned imitation learning using [34] Denis Yarats, Rob Fergus, Alessandro Lazaric, and Lerrel
score-based diffusion policies. In Robotics: Science and Pinto. Mastering visual continuous control: Improved
Systems, 2023. data-augmented reinforcement learning. In International
[25] Younggyo Seo, Danijar Hafner, Hao Liu, Fangchen Liu, Conference on Learning Representations (ICLR), 2022.
Stephen James, Kimin Lee, and Pieter Abbeel. Masked [35] Zhao-Heng Yin, Weirui Ye, Qifeng Chen, and Yang
world models for visual control. In Conference on Robot Gao. Planning for sample efficient imitation learning.
Learning (CoRL), 2022. In Advances in Neural Information Processing Systems
[26] David Silver, Julian Schrittwieser, Karen Simonyan, (NeurIPS), 2022.
Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas [36] Tianhe Yu, Deirdre Quillen, Zhanpeng He, Ryan Julian,
Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, Yu- Karol Hausman, Chelsea Finn, and Sergey Levine. Meta-
tian Chen, Timothy P. Lillicrap, Fan Hui, Laurent Sifre, world: A benchmark and evaluation for multi-task and
George van den Driessche, Thore Graepel, and Demis meta reinforcement learning. In Conference on Robot
Hassabis. Mastering the game of go without human Learning (CoRL), 2019.
knowledge. Nature, 550(7676):354–359, 2017. [37] Albert Zhan, Ruihan Zhao, Lerrel Pinto, Pieter Abbeel,
[27] Tom Silver, Kelsey R. Allen, Josh Tenenbaum, and and Michael Laskin. Learning visual robotic control
Leslie Pack Kaelbling. Residual policy learning. CoRR, efficiently with contrastive pre-training and data augmen-
abs/1812.06298, 2018. URL http://arxiv.org/abs/1812. tation. In International Conference on Intelligent Robots
06298. and Systems (IROS), 2022.
[28] Yuda Song, Yifei Zhou, Ayush Sekhari, Drew Bagnell, [38] Haichao Zhang, Wei Xu, and Haonan Yu. Policy
Akshay Krishnamurthy, and Wen Sun. Hybrid RL: Using expansion for bridging offline-to-online reinforcement
both offline and online data can make RL efficient. In learning. In International Conference on Learning Rep-
International Conference on Learning Representations resentations (ICLR), 2023.
(ICLR), 2023. [39] Tony Z. Zhao, Vikash Kumar, Sergey Levine, and
[29] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Chelsea Finn. Learning fine-grained bimanual manipu-
Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: a lation with low-cost hardware. arXiv:2304.13705, 2023.
simple way to prevent neural networks from overfitting.
IX. P SEUDOCODE FOR IBRL

Algorithm 1 IBRL with TD3 backbone. Major modifications w.r.t. vanilla TD3 highlighted in blue.
1: Hyperparameters: number of critics E, number of critic updates G, update freq U , exploration std σ, noise clip c
2: Train imitation policy µψ on demonstrations D = {ξ1 , . . . , ξn } with the selected IL algorithm.
3: Initialize policy πθ , target policy πθ′ , and critics Qϕi , target critics Qϕ′i for i = 1, 2, . . . , E
4: Initialize replay buffer B with demonstrations {ξ1 , . . . , ξn }
5: for t = 1, . . . , num rl steps do
6: Observe st from the environment
2
7: Compute IL action aIL RL
t ∼ µψ (st ) and RL action at = πθ (st ) + ϵ, ϵ ∼ N (0, σ )
8: Sample a set K of 2 indices from {1, 2, . . . , E}
9: Take action with higher Q-value at = argmaxa∈{aRL ,aIL } [mini∈K Qϕ′i (st , a)]
10: Store transition (st , at , rt , st+1 ) in B
11: if t % U ̸= 0 then
12: Continue
13: end if
14: for g = 1, . . . , G do
(j) (j) (j) (j)
15: Sample a minibatch of N transitions (st , at , rt , st+1 ) from B
16: Sample a set K of 2 indices from {1, 2, . . . , E}
17: For each element j in the minibatch, compute target Q-value
 
(j)
y (j) = rt + γ max min Qϕ′i (st+1 , a′ )
a′ ∈{aIL ,aRL } i∈K

aIL ∼ µψ (st+1 ) and aRL = πθ′ (st+1 ) + clip(ϵ, −c, c)


(j) (j)
Update ϕi by minimizing loss: L(ϕi ) = N1 j [y (j) − Qϕi (st , at )]2 for i = 1, . . . , E
P
18:
19: Update target critics ϕ′i ← ρϕ′i + (1 − ρ)ϕi for i = 1, . . . , E
20: end for
(j) (j)
Update θ with the last minibatch by maximizing N1 j mini=1,...,E Qϕi (st , πθ (st ))
P
21:
22: Update target actor θ′ ← ρθ′ + (1 − ρ)θ
23: end for

Algorithm 1 contains the detailed pseudocode for IBRL. Lines 2-4 do the necessary initialization for policy, critics and
replay buffer. Then lines 6-10 correspond to interacting with the environment and line 9 specifically corresponds to the actor
proposal of IBRL. Note that the minimization over the multiple critics mini∈K Qϕ′i (st , a) is part of TD3 and RED-Q. Lines
12-16 are critic updates and line 14 is the bootstrap proposal. Finally, lines 18-19 are policy updates, which is identical to
vanilla TD3. The final output of IBRL is the hybrid policy that acts following at = argmaxa∈{aRL ,aIL } [mini∈K Qϕ′i (st , a)] .
The code shown here uses the argmax action selection scheme, we can obtain the softmax version by simply replacing the
action selection method in line 9 and line 14 with a ∼ softmaxa∈{aIL ,aRL } (βQ(a)).

X. I MPLEMENTATION D ETAILS AND H YPERPARAMETERS


In this section we cover the implementation details of IBRL as well as the baselines.
The BC policies use a ResNet-18 encoder. The output of the ResNet encoder is flattened and then fed into the MLPs to
get the final 7D actions. For all the ResNet encoders used in this paper, we replace the BatchNorm layers in ResNet with
GroupNorm [32] and set the number of groups equal to the number of input channels. The modified ResNet achieves similar
performance as the original one in BC but significantly better in RL since BatchNorm does not work well with exponential
moving average target networks in RL. We train the BC policies using Adam optimizer [19] with batch size of 256 and learning
rate of 1e−4. We use random-shift data augmentation to prevent overfitting. In Meta-World, we follow the camera position used
in MoDem [13] for fair comparison. Prior work [16] shows that wrist cameras improve generalization and sample efficiency.
Therefore, we opt for wrist cameras whenever possible in Robomimic and real-world experiments. Specifically, we use the
wrist camera in Can (Robomimic), Lift (real-world) and Drawer (real-world). In Square (Robomimic) and Hang (real-world),
we use the 3rd-person camera because the wrist camera may not capture the goal location in this task. In Robomimic, we
additionally experiment with state-based IBRL where the BC policies use a straightforward 4-layer MLP with 1024 hidden
units per layer. The input to the policy is the stack of three states at t, t−1 and t−2. We find that MLPs with stacked state
inputs achieve similar performance as the LSTMs from [21]. We use dropout 0.5 in state-based BC to prevent overfitting.
Parameter Meta-World Robomimic (Pixel) Real-World Robomimic (State)

Optimizer Adam
Learning Rate 1e−4
Batch Size 256
Discount (γ) 0.99
Exploration Std. (σ) 0.1
Noise Clip (c) 0.3
EMA Update Factor (ρ) 0.99
Update Frequency (U ) 2
Actor Dropout 0.5
Q-Ensemble Size (E) 2 5
Num Critic Update (G) 1 5
Inverse Temperature (Soft-IBRL, β) N/A 10

Image Size 84 × 84 96 × 96 N/A


Use Proprio No Yes N/A
Proprio Stack N/A 3 N/A
State Stack N/A 3
Action Repeat 2 1

TABLE II: Hyperparameters for IBRL.

The major hyperparameters for RL in IBRL are listed in Table II. In pixel-based RL, the RL policies use the same camera
view as the BC policies in each environment. Following DrQ-v2 [34], the actor and two critics share the image encoder but
only the gradients from the critics are used to update the image encoder. We also use random-shift data augmentation in RL to
prevent overfitting and improve sample efficiency. Different from [34] which only uses target networks for critics, we also use
a target actor as we find it slightly improves training stability. In environments that use proprioception data, we use a stack of
three proprioception data (t, t−1, t−2) instead of only using the current proprioception data (t). The details of our ViT-based
architecture are shown in Fig. 17, Fig. 18 and Fig. 19. In state-based RL, we use Q-ensembling (RED-Q) with E = 5 and a
higher UTD ratio G = 5 as we find this combination achieves good sample efficiency. We have also tried to further increase
UTD ratio to G = 10 but find it takes significantly longer wall-clock time to train without improving sample efficiency. Critics
and the actor in state-based RL are all 4-layer MLPs shown in Fig. 20. Similar to state-based BC, we use a stack of three
states as the input for critics and the actor. We set actor dropout with p = 0.5 in all environments. In Meta-World, we inherit
the action repeat value from prior work for fair comparison. We do not use action repeat for Robomimic and real-world tasks.
The RLPD and RFT baselines share the same base RL implementation as IBRL. The core idea of RLPD [2] is to draw
half of the batch from demonstrations and the other half from the RL replay buffer to upweigh the successful demonstration
trajectories to address the exploration challenge. Note that the original RLPD paper use SAC as the base RL algorithm while
our implementation use the same TD3 as IBRL for controlled experiments. Note that the original RLPD disables the entropy
backup part of SAC in 3 out of 4 benchmarks evaluated, making that specific SAC variant highly similar to TD3 in practice.
RFT first pretrains the encoder and policy head with BC and then runs RL with an additional BC loss term on the policy
head for regularization. Different instantiations of this idea have appeared in prior works [22]. Specifically, our implementation
of RFT resembles the one from ROT [12]. The policy loss is Lθ (πθ ) = −Es∼D Q(s, πθ (s)) + αλ(πθ )E(s,a)∼T ∥a − πθ (s)∥2 ,
where D is the RL replay buffer and T is the demonstration dataset. Moreover, α is the base regularization weight and we
set λ(πθ ) = Es∼T [1Q(s,πθ0 (s))>Q(s,πθ (s)) ] to dynamically adjust the weight of regularization. πθ0 (s) is the pretrained policy.

XI. V I T- BASED A RCHITECTURE I MPROVES A LL M ETHODS

In the main paper we show that our ViT architecture improves performance for IBRL over the commonly used network
architecture from DrQ [34]. To show that the improvement from our ViT architecture is general, we additionally evaluate
RLPD and RFT on both our ViT and DrQ network. Fig. 10 summarizes the results. We clearly see that our ViT architecture
greatly improves the performance of all three methods. This emphasizes that our contribution to a better network architecture
is general and can be considered independent of IBRL in future work.
Can (pixel) Square (pixel) Can (pixel) Square (pixel) Can (pixel) Square (pixel)
1.0 1.0 1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2 0.2 0.2

0.0 0.0 0.0 0.0 0.0 0.0


0 100 200 0 200 400 0 100 200 0 200 400 0 100 200 0 200 400
Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000)
IBRL (our ViT) IBRL (DrQ net) RLPD (our ViT) RLPD (DrQ net) RFT (our ViT) RFT (DrQ net)

Fig. 10: Performance of our ViT v.s DrQ network on IBRL, RLPD and RFT. Our ViT-based architecture universally improve all three
methods.

XII. A DDITIONAL D ETAILS OF R EAL -W ORLD E XPERIMENTS


A. Success Detection
We design rule-based systems to detect success of each task and give the final 0/1 reward for each episode. We run each
episode for a maximum number of steps depending on the time it requires to finish the task. An episode ends early when a
success is detected.
Lift: The objective is to pick up a foam block. We detect whether the gripper is holding the block by checking if the gripper
width is static and the desired gripper width is smaller than the actual gripper width. The success detector returns 1 if the end
effector has move upward by at least 2cm while holding the block. The maximum episode length is 75.
Drawer: The objective is to open the top drawer in a set of plastic drawers in a fixed position. We attach a red patch to the
side of the drawer and install a side camera that detects the red patch. The red patch is visible to the side camera when the
drawer is open and invisible when the drawer is closed. The maximum episode length is 150.
Hang: The objective is to hang a deformable soft cloth on a metal hook. The cloth is the only red object in the scene so we
can track its location. The success returns 1 when the gripper is wide open, the red pixels are stable, and highest red pixel is
above a threshold. The maximum episode legnth is 150.
B. Reset
For Drawer, the reset is straightforward as we manually close the drawer if it is not fully closed. The robot will sample a
new random initial location for the end effector at the beginning of each episode. For Lift and Hang, we follow a common
reset strategy for all methods. At the beginning of training, we put the object in the center of the initial area and do not move
it until the RL policy obtains its first success. If the object is moved before the first success, we put it back to the center.
After the RL policy succeeds for the first time, we gradually move the object from the center to the boundary in each reset.
If we reach the boundary before the training ends, we start resetting the object from top left to bottom right and repeat until
training terminates.
C. Safety Boundaries
To prevent the robot from damaging itself and the scene, we set a safety boundary on the end effector position and rotation
for each task. The boundary is by first getting the range of the end effector position and rotation from the human demonstrations
and increasing the range by a fixed amount to get a modestly larger range for RL. For Hang, we additionally block the region
right beneath the hook as the robot arm collides with the metal hook when the end effector is in that region. The episode
terminates early with 0 reward when the safety boundary is violated.
XIII. A BLATION OF ACTOR D ROPOUT ON M ETA -W ORLD
In the main paper, we do not include actor dropout in all methods on Meta-World as it does not meaningfully affect the
conclusion in this simple benchmark. Fig. 11 and Fig. 12 shows the performance of IBRL, RFT and RLPD with and without
actor dropout. Actor dropout slightly improves RLPD but makes little difference for IBRL and RFT which are already highly
competitive in this benchmark.
Given these results, we want to emphasize that Meta-World is a relatively simple benchmark for single task RL as it
is originally proposed for meta-learning and thus the designers ensure that each individual task can be solved easily [36].
Specifically, Meta-World has smaller actions space (4 dimensional instead of 7 dimensional in Robomimic and real world),
shorter episode length (less than 100 steps) and limited randomness in the initial condition. In the sparse reward setting
considered in this paper, we observe that Meta-World tasks are easy for modern RL methods that utilize demonstrations
even under highly limited data (i.e., 3 episodes of demonstrations). Therefore, these tasks do not provide enough signal to
differentiate strong methods like IBRL and RFT, nor to justify the benefit of regularization techniques like actor dropout.
Assembly Box close Coffee push Stick pull Aggregation
1.0 1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2 0.2

0.0 0.0 0.0 0.0 0.0


0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60
Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000)
IBRL (w/o Actor Dropout) IBRL (w Actor Dropout) RFT RFT (w Actor Dropout)

Fig. 11: Performance of RFT with actor dropout compared with IBRL counterparts. Actor dropout does not meaningfully change the
performance of these strong methods on the relatively simple Meta-World benchmark.

Assembly Box close Coffee push Stick pull Aggregation


1.0 1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2 0.2

0.0 0.0 0.0 0.0 0.0


0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60
Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000)
IBRL (w/o Actor Dropout) IBRL (w Actor Dropout) RLPD RLPD (w Actor Dropout)

Fig. 12: Performance of RLPD with actor dropout compared with IBRL counterparts. Actor dropout slightly improves RLPD.

XIV. D ISCUSSION ON THE IL POLICY IN IBRL


To understand the role of the IL policy during IBRL training and at convergence, in Fig. 13 we plot the frequency that IBRL
selects the IL action when collecting online data for training. IBRL selects fewer actions from the IL policy at the beginning,
because the critics are randomly initialized and it is easy for the RL actor to find actions with “fake” high Q-values. As the
critics get updated, the incorrectly high Q-values for those actions are pushed down and IBRL starts to pick more actions from
IL policy as the critic learns that those IL actions are high quality by learning from the demonstration data in the replay buffer.
Then, the ratio of IL actions steadily decreases in most cases as the RL policy improves. One exception is in the hardest real
world Hang task, where the ratio of IL actions keeps increasing. This is reasonable given that the IL policy is fairly strong
for this task and the RL policy likely has not fully converged yet, as reflected by the high performance of IL and imperfect
performance of IBRL in this task. In all cases, however, the ratio never decreases to zero, indicating that IBRL still relies on
the IL policies, which are parameterized by much deeper networks, even at convergence.
Next, we investigate how a suboptimal IL policy may affect the performance of IBRL. We train suboptimal BC policies
using the Multi-Human (MH) version of the Robomimic dataset instead of the Proficient Human (PH) version used in normal
IBRL. The average length of the 50 demonstrations in the PH dataset is 149 compared to 271 in the MH dataset, indicating
that the MH dataset comprises very inefficient motions. The dashed horizontal lines in Fig. 14 illustrate the performance
gap of the BC trained from different dataset. The BC (worse) policies achieve less than half of the success rates achieved by
their counterparts trained on the PH data. We then run IBRL with BC (worse) as the IL policy and keep everything else the
same—i.e. we still add the PH data to the RL replay buffer for controlled experiments. As expected, the performance of IBRL
decreases as the worse IL policies are unable to provide equally good alternative actions. However, IBRL is able to eventually
escape from the worse BC to reach equally good final policies.

XV. A DDITIONAL BASELINE : R ESIDUAL P OLICY L EARNING


In this section, we compare IBRL with an additional baseline, residual policy learning (RPL) [27]. The core concept of
residual policy learning is to first have a base policy µ(s) and then use RL to learn a policy π(s) that outputs action residual
to the first policy. The final action from RPL takes the form of a = µ(s) + π(s). Our instantiation of RPL uses the same deep
ResNet-18 BC policies as IBRL and we also allow the RL residual policy to take the output of the BC policy as an additional
input to provide it with a useful initial guess, i.e. a = µ(s) + π(s, µ(s)). The BC policy is kept fixed and we optimize the
residual policy using the same RL backbone used by all model-free RL methods in this paper. Furthermore, we follow [27] to
zero out the last layer of the RL policy in RPL so that the initial actions are close to the BC actions.
Fig. 15 shows the performance of RPL alongside IBRL and other baselines in the two Robomimic tasks with image inputs.
RPL performs well compared against other baselines but not as well as IBRL. Inspired by the strong performance of RPL,
Can (pixel) Square (pixel) Lift (Pixel, Real) Drawer (Pixel, Real) Hang (Pixel, Real)
1.0 1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2 0.2

0.0 0.0 0.0 0.0 0.0


0 50 100 150 200 0 125 250 375 500 0 2 4 6 8 0 4 8 12 16 0 15 30
Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000)

Fig. 13: Percentage of actions from BC policy selected during IBRL training.
Can (pixel) Square (pixel) Can (pixel) Square (pixel)
1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2

0.0 0.0 0.0 0.0


0 50 100 150 200 0 200 400 0 50 100 150 200 0 200 400
Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000)
IBRL BC IBRL (w/ worse BC) BC (worse) IBRL RPL IBRL-RPL RLPD RFT

Fig. 14: IBRL with a significantly worse BC policy trained from Fig. 15: Performance of Residual Policy Learning (RPL). RPL
suboptimal demonstrations. This illustrates that IBRL can escape performs well among the baselines but still underperforms IBRL.
from substantially worse BC policies and achieve similar final RPL can be combined with IBRL to further improve performance
performance at the cost of lower initial performance. on the harder Square task.

we are interested in understanding if the residual formulation benefits other methods. Therefore, we additionally run IBRL
with an RPL-style modification to the input and output of the policy network (IBRL-RPL, dotted line in Fig. 15). We find
that IBRL-RPL further improves the sample efficiency over IBRL on the harder Square task and maintains roughly the same
performance on the simpler Can task. It is encouraging that IBRL can be combined with existing techniques to achieve even
better performance.
XVI. C OMPARISON WITH ROT

Assembly Box close Coffee push Stick pull Aggregation


1.0 1.0 1.0 1.0 1.0

0.8 0.8 0.8 0.8 0.8

0.6 0.6 0.6 0.6 0.6

0.4 0.4 0.4 0.4 0.4

0.2 0.2 0.2 0.2 0.2

0.0 0.0 0.0 0.0 0.0


0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60 0 20 40 60
Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000) Interaction steps (×1000)
IBRL (w/o Actor Dropout) ROT BC

Fig. 16: Comparison with IBRL and ROT, one of the best performing RL method that does not require environment reward. Note IBRL
and ROT have different assumptions because ROT does not use the sparse 0/1 reward from the environment. This comparison is mainly to
illustrate the difference in the peak performance under different assumptions (sparse reward v.s. no environment reward at all).

To understand the difference in the peak performance between RL with sparse reward and RL that assumes no access to
environment reward at all (such as inverse RL, or online imitation learning), we compare IBRL against ROT [12], a powerful
online imitation learning method that has shown to outperform a wide range of other inverse RL methods. Our RFT baseline
is closely related to ROT. ROT can be seen as RFT without the sparse reward from the environment but instead with a dense
trajectory matching reward computed by optimal transport. We emphasize the methods considered in this paper have different
assumptions from ROT or inverse RL/online imitation in general as the later family of methods do no assume access of any
environment rewards and instead use reward predicted from demonstrations.
Fig. 16 shows IBRL and ROT on the Meta-World tasks. Unsurprisingly, IBRL performs significantly better than ROT. Note
that the Meta-World tasks considered in this paper are harder than the ones considered in the original ROT paper and we also
run on significantly smaller sample budgets (60K vs 1M). Additionally, we find that adding OT reward on IBRL or RFT no
longer helps but sometimes hurts performance when having access to the ground truth sparse reward as it is challenging to
balance the magnitude of the two reward sources.
One takeaway from this experiment is that ground truth reward, even sparse, makes a huge difference in the performance
of the RL method. When sparse reward is accurate, IBRL learns efficiently without relying on any dense reward signals. This
suggests accurate and robust success prediction as an important research direction for RL on real robots.

VitEncoder(
(patch_embed): PatchEmbed(
(embed): Sequential(
(conv1): Conv2d(3, 128, kernel_size=(8, 8), stride=(4, 4))
(relu): ReLU()
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(2, 2))
)
)
(net): Sequential(
TransformerLayer(
(layer_norm1): LayerNorm()
(mha): MultiHeadAttention(
(qkv_proj): Linear(in_features=128, out_features=384, bias=True)
(out_proj): Linear(in_features=128, out_features=128, bias=True)
)
(layer_norm2): LayerNorm()
(linear1): Linear(in_features=128, out_features=512, bias=True)
(linear2): Linear(in_features=512, out_features=128, bias=True)
)
)
(norm): LayerNorm()
)

Fig. 17: Architecture of ViT encoder expressed in PyTorch style pseudocode. The shape of the input image is (3, 96, 96) in all experiments.
The shape of the output of the ViT encoder is (121, 128), i.e., 121 patches where each patch is a 128-dimensional vector.

Critic(
(spatial_embed) SpatialEmbed(
(weight): Parameter(128, 1024)
(input_proj): Sequential(
(0): Linear(in_features=155, out_features=1024, bias=True)
(1): LayerNorm()
(2): ReLU(inplace=True)
)
)
(q): Sequential(
(0): Linear(in_features=1058, out_features=1024, bias=True)
(1): LayerNorm()
(2): ReLU(inplace=True)
(3): Linear(in_features=1024, out_features=1024, bias=True)
(4): LayerNorm()
(5): ReLU(inplace=True)
(6): Linear(in_features=1024, out_features=1, bias=True)
)
)

Fig. 18: Architecture of the critic head expressed in PyTorch style pseudocode. We first transpose the output of ViT encoder (121, 128) →
(128, 121) and then append three most recent proprioception data (3 × 8, ) and the action to evaluate (7, ) to each channel. Hence the input
size of the input proj is (155 = 121 + 3 ∗ 8 + 7). We apply an element-wise multiplication between the output of input proj and
weight, and sum over the channel dimension to produce a 1024-dimensional vector as the output of SpatialEmbed. Finally, we append the
action to the output of SpatialEmbed again before feeding it to the Q-MLP.
Actor(
(compress): Sequential(
(0): Linear(in_features=15488, out_features=128, bias=True)
(1): LayerNorm()
(2): Dropout(p=0.5, inplace=False)
(3): ReLU()
)
(policy): Sequential(
(0): Linear(in_features=155, out_features=1024, bias=True)
(1): LayerNorm()
(2): Dropout(p=0.5, inplace=False)
(3): ReLU()
(4): Linear(in_features=1024, out_features=1024, bias=True)
(5): LayerNorm()
(6): Dropout(p=0.5, inplace=False)
(7): ReLU()
(8): Linear(in_features=1024, out_features=7, bias=True)
(9): Tanh()
)
)

Fig. 19: Architecture of the policy head. It takes the flattened output of the ViT encoder, i.e. 15488 = 121 × 128. We append three most
recent proprioception data (3 × 8, ) to the output of the compress module before feeding it to the policy module.

Critic(
(net): Sequential(
(0): Linear(in_features=3 * state_dim + action_dim, out_features=1024, bias=True)
(1): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(2): ReLU()
(3): Linear(in_features=1024, out_features=1024, bias=True)
(4): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(5): ReLU()
(6): Linear(in_features=1024, out_features=1024, bias=True)
(7): LayerNorm((1024,), eps=1e-05, elementwise_affine=True)
(8): ReLU()
(9): Linear(in_features=1024, out_features=1, bias=True)
)
)

Actor(
(net): Sequential(
(0): Linear(in_features=3 * state_dim, out_features=1024, bias=True)
(1): Dropout(p=0.5, inplace=False)
(2): ReLU()
(3): Linear(in_features=1024, out_features=1024, bias=True)
(4): Dropout(p=0.5, inplace=False)
(5): ReLU()
(6): Linear(in_features=1024, out_features=1024, bias=True)
(7): Dropout(p=0.5, inplace=False)
(8): ReLU()
(9): Linear(in_features=1024, out_features=action_dim, bias=True)
(10): Tanh()
)
)

Fig. 20: Architecture of critic and policy network in state-based RL.

You might also like