Improving Vision-Language-Action Model With Online Reinforcement Learning
Improving Vision-Language-Action Model With Online Reinforcement Learning
Online RL
SFT on Online RL in
Sparse Reward
Pretrained Robotic Dataset Environment
Long Horizon
Large Models
Build an Intelligent Robot
Fig. 1: Illustration of our motivation. We employ the fine-tuning pipeline from large language models (LLMs) to enhance
the Vision-Language Architecture (VLA) in the robotic domain, starting with supervised fine-tuning (SFT) followed by
reinforcement learning (RL). However, we observed that standard online RL can be extremely unstable when applied to
large VLA models. To address this, we propose an iterative RL method, iRe-VLA.
Abstract— Recent studies have successfully integrated large control signals while also benefiting from the common-sense
vision-language models (VLMs) into low-level robotic control knowledge and reasoning abilities [12] encoded in large pre-
by supervised fine-tuning (SFT) with expert robotic datasets, trained models.
resulting in what we term vision-language-action (VLA) models.
Although the VLA models are powerful, how to improve these The fine-tuning of VLA models generally employs a
large models during interaction with environments remains an supervised fine-tuning (SFT) approach [8], noted for its
open question. In this paper, we explore how to further improve stability and scalability. However, SFT depends on high-
these VLA models via Reinforcement Learning (RL), a com- quality expert datasets that are costly and difficult to obtain
monly used fine-tuning technique for large models. However,
we find that directly applying online RL to large VLA models in the robotic domain [13]. Additionally, supervised learning
presents significant challenges, including training instability may not fully align VLA models with physical environments
that severely impacts the performance of large models, and due to distribution shift issues [14], [15]. We wonder how to
computing burdens that exceed the capabilities of most local further improve such large VLA models through interaction
machines. To address these challenges, we propose iRe-VLA with the physical environment beyond supervised learning.
framework, which iterates between Reinforcement Learning
and Supervised Learning to effectively improve VLA models, Notably, Reinforcement Learning from Human Feedback
leveraging the exploratory benefits of RL while maintaining the (RLHF) [1], [16], [17] has better align large language model
stability of supervised learning. Experiments in two simulated with human preference, as illustrated in the upper-left of
benchmarks and a real-world manipulation suite validate the Figure 1.
effectiveness of our method.
Inspired by the success of RLHF, we try online RL to
I. I NTRODUCTION improve the VLA model and better align the VLA model
It has become a recent trend to employ powerful pre- with physical environments. However, the environments en-
trained large language models (LLMs) and vision-language countered by chatbots and embodied robots are markedly dif-
models (VLMs) for a variety of advanced tasks beyond their ferent. Chatbots are optimized using offline, human-labeled
original scope, including dialogue systems [1], [2], [3], code datasets with well-defined dynamics [1], while embodied
generation [4], task planning [5], [6], and even low-level robots necessitate online exploration in tasks characterized
robotic control [7], [8]. By fine-tuning VLMs on robotic by long horizons and sparse rewards. Furthermore, previous
datasets with explicit action modeling, previous works have research has shown that the online reinforcement learning
developed large vision-language-action (VLA) models [9], (RL) process can be extremely unstable when applied
such as RT-2 [8], HiRT[10], Roboflamingo [11], etc. These to large neural networks[18], [19], [20]. Empirically, we
models are capable of directly outputting low-level robotic also observe that directly applying the standard RL algorithm
to large VLA models results in training instability and
∗ Equal contribution performance drops, as depicted on right side of Figure 1.
† Corresponding [email protected]
1 Institute for Interdisciplinary Information Sciences, Tsinghua University,
To stabilize the RL process and effectively enhance the
Beijing, China. [email protected]
VLA model, we propose the novel iRe-VLA method, which
2 University of California, Berkeley, USA. iterates between online Reinforcement Learning stages and
3 Shanghai Qi Zhi Institute, Shanghai, China. supervised learning stages. Specifically, during the RL stage,
we freeze the VLM parameters and only train lightweight However, they all assume low-level skills (e.g., pick, goto)
action heads to maintain training stability. In the subsequent are available and only better ground the high-level plans.
supervised learning phase, we fine-tune the entire model Different from them, we try to use RL to directly improve
on successful trajectories to fully utilize the expressive the low-level control signal output by VLA policy which
capabilities of the large model. Empirically, this two-stage has much longer horizons (hundreds or thousands of steps)
approach consistently enhances the VLA’s performance, sta- in sparse-reward physical environments.
bilizes training, and is computationally more efficient. We
have validated the iRe-VLA methods through comprehensive III. P RELIMINARY
experiments, including simulated MetaWorld [21], Franka- Reinforcement Learning. We utilize the standard deep
Kitchen [22], and real-world Panda manipulation task sets. RL partially-observed Markov decision process (POMDP)
In these domains, our method not only better aligns the VLA framework, where a task can be modeled as M =
model with the original tasks but also autonomously solves (S, A, PT , R, γ, O, PE ). S and A are the state space and
unseen tasks. Furthermore, the VLA model’s generalization action space for tasks, O is the robot observation, such as
ability has also been improved through online interactions visual image. PT : S × A × S → [0, 1] are state transition
with the environment. probability functions and R : S × A × S → R are reward
function for the task. In robotic tasks, the reward signal is
II. R ELATED W ORKS
always sparse, so we consider binary reward in this paper,
Foundation Models for Embodied control. Large Lan- where R = 1 if the robot successfully finished the task
guage Models (LLMs) and vision-language models (VLMs) otherwise R = 0. PE : S × O → [0, 1] is the observation
trained on web-scale data encode knowledge of the physical emission probabilities. A policy πθ : O → A defines a
world and exhibit impressive reasoning ability. With this probability distribution in action space parameterized by θ.
prior knowledge, LLMs and VLMs can benefit the embod- The objective of parameter θ is to maximize the expected
ied control tasks in many aspects, ranging from providing return of the policy πθ with discount γ:
rewards or values [23], [24], [25] for agents, modeling the " #
world dynamics [26], [27], or directly as policy [5], [6], [28], X
t t t
J(θ) = E((s0 ,o0 ,a0 ),(s1 ,o1 ,a1 ),...)∼pθ γ R(s , a ) (1)
[29], [30], [31], [32]. t
As for literature using LLMs/VLMs directly as agents’
policy, we can roughly divide them into two categories, Vision-Language Model. Numerous vision-language
namely high-level planning and low-level control. Works models (VLMs) have been developed that can concurrently
in the first categories leverage LLMs’ reasoning ability to process visual and language input. These models can broadly
autoregressively generate the textual step sequences [5], [6], be classified into two categories [8]: representation learning
[28] or code [33], thereby decomposing the long-horizon models, such as CLIP [41], and generative models, such
tasks into feasible plans. However, these methods output as Blip-2 [42] and InstructBlip [43]. Following [8], [34],
textual plans that are not directly grounded in the physical [11], we particularly employ the generative VLMs in the
world and require powerful low-level skills. Another line of format of {vision, text}→{text}. Formally, the generative
work leveraged VLMs to directly output low-level control VLMs sample tokens x1:K from p(x1:K |I, c), which are
signals and verified that low-level skills themselves could conditioned on the input image I and instruction c. Since
also benefit from the prior knowledge encoded in the pre- original generative VLMs produce natural language outputs,
trained VLMs [7], [8], [10], [34], [11]. Since the original integrating these models into robotic control tasks requires
output of VLMs lies in the language space, these works need an additional action modeling component, detailed in the
additional action modeling parts like adding action heads subsequent section.
[10], [11] or replacing the language tokens with actions [8].
IV. M ETHOD
Finetune Large Models with RL. Reinforcement learn-
ing has been successfully used in the natural language Our goal is to develop a learning method that effec-
process downstream tasks to better align the generated text tively improves the VLA model through online interactions
to human preferences [1], [35], [36]. In this Reinforcement while maintaining computational costs affordable for robotic
Learning from Human Feedback (RLHF) framework, a re- systems. We start with a Vision-Language-Action (VLA)
ward model is trained on a pre-collected human preference model fine-tuned on robotic demonstrations. We detail the
dataset and then LLM is optimized in a bandit environment VLA architectures in Section IV-A and outline the learning
with constraints of not shifting too much from the original pipeline of the iRe-VLA method in Section IV-B.
model [1], which can be seen as offline-style RL [37]. Differ-
ent from RLHF for dialog systems, fine-tuning VLA models A. Model Architectures
face unknown dynamics and require online exploration [38], Our VLA model transforms vision input o ∈ O and free-
[39], [40]. For instance, GLAM [38] ground the LLM textual form language instruction i ∈ L into low-level robotic action
plans in simplified grid-world environments through online a ∈ A, represented as O × L → A. The model comprises
RL. LLaRP [39] ground the high-level plans generated a pre-trained large VLM and a lightweight action head, as
by VLMs in rearrangement tasks with dense reward RL. illustrated on the left side of Figure 2.
Low-level Action Stage 1: Stage 2:
Imitate RL in Environment Supervised Learning
on Dataset
MLP
Action head 𝝓 Success trajectories
(Lightweight)
Token Learner
We utilize the BLIP-2 3B model [42] as our backbone Algorithm 1 Iterative RL for VLA model (iRe-VLA)
VLM. Since pre-trained VLM output text tokens in language
Given: A expert dataset De , a supevise fine-
space, an action head is designed to produce low-level 0
tuned VLA model πθ,ϕ with VLM parameters
control actions. These actions typically include changes in
θ and action head ϕ, unseen tasks set T =
the end-effector’s pose and the gripper’s status. Following
{T1 , ...Tn }.
the design presented in [11], [34], we replace the VLM’s
1: Initialize the online dataset DRL ← ∅, copy the weight
final fully connected layer with a newly initialized action 0 1 2
head. In the action head, a token learner [44] first converts of πθ,ϕ to πθ,ϕ , πθ,ϕ
2: for Ti in {T0 , T1 , ..., Tn } do
the VLM’s last hidden representation h ∈ Rm×d to h′ ∈ Rd .
Subsequently, a Multi-Layer-Perceptron (MLP) [45] map 3: # Stage 1: RL
2 1
h′ to the action a ∈ Rda , where m and d denote the 4: Copy the weight of πθ,ϕ to πθ,ϕ , initialize a critic
number of tokens and the embedding dimension of the VLM, head.
respectively, and da represents the action dimensions. 5: Optimize ϕ with online reinforcement learning until
convergence by equation 3.
Low-Rank Adaptation(LoRA) [46] Our VLA model
6: Collect successful trajectories xi into DRL :DRL =
comprises a large VLM backbone and a lightweight action
DRL ∪ xi .
head. However, fine-tuning the entire model, with its billions
7: # Stage 2: SL
of parameters, requires significant computational resources. 1 2
8: Copy the weight of πθ,ϕ to πθ,ϕ .
Furthermore, previous studies [47], [48] suggest that fine-
9: Optimize θ, ϕ with supervised learning on De ∪ DRL
tuning the whole large pre-trained model in limited-data
by equation 4.
regimens can result in over-fitting. Following the approach
10: end for
described in [47], we utilize the parameter-efficient LoRA
method to fine-tune the VLM part. The total trainable pa-
rameters consist of the LoRA parameters θ and the action loss:
head parameters ϕ.
h i
2
J 0 (θ, ϕ) = E(o,l,a)∼De ||πθ,ϕ (o, l) − a||2 (2)
B. Learning Pipeline After supervised fine-tuning, we obtain the initial VLA
0 0
We desciribe the learning pipeline in this section. First, we model πθ,ϕ . The performance of πθ,ϕ is highly correlated
supervised fine-tuning the VLA model on robotic datasets to the scale and quality of the expert dataset De . Then we
0
(stage 0), then we iterative between online RL (stage 1) and start to improve the πθ,ϕ through online RL.
supervised learning (stage 2). Stage 1: Online RL with Frozen VLM. The SFT model,
0
Stage 0: Supervised Learning on Expert Dataset. πθ,ϕ , may not achieve optimal performance for new tasks.
We first perform standard supervised fine-tuning on the However, it serves as a valuable starting point since it has
VLA model πθ with the expert robotic dataset De = been trained on a variety of tasks from the robotic dataset. To
{(o1 , l1 , a1 ), (o2 , l2 , a2 ), ..., (oi , li , ai )}. Formally, the learn- enhance the performance of the SFT policy, we utilize online
ing objective is defined by a Mean Squared Error (MSE) reinforcement learning (RL). In the RL process, we introduce
a critic head that mirrors the structure of the action head, VLA model in second-category new tasks through online RL.
but with the output dimension set to one. To prevent model Lastly, the third-category tasks are employed to evaluate the
collapse and accelerate the learning process, we freeze the generalization capabilities of the trained VLA policy.
VLM parameters, θ, during this phase. Consequently, only In the Metaworld domain, the expert dataset contains
the parameters of the action head, ϕ, are optimized: 25 tasks each with 50 trajectories. The second and third
"
X
# category introduces novel tasks featuring variations in object
J 1 (ϕ) = E((s0 ,o0 ,a0 ),(s1 ,o1 ,a1 ),...)∼pϕ γ t R(ot , at ) (3) shape, color, and position. In the Franka kitchen domain,
t we follow the setting in [47], the expert dataset contains
After online RL, the robot may discover new trajectories 5 tasks while the tasks in the second and third categories
xi to solve new tasks. Then we collected these success encompass unseen changes in object appearance and posi-
trajectories into an online dataset DRL = DRL ∪ xi tion. As for real-world tasks, we collect 2,000 trajectories
Stage 2: Supervised Learning on Both Expert and through teleoperation and script for picking (grasp), placing,
Online-collected Data. In Stage 1, while the agent conducts button-press, cable-route, and drawer-open. The unseen tasks
RL on new tasks, it risks forgetting previously learned tasks. of real-world experiments include picking up unseen objects.
Hence, in Stage 2, we supervise the whole model using both
the newly collected online data DRL and the original expert B. Why do we adopt two-stage iterative optimization?
dataset De to mitigate catastrophic forgetting [49]. Formally, Stabilizing Training Process. We observed that directly
the objective can be written as: fine-tuning the large VLA model using standard reinforce-
h
2
i ment learning (RL) algorithms can be unstable and lead
J 2 (θ, ϕ) = E(o,l,a)∼De ∪DRL ||πθ,ϕ (o, l) − a||2 (4) to performance drops. As shown in Figure 3, we observe
Iterate between Stage 1 and Stage 2. As previously performance drops in four out of five tasks with sparse
noted, the agent in Stage 1 explores novel solutions for reward in the Metaworld benchmark. This phenomenon was
new tasks, while in Stage 2, it imitates all available success also observed in previous research [18], which encountered
trajectories. By alternating between Stages 1 and 2, large similar instability issues with transformer-based RL policies
VLA models progressively address a broader range of tasks and had to modify transformer blocks to prevent collapse.
while also preventing catastrophic forgetting on seen tasks. However, these modifications are not compatible with pre-
Furthermore, as suggested in previous works [50], [13], the trained VLMs, instead, we freeze the VLM during the RL
VLA model could become more generalizable by imitating stage to prevent collapse.
a wider range of tasks. The whole pipeline is outlined in Managing the Model Training Burden. Fully fine-
Algorithm 1. tuning the VLA model with billions of parameters exceeds
the computational capability of most local machines, while
V. E XPERIMENTS complete deployment on a remote server introduces parame-
In this section, we perform tense experiments in two sim- ter transmission issues and reduces the control frequency.
ulated benchmarks Metaworld and FrankaKitchen, and real- Our two-stage iRe-VLA framework addresses these chal-
world panda manipulation tasks to verify the effectiveness of lenges by distributing the computational load. In the first RL
our iRe-VLA framework. We aim to answer the following stage, iRe-VLA freezes the upper-layer VLM and only adapts
questions: the lightweight action head, thus keeping computational
demands affordable on the local machine. The second stage
• Why do we adopt a two-stage iterative RL process
of optimization is then delegated to remote services that
instead of standard RL?
can handle larger computational loads. For instance, in our
• Can iRe-VLA stabilize the training process and effec-
real-world experiments (see Section V-D), we conducted the
tively improve the VLA model in both expert tasks and
RL process locally using a single NVIDIA 4090 card and
unseen tasks?
performed the second stage on remote servers equipped with
• Can iRe-VLA lead to better generalization of the VLA
4 NVIDIA A100 cards.
model?
…… …… ……
Franka- 5 2 1
Kitchen Tasks Tasks Tasks
…… …… ……
Real 10 2 8
Panda Tasks Tasks Tasks
Fig. 3: We perform experiments in three domains. Each domain encompasses three categories: tasks observed in the expert
dataset, new tasks utilizing reinforcement learning, and hold-out unseen tasks. The tasks vary by required skills, as well as
the shapes and appearances of objects. The initial positions of objects in each task are randomized in every episode.
Fig. 4: Reinforcement Learning process in new tasks. SFT policy can serve as a good starting point in new RL tasks
compared to the learn-from-scratch policy. We also observed that fully fine-tuning VLA models can lead to performance
degradation (orange lines) while freezing the VLM part can avoid collapses.
Original Button- Drawer- Door- Window- Window- Unseen
Metaworld
25 tasks Press-new Open-new Open-new Open-new Close-new 10 tasks
SFT Policy 0.83 0.56 0.48 0.40 0.32 0.28 0.51
PPO-Replay 0.69 0.80 0.24 0.32 0.04 0.36 0.39
iRe-VLA(Ours) 0.83 1.00 0.84 0.84 0.80 0.96 0.80
TABLE I: Success rates on Metaworld and Franka-kitchen benchmark with three categories of tasks (expert tasks in blue,
RL-trained tasks in green, and unseen tasks in red). Standard online RL algorithms result in performance even worse than
SFT policy, while iRe-VLA improves performance in three categories of tasks.
task and adopted the same expert data replay strategy after both seen and unseen tasks. The advantage of the iRe-VLA
each task, namely PPO-Replay. method can be reflected in three aspects:
Analysis. The results are presented in Table I. Standard (1) Improved Performance in Original Tasks. We
PPO algorithms often exhibit instability when introduced to can continue to improve performance in seen expert tasks
RL tasks, as depicted in Figure 4. This instability not only af- through online interaction. For instance, in the Franka-
fects performance in RL tasks but also degrades performance kitchen benchmark, the supervised VLA model achieved a
in previously learned tasks, even with experience replay. This modest success rate in the expert task left-door-open due to
decline is likely due to noisy RL gradients that adversely limited demonstrations. Our iRe-VLA method improves the
affect the pre-trained representations within the VLA model. success rate of this task from 0.43 to 0.83.
In contrast, our two-stage iRe-VLA method stabilizes the (2) Improved Performance in RL Tasks. It is crucial for
RL process and effectively enhances task performance across intelligent agents to adapt to tasks excluded in expert data
“Pick up the eggplant”
autonomously. We explored various RL tasks (as detailed in
the second column of Figure 3) and applied our iterative RL
algorithm to address these tasks. As indicated in Table 3,
Wristed
our iRe-VLA method successfully tackled new tasks in each Camera
domain without catastrophic forgetting [49].
“Grasp the orange Carrot”
(3) Improved Generalization in Unseen Tasks. In addi-
tion to the enhanced performance in RL-trained tasks through
online iterations, we also observed increased success rates in
unseen tasks, indicating better generalization ability. As the
agent tackles an increasing variety of tasks automatically,
its generalization ability correspondingly strengthens. For Expert tasks RL-Tasks Unseen Tasks
(a) Experiment Setups (b) Online RL Tasks
example, after mastering four types of window tasks in 1.0 SFT Policy Strandard RL iRe-VLA (ours)
Metaworld, the agent effectively generalized to windows of
0.8
unseen colors and shapes. 0.8 0.73 0.74
Ablation Study. In the iRe-VLA method, the whole VLM 0.61
Success Rate
is trainable in the second supervised learning stage. We 0.6
0.6 is processed by the VLM only once, and the resulting latent
output is stored in the buffer. Subsequently, we implement
0.4
the SACfD algorithm in this latent space.
0.2
Results. The expert pick demonstrations were limited to
blocks of four colors, and we extended the online RL to
0.0
Expert Tasks RL Tasks Unseen Tasks
objects with irregular shapes, such as eggplants and carrots.
Fig. 5: Ablations. Freezing VLM all the time leads to The real-world RL training process for each new task costs
performance drops. around one hour, similar to time costs in SERL [52]. The
success rates before and after RL process are shown in
D. Real-world Manipulation Experiments Figure 6, our iRe-VLA pipeline increased the success rate
Experiment Setups. Our real-world experiment follows for picking eggplants or carrots from 0.35 to 0.80. Moreover,
the set ups described in SERL [52], [53], a useful software the success rates for the original tasks remained stable, and
suite for the real-world RL. We first train a VLA model the picking success rate for unseen objects also improved
on 2,000 human-collected expert data across various task from 0.37 to 0.61.
categories, including pick (grasp), place, button-press, cable-
VI. C ONCLUSION AND L IMITATION
route, and drawer operations.
We notice that the learned VLA model shows a certainty In this paper, we explore ways to further enhance the
success rate on unseen objects thanks to the generalization VLA model through online reinforcement learning. Fine-
ability of the VLA model. Then we adopt online RL to tuning large VLA models presents several challenges, but our
further increase the success rate on unseen objects. We proposed iRe-VLA methods stabilize the training process and
implemented several key design choices to enhance sample significantly reduce computational demands. Experiments on
efficiency and ensure computational affordability within the both simulated and real-world manipulation tasks confirm
context of large Vision-Language-Action (VLA) models. To the effectiveness of iRe-VLA. A potential limitation is that
improve sample efficiency, we adopted the SACfD algorithm it can only improve skills within seen types and cannot learn
[54], [55]. Specifically, when introduced to a new task, we entirely new skills under sparse-reward online RL conditions.
R EFERENCES [21] T. Yu, D. Quillen, Z. He, R. Julian, K. Hausman, C. Finn, and
S. Levine, “Meta-world: A benchmark and evaluation for multi-task
[1] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, and meta reinforcement learning,” in Conference on robot learning.
C. Zhang, S. Agarwal, K. Slama, A. Ray, et al., “Training language PMLR, 2020, pp. 1094–1100.
models to follow instructions with human feedback,” Advances in [22] A. Gupta, V. Kumar, C. Lynch, S. Levine, and K. Hausman, “Relay
neural information processing systems, vol. 35, pp. 27 730–27 744, policy learning: Solving long-horizon tasks via imitation and reinforce-
2022. ment learning,” arXiv preprint arXiv:1910.11956, 2019.
[2] A. Glaese, N. McAleese, Maja, J. Aslanides, V. Firoiu, T. Ewalds, [23] Y. J. Ma, W. Liang, G. Wang, D.-A. Huang, O. Bastani, D. Ja-
M. Rauh, L. Weidinger, M. Chadwick, P. Thacker, et al., “Improving yaraman, Y. Zhu, L. Fan, and A. Anandkumar, “Eureka: Human-
alignment of dialogue agents via targeted human judgements,” arXiv level reward design via coding large language models,” arXiv preprint
preprint arXiv:2209.14375, 2022. arXiv:2310.12931, 2023.
[3] R. Thoppilan, D. De Freitas, J. Hall, N. Shazeer, A. Kulshreshtha, H.- [24] L. Fan, G. Wang, Y. Jiang, A. Mandlekar, Y. Yang, H. Zhu, A. Tang,
T. Cheng, A. Jin, T. Bos, L. Baker, Y. Du, et al., “Lamda: Language D.-A. Huang, Y. Zhu, and A. Anandkumar, “Minedojo: Building open-
models for dialog applications,” arXiv preprint arXiv:2201.08239, ended embodied agents with internet-scale knowledge,” Advances in
2022. Neural Information Processing Systems, vol. 35, pp. 18 343–18 362,
[4] M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. d. O. Pinto, J. Ka- 2022.
plan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, et al., [25] A. Adeniji, A. Xie, C. Sferrazza, Y. Seo, S. James, and P. Abbeel,
“Evaluating large language models trained on code,” arXiv preprint “Language reward modulation for pretraining reinforcement learning,”
arXiv:2107.03374, 2021. arXiv preprint arXiv:2308.12270, 2023.
[5] M. Ahn, A. Brohan, N. Brown, Y. Chebotar, O. Cortes, B. David, [26] J. Lin, Y. Du, O. Watkins, D. Hafner, P. Abbeel, D. Klein, and
C. Finn, K. Gopalakrishnan, K. Hausman, A. Herzog, et al., “Do as i A. Dragan, “Learning to model the world with language,” arXiv
can, not as i say: Grounding language in robotic affordances,” arXiv preprint arXiv:2308.01399, 2023.
preprint arXiv:2204.01691, 2022. [27] A. W. Hanjie, V. Y. Zhong, and K. Narasimhan, “Grounding language
[6] W. Huang, F. Xia, T. Xiao, H. Chan, J. Liang, P. Florence, A. Zeng, to entities and dynamics for generalization in reinforcement learning,”
J. Tompson, I. Mordatch, Y. Chebotar, et al., “Inner monologue: in International Conference on Machine Learning. PMLR, 2021, pp.
Embodied reasoning through planning with language models,” arXiv 4051–4062.
preprint arXiv:2207.05608, 2022. [28] A. Zeng, A. Wong, S. Welker, K. Choromanski, F. Tombari, A. Purohit,
[7] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, J. Dabis, C. Finn, M. Ryoo, V. Sindhwani, J. Lee, V. Vanhoucke, et al., “Socratic models:
K. Gopalakrishnan, K. Hausman, A. Herzog, J. Hsu, et al., “Rt-1: Composing zero-shot multimodal reasoning with language,” arXiv
Robotics transformer for real-world control at scale,” arXiv preprint preprint arXiv:2204.00598, 2022.
arXiv:2212.06817, 2022. [29] D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter,
[8] A. Brohan, N. Brown, J. Carbajal, Y. Chebotar, X. Chen, K. Choro- A. Wahid, J. Tompson, Q. Vuong, T. Yu, et al., “Palm-e: An embodied
manski, T. Ding, D. Driess, A. Dubey, C. Finn, et al., “Rt-2: Vision- multimodal language model,” arXiv preprint arXiv:2303.03378, 2023.
language-action models transfer web knowledge to robotic control,” [30] Y. Guo, Y.-J. Wang, L. Zha, Z. Jiang, and J. Chen, “Doremi: Grounding
arXiv preprint arXiv:2307.15818, 2023. language model by detecting and recovering from plan-execution
[9] Y. Ma, Z. Song, Y. Zhuang, J. Hao, and I. King, “A survey misalignment,” arXiv preprint arXiv:2307.00329, 2023.
on vision-language-action models for embodied ai,” arXiv preprint [31] I. Dasgupta, C. Kaeser-Chen, K. Marino, A. Ahuja, S. Babayan,
arXiv:2405.14093, 2024. F. Hill, and R. Fergus, “Collaborating with language models for
[10] J. Zhang, Y. Guo, X. Chen, Y.-J. Wang, Y. Hu, C. Shi, and J. Chen, embodied reasoning,” arXiv preprint arXiv:2302.00763, 2023.
“Hirt: Enhancing robotic control with hierarchical robot transformers,” [32] Y.-J. Wang, B. Zhang, J. Chen, and K. Sreenath, “Prompt a robot to
in 8th Annual Conference on Robot Learning. walk with large language models,” arXiv preprint arXiv:2309.09969,
[11] X. Li, M. Liu, H. Zhang, C. Yu, J. Xu, H. Wu, C. Cheang, Y. Jing, 2023.
W. Zhang, H. Liu, et al., “Vision-language foundation models as [33] J. Liang, W. Huang, F. Xia, P. Xu, K. Hausman, B. Ichter, P. Florence,
effective robot imitators,” arXiv preprint arXiv:2311.01378, 2023. and A. Zeng, “Code as policies: Language model programs for
[12] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. embodied control,” arXiv preprint arXiv:2209.07753, 2022.
Le, D. Zhou, et al., “Chain-of-thought prompting elicits reasoning in [34] W. Chen, O. Mees, A. Kumar, and S. Levine, “Vision-language models
large language models,” Advances in neural information processing provide promptable representations for reinforcement learning,” arXiv
systems, vol. 35, pp. 24 824–24 837, 2022. preprint arXiv:2402.02651, 2024.
[13] A. Padalkar, A. Pooley, A. Jain, A. Bewley, A. Herzog, A. Ir- [35] N. Stiennon, L. Ouyang, J. Wu, D. Ziegler, R. Lowe, C. Voss,
pan, A. Khazatsky, A. Rai, A. Singh, A. Brohan, et al., “Open A. Radford, D. Amodei, and P. F. Christiano, “Learning to summarize
x-embodiment: Robotic learning datasets and rt-x models,” arXiv with human feedback,” Advances in Neural Information Processing
preprint arXiv:2310.08864, 2023. Systems, vol. 33, pp. 3008–3021, 2020.
[14] S. Belkhale, Y. Cui, and D. Sadigh, “Data quality in imitation learn- [36] R. Ramamurthy, P. Ammanabrolu, K. Brantley, J. Hessel, R. Sifa,
ing,” Advances in Neural Information Processing Systems, vol. 36, C. Bauckhage, H. Hajishirzi, and Y. Choi, “Is reinforcement learn-
2024. ing (not) for natural language processing: Benchmarks, baselines,
[15] A. Kumar, A. Zhou, G. Tucker, and S. Levine, “Conservative q- and building blocks for natural language policy optimization,” arXiv
learning for offline reinforcement learning,” Advances in Neural In- preprint arXiv:2210.01241, 2022.
formation Processing Systems, vol. 33, pp. 1179–1191, 2020. [37] S. Levine, A. Kumar, G. Tucker, and J. Fu, “Offline reinforcement
[16] F. Liu et al., “Learning to summarize from human feedback,” in learning: Tutorial, review, and perspectives on open problems,” arXiv
Proceedings of the 58th Annual Meeting of the Association for preprint arXiv:2005.01643, 2020.
Computational Linguistics, 2020. [38] T. Carta, C. Romac, T. Wolf, S. Lamprier, O. Sigaud, and P.-Y.
[17] P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and Oudeyer, “Grounding large language models in interactive environ-
D. Amodei, “Deep reinforcement learning from human preferences,” ments with online reinforcement learning,” in International Conference
Advances in neural information processing systems, vol. 30, 2017. on Machine Learning. PMLR, 2023, pp. 3676–3713.
[18] E. Parisotto, F. Song, J. Rae, R. Pascanu, C. Gulcehre, S. Jayakumar, [39] A. Szot, M. Schwarzer, H. Agrawal, B. Mazoure, R. Metcalf, W. Tal-
M. Jaderberg, R. L. Kaufman, A. Clark, S. Noury, et al., “Stabilizing bott, N. Mackraz, R. D. Hjelm, and A. T. Toshev, “Large language
transformers for reinforcement learning,” in International conference models as generalizable policies for embodied tasks,” in The Twelfth
on machine learning. PMLR, 2020, pp. 7487–7498. International Conference on Learning Representations, 2023.
[19] M. Andrychowicz, A. Raichuk, P. Stańczyk, M. Orsini, S. Girgin, [40] Y. Zhai, H. Bai, Z. Lin, J. Pan, S. Tong, Y. Zhou, A. Suhr, S. Xie,
R. Marinier, L. Hussenot, M. Geist, O. Pietquin, M. Michalski, et al., Y. LeCun, Y. Ma, et al., “Fine-tuning large vision-language models
“What matters for on-policy deep actor-critic methods? a large-scale as decision-making agents via reinforcement learning,” arXiv preprint
study,” in International conference on learning representations, 2020. arXiv:2405.10292, 2024.
[20] K. Ota, D. K. Jha, and A. Kanezaki, “Training larger networks for [41] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal,
deep reinforcement learning,” arXiv preprint arXiv:2102.07920, 2021. G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable
visual models from natural language supervision,” in International
conference on machine learning. PMLR, 2021, pp. 8748–8763.
[42] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-
image pre-training with frozen image encoders and large language
models,” in International conference on machine learning. PMLR,
2023, pp. 19 730–19 742.
[43] W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li,
P. N. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-
language models with instruction tuning,” Advances in Neural Infor-
mation Processing Systems, vol. 36, 2024.
[44] J. Lee, Y. Lee, J. Kim, A. Kosiorek, S. Choi, and Y. W. Teh, “Set
transformer: A framework for attention-based permutation-invariant
neural networks,” in International conference on machine learning.
PMLR, 2019, pp. 3744–3753.
[45] M. Riedmiller and A. Lernen, “Multi layer perceptron,” Machine
Learning Lab Special Lecture, University of Freiburg, vol. 24, 2014.
[46] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang,
and W. Chen, “Lora: Low-rank adaptation of large language models,”
arXiv preprint arXiv:2106.09685, 2021.
[47] Z. Liu, J. Zhang, K. Asadi, Y. Liu, D. Zhao, S. Sabach, and R. Fakoor,
“Tail: Task-specific adapters for imitation learning with large pre-
trained models,” arXiv preprint arXiv:2310.05905, 2023.
[48] K. Bousmalis, G. Vezzani, D. Rao, C. Devin, A. X. Lee, M. Bauza,
T. Davchev, Y. Zhou, A. Gupta, A. Raju, et al., “Robocat: A self-
improving foundation agent for robotic manipulation,” arXiv preprint
arXiv:2306.11706, 2023.
[49] M. McCloskey and N. J. Cohen, “Catastrophic interference in connec-
tionist networks: The sequential learning problem,” in Psychology of
learning and motivation. Elsevier, 1989, vol. 24, pp. 109–165.
[50] O. M. Team, D. Ghosh, H. Walke, K. Pertsch, K. Black, O. Mees,
S. Dasari, J. Hejna, C. Xu, J. Luo, et al., “Octo: An open-source
generalist robot policy,” 2023.
[51] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov,
“Proximal policy optimization algorithms,” arXiv preprint
arXiv:1707.06347, 2017.
[52] J. Luo, Z. Hu, C. Xu, Y. L. Tan, J. Berg, A. Sharma, S. Schaal, C. Finn,
A. Gupta, and S. Levine, “Serl: A software suite for sample-efficient
robotic reinforcement learning,” arXiv preprint arXiv:2401.16013,
2024.
[53] J. Luo, C. Xu, F. Liu, L. Tan, Z. Lin, J. Wu, P. Abbeel, and S. Levine,
“Fmb: a functional manipulation benchmark for generalizable robotic
learning,” arXiv preprint arXiv:2401.08553, 2024.
[54] M. Vecerik, T. Hester, J. Scholz, F. Wang, O. Pietquin, B. Piot,
N. Heess, T. Rothörl, T. Lampe, and M. Riedmiller, “Leveraging
demonstrations for deep reinforcement learning on robotics problems
with sparse rewards,” arXiv preprint arXiv:1707.08817, 2017.
[55] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-
policy maximum entropy deep reinforcement learning with a stochastic
actor,” in International conference on machine learning. PMLR,
2018, pp. 1861–1870.